DECIMER 1.0: deep learning for chemical image recognition using transformers

Rajan, Kohulan; Zielesny, Achim; Steinbeck, Christoph

doi:10.1186/s13321-021-00538-8

Research article
Open access
Published: 17 August 2021

DECIMER 1.0: deep learning for chemical image recognition using transformers

Journal of Cheminformatics volume 13, Article number: 61 (2021) Cite this article

12k Accesses
35 Citations
38 Altmetric
Metrics details

This article has been updated

Abstract

The amount of data available on chemical structures and their properties has increased steadily over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, Optical Chemical Structure Recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50–100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.

Introduction

Scientists build on the results of their peers. Knowledge and data arising from previous research is shared through scientific publications and increasingly through the deposition of data in repositories. To enable progress in core areas of chemistry, the availability of open data has a beneficial impact [1]. Most of the chemical data is published in the form of text and images in scientific publications [2]. Retrieving and storing published information into open-access databases will facilitate the reuse as well as the development of new methods and products [3]. But most of the data published is non-machine readable and manual curation is still the standard. This manual work is tedious and error-prone [4]. The increase of publications with valuable chemical information [5] does encourage the development of tools for automated data retrieval. Information retrieval with corresponding database storage is an ongoing task and multiple projects are working towards this. The CHEMDNER [6] challenge is one good example of it.

There has been a significant amount of development in the field of chemical data mining [5] with a couple of open source solutions including ChemDataExtracter [4] and ChemSchematicResolver (CSR) [7], building upon each other. A scanned page of an article, however, cannot be handled by CSR, and not all publications can be processed by CSR. Although most publishers offer documents in markup format, many of the older publications are stored in scanned PDF files. For example, the Journal of Natural Products did publish scientific articles since 1978, one of their issues even dates back to 1949; however, these publications were not formatted in markup format. So retrieving this information is a difficult process.

Image mining methods for chemical structure depictions and their conversion into a machine-readable file format is a comparatively small research area [8]. The automatic recognition of chemical structure depictions and their conversion into machine-readable formats such as SMILES [9] or InChI [10], however, is an important task for creating corresponding databases. The publications include chemical structure depictions along with other information in textual format and contain some information presented as tables, graphs, spectra, etc.

Optical Chemical Structure Recognition (OCSR) software was built to parse chemical structure depictions. However, most of these softwares/tools are unable to handle whole page articles or scanned ones. In order to use these tools, it is necessary to segment the chemical structure depictions into separate images from printed literature and then use these segmented images as inputs. Also, the user should ensure that the image does not contain any other elements or artefacts other than a representation of a chemical structure in a segmented image. All of the available systems vary in their accuracy, OSRA [11] and MolVec [11, 12] can resolve a chemical structure with 80–90% accuracy [8].

With the advancements in computer vision, a few deep learning-based OCSR tools have been developed, e.g. by Staker et al. [13], the first machine learning-based system for segmentation of images and resolution into a computer-readable format. Another deep learning-based work is Chemgrapher [14], where multiple neural networks are combined for the recognition of molecules. Recently, there was a new publication called ChemPix [15], a deep learning-based method that was developed to recognize hand drawn hydrocarbon chemical structures. Another recent publication describes SMILES generation from images [16] where an encoder–decoder method with a pre-trained decoder is used from previous work [17]. These contributions demonstrate an increasing interest in this field of research. Even though they all claim to provide enhanced accuracy, none of them is accessible to the general public to date.

The DECIMER (Deep lEarning for Chemical IMagE Recognition) project [18] is an end-to-end open-source system that can perform chemical structure segmentation on scanned scientific literature and use the segmented structure depictions to convert them into a computer-readable molecular file format.

In our work on DECIMER-Segmentation [19], the segmentation workflow was specifically addressed. Here we now present a transformer-based algorithm that converts the bitmap of a chemical structure depiction into a computer-readable format. The system does not inherit any rules or make any assumptions, thus, it solely relies on the chemical structure depiction to perform its task.

The DECIMER algorithm was primarily inspired by the successful AlphaGo Zero algorithm [20] developed by Google’s DeepMind. The success of AlphaGo Zero allowed us to realize that very challenging problems could be adequately tackled by having a sufficient amount of data and using an adequate neural network architecture. With dozens of millions of molecules available in the databases like PubChem [21], Zinc20 [22], and GDB-17 [23], we have shown in our preliminary communication that our goal to have a system that can work with about 90% accuracy, could be achieved by training the network on a dataset of 50–100 million molecules.

Materials and methods

DECIMER is a completely data-driven solution to chemical image recognition. Recent impressive applications of deep learning, such as the AlphaGo Zero example, all relied on the availability of very large to unlimited amounts of training data. In our case, one of the largest chemical databases on the planet, PubChem [21], was used.

Data preparation

The latest version of Pubchem was downloaded from their FTP site. All explicit hydrogens were removed using the CDK [24] and isomeric SMILES [9] were generated, which inherit the canonicalisation and retain the stereochemistry information. After generating the SMILES, the following set of rules were used to filter the dataset for a balanced dataset. The molecules in both training and test set should,

have a molecular weight of fewer than 1500 Daltons,
not possess counter ions,
only contain the elements C, H, O, N, P, S, F, Cl, Br, I, Se and B,
not contain isotopes of Hydrogens (D, T),
have 3–40 bonds,
not contain any charged groups including zwitterionic forms,
only contain implicit hydrogens, except in functional groups,
have less than 40 SMILES characters,
no stereochemistry is allowed.

The resulting main dataset contains 39 million molecules. The same rule set was used to generate a second dataset, but the molecules with charged groups including zwitterionic forms and stereochemistry were retained. Furthermore, the molecules containing tokens that were rare in the dataset were removed (see “Tokenization” section), resulting in a dataset that contains approximately 37 million molecules. Adding extra information caused the SMILES character length to get longer. Later, when the rule that SMILES length should not exceed 40 characters was applied, more molecules were removed. In the end, this resulted in dataset 2 being smaller in size than dataset 1.

Molecular bitmap images were generated using the CDK Structure Diagram Generator (SDG). The CDK depiction generator enables the generation of production-quality 2D images. In this work, every molecule was randomly rotated and depicted as 8 Bit PNG images with a 299 × 299 resolution. It was made sure that each image contains only one structure.

Using the set of images from the second dataset and introducing image augmentations the third dataset was generated. The image augmentations were applied using the imgaug [25] python package. One of the following augmentations was randomly applied to the images.

Gaussian Blur
Average Blur
Additive Gaussian Noise
Salt and Pepper
Salt
Pepper
Coarse Dropout
Gamma Contrast
Sharpen
Enhance Brightness

Often, deep learning in chemistry is using SMILES as a textual representation of structures. Training Neural Networks (NNs) directly with SMILES, however, has pitfalls: In order to generate tokens, a set of rules has to be set up on how and where to split long strings of SMILES into smaller words. After training, invalid SMILES are often encountered in the predictions, which results in overall significantly reduced accuracy. To tackle this problem there are two new text representations named DeepSMILES [26] and SELFIES [27]. DeepSMILES exhibited better results in comparison to standard SMILES, but again invalid DeepSMILES caused similar problems. In the end, SELFIES were used, since they can be split easily into tokens by splitting the SELFIE at close (“]”) and open brackets (“[”). No further rules had to be applied to split them into a working token set (Fig. 1). Also, they translate back into a SMILES string without any errors. All SMILES strings in our 3 datasets were converted into SELFIES using Python.

To summarize, the datasets used in this work are:

1.
Dataset 1: PNG images of chemical structure depictions plus corresponding canonical SMILES converted into SELFIES, without stereochemical information and charged groups.
2.
Dataset 2: PNG images of chemical structure depictions plus corresponding canonical SMILES converted into SELFIES, with stereochemical information and charged groups.
3.
Dataset 3: Augmented PNG images of chemical structure depictions plus corresponding canonical SMILES converted into SELFIES, with stereochemical information and charged groups.

Test datasets were selected from 10% of each dataset. To ensure that the chemical diversity of test and training data was similar, 10% of SMILES were selected as Test dataset using the RDKIT MaxMin algorithm. An overview of all the train and test datasets and the naming of subsets can be found in Table 1.

Table 1 Overview of the datasets

DECIMER 1.0: deep learning for chemical image recognition using transformers

Abstract

Introduction

Materials and methods

Data preparation

Image feature extraction

Tokenization

Generating TFRecords

Networks

Encoder–decoder network

Transformer network

Training the models

Testing the models

Results and discussion

Computational considerations

Image feature extraction test

Encoder–decoder model vs. transformer model

Image feature extraction comparison using EfficientNet-B3 and B7

The performance measure with increasing dataset size

Analysis of the predictions with low Tanimoto similarity indices

Performance of the network with training data using stereochemistry information—Dataset 2

Performance of the network with training data using stereo-chemistry and image augmentation—Dataset 3

Conclusion and future work

Availability of data and materials

Change history

24 November 2021

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us