Volume 2, Issue 1 e202100069
Full Paper
Open Access

Image2SMILES: Transformer-Based Molecular Optical Recognition Engine**

Ivan Khokhlov

Ivan Khokhlov

Syntelly LLC, Bolshoy Boulevard 30, bld. 1, 121205 Moscow, Russian Federation

Search for more papers by this author
Lev Krasnov

Lev Krasnov

Syntelly LLC, Bolshoy Boulevard 30, bld. 1, 121205 Moscow, Russian Federation

Department of Chemistry, Lomonosov Moscow State University, 119991 Moscow, 1 Leninskiye Gory, Russia

Search for more papers by this author
Prof. Maxim V. Fedorov

Prof. Maxim V. Fedorov

Syntelly LLC, Bolshoy Boulevard 30, bld. 1, 121205 Moscow, Russian Federation

Sirius University of Science and Technology, Olimpiysky ave. b.1, Sochi, 354000 Russia

Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld. 1, 121205 Moscow, Russian Federation

Search for more papers by this author
Dr. Sergey Sosnin

Corresponding Author

Dr. Sergey Sosnin

Syntelly LLC, Bolshoy Boulevard 30, bld. 1, 121205 Moscow, Russian Federation

Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld. 1, 121205 Moscow, Russian Federation

Search for more papers by this author
First published: 11 January 2022
Citations: 8
**

A previous version of this manuscript has been deposited on a preprint server (https://doi.org/10.26434/chemrxiv.14602716.v1)“

Graphical Abstract

A Transformer-based artificial neural network that can convert images of organic structures to molecular templates is presented. To train this network, a comprehensive data generator that stochastically simulates various drawing styles, functional groups, functional group placeholders (R-groups), and visual contamination was developed.

Abstract

The rise of deep learning in various scientific and technology areas promotes the development of AI-based tools for information retrieval. Optical recognition of organic structures is a key part of the automated extraction of chemical information. However, this is a challenging task because there is a large variety of representation styles. In this research, we present a Transformer-based artificial neural network to convert images of organic structures to molecular structures. To train the model, we created a comprehensive data generator that stochastically simulates various drawing styles, functional groups, functional group placeholders (R-groups), and visual contamination. We demonstrate that the Transformer-based architecture can gather chemical insights from our generator with almost absolute confidence. That means that, with Transformer, one can fully concentrate on data simulation to build a good recognition model. A web demo of our optical recognition engine is available online at Syntelly platform, and the code for dataset generation is available on GitHub.

Introduction

There is a large collection of chemical data been published in the literature over the years.1 Unfortunately, before the computer era, these valuable data were represented at the paper sources only. The current challenge is to extract and mine this data from these sources. Extensive development of deep neural networks significantly improved the performance of optical recognition tasks. However, graphical or weakly structured information recognition has been a challenging problem yet. A common example is the recognition of chemical structures. First, the drawing style of chemical compounds (atom label fonts, bond depiction style, etc.) is not fully standardized among publishers. Second, the chemical compounds are commonly drawn as Markush structures: a scaffold that can describe many compounds. There are no common guidelines for Markush structures, which leads to a large variety of Markush representations. Sometimes, even experienced chemists can face troubles struggling with some extreme cases. Moreover, there are cases when authors of chemical papers use an artistic style representing chemical structures. Some examples of artistically styled schemes are given in (Figure 1). To sum up, recognizing chemical structures and molecular templates is a challenging problem, and we believe that one can solve it with AI-based tools only.

Details are in the caption following the image

Examples of molecules depicted in artistic styles.

There are two principal approaches to construct an optical recognition system: i) a rules-based approach where one uses optical primitives detectors (character, lines, edges detectors) and a set of pre-defined rules, that describes common journal drawing styles. ii) a fully data-driven approach that is based on deep neural networks. There are several examples of rules-based software: Optical Structure Recognition Software To Recover Chemical Information (OSRA),2 ChemReader,3 Kekule,4 CLiDE Pro.5 However, there is an explosive growth of ANN-based approaches nowadays. A common deep-learning approach is a hybrid neural network with convolutional layers for image processing and recurrent layers as decoders that produces a structural representation of an organic molecule in a linear molecular notation (SMILES, DeepSMILES6 or SELFIES7). The approaches proposed by Rajan et al.,8 Staker et al.,9 and Clevert et al.10 generally follows the CNN+RNN pipeline for the direct conversion from bitmap images to structural representations. These publications used a stack of convolutional layers as an encoder for images and an RNN-based decoder that generates structures. These methods showed promising results and proved the feasibility of data-driven approaches. Weir et al.11 extended the CNN+RNN approach for recognition of hand-drawn hydrocarbon chemical structures. Sundaramoorthy et al.,12 and Rajan et al.13 proposed Vision Transformer and CNN+Transformer approaches as an alternative to the CNN+RNN approach. In contrast to end-to-end approaches, Oldenhof et al.14 proposed a hybrid approach where chemical primitives are located and recognized by several deep-learning models combining together into a chemical graph by a defined algorithm. A detailed review of the existing approaches is given in the survey by Rajan et al.15

A common application of chemical recognition system is an automatic filling of chemical databases.16 However, there is an unpleasant feature that even a single mistake in one token breaks down the whole structure. That is not the case, for example, for neural machine translation, where the replacement of a word by its synonym is acceptable. So the accuracy of the recognition engine should be as high as possible, and there should be a way to conclude while the prediction is correct or not.

Transformer is an architecture proposed by the Google team for neural translation initially.17 However, the architecture and its modifications demonstrated outstanding performance in many other tasks, i. e., text generation18 and computer vision.19 In chemistry Transformer was applied to the predictions of the outcomes of organic reactions,20, 21 QSAR modelling,22 the conversion between SMILES and IUPAC names.23 One can see that the performance of Transformer-based architectures is generally higher than for RNN based approaches. This observation motivated us to implement a Transformer-based engine for the optical recognition of chemical structures.

Data is the key in machine learning. However, there are no open-access datasets with annotated objects on chemical articles as far as we know. The only way to obtain a large dataset is to build a data-generative model. This model should simulate a large variety of real paper drawings. We believe that without a proper data augmentation scheme, there is no way to train Transformer well.

The novelty of our approach is both a strong focus on data generation schemes and the possibility to process not only organic structures but molecular templates as well, so that makes the approach ready to use on real data.

In this work, we demonstrate an advanced training data generator that simulates a lot of real molecule rendering cases. We train a Transformer-based model on the simulated data. Surprisingly, this model demonstrated very high performance on the external validation set of simulated data. Moreover, the model shows good performance on real data. We show that there is a strong correlation between the confidence score of our model and the validity of recognition. Threshold tuning can provide high confidence for extracted structures. Our model will be a significant part of the chemical documents extraction and analysis pipeline from scans that we are working on.

Materials and Methods

As we mentioned above, the key part of our approach is the data generator. In this section, we describe a dataset of molecular structures, our extension for the Markush structures processing, data augmentation scheme, and the network architecture.

Dataset

PubChem database contains about 100 M molecules. They are not all required, we estimate that we need about 10 M structures, according to the limits of our training environment. So it was necessary to select a 10 % subset, biased towards more complicated structures. We implemented a probability-based approach to select rare or complicated molecules with higher probabilities. We propose two empirical coefficients: Basic coefficient (BC) and Complexity Gain (CG). BC depends on the molecular size (number of non-hydrogen atoms). It is calculated following the equation:
B C = 0 . 1 + 1 . 2 n m a x - n n m a x 3 (1)

where n is the molecule size, n m a x = 60 is the upper limit. This is an empirical formula and it gives a greater probability for small molecules, a compensation of a much larger presence of big molecules over small ones. We can modify Basic coefficient by multiplying it by Complexity Gain (CG), which encourages more complicated molecules. In this work we use only one value for Complexity Gain=1.5 and three conditions Ncond if:

  • a molecule has rare rings (size 3, 4, 7+)

  • a molecule is bridged

  • a molecule has atoms that are other than C, N, S, O, P, and halogens

The Full coefficient (FC) is:
F C = i = 1 N c o n d C G * B C (2)
FC is related to the probability of choosing a molecule p(M) by the following equation:
p ( M ) = F C i f FC < 1 1 i f FC 1 (3)

To prepare a training set of the molecular structures we take the first million PubChem structures (1-1000000 PubChem indices), because they are quite small and, we suppose, the majority of these compounds are common scaffolds. At the next stage, we iterate over the rest of PubChem and calculated FC for each molecule. Then we sample 10 M molecules randomly, following their probabilities p(M).

Rendering augmentations

Rendering augmentations are the variations in style and geometry of the rendered molecules. It provides coverage for different drawing styles, improving the robustness of the final model.

We chose RDKit24 as an auto-drawing tool. The following drawing options were used:

  • a rotation to a random angle

  • a various font size

  • a various line thickness

  • a various distance between lines in double and triple bonds

  • a various whitespace size surrounding atoms

  • an optional CoordGen coordinate generator from RDKit was used in 20 % cases

The example in Figure 2 shows two different augmented images generated for one structure.

Details are in the caption following the image

An example of two images for a molecule generated by rendering augmentations engine.

Functional and R-groups

In most chemical documents authors draw molecules with functional groups and R-group substituents (Markush structures). In order to generate molecules with such substituents, we created a list of more than 100 common functional groups. We described each group as a SMARTS template. Our augmentation algorithm replaces functional groups in molecules stochastically to generate an augmented dataset.

One should note that some functional groups are nested. Typical examples are: methyl (-Me) and methoxy (-OMe) groups (Figure 3). We designed a resolving method described below to prevent nested groups overlapping.

Details are in the caption following the image

Nested functional groups: methyl (-Me) and methoxy (-OMe).

At the first stage, our algorithm searches all occurrences for all SMARTS templates in a molecule. Then only one random group is left for each overlapping case. A random half of the selected groups is substituted in the following way: the whole functional group branch is removed, and short name of the group is replaced with a pseudo-atom. The label of the pseudo-atom is the name of the functional group. To provide the variety of R-groups the algorithm substitutes R-groups instead of the half of found -Me groups. We choose methyl groups for the replacing because there are the most frequent substitutors in the chemical structures. The types and corresponding probabilities of R-groups substitution are given in Table 1.

Table 1. R-groups and their generation probabilities.

Label

R

R1

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R’

R“

Prob

0.2

0.15

0.05

0.1

0.1

0.05

0.05

0.05

0.025

0.025

0.025

0.025

0.1

0.05

We render molecules as vector images (SVG format), with some postprocessing to fix the depiction of labels with subscript and superscript indices, dashes, and multiple uppercase chars. Additional augmentation is the variation of size and scales in symbols. An example with a source molecule and several random modifications is given in (Figure 4).

Details are in the caption following the image

Examples of generated molecules with functional and R-groups.

We implemented a method to generate images with R-groups in variable positions in a ring because it is a common case in real chemical documents (Figure 5).

Details are in the caption following the image

Example of R-groups in variable positions, from [25].

If the ring has no more than two substituents, R-groups (R, R1, R2, R’, R’’) are drawn in variable positions with 20 % probability, but the algorithm does one replace for a ring, and two replaces for a molecule at maximum. A virtual bond is added, to make RDKit place the group in front of the ring bond, then SVG postprocessing is used to replace two bonds with a single line. A visual explanation is given in the following example (Figure 6).

Details are in the caption following the image

Rendering an R-group in variable positions.

Functional Groups SMILES

SMILES notation represents molecules, while Markush structures are molecular templates. There is no way of representing molecular templates in standard SMILES, so we designed a modified syntax. We named it FG-SMILES (functional groups smiles). This is an extension of standard SMILES, where a substituent or R-group can be written as a single pseudo-atom. If a substituent is a functional group, FG-SMILES can be translated to SMILES directly by replacing corresponding pseudo-atoms. An example:

SMILES: Cc1cc(C)c(-c2ccccc2)c(-c2ccc([N+](=O)[O-])cc2)c1

FG-SMILES: [Me]c1cc([Me])c(-[Ph])c(-c2ccc([NO2])cc2)c1

FG-SMILES notation allows describing variable R-group position. We add the v symbol to denote the variable R-group inside an aromatic system. For example, the template c1[vR’]cccc([R2])c1 represents the template in (Figure 6). Formally, this notation breaks SMILES grammar because the branching atom is inside the ring, but it represents the case when R-group is attached not to a specific place in the ring but to the ring itself.

The generator source code link is available in Supporting information. The code is published under MIT licence.

Image augmentations

When our model operates in the real environment, it crops a region from an optical scan as an input. However, commonly a molecular image is contaminated by other details like parts of other molecules, text, arrows, lines, and other elements. One has to make the model robust to the contamination. For example, sometimes these objects are too close or even touch the main molecule, and small labels can be inside the molecule contour. In that case, classic computer vision approaches fail. A contour detection algorithm is not able to separate the main molecule and contamination robustly. Even worse, one can not just ignore some disconnected elements in the image because they can be in their proper place. Our preliminary experiments showed that the presence of even small contamination at an image spoils the prediction. To combat this problem, we proposed a contamination augmentation algorithm that simulates typical contamination. Our algorithm stochastically renders common contamination types: parts of other structures, labels, and arrows taken from a set of chemical documents randomly. We noted that our models trained with contamination augmentations are much more robust. Some examples of outcomes of our contamination augmentation algorithm are given in Figure 7.

Details are in the caption following the image

Examples of molecules generated by the contamination augmentation algorithm.

Also, we used standard computer vision augmentations implemented in “albumentations” library.26 Three types of augmentations were used: geometry distortions (non-linear and piecewise linear), pixel-wise operations (different noises, smoothings, and sharpening), and random cuts (paddings, shifts, and linear stretching). After the augmentation images were resized to 384x384 and fed to the neural network. Some examples of the augmented images are available in Supporting Information.

Model architecture

Transformer was initially designed for the conversion between linear representations. So, we had to modify Transformer to work with images - 2D sequences. Initially, we followed a standard way and implemented a ResNet-5027 block as an embedding layer before Transformer encoder. However, this idea was not sufficiently effective, as it turned out. Surprisingly the performance of the model improved when we removed the encoder completely. The architecture of our model is given in Figure 8. It is worth mentioning that the visual transformer28 has not been published at that moment, and that was the reason why we use this way. However, our findings that CNN block without any attention gave the best result is interesting itself and requires special study if this no-input-attention approach can benefit in general image captioning tasks.

Details are in the caption following the image

A scheme of Img2SMILES model. Adopted from [17].

The input for the model has a shape 384x384. We use ResNet-50 as a CNN block. The ResNet module's output has a shape of 2048x12x12. This is not quite good if we try to substitute encoder attention with CNN, because the decoder gets it as the sequence of 12 tokens. Also, we considered 2048 is a large depth for the transformer, which gives unnecessarily long training and application time. That's why the we removed last two residual blocks in ResNet-50, with the resulting shape 512x48x48, which is very similar to the classical transformer dimension. Other parameters of the transformer decoder were taken from classical architecture.17

Model training

The model was trained for 2 weeks at a node on the Zhores HPC cluster.29 This node was equipped with 4 Nvidia V100 GPU and 36 CPU cores. The generation of 10 M training dataset took 3 additional days at 80 CPUs. We randomly reserved 5 % of the dataset for the internal validation. It looks like a small amount, but in absolute numbers it is 500k samples, which is quite good. There are observations that Transformer-based models have a small tendency to overfit,23, 30 and some augmentations were applied on-fly at each epoch. We used RAdam optimizer31 with learning rate 3e–4. We trained our model for five epochs, it took 63 hours per epoch in average. To ensure that our training and test sets have similar molecular distributions we projected these molecules to the 2D plane using a parametric t-SNE approach32 and inspected them visually. The projections are given in Supporting information for this article. The performance of models on training and internal validation sets is given in Table 2.

Table 2. The statistics on training and internal validation sets.

Epoch

Train

Int. validation

loss

acc. per token

acc. per sequence

loss

acc. per token

acc. per sequence

1

0.0766

0.9716

0.6696

0.0195

0.9933

0.8614

2

0.0176

0.9939

0.8753

0.0144

0.9951

0.8958

3

0.0137

0.9953

0.8753

0.0119

0.9960

0.8958

4

0.0119

0.9959

0.9112

0.0104

0.9960

0.9223

5

0.0108

0.9963

0.9178

0.0095

0.9968

0.9280

Our model did not achieve saturation at the final epoch, so we suppose that longer training will result in better performance.

Results and discussion

Our model demonstrates surprisingly good performance on the validation set. However, we want to stress that this is an optimistic estimation of quality because our validation set does not reflect the real data. It shows how well the model learned the data generation principle indeed. We tried to simulate a lot of different variants of molecule rendering, but our generator is still limited and does not cover all possible functional groups, drawing styles, and contamination types. The proper validation should ground on manually curated data from real chemical documents.

Performance on the generated validation set

We validated our model on the external validation set. It consists of 1 M generated images. The validation set is a PubChem subset, chosen by the same approach, which was used for choosing training set, and not intersecting with training set. On this set, our model demonstrated 90.7 % accuracy. We want to highlight that the match criterion was high, and we consider only fully correct recognition, including stereo and indices in R-groups. For example, if R’ was recognized as R1, we regard the whole recognition as wrong. The common mistakes were: incorrect stereo groups and wrong indices in R-groups, which are obviously the most challenging tasks.

The correlation between the confidence score and the recognition accuracy is given in Table 3. We noticed a strong correlation between prediction correctness and the network prediction score. If we ignore samples with low prediction scores, we can obtain more precise recognition on the rest of the samples. This correlation opens the doors to a fully automatic chemical data extraction pipeline.

Table 3. The correlation between the confidence score and the accuracy on the structures that are above the confidence score threshold.

Threshold

Ignored part, %

Rest part accuracy, %

0

90.7

0.98

10.3

97.0

0.99

15.1

98.6

0.995

22.5

99.85

One can see that with a 0.995 threshold 22 % of data is ignored but the accuracy on the rest is almost absolute – 99.85 %.

Validation on real data

Massive automated testing on real data is challenging at this moment, because of the lack of data. However, we tried to estimate the performance of our model to the real data. The approach we used was to recognize the image, cropped manually from a page, then convert back the molecular structure into an image and compare the original image with rendered structure manually. A demonstration of this approach is given in Figure 9. At the right-hand side, an original image is represented, at the left-hand side is the recognition result, redrawn by RDKit. The red string above represents the FG-SMILES outcome with the corresponding score (Figure 9).

Details are in the caption following the image

Demonstration of our manual test tool. At the right-hand side, an original image is represented, at the left-hand side is the recognition result. One should match these structures to estimate the validity. This example was taken from [33].

Our practice revealed that an experienced person can compare structures visually in a dozen of seconds, even for large molecules. So manual test is feasible, but for a limited number of samples. We took ten articles34-43 to test our method on the real data. Structures that represent reaction mechanisms were excluded, so we regarded only molecules and Markush templates.

The possibility of recognition relies on the presence of structural patterns in our data generator. That means that there are inherent limitations in our model. However, to estimate the true performance, we leave these structures for the test. We defined it as dataset A. We tested 332 structures; 263 were recognized correctly and 69 were incorrect. A closer look at the incorrectly recognized structures revealed that our model can not process 56 structures from those 69 due to the absence of functional groups in the model dictionary. These structures have some groups that are not included in our engine, like Ms, NMe2, and even # (Figure 14). Another case is a varying chain length “()n” pattern in Markush structures, which is not supported by our generator. Among the structures that our model is able to recognize, the most frequent failure was the confusion of R-indices (for example, the model is confused between R, R’ and R1 ). Another common mistake is wrong stereo conformations. Also, a few structural errors happen; in that case, a fragment of a molecule was missing, but the generated result was still chemically correct.

It is worth stressing that we do not search bounding boxes automatically in this validation scheme. We crop the bounding boxes manually. Therefore, the performance in a fully automatic extraction pipeline can vary depending on the performance of the bounding boxes detector used. Some examples of incorrect recognition are given in Figures 10, 11, 12, 13, 14.

Details are in the caption following the image

An example of the model's failure on a large (>40 atoms) molecule (from [36]).

Details are in the caption following the image

An example of incorrect recognition occurs because the functional group NMe2 is not presented in the template dictionary (from [34]).

Details are in the caption following the image

An example of incorrect recognition occurs because “()n” syntax is not supported yet (from [37]).

Details are in the caption following the image

An example of incorrect recognition occurs due to that fact that we do not train the model to recognize disconnected structures i. e. ionic bonds (from [35]).

Details are in the caption following the image

An example of incorrect recognition occurs because “#” syntax is not supported yet (from [35]).

To provide a comparative study, we analyzed the same set of articles by OSRA engine44 (latest version 2.11). OSRA automatically extracted 562 structures from the papers. We manually inspected these structures and excluded ones that represent reaction schemes and incorrectly cropped regions (partially cropped molecules, absent molecules, etc.) After manual processing, we kept 465 valid structures. We validated OSRA predictions in the same way as for Image2SMILES, comparing visual images with the reference structures. OSRA predicted correctly 289 structures, resulting in total accuracy of 62.5 %.

Systematic validation

One can see that in the previous test OSRA extracted 465 valid structures, while we tested our system at 332 structures. The reason is we did not select repeating or similar molecules in the test. Inside one paper similar or repeating molecules occur often. From our observations, simple structures have even more repeats, than complicated ones. Moreover, our manual selection could be biased too. That was the reason why we decided to prepare another manual validation test. We aimed to avoid the weak points of the previous test.1

Preparing the paper set A we took papers randomly. However, one should collect a dataset following a systematic approach to prevent humans’ bias. To solve this issue we used one paper from each issue of Journal of Organic Chemistry for the 2020 year. The full list of papers is given in Supp. Mat. for this article. We extracted only the structures nearest to page corners, resulting in no more than four images from a page. We use this approach to equalize pages, containing different numbers of molecules. Invalid objects were ignored, i. e. parts of reaction mechanisms and artistic drawings. If the closest structure was not valid, or there were no structures at all in the corresponding quarter of a page, we ignore the quarter to prevent selection bias. We process no more than 10 pages from the beginning of a paper, reducing bias towards large articles.

This procedure resulted in 296 extracted structures. We manually created a corresponding FG-SMILES for each structure. We define it as dataset B. This dataset is available in Supporting information. We performed a manual validation on this dataset, Image2SMILES model gave 185 correct answers, while OSRA gave 71. The comparison between OSRA and Image2SMILES on datasets A and B is given in Table 4.

Table 4. The accuracy of recognition for OSRA and Image2SMILES engines on datasets A and B.

Engine

Accuracy, %

Image2SMILES

79.2

Dataset A

OSRA

62.1

Image2SMILES

62.5

Dataset B

OSRA

24.0

We investigated the reasons that led our model to fail in each case. Our observations revealed that there are several types of mistakes. These types are summarized in Table 5.

Table 5. A summary of common mistakes on dataset B.

Failure type

1-case

multiple cases

Unknown group

18

8

Confusion between R’ and R1

13

4

Incorrect group recognition

10

13

Incorrect stereo recognition

8

7

Bold-highlighted bonds

8

4

Incorrect R-group recognition

7

2

Ultracondensed rings

6

0

Incorrect structure recognition

5

14

Ultra small molecules

5

0

Hydrogen with an explicit bond

4

5

Incorrect carbon chain length

3

2

The first column contains the number of structures, which was incorrect because of a single reason. The second column contains the number of structures, where there were more than one failure. All records, counted in the first column, are mutually exclusive, for example, if the structure was counted in Confusion between R’ and R1, it is not counted in “Incorrect R-group” row.

One can see that the most common mistake is an unknown group that is not presented in our generator dictionary. It can be a completely new one (like Ad, NC) or a combination of some known groups (for example, OR2R3, PCy2, or OtBu). Also, it can be an alternative writing of an existing group (like Trt instead of Tr).2 Whatever the reason, it leads to incorrect recognition. The number of groups, used by chemists, is large, but observable, so improving the dictionary is a reasonable way to enhance the performance of the model.

A quite frustrating issue is “Confusion between R’ and R1”. According to our observation, in recent articles authors tend to draw R-group indices in superscript, not in subscript, while our generator was tuned to deliver images with subscripts: R1 subscript indices mostly (Figure 15).

Details are in the caption following the image

Example of molecule with R1 group (from [45]).

Our observation revealed that there are no papers where the authors mix R’/R“ with numerical indexing in the same molecule. So, one can fix a possible misrecognition in ”Confusion between R’ and R1“ using this observation. Following this idea, we can treat 13 cases, where R’ was the only mistake, as correct answers, enhancing the total accuracy from 62.5 % to 67.0 %.

Another issue3 is caused by bold-highlighted bonds (Figure 16). When a bond in a molecule is bold, our system recognizes it as a double bond, or a wedged stereo bond. Adding this drawing style to the generator would have given additional 8 more correct answers.

Details are in the caption following the image

Example of molecule with bold-highlighted bonds (from [46]).

We did not train our model to deal with explicit hydrogens as a separate group. By this reason it tends to put I (Iodine) instead of H (Hydrogen) (Figure 17).

Details are in the caption following the image

Example of molecule with explicit hydrogens rendered as a separated group (from [46]).

Other common types of failures are: incorrect recognition of functional groups (when one known group is placed instead of another known group), incorrect R-groups (for example, R2 instead of R3 or R instead of vR, etc.), incorrect stereo configurations (while the structure and groups are recognized correctly).

In a number of cases, the molecular structure itself is broken. For example, a part of a molecule is deleted, or a branch of the structure is duplicated somewhere else in the molecule. We believe that these mistakes could be triggered by different reasons, like non-typical depiction style, a small distance from a neighbor molecule, or a molecule can be complicated for the model.

It is worth stressing that, even in case of heavy structural errors our model generates syntactically valid (but not correct) FG-SMILES in most cases. (Figure 18).

Details are in the caption following the image

A complicated molecule (from [47]).

A particular issue is the processing of ultracondensed rings (see Figure 19). It indicates that such molecules were underrepresented in the training set.

Details are in the caption following the image

Molecule with ultracondensed rings (from [48]).

Other noticeable mistakes are: incorrect carbon chains length and issues with ultra-small molecules. A possible explanation of the first issue is that our model does not process a structure atom-by-atom (how, for example, OSRA does), but takes a shot of the entire molecule, so it seems challenging for the model to calculate the exact number of carbons in long chains (Figure 20).

Details are in the caption following the image

Molecule with a long carbon chain (from [49]).

It is already been noticed, that extra-short sequences (like 1–5 atoms, see Figure 21) are poorly-predicted by Transformer-based architectures.23 A possible explanation is that Transformer analyzes correlations inside input sequences by self-attention mechanism, and when the sequences are extra-short, there is a lack of information for the network. However, the chemical space of extra-small molecules is limited, and such molecules can be recognized by another, more simple model.

Details are in the caption following the image

Example of extra-small molecule with 4 atoms (from [50]).

Analyzing the validation results, we conclude that the majority of failures are related to the generator because it does not cover some cases. The analysis of the Transformer's performance on an automatically generated validation set demonstrates that the architecture itself is able to learn the majority of structures that are chemically valid.

Metrics

These observations push the idea, that defining a metric to estimate model performance is not an obvious task. Of course, each mistake is crucial for the task of automated data extraction, so the binary metric suits this task. Still, the types of mistakes, described in previous chapter, are unequal and have different costs. So if one would like to define a metric to compare performance of predictive models, a continuous metric would be useful. Unfortunately, at this moment we do not have clear ideas on competitive metrics, and leave the problem open. Tanimoto or graph edit distance metrics are usually used to estimate difference between molecules, but they do not make big sense for molecular templates, from our point of view.

Estimation of the performance of a chemical OCR model is a challenging task. On the one hand, there is a problem of the absence of standardized datasets.51 On the other hand, defining a proper metric for a chemical OCR requires standardization too. The most straight-up way is a binary metric that indicates whether a molecule is correctly recognized. This approach has a practical sense because a real-world problem (for example, filling up chemical databases) requires fully correct recognition, and there is no option for an “almost right” chemical structure. However, the binary metric may have low discrimination ability treating the comparison of chemical OCR engines performance. For that challenge a continuous metric is desirable. However, to our knowledge, there is no continuous metric that measures distances between molecular templates, while for molecules one can use Tanimoto index or graph edit distance.

In this work, we estimated the performance of our model by a binary metric (valid or invalid). This approach has a practical sense for the automated data extraction task because there is no option for an “almost right” chemical structure. But the types of mistakes, described in the previous chapter, are unequal and have different costs. That means we need a continuous metric to make a proper comparison of different chemical OCR engines. Tanimoto index or graph edit distance are usually used to estimate the difference between molecules, but they should be modified for molecular templates. However, at present, we have no clear solution in sight for a continuous metric that measures distances between molecular templates.

But there are molecules in dataset B for which we are able to calculate distances. We extracted 53 structures that RDkit successfully recognized as molecules and calculated graph edit distance (GED) between the predicted molecules and the corresponding targets. Graph edit distance indicates the number of graph operations (insertions, deletions, or replacements) required to convert one molecule to another. One can consider it as a robust molecular similarity metric. For 47 molecules, the GED was zero, indicating that the predictions were precisely the same. For the rest five molecules, the GED was relatively small (2 or 3), and only for one molecule, it was 12.

Generate and Train!

Today neural networks have become more and more powerful. Transformer-based models are the leaders in many practical challenges. At the same time, these models can extract objective laws directly from the data with outstanding efficiency.20, 23 One can speculate that given a large amount of high-quality data, Transformer-based models can learn just anything (with a reasonable limitation). Under this assumption, the most important goal in solving a challenge is to develop a data generator (augmenter) that will cover all the variety of real data. However, the data augmentation schemes for chemical OCR did not draw as much attention. Describing DECIMER system Rajan et al.8 focus their attention on the influence of the linear molecular representations to the performance, but pay small attention to the data augmentation. The Schrödinger team9 concentrates on network architecture. In contrast in our research the most important part is data generation. Regarding fast-growing Transformer's applications, we believe that in the future, the focus of researchers’ attention will switch from the models themselves to proper training data representations. Clearly it is not a paradigm shift because data augmentation is the key technique in deep learning; mostly, it is a tendency that we expect to see. We suppose that the idea of focusing on data generation that we called “Generate and Train!” will be helpful for other challenges.

Conclusions

In this work, we demonstrated that Transformer-based architecture could provide a notable advantage in recognizing optical images of chemical structures. We built a comprehensive training data generator that can emulate many drawing styles, molecular templates, and contamination. Our research shifts the attention from building a data recognition engine to the construction of proper data augmentation. We demonstrated that the performance of a Transformer-based network on our emulated data is very high. Also, we tested our engine on the real clipping from chemical articles. We noted that the correlation of the confidence score and the performance of recognition proves the feasibility of using the engine for automatic chemical data extraction. This work also demonstrates, that complicated graphical schemes could be translated to strict formulas. Our modified Transformer without encoder at the input performed surprisingly well, and we encourage researchers to validate this idea for other optical recognition challenges.

Data and Software Availability

The key part of our approach – the data generator is freely available on GitHub (https://github.com/syntelly/img2smiles generator). A dataset of 1 M randomly generated samples is located on Zenodo (https://doi.org/10.5281/zenodo.5069806). We demonstrate our Img2SMILES model as a part of a broader pipeline that includes the PDF preprocessor and the bounding-box detector of structures. The demo is available on Syntelly platform: https://app.syntelly.com/pdf2smiles. Our demo allows users to upload a PDF file for the demonstration and receive the recognized structures and how they were cropped. We want to stress that our current bounding-box detector is not perfect, that can lead to poor structure crop and aberrant predictions in some cases.

The dataset of images, cropped from real papers, with corresponding target strings, is available at Zenodo: https://doi.org/10.5281/zenodo.5356500.

The supporting information is located on Zenodo (https://doi.org/10.5281/zenodo.4746136). It contains examples of generated training images and a list of SMARTS which were used for replacement functional groups.

The Authors acknowledge the use of computational resources of the Skoltech CDISE supercomputer Zhores for obtaining the results presented in this paper.29 The authors are thankful to Arkadiy Buianov for the help in creation of the web-demo.

Conflict of interest

The authors declare the following competing interests: Maxim Fedorov and Sergey Sosnin are co-founders of Syntelly LLC. Lev Krasnov and Ivan Khokhlov are employees of Syntelly LLC. The authors are going to integrate the functionality described in this paper to Syntelly online platform.

  • n1 We decided to keep the report on the previous validation because such manual tests require considerable work.
  • n2 We included some of them in the latest version of the generator.
  • n3 One can see that we met several types of provoking issues, when the model correctly parsed a complicated task and stumbled at a small ill-posed detail. Still, an experiment is an experiment, and we have to discard incorrect answers regardless of the reasons.