Improving the performance of models for one-step retrosynthesis through re-ranking

Abstract Retrosynthesis is at the core of organic chemistry. Recently, the rapid growth of artificial intelligence (AI) has spurred a variety of novel machine learning approaches for data-driven synthesis planning. These methods learn complex patterns from reaction databases in order to predict, for a given product, sets of reactants that can be used to synthesise that product. However, their performance as measured by the top-N accuracy in matching published reaction precedents still leaves room for improvement. This work aims to enhance these models by learning to re-rank their reactant predictions. Specifically, we design and train an energy-based model to re-rank, for each product, the published reaction as the top suggestion and the remaining reactant predictions as lower-ranked. We show that re-ranking can improve one-step models significantly using the standard USPTO-50k benchmark dataset, such as RetroSim, a similarity-based method, from 35.7 to 51.8% top-1 accuracy and NeuralSym, a deep learning method, from 45.7 to 51.3%, and also that re-ranking the union of two models’ suggestions can lead to better performance than either alone. However, the state-of-the-art top-1 accuracy is not improved by this method. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-022-00594-8.

Functional group interconversion (FGI) 3.7 10 Functional group addition (FGA) 0.5 S-2 S3 Extra steps to clean the USPTO-50K dataset 1. Remove reaction SMILES strings with products that are shorter than 3 alphabet characters and clearly incorrect: for example, organic reactants giving just 1 equivalent of water as the product. There were 4 such reactions in the training dataset.
2. We canonicalized all reactions using RDKit. Afterwards, duplicated reactions were removed such that only one unique copy of a reaction is present.
3. We were surprised to see that some reactions in the validation and test data that also appeared in the training data. This would exaggerate models' generalization performance. In total, 50 reactions in the test set and 44 in the validation set appeared in the training data, and were removed from the test and validation data. 6 test reactions that also appear in the validation data were also removed from the test set, so that the validation and test sets do not overlap.

S4 Details of model and training hyperparameters
The Adam optimizer S3 with default hyperparameters was used to optimize the parameters of all EBM models. We experimented with multiple learning rate schedules such as the cyclic learning rate which has been previously shown S4 to greatly speed up convergence.
Ultimately however, we found that the most reliable and hyperparameter-insensitive choice is the PyTorch built-in ReduceLROnPlateau, which reduces the learning rate by a userdefined ratio whenever the top-1 accuracy on the validation data stops improving for a specified number of training epochs. To avoid overfitting, we used early stopping, which similarly monitors the validation top-1 accuracy and stops the training once it does not improve after a defined number of epochs. Specific hyperparameters are as follows: S-3

S4.1 Feedforward-EBM
For the Feedforward-EBM, we use hidden sizes of {1024, 128} for each of the 3 input networks, with one dropout layer (coefficient 0.2) in between. The final output layer has a dimension of 128 and is preceded by a dropout layer (coefficient 0.2). The PReLU activation is used throughout as well. The starting learning rate is set to 0.001, with a reduction factor of 0.4 and patience of 1 for ReduceLROnPlateau. During training, a batch size of 16 reactions is used while during evaluation a batch size of 32 is used. A patience of 3 is used for early stopping. The training takes about 2 to 3 hours on 1 RTX2080Ti GPU.

S4.2 Graph-EBM
For the Graph-EBM, we observed that slightly different hyperparameters are needed to achieve optimal performance to re-rank different one-step models. The differences lie mainly in whether a single shared MPNN encoder or two separate MPNN encoders are used to embed reactants and products, the pooling method to obtain the energy output given embeddings of reactants and products, the learning rate reduction factor for ReduceLROnPlateau, and early stopping patience.
First we describe the hyperparameters for re-ranking GLN, and then we specify the differences from this set for each of the other one-step models. We use the Gated Recurrent Unit (GRU) for the message passing operations. Following GraphRetro S1 we use a message passing depth of 10 with dimension 300 for the GRU (ReLU activation). We additionally modify the GRU to have a dropout layer (coefficient 0.08) after each message passing iteration. Different from GraphRetro, S1 we use two hidden layers instead of one for the final node update module W o , with dimensions {320, 300} and a dropout layer (coefficient 0.08) in between. Also note that we use two MPNN encoder networks with identical architectures but separate parameters to encode product molecules separately from reactant molecules, which allows each MPNN to specialise to reactants or products, and resulted in a significant S-4 performance boost. The pooled output embedding (300-dim) from each MPNN encoder is subjected to a small projection network (PReLU activation) of sizes {256, 200} (dropout coefficient 0.12) and again we use a separate projection network for each MPNN encoder, to obtain the reactants embedding R G and product embedding P G , which are concatenated with their difference P G −R G and element-wise product P G * R G . Finally, the output network (PReLU activation) has dimensions {600, 300, 1} with a dropout layer (coefficient 0.15) in between. The initial learning rate is set to 1e-4, with a learning rate reduction factor of 0.3 and patience of 1 for ReduceLROnPlateau. A patience of 3 is used for early stopping.
A batch size of 2 reactions is used during training and 4 during evaluation, because these are the maximum we could fit on RTX2080Ti's. A GPU with even larger memory would allow larger batch sizes, which should speed up training. All gradients are clipped to 20 to stabilize training. Also note that we set a condition for training to stop if the learning rate falls below 1e-7, though in practice this was not triggered for re-ranking GLN.
For RetroSim, we observed that using just 1 shared MPNN encoder (depth = 10, dimension = 300, ReLU activation) for both reactants and products is sufficient, and using 2 MPNN encoders offers no conclusive improvement. We still use two separate projection networks, each of size {256, 200}, for reactants and products respectively after the shared MPNN encoder. However, unlike for GLN, we do not concatenate the reactants embedding R G and product embedding P G . Instead, the energy is obtained directly through a dot product R G · P G ; thus, there is no output network ( Fig S1). With significantly fewer parameters than using 2 separate MPNN encoders, we can afford a larger batch size of 4 during training and 8 during testing and accordingly, the initial learning rate is doubled to 2e-4, and we stop the training if the learning rate falls below 1e-6, which was triggered occasionally. Otherwise, we use the same hyperparameters as the 2 MPNN Graph-EBM used with GLN.
For NeuralSym, we also use just 1 shared MPNN encoder (depth = 10, dimension = 300, S-5 Figure S1: Schematics of Graph-EBM with a single shared MPNN encoder and dot product pooling ReLU activation), with separate projection networks (ReLU activation) of size {256, 200}, followed by dot product R G · P G to get the energy. We found it helpful to use a more conservative learning rate reduction factor of 0.8, with the same patience of 1 for ReduceL-ROnPlateau, same early stopping patience of 3, and same learning rate floor of 1e-6.
With RetroXpert, due to the particularly poor re-ranking performance of the Graph-EBM, we explored more architectural and hyperparameter variations. While using just 1 shared MPNN encoder (depth = 10, dimension = 300, PReLU activation) proved optimal, we found it helpful to include a small pre-embedding layer W emb of size {90} in the MPNN encoder. Specifically, W emb is applied on the initial message feature vectors before any message passing is done, and these transformed feature vectors are then fed into the GRU cell for the message passing iterations. Otherwise, we continue to employ two separate projection networks (ReLU activation) of size {256, 200}, followed by dot product R G · P G to get the S-6 energy. The initial learning rate is set higher at 3e-4 with a learning rate reduction factor of 0.6, but otherwise the same patience of 1 for ReduceLROnPlateau and early stopping patience of 3.
Finally, for the union of GLN and RetroSim, we again use 1 shared MPNN encoder (depth = 10, dimension = 300, ReLU activation), as using two MPNN encoders would make training too slow, given that we have 100 training proposals instead of 50 (which roughly doubles training time). This is followed by separate projection networks (ReLU activation) each of size {256, 200}, and dot product pooling R G · P G to get the energy. A batch size of 2 reactions is used during training and 4 during evaluation, and accordingly, the initial learning rate is set to 0.0001 with a learning rate floor of 1e-7 (both same as for re-ranking GLN). Just as with RetroXpert, we found it beneficial to use a more conservative learning rate reduction factor of 0.8, with the same patience of 1 for ReduceLROnPlateau but a longer early stopping patience of 4.

Architecture
Other than the FF-EBM and Graph-EBM, we can also represent each reactant and product using SMILES, and encode each reaction with the Transformer S5 (Fig S2). At the core of the   We observed that the TF-EBM does poorly, failing to even recover RetroSim's original top-N accuracies. When we attempted to train the Transformer-EBM for an extended duration, the training top-1 accuracy consistently plateaued at about 33%. Due to its poor performance, we do not report the TF-EBM's results for the remaining one-step models. One hypothesis is that the TF-EBM struggles to learn, from reaction SMILES, the fine-grained chemical rules necessary to accurately distinguish chemically similar reactant-sets. Two chemically S-9 similar reactant-sets can have very different SMILES due to RDKit canonicalization, which may exacerbate the learning difficulty in this re-ranking setting, which is different in nature from the original task of reactant-set generation.

Results
We note that the lackluster performance of the TF-EBM is not in contradiction with the success of the Transformer architecture for sequence-to-sequence tasks in chemistry, including the task of one-step retrosynthesis itself. Sequence-to-sequence tasks typically use an autoregressive loss and model the task as conditional SMILES generation. This EBM formulation, on the other hand, is more similar to the Bidirectional Model used in S9, which also exhibits low performance, despite making use of bidirectional context similar to BERT.

S5 Additional metric to evaluate the EBMs' re-ranking performance
The primary objective of our work is to demonstrate that re-ranking with trained EBMs can improve the performance of existing one-step retrosynthesis models. Therefore, we believe that our reported metrics of top-N accuracy and Mean Reciprocal Rank (MRR) before reranking and that after re-ranking suitably highlights the benefits of our approach. Still, to substantiate our case further, we also report the area under the top-N curve as an additional metric.

S5.1 Area under the top-N curve (AUC-N)
The AUC-N depends on the choice of N , and we report the values for N ∈ {3, 5, 10, 20, 50}.
The area was calculated by the sum of rectangles with one rectangle for each top-N accuracy, which is finally normalised by dividing by N , the total number of rectangles. Thus, if N = 1, the area simply equals the top-1 accuracy. The formula is summarised as follows: where a i is the top-i accuracy, and N is the maximum top-N accuracy that we consider.

S-10
For the AUC-N, the trends parallel exactly that of top-N accuracy, as expected. As shown in table S4, re-ranking with Graph-EBM achieves the best metric for RetroSim, NeuralSym and GLN, while for RetroXpert, FF-EBM is superior to Graph-EBM, but both are worse than raw RetroXpert when N is 3. As for re-ranking the union of RetroSim and GLN with Graph-EBM, we also see that it achieves better AUC-N than re-ranking RetroSim alone or GLN alone. Table S4: Results of re-ranking the four one-step models, as well as the union of RetroSim and GLN, on the USPTO-50K test dataset. Bolded values refer to the best area under the top-N curve for that one-step model. We report the average of 3 experiments where both the proposer and re-ranker are initialized with a different random seed, with the standard deviation in parentheses.

S-11
S6 Examples of egregious re-ranking While we have mainly highlighted the strengths of our EBM re-ranking approach, we also wish to highlight a few "failure" cases, where the EBMs have worsened the rank of the published reaction relative to the original proposer. However, as we will see, this is not necessarily "bad", as the EBM's top-ranked suggestions can still be useful.
In the first example (Fig S3), the published reaction is the formation of an ether linkage between a phenol and a primary alcohol. While RetroSim ranked this 4th, the Graph-EBM has worsened its rank to 7th. The Graph-EBM's top-1 proposal (ranked 11th by RetroSim) is a Heck coupling between a phenyl bromide and a terminal alkene. While there is no guarantee this reaction would work, it is not an absurd suggestion. Figure S3: Graph-EBM prefers Heck coupling between a phenyl bromide and a terminal alkene, over etherification Secondly, the published reaction as shown in Fig S4 is the oxidation of a secondary alcohol to a ketone. Such oxidation reactions seem rather straightforward, and was ranked 2nd by RetroSim, but pushed to 9th by the Graph-EBM. The top re-ranked suggestion (ranked 6th by RetroSim) is the Friedel-Crafts acylation of the benzene ring at the position para to the methoxy group. While this is certainly a valid chemical transformation, it may suffer from reduced selectivity due to the ortho-acylated side-product. Although the oxidation may be more selective, the EBM's top-1 proposal may still work in practice and may simply reflect S-12 an overall bias in the dataset toward bimolecular reactions rather than unimolecular redox manipulations. Figure S4: Graph-EBM favors electrophilic substitution over oxidation Next in Fig S5, the published reaction involves a Grignard coupling between phenyl magnesium and a Weinreb amide; in fact, this is the second step of the named Weinreb ketone synthesis, which is a reliable method of selectively synthesising ketones without the alcohol side-product from over-addition of the organometallic reagent to the carbonyl center. The Weinreb amide is usually synthesised from the reaction of N,O-dimethylhydroxylamine with an acyl chloride precursor. While RetroSim ranked this 4th, the Graph-EBM has worsened its rank to 8. Instead, the Graph-EBM's rank-1 proposal is the coupling between a phenyl bromide and an acyl chloride to form the ketone. Presumably, the phenyl amine group acts as an electron-donating group to activate the ortho-bromo position towards electrophilic attack by the acyl chloride. Actually, this reaction seems practical too, and could yield the desired ketone. Again, this example highlights the subjectivity of evaluating reaction proposals, especially without considering other factors like reaction conditions.
The last example (Fig S6) involves the coupling between a nitrogen atom on a heterocycle (1h-Pyrazolo[3,4-d]pyrimidine) and iodobenzene to install the benzene ring onto the nitrogen.
This published reaction was ranked 1 by RetroSim, but only 12th by the Graph-EBM. The S-13 Figure S5: Graph-EBM favors the coupling of acyl chloride with bromobenzene over the Grignard coupling with a Weinreb amide top proposal by the Graph-EBM is the substitution of a chlorine side group on the Nheterocycle with an amino group using ammonia. It is difficult to judge which reaction is more favorable in practice, which may boil down to the multi-step strategy being employed rather than the single-step chemical feasibility. Figure S6: Graph-EBM favors the nucleophilic substitution of a heteroaryl chloride with ammonia over the coupling of a ring nitrogen and iodobenzene S-14