In this paper, we present a data-driven method for the uncertainty-aware prediction of chemical reaction yields. The reactants and products in a chemical reaction are represented as a set of molecular graphs. The predictive distribution of the yield is modeled as a graph neural network that directly processes a set of graphs with permutation invariance. Uncertainty-aware learning and inference are applied to the model to make accurate predictions and to evaluate their uncertainty. We demonstrate the effectiveness of the proposed method on benchmark datasets with various settings. Compared to the existing methods, the proposed method improves the prediction and uncertainty quantification performance in most settings.

Introduction

In organic chemistry, the prediction of chemical reaction yields is an important research topic in chemical synthesis planning [1, 2]. This enables the estimation of the overall yield of a complex synthetic pathway and the detection of low-yield reactions that negatively affect the overall yield. It also provides clues for designing new reactions that provide higher yields to save on the time and cost required for experimental syntheses.

Machine learning has achieved remarkable success in the data-driven prediction of chemical reaction yields [1, 3,4,5,6,7]. The main concept is to construct a prediction model that predicts the yield of a chemical reaction by learning from previously accumulated data comprising a number of chemical reactions annotated with their experimentally measured yields. The successful application of a prediction model enables fast and efficient estimation of chemical reaction yields without performing experimental syntheses, which are costly and time-consuming.

Early studies represented each chemical reaction as a fixed-size vector of handcrafted features, such as molecular fingerprints and chemical property descriptors, and constructed an off-the-shelf prediction model on top of the vector representation [3,4,5, 8]. The limitation of this approach is that the choice of adequate features relies on chemical knowledge and intuition, and some inherent information to the original reaction may be lost in the representation. With advances in deep learning [9], recent studies have applied deep neural networks constructed on a more informative representation of a chemical reaction. Schwaller et al. [6, 10] used simplified molecular-input line-entry system (SMILES) to represent a chemical reaction. To predict the reaction yield, they fine-tuned a bidirectional encoder representations from transformers (BERT) model pre-trained using a reaction SMILES database [11] to predict the yield. Saebi et al. [7] represented a chemical reaction as a set of graphs, on which a graph neural network was constructed to predict the yield.

In this paper, we present an alternative method for predicting chemical reaction yields. As a prediction model, we adapt a graph neural network that directly operates on the graph representation of a chemical reaction in a permutation-invariant fashion. We use uncertainty-aware learning and inference in the model to make accurate predictions of yields and determine the confidence of predictions.

Methods

Data representation

We suppose that a chemical reaction consists of a number of reactants and a single product. This chemical reaction is labeled with its reaction yield. Each instance is represented as \(( \mathcal {R},\mathcal {P}, y )\), where \(\mathcal {R} = \{\mathcal {G}^{R,1},\ldots ,\mathcal {G}^{R,m}\}\) and \(\mathcal {P} = \{\mathcal {G}^{P}\}\) are the set of m reactants and the resulting product in the reaction, respectively, and y is the reaction yield. The number of reactants m can be different for each reaction.

Each molecule in \(\mathcal {R}\) and \(\mathcal {P}\) is defined as an undirected graph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\), where \(\mathcal {V}\) and \(\mathcal {E}\) represent the set of nodes and the set of edges, respectively. The node feature vectors \(\mathbf {v}^j \in \mathcal {V}\) and edge feature vectors \(\mathbf {e}^{j,k} \in \mathcal {E}\) are associated with heavy atoms (e.g., C, N, O, and F) and their bonds (e.g., single, double, triple, and aromatic), respectively. Hydrogen atoms are treated implicitly. The number of heavy atoms and bonds in each molecule is the same as the number of node feature vectors and edge feature vectors in the corresponding graph representation, respectively. Figure 1 illustrates an example of the graph representation of a molecule.

For the j-th atom, \(\mathbf {v}^j=(v^{j,1},\ldots ,v^{j,p})\) is a vector indicating the atom type, formal charge, degree, hybridization, number of hydrogens, valence, chirality, whether it accepts or donates electrons, whether it is aromatic, whether it is in a ring, and associated ring sizes. For the bond between the j-th and k-th atoms, \(\mathbf {e}^{j,k}=(e^{j,k,1},\ldots ,e^{j,k,q})\) is a vector indicating the bond type, stereochemistry, whether it is in a ring, and whether it is conjugated.

Prediction model

To predict the reaction yield y, we introduce a predictive distribution for y conditioned on the set of reactants \(\mathcal {R}\) and product \(\mathcal {P}\), denoted by \(p_\theta (y|\mathcal {R},\mathcal {P})\), which is modeled as a normal distribution as follows:

where \(\mu\) and \(\sigma ^2\) are the mean and variance of the distribution, respectively. We parameterize the predictive distribution \(p_\theta\) using a neural network f that produces \(\mu\) and \(\sigma ^2\) as a function of \(\mathcal {R}\) and \(\mathcal {P}\) with a set of parameters \(\theta\):

To construct the neural network f, we adapt the architecture presented by Saebi et al. [7] to process two sets of molecular graphs with advanced neural network modules. Figure 2 illustrates the architecture used in this study. The architectural details of each component are presented next.

A message passing neural network (MPNN) [12] is used as the GNN component of f to process each molecular graph \(\mathcal {G}\) in \(\mathcal {R}\) and \(\mathcal {P}\). The GNN is designed to take \(\mathcal {G}\) as the input and return the graph representation vector \(\mathbf {r}\) as the output:

In the GNN, we apply multiple message passing steps using an edge network as a message function and a gated recurrent unit (GRU) network as an update function to generate node representation vectors. We then apply a set2set model [13] as a readout function for global pooling over the node representation vectors to obtain a graph-level embedding that is invariant to the order of the nodes. The embedding is sparsified by a fully-connected layer to obtain the graph representation vector \(\mathbf {r}\). The use of the GNN renders the representation invariant to graph isomorphism.

We summate the graph representation vectors for \(\mathcal {R} = \{\mathcal {G}^{R,1},\ldots ,\mathcal {G}^{R,m}\}\). This makes the representation invariant with respect to the order of the reactants. The summated vector is concatenated with the graph representation vector \(\mathcal {P} = \{\mathcal {G}^{P}\}\) to generate the reaction representation vector \(\mathbf {h}\):

The reaction representation vector \(\mathbf {h}\) is further processed by a feed-forward neural network (FNN) with two output units. The first unit returns the predictive mean \(\mu\). The second unit returns the log predictive variance \(\log \sigma ^2\).

The main advantages of the prediction model f presented in this study can be summarized as follows. First, the input for the model is the graph representation of a chemical reaction, which can directly encompass various atom and bond features regarding their chemical properties that make the representation more informative. Second, the model can handle chemical reactions of varying sizes with different numbers of reactants as the input. Third, the output of the model is invariant to permutations of reactants in the input reaction and is also invariant to permutations of atoms in each of the reactants/products. Fourth, the output of the model specifies the corresponding predictive distribution, which allows for uncertainty-aware learning and inference.

Uncertainty-aware learning

The learning procedure aims to train the prediction model f such that it can estimate the predictive mean \(\mu\) and variance \(\sigma ^2\) of the unknown yield y for a chemical reaction \((\mathcal {R}, \mathcal {P})\). For the model f to learn from data, we construct a training dataset of N chemical reactions and their yields, denoted by \(\mathcal {D}=\{(\mathcal {R}_i, \mathcal {P}_i, y_i)\}_{i=1}^N\).

We train the model f based on the maximum likelihood estimation. Based on the normality assumption for the predictive distribution \(p_\theta\), the log-likelihood is given by:

which involves two learning objectives with the hyperparameter \(\lambda\) that controls the relative strength of each objective. The first term is to minimize the conventional mean squared error over the training dataset \(\mathcal {D}\), which corresponds to the maximization of the log-likelihood over \(\mathcal {D}\) under the homoscedasticity assumption. The second term is to maximize the log-likelihood over \(\mathcal {D}\) under the heteroscedasticity assumption. The first term contributes to stabilizing the training with respect to the predictive mean \(\mu\). The second term enables the predictive variance \(\sigma ^2\) to quantify the aleatoric uncertainty caused by the inherent noise in \(\mathcal {D}\).

Uncertainty-aware inference

Once trained, the prediction model f is used to predict the yields of new chemical reactions. We employ the Monte-Carlo (MC) dropout [14] for the Bayesian approximation of the model f. Following the Bayesian approach, the approximate predictive distribution q is given by

Given a query reaction \((\mathcal {R}_*,\mathcal {P}_*)\), we wish to predict the unknown yield \({y}_*\) of the reaction as well as to quantify the uncertainty of the prediction. We empirically derive the MC estimates by sampling T predictions \(\{(\hat{\mu }_*^{(t)}, \hat{\sigma }_*^{2(t)})\}_{t=1}^T\) based on stochastic forward passes through the model f with dropout applied. Because some hidden units are randomly dropped out at each forward pass, the T predictions vary for the same reaction. The variability in the predictions is primarily caused by the epistemic uncertainty of the model f owing to the insufficiency of the training dataset \(\mathcal {D}\).

For prediction, the predictive mean can be estimated by averaging over \(\{\hat{\mu }_*^{(t)}\}_{t=1}^T\):

where \(\bar{\mu }_* = \frac{1}{T} \sum _{t=1}^T \hat{\mu }_*^{(t)}\). This is used as the uncertainty score for the prediction. The predictive variance can be decomposed into two types of uncertainty [15]. The first term corresponds to the aleatoric uncertainty, which accounts for the statistical uncertainty caused by inherent noise in the dataset \(\mathcal {D}\). The second term corresponds to the epistemic uncertainty, which accounts for the systemic uncertainty in the model f caused by the insufficiency of \(\mathcal {D}\).

The prediction of chemical reaction yields supports the identification of high-yield reactions from a pool of possible candidates in an efficient manner. The prerequisite is that the prediction model must be as accurate as possible. In practice, the prediction model may be imperfect and result in inaccurate predictions. To overcome this issue, we can selectively use the model based on uncertainty quantification. Because a high prediction uncertainty tends to cause erroneous predictions, the rejection of uncertain predictions would be beneficial for the actual use of the prediction model. If the prediction uncertainty is sufficiently low, we can use the model with confidence to identify whether a reaction has a high yield. Otherwise, the model abstains from predicting. Rejected cases can be carefully investigated by chemists in terms of their yields.

Experimental investigation

Datasets

We investigate the effectiveness of the proposed method using the following two benchmark datasets: Buchwald-Hartwig [3] and Suzuki-Miyaura [16]. In these datasets, each reaction was annotated with a measured yield ranging from 0% to 100%. The summary statistics of the datasets are presented in Table 1.

The Buchwald-Hartwig dataset was released by Ahneman et al. [3]. They conducted high-throughput experiments on the class of Pd-catalyzed Buchwald-Hartwig C-N cross-coupling reactions. They experimented on combinations of 15 aryl halides, 4 ligands, 3 bases, and 23 additives. A total of 3955 reactions were reported with their measured yields. The studies [3,4,5,6] evaluated the performance of the chemical reaction yield prediction on this dataset.

The Suzuki-Miyaura dataset was released by Perera et al. [16]. They conducted high-throughput experiments on the class of Suzuki-Miyaura cross-coupling reactions. 15 couplings of electrophiles and nucleophiles across combinations of 12 ligands, 8 bases, and 4 solvents were considered, resulting in measured yields for a total of 5760 reactions. The studies [6, 16, 17] have investigated this dataset.

For experimental investigations, we use 10 random shuffles for each benchmark dataset and 4 out-of-sample splits of the Buchwald-Hartwig dataset [3, 6].

Implementation

In the experimental investigation, we use the following configurations for the proposed method. For the GNN component of the model, the node representation vectors and graph representation vectors have dimensions of 64 and 1024, respectively. The graph representation vectors were set to have higher dimensionality because they are summated over multiple reactants to obtain the reaction representation vector. The number of message passing steps and set2set processing steps are both set to 3. Increasing the size of the GNN component may provide better performance, but it also incurs higher computational costs and memory usage. Thus, we set it to moderately large so that it can be trained in a reasonable time. The FNN component of the model has two fully-connected layers with 512 dimensions, followed by an output layer. During training, we standardize the yield y to have a mean of 0 and a variance of 1 over the training dataset \(\mathcal {D}\). A dropout rate of 0.1 is applied to the fully-connected layers in the FNN component. The hyperparameter \(\lambda\) in the objective function \(\mathcal {J}\) is set to 0.1. L2 regularization with a factor of \(10^{-5}\) is applied to the parameters \(\theta\). To train the model f, we update the parameters \(\theta\) for 500 epochs using the Adam optimizer with a batch size of 128. The learning rate is set to \(10^{-3}\) for the initial epochs and decayed to \(10^{-4}\) and \(10^{-5}\) over the last 100 epochs. We did not consider hyperparameter optimization through holdout validation, because it is unsuitable when the training dataset is very small. At inference, we set the number of forward passes T to 30 for MC dropout. We use Equation 8 and Equation 9 for the prediction and uncertainty score, respectively.

The proposed method is implemented using PyTorch in Python. The source code used in this study is available online at http://github.com/seokhokang/reaction_yield_nn/. The results of the experimental investigations are reported and discussed in the following section.

Results and discussion

Prediction and uncertainty quantification

We investigated the effectiveness of the proposed method for predicting the chemical reaction yields on the Buchwald-Hartwig and Suzuki-Miyaura datasets. For the proposed method, we derived two ablations by adjusting the hyperparameter \(\lambda\) in the objective function \(\mathcal {J}\). For the first ablation, the model was trained using only homoscedastic loss by setting \(\lambda =0\), which is equivalent to fixing the predictive variance \(\sigma\) to 1. For the second ablation, the model was trained using only heteroscedastic loss by setting \(\lambda =1\). For baselines, we considered YieldBERT [6] and YieldBERT-DA [10], which demonstrated superior performance compared to the other methods presented in the literature [3,4,5]. YieldBERT adapted a pre-trained BERT encoder [11] to predict the chemical reaction yield as a function of the reaction SMILES. YieldBERT-DA is an extension of YieldBERT based on data augmentation, which increases the quantity of the training dataset using SMILES randomization. For YieldBERT-DA, the prediction uncertainty score was computed using the prediction variance obtained from the test-time augmentation, as implemented in [10]. The source codes for YieldBERT and YieldBERT-DA are available online at https://github.com/rxn4chemistry/rxn_yields/, which we used to reproduce the experimental results. Consequently, a total of five methods were compared: YieldBERT, YieldBERT-DA, and the proposed method with \(\lambda =0\), 1, and 0.1.

For performance evaluation, we split each dataset into training and test sets. We then trained the prediction model using the training set and evaluated its performance on the test set. To examine the effects of training set size on performance, the training/test splits were varied as 70/30, 50/50, 30/70, 20/80, 10/90, 5/95, and 2.5/97.5. Regarding prediction performance, we used the following three measures calculated on the test set: mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R\(^2\)). Uncertainty quantification performance was evaluated in terms of the Spearman rank correlation coefficient \(\rho\) between the absolute prediction error and uncertainty score on the test set [10, 18].

Table 2 reports the average and standard deviation of the results over the 10 repetitions. In terms of prediction performance, the proposed method outperformed all the baseline methods. Although YieldBERT-DA was the best baseline method, the MAE and RMSE values of the proposed method reduced by around 5\(\sim\)10% compared to those of YieldBERT-DA on both benchmark datasets. The higher prediction performance indicates that the proposed method can provide more accurate predictions of yields for new reactions. Regarding uncertainty quantification performance, the proposed method yielded a Spearman \(\rho\) comparable to that of YieldBERT-DA.

For the proposed method, the prediction performance with \(\lambda = 1\) was slightly better than that with \(\lambda = 0\). The uncertainty quantification performance with \(\lambda = 1\) was far better than that with \(\lambda = 0\), which implies that capturing the aleatoric uncertainty is beneficial. Compared to the ablations, setting \(\lambda = 0.1\) yielded a better trade-off between prediction performance and uncertainty quantification performance. The results demonstrated that the use of both homoscedastic and heteroscedastic losses helped to improve performance.

Out-of-sample prediction

We also evaluated the performance of the proposed method for out-of-sample prediction. As in [6, 10], we used four out-of-sample training/test splits of the Buchwald-Hartwig dataset, which we denote by Test 1, Test 2, Test 3, and Test 4. In each split, certain additives are absent from the training set but only appear in the test set. The proposed method was compared with YieldBERT and YieldBERT-DA. The training configurations and evaluation scheme were the same as before. The experiments were repeated five times independently using different random seeds.

Table 3 summarizes the results averaged over the five repetitions. Overall, the proposed method was comparable to the best of the baseline methods for out-of-sample prediction. In terms of prediction performance, the proposed method performed best on Test 2 and Test 4, while was comparable or inferior to the best baseline on Test 1 and Test 3. Among the baselines, YieldBERT-DA yielded a lower performance than YieldBERT on average. For uncertainty quantification performance, the proposed method yielded the highest Spearman \(\rho\) for Test 1, Test 3, and Test 4.

Selective prediction with rejection

We investigated the effectiveness of the proposed method for selective prediction using 70/30 splits of benchmark datasets. For the proposed method, prediction uncertainty was quantified using the total predictive variance in Eq. 9. Because it can be decomposed into aleatoric and epistemic uncertainties, we conducted an ablation study to examine the effects of each component. The first ablation quantified the prediction uncertainty using the aleatoric uncertainty term. The second ablation used the epistemic uncertainty term. The proposed method was compared to the best baseline method, YieldBERT-DA, for which the uncertainty quantification was based on the test-time augmentation.

To evaluate the selective prediction performance, we rejected the prediction for a reaction if its uncertainty score was above a certain threshold. The threshold controls the trade-off between prediction accuracy and coverage. As performance measures, we computed the MAE and RMSE on the test set with various prediction coverage rates ranging from 100% to 30%.

Tables 4 and 5 present the comparison results for the selective prediction performance in terms of the MAE and RMSE with various prediction coverage rates, which are summarized in Fig. 3. The results clearly demonstrated that a high uncertainty score for a reaction causes its predicted yield to be less accurate for all compared methods. Reducing the prediction coverage with more rejections led to a significant improvement in the prediction performance. The proposed method outperformed YieldBERT-DA in most cases. The MAE and RMSE decreased by over 10% and were nearly halved at 90% and 40% coverages, respectively, for both datasets.

Regarding the two ablations of the proposed method, the selective prediction performance with the epistemic uncertainty was superior at higher prediction coverages, whereas that with the aleatoric uncertainty was better at lower coverages. Compared to the ablations, using the total predictive variance combining the aleatoric and epistemic uncertainty improved the performance by taking their individual strengths to detect erroneous predictions.

Conclusion

We presented an uncertainty-aware method for predicting chemical reaction yields. We represented a chemical reaction as a set of graphs. We constructed a prediction model whose input was the graphs and output was the predictive mean and variance for the reaction yield. For a query reaction, the predictive mean of the model was used as the predicted yield and the predictive variance was used to quantify the uncertainty of the prediction, which allowed the model to avoid making predictions with high uncertainty. The effectiveness of the proposed method for chemical reaction yield prediction was successfully demonstrated through experimental validation on two benchmark datasets. We also demonstrated that a high predictive variance tends to cause a high prediction error, allowing for selective prediction with rejection.

The accurate prediction of chemical reaction yields with uncertainty quantification can assist in advanced synthesis planning considering imposed constraints in practice, including availability, variability, and budget limits. Future research directions for improving prediction performance will be to enrich the data representation of chemical reactions to make it more informative by incorporating various atom/bond features and molecular descriptors associated with reaction yields.

Chuang KV, Keiser MJ (2018) Comment on “Predicting reaction performance in C–N cross-coupling using machine learning”. Science. 362(6416)

Sandfort F, Strieth-Kalthoff F, Kühnemund M, Beecks C, Glorius F (2020) A structure-based platform for predicting chemical reactivity. Chem 6(6):1379–1390

Saebi M, Nan B, Herr J, Wahlers J, Wiest O, Chawla N (2021) Graph neural networks for predicting chemical reaction performance. ChemRxiv

Schneider N, Lowe DM, Sayle RA, Landrum GA (2015) Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J Chem Inf Model 55(1):39–53

Schwaller P, Vaucher AC, Laino T, Reymond JL (2020) Data augmentation strategies to improve reaction yield predictions and estimate uncertainty. In: Proceedings of NeurIPS 2020 Machine Learning for Molecules Workshop

Schwaller P, Probst D, Vaucher AC, Nair VH, Kreutter D, Laino T et al (2021) Mapping the space of chemical reactions using attention-based neural networks. Nat Machine Intell 3(2):144–152

Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: Proceedings of International Conference on Machine Learning; p. 1263–1272

Vinyals O, Bengio S, Kudlur M (2015) Order matters: sequence to sequence for sets. In: Proceedings of International Conference on Learning Representations

Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of International Conference on Machine Learning; p. 1050–1059

Kendall A, Gal Y (2017) What uncertainties do we need in Bayesian deep learning for computer vision? Adv Neural Inf Process Syst 30:5574–5584

Perera D, Tucker JW, Brahmbhatt S, Helal CJ, Chong A, Farrell W et al (2018) A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science 359(6374):429–434

Granda JM, Donina L, Dragone V, Long DL, Cronin L (2018) Controlling an organic Synthesis robot with machine learning to search for new reactivity. Nature 559(7714):377–381

The authors thank the anonymous reviewers for their valuable comments.

Funding

This work was supported by Samsung Advanced Institute of Technology and the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT; Ministry of Science and ICT) (No. NRF-2020R1C1C1003232).

Author information

Authors and Affiliations

Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea

Youngchun Kwon, Dongseon Lee & Youn-Suk Choi

Department of Computer Science and Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea

Youngchun Kwon

Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, Republic of Korea

YK and SK designed and implemented the methodology. DL performed the analysis. YSC and SK supervised the research. YK and SK wrote the manuscript. All authors read and approved the final manuscript.

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Kwon, Y., Lee, D., Choi, YS. et al. Uncertainty-aware prediction of chemical reaction yields with graph neural networks.
J Cheminform14, 2 (2022). https://doi.org/10.1186/s13321-021-00579-z