Prediction of chemical reaction yields with large-scale multi-view pre-training

Developing machine learning models with high generalization capability for predicting chemical reaction yields is of significant interest and importance. The efficacy of such models depends heavily on the representation of chemical reactions, which has commonly been learned from SMILES or graphs of molecules using deep neural networks. However, the progression of chemical reactions is inherently determined by the molecular 3D geometric properties, which have been recently highlighted as crucial features in accurately predicting molecular properties and chemical reactions. Additionally, large-scale pre-training has been shown to be essential in enhancing the generalization capability of complex deep learning models. Based on these considerations, we propose the Reaction Multi-View Pre-training (ReaMVP) framework, which leverages self-supervised learning techniques and a two-stage pre-training strategy to predict chemical reaction yields. By incorporating multi-view learning with 3D geometric information, ReaMVP achieves state-of-the-art performance on two benchmark datasets. Notably, the experimental results indicate that ReaMVP has a significant advantage in predicting out-of-sample data, suggesting an enhanced generalization ability to predict new reactions. Scientific Contribution: This study presents the ReaMVP framework, which improves the generalization capability of machine learning models for predicting chemical reaction yields. By integrating sequential and geometric views and leveraging self-supervised learning techniques with a two-stage pre-training strategy, ReaMVP achieves state-of-the-art performance on benchmark datasets. The framework demonstrates superior predictive ability for out-of-sample data and enhances the prediction of new reactions. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-024-00815-2.


Introduction
The prediction of chemical reaction yields, which refer to the percentage of product formed in relation to the reactant consumed, is an important research topic in organic chemistry [1,2].In the field of organic synthesis, chemists often synthesize a target molecule through several or a dozen reaction steps [3].Consequently, low-yield reactions in the intermediate steps can have a negative impact on the total yield of the synthesis route due to the cumulative effect of each step.The estimation of chemical reaction yields plays an important role in guiding synthetic chemists to choose appropriate molecular synthesis routes, particularly in the case of identifying highly active and selective catalysts efficiently.Traditionally, chemists depend on empirical predictions or specific wet experiments to determine yields, which require extensive domain knowledge and are both timeconsuming and labor-intensive.Therefore, data-driven machine learning techniques are needed to provide an efficient alternative.
It remains a great challenge to accurately predict chemical reaction yields due to the complexity of reaction space and the diverse factors that influence chemical experiments [4,5].To develop machine learning-based approaches, it is crucial to establish effective methods for representing chemical reactions.Conventional studies typically rely on feature engineering to represent chemical reactions using fingerprints or descriptors.This involves creating customized descriptors based on domain expertise to capture molecular, atomic, vibrational, or physicochemical properties [6][7][8][9][10][11][12].Some researchers derived descriptors from circular substructures present in the simplified molecular-input line-entry system (SMILES) [13] strings of the reactions [14].Chemical reactions can be perceived as sequences or collections of molecules.Therefore, the traditional practice of concatenating molecular fingerprints or descriptors at the molecule level is commonly employed [15][16][17][18].However, this concatenation approach is typically suitable only for reactions with a fixed number of molecules, posing limitations on their ability to generalize to downstream tasks.
In recent years, deep learning (DL) models have gained popularity and have led to significant advancements in the representation and prediction of chemical reactions.Some studies utilized SMILES as input and employed well-established models from the field of natural language processing (NLP) to encode the SMILES notations of chemical reactions into continuous vectors.These studies employed either pre-trained Transformer-based models [19][20][21] or Recurrent Neural Network (RNN) models [22,23] on large-scale datasets, and fine-tuned their models [3,24] with downstream tasks to capture task-specific representations of chemical reactions.Other studies incorporated molecular graph structures to represent chemical reactions.Recent approaches include adding [25] or concatenating [26] molecular features learned from graph neural networks, encoding the condensed graph of chemical reactions [27], and learning organic reactivity based on generalized reaction templates [28].
While the YieldGNN model [12] attempts to combine 2D chemical reaction graphs with 1D descriptors, most of the aforementioned studies have focused solely on a specific perspective of chemical information, namely, 1D sequences or 2D graphs.By contrast, multi-view learning methods for molecular representation learning [29][30][31][32][33] have achieved success by incorporating multiple views of molecules with different dimensional inputs.
However, due to the difficulty in extracting effective features from molecular geometry, only a few molecule prediction methods [30,[34][35][36][37] have leveraged 3D spatial structure information of molecules, which is critical for determining molecular properties and reaction outcomes.Although these studies have demonstrated the potential of 3D geometric information in providing comprehensive and complementary insights into chemical reactions, to further enhance the accuracy of reaction prediction, there is a need to explore more effective and efficient algorithms for handling the high dimensionality of 3D molecular structures.
In this study, we propose a large-scale Reaction Multi-View Pre-training framework, named ReaMVP, to represent chemical reactions and predict their yields.ReaMVP utilizes a two-stage approach that involves the pre-training of a sequence encoder and a conformer encoder.In the first stage, ReaMVP aims to capture the consistency of chemical reactions from different views via distribution alignment and contrastive learning.In the second stage, ReaMVP further enhances the representation of chemical reactions through supervised learning using reaction data with known yields.By incorporating this additional information, the model can improve its performance on downstream prediction tasks.The contributions of this work are summarized as follows: (

Pre-training datasets
Two large-scale datasets are utilized for the first and second stages of pre-training, respectively.Figure 1 illustrates the data preparation pipeline.The United States Patent and Trademark Office (USPTO) from 1976 to September 2016 [38] is a large database of reactions.The original dataset contains over 1.8 million chemical reactions stored in the form of SMILES arbitrary target specification (SMARTS) [39].We remove duplicate records and invalid reactions that RDKit [40] fails to recognize, and then transform the remaining reactions into SMILES to obtain the dataset USPTO-1.Subsequently, we retain reactions in which RDKit is able to generate geometric information for all molecules to obtain the dataset USPTO-2, which is used in the first stage of pre-training.
For the second stage of pre-training, we select reactions from USPTO-2 that possess known and valid yields.However, the yield distribution in USPTO-2 is highly biased since it contains only a few reactions with low yields (the detailed distribution is present in Additional file 1: Figure S4), which potentially limits the model's generalization ability.To address this issue, we augment USPTO-2 by adding more reactions with low yields from the Chemical Journals with High Impact Factor (CJHIF) [23] dataset to cover a wider range of values.The original CJHIF dataset includes over 3.2 million chemical reactions in the form of SMARTS extracted from chemistry journals by Chemical.AI [41], which only considers the reactants and products.However, additional compounds such as catalysts and solvents are represented using plain English names or abbreviations that cannot be directly recognized and processed by computers.To convert these names into RDKit-recognizable formats, we utilize both the open parser for systematic IUPAC nomenclature (OPSIN) [42] and the chemical identifier resolver (CIR) [43] to obtain their corresponding SMILES representations.Similar to the processing pipeline for USPTO, we further remove duplicate and invalid reactions and convert the remaining reactions into SMILES to obtain the dataset CJHIF-1.Subsequently, we retain reactions with known and valid yields and sample reactions whose yields are lower than 50% .The combination of chemical reactions with known yields from USPTO-2 and CJHIF forms the dataset USPTO-CJHIF, which is used in the second stage of pre-training.
Note that the conformer encoder necessitates the atom coordinates of each molecule.However, generating a dataset comprising millions of transition state reactions is exceedingly intricate and computationally demanding [44].As a substitute, we employ molecular conformers to depict the geometric structures of reactions.Experimental determination of conformers involves resource-intensive physical and chemical experiments.Hence, we rely on computational chemistry, employing simulation software and algorithms to model molecule conformers.Specifically, we utilize the ETKDG algorithm [45] provided by RDKit with default parameters to compute one conformer for each molecule.
USPTO-2 is randomly divided into training, validation, and test sets in an 18:1:1 ratio.Similarly, USPTO-CJHIF is divided in a stratified way according to the yields of reactions in the same ratio as USPTO-2.Table 1 presents an overview of the datasets for pre-training.

Downstream datasets
We fine-tune and assess ReaMVP on two benchmark datasets for the prediction of chemical reaction yields.Table 2 presents the data statistics.It is noteworthy that, when predicting chemical reaction yields, chemists are often able to select appropriate reactants guided by the desired product.However, selecting influential precursors, such as additives and catalysts, that have a significant impact on yields poses a considerable challenge.This challenge necessitates the exploration of numerous unobserved alternative molecules, demanding a machine learning model with high generalization capability under outof-sample conditions.Specifically, the model must be capable of accurately predicting yields of reactions that involve molecules not included in the training set.
The Buchwald-Hartwig dataset was released by Ahneman et al. [8].They conducted high-throughput experiments (HTEs) with 1536-well plates on the class of Pd-catalyzed Buchwald-Hartwig C-N cross-coupling reactions.They experimented on the combinations of 15 aryl halides, four ligands, three bases, and 23 additives.A total of 3955 reactions were reported with their measured yields.Ahneman et al. and Sandfort et al. [15] split the dataset into eight representative training and test sets according to isoxazole additives as out-of-sample conditions.We further split the dataset based on reactants to construct five out-of-sample conditions (detailed split groups are shown in Additional file 1: Figure S1).We also apply the same random 70/30 split as reported in Sandfort et al. to get training and test sets.
The Suzuki-Miyaura dataset was released by Perera et al. [46].They conducted high-throughput experiments on the class of Suzuki-Miyaura cross-coupling reactions.Discarding water (H 2 O), 15 couplings of electrophiles and nucleophiles across combinations of 12 ligands (with a blank one), eight bases (with a blank one), and four solvents were considered, resulting in measured yields for a total of 5760 reactions.To evaluate the model under out-of-sample conditions similar to the Buchwald-Hartwig dataset, we further split the Suzuki-Miyaura dataset into four representative training and test sets according to ligands (detailed split groups are shown in Additional file 1: Figure S3).We also apply the same random 70/30 split as reported in Philippe et al. [3] to get training and test sets.

Problem formulation
Suppose that a chemical reaction contains a total of n molecules, including reactants, catalysts, products, and other relevant compounds.Each reaction is associated with a yield value, and the value of n may vary across different reactions.We denote each sample as (M s , M c , y) , where M s = {S 1 , . . ., S n } and M c = {C 1 , . . ., C n } represent the set of n molecules in sequence format and conformer format, respectively, and y refers to the reaction yield.For the i-th molecule within a reaction, S i represents its molecular sequence, and C i = {V, R} represents its corresponding molecular conformer, where V = {v 1 , . . ., v m } and R = {r 1 , . . ., r m } denote the set of atoms and their spatial coordinates, respectively.
Given a reaction (M ′ s , M ′ c , y ′ ) where y ′ is unknown, the task of predicting reaction yields aims to find a mapping function that can be defined as follows: where φ(•) represents the desired mapping function and y p denotes the predicted yield.

Model architecture
Figure 2 shows the basic structure of ReaMVP.As a generic self-supervised learning pipeline, the proposed model, ReaMVP, consists of two phases: pre-training and finetuning.The pre-training phase consists of two stages.In stage I, we map multiple views of reactions into the representation space via sequence (1D) and conformer (3D) encoders, respectively.The mapped representations are then projected into the alignment space where we (1)

Pre-training stage I
In the pre-training stage I, we aim to improve the representation of chemical reactions by conducting distribution alignment (Align) and contrastive learning (Contrast) using data collections that provide both sequential and geometric structures.To achieve this, ReaMVP utilizes the 1D sequence and the 3D conformer as complementary views for each reaction.We employ a sequence encoder φ s (•) and a conformer encoder φ c (•) to learn the representations of reactions.Subsequently, a projection head [47] g(•) (also called a projector) is applied to map the learned representations to the alignment space as follows: where φ s (•) and φ c (•) represent the sequence encoder and the conformer encoder, h s and h c denote the learned representation of the corresponding encoders, and (2) x s and x c are the output in the alignment space of the sequence encoder and the conformer encoder, respectively.Given a batch of N inputs X s = {x s 1 , . . ., x s N } and X c = {x c 1 , . . ., x c N } , ReaMVP aims to align the represen- tations of the two views for the same reaction, i.e., pushing x s i and x c i as close as possible.Hence we apply the Jeffreys divergence [48] to achieve this in a distributionlike format as follows: where D denotes the Kullback-Leibler divergence [49], d denotes the dimension of features, z s i denotes the i- th input of Z s , z j s i denotes the j-th logit of z s i , and z c i and Z j c i are defined in a similar way.Furthermore, we want to separate the outputs from distinct views for different sample pairs as far as possible to enhance the representation ability of chemical reactions.Thus, we adopt contrastive learning based on InfoNCE [50] to maximize the mutual information between X s and X c as follows: where f(x, y) equals exp(x • y/τ ) , τ denotes a hyper- parameter called temperature, x s ′ and x c ′ represent the sequence and conformer views of other reactions within the same batch, relative to positive pair (x s , x c ) , and (x s i , x c k ) denotes the feature of the sequence encoder and the conformer encoder, respectively.The pair , Fig. 2 Overview of the model comes from different reactions when i is not equal to k (negative pairs) and vice versa (positive pairs).To combine the distribution alignment and contrastive learning objectives, we formulate the overall loss function for the pre-training stage I as a combination of the distribution alignment loss in Equation ( 4) and the contrastive learning loss in Equation ( 5) as follows: where denotes the weighting coefficient to balance the contributions of these two objectives.

Pre-training stage II
Focusing on the prediction of chemical reaction yields, we aim to improve the generalization capability of ReaMVP by leveraging supervised learning techniques (see "Both self-supervised and supervised pre-training enhance prediction performance" Section for details).Despite the inherent dissimilarity among various types of chemical reactions, we adopt the large-scale dataset USPTO-CJHIF for pre-training to capture the shared and common characteristics between chemical reactions and their corresponding yields.
In the pre-training stage II, we concatenate the learned representations from both the sequence encoder and the conformer encoder.The fused representations are then used for supervised pre-training, where a predictor is introduced to further improve performance on reaction yield prediction tasks.The loss function of the pretraining stage II is, (6) where ⊕ denotes the concatenation operation and p(•) denotes the predictor.

Fine-tuning
The learned concatenated representations can be fixed as reaction descriptors or further trained during finetuning.In this study, we fine-tune the entire ReaMVP model along with the predictor for the problem of predicting reaction yields.We adopt the model on the Buchwald-Hartwig dataset and the Suzuki-Miyaura dataset, respectively.

Reaction sequence encoding
SMILES is a well-designed and widely-used sequence format to represent molecules, which has demonstrated its effectiveness in various chemistry-related tasks [3,[19][20][21][22][23][24].Hence we adopt a multi-layer bidirectional gated recurrent unit (GRU) model [51] as the sequence encoder to process SMILES input as follows: where Bi-GRU denotes a multi-layer bidirectional gated recurrent unit model and h t i (1 ≤ i ≤ n) denotes the embedding of the i-th token.Figure 3a presents the basic structure of the sequence encoder.Similar to natural language processing problems, tokenization is a key technology for the sequence encoder, and the level of granularity has a great impact on model performance [52].Existing works for SMILES representation apply either a coarsegrained tokenization method [23] based on the statistical probability of characters or fields in the dataset, or a fine-grained tokenization method at the character level [3,53,54].Considering the versatility (8)  of tokenization, we employ a fine-grained tokenization approach adapted from Xue et al. [54] based on the character level to ensure a high frequency of each token, except that the symbol "%" followed by two digits is segmented as one word.It is worth noting that aromatic elements are represented in lowercase when forming chemical bonds in the SMILES representation.Hence it is necessary to distinguish between cases such as "[cs]" and "cs".The former is the representation of the lowercase element "Cs" in the aromatic bond, while the latter is the bond representation of the two elements "c" and "s".In addition to the elements or other characters obtained from tokenization, we introduce five special characters to construct the pretraining task.The "[PAD]" symbol is used for aligning the length of input SMILES; the "[CLS]" symbol is used for recording classification information or indicating the start position of a SMILES string; the "[SEP]" symbol is used for separating reactants, catalysts, products, etc.; and the "[UNK]" symbol is used for representing characters that are either unknown or have a frequency less than ten.Consequently, the final corpus consists of 115 valid tokens (111 original tokens plus the additional 4 tokens).

Reaction conformer encoding
Although numerous studies have explored the use of 3D conformers to predict molecular properties [30,[34][35][36][37], there has been a notable dearth in the development of reaction representations and architectures that incorporate geometric information.In addition, models need to be rotational and translational invariant since conformers are generally described by atomic coordinates and are not fixed in the Cartesian or spherical coordinate systems.In light of these considerations, we propose a simple yet effective approach to capturing spatial features of reactions by utilizing sequential molecular conformers, as illustrated in Fig. 3b.The conformer encoder is composed of a SchNet [34] model that satisfies rotational and translational invariance to embed molecules and a multi-layer bidirectional GRU to aggregate molecular representations as follows: where h m i denotes the feature of the i-th molecule extracted from its conformer C i , and SchNet is a variant of Deep Tensor Neural Network (DTNN).The SchNet model incorporates a continuous-filter convolution layer, which is particularly well-suited for molecular dynamics simulations aimed at predicting potential energy surfaces (9) and energy-conserving force fields.A SchNet model can be formulated as follows: where I u denotes the input feature of atom u for the embedding layer, e ℓ u denotes the output at layer ℓ ( 0 < ℓ < K ) for atom u, r denotes the spatial coordinate of the corresponding atom, MLP denotes the multi-layer perceptron, K denotes the number of hidden layers, and is the continuous-filter convolution layer that captures continuous coordinates of atoms instead of discrete ones using different hyper-parameters γ and µ k .
We use three metrics, namely mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R 2 ), to assess the performance of yield prediction.The sequence encoder comprises an embedding layer with a dimensionality of 256, followed by a two-layer bidirectional GRU model with a hidden layer of 128-D and a dropout ratio of 0.3.The conformer encoder employs a SchNet model (detailed initial features of atoms are provided in Additional file 1: Table S7) with four interaction blocks, 64 Gaussian filters, a molecular radius threshold of ten, and all hidden layers and filters with a dimension of 128.Similarly, a two-layer bidirectional GRU model is employed with a hidden layer of 128-D and a dropout ratio of 0.3.We use the Adam [57] optimizer with default parameters during the training process.In the pre-training stage I, we set to 1.0 after exploring values in {0.1, 0.5, 1.0} (see "Both Align and Contrast operations are indispensable for the selfsupervised pre-training" Section for details).In the pretraining stage II, we use a regression head that takes the concatenation of the outputs from the sequence encoder (10) e 0 u = embedding(I u ), and conformer encoder (see "Multi-view learning excels over single-view methods on most splits" Section for details).During the fine-tuning stage, we perform a grid search for hyper-parameters, including the learning rate in {3e-4, 1e-3, 3e-3}, the dropout ratio in {0.1, 0.3}, the weight decay in {0, 1e-4, 1e-5}, and the loss function in {MSE, MAE}.

ReaMVP demonstrates superior generalization capability compared to SOTA models
To evaluate the effectiveness of ReaMVP, we compare its performance with three state-of-the-art DL yield prediction methods as listed below.
• YieldBERT [3] adapts a pre-trained BERT encoder [19] to predict chemical reaction yields via reaction SMILES.• YieldBERT-DA [24] is an extension of YieldBERT, which uses the same pre-trained BERT encoder and adds SMILES randomization and permutation as data augmentation.
• UA-GNN [25] aggregates molecular embeddings learned by graph neural networks using a set of molecular graphs with permutation invariance and utilizes uncertainty-aware learning and inference.
To perform a rigorous assessment of model generalization capability to unseen data, we experiment with the out-of-sample splits (as described in "Downstream datasets" Section), i.e., the reactions for test include molecules that are not present in the training set.The Buchwald-Hartwig (BH) dataset has eight ligandbased splits (Tests 1-4, Plates 1-3, and Plate 2 new) and five reactant-based splits (Halide Br, Halide Cl, Halide I, Pyridyl, and Nonpyridyl), while the Suzuki-Miyaura (SM) dataset has four ligand-based splits (Tests 1-4).
Tables 3, 4 present the average results from ligandbased splits across five random runs for the BH and SM datasets, respectively.The results of Tests 1-4 for the BH dataset are as reported in the original papers of the three baseline models; while the results of other splits are not available, thus we reproduce the three models using their released codes and conduct the experiments.To reproduce YieldBERT and YieldBERT-DA, we employ the pre-trained model labeled as "pre-trained" instead of "ft".During the fine-tuning, we determine the hyperparameters via a grid-search, including the learning rate in {5e-6, 1e-5, 5e-5, 1e-4} and the dropout ratio in {0.3, 0.4, 0.5, 0.6, 0.7}.For the augmentation hyper-parameters of YieldBERT-DA, we adopt the fixed random order with a random type of "rotated" and conduct ten permutations for training and testing the model.To reproduce UA-GNN, we maintain the same set of hyper-parameters as specified in the original paper to ensure consistency and comparability.
Remarkably, we observe that ReaMVP demonstrates superior performance under various evaluation metrics in most cases, except for the BH Plate 1 split and the BH Plate 3 split.The proposed model exhibits outstanding performance in out-of-sample yield prediction tasks.For instance, the R 2 value of ReaMVP increases by approximately 40% under the BH Test 4 split, by approximately 22% under the BH Plate 2 new split, and by approximately 24% under the SM Test 1 split.These substantial  improvements in prediction performance highlight the effectiveness of distribution alignment during the pretraining stage, which enhances the generalization capability of models.ReaMVP demonstrates a more balanced representation of isoxazole additives in the BH dataset and ligands in the SM dataset, as evidenced by its consistent performance across these different categories.
In addition to the ligand-based splits, we experiment with five reactant-based splits of the BH dataset.Table 5 presents the average results along with the corresponding standard deviations of five random runs.ReaMVP yields the best performance on the Halide Br and Pyridyl splits, while also attaining a top 2 position (on par with YieldBERT-DA) for the Halide I split.
We also apply the random forest (RF) models and support vector machines (SVMs) on the BH dataset under out-of-sample conditions to offer a better understanding of comparisons between different methods (detailed values are shown in Additional file 1: Table S5).Reaction features are adopted from Mandana Saebi et al. [12].We observe that non-DL methods perform worse in most cases and have a higher risk of overfitting, especially for reactant-based splits.
Notably, both the DL and non-DL methods underperform on the Halide Cl split with negative values of R 2 .In contrast, most models obtain meaningful predictions for the Halide Br and Halide I splits.To further investigate the inferior performance for the Halide Cl split among three halide-based splits, we analyze the yield distributions of the training dataset alongside the corresponding test dataset (detailed distributions are present in Additional file 1: Figure S2).Three histogram metrics are computed to quantify the dissimilarity between these distributions (detailed values are present in Additional file 1: Table S1).The results reveal a substantial dissimilarity in yield distributions between the training and test datasets for the Halide Cl split.For instance, the normalized histogram intersection decreases by roughly 33%, the chi-squared distance increases by around 91%, and the Jeffreys divergence increases by approximately 216% in comparison with the Halide I split.Such a distribution shift poses a great challenge for machine learning models to accurately predict reaction yields.
Additionally, as many previous studies reported results under random conditions, here we also compare the results obtained by random splitting.Table 6 presents the averaged results with standard deviations for ten random 70/30 splits.To ensure a fair comparison, all methods utilize the same random splits.ReaMVP exhibits competitive performance compared to the state-of-the-art method by a slight margin.YieldBERT, YieldBERT-DA, and ReaMVP all utilize large-scale pre-training strategies followed by fine-tuning on downstream tasks.These methods are supposed to have a higher generalization capability than those without pre-training.ReaMVP generally achieves the best performance among them, except for MAE in the BH dataset.It suggests that considering more dimensional information during pre-training is an effective approach to improving model performance.
Under random splits, UA-GNN, the model without pre-training, outperforms the three models with pretraining, which may be attributed to the presence of data leakage in the datasets.Both the BH and SM datasets are characterized by a relatively small number of unique molecules, consisting of only 51 and 36 molecules, respectively.Hence the training set is likely to contain all molecules at least once under the random 70/30 split setting [15].Overlapping data between the training and test sets poses significant challenges in accurately evaluating a model's generalization ability.As can be seen from Table 6, the models exhibit notably better performance under random conditions compared to out-of-sample conditions.Specifically, the BH dataset shows exceptionally high R 2 values exceeding 0.95, which is remarkably high in practical scenarios.This suggests that the model performance may be overestimated when using random splits, potentially leading to misleading conclusions.

Multi-view learning excels over single-view methods on most splits
ReaMVP utilizes both the 1D sequential and 3D geometric information of molecules to learn the representation of chemical reactions.To investigate the impact of the data view for the predictor, we set (in Eq. 6) to 1.0 and examine the performance when considering the sequence encoder alone, the conformer encoder alone, and the concatenation of their outputs both during the pre-training stage II and the fine-tuning stage.Figure 4 (detailed values are shown in Additional file 1: Table S2) depicts the results.Taking the concatenation of outputs from each encoder yields the best performance against others in six out of eight splits, with the remaining two splits showing very close results to the best ones.Notably, on Plate 2 and Plate 2 new, the '1D+ 3D' approach substantially improves performance, with an increase of over 0.14 and 0.07 in R 2 against 1D-and 3D-only methods, respectively.
The results highlight the efficacy of integrating information from multiple views to enhance model performance.Interestingly, the method using 3D information exhibits inferior performance compared to its 1D counterpart, possibly because some 3D coordinates of molecules generated by the ETKDG algorithm in RDKit are inaccurate.Obtaining precise 3D geometric information is a challenging task.Nonetheless, the simulated 3D information can still provide valuable supplementary information to the 1D sequence and ultimately improve the accuracy.
Additionally, we further investigate the influence of multi-view choices.The ReaMVP model includes a sequence encoder and a conformer encoder to encode chemical reactions during the pre-training phase.The model allows for various data views, including 1D sequential, 2D topologic, and 3D geometric data structures.We replace the sequence encoder with a graph encoder (the graph neural network model GIN [58]) to include 2D features from consideration.This adjustment transforms the input into molecular graph structures, where atoms and bonds are treated as nodes and edges, respectively.
Figure 5 (detailed values are given in Additional file 1: Table S2) presents the results.The '1D+3D' approach yields the better performance in seven out of eight splits, exhibiting an average increase of about 0.05 in R 2 .The superior results of '1D+3D' over '2D+3D' can be attributed from complementarity between views and the nature of contrastive learning.First, considering the differences in these representations, the 1D SMILES and 3D conformer views could provide more complementary information to each other than the 2D and 3D views.The 1D SMILES view encapsulates bonding and atomic information effectively in a compact form, while the 3D conformer view provides spatial information.By contrast, there is an overlap between the information provided by the 3D and 2D structures, which may lead to shortcut  Here, we conduct experiments that only employ the first or second stage of pre-training.The results shown in Fig. 6 demonstrate the efficacy of pre-training stages I and II.Specifically, it leads to inferior performance in seven out of eight splits with only stage I and in six out of eight splits with only stage II.Note that since there is no overlap between the reactions in the two-stage pretraining data and the downstream data, we can conclude that both the self-supervised and supervised pre-training, which primarily include reaction types different from those in the downstream data, capture some common patterns related to reaction mechanisms.As a result, the two-stage pre-training helps improve yield prediction for the downstream datasets.

Both Align and Contrast operations are indispensable for the self-supervised pre-training
During our pre-training stage I, there are two basic operations, i.e., Align and Contrast (as shown in Fig. 2).The process of distribution alignment (Align), achieved through the utilization of Kullback-Leibler divergence, primarily focuses on positive pairs that correspond to different views of the same reactions.While pulling the distributions between positive pairs closer together is a crucial aspect, it is equally important to differentiate the outputs between distinct views from different sample pairs to avoid trivial representations.To investigate the impact of distribution alignment and contrastive learning using the concatenation of outputs from the sequence encoder and the conformer encoder, we experiment with different weighting coefficients in Equation (6).
Figure 7c (detailed values are given in Additional file 1: Table S3) presents the R 2 values obtained on downstream datasets under out-of-sample conditions when equals 0.1, 0.5, and 1.0, respectively.We set the batch size as 32 and extract reaction pairs from the pre-training dataset USPTO-2 to compute the Jeffreys divergence of positive and negative pairs. Figure 7a and b reveal the cumulative distribution function (CDF) curves for both the positive pairs and the negative pairs.
Based on the results, there are some observations that can be made: (1) The use of distribution alignment via Kullback-Leibler divergence alone may not produce the desired outcome.When the hyper-parameter is set to 0, both the distances between positive and negative pairs in the alignment space tend to be zero, which can limit the model's ability to capture meaningful patterns in the data.(2) Using distribution alignment along with contrastive learning proves to be an effective approach to learning more informative representations that yield better performance on downstream tasks, especially under out-of-sample conditions.(3) Different weighting coefficients have a slight influence on the Jeffreys divergence between positive pairs but exhibit a more significant impact on negative pairs.(4) The trends observed in the CDF curves are generally consistent with the downstream performance, which indicates that the Jeffreys divergence may serve as a criterion for model selection during pretraining.We notice that it outperforms others in all cases when equals 1.0.This setting effectively enforces a notable separation between negative pairs and maintains the proximity between positive pairs.Meanwhile, results obtained with equal to 0.5 tend to surpass those obtained with equal to 0.1.

Further investigation on predicting the data from electronic laboratory notebooks (ELNs)
The BH dataset and SM dataset are both from highthroughput experiments (HTE), yet they represent a small part of the reaction space due to limited categories of molecules.For example, the BH dataset and the SM dataset form only five and one different products, respectively.Such a low diversity may lead to obstacles for a general-purpose reaction yield prediction [59].
To evaluate the model's performance on previously unseen and complex data points, we conduct experiments on the ELN BH dataset, which was released by Mandana Saebi et al. [12].They collected a legacy dataset for Buchwald-Hartwig reactions from electronic laboratory notebooks with a wide range of substrates, ligands, and solvents.The structural diversity of the ELN dataset is much higher than that of the HTE dataset.We apply the same random 70/30 split as reported in the original paper.
The results are present in Additional file 1: Table S6.We observe that ReaMVP outperforms other DL methods in all metrics, demonstrating a higher generalization capability.However, the non-DL model, RF with RDKit features, yields the best performance.Besides, none of the models provide meaningful predictions.Our findings are in agreement with recent studies on reaction condition prediction [12,60].While DL models excel on larger datasets BH and SM, their relatively inferior performance on the smaller ELN dataset warrants examination.One potential explanation is that DL models tend to perform well when trained on larger, more comprehensive datasets, which allow them to learn representations sufficiently.On smaller datasets like ELN BH, they might be overfitting to noise or unable to develop a rich representation due to the scarcity of data, resulting in suboptimal performance.Moreover, the structural diversity introduced by a wide range of substrates, ligands, bases, and solvents in the ELN BH dataset poses a significant challenge to machine learning models.Although the pre-training techniques are employed, due to the large gap between the pre-training data and the downstream task data, the small amount of fine-tuning data leads to poor generalization capability.

Conclusions
In this study, we introduce ReaMVP, a large-scale multiview pre-training method with two pre-training stages designed to enhance the representation of chemical reactions for predicting reaction yields.In the first pretraining stage, we learn representations of reactions by learning the consistency relationship between different views of reactions via distribution alignment and contrastive learning.In the second pre-training stage, we combine the outputs from the sequence encoder and conformer encoder and incorporate a predictor for supervised pre-training, thereby further refining the learned representations for accurate yield prediction.
While the ETKDG algorithm used to obtain molecular coordinates may generate inaccurate 3D information, the experimental findings demonstrate the effectiveness of the incorporation of the 3D view.By combining 1D and 3D representations, we can capitalize on the strengths of both views and mitigate their limitations, leading to enhanced performance in predicting chemical reaction yields.Notably, even when the algorithm fails to simulate 3D structures, ReaMVP's sequence encoder can still effectively predict chemical reaction yields.
ReaMVP stands out from the state-of-the-art methods with its exceptional performance in out-of-sample scenarios, demonstrating its potential in predicting chemical reaction yields involving unseen additives or ligands.Additionally, the model can be extended to other prediction tasks related to chemical reaction outcomes.With its robust performance and versatility, ReaMVP represents a valuable tool for chemists and researchers in chemical reaction studies.

Fig. 1
Fig.1Overview of the data preparation process.We remove duplicate reactions and retain reactions with valid SMILES and conformers.We use USPTO-2 and USPTO-CJHIF for the first stage and the second stage of pre-training, respectively conduct self-supervised pre-training.In stage II, we carry out supervised pre-training in the representation space.

Fig. 3
Fig. 3 Structure of a one-layer bidirectional GRU as an example of the sequence encoder a and the conformer encoder b

Fig. 4 R
Fig. 4 R 2 values of the Buchwald-Hartwig dataset under eight out-of-sample conditions based on ligands with different data views for fine-tuning

Fig. 5 RFig. 6 R
Fig. 5 R 2 values of the Buchwald-Hartwig dataset under eight out-of-sample conditions based on ligands with different data views for pre-training and fine-tuning

Fig. 7
Fig. 7 Experimental results on investigating the efficacy of the pre-training stage I. a The CDF curves of positive pairs.b The CDF curves of negative pairs.c R 2 values of the Buchwald-Hartwig dataset under eight out-of-sample conditions based on ligands with different weighting coefficients

Table 1
Overview of the pre-training datasets

Table 2
Overview of the downstream datasets

Table 3
Results of the Buchwald-Hartwig dataset under ligand-based out

Table 4
Results of the Suzuki-Miyaura dataset under ligand-based out

Table 5
Results of the Buchwald-Hartwig dataset under reactant-based

Table 6
Results of the Buchwald-Hartwig and Suzuki-Miyaura datasets with random splits