### The energy-based re-ranking model

Motivated by statistical mechanics, EBMs describe the unnormalized probability distribution \(p_\theta\) of a variable of interest, *x*, using an energy function \(E_\theta (x) : {\mathbb {R}}^D \mapsto {\mathbb {R}}\). Popularized in mid-2000s [25], EBMs regained interest after recent works drawing theoretical connections between EBMs and the widespread tasks of classification and generation [16, 17]. Formally, we have:

$$\begin{aligned} p_\theta (x) = \frac{e^{-E_\theta (x)}}{Z(\theta )}\qquad Z(\theta ) = \int _xe^{-E_\theta (x)}dx \end{aligned}$$

(1)

EBMs are highly flexible as there is no restriction to the form of the energy function \(E_\theta\), allowing the user massive freedom to explore and utilize what best suits the problem at hand. We exploit this flexibility by exploring different architectures of deep neural networks that encode different molecular representations to parameterize \(E_\theta\). The ranking architecture choice for the ranking model can be chosen independently from the proposer model. Given an input reaction *x*, the output of our network, \(E_\theta (x)\), is its “energy”, a scalar value (the lower the better), which holistically represents that reaction’s feasibility. Also, notice that the denominator \(Z(\theta )\) in Eq. 1, the partition function, requires integrating \(E_\theta\) over all possible input reactions in order to represent a meaningful probability distribution. However, this partition function is computationally intractable and it is necessary to simplify it. In the context of retrosynthesis, for each product, a one-step model can generate up to *K* reactant-sets: \(\left\{ R_k \right\} _{k=1}^K\). While only one of these can exactly match the published reactant-set (“ground-truth”, \(R_{true}\in \left\{ R_k \right\} _{k=1}^K\)), given a well-trained one-step model, several of the remaining reactant-sets, \(\left\{ R_k \right\} _{k=1}^K\setminus R_{true}\), could also be chemically viable options or could be similar to \(R_{true}\) (differing by the identity of a leaving group, for example). Therefore, we reasoned that we could simply use these remaining reactant-sets as “negatives” to approximate the intractable partition function \(Z(\theta )\). Similar simplification was also made in Sun et al’s energy-based modelling work [24] which shares our theoretical motivations but samples the “negatives” in a different way. They choose to assume that extracted reaction template sets are exhaustive and subsequently apply non ground-truth templates to generate negative samples, while we assume that relevant proposed reactants are exhaustive. We then empirically show in this work that such an approximation is sufficient for good retrosynthesis re-ranking performance.

Therefore, the EBM’s training objective is to describe the dataset of reactions to maximize the separation of energy between a positive reaction (a product paired with published reactant-set) against its associated negatives (the same product paired with non-published reactant-sets proposed by the one-step model), by pushing down the energy of positive reactions while pushing up the energy of negative reactions. To achieve this, we design the following loss function, where for a product *P*, given a batch of top *K* reactant-set proposals \(\left\{ R_k \right\} _{k=1}^K\) from a one-step model, we have:

$$\begin{aligned} {\mathcal {L}}_{batch} = -\log \,p_\theta \left( R_{true},\, \{R_k\} ,\,P\right) = -\log \,\left( \frac{e^{-E_\theta (R_{true}, \,P)}}{\sum _{k=1}^{K}e^{-E_\theta (R_k,\,P)}}\right) \end{aligned}$$

(2)

The EBM can then be trained using this loss function through stochastic gradient descent and variants over mini-batches of data. This approach has conceptual similarities to contrastive learning, where the decision to use negative examples from existing one-step models can be thought of as a form of hard negative sampling. Theoretically, the higher the *K*, the better the approximation to the true partition function \(Z(\theta )\), but in practice, we do not find any noticeable benefit in using \(K > 50\), and therefore use \(K = 50\) for all experiments. We also note that because one-step models do not have a perfect top-50 accuracy, the set of top *K* proposals will not always contain the published reactant-set. During training, we add the ground truth reactant-set \(R_{true}\) to the list of candidates if it is not already present. When re-ranking validation or test reactions, however, if \(R_{true}\) is not part of the top *K* proposals from the one-step model given a product, the EBM cannot re-rank correctly for this product. To maximize the chances of \(R_{true}\) being present in \(\left\{ R_k \right\} _{k=1}^K\), we use a larger \(K = 200\) when re-ranking validation and test proposals.

### Architecture choices for the energy-based model

Due to the flexibility of the EBM framework, we enjoy great freedom in both input representation and architecture. In this work, we focus on two machine-readable formats to represent a chemical reaction, with each choice corresponding to a different architecture below. We explore both a feedforward backbone and a graph-based Message Passing Neural Network (MPNN) backbone. We did briefly experiment with a Transformer-based architecture, but did not observe good performance and further discuss it in Additional file 1: Section S4.3.3. For all three architectures, we elaborate on details of the network structure and hyperparameter choice in the Additional file 1: Section S4.

#### Feedforward EBM (FF-EBM)

We represent each molecule as a Morgan count fingerprint of length 16,384 with radius 3, and employ 3 input networks (Fig. 2, left). The first network receives the product fingerprint \(\mathbf{P }_\text {in}\). The second network receives the reactants fingerprint \(\mathbf{R }_\text {in}\); because each reaction can have multiple reactants, we sum reactant fingerprints into a single, 16,384-length fingerprint. Lastly, the third network receives the “difference” fingerprint [26] \(\mathbf{D }_\text {in}\), which captures fragments lost and gained during the reaction: \(\mathbf{D }_\text {in} = \mathbf{P }_\text {in} - \mathbf{R }_\text {in}\). From these 3 input networks, we obtain 3 dense embeddings \(\mathbf{P }_\text {out}\), \(\mathbf{R }_\text {out}\), \(\mathbf{D }_\text {out}\). We concatenate these 3 vectors with their element-wise products \(\mathbf{P }_\text {out} *\mathbf{R }_\text {out}\), \(\mathbf{R }_\text {out} * \mathbf{D }_\text {out}\) and \(\mathbf{P }_\text {out} * \mathbf{D }_\text {out}\) to capture higher-order interactions as inspired by Ref. [27], as well as the cosine similarity of product and reactants embeddings \(\mathbf{sim} (\mathbf{P }_\text {out},\mathbf{R }_\text {out}) = \frac{\mathbf{P }_\text {out}\cdot \mathbf{R }_\text {out}}{\left\| \mathbf{P }_\text {out} \right\| \left\| \mathbf{R }_\text {out} \right\| }\). Finally, we apply another feedforward network on this concatenated vector to output the energy.

#### Graph EBM

A graph \({\mathcal {G}}\) contains nodes \({\mathcal {V}}\) corresponding to atoms and edges \({\mathcal {E}}\) that link nodes, corresponding to bonds. We adapt the graph representation from GraphRetro [28]. Each atom *u* has a feature vector \(\mathbf{x }_u\) containing chemical properties such as its element and charge. Similarly, each bond (*u*, *v*) between two atoms *u* and *v* has a feature vector \(\mathbf{x }_{uv}\), containing information like bond type. The full list is detailed in Additional file 1: Table S1. We directly adapt the graph encoder from Ref. [28]. The MPNN performs a defined number of message passing operations around each atom and bond in the molecule, which communicates chemical information from neighbouring bonds and atoms to extract meaningful representations. These representations can be very powerful as they are custom-learnt for the task at hand. In contrast, fingerprints are constructed using a fixed algorithm agnostic to the task, which may not yield optimal performance. For brevity, we denote the MPNN’s encoding process by MPNN(\(\cdot\)) and refer readers to Refs. [28, 29] for detailed explanations of the MPNN architecture. In our case, we apply two MPNNs with separate parameters (Fig. 2, right): one MPNN just for the reactant graphs, and the other just for the product graph. For each graph \({\mathcal {G}}\), the MPNN computes atom representations \(\mathbf{c }_u\) for each atom *u*, i.e. \(\left\{ \mathbf{c }_u|u\in {\mathcal {G}} \right\}\), using Eq. 3:

$$\begin{aligned} \left\{ \mathbf{c }_u \right\} = MPNN ({\mathcal {G}}, \left\{ \mathbf{x }_u \right\} , \left\{ \mathbf{x }_{uv} \right\} _{v\in {\mathcal {N}}(u)}) \end{aligned}$$

(3)

where \({\mathcal {N}}(u)\) refers to the neighbouring atoms of atom *u*. To obtain a fixed-length graph-level embedding for each molecule \(\mathbf{c }_{\mathcal {G}}\), all of its atom representations could be summed: \(\mathbf{c }_{\mathcal {G}} = \sum _{u\in {\mathcal {V}}}\mathbf{c }_u\). However, a naive sum may not be optimal as certain atoms may be more important in determining a reaction’s feasibility. Thus, we use attention pooling [30] which uses a feedforward network to calculate the weight (“attention”) each atom should contribute to the graph-level embedding in a weighted sum. Since a reaction can have multiple reactants, we sum the graph-level embeddings of all reactants into a single vector. We additionally apply a small projection network to the pooled output from each MPNN encoder to respectively obtain the product embedding \(\mathbf{P }_{{\mathcal {G}}}\) and reactants embedding \(\mathbf{R }_{{\mathcal {G}}}\). Finally, we concatenate \(\mathbf{P }_{{\mathcal {G}}}\) and \(\mathbf{R }_{{\mathcal {G}}}\) with their difference \(\mathbf{P }_{{\mathcal {G}}} - \mathbf{R }_{{\mathcal {G}}}\) and element-wise product \(\mathbf{P }_{{\mathcal {G}}} * \mathbf{R }_{{\mathcal {G}}}\), before applying an output feedforward network to obtain the energy.

### Dataset and preprocessing

We trained our models on the USPTO-50K dataset of roughly 50,000 reactions extracted from the United States patent literature from 1976 to 2016 [31]. The reactions are recorded as atom-mapped SMILES strings and comprise 10 types (Additional file 1: Table S2). We use a cleaned version of the random 80%/10%/10% split from Refs. [8, 9], where additional duplicate reactions are removed (explained in Additional file 1: Section S3), for a total of 39,713 training, 4989 validation and 5005 test reactions.

### One-step models used for re-ranking

We re-trained from scratch all the one-step models to be re-ranked—RetroSim [8], NeuralSym [6], GLN [9] and RetroXpert [32]—on our cleaner USPTO-50K, which led to minor discrepancies in quantitative top-*N* accuracies relative to previously reported values. However, results are typically within 2% of previously reported values. RetroSim [8] is a template-based approach that computes and compares molecular similarity to choose the best template for a given product. The similarity is a combination of product similarity and reactants similarity, calculated against the training data. Using only similarity values, RetroSim is a purely data-driven method without any model parameters. Next, NeuralSym [6] is a one-step retrosynthesis deep-learning model trained to classify, for a given product in each retrosynthesis step, the most relevant template from a library of templates algorithmically-extracted from the training set. One of the best performing template-based models, the Graph Logic Network (GLN) [9] is parameterized by graph neural networks. First, GLN identifies reaction centres in a given product, and then ranks the most relevant templates, before finally scoring and matching reactants given each template. In the realm of template-free methods, Retrosynthesis eXpert (RetroXpert) [32] is a hybrid graph-sequence model that employs a graph network to first identify likely reaction centres to decompose the product into synthons, followed by a Transformer network to generate full reactants from each synthon.

### Implementation

We used several open-source libraries with Python 3.6 [33]. PyTorch [34] was used as the backbone for building, training and testing all models. We used RDKit [35] and rdchiral [36] for chemical processing, and NetworkX [37] for processing molecular graphs for the Graph-EBM. As NeuralSym is not open-sourced, we re-implemented it from scratch at https://github.com/linminhtoo/neuralsym following the original work [6]. All code for data preprocessing, proposer training, EBM training and evaluation is open-sourced at: https://github.com/coleygroup/rxn-ebm.

We tuned the hyperparameters for each of the three EBM architecture by choosing the hyperparameters that produce the best top-1 accuracy on the validation data. It is also possible to optimize for other top-*N* accuracies, if desired. The tuning was done manually starting from common settings found in the literature. Only after the best hyperparameters for each EBM architecture have been finalized, we then calculate and report the top-*N* accuracies with these optimized hyperparameters on the USPTO-50K test data for re-ranking each of the four one-step models. Specific model and training hyperparameters are described in Additional file 1: Section S4.