### Implementation details

COMA was implemented using Python 3.6, and several open-source tools, including PyTorch 1.10.1 and RDKit 2021.03.5. RDKit, an open source tool for chemoinformatics, was used for SMILES kekulization, SMILES validity check, Tanimoto similarity computation, and estimation of QED. PyTorch, an open-source machine learning framework, was used to construct and train the neural networks of COMA. All experiments were conducted on Ubuntu 18.04.6 LTS with 64 GB of memory and a GeForce RTX 3090.

### Tanimoto similarity

The Tanimoto similarity, which ranges from 0 to 1, compares molecular structures such as atom pairs and topological torsions, represented by Morgan fingerprints. In this study, the Morgan fingerprints were binary vectors generated using RDKit with radii of 2 and 2048 bits. For any two SMILES strings *x* and *y* with the corresponding fingerprint vectors \(FP(x)=(p_1, p_2,..., p_{2048})\) and \(FP(y)=(q_1, q_2,..., q_{2048})\), the Tanimoto similarity score was computed as:

$$\begin{aligned} {\mathcal {T}}(x,y)=\frac{\sum \limits _{i=1}^{2048}{p_i q_i}}{\sum \limits _{j=1}^{2048}{(p_j + q_j - p_j q_j)}}. \end{aligned}$$

(1)

### Binding affinity prediction

Predicting the binding affinity scores for ABCG2 and BRAF is crucial for the application of COMA for sorafenib resistance. In this study, DeepPurpose [38], a PyTorch-based library for virtual screening, was used for the accurate and high-throughput affinity prediction of more than 4.6 million pairs of molecules. We exploited the predictive model with message-passing and convolutional neural networks pretrained on BindingDB, which is a public database of measured binding affinities [39], to generate training datasets for UGMMT and COMA and compute the rewards of reinforcement learning in COMA.

### Benchmark datasets

In this study, we used four benchmark datasets provided in [13] and the original dataset for sorafenib resistance (Additional file 1: Table S1).

The DRD2 dataset contains 34k molecular pairs (source and target) with DRD2 activity scores derived from the ZINC database [40]. The DRD2 activity score ranged from 0 to 1 and was assessed using the regression model of support vector machine from [41]. For each pair in the DRD2 dataset, the pair of SMILES strings satisfied the similarity constraint that the Tanimoto similarity was greater than or equal to 0.4, and the DRD2 scores of the source and target SMILES strings were less than 0.05 and greater than 0.5, respectively. The QED dataset contained 88k molecular pairs derived from the ZINC database with QED scores. The QED score ranged from 0 to 1 and was calculated using the RDKit [42]. For each pair in the QED dataset, the Tanimoto similarity between two SMILES strings was greater than or equal to 0.4, and the QED scores of the source and target were in the ranges [0.7, 0.8] and [0.9, 1.0], respectively. The penalized logP04 and penalized logP06 datasets contained 98k and 74k molecular pairs derived from the ZINC database with penalized logP scores, respectively. The penalized logP score ranged from \(-\)63.0 to 5.5. For each pair in the penalized logP04 dataset, the Tanimoto similarity between two SMILES strings was greater than or equal to 0.4. In the case of the penalized logP06, the similarity threshold was set to 0.6.

We constructed a dataset for the sorafenib-like molecular generation to introduce an example of COMA application. Based on the observation that the activity of ABCG2 is related to sorafenib resistance in hepatocellular carcinoma [27, 43], this application aimed to generate sorafenib-like molecules with lower binding affinity against ABCG2, while conserving the level of affinity against the target kinase of sorafenib BRAF as much as possible. The dataset contained 231k molecular pairs derived from the ChEMBL database [29] with binding affinity scores against ABCG2 and BRAF. The binding affinity score evaluated using DeepPurpose was pKd. For each pair in the ABCG2 dataset, the Tanimoto similarity between two molecules was greater than or equal to 0.4, and the ABCG2 affinity values of the source and target were in the ranges [4.9, 8.4] and [3.3, 4.7], respectively. For BRAF, both the source and target had binding affinities greater than 6.0.

### Metric learning for molecular structural similarity

COMA has the structure of VAEs. It consists of an encoder \(q_\phi \) that maps an input SMILES string to a latent vector, and decoder \(p_\psi \) that reconstructs a string from a latent vector. However, the goal of COMA is to generate a SMILES string that is structurally similar to a given input SMILES string, whereas the original VAE aims to sample a random SMILES string. More concretely, given a pair of SMILES strings with similar chemical structures \(x_{s}\) and \(x_{t}\), the objective function of COMA \(L(x_{s}, x_{t})\) is expressed as

$$\begin{aligned} \begin{aligned} L(x_{s}, x_{t})&= \,{\mathbb {E}}_{z \sim q_{\phi }(z \mid x_{t})}{[logp_{\psi }(x_{t} \mid z)]} \\&+ {\mathbb {E}}_{z \sim q_{\phi }(z \mid x_{s})}{[logp_{\psi }(x_{s} \mid z)]} \\&- D(q_{\phi }(z \mid x_{t}) \Vert p(z \mid x_{s})) \\&- D(q_{\phi }(z \mid x_{s}) \Vert p(z \mid x_{t})), \end{aligned} \end{aligned}$$

(2)

where \(p(z \mid x_{s})\), \(p(z \mid x_{t})\) are prior distributions, and \(D(\cdot \Vert \cdot )\) is the Kullback–Leibler (KL) divergence. Equation (2) was derived from the lower bound of \(logp(x_{t} \mid x_{s}) + logp(x_{s} \mid x_{t})\). The prior distributions were replaced by \(q_{\phi }(z \mid x_{s})\) and \(q_{\phi }(z \mid x_{t})\), respectively, because our assumption for similarity was that paired SMILES strings should be embedded into an identical latent point; thus, the KL terms ensured that \(q_{\phi }(z \mid x_{s})\) and \(q_{\phi }(z \mid x_{t})\) were equal. However, the computation of double KL terms is complex and unstable because the gradients must be calculated for both trainable distributions, that is, \(q_{\phi }(z \mid x_{s})\) and \(q_{\phi }(z \mid x_{t})\), in KL terms. To replace the KL terms for efficient computation, we utilized the Fréchet distance [44] that measures the distances between the probability distributions. The objective function of the COMA in Eq(2) is rewritten as

$$\begin{aligned} \begin{aligned} L(x_{s}, x_{t})&= \,{\mathbb {E}}_{z \sim q_{\phi }(z \mid x_{t})}{[-logp_{\psi }(x_{t} \mid z)]} \\&+ {\mathbb {E}}_{z \sim q_{\phi }(z \mid x_{s})}{[-logp_{\psi }(x_{s} \mid z)]} \\&+ \Vert \mu _{t} - \mu _{s}\Vert ^2 + tr[\Sigma _{t} + \Sigma _{s} - 2(\Sigma _{t}\Sigma _{s})^{1/2}], \end{aligned} \end{aligned}$$

(3)

where \(\mu _{t}\) and \(\Sigma _{t}\) are the mean vector and covariance matrix, respectively, of a target SMILES string \(x_{t}\) computed by encoder \(q_\phi \), and \(\mu _{s}\) and \(\Sigma _{s}\) are the mean vector and covariance matrix, respectively, of a source string \(x_{s}\).

Equation (3) can impose the restriction that similar SMILES strings have identical distributions; however, it is not guaranteed that structurally dissimilar strings have different distributions. To address this issue, triplet learning was designed to create effective molecular embeddings for (dis)similar molecular pairs. The algorithm for the triplet dataset construction from a paired dataset is described in Additional file 1: Algorithm S3. Given a triplet of SMILES strings (\(x_{s}\), \(x_{t}\), \(x_{n}\)), where \(x_{s}\) and \(x_{t}\) are similar but \(x_{n}\) is relatively different, the final objective function of COMA \(L(x_{s}, x_{t}, x_{n})\) is expressed as

$$\begin{aligned} \begin{aligned} L(x_{s}, x_{t}, x_{n})&= \,{\mathbb {E}}_{z \sim q_{\phi }(z \mid x_{t})}{[-logp_{\psi }(x_{t} \mid z)]} \\&+ {\mathbb {E}}_{z \sim q_{\phi }(z \mid x_{s})}{[-logp_{\psi }(x_{s} \mid z)]} \\&+ \underbrace{{\mathbb {E}}_{z \sim q_{\phi }(z \mid x_{n})}{[-logp_{\psi }(x_{n} \mid z)]}}_\text {Reconstruction loss} \\&+ \underbrace{\Vert \mu _{t} - \mu _{s}\Vert ^2 + tr[\Sigma _{t} + \Sigma _{s} - 2(\Sigma _{t}\Sigma _{s})^{1/2}]}_\text {Contractive loss} \\&+ softplus(1 - \Vert \mu _{t} - \mu _{n}\Vert ^2) \\&+ \underbrace{softplus(1 - \Vert \mu _{s} - \mu _{n}\Vert ^2)}_\text {Margin loss}, \end{aligned} \end{aligned}$$

(4)

where \(softplus(x)=log(1+e^x)\). The softplus function was used to prevent the excessive spread of latent vectors. In Eq. (4), the role of *reconstruction loss* was that the encoder and decoder learned how to generate valid and diverse SMILES strings, and both the *contractive loss*, which was equal to the Fréchet distance between two multivariate Gaussian distributions, and the *margin loss*, which was similar to soft hinge loss, were aimed at improving the performance of structure-constrained molecular generation in terms of similarity. COMA was first trained using these three loss terms (Additional file 1: Algorithm S2), and then it was trained via the reinforcement learning procedure explained below.

### Reinforcement learning for molecular property optimisation

COMA can translate an input molecule into a valid SMILES string that resembles the input via metric learning; however, the outputs of COMA are not yet guaranteed to exhibit desirable chemical properties. Reinforcement learning aims to encourage COMA to produce molecules that are structurally similar to the given input molecule, but with desirable chemical properties. To achieve this goal, we utilized the REINFORCE algorithm [19], where an episode and a reward are defined as the transformation process of an input SMILES string and a score based on the evaluated properties of the output, respectively. Specifically, for a given SMILES string *x* and a given reward function *R*, the objective function is expressed as

$$\begin{aligned} L(x) = -R({\hat{x}},x)p_{\psi }({\hat{x}} \mid z_{x}), \end{aligned}$$

(5)

where \({\hat{x}}\) is an output translated from *x*, and \(z_{x}\) is a latent vector of *x* sampled by the pretrained encoder \(q_\phi \). Because generating molecules with desirable properties depends solely on the decoder’s ability, only the gradient of the decoder was calculated as follows:

$$\begin{aligned} \nabla _{\psi }L(x) = {\mathbb {E}}_{p_{\psi }({\hat{x}} \mid z_{x})}[-R({\hat{x}},x)\nabla _{\psi }logp_{\psi }({\hat{x}} \mid z_{x})]. \end{aligned}$$

(6)

In Eqs. (5) and (6), a reward function \(R({\hat{x}},x)\) should be designed for property improvement while ensuring that the degeneration of the similarity between \({\hat{x}}\) and *x* is within the allowable limits. In the case of DRD2 and QED, the reward function was defined as

$$\begin{aligned} R({\hat{x}},x) = {\left\{ \begin{array}{ll} \max \{0,\frac{\phi ({\hat{x}}) - \delta }{1 - \delta }\} & \quad \text {if } {\mathcal {T}}({\hat{x}},x) > \epsilon \\ 0 &\ quad \text {otherwise}, \end{array}\right. } \end{aligned}$$

(7)

where \({\mathcal {T}}\) is a function for the Tanimoto similarity, \(\phi \) is a property scoring function, \(\delta \) is the threshold of the property, and \(\epsilon \) is the threshold of the Tanimoto coefficient. Using this reward function, the decoder of COMA was guided to increase the probability of producing molecules with property \(>\delta \) and similarity \(>\epsilon \). In this study, we set heuristically \(\delta \) as 0 and 0.75 for DRD2 and QED, respectively, and \(\epsilon =0.3\) in both tasks. In the pLogP04 and pLogP06 tasks, a slightly modified reward function is exploited.

$$\begin{aligned} R({\hat{x}},x) = {\left\{ \begin{array}{ll} \max \{0,\frac{[\phi ({\hat{x}}) - \phi (x)] - \delta }{1 - \delta }\} & \quad \text {if } {\mathcal {T}}({\hat{x}},x) > \epsilon \\ 0 & \quad \text {otherwise}, \end{array}\right. } \end{aligned}$$

(8)

We set threshold of property improvement \(\delta \) as 0 for both pLogP04 and pLogP06, and a similarity threshold \(\epsilon \) was set 0.3 and 0.5 for pLogP04 and pLogP06, respectively. In contrast, in the case of decreasing property values, such as the affinity of ABCG2, the reward function was defined as:

$$\begin{aligned} R({\hat{x}},x) = {\left\{ \begin{array}{ll} \max \{0,\frac{\delta - \phi ({\hat{x}})}{\delta }\} & \quad \text {if } {\mathcal {T}}({\hat{x}},x) > \epsilon \\ 0 & \quad \text {otherwise}, \end{array}\right. } \end{aligned}$$

(9)

Using this reward function, the decoder of COMA could receive a positive reward only if a molecule with property \(<\delta \) and similarity \(>\epsilon \) was generated. In the experiment for sorafenib resistance, \(\epsilon \) was set to 0.4, \(\delta =4.989\), and we added a condition for a positive reward in which the affinity value against BRAF was larger than 6.235. The two threshold values of 4.989 and 6.235 were equal to the affinity values of sorafenib against ABCG2 and BRAF, respectively, as evaluated using DeepPurpose.

The details of the reinforcement learning procedure for COMA are described in the Additional file 1: Algorithm S3.

### Evaluation Metrics

To evaluate the performance in structure-constrained molecular generation tasks, we used the following seven metrics: validity, novelty, property, improvement, similarity, diversity, and success rate. More concretely, given the training and test datasets \(X_{train}\) and \(X_{test}\), the seven metrics for a molecular generation model \({\mathcal {M}}\) that generates 20 molecules with one source molecule were defined as

$$\begin{aligned} Valid({\mathcal {M}})= & {} \sum _{x \in X_{test}}{\frac{{1\!\!1}\left( \sum _{y \in {\mathcal {M}}(x)}{\zeta (y)}>0\right) }{|X_{test} |}}, \end{aligned}$$

(10)

$$\begin{aligned} Novel({\mathcal {M}})= & {} \sum _{x \in X_{test}}{\frac{{1\!\!1}\left( {\mathcal {M}}(x)\setminus X_{train}\ne \emptyset \right) }{|X_{test} |}}, \end{aligned}$$

(11)

$$\begin{aligned} Prop({\mathcal {M}})= & {} \frac{1}{|X_{test} |} {\sum _{x \in X_{test}}{\left( \frac{\sum _{y \in {\mathcal {M}}(x)}{\zeta (y)\phi (y)}}{\sum _{y \in {\mathcal {M}}(x)}{\zeta (y)}}\right) }}, \end{aligned}$$

(12)

$$\begin{aligned} Impr({\mathcal {M}})= & {} \frac{1}{|X_{test} |} {\sum _{x \in X_{test}}{\left( \frac{\sum _{y \in {\mathcal {M}}(x)}{\zeta (y)[\phi (y)-\phi (x)]}}{\sum _{y \in {\mathcal {M}}(x)}{\zeta (y)}}\right) }}, \end{aligned}$$

(13)

$$\begin{aligned} Sim({\mathcal {M}})= & {} \frac{1}{|X_{test} |} {\sum _{x \in X_{test}}{\left( \frac{\sum _{y \in {\mathcal {M}}(x)}{\zeta (y){\mathcal {T}}(y,x)}}{\sum _{y \in {\mathcal {M}}(x)}{\zeta (y)}}\right) }}, \end{aligned}$$

(14)

$$\begin{aligned} Div({\mathcal {M}})= & \frac{1}{|X_{test} |} {\sum _{x \in X_{test}}{\left( 1- \frac{\sum _{\begin{array}{l} y \in {\mathcal {M}}(x); \\ z \in {\mathcal {M}}(x) \setminus \{y\} \end{array}}{\zeta (y)\zeta (z){\mathcal {T}}(y,z)}}{\sum _{\begin{array}{l} y \in {\mathcal {M}}(x); \\ z \in {\mathcal {M}}(x) \setminus \{y\} \end{array}}{\zeta (y)\zeta (z)}} \right) }}, \end{aligned}$$

(15)

$$\begin{aligned} SR({\mathcal {M}})= & \sum _{x \in X_{test}} \frac{{ {1\!\!1}\left( \sum _{\begin{array}{c} y \in {\mathcal {M}}(x); \\ y \notin X_{train} \end{array}}{\left\{ \zeta (y)\mathbb {1}(\phi (y)-\phi (x)\ge \delta ){1\!\!1}({\mathcal {T}}(y,x)\ge \epsilon )\right\} > 0}\right) }}{{|X_{test} |}}, \end{aligned}$$

(16)

where \({1\!\!1}(p)\) is an indicator function that returns 1 if a statement *p* is true and zero otherwise; \(\zeta (x)\) is an indicator function that returns 1 if an input SMILES string *x* is valid and zero otherwise; \(\phi (x)\) is an oracle that calculates a property score of an input SMILES string *x*; \({\mathcal {T}}\) is a function for Tanimoto similarity, and \(\delta \) and \(\epsilon \) are thresholds of improvement and similarity, respectively.

### Docking simulation

To verify our findings, that is, the molecules translated from sorafenib in the application example, we analysed the binding poses and evaluated the binding energy using docking simulation tools, including AutoDock Vina [30], Chimera [31], and LigPlot Plus [32]. AutoDock Vina is one of the most widely used open-source docking programs. To evaluate the extent to which our findings had less binding affinity against ABCG2, which is associated with sorafenib resistance, we conducted a docking simulation and computed scores in terms of binding energy using AutoDock Vina. Chimera is a program for interactive three-dimensional (3D) visualisation and analysis of molecular structures. We used Chimera to check the docking poses between BRAF and several ligands, including sorafenib, and our findings. LigPlot Plus is a program used for 2D visualisation of ligand-protein interaction diagrams. Through the output figures of LigPlot Plus, we identified protein residues interacting with sorafenib and compared them to analyse whether our molecules lost their binding ability against BRAF, a therapeutic target of sorafenib.