SMILES-based deep generative scaffold decorator for de-novo drug design

Molecular generative models trained with small sets of molecules represented as SMILES strings can generate large regions of the chemical space. Unfortunately, due to the sequential nature of SMILES strings, these models are not able to generate molecules given a scaffold (i.e., partially-built molecules with explicit attachment points). Herein we report a new SMILES-based molecular generative architecture that generates molecules from scaffolds and can be trained from any arbitrary molecular set. This approach is possible thanks to a new molecular set pre-processing algorithm that exhaustively slices all possible combinations of acyclic bonds of every molecule, combinatorically obtaining a large number of scaffolds with their respective decorations. Moreover, it serves as a data augmentation technique and can be readily coupled with randomized SMILES to obtain even better results with small sets. Two examples showcasing the potential of the architecture in medicinal and synthetic chemistry are described: First, models were trained with a training set obtained from a small set of Dopamine Receptor D2 (DRD2) active modulators and were able to meaningfully decorate a wide range of scaffolds and obtain molecular series predicted active on DRD2. Second, a larger set of drug-like molecules from ChEMBL was selectively sliced using synthetic chemistry constraints (RECAP rules). In this case, the resulting scaffolds with decorations were filtered only to allow those that included fragment-like decorations. This filtering process allowed models trained with this dataset to selectively decorate diverse scaffolds with fragments that were generally predicted to be synthesizable and attachable to the scaffold using known synthetic approaches. In both cases, the models were already able to decorate molecules using specific knowledge without the need to add it with other techniques, such as reinforcement learning. We envision that this architecture will become a useful addition to the already existent architectures for de novo molecular generation.


S1. Databases DRD2 modulators
A set of 4,613 human DRD2 active modulators ( 50 ≥ 5) obtained from ExCAPE DB [1] was downloaded from the official website and was cleaned using a process very similar to [2], which had the following steps: First, the MolVS 0.1.1 library [3] was used to sanitize all molecules, remove duplicates, stereochemistry, salts and all fragments except for the largest were removed. Then, all molecules containing heavy atom types other than (C, N, O, S, Cl, Br, F) were removed. After, a series of filters were applied sequentially to remove outliers ( Table 1). Token filter Removed SMILES with non-ring tokens with less than 0.5 % abundance in the dataset. 4,211 Table 1: Filters applied to DRD2 modulator set from ExCAPE DB in order (from top to bottom). The first entry is not a filter but represents the initial state.

ChEMBL subset
The ChEMBL 25 database [4] was obtained from the official website, and the same process as in the DRD2 set was used to filter the database but with different cutoffs and some additional descriptors (Table 2). Specifically, all molecules bigger than 40 heavy atoms and with less than two rings or with ring size different than 5 or 6 were discarded. Also, a restrictive QED [5] filter was applied that removed around 350.000 compounds. Lastly, molecules whose SMILES was too complicated or that included tokens that seldom appeared in the dataset were filtered. This process ensured a database with fewer outliers. Token filter All non-ring tokens with less than 0.05% canonical SMILES strings were removed. 827,098

ZINC fragments
The In-Stock Fragment subset of ZINC was obtained from the official website, and was further processed with RDKit to remove stereochemistry, obtain canonical SMILES, and remove repeated molecules. The final size of the subset was 541,281 molecules. This database was used only to check whether decorations from the ChEMBL model were readily purchasable. As the database holds molecules with more than 3 heavy atoms, any smaller decoration was considered to be automatically in the database.

S2. Training details DRD2 models
The decorator models (multi-step and single-step) were trained with a split training-validation set of (131,241; 5,820) scaffold-decorations. The model was trained for 100 epochs, a batch size of 64, with exponential learning rate decay with a starting value of 10 −3 down to 10 −5 .
The scaffold generator model was trained on a subset of the scaffolds in the training set, which included all scaffolds with at least two attachment points and that the shortest path between all attachment points passed through a ring atom. This set amounted to 9,925 scaffolds, which were divided into a training-validation sets of (9,425; 500) scaffolds each. The model was trained for 500 epochs with the same exponential learning rate and optimizer specified before, but with a batch size of 8. The model took roughly 3 hours to train, and the best epoch was chosen using the UC-JSD, as specified in [2].

ChEMBL models
The decorator models were trained using the same hyperparameters as the decorator models in the previous section, except for the batch size, which was increased to 512. The trainingvalidation set split was (4,119,080; 48,127). As the training set was very large, models took 50 minutes to train each epoch, amounting to a total of 3-4 days each.
The scaffold generator was trained with the same hyperparameters as the previous one but with a batch size of 32. The training set was obtained the same way as the other decorator model, yielding a total of 167,099 scaffolds, which was then split to (162,099; 5000) for the training and validation sets, respectively. The model took 2 days and 10 hours to train, and the best epoch was chosen using the UC-JSD as before.
A note on training duration The models in this research were trained using randomized SMILES. As shown in [2], these models give good results when just trained a few epochs, but they can be trained for very long periods and obtain models that make fewer mistakes. Other published approaches [7], [8] use graph generative models, which are substantially slower than SMILES models, but in their applications, only train their models for a tiny amount of steps. Thus the total training time is lower. For instance, the decorator model from [7] was trained for 50,000 steps and took 20 hours, whereas our ChEMBL decorator model was trained for 4167207 ⋅