Randomized SMILES strings improve the quality of molecular generative models

Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES.

than 0.05% of the total molecules in the set were kept. In continuous descriptors an arbitrary cut-off was set. Lastly, canonical SMILES were tokenized (see next section) and some stringbased filters were applied. First, SMILES with too many tokens were removed. Then, SMILES with a ratio of non-atom tokens higher than 2 were also removed. This filtered out molecules with too much branching. Finally, any non-ring token that appeared in less than 0.05% of the molecules was removed. This accounted for 26 tokens that appear seldom. The final database size was 1,562,045 compounds.

Filter
Range allowed Size after

S2. Benchmark
Adaptive learning rate decay strategy The adaptive strategy used in all trained models is based on exponential learning rate decay.
A parameter 0 < < 1 is multiplied after each epoch by the previous learning rate, thus reducing it. In this approach, is multiplied by the previous learning rate if the average UC-JSD from the last epochs is not lower than the current one. Additionally, the learning rate is not reduced in the first epochs after a change (i.e., patience). This allows models to continue to train with the same learning rate given that the model is still improving while still being resilient to training instability.
Obtaining statistics for each model

Molecule NLL of randomized SMILES models
To be able to compare the canonical and randomized SMILES models, the NLLs of the randomized SMILES have to be normalized. To achieve that, different randomized SMILES are calculated on each molecule and then repeats are removed. The NLL is calculated in the randomized SMILES model for each of the randomized SMILES of the molecule. Then, for each molecule, all randomized SMILES are grouped, converted back to probabilities and summed.
The NLL is calculated back from the cumulative probability of each molecule, which is a lower bound approximation of the real probability of sampling a given molecule whose error depends on .

S3. Similarity maps
Similarity maps were based on previous literature [7]. Molecular Quantum Number (MQN) [8] fingerprints were calculated using the JChem Library 18.22.0 from ChemAxon 4 for all molecules in two sets randomly sampled from GDB-13, one with 25 million molecules and another with 1,000 molecules, also called probes. Next, the Manhattan distance (i.e., city block distance) was calculated between all molecules from the first set and the probes yielding can be obtained by: Inserting, we get the expression for the variance of total number of compounds sampled.
Since the coverage as a fraction of the total number is the desired metric, we divide [ ] and ( ) by and 2 , respectively.

Proof that the variance is maximal when the model is uniform
Since | | ≥ 1 (at least one unique compound must be generated for any sample of size >   would have been sampled from the upper bound of some confidence interval, and 2 from the lower bound of another interval. To ensure that, even in this worst case, the confidence intervals will not overlap the difference between the sampled coverage is expressed by 1 − 2 ≥ 2 α σ 1 + 2 α σ 2

(Equation 8)
However, the true values of σ 1 , σ 2 are unknown, and thus we use the upper estimation σ * from the uniform model. Thus, if the difference in coverage between the models fulfill 1 − 2 ≥ 4 α σ * (Equation 9) we know that there is a statistical significance between the models at confidence level 1 − α.