Randomized SMILES strings improve the quality of molecular generative models

Table 3 Best models trained on subsets of GDB-13 after the hyperparameter optimization

Set	SMILES	Time	% GDB-13	Valid	Unif	Comp	Closed	UCC
1M	Canonical	4:08	72.8	0.994	0.879	0.836	0.861	0.633
	Rand. unr.	31:47	80.9	0.995	0.970	0.929	0.876	0.790
	Rand. unr. no DA	1:37	77.0	0.987	0.957	0.795	0.883	0.672
	Rand. rest.	7:19	83.0	0.999	0.977	0.953	0.925	0.860
	Rand. rest. no DA	1:21	78.2	0.992	0.957	0.829	0.898	0.712
	DS branch	1:33	72.1	0.987	0.881	0.828	0.834	0.608
	DS rings	1:11	68.6	0.979	0.852	0.788	0.798	0.535
	DS both	1:05	68.4	0.979	0.851	0.785	0.796	0.532
10K	Canonical	0:04	38.8	0.905	0.666	0.445	0.426	0.126
10K	Rand. rest.	0:36	62.3	0.974	0.882	0.715	0.598	0.377
1K	Canonical	0:01	14.5	0.504	0.611	0.167	0.133	0.014
1K	Rand. rest.	0:04	34.1	0.812	0.790	0.392	0.276	0.085

See “Methods” section for a description of the ratios
Best result for each training set size are indicated in italics
Set Benchmark training set size, SMILES SMILES variant, including randomized variants with and without data augmentation (DA), Time training time up in hh:mm, % GDB-13 Percent of unique molecules from GDB-13 generated in a 2 billion sample with replacement, Valid valid SMILES, Unif uniformity ratio, Comp completeness ratio, Closed closedness ratio, UCC UCC ratio

ISSN: 1758-2946