Skip to main content

Table 6 Impact of the training set size on GENs performance

From: GEN: highly efficient SMILES explorer using autodidactic generative examination networks

Dataset and evaluated sizeAugmented size with real factoraBest model epoch #Validity%Uniqueness%Training%Length match%bHAC match%c
PubChem225k
 9k54,624 (4.8)10, 10, 1081.3 ± 0.9100.0 ± 0.00.3 ± 0.197.7 ± 0.090.5 ± 0.0
 45k218,124 (4.8)5, 5, 595.6 ± 0.799.9 ± 0.12.6 ± 0.599.0 ± 0.094.7 ± 0.0
 225k1088,864 (4.8)4, 4, 498.3 ± 0.399.9 ± 0.011.2 ± 0.597.3 ± 0.796.6 ± 0.3
Chembl24
 9k35,928 (4.0)44, 43, 4574.2 ± 1.999.0 ± 0.20.2 ± 0.281.9 ± 5.495.9 ± 1.0
 45k179,888 (4.0)5, 6, 591.9 ± 1.9100.0 ± 0.00.2 ± 0.190.6 ± 2.897.6 ± 1.4
 225k896,214 (4.0)9, 6, 694.6 ± 0.1100.0 ± 0.01.4 ± 0.388.4 ± 1.698.1 ± 0.6
Zinc15
 9k32,546 (3.6)24, 21, 2177.2 ± 1.0100.0 ± 0.00.0 ± 0.082.2 ± 3.391.2 ± 1.1
 45k163,929 (3.6)10, 7, 1190.4 ± 1.1100.0 ± 0.00.1 ± 0.187.6 ± 1.292.6 ± 1.1
 225k820,747 (3.6)4, 6, 695.2 ± 0.3100.0 ± 0.00.3 ± 0.190.4 ± 1.293.5 ± 1.2
  1. aSize of the augmented dataset after 5 random attempts per SMILES and de-duplication to unique SMILES. Real augmentation factor varies depending on dataset
  2. bLength match for SMILES length distributions of the training set and generated set (See “Methods”)
  3. cHAC match for the atom count distributions of the generated set and training set (See “Methods”)