Skip to main content

Table 6 Impact of the training set size on GENs performance

From: GEN: highly efficient SMILES explorer using autodidactic generative examination networks

Dataset and evaluated size

Augmented size with real factora

Best model epoch #

Validity%

Uniqueness%

Training%

Length match%b

HAC match%c

PubChem225k

 9k

54,624 (4.8)

10, 10, 10

81.3 ± 0.9

100.0 ± 0.0

0.3 ± 0.1

97.7 ± 0.0

90.5 ± 0.0

 45k

218,124 (4.8)

5, 5, 5

95.6 ± 0.7

99.9 ± 0.1

2.6 ± 0.5

99.0 ± 0.0

94.7 ± 0.0

 225k

1088,864 (4.8)

4, 4, 4

98.3 ± 0.3

99.9 ± 0.0

11.2 ± 0.5

97.3 ± 0.7

96.6 ± 0.3

Chembl24

 9k

35,928 (4.0)

44, 43, 45

74.2 ± 1.9

99.0 ± 0.2

0.2 ± 0.2

81.9 ± 5.4

95.9 ± 1.0

 45k

179,888 (4.0)

5, 6, 5

91.9 ± 1.9

100.0 ± 0.0

0.2 ± 0.1

90.6 ± 2.8

97.6 ± 1.4

 225k

896,214 (4.0)

9, 6, 6

94.6 ± 0.1

100.0 ± 0.0

1.4 ± 0.3

88.4 ± 1.6

98.1 ± 0.6

Zinc15

 9k

32,546 (3.6)

24, 21, 21

77.2 ± 1.0

100.0 ± 0.0

0.0 ± 0.0

82.2 ± 3.3

91.2 ± 1.1

 45k

163,929 (3.6)

10, 7, 11

90.4 ± 1.1

100.0 ± 0.0

0.1 ± 0.1

87.6 ± 1.2

92.6 ± 1.1

 225k

820,747 (3.6)

4, 6, 6

95.2 ± 0.3

100.0 ± 0.0

0.3 ± 0.1

90.4 ± 1.2

93.5 ± 1.2

  1. aSize of the augmented dataset after 5 random attempts per SMILES and de-duplication to unique SMILES. Real augmentation factor varies depending on dataset
  2. bLength match for SMILES length distributions of the training set and generated set (See “Methods”)
  3. cHAC match for the atom count distributions of the generated set and training set (See “Methods”)