Skip to main content

Table 2 Performance of atom-wise and atom-in-SMILES tokenization schemes tested on various restricted GDB-13 test sets [33]

From: Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

GDB-13 subsets [33] (cumulative)

Prediction accuracy (%)

Atom-wise

Atom-in-SMILES

x10

x30

x50

x10

x30

x50

ab

34.2

34.3

33.2

37.3

35.9

34.1

abc

31.0

30.8

29.6

33.7

32.1

30.4

abcd

30.8

30.4

29.2

34.3

32.3

30.5

abcde

48.7

47.6

45.5

53.6

50.0

47.0

abcdef

41.8

40.6

39.1

52.5

49.6

46.9

abcdefg

50.9

50.9

50.0

59.9

58.6

56.8

  1. The training is conducted with one million randomly sampled molecules taken from the GDB-13, combined with 150K randomly sampled subset of the strictest cumulative abcdefgh data, which we augmented at different levels (\(\times\)10, \(\times\)30, and \(\times\)50)