Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Table 2 Performance of atom-wise and atom-in-SMILES tokenization schemes tested on various restricted GDB-13 test sets [33]

The training is conducted with one million randomly sampled molecules taken from the GDB-13, combined with 150K randomly sampled subset of the strictest cumulative abcdefgh data, which we augmented at different levels (\(\times\)10, \(\times\)30, and \(\times\)50)

ISSN: 1758-2946