Skip to main content
Fig. 4 | Journal of Cheminformatics

Fig. 4

From: Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Fig. 4

Comparison of expressiveness and normalized repetition rates across various molecular representations. Distributions showcasing the distinct characteristics of tokenization schemes on representative datasets, each designed to test different facets of molecular structures such as coordination compounds, ligands (metal complexes), ring structures and functional groups (steroids), long-chain formations (phospholipids, ionizable lipids), complex and diverse structures (natural products), small organic molecules (drugs), and configurational changes in molecular structure (octane isomers). Each dataset contains one hundred members, with the exception of steroids (59 members) and octane isomers (18 members). The mean values of normalized repetitions and deviations from the mean are visually represented as horizontal and dashed vertical lines, respectively, accompanying the distributions

Back to article page