Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Table 5 Performance analysis of tokenization schemes for molecular property prediction using MoleculeNet benchmark suite

	SMILES	DeepSMILES	SELFIES	SmilesPE	AIS
Regression Datasets: RMSE
ESOL	0.628	0.631	0.675	0.689	0.553
FreeSolv	0.545	0.544	0.564	0.761	0.441
Lip	0.924	0.895	0.938	0.800	0.683
Classification Datasets: ROC-AUC
BBBP	0.758	0.777	0.799	0.847	0.885
BACE	0.740	0.774	0.746	0.837	0.835
HIV	0.649	0.648	0.653	0.739	0.729

Comparison of Random Forest regression and classification models with 5-Fold Cross-Validation. Bold emphasis denotes the highest performing approach

ISSN: 1758-2946