Skip to main content

Table 5 Performance analysis of tokenization schemes for molecular property prediction using MoleculeNet benchmark suite

From: Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

 

SMILES

DeepSMILES

SELFIES

SmilesPE

AIS

Regression Datasets: RMSE

ESOL

0.628

0.631

0.675

0.689

0.553

FreeSolv

0.545

0.544

0.564

0.761

0.441

Lip

0.924

0.895

0.938

0.800

0.683

Classification Datasets: ROC-AUC

BBBP

0.758

0.777

0.799

0.847

0.885

BACE

0.740

0.774

0.746

0.837

0.835

HIV

0.649

0.648

0.653

0.739

0.729

  1. Comparison of Random Forest regression and classification models with 5-Fold Cross-Validation. Bold emphasis  denotes the highest performing approach