Skip to main content

Table 4 Performance (top-1 accuracy) of various tokenization schemes on single-step retrosynthesis task and the number of predictions with token repetition

From: Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Tokenization schemes

\(\text {rep-l}|_{P} - \text {rep-l}|_{GT} \ge 2\)

Acc.(%) greedy

String exact

Tc exact

Atom-wise baseline [57]

–

42.00

–

Atom-wise (ref. [57] is reproduced)

801

42.05

44.72

SmilesPE (ref. [21])

821

19.82

22.74

SELFIES (ref. [17])

886

28.82

30.76

DeepSMILES (ref. [16])

902

38.63

41.20

Atom-in-SMILES

727

46.32

47.62