Skip to main content

Table 1 Comparison of the important aspects of natural and chemical languages within the NLP framework

From: Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Aspects

Natural language

SMILES language

Sequence length

15-20 words

\(\sim\) 3 times higher

Token space

>100K

\(\sim\) 1000 times smaller

Token order

Tone, meaning, fluency

\({}_n C_{2}\) alternatives*

Meaning-wise

isolation \(\equiv\) context

isolation \(\equiv\) context

  1. *practically less due to the rules of chemistry