From: Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
Aspects | Natural language | SMILES language |
---|---|---|
Sequence length | 15-20 words | \(\sim\) 3 times higher |
Token space | >100K | \(\sim\) 1000 times smaller |
Token order | Tone, meaning, fluency | \({}_n C_{2}\) alternatives* |
Meaning-wise | isolation \(\equiv\) context | isolation \(\equiv\) context |