Feature description | Note/Regular expression |
---|---|
Roman number | [ivxdlcm]+|[IVXDLCM]+ |
Punctuation | [,\\.;:?!] |
Start with dash | "-.* |
Nucleotide sequence | [atgcu]+ |
Number | [0-9]+ |
Capitalized | [A-Z] [a-z]* |
Quote | [\"`'] |
The lemma for the current token | Provided by BioLemmatizer [23] |
2, 3 and 4-character prefixes and suffixes | Â |
2 and 3 character n-grams | Token start or end indicators are included |
2 and 3 word n-grams | Â |