|Name of feature||Description|
|Length||Classifies tokens by length. If the length is less than 5, the token is Short. If length is between 5 and 15, the token is Medium, otherwise, the token is Large.|
|Word class||Automatic generation of features in terms of frequency of upper and lower case characters, digits and other types of characters.|
|Autom. Prefixes/Suffixes||Automatic generation of suffix and prefix (length 2, 3 and 4)|
|List||Automatic generation for every token that match an element within the list. We used lists of basic name segments (~3300), and stop words (~550).|
|Dictionaries||A dictionary matching for trivial, family and abbreviations names classes (~6400, ~1300 and ~1400 elements, repectively).|
Regular expressions that identify specific features, such as "contains dashes?", "is all cap?", or "contains numbers?".|
Regular expressions that identify specific types of characters that are more common in chemical entities than in other words, such as greek letters, roman numbers, etc.
Regular expressions that match with specific morphological chemical formulas features, identifiers, and systematic features in chemical names.
Regular expressions used in the pos-processing step that filter out common names that are incorrectly tagged by the systems in a systematic way.