Skip to main content

Table 1 The baseline features.

From: Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

Feature description Note/Regular expression
Roman number [ivxdlcm]+|[IVXDLCM]+
Punctuation [,\\.;:?!]
Start with dash "-.*
Nucleotide sequence [atgcu]+
Number [0-9]+
Capitalized [A-Z] [a-z]*
Quote [\"`']
The lemma for the current token Provided by BioLemmatizer [23]
2, 3 and 4-character prefixes and suffixes  
2 and 3 character n-grams Token start or end indicators are included
2 and 3 word n-grams