Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

Table 1 The baseline features.

Feature description	Note/Regular expression
Roman number	[ivxdlcm]+\|[IVXDLCM]+
Punctuation	[,\\.;:?!]
Start with dash	"-.*
Nucleotide sequence	[atgcu]+
Number	[0-9]+
Capitalized	[A-Z] [a-z]*
Quote	[\"`']
The lemma for the current token	Provided by BioLemmatizer [23]
2, 3 and 4-character prefixes and suffixes
2 and 3 character n-grams	Token start or end indicators are included
2 and 3 word n-grams

ISSN: 1758-2946