Skip to main content

Table 1 The baseline features.

From: Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

Feature description

Note/Regular expression

Roman number

[ivxdlcm]+|[IVXDLCM]+

Punctuation

[,\\.;:?!]

Start with dash

"-.*

Nucleotide sequence

[atgcu]+

Number

[0-9]+

Capitalized

[A-Z] [a-z]*

Quote

[\"`']

The lemma for the current token

Provided by BioLemmatizer [23]

2, 3 and 4-character prefixes and suffixes

 

2 and 3 character n-grams

Token start or end indicators are included

2 and 3 word n-grams

Â