From: tmChem: a high performance approach for chemical named entity recognition and normalization
Aspect | Model 1 | Model 2 |
---|---|---|
System adapted | BANNER [22] | tmVar [24] |
Preprocessing | ||
Unicode transliteration | No | Yes |
Tokenization | whitespace punctuation digits lowercase to uppercase | whitespace punctuation digits lowercase to uppercase uppercase to lowercase |
Sentence segmentation | Java BreakIterator | None |
Conditional random field configuration and settings | ||
Implementation | MALLET [25] | CRF++ [23] |
Order | 1 | 2 |
Label model | IOB with one entity label | IOB with one entity label |
Regularization | L2 | L2 |
Gaussian prior variance (σ) | 1.0 | 4.0 |
Feature frequency threshold | 0 | 3 |
Features | ||
Individual tokens | Yes | Yes |
Morphology | Lemmatization | Stemming |
Part of speech | Yes | No |
Word shapes | Yes | Yes |
Characters | N-grams length 2 - 4 | Prefixes and suffixes length 2 - 5 |
Character counts | None | Total characters, digits, uppercase, lowercase |
ChemSpot [4] | Yes | No |
Semantic affixes | None | Suffixes, alkane stems, trivial rings, simple multipliers, etc. |
Chemical elements | Name and symbol | Name |
Amino acids | Name, 3-char abbreviation, 1-char abbreviation | None |
Chemical formulas | Within a single token | None |
Amino acid sequences | Across tokens | None |
Context window | 2 | 3 |
Post processing | ||
Consistency | Yes | No |
Abbreviation resolution | Yes | Yes |
Parenthesis balancing | Yes | Yes |
Chemical identifiers | Yes | Yes |