Skip to main content

Table 1 Comparison of Model 1 and Model 2.

From: tmChem: a high performance approach for chemical named entity recognition and normalization

Aspect Model 1 Model 2
System adapted BANNER [22] tmVar [24]
Preprocessing
Unicode transliteration No Yes
Tokenization whitespace
punctuation
digits
lowercase to uppercase
whitespace
punctuation
digits
lowercase to uppercase
uppercase to lowercase
Sentence segmentation Java BreakIterator None
Conditional random field configuration and settings
Implementation MALLET [25] CRF++ [23]
Order 1 2
Label model IOB with one entity label IOB with one entity label
Regularization L2 L2
Gaussian prior variance (σ) 1.0 4.0
Feature frequency threshold 0 3
Features
Individual tokens Yes Yes
Morphology Lemmatization Stemming
Part of speech Yes No
Word shapes Yes Yes
Characters N-grams length 2 - 4 Prefixes and suffixes length 2 - 5
Character counts None Total characters, digits, uppercase, lowercase
ChemSpot [4] Yes No
Semantic affixes None Suffixes, alkane stems, trivial rings, simple multipliers, etc.
Chemical elements Name and symbol Name
Amino acids Name, 3-char abbreviation, 1-char abbreviation None
Chemical formulas Within a single token None
Amino acid sequences Across tokens None
Context window 2 3
Post processing
Consistency Yes No
Abbreviation resolution Yes Yes
Parenthesis balancing Yes Yes
Chemical identifiers Yes Yes
  1. This table compares the setup and configuration of Model 1 and Model 2.