tmChem: a high performance approach for chemical named entity recognition and normalization

Table 1 Comparison of Model 1 and Model 2.

Aspect	Model 1	Model 2
System adapted	BANNER [22]	tmVar [24]
Preprocessing
Unicode transliteration	No	Yes
Tokenization	whitespace punctuation digits lowercase to uppercase	whitespace punctuation digits lowercase to uppercase uppercase to lowercase
Sentence segmentation	Java BreakIterator	None
Conditional random field configuration and settings
Implementation	MALLET [25]	CRF++ [23]
Order	1	2
Label model	IOB with one entity label	IOB with one entity label
Regularization	L₂	L₂
Gaussian prior variance (σ)	1.0	4.0
Feature frequency threshold	0	3
Features
Individual tokens	Yes	Yes
Morphology	Lemmatization	Stemming
Part of speech	Yes	No
Word shapes	Yes	Yes
Characters	N-grams length 2 - 4	Prefixes and suffixes length 2 - 5
Character counts	None	Total characters, digits, uppercase, lowercase
ChemSpot [4]	Yes	No
Semantic affixes	None	Suffixes, alkane stems, trivial rings, simple multipliers, etc.
Chemical elements	Name and symbol	Name
Amino acids	Name, 3-char abbreviation, 1-char abbreviation	None
Chemical formulas	Within a single token	None
Amino acid sequences	Across tokens	None
Context window	2	3
Post processing
Consistency	Yes	No
Abbreviation resolution	Yes	Yes
Parenthesis balancing	Yes	Yes
Chemical identifiers	Yes	Yes

ISSN: 1758-2946