Skip to main content

Table 1 Comparison of Model 1 and Model 2.

From: tmChem: a high performance approach for chemical named entity recognition and normalization

Aspect

Model 1

Model 2

System adapted

BANNER [22]

tmVar [24]

Preprocessing

Unicode transliteration

No

Yes

Tokenization

whitespace

punctuation

digits

lowercase to uppercase

whitespace

punctuation

digits

lowercase to uppercase

uppercase to lowercase

Sentence segmentation

Java BreakIterator

None

Conditional random field configuration and settings

Implementation

MALLET [25]

CRF++ [23]

Order

1

2

Label model

IOB with one entity label

IOB with one entity label

Regularization

L2

L2

Gaussian prior variance (σ)

1.0

4.0

Feature frequency threshold

0

3

Features

Individual tokens

Yes

Yes

Morphology

Lemmatization

Stemming

Part of speech

Yes

No

Word shapes

Yes

Yes

Characters

N-grams length 2 - 4

Prefixes and suffixes length 2 - 5

Character counts

None

Total characters, digits, uppercase, lowercase

ChemSpot [4]

Yes

No

Semantic affixes

None

Suffixes, alkane stems, trivial rings, simple multipliers, etc.

Chemical elements

Name and symbol

Name

Amino acids

Name, 3-char abbreviation, 1-char abbreviation

None

Chemical formulas

Within a single token

None

Amino acid sequences

Across tokens

None

Context window

2

3

Post processing

Consistency

Yes

No

Abbreviation resolution

Yes

Yes

Parenthesis balancing

Yes

Yes

Chemical identifiers

Yes

Yes

  1. This table compares the setup and configuration of Model 1 and Model 2.