Skip to main content

Table 2 Details on the chemical NER tools in terms of training sets, databases to which the entities are normalized, classes of chemicals addressed, and tokenization methods

From: Recognizing chemicals in patents: a comparative analysis

NER tool

Training set

Databases

Classes

Tokenization method

tmChem [24]

CHEMDNER corpus at BioCreative IV (training and development sets)

CHEBI

SYSTEMATIC

Tokenization at every non-letter and non-digit characters, number- letter changes and lower case letter followed by an uppercase letter

MESH

FORMULA

 

FAMILY

 

TRIVIAL

 

IDENTIFIER

 

MULTIPLE

 

ABBREVIATION

ChemSpot [13]

A subset of SCAI Corpus [29] containing only IUPAC

ChemIDplus

SYSTEMATIC

Tokenization at every non-letter and non-digit characters and number-letter changes

CHEBI

FORMULA

CAS

FAMILY

NUMBER

TRIVIAL

PubChem

IDENTIFIER

InChI

MULTIPLE

DrugBank

ABBREVIATION

KEGG

 

Human

 

Metabolome

 

MESH