Skip to main content

Table 2 Details on the chemical NER tools in terms of training sets, databases to which the entities are normalized, classes of chemicals addressed, and tokenization methods

From: Recognizing chemicals in patents: a comparative analysis

NER tool Training set Databases Classes Tokenization method
tmChem [24] CHEMDNER corpus at BioCreative IV (training and development sets) CHEBI SYSTEMATIC Tokenization at every non-letter and non-digit characters, number- letter changes and lower case letter followed by an uppercase letter
MESH FORMULA
  FAMILY
  TRIVIAL
  IDENTIFIER
  MULTIPLE
  ABBREVIATION
ChemSpot [13] A subset of SCAI Corpus [29] containing only IUPAC ChemIDplus SYSTEMATIC Tokenization at every non-letter and non-digit characters and number-letter changes
CHEBI FORMULA
CAS FAMILY
NUMBER TRIVIAL
PubChem IDENTIFIER
InChI MULTIPLE
DrugBank ABBREVIATION
KEGG  
Human  
Metabolome  
MESH