Recognizing chemicals in patents: a comparative analysis

Table 1 The details of the gold standard patent corpora containing the annotations for chemicals

Corpus	Number of patents	Annotated entities	Number of annotations
CEMP training set (CEMP_T) [11, 25] \(\approx\)660 thousand token	7000 patents (title and abstract)	ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS	33543 (without normalization)
CEMP development set (CEMP_D) [11, 25] \(\approx\)650 thousand token	7000 patents (title and abstract)	ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS	32142 (without normalization)
CHEBI patent corpus (chapati) [18] \(\approx\)265 thousand token	40 full patents (title, abstract, claims, description)	CLASS, CHEMICAL, ONT, FORMULA, LIGAND, CM	18746 (normalized to CHEBI identifiers)
BioSemantic patent corpus (BioS) [19] 11,500 pages and \(\approx\)4.2 million token	200 full patents (title, abstract, claims, description)	IUPAC, SMILES, InChi, ABBREVIATION, MOA, DISEASE, FORMULA, REGISTRY NUMBER, GENERIC, TRADEMARK, CAS NUMBER, TARGET	400125 (without normalization)

ISSN: 1758-2946