Skip to main content

Table 1 The details of the gold standard patent corpora containing the annotations for chemicals

From: Recognizing chemicals in patents: a comparative analysis

Corpus

Number of patents

Annotated entities

Number of annotations

CEMP training set (CEMP_T) [11, 25] \(\approx\)660 thousand token

7000 patents (title and abstract)

ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS

33543 (without normalization)

CEMP development set (CEMP_D) [11, 25] \(\approx\)650 thousand token

7000 patents (title and abstract)

ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS

32142 (without normalization)

CHEBI patent corpus (chapati) [18] \(\approx\)265 thousand token

40 full patents (title, abstract, claims, description)

CLASS, CHEMICAL, ONT, FORMULA, LIGAND, CM

18746 (normalized to CHEBI identifiers)

BioSemantic patent corpus (BioS) [19] 11,500 pages and \(\approx\)4.2 million token

200 full patents (title, abstract, claims, description)

IUPAC, SMILES, InChi, ABBREVIATION, MOA, DISEASE, FORMULA, REGISTRY NUMBER, GENERIC, TRADEMARK, CAS NUMBER, TARGET

400125 (without normalization)