From: Recognizing chemicals in patents: a comparative analysis
Corpus | Number of patents | Annotated entities | Number of annotations |
---|---|---|---|
CEMP training set (CEMP_T) [11, 25] \(\approx\)660 thousand token | 7000 patents (title and abstract) | ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS | 33543 (without normalization) |
CEMP development set (CEMP_D) [11, 25] \(\approx\)650 thousand token | 7000 patents (title and abstract) | ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS | 32142 (without normalization) |
CHEBI patent corpus (chapati) [18] \(\approx\)265 thousand token | 40 full patents (title, abstract, claims, description) | CLASS, CHEMICAL, ONT, FORMULA, LIGAND, CM | 18746 (normalized to CHEBI identifiers) |
BioSemantic patent corpus (BioS) [19] 11,500 pages and \(\approx\)4.2 million token | 200 full patents (title, abstract, claims, description) | IUPAC, SMILES, InChi, ABBREVIATION, MOA, DISEASE, FORMULA, REGISTRY NUMBER, GENERIC, TRADEMARK, CAS NUMBER, TARGET | 400125 (without normalization) |