Skip to main content

Table 1 CHEMDNER corpus overview.

From: The CHEMDNER corpus of chemicals and drugs and its annotation principles

  Training set Development set Test set Entire corpus
Abstracts 3,500 3,500 3,000 10,000
Nr. characters 4,883,753 4,864,558 4,199,068 13,947,379
Nr. tokens 770,855 766,331 662,571 2,199,757
Abstracts with SACEM 2,916 2,907 2,478 8,301
Nr. mentions 29,478 29,526 25,351 84,355
Nr. chemicals 8,520 8,677 7,563 19,805
Nr. journals 193 188 188 203
TRIVIAL 8,832 8,970 7,808 25,610
SYSTEMATIC 6,656 6,816 5,666 19,138
ABBREVIATION 4,538 4,521 4059 13,118
FORMULA 4,448 4,137 3,443 12,028
FAMILY 4,090 4,223 3,622 11,935
IDENTIFIER 672 639 513 1,824
MULTIPLE 202 188 199 589
NO CLASS 40 32 41 113
  1. This table provides an overview of the CHEMDNER corpus in terms of the number of manually revised abstracts (Abstracts) with their total sizes as number of characters and tokens, the number of abstracts containing at least one chemical entity mention (Abstracts with CEM), the number of annotated mentions of chemical entities, the number of unique chemicals annotated (the non-redundant list of mentions) and the number of corresponding journals for the annotated abstracts. The number of mentions for each CHEMDNER entity class (see Figure 1) is provided for each set and the entire corpus in the lower half of the table.