From: The CHEMDNER corpus of chemicals and drugs and its annotation principles
 | Training set | Development set | Test set | Entire corpus |
---|---|---|---|---|
Abstracts | 3,500 | 3,500 | 3,000 | 10,000 |
Nr. characters | 4,883,753 | 4,864,558 | 4,199,068 | 13,947,379 |
Nr. tokens | 770,855 | 766,331 | 662,571 | 2,199,757 |
Abstracts with SACEM | 2,916 | 2,907 | 2,478 | 8,301 |
Nr. mentions | 29,478 | 29,526 | 25,351 | 84,355 |
Nr. chemicals | 8,520 | 8,677 | 7,563 | 19,805 |
Nr. journals | 193 | 188 | 188 | 203 |
TRIVIAL | 8,832 | 8,970 | 7,808 | 25,610 |
SYSTEMATIC | 6,656 | 6,816 | 5,666 | 19,138 |
ABBREVIATION | 4,538 | 4,521 | 4059 | 13,118 |
FORMULA | 4,448 | 4,137 | 3,443 | 12,028 |
FAMILY | 4,090 | 4,223 | 3,622 | 11,935 |
IDENTIFIER | 672 | 639 | 513 | 1,824 |
MULTIPLE | 202 | 188 | 199 | 589 |
NO CLASS | 40 | 32 | 41 | 113 |