Skip to main content

Table 1 CHEMDNER corpus overview.

From: The CHEMDNER corpus of chemicals and drugs and its annotation principles

 

Training set

Development set

Test set

Entire corpus

Abstracts

3,500

3,500

3,000

10,000

Nr. characters

4,883,753

4,864,558

4,199,068

13,947,379

Nr. tokens

770,855

766,331

662,571

2,199,757

Abstracts with SACEM

2,916

2,907

2,478

8,301

Nr. mentions

29,478

29,526

25,351

84,355

Nr. chemicals

8,520

8,677

7,563

19,805

Nr. journals

193

188

188

203

TRIVIAL

8,832

8,970

7,808

25,610

SYSTEMATIC

6,656

6,816

5,666

19,138

ABBREVIATION

4,538

4,521

4059

13,118

FORMULA

4,448

4,137

3,443

12,028

FAMILY

4,090

4,223

3,622

11,935

IDENTIFIER

672

639

513

1,824

MULTIPLE

202

188

199

589

NO CLASS

40

32

41

113

  1. This table provides an overview of the CHEMDNER corpus in terms of the number of manually revised abstracts (Abstracts) with their total sizes as number of characters and tokens, the number of abstracts containing at least one chemical entity mention (Abstracts with CEM), the number of annotated mentions of chemical entities, the number of unique chemicals annotated (the non-redundant list of mentions) and the number of corresponding journals for the annotated abstracts. The number of mentions for each CHEMDNER entity class (see Figure 1) is provided for each set and the entire corpus in the lower half of the table.