Skip to main content

Table 1 Statistics of the dataset.

From: A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature

Types

Training set

Development set

Test set

Entire corpus

ABBREVIATION

4,538

4,521

4,059

13,118

FAMILY

4,090

4,223

3,622

11,935

FORMULA

4,448

4,137

3,443

12,028

IDENTIFIER

672

639

513

1,824

MULTIPLE

202

188

199

589

SYSTEMATIC

6,656

6,816

5,666

19,138

TRIVIAL

8,832

8,970

7,808

25,610

NO CLASS

40

32

41

113

ALL

29,478

29,526

25,351

84,355