Skip to main content

Table 3 CEMP and GPRO corpora overview

From: A neural network approach to chemical and gene/protein entity recognition in patents

 

Training set

Test set

Entire corpus

Patent abstracts

21,000

9000

30,000

CEMP mentions

99,632

44,486

144,188

GPRO mentions

17,751

8998

26,749

GPRO type 1 mentions

12,422

5330

17,752

GPRO type 2 mentions

5329

3668

8997

Tokens

1,770,836

767,599

2,538,435