Skip to main content

Table 3 CEMP and GPRO corpora overview

From: A neural network approach to chemical and gene/protein entity recognition in patents

  Training set Test set Entire corpus
Patent abstracts 21,000 9000 30,000
CEMP mentions 99,632 44,486 144,188
GPRO mentions 17,751 8998 26,749
GPRO type 1 mentions 12,422 5330 17,752
GPRO type 2 mentions 5329 3668 8997
Tokens 1,770,836 767,599 2,538,435