From: A neural network approach to chemical and gene/protein entity recognition in patents
 | Training set | Test set | Entire corpus |
---|---|---|---|
Patent abstracts | 21,000 | 9000 | 30,000 |
CEMP mentions | 99,632 | 44,486 | 144,188 |
GPRO mentions | 17,751 | 8998 | 26,749 |
GPRO type 1 mentions | 12,422 | 5330 | 17,752 |
GPRO type 2 mentions | 5329 | 3668 | 8997 |
Tokens | 1,770,836 | 767,599 | 2,538,435 |