- Research article
- Open Access
A neural network approach to chemical and gene/protein entity recognition in patents
© The Author(s) 2018
- Received: 17 September 2018
- Accepted: 5 December 2018
- Published: 18 December 2018
In biomedical research, patents contain the significant amount of information, and biomedical text mining has received much attention in patents recently. To accelerate the development of biomedical text mining for patents, the BioCreative V.5 challenge organized three tracks, i.e., chemical entity mention recognition (CEMP), gene and protein related object recognition (GPRO) and technical interoperability and performance of annotation servers, to focus on biomedical entity recognition in patents. This paper describes our neural network approach for the CEMP and GPRO tracks. In the approach, a bidirectional long short-term memory with a conditional random field layer is employed to recognize biomedical entities from patents. To improve the performance, we explored the effect of additional features (i.e., part of speech, chunking and named entity recognition features generated by the GENIA tagger) for the neural network model. In the official results, our best runs achieve the highest performances (a precision of 88.32%, a recall of 92.62%, and an F-score of 90.42% in the CEMP track; a precision of 76.65%, a recall of 81.91%, and an F-score of 79.19% in the GPRO track) among all participating teams in both tracks.
- Biomedical entity recognition
- Deep learning
- Long short-term memory
- Conditional random field
Biomedical named entity recognition (NER) aims to automatically find the biomedical mentions in text, which is crucial for the information extraction in biomedical domain. In the previous BioCreative challenges [1–3], various tasks have been addressed to recognize biomedical entities (such as gene/protein, chemical and disease) from the scientific literature. In addition to the scientific literature, patents are another important source since they contain a wealth of useful biomedical information. Therefore, automatic extraction of information contained in patents has received much attention, and automatic biomedical entity recognition from medicinal chemistry patents has become an important research task .
To promote the development of NER systems, the BioCreative V.5, a major challenge event in biomedical natural language processing, organized three tracks to focus on biomedical entity recognition in patents. This challenge included three individual tracks: two traditional BioCreative tracks to detect relevant biomedical entities (chemical entity mention recognition (CEMP) track and gene and protein related object recognition (GPRO) track) and a novel track called technical interoperability and performance of annotation servers (TIPS). The latter focuses on the technical aspects of the evaluation of continuous text Annotation Servers for NER. For the challenge, we participated in the CEMP and GPRO tracks, and our submissions to the two tracks were created by our deep learning system.
The biomedical NER is a fundamental step for further biomedical text mining and has received much more attention recently. However, biomedical NER is particularly challenging due to some reasons. For example, for gene/protein NER, millions of gene/protein names are used, new names are created constantly and rapidly, gene/protein names naturally co-occur with other types that have similar morphology and context, various ways of naming gene and ambiguities caused by DNA sequences may vary in nonspecific ways . In the previous works, the state-of-the-art CRF-based biomedical NER methods [5–9] depend on effective feature engineering, i.e., the design of effective features using various natural language processing (NLP) tools and knowledge resources, which is still a labor-intensive and skill-dependent task. Recently, deep learning has become prevalent in the machine learning research community. These are neural network-based representation learning methods that compose simple but non-linear modules to obtain multiple levels of representation . For the NER task in general domain (such as news domain), several similar neural network architectures [11–13] have been proposed and exhibit promising results. Moreover, deep learning methods have begun to be explored in biomedical field, including genes and proteins , diseases  and chemicals . Compared with the traditional machine learning methods, the key advantage of deep learning methods is that these layers of features are not designed by human engineers and, therefore, less feature engineering is needed.
In this paper, we describe our NER systems based on the neural network for the CEMP and GPRO tracks. In the approach, first the word embedding is learned from a large unlabeled dataset. Thereafter, character feature is produced with the character and capitalization embeddings. Then the concatenation of the character feature and the word embedding is used as a basic input. Finally, the input is fed into a bidirectional long short-term memory with a conditional random field layer (BiLSTM-CRF) to recognize chemical and gene/protein entities from patents. Furthermore, we explored the effect of additional features (i.e., part of speech (POS), chunking and NER features generated by the GENIA tagger) for the neural network model. In the official results, our best runs achieve the highest performances (the F-scores of 90.42% and 79.19% on the CEMP and GPRO corpora, respectively) in both tracks. The details of our method and results are presented in the following sections.
First, document titles and abstracts are extracted from the dataset. The extracted text is then split into the sentences, tokenized using the Stanford CoreNLP tool . Note that the tokenization of the Stanford CoreNLP tool does not split text into segments at the dash (-) character. However, in the biomedical documents, some chemical and gene/protein entity names and other words are always combined into one token using dash character. For example, “ephrinB-EphB” is annotated as two entities (i.e., “ephrinB” and “EphB”); “CD3-binders” is only annotated with “CD3” as an entity. To address the cases, we broke the text into separated segments at the dash character (e.g., “ephrinB-EphB” is split into three tokens: “ephrinB”, “-” and “EphB”). The experimental results show that the processing can improve the performance of our system.
An example of all features
s u b s t i t u d e d
p i p e r i d i n e s
w i t h
s e l e c t i v e
b i n d i n g
h i s t a m i n e
r e c e p t o r
Word embedding, also known as distributed word representation, can capture both the semantic and syntactic information of words from a large unlabeled corpus and has attracted considerable attention from many researchers . Compared with the bag-of-words representation, word embedding is low-dimensional and dense. In recent years, several models, such as word2vec  and GloVe , have been proposed and widely used in the field of NLP. To achieve a high-quality word embedding, we downloaded a total of 1,918,662 MEDLINE abstracts from the PubMed website as the unlabeled data. Then the data and all datasets (The training set comprises a total of 21,000 abstracts, and the test set comprises a total of 9000 abstracts.) provided in the BioCreative V.5 CEMP and GPRO tracks were used to train the word embedding by the word2vec tool using the skip-gram model as pre-trained word embedding.
In addition to the word embedding, character-level features in a name contain rich structure information of the entity. These features (such as character n-grams, prefixed and suffixes) are commonly employed in the current NER methods . Unlike the previous traditional methods in which character features are based on hand-engineering, character embedding can be learned while training. Character embedding has been found useful for many NLP tasks. They can not only learn interior representations of the entity names, but also alleviate the out-of-vocabulary problem . In our model, a bidirectional long short-term memory (BiLSTM) is used to obtain the character-level feature. First, a character lookup table which contains a character embedding for every character is initialized randomly. The sequence of characters in a word is transformed to a sequence of embeddings with fixed length L, where L is the max length of all words. If the word has a length less than L, we pad it with zero embeddings. Then the character embedding corresponding to every character in a word is given in both direct and reverse orders to a BiLSTM. Further, we used a separate lookup table to add a capitalization feature since capitalization information is erased during the word and character embeddings. The capitalization feature is obtained with the following options: allCaps (all characters are uppercase in a word), firstCaps (only the first character is uppercase), lower (all characters are lowercase), others (the other case excluding the above ones). At last, the concatenation of the forward and backward representations from the BiLSTM and the capitalization feature is used as the character-level feature of the word.
Due to the complexity of the natural language and the specialty of the biomedical domain, some linguistic and domain features are often employed in traditional machine learning methods for biomedical NER [7, 9]. We also explored the effect of linguistic features (such as POS and chunking features). The POS information and chunking information of each word were generated by the GENIA tagger (http://www.nactem.ac.uk/GENIA/tagger/). In addition, named entity tags information (including protein, DNA, RNA, cell line and cell type entities) generated by the GENIA tagger was also used as a feature. And the NER feature of each token was encoded in the BIO tagging scheme. In our experiments, three different lookup tables were to output POS, chunking, and NER embeddings, respectively. And they were initialized randomly.
This can be computed using dynamic programming, and the Viterbi algorithm  is chosen for this inference.
The main hyper-parameters of our model
Word embedding dimension
50, 100, 200
Character embedding dimension
Character-level BiLSTM state size
Capitalization embedding dimension
POS embedding dimension
Chunking embedding dimension
NER embedding dimension
Word-level BiLSTM state size
50, 100, 200
SGD learning rate
0.01, 0.005, 0.001
For performance optimization, we also employed several common post-processing steps including tagging consistency, abbreviation resolution and bracket balance.
If the number of a word sequence tagged by our model as a biomedical entity exceeds 50% of the total number of the sequence in a document (title and abstract), all instances of the word sequence will be tagged as an entity mention. For example, if our BiLSTM-CRF model found three gene/protein mentions of “nociception receptor” and missed out two other mentions of “nociception receptor” in a document, the missed mentions would be retrieved.
For abbreviation resolution, all local abbreviation definitions, such as “protease-activated receptor 1 (PAR1)”, will be found. If the abbreviation (i.e., “PAR1”) in the long form was tagged by our model, then all instances of the abbreviation in the document would be tagged.
While there are some mentions with unbalanced brackets (such as parenthesis, square brackets and curly brackets), we attempted to balance the brackets by adding or removing characters to the right or left of the mention. For example, if “OGP(10” (the next characters in the text are “− 14)”) was tagged as an mention by our model, then the mention would be extended to include the right parenthesis (i.e., “OGP(10–14)”). If the unbalanced bracket is the first or last character of the entity tagged by the model (e.g., “(nNOS”), the bracket would be simply discarded.
In this section, first the experimental datasets and settings are introduced, and then the experimental results and discussion are presented.
Experimental datasets and settings
CEMP and GPRO corpora overview
GPRO type 1 mentions
GPRO type 2 mentions
The effect of the different ratios of positive and negative documents
The effect of the different ratios of positive and negative documents
All training set
On the CEMP corpus, there is slight difference among the F-scores of these models. The reason is that only small amounts of documents do not contain chemical entities. On the GPRO corpus, the model achieves the best performance with an F-score of 75.95% when the number of positive and negative documents in the training set is equal. When the number of positive documents exceeds the number of negative documents, the more token sequences are predicted as the entities. In this case, the model performs worse owing to a significant drop in precision. When the number of negative documents exceeds the number of positive documents, the model also performs worse owing to a significant drop in recall. In the following experiments, all CEMP training set is used to train the models, while the balanced version of GPRO training set is used.
The effect of the model components on the development set
The effect of our baseline components on our development sets
− Character embedding
− Capitalization feature
− CRF layer
The similar results were observed on both CEMP and GPRO corpora. The results show that each component makes different degrees of contribution. Among others, the CRF layer makes the most significant contribution. After the CRF layer is removed, the F-score decreases by 3.31% and 5.81% on the CEMP and GPRO development sets, respectively. It demonstrates that BiLSTM has the ability of handing sequential data and learning the long-range context information, but the performance of the model can still be further improved by considering the dependencies of output labels (which is implemented with the CRF layer). In addition, the character embedding is also important. Removing the character embedding leads to the decrease of F-score by 1.41% and 1.73% on the CEMP and GPRO development sets, respectively. The reason is that character information can not only capture interior representations of the entity names, but also alleviate the out-of-vocabulary problem. Moreover, the post-processing can slightly improve the performance of our model.
The effect of additional features on the development set
The effect of additional features on our development sets
+ POS feature
+ Chunking feature
+ NER feature
+ All features
When the additional features are added, the models achieve slightly lower F-scores than the baseline on the CEMP corpus. The plausible reason is that the deep neural network itself has learned sufficient higher and abstract features automatically from the word and character embeddings with the large training set. However, noise may be introduced into the models by the errors of the NLP tools, which leads to the decrease in performances of the models. On the GPRO corpus, when only the POS feature is added, higher F-score (an improvement of 0.23% in F-score over the baseline) is achieved. The main reason is that the information of POS can help boost the precision of baseline. For example, most entities are nouns but not verbs. When only the chunk feature is added, the model achieves a slight improvement (an improvement of 0.06% in F-score). The main reason is that some entity boundary errors can be revised by the chunking information though some chunking information generated by the GENIA tagger tool is error. The introduction of NER feature alone also improves the performance (an improvement of 0.30% in F-score), which demonstrates that the information of prior entities provided by the GENIA tagger can help boost the performance. When all the additional features are added into the baseline, the best performance (an improvement of 0.81% in F-score) is achieved. Compared with the GPRO training set, the CEMP training set contains more entity mentions (99,623 chemical mentions vs 17,751 gene/protein mentions). The additional features are more helpful for a small training set than large one. For the GPRO task, the different kinds of additional features contribute complementary information, and the introduction of them into our baseline can further improve the performance.
Performance comparison with other participants on the test set
Performance comparison with other participants on the test sets (the best runs per team)
Examples of gene/protein named entity recognition errors
And in the treatment of diseases and conditions that are mediated by AXL receptor tyrosine kinase
Missing gene/protein mention
Combination of C1-INH and lung surfactant for the treatment of respiratory disorders
Not a gene/protein mention
Application of tumor inhibitor MLN4924 to preparation of antiviral drug
For the incorrect boundary error, most cases occur where a gene/protein is nested within a larger gene/protein mention (e.g., our model predicts “AXL” as a mention but the correct mention should be “AXL receptor tyrosine kinase” in Table 8). The main reason may be that the annotated training set contains the tagging inconsistency. For example, “5-ht2a” of the string “5-ht2a serotonin receptor” is annotated as an entity in the document with ID: CN101871931A while “5-ht2a serotonin receptor” is annotated as the same entity in the document with ID: WO2006060762A3. For the missing gene/protein mention error, the reason is that our model cannot detect the entity without sufficient context information. In the example of Table 8, “C1-INH” is the abbreviation of “C1 esterase inhibitor” in the document, but it is difficult to detect the entity in the sentence without sufficient information by our model. In addition, we observed that many strings having similar expressions and strong gene/protein indicators are falsely identified as gene/protein mentions. For example, “MLN4924” consists of uppercase and number, and its context contains the strong gene/protein indicator “inhibitor”. Our model incorrectly identified the chemical as a gene/protein mention. It can be seen from the above analysis, even though automatic learning of high-level features is advantage of deep learning methods and BiLSTM-CRF model can capture long-range dependencies, it is difficult for our model to automatically learn domain knowledge from the raw text and capture sufficient context information from a sentence. Therefore, more contextual information from a document and external knowledge can be considered to improve our model.
In this paper, we present our system based on a deep learning approach for the chemical and gene/protein NER tasks in the BioCreative V.5 CEMP and GPRO tracks. In our approach, a BiLSTM-CRF model is employed to recognize biomedical entities from patents. Moreover, the effect of additional features (such as POS, chunking, and NER features) for the neural network model is investigated. The experimental results show that the additional features are effective to improve the performance of our system for the GPRO track. And our system achieves the state-of-the-art performances on both CEMP and GPRO corpora. It demonstrates the effectiveness of our approach for biomedical NER task in patents. However, from our error analysis, our system should can be further improved by considering more contextual information at document-level (not only at sentence-level) and external knowledge which will be explored in our future work.
LL designed the algorithm, conducted the experiments and drafted the manuscript. ZY provided the initial ideas and revised the manuscript. PY participated in the model designs and the experiments. LW provided biomedical support and revised the manuscript. YZ, JW and HL commented on algorithm designs. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
This work was supported by the grants from the National Key Research and Development Program of China (No. 2016YFC0901902, funding body: Ministry of Science and Technology of China), Natural Science Foundation of China (Nos. 61272373, 61572102 and 61572098, funding body: National Natural Science Foundation of China), and Trans-Century Training Program Foundation for the Talents by the Ministry of Education of China (NCET-13-0084, funding body: Ministry of Education of China).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Smith L, Tanabe LK, nee Ando RJ, Kuo C-J, Chung I-F, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K (2008) Overview of BioCreative II gene mention recognition. Genome Biol 9(2):S2View ArticleGoogle Scholar
- Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A (2015) CHEMDNER: the drugs and chemical names extraction challenge. J Cheminformatics 7(1):S1View ArticleGoogle Scholar
- Wei CH, Peng Y, Leaman R, Davis AP, Mattingly CJ, Jiao L, Wiegers TC, Lu Z (2016) Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database J Biol Databases Curation 2016:baw032Google Scholar
- Krallinger M, Rabal O, Lourenço A, Perez MP, Rodriguez GP, Vazquez M, Leitner F, Oyarzabal J, Valencia A (2015) Overview of the CHEMDNER patents task. In: Proceedings of the fifth BioCreative challenge evaluation workshop, pp 63–75Google Scholar
- Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14):3191–3192View ArticleGoogle Scholar
- Campos D, Matos S, Oliveira JL (2013) Gimli: open source and high-performance biomedical name recognition. BMC Bioinform 14(1):54View ArticleGoogle Scholar
- Wei C-H, Kao H-Y, Lu Z (2015) GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed Res Int 2015:918710Google Scholar
- Leaman R, Wei C-H, Zou C, Lu Z (2016) Mining chemical patents with an ensemble of open systems. Database 2016:baw065View ArticleGoogle Scholar
- Leaman R, Wei C-H, Lu Z (2015) tmChem: a high performance approach for chemical named entity recognition and normalization. J Cheminformatics 7(1):S3View ArticleGoogle Scholar
- LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444View ArticleGoogle Scholar
- Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537Google Scholar
- Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT: 2016, pp 260–270Google Scholar
- Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354
- Li L, Jin L, Jiang Z, Song D, Huang D (2015) Biomedical named entity recognition based on extended recurrent neural networks. In: 2015 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 649–652Google Scholar
- Sahu SK, Anand A (2016) Recurrent neural network models for disease name recognition using domain invariant features. arXiv preprint arXiv:1606.09371
- Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J (2018) An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34(8):1381–1388View ArticleGoogle Scholar
- Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60Google Scholar
- Lai S, Liu K, Xu L, Zhao J (2015) How to generate a good word embedding? arXiv preprint arXiv:1507.05523
- Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
- Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the empiricial methods in natural language processing (EMNLP 2014), vol 12 m pp 1532–1543Google Scholar
- Wang X, Yang C, Guan R (2018) A comparative study for biomedical named entity recognition. Int J Mach Learn Cybernet 9(3):373–382View ArticleGoogle Scholar
- Rei M, Crichton GK, Pyysalo S (2016) Attending to characters in neural sequence labeling models. arXiv preprint arXiv:1611.04361
- Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166View ArticleGoogle Scholar
- Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780View ArticleGoogle Scholar
- Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5):602–610View ArticleGoogle Scholar
- Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13(2):260–269View ArticleGoogle Scholar
- Bottou L (1991) Stochastic gradient learning in neural networks. In: Neuro-Nîmes, vol 91, no 8, p 12Google Scholar
- Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(Feb):281–305Google Scholar
- Prechelt L (1998) Automatic early stopping using cross validation: quantifying the criteria. Neural Netw 11(4):761–767View ArticleGoogle Scholar
- Pérez-Pérez M, Rabal O, Pérez-Rodríguez G, Vazquez M, Fdez-Riverola F, Oyarzabal J, Valencia A, Lourenço A, Krallinger M (2017) Evaluation of chemical and gene/protein entity recognition systems at BioCreative V.5: the CEMP and GPRO patents tracks. In: Proceedings of the BioCreative V5 challenge evaluation workshop, pp 11–18Google Scholar
- Pérez-Pérez M, Pérez-Rodríguez G, Blanco-Míguez A, Fdez-Riverola F, Valencia A, Krallinger M, Lourenço A (2018) Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalm. J Cheminformatics (in press)Google Scholar
- Rabal O, Pérez-Pérez M, Pérez-Rodríguez G, Vazquez M, Fdez-Riverola F, Oyarzabal J, Valencia A, Lourenço A, Krallinger M (2018) Comparative assessment of named entity recognition strategies on medicinal chemistry patents for systems pharmacology. J Cheminformatics (in press)Google Scholar