CheNER: a tool for the identification of chemical entities and their classes in biomedical literature

Background Small chemical molecules regulate biological processes at the molecular level. Those molecules are often involved in causing or treating pathological states. Automatically identifying such molecules in biomedical text is difficult due to both, the diverse morphology of chemical names and the alternative types of nomenclature that are simultaneously used to describe them. To address these issues, the last BioCreAtIvE challenge proposed a CHEMDNER task, which is a Named Entity Recognition (NER) challenge that aims at labelling different types of chemical names in biomedical text. Methods To address this challenge we tested various approaches to recognizing chemical entities in biomedical documents. These approaches range from linear Conditional Random Fields (CRFs) to a combination of CRFs with regular expression and dictionary matching, followed by a post-processing step to tag those chemical names in a corpus of Medline abstracts. We named our best performing systems CheNER. Results We evaluate the performance of the various approaches using the F-score statistics. Higher F-scores indicate better performance. The highest F-score we obtain in identifying unique chemical entities is 72.88%. The highest F-score we obtain in identifying all chemical entities is 73.07%. We also evaluate the F-Score of combining our system with ChemSpot, and find an increase from 72.88% to 73.83%. Conclusions CheNER presents a valid alternative for automated annotation of chemical entities in biomedical documents. In addition, CheNER may be used to derive new features to train newer methods for tagging chemical entities. CheNER can be downloaded from http://metres.udl.cat and included in text annotation pipelines.


Background
Scientific literature accumulates at a rate that makes it impossible for any biologist to extract all the relevant information from the multitude of available sources. For this reason, there is a keen interest in the development of systems that can automatically mine information from the text and provide that information to researchers.
Mining biologically important information from text is a two-step process, requiring that one identifies the relevant entities in the documents and, subsequently, the relationships between those entities. Methods that fully automate both steps of the process in a combined way with highly accurate results have yet to be developed. So far the focus has been mostly on creating and testing methods that perform one of the steps of the text-mining process (see for example [1][2][3][4][5][6][7][8]). This focus has been further promoted by initiatives such as the BioCreAtIvE challenge (BioCreAtIvE Workshops I, II, II.5, III, and IV held in 2004,2007,2009,2010, and 2013 respectively) [1][2][3][4][5].
The BioCreAtIvE challenge provides participating research teams with annotated literature corpora that enable a controlled comparison of the performance between the various competing methods for automated recognition of specific types of entities in biomedical documents. There are various BioCreAtIvE challenge tracks that focus on identifying various types of biologically relevant entities, such as genes and their functions, diseases, phenotypes, or chemical compounds. The importance of these chemical compounds arises from their involvement in regulating biological activity of proteins and genes, and from their potential use to treat pathological states.
Identifying chemical entities in biomedical textbooks, patents, articles, and other scientific documents is a challenging task. The difficulty arises from two main factors: the diverse morphology of chemical entities and the various types of nomenclature that are simultaneously used to describe them in biomedical documents [9]. These factors make it difficult to develop a single approach that can successfully identify all types of chemical mentions with high accuracy. Because of this there is a small number of applications available to do NER of chemical names [10][11][12][13][14][15][16][17][18][19][20][21][22]. In addition, many of these applications are not freely available to the community, as summarized in Table 1.
A detailed review about this subject can be found in [9].
The statistical methods used to identify chemical entities must be trained through the use of appropriate and encompassing gold standard collections of documents (corpora), containing precisely annotated chemical entities [5]. Although quite useful, existing corpora [15,16,28,29] that can be used for training those methods are often limiting in developing automatic annotation systems, because they are small in size and have incomplete annotation. The DDI corpora contain a larger number of documents (766) and chemical entities (13029). However, it is only adequate to train methods that perform NER of pharmacological substances. Because of this only the SCAI corpora could be considered as a general gold standard that covered a large class of chemical entities, containing a total number of~1550 abstracts with~6 600 entities annotated. However, the Medline corpus within the SCAI corpora only contains 100 Medline abstract with 151 annotated IUPAC (International Union of Pure and Applied Chemistry) chemical names.
The latest round of the BioCreAtIvE challenge emphasized how important automated annotation of chemical entities in biomedical documents is by setting up a track (CHEMDNER) to potentiate the development of more accurate methods to perform that annotation. In order to lift one the main limitations in developing annotation methods, two new biological literature corpora with annotated chemical entities were provided for the community to use in training their methods. Each corpus contains 3500 documents, with approximately 29500 annotated chemical entities, divided into several classes: SYSTEMATIC, TRIVIAL, FAMILY, FORMULA, ABBREVIATIONS, IDENTIFIERS, MULTIPLE, and NO CLASS. The corpora developed by BioCreAtIvE IV are significantly larger than the SCAI corpora [15,16] and the DDI corpora [28,29] that were freely available for the training and testing of applications that perform chemical NER. Our team had previously developed CheNER, a tool that automatically and specifically tags IUPAC chemical names in documents [22]. CheNER uses CRFs based on Mallet [30] to identify the IUPAC names and achieves F-score performances higher than 70% in the SCAI corpora [15,16]. Given that the IUPAC nomenclature is only one of the many that are used, we took the opportunity provided by BioCreAtIvE IV organizers to further develop CheNER in order for it to specifically identify and tag the different classes of chemical names.
In this paper we report the development of this improved version of CheNER and analyse its performance. We implemented and tested a set of approaches that combine dictionary matching, linear CRFs and regular expressions in different ways to tag chemical entities according to their nomenclature classes in the biomedical literature. We find that the approach with the highest performance implements a CRF that is trained to simultaneously identify the individual classes of chemical entities. Our system is freely available at http://metres.udl.cat and can be easily integrated in pipelines to annotate large bodies of literature. To our knowledge, CheNER is unique with respect to other chemical entity annotation programs that were presented during the challenge because CheNER groups the chemical terms it annotates into the various classes of chemical names.

Materials & methods
Our set of approaches combines CRFs, dictionary matching, and regular expression matching in five different ways ( Table 2; also see below for details). We defined two different taggers: CRFs tagger and Regular Expression tagger (which include dictionary and regular expression approaches).

CRF implementation
In the original development of CheNER we systematically tested how order, offset conjunction, and tokenization affected the performance of the CRF [22]. Based on those tests we decided to use linear chain, 2 nd order CRFs, with an offset conjunction value of 1 and tokenization by spaces in the development of the current CheNER version. We note that the punctuation marks at the end of the tokens are not taken it into account to extract their features. All CRFs for the current work were implemented using Mallet [30], and trained using the training corpus provided by the BioCreAtIvE organizers, containing 3500 abstracts, with~29500 annotated entities.

Word features, regular expressions, and dictionaries
The features used to originally train CheNER's CRF [22] were also used in the current work. However, we note that the first version of CheNER was developed to specifically identify IUPAC chemical names. The BioCreAtIvE IV CHEMDNER track that CheNER participated in called for identifying and annotating all types of chemical entities. In order to accommodate for this we added the features described in Table 3 to the training process. These features were chosen because they have been previously identified as the best subset of features that better discriminates chemical names [15,16]. Given that several classes of chemical names present either a very regular structure or a finite set of names, we wanted to see if using regular expressions and/or dictionaries to identify the entities for those classes would perform as well as using CRFs. The classes for which we wanted to test this were TRIVIAL, FAMILY, ABBREVIA-TION, FORMULA, and IDENTIFIER chemical names. The regular expressions that were defined to train our system in the runs that combine CRFs and Regular Expression taggers are also summarized in Table 3. FOR-MULA chemical were identified in these runs by using regular expressions describing patterns containing atomic elements, SMILES, etc. The dictionaries used to identify TRIVIAL, FAMILY, and ABBREVIATIONS in the relevant runs were built from a non-redundant list of the entities from each class annotated in the corpora provided by the BioCreAtIvE organizers, the SCAI corpora, and also by extracting the names of chemical entities from http://www.drugs.com/. In total, these dictionaries have~9100 terms, with~6400 for the TRIVIAL dictionary,~1300 for the ABBREVIATION dictionary and 1400 for the FAMILY dictionary. To identify SYS-TEMATIC names using a CRF, we used regular expressions to define patterns that identify morphological structures such as isomers (ex: 3,5,4'-trihydroxy-transstilbene), as well as the expressions used in [22]. We note that regular expressions or dictionary words used to identify any type of chemical entity by the Regular Expression tagger were also used as a feature to identify the same type of entities by the CRFs tagger in the relevant runs.
It is likely that overall performance of our system would improve by including additional dictionaries such as ChEBI [31,32], Jochem [33] and PubChem [34]. However, the deadlines of the BioCreAtIvE challenge made it impossible to develop a reasonable way to correctly attribute class type to each entity in these dictionaries, and class attribution was a differential feature that we wanted CheNER to have.

Runs
We tested five different approaches (Runs) to Chemical NER, in order to see which approach works better in the global identification of the chemical names. Each of these Runs is described in Table 2.

Output
The output of the CRFs, dictionary, and Regular Expression taggers in each run is marked according to the IOB (In-Out-Beginning) labelling scheme [9]. This output is reformatted to the required specifications of the CDI (Chemical Document Indexing) and/or CEM (Chemical Entity Mention) output format.
The integration of the output from the various recognition approaches used in a run (CRF, dictionary, and regular expression matching) is done through a post-processing step. In this step we perform several clean up actions, such as correcting unequal numbers of closing or opening brackets or detagging "action words" that are often appended at the end of chemical mentions such as "-based", "-regulated", etc. This clean up is done in the following way. Once the names are tagged by all the approaches, the systems remove all the mention that match with regular expressions that eliminate various classes of potential False Positive entities detected. In addition, regular expression matching is also used to correct the mentions that contain "action words". Once this clean up is done, the output of all approaches is merged and tagged using the IOB scheme (see Figure 1 for examples).

Evaluation of the results
The F-score is a standard way to evaluate performance of NER methods [9]. It is given by the harmonic mean between precision and recall. We calculate the micro-averaged F-score of the individual Runs over the development and test corpora, which is the evaluation measure used by the BioCreAtIvE IV organizers. The micro-averaged performance is calculated by weighing equally every annotated entity in the corpus. To get the macro-averaged scores, each document should be evaluated, and then the resulting evaluation should be averaged on the whole corpus. The calculations of precision, recall, and F-score are done using the evaluation library provided by the Bio-CreAtIvE IV organizers, downloaded from http://www. biocreative.org/resources/biocreative-ii5/evaluationlibrary/.

Results & discussion
The evaluation of the systems presented to the IV Bio-CreAtIvE workshop was done by the organizers using a subset of 3000 abstracts within a test data set composed of 20000 abstracts, and calculating micro-averaged precision, recall, and balanced F-score. The performance of the systems was calculated with the BioCreAtIvE evaluation library.

Performance of the five runs
The performance of the systems implemented in each run was tested using the CHEMDNER development corpus in two different ways. On one hand we tested the performance of the system in identifying unique chemical entities in the documents of the corpus (CDI subtask). Table 4 summarizes the results and we see that the system implemented in Run 5 has the highest F-Score performance. On the other hand, we tested the performance of each system in identifying all mentions of chemical entities in the documents of the corpus (CEM subtask). Table 5 summarizes the results and again, we see that the system implemented in Run 5 has the highest F-Score performance. In addition, we see that the system implemented in Run 5 has similar performance in the two tasks, suggesting that it might be at Table 3 Examples of features and regular expressions used during the training of the chemical entities identification systems

Name of feature Description
Length Classifies tokens by length. If the length is less than 5, the token is Short. If length is between 5 and 15, the token is Medium, otherwise, the token is Large.
Word class Automatic generation of features in terms of frequency of upper and lower case characters, digits and other types of characters.

List
Automatic generation for every token that match an element within the list. We used lists of basic name segments (~3300), and stop words (~550).
Regular expressions Regular expressions that identify specific features, such as "contains dashes?", "is all cap?", or "contains numbers?". Regular expressions that identify specific types of characters that are more common in chemical entities than in other words, such as greek letters, roman numbers, etc. Regular expressions that match with specific morphological chemical formulas features, identifiers, and systematic features in chemical names. Regular expressions used in the pos-processing step that filter out common names that are incorrectly tagged by the systems in a systematic way. the higher limit of performance for the set of features considered during the training of the CRFs. We remind readers that the system implemented in Run 5 uses a single CRF that simultaneously identifies both, chemical entities and their classes. What causes the differences in performance between the various approaches we use to identify chemical entities? For example, the approach in Run 3 has the lowest F-score in both subtask, CDI and CEM. This run implements an individual CRF for each entity class. The CRF that identifies FORMULA chemical names tags a large number of false positives, leading to a very low recall. This is seen by comparing the results from Run 3 and Run 4. These two runs differ only in how the system identifies the FORMULA chemical names. We see that the identification of FORMULA chemical names using a single CRF decreases the recall by~15% when compared to FORMULA identification using regular expressions. This suggests that the context where FORMULA names are often found in the text is not sufficiently informative to allow the CRF to appropriately rule out many false positives.
We see a similar effect in Run 2. This Run has an F-score closer to Run 3 in the CDI subtask, while its F-score in the CEM task is closer to that of the best system. This difference is due to the fact that the system missed more unique entities than systems using CRFs to identify FAMILY, ABBREVIATION, FORMULA and IDENTIFIER chemical names. However, the entities of these types identified by Run 2 are the most frequently repeated in the texts that are analyzed, which raises the F-score of this Run in the CEM task.
To summarize, the usage of a single CRF for each entity class leads to many false positives for each class, due to the similitude between the entity types. Replacing some CRFs with the direct use of Regular Expression taggers leads to a smaller number of entities being identified but improves the identification of the class for those entities, decreasing false positives. When a single CRF is used to tag all classes of entities (Run 5), this CRF can create a more accurate model for each class, thus improving the ability of the method to clearly identify the difference between the entity classes.
In the evaluation done for the BioCreAtIvE Challenge, the best system presented by CheNER achieves an Fscore of 67.78 % in the CDI task and an F-score of 63.74% in the CEM task. These scores are higher in the development corpus (72.08% F-score in the CDI task and 72.61% F-score in the CEM task). The version of CheNER we present in this work improves the original F-scores from the BioCreAtIvE workshop to 72.68% in the CDI task and 73.07% in the CEM task. This increase in F-Score indicates that the new version of CheNER has an improved performance. Nevertheless, it would be important to calculate the performances for both tasks once the annotated test corpus becomes available to make sure that performance has also improved in that corpus.

Merging the tagging results from different chemical NER tools
The systems with the highest F-score performance in the BioCreAtIvE challenge were trained by combining features that are derived from a human analysis of patterns in chemical names to features that are derived from the automated tagging of chemical entities by entities such as OSCAR or ChemSpot [35][36][37][38][39][40][41][42][43][44]. All these systems have F-scores that are 10%-15% higher than those of CheNER, which uses only human-derived features.
We wanted to see whether adding features derived from the automated tagging by CheNER to those combined systems could improve their performance. These features would, for example, be the annotated chemical names themselves. To test this directly we would have to include the output of CheNER ourselves into the tools described in [35][36][37][38][39][40][41][42][43][44] and measure the resulting F-Score. However, the relevant tools were not publicly available and this conclusive experiment could not be performed.
As an alternative test to see whether adding features derived from the automated tagging by CheNER to those combined systems might improve their performance, we merged the individual results of CheNER [22], OSCAR [13,14], and ChemSpot [21] in tagging the CHEMDNER development corpus. This allowed us to investigate whether the three programs identified largely  overlapping sets of entities or not. We did this for the CDI subtask. The experiment was done in the following way. Each of the three tools was run in the CHEMDNER development corpus. The entities tagged by each tool were then filtered through the post-processing step described in Methods for CheNER. After post-processing, the precision, recall, and F-Score were recalculated for the combinations of CheNER, OSCAR, and ChemSpot described in Table 6. We find that the performance of OSCAR and ChemSpot improves by a few percent when the post-processing step we developed is applied to the entities that they tag. However, this improvement is not enough to compensate for the low precision achieved by OSCAR.
If we compare Tables 4 and 6, we see that CheNER always outperforms the other two programs, when they are run in their "out of the box" version, meaning that the tool can be downloaded from the Internet http:// metres.udl.cat/ and used as is in annotation pipelines. In addition, Table 7 shows that combining CheNER and ChemSpot improves the individual performance of either tool. However, combining both tools with OSCAR significantly decreases the F-Score with respect to either CheNER or OSCAR. This is a consequence of the low precision shown by OSCAR.
Overall, our results show that combining the result list of CheNER and ChemSpot improves the performance of either tool (Tables 4, 6, 7). We find that there are 2643 annotated chemical entities that are only recognized by ChemSpot and 2893 annotated chemical entities that are only recognized by CheNER (Table 8). Taken  together, the results from Tables 4, 5, 6, 7, 8 suggest that including CheNER in combination with ChemSpot could improve the performance of methods that combine several tools.

Notes on the IV BioCreAtIvE Challenge
One of the most important outcomes from the Bio-CreAtIvE IV Challenge is the development of larger sized literature corpora that can be used for the training and evaluation of automated chemical entity annotation systems. Specifically, two corpora of 3500 abstracts each for training and development, and a test corpus containing more than 20000 abstracts are invaluable resources for the development of better chemical annotation systems. However, even these corpora should be further curated and, to some extent, reannotated. This is so because there is small percentage of cases where the same chemical entities were either not consistently annotated over different abstracts or not recognized as chemical entities by the annotators (see Figure 2 for examples). In addition, there are still some problems with the normalization of chemical entity names in documents. The methods presented in this volume could highly facilitate this process if a semi-automated reannotation approach is applied.

Conclusions
Here we presented CheNER, the latest version of our system for chemical entity tagging in biological literature. While the original version of CheNER only tagged IUPAC names, the current version tags and identifies various classes of chemical entities (see Figure 1 for an example), with a performance that is better than that of other comparable tools that can be downloaded from the internet and used "out of the box" (see Tables 4, 6, and 7 and references [5] and [35]). This version is a development over the one we presented at the IV BioCreAtIvE Challenge workshop, where we only presented early results from Runs 1, 2, 4 in the CDI subtask and Run 1 in the CEM subtask [5]. In addition to testing additional systems, we further refined the post-processing of the results, significantly improving our F-Score.
CheNER presents a valid alternative for automated annotation of chemical entities in biomedical documents that can be downloaded from http://metres.udl.cat and easily integrated in annotation workflows. Examples on how to perform this integration are provided in the website. The individual performance of CheNER could be further improved by expanding the dictionaries of chemical entities P: precision; R: recall; F: F-score. No processing: results were not processed through the post-processing step described in methods; Processing of results: results were passed through the post-processing step described in methods.  used in its training. In addition, CheNER may provide a valuable resource to automatically derive new features that could be used for training and improving the performance of newer methods for tagging chemical entities.