Annotation and detection of drug effects in text for pharmacovigilance

Thompson, Paul; Daikou, Sophia; Ueno, Kenju; Batista-Navarro, Riza; Tsujii, Jun’ichi; Ananiadou, Sophia

doi:10.1186/s13321-018-0290-y

Research article
Open access
Published: 13 August 2018

Annotation and detection of drug effects in text for pharmacovigilance

Paul Thompson¹,
Sophia Daikou¹,
Kenju Ueno²,
Riza Batista-Navarro¹,
Jun’ichi Tsujii^1,2 &
…
Sophia Ananiadou¹

Journal of Cheminformatics volume 10, Article number: 37 (2018) Cite this article

7818 Accesses
19 Citations
7 Altmetric
Metrics details

Abstract

Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/.

Background

Pharmacovigilance [1] (PV) is a vital activity that assesses drugs, in terms of risks and benefits, in order to validate and improve upon their safety and efficacy. Monitoring the effects of drugs is important to determine the effectiveness of treatments in patient populations. In addition, the identification and understanding of adverse drug effects (ADEs) is important for risk assessment. Information about ADEs is also important for stratified medicine, which seeks to identify how drug response may be stratified across patient subgroups [2].

PV is a well-established field, and its research outcomes are used in the creation and update of reference resources that record detailed, evidence-based information about drugs, their effects and interactions (e.g., [3,4,5,6,7]). Since it is estimated that at least 60% of adverse drug reactions are preventable [8], such resources are of critical importance in analysing the appropriateness, effectiveness and safety of prescription medicines. However, maintaining these resources is extremely difficult. Firstly, they are typically manually curated, meaning that they are reliant upon dedicated and intensive efforts of domain experts to carry out extensive surveys of literature and other relevant information sources. Secondly, they can never be considered to be complete, due to the constantly changing evidence about the effects of existing drugs and the development of new drugs, which may be reported in a wealth of different textual resources.

Given the growth of textual data, it is becoming impossible for domain experts to manually curate the information contained within them in an efficient and timely way. Potentially vital information may remain hidden in a deluge of results that are returned when querying these sources. The difficulties in creating and maintaining comprehensive resources are highlighted in a recent survey of several frequently-used drug interaction resources [9], which found several discrepancies amongst the resources, in terms of factors such as the scope of reactions covered, completeness of information about the reactions and consistency of information between the resources. Such inconsistencies could result in patient care being compromised.

In order to mitigate such issues, text mining (TM) techniques have been proven to form the basis for more efficient solutions. They have been used to detect information relevant to drug effects in a range of potentially complementary information sources, including the scientific literature [10,11,12], electronic health records [13, 14], product labels [15, 16] and social media [17,18,19,20,21], along with pharmacokinetic [22,23,24,25,26] or genetic [27,28,29] evidence for these effects. TM methods have been used to semi-automate the curation of a number of databases in areas such as biomedicine [30], pharmacogenomics [31] and drug side effects [16], by increasing the efficiency with which articles of interest can be identified and/or by automatically locating relevant details within these articles. Furthermore, TM can be applied to the contents of such databases to uncover meaningful associations amongst drugs [32].

Particularly in the field of biocuration, much attention has been devoted to ensuring that TM methods can be effectively translated into tools that can increase the efficiency of curators’ tasks [30, 33,34,35]. Additionally, given the variability in curation tasks and working methods, efforts have been made to increase the flexibility with which different TM tools can be integrated within curation workflows in different ways [36,37,38].

The level of TM-driven support that can be provided to curators for particular tasks is dependent on the sophistication and performance of the tools that are available for a given domain/subject area. The development of such tools is typically reliant on the availability of annotated corpora, i.e., collections of texts manually marked up by domain experts with semantic information pertaining to a domain, which are used for training and evaluating text mining tools.

The levels of semantic annotation in different corpora determine the types of information that can be recognised by TM tools. Named Entities (NEs), i.e., semantically categorised words/phrases, such as drugs and disorders, form the basis for a number of more complex types of annotation. Several efforts have produced corpora annotated with such NEs and/or demonstrated how such corpora can be used to train machine learning (ML) tools to recognise NEs automatically to high degrees of accuracy [39,40,41,42,43,44,45,46,47,48,49,50].

In many cases, the annotation process involves linking each NE with a unique concept identifier in a domain-specific terminological resource. These resources include the UMLS Metathesaurus [51], the Medical Subject Headings (MeSH) [52], DrugBank [3] and ChEBI [53]. Such linking is important, because of the wide range of ways in which a given concept may be mentioned in text. For example, variant mentions of the disorder concept dyspnea may include spelling variations (dyspnoea), completely different forms (e.g., shortness of breath), abbreviations (e.g., SOB) and altered internal structures (e.g., breath shortness). Similarly, the medication etanercept may be mentioned in various forms, such as a brand name (Enbrel), a different form (Tumor Necrosis Factor Receptor IgG Chimera), or an abbreviation (ETN; TNFR:Fc). The NEs in several corpora relevant to PV include manually-assigned concept IDs (e.g., [44, 45, 54, 55]). Such corpora can facilitate the development of normalisation methods (e.g., [56,57,58,59]), which aim to automatically assign a concept ID in a given terminological resource to each NE. The challenge of this task is that, although terminological resources usually list some synonyms/variant forms for each concept, the range of variants that can actually appear in text is far larger, and mostly unpredictable. However, successful normalisation makes it possible to develop search tools that can automatically locate all mentions of a concept of interest in large collections of text, regardless of how the concept is mentioned.

Binary relation annotations link together pairs of NEs whose textual contexts connect them in specific ways. The wide range of corpora annotated with binary relations relevant to PV (summarised in Table 1) mainly concern interactions between pairs of drugs (see sentence S1), or different types of relationships between disorders and treatments, e.g., a treatment may improve, worsen or cause a disorder (see sentence S2).

Table 1 Summary of corpora annotated with relations relevant to PV

Annotation and detection of drug effects in text for pharmacovigilance

Abstract

Background

Methods

Document selection

Annotation levels

Named entity annotation

Event annotation

Event attributes

Relation annotation

Co-reference annotation

Normalisation of NEs

Results and discussion

Ensuring annotation quality

NE agreement rates

Event agreement rates

Relation agreement rates

Event attribute agreement rates

Coreference agreement rates

Normalisation results

Final corpus analysis

Training of machine learning based text mining tools using the corpus

Automatic NE recognition

Automatic event recognition

Conclusions

Abbreviations

References

Authors’ contributions

Acknowledgements

Competing interests

Availability of data and materials

Funding

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us