Application of machine reading comprehension techniques for named entity recognition in materials science

Huang, Zihui; He, Liqiang; Yang, Yuhang; Li, Andi; Zhang, Zhiwen; Wu, Siwei; Wang, Yang; He, Yan; Liu, Xujie

doi:10.1186/s13321-024-00874-5

Research
Open access
Published: 02 July 2024

Application of machine reading comprehension techniques for named entity recognition in materials science

Zihui Huang¹^na1,
Liqiang He¹^na1,
Yuhang Yang¹,
Andi Li¹,
Zhiwen Zhang¹,
Siwei Wu¹,
Yang Wang¹,
Yan He¹ &
…
Xujie Liu¹

Journal of Cheminformatics volume 16, Article number: 76 (2024) Cite this article

Metrics details

Abstract

Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science.

Scientific contribution

We have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks.

Introduction

The field of materials science has witnessed a significant surge in research and literature in recent years. While scientific publications offer valuable and reliable data, the manual analysis of a vast number of papers to extract essential information for materials can be an arduous undertaking. The manual extraction of this information is time-consuming, impeding researchers’ ability to access the necessary information. Emerging technologies in natural language processing (NLP) offer promising solutions to the process of extracting relevant information from scientific literature. Among them, automatically recognizing named entities in a given text is an important task in the field of NLP. In materials science, identification of various materials, compounds, elements, and other entities is crucial for extracting and transforming material science knowledge from unstructured texts. However, the task of identifying named entities in materials science (MatNER) [1,2,3,4,5] is extremely challenging because there are multiple entities in the materials science literature and their complex combinations, such as acronyms, misspellings, synonyms of compound names, etc.

In the early stages, named entity recognition (NER) mainly relied on rule-based and handcrafted feature methods [3, 6,7,8,9,10]. These methods required manual definition of rules and feature templates, and had high requirements for domain knowledge. However, due to the complexity and limitations of rules and features, these methods had poor adaptability to different languages and domains. As machine learning has gained popularity, statistical and machine learning techniques have been increasingly utilized in NER. These approaches leverage annotated datasets to train models, enabling them to learn the statistical patterns and contextual information associated with entities in text. Common machine learning algorithms include Hidden Markov Models (HMM) [11], Conditional Random Fields (CRF) [12], and deep learning models. Deep learning has made significant progress in the field of NER [13,14,15,16,17]. Deep learning models can automatically learn text feature representations, extracting and classifying information through multi-layer neural networks. For example, Bidirectional Long Short-Term Memory networks (BiLSTM) [14] with Recurrent Neural Networks (RNNs) [18] can capture the contextual dependencies between entities, improving the accuracy of NER. Furthermore, the emergence of large-scale pre-trained language models like ELMo [19] and BERT [20] has greatly benefited NER. Due to its significant performance, pre-training BERT on large corpora and fine-tuning on target datasets has become a mainstream approach. In the field of materials science, Gupta et al. [21] used MatSciBERT, i.e., BERT pre-trained on materials science corpus, to recognize material science entities, and their method achieved SOTA performance on multiple materials science datasets. The ability of deep learning methods to automatically learn features results in more competitive performance compared to feature engineering methods.

Existing methods typically approach the MatNER task by treating it as a sequence labeling problem. This involves training a model to assign labels to individual tokens in a given sequence. However, these methods have limitations in effectively capturing semantic information and addressing the nested entity problem. Motivated by the recent trend of transforming NLP tasks as machine reading comprehension (MRC) tasks [22,23,24,25,26,27], a MatSciBERT-MRC method based on the MRC framework was proposed in this study. In the MRC framework, each type of material science entity can be encoded through language queries and extracted in the given context by answering these queries, thus more effectively utilizing the information in the training data and improving the generalization ability of the model. Recent studies have converted various NLP tasks into MRC tasks. For instance, Levy et al. [23] proposed a method to cast the relation extraction task as a QA task by parameterizing each relation type R(x,y) as a question Q(x), with y being the answer. Similarly, McCann et al. [24] achieved competitive performance by uniformly implementing 10 different NLP tasks using a question answering framework. In the field of Named Entity Recognition (NER), Li et al. [26] applied BERT for entity recognition under the MRC framework in texts from regular domains, while Sun et al. [27] attained significant performance in texts from the biomedical domain.

To our knowledge, no specific study has focused on NER in materials science under the MRC framework. Herein, we aim to identify entities in materials science, which differs from previous research [26, 27]. Additionally, the impact of different MRC strategies on the MatNER task is explored. The performance of MatSciBERT-MRC was evaluated on five public materials science datasets, and a comparison was made with traditional sequence labeling models. Experimental results showed that MatSciBERT-MRC has good performance in detecting various material names, compounds, elements, etc., achieving the latest SOTA performance. A powerful tool is provided to material science researchers by this research, enabling them to handle large-scale material science literature and data more accurately and efficiently. Accurately identifying and extracting key information can accelerate the material research process and provide more possibilities for material design and discovery.

Methodology

Datasets construction

The input to a traditional sequence annotation task is a sequence $X=\{{x}_{1},{x}_{2},... , {x}_{N}\}$, where ${x}_{i}$ represents the i-th word or label in the sequence. In this study, the labeled NER data needs to be transformed into triples of (Context, Query, and Answer). The Context is a given input sequence $X$, the Query is a query sentence designed based on that sequence $X$, and the Answer is the scope of the target entity. In the MRC task, the construction of the query sentence ${Q}_{y}$ to obtain relevant information is required. Specifically, for each type of entity, we can use keywords or phrases associated with label $y$ and combine them into a query sentence. The length of the query sentence can be determined based on the specific requirements of the task.

Query generation

The generation of queries is recognized as a crucial process as it encompasses prior knowledge of labels, which ultimately influences the final results of MatNER tasks. In this study, the creation of queries relied upon annotation guidelines as references. These guidelines are composed of instructions provided by dataset producers to annotators, enabling them to effectively describe label categories. It is essential that these guidelines be expressed in a broad and precise manner in order to eliminate any ambiguity. Table 1 presents examples of queries we constructed in Matscholar [1] dataset.

Table 1 Examples of constructed queries

Full size table

Model details

In this study, BERT [20] was used as the model backbone, along with MatsciBERT [21] as the model weights, to identify entities in the field of materials science. Figure 1 depicts the implementation of the MatNER task using BERT in the MRC framework. Initially, the combined sequence{[CLS],${q}_{1}$,${q}_{2}$,…,${q}_{m}$, [SEP],${x}_{1}$,${x}_{2}$,…,${x}_{n}$} is formed by concatenating the query ${Q}_{y}$ with the sequence X, the special tokens [CLS] and [SEP] are used to represent the start and end positions of the labels. These tokens are combined with the input sequence and fed into the BERT model, which received the combined string and outputs the context representation. Since we only require context predictions, query representations can be removed as they are not part of the target for model prediction.

In the MRC framework, there are two prevalent approaches to select spans. One strategy involves employing a pair of n-class classifiers, where one is responsible for predicting the starting index and the other for predicting the ending index. These classifiers can extract features using pre-trained models like BERT and output an n-dimensional probability distribution, indicating the probability of each position being the start or end position. Then, the highest probability start and end positions can be selected to form a span. Another approach involves the utilization of two binary classifiers, wherein one classifier is responsible for predicting the start index of each position, while the other classifier is responsible for predicting the end index of each position. Similarly, these binary classifiers can extract features using pre-trained models and output a binary probability distribution, indicating the probability of each position being the start or end position. This approach enables the output of multiple start and end indexes, catering to a given context and specific query, making it possible to extract all relevant entities based on the query. The second approach is utilized in this study and a detailed explanation of its workings is provided.

For the prediction of start index, softmax is used to get whether the token is start index using the following equation:

$${K}_{start} = linear(L{Q}_{start} ) \in {R}^{N\times 2}$$

(1)

where ${Q}_{start}$ is the weight to be learned and the probability distribution of each index as the starting position of the entity is represented by each row of ${K}_{start}$.

The model then predicts the probability of each token being the corresponding end index, using the following formula:

$${K}_{end} = linear\left(L{Q}_{end}; softmax\left({K}_{start}\right)\right)\in {R}^{N\times 2}$$

(2)

In order to determine the ending position of an entity for a given query, we introduce the weight ${Q}_{end}$, which is to be learned. The probability distribution of each index as the ending position of the entity is represented by each row of ${K}_{end}$.

For each given X, there may exist multiple possible start and end indices. Simply matching them based on proximity is not a reasonable approach. Therefore, by applying argmax to the output matrix for each row of ${K}_{start}$ and ${K}_{end}$, we are able to obtain all possible start and end indices. This approach allows us to identify the most probable start and end indices, as determined by the following formula:

$${I}_{start} =\{i|argmax({K}_{start}^{i}) = 1, i = 1, 2, \cdots , N\}$$

(3)

$${I}_{end} =\{j|argmax({K}_{end}^{j}) = 1, j = 1, 2, \cdots , N\}$$

(4)

Superscripts i and j are used to indicate the i-th and j-th rows of the matrix, respectively.

Datasets and experiment settings

Five different NER datasets were considered, i.e. BC4CHEMD [28], Matscholar [1], NLMChem [29], SOFC-Slot and SOFC [30], to represent various text sources and questions related to materials science. Table 2 displays the statistics of all datasets used in this study, which were selected because they are publicly available and ensured a comprehensive evaluation of the proposed method.

Table 2 Statistics on datasets

Full size table

The experiments in this study were conducted using Python 3.8.12 and torch 1.12.1. Training of the models was performed on a single GTX 3060 GPU. Due to the computational complexity limitations, many previous works in the field of material science utilized the ${\text{BERT}}_{\text{BASE}}$ model. To ensure comparability with these works, all BERT models employed in this study were based on the ${\text{BERT}}_{\text{BASE}}$ [20] model, which consists of 12 transformer layers, a 768-dimensional hidden layer, and a 12-head multi-head attention mechanism. For specific details on the hyperparameters used in the experiments, please refer to Table 3.

Table 3 The detailed hyper-parameters of MatSciBERT-MRC

Full size table

Evaluation metrics

In the experimental phase, the F1-score was employed as the metric to assess the overall performance of the model. Furthermore, precision and recall were utilized to evaluate the model’s capability in accurately identifying positive and negative samples.

Precision can be defined as the ratio of correctly identified positive values, also known as true values, to the total number of identified positive values.

$$P=\frac{TP}{TP+FP}$$

(5)

Recall is the ratio between the predicted true value and the actual labeled result.

$$R=\frac{TP}{TP+FN}$$

(6)

F1-score is the harmonic mean of precision and recall.

$$F1=\frac{2PR}{P+R}$$

(7)

Result and discussion

The effect of different BERT models on NER performance

To investigate the effects of different BERT models on NER performance, the performance of MatSciBERT [21], BioBERT [31], and SciBERT [32] was evaluated for the NER task. The aim was to determine which BERT model would yield the best performance for NER in materials science. The effect of different BERT models on NER performance is illustrated in Table 4. Overall, MatsciBERT achieved higher scores than BioBERT and SciBERT on all the datasets (p < 0.05). Unlike SciBERT, MatSciBERT has been trained using a large corpus of texts from the field of materials science, covering multiple research fields, journals, and data sources, encompassing a broad body of materials science knowledge and domain-specialized terminology. This gives MatSciBERT greater adaptability and accuracy when working with materials science-related texts.

Table 4 Performance comparison for different BERT Models

Full size table

These experimental results indicate that there is a significant difference between the scientific literature in the materials domain pre-trained by MatSciBERT and the scientific literature in the biomedical domain pre-trained by BioBERT. It becomes evident that each scientific discipline presents substantial variation in terms of ontology, vocabulary, and domain-specific symbols. Therefore, the pre-trained corpus plays a pivotal role in determining the model’s performance, and the utilization of MatSciBERT, trained on an extensive collection of materials science publications, proves to be more fitting for our MatNER task in this experiment.

MRC vs sequence labeling frameworks

A detailed comparison of the performance of BERT models with MRC framework and sequence tagging framework was conducted. The performance comparison between different models is presented in Table 5. In the MatSciBERT-Softmax, the classification of each token in the sequence is accomplished by utilizing the Softmax function on the MatSciBERT output. MatSciBERT-CRF learns the constraint relationships between labels through CRF to ensure the rationality of the predicted label sequence, thereby obtaining the best sequence annotation results. MatSciBERT-BiLSTM-CRF uses BiLSTM-CRF to enhance the learning ability of the context, enabling the model to better learn semantic information in the context. Among the three sequence labeling frameworks (i.e. MatSciBERT-CRF, MatSciBERT-BiLSTM-CRF, and MatSciBERT-Softmax), MatSciBERT-CRF achieves the best performance in all four datasets in the traditional sequence annotation model. It can be inferred that the CRF possesses the ability to acquire the interdependent connection between labels. This capability ultimately guarantees the rationality of the predicted label sequence and significantly enhances the accuracy of entity recognition. However, for BC4CHEMD dataset, the performance of MatSciBERT-BiLSTM-CRF model is the best among the three series of annotation models, which may be due to the relatively large items in BC4CHEMD dataset. The MatSciBERT-BiLSTM-CRF model can learn more contextual semantic information from the dataset.

Table 5 Performance comparison for different models

Full size table

Unlike the above three methods, MatSciBERT-MRC turns the MatNER task into a machine reading comprehension problem and predicts the answer span ${x}_{start, end}$ based on the input sequence $X$ and query statement ${Q}_{y}$. As shown in Table 5, compared with sequence tagging methods, MatSciBERT-MRC improves a substantial enhancement to the performance of the MatNER task (p < 0.05). By encoding crucial prior knowledge into the query, the MRC effectively mitigates the problem of sparse tagging, corpus size or sentence length, leading to more improvements on all the datasets. The experimental results unequivocally showcase the superior entity identification capabilities of BERT within the MRC framework compared to the sequence tagging framework, particularly in the domain of material science.

The effect of different MRC strategies on NER performance

The influence of different span prediction strategies was also evaluated in this study. Specifically, the effect of end_index information in MatSciBERT-MRC model on the NER performance was investigated. To assess this, a baseline model called MatSciBERT-MRC-base was designed and its performance was compared to that of MatSciBERT-MRC. In implementation, MatSciBERT-MRC-base only replaced ${K}_{end}$ of MatSciBERT-MRC described in front with the following formula, while keeping everything else unchanged:

$${K}_{end} = linear\left(L{Q}_{end}\right)\in {R}^{N\times 2}$$

(8)

A performance comparison between these two models is presented in Table 6. It can be observed that both models have competitive average F1 scores. Overall, MatSciBERT-MRC outperformed MatSciBERT-MRC-base on four out of five datasets (p < 0.05). This advantage is likely because the model considers the start index when predicting the end index, allowing for more accurate boundary prediction of entities. The start index provides context, helping the model determine the most likely end position of an entity, thereby reducing errors in boundary prediction. Additionally, independently predicting the start and end indices can lead to invalid spans, such as the end index being before the start index or spans that do not correspond to valid entities. However, the base model performed better than MatSciBERT-MRC models in the BC4CHEMD dataset, possibly because the entity structure in this dataset is relatively simple, allowing the baseline model to better capture this simple structure.

Table 6 Performance comparison for different end index strategies

Full size table

Based on the experimental findings, it has been observed that the model's performance can be influenced to a certain degree by the implementation of different end_index functions. Furthermore, considering the start index during the prediction of the end index has been found to enhance the overall performance of the model.

Due to the extensive data processing required in the early stage of MRC model, it has led to a decrease in the number of entities and an imbalance in labels. To address these issues, this study investigates the impact of different loss functions on model performance, comparing Focal Loss, CrossEntropy Loss, and Label Smoothing. Focal Loss tackles class imbalance by adjusting the weights of samples, with particular focus on difficult-to-classify samples. In contrast, CrossEntropy Loss measures the accuracy of the model by calculating the difference between predicted results and true labels. Label Smoothing introduces some noise to make the labels relatively soft, thereby alleviating overfitting to the training data.

In this experiments, three distinct loss functions were employed to evaluate the performance of the model. The results in Table 7 demonstrate that using Focal Loss as the loss function leads to a certain improvement in model performance (p < 0.05). This can be attributed to Focal Loss effectively addressing the issue of class imbalance, thereby enhancing the model’s ability to classify difficult samples. In conclusion, selecting an appropriate loss function is crucial for improving the performance of MRC models. When dealing with class imbalance, Focal Loss may be an effective choice.

Table 7 Performance comparison for different Losses

Full size table

Our findings show that incorporating appropriate loss functions and span prediction strategies can significantly improve the performance of the model on imbalanced datasets.

The effect of different query constructs on NER performance

The structure of the query plays a crucial role in determining the final outcomes. In this section, different approaches to query construction and their implications was explored. The label “MAT” in the Matscholar dataset was used as an example. Several common methods for query construction were employed, including:

Keywords: The query describes the label using keywords. For example, the query for the label MAT is “inorganic material”.
Rule-based template filling: Queries are generated using templates. The query for the label MAT is “Which inorganic material is mentioned in the text?”.
Wikipedia: Queries are constructed using Wikipedia definitions. The query for the label MAT is “Materials made from inorganic substances alone or in combination with other substances”.
Synonyms: Words or phrases that possess identical or closely similar meanings to the original keyword extracted from the Oxford Dictionary. The query for the label MAT is “Inorganic material”.
Keywords + Synonyms: Keywords are combined with their synonyms.
Annotation guideline annotation: This is the method we used in this paper. The query for the label MAT is “Look up any inorganic solids or alloys, any non-gaseous elements.”

The experimental results of our MatNER on the Matscholar dataset are presented in Table 8. The BERT-MRC model performs better than the BERT-Tagger model in all settings. The Annotation Guideline Notes method outperforms other methods because it provides clearer and more detailed label definitions (p < 0.05). These guidelines typically include instructions provided by dataset producers to annotators, making the label category descriptions more explicit and specific, thereby reducing ambiguity and errors in the annotation process. In contrast, Wikipedia falls short in comparison to Annotation Guideline Notes. This can be attributed to the relatively general definitions provided by Wikipedia, which may not precisely align with the specific data annotations required. These findings highlight the importance of query construction in MatNER tasks. The use of carefully designed queries can significantly improve the performance of MatNER models.

Table 8 Performance comparison for different Query constructs

Full size table

Performance comparison with other methods

The main purpose of this work is to compare the proposed method with previous studies on five material science datasets. We observed a significant improvement in the performance of the material science dataset compared to the previous SOTA models. The F1 scores on the Matscholar, BC4CHEMD, NLMChem, SOFC and SOFC-Slot datasets were 89.64%, 94.30%, 85.89%, 85.95% and 71.73%, respectively, which represent an improvement of + 0.89%, + 2.55%, + 1.08%, + 0.91%, and + 1.39% over the previous SOTA performances.

These results demonstrate the superior performance of our method in the field of material science, surpassing previous benchmarks. This improvement is crucial for information extraction and entity recognition tasks in material science. Through experiments on these five datasets, the robustness, generality, and effectiveness of our method have been validated across multiple datasets and different scenarios (Table 9). In addition, to verify the effectiveness of our model, we have also conducted a five-fold cross-validation on the dataset. For more details, please refer to Table S2 in the supporting information.

Table 9 Performance comparison with other existing methods

Full size table

Case study

A comprehensive case study was conducted to further investigate the distinctions between MatSciBERT-MRC and MatSciBERT-CRF. The outcomes of the case study are presented in Table 10. Based on the case studies of Matscholar, BC4CHEMD, and SOFC-Slot, we can observe that the MatSciBERT-MRC model provides an accurate demarcation of the boundaries of entities, such as “UV-light illumination”, “docosahexaenoic acids”, “ceria-based ceramics”. It can therefore be inferred that MatSciBERT-MRC model is able to successfully identify words and phrases related to entity categories and provide accurate boundary information. In contrast, MatSciBERT-CRF model has limitations in accurately determining boundary information, which may be attributed to the difficulties encountered by the CRF model in handling complex syntactic structures and boundary information. Furthermore, in the case studies of NLMChem and SOFC, we can also clearly observe that the MatSciBERT-MRC model is able to identify “ceramic cell” entities which MatSciBERT-CRF fails to capture and corrected MatSciBERT-CRF's misidentification of “Immunocytochemistry (ICC)” entities. This further validates the superiority of the MatSciBERT-MRC model in entity recognition tasks.

Table 10 Representative results of case study

Full size table

In addition, since our material dataset lacks nested cases, we artificially created nested cases to evaluate the performance of both models. MatSciBERT-CRF can only recognize “BWT-Pt catalysts” entities, while MatSciBERT-MRC can recognize “BWT-Pt” entities nested within “BWT-Pt catalysts” entities. MatSciBERT-MRC addresses the limitation of sequence annotation architectures and efficiently handling nested entities. These examples can be inferred that MatSciBERT-MRC excels at accurately identifying entity boundaries while mitigating label inconsistency and resolving entity nesting issues. This highlights the robustness and practicality of MatSciBERT-MRC in various scenarios related to material science information extraction. This advancement is not only pivotal for materials science but also has broader implications. For instance, this method can be adapted for named entity recognition in various domains such as chemistry and biosciences. In these fields, the MRC-based approach can effectively handle diverse texts, including those related to chemical synthesis, chemical property analysis, biological processes, etc.

Conclusion

In summary, BERT in the MRC framework was employed to conduct named entity recognition in material science (MatNER) task. Compared to BERT in the sequence labeling framework, BERT (i.e., MatSciBERT) in the MRC framework can improve the performance in recognizing target entities. Moreover, the MRC framework has the advantage of incorporating prior knowledge, which can be effectively enhanced in performance through query design. The proposed approach achieves good SOTA performance on five MatNER datasets. The results demonstrate that utilizing BERT in the MRC framework with carefully designed queries can significantly improve the accuracy of MatNER models. The results clearly indicate that utilizing BERT in the MRC framework with thoughtfully designed queries can significantly improve the accuracy of MatNER models. By demonstrating the versatility and effectiveness of BERT in the MRC framework, our findings contribute to the development of more accurate and efficient natural language processing tools. These tools can be instrumental across a range of applications in materials science, chemistry, biosciences, and other fields, enabling precise extraction of named entities from different types of scientific texts.

Availability of data and materials

The code of this study was written using PyTorch and Transformers library and is available at the GitHub repository https://github.com/huilkq/MatsciBERT_MRC, which also includes the code of MatsciBERT_MRC usage and data processing. The code and datasets for training our model can be found in this GitHub repository to ensure the reproducibility of this work. Additionally, all the pre-trained models and datasets used for fine-tuning are publicly available.

References

Weston L, Tshitoyan V, Dagdelen J, Kononova O, Trewartha A, Persson KA, Ceder G, Jain A (2019) Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J Chem Inf Model 59:3692–3702
Article CAS PubMed Google Scholar
Isazawa T, Cole JM (2022) Single model for organic and inorganic chemical named entity recognition in ChemDataExtractor. J Chem Inf Model 62:1207–1213
Article CAS PubMed PubMed Central Google Scholar
Leaman R, Wei C, Lu Z (2015) tmChem: a high performance approach for chemical named entity recognition and normalization. J Cheminformatics 7:S3–S3
Article Google Scholar
Eltyeb S, Salim N (2014) Chemical named entities recognition: a review on approaches and applications. J Cheminformatics 6:17
Article Google Scholar
Choudhary K, DeCost B, Chen C, Jain A, Tavazza F, Cohn R, Park CW, Choudhary A, Agrawal A, Billinge SJ, Holm E (2022) Recent advances and applications of deep learning methods in materials science. NPJ Comput Mater 8:1–26
Article Google Scholar
Rocktäschel T, Weidlich M, Leser U (2012) ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics 28:1633–1640
Article PubMed Google Scholar
Humphreys K, Gaizauskas R, Azzam S (1998) University of Sheffield: Description of the LaSIE-II System as Used for MUC-7. In Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia
Hanisch D, Fundel K, Mevissen H, Zimmer R, Fluck J (2005) ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics 6:S14
Article PubMed PubMed Central Google Scholar
Quimbaya AP (2016) Named entity recognition over electronic health records through a combined dictionary-based approach. Procedia Comput Sci
Bikel DM, Schwartz R, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn 34:211–231
Article Google Scholar
Rabiner LR (1989) A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning; pp 282–289
Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J (2018) An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34:1381–1388
Article CAS PubMed Google Scholar
Lample G, M. B. S. S., (2016) Bidirectional LSTM-CRF models for sequence tagging. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics; pp 260–270
Jagannatha AN, Yu H (2016) Structured prediction models for RNN based sequence labeling in clinical text. Proc Conf Empir Methods Nat Lang Process 2016:856–865
PubMed PubMed Central Google Scholar
Cho H, Lee H (2019) Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinformatics 20:735
Article PubMed PubMed Central Google Scholar
Strubell E, Verga P, Belanger D (2017) Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; pp 2670–2680
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. In arXiv: Computation and Language
Peters M, Neumann M (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics; pp 2227–2237
Devlin J, Chang MW, Lee K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics; pp 4171–4186
Gupta T, Zaki M, Krishnan NA, Mausam A (2022) MatSciBERT: a materials domain language model for text mining and information extraction. NPJ Comput Mater 8:102
Article Google Scholar
Shen Y, Huang PS, Gao J (2017) ReasoNet: learning to stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; pp 1047–1055
Levy O, Seo M, Choi E (2017) Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning; pp 333–342
McCann B, Keskar NS, Xiong C (2018) The Natural Language Decathlon: Multitask Learning as Question Answering. In arXiv: Computation and Language
Li X, Yin F, Sun Z (2019) Entity-relation extraction as multi-turn question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; pp 1340–1350
Li X, Feng J, Meng Y (2020) A Unified MRC Framework for Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; pp 5849–5859
Sun C, Yang Z, Wang L, Zhang Y, Lin H, Wang J (2021) Biomedical named entity recognition using BERT in the machine reading comprehension framework. J Biomed Inform 118:103799
Article PubMed Google Scholar
Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, Leaman R, Lu Y, Ji D, Lowe DM, Sayle RA, Batista-Navarro RT, Rak R, Huber T, Rocktäschel T, Matos S, Campos D, Tang B, Xu H, Munkhdalai T, Ryu KH, Ramanan SV, Nathan S, Žitnik S, Bajec M, Weber L, Irmer M, Akhondi SA, Kors JA, Xu S, An X, Sikdar UKEA (2015) The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminformatics 7:S2
Article Google Scholar
Islamaj R, Leaman R, Kim S, Kwon D, Wei C, Comeau DC, Peng Y, Cissel D, Coss C, Fisher C, Guzman R, Kochar PG, Koppel S, Trinh D, Sekiya K, Ward J, Whitman D, Schmidt S, Lu Z (2021) NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 8:91
Article PubMed PubMed Central Google Scholar
Friedrich A, Adel H, Tomazic F (2020) The SOFC-Exp corpus and neural approaches to information extraction in the materials science domain. In Proceedings of the 58th annual meeting of the association for computational linguistics; pp 1255–1268
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240
Article CAS PubMed Google Scholar
Beltagy I, Lo K, Cohan A (2019) In SCIBERT: A Pretrained Language Model for Scientific Text, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 2019; Hong Kong, pp 3615–3620
Shetty P, Rajan AC, Kuenneth C, Gupta S, Panchumarti LP, Holm L, Zhang C, Ramprasad R (2023) A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. NPJ Comput Mater 9:52–52
Article PubMed PubMed Central Google Scholar
Yoon W, So CH, Lee J, Kang J (2019) CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics 20:55
Article Google Scholar
Watanabe T, Tamura A, Ninomiya T, Makino T, Iwakura T (2019) Multi-task learning for chemical named entity recognition with chemical compound paraphrasing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); pp 6244–6249
Leaman R, Lu Z (2016) TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32:2839–2846
Article CAS PubMed PubMed Central Google Scholar
Peng Y (2019) Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task. Association for Computational Linguistic; pp 58–65
Tong Y, Zhuang F, Zhang H, Fang C, Zhao Y, Wang D, Zhu H, Ni B (2022) Improving biomedical named entity recognition by dynamic caching inter-sentence information. Bioinformatics 38:3976–3983
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This research was supported by the National Natural Science Foundation of China (No.32171314), Guangdong Basic and Applied Basic Research Foundation (2022A1515010671), Guangzhou Basic and Applied Basic Research Foundation (202201010371) and University Innovative Team Support for Major Chronic Diseases and Drug Development (26330320901).

Author information

Zihui Huang and Liqiang He equally contributed to this work.

Authors and Affiliations

School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou, 510006, China
Zihui Huang, Liqiang He, Yuhang Yang, Andi Li, Zhiwen Zhang, Siwei Wu, Yang Wang, Yan He & Xujie Liu

Authors

Zihui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Liqiang He
View author publications
You can also search for this author in PubMed Google Scholar
Yuhang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Andi Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Siwei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan He
View author publications
You can also search for this author in PubMed Google Scholar
Xujie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.H. conceived the study, participated in its design, developed the extension program, and drafted the manuscript. L.H. carried out calculations and helped draft the manuscript. Y.Y. and A.L. participated in data analysis and helped draft the manuscript. Z.Z. participated in study design. S.W. and Y.W. helped draft the manuscript. Y.H. and X.L. provided ideas. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yan He or Xujie Liu.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

The query used by materials science datasets.

Additional file 2:

Supplementary Information for fivefold cross-validation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Huang, Z., He, L., Yang, Y. et al. Application of machine reading comprehension techniques for named entity recognition in materials science. J Cheminform 16, 76 (2024). https://doi.org/10.1186/s13321-024-00874-5

Download citation

Received: 09 November 2023
Accepted: 14 June 2024
Published: 02 July 2024
DOI: https://doi.org/10.1186/s13321-024-00874-5

Application of machine reading comprehension techniques for named entity recognition in materials science

Abstract

Introduction

Methodology

Datasets construction

Query generation

Model details

Datasets and experiment settings

Evaluation metrics

Result and discussion

The effect of different BERT models on NER performance

MRC vs sequence labeling frameworks

The effect of different MRC strategies on NER performance

The effect of different query constructs on NER performance

Performance comparison with other methods

Case study

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1: Table S1.

Additional file 2:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us