Improved chemical text mining of patents using infinite dictionaries, translation and automatic spelling correction

Sayle, Roger A; Petrov, Plamen; Winter, Jon; Muresan, Sorel

doi:10.1186/1758-2946-3-S1-O16

Volume 3 Supplement 1

6th German Conference on Chemoinformatics, GCC 2010

Oral presentation
Open access
Published: 19 April 2011

Improved chemical text mining of patents using infinite dictionaries, translation and automatic spelling correction

Roger A Sayle¹,
Plamen Petrov²,
Jon Winter³ &
…
Sorel Muresan²

Journal of Cheminformatics volume 3, Article number: O16 (2011) Cite this article

2658 Accesses
1 Citations
Metrics details

The text mining of patents and patent applications for chemical structures of interest to medicinal chemists poses a number of unique challenges not encountered in other fields of text analytics. Traditional text mining relies on the co-occurrence of common terms between documents to provide similarity measures that can be used to cluster and rank related documents. The more words shared between two documents, the more similar they are, and the greater the probability that they discuss the same topic. By contrast, in pharmaceutical “composition of matter” patents the novel and unique chemical entities are far more significant than those that can be found elsewhere. Although the text of a pharmaceutical patent may explicitly name thousands of individual compounds, and via generic Markush structures claim an infinite number, the role of these patents is to protect the intellectual property of only one or perhaps two drug candidates.

In this work, we present an analysis of the “quality not quantity” of structures extracted by automatic Chemical Named Entity Recognition (CNER) methods both on a small hand-curated benchmark set [1] and a large-scale analysis of a comprehensive database of 12 million patents [2, 3]. Our results show the limited value of traditional lexicon/dictionary based approaches in extracting “key” compounds and that the major impediment is not the performance of the name-to-structure software used, but the high rate of OCR errors, typos and lexicographic problems found in patent-office data feeds. To address this problem, novel algorithms for automatic chemical spelling correction have been developed, that take advantage of the grammar used in IUPAC-like nomenclature. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve results in our study.

References

Hattori K, Wakabayashi H, Tamaki K: Predicting Key Example Compounds in Competitor’s Patent Applications using Structural Information Alone. J Chem Inf Model. 2008, 48: 135-142. 10.1021/ci7002686.
Article CAS Google Scholar
Rhodes J, Boyer S, Kreulen J, Chen Y, Ordonez P: Mining Patents using Molecular Similarity Search. Pac Symp Biocomput. 2007, 12: 304-315. full_text.
Google Scholar
Suriyawongkul I, Southan C, Muresan S: The Cinderella of Biological Data Integration: Addressing the Challenges of Entity and Relationship Mining from Patent Sources. Data Integration in the Life Sciences. Lect Notes Bioinf. 2010, Springer, 6254:
Google Scholar
Sayle R: Foreign Language Translation of Chemical Nomenclature by Computer. J Chem Inf Model. 2009, 49: 519-530. 10.1021/ci800243w.
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

NextMove Software, Santa Fe, New Mexico, 87501, USA
Roger A Sayle
DECS, AstraZeneca, Molndal, Sweden
Plamen Petrov & Sorel Muresan
CIRA, AstraZeneca, Alderley Park, Cheshire, UK
Jon Winter

Authors

Roger A Sayle
View author publications
You can also search for this author in PubMed Google Scholar
Plamen Petrov
View author publications
You can also search for this author in PubMed Google Scholar
Jon Winter
View author publications
You can also search for this author in PubMed Google Scholar
Sorel Muresan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roger A Sayle.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sayle, R.A., Petrov, P., Winter, J. et al. Improved chemical text mining of patents using infinite dictionaries, translation and automatic spelling correction. J Cheminform 3 (Suppl 1), O16 (2011). https://doi.org/10.1186/1758-2946-3-S1-O16

Download citation

Published: 19 April 2011
DOI: https://doi.org/10.1186/1758-2946-3-S1-O16

6th German Conference on Chemoinformatics, GCC 2010

Improved chemical text mining of patents using infinite dictionaries, translation and automatic spelling correction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Journal of Cheminformatics

Contact us

6th German Conference on Chemoinformatics, GCC 2010

Improved chemical text mining of patents using infinite dictionaries, translation and automatic spelling correction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us