- Open Access
Wikipedia Chemical Structure Explorer: substructure and similarity searching of molecules from Wikipedia
Journal of Cheminformatics volume 7, Article number: 10 (2015)
Wikipedia, the world’s largest and most popular encyclopedia is an indispensable source of chemistry information. It contains among others also entries for over 15,000 chemicals including metabolites, drugs, agrochemicals and industrial chemicals. To provide an easy access to this wealth of information we decided to develop a substructure and similarity search tool for chemical structures referenced in Wikipedia.
The web-based Wikipedia Chemical Structure Explorer provides a useful resource for research as well as for chemical education enabling both researchers and students easy and user friendly chemistry searching and identification of relevant information in Wikipedia. The tool can also help to improve quality of chemical entries in Wikipedia by providing potential contributors regularly updated list of entries with problematic structures. And last but not least this search system is a nice example of how the modern web technology can be applied in the field of cheminformatics.
Wikipedia is the 6th most accessed web site worldwide  with currently more than 4.6 million articles in English and many million in other languages. It is an indispensable reference for all scientific disciplines, among others also for chemistry , containing information about numerous important molecules including metabolites, drugs, agrochemicals, industrial chemicals and many others. Wikipedia contains currently entries for over 15,000 chemicals, growing by about 1000 new molecules per year. These chemical pages have been created and are constantly updated by thousands of enthusiastic Wikipedia users, both experts in the field and random passersby. Without this huge volunteer effort the tool we are presenting here would not be possible. The authors want therefore to express their gratitude to the Wikipedia community and contribute to this global effort by developing and making freely available the Wikipedia Chemical Structure Explorer. Our motivation was to address lack of support for special chemical searches in Wikipedia. The tool we are presenting here offers an easy way to perform substructure and similarity searches for molecules referenced in Wikipedia.
Extraction of molecules from Wikipedia
Among other chemical content Wikipedia contains numerous entries describing specific chemicals. Such entries are using a special chemical template , either an “Infobox drug” (also called Drugbox) used for drugs or a Chembox used for other chemicals. These chemical templates (also called infoboxes) are pieces of Wikipedia markup embedded into chemistry pages that contain the most important information about molecules, allowing to present chemical data in a standardized way and support also computer mining of the data. The templates have a modular design. After general information including chemical name and structure depiction, possibly also a 3D molecule image, SMILES code  and links to other chemical databases like PubChem  or ChEMBL  these boxes often contain also other data like physicochemical and pharmacological properties, information about chemical hazard and so on. They can be built from multiple sections, each covering a group of information. Depending on the compound, sections can be added or left out, and within a section parameters can be added or omitted. An example of Wikipedia page with a Chembox is shown in Figure 1 and its encoding in Wiki markup in Figure 2.
Several projects under Wikipedia umbrella are using information stored in the chemical infoboxes. One of them is DBpedia:Chemical Compound project  extracting structured information from Wikipedia, converting it into RDF format and making it freely available on the Web. DBpedia is therefore something like the Semantic Web mirror of Wikipedia. Another activity in this area is the Wikidata:WikiProject Chemistry . Wikidata is a free database of the structured data extracted from Wikipedia. Its Chemistry subproject focuses on defining data items for chemical entries and checking their quality and also organizes and monitors the data implementation (also with automatic bots).
At the end we were able to extract some 13,000 SMILES codes for the Wikipedia entries. Over 600 of these codes could not be processed by the SMILES parser. A clear majority of the problems (over 350 cases) was caused by not respecting the SMILES syntax rules for unsubstituted pyrrole-type nitrogen. This nitrogen was encoded as n and not as [nH] as required by the SMILES grammar (so for example benzimidazole was incorrectly encoded as n2c1ccccc1nc2). Since the incorrect SMILES encoding of many heteroaromatic molecules is apparently an issue in Wikipedia we recommend entering SMILES to Chemboxes or Drugboxes in its nonaromatic form with alternating single and double bonds. Other SMILES errors were caused by missing ring closures, unclosed parentheses, hydrogens and non-organic atoms outside square brackets and so on.
In the process of extraction the authors (LP and MZ) corrected over 100 such errors directly in Wikipedia. Still remaining SMILES codes that cannot be parsed may be viewed after clicking the “Browse errors” link on the Wikipedia Explorer main page. This list of problematic SMILES’s is generated regularly on a daily basis. A click on the SMILES code opens directly the respective Wikipedia page. We hope that this mechanism will make correction of SMILES errors for the Wikipedia cheminformatics community much easier.
As mentioned previously, the chemical infoboxes contain in addition to SMILES also links to other chemical databases, most notably to PubChem . When the Wikipedia entry contained both the SMILES code and the link to PubChem it was possible to check whether both structures are the same. In over 600 cases we could see that these structures differ. In most cases this was caused by difference in tautomeric form but in many cases the two molecules differed also in position of functional groups, or atom types, caused probably by errors when drawing the structures. In some cases the two structures differ completely (probably inclusion of an incorrect PubChem CompoundID). In such ambiguous cases we used structures from Wikipedia but disagreement with the PubChem may be indication of problems that should be checked in the future.
After final processing the database of SMILES codes of Wikipedia molecules contained 13,072 entries. To document diversity of this molecule collection the 250 most frequent scaffolds present in this set are shown in Figure 3 in form of a Molecule Cloud diagram . The size of the scaffold image is proportional to the number of molecules containing this scaffold, ranging from the largest benzene (there are 1116 entries for benzene derivatives in Wikipedia) down to the smallest images representing 5 Wikipedia entries. The 250 scaffolds displayed in the Figure 3 represent together 4294 Wikipedia molecules. Although detailed analysis of Wikipedia chemical content is out of scope of this communication, it is interesting to compare at least briefly the Wikipedia scaffolds with those present in the common synthetic molecules and bioactive molecules (Figures four and five in ref. ). In Wikipedia one can see clear preference for more complex structures, like structures of natural products, steroids or cores of common drugs. This is, of course, nothing surprising, because the Wikipedia chemical entries are created subjectively based on the usefulness and application area of the respective molecules.
The list of entries with canonical SMILES also allows an easy check of structure duplication. We could identify about 30 cases where different Wikipedia pages describe the same molecules (molecules with the same SMILES). This is caused by different spelling (Dichlorophen vs Dichlorophene) but mostly by use of synonyms (for example Amphetamine vs Adderall, Pozanicline vs A-84,543, or Tretinoin vs Retinoic_acid). Some of these mismatches are caused by not including the stereochemical information in SMILES (three sugars with different stereochemistry, namely Pinitol, Quebrachitol and Ononitol are described by the same nonstereo SMILES), and some clearly by error (for example 2-Nitropropane and 1-Nitropropane entries both have the same SMILES).
The data extraction procedure is currently set-up to run automatically every day, so the Wikipedia Explorer contains always the actual information. The “About” page provides information about the latest update and the number of structures extracted. The extracted data (SMILES codes and names of the respective Wikipedia pages) may be obtained by clicking the “Download SMILES” link in the Explorer top menu or downloaded from www.cheminfo.org/wikipedia/smiles.txt.
Substructure and similarity searching
As mentioned in the introduction, the main motivation that prompted us to embark on the present project was to offer possibility of easy substructure and similarity searches for chemical entries in Wikipedia. Majority of classical chemical search engines currently in use rely on a central server and therefore have numerous susceptibilities, not only possible server malfunction but also in the more extreme scenario of abandonment of the project with resulted end of maintenance and updates. One approach to prevent this risk is to release the server code as an open-source but to install an open source software from scratch usually requires relatively good IT skills. A more recent approach is to distribute directly a server image  that allows to set up quickly a new server using a virtual machine. In this project we decided to use this approach and capitalize on the latest web technologies to warrant the project sustainability by moving all the search intelligence from the server to the client. As a result we are able to make exact search, substructure and similarity search using pure HTML5 technology, compatible with most of the recent browsers  without any access to an external web service.
In our system all molecules from Wikipedia are stored in the JSON data file as an array of objects. Each object contains a canonical representation of the chemical structure, an array of 16 numbers containing the 512 index bits used for the structure similarity searching and substructure fingerprint pre-screening, the molecular formula, the molecular weight as well as the Wikipedia page name allowing the client to create a direct link to the original information. This file is currently 5 MiB large, but as httpd servers and web browser negotiate compression algorithm, the transferred data are actually less than 2 MiB. And since the whole project is available as a zipped file, it can even be placed on a local server by simply downloading and unzipping one file .
The search is interactive. This means that each time the query is modified by adding or removing atom or bond a new search is performed and the results are immediately displayed. In order to provide more intuitive results, particularly for very small queries, simple tuning has been implemented. For substructure search the results are sorted by absolute mass difference between molecular weight and the query molecular weight while similarity search hits having the same score are sorted by the absolute difference of molecular weight with the query and checking that the exact match is always on the top of the list. These simple “tricks” assure that the query structure will always be the first hit which is the result a user would expect.
Results and discussion
More information about hits may be obtained by moving mouse over a result line. Clicking on the hit loads the corresponding Wikipedia page directly in the bottom-right module. The page may be opened also in a new window.
We developed a web-based Wikipedia Chemical Structure Explorer allowing easy, user friendly substructure and similarity searching and navigation within the Wikipedia chemical content. The tool provides a useful resource for research as well as for chemical education. The presented analysis can hopefully also help to improve quality of chemical entries in Wikipedia by providing daily updated list of entries with problematic or missing structural information and directing in this way potential contributors to the area where they effort is mostly needed. And last but not least this search system is a nice showcase example of how the modern web technology can be applied in the field of cheminformatics. The Wikipedia Chemical Structure Explorer is available at www.cheminfo.org/wikipedia and the code is available also from GitHub https://github.com/cheminfo/wikipedia for local installation. The system is released under the open source BSD license.
Availability and requirements
Project name: Wikipedia Chemical Structure Explorer.
Project home page: http://www.cheminfo.org/wikipedia/
Operating system: Platform independent.
Other requirements: none.
License: the tool itself BSD, various support libraries have their own licenses.
Any restrictions to use by non-academics: none other than those specified by the licenses.
List of most popular websites. [http://en.wikipedia.org/wiki/List_of_most_popular_websites]
WikiProject Chemistry. [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemistry]
Chemical infobox. [http://en.m.wikipedia.org/wiki/Wikipedia:Chemical_infobox]
Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28:31–6.
Bolton E, Wang Y, Thiessen PA, Bryant SH. PubChem. In: Integrated Platform of Small Molecules and Biological Activities. Chapter 12 In Annual Reports in Computational Chemistry, vol. 4. Washington, DC: American Chemical Society; 2008.
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, et al. ChEMBL: a Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012;40:D1100–7.
DBpedia:Chemical Compound project. [http://live.dbpedia.org/ontology/ChemicalCompound]
Wikidata:WikiProject Chemistry. [http://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry]
Wikipedia template pages. [https://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Drugbox] [https://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Chembox].
Node.js platform [http://nodejs.org/]
Ertl P, Rohde B. The molecule cloud - compact visualization of large collections of molecules. J Cheminformatics. 2012;4:12.
Davies M, Nowotka M, Papadatos G, Atkinson F, van Westen GJP, Dedman N, et al. MyChEMBL: A Virtual Platform for Distributing Cheminformatics Tools and Open Data. Challenges. 2014;5:334–7.
The project has been tested on Google Chrome 40, Safari 8, Firefox 36 and Internet Explorer 11.
Hanson RM, Priluski J, Renjian Z, Nakane T, Sussman JL. JSmol and the Next-Generation Web-Based Representation of 3D Molecular Structure as Applied to Proteopedia. Isreal J Chem. 2013;53:207–16 [http://sourceforge.net/projects/jsmol/]
Machine learning tools ml.js. [https://github.com/mljs/ml]
The project can be retrieved from GitHub using “git clone. https://github.com/cheminfo/wikipedia.git” or by downloading the latest zip file “https://github.com/cheminfo/wikipedia/archive/master.zip”.
Sander T, Freyss J, von Korff M, Rufener C. DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis. J Chem Inf Mod. 2015;55:460–73.
The authors declare that they have no competing interests.
Peter Ertl: http://peter-ertl.com.
Luc Patiny: http://cheminformatics.epfl.ch.