Skip to main content

Technical implications of new IUPAC elements in cheminformatics

Abstract

The symbols for the new IUPAC elements named in November 2016 can introduce subtle ambiguities within cheminformatics software. The ambiguities are described and demonstrated by highlighting inconsistencies between software when handling existing element symbols.

Background

On the 28th November 2016 the International Union of Pure and Applied Chemistry (IUPAC) approved the names and symbols of four new elements: 113 Nihonium (Nh), 115 Moscovium (Mc), 117 Tennessine (Ts), 118 Oganesson (Og). Cheminformatics libraries typically use a centralised dictionary of elements to store and look up symbols in the periodic table. Naïvely adding the new element symbols to this table can introduce unexpected behaviour.

Contracted abbreviations

The first ambiguity was noted in the the preliminary recommendations from IUPAC. Contracted group abbreviations are common in published chemistry research to make sketches more concise and readable. Common abbreviations include Ph for Phenyl, NO2 for Nitro, COOH for Carboxylic acid, etc. The new symbol Ts is widely used to represent a Tosyl group. Unfortunately the symbol Tn could not be used as it was previously used for \(^{220}\)Radon (Thoron). It was noted in the preliminary recommendations that similarly to how Ac is used for both Actinium and Acetyl (and sometimes Acyl) the intended meaning would be clear from the context (Fig. 1).

Fig. 1
figure 1

6-(Diacetoxyiodo)-1-tosylindoline Intermediate 30 in US 2016/362375 A1. The intended meaning of Ts is Tosyl, and OAc is Acetoxy

While it is true that a human may decipher the intended meaning, it is more difficult for the software especially when the compound is no longer associated with the original context. The existence of compounds like PubChem [1]’s p-tolylactinium (CID 20712520 [2]) instead of the intended p-acetyl structure demonstrate this. The error here was propagated from the original substance submissions: a deprecated ChemSpider [3] entry, and patent sketches extracted by SCRIPDB [4].

Software for sketching chemical diagrams often allow the input of contracted abbreviations. ChemDraw 15 interprets all Ac labels as Acetyl and it is impossible to add an Actinium atom to a sketch even via the periodic table selection menu or from file or line-notation input. In MarvinSketch 16.4.18.0 entering Ac using the periodic table menu or keyboard short-cuts results in Actinium whilst using the “Label Editor” produces Acetyl. BIOVIA Draw 2017 makes a clear distinction when adding abbreviated atoms and both interpretations can be input. ChemDoodle 7.0.2 always interprets Ac as Actiniumwhen setting the atom label but does allow OAc for Acetoxy. With all of these, there is often little visual indication or feedback as to whether a user has entered the input they intended to.

To remove the ambiguity between Tosyl and Tennessine the alternative abbreviation Tos can be used. A brief analysis of sketches taken from United States patent applications published in 2015 shows that Ts is used in atom labels 2290 times and Tos 113 times.

Case insensitivity

A more subtle problem may arise with the symbol Nh in software that allows case insensitive atom labels. It is reasonable to accept CL as equal to Cl for chlorine (e.g. PDB HETATMs) but NH (secondary amine) may now unexpectedly be picked up as Nihonium from the internal dictionary.

SMARTS queries

Support for the SMARTS query language is available in many closed and open-source cheminformatics toolkits. A potential area for ambiguity is again found with Nihonium and the interpretation of other transfermium symbols. Transfermium symbols were officially named after the initial release of the Daylight SMARTS toolkit [5] and in subsequent implementations some are interpreted differently between toolkits either as an element or a conjuction (AND) expression. For example, at the time of writing both the CDK [6] and Open Babel [7] interpret [Bh] as [B&h] by whilst RDKit [8] interprets it as [#107].

The problem occurs due to the implicit conjunction between adjacent primitive expressions. The new symbol [Nh] could be interpreted as [#113] (element 113) or [N&h] (Aliphatic nitrogen and at least one implicit hydrogen). Table 1 lists the transfermium symbols and their different interpretations. Software that generates SMARTS patterns should take extra care to avoid writing ambiguous expressions.

Table 1 Ambiguous SMARTS for transfermium element symbols officially named since 1997

Conclusions

A pragmatic approach to handling the new elements or perhaps all high atomic number elements with a very short half-life could be to simply ignore them. Whilst these elements are unlikely to have a practical application it is unsatisfactory to simply ignore them and we hope this commentary highlights that care should be taken when supporting the new symbols in cheminformatics software.

References

  1. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH (2016) PubChem substance and compound databases. Nucleic Acids Res 44(D1):1202–1213. doi:10.1093/nar/gkv951

    Article  Google Scholar 

  2. National Center for Biotechnology Information. PubChem compound database; CID=20712520. https://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=20712520. Accessed 3 Jan 2017

  3. Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87(11):1123–1124. doi:10.1021/ed100697w

    Article  CAS  Google Scholar 

  4. Heifets A, Jurisica I (2012) SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents. Nucleic Acids Res 40(D1):428–433. doi:10.1093/nar/gkr919

    Article  Google Scholar 

  5. Daylight Chemical Information Systems Inc. http://www.daylight.com. Accessed 3 Jan 2017

  6. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The Chemistry Development Kit (CDK): an open-source java library for chemo- and bioinformatics. J Chem Inf Comput Sci 43(2):493–500. doi:10.1021/ci025584y

    Article  CAS  Google Scholar 

  7. O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminform 3(1):33. doi:10.1186/1758-2946-3-33

    Article  Google Scholar 

  8. RDKit: Open-source cheminformatics. http://www.rdkit.org. Accessed 3 Jan 2017

Download references

Authors' contributions

JWM wrote the manuscript and RAS recognized the ambiguities with SMARTS expressions. Both authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John W. Mayfield.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mayfield, J.W., Sayle, R.A. Technical implications of new IUPAC elements in cheminformatics. J Cheminform 9, 10 (2017). https://doi.org/10.1186/s13321-017-0196-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13321-017-0196-0

Keywords