- Open Access
ErtlFunctionalGroupsFinder: automated rule-based functional group detection with the Chemistry Development Kit (CDK)
© The Author(s) 2019
- Received: 7 March 2019
- Accepted: 28 May 2019
- Published: 4 June 2019
The Ertl algorithm for automated functional groups (FG) detection and extraction of organic molecules is implemented on the basis of the Chemistry Development Kit (CDK). A distinct impact of the chosen CDK aromaticity model is demonstrated by an FG analysis of the ChEMBL database compounds. The average performance of less than a millisecond for a single-molecule FG extraction allows for fast processing of even large compound databases.
- Chemistry Development Kit
- Functional group
- Electron donation
- Cycle finder
Functional groups (abbreviated FG) are an important concept of organic chemistry. They allow for a systematic and (in many cases) adequate molecular categorization according to a molecule’s reactivity and its chemical properties. Moreover, the FG concept may be successfully exploited across a wide range of molecular research, e.g. to construct quantitative structure–activity relationships (QSAR) in order to support drug discovery. In a recent publication  Peter Ertl proposed a new purely rule-driven approach to identify FGs of an organic molecule. This effort may be regarded as the first genuine algorithmic method to tackle FG identification in contrast to the common manual FG definition performed by chemists. A first open implementation of the Ertl algorithm (denoted IFG—Identify Functional Groups) was realized by Guillaume Godin and Richard Hall for the RDKit package .
In this work, the Ertl algorithm for automated FG detection and extraction is implemented on the basis of the Chemistry Development Kit (CDK) [3–6] with a new Java class ErtlFunctionalGroupsFinder to extend its open applicability for molecular research. The concrete CDK implementation and the distinct impact of the chosen CDK aromaticity model on FG detection as well as a comparison with the IFG RDKit implementation are discussed in detail. Due to the average performance of less than a millisecond for a single-molecule FG extraction using a single standard workstation processor core, ErtlFunctionalGroupsFinder may be used to process even large databases with a tenth of millions of molecules within an hour.
The implementation of the Ertl algorithm is divided into three consecutive steps. Step I marks all atoms within a molecule that meet the Ertl rules. Step II detects groups of connected marked atoms and extracts each group as a FG including information about its environment. The final step III applies the Ertl generalization scheme to the detected FGs.
AtomContainer is the basic molecule representation class of the CDK. In order to mark atoms according to the Ertl rules, atomic connectivity information of an AtomContainer has to be queried (e.g. ‘Is atom A connected to atom B?’). Since AtomContainer internally uses edge-lists the marking procedure would scale linearly with the number of a molecule’s bonds. To avoid this inefficiency, the CDK utility class GraphUtil is alternatively invoked to generate an adjacency list with a complementing edge-to-bond map for quick atomic connectivity access. The marking procedure iterates over all non-aromatic atoms of the molecule in the order given by its AtomContainer. Heteroatoms (unlike carbon or hydrogen) are identified by their atomic number, whereas for carbons all neighboring atoms and bonds are evaluated in a successive manner according to the Ertl rules. For special treatment in the following steps, aromatic heteroatoms are collected in a separate manner and marked carbons in carbonyl groups are specifically labeled. If not already the case, explicit hydrogens are set as an implicit property of the connected parent atoms. The iteration result is a set of marked atoms which acts as a basis for the following FG extraction in Step II.
To identify groups of connected marked atoms in combination with their merger into FGs, a single unprocessed marked atom is picked as a starting point for a new FG. Then an iterative breadth-first search (BFS) based on the adjacency list explores all neighboring atoms and expands the group by adding connected marked atoms until unmarked carbons or aromatic heteroatoms are reached which form the FG’s environment. These terminal atoms are not included in the FG themselves but their aromaticity and bonding information is extracted and attributed to their connected marked atoms. The extraction process keeps the aromaticity assignments and molecular orbitals of the molecule under investigation. In addition, aromatic heteroatoms that are not included in a group are extracted separately as single-atom FGs. Once all marked atoms are processed, the complete list of FGs (including environmental carbon information) of the molecule is obtained in form of a list of AtomContainers.
The final generalization step processes all extracted FGs separately according to the Ertl generalization scheme. Each FG is represented by an AtomContainer that contains marked atoms, their connecting bonds, and information on (former) neighboring environmental carbons. The information on environmental carbons comprises their location and their aromaticity as derived from the molecule under investigation. First, all exceptional cases are addressed where the FG contains a single marked atom only. This includes single-atomic nitrogen or oxygen FGs with one environmental carbon, simple thiols, and secondary amines or single aromatic hetero-atom FGs. Then an iteration over all atoms in the FG is performed. In case of a heteroatom, all hydrogens are replaced by new R-atoms which are implemented as instances of the PseudoAtom class—while oxygens in hydroxyl groups retain their hydrogens as an exception. According to the generalization scheme, any environmental information about carbon atoms is deleted with the exception of previously-labeled carbons in carbonyls which are replaced by R-atoms. The resulting generalized FGs of the molecule are finally returned as an AtomContainer list.
ErtlFunctionalGroupsFinder provides two basic processing modes which can be defined via the class constructor: The default generalization mode generalizes all detected FGs as outlined above whereas the no-generalization mode replaces generalization with an alternative outline of the environmental information with distinct atoms and bonds including their original aromaticity.
The most common FG for daylight electron donation is a tertiary amine with an aromatic central nitrogen atom, followed by an ether group, an amide group and a tertiary amine with a non-aromatic central atom. Differences between the four electron donation types are obvious with some striking examples: The “O=C*” FG (representing a carbonyl group containing an aromatic carbon atom) is frequent for the daylight and cdkAllowingExocyclic type but does not appear at all for cdk and piBonds. The cdk type does not allow atoms with exocyclic double or triple bonds in aromatic systems. Therefore, a carbon atom connected to carbonyl oxygen is not considered aromatic in any case. For piBonds all possibly aromatic atoms must be connected to a cyclic pi-bond which is impossible for the carbon atom in a carbonyl group (and also for oxygen and sulfur atoms, compare FGs “RO*R” and “RS*R” in Fig. 3). Type cdkAllowingExocyclic, on the other hand, allows electron contributions from exocyclic pi-bonds and the daylight type tolerates a carbonyl carbon in an aromatic system but considers its electron contribution to be zero since the oxygen atom is more electronegative. The detected most frequent ChEMBL FGs are in correspondence with the findings in  obtained from a specific bioactive subset of ChEMBL. The most frequent “RN*(R)R” aromatic amine FG is not explicitly mentioned in  but published in the supplementary file.
The FG detection results of both implementations are in good general agreement: The most common FG detected by IFG RDKit is “O” (representing a single oxygen atom with unknown connections). It equals the generalized ErtlFunctionalGroupsFinderEvaluationTest FGs representing an ether group (pseudo SMILES “ROR”), hydroxyl groups connected to aromatic carbon atoms (pseudo SMILES “[H]OC*”) or aliphatic carbon atoms (pseudo SMILES “[H]O[C]”) and a carbonyl group containing an aromatic carbon atom (pseudo SMILES “O = C*”). A striking deviation is the chemically meaningful FG “RN*(R)N*(R)R” which represents two bonded aromatic nitrogen atoms (e.g. found in pyridazine): While this FG is frequently detected with IFG RDKit it is not at all found by ErtlFunctionalGroupsFinder—but this detection failure of the latter is in concordance with the Ertl algorithm which defines that aromatic heteroatoms are to be collected as single atoms if no aliphatic group is connected. Last but not least the total numbers of different identified FGs are smaller for IFG RDKit (11.000 with AROMATICITY_RDKIT up to 43.000 with AROMATICITY_MDL) compared to ErtlFunctionalGroupsFinder (41.000 with daylight up to 134.000 with piBonds) which can be traced to the different FG output representations.
As a résumé, chemical FG detection remains challenging—and different open implementations of the Ertl algorithm are a true virtue, not only for general comparisons but especially for their different oddities and subtleties. For example, the ErtlFunctionalGroupsFinderEvaluationTest pseudo SMILES FG “R[N]R” is not straight forward to comprehend, the same applies to “[S]R”, “O=C(R)[N]R” or “O=[C]R” (the latter does not represent an aldehyde group since this FG is represented by “[C]=O” thus it may be an artefact due to a SD file error or originate from an amide group containing an aromatic nitrogen atom). Further investigations of these rare problems may lead to an improved structural pre-processing as well as possible useful extensions of the Ertl FG detection rules.
The numerous possible applications of FG detection in molecular research are widespread: FGs may be regarded as “intrinsic seeds” for proper fragmentation of molecules. The FG sets of single molecules up to large molecule collections can be regarded as chemically meaningful “feature vectors” or “fingerprints”: This qualifies their use for molecular comparison, clustering/classification or ranking purposes (e.g. based on “overlap in FG space”) and may substantially improve QSAR/QSPR research, especially the growing and increasingly complex machine learning approaches.
The ErtlFunctionalGroupsFinder LGPL Java class code is openly available from its project page. It is recommended to place this java class in the tools package of the cdk-extra module. ErtlFunctionalGroupsFinder depends only on CDK base classes and interfaces as well as GraphUtil for quick connectivity queries. Unit tests for 20 compounds given in  (with the daylight electron donation model and cycle finder Cylces.all for aromaticity detection) are implemented: They demonstrate adequate FG extraction of the new implementation by comparison of the expected FGs according to the Ertl rules with the actual results. The open tools ErtlFunctionalGroupsFinderPerformanceTest and ErtlFunctionalGroupsFinderEvaluationTest provide detailed sample code for using the new functionality. An integration into future CDK releases is requested and will hopefully be approved by the CDK community.
The authors like to thank Peter Ertl for describing his algorithm in a way that allowed easy re-implementation. This is not always the case. We also thank him for valuable discussions. We appreciate help from Egon Willighagen and John Mayfield with the CDK integration and from Felix Bänsch for unbiased release testing.
CS initiated the project. SF designed and implemented the new code. JS contributed to the test code, performed the FG frequency distribution analysis and the run time measurements. SN, CS, and AZ lead the project development. All authors read and approved the final manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Ertl P (2017) An algorithm to identify functional groups in organic molecules. J Cheminform 9:36View ArticleGoogle Scholar
- RDKit. https://github.com/rdkit/rdkit/releases/tag/Release_2018_03_1 and https://github.com/rdkit/rdkit/tree/master/Contrib/IFG. Accessed 15 Feb 2019
- Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen EL (2003) The Chemistry Development Kit (CDK): an open-source java library for chemo- and bioinformatics. J Chem Inform Comput Sci 43(2):493–500View ArticleGoogle Scholar
- Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL (2006) Recent developments of the Chemistry Development Kit (CDK)—an open-source java library for chemo- and bioinformatics. Curr Pharm Des 12(17):2111–2120View ArticleGoogle Scholar
- May JW, Steinbeck C (2014) Efficient ring perception for the Chemistry Development Kit. J Cheminform 6:3View ArticleGoogle Scholar
- Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluska T, Rojas-Chertó M, Spjuth O, Torrance G, Evelo CT, Guha R, Steinbeck C (2017) The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform 9:33View ArticleGoogle Scholar
- Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR (2017) The ChEMBL database in 2017. Nucleic Acids Res 45(D1):D945–D954View ArticleGoogle Scholar
- ChEMBL database SD-file chembl_24_1.sdf.gz.ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_24_1.sdf. Accessed 21 Jan 2019Google Scholar
- Product specification Intel Xeon Processor E5 2697 v2. https://ark.intel.com/products/75283/Intel-Xeon-Processor-E5-2697-v2-30M-Cache-2_70-GHz. Accessed 18 April 2018