The CDK implementation of the Ertl algorithm allows for a fast FG extraction. A performance snapshot with the ErtlFunctionalGroupsFinderPerformanceTest tool exhibits an in-memory processing speed of 74 s for 1.8 million ChEMBL compounds [7, 8] using a single core of an Intel Xeon E5-2697 v2 workstation CPU [9]. This corresponds to a single-core processing speed of more than 50 million molecules per hour. The FG extraction performance of parallelized ErtlFunctionalGroupsFinder threads with equal shares of the ChEMBL molecules (using the same hardware) is shown in Fig. 2: an initial distinct decrease of processing time flattens to only minor performance enhancements beyond four parallelized threads—the efficient FG extraction with four parallelized threads allows to process more than 150 million molecules per hour.
The ErtlFunctionalGroupsFinderEvaluationTest tool generates FG extraction results where FGs are represented as pseudo SMILES strings with aromatic atoms marked by an asterisk and pseudo-atoms indicated by character R. In a preliminary step ErtlFunctionalGroupsFinderEvaluationTest excludes molecules with metal or metalloid atoms, selects the largest part from compounds with multiple unconnected structures and neutralizes charged atoms. The latter is performed by zeroing the formal atomic charges and filling up free valences with hydrogen atoms (according to the CDK atom types). This procedure allows a more general charge treatment than a pre-defined transformation list but may produce “wrong” structures, e.g. it turns a nitro NO2 group into pseudo SMILES “[H]O[N](= O)R” with an uncharged four-bonded nitrogen atom (other examples are “R[N](R)(R)R”, “[C]#[N]R” or “RS(R)(R)R”). Thus an improved charge neutralization scheme is desirable for future implementations. Figure 3 shows the twenty most frequently detected FGs of 1.8 million ChEMBL compounds for the daylight electron donation in comparison to the findings with the three other electron donation types (cycle finder Cycles.all algorithm is used which is substituted by Cycles.vertexShort in case of a CDK intractable exception).
The most common FG for daylight electron donation is a tertiary amine with an aromatic central nitrogen atom, followed by an ether group, an amide group and a tertiary amine with a non-aromatic central atom. Differences between the four electron donation types are obvious with some striking examples: The “O=C*” FG (representing a carbonyl group containing an aromatic carbon atom) is frequent for the daylight and cdkAllowingExocyclic type but does not appear at all for cdk and piBonds. The cdk type does not allow atoms with exocyclic double or triple bonds in aromatic systems. Therefore, a carbon atom connected to carbonyl oxygen is not considered aromatic in any case. For piBonds all possibly aromatic atoms must be connected to a cyclic pi-bond which is impossible for the carbon atom in a carbonyl group (and also for oxygen and sulfur atoms, compare FGs “RO*R” and “RS*R” in Fig. 3). Type cdkAllowingExocyclic, on the other hand, allows electron contributions from exocyclic pi-bonds and the daylight type tolerates a carbonyl carbon in an aromatic system but considers its electron contribution to be zero since the oxygen atom is more electronegative. The detected most frequent ChEMBL FGs are in correspondence with the findings in [1] obtained from a specific bioactive subset of ChEMBL. The most frequent “RN*(R)R” aromatic amine FG is not explicitly mentioned in [1] but published in the supplementary file.
Figure 4 depicts the cycle finder algorithm influence on the resulting FG frequencies for the daylight electron donation type with a subset of the available CDK cycle finder algorithms. Compared to the differences between the electron donation types, the cycle finder algorithm influence is of minor importance but nonetheless leads to deviations of about 4% in FG frequencies (e.g. the frequency of ‘RS*R’ varies between 187,503 and 194,136 molecules containing this functional group).
We originally intended this article to purely describe our CDK implementation of the Ertl algorithm. During the publication process, one of the reviewers requested a comparison with the open IFG RDKit implementation, not in terms of the execution time but in terms of providing similar results. We agreed to include the following evaluation of both implementations, but would like to highlight some caveats. A direct one-to-one FG detection comparison of ErtlFunctionalGroupsFinder with the IFG RDKit implementation suffers from the fact that IFG RDKit does not provide generalized FGs according to the Ertl generalization scheme. But the resulting IFG RDKit SMILES string with only marked atoms can be regarded as an approximated generalized FG representation (since it does not contain environmental information) thus it can be mapped to a set of multiple pseudo SMILES FGs generated by ErtlFunctionalGroupsFinderEvaluationTest. As an example the IFG RDKit SMILES string “O=CO” represents a carboxyl group, an ester group and a formic acid ester group which correspond to pseudo SMILES FGs “[H]OC(=O)R”, “O=C(R)OR” and “O=[C]OR” of ErtlFunctionalGroupsFinderEvaluationTest. Figure 5 shows the comparison for the twenty most frequent FGs detected in 1.8 million ChEMBL compounds using IFG RDKit with the AROMATICITY_RDKIT aromaticity model (plus standard RDKit valence model and standard cycle finder algorithm): Each IFG RDKit FG is represented by the corresponding set of ErtlFunctionalGroupsFinderEvaluationTest pseudo SMILES FGs with their individual frequencies summed up.
The FG detection results of both implementations are in good general agreement: The most common FG detected by IFG RDKit is “O” (representing a single oxygen atom with unknown connections). It equals the generalized ErtlFunctionalGroupsFinderEvaluationTest FGs representing an ether group (pseudo SMILES “ROR”), hydroxyl groups connected to aromatic carbon atoms (pseudo SMILES “[H]OC*”) or aliphatic carbon atoms (pseudo SMILES “[H]O[C]”) and a carbonyl group containing an aromatic carbon atom (pseudo SMILES “O = C*”). A striking deviation is the chemically meaningful FG “RN*(R)N*(R)R” which represents two bonded aromatic nitrogen atoms (e.g. found in pyridazine): While this FG is frequently detected with IFG RDKit it is not at all found by ErtlFunctionalGroupsFinder—but this detection failure of the latter is in concordance with the Ertl algorithm which defines that aromatic heteroatoms are to be collected as single atoms if no aliphatic group is connected. Last but not least the total numbers of different identified FGs are smaller for IFG RDKit (11.000 with AROMATICITY_RDKIT up to 43.000 with AROMATICITY_MDL) compared to ErtlFunctionalGroupsFinder (41.000 with daylight up to 134.000 with piBonds) which can be traced to the different FG output representations.
As a résumé, chemical FG detection remains challenging—and different open implementations of the Ertl algorithm are a true virtue, not only for general comparisons but especially for their different oddities and subtleties. For example, the ErtlFunctionalGroupsFinderEvaluationTest pseudo SMILES FG “R[N]R” is not straight forward to comprehend, the same applies to “[S]R”, “O=C(R)[N]R” or “O=[C]R” (the latter does not represent an aldehyde group since this FG is represented by “[C]=O” thus it may be an artefact due to a SD file error or originate from an amide group containing an aromatic nitrogen atom). Further investigations of these rare problems may lead to an improved structural pre-processing as well as possible useful extensions of the Ertl FG detection rules.
The numerous possible applications of FG detection in molecular research are widespread: FGs may be regarded as “intrinsic seeds” for proper fragmentation of molecules. The FG sets of single molecules up to large molecule collections can be regarded as chemically meaningful “feature vectors” or “fingerprints”: This qualifies their use for molecular comparison, clustering/classification or ranking purposes (e.g. based on “overlap in FG space”) and may substantially improve QSAR/QSPR research, especially the growing and increasingly complex machine learning approaches.
The ErtlFunctionalGroupsFinder LGPL Java class code is openly available from its project page. It is recommended to place this java class in the tools package of the cdk-extra module. ErtlFunctionalGroupsFinder depends only on CDK base classes and interfaces as well as GraphUtil for quick connectivity queries. Unit tests for 20 compounds given in [1] (with the daylight electron donation model and cycle finder Cylces.all for aromaticity detection) are implemented: They demonstrate adequate FG extraction of the new implementation by comparison of the expected FGs according to the Ertl rules with the actual results. The open tools ErtlFunctionalGroupsFinderPerformanceTest and ErtlFunctionalGroupsFinderEvaluationTest provide detailed sample code for using the new functionality. An integration into future CDK releases is requested and will hopefully be approved by the CDK community.