“Molecular Anatomy”: a new multi-dimensional hierarchical scaffold analysis tool
Journal of Cheminformatics volume 13, Article number: 54 (2021)
The scaffold representation is widely employed to classify bioactive compounds on the basis of common core structures or correlate compound classes with specific biological activities. In this paper, we present a novel approach called “Molecular Anatomy” as a flexible and unbiased molecular scaffold-based metrics to cluster large set of compounds. We introduce a set of nine molecular representations at different abstraction levels, combined with fragmentation rules, to define a multi-dimensional network of hierarchically interconnected molecular frameworks. We demonstrate that the introduction of a flexible scaffold definition and multiple pruning rules is an effective method to identify relevant chemical moieties. This approach allows to cluster together active molecules belonging to different molecular classes, capturing most of the structure activity information, in particular when libraries containing a huge number of singletons are analyzed. We also propose a procedure to derive a network visualization that allows a full graphical representation of compounds dataset, permitting an efficient navigation in the scaffold’s space and significantly contributing to perform high quality SAR analysis. The protocol is freely available as a web interface at https://ma.exscalate.eu.
High-throughput screening (HTS) of small-molecule libraries is routinely used in drug discovery process to identify novel leads against clinically relevant targets. Successful HTS require high quality, validated screening assays, but also an effective strategy for chemical structures selection is fundamental for the following hit-to-lead phase. HTS libraries, indeed, comprise some hits of interest, but also many compounds resulting in false positives or promiscuous hits, as well as number of compounds with no relevant biological activities at micromolar concentrations in biochemical assays . The first fundamental step, affecting the probability of success of the entire lead discovery process, is represented by an incisive preliminary structure activity relationships (SAR) analysis. One of the crucial tasks in the design of large diverse libraries is the chemical space mapping. Selection of a representative subset of the desired chemical space is generally addressed by the identification of three elements: a set of meaningful descriptors , a similarity metric allowing to compare molecular structures pairwise , and a clustering algorithm for grouping structures according to the calculated pairwise similarity values [4, 5]. Many clustering algorithms exist , and many clustering techniques are able to address this task for groups of 105 to 107 compounds; however, the identification of relevant chemical series within the generated clusters is much more difficult. The generation of clusters organized as “series” in medicinal chemistry is an important asset of the scaffold-based techniques.
A chemical scaffold, also referred to as ‘chemotype’ or ‘Markush structure’, can be defined as the common structure characterizing a group of molecules. Compounds sharing the same scaffold are likely to have a similar synthetic pathway as resulting from the concept of molecular template in combinatorial chemistry . Once a scaffold is defined, SAR can be developed analyzing the effects of the substitution patterns . The scaffold approach shows several advantages, in particular its outcomes are both simple to interpret and medicinal chemistry-oriented; additionally, some of the most significant features of the graph-based approaches  are combined with molecular fingerprint characteristics and maximum common substructure methods. Bemis et al.  introduced a systematic analysis of drugs according to their scaffold/framework representation which it is now a well-established method alongside molecular descriptors, molecular fingerprints and graphs. In the last twenty years, different scaffold definitions have been introduced and numerous scaffold-based computational approaches have been developed for structural classification and biological activities prediction . The introduction in 2005, by Wilkens et al. , of hierarchies based on several kinds of scaffold deconstructions, represented a milestone in the development of scaffold-based approaches. In 2007, Schuffenhauer et al.  demonstrated the potentiality of combining the scaffold-based approach with ad hoc graphical representations through the “Scaffold Tree” algorithm and visualization tool; Schuffenhauer also introduced a rule-based ring disassembly. Since then, other decomposition and visualization tools have been developed, such as Scaffold Hunter in 2009 , recently revised and extended, and Scaffold Explorer in 2010 . In 2008 Gianti and Sartori  proposed an alternative procedure to address scaffold decoration, pruning and fragmentation as a workflow for the identification of “privileged fragments”. Agrafiotis et al. , in 2010, demonstrated that the inclusion of relevant side chains and functional groups in the scaffold representation could greatly enhance the derivation of robust SAR, thus indicating that the explicit consideration of the most significant molecular features overcomes the limits associated with “a priori” definition of specific pruning rules. In 2011, Varin et al.  proposed an extended version of the previously developed Scaffold Tree and demonstrated that rule-based approaches in fragmentation are less useful and flexible than the unbiased ones. Lipkus  introduced hierarchies between different abstraction levels and, finally, different hierarchical scaffold decomposition and abstraction approaches were proposed by many authors [19, 20].
All the above described methods share two major limitations; first a single ring system is represented, decorated with chains of various length, therefore, when pruned, all the molecules collapse into a degenerated cluster. Additionally, there is no relationship between scaffolds when the difference is represented by one or more ring systems. These issues are particularly limiting for the analysis and selection of vendor libraries to build diverse compound collections and, afterward, for HTS campaigns analysis in order to obtain preliminary SAR.
Very recently, Bandyopadhyay et al. , in order to overcome the limits related to hard clustering methods, which assign each molecule to a single cluster and so tend to place structurally analogous molecules into different and not related clusters, described a method based on fuzzy clustering that may assign a molecule to different clusters. In this method, for each molecule an exhaustive enumeration of Bemis-Murcko scaffolds, corresponding to all possible combinations of ring systems, was applied and data were annotated and aggregate at scaffold level, allowing to relate molecules on the basis of shared scaffolds. Another recent approach to perform scaffold analysis is based on retrosynthesis rules, which allow to easily identify analog series [22, 23]. Such analog series-based scaffolds can also be associated with activity information to develop possible target hypotheses for other compounds containing the same scaffold . The organization of compounds in analog series leads to the formation of “constellations” of molecules, in chemical space, which can be visualized as a network of all possible molecule–core relationship .
However, the main limitation of the network connecting molecules and scaffolds generated with these implementations is that they are based on a unique scaffold representation, not sufficient to effectively map the chemical space of a heterogeneous ensemble of molecules, for example multi-scaffold libraries, and to capture relationships with the relative biological activity. The critical point is that it is not possible to define a priori the best representation of a molecule, because it mainly depends on the biological context and on the nearest-neighbors of the screened library.
Here, we present a novel approach, called “Molecular Anatomy”, for the generation and analysis of correlated molecular frameworks aimed at overcoming the limitations of scaffold analysis based on a single predefined set of rules. In our experience, the here identified molecular frameworks and related fragments are able to capture most of the structure activity information from HTS campaigns, and are also useful for other applications, such as library design and analysis. In particular, the combination of fragments, correlated in frameworks and wireframes, identifies relevant chemical moieties in an efficient manner, clustering together many scaffolds with similar shapes despite, for example, different dispositions of heteroatoms or small differences in bond order. To the best of our knowledge, this is a distinctive feature of our approach, compared to other known methodologies, such as the widely accepted Maximum Common Substructure (MCS) .
In the “Methods” section, the molecular scaffold representations and the fragmentation rules used to generate the related fragments are defined. A COX-2 inhibitors dataset has been chosen to illustrate our approach. Then, we introduce an innovative network representation as a more convenient tool for SAR evaluation and visualization. We first apply this visualization to the molecular frameworks proposed, and then we extend the network visualization also to the underlying fragments, to show the full graphical representation. We also summarize the main advantages of our method compared to the other approaches proposed so far. Finally, we show the general applicability of our approach by performing the SAR analysis of 26,092 commercial compounds tested in an HTS campaign aimed at identifying potential inhibitors of the enzyme Histone deacetylase 7 (HDAC7).
COX-2 A dataset containing COX-2 inhibitors was prepared and used to illustrate the “Molecular Anatomy” approach. To this aim, the Integrity™ database was interrogated to search for COX-2 inhibitors, providing 2599 molecules in total. Of these, 816 were in preclinical phase or in a higher phase of clinical development. This subset was used in the following analysis to compare different scaffold representations. A Pipeline Pilot protocol  was used to standardize the molecular structures, to classify them according to the molecular mechanism (e.g. inhibitors) and highest phase of development, to perform substructure searches, to generate molecular frameworks according to our definition rules and, finally, to analyze the results in order to compare the different scaffold definitions.
HDAC7 dataset of 26,092 commercial compounds, tested as potential HDAC7 inhibitors during an HTS campaign performed internally within Dompé, was used as a more complex case study. The compounds were stratified in different activity classes according to the value of percent inhibition of the HDAC7 activity obtained at 10 μM concentration (Table 1).
Additional compound datasets were retrieved from release 28 of ChEMBL [28, 29] for further proof-of-concept studies. Two sets were selected, with at least 1000 active compounds, identified with ChEMBL Target IDs 202 (Dihydrofolate reductase) and 2000 (Plasma kallikrein). Only data measuring binding of compounds (i.e., assay type “B”) were collected. To ensure a high level of data integrity, only compounds with explicitly defined IC50 values were selected, using a cut-off of 5 μM as minimum potency to define compounds as “actives”. A third dataset consisted of a recently added repository generated within the “EXaSCale smArt pLatform Against paThogEns for Corona-Virus, Exscalate4CoV or E4C” project (CHEMBL4495564), containing activity data for ~ 8000 compounds from the primary screening against SARS-CoV-2 Main protease (Mpro) . In this case, compounds were considered as actives if enzyme inhibition was at least 40%.
Identification of common scaffold representations and evaluation of their performance
In theory, it is possible to define an arbitrary number of scaffold's representations based on different levels of abstraction and pruning rules. Figure 1 shows an example applied to the COX-2 inhibitor Polmacoxib (whose full structure is depicted in panel e) .
Panel 1a shows the most abstracted representation (hereafter 1a), obtained removing both bond and atom type. This representation is also known as “cyclic skeleton”. Representations 1b and 1c retain only bond and atom type, respectively, whereas the 1d representation corresponds to the Bemis-Murcko scaffold, containing all the rings and chains connecting them of the original molecule.
By using the most abstracted representation 1a of the Polmacoxib scaffold, corresponding to the most common moiety of COX-2 inhibitors, a subset of 224 COX-2 inhibitors was identified. Figure 2 reports the MDL substructure query and the corresponding SMARTS string used to retrieve the molecules containing this substructure.
These 224 molecules correspond to 84 different scaffolds if the less abstracted 1d representation is used (Table 2), thus resulting impossible to associate them each other as belonging to the same substructure. Furthermore, we found that 53 out of 84 Bemis-Murko scaffolds (63.1%), have one or more additional rings, corresponding to 82 of the 224 molecules (36.6%) and clustered in 34 groups according to the 1a representation, whereas the remaining 142 molecules, with exactly 3 rings, corresponding to 31 scaffolds based on the 1d representation, collapse to only one cluster if the 1a representation is used.
Identification of nine correlated scaffold representations
In “Molecular Anatomy” we used, as starting point, the widely accepted scaffold abstraction representation (here called Basic Scaffold), which is generated by removing all side chains and terminal atoms. Then, we defined a set of nine molecular frameworks (MF) at different abstraction levels to match different side chain definitions, as showed in Fig. 3 for the COX-2 inhibitor Polmacoxib. We used two sets of pruning rules able to determine a multidimensional hierarchy.
The first set of rules is based on an increased level of structural information with respect to the basic scaffold. In a first step, terminal atoms with bond order greater than one are maintained (Decorated Scaffold); in a second step, the longest atom chain, considering also substitutions, is retained but all terminal non-carbon atoms, belonging to side chains, are iteratively pruned (Augmented Scaffold). In case that no terminal atoms remain removing all terminal non-carbon atom with a bond order equal to 1, decorated and augmented scaffold coincide. Some examples reported in Fig. 4 explain these rules, applied to different possible cases.
The second set of rules, conversely, increases chemical abstraction by removing the atom type label and then the bond order, generating, respectively, a Framework and a Wireframe for each level of the scaffold (basic, decorated and augmented), thus finally producing nine molecular representations with a hierarchical correlation.
Fragmentation rules definition
To further overcome one general limitation of the scaffold based techniques [12, 14, 32] that, by definition, molecules sharing the same scaffold only partially belong to distinct clusters, in “Molecular Anatomy” approach we have implemented an unbiased fragmentation scheme that can be applied in parallel to all nine scaffold representations described above. These rules are explained in Fig. 5, applied to specific molecules chosen on representative purpose. The first rule (Fig. 5a) depicts an example of fragmentation based on an exhaustive and progressive elimination of all the internal chains from the scaffold. As second rule, unbiased ring disassembly was also implemented; the methodology for ring decomposition involves the removal of all fused rings, allowing their opening into fragments (Fig. 5b). For sake of consistency, we also introduced a third rule to remove internal rings (Fig. 5c).
The here reported fragmentation and deconstruction introduce other hierarchies, meaning that each fragment of the original scaffold is related with all the other representations in a multi-dimensional space. As a result of this multi-dimensional hierarchical scaffold analysis, the entire set of generated molecular frameworks are highly interconnected, and it is possible to move from one to another following SAR. To clarify this concept, Fig. 6 reports an example showing how the combination of fragments and molecular frameworks at different abstraction levels allows to cluster molecules with different scaffolds.
Network representation of “Molecular Anatomy”’s frameworks
The software Cytoscape was used for creating and visualizing an MF-based network, which was also integrated with activity data for the SAR analysis. This network provides a full graphical representation of the dataset composition, allowing to easily navigate through the molecular frameworks and their hierarchical correlation. A Pipeline Pilot protocol was implemented to prepare the data matrix needed for the visualization, in the format required for the import process.
Each molecule from the dataset of 816 COX-2 inhibitors was described according to the nine molecular representations implemented in the “Molecular Anatomy” approach (Additional file 3: Table S2); then, a unique list of frameworks was obtained (Additional file 4: Table S3), keeping the less abstracted one in case of duplicate structures (when a same scaffold structure was obtained with different representations), corresponding to the nodes of the network. All possible parent–child relationships between the nine molecular frameworks of each molecule were generated, as reported in Additional file 5: Table S4, according to the hierarchical relationships between the representation types shown in Fig. 3, corresponding to the edges of the network.
To fully exploit this graphical representation, the network data matrix can be integrated with the enrichment factors (EF) calculated, according to Eq. (1), for each molecular framework (MF) according to the activity data of the corresponding molecules, keeping the highest EF value in case of duplicate structures when the unique list of frameworks is generated.
In order to focus the network visualization on the most relevant dataset information, the nodes associated with EF = 0 can be filtered out, thus stepping through the nodes with increasingly higher EF values, as described in the HDAC7 case study (see “Results and discussion”).
A fully connected network representation by means of “Molecular Anatomy”’s frameworks and fragments
“Molecular Anatomy” allows, as already described, to derive trees in multiple dimensions such as wireframe > framework > scaffold, or augmented > decorated > basic or wireframe > ring disassembly > fragments and in all the other possible directions maximizing the SAR information of the dataset. The network visualization can be extended also to the fragments to obtain a fully connected network, considering that the smallest fragments (e.g. benzene ring) are shared by a huge number of the original molecules. In this implementation, the network nodes can be molecular frameworks, fragments or entire small molecules, and the direction of the edges, defined by the fragmentation rule, starts from the originator fragment and end up into the corresponding fragments.
Molecular Anatomy Web interface implementation
The above-described protocol is freely accessible at https://ma.exscalate.eu. The web interface was implemented using LAMP (Linux Apache MariaDB PHP), an open source Web development platform that allows a fluent and responsive user experience in displaying and handling the output data, which in this case are calculated on the fly in a completely automated Pipeline Pilot workflow.
Implementation of Molecular Anatomy approach in Knime
The Pipeline Pilot protocol for the preparation of the data matrix was also re-implemented in Knime 4.3.2 by means of in-house Python scripts and taking advantage of the available RDKit  and Indigo nodes .
Results and discussion
Comparison between common scaffold representations and “Molecular Anatomy” to perform SAR analysis
As shown in “Methods” section for to the COX-2 inhibitors dataset, scaffold representations with high level of abstraction, showed in Fig. 1a–b for Polmacoxib, perform generally better than the others in the identification of relevant chemotypes. Table 2 summarizes the results obtained for each representation in terms of number of clusters generated, starting, on one hand, from all the 816 COX-2 inhibitors in preclinical development or in a higher phase, and, on the other hand, from the subset of the COX-2 inhibitors matching the MDL substructure reported in Fig. 2, the most common COX-2 inhibitor moiety. In particular, the number of clusters containing the molecules matching the common substructure with exactly or more than 3 rings was specified.
Representation 1a clusters together most of the well-known marketed drugs, such as valdecoxib and celecoxib, as well as many others leads and experimental drugs, and collapses all the 142 active molecules with exactly 3 rings to a single cluster. This cluster likely includes also several inactive molecules. Interestingly, we can note that, even though this representation is used, still almost the 40% of the scaffold information, corresponding to the molecules with additional rings, would be lost in unrelated clusters, impairing the identification of the most relevant additional structural information.
Using the less abstracted representation 1d, we can retrieve and distinguish the most diverse COX-2 inhibitor scaffolds, even if this information is distributed in 84 clusters considering both those with 3 or more rings. Furthermore, an intermediate representation as 1b, where only the atom type information is removed, could allow a more effective clustering of the relevant structural information, identifying only 11 different frameworks containing molecules with exactly 3 rings, instead of 31; but, almost the same number of clusters containing molecules with more than 3 rings is generated with the two representations (48 instead of 53).
This example on COX-2 inhibitors clearly shows how this kind of analysis strongly depends on the nature of the dataset; each scaffold abstraction of Fig. 1 provides some important structural information but none of them is sufficient, alone, to capture the complexity of the heterogeneous ensemble of molecules. Only the integration of the information captured from the different scaffold abstractions, in a Multi-Dimensional Hierarchical Scaffold Analysis, allows to effectively map the entire chemical space of multi scaffold libraries. Furthermore, the combination of the “Molecular Anatomy” approach, the fragmentation rules and the network representation allows to immediately focus the attention on the most interesting and useful structural information, easily navigating among several structural clusters, moving from a molecular framework to another on the basis of their hierarchy and according to the SAR.
Attempts to identify more relevant chemical moieties have been presented in the past, for example the rule-based decompositions proposed by Schuffenhauer et al. , schematized in Fig. 7 for three COX-2 inhibitor scaffolds. However, a clear limitation resides in the difficulty to define a priori a set of rules able to maintain a general consistency with SAR information.
The method that we propose, involving the combination of correlated molecular frameworks and fragments, is able to efficiently identify relevant chemical moieties, and to cluster together active molecules (also in the nanomolar range) belonging to different molecular classes within HTS campaigns, capturing most of the SAR information.
To fully exploit the hierarchical correlation among the molecular frameworks and to generate a full graphical representation of the analyzed dataset, we also propose a network visualization. Actually, the combination of the MF approach with a network representation provides a more convenient tool for SAR evaluation and visualization [35,36,37,38], usefully guiding the user from a molecular framework to another, on the basis of their hierarchy in the direction of increasing or decreasing level of abstraction and according to the SAR.
Figure 8 shows the complete network obtained for the dataset of 816 COX-2 inhibitors. As reported in the list of statistical parameters (Fig. 8b), 277 connected components were generated, corresponding to the clusters obtained using the most abstracted (basic wireframe) representation. It is possible to clearly note the biggest cluster at the top of Fig. 8a corresponding to the 142 molecules with exactly 3 rings (Table 2), all sharing the basic wireframe 1a. Figure 8c reports the hierarchical visualization of a smaller cluster, to further show how this graphical representation of the data matrix consists in an oriented network, where nodes are in general molecular frameworks, and the direction of the edges is defined by the direction of increasing abstraction level of the molecular representations.
Furthermore, it is possible to retrieve the relationships among the diverse representations within this cluster and, focusing on the most interconnected frameworks, to identify the structural characteristic representative of the active molecules, as shown in Fig. 9. On the other hand, the network visualization clearly shows the high number of singletons that would be dispersed considering only the representation 1a. Here, thanks to the use of the fragmentation, these singletons can be related each other if containing the same fragments, allowing to easily verify if they contain characteristics in common with relevant clusters of actives.
Focusing on the fragments related to the basic wireframe representation, all the clusters identified in Fig. 8a can be connected each other in a unique network, as can be visualized in Additional file 1: Figure S1.
Furthermore, Additional file 1: Figure S2 shows the fragments with the highest indegree value, in particular cyclohexane and cyclopentane, which means the highest number of fragments connected within the network in Additional file 1: Figure S1.
Some qualitative considerations about the obtained networks can be done. As a first point, it is reasonable that highly connected singletons tend to be small fragments shared by a large number of molecules included in the library (as shown in Additional file 1: Figure S2). On the contrary, low molecular weight singletons involved in a small number of connections represent potential interesting decorations of a specific group of the original molecules. If this group is enriched in a specific activity of interest, the corresponding singleton fragments connecting all the molecules included in the group, could represent a pharmacophore. As a second point, high molecular weight singleton fragments, connecting cluster of molecules with enriched activity, could represent chemical scaffolds or the “minimal chemical entity” that confers the selected activity to the cluster. As a third point, it is comprehensible that the meaning of the singleton constituting the networks may change according to the fragmentation rules used. While the approach suggested herein consists in a purely informatics fragmentation procedure, an alternative method is possible, where singletons consist in reaction intermediates derived applying retrosynthetic rules to the original molecules. In other words, in this case the network would contains, as “fragments” the precursors used to synthesize larger molecules, and as pathways connecting couple of singletons, possible synthetic strategies to attach a specific interesting low molecular weight singleton to another one representing, for example, a scaffold.
In our experience, the “Molecular Anatomy”’ approach allows deciphering more easily the connections between chemotypes. In particular, filtering by EF and ranking by number of connections for each cluster allow to focus the analysis on the highly connected singletons. These frameworks have high relevance, considering that they connect different chemotypes without overlapping fragments and, then, could, suggest the most significant parts of active molecules, the fragments that could be exchanged, and the bond order and the atom type relevant for SAR derivation. This approach allows to include in SAR analysis also molecules usually underestimated because singletons, or compounds with small ligand efficacy, but here connected to relevant clusters corresponding to specific series of compounds. In this way, a valuable information could be added in the SAR of this major hit series, connecting them to additional latent ones . This method could be considered an extension of the already proposed compound set enrichment [17, 40, 41], based on an implementation of an higher level of abstraction, potentially able to identify new hit series connected with the conventional one.
Case study I: SAR analysis of an HTS campaign on HDAC7
In order to better illustrate the molecular scaffold representations and the fragmentations rules that we introduced and with the intent to clarify the advantages to use the network visualization proposed for SAR evaluation, we present, as case study, the SAR analysis of the HTS campaign on HDAC7 performed for 26,092 compounds.
First, the set of nine molecular frameworks at different abstraction levels were generated for the entire dataset. For each of the nine frameworks, the EF was calculated, based on the inhibition data of the corresponding molecules; molecules were considered as active if belonging to the activity classes moderate, strong and very strong (Table 1).
Figure 10 shows the complete network obtained with Cytoscape, as described in “Methods” section, for this dataset, that clearly appears a more complex case study compared to the previous one, thus chosen to show the potentiality of our approach. 3061 connected components were generated, corresponding to the clusters obtained using the most abstracted (basic wireframe) representation.
The most interesting basic wireframe in terms of SAR evaluations are selected (Additional file 1: Figure S3), filtered by the highest values of EF and number of connected active molecules, to focus the analysis on the abstracted scaffolds accounting for more actives.
Figure 11a reports the network corresponding to one of these selected clusters, using a hierarchical layout for a better visualization. The complexity of this specific network is due to the high number of nodes corresponding to all the molecules (on top, in light blue) and relative molecular frameworks (all other nodes) matching the basic wireframe reported in Fig. 11c. This complex network may however be considerably simplified removing nodes with EF value equal to 0, that is, removing all the nodes connected to inactive molecules. Applying this filter, a more clear and useful network can be obtained (Fig. 11b), with the most relevant dataset information. In this way, it is possible to easily extract only the interesting pathways in terms of SAR analysis, starting from a huge number of connections that ensure a complete evaluation of the structural information.
In more detail, starting from the basic wireframe selected (Fig. 11c), thanks to the network visualization, more interesting sub-clusters can be identified corresponding to the decorated wireframes reported in Fig. 12. The EF values of these decorated wireframes are higher than that of the basic wireframe in common, meaning that such approach allows focusing on specific characteristics of the active compounds.
Focusing on the increase of EF, for example moving from the basic wireframe (EF = 2.1) to the first decorated wireframe (EF = 7.8), allows to highlight interesting “hotspots”, identifying a specific feature of the active molecules structure (e.g. a protruding bond in position 1 of the spacer between the two rings).
Furthermore, it is also possible to move to the less abstracted representation within the network, the decorated frameworks also reported in Fig. 12, that provide information about the bond order characteristics common to the active compounds. And so on, moving back through the network toward the lowest abstraction level is it possible to visualize the original molecules.
A first interesting consideration about these results concerns the introduction of decorations in our scaffold representations: defining a description level in which protruding bonds are added to the basic scaffold allows to better identify and distinguish the requirement essential for the activity. This point is clearly showed in Figs. 11 and 12, where moving from the basic to the decorated wireframes with higher values of EF and number of connections, it is possible to retrieve all the clusters containing the active molecules. On the other hand, 12 decorated wireframes and 37 decorated frameworks are identified in common with inactive molecules, another useful information to rationalize which scaffold decorations are responsible of decrease or even loss of activity.
Finally, we want to show how the most useful SAR information can be obtained extending the analysis and the network to the fragments. When the fragmentation rules are applied to the dataset, the network visualization of the fragmented library allows to interconnect all the molecular frameworks containing the same fragment and the EF can be recalculated for each fragment according to the activity data of all the molecules connected via the corresponding molecular frameworks.
In particular, focusing the attention on the interesting structures above identified, Fig. 13 reports the same scheme of Fig. 12, with the EF values recalculated considering all the clusters identified by molecular frameworks corresponding to superscaffolds of the scaffolds visualized (superframeworks).
Comparing Fig. 12 and 13, it is possible to identify the molecular frameworks, the EF of which increases when they are considered as fragments, thus containing relevant structural characteristic of active molecules.
To better explain the contribution of the integration of fragments and molecular frameworks in the SAR analysis, we report on top of Fig. 14, as example, the decorated wireframe of Fig. 13 showing the higher increase of EF value respect to Fig. 12 and the corresponding five decorated wireframes retrieved in the fragmented library containing it as a fragment. For each of these decorated wireframes, the EF value is reported and that of the common wireframe on top, here treated as a fragment, is recalculated, adding the contribution of the other five ones. For each decorated wireframe the corresponding decorated framework and scaffold of the active molecules are reported in the lower panel of Fig. 14.
This example shows how this approach allows to extract structural information from the different levels of scaffold's representations. The decorated wireframe on top of Fig. 14, identified on the basis of the higher EF value, represents the common pharmacophore of nine active compounds, corresponding to two rings connected by a five heavy atoms linker with an H-bond acceptor in position 2. The five decorated wireframes matching this pharmacophore enrich the information, showing that the structural flexibility can be reduced with a cyclization, involving different positions of the linker, and this information can clearly orient the design of new compounds. Moving to the level of framework, the information related to the double bonds can be added, showing that in most of the active compounds both the rings are aromatic. This information might suggest that aliphatic rings could also be included in active compounds. A simple search among the library frameworks allows to easily verify if this feature is already present in the library, in inactive compounds, otherwise it may represent a possible modification to be explored. Finally, moving to the scaffold level, the information related to the atom type can be added, showing if ketone, amide, urea or thiourea are preferred moieties in the active compounds, and thus providing useful insights for the design of new compounds.
We can conclude that, even if in this particular case study the decorated wireframe seems the most informative representation, in general the integration of all molecular frameworks and fragments in the network visualization is crucial for capturing the most relevant information in compound libraries SAR analysis.
Performances of different molecular frameworks in terms of EF
In order to investigate which level of scaffold abstraction leads to the highest enrichment factors, we employed three further datasets from ChEMBL (see “Methods” section) together with the above-described HDAC7 dataset. The nine molecular frameworks were generated for all these datasets and the corresponding EF were calculated. The top 50 frameworks with the highest EF values and number of connected molecules were selected; the plot in Fig. 15 shows the number of the selected frameworks for each of the nine representations. As pointed out above, the basic wireframe often represents the molecular framework that performs better in the identification of relevant chemotypes of active molecules, in particular for larger and diverse datasets, such as Mpro and HDAC7. However, also the other representations, in some cases, are much informative, in particular for capturing the most significant features of molecules among active compound series. Therefore, the integration of the information retrieved from the different scaffold abstractions allows to more effectively map the chemical space of different types of compound datasets.
The “Molecular Anatomy” approach is available as a web application, freely accessible at https://ma.exscalate.eu. The user can upload a text file containing one or more compounds, encoded as canonical SMILES, to generate the molecular frameworks related to the nine molecular representations at different abstraction levels.
The output consists of four tables, resembling Additional files 2, 3, 4, 5: Table S1–S4 of this paper, named Molecular Frameworks for each SMILES, Molecular Frameworks list, Attribute file for network, Network file. Each output table can be downloaded as .csv file. Specifically, the first table contains the uploaded compounds, and the related nine molecular representations, encoded as canonical SMILES and InChIKey, listed in the same row. In the second output table, each compound and the corresponding nine molecular representations are listed as single rows. The third output table consists of the list of unique frameworks (attribute file) whereas the fourth output table (network file) lists all the parent–child relationships which can be used for the network visualization in Cytoscape, permitting an efficient navigation in the scaffold’s space that can be readily used for SAR analysis.
The dataset of 816 COX-2 inhibitors used for this study has been provided within the web interface as template file to test the application (by clicking on the “Submit Example” button).
We propose “Molecular Anatomy” as a fast and flexible method for the analysis of the chemical space, library design and SAR studies.
This set of tools could be useful in the management of large compounds collections, for example in the analysis of HTS campaign results, as well as in focused libraries design. On the other side, this kind of data organization allows to efficiently analyze scaffold-activity relationship, identify relevant clusters and easily connect different chemotypes with biological activity. The limitation in the underestimation of the side chain effect can be easily circumvented combining the “Molecular Anatomy” approach with other techniques, such as matched molecular pairs (MMP) [42,43,44]. In this case, the identified molecular frameworks can make MMP equally or even more efficient and consistent than other methods [45, 46]. Furthermore, using MA with a higher level of abstraction, it is possible to compare effectively SAR behaviors on multiple scaffolds, or support scaffold hopping strategies.
Another interesting application, still in an early phase of evaluation, is the possibility to annotate a library according to therapeutic areas information of classified drugs, in order to accelerate the identification of target-based or disease-based libraries, using for example annotated database such as MDDR, WOMBAT , or public databases .
Furthermore, molecular frameworks can be profitably used for compounds clustering and database indexing. One of the most critical tasks in the design of large diverse libraries is the comprehensive mapping of chemical space. The generation of groups that can be considered as “series” in a medicinal chemist’s perception represents an important asset of the scaffold-based techniques. However, the clusters generated by currently available approaches generally tend to contain an elevated number of scaffolds, hampering the selection of chemical series for follow-up activities. Some improvements have been introduced to overcome this problem, for example by means of Maximum Overlapping Set (MOS) . The main advantage of scaffold-based clustering techniques is that they do not require the calculation of similarity indices, nor pairwise similarities: indeed, the scaffold structure itself represents the aggregation rule, so that each molecule is assigned to a cluster regardless of the nature of the neighbors. In this sense, the approach can be defined as a “Natural Clustering” (NC) method. Another important feature is that no bias, like the average number of molecules per group or the expected number of scaffolds, has to be introduced. This is very useful when some over-represented scaffolds may drive the analysis. Therefore, MA enables the hierarchical clustering, considerably extending the potentialities of scaffold-based clustering.
Finally, NC makes relational databases ideal for chemical graph-based compound clustering applications. The composition of a specific cluster is independent from the chemical structure of the other clusters, and, once the scaffold abstraction is defined, isn’t needed to re-cluster the whole library.
Organizing molecules within a database by means of clustered SQL indices (based on the MA), can dramatically reduce the time required for substructure searches, as reported by Wilkens et al.  and Masciocchi et al. . In our implementation, due to the higher abstraction of the frameworks and wireframes, it is also possible to further speed up substructure searches using these representations as a wild character-like query, such as “any atom” or “any bond”. Interestingly, MA are, per se, searchable molecular representations, and this allows to define local similarities in substructure searches space. For example, the scaffold of a target molecule could be searched with either a lower or higher similarity to the reference template. Besides that, it is also possible to constrain the local diversity of the scaffold by requiring, for example, the presence of a specific hydrogen bond acceptor at a given position on the scaffold, or even specifying a LogP range.
All these prospective applications make the MA approach a valuable cheminformatics tool that can considerably improve structural data analysis.
Availability of data and materials
Datasets generated and analyzed during this study are included in this published article and its Additional files. The Pipeline Pilot protocol employed for the current study can be downloaded at: https://github.com/andreabeccari/Molecular_Anatomy. A reimplementation of this protocol in Knime is available at the same address. The protocol is also available as a web application, freely accessible at https://ma.exscalate.eu.
Structure activity relationships
Maximum Common Substructure
Histone deacetylase 7
Maximum Overlapping Set
Matched molecular pairs
Maximum Overlapping Set
Macarron R (2015) Chemical libraries: how dark is HTS dark matter? Nat Chem Biol 11:904–905. https://doi.org/10.1038/nchembio.1937
Bender A, Jenkins JL, Scheiber J et al (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49:108–119. https://doi.org/10.1021/ci800249s
Todeschini R, Consonni V, Xiang H et al (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model. https://doi.org/10.1021/ci300261r
Brown RD, Martin YC (1996) Use of structure−activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comput Sci 36:572–584. https://doi.org/10.1021/ci9501047
McGregor MJ, Pallai PV (1997) Clustering of large databases of compounds: using the MDL “Keys” as structural descriptors. J Chem Inf Comput Sci 37:443–448. https://doi.org/10.1021/ci960151e
Raymond JW, Blankley CJ, Willett P (2003) Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. J Mol Graph Model 21:421–433
Katritzky AR, Kiely JS, Hebert N, Chassaing C (2000) Definition of templates within combinatorial libraries. J Comb Chem 2:2–5
Hu Y, Bajorath J (2011) Target family-directed exploration of scaffolds with different SAR profiles. J Chem Inf Model 51:3138–3148. https://doi.org/10.1021/ci200461w
Bonchev D, Rouvray DH (1991) Chemical graph theory : introduction and fundamentals. Abacus, New York, London
Bemis GW, Murcko MA (1996) The properties of known drugs. 1 Molecular frameworks. J Med Chem 39:2887–2893. https://doi.org/10.1021/jm9602928
Hu Y, Stumpfe D, Bajorath J (2016) Computational exploration of molecular scaffolds in medicinal chemistry. J Med Chem 59:4062–4076. https://doi.org/10.1021/acs.jmedchem.5b01746
Wilkens SJ, Janes J, Su AI (2005) HierS: hierarchical scaffold clustering using topological chemical graphs. J Med Chem 48:3182–3193. https://doi.org/10.1021/jm049032d
Schuffenhauer A, Ertl P, Roggo S et al (2007) The scaffold tree–visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Model 47:47–58. https://doi.org/10.1021/ci600338x
Wetzel S, Klein K, Renner S et al (2009) Interactive exploration of chemical space with Scaffold Hunter. Nat Chem Biol 5:581–583. https://doi.org/10.1038/nchembio.187
Agrafiotis DK, Wiener JJ (2010) Scaffold explorer: an interactive tool for organizing and mining structure-activity data spanning multiple chemotypes. J Med Chem 53:5002–5011. https://doi.org/10.1021/jm1004495
Gianti E, Sartori L (2008) Identification and selection of “privileged fragments” suitable for primary screening. J Chem Inf Model 48:2129–2139. https://doi.org/10.1021/ci800219h
Varin T, Schuffenhauer A, Ertl P, Renner S (2011) Mining for bioactive scaffolds with scaffold networks: improved compound set enrichment from primary screening data. J Chem Inf Model 51:1528–1538. https://doi.org/10.1021/ci2000924
Lipkus AH, Yuan Q, Lucas KA et al (2008) Structural diversity of organic chemistry. A scaffold analysis of the CAS Registry. J Org Chem 73:4443–4451. https://doi.org/10.1021/jo8001276
Vogt M, Huang Y, Bajorath J (2011) From activity cliffs to activity ridges: informative data structures for SAR analysis. J Chem Inf Model 51:1848–1856. https://doi.org/10.1021/ci2002473
Hu Y, Stumpfe D, Bajorath J (2011) Lessons learned from molecular scaffold analysis. J Chem Inf Model 51:1742–1753. https://doi.org/10.1021/ci200179y
Bandyopadhyay D, Kreatsoulas C, Brady PG et al (2019) Scaffold-based analytics: enabling hit-to-lead decisions by visualizing chemical series linked across large datasets. J Chem Inf Model 59:4880–4892. https://doi.org/10.1021/acs.jcim.9b00243
Stumpfe D, Dimova D, Bajorath J (2016) Computational method for the systematic identification of analog series and key compounds representing series and their biological activity profiles. J Med Chem 59:7667–7676. https://doi.org/10.1021/acs.jmedchem.6b00906
Dimova D, Stumpfe D, Hu Y, Bajorath J (2016) Analog series-based scaffolds: computational design and exploration of a new type of molecular scaffolds for medicinal chemistry. Futur Sci OA 2:FSO149. https://doi.org/10.4155/fsoa-2016-0058
Cerchia C, Dimova D, Lavecchia A, Bajorath J (2017) Exploring structural relationships between bioactive and commercial chemical space and developing target hypotheses for compound acquisition. ACS Omega 2:7760–7766. https://doi.org/10.1021/acsomega.7b01338
Naveja JJ, Medina-Franco JL (2019) Finding constellations in chemical space through core analysis. Front Chem 7:510
Hariharan R, Janakiraman A, Nilakantan R et al (2011) MultiMCS: a fast algorithm for the maximum common substructure problem on multiple molecules. J Chem Inf Model 51:788–806. https://doi.org/10.1021/ci100297y
Dassault Systèmes BIOVIA (2016) BIOVIA Pipeline Pilot.
Gaulton A, Hersey A, Nowotka M et al (2017) The ChEMBL database in 2017. Nucleic Acids Res 45:D945–D954. https://doi.org/10.1093/nar/gkw1074
Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940. https://doi.org/10.1093/nar/gky1075
Kuzikov M, Costanzi E, Reinshagen J et al (2021) Identification of inhibitors of SARS-CoV-2 3CL-pro enzymatic activity using a small molecule in vitro repurposing screen. ACS Pharmacol Transl Sci. https://doi.org/10.1021/acsptsci.0c00216
Penning TD, Talley JJ, Bertenshaw SR et al (1997) Synthesis and biological evaluation of the 1,5-diarylpyrazole class of cyclooxygenase-2 inhibitors: identification of 4-[5-(4-methylphenyl)-3-(trifluoromethyl)-1H-pyrazol-1-yl]benze nesulfonamide (SC-58635, celecoxib). J Med Chem 40:1347–1365. https://doi.org/10.1021/jm960803q
Ertl P, Schuffenhauer A, Renner S (2011) The scaffold tree: an efficient navigation in the scaffold universe. Methods Mol Biol 672:245–260. https://doi.org/10.1007/978-1-60761-839-3_10
RDKit. https://www.rdkit.org/. Accessed 28 May 2021
GGA Software Services LLC Indigo Nodes for KNIME. http://ggasoftware.com/opensource/%0Aindigo. Accessed 28 May 2021
Xiong B, Liu K, Wu J et al (2008) DrugViz: a Cytoscape plugin for visualizing and analyzing small molecule drugs in biological networks. Bioinformatics 24:2117–2118. https://doi.org/10.1093/bioinformatics/btn389
Shannon P, Markiel A, Ozier O et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13:2498–2504. https://doi.org/10.1101/gr.1239303
Iyer P, Stumpfe D, Bajorath J (2011) Molecular mechanism-based network-like similarity graphs reveal relationships between different types of receptor ligands and structural changes that determine agonistic, inverse-agonistic, and antagonistic effects. J Chem Inf Model 51:1281–1286. https://doi.org/10.1021/ci2001378
Lepp Z, Huang C, Okada T (2009) Finding key members in compound libraries by analyzing networks of molecules assembled by structural similarity. J Chem Inf Model 49:2429–2443. https://doi.org/10.1021/ci9001102
Varin T, Didiot MC, Parker CN, Schuffenhauer A (2012) Latent hit series hidden in high-throughput screening data. J Med Chem 55:1161–1170. https://doi.org/10.1021/jm201328e
Varin T, Gubler H, Parker CN et al (2010) Compound set enrichment: a novel approach to analysis of primary HTS data. J Chem Inf Model 50:2067–2078. https://doi.org/10.1021/ci100203e
Kruger F, Stiefl N, Landrum GA (2020) rdScaffoldNetwork: the Scaffold Network Implementation in RDKit. J Chem Inf Model 60:3331–3335. https://doi.org/10.1021/acs.jcim.0c00296
Griffen E, Leach AG, Robb GR, Warner DJ (2011) Matched molecular pairs as a medicinal chemistry tool. J Med Chem 54:7739–7750. https://doi.org/10.1021/jm200452d
Wassermann AM, Bajorath J (2011) Large-scale exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Med Chem 3:425–436. https://doi.org/10.4155/fmc.10.293
Leach AG, Jones HD, Cosgrove DA et al (2006) Matched molecular pairs as a guide in the optimization of pharmaceutical properties; a study of aqueous solubility, plasma protein binding and oral exposure. J Med Chem 49:6672–6682. https://doi.org/10.1021/jm0605233
Hussain J, Rea C (2010) Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J Chem Inf Model 50:339–348. https://doi.org/10.1021/ci900450m
Hu X, Hu Y, Vogt M et al (2012) MMP-Cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs. J Chem Inf Model 52:1138–1145. https://doi.org/10.1021/ci3001138
Keiser MJ, Setola V, Irwin JJ et al (2009) Predicting new molecular targets for known drugs. Nature 462:175–181. https://doi.org/10.1038/nature08506
Zhou Y, Zhou B, Chen K et al (2007) Large-scale annotation of small-molecule libraries using public databases. J Chem Inf Model 47:1386–1394. https://doi.org/10.1021/ci700092v
Stahl M, Mauser H, Tsui M, Taylor NR (2005) A robust clustering method for chemical structures. J Med Chem 48:4358–4366. https://doi.org/10.1021/jm040213p
Wilkens SJ (2006) Relational database driven two-dimensional chemical graph analysis. Chem Biol Drug Des 68:135–138. https://doi.org/10.1111/j.1747-0285.2006.00426.x
Masciocchi J, Frau G, Fanton M et al (2009) MMsINC: a large-scale chemoinformatics database. Nucleic Acids Res 37:D284–D290. https://doi.org/10.1093/nar/gkn727
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Cytoscape network visualization of the 816 COX-2 inhibitors subset where nodes includes fragments related to the basic wireframe representation, contributing to create a fully connected unique network. Figure S2. The fragments extracted from the basic wireframe representation, with the highest number of connections (indegree) in the Cytoscape network visualization reported in Figure S1. Figure S3. Selection of the most interesting basic wireframe, corresponding to the most abstracted representation in common within each cluster of the network, filtered by the highest values of EF and number of connected active molecules of the corresponding cluster.
Dataset of 816 COX-2 inhibitors in preclinical development or in a higher phase, described according to the nine molecular representations of “Molecular Anatomy”. Each compound and the related nine molecular representations are encoded as canonical SMILES and InChIKey, respectively, and are listed in the same row.
Dataset of 816 COX-2 inhibitors reported in Table S1; each compound and the related nine molecular representations are encoded as canonical SMILES and InChIKey and are listed as single rows. From this list, Tables S3 and S4 are generated.
List of unique frameworks obtained from the molecules in Table S2.
List of parent–child relationships between the nine molecular frameworks of molecules listed in Table S2.
About this article
Cite this article
Manelfi, C., Gemei, M., Talarico, C. et al. “Molecular Anatomy”: a new multi-dimensional hierarchical scaffold analysis tool. J Cheminform 13, 54 (2021). https://doi.org/10.1186/s13321-021-00526-y