Spectrophores as one-dimensional descriptors calculated from three-dimensional atomic properties: applications ranging from scaffold hopping to multi-target virtual screening
Journal of Cheminformatics volume 10, Article number: 9 (2018)
Spectrophores are novel descriptors that are calculated from the three-dimensional atomic properties of molecules. In our current implementation, the atomic properties that were used to calculate spectrophores include atomic partial charges, atomic lipophilicity indices, atomic shape deviations and atomic softness properties. This approach can easily be widened to also include additional atomic properties. Our novel methodology finds its roots in the experimental affinity fingerprinting technology developed in the 1990’s by Terrapin Technologies. Here we have translated it into a purely virtual approach using artificial affinity cages and a simplified metric to calculate the interaction between these cages and the atomic properties. A typical spectrophore consists of a vector of 48 real numbers. This makes it highly suitable for the calculation of a wide range of similarity measures for use in virtual screening and for the investigation of quantitative structure–activity relationships in combination with advanced statistical approaches such as self-organizing maps, support vector machines and neural networks. In our present report we demonstrate the applicability of our novel methodology for scaffold hopping as well as virtual screening.
Computational drug design has played a major role in the discovery of molecular therapeutics for more than three decades. This domain can be broadly classified into protein structure-based and ligand-based methods. Protein structure-based methods rely on the availability of structural information of both protein target and ligands, and includes technologies such as computational fragment-based drug design  and molecular docking [2, 3]. Ligand-based methods use only information obtained from the ligands for predicting activity, dependent on their similarity or dissimilarity to previously known active ligands. Widely used ligand-based methods include pharmacophore searching [4, 5], bitwise fingerprint-based similarity searches [6,7,8,9] and the development of quantitative structure–activity/property relationships involving a variety of different molecular descriptors [10,11,12,13]. Additionally, important cheminformatics approaches such as the establishment and maintenance of compound databases [14,15,16,17], compound clustering [18, 19] and maximum common substructure calculations [20, 21] are now firmly integrated into the workflow of many pharmaceutical drug discovery processes.
Ligand-based virtual screening approaches all rely on the concept that structurally similar molecules have similar biological activities . Molecular fingerprints are bitwise representations of molecular structure and properties and examples include hashed connectivity pathways , dictionary-based , and layered atom fingerprints [25, 26]. Another example are the 3D-MoRSE descriptors . These methods are also called two-dimensional (2D) similarity methods since these do not rely on the underlying three-dimensional (3D) structure of the molecules.
It has been shown that in a significant number of cases ligand-based virtual screening outperforms protein structure-based virtual screening , although the latter performs better in scenarios where novel scaffolds need to be identified . 3D similarity virtual screening methods make use of the three-dimensional structure of the reference compound, as a query to search for compounds that have similar spatial atomic arrangements. These methods are not dependent upon the underlying molecular topology of the query compounds and are therefore also useful for scaffold hopping. Examples of such algorithms include shape-matching algorithms and shape-based fingerprints [30, 31], molecular field descriptors [32, 33], pharmacophore fingerprints [34,35,36] and pharmacophore-based screening [37,38,39]. A number of recent reviews on the use of descriptors and classification methods are available [40,41,42,43].
In this study, a novel shape-based descriptor is described which is termed a ‘spectrophore’, referring to the fact that this descriptor is composed of a one-dimensional ‘spectrum’ of n real numbers, with each of these numbers representing the interaction between a given molecular property and a certain artificial environment (hence resembling a type of ‘pharmacophore’). Because spectrophores are shape-based, these descriptors are not directly dependent on the actual molecular topology but rather on the molecular field that is generated by the underlying topology, hence craving it use as a scaffold-hopping tool in combination with automated molecular design approaches. In addition, since the spectrophores are composed of a set of real numbers and being independent on the underlying molecular orientation, these descriptors can be used as input to automated machine learning approaches for the generation of advanced QSAR models. The spectrophore approach has been based on the affinity fingerprinting technology, which was originally described in the 90’s by Terrapin Technologies, Inc. . In this approach, an affinity fingerprint is the pattern of the in vitro binding potency of a single compound to a reference panel of eight diverse proteins. Using a database of such affinity fingerprints, the authors were able to predict the binding potency of a novel compound for a specific protein target using a multivariate linear regression model, derived from the affinity fingerprints of a small set of training compounds. An analogous fingerprinting system is used in the spectrophore technology presented here, but in which the diverse set of reference proteins of the original affinity fingerprinting technology has been replaced by a set of virtual affinity cages. In addition, the in vitro measured binding potencies have also been replaced by the calculated interaction energies between a number of atomic properties and the surrounding cage points. We report the applicability of this approach for virtual screening and compound clustering. The influence of the conformational flexibility on the generated spectrophores is also discussed.
Artificial cages surrounding the molecule
Spectrophores are generated by calculating the interaction energies between the molecule and a set of predefined artificial cages that surround the molecular conformation. Each cage consists of 12 points and each point is assigned a value of + 1 or − 1, with the additional constraint that the sum of the values on all points on the cage should be 0 (hence each cage consists of six points with value + 1 and six points with value − 1) (Fig. 1).
Within the constraint that the sum of all values on the cage must be zero, it is possible to construct either 12 or 18 unique cages, depending on whether the + 1 and − 1 values are distributed in either a symmetrical or asymmetrical manner along the cage. These different cages are summarized in Table 1.
Each molecule (or more specific: each conformation) is inserted into each of these cages with the molecular center of geometry corresponding to the center of the cage. The initial orientation of the molecule is taken from the input geometry provided by the user; however this parameter is not important as the molecule is subsequently rotated along all its axes within the surrounding cage (see below). In our current implementation we opted to use a rectangular cage with the cell dimensions adjusted in such a manner that the minimum distance between the enclosed molecule and each of the cell edges corresponds to a constant value which is in the same range of a typical ligand-receptor non-bonded contact, for example around 3 Å, and which can be specified at runtime (corresponds to the resolution of the spectrophore; see below). This constant distance between molecule and cage is established by altering the cage dimensions on each new orientation of the enclosed molecule.
The generation of a spectrophore requires the calculation of a number of atomic properties of which the interaction energy with the cage point values is obtained. In our current implementation, four atomic properties were generated, which include the atomic partial charges, atomic lipophilicities, atomic shape deviations and atomic electrophilicities. These properties were selected based on the fact that no or little correlation exists between each of these, as shown in Fig. 2 in which the atomic properties calculated from a subset of 10,000 compounds randomly selected from the DUD-E dataset [45, 46] are plotted against each other.
Atomic shape deviations are generated by calculating, for each atom, the atom’s deviation from the average molecular radius. Atomic partial charges and atomic electrophilicity properties are calculated using the electronegativity equalization method, as described by Bultinck and coworkers [47, 48]. Atomic lipophilicities are assigned using a rule-based approach, according to the particular atom type. Parameters used to calculate lipophilicity, electronegativity and partial charges are summarized in Table 2.
For a molecular conformation with j atoms and p atomic properties (in the current implementation p = 4), the total interaction value V(c, p) between the atomic contribution values A(j, p) of property p and the cage values P(c, i) for a given cage c with i cage points, is calculated according a standard interaction energy equation as given in Eq. 1:
with r ij being the Euclidean distance between cage point i and atom j. The arbitrary factor of − 100 in the equation above is used to scale the calculated interaction values to a number of order ~ 1, with attractive interaction values expressed as positive numbers and repulsive values as negative numbers. This total interaction energy V(c, p) for a given property p and cage c is maximized by rotating the molecular orientation along the three angular dimensions and calculating at each rotational orientation the corresponding V(c, p) value. The final interaction energy V(c, p) for a given cage c and property p corresponds to the maximal interaction energy obtained this way. The entire process is repeated for each cage and for each atomic property, hence a typical spectrophore vector consists of c times p values, with c being the number of artificial cages that are used and p the number of different atomic properties. In the current implementation, default values for c and p are 12 and 4, respectively, meaning that 12 different cages and 4 different atomic properties are used, thereby generating spectrophores of 48 values per molecule (Fig. 3). The 48 values are organized into four sets of 12 values each:
Values 01–12: optimal interaction energies calculated from the atomic partial charges;
Values 13–24: optimal interaction energies calculated from the atomic lipophilicities;
Values 25–36: optimal interaction energies calculated from the atomic shape deviations;
Values 37–48: optimal interaction energies calculated from the atomic electrophilicities.
It should be noted that other spectrophore sizes are also possible. When the asymmetrical cages are selected (Table 1) the resulting spectrophore consists of 18 × 4 = 72 values. In case both the asymmetric and symmetric cages are selected, the resulting spectrophore will consist of 30 × 4 = 120 values. However, by default the non-stereospecific set of 12 cages are used, resulting in a spectrophore of 48 values.
As mentioned previously, the total interaction energy between a given cage and molecule for a given property is optimized by sampling the molecular orientation in angular steps of a given magnitude. Larger angular step sizes lead to faster computing times, but at the risk of missing the global interaction energy maximum, leading to a dependency of the spectrophore values on the actual starting orientation. Smaller angular step sizes sample the orientational space more thoroughly, but at a much higher computational cost. In our current implementation, accuracy is restricted to angular step sizes of 1°, 2°, 5°, 10°, 15°, 20°, 30°, 36°, 45° or 60° along all three axes. The user can specify this step size and therefore influence the required accuracy of the method.
Spectrophores capture information about the property fields surrounding the molecule. The closer the surrounding cage is wrapped around the molecule the more atomic details and variations are captured in the resulting spectrophore values. The default distance between the molecule and cage is 3 Å, as this resembles a non-bonded average distance between the receptor and ligand. Computational time is not influenced by the applied resolution setting. In the current implementation, resolution can be specified by any real number that is larger than 0.
As previously mentioned, there are 12 cages that are symmetrical and 18 cages with an asymmetrical distribution of points. These latter cages are therefore sensitive to the enantiomeric configuration of the molecule within the cage. For example, the generated spectrophores of both enantiomers of a chiral molecule will be of opposite sign to each other. In most instances, the symmetric cages will suffice for normal usage of the spectrophore technology. In the current implementation, there are three stereospecificity settings: ‘none’ for no stereoselectivity (hence using the 12 symmetric cages), ‘unique’ for using only the 18 asymmetric cages, and ‘all’ for using the 12 symmetric and 18 asymmetric cages.
In some circumstances it may be desirable to focus on the relative differences in the spectrophore values rather than on the absolute numbers, and for this reason normalization of the calculated values may be needed. Normalization may be important when comparing spectrophores of charged and neutral molecules, since the presence of a formal charge in the molecule will lead to a shift in the spectrophore values of the atomic charge and electrophilic properties: the lipophilicity and shape deviation spectrophore points are not influenced by the presence or absence of a formal charge. Normalization is performed on a ‘per-property’ basis, meaning that normalization is only performed on the data points belonging to the same property and not across all the data points. In our current implementation, there were four normalization settings: ‘none’ for no normalization, ‘mean’ for normalization by zero mean, ‘std’ for normalization by unit standard deviation, and ‘all’ for normalization by zero mean and unit standard deviation.
Results and discussion
Conformational flexibility dependency
The DUD-E dataset [45, 46] was used as a source for the selection of 1000 random compounds. Conformations were generated using RDKit . For each molecule, the number of conformations generated was equal to 1.5 times the number of atoms. Each conformation was converted into a spectrophore using the default values (accuracy: 20°, resolution: 3 Å, stereospecificity: ‘none’). The standard deviation of the spectrophore values from different molecules (using only the first conformation of each molecule to calculate the spectrophore from) was compared to the standard deviation of the spectrophore values calculated from the different molecular conformations. The results are summarized in Fig. 4, and demonstrate that the variation in the spectrophore values, resulting from the conformational flexibility, is less than the variability resulting from molecular differences: hence spectrophores from different molecules show more variation than spectrophores from different conformations of the same molecule.
Influence of the resolution and accuracy settings
The resolution of the spectrophore calculations is controlled by the distance between the central molecule and its surrounding cage. In Fig. 5 (top), the dependency of the spectrophore values on the resolution is shown. With increasing values in resolution setting, the absolute values of the calculated spectrophores become smaller, which is a consequence of the larger distance between the molecule and its surrounding cage. Changes in resolution often do not modify the overall shape of the resulting spectrophores, but minor local changes in the values are nonetheless observable (for example in the 1–12 region of the spectrophore; see Fig. 5). It is therefore important to use an identical resolution setting when comparing spectrophores of different molecules. In our current spectrophore implementation, a distance of 3 Å was used as the default resolution, since this distance reflects the typical non-bonded distance between ligand and its surrounding receptor.
Accuracy of the spectrophore calculations is specified by the angular step size that is used to rotate the molecule within its surrounding cage. Smaller step sizes can lead to significantly longer calculation times, while larger step sizes are much faster in sampling the rotational space but at the risk of missing the global interaction energy minimum. As shown in Fig. 5 (bottom), the best compromise between accuracy and computing time is obtained with an angular step size of 20°, as when using this step size there are no significant differences with the corresponding spectrophore values obtained using smaller step sizes. In contrast, accuracy settings of 36° and 60° lead to significant deviations in the calculated values.
Application note 1: precision and recall calculated on the DUD-E dataset
The DUD-E dataset [45, 46] was used to evaluate the potential use of spectrophores in the field of virtual screening. The original DUD-E dataset contains a total of 22,886 active compounds for 102 different pharmacological targets, with each active compound ‘diluted’ by an average of 50 decoy structures. For each of the 102 pharmacological targets, a reference compound was selected and the ranked Euclidean distances between the spectrophores of these reference compounds and all other DUD-E spectrophores were used to calculate the area under the curve (AUC) from the corresponding receiver operating characteristic (ROC) curves. Spectrophores were calculated with varying normalization parameters and stereospecificity settings, however with a constant default resolution of 3 Å and a constant angular step size of 20°. Conformations were those as provided in the DUD-E dataset. The results are summarized in Table 3.
The best median AUC value (0.63 ± 0.13) was obtained when normalization was calculated over the average (normalization ‘mean’) and with stereo option ‘none’ (hence using the 12 non-stereospecific cages). This AUC value came close to the AUC of 0.66 as calculated using Morgan fingerprints [50, 51] in combination with the Tanimoto similarity index (data not shown) . On the other hand, the worst AUC median (0.55 ± 0.19) was found without normalization (normalization ‘none’) and when only the 18 stereospecific probes (stereospecificity ‘unique’) were used.
Spectrophores are vectors of real numbers and can therefore be used in machine learning applications to classify active from inactive compounds. The example shown in Fig. 6 applies four machine learning approaches [(a) stochastic gradient descent (SGD) linear regression, (b) logistic regression and support vector machine (SVM) with both a linear (c) and a polynomial (d) kernel], as implemented within the scikit-learn package , to classify active from inactive compounds within 102 targets of the original DUD-E dataset [45, 46]. For each molecule, spectrophores were calculated from the single molecular conformation as provided in the DUD-E set, each of them differing in the applied normalization and treatment of stereospecificity. Each dataset was cross-validated ten-fold using 10 stratified subsets, and the best models were selected based on maximum precision and recall. At the end, averages of the 102 recall and precision values were calculated and these averages are plotted in Fig. 6.
In general, logistic regression and SVM with a polynomial kernel gave the best results. The best values of precision were found using logistic regression with normalization and stereospecificity both set to ‘none’ (precision = 0.94 and recall = 0.71), or with normalization set to ‘std’ and stereospecificity set to ‘all’ (precision = 0.96 and recall = 0.64). The best result for recall was obtained using the SVM polynomial kernel with normalization set to ‘none’ and stereospecificity set to ‘unique’ (precision = 0.79 and recall = 0.94) (Fig. 6).
The obtained metrics are comparable to results obtained with other approaches based on 2D-fingerprints or standard shape-based methods, indicating that the spectrophore technology can also be used as a virtual screening platform. However, direct comparison between the different approaches, and in particular questions aimed at answering which method is the’best’, are in our opinion not useful since each method or approach has its own application domain and one method may be more applicable or desired compared depending on the question to be answered.
Application note 2: scaffold hopping
Spectrophores are calculated as interaction energies between a set of atomic properties and a set of artificial receptors, in casu a set of cages represented by 12 cage points each with a + 1 or − 1 value. With respect to this, since only the atomic properties themselves, and not the actual atom types, contribute to the final spectrophore values, spectrophores can be useful for scaffold hopping in which one wants to identify fragments with similar interaction properties but with different atomic and topological environments. In order to demonstrate the applicability of spectrophores to scaffold hopping, all five- and six-membered disubstituted aromatic rings were extracted from the DUD-E dataset [45, 46] and converted into their corresponding spectrophores (using full normalization and with only the 12 non-stereospecific cages) after replacing each of the two sidechains by a methyl group and generation of a single conformer for each ring. Subsequent clustering of the calculated spectrophores using the affinity propagation implementation of scikit-learn  classified the 72 different ring systems into seven different clusters, populated with 4–17 members each (Table 4; for a list of all ring systems and their corresponding cluster, see Additional file 1: S1). For the majority of clusters, a consensus chemical scaffold could be identified; these scaffolds are shown in Fig. 7. As demonstrated in Table 4 and Fig. 7, it can be seen that ring types 0, 1, 3, 5 and 6 are very well separated from the other types. Ring types 0, 3 and 5 are all 1,3-disubstituted five-membered ring systems, with a hydrogen bond acceptor functionality in the 4- and 2-position for types 0 and 5, respectively. Ring types 1 and 6 are 1,2-disubstituted five-membered rings, the difference between these two is the presence of a hydrogen bond acceptor pharmacophore at position 3 and 4 for type 1 and 6, respectively. Clusters 2 and 4 are less well clearly defined, with cluster type 2 being a mixture of rings of type 0, 1, 2, together with a significant fraction of rings which cannot be not classified into these seven ring clusters. Finally, cluster type 4 is merely composed of 1,2-disubstituted six-membered rings with a significant contamination of rings which can be classified as ring type 1 (Table 4).
Again, as in the case of case study 1, we do not want make any statements whether the spectrophore approach in scaffold hopping is better than any other method; we have only included this case in order to demonstrate the applicability of the technology in scaffold hopping and the use of it in compound clustering.
Application note 3: combining machine learning and spectrophores for the identification of novel inhibitors from compound databases
In order to demonstrate the applicability of spectrophores in the domain of virtual screening, mathematical models were generated from the spectrophores, and these models were subsequently used to identify compounds predicted to be inhibitory active against a particular subset of therapeutic targets. Following this virtual screening step, a number of these compounds were actually acquired and their predicted inhibitory activity was subsequently biochemically validated.
The training set was constructed from the DUD-E dataset [45, 46]. For each compound in the DUD-E set, ten conformations were generated using RDKit  and each conformation was converted into a spectrophore with default parameters (accuracy: 20°, resolution: 3 Å, stereospecificity: ‘none’). All ten spectrophores of each molecule were then merged into a single spectrophore by calculating the element-wise maximum of all the ten spectrophores. The training set hence consisted of 22,802 spectrophores corresponding to the ‘active’ compounds, and 1180,480 spectrophores corresponding to the ‘inactive’ compounds (the decoy set in DUD-E).
Building the classifiers
In the first phase of this virtual screening experiment, a binary classification machine learning model was trained and selected based on maximal ‘precision’ to classify ‘active’ spectrophores from ‘inactive’ ones (in order to limit the number of false positives, ‘precision’ was chosen as a model evaluation parameter). All models were generated using the scikit-learn package in Python . The best model comprised a majority soft voting model with a random forest and k-nearest neighbors model as underlying classifiers (Table 5, phase 1). A second model was also generated in which the training set consisted of all ‘active’ spectrophores labeled according their DUD-E pharmacological targets, and in which only these ‘active’ compounds were used to train the particular model (hence no ‘inactives’ in this training set). This multiclass model was used in the second phase of the virtual screening experiment to assign the most likely pharmacological target label to each of the ‘active’ compounds that were selected during the first phase. The best multiclass model with ‘precision’ as evaluation parameter was the Extra Trees classifier (Table 5, phase 2). Cross-validated precision scores for this model are given in Table 6.
Virtual screening of a test set
The Enamine HTS compound collection  was downloaded as a test set and the 1735,523 compounds were converted into their respective spectrophores using the same protocol as for the training set. Screening of this test set was performed in two phases (Fig. 8). In the first phase, all Enamine spectrophores were classified as either ‘active’ or ‘inactive’ according the prediction calculated by the phase 1 classifier (Fig. 8 and Table 5). In total, 93 of these spectrophores were labeled as ‘active’. In the second phase, all 93 ‘actives’ were filtered through the phase 2 classifier model in which each of these spectrophores were labeled with one of the 102 possible pharmacological classes according to the prediction made by phase 2 classifier. In total, 32 different labels were assigned to each of the 93 spectrophores. The results are summarized in Table 6 and demonstrate that 2/3 of all 93 spectrophores (read: compounds) are assigned to only seven pharmacological classes: acetylcholinesterase (25 compounds), tyrosine-protein kinase SRC (10) and LCK (4), dopamine receptor D3 (9), peroxisome proliferator-activated receptor α (4) and γ (5), and thrombin (5). The quality of these predictions (expressed as ‘precision’) ranges from ‘low’ (0.33, 0.38, and 0.42 for the peroxisome proliferator-activated receptor α and γ, and thrombin, respectively), over ‘medium’ for the two tyrosine-protein kinases SRC and LCK (0.53 and 0.57, respectively), and up to ‘good’ for the dopamine D3 receptor and acetylcholinesterase (0.67 and 0.74, respectively). The highest quality models in term of ‘precision’ are those for farnesyl transferase (0.98), β-glucocerebrosidase (0.81) and dihydroorotate dehydrogenase (0.87).
Validating the two highest quality models
According Table 6, the two best models are farnesyl diphosphate synthase and dihydroorotate dehydrogenase, with tenfold cross validated specificity values of 0.980 ± 0.016 and 0.870 ± 0.042, respectively. For each of these targets only a single compound from the Enamine library was predicted to bind, and these structures with corresponding Enamine codes are given in Table 7. The calculated distances, expressed as Tanimoto and Euclidean distances, between each of the two hits and the corresponding actives from the DUD-E dataset are plotted in Fig. 9.
Focusing in first instance on the farnesyl diphosphate synthase target, the hit retrieved from the Enamine HTS library for this enzyme was Z2181651281, a compound also known as ibandronic acid and described as a potent farnesyl pyrophosphate synthase inhibitor . Unfortunately, this compound was also one of the 85 compounds that were part of the DUD-E training set used to derive the farnesyl diphosphate synthase model, and therefore we cannot conclude that the model has been able to identify a novel farnesyl diphosphate synthase inhibitor. However, in Fig. 9, the Euclidean and Tanimoto distances between Z2181651281 and each of the 85 DUD-E compounds are plotted (green dots). Inspection of these Euclidean distances reveal three other compounds that have a significant spectrophore similarity to Z2181651281 (Euclidean spectrophore distance ≤ 50). The structures of these three compounds, together with their topological and spectrophore distances to Z2181651281, is given in Table 8. All these compounds are bisphosphonates and are structurally very similar as reflected by the large spectrophore similarities. However, the calculated Tanimoto similarities range from 0.36 to 0.67, values which are significantly below the generally accepted cutoff of 0.85 to reflect chemical similarity . Hence, these compounds would not have been retrieved as putative farnesyl diphosphate synthase inhibitors when searched with topological fingerprints and a Tanimoto similarity cutoff ≥ 0.7, while using spectrophores in combination with an Euclidean distance cutoff ≤ 50 would have retrieved all three compounds.
The single compound identified as hit of dihydroorotate dehydrogenase is Z2465619914 (Table 7). This compound is also known as brequinar and has been described as a potent dihydroorotate dehydrogenase inhibitor , hence validating our virtual screening approach using spectrophores as similarity metric. Within the DUD-E dataset, there are 17 compounds having a spectrophore Euclidean distance less than 50 to Z2465619914 (Fig. 9). The structures of these compounds are shown in Table 8. From this table it is clear that most of these 17 compounds are quite similar to Z2465619914, however six of these compounds have a Tanimoto similarity ≤ 0.7 while their spectrophore similarity is still within the cutoff window of 50. Hence the same conclusion as for farnesyl diphosphate synthase can be made here: these compounds would not have been retrieved as putative inhibitors when searched with topological fingerprints and a Tanimoto similarity cutoff ≥ 0.7; however a different result would have been obtained using spectrophores in combination with an Euclidean distance cutoff ≤ 50.
In vitro biochemical validation of the predictions
In order to validate the predictions in an orthogonal manner, in vitro biochemical testing of the inhibitory activities of some of the compounds was performed. Driven by the availability of a number of in house biochemical assays on the one hand, but on the other hand also being limited by financial restrictions, we selected acetylcholinesterase (medium quality model, see Table 6) and thrombin (low quality classification model) as pharmacological targets for the biochemical validation. As shown in Table 6, five compounds were predicted to target thrombin and 25 compounds targeted acetylcholinesterase; hence these 30 compounds were ordered from Enamine in 5 mg solid state quantities each. All 30 purchased compounds were tested in both biochemical assays in concentrations ranging from 100 to 1 μM. Acetylcholinesterase activity was determined by a kinetic assay using the indicator 5,5′-dithiobis-(2-nitrobenzoic acid) (DTNB, Ellman’s reagent) and the substrate acetylthiocholine iodide (K m = 420 μM) at concentration 500 and 400 μM, respectively. All the experiments were conducted in duplicate at 25 °C in a 100 mM phosphate buffer at pH 7.8. Control experiment using a commercial inhibitor of acetylcholinesterase, neostigmine methyl sulfate (IC50 ≈ 40 nM), was included in each screening assay. Thrombin activity was determined by a kinetic assay using the chromogenic substrate Biophen CS-21(66) (pyro-Glu-Pro-Arg-pNA∙HCl, Km = 400 μM) at 415 μM concentration. All the experiments were conducted in duplicate at 37 °C in 50 mM HEPES buffer at pH 8.1. Control experiment using a commercial inhibitor of thrombin, gabexate mesylate (Ki = 500 nM) was included in each screening assay. Detailed protocols for the biochemical assays are described in Additional file 2: S2.
The results obtained from the biochemical assays are summarized in Table 9. Protocols used for the biochemical assays are described in Additional file 2: S2. Many of the compounds were difficult to solubilize in the assay buffers, so the highest concentration at which measurements could be done was determined by the actual compound solubility. Additionally, acetylcholinesterase is very sensitive to the presence of DMSO, therefore the final test mixture contained 0.25% of DMSO, limiting range of the compound concentrations tested to low-micromolar values. In case of thrombin, the enzyme can tolerate better the presence of DMSO; the final percentage of DMSO in the test mixture was 2.5%, allowing to test the compounds at slightly higher concentrations. Focusing in first instance on the results obtained for the low-quality target thrombin, none of the five compounds that were predicted to target thrombin actually inhibited this enzyme at concentrations lower than 100 μM. Although this may seem disappointing at first sight, one should bear in mind that the phase 2 classification model for thrombin is of very poor quality (cross-validated precision of 0.425 ± 0.022; see Table 6) and that only five compounds were predicted to bind this pharmacological target. However, the results were far more encouraging for acetylcholinesterase. The phase 2 classification model for acetylcholinesterase was of much higher quality in terms of cross validated precision (0.742 ± 0.028; Table 6), and this is reflected in the fact that 2 out of the 25 predicted compounds (Z44853616 and Z1723688652) from the Enamine library showed inhibitory activity against this target at IC50 values lower than 1 μM (Table 9).
The two identified acetylcholinesterase inhibitors are topologically quite diverse but are very similar in spectrophore space. As shown in Fig. 10, the Tanimoto distance between the topological fingerprints generated from both compounds is 0.49 (calculated with RDKit ), a value far below the 0.85 cutoff that is commonly used to reflect similar bioactivities . However, the Euclidean distance between the spectrophores of both compounds is on the lower edge of the spectrum (Fig. 10; red dot), indicating high similarity from a spectrophore’s point-of-view. In addition, none of the two identified compounds possess any significant topological similarity to any of the 453 acetylcholinesterase inhibitors from the DUD-E dataset (all Tanimoto distance are less than 0.6), indicating that these two compounds would likely not have been identified from the DUD-E dataset when these topology-based fingerprints would have used as similarity criterion. The lack of correlation between the Spectrophore and topological fingerprint distances is also clear from Fig. 10. The calculated squared correlation coefficients r2 are 0.043 and 0.004 for Z44853616 and Z1723688652, respectively.
The current spectrophore algorithm has been implemented in three separate frameworks: 1) an Open Babel implementation written in C++ (OBSpectrophore [58, 59]), 2) an RDKit version written in Python and C++, and 3) an RDKit implementation coded entirely in Python. On a personal computer equipped with an i7-5500U processor with 12 GB of main memory, the Open Babel version runs fastest, with an average calculation speed of 36 ± 2 ms/molecule. On the other hand, the code entirely written in Python performs the same calculations 140 times slower (4950 ± 79 ms/molecule). The mixed Python/C ++ implementation in RDKit is only 3 times slower than OBSpectrophore, performing the same calculations at an average speed of 102 ± 2 ms/molecule.
The spectrophore is a novel descriptor that reflects, in a virtual manner, the way how molecules are binding to a set of artificial receptors, taking into account the spatial interactions between a molecule and its surroundings. Because of these unique properties, the spectrophore can be considered to be a one-dimensional mathematical description of a three-dimensional pharmacophore. This makes it applicable for a wide range of cheminformatics approaches, including virtual screening using sophisticated statistical models and clustering approaches. Successful applications in the area of scaffold hopping and virtual screening have been demonstrated in this study.
In the multi-target virtual screening experiment, all compounds were treated as neutral and they were not ionized according to their physiological pH. This could be one of the factors explaining the poor model quality of the thrombin target (Table 6), as it has been demonstrated that many of the thrombin inhibitors carry a positively charged functional group as a common feature binding into the P1 pocket of thrombin . Although the current setup was sufficient for our novel proof-of-concept study such as this work, in a real-world virtual screening experiment the correct pretreatment and washing conditions for each compound would need to be carefully determined .
The spectrophore technology is one of the many existing descriptors that may be used in the field of cheminformatics and could be useful to compare some of these. However, we believe that descriptor comparison, by evaluating their performance in clustering and virtual screening, is a difficult and very subjective task as 2D- and 3D-descriptors are fundamentally different in the way they represent molecules. Therefore, this will lead to fundamental differences in the outcome of a virtual screening experiment. Rather than comparing these technologies in an attempt to identify the ‘most powerful one’ (by whichever criterion is selected), we believe that it would be more useful to integrate many of these orthogonal molecular representations into a unified machine learning model, with the goal of developing a virtual screening toolbox with optimized predictive power. Research into such an approach is currently ongoing in our laboratory.
The spectrophore technology could also be extended to describe and compare protein pockets. In this case, the spectrophore approach needs to be modified in such way that the protein pockets, along with their structural and electronic properties, are converted into a type of mirror image from which the corresponding spectrophore can be calculated. Research is currently also on-going to investigate the feasibility and applicability of this approach.
area under curve
quantitative structure–activity relations
support vector machine
Sheng C, Zhang W (2013) Fragment informatics and computational fragment-based drug design: an overview and update. Med Res Rev 33:554–598
Meng X-Y, Zhang H-X, Mezei M, Cui M (2011) Molecular docking: a powerful approach for structure-based drug discovery. Curr Comput Aided Drug Des 7:146–157
Lyne PD (2002) Structure-based virtual screening: an overview. Drug Discov Today 7:1047–1055
Yang S-Y (2010) Pharmacophore modeling and applications in drug discovery: challenges and recent advances. Drug Discov Today 15:444–450
Qing XY, Lee XY, De Raeymaeker J, Tame JRH, Zhang KYJ, De Maeyer M, Voet ARD (2014) Pharmacophore modeling: advances, limitations, and current utility in drug discovery. J Recept Ligand Channel Res 7:81–92
Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujada G (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63
Willett P (2011) Similarity searching using 2D structural fingerprints. Methods Mol Biol 672:133–158
Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11:1046–1053
Maldonado AG, Doucet JP, Petitjean M, Fan B-T (2006) Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol Divers 10:39–79
Dudek AZ, Arodz T, Gálvez J (2006) Computational methods in developing quantitative structure-activity relationships (QSAR): a review. Comb Chem High Throughput Screen 9:213–228
Grover M, Singh B, Bakshi M, Singh S (2000) Quantitative structure–property relationships in pharmaceutical research—part 1. Pharm Sci Technol Today 3:28–35
Grover M, Singh B, Bakshi M, Singh S (2000) Quantitative structure–property relationships in pharmaceutical research—part 2. Pharm Sci Technol Today 3:50–57
Kubinyi H (1997) QSAR and 3D QSAR in drug design. Part 1: methodology. Drug Discov Today 11:457–467
Scior T, Bernard P, Medina-Franco JL, Maggiora GM (2007) Large compound databases for structure–activity relationships studies in drug discovery. Mini Rev Med Chem 7:851–860
Gozalbes R, Pineda-Lucena A (2011) Small molecule databases and chemical descriptors useful in chemoinformatics: an overview. Comb Chem High Throughput Screen 14:548–558
Ekins S, Freundlich JS, Choi I, Sarker M, Talcott C (2011) Computational databases, pathway and cheminformatics tools for tuberculosis drug discovery. Trends Microbiol 19:65–74
Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37(Web Server issue):W623–W633
Downs G, Barnard J (2002) Clustering methods and their uses in computational chemistry. Rev Comput Chem 18:1–40
Bajorath J (2002) Integration of virtual and high-throughput screening. Nature Rev Drug Discov 1:882–894
Stahl M, Mauser H (2005) Database clustering with a combination of fingerprint and maximum common substructure methods. J Chem Inf Model 45:542–548
Englert P, Kovács P (2015) Efficient heuristics for maximum common substructure search. J Chem Inf Model 55:941–955
Kubinyi H (1998) Similarity and dissimilarity—a medicinal chemist’s view. Perspect Drug Discov Des 11:225–252
James CA, Weininger D (2006) Daylight theory manual, chapter 6. In: Daylight chemical information systems. http://www.daylight.com/dayhtml/doc/theory/index.pdf. Accessed 04 July 2016
Barnard JM, Downs GM (1997) Chemical fragment generation and clustering software. J Chem Inf Comput Sci 37:141–142
Bender A, Mussa HY, Glen RC, Reiling S (2004) Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J Chem Inf Comput Sci 44:1708–1718
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
Devinyak O, Havrylyuk D, Lesyk R (2014) 3D-MoRSE descriptors explained. J Mol Graph Model 54:194–203
Zhang Q, Muegge I (2006) Scaffold hopping through virtual screening using 2D and 3D similarity descriptos: ranking, voting, and consensus scoring. J Med Chem 49:1536–1548
Good AC, Cheney DL, Sitkoff DF, Tokarski JS, Stouch TR, Bassolino DA, Krystek SR, Li Y, Mason JS, Perkins TDJ (2003) Analysis and optimization of structure-based virtual screening protocols: 2. Examination of docked ligand orientation sampling methodology: mapping a pharmacophore for success. J Mol Graph Model 22:31–40
Rush TS, Grant JA, Mosyak L, Nicholls A (2005) A shape-based 3-D scaffold hopping method and Its application to a bacterial protein-protein interaction. J Med Chem 48:1489–1495
Haigh JA, Pickup BT, Grant JA, Nicholls A (2005) Small molecule shape-fingerprints. J Chem Inf Model 45:673–684
Cheeseright T, Mackey M, Rose S, Vinter A (2006) Molecular field extrema as descriptors of biological activity: definition and validation. J Chem Inf Model 46:665–676
Pastor M, Cruciani G, McLay I, Pickett S, Clementi S (2000) GRid-INdependent descriptors (GRIND): a novel class of alignment-independent three-dimensional molecular descriptors. J Med Chem 43:3233–3243
McGregor MJ, Muskal SM (1999) Pharmacophore fingerprinting. 1. Application to QSAR and focused library design. J Chem Inf Comput Sci 39:569–574
McGregor MJ, Muskal SM (2000) Pharmacophore fingerprinting. 2. Application to primary library design. J Chem Inf Model 40:117–125
Saeh JC, Lyne PD, Takasaki BK, Cosgrove DA (2005) Lead hopping using SVM and 3D pharmacophore fingerprints. J Chem Inf Model 45:1122–1133
Kim K-H, Kim ND, Seong B-L (2010) Pharmacophore-based virtual screening: a review of recent applications. Expert Opin Drug Discov 5:205–222
Spitzer GM, Heiss M, Mangold M, Markt P, Kirchmair J, Wolber G, Liedl KR (2010) One concept, three implementations of 3D pharmacophore-based virtual screening: distinct coverage of chemical search space. J Chem Inf Model 50:1241–1247
Yang S-Y (2010) Pharmacophore modeling and applications in drug discovery: challenges and recent advances. Drug Discov Today 15:444–450
Mathea M, Klingspohn W, Baumann K (2016) Chemoinformatic classification methods and their applicability domain. Mol Inform 35:160–180
Basak SC (2013) Mathematical descriptors for the prediction of property, bioactivity, and toxicity of chemicals from their structure: a chemical-cum-biochemical approach. Curr Comput Aided Drug Des 9:449–462
Grimme S, Schreiner PR (2017) Computational chemistry: the fate of current methods and future challenges. Angew Chem Int Ed Engl. https://doi.org/10.1002/anie.201709943
Humbeck L, Koch O (2017) What can we learn from bioactivity data? Chemoinformatics tools and applications in chemical biology research. ACS Chem Biol 12:23–35
Kauvar LM, Higgins DL, Villar HO, Sportsman JR, Engqvist-Goldstein Å, Bukar R, Bauer KE, Dilley H, Rocke DM (1995) Predicting ligand binding to proteins by affinity fingerprinting. Chem Biol 2:107–118
Huang N, Stoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49:6789–6801
Mysinger MM, Carchia M, Irwin JJ, Shoichet BK (2012) Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem 55:6582–6594
Bultinck P, Langenaeker W, Lahorte P, De Proft F, Geerlings P, Van Alsenoy C, Tollenaere JP (2002) The electronegativity equalization method II: applicability of different atomic charge schemes. J Phys Chem A 106:7895–7901
Bultinck P, Langenaeker W, Carbó-Dorca R, Tollenaere JP (2003) Fast calculation of quantum chemical molecular descriptors from the electronegativity equalization method. J Chem Inf Comput Sci 43:422–428
Landrum G. RDKit: open-source cheminformatics. http://www.rdkit.org. Accessed 10 July 2016
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
Riniker S, Landrum G (2013) Open source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminf 5:26. https://doi.org/10.1186/1758-2946-5-26
Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132:1115–1118
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Mühlbauer RC, Bauss F, Schenk R, Janner M, Bosies E, Strein K, Fleisch H (1991) BM, 21.0955, a potent new bisphosphonate to inhibit bone resorption. J Bone Miner Res 6:1003–1011
Maggiora G, Vogt M, Stumpfe D, Bajorath J (2014) Molecular similarity in medicinal chemistry. J Med Chem 57:3186–3204
Chen SF, Perrella FW, Behrens DL, Papp LM (1992) Inhibition of dihydroorotate dehydrogenase activity by brequinar sodium. Cancer Res 52:3521–3527
Hutchison GR, Morley C, O’Boyle NM, James C, Swain C, De Winter H, Vandermeersch T (2011) Open babel or how I learned to love the chemical file format—official user guide
O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminf 3:33. https://doi.org/10.1186/1758-2946-3-33
Skordalakes E, Tyrell R, Elgendy S, Goodwin CA, Green D, Dodson G, Scully MF, Freyssinet J-MH, Scully VV, Deadman JJ (1997) Crystallographic structures of human α-thrombin complexed to peptide boronic acids lacking a positive charge at P1. Evidence of novel interactions. J Am Chem Soc 119:9935–9936
Klebe G (2006) Virtual ligand screening: strategies, perspectives and limitations. Drug Discov Today 11:580–594
GT, WL and HDW are the original inventors of the spectrophore approach. FMDS coded and wrote the precision and recall code on the DUD-E dataset (application note 1). FMDS also coded the C++/Python implementation of the spectrophore code. KA reviewed the manuscript. RG designed and performed the biological experiments. HDW coded the scaffold hopping and virtual screening application (application notes 2 and 3), supervised the project and wrote the majority of the paper. All authors read and approved the final manuscript.
FMDS is grateful for the financial support from the National Council for Scientific and Technological Development (CNPq) of Brazil. RG is supported by a postdoctoral fellowship from FWO, Belgium. The spectrophore code was originally invented and developed at Silicos NV in 2010 by HDW, GT and WL (all authors of this paper). This paper is dedicated to our colleague, Prof. Dr. Sandra Apers, who passed away much too early on February 5th, 2017.
The authors declare that they have no competing interests.
Availability of data and materials
Both the standalone Python implementation (PySpectrophore) and the Python/C++ implementation for RDKit are available from https://github.com/UAMCAntwerpen. The Open Babel implementation (OBSpectrophore) is provided within the Open Babel package (http://openbabel.org/wiki/Main_Page).
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Gladysz, R., Dos Santos, F.M., Langenaeker, W. et al. Spectrophores as one-dimensional descriptors calculated from three-dimensional atomic properties: applications ranging from scaffold hopping to multi-target virtual screening. J Cheminform 10, 9 (2018). https://doi.org/10.1186/s13321-018-0268-9