- Research article
- Open Access
Application of 3D Zernike descriptors to shape-based ligand similarity searching
Journal of Cheminformaticsvolume 1, Article number: 19 (2009)
The identification of promising drug leads from a large database of compounds is an important step in the preliminary stages of drug design. Although shape is known to play a key role in the molecular recognition process, its application to virtual screening poses significant hurdles both in terms of the encoding scheme and speed.
In this study, we have examined the efficacy of the alignment independent three-dimensional Zernike descriptor (3DZD) for fast shape based similarity searching. Performance of this approach was compared with several other methods including the statistical moments based ultrafast shape recognition scheme (USR) and SIMCOMP, a graph matching algorithm that compares atom environments. Three benchmark datasets are used to thoroughly test the methods in terms of their ability for molecular classification, retrieval rate, and performance under the situation that simulates actual virtual screening tasks over a large pharmaceutical database. The 3DZD performed better than or comparable to the other methods examined, depending on the datasets and evaluation metrics used. Reasons for the success and the failure of the shape based methods for specific cases are investigated. Based on the results for the three datasets, general conclusions are drawn with regard to their efficiency and applicability.
The 3DZD has unique ability for fast comparison of three-dimensional shape of compounds. Examples analyzed illustrate the advantages and the room for improvements for the 3DZD.
A crucial step in early phase drug discovery is the identification of promising drug leads i.e. those of pharmacological interest. A guiding premise in this stage is that of the similarity property principle [1, 2] which states that similar structures are likely to have similar properties (although exceptions to this rule do exist [3, 4]). Proceeding along these lines, similarity based virtual screening efforts  look for nearest neighbours for a given query structure. The output of the screen is in the form of a sorted list, where top-ranking molecules are selected to undergo further processing.
As compound databases can hold millions of structures (spanning a large chemical space), the application of such approaches requires suitable molecular representations that aid rapid screening. An additional requirement is that of a numerical measure that quantifies the similarity between the compounds. Popular descriptions include fingerprints that encode the two-dimensional molecular structure as a bit string where each value indicates the presence or absence of a desired attribute (e.g. a substructural fragment) [6, 7]. Similarities between the ligands can then be obtained using the Tanimoto score  which accounts for the number of bits shared by the fingerprints. While these descriptors are extremely efficient and easy to calculate they have still some limitations . A number of three-dimensional (3D) similarity methods [10–12] have therefore been developed to investigate if the 3D structure information improves over the existing descriptors.
While a number of techniques for 3D molecular comparison have been proposed [1, 13, 14], in this article, we focus on similarity-based virtual screening using molecular shape [15–17] as the key feature for discrimination. Shape is known to play an important role in molecular-recognition, with previous studies demonstrating successful applications to virtual screening experiments [16, 18]. However, identifying suitable encodings based on shape are far from trivial [18–21], which pose significant hurdles in their application to fast screening of compound databases.
In order to facilitate efficient comparisons, several representations of shape have been proposed ranging from those based on moments  and surfaces [21, 23] to grid-based approaches  (see Putta & Beroz  for a comprehensive review). A well-established method is that of ROCS (Rapid Overlay of Chemical Structures) that describes the molecule as a set of atom-centered Gaussians . Shape similarity scores are then evaluated in terms of the rigid body overlap volume with comparison timings in the milliseconds range. Goldman and Wipke , on the other hand, divide the molecular surface  into a series of patches (2Å radius) centered on a set of critical points  with each patch defined by a geometrically invariant descriptor (the principal curvatures, normals, and the shape index). Points with similar geometric signatures are identified, based on which a transformation can be calculated. Proceeding along the same lines, SURFCOMP  uses a graph matching to identify correspondences between shape (local curvature) critical points of the molecular surfaces being compared. Although the superimpositions found using the above two methods were found to be reasonably accurate, pairwise comparisons took more than a minute, which limits their large scale application.
Spherical harmonics based representations [29, 30] have been further applied to comparing shapes of ligand binding sites  and as geometric filters for virtual high throughput screening . The use of spherical harmonics allows the surface information to be encoded in a compact form as an orthonormal one-dimensional (1D) vector of floating numbers rendering it amenable to fast comparison. However, for the molecules to be compared, they have to be placed in a standard frame of reference. This has been shown to be error-prone and hence may result in the decreased performance of the descriptor [33, 34]. Consequently, descriptors that obviate the need for any pre-alignment have been the focus in a number of studies. Shape signatures [18, 35], for example, produce a 1D representation of the ligand or receptor site by ray-tracing the molecular volume. The geometric information is encoded as a probability distribution which enables fast comparisons using the shape histograms. The signatures can be further extended to incorporate other properties such as electrostatics. Another method that captures shape independent of orientation is the Ultrafast Shape Recognition (USR) scheme . In this technique, the 3D molecular shape is represented as a set of statistical moments, generated from all atom distance distributions that are calculated with respect to preselected reference locations.
More recently, a number of articles [36–41] have advocated the use of 3D Zernike invariants as descriptors for shape comparison. An extension of spherical harmonics, the 3D Zernike descriptors (3DZD) have favourable features such as orthonormality and compactness. More importantly, the 3DZD are also invariant to transformation (see section on computational details), and thus the pre-alignment step is no longer required. Represented by a 1D set of numbers (subject to a specified order of expansion), the 3DZD have facilitated rapid screening of protein databases , discrimination of proteins based on the electrostatic potential  and for analyzing shapes of ligand binding pockets [34, 37].
In this paper, we examine the efficacy of the 3DZD as a tool for shape similarity based virtual screening. Due to its compact representation, 3DZD enable fast comparison of compounds, which is a key property of virtual screening methods given the fast growing size of molecular databases. The performance is compared with several other methods, USR , SIMCOMP , EVA , Unity2D , Molprint-2D , and MACCS . These approaches and the metrics used for evaluation are briefly described in the next section. Three datasets were used: the first one is a set of 47 diverse odour compounds (divided into seven classes) taken from a previous study by Takane and Mitchell , while the second one is the Directory of Useful Decoys (DUD) dataset [45, 46], that contains 13 targets with 66013 compounds. The last dataset includes 42,689 anti-HIV inhibitors  categorized into active, inactive, and moderately active. With the first dataset, we test the ability of the methods to classify compounds, while the second dataset is employed to examine the ability to rank actives among decoys. In addition, the third dataset is used to simulate actual virtual screening process against a large pharmaceutical database. Results evaluated with respected to the datasets are assessed in terms of several evaluation metrics. Reasons for the failure of the shape based methods for specific cases are investigated. Based on the results for the three datasets, general conclusions are drawn with regard to their efficiency and applicability with suggestions for future work.
Computational and Experimental Details
Methods for structure comparison
Computational approaches for ligand screening used in this study are briefly introduced here. For further details, please refer to the cited articles.
3D Zernike Descriptors
The 3D Zernike functions  are given by
where are complex valued spherical harmonics expressed in terms of spherical coordinates (θ, φ), n, l, m are integers such that |m| ≤ n and n - |m| is even and R nl (r) represents orthogonal radial polynomials. Given a 3D shape function f(x): x ∈ R3, the Zernike moments are the projection of the shape function onto these orthogonal basis functions. For an order n they can be expressed as a linear combination of scaled geometrical moments (to fit a unit sphere)
The moments however are not rotationally invariant but as rotations do not change the magnitudes of the functions, the invariant features are expressed in terms of the norms . Translational invariance is obtained by fixing the coordinate system with the origin coinciding with the spatial center of the molecule. From mathematical point of view, this procedure is proven to compute the identical descriptor for an object regardless of the positioning of the object in the space [36, 48]. However, in practice some variance may be caused due to numerical errors and the voxelization step of the object necessary to represent the shape of the object. In our previous paper, we examined the variance caused by rotation . An advantage of the 3DZD is that it can also describe non-star-like shapes. This is a limiting factor for spherical harmonics as they can only model single-valued surfaces .
Extraction of moments starts with the generation of a suitable molecular surface. In this study, the Connolly surface  definition has been used. Unlike the spherical harmonics which are calculated with respect to a spherical grid, the 3D Zernike formulation uses a rectangular grid (voxelization) to compute the geometrical moments. Moments of orders ranging from n = 4 to n = 14 have been examined in this study, with each molecule represented as a 1D floating point vector of numbers when n is an even number . Three distance measures have been used to compare structures based on their 3DZD representations (X and Y). These include the Euclidean distance (D E ), the Pearson correlation coefficient (r) and the third based on a scaled Manhattan distance (D M )[22, 48]. While the first two measures are commonly used in similarity searching, the third metric is taken from an earlier article by Ballester and Richards .
Note that DE is 0 for two identical molecules, while the correlation coefficient and DM give the value of 1 for that case.
Ultrafast Shape Descriptor
A purely atom-based approach, ultrafast shape recognition (USR) uses the statistical moments (mean, standard deviation and skewness) generated from the interatomic distance distributions. The moments attempt to define shape in terms of the size, compactness and the asymmetry associated with the structure. The all-atom distance distributions are calculated with respect to four reference points: the centroid (ctd), the atom closest to the centroid (cst), the atom furthest to the centroid (fct) and the farthest atom to fct (ftf). In the original implementation, Ballester and Richards  had used the first three moments yielding a 12-valued descriptor.
In the original implementation of the USR, similarities between the structures being compared is then calculated based on DM with the score (range 0-1) given by
Here, M Q and M D are the one-dimensional vectors corresponding to the query and database molecules and N is the length of the vector, determined by the number of statistical moments considered i.e. N = 12 for the first three and N = 16 for the first four. In addition to DM, in this study we also employed the Euclidean distance (DE) and the correlation coefficient.
The other existing programs compared in this study
In addition, we compared with several existing programs: EVA , UNITY2D , SIMCOMP , Molprint-2D , and MACCS . Below we provide a brief description of the characteristics of the methods.
The EVA (Eigen VAlue) descriptors are derived from the vibrational frequency calculations (calculated normal modes) with each molecule represented as a vector of 761 numbers . In the UNITY2D  the molecule is encoded as a Boolean array of 922 bits (available in SYBYL 7.1) that encode the presence (1) or absence (0) of substructural features. The results of these two methods are taken from the paper by Takane & Mitchell (2004), who analyzed the odour dataset.
The program SIMCOMP , uses a graph matching approach to compare chemical compounds. Each molecule is represented as a two-dimensional graph  with atoms and bonds becoming the vertices and edges respectively. Each atom is then assigned a label based on its neighbourhood i.e. the atom-typing scheme takes into consideration the adjacent atoms, the type of bonds they are involved in, and whether they have an associated ring structure. In all 68 different atom types were defined that included 23 carbon atom types, 18 oxygen, 7 sulphur, 2 phosphorous, and the rest for halogens and others. The edges are labelled according to the type of bond (single, double, triple) they represent. With this representation in place, the problem of finding a match is reduced to that of identifying a maximal common subgraph (subgraph of one graph that is isomorphic to a subgraph of the other) between the two graphs being compared. For this purpose, an association graph (AG) is constructed that encodes the possible mappings between the nodes (similar atom environments) of the two graphs. Further, each vertex pair in AG is also assigned a weight, 0.5 for cases where partial matches for the same atom species with different environments were found. All the other pairs were weighted as 1 if they belonged to the same atom species and 0 otherwise. SIMCOMP adopts a clique (fully connected subgraph) detection approach (a modified version of the Bron-Kerbosch (BK) algorithm ), to identify common substructures among which subgraphs with the largest sum of weights are sought. As graph matching has a high time complexity, additional heuristics in the form of a minimum size of the cliques to be found and the number of recursions of the BK algorithm to be executed were introduced to speed up the matching. Based on the largest match found, a numerical measure of the similarity between the two structures S(G1, G2) was calculated as
Here, MCS is the maximal common subgraph found and |.| represents the number of vertices in the graphs. The score depends on the sizes of the graphs and ranges between 0 and 1.
The MACCS descriptors are a set of 166 predefined structural keys that encode patterns in the molecule (such as the presence of S-S bonds, rings of size 4, presence of halogen etc).
The Molprint-2D fingerprint  also encodes the molecule as a binary vector by taking into consideration the atom environment (only heavy atoms) i.e. the counts of the types of the atoms (SYBYL atom types are used) within two bond-lengths of a central atom. The bits thus encode the presence or absence of these atom environments.
In addition, we employ a method which simply considers the number of atoms in the molecule (the atom count method). The similarity of two molecules is defined as the difference of the number of atoms. This method serves as the reference to examine the effect of using shape information by the above methods.
The aforementioned approaches were tested on three datasets that are chosen to demonstrate the applicability of the methods for classification and rapid screening of large databases. The first dataset was taken from an earlier study by Takane and Mitchell , who attempted a clustering of 47 odour compounds using the EVA descriptors. The compounds are divided into seven categories: amber (9 compounds), bitter almond (9 compounds), camphor (5 compounds), musk (11 compounds), jasmine (2 compounds), rose (5 compounds) and muguet (6 compounds).
The second data set is Directory of Useful Decoys (DUD) dataset . It is derived from ZINC , a database of commercially available compounds for virtual screening. A subset of the DUD containing 13 targets that is more suitable for vertical screening has often been used in recent studies [46, 53–55]. This subset is the result of a filter approach applied to the original DUD dataset (40 targets with about 95172 compounds) that not only removes molecules with unsuitable physicochemical properties but also resolves the problem of bias of an artificial enrichment (due to the presence of close analogues of the actives). The first step of the filtering protocol involved the generation of the seed structures for the actives (obtained from http://dud.docking.org/clusters/) using CORINA3D and refinement using MACROMODEL at pH 7.0 . The decoys were also subjected to a similar refinement process. Subsequently, a filtering process is applied to retain lead like structures (Alog P filter of 5.5 for nuclear hormone receptors; 4.5 for others) . See Table 1 for the final composition. The prepared structures for the actives and decoys for the 13 DUD targets were downloaded http://dud.docking.org/jahn/ and have been analysed in this study. For the search query, crystal structure coordinates of the same complexed ligands were taken from the DUD http://dud.docking.org/r2/dud_target_ligands.tar.gz.
The third dataset from the National Cancer Institute (NCI) consists of 42,687 compounds derived from an assay measuring protection from HIV-1 infection of human CEM cells . The compounds were further categorized into 423 confirmed actives (100% protection), 1,081 moderately actives (> 50% protection) and 41,185 confirmed inactives (<50% protection) yielding a ratio of 97 decoys per active. More details of the dataset is available at the website at NCI http://dtp.nci.nih.gov/docs/aids/aids_data.html. The coordinates for these structures were downloaded from http://ligand.info in the SDF format. The dataset not only resembles a pharmaceutical database but also enables the extraction of actives in a form akin to that of a typical virtual screening experiment. For cases with missing coordinates, the structures were rebuilt using CORINA . Examination of the compounds in the dataset revealed that a small number of cases had disconnected components. As neither the 3DZD nor USR can currently handle such structures, we decided to choose the largest fragment to represent the compound. Following a previous work on the same dataset by von Grotthus et al. , the objective was to test the retrieval capability of the actives using the 1081 moderately actives as queries. The datasets used are available at our website, http://kiharalab.org/zernikeligand/.
In order to assess the performance of the different methods used for screening, the following metrics have been used while taking into consideration the size of the datasets:
Clustering and the Adjusted Rand Index- For the odour dataset, results were evaluated based on the quality of the clustering obtained. Ward's hierarchical clustering  was done using software downloaded from http://www.let.rug.nl/kleiweg/clustering/. Starting with an initial number of clusters (say N), Ward's method merges two clusters at a time while minimizing the sum of squared errors at each step. To compare, the quality of the clusters, the adjusted Rand index [43, 61] has been applied and is given by(8)
where G1 and G2 are the true and predicted partitions of the data set of size N, m ij is the number of samples in both class i of G1 and class j of G2, m i and m j are the number of samples in the ith and jth class of the partitions of G1 and G2, respectively. The index provides a numerical measure of the agreement between the original and predicted clusters and ranges between 0 for dissimilar groupings to 1 for similar ones.
Enrichment factor - This metric  describes the ratio of actives retrieved relative to the percentage of the database scanned. If T A be the total number of actives in a database of size T D and N a be the number of actives in the top x percent N x of the database, then the enrichment factor is given by(9)
BEDROC - Although frequently used, the enrichment factor has a major drawback in the form of the "early recognition problem". It does not distinguish between schemes that rank actives ranked at the top of the list from those that place them at the end. As actives ranked earlier in the list are desired, the Boltzmann enhanced discrimination of receiver operating characteristic or BEDROC  was proposed to evaluate the performance of ranking methods. The metric is given by(10)
where, n is the number of actives among N compounds, ri is the rank of the ith active and α is a parameter that assigns a weight towards compounds the top of the ranked list. The BEDROC metric ranges between 0 and 1 and in this study, has been calculated for α = 160.9 and α = 32.2 which corresponds to the top 1% and 5% of the relative rank accounting for 80% of the BEDROC score.
4) Area Under Curve for Receiver Operator Characteristic (ROCAUC) - The area under the curve metric represents the probability of a randomly chosen active being ranked higher than a randomly chosen decoy . If N be the number of compounds with N a actives, and N d decoys the area under the ROC curve is given by
where is the number of decoys ranked above the ithactive.
All calculations were performed on a 2.13 GHz Intel dual processor system running Linux with 8 GB RAM. Programs for the extraction and comparison of the moments based on the 3DZD and USR were written in C++. For the USR approach, the extraction of moments typically takes around 4 ms on an average. Times for the 3DZD on an average take about 1s including surface generation which is about 250 times that of the USR. However, this step needs to be performed only once and can be directly stored in a database.
Software for SIMCOMP  was downloaded from the KEGG website http://web.kuicr.kyoto-u.ac.jp/simcomp/. The software executable SIMCOMP which performs a pairwise comparison of two structures was used to calculate the similarities. Prior to the calculation, all structures were converted into the required KCF (KEGG Chemical Function) format using the SOAP/WSDL http://www.genome.jp/kegg/soap/ interface provided by the KEGG database.
Pairwise comparisons of the 47 odour compounds (Figure 1) were performed using all approaches, i.e. the 3DZD, USR and its kurtosis variant, SIMCOMP, EVA, UNITY2D, MACCS, and Molprint2D. The values of EVA and UNITY2D are taken from the previous study of EVA . In addition, we employed a simple atom counting method as a reference, which compares the number of atoms in compounds. Since the main objective was to see if the methods could provide correct groupings of the seven fragrances, the data set was clustered into as many groups, based on the similarity matrix obtained. In order to compare the quality of the clustering against the experimentally determined classification, the adjusted Rand index, BEDROC, and ROCAUC values are calculated (Table 2).
For the 3DZD, similarity matrices were built based on the three different measures of the correlation coefficient (r), DE, and DM. Orders of expansion ranging from 4 to 14 were tested, with significant gains observed in the value of the Rand index, as the order increases. The highest adjusted Rand index (0.487) is obtained for the 3DZD, where both correlation and DE metrics provide the same results for an order 12 expansion. The value becomes worse when the DM is used (0.393). Using a higher order (here we examined 14) resulted in smaller adjusted Rand index value which also suggests that expansion orders of 10-12 should be appropriate for comparison. The poorer performance at this level (>12) can be attributed to the noise resulting from far too detailed a description of the molecular shape. On the other hand, using a smaller order say 4-6, results in a much lower Rand index value (0.25-0.30), implying insufficient resolution for shape description.
The highest value achieved by the 3DZD (0.487) is higher than that of SIMCOMP (0.400), USR (0.342) and its variant USR-k (0.328), MACCS (0.364), and the atom count method (0.400). The 3DZD also marginally outperforms the EVA descriptor based classification which achieved a highest Rand index value of 0.480. However, Molprint-2D, which considers atom environments and atom types, shows the highest value among all (0.516).
The methods are also evaluated by two ranking-based scores, BEDROC and ROCAUC. The performance of the 3DZD becomes worse when evaluated by these two scores relative to the other methods. When evaluated by BEDROC (α = 160.9), Molprint-2D shows the highest value (0.848), and the rest of the methods are ranked in the following order: SIMCOMP, USR (Manhattan, 0.782), MACCS (0.778), and the 3DZD (0.739 with the correlation coefficient, order = 8). In terms of BEDROC (α = 32.2) and ROCAUC, SIMCOMP shows the highest value and Molprint-2D comes to the close second. With the AUCROC, the 3DZD (0.748 with Manhattan, order = 14) is ranked the third, and MACCS (0.742), USR (0.718 with Manhattan, order = 12) follow in this order. Although BEDROC and ROCAUC are frequently used ranking-based scores, it would be argued that such ranking-based scores are not very appropriate for a small dataset like the odour dataset . As mentioned in Method section, BEDROC α = 160.9 and α = 32.2 emphasize ranks in top 1% and 5%, respectively, which correspond only to the top rank and top two ranks for the odour dataset of 47 compounds. On the other hand, the AUCROC computes unintuitively high value for many search results since the number of hits in the dataset is relatively high (5 to 11 hits among 49 total, as described in the dataset section).
Dendrograms for the five methods (3DZD correlation, USR Manhattan, SIMCOMP, Molprint-2D, and MACCS) are shown in Figure 2. None of the methods provide perfect distinction between the compounds and they produce different groupings. For the 3DZD based clustering, all camphor compounds are located in a separate partition while the musk odours are placed in two neighbouring clusters. Although USR manages to separate the camphor structures from the rest, the bitter almond series are located in three different groups (a trait shared by the 3DZD as well). On the whole, however, a poor separation of the other odours is seen, resulting in a smaller adjusted Rand index (0.343). SIMCOMP places most of the amber compounds in the same cluster, but splits the camphor series into two groups. On the other hand, the EVA descriptor based clustering (see Table 3 in the reference ) was able to locate the amber compounds in a single cluster; the other fragrances were split across multiple partitions. Molprint-2D classifies muguet compounds in a single cluster and also well captures similarity of amber compounds. Finally, MACCS shows a slightly different clustering, capturing similarity of camphor compounds but considers jasmine similar to camphor.
While most of the methods clustered the camphor structures in the same group, the two jasmine fragrances are typically split. The 3DZD places one of the jasmines (Jasmine_40) by itself in a separate cluster. Inspection of the 3DZD for these compounds shows that their magnitudes follow very different trends. Graphs of the invariants are shown for camphor and jasmine (Figure 3), the former illustrating the similar trend in the values (3DZD for the camphor series) with all the compounds placed in a single group. In contrast, using Jasmine_39 as the query tends to pick up amber, muguet, and bitter compounds as the top 3 hits with corresponding correlation coefficients of 0.990, 0.988, and 0.987, respectively. The other jasmine compound (Jasmine_40) yields a much lower correlation value of 0.782 and is ranked poorly at 41.
Overall, the clustering results based on the 3DZD for the odour dataset have been encouraging. To further evaluate this approach, we have analysed the performances with respect to much larger datasets, results for which are presented in the next section.
In the previous section, we investigated how well the methods classify compounds using a small dataset of 47 odour compounds. Next, we use a larger dataset, namely, the DUD dataset (Table 1), to examine the performance of the methods in ranking and retrieving active compounds. The DUD dataset is appropriate for this task since it is developed for virtual screening benchmark and has been used in several previous studies. For the query, the active ligand molecule crystallized with the target protein is used to retrieve the other known active molecules among decoys. We compare 3DZD, USR, MACCS, and the atom count method. Three ranking-based evaluation metrics are used, i.e. BEDROC (a = 32.2), the enrichment factor (5%), and the AUCROC.
Figure 4 shows the performance of the four methods for each target. In addition, the summary (the average values) are provided in Table 3. On average the three methods, 3DZD, USR, and MACCS, all outperform the atom count method in all three evaluation metrics (Table 3), although there are some individual targets where the atom count method performs better than the others when evaluated by certain metrics (for example, results for vegfr2 and src evaluated by BEDROC and EF5%). The rankings of the three methods are not consistent across different evaluation metric used. When EF5% is used, MACCS shows the highest score and the 3DZD and USR follow in this order. Using BEDROC, MACSS stays at the best rank and the order of the 3DZD and USR switches. On the other hand, USR shows the highest score in terms of AUCROC and 3DZD comes to the second. These results illustrate difficulty of evaluating virtual screening methods and importance of evaluating methods by using several different metrics.
NCI anti-HIV dataset
The third dataset, the anti-HIV dataset from the National Cancer Institute, is employed to simulate a typical virtual screening experiment. With both the actives and inactives forming the database to be searched, each of the remaining 904 moderately active molecules was used as the query. USR typically takes about 0.74 (12 terms)-0.76 (16 terms) seconds per query. Timings for the 3DZD comparisons are about 3 times that of the USR with per query comparisons taking 2.62 seconds for order 8 expansions and up to 2.70 seconds for order 14. It must be noted that the 3DZD considers more terms ranging from 25 (order 8) to 64 terms (order 14). The timings are reported with respect for 38352 database molecules that were compared and include the calculation of all the three distance metrics (correlation, DE, and DM). In comparison, the graph based SIMCOMP is significantly slower with timings exceeding an hour (~4245 seconds/query) in most cases.
The relative performances of the different methods were assessed using the BEDROC scores, enrichment factors, and AUCROC, which are shown in Table 4. As with the odour dataset case, for the range of expansion orders of the 3DZD tested, there is a trend where higher order terms lead to increase the enrichment. This trend is clear in AUCROC.
The performance of the 3DZD for this dataset is however slightly poorer in comparison to the other methods tested. SIMCOMP achieves the highest value for the enrichment factor value at the 1% cutoff (2.735), BEDROC (α = 160.9) (0.0383), and for AUCROC (0.477). It also shows a better score than the 3DZD for BEDROC (α = 32.2). USR (Manhattan, with the order of 12) outperforms the 3DZD at all the metrics except for AUCROC. However, the relatively small value by all the methods, the 3DZD, USR, and SIMCOMP, suggests that all the methods compared here would not effective from an early recognition perspective for the anti-HIV dataset. It is particularly notable that the performance of the atom count method is close to the other methods on this dataset and even shows the highest value in the EF5% and BEDROC (α = 32.2). These results imply that molecular shape information is not effective as it is for the previous two datasets. We discuss the nature of this dataset in Discussion, which could be a reason of the results.
To understand, why the three methods (3DZD, USR, SIMCOMP) have relatively poorer results for this dataset we examined the structures retrieved for specific queries. Three different cases are considered:
A case where the 3DZD is able to retrieve more hits in the top 100 than SIMCOMP and USR,
A case where SIMCOMP outperforms USR and 3DZD, and
Where USR does better than SIMCOMP and 3DZD.
The first case is illustrated by the example in Figure 5 that lists five structures retrieved of which two are active. It can be seen that the 3DZD is able to retrieve actives and rank them much higher than USR and SIMCOMP. All three similarity metrics - the correlation coefficient, DE, and DM are able to discriminate almost equally well finding at least 2 actives in the top 10. Although difficult to generalize based on some of the cases considered, it would seem that the 3DZD, being an extension of spherical harmonics, is able to discriminate symmetrically shaped structures well and is therefore more effective in retrieving actives for molecules of this class.
Coming to the second case (shown in Figure 6), SIMCOMP achieves a much higher retrieval rate as compared to the other two methods. While on one hand these actives are retrieved very early on (4 out of top 5 are actives), the structures are quite diverse, which is a useful feature to have for scaffold hopping (structures with similar bioactivity but different chemotype).
In comparison, the other schemes find no actives in the top 100 structures with the exception of the 3DZD (DM) which finds a single active ranked 73. We therefore tried to analyze why the moment based methods behaved as they did. The graph shown in Figure 7 illustrates why this might be the case. It can be seen that the moments for the highest ranked actives (represented by the dashed lines) and inactives (represented by the bold lines) retrieved by the 3DZD for this case, follow the same trends as those of the query (shown in black). However, this trend is stronger for the inactives. Closer inspection of the correlation coefficients values (Table 4) of the 3DZD for the actives showed that their values were fairly close with difference of 0.0122 between the closest inactive retrieved (0.0995) and the first hit (active) ranked at 245 (0.9825). However, both DE and DM manifest this difference more clearly with values, thus the closest inactive is measured further closer relative to the first hit: the gap between the closest inactive and the first active is 0.7482 and 0.0575 for DE and DM, respectively. Thus, from a 3DZD point of view, these inactives are closer in the molecular shape to the query than the actives. SIMCOMP, on the other hand, using as it does the idea of atom environments, is able to capture similarities that the moment based methods missed (Figure 6).
The statistical moments produced by USR were also examined in this context. Here also a pattern similar to that observed for the 3DZD is seen (Figure 8). While the moments of both the actives (dashed lines) and inactives (solid lines) trace the same trends as those followed by the query, greater deviations occur amongst the actives. Similarity scores with respect to the query are just above 0.5 (the left most column in Table 5) while those for the top 3 inactives average 0.66 with a relatively large difference of about 0.15 as shown in Table 5. This may suggest that, for cases where moments, both geometrical and statistical, are not as discriminating, other considerations are mandated - such as the atom environments used by SIMCOMP. This also reiterates the fact that while geometrical shape is a useful property to characterize molecules; it sometimes is by itself insufficient to provide a clear discrimination.
Finally, we analyzed the cases where USR reported a better discrimination of actives and inactives as compared to SIMCOMP and the 3DZD (Figure 9). For this query, the 3DZD does retrieve two actives within the top 100 while SIMCOMP obtains none. In contrast both USR and its kurtosis variant (USR-K) retrieve very similar looking structures  within the top 10. Though difficult to generalize based on these examples alone (Figures 5, 6 and 9) it is a case in point that USR may work relatively well for asymmetric molecules, considering as it has the third statistical moment, skewness, which is a measure of the molecule's asymmetry. On the other hand, the 3DZD seem to work better for spherical structures (Figure 6) rather than the other molecules that have typically elongated shapes and are not even close to being spherically symmetric (Figure 9).
On the basis of these results observed on the three datasets, it would be of interest to combine other molecular surface properties such as electrostatics and hydrophobicity that are captured implicitly by the atom environments used in SIMCOMP and Molprint-2D. While this may be a limiting factor for USR, the 3DZD facilitates their incorporation in a more convenient way as shown in our previous paper .
In this article, we have presented the 3D Zernike descriptors for ligand similarity searching. The work was motivated by previous studies that showed that these rotation invariant descriptors outperformed several other shape and alignment based comparison techniques when applied to proteins . Compactness in terms of the number of coefficients generated in comparison to their spherical harmonics counterpart and easy extensibility to other molecular properties  were additional factors in favour of this descriptor. The compactness of the 3DZD enable fast comparison of compounds, which is a key property of virtual screening methods given the fast growing size of molecular databases.
The application to ligand similarity searching was exemplified using three datasets, each of which has a different purpose. The first odour dataset is suitable for examining ability to classify compounds into experimentally verified categories. The adjusted Rand index used as a measure of agreement with the known classification was found to be the second highest (0.487) for the 3DZD and it outperformed other shape based method (USR), a chemical graph matching scheme (SIMCOMP), a 2D finger print-based method (UNITY2D), the vibrational frequency based method (EVA), and MACCS. The second dataset, DUD, was chosen to investigate ability of ranking compounds, as the dataset has been used for the same purpose in several previous studies. All the methods compared consistently showed better performance than the simple atom count method, and the performance of the 3DZD was comparable among the methods.
For the third dataset, moderately active structures were used as queries to search an anti-HIV database of active and inactives. This dataset is intended to simulate actual application of virtual screening methods to a large pharmaceutical database. Rapid comparisons are facilitated by the floating point vector representation and both 3DZD and USR were found to be more than 60 orders faster than SIMCOMP. Although both USR and 3DZD describe shapes using moments, they exhibit a preference for specific shape types. Examination of a few cases suggests that the 3DZD may perform better for comparing molecules of more or less spherical shape while USR performs well for elongated asymmetric structures. However, a more detailed analysis of this would be required and is planned for the future. For this dataset, the 3DZD, USR, and SIMCOMP showed much lower values in terms of all the evaluation metrics as compared with the results for the previous two datasets and those typically seen in literature . Moreover, the atom count method which simply considers the size of molecules showed comparable results, indicating molecular shape information did not add effective information for retrieval. To understand these results, it should be noted that this dataset may contain active compounds for multiple different molecular targets and, moreover, the heterogeneity of actives may be further increased due to the fact that the activity of compounds is measured by assays in living cell systems, where metabolism and uptake become important factor. Hence, generally speaking, we must say that this dataset is not among the most appropriate data for rigorous benchmark of virtual screening methods. In this study, however, we tried this dataset to mimic actual situation of the virtual screening after testing the methods on two well curated datasets.
Compound similarity searching by the 3DZD is intrinsically sensitive to the shape of molecules. This characteristic of the 3DZD can work as an advantage or can also lead to poor performance. In Figure 5, we showed a case that the 3DZD were able to find an active compound which was failed by SIMCOMP and USR. On the other hand, the two jasmine compounds in Figure 3 are the case where the 3DZD failed but the method which considers atom environments (i.e. SIMCOMP, see Figure 2), can detect their similarity. Figures 6 and 9 also exemplify compounds that the 3DZD find similarity based on the shape, which are not desired.
In summary, the 3DZD provide compact representations of molecular shape and can be applied to rapid screens of large compound databases. In addition to shape, other molecular properties can also be incorporated, thus enabling uniform comparison of the structures. The fact that they are surface based has other advantages such as comparison of shapes of binding pockets and the ligands bound to them, which are currently being studied in our group.
Bender A, Glen RC: Molecular similarity: a key technique in molecular informatics. Org Biomol Chem. 2004, 2: 3204-3218. 10.1039/b409813g.
Johnson AM, Maggiora GM: Concepts and applications of molecular similarity. 1990, New York: Wiley
Martin YC, Kofron JL, Traphagen LM: Do structurally similar molecules have similar biological activity?. J Med Chem. 2002, 45: 4350-4358. 10.1021/jm020155c.
Kubinyi H: Similarity and dissimilarity: A medicinal chemist's view. Perspectives in Drug Discovery and Design. 1998, 9-11: 225-252. 10.1023/A:1027221424359.
Eckert H, Bajorath J: Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov Today. 2007, 12: 225-233. 10.1016/j.drudis.2007.01.011.
MACCS Keys. 2009, MDL Information Systems, Inc., San Leandro, CA, [http://www.mdl.com]
Bender A, Mussa HY, Glen RC, Reiling S: Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J Chem Inf Comput Sci. 2004, 44: 1708-1718.
Edgar SJ, Holliday JD, Willett P: Effectiveness of retrieval in similarity searches of chemical databases: a review of performance measures. J Mol Graph Model. 2000, 18: 343-357. 10.1016/S1093-3263(00)00061-9.
Schneider G, Schneider P, Renner S: Scaffold-hopping: How far can you jump?. Qsar & Combinatorial Science. 2009, 25: 1162-1171. 10.1002/qsar.200610091.
Nisius B, Goller AH: Similarity-based classifier using topomers to provide a knowledge base for hERG channel inhibition. J Chem Inf Model. 2009, 49: 247-256. 10.1021/ci800304t.
Bender A, Mussa HY, Gill GS, Glen RC: Molecular surface point environments for virtual screening and the elucidation of binding patterns (MOLPRINT 3D). J Med Chem. 2004, 47: 6569-6583. 10.1021/jm049611i.
Markt P, Petersen RK, Flindt EN, Kristiansen K, Kirchmair J, Spitzer G, Distinto S, Schuster D, Wolber G, Laggner C, Langer T: Discovery of novel PPAR ligands by a virtual screening approach based on pharmacophore modeling, 3D shape, and electrostatic similarity screening. J Med Chem. 2008, 51: 6303-6317. 10.1021/jm800128k.
Chen Q, Higgs RE, Vieth M: Geometric accuracy of three-dimensional molecular overlays. J Chem Inf Model. 2006, 46: 1996-2002. 10.1021/ci060134h.
Artymiuk PJ, Bath PA, Grindley HM, Pepperrell CA, Poirrette AR, Rice DW, Thorner DA, Wild DJ, Willett P, Allen FH: Similarity searching in databases of three-dimensional molecules and macromolecules. J Chem Inf Comput Sci. 1992, 32: 617-630.
Putta S, Beroza P: Shapes of things: computer modeling of molecular shape in drug discovery. Curr Top Med Chem. 2007, 7: 1514-1524. 10.2174/156802607782194770.
Rush TS, Grant JA, Mosyak L, Nicholls A: A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J Med Chem. 2005, 48: 1489-1495. 10.1021/jm040163o.
Kirchmair J, Ristic S, Eder K, Markt P, Wolber G, Laggner C, Langer T: Fast and efficient in silico 3D screening: toward maximum computational efficiency of pharmacophore-based and shape-based approaches. J Chem Inf Model. 2007, 47: 2182-2196. 10.1021/ci700024q.
Chekmarev DS, Kholodovych V, Balakin KV, Ivanenkov Y, Ekins S, Welsh WJ: Shape signatures: new descriptors for predicting cardiotoxicity in silico. Chem Res Toxicol. 2008, 21: 1304-1314. 10.1021/tx800063r.
Willett P: Searching techniques for databases of two- and three-dimensional chemical structures. J Med Chem. 2005, 48: 4183-4199. 10.1021/jm0582165.
Labute P, Williams C, Feher M, Sourial E, Schmidt JM: Flexible alignment of small molecules. J Med Chem. 2001, 44: 1483-1490. 10.1021/jm0002634.
Hofbauer C, Lohninger H, Aszodi A: SURFCOMP: a novel graph-based approach to molecular surface comparison. J Chem Inf Comput Sci. 2004, 44: 837-847.
Ballester PJ, Richards WG: Ultrafast shape recognition to search compound databases for similar molecular shapes. J Comput Chem. 2007, 28: 1711-1723. 10.1002/jcc.20681.
Hudson BD, Whitley DC, Ford MG, Swain M, Essex JW: Pattern recognition based on color-coded quantum mechanical surfaces for molecular alignment. J Mol Model. 2008, 14: 49-57. 10.1007/s00894-007-0251-2.
Tervo AJ, Ronkko T, Nyronen TH, Poso A: BRUTUS: optimization of a grid-based similarity function for rigid-body molecular superposition. 1. Alignment and virtual screening applications. J Med Chem. 2005, 48: 4076-4086. 10.1021/jm049123a.
Grant JA, Pickup BT: A Gaussian description of molecular shape. J Phys Chem. 1995, 99: 3503-3510. 10.1021/j100011a016.
Goldman BB, Wipke WT: Quadratic shape descriptors. 1. Rapid superposition of dissimilar molecules using geometrically invariant surface descriptors. J Chem Inf Comput Sci. 2000, 40: 644-658.
Connolly ML: Analytical molecular surface calculation. J Appl Cryst. 1983, 16: 548-558. 10.1107/S0021889883010985.
Lin SL, Nussinov R, Fischer D, Wolfson HJ: Molecular surface representations by sparse critical points. Proteins. 1994, 18: 94-101. 10.1002/prot.340180111.
Max NL, Getzoff ED: Spherical harmonic molecular surfaces. IEEE Comput Graph Appl. 1988, 8: 42-50. 10.1109/38.7748.
Ritchie DW, Graham J: Fast computation, rotation, and comparison of low resolution spherical harmonic molecular surfaces. J Comp Chem. 1999, 20: 383-395. 10.1002/(SICI)1096-987X(199903)20:4<383::AID-JCC1>3.0.CO;2-M.
Morris RJ, Najmanovich RJ, Kahraman A, Thornton JM: Real spherical harmonic expansion coefficients as 3D shape descriptors for protein binding pocket and ligand comparisons. Bioinformatics. 2005, 21: 2347-2355. 10.1093/bioinformatics/bti337.
Cai W, Xu J, Shao X, Leroux V, Beautrait A, Maigret B: SHEF: a vHTS geometrical filter using coefficients of spherical harmonic molecular surfaces. J Mol Model. 2008, 14: 393-401. 10.1007/s00894-008-0286-z.
Funkhouser T, Min P, Kazhdan M, Chen J, Halderman A, Dobkin D, Jacobs D: A search engine for 3D models. Acm Transactions on Graphics. 2003, 22: 83-105. 10.1145/588272.588279.
Venkatraman V, Sael L, Kihara D: Potential for protein surface shape analysis using spherical harmonics and 3D Zernike descriptors. Cell Biochem Biophys. 2009, 54: 23-32. 10.1007/s12013-009-9051-x.
Zauhar RJ, Moyna G, Tian L, Li Z, Welsh WJ: Shape signatures: a new approach to computer-aided ligand- and receptor-based drug design. J Med Chem. 2003, 46: 5674-5690. 10.1021/jm030242k.
Novotni M, Klein R: 3D Zernike descriptors for content based shape retrieval. ACM Symposium on Solid and Physical Modeling, Proceedings of the eighth ACM symposium on Solid modeling and applications. 2003, 216-225. full_text.
Kihara D, Sael L, Chikhi R: Local surface shape-based protein function prediction using Zernike descriptors. Biophys J. 2009, 96: 650a-10.1016/j.bpj.2008.12.3435.
Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D: Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins. 2008, 72: 1259-1273. 10.1002/prot.22030.
Sael L, Kihara D: Protein surface representation and comparison: New approaches in structural proteomics. Biological Data Mining. Edited by: Chen J, Lonardi S. 2009, Boca Raton, Florida, USA: Chapman & Hall/CRC Press, 89-109.
Sael L, La D, Li B, Rustamov R, Kihara D: Rapid comparison of properties on protein surface. Proteins. 2008, 73: 1-10. 10.1002/prot.22141.
Mak L, Grandison S, Morris RJ: An extension of spherical harmonics to region-based rotationally invariant descriptors for molecular shape description and comparison. J Mol Graph Model. 2007, 26: 1035-1045. 10.1016/j.jmgm.2007.08.009.
Hattori M, Okuno Y, Goto S, Kanehisa M: Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J Am Chem Soc. 2003, 125: 11853-11865. 10.1021/ja036030u.
Takane SY, Mitchell JB: A structure-odour relationship study using EVA descriptors and hierarchical clustering. Org Biomol Chem. 2004, 2: 3250-3255. 10.1039/b409802a.
Tripos Inc: SYBYL 6.8. South Hanley Road, St. Louis, Missouri, 63114, USA. 2009, [http://tripos.com/]
Huang N, Shoichet BK, Irwin JJ: Benchmarking sets for molecular docking. J Med Chem. 2006, 49: 6789-6801. 10.1021/jm0608356.
Jahn A, Hinselmann G, Fechner N, Zell A: Optimal assignment methods for ligand-based virtual screening. J Cheminformatics. 2009, 1: 14-10.1186/1758-2946-1-14.
von GM, Koczyk G, Pas J, Wyrwicz LS, Rychlewski L: Ligand.Info small-molecule Meta-Database. Comb Chem High Throughput Screen. 2004, 7: 757-761. 10.2174/1386207043328265.
Canterakis N: 3D Zernike moments and Zernike affine invariants for 3D image analysis and recognition. Proc 11th Scandinavian Conference on Image Analysis. 1999, 85-93.
Joanes DN, Gill CA: Comparing measures of sample skewness and kurtosis. The Statistician. 1998, 47: 183-189.
Raymond JW, Willett P: Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J Comput Aided Mol Des. 2002, 16: 521-533. 10.1023/A:1021271615909.
Bron C, Kerbosch J: Algorithm 457: finding all cliques of an undirected graph. Commun ACM. 1973, 16: 575-577. 10.1145/362342.362367.
Irwin JJ, Shoichet BK: ZINC--a free database of commercially available compounds for virtual screening. J Chem Inf Model. 2005, 45: 177-182. 10.1021/ci049714+.
Kinnings SL, Jackson RM: LigMatch: a multiple structure-based ligand matching method for 3D virtual screening. J Chem Inf Model. 2009, 49: 2056-2066. 10.1021/ci900204y.
Cheeseright TJ, Mackey MD, Melville JL, Vinter JG: FieldScreen: virtual screening using molecular fields. Application to the DUD data set. J Chem Inf Model. 2008, 48: 2108-2117. 10.1021/ci800110p.
Good AC, Oprea TI: Optimization of CAMD techniques 3. Virtual screening enrichment studies: a help or hindrance in tool selection?. J Comput Aided Mol Des. 2008, 22: 169-178. 10.1007/s10822-007-9167-2.
Schrodinger LLC: MacroModel (ver. 9.6). New York, NY. 2008, [http://www.schrodinger.com/ProductDescription.php?mID=6&sID=8&cID=0]
Weislow OS, Kiser R, Fine DL, Bader J, Shoemaker RH, Boyd MR: New soluble-formazan assay for HIV-1 cytopathic effects: application to high-flux screening of synthetic and natural products for AIDS-antiviral activity. J Natl Cancer Inst. 1989, 81: 577-586. 10.1093/jnci/81.8.577.
Molecular Networks: CORINA. Erlangen, Germany. 2009, [http://www.molecular-networks.com/products/corina]
von Grotthuss M, Pas J, Rychlewski L: Ligand-Info, searching for similar small compounds using index profiles. Bioinformatics. 2003, 19: 1041-1042. 10.1093/bioinformatics/btg117.
Ward J: Hierarchical grouping to optimize an objective function. J Am Stat Ass. 1963, 58: 236-244. 10.2307/2282967.
Hubert L, Arabie P: Comparing Partitions. Journal of Classification. 1985, 2: 193-218. 10.1007/BF01908075.
Bender A, Glen RC: A discussion of measures of enrichment in virtual screening: comparing the information content of descriptors with increasing levels of sophistication. J Chem Inf Model. 2005, 45: 1369-1375. 10.1021/ci0500177.
Truchon JF, Bayly CI: Evaluating virtual screening methods: good and bad metrics for the "early recognition" problem. J Chem Inf Model. 2007, 47: 488-508. 10.1021/ci600426e.
Fawcett T: An introduction to ROC analysis. Pattern Recognition Letters. 2006, 27: 861-874. 10.1016/j.patrec.2005.10.010.
Mackey MD, Melville JL: Better than random? The chemotype enrichment problem. J Chem Inf Model. 2009, 49: 1154-1162. 10.1021/ci8003978.
The authors acknowledge Lee Sael for developing programs for the 3DZD method. This work is supported by grants from the National Institutes of Health (R01GM075004) and in part by National Science Foundation (DMS800568, EF0850009, IIS0915801).
The authors declare that they have no competing interests.
VV implemented the USR method and scripts to perform database screening. VV and PR tested the presented methods and prepared the manuscript for this publication. DK conceived the study, supervised and coordinated the project, and helped writing the manuscript. All authors have read and approved of the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.