Research article  Open  Published:
4D Flexible AtomPairs: An efficient probabilistic conformational space comparison for ligandbased virtual screening
Journal of Cheminformaticsvolume 3, Article number: 23 (2011)
Abstract
Background
The performance of 3Dbased virtual screening similarity functions is affected by the applied conformations of compounds. Therefore, the results of 3D approaches are often less robust than 2D approaches. The application of 3D methods on multiple conformer data sets normally reduces this weakness, but entails a significant computational overhead. Therefore, we developed a special conformational space encoding by means of Gaussian mixture models and a similarity function that operates on these models. The application of a modelbased encoding allows an efficient comparison of the conformational space of compounds.
Results
Comparisons of our 4D flexible atompair approach with over 15 stateoftheart 2D and 3Dbased virtual screening similarity functions on the 40 data sets of the Directory of Useful Decoys show a robust performance of our approach. Even 3Dbased approaches that operate on multiple conformers yield inferior results. The 4D flexible atompair method achieves an averaged AUC value of 0.78 on the filtered Directory of Useful Decoys data sets. The best 2D and 3Dbased approaches of this study yield an AUC value of 0.74 and 0.72, respectively. As a result, the 4D flexible atompair approach achieves an average rank of 1.25 with respect to 15 other stateoftheart similarity functions and four different evaluation metrics.
Conclusions
Our 4D method yields a robust performance on 40 pharmaceutically relevant targets. The conformational space encoding enables an efficient comparison of the conformational space. Therefore, the weakness of the 3Dbased approaches on single conformations is circumvented. With over 100,000 similarity calculations on a single desktop CPU, the utilization of the 4D flexible atompair in realworld applications is feasible.
Background
Sorting and comparing molecules from chemical databases represent two of the key tasks in cheminformatics [1]. The sorting of such databases, with respect to a given set of queries (molecules) and similarity functions, is known as virtual screening (VS). The goal of VS is to enrich molecules with similar properties (e.g., biological activity) to the query molecules and to discover new chemical entities in a small fraction of the database. To ensure the desired properties (e.g., biological activity) and to evaluate the success of the VS run, it is necessary to further analyze the enriched molecules by means of biological assays. The success of a VS run consists of two different aspects. First, the enriched molecules should have similar properties as the query molecules. Second, the discovery of new chemical entities that consist of different scaffolds in comparison with the query molecules, and, therefore represent an information gain. Based on the focus on a relevant subset of the database and the possible structural information gain, VS experiments represent a fundamental approach in the drug discovery pipeline [2, 3].
In the last two decades a plethora of different similarity functions were proposed [4, 5], and the development of new functions is still an open field of research. All similarity functions can be categorized by the dimension of the applied representation of molecules. 1D similarity functions are based on molecular property counts such as molecular weight or number of hydrogen bond acceptors. 2D approaches make use of the adjacency matrix of the molecular graph, and, therefore they are also called topologicalbased approaches. MOLPRINT2D [6], substructurebased fingerprints like BCI [7] and DAYLIGHT [8] as well as the MACCS [9] keys are well known 2D similarity methods. Those topological or structural fingerprints yield promising results with respect to the enrichment of active molecules, but often lack the ability to discover new chemical entities [10]. 3D similarity functions are based on the shape [11–14] or geometrical distance information [15–17] of molecules. Information of the conformational ensembles of molecules extends the 3Dbased methods and can be seen as 4D approaches [18, 19].
Based on the keylock principle of Hermann Emil Fischer, it could be expected that the shape of molecules plays an important role for the biological activity. However, the shape of a molecule is not unique, but rather a function of internal parameters like the torsion angles. Hence, each rotatable bond represents a degree of freedom and increases the number of possible shapes (conformations) of the molecule. The resulting space, which contains all possible conformations, represents the conformational space of the molecule. Based on this increased complexity, it is not surprising that several literature studies reported a more robust VS performance of 2D methods in comparison to 3D approaches [20, 21]. Further arguments for 2D methods are their simplicity and speed [22].
In a comprehensive study, Venkatraman et al. [21] investigated the performance of different 2D and 3D methods on a wide range of pharmaceutically relevant targets. The results of the study underpin the predominant opinion that 2Dbased approaches are superior to 3D approaches with respect to the enrichment of active molecules. The performance of the 2D and 3D approaches with respect to the knowledge gain by means of the discovery of new chemical entities was not evaluated by the study. A possible reason for the inferior performance of 3D methods is the geometric information that is based on one conformation of the molecule [21]. One opportunity to improve the performance of 3D methods is to apply the 3D methods on different conformations of the molecules and use the mean or maximum similarity value. The drawback of this workaround is the quadratic increase in computation time, which scales with the number of conformations. To address this runtime issue, it is necessary to perform the similarity calculation on the complete conformational ensemble in one step in a feasible manner. These limitations of 3D approaches also affect the performance of instancebased machine learning QSAR/QSPR models. To improve the robustness of those QSAR/QSPR models, we developed a 4Dbased approach that is able to compare the conformational space of molecules within one step in feasible time [23]. The results showed that our approach produces robust models that are superior to similar 3D and 2D approaches. Given the fact that the reasons for the inferior performance of 3Dbased methods seem to be similar in both applications (VS and QSAR/QSPR), it is possible that our 4Dbased approach is also able to increase the VS performance in comparison to 2D and 3D methods.
The aim of this study is to evaluate our 4D approach as VS similarity function on a variety of literature VS benchmark data sets. Additionally, we compare the results to stateoftheart 2D and 3D approaches to assess the performance of our method. We employed VS performance metrics that measure the chemotype enrichment performance to reduce the influence of artificial enrichment. The results show a robust performance of our approach in comparison to stateoftheart 2D and 3D approaches. Therefore, our conformational space comparison is able to reduce the weakness of 3Dbased methods without the timedemanding pairwise comparison of individual conformations.
Methods
This section describes our 4D flexible atompair (4D FAP) similarity measure on the conformational space of molecules. To allow an efficient comparison of the conformational space of molecules, our approach needs a special encoding of the conformational ensembles, which can be seen as a preprocessing step. First, We describe our conformational space encoding. Afterwards, a modified Expectation Maximization (EM) algorithm will be presented that computes generative models, which represent the behavior of the molecules in their conformational space. Finally, the actual similarity calculation, which operates on the preprocessed molecules, will be explained.
Conformational Space Encoding
To ensure a fast comparison of the conformational space of molecules, it is necessary to transform the complex information of the conformational space of molecules into a representation that is suitable for the integration into fast similarity functions. Therefore, we decompose the information of the conformational space into small portions. Given a conformational sampling C_{ M } of molecule M with M heavy atoms, the encoding is based on the distance behavior of atompairs in the conformational space. Hence, the conformational space C_{ M } of molecule M is segmented into the distance behavior of atompairs.
Figure 1 represents exemplarily an atompair and the corresponding geometric distance. Not all of the atompairs of a molecule M have a flexible distance behavior in the conformational space. The distance relation of neighboring atoms or atoms of a ring system only shows a small variability of the distance. Therefore, our encoding separates the atompairs into two disjoint classes: The flexible and the rigid atompair class. The separation is realized by a heuristic that employs the number of rotatable bonds in the shortest path of the corresponding molecular graph. Figure 1 visualizes the shortest path of the marked atompair. A bond is supposed to be rotatable if it is a single bond and not a bond of a ring system. If the number of rotatable bonds in the shortest path is ≥ 1, the atompair represents a flexible, otherwise a rigid atompair. Terminal rotatable bonds (rotatable bonds that are adjacent to one of the atoms that form the atompair) are not counted in the heuristic because a rotation of such a bond has no influence on the distance relation of the atoms (Figure 1).
Given the class of flexible atompairs from the heuristic above, our encoding measures the distance of each atompair and conformation of the given conformational sampling C_{ M } . This results into atompairs that have C_{ M }  distance values, where C_{ M }  represents the number of sampled conformers of molecule M. We refer to the atompairs containing the distance information in the conformational space as distance profiles.
These distance profiles can be visualized by means of normalized histograms, which represent the relative frequency of observing the corresponding atompair distance within a binned distance range. A histogrambased visualization of the distance profile from the atompair of Figure 1 can be seen in Figure 2. The application of histograms in a similarity function entails two major drawbacks. First, the binning size represents a parameter and has substantial impact on the resulting similarity value. Second, the storage of the information needs more space than a modelbased encoding. Therefore, we decided to describe the distance behavior in the conformational space by means of Gaussian Mixture Models (GMMs). After the encoding of the distance profiles as GMMs the preprocessing of the molecules is finished.
Gaussian Mixture Models and Parameter Estimation on Distance Profiles
Mixture models are probabilistic models that represent a complex distribution based on a linear combination of individual subdistributions. Applying Gaussian distributions as subdistributions in a mixture model yields a GMM as given in Equation 1, where p(x) represents the probability density at the point x, π_{ c } determines the weight of the cth Gaussian distribution, and depicts the cth Gaussian distribution with mean μ_{ c } and covariance matrix Σ _{ c }.
GMMs are generative models for realworld data and involve two advantages in our application. First, a conformational ensemble of a molecule represents only a discrete sampled approximation of the complete conformational space. Therefore, the flexible atompairs contain a series of sampled distance values. A generative model, fitted to the distance values, represents a continuous function, and, therefore describes a more realistic model in comparison to discrete or binned values. Second, the models can be efficiently stored because only the model parameters are necessary for a similarity calculation between such models. A drawback of the GMMs is the parameter estimation for a given data set. Given the distance values of a flexible atompair, it is necessary to fit the parameters π_{ c } , μ _{ c }, Σ _{ c }, and C (number of Gaussian components) of Equation 1 to the distance values. A popular approach to determine the parameters of a mixture model is the Expectation Maximization (EM) algorithm [24]. This algorithm is based on the maximum likelihood framework and optimizes the objective function given in Equation 2.
The EM algorithm represents an iterative process that consists of two steps. The first step (Estep) evaluates the responsibilities that the kth component of the GMM was responsible for generating the nth (x _{ n }) data point of the given data set X (Equation 3).
The second step (Mstep) updates the parameters of the GMM on the basis of the responsibilities of the previous Estep (Equation 47).
These two steps are repeated until a predefined convergence criterion is reached. The EM algorithm optimizes the parameters of the GMM and guarantees a local optimum solution. Therefore, it is necessary to execute the EM algorithm with different initial parameters to avoid a model from a local optimum with an inferior likelihood value.
The EM algorithm estimates the parameters of a GMM based on a predefined number of Gaussian components. A suitable number of components is crucial for a useful model. Therefore, a model selection step that determines an optimal number of components is necessary. To reduce the risk of overfitting (high number of Gaussian components), several model selection criterions, such as the Bayesian information criterion [25] or the Akaike information criterion [26], were proposed that penalize an increased number of components. This model selection step involves a significant runtime overhead and can be avoided if the number of subdistributions can be estimated. In our application, a GMM has to model the distance behavior of the corresponding atompair in the conformational space. The distance behavior of an atompair can be seen as a function of the flexibility of the shortest path in the molecular graph.
Therefore, the number of flexible bonds in the shortest path (as applied to classify the atompairs) can also be applied as a heuristic to determine the number of Gaussian components for the GMM. In an earlier study we already presented the comparable performance of the heuristic in comparison to model selection criterions [23]. This heuristic avoids the model selection step and reduces the runtime of the preprocessing step. Figure 2 presents the GMM that models the distance behavior of the atompair in Figure 1. The presented EM algorithm assumes that all samples of the data set are equally important for the final model. Transferred to our application this means that each conformation has the same influence on the final model. Based on a thermodynamic point of view, this assumption of equal influence holds if all conformations of the ensemble have the same energy. To emphasize the influence of low energy conformations on the final model, we developed an extension of the EM algorithm that integrates the importance of each sample into the optimization process. In an earlier study, our modified EM algorithm generated improved QSAR/QSPR models in comparison to models based on equally weighted GMMs [27].
Boltzmann Weighted Expectation Maximization Algorithm
To increase the importance of low energy conformations on the final GMMs of a molecule, we apply the normalized Boltzmann distribution as given in Equation 8 to determine a probability value for a given conformer. ΔE_{ n } represents the energy offset of the nth conformer to the global optimum of the conformational ensemble, R presents the gas constant, and T the temperature of the canonical ensemble.
These probability values have to be integrated into the EM algorithm and modify the objective function as outlined in Equation 9, where E symbolizes a vector containing the energy values of the conformers and p(E _{ n }) depicts the probability of the nth conformation.
The Estep (computation of the responsibilities) remains unchanged, and, therefore the responsibilities are calculated as stated in the Equation 3. However, the equations of the Mstep (update of the parameters) need the integration of the probability values as listed in the Equations 1013.
Based on the described modifications, the EM algorithm computes GMMs that represent the distance behavior of atompairs as a function of the frequency of observing an atompair at a certain distance as well as the probability that the canonical ensemble will occupy these states (conformations). Figure 3 visualizes a weighted (probabilities of the conformations) histogrambased representation of the same atompair visualized in Figure 2. The Boltzmann weighted model of Figure 3 shows that the distances at 9.25 Å and 11 Å are energetically favorable, and, therefore the probability density is increased in comparison to the unweighted model of Figure 2. In contrast, the conformations with low range distances have higher energy values and, as a consequence, the corresponding probability densities are reduced.
4D Flexible AtomPair Similarity Function
After the preprocessing of the molecules (encoding the distance distributions by means of GMMs) the actual 4D similarity calculation can be conducted. The similarity function operates on the molecular graph (adjacency matrix) and the GMMs of the flexible atompairs. Therefore, the conformational ensemble of the molecules is not further needed.
In a first step, the 4D FAP creates for each heavy atom of the molecule an atompair prefix tree. This data structure represents an efficient approach for search and comparison operations and was already applied as a data structure for an atompairbased similarity measure [28, 29]. Each prefix tree has one atom as root node and contains all atompair information to the remaining M  1 heavy atoms of the molecule (leaves of the tree). The preprocessing step divides the atompairs into two disjoint classes. Therefore, an atompair tree T contains the two different subtrees R and F for the rigid and flexible atompair class, respectively. The rigid subtree R contains the information of the rigid atompairs that were not modeled by GMMs. To increase the information content of the rigid atompairs, the subtree additionally contains the topological distance information of each atompair. An example of such an atompair tree can be seen in Figure 4.
The nodes in these prefix trees can be labeled by any arbitrary atom typing scheme. We applied a labeling function that consists of three different elements. The first element is the element symbol of the atom. A ring flag indicates the membership to a ring system. The final value is the result of the number of neighboring heavy atoms minus the number of neighboring hydrogen atoms.
Given the prefix trees of two molecules A and B, the 4D FAP computes a similarity matrix S, where each entry represents the similarity value between two atompair trees. Based on the two different subtrees, an entry S^{ij} in the similarity matrix is the sum of two distinct similarity calculations on the subtrees. Hence, the 4D FAP utilizes different similarity functions for the subtrees.
The rigid subtree contains the labels of the atoms and a topological distance value. This type of information represents nominal features and enables the use of simple similarity functions. We applied the Tanimoto similarity function as stated in Equation 14. R_{ i } and R_{ j } represent the rigid atompair subtrees of the ith and jth atom, respectively.
The comparison of the flexible atompair subtree consists of two different similarity functions. Given the flexible atompair subtrees F_{ i } and F_{ j } containing F_{ i }  and F_{ j }  flexible atompairs (number of leaves in the subtree). The first function compares the atom labels of two given flexible atompairs AP _{ n } ∈ F_{ i } and AP _{ m } ∈ F_{ j } as presented in Equation 15. l_{ m } and l_{ n } represent the atom labels of the atompairs. The function is a Dirac function on the atom labels and returns a value of 1.0 if the labels are equal and 0 otherwise.
The second similarity function compares the behavior of the atompairs in the conformational space. For this purpose, a correlation measure on GMMs is applied as denoted in Equation 16, where g_{ m } and g_{ n } symbolize the GMMs of the flexible atompair AP _{ m } and AP_{ n }, respectively.
The assembly of both similarity functions for flexible atompairs results in Equation 17 and represents the similarity function for the flexible atompair subtree.
Unlike the similarity function for the rigid atompair subtree, the similarity function for the flexible atompair subtree is not based on nominal features. Therefore, Equation 17 performs a pairwise comparison of all atompairs and sums up the individual similarity scores. To avoid overestimated similarity values of subtrees with an increased number of flexible atompairs, the similarity value is normalized by Equation 18
After computation of the similarity matrix S, which contains all pairwise similarity values of the atompair trees, the 4D FAP computes a final similarity value based on the matrix S. The original 4D FAP, as applied in QSAR/QSPR studies [23, 27], sums up the entries of the S matrix and normalizes the sum to obtain a value in the range [0.0, 1.0]. Another possibility to compute a final similarity value represents the optimal assignment. This approach was introduced into the field of cheminformatics by Fröhlich et al. [30, 31] and applied as a VS similarity function in a previous study [28].
Preliminary experiments (not published) showed that an optimal assignment on the matrix S improves the VS performance in comparison to the normalized summation of the matrix elements. Therefore, we changed the final computation step of the 4D FAP to perform an optimal assignment on the matrix S as stated in Equation 19. Given the molecules A and B (with A < B w.l.o.g.), π represents a function that maps each value of i ∈ [1, ..., A] on a value in the range [1, ..., B] in such a way that the sum of the similarity entries is maximized. The final sum of the optimal assigned similarity values of the atompair trees is also normalized by Equation 18. We refer to the optimal assignmentbased variant of the 4D FAP as 4D FAP_{OA}.
Experimental
In this section we initially characterize the applied VS benchmark data sets as well as their preparation step. Afterwards, the protocol for the conformational sampling of the molecules as well as a short description of the VS evaluation metrics follow. Finally, we present a brief overview of literature VS methods that were applied to classify the results of the 4D FAP_{OA}.
Data sets
To evaluate the 4D FAP_{OA} on a wide range of pharmaceutically relevant targets, we employed the directory of useful decoys (DUDs) release 2 [32]. These data sets were introduced as a benchmark data set compilation for the evaluation of docking algorithms [33]. Ligandbased VS, based on similarity values to a query structure, can be afflicted with an analogue enrichment bias. This bias results from the enrichment of structurally similar molecules with respect to the query structure. These similar structures represent only a limited information gain, and, therefore the results of the experiment will have an analogue enrichment bias.
To reduce this bias and to enable a fair comparison between similaritybased and dockingbased algorithms, Good and Oprea applied a leadlike filter [34] on the data sets and clustered the actives [35]. These filtered and clustered data sets were already applied in a ligandbased VS study [28] and are publicly available [36]. Table 1 shows a complete overview of the 40 targets and the number of actives and decoys for the DUD release 2 and the filtered variant. For the VS experiments we applied the target ligands as query structure for the respective active and decoy data set. The data sets were not further modified to allow a fair comparison of the results.
The evaluation of similarity functions by means of the DUD data sets represents a retrospective evaluation. Analogous to the "Kubinyi paradox" [37] of QSAR models, the solely retrospective evaluation possibly implies the risk that the development of new methods or the improvement of existing approaches will increase their retrospective performance at the expense of the prospective performance. However, the DUD data sets contain over 100,000 molecules for 40 different targets. Consequently, the evaluation on all 40 data sets is based on an increased molecular diversity in comparison with the usually smaller and less diverse benchmark data sets of QSAR experiments. Therefore, the risk of an inferior prospective performance of VS similarity functions as a result of their optimization for the retrospective performance is reduced but still present.
Conformational sampling
To create the conformational ensembles of the molecules, we applied the ConfGen tool of Schrödinger [38]. Recent studies showed the ability of ConfGen to compute reasonable conformers of molecules [39, 40]. The tool provides four standard parameter schemes that sample the conformational space at different resolutions. To compute useful GMMs in the preprocessing step, it is necessary to sample the conformational space at a high resolution. Therefore, we modified the 'comprehensive' parameter scheme of ConfGen to further increase the resolution. We reduced the heavy atom rmsd for distinct conformers from 0.5 Å to 0.1 Å. This modification results in more conformers but does not increase the runtime of the conformational sampling. The energy values, which are necessary for the Boltzmann weighted GMMs, were computed by the OPLS 2005 forcefield with standard parameters.
The applied conformational sampling algorithm as well as the forcefield model have a major impact on the final results of the 4D FAP_{OA}. Different conformational sampling algorithms compute different sets of conformers, which in turn yield different atompair distance profiles. The forcefield computes an energy value for each conformer and determines the weight of each measured atompair distance. As a result, a different conformational sampling protocol will yield different GMMs of the atompairs. Hence, the computed similarity values differ and will probably change the results. However, the aim of this study is not the evaluation of the impact of different conformational sampling protocols on the 4D FAP_{OA}, but the evaluation of the 4D FAP_{OA} as a VS similarity function based on the given protocol.
Evaluation metrics
To evaluate the performance of our 4D FAP_{OA} approach, we applied different standard evaluation metrics. The receiver operating characteristic (ROC) curve represents a function that plots the true positive rate as a function of the false positive rate. The area under the ROC curve (AUC) represents a quantification of the curve and facilitates an easier comparison of results. The AUC is calculated as given in Equation 20, where N_{actives} depicts the number of actives, N_{decoys} represents the number of decoys, and describes the number of decoys that are higher ranked than the ith active structure. The received value is in the range [0.0, 1.0], where 0.5 indicates a random performance.
The AUC metric represents a measure that evaluates the performance on the complete data. However, a major goal of VS experiments is the enrichment of active structures in a small fraction of the database. Therefore, it is necessary to apply additional metrics that focus on the early enrichment behavior. A common metric for the early enrichment problem is the enrichment factor at a predefined fraction of the data set (x%) as given in Equation 21.
The enrichment factor depends on the number of actives, and, therefore it is not a robust metric. Korff et al. [41] proposed the relative enrichment factor (REF) as stated in Equation 22 to remove the dependency on the number of actives structures.
The enrichment of active structures that are based on different scaffolds emerged to an additional important goal of VS experiments. All metrics that evaluate the socalled chemotype enrichment are based on a clustering of the active structures into different chemotypes (scaffolds). Mackey and Melville showed the integration of the scaffold information into common VS metrics [42]. We decided to apply the arithmetic weighting on the ROC enrichment as given in Equation 23. Based on the information of the clustering, a structure i obtains a weight that is inversely proportional to the number of structures (N_{ j } ) in the cluster . represents a binary function that returns 1 if the ith active structure of the jth cluster is contained in the first x% of the data set.
The evaluation metrics listed above represent only a small fraction of possible metrics. Other popular metrics for the early enrichment evaluation are the BEDROC [43] score or the enrichment factor. To enable future comparisons with the presented results of the 4D FAP_{OA}, we computed for each target of the filtered DUD data set a result file that contains several additional VS metrics (e.g., BEDROC score at nine predefined alpha values). Additionally, the files contain the complete ranking of the molecules that allows the computation of the VS metric of choice. The 40 result files are contained in the additional file 1 of this study.
Literature Similarity Functions
We employed a wide range of different 2D and 3D similarity approaches to assess the performance of the 4D FAP_{OA}. Due to the fact that we compare our approach to 20 other approaches, we only mention the name of the method and the applied type of information. For a comprehensive description we refer to the original publications.
Different optimal assignment approaches were already evaluated on the filtered DUD data sets in an earlier publication [28]. The best approach of this study was a twostep hierarchical assignment (2SHA) that first operates on a substructure level and afterwards on the atomic level. A second approach of that study optimally assigns the atompair (OAAP) environment trees and represents a similar 3D concept in comparison to the 4D FAP_{OA}. The optimal assignment kernel (OAK) [30, 31] and its flexibility extension, the OAK_{FLEX} [44], were also evaluated in this earlier publication.
Cheeseright et al. [45] introduced FieldScreen as a multiconformerbased VS tool. FieldScreen utilizes a database that contains conformers of each molecule. Therefore, it operates on a conformational ensemble in a similar way as the 4D FAP_{OA} and represents an interesting reference approach. FieldScreen employs four different types of locally optimized molecular field points to compute a similarity value between two given molecules.
Venkatraman et al. conducted a comparison study in which a plethora of different 2D and 3D approaches were evaluated on the original as well as the filtered DUD data sets [21]. We compared the performance of the 4D FAP_{OA} to the main results of this study. The study conducted by Venkatraman et al. employed the 2D fingerprint methods: OPENBABEL [46], DAYLIGHT [8], BCI [7], MACCS [9], and MOLPRINT2D [6]. As 3Dbased approaches they utilized ROCS [12] with two different scoring schemes ROCSS (shape only) and ROCS_{SC} (shape and chemistry). The EON [47] approach compares the electrostatic fields computed by the PoissonBoltzmann equation and was also evaluated using two different parameterizations. EON_{SE} is based on the shape and the electrostatic, whereas EON_{SCE} additionally uses chemical information.
SHAEP [14] is based on a maximum common subgraph approach that is employed to perform a superposition of the molecules. The method operates only on the shape of molecules (SHAPE_{S}) or on the shape and the electrostatic (SHAPE_{SE}). The Ultrafast Shape Recognition (USR) [17] employs four distance relations of each atom and computes the first three moments of each distribution to obtain 12 descriptor values for each molecule. ESHAPE3D is based on a heavy atom distance matrix that is employed to compute fingerprints. The ESHAPE_{HYD} alternatively uses the hydrophobic heavy atoms.
PARAFIT [13] computes a similarity value based on spherical harmonic expansions of molecular surfaces. Another important class of similarity functions are the pharmacophorebased approaches. These approaches operate on an abstract representation of the molecules by means of pharmacophore features. These features are divided into different classes (e.g., hydrogen bond acceptor, aromatic, or hydrophobic) and represent important interaction points of molecules. The distance relation between these pharmacophore features plays an important role and can be measured in a topological or geometrical manner. Therefore, the pharmacophorebased approaches can also be divided into 2D and 3Dbased approaches. Korff et al. [41] compared different structure and ligandbased VS approaches on the DUD data sets. This study contains two different pharmacophorebased methods. The topological pharmacophore point histogram (TopPPHist) computes for each pair of pharmacophore classes a distance histogram based on the topological distances. Therefore, the TopPPHist represents a 2Dbased pharmacophore approach. Finally, the distance histograms are converted into a descriptor vector. The Flexophore approach [1, 41] computes geometrical and binned distance histograms for each pharmacophore point pair based on a representative set of given conformers. The final comparison between two molecules is similar to the maximum common subgraphisomorphism because the pharmacophore points together with the distance histograms form complete graphs.
Results and Discussion
The results section is divided into four different subsections. The first subsection compares the results of the 4D FAP_{OA} approach with other optimal assignmentbased approaches that were already evaluated on the DUD data sets [28]. The second part is based on the results of Venkatraman et al. [21] and compares the average performance of the 4D FAP_{OA} with 15 stateoftheart 2D and 3D approaches. Afterwards, a comparison with the pharmacophorebased approaches of Korff et al. [41] follows. The final subsection focuses on the performance difference between 3D approaches on multiple conformers and our 4D FAP_{OA} approach.
Comparison with other Optimal Assignment Methods
The comparison with other optimal assignment methods measures the influence of the applied information type on the final performance. The OAAP represents the comparable 3D approach in comparison with the 4D FAP_{OA}, and, therefore directly measures the performance gain of the 4D extension. As an early enrichment metric we applied the awROCE_{5%}, which also assesses the chemotype enrichment performance. To reduce the bias introduced by a low number of chemotypes, we only applied data sets that have at least 15 different chemotypes. The AUC value was applied to evaluate the performance on the complete data sets.
Table 2 shows the results of the four optimal assignment methods and the 4D FAP_{OA}. The direct comparison of the OAAP and the 4D FAP_{OA} indicates that the 4D FAP_{OA} outperforms the OAAP on 10 out of 13 data sets with respect to both performance measures. The OAAP is superior to the 4D FAP_{OA} on the COX2, HIVRT, and the PDGFrb data sets. These three data sets are more rigid data sets with respect to the number of rotatable bonds of the query compounds. The query compounds of COX2, HIVRT, and PDGFrb have 5, 9, and 7 rotatable bonds, respectively. In comparison, the most flexible data sets are the ACE and EGFr data sets with 18 and 14 rotatable bonds, respectively. The correlation between the performance gain on the AUC metric and the number of rotatable bonds of the query structure amounts to 0.54. Based on the equation , where n is the number of samples (13 data sets) and r the correlation, the probability that both variables (AUC performance gain and flexibility of query compounds) result in such a correlation if there is no true correlation of the variables (ρ = 0.0) is 0.0277. Therefore, the correlation is significant (p = 0.05) and indicates that the performance gain of the 4D FAP_{OA} is a function of the flexibility of the data set. In comparison to the other optimal assignment methods there is no correlation apparent. However, the OAK, OAK_{FLEX}, and 2SHA are based on a different type of information (local atom similarity based on atom and bond features), and therefore, a direct comparison of the correlations is not meaningful.
The comparisons of the 4D FAP_{OA} with all other optimal assignment approaches show that the 4D FAP_{OA} outperforms all other methods on 6 and 9 data sets with respect to the awROCE_{5%} and AUC, respectively. These results yield a best average rank of 1.58 for the 4D FAP_{OA} with respect to the AUC. For the awROCE_{5%} results the 2SHA achieves the best average rank of 2.15 followed by the 4D FAP_{OA} with an average rank of 2.31. In a direct comparison the 4D FAP_{OA} outperforms the 2SHA approach on 7 data sets, whereas the reverse case only occurs on 5 data sets. To conclude, the 4D FAP_{OA} shows a robust performance on 13 data sets. Considering the results of the complete data sets (AUC) the 4D FAP_{OA} outperforms all other optimal assignment methods. The ability of 4D FAP_{OA} to early enrich different scaffolds is comparable with the 2SHA approach.
The encoding of the conformational space should be most beneficial if the flexibility of the query structure is high. Therefore, we discuss the results on the two data sets with the most flexible query compounds, the ACE and EGFr data set, in more detail.
Figure 5 shows the ROC plot of all optimal assignment methods on the ACE data set. The curve of the 4D FAP_{OA} passes always above the other curves with the exception of the 2SHA curve between 0.3 and 0.4 false positive rate. In the early enrichment range (0.0  0.1 false positive rate) the 4D FAP_{OA} shows a strong increase of the true positive rate without any longer phases of stagnation (horizontal elements in the curve). The other optimal assignment methods show a similar behavior till ≈ 0.03 false positive rate, but they stagnate until ≈ 0.1 false positive rate. Therefore, the 4D FAP_{OA} has an offset of the true positive rate of nearly 0.2 in comparison to the other methods. The 4D FAP_{OA} is also the first approach that is able to retrieve all actives of the data set (≈ 0.7 false positive rate). The second approach that retrieves all actives is the 2SHA approach at a false positive rate of ≈ 0.9. The comparable 3D approach (OAAP) is always inferior in comparison to the 4D FAP_{OA}.
To evaluate the chemotype discovery on the complete data set, we plotted the fraction of the discovered chemotypes as a function of the fraction of the ranked data set. A chemotype is considered as discovered if one compound of the chemotype is ranked.
Figure 6 presents the chemotype discovery of all optimal assignment approaches on the ACE data set. The curves of the 4D FAP_{OA}, 2SHA, OAK, and OAK_{FLEX} show a similar behavior over the complete data set. Only the OAAP has an inferior chemotype discovery rate until ≈ 40% of the data set. Therefore, the information gain by discovering new chemotypes is increased by the 4D FAP_{OA} in comparison to its similar 3D method (OAAP). However, the other approaches that are based on a different type of information (OAK, OAK_{FLEX}, and 2SHA) show a similar discovery rate.
The ROC plots and the chemotype discovery on the EGFr data set can be seen in the Figures 7 and 8, respectively. In both figures a considerably performance gain of the 4D FAP_{OA} is apparent. The 4D FAP_{OA} is able to retrieve all actives within ≈ 30% of the data set (Figure 7). All chemotypes were discovered within 23% of the data set (Figure 8). All other optimal assignment methods retrieve at least 20% of the actives and at least 15% of the chemotypes within the last percent of the data set. In comparison to the OAAP the performance gain of the 4D FAP_{OA} is approximately twice that of the OAK, OAK_{FLEX}, and 2SHA. Therefore, our encoding of the conformational space entails a significant performance gain on the EGFr data set.
Another important property of a VS similarity function is the computation performance. To enable a VS experiment on a realworld database, the VS similarity function should be able to process a reasonable number of compounds in a feasible time. All presented VS similarity functions that are based on the optimal assignment approach were developed at our department, and, therefore we are able to perform a fair comparison of the computation time. We computed the average computation time of each optimal assignment method on the 13 data sets, which were used in Table 2, to approximate a reasonable performance for druglike compounds.
The 4D FAP_{OA} approach has an averaged performance of 38.8 ± 27.56 similarity calculations per second. This computation time is based on preprocessed molecules (GMMs already computed). The OAK yields 27.34 ± 3.40 calculations per second, whereas its flexibility extension (OAK_{FLEX}) computes 41.03 ± 7.32 molecules per second. The OAAP represents the fastest approach with 51.49 ± 18.07 computations per second. In contrast, the 2SHA is the slowest method with a throughput of 14.04 ± 1.78 per second. All calculations were done on a Core2Duo CPU with 2 GHz using one core and 1 GB memory. As a result, the 4D FAP_{OA} is fast enough to screen over 100,000 molecules within one hour on a desktop CPU using only one core. The similarity calculation can be easily parallelized to further increase the throughput, and, therefore the approach should be fast enough for realworld applications.
The preprocessing step (conformational sampling and GMM calculation) represents an additional computational task of our approach. However, the preprocessing step has only to be computed once for each molecule. Additionally, the computation of different conformers (conformational sampling) is often necessary for different tasks in the drug discovery pipeline. Furthermore, our encoding is a modelbased encoding that reduces the memory usage in a database in comparison to the storage of multiple conformers of a molecule.
Comparison with StateoftheArt 2D and 3D Approaches
In this subsection we compare the performance of the 4D FAP_{OA} with different stateoftheart 2D and 3D approaches. Venkatraman et al. [21] conducted a comprehensive evaluation of 15 different literature methods on the DUD data sets. The study contains the averaged (over all 40 data sets) relative enrichment factors at 1%, 5%, and 10% as well as the AUC values for each method. Unfortunately, the study lacks any evaluation metric that rates the chemotype discovery of the approaches. Therefore, the results in this section are only based on the early enrichment (relative enrichment factors) and the performance on the complete data set (AUC). All results are based on the filtered data sets and compiled in Table 3.
The results of Table 3 confirm that the 2D approaches are more robust in comparison to the 3D methods. Only the ROCS_{SC} is able to yield comparable results in comparison to the MACCS keys and MOLPRINT2D. The 4D FAP_{OA} is able to utilize the GMMs as a source of reasonable information, and, therefore the approach yields the best results with respect to the relative enrichment factor at 5% and 10% as well as the AUC metric. Only the BCI approach is able to marginally improve the results with respect to the relative enrichment factor at 1%. The best performance of the 4D FAP_{OA} on three out of four metrics results in the best average rank of 1.25. The BCI and DAYLIGHT fingerprints yield an average rank of 2.75 and represent the best 2Dbased approach. ROCS_{SC} is the best 3Dbased approach with an average rank of 5.0, and, therefore higher ranked as the 2Dbased approaches MOLPRINT2D (6.0) and the MACCS keys (7.0). All other 3Dbased methods are inferior in comparison to the 2Dbased approaches. To conclude, the 4D FAP_{OA} benefits from the conformational space information and is able to yield the best average performance of all methods.
Comparison with PharmacophoreBased Approaches
Korff et al. [41] evaluated the TopPPHist and the Flexophore approach on the 40 targets of the DUD data sets. The early enrichment performance was assessed by the relative enrichment factor at 1% of the data set. To evaluate the chemotype enrichment, Korff et al. counted the discovered chemotypes within the enriched data set fraction with respect to the chemotype definition of Good and Oprea [35]. Table 4 lists the relative enrichment factors and the number of discovered chemotypes for each of the 40 data sets of the DUD.
With respect to the early enrichment performance the TopPPHist and the Flexophore approach achieved an average relative enrichment factor of 37.34 ± 31.38 and 43.31 ± 33.25, respectively. The application of the 4D FAP_{OA} resulted in an average relative enrichment factor of 55.45 ± 33.26 and increased the performance of the Flexophore approach by over 20%. However, based on their abstract representation of molecules, one of the strengths of pharmacophorebased approaches is the ability to discover new chemical entities. This abstraction from the query scaffold can be seen in the chemotype discovery results of Table 4. The Flexophore approach needs ≈ 20% less active compounds to discover a similar amount of chemotypes (94) in comparison with the 4D FAPOA (98). The 2Dbased TopPPHist discovered only 66 chemotypes over all 40 data sets and showed an inferior chemotype discovery in comparison with the 4Dbased approaches (Flexophore, 4D FAP_{OA}).
Comparison with Multiple Conformer Approaches
The results of the previous sections demonstrated the inferior performance of 3Dbased approaches in comparison with 2Dbased methods. A common technique to tackle this deficit of 3D approaches is to utilize multiple conformers and average or use the maximum of all pairwise similarity values. The number of necessary similarity computations scales with O(n^{2}), where n represents the number of conformers of the molecules. Therefore, this technique implies a significant increase in computation time. However, the averaging over multiple conformers increases the available information content of the 3Dbased approaches to a level that is similar in comparison to the 4D FAP_{OA}. The 4D FAP_{OA} has a modelbased description of the conformational space, whereas the 3Dbased approaches explicitly have the conformational space. Consequently, a comparison of the 4D FAP_{OA} with 3Dbased approaches on multiple conformers represents an interesting comparison based on a equal source of information.
Venkatraman et al. [21] evaluated the ROCS_{SC} (best 3Dbased approach of Table 3) in three additional experiments on the unfiltered DUD data sets with different ensembles of size 10, 100, and 1000 conformers per molecule. Table 5 lists in detail the AUC performance of the ROCS_{SC} on different ensemble sizes and the AUC results of the 4D FAP_{OA}. The table also contains the AUC results of the ROCS_{SC} on one given conformation as a baseline to evaluate the performance gain of the multiple conformer setup.
The average AUC of the ROCS_{SC} increases from 0.692 (AUC(1)) over 0.703 (AUC(10)) to 0.725(AUC(100)). The results on 1000 conformers are marginally inferior (average AUC(1000) of 0.722) in comparison to the results on 100 conformers. As a result, the ROCS_{SC} slightly benefits from the additional information content of multiple conformers. However, the average AUC of the 4D FAP_{OA} is 0.80, and, therefore superior in comparison to all four ROCS_{SC} setups. These results are verified by the average ranks of the approaches. The 4D FAP_{OA} is able to achieve the best AUC value on 27 out of 40 data sets and demonstrates its robust performance on a wide range of pharmaceutically relevant targets. The best ROCS_{SC} setup (100 conformers) yields on eight data sets the best result. Please note that the different average AUC values in Table 3 and 5 are the result of the applied data sets (filtered DUD in Table 3 and unfiltered in Table 5).
Despite the robust and superior performance of the 4D FAP_{OA} on the majority of the 40 data sets, the weak performance on the HIVPR data set is conspicuous. The HIVPR data set has an average number of heavy atoms of 36.3 and represents the data set with the largest compounds of all 40 DUD data sets. The 4D FAP_{OA} entails an optimal assignment step to compute a final similarity value based on the atompair tree similarity matrix S. If the approach computes the similarity value between the query compound and a data set compound, the ith row of S represents the atompair tree with the ith atom of the query compound as root node. Analogously, this applies to the jth column of S and the jth atom of the data set compound. The optimal assignment step maps each atom of the query compound onto an atom of the data set compound. With an increased size of atoms the number of possibilities (possible mappings) scales with O(n!), where n is the number of heavy atoms. This increase also increments the risk of a topological error in the assignment step. Topological errors are assignments that do not preserve a substructure mapping (e.g., atoms of a ring are assigned to atoms of different rings). Figure 9 shows a mapping with several topological errors. These topological errors maximize the final similarity value, but from a chemical point of view these mappings are questionable. Therefore, these errors can negatively influence the ranking of the compounds on the HIVPR data set.
The FieldScreen approach by Cheeseright et al. [45] represents a VS similarity function that applies four different types of locally optimized field points and operates on a multiconformer database. Therefore, it also operates on a comparable information content as our 4D FAP_{OA} approach. Cheeseright et al. evaluated the FieldScreen approach on the filtered DUD data sets and applied the chemotype information on the result metrics. The results of the FieldScreen approach as well as the 4D FAP_{OA} are listed in Table 6.
The 4D FAP_{OA} yields a superior early enrichment performance (awROCE_{5%}) on 9 out of 13 data sets. Concerning the performance on the complete data set (AUC) our approach outperforms FieldScreen on 8 data sets. The 4D FAP_{OA} is able to increase the mean early enrichment and complete data set performance by ≈ 30% and ≈ 16%, respectively. The major improvements of FieldScreen in comparison to the 4D FAP_{OA} are on the FXa and HIVRT data sets. These data sets also consist of larger molecules, and, therefore the risk of topological errors is increased and is probably a reason for the inferior 4D FAP_{OA} performance.
To conclude, the best 3Dbased approach of Table 3 (ROCS_{SC}) could increase the performance if it is applied on multiple conformer data sets. However, the performance gain was not strong enough to reach the results of the 4D FAP_{OA}. The comparison with the FieldScreen approach yields similar results and underpinned the robust performance of the 4D FAP_{OA}. The detailed evaluation of the results reveals a weakness of our approach if the compounds of a data set have an increased number of heavy atoms. This weakness is likely the result of the optimal assignment step and was already reported as a weak point of optimal assignment approaches [28]. Nevertheless, the 4D FAP_{OA} represents a robust similarity measure for small and medium sized druglike compounds.
Conclusions
We presented a VS similarity function that operates on GMM encoded conformational space information. Our approach is able to compare the conformational space of molecules within one step, and, therefore avoids the application of timeconsuming averaging techniques. The approach was already applied in QSAR experiments and demonstrated its robust performance in comparison to similar 3Dbased QSAR models [23, 28].
The aim of this study was to evaluate our approach as VS similarity function. Therefore, we compared the results of the 4D FAP_{OA} with 20 other 2D and 3Dbased approaches. Additionally, we applied two approaches (ROCS_{SC} and FieldScreen) that operate on multiple conformers to provide a comparison of approaches that are based on a similar information content.
The results showed that our approach is able to achieve superior results on a wide range of pharmaceutically relevant targets. Even the best 3D approach, with respect to the results of Venkatraman et al. [21], applied on multiple conformers is inferior in comparison to our approach.
The preprocessing, which is necessary to encode the conformational space information by means of GMMs, represents an additional computational step. However, all compounds have only be computed once and the encoded models need less space in comparison to the storage of conformational ensembles. The computational speed of the actual similarity function is fast enough to screen over 100,000 compounds within one hour on a standard desktop CPU with one core. Therefore, our approach should meet the requirements of realworld VS applications.
The complete source code of the preprocessing tool (computing GMMs based on conformational ensembles) as well as the 4D FAP_{OA} similarity function are publicly available on our department website http://www.cogsys.cs.unituebingen.de/software/4DFAP.
Abbreviations
 ACE:

angiotensinconverting enzyme
 AChE:

acetylcholinesterase
 ADA:

adenosine deaminase
 ALR2:

aldose reductase
 AmpC:

AmpC βlactamase
 AR:

androgen receptor
 CDK2:

cyclindependent kinase 2
 COMT:

catechol Omethyltransferase
 COX1:

cyclooxygenase1
 COX2:

cyclooxygenase2
 DHFR:

dihydrofolate reductase
 EGFr:

epidermal growth factor receptor
 ER:

estrogen receptor
 FGFr1:

fibroplast growth factor receptor kinase
 FXa:

factor Xa
 GART:

glycinamide ribonucleotide transformylase
 GPB:

glycogen phosphorylase β
 GR:

glucocorticoid receptor
 HIVPR:

HIV protease
 HIVRT:

HIV reverse transcriptase
 HMGR:

hydroxymethylglutarylCoA reductase
 HSP90:

human heat shock protein 90
 InhA:

enoyl ACP reductase
 MR:

mineralocorticoid receptor
 NA:

neuraminidase
 P38:

P38 mitogen activated protein
 PARP:

poly(ADPribose) polymerase
 PDE5:

phosphodiesterase 5
 PDGFrb:

platelet derived growth factor receptor kinase
 PNP:

purine nucleoside phosphorylase
 PPAR_{ γ }:

peroxisome proliferator activated receptor γ
 PR:

progesterone receptor
 RXR_{ α }:

retinoic X receptor α
 SAHH:

Sadenosylhomocysteine hydrolase
 SRC:

tyrosine kinase SRC
 TK:

thymidine kinase
 VEGFr2:

vascular endothelial growth factor receptor.
References
 1.
von Korff M, Freyss J, Sander T: Flexophore, a New Versatile 3D Pharmacophore Descriptor That Considers Molecular Flexibility. J Chem Inf Model. 2008, 48 (4): 797810. 10.1021/ci700359j.
 2.
Bajorath J: Integration of virtual and highthroughput screening. Nat Rev Drug Discov. 2002, 1 (11): 882894. 10.1038/nrd941.
 3.
Varnek A, Tropsha A, (Eds): Chemoinformatics Approaches to Virtual Screening. 2008, Cambridge: The Royal Society of Chemistry
 4.
Geppert H, Vogt M, Bajorath J: Current Trends in LigandBased Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. J Chem Inf Model. 2010, 50 (2): 205216. 10.1021/ci900419k.
 5.
Bender A, Jenkins JL, Scheiber J, Sukuru SC, Glick M, Davies JW: How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space. J Chem Inf Model. 2009, 49: 108119. 10.1021/ci800249s.
 6.
Bender A, Mussa HY, Glen RC, Reiling S: Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance. J Chem Inf Comput Sci. 2004, 44 (5): 17081718.
 7.
Barnard JM, Downs GM: Chemical Fragment Generation and Clustering Software. J Chem Inf Comput Sci. 1997, 37: 141142.
 8.
Daylight Chemical Information Systems Inc. [http://www.daylight.com]
 9.
Symyx Software: MACCS structural keys. San Ramon, CA. 2005
 10.
Good AC, Hermsmeier MA, Hindle S: Measuring CAMD Technique Performance: A Virtual Screening Case Study in the Design of Validation Experiments. J ComputAided Mol Des. 2004, 18 (7): 529536. 10.1007/s1082200440671.
 11.
Venkatraman V, Chakravarthy P, Kihara D: Application of 3D Zernike descriptors to shapebased ligand similarity searching. J Cheminf. 2009, 1: 1910.1186/17582946119.
 12.
Grant JA, Gallardo MA, Pickup BT: A fast method of molecular shape comparison: A simple application of a Gaussian description of molecular shape. J Comput Chem. 1996, 17 (14): 16531666. 10.1002/(SICI)1096987X(19961115)17:14<1653::AIDJCC7>3.0.CO;2K.
 13.
Mavridis L, Hudson BD, Ritchie DW: Toward High Throughput 3D Virtual Screening Using Spherical Harmonic Surface Representations. J Chem Inf Model. 2007, 47 (5): 17871796. 10.1021/ci7001507.
 14.
Vainio MJ, Puranen SJ, Johnson MS: ShaEP: Molecular Overlay Based on Shape and Electrostatic Potential. J Chem Inf Model. 2009, 49 (2): 492502. 10.1021/ci800315d.
 15.
Clark DE, Jones G, Willett P, Kenny PW, Glen RC: Pharmacophoric pattern matching in files of threedimensional chemical structures: Comparison of conformationalsearching algorithms for flexible searching. J Chem Inf Comput Sci. 1994, 34: 197206.
 16.
Baumann K: Distance Profiles (DiP): A translationally and rotationally invariant 3D structure descriptor capturing steric properties of molecules. Quant StructAct Relat. 2002, 21: 507519. 10.1002/15213838(200211)21:5<507::AIDQSAR507>3.0.CO;2L.
 17.
Ballester PJ, Finn PW, Richards WG: Ultrafast shape recognition: Evaluating a new ligandbased virtual screening technology. J Mol Graphics Modell. 2009, 27 (7): 836845. 10.1016/j.jmgm.2009.01.001.
 18.
Iyer M, Hopfinger AJ: Treating Chemical Diversity in QSAR Analysis: Modeling Diverse HIV1 Integrase Inhibitors Using 4D Fingerprints. J Chem Inf Model. 2007, 47 (5): 19451960. 10.1021/ci700153g.
 19.
Senese CL, Duca J, Pan D, Hopfinger AJ, Tseng YJ: 4DFingerprints, Universal QSAR and QSPR Descriptors. J Chem Inf Comput Sci. 2004, 44 (5): 15261539.
 20.
Ebalunode JO, Zheng W: Unconventional 2D Shape Similarity Method Affords Comparable Enrichment as a 3D Shape Method in Virtual Screening Experiments. J Chem Inf Model. 2009, 49 (6): 13131320. 10.1021/ci900015b.
 21.
Venkatraman V, PérezNueno VI, Mavridis L, Ritchie DW: Comprehensive Comparison of LigandBased Virtual Screening Tools Against the DUD Data set Reveals Limitations of Current 3D Methods. J Chem Inf Model. 2010, 50 (12): 20792093. 10.1021/ci100263p.
 22.
Dixon SL, Merz KM: OneDimensional Molecular Representations and Similarity Calculations: Methodology and Validation. J Med Chem. 2001, 44 (23): 37953809. 10.1021/jm010137f.
 23.
Jahn A, Hinselmann G, Fechner N, Henneges C, Zell A: Probabilistic Modeling of Conformational Space for 3D Machine Learning Approaches. Mol Inf. 2010, 29 (5): 441455. 10.1002/minf.201000036.
 24.
Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Ser B Stat Methodol. 1977, 39: 138.
 25.
Schwarz GE: Estimating the dimension of a model. Ann Stat. 1978, 6 (2): 461464. 10.1214/aos/1176344136.
 26.
Akaike H: A new look at the statistical model identification. IEEE Trans Autom Control. 1974, 19 (6): 716723. 10.1109/TAC.1974.1100705.
 27.
Jahn A, Hinselmann G, Rosenbaum L, Fechner N, Zell A: BoltzmannEnhanced Flexible AtomPair Kernel with Dynamic Dimension Reduction. Mol Inf. 2011, 30 (4): 307315. 10.1002/minf.201000120.
 28.
Jahn A, Hinselmann G, Fechner N, Zell A: Optimal assignment methods for ligandbased virtual screening. J Cheminf. 2009, 1: 1410.1186/17582946114.
 29.
Hinselmann G, Fechner N, Jahn A, Eckert M, Zell A: Graph kernels for chemical compounds using topological and threedimensional local atom pair environments. Neurocomputing. 2010, 74 (13): 219229. 10.1016/j.neucom.2010.03.008.
 30.
Fröhlich H, Wegner JK, Sieker F, Zell A: Optimal assignment kernels for attributed molecular graphs. ICML '05. Proceedings of the 22nd international conference on Machine learning. 2005, New York, NY, USA: ACM, 225232.
 31.
Fröhlich H, Wegner JK, Sieker F, Zell A: Kernel Functions for Attributed Molecular Graphs  A New SimilarityBased Approach to ADME Prediction in Classification and Regression. QSAR Comb Sci. 2006, 25 (4): 317326. 10.1002/qsar.200510135.
 32.
DUD  A Directory of Useful Decoys. [http://dud.docking.org]
 33.
Huang N, Shoichet BK, Irwin JJ: Benchmarking Sets for Molecular Docking. J Med Chem. 2006, 49 (23): 67896801. 10.1021/jm0608356.
 34.
Oprea TI, Davis AM, Teague SJ, Leeson PD: Is There a Difference between Leads and Drugs? A Historical Perspective. J Chem Inf Comput Sci. 2001, 41 (5): 13081315.
 35.
Good AC, Oprea TI: Optimization of CAMD Techniques 3. Virtual Screening Enrichment Studies: a Help or Hindrance in Tool Selection?. J ComputAided Mol Des. 2008, 22 (34): 169178. 10.1007/s1082200791672.
 36.
DUD Filtered  Leadlike filtered DUD. [http://dud.docking.org/jahn/]
 37.
van Drie JH: Pharmacophore Discovery  Lessons Learned. Curr Pharm Des. 2003, 9 (20): 16491664. 10.2174/1381612033454568.
 38.
Schrödinger, LLC: ConfGen. version 2.2, New York, NY. 2010
 39.
Watts KS, Dalal P, Murphy RB, Sherman W, Friesner RA, Shelley JC: ConfGen: A Conformational Search Method for Efficient Generation of Bioactive Conformers. J Chem Inf Model. 2010, 50 (4): 534546. 10.1021/ci100015j.
 40.
Chen IJ, Foloppe N: Druglike Bioactive Structures and Conformational Coverage with the LigPrep/ConfGen Suite: Comparison to Programs MOE and Catalyst. J Chem Inf Model. 2010, 50 (5): 822839. 10.1021/ci100026x.
 41.
von Korff M, Freyss J, Sander T: Comparison of Ligand and StructureBased Virtual Screening on the DUD Data Set. J Chem Inf Model. 2009, 49: 209231. 10.1021/ci800303k.
 42.
Mackey MD, Melville JL: Better than Random? The Chemotype Enrichment Problem. J Chem Inf Model. 2009, 49 (5): 11541162. 10.1021/ci8003978.
 43.
Truchon JF, Bayly CI: Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem. J Chem Inf Model. 2007, 47 (2): 488508. 10.1021/ci600426e.
 44.
Fechner N, Jahn A, Hinselmann G, Zell A: Atomic Local Neighborhood Flexibility Incorporation into a Structured Similarity Measure for QSAR. J Chem Inf Model. 2009, 49 (3): 549560. 10.1021/ci800329r.
 45.
Cheeseright TJ, Mackey MD, Melville JL, Vinter JG: FieldScreen: Virtual Screening Using Molecular Fields. Application to the DUD Data Set. J Chem Inf Model. 2008, 48 (11): 21082117. 10.1021/ci800110p.
 46.
The Open Babel Package. [Version 2.2.3], [http://openbabel.org]
 47.
Muchmore SW, Souers AJ, AkritopoulouZanze I: The Use of ThreeDimensional Shape and Electrostatic Similarity Searching in the Identification of a MelaninConcentrating Hormone Receptor 1 Antagonist. Chem Biol Drug Des. 2006, 67 (2): 174176. 10.1111/j.17470285.2006.00341.x.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
AJ designed and developed the main part of the 4D FAP_{OA}, has written the manuscript, participated in the design of the experiments and the discussion of the results. LR participated in the design of the experiments and the discussion of the results. GH contributed to the development of the 4D FAP_{OA}, participated in the design of the experiments and the discussion. AZ participated in the design of the 4DFAP_{OA}, the design of the experiments, and the discussion of the results.
Electronic supplementary material
Additional file 1: Archive of the 4D FAP_{OA} result files. This is a Gzip compressed Tar archive containing the result files of the 4D FAP_{OA} on the filtered DUD data sets. The result files are tabseparated plain text files including the following information: method name, active data set with size, cluster information and the distribution of the active molecules over the clusters, decoy data set with size, ratio active:decoy, AUC, awAUC, BEDROC scores for predefined alpha values as suggested by Truchon and Bayly [43], enrichment factors, relative enrichment factor [41], ROC enrichments, awROC enrichments at predefined false positive fractions, chemotype enrichment, ROC and awROC data points, and the ranking of each structure to compute other VS metrics. (GZ 2 MB)
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Virtual Screening
 Expectation Maximization Algorithm
 Conformational Space
 Optimal Assignment
 Conformational Ensemble