Ligand-based virtual screening experiments are an important task in the early drug discovery stage. An ambitious aim in each experiment is to disclose active structures based on new scaffolds. To perform these "scaffold-hoppings" for individual problems and targets, a plethora of different similarity methods based on diverse techniques were published in the last years. The optimal assignment approach on molecular graphs, a successful method in the field of quantitative structure-activity relationships, has not been tested as a ligand-based virtual screening method so far.
We evaluated two already published and two new optimal assignment methods on various data sets. To emphasize the "scaffold-hopping" ability, we used the information of chemotype clustering analyses in our evaluation metrics. Comparisons with literature results show an improved early recognition performance and comparable results over the complete data set. A new method based on two different assignment steps shows an increased "scaffold-hopping" behavior together with a good early recognition performance.
The presented methods show a good combination of chemotype discovery and enrichment of active structures. Additionally, the optimal assignment on molecular graphs has the advantage to investigate and interpret the mappings, allowing precise modifications of internal parameters of the similarity measure for specific targets. All methods have low computation times which make them applicable to screen large data sets.
Virtual screening (VS) approaches are a fundamental part of the drug discovery pipeline [1, 2]. The top-ranked structures of VS experiments are further analysed in biological assays to elucidate their activities. The methods for this purpose can be divided into structure-based and ligand-based VS techniques. Structure-based approaches, like DOCK [3, 4] or PLANTS , try to dock the ligand into the binding pocket of a target protein, which requires X-ray diffraction or nuclear magnetic resonance spectroscopy experiments to obtain three-dimensional coordinates of the protein structure. Additionally, there are doubts about the ability of docking approaches to predict the affinity or even the rank of the structures . In spite of these negative examples, there exists a remarkable list of successful structure-based VS stories .
Ligand-based methods operate only on one or several known active ligands. Therefore, ligand-based approaches are the method of choice if no protein structure is available. The underlying assumption of ligand-based approach is that similar structures have similar biological activity. Maggiora and Johnson  introduced the similarity-property principle that implies that the chemical similarity can be related to the biological activity of structures. Although Martin et al.  reported that the correlation between structural similarity and biological activity is not so strong, the correlation between activity and similarity is of course the result of the used similarity measure and varies between different methods. Thus, a wide variety of different similarity measures have been proposed in recent years [10, 11]. All methods can be divided into different classes concerning the type of information used to calculate the similarity between two structures. A simple and fast approach is to encode the information in fingerprints based on features that are included in the structure. Those topological or structural fingerprints, like the MACCS keys  or Daylight fingerprints , are often used but lack the ability to perform "scaffold-hoppings" . A more elaborate approach is based on molecular graphs in combination with maximum common substructure or maximum common edge subgraph isomorphism as used by the RASCAL algorithm or different reduced graph approaches [15–17]. Feature Trees are another reduced graph method which uses a reduced tree representation of important molecular fragments like hydrophobic parts or functional groups . The topological based MOLPRINT2D approach uses count vectors of atom types in different layers as molecular atom environment fingerprints and has shown to be useful descriptors for naïve Bayesian classifiers in VS experiments .
Similarity measures, based on three-dimensional coordinates, include geometrical information of arbitrary objects defined on structures. The type of an object varies between different methods and can be classified into three types: pharmacophores, molecular shapes or volumes, and molecular (interaction) fields. Pharmacophore based methods generate patterns of distances between predefined molecular properties like aromatic systems or hydrogen bond acceptors/donors [20, 21] and calculate a similarity value by a comparison of the corresponding patterns. Molecular shape or volume approaches try to maximize the overlap of shapes or volumes and determine a similarity value based on the overlap. Ballester et al. introduced a non-superposition comparison algorithm for molecular shapes, called Ultrafast Shape Recognition, and applied it in VS experiments . ROCS uses three-dimensional Gaussian functions to describe the volume of query structures and to screen databases in search of structures with similar shapes . Molecular interaction field methods like GRID  or CoMFA  were originally introduced in the field of quantitative structure-activity relationship modelling (QSAR) but are also suitable for VS experiments . FieldScreen, a recently published method by Cheeseright et al. , uses four different types of molecular fields and reduces these fields to the local maximal values. These values, referred to as "field points" , serve as the descriptors. This method has shown an improved "scaffold-hopping" behavior on various data sets and improved enrichment rates in comparison to DOCK.
The concept of an optimal assignment of arbitrary objects of molecular graphs was introduced in the field of cheminformatics by Fröhlich et al. [29, 30]. Just like CoMFA studies [25, 31], the optimal assignment on molecular graphs was used to derive QSAR and quantitative structure-property relationship models. Despite several proofs of the usability of this concept on several data sets and different problems [29, 30, 32–34], no experiments have been conducted to evaluate this idea in the field of ligand-based VS.
There are two objectives of this study. First, we want to evaluate the existing optimal assignment similarity measure and its flexibility extension on a set of different ligand-based VS experiments. Second, we introduce two new methods based on the optimal assignments. The experiments were designed with emphasis on the evaluation of the "scaffold-hopping" behavior of the methods. This was achieved by using recently suggested VS metrics based on chemotype clustering and extended receiver operating characteristics [35–41]. To benchmark the optimal assignment methods, we compared the results with FieldScreen, the 166 bit MACCS keys and DOCK.
The results show an improved early recognition performance for the optimal assignment methods in comparison to selected literature methods. The performance over the complete data set suffers from late retrievals of dissimilar chemotypes reducing the arithmetic weighted version of the area under the receiver operating characteristic curve values. One of the optimal assignment methods, which is based on two different assignment steps, shows an increased performance on both types of evaluation. The presented methods have low computation times; consequently, they can be applied to screen large databases.
In this section we first summarize the optimal assignment problem and give a general definition of the problem. Afterwards, a detailed description of the different methods which are evaluated in the scope of ligand-based VS, follows.
The optimal assignment problem is one of the basic discrete optimization problems. The fundamental task is to find the set of assignments that minimizes (or maximizes) the overall cost of the assignments. A mathematical description of the optimization problem is to find a minimum (or maximum) weight matching in a complete weighted bipartite graph. Given two disjoint sets of arbitrary objects X = (x1, x2,..., xn) and Y = (y1, y2,..., ym) with n ≥ m without loss of generality, the optimization problem can be defined as:
where π is a subset of the indices 1,..., n of size m describing the assigned objects of the set X and w(xπ(i), yi) is the weight (cost) of mapping the object xπ(i) onto yi. The Kuhn-Munkres algorithm (also called Hungarian Method) [42, 43] is one popular algorithm to solve this optimization problem.
Fröhlich et al. introduced the concept of the optimal assignment in the field of cheminformatics as a kernel function for attributed molecular graphs [29, 30]. Although this function is not a valid kernel function , it can be used as a similarity function based on molecular graphs.
Optimal Assignment Kernel
The idea of the Optimal Assignment Kernel (OAK) [29, 30] is to find a mapping of the atoms of the smaller molecule onto the atoms of the other molecule, maximizing the sum of the pairwise atom similarities. The first step of this approach is the calculation of all pairwise atom similarities. For this purpose, a radial basis function (RBF) calculates a similarity value based on physico-chemical descriptors of the atoms (Equation 2, where aiand biare the ith descriptor value of the atom a and b, σ2 represents the variance of a descriptor). To calculate the similarity between two atoms, the OAK uses 24 atom and 8 bond descriptors of the expert system of JOELib2 [45, 46]. For a complete list of the descriptors, we refer to Additional file 1 of this study.
To improve the chemical interpretation of the similarity calculation, the information of the topological neighbors and the bonds up to a predefined depth is integrated into the atom-wise similarity calculation. The integration is realized by a recursive atom-wise similarity calculation of the neighbors. In addition, a decay parameter reduces the influence of atoms with increased topological distance. Given two molecules A and B with atoms a1,..., anand b1,..., bmthe primal form of the optimal assignment problem is modified to Equation 3.
The result of Equation 3 is the sum of all atom-wise similarity values. Hence, the function yields higher results if the molecules have more atoms. To reduce the impact of the number of atoms on the final similarity value, the result of the optimal assignment is normalized by Equation 4 to a final value in the range of [0; 1].
An example of an optimal assignment on molecular graphs can be seen in Figure 1.
Optimal Assignment Kernel with Flexibility Extension
The original OAK computes the similarity of two molecules using the sets of atoms augmented with their local neighborhood. The idea of the OAK can also be applied to approximate the similarity of the conformational space of two molecules . This is achieved by breaking down the overall molecular flexibility into a set of local flexibilities similar to the local atom environments used by the OAK. The local flexibility is defined as the spatial positions on which the neighbors of a center (core) atom could be found depending on the length and angles of the bonds by which they are connected to the core (e.g. Figure 2). Only the flexibility of the second and third order neighbors are considered and the information is stored in the core atom.
The spatial positioning possibilities are expressed by using internal coordinate parameters like bond lengths and angles. These parameters are regarded as only depending on the hybridization of the atoms that connect the core with the neighbor. Non-bonding interactions like electrostatic and steric interactions, which would also influence the conformational space, are not considered. Ring bonds and non-single bonds are generally regarded as rigid. Therefore, the approach only approximates the real spatial positioning of the neighbor. The details of the parametrization of the local flexibility can be found in .
The overall similarity of the approximated conformational spaces is computed by augmenting the local atom similarity in the original OAK formulation with the similarity of the local flexibility (i.e. the parametrization of the spatial positioning of the neighbors) of the center atoms (OAKFLEX). The flexibility similarity between two atoms is the weighted sum of the flexibility similarities of the second and third order neighbors. The weighted sum is adjusted by an internal parameter. Additionally, the contribution of the flexibility similarity to the original OAK similarity is set by a second parameter. We used the first default parametrization as suggested by Fechner et al.  (OAK weight = 0.95, flexibility extension weight = 0.05, second order neighbor weight = 0.3, and third order neighbor weight = 0.7)
Two-Step Hierarchical Assignment
An accurate investigation of the assignments of the OAK discloses failures and shows that the OAK is not able to generate a substructure preserving assignment of the atoms in some cases. These topological errors result in an assignment that scatters the atoms of a substructure over the complete other molecule. The assignments maximize the overall similarity, but from a chemical point of view some mappings are problematic. An example of an OAK assignment with topological errors can be seen in Figure 3.
Our analysis uncovered that the occurrence of these errors is favored by the existence of multiple aromatic or condensed ring systems with a small number of substitutions and the presence of conjugated environments. To reduce the number of topological errors, we developed a two-step hierarchical assignment approach (2SHA). The basic idea is to perform two different assignments on different types of objects. The first assignment step operates on a substructure level and maps similar substructures of the molecules onto each other. For this purpose, a preprocessing step is necessary to identify condensed, aromatic, and conjugated systems (e.g. Figure 4). The ring detection is done with a modified biconnected component algorithm that operates on the molecular graph. The identification of the aromaticity and conjugated environments is done by the expert system of JOELib2 [45, 46]. To compute the similarity between two substructures, we use a modified version of the original OAK that computes a similarity score without performing the normalization as given in Equation 4. The advantage of omitting this step is a reduced computation time because the self-similarities Sim(A, A) and Sim(B, B) are not necessary and are not computed.
The algorithm computes the first optimal assignment calculation based on the substructure similarity values and maps one fragment of the smaller molecule on exactly one fragment of the other molecule. An example of such a mapping process is shown in Figure 4 on the two benzodiazepine derivatives used to explain the topological errors. The information of the fragment-based mapping is stored in each atom of a mapped fragment. This knowledge is used to establish constraints for the second assignment on the atomic level that reduces the frequency of topological errors.
An additional aim of this approach is the inclusion of geometrical information into the similarity calculation. To achieve this, we integrate an extra calculation step between the two assignment calculations. The initial preprocessing of the structures identifies aromatic systems, condensed rings, and conjugated environments, which are all rigid scaffolds of molecules. Our idea is to superimpose the rigid substructures which were mapped in the first assignment step. To realize the superposition, the substructures are treated as individual molecules and an algorithm based on quaternions, which represents an equivalent method to the well-known Kabsch algorithm, was used [47–49]. Note that the necessary information of matching atoms of the Kabsch algorithm is the result of the similarity calculation between the substructures. The reason for this is the optimal assignment of the atoms of the substructures to calculate the similarity. Although the algorithm calculates a superposition minimizing the root-mean-square deviation, our approach uses the three-dimensional atom coordinates and integrates the information into the second assignment step.
The second assignment is the final optimal assignment on the atomic level, which includes all information of the previous steps. Each atom has information about its integration into substructures and the mapping of these substructures. Considering this information, all atoms of the molecules can be divided into three different atom classes: First, atoms that are not part of a substructure, like linkers or side-chains. Second, atoms that are part of fragments which were not mapped in the first assignment. Third, atoms that are part of fragments which were mapped. Using these three classes we obtain six different unordered 2-element subsets. Each subset represents a different case of the atom-wise similarity calculation using different information and parameters. The case in which both atoms are part of a fragment and both fragments were not mapped cannot occur because of the nature of the optimal assignment problem. This reduces the number of different cases to five, which have to be regarded in the similarity calculation on the atomic level. However, the case in which both atoms are part of a fragment has to be divided into two cases, depending on whether the fragments were mapped onto each other or not. So we need to consider six cases overall. To include this discrimination, we integrated the following case differentiation into the original atom-wise similarity calculation:
Both atoms are not part of a fragment. Calculate the similarity using the original RBF of the OAK.
Both atoms are part of a fragment and the fragments were not mapped onto each other. This case is the main cause of topological errors. Therefore, the method has to penalize the similarity score to avoid a mapping of those atoms. The penalization is done by a decreased σ value in the RBF. This modification sharpens the RBF and reduces the overall score of the atom-wise similarity calculation.
Both atoms are part of a fragment and the fragments were mapped onto each other. The method makes use of the geometrical information of the superposition of the fragments. Given two atoms from the two superimposed fragments, the method calculates a vector based on the two atom coordinates and maps the vector into a two-dimensional plane using an isometric mapping. The initial and terminal points of the two-dimensional vector represent the coordinates of the atoms in the two-dimensional plane. An RBF is centered at each point using the van der Waals radius of the corresponding atom as σ value (σvdW). Equation 5 calculates the final geometrical similarity value for two mapped atoms a and b using two RBF and the half atomic distance as input. The result of this geometrical atom-wise similarity calculation is integrated into the original RBF of the OAK in the form of an additional numerical descriptor.
One atom is part of a fragment, but this fragment is not mapped. The other atom is not part of a fragment. In this case the method tries to map an atom of a rigid fragment onto a side-chain or linker atom. The fragment was not mapped in the first assignment step, but this mapping is also doubtful and has to be penalized. The technique for penalizing those mappings is the same as in the second case.
One atom is part of a fragment which is mapped. The other atom is not part of a fragment. This case is related to the previous one, but the fragment was mapped in the first assignment step. If the atom-wise similarity calculation results in a high similarity value, the possibility that the atom of the fragment is assigned to an atom of the corresponding fragment of the other molecule is reduced. Depending on the structure of the two molecules, this increases the possibility of topological errors. Therefore, the penalty has to be higher than in the fourth case.
Both atoms are part of a fragment, but only one fragment was mapped. This case is also related to the fourth and fifth one, but the mapping of two atoms of fragments is more reasonable from a chemical point of view. However, the risk of topological errors is also increased, and therefore we use the same penalization as in the fourth case.
The separate σ values of each case can be set individually and have a great impact on the final similarity value. The optimal parametrization depends on the structures of a data set. Based on a chemical interpretation of the different cases, we expected the following relations between the different σ values: σ2 <σ5 <σ6 ≤ σ4, where the indices represent the number of the case. We performed a grid search in the range [0.25, 0.5, ⋯, 10.0] to define the σ values for the penalization and to validate our hypothesis concerning the relation between the values. The result of the grid search yields the following values: σ2 = 2.5, σ5 = 5.0, σ6 = 4.0, and σ4 = 4.0. The empirical results of the grid search differ only in the relation between the σ values of the fifth and sixth case. Thus, the different cases of the 2SHA method show a correlation with the chemical interpretation.
The final result of the optimal assignment of the atoms, based on the atom-wise similarity calculations using the differentiation with the six cases, is shown in Figure 5 on the same molecules causing topological errors using the original OAK. The assignment preserves the mapping of the substructures and reduces the occurrence of topological errors. The assignment of the atom from the condensed ring system onto the nitrogen of the nitro group is penalized.
Optimal Local Atom Pair Environment Assignment
This variant of the optimal assignment approach uses local atom pair environments (OAAP). There are two fundamental types of atom pairs: topological and geometrical atom pairs. We employed the binned geometrical distance matrix between the three-dimensional coordinates of atoms i, j of a molecule in this study. The geometrical bin size b was set to 1 Å. Therefore, this method can be considered as a geometrical similarity measure. The computation time of D has a quadratic complexity with the number of atoms. The matrix is symmetric, so it is sufficient to compute the upper half of each matrix. The distance in the diagonal equals zero.
D is used as lookup table to store the information of all geometrical atom pairs from atom i in a trie. A trie is a prefix search tree that can be applied to patterns with a reading direction. In our case, we have a star-shaped local atom environment. At the root of the trie of atom i the hash code of the atomic symbol i is placed. Next, the patterns of the form
are inserted as ordered triplets. Note that symbol can be any atom type or fragment representation and hash(symbol) any suitable hash function hash : symbol → ℕ. By using a trie, the collection of patterns is non-redundant, because the trie is updated whenever a known part of a pattern is inserted. If the whole pattern is already contained in the trie, the count is incremented by one. For many similarity metrics, it is necessary to know basic properties of a trie, for example the total number of patterns or the number of unique patterns (number of leaves). Therefore, our trie implementation has a method to compute such properties. A geometrical local atom pair environment and the corresponding trie of the marked atom is shown in Figure 6.
The representation of the local atom environments as tries allows us to apply efficient recursive similarity computations. A well-known similarity metric for nominal features is the Tanimoto coefficient. The computation of the Tanimoto coefficient of two local atom environments is reduced to a comparison of two tries.
Let LA, LBbe the sets of local atom pair environments of two molecular graphs A, B and the tries i, j of the nominal features (atom pair environments of atoms i, j). Now, the Tanimoto coefficient can be defined as:
The overall molecular similarity is computed by the score of the optimal assignment of the local atom pair environments. The cost of computing the local similarity matrix is O(nml) for n, m atoms of the molecules A, B and l leaves of the larger trie. The lookup for a pattern has a constant computation time, because the depth of the tries is fixed.
We computed the OAAP similarity between two structures on single low-energy conformations. In spite that, it is possible to extend this method to conformational ensembles. This could be accomplished by simply averaging the distances as a case in point. The approach was implemented using the Chemistry Development Kit (CDK) library [50, 51].
Parameters and Computation Time
All presented methods have internal parameters, which allow modifications of the similarity measures. These modifications can improve the results on a specific problem and data set. The optimization of the parameters is a complex process making extensive methods like grid searches necessary. However, one aim of this work was to evaluate the usability of optimal assignment methods on molecular graphs for ligand-based VS experiments. Therefore, we used the described standard parameters for each method, which were constant for all data sets to obtain comparable results of the overall performance.
The computation time of the methods depends on the number of atoms, which have to be mapped in the assignment step. In addition, each method has its own type of preprocessing and mining the chemical information. This yields differences in the performance depending on the data set. Instead of giving a complete overview of the computation time for each method and data set, we report an averaged computation time, as it can be expected for drug-like structures, for each method. The original OAK has an averaged performance of 27.34 ± 3.40 similarity calculations per second. The flexibility extension OAKFLEX yields 41.03 ± 7.32 calculations per second. This speed-up of the OAKFLEX is the result of a reduction of the used features. The OAAP method achieves a performance of 51.49 ± 18.07 similarity calculations per second (computation time for the atom typing is not included). The 2SHA method has the highest computation time, because of the two assignment steps and the superposition of the fragments. However, the approach is capable to perform 14.04 ± 1.78 calculations per second. All values were measured on a Intel Core2Duo CPU with 2 GHz using one core and 1 GB memory, Java 1.6.0_07, Ubuntu 8.04.3, Linux kernel 2.6.24.
The setup of the VS experiments is based on the work of Cheeseright et al. in which the FieldScreen approach is introduced and compared against a docking algorithm. We decided to use this workflow to create a common setup of the data sets with the objective to achieve comparable results.
Data Sets and Preparation
All ligand-based VS experiments were performed using a modified version of the Directory of Useful Decoys (DUD) Release 2 [52, 53]. The DUD contains known actives and mimetic  decoys for 40 target proteins. This collection of data sets was compiled to serve as an unbiased community benchmark database for the evaluation of docking algorithms. Therefore, the original version of the DUD is not suited for ligand-based VS experiments . A modification of the active structures of the DUD database, suggested by Good and Oprea , solves this problem by applying a lead-like filter  and a cluster algorithm . The clustering identifies analogue structures and can be used to reduce the bias of analogue or trivial enrichment. Additionally, each cluster can be seen as an individual chemotype and allows the analysis of the "scaffold-hopping" behavior of the methods.
The active structures for the VS experiments were prepared as follows. We obtained the modified active structures of Good and Oprea from the DUD site . These data sets contain only topological information about the structures. Thus, we used CORINA3D  to generate initial geometrical seed structures. Those were refined with MacroModel 9.6  using the OPLS 2005 force field and the limited Broyden-Fletcher-Goldfarb-Shanno optimization method with 5000 iterations and a gradient criterion of 0.0001 RMSD for the atomic movement between two iterations. The final data sets were used as active structures.
The original decoys of the DUD release 2 were obtained from the DUD site . Using these decoys, together with the filtered actives from Good and Oprea, would introduce a bias of an artificial enrichment based on the distinction of physical property values. To remove this artificial enrichment bias, we applied the same lead-like filter, used by Good and Oprea to filter the active structures, on the decoys. Thus, the AlogP value and the molecular weight was calculated for each structure with dragonX 1.4 . All structures with an AlogP value > = 4.5 (5.5 for nuclear hormone receptor data sets) or a molecular weight were removed. The initial three-dimensional coordinates were further optimized using the same configuration of MacroModel as for the active structures.
The described preparation protocol was applied on all 40 data sets of the DUD. The SD files of the prepared actives and decoys are directly obtainable from the DUD site .
To obtain results that are not biased by a low number of chemotypes, it is advisable to use data sets with a sufficient number of chemotypes. Therefore, we used the same subset of DUD targets as used by Cheeseright et al. This subset has a minimum of 17 and a maximum of 44 chemotypes. A comprehensive overview of the data sets can be seen in Table 1.
Ligand-based VS methods need a biologically active structure that serves as a query in the experiment. In the work of Huang et al.  complexed crystal structures were used to identify the binding sites for the docking algorithm. Cheeseright et al.  used the ligands of the same complexed crystal structures as query structures to evaluate the FieldScreen approach. To allow a comparison with FieldScreen, we extracted the same bounded ligands from the protein data bank [59, 60] and corrected the bonds lengths with MacroModel 9.6. These structures serve as queries for the VS experiments in this study.
Evaluation of VS Performance
In the following section we describe the metrics for the evaluation of VS experiments used in our work. Additionally, we explain why we used these metrics and show their correlation to well-known established metrics.
The evaluation and comparison of VS performance is an important but also an error-prone process [39, 40, 61]. In recent years, a plethora of evaluation metrics have been published, each metric with its strengths and weaknesses. For an overview, we refer to the work of Kirchmair et al.  There is no standard protocol for the analysis and publication of VS results. Thus, the comparison of different works is a challenging task. It is necessary to analyze the results, regarding three different aspects, to characterize the performance of a VS method. The first aspect is the early recognition problem and originates from real-world screening applications, where only top ranked molecules were selected for testing in biological assays because of high costs and expenditure of time. Truchon and Bayly developed for this purpose the BEDROC score , which uses a decreasing exponential weighting function. The drawback of this metric is its lack of interpretability and the dependency of an extrinsic variable . To resolve these problems, Jain and Nicholls suggested to report the enrichment values at predefined false positive fractions . These values can be obtained by dividing the sensitivity value by the fraction of false positives. Equation 7 represents the interrelation between the ROC value and the ROC enrichment for a predefined false positive fraction. and is the number of true positives (actives retrieved) and false positives (decoys retrieved), respectively, found in the range containing a false positive rate of X%. TP, TN, FP, and FN are the entries of the confusion matrix ranking X% false positives.
We decided to report the ROC enrichments at false positive rates of 0.5%, 1.0%, 2%, and 5%, as suggested by Jain and Nicholls .
The second aspect of characterizing the performance of VS methods is the evaluation of the performance considering the complete data set. This overall performance can be visualized plotting the receiver operating characteristic (ROC), which should always be part of VS results [39, 64]. However, a comparison of many ROC plots is not straightforward, thus we report the area under the ROC curve to provide a better overview of the results. Equation 8 shows the calculation of the area under the ROC curve (AUC), where Nactives and Ndecoys are the numbers of active and decoy structures, respectively is the number of decoys that are higher ranked than the ith active structure.
The last point of characterizing the performance of VS methods is the evaluation of the retrieval of new scaffolds. As already mentioned, ligand-based VS suffers from the bias caused by enrichments of analog structures. To reduce the influence of this overestimated enrichments on the result metrics, Clark and Webster-Clark recommended an adaption of the ROC and AUC calculation . They proposed an arithmetic and a harmonic weighting scheme to reduce the influence of structurally similar structures. Using the arithmetic weighting, each structure has a weight that is inversely proportional to the size of the cluster it belongs to. Therefore, the weight of all structures taken from one cluster is equal. This leads to the equation: , where wijis the weight of the ith structure taken from the jth cluster with Njstructures. As a result of this, the true positive value of the sensitivity is no more the number of true active structures seen in a predefined fraction of the data set. Instead, it is the sum of the weights resulting in the equation The value of is 1 if the ith structure of the jth cluster is contained in the fraction of the data set, otherwise it is 0. Nclustersis the number of clusters in the data set and Njis the number of structures in the jth cluster.
The harmonic weighting scheme uses weights representing the ranks of the active structures within each cluster. Top-ranked structures of a cluster have a higher weight, motivated by the idea that those structures have the highest information content. The weight of the structures is defined as , where i is the ith structure of the jth cluster in decreasing rank order. A recently published analysis of the weighting schemes conducted by Mackey and Melville discloses that the arithmetic weighting scheme has better properties and is more robust than the harmonic scheme . Therefore, we decided to use the arithmetic weighting scheme. Integrating this scheme into the basic ROC enrichment leads to an arithmetic weighted version (awROC enrichment) given by Equation 9.
This weighting scheme can also be embedded into the AUC calculation. The modified version (awAUC) can be seen in Equation 10.
To account for intrinsic variances  of the active and decoy structures, we bootstrapped the data sets and approximated the error of the awROC enrichments and awAUC. We performed 25000 iterations removing randomly 20% of the data set. The standard deviation of the 25000 metric results were used as an error of the result using the complete data set.
Comparison against Other Methods
The docking study by Huang et al.  provides the energy scores of the DOCK algorithm for each target of the DUD. We used these rankings to calculate the awROC enrichments and awAUC on the same data sets as our approaches. Hence, extensive properties such as active to decoy ratio and data set size are identical and permit a comparison of the results. A direct comparison of docking algorithms with ligand-based VS tools is critical because of the different techniques and information included in the methods. We calculated the results of the docking algorithm with the objective to provide an impression of the performance of the DOCK algorithm using the awROC and awAUC evaluation metrics.
We considered the results of the FieldScreen approach to include a sophisticated ligand-based method. To reduce the information content solely to the information obtained from the query structure, the results without the excluded volume of the protein were used. Although the data set sizes and active to decoy ratios of the FieldScreen results deviate from our setup, the use of awROC enrichments ensures the comparability because the ROC enrichment is independent from extensive quantities .
Introducing a new method to a well-established field, like ligand-based VS, it is necessary to justify it by comparing the results to a common approach. To meet these requirements, we compared our approaches to the 166 bit MACCS keys  using the Tanimoto coefficient to calculate a similarity value. Although the results of the MACCS keys are inferior in comparison to other fingerprints  we provide the results of the MACCS keys to give a baseline for the performance on the data sets. The MACCS keys were calculated using the CDK 1.1.5 [50, 51].
Results and Discussion
The results of the four optimal assignment based methods and the three comparison approaches are organized as follows. The awROC enrichments for each target at a predefined false positive fraction of 0.5%, 1.0%, 2.0%, and 5.0% can be seen in the Tables 2, 3, 4 and 5. The awAUC values are compiled in Table 6. Each value in the Tables 2, 3, 4, 5 and 6 has a standard deviation, which is the result of the error approximation using the bootstrapping approach. The bold value in each row indicates the best result concerning the data set and evaluation metric. The last row in each Table contains the rank of each method averaged over all data sets. These values describe the overall performance of one method considering the results of all other methods.
In addition to the already mentioned VS metrics, Additional file 2 contains the complete ranking of all structures and several VS metrics for each method and data set.
Virtual Screening Results
The results of Table 2 show the awROC enrichments at a false positive rate of 0.5%, which implies that the early recognition problem is focused in this evaluation. The best overall performance is achieved by the (OAAP). The results of the original OAK and the flexibility extension are similar and only marginally worse than the OAAP. The 2SHA approach obtains an average rank that lies in between the other optimal assignment methods and FieldScreen. The MACCS keys show an increased performance on the fxa and vegfr2 data set, but the overall performance is lower than all other methods except the DOCK algorithm. The ranking of the methods at higher false positive fractions changes, but there is a tendency observable. The 2SHA yields the best average rank at 1%, 2%, and 5%. Additionally, the average rank decreases with an increasing false positive fraction. FieldScreen shows a similar behavior resulting in comparable results at 5% in comparison to OAK and OAAP. The average ranks of the MACCS keys, OAK, OAKFLEX, and OAAP increased with higher false positive fractions by a value of ≈ 0.7. The average rank of the DOCK algorithm alternates between 6.54 and 5.54, showing no dependency on the false positive rate.
The increasing performance of the FieldScreen and 2SHA approach can be explained by the ability to perform "scaffold-hoppings". The other optimal assignment methods have a good early enrichment as seen in Table 2, but these enrichment values are the result of retrievals of chemical similar scaffolds with respect to the search query. Those methods are not able to hold up the high retrieval rates of the chemotypes and the discovery of new chemotypes stagnates. The influence of this bias is reduced with increasing false positive fraction and the more robust methods became apparent. Regarding the technique of FieldScreen, the improved "scaffold-hopping" behavior is probably the result of the conformational sampling together with the similarity calculation which is based on molecular fields. The 2SHA method yields a similar effect by the identification of rigid scaffolds, their assignment and superposition. This procedure can be seen as an inherent heuristic of a flexible alignment of the search query and the screening structures. The method assumes that the linker between two fragments has the flexibility to perform the translation and rotation which is needed for the superposition of the assigned fragments. Additionally, this assumes a comparable length of the linkers in both molecules. Considering these facts, the inherent flexible alignment of the method is a rough approximation, but it increases the ability to do "scaffold-hoppings" without the use of a time-consuming conformational sampling.
A comparison of the OAK with its flexibility extension shows a comparable overall performance at 0.5%. The gap between the methods increases at higher false positive fractions concerning the average rank. In contrast to the average rank, the difference on each data set is marginal with an advantage for the OAK.
As a result, the OAK obtains lower ranks and the difference of the average rank increases. Major differences regarding the performance on each data set can only be seen at 0.5% false positive rate. Surprising is the improved performance of the OAKFLEX on the cox2 data set. The inhibitors of the cox2 are usually rigid scaffolds consisting of two or three ring systems. Hence, there should not be a performance gain using an approach that integrates the local flexibility of the atom neighborhood. The active structures of the cox2 data set have terminal substitutions with a depth of two or three bonds. These substitutions form the flexible part of the molecules and have an optimal depth for the local flexibility approach of the OAKFLEX. The flexibility information of the terminal substitutions is stored in the atom which is part of the ring system. Therefore, the method has an improved ability to distinguish different substitution stages like ortho, meta and para substitutions. Additionally, the existence of rigid and flexible parts with an ideal length yields a better distinction between rigid and flexible parts of a molecule. The last point was already observed in QSAR experiments reported by Fechner et al.  The active structures of the egfr data set contain also a rigid basic scaffold with various substitutions among the molecules. Based on the previous observation, the OAKFLEX should attain a higher awROC enrichment as the OAK. Obviously, it is not the case as it can be seen in Table 2. This contradictory result can be explained by the chemical nature of the search query. Erlotinib has two flexible substitutions each with five heavy atoms forming a chain. These substitutions increase the number of rotatable bonds to 11. The average number of rotatable bonds in the active data set of egfr is 4.07. For that reason, the OAKFLEX has difficulties to map the 10 flexible atoms of the search query onto the atoms of the active structures. These mappings reduce the overall similarity and performance on the egfr data set. An evidence for this hypothesis is the result on the vegfr2 data set. The search query of the vegfr2 data set has five rotatable bonds, whereas the actives have a mean value of 4.2. Therefore, the penalty of the OAKFLEX mapping is reduced and results in a higher enrichment rate, because of the different flexibility properties. The OAAP method achieves good results on the cox2, cdk2, and pdgfrb data set. These results are a consequence of the binning approach on the geometrical distances. The binning represents a fuzzy geometrical distance measure that allows minor changes in the basic scaffold of a structure without reducing the similarity value. This increases the ability to perform "scaffold-hoppings" to related chemotypes with a similar basic structure.
The awROC enrichment of the 2SHA algorithm at 0.5% on the egfr data set is remarkable. The enrichment on this data set outperforms the results of the other methods. To elucidate this result, we analysed the similarity calculations of the 2SHA on the egfr structures in detail. The active structures have a common property that favors the computation technique of the 2SHA. Each active cluster has a so-called parent molecule. This molecule is the smallest molecule of the cluster regarding the number of heavy atoms . We compared these parent molecules and identified a common basic scaffold, which is included in 32 of 40 clusters and in the search query. Figure 7 shows four examples of this basic scaffold, which consists of a ring system and a condensed heterocyclic system with two or three rings. These systems are aromatic in the query structure and in 25 out of 32 clusters. The 2SHA method identifies the aromatic systems and maps the corresponding systems onto each other in the first assignment step. The superposition of the fragments yields minimal distances of the mapped atoms, because of the planarity of aromatic systems. This results in a high similarity score based on this common basic scaffold and is an explanation of the high awROC enrichment value at 0.5%. Our hypothesis confirms the results of the 2SHA on egfr at higher false positive rates. The method still has the highest enrichment rates, but in comparison to the other methods the difference is significantly reduced.
The awAUC values of Table 6 represent the performance of each method using the complete data set. The 2SHA approach yields the best result with an average rank of 3.08. The FieldScreen method performs better than the remaining optimal assignment techniques OAAP, OAK, and OAKFLEX. The DOCK algorithm improves the results and obtained an average rank of 4.27 and achieves a better result than the MACCS keys. Table 6 confirms with the tendency over the awROC enrichments and assigns the best "scaffold-hopping" performance to the 2SHA and FieldScreen. The reduced performance of the other optimal assignment methods can be explained by the results of the chemotype enrichment. Figure 8 shows the chemotype enrichment of the four optimal assignment methods, the MACCS keys, and the random performance on the p38 data set. All optimal assignment methods have a comparable chemotype retrieval until ≈ 50% of the chemotypes are retrieved, which is consistent with the results of the Tables 2, 3, 4, 5. From that point on, the retrieval of the OAK, OAKFLEX, and OAAP stagnates and their chemotype enrichment is reduced to the level of the random performance. The reduced enrichment performance of those three optimal assignment methods at higher false positive rates explains the improved results of the FieldScreen approach regarding the awAUC values. Only the 2SHA approach has the ability to achieve a higher generalization of the chemotypes which results in an increased retrieval rate over the complete data set.
Correlation of Optimal Assignment Methods
The presented optimal assignment methods are based on the same functional principle. On the one hand, they use different representations and algorithms, but on the other hand, the final similarity computation between two molecules is the result of the assignment algorithm. Hence, it is possible that the two new optimal assignment methods (2SHA and OAAP) are only marginally different compared to the two existing methods (OAK and OAKFLEX). The results in the previous section are an evidence that there are differences in the rankings of active and inactive structures. Another interesting question is the order of the chemotype discovery of the individual methods. Different sequences of the chemotype retrieval indicate that the methods explore the chemspace in various directions. Therefore, we analyzed the sequences of all optimal assignment methods on all data sets. We calculated the Kendall's τ rank correlation coefficient for each data set to obtain a measure for the correlation of the chemotype retrieval between two methods. This results in 13 correlation coefficients for each pair of methods. The results are illustrated in Figure 9 in form of box plots.
The correlation analysis shows that the 2SHA and OAAP approach have the lowest median of the correlation coefficients. This is the result of a different calculation process to determine the similarity between two atoms. The increased correlation between 2SHA and OAKFLEX as well as between 2SHA and OAK can be explained by a common subset of descriptors and the fact that the similarity between two atoms in all three methods is calculated by a RBF. 2SHA has a higher correlation with the OAKFLEX than with the OAK. This indicates the already mentioned inherent flexibility consideration of the method. In contrast to the 2SHA, the OAAP method shows a superior correlation to the OAK method without flexibility information. The strong correlation between OAK and OAKFLEX is not surprising because the OAKFLEX is an extension of the OAK and the standard parametrization of the OAKFLEX uses a weight of 0.95 for the OAK similarity .
To further analyse the optimal assignment methods and their characteristic behaviour in chemspace, we conducted the same experiment and evaluated the correlation between the optimal assignment methods and DOCK or the MACCS keys. The results can be seen in Figure 10.
The results show that the chemotype retrieval sequence of the optimal assignment methods has no correlation with the sequence retrieved by DOCK. The correlation to the methods considering flexibility information (2SHA and OAKFLEX) is marginal increased in comparison to the other methods. This is probably the result of the flexible-ligand method that is used by DOCK allowing the incorporation of the flexibility . The overall low correlation is not surprising given the fundamental differences of ligand-based and docking approaches. The correlation coefficients between the optimal assignment methods and the MACCS keys are comparable to the coefficients in the previous experiment between two optimal assignment methods. From these results, it follows that the similarity of the chemotype retrieval sequence between the MACCS keys and an optimal assignment method is comparable to the similarity of two optimal assignment methods. Therefore, the optimal assignment methods and the MACCS keys explore the chemspace in a comparable direction, but the VS results show that each method has its strengths on different data sets.
The enrichment results of the DOCK algorithm in the Tables 2, 3, 4, 5, 6 show an inferior performance in comparison to the other methods. The findings of the chemotype retrieval sequence disclose an interesting property of DOCK. The uncorrelated sequences demonstrate that DOCK explores the chemspace in an orthogonal manner with respect to the other methods. Accordingly, DOCK retrieves different chemotypes that will only be found by the optimal assignment methods and the MACCS keys if a large part of the data set is ranked.
We have introduced the concept of the optimal assignment approach in the field of ligand-based VS. We presented two new optimal assignment methods. The OAAP method is based on geometrical distance atom pairs. The 2SHA computes two assignment steps. Each method uses the optimal assignment approach on different types of objects showing the wide field of application of the approach. Another advantage is the interpretability of the mappings using visualizations as shown in Figure 1.
We evaluated the methods on 13 modified DUD data sets covering a wide range of different molecules and chemotypes. In order to grade the results of the optimal assignment approaches we compared the results with a state-of-the-art ligand-based VS method that uses a conformational sampling and molecular fields. Additionally, we provided the results of the 166 bit MACCS keys in combination with the Tanimoto coefficient and the performance of the DOCK algorithm on the same data sets. The results show an improved early enrichment performance of all optimal assignment methods. Analyses show that these early enrichments are the results of the retrievals of chemical similar chemotypes with respect to the search query. With increasing false positive rates this bias is reduced and the gap between the optimal assignment methods and the literature results is successively closed. Only the 2SHA approach has the ability to perform the necessary "scaffold-hoppings" to maintain a robust enrichment and obtained the overall best results. The calculations of the 2SHA method approximate an implicit flexible alignment of the substructures and enables the retrieval of chemotypes that have larger distances to the query in chemspace. Further research will be spent on the 2SHA method in combination with multiple query screenings. The fragmentation and the two assignment steps enable a substructure based data fusion on the fragment level.
The first assignment step can disclose fragment classes with similar fragments of two or more search queries. These classes can be used to enumerate all combinations of the fragments resulting in an increased chemotype coverage. This procedure can further improve the "scaffold-hopping" performance and contribute to the robustness of the method.
The presented methods are fast enough to screen more than one million structures within one day on a single core CPU. This high throughput performance qualifies the presented methods to perform screenings on large databases with the aim to select relevant subsets for a given problem. These subsets can be used in combination with biological screenings or more time-consuming docking algorithms.
Bajorath J: Integration of virtual and high-throughput screening. Nat Rev Drug Discov. 2002, 1 (11): 882-894. 10.1038/nrd941.
Bender A, Jenkins JL, Scheiber J, Sukuru SC, Glick M, Davies JW: How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space. J Chem Inf Model. 2009, 49: 108-119. 10.1021/ci800249s.
Good AC, Hermsmeier MA, Hindle S: Measuring CAMD Technique Performance: A Virtual Screening Case Study in the Design of Validation Experiments. J Comput-Aided Mol Des. 2004, 18 (7): 529-536. 10.1007/s10822-004-4067-1.
Raymond JW, Willett P: Effectiveness of graph-based and fingerprint-based similarity measures for virtual screening of 2D chemical structure databases. J Comput-Aided Mol Des. 2002, 16: 59-71. 10.1023/A:1016387816342.
Rush TS, Grant JA, Mosyak L, Nicholls A: A Shape-Based 3-D Scaffold Hopping Method and Its Application to a Bacterial Protein-Protein Interaction. J Med Chem. 2005, 48 (5): 1489-1495. 10.1021/jm040163o.
Cramer RD, Patterson DE, Bunce JD: Comparative Molecular Field Analysis (CoMFA). 1. Effect of Shape on Binding of Steroids to Carrier Proteins. J Am Chem Soc. 1988, 110 (18): 5959-5967. 10.1021/ja00226a005.
Fröhlich H, Wegner JK, Sieker F, Zell A: Optimal assignment kernels for attributed molecular graphs. ICML '05: Proceedings of the 22nd international conference on Machine learning. 2005, New York, NY, USA: ACM, 225-232.
Hinselmann G, Jahn A, Fechner N, Zell A: Chronic Rat Toxicity Prediction of Chemical Compounds Using Kernel Machines. Lecture Notes in Computer Science (EvoBIO 2009). Edited by: Pizutti C, Ritchie M, Giacobini M. 2009, Springer-Verlag Berlin Heidelberg, 5483: 25-36.
Good AC, Oprea TI: Optimization of CAMD Techniques 3. Virtual Screening Enrichment Studies: a Help or Hindrance in Tool Selection?. J Comput-Aided Mol Des. 2008, 22 (3–4): 169-178. 10.1007/s10822-007-9167-2.
Coutsias EA, Seok C, Dill KA: Rotational superposition and least squares: the SVD and quaternions approaches yield identical results. Reply to the preceding comment by G. Kneller. J Comput Chem. 2005, 26 (15): 1663-1665. 10.1002/jcc.20316.
Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci. 2003, 43 (2): 493-500.
Triballeau N, Acher F, Brabet I, Pin JP, Bertrand HO: Virtual Screening Workflow Development Guided by the "Receiver Operating Characteristic" Curve Approach. Application to High-Throughput Docking on Metabotropic Glutamate Receptor Subtype 4. J Med Chem. 2005, 48 (7): 2534-2547. 10.1021/jm049092j.
Muchmore SW, Debe DA, Metz JT, Brown SP, Martin YC, Hajduk PJ: Application of belief theory to similarity data fusion for use in analog searching and lead hopping. J Chem Inf Model. 2008, 48 (5): 941-948. 10.1021/ci7004498.
The authors want to thank John J. Irwin and Brian Shoichet from the University of California, San Francisco for adding the filtered and optimized DUD data sets to the official DUD site . This work has been partially funded by Nycomed GmbH, Konstanz, Germany.
Authors and Affiliations
University of Tübingen, Center for Bioinformatics Tübingen (ZBIT), Sand 1, 72076, Tübingen, Germany
Andreas Jahn, Georg Hinselmann, Nikolas Fechner & Andreas Zell
The authors declare that they have no competing interests.
AJ filtered and prepared the DUD data sets, developed the two-step hierarchical assignment approach and the flexibility extension of the Optimal Assignment Kernel, designed the experimental setup, performed the virtual screening experiments, performed the correlation experiments of the methods, drafted the manuscript, and participated in the discussion of the results. GH developed the optimal assignment based on atom pairs, helped to draft the manuscript, and participated in the discussion of the results. NF helped to develop the two-step hierarchical assignment approach and the flexibility extension of the Optimal Assignment Kernel, participated in drafting the manuscript, and participated in the discussion of the results. AZ participated in the development of the original Optimal Assignment Kernel, participated in the design of the experimental setup, and participated in the discussion of the results. All authors read and approved the final manuscript.
Additional file 2: Archive of the result files. The file Results.tar.gz is a Gzip compressed Tar archive containing the results file of each method for each data set. The result files are tab-separated plain text files including the following information: method name, active data set with size, cluster information and the distribution of the molecules over the clusters, decoy data set with size, ratio active:decoy, AUC, awAUC, BEDROC scores for predefined alpha values as suggested by Truchon and Bayly , relative enrichment factor , ROC enrichments, awROC enrichments at predefined false positive fractions, chemotype enrichment, ROC data points, and the ranking of each structure. (ZIP 6 MB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.