In this section we first summarize the optimal assignment problem and give a general definition of the problem. Afterwards, a detailed description of the different methods which are evaluated in the scope of ligandbased VS, follows.
The optimal assignment problem is one of the basic discrete optimization problems. The fundamental task is to find the set of assignments that minimizes (or maximizes) the overall cost of the assignments. A mathematical description of the optimization problem is to find a minimum (or maximum) weight matching in a complete weighted bipartite graph. Given two disjoint sets of arbitrary objects X = (x_{1}, x_{2},..., x_{
n
}) and Y = (y_{1}, y_{2},..., y_{
m
}) with n ≥ m without loss of generality, the optimization problem can be defined as:
where π is a subset of the indices 1,..., n of size m describing the assigned objects of the set X and w(x_{π(i)}, y_{
i
}) is the weight (cost) of mapping the object x_{π(i) }onto y_{
i
}. The KuhnMunkres algorithm (also called Hungarian Method) [42, 43] is one popular algorithm to solve this optimization problem.
Fröhlich et al. introduced the concept of the optimal assignment in the field of cheminformatics as a kernel function for attributed molecular graphs [29, 30]. Although this function is not a valid kernel function [44], it can be used as a similarity function based on molecular graphs.
Optimal Assignment Kernel
The idea of the Optimal Assignment Kernel (OAK) [29, 30] is to find a mapping of the atoms of the smaller molecule onto the atoms of the other molecule, maximizing the sum of the pairwise atom similarities. The first step of this approach is the calculation of all pairwise atom similarities. For this purpose, a radial basis function (RBF) calculates a similarity value based on physicochemical descriptors of the atoms (Equation 2, where a_{
i
}and b_{
i
}are the ith descriptor value of the atom a and b, σ^{2} represents the variance of a descriptor). To calculate the similarity between two atoms, the OAK uses 24 atom and 8 bond descriptors of the expert system of JOELib2 [45, 46]. For a complete list of the descriptors, we refer to Additional file 1 of this study.
To improve the chemical interpretation of the similarity calculation, the information of the topological neighbors and the bonds up to a predefined depth is integrated into the atomwise similarity calculation. The integration is realized by a recursive atomwise similarity calculation of the neighbors. In addition, a decay parameter reduces the influence of atoms with increased topological distance. Given two molecules A and B with atoms a_{1},..., a_{
n
}and b_{1},..., b_{
m
}the primal form of the optimal assignment problem is modified to Equation 3.
The result of Equation 3 is the sum of all atomwise similarity values. Hence, the function yields higher results if the molecules have more atoms. To reduce the impact of the number of atoms on the final similarity value, the result of the optimal assignment is normalized by Equation 4 to a final value in the range of [0; 1].
An example of an optimal assignment on molecular graphs can be seen in Figure 1.
Optimal Assignment Kernel with Flexibility Extension
The original OAK computes the similarity of two molecules using the sets of atoms augmented with their local neighborhood. The idea of the OAK can also be applied to approximate the similarity of the conformational space of two molecules [34]. This is achieved by breaking down the overall molecular flexibility into a set of local flexibilities similar to the local atom environments used by the OAK. The local flexibility is defined as the spatial positions on which the neighbors of a center (core) atom could be found depending on the length and angles of the bonds by which they are connected to the core (e.g. Figure 2). Only the flexibility of the second and third order neighbors are considered and the information is stored in the core atom.
The spatial positioning possibilities are expressed by using internal coordinate parameters like bond lengths and angles. These parameters are regarded as only depending on the hybridization of the atoms that connect the core with the neighbor. Nonbonding interactions like electrostatic and steric interactions, which would also influence the conformational space, are not considered. Ring bonds and nonsingle bonds are generally regarded as rigid. Therefore, the approach only approximates the real spatial positioning of the neighbor. The details of the parametrization of the local flexibility can be found in [34].
The overall similarity of the approximated conformational spaces is computed by augmenting the local atom similarity in the original OAK formulation with the similarity of the local flexibility (i.e. the parametrization of the spatial positioning of the neighbors) of the center atoms (OAK_{FLEX}). The flexibility similarity between two atoms is the weighted sum of the flexibility similarities of the second and third order neighbors. The weighted sum is adjusted by an internal parameter. Additionally, the contribution of the flexibility similarity to the original OAK similarity is set by a second parameter. We used the first default parametrization as suggested by Fechner et al. [34] (OAK weight = 0.95, flexibility extension weight = 0.05, second order neighbor weight = 0.3, and third order neighbor weight = 0.7)
TwoStep Hierarchical Assignment
An accurate investigation of the assignments of the OAK discloses failures and shows that the OAK is not able to generate a substructure preserving assignment of the atoms in some cases. These topological errors result in an assignment that scatters the atoms of a substructure over the complete other molecule. The assignments maximize the overall similarity, but from a chemical point of view some mappings are problematic. An example of an OAK assignment with topological errors can be seen in Figure 3.
Our analysis uncovered that the occurrence of these errors is favored by the existence of multiple aromatic or condensed ring systems with a small number of substitutions and the presence of conjugated environments. To reduce the number of topological errors, we developed a twostep hierarchical assignment approach (2SHA). The basic idea is to perform two different assignments on different types of objects. The first assignment step operates on a substructure level and maps similar substructures of the molecules onto each other. For this purpose, a preprocessing step is necessary to identify condensed, aromatic, and conjugated systems (e.g. Figure 4). The ring detection is done with a modified biconnected component algorithm that operates on the molecular graph. The identification of the aromaticity and conjugated environments is done by the expert system of JOELib2 [45, 46]. To compute the similarity between two substructures, we use a modified version of the original OAK that computes a similarity score without performing the normalization as given in Equation 4. The advantage of omitting this step is a reduced computation time because the selfsimilarities Sim(A, A) and Sim(B, B) are not necessary and are not computed.
The algorithm computes the first optimal assignment calculation based on the substructure similarity values and maps one fragment of the smaller molecule on exactly one fragment of the other molecule. An example of such a mapping process is shown in Figure 4 on the two benzodiazepine derivatives used to explain the topological errors. The information of the fragmentbased mapping is stored in each atom of a mapped fragment. This knowledge is used to establish constraints for the second assignment on the atomic level that reduces the frequency of topological errors.
An additional aim of this approach is the inclusion of geometrical information into the similarity calculation. To achieve this, we integrate an extra calculation step between the two assignment calculations. The initial preprocessing of the structures identifies aromatic systems, condensed rings, and conjugated environments, which are all rigid scaffolds of molecules. Our idea is to superimpose the rigid substructures which were mapped in the first assignment step. To realize the superposition, the substructures are treated as individual molecules and an algorithm based on quaternions, which represents an equivalent method to the wellknown Kabsch algorithm, was used [47–49]. Note that the necessary information of matching atoms of the Kabsch algorithm is the result of the similarity calculation between the substructures. The reason for this is the optimal assignment of the atoms of the substructures to calculate the similarity. Although the algorithm calculates a superposition minimizing the rootmeansquare deviation, our approach uses the threedimensional atom coordinates and integrates the information into the second assignment step.
The second assignment is the final optimal assignment on the atomic level, which includes all information of the previous steps. Each atom has information about its integration into substructures and the mapping of these substructures. Considering this information, all atoms of the molecules can be divided into three different atom classes: First, atoms that are not part of a substructure, like linkers or sidechains. Second, atoms that are part of fragments which were not mapped in the first assignment. Third, atoms that are part of fragments which were mapped. Using these three classes we obtain six different unordered 2element subsets. Each subset represents a different case of the atomwise similarity calculation using different information and parameters. The case in which both atoms are part of a fragment and both fragments were not mapped cannot occur because of the nature of the optimal assignment problem. This reduces the number of different cases to five, which have to be regarded in the similarity calculation on the atomic level. However, the case in which both atoms are part of a fragment has to be divided into two cases, depending on whether the fragments were mapped onto each other or not. So we need to consider six cases overall. To include this discrimination, we integrated the following case differentiation into the original atomwise similarity calculation:

1.
Both atoms are not part of a fragment. Calculate the similarity using the original RBF of the OAK.

2.
Both atoms are part of a fragment and the fragments were not mapped onto each other. This case is the main cause of topological errors. Therefore, the method has to penalize the similarity score to avoid a mapping of those atoms. The penalization is done by a decreased σ value in the RBF. This modification sharpens the RBF and reduces the overall score of the atomwise similarity calculation.

3.
Both atoms are part of a fragment and the fragments were mapped onto each other. The method makes use of the geometrical information of the superposition of the fragments. Given two atoms from the two superimposed fragments, the method calculates a vector based on the two atom coordinates and maps the vector into a twodimensional plane using an isometric mapping. The initial and terminal points of the twodimensional vector represent the coordinates of the atoms in the twodimensional plane. An RBF is centered at each point using the van der Waals radius of the corresponding atom as σ value (σ_{vdW}). Equation 5 calculates the final geometrical similarity value for two mapped atoms a and b using two RBF and the half atomic distance as input. The result of this geometrical atomwise similarity calculation is integrated into the original RBF of the OAK in the form of an additional numerical descriptor.

4.
One atom is part of a fragment, but this fragment is not mapped. The other atom is not part of a fragment. In this case the method tries to map an atom of a rigid fragment onto a sidechain or linker atom. The fragment was not mapped in the first assignment step, but this mapping is also doubtful and has to be penalized. The technique for penalizing those mappings is the same as in the second case.

5.
One atom is part of a fragment which is mapped. The other atom is not part of a fragment. This case is related to the previous one, but the fragment was mapped in the first assignment step. If the atomwise similarity calculation results in a high similarity value, the possibility that the atom of the fragment is assigned to an atom of the corresponding fragment of the other molecule is reduced. Depending on the structure of the two molecules, this increases the possibility of topological errors. Therefore, the penalty has to be higher than in the fourth case.

6.
Both atoms are part of a fragment, but only one fragment was mapped. This case is also related to the fourth and fifth one, but the mapping of two atoms of fragments is more reasonable from a chemical point of view. However, the risk of topological errors is also increased, and therefore we use the same penalization as in the fourth case.
The separate σ values of each case can be set individually and have a great impact on the final similarity value. The optimal parametrization depends on the structures of a data set. Based on a chemical interpretation of the different cases, we expected the following relations between the different σ values: σ_{2} <σ_{5} <σ_{6} ≤ σ_{4}, where the indices represent the number of the case. We performed a grid search in the range [0.25, 0.5, ⋯, 10.0] to define the σ values for the penalization and to validate our hypothesis concerning the relation between the values. The result of the grid search yields the following values: σ_{2} = 2.5, σ_{5} = 5.0, σ_{6} = 4.0, and σ_{4} = 4.0. The empirical results of the grid search differ only in the relation between the σ values of the fifth and sixth case. Thus, the different cases of the 2SHA method show a correlation with the chemical interpretation.
The final result of the optimal assignment of the atoms, based on the atomwise similarity calculations using the differentiation with the six cases, is shown in Figure 5 on the same molecules causing topological errors using the original OAK. The assignment preserves the mapping of the substructures and reduces the occurrence of topological errors. The assignment of the atom from the condensed ring system onto the nitrogen of the nitro group is penalized.
Optimal Local Atom Pair Environment Assignment
This variant of the optimal assignment approach uses local atom pair environments (OAAP). There are two fundamental types of atom pairs: topological and geometrical atom pairs. We employed the binned geometrical distance matrix between the threedimensional coordinates of atoms i, j of a molecule in this study. The geometrical bin size b was set to 1 Å. Therefore, this method can be considered as a geometrical similarity measure. The computation time of D has a quadratic complexity with the number of atoms. The matrix is symmetric, so it is sufficient to compute the upper half of each matrix. The distance in the diagonal equals zero.
D is used as lookup table to store the information of all geometrical atom pairs from atom i in a trie. A trie is a prefix search tree that can be applied to patterns with a reading direction. In our case, we have a starshaped local atom environment. At the root of the trie of atom i the hash code of the atomic symbol i is placed. Next, the patterns of the form
are inserted as ordered triplets. Note that symbol can be any atom type or fragment representation and hash(symbol) any suitable hash function hash : symbol → ℕ. By using a trie, the collection of patterns is nonredundant, because the trie is updated whenever a known part of a pattern is inserted. If the whole pattern is already contained in the trie, the count is incremented by one. For many similarity metrics, it is necessary to know basic properties of a trie, for example the total number of patterns or the number of unique patterns (number of leaves). Therefore, our trie implementation has a method to compute such properties. A geometrical local atom pair environment and the corresponding trie of the marked atom is shown in Figure 6.
The representation of the local atom environments as tries allows us to apply efficient recursive similarity computations. A wellknown similarity metric for nominal features is the Tanimoto coefficient. The computation of the Tanimoto coefficient of two local atom environments is reduced to a comparison of two tries.
Let L_{
A
}, L_{
B
}be the sets of local atom pair environments of two molecular graphs A, B and the tries i, j of the nominal features (atom pair environments of atoms i, j). Now, the Tanimoto coefficient can be defined as:
The overall molecular similarity is computed by the score of the optimal assignment of the local atom pair environments. The cost of computing the local similarity matrix is O(nml) for n, m atoms of the molecules A, B and l leaves of the larger trie. The lookup for a pattern has a constant computation time, because the depth of the tries is fixed.
We computed the OAAP similarity between two structures on single lowenergy conformations. In spite that, it is possible to extend this method to conformational ensembles. This could be accomplished by simply averaging the distances as a case in point. The approach was implemented using the Chemistry Development Kit (CDK) library [50, 51].
Parameters and Computation Time
All presented methods have internal parameters, which allow modifications of the similarity measures. These modifications can improve the results on a specific problem and data set. The optimization of the parameters is a complex process making extensive methods like grid searches necessary. However, one aim of this work was to evaluate the usability of optimal assignment methods on molecular graphs for ligandbased VS experiments. Therefore, we used the described standard parameters for each method, which were constant for all data sets to obtain comparable results of the overall performance.
The computation time of the methods depends on the number of atoms, which have to be mapped in the assignment step. In addition, each method has its own type of preprocessing and mining the chemical information. This yields differences in the performance depending on the data set. Instead of giving a complete overview of the computation time for each method and data set, we report an averaged computation time, as it can be expected for druglike structures, for each method. The original OAK has an averaged performance of 27.34 ± 3.40 similarity calculations per second. The flexibility extension OAK_{FLEX} yields 41.03 ± 7.32 calculations per second. This speedup of the OAK_{FLEX} is the result of a reduction of the used features. The OAAP method achieves a performance of 51.49 ± 18.07 similarity calculations per second (computation time for the atom typing is not included). The 2SHA method has the highest computation time, because of the two assignment steps and the superposition of the fragments. However, the approach is capable to perform 14.04 ± 1.78 calculations per second. All values were measured on a Intel Core2Duo CPU with 2 GHz using one core and 1 GB memory, Java 1.6.0_07, Ubuntu 8.04.3, Linux kernel 2.6.24.
Experimental
The setup of the VS experiments is based on the work of Cheeseright et al. in which the FieldScreen approach is introduced and compared against a docking algorithm. We decided to use this workflow to create a common setup of the data sets with the objective to achieve comparable results.
Data Sets and Preparation
All ligandbased VS experiments were performed using a modified version of the Directory of Useful Decoys (DUD) Release 2 [52, 53]. The DUD contains known actives and mimetic [39] decoys for 40 target proteins. This collection of data sets was compiled to serve as an unbiased community benchmark database for the evaluation of docking algorithms. Therefore, the original version of the DUD is not suited for ligandbased VS experiments [41]. A modification of the active structures of the DUD database, suggested by Good and Oprea [35], solves this problem by applying a leadlike filter [54] and a cluster algorithm [55]. The clustering identifies analogue structures and can be used to reduce the bias of analogue or trivial enrichment. Additionally, each cluster can be seen as an individual chemotype and allows the analysis of the "scaffoldhopping" behavior of the methods.
The active structures for the VS experiments were prepared as follows. We obtained the modified active structures of Good and Oprea from the DUD site [38]. These data sets contain only topological information about the structures. Thus, we used CORINA3D [56] to generate initial geometrical seed structures. Those were refined with MacroModel 9.6 [57] using the OPLS 2005 force field and the limited BroydenFletcherGoldfarbShanno optimization method with 5000 iterations and a gradient criterion of 0.0001 RMSD for the atomic movement between two iterations. The final data sets were used as active structures.
The original decoys of the DUD release 2 were obtained from the DUD site [52]. Using these decoys, together with the filtered actives from Good and Oprea, would introduce a bias of an artificial enrichment based on the distinction of physical property values. To remove this artificial enrichment bias, we applied the same leadlike filter, used by Good and Oprea to filter the active structures, on the decoys. Thus, the AlogP value and the molecular weight was calculated for each structure with dragonX 1.4 [58]. All structures with an AlogP value > = 4.5 (5.5 for nuclear hormone receptor data sets) or a molecular weight were removed. The initial threedimensional coordinates were further optimized using the same configuration of MacroModel as for the active structures.
The described preparation protocol was applied on all 40 data sets of the DUD. The SD files of the prepared actives and decoys are directly obtainable from the DUD site [52].
To obtain results that are not biased by a low number of chemotypes, it is advisable to use data sets with a sufficient number of chemotypes. Therefore, we used the same subset of DUD targets as used by Cheeseright et al. This subset has a minimum of 17 and a maximum of 44 chemotypes. A comprehensive overview of the data sets can be seen in Table 1.
Ligandbased VS methods need a biologically active structure that serves as a query in the experiment. In the work of Huang et al. [53] complexed crystal structures were used to identify the binding sites for the docking algorithm. Cheeseright et al. [27] used the ligands of the same complexed crystal structures as query structures to evaluate the FieldScreen approach. To allow a comparison with FieldScreen, we extracted the same bounded ligands from the protein data bank [59, 60] and corrected the bonds lengths with MacroModel 9.6. These structures serve as queries for the VS experiments in this study.
Evaluation of VS Performance
In the following section we describe the metrics for the evaluation of VS experiments used in our work. Additionally, we explain why we used these metrics and show their correlation to wellknown established metrics.
The evaluation and comparison of VS performance is an important but also an errorprone process [39, 40, 61]. In recent years, a plethora of evaluation metrics have been published, each metric with its strengths and weaknesses. For an overview, we refer to the work of Kirchmair et al. [62] There is no standard protocol for the analysis and publication of VS results. Thus, the comparison of different works is a challenging task. It is necessary to analyze the results, regarding three different aspects, to characterize the performance of a VS method. The first aspect is the early recognition problem and originates from realworld screening applications, where only top ranked molecules were selected for testing in biological assays because of high costs and expenditure of time. Truchon and Bayly developed for this purpose the BEDROC score [63], which uses a decreasing exponential weighting function. The drawback of this metric is its lack of interpretability and the dependency of an extrinsic variable [39]. To resolve these problems, Jain and Nicholls suggested to report the enrichment values at predefined false positive fractions [40]. These values can be obtained by dividing the sensitivity value by the fraction of false positives. Equation 7 represents the interrelation between the ROC value and the ROC enrichment for a predefined false positive fraction. and is the number of true positives (actives retrieved) and false positives (decoys retrieved), respectively, found in the range containing a false positive rate of X%. TP, TN, FP, and FN are the entries of the confusion matrix ranking X% false positives.
We decided to report the ROC enrichments at false positive rates of 0.5%, 1.0%, 2%, and 5%, as suggested by Jain and Nicholls [40].
The second aspect of characterizing the performance of VS methods is the evaluation of the performance considering the complete data set. This overall performance can be visualized plotting the receiver operating characteristic (ROC), which should always be part of VS results [39, 64]. However, a comparison of many ROC plots is not straightforward, thus we report the area under the ROC curve to provide a better overview of the results. Equation 8 shows the calculation of the area under the ROC curve (AUC), where N_{actives} and N_{decoys} are the numbers of active and decoy structures, respectively is the number of decoys that are higher ranked than the ith active structure.
The last point of characterizing the performance of VS methods is the evaluation of the retrieval of new scaffolds. As already mentioned, ligandbased VS suffers from the bias caused by enrichments of analog structures. To reduce the influence of this overestimated enrichments on the result metrics, Clark and WebsterClark recommended an adaption of the ROC and AUC calculation [36]. They proposed an arithmetic and a harmonic weighting scheme to reduce the influence of structurally similar structures. Using the arithmetic weighting, each structure has a weight that is inversely proportional to the size of the cluster it belongs to. Therefore, the weight of all structures taken from one cluster is equal. This leads to the equation: , where w_{
ij
}is the weight of the ith structure taken from the jth cluster with N_{
j
}structures. As a result of this, the true positive value of the sensitivity is no more the number of true active structures seen in a predefined fraction of the data set. Instead, it is the sum of the weights resulting in the equation The value of is 1 if the ith structure of the jth cluster is contained in the fraction of the data set, otherwise it is 0. N_{
clusters
}is the number of clusters in the data set and N_{
j
}is the number of structures in the jth cluster.
The harmonic weighting scheme uses weights representing the ranks of the active structures within each cluster. Topranked structures of a cluster have a higher weight, motivated by the idea that those structures have the highest information content. The weight of the structures is defined as , where i is the ith structure of the jth cluster in decreasing rank order. A recently published analysis of the weighting schemes conducted by Mackey and Melville discloses that the arithmetic weighting scheme has better properties and is more robust than the harmonic scheme [65]. Therefore, we decided to use the arithmetic weighting scheme. Integrating this scheme into the basic ROC enrichment leads to an arithmetic weighted version (awROC enrichment) given by Equation 9.
This weighting scheme can also be embedded into the AUC calculation. The modified version (awAUC) can be seen in Equation 10.
To account for intrinsic variances [39] of the active and decoy structures, we bootstrapped the data sets and approximated the error of the awROC enrichments and awAUC. We performed 25000 iterations removing randomly 20% of the data set. The standard deviation of the 25000 metric results were used as an error of the result using the complete data set.
Comparison against Other Methods
The docking study by Huang et al. [53] provides the energy scores of the DOCK algorithm for each target of the DUD. We used these rankings to calculate the awROC enrichments and awAUC on the same data sets as our approaches. Hence, extensive properties such as active to decoy ratio and data set size are identical and permit a comparison of the results. A direct comparison of docking algorithms with ligandbased VS tools is critical because of the different techniques and information included in the methods. We calculated the results of the docking algorithm with the objective to provide an impression of the performance of the DOCK algorithm using the awROC and awAUC evaluation metrics.
We considered the results of the FieldScreen approach to include a sophisticated ligandbased method. To reduce the information content solely to the information obtained from the query structure, the results without the excluded volume of the protein were used. Although the data set sizes and active to decoy ratios of the FieldScreen results deviate from our setup, the use of awROC enrichments ensures the comparability because the ROC enrichment is independent from extensive quantities [39].
Introducing a new method to a wellestablished field, like ligandbased VS, it is necessary to justify it by comparing the results to a common approach. To meet these requirements, we compared our approaches to the 166 bit MACCS keys [12] using the Tanimoto coefficient to calculate a similarity value. Although the results of the MACCS keys are inferior in comparison to other fingerprints [66] we provide the results of the MACCS keys to give a baseline for the performance on the data sets. The MACCS keys were calculated using the CDK 1.1.5 [50, 51].