- Research article
- Open Access
Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints
© The Author(s) 2018
- Received: 24 April 2018
- Accepted: 27 September 2018
- Published: 4 October 2018
Interaction fingerprints (IFP) have been repeatedly shown to be valuable tools in virtual screening to identify novel hit compounds that can subsequently be optimized to drug candidates. As a complementary method to ligand docking, IFPs can be applied to quantify the similarity of predicted binding poses to a reference binding pose. For this purpose, a large number of similarity metrics can be applied, and various parameters of the IFPs themselves can be customized. In a large-scale comparison, we have assessed the effect of similarity metrics and IFP configurations to a number of virtual screening scenarios with ten different protein targets and thousands of molecules. Particularly, the effect of considering general interaction definitions (such as Any Contact, Backbone Interaction and Sidechain Interaction), the effect of filtering methods and the different groups of similarity metrics were studied.
The performances were primarily compared based on AUC values, but we have also used the original similarity data for the comparison of similarity metrics with several statistical tests and the novel, robust sum of ranking differences (SRD) algorithm. With SRD, we can evaluate the consistency (or concordance) of the various similarity metrics to an ideal reference metric, which is provided by data fusion from the existing metrics. Different aspects of IFP configurations and similarity metrics were examined based on SRD values with analysis of variance (ANOVA) tests.
- Virtual screening
- Interaction fingerprint
- Similarity metrics
- Binary fingerprints
Interaction fingerprints are a relatively new concept in cheminformatics and molecular modeling . As molecular fingerprints are binary (or bitstring) representations of molecular structure, analogously, interaction fingerprints are binary (or bitstring) representations of 3D protein–ligand complexes. Each bit position of an interaction fingerprint corresponds to a specific amino acid of the protein and a specific interaction type. A value of 1 (“on”) denotes that the given interaction is established between the given amino acid and the small-molecule ligand (a 0, or “off” value denotes the lack of that specific interaction). Two such fingerprints are most commonly compared with the Tanimoto similarity metric (taking a value between 0 and 1, with 1 corresponding to identical fingerprints, i.e. protein–ligand interaction patterns). In the most common setting, the Tanimoto similarity is calculated between a reference fingerprint (usually belonging to a known active molecule) and many query fingerprints.
Summary of the bit definitions of the modified SIFt implemented in the Schrödinger Suite and applied in this work
A ligand atom is within the required distance of a receptor atom
A ligand atom is within the required distance of a receptor backbone atom
A ligand atom is within the required distance of a receptor side chain atom
A ligand atom is within the required distance of an atom in a polar residue of the receptor (ARG, ASP, GLU, HIS, ASN, GLN, LYS, SER, THR, ARN, ASH, GLH, HID, HIE, LYN)
A ligand atom is within the required distance of an atom in a hydrophobic residue of the receptor (PHE, LEU, ILE, TYR, TRP, VAL, MET, PRO, CYS, ALA, CYX)
Hydrogen bond acceptor
The ligand forms a hydrogen bond with an acceptor in a receptor residue
Hydrogen bond donor
The ligand forms a hydrogen bond with a donor in a receptor residue
A ligand atom is within the required distance of an atom in an aromatic residue of the receptor (PHE, TYR, TRP, TYO)
A ligand atom is within the required distance of an atom in a charged residue of the receptor (ARG, ASP, GLU, LYS, HIP, CYT, SRO, TYO, THO)
A widely-applied variant, simply termed interaction fingerprint (IFP) was introduced by Marcou and Rognan , containing seven interactions per residue. A marked difference between SIFt and IFP is that IFP differentiates aromatic interactions by their orientations (face-to-face vs. edge-to-face), and charged interactions by the specific charge distribution (i.e. cation on the ligand vs. anion on the ligand). Furthermore, IFPs can be configured to include less common interaction types, such as weak H-bonds or cation–π interactions. Later, the same group has introduced triplet interaction fingerprints (TIFPs), which encodes triplets of interaction points to a fixed length of 210 bits .
Mpamhanga et al.  have introduced three types of interaction fingerprints in their work in 2006, out of which the one termed CHIF is probably the most prominent. Atom-pairs based interaction fingerprint (APIF) is a variant implemented by Pérez-Nueno et al.  in the MOE SVL scripting language . APIF accounts for the relative positions of pairs of interactions (based on their binned distances) and stores them in a count-based fingerprint with a fixed length (294 bits).
Da and Kireev  have introduced SPLIF (Structural protein–ligand interaction fingerprints), whose main difference with respect to SIFt is that the interactions are encoded only implicitly by encoding the interacting ligand and protein fragments (whereas in SIFt the interaction type explicitly defines the given bit in the bitstring). In the same year, Sato and Hirokawa  have introduced another approach called PLIF (protein–ligand interaction fingerprints), which relies on the per-residue identification of the number of interacting atoms (with the ligand). To our knowledge, the most recent novel interaction fingerprint implementation is the PADIF (Protein per atom score contributions derived interaction fingerprint) approach of Jasper et al. . PADIF incorporates the strengths of the different interactions by exploiting the per atom score contributions of the protein atoms, which are calculated for each pose during docking with GOLD, or with any other scoring function that can output atom contributions . As a consequence, PADIF is an atom-based interaction fingerprint.
Interaction fingerprints have been applied numerous times to complement docking scores in virtual screening campaigns, e.g. for the discovery of GPCR (G-protein coupled receptor) ligands  or kinase inhibitors . In more complex examples, they have been applied for interpreting activity landscapes , for training machine learning models , and for identifying covalently targetable cysteine residues in the human kinome . Additionally, interaction fingerprints are applied to support large, specialized structural databases, such as GPCRdb (for GPCRs) , KLIFS (for kinases) [21, 22] or PDEstrian (for phosphodiesterases) .
Binary similarity measures are applied in various scientific fields to compare binary and continuous data vectors. To our knowledge the most comprehensive collection of similarity measures was published by Todeschini et al. , listing 51 similarity measures (out of which seven have been shown to perfectly correlate with others).
Confusion matrix for a pair of interaction fingerprints, containing the frequencies of common on bits (a), common off bits (d), and exclusive on bits for Complex 1 (b) and Complex 2 (c)
p = a + b + c + d
1 (interaction present)
0 (interaction absent)
1 (interaction present)
0 (interaction absent)
In our related earlier works, we have confirmed the choice of the Tanimoto coefficient for molecular fingerprints (by a comparison of eight commonly available measures) , and more recently we have suggested the Baroni–Urbani–Buser (BUB) and Hawkins–Dotson (HD) coefficients for metabolomic fingerprints . We should note however, that due to the highly different data structure, these conclusions are not transferrable to interaction fingerprints (or other fingerprint types).
In this work, our goals were to (1) compare and rank these 44 similarity measures for their use with interaction fingerprint data, and (2) to dissect the interaction fingerprints and investigate how changes in the data structure affect the ranking of similarity coefficients. Also, we aimed to answer some specific questions considering interaction fingerprints, regarding e.g. the usefulness of IFP filtering schemes (i.e. exclusion of certain bit positions or blocks), or of general interaction definitions (e.g. “Any contact”). We note here that we use the abbreviation IFP throughout this work to refer to interaction fingerprints in general, not to the specific fingerprinting method of Marcou and Rognan . (The specific method we used here is a modified version of SIFt , implemented in the Schrödinger Suite .)
Summary of the applied protein targets and ligand sets
Androgen receptor agonists
Cyclin dependent kinase 2
Estrogen receptor antagonists
Tyrosine kinase SRC
Vascular endothelial growth factor receptor kinase
The case studies correspond to ten virtual screening scenarios, where IFPs are used for retrieving the active molecules from among the chemically similar, but not active decoy compounds. A standard tool for evaluating virtual screenings is the area under the receiver operating characteristic curve (ROC AUC, or AUC for even shorter). The AUC can take values between 0 and 1, and corresponds to the probability of ranking a randomly selected active compound higher than a randomly selected inactive compound (as a consequence, an AUC value of 0.5 corresponds to random ranking) . In this work, we have used AUC values as a first approach to evaluating the various IFP-similarity measure combinations, followed by a more detailed statistical analysis, as explained below.
Generation of interaction fingerprints
All the preprocessing procedures for the protein targets and ligands were carried out with the relevant Schrödinger software (LigPrep, Protein Preparation Wizard etc.) . Standard (default) protocols were used for grid generation and ligand docking (Glide) [30, 31]. The IFPs were also generated with a Schrödinger module based on the docked poses, and contained by default all of the nine interactions listed in Table 1. To study the effects of the more general interaction definitions (bits), we have generated two more sets of IFPs, where we have omitted (1) the Any Contact (Any), and (2) the Any Contact (Any), Backbone Interaction (BB), and Sidechain Interaction (SC) definitions. We have labeled the resulting IFPs ALL (original), WO1 (without Any), WO3 (without Any, BB and SC).
We have implemented a Python module (FPKit) to calculate 44 similarity measures (collected by Todeschini et al. ) on plain bitstrings. The definitions of these similarity measures can be found in the original publication of Todeschini et al., and as a supplement to our recent (open access) article on metabolomic profiles . Those measures that do not, by definition, produce values in the [0, 1] range are scaled with the α and β scaling parameters, published together with the definitions (see also Eq. 4). In some instances, we needed to correct some of these scaling parameters and implement additional checks to avoid division-by-zero errors: these are summarized in Additional file 1. The Python module additionally contains the implemented filtering rules, and is available at: https://github.com/davidbajusz/fpkit.
SRD is a novel algorithm based on the calculation of the differences between the object-wise ranks produced by a vector (corresponding to a method, model, similarity metric, etc.), as compared to a reference vector [32, 33]. The reference can be experimental values as a gold standard, or a consensus produced by data fusion, such as row-average, minimum or maximum, etc. This is related to the basic idea of multicriteria decision making, where the objective is to rank the objects simultaneously by each criterion: using that terminology, the criteria would be the various similarity measures in this case. The basic steps of the protocol are the following: (1) ranking the samples (here, ligands) in their order of magnitude by each column vector (similarity measure), (2) for each sample (ligand), calculating the differences between the ranks produced by each similarity measure and the reference, and (3) summing up the absolute values of the differences for each similarity measure. The resulting sums are called SRD values and can be used to compare the similarity measures: the smaller the SRD value, the closer the measure is to the reference (in terms of ranking behavior). A detailed animation of the calculation procedure can be found as a supplement to our earlier work . The method is validated with cross-validation and a randomization test as well. The MS Excel SRD macro is freely available for download at: http://aki.ttk.mta.hu/srd
We should note that besides SRD, a number of methods for the comparison of rankings is reported in the literature, or used routinely by statisticians. Spearman’s rank correlation coefficient—probably the most commonly used rank-based statistical test—has been compared to SRD in the paper of Héberger and Kollár-Hunek  as early as 2011, and we have also shown in our recent work the more sophisticated discriminatory power of SRD as compared to Spearman’s rho . An interesting novel application of SRD is in Post-Pareto optimality analysis, where it was clearly shown to be a well-suited decision support tool (by ranking the solutions along the Pareto front) .
More generally: while it is also based on a comparison of rankings, the SRD workflow can be clearly distinguished from rank-based statistical tests, as it involves not one, but three essential steps. The first of these is the definition of the reference vector (i.e. reference ranking), which—depending of the problem—can be a “gold standard” (such as experimental values for the comparison of computational methods for modeling/predicting the same property) or a consensus of the existing (compared) methods, produced with a suitable data fusion technique, such as average, minimum, maximum, etc. This is again problem-dependent, as the reference vector must always represent a hypothetical optimum (or ideal) ranking. (It may involve more than one data fusion technique, if necessary, e.g. in the present work, the hypothetical best similarity measure would be one that produces the highest possible similarity value for active molecules and the lowest possible value for inactives, so our current solution involved the use of maximum values for actives, and minimum values for inactives, see Results section.) Definition of a reference vector is not part of any rank-based statistical test we are aware of.
The second step is the calculation of the distance measure itself between the reference vector (ranking) and the rankings produced by the compared methods (here, similarity measures). In the current implementation of SRD, the Manhattan distance is applied: in the case where there are no tied ranks, this is identical to another rank-based distance measure, the Spearman footrule metric . Koziol related SRD to another distance measure for permutations—namely, the inversion number , but it has less discriminatory power, and has not found any applications yet (to the best of our knowledge).
The third step is the application of a meticulous validation approach, involving a randomization (permutation) test and leave-one-out or leave-many-out cross-validation. This step instantly provides answers to two important questions: whether the SRD values characterizing two compared methods (i.e. rankings) are significantly different from each other (cross-validation), and whether there is any among the compared methods (i.e. rankings) that is not significantly better (i.e. not closer to the reference vector) than random rankings (randomization test).
The further statistical analysis of SRD values was carried out by factorial analysis of variance (ANOVA). This method is based on the comparison of the average values for the different groups of samples. The input matrices contained the SRD values and several grouping factors such as similarity metrics, symmetricity, metricity, bit selection and filtering rule. The complete procedure of statistical analysis was carried out three times with different pretreatment methods (rank transformation, range scaling, autoscaling). STATISTICA 13 (Dell Inc., Tulsa, OK, USA) was used for the analysis.
Comparison based on AUC values
Results based on SRD values
Because of the problem detailed above, we have decided to apply the SRD method for the statistical comparison. Selecting the reference value (data fusion) was not trivial in this particular case, since we have active and inactive ligands as well, where the ideal behavior for a similarity measure is to produce the highest and the smallest similarity values, respectively. Thus, the reference was defined as the minimum or maximum value among the similarity values, depending on the activity of the specific ligand (if it was active, the row-maximum was used, if it was inactive, then row-minimum was used). The analysis was run 90 times altogether, corresponding to each possible combination of 10 protein targets, 3 bit selections, and 3 filtering rules.
The original input matrices contained the 44 similarity measures for the different molecules in each case study, but the ranges of these measures were sometimes very different. For example, values close to 0 were typical for the Mou (Mountford) similarity, but values close to 1 were typical for the Yu1 (Yule) similarity. Obviously, in such cases, taking the row minimum as the reference value would favor the former, regardless of the ligand being active or inactive. Thus, an additional round of data pretreatment was essential for the analysis, to provide a valid basis of comparison. Autoscaling, range scaling and rank transformation were applied for this purpose.
One example of the original plots produced by the SRD script can be seen in Additional file 1: Figure S1, where the normalized (scaled) SRD values are plotted in increasing magnitude and the distribution of random SRD rankings (for random numbers) is plotted as a basis of comparison.
SRD analysis was performed with fivefold cross-validation to every combination of the original parameters (bit selection, filtering, scaling) and the results (the SRD values) for each similarity measure were collected from every dataset (see Fig. 2) for a final factorial ANOVA analysis. The collected SRD values for the ten datasets (i.e. target proteins) were used together for the further ANOVA analysis, to allow us more general conclusions.
We can observe that there are some measures with very high SRD values (i.e. producing very different rankings as compared to the reference/consensus method), for example RR (Russel–Rao), Mic (Michael) or CT3 (Consonni–Todeschini 3). On the other hand, one can identify the best measures (i.e. closest to the reference) as SM (simple matching) , RT (Rogers–Tanimoto) , SS2 (Sokal–Sneath 2) , CT1 (Consonni–Todeschini 1), CT2 (Consonni–Todeschini 2)  or AC (Austin–Colwell) . These similarity measures are closer to the reference and can be recommended for usage. The JT (Jaccard–Tanimoto) metric, which is the de facto standard of cheminformatics (simply called the “Tanimoto coefficient” in most of the related scientific literature) is located relatively close to the reference, but somewhat farther than those mentioned above, meaning that the SM, RT, SS2, CT1, CT2 and AC metrics could be considered as viable alternatives of the Tanimoto coefficient.
The preference for symmetric measures over asymmetric ones is somewhat surprising, considering that one would expect symmetric measures to be affected by the amount of “off” bits (and consequently, the number of common “off” bits, d) more than asymmetric ones. If we look at the effects of the filtering rules (and therefore the amount of “off” bits) on the SRD values of the similarity metrics separately (Additional file 1: Figure S2), we find that this assumption is confirmed, but only partially: similarity measures, where we can observe major differences are Mic (Michael), HD (Hawkins–Dotson), Den (Dennis), dis (dispersion), SS4 (Sokal–Sneath 4), Phi (Pearson–Heron), Coh (Cohen), Pe1, Pe2 (Peirce), MP (Maxwell–Pilliner), and HL (Harris–Lahey). These are symmetric and correlation-based coefficients, without exception. The associated ANOVA plots are included in Additional file 1: Figure S2.
In this study forty-four similarity measures were compared based on ten case studies, corresponding to interaction fingerprint-based virtual screening scenarios. The effects of the applied set of bits (interaction types) and filtering rules were studied in detail. The comparison was carried out with a novel algorithm, sum of ranking differences (SRD), coupled with analysis of variance (ANOVA). This work complements our earlier comparative studies on metabolomic fingerprints  and molecular fingerprints .
There are several similarity metrics that are worth consideration as viable alternatives of the popular Jaccard–Tanimoto coefficient, namely: Sokal–Michener (SM), Rogers–Tanimoto (RT), Sokal–Sneath 2 (SS2), Consonni–Todeschini 1 and 2 (CT1, CT2) and Austin–Colwell (AC). These six similarity measures gave the most consistent results with the “ideal” (hypothetical best) reference method in our evaluations using 10 highly diverse protein data sets. We can also conclude that metric similarities are usually more consistent with the reference method than non-metric ones. Similarly, symmetric and intermediately symmetric measures gave more consistent results than asymmetric and correlation-based ones.
Finally, there are important and significant differences with regard to the applied bit definitions and filtering rules. As a general conclusion, we can recommend omitting the “Any contact” bit definition from IFP-based analyses, as it will not deteriorate the results in a virtual screening scenario (however, omitting the backbone and sidechain interaction bits, BB and SC, is not recommended). Similarly, applying a bit filtering rule, such as interaction-based filtering (omitting any interaction that is not established even once in the whole dataset) can improve the results on average. The open-source Python-based FPKit (FingerPrint Kit) package applied for IFP filtering and similarity calculations is freely available at: https://github.com/davidbajusz/fpkit.
ALL: all interactions; WO1: all interactions, except “Any contact”; WO3: all interactions, except “Any contact”, “Backbone interaction” and “Sidechain interaction”.
INTS: interaction-based filtering; NO: no filtering; RES: residue-based filtering.
ANOVA: analysis of variance; SRD: sum of (absolute) ranking differences.
AUTO: autoscaling (or standardization); RGS: range (interval) scaling; RANK: rank transformation.
AR conducted the virtual screening campaigns. AR and KH conducted the statistical analyses. DB implemented and calculated the similarity metrics, and supervised the project. All authors have participated in preparing the manuscript. All authors read and approved the final manuscript.
The authors thank Prof. Roberto Todeschini for his advice regarding the implementation of some of the similarity metrics. This work was supported by the National Research, Development and Innovation Office of Hungary under Grant Numbers OTKA K 119269 and KH_17 125608.
The authors declare that they have no competing interests.
Availability of data and materials
The FPKit package is freely available at: https://github.com/davidbajusz/fpkit.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Bajusz D, Rácz A, Héberger K (2017) Chemical data formats, fingerprints, and other molecular descriptions for database analysis and searching. In: Chackalamannil S, Rotella DP, Ward SE (eds) Comprehensive medicinal chemistry III. Elsevier, Oxford, pp 329–378View ArticleGoogle Scholar
- Deng Z, Chuaqui C, Singh J (2004) Structural interaction fingerprint (SIFt): a novel method for analyzing three-dimensional protein–ligand binding interactions. J Med Chem 47:337–344View ArticleGoogle Scholar
- Mordalski S, Kosciolek T, Kristiansen K et al (2011) Protein binding site analysis by means of structural interaction fingerprint patterns. Bioorg Med Chem Lett 21:6816–6819. https://doi.org/10.1016/j.bmcl.2011.09.027 View ArticlePubMedGoogle Scholar
- Small-Molecule Drug Discovery Suite 2017-4, Schrödinger, LLC, New York, NY, 2017. https://www.schrodinger.com/citations
- Cao R, Wang Y (2016) Predicting molecular targets for small-molecule drugs with a ligand-based interaction fingerprint approach. ChemMedChem 11:1352–1361. https://doi.org/10.1002/cmdc.201500228 View ArticlePubMedGoogle Scholar
- Marcou G, Rognan D (2007) Optimizing fragment and scaffold docking by use of molecular interaction fingerprints. J Chem Inf Model 47:195–207. https://doi.org/10.1021/ci600342e View ArticlePubMedGoogle Scholar
- Desaphy J, Raimbaud E, Ducrot P, Rognan D (2013) Encoding protein–ligand interaction patterns in fingerprints and graphs. J Chem Inf Model 53:623–637. https://doi.org/10.1021/ci300566n View ArticlePubMedGoogle Scholar
- Mpamhanga CP, Chen B, McLay IM, Willett P (2006) Knowledge-based interaction fingerprint scoring: a simple method for improving the effectiveness of fast scoring functions. J Chem Inf Model 46:686–698. https://doi.org/10.1021/ci050420d View ArticlePubMedGoogle Scholar
- Pérez-Nueno VI, Rabal O, Borrell JI, Teixidó J (2009) APIF: a new interaction fingerprint based on atom pairs and its application to virtual screening. J Chem Inf Model 49:1245–1260. https://doi.org/10.1021/ci900043r View ArticlePubMedGoogle Scholar
- Molecular Operating Environment (MOE), 2013.08 (2018) Chemical Computing Group ULC, QC, Canada. https://www.chemcomp.com/Research-Citing_MOE.htm
- Da C, Kireev D (2014) Structural protein–ligand interaction fingerprints (SPLIF) for structure-based virtual screening: method and benchmark study. J Chem Inf Model 54:2555–2561View ArticleGoogle Scholar
- Sato M, Hirokawa T (2014) Extended template-based modeling and evaluation method using consensus of binding mode of GPCRs for virtual screening. J Chem Inf Model 54:3153–3161. https://doi.org/10.1021/ci500499j View ArticlePubMedGoogle Scholar
- Jasper JB, Humbeck L, Brinkjost T, Koch O (2018) A novel interaction fingerprint derived from per atom score contributions: exhaustive evaluation of interaction fingerprint performance in docking based virtual screening. J Cheminform 10:15. https://doi.org/10.1186/S13321-018-0264-0 View ArticlePubMedPubMed CentralGoogle Scholar
- Jones G, Willett P, Glen RC et al (1997) Development and validation of a genetic algorithm for flexible docking. J Mol Biol 267:727–748. https://doi.org/10.1006/jmbi.1996.0897 View ArticlePubMedGoogle Scholar
- de Graaf C, Kooistra AJ, Vischer HF et al (2011) Crystal structure-based virtual screening for fragment-like ligands of the human histamine H1 receptor. J Med Chem 54:8195–8206View ArticleGoogle Scholar
- Bajusz D, Ferenczy GG, Keserű GM (2016) Discovery of subtype selective Janus kinase (JAK) inhibitors by structure-based virtual screening. J Chem Inf Model 56:234–247. https://doi.org/10.1021/acs.jcim.5b00634 View ArticlePubMedGoogle Scholar
- Méndez-Lucio O, Kooistra AJ, de Graaf C et al (2015) Analyzing multitarget activity landscapes using protein–ligand interaction fingerprints: interaction cliffs. J Chem Inf Model 55:251–262. https://doi.org/10.1021/ci500721x View ArticlePubMedGoogle Scholar
- Smusz S, Mordalski S, Witek J et al (2015) Multi-step protocol for automatic evaluation of docking results based on machine learning methods—a case study of serotonin receptors 5-HT 6 and 5-HT 7. J Chem Inf Model 55:823–832. https://doi.org/10.1021/ci500564b View ArticlePubMedGoogle Scholar
- Zhao Z, Liu Q, Bliven S et al (2017) Determining cysteines available for covalent inhibition across the human kinome. J Med Chem 60:2879–2889. https://doi.org/10.1021/acs.jmedchem.6b01815 View ArticlePubMedPubMed CentralGoogle Scholar
- Pándy-Szekeres G, Munk C, Tsonkov TM et al (2018) GPCRdb in 2018: adding GPCR structure models and ligands. Nucleic Acids Res 46:D440–D446. https://doi.org/10.1093/nar/gkx1109 View ArticlePubMedGoogle Scholar
- van Linden OPJ, Kooistra AJ, Leurs R et al (2014) KLIFS: a knowledge-based structural database to navigate kinase–ligand interaction space. J Med Chem 57:249–277. https://doi.org/10.1021/jm400378w View ArticlePubMedGoogle Scholar
- Kooistra AJ, Kanev GK, van Linden OPJ et al (2016) KLIFS: a structural kinase–ligand interaction database. Nucleic Acids Res 44:D365–D371. https://doi.org/10.1093/nar/gkv1082 View ArticlePubMedGoogle Scholar
- Jansen C, Kooistra AJ, Kanev GK et al (2016) PDEStrIAn: a phosphodiesterase structure and ligand interaction annotated database as a tool for structure-based drug design. J Med Chem 59:7029–7065. https://doi.org/10.1021/acs.jmedchem.5b01813 View ArticlePubMedGoogle Scholar
- Todeschini R, Consonni V, Xiang H et al (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52:2884–2901. https://doi.org/10.1021/ci300261r View ArticlePubMedGoogle Scholar
- Rácz A, Andrić F, Bajusz D, Héberger K (2018) Binary similarity measures for fingerprint analysis of qualitative metabolomic profiles. Metabolomics 14:29. https://doi.org/10.1007/s11306-018-1327-y View ArticlePubMedPubMed CentralGoogle Scholar
- Bajusz D, Rácz A, Héberger K (2015) Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7:20. https://doi.org/10.1186/s13321-015-0069-3 View ArticlePubMedPubMed CentralGoogle Scholar
- Huang N, Shoichet B, Irwin J (2006) Benchmarking sets for molecular docking. J Med Chem 49:6789–6801View ArticleGoogle Scholar
- Jain AN, Nicholls A (2008) Recommendations for evaluation of computational methods. J Comput Aided Mol Des 22:133–139. https://doi.org/10.1007/s10822-008-9196-5 View ArticlePubMedPubMed CentralGoogle Scholar
- Sastry GM, Adzhigirey M, Day T et al (2013) Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. J Comput Aided Mol Des 27:221–234. https://doi.org/10.1007/s10822-013-9644-8 View ArticlePubMedGoogle Scholar
- Friesner RA, Banks JL, Murphy RB et al (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47:1739–1749View ArticleGoogle Scholar
- Halgren TA, Murphy RB, Friesner RA et al (2004) Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J Med Chem 47:1750–1759View ArticleGoogle Scholar
- Héberger K (2010) Sum of ranking differences compares methods or models fairly. TrAC Trends Anal Chem 29:101–109. https://doi.org/10.1016/j.trac.2009.09.009 View ArticleGoogle Scholar
- Kollár-Hunek K, Héberger K (2013) Method and model comparison by sum of ranking differences in cases of repeated observations (ties). Chemom Intell Lab Syst 127:139–146. https://doi.org/10.1016/j.chemolab.2013.06.007 View ArticleGoogle Scholar
- Héberger K, Kollár-Hunek K (2011) Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers. J Chemom 25:151–158. https://doi.org/10.1002/cem.1320 View ArticleGoogle Scholar
- Andrić F, Bajusz D, Rácz A et al (2016) Multivariate assessment of lipophilicity scales—computational and reversed phase thin-layer chromatographic indices. J Pharm Biomed Anal 127:81–93. https://doi.org/10.1016/j.jpba.2016.04.001 View ArticlePubMedGoogle Scholar
- Lourenco JM, Lebensztajn L (2018) Post-pareto optimality analysis with sum of ranking differences. IEEE Trans Magn 54:1–10. https://doi.org/10.1109/TMAG.2018.2836327 View ArticleGoogle Scholar
- Sipos L, Gere A, Popp J, Kovács S (2018) A novel ranking distance measure combining Cayley and Spearman footrule metrics. J Chemom 32:e3011. https://doi.org/10.1002/cem.3011 View ArticleGoogle Scholar
- Koziol JA (2013) Sums of ranking differences and inversion numbers for method discrimination. J Chemom 27:165–169. https://doi.org/10.1002/cem.2504 View ArticleGoogle Scholar
- Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830Google Scholar
- Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 28:1409–1438Google Scholar
- Rogers D, Tanimoto T (1960) A computer program for classifying plants. Science 132:1115–1118. https://doi.org/10.1126/science.132.3434.1115 View ArticlePubMedGoogle Scholar
- Sokal R, Sneath P (1963) Principles of numerical taxonomy. W. H, Freeman, San Francisco, CAGoogle Scholar
- Consonni V, Todeschini R (2012) New similarity coefficients for binary data. MATCH Commun Math Comput Chem 68:581–592Google Scholar
- Austin B, Colwell R (1977) Evaluation of some coefficients for use in numerical taxonomy of microorganisms. Int J Syst Bacteriol 27:204–210View ArticleGoogle Scholar