The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability

Lopes, Julio Cesar Dias; dos Santos, Fábio Mendes; Martins-José, Andrelly; Augustyns, Koen; De Winter, Hans

doi:10.1186/s13321-016-0189-4

Research article
Open access
Published: 02 February 2017

The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability

Julio Cesar Dias Lopes¹,
Fábio Mendes dos Santos¹,
Andrelly Martins-José¹,
Koen Augustyns² &
…
Hans De Winter ORCID: orcid.org/0000-0002-4450-7677²

Journal of Cheminformatics volume 9, Article number: 7 (2017) Cite this article

3884 Accesses
20 Citations
4 Altmetric
Metrics details

A Commentary to this article was published on 15 March 2018

Abstract

A new metric for the evaluation of model performance in the field of virtual screening and quantitative structure–activity relationship applications is described. This metric has been termed the power metric and is defined as the fraction of the true positive rate divided by the sum of the true positive and false positive rates, for a given cutoff threshold. The performance of this metric is compared with alternative metrics such as the enrichment factor, the relative enrichment factor, the receiver operating curve enrichment factor, the correct classification rate, Matthews correlation coefficient and Cohen’s kappa coefficient. The performance of this new metric is found to be quite robust with respect to variations in the applied cutoff threshold and ratio of the number of active compounds to the total number of compounds, and at the same time being sensitive to variations in model quality. It possesses the correct characteristics for its application in early-recognition virtual screening problems.

Background

The field of virtual screening with applications in drug design has become increasingly important in terms of hit finding and lead generation [1–3]. Many different methods and descriptors have emerged over time to help the drug discovery scientist in applying the most optimal techniques for almost any given computational problem [4]. However, still a serious drawback in the domain of virtual screening is the lack of metrics standards to statistically evaluate and compare the performance of different methods and descriptors. Nicholls [5] suggested a few list of desirable characteristics of a good metric:

1.
independence to extensive variables,
2.
statistical robustness,
3.
straightforward assessment of error bounds,
4.
no free parameters,
5.
easily understandable and interpretable.

In addition to these five characteristics, we believe that a good metric might also benefit from having well-defined lower and upper boundaries as this facilitates quantitative comparison of different models and facilitates optimization of fitness functions based on these metrics.

In this paper a new metric is proposed that adheres to the six desired characteristics of an ideal metric. The metric is based on the principles behind the power of hypothesis test, which is the probability of making the correct decision if the alternative hypothesis is true. Comparison of the new power metric with more established metrics, including the enrichment factor (EF) [6, 7], the relative enrichment factor (REF) [8], the receiver operating characteristic (ROC) enrichment ROCE [9–11], the correct classification rate (CCR) [12, 13], Matthews correlation coefficient (MCC) [14], Cohen’s kappa coefficient (CKC) [15, 16] together with the standard precision (PRE), accuracy (ACC), sensitivity (SEN) and specificity (SPE) metrics, is presented in this paper.

Methods

Definitions

In the field of virtual screening, the quality of a model can be quantified by a number of metrics. The area under the curve (AUC) represents the overall accuracy of a model, with a value approaching 1.0 indicating a high sensitivity and high specificity [17]. A model with an AUC of 0.5 represents a test with zero discrimination. AUC metrics are calculated from typical ROC curves; these are plots of the (1 − SPE) values on the x-axis against the SEN values plotted on the y-axis for all possible cutoff points. Sensitivity and specificity, and thus the AUC, are good indicators of the validity of a method but are not measuring the predictive value of a method [18].

The AUC is a metric that describes the overall quality of a model. In practical virtual screening experiments however, it is typical to score each molecule according to a value proposed by the model, and rank these molecules in decreasing order based on these calculated values. It is custom to define a cutoff threshold χ that separates predicted actives (all compounds along the ‘top’ side of this ranked list) from predicted non-actives (all compounds along the ‘bottom’ side of the ranked list) (see Fig. 1). The cutoff threshold χ is defined as the fraction of compounds selected:

$$\chi = N_{s} /N$$

(1)

with N _s being the number of compounds in the selection set (the predicted actives) and N being the total number of compounds in the entire dataset. The majority of metrics, including all metrics in this paper, are dependent on the value of this cutoff criterion χ since this criterion defines which compounds are predicted to be active and non-active.

Apart from the N _s and N variables, two other definitions are used in the following sections: the number of true active compounds in the selection set that is defined as n _s, and the number of true active compounds in the entire dataset defined as n. Finally, the prevalence of actives R _a in the entire dataset can be defined as:

$$R_{a} = n/N$$

(2)

Definition and calculation of established metrics

The sensitivity of a model is defined as the ability of the model to correctly identify active compounds from all the actives in the screening set (also termed the true positive rate or TPR), while specificity refers to the ability of the model to correctly identify inactives from all inactives in the dataset at a given cutoff threshold χ:

$${SEN}\left( \chi \right) = TPR\left( \chi \right) = \frac{TP}{TP + FN} = \frac{{n_{s} }}{n}$$

(3)

$$SPE\left( \chi \right) = \frac{TN}{FP + TN} = \frac{{N - N_{s} - n + n_{s} }}{N - n}$$

(4)

In line with the true positive rate, one can also define a false positive rate FPR as the number of true inactives in the selection set in relation to the total number of inactives in the entire dataset:

$$FPR\left( \chi \right) = \frac{FP}{FP + TN} = \frac{{N_{s} - n_{s} }}{N - n}$$

(5)

Other well-established metrics include the precision and accuracy:

$$PRE\left( \chi \right) = \frac{TP}{TP + FP} = \frac{{n_{s} }}{{N_{s} }}$$

(6)

$$ACC\left( \chi \right) = \frac{TP + TN}{TP + TN + FP + FN} = \frac{{2n_{s} + N - N_{s} - n}}{N}$$

(7)

The enrichment factor is probably the most used metric in virtual screening and other fields as well. The EF at a given cutoff χ is calculated from the proportion of true active compounds in the selection set in relation to the proportion of true active compounds in the entire dataset:

$$EF\left( \chi \right) = \frac{TP/TP + FP}{TP + FN/TP + TN + FP + FN} = \frac{{N \times n_{s} }}{{n \times N_{s} }}$$

(8)

The enrichment factor is very intuitive and easy to understand, but it lacks a strong statistic background and has some drawbacks, including the lack of a well-defined upper boundary [the EF(χ) can vary from 0 in the case that there are no active compounds in the selection set (n _s = 0), and up to 1/χ when all active compounds are located in the selection set (n _s = n); see Ref. [19] for the derivation], the dependency of the value on the ratio of active to inactive compounds in the dataset, and a pronounced ‘saturation effect’ when the actives saturate the early positions of the ranking list and the performance metric cannot get any higher, thereby preventing to distinguish between good and excellent models [6].

To avoid the problems associated to EF, a number of other metrics have been proposed. The first of these is the relative enrichment factor [8], a metric in which the problem associated to the saturation effect is fixed by considering the maximum EF achievable at the cutoff point:

$$REF\left( \chi \right) = \frac{{100 \times n_{s} }}{{\hbox{min} \left( {N \times \chi ,n} \right)}}$$

(9)

The REF, has well defined boundaries—ranging from 0 to 100—and is less subject to the saturation effect.

The ROC enrichment metric is defined as the fraction of actives found when a given fraction of inactives has been found [9]:

$$\it {{ROCE}}\left(\upchi \right) = \frac{{{{n}}_{{s}} /{{n}}}}{{\left( {{{N}}_{{s}} - {{n}}_{{s}} } \right)/\left( {{{N}} - {{n}}} \right)}} = \frac{{{{n}}_{{s}} \times \left( {{{N}} - {{n}}} \right)}}{{{{n}} \times \left( {{{N}}_{{s}} - {{n}}_{{s}} } \right)}}$$

(10)

The ROCE metric has been advocated by some researches as a better approach to address early recovery [5, 9]. However, some issues still remain, such as the lack of a well-defined upper boundary [which is equal to 1/χ when TPR(χ) equals 1], a smaller but still noticeable saturation effect, and a statistic robustness which is not as desirable as we will demonstrate later.

Another metric often considered to measure classification performances is the correct classification rate [12], defined as the percentage of instances correctly classified:

$$\it {{CCR}}\left(\upchi \right) = \frac{1}{2}\left[ {\frac{{TP}}{{{{TP}} + {{FN}}}} + \frac{{TN}}{{{{TN}} + {{FP}}}}} \right] = \frac{1}{2}\left[ {\frac{{{{n}}_{{s}} }}{{n}} + \frac{{{{N}} - {{N}}_{{s}} - {{n}} + {{n}}_{{s}} }}{{{{N}} - {{n}}}}} \right]$$

(11)

The CCR is sometimes also called the balanced accuracy [20].

Matthews correlation coefficient has been advocated as a balanced measure that can be used on classes of different sizes [14]. The MCC is in essence a correlation coefficient between the measured and predicted classifications; it returns a coefficient of +1 in the case of a perfect prediction, 0 when no better than random prediction and −1 in cases of total disagreement between prediction and observation:

$$\it {{MCC}}\left(\upchi \right) = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }} = \frac{{N \times n_{s} - N_{s} \times n}}{{\sqrt {N_{s} \times n \times \left( {N - n} \right) \times \left( {N - N_{s} } \right)} }}$$

(12)

The last metric that is evaluated with respect to its performance as compared to the here developed power metric is Cohen’s kappa coefficient [21–24]:

$$\it {{CKC}}\left(\upchi \right) = 1 - \frac{{1 - \frac{TP + TN}{TP + TN + FP + FN}}}{{1 - \frac{{\left( {TP + FN} \right)\left( {TP + FP} \right) + \left( {FP + TN} \right)\left( {FN + TN} \right)}}{{\left( {TP + TN + FP + {FN}^{2}} \right) }}}} = 1 - \frac{{N \times n + N \times N_{s} - 2 \times n_{s} \times N}}{{N \times n + N \times N_{s} - 2 \times n \times N_{s} }}$$

(13)

Derivation of a new metric: the power metric

In virtual screening studies, we can assume all compounds being inactive as the null hypothesis, and the assumption that some compounds are active as the alternative hypothesis. The statistical power, also known as sensitivity or recall, is equal to the true positive rate.

However, the statistical power alone does not include information about the distribution of negative instances or the size effect. Therefore, a metric based on statistical power and suited for applications in the field of virtual screening should incorporate information about the negative instances as well. Ideally, a good virtual screening method must be able to perform a good prediction of true positive instances combined with a small false positive prediction rate. This translates in a metric that combines the TPR with the false positive rate:

$$'net\;power '\;(\chi ) = TPR(\chi ) - FPR(\chi )$$

(14)

Graphically, the ‘net power’ is the area of the distribution of positive instances or the alternative hypothesis, minus the area of the distribution of negative instances or the null hypothesis (Fig. 2).

The metric is not new; it has been developed independently several times in the past. Its origin can be traced back to the seminal paper of Peirce [25] with his ‘science of the method’ [26]. More than 70 years later, it was proposed again by Youden as Youden’s index (Y’J) [27]. Youden’s index is often used in conjunction with the ROC curve as a criterion for selecting the optimum cutoff point [28]. The index has been used to calculate the best cutoff point in the ROC curve. Once more, almost 50 years later in 2003, it was proposed again by Powers who called it ‘informedness’ [10].

Despite the success of this metric to evaluate the prediction power of a method, it is not entirely appropriate for virtual screening studies due to the lack of early recovery capabilities that are very desirable in any virtual screening application. Consider, for instance, a database of 10,000 compounds of which 1% are active compounds. In this hypothetical thought experiment, we can think of different methods that yield identical Youden’s indices calculated from different TPR and FPR values. Thinking of two methods, each produce a Youden’s index of 0.5, with the first one characterized by a TPR = 0.9 and a FPR = 0.4, and the second method characterized by a TPR = 0.51 and a FPR = 0.01. In the case of the first method, 4050 compounds will be marked as ‘hits’ of which only 90 compounds being true active (or 5.7% of the selected compounds). However, for the second method only 150 compounds are flagged as ‘hits’, of which 51 compounds are true actives (or 34% of the selected compounds). Obviously, for virtual screening applications, the second method provides a more optimal early recovery rate since only 1.5% of the original dataset needs to be tested in order to recover 51% of all active compounds.

Normalization of the ‘net power’ metric by dividing by the sum of the true positive and false negative rates introduces early recovery capabilities bias into the ‘net power’ metric. This difference-over-the-sum normalized ‘net power’ expresses the dominance of the true positive rate over the false positive rate among those instances predict as positive, expressed by its rates:

$$normalized\;'net power^{\prime} = \frac{TPR\left( \chi \right) - FPR\left( \chi \right)}{TPR\left( \chi \right) + FPR\left( \chi \right)}$$

(15)

The metric ranges from −1 to +1 and can easily be modified to range from 0 to +1 by adding 1 to the metric and dividing by 2. We call this new metric the power metric (PM) and is defined as follows:

$${PM}\left( \chi \right) = \frac{{\left( {\frac{TPR\left( \chi \right) - FPR\left( \chi \right)}{TPR\left( \chi \right) + FPR\left( \chi \right)} + 1} \right)}}{2} = \frac{TPR\left( \chi \right)}{TPR\left( \chi \right) + FPR\left( \chi \right)} = \frac{{n_{s} \times N - n \times n_{s} }}{{n_{s} \times N - 2 \times n \times n_{s} + n \times N_{s} }}$$

(16)

Probability distribution function to evaluate the metrics

In order to evaluate the performance of several metrics used in the field of virtual screening, we used the probability distribution function approach as suggested by Truchon and Bayly to build hypothetical models of different qualities [6]. For a typical virtual screening study with N compounds of which n being active compounds, we generated the ranks of these active compounds according to the exponential distribution as proposed by Truchon and Bayly [6]:

$$X_{i} = \frac{ - 1}{\lambda }ln\left( {1 - U_{i} \left( {1 - e^{ - \lambda } } \right)} \right)$$

(17)

The generated real number X _i corresponds to the relative position of active compound i and U _i is a pseudo random number with values between 0 and 1. In this exponential distribution, the λ parameter represents the model quality (lower λ values correspond to poor models and larger λ values correspond to better models). The number X _i is transformed into a rank integer r _i that falls within 1 and N:

$$r_{i} = int\left( {N \times X_{i} + 0.5} \right)$$

(18)

No ties were allowed and each active compound occupies one unique position. In cases when a clash occurred, a new random number was generated. In our simulations we used values of λ equal to 1, 2, 5, 10, 20 and 40. Visualization of the quality of these models is given in Fig. 3.

To illustrate the model generation process by example, consider a model with quality λ = 20 and consisting of n = 100 active compounds on a total of N = 10,000 compounds. To generate the relative rankings of these 100 active compounds, Eq. 17 was called 100 times, each time with a different random number U _i. Using Eq. 18, the 100 generated X _i numbers are then converted into 100 rankings r _i with N set to 10,000. These 100 rankings are the absolute positions of the active compounds; the remaining 9900 ranks (10,000 − 100 = 9900) are those of the inactive compounds.

In order to evaluate the quality of the PM metric and to compare its behavior to the other metrics, a large number of datasets were generated and analyzed. The total number of compounds N, number of actives n, model quality λ and cutoff parameter χ were varied. Each simulation was repeated 10,000 times and the results were analyzed by inspecting the variations of mean and standard deviation (STD) of the metrics as a function of the number of actives and total compounds. The eleven enrichment-type metrics that were analyzed were the PM, EF, ROCE, CCR, REF, MCC, CKC, together with the standard PRE, ACC, SEN and SPE metrics.

All calculations were performed under Python 2.7 using Numpy and Scipy [29]. The IPython notebook [30] was used as programming environment and figures were generated with Matplotlib [31]. MarvinSketch was used for drawing chemical structures [32].

Results and discussion

Dependency on model quality

One of the key aspects of a suitable metric is that its value is dependent of the model quality. In Table 1, the dependency of the different metrics on the model quality parameter λ was evaluated. All metrics are model quality dependent, but the ROCE, EF, REF, MCC, CKC, SEN and PRE show an approximate tenfold increase when moving from a poor model with quality λ = 2 to a good model with quality λ = 40, while in the case of the PM metric a doubling of the parameter value is observed (going from PM = 0.5 for a poor model to a value of 0.98 for a good model; Table 1). Accuracy and specificity metrics are not influenced by the model quality λ or by the cutoff value χ; both metrics fluctuate around a value of 0.97-1.00 irrespective of the underlying model quality or applied threshold cutoff. In the case of the CCR metric, the maximal value of this metric finds it limit at 0.75 ± 0.02 for the case with an extremely good model quality of λ = 40 in combination with a threshold cutoff χ of 2% (for a model with 100 actives on a total of 10,000 compounds, a model quality of λ = 40 corresponds to an AUC of 97.25%, as compared to an AUC of 99.5% for the ideal case). This is not what one would like to expect for a metric to separate quality models from poor models. Furthermore, the PM metric seems to be less influenced by the applied cutoff parameter χ, since the PM metric for a good model (λ = 40) at the different cutoffs of 0.5, 1 and 2% remains largely unchanged (at a constant value of approximately 0.98; see Table 1), while an increase is seen for the CCR metric. It seems that all but the PM, SPE and ACC metrics are more dependent on the applied cutoff threshold χ (indicated by the shifts in the values and by the larger variations on the calculated metrics; Table 1), making it more difficult to define an appropriate metric value for identification proper virtual screening models. Starting with models of reasonable quality, and up to models of higher qualities (λ ≥ 10), the PM is calculated to vary between 0.9 and 1.0 with a relative standard deviation less than 10%. For the other metrics (except the CCR, ACC and SPE metrics), this relative standard deviation is in most instances larger than 10%.

Table 1 Dependency on the model quality parameter λ using models generated from datasets with 100 actives (n) on 10,000 compounds in total (N)

Full size table

Dependency on the ratio of actives to total number of compounds

The influence of the R _a value, calculated from the ratio of number of actives n to the total number of compounds N, on the different metrics is given in Table 2. For the different model qualities (a poor model with λ = 1 or a good model with λ = 20) and different cutoff values (χ = 1 or 10%), there is a significant dependency for the REF, PRE and ACC metrics on the R _a value. The EF, CKC, SEN and ROCE metrics are not very sensitive to the R _a value when applied to poor models (λ = 1), but show more dependency on the R _a ratio when applied on good models (λ = 20). In contrast, the REF is very sensitive to the R _a value when used on poor models (λ = 1), but is not dependent on the R _a value when applied on a good model in combination with a large cutoff value (χ = 1%; Table 2). In contrast, the PM and CCR metrics remain largely insensitive to the R _a value, unless when the PM metric it is applied to a very poor model (λ = 1) in combination with a small cutoff threshold value (χ = 1%). Again, good models all have PM values ≥ 0.9 with small variations, and are independent on the number of actives in relation to the total number of compounds. The combination of a high model quality of λ = 20 with a cutoff threshold of χ = 1%, applied to a database with n = 50 actives on a total of N = 5000 compounds, corresponds to a virtual screening situation characterized by a high true positive and high true negative rate. It is therefore surprising that for the CCR metric a value of 0.58 ± 0.02 is calculated, while for the PM metric a more intuitive value of 0.95 ± 0.02 is found (Table 2). Increasing the cutoff threshold to 10% improves the calculated CCR value to 0.88 ± 0.02 and decreases the PM case from 0.95 ± 0.02 to 0.90 ± 0.01, again in line what one would expect from considering the true positive and true negative rates in this situation.

Table 2 Dependency on the R _a value

Full size table

Dependency on the cutoff threshold χ

The dependency of the different metrics on the applied cutoff value χ is given in Table 3. This dependency was evaluated using models with n = 250 active compounds in a dataset of N = 10,000 compounds in total, and at five different cutoff values χ (0.5, 1, 2.5, 5 and 10%) for both a poor and high quality model (λ = 1 and 20, respectively). A significant dependency on the cutoff χ is observed for the REF and SEN metrics, increasing their values with increasing cutoff values. A similar behavior is observed for the CCR, MCC and CKC metrics when applied to the high quality model situation (λ = 20). Interestingly, the calculated REF metric values remain constant up to a cutoff of 2.5%, but at higher cutoff values this metric increases significantly. It is not surprising that this turning point in metric behavior is observed at a cutoff value of 2.5%, since this corresponds to a selection set of exactly 250 compounds when applied to a dataset of 10,000 compounds with 250 actives mixed into it. In case of a high quality model, this translates to a situation with maximum rates of true positives and true negatives. Focusing on the EF, ROCE, CCR, SPE, ACC and PM metrics, their values are quite constant over the different cutoff values in the case of a bad model quality, but a significant drift is observed for the EF, CCR and ROCE metrics in case of a good model quality. This shift is again observed at a χ cutoff value larger than 2.5%. A similar drift is not observed for the PM metric that, together with the CCR metric, also has the smallest relative standard deviations (Table 3).

Table 3 Dependency on the χ cutoff value using models generated from datasets with 250 actives (n) on 10,000 compounds in total (N)

Full size table

Dependency on both model quality λ and cutoff threshold χ

A direct comparison of the variation of the values of the five most commonly used metrics (CCR, ROCE, MCC, REF and CKC) with those of the PM, as a function of both model quality λ and cutoff threshold χ, is provided in Fig. 4. Comparing the results of the PM and CCR metrics, both types of metric values increase with increasing model quality, but the PM metric seems to be less dependent on the applied cutoff threshold as compared to the CCR metric (in fact, the CCR metric value is increasing with increasing cutoff thresholds, while the opposite behavior is observed in the case of the PM metric). The CCR metric is finding its highest values at larger cutoff thresholds in combination with high model qualities, making it less suitable for early-recognition problems. A similar conclusion can be drawn for the MCC and CKC metrics, as in both cases maximum values are obtained near a cutoff threshold χ that is equal or close to the fraction of true actives within the entire dataset (in the example of Fig. 4, this is 2.5%). Focusing on the ROCE metric, maximum values are calculated when models of high qualities are combined with cutoff thresholds χ that are smaller than 2.5%, in casu the fraction of true actives within the entire dataset of compounds. At very low cutoff thresholds, the ROCE metric decreases again. A main disadvantage of the ROCE metric is the lack of a well-defined upper boundary, hence making it difficult to compare the quality of underlying models and applied cutoff thresholds. Finally, the REF metric is not a continuous function but shows a discontinuity in its metric value along a threshold cutoff value of 2.5%, a value that is equal to the fraction of true actives in the dataset. At this cutoff threshold value and for all model qualities, a minimum in metric value is observed, which makes that for any given model quality under consideration two maxima are found: a first optimum at a cutoff threshold smaller than the 2.5%, and a second optimum that is located at a cutoff threshold χ much larger than the 2.5%.

Based on these observations, it can be concluded that the CCR, MCC and CKC metrics are all less suitable for early-recognition problems; for these problems the PM and ROCE metrics are better suited. The REF metric might also be an option to some extend but some cautions are warranted when used in combination with cutoff thresholds χ that are equal or larger than the fraction of true actives in the entire dataset. In these cases an increase in the REF metric is observed, which makes it less suitable for early-recognition problems. As already mentioned, the main disadvantage of the ROCE metric is the lack of a well-defined upper boundary, and for this reason the PM metric seems to posses powerful early-recognition properties and might be one of the preferred metrics for evaluating virtual screening models.

Conclusions

The power metric PM as described in this paper is a statistically solid metric with little sensitivity to the ratio of actives to the total number of compounds (the R _a value; see Table 2) and little sensitivity to the cutoff threshold parameter χ (Table 3). The metric is dependent on the underlying model quality, in such sense PM values around 0.5 are calculated for poor to random models, and values between 0.9 and 1.0 for high quality models. It is statistically robust in the sense that the calculated standard deviations are small and largely insensitive to the applied threshold cutoff value χ.

Abbreviations

ACC:: accuracy
AUC:: area under the curve
CCR:: correct classification rate
CKC:: Cohen’s Kappa coefficient
EF:: enrichment factor
MCC:: Matthews correlation coefficient
PM:: power metric
PRE:: precision
QSAR:: quantitative structure–activity relationship
REF:: relative enrichment factor
ROC:: receiver operating characteristic
ROCE:: ROC enrichment
SEN:: sensitivity
SPE:: specificity
STD:: standard deviation
TNR:: true negative rate
TPR:: true positive rate

References

Cross JB, Thompson DC, Rai BK, Baber JC, Fan KY, Hu Y, Humblet C (2009) Comparison of several molecular docking programs: pose prediction and virtual screening accuracy. J Chem Inf Model 49:1455–1474
Article CAS Google Scholar
Kirchmair J, Markt P, Distinto S, Wolber G, Langer T (2008) Evaluation of the performance of 3D virtual screening protocols: RMSD comparisons, enrichment assessments, and decoy selection—what can we learn from earlier mistakes? J Comput Aided Mol Des 22:213–228
Article CAS Google Scholar
Taminau J, Thijs G, De Winter H (2008) Pharao: pharmacophore alignment and optimization. J Mol Graph Model 27:161–169
Article CAS Google Scholar
Geppert H, Vogt M, Bajorath J (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 50:205–216
Article CAS Google Scholar
Nicholls A (2008) What do we know and when do we know it? J Comput Aided Mol Des 22:239–255
Article CAS Google Scholar
Truchon J-F, Bayly CI (2007) Evaluating virtual screening methods: good and bad metrics for the ‘early recognition’ problem. J Chem Inf Model 47:488–508
Article CAS Google Scholar
Fecher U, Schneider G (2004) Evaluation of distance metrics for ligand-based similarity searching. ChemBioChem 5:538–540
Article Google Scholar
von Korff M, Freyss J, Sander T (2009) Comparison of ligand- and structure-based virtual screening on the DUD data set. J Chem Inf Model 49:209–231
Article Google Scholar
Nicholls A (2014) Confidence limits, error bars and method comparison in molecular modeling. Part 1: the calculation of confidence intervals. J Comput Aided Mol Des 28:887–918
Article CAS Google Scholar
Powers DMW (2011) Evaluation: from precision, recall and F-score to ROC, informedness, markedness & correlation. J Mach Learn Technol 2:37–63
Google Scholar
Fawcelt T (2006) An introduction to ROC analysis. Pattern Recogn Lett 2006(27):861–874
Article Google Scholar
Fleiss JL (1981) Statistical methods for rates and proportions, 2nd edn. Wiley, New York
Google Scholar
Brodersen KH, Ong CS, Stephan KE, Buhmann JM. (2010) The balanced accuracy and its posterior distribution. In: Proceedings of the 20th international conference on pattern recognition, pp 3121–3124
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA Protein Struct 405:442–451
Article CAS Google Scholar
Smeeton NC (1985) Early history of the kappa statistic. Biometrics 41:795
Google Scholar
Viera AJ, Garrett JM (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37:360–363
Google Scholar
Hawkins PCD, Warren GL, Skillman AG, Nicholls A (2008) How to do an evaluation: pitfalls and traps. J Comput Aided Mol Des 22:179–190
Article CAS Google Scholar
Altman DG, Bland JM (1994) Diagnostic tests 2: predictive values. Brit. Med. J. 309:102
Article CAS Google Scholar
Inserting equation 1 into equation 8 gives $EF\left( \chi \right) = \frac{1}{\chi }\frac{{n_{s} }}{n}$; hence EF(χ) can vary from 0 in the case that n _s equals 0, to 1/χ in the case that n _s equals n
Hardison NE, Fanelli TJ, Dudek SM, Reif DM, Ritchie MD, Motsinger-Reif AA (2008) A balanced accuracy fitness function leads to robust analysis using grammatical evolution neural networks in the case of class imbalance. Genet Evol Comput Conf 2008:353–354
Google Scholar
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46
Article Google Scholar
Ben-David A (2008) About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell 21:874–882
Article Google Scholar
Ben-David A (2008) Comparison of classification accuracy using Cohen’s weighted kappa. Expert Syst Appl 34:825–832
Article Google Scholar
Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Linguist 22:249–254
Google Scholar
Peirce CS (1884) The numerical measure of the success of predictions. Science 4:453–454
Article CAS Google Scholar
Baker SG, Kramer BS (2007) Peirce, Youden, and receiver operating characteristic curves. Am Stat 61:343–346
Article Google Scholar
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3:32–35
Article CAS Google Scholar
Schisterman EF, Perkins NJ, Liu A, Bondell H (2005) Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology 16:73–81
Article Google Scholar
van der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13:22–30
Article Google Scholar
Pérez F, Granger BE (2007) IPython: a System for interactive scientific computing. Comput Sci Eng 9:21–29
Article Google Scholar
Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90–95
Article Google Scholar
MarvinSketch (version 15.10.26), calculation module developed by ChemAxon. http://www.chemaxon.com/products/marvin/marvinsketch/

Download references

Authors’ contributions

JCDL and FMDS: original idea, manuscript writing and programming; AMJ: original idea; HDW: programming, interpretations and writing of the manuscript. KA: general supervision. All authors have read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Funding

Julio Cesar Dias Lopes has received a fellowship from the Brazilian research agency CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico) within the Science Without Border program.

Author information

Authors and Affiliations

NEQUIM - Chemoinformatics Group, Departamento de Quimica, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Julio Cesar Dias Lopes, Fábio Mendes dos Santos & Andrelly Martins-José
Medicinal Chemistry Group, Department of Pharmaceutical Sciences, University of Antwerp, Campus Drie Eiken, Building A, Universiteitsplein 1, 2610, Wilrijk, Antwerp, Belgium
Koen Augustyns & Hans De Winter

Authors

Julio Cesar Dias Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Fábio Mendes dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
Andrelly Martins-José
View author publications
You can also search for this author in PubMed Google Scholar
Koen Augustyns
View author publications
You can also search for this author in PubMed Google Scholar
Hans De Winter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hans De Winter.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Lopes, J.C.D., dos Santos, F.M., Martins-José, A. et al. The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability. J Cheminform 9, 7 (2017). https://doi.org/10.1186/s13321-016-0189-4

Download citation

Received: 03 October 2016
Accepted: 30 December 2016
Published: 02 February 2017
DOI: https://doi.org/10.1186/s13321-016-0189-4

The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability

Abstract

Background