Cross-validation is dead. Long live cross-validation! Model validation based on resampling

Baumann, Knut

doi:10.1186/1758-2946-2-S1-O5

Volume 2 Supplement 1

5th German Conference on Cheminformatics: 23. CIC-Workshop

Oral presentation
Open access
Published: 04 May 2010

Cross-validation is dead. Long live cross-validation! Model validation based on resampling

Knut Baumann¹

Journal of Cheminformatics volume 2, Article number: O5 (2010) Cite this article

2806 Accesses
7 Citations
Metrics details

Cross-validation was originally invented to estimate the prediction error of a mathematical modelling procedure. It can be shown that cross-validation estimates the prediction error almost unbiasedly. Nonetheless, there are numerous reports in the chemoinformatic literature that cross-validated figures of merit cannot be trusted and that a so-called external test set has to be used to estimate the prediction error of a mathematical model. In most cases where cross-validation fails to estimate the prediction error correctly, this can be traced back to the fact that it was employed as an objective function for model selection. Typically each model has some meta-parameters that need to be tuned such as the choice of the actual descriptors and the number of variables in a QSAR equation, the network topology of a neural net, or the complexity of a decision tree. In this case the meta-parameter is varied and the cross-validated prediction error is determined for each setting. Finally, the parameter setting is chosen that optimizes the cross-validated prediction error in an attempt to optimize the predictivity of the model. However, in these cases cross-validation is no longer an unbiased estimator of the prediction error and may grossly deviate from the result of an external test set. It can be shown that the "amount" of model selection can directly be related to the inflation of cross-validated figures of merit. Hence, the model selection step has to be separated from the step of estimating the prediction error. If this is done correctly, cross-validation (or resampling in general) retains its property of unbiasedly estimating the prediction error. Matter of factly, it can be shown that data splitting into a training set and an external test set often estimates the prediction error less precise than proper cross-validation. It is this variabability of prediction errors, which depends on test set size, that causes seemingly paradox phenomena such as the so-called "Kubinyi's paradoxon" for small data sets.

Author information

Authors and Affiliations

Institute of Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstr. 55, D-38106, Braunschweig, Germany
Knut Baumann

Authors

Knut Baumann
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Baumann, K. Cross-validation is dead. Long live cross-validation! Model validation based on resampling. J Cheminform 2 (Suppl 1), O5 (2010). https://doi.org/10.1186/1758-2946-2-S1-O5

Download citation

Published: 04 May 2010
DOI: https://doi.org/10.1186/1758-2946-2-S1-O5

5th German Conference on Cheminformatics: 23. CIC-Workshop

Cross-validation is dead. Long live cross-validation! Model validation based on resampling

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Journal of Cheminformatics

Contact us

5th German Conference on Cheminformatics: 23. CIC-Workshop

Cross-validation is dead. Long live cross-validation! Model validation based on resampling

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us