Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

Background Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection. Methods Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided. Results The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate. Conclusions Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set. Electronic supplementary material The online version of this article (doi:10.1186/s13321-014-0047-1) contains supplementary material, which is available to authorized users.


I. Bias-Variance Decomposition
In the simulation study the composition of the prediction errors (model errors) was studied as follows: The theory is explained in the following.

MLR
Assuming the following relationship holds: 2 ) where X are the selected model variables, X are omitted but true variables, b and b o are the corresponding regression coefficients. Generally, the regression vector estimate can be expressed for MLR under the usual assumptions [1] as follows: ̂M LR = (X T X) -1 X T y = (X T X) -1 X T (Xb + e ) Substituting = [̂] in (1) and rearranging equation (1) yields the following equation: According to equation (2) different regression vector estimates scatter randomly around their expectation value. Thus, equation (2) describes random influences. Analogously, the MLR estimate is exposed to randomness for a given variable subset as follows: The following definitions are introduced for simplicity: Under the Gauss-Markov assumptions MLR is known to yield unbiased estimates of the regression vector estimates [2]. These assumptions are not necessarily satisfied under model uncertainty due to the omission of relevant variables [3]. The estimates of the selected variables are likely to be biased if true variables are erroneously excluded. Thus, the selected variables are systematically over-or underestimated. This bias is also known under the term omitted variable bias. [3]. The omitted variable bias depends on the correlation of the omitted and included variables and can be derived in case of MLR as follows [4]: Thus, the regression vector estimate for a given variable subset can be calculated as follows: The model error (ME) describes the squared difference between predicted and true response as follows: Equation (6) The first quadratic term on the right side of equation (7) It follows since H m is symmetric and idempotent: In the simulation study bias and variance estimates were derived according to the aforementioned bias-variance decomposition as follows:

PCR
A widely applied matrix decomposition is the singular value decomposition (SVD) [1]. The predictor matrix X (n rows and p columns) can be decomposed according to singular value decomposition as follows: where r is the maximum (mathematical) rank of the predictor matrix. The matrices U and V contain the left and right singular vectors. The diagonal matrix S contains the singular values in decreasing order. The regression vector estimate for PCR can be described as follows [1]: where q (q < r) are the selected number of principal components. The omission of principal components which are associated with negligibly small singular values often reduces the variance considerably. If the predictor matrix is ill-conditioned and is almost singular the omission of principal components often reduces the variance to a very large extent [1]. But there is also a drawback because the omission of principal components causes some bias [1,5].
Nevertheless, it is often reasonable to accept a small or moderate increase in bias for the benefit of variance reduction. The difficulty is to find a reasonable bias-variance tradeoff [1]. The following definitions are introduced for simplicity: where j are the omitted principal components. The expectation value of the PCR estimate can be described as follows: Analogously, the expectation value of the PCR estimate can be calculated for a given variable subset (m) as follows: [̂m ,PCR ] = X m,q The following equation describes the bias of the PCR estimate for the full model [5]: The following equation ( ) relates to the bias of the PCR estimate for a given variable subset: According to equation (9a), the bias due to rank approximation can be described for a specific variable subset as follows: In case of PCR the omitted variable bias can be described as follows: Equation (10) shows that MLR yields larger omitted variable bias than PCR since the omitted variable bias also depends on the number of selected principal components. Certainly, the PCR estimate is also exposed to random influences as follows [5]: The PCR estimate is exposed to random influences to a smaller extent as compared to the MLR estimate: The PCR estimate is exposed to random influences for a given variable subset as follows: Thus, the PCR estimate can be derived according to the aforementioned equations as follows: Thus, the model error for external test data can be derived as follows: According to the aforementioned equations the approximate bias and variance terms can be calculated as follows: The different sources of bias can be estimated as follows: The term ( ) refers to the bias due to rank approximation. The term ( ) refers to the influence of the omitted variables on the prediction error estimates.
The term ( ) relates to the bias due to poor model specification.

TS-PCR
There is no overall patter in the deviations. In the worst case the 'oracle' prediction error is underestimated by 7%. The standard deviations, which are shown in the main body of the paper, indicate that the deviations can be attributed to random fluctuations.

ave.bias(ME)
Average bias terms (simulation model 1)  (LMO: d=80%) yielded lower prediction errors than Lasso. This was due to the fact that the number of irrelevant variables was high in case of Lasso.