Cross-validation pitfalls when selecting and assessing regression and classification models

Table 2 Distribution of optimal parameters

PLS on aquaticTox	Number of components	10	11	12	13	14	15
PLS on aquaticTox	Frequency	1	9	9	23	6	2
Ridge regression on AquaticTox	Lambda	≤0.027	0.035	0.040	0.046	0.053	0.061	0.070	0.081	≥0.093
Ridge regression on AquaticTox	Frequency	6	5	7	8	4	6	10	6	2
Ridge logistic regression on bbb2	Lambda	≤0.09	0.10	0.12	0.14	0.16	0.18	0.21	0.24	≥0.28
Ridge logistic regression on bbb2	Frequency	7	3	4	5	10	6	5	2	8
Ridge logistic regression on caco-PipelinePilotFP	Lambda	<0.0046	0.0046	0.0053	0.0061	0.0070	0.0081	0.0093	0.0107	>0.0107
Ridge logistic regression on caco-PipelinePilotFP	Frequency	6	2	2	4	7	12	6	6	5
Ridge logistic regression on caco-QuickProp	Lambda	≤0.018	0.021	0.024	0.028	0.032	0.037	0.042	0.049	≥0.056
Ridge logistic regression on caco-QuickProp	Frequency	7	2	8	7	7	7	4	4	4
PLS on MeltingPoint	Number of components	34-35	36	37-40	41	42-46	47	48-51	57	60
PLS on MeltingPoint	Frequency	7	7	6	8	7	8	5	1	1
Ridge regression on MeltingPoint	Lambda	≤0.031	0.036	0.042	0.048	0.055	0.063	0.073	0.084	≥0.096
Ridge regression on MeltingPoint	Frequency	5	1	4	6	5	5	7	10	5
Ridge logistic regression on Mutagen	Lambda	<0.0016	0.0016	0.0018	0.0021	0.0024	0.0031	0.0036	0.0042	>0.0042
Ridge logistic regression on Mutagen	Frequency	7	2	1	6	5	8	4	6	7
Ridge logistic regression on PLD	Lambda	≤0.34	0.34	0.39	0.44	0.67	0.77	0.89	1.02	≥1.17
Ridge logistic regression on PLD	Frequency	10	2	3	2	1	5	5	5	19

Distribution of optimal parameters (number of components or lambda values) based on 50 single cross-validations for each pair of method/dataset.

ISSN: 1758-2946