Large-scale evaluation of k-fold cross-validation ensembles for uncertainty estimation

It is insightful to report an estimator that describes how certain a model is in a prediction, additionally to the prediction alone. For regression tasks, most approaches implement a variation of the ensemble method, apart from few exceptions. Instead of a single estimator, a group of estimators yields several predictions for an input. The uncertainty can then be quantified by measuring the disagreement between the predictions, for example by the standard deviation. In theory, ensembles should not only provide uncertainties, they also boost the predictive performance by reducing errors arising from variance. Despite the development of novel methods, they are still considered the “golden-standard” to quantify the uncertainty of regression models. Subsampling-based methods to obtain ensembles can be applied to all models, regardless whether they are related to deep learning or traditional machine learning. However, little attention has been given to the question whether the ensemble method is applicable to virtually all scenarios occurring in the field of cheminformatics. In a widespread and diversified attempt, ensembles are evaluated for 32 datasets of different sizes and modeling difficulty, ranging from physicochemical properties to biological activities. For increasing ensemble sizes with up to 200 members, the predictive performance as well as the applicability as uncertainty estimator are shown for all combinations of five modeling techniques and four molecular featurizations. Useful recommendations were derived for practitioners regarding the success and minimum size of ensembles, depending on whether predictive performance or uncertainty quantification is of more importance for the task at hand. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00709-9.

: Overview of the predictive performances (R 2 ) of the 200-member ensembles, for all datasets and each combination of featurization and modeling technique. Brighter colors correspond to larger values.

S3
For a selection of eight datasets, another 200 members were generated using the bootstrapping procedure instead of 2-fold CV to generate the random subsamples. Producing repetitive bootstrap samples of dataset size n and averaging over all predictions for each left out object is also known as leave-one-out bootstrap [S1, S2]. Leave-one-out bootstrap was found to yield error rates comparable to 2-fold CV (called "half-sample CV" in the original publication) [S3], which raised the question whether this comparability also holds for predictive performance and UQ performance. To make potential differences noticeable, the comparison between the two subsampling techniques was surveyed for datasets with performances of exceptional dependency on modeling technique or featurization. Overall, the results were comparable, both methods exhibited the same strengths and weaknesses within a dataset and a given combination of featurization and modeling technique. While bootstrap ensembles slightly outperformed 2-fold CV ensembles for predictive performance, they were surpassed for UQ performance. The performance comparisons are visualized in Fig. S2.1 and S2.2, for predictive performance and UQ performance, respectively. Since the training sample in the bootstrap approach covers a larger fraction of objects, the individual ensemble members are expectedly slightly less diverse. This might improve the predictive quality due to the better information coverage, but leads to less variance to estimate uncertainties at the same time.  Figure S2.1: Differences in predictive performances between subsampling by 2-fold CV and by bootstrapping for the eight selected datasets. Each row corresponds to the comparisons between results of one dataset, each plot in a row shows the results for a specific modeling technique. Within each plot, a pair of bars represents the results for a specific featurization, also indicated by the color. The left bar of each pair shows R 2 when generating 200 members by 2-fold, the right bar when using bootstrapping.  Figure S2.2: Differences in UQ performances between subsampling by 2-fold CV and by bootstrapping for the eight selected datasets. Each row corresponds to the comparisons between results of one dataset, each plot in a row shows the results for a specific modeling technique. Within each plot, a pair of bars represents the results for a specific featurization, also indicated by the color. The left bar of each pair shows ρ when generating 200 members by 2-fold, the right bar when using bootstrapping.

S6
Some per-member evaluations resulted in unrealistic predictions. An example of such a case are the predictions of member 40 from the ensemble evaluation of Tetrayhmena, featurized by RDKit descriptors and using the SNN. Assessing the predictive performance from the raw, unfiltered predictions of member 40 is visualized in Fig. S3.1. Especially the SNN was prone to produce such predictions, mostly in combination with RDKit descriptors. Predictions of the SNN showed a noticeable variation in general, where an initialization using different random seeds can distinctively alter the model quality. Furthermore, they were more prone to outliers, which can be extreme in the case of RDKit descriptors. To deal with such predictions, an outlier filter was applied to each set of per-member predictions. Let y train max denote the maximum value of the train output variables and y train min their minimum. Then, the width of the interval covering all train outputs can be defined as y train range = y train max − y train min . The filter would then remove test predictionsŷ test outside a specified interval wider than y train range , yieldingŷ test f iltered : <ŷ test < y train max + δ · y train range } The acceptance level δ was set to 0.5, resulting in a coverage range of twice the train output interval width. In the example, the outlier filter removed three predictions, which were located between 16.90 and 26.99. The result is shown in Fig. S3.2.
An independent variable that often causes extreme outliers for datasets featurized as RDKit descriptors is the shape index Kappa3. In terms of the example, deleting Kappa3 completely and rerunning the evaluation of member 40 also leads to a reasonable predictive performance, as visualized in Fig. S3.3. However, knowing which variables cause outliers in the test set obviously necessitates knowing the test set. Although test data were available for a split as part of a cross-validation, it cannot be guaranteed that future compounds would conform to the remaining descriptors. It was therefore considered more realistic to remove outliers retrospectively, from the predictions, with the filtering approach. The notebook to obtain the figures with additional outlier analysis and an investigation for the influence of Kappa3 can be found in [S4].    Figure S4: Overview of the UQ performances (ρ) of the 200-member ensembles, for all datasets and each combination of featurization and modeling technique. Brighter colors correspond to larger values.

S8
The following example plots for SNN ensembles of P03372 demonstrate how the cumulative member curves were processed to obtain the ensemble size at the supposed saturation. The "raw" cumulative predictive and UQ performance curves for each featurization when only considering the generation order of the members is visualized in Fig. S5.1. The corresponding notebook for obtaining the raw cumulative member curves can be found in [S5]. Permutating the order of members randomly 200 times, computing a cumulative member curve each time and taking the median performance at each of the 200 values per ensemble size results in Fig. S5.2. The notebook that generates all permutated cumulative member curves can be found in [S6]. Finally, the median cumulative member curves are smoothed by fitting a Michaelis-Menten function through them. The median curves together with the fitted functions and the estimated points of saturation are visualized in Fig. S5.3. In [S7], the notebook is provided that fits the Michaelis-Menten functions and estimates the points of saturation.

S10
The median ensemble sizes where the saturation was reached was plotted against the performances of the full ensembles, for each dataset, as visualized in Fig. S6. The Spearman's rank correlation coefficient between ensemble sizes and full performances was -0.902 in case of predictive performance and 0.433 in case of UQ performance.  Figure S8: Overview of the predictive performances (R 2 ) and UQ performances (ρ) of the 10-fold CV single RF models, for all datasets and each featurization. Brighter colors correspond to larger values. S13 Table S1: All data sets used for evaluation. The original number of compounds refers to the number of Table S2: Python packages with versions used for this study.