A hybrid framework for improving uncertainty quantification in deep learning-based QSAR regression modeling

Reliable uncertainty quantification for statistical models is crucial in various downstream applications, especially for drug design and discovery where mistakes may incur a large amount of cost. This topic has therefore absorbed much attention and a plethora of methods have been proposed over the past years. The approaches that have been reported so far can be mainly categorized into two classes: distance-based approaches and Bayesian approaches. Although these methods have been widely used in many scenarios and shown promising performance with their distinct superiorities, being overconfident on out-of-distribution examples still poses challenges for the deployment of these techniques in real-world applications. In this study we investigated a number of consensus strategies in order to combine both distance-based and Bayesian approaches together with post-hoc calibration for improved uncertainty quantification in QSAR (Quantitative Structure–Activity Relationship) regression modeling. We employed a set of criteria to quantitatively assess the ranking and calibration ability of these models. Experiments based on 24 bioactivity datasets were designed to make critical comparison between the model we proposed and other well-studied baseline models. Our findings indicate that the hybrid framework proposed by us can robustly enhance the model ability of ranking absolute errors. Together with post-hoc calibration on the validation set, we show that well-calibrated uncertainty quantification results can be obtained in domain shift settings. The complementarity between different methods is also conceptually analyzed. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-021-00551-x.


Introduction
With the increasing scale of available datasets, deep learning methods have made tremendous impact in the chemical domain [1]. However, most works in this area have focused on improving model accuracy, less attention has been paid for quantifying the uncertainty of predictions given by the model. Uncertainty quantification refers to estimating the confidence level of a model output. Reliable estimation of this certainty is often crucial for high-stacks problems, especially drug design and discovery [2][3][4]. For example, in the scenario of virtual screening, molecules with high predictive activity are chosen for further experimental verification. Through this process it is always expected that the molecules with unreliable predictions can be excluded in order to avoid wasting time and money [5]. However, a deterministic model cannot give such information. This example shows that numerical results without a measure of veracity do not contain enough information for decision making [6].
Given the importance of uncertainty quantification, a plethora of methods have been proposed so far and employed in various cheminformatics tasks such as molecular property prediction [7], chemical reaction prediction [8], material property prediction [9], NMR spectral property prediction [10] and interatomic potential prediction [11]. Broadly speaking, current mainstream uncertainty quantification methods used in the chemical domain can be divided into two categories: distancebased approaches and Bayesian approaches.
The core of distance-based approaches is the traditional concept of applicability domain (AD). AD estimates the chemical space in which the model could give reliable predictions. It is generally accepted that if a test sample is too remote from the training set, its prediction is likely to be unreliable. While the common goal is the same, the representation of the distance between a molecule and the model training set is varied across different distance-based methods. Many classical methods use feature space distance defined by molecular fingerprints [12][13][14][15][16][17], while some recent studies have shown that the distance in latent space may yield superior performance [18,19].
Bayesian approaches encompass a diverse group of strategies with strong theoretical guarantee which enjoyed a recent reconstruction as a result of the improvement of computing power [20][21][22]. The underlying assumption behind the Bayesian approach is that the model weights and predictions are no longer definite point estimates but probability distributions which allows uncertainties of predictions to be taken into the model naturally [23]. By fitting a defined model to the observed data (training set), the posterior distribution of model weights can be theoretically obtained and used to make inference. The total uncertainty of a prediction is then quantified as its posterior variance. Interestingly, in Bayesian modelling, the total uncertainty can be decomposed into two components: aleatoric uncertainty which captures the noise of labels and epistemic uncertainty which results from the lack of training data [7,24]. Many researches have made use of this desired property to identify the main source of uncertainty for their specific tasks [9,25].
Despite the progress mentioned above, the reliability and applicable conditions of both distance-based and Bayesian methods are still limited by some challenges. On the one hand, the measure of chemical space distance is ambiguous and the threshold for classifying reliable predictions is hard to define. Also, the distance-to-model metric lacks the information of stochasticity arising from the data. On the other hand, although the computational intractability of Bayesian methods has been eased by several approximating ways [26], Bayesian approaches are reported to tend to make overconfident predictions for out-of-domain examples [27,28]. In this context, we make the assumption that combining both distancebased and Bayesian methods represents a feasible strategy which can minimize the intrinsic drawbacks of these methods.
The complementarity between Bayesian methods and distance-based methods can be viewed in a more theoretic way. Some recent studies proposed that except for the epistemic uncertainty and aleatoric uncertainty that have been included in the Bayesian approach, distributional uncertainty is another source of uncertainty that needs to be considered [29]. Distributional uncertainty describes that the model is unfamiliar with the test data and thus cannot confidently make predictions, no matter for the label or the data noise. Some uncertainty quantification methods can explicitly model distributional uncertainties, but at the same time need out-of-domain samples during training time, which is unrealistic in realword applications [30]. To this end, distance-based methods here play a similar role as distributional uncertainty modeling methods to estimate whether a sample is outof-domain, which makes up the shortcomings of Bayesian methods.
In this study, we investigated the performance of several consensus strategies that combine both distance-based and Bayesian uncertainty quantification approaches in the context of deep learning-based QSAR regression modeling. The value of performing post-hoc calibration on a leave-out validation set was also studied. Special emphasis was put on model's ability of ranking absolute errors and providing calibrated uncertainty quantification results. The performance of different models was benchmarked on 24 biological regression datasets. We found that the consensus model showed improved performance over individual methods in both in-domain and out-of-domain settings.

Problem definition
Suppose we are given a training set with m samples Here x represents an input molecule and y ∈ R is a real-valued property. A deep learning-based regression model M parameterized by weights θ is trained on D A with early stopping on D B . M is then used to make predictions on D B and D C which can be represented by vectors . Signed error vectors are defined by . It can be assumed that [31] the error of each prediction e C i is a random variable following a Gaussian distribution with a zero mean and a specific variance σ C i 2 : where σ C i 2 is a function of the trained model and M and x C i , an uncertainty quantification method Q is used to give a relative or direct When it is clear from the context, this estimation will be directly referred to as Q x C i . Suppose there are t individual uncertainty quantification methods Q 1 , . . . , Q t , each of which could give a set of uncertainty estimation on D C named as We make the assumption that U C 1 , . . .,U C t can be combined through a consensus model f into a stronger estimation .

Datasets
24 bioactivity datasets gathered by Cortés-Ciriano et al. were used to benchmark the performance of different uncertainty quantification methods in this study [32]. Each dataset consists of bioactive small molecules for a specific drug target and IC50 values extracted from ChEMBL [33]. The number of data points per target varies from 203 (A2a) to 5207 (HERG). When multiple pIC50 values were available for the same compound, the average pIC50 value was calculated and taken as the label. Further information and all datasets can be found in ref [32,[34][35][36].

Graph convolutional neural network (GCNN)
Directed Message Passing Neural Network (D-MPNN) [37] was used in this study to conduct molecular activity prediction. D-MPNN is a kind of graph convolutional neural networks which uses messages centered on directed bonds instead of atoms to avoid unnecessary messagepassing loop. Unless otherwise noted, the default hyperparameter setting from the Chemprop package [37] was used to train the model. Noam learning rate schedule [38] was adopted to dynamically adjust learning rate. Early stopping was used to avoid over-fitting during training. At the end of each epoch, the loss on the validation set was calculated and recorded. If the loss did not decrease over 50 epochs, the training process was stopped and the best model on the validation set was taken as the final model for evaluation. Mean-variance loss was used as the loss function to estimate aleatoric uncertainty, for more details see the next subsection.

Mean-variance estimation (MVE)
The basic assumption of MVE is that the labels of a regression dataset are normal variables with different variances which arise from the experimental noise [27]. This is also called aleatoric uncertainty in Bayesian uncertainty analysis. MVE gives the estimation of the mean and variance for each data point with maximum likelihood estimation (MLE). In this way, the prediction of the model is a distribution rather than a single point. Practically, the output of GCNN is branched into two predictions: the mean µ(θ , x) and the variance v(θ , x) . v(θ , x) is taken as the result of MVE as the estimated uncertainty during the prediction process. A loss function proportional to the negative loglikelihood of the normal distribution instead of the traditional mean squared error (MSE) loss is used to optimize the weights θ: The output v(θ , x) of a model trained in this way is taken as the estimated aleatoric uncertainty Q A (x):

Ensemble (ENS)
Except for label noise inherent in dataset, the randomness of training process is another source of uncertainty. With the same loss on the training set, there are multiple possible model weights that are able to explain the data. This is called epistemic uncertainty in Bayesian uncertainty estimation, which results from the lack of knowledge at certain regions of the feature space, and can be neutralized by increasing training data in those low-density regions. In this study, an ensemble approach was used to obtain this kind of uncertainty. For each training process, a set of models {M k } K k=1 were trained parallelly with different random initial weights. The ensemble variance defined as the following equations is then used to estimate the epistemic uncertainty Q E x C i for the test sample x C i : where M k x C i is the prediction of the k-th model, M x C i is the average prediction of ensemble models and Q E x C i is the estimation of the uncertainty given by the ensemble variance. If MVE is used during training (as shown in Eq. 2), the predicted mean value µ θ k , x C i is taken as M k x C i . This approach was first proposed as Deep Ensemble [39] and has been widely used in other researches. Although there still exist other popular ensemble techniques for obtaining ensemble models, such as Monte Carlo dropout [21,40], hyperparameter ensemble [41] and bootstrapping [42], a recent study has shown that many of them are essentially equivalent to simply ensemble of several independently trained networks as Deep Ensemble does [43], which has become a practical alternative to improve the accuracy of deep learning models. It has also been reported that Deep Ensemble significantly outperforms other ensemble techniques on a variety of machine learning tasks [27,28]. Another advantage for Deep Ensemble is that it is based on the standard training pipeline without additional computation burden, which makes the implementation much simpler.

Feature space distance (FDIST)
The feature space distance between the test sample and the training dataset has long been used to define the AD for machine learning models. A variety of approaches have been proposed in the literature over the years for measuring the feature space distance [13-15, 17, 44]. Considering the generality of the method, a rather simple but robust measurement named SIMILARI-TYNEARIST1 [12,14] is used in this study. Here, each molecule is represented by the MinHash fingerprint6 (MHFP6) [45], which removes the curse of dimensionality and outperforms ECFP4 [46] in analog recovery experiments. The uncertainty of a test sample is given by the Tanimoto distance [47] of that with the nearest sample in the training set.

Latent space distance (LDIST)
GCNN is able to build a learned molecular representation and further engineer it automatically to amplify the effects of task-related features and limit the effects of weakly-informative features. When the model could extract the task-related features of molecules successfully, the distance in latent space is extremely meaningful to stand for the similarity of two molecules on a specific task. As a result, latent space distance was recently proposed as a novel and efficient strategy of uncertainty quantification [19]. Here, the cosine distance in the final layer latent space was used to assess the similarity of a test point and the training dataset. The number of nearest neighbors was also set to one as used in the calculation of FDIST.
In practice, 10 models were trained with randomly initialized weights for a given training/validation/testing split. The arithmetic mean of these 10 predictions was taken as the ensemble predictions on the validation set and the test set as shown in Eq. 4, while the variance of which was taken as the result of ENS as shown in Eq. 5. Noticed that FDIST is independent with the model weights, but the results of MVE and LDIST vary as the weights change. In this case the results of these two approaches were also averaged over the 10 models as is done for the predictions. A graph illustration of each individual method is shown in Fig. 1.

Bayesian approach (BYS)
With the MVE approach, the prediction of a model is not a definite number but a random variable. Given a model with weights θ and input x , the conditional expectation and variance of the prediction ŷ are: where µ(·) and v(·) are two deterministic functions. Under the Bayesian perspective, the weights of a trained model are also random variables following posterior distribution p θ |D A : where p(θ ) is the prior and p D A |θ is the likelihood. Combining Eqs. 8-10, the posterior distribution of the prediction is: whose expectation and variance are: shows how to define and calculate uncertainty of model predictions in a Bayesian way. It is noticed that according to Eq. 13, the total uncertainty var ŷ|x, D A is decomposed into two components, aleatoric uncertainty (the former one) and epistemic uncertainty (the latter one). Due to the large number of parameters contained in θ which leads to high dimensional integrals, calculation of Eq. 13 is intractable. Many methods have focused on finding strategies to make efficient approximation. As a realization of this framework, Scalia et al. proposed that the outputs of MVE and ENS can be added to obtain an approximation of Bayesian total uncertainty (as shown in Eq. 14), leading to a better performance compared with using single method alone [27]. This method which can also be treated as a simple aggregation strategy was used as a baseline model in this study for comparison and referred to as BYS.

Consensus methods and corresponding evaluation metrics
To combine the outputs of single models , two weighted consensus strategies were considered in this study: (1) weighted averaging focusing on improving the ranking ability; and (2) NLL (negative log-likelihood) (14) Q calibration focusing on improving the calibration ability, both of which take the form of a linear combination of predictions provided by individual models to get the final uncertainty estimation, as follows: are denormalized outputs of individual models, H(·) is a normalization function and U C * is the final outputs. Next, we will explain the meaning of ranking and calibration ability and how we obtain these abilities with corresponding methods.

Weighted averaging
In general, an ideal uncertainty quantification method should assign higher uncertainty values to predictions with higher absolute errors. We name this as the ranking ability of the model. Assuming that each individual model has already possessed ranking ability to some degree, the problem now is how to combine these models into a stronger one. A weighted averaging approach as shown in Eq. 15 is used to achieve this goal. However, directly adding the results of different individual models is not applicable owing to different quantities and units of individual predictions. In this context, a normalization is performed for each model's results at first. Two classical normalization strategies, z-score normalization (referred to as Zscore) and min-max normalization (referred to as MinMax), were tested in this study. We also considered to directly transform the results into rankings (referred to as Rank) before performing weighted averaging.
To determine the weights W in Eq. 15, without any prior knowledge, the most intuitive way is to use arithmetic mean (referred to as Unweighted), where w 0 = 0 and w 1 = . . . = w t = 1 . We also explored the possibility of assigning higher weight for the method showing better performance on the validation set (referred to as Weighted). Since the ranking ability can be quantitatively measured by the Spearman Correlation Coefficient (SCC) between the predictions and absolute errors, Eq. 16 is used to determine the weight for each individual model.
Practically we found that SCCs can rarely simultaneously be negative for all individual models on the validation set. In this case the arithmetic mean would be used instead.
Confidence curve is a usual way to visually assess the ranking ability of the model. To draw this curve, the most uncertain samples are successively removed and the mean absolute error (MAE) is calculated for the remaining predictions. The confidence curve is plotted by showing how the value of MAE varies as the function of confidence percentile. A monotonic decreasing confidence curve would be expected for an ideal uncertainty estimator. When two curves were compared in parallel, the one with a smaller AUC (area under the curve) should be regarded as the better one. SCC between the estimated uncertainties and the absolute errors on the test set was used to quantitatively measure the ranking ability, following the work of Hirschfeld et al. [48].

NLL calibration (NLLCAL)
Even if a consensus model shows a perfect ranking ability, which we would not expect since e C i is considered to be a random variable sampled from a zero-mean normal distribution as shown in Eq. 1, it still remains unknown whether Q * x C i equals to σ C i 2 in value. The weighted averaging methods described in the previous subsection (16) are only aimed at giving estimation of rankings for absolute errors instead of directly predicting the uncertainty σ C i 2 . To do this, a post-hoc calibration on the validation set with MLE strategy is used. In MLE, we hope to learn a set of weights W that maximize the likelihood of observing errors on the validation set E B , given by ) . This is achieved by minimizing the following NLL loss function: where This method will be referred to as NLLCAL from now on. A similar MLE-based calibration approach is adopted by Janet et al. in which only latent distance information is used [19]. The idea of fitting absolute errors against several uncertainty metrics on the validation set to construct a hybrid error prediction model has also been explored by a set of work of Sheridan in the context of machine-learning-based QSAR modeling earlier [14,49,50].
can be evaluated under error-based view or confidence-based view [27,48]. Error-based calibration means that given the predicted mean ŷ and uncertainty Q(x) , the following condition is met: To assess error-based calibration, we followed the method proposed by Levi et al. [51]. According to this procedure, the test set is divided into multiple bins according to the predicted uncertainties U C * . In this study, each bin is set to have 20 samples with close predicted uncertainty. A scatter plotting between the root mean squared error (RMSE) and the root of average predicted uncertainty of each bin is then drawn. For a perfect calibrated model, points should be distributed around the diagonal line of the plotting. Expected normalized calibration error (ENCE) is defined to quantitatively measure the error-based calibration error: where N equals to the number of bins, mVAR(i) is the average predicted uncertainty over the i-th bin and MSE(i) is the MSE over the i-th bin.
Obviously, there is no need to make any distribution assumption of errors for checking whether a model is error-based calibrated. However, if we want to define the (17) confidence interval of error with a specified confidence level, such as 80%, for each test sample using the uncertainty value obtained, a zero-mean Gaussian distribution assumption is usually required, as shown in Eq. 1. Under the Gaussian distribution assumption, accuracy of the confidence interval is a function of confidence level ranging from 0 to 100%. According to this, the reliability diagrams can be plotted to check whether a model is confidence-based calibrated. Here the accuracy of a confidence interval is defined as the proportion of errors falling into the interval. A perfectly confidence-based calibrated model should obtain a diagonal line for the reliability diagram. In other words, for a x% confidence interval, it should always be observed that x% errors fall into it. Expected calibration error (ECE) is used to give a scalar statistic of miscalibration rate, defined as: where acc(i%) refers to the accuracy of i% confidence interval.

Splitting strategy
For each target we followed the standard fivefold cross validation (fivefold CV) protocol as is commonly done in the development pipeline of QSAR models. To analyze the performance of the model in different application scenarios, three different splitting strategies were adopted. Each strategy performed a fivefold CV split on the full dataset for each target, but with varying degrees of domain shift. For each split, the ratio of training/validation/testing is always 60/20/20. We named these three splitting strategies as IVIT, IVOT and OVOT, where V refers to "validation set", T refers to "test set", I refers to "in-domain" and O refers to "out-of-domain". The difference between them will be briefly discussed next.
IVIT (in-domain validation set; in-domain test set) is just the standard random splitting strategy making the training, validation, and testing sets cover the same chemical space. This strategy is widely used, but it is reported that random-split validation is tended to give over-optimistic results [1]. Accordingly, a stricter splitting method, cluster cross-validation [52], which can guarantee that compounds of different clusters cover disparate chemical spaces was adopted in this study.
We used the single-linkage algorithm to carry out the fivefold cluster CV. Single-linkage algorithm is a type of hierarchical clustering depending on the smallest dissimilarity between two examples x i , x j respectively from two different clusters C i , C j , which can guarantee that the minimum distance between any two folds is larger than a given threshold, as the following equation shows: where dist(·, ·) is the distance function and d cutoff is the threshold used. In this study, dist was defined by the Tanimoto distance on binarized ECFP4 fingerprints [46], and d cutoff was set to 0.3. Under this configuration, for example, the compounds in HERG were clustered into 1690 clusters with 131 compounds for the largest cluster, and the compounds in JAK2 were clustered into 534 clusters with 324 compounds for the largest cluster. These clusters were then united into five different folds with approximately the same size. For OVOT (out-of-domain validation set; out-of-domain test set), two folds were selected as the validation set and the test set, respectively, and the remaining three folds as the training set. For IVOT (indomain validation set; out-of-domain test set), one fold was selected as the test set at each time and the remaining four folds were further mixed and randomly split into the training set and the validation set in a ratio of 60: 20. OVOT has been adopted by some previous studies to evaluate the performance of uncertainty quantification models in an out-of-domain setting [48]. However, IVOT is a better way to simulate real usage scenarios since typically we would not intentionally use an out-of-domain validation set for early stopping. What's more, in this study the validation set functions as a calibration set for weighted averaging and NLLCAL. IVOT provides a better way of checking whether these methods are robust to dataset shift [28].

Motivation of building hybrid uncertainty quantification model
First of all, we would like to discuss the motivation of incorporating the distance information into traditional Bayesian framework according to the equations given in the "Methods and datasets" section. As shown in Eq. 13, the definition of Bayesian uncertainty is the posterior variance var(ŷ|x C i , D A ) , which can be further approximated by adding Q A and Q E together. However, we suppose that there exists a discrepancy between the Bayesian uncertainty var(ŷ|x C i , D A ) and the absolute error e C i which is actually the goal we concern about. A low var(ŷ|x C i , D A ) does not guarantee that ŷ C i is close to the label y C i . That is to say, a model may be quite "certain" but the prediction is "wrong". This is common for many out-of-domain samples, for which it has been observed that Bayesian approaches usually give overconfident estimation. A typical scenario is that the training set is biased, in which the positive samples contain low dimensional properties, for example, molecular weights, that can be used to easily identify positives and negatives. Considering an extreme case where nearly all positive molecules possess biphosphate groups while the negative ones do not (in fact this is a real case for target FPPS in DUD-E [53]). During training it is very likely that the neural network maps all molecules with biphosphate groups into a neighboring region in the latent distance. For an out-of-domain molecule (defined on the chemical space) containing a biphosphate group, var(µ(θ, x)) will be quite small since the model takes it as an in-domain sample (defined on the latent space) that can be well explained by the posterior weights. On the other hand, the FDIST will not be affected by the biased data and give important prior knowledge that the prediction is rather uncertain owing to the dissimilar molecular structure.

Consensus models outperform individual models
The performance of four individual models (MVE, ENS, LDIST, FDIST) and NLLCAL across three data splitting strategies are provided in Table 1 by showing the values of SCC, ECE and ENCE. The SCCs of unweighted and weighted averaging models are shown in Table 2. The results of SCC, ECE and ENCE are also illustrated in Figs. 2, 3, 4 with box plots. For more detailed performance metrics on each target see Additional file 1: Tables S1-S9.
As expected, the performance of individual models for IVIT is generally better than that for IVOT and OVOT, since uncertainty estimation for in-domain samples is easier than that for out-of-domain samples. Among four individual models, LDIST always showed the best ranking ability and FDIST always showed the worst. According to the results of ECE and ENCE, MVE consistently showed a better performance compared with ENS at the task of confidence-based and error-based calibration.
We next examine the performance of BYS which combines ENS and MVE. For ranking ability, it is found that BYS achieved better or at least similar performance compared with the best performing individual model. A clearer trend is observed for calibration tasks where BYS consistently significantly surpassed MVE and ENS.
The results of NLLCAL presented in the last column of Table 1 clearly show that the performance of BYS can be further improved by performing post-hoc calibration on the validation set. In fact, NLLCAL outperformed all of the baseline models that were not calibrated including BYS, regardless of splitting strategies and evaluation metrics. A previous study suggested that the post-hoc calibration is expected to show good performance in independent and identically distributed regimes, but may fail in the conditions of distributional shift, even when the shift is minor [28]. Accordingly, it is not surprising that NLLCAL showed a satisfied performance for IVIT. However, as shown in Table 1, NLLCAL still achieved better SCC, ECE and ENCE compared with BYS for IVOT and OVOT, which suggests that it is worth  calibrating the prediction on the validation set even in the out-of-domain scenario. Different from NLLCAL, unweighted averaging strategy does not require the calibration process and therefore possesses wider application scenarios. Table 2 shows that with a proper normalization strategy, unweighted averaging models could achieve higher SCC compared with baseline models and even NLLCAL (0.314 VS 0.308 for IVIT, 0.240 VS 0.225 for IVOT and 0.211 VS 0.194 for OVOT). Considering the normalization strategy, there was no significant difference between MaxMin, Zscore and Rank, despite the observation that Rank seems to show a slightly higher performance.
If a weighted strategy was used for averaging models, the same situation with NLLCAL arose where calibration had a high probability of further improving the performance for Zscore-based and MinMax-based averaging models. For example, the mean SCC of MinMax-based averaging model improved from 0.306 to 0.314 for IVIT, from 0.230 to 0.239 for IVOT and from 0.200 to 0.204 for OVOT. However, it is observed that the SCCs of Rank-based averaging models decrease slightly with the weighted strategy.
For visually presenting and comparing the results of different methods, the error-based calibration plots, confidence-based calibration plots and confidence curves for the first fold of erbB1 are shown in Figs. 5, 6, and 7, respectively. As one can observe in Figs. 5 and 6, NLLCAL shows a better calibration ability by performing post-hoc calibration on the validation set. The results of MVE and ENS were both significantly miscalibrated, especially for IVOT. Figure 5 also demonstrates that although in the IVOT setting the NLLCAL was still miscalibrated compared with the "ideal" diagonal line, it still outperformed BYS which indicates that post-hoc calibration is, to some extent, robust to the domain shift. Figure 7 demonstrates that for all splits weighted averaging method and NLLCAL could achieve better or comparable results with respect to the best performing individual model.
Except for the mean values, we also examined the relative performance of different methods across all 24 datasets. Taken one target as a single mission, Fig. 8 shows how often a model (y-axis) outperformed another (x-axis) which is defined by obtaining a higher mean value of SCC. As it can be seen, the improvement of using consensus strategy is outstanding and robust. For all targets and data splitting strategies, merely using Unweighted_Rank alone without performing post-hoc calibration is already very likely to get better performance compared with the traditional BYS method. This is especially ideal for real-world deployment.

Ablation studies highlight the importance of each individual model
Ablation studies were performed to assess the effect of each individual model. First, for weighted averaging methods (using Weighted_Rank as the representative model), each single method was removed in turn during the construction of the consensus model and four new sub-models were built whose performance would be compared with that of the whole model. The results are shown in Table 3. As it can be seen, for IVIT and OVOT, the performance of the consensus model built using the entirely four types of uncertainties surpassed that of sub-models. For IVOT, the sub-model excluding MVE shows slightly better performance compared with the whole model. According to Eq. 2, MVE learns aleatoric uncertainty from the distribution of training set, thus it is  The frequency for a model (y-axis) to outperform another model (x-axis) across all targets with respect to the metric SCC. Since a model always performs equally with itself, the values on the diagonal lines are zeros. Rank is used as the representation of different normalizations reasonable to find that MVE did not perform well on outof-domain test sets, as used in IVOT.
It is also interesting to note that although FDIST showed the worst ranking ability among the individual models (as shown in Table 1), the sub-model which excluded FDIST (M + E + L) showed significant decrease in SCC. We suppose that this is due to the strong complementary effect of FDIST. To prove this point, the correlation of predictions between the four individual models for each data splitting strategy are shown in Fig. 9. Correlation coefficients are calculated by averaging over different folds and targets. From a rank-order point of view, no matter for which data splitting method, there exists relatively high correlation between MVE, ENS, and LDIST. However, FDIST was weakly related with these three strategies, especially when IVOT and OVOT were adopted. In Fig. 10 we show two representative  . 9 The correlation between the rank ordering of four individual models. Spearman rank correlation coefficients are annotated in the corresponding squares examples. In the left part of Fig. 10, all individual models except FDIST showed decreasing confidence curves, while FDIST failed to give a meaningful prediction. On the contrary, in the right part, the performance of other models was rather poor, but FDIST showed a satisfied decreasing curve.
Since FDIST is not related with the task or model, we assume that it can be treated as a prior estimation of the reliability. As in Bayesian inference, the nonconformity between the likelihood and the prior does not necessarily evidence that the prior is inappropriate, but may be due to the data bias or a mis-specified model [26]. FDIST could function as prior knowledge and would show its unique value in some cases, especially for a dataset with biased features, which can be utilized through the consensus strategy we proposed.
One more thing should be noticed in Fig. 9 is the high correlation between MVE and ENS. Technically, MVE and ENS capture the aleatoric and epistemic uncertainties, respectively, which are conceptually orthogonal, so it is expected that these two methods should be independent. However, the results shown in Fig. 9 suggest that these two methods are highly correlated on the whole. We suppose that it is because MVE, ENS and LDIST are all derived from the final layer of the neural network, but FDIST is calculated directly from the original representation of the molecule. The same phenomenon has also been reported by Scalia et al. [27].
As for NLLCAL, we compared the performance of three models: considering LDIST only (L), considering MVE and ENS together (M + E) and finally the fully constructed model (M + E + L + F). Only performing calibration using LDIST was previously proposed by Janet et al. [19] and thus taken as the baseline model for comparison. M + E was also taken into consideration in order to assess the value of incorporating chemical space distance into the consensus model. The results of ablation study are shown in Table 4. As it can be seen, M + E + L + F which used the whole four individual models showed the best performance for most metrics, except for ECE and ENCE in the OVOT setting that M + E performed equally or slightly better.
The results discussed above reveal that all four individual models have its unique value no matter for weighted averaging or NLLCAL. The total consensus model considering all of the four individual models showed the best performance.

Comparison of NLLCAL with conformal prediction
As mentioned in the "Methods and datasets" section, a confidence-based calibrated uncertainty method, like NLLCAL, can be used to make interval estimation according to Eq. 1. It is noticed that another widely used approach for generating the prediction interval is the Conformal Prediction (CP) [54]. In this case, we compared the performance of these two methods in the mission of prediction interval estimation. For detailed explanation of Conformal Prediction methodology we refer the readers to Ref [54], and for different ways of defining nonconformity values, to Ref [55]. We adopted the following ESD (ensemble standard deviation) nonconformity function [55] to calculate the nonconformity value: where Q E (x i ) is the estimation obtained from the ENS approach as shown in Eq. 5, y i is the label and ŷ i is the predicted value. The confidence level was set to 0.9 for all experiments. Validity and efficiency were used to evaluate the performance of confidence interval predictors. Validity of the predictor was assessed by calculating the average fraction of labels falling inside the prediction interval across all folds for a single target. This value is expected to be as close to 0.9, the confidence level we set, as possible. Efficiency of the predictor was calculated as the average range of the prediction interval generated. For example, for an interval of 5.74 ± 0.38, the efficiency is 0.76. Provided the validity is close to 0.9, the efficiency is the lower the better. We say a model is valid for a target if the validity value falls into [0.85, 0.95].
The results of validity and efficiency of NLLCAL and CP are shown in Additional file 1: Tables S10, S11, respectively. For IVIT, both NLLCAL and CP were valid for all 24 targets, while NLLCAL could on average generate narrower intervals than CP (2.147 VS 2.205). For OVOT, NLLCAL and CP showed comparable performance that both of them were valid on 17 targets. (22)  Although the test set used in OVOT is out-of-domain, NLLCAL and CP were still able to generate valid estimation for most targets owing to the similar residual distribution between the validation set and the test set. However, for IVOT, it can be observed that CP was valid for only two targets (Carbonic and COX-1). The validity values of CP for the rest targets are all much lower than the confidence level we set, indicating that the prediction intervals given by CP were generally too narrow. It is not surprising since calibration on an in-domain validation set will definitely lead to underestimation of residuals on an out-of-domain test set. In fact, strictly speaking, CP is not applicable in the condition of IVOT and OVOT since the use of CP requires the randomness assumption that samples are independently drawn from the same distribution [54]. For the same reason, the prediction intervals given by NLLCAL were also generally too narrow for IVOT, but NLLCAL was still able to give valid estimation for 7 targets and showed better average validity value compared with CP (0.833 VS 0.782). All of these results clearly show that NLLCAL can generate valid prediction intervals with practical usefulness in the IVIT setting, while at the same time show more robust performance in domain shift settings compared with the Conformal Prediction approach.

Mean-variance estimation does not decrease model performance
One thing remains unclear is that whether the pipeline we proposed would affect the model performance, since we hope that any additional strategies for better uncertainty quantification will not decrease the accuracy of the origin model, which is the most important thing we concern about. Obviously, FDIST and LDIST have no influence on the prediction process. ENS requires to ensemble several individually trained models, which has been widely used to improve the robustness of the model. However, whether using mean-variance estimation would affect model accuracy still remains to be unknown.
To investigate this effect, we retrained all models with the normal MSE loss, whose results were taken as the baseline for comparison. The fivefold CV RMSEs on each target are reported in Additional file 1:  Table S12 indicate that mean-variance loss may have the added advantage of further driving down the prediction error on the test set. Similar phenomena have also been reported by some other researches [9,27,48]. With mean-variance loss, MVE is able to capture the aleatoric uncertainty [variance of the conditional distribution of target variables p(y|x) ] under the heteroskedastic assumption. Considering that heteroskedasticity is very common in bioactivity datasets, mean-variance loss can be treated as a useful regularization technique for avoiding overfitting on high noise samples, which may explain the improvement for model accuracy. However, whether this kind of regularization effect can be generalized to other datasets and tasks still needs to be further studied under more systematic tests, which is beyond the scope of this study.

Efficient uncertainty quantification for machine learning models is necessary and needs to be further explored
Although this research focuses on constructing hybrid uncertainty quantification methods for deep learning models as stated in the title, we wondered whether the similar consensus approach could also work for machine learning models. Random forest (RF) was taken as the representative model for studies. All experiments were reperformed by replacing the D-MPNN model with the RF model trained using the Scikit-learn package [56]. The default parameter values were used during training, where the number of trees was set to 100. It is reported that more trees generally do not lead to improved performance for QSAR regression modeling [57]. ECFP4 fingerprint was used to featurize molecules. Since LDIST and MVE are not applicable for RF models, we only attempted to build the hybrid models by combining ENS and FDIST together. ENS was calculated using the variance of predictions generated by 100 trees.
The performance of RF models across all targets are presented in Additional file 1: Table S13 by showing the average RMSE values. Additional file 1: Table S14 reports the performance of ENS, FDIST, Weighted_Rank and NLLCAL in the same way of Table 1 by showing the average SCC, ECE and ENCE. As it can be seen, ENS showed much better ranking ability than that of FDIST and performed well regarding to the calibration tasks. However, by combining these two individual models using weighted averaging and post-hoc calibration, Weighted_ Rank and NLLCAL still showed comparable or better performance compared with ENS.
Although GCNN models, like D-MPNN we studied in this research, have gained great attention in QSAR modeling and shown outstanding performance on large datasets, machine learning models still have unique value that should not be omitted. Generally speaking, machine learning models are much faster for training (Additional file 1: Figure S1) and more convenient for parameter tunning and interpretation. Jiang et al. even proposed that descriptor-based machine learning models on average showed better performance on small datasets than GCNN models [58]. As Additional file 1: Table S14 shows, the hybrid algorithm may also be beneficial for uncertainty quantification of machine learning models. However, how to choose individual methods and make combination is still an open question that needs to be further studied. In conclusion, efficient uncertainty quantification for deep learning and traditional machine learning models are equally important in QSAR modeling. We will continue to explore these two aspects in the future studies.

Conclusion
Data-driven methods are emerging as important tools for drug design and discovery. To fully realize the potential of these models, well-calibrated uncertainty quantification can be as important as accurate predictions. Many uncertainty quantification strategies have been proposed and benchmarked in the context of deep-learning-based QSAR regression modeling in recent studies. However, it has been reported that these approaches have the deficiency of showing large performance variation across different datasets and model architectures. In this study we explored several consensus strategies for improving the performance of the individual model. We found that both weighted averaging and post-hoc calibration on the validation set could lead to better performance. The importance of incorporating chemical space distance information into traditional Bayesian framework is also highlighted. Although the performance improvement is promising, there still exists gap between the reliability of the model and the need for real-world deployment. Considering the consensus strategies used in this study are rather simple, future work could focus on the transformation of chemical space distance information into the prior distribution that can be effectively used by existing Bayesian uncertainty quantification approaches.