Skip to main content

Table 1 Description of Methodologies, which are used to take into account uncertainty in predictions, and their advantages and disadvantages

From: Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Method Description Advantage Disadvantage
Applicability Domain (AD) estimation Provides an estimate of whether the assumptions of a model are fulfilled for a given input [42,43,44,45], e.g., distance to model AD provides a reliability based on whether a query compound is close to model training data Provides estimates in uncertainty when making predictions for new compounds Do not commonly take into account the uncertainty related to the underlying data
Conformal Prediction Produces error bands around the predictions, with the underlying assumption that inputs less similar to model training data should lead to less certain estimates. This is captured using a nonconformity measure, i.e., the nonconformity score for a new query compound is calculated [46,47,48] Provides estimates in uncertainty when making predictions for new compounds Do not commonly take into account the uncertainty related to the underlying data
Probability Calibration Addresses the question of obtaining accurate likelihoods of predictions based on the distributions of reference observations for a given dataset [36] There are advantages related to specific calibration methodologies
e.g., Isotonic regression methodology makes no assumptions on the curve form. Inductive methods must split data in order to create ‘proper’ calibration splits
Performance depends on the reference observations used
Limitations related to specific calibration methodologies: e.g., Isotonic regression methodology requires a large number of calibration points and has a tendency to overfit
Gaussian processes (GP, Bayesian methodology) Probability distributions over possible functions are used to evaluate confidence intervals and decide based on those if one should refit the prediction in some region of interest [7] Allow the incorporation of data prior knowledge
The uncertainty of a fitted GP increases away from the training data
Gaussian processes can be computationally expensive (because of their non-parametric nature and they need to take into account all the training data each time they make a prediction)