Skip to main content

Table 1 Description of Methodologies, which are used to take into account uncertainty in predictions, and their advantages and disadvantages

From: Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Method

Description

Advantage

Disadvantage

Applicability Domain (AD) estimation

Provides an estimate of whether the assumptions of a model are fulfilled for a given input [42,43,44,45], e.g., distance to model AD provides a reliability based on whether a query compound is close to model training data

Provides estimates in uncertainty when making predictions for new compounds

Do not commonly take into account the uncertainty related to the underlying data

Conformal Prediction

Produces error bands around the predictions, with the underlying assumption that inputs less similar to model training data should lead to less certain estimates. This is captured using a nonconformity measure, i.e., the nonconformity score for a new query compound is calculated [46,47,48]

Provides estimates in uncertainty when making predictions for new compounds

Do not commonly take into account the uncertainty related to the underlying data

Probability Calibration

Addresses the question of obtaining accurate likelihoods of predictions based on the distributions of reference observations for a given dataset [36]

There are advantages related to specific calibration methodologies

e.g., Isotonic regression methodology makes no assumptions on the curve form. Inductive methods must split data in order to create ‘proper’ calibration splits

Performance depends on the reference observations used

Limitations related to specific calibration methodologies: e.g., Isotonic regression methodology requires a large number of calibration points and has a tendency to overfit

Gaussian processes (GP, Bayesian methodology)

Probability distributions over possible functions are used to evaluate confidence intervals and decide based on those if one should refit the prediction in some region of interest [7]

Allow the incorporation of data prior knowledge

The uncertainty of a fitted GP increases away from the training data

Gaussian processes can be computationally expensive (because of their non-parametric nature and they need to take into account all the training data each time they make a prediction)