Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Mervin, Lewis H.; Trapotsi, Maria-Anna; Afzal, Avid M.; Barrett, Ian P.; Bender, Andreas; Engkvist, Ola

doi:10.1186/s13321-021-00539-7

Table 1 Description of Methodologies, which are used to take into account uncertainty in predictions, and their advantages and disadvantages

From: Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Method	Description	Advantage	Disadvantage
Applicability Domain (AD) estimation	Provides an estimate of whether the assumptions of a model are fulfilled for a given input [42,43,44,45], e.g., distance to model AD provides a reliability based on whether a query compound is close to model training data	Provides estimates in uncertainty when making predictions for new compounds	Do not commonly take into account the uncertainty related to the underlying data
Conformal Prediction	Produces error bands around the predictions, with the underlying assumption that inputs less similar to model training data should lead to less certain estimates. This is captured using a nonconformity measure, i.e., the nonconformity score for a new query compound is calculated [46,47,48]	Provides estimates in uncertainty when making predictions for new compounds	Do not commonly take into account the uncertainty related to the underlying data
Probability Calibration	Addresses the question of obtaining accurate likelihoods of predictions based on the distributions of reference observations for a given dataset [36]	There are advantages related to specific calibration methodologies e.g., Isotonic regression methodology makes no assumptions on the curve form. Inductive methods must split data in order to create ‘proper’ calibration splits	Performance depends on the reference observations used Limitations related to specific calibration methodologies: e.g., Isotonic regression methodology requires a large number of calibration points and has a tendency to overfit
Gaussian processes (GP, Bayesian methodology)	Probability distributions over possible functions are used to evaluate confidence intervals and decide based on those if one should refit the prediction in some region of interest [7]	Allow the incorporation of data prior knowledge The uncertainty of a fitted GP increases away from the training data	Gaussian processes can be computationally expensive (because of their non-parametric nature and they need to take into account all the training data each time they make a prediction)

Back to article page

ISSN: 1758-2946

Contact us

Submission enquiries: journalsubmissions@springernature.com

Journal of Cheminformatics

Contact us