Bioactivity data set
The ChEMBL (version 27) database [32] was filtered for compounds with a reported pChEMBL (normalized − log10) activity value from ‘binding’ (IC50/EC50/Ki/Kd) human protein assays. Confidence scores of 5 and 8 were employed for the reproducibility comparison when activity values were aggregated at protein complexes or for specific individual proteins, respectively. Compounds were subsequently filtered for a confidence score of 8 for modelling purposes. Targets were also subsequently filtered for greater or equal to 50 active compounds across the activity thresholds for the pChEMBL activity bins 5, 6 and 7 (corresponding to activity values 10, 1 and 0.1 μM) to ensure that only proteins encompassing sufficient chemical space across the activity thresholds were retained for the training set. Models were trained for 559 targets and Additional file 1: Figure S1 summarizes the number of active and inactive data points for each model and for which a large variance between the amount of bioactivity data available per target was observed. For example, there was a median of 389, 375, and 386 active compounds per-target for the pChEMBL classification thresholds of 5, 6 and 7, respectively. A median of 1000 inactive compound datapoints was calculated across targets and thresholds, with a median ratio of 0.4 active compounds to inactive compounds (see Additional file 1: Figure S1 for details). The dataset for putative inactives per target is available for download as zip files here: https://pidginv3.readthedocs.io/en/latest/install.html.
Compound pre-processing
Compound structures were standardized using the IMI eTox molecular structure standardizer (https://github.com/flatkinson/standardiser), with settings to remove salts, waters, solvents, normalize charges, tautomerize (to the most favourable form) and to remove duplicates. RDKit [33] (Version 2019.03.4) was employed to remove structures without carbon, and to retain only compounds with atomic numbers between 21–32, 36–52, and greater than 53, and with a molecular weight between 100 and 1000 Da, to retain small organic molecule chemical space.
Calculating uncertainty values for ChEMBL activity labels
Prior to the application of the PRF algorithm, the calculation of uncertainty in bioactivity labels was required. Since uncertainty originates from the hypothesis that bioactivity data extracted from public bioactivity databases have a degree of uncertainty, we introduced uncertainty into the labels. Thereby, labels were treated as probability distribution functions, rather than deterministic values by “injecting noise” in the following way. Bioactivity training data were converted into an uncertainty-based scale on a per-threshold basis (\(pActivity^{T}\)), across a range of arbitrary standard deviation (σ) thresholds ranging between 0.0 and 0.6, at increments of 0.2. By varying the standard deviation, \(\sigma\), we evaluated model behaviour over a range of uncertainties.
For each bioactivity value (\(pActivity\)), we used the cumulative distribution function (cdf) of a normal distribution (Eq. 1) with a mean equal to the bioactivity threshold for each \(pActivity^{T}\). More concretely, assuming only the mean and variance of activity values is known, the maximum entropy distribution to represent these values is a normal distribution [34]. One can set the mean and variance parameters of this distribution to a threshold value (e.g., 10 µM), and experimental error (e.g., \(\sigma\) of 0.3) and compute the probability of activity values with the cdf. Each \(pActivity\) value was converted to a y-label probability (∆y), a value representing the uncertainty in the measurement which was used for PRF training. We refer to this as the ‘ideal y-label’ or simply ‘y-ideal’, because it represents the ideal case, where experimental error is taken into account when training a target prediction model. For the calculation of ∆y, the stats.norm.cdf() function was used from scipy [35] library in python as in Eq. 1:
$$\Delta y\left( {\vec{c}} \right) = \frac{1}{2}\left[ {1 + {\text{erf}}\left( {\frac{{\overrightarrow {{p_{Activity} }} - \overrightarrow {{p_{Threshold} }} }}{\sigma \sqrt 2 }} \right)} \right]$$
(1)
where ∆y were the y-label probabilities, \(\vec{c} = (C_{1} , \ldots ,C_{n} )\) represented the compounds in the training set, \(\overrightarrow {{p_{Threshold} }}\) described the pre-defined binding affinity thresholds for \(\overrightarrow {{p_{Activity} }}\) (− log10) values, and \(\sigma\) was the standard deviation defined in this work using arbitrary defined cut-offs (which could also be set as required to the deviation across replicates within or between experiments, screening platforms or activity unit aggregation methods).
Values of \(\Delta y\) hence captured the likelihood that a given compound \(C_{n}\) had binding affinity that falls within the boundary of the active class at the \(p_{Threshold}\) given \(p_{Activity}\) and given the assumption that most bioactivity data is homoscedastic (which is not always true in practice). Hence, a compound with a pChEMBL value of e.g., 5.1 (8 μM) was assigned a new ∆y of ~ 0.63 for a pChEMBL activity threshold of 5.0 (10 μM) and a user-defined standard deviation \(\sigma\) of 0.3 (Fig. 2), i.e., there is a 63% chance for that compound to belong to the active class given those parameters compared to traditional RF classifier which assumes that it is 100% active. This enabled representing the activity in a framework in-between the classification and regression architecture, with philosophical differences from either approach. Compared to classification, this approach enables better representation of factors increasing/decreasing inactivity. Conversely, one can utilize all data (even delimited/operand/censored data far from a cut-off) at the same time as taking into account the granularity around the cut-off, compared to a classical regression framework. Thereby, PRF combines characteristics from both classification and regression settings.
Supplemental inactive data
In order to ensure sufficient chemical space of compounds not binding to targets (hence assigned a constant [\(p_{Activity}\) = 0] across all test-train standard deviations) an inactive dataset of compounds from PubChem was used as published in Mervin, et al. [36] and available at https://github.com/lhm30/PIDGINv3. These supplemental inactive compounds were randomly sampled from PubChem with a Tanimoto coefficient fingerprint similarity to actives lower than 0.4 to obtain the desired number of compounds, which could reasonably be assumed to be inactive against a given target. The dataset included 38,902,310 inactive labelled compound annotations across the full complement of targets. For these inactive datapoints, ∆y remained constant across test-train σ thresholds (i.e., only bioactivity data points from ChEMBL were assigned ∆y probabilities greater than zero). In more detail, out of a total of 557 models trained (e.g., with a pXC50 threshold equal to 5), 310 models (~ 56%) included at least 1 SE datapoint in the inactive set of compounds and the percentage of SE data included in the inactive data of the 310 models is shown in Fig. 3. As we observe, 183 models (33% of total models) were trained with a small number of SE data of about 20% of the total inactive compounds and 116 models (21% of total models) were trained with a high number of SE data points (more than 80% SE data in the inactive compound set).
Machine learning modelling and benchmarking
Random Forest
The Probabilistic Random Forest (PRF) is a modification to the original RF algorithm; hence we first outline the RF concept followed by the modifications to enable uncertainty estimation via the PRF.
RF is an ensemble method using a number of decision trees during training. Each decision tree is described via a tree-like graph relating the relationships between (chemical) features and target (activity) variables in a series of conjoined conditions arranged in a top-to-bottom “tree-like” structure. For binary classification, trees are constructed via nodes searching for the ‘best split’; the combinations of features and thresholds providing the best separation between classes [37]. Gini impurity, the probability that a randomly selected object (compound) will be misclassified if assigned a randomly selected label (i.e., active or inactive), is frequently employed for this purpose.. Let \(P_{n,A}\) and \(P_{n,B}\) denote the fractions of objects of classes A and B within the group in the node n (class probabilities), hence the Gini impurity \(G_{n}\) is:
$$G_{n} = 1 - \left( {P_{n,A}^{2} + P_{n,B}^{2} } \right)$$
(2)
The algorithm iterates over features and thresholds dividing training data “left” or “right” corresponding to objects left or right of the threshold, respectively. The splitting threshold resulting in the minimal combined impurity of the groups is defined as:
$$G_{n, right} x f_{n,right} + G_{n, left} x f_{n,left}$$
(3)
where Gright, Gleft in node n are Gini impurities and fright, fleft are the fractions of objects in each group. This iteration process over features and thresholds is repeated recursively (so long as groups have a lower combined impurity compared to the impurity of the node) until ending in a terminal node (which assigns probabilities according to the distribution of compounds in the classes). Novel predictions are propagated through the tree with predictions assigned via the largest fraction of samples in terminal nodes.
Individual decision trees are prone to overfitting since they are engineered so as to perfectly fit all samples in the training data set. To combat this, a RF is a set of many decision trees, with randomness introduced via: (1) randomly sampled subsets of the full dataset, and (2) random subsets of the features in each node of the trees. Aggregation across the randomised decision trees reduces the tendency of overfitting. An unlabelled object is propagated through the trees in the forest, and the predicted class probability for an input sample computed as the mean predicted fraction of samples of the same class in the terminal nodes across the trees. Both (a) the fraction of the trees voting for a predicted class and (b) deviation of the fraction of samples in the terminal nodes across the forest can serve as certainty measures for predictions.
Probabilistic Random Forest
RFs receive a sample of observed random pairs of random variables, \(\left( {x_{1} ,y_{1} } \right), \ldots ,\left( {x_{n} ,y_{n} } \right)\) describing the relation: \(h:X \to Y\) used to predict \(y\) for a given value of \(x\). On the other hand, the PRF receives \(\left( {x_{1} ,y_{1} ,\Delta x_{1} ,\Delta y_{1} } \right), \ldots ,\left( {x_{n} ,y_{n} ,\Delta x_{n} ,\Delta y_{n} } \right)\), where ∆x and ∆y represent uncertainty in features and labels, respectively. Naturally, the focus of this work is concerned with (activity) label uncertainties, and (chemical) feature uncertainties are not specified.
To account for uncertainty, the PRF treats labels as normal distributions, rather than deterministic values. Labels become probability mass functions (PMFs) where each object has a label assigned to it with some probability and the relationship between RF and PRF follow naturally from this concept, since the PRF converges toward a RF when there are low or no (zero) uncertainties in ∆y (see Fig. 2). Another difference between the two algorithms is that randomness of a RF is induced epistemically (i.e., from the model itself) by training different decision trees on randomly selected subgroups of the data and by using random subsets of features in each node of each decision tree. On the other hand, PRF introduces randomness allosterically; since it is not drawn from a defined distribution, but rather the underlying uncertainty (experimental deviation) relevant for classification. Label uncertainties propagate through the splitting criterion during the construction of the tree. Similar to a standard tree, nodes are split left and right, such that resulting subsets are more homogeneous than the set in the parent node. A cost function for minimization is used for this purpose since the transition from y to ∆y means that labels now become random variables. Instead of calculating the fraction of objects in node, n, the expectancy value (\(\pi_{i} \left( n \right))\) is calculated:
$$\begin{gathered} P_{n,A} \to \overline{P}_{n,A} = \frac{{\mathop \sum \nolimits_{i \in n} \pi_{i} \left( \eta \right) x p_{i,A} }}{{\mathop \sum \nolimits_{i \in n} \pi_{i} \left( \eta \right)}} \hfill \\ P_{n,B} \to \overline{P}_{n,B} = \frac{{\mathop \sum \nolimits_{i \in n} \pi_{i} \left( \eta \right) x p_{i,B} }}{{\mathop \sum \nolimits_{i \in n} \pi_{i} \left( \eta \right)}} \hfill \\ \end{gathered}$$
(4)
Hence, Gini impurity is transformed to:
$$G_{n} \to \overline{G}_{n} = 1 - \left( {\overline{P}_{n,A}^{2} + \overline{P}_{n,B}^{2} } \right)$$
(5)
The cost function (weighted average of the modified impurities of the two nodes) is then:
$$\overline{G}_{{\left( {n,r} \right)}} x \frac{{\mathop \sum \nolimits_{{i \in \left( {n,r} \right)}} \pi_{i} \left( {\eta ,r} \right) }}{{\mathop \sum \nolimits_{i \in n} \pi_{i} \left( \eta \right)}} + \overline{G}_{{\left( {n,l} \right)}} x \frac{{\mathop \sum \nolimits_{{i \in \left( {n,l} \right)}} \pi_{i} \left( {\eta ,l} \right) }}{{\mathop \sum \nolimits_{i \in n} \pi_{i} \left( \eta \right)}}$$
(6)
The modified propagation scheme and cost functions are the two major conceptual changes separating PRFs and RFs. After training, the PRF classifies new objects which is identical for both training and prediction. Once an object reaches a terminal node the class probability can be used to provide the prediction as in the classical RF, since each object reaches all the terminal nodes a probability. Hence, all the predictions given by all the terminal nodes should be taken into account to obtain the prediction of the tree, which is given by the following equations:
$$Pr_{A} \to \mathop \sum \limits_{terminal nodes} \pi \left( n \right) x \overline{P}_{n,A}$$
(7)
$$Pr_{B} \to \mathop \sum \limits_{terminal nodes} \pi \left( n \right) x \overline{P}_{n,B}$$
(8)
Computational details
The PRF implementation in Reis, Baron and Shahaf [30] was employed for this work as provided via https://github.com/ireis/PRF. The algorithm was fit with the RDKit fingerprints and the corresponding ∆y labels on a per standard deviation (σ) basis, with a lower propagation probability limit (“keep_proba”) of 0.05, to ensure that a given object did not propagate to branches with a low probability (reducing runtime without impairing performance). The output of the PRF was recorded as the number of probabilistic decision trees in the forest predicting the label. The RF was implemented using the RandomForestClassifer function from Scikit-Learn.
Two different metrics were used to compare the PRF and RF prediction probabilities. The first metric is the error margin as described in Eq. 9:
$$Error\;margin = \left[ {\begin{array}{*{20}c} {\left( {ideal \; ylabel - {\varvec{RF}}\;probabilities} \right) - } \\ {\left( {ideal \;ylabel - {\varvec{PRF}}\;probabilities} \right)} \\ \end{array} } \right]$$
(9)
In addition to the error margin, when two scores are compared (y-probability from 1. RF and 2. PRF) rather than comparing only the absolute values, it is also possible to compare the scores relative to each other. This is achieved by calculating the relative increase toward the potential optimum (i.e., the ideal y-label) as shown in Eq. 10:
$$Relative\;score = \frac{{\left| {\left| {error\;margin\;RF} \right| - \left| {error\;margin\;PRF} \right|} \right|}}{{error\;margin \left( {worst\;performing\;classifier} \right)}} \times 100$$
(10)
The rationale behind this calculation is that for a metric with an ideal y-label e.g., equal to 0.65 a difference between RF and PRF y-probabilities from 0.75 to 0.70 is more meaningful than a difference from 0.85 to 0.80. In terms of relative score, the latter and the former difference in y-probabilities correspond to 50% and 25% change respectively.
Evaluation of Sphere Exclusion effect on the fraction of improved models by PRF
The effect of including sphere excluded putative inactives on the error margins by Probabilistic RF was evaluated. In this comparison, we selected (a) targets that did not contain any putative inactives (models without SE data) and (b) targets that 80% of their inactive datapoints were putative inactives (models with SE data) across the three different bioactivity thresholds and different emulated test-train standard deviations. We calculated the error margin between the two algorithms (as described in the section above) separately for models without SE data and models with SE data across different standard deviations. As a result, we derived two error margin distributions and sought to compare their means to understand if there is a statistically significant difference. Firstly, a Kolmogorov Smirnov (KS) test in scipy (scipy.stats.kstest) was applied to confirm if the data in error margin distributions are normally distributed. Next, an unpaired t-test (scipy.stats.ttest_ind, with ‘equal_var’ parameter equal to False) was applied to statistically compare the distributions.