### Data set

ChEMBL is an open, large-scale chemical database containing more than 1.7 million distinct compounds with bioactivity data extracted from the chemical literature and calculated molecular properties [12]. From ChEMBL version 23, we extracted all compounds having the calculated property acd_logd (calculated logD) at \(\hbox {pH}=7.4\), resulting in 1,679,912 compounds. Standardization of chemical structures was performed by ambitcli version 3.0.2, which is part of AMBIT cheminformatics platform and relies on the CDK library [13,14,15].

Standardization was performed using default settings except for the option ‘splitfragments’ that was set to TRUE. In this way, salt and solvent components were filtered away. After standardization and removal of duplicates the data set consisted of 1,592,127 chemical compounds. To evaluate the predictive ability of the developed models, we set aside a test set comprising 100,000 compounds. To perform predictions on the developed model we downloaded 91,498,351 chemical compounds of PubChem database [16], which were standardized in the same way as the compounds from the ChEMBL database.

### LogP and logD

The most commonly used measure of lipophilicity is logP, the log of the partition coefficient of a neutral (non-ionized) molecule between two immiscible solvents, usually octanol and water. The distribution coefficient, logD, takes into account both the compound’s non-ionized and ionized forms and in the determination of logD the aqueous phase is adjusted to a specific pH. Most of the drugs and the majority of molecules under research for pharmaceutical purposes do contain ionizable groups, and therefore logD should be used preferentially over logP as the descriptor for lipophilicity, especially when looking at compounds that are likely to ionize in physiological media. Of a particular interest is the logD at \(\hbox {pH}=7.4\) (the physiological pH of blood serum).

### Signature molecular descriptor

The compounds were encoded by the signature molecular descriptor [17], generated by CPSign [18]. A signature molecular descriptor constitutes a vector of occurrences of all atom signatures in the dataset, where an atom signature is a canonical representation of the atom’s environment (i.e., neighboring and next-to neighboring atoms). Signatures distinguish between different atom and bond types, as well as between aromatic and aliphatic atoms in the atom’s environment. Presence of the same atom signature in several compounds thus indicates that these compounds share identical 2D structural fragments. Atom signatures can be calculated up to a predefined height (i.e., the number of bonds to the neighboring and next-to neighboring atoms that the signature spans). We here calculated atom signatures of heights one, two and three, which is a set of heights good both for modeling as well as for visualization purposes [19, 20].

In total 1,068,830 different 2D structural fragments were found in the dataset. Of these, 675,996 fragments were present in at least two compounds, 251,278 in at least ten compounds, and 50,293 in at least one hundred compounds.

### QSPR modeling by SVM

To model the relationship of logD to the molecular descriptors, we used SVM, a machine learning algorithm that correlates independent variables to the dependent one by means of a linear or nonlinear kernel function. Kernel functions map the data into a high-dimensional space, where correlation is performed based on the structural risk minimization principle; i.e., aiming to increase the generalization ability of a model [21].

We elected to perform correlation by the linear kernel using signature molecular descriptors comprised of a vector of 1,068,830 integers. This choice was also supported by results of our earlier, large-scale modeling study, where a linear kernel performed on par with the nonlinear but required dramatically less computational resources [22].

SVM with linear kernel requires fine-tuning of two parameters to obtain an optimal model, namely, the error penalty parameter *cost* and tolerance of termination criterion *epsilon*. We found optimal *cost* and *epsilon* by performing grid search with cost values ranging from 0.1 to 10 and epsilon values from 0.1 to \(10^{-5}\). SVM models were created by the LIBLINEAR software as accessed from CPSign [18, 23].

### Conformal prediction

In the conformal prediction framework, conventional single value predictions are complemented with measures of their confidence. In the case of regression, the conformal prediction algorithm outputs a prediction interval around the single prediction point [24]. In QSPR modeling, the size of the prediction interval is determined by some measure of dissimilarity (nonconformity measure) of the new chemical compound to the compounds used in the development of the prediction model. Thus, the compound that is “typical” for the data set would more likely be given a smaller interval than a compound being in a less explored area or outside the modeled chemical domain [25,26,27].

The size of intervals also depends on the desired confidence level (also called *validity*) which is defined as the ratio of compounds for which the true value falls within the prediction interval. Validity can thus range from 0 to 100%, where 0% means that none of the prediction intervals include the true value and 100% means that all of them include the true value.

For *inductive conformal prediction*, the training set is split into a *proper training set* and a *calibration set*. The proper training set is used for creating a prediction model and the calibration set is used for comparing new compounds to existing ones and to estimate sizes of intervals for a certain confidence level. The *inductive* setting means that split and training is performed once and all subsequent predictions are done by the same model; splitting is typically done in such a way that size of calibration set is smaller than the size of the proper training set [26].

In the present study, we applied a 10-fold *cross-conformal predictor* (CCP) as described in [28]. In brief, this algorithm attempts to reduce the influence of the split into proper and calibration sets by performing multiple such splits, each resulting in an inductive conformal predictor, and aggregating the resulting predictions. Here we chose to use ten aggregated models, and performing the dataset splits in a folded fashion (the *cross* prefix refers to *k*-fold cross validation). The workflow of CCP is presented in Fig. 1.

Conformal predictors are always valid under the assumption of exchangeability, i.e., that predicted compounds are drawn from the same distribution as compounds used to develop the prediction model. The main criterion when comparing different nonconformity measures is therefore their *efficiency*, i.e., the sizes of prediction intervals in case of regression. Intuitively, a smaller size of prediction intervals indicates a higher efficiency. In this work we evaluated three different nonconformity measures. The simplest measure tested here was based on the prediction error given by the endpoint model, where the nonconformity of compound *i*, denoted \(\alpha _{\mathrm{i}}\) is calculated using Eq. 1. This measure, termed *absolute difference*, gives the same prediction interval size for all predictions for a given confidence level, but in turn does not require any error model to be fitted and can thus lessen the computational demands.

$$\begin{aligned} \alpha _{\mathrm{i}}=|y_{\mathrm{i}}-\hat{y}_{\mathrm{i}}| \end{aligned}$$

(1)

The second nonconformity measure used, termed *normalized*, assigns larger prediction intervals to objects that are more different from objects used in the model development and hence are “harder” to predict, and smaller intervals to “easier” objects. Naturally, when using normalized nonconformity measures, we expect the median prediction interval to be smaller, i.e., the efficiency to be increased. One of the common ways to obtain a normalized nonconformity measure is by creating an error model, where the dependent variable is the absolute value of error in the endpoint prediction model. This is expected to provide a more efficient nonconformity measure than absolute difference, provided that the error model is predictive. The normalized nonconformity measure is defined following Eq. 5 in [26], here shown in Eq. 2, where \(|y_{\mathrm{i}}-\hat{y}_{\mathrm{i}}|\) is the absolute value of error for object *i* in the endpoint prediction model and \(\hat{\mu }_{\mathrm{i}}\) is the prediction from an error model (note that both \(\hat{y}_{\mathrm{i}}\) and \(\hat{\mu }_{\mathrm{i}}\) are calculated when the compound is placed in the calibration set, i.e., is not present in the proper training set).

$$\begin{aligned} \alpha _{\mathrm{i}}=\frac{|y_{\mathrm{i}}-\hat{y}_{\mathrm{i}}|}{\hat{\mu }_{\mathrm{i}}} \end{aligned}$$

(2)

The third nonconformity measure, termed *log-normalized*, proposed in [25], instead of \(|y_{\mathrm{i}}-\hat{y}_{\mathrm{i}}|\) uses \(\ln {|y_{\mathrm{i}}-\hat{y}_{\mathrm{i}}|}\) as dependent variable when fitting the error model. It also introduces a smoothing factor, \(\beta\), that can be used for “smoothing” the interval sizes, making the small intervals a bit larger and the very large intervals a bit smaller, i.e., reducing the influence of \(\hat{\mu }_{\mathrm{i}}\) in calculating \(\alpha _{\mathrm{i}}\), Eq. 3. The smoothing might be advantageous as biological measurements always include some measurement errors, precluding predictions with intervals close to 0. Very large intervals, on the other hand, can arise from badly predicted \(\hat{\mu }\) in the error model. We here created models with \(\beta =0\) and \(\beta =1\).

$$\begin{aligned} \alpha _{\mathrm{i}}=\frac{|y_{\mathrm{i}}-\hat{y}_{\mathrm{i}}|}{e^{\hat{\mu }_{\mathrm{i}}} + \beta }; \quad \beta \ge 0 \end{aligned}$$

(3)

For each of the inductive conformal predictors, \(\alpha _{\mathrm{i}}\) values are computed for all compounds in the calibration set and are then sorted in ascending order. When performing a prediction, the test compound is first predicted by the endpoint model to get the prediction midpoint, \(\hat{y}\). To compute the prediction interval, the algorithm looks in the ordered set of nonconformity values to to get \(\alpha _{\mathrm{conf. lev.}}\), which is dependent on the desired confidence of the prediction. If, for example, we propose that an 80% confidence is required, the \(\alpha _{\mathrm{conf. lev.}}\) is then the \(\alpha _{\mathrm{i}}\) value found when traversing 80% of the list. If the nonconformity value is dependent on an error model, an error prediction, \(\mu _{\mathrm{i}}\), is made. The size of the prediction interval is then calculated by rearranging the nonconformity measure to solve for \(|y-\hat{y}|\), resulting in the final prediction interval \((\hat{y} - |y-\hat{y}|, \hat{y} + |y-\hat{y}|)\) for the single inductive predictor. The CCP prediction is then computed to be the median prediction midpoint and the median predicted interval size.

### Molecule gradient for the prediction

CPSign allows the computation of a “prediction gradient”, as described in [29]. This is managed by altering the number of occurrences of each signature descriptor of the molecule, changing one descriptor at a time. For each alteration a new prediction is made, and the relative change in the prediction output is considered the gradient for that signature descriptor. If the gradient value for the descriptor is positive, the altered prediction has given a larger regression value, meaning that adding more of this descriptor would move the prediction to a higher response value, and vice-versa if the gradient value is negative. In CCP, each of the ten models produces its own gradient. The resulting gradient is computed as the median of the individual gradients. The *per-descriptor* contributions can then be transformed to the *per-atom* contribution, by summing up all contributions that each atom is part of.