 Methodology
 Open access
 Published:
CPSign: conformal prediction for cheminformatics modeling
Journal of Cheminformatics volume 16, Article number: 75 (2024)
Abstract
Conformal prediction has seen many applications in pharmaceutical science, being able to calibrate outputs of machine learning models and producing valid prediction intervals. We here present the open source software CPSign that is a complete implementation of conformal prediction for cheminformatics modeling. CPSign implements inductive and transductive conformal prediction for classification and regression, and probabilistic prediction with the VennABERS methodology. The main chemical representation is signatures but other types of descriptors are also supported. The main modeling methodology is support vector machines (SVMs), but additional modeling methods are supported via an extension mechanism, e.g. DeepLearning4J models. We also describe features for visualizing results from conformal models including calibration and efficiency plots, as well as features to publish predictive models as REST services. We compare CPSign against other common cheminformatics modeling approaches including random forest, and a directed messagepassing neural network. The results show that CPSign produces robust predictive performance with comparative predictive efficiency, with superior runtime and lower hardware requirements compared to neural network based models. CPSign has been used in several studies and is in productionuse in multiple organizations. The ability to work directly with chemical input files, perform descriptor calculation and modeling with SVM in the conformal prediction framework, with a single software package having a low footprint and fast execution time makes CPSign a convenient and yet flexible package for training, deploying, and predicting on chemical data. CPSign can be downloaded from GitHub at https://github.com/arosbio/cpsign.
Scientific contribution
CPSign provides a single software that allows users to perform data preprocessing, modeling and make predictions directly on chemical structures, using conformal and probabilistic prediction. Building and evaluating new models can be achieved at a high abstraction level, without sacrificing flexibility and predictive performance—showcased with a method evaluation against contemporary modeling approaches, where CPSign performs on par with a stateoftheart deep learning based model.
Introduction
Ligandbased modeling and quantitative structureactivity relationships (QSAR) are computational methods used in drug discovery to predict properties of small molecules, such as binding affinity or activity towards a protein target, and toxicity [1,2,3]. The approach relies on the structure and properties of known chemical structures, and commonly takes advantage of machine learning to construct predictive models. Over the years the available data in public repositories related to cheminformatics have increased, and the applications and accuracy of predictive models have expanded and improved. This has lead to an increased utilization of ligandbased modeling in drug discovery projects [4].
The predictive performance of machine learning models is commonly measured on an external test set or using crossvalidation, with accuracy, AUC, F1 scores (classification), and RMSE and \(\mathrm {R^2}\) (regression) as commonly used metrics. However this does not relay the level of confidence for individual objects predicted. When predicting two different objects, it would seem natural that the object that is most dissimilar compared to the training data would result in a larger prediction interval to reflect greater uncertainty, and vice versa. In drug discovery, where predicted objects in many cases are novel chemical structures, this is particularly important, and concepts and approaches to determine a model’s applicability domain have been proposed to this end [5]. However in most cases these are ad hoc methods without a proven theoretical underpinning.
Conformal prediction is a framework that provides a way to generate valid prediction intervals for a wide range of machine learning algorithms [6]. Unlike traditional prediction intervals, which applies the same certainty regardless on the predicted object, conformal prediction constructs prediction intervals that are both guaranteed to be valid and based on the estimated difficulty of the predicted objects. This makes conformal prediction a powerful tool for machine learning in settings where the underlying distribution of data is unknown, and a way to address the applicability domain assessment for compounds [7, 8].
Conformal prediction has been extensively used in drug discovery [9] with applications including screening [10], toxicology prediction [11, 12], property prediction [13], target prediction [14], and prediction of pharmacokinetics [15, 16]. More recently, conformal prediction has also been used with Deep Neural Networks in drug discovery applications [17,18,19] and in medical applications [20].
Existing software for conformal prediction include the Nonconformist software [21] which is a Python implementation of the conformal prediction framework that has been used in many drug discovery projects [22,23,24]. PUNCC [25] is a Python library that implements conformal prediction algorithms and associated techniques. Crepes [26] is a more recent Python package for generating conformal regressors and predictive systems. None of the aforementioned tools however is able to work with chemical structures as input but all operate on numerical data. Hence we think there is room for a conformal predicting tool specifically tailored for chemical structures. For a more extensive list of resources, papers, and software related to conformal prediction, we refer to the Awesome Conformal Prediction GitHub repository [27].
In this manuscript we present CPSign, a standalone software tool that implements conformal prediction for cheminformatics modeling. We start by introducing conformal prediction and VennABERS prediction as well as the default CPSign methods; Signatures [28, 29] for molecular representation, and support vector machines (SVMs) for machine learning modeling. We continue to discuss the implementations in CPSign and associated tools, and also present a comparison between CPSign and several other methods for a set of regression and classification datasets.
Methods
Conformal prediction
Conformal prediction is a mathematical framework built up by a collection of algorithms used for producing confidence guarantees for standard machine learning algorithms [6]. There are many resources that has introduced it in different settings, e.g., in the drug discovery domain [9, 30]. Here we make a brief introduction to the subject, focusing mainly on the inductive versions of the algorithms, in which an underlying scoring algorithm is trained once and then later reused for all future predictions (until enough new training data has been accumulated to warrant a full retraining to include new knowledge in the model). The inductive setting contrasts to transductive modeling where an underlying scoring algorithm is retrained for every prediction.
At the heart of conformal prediction lies the notion of nonconformity, which intuitively is a measure of how “strange” an object is compared to other objects. The term object here refers to the features x of a training observation (x, y), where y is the label of the observation. The nonconformity of an object i is often referred to as \(\alpha _i\) and computed using a nonconformity function; \(h(x_i)\). This function, h, is based on the output of an underlying scoring algorithm—which could be any machine learning algorithm that produces a prediction score. In the inductive setting, where the underlying algorithm is trained only once, the available training data is split up into two disjoint sets; the proper training and the calibration sets. The proper training set is used for training the underlying algorithm, while the calibration set is used for calibrating the predictions—based on the nonconformity function.
The classification and regression algorithms differ slightly and we refer readers to e.g. Vovk et al. [6] for a detailed explanation of them. But in essence the classification algorithm computes the nonconformity, \(\alpha _i\), for all n instances in the calibration set during training, resulting in a list \(\alpha _1,...,\alpha _n\). When predicting a new testobject, \(x_{n+1}\), the nonconformity, \(\alpha _{n+1}\), is calculated and then ranked against the list \(\alpha _1,...,\alpha _{n+1}\) (i.e., including the \(n+1\) test instance) according to Eq. 1, resulting in a pvalue for the object. This ranking is in most cases performed separately (referred to as mondrian) for each possible class label y (i.e., with a separate list of \(\alpha\) values for each class). The resulting prediction of a testobject is a set of pvalues, one for each possible class label. These pvalues can then be subjected to a statistical test in order to obtain a setprediction, i.e. if we wish to have 90 % confidence in the prediction we specify a significance level, \(\varepsilon\), of 0.1 and have to include all labels with a pvalue equal to or higher than 0.1 in the prediction set. The resulting prediction sets can thus be empty (no classes predicted), singlelabel (informative) or multilabel (less informative).
In the regression algorithm it is common to use a nonconformity function that also attempts to scale the prediction intervals based on the predicted difficulty of the testobject, commonly performed by training an additional error model that is trained on e.g. the residuals produced when predicting the training set. The regression algorithm, in contrast to classification, also require the user to specify a desired significance level (\(\varepsilon\)) at prediction time and the output is a prediction interval for the given \(\varepsilon\). This prediction interval should enclose the true label, y, with a probability of 1−\(\varepsilon\) or greater (i.e. the expected error is at most \(\varepsilon\)). Naturally, as the desired significance level is decreased the predictor has to increase the prediction interval in order to comply with the lowered level of accepted errors (see Fig. 1 for an illustrative example).
Under the relatively week assumption of exchangeability of calibration and testdata, these inductive versions of conformal predictors are proven to produce valid (wellcalibrated) predictions, i.e., that the error rate is equal to or smaller than the specified significance level [6]. Furthermore, in the case of classification, given that a mondrian (class conditional) calibration is used, the guarantee holds individually for each class and has been shown to handle imbalanced datasets very well without the need to apply balancing techniques [18, 31, 32]. However, this guarantee may in practice be difficult to achieve sometimes due to assay drifts [12] or in case data splitting is performed in a nonrandom way (such as scaffoldsplitting). The validity is thus commonly assessed by calculating the error rate for a set of significance levels, or by plotting a calibration curve of error rate vs significance level across a range of significance levels (see e.g. Fig. 3A).
Once calibration has been assessed, the goal is to produce as informative predictions as possible, referred to as predictive efficiency. A conformal predictor could always place all possible class labels in the prediction set or predict the interval (\(\infty ; \infty\)), and thus always be correct—but those predictions are not informative. The predictive efficiency thus relates to how specific the predictions are (small prediction sets or tight prediction intervals). Many metrics has been proposed for evaluating predictive efficiency, most which are summarized in Vovk et al. [33]. Some of the most commonly used metrics for classification are: observed fuzziness, the average number of predicted classes (average C), and the ratio of singlelabel prediction sets; where the two latter ones requires a fixed significance level. For regression the most commonly used metrics are the median or mean prediction interval width at fixed significance levels.
Evaluating a conformal model requires more effort and thought than standard point predictions, as we also need to apply domain knowledge into how each dataset should be evaluated. More specifically, we need to choose a sensible level of confidence in the evaluation – something that require insight in what level of confidence is needed for a prediction to be useful, or contrary, how specific must the predictions be to be useful (e.g. tight prediction intervals). An example of this is choosing to use the proportion of single label predictions produced by a classification model at a fixed significance level, which seems sensible as the most useful prediction is when it is of a single label, but a priori it may be hard to decide on what confidence level to use. See Fig. 2B, where the green area shows the proportion of singlelabel predictions at any given significance level. If we in advance chose to use e.g. 0.35 as significance level the predictions will include empty prediction sets—the predictive efficiency is actually better at a lower significance level (0.29 being the best), which you may miss in case focusing on a fixed significance level.
VennABERS prediction
Apart from the conformal predictors introduced in the previous section, CPSign also supports probabilistic modeling using the VennABERS predictor (VAP) [34, 35]. This algorithm, contrary to the conformal predictors, output probability estimates rather than pvalues, which is preferred by some users. The VAP is a special type of Venn predictor which relies on a machine learning model to produce socalled Venn taxonomies. VAP is a multiprobability predictor for binary classification tasks, giving two probability estimates (\(p_0\) and \(p_1\)) for each testobject. One of these estimates is the true probability of the testobject, but we do not know which one. This simplest version of VAP relies on splitting the full training set into a proper training and calibration set (called Inductive VennABERS Predictor, or IVAP), in the same manner as the inductive conformal predictors discussed in the previous section. The proper training set is similarly used for training the underlying scoring algorithm and the calibration set is here used for producing the predictions using an isotonic regression where the testobject is included. The calibration step is performed two times, once for each of the possible class labels, where the testobject is augmented with one of the class labels as a tentative label.
An extension to the onesplit VAP is to train a CrossVennABERS (CVAP) [35] model, in which the training set is split several times in a folded fashion similar to kfold crossvalidation, where an IVAP is trained for each such split. The benefit of this strategy is that the k multiprobabilities can be aggregated into a single probability with conditional guarantees [35]. A benefit of VAP is that it has guarantees for producing wellcalibrated probabilities, without introducing further assumptions on the data being modeled, something that is not guaranteed by most probabilistic models [36]. In many cases a probabilistic predictor is favorable over a conformal classifier, mostly as probabilities are easier to interpret than pvalues, and having the possibility to measure and compare its performance using standard evaluation metrics. However, there are cases where the conformal algorithms are preferable, such as for imbalanced datasets where the mondrian calibration handles the imbalance without requiring balancing techniques—whereas VAP generally needs data balancing or other techniques to perform well. Conformal classification also handle the case of multiclass data, whereas the VAP is only defined for binary classes. For the scenario of dealing with small datasets the transductive conformal predictor may be a more favorable alternative as no valuable training observations must be set aside for the model calibration, whereas the VAP algorithms require a separate calibration set.
In life science research, VAP has been applied in drug screening [37], to predict metabolic transformations [38], and to assess cardiovascular risk based on in vitro assay data [39].
CPSign
This section will briefly go through the implementation choices made when developing CPSign and some of the key features, a summary of these can be seen in Supplemental Table 1, Additional file 1.
Molecular representation—descriptors
Representing chemical structures as numerical features is generally referred to as molecular or chemical descriptors [40], but sometimes also chemical fingerprints although this is mostly used for structural comparisons and similarity searching. Many descriptor implementations have been proposed, with varying performance. The simplest ones include physicochemical descriptors such as molecular weight, number of rotatable bonds, lipophilicity etc. Over the years, a type of topological (2D) fingerprints describing the local environment around each atom, referred to as circular fingerprints [41], have emerged as robust descriptors that sustain efficient modeling with machine learning methods. Several different approaches and implementations exist, with Morgan fingerprints [42], extendedconnectivity fingerprints (ECFP) [43], and Signatures [29] being the most widely used. These descriptors can be rapidly calculated and since they stem from chemical substructures, they allow for chemically relevant feature interpretations.
CPSign implements Signatures as the main descriptor type, but also CDK molecular descriptors [44] including ECFPs. The user can also generate descriptors by other tools and load them as properties from CSV or SDF files together with the chemical structures. Additional descriptors can be calculated by extending an interface as explained in section Adding custom extensions.
Underlying scoring algorithms
CPSign includes the Java versions of the popular LIBLINEAR [45] and LIBSVM [46] packages, both implementing support vector machines (SVMs) allowing for sparse input data. The need for sparse data representation is essentially required when using the Signatures descriptor as the number of features can be in the order of hundreds of thousands for larger datasets—which would require vast amounts of RAM for dense representations. For smaller datasets the RBFkernel SVM from LIBSVM is preferable (other kernels are also possible to use), but the training time scales poorly for larger datasets [47]. For larger datasets the linear kernel SVM from LIBLINEAR is often preferred, as it also includes heuristics in order to speed up training time [45]. Earlier work has been aimed at finding good sweet spot hyperparameters for the combination of using the Signatures descriptor with SVMs [48], leading to robust results using the default settings in CPSign.
LIBLINEAR also has a sparse implementation of the logistic regression algorithm that can be used within CPSign, albeit less tested and may require more tuning to achieve good results. Similarly as for the chemical descriptors, users can implement and expose their own machine learning methods to be used as underlying scoring algorithms. We have also made such an extension by wrapping the DeepLearing4J library [49], that can be found in the CPSignDL4J GitHub repository (https://github.com/arosbio/cpsigndl4j). Note however that this extension requires the user to build it and adjust it to work for a particular platform and hardware as it requires to run native code.
Predictor types
CPSign implements the transductive conformal predictor (only for classification), a variety of inductive conformal predictors (both for regression and classification) and the Cross VennABERS predictor (binary probabilistic classifier). For the conformal models there are three base types; TCPClassifier, ACPClassifier and ACPRegressor. The ACPClassifier and ACPRegressor predictors can be changed between running onesplit ICP, several splits (ACP [50]) or folded splits (CCP [51]), depending on the data sampling strategy. There is also the option of using a predefined split from the user or adding custom splitting strategies such as in Arvidsson McShane et al. [52]. For the TCPClassifier there is no splitting of data into proper training and calibration sets, instead the underlying scoring model is trained once for every possible label for every test example. The TCP model can thus use all available data both for training the underlying model and for calibration, but at the cost of being highly computationally demanding, and is thus only recommended for smaller datasets.
For the conformal predictors the notion of nonconformity measure (NCM) is central, and CPSign comes with four NCMs for classification and four for regression (see Supplemental Table 1, Additional file 1). Depending on what the function requires, they can be combined with different types of underlying scoring models, e.g., the InverseProbability requires probability scores from the underlying model and thus restricts the number of available scoring models, and the NegativeDistanceToHyperplane requires the use of SVM as scoring model. For the regression algorithm the NCM also dictates whether an additional error model should be trained in order to predict the difficulty of a test example in order to normalize the prediction intervals. By default the error models will use the same algorithm and hyperparameter settings as the main scoring model, but it is possible to, e.g., use a more complex (RBF kernel SVM) for predicting the midpoint and then use the computationally cheaper linear kernel based SVM to normalize the prediction intervals with.
Another feature of CPSign is the possibility to use different ways of calculating and handling the pvalues, apart from the standard calculation (Eq. 1), also allowing to calculate “smoothed pvalues” [6], and both linear and splines interpolation [53, 54]. The interpolation options can be useful especially when having small datasets, where only a few examples can be set aside to be used in the calibration set, see Fig. 3A, B for a comparison between the standard pvalue calculation and linear interpolation when only having five calibration instances. This example is exaggerated and we do not recommend using only five examples for calibration, but shows the usefulness of including interpolation.
Hyperparameter tuning
CPSign has robust predictive performance using the default parameters (see the method evaluation section for a comparison against tuning of hyperparameters, as well as other popular modeling methods). To further improve the model performance it is possible to finetune the hyperparameters using a standard grid search algorithm. For some parameters, e.g., the SVM cost parameter, there are default values to try out—for other hyperparameters the user has to decide which values to evaluate in the grid. There is flexibility in how this should be done, e.g.; choice of performance metric and the evaluation strategy to use (see section Validation strategies). Furthermore, it is possible to choose whether to tune the parameters based on the underlying scoring model by itself, or if the evaluation should be performed based on a conformal or VennABERS predictor. This latter concept can be especially useful if aggregating several ICPs, or if the final model will be a TCP (requiring retraining the underlying scorer model multiple times for each testprediction), where considerable computational time can be saved.
Validation strategies
Three validation strategies and several settings thereof are available; kfold crossvalidation, single testtrain split, and leaveoneout crossvalidation. The two former has several configurable parameters such as performing the splits stratified (for classification), the fraction of testinstances, the k in kfold, and number of repetitions. Validation strategy is another feature that is extendable with CPSign.
Interpretation of predictions
CPSign can render images of molecules with an interpretation of the prediction based on the algorithm outlined in Ahlberg et al. [55] (Fig. 4 and Supplemental Table 3). This interpretation is based on feature importance’s that can be mapped back to atoms in the molecule, and can either be in terms of which molecular signature had the highest impact on the prediction (Fig. 4A) or as a complete molecule gradient where all features individual contributions are aggregated and mapped back to their originating atoms (Fig. 4B). This feature has been especially appreciated by chemists utilizing the predictive models, as it allows for editing chemical structures in a drawing graphical user interface and immediately visualize how the predictions change.
Implementation details
CPSign is written in Java and is thus platformindependent, only requiring a Java runtime of version 11 or later. Build and dependency management is handled by Maven, and published artifacts are available from the Maven central repository for easy inclusion in other JVMbased projects. The code base is split up into several childmodules, outlined in Fig. 5B, to allow users to only depend on the parts needed for their particular requirements. For instance, users modeling nonchemical data can do so by depending on the ConfAI module and reduce the dependency graph, and the REST servers depend on the CPSignAPI module as no CLI functionality is required.
Adding custom extensions
There are several ways that users can add their own custom extensions to CPSign, e.g., by providing pull requests on GitHub, cloning the repository or simply by extending the desired interface and exposing it as a Java Service Provider. CPSign loads extendable interfaces using the ServiceLoader class which makes it possible with minor effort to add custom code that can be used even through the CLI. For an example of how this can be achieved, see our CPSignDL4J extension at GitHub (https://github.com/arosbio/cpsigndl4j) which makes it possible to use deep learning models as underlying scoring models by wrapping the DeepLearning4J [49] package. This DL4J extension was evaluated and developed as part of a thesis work [56], resulting in predictive models performing on par with the SVM based models.
Interfaces
There are multiple ways of running CPSign, each explained in more detail in the following subsections. User documentation is found on https://cpsign.readthedocs.io. An overview of how CPSign can be used and the typical workflows are outlined in Fig. 5A, and are here described briefly. CPSign works with tabular type data, in either CSV format containing chemistry in SMILES format, or SDF files. The first step is always to convert the chemical input file(s) into numerical data, which is performed using one or several descriptors and termed precompute. The precompute step results in a precomputed dataset, containing both numerical data and all meta data from that step, so e.g. the same descriptors and data transformations are applied to any future test molecules. From precomputed data, the user can either run crossvalidate to evaluate the given data with a predictorsetting (i.e. conformal or VennABERS model, including specific settings for scoring model, nonconformity function and any additional hyperparameters that can be set), to quickly assess the expected performance for a new dataset and settings.
From a precomputed dataset a predictor model can be trained, with an optional intermediate step of hyperparameter tuning using either tune (hyperparameter tuning including all tuneable predictorparameters) or tunescorer (hyperparameter tuning of the underlying scorer model only). The train step can thus be run either using default parameters (or manually set parameters) or using tuned parameters from the optional tuning step. The trained model can then be validated with an external validationset or used to predict new compounds. For the final trained models, there is also the option to deploy them as micro services which can be deployed locally or publicly, allowing users to run query predictions using REST, this option is further described in the section REST API.
Command line interface
The Command Line Interface (CLI) is the main way that CPSign is intended to be used, in a high abstraction level which facilitate rapid evaluation of new datasets and models. Apart from the user documentation online, the CLI tool has a rich user manual available directly in the terminal environment, as well as a help program (explain) that both provide detailed explanations about key arguments and lists available settings. The listing functionality is useful as CPSign can be extended with custom implementations and the documentation is generated dynamically depending on what is currently available, including listing of subparameters dynamically.
Working with the CLI follows the outline in Fig. 5A, having a separate “program” for each rectangle in the figure. The goal has been to make the CLI as feature complete as possible, while balancing the level of complexity of the interface. In this spirit most parameters have been set to good default parameters, favoring less computational complexity (such as using a linear kernel SVM as default), but always making it possible to change settings for more elaborate alternatives. Most users thus prefer working with the CLI, e.g. for publications [12, 57,58,59] as well as other unpublished work.
Java API
For greater control of all available tweaks and handles, and for incorporating CPSign in other programs the Java API can be used. Here the user can also chose to depend on another submodule of CPSign (Fig. 5B) depending on their specific requirements. Coding examples can be found in the CPSignexample GitHub repository (https://github.com/arosbio/cpsignexamples), to make it easier for new users to start coding against the API.
REST API
To make it easy for users to make their developed models publicly available, the final models can be deployed as micro services and users can interact with them using REST. Each service is automatically documented using the OpenAPI 3.0 specification [60], and can optionally include a graphical user interface in which the user can draw or paste chemical structures and get the predictions back as well as atom contributions drawn using the method described in the section Interpretation of predictions. The web service implementation is freely available in the CPSign_predict_services GitHub repository (https://github.com/arosbio/cpsign_predict_services), and can thus be altered according to any further requirements, e.g., by adding user identification. Prebuilt Docker images for each model type are available from GitHub container repository, making it possible to spin up a web server with a CPSign model using a single Docker command. Examples of these services running in production are the models serving the web page accessible at https://predgui.serve.scilifelab.se.
Conformal eval
When implementing CPSign a decision was made to not include any plotting functionality into the program itself but instead let the user create figures with their own favorite tool for this task. Creating figures through a CLI would both restrict the level of flexibility as well as clutter the API with too many parameters. However, to make it easy to quickly generate figures to analyze results we developed a Python library building on top of the popular matplotlib [61], as well as added functions to load results from CPSign. This library can be found in the conformaleval GitHub repository (https://github.com/pharmbio/conformaleval) and was used for generating e.g. Figs. 2 and 3. An example of how to generate Fig. 3D can be found in Algorithm 1. Note that the regression case is more difficult compared to classification, as the confidence/significance level must be given at prediction time and the lower/upper bounds of each prediction interval must be saved and loaded. For classification the loading only requires picking the columns containing pvalues from a CSV file, and significance levels can be applied when generating the figures.
Evaluation
In this section we make a comparison of CPSign versus other common modeling approaches used for QSAR modeling, both the traditional method of handcrafted Morgan fingerprints combined with Random Forest, as well as the contemporary graph neural network based Chemprop [62]. The objective here is not to make a comprehensive comparison across all combinations of descriptors and modeling methods that are currently available, but rather using representative methods and data sets.
Datasets
For the evaluation we use a subset of the benchmarking datasets from the popular MoleculeNet [63]. The classification datasets were picked from both of the categories Biophysics and Physiology in order to obtain a broader range of tasks. The selected datasets are outlined in Table 1. For regression the number of available datasets in MoleculeNet were limited, so all tasks that were not part of the category Quantum mechanics were selected, see Table 2. The reason for excluding the Quantum mechanics datasets was that they are based on crystal structures of proteinligand complexes, for which CPSign lack descriptors for modeling. To expand the evaluation, the 13 largest curated datasets published in Škuta et al. [64] (Additional file 1) were included, as well as the two largest datasets from Papyrus [65]. This resulted in 16 datasets for classification and 18 for regression.
Each dataset was split into three subsets; train, calibrate and test, in ether 80%–10%–10% splits, or 60%–20%–20% splits for the classification datasets with fewer than 500 observations for the minority class (see Tables 1 and 2). The datasets from MoleculeNet were downloaded and split using the deepchem software [66], the splitting was performed randomly for regression and random stratified for classification datasets. The datasets from Škuta et al. and Papyrus were split randomly using numpy [67].
Modeling methods
In this comparison the Inductive Conformal Predictor (ICP) was used by all methods, with predefined splits of proper training and calibration sets according to the train and calibrate splits described in the previous paragraph – so all methods had exactly the same training, calibration and testing data. The discerning factors between the methods was the descriptors, the underlying scoring models, nonconformity function and any additional parameters that can be tweaked within each software, such as employing interpolation of the pvalues to smooth out the predictions. The following modeling methods were used in the comparison:

CPSign: CPSign using the CLI with the default descriptor, i.e. the Signatures descriptor [28, 29], with an RBFkernel SVM except for the largest classification dataset (PCBA686978) for which a linear kernel SVM was used instead. The default nonconformity function was used for both classification (negative distance to SVM hyperplane) and regression (LogNormalized). The LogNormalized function uses an additional error model to estimate the difficulty of each example, which is used to scale the prediction interval. Additionally, linear interpolation was employed for the pvalue calculation. All other settings were the default ones.

CPSign tuned: This strategy used the same parameters as described for CPSign above, but extended with a hyperparameter tuning step of the SVM hyperparameters cost (C) and gamma (\(\gamma\)) using grid search. The grid consisted of ten values for C (\(2^{6},2^{4},...,2^{12}\)) and six for \(\gamma\) (\(2^{14},2^{12},...,2^{4}\)) for a total of 60 combinations, apart from the largest classification dataset which used the linear kernel SVM where only the ten C values were evaluated. Hyperparameter tuning was exclusively performed on the train partition of the data, running a 10fold crossvalidation of the SVM only (i.e. without adding the conformal calibration) and optimizing with respect to macro F1 score for classification and RMSE for regression.

FP+RF tuned: Random Forest (RF) using Morgan Fingerprints (FP) as descriptor and nonconformist [21] for CP implementation. The Morgan Fingerprints were calculated using RDKit [68], using bit length of 2048 and radius 2. The RF hyperparameters were tuned using the train split, without including conformal calibration, optimizing for balanced accuracy for the classification datasets and RMSE for the regression datasets. The grid of tested hyperparameters had 32 combinations for the classification models and 64 for the regression models. For the conformal classification model the MarginErrFunc was used as nonconformity function and for regression the AbsErrFunc was used in conjunction with a normalizer model, using an additional RF model using the same hyperparameters as the scoring model.

Chemprop: The Chemprop software [62, 69] was used to develop Directed Message Passing Neural Network (DMPNN) models. Default parameter settings and network architecture was used. A separate validation set (10 %) was randomly split off from the train dataset for monitoring model training (using random stratified splitting for classification). For the classification models 1probability for the class was used as nonconformity measure, and using the smoothed calculation of pvalues (i.e. special treatment of equal nonconformity scores). The procedure described in Norinder et al. [30] was used for regression, i.e., using one model to predict the midpoint and a second (error model) for predicting the error made by the first model in order to normalize the prediction intervals based on the predicted difficulty of the object. The nonconformity function was extended by adding a small smoothing factor, \(\beta\), of 0.01 as CPSign does for the normalized nonconformity measures (which increases stability as well as removes the potential of division by 0 in the calculation).

Chemprop tuned: This method used the same settings as the method above, but with the added step of hyperparameter tuning of the Chemprop model using the chemprop_hyperopt function. Chemprop performs a bayesian hyperparameter optimization using the hyperopt package [70], here evaluated using the default 20 different hyperparameter settings. To minimize information leakage, this optimization step was only applied on the 90 % split from the train dataset, and thus chemprop internally split that set further into a validation set for monitoring the model training, a testset to compare the model performance of different hyperparameters and data used for training the model. For the regression experiments the error model used the same optimized hyperparameters as the scoring model that predicted the midpoint.
Comparison
While comparing the methods we restricted the analyzed significance levels to 0.01–0.3 for classification and 0.05–0.3 for regression, corresponding to 70–99 and 70–95 % confidence. The methods were first assessed with respect to calibration using calibration plots, shown in Supplemental Figures 2 and 3, Additional file 1. To simplify and quantify the calibration of the different methods we computed the maximum (signed) difference between the error rate and specified significance level, the RMSE of error rate against significance level as well as the “capped” RMSE, Supplemental Fig. 1 (Additional file 1). The capped RMSE was calculated by setting the error rate equal to the significance level if it was lower than the significance level (for every evaluated significance level), so that overconservative predictions (i.e. lower error rate than required) do not contribute to a higher RMSE. The guarantees made by the conformal framework is that the error rate should be at most equal to the significance level, the capped RMSE is more representative of level of calibration.
All methods produce similar results with respect to calibration, where the only concern is the calibration of the minority class for some of the classification datasets. This is most likely due to the smaller number of observations of the minority class in both the calibration and test splits, which leads to higher variance. The methods perform similar enough so they can be compared fairly with respect of predictive efficiency.
Classification
Aggregating the predictive efficiency across the datasets resulted in similar predictive efficiency for all evaluated methods (Fig. 6A, B). Differences between methods can be seen when analyzing the datasets individually in terms of Observed Fuzziness (Supplemental Fig. 4, Additional file 1) and fraction of singlelabel predictions (Supplemental Fig. 5, Additional file 1), where the most notable trend were that the two Chemprop methods preformed best on the two largest datasets (HIV and PCBA686978). The results were further ranked (Supplemental Table 2, Additional file 1) to find the topperformer as well as their overall rank, where the CPSign method (i.e., without tuning) was the overall best method.
Regression
Aggregating the results across all evaluated datasets (Fig. 6C) shows that the CPSign tuned method generated the most efficient predictions overall, although the methods again performs largely similarly. Calculating the Wilcoxon signedrank test between each combination of method, separately for each significance level, gave no significant difference between any method. All results presented separately for each dataset can be found in Supplemental Fig. 6 (Additional file 1), and the rankings across all datasets in Supplemental Table 2 (Additional file 1). From the ranking it is clear that the CPSign tuned method performed best overall, but that each method was the topperforming method for at least one dataset and significance level.
Runtime comparison
Comparison was further performed in respect to runtime for each experiment to complete. Both CPSign methods and the FP+RF tuned method were run on a laptop whereas the Chemprop experiments were run on computer cluster with a Nvidia 1080 Ti GPU. A summary of the runtimes can be found in Table 3, with individual datasets in Supplemental Figures 7 and 8, Additional file 1. Note that no replicate runs were performed, so the results should be interpreted as an indication of how the methods compare.
Discussion
When making predictions on novel chemical structures and where it is certain that it has not been used in model training, scientists face the inevitable questions of how much to trust the prediction, and if trusted then how to interpret the result. While most traditional methods provide a single level of model confidence from the training procedure, e.g. a metric from crossvalidatation, CPSign via the conformal prediction methodology outputs a prediction interval that is specific for each predicted object. If the object to predict is more different to what has been seen before, the prediction interval becomes larger, and the user can be sure to trust the size of these intervals as they are based on proven mathematical theory [6]. This also offers a compelling alternative to the concept of Applicability Domain [5]. It is important to notice that the size of prediction intervals are related to the choice of nonconformity function; a poor nonconformity function will still be valid but result in larger prediction intervals [71, 72]. Since conformal prediction outputs a prediction interval given a userspecified level of confidence, it can be difficult to choose this level of confidence; if requiring a higher confidence then the prediction intervals will naturally be larger. In the end this comes down to the user having to make a decision on what is acceptable for the specific problem given the interval sizes produced by the model. In the author’s experience this can lead to a more realistic view on model expectations.
Recently, Deep Neural Networks (DNNs) have emerged as a popular method for supervised learning, and shown higher accuracy for many traditional machine learning tasks. A prominent example is computer vision when the objects are images and where convolutional neural networks (CNNs) have yielded dramatic improvements [73]. For tabular data the improvements are not as profound. DNNs generally require larger training sets compared to traditional machine learning methods, although techniques such as transfer learning and augmentation can somewhat reduce this burden [74]. Further, DNNs necessitate hyperparameter tuning on a much larger scale than traditional machine learning methods, making them costly in terms of time and computational resources.
For supervised learning where the data objects are chemical structures, several studies have been presented to compare different deep learning approaches with more traditional machine learning methods, such as [75, 76]. One problem with comparisons lies in the choice of metric, for example using accuracy or AUC (AUROC) is not suitable when working with unbalanced data that is very common in the field. Another and more serious problem is that studies rarely assess the level of calibration of models. Deep Learning models have been shown in many cases to be poorly calibrated [77], rendering comparisons on the produced output probabilities biased. Conformal prediction is one method to calibrate models to obtain valid (wellcalibrated) probability estimates, and in our evaluation we use it to compare CPSign with DNNs as implemented in the Chemprop package both in terms of calibration and efficiency. As conformal prediction produces prediction intervals, traditional metrics such as AUC and F1 cannot be used, and we instead use the wellestablished metrics of Observed Fuzziness, Average C and fraction of singlelabel prediction intervals. Using mondrian conformal prediction [6] also improves the modeling and calibration for imbalanced datasets, which has been shown in several ligandbased studies [18, 31, 32].
Comparing CPSign against other popular modeling methods showed that overall all methods performed similarly in terms of predictive efficiency for the classification datasets. For the regression datasets CPSign with tuned hyperparameters was overall the topperforming method (Fig. 6). Looking at individual datasets each modeling method was the topperforming method at least once, showing that the optimal modeling approach can vary depending on the data being modeled. We note for instance that the DNNbased Chemprop performed better on the largest datasets, supporting the hypothesis that DNN requires large training sets. Our overall conclusion is that CPSign performs on par with DNNs (Chemprop), when calibrating models using conformal prediction and hence take advantage of theoretically proven model validity, which also was assessed empirically.
The lack of interpretability of DNNs is widely acknowledged [78]. CPSign utilizes the Signatures descriptors to represent chemistry, which allows for feature importance to be visualized as chemical substructures. Due to the fast predictions, CPSign allows for immediate feedback and visualization of atom contributions (Fig. 4) to the prediction (“calculateasyoudraw”). This has been a much appreciated feature by users of the software, such as medicinal chemists.
Setting up and maintaining computational environments for machine learning can be demanding, and this is especially evident for deep neural networks having many and specific dependencies. Further, with changing versions of frameworks and dependencies, it can be timeconsuming to maintain models and predictions over time, such as in production environments [79]. The requirements on IT infrastructure also varies a lot between modeling methods, with DNNs generally requiring access to GPUs to accelerate the learning. Even so, the amount of hyperparameters that needs to be optimized commonly lead to several days of model training. In contrast, CPSign has a single dependency (Java) that makes it straightforward to download, use, integrate into other systems, and with a modeling that is fast to complete. Examples of tools and systems based on CPSign are ANDROMEDA by Prosilico (https://prosilico.com/andromeda) and PredGUI in SciLifeLab Serve (https://predgui.serve.scilifelab.se).
CPSign is an advanced tool with many options, and hence it requires a bit of learning to understand the parameters of the software. Effort has been made to reduce the number of required parameters, and to provide good default values. For example, the default option of CPSign is Inductive Conformal Prediction (ICP) requiring a number of data points to be set aside into a calibration set. This is the price to pay for obtaining valid (wellcalibrated) models. When having few data points in the training set, there is always the option to use Transductive Conformal Prediction (TCP) that does not use a calibration set, but with the downside that each prediction requires a retraining of the model. However when data sizes are so low that TCP is mandated, this is usually an acceptable tradeoff. Another challenge lies in communicating model evaluation, as model efficiency for prediction intervals is not directly comparable with commonly used point estimate validation metrics such as F1 score or AUC.
For SVM, it generally leads to more efficient models when using a nonlinear kernel such as RBF. However, this is computationally expensive when datasets are large, although it has been shown that for larger models the difference between linear and nonlinear kernels is small [47]. The default setting for CPSign is to use a linear kernel (LIBLINEAR) to produce fast results when prototyping, and it is recommended to switch to RBF kernel if datasets are of small or moderate size, also depending on the computational infrastructure available. As the comparison shows (Fig. 6, and supplemental figures, Additional File 1), the default values for parameters C and \(\gamma\) as previously devised [48] are generally well performing, but in some cases the efficiency can be improved using hyperparameter tuning.
Although being a worthy tool for many typical cheminformatics modeling tasks, it is worth mentioning that there are tasks that CPSign is not suitable for, such as for nontabular data (e.g. images or graphbased data) or when multitask learning could be employed.
Conclusion
CPSign is a robust and complete implementation of conformal prediction for cheminformatics applications. The combination of signatures and SVM has been shown to produce robust and accurate models, and conformal prediction adds the ability to produce valid prediction intervals. The implementation as a single software package with no dependencies apart from a Java runtime, a welldeveloped API, and a low footprint makes it suitable both for rapid prototyping and integration in production system. The evaluation of modeling methods highlights that CPSign performs overall on par or outperforms other stateoftheart cheminformatics approaches.
Data availability
All data is openly accessible from MoleculeNet (https://moleculenet.org/), additional files from Škuta et al. [64] (https://doi.org/10.1186/s13321020004436) and Papyrus (https://doi.org/10.5281/zenodo.7019874).
References
Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18(6):463–477
Basile AO, Yahi A, Tatonetti NP (2019) Artificial intelligence for drug toxicity and safety. Trends Pharmacol Sci 40(9):624–635
Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V, Oprea TI, Baskin II, Varnek A, Roitberg A, Isayev O, Curtarolo S, Fourches D, Cohen Y, AspuruGuzik A, Winkler DA, Agrafiotis D, Cherkasov A, Tropsha A (2020) QSAR without borders. Chem Soc Rev 49(11):3525–3564
JiménezLuna J, Grisoni F, Weskamp N, Schneider G (2021) Artificial intelligence in drug discovery: recent advances and future perspectives. Expert Opin Drug Discov 16(9):949–959
Gadaleta D, Mangiatordi GF, Catto M, Carotti A, Nicolotti O (2016) Applicability domain for QSAR models: where theory meets reality. Int J Quant Struct Prop Relatsh 1(1):45–63
Vovk V, Gammerman A, Shafer G (2005) Algorithmic learning in a random world. Springer, New York. https://doi.org/10.1007/b106715
Norinder U, Carlsson L, Boyer S, Eklund M (2014) Introducing conformal prediction in predictive modeling: a transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54(6):1596–603. https://doi.org/10.1021/ci5001168
Norinder U, Rybacka A, Andersson PL (2016) Conformal prediction to define applicability domain—a case study on predicting ER and AR binding. SAR QSAR Environ Res 27(4):303–16. https://doi.org/10.1080/1062936X.2016.1172665
Alvarsson J, McShane SA, Norinder U, Spjuth O (2021) Predicting with confidence: using conformal prediction in drug discovery. J Pharm Sci 110(1):42–49
Svensson F, Afzal AM, Norinder U, Bender A (2018) Maximizing gain in highthroughput screening using conformal prediction. J Cheminform 10(1):7. https://doi.org/10.1186/s1332101802604
Svensson F, Norinder U, Bender A (2017) Modelling compound cytotoxicity using conformal prediction and PubChem HTS data. Toxicol Res 6(1):73–80. https://doi.org/10.1039/c6tx00252h
Morger A, Svensson F, Arvidsson McShane S, Gauraha N, Norinder U, Spjuth O, Volkamer A (2021) Assessing the calibration in toxicological in vitro models with conformal prediction. J Cheminf 13(1):35
Lapins M, Arvidsson S, Lampa S, Berg A, Schaal W, Alvarsson J, Spjuth O (2018) A confidence predictor for logD using conformal regression and a supportvector machine. J Cheminform 10(1):17. https://doi.org/10.1186/s1332101802711
Lampa S, Alvarsson J, Arvidsson Mc Shane S, Berg A, Ahlberg E, Spjuth O (2018) Predicting offtarget binding profiles with confidence using conformal prediction. Front Pharmacol 9:1256. https://doi.org/10.3389/fphar.2018.01256
Fagerholm U, Hellberg S, Alvarsson J, Spjuth O (2022) In silico predictions of the human pharmacokinetics/toxicokinetics of 65 chemicals from various classes using conformal prediction methodology. Xenobiotica 52(2):113–118. https://doi.org/10.1080/00498254.2022.2049397
Fagerholm U, Hellberg S, Alvarsson J, Spjuth O (2023) In silico prediction of human clinical pharmacokinetics with andromeda by prosilico: predictions for an established benchmarking data set, a modern small drug data set, and a comparison with laboratory methods. Altern Lab Anim 51(1):39–54. https://doi.org/10.1177/02611929221148447
CortésCiriano I, Bender A (2019) Deep confidence: a computationally efficient framework for calculating reliable prediction errors for deep neural networks. J Chem Inf Model 59(3):1269–1281. https://doi.org/10.1021/acs.jcim.8b00542
Norinder U (2022) Traditional machine and deep learning for predicting toxicity endpoints. Molecules. https://doi.org/10.3390/molecules28010217
Zhang J, Norinder U, Svensson F (2021) Deep learningbased conformal prediction of toxicity. J Chem Inf Model 61(6):2648–2657. https://doi.org/10.1021/acs.jcim.1c00208
Olsson H, Kartasalo K, Mulliqi N, Capuccini M, Ruusuvuori P, Samaratunga H, Delahunt B, Lindskog C, Janssen EAM, Blilie A, Egevad L, Spjuth O, Eklund M, ISUP Prostate Imagebase Expert Panel (2022) Estimating diagnostic uncertainty in artificial intelligence assisted pathology using conformal prediction. Nat Commun 13(1):7761. https://doi.org/10.1038/s41467022349458
Linusson H. Nonconformist. 2015. http://donlnz.github.io/nonconformist/. Accessed Aug 2023
Bosc N, Atkinson F, Felix E, Gaulton A, Hersey A, Leach AR (2019) Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J Cheminf 11:1–16
Svensson F, Norinder U, Bender A (2017) Improving screening efficiency through iterative screening using docking and conformal prediction. J Chem Inf Model 57(3):439–444
Norinder U, Naveja JJ, LópezLópez E, Mucs D, MedinaFranco JL (2019) Conformal prediction of HDAC inhibitors. SAR QSAR Environ Res 30(4):265–277. https://doi.org/10.1080/1062936X.2019.1591503
Mendil M, Mossina L, Vigouroux D. PUNCC: a python library for predictive uncertainty calibration and conformalization. In: Conformal and Probabilistic Prediction with Applications, PMLR. 2023. p. 582–601.
Boström H (2022) crepes: a python package for generating conformal regressors and predictive systems. In: Conformal and Probabilistic Prediction with Applications, pp. 24–41. PMLR
Manokhin V. Awesome conformal prediction. https://doi.org/10.5281/zenodo.6467205. https://doi.org/10.5281/zenodo.6467205. Accessed Nov 2023
Faulon JL, Visco DP, Pophale RS (2003) The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. J Chem Inf Model 43(3):707–720. https://doi.org/10.1021/ci020345w
Faulon JL, Churchwell CJ, Visco DP (2003) The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. J Chem Inf Model 43(3):721–734. https://doi.org/10.1021/ci020346o
Norinder U, Carlsson L, Boyer S, Eklund M (2014) Introducing conformal prediction in predictive modeling: a transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54(6):1596–1603
Sun J, Carlsson L, Ahlberg E, Norinder U, Engkvist O, Chen H (2017) Applying mondrian crossconformal prediction to estimate prediction confidence on large imbalanced bioactivity data sets. J Chem Inf Model 57(7):1591–1598. https://doi.org/10.1021/acs.jcim.7b00159
Norinder U, Boyer S (2017) Binary classification of imbalanced datasets using conformal prediction. J Mol Gr Modell 72:256–265
Vovk V, Fedorova V, Nouretdinov I, Gammerman A. Criteria of efficiency for conformal prediction. In: Symp. on Conformal and Probabilistic Prediction with Appl. Springer; 2016. p. 23–39.
Vovk V. Venn predictors and isotonic regression. CoRR abs/1211.0025. 2012.
Vovk V, Petej I, Fedorova V. Largescale probabilistic prediction with and without validity guarantees. In: Proceedings of NIPS, vol. 2015. 2015.
Sweidan D, Johansson U. Probabilistic prediction in scikitlearn. In: The 18th International Conference on Modeling Decisions for Artificial Intelligence, Sept 27–30, 2021. 2021.
Buendia R, Kogej T, Engkvist O, Carlsson L, Linusson H, Johansson U, Toccaceli P, Ahlberg E (2019) Accurate hit estimation for iterative screening using venn–abers predictors. J Chem Inf Model 59(3):1230–1237
Arvidsson S, Spjuth O, Carlsson L, Toccaceli P. Prediction of metabolic transformations using cross venn–abers predictors. In: Conformal and Probabilistic Prediction and Applications, PMLR. 2017. p. 118–31.
Ahlberg E, Buendia R, Carlsson L. Using venn–abers predictors to assess cardiovascular risk. In: Conformal and Probabilistic Prediction and Applications, PMLR. 2018. p. 132–46.
Todeschini R, Consonni V (2008) Handbook of molecular descriptors. John Wiley & Sons, Hoboken
Glen RC, Bender A, Arnby CH, Carlsson L, Boyer S, Smith J (2006) Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. IDrugs 9(3):199
Morgan HL (1965) The generation of a unique machine description for chemical structuresa technique developed at chemical abstracts service. J Chem Doc 5(2):107–113
Rogers D, Hahn M (2010) Extendedconnectivity fingerprints. J Chem Inf Model 50(5):742–754
Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL (2006) Recent developments of the chemistry development kit (CDK)an opensource java library for chemoand bioinformatics. Curr Pharm Des 12(17):2111–2120
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27–12727
Alvarsson J, Lampa S, Schaal W, Andersson C, Wikberg JE, Spjuth O (2016) Largescale ligandbased predictive modelling using support vector machines. J Cheminf 8(1):1–9
Alvarsson J, Eklund M, Andersson C, Carlsson L, Spjuth O, Wikberg JE (2014) Benchmarking study of parameter variation when using signature fingerprints together with support vector machines. J Chem Inf Model 54(11):3211–3217
Team EDD. Deeplearning4j: opensource distributed deep learning for the JVM. 2023. https://deeplearning4j.konduit.ai/. Accessed Nov 2023
Carlsson L, Eklund M, Norinder U (2014) Aggregated conformal prediction. In: Iliadis L, Maglogiannis I, Papadopoulos H, Sioutas S, Makris C (eds) Artificial intelligence applications and innovations IFIPAICT 14. Springer, Berlin, pp 231–240
Vovk V (2015) Crossconformal predictors. Ann Math Artif Intell 74(1–2):9–28. https://doi.org/10.1007/s1047201393684
Arvidsson McShane S, Ahlberg E, Noeske T, Spjuth O (2021) Machine learning strategies when transitioning between biological assays. J Chem Inf Model 61(7):3722–3733
Johansson U, Ahlberg E, Boström H, Carlsson L, Linusson H, Sönströd C. Handling small calibration sets in mondrian inductive conformal regressors. In: Int Symp on Statistical Learning and Data Sci, Springer. 2015. p. 271–80.
Carlsson L, Ahlberg E, Boström H, Johansson U, Linusson H. Modifications to pvalues of conformal predictors. In: Int Symp on Statistical Learning and Data Sci. Springer. 2015. p. 251–9.
Ahlberg E, Spjuth O, Hasselgren C, Carlsson L. Interpretation of conformal prediction classification models. In: statistical learning and data sciences: third international symposium, SLDS 2015, Egham, UK, April 20–23, 2015, Proceedings 3, Springer. 2015. p. 323–34.
Deligianni M. Comparison of support vector machines and deep learning For QSAR with conformal prediction. 2022.
Fagerholm U, Hellberg S, Alvarsson J, Spjuth O (2023) In silico prediction of human clinical pharmacokinetics with andromeda by prosilico: predictions for an established benchmarking data set, a modern small drug data set, and a comparison with laboratory methods. Altern Lab Anim. https://doi.org/10.1177/02611929221148447
Lampa S, Alvarsson J, Arvidsson Mc Shane S, Berg A, Ahlberg E, Spjuth O (2018) Predicting offtarget binding profiles with confidence using conformal prediction. Front Pharmacol 9:1256
Lapins M, Arvidsson S, Lampa S, Berg A, Schaal W, Alvarsson J, Spjuth O (2018) A confidence predictor for logD using conformal regression and a supportvector machine. J Cheminf 10:1–10
Software S. OpenAPI specification. 2023. https://swagger.io/specification/. Accessed Nov 2023
Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55
Heid E, Greenman KP, Chung Y, Li SC, Graff DE, Vermeire FH, Wu H, Green WH, McGill CJ (2023) Chemprop: machine learning package for chemical property prediction. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.3c01250
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical Sci 9(2):513–530
Škuta C, CortésCiriano I, Dehaen W, Kříž P, Westen GJ, Tetko IV, Bender A, Svozil D (2020) QSARderived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminf 12(1):1–16
Béquignon OJ, Bongers BJ, Jespers W, IJzerman AP, Water B, Westen GJ (2023) Papyrus: a largescale curated dataset aimed at bioactivity predictions. J Cheminf 15(1):1–11
Ramsundar B, Eastman P, Walters P, Pande V, Leswing K, Wu Z (2019) Deep learning for the life sciences. O’Reilly Media, Sebastopol
Harris CR, Millman KJ, Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, Kerkwijk MH, Brett M, Haldane A, Río JF, Wiebe M, Peterson P, GérardMarchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s4158602026492
RDKit: RDKit: Opensource cheminformatics software. https://zenodo.org/record/7671152#.ZFIV43ZBzao. Accessed Aug 2023
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, GuzmanPerez A, Hopper T, Kelley B, Mathea M et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388
Bergstra J, Yamins D, Cox D. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: International Conference on Machine Learning, PMLR. 2013. p. 115–23.
Eklund M, Norinder U, Boyer S, Carlsson L (2015) The application of conformal prediction to the drug discovery process. Ann Math Artif Intell 74(1–2):117–132
Svensson F, Aniceto N, Norinder U, CortesCiriano I, Spjuth O, Carlsson L, Bender A (2018) Conformal regression for quantitative structureactivity relationship modelingquantifying prediction uncertainty. J Chem Inf Model 58(5):1132–1140
Krizhevsky A, Sutskever I, Hinton GE Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25.
Kensert A, Harrison PJ, Spjuth O (2019) Transfer learning with deep convolutional neural networks for classifying cellular morphological changes. SLAS Discov Adv Life Sci R &D 24(4):466–475
Wu Z, Zhu M, Kang Y, Leung ELH, Lei T, Shen C, Jiang D, Wang Z, Cao D, Hou T (2021) Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 qsar data sets. Briefin Bioinf 22(4):321
Korotcov A, Tkachenko V, Russo DP, Ekins S (2017) Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol Pharma 14(12):4462–4475
Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International Conference on Machine Learning, PMLR. 2017. p. 1321–30.
Baskin II (2020) The power of deep learning to ligandbased novel drug discovery. Expert Opin Drug Discov 15(7):755–764
Spjuth O, Frid J, Hellander A (2021) The machine learning life cycle and the cloud: implications for drug discovery. Expert Opin Drug Discov 16(9):1071–1079
Funding
Open access funding provided by Uppsala University. OS acknowledges funding from the Swedish Research Council (Grants 202003731 and 202001865), FORMAS (Grant 202200940), Swedish Cancer Foundation (22 2412), and Horizon Europe Grant agreement #101057014 (PARC).
Author information
Authors and Affiliations
Contributions
O.S, E.A. and L.C. wrote the first pilot version of CPSign. S.A.M. has written the majority of CPSign, documentation website, and CPSign extension packages. J.A. has tested CPSign, provided feedback and feature requests. All authors were part of designing the method comparison. U.N. selected datasets used in the method comparison and provided code for conformal predictions for Chemprop. S.A.M. preformed the final experiments for the method comparison. All authors contributed to writing and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
OS is the owner of Aros Bio AB, providing commercial licenses of CPSign. Both JA and SAM have previously been employed at Genetta Soft which at the time owned the rights to CPSign.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13321_2024_870_MOESM1_ESM.pdf
Additional file 1: PDF file with Supplemental information, including three large tables and evaluation figures for the individual datasets.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Arvidsson McShane, S., Norinder, U., Alvarsson, J. et al. CPSign: conformal prediction for cheminformatics modeling. J Cheminform 16, 75 (2024). https://doi.org/10.1186/s13321024008709
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13321024008709