Quantitative prediction of selectivity between the A1 and A2A adenosine receptors

The development of drugs is often hampered due to off-target interactions leading to adverse effects. Therefore, computational methods to assess the selectivity of ligands are of high interest. Currently, selectivity is often deduced from bioactivity predictions of a ligand for multiple targets (individual machine learning models). Here we show that modeling selectivity directly, by using the affinity difference between two drug targets as output value, leads to more accurate selectivity predictions. We test multiple approaches on a dataset consisting of ligands for the A1 and A2A adenosine receptors (among others classification, regression, and we define different selectivity classes). Finally, we present a regression model that predicts selectivity between these two drug targets by directly training on the difference in bioactivity, modeling the selectivity-window. The quality of this model was good as shown by the performances for fivefold cross-validation: ROC A1AR-selective 0.88 ± 0.04 and ROC A2AAR-selective 0.80 ± 0.07. To increase the accuracy of this selectivity model even further, inactive compounds were identified and removed prior to selectivity prediction by a combination of statistical models and structure-based docking. As a result, selectivity between the A1 and A2A adenosine receptors was predicted effectively using the selectivity-window model. The approach presented here can be readily applied to other selectivity cases.


Introduction
Computational modeling of small molecules in drug discovery is typically focused on modeling their binding affinity or bioactivity. These models can be used to identify active compounds in silico, or to rationalize which chemical groups are correlated with bioactivity. Quantitative structure-activity relationship (QSAR) models can be applied to model compound activity for single proteins, whereas proteochemometrics (PCM) can be applied to model activity for multiple proteins in one single model [1,2]. Next to statistical models (e.g. machine learning), structure-based models are used to predict and rationalize compound activity. Methods such as docking, molecular dynamics, and free-energy perturbation (FEP) are widely applied to study binding and bioactivity [3][4][5].
Although modeling of activity is essential in drug discovery, these models often do not take target selectivity into account. The ability to control promiscuity of potential drug candidates is crucial as off-target effects can be avoided this way. Whereas selective drugs are designed to be non-promiscuous, polypharmacological drugs are designed to interact with multiple targets [6]. The development of both polypharmacological and selective drugs requires predictions for more than one target. Polypharmacology and selectivity can both be modeled by machine learning or structure-based approaches that predict the affinity of compounds on multiple targets separately. The resulting bioactivity profile can subsequently be applied to match the desired on-target(s) and to avoid off-target effects. However, this indirect way of predicting selectivity based on predicted bioactivities Burggraaff et al. J Cheminform (2020) 12:33 requires multiple model predictions to calculate one feature: the activity difference of a compound for one target over the other.
Here we explore selectivity modeling for the adenosine receptors, which are members of the class A G proteincoupled receptors (GPCRs). The adenosine receptor family, existing of subtypes A 1 , A 2A , A 2B , and A 3 , is involved in many physiological processes including cardiac control and inflammation [7]. The A 1 and A 2A adenosine receptors (A 1 AR and A 2A AR) both control cyclic adenosine-5′monophosphate (cAMP) levels in the cell. Activation of A 1 AR results in decreased cAMP levels, whereas A 2A AR activation increases cAMP levels [8]. These contrary effects justify a need for computational models that can predict selectivity between these two subtypes. A novel method to model selectivity is presented in this study: namely to train machine learning models directly on the differences between experimentally determined affinities.
For the A 2A AR many crystal structures are available in the Protein Data Bank [9]. More recently, protein crystal and cryo-EM structures for the A 1 AR have been obtained also [10][11][12]. The availability of structures for both proteins allows for a structure-based comparison of the subtypes to gain insights into selectivity. Previous studies revealed differences between the protein structures that correspond with ligand selectivity of specific chemical structures [10,11,13]. For example, the A 2A AR-selectivity of reference antagonist ZM241385 could be explained by a combination of three structural differences between the A 1 AR and A 2A AR: a salt bridge at the binding pocket entry, a hydrophobic pocket in the A 1 AR, and a (de)stabilized water network [13]. Furthermore, the A 1 AR-selectivity of xanthine-based antagonists with a bulky substituent has shown to be caused by steric hindrance in the A 2A AR by residue Met270 7.35 (Thr270 7.35 in A 1 AR) [11].
In addition to the availability of crystal structures, many small molecule ligands have been experimentally tested for their activity on one or multiple adenosine receptors. This data has already been exploited to train bioactivity models using classification or continuous statistical models [14,15]. However, direct modeling of selectivity in the adenosine receptors using statistical models has not yet been reported. Statistical models to predict selectivity are faster than predicting selectivity using time-consuming FEP approaches [5]. However, structure-based methods can give additional mechanistic information on binding and selectivity of compounds, and are in principle not dependent on available bioactivity data of ligands.
In this research a combined approach of QSAR modeling and structure-based docking is presented to model bioactivity for the A 1 AR and A 2A AR. Moreover, the selectivity between the A 1 AR and A 2A AR is predicted directly by training on affinity differences (selectivity-window). This is in contrast to methods that were reported up till now that deduced selectivity from predicted bioactivities of separate models. Furthermore, by training a continuous model (regression), the degree of selectivity was calculated in addition to a selectivity class with predefined thresholds. Finally, to enhance the performance of the selectivity models, statistical bioactivity models and structure-based docking were used to exclude inactive compounds.
Our study shows that continuous QSAR models can effectively predict selectivity between the A 1 AR and A 2A AR. A model trained directly on the difference in affinity between the two proteins, the selectivity-window model, outperformed models that are generally used to predict selectivity: models that are trained on separate bioactivities for the A 1 AR and A 2A AR.

Datasets
Information on compound-protein interactions (e.g. binding affinity and efficacy) was collected for the human A 1 AR and A 2A AR. Public bioactivity data was extracted from ChEMBL version 23 [16] and supplemented with in-house data (see Additional file 1). As compounds with a ribose scaffold are often associated with agonistic activity and dicyanopyridines with partial agonism, these compounds were removed to generate an antagonist-focused dataset [17,18]. Bioactivity values were standardized to pActivity (conceptually similar to the pChEMBL value, an ensemble from pK i /IC 50 /EC 50 /K d values [19]), while simultaneously combining data from different labs and assays. The data was subsequently used to compile the following compound datasets: an ' A 1 AR bioactivity dataset' (containing bioactivities of compounds on the A 1 AR), an ' A 2A AR bioactivity dataset' (containing bioactivities of compounds on the A 2A AR) and an ' A 1 AR/A 2A AR dataset' (bioactivities of compounds tested on both the A 1 AR and A 2A AR). The latter dataset included information on the selectivity of compounds. Compounds were termed 'selective' when the difference in activity between the two proteins was more than 100-fold (e.g., A 2A AR-selective when pActivity A 1 AR = 6.5 and pActivity A 2A AR ≥ 8.5). The A 1 AR/A 2A AR dataset consisted of five classes: non-binder (pActivity A 1 AR and A 2A AR < 6.5), A 1 AR-selective (pActivity A 1 AR ≥ 6.5 and selectivity ≥ 100-fold), A 2A AR-selective (pActivity A 2A AR ≥ 6.5 and selectivity ≥ 100-fold), and dual binder (pActivity A 1 AR and A 2A AR ≥ 6.5 and selectivity ≤ tenfold). Additionally, compounds that had measured bioactivities for both the A 1 AR and A 2A AR, but did not fit into any of the classes of the A 1 AR/ A 2A AR dataset, were termed 'semi-selective' compounds. Compounds that only had measured bioactivity for one receptor and were not present in the A 1 AR/A 2A AR dataset, but were included in either the A 1 AR bioactivity dataset or A 2A AR bioactivity dataset were termed 'single points' . The distribution of activities in the different datasets was comparable (Table 1) and normally distributed in both the A 1 AR and A 2A AR (Fig. 1). It should be noted that the total number of A 1 AR-selective compounds is about three times smaller than the number of A 2A AR-selective compounds (50 and 146 compounds, respectively).

Modeling A 1 AR/A 2A AR subtype selectivity using classification QSAR models
Several QSAR models were created to study selectivity. Firstly, subtype selectivity for the A 1 AR and A 2A AR was modeled using classification models. Additionally, non-selective compounds (dual binders) and non-binders were modeled. The following four models were constructed: a 2-class model (A 1 AR-selective/ A 2A AR-selective), two 3-class models (A 1 AR-selective/ A 2A AR-selective/dual inhibitors on one hand, and A 1 AR-selective/A 2A AR-selective/non-binders on the other hand), and a 4-class model (A 1 AR-selective/ A 2A AR-selective/dual/non-binders). All models were  (Fig. 2). Furthermore, the sensitivity and the positive predictive value (PPV) were consistently higher for A 2A AR-selective compounds than for A 1 AR-selective compounds, whereas specificity and negative predictive value (NPV) were higher for A 1 AR-selective compounds. The non-binder class contains compounds that are inactive at both receptors. However, these compounds are not well differentiated from the active classes (A 1 AR-, A 2A AR-selective, and dual), as is observed by low MCC (0.15 ± 0.06) and poor ROC (0.57 ± 0.07) for the non-binder class. The next section therefore describes bioactivity modeling of the A 1 AR and A 2A AR in an attempt to categorize non-binders.

Modeling A 1 AR and A 2A AR bioactivity using classification and regression QSAR models
The bioactivity of compounds for the A 1 AR and A 2A AR were modeled with both classification and regression models. Classification models categorize compounds with using a pre-defined threshold (here pActivity ≥ 6.5) as 'active' and compounds below that threshold are termed 'inactive' . The model is trained on these activity classes and provides an activity class for test compounds as well. In contrast, regression models are not trained on classes, but on numerical bioactivity values. The output that is generated from a regression model is a bioactivity value, which can subsequently be assigned to an activity class. As can be observed in Table 1 where the median pActivity for the sets is shown, this value (pActivity 6.5) is applicable for these data sets and was previously also shown to be a relevant threshold leading to balanced classes [15].
Bioactivity classification and regression QSAR models were trained on the A 1 AR/A 2A AR dataset, the same dataset that was used in the selectivity-classification QSAR models described in the previous section. Additionally, semi-selective compounds were added to increase the amount of training data. These semi-selective compounds have experimental activities for both receptors but do not fit into any of the four selectivity classes (e.g. a compound with pActivity A 1 AR = 7.0, pActivity A 2A AR = 8.1). However, for bioactivity modeling the selectivity class is irrelevant, and thus these compounds were now included to increase The validation test sets were composed based on chemical clusters and bioactivity of compounds; each subset contained both actives and inactives. These validation sets were not equal to the aforementioned (selectivity) validation sets as they were used for a different purpose: validation of bioactivity models instead of selectivity models. All bioactivity models were validated using the same crossvalidation test sets, regardless of the dataset that was used in training (A 1 AR/A 2A AR dataset, A 1 AR bioactivity, or A 2A AR bioactivity). The A 1 AR and A 2A AR bioactivity datasets contained more data points than the A 1 AR/A 2A AR dataset as these sets also included single points (bioactivity measured only for one of the two receptors). The single bioactivity points were included in training, but excluded from validation to retain comparability of performance for the different models. Single points that belonged to the same chemical cluster as the data points in the test set were also excluded from training to prevent bias. The regression models show good model quality in training, with a high R 2 (≥ 0.98) and low RMSE values (≤ 0.14). Unfortunately, when applied on the validation set, performances are lower than expected based on training performance (likely caused by the challenging test set based on chemical clustering). Nevertheless judging the model performance on classification validation metrics a realistic estimation can be made for the predictive performance of the models. The mean performances of all bioactivity models (based on fivefold cross-validation) were better for the A 2A AR than for the A 1 AR, with an average ROC difference of 0.12 (Table 3). Furthermore, classification models performed worse than regression models, as indicated by their lower values for enrichment (ROC) and MCC: average difference in ROC for the A 1 AR = 0.20 and A 2A AR = 0.07, and average difference in MCC for the A 1 AR = 0.07 and A 2A AR = 0.03. Moreover, the MCC and ROC for A 1 AR bioactivity classification models even indicated performances worse than random (MCC < 0 and ROC < 0.5). The best performing bioactivity models were based on regression, which reached an average performance (ROC score 0.60-0.70) for predicting bioactivities.

Modeling A 1 AR/A 2A AR subtype selectivity using regression models
The application of the above bioactivity model approach was tested in modeling the selectivity of compounds (i.e. modeling the affinity on the respective receptors and deriving selectivity from that indirectly). As the predicted bioactivities of the two bioactivity models are not correlated with each other, a separate validation was performed to indicate the performance of selectivity predictions. However, the cross-validation sets of the models in Table 3 were clustered based on bioactivity instead of selectivity classes, hence bioactivity models were retrained using differently composed cross-validation sets to justify comparison with a later discussed selectivity-window model. Regression bioactivity models were selected as these outperformed the classification bioactivity models. Moreover, regression was preferred in selectivity modeling as with regression a quantitative value for selectivity can be derived.
Thus, bioactivity regression models were used to predict compound activity for the A 1 AR and A 2A AR. Models were trained on the A 1 AR/A 2A AR dataset including additional 'semi-selective' compounds. The difference in predicted bioactivity for the two receptors was calculated as selectivity value. Subsequently, selectivity classes

Table 3 Performances of classification and regression bioactivity models for the A 1 AR and A 2A AR
Query compounds were categorized based on post-classification of the bioactivity predictions: predicted pActivity < 6.5 = inactive and predicted pActivity ≥ 6.5 = active  (Table 4). Continuing our single model approach to predict selectivity we explored the usage of regression for a selectivitywindow model rather than classification. In contrast to the two-step A 1 AR-A 2A AR difference model (regression), the single model to predict selectivity between the A 1 AR and A 2A AR was based on the difference in affinity rather than the prediction of bioactivity and calculation of the resulting selectivity. The regression model was trained directly on the difference in bioactivity for both receptors (pActivity A 1 AR − pActivity A 2A AR = selectivitywindow) and predicts a quantitative score for the degree of selectivity of a compound (difference in pActivity). A positive score indicates A 1 AR-selectivity, a negative score A 2A AR-selectivity, and a score close to zero indicates dual binders. The model was evaluated based on the ROC score and classification metrics (MCC, sensitivity, specificity, PPV, and NPV). The rules for classification of the A 1 AR-, A 2A AR-selective, and dual binders were derived from the thresholds applied in the selectivity classification models: A 1 AR ≥ 100-fold selective equals pActivity difference ≥ 2, A 2A AR ≥ 100-fold selective equals pActivity difference ≤ − 2, and for dual binders (≤ tenfold selective) pActivity difference ≥ − 1 and ≤ 1.
The selectivity-window regression model was trained on the same data (A 1 AR/A 2A AR dataset and semi-selective compounds) as the two-step A 1 AR-A 2A AR difference model described above in which selectivity was deducted from two separate bioactivity models. The selectivitywindow outperformed the two-step A 1 AR-A 2A AR difference model with increased ROC values for selectivity classes A 1 AR-and A 2A AR-selective (ROC increase 0.07-0.13) ( Table 4). Figure 3 shows example compounds that were misclassified with the two-step A 1 AR-A 2A AR difference model, but were correctly predicted using the selectivity-window model. The similarity (Tanimoto FCFP4) between the mispredicted compounds by the two-step A 1 AR-A 2A AR difference model was 0.25, whereas the similarity within Table 4 Performances of selectivity modeling using the two-step A 1 AR-A 2A AR difference model or the selectivity-window model  wrongly predicted compounds by the selectivity-window model was 0.58. This indicates that the two-step A 1 AR-A 2A AR difference model is challenged by selectivity prediction of more diverse compounds and the selectivity-window model underperforms on specific chemical scaffolds. The most frequently mispredicted scaffold by the selectivity-window model was N-(2-(furan-2-yl)-6-(1H-pyrazol-1-yl)pyrimidin-4-yl)-2-phenoxyacetamide (see Additional file 2). It should be noted that for the A 1 AR and A 2A AR predictions MCC and PPV could not always be calculated in cross-validation, resulting in failed cross-validation folds, or iterations. This is explained by the lack of true/false positives in the particular cross-validation fold: PPV cannot be calculated if there are no positives, MCC cannot be calculated without a PPV.
The selectivity-window model performed better than the three-class selectivity classification model validated earlier (Table 2). It was tested whether this increased performance was a result from the increased amount of data as the selectivity-window model included additional semi-selective compounds and non-binders, which the three-class selectivity classification model necessarily lacked as the affinity difference was not large enough to meet the classification cut-off. When these additional data points were excluded from the selectivity-window model, the ROC dropped from 0.78 to 0.67 (average over classes), which is comparable with the ROC of 0.65 of the three-class selectivity classification model. This observation clearly confirms a direct link between model quality and data availability and shows that the increased performance of the selectivity-window model is attributed to additional data points. Hence it is advantageous to use continuous models in selectivity modeling as in this case more data can be included. In addition to the benefit of increased data availability, continuous selectivity models also provide the ability to calculate a selectivity ratio as opposed to the class only. This selectivity ratio indicates the degree of selectivity and therefore cannot only identify selective compounds, but can also differentiate highly selective compounds from weakly selective compounds.
Remarkably, metrics based on classification (MCC, sensitivity, specificity, PPV, and NPV) for the selectivitywindow model (without non-binders and semi-selective compounds) (Table 4) are lower for the A 1 AR-selective compounds and dual binders than metrics of the 3-class classification model (both trained on the same data) ( Table 2), whereas ROC scores are comparable or higher. Therefore, the predictions of the selectivity-window model were compared with the experimentally measured selectivity values (Fig. 4). It was observed that the A 1 AR-selective compounds have consistently lower selectivity-window predictions than the experimental selectivity values. As a result, fewer compounds reached the A 1 AR-selective classification threshold, decreasing the number of the A 1 AR-selective positives drastically. From the 50 A 1 AR-selective compounds, none reached the threshold. Of all predictions, only three compounds reached the A 1 AR-selective threshold, which were dual binders instead of the A 1 AR-selective compounds. This deficiency of predicted positives explains the failed crossvalidation calculations for the A 1 AR-selective class.
To compensate, classification validation metrics were re-calculated post hoc using classification thresholds that were adapted to compensate for the generalization of selectivity for A 1 AR-selective compounds. Compounds were deemed A 1 AR-selective when the selectivity-window ≥ 0.5, A 2A AR-selective when the selectivity-window ≤ − 2 (unchanged), and dual binder when the selectivity-window ≥ − 1 and < 0.5. Using the new thresholds, values of the metrics for the A 1 AR-selectivity predictions improved: MCC 0.31 ± 0.10, sensitivity 0.45 ± 0.12, specificity 0.91 ± 0.02, PPV 0.32 ± 0.06, and NPV 0.95 ± 0.01, indicating that the revised threshold improves the categorization of the A 1 AR-selective compounds. Not all A 1 AR-selective compounds were correctly categorized but the post hoc optimized threshold was considered adequate, as lowering the A 1 AR-selective threshold further would increase sensitivity (by categorization of more compounds as A 1 AR-selective), but would also decrease PPV. Here, the correctness of predictions was prioritized over the number of predicted active compounds; hence PPV was prioritized over sensitivity.

Removal of non-binders to enhance performance
Although the selectivity-window model differentiates between the A 1 AR-, A 2A AR-selective compounds, and dual binders, the model does not consider potential inactivity of compounds. Consequently, non-binders cannot be filtered using this model. Therefore, a consensus approach of statistical modeling and structure-based docking was applied to identify and exclude non-binders.
Bioactivity regression models described above for the A 1 AR and A 2A AR were combined with docking of the A 1 AR/A 2A AR dataset and semi-selective compounds into crystal structures of both proteins. Bioactivity predictions for the A 1 AR and A 2A AR, and selectivity-window predictions, were derived for the entire A 1 AR/A 2A AR dataset and semi-selective compounds by assembling the predictions made during fivefold cross-validation of the previously trained regression models. Compounds were docked into crystal structures of the A 1 AR (PDB: 5UEN) [10] and the A 2A AR (PDB: 5OLZ) [21], which resulted in docking scores for both receptors. Compounds were assigned a separate bioactivity label for the A 1 AR and A 2A AR: compounds in the A 1 AR were labeled 'active' when predicted pActivity ≥ 7 and docking score ≤ − 9. Compounds in the A 2A AR were labeled 'active' when predicted pActivity ≥ 7 and docking score ≤ − 10.
Compounds with predicted selectivity-window ≥ 0.5 or ≤ − 2, corresponding with A 1 AR-and A 2A AR-selective, were subsequently filtered using the consensus bioactivity filter (Fig. 5). The PPV for A 1 AR-selective compounds drastically increases from 0.13 to 0.39 when the selectivity-window predictions were filtered using the consensus bioactivity filter for the A 1 AR. The A 2A AR bioactivity filtering also increased the PPV of A 2A AR-selective compounds; here docking and consensus filtering performed equally well (PPV A 2A AR-selective: 0.80).
Non-binders that were not removed using the statistical filter only, but were filtered when the consensus approach was used, were inspected in the crystal structure of the A 1 AR. Some non-binders (e.g., CHEMBL1800792) did not adapt a favorable conformation when docked into the A 1 AR (Fig. 6). Moreover, an interaction with pocket residue Asn 6.55 (Ballesteros-Weinstein numbering) was frequently not observed. This is an essential missing element, as interaction with this residue has shown to be important for ligand binding to the A 1 AR and A 2A AR [11,22]. However, some non-binders were able to make this interaction (CHEMBL372580), but nevertheless had a docking score that did not reach the set threshold (docking score ≤ − 9). The poses of the non-binders were compared to those of an A 1 AR-selective (CHEMBL207824) and an A 2A AR-selective compound (CHEMBL371436). Both selective compounds adapt a conformation that is able to make an interaction with residue Asn 6.55 . Furthermore, the poses also constitute the same aromatic interactions (with Phe171 EL2 in the A 1 AR and with Phe168 EL2 and His250 6.52 in the A 2A AR) as the co-crystalized ligands and adapt a similar scaffold orientation. Finally, the poses of the selective ligands have favorable docking scores (− 10.30 and − 10.91, respectively), supporting that these compounds are binders for the A 1 AR or A 2A AR.

Validation of the selectivity-window model on an external set
The predictive selectivity-window model (trained on A 1 AR/A 2A AR dataset and semi-selective compounds) was challenged to predict the selectivity of compounds from an external validation set. This set contained 1482 compounds of which a dose-response bioactivity value (K i /IC 50 /EC 50 /K d ) was known for at least one of the two receptors. If an accurate bioactivity value was available for both receptors, the compound was classified according to prior rules applied in this study. However, if an accurate bioactivity value for only one receptor was known, a less accurate bioactivity measurement (inhibition as percentage displacement/efficacy/change) was used to identify inactivity for the missing receptor. The low accuracy of the bioactivity values makes this data less suitable for model training on the quantitative difference between activity on the two receptors, but suitable for classification validation. A pChEMBL value of < 4.5 or inhibition threshold of ≤ 50% (at 10 μM) was used to label inactive compounds, whereas a pChEMBL value of ≥ 6.5 was used to indicate active compounds. The selectivity-window model (see Additional file 3) was applied to the compounds in the external validation set, providing them all with a predicted selectivity score and, subsequently, a selectivity class. The validation encompassed all selectivity classes: A 1 AR-and A 2A AR-selective compounds, dual binders and nonbinders. Since the selectivity-window model has no threshold for non-binders, non-binders were always considered either as true or false negative (never false/ true positive).
Without filtering inactives, the selectivity-window model performed average in the prediction of the A 1 AR-selective compounds (ROC 0.75) and A 2A AR-selective compounds (ROC 0.66) in the external validation set (Table 5). However, application of the consensus bioactivity filter resulted in an increase of the classification enrichment of the A 1 AR-and A 2A AR-selective compounds. Although the ROC for A 1 AR-selective compounds decreased after applying the bioactivity filter, PPV, and thus the fraction of true A 1 AR-selective compounds compared to all predicted A 1 AR compounds, increased from 0.12 to 0.21. In addition, MCC increased slightly from 0.13 to 0.18. Inspection of the compounds showed that all non-binders were removed after filtering the selectivity-window predictions with the consensus bioactivity filter. The decrease in ROC for the A 1 AR-selective class was thus caused by the presence of dual binders only. Remarkably, sensitivity for the A 1 AR-and A 2A AR-selective compounds was both 1.00 (100%), whereas sensitivity for dual binders was 0.00 (0%). Although dual compounds were present in the set that was filtered with the selectivity-window model, these compounds were wrongly categorized as either A 1 ARor A 2A AR-selective. The predicted dual binders prior to bioactivity filtering were, in fact, non-binders. However, these non-binders were correctly filtered out using the bioactivity filter, leaving the dual binder class without positive-predicted compounds. Note that the results do not specify dual binder enrichment, as the bioactivity predictions encompassed only compounds predicted to be A 1 AR-or A 2A AR-selective.

Discussion
While QSAR models are widely applied in bioactivity modeling, they can also effectively be used in selectivity modeling. However, modeling of selectivity requires a substantial amount of data, as activities for more than one protein have to be measured. The amount of data that is available influences the performance of the selectivity model as was observed for the performances of the selectivity-window models when trained on limited data. To increase the amount of data that is sufficient for selectivity modeling continuous regression models can be applied instead of classification models. With regression not only compounds that belong to a defined selectivity class can be included, but also compounds of which there is some selectivity but not large enough to fit into a class. Another benefit of regression is that the degree of selectivity can be provided in addition to the selectivity class of a compound.
Multiple QSAR regression models to derive selectivity for a panel of kinases were used by Sciabola et al. [23]. First, regression bioactivity models were trained for every kinase in the panel. Next, bioactivity patterns were predicted for a set of compounds against all kinases, from which subsequently selectivity was derived. To compare, we repeated a similar approach was repeated by us in the current work. However, we also introduce the selectivitywindow model, which is a direct implementation of selectivity. We show that this approach outperformed models that predicted selectivity indirectly by using separate bioactivity models. Even though separate bioactivity models can include more data since compounds measured for just one protein can be considered, this approach did not increase model performance enough to outperform the selectivity-window model.
Nevertheless, an advantage of using separate bioactivity models to deduce selectivity from is that

Table 5 Performance of the selectivity-window model on an external validation set
The query compounds were categorized based on post hoc optimized classification of the selectivity predictions: A 1 AR-selective when selectivity-window ≥ 0.5, A 2A AR-selective when selectivity-window ≤ − 2, and dual binder when selectivity-window ≥ − 1 and < 0.5 additional proteins can be added easily: the selectivity between the added protein and existing proteins can quickly be deduced from the results of the added bioactivity model. In selectivity-window modeling multiple models need to be trained to predict selectivities of compounds against a panel of targets: one model for every target-target combination. However, while using separate bioactivity models can be more convenient, selectivity-window modeling may yield more accurate predictions. It should also be noted that automatically generating these models using scripting can be considered trivial. Therefore, when sufficient selectivity data is available it is worthwhile to apply selectivity-window modeling.

MCC
The higher accuracy of selectivity-window modeling compared to using multiple bioactivity models is suggested to be influenced by the higher quality of the data used in selectivity-window modeling. In selectivitywindow modeling, selectivity is predicted based on data from biological experiments. Although biological experiments are susceptible to error (on average an error of 0.6 log units [24]), this data is more reliable than data derived from statistical models whose error by definition should be higher than the error of the data they were trained on. In practice the error of statistical models (and hence the error of predictions) varies around 0.5-1.0 log units [25,26]. While this error may accumulate with the experimental error, there is also the possibility that modeling can reduce part of the experimental error. An additional study is required to reveal how the modeling error behaves in combination with the experimental error.
The selectivity-window model by itself is not capable of distinguishing actives from inactives as it is trained on the difference only, which is different from the affinity. However, separate bioactivity models can be applied to filter potentially selective compounds from inactives, or nonbinders. A study by Zhao et al., where subtype selectivity between epigenetic targets HDAC1 and HDAC6 was modeled by classification of selectivity, utilizes a comparable approach by first predicting selectivity, followed by bioactivity [27]. However, the selectivity model in that study is incapable of predicting the degree of selectivity as a classification model was used. Furthermore, only statistical models were used by the authors to predict bioactivity of compounds. In the current study it was observed that implementation of statistical bioactivity models only increased the enrichment of selective compounds slightly. In contrast, addition of structure-based docking scores increased the enrichment of selective compounds substantially for both the A 1 AR and A 2A AR. Moreover, structure-based docking performed equally well as the consensus model (statistical bioactivity and docking) for A 2A AR-selective enrichment.

Conclusion
We demonstrated that continuous QSAR models can be applied to model selectivity on the A 1 AR and A 2A AR. The selectivity-window model, which was trained directly on the difference in affinity between both receptors, outperformed a two-step A 1 AR-A 2A AR selectivity model. In the two-step model, which is generally applied in selectivity modeling, selectivity predictions are derived indirectly by calculation of the difference between bioactivity predictions that resulted from two separate models. Even though the separate bioactivity models included more data, the performance did not increase enough to outperform our selectivity-window model. Furthermore, a combination of statistical bioactivity models and structure-based docking contributed to the enrichment of selective compounds and can be used to exclude nonbinders (which are not predicted accurately when directly predicting selectivity). In summary, we demonstrated that accurate selectivity predictions can be made for the A 1 and A 2A adenosine receptors by combining the selectivity-window model and consensus bioactivity modeling. This method can easily be applied to other protein targets (e.g., kinases) as well, provided sufficient data is available.

Training/test datasets
The dataset was compiled from publicly available data derived from the ChEMBL database [16,28] (version 23) and in-house data from Leiden University (Leiden, The Netherlands). Compounds with experimental activities were collected for the human A 1 AR (P30542) and human A 2A AR (P29274). The data derived from ChEMBL was filtered on confidence score 7 and 9, and a pChEMBL value ≥ 4. In-house data was filtered similarly: activity (K i /IC 50 /EC 50 ) ≤ 10 −4 M. K i values were prioritized over IC 50 or EC 50 values. Thus, for duplicates, when more than one type was available for a given compound-receptor pair, K i values were kept and IC 50 and EC 50 values were discarded. The mean value was taken when multiple bioactivity values of the same type were reported for a given compound-receptor pair (e.g., mean of multiple K i values for the same compound). The standardized activity values are reported as pActivity values.
An antagonist-focused dataset was compiled from the filtered data by removing compounds with a ribose or dicyanopyridine scaffold. From this antagonist-focused dataset an A 1 AR/A 2A AR dataset that contained only compounds with activities measured on both the A 1 AR and A 2A AR was derived. The compounds were assigned to the A 1 AR/A 2A AR dataset after they were categorized into one of the following five classes: non-binders when pActivity for both the A 1 AR and A 2A AR < 6.5, A 1 AR-selective (pActivity ≥ 6.5 for the A 1 AR and activity compared to the A 2A AR ≥ 100-fold), A 2A AR-selective (pActivity ≥ 6.5 for the A 2A AR and activity compared to the A 1 AR ≥ 100fold), and dual binders (pActivity ≥ 6.5 for both the A 1 AR/A 2A AR and activity difference ≤ 10-fold). Compounds with experimental bioactivities for both the A 1 AR and A 2A AR, but that did not fit into any of the classes of the A 1 AR/A 2A AR dataset, were termed "semiselective" compounds (855 bioactivities). The antagonistfocused dataset contained 5897 activities, the A 1 AR/ A 2A AR dataset included 1106 compounds, of which 50 A 1 AR-selective and 146 A 2A AR-selective. Additionally, the antagonist-focused dataset was split into two datasets for bioactivity modeling: the A 1 AR bioactivity dataset (2774 compounds) and the A 2A AR bioactivity dataset (3123 compounds).

T-distributed stochastic neighbor embedding (t-SNE)
The chemical similarity of the A 1 AR/A 2A AR dataset was plotted using t-SNE [20]. Compounds were described using FCFP4 fingerprints (fixed-length array of bits 2024). Two dimensions were calculated: t-SNE component 1 and t-SNE component 2. The settings were as follows: maximum number of iterations 5000, theta 0, perplexity 30, momentum 0.5, final momentum 0.8, and learning rate 10. Additionally a t-SNE was conducted showing the distribution of chemically clustered compounds using affinity propagation (FCFP4) [29].

External validation set
An external validation set was created by using compounds that had been newly added in ChEMBL version 24 and 25. Furthermore, compounds with confidence score 6 and 8 from previous ChEMBL versions were added. Additionally, less accurate bioactivity measurements (e.g., % displacement) were used to identify inactives. These less accurate bioactivities included bioactivities measured as percentage displacement, efficacy and change. If a pChEMBL value was known for both receptors, the compounds were categorized into the selectivity classes A 1 AR-selective, A 2A AR-selective, dual binder, and non-binder according to the same rules as used for the A 1 AR/A 2A AR dataset. If a pChEMBL value was known for only one of the two receptors, less accurate measurements (displacement/efficacy/change) were used to identify if the compound was marked as inactive for the other receptor. Subsequently, a selectivity class could be assigned. A pChEMBL value of < 4.5 or inhibition threshold of ≤ 50% (at 10 μM) was used to label compounds as inactive and a pChEMBL value of ≥ 6.5 was used to identify active compounds. Subsequently the selectivity class was derived from these bioactivity classes: A 1 AR-selective if active on the A 1 AR and inactive on the A 2A AR, A 2A AR-selective if inactive on the A 1 AR and active on the A 2A AR, and non-binder if inactive on both the A 1 AR and A 2A AR. Again, compounds with ribose and dicyanopyridine scaffolds were excluded, resulting in an external validation set of 1482 compounds.

Machine learning
QSAR bioactivity and selectivity models were constructed using the R XGBoost model component in Pipeline Pilot (version 18.1.0.1604) [30]. The following settings were applied for both classification and continuous models: maximum number of trees 1000, learning rate 0.3, maximum depth 7, data fraction 1.0, and descriptor fraction 0.7. Compound descriptors were calculated within the component and included ALogP, molecular weight, number of H-donors, number of H-acceptors, number of rotatable bonds, number of atoms, number of (aromatic) rings, and FCFP6 fingerprints (fixed-length array of bits 2024). The workflow to train the selectivity-window model in Pipeline Pilot is included in Additional file 3 (trained with the dataset in Additional file 1). A comparable model training protocol in KNIME [31] is added in Additional file 4 (trained with Additional file 5).

Cross-validation
The models were validated with fivefold cross-validation where they were trained on 80% and tested on 20% of the dataset. The A 1 AR/A 2A AR dataset was split into five subsets that each contained all four classes (A 1 AR-selective/ A 2A AR-selective/dual/non-binder) and the A 1 AR bioactivity dataset and A 2A AR bioactivity dataset were both separated into five subsets considering an active/inactive distribution (Table 6). This consideration of class-distribution ensured that every subset contained each (bioactivity) class, which allows for balanced model training and validation. Chemical similarity of compounds was also considered; the A 1 AR/A 2A AR dataset and bioactivity datasets were each split into five subsets with every set covering different chemical structures. In order to create five chemically distinct subsets, each selectivity/ activity class was clustered into ten clusters with the cluster molecules component in Pipeline Pilot (based on FCFP4). Subsequently, the smallest and largest clusters were combined into one group. This was done recurrently until all clusters were divided into five groups with every group containing 2 clusters per class. Finally, the resulting groups of each selectivity/activity class were distributed equally, resulting in five chemically distinct