Exploring QSAR models for activity-cliff prediction

Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that QSAR models struggle to predict ACs and that ACs thus form a major source of prediction error. However, the AC-prediction power of modern QSAR methods and its quantitative relationship to general QSAR-prediction performance is still underexplored. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. Our results provide strong support for the hypothesis that indeed QSAR models frequently fail to predict ACs. We observe low AC-sensitivity amongst the evaluated models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance amongs the tested input representations. A potential future pathway to improve QSAR-modelling performance might be the development of techniques to increase AC-sensitivity.


Introduction
Activity cliffs (ACs) are pairs of small molecules that exhibit high structural similarity but at the same time show an unexpectedly large difference in their binding affinity against a given pharmacological target [14,47,63,64,[67][68][69].The existence of ACs directly defies the intuitive idea that chemical compounds with similar structures should have similar activities, often referred to as the molecular similarity principle.An example of an AC between two inhibitors of blood coagulation factor Xa [43] is depicted in Figure 1; a small chemical modification involving the addition of a hydroxyl group leads to an increase in inhibition of almost three orders of magnitude.
For medicinal chemists, ACs can be puzzling and confound their understanding of structure-activity relationships (SARs) [19,67,77].ACs reveal small compound-modifications with large biological impact and thus represent rich sources of pharmacological information.Mechanisms by which a small structural transformation can give rise to an AC include a drastic change in 3D-conformation and/or the switching to a different binding mode or even binding site.ACs form discontinuities in the SAR-landscape and can therefore have a crucial impact on the success of lead-optimisation programmes.While knowledge about ACs can be powerful when trying to escape from flat regions of the SAR-landscape, their presence can be detrimental in later stages of the drug development process, when multiple molecular properties beyond mere activity need to be balanced carefully to arrive at a safe and effective compound [14,67].
In the field of computational chemistry, ACs are suspected to form one of the major roadblocks for successful quantitative structure-activity relationship (QSAR) modelling [14,26,47,63]; abrupt changes in potency are expected to negatively influence machine learning algorithms for pharmacological activity prediction.During the development of QSAR models, ACs are sometimes dismissed as measurement errors [49], but simply removing ACs from a training data set can result in a loss of precious SAR-information [15].
Golbraikh et al. [26] developed the MODI metric to quantify the smoothness of the SAR-landscape of binary molecular classification data sets and showed that the SAR-landscape smoothness is a strong determinant for downstream QSAR-modelling performance.In a related work, Sheridan et al. [63] found that the density of ACs in a molecular data set is strongly predictive of its overall modelability by classical descriptor-and fingerprint-based QSAR methods.Furthermore, they found that such methods incur a significant drop in performance when the test set is restricted to "cliffy" compounds that form a large number of ACs.In a more extensive study, van Tilborg et al. [75] observed a similar drop in performance when testing classical and graph-based QSAR techniques on compounds involved in ACs.Notably, Figure 1 Example of an activity cliff (AC) for blood coagulation factor Xa. A small structural transformation in the upper compound leads to an increase in inhibitory activity of almost three orders of magnitude.Both compounds were identified in the same ChEMBL assay with ID 658338.
in both studies this performance drop was also observed for highly nonlinear and adaptive deep learning models.In fact, van Tilborg reports that descriptorbased QSAR methods even outperform more complex deep learning models on "cliffy" compounds associated with ACs.This runs counter to earlier hopes expressed in the literature that the approximation power of deep neural networks might ameliorate the problem of ACs [79].
While these works provide valuable insights into the detrimental effects of SAR discontinuity on QSAR models, they consider ACs mainly indirectly by focussing on individual compounds involved in ACs.Arguably, a distinct and more natural approach would be to investigate ACs directly at the level of compound pairs.This approach has been followed in the AC-prediction field which is concerned with developing techniques to classify whether a pair of similar compounds forms an AC or not.An effective AC-prediction method would be of high value for drug development with important applications in rational compound optimisation and automatic SAR-knowledge acquisition.
The AC-prediction literature is still very thin compared to the QSAR-prediction literature.An attempt to conduct an exhaustive literature review on ACprediction techniques revealed a total number of 15 methods [4,7,10,27,29,32,34,39,44,51,52,54,57,71], all of which have been published since 2012.Current AC-prediction methods are often based on creative ways to extract features from pairs of molecular compounds in a manner suitable for standard machine learning pipelines.For example, Horvath et al. [29] used condensed graphs of reactions [28,35], a representation technique originally introduced for modelling of chemical reactions, to encode pairs of similar compounds and subsequently predict ACs.Another method was recently described by Iqbal et al. [34] who investigated the abilities of convolutional neural networks operating on 2D images of compound pairs to distinguish between ACs and non-ACs.Interestingly, none of the AC-prediction methods we identified employ feature extraction techniques built on modern graph neural networks (GNNs) [20,25,40,76,81] with the exception of Park et al. [54] who recently applied graph convolutional methods to compound-pairs to predict ACs.
In spite of the existence of advanced AC-prediction models there are significant gaps left in the current AC-prediction literature.Note that any QSAR model can immediately be repurposed as an AC-prediction model by using it to individually predict the activities of two structurally similar compounds and then thresholding the predicted absolute activity difference.Nevertheless, at the moment there is no study that uses this straightforward technique to investigate the potential of current QSAR models to classify whether a pair of compounds forms an AC or not.Importantly, this also entails that the most salient ACprediction models [27,29,34,44,71] have not been compared to a simple QSAR-modelling baseline applied to compound pairs.It is thus an open question to what extent (if at all) these tailored AC-prediction techniques outcompete repurposed QSAR methods in the detection of ACs.This is especially relevant in light of the fact that several published AC-predict¸ion models [27,34,44] are evaluated via compound-pairbased data splits which incur a significant overlap between training set and test set at the level of individual molecules; this type of data split should strongly favour standard QSAR models for AC-prediction, yet a comparison to such baseline methods is lacking.
We address these gaps by systematically investigating the abilities of nine frequently used QSAR models to classify pairs of similar compounds as ACs or non-ACs within three pharmacological data sets: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease.Each QSAR model is constructed by combining a molecular representation method (physicochemicaldescriptor vectors (PDVs) [72], extended-connectivity fingerprints (ECFPs) [59], or graph isomorphism networks (GINs) [81]) with a regression technique (random forests (RFs), k-nearest neighbours (kNNs), or multilayer perceptrons (MLPs)).All models are used for two distinct prediction tasks: QSAR-prediction at the level of individual molecules, and AC-classification at the level of compound-pairs.The main contribution of this study is to shed light on the following questions: • What is the relationship between the ability of a QSAR model to predict the activities of individual compounds, versus its ability to classify whether pairs of similar compounds form ACs? • When (if at all) are common QSAR models capable of predicting ACs? • When (if at all) are common QSAR models capable of predicting which of two similar compounds is the more active one?• Which QSAR model shows the strongest ACprediction performance, and should thus be used as a baseline against which to compare tailored AC-prediction models?• Do differentiable GINs outperform classical nontrainable ECFPs and PDVs as molecular representations for QSAR-and/or AC-prediction? • How could ACs potentially be used to improve QSAR-modelling performance?

Experimental Methodology
Molecular Data Sets We built three binding affinity data sets of smallmolecule inhibitors of dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease.Factor Xa is an enzyme in the coagulation cascade and a canonical target for blood-thinning drugs [43].Dopamine receptor D2 is the main site of action for classic antipsychotic drugs which act as antagonists of the D2 receptor [62].SARS-CoV-2 main protease is one of the key enzymes in the viral replication cycle of the SARS coronavirus 2, that recently caused the unprecedented COVID-19 pandemic; it is one of the most promising targets for antiviral drugs against this coronavirus [74].
For dopamine receptor D2 and factor Xa, data was extracted from the ChEMBL database [45] in the form of SMILES strings with associated K i [nM] values.For SARS-CoV-2 main protease, data was obtained from the COVID moonshot project [1] in the form of SMILES strings with associated IC 50 [µM] values.SMILES strings were standardised and desalted via the ChEMBL structure pipeline [8].This step also removed solvents and all isotopic information.Following this, SMILES strings that produced error messages when turned into an RDKit mol object were deleted.Finally, a scan for duplicate molecules was performed: If the activities in a set of duplicate molecules were within the same order of magnitude then the set was unified via geometric averaging.Otherwise, the measurements were considered unreliable and the corresponding set of duplicate molecules was removed.This procedure reduced the data set for dopamine receptor D2 / factor Xa / SARS-CoV-2 main protease from 8883 / 4116 / 1926 compounds to 6333 / 3605 / 1924 unique compounds whereby 174 / 21 / 0 sets of duplicate SMILES were removed and the rest was unified.

Activity Cliffs: Definition of Binary Classification Tasks
The exact definition of an AC hinges on two concepts: structural similarity and large activity difference.An elegant technique to measure structural similarity in the context of AC analysis is given by the matched molecular pair (MMP) formalism [31,38].An MMP is a pair of compounds that share a common structural core but differ by a small chemical transformation at a specific site.Figure 1 depicts an example of an MMP whose variable parts are formed by a hydrogen atom and a hydroxyl group.To detect MMPs algorithmically, we used the mmpdb Python-package provided by Dalke et al. [17].We restricted ourselves to MMPs with the following commonly used [27,29,71] size constraints: the MMP core was required to contain at least twice as many heavy atoms as either of the two variable parts; each variable part was required to contain no more than 13 heavy atoms; the maximal size difference between both variable parts was set to eight heavy atoms; and bond cutting was restricted to single exocyclic bonds.To guarantee a well-defined mapping from each MMP to a unique structural core, we canonically chose the core that contained the largest number of heavy atoms whenever there was ambiguity.Based on the ratio of the activity values of both MMP compounds, each MMP was assigned to one of three classes: "AC", "non-AC" or "half-AC".In accordance with the literature [5,27,29,52,77] we assigned an MMP to the "AC"-class if both activity values differed by at least a factor of 100.If both activity values differed by no more than a factor of 10, then the MMP was assigned to the "non-AC"-class.In the residual case the MMP was assigned to the "half-AC"-class.To arrive at a well-separated binary classification task, we labelled all ACs as positives and all non-ACs as negatives.The half-ACs were removed and not considered further in our experiments.It is relevant to know the direction of a potential activity cliff, i.e. which of the compounds in the pair is the more active one.We thus assigned a binary label to each MMP indicating its potency direction (PD).PD-classification is a balanced binary classification task.Table 1 gives an overview of all our curated data sets.

Data Splitting Technique
ACs are molecular pairs rather than single molecules; it is thus not obvious how best to split up a chemical data set into non-overlapping training-and test sets for the fair evaluation of an AC-prediction method.There seems to be no consensus about which data splitting strategy should be canonically used.Several authors [27,34,44] have employed a random split at the level of compound pairs.While this technique is conceptually straightforward, it must be expected to incur a significant overlap between training-and test set at the level of individual molecules.For example, randomly splitting up a set of three MMPs {{s 1 , s 2 }, {s 1 , s 3 }, {s 2 , s 3 }} into a training-and a test set might lead to {s 1 , s 2 } and {s 1 , s 3 } getting assigned to the training-and {s 2 , s 3 } getting assigned to the test set which leads to a full inclusion of the test set in the training set at the level of individual molecules.This molecular overlap is problematic for at least three reasons: Firstly, it likely leads to overly optimistic results for AC-prediction methods since they will have already encountered some of the test compounds during training.Secondly, it does not model the natural situation encountered by medicinal chemists who we assume will not know the activity value of at least one  compound in a test-set pair.Thirdly, the mentioned molecular overlap should lead to strong AC-prediction results for standard QSAR models, but to the best of our knowledge, no such control experiments have been run in the literature.Horvath et al. [29] and Tamura et al. [71] have made efforts to address the shortcomings of a compoundpair-based random split.They came up with advanced data splitting algorithms designed to mitigate the molecular-overlap problem by either managing distinct types of test sets according to compound membership in the training set or by designing splitting techniques based on the structural cores of MMPs.However, their data splitting schemes exhibit a relatively high degree of complexity which can make their implementation and interpretation difficult.
We propose a novel data splitting method which represents a favourable trade-off between rigour, interpretability and simplicity.Our technique shares some of its concepts with the methods proposed by Horvath et al. [29] and Tamura et al. [71] but might be simpler to implement and interpret.We first split the data into a training-and test set at the level of individual molecules and then use this basic split to distinguish several types of test sets at the level of compound pairs.Let Here, which describes the set of MMP-cores that appear in D train .Note that M train ∪ M inter ∪ M test = M.The pair (D train , M train ) describes the training space at the level of individual molecules and MMPs, and can be used to train a QSAR-or AC-prediction method.A trained method can then classify MMPs in M test , M inter and M cores .M test models an AC-prediction setting where the activities of both MMP-compounds are unknown.M cores represents the subset of MMPs in M test whose structural cores do not appear in M train ∪ M inter ; M cores thus models the difficult task of predicting ACs in a strucurally novel area of chemical space.Finally, M inter represents an AC-prediction scenario where the activity of one MMP-compound is given a priori ; this can be interpreted as a compoundoptimisation task where one strives to predict small AC-inducing modifications of a query compound with known activity.An illustration of our data splitting strategy is given in Figure 2.
We implemented our data splitting strategy within a k-fold cross validation scheme repeated with m random seeds.This generated data splits of the form for i ∈ {1, ..., m} and j ∈ {1, ..., k} where (D ij train , D ij test ) represents the j-th split of D in the cross validation round with random seed i.The overall QSAR-and ACprediction performance of each model was recorded as the average over the mk training-and test runs based on all data splits S 1,1 , ..., S mk .We chose the configuration (k, m) = (2, 3) which gave a good trade-off between computational costs and accuracy and reasonable numbers of MMPs in the compound-pair-sets.In particular, random cross-validation with k = 2 gave expected relative sizes of: On average, 12.7 %, 11.91 %, and 6.84 % of MMPs in M test were also in M cores for dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease, respectively.

Prediction Strategies and Performance Measures
In a data split of the form can be associated with an activity label a(s) ∈ R, defined as the negative decadic logarithm of the experimentally measured activity of s.We stuck with the canonical units used in the ChEMBL database and the COVID moonshot project ([nM] for K i and [µM] for IC 50 ); each activity label a(s) thus represents a standard pK i -or pIC 50 value (with an additive shift towards 0 caused by the units which might slightly benefit prediction techniques initialised around the origin).We are interested in QSAR-prediction functions, that can map a chemical structure s ∈ D to an estimate of its binding affinity a(s).The mapping f is found via an algorithmic training process on the labelled data set and can then either be used to predict the activity labels of compounds in D test , or it can be repurposed to classify whether an MMP forms an activity cliff (ACclassification) and what the potency direction of an MMP is (PD-classification).If {s, s} ∈ M inter , then one can assume that the activity label of one of the compounds, say a(s), is known; f is then used to clas-sify {s, s} via: Here d crit ∈ R >0 is a critical threshold above which an MMP is classified as an AC.Throughout this work we use d crit = 1.5 (in pK i -or pIC 50 units) since this value represents the middle point between the intervals [0, 1] and [2, ∞) which correspond to absolute activity-label differences associated with non-ACs and ACs respectively.
If {s, s} ∈ M test ∪ M cores then the activities of both compounds are unknown and we classify {s, s} via: PD-classification for MMPs is performed in a straightforward manner: the activity labels of both MMPcompounds are predicted via f and then compared to classify which compound is the more active one.The performance of f for standard QSAR prediction in D test is measured via the mean absolute error (MAE).For the balanced PD-classification problem we rely on accuracy as a suitable performance measure.For the highly imbalanced task of ACclassification, however, we use the Matthews correlation coefficient (MCC), as well as sensitivity and precision.For the relatively small SARS-CoV-2 main protease data set we sometimes encountered the edge case where there were no positive predictions; we then set MCC = 0 and ignored ill-defined precision measurements when averaging the performance metrics to obtain the final results.

Molecular Representation-and Regression Techniques
We constructed nine QSAR models via a robust combinatorial methodology that systematically combines three molecular representation methods with three regression techniques.This setup allows, for example, to compare the performance of molecular representation methods across regression techniques, data sets and predictions tasks.
For molecular representation, we used extendedconnectivity fingerprints [59] (ECFPs), physicochemical molecular descriptor vectors [72] (PDVs), and graph isomorphism networks (GINs) [81].Both ECFPs and PDVs were computed via RDKit [42].The ECFPs were chosen to use a radius of two, a length of 2048 bits, and active chirality flags.The PDVs had a dimensionality of 200 and were constructed using the general list of descriptors from the work of Fabian et al. [21].This list encompasses properties related to druglikeness, logP, molecular refractivity, electrotopological state, molecular graph-structure, fragment profile, charge, and topological surface properties.The GIN was implemented using PyTorch Geometric [23] and consisted of a variable number of graph convolutional layers, each with two internal hidden layers with ReLU activations and batch normalisation [33].We further chose the maxpool operator which computes the component-wise maximum over all atom feature vectors in the final graph layer to obtain a graph-level representation.
The GIN was combined with the regression techniques as follows: For MLP regression, the GIN was trained with the MLP as a projection head after the pooling step in the usual end-to-end manner.For RFor kNN-regression, the GIN was first trained with a single linear layer added after the global pooling step that directly mapped the graph-level representation to an activity prediction.After this training phase the weights of the GIN were frozen and it was used as a static feature extractor.The RF-or kNN-regressor was then trained on the features extracted by the frozen GIN. Figure 3 illustrates our combinatorial experimental methodology.

Model Training and Hyperparameter Optimisation
All models were trained using full inner hyperparameter-optimisation loops.Hyperparameters of RFs and kNNs were optimised in scikit-learn [56] by uniformly random sampling of hyperparameters from a predefined grid.The hyperparameters of MLPs and GINs were sampled from a predefined grid via the treestructured Parzen estimator algorithm implemented in Optuna [2].Deep learning models were trained for 500 epochs on a single NVIDIA GeForce RTX 3060 GPU via the mean squared error loss function using AdamW optimisation [46].Weight decay, learning rate decay and dropout [65] were employed at all hidden layers for regularisation.Batch size, learning rate, learning rate decay rate, weight decay rate, and dropout rate were treated as hyperparameters and subsequently optimised.Note that the training length (i.e. the number of gradient updates) was implicitly optimised by tuning the batch size for the fixed number of 500 training epochs.Further implementation details can be found in our public code repository [1] .

Results and Discussion
The QSAR-prediction-, AC-classification-and PDclassification results for all three data sets are depicted in Figures 4 to 9.

QSAR-Prediction Performance
When considering the results depicted in Figures 4  to 9 with respect to QSAR-prediction performance, one can see that ECFPs tend to lead to better performance (i.e. a lower QSAR-MAE) compared to GINs, which in turn tend to lead to better performance compared to PDVs.In particular, the combination MLP-ECFP consistently produced the lowest QSAR-MAE across all three targets.These observations reinforce a growing corpus of literature that suggests that trainable GNNs have not yet reached a level of technical maturity by which they consistently and definitively outperform the much simpler non-differentiable ECFPs at important molecular property prediction tasks [13,37,48,50,60,66,80]. [1] https://github.com/MarkusFerdinandDablander/QSAR-activity-cliff-experiments

AC-Classification Performance
The AC-MCC plots in Figures 4 to 6 reveal surprisingly strong overall AC-classification results on M inter .This type of MMP-set models a compoundoptimisation scenario where a researcher strives to identify small structural modifications with a large impact on the activity of query compounds with known activities.For this task, a significant portion of our QSAR models exhibit an AC-MCC value greater than 0.5 across targets, which appears impressive considering the simplicity of the approach.Exchanging M inter with either M test or M cores leads to a substantial drop in the AC-MCC to approximately 0.3 that appears to be mediated by a large drop in AC-sensitivity.
In most cases, GINs perform better than the other molecular representation methods with respect to the AC-MCC.Notably, kNN-regressors consistently perform best for AC-classification when combined with GIN-features; this supports the idea that GINs might have a heightened ability to resolve ACs by learning an embedding of chemical space in which the distance between two compounds is reflective of activity difference rather than structural difference.The combinations GIN-MLP, GIN-RF and ECFP-MLP exhibit particularly high AC-MCC values relative to the other methods.We recommend using at least one of these three models as a baseline against which to compare tailored AC-prediction models; the practical utility of Figure 4 QSAR-prediction-and AC-classification results for dopamine receptor D2.For each plot, the x-axis corresponds to a combination of MMP-set and AC-classification performance metric and the y-axis shows the QSAR-prediction performance on the molecular test set Dtest.The total length of each error bar equals twice the standard deviation of the performance metric measured over all mk = 3 * 2 = 6 hyperparameter-optimised models.For each plot, the lower right corner corresponds to strong performance at both prediction tasks.
any AC-prediction technique that cannot outperform these three common QSAR methods is questionable.Across all three targets, AC-sensitivity is moderately high on M inter but universally low on M test and M cores .This is consistent with the hypothesis that ACs form one of the major sources of prediction error for QSAR models.The weak AC-sensitivity on M test and M cores indicates that modern QSAR methods are largely blind to ACs in novel areas of chemical space and thus lack essential chemical knowledge.GINs clearly outperform the other two more classical molecular representations across regression techniques with respect to AC-sensitivity.In particular, the GIN-MLP combination leads to the highest AC-Figure 5 QSAR-prediction-and AC-classification results for factor Xa.For each plot, the x-axis corresponds to a combination of MMP-set and AC-classification performance metric and the y-axis shows the QSAR-prediction performance on the molecular test set Dtest.The total length of each error bar equals twice the standard deviation of the performance metric measured over all mk = 3 * 2 = 6 hyperparameter-optimised models.For each plot, the lower right corner corresponds to strong performance at both prediction tasks.sensitivity in all examined cases and thus discovers the most ACs.The highly parametric nature of GINs that makes them prone to overfitting could at the same time enable them to better model jagged regions of the SAR-landscape that contain ACs than classical taskagnostic representations.
There is a wide gap between distinct prediction techniques with respect to AC-precision: some models achieve a considerable level of AC-precision such that over 50% of positively predicted MMPs in M test and M cores are indeed actual ACs.Other QSAR models, however, seem to fail almost entirely with respect to this metric on M test and M cores and only deliver modest performance on M inter .RFs tend to exhibit the strongest AC-precision and the weakest AC-sensitivity.This might be as a result of their ensemble nature 6 QSAR-prediction-and AC-classification results for SARS CoV-2 main protease.For each plot, the x-axis corresponds to a combination of MMP-set and AC-classification performance metric and the y-axis shows the QSAR-prediction performance on the molecular test set Dtest.The total length of each error bar equals twice the standard deviation of the performance metric measured over all mk = 3 * 2 = 6 hyperparameter-optimised models.The precision of the AC-classification task is lacking for the ECFP + kNN technique on Mtest and Mcores since this method produced only negative AC-predictions for all trials on this data set.For each plot, the lower right corner corresponds to strong performance at both prediction tasks.
which should intuitively lead to conservative but trustworthy predictions of extreme effects such as ACs.

PD-Classification Performance
The abilities of the evaluated QSAR models to identify which is the more active compound in an MMP is uni-versally weak, with PD-accuracies clustering around 0.7 on M inter and around 0.6 on M test and M cores , as can be seen in the top rows of Figures 7 to 9. Predicting the potency direction for two compounds with similar structures and thus usually similar activity levels must be considered a challenging task.The combination ECFP-MLP reaches the strongest PD-accuracy in the majority of cases and we recommend starting with this model as a baseline for more advanced PDprediction methods.
One can argue that the activity order of two similar compounds is of little interest if the true activity difference is small, as is often the case.We therefore also restricted PD-classification to predicted ACs.The three plots in the bottom rows of Figures 7 to 9 depict the PD-accuracy of each QSAR model on the subset of MMPs that were also predicted to be ACs by the same model.In this practically more relevant scenario PD-prediction accuracy tends to exceed 0.9 on M inter and 0.8 on M test and M cores .The QSAR models investigated here are thus able to identify the correct activity order of MMPs if they also predict them to be ACs.The relatively rare instances in which the PD of a predicted AC is misclassified, however, reflect severe QSAR-prediction errors.

Linear Relationship between QSAR-MAE and AC-MCC
Our experiments reveal a consistent linear relationship between the QSAR-MAE and the AC-MCC as can be seen in the left columns of Figures 4 to 6.A potential mechanism driving this effect could be that as the overall QSAR-MAE of a model improves, its accuracy at predicting activity differences between similar molecules might be expected to improve as well; previously misclassified MMPs whose predicted absolute activity differences were already close to the critical value d crit = 1.5 might then gradually move to the correct side of the decision boundary and increase the AC-MCC.The results suggest that for real-world QSAR models the AC-MCC and the QSAR-MAE are strongly predictive of each other; while this observation only rests on nine models, it is highly consistent across MMP-sets and pharmacological targets.

Future Research: Exploring Twin-Network Training Schemes
ACs are rich in pharmacological information; at the same time the experiments have shown that QSAR models exhibit low AC-sensitivity and thus frequently fail to predict ACs.In spite of this, to the best of our knowledge so far no method has been described to tackle this problem by attempting to increase the ACsensitivity of QSAR models.We propose twin-network training of deep-learning models as a potential strategy to increase AC-sensitivity.Comparatively little work has been done to investigate twin neural network architectures (also referred to as Siamese networks [9,12,41,70]) in computational drug discovery [3,6,11,18,22,24,36,53,58,61,73,82].However, twin networks provide a natural way to tackle chemical prediction problems on compound pairs such as AC-classification.
Instead of training a deep network, f , on an individual compound, s, with activity label, a(s), via a classical squared error loss, (a(s) − f (s)) 2 , we suggest to train f on compound pairs, {s, s}, using a pair-based loss: The quantity w {s,s} is used to specify the weight put on the compound pair {s, s} during training; w diff determines the relative importance of predicting the individual activities of s and s versus predicting the activity difference associated with {s, s}.Twin-network training could be conducted in two phases: first on general compound pairs in D train × D train and then on MMPs in M train .In the second phase, the weight function w {s,s} could be used to assign training weights to MMPs proportional to their associated activity dif- ferences; MMPs that represent larger activity differences might encode structural transformations that are pharmacologically more relevant and thus should receive more attention during training.This weighting procedure could lead to increased AC-sensitivity and the extraction of more chemical knowledge.Our pair-based training strategy is depicted in Figure 10 and is based on a twin neural network model for ACprediction with discrete outputs that we explored in a previous research study [16].We intend to evaluate the proposed twin-network training scheme in a future study.

Conclusions
To the best of our knowledge this is the first study to investigate the AC-prediction capabilities of QSAR models.It is also the first work to explore the quantitative relationship between QSAR-prediction (at the level of individual molecules) and AC-prediction (at the level of compound-pairs).As part of our methodology we have additionally introduced a simple, interpretable, and rigorous data-splitting technique for pair-based prediction tasks.
When the activities of both MMP-compounds are unknown (i.e.absent from the training set) then common QSAR models exhibit low AC-sensitivity which limits their utility for AC-prediction.This strongly supports the hypothesis that QSAR methods do indeed regularly fail to predict ACs which might thus form a major source of prediction errors in QSAR modelling [14,26,47,63].However, if the activity of one MMP-compound is known (i.e.present in the training set) then AC-sensitivity increases substantially; for query compounds with known activities, QSAR methods can therefore be used as simple AC-prediction-, compound-optimisation-and SAR- knowledge-acquisition tools.Furthermore, based on the observed potency-directon (PD) classification results we can expect the estimated activity direction of predicted ACs to have a high degree of accuracy.
With respect to molecular representation, we have found robust evidence that non-trainable task-agnostic ECFPs still outcompete differentiable GINs at general QSAR-prediction.This adds to a growing awareness that standard message-passing GNNs might need to be improved further to definitively beat classical molecular featurisations such as ECFPs [13,37,48,50,60,66,80].One potential angle to achieve this could be self-supervised GNN-pretraining, which has recently shown promising results in the molecular domain [30,78].However, while GINs appear to be inferior to ECFPs for QSAR-prediction, they tend to be advantageous for AC-classification; their highly parametric nature might simultaneously lead to increased overfitting but to a better modelling of the more jagged regions of the SAR-landscape.We thus recommend using GINs as an AC-classification baseline since such an agreed-upon baseline is currently lacking.
Finally, the low AC-sensitivity of QSAR models when the activites of both MMP-compounds are unknown suggests that such methods are still lacking essential SAR knowledge; on the flip side, it might be possible to boost QSAR-modelling performance and increase the amount of extracted SAR knowledge by developing techniques to increase AC-sensitivity.To this end, we propose an AC-sensitive twin-network [9,12,41,70] training scheme for deep-learning models that we intend to explore in the future.

D
= {s 1 , s 2 , ...} be the given data set of individual molecules.Furthermore, let M ⊆ {{s, s} | s = s and s, s ∈ D} be the set of all MMPs in D that have been labelled as either ACs or non-ACs.Each MMP {s, s} ∈ M shares a common structural core denoted as core({s, s}).We use a random split to partition D into a training set D train and a test set D test and then define the following MMP-sets:

Figure 2
Figure2Illustration of our data splitting strategy.We distinguish between three MMP-sets, M train , M inter and Mtest, depending on whether both MMP-compounds are in D train , one MMP-compound is in D train and the other one is in Dtest, or both MMP-compounds are in Dtest.We additionally consider a fourth MMP-set, Mcores, consisting of the MMPs in Mtest whose structural cores do not appear in M train ∪ M inter .

Figure 3
Figure3Schematic showing the combinatorial experimental methodology used for the study.Each molecular representation method is systematically combined with each regression technique, giving a total of nine QSAR models.Each QSAR model is trained and evaluated for QSAR-prediction, AC-classification and PD-classification within a 2-fold cross validation scheme repeated with 3 random seeds.For each of the 2 * 3 = 6 trials, an extensive inner hyperparameter-optimisation loop on the training set is performed for each QSAR model.

Figure 7
Figure 7 QSAR-prediction-and PD-classification results for dopamine receptor D2.Each column corresponds to an upper plot and a lower plot for one of the MMP-sets M inter , Mtest or Mcores.The x-axis of each upper plot indicates the PD-classification accuracy on the full MMP-set; the x-axis of each lower plot indicates the PD-classification accuracy on a restricted MMP-set only consisting of MMP predicted to be ACs by the respective method.The y-axis of each plot shows the QSAR-prediction performance on the molecular test set Dtest.The total length of each error bar equals twice the standard deviation of the performance metrics measured over all mk = 3 * 2 = 6 hyperparameter-optimised models.For each plot, the lower right corner corresponds to strong performance at both prediction tasks.

Figure 8
Figure 8 QSAR-prediction-and PD-classification results for factor Xa.Each column corresponds to an upper plot and a lower plot for one of the MMP-sets M inter , Mtest or Mcores.The x-axis of each upper plot indicates the PD-classification accuracy on the full MMP-set; the x-axis of each lower plot indicates the PD-classification accuracy on a restricted MMP-set only consisting of MMP predicted to be ACs by the respective method.The y-axis of each plot shows the QSAR-prediction performance on the molecular test set Dtest.The total length of each error bar equals twice the standard deviation of the performance metrics measured over all mk = 3 * 2 = 6 hyperparameter-optimised models.For each plot, the lower right corner corresponds to strong performance at both prediction tasks.

Figure 9
Figure9QSAR-prediction-and PD-classification results for SARS-CoV-2 main protease.Each column corresponds to an upper plot and a lower plot for one of the MMP-sets M inter , Mtest or Mcores.The x-axis of each upper plot indicates the PD-classification accuracy on the full MMP-set; the x-axis of each lower plot indicates the PD-classification accuracy on a restricted MMP-set only consisting of MMP predicted to be ACs by the respective method.The y-axis of each plot shows the QSAR-prediction performance on the molecular test set Dtest.The total length of each error bar equals twice the standard deviation of the performance metrics measured over all mk = 3 * 2 = 6 hyperparameter-optimised models.The accuracy of the PD-classification task for predicted ACs is lacking for the ECFP + kNN technique on Mtest and Mcores since this method produced only negative AC-predictions for all trials on this data set.For each plot, the lower right corner corresponds to strong performance at both prediction tasks.

ClFFigure 10
Figure 10 Twin-network training strategy for deep-learning-based QSAR models that might increase AC-sensitivity.Twin-network training could be conducted on general compound pairs and on MMPs, with larger weights given to MMPs associated with larger activity differences.
written by M.D. Feedback was provided by R.L., G.M.M. and T.H. during the writing process.The novel data splitting technique for MMP-data, the QSAR-modelling-based activity cliff prediction strategies and the proposed twin-network training scheme were developed by M.D.All scientific figures were designed by M.D., with input from G.M.M., R.L. and T.H.All chemical data sets were gathered and cleaned by M.D.All authors read and approved the final manuscript.

Table 1
Sizes of our curated data sets and their respective numbers of matched molecular pairs (MMPs), activity cliffs (ACs), halfactivity-cliffs (half-ACs) and non-activity-cliffs (non-ACs).