Identifying a drug candidate is a long and complex task. It requires finding a compound that is active on the therapeutic target of interest, and that also satisfies multiple criteria related to safety and pharmacokinetics. To speed up this difficult search, de-novo drug design aims at finding promising novel chemical structures in-silico. The two principal use cases of generative de-novo drug design [1] are distribution learning, where the goal is to design new molecules that resemble an existing set of molecules, and goal-directed generation. The second use case is goal-directed generation [1]. In goal-directed generation, a generative model designs small molecules that maximize a given scoring function. The scoring function takes as input a molecular structure, and by the means of in-silico computations, returns a score that reflects the suitability of the molecular structure in a drug discovery setting. The scoring function is usually a combination of predicted biological and physico-chemical properties. Those predictions (widely refered to as QSPR models in the literature, for Quantitative Structure Property Relationships) are often computed by machine learning models [2,3,4]. While those models have shown impressive predictive accuracy on many drug-discovery related tasks [5], their performances deteriorate outside of their validity domains [6] and they can easily be fooled [7]. Novel molecules with high scores can be designed with deep generative models [2] coupled with reinforcement learning or other optimization techniques [8, 9]. Classical optimization methods, such as genetic algorithms [10, 11], have also shown good performance in goal-directed generation [1].
In a recent study published by Renz et al. [12], the authors identified failure modes of goal-directed generation guided by machine learning models. As they highlight in their work, optimization of molecules with respect to scoring functions can be performed in an unintended manner during goal-directed generation. Machine learning models are not oracles, and there are many reasons that can lead machine learning models to make erroneous predictions, such as distribution shift at test time [13], inherent limitations of the model or adversarial examples [14]. Furthermore, condensing every requirement of a drug-discovery project in a single score is not necessarily feasible. As the finality of goal-directed generation is to identify promising bioactive molecules for drug-discovery, identifying how and why goal-directed generation guided by machine learning can fail is of paramount importance for the adoption and success of those algorithms in drug-discovery.
Experimental setup and results of the original study
In their study, Renz et al. design an experiment to assess whether goal-directed generation exploits features that are unique to the predictive model used for optimization, which is outlined in Fig. 1. Three datasets extracted from ChEMBL have been considered. Starting from a given dataset, they split it in two stratified random sets Split 1/2, where the ratio of actives to inactives is kept equal in both splits. On each of them, three bioactivity models are built in such a way that they should have a relatively equivalent predictive performance. The three classifiers are: a classifier \(C_{opt}\) trained on Split 1 (that takes as input a molecule x and returns the confidence score \(S_{opt}(x)\), which is called the optimization score), another classifier \(C_{mc}\) trained on the same split with a different random seed (that yields the model control score \(S_{mc}\)), and finally a classifier \(C_{dc}\) trained on the Split 2 (that yields the data control score \(S_{dc}\)). All three classifiers are Random Forests models [15] that share a similar architecture (see “Methods” section), and differ only by the random seed they are initialized with (\(C_{opt}\) and \(C_{mc}\)) or the sample from the data distribution they are trained on (\(C_{opt}\) and \(C_{dc}\)). These scores are comprised between 0 and 1, and correspond to the confidence score (given by the ratio of the number of trees predicting that a compound is active) returned by Random Forest classification models. \(S_{opt}\) confidence score is used as a reward function for goal-directed generation that is performed with three different goal-directed generation algorithms (SMILES-based LSTM [3], Graph Genetic Algorithm [10] (Graph GA), and Multiple Swarm Optimization (MSO) [8]) (see “Methods” section for more details on the goal-directed generation algorithms). As the three bioactivity models are trained to predict the same property on the same data distribution, a practitioner would assume generated molecules with high \(S_{opt}\) to also have high \(S_{mc}\) and \(S_{dc}\). Indeed, QSAR models built with different random seeds or train/test split are most of the time treated as interchangeable by practitioners. In a highly valuable critical analysis, Renz et al. highlighted several issues related to distribution learning and goal-directed generation. For the latter, they observe that while \(S_{opt}\) grows during goal-directed generation, \(S_{mc}\) and \(S_{dc}\) diverge from the optimization score during the course of the optimization, reaching on average lower values than \(S_{opt}\) (see Fig. 2 and “Methods” section for further details) and sometimes even decrease throughout the course of optimization. Those results, suggesting that the molecules produced through goal-directed generation exploit bias unique to the model they are optimized on, were noted in the literature [16, 17]. Indeed, those results are concerning as they cast doubt on the viability of generating optimized molecules guided by machine learning models.
To avoid the pitfall of designing molecules with high optimization scores and low control scores, Renz et al. suggest to stop goal-directed generation when control scores stop increasing. This requires to hold out a significant part of the original dataset to build a data control model: this might not be feasible in low data regimes, and would harm the predictive power of the optimization models used during goal-directed generation.
Interpretation of the initial results
The observed difference between \(S_{opt}\), \(S_{mc}\) and \(S_{dc}\) is explained in [12] by the fact that goal-directed generation algorithms exploit bias in the scoring function, defined as the presence of features that yield a high \(S_{opt}\) but do not generalize to \(S_{mc}\) and \(S_{dc}\) . The first case (high \(S_{opt}\) and low \(S_{mc}\)) is referred to as exploiting model-specific biases. Indeed, \(S_{opt}\) and \(S_{mc}\) are scores given by two classifiers trained on the same data (only the model differs, through the choice of a different random seed). A molecule with a high \(S_{opt}\) and low \(S_{mc}\) therefore must exploit features that are specific to the exact model \(C_{opt}\) and not shared by \(C_{mc}\), even though those models were trained on the same data. Conversely, the second case (high \(S_{opt}\) and low \(S_{dc}\)) is referred to as exploiting data specific biases. The authors interpret this as a failure of the goal-directed generation procedure: “there is a mismatch between optimization scores and data control scores, which shows that the optimization procedure suffers from model and/or data specific biases”. This interpretation is further supported by the fact that the difference between optimization and control scores grows over time during goal-directed generation.
Formally, we can view a dataset of molecules and associated bioactivities as a sample \((X_i, y_i), i \in [\![1;n]\!] \sim P\), where \((X_i, y_i)\) denote a molecule and its associated bioactivity, and P is the distribution of the dataset. Most classifiers return a decision function or a confidence score \(S_{classifier}(x)\), that can be used as scoring functions when searching for novel active compounds using goal-directed generation algorithms. This is achieved by maximizing \(S_{classifier}(x)\), through optimization techniques or reinforcement learning. As the three classifiers \(C_{opt}\), \(C_{mc}\) and \(C_{dc}\) model the same property on the same dataset, Renz et al. expect molecules obtained with goal-directed generation to be also predicted active by the classifiers \(C_{mc}\) and \(C_{dc}\), and to have similar control scores as their optimization scores. As \(S_{mc}\) and \(S_{dc}\) are significantly lower than \(S_{opt}\), the conclusion reached is that the molecules generated are predicted active by \(C_{opt}\) for the wrong reasons (the exploitation of biases, a behavior already observed in the machine learning literature [18]) as those predictions do not translate in similar bioactivity models. The intuition behind this conclusion is that features that are true explanatory factors of the output will yield high scores both by the optimization and control models. Therefore, a goal-directed generation algorithm that design molecules with high optimization scores but low control scores could be exploiting spurious features specific only to the optimization model, that will not translate when testing the molecule in a wet-lab experiment [12].
This conclusion rests on the unproven assumption that, in the original data distribution P, molecules predicted active with high confidence by the optimization model \(C_{opt}\) are also predicted active with high confidence by the control models. This hypothesis might seem reasonable considering that all three models share the same architecture and are trained to predict the same property on the same data distribution [12]. However, the goal of this work is precisely to test this assumption. Modeling biological properties from chemical structure is a difficult task, especially on the small, noisy and chemically very diverse (see “Methods” section) datasets used in [12]. While the optimization and control models display similar predictive performance metrics (as assessed by the ROC-AUC metric), it does not imply that \(S_{opt}\) will perfectly correlate with control scores \(S_{mc}\) and \(S_{dc}\). It is therefore necessary to validate this assumption in order to assign the failure of goal-directed generation to the goal-directed generation algorithms themselves, and not to the initial difference observed on the data distribution. This requires a comparison of \(S_{opt}\) with \(S_{mc}\) and \(S_{dc}\) on an independent sample of molecules from the initial data distribution P.
In this work, we show that the difference observed between optimization and control scores on goal-directed generation tasks is not due to a failure in the procedure itself, but that it can be explained by an initial difference between the scores of the classifiers on the original data distribution. We further show that the optimized population, in feature space, has indeed similar statistics as the original dataset, and that the divergence between optimization and control scores is already present in the initial dataset. We adapt the initial experimental setting by Renz et al. [12], in order to have an experimental setting that allows us to answer the question of whether goal-directed generation algorithm exploit model or data specific biases during optimization. We assess in those adequate settings whether we still observe the difference between optimization and control scores, and show that in those appropriate settings, the failure of goal-directed generation algorithms is not observed. Finally, we highlight that the behavior described in [12] warrants caution when designing predictive models for goal-directed molecular generation.