Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study

Deep generative models have shown the ability to devise both valid and novel chemistry, which could significantly accelerate the identification of bioactive compounds. Many current models, however, use molecular descriptors or ligand-based predictive methods to guide molecule generation towards a desirable property space. This restricts their application to relatively data-rich targets, neglecting those where little data is available to sufficiently train a predictor. Moreover, ligand-based approaches often bias molecule generation towards previously established chemical space, thereby limiting their ability to identify truly novel chemotypes. In this work, we assess the ability of using molecular docking via Glide—a structure-based approach—as a scoring function to guide the deep generative model REINVENT and compare model performance and behaviour to a ligand-based scoring function. Additionally, we modify the previously published MOSES benchmarking dataset to remove any induced bias towards non-protonatable groups. We also propose a new metric to measure dataset diversity, which is less confounded by the distribution of heavy atom count than the commonly used internal diversity metric. With respect to the main findings, we found that when optimizing the docking score against DRD2, the model improves predicted ligand affinity beyond that of known DRD2 active molecules. In addition, generated molecules occupy complementary chemical and physicochemical space compared to the ligand-based approach, and novel physicochemical space compared to known DRD2 active molecules. Furthermore, the structure-based approach learns to generate molecules that satisfy crucial residue interactions, which is information only available when taking protein structure into account. Overall, this work demonstrates the advantage of using molecular docking to guide de novo molecule generation over ligand-based predictors with respect to predicted affinity, novelty, and the ability to identify key interactions between ligand and protein target. Practically, this approach has applications in early hit generation campaigns to enrich a virtual library towards a particular target, and also in novelty-focused projects, where de novo molecule generation either has no prior ligand knowledge available or should not be biased by it. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-021-00516-0.


Model performance metrics
The following metrics were used to assess model performance (unless otherwise stated, RDKit was used to canonicalize SMILES): • Validity is the fraction of SMILES strings that are parsed by RDKit [1], in this case; this indicates whether a SMILES string translates to a real structure.
• Uniqueness is the fraction of unique molecules, where non-unique molecules are defined as having canonical SMILES that match those previously sampled or in the same batch. Low uniqueness is indicative of a poorly behaving model that is 'stuck' in a particular region of chemical space.
• Novelty is the ratio of valid, unique canonical SMILES not present in the training dataset (ZINC subset), and low novelty indicates the model cannot generalize beyond training data, which is precisely the aim of de novo design.
• Filters is the ratio of valid, unique molecules that pass the filters applied the training dataset as implemented in the original publication [2] (i.e., not allowing charged molecules).
• Internal diversity (IntDiv1) is one minus the average pairwise Tanimoto similarity (or Jaccard index) of all molecules, more specifically the MOSES implementation [2] calculates the Tanimoto similarity of Morgan fingerprints (radius=2, nBits=1024) using RDKit [1]. IntDiv2 is the square root of the average pairwise squared Tanimoto similarity [2]. Low internal diversity is an indication that a model samples from a very narrow range of chemical space.
• Fréchet ChemNet Distance (FCD) [3] was used to enable comparison with previous studies, which measures the mean and covariance of the penultimate layer of ChemNet [4] for two datasets. This provides a measure of distance between two datasets and has shown to take into account differences in predicted properties related to internal diversity, 'drug-likeness', logP and synthetic accessibility proxies [3].
• Single nearest neighbour similarity (SNN) is the average maximum Tanimoto similarity of a dataset to a reference dataset, more specifically the MOSES implementation [2] calculates the Tanimoto similarity of Morgan fingerprints (radius=2, nBits=1024) using RDKit [1]. This provides a measure of on average how close the most similar molecules are between datasets.
• Fragment similarity (Frag) is the cosine distance between the frequency of substructures in two datasets as enumerated using BRICS fragmentation [5] in RDKit • Scaffold similarity (Scaff) is the cosine distance between the frequency of Bemis-Murcko scaffolds [6] in two datasets as implemented in RDKit [1]. This provides a measure of scaffold distribution similarity between two datasets.
In addition to the above metrics, we extend the performance metrics to include: • Scaffold diversity (ScaffDiv) is identical to the internal diversity, however, calculated instead on the Morgan fingerprints (radius=2, nBits=1024) of the Bemis-Murcko scaffolds [6] using RDKit [1]. This allows further interpretation as to whether the model is generating similar scaffolds.  Table S1. Basic generative model metrics of the Prior, Glide-Agent (@2000 steps) and SVM-Agent (@500 steps).  Table S2. Diversity metrics of the Prior, Glide-Agent (@2000 steps) and SVM-Agent (@500 steps).  Table S3. Similarity metrics of the Prior, Glide-Agent (@2000 steps) and SVM-Agent (@500 steps) to training and held out test data.    8 Figure S4. The cumulative number of molecular fingerprint analogues to known DRD2 active compounds (a) and number of known DRD2 active molecules with analogues (b) generated during training. The SVM-Agent generates more analogues to known DRD2 active molecules, although, the Glide-Agent generates analogues to more known DRD2 active molecules.            (coloured, unshaded). Of note, the D114 3x32 HB-Acceptor (Charged) interaction is associated with better docking scores than Charged Residue interaction.