To give an idea of how to utilize CIME, we describe three use cases from authors of this paper, who are data scientists and computational chemists:
-
Use case 1: Visualizing attributions to free hydration energy predictions using SHAP values.
-
Use case 2: Comparing the attributions of models trained on a lipophilicity dataset.
-
Use case 3: Comparing the latent space of a trained model to a fingerprint representation.
Use case 1: visualizing attributions to free hydration energy predictions using SHAP values
In this use case, we explored the predictions of a model that was trained on the hydration free energy of a set of compounds. Hydration energy is one component in the quantitative analysis of solvation. It is a particular special case of water and describes the amount of energy released when one mole of ions is covered by water molecules. If the hydration energy is greater than the lattice energy, then the enthalpy of solution is negative (heat is released), otherwise it is positive (heat is absorbed). The more negative the hydration free energy, the more soluble in water the compound. Hydration free energy is an important physicochemical property to assess properties such as the bioavailability of small molecules.
With the goal of exploring the hydration free energy of compounds, we downloaded the Free Solvation Database (FreeSolv) dataset [50] which has already been used as a benchmark set in the past [51]. It consists of 642 compounds in the latest version along with their measured and calculated hydration free energy values. We then trained a CatBoost multiregression gradient-boosted tree model [52] to predict these variables. The features to train the model were the Morgan fingerprint count values [44] combined with MACCS keys [53]. The model performed well with an RMSE value of 1.03 as estimated by a 5-fold nested cross-validation approach (see Supplementary Material, Additional File 1 for details).
Aiming to understand how each atom contributed to the predicted hydration free energy value, we first calculated the tree SHAP (SHapley Additive exPlanations [54, 55]) values for every fingerprint feature. SHAP values are given in the same unit(s) as the target variable(s) — in our case hydration free energy—and indicate by how many units a feature pushed the prediction towards positive or negative values for a given instance.
To analyze the chemical space, we derived a UMAP projection from the rank-based Spearman correlation matrix of the SHAP values of all observations. With this, we grouped the compounds by the similarity of the explanations (Fig. 4), making full use of the multivariate and feature interaction information. Which should be more expressive than just using Tanimoto similarity based on Morgan and MACCS fingerprints.
As we can see in Fig. 4, the projection reveals a few groups. The color indicates how nicely that SHAP values can be used to segregate compounds based on predicted hydration free energy of the trained model, since the segregation matches well the color diversion. The projection algorithm placed the compounds with positive predictions mostly at the top-right area. At the bottom-right, we found a group with 12 similar compounds in terms of structure and explanations, highlighted with the rectangle, and detailed on the right side of the figure. The bold stroke represents the maximum common substructure (i.e., the three rings that they have in common).
Furthermore, we used the SHAP values to understand how much each individual atom of a compound increased or decreased the predicted value. To this end, we determined for every non-zero feature the atoms that represent this feature, and then summed all SHAP values for every atom in the compound—these are our explanations, that indicate how each atom contributed to the prediction. As example, in Fig. 5, we show four compounds and how their atoms contribute to hydration free energy. For these compounds the less polar hydrocarbon regions appear in green, whereas polar atoms forming hydrogen bonds appear in magenta, as we would expect.
In this use case, we demonstrated how a set of molecules can be explored under the perspective of SHAP values (Task Explore). Exploring the chemical space considering how a model sees the data can help users to identify interesting groups of compounds. SHAP-based explanations allowed us to confirm that the model seems to identify which regions of the selected compounds contribute positively, and negatively, to hydration free energy (Task Understand).
Use case 2: comparing the attributions of models trained on physico-chemical properties
Lipophilicity is an important parameter in medicinal chemistry, related to the pharmacokinetic properties of a drug [56]. Therefore, it is of great interest to monitor such property in drug discovery projects. Here, we explore a set of compounds examining their lipophilicity and compare two in-house models as for their interpretability.
The lipophilicity dataset was taken from the MoleculeNet datasets [57]. Two in-house pre-trained graph convolutional models (see [58] for more details on the training datasets) were used to predict logD of the compounds from the lipophilicity dataset. Here, LogD is the logarithm of the partition coefficient of a compound between octanol and water, taking into account the charge state of the compound at a physiologically relevant pH. The first model is hereafter referred to as the “base model”. The second model, here identified as “XAI model”, was designed to be more interpretable by adding constraints during training [59]. The dataset of 4200 compounds was uploaded to CIME. It contains the measured lipophilicity, the logD predictions by the two models, the models’ latent space representations and atom contributions for both predictions. The Class Attribution Maps (CAM) methodology was adapted to graph neural networks [30] to obtain the atom contributions for the two models.
Once the data had been uploaded, a UMAP projection was calculated based on the explainable model’s latent space representations. We then proceeded to explore different groups, the predictions obtained by the models and the related explanations. Here we present our findings related to one specific group that contains 26 compounds with high structural similarity (see Supplementary Material, Additional File 1 for a detailed view of the group and projection).
Using CIME’s “Table View”, we display in Fig. 6 an overview of the measured and predicted logD and absolute errors from each model for the entire dataset (a) and selected group (b). We observe that for some compounds the predictions (of one or both models) are good with an error below 0.5 log units while others have predictions a bit off (errors above 0.5 log units)—see Supplementary Material, Additional File 1.
Figure 7 shows attributions from both models for a subset of accurately predicted compounds in the selected group. Note that magenta atom contributions are sites which push the prediction towards lower values of logD (i.e., less lipophilic), and green contributions indicate sites that push the predictions towards higher values of logD (i.e., more lipophilic). We observe that the attributions produced by the base model are uniformly green for all compounds, which is not useful to a chemist trying to find optimal positions for modifications. This is the case for all compounds of the cluster, not only for those shown in Fig. 7. Furthermore, the atom contributions according to the XAI model are more diverse and sparse: there are atom contributions labeled as (i) increasing lipophilicity, (ii) decreasing lipophilicity and (iii) as largely irrelevant to the prediction.
Both models give similar predictions.
In four out of six cases, the XAI model attributes lower lipophilicity to the ester group. Similarly, the heteroatoms in the three rings of the scaffold are often marked as lowering the lipophilicity, or at least are excluded from the green highlights. Both of which accords with a medicinal chemist’s intuition. Nevertheless, the attributions are far from perfect, especially from a stability point of view: some very similar compounds have different attributions in the XAI model (for example, molecules 239 and 621 only differ by one methyl group but have very different explanations).
This use case demonstrated how CIME can be used to compare attributions from two models (Task Compare) through the exploration of a test dataset (Task Explore), and might increase user trust in predictions made by an interpretable model. A similar workflow could be used for comparing two (or more) attribution methods for a single model; or one attribution method and one ground truth attribution in cases where ground truth explanations are known.
Use case 3: comparing the latent space of a trained model to a fingerprint representation
Protein kinases feature prominently in the human genome [60], and kinase inhibitors are of particular interest in drug discovery [61]. Recently, Sydow et al. [62] have developed a fragment-library approach to generating novel kinase inhibitors. In this approach, known kinase inhibitors are split into smaller molecular fragments, and those fragments are then virtually recombined. While theoretically the number of potential new kinase inhibitors is limited only by the number of possible fragment combinations, in practice some of these “recombined” compounds will be more desirable than others, for instance, because of their physicochemical properties or synthetic feasibility. It is thus of interest to explore the large set of virtually generated candidates to find subsets of promising candidate kinase inhibitors.
Extended connectivity fingerprints (ECFPs) [63] are commonly used descriptors in ligand-based virtual screening. However, ECFPs encode only structural information. More abstract encodings pertaining to the prediction of physicochemical properties can be better expressed using latent space representations generated from deep learning models (i.e., replacing use of fingerprints with latent space representations to generate a projection). In this use case, we used the same in-house pre-trained explainable model as in Use Case 2 to generate the learned embeddings for the compounds and fragments in the kinase dataset.
In Fig. 8, we illustrate the representation of the fragments for both the latent space from a deep learning model (left) and the ECFP4 fingerprint (right). We highlight and color only the fragments known to bind to the FP subpocket. Regarding the positioning of the fragments, the visualizations suggest that the latent space generates a smoother representation compared to the ECFP4 fingerprint space. This makes intuitive sense since ECFP4 is a 2048-dimensional bitwise fingerprint based fully on structural features, whereas the deep learning representation is a 256-dimensional continuous vector. In the left part of Fig. 8, we colored the fragments by the predicted solubility and see that most of them are predicted to be soluble (i.e., they are between yellow and green). The fact that the analyzed “front pocket”fragments have generally higher predicted solubility is congruent with chemical rationalizations given in [62]. Since the ECFP4 fingerprint is not by itself predictive, we only highlight whether the compound is found in the front pocket or not in Fig. 8 (right).
Sydow et al. [62] provided a recombined ligand library of over 6 million potential kinase inhibitors, helpfully scoring the ligands based on their closest chemical similarity to compounds found in the ChEMBL database [64, 65], as measured by the Tanimoto similarity. By using this information, we can quickly identify regions in a projection where the recombined compounds are similar to known molecules.
We therefore projected the recombined ligands based on the latent space from a deep learning model, as was done for fragments in Fig. 8 left. We utilized only ligands with a Tanimoto similarity greater than 0.8 to at least one ligand in ChEMBL. Then, we colored the compounds according to their similarity to known ligands in ChEMBL (Fig. 9). This view of the recombined ligand space allows focusing on specific regions that are densely populated in compounds highly similar to existing compounds. The selected region is enlarged for a closer view, and several relevant chemical structures are revealed. We speculate that compounds that are different from the known ChEMBL molecules (“Distant ligands” in Fig. 9) but positioned closer to more ChEMBL-similar molecules in the fingerprint space are more likely to represent promising ligands than recombined molecules that are in dark blue regions (none of their neighbors is close to a known molecule).
This use case demonstrated how CIME can be utilized to explore a chemical space and to compare molecular representations for a set of labeled compounds (Task Explore). By using an approach based on exploring two types of similarities, we showed how CIME can be used to select smaller sets of pertinent candidate compounds from a large chemical space.
Performance
We conducted structured benchmarks on two different machines by gradually increasing (i) the number of compounds in the dataset and (ii) the number of features used for projection (i.e., fingerprints). A summary of the benchmark is visualized in Fig. 10. We provide a detailed description of the CIME benchmark in the Supplementary Material, Additional File 1.
Overall, CIME dealt well with datasets of up to 20,000 compounds and 1,000 fingerprints. Beyond these thresholds, we experienced longer loading times (i.e.,>= 5 minutes). The results are better if fingerprints are not handled by the system; that is, the projection is precalculated and stored in the SDF. Not having fingerprints uploaded or computed by CIME resulted in a considerable drop in memory usage in both back- and front-end. We tested datasets of up to 100,000 compounds with only 1 fingerprint to simulate this scenario in our benchmark, where CIME generally handled the datasets well, with only LineUp’s initial loading being slow at 5-20 seconds when over 60,000 compounds were used.
Future work
Currently, the tool does not allow direct comparison of different projected spaces: users see only one projection at a time. However, we are working on a feature that allows displaying two projections next to each other for better comparison of representations.
Another limitation of the tool is its inability to save its current state, which means that users must show their live analyses directly to collaborators or make screenshots to document the results. We are working on a solution that simplifies collaboration between users on different devices and enables users to store their analysis and continue it at a later point.
CIME enables users to select compounds and display each compound structure overlaid with attributions. Although CIME allows users to show structure-based aggregations of selected compounds using MCS, it is not possible to display aggregations of attributions of a list of compounds. We are not aware of existing visualization techniques that are capable of displaying multiple weights (attributions) per atom effectively.
Regarding the visual representation of compounds, users can neither interact with the compounds nor check the numerical values of atom contributions. However, we plan to adapt a JavaScript library for drawing the compounds in the front-end and make them interactive.
Currently, only one algorithm is available for projecting and one for clustering data—UMAP and HDBSCAN, respectively. Users can alternatively include precalculated projections and cluster affiliations in the SDF file. CIME can also be enhanced programmatically by users to include additional projection methods. As part of future work, we plan to provide more projection and clustering algorithms directly within the tool. However, not every library can be integrated into CIME’s official repository due to licensing restrictions