Use case 1—analyzing attributions and developing a bioconcentration factor model
The Bioconcentration Factor (BCF) quantifies a chemical’s potential to accumulate in living organisms, most frequently fish. As such, it is an important characteristic in the environmental risk assessment of chemicals. Zhao et al. [9], including authors of the XSMILES, created a model called xBCF that can predict BCF and provides attributions for SMILES strings.
In summary, xBCF is a deep learning model based on CDDD [19] molecular representations that use SMILES strings as input. The XAI method first substitutes the token of interest to any token in the vocabulary set of the CDDD model. Then the difference between the prediction from the original SMILES and the average prediction from all substituted SMILES is regarded as the attribution of the token of interest: the sensitivity score. A positive attribution indicates that the predicted BCF value is expected to drop when that token was substituted with any other token in the vocabulary.
The xBCF model was trained on public BCF data and internal logD data so that it can predict both logBCF and logD simultaneously. LogD represents the distribution coefficient of a chemical between octanol and water, where octanol is often seen as a proxy for organic tissue. This multitasking nature of xBCF was driven by the high correlation between logBCF and logD. Therefore, when the XAI method is applied on the xBCF model, one can obtain explanations for both logBCF and logD predictions which enable chemists to gain insights into the predictions and the model.
During the xBCF development, patterns of SMILES non-atom and atom tokens were analyzed for many molecules. Due to its dependency on the CDDD molecular representations encoded from SMILES strings, non-atom tokens played a key role in the translational autoencoder and the downstream predictive models for BCF and logD.
XSMILES was developed iteratively with the development of xBCF and was of great importance for the authors to analyze results during and after the development process. The model is now deployed in house, and XSMILES is used to display results to end users through interactive visualization. The XAI Substitution method is open-source and publicly available (see section Availability of data and materials). Detailed explanations about both model and XAI method are found in the original article [9].
Zhao et al. [9] extensively used XSMILES to analyze how their model and XAI methods work. In Fig. 6 we reproduced examples illustrating xBCF model is able to recognize symmetry-equivalent functional groups and attributes similar sensitivity scores to equivalent atoms. Despite almost perfect symmetric attributions, it’s important to note that this was not always the case and regardless of results, XSMILES played a key role in the process of quickly screening molecules, identifying patterns and creating hypotheses.
Another activity described by the authors is the comparison of logD and logBCF. In Fig. 7 we illustrate one of their examples with high logD (5.5) and low logBCF (0.66) predicted values: spirodiclofen, a molecule known to be readily metabolized. We see that the sensitivity scores for important parts of the molecule have different signs, which means that logD cannot explain the low BCF value.
Another aspect that helped the development of the xBCF was the fact that the authors could output a JSON file and quickly share with colleagues, and visualize results, without setting up any coding environment. The file with the molecules of this use case is available at our git repository and can be visualized with the XSMILES demonstration website.
Having the possibility of using XSMILES from within JupyterLab notebooks also helped them to quickly test and re-render visualizations based on new parameters defined to train the models, to develop the Substitution method or to adapt the visualization to better highlight patterns from the attributions.
In this use case, we described how XSMILES assisted the development of the xBCF model and is being used by end-users. The importance of the XSMILES was highlighted through examples of analysis that helped the xBCF’s authors to develop the model and the Substitution XAI method —both based on SMILES strings.
Use case 2—analyzing logP attributions against Crippen logP atomic contributions
Rasmussen et al. [8] studied the original and transformed logP Crippen contributions as a potential ground truth to attributions calculated with the “atom attribution from fingerprints”-method developed by Riniker and Landrum [20] (in this text referenced as R &L). They compared the overlap of heatmaps between this attribution method and the original (atom-based) and adapted (fragment-based) logP atomic contributions. Throughout their analysis, they visually compared contributions with attributions, highlighting molecules with high and low heatmap overlap.
Here, we explore this idea of using logP contributions and comparing them with attributions, but with three different XAI methods. We visually compare the original logP atomic contributions calculated with RDKit against the R &L attributions and attributions from two additional approaches: one based on the SMILES strings token-substitution method [9] described in Use case 1 and one based on Morgan fingerprints [21] and SHAP [10, 22, 23] values. A JupyterLab notebook with all methods is available (see section Availability of data and materials).
To calculate attributions, we combined the three attribution methods to two different CatBoost [24] (catboost 1.0.5, iterations=10000, depth=6) regressors, with a total of three different setups:
-
CDDD-Substitution) a model trained with CDDD [19] molecular representations with attributions calculated using the Substitution method [9],
-
Morgan-SHAP) a model trained with Morgan fingerprint bits (radius 1) with attributions calculated through the SHAP method [22], and
-
Morgan-R &L) the same fingerprint-based model as the latter, but with attributions calculated using Riniker and Landrum’s method [20].
Overall, the predictions from the two Catboost regression models resulted in good coefficient of determination (above 0.9) and root mean squared error (below 0.19). More details about the models that we tested and performance are described in Additional file 1.
We analyzed the attributions from the CDDD-Substitution, Morgan-SHAP, and Morgan-R &L methods. Note that there are significant differences among the compared methods regarding XAI techniques (R &L, SHAP, Substitution), molecular representation (Morgan, CDDD), and predictive performance. This use case shows how we can explore their calculated attributions with XSMILES to create hypothesis and inspire thoughts.
To visualize dozens of molecules, we generated JSON files describing their calculated attributions. These datasets were then visualized using the demonstration website available at the project’s main repository. The website provides the user the capability of quickly visualizing sets of molecules and their attributions, and of changing XSMILES’ parameters, such as color palette, color domain, and thresholds. Here we focus on one molecule that we found to be quantitatively and qualitatively very interesting.
Figure 8 shows diagrams where the color domain was defined for each molecule based on their maximum absolute score. With a threshold of 0.75, the diagram highlights with white circles and darker colors the most influential atoms, and displays a horizontal line to help to identify tokens that overpass the threshold. The logP contribution (A) of the bromine (last token from the SMILES string) is highly positive. While CDDD-Substitution (B) and Morgan-SHAP (C) identified the same bromine as the most influential atom, Morgan-R &L (D) attributed the highest values to carbons. An explanation for this difference could be the molecular representation, which is not atom-based but fragment-based, as made clear by Rasmussen et al. [8]. However, Morgan-SHAP uses the same molecular representation and spotlighted the bromine similarly to the contributions.
Although the CDDD-Substitution (B) highlighted the bromine atom in Fig. 8, it attributed higher values to the carbons than the ones we find in the contributions vector (A). Moreover, it considers the prediction to be as sensitive to a substitution of non-atom tokens as to a substitution of carbons, in general. This highlights that the CDDD model utilizes the non-atom tokens to correctly represent the molecular structure, as opposed to reading only a linear chain of atoms.
Although quantitatively the same bromine was highlighted by the contributions (A), CDDD-Substitution (B) and Morgan-SHAP (C) in Fig. 8, the story changes if we analyze them qualitatively through the direction of attributions’ sings. In Fig. 9 we see that the contributions for the two oxygen atoms are different: positive for the first oxygen token in the string, and negative for the second. All the three methods (B, C, D) attributed the opposite direction to both oxygen atoms. As of additional information, all atoms have positive contribution in the FPA contributions, which also disagrees with the three methods. To ignore the magnitude in this example, we set the color domain to be equal to a tiny value, i.e., between − 0.00001 and 0.00001. With this approach, all attributions are represented equally in terms of absolute values.
In this use case, we demonstrated how XSMILES can be used to compare attributions from different methods. We used methods based on atom-attributions only and one method based on SMILES-attributions. Although the models and molecule representations differ drastically, we found many cases in which attributions created by each method are relatively similar to the logP contributions. In other cases, attributions would agree among themselves and disagree with the contributions. The analysis gets complex and XSMILES has helped in the task of identifying patterns and facts that agree and disagree with our beliefs about the methods, models, and molecular representations.