ChemInformatics Model Explorer (CIME): exploratory analysis of chemical model explanations

The introduction of machine learning to small molecule research– an inherently multidisciplinary field in which chemists and data scientists combine their expertise and collaborate - has been vital to making screening processes more efficient. In recent years, numerous models that predict pharmacokinetic properties or bioactivity have been published, and these are used on a daily basis by chemists to make decisions and prioritize ideas. The emerging field of explainable artificial intelligence is opening up new possibilities for understanding the reasoning that underlies a model. In small molecule research, this means relating contributions of substructures of compounds to their predicted properties, which in turn also allows the areas of the compounds that have the greatest influence on the outcome to be identified. However, there is no interactive visualization tool that facilitates such interdisciplinary collaborations towards interpretability of machine learning models for small molecules. To fill this gap, we present CIME (ChemInformatics Model Explorer), an interactive web-based system that allows users to inspect chemical data sets, visualize model explanations, compare interpretability techniques, and explore subgroups of compounds. The tool is model-agnostic and can be run on a server or a workstation. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-022-00600-z.


Memory usage and response time
Although CIME was designed to handle up to 20,000 compounds and around 1000 fingerprints, users can upload larger datasets. While the projection algorithm uses fingerprints and consumes much CPU while running in the browser, memory is a bottleneck in the back-end due to storage and processing of SDF. Depending on where the back-and front-end run, computational power will define the limits of the tool.
To understand CIME's limits we ran a benchmark with two computers. They are laptops with specifications defined in Table 1. To run the benchmark, we created bigger datasets by replicating the compounds into datasets with up to 200, 000 and uploaded them to the system. The number of fingerprints were also explored, in the range of 1, 500, and 1000 fingerprints. Explainability, properties and predictions were fixed to 25. We ran the back-end using Docker 20.10.7 (docker.com) and accessed the front-end using the web browser Google Chrome 95 (google.com/chrome).
The benchmark consisted of uploading a dataset, loading all compounds into LineUp, executing the UMAP projection using fingerprints, and interacting with the graphic user interface to check if it is responsive, using CIME version 0.1.19. We restarted the Docker container whenever a new dataset was uploaded. Table 2 reports the relation between dataset dimensions and the columns: (Upload) the time that the backend took to process the upload of a dataset; (RAM) amount of memory used in the backend just after the upload completed in MiB; (LU) whether or not LineUp was able to load all items of the dataset; (UMAP) the time to compute the projection in minutes:seconds (MM:SS). We highlight in gray the upper-limits of our tests that worked in both computers. Asterisks (*) indicate when CIME stopped working. Table 2 shows that in many cases we concluded the upload successfully, however we couldn't complete the UMAP projection. This might be connected to the fact that we are running both front-and back-end in the same computer and a high amount of memory is used in the backend, limiting the resources available for the browser, where LineUp is populated with compounds and the projection is computed. Memory is used in the browser: (a) to store fingerprints and compute and store a distance matrix for the projection with size NxN, where N is the number of compounds; (b) to store compounds and their properties in another matrix in the LineUp object. Therefore, we expect that the limits of CIME in computer A and B would increase if the back-end is running on another computer with enough RAM, i.e., 2GB available for a dataset with 100,000 compounds and 1 fingerprint, or 10 GB for a dataset with 100,000 compounds and 500 fingerprints or 60,000 compounds and 1,000 fingerprints. Overall, CIME easily handled datasets with up to 20, 000 compounds. For a higher number of compounds, we recommend pre-calculating the projection and adding only one fingerprint to the SDF file. This will drastically reduce memory consumption and increase speed response time in the browser. In the current version, adding 1 fingerprint to the SDF will make CIME ignore the calculation of fingerprints (default behavior when CIME does not find fingerprints in the dataset). Users must add instead 1-bit fingerprint to their SDF and provide the x,y coordinates. Future versions of CIME should not require the 1-bit fingerprint and calculate fingerprints on request.
We are changing CIME's architecture and expect to have an improved version of the system by mid-2022. Please check the updated documentation in the git repository: github.com/jku-vdslab/cime.

Use cases
In this section, we bring supplementary information for the use cases: • Use case 1: Visualizing attributions to hydration energy predictions using SHAP values.
• Use case 2: Comparing the attributions of models trained on a lipophilicity dataset.
Use case 1: Visualizing attributions to hydration energy predictions using SHAP values ML modeling A nested 5-fold cross-validation was performed to identify the error and optimize hyperparameters. We obtained an RMSE of 1.03 on the predicted experimental values. To define the folds of the cross-validation in a way that very similar compounds are placed together in the same fold, we used hierarchical clustering (average linkage) to group them. Groups of compounds with a Tanimoto similarity of at least 0.75 were placed in the same fold. These folds define the outerloop of the nested cross-validation. For each of the outer folds, an inner-loop of cross-validation was used to optimize the CatBoost hyperparameters using Optuna [1].
Use case 2: Comparing the attributions of models trained on a lipophilicity dataset Multidimensional projection The projection was calculated using UMAP and the "fingerprint-XAI" properties of the compounds that are shown in the Projection Menu when using CIME with the provided dataset (See Figure 1).
Absolute error To identify areas of the chemical space where both models performed well, we calculated for each compound the absolute error [2] of predictions from the base and XAI models against the measure LogD. Therefore, each compound has two errors associated: base error and XAI error. We show the compounds colored by the mean value between their base and XAI errors in Figure 1 -approach used to identify a cluster with low error in both models. We see in the projection that most compounds have error much closer to 0 than to 4 (see color legend). Very few regions (dark) reveal extensive disagreement among predictions and experimentally measured logD values.

Histograms of predictions and measured LogD
In Figure 3 we can analyze a summary of models' predictions and performance for the entire dataset with the histograms displayed at the top. Histograms in red (Base and XAI error) and brown (mean between Base and XAI errors) confirm that the errors from both models are closer to 0, i.e., they are left-skewed. Histograms in blue (measured and predicted LogD) show that the predictions' distributions are similar to each other (2nd and 3rd columns) and slightly different from the measured experimental LogD values (1st column), being the predicted values closer to normal distributions.
Group of compounds In this use case, we selected a few compounds to compare the explanations extracted from two models. Those compounds were filtered from a group of similar structures detailed in Figures 2 and 3.    Table View" with ID, measured LogD, predicted LogD (base and XAI models), absolute errors (base and XAI), and mean between errors (base and XAI) of each compound in the studied group (bar charts). Compounds are sorted by mean error. Histograms represent the entire dataset, and box plots represent all compounds that are not in the studied group.