No single data visualization will satisfy all user requirements independent of the type of data and tasks at hand. Scaffold Hunter thus provides several visualizations, including established techniques from information visualization, as well as visualizations tailored towards the specific requirements of compound analysis. Three novel views are the most preeminent recent innovations of Scaffold Hunter: (i) the tree map view, see section “Tree map view”, (ii) the heat map view, see section “Heat map view” and (iii) the molecule cloud view, see section “Molecule cloud view”. Before we describe them in detail, we will give a short summary of Scaffold Hunter’s architecture in section “Architecture” and of the established views in section “Established views”.
Architecture
Scaffold Hunter’s modular architecture comprises three main building blocks that support data integration, automated analysis, and interactive visualization for use in a cyclic knowledge discovery process, see Fig. 1. Support for different data import formats, data transformations, and property calculations is realized using a plugin concept, and thus can easily be extended to new formats and properties. The current version of Scaffold Hunter already supports several built-in property calculators, most notably for structural fingerprints. Datasets can be merged and extended by additional data at a later stage, e.g., when new experimental or public information is available. After import the data can be processed and analyzed using filtering, clustering, dimensionality reduction and by applying the scaffold tree approach. The results can then be visualized in any of the interactive views provided by the framework for visual inspection and further selection.
The visualizations can be used in combination, which allows the user to profit from their complementary strengths. Linking via a global selection and filtering mechanism facilitates to switch between visualizations, and a flexible subset storage mechanism allows the user to organize the compound set and to either compare different datasets using a single visualization or to analyze different aspects of a single dataset with the help of different types of views. All views support a mapping of compound or scaffold properties to visual cues, e.g., color, and the visualizations can be annotated with comments to support collaboration or persistently store findings generated during the workflow. Detailed information on a scaffold or compound is displayed in tooltip windows that pop up when hovering the mouse over the corresponding element.
Established views
We briefly summarize the established visualizations provided by Scaffold Hunter and refer the reader to our previous publications [3, 5, 6] for further details.
Scaffold tree view
The scaffold tree view allows interactive exploration of the scaffold tree structure described in section “Scaffold based approaches”. Using a classical radial tree layout for the visualization, this view allows the user to gain an overview on the structure classification hierarchy as well as the distribution of structures within the dataset, and serves as a starting point for the search workflow. Figure 2a shows the scaffold tree view with a user-defined sorting of the scaffolds based on some scaffold property, as well as a mapping of further properties to the node background color.
Dendrogram view
The hierarchy resulting from the clustering techniques described in section “Clustering techniques” can be represented by a binary tree and is often visualized as a dendrogram, where the height of each inner node in the tree corresponds to the similarity of the two child clusters, see Fig. 6 for an example. Scaffold Hunter provides a dedicated dendrogram view that realizes an enhanced version of the standard dendrogram visualization, adapted to the requirements of compound analysis. It features a combined dendrogram and spreadsheet configuration that allows a detailed analysis of the clustered molecules within a limited screen estate. The user can interact with the dendrogram view to navigate in the hierarchy and by panning and zooming, and to select a level in the cluster hierarchy for highlighting.
Molecular spreadsheet and plot view
Scaffold Hunter also provides a spreadsheet and a plot view to visualize information on the molecules in the database. The spreadsheet view provides detailed molecule information shown in a tabular visualization including molecule properties, as well as their names, SVG-images and user-defined flags. It implements the standard functionalities such as sorting by a user-selected column, reordering of columns, pan and zoom with the minimap, sticky columns, and cell resizing. Annotations such as comments and flags can be changed via the spreadsheet view. The plot view allows the user to visualize relations between molecule properties in 2- and 3-dimensional scatter plots. Beyond the mapping of properties to coordinates, additional properties can be mapped to color or size to allow the user to visualize more than three properties at a time, see Fig. 2b for an example of a color mapping.
Tree map view
The tree map view visualizes the scaffold tree described in section “Scaffold based approaches” in a space-filling manner. It provides an alternative to the scaffold tree view and expands the applicability of Scaffold Hunter to use cases for which this view turned out to be less suitable. In contrast to the scaffold tree view, the tree map view does not employ a classical tree layout, where the relation between nodes is represented by edges. Instead, the scaffold tree structure is illustrated by nested rectangles, and each scaffold in the tree is represented by a rectangle. The molecules associated with the scaffold are drawn within this rectangle as structural formulas and the children in the scaffold tree are represented by nested rectangles, see Fig. 3. This space-filling approach to visualize trees is referred to as tree map and was first proposed in [37]. Compared to the radial layout of the scaffold tree in Fig. 2a, this approach provides a compact representation making optimal use of the available screen space. Moreover, while the classical scaffold tree view focuses on the scaffolds itself, the tree map view uses the scaffold tree primarily as a hierarchical classification scheme to group molecules by their common scaffolds. As a consequence, the size of an encompassing rectangle directly corresponds to the number of molecules it contains and, thus gives an intuitive impression of the fraction of molecules in the dataset that are related to the associated scaffold. As an alternative, it is possible to scale the depiction of molecules relative to their property values. The size of the encompassing rectangles is adjusted accordingly and the screen space occupied by a subtree reflects its relative importance with respect to the property. For very large datasets, the tree map view optionally displays the leafs of the scaffold tree only instead of the molecules, which reduces the size of representation.
Our visualization uses clearly visible shaded frames with title lines and padding to the neighboring rectangles to highlight the nested structure. Furthermore, this makes the background of all rectangles visible, which is used to encode property values by colors just as for the scaffold tree view. When displaying the molecules of a dataset and mapping a property of the molecules to the background color, the encompassing rectangles can be colored according to the average, minimum or maximum value of either all the contained molecules or just those directly associated with the corresponding scaffold. This gives an easily comprehensible impression of the heterogeneity with respect to the selected property at a glance and indicates how well it aligns with the scaffold based grouping. The tree map view allows zooming and supports to pan by grab and drag consistent with all other views. Furthermore, it is possible to focus an encompassing rectangle by clicking on its title. This triggers an animated zooming operation and supports the systematic explorations along branches of the scaffold tree.
Molecule cloud view
The concept of a molecule cloud is inspired by the widely used word clouds and has been introduced to cheminformatics by Ertl and Rohde [7] to provide compact diagrams summarizing a collection of molecules. The approach relies on the concept of scaffolds described in section “Scaffold based approaches”: the elements constituting the molecule cloud are the scaffolds obtained from a set of molecules, whereas each scaffold is scaled according to the number of molecules it represents. These scaffolds are drawn compactly on the plane to form a cloud diagram. To this end, Ertl et al. [7] proposed a layout algorithm, which first places large scaffolds at the center and then aims at arranging the remaining scaffolds, such that the gaps are filled, see Fig. 4a. Our realization in Scaffold Hunter extends the original concept by supporting user interaction and semantic layout algorithms, which take molecular similarities into account.
Interactive realization
In Scaffold Hunter an interactive view was developed, which uses the static concept of molecule clouds as a basis. Just like several other existing views the cloud view allows to zoom and pan, which is particularly essential for large datasets. Properties of molecules and scaffolds can be represented by the background color, see Fig. 4. For this purpose, properties of scaffolds can be computed as the average, minimum or maximum value of all molecules associated with them. The cloud view benefits from the subset concept described in section “Architecture” like any other view. In addition, interactive refinement of the molecule cloud is supported by filtering out scaffolds, e.g., according to the number of molecules associated with them or their number of rings. The filter criteria can quickly be adjusted by sliders on the left side bar, which control the minimum and maximum values. The view can be configured to automatically recompute the layout after each filtering step in order to keep the representation as compact as possible and close the emerging unused space. To achieve this efficiently, a new dynamic layout algorithm was developed, which avoids the re-computation from scratch and at the same time aims at preserving the user’s mental map. This is achieved by minimizing the change between two different layouts in combination with smooth animated layout transitions.
Semantic layouts
Scaffold Hunter supports different layout algorithms for the cloud diagram and the user can switch between them at any time. First, the original layout algorithm kindly provided by Ertl [7] was integrated. Moreover, layout algorithms from [36] were incorporated, which were originally proposed for so-called semantic word clouds: The key idea of these approaches is to place semantically related words close to each other. In the domain of molecules our goal is to place similar scaffolds close to each other, whereas the user can choose from various similarity measures. These are derived from a specific property (or a set of properties) or obtained by standard cheminformatics techniques such as the Tanimoto coefficient applied to structural fingerprints. The user can quickly switch between different similarity measures giving rise to animated changes of the layout. The effect of the semantic layout algorithms is illustrated in Fig. 4b, where the same property used to define the background color is also used as basis for the similarity calculation. The scaffolds are arranged such that green scaffolds with a high property value are placed on the left and those with a low value on the right giving the impression of a color gradient, which indicates that the layout preserves the similarities well. However, similarities and distances stemming from high-dimensional data cannot be embedded in the plane without severe distortion, see section “High-dimensional data visualization and low-dimensional embeddings”. The primary objective was to provide a compact diagram without gaps and to consider similarities only as a secondary criteria. In this respect, our approach differs from many techniques widely-used in cheminformatics [34] to map high-dimensional molecular data into a low-dimensional space for visualization. Nevertheless, this technique can be particularly useful when the similarity is defined based on structural properties, e.g., obtained from fingerprints, and the background color represents a numerical property of interest, e.g., the biological activity: in this case, the molecule cloud can give hints on the relation between structure and activity. Figure 5 shows a cloud diagram obtained for this setting. The region on the left-hand side is predominantly populated by green scaffolds (high property values), while blue scaffolds (low property values) are placed on the right half of the diagram. However, there are also individual blue scaffolds on the left, which may indicate the presence of so-called activity cliffs [38], i.e., structurally similar molecules with a significant difference in potency. In contrast to the results obtained by clustering, the visualization does not necessarily partition elements into distinct groups, but represents pairwise similarities in an intuitively comprehensible manner.
Heat map view
Scaffold Hunter provides a heat map view to show relations between compounds and their properties. Some computer science aspects of the approach have previously been discussed in [39]. It is best suited to discover (partial) correlations between compound similarity and different properties as well as outliers with respect to compound similarity. A heat map is a visual representation of a 2-dimensional matrix. It is composed of rectangular tiles, colored or shaded, where each tile represents a matrix entry and is positioned in accordance to the matrix indexes. The color of each tile is defined by a mapping of the value of the associated matrix entry. This way of coloring matrices has its roots in the 19th century [40] and gives a quick graphical overview over the matrix value distribution. The matrix displayed in the heat map view contains property values. Each column of the matrix is associated with a compound and each row with a property.
In order to identify correlations between the compound similarity and the property values, it is important that the matrix is ordered in a way such that similar compounds are close to each other. In this case, correlations can be identified by consecutive tiles with similar colors. Analogously, an appropriate ordering of the properties enables the user to discover correlations between different properties. For this reason, a heat map is often ordered with the help of a dendrogram (see section “Clustering techniques”) stemming from a hierarchical clustering algorithm. It is important to notice that a dendrogram induces multiple linear orderings that are consistent with its binary tree structure. Neighbors in such an ordering do not need to be similar, especially if they stem from different high level clusters. Depicting the dendrogram beside a heat map therefore reveals important additional information with respect to the similarity structure of the dataset. Figure 6 shows an example of such a situation inside the heat map view of Scaffold Hunter. In order to calculate the similarity of two properties, the heat map rows are interpreted as numerical vectors in the Euclidean space and are clustered using a single link SAHN algorithm [29]. Missing values get replaced by their expected value with respect to other values of the same property. Thus, it is possible to display a heat map on partially defined properties. The heat map shown in Fig. 6 quickly reveals the correlation of several attributes with respect to the clustering structure. This is indicated by the different average colors for the heat map cells associated with the red and the blue cluster. It also reveals an outlier of the property Qualified
\(\mathrm{AC}_{50}\) (3rd row from below) for the second from the right compound of the red cluster, i.e., the \(\hbox {AC}_{50}\) value is unexpectedly low for the red cluster.
Scaffold Hunter’s heat map view is highly configurable and allows the adjustment of the matrix ordering, the visual appearance, and the level of detail to individual needs. The hierarchical clustering configuration allows the selection of various similarity measures, linkages, fingerprints and numerical properties. As an alternative to the matrix ordering induced by a clustering, it is possible to adjust the ordering of the properties manually using drag and drop. Utilizing the left sidebar one is able to filter out irrelevant properties. To support the visualization of larger datasets, the heat map view supports a semantic zoom, which automatically adjusts the level of detail displayed by a single tile. When a tile is large enough it displays the exact value of the property. For lower zoom levels the properties are rounded or hidden and the cells are rendered without borders as shown in Fig. 7. Still, the exact value and molecular depiction can be quickly accessed in the details section of the left sidebar by hovering a tile with the mouse. In order to handle different property types and scales, Scaffold Hunter provides a manual adjustment of the value to color mapping for each individual property. Scaffold Hunter offers gradient or interval based color mappings with adjustable ranges. This feature is also highlighted in Fig. 7, where different color palettes show different color mapping configurations.
Other recent developments
Besides the integration of novel views described above, Scaffold Hunter has been extended by various new features and improved in many ways. In the following, the most important changes are summarized. Several performance improvements have affected multiple modules of Scaffold Hunter. This includes several visualizations, the dataset import and export as well as the calculation of fingerprints and properties. The handling of SD files—especially if they do not fully comply with the technical standard—has become much more robust. Scaffold Hunter now supports the calculation of additional properties for molecules as well as for scaffolds. This has been achieved by a plugin mechanism to support easy extension, which is now also applicable to scaffolds. Among others we have incorporated a plugin for the calculation of the popular extended connectivity fingerprints (ECFP/FCFP) [41] in various configurations. Recent versions of Scaffold Hunter support chirality and it is possible to individually select whether chiral information is taken into account, e.g., for molecular matching and depiction. The molecular depiction now also offers a cleaner visual appearance with more colors and a better layout of complex structures. Finally, the subset management has been improved and additional operations to generate subsets are now available. Scaffold Hunter supports random sampling of subsets and an operation to split a subset according to its scaffold tree structure into smaller parts.