Skip to main content

Visualization of very large high-dimensional data sets as minimum spanning trees

Abstract

The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature.

Introduction

The recent development of new and often very accessible frameworks and powerful hardware has enabled the implementation of computational methods to generate and collect large high dimensional data sets and created an ever increasing need to explore as well as understand these data [1,2,3,4,5,6,7,8,9]. Generally, large high-dimensional data sets are matrices where rows are samples and columns are measured variables, each column defining a dimension of the space which contains the data. Visualizing such data sets is challenging because reducing the dimensionality, which is required in order to make the data visually interpretable for humans, is both lossy and computationally expensive [10].

Large high-dimensional data sets are frequently used in the chemical sciences. For instance the ChEMBL database (\(n = 1,159,881\)) of bioactive molecules from the scientific literature and their associated biological assay data are used daily in the area of drug discovery [11]. Further examples of large databases containing molecules include FDB17 \(\left( {n = 10,101,204} \right)\)), a fragment-like subset of the enumerated database GDB17 listing theoretically possible molecules up to 17 atoms [12,13,14], and DSSTox (\(n = 848,816\)), containing molecules investigated for toxicity [15]. Examples of smaller data sets include the Natural Products Atlas (\(n = 24,594\)), collecting microbially-derived natural products; [16] Drugbank (\(n = 9300\)), listing molecules marketed or investigated as drugs; [17] and the MoleculeNet benchmark, containing a collection of 16 data sets of small organic molecules [18].

To visualize such databases, simple linear dimensionality reduction methods such as principal component analysis and similarity mapping readily produce 2D- or 3D-representations of global features [19,20,21,22,23,24,25]. However, local features defining the relationships between close or even nearest neighbor (NN) molecules, which are very important to understand the structure of data, are mostly lost, limiting the applicability of linear dimensionality reduction methods for visualization. The important NN relationships are much better preserved using non-linear manifold learning algorithms, which assume that the data lies on a lower-dimensional manifold embedded within the high-dimensional space. Algorithms such as nonlinear principal component analysis (NLPCA), t-distributed stochastic neighbor embedding (t-SNE), and more recently uniform manifold approximation and projection (UMAP) are based on this assumption [26,27,28]. Other techniques used are probabilistic generative topographic maps (GTM) and self-organizing maps (SOM), which are based on artificial neural networks [29, 30]. However, these algorithms have time complexities between at least \(O\left( {n^{1.14} } \right)\) and \(O\left( {n^{5} } \right)\), limiting the size of to be visualized data sets [31]. The same limitations in terms of data set size apply when distributing data in a tree by implementing the neighbor joining algorithm or similar methods used to create phylogenetic trees [32, 33]. This limiting behavior has been documented by the ChemTreeMap tool, which can only visualize up to approximately 10,000 data points (molecules or clusters of molecules) [34]. Due to the described challenges, large scientific data sets are generally visualized in aggregated or reduced form [35, 36].

Here we present an algorithm, named TMAP (Tree MAP), to generate and distribute intuitive visualizations of large data sets in the order of up to \(10^{7}\) with arbitrary dimensionality in a tree. Our method is based on a combination of locality sensitive hashing, graph theory, and modern web technology which also integrates into established data analysis and plotting workflows. This tree-based layout facilitates visual inspection of the data with a high resolution by explicitly visualizing the closest distance between clusters and the detailed structure of clusters through branches and sub-branches. We demonstrate the performance of TMAP with toy data sets from computer graphics and with ChEMBL subsets of different size and composition, and show that it surpasses comparable algorithms such as t-SNE and UMAP in terms of time and space complexity. We further exemplify the use of TMAP for visualizing large high-dimensional data sets from chemistry as well as from further scientific fields (Table 1).

Table 1 Data sets visualized using TMAP

Methods

Given an arbitrary data set as an input, TMAP encompasses four phases: (I) LSH forest indexing [37, 38], (II) construction of a \(c\)-approximate \(k\)-nearest neighbor graph, (III) calculation of a minimum spanning tree (MST) of the \(c\)-approximate \(k\)-nearest neighbor graph [39], and (IV) generation of a layout for the resulting MST [40].

During phase I, the input data are indexed in an LSH forest data structure, enabling \(c\)-approximate \(k\)-nearest neighbor (k-NN) searches with a time complexity sub-linear in \(n\). Text and binary data are encoded using the MinHash algorithm, while integer and floating-point data are encoded using a weighted variation of the algorithm [41,42,43]. The LSH Forest data structure for both MinHash and weighted MinHash data is initialized with the number of hash functions \(d\) used in encoding the data, and the number of prefix trees \(l\). An increase in the values of both parameters led to an increase in main memory usage; however, higher values for \(l\) also decrease query speed. The effect of parameters \(d\) and \(l\) on the final visualization is shown in Additional file 1: Fig. S1. The use of a combination of (weighted) MinHash and LSH Forest, which supports fast estimation of the Jaccard distance between two binary sets, has been shown to perform very well for molecules [44]. Note that other data structures and algorithms implementing a variety of different distance metrics may show better performance on other data and can be used as drop-in replacements of phase I.

In phase II, an undirected weighted \(c\)-approximate \(k\)-nearest neighbor graph (\(c\)\(k\)-NNG) is constructed from the data points indexed in the LSH forest, where an augmented variant of the LSH forest query algorithm we previously introduced for virtual screening tasks is used to increase efficiency [45]. The \(c\)\(k\)-NNG construction phase takes two arguments, namely \(k\), the number of nearest-neighbors to be searched for, and \(k_{c}\), the factor used by the augmented query algorithm. The variant of the query algorithm increases the time complexity of a single query from \(O\left( {\log n} \right)\) to \(O\left( {k \cdot k_{c} + \log n} \right)\), resulting in an overall time complexity of \(O\left( {n\left( {k \cdot k_{c} + \log n} \right)} \right)\), where practically \(k \cdot k_{c} > \log n\), for the \(c\)\(k\)-NNG construction. The edges of the \(c\)\(k\)-NNG are assigned the Jaccard distance of their incident vertices as their weight. Depending on the distribution and the hashing of the data, the \(c\)\(k\)-NNG can be disconnected (1) if outliers exist which have a Jaccard distance of 1.0 to all other data points and are therefore not connected to any other nodes or (2) if, due to highly connected clusters of size \(\ge k\) in the Jaccard space, connected components are created. However, the following phases are agnostic to whether this phase yields a disconnected graph. The effect of parameters \(k\) and \(k_{c}\) on the final visualization is shown in Additional file 1: Fig. S2. Alternatively, an arbitrary undirected graph can be supplied to the algorithm as a (weighted) edge list.

During phase III, a minimum spanning tree (MST) is constructed on the weighted \(c\)\(k\)-NNG using Kruskal’s algorithm, which represents the central and differentiating phase of the described algorithm. Whereas comparable algorithms such as UMAP or t-SNE attempt to embed pruned graphs, TMAP removes all cycles from the initial graph using the MST algorithm, significantly lowering the computational complexity of a low dimensional embedding. The algorithm reaches a globally optimal solution by applying a greedy approach of selecting locally optimal solutions at each stage—properties which are also desirable in data visualization. The time complexity of Kruskal’s algorithm is \(O\left( {E + \log V} \right)\), rendering this phase negligible compared to phase II in terms of execution time. In the case of a disconnected \(c\)\(k\)-NNG, a minimum spanning forest is created.

Phase IV lays out the tree on the Euclidean plane. As the MST is unrooted and to keep the drawing compact, the tree is not visualized by applying a tree but a graph layout algorithm. In order to draw MSTs of considerable size (millions of vertices), a spring-electrical model layout algorithm with multilevel multipole-based force approximation is applied. This algorithm is provided by the open graph drawing framework (OGDF), a modular C++ library [40]. In addition, the use of the OGDF allows for effortless adjustments to the graph layout algorithm in terms of both aesthetics and computational time requirements. Whereas several parameters can be configured for the layout phase, only parameter \(p\) must be adjusted based on the size of the input data set (Additional file 1: Fig. S3). This phase constitutes the bottleneck regarding computational complexity.

Results and discussion

TMAP performance assessment with toy data sets and ChEMBL subsets

The quality of our TMAP algorithm is first assessed by comparing TMAP and UMAP to visualize the common benchmarking data sets MNIST, FMNIST, and COIL20 (Fig. 1). UMAP generally represents clusters as tightly packed patches and tries to reach maximal separation between them. On the other hand, TMAP visualizes the relations between, as well as within, clusters as branches and sub-branches. While UMAP can represent the circular nature of the COIL20 subsets, TMAP cuts the circular clusters at the edge of largest difference and joins subsets through one or more edges of smallest difference (Fig. 1a, b). However, the plot shows that this removal of local connectivity leads to an untangling of highly similar data (shown in dark green, orange, dark red, dark purple, and light blue). This behavior has been assessed and compared to UMAP in Additional file 1: Figures S4 and S5, where it is shown that both TMAP and UMAP have to sacrifice locality preservation for more complex examples. For the MNIST and FMNIST data sets, the tree structure results in a higher resolution of both variances and errors within clusters as it becomes apparent how sub clusters (branches within clusters) are linked and which true positives connect to false positives (Fig. 1c–f).

Fig. 1
figure 1

Comparison between TMAP and UMAP on benchmark data sets. Please use the interactive versions of the TMAP visualizations at http://tmap.gdb.tools to see images associated with each point on the map. TMAP explicitly visualizes the relations between as well as within clusters. a, b While UMAP represents the circular nature of the COIL20 subsets, TMAP cuts the circular clusters at the edge of largest difference and joins clusters through an edge of smallest difference. cf For the MNIST and FMNIST data sets, the tree structure allows for a higher resolution of both variances and errors within clusters as it becomes apparent how sub clusters (branches within clusters) are linked and which true positives connect to false positives. The image data of all three sets was binarized using the average intensity per image as a threshold

In a second, more applied comparison example, we visualize data from ChEMBL using TMAP and UMAP. For this analysis molecular structures are encoded using ECFP4 (extended connectivity fingerprint up to 4 bonds, 512-D binary vector), a molecular fingerprint encoding circular substructures and which performs well in virtual screening and target prediction [46,47,48]. We consider a subset \(S_{t}\) of the top 10,000 ChEMBL compounds by insertion date, as well as a random subset \(S_{r}\) of 10,000 ChEMBL molecules.

Taking the more homogeneous set \(S_{t}\) as an input, the 2D-maps produced by each representation, plotted using the Python library matplotlib, illustrate that TMAP, which distributes clusters in branches and subbranches of the MST, produces a much more even distribution of compounds on the canvas compared to UMAP, thus enabling better visual resolution (Fig. 2a, b). Furthermore, in a visualization of the heterogeneous set \(S_{r}\), nearest neighbor relationships (locality) are better preserved in TMAP compared to UMAP, as illustrated by the positioning of the 20 structurally nearest neighbors of compound CHEMBL370160 [2, 49] reported as a potent inhibitor of human tyrosine-protein kinase SYK. The 20 structurally similar nearest neighbors are defined as 20 nearest neighbors in the original 512-dimensional fingerprint space. TMAP directly connects the query compound to three of the 20 nearest neighbors, CHEMBL3701630, CHEMBL3701611, and CHEMBL38911457, its nearest, second nearest, and 15th nearest neighbor respectively. The nearest neighbors 1 through 7 are all within a topological distance of 3 around the query (Fig. 2c). In contrast, UMAP has positioned nearest neighbors 2, 3, 9, and 18, among several even more distant data points, closer to the query than the nearest neighbor from the original space (Fig. 2d). Indeed, TMAP preserves locality in terms of retaining 1-nearest neighbor relationships much better than UMAP, applying both topological and Euclidean metrics (Fig. 2e, f; Additional file 1: Fig. S6). The quality of the preservation of locality largely depends on parameter \(d\), with adjustments to parameters \(k\) and \(k_{c}\) only having a minor influence (Additional file 1: Fig. S7). Moreover, TMAP yields reproducible results when running on identical parameters and input data, whereas results of comparable algorithms such as UMAP change considerably with every run (Additional file 1: Fig. S8) [26].

Fig. 2
figure 2

Comparing TMAP and UMAP for visualizing ChEMBL. The first \(n\) compounds \(S_{t}\) (a, b, e) and a random sample \(S_{r}\) (c, d, f), each of size \(n = 10,000\), were drawn from the 512-D ECFP-encoded ChEMBL data set to visualize the distribution of biological entity classes and k-nearest neighbors respectively. a TMAP lays out the data as a single connected tree, whereas (b) UMAP draws what appears to be a highly disconnected graph, with the connection between components becoming impossible to assert. TMAP keeps the intra- and inter-cluster distances at the same magnitude, increasing the visual resolution of the plot. c, d The 20 nearest neighbors of a randomly selected compound from a random sample. c TMAP directly connects the query compound to three of the 20 nearest neighbors (1, 2, 15); nearest neighbors 1 through 7 are all within a topological distance of 3 around the query compound. d The closest nearest neighbors of the same query compound in the UMAP visualization are true nearest neighbors 2, 3, 18, 9, and 1, with 1 being the farthest of the five. e, f Ranked distances from true nearest neighbor in original high dimensional space after embedding based on topological and Euclidean distance for data sets \(S_{t}\) and \(S_{r}\) respectively. g Computing the coordinates for a random sample (\(n = 1,000,000\)) highlights the running time behavior of TMAP and allows an inspection of the time and space requirements of the different phases of the algorithm. Four random samples increasing in size (\(n = 10,000\), \(n = 100,000\), \(n = 500,000\), and \(n = 1,000,000\)) detail the differences in memory usage (h) and running time (i) between TMAP and UMAP (\(t_{TMAP} = 4.865{\text{s}}\), \(a_{TMAP} = 0.223{\text{GB}}\); \(t_{UMAP} = 20.985{\text{s}}\), \(a_{UMAP} = 0.383{\text{GB}}\) and \(t_{TMAP} = 33.485{\text{s}}\), \(a_{TMAP} = 1.12{\text{GB}}\); \(t_{UMAP} = 115.661{\text{s}}\), \(a_{UMAP} = 2.488{\text{GB}}\) respectively) (\(t_{TMAP} = 175.89{\text{s}}\), \(a_{TMAP} = 4.521{\text{GB}}\); \(t_{UMAP} = 3,577.768{\text{s}}\), \(a_{UMAP} = 18.854{\text{GB}}\) and \(t_{TMAP} = 354.682{\text{s}}\), \(a_{TMAP} = 8.553{\text{GB}}\); \(t_{UMAP} = 41,325.944{\text{s}}\), \(a_{UMAP} = 48.507{\text{GB}}\) respectively)

In terms of calculation times, TMAP and UMAP have comparable running time \(t\) and memory usage \(a\) for small random subsets of the 512-D ECFP-encoded ChEMBL data set with sizes \(n = 10,000\) and \(n = 100,000\), TMAP significantly outperforms UMAP for larger random subsets (\(n = 500,000\) and \(n = 1,000,000\)) (Fig. 2h, i). Further insight into the computational behavior of TMAP is provided by analyzing running times for the different phases based on a larger subset (\(n = 1,000,000\)) of the ECFP4-encoded ChEMBL data set (Fig. 2g). During phase I of the algorithm, which accounts for \(180{\text{s}}\) of the execution time and approximately \(5{\text{GB}}\) of main memory usage, data is loaded and indexed in the LSH Forest data structure in chunks of 100,000, as expressed by 10 distinct jumps in memory consumption. The construction of the \(c\)\(k\)-NNG during phase II requires a negligible amount of main memory and takes approximately \(110{\text{s}}\). During 10 s of execution time, MST creation (phase III) occupies a further \(2{\text{GB}}\) of main memory of which approximately \(1{\text{GB}}\) is retained to store the tree data structure. The graph layout algorithm (phase IV) requires \(2{\text{GB}}\) throughout \(55{\text{s}}\), after which the algorithm completes with a total wall clock run time of \(355{\text{s}}\) and peak main memory usage of \(8.553{\text{GB}}\).

Note that TMAP supports Jaccard similarity estimation through MinHash and weighted MinHash for binary and weighted sets, respectively. While the Jaccard metric is very suitable for chemical similarity calculations based on molecular fingerprints, the metric may not be the best option available to problems presented by other data sets. However, there exists a wide range of LSH families supporting distance and similarity metrics such as Hamming distance, \(l_{p}\) distance, Levenshtein distance, or cosine similarity, which are compatible with TMAP [50, 51]. Furthermore, the modularity of TMAP allows to plug in arbitrary nearest-neighbor-graph creation techniques or load existing graphs from files.

TMAPs of small molecule data sets: ChEMBL, FDB17, DSSTox, and the Natural Products Atlas

The high performance and relatively low memory usage of TMAP, as well as the ability to generate highly detailed and interpretable representations of high-dimensional data sets, is illustrated here by interactive visualization of a series of small molecule data sets available in the public domain. In these examples we use MHFP6 (512 MinHash permutations), a molecular fingerprint related to ECFP4 but with better performance for virtual screening tasks and the ability to be directly indexed in an LSH Forest data structure, which considerably speeds up computation for large data sets [45].

As a first example, we discuss the TMAP of the full data set of the ChEMBL database containing the 1.13 million ChEMBL compounds associated with biological assay data. TMAP completes the calculation within 613 s with a peak memory usage of 20.562 GB. Note that approximately half of the main memory usage is accounted for by SMILES, activities, and biological entity classes which are loaded for later use in the visualization. To facilitate data analysis, the coordinates computed by TMAP are exported as an interactive portable HTML file using Faerun, where molecules are displayed using the JavaScript library SmilesDrawer (Fig. 3a) [25, 52].

Fig. 3
figure 3

TMAP visualization of ChEMBL, FDB17, DSSTox, and the Natural Products Atlas in the MHFP6 chemical space. Please use the interactive versions at https://tmap.gdb.tools to visualize molecular structures associated with each point. a Visualization of all ChEMBL compounds associated with biological assay data (\(n = 1,159,881\)) colored by target class. The inset shows molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters (cyan); inhibitors of the phenylethanolamine N-methyltransferase (orange); and structurally related compounds with high binding affinities for nicotinic acetylcholine receptors and inhibitory effects on cytochrome p450s (red, dark blue). b The ChEMBL data set was merged with fragment database (FDB17) compounds (\(n = 11,261,085\)) and visualized. FDB17 molecules are shown in light gray. The inset shows a branch of steroid and steroid-like ChEMBL compounds, as well as dominantly FDB17 branches which are sparsely populated by ChEMBL molecules. c Visualization of DSSTox compounds colored by reported toxicity level. The inset shows a subtree containing a high number of toxic compounds structurally similar or related to naphthalenes and other polycyclic aromatic hydrocarbons. d The Natural Products Atlas chemical space colored by origin genus of the 9 largest groups. The inset shows that structurally similar compounds are grouped into distinct branches and subbranches and are usually produced by plants and fungi from the same genus

Analyzing the distribution of molecules on the tree shows that TMAP groups molecules according to their structure and their biological activity, accurately reflecting similarities calculated in the high-dimensional MHFP6 space. This is well illustrated for a subset of the map (Fig. 3a, insert). In this area of the map, data points in cyan indicate molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters in two connected branches (right side of inset), while data points in orange show inhibitors of the phenylethanolamine N-methyltransferase (PNMT) (left side of inset), and red and dark blue data points indicate nicotinic acetylcholine receptor (nAChRs) ligands and cytochrome p450s (CYPs) inhibitors, respectively.

As a second example, we visualize the ChEMBL set merged with FDB17 (\(n = 10,101,204\)) into a superset of size \(n = 11,261,085\) (Fig. 3b), which corresponds to the largest data set that TMAP can successfully handle. As above, the TMAP 2D-layout accurately reflects structural and functional similarities computed in the high-dimensional MHFP6 space. In this TMAP visualization, the majority of ChEMBL compounds accumulate in closely connected clusters (branches) due to the prevalence of aromatic carbocycles. A notable exception is a relatively sizable branch of steroids and steroid-like compounds, which is connected to a branch of FDB17 molecules containing non-aromatic 5-membered carbocycles and ketones (Fig. 3b, insert). Many more detailed insights can be gained by inspecting the interactive map in Faerun (http://tmap-fdb.gdb.tools).

Further examples include MHFP6-encoded compounds from the Distributed Structure-Searchable Toxicity (DSSTox) Database (\(n = 848,816\)) and the Natural Products Atlas (\(n = 24,594\)). Visualizing DSSTox and coloring the resulting tree by toxicity rating, TMAP creates several subtrees and branches representing structural regions with a high incidence of highly toxic compounds (shown in red, Fig. 3c). An example of such a subtree contains naphthalenes and other polycyclic aromatic hydrocarbons (Fig. 3c, insert). The TMAP tree of the Natural Products Atlas was colored according to origin genus and reveals that branches and subbranches containing distinct substructures usually correlate with a certain genus such as various combinations of phenols, fused cyclopentanes, lactones and steroids produced by the fungi genus Ganoderma (colored purple in Fig. 3d, inset).

Visualization of the MoleculeNet benchmark data sets

We further illustrate TMAP to visualize the MoleculeNet, a benchmark for molecular machine learning which has found wide adaption in cheminformatics and encompasses 16 data sets ranging in size and composition (Table 1) [18]. As for the other small molecule data sets above, we computed MHFP6 fingerprints of the associated molecules and the corresponding TMAPs, which we then color-coded according to various numerical values available in the benchmarks. The procedure was applied with all MoleculeNet data sets except for QM7/b, where no SMILES have been provided.

The resulting TMAP representations, accessible at the TMAP website (http://tmap.gdb.tools), reveal the detailed structure of the data sets as well as the behaviour of methods applied to these data sets as a function of the chemical structures of the molecules. For example, TMAPs of the QM8 and QM9 (\(n = 21,786\) and \(n = 133,885\)), which contain small molecules and DFT-modelled parameters, reveal relationships between molecular structures and the various computed physico-chemical values. For instance the TMAP of the QM8 data set color-coded by the oscillator strengths of the lowest two singlet electronic states reveals how the value correlates with molecular structure and explains the performance differences in machine learning models trained on Coulomb matrices versus those trained on structure-sensitive molecular fingerprints [53]. In the case of the ESOL data set containing measured and calculated water solubility values of common small molecules (\(n = 1128\)), its TMAP color-coded with the difference between computed and measured values reveals the limitation of the ESOL model when estimating solubility of polycyclic aromatic hydrocarbons and compounds containing pyridines. For the FreeSolv data set (\(n = 642\)) containing small molecules and their measured and calculated hydration free energy in water, the TMAP visualization hints at possible limitations of the method when calculating hydration free energies of sugars. Finally, for the MUV data set (\(n = 93,087\)), which contains active small drug-like molecules against 17 different protein targets mixed in each case with inactive decoy molecules, the various TMAPs reveal differences in the structural distribution of actives among decoys. Actives are usually well distributed but appear to form clusters in certain subsets (e.g. MUV-548 and MUV-846), explaining the generally higher performance of fingerprint benchmarks for these subsets [47].

Application to other scientific data sets

We further illustrate the general applicability of TMAP to visualize data sets from the fields of linguistics, biology, and particle physics. All produced maps are available as interactive Faerun plots on the TMAP website (http://tmap.gdb.tools).

Our first example concerns visualization of the RCSB Protein Data Bank, which contains experimental 3D-structures of proteins and nucleic acids (\(n = 131,236\)) [54]. The PDB files were extracted from the Protein Data Bank and encoded using the protein shape fingerprint 3DP (136-D integer vector, 256 weighted MinHash samples) 3DP encodes the structural shape of large molecules stored as PDB files based on through-space distances of atoms [22]. Processing data extracted from the PDB and indexed using a weighted variant of MinHash, demonstrates the ability of TMAP to visualize both global and local structure, improving on previous efforts on the visualization of the database [22, 55]. The global structure of the 3DP-encoded PDB data is dominated by the size (heavy atom count) of the proteins (Fig. 4a), on the other hand, the local structure is defined by properties such as the fraction of negative charges (Fig. 4b).

Fig. 4
figure 4

TMAP visualizations of the RCSB Protein Data Bank (PDB), PANCAN, and ProteomeHD data. For a and b, please use the interactive versions at http://pdb-tmap.gdb.tools to visualize protein structures associated with each point. 3DP-encoded PDB entries visualized using TMAP with weighted MinHash indexing, the color bars show the log–log distribution of the property values. a Colored according to the macromolecular size (heavy atom count). The resulting map reflects the size-sensitivity of the 3DP fingerprint. b Colored according to the fraction of negative charges in the molecules. Macromolecules with a high fraction of negatively charged atoms, predominantly nucleic acids, are visible as clusters of red branches. c The PANCAN data set (n = 801, d = 20,531) consists of gene expressions data of five types of tumors (PRAD, KIRC, LUAD, COAD, and BRCA) and was indexed using a weighted variant of the MinHash algorithm. d Visualization of the ProteomeHD data set (n = 5013, d = 5013) based on co-regulation scores of proteins. The data points have been colored according to the associated cellular location

As an additional example from biology, we consider the PANCAN data set (\(n = 800\), \(d = 20,531\)), which consists of gene expressions of patients having different types of tumors (PRAD, KIRC, LUAD, COAD, and BRCA), randomly extracted from the cancer genome atlas database [56]. Here we index the PANCAN data directly using the LSH Forest data structure and weighted MinHash. The output produced by processing the PANCAN data set displays the successful differentiation of tumor types based on RNA sequencing data by the algorithm (Fig. 4c). We also visualize the ProteomeHD data set using TMAP [57]. This data set consists of co-regulation scores of 5013 proteins, annotated with their respective cellular location. In addition to the ProteomeHD data set, Kustatscher et al. also released an R script to create a map of the set using t-SNE which took a total of 400 s to complete; in contrast, TMAP visualized the data set within 32 s (Fig. 4d), successfully clustering proteins by their cellular location based on co-regulation scores. As a further biology example, our TMAP webpage also features flow cytometry measurements (\(n = 436,877, d = 14\)), exemplifying the methods application for the visualization of relatively low dimensional data [17, 58].

As an example from physics, we represent the MiniBooNE data set (\(n = 130,065\), \(d = 50\)), which consists of measurements extracted from Fermilab’s MiniBooNE experiment and contains the detection of signal (electron neutrinos) and background (muon neutrinos) events [59]. As the attributes in MiniBooNE are real numbers, we use the Annoy indexing library which supports the cosine metric in phase I of the algorithm to index the data for \(k\)-NNG construction, which demonstrates the modularity of TMAP [60]. This example reflects the independence of the MST and layout phases of the algorithm from the input data, displaying the distribution of the signal over the background data (Fig. 5a).

Fig. 5
figure 5

Visualizing linguistics, RNA sequencing, and particle physics data sets. a The MiniBooNE data set (\(n = 130,065\), \(d = 50\)) consists of measurements extracted from Fermilab’s MiniBooNE experiment. TMAP visualizes the distribution of the signal data among the background. b The GUTENBERG data set is a selection of books by 142 authors (\(n = 3036, d = 1,217,078)\). The works of five different authors are shown to occupy distinct branches. Interactive version of these maps and further examples can be found at http://tmap.gdb.tools

Outside of the natural sciences, we exemplify TMAP to visualize the GUTENBERG set as an example of a data set from linguistics. This data set features a selection of \(n = 3036\) books by 142 authors written in English [61]. To analyze this data, we define a book fingerprint as a dense-form binary vector indicating which words from the universe of all words extracted from all books occurred at least once in a given book (yielding a dimensionality of \(d = 1,217,078\)), and index this book fingerprint using the LSH Forest data structure with MinHash. The visualization of the GUTENBERG data set exemplifies the ability of TMAP to handle input with extremely high dimensionality (\(d = 1,217,078)\) efficiently (Fig. 5b). The works of different authors tend to populate specific branches, with notable expected exceptions such as the autobiography of Charles Darwin, which does not lie on the same branch as all his other works. Meanwhile, the works of Alfred Russel Wallace are found on subbranches of the Darwin branch.

Related to linguistics, the TMAP webpage further features a map of the distribution of different scientific journals (Nature, Cell, Angewandte Chemie, Science, the Journal of the American Chemical Society, and Demography) over the entire PubMed article space (\(n = 327,628, d = 1,633,762\)), perceiving specialization, diversification, and overlaps; as well as a TMAP of the NeurIPS conference papers (\(n = 7,241, d = 225,423\)), visualizing the increase in occurrence of the word “deep” in conference paper abstracts over time (1987–2016).

Conclusion

In this study, we introduced TMAP as a visualization method for very large, high-dimensional data sets enabling high data interpretability by preserving and visualizing both global and local features. By using TMAP in combination with the MHFP6 fingerprint, we can visualize databases of millions of organic small molecules and the associated property data with a high degree of resolution, which was not possible with previous methods. TMAP is also well-suited to visualize arbitrary data sets such as images, text, or RNA-seq data, hinting at its usefulness in a wide range of fields including computational linguistics or biology.

TMAP excels with its low memory usage and running time, with performance superior to other visualization algorithms such as t-SNE, UMAP or PCA. By adjusting the available parameters and leveraging output quality and memory usage, TMAP does not require specialized hardware for high-quality visualizations of data sets containing millions of data points. Most importantly, TMAP generates visualizations with an empirical sub-linear time complexity of \(O\left( {n^{0.931} } \right)\), allowing to visualize much larger high dimensional data sets than previous methods.

All the TMAP visualizations presented, including installation and usage instructions, are available as interactive online versions (http://tmap.gdb.tools). The source code for TMAP is available on GitHub (https://github.com/reymond-group/tmap) and a Python package can be obtained using the conda package manager.

Availability of data and materials

The datasets generated during and/or analysed during the current study are available in the tmap repository, http://tmap.gdb.tools.

Abbreviations

DSSTox:

Distributed structure-searchable toxicity

ECFP:

Extended connectivity fingerprint

FDB17:

Fragment database 17

GDB17:

Generated database 17

GTM:

Generative topographic maps

LSH:

Locality sensitive hashing

MHFP:

MinHash fingerprint

MST:

Minimum spanning tree

NLPCA:

Nonliner prinicipal component analysis

NN:

Nearest neighbor

NNG:

Nearest neighbor graph

OGDF:

Open graph drawing framework

PANCAN:

Pancreatic cancer action network

PCA:

Principal component analysis

PDB:

Protein data bank

SMILES:

Simplified molecular input line entry specification

SOM:

Self-organizing maps

TMAP:

Tree MAP

t-SNE:

t-distributed stochastic neighbor embedding

UMAP:

Uniform manifold approximation and projection

References

  1. Callahan SP, et al (2006) VisTrails: Visualization meets data management. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM. pp 745–747. https://doi.org/10.1145/1142473.1142574

  2. Fox P, Hendler J (2011) Changing the equation on scientific data visualization. Science 331:705–708

    Article  CAS  PubMed  Google Scholar 

  3. Michel J-B et al (2011) Quantitative analysis of culture using millions of digitized books. Science 331:176–182

    Article  CAS  PubMed  Google Scholar 

  4. Keim D, Qu H, Ma K (2013) Big-data visualization. IEEE Comput Graphics Appl 33:20–21

    Article  Google Scholar 

  5. Costa FF (2014) Big data in biomedicine. Drug Disc Today 19:433–440

    Article  Google Scholar 

  6. Stephens ZD et al (2015) Big data: astronomical or genomical? PLoS Biol 13:e1002195

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  7. Bikakis N, Sellis T (2016) Exploration and visualization in the web of big linked data: a survey of the state of the art. arXiv:1601.08059

  8. Kahles A et al (2018) Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell 34:211–224.e6

    Article  CAS  PubMed  Google Scholar 

  9. Arús-Pous J et al (2019) Exploring the GDB-13 chemical space using deep generative models. J Cheminform 11:20

    Article  PubMed  PubMed Central  Google Scholar 

  10. van der Maaten L, Postma EO, van der Herik HJ (2009) Dimensionality reduction : a comparative review. J Mach Learn Res 10:66–71

    Google Scholar 

  11. Gaulton A et al (2017) The ChEMBL database in 2017. Nucleic Acids Res 45:D945–D954

    Article  CAS  PubMed  Google Scholar 

  12. Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 52:2864–2875

    Article  CAS  PubMed  Google Scholar 

  13. Visini R, Awale M, Reymond J-L (2017) Fragment database FDB-17. J Chem Inf Model 57:700–709

    Article  CAS  PubMed  Google Scholar 

  14. Awale M, Visini R, Probst D, Arús-Pous J, Reymond J-L (2017) Chemical space: big data challenge for molecular diversity. Chimia 71:661–666

    Article  CAS  PubMed  Google Scholar 

  15. Richard AM, Williams CR (2002) Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. Mutat Res 499:27–52

    Article  CAS  PubMed  Google Scholar 

  16. Natural Products Atlas. https://www.npatlas.org/joomla/

  17. Wishart DS et al (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46:D1074–D1082

    Article  CAS  PubMed  Google Scholar 

  18. Wu Z, et al (2017) MoleculeNet: A benchmark for molecular machine learning. arXiv:1703.00564[physics, stat]

  19. Oprea TI, Gottfries J (2001) Chemography: the art of navigating in chemical space. J Comb Chem 3:157–166

    Article  CAS  PubMed  Google Scholar 

  20. Awale M, van Deursen R, Reymond J-L (2013) MQN-Mapplet: visualization of chemical space with interactive maps of DrugBank, ChEMBL, PubChem, GDB-11, and GDB-13. J Chem Inf Model 53:509–518

    Article  CAS  PubMed  Google Scholar 

  21. Awale M, Reymond J-L (2015) Similarity Mapplet: interactive visualization of the directory of useful decoys and ChEMBL in high dimensional chemical spaces. J Chem Inf Model 55:1509–1516

    Article  CAS  PubMed  Google Scholar 

  22. Jin X et al (2015) PDB-explorer: a web-based interactive map of the protein data bank in shape space. BMC Bioinform 16:339

    Article  CAS  Google Scholar 

  23. Awale M, Reymond J-L (2016) Web-based 3D-visualization of the DrugBank chemical space. J. Cheminform 8:25

    Article  PubMed  PubMed Central  Google Scholar 

  24. Awale M, Probst D, Reymond J-L (2017) WebMolCS: a web-based interface for visualizing molecules in three-dimensional chemical spaces. J Chem Inf Model 57:643–649

    Article  CAS  PubMed  Google Scholar 

  25. Probst D, Reymond J-L (2018) FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web. Bioinformatics 34:1433–1435

    Article  CAS  PubMed  Google Scholar 

  26. McInnes L, Healy J, Melville J (2018) UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [cs, stat]

  27. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    Google Scholar 

  28. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507

    Article  CAS  PubMed  Google Scholar 

  29. Bishop CM, Svensén M, Williams CKIGTM (1998) The generative topographic mapping. Neural Comput 10:215–234

    Article  Google Scholar 

  30. Kohonen T (1997) Exploration of very large databases by self-organizing maps. In: Proceedings of international conference on neural networks (ICNN’97) vol. 1 PL1-PL6 vol.1

  31. Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th international conference on World wide web—WWW’11 577, ACM Press. https://doi.org/10.1145/1963405.1963487

  32. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425

    CAS  PubMed  Google Scholar 

  33. Zhou Z et al (2018) GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res 28:1395–1404

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Lu J, Carlson HA (2016) ChemTreeMap: an interactive map of biochemical similarity in molecular datasets. Bioinformatics 32:3584–3592

    CAS  PubMed  PubMed Central  Google Scholar 

  35. P’ng C et al (2019) BPG: seamless, automated and interactive visualization of scientific data. BMC Bioinform. 20:42

    Article  Google Scholar 

  36. Idreos S, Papaemmanouil O, Chaudhuri S (2015) Overview of data exploration techniques. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 277–281. https://doi.org/10.1145/2723372.2731084

  37. Andoni A, Razenshteyn I, Nosatzki NS (2017) LSH Forest: practical algorithms made theoretical. In: Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 67–78 https://doi.org/10.1137/1.9781611974782.5

  38. Bawa M, Condie T, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th international conference on World Wide Web—WWW’05 651. ACM Press. https://doi.org/10.1145/1060745.1060840

  39. Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math Soc 7:48–48

    Article  Google Scholar 

  40. Chimani M et al (2013) The open graph drawing framework (OGDF). Handbook Graph Draw Vis 2011:543–569

    Google Scholar 

  41. Broder AZ ((1997) On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) 21–29. https://doi.org/10.1109/sequen.1997.666900

  42. Manber U (1994) Finding similar files in a large file system. In: Usenix Winter 1994 technical conference 1–10

  43. Wu W, Li B, Chen L, Zhang C, Yu P (2017). Improved consistent weighted sampling revisited. arXiv:1706.01172 [cs]

  44. Bajusz D, Rácz A, Héberger K (2015) Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7:20

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  45. Probst D, Reymond J-L (2018) A probabilistic molecular fingerprint for big data settings. J Cheminform 10:66

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754

    Article  CAS  PubMed  Google Scholar 

  47. Riniker S, Landrum GA (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminform. 5:26

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Awale M, Reymond J-L (2019) Polypharmacology Browser PPB2: target prediction combining nearest neighbors with machine learning. J Chem Inf Model 59:10–17

    Article  CAS  PubMed  Google Scholar 

  49. Binding DB (2014) BindingDB Entry 6310: Compounds and compositions as Syk kinase inhibitors. https://doi.org/10.7270/q24q7sns

  50. Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. arXiv:1408.2927[cs]

  51. Marcais G, DeBlasio D, Pandey P, Kingsford C (2019) Locality sensitive hashing for the edit distance. http://biorxiv.org/lookup/doi/10.1101/534446 https://doi.org/10.1101/534446

  52. Probst D, Reymond J-L (2018) SmilesDrawer: parsing and drawing SMILES-encoded molecular structures using client-side JavaScript. J Chem Inf Model 58:1–7

    Article  CAS  PubMed  Google Scholar 

  53. Ramakrishnan R, Hartmann M, Tapavicza E, von Lilienfeld OA (2015) Electronic spectra from TDDFT and machine learning in chemical space. J Chem Phys 143:084111

    Article  PubMed  CAS  Google Scholar 

  54. Berman HM et al (2000) The protein data bank. Nucleic Acids Res 28:235–242

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Awale M, Reymond J-L (2014) Atom pair 2D-fingerprints perceive 3D-molecular shape and pharmacophores for very fast virtual screening of ZINC and GDB-17. J Chem Inf Model 54:1892–1907

    Article  CAS  PubMed  Google Scholar 

  56. The Cancer Genome Atlas Research Network et al (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 45:1113–1120

    Article  CAS  Google Scholar 

  57. Kustatscher G et al (2019) Co-regulation map of the human proteome enables identification of protein functions. Nat Biotechnol 37:1361–1371

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Hanley MB, Lomas W, Mittar D, Maino V, Park E (2013) Detection of low abundance RNA molecules in individual cells by flow cytometry. PLoS ONE 8:e57002

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Roe BP et al (2005) Boosted decision trees as an alternative to artificial neural networks for particle identification. Nucl Instrum Methods Phys Res 543:577–584

    Article  CAS  Google Scholar 

  60. Bernhardsson E. Annoy (Approximate Nearest Neighbors Oh Yeah). https://github.com/spotify/annoy

  61. Lahiri S (2013) Complexity of word collocation networks: a preliminary structural analysis. arXiv:1310.5111[physics]

Download references

Acknowledgements

This work was supported financially by the Swiss National Science Foundation, NCCR TransCure (Grant No. 51NF40-185544).

Author information

Authors and Affiliations

Authors

Contributions

DP designed and realized the study and wrote the paper. JLR supervised the study and wrote the paper. Both authors read and approved the final manuscript.

Corresponding authors

Correspondence to Daniel Probst or Jean-Louis Reymond.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Additional figures.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Probst, D., Reymond, JL. Visualization of very large high-dimensional data sets as minimum spanning trees. J Cheminform 12, 12 (2020). https://doi.org/10.1186/s13321-020-0416-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13321-020-0416-x

Keywords