 Research article
 Open Access
 Published:
Visualization of very large highdimensional data sets as minimum spanning trees
Journal of Cheminformatics volume 12, Article number: 12 (2020)
Abstract
The chemical sciences are producing an unprecedented amount of large, highdimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a twodimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than tSNE or UMAP for the exploration and interpretation of large data sets due to their treelike nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature.
Introduction
The recent development of new and often very accessible frameworks and powerful hardware has enabled the implementation of computational methods to generate and collect large high dimensional data sets and created an ever increasing need to explore as well as understand these data [1,2,3,4,5,6,7,8,9]. Generally, large highdimensional data sets are matrices where rows are samples and columns are measured variables, each column defining a dimension of the space which contains the data. Visualizing such data sets is challenging because reducing the dimensionality, which is required in order to make the data visually interpretable for humans, is both lossy and computationally expensive [10].
Large highdimensional data sets are frequently used in the chemical sciences. For instance the ChEMBL database (\(n = 1,159,881\)) of bioactive molecules from the scientific literature and their associated biological assay data are used daily in the area of drug discovery [11]. Further examples of large databases containing molecules include FDB17 \(\left( {n = 10,101,204} \right)\)), a fragmentlike subset of the enumerated database GDB17 listing theoretically possible molecules up to 17 atoms [12,13,14], and DSSTox (\(n = 848,816\)), containing molecules investigated for toxicity [15]. Examples of smaller data sets include the Natural Products Atlas (\(n = 24,594\)), collecting microbiallyderived natural products; [16] Drugbank (\(n = 9300\)), listing molecules marketed or investigated as drugs; [17] and the MoleculeNet benchmark, containing a collection of 16 data sets of small organic molecules [18].
To visualize such databases, simple linear dimensionality reduction methods such as principal component analysis and similarity mapping readily produce 2D or 3Drepresentations of global features [19,20,21,22,23,24,25]. However, local features defining the relationships between close or even nearest neighbor (NN) molecules, which are very important to understand the structure of data, are mostly lost, limiting the applicability of linear dimensionality reduction methods for visualization. The important NN relationships are much better preserved using nonlinear manifold learning algorithms, which assume that the data lies on a lowerdimensional manifold embedded within the highdimensional space. Algorithms such as nonlinear principal component analysis (NLPCA), tdistributed stochastic neighbor embedding (tSNE), and more recently uniform manifold approximation and projection (UMAP) are based on this assumption [26,27,28]. Other techniques used are probabilistic generative topographic maps (GTM) and selforganizing maps (SOM), which are based on artificial neural networks [29, 30]. However, these algorithms have time complexities between at least \(O\left( {n^{1.14} } \right)\) and \(O\left( {n^{5} } \right)\), limiting the size of to be visualized data sets [31]. The same limitations in terms of data set size apply when distributing data in a tree by implementing the neighbor joining algorithm or similar methods used to create phylogenetic trees [32, 33]. This limiting behavior has been documented by the ChemTreeMap tool, which can only visualize up to approximately 10,000 data points (molecules or clusters of molecules) [34]. Due to the described challenges, large scientific data sets are generally visualized in aggregated or reduced form [35, 36].
Here we present an algorithm, named TMAP (Tree MAP), to generate and distribute intuitive visualizations of large data sets in the order of up to \(10^{7}\) with arbitrary dimensionality in a tree. Our method is based on a combination of locality sensitive hashing, graph theory, and modern web technology which also integrates into established data analysis and plotting workflows. This treebased layout facilitates visual inspection of the data with a high resolution by explicitly visualizing the closest distance between clusters and the detailed structure of clusters through branches and subbranches. We demonstrate the performance of TMAP with toy data sets from computer graphics and with ChEMBL subsets of different size and composition, and show that it surpasses comparable algorithms such as tSNE and UMAP in terms of time and space complexity. We further exemplify the use of TMAP for visualizing large highdimensional data sets from chemistry as well as from further scientific fields (Table 1).
Methods
Given an arbitrary data set as an input, TMAP encompasses four phases: (I) LSH forest indexing [37, 38], (II) construction of a \(c\)approximate \(k\)nearest neighbor graph, (III) calculation of a minimum spanning tree (MST) of the \(c\)approximate \(k\)nearest neighbor graph [39], and (IV) generation of a layout for the resulting MST [40].
During phase I, the input data are indexed in an LSH forest data structure, enabling \(c\)approximate \(k\)nearest neighbor (kNN) searches with a time complexity sublinear in \(n\). Text and binary data are encoded using the MinHash algorithm, while integer and floatingpoint data are encoded using a weighted variation of the algorithm [41,42,43]. The LSH Forest data structure for both MinHash and weighted MinHash data is initialized with the number of hash functions \(d\) used in encoding the data, and the number of prefix trees \(l\). An increase in the values of both parameters led to an increase in main memory usage; however, higher values for \(l\) also decrease query speed. The effect of parameters \(d\) and \(l\) on the final visualization is shown in Additional file 1: Fig. S1. The use of a combination of (weighted) MinHash and LSH Forest, which supports fast estimation of the Jaccard distance between two binary sets, has been shown to perform very well for molecules [44]. Note that other data structures and algorithms implementing a variety of different distance metrics may show better performance on other data and can be used as dropin replacements of phase I.
In phase II, an undirected weighted \(c\)approximate \(k\)nearest neighbor graph (\(c\)–\(k\)NNG) is constructed from the data points indexed in the LSH forest, where an augmented variant of the LSH forest query algorithm we previously introduced for virtual screening tasks is used to increase efficiency [45]. The \(c\)–\(k\)NNG construction phase takes two arguments, namely \(k\), the number of nearestneighbors to be searched for, and \(k_{c}\), the factor used by the augmented query algorithm. The variant of the query algorithm increases the time complexity of a single query from \(O\left( {\log n} \right)\) to \(O\left( {k \cdot k_{c} + \log n} \right)\), resulting in an overall time complexity of \(O\left( {n\left( {k \cdot k_{c} + \log n} \right)} \right)\), where practically \(k \cdot k_{c} > \log n\), for the \(c\)–\(k\)NNG construction. The edges of the \(c\)–\(k\)NNG are assigned the Jaccard distance of their incident vertices as their weight. Depending on the distribution and the hashing of the data, the \(c\)–\(k\)NNG can be disconnected (1) if outliers exist which have a Jaccard distance of 1.0 to all other data points and are therefore not connected to any other nodes or (2) if, due to highly connected clusters of size \(\ge k\) in the Jaccard space, connected components are created. However, the following phases are agnostic to whether this phase yields a disconnected graph. The effect of parameters \(k\) and \(k_{c}\) on the final visualization is shown in Additional file 1: Fig. S2. Alternatively, an arbitrary undirected graph can be supplied to the algorithm as a (weighted) edge list.
During phase III, a minimum spanning tree (MST) is constructed on the weighted \(c\)–\(k\)NNG using Kruskal’s algorithm, which represents the central and differentiating phase of the described algorithm. Whereas comparable algorithms such as UMAP or tSNE attempt to embed pruned graphs, TMAP removes all cycles from the initial graph using the MST algorithm, significantly lowering the computational complexity of a low dimensional embedding. The algorithm reaches a globally optimal solution by applying a greedy approach of selecting locally optimal solutions at each stage—properties which are also desirable in data visualization. The time complexity of Kruskal’s algorithm is \(O\left( {E + \log V} \right)\), rendering this phase negligible compared to phase II in terms of execution time. In the case of a disconnected \(c\)–\(k\)NNG, a minimum spanning forest is created.
Phase IV lays out the tree on the Euclidean plane. As the MST is unrooted and to keep the drawing compact, the tree is not visualized by applying a tree but a graph layout algorithm. In order to draw MSTs of considerable size (millions of vertices), a springelectrical model layout algorithm with multilevel multipolebased force approximation is applied. This algorithm is provided by the open graph drawing framework (OGDF), a modular C++ library [40]. In addition, the use of the OGDF allows for effortless adjustments to the graph layout algorithm in terms of both aesthetics and computational time requirements. Whereas several parameters can be configured for the layout phase, only parameter \(p\) must be adjusted based on the size of the input data set (Additional file 1: Fig. S3). This phase constitutes the bottleneck regarding computational complexity.
Results and discussion
TMAP performance assessment with toy data sets and ChEMBL subsets
The quality of our TMAP algorithm is first assessed by comparing TMAP and UMAP to visualize the common benchmarking data sets MNIST, FMNIST, and COIL20 (Fig. 1). UMAP generally represents clusters as tightly packed patches and tries to reach maximal separation between them. On the other hand, TMAP visualizes the relations between, as well as within, clusters as branches and subbranches. While UMAP can represent the circular nature of the COIL20 subsets, TMAP cuts the circular clusters at the edge of largest difference and joins subsets through one or more edges of smallest difference (Fig. 1a, b). However, the plot shows that this removal of local connectivity leads to an untangling of highly similar data (shown in dark green, orange, dark red, dark purple, and light blue). This behavior has been assessed and compared to UMAP in Additional file 1: Figures S4 and S5, where it is shown that both TMAP and UMAP have to sacrifice locality preservation for more complex examples. For the MNIST and FMNIST data sets, the tree structure results in a higher resolution of both variances and errors within clusters as it becomes apparent how sub clusters (branches within clusters) are linked and which true positives connect to false positives (Fig. 1c–f).
In a second, more applied comparison example, we visualize data from ChEMBL using TMAP and UMAP. For this analysis molecular structures are encoded using ECFP4 (extended connectivity fingerprint up to 4 bonds, 512D binary vector), a molecular fingerprint encoding circular substructures and which performs well in virtual screening and target prediction [46,47,48]. We consider a subset \(S_{t}\) of the top 10,000 ChEMBL compounds by insertion date, as well as a random subset \(S_{r}\) of 10,000 ChEMBL molecules.
Taking the more homogeneous set \(S_{t}\) as an input, the 2Dmaps produced by each representation, plotted using the Python library matplotlib, illustrate that TMAP, which distributes clusters in branches and subbranches of the MST, produces a much more even distribution of compounds on the canvas compared to UMAP, thus enabling better visual resolution (Fig. 2a, b). Furthermore, in a visualization of the heterogeneous set \(S_{r}\), nearest neighbor relationships (locality) are better preserved in TMAP compared to UMAP, as illustrated by the positioning of the 20 structurally nearest neighbors of compound CHEMBL370160 [2, 49] reported as a potent inhibitor of human tyrosineprotein kinase SYK. The 20 structurally similar nearest neighbors are defined as 20 nearest neighbors in the original 512dimensional fingerprint space. TMAP directly connects the query compound to three of the 20 nearest neighbors, CHEMBL3701630, CHEMBL3701611, and CHEMBL38911457, its nearest, second nearest, and 15th nearest neighbor respectively. The nearest neighbors 1 through 7 are all within a topological distance of 3 around the query (Fig. 2c). In contrast, UMAP has positioned nearest neighbors 2, 3, 9, and 18, among several even more distant data points, closer to the query than the nearest neighbor from the original space (Fig. 2d). Indeed, TMAP preserves locality in terms of retaining 1nearest neighbor relationships much better than UMAP, applying both topological and Euclidean metrics (Fig. 2e, f; Additional file 1: Fig. S6). The quality of the preservation of locality largely depends on parameter \(d\), with adjustments to parameters \(k\) and \(k_{c}\) only having a minor influence (Additional file 1: Fig. S7). Moreover, TMAP yields reproducible results when running on identical parameters and input data, whereas results of comparable algorithms such as UMAP change considerably with every run (Additional file 1: Fig. S8) [26].
In terms of calculation times, TMAP and UMAP have comparable running time \(t\) and memory usage \(a\) for small random subsets of the 512D ECFPencoded ChEMBL data set with sizes \(n = 10,000\) and \(n = 100,000\), TMAP significantly outperforms UMAP for larger random subsets (\(n = 500,000\) and \(n = 1,000,000\)) (Fig. 2h, i). Further insight into the computational behavior of TMAP is provided by analyzing running times for the different phases based on a larger subset (\(n = 1,000,000\)) of the ECFP4encoded ChEMBL data set (Fig. 2g). During phase I of the algorithm, which accounts for \(180{\text{s}}\) of the execution time and approximately \(5{\text{GB}}\) of main memory usage, data is loaded and indexed in the LSH Forest data structure in chunks of 100,000, as expressed by 10 distinct jumps in memory consumption. The construction of the \(c\)–\(k\)NNG during phase II requires a negligible amount of main memory and takes approximately \(110{\text{s}}\). During 10 s of execution time, MST creation (phase III) occupies a further \(2{\text{GB}}\) of main memory of which approximately \(1{\text{GB}}\) is retained to store the tree data structure. The graph layout algorithm (phase IV) requires \(2{\text{GB}}\) throughout \(55{\text{s}}\), after which the algorithm completes with a total wall clock run time of \(355{\text{s}}\) and peak main memory usage of \(8.553{\text{GB}}\).
Note that TMAP supports Jaccard similarity estimation through MinHash and weighted MinHash for binary and weighted sets, respectively. While the Jaccard metric is very suitable for chemical similarity calculations based on molecular fingerprints, the metric may not be the best option available to problems presented by other data sets. However, there exists a wide range of LSH families supporting distance and similarity metrics such as Hamming distance, \(l_{p}\) distance, Levenshtein distance, or cosine similarity, which are compatible with TMAP [50, 51]. Furthermore, the modularity of TMAP allows to plug in arbitrary nearestneighborgraph creation techniques or load existing graphs from files.
TMAPs of small molecule data sets: ChEMBL, FDB17, DSSTox, and the Natural Products Atlas
The high performance and relatively low memory usage of TMAP, as well as the ability to generate highly detailed and interpretable representations of highdimensional data sets, is illustrated here by interactive visualization of a series of small molecule data sets available in the public domain. In these examples we use MHFP6 (512 MinHash permutations), a molecular fingerprint related to ECFP4 but with better performance for virtual screening tasks and the ability to be directly indexed in an LSH Forest data structure, which considerably speeds up computation for large data sets [45].
As a first example, we discuss the TMAP of the full data set of the ChEMBL database containing the 1.13 million ChEMBL compounds associated with biological assay data. TMAP completes the calculation within 613 s with a peak memory usage of 20.562 GB. Note that approximately half of the main memory usage is accounted for by SMILES, activities, and biological entity classes which are loaded for later use in the visualization. To facilitate data analysis, the coordinates computed by TMAP are exported as an interactive portable HTML file using Faerun, where molecules are displayed using the JavaScript library SmilesDrawer (Fig. 3a) [25, 52].
Analyzing the distribution of molecules on the tree shows that TMAP groups molecules according to their structure and their biological activity, accurately reflecting similarities calculated in the highdimensional MHFP6 space. This is well illustrated for a subset of the map (Fig. 3a, insert). In this area of the map, data points in cyan indicate molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters in two connected branches (right side of inset), while data points in orange show inhibitors of the phenylethanolamine Nmethyltransferase (PNMT) (left side of inset), and red and dark blue data points indicate nicotinic acetylcholine receptor (nAChRs) ligands and cytochrome p450s (CYPs) inhibitors, respectively.
As a second example, we visualize the ChEMBL set merged with FDB17 (\(n = 10,101,204\)) into a superset of size \(n = 11,261,085\) (Fig. 3b), which corresponds to the largest data set that TMAP can successfully handle. As above, the TMAP 2Dlayout accurately reflects structural and functional similarities computed in the highdimensional MHFP6 space. In this TMAP visualization, the majority of ChEMBL compounds accumulate in closely connected clusters (branches) due to the prevalence of aromatic carbocycles. A notable exception is a relatively sizable branch of steroids and steroidlike compounds, which is connected to a branch of FDB17 molecules containing nonaromatic 5membered carbocycles and ketones (Fig. 3b, insert). Many more detailed insights can be gained by inspecting the interactive map in Faerun (http://tmapfdb.gdb.tools).
Further examples include MHFP6encoded compounds from the Distributed StructureSearchable Toxicity (DSSTox) Database (\(n = 848,816\)) and the Natural Products Atlas (\(n = 24,594\)). Visualizing DSSTox and coloring the resulting tree by toxicity rating, TMAP creates several subtrees and branches representing structural regions with a high incidence of highly toxic compounds (shown in red, Fig. 3c). An example of such a subtree contains naphthalenes and other polycyclic aromatic hydrocarbons (Fig. 3c, insert). The TMAP tree of the Natural Products Atlas was colored according to origin genus and reveals that branches and subbranches containing distinct substructures usually correlate with a certain genus such as various combinations of phenols, fused cyclopentanes, lactones and steroids produced by the fungi genus Ganoderma (colored purple in Fig. 3d, inset).
Visualization of the MoleculeNet benchmark data sets
We further illustrate TMAP to visualize the MoleculeNet, a benchmark for molecular machine learning which has found wide adaption in cheminformatics and encompasses 16 data sets ranging in size and composition (Table 1) [18]. As for the other small molecule data sets above, we computed MHFP6 fingerprints of the associated molecules and the corresponding TMAPs, which we then colorcoded according to various numerical values available in the benchmarks. The procedure was applied with all MoleculeNet data sets except for QM7/b, where no SMILES have been provided.
The resulting TMAP representations, accessible at the TMAP website (http://tmap.gdb.tools), reveal the detailed structure of the data sets as well as the behaviour of methods applied to these data sets as a function of the chemical structures of the molecules. For example, TMAPs of the QM8 and QM9 (\(n = 21,786\) and \(n = 133,885\)), which contain small molecules and DFTmodelled parameters, reveal relationships between molecular structures and the various computed physicochemical values. For instance the TMAP of the QM8 data set colorcoded by the oscillator strengths of the lowest two singlet electronic states reveals how the value correlates with molecular structure and explains the performance differences in machine learning models trained on Coulomb matrices versus those trained on structuresensitive molecular fingerprints [53]. In the case of the ESOL data set containing measured and calculated water solubility values of common small molecules (\(n = 1128\)), its TMAP colorcoded with the difference between computed and measured values reveals the limitation of the ESOL model when estimating solubility of polycyclic aromatic hydrocarbons and compounds containing pyridines. For the FreeSolv data set (\(n = 642\)) containing small molecules and their measured and calculated hydration free energy in water, the TMAP visualization hints at possible limitations of the method when calculating hydration free energies of sugars. Finally, for the MUV data set (\(n = 93,087\)), which contains active small druglike molecules against 17 different protein targets mixed in each case with inactive decoy molecules, the various TMAPs reveal differences in the structural distribution of actives among decoys. Actives are usually well distributed but appear to form clusters in certain subsets (e.g. MUV548 and MUV846), explaining the generally higher performance of fingerprint benchmarks for these subsets [47].
Application to other scientific data sets
We further illustrate the general applicability of TMAP to visualize data sets from the fields of linguistics, biology, and particle physics. All produced maps are available as interactive Faerun plots on the TMAP website (http://tmap.gdb.tools).
Our first example concerns visualization of the RCSB Protein Data Bank, which contains experimental 3Dstructures of proteins and nucleic acids (\(n = 131,236\)) [54]. The PDB files were extracted from the Protein Data Bank and encoded using the protein shape fingerprint 3DP (136D integer vector, 256 weighted MinHash samples) 3DP encodes the structural shape of large molecules stored as PDB files based on throughspace distances of atoms [22]. Processing data extracted from the PDB and indexed using a weighted variant of MinHash, demonstrates the ability of TMAP to visualize both global and local structure, improving on previous efforts on the visualization of the database [22, 55]. The global structure of the 3DPencoded PDB data is dominated by the size (heavy atom count) of the proteins (Fig. 4a), on the other hand, the local structure is defined by properties such as the fraction of negative charges (Fig. 4b).
As an additional example from biology, we consider the PANCAN data set (\(n = 800\), \(d = 20,531\)), which consists of gene expressions of patients having different types of tumors (PRAD, KIRC, LUAD, COAD, and BRCA), randomly extracted from the cancer genome atlas database [56]. Here we index the PANCAN data directly using the LSH Forest data structure and weighted MinHash. The output produced by processing the PANCAN data set displays the successful differentiation of tumor types based on RNA sequencing data by the algorithm (Fig. 4c). We also visualize the ProteomeHD data set using TMAP [57]. This data set consists of coregulation scores of 5013 proteins, annotated with their respective cellular location. In addition to the ProteomeHD data set, Kustatscher et al. also released an R script to create a map of the set using tSNE which took a total of 400 s to complete; in contrast, TMAP visualized the data set within 32 s (Fig. 4d), successfully clustering proteins by their cellular location based on coregulation scores. As a further biology example, our TMAP webpage also features flow cytometry measurements (\(n = 436,877, d = 14\)), exemplifying the methods application for the visualization of relatively low dimensional data [17, 58].
As an example from physics, we represent the MiniBooNE data set (\(n = 130,065\), \(d = 50\)), which consists of measurements extracted from Fermilab’s MiniBooNE experiment and contains the detection of signal (electron neutrinos) and background (muon neutrinos) events [59]. As the attributes in MiniBooNE are real numbers, we use the Annoy indexing library which supports the cosine metric in phase I of the algorithm to index the data for \(k\)NNG construction, which demonstrates the modularity of TMAP [60]. This example reflects the independence of the MST and layout phases of the algorithm from the input data, displaying the distribution of the signal over the background data (Fig. 5a).
Outside of the natural sciences, we exemplify TMAP to visualize the GUTENBERG set as an example of a data set from linguistics. This data set features a selection of \(n = 3036\) books by 142 authors written in English [61]. To analyze this data, we define a book fingerprint as a denseform binary vector indicating which words from the universe of all words extracted from all books occurred at least once in a given book (yielding a dimensionality of \(d = 1,217,078\)), and index this book fingerprint using the LSH Forest data structure with MinHash. The visualization of the GUTENBERG data set exemplifies the ability of TMAP to handle input with extremely high dimensionality (\(d = 1,217,078)\) efficiently (Fig. 5b). The works of different authors tend to populate specific branches, with notable expected exceptions such as the autobiography of Charles Darwin, which does not lie on the same branch as all his other works. Meanwhile, the works of Alfred Russel Wallace are found on subbranches of the Darwin branch.
Related to linguistics, the TMAP webpage further features a map of the distribution of different scientific journals (Nature, Cell, Angewandte Chemie, Science, the Journal of the American Chemical Society, and Demography) over the entire PubMed article space (\(n = 327,628, d = 1,633,762\)), perceiving specialization, diversification, and overlaps; as well as a TMAP of the NeurIPS conference papers (\(n = 7,241, d = 225,423\)), visualizing the increase in occurrence of the word “deep” in conference paper abstracts over time (1987–2016).
Conclusion
In this study, we introduced TMAP as a visualization method for very large, highdimensional data sets enabling high data interpretability by preserving and visualizing both global and local features. By using TMAP in combination with the MHFP6 fingerprint, we can visualize databases of millions of organic small molecules and the associated property data with a high degree of resolution, which was not possible with previous methods. TMAP is also wellsuited to visualize arbitrary data sets such as images, text, or RNAseq data, hinting at its usefulness in a wide range of fields including computational linguistics or biology.
TMAP excels with its low memory usage and running time, with performance superior to other visualization algorithms such as tSNE, UMAP or PCA. By adjusting the available parameters and leveraging output quality and memory usage, TMAP does not require specialized hardware for highquality visualizations of data sets containing millions of data points. Most importantly, TMAP generates visualizations with an empirical sublinear time complexity of \(O\left( {n^{0.931} } \right)\), allowing to visualize much larger high dimensional data sets than previous methods.
All the TMAP visualizations presented, including installation and usage instructions, are available as interactive online versions (http://tmap.gdb.tools). The source code for TMAP is available on GitHub (https://github.com/reymondgroup/tmap) and a Python package can be obtained using the conda package manager.
Availability of data and materials
The datasets generated during and/or analysed during the current study are available in the tmap repository, http://tmap.gdb.tools.
Abbreviations
 DSSTox:

Distributed structuresearchable toxicity
 ECFP:

Extended connectivity fingerprint
 FDB17:

Fragment database 17
 GDB17:

Generated database 17
 GTM:

Generative topographic maps
 LSH:

Locality sensitive hashing
 MHFP:

MinHash fingerprint
 MST:

Minimum spanning tree
 NLPCA:

Nonliner prinicipal component analysis
 NN:

Nearest neighbor
 NNG:

Nearest neighbor graph
 OGDF:

Open graph drawing framework
 PANCAN:

Pancreatic cancer action network
 PCA:

Principal component analysis
 PDB:

Protein data bank
 SMILES:

Simplified molecular input line entry specification
 SOM:

Selforganizing maps
 TMAP:

Tree MAP
 tSNE:

tdistributed stochastic neighbor embedding
 UMAP:

Uniform manifold approximation and projection
References
 1.
Callahan SP, et al (2006) VisTrails: Visualization meets data management. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM. pp 745–747. https://doi.org/10.1145/1142473.1142574
 2.
Fox P, Hendler J (2011) Changing the equation on scientific data visualization. Science 331:705–708
 3.
Michel JB et al (2011) Quantitative analysis of culture using millions of digitized books. Science 331:176–182
 4.
Keim D, Qu H, Ma K (2013) Bigdata visualization. IEEE Comput Graphics Appl 33:20–21
 5.
Costa FF (2014) Big data in biomedicine. Drug Disc Today 19:433–440
 6.
Stephens ZD et al (2015) Big data: astronomical or genomical? PLoS Biol 13:e1002195
 7.
Bikakis N, Sellis T (2016) Exploration and visualization in the web of big linked data: a survey of the state of the art. arXiv:1601.08059
 8.
Kahles A et al (2018) Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell 34:211–224.e6
 9.
ArúsPous J et al (2019) Exploring the GDB13 chemical space using deep generative models. J Cheminform 11:20
 10.
van der Maaten L, Postma EO, van der Herik HJ (2009) Dimensionality reduction : a comparative review. J Mach Learn Res 10:66–71
 11.
Gaulton A et al (2017) The ChEMBL database in 2017. Nucleic Acids Res 45:D945–D954
 12.
Ruddigkeit L, van Deursen R, Blum LC, Reymond JL (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB17. J Chem Inf Model 52:2864–2875
 13.
Visini R, Awale M, Reymond JL (2017) Fragment database FDB17. J Chem Inf Model 57:700–709
 14.
Awale M, Visini R, Probst D, ArúsPous J, Reymond JL (2017) Chemical space: big data challenge for molecular diversity. Chimia 71:661–666
 15.
Richard AM, Williams CR (2002) Distributed structuresearchable toxicity (DSSTox) public database network: a proposal. Mutat Res 499:27–52
 16.
Natural Products Atlas. https://www.npatlas.org/joomla/
 17.
Wishart DS et al (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46:D1074–D1082
 18.
Wu Z, et al (2017) MoleculeNet: A benchmark for molecular machine learning. arXiv:1703.00564[physics, stat]
 19.
Oprea TI, Gottfries J (2001) Chemography: the art of navigating in chemical space. J Comb Chem 3:157–166
 20.
Awale M, van Deursen R, Reymond JL (2013) MQNMapplet: visualization of chemical space with interactive maps of DrugBank, ChEMBL, PubChem, GDB11, and GDB13. J Chem Inf Model 53:509–518
 21.
Awale M, Reymond JL (2015) Similarity Mapplet: interactive visualization of the directory of useful decoys and ChEMBL in high dimensional chemical spaces. J Chem Inf Model 55:1509–1516
 22.
Jin X et al (2015) PDBexplorer: a webbased interactive map of the protein data bank in shape space. BMC Bioinform 16:339
 23.
Awale M, Reymond JL (2016) Webbased 3Dvisualization of the DrugBank chemical space. J. Cheminform 8:25
 24.
Awale M, Probst D, Reymond JL (2017) WebMolCS: a webbased interface for visualizing molecules in threedimensional chemical spaces. J Chem Inf Model 57:643–649
 25.
Probst D, Reymond JL (2018) FUn: a framework for interactive visualizations of large, highdimensional datasets on the web. Bioinformatics 34:1433–1435
 26.
McInnes L, Healy J, Melville J (2018) UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [cs, stat]
 27.
van der Maaten L, Hinton G (2008) Visualizing data using tSNE. J Mach Learn Res 9:2579–2605
 28.
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507
 29.
Bishop CM, Svensén M, Williams CKIGTM (1998) The generative topographic mapping. Neural Comput 10:215–234
 30.
Kohonen T (1997) Exploration of very large databases by selforganizing maps. In: Proceedings of international conference on neural networks (ICNN’97) vol. 1 PL1PL6 vol.1
 31.
Dong W, Moses C, Li K (2011) Efficient knearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th international conference on World wide web—WWW’11 577, ACM Press. https://doi.org/10.1145/1963405.1963487
 32.
Saitou N, Nei M (1987) The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425
 33.
Zhou Z et al (2018) GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res 28:1395–1404
 34.
Lu J, Carlson HA (2016) ChemTreeMap: an interactive map of biochemical similarity in molecular datasets. Bioinformatics 32:3584–3592
 35.
P’ng C et al (2019) BPG: seamless, automated and interactive visualization of scientific data. BMC Bioinform. 20:42
 36.
Idreos S, Papaemmanouil O, Chaudhuri S (2015) Overview of data exploration techniques. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 277–281. https://doi.org/10.1145/2723372.2731084
 37.
Andoni A, Razenshteyn I, Nosatzki NS (2017) LSH Forest: practical algorithms made theoretical. In: Proceedings of the twentyeighth annual ACMSIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 67–78 https://doi.org/10.1137/1.9781611974782.5
 38.
Bawa M, Condie T, Ganesan P (2005) LSH forest: selftuning indexes for similarity search. In: Proceedings of the 14th international conference on World Wide Web—WWW’05 651. ACM Press. https://doi.org/10.1145/1060745.1060840
 39.
Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math Soc 7:48–48
 40.
Chimani M et al (2013) The open graph drawing framework (OGDF). Handbook Graph Draw Vis 2011:543–569
 41.
Broder AZ ((1997) On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) 21–29. https://doi.org/10.1109/sequen.1997.666900
 42.
Manber U (1994) Finding similar files in a large file system. In: Usenix Winter 1994 technical conference 1–10
 43.
Wu W, Li B, Chen L, Zhang C, Yu P (2017). Improved consistent weighted sampling revisited. arXiv:1706.01172 [cs]
 44.
Bajusz D, Rácz A, Héberger K (2015) Why is Tanimoto index an appropriate choice for fingerprintbased similarity calculations? J Cheminform 7:20
 45.
Probst D, Reymond JL (2018) A probabilistic molecular fingerprint for big data settings. J Cheminform 10:66
 46.
Rogers D, Hahn M (2010) Extendedconnectivity fingerprints. J Chem Inf Model 50:742–754
 47.
Riniker S, Landrum GA (2013) Opensource platform to benchmark fingerprints for ligandbased virtual screening. J Cheminform. 5:26
 48.
Awale M, Reymond JL (2019) Polypharmacology Browser PPB2: target prediction combining nearest neighbors with machine learning. J Chem Inf Model 59:10–17
 49.
Binding DB (2014) BindingDB Entry 6310: Compounds and compositions as Syk kinase inhibitors. https://doi.org/10.7270/q24q7sns
 50.
Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. arXiv:1408.2927[cs]
 51.
Marcais G, DeBlasio D, Pandey P, Kingsford C (2019) Locality sensitive hashing for the edit distance. http://biorxiv.org/lookup/doi/10.1101/534446 https://doi.org/10.1101/534446
 52.
Probst D, Reymond JL (2018) SmilesDrawer: parsing and drawing SMILESencoded molecular structures using clientside JavaScript. J Chem Inf Model 58:1–7
 53.
Ramakrishnan R, Hartmann M, Tapavicza E, von Lilienfeld OA (2015) Electronic spectra from TDDFT and machine learning in chemical space. J Chem Phys 143:084111
 54.
Berman HM et al (2000) The protein data bank. Nucleic Acids Res 28:235–242
 55.
Awale M, Reymond JL (2014) Atom pair 2Dfingerprints perceive 3Dmolecular shape and pharmacophores for very fast virtual screening of ZINC and GDB17. J Chem Inf Model 54:1892–1907
 56.
The Cancer Genome Atlas Research Network et al (2013) The Cancer Genome Atlas PanCancer analysis project. Nat Genet 45:1113–1120
 57.
Kustatscher G et al (2019) Coregulation map of the human proteome enables identification of protein functions. Nat Biotechnol 37:1361–1371
 58.
Hanley MB, Lomas W, Mittar D, Maino V, Park E (2013) Detection of low abundance RNA molecules in individual cells by flow cytometry. PLoS ONE 8:e57002
 59.
Roe BP et al (2005) Boosted decision trees as an alternative to artificial neural networks for particle identification. Nucl Instrum Methods Phys Res 543:577–584
 60.
Bernhardsson E. Annoy (Approximate Nearest Neighbors Oh Yeah). https://github.com/spotify/annoy
 61.
Lahiri S (2013) Complexity of word collocation networks: a preliminary structural analysis. arXiv:1310.5111[physics]
Acknowledgements
This work was supported financially by the Swiss National Science Foundation, NCCR TransCure (Grant No. 51NF40185544).
Author information
Affiliations
Contributions
DP designed and realized the study and wrote the paper. JLR supervised the study and wrote the paper. Both authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Probst, D., Reymond, J. Visualization of very large highdimensional data sets as minimum spanning trees. J Cheminform 12, 12 (2020). https://doi.org/10.1186/s133210200416x
Received:
Accepted:
Published:
Keywords
 Data visualization
 Chemistry databases
 Algorithms
 Big data
 Dimensionality reduction