Visualization of very large high-dimensional data sets as minimum spanning trees

The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature.


Main
2][3][4][5][6][7][8][9] Generally, large high dimensional data sets are matrices where rows are samples and columns are measured variables, each column defining a dimension of the space which contains the data.Visualizing such data sets is challenging because reducing the dimensionality, which is required in order to make the data visually interpretable for humans, is both lossy and computationally expensive. 10tabases of millions of molecules used in the area of drug discovery such as the ChEMBL database of bioactive molecules from the scientific literature and their associated biological assay data ( = 1,159,881), 11 from which mathematical representations of chemical structures in the form of molecular fingerprints (high-dimensional binary or integer vectors, encoding structure or composition) are calculated, 12 represent a typical case of need.
The problem extends to even larger molecular databases, as exemplified here for FDB17, a database of 10.1 million theoretically possible fragment-like molecules of up to 17 atoms, 13 as well as for databases of biomolecules such as the RSCB Protein Databank. 146][17][18] However, local features defining the relation between close or even nearest neighbor (NN) molecules, which are very important in drug research, are mostly lost, limiting the applicability of linear dimensionality reduction methods for visualization.
The important NN relationships are much better preserved using non-linear manifold learning algorithms, which assume that the data lies on a lower-dimensional manifold embedded within the high-dimensional space.0][21] Other techniques used are probabilistic generative topographic maps (GTM) and self-organizing maps (SOM), which are based on artificial neural networks. 22,23However, these algorithms have time complexities between at least ( 1.14 ) and ( 5 ), limiting the size of to be visualized data sets. 24The same limitations in terms of data set size apply when distributing data in a tree by implementing the neighbor joining algorithm or similar methods used to create phylogenetic trees. 25,26This limiting behavior has been documented by the ChemTreeMap tool, which can only visualize up to approximately 10,000 data points (molecules or clusters of molecules). 27Due to the described challenges, large scientific data sets are generally visualized in aggregated or reduced form. 28,29re we present an algorithm, named TMAP (Tree MAP), to generate and distribute intuitive visualizations of large data sets in the order of up to 10 7 with arbitrary dimensionality in a tree based on a combination of locality sensitive hashing, graph theory, and modern web technology which also integrates into established data analysis and plotting workflows.This tree-based layout facilitates visual inspection of the data with a high resolution by explicitly visualizing the closest distance between clusters and the detailed structure of clusters through branches and sub-branches.We show that TMAP is superior to comparable algorithms such as t-SNE and UMAP in terms of time and space complexity.
Additionally, we argue that visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, an increased neighborhood-preservation, and the transparency of the methods the algorithm is based on.

Algorithm
Given an arbitrary data as an input, TMAP encompasses four phases: (I) LSH forest indexing, 30,31 (II) construction of a -approximate -nearest neighbor graph, (III) calculation of a minimum spanning tree (MST) of the -approximate -nearest neighbor graph, 32 and (IV) generation of a layout for the resulting MST. 33ring phase I, the input data are indexed in an LSH forest data structure, enabling approximate -nearest neighbor (k-NN) searches with a time complexity sub-linear in .5][36] The LSH Forest data structure for both MinHash and weighted MinHash data is initialized with the number of hash functions  used in encoding the data, and the number of prefix trees .An increase in the values of both parameters lead to an increase main memory usage; however, higher values for  also decrease query speed.The effect of parameters  and  on the final visualization is shown in Fig. S1.The use of a combination of (weighted) MinHash and LSH Forest, which supports fast estimation of the Jaccard distance between two binary sets, has been shown to perform very well for molecules. 37Note, however, that other data structures and algorithms implementing a variety of different distance metrics may show better performance on other data and can be used as a drop-in replacements of phase I.
In phase II, an undirected weighted -approximate -nearest neighbor graph (--NNG) is constructed from the data points indexed in the LSH forest, where an augmented variant of the LSH forest query algorithm we previously introduced for virtual screening tasks, 38 is used to increase efficiency.The --NNG construction phase takes two arguments, namely , the number of nearest-neighbors to be searched for, and   , the factor used by the augmented query algorithm.This variant of the query algorithm increases the time complexity of a single query from (log ) to ( •   + log ), resulting in an overall time complexity of (( •   + log )), where practically  •   > log , for the --NNG construction.The edges of the --NNG are assigned the Jaccard distance of their incident vertices as their weight.Depending on the distribution and the hashing of the data, the --NNG can be disconnected (1) if outliers exist which have a Jaccard distance of 1.0 to all other data points and are therefore not connected to any other nodes or (2) if, due to clusters of size ≥  in the Jaccard space, connected components are created.However, the following phases are agnostic to whether this phase yields a disconnected graph.The effect of parameters  and   on the final visualization is shown in Fig. S2.Alternatively, an arbitrary graph can be supplied to the algorithm as a (weighted) edge list.
During phase III, a minimum spanning tree (MST) is constructed on the weighted --NNG using Kruskal's algorithm, which represents the central and differentiating phase of the described algorithm.Whereas comparable algorithms such as UMAP or t-SNE attempt to embed pruned graphs, TMAP removes all cycles from the initial graph using the MST algorithm, significantly lowering the computational complexity of a low dimensional embedding.The algorithm reaches a globally optimal solution by applying a greedy approach of selecting locally optimal solutions at each stage-properties which are also desirable in data visualization.The time complexity of Kruskal's algorithm is ( + log ), rendering this phase negligible compared to phase II in terms of execution time.In the case of a disconnected --NNG, a minimum spanning forest is created.
Phase IV lays out the tree on the Euclidean plane.As the MST is unrooted and to keep the drawing compact, the tree is not visualized by applying a tree but a graph layout algorithm.In order to draw MSTs of considerable size (millions of vertices), a springelectrical model layout algorithm with multilevel multipole-based force approximation is applied.This algorithm is provided by the open graph drawing framework (OGDF), a modular C++ library.In addition, the use of the OGDF allows for effortless adjustments to the graph layout algorithm in terms of both aesthetics and computational time requirements.
Whereas several parameters can be configured for the layout phase, only parameter  must be adjusted based on the size of the input data set (Fig. S3).This phase constitutes the bottleneck regarding computational complexity.

TMAP performance assessment and comparison with UMAP
The quality of our TMAP algorithm is first illustrated by comparing TMAP and UMAP to visualize the common benchmarking data sets MNIST, FMNIST, and COIL20 (Fig. 1).
UMAP represents clusters as tightly packed patches and tries to reach maximal separation between them.On the other hand, TMAP visualizes the relations between as well as within clusters as branches and sub-branches.While UMAP is capable of representing the circular nature of the COIL20 subsets, TMAP cuts the circular clusters at the edge of largest difference and joins subsets through one or more edges of smallest difference (Fig. 1a, b).However, the plot shows that this removal of local connectivity leads to an untangling of highly similar data (shown in dark green, orange, dark red, dark purple, and light blue).For the MNIST and FMNIST data sets, the tree structure results in a higher resolution of both variances and errors within clusters as it becomes apparent how sub clusters (branches within clusters) are linked and which true positives connect to false positives (Fig 1c, d, e, f).In a second, more applied comparison example, we visualize data from ChEMBL using TMAP and UMAP.0][41] We consider a subset   of the top 10,000 ChEMBL compounds by insertion date, as well as a random subset   of 10,000 ChEMBL molecules.
Taken the more homogeneous set   as an input, the 2D-maps produced by each representation, plotted using the Python library matplotlib, illustrate that TMAP, which distributes clusters in branches and subbranches of the MST, produces a much more even distribution of compounds on the canvas compared to UMAP, thus enabling better visual resolution (Fig. 2a, b).Furthermore, in a visualization of the heterogeneous set   , nearest neighbor relationships (locality) are better preserved in TMAP compared to UMAP, as illustrated by the positioning of the 20 structurally nearest neighbors of compound CHEMBL3701602, 42 reported as a potent inhibitor of human tyrosine-protein kinase SYK.
The 20 structurally nearest neighbors are defined as 20 nearest neighbors in the original 512dimensional fingerprint space.TMAP directly connects the query compound to three of the 20 nearest neighbors, CHEMBL3701630, CHEMBL3701611, and CHEMBL38911457, its nearest, second nearest, and 15 th nearest neighbor respectively.The nearest neighbors 1 through 7 are all within a topological distance of 3 around the query (Fig. 2c).In contrast, UMAP has positioned nearest neighbors 2, 3, 9, and 18, among several even more distant data points, closer to the query than the nearest neighbor from the original space (Fig. 2d).Indeed, TMAP preserves locality in terms of retaining 1-nearest neighbor relationships much better than TMAP, applying both topological and Euclidean metrics (Fig. 2e, f; Fig. S4).The quality of the preservation of locality largely depends on parameter , with adjustments to parameters  and   only having a minor influence (Fig. S5).Moreover, TMAP yields reproducible results when running on identical parameters and input data, whereas results of comparable algorithms such as UMAP change considerably with every run (Fig. S6). 19 terms of calculation times, TMAP and UMAP have comparable running time  and memory usage  for small random subsets of the 512-D ECFP-encoded ChEMBL data set with sizes  = 10,000 and  = 100,000, TMAP significantly outperforms UMAP for larger random subsets ( = 500,000 and  = 1,000,000) (Fig. 2h, i).Further insight into the computational behavior of TMAP is provided by analyzing running times for the different phases based on a larger subset ( = 1,000,000) of the ECFP4-encoded ChEMBL data set (Fig. 2g).During phase I of the algorithm, which accounts for 180s of the execution time and approximately 5GB of main memory usage, data is loaded and indexed in the LSH Forest data structure in chunks of 100,000, as expressed by 10 distinct jumps in memory consumption.
The construction of the --NNG during phase II requires a neglectable amount of main memory and takes approximately 110s.During 10 seconds of execution time, MST creation (phase III) occupies a further 2GB of main memory of which approximately 1GB is retained to store the tree data structure.The graph layout algorithm (phase IV) requires 2GB throughout 55s, after which the algorithm completes after a total wall clock time of 355s and peak main memory usage of 8.553GB.

Visualizing very large high-dimensional data sets with TMAP: ChEMBL and FDB17
The high performance and relatively low memory usage of TMAP as well as the ability to generate highly detailed and interpretable representations of high-dimensional data sets is illustrated here by interactive visualization of the full data set containing the 1.13 million ChEMBL compounds associated with biological assay data.Here we use MHFP6 (512 MinHash permutations), a molecular fingerprint related to ECFP4 but with better performance for virtual screening and the ability to be used with LSH. 38TMAP completes the calculation within 613 seconds with a peak memory usage of 20.562 GB.Note however that approximately half of the main memory usage is accounted for by SMILES, activities, and biological entity classes which are loaded for later use in the visualization.To facilitate data analysis, the coordinates computed by TMAP are exported as an interactive portable HTML file using Faerun (Fig. 3a).
Analyzing the distribution of molecules on the tree shows that TMAP groups molecules according to their structure and their biological activity, accurately reflecting similarities calculated in the high-dimensional MHFP6 space.This is well illustrated for a subset of the map (Fig. 3a, insert).In this area of the map, data points in cyan indicate molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters in two connected branches (right side of inset), while data points in orange show inhibitors of the phenylethanolamine N-methyltransferase (PNMT) (left side of inset), and red and dark blue data points indicate nicotinic acetylcholine receptor (nAChRs) ligands and cytochrome p450s (CYPs) inhibitors, respectively.TMAP can also be used to visualize even larger data sets, as illustrated here for the ChEMBL set merged with FDB17 ( = 10,101,204) into a superset of size  = 11,261,085 (Fig. 3b).As above, the TMAP 2D-layout accurately reflect structural and functional similarities computed in the high-dimensional MHFP6 space.In this TMAP visualization, the majority of ChEMBL compounds accumulate in closely connected clusters (branches) due to the prevalence of aromatic carbocycles.A notable exception is a relatively sizable branch of steroids and steroid-like compounds, which is connected to a branch of FDB17 molecules containing non-aromatic 5-membered carbocycles and ketones (Fig. 2b, insert).Many more detailed insights can be gained by inspecting the interactive map in Faerun (http://tmapfdb.gdb.tools)."Cytochrome p450", "Epigenetic Regulator", "Ion Channel", "Kinase", "Protease", "Other Enzyme", "Membrane Receptor", "Transcription Factor", "Transporter", and "Other".The inset shows: molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters (cyan); inhibitors of the phenylethanolamine N-methyltransferase (orange); and structurally related compounds with high binding affinities for nicotinic acetylcholine receptors and inhibitory effects on cytochrome p450s (red, dark blue).(b) The ChEMBL data set was merged with fragment database (FDB17) compounds ( = 11,261,085) and visualized.FDB17 molecules are shown in light gray.The inset shows a branch of steroid and steroid-like ChEMBL compounds, as well as dominantly FDB17 branches which are sparsely populated by ChEMBL molecules.An interactive version of (b) is available at http://tmap-fdb.gdb.tools.

TMAP visulization of the RSCB Protein DataBank
We further illustrate TMAP in the area of biomolecules to visualize the RCSB PDB ( = 131,236). 14The PDB files were extracted from the Protein Data Bank and encoded using the protein shape fingerprint 3DP (136-D integer vector, 256 weighted MinHash samples) 3DP encodes the structural shape of large molecules stored as PDB files based on through-space distances of atoms.Processing data extracted from the PDB and indexed using a weighted variant of MinHash, demonstrates the ability of TMAP to visualize both global and local structure, improving on previous efforts on the visualization of the database. 17,43The global structure of the 3DP-enconded PDB data is dominated by the size (heavy atom count) of the proteins (Fig. 4a), on the other hand, the local structure is defined by properties such as the fraction of negative charges (Fig. 4b).As a third example, we represent the MiniBooNE data set ( = 130,065,  = 50), which consists of measurements extracted from Fermilab's MiniBooNE experiment and contains the detection of signal (electron neutrinos) and background (muon neutrinos) events 48 .As the attributes in MiniBooNE are real numbers, we use the Annoy indexing library which supports the cosine metric in phase I of the algorithm to index the data for -NNG construction, which demonstrates the modularity of TMAP. 49This example reflects the independence of the MST and layout phases of the algorithm from the input data, displaying the distribution of the signal over the background data (Fig. 5c).

Discussion
In this study, we introduced the data visualization method TMAP, which is suitable for very large, high-dimensional data sets containing molecular information.Compared to currently available methods such as t-SNE, UMAP or PCA, TMAP excels with its low memory usage and running time.Indeed, TMAP has shown to generate visualizations with an empirical sub-linear time complexity of ( 0.931 ) when processing real-world chemical data.In addition, TMAP facilitates a high interpretability of the resulting visualization, the ability to preserve and visualize both global and local features, and has been shown to be applicable to arbitrary data sets such as images, text, or RNA-seq data, hinting at its usefulness in a wide range of fields including computational linguistics or biology.By adjusting the available parameters and leveraging output quality and memory usage, the algorithm does not require specialized hardware for high-quality visualizations of data sets containing millions of data points.
TMAP supports Jaccard similarity estimation through MinHash and weighted MinHash for binary and weighted sets respectively.While the Jaccard metric has proven to be suitable for the challenges presented by chemical fingerprint similarity calculation, the metric may not be the best option available to problems presented by other data sets.However, there exists a wide range of LSH families supporting distance and similarity metrics such as Hamming distance,   distance, Levenshtein distance, or cosine similarity, which are compatible with TMAP. 50,51Furthermore, the modularity of TMAP allows to plug in arbitrary nearest-neighbor-graph creation techniques or load existing graphs from files.
All the TMAP visualizations presented, including installation and usage instructions, are available as interactive online versions (http://tmap.gdb.tools).The source code for TMAP is available on GitHub (https://github.com/reymond-group/tmap)and a Python package can be obtained using conda package manager.download file view on ChemRxiv manuscript_mstmap_daniel_probst_v7.pdf (2.08 MiB)

Fig. 1 .
Fig. 1.Comparison between TMAP and UMAP on benchmark data sets.TMAP explicitly visualizes the relations between as well as within clusters.(a, b) While UMAP represents the circular nature of the COIL20 subsets, TMAP cuts the circular clusters at the edge of largest difference and joins clusters between through an edge of smallest difference.(c, d, e, f) For the MNIST and FMNIST data sets, the tree structure for a higher resolution of both variances and errors within clusters as it becomes apparent how sub clusters (branches within clusters) are linked and which true positives connect to false positives.The image data of all three sets was binarized using the average intensity per image as a threshold.Interactive versions of the TMAP visualizations can be found on http://tmap.gdb.tools.

Fig. 2
Fig. 2 Comparing TMAP and UMAP for visualizing ChEMBL.The first  compounds   (a, b, e) and a random sample   (c, d, f), each of size  = 10,000, were drawn from the 512-D ECFP-encoded ChEMBL data set to visualize the distribution of biological entity classes and k-nearest neighbors respectively.(a) TMAP lays out the data as a single connected tree, whereas (b) UMAP draws what appears to be a highly disconnected graph, with the connection between components becoming impossible to assert.TMAP keeps the intra-and inter-cluster distances at the same magnitude, increasing the visual resolution of the plot.(c, d) The 20 nearest neighbors of a randomly selected compound from a random sample.(c) TMAP directly connects the query compound to three of the 20 nearest neighbors (1, 2, 15); nearest neighbors 1 through 7 are all within a topological distance of 3 around the query compound.(d) The closest nearest neighbors of the same query compound in the UMAP visualization are true nearest neighbors 2, 3, 18, 9, and 1, with 1 being the farthest of the five.(e, f) Ranked distances from true nearest neighbor in original high dimensional space after projection based on topological and Euclidean distance for data sets   and   respectively.(g) Computing the coordinates for a random sample ( = 1,000,000) highlights the running time behavior of TMAP and allows an inspection of the time and space requirements of the different phases of the algorithm.Four random samples increasing in size ( = 10,000,  = 100,000,  = 500,000, and n=1,000,000) detail the differences in memory usage (h) and running time (i) between TMAP and UMAP.(  = 4.865s,   = 0.223GB;   = 20.985s,  = 0.383GB and   = 33.485s,  = 1.12GB;   = 115.661s,  = 2.488GB respectively) (  = 175.89s,  = 4.521GB;   = 3,577.768s,  = 18.854GB and   = 354.682s,  = 8.553GB;   = 41,325.944s,  = 48.507GBrespectively) where the molecule expressed the highest activity in a biological assay.

Fig. 3
Fig. 3 TMAP visualization of ChEMBL and FDB17 in the MHFP6 chemical space.(a) Visualization of all ChEMBL compounds associated with biological assay data ( = 1,159,881) colored by target class:"Cytochrome p450", "Epigenetic Regulator", "Ion Channel", "Kinase", "Protease", "Other Enzyme", "Membrane Receptor", "Transcription Factor", "Transporter", and "Other".The inset shows: molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters (cyan); inhibitors of the phenylethanolamine N-methyltransferase (orange); and structurally related compounds with high binding affinities for nicotinic acetylcholine receptors and inhibitory effects on cytochrome p450s (red, dark blue).(b) The ChEMBL data set was merged with fragment database (FDB17) compounds ( = 11,261,085) and visualized.FDB17 molecules are shown in light gray.The inset shows a branch of steroid and steroid-like ChEMBL compounds, as well as dominantly FDB17 branches which are sparsely populated by ChEMBL molecules.An interactive version of (b) is available at http://tmap-fdb.gdb.tools.

Fig. 4
Fig. 4 TMAP visualization of the RCSB Protein Data Bank (PDB).3DP-encoded PDB entries visualized using TMAP with weighted MinHash indexing, the color bars show the log-log distribution of the property values.(a) Colored according to the macromolecular size (heavy atom count).The resulting map reflects the size-sensitivity of the 3DP fingerprint.(b) Colored according to the fraction of negative charges in the molecules.Macromolecules with a high fraction of negatively charged atoms, predominantly nucleic acids, are visible as clusters of red branches.An interactive version of (a) is available at http://pdb-tmap.gdb.tools.

Fig. 5
Fig. 5 Visualizing linguistics, RNA sequencing, and particle physics data sets.(a) The GUTENBERG data set is a selection of books by 142 authors ( = 3,036,  = 1,217,078).The works of five different authors are shown to occupy distinct branches.(b) The PANCAN data set ( = 801,  = 20,531) consists of gene expressions data of five types of tumors (PRAD, KIRC, LUAD, COAD, and BRCA) and was indexed using a weighted variant of the MinHash algorithm.(c) The MiniBooNE data set ( = 130,065,  = 50) consists of measurements extracted from Fermilab's MiniBooNE experiment.TMAP visualizes the distribution of the signal data among the background.Interactive version of these maps and further examples can be found at http://tmap.gdb.tools.

Fig. S2
Fig. S2 Influence of LSH Forest parameters  and   on visualization of MNIST.Whereas parameter  directly influences the average degree of the -nearest neighbor graph,   increases the quality of the returned  nearest neighbors.Both parameters only marginally influence the aesthetics and quality of the visualization.

Fig. S3
Fig. S3 Influence of parameter  on visualization of MNIST.The point size parameter  has major influence on the aesthetics of the visualization, as it controls the sparseness of the drawn tree.Decreasing the point size and thus the repulsive force between two points, allows the layout algorithm to draw points closer to their respective (sub) branches.

Fig. S4
Fig. S4 Ranked distance from true nearest neighbor when visualizing the MNIST data set.Ranked distances from true nearest neighbor in original high dimensional space after projection based on topological and Euclidean distance for the MNIST data set.Whereas UMAP preserves less than 10% of true 1-nearest neighbors, TMAP preserves more than 80% based on topological and more than 35% based on Euclidean distance.

Fig. S5
Fig. S5 Influence of TMAP parameters on locality preserving performance.Ranked distances from true nearest neighbor in original high dimensional space after projection based on topological and Euclidean distance for the MNIST data set.While, parameters  and  (a, b) have a major influence on both, the topological and Euclidean measure of locality preserving performance, parameters  and   have only marginal influence(c, d).The point size  does not influence topological distances; however, it has a minor effect on the Euclidean distance-based metric, as higher values increase the sparsity of the drawn tree.

Fig. S6
Fig. S6 Stability of TMAP.Algorithms TMAP (a, c) and UMAP (b, d) have been repeatedly ( = 4) run on the same data sets with the same parameters.Whereas the output of TMAP is perceived as identical in all instances, the results by UMAP show considerable differences between each run.