Skip to main content
Fig. 2 | Journal of Cheminformatics

Fig. 2

From: Visualization of very large high-dimensional data sets as minimum spanning trees

Fig. 2

Comparing TMAP and UMAP for visualizing ChEMBL. The first \(n\) compounds \(S_{t}\) (a, b, e) and a random sample \(S_{r}\) (c, d, f), each of size \(n = 10,000\), were drawn from the 512-D ECFP-encoded ChEMBL data set to visualize the distribution of biological entity classes and k-nearest neighbors respectively. a TMAP lays out the data as a single connected tree, whereas (b) UMAP draws what appears to be a highly disconnected graph, with the connection between components becoming impossible to assert. TMAP keeps the intra- and inter-cluster distances at the same magnitude, increasing the visual resolution of the plot. c, d The 20 nearest neighbors of a randomly selected compound from a random sample. c TMAP directly connects the query compound to three of the 20 nearest neighbors (1, 2, 15); nearest neighbors 1 through 7 are all within a topological distance of 3 around the query compound. d The closest nearest neighbors of the same query compound in the UMAP visualization are true nearest neighbors 2, 3, 18, 9, and 1, with 1 being the farthest of the five. e, f Ranked distances from true nearest neighbor in original high dimensional space after embedding based on topological and Euclidean distance for data sets \(S_{t}\) and \(S_{r}\) respectively. g Computing the coordinates for a random sample (\(n = 1,000,000\)) highlights the running time behavior of TMAP and allows an inspection of the time and space requirements of the different phases of the algorithm. Four random samples increasing in size (\(n = 10,000\), \(n = 100,000\), \(n = 500,000\), and \(n = 1,000,000\)) detail the differences in memory usage (h) and running time (i) between TMAP and UMAP (\(t_{TMAP} = 4.865{\text{s}}\), \(a_{TMAP} = 0.223{\text{GB}}\); \(t_{UMAP} = 20.985{\text{s}}\), \(a_{UMAP} = 0.383{\text{GB}}\) and \(t_{TMAP} = 33.485{\text{s}}\), \(a_{TMAP} = 1.12{\text{GB}}\); \(t_{UMAP} = 115.661{\text{s}}\), \(a_{UMAP} = 2.488{\text{GB}}\) respectively) (\(t_{TMAP} = 175.89{\text{s}}\), \(a_{TMAP} = 4.521{\text{GB}}\); \(t_{UMAP} = 3,577.768{\text{s}}\), \(a_{UMAP} = 18.854{\text{GB}}\) and \(t_{TMAP} = 354.682{\text{s}}\), \(a_{TMAP} = 8.553{\text{GB}}\); \(t_{UMAP} = 41,325.944{\text{s}}\), \(a_{UMAP} = 48.507{\text{GB}}\) respectively)

Back to article page