TMAP performance assessment with toy data sets and ChEMBL subsets
The quality of our TMAP algorithm is first assessed by comparing TMAP and UMAP to visualize the common benchmarking data sets MNIST, FMNIST, and COIL20 (Fig. 1). UMAP generally represents clusters as tightly packed patches and tries to reach maximal separation between them. On the other hand, TMAP visualizes the relations between, as well as within, clusters as branches and sub-branches. While UMAP can represent the circular nature of the COIL20 subsets, TMAP cuts the circular clusters at the edge of largest difference and joins subsets through one or more edges of smallest difference (Fig. 1a, b). However, the plot shows that this removal of local connectivity leads to an untangling of highly similar data (shown in dark green, orange, dark red, dark purple, and light blue). This behavior has been assessed and compared to UMAP in Additional file 1: Figures S4 and S5, where it is shown that both TMAP and UMAP have to sacrifice locality preservation for more complex examples. For the MNIST and FMNIST data sets, the tree structure results in a higher resolution of both variances and errors within clusters as it becomes apparent how sub clusters (branches within clusters) are linked and which true positives connect to false positives (Fig. 1c–f).
In a second, more applied comparison example, we visualize data from ChEMBL using TMAP and UMAP. For this analysis molecular structures are encoded using ECFP4 (extended connectivity fingerprint up to 4 bonds, 512-D binary vector), a molecular fingerprint encoding circular substructures and which performs well in virtual screening and target prediction [46,47,48]. We consider a subset \(S_{t}\) of the top 10,000 ChEMBL compounds by insertion date, as well as a random subset \(S_{r}\) of 10,000 ChEMBL molecules.
Taking the more homogeneous set \(S_{t}\) as an input, the 2D-maps produced by each representation, plotted using the Python library matplotlib, illustrate that TMAP, which distributes clusters in branches and subbranches of the MST, produces a much more even distribution of compounds on the canvas compared to UMAP, thus enabling better visual resolution (Fig. 2a, b). Furthermore, in a visualization of the heterogeneous set \(S_{r}\), nearest neighbor relationships (locality) are better preserved in TMAP compared to UMAP, as illustrated by the positioning of the 20 structurally nearest neighbors of compound CHEMBL370160 [2, 49] reported as a potent inhibitor of human tyrosine-protein kinase SYK. The 20 structurally similar nearest neighbors are defined as 20 nearest neighbors in the original 512-dimensional fingerprint space. TMAP directly connects the query compound to three of the 20 nearest neighbors, CHEMBL3701630, CHEMBL3701611, and CHEMBL38911457, its nearest, second nearest, and 15th nearest neighbor respectively. The nearest neighbors 1 through 7 are all within a topological distance of 3 around the query (Fig. 2c). In contrast, UMAP has positioned nearest neighbors 2, 3, 9, and 18, among several even more distant data points, closer to the query than the nearest neighbor from the original space (Fig. 2d). Indeed, TMAP preserves locality in terms of retaining 1-nearest neighbor relationships much better than UMAP, applying both topological and Euclidean metrics (Fig. 2e, f; Additional file 1: Fig. S6). The quality of the preservation of locality largely depends on parameter \(d\), with adjustments to parameters \(k\) and \(k_{c}\) only having a minor influence (Additional file 1: Fig. S7). Moreover, TMAP yields reproducible results when running on identical parameters and input data, whereas results of comparable algorithms such as UMAP change considerably with every run (Additional file 1: Fig. S8) [26].
In terms of calculation times, TMAP and UMAP have comparable running time \(t\) and memory usage \(a\) for small random subsets of the 512-D ECFP-encoded ChEMBL data set with sizes \(n = 10,000\) and \(n = 100,000\), TMAP significantly outperforms UMAP for larger random subsets (\(n = 500,000\) and \(n = 1,000,000\)) (Fig. 2h, i). Further insight into the computational behavior of TMAP is provided by analyzing running times for the different phases based on a larger subset (\(n = 1,000,000\)) of the ECFP4-encoded ChEMBL data set (Fig. 2g). During phase I of the algorithm, which accounts for \(180{\text{s}}\) of the execution time and approximately \(5{\text{GB}}\) of main memory usage, data is loaded and indexed in the LSH Forest data structure in chunks of 100,000, as expressed by 10 distinct jumps in memory consumption. The construction of the \(c\)–\(k\)-NNG during phase II requires a negligible amount of main memory and takes approximately \(110{\text{s}}\). During 10 s of execution time, MST creation (phase III) occupies a further \(2{\text{GB}}\) of main memory of which approximately \(1{\text{GB}}\) is retained to store the tree data structure. The graph layout algorithm (phase IV) requires \(2{\text{GB}}\) throughout \(55{\text{s}}\), after which the algorithm completes with a total wall clock run time of \(355{\text{s}}\) and peak main memory usage of \(8.553{\text{GB}}\).
Note that TMAP supports Jaccard similarity estimation through MinHash and weighted MinHash for binary and weighted sets, respectively. While the Jaccard metric is very suitable for chemical similarity calculations based on molecular fingerprints, the metric may not be the best option available to problems presented by other data sets. However, there exists a wide range of LSH families supporting distance and similarity metrics such as Hamming distance, \(l_{p}\) distance, Levenshtein distance, or cosine similarity, which are compatible with TMAP [50, 51]. Furthermore, the modularity of TMAP allows to plug in arbitrary nearest-neighbor-graph creation techniques or load existing graphs from files.
TMAPs of small molecule data sets: ChEMBL, FDB17, DSSTox, and the Natural Products Atlas
The high performance and relatively low memory usage of TMAP, as well as the ability to generate highly detailed and interpretable representations of high-dimensional data sets, is illustrated here by interactive visualization of a series of small molecule data sets available in the public domain. In these examples we use MHFP6 (512 MinHash permutations), a molecular fingerprint related to ECFP4 but with better performance for virtual screening tasks and the ability to be directly indexed in an LSH Forest data structure, which considerably speeds up computation for large data sets [45].
As a first example, we discuss the TMAP of the full data set of the ChEMBL database containing the 1.13 million ChEMBL compounds associated with biological assay data. TMAP completes the calculation within 613 s with a peak memory usage of 20.562 GB. Note that approximately half of the main memory usage is accounted for by SMILES, activities, and biological entity classes which are loaded for later use in the visualization. To facilitate data analysis, the coordinates computed by TMAP are exported as an interactive portable HTML file using Faerun, where molecules are displayed using the JavaScript library SmilesDrawer (Fig. 3a) [25, 52].
Analyzing the distribution of molecules on the tree shows that TMAP groups molecules according to their structure and their biological activity, accurately reflecting similarities calculated in the high-dimensional MHFP6 space. This is well illustrated for a subset of the map (Fig. 3a, insert). In this area of the map, data points in cyan indicate molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters in two connected branches (right side of inset), while data points in orange show inhibitors of the phenylethanolamine N-methyltransferase (PNMT) (left side of inset), and red and dark blue data points indicate nicotinic acetylcholine receptor (nAChRs) ligands and cytochrome p450s (CYPs) inhibitors, respectively.
As a second example, we visualize the ChEMBL set merged with FDB17 (\(n = 10,101,204\)) into a superset of size \(n = 11,261,085\) (Fig. 3b), which corresponds to the largest data set that TMAP can successfully handle. As above, the TMAP 2D-layout accurately reflects structural and functional similarities computed in the high-dimensional MHFP6 space. In this TMAP visualization, the majority of ChEMBL compounds accumulate in closely connected clusters (branches) due to the prevalence of aromatic carbocycles. A notable exception is a relatively sizable branch of steroids and steroid-like compounds, which is connected to a branch of FDB17 molecules containing non-aromatic 5-membered carbocycles and ketones (Fig. 3b, insert). Many more detailed insights can be gained by inspecting the interactive map in Faerun (http://tmap-fdb.gdb.tools).
Further examples include MHFP6-encoded compounds from the Distributed Structure-Searchable Toxicity (DSSTox) Database (\(n = 848,816\)) and the Natural Products Atlas (\(n = 24,594\)). Visualizing DSSTox and coloring the resulting tree by toxicity rating, TMAP creates several subtrees and branches representing structural regions with a high incidence of highly toxic compounds (shown in red, Fig. 3c). An example of such a subtree contains naphthalenes and other polycyclic aromatic hydrocarbons (Fig. 3c, insert). The TMAP tree of the Natural Products Atlas was colored according to origin genus and reveals that branches and subbranches containing distinct substructures usually correlate with a certain genus such as various combinations of phenols, fused cyclopentanes, lactones and steroids produced by the fungi genus Ganoderma (colored purple in Fig. 3d, inset).
Visualization of the MoleculeNet benchmark data sets
We further illustrate TMAP to visualize the MoleculeNet, a benchmark for molecular machine learning which has found wide adaption in cheminformatics and encompasses 16 data sets ranging in size and composition (Table 1) [18]. As for the other small molecule data sets above, we computed MHFP6 fingerprints of the associated molecules and the corresponding TMAPs, which we then color-coded according to various numerical values available in the benchmarks. The procedure was applied with all MoleculeNet data sets except for QM7/b, where no SMILES have been provided.
The resulting TMAP representations, accessible at the TMAP website (http://tmap.gdb.tools), reveal the detailed structure of the data sets as well as the behaviour of methods applied to these data sets as a function of the chemical structures of the molecules. For example, TMAPs of the QM8 and QM9 (\(n = 21,786\) and \(n = 133,885\)), which contain small molecules and DFT-modelled parameters, reveal relationships between molecular structures and the various computed physico-chemical values. For instance the TMAP of the QM8 data set color-coded by the oscillator strengths of the lowest two singlet electronic states reveals how the value correlates with molecular structure and explains the performance differences in machine learning models trained on Coulomb matrices versus those trained on structure-sensitive molecular fingerprints [53]. In the case of the ESOL data set containing measured and calculated water solubility values of common small molecules (\(n = 1128\)), its TMAP color-coded with the difference between computed and measured values reveals the limitation of the ESOL model when estimating solubility of polycyclic aromatic hydrocarbons and compounds containing pyridines. For the FreeSolv data set (\(n = 642\)) containing small molecules and their measured and calculated hydration free energy in water, the TMAP visualization hints at possible limitations of the method when calculating hydration free energies of sugars. Finally, for the MUV data set (\(n = 93,087\)), which contains active small drug-like molecules against 17 different protein targets mixed in each case with inactive decoy molecules, the various TMAPs reveal differences in the structural distribution of actives among decoys. Actives are usually well distributed but appear to form clusters in certain subsets (e.g. MUV-548 and MUV-846), explaining the generally higher performance of fingerprint benchmarks for these subsets [47].
Application to other scientific data sets
We further illustrate the general applicability of TMAP to visualize data sets from the fields of linguistics, biology, and particle physics. All produced maps are available as interactive Faerun plots on the TMAP website (http://tmap.gdb.tools).
Our first example concerns visualization of the RCSB Protein Data Bank, which contains experimental 3D-structures of proteins and nucleic acids (\(n = 131,236\)) [54]. The PDB files were extracted from the Protein Data Bank and encoded using the protein shape fingerprint 3DP (136-D integer vector, 256 weighted MinHash samples) 3DP encodes the structural shape of large molecules stored as PDB files based on through-space distances of atoms [22]. Processing data extracted from the PDB and indexed using a weighted variant of MinHash, demonstrates the ability of TMAP to visualize both global and local structure, improving on previous efforts on the visualization of the database [22, 55]. The global structure of the 3DP-encoded PDB data is dominated by the size (heavy atom count) of the proteins (Fig. 4a), on the other hand, the local structure is defined by properties such as the fraction of negative charges (Fig. 4b).
As an additional example from biology, we consider the PANCAN data set (\(n = 800\), \(d = 20,531\)), which consists of gene expressions of patients having different types of tumors (PRAD, KIRC, LUAD, COAD, and BRCA), randomly extracted from the cancer genome atlas database [56]. Here we index the PANCAN data directly using the LSH Forest data structure and weighted MinHash. The output produced by processing the PANCAN data set displays the successful differentiation of tumor types based on RNA sequencing data by the algorithm (Fig. 4c). We also visualize the ProteomeHD data set using TMAP [57]. This data set consists of co-regulation scores of 5013 proteins, annotated with their respective cellular location. In addition to the ProteomeHD data set, Kustatscher et al. also released an R script to create a map of the set using t-SNE which took a total of 400 s to complete; in contrast, TMAP visualized the data set within 32 s (Fig. 4d), successfully clustering proteins by their cellular location based on co-regulation scores. As a further biology example, our TMAP webpage also features flow cytometry measurements (\(n = 436,877, d = 14\)), exemplifying the methods application for the visualization of relatively low dimensional data [17, 58].
As an example from physics, we represent the MiniBooNE data set (\(n = 130,065\), \(d = 50\)), which consists of measurements extracted from Fermilab’s MiniBooNE experiment and contains the detection of signal (electron neutrinos) and background (muon neutrinos) events [59]. As the attributes in MiniBooNE are real numbers, we use the Annoy indexing library which supports the cosine metric in phase I of the algorithm to index the data for \(k\)-NNG construction, which demonstrates the modularity of TMAP [60]. This example reflects the independence of the MST and layout phases of the algorithm from the input data, displaying the distribution of the signal over the background data (Fig. 5a).
Outside of the natural sciences, we exemplify TMAP to visualize the GUTENBERG set as an example of a data set from linguistics. This data set features a selection of \(n = 3036\) books by 142 authors written in English [61]. To analyze this data, we define a book fingerprint as a dense-form binary vector indicating which words from the universe of all words extracted from all books occurred at least once in a given book (yielding a dimensionality of \(d = 1,217,078\)), and index this book fingerprint using the LSH Forest data structure with MinHash. The visualization of the GUTENBERG data set exemplifies the ability of TMAP to handle input with extremely high dimensionality (\(d = 1,217,078)\) efficiently (Fig. 5b). The works of different authors tend to populate specific branches, with notable expected exceptions such as the autobiography of Charles Darwin, which does not lie on the same branch as all his other works. Meanwhile, the works of Alfred Russel Wallace are found on subbranches of the Darwin branch.
Related to linguistics, the TMAP webpage further features a map of the distribution of different scientific journals (Nature, Cell, Angewandte Chemie, Science, the Journal of the American Chemical Society, and Demography) over the entire PubMed article space (\(n = 327,628, d = 1,633,762\)), perceiving specialization, diversification, and overlaps; as well as a TMAP of the NeurIPS conference papers (\(n = 7,241, d = 225,423\)), visualizing the increase in occurrence of the word “deep” in conference paper abstracts over time (1987–2016).