Skip to main content

Table 1 Data sets visualized using TMAP

From: Visualization of very large high-dimensional data sets as minimum spanning trees

Data set Description Data type Size
Toy data sets
 COIL20 Gray-scale images of 20 objects, each rotated 72 × at 5° intervals Images 1440
 MNIST Gray-scale images of handwritten digits Images 70,000
 Fashion MNIST Gray-scale images of fashion items from 10 classes Images 70,000
Chemical compound databases and PDB
 ChEMBL Bioactive molecules with drug-like properties SMILES 1,159,881
 FDB17 and ChEMBL Fragment database (up to 17 atoms) and ChEMBL SMILES 11,261,085
 Natural products atlas Bacterial and fungal natural products SMILES 24,594
 DSSTox U.S. EPA information on toxicity of chemicals SMILES 848,816
 PDB Information on the 3D structures of proteins and nucleic acids Atomic coordinates 131,236
 Drugbank Approved, investigational, experimental, and withdrawn drugs SMILES 9300
MoleculeNet benchmark data sets
 QM8 Subset of GDB-13 with associated QM properties SMILES 21,786
 QM9 Subset of GDB-13 with associated QM properties SMILES 133,885
 ESOL Common organic small molecules with solubility information SMILES 1128
 FreeSolv Calculated and experimental hydration free energy of molecules SMILES 642
 Lipophilicity Experimental results of logD for organic small molecules SMILES 4200
 PCBA PubChem subset with biological activities SMILES 437,929
 MUV PubChem subset for virtual screening validation SMILES 93,087
 HIV Experimental results for HIV replication inhibition SMILES 41,127
 PDBind Binding affinities for ligands in biomolecular complexes SMILES 11,908
 BACE IC50 values against BACE-1 (human β-secretase 1) SMILES 1513
 BBBP Ability of organic molecules to cross the blood–brain barrier SMILES 2039
 Tox21 Toxicity measurements on 12 targets SMILES 7831
 ToxCast Toxicity measurements on more than 600 targets SMILES 8575
 SIDER Adverse drug reactions of a selection of marketed drugs. SMILES 1427
 ClinTox FDA approved drugs that failed clinical trials for toxicity reasons SMILES 1478
Other data sets
 PubMed central Full-text archive of biomedical and life sciences journal literature Text 327,628
 Gutenberg A subset of public domain Project Gutenberg eBooks. Text 3036
 NIPS Abstracts of NIPS conference papers from 1987 to 2015 Text 7241
 RNA sequencing A subset of the PANCAN database Gene expression 801
 ProteomeHD Human proteome co-regulation data Co-regulation scores 5013
 Flowcytometry Data gathered from a flow cytometry experiment Signal intensity 436,877
 MiniBooNE Data gathered by the MiniBooNE particle physics experiment Particle ID 130,065