Skip to main content

Table 1 Data sets visualized using TMAP

From: Visualization of very large high-dimensional data sets as minimum spanning trees

Data setDescriptionData typeSize
Toy data sets
 COIL20Gray-scale images of 20 objects, each rotated 72 × at 5° intervalsImages1440
 MNISTGray-scale images of handwritten digitsImages70,000
 Fashion MNISTGray-scale images of fashion items from 10 classesImages70,000
Chemical compound databases and PDB
 ChEMBLBioactive molecules with drug-like propertiesSMILES1,159,881
 FDB17 and ChEMBLFragment database (up to 17 atoms) and ChEMBLSMILES11,261,085
 Natural products atlasBacterial and fungal natural productsSMILES24,594
 DSSToxU.S. EPA information on toxicity of chemicalsSMILES848,816
 PDBInformation on the 3D structures of proteins and nucleic acidsAtomic coordinates131,236
 DrugbankApproved, investigational, experimental, and withdrawn drugsSMILES9300
MoleculeNet benchmark data sets
 QM8Subset of GDB-13 with associated QM propertiesSMILES21,786
 QM9Subset of GDB-13 with associated QM propertiesSMILES133,885
 ESOLCommon organic small molecules with solubility informationSMILES1128
 FreeSolvCalculated and experimental hydration free energy of moleculesSMILES642
 LipophilicityExperimental results of logD for organic small moleculesSMILES4200
 PCBAPubChem subset with biological activitiesSMILES437,929
 MUVPubChem subset for virtual screening validationSMILES93,087
 HIVExperimental results for HIV replication inhibitionSMILES41,127
 PDBindBinding affinities for ligands in biomolecular complexesSMILES11,908
 BACEIC50 values against BACE-1 (human β-secretase 1)SMILES1513
 BBBPAbility of organic molecules to cross the blood–brain barrierSMILES2039
 Tox21Toxicity measurements on 12 targetsSMILES7831
 ToxCastToxicity measurements on more than 600 targetsSMILES8575
 SIDERAdverse drug reactions of a selection of marketed drugs.SMILES1427
 ClinToxFDA approved drugs that failed clinical trials for toxicity reasonsSMILES1478
Other data sets
 PubMed centralFull-text archive of biomedical and life sciences journal literatureText327,628
 GutenbergA subset of public domain Project Gutenberg eBooks.Text3036
 NIPSAbstracts of NIPS conference papers from 1987 to 2015Text7241
 RNA sequencingA subset of the PANCAN databaseGene expression801
 ProteomeHDHuman proteome co-regulation dataCo-regulation scores5013
 FlowcytometryData gathered from a flow cytometry experimentSignal intensity436,877
 MiniBooNEData gathered by the MiniBooNE particle physics experimentParticle ID130,065