Skip to main content

Table 1 Data sets visualized using TMAP

From: Visualization of very large high-dimensional data sets as minimum spanning trees

Data set

Description

Data type

Size

Toy data sets

 COIL20

Gray-scale images of 20 objects, each rotated 72 × at 5° intervals

Images

1440

 MNIST

Gray-scale images of handwritten digits

Images

70,000

 Fashion MNIST

Gray-scale images of fashion items from 10 classes

Images

70,000

Chemical compound databases and PDB

 ChEMBL

Bioactive molecules with drug-like properties

SMILES

1,159,881

 FDB17 and ChEMBL

Fragment database (up to 17 atoms) and ChEMBL

SMILES

11,261,085

 Natural products atlas

Bacterial and fungal natural products

SMILES

24,594

 DSSTox

U.S. EPA information on toxicity of chemicals

SMILES

848,816

 PDB

Information on the 3D structures of proteins and nucleic acids

Atomic coordinates

131,236

 Drugbank

Approved, investigational, experimental, and withdrawn drugs

SMILES

9300

MoleculeNet benchmark data sets

 QM8

Subset of GDB-13 with associated QM properties

SMILES

21,786

 QM9

Subset of GDB-13 with associated QM properties

SMILES

133,885

 ESOL

Common organic small molecules with solubility information

SMILES

1128

 FreeSolv

Calculated and experimental hydration free energy of molecules

SMILES

642

 Lipophilicity

Experimental results of logD for organic small molecules

SMILES

4200

 PCBA

PubChem subset with biological activities

SMILES

437,929

 MUV

PubChem subset for virtual screening validation

SMILES

93,087

 HIV

Experimental results for HIV replication inhibition

SMILES

41,127

 PDBind

Binding affinities for ligands in biomolecular complexes

SMILES

11,908

 BACE

IC50 values against BACE-1 (human β-secretase 1)

SMILES

1513

 BBBP

Ability of organic molecules to cross the blood–brain barrier

SMILES

2039

 Tox21

Toxicity measurements on 12 targets

SMILES

7831

 ToxCast

Toxicity measurements on more than 600 targets

SMILES

8575

 SIDER

Adverse drug reactions of a selection of marketed drugs.

SMILES

1427

 ClinTox

FDA approved drugs that failed clinical trials for toxicity reasons

SMILES

1478

Other data sets

 PubMed central

Full-text archive of biomedical and life sciences journal literature

Text

327,628

 Gutenberg

A subset of public domain Project Gutenberg eBooks.

Text

3036

 NIPS

Abstracts of NIPS conference papers from 1987 to 2015

Text

7241

 RNA sequencing

A subset of the PANCAN database

Gene expression

801

 ProteomeHD

Human proteome co-regulation data

Co-regulation scores

5013

 Flowcytometry

Data gathered from a flow cytometry experiment

Signal intensity

436,877

 MiniBooNE

Data gathered by the MiniBooNE particle physics experiment

Particle ID

130,065