From: Visualization of very large high-dimensional data sets as minimum spanning trees
Data set | Description | Data type | Size |
---|---|---|---|
Toy data sets | |||
 COIL20 | Gray-scale images of 20 objects, each rotated 72 × at 5° intervals | Images | 1440 |
 MNIST | Gray-scale images of handwritten digits | Images | 70,000 |
 Fashion MNIST | Gray-scale images of fashion items from 10 classes | Images | 70,000 |
Chemical compound databases and PDB | |||
 ChEMBL | Bioactive molecules with drug-like properties | SMILES | 1,159,881 |
 FDB17 and ChEMBL | Fragment database (up to 17 atoms) and ChEMBL | SMILES | 11,261,085 |
 Natural products atlas | Bacterial and fungal natural products | SMILES | 24,594 |
 DSSTox | U.S. EPA information on toxicity of chemicals | SMILES | 848,816 |
 PDB | Information on the 3D structures of proteins and nucleic acids | Atomic coordinates | 131,236 |
 Drugbank | Approved, investigational, experimental, and withdrawn drugs | SMILES | 9300 |
MoleculeNet benchmark data sets | |||
 QM8 | Subset of GDB-13 with associated QM properties | SMILES | 21,786 |
 QM9 | Subset of GDB-13 with associated QM properties | SMILES | 133,885 |
 ESOL | Common organic small molecules with solubility information | SMILES | 1128 |
 FreeSolv | Calculated and experimental hydration free energy of molecules | SMILES | 642 |
 Lipophilicity | Experimental results of logD for organic small molecules | SMILES | 4200 |
 PCBA | PubChem subset with biological activities | SMILES | 437,929 |
 MUV | PubChem subset for virtual screening validation | SMILES | 93,087 |
 HIV | Experimental results for HIV replication inhibition | SMILES | 41,127 |
 PDBind | Binding affinities for ligands in biomolecular complexes | SMILES | 11,908 |
 BACE | IC50 values against BACE-1 (human β-secretase 1) | SMILES | 1513 |
 BBBP | Ability of organic molecules to cross the blood–brain barrier | SMILES | 2039 |
 Tox21 | Toxicity measurements on 12 targets | SMILES | 7831 |
 ToxCast | Toxicity measurements on more than 600 targets | SMILES | 8575 |
 SIDER | Adverse drug reactions of a selection of marketed drugs. | SMILES | 1427 |
 ClinTox | FDA approved drugs that failed clinical trials for toxicity reasons | SMILES | 1478 |
Other data sets | |||
 PubMed central | Full-text archive of biomedical and life sciences journal literature | Text | 327,628 |
 Gutenberg | A subset of public domain Project Gutenberg eBooks. | Text | 3036 |
 NIPS | Abstracts of NIPS conference papers from 1987 to 2015 | Text | 7241 |
 RNA sequencing | A subset of the PANCAN database | Gene expression | 801 |
 ProteomeHD | Human proteome co-regulation data | Co-regulation scores | 5013 |
 Flowcytometry | Data gathered from a flow cytometry experiment | Signal intensity | 436,877 |
 MiniBooNE | Data gathered by the MiniBooNE particle physics experiment | Particle ID | 130,065 |