Skip to main content

Table 2 The detailed information of the datasets used in this study

From: Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models

Datasets

Task Type

Compounds

Tasks

Metric

Descriptions

ESOL

Regression

1127

1

RMSE

Water solubility for organic small molecules

FreeSolv

Regression

639

1

RMSE

Hydration free energy of small molecules in water

Lipop

Regression

4200

1

RMSE

Octanol/water distribution coefficient (logD at pH = 7.4)

HIV

Classification

40748

1

AUC-ROC

Inhibition to HIV replication

BACE

Classification

1513

1

AUC-ROC

Inhibition to human β-secretase 1 (BACE-1)

BBBP

Classification

2035

1

AUC-ROC

Binary labels of blood–brain barrier penetration

ClinTox

Classification

1475

2

AUC-ROC

Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons

SIDER

Classification

1366

27

AUC-ROC

Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes

Tox21

Classification

7811

12

AUC-ROC

Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways

ToxCast

Classification

8539

182

AUC-ROC

Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks

MUV

Classification

93087

17

AUC-PRC

Subset of PubChem BioAssay by applying a refined nearest neighbor analysis, designed for the validation of virtual screening techniques