Skip to main content

Table 5 The 76 datasets used for our model building experiments

From: Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability

Type

Dataset/group

Num

Compounds

Active

In-active

Source

Balanced

AMES

1

4337

2401

1936

[47]

Balanced

CPDBAS

5

1102.6

545.8

556.8

[48]

Balanced

NCTRER

1

217

126

91

[33]

Virtual-screening

ChEMBL

50

10,100

100

10,000

[6, 49]

Virtual-screening

DUD

3

1822.3

42

1780.3

[6, 50]

Virtual-screening

MUV

16

15,026.8

30

14,996.8

[6, 51]

  1. Multiple occurrences of the same compound are inserted only once. E.g., some of the originally 15,000 decoys for each MUV dataset are removed. In case, multiple occurrences have differing endpoint values, the compound is omitted. Only 5 of 7 endpoints from the CPDBAS dataset could be used for this study as two endpoints (Hamster and Dog/Primates) are to small and yield less than 1024 ECFP4 fragments. A more detailed list of datasets is provided in Additional file 2