Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability

Table 5 The 76 datasets used for our model building experiments

Type	Dataset/group	Num	Compounds	Active	In-active	Source
Balanced	AMES	1	4337	2401	1936	[47]
Balanced	CPDBAS	5	1102.6	545.8	556.8	[48]
Balanced	NCTRER	1	217	126	91	[33]
Virtual-screening	ChEMBL	50	10,100	100	10,000	[6, 49]
Virtual-screening	DUD	3	1822.3	42	1780.3	[6, 50]
Virtual-screening	MUV	16	15,026.8	30	14,996.8	[6, 51]

Multiple occurrences of the same compound are inserted only once. E.g., some of the originally 15,000 decoys for each MUV dataset are removed. In case, multiple occurrences have differing endpoint values, the compound is omitted. Only 5 of 7 endpoints from the CPDBAS dataset could be used for this study as two endpoints (Hamster and Dog/Primates) are to small and yield less than 1024 ECFP4 fragments. A more detailed list of datasets is provided in Additional file 2

ISSN: 1758-2946