Skip to main content

Table 2 Data cleaning results on the dataset from Obach et al. [17], the SIN list from ChemSec [18] and the EPISuite™ solubility dataset

From: A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications

Category

Obach et al. [17]

SIN list [18]

EPISuiteâ„¢

Total

668

913

5761

Maintain (w/duplicates)

607

256

4635

Maintain (w/o duplicates) (H reliability)

514

163

3536

Maintain (w/o duplicates) (M reliability)

85

68

850

Manual check

47

115

639

Rejected (mixtures)

0

23

9

Rejected (inorganic or unusual elements)

2

194

96

Rejected (missing/ambiguous)

12

394

395

Maintain (manual check) (w/duplicates)

652

335

5014

Maintain (manual check) (w/o duplicates) (H reliability)

515

163

3554

Maintain (manual check) (w/o duplicates) (M reliability)

128

127

1171

Rejected (manual check failed)

2

36

260

  1. Numbers refer to results before and after the manual check procedure