Skip to main content

Table 2 Data cleaning results on the dataset from Obach et al. [17], the SIN list from ChemSec [18] and the EPISuite™ solubility dataset

From: A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications

Category Obach et al. [17] SIN list [18] EPISuite™
Total 668 913 5761
Maintain (w/duplicates) 607 256 4635
Maintain (w/o duplicates) (H reliability) 514 163 3536
Maintain (w/o duplicates) (M reliability) 85 68 850
Manual check 47 115 639
Rejected (mixtures) 0 23 9
Rejected (inorganic or unusual elements) 2 194 96
Rejected (missing/ambiguous) 12 394 395
Maintain (manual check) (w/duplicates) 652 335 5014
Maintain (manual check) (w/o duplicates) (H reliability) 515 163 3554
Maintain (manual check) (w/o duplicates) (M reliability) 128 127 1171
Rejected (manual check failed) 2 36 260
  1. Numbers refer to results before and after the manual check procedure