Skip to main content

Table 2 Shingles statistics for the datasets at different stages of the workflow

From: Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization

Dataset Size Distinct shingles r1 Distinct shingles r2 Unique shingles r2 Distinct shingles r3 Unique shingles r3
QM9 step 3 122,227 229 28,053 7162 376,852 273,423
PC9 step 3 77,790 1295 39,725 18,718 223,127 158,226
OD9_0 step 3 184,158 1297 57,741 22,130 544,460 392,637
OD9_1 step 1 1,023,624 1007 642,265 282,311 4,568,964 3,675,203
OD9_1 step 2 854,059 3585 979,596 548,870 4,255,262 3,513,467
OD9_1 step 3 250,874 762 213,034 103,858 1,156,813 929,228
OD9 step 1 1,276,171 2447 691,715 301,669 5,156,545 4,064,788
OD9 step 2 1,088,773 3714 1,013,639 557,832 4,798,140 3,870,539
OD9 step 3 435,032 1563 250,163 116,483 1,665,725 1,293,995
  1. A unique count represents a shingle that appears only once