Skip to main content

Table 2 Shingles statistics for the datasets at different stages of the workflow

From: Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization

Dataset

Size

Distinct shingles r1

Distinct shingles r2

Unique shingles r2

Distinct shingles r3

Unique shingles r3

QM9 step 3

122,227

229

28,053

7162

376,852

273,423

PC9 step 3

77,790

1295

39,725

18,718

223,127

158,226

OD9_0 step 3

184,158

1297

57,741

22,130

544,460

392,637

OD9_1 step 1

1,023,624

1007

642,265

282,311

4,568,964

3,675,203

OD9_1 step 2

854,059

3585

979,596

548,870

4,255,262

3,513,467

OD9_1 step 3

250,874

762

213,034

103,858

1,156,813

929,228

OD9 step 1

1,276,171

2447

691,715

301,669

5,156,545

4,064,788

OD9 step 2

1,088,773

3714

1,013,639

557,832

4,798,140

3,870,539

OD9 step 3

435,032

1563

250,163

116,483

1,665,725

1,293,995

  1. A unique count represents a shingle that appears only once