Skip to main content

Table 1 Translation-related statistics regarding the domain-specific datasets generated by the structural fingerprints used for the performance analysis, together with the targeted molecular representations, SMILES, and SELFIES

From: Reconstruction of lossless molecular representations from fingerprints

Abbreviations

Description

Dim

Sequence length

Token size

Ave.

Max

Predefined substructures

    

 MACCS

 

166

50

107

160

Paths and feature classes

    

 Avalon

Hashed

512

182

470

516

Path-based

    

 HashAP

Atom pair - hashed

2048

92

273

1998

 RDK4

RDkit fingerprint - hashed

2048

83

288

2052

 RDK4-L

RDK4 - with no branch

2048

58

209

2052

4-atom-paths

    

 TT

Topological torsion

sparse

32

124

54973

 HashTT

TT - hashed

2048

31

118

2052

Circular

    

 AEs

Morgan radius 1

sparse

29

65

54076

 ECFP0

Morgan radius 0 - hashed

2048

10

25

100

 ECFP2

Morgan radius 1 - hashed

2048

28

64

2052

 ECFP4

Morgan radius 2 - hashed

2048

47

103

2052

 FCFP2

Feature-class of ECFP2

2048

20

51

1576

 FCFP4

Feature-class of ECFP4

2048

36

86

2052

Unique Representation

    

 SMILES

Tokenized atom-wise

 

51

125

109

 SELFIES

Generic tokenization

 

44

127

205