Skip to main content

Table 1 Description of canSAR hierarchy and steps to generate each output

From: canSAR chemistry registration and standardization pipeline

3,157,884 unique 2D structures in canSAR

canSAR Hierarchy

Input file

Description

Steps

2,668,609 Standard Forms and Canonical Representatives

Standardized Compound (SC)

Original source DBs in SDF format

Standard form

1. Checker and 2. Standardizer

 SDF parsing and filtering of empty molblocks

 RDKit sanitization

 Structure Standardization through RDKit Standardizer

Canonical Representative (CR)

SC Output

Canonical Representative

3. Generation of Canonical Representative

 Generation of at most 30 canonical tautomers

 Prevent canonicalization in the presence of chiral center

 Time-out for canonicalization set at 250 ms

2,304,805 Unsalted

Canonical Representatives

Unsalted Canonical Representative (UCR)

CR Output

Free base canonical tautomer

4. Salt strip

 Strip inorganic and organic counterions

 Strip solvents and fragments

 Strip shorter SMILES string

 Keep first fragment with two identical SMILES strings

 Neutralization

2,162,736 Abstract

Representation

Abstract Representation (AR)

UCR Output

Abstract representation (Canonical compound stripped of salts and stereochemistry)

5. Generation of abstract structure to get parent compounds

 Strip stereochemistry

 Strip cis/trans and E/Z isomerism

 Strip isotopes

  1. The total number of unique 2D structures in canSAR is reported for each hierarchy level together with the input file and the steps carried out to generate the corresponding output files