From: canSAR chemistry registration and standardization pipeline
3,157,884 unique 2D structures in canSAR | canSAR Hierarchy | Input file | Description | Steps |
---|---|---|---|---|
2,668,609 Standard Forms and Canonical Representatives | Standardized Compound (SC) | Original source DBs in SDF format | Standard form | 1. Checker and 2. Standardizer |
SDF parsing and filtering of empty molblocks | ||||
RDKit sanitization | ||||
Structure Standardization through RDKit Standardizer | ||||
Canonical Representative (CR) | SC Output | Canonical Representative | 3. Generation of Canonical Representative | |
Generation of at most 30 canonical tautomers | ||||
Prevent canonicalization in the presence of chiral center | ||||
Time-out for canonicalization set at 250 ms | ||||
2,304,805 Unsalted Canonical Representatives | Unsalted Canonical Representative (UCR) | CR Output | Free base canonical tautomer | 4. Salt strip |
Strip inorganic and organic counterions | ||||
Strip solvents and fragments | ||||
Strip shorter SMILES string | ||||
Keep first fragment with two identical SMILES strings | ||||
Neutralization | ||||
2,162,736 Abstract Representation | Abstract Representation (AR) | UCR Output | Abstract representation (Canonical compound stripped of salts and stereochemistry) | 5. Generation of abstract structure to get parent compounds |
Strip stereochemistry | ||||
Strip cis/trans and E/Z isomerism | ||||
Strip isotopes |