From: A probabilistic molecular fingerprint for big data settings

MHFP, ECFP workflow comparison. a Comparison of hashing and approximate nearest neighbor search indexing of ECFP with Annoy (gray) and MHFP via molecular shingling and MinHash with LSH Forest (orange). In addition, MinHash is applied to unfolded ECFP hashes and indexed using LSH Forest as well (green), resulting in the hybrid fingerprint MHECFP. The latter was used as a control to separate the influences of molecular shingling and applying MinHash on the measured performance. b Circular substructure SMILES of an input molecule are computed with each heavy atom as the center (examples for MHFP4 shown in red and blue). In addition, SMILES for each ring are extracted (examples shown in black). Circular substructure SMILES are rooted at the central atom. All substructure SMILES are canonicalized and kekulized

