Skip to main content

Advertisement

You are viewing the new BMC article page. Let us know what you think. Return to old version

Poster presentation | Open | Published:

The FPS fingerprint format and chemfp toolkit

During GCC 2010 poster session I presented a draft version of the FPS format for storing dense binary fingerprints. That format is now stable, and supported by RDKit [1], CACTVS [2], and other software. The chemfp package is a set of command-line tools and a Python library for fingerprint generation and high-speed Tanimoto search. It can extract pre-computed fingerprints from an SD tag or use OpenEye's OEChem [3], Open Babel [4], or RDKit to generate fingerprints. Search uses a combination of careful indexing [5], CPU-specific instructions (if available), and OpenMP. Nearest-100 similarity searches of PubChem-sized take less than a second on a laptop, and Butina clustering [6] of 2 million compounds takes about 6 hours on a 15 CPU node. In my poster I present the FPS format and chemfp package, and describe how the memory and performance requirements lead to the internal search architecture.

References

  1. 1.

    [http://rdkit.org]

  2. 2.

    [http://xemistry.org/]

  3. 3.

    [http://www.eyesopen.com/oechem-tk]

  4. 4.

    [http://openbabel.org]

  5. 5.

    Swamidass SJ, Baldi P: Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time. J Chem Inf Model. 2007, 47: 302-317. 10.1021/ci600358f.

  6. 6.

    Butina D: Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. J Chem Inf Model. 1999, 39: 747-750. 10.1021/ci9803381.

Download references

Author information

Correspondence to Andrew Dalke.

Rights and permissions

Reprints and Permissions

About this article

Keywords

  • Similarity Search
  • Performance Requirement
  • Poster Session
  • Draft Version
  • Careful Indexing