Skip to main content

chemfp - fast and portable fingerprint formats and tools

Fingerprints are conceptually simple but the abstract sequence of 0 and 1 bits are represented in an astonishing variety of forms. The diversity exists for a very practical sense: it's easier for most researchers to create a simple format than it is to search for or advocate a common standard. Incompatible formats often have no immediate or large negative consequence. The problems are more subtle. Ad hoc formats cannot easily be exchanged with other groups. They lack metadata to help track the provenance of a data set. They do not have existing tools for creating and manipulating records, and the tools which are written are often an order of magnitude slower than what an optimized program can achive.

I have developed two file portable file formats for storing the short and dense fingerprints (order 16 K bits or less, with density > 1%) often seen in cheminformatics. The FPS format is a line-based text format using hex fingerprint encoding. It is designed to be readable and easy to generate and parse. The FPB format is a block-based binary format designed for high-performance operations, including optimized ordering for sublinear Tanimoto searches [1]. The format descriptions are freely available at [2] along with the chemfp Python package to generate, convert, and work with the formats. It includes a C library and extension for fast parsing and fingerprint operations.

References

  1. 1.

    Swamidass S, Baldi P: Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time. J Chem Inf Model. 2007, 47: 302-317. 10.1021/ci600358f.

    CAS  Article  Google Scholar 

  2. 2.

    chem-fingerprints project at Google code. http://code.google.com/p/chem-fingerprints/,

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to AP Dalke.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Dalke, A. chemfp - fast and portable fingerprint formats and tools. J Cheminform 3, P12 (2011). https://doi.org/10.1186/1758-2946-3-S1-P12

Download citation

Keywords

  • Simple Format
  • File Format
  • Format Description
  • Text Format
  • Binary Format