Bloom filters for molecules
Journal of Cheminformatics volume 15, Article number: 95 (2023)
Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint representations. Bloom filters are small enough to hold billions of molecules in just a few GB of memory and check membership in sub milliseconds. We found string representations can have a false positive rate below 1% and require significantly less storage than using fingerprints. Canonical SMILES with Bloom filters with the simple FNV (Fowler-Noll-Voll) hashing function provide fast and accurate membership tests with small memory requirements. We provide a general implementation and specific filters for detecting if a molecule is purchasable, patented, or a natural product according to existing databases at https://github.com/whitead/molbloom.
With the growing scale of molecular screening, which now involves searching through billions of chemical structures, the processing times for querying extensive compound datasets have significantly increased [1, 2]. To address this, Bloom filters can compact any database just for membership verification.
The Bloom filter, a space-efficient and probabilistic data structure, was designed to ascertain whether an element belongs to a specific set. First proposed by Burton H. Bloom , this data structure has demonstrated exceptional value for large datasets, where traditional set membership testing methods would be excessively time-consuming. At its core, the Bloom filter utilizes a fixed-size (m) bit array to represent n elements, employing k hash functions to map each element to k positions within the array [3,4,5]. This allows Bloom filters to conduct set membership tests with low false positive rates while utilizing less time and space compared to traditional data retrieval techniques.
Originally applied in dictionaries and spell checkers [3, 6], Bloom filters allowed for the quick identification of words within a given vocabulary, where the only significant drawback was with fake positives when misspelled words were labeled as being correct. Over time, the scope of their applications broadened to encompass web searches such as Google Chrome’s former implementation of a Bloom filter to detect malicious URLs , among other use cases [8,9,10]. Concrete examples of the usage of bloom filters in chemistry workflows include exploration of the chemical space while asserting either commercial availability or neglecting patented chemicals, without needing memory intensive databases or external server dependencies. A real-life example can be found in ChemCrow, where bloom filters are used in the molecule recommendation setting, making sure recommended chemicals are purchasable without intensive memory requirement. As underscored by the Bloom Filter principle , ”Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.”
Traditionally, molecules have been represented using structure-based fingerprints . In this study, we built different bloom filters using the Coconut database  to compare the effectiveness of structure-based hashing with string hashes in the Bloom filter; we demonstrate that string hashing consistently outperforms its counterpart. To provide further context, Table 1 presents well-known chemistry databases, their approximate number of compounds, storage size required for text (SMILES) representation, and a comparison with a Bloom filter designed to store an equivalent number of molecules.
This study explores the use of Bloom filters in molecular databases. Although, we refer alternative data structures that offer functionalities that can either mitigate some limitations of Bloom filters or serve entirely different objectives. For example, Cuckoo filters  provide the capability for dynamic item insertion and deletion, a feature absent in conventional Bloom filters. Other alternatives, such as Quotient filters  and Count-Min sketches , also offer unique advantages and can be found in the literature. On a different note, Locality-Sensitive Hashing (LSH)  serves the specialized purpose of maximizing hash collisions to facilitate similarity searches. However, LSH techniques often grapple with computational challenges as data scales, leading to memory requirements that can quickly exceed available main memory. In contrast, Bloom filter indices, even for extensive databases like ZINC, can comfortably reside in the main memory of everyday household devices, such as smartwatches or cellphones.
A Bloom filter is initialized with an m-length bit vector, with all positions set to zero, and employs k independent hashing functions. These hashing functions generate k values ranging from 0 to m−1, which correspond to the positions in the bit vector where a ”1” will be assigned. The hashing functions must exhibit the following characteristics : (1) Quick computation; (2) An avalanche effect, where minor input changes result in substantial and unpredictable output alterations, and (3) The generation of integers between 0 and m−1.
Bloom filters enable the addition of new members but do not support individual removals. The filter can be queried to determine if a particular element has been added previously. However, this simplicity comes with certain drawbacks, such as the potential for two or more elements to be hashed to the same position in the Bloom filter (i.e., collisions). As a result, removing an element (by changing its positions from one to zero) could inadvertently affect other members with overlapping positions. This issue underscores the importance of randomness in hashing functions, often referred to as the avalanche effect. Figures 1 and 2 illustrate the workings of a Bloom filter and the storage of molecules within such filters.
Double hashing is employed to minimize the probability of collisions in the indexing of new members. Two distinct ”universal hashes,” \(h_\alpha\) and \(h_\beta\), are utilized to obtain k individual indices :
Here, ‘A’ represents an element being hashed, \(h_i\) refers to one of the k hash functions generated per element (as illustrated in Figs. 1 and 2), |m| denotes the fixed size of the filter, and ”mod” signifies the remainder of the division. Restrictions that reduce collision probability are :
\(h_\beta \ne 0\),
\(h_\beta (A)\) should not be divisible by the size of the filter.
Using the described method, the number of generated hashing functions can be selected depending on the number of elements to add (n), the bloom filter size in bits (M), and the pre-stablished false positive rate (\(\epsilon\)).
If the false positive rate is specified, M can be calculated as follows.
Conversely, altering M impacts the final false positive rate. Given M and the number of elements to be added, the number of hash functions k is calculated as:
This will yield a range from 8 to 64 hashing functions.
The Python package MolBloom developed for this work  is an open-source package designed for molecules, featuring a built-in filter with ZINC-in-stock molecules. The package permits the creation of custom filters of varying sizes, which were adjusted in increments of one order of magnitude. Tests were conducted using the Coconut dataset  (approximately 400,000 molecules).
For comparative purposes, molecular fingerprints were employed to populate a Bloom filter and measure the false positive rate for increasing bit-array sizes. The hashing functions used in this study include Fowler-Noll-Voll (FNV) , as well as message digest 4 and 5 algorithms (MD4 and MD5) [28, 29] for string hashing. For chemical structure fingerprints, six combination between MACCS , Morgan , Atom-pair , and RDKit Fingerprints were utilized. This was done to investigate how traditional ways to hash molecules would act in this setting. FNV is a hash function designed for rapid, non-cryptographic hashing of data, leveraging prime numbers and bitwise operations to generate hash values that identify unique data elements. The FNV algorithm offers variants of different bit sizes and prime numbers, such as FNV-1 and FNV-1a. MD4 and MD5 are well-established hashing functions within the computer science community .
To assess false positive rates in each filter with different sizes, a fifty-fifty split was performed. The first half was added to empty filters, followed by membership testing in the second half. Any molecules from the second half classified as part of the set were counted as false positives.
An evaluation was conducted to compare the speed of Bloom filters and traditional methods in searching for elements within a dataset (using the dataset’s native API).
Results and discussion
All six possible fingerprint combinations across eight distinct orders of magnitude for the Bloom filter and string hash implementations were examined. Figure 3 provide a comprehensive summary of the results.
As illustrated in Fig. 3, two key observations can be made. First, as anticipated, the false positive rate of Bloom filters approaches zero as the ratio between the filter size and dataset size increases. Second, the hashing of string SMILES representation outperforms most chemical structure fingerprints by over an order of magnitude in terms of false positive rate (combinations 7 & 8). Only the Morgan-MACCS and Atompair-MACCS fingerprint (combinations 3 & 5) hashing achieve false positive rates comparable to strings while requiring half an order of magnitude more bits of space.
Message Digest and FNV hashing (7 & 8) of strings yielded nearly identical and seemingly smooth curves, suggesting a well-randomized hashing of the elements. In contrast, other methods exhibit a “noisy” pattern, which serves as evidence of inadequate randomization. By design, these alternative methods are not highly randomized, as similar molecules tend to have comparable chemical fingerprints. This characteristic is the basis for their use in numerous optimization methods, as they can measure the distance between molecules. Consequently, their performance is suboptimal, as similar molecules have a higher likelihood of collisions within the Bloom filter.
In terms of the time required for Bloom filters to verify whether a molecule is part of a set or not, Fig. 4 clearly illustrates that Bloom filters demand up to three orders of magnitude less time compared to the native API, and one order of magnitude less than B-Tree indexing search. Even the “slower” Python implementation using RDKit for fingerprints necessitates two orders of magnitude less time for membership checks with an online server. To showcase the effect of latency in this test, a locally installed PostgreSQL database with a B-Tree index with 400,000 members was used. Assuming the Internet Search uses another efficient search method, the difference can be consider as latency.
We demonstrate that string hashing (FNV and MD4–5) for Bloom filters outperform and approximate the theoretical limit of these structures, confirming that strings are sufficient for molecule storage. Even taking into account the time spent on canonicalizing SMILES, Bloom filter retrieval is still more than two orders of magnitudes faster than using an internet search. We also show that FNV, despite its simplicity and speed, is as effective as MD5. Employing other string representations, such as InChI and SELFIES, is expected to yield similar results. Potential applications for the Bloom filter are to quickly determine if a molecule is purchasable in ZINC, patented according to SureChembl , or a natural product .
Rester Ulrich (2008) From virtuality to reality - virtual screening in lead discovery and lead optimization: a medicinal chemistry perspective. Curr Opinion Drug Disc Devel 11(4):559–568
Irwin John J, Tang Khanh G, Jennifer Young, Chinzorig Dandarchuluun, Wong Benjamin R, Munkhzul Khurelbaatar, Moroz Yurii S, John Mayfield, Sayle RA (2020) Zinc20-a free ultralarge-scale chemical database for ligand discovery. J Chem Inform Model 60(12):6065–6073
Bloom Burton H (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Tarkoma Sasu, Rothenberg Christian Esteve, Lagerspetz Eemil (2012) Theory and practice of bloom filters for distributed systems. IEEE Commun Surv Tutor 14(1):131–155
Broder Andrei, Mitzenmacher Michael (2004) Network applications of bloom filters: a survey. Internet Mathemat 1(4):485–509
McIlroy M (1982) Development of a spelling list. IEEE Trans Commun 30(1):91–99
Yakunin Alex (2010) Nice bloom filter application
Dasgupta Sanjoy, Sheehan Timothy C, Stevens Charles F, Navlakha Saket (2018) A neural data structure for novelty detection. Proc Natl Acad Sci 115(51):13093–13098
Talbot Jamie (July 2015) What are Bloom filters?
Goodwin Bob, Hopcroft Michael, Luu Dan, Clemmer Alex, Curmei Mihaela, Elnikety Sameh, He Yuxiong (August 2017) BitFunnel: Revisiting Signatures for Search. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 605–614, Shinjuku Tokyo Japan, ACM
Bran Andres M, Cox Sam, White Andrew D (2023) and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools
Muegge Ingo, Mukherjee Prasenjit (2016) An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Disc 11(2):137–148
Sorokina Maria, Merseburger Peter, Rajan Kohulan, Yirik MehmetAziz, Steinbeck Christoph (2021) COCONUT online: collection of open natural products database. J Cheminform 13(1):2
Fan Bin, Andersen Dave G., Kaminsky Michael, Mitzenmacher Michael D. (2014) Cuckoo filter: Practically better than bloom. In: Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, CoNEXT ’14, page 75-88, New York, NY, USA. Association for Computing Machinery
Bender Michael A, Farach-Colton Martin, Johnson Rob, Kuszmaul Bradley C, Medjedovic Dzejla, Montes Pablo, Shetty Pradeep, Spillane Richard P, Zadok Erez (2011) Don’t thrash: how to cache your hash on flash. In: 3rd Workshop on Hot Topics in Storage and File Systems (HotStorage 11)
Cormode Graham (2009) Count-min sketch
Rajaraman Anand, Ullman Jeffrey David (2011) Mining of massive datasets. Cambridge University Press; Cambridge
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucl Acids Res 40(D1):D1100–D1107
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK (2007) BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucl Acids Res 35(Database):D198–D201
Kim Sunghwan, Chen Jie, Cheng Tiejun, Gindulyte Asta, He Jia, He Siqian, Li Qingliang, Shoemaker Benjamin A, Thiessen Paul A, Bo Yu, Zaslavsky Leonid, Zhang Jian, Bolton Evan E (2023) PubChem 2023 update. Nucl Acids Res 51(D1):D1373–D1380
Papadatos George, Davies Mark, Dedman Nathan, Chambers Jon, Gaulton Anna, Siddle James, Koks Richard, Irvine Sean A, Pettersson Joe, Goncharoff Nicko, Hersey Anne, Overington John P (2016) SureChEMBL: a large-scale, chemically annotated patent document database. Nucl Acids Res 44(D1):D1220–D1228
Pence Harry E, Williams Antony (2010) ChemSpider: an online chemical information resource. J Chem Educ 87(11):1123–1124
St Denis Tom, Johnson Simon (2007) Chapter 5 - hash functions. In: St Denis Tom, Johnson Simon (eds) Cryptography for Developers, pages 203–250. Syngress, Burlington
Wikipedia contributors (2023) Bloom filter, 2
Dillinger Peter C, \(<\)email@example.com\(>\) Manolios Panagiotis \(<\)firstname.lastname@example.org\(>\) (2004) Bloom filters in probabilistic verification. International Conference on Formal Methods in Computer-Aided Design
White Andrew D (2022) molbloom: quick assessment of compound purchasability with bloom filters url = https://github.com/whitead/molbloom, Dic 2022
Fowler Glenn, Noll Landon Curt, Vo Kiem-Phong, Eastlake Donald E 3rd, Hansen Tony (2023) The FNV Non-Cryptographic Hash Algorithm. Internet-Draft draft-eastlake-fnv-19, Internet Engineering Task Force, January 2023. Work in Progress
Rivest Ronald L (April 1992) The MD4 Message-Digest Algorithm. RFC 1320
Rivest Ronald L (April 1992) The MD5 Message-Digest Algorithm. RFC 1321
Durant Joseph L, Leland Burton A, Henry Douglas R, Nourse James G (2002) Reoptimization of mdl keys for use in drug discovery. J Chem Inform Comp Sci 42(6):1273–1280 (PMID: 12444722)
Morgan HL (1965) The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J Chem Document 5(2):107–113
Capecchi Alice, Probst Daniel, Reymond Jean-Louis (2020) One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminform 12(1):43
Bosselaers Antoon (2005) Md4-Md5, pages 378–379. Springer US, Boston, MA
Papadatos George, Davies Mark, Dedman Nathan, Chambers Jon, Gaulton Anna, Siddle James, Koks Richard, Irvine Sean A, Pettersson Joe, Goncharoff Nicko et al (2016) Surechembl: a large-scale, chemically annotated patent document database. Nucl acids Res 44(D1):D1220–D1228
Medina Jorge (March 2023) molbloom: quick assessment of compound purchasability with bloom filters url = https://github.com/Jgmedina95/molbloom-paper
We thank the Center for Integrated Research Computing (CIRC) at the University of Rochester for providing computational resources and technical support.
This work has been supported by funds from the Robert L. and Mary L. Sproull Fellowship gift and U.S. Department of Energy, Grant No. DE-SC0023354.
Ethics approval and consent to participate
Consent for publication
The authors have no competing interests to declare.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.