### Fingerprint design

The MinHash fingerprint (MHFP) described herein combines the concept of extended connectivity used for ECFP with MinHash as a hashing scheme to later enable LSH-based ANN searches. As a first step, we enumerate all circular substructures around each atom in a molecule and write these out as SMILES [6]. This operation yields \(O\left( {n\left( {r + 1} \right)} \right)\) SMILES strings for a molecule with a heavy atom count (HAC) of \(n\) and a maximum radius \(r\). As for either small radii \(r\) or macrocycles the ring information of a molecule is lost, we also extract the SMILES string for each ring of the symmetrized smallest set of smallest rings in the molecule. We then filter the SMILES strings for duplicates and combine them to a set \(S\left( A \right)\) representing the molecular shingling of the molecule \(A\).

We denote the process described above as “shingling of a molecule” and the resulting set *S(A)* as “molecular shingling”. A molecular shingling differs from the w-shingling of a document, where a w-shingling consists of n-grams with \(n = w\), in that it includes SMILES strings of different lengths, with the maximum length depending on the maximum radius \(r\) and the size of the rings in the molecule. The number of hashed unique SMILES-encoded molecular subgraphs with radius \(r\) grows according to Heaps’ law with lower \(\beta\) than ECFP hashes with radius \(r\) when processing 1.7 million compounds from ChEMBL24 (Additional file 1: Fig. S1) [40, 41]. Given the molecular shinglings \(S\left( {M_{a} } \right)\) and \(S\left( {M_{b} } \right)\) of two molecules \(M_{a}\) and \(M_{b}\), the Jaccard similarity coefficient of the molecules is calculated according to Eq. 1 (see “Methods” section).

As the MinHash scheme cannot be applied directly to strings, the SMILES in a molecular shingling are first hashed to a 32-bit unsigned integer using a function \(f:\varOmega \to \left\{ {0, \ldots ,2^{32} - 1} \right\}\). There is a trade-off when choosing this relatively small 32-bit hash, as the number of collisions (two or more different strings being hashed to the same integer value) during hashing is inversely proportional to the length of the hash. To estimate the number of collisions, molecular shingles with \(r = 2\) were extracted from 1.7 million ChEMBL24 compounds, yielding a total number of 197,604 unique SMILES. Applying Eq. 3 (see “Methods” section), the number of expected collisions yields \(c\left( {k = 197,604,N = 2^{32} - 1} \right) = 4.546.\) Increasing the maximum radius to \(r = 3\) results in an increase to 2022,448 unique SMILES and 476.098 expected collisions. The measured numbers of collisions when hashing molecular shinglings from ChEMBL24 were 3 and 481 for \(r = 2\) and \(r = 3\), respectively, proving Eq. 3 to be a good estimator for SMILES hashing collisions. Substituting the 32-bit (SHA-1) hash with a 64-bit (SHA-1) hash would lower the number of estimated collisions to 0. However, a 64-bit hash would have the numbers of most calculations during MinHash computation exceed 64 bits, potentially slowing the MinHash computation and further processing by a factor of 2 on current hardware. In addition, the space requirement of the MinHash would double as well. Thus, SMILES contained within molecular shinglings are hashed to a 32-bit (SHA-1) hash.

To transform the hashed molecular shingling into our final fingerprints, we finally apply MinHash according to Eq. 2 (see “Methods” section). In the present study we calculated MinHash fingerprints for hashed molecular shinglings with \(r \in \left\{ {2, 3, 4} \right\}\) and \(k \in \left\{ {128, 1024, 2048, 4096} \right\}\). We considered radii \(r = 2\) (MHFP4), \(r = 3\) (MHFP6), and \(r = 4\) (MHFP8), resulting in 12 fingerprints with different level of structural encoding and compression (according to common notation, the numbers in the fingerprint names represent the maximum diameter rather than the maximum radius).

### Benchmarking study

To validate the SMILES-strings based approach as well as the chosen hash function, we used a platform to benchmark fingerprints for ligand-based virtual screening with Jaccard similarity as a metric [7]. The benchmark performs statistically valid comparisons of fingerprints using structural and activity data drawn from DUD, MUV, and ChEMBL [40, 42, 43]. The benchmark evaluates 7 metrics: The area under the receiver operating characteristic (ROC) curve (AUC), the enrichment factor (EF) for \(\chi = 0.01\) and \(\chi = 0.05\), the Boltzmann-enhanced discrimination of ROC (BEDROC) for \(\alpha = 20\) and \(\alpha = 100\), and the robust initial enhancement (RIE) for \(\alpha = 20\) and \(\alpha = 100\).

First, we compared the hashed molecular shinglings to ECFP hashes before folding, as well as to ECFP*, a variant of ECFP considering only atomic numbers as invariants, all with \(r = 2\) and \(r = 3\). This comparison showed that the hashed molecular shingling method with a radius of \(r = 3\) is superior to ECFP hashing, as it beats unfolded ECFP (with either radius \(r = 2\) or \(r = 3\)) significantly in 2 out of 7 values (AUC, EF 5%) and with a *p* value above 0.05 in 5 out of 7 (EF 1%, BEDROC20, BEDROC100, RIE20 and RIE100) metrics (Fig. 2, Additional file 1: Fig. S5). ECFP* performed significantly worse with both \(r = 2\) and \(r = 3\) in all metrics compared to molecular shingling with \(r = 3\).

To establish whether results based on evaluating hashed molecular shinglings carry over to minhashed molecular shinglings, we then compared our 12 different MHFPs variants with each other. Comparing these different fingerprints in the benchmark confirmed that MHFP6 (MinHash applied to hashed molecular shinglings with \(r = 3\)) performed better than both MHFP4 (\(r = 2\)) and MHFP8 (\(r = 4\)) for medium (1024-D, 2048-D) to high dimensional (4096-D) variants (Fig. 3). The data further suggested that low dimensional variants such as 128-D perform better with \(r = 2\). As MHFP8 failed to perform better than MHFP6, it was discounted from further experiments. MHFP4, while also performing worse than MHFP6, was kept for further experiments as a comparison to ECFP variants with \(r = 2\).

Given the results of benchmarking unfolded ECFP hashes and hashed molecular shinglings (Fig. 2), as well as the results of benchmarking different MHFP radii (Fig. 3), we finally selected the following fingerprints for a detailed comparison aimed at identifying the best fingerprint: (1) Folded ECFP4 and ECFP6; (2) MinHash molecular shinglings with radii 2 and 3, henceforth denoted MHFP4 and MHFP6; (3) MinHash ECFP4 and ECFP6, henceforth denoted MHECFP4 and MHECFP6, respectively, used here to control for the performance of encoding SMILES (MHFP) as opposed to hashes of invariants (ECFP) by applying the minhashing scheme to unfolded ECFP values (Fig. 1). For each fingerprint four different dimensionalities were evaluated.

An average rank comparison according to the benchmark is shown in Fig. 4. Comparing the average ranking of the fingerprints as a function of the chosen radius, both ECFP4 and MHECFP4 perform marginally better than their respective counterparts, ECFP6 and MHECFP6, in the vast majority of cases. In contrast, MHFP6 generally performs better than MHFP4. This result confirms the observations from Fig. 2 where hashed molecular shinglings performed better with \(r = 3\) than with \(r = 2\), while the ECFP4 hashes outperformed ECFP6 hashes. With the exception of the 128-D variant, MHFP4/6 exhibit strictly better performance in AUC compared to both MHECFP4/6 and ECFP4/6, while both MHFP4/6 and MHECFP4/6 perform better than ECFP4/6 in early recognition metrics EF1 and EF5, suggesting that the AUC performance gains are a result of the molecular shingling approach, while the gains in early recognition can be attributed to minhashing. Note that MHFP6 (both 2048-D and 4096-D) did not perform significantly worse than path-based methods (TT and AP, [7]) in AUC, while performing generally significantly better in other metrics, which is in contrast to ECFP fingerprints, which perform worse in AUC benchmarks than path based fingerprints (Additional file 1: Fig. S4).

The above comparisons established that MHFP6 provided the best overall performance across all fingerprints considered, with the 2048-D offering a good compromise between performance and size. In detail, 2048-D MHFP6 significantly outperformed 2048-D ECFP4 in AUC, EF1 and EF5, while performing non-significantly better in BEDROC20, BEDROC100, RIE20 and RIE100. In fact, 2048-D MHFP6 was comparable to 16,384-D ECFP4, although it still performed better in terms of BEDROC20 and RIE20. 2048-D MHFP6 also performed significantly better in AUC than 2048-D MHECFP4 while non-significantly better in EF1, EF5, BEDROC100 and RIE100 and worse in BEDROC20 and BEDROC100. While 2048-D MHFP6 ranked significantly worse than 4096-D MHECFP4 in AUC, 4096-D MHFP6 significantly outranked 4096-D MHECFP4 in AUC (Additional file 1: Fig. S6). Further analysis of the data suggested that gains by MHFP6 over ECFP4 was largely due to better performance on benchmark targets selected from ChEMBL24, while performing approximately equal on DUD and MUV data (Fig. 5, see full target-level performance comparisons between 2048-D MHFP6 and 2048-D ECFP4 and MHECFP4 in Additional file 1: Figs. S2 and S3, respectively).

To further compare MHFP6 and ECFP4, we explored the respective Jaccard distance measurements between molecules within three sets: (1) A subset of hydrocarbons extracted from GDB-13 (\(n = 3,824\)), (2) Drugbank (\(n = 9,300\)), and (3) a matched molecular pairs (MMP) set (\(n = 240,322\)) [44,45,46]. For the hydrocarbon and Drugbank sets, 50 compounds were randomly selected from each, and their Jaccard distance to all the compounds in their respective set was computed. In the MMP set, the Jaccard distance between each pair was computed. While the distances in all data sets show moderate to strong linear correlation (\(r = 0.659\), \(r = 0.792\), and \(r = 0.829\) for GDB-13 hydrocarbons, Drugbank, and MMP respectively), we observed interesting differences. While the distribution of measured distances is similar for MHFP6 and ECFP4 for the GDB-13 subset, ECFP4 seems to measure a distance of 0.0 between clearly different molecules (Fig. 6a, d). In addition, gaps appear in measured ECFP4 distances, resulting in a multimodal distribution—an effect that cannot be fully attributed to the folding operation of ECFP4, as MHECFP4 shows a similar pattern (Additional file 1: Fig. S8a, d). The distances measured in Drugbank show a strong correlation, however, both fingerprints seem to measure a distance of 1.0 in molecules where a finer grained distance measure could prove beneficial (Fig. 6b, e). The MMP data set exposes the inability of ECFP4 to distinguish between highly similar molecules that differ only in the size of one ring compared to MHFP6, which seems to express higher resolution for distance measurements between very similar compounds.

As MHFP6 significantly outperformed both MHECFP4 and ECFP4 (Figs. 4, 5, and Additional file 1: S6), a fingerprint variant on MHFP’s SMILES-based circular substructure hashing scheme, folded by the same modulo \(n\) operation that is used by ECFP, was compared to both minhashed MHFP and folded ECFP with \(r \in \left\{ {2, 3} \right\}\) and \(D = 2048\) (Fig. 7). We denoted this variant SECFP (SMILES extended connectivity fingerprint). While SECFP4/6 were outperformed by MHFP4/6 respectively, SECFP6 performed significantly better than both ECFP4/6 (Additional file 1: Fig. S9). These results suggest that SECFP6 can be readily used as a drop-in replacement for ECFP4 with beneficial results. By performing significantly worse compared to MHFP6, acting as a control, SECFP6 further validates the minhashing approach as compared to folding (Additional file 1: Fig. S10). However, as the minhashed MHFP is based on a sparse representation of the \(2^{32}\)-dimensional binary hash space with a fixed number (\(D\)) of set bits, search optimization algorithms assuming \(D\)-dimensional binary vectors such as BitBound cannot be applied to it.

### Approximate k-nearest neighbor (ANN) searches

In the context of big data, the key advantage of our MHFP over ECFP consists in the implementation of MinHash (Fig. 7), which enables the use of the LSH Forest algorithm to perform ANN searching in the sparse, \(2^{32}\)-dimensional hash space. As a comparison, ECFP hashes are folded into binary arrays, indexed and searched using the ANN algorithm Annoy [33, 36]. Annoy is used by the R package eiR for accelerated structure similarity searching of very large small molecule data sets [26]. To establish whether the performance of LSH Forest can be compared to that of state-of-the-art ANN algorithms when indexing chemical fingerprints, we compared 2048-D MHFP6 fingerprints and 2048-D ECFP4 fingerprints indexed by LSH Forest and Annoy, respectively. A benchmark based on all compounds found in ChEMBL24 (\(n = 1,712,978\)) was set up. From ChEMBL24, 20 compounds were randomly selected as query compounds. Next, for each of the 20 query compounds, the Jaccard distances to all compounds from ChEMBL24 were calculated using brute-force linear scan, resulting in 20 sorted lists. These steps were performed for MHFP6, and ECFP4 based Jaccard distances. Finally, the recovery rates of \(k\)-nearest neighbors for \(k \in \left\{ {5, 10, 50, 100} \right\}\) of approximate \(k\)-nearest neighbor algorithms (LSH Forest for MHFP6 and Annoy for ECFP4) were calculated and the respective query times measured. For each value of \(k\), the benchmark was repeated over parameter \(k_{c} \in \left\{ {1, 10, 20, \ldots , 90, 100} \right\}\).

LSH Forest and Annoy were each benchmarked with \(l \in \left\{ {8, 16, 32, 64, 128, 256} \right\}\) prefix and Annoy trees, respectively. While LSH Forest performs better for \(k = 5\) and \(k = 10\) nearest neighbors, Annoy surpasses LSH Forest for \(k = 50\) and \(k = 100\) (Fig. 8a). By increasing the number of nearest neighbors by a factor of \(k_{c}\), the performance of both ANN neighbor methods can be greatly improved. LSH Forest (orange) shows worse performance compared to Annoy (green) for \(k_{c} < 20\), however, for \(k_{c} \ge 20\) it surpasses Annoy (Fig. 8b). As LSH Forest and Annoy both construct multiple trees (prefix and binary trees respectively) in order to approximate optimal nearest neighbor search, increasing the number of trees \(l\) increases the recovery rate for both methods at the expense of main memory. Annoy performs slightly better for \(l = \left\{ {8, \ldots ,128} \right\}\), however, performance of LSH Forest increases at a greater rate, overtaking Annoy at the final value of \(l = 256\) (Fig. 8c).

Increasing values of parameters \(k_{c}\) and \(k\) affects query times of Annoy negatively, while the average query time for LSH Forest only shows a small increase and remains below 100 ms for \(k = 50\) and \(k = 100\), Annoys average query time increases to above 100 and 200 ms respectively (Fig. 8d, e). The comparatively steep increase in query time for Annoy with \(k_{c} > 1\) is caused by cosine similarity computations, which are more resource demanding than Jaccard distance computations. A major difference between the two methods is the effect of parameter *l* on query time. As the number of prefix trees *l*, and thus the recovery rate, in LSH Forest increases, the query time decreases. On the other hand, an increase in Annoy trees, while having a beneficial effect on recovery rate, has a negative effect on query time (Fig. 8f).

The combination of MHFP and LSH Forest allows for fast and accurate searching in sparse, high-dimensional binary chemical spaces. Its performance is comparable to methods such as Annoy which rely on the folding of fingerprint vectors, although the presented implementation is limited in terms of speed and scope of data set size due to in-memory processing and Python.