Skip to main content


Fig. 8 | Journal of Cheminformatics

Fig. 8

From: A probabilistic molecular fingerprint for big data settings

Fig. 8

ChEMBL (\(n = 1.7M\)) k-nearest neighbor searches performance of 2048-D MHFP6 indexed using LSH Forest and 2048-D ECFP4 indexed using Annoy Recovery rates for both implementations depend on parameters \(k\), \(k_{c}\), and \(l\). a While LSH Forest performs better for \(k = 5\) and \(k = 10\) nearest neighbors, Annoy surpasses LSH Forest for \(k = 50\) and \(k = 100\). b By increasing the number of nearest neighbors by a factor of \(k_{c}\), the performance of both ANN neighbor methods can be greatly improved. While LSH Forest (orange) shows worse performance compared to Annoy (green) for \(k_{c} < 20\), it surpasses Annoy for higher values. c Increasing the number of trees \(l\) increases the recovery rate for both methods at the expense of main memory. Annoy performs slightly better for \(l = 8, \ldots ,128\), performance of LSH Forest increases at a greater rate, overtaking Annoy at \(l = 256\). d, e Increasing values of parameters \(k_{c}\) and \(k\) affects query times of Annoy negatively. While the average query time for LSH Forest remains below 100 ms for \(k = 50\) and \(k = 100\), Annoys average query time increases to above 100 and 200 ms respectively. f As the number of prefix trees, and thus the recovery rate, in LSH Forest increases, the query time decreases. On the other hand, an increase in Annoy trees, with a beneficial effect on recovery rate, also increases the query time. For subplots a, d; b, e; and c, f; the data has been aggregated over all measured values for \(k_{c}\), \(l\); \(k\), \(l\); and \(k_{c}\), \(k\); respectively

Back to article page