 Methodology
 Open Access
 Published:
Statisticalbased database fingerprint: chemical space dependent representation of compound databases
Journal of Cheminformatics volume 10, Article number: 55 (2018)
Abstract
Background
Simplified representation of compound databases has several applications in cheminformatics. Herein, we introduce an alternative and general method to build single fingerprint representations of compound databases. The approach is inspired on the previously published modal fingerprints that are aimed to capture the most significant bits of a fingerprint representation for a compound data set. The novelty of the herein proposed statisticalbased database fingerprint (SBDFP) is that it is generated based on binomial proportions comparisons taking as reference the distribution of “1” bits on a large representative set of the chemical space.
Results
To illustrate the Method, SBDFPs were constructed for 28 epigenetic target data sets retrieved from a recently published epigenomics database of interest in probe and drug discovery. For each target data set, the SBDFPs were built based on two representative fingerprints of different design using as reference a data set with more than 15 million compounds from ZINC. The application of SBDFP was illustrated and compared to other methods through association relationships of the 28 epigenetic data sets and similarity searching. It was found that SBDFPs captured overall, the common features between data sets and the distinct features of each set. In similarity searching SBDFP equaled or outperformed other approaches for at least 20 out of the 28 sets.
Conclusions
SBDFP is a general approach based on binomial proportion comparisons to represent a compound data set with a single fingerprint. SBDFP can be developed, at least in principle, based on any fingerprint and reference data set. SBDFP is a good alternative for exploration of relationships between targets through its associated compound data sets and performing similarity searching.
Background
Molecular fingerprints are bit strings representations of chemical structures in which each position indicates the presence (1) or absence (0) of chemical features as defined in the design of the fingerprint. There are several types of molecular fingerprints described elsewhere [1, 2]. Such representations are broadly employed for the assessment of chemical space coverage, molecular diversity and similarity searching [1,2,3]. With the constant increasing size of chemical databases, such studies have become more computationally demanding, leading to the need of generating simplified representations of compound databases to optimize storage and calculation speed. To this end, many of the approaches that have been proposed generate a single fingerprint trying to capture the common chemical features presents in all compounds in a database (or at least in most of them). The first strategy dates back to 1996, when Shemetulskis et al. [4] employed the Daylight Chemical Information Systems, Inc. molecular fingerprint to build the socalled modal fingerprint, which contains the common bits found in the molecular fingerprints in a given compound data set. In the modal fingerprint, the degree to which bits have to be in common in the data set in order to be set as “1” is determined by a userdefined threshold value, which ranges from 50 to 100%, being 50% the best performing threshold in different studies. Since 1996 the algorithm has been extended to different molecular fingerprints and a number of studies have shown its application in similarity searching [5, 6] and for the quantification of intra and interdatabase diversity [7]. In parallel, several modifications to this concept have been developed, mostly aiming to enhance its performance on similarity searches at the expense of increasing the complexity to implement the approach. Such approaches include bit scaling [8,9,10], bit silencing [11] and the determination of the best feature combinations [12]. In different publications the term “modal fingerprint” has been used to refer to distinct approaches. To avoid confusions, herein we refer as “database fingerprint (DFP)” to the modal fingerprint constructed using 50% as the predefined threshold.
In this work, we present the statisticalbased database fingerprint (SBDFP) as a novel and general approach to generate a compound database fingerprint based on binomial proportion comparisons. In this paper, we illustrate the application of SBDFP in comparing targetassociated compound data sets and performing similarity searching. As a case study, and to further advance the emerging field of epiinformatics [13], the SBDFPs were applied to a recently published epigenomics database with potential therapeutic significance.
Methods
Concept and construction of SBDFP
As commented on the Background, in a “classic” DFP representation, to set a bit “1” requires that such bit is present in at least 50% of all the molecules in the input data set. The basic idea of such threshold is to extract common bits in at least half of the input data set. However, the underlying hypothesis assumes that the probability of presence of a feature (bit) in a molecular representation is 50% for each of them, so all bits are compared against such probability.
SBDFP is based on the basic hypothesis that the probability of presence of a feature (bit) in a molecular representation is not equal for each bit. Instead, it is determined by the availability of such feature in a reference set e.g. the “known chemical space” (or a reasonable approximation) and such availability has to be determined. Once the frequency occurrence of each bit in a molecular representation is determined for both, namely the reference set and the data set of study, the SBDFP is constructed by comparing the frequency occurrence of each bit between both sets. Thus, a bit is set to “1” only if the frequency in the target data set is statistically higher than the reference. Figure 1 depicts a schematic comparison between a classic DFP (reminiscent of the modal fingerprint, vide supra) and the SBDFP, respectively. In this scheme the database fingerprint is illustrated for a short hypothetical fingerprint representation with 20bit positions.
It should be noted that the SBDFP representation for a given data set requires three main features (Fig. 1b): (1) a reference set, (2) a molecular fingerprint representation and 3) a statistical method to do the binomial proportion comparisons. The chosen features for this work are described below, although SBDFP can be developed with different fingerprints, reference sets, and statistical methods.
Compound data sets
As a case study we generated SBDFPs for a recently published epigenomics database [14]. The set of targets used as a test case in this work were selected based on their relevance in probe and epigenetic drug discovery that have attracted the attention to perform virtual screening [15, 16]. However, the SBDFP is general and could be used for other targets. The epigenomics database used in this study contains compounds associations against 60 epigenetic targets. For our analysis, we selected the information for 28 targets for which there was at least 50 reported compounds with a potency of 10 µM or better. Table 1 summarizes the targets considered in this work that included bromodomaincontaining proteins (BRD2, BRD3 and BRD4), histone acetyltransferases (CREBBP and EP300), DNA methyltransferase (DNMT1), histone lysine methyltransferase (EHMT2), histone deacetylases (HDAC1HDAC11), lysine acetyltransferase (KAT2B), lysine demethylases (KDM1A and KDM4C), histone methyllysine binding proteins (L3MBTL1 and L3MBTL3), mitogenactivated protein kinase (MAP3K7), OGlcNAcase (MGEA5), nuclear receptor coactivators with histone acetyltransferase activity (NCOA1 and NCOA3), and protein arginine methyltransferase (PRMT1). Table 1 also includes the number of compounds in each set (350 compounds on average with a maximum of 2740 for HDAC1). Note that SBDFP could be applied to other data sets with larger number of compounds and their performance in, for instance, virtual screening, would need to be assessed in a casebycase basis. It might be anticipated that the performance could be targetdependent as it happens in other virtual screening approaches.
Reference set
In this study, the All Clean subset from the ZINC12 database [17], with 16,403,844 unique compounds, was selected as starting point to build the reference set for SBDFP calculations. We removed 21 compounds that could not be processed by the RDKit module for Python [18] and also 154 compounds present in the epigenomics database. The remaining molecules were randomly divided in two groups: one group with 1,000,000 compounds to be used as decoys in similarity searching (vide infra) and the second group with the remaining 15,403,690 molecules to be used as reference for SBDFP calculations. We employed such database with more than 15 million compounds as a representative sample of the currently known chemical space of small molecules available in ZINC. We emphasize that SBDFP could be implemented using other reference data sets.
Fingerprints
We selected two fingerprints to illustrate the applicability of the concept of SBDFP: Molecular ACCess System (MACCS) keys (166bit) [19] as a “low resolution” dictionary fingerprint, and Extended Connectivity Fingerprint diameter 4 (ECFP4) as a “high resolution” representation [20] in its folded version of 2048 bits. MACCS keys and ECFP4 were generated with RDKit.
Binomial proportion comparisons
To perform the binomial proportion comparisons we employed a Ztest, as implemented in the statsmodels [21] module for Python. As can be found elsewhere [22], the proportion comparison relies on the calculation of a test statistic (called \(Z_{test}\)) defined as:
where \(p_{t}\) and \(p_{r}\) are the proportions in which a given bit appears as “1” in the target and reference data sets for a total of \(n_{t}\) and \(n_{r}\) observations, respectively. \(\varvec{P}\) is the estimated true proportion of “1” bits considering both sample observations and it is calculated as:
With the \(Z_{test}\) calculated and through the standard Normal distribution, the exact probability than the observed difference between proportion is due to random variation can be determined (the p value). So that the proportion difference is statistically significative if the p value is lower than the associated to the confidence level selected a priori. For example, for the bit 100 in MACCS fingerprint, the bit “1” occurrence in the reference set is 10,892,579 from 15,403,690 observations (\(p_{r} = 0.707\)). By selecting a confidence level of 99% (p value < 0.01) and doing the calculations one gets that for a target data set of 350 compounds, the bit occurrence must be equal or greater than 268 (\(p_{t}\) = 0.766, p value = 0.008) to be set as an “1” bit in the SBDFP representation even when for a bit occurrence of 248 the proportion seems to be larger (\(p_{t}\) = 0.708, p value = 0.476). This example illustrates that a greater proportion of “1” in a given bit for the target data set in comparison to the reference data set does not necessarily implies that such bit will be set as “1” in the SBDFP. In other words, the proportion difference must be “big enough”.
For this work we choose a confidence level of 99% (p value < 0.01) based on the average AUC values obtained from similarity searching for ECFP4 and MACCS keys at five different confidence levels (vide infra). For the sets of targets and the fingerprints explored, the best performing method is the one with a confidence level of 99% (Additional file 1: Table S1) and all further calculations and discussion are based on such method. Of note, other p values could be chosen for other targets and/or other fingerprints.
SBDFP to study interdata set relationships
To evaluate the performance of SBDFP to capture the differences between data sets we calculated both, the classic DFP and the SBDFP for each of the 28 targets. Both database fingerprints were constructed based on ECFP4 and MACCS keys fingerprints. Using the Tanimoto coefficient [23] and for each molecular fingerprint, we constructed the similarity matrices between epigenetic targets with three methodologies to calculate the similarity between pairs of targets: the median similarity between allcompound comparisons (ACC) in the data sets, the similarity between DFPs, and the similarity between SBDFPs. This led to a total of six representations herein referred as ACC/MACCS, ACC/ECFP4, DFP/MACCS, DFP/ECFP4, SBDFP/MACCS and SBDFP/ECFP4. The range of similarity values for each representation was taken as a measure of its resolution. All six similarity matrices were transformed to their corresponding distance matrices based on the relationship (distance = 1 − similarity). The distance matrices were used as basis for hierarchical clustering with complete linkage to analyze the ability of the representations to recover the known relationships between epigenetic targets based on its sequence identity. Such ability was assessed by calculating the Adjusted Rand Index (ARI) of each clustering [24] at a level of 10 clusters. The ARI measures the similarity between a given clustering and a ground truth: an ARI value of 1 indicates that the clustering recovers the original groups and an ARI value of 0 indicates random assignations. As ground truth, we used the hierarchical clustering with complete linkage obtained from the distance form of the sequence identity matrix (shown as Additional file 1: Table S10) as obtained from the alignment with Clustal Omega [25] with default parameters for the 28 targets studied. Sequences for all targets were taken from the Universal Protein Knowledgebase (UniProt) [26]. In addition, the number of “1” bits present in each representation was calculated as an approach of the amount of information contained in each one.
SBDFP as query for similarity searching
Previous studies have shown that using single fingerprint representation of compound databases as query yield better results in similarity searching than fingerprint representations of single compounds [5, 6]. However, when single fingerprint representations are compared with methods that use information for multiple compound in a database, such as knearest neighbors (kNN) and binary kernel discrimination, the single fingerprint searches are outperformed [5]. In this work, we tested the performance of SBDFP in similarity searching as compared to the classic DFP and 1NN search strategies for both MACCS keys and ECFP4 fingerprints, methods such as binary kernel discrimination were not compared in this work given its reported lack of efficiency [5]. The Tanimoto coefficient was used as similarity measure, although other similarity metrics could be explored. For SBDFP, five different confidence levels were tested for binomial proportion comparisons, here we report only the best performing one (99%), the rest are summarized in Additional file 1: Table S1.
Using an approach similar to the one reported by Heikamp et al. [27], from each of the 28 epigenetic targets, 100 random sets of 10 active compounds each were randomly selected and used as query. In each case, all remaining active compounds were added as active database of compounds (ADCs) to a database containing one million compounds randomly selected from the ZINC All Clean subset (vide supra), called the search set. For the searches involving DFP and SBDFP, the 10 compounds used as query were employed to build the corresponding single fingerprint, which was compared against all compounds in the search set, leading directly to a single similarity value per compound. On the other hand, for 1NN, each of the compounds in the search set was compared to the 10 compounds used as query, leading to 10 similarity values per compound, from which the highest value was taken. For each similarity search, the compound recovery rates (RR) were calculated in a targetspecific selection over the number of available ADCs as a measure of early enrichment. Receiver operating characteristic (ROC) curves and ROC area under the curve (AUC) values were also computed.
Results and discussion
Bit proportions in the reference set
As detailed in the Methods section, 15,403,690 compounds from the ZINC All Clean subset were taken as a representative sample of the currently known chemical space of small molecules. For the complete data set, the frequency of each bit was calculated for ECFP4 and MACCS keys. The results are summarized in Additional file 1: Tables S2 and S3. Of note, only 43 out of 166 bits for MACCS keys and 12 out of 2048 bits for ECFP4 have frequencies over 0.5. This means that 43 and 12 bits of MACCS keys and ECFP4, respectively, are the most likely to appear in the DFP representation of any data set. Such bias is avoided in SBDFP.
Compound data sets
For the 28 data sets studied in this work a total of six representations were generated for each set: the fingerprints for each compound, the single DFP, and SBDFP, all based on ECFP4 and MACCS keys, respectively. Of note, the data sets representations based on DFP and SBDFP have the advantage over “allcompounds” representation in that the speed of calculation is NxM times faster than doing pairwise comparisons with all compounds in a set (with N and M being the number of compounds in two data sets).
The median of the intraset similarity for all compounds in each data set was computed with MACCS keys and ECFP4 and the results are summarized in Table 1. Overall, all 28 sets have structural diverse compounds with, for instance, maximum median MACCS keys similarity of 0.694 (average of 0.507) and maximum median ECFP4 similarity of 0.551 (average of 0.178).
Table 1 also reports the average number of “1” bits for all compounds, as well as the number of “1” bits in the DFP and SBDFP, respectively. For both MACCS keys and ECFP4 fingerprints, DFP representation has, on average, number of “1” bits (46 and 18, respectively) lower than allcompounds representation (52 and 48, respectively) but higher than the number of bits with occurrence frequencies over 0.5 in the reference set (vide supra). As expected, DFP contains less information than the complete data set. However, DFP captures more features in the data set than expected according to the occurrence frequencies in the reference data set.
DFP/MACCS and SBDFP/MACCS capture similar amount of information with an average number of “1” bits of 46. However, as shown in Table 1, there is a dramatic increase in the number of “1” bits for SBDFP/ECFP4 as compared to DFP/ECFP4 (232 vs. 18). These results indicate that for the 28 data sets considered in this work, SBDFP/ECFP4 captures a higher amount of specific structural features of the compounds.
Similarity matrices
The similarity matrices between epigenetic targets were calculated with three different approaches to calculate the similarity between pairs of targets: the median similarity of the all pairwise comparisons (e.g., allcompound comparisons) in the data sets (ACC), the similarity between their DFPs, and the similarity between their SBDFPs, all based on MACCS keys and ECFP4 using the Tanimoto coefficient. As described in the Methods section, these representations are referred in this work as ACC/MACCS, ACC/ECFP4, DFP/MACCS, DFP/ECFP4, SBDFP/MACCS, and SBDFP/ECFP4. The six matrices are shown in Additional file 1: Tables S4–S9. Table 2 summarizes the maximum, minimum, average and range of Tanimoto similarity values for each similarity matrix. By using the median similarity between ACC in the data sets, the ranges are the smallest for MACCS keys and ECFP4 (i.e., 0.51 and 0.49, respectively). Table 2 shows that the similarity matrices constructed using SBDFP present a broader range of values (0.950 and 0.989) than those constructed using DFP (0.746 and 0.930).
The SBDFP matrices also have lower average similarities between data sets than the DFP matrices (0.540 vs. 0.342 for MACCS keys and 0.408 vs. 0.185 for ECFP4, respectively). Based on these results, the representation that captures better the differences between data sets is SBDFP/ECFP4. This result agrees with the relative “higher resolution” of SBDFP/ECFP4 i.e., higher number of “1” bits discussed above (Table 1).
SBDFP to study interdata set relationship
Figure 2 shows the dendrograms for each hierarchical clustering obtained with the corresponding distance matrices (vide supra). Analyzing the differences between data sets is not a trivial task and it is not straightforward evaluating the performance of a structural representation. In this work, we assessed the ability of the six representations listed above to recover the known relationships between epigenetic targets based on its sequence identity, using as metric the ARI at a level of 10 clusters and as ground truth the hierarchical clustering obtained from the distance form of the sequence identity matrix (vide supra). The level of ten clusters was selected as ground truth given its recovery of four groups of epigenetic targets with known relationships: group 1 containing BRDs 2–4, CREBBP and EP300; group 2 containing HDACs 1–11; group 3 including L3MBTLs 1 and 3; and group 4 consisting of NCOAs 1 and 3. According to the results, the best performing methods were those based on the SBDFP, with ARI values of 0.831 for SBDP/ECFP4 and 0.808 for SBDFP/MACCS. Methods based on ACC had worst but similar performances for both fingerprints with ARI values of 0.762 and 0.708 for ACC/MACCS and ACC/ECFP4 respectively. Finally, methods based on DFP had contrasting performances, being DFP/MACCS tied as the second best method with an ARI value of 0.808 and DFP/ECFP4 the worst of them with an ARI value of 0.388.
SBDFP as template for similarity searching
All 28 epigenetic data sets were subjected to systematic fingerprint search calculations. To obtain statistically relevant data, from each data set, 100 compound reference sets of 10 compounds were randomly selected and used as query in six different representations: the fingerprints for each compound (1NN), the DFP and the SBDFP, the three of them based on ECFP4 and MACCS keys. For the six search strategies, Figs. 3 and 4 show the results of the RR and AUC, respectively. In terms of early enrichment, by using MACCS keys as molecular representation, the SBDFP approach outperformed the other methods with an average RR of 35.3%, followed by 1NN (33.1%) and DFP (26.4%). Similar trends were obtained using ECFP4, being the average RRs 50.2%, 46% and 21.5 for SBDFP, 1NN, and DFP respectively. Regarding to the global performance, the tendency was identical. The best performing method in both cases was SBDFP, for MACCS keys with an average AUC of 0.898, followed by 1NN and DFP with average AUCs of 0.853 and 0.824 respectively and for ECFP4 with average AUCs of 0.926, 0.882 and 0.755 for SBDFP, 1NN and DFP respectively. These results revealed the anticipated differences between high and lowresolution fingerprints, since ECFP4 achieved higher RRs and AUCs for 1NN searches, while for the single fingerprint searches the higher values corresponded to the most populated representations in terms of number of bits “1” (MACCS keys for DFP and ECFP4 for SBDFP).
The results also illustrated the general data setdependence of the similarity searching performance and the good success rates achieved for 2D fingerprint methods, since the best performing search strategy for each data set obtained an average RR of at least 50% in 22 of 28 cases, and an average AUC larger than 0.7 in all of them. By analyzing the individual performances, according to RRs (Table 3), SBDFP was the best method for 17 cases, from which eight were based on MACCS keys, seven based on ECFP4 and two without significative difference between molecular fingerprints. The second best method was 1NN with eight favorable cases by using ECFP4. For three data sets there was not significative difference between SBDFP and 1NN (Fig. 3). Additionally, the DFP representation was not the best performing method for any of the data sets studied.
According to the AUCs values (Table 4), the best performing method for 23 data sets was SBDFP, from which four were based on MACCS keys, 17 based on ECFP4 and two without significative difference between fingerprints. The overall secondbest approach was 1NN with better predictions for two data sets (one for each molecular fingerprint). In general, DFP had lower AUCs values as compared to the other two search methods (Table 4).
Remarkably, the search method based on SBDFP could be applied in at least 20 out of the 28 data sets studied leading to the best RRs, with the additional advantage over 1NN that the speed of calculation is N times faster (with N being the number of compounds used as query). This fact is because the number of comparisons needed for the screening is always equal to the number of compounds in the screened database in contrast to 1NN, where this number scale with the number of compounds used as query.
Conclusions and perspectives
Here we presented the statisticalbased database fingerprint (SBDFP) as a novel general approach to generate single fingerprints of compound databases based on binomial proportion comparisons. In this work we shown its implementation for two molecular fingerprints (e.g., ECFP4 and MACCS keys) and one specific reference set (e.g., ZINC). However, the applicability of SBDFP can be extended to any binary fingerprint and to other reference sets. Using as a case study a recently published set of 28 epigenetic compound sets with therapeutic relevance, we illustrate the application of SBDFP to capture the interdata sets relationships and to perform similarity searching. For the data sets explored in this work the largest set has 2740 compounds (as deposited in ChEMBL) but SBDFP could be applied to other larger compound data with relevance in drug or probe discovery. Despite the fact that no quantitative analysis was performed in terms of speed of calculation, it is clear that single fingerprint approaches to represent compound databases are faster because they depend on single rather than multiple comparisons.
Two major perspectives of the SBDFP approach are application in high throughput virtual screening and target identification. To these ends, studies involving different molecular fingerprints, targetassociated compound sets and reference data sets would be required, as well as exhaustive validations of their performance. Part of this work in ongoing and will be reported in due course.
Abbreviations
 ACC:

all compound comparisons
 ADC:

active database of compounds
 ARI:

Adjusted Rand Index
 AUC:

area under the curve
 DFP:

database fingerprint
 ECFP4:

extended connectivity fingerprint of diameter four
 kNN:

knearest neighbors
 MACCS:

molecular access system
 ROC:

receiver operating characteristic
 RR:

recovery rate
 SBDFP:

statisticalbased database fingerprint
 UniProt:

Universal Protein Knowledgebase
References
 1.
CeretoMassagué A, Ojeda MJ, Valls C et al (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63. https://doi.org/10.1016/j.ymeth.2014.08.005
 2.
Muegge I, Mukherjee P (2016) An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Discov 11:137–148. https://doi.org/10.1517/17460441.2016.1117070
 3.
Heikamp K, Bajorath J (2012) Fingerprint design and engineering strategies: rationalizing and improving similarity search performance. Future Med Chem 4:1945–1959. https://doi.org/10.4155/fmc.12.126
 4.
Shemetulskis NE, Weininger D, Blankley CJ et al (1996) Stigmata: an algorithm to determine structural commonalities in diverse datasets. J Chem Inf Comput Sci 36:862–871. https://doi.org/10.1021/ci950169+
 5.
Hert J, Willett P, Wilton DJ et al (2004) Comparison of fingerprintbased methods for virtual screening using multiple bioactive reference structures. J Chem Inf Comput Sci 44:1177–1185. https://doi.org/10.1021/ci034231b
 6.
Duan J, Dixon SL, Lowrie JF, Sherman W (2010) Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods. J Mol Graph Model 29:157–170. https://doi.org/10.1016/j.jmgm.2010.05.008
 7.
FernándezDe Gortari E, GarcíaJacas CR, MartinezMayorga K, MedinaFranco JL (2017) Database fingerprint (DFP): an approach to represent molecular databases. J Cheminform 9:1–9. https://doi.org/10.1186/s1332101701951
 8.
Xue L, Stahura FL, Godden JW, Bajorath J (2001) Fingerprint scaling increases the probability of identifying molecules with similar activity in virtual screening calculations. J Chem Inf Comput Sci 41:746–753. https://doi.org/10.1021/ci000311t
 9.
Xue L, Godden JW, Stahura FL, Bajorath J (2003) Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys. J Chem Inf Comput Sci 43:1218–1225. https://doi.org/10.1021/ci030287u
 10.
Xue L, Stahura FL, Bajorath J (2004) Similarity search profiling reveals effects of fingerprint scaling in virtual screening. J Chem Inf Comput Sci 44:2032–2039. https://doi.org/10.1021/ci0400819
 11.
Wang Y, Bajorath J (2008) Bit silencing in fingerprints enables the derivation of compound classdirected similarity metrics. J Chem Inf Model 48:1754–1759. https://doi.org/10.1021/ci8002045
 12.
Lounkine E, Hu Y, Batista J, Bajorath J (2009) Relevance of feature combinations for similarity searching using general or activity classdirected molecular fingerprints. J Chem Inf Model 49:561–570. https://doi.org/10.1021/ci800377n
 13.
MedinaFranco JL (2016) Epiinformatics: discovery and development of small molecule epigenetic drugs and probes. Academic Press, Cambridge. https://doi.org/10.1016/C20140037896
 14.
Naveja JJ, MedinaFranco JL (2017) Insights from pharmacological similarity of epigenetic targets in epipolypharmacology. Drug Discov Today 23:141–150. https://doi.org/10.1016/j.drudis.2017.10.006
 15.
Lu W, Zhang R, Jiang H et al (2018) Computeraided drug design in epigenetics. Front Chem 6:57. https://doi.org/10.3389/fchem.2018.00057
 16.
PrietoMartinez FD, MedinaFranco JL (2018) Charting the Bromodomain BRD4: towards the identification of novel inhibitors with molecular similarity and receptor mapping. Lett Drug Des Discov 15:1002–1011. https://doi.org/10.2174/1570180814666171121145731
 17.
Irwin JJ, Sterling T, Mysinger MM et al (2012) ZINC: a free tool to discover chemistry for biology. J Chem Inf Model 52:1757–1768. https://doi.org/10.1021/ci3001277
 18.
RDKit: opensource cheminformatics. http://www.rdkit.org. Accessed Nov 2018.
 19.
Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42:1273–1280. https://doi.org/10.1021/ci010132r
 20.
Rogers D, Hahn M (2010) Extendedconnectivity fingerprints. J Chem Inf Model 50:742–754. https://doi.org/10.1021/ci100050t
 21.
Seabold S, Perktold J (2010) Statsmodels: econometric and statistical modeling with python. In: Proceedings of 9th python in science conference, pp 57–61
 22.
LeBlanc D (2004) Statistics: concepts and applications for science. Jones & Bartlett Publishers, Sudbury
 23.
Bajusz D, Rácz A, Héberger K (2015) Why is Tanimoto index an appropriate choice for fingerprintbased similarity calculations? J Cheminform 7:20. https://doi.org/10.1186/s1332101500693
 24.
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218. https://doi.org/10.1007/BF01908075
 25.
Sievers F, Wilm A, Dineen D et al (2014) Fast, scalable generation of highquality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539. https://doi.org/10.1038/msb.2011.75
 26.
The UniProt Consortium (2017) UniProt: the universal protein knowledgebase. Nucl Acids Res 45:D158–D169. https://doi.org/10.1093/nar/gkw1099
 27.
Heikamp K, Bajorath J (2011) Largescale similarity search profiling of ChEMBL compound data sets. J Chem Inf Model 51:1831–1839. https://doi.org/10.1021/ci200199u
Authors’ contributions
All authors designed the study. NSC performed the calculations. All authors wrote read and approved the final manuscript.
Acknowledgements
This work was funded by the Programa de Apoyo a Proyectos de Investigación e Innovación Tecnológica (PAPIIT) IA203718, Facultad de Química, UNAM. NSC is thankful to CONACyT for the granted scholarship number 335997.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
All datasets reported in the current study as well as the code implemented for calculations of DFP and SBDFP are available in the GitHub repository: https://github.com/DIFACQUIM/SBDFP
Funding
This work was funded by the Programa de Apoyo a Proyectos de Investigación e Innovación Tecnológica (PAPIIT) IA203718.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Affiliations
Corresponding authors
Additional information
This work is dedicated to Dr. Gerald M. Maggiora on the occasion of his 80th Birthday.
Additional file
Additional file 1: Table S1.
Average similarity searching performances for SBDFP constructed at different confidence levels. Table S2. “1” bits count for 15,403,690 compounds taken from ZINC using MACCS keys. Table S3. “1” bits count for 15,403,690 compounds taken from ZINC using ECFP4. Table S4. Similarity matrix of compound data sets computed as the median Tanimoto coefficient between its compounds using MACCS keys. Table S5. Similarity matrix of compound data sets computed as Tanimoto coefficient between its DFP based on MACCS keys. Table S6. Similarity matrix of compound data sets computed as Tanimoto coefficient between its SBDFP based on MACCS keys. Table S7. Similarity matrix of compound data sets computed as the median Tanimoto coefficient between its compounds using ECFP4. Table S8. Similarity matrix of compound data sets computed as Tanimoto coefficient between its DFP based on ECFP4. Table S9. Similarity matrix of compound data sets computed as Tanimoto coefficient between its SBDFP based on ECFP4. Table S10. Sequence identity matrix of targets computed from Clustal Omega alignments.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
SánchezCruz, N., MedinaFranco, J.L. Statisticalbased database fingerprint: chemical space dependent representation of compound databases. J Cheminform 10, 55 (2018). https://doi.org/10.1186/s133210180311x
Received:
Accepted:
Published:
Keywords
 Chemical space
 Epiinformatics
 Molecular fingerprints
 Representation
 Similarity searching