- Research article
- Open Access
A new chemoinformatics approach with improved strategies for effective predictions of potential drugs
© The Author(s) 2018
- Received: 2 June 2018
- Accepted: 2 October 2018
- Published: 11 October 2018
Fast and accurate identification of potential drug candidates against therapeutic targets (i.e., drug–target interactions, DTIs) is a fundamental step in the early drug discovery process. However, experimental determination of DTIs is time-consuming and costly, especially for testing the associations between the entire chemical and genomic spaces. Therefore, computationally efficient algorithms with accurate predictions are required to achieve such a challenging task. In this work, we design a new chemoinformatics approach derived from neighbor-based collaborative filtering (NBCF) to infer potential drug candidates for targets of interest. One of the fundamental steps of NBCF in the application of DTI predictions is to accurately measure the similarity between drugs solely based on the DTI profiles of known knowledge. However, commonly used similarity calculation methods such as COSINE may be noise-prone due to the extremely sparse property of the DTI bipartite network, which decreases the model performance of NBCF. We herein propose three strategies to remedy such a dilemma, which include: (1) adopting a positive pointwise mutual information (PPMI)-based similarity metric, which is noise-immune to some extent; (2) performing low-rank approximation of the original prediction scores; (3) incorporating auxiliary (complementary) information to produce the final predictions.
We test the proposed methods in three benchmark datasets and the results indicate that our strategies are helpful to improve the NBCF performance for DTI predictions. Comparing to the prior algorithm, our methods exhibit better results assessed by a recall-based evaluation metric.
A new chemoinformatics approach with improved strategies was successfully developed to predict potential DTIs. Among them, the model based on the sparsity resistant PPMI similarity metric exhibits the best performance, which may be helpful to researchers for identifying potential drugs against therapeutic targets of interest, and can also be applied to related research such as identifying candidate disease genes.
A key component in the drug discovery process is to accurately identify the drug–target interactions (DTIs). Traditionally, experimental determination of DTIs is both costly and time consuming. In addition, to fully explore the growing chemical and genomic (for drug targets) spaces being discovered, it becomes impractical to experimentally validate all possible combinations of drug–target pairs. Thus, effective computational algorithms used for predicting potential DTIs are increasingly in demand. Typically, docking simulation is often used to probe the interactions between a series of small molecules and a target under study  at a molecular level. However, docking methods require accurate three-dimensional structures of target proteins, making such studies challenging for membrane proteins due to the challenge of protein crystallization. Quantitative structure–activity relationship (QSAR) is another method to depict possible DTIs. However, QSAR typically requires molecular structures with similar scaffolds [2–4] for stronger performance. Nowadays, the technology advancement of next-generation sequencing (NGS) and small molecule high-throughput screening (HTS) is accelerating the identification of potential therapeutic targets and drug compounds, which presents great challenges as well as opportunities for chemogenomic research to explore both chemical and genomic spaces simultaneously. In line with this, Yamanishi et al.  proposed a bipartite graph learning method correlating the chemical/genomic spaces with the interaction space (i.e., pharmacological space) for predicting potential DTIs, which was followed by several algorithms with improved performance. For example, Bleakley et al.  proposed a novel supervised inference method to predict unknown DTIs by using several bipartite local models (BLM). Specifically, BLM transformed the edge prediction problem into the binary classification problem of points with labels. van Larrhoven et al.  used a regularized least squares algorithm combined with the Gaussian interaction profile (GIP) kernel calculated only from the topological information of the drug–target network for inferring DTIs. Mei et al.  introduced a neighbor-based interaction-profile inferring method and integrated it into BLM, enabling the model for predicting new drugs/targets. Hao et al.  employed a nonlinear kernel diffusion (KF) technique to infer DTIs. Liu et al.  proposed a neighborhood regularized logistic matrix factorization (NRLMF) algorithm to partly overcome the imbalanced problem in the DTI prediction process. Later on, Hao et al.  designed a dual-network integrated logistic matrix factorization (DNILMF) technique by incorporating an idea for modeling social ensemble into the DTI prediction model. Recently, Olayan et al.  proposed a method (called DDR), which is based on heterogeneous graph including known DTI network and multiple similarities from both targets and drugs, to predict unknown DTIs by using Random Forest as a classifier. By adding a heuristic selection of similarity matrices and nonlinear KF technology, DDR outperformed other state-of-the-art priors . Additionally, many other DTI prediction algorithms developed previously can be found in the reviews [13–17].
Among popular DTI prediction algorithms, the most reliable and accurate ones are those based on similarities. However, the used similarity information is derived either from protein sequences or from drug structures. Despite of the importance of the DTI graph, little studies considered using its similarity information as the main source when building the model with the exception of previous work . In fact, the DTI bipartite network itself contains extremely important information, which will be beneficial to the model performance. The success of recommender system in e-commence has provided a proof of concept [18–21], which explores the bipartite network solely. Inspired by this technology, we in this work make an effort to apply and extend it for DTI predictions. Herein, we adopt a technology, called neighbor-based collaborative filtering (NBCF), which is one of the most successful technologies in the community of recommendations. For applying NBCF to DTI predictions, a fundamental step is to accurately compute the pairwise drug similarities based on the drug interaction profiles (DIPs) with targets in the bipartite interaction network, rather than based on drug structures as used in the previous studies. In fact, a similar idea has been reported whereas the protein similarities were measured based on their associated ligands but not based on amino acid sequences [22, 23]. With the DIPs-based drug similarities, an intuitive model is built using NBCF by making the following assumption: if drug A and drug B are highly similar (again, as indicated by similarity from DIPs), and if drug A interacts with the current target, then drug B has a high probability of interacting with the same target, though there may be exceptional cases [22, 23]. However, it is well-known that the experimentally validated interaction information is extremely limited compared to the whole drug–target interaction space, which will introduce noise when computing similarity from such a sparse network (sparseness, defined as the number of links divided by the total number of possible target-drug pairs) using the conventional similarity calculation methods. To tackle this challenge, we in this work propose three strategies to remedy the issue, i.e., by designing a new similarity metric to mitigate noise, performing low-rank approximation (LRA) of the original prediction scores, and incorporating the auxiliary information into the model.
It is critical to select an appropriate evaluation method in order to assess the strength of a developed DTI prediction algorithm as well as to identify rooms for further improvement. Instead of adopting the commonly used evaluation metrics [i.e., area under curve (AUC) and area under precision-recall (AUPR) curve], we introduce a recall-based metric, namely mean percentile ranking (MPR), which is under-studied in DTI predictions  but routinely used in the recommender system studies [18, 24]. The reason for selecting MPR as the evaluation criteria is that one only knows about the one-class experimentally validated information (i.e., a drug interacts with a target, which is considered as the positive information) but does not know about the negative information (i.e., a drug does not interact with a target) due to the lack of comprehensive experimental data on a drug–target pair. Thus, a recall-based metric is suitable to such a scenario. Finally, we validate our method in three large publicly available datasets and compare the proposed algorithm with the prior art based on MPR. We conclude that the proposed NBCF algorithm with the improved strategies is both effective and computationally efficient for DTI predictions, which outperforms the previously developed algorithm for identifying potential drugs against therapeutic targets under a study.
Benchmark datasets and corresponding properties
Number of targets
Number of drugs
Number of interactions
Average interaction number of each drug with targets
Average interaction number of each target with drugs
Minimum interaction number of each drug with targets
Maximum interaction number of each drug with targets
Minimum interaction number of each target with drugs
Maximum interaction number of each target with drugs
Workflow of the proposed algorithm
Properties of benchmark datasets
We validate our algorithm using three benchmark datasets (Table 1). (1) DATASET-H: in this dataset which was obtained from our previous work , there are 733 unique targets and 829 unique drugs extracted from the DrugBank database following several pre-processing steps. On average, DATASET-H has about 4 known targets for each drug and 5 drugs for each target. Among them, looked from the drug side, the minimum and maximum number of interacted targets are 1 and 48, respectively. From the target end, the minimum and maximum number of interacted drugs are 1 and 75, respectively. The sparsity value (calculated from known interactions divided by the totally possible interaction pairs between drugs and targets; the lower the value, the sparser the dataset is) is 0.006, indicating the dataset is very sparse. (2) DATASET-K: the dataset was retrieved from the publication of Kuang and co-workers . This dataset is similar to DATASET-H, but has more targets than drugs. This dataset also has the sparsity value of 0.006, making the DTI predictions extreme difficult. (3) DATASET-Y: this dataset is a subset of the previous work with the largest number of possible interaction pairs published by Yamanishi and co-workers . Similar to DATASET-K, this dataset also has more targets than drugs. Compared to the first two datasets, the sparsity value of DATASET-Y is relative higher (0.010) indicating it is relatively less (but still very) sparse and has more known interactions within the dataset. In summary, all these three benchmark datasets have a very low sparsity value leaving a larger room for challenging the algorithms for DTI predictions.
Results of the proposed algorithm
Results of MPR for the proposed algorithms based on 5 trials of tenfold cross-validation in the benchmark datasets
0.054 ± 0.010
0.049 ± 0.010
0.020 ± 0.006
0.081 ± 0.019
0.068 ± 0.019
0.037 ± 0.013
0.092 ± 0.026
0.070 ± 0.017
0.035 ± 0.012
0.061 ± 0.012
0.055 ± 0.014
0.023 ± 0.008
0.066 ± 0.013
0.049 ± 0.010
0.029 ± 0.007
0.066 ± 0.013
0.052 ± 0.011
0.028 ± 0.007
0.109 ± 0.020
0.077 ± 0.014
0.023 ± 0.007
0.086 ± 0.013
0.051 ± 0.009
0.027 ± 0.006
0.083 ± 0.014
0.055 ± 0.010
0.027 ± 0.004
0.083 ± 0.023
0.063 ± 0.016
0.037 ± 0.013
Comparison to counterpart and further consideration
We compared the proposed NBCF algorithm to DT-Hybrid proposed by Alaimo and co-workers . We select DT-Hybrid for comparison because (1) both NBCF and DT-Hybrid are derived from network based recommendation technology [19, 20, 33]; (2) both algorithms adopt a recall-based metrics; and (3) they are both effective and computationally efficient for DTI predictions. For DT-Hybrid, we adopt the default parameters according to the reported values (i.e., lambda set to 0.5 and alpha set to 0.4) . As shown in Table 2, in all three datasets, PPMI-based NBCF in Strategy 1, and both COSINE-based and TANIMOTO-based NBCF models in Strategy 2 generated much better results than those from DT-Hybrid. Similarly, models from COSINE and TANIMOTO in Strategy 3 consistently outperform those from DT-Hybrid. Therefore, all the results indicate that our proposed algorithm with the improved strategies demonstrated stronger prediction ability for inferring potential DTIs. Though the NBCF algorithm combined with similarity from either PPMI, COSINE or TANIMOTO proposed in this work has been successful for DTI predictions, we were interested to explore the effect of other similarity methods on the model performance. Since the GIP kernel was reported to be a useful similarity metric for predicting potential DTIs in the previous work [7, 9, 11], we performed an experiment based on GIP. However, no satisfied results were obtained in all three benchmark datasets in terms of MPR. When we tested another similarity metric called DICE coefficient, results showed similar trend with those based on COSINE and TANIMOTO. Furthermore, we experimented the proposed algorithm with the IC dataset from the previous study , and noticed that the PPMI-based model still exhibits the best performance. To further validate the model effectiveness, we also performed fivefold cross-validations, which generated similar results as those from the tenfold cross-validation. It should be noted that the current work mainly focuses on inferring the potential drugs for interesting targets. In fact, the inverse operation (i.e., inferring the potential targets for interesting drugs) is also possible, which will be further explored in the future. Moreover, we plan to improve the current algorithm to make it scalable to larger datasets, and suitable to the new targets (or new drugs) scenarios [8, 10, 11, 36, 37].
In this work, we propose a straightforward yet effective and computationally efficient algorithm, NBCF, for inferring potential DTIs. For overcoming data sparsity inherently existing in the known DTI network, we designed three strategies to tackle the difficult issue. In Strategy 1, we propose to use a sparsity resistant similarity metric, PPMI, to measure the correlation between drugs from the DTI network solely, which as a result exhibits the best performance in the current work. In Strategies 2 and 3, we apply the low-rank approximation technique and incorporate additional auxiliary similarity into noise-prone models (i.e., COSINE-based NBCF and TANIMOTO-based NBCF) respectively, which have been shown to enhance the prediction accuracy to identify drug candidates for therapeutic targets.
MH and YW conceptualized the project. MH was responsible for the solution development. YW supervised the project. All authors participated in the project discussion. All authors read and approved the final manuscript.
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.
The authors declare that they have no competing interests.
Availability of data and materials
The source codes with the manuscript are available at: https://github.com/minghao2016/NBCF4DTIPred.
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Hao M, Li Y, Wang Y, Yan Y, Zhang S (2011) Combined 3D-QSAR, molecular docking, and molecular dynamics study on piperazinyl-glutamate-pyridines/pyrimidines as potent P2Y12 antagonists for inhibition of platelet aggregation. J Chem Inf Model 51:2560–2572View ArticleGoogle Scholar
- Cai J, Li C, Liu Z, Du J, Ye J, Gu Q, Xu J (2017) Predicting DPP-IV inhibitors with machine learning approaches. J Comput Aided Mol Des 31:393–402View ArticleGoogle Scholar
- Hao M, Li Y, Wang Y, Zhang S (2011) A classification study of human β3-adrenergic receptor agonists using BCUT descriptors. Mol Divers 15:877View ArticleGoogle Scholar
- Myint K-Z, Wang L, Tong Q, Xie X-Q (2012) Molecular fingerprint-based artificial neural networks QSAR for ligand biological activity predictions. Mol Pharm 9:2912–2923View ArticleGoogle Scholar
- Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M (2008) Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24:i232–i240View ArticleGoogle Scholar
- Bleakley K, Yamanishi Y (2009) Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics 25:2397–2403View ArticleGoogle Scholar
- van Laarhoven T, Nabuurs SB, Marchiori E (2011) Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics 27:3036–3043View ArticleGoogle Scholar
- Mei J-P, Kwoh C-K, Yang P, Li X-L, Zheng J (2013) Drug–target interaction prediction by learning from local information and neighbors. Bioinformatics 29:238–245View ArticleGoogle Scholar
- Hao M, Wang Y, Bryant SH (2016) Improved prediction of drug–target interactions using regularized least squares integrating with kernel fusion technique. Anal Chim Acta 909:41–50View ArticleGoogle Scholar
- Liu Y, Wu M, Miao C, Zhao P, Li X-L (2016) Neighborhood regularized logistic matrix factorization for drug–target interaction prediction. PLoS Comput Biol 12:e1004760View ArticleGoogle Scholar
- Hao M, Bryant SH, Wang Y (2017) Predicting drug–target interactions by dual-network integrated logistic matrix factorization. Sci Rep 7:40376View ArticleGoogle Scholar
- Olayan RS, Ashoor H, Bajic VB (2017) DDR: efficient computational method to predict drug–target interactions using graph mining and machine learning approaches. Bioinformatics 34:1164–1173View ArticleGoogle Scholar
- Chen X, Yan CC, Zhang X, Zhang X, Dai F, Yin J, Zhang Y (2016) Drug–target interaction prediction: databases, web servers and computational models. Brief Bioinform 17:696–712View ArticleGoogle Scholar
- Cheng T, Hao M, Takeda T, Bryant S, Wang Y (2017) Large-scale prediction of drug–target interaction: a data-centric review. AAPS J 19:1264–1275View ArticleGoogle Scholar
- Mousavian Z, Masoudi-Nejad A (2014) Drug–target interaction prediction via chemogenomic space: learning-based methods. Expert Opin Drug Metab Toxicol 10:1273–1287View ArticleGoogle Scholar
- Ezzat A, Wu M, Li XL, Kwoh CK (2018) Computational prediction of drug–target interactions using chemogenomic approaches: an empirical survey. Brief Bioinform. https://doi.org/10.1093/bib/bby002 View ArticlePubMedGoogle Scholar
- Hao M, Bryant SH, Wang Y (2018) Open-source chemogenomic data-driven algorithms for predicting drug–target interactions. Brief Bioinform. https://doi.org/10.1093/bib/bby010 View ArticlePubMedGoogle Scholar
- Johnson CC (2014) Logistic matrix factorization for implicit feedback data. In: Neural information processing systems workshop on distributed machine learning and matrix computationsGoogle Scholar
- Volkovs M, Yu GW (2015) Effective latent models for binary feedback in recommender systems. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrievalGoogle Scholar
- Zhou T, Kuscsik Z, Liu J-G, Medo M, Wakeling JR, Zhang Y-C (2010) Solving the apparent diversity-accuracy dilemma of recommender systems. Proc Natl Acad Sci USA 107:4511–4515View ArticleGoogle Scholar
- Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th international conference on world wide webGoogle Scholar
- Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK (2007) Relating protein pharmacology by ligand chemistry. Nat Biotechnol 25:197–206View ArticleGoogle Scholar
- Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, Jensen NH, Kuijer MB, Matos RC, Tran TB (2009) Predicting new molecular targets for known drugs. Nature 462:175View ArticleGoogle Scholar
- Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for implicit feedback datasets. In: IEEE international conference on data miningGoogle Scholar
- Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M (2008) DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 36:D901–D906View ArticleGoogle Scholar
- Kuang Q, Xu X, Li R, Dong Y, Li Y, Huang Z, Li Y, Li M (2015) An eigenvalue transformation technique for predicting drug–target interaction. Sci Rep 5:13867View ArticleGoogle Scholar
- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita K, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34:D354–D357View ArticleGoogle Scholar
- Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D (2004) BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res 32:D431–D433View ArticleGoogle Scholar
- Günther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, Ahmed J, Urdiales EG, Gewiess A, Jensen LJ (2007) SuperTarget and Matador: resources for exploring drug–target relationships. Nucleic Acids Res 36:D919–D922View ArticleGoogle Scholar
- Dimova D, Bajorath J (2016) Advances in activity cliff research. Mol Inform 35:181–191View ArticleGoogle Scholar
- Whittle M, Gillet VJ, Willett P, Alex A, Loesel J (2004) Enhancing the effectiveness of virtual screening by fusing nearest neighbor lists: a comparison of similarity coefficients. J Chem Inf Comput Sci 44:1840–1848View ArticleGoogle Scholar
- Yan X, Guo J, Liu S, Cheng X, Wang Y (2013) Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the SIAM international conference on data miningGoogle Scholar
- Sedhain S, Menon AK, Sanner S, Braziunas D (2016) On the effectiveness of linear models for one-class collaborative filtering. In: AAAIGoogle Scholar
- Ma H, King I, Lyu MR (2009) Learning to recommend with social trust ensemble. In: Proceedings of SIGIRGoogle Scholar
- Alaimo S, Pulvirenti A, Giugno R, Ferro A (2013) Drug–target interaction prediction through domain-tuned network-based inference. Bioinformatics 29:2004–2008View ArticleGoogle Scholar
- van Laarhoven T, Marchiori E (2013) Predicting drug–target interactions for new drug compounds using a weighted nearest neighbor profile. PLoS ONE 8:e66952View ArticleGoogle Scholar
- Gönen M (2012) Predicting drug–target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 28:2304–2310View ArticleGoogle Scholar