Skip to main content

An end-to-end method for predicting compound-protein interactions based on simplified homogeneous graph convolutional network and pre-trained language model


Identification of interactions between chemical compounds and proteins is crucial for various applications, including drug discovery, target identification, network pharmacology, and elucidation of protein functions. Deep neural network-based approaches are becoming increasingly popular in efficiently identifying compound-protein interactions with high-throughput capabilities, narrowing down the scope of candidates for traditional labor-intensive, time-consuming and expensive experimental techniques. In this study, we proposed an end-to-end approach termed SPVec-SGCN-CPI, which utilized simplified graph convolutional network (SGCN) model with low-dimensional and continuous features generated from our previously developed model SPVec and graph topology information to predict compound-protein interactions. The SGCN technique, dividing the local neighborhood aggregation and nonlinearity layer-wise propagation steps, effectively aggregates K-order neighbor information while avoiding neighbor explosion and expediting training. The performance of the SPVec-SGCN-CPI method was assessed across three datasets and compared against four machine learning- and deep learning-based methods, as well as six state-of-the-art methods. Experimental results revealed that SPVec-SGCN-CPI outperformed all these competing methods, particularly excelling in unbalanced data scenarios. By propagating node features and topological information to the feature space, SPVec-SGCN-CPI effectively incorporates interactions between compounds and proteins, enabling the fusion of heterogeneity. Furthermore, our method scored all unlabeled data in ChEMBL, confirming the top five ranked compound-protein interactions through molecular docking and existing evidence. These findings suggest that our model can reliably uncover compound-protein interactions within unlabeled compound-protein pairs, carrying substantial implications for drug re-profiling and discovery. In summary, SPVec-SGCN demonstrates its efficacy in accurately predicting compound-protein interactions, showcasing potential to enhance target identification and streamline drug discovery processes.

Scientific contributions

The methodology presented in this work not only enables the comparatively accurate prediction of compound-protein interactions but also, for the first time, take sample imbalance which is very common in real world and computation efficiency into consideration simultaneously, accelerating the target identification and drug discovery process.


Identification of interactions between compounds and proteins holds immense importance in various realms. Specifically, discovering new drugs is globally significant, both in academic research and commercial endeavors. The exploration of compound-protein interactions (CPIs) is pivotal in identifying compounds that interact with specific molecular targets. This process is fundamental for various purposes such as drug discovery, target identification, network pharmacology, comprehending protein functionalities, and more [1, 2]. However, the task of identifying new compounds along with their corresponding protein targets remains a formidable challenge, primarily due to the limited comprehension of the intricate relationships between the chemical space and proteomic space. Wet experimental tests are crucial methods utilized to assess the safety and effectiveness of novel drugs or treatment strategies. Nonetheless, these methods are often proven to be costly and time-consuming, demanding substantial resources. The evaluation process involves in vivo testing, which examines the effects of a drug or treatment within a living organism. However, this can be intricate and problematic due to ethical concerns and the inherent variability of biological systems. Alternatively, in vitro testing investigates the effects of drugs or treatments in a controlled laboratory setting outside a living organism, which also tends to be time-consuming and expensive due to the requirement for specialized equipment and expertise [3].

Recently, the use of machine learning (ML) and deep learning (DL) algorithms like Random Forest (RF) [4], Support Vector Machine (SVM) [5], Deep Neural Network (DNN) [6], Gradient Boosting Decision Tree (GBDT) [7] has speeded up the CPI identification process by enabling the development of novel compounds candidates with enhanced efficiency, efficacy, and quality [8]. Chen et al. [9] proposed TransformerCPI to improve compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism. Li et al. [10] developed MONN, a multi-objective neural network capable of accurately predicting binding affinities between compounds and proteins. Additionally, MONN effectively captures the non-covalent interactions between compounds and proteins. DEEPScreen [11] utilized convolutional neural networks with 2-D structural compound representations. These techniques have been used to discover targets that are more specific and effective, and to identify novel compounds that can be further optimized for therapeutic use. By leveraging large datasets and computational models, ML and DL algorithms can predict the interactions between compounds and their targets, analyze the pharmacological properties of candidate compounds, and optimize the chemical structures of molecules to improve their potency, selectivity, and safety profiles. Moreover, these techniques can significantly shorten time and costs associated with traditional drug discovery methods, which rely on trial-and-error experiments and animal testing. Overall, the integration of ML and DL algorithms into CPIs prediction holds great promise for the development of safer, more effective, and more affordable treatments for a wide range of diseases. However, fewer methods based on ML and DL for predicting CPIs use end-to-end representation learning. Instead, they relied on hand-extracted and well-designed compounds and protein features as input to the neural network. Using molecular fingerprints and protein structures as input features requires some prior knowledge about the data and involves hand-crafted features. While end-to-end learning has proven to be an effective method for feature representation, it is not commonly used in biological problems. In the case of the CPI problem, compounds or proteins can be represented as sequences where each character represents an atom or amino acid which were similar with ‘sentences’ in the natural language processing. Therefore, there are a lot of potentials for considering end-to-end learning of CPI feature representations based on these assumptions. Various protein language models and compound language models have been proposed for feature representation. Examples include ProtVec [12] and SMILES2Vec [13] based on word2vec [14] methodology model, ProteinBERT [15] and Knowledge-based BERT [16] for compounds, as well as large language models like ESM-1b [17].

Graph convolutional networks (GCN) [18] have achieved significant advancement in processing network or data with graph-structure and are deemed a promising solution to the CPIs problems. GCN plays a vital role in investigating intricate biological systems, which are represented as graphs made up of nodes (i.e., biomolecules) and edges (i.e., connections between biomolecules), such as CPIs analyzed in this study. Previous studies demonstrated that biomolecules, such as small molecules and proteins, carry out their functions not only individually but also through interactions with other biomolecules. As a result, network topology should be considered to predict interactions among biomolecules. To date, GCN has been extensively applied in numerous real-world tasks, yielding satisfactory results in drug-target interaction or affinity prediction [19,20,21,22], drug-drug interaction prediction [23, 24], disease-gene association recognition [25,26,27], and so on. In biological tasks like CPIs, the number of pertinent entities (e.g., genes, compounds, proteins, etc.) is typically enormous. The most typical hurdle is the “neighbor explosion” phenomenon encountered when dealing with complex large graphs (the complexity of node representation and stochastic gradient calculation will exponentially increase with the increasing number of message passing layers), and the over-smoothing or overfitting issues caused by stacking multiple layers of GCN (as the neural network goes deeper, nodes tend to have similar representations after aggregation operations). Researchers have proposed various graph sampling techniques to reduce the number of nodes involved in message passing, thereby lowering training costs. The most common techniques include node sampling (such as GraphSAGE [28], PinSage [29], VRGCN [30]), layer sampling (such as FastGCN [31], ASGCN [32]) and edge sampling [28]. In addition to training complexity issues, there are still challenges on accuracy and scalability [33, 34].

In real-world scenarios, the number of observed CPIs is often significantly lower than the potential interactions that could exist. Therefore, using unbalanced datasets more accurately reflects the natural distribution of positive and negative samples in CPI prediction tasks. However, many existing methods for CPI prediction are trained and evaluated using balanced datasets, as observed in [35,36,37]. In balanced datasets, models might achieve artificially high accuracy due to being prone to predicting the majority class. This can be misleading and does not provide an accurate assessment of a model’s performance. Therefore, it becomes imperative to address the challenge posed by natural imbalanced data, even though training models on imbalanced datasets remain a significant hurdle for machine learning techniques [38]. Unbalanced datasets force models to learn the underlying patterns of interactions, leading to more reliable evaluations. Imbalanced datasets can improve a model's sensitivity to true positive predictions. Sensitivity is crucial in CPI prediction, as accurately identifying existing CPIs is essential for drug development and repurposing. Successfully predicting interactions in an imbalanced setting indicates better generalization and robustness of the model when applied to real-world situations.

In this study, we proposed an end-to-end approach called SPVec-SGCN-CPI utilizing a simplified homogenous GCN model by concatenating compounds and protein features derived from the SPVec [39] model. Figure 1 illustrated the whole pipeline for CPI prediction. There are three steps for CPI prediction task: (i) feature representation via SPVec method, (ii) graph construction based on feature concatenation and feature similarity and (iii) simplified GCN model with \(\text{K}\)-layers. The SGCN technique separates the local neighborhood aggregation and nonlinearity layer-wise propagation steps, effectively aggregating \(K\)-order neighbor information while preventing neighbor explosion and speeding up training [40]. This makes the training process more efficient and allows the algorithm to handle larger graphs. The SPVec-SGCN-CPI method's performance was evaluated across three datasets, comparing it against four ML- and DL-based methods and four state-of-the-art methods. Experimental results demonstrated that SPVec-SGCN-CPI outperformed ML, DL and state-of-art methods, particularly excelling in unbalanced datasets. In sum, SPVec-SGCN demonstrates its capacity in reliably predicting CPIs, exhibiting potential to enhance target identification and streamline drug discovery processes.

Fig. 1
figure 1

The whole pipeline for CPI prediction. There are three steps for CPI prediction task: (i) feature representation via SPVec method, (ii) graph construction based on feature concatenation and feature similarity and (iii) simplified GCN model with \(\text{K}\)-layers



ChEMBL [41], BindingDB [42] and PubChem [43] are commonly used and frequently reported databases of chemical molecules and their biological activities. Affinity data for protein–ligand complexes are curated from published literature in major medicinal chemistry journals, and the data have been manually annotated to ensure the reliability. Supplementary Table S1 shows the numbers of compounds, target and their interactions obtained from three data source mentioned above as of October 2023. While many types of assays (IC50, AC50, EC50, \({K}_{i}\), \({K}_{d}\)) has been widely used to quantify the potency of compounds in inhibiting the activity of a biological target, IC50 is most commonly reported in experimental studies because determination process of \({K}_{i}\)/\({K}_{d}\) is slightly cumbersome. Either a low IC50 value or a low Ki/Kd value indicates high binding affinity [44]. In order to maximize the utilization of our collected data, we selected IC50 as the primary quantitative measure. We firstly got rid of compound-protein pairs that had missing IC50 values and inorganic compounds because of low drugability. By following the activity threshold discussion in [45, 46], compound-protein pairs with IC50 values < 100 nM were selected as positive samples and compound-protein pairs with IC50 values > 10000 nM were selected as negative samples. It is worth noting that this threshold is variable. It can adjust the IC50 value [47,48,49] or classify positive and negative samples based on \({K}_{i}\) or \({K}_{d}\) values [50,51,52,53]. Table S2 lists the different criteria adopted by researchers, with related analyses following Table S2. According to data source, we used ChEMBL dataset as training data and the other two datasets as test data. Table 1 shows the final numbers of entries in three datasets obtained from ChEMBL, BindingDB and PubChem, respectively. Unlike the ChEMBL and BindingDB datasets, where the positive and negative sample quantities are relatively close, the ratio of positive to negative samples in the PubChem dataset is approximately 1:81, indicating a highly imbalanced distribution. This is in line with real-world phenomena because most compound-protein pairs are unmarked data or negative samples. The PubChem dataset addresses the sample imbalance issue, which many other studies have not considered. Figure 2 shows the numbers of compounds (Fig. 2a) and protein targets (Fig. 2b) that are unique or common in the ChEMBL, BindingDB and PubChem datasets. Each CPI entry in the three datasets is unique with  no overlap. It can be observed that the overlap of samples (compounds or targets) among the three datasets is very low. BindingDB and PubChem are suitable for use as test sets.

Table 1 Numbers of entries in three datasets obtained from ChEMBL, BindingDB and PubChem, respectively
Fig. 2
figure 2

Numbers of compounds (a) and protein targets (b) that are unique or common in the ChEMBL, BindingDB and PubChem datasets. Numbers in parentheses indicate the inclusion relationship of different groups, 1 means containment, 0 means non-containment, and position indicates the group in which it is located

Feature representations

In our previous study, we proposed SPVec [39] method to learn feature representation of small compounds (SMILES format) and target proteins sequences. SPVec, which was inspired by word2vec, uses the Skip-gram [54] model. The Skip-gram model is a type of neural network architecture that is used to predict the context words surrounding a target word in a sentence or text. The basic idea is to learn a set of distributed representations (vectors) for each word in the vocabulary, such that similar words have similar vectors. The Skip-gram model is trained on a large corpus of text data, and it learns to predict the probability of observing a context word given a target word. One of the advantages of using the skip-gram method is its ability to capture the semantic relationships between words. Negative-sampling method is used to train SPVec model, which helps to reduce computational complexity and to enhance simultaneously the quality of word vectors and to accelerate the training speed. Given a subset \(NEG\left(w\right)( NEG\left(w\right)\ne \varnothing\)) and \(\forall \widetilde{w}\in D\), the probability of word vector is

$$p\left( {u\left| {\tilde{w}} \right.} \right) = \left[ {\sigma \left( {V\left( {\tilde{w}} \right)^{T} \theta^{u} } \right)} \right]^{{L^{{\text{w}}} \left( u \right)}} ,\left[ {1 - \sigma \left( {V\left( {\tilde{w}} \right)^{T} \theta^{u} } \right)} \right]^{{1 - L^{{\text{w}}} \left( u \right)}}$$

where \({L}^{w}(\widetilde{w})\) is the label of word\(w\), \(\sigma (\bullet )\) is sigmoid function, \(\theta\) are parameters of latent word vectors. The following function is maximized for sample\((w,Context\left(w\right) )\):

$$g\left( {\text{w}} \right) = \mathop \prod \limits_{{\tilde{w} \in Context\left( {\text{w}} \right)}} \,\mathop \prod \limits_{{{\text{w}} \in \left\{ u \right\} \cup NEG^{{\tilde{w}}} \left( {\text{w}} \right)}} p\left( {u{|}\tilde{w}} \right){ }$$

where \({NEG}^{\widetilde{w}}(w)\) denotes as a subset generated from negative samples during processing words \(\widetilde{w}\). The following objective loss function \(\mathcal{L}\) is maximized by the stochastic gradient descent (SGD) method:

$${\mathcal{L}} = \log G = \log \mathop \prod \limits_{{{\text{w}} \in {\mathcal{C}}}} g\left( {\text{w}} \right)$$

SPVec learned distributed representations (vectors) for each category instead of one-hot encoding. To represent drug molecules, we treated SMILES as “sentences” and each atom as a “word”. For protein sequences, we regarded them as “sentences” and every three amino acids that not overlap to each other as a “word”. Since SMILES strings have different representations for the same chemical structure, we converted the original SMILES in datasets to canonical SMILES to ensure the consistency and quality of the generated features. Our previous work has demonstrated that SPVec is superior to the features of manual design and SPVec can avoid the sparseness problem and reduce the number of bit collisions. And we discuss the performance of different word vector dimensions and context window size for word vectors. However, previously, the influence of different corpus on word2vec was ignored. Here, we designed three corpuses to explore the sensitivity of word2vec to corpus quality: (1) Corpus_1 only contains ~ 273 K compounds and 3451 proteins in training phase; (2) Corpus_2 contains ~ 335 K compounds and 4806 proteins in all three datasets (i.e. ChEMBL, BindingDB and PubChem); (3) Corpus_3 contains ~ 2.4 M compounds and ~ 15 K proteins in ChEMBL. That is, SPVec was pre-trained by external data. We proposed two methods to explicitly address the limitations associated with word2vec. Handling Out-Of-Vocabulary (OOV) words is a significant obstacle for small corpus size (corpus_1). One common solution is assigning random vectors to OOV words. Here we replace random vectors with averaged vectors of ‘words’ (i.e. compounds and protein segments). Large corpus (corpus_3) may contain noisy data, we filtered molecules with Tanimoto similarity > 80% and proteins with sequence similarity > 80%. The improved corpus based on methods mentioned above are named as corpus_1_imp and corpus_3_imp, respectively. Besides, to make sure no data about new protein or a new SMILES or both was leaked in pre-trained and training process, Corpus_3_only were constructed by removing proteins and compounds in testsets (i.e. BindingDB and PubChem).

According to Tomas [55], improving the range can enhance the quality of word vectors, but it comes at the cost of increased computational complexity. The training complexity of Skip-gram model can be expressed as:

$$Q = C \times \left( {D + D \times log_{2} \left( V \right)} \right)$$

Here, C represents the maximum distance between words, \(\text{V}\) represents real value vector \(V\left(w\right)\) for any word in dictionary \(D\).

Simplified graph convolutional networks

GCN is one type of neural network specifically developed to handle data with graph structure. In convolutional neural networks (CNN), convolution operation works by sliding a filter over the input image and applying a dot product between the filter and the input at each location [56]. This operation can be extended to graph data by defining a filter as a weight matrix that is shared across all nodes in the graph. The output of the convolution operation is then calculated by taking the dot product between the weight matrix and a node feature matrix, where the node feature matrix contains feature vectors for all nodes in the graph.

A particular graph-based neural network model is theoretically motivated by the layer-wise back propagation rule below. Define

$${\text{S}} = \,\tilde{D}^{{ - \frac{1}{2}}} \tilde{A}\tilde{D}^{{ - \frac{1}{2}}}$$

Here, \(\widetilde{\text{A}}=A+{I}_{N}\), where \({I}_{N}\) is the identity matrix and \(A\) is adjacency matrix of graph G. \(\widetilde{\text{D}}\) is the degree matrix of \(\widetilde{\text{A}}\).

The representation updating rule of the \(k\)-th layer is:

$${\text{H}}^{\left( k \right)} \leftarrow {\text{ReLU}}\left( {SH^{{\left( {k - 1} \right)}} \Theta^{\left( k \right)} } \right)$$

The weight matrix \({\Theta }^{\left(k\right)}\) is specific and trainable to each layer. \({\text{H}}^{\left(k\right)}\in {\mathbb{R}}^{N\times D}\) is the matrix of activations in the hth layer.

For binary classification, the predicted class \(\widehat{\mathbf{Y}}\) in a \(k\)-layer GCN can be expressed as:

$$\hat{Y}_{{{\text{GCN}}}} = sigmoid\left( {SH^{{\left( {k - 1} \right)}} \Theta^{\left( k \right)} } \right)$$

where \(sigmoid(x)=\frac{1}{1+{e}^{-x}}\) acts as a normalizer among two classes.

For traditional multilayer perceptron (MLP), greater depth enhances expressivity by enabling the formation of feature hierarchies. For instance, features in the next layer build upon those of the first layer. Feature propagation is the key factor that sets a GCN apart from a MLP. In GCNs, layers serve an additional crucial role: at each layer, hidden node representations are obtained by average among neighbors situated one hop away. Consequently, after \(k\) layers, a node incorporates feature information from all nodes located \(k\)-hops away in the graph. This effect resembles CNN, where depth expands the receptive field of inner node features [57]. While convolutional networks significantly benefit from increased depth [58], MLPs typically derive little advantage beyond 4 or 5 layers.

Our hypothesis suggests that GCN performs well on graph data for two key reasons. (1) local neighborhood aggregation: GCN can effectively capture the local neighborhood information of each node. By using the features of a node and its immediate neighbors for convolution, GCN aggregates information from surrounding nodes, incorporating their information into the representation of each node. (2) nonlinearity layer-wise propagation: GCN models typically employ nonlinearity layer-wise propagation, where each layer depends on the output of the previous layer. This layer-wise propagation effectively preserves and propagates information through the layers, allowing the model to gradually capture more extensive graph structural information and complex nonlinear relationships in the input data. These two parts can be executed separately. Figure 3 shows a schematic layout comparison between GCN and Simplified GCN (SGCN). SGCN eliminates the nonlinear transition functions in each layer, retaining only the final sigmoid to generate probabilistic outputs in a range of 0–1. The resultant model is linear, yet maintains the same increased receptive field characteristic of a \(k\)-layer GCN and can be freely combined with nonlinearity layer-wise propagation.

Fig. 3
figure 3

Schematic layout comparison between GCN and SGCN. In the left, the GCN iteratively transforms feature vectors across \(K\) layers before employing a linear classifier on the ultimate representation. In contrast, the right showcases the SGCN, which simplifies the entire process to a single step of \(K\)-hop feature propagation

The predicted class \(\widehat{\text{Y}}\) in a \(k\)-layer SGCN can be written as:

$$\hat{Y}_{{{\text{SGCN}}}} = sigmoid\left( {{\text{S}}^{k} {\text{X}}\Theta } \right)$$

Equation 8 yields a straightforward and intuitive understanding of SGCN. By delineating between feature representation and classification, SGCN comprises a fixed and parameter-free feature smoothing component \(\widehat{\text{X}}={\text{S}}^{k}\text{X}\), succeeded by a linear logistic regression classifier \(\widehat{\text{Y}}=\mathit{sigmoid}\left(\widehat{\text{X}}\Theta \right)\). As the computation of \(\widehat{\text{X}}\) involves no weights \(\Theta\), it is basically equivalent to feature transformation. Consequently, the entire model training process simplifies to binary logistic regression on the transformed features \(\widehat{\text{X}}\). SGCN inherently scales well to very large graph sizes, making the training of SGCN significantly faster compared to GCNs. For a \(n\)-layer GCN, we use 1 or 2 layer nonlinearity propagation to capture complex nonlinear relationships and the other can directly execute \(k\)-layer local neighborhood aggregation by SGCN.

To construct the graph we need for our task, we use compound-protein pairs as nodes. The nodes have 200 dimensions of features obtained using SPVec. The adjacency matrix is obtained from the similarity among vectors of nodes representation. The label of each node indicates whether there is an interaction between the compounds and proteins. Table S3 showed detailed information about hyperparameters and architectures of SGCN model.

Model evaluation

The evaluation process serves as an important step in determining the overall effectiveness of the model and ensuring its future applicability. The metrics for evaluating the model performance include accuracy, precision, recall, F1 score, area under receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPR). Each metric provides unique insights into different aspects of a model's performance: (1) Accuracy is the proportion of correctly classified instances among the total predictions. It is simple and easy to interpret but may not reflect class-specific performance and could overemphasize the majority class for imbalanced datasets. (2) Precision is useful when minimizing false positives is crucial; however, it ignores false negatives and thus not provide a complete model performance. (3) Recall emphasizes capturing all positive instances. There is a trade-off between recall and precision: increasing recall may decrease precision, and vice versa. (4) F1-score balances precision and recall and it is useful when both are important. It assumes equal importance of precision and recall and may not be suitable for all scenarios. (5) AUC is an important metric for binary classification and it assesses model performance across various decision thresholds. (6) AUPR is particularly useful in cases where the positive class (or the class of interest) is rare, making precision and recall more informative than accuracy but it may not be as interpretable as AUC. In order to ensure the stability of our proposed model, fivefold cross-validation (CV) was performed 10 times for CPIs task.

Results and discussion

Performance of SPVec-SGCN-CPI model using six corpuses

Figure 4 shows classification performance of SPVec-SGCN-CPI model averaged over 10 runs on two test sets. Although the model evaluation metrics achieved similar results using three corpuses on the training set (see Supplementary Figure S1), while increasing the size of the Corpus_1 to Corpus_2, SPVec-SGCN-CPI model achieved better performance on two test sets. Corpus_2 is larger encompassing a broader “vocabulary” and a more diverse range of “language” (i.e., protein sequences and SMILES) contexts, allowing the model to learn richer and more specific feature representations. Besides, larger corpus provides more contextual information, enabling the model to better understand the meanings of words in different contexts. This helps in generating word embeddings that are more contextually sensitive and enhance the model's generalizability. Test sets contained new ‘words’ not able to be represented by SPVec model and handling out-of-vocabulary “words” contributes to the descending prediction ability of the model. Compared to Corpus_1, replacing random vectors with averaged vectors of ‘words’ in Corpus_1_imp has improved the model performance, indicating that this approach is one of the measures to address the OOV problem. However, it's essential to note that the improvement in model performance is not always linear with the increase in corpus size. SPVec-SGCN-CPI model on Corpus_2 and Corpus_3 achieved roughly equivalent performance. This indicates that beyond a certain point, the marginal benefits of enlarging the corpus may diminish, while the computational requirements and training time increase. Moreover, Corpus_3 may contain noise or low-quality text, which could negatively impact the model. Corpus_3_imp after removing redundant data related to compounds and proteins did not experience a decline in model performance. This provides an option for removing noisy data associated with biological data. Corpus_3 and Corpus_3_only both achieved excellent model performance with no significant difference. The embeddings of proteins or compounds are obtained by summing and averaging each “word” (protein sequences and SMILES segments). We found that Corpus_3_only doesn’t contain any new “words”. SPVec has learned the optimized vector representation of each “word” even though not using new protein or new SMILES. Therefore, the inclusion of test set data in pre-training process by SPVec will not artificially inflate the model’s performance and generalization capability.

Fig. 4
figure 4

Classification performance of SPVec-SGCN-CPI model using six corpuses averaged over 10 runs on the BindingDB test set and PubChem test set

Performance of SPVec compared with SPBert and SPGPT for feature representation

To demonstrate the feature representation capabilities of SPVec, we compared it with state-of-the-art large language models. We utilized bert-based pre-trained models, ChemBERTa-77M-MTR [59] and esm2_t33_650M_UR50D [60] for feature extraction. The combination of these is referred to as SPBert. Simultaneously, we employed GPT-based pre-trained models, ChemGPT-4.7M [61] and ProGPT2 [62] and their combination is denoted as SPGPT. We repeat the process 10 times for model evaluations on three datasets to reduce the influence of chance factors, thus improving the accuracy of our assessment of the model’s performance. Figure 5 illustrates the averaged AUC and AUPR while Figure S2 shows boxplot of AUC and AUPR with 10 repetitions using three different feature representation methods on ChEMBL (fivefold cross-validation dataset), BindingDB (test set), and PubChem (test set). All models performed well on fivefold cross validation and two independent testsets, suggesting saturation of modeling performance based on biological embeddings generated by pre-trained language models. This phenomenon is likely because the SGCN model's potent neighbor aggregation and topological graph representation capabilities for CPI information entail relatively basic feature requirements, and all three models can provide sufficient performance. Table 2 shows comparison of three feature representation methods in dimensionality, computation time, and memory consumption. For CPI prediction task, word2Vec might already suffice in capturing word semantics. Due to the higher dimensions (\(d=1664\) for both SPBert and SPGPT) and substantial computational resource demands using SPBert and SPGPT, SPVec may become a more practical choice under resource constraints.

Fig. 5
figure 5

AUC and AUPR of three different feature representation methods on ChEMBL training set, BindingDB test set and PubChem test set

Table 2 Comparison of three feature representation methods in dimensionality, computation time and memory consumption

Model performance and efficiency of different model combinations between SGCN and GCN

Performance. Table 3 shows model performance of nine model combinations between SGCN and GCN on ChEMBL dataset. As the number (S)GCN of layers increases from 1 to 3, there is a consistent improvement in various performance metrics, suggesting that a larger receptive field or increased connectivity in the graph benefits the model's ability to make accurate predictions. When hop value gets larger, model performance barely improved. Figure S3 shows fivefold cross validation results averaged over 10 runs in the 1-hop to 4-hop setting. As the hop value from 1 increases to 3, there is a consistent improvement in various performance metrics, suggesting that a larger receptive field or increased connectivity in the graph benefits the model's ability to make accurate predictions. When hop value get larger, model performance improved little. We choose hop = 3 for following research. When the model utilizes only 3-order SGCN layers, it essentially undergoes linear transformation, that is a 3-hop neighbor aggregation. Its performance is superior to that of the 3-order GCN, indicating the unnecessity of non-linear transformations. Table 4 shows model performance of nine model combinations between SGCN and GCN on two independent test sets. Despite achieving comparable results on the training set, the SGCN model alone significantly outperforms the GCN and GCN + SGCN combinations on the test sets. This suggests that SGCN has advantages over GCN in terms of generalization. Specifically, the GCN model exhibits a decrease in performance on both test sets, indicating potential overfitting during training. On the PubChem dataset, SGCN consistently performs well across all layers, while GCN's performance is relatively limited. Notably, when the model consists of two or more layers, substituting the standard GCN layer with SGCN leads to an improvement in model performance. In summary, SGCN demonstrates superior generalization capabilities compared to GCN, particularly on independent test sets, and replacing GCN layers with SGCN layers can enhance model performance, especially in deeper architectures.

Table 3 Model performance of nine model combinations between SGCN and GCN on ChEMBL dataset
Table 4 Model performance of nine model combinations between SGCN and GCN on two independent test sets

Efficiency. Figure 6 illustrates the training time and AUC of nine model combinations between SGCN and GCN. As the standard GCN layer grows deeper, the training time increases gradually diminishing the model’s efficiency. However, utilizing SCN for \(K\)-order neighbor aggregation while simultaneously omitting non-linear layer-wise propagation can effectively reduce the model’s training time. For instance, comparing GCN + GCN + GCN and SGCN + SGCN + SGCN, the network structure of SGCN + SGCN + SGCN significantly enhances the model's training speed, reducing the training time by 72.23%, while still achieving competitive performance. Consequently, substituting the standard GCN layer with SGCN can effectively extends the layers of the GCN network without the occurrence of neighbor explosion and improve the training speed on the premise of ensuring the model efficiency.

Fig. 6
figure 6

Training time and AUC of nine model combinations between SGCN and GCN

Performance of SPVec-SGCN model compared with ML- and DL-based models on BindingDB and PubChem test sets

On one hand, accuracy, precision, recall, F1-score, AUC and AUPR of SPVec-SGCN model on BindingDB testset reached 0.9805, 0.9763, 0.9847, 0.9805, 0.9979 and 0.9979 respectively. On the other hand, accuracy, precision, recall, F1-score, AUC and AUPR of SPVec-SGCN model on PubChem testset reached 0.9948, 0.7754, 0.8262, 0.8000, 0.9875 and 0.8709 respectively. These results demonstrate its extraordinary predictive ability for CPI prediction tasks, especially in unbalanced data. We compared the SPVec-SGCN model with Gaussian Naïve Baysian (GNB), RF, GBDT and DNN to show its superiority in CPI prediction. Our proposed model is superior to other traditional ML and DL models on training set (Table S4) and two test sets (Table 5). On BindingDB test set, The AUCs of our method are higher than GNB, RF, GBDT and DNN by 46.27, 14.11, 29.04, and 21.37%, respectively. All four ML and DL models failed to predict CPIs on PubChem test set, because these models may have a bias towards the majority class, struggling to identify or distinguish the minority class properly. For instance, GBDT and DNN overlooked minority samples and predicted almost the entire sample to negative. The excellent performance of our model is attributed to its effective and powerful graph processing ability via adaptive neighbor feature aggregation. The results show that our model has learned robust patterns and features that are applicable across various data distributions, rather than being overfit to a specific dataset. Generalizability and robustness of our model have been validated.

Table 5 Results of SPVec-SGCN model performance compared with machine learning- and deep learning-based models on BindingDB and PubChem test sets

Further experimentation with imbalanced datasets

Besides compound-protein pairs with IC50 > 10000 \(nM\), we randomly selected the matching number of the unknown compound-protein pairs (by excluding all known CPIs) as negative samples [50, 51]. Figure 7 shows the model performance of our model under different proportions of positive and negative samples on ChEMBL training set, BindingDB test set and PubChem test set. As the proportion of positive and negative samples increases, the AUC remains unchanged on the model training set. Due to changes in the balance between precision and recall, AUPR is more sensitive to class imbalance, resulting in a slight decrease in AUPR. On the BindingDB and PubChem test sets, as the proportion of positive and negative samples increases, both AUC and AUPR values show a decreasing trend. This indicates that with the increase of negative samples, the performance of the model in the entire sample space decreases slightly. However, even at a positive-to-negative sample ratio of 1:5, both AUC and AUPR remain high, indicating that the model performs well in handling class imbalance issues. The model's high AUPR may also suggest its strong ability to identify positive instances (minority classes), meaning that it can find true positive instances while maintaining a low misclassification rate.

Fig. 7
figure 7

the model performance of our model under different proportions of positive and negative samples on ChEMBL training set, BindingDB test set and PubChem test set

Further experimentation with larger datasets and deeper SGCN

To better showcase the superiority of the SGCN model, we collected data from multiple sources to augment the dataset. Details of the augmented training set (named MultiSource) can be found in Table S5. After data cleaning, there are a total of 676,414 positive samples and 319,197 negative samples in the MultiSource dataset, with a ratio of approximately 2:1. Deeper SGCN models were employed on this larger dataset. Figure 8 illustrates model performance of SGCN and GCN models with different layers on MultiSource training set, BindingDB testset and PubChem testset averaged over 10 runs. It is important to note that data from the BindingDB and PubChem test sets were not involved in the model fine-tuning process.

Fig. 8
figure 8

Model performance of SGCN and GCN models with different layers on MultiSource training set, BindingDB testset and PubChem testset averaged over 10 runs

As the number of layers in the SGCN and GCN increases from 1 to 5, the performance shows a gradual improvement on the MultiScource dataset. This suggests that adding layers contributes to the models learning more complex representations, thereby enhancing performance. When the number of layers reaches 5, the model reaches its optimum. With further increases in the number of layers, the performance of the model no longer improves. Across all layers, SGCN generally outperforms GCN in terms of both AUC and AUPR indicating that the SGCN is more effective on imbalanced training data. In the BindingDB dataset, as the number of layers increases, the SGCN model exhibits a similar trend to that observed in the MultiSource dataset. However, the GCN model shows a decline in model performance after 3 layers, indicating the occurrence of overfitting. On the PubChem dataset, SGCN performs well across all layers, while GCN's performance is relatively limited.

Performance of SPVec-SGCN model compared with six state-of-the-art models

In order to provide additional evidences of the efficacy of our proposed SPVec-SGCN-CPI approach, we conducted a comparative analysis with six other existing state-of-the-art methods published. Below are brief descriptions of each of these methods. (1) PMFCPI [63] is a pre-trained multi-functional model with assessing drug selectivity. (2) GraphCPI [64] is a novel graph-based computational model for potential CPIs. (3) STCPI [65] is self-training model with augmenting negative samples. (4) GcForest [66] is an ensemble decision tree learning algorithm with unique features. (5) CCL-DTI [67] contributes the contrastive loss in CPI prediction using CNN. (6) SgCPI [68] is heterogeneous sampled subgraph neural networks model with knowledge distillation. The results in Table 6 demonstrate that SubSGCN-CPI outperforms the four state-of-the-art methods in terms of performance evaluation metrics on BindingDB and PubChem test sets. Although these state-of-the-art methods achieved relatively high performance on training phase (Table S6), our method exhibits superior performance with higher scores across all metrics for the BindingDB dataset. Accuracy (0.9805), precision (0.9763), recall (0.9847), F1-Score (0.9805), AUC (0.9979), and AUPR (0.9979) are higher than second best model (i.e., PMFCPI) by 19.31, 19.21, 15.83, 19.08, 11.35, 10.34%. Our method also demonstrates superior performance on PubChem dataset displaying highest scores in AUPR (0.8709), which is 20.22% higher than STCPI. PMFCPI, GcForest and GraphCPI achieved notably lower precision scores, showing poor ability to predict unbalanced data. These results suggest that SPVec-SGCN model is an effective approach to enhance the accuracy of CPI prediction. SPVec-SGCNs combined beneficial similarity features to build a homogeneous network, thereby maximizing the utility of available information through the aggregation of neighborhood data. And, SPVec-SGCNs utilized three-layer simplified GCN model to ensure parameters of graph structure which offers advantages in terms of accuracy, scalability and training speed.

Table 6 Classification results of SPVec-SGCNs model compared with four state-of-the-art models on BindingDB and PubChem test sets

Prediction and validation of unidentified CPIs

To further validate the CPIs prediction ability of SPVec-SGCN model, we scored all the unlabeled CPIs on the ChEMBL dataset. Table S7 lists top 30 predicted CPIs. Specifically, we identified the top five ranked CPIs by molecular docking (MD) and cross-referenced them with external supporting evidences from relevant databases and biomedical literature. Figure 9 shows interactions of top five ranked compounds-protein pairs via MD and Table S8 shows the positions, bond types, distances, and energy values of the interaction relationships among top five ranked compounds-protein pairs, which is important to understand mechanisms of CPIs. All five compound-protein pairs have various interactions like H-bond and Vanderwals force. Table 7 shows detailed information and external supporting evidence of top five ranked compounds-protein pairs. Except for the third compound protein pair, all of them have very low Ki or IC50 values, which is also consistent with the results of molecular simulations, demonstrating a strong interaction between compound and target protein. However, the third compound-protein pair has a higher IC50 value because our training and modeling process is based on the premise that similar compounds and protein targets are related and similar compound-protein pairs tend to have similar interactions, as described above. However, protein-related life activities are complex and do not fully conform to this assumption. We examined the training data and found that there was a very strong interaction (IC50: 0.860 nM) between the third ranked molecule and another protein that was highly similar (sequence identity: 99%) to the predicted target (Integrin alpha-4/beta-7), which was responsible for the high prediction score. Overall, these results suggest that the SPVec-SGCN-CPI model is highly effective in predicting novel CPIs and has important potential in drug discovery and development.

Fig. 9
figure 9

Interactions of top five ranked compounds-protein pairs predicted by SPVec-SGCN model via molecular docking

Table 7 Detailed information and external supporting evidence of top five ranked compounds-protein pairs predicted by SPVec-SGCN model


In this study, we proposed an end-to-end approach, named SPVec-SGCN-CPI, which utilizes simplified GCN model information based on low-dimensional and continuous feature generated by SPVec model and graph topology information for predicting CPIs. The SGCN technique separated local neighborhood aggregation step and nonlinearity layer-wise propagation step to effectively aggregate \(K\)-order neighbor information under the premise of avoiding neighbor explosion and accelerating training. This makes the method more effective than other traditional methods that can't handle such complexity. The study evaluated the performance of SPVec-SGCN-CPI method on the three databases and compared it with classic ML and DL methods such as GNB, GBDT, RF, and DNN, as well as advanced CPI prediction methods, including PMFCPI, GraphCPI, STCPI and GcForest. The classification results illustrated that SPVec-SGCN-CPI outperformed all these methods in terms of prediction accuracy, especially on unbalanced data. SPVec-SGCN-CPI is capable of propagating node features and topological information to the feature space, which enables the method to take interactions among CPIs into account for fusion of heterogeneity. All unlabeled data in ChEMBEL were scored using our method and top five ranked CPIs were confirmed by molecular docking and existing evidence. The results suggest that our model can discover reliable CPIs among unlabeled compounds-protein pairs. This discovery has significant implications for drug re-profiling and drug discovery. Overall, SPVec-SGCN has demonstrated its superior ability to predict CPIs. This method has great potential to contribute to the identification of new targets and improve the efficiency of drug discovery.

Availability of data and materials

All source codes are available in the GitHub repository Furthermore, the raw data used in this work can be downloaded from The processed datasets and features extracted by our method can be downloaded from (password: aUPW).



Compound-protein interactions


Machine learning


Deep learning


Random Forest


Support Vector Machine


Deep Neural Network


Gradient Boosting Decision Tree


Graph convolutional network


Convolutional neural network


Multilayer perceptron


Simplified Graph convolutional network


Area under receiver operating characteristic curve


Area under the precision-recall curve




Gaussian Naïve Baysian


Molecular docking


  1. Gupta R, Srivastava D, Sahu M, Tiwari S, Ambasta RK, Kumar P (2021) Artificial intelligence to deep learning: machine intelligence approach for drug discovery. Mol Divers 25:1315–1360

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Schenone M, Dančík V, Wagner BK, Clemons PA (2013) Target identification and mechanism of action in chemical biology and drug discovery. Nat Chem Biol 9(4):232–240

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. DiMasi JA, Hansen RW, Grabowski HG (2003) The price of innovation: new estimates of drug development costs. J Health Econ 22(2):151–185

    Article  PubMed  Google Scholar 

  4. Belgiu M, Drăguţ L (2016) Random forest in remote sensing: a review of applications and future directions. ISPRS J Photogramm Remote Sens 114:24–31

    Article  Google Scholar 

  5. Noble WS (2006) What is a support vector machine? Nat Biotechnol 24(12):1565–1567

    Article  CAS  PubMed  Google Scholar 

  6. Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329

    Article  Google Scholar 

  7. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W et al (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30(1):3149–3157

    Google Scholar 

  8. Tsubaki M, Tomii K, Sese J (2019) Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics 35(2):309–318

    Article  CAS  PubMed  Google Scholar 

  9. Chen L, Tan X, Wang D, Zhong F, Liu X, Yang T et al (2020) TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 36(16):4406–4414

    Article  CAS  PubMed  Google Scholar 

  10. Li S, Wan F, Shu H, Jiang T, Zhao D, Zeng J (2020) MONN: a multi-objective neural network for predicting compound-protein interactions and affinities. Cell Syst 10(4):308–322

    Article  CAS  Google Scholar 

  11. Rifaioglu AS, Nalbat E, Atalay V, Martin MJ, Cetin-Atalay R, Doğan T (2020) DEEPScreen: high performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chem Sci 11(9):2531–2557

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Väth P, Münch M, Raab C, Schleif F-M (2022) PROVAL: a framework for comparison of protein sequence embeddings. J Comput Math 3:100044

    Google Scholar 

  13. Lambard G, Gracheva E (2020) SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors. Mach Learn: Sci Technol 1(2):025004

    Google Scholar 

  14. Di Gennaro G, Buonanno A, Palmieri FA (2021) Considerations about learning Word2Vec. J Supercomput 23(2):1–16

    Google Scholar 

  15. Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M (2022) ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8):2102–2110.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Wu Z, Jiang D, Wang J, Zhang X, Du H, Pan L et al (2022) Knowledge-based BERT: a method to extract molecular features like computational chemists. Brief Bioinform.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Villegas-Morcillo A, Gomez AM, Sanchez V (2022) An analysis of protein language model embeddings for fold prediction. Brief Bioinform 23(3):142

    Article  Google Scholar 

  18. Zhang S, Tong H, Xu J, Maciejewski R (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw 6(1):1–23

    Article  Google Scholar 

  19. Jiang M, Li Z, Zhang S, Wang S, Wang X, Yuan Q et al (2020) Drug–target affinity prediction using graph neural network and contact maps. RSC Adv 10(35):20701–20712

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Nguyen T, Le H, Quinn TP, Nguyen T, Le TD, Venkatesh S (2021) GraphDTA: predicting drug-target binding affinity with graph neural networks. Bioinformatics 7(8):1140–1147.

    Article  CAS  Google Scholar 

  21. Jiang L, Sun J, Wang Y, Ning Q, Luo N, Yin M (2022) Identifying drug–target interactions via heterogeneous graph attention networks combined with cross-modal similarities. Brief Bioinform 23(2):016.

    Article  CAS  Google Scholar 

  22. Lim J, Ryu S, Park K, Choe YJ, Ham J, Kim WY (2019) Predicting drug-target interaction using a novel graph neural network with 3D structure-embedded graph representation. J Chem Inf Model 59(9):3981–3988.

    Article  CAS  PubMed  Google Scholar 

  23. Purkayastha S, Mondal I, Sarkar S, Goyal P, Pillai JK (2019) Drug-Drug Interactions Prediction Based on Drug Embedding and Graph Auto-Encoder. Paper presented at 19th international conference on bioinformatics and bioengineering, Athens, Greece, 28–30 Octobor 2019.

  24. Xiong W, Li F, Yu H, Ji D (2019) Extracting Drug-drug Interactions with a Dependency-based Graph Convolution Neural Network. Paper presented at 19th international conference on bioinformatics and bioengineering, Athens, Greece, 28–30 Octobor 2019.

  25. Zhang Y, Chen L, Li S (2022) CIPHER-SC: disease-gene association inference using graph convolution on a context-aware network with single-cell data. IEEE/ACM Trans Comput Biol Bioinform 19(2):819–829.

    Article  CAS  PubMed  Google Scholar 

  26. Li C, Liu H, Hu Q, Que J, Yao J (2019) a novel computational model for predicting microRNA-disease associations based on heterogeneous graph convolutional networks. Cells.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Rao A, Vg S, Joseph T, Kotte S, Sivadasan N, Srinivasan R (2018) Phenotype-driven gene prioritization for rare diseases using graph convolution on heterogeneous networks. BMC Med Genomics 11(1):57.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Zhang T, Shan HR, Little MA (2022) Causal GraphSAGE: a robust graph method for classification based on causal sampling. Pattern Recogn 128:108696.

    Article  Google Scholar 

  29. Ying R, He R, Chen K, Eksombatchai P, Hamilton WL, Leskovec J. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. Paper presented at proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, London, United Kingdom, 19–23 August 2018.

  30. Chen J, Zhu J, Song L (2018) Stochastic Training of Graph Convolutional Networks with Variance Reduction. Paper presented at 35th international conference on machine learning, Stockholmsmässan, Stockholm, 10–15 July 2018.

  31. Chen J, Ma T, Xiao C (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint.

    Article  Google Scholar 

  32. Zhang C, Li QC, Song DW (2019) Aspect-based Sentiment Classification with Aspect-specific Graph Convolutional Networks. Paper presented at proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, Hong Kong, China, 4 November 2019.

  33. Hamilton WL, Ying R, Leskovec J (2017) Inductive representation learning on large graphs. Paper presented at proceedings of the 31st international conference on neural information processing systems, Long Beach, California, 4–7 December 2017.

  34. Li C, Yang Y, Feng M, Chakradhar S, Zhou H (2016) Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs. Paper presented at SC '16: proceedings of the international conference for high performance computing, networking, storage and analysis, Salt Lake City, Utah, 13–18 November 2016.

  35. Du B-X, Qin Y, Jiang Y-F, Xu Y, Yiu S-M, Yu H et al (2022) Compound–protein interaction prediction by deep learning: databases, descriptors and models. Drug Discov Today 27(5):1350–1366

    Article  CAS  PubMed  Google Scholar 

  36. Tian K, Shao M, Wang Y, Guan J, Zhou S (2016) Boosting compound-protein interaction prediction by deep learning. Methods 110:64–72

    Article  CAS  PubMed  Google Scholar 

  37. Liu H, Sun J, Guan J, Zheng J, Zhou S (2015) Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics 31(12):i221–i229

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Putthiporn T, Chidchanok L (2013) Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognit Lett 34(12):1339–1347

    Article  Google Scholar 

  39. Zhang Y-F, Wang X, Kaushik AC, Chu Y, Shan X, Zhao M-Z et al (2020) SPVec: a Word2vec-inspired feature representation method for drug-target interaction prediction. Front Chem 7:895

    Article  PubMed  PubMed Central  Google Scholar 

  40. Zeng H, Zhou H, Srivastava A, Kannan R, Prasanna V (2019) Accurate, efficient and scalable graph embedding. Paper presented at 2019 IEEE international parallel and distributed processing symposium, Rio de Janeiro, Brazil, 20–24 May 2019.

  41. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A et al (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucl Acids Res 40(D1):D1100–D1107

    Article  CAS  PubMed  Google Scholar 

  42. Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J (2016) BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucl Acids Res 44(D1):D1045–D1053

    Article  CAS  PubMed  Google Scholar 

  43. Kim S, Chen J, Cheng T, Gindulyte A, He J, He S et al (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47(D1):D1102–D1109

    Article  PubMed  Google Scholar 

  44. Cer RZ, Mudunuri U, Stephens R, Lebeda FJ (2009) IC50-to-Ki: a web-based tool for converting IC50 to Ki values for inhibitors of enzyme activity and ligand binding. Nucl Acids Res 37:W441-445

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Zheng Y, Tang P, Qiu W, Wang H, Guo J, Huang Z (2023) A novel deep learning framework for interpretable drug-target interaction prediction with attention and multi-task mechanism. In: Wang X, Sapino ML, Han W-S, El Abbadi A, Dobbie G, Feng Z, Shao Y, Yin H (eds) Database systems for advanced applications. Springer Nature Switzerland, Cham

    Google Scholar 

  46. Wang Z, Liang L, Yin Z, Lin J (2016) Improving chemical similarity ensemble approach in target prediction. J Cheminform 8(1):20

    Article  PubMed  PubMed Central  Google Scholar 

  47. Chatterjee A, Walters R, Shafi Z, Ahmed OS, Sebek M, Gysi D et al (2023) Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat Commun 14(1):1989

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Lim J, Ryu S, Park K, Choe YJ, Ham J, Kim WY (2019) Predicting drug-target interaction using a novel graph neural network with 3D structure-embedded graph representation. J Chem Inf Model 59(9):3981–3988

    Article  CAS  PubMed  Google Scholar 

  49. Torng W, Altman RB (2019) Graph convolutional neural networks for predicting drug-target interactions. J Chem Inf Model 59(10):4131–4149

    Article  CAS  PubMed  Google Scholar 

  50. Zeng X, Zhu S, Hou Y, Zhang P, Li L, Li J et al (2020) Network-based prediction of drug-target interactions using an arbitrary-order proximity embedded deep forest. Bioinformatics 36(9):2805–2812

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y et al (2020) Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci 11(7):1775–1797

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Lee I, Keum J, Nam H (2019) DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol 15(6):e1007129.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Huang Y, Huang H-Y, Chen Y, Lin Y-C-D, Yao L, Lin T et al (2023) A robust drug-target interaction prediction framework with capsule network and transfer learning. Int J Mol Sci 24(18):14061

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Gao M, Zhang D, Chen Y, Zhang Y, Wang Z, Wang X et al (2024) GraphormerDTI: a graph transformer-based approach for drug-target interaction prediction. Comput Biol Med 173:108339

    Article  CAS  PubMed  Google Scholar 

  55. Bražinskas A, Havrylov S, Titov I (2017) Embedding words as distributions with a Bayesian skip-gram model. arXiv preprint.

    Article  Google Scholar 

  56. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint.

    Article  Google Scholar 

  57. Ghiasi-Shirazi K (2019) Generalizing the convolution operator in convolutional neural networks. Neural Process Lett 50(3):2627–2646

    Article  Google Scholar 

  58. Hariharan B, Arbeláez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization. Paper presented at 2015 IEEE conference on computer vision and pattern recognition, Boston, Massachusetts, 7–12 June 2015.

  59. Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ (2016) Deep networks with stochastic depth. Paper presented at computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, 11–14 October 2016.

  60. Lang AS, Chong WK, Wörner JH (2023) Fine-tuning ChemBERTa-2 for aqueous solubility prediction. Ann Chem Sci Res 4:1–3.

    Article  Google Scholar 

  61. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W et al (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637):1123–1130.

    Article  CAS  PubMed  Google Scholar 

  62. Frey NC, Soklaski R, Axelrod S et al (2023) Neural scaling of deep chemical models. Nat Mach Intell 5:1297–1305.

    Article  Google Scholar 

  63. Ferruz N, Schmidt S, Höcker B (2022) ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13:4348.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Song N, Dong R, Pu Y et al (2023) PMF-CPI: assessing drug selectivity with a pretrained multi-functional model for compound–protein interactions. J Cheminform 15:97.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Quan Z, Guo Y, Lin X,Wang Z-Y, Zeng X (2019) GraphCPI: Graph Neural Representation Learning for Compound-Protein Interaction. Paper presented at 2019 IEEE international conference on bioinformatics and biomedicine, San Diego, California, 18–21 November 2019.

  66. Chen S, Wang X, Guo H, Xie P, Wang J, Hao X (2020) A conditional probability interpolation method based on a space-time cube for MODIS snow cover products gap filling. Remote Sens 12(21):3577.

    Article  Google Scholar 

  67. Shan W, Chen L, Xu H, Zhong Q, Xu Y et al (2023) GcForest-based compound-protein interaction prediction model and its application in discovering small-molecule drugs targeting CD47. Front Chem 11:1292869.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Palhamkhani F, Alipour M, Dehnad A, Abbasi K, Razzaghi P, Ghasemi JB (2023) DeepCompoundNet: enhancing compound-protein interaction prediction with multimodal convolutional neural networks. J Biomol Struct Dyn 12:1–10.

    Article  CAS  Google Scholar 

  69. Dehghan A, Abbasi K, Razzaghi P (2024) CCL-DTI: contributing the contrastive loss in drug–target interaction prediction. BMC Bioinform 25:48.

    Article  Google Scholar 

  70. Chang LL, Truong Q, Mumford RA, Egger LA, Kidambi U, Lyons K et al (2002) The discovery of small molecule carbamates as potent dual α4β1/α4β7 integrin antagonists. Bioorg Med Chem Lett 12(2):159–163

    Article  CAS  PubMed  Google Scholar 

  71. Johnson TW, Richardson PF, Bailey S, Brooun A, Burke BJ, Collins MR et al (2014) Discovery of (10 R)-7-Amino-12-fluoro-2, 10, 16-trimethyl-15-oxo-10, 15, 16, 17-tetrahydro-2H-8, 4-(metheno) pyrazolo [4, 3-h][2,5,11]-benzoxadiazacyclotetradecine-3-carbonitrile (PF-06463922), a macrocyclic inhibitor of anaplastic lymphoma kinase (ALK) and c-ros oncogene 1 (ROS1) with preclinical brain exposure and broad-spectrum potency against ALK-resistant mutations. J Med Chem 57(11):4720–4744

    Article  CAS  PubMed  Google Scholar 

  72. Kopka IE, Young DN, Lin LS, Mumford RA, Magriotis PA, MacCoss M et al (2002) Substituted N-(3, 5-dichlorobenzenesulfonyl)-L-prolyl-phenylalanine analogues as potent VLA-4 antagonists. Bioorg Med Chem Lett 15(18):6043–6053

    Google Scholar 

  73. Parai MK, Huggins DJ, Cao H, Nalam MN, Ali A, Schiffer CA et al (2012) Design, synthesis, and biological and structural evaluations of novel HIV-1 protease inhibitors to combat drug resistance. J Med Chem 55(14):6328–6341

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Liu H, Altenbach RJ, Diaz GJ, Manelli AM, Martin RL, Miller TR et al (2010) In vitro studies on a class of quinoline containing histamine H3 antagonists. Bioorg Med Chem Lett 20(11):3295–3300

    Article  PubMed  Google Scholar 

Download references


The author would like to thank Prof. Wei for hosting us in his research group as well as Prof. Xiong for his fruitful feedback and discussions.


This work was supported by National Science Foundation of China (Grant No. 32070662, 61832019, 32030063), Intergovernmental International Scientific and Technological Innovation and Cooperation Program of The National Key R&D Program (2023YFE0199200) and Joint Research Funds for Medical and Engineering and Scientific Research at Shanghai Jiao Tong University (YG2021ZD02). The computations were partially performed at the Pengcheng Lab. and the Center for High-Performance Computing, Shanghai Jiao Tong University.

Author information

Authors and Affiliations



YZ conducted the experiment(s) and wrote the original draft. JL and JZ conceived the molecular docking experiment(s). SL collected the data. YX revised the manuscript and D-QW reviewed the manuscript. All authors discussed the results and commented on the manuscript.

Corresponding authors

Correspondence to Yi Xiong or Dong-Qing Wei.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Li, J., Lin, S. et al. An end-to-end method for predicting compound-protein interactions based on simplified homogeneous graph convolutional network and pre-trained language model. J Cheminform 16, 67 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: