Skip to main content

PINNED: identifying characteristics of druggable human proteins using an interpretable neural network

Abstract

The identification of human proteins that are amenable to pharmacologic modulation without significant off-target effects remains an important unsolved challenge. Computational methods have been devised to identify features which distinguish between “druggable” and “undruggable” proteins, finding that protein sequence, tissue and cellular localization, biological role, and position in the protein–protein interaction network are all important discriminant factors. However, many prior efforts to automate the assessment of protein druggability suffer from low performance or poor interpretability. We developed a neural network-based machine learning model capable of generating druggability sub-scores based on each of four distinct categories, combining them to form an overall druggability score. The model achieves an excellent performance in separating drugged and undrugged proteins in the human proteome, with an area under the receiver operating characteristic (AUC) of 0.95. Our use of multiple sub-scores allows the assessment of potential protein targets of interest based on distinct contributors to druggability, leading to a more interpretable and holistic model to identify novel targets.

Introduction

The cost of developing new therapeutic drugs has risen significantly in recent years, with the average R&D (Research & Development) cost per new drug ranging between $314 million and $2.8 billion [41]. Most of this expense is incurred in the clinical phase, where trial compounds primarily fail due to a poor understanding of the disease process leading to lack of efficacy or toxicity caused either intrinsically by the actual protein being targeted or extrinsically by off-target effects on other proteins [19]. Determining whether a prospective target protein is “druggable,” however, is a complex problem without a clearly understood solution. This can lead to a considerable amount of trial and error in the drug development process. A successful method for pre-screening prospective target proteins for druggability could save billions of dollars per year and increase the number of lifesaving drugs reaching the market.

Druggability is a poorly defined term; it can be used narrowly to refer only to a protein’s ability to bind an activity-modifying small molecule ligand, or more broadly to refer to a protein’s relevance as a therapeutic target in human disease. For this paper’s purposes, druggability encompasses the ability of a protein’s activity to be modulated for pharmacologic effect by a drug which gains regulatory approval. Undruggable proteins are those that cannot be influenced for therapeutic benefit, either because they lack disease relevance, are biologically essential, or cannot be targeted through any known drug modality. Throughout our work we will use these definitions.

Recent advances in machine learning offer the potential for in silico feature identification of druggable proteins. This can facilitate computational evaluation of prospective targets prior to the initiation of expensive clinical trials. A variety of efforts in this area have taken different approaches, incorporating different predictors of druggability into their feature sets. Several groups have sought to solely use properties derived from the primary protein sequence, achieving impressive results in distinguishing drugged proteins from a select subset of difficult-to-drug proteins [17, 21, 26, 28, 35]. However, it is unclear how these models can effectively generalize the entire proteome. Others have analyzed the position of drugged proteins in the protein–protein interaction network to identify common features [14, 27, 29, 40, 43, 46]. Although these models successfully extracted network properties of drugged proteins, their effectiveness is undermined by the lack of information about the protein’s properties, which may be difficult to target chemically.

Given the wide variety of features which may determine whether a protein can be categorized as druggable, it is likely that the most successful approach will incorporate a comprehensive range of properties, including physical and chemical attributes, expression profile, biological functions, and protein–protein interactions. Successful machine learning efforts in this area have utilized features from several of these domains [3, 5, 10, 15, 43]. A 2020 study by Dezső and Ceccarelli focused exclusively on proteins that were targeted by oncology drugs, generating a feature set including a wide variety of chemical, expression, biological function, and network properties. Using a random forest-based model, cancer drug targets were capable of being distinguished from the remainder of the proteome with an AUC of 0.89 [13]. We utilized this feature set and augmented it with additional protein attributes to build a classifier for property identification of drugged human proteins.

To our knowledge, all previously published machine learning models are trained to discriminate druggable from undruggable proteins with a single druggability score or binary classification. This approach lacks interpretability and wholeness, particularly when many distinct types of features are specifically and uniquely contributing to druggability. For example, a protein’s position in the protein–protein interaction network may have major implications for potential off-target effects during clinical trials but does not demonstrate its structural amenability to small molecule modulation. A classifier that separates distinct features into sub-scores prior to obtaining a total druggability score could output multiple types of pertinent information about whether a protein is druggable or undruggable. We created the Predictive Interpretable Neural Network for Druggability (PINNED), a deep learning model which divides its inputs into four distinct groups—sequence and structure, localization, biological functions, and network information—and generates interpretable sub-scores that contribute to a final druggability score.

Results

Many factors influence a protein’s druggability, including its effectiveness as a disease-modifying target and its propensity for causing undesired side-effects. A protein’s physical and chemical properties, such as amino acid composition, secondary structure, post-translational modification, and others, can determine whether it can be readily liganded by a drug-like molecule. Its position in the complex network of protein–protein interactions which occur within the human body can influence its role in disease and its potential for off-target effects. The biological function of a protein plays a significant role in whether it is a useful drug target; however, many proteins are involved in multiple different processes, disturbance of any can lead to unanticipated consequences for homeostasis and thus leading to off-target effects. Additionally, a protein’s expression profile across target and non-target tissues can have implications for its efficacy and safety.

To incorporate all these contributions to druggability, we generated a feature set that contains a variety of data for 20,404 human proteins, including properties extracted from the protein sequence, tissue specificity, subcellular localization, biological functions, and position in the protein–protein interaction network [13]. The features were divided into four feature groups: sequence and structure, localization, biological functions, and network information. Each category was then augmented with additional features obtained from the protein sequence, Gene Ontology (GO) knowledgebase [1], and the protein’s 3-dimensional structure as estimated by the artificial intelligence system AlphaFold [23] (Table 1).

Table 1 All features used to train the model, divided into the four feature groups

Sequence and structure properties

Sequence and structure properties included information about 52 physiochemical features, such as protein molecular weight and amino acid residues, charge and isoelectric points, extinction coefficients, predicted post-translational modifications, secondary structure, and solvent accessibility. Previous works indicate that the grouped dipeptide composition (GDPC) and pseudo amino acid composition (PAAC) of a protein may be useful characteristics in determining its druggability [17, 28, 35]. GDPC represents the relative composition of all the amino acid 2-mers in a protein’s sequence, with the 20 amino acids being reduced to five groups according to their physical properties. PAAC is an algorithm designed to reduce the sequence characteristics of a protein to a defined-length vector while incorporating information about their sequence order [9]. GDPC and PAAC were generated for each of the proteins in our dataset and included in the sequence and structural properties.

AlphaFold is a deep learning network developed by DeepMind that can predict a protein’s structure from its three-dimensional amino acid sequence. The AlphaFold Protein Structure Database was established between AlphaFold and EMBL-EBI [11]. This database contains the predicted protein structure models of accessible UniProt human proteome. It is available as an open-source database. Fpocket is an open-source software package able to automatically detect and provide pocket descriptors in a protein’s 3-dimensional structure [25]. It enables the identification of potential drug binding sites and provides relevant properties based on each pocket detected. The pockets are ranked according to their ability to bind to small molecules as a cavity prediction algorithm. Fpocket was utilized to identify druggable and undruggable protein cavities based on the trajectories produced by the simulation. AlphaFold models of each protein were collected from the AlphaFold database and pocket information was generated using Fpocket.

Localization

The Subcellular Localization Predictive System (CELLO) was used to predict subcellular localization for each protein in the dataset [44]. We included this prediction, in addition to tissue specificity data obtained from the Genotype-Tissue Expression (GTEx) and the Human Protein Atlas (HPA) [18, 38]. The GO Knowledgebase was used to retrieve Cellular Component annotations for each protein. These labels are manually assigned based on published literature and represent the cellular structures in which the protein performs its functions.

Biological functions

Gene essentiality, assessed by lethality of mouse homozygous loss-of-function mutations [16] and enzyme classifications obtained from the Swiss-prot database [2], were included in the biological functions score. Scores were generated by Dezső et al. for each gene ontology in the MetaCore database based on their 102-protein target enrichment set of cancer drugs. The highest three ontology scores in the categories—“Biological Functions,” “Molecular Process,” and “Maps” (signaling pathways)—were included in that protein’s feature set. “Biological Functions” and “Molecular Process” were used as inputs to the “biological functions” sub-score, while “Maps” was included in the “network information” sub-score (see below). It should be noted that the Biological Functions score generated by Dezső et al. represents only one feature input into the biological functions network.

Network information

The signaling pathways (“Maps”) score generated by Dezső et al. was included in the network information features. Degree, closeness, betweenness, eigen centrality, and PageRank of each protein in the protein–protein interaction network were calculated using information from the STRING database [37]. These features were incorporated into the network information input.

Fig. 1
figure 1

Design of the PINNED model and dataset. A Division of the data into training, validation, and test sets. B PINNED architecture including the four constituent subnetworks

Protein set

The National Center for Biotechnology Information (NCBI) Pharos database, a data repository of human protein properties and drugged status, identifies proteins as confirmed drug targets if they are “protein drug targets via which approved drugs act” (“Tclin”) [34]. As of October 2022, 704 of the 20,412 proteins in Pharos are categorized as Tclin.

All other proteins are classified as one of three other categories: undrugged proteins which bind small molecules with high potency (“Tchem”), proteins with well-studied biology (“Tbio”), and proteins not meeting the criteria for any of the other categories (“Tdark”) [34]. Of these proteins, 19,873 were represented in both Dezso et al.’s dataset and the AlphaFold database, including 696 of the 704 Tclin proteins in Pharos. We used the 696 Tclin proteins as our positive “drugged” set, and the remaining 19,177 proteins in the other categories as our negative “undrugged” set (Fig. 1A). It is likely that the undrugged set contains many potentially druggable proteins which have not yet been targeted by approved therapeutics.

PINNED model

The model architecture consisted of four separate deep neural networks, designated “sequence and structure,” “localization,” “biological functions,” and “network information.” Each network contained an input layer, a hidden layer with ReLU activation, and a single output neuron representing the network sub-score. The four sub-scores were summed, producing a logit which was passed through a sigmoid function to generate the final probability of druggability (Fig. 1B).

Prior to model tuning, 20% of the dataset was held out to form a separate test set, which was used to evaluate the model after the optimal architecture had been determined. The remaining data was divided into five equal groups, one of which was held out as a validation set, while the remaining four were combined to form the training set (5-fold cross-validation) (Fig. 1A). It was necessary to oversample the positive set to prevent the model from converging towards a naïve negative classifier due to the significant imbalance between drugged and undrugged proteins. Within the training set, drugged proteins were separated from the validation set, then randomly oversampled with replacement until the number of drugged and undrugged proteins was equal. The feature matrix was then divided into sequence and structure, localization, biological functions, and network information matrices. These matrices served as inputs to their respective networks.

Fig. 2
figure 2

Performance of PINNED on the test set. A AUC curve of the model and each subnetwork for distinguishing between drugged and undrugged proteins. B Histogram showing the distribution of druggability probabilities for undrugged proteins in the test set. C Histogram showing the distribution of druggability probabilities for drugged proteins in the test set

After hyperparameter optimization, a model was trained on the full training/validation set, with the held-out test data used as the final validation set. The complete model achieved an excellent AUC of 0.950 on the test set (Fig. 2A), with the scores from each subnetwork attaining a lower AUC. Although the biological functions sub-score performed by far the best with an AUC of 0.924, the other networks successfully classified proteins as drugged or undrugged with reasonable discriminatory power. The full model consistently scored undrugged proteins in the test set as having low druggability due to the substantial number of negative examples (undrugged proteins) to learn from (Fig. 2B). Druggability scores were more variable for the drugged proteins, reflecting the difficulty of identifying a consistent “druggable” profile from a small number of positives (Fig. 2C). However, PINNED’s high AUC demonstrates its ability to successfully distinguish between proteins with high and low druggability potential.

To determine if the scores could predict success in clinical trials, we tested them against a dataset of successful and failed phase III clinical targets [33]. We found that the overall druggability score achieved an AUC superior to that of the original publication (Additional file 5). Although this may reflect bias in the data, in which more GO annotations or protein–protein interactions have been identified for targets which were successful in clinical trials it indicates that PINNED may be a useful resource for informing not just target selection, but later-stage clinical trials.

Fig. 3
figure 3

Confusion matrices of PINNED on the test set. A Confusion matrix with threshold for druggability set at 0.5. B Confusion matrix with threshold set at 0.03 to balance sensitivity and specificity

Reducing the druggability score required to consider a protein “druggable” can increase the sensitivity of the predictor. By default, this value was set as 0.5 during training, but may be changed to any arbitrary value during inference. At a threshold of 0.5, PINNED achieves excellent specificity but low sensitivity, with many drugged proteins in the test set being mistakenly classed as undruggable (Fig. 3A; Table 2). At a reduced threshold of 0.03, chosen to balance sensitivity and specificity, all the drugged proteins are properly classed, while many undrugged proteins are now considered “druggable” (Fig. 3B; Table 2). This cohort of undrugged proteins with high druggability scores represents potential opportunities for pharmaceutical targeting.

Table 2 Comparison of PINNED’s test performance at different druggability thresholds

Comparing PINNED’s performance to prior machine learning efforts to assess protein druggability is challenging due to the wide variety of datasets used and metrics reported. Many previous works exclude proteins with significant homology to drugged proteins from their undrugged sets [3, 14], even though there may be significant differences between these proteins’ properties which alter their utility as drug targets. Similarly, some construct an idealized set of “undruggable” proteins, making it difficult to generalize to the whole proteome [6, 17, 21, 26, 28, 35, 36, 46]. Others only focus on a specific target or indication, such as oncology [4, 12, 13, 22], or ion channels [20]. Restricting our focus to models which seek to assess the druggability of the entire proteome, we find that PINNED comfortably outperforms much of the prior literature in sensitivity, specificity, and AUC [5, 10, 15, 43] (Additional file 1). A recent publication by Raies et al. achieved a higher AUC, but without the constituent sub-scores PINNED generates [32]. The interpretability of our model is a unique advantage which enhances its value to the target selection process.

Of the 696 drugged proteins in our dataset, 294 were affiliated with three protein families: ion channels (124 proteins), G-protein-coupled receptors, or GPCRs (102 proteins), and kinases (68 proteins). The PINNED architecture model of four subnetworks determines which individual members of protein families represent the most promising targets, not just a druggability assessment of entire families. To test the feature learning ability indicating druggability across the whole proteome and apply it to identify targets within unseen protein families, we excluded ion channels, GPCRs, or kinases, respectively, from the training data, and tested the models’ performance on these held-out families. In each case, a training and validation set was constructed consisting of the entire proteome except for the members of the held-out family and used to train five models applying cross-validation. Each model was then implemented to score all proteins within the held-out family, and the scores averaged to generate an ensemble score. The ensemble score AUC in distinguishing drugged and undrugged members of the held-out family was assessed for the overall druggability score and all four constituent sub-scores. This process was repeated for each of the three main drugged families. The PINNED framework maintained the ability to distinguish between drugged and undrugged proteins within each family with reasonable discriminatory power (Table 3). Although the performance of the overall druggability score and each of the sub-networks was reduced relative to a fully heterogeneous training set, the network retained a notable ability to separate drugged from undrugged targets despite not having seen any members of the family in the training data. The sequence and structure sub-score consistently performed poorly, consistent with the same protein family being highly homologous in sequence and in signature structural motifs, and therefore difficult to distinguish. The other three networks incorporate information on each proteins’ known localization, functions, and interactions, respectively, they are more capable of capturing differentiating features which impact druggability, and their performance is correspondingly higher. These results indicate that PINNED can generalize properties of druggable proteins to entire unseen families.

Table 3 PINNED performance in distinguishing drugged from undrugged members of major drug target families, as measured by AUC

After training the model and assessing it on the test set, we ablated each feature by randomly shuffling (permuting) the values among the protein test set and assessed the increase in test loss induced by the change. As loss is inversely related to the network’s performance, more prominent features will result in a higher increase in loss after being permuted. We found that features belonging to the biological function’s subnetwork comprised seven of the top 10 (Table 4), consistent both with the substantial number of features in that network and the fact that it was by far the most significant in contributing to PINNED’s performance. Many of the features, including essentiality, degree, transmembrane helices, and PageRank, overlapped with the most notable features selected by Dezső et al. [13]. This indicates a similarity between the properties of oncology targets and other drugged proteins. Additionally, several of the top features derived from GO annotations—ATP binding, voltage-gated potassium channel activity, and potassium ion transmembrane transport—are known to be relevant factors in druggability [8, 42].

Table 4 Most notable features, as ranked by change in test loss after random permutation of the feature

To generate druggability scores for the entire proteome, we split our entire dataset, including the training/validation and test sets into five parts. Each part was held out and the remaining four were used to train a classifier model. The scores for the held-out set were designated as the final druggability scores for the protein set. This process was repeated with each of the sets being held out once to generate scores for the proteins in the entire proteome. Of the 10 highest-scoring undrugged proteins in the proteome, all except TNFRSF11A are listed by Pharos as Tchem, having validated high-potency small molecule ligands (Table 5). The mechanism of action for many drugs is not entirely clear, as they may interact with multiple proteins in the same family, making conclusive classification of proteins as targets or non-targets challenging. We cross-referenced all top 10 scoring proteins with the Therapeutic Targets Database (TTD) and the Open Targets platform, two other databases of drug–target interaction [30, 45]. Of these, five were listed by TTD and two by Open Targets as already being the targets of approved therapeutics, while two were listed by TTD and two by Open Targets as clinical trial targets (Additional file 2). This discrepancy between databases reflects the difficulty of conclusively classifying proteins as mechanism of action drug targets. However, the high prevalence of likely interactors of approved drugs demonstrates that PINNED successfully generalizes the properties of drugged proteins to previously unseen data.

Table 5 Highest scoring undrugged proteins

Of the 20,412 proteins in the Pharos database, 5679 (28%) are designated as “Tdark”—having extremely limited data about their properties and functions. Considerable interest exists in exploring these understudied parts of the genome, particularly to discover novel therapeutic targets which have previously been overlooked [31]. At least one of the top scoring Tdark proteins in our model has been investigated as a drug target (Table 6). Transmembrane protease serine 11B (TMPRSS11B) was identified as upregulated in lung squamous cell carcinomas, serving as a poor prognostic marker. Inhibition of the protein in vitro reduced transformation and proliferation [39].

Table 6 Highest scoring Tdark proteins

TMPRSS11B’s sub-scores for sequence and structure, localization, biological functions, and network information, compared to the Tclin (drugged) proteins, were respectively in the 84th, 97th, 29th, and 1st percentiles (Additional file 3). The high score for sequence and structure is consistent with the observation that transmembrane helices are highly indicative of druggability (Table 4). Similarly, for the localization subnetwork, permutation importance suggests three of the five most notable features are GO annotations related to localization to the plasma membrane (Additional file 4). Although TMPRSS11B attains a lower score in the biological functions network, it is higher than 95% of undrugged proteins. Its network information score, however, is low even among undrugged proteins, at the 7th percentile. This may indicate that TMPRSS11B lacks the network centrality to have a significant impact on cellular homeostasis. Overall, our results indicate that TMPRSS11B may be structurally amenable to drugging and demonstrates localization and biological activity consistent with other drug targets but may not be indicative of the protein–protein interaction network relative to successfully drugged proteins. The use of multiple sub-scores to characterize a protein’s druggability profile enables a more detailed analysis of its potential strengths and weaknesses rather than a single unified score.

Discussion

The implementation of a pre-screening methodology that differentiates druggable and undruggable targets can help ameliorate the difficulty of target selection in pharmaceutical development and aid in allocating R&D investments to promising targetable proteins. Consequently, it is imperative that an interpretable model can accurately identify novel druggable targets. We developed a neural network-based machine learning model able to produce druggability sub-scores based on separate feature categories spanning multiple factors in druggability. These allow the analysis of each category individually and its contribution to an overall druggability score.

PINNED attained excellent results in its ability to distinguish drugged from undrugged proteins with an AUC of 0.95. Importantly, this was achieved on the entire proteome, indicating that the model can handle cases generated by family members of drugged proteins. Notably, PINNED was far better at assigning low druggability scores to undrugged proteins than assigning high scores to drugged proteins (Fig. 2), consistent with the large imbalance between the two classes. By reducing the score required to designate a protein as “druggable,” it is possible to increase the sensitivity of the classifier in positively labeling drugged proteins at the expense of also designating as druggable many currently undrugged proteins (Fig. 3). However, these may represent proteins which are already the targets of approved drugs but have not been formally labeled due to insufficient evidence, or potential new targets which merit further investigation (Table 5).

Among our sub-scores, the biological functions network achieved the best performance with a standalone AUC of 0.924. This is potentially due to it being the largest subnetwork, with 3,464 inputs, allowing it to incorporate a large amount of information about protein function. The network information sub-score attained the second-highest performance at 0.810, despite being by far the smallest network, suggesting that the relationship between number of inputs and classification value is complex. Sequence and structure was the lowest-performing subnetwork, achieving an AUC of 0.777 and 0.729. However, these scores are still competitive with previous efforts at using machine learning to assess protein druggability (Additional file 1). This result indicates that our druggability sub-scores are useful not just as inputs to the overall score, but as standalone estimates of each protein’s druggability within that subdomain. Furthermore, we found that PINNED’s overall druggability score exceeds prior publications in predicting success in phase III clinical trials, despite not being trained to directly predict clinical success (Additional file 5).

The 10 most relevant features fed into PINNED, in terms of impact on accuracy, span three of the four subnetworks, with the majority coming from biological functions, but none from localization (Table 4). While this finding is consistent with the fact that the localization subnetwork achieves the lowest standalone AUC, the “transmembrane helices” feature in the sequence and structure network can be assumed to be a strong indicator of whether a protein is localized to the plasma membrane, which dominates the most important localization features (Additional file 4). Some collinearity exists between the feature inputs between the different networks. This is an inevitable result of the proteins’ functions, structures, and interactions being closely interrelated. However, the observation that many proteins score highly on some subnetworks but poorly on others demonstrates that they capture distinct information about a protein’s druggability. Many of the top features overlap with those identified in previous publications [5, 12, 13, 24]. This suggests that machine learning models trained to predict protein druggability converge on a common set of important contributors.

The “dark genome” encompasses the proteins in the human proteome which have not been extensively studied, especially as prospective drug targets, and has thus become of particular interest to the pharmaceutical industry [31]. Our work indicates that a substantial number of proteins in the dark genome may have drug-like properties. For instance, we found transmembrane serine protease TMPRSS11B, a dark genome protein, is similar in structure, localization, and function to many successfully drugged targets. Our model enables dark genome proteins with disease associations to be investigated for druggability potential.

Conclusions

We established a neural network-based machine learning model, termed PINNED, able to assess proteins’ druggability based on their sub-scores across four distinct categories. We have demonstrated that our proposed methodology is a highly predictive network (test AUC 0.95) with the ability to estimate the druggability of over 20,000 proteins spanning the entire human proteome. PINNED can be used as a pre-screening tool to determine a protein’s amenability to drugging prior to the initiation of pre-clinical programs and identify weaknesses in the form of low sub-scores of top targets that do not necessarily score high in all four areas, providing room for insight and early remediation. This methodology enables the exploration of novel targets cost-effectively while improving the clinical phase success rate.

Materials and methods

Drug targets

Drugged and undrugged proteins and sequences were obtained from the Pharos database on October 12, 2022. Proteins categorized as Tclin were labeled as drugged, while proteins categorized as Tchem, Tbio, or Tdark were labeled as undrugged. Protein features were obtained from Dezső et al.’s features [13] and the AlphaFold database [11]. A protein list was generated from the intersection of these three databases. Proteins not found in all the databases were removed, leaving the final protein set used to train the model as the intersection of the three sets. Labels identifying proteins as GPCRs, ion channels, or kinases were obtained from the Knowledge Management Center for Illuminating the Druggable Genome via Pharos on May 20, 2023.

All features generated by Dezső et al. were incorporated into our feature set and divided between the four subnetworks. These include characteristics calculated or predicted from the amino acid sequence, such as posttranslational modifications, enzyme classification, localization, secondary structure, and sequence motifs. Details on the generation of these features can be found in Dezső et al. [13]. All numeric features were standardized to a mean of 0 and standard deviation of 1 (“standard scaled”), while all categorical features were one-hot encoded.

Sequence and structure properties sub-score

Information about protein molecular weight and amino acid residues, charge and isoelectric points, extinction coefficients, predicted post-translational modifications, secondary structure, and solvent accessibility from Dezső et al.’s feature set were included as sequence and structure properties.

Grouped dipeptide composition (GDPC) and pseudo amino acid composition (PAAC) were calculated using the iFeature toolkit [7]. All selenocysteine (U) residues in the protein sequences were converted to cysteine (C) for the calculations. A lambda of 3 was chosen for PAAC.

Human protein structure predictions were acquired from AlphaFold (last modification on 05/05/2022). The structures were curated to run through Fpocket. Fpocket is an open-source protein prediction algorithm based on the Voronoi tessellation and the alpha sphere theory [25]. Fpocket begins by filtering the vertices and finding the correlated alpha spheres dependent on their minimum and maximum size. Alpha spheres that are clustered together equate to a recognized pocket. The pockets are further reduced based on the zones of compacted atom packing. The alpha spheres are labeled based on their contact to atoms, then ranked based on their prospective binding capabilities towards small molecules. All features were standard scaled.

Localization sub-score

Protein localization and tissue specificity data obtained from Dezső et al. was included in the localization data.

GO terms were downloaded from the Target Central Resource Database (TCRD) on July 29, 2022, and separated into GO terms categorized as Components, Functions, or Processes. They were used to generate a one-hot encoded GO terms matrix that mapped each protein in the dataset. Terms mapped to less than 10 proteins were excluded. GO Components were included in the localization data, while Functions and Processes were included in the biological functions data (see below).

Biological functions sub-score

Scores generated for each protein by Dezső et al. from the MetaCore database for “Biological Function,” and “Molecular Process” were standard-scaled and included in the “biological functions” sub-score. The enzyme classification and gene essentiality feature from Dezső et al. were included in the biological functions data.

GO Functions and Processes were obtained and processed as described above and included in biological functions.

Network information sub-score

The “Maps” (signaling pathways) scores from Dezső et al. and calculated protein–protein interaction network features were used as the input to the network information subnetwork.

Model

Features for all four sub-scores were combined into a single feature matrix. 20% of the proteins were selected at random prior to model development and held out as a test set. Prior to training, the drugged proteins in the training set were randomly oversampled with replacement until the quantity was equal to the quantity of undrugged proteins. Oversampling by SMOTE, ADASYN, or applying different weights to positive and negative samples were evaluated, but performance was not improved.

Our model was implemented in Python 3.7.13 using TensorFlow 2.11.0 and consisted of four densely connected neural networks, corresponding to the four sub-scores. Each consisted of a single input layer of size n inputs, a hidden layer with size 2i, where i is the largest integer such that 2i ≤ n, and an output layer of size 1, representing that network’s sub-score. ReLU activation was applied to the hidden layers, and an L2 penalty of 0.001 was applied to both the hidden and output layers. The four subnetwork output layers were summed to generate the logits of the overall druggability score. Different numbers of hidden layers, dropout for the input and hidden layers, learning rates, and L2 coefficients were tested, and the above values were found to lead to optimal AUC scores on validation sets.

Support vector machine, logistic regression, XGBoost, and random forest models were also evaluated and found to deliver performance comparable or inferior to neural network.

The model was trained using the Adam optimizer with TensorFlow default parameters at a learning rate of 10− 3.5, with a batch size of 32 and the binary cross entropy loss function.

Availability of data and materials

Our code is available at https://github.com/abbvie-external/Predictive-Interpretable-Neural-Network-for-Druggability-PINNED-.

Abbreviations

ADASYN :

Adaptive Synthetic

AUC :

Area Under the Receiver Operating Characteristic

CELLO :

Cellular Localization

GDPC :

Grouped Dipeptide Composition

GO :

Gene Ontology

GPCR:

G-protein-coupled receptor

GTEx :

Genotype-Tissue Expression

HPA :

Human Protein Atlas

NCBI :

National Center for Biotechnology Information

PAAC :

Pseudo Amino Acid Composition

PINNED :

Predictive Interpretable Neural Network for Druggability

R&D :

Research and Development

ReLU :

Rectified Linear Unit

SMOTE :

Synthetic Minority Over-sampling Technique

STRING :

Search Tool for the Retrieval of Interacting Genes/Proteins

TCRD :

Target Central Resource Database

TMPRSS11B :

Transmembrane Serine Protease 11B

TTD :

Therapeutic Target Database

References

  1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25(1):25–29. https://doi.org/10.1038/75556

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Bairoch A, Boeckmann B (1991) The SWISS-PROT protein sequence data bank. Nucleic Acids Res 19(suppl):2247–2249. https://doi.org/10.1093/nar/19.suppl.2247

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Bakheet TM, Doig AJ (2009) Properties and identification of human protein drug targets. Bioinformatics 25(4):451–457

    Article  CAS  PubMed  Google Scholar 

  4. Bazaga A, Leggate D, Weisser H (2020) Genome-wide investigation of gene-cancer associations for the prediction of novel therapeutic targets in oncology. Sci Rep 10(1):10787. https://doi.org/10.1038/s41598-020-67846-1

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Bull SC, Doig AJ (2015) Properties of protein drug target classes. PLoS ONE 10(3):e0117955. https://doi.org/10.1371/journal.pone.0117955

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Charoenkwan P, Schaduangrat N, Lio’ P, Moni MA, Shoombuatong W, Manavalan B (2022) Computational prediction and interpretation of druggable proteins using a stacked ensemble-learning framework. IScience 25(9):104883. https://doi.org/10.1016/j.isci.2022.104883

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, Webb GI, Smith AI, Daly RJ, Chou K-C, Song J (2018) iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34(14):2499–2502. https://doi.org/10.1093/bioinformatics/bty140

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Chène P (2002) ATPases as drug targets: learning from their structure. Nat Rev Drug Discov 1(9):665–673. https://doi.org/10.1038/nrd894

    Article  CAS  PubMed  Google Scholar 

  9. Chou K-C (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Genet 43(3):246–255. https://doi.org/10.1002/prot.1035

    Article  CAS  PubMed  Google Scholar 

  10. Costa PR, Acencio ML, Lemke N (2010) A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data. BMC Genomics. https://doi.org/10.1186/1471-2164-11-S5-S9

    Article  PubMed  PubMed Central  Google Scholar 

  11. David A, Islam S, Tankhilevich E, Sternberg MJE (2022) The AlphaFold database of protein structures: a biologist’s guide. J Mol Biol 434(2):167336. https://doi.org/10.1016/j.jmb.2021.167336

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. de Falco A, Dezso Z, Ceccarelli F, Cerulo L, Ciaramella A, Ceccarelli M (2021) Adaptive one-class gaussian processes allow accurate prioritization of oncology drug targets. Bioinformatics 37(10):1420–1427. https://doi.org/10.1093/bioinformatics/btaa968

    Article  CAS  PubMed  Google Scholar 

  13. Dezső Z, Ceccarelli M (2020) Machine learning prediction of oncology drug targets based on protein and network properties. BMC Bioinformatics 21(1):104. https://doi.org/10.1186/s12859-020-3442-9

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Feng Y, Wang Q, Wang T (2017) Drug target protein-protein interaction networks: a systematic perspective. BioMed Research International. https://doi.org/10.1155/2017/1289259

    Article  PubMed  PubMed Central  Google Scholar 

  15. Ferrero E, Dunham I, Sanseau P (2017) In silico prediction of novel therapeutic targets using gene–disease association data. J Transl Med. https://doi.org/10.1186/s12967-017-1285-6

    Article  PubMed  PubMed Central  Google Scholar 

  16. Georgi B, Voight BF, Bućan M (2013) From mouse to Human: Evolutionary Genomics analysis of human orthologs of essential genes. PLoS Genet 9(5):e1003484. https://doi.org/10.1371/journal.pgen.1003484

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Gong Y, Liao B, Wang P, Zou Q (2021) DrugHybrid_BS: using hybrid feature combined with Bagging-SVM to Predict potentially druggable proteins. Front Pharmacol 12:771808. https://doi.org/10.3389/fphar.2021.771808

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. GTEx Consortium (2017) Genetic effects on gene expression across human tissues. Nature 550(7675):204–213. https://doi.org/10.1038/nature24277

    Article  PubMed Central  Google Scholar 

  19. Harrison RK (2016) Phase II and phase III failures: 2013–2015. Nat Rev Drug Discov 15(12):817–818. https://doi.org/10.1038/nrd.2016.184

    Article  CAS  PubMed  Google Scholar 

  20. Huang C, Zhang R, Chen Z, Jiang Y, Shang Z, Sun P, Zhang X, Li X (2010) Predict potential drug targets from the ion channel proteins based on SVM. J Theor Biol 262(4):750–756. https://doi.org/10.1016/j.jtbi.2009.11.002

    Article  CAS  PubMed  Google Scholar 

  21. Jamali AA, Ferdousi R, Razzaghi S, Li J, Safdari R, Ebrahimie E (2016) DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins. Drug Discov Today 21(5):718–724. https://doi.org/10.1016/j.drudis.2016.01.007

    Article  CAS  PubMed  Google Scholar 

  22. Jeon J, Nim S, Teyra J, Datti A, Wrana JL, Sidhu SS, Moffat J, Kim PM (2014) A systematic approach to identify novel cancer drug targets using machine learning, inhibitor design and high-throughput screening. Genome Med 6(7):57. https://doi.org/10.1186/s13073-014-0057-7

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman R, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589. https://doi.org/10.1038/s41586-021-03819-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Kim B, Jo J, Han J, Park C, Lee H (2017) In silico re-identification of properties of drug target proteins. BMC Bioinform. https://doi.org/10.1186/s12859-017-1639-3

    Article  Google Scholar 

  25. Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 10(1):168. https://doi.org/10.1186/1471-2105-10-168

    Article  PubMed  PubMed Central  Google Scholar 

  26. Li Q, Lai L (2007) Prediction of potential drug targets based on simple sequence properties. BMC Bioinformatics 8(1):353. https://doi.org/10.1186/1471-2105-8-353

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Li Z-C, Zhong W-Q, Liu Z-Q, Huang M-H, Xie Y, Dai Z, Zou X-Y (2015) Large-scale identification of potential drug targets based on the topological features of human protein–protein interaction network. Anal Chim Acta 871:18–27. https://doi.org/10.1016/j.aca.2015.02.032

    Article  CAS  PubMed  Google Scholar 

  28. Lin J, Chen H, Li S, Liu Y, Li X, Yu B (2019) Accurate prediction of potential druggable proteins based on genetic algorithm and Bagging-SVM ensemble classifier. Artif Intell Med 98:35–47. https://doi.org/10.1016/j.artmed.2019.07.005

    Article  PubMed  Google Scholar 

  29. Mitsopoulos C, Schierz AC, Workman P, Al-Lazikani B (2015) Distinctive behaviors of Druggable Proteins in Cellular Networks. PLoS Comput Biol 11(12):e1004597. https://doi.org/10.1371/journal.pcbi.1004597

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Ochoa D, Hercules A, Carmona M, Suveges D, Baker J, Malangone C, Lopez I, Miranda A, Cruz-Castillo C, Fumis L, Bernal-Llinares M, Tsukanov K, Cornu H, Tsirigos K, Razuvayevskaya O, Buniello A, Schwartzentruber J, Karim M, Ariano B, Osorio REM, Ferrer J, Ge X, Machlitt-Northen S, Gonzalez-Uriarte A, Saha S, Tirunagari S, Mehta C, Roldán-Romero JM, Horswell S, Young S, Ghoussaini M, Hulcoop DG, Dunham I, McDonagh EM (2023) The next-generation open targets platform: reimagined, redesigned, rebuilt. Nucleic Acids Res 51(D1):D1353–D1359. https://doi.org/10.1093/nar/gkac1046

    Article  PubMed  Google Scholar 

  31. Oprea TI (2019) Exploring the dark genome: implications for precision medicine. Mamm Genome 30(7–8):192–200. https://doi.org/10.1007/s00335-019-09809-0

    Article  PubMed  PubMed Central  Google Scholar 

  32. Raies A, Tulodziecka E, Stainer J, Middleton L, Dhindsa RS, Hill P, Engkvist O, Harper AR, Petrovski S, Vitsios D (2022) DrugnomeAI is an ensemble machine-learning framework for predicting druggability of candidate drug targets. Commun Biol. https://doi.org/10.1038/s42003-022-04245-4

    Article  PubMed  PubMed Central  Google Scholar 

  33. Rouillard AD, Hurle MR, Agarwal P (2018) Systematic interrogation of diverse omic data reveals interpretable, robust, and generalizable transcriptomic features of clinically successful therapeutic targets. PLoS Comput Biol 14(5):e1006142. https://doi.org/10.1371/journal.pcbi.1006142

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Sheils TK, Mathias SL, Kelleher KJ, Siramshetty VB, Nguyen D-T, Bologa CG, Jensen LJ, Vidović D, Koleti A, Schürer SC, Waller A, Yang JJ, Holmes J, Bocci G, Southall N, Dharkar P, Mathé E, Simeonov A, Oprea TI (2021) TCRD and Pharos 2021: mining the human proteome for disease biology. Nucleic Acids Res 49(D1):D1334–D1346. https://doi.org/10.1093/nar/gkaa993

    Article  CAS  PubMed  Google Scholar 

  35. Sikander R, Ghulam A, Ali F (2022) XGB-DrugPred: computational prediction of druggable proteins using eXtreme gradient boosting and optimized features set. Sci Rep 12(1):5505. https://doi.org/10.1038/s41598-022-09484-3

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Sun T, Lai L, Pei J (2018) Analysis of protein features and machine learning algorithms for prediction of druggable proteins. Quant Biology 6(4):334–343. https://doi.org/10.1007/s40484-018-0157-2

    Article  CAS  Google Scholar 

  37. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ, von Mering C (2019) STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 47(D1):D607–D613. https://doi.org/10.1093/nar/gky1131

    Article  CAS  PubMed  Google Scholar 

  38. Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson Ã, Kampf C, Sjöstedt E, Asplund A, Olsson I, Edlund K, Lundberg E, Navani S, Szigyarto CA-K, Odeberg J, Djureinovic D, Takanen JO, Hober S, Alm T, Edqvist P-H, Berling H, Tegel H, Mulder J, Rockberg J, Nilsson P, Schwenk JM, Hamsten M, von Feilitzen K, Forsberg M, Persson L, Johansson F, Zwahlen M, von Heijne G, Nielsen J, Pontén F (2015) Tissue-based map of the human proteome. Science 347(6220):1260419. https://doi.org/10.1126/science.1260419

    Article  CAS  PubMed  Google Scholar 

  39. Updegraff BL, Zhou X, Guo Y, Padanad MS, Chen P-H, Yang C, Sudderth J, Rodriguez-Tirado C, Girard L, Minna JD, Mishra P, DeBerardinis RJ, O’Donnell KA (2018) Transmembrane protease TMPRSS11B promotes lung cancer growth by enhancing lactate export and glycolytic metabolism. Cell Rep 25(8):2223-2233e6. https://doi.org/10.1016/j.celrep.2018.10.100

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Viacava Follis A (2021) Centrality of drug targets in protein networks. BMC Bioinform 22(1):527. https://doi.org/10.1186/s12859-021-04342-x

    Article  CAS  Google Scholar 

  41. Wouters OJ, McKee M, Luyten J (2020) Estimated research and development investment needed to bring a new medicine to market, 2009–2018. JAMA 323(9):844. https://doi.org/10.1001/jama.2020.1166

    Article  PubMed  PubMed Central  Google Scholar 

  42. Wulff H, Castle NA, Pardo LA (2009) Voltage-gated potassium channels as therapeutic targets. Nat Rev Drug Discov 8(12):982–1001. https://doi.org/10.1038/nrd2983

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Yao L, Rzhetsky A (2008) Quantitative systems-level determinants of human genes targeted by successful drugs. Genome Res 18(2):206–213. https://doi.org/10.1101/gr.6888208

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Yu C-S, Lin C-J, Hwang J-K (2004) Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n -peptide compositions. Protein Sci 13(5):1402–1406. https://doi.org/10.1110/ps.03479604

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Zhou Y, Zhang Y, Lian X, Li F, Wang C, Zhu F, Qiu Y, Chen Y (2022) Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Res 50(D1):D1398–D1407. https://doi.org/10.1093/nar/gkab953

    Article  CAS  PubMed  Google Scholar 

  46. Zhu M, Gao L, Li X, Liu Z, Xu C, Yan Y, Walker E, Jiang W, Su B, Chen X, Lin H (2009) The analysis of the drug–targets based on the topological properties in the human protein–protein interaction network. J Drug Target 17(7):524–532. https://doi.org/10.1080/10611860903046610

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Keith Kelleher of the NIH, as well as Maria Argiriadi, Marlon Cowart, Felix DeAnda, Jacob Degner, Phil Hajduk, Howard Jacob, Jozsef Karman, Xavier Langlois, Frank Oellien, Ahmad Sheikh, and Jeff Waring of AbbVie, Inc. for their assistance with this project.

Funding

All funding was provided by AbbVie, Inc. All authors are employees of AbbVie. The design, study conduct, and financial support for this research were provided by AbbVie. AbbVie participated in the interpretation of data, review, and approval of the publication.

Author information

Authors and Affiliations

Authors

Contributions

MC developed and tested the machine learning model. DP generated the fpocket data. MC and DP wrote the manuscript. AP supervised the project. ZD generated protein features and developed an earlier model which was the inspiration for this work. All authors reviewed the manuscript and gave feedback.

Corresponding author

Correspondence to Michael Cunningham.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

 Comparison to previous druggability classifiers.

Additional file 2.

 Target classification of highest scoring undrugged proteins.

Additional file 3.

 All protein scores.

Additional file 4.

 Feature importance scores.

Additional file 5.

 Results from evaluating PINNED scores on phase III clinical targets.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cunningham, M., Pins, D., Dezső, Z. et al. PINNED: identifying characteristics of druggable human proteins using an interpretable neural network. J Cheminform 15, 64 (2023). https://doi.org/10.1186/s13321-023-00735-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13321-023-00735-7

Keywords