Human oral bioavailability (HOB) is a key factor in determining the fate of new drugs in clinical trials. HOB is conventionally measured using expensive and time-consuming experimental tests. The use of computational models to evaluate HOB before the synthesis of new drugs will be beneficial to the drug development process. In this study, a total of 1588 drug molecules with HOB data were collected from the literature for the development of a classifying model that uses the consensus predictions of five random forest models. The consensus model shows excellent prediction accuracies on two independent test sets with two cutoffs of 20% and 50% for classification of molecules. The analysis of the importance of the input variables allowed the identification of the main molecular descriptors that affect the HOB class value. The model is available as a web server at www.icdrug.com/ICDrug/ADMET for quick assessment of oral bioavailability for small molecules. The results from this study provide an accurate and easy-to-use tool for screening of drug candidates based on HOB, which may be used to reduce the risk of failure in late stage of drug development.
Poor pharmacokinetic properties, including absorption, distribution, metabolism, excretion, and toxicity (ADMET), are the key reasons of late-stage failures in drug development . Therefore, ADMET assessments of candidate compounds during the early stages of drug discovery have become critical for improving the success rate of drug discovery and reducing the risk of late-stage attrition . However, experimental testing of ADMET properties is time-consuming and costly. Thus, the accurate prediction of these properties is becoming increasingly important in drug discovery.
Among the ADMET properties, one of the most important pharmacokinetic characteristics of newly developed drugs is high oral bioavailability. Because oral administration is convenient and does not damage the skin or mucous membranes, 80% of the world’s drugs are administered orally . Human oral bioavailability (HOB) is an important pharmacokinetic parameter that measures the amount of a drug that actually enters circulation within the body after ingestion. If intravenous administration is used, the human body can use the blood to deliver the drug to the site where it can exert pharmacological effects through the systemic circulation. Higher oral availability of the drug can reduce the amount of administration required to achieve the expected pharmacological effect because it can reduce the side effects and toxicity risks brought by the drug. On the other hand, poor oral bioavailability can lead to inefficiency of drugs and high inter-individual variability in the use of drugs, triggering some unpredictable drug reactions in the human body. In the actual drug development process, approximately 50% of candidate drugs fail due to low oral availability [4, 5]. Therefore, the level of oral availability is one of the key factors determining the success or failure of clinical trials of new drugs.
Experimental measurements of drug HOB are not only expensive, but also particularly time-consuming. Therefore, the development of a predictive model that can evaluate the HOB of a candidate compound before synthesis is of great help to drug discovery. Because the oral availability of a drug is affected by various biological, physical and chemical factors, such as the solubility of chemicals in the gastrointestinal tract, the permeability of the intestinal membrane, and the first pass metabolism of the intestine and liver, it is a very difficult and challenging task to develop accurate models to predict HOB. Nonetheless, a number of prediction models based on quantitative structure property relationships (QSPR) have been published [6, 7]. For example, Falcón-Cano et al. used 1448 molecules and obtained a consensus model with an accuracy of 78% . Yang et al. used the random forest algorithm and 995 molecules to develop the admetSAR method with an accuracy of 69.7% . On the basis of 995 data points, Kim et al. obtained 76% accuracy using the logistic classifier method .
In this study, we collected an HOB dataset composed of 1588 molecules and proposed a new model for HOB prediction based on machine learning. Using random forest (RF) [11,12,13] and two cutoffs of 20% and 50% for classifying molecules, we developed consensus models with a state-of-the-art accuracy on two independent test sets. Moreover, the importance of input molecular features to the prediction results was analyzed using the SHapley Additive exPlanation (SHAP) algorithm , revealing key molecular properties that affect HOB.
Materials and methods
Classification of positive and negative data
The performance of a classifier strongly depends on how the positive and negative samples are defined. However, there is still no consensus criterion to define positive and negative samples in HOB prediction. In previous studies, four cutoff values have been used: 20% (if HOB ≥ 20%, then the molecule belongs to the positive class; otherwise, it is a negative example) , 30% , 50% , and 80%  (Additional file 1: Table S1). We have therefor used two cutoffs 50% and 20% for labeling molecules in this study, which are used in more recent methods.
The HOB training and test datasets from Falcón-Cano et al.  were used in this study, which includes 1157 training and 290 test molecules. The 290 test molecules (test set 1) were selected by randomly selecting 20% of all molecules. Three molecules in the test set 1 had wrong values and were manually corrected according to the relevant literatures [18,19,20,21]. An additional test set of 27 molecules was collected from a number of publications  and was then combined with the HOB data from ChEMBL to form an additional test set of 141 molecules (Test set 2, Table 1). To ensure that molecules in test set 2 do not overlap with the molecules in training set and test set 1, 2D structures were first generated and used to remove duplicates followed by deduplication using molecular fingerprints. Neither test set 1 nor test set 2 was used during the training.
The labeling of molecules using the 50% cutoff was obtained directly from Falcón-Cano et al. For the 20% cutoff, some molecules cannot be classified due to inaccurate experimental values. These molecules were discarded, leaving training set with 1142 molecules, test set 1 with 287 molecules and test set 2 with 133 molecules (Table 1).
All molecules in the training and test sets were converted to 3D structures using the RDKit package. However, the calculated 3D fingerprints and descriptors were not used during model training due to null value or small variance for many of the molecules.
To evaluate the similarity between the training set and the test sets, we calculated the max Tanimoto coefficient similarity of each molecule in the test sets with all molecules in the training set. The similarity ranges from 0.1 to 1 in test set 1 and 0.2 to 1 in test set 2 (Fig. 1). The average similarity between the test set 1 and the training set is 0.655, and the average similarity between the test set 2 and the training set is 0.612.
Calculation of descriptors
We used Mordred  software to calculate 1614 molecular descriptors and fingerprints (Additional file 1: Table S2). Descriptors that had zero values or zero variance for all compounds were removed, reducing the total number of descriptors to 1143. All these 1143 features were used for training RF models.
Evaluation of models
The performance of the individual and consensus models was evaluated by analyzing the sensitivity (SE), specificity (SP), accuracy of prediction (ACC), Matthew’s correlation coefficient (MCC)  and F1_score.
where TP denotes true positive, FP is false positive, FN is false negative, and TN is true negative. In addition, the receiver operating characteristic (ROC) curve and the area under the curve (AUC) were also calculated.
We utilized the SHapley Additive exPlanation (SHAP) algorithm to explain the prediction model by providing consistent and locally accurate attribution values (SHAP values) for each feature within each prediction model . The SHAP values evaluate the importance of the output resulting from the inclusion of a certain feature A for all combinations of features other than A.
Performance of the models and comparison with previous methods
For the 50% cutoff, we first assembled a training set of 1157 molecules and two independent test sets with 290 and 146 molecules from the literature and public databases (see Materials and methods). The input features of the models include 1143 2D descriptors calculated with the Mordred package (see Materials and methods for details). Five individual RF models were first trained on the training set with fivefold cross-validation, using grid search and accuracy score (ACC) for hyperparameter optimization to obtain the best n_estimators and min_samples_leaf values while leaving the remaining hyperparameters set to default values (Table 2). Restricting the number of tunable hyper-parameters may reduce the risk of overfitting. The accuracy of the our model is 0.86–0.90 on the training set (Additional file 1: Tables S3, S4), and 0.74–0.77 on test set 1 (Table 3). The moderate decrease of accuracy on the test set suggests that our model has certain extend of overfitting but not severe. We then combined the five RF models to obtain a voting model. The final bioavailability class is the result of voting from each classification model with equal weight. Individual models can usually identify different aspects of the relationship between independent variables and dependent variables, and the relationships between the variables identified by those models may be different. In certain cases, the usage of a consensus model can greatly reduce the prediction error [25,26,27]. The accuracy of the five individual models ranges from 0.742 to 0.808 on the two test sets (Table 3). The consensus model shows improvement in accuracy on test set 1 and test set 2. The AUC values of the consensus model were 0.830 and 0.878 on the two test sets (Fig. 2). In addition, we merged the training set and test set 1, and used the same protocol to train five random forest models and obtained a consensus model from fivefold cross-validation. This whole process was repeated 50 times, and the average accuracy on test 2 was 0.826 with a standard deviation of 0.014, which was close to the accuracy of 0.823 when test set 1 is not included for training.
To further estimate the accuracy of the models, the accuracy of our model on test set 1 was compared with those of previously published HOB models (Table 4). It should be noted that we used the same training and test sets as Falcón-Cano et al., but the accuracies of other models are directly taken from the respective literature publications, which may use different data sets.
Among these reported models, only admetSAR and ADMETlab provide an online prediction server that enables a direct comparison with our method based on the same test data. However, ADMETlab used different cutoffs of 30% and 20%. Therefore, we compared our method with admetSAR, which also uses F = 50% as the cut-off value. For a fair comparison, the molecules in the admetSAR training set were removed from the two test sets. Our model is more accurate than admetSAR in terms of SP, ACC and AUC (Table 5).
For the F = 20% cutoff, we used the same method to build a consensus model (Table 6). The accuracy of the five individual models ranges from 0.739 to 0.932 on the two test sets (Table 7). The consensus model shows improvement in accuracy on test set 1 and test set 2. The AUC values of the consensus model were 0.804 and 0.978 on the two test sets.
The performance of our model is compared with that of ADMETlab  on the two test sets (Table 8). Our model showed lower ACC and AUC on test set 1 but better performance on test set 2. Because test set 1 comes from an earlier data set that was published before ADMETlab [8, 10, 28,29,30], and has overlapping molecules with the training sets of several previous methods, it is likely that some molecules in test set 1 may be included in the training of ADMETlab (We were not able to obtain the training set of ADMETlab for removing of redundancy). On the other hand, the HOB data in test set 2 does not overlap with any published training sets and may server as a more objective testing of the two models.
Diversity distribution of HOB data
In this study, 1157, 290 and 141 molecules that have human oral availability data were collected for model construction. To examine the diversity of these molecules and provide a possible way to access the applicability of our model, we carried out principal component analysis (PCA) of these molecules. The 1143 fingerprints and descriptors mentioned above for model training were used to generate PCA for all compounds. We selected the two most important components to create a chemical space for characterizing training set, test set 1, and test set 2 (Fig. 3). The results suggest that the chemical space of the two test sets is roughly within the space of the training set, therefore it is sensible to use the prediction model trained by the training set to predict the HOB values for the test sets. In addition, after removing an outlier in test set 1 (circled points in Fig. 3) that are outside the PCA space of the training set, the accuracy remains unchanged on test set 2 with 50% and 20% cutoff. Moreover, the PCA analysis can be used to determine the applicability of our model to new molecules. When the projection of a new molecule is within the range of the training molecules, it is considered as “inside” the application domain, indicating a more reliable prediction.
Diversity evaluation of base learners
Diversity is very important in combining base learners. The Q-value approach is one way to measure the diversity of two classifiers . It ranges between − 1 and 1, and is 0 if two classifiers are independent. The larger the Q value, the smaller the difference between the predictions of two classifiers. We used Q-value to measure the difference between the decision trees in each model. The average Q-value for individual trees in the five random forest models when the cutoff is 50% was 0.207, 0.233, 0.27, 0.241, and 0.267. When the cutoff is 20%, the Q-values were 0.235, 0.269, 0.288, 0.293, and 0.275, which suggest that these trees have high diversity.
Importance of the input features
The analysis of important descriptors and fingerprints for prediction provides more information to fully understand these models. To this end, we used the SHapley Additive exPlanation (SHAP) algorithm to calculate the importance score of the input descriptor and fingerprints . SHAP (SHapley Additive exPlanations) is a game theory method used to explain the output of a machine learning model. It uses the classical Shapley value from game theory and its related extension to link the optimal credit allocation with local explanation.
We used the same method to analyze the contribution of each descriptor in the RF models when the cutoff is 50%, and the top 20 most important variables that contributed to the model were obtained through importance matrix plots (Additional file 1: Figs. S1–S5). The importance matrix plot for the consensus method was calculated by averaging the value in each model (Fig. 4A), which depicts the importance of each input feature in the development of the final predictive model. SsOH (number of all atoms) which is an atom type e-state descriptor contributes the most to predictive power, followed by the topological structure descriptor ATS5i and polar surface area descriptor TopoPSA (NO). In addition, we also counted the number of appearances of the features in the five individual models (Fig. 4B). SsOH, ATS4p and TopoPSA(NO) appear three times among the 20 most important descriptors of the five models. In the two methods of quantifying feature importance, the specific information of the top 20 features is sorted in Table 9.
The SHAP dependence plot is further used to understand how a single feature affects the output of the prediction models, where the color bar indicates the actual values of the features, and the SHAP values are plotted on the x-axis. The dependence plot of the consensus model when the cutoff is 50% was obtained by averaging the contributions of the individual models (Fig. 5, Additional file 1: Figs. S6–10). For each feature, a dot is created for each molecule, and positive SHAP values for a specific feature represent an increase in the predicted HOB value. For example, TopoPSA (NO) (topological polar surface area, using only nitrogen and oxygen) is an important feature that is overall negatively correlated with HOB (Fig. 5), i.e., a higher TopoPSA(NO) value will reduce the predicted HOB value. This is consistent with the finding that reducing the polar surface area increases the permeation rate of a molecule . Similarly, the TopoPSA descriptor which calculates the entire polar surface area is also negatively correlated with the HOB value.
In addition, SsOH which is the total number of OH bonds also significantly affects the prediction of HOB. The blue dots are mainly concentrated in the area where the SHAP is greater than 0, therefore a small SsOH value will increase the HOB value. Decreasing the number of OH groups will increase the hydrophobicity and membrane absorption of a molecule, therefore leading to higher HOB. This is in line with the Lipinski's ‘Rule-of-Five’ : if the number of hydrogen bond donors exceeds 5, the absorption or permeability may be poor .
It is believed that the charge state of molecules exerts a key influence on the perception of biomolecules (including membranes, enzymes and transporters) . Several features that have great influence in the consensus model, such as RPCG in RF1, ATSC0c in RF2 and RNCG in RF4 are charge related descriptors.
Using the same method, we also analyzed the model obtained with the 20% cutoff, and obtained importance matrix plots (Additional file 1: Figs. S11–S15, S21) and the dependence plots (Additional file 1: Figs. S16–S20, S22) of the individual and the consensus models. The important features from the models trained with the two cutoffs are overall consistent, e.g., TopoPSA (NO), SsOH and Autocorrelation also have a significant impact on the F = 20% consensus model as that on the F = 50% model.
Discussion and conclusions
On the basis of a comprehensive data sets collected in this study, accurate RF models for prediction of HOB were developed. Moreover, we also analyzed the importance of the features, and found that the number of OH bonds and polar surface area of a molecule are negatively correlated with the HOB value, which is consistent with molecular characteristics that affect the oral drug bioavailability. The model is available as a web server at www.icdrug.com/ICDrug/ADMET, which provides researchers with an accurate tool to quickly and automatically predict the HOB of new molecules without any machine learning or statistical modeling knowledge.
Falcon-Cano G, Molina C, Cabrera-Perez MA (2020) ADME prediction with KNIME: development and validation of a publicly available workflow for the prediction of human oral bioavailability. J Chem Inf Model 60(6):2660–2667
Lipinski CA, Lombardo F, Dominy BW et al (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 23(1–3):3–25
Hou T, Wang J, Zhang W et al (2007) ADME evaluation in drug discovery. 6. Can oral bioavailability in humans be effectively predicted by simple molecular property-based rules? J Chem Inf Model 47(2):460–3
The computer time was provided by the ECNU Multifunctional Platform for Innovation001.
This work was supported by the National Key R&D Program of China (Grant No. 2016YFA0501700), the National Natural Science Foundation of China (22033001, 21933010), and the Natural Science Foundation of Shanghai (19ZR1473600).
Authors and Affiliations
Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, Shanghai Key Laboratory of Green Chemistry & Chemical Process, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai, 200062, China
Min Wei, Xudong Zhang, Xiaolin Pan, Bo Wang, Changge Ji & John Z. H. Zhang
NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai, 200062, China
Changge Ji & John Z. H. Zhang
Department of Medicinal Chemistry, School of Pharmacy, Fudan University, Shanghai, 201203, China
Department of Chemistry, New York University, New York, NY, 10003, USA
John Z. H. Zhang
Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
John Z. H. Zhang
Collaborative Innovation Center of Extreme Optics, Shanxi University, Taiyuan, Shanxi, 030006, China
Cut-off values used in different studies and the number of positive and negative samples in each study. Table S2. List of descriptors calculated using Mordred. Table S3. The performance of the consensus model on the training set and each fold in the fivefold cross validation when the cutoff is 50%. Table S4. The performance of the consensus model on the training set and each fold in fivefold cross validation when the cutoff is 20%. Figure S1. The importance matrix plot for the RF model 1 when the cutoff is 50%. Figure S2. The importance matrix plot for the RF model 2 when the cutoff is 50%. Figure S3. The importance matrix plot for the RF model 3 when the cutoff is 50%. Figure S4. The importance matrix plot for the RF model 4 when the cutoff is 50%. Figure S5. The importance matrix plot for the RF model 5 when the cutoff is 50%. Figure S6. SHAP dependence plot of the top 20 features of the RF model 1 when the cutoff is 50%. Figure S7. SHAP dependence plot of the top 20 features of the RF model 2 when the cutoff is 50%. Figure S8. SHAP dependence plot of the top 20 features of the RF model 3 when the cutoff is 50%. Figure S9. SHAP dependence plot of the top 20 features of the RF model 4 when the cutoff is 50%. Figure S10. SHAP dependence plot of the top 20 features of the RF model 5 when the cutoff is 50%. Figure S11. The importance matrix plot for the RF model 1 when the cutoff is 20%. Figure S12. The importance matrix plot for the RF model 2 when the cutoff is 20%. Figure S13. The importance matrix plot for the RF model 3 when the cutoff is 20%. Figure S14. The importance matrix plot for the RF model 4 when the cutoff is 20%. Figure S15. The importance matrix plot for the RF model 5 when the cutoff is 20%. Figure S16. SHAP dependence plot of the top 20 features of the RF model 1 when the cutoff is 20%. Figure S17. SHAP dependence plot of the top 20 features of the RF model 2 when the cutoff is 20%. Figure S18. SHAP dependence plot of the top 20 features of the RF model 3 when the cutoff is 20%. Figure S19. SHAP dependence plot of the top 20 features of the RF model 4 when the cutoff is 20%. Figure S20. SHAP dependence plot of the top 20 features of the RF model 5 when the cutoff is 20%. Figure S21. (A) Importance matrix plot of the consensus model when the cutoff is 20%. (B) A statistical graph of the number of occurrences of the top 20 features that affect all models when the cutoff is 20%. Figure S22. SHAP dependence plot of the top 20 features of the consensus model when the cutoff is 20%.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.