ADMET evaluation in drug discovery. 20. Prediction of breast cancer resistance protein inhibition through machine learning

Breast cancer resistance protein (BCRP/ABCG2), an ATP-binding cassette (ABC) efflux transporter, plays a critical role in multi-drug resistance (MDR) to anti-cancer drugs and drug–drug interactions. The prediction of BCRP inhibition can facilitate evaluating potential drug resistance and drug–drug interactions in early stage of drug discovery. Here we reported a structurally diverse dataset consisting of 1098 BCRP inhibitors and 1701 non-inhibitors. Analysis of various physicochemical properties illustrates that BCRP inhibitors are more hydrophobic and aromatic than non-inhibitors. We then developed a series of quantitative structure–activity relationship (QSAR) models to discriminate between BCRP inhibitors and non-inhibitors. The optimal feature subset was determined by a wrapper feature selection method named rfSA (simulated annealing algorithm coupled with random forest), and the classification models were established by using seven machine learning approaches based on the optimal feature subset, including a deep learning method, two ensemble learning methods, and four classical machine learning methods. The statistical results demonstrated that three methods, including support vector machine (SVM), deep neural networks (DNN) and extreme gradient boosting (XGBoost), outperformed the others, and the SVM classifier yielded the best predictions (MCC = 0.812 and AUC = 0.958 for the test set). Then, a perturbation-based model-agnostic method was used to interpret our models and analyze the representative features for different models. The application domain analysis demonstrated the prediction reliability of our models. Moreover, the important structural fragments related to BCRP inhibition were identified by the information gain (IG) method along with the frequency analysis. In conclusion, we believe that the classification models developed in this study can be regarded as simple and accurate tools to distinguish BCRP inhibitors from non-inhibitors in drug design and discovery pipelines.


Introduction
The breast cancer resistance protein (BCRP/ABCG2) is the second member of the G subfamily of the ATPbinding cassette (ABC) transporter superfamily. It is widely distributed and expressed in many normal tissues, especially the small intestine, liver, brain endothelium, and placenta [1]. BCRP functions as a xenobiotic efflux pump by exporting antineoplastic drugs out of cells, such as topotecan and mitoxantrone, and its overexpression or activation results in multi-drug resistance (MDR) to many chemotherapeutic agents [2][3][4]. It is also a vital constituent part of blood-tissue barriers and is closely involved in absorption blocking and excretion enhancing. Therefore, BCRP has been recognized as an important player in drug absorption, distribution, metabolism, excretion, and toxicity (ADMET) [5,6], and has also been regarded by the Food and Drug Administration (FDA) as one of the key drug transporters in clinical drug disposition [4,7].
Many drugs, such as mitoxantrone and topotecan, have been identified as BCRP substrates [3,8], and BCRP inhibitors can enhance the bioavailability of these drugs [9,10]. For instance, the co-administration of topotecan (an anti-cancer drug) and GF120918 (a BCRP inhibitor) resulted in a significant increase of the systemic exposure of topotecan [9]. Gefitinib, a potent BCRP inhibitor, could considerably improve the bioavailability of irinotecan in mice [10]. Therefore, the discovery of novel BCRP inhibitors can contribute to the design of better therapeutic strategies for cancer treatment and identification of potential drug-drug interactions [11]. Many experimental techniques have been employed to measure the BCRP inhibition of drugs or drug candidates, but these assays are expensive, resource-intensive, and time-consuming. Thus, it is of critical importance to develop in silico models to predict BCRP inhibition in the process of drug design and discovery in order to reduce the failure rates and avoid adverse drug-drug interactions.
Quantitative structure-activity relationship (QSAR) analysis is one of the most classical techniques in the realm of computer-aided drug design (CADD) to correlate the chemical structures of molecules with their bioactivities. To date, numerous computational QSAR models based on various machine learning (ML) approaches and pharmacophore modeling have been proposed to predict BCRP inhibition [12][13][14][15][16][17][18][19][20][21][22] For example, in 2005, based on a dataset of 25 flavonoids, Zhang et al. [15] built a multiple linear regression model to quantitatively predict the BCRP inhibition of flavonoids. In 2007, Matsson and co-workers used the orthogonal partial least-squares projection to latent structures discriminant analysis (OPLS-DA) to construct a classification model to differentiate BCRP inhibitors from non-inhibitors with predictive accuracies of 93% and 79% for the training and test sets, respectively [16]. It is surprising that an easily interpretable model could be established based on only two descriptors: octanol-water partition coefficient at pH 7.4 and molecular polarizability. In 2013, Pan et al. generated the Bayesian classification and pharmacophore models based on a training set of 124 BCRP inhibitors and non-inhibitors, and the best Bayesian classification model achieved a Matthews correlation coefficient (MCC) of 0.69 for the test set with 79 compounds and the best pharmacophore model yielded a MCC of 0.29 for the same test set [11]. In 2014, based on a large dataset containing 780 diverse compounds, Montanari et al. also used the naïve Bayesian approach to establish a model with an encouraging global accuracy of 0.92 but a relatively low AUC (the area under the receiver operating characteristic curve) value of 0.854 for the training set under tenfold cross-validation. In addition, this model was not validated by a test set [20]. In 2015, based on a training set of 197 compounds, Belekar et al. developed a number of models to distinguish BCRP inhibitors from non-inhibitors by using support vector machine (SVM), k-nearest neighbor (k-NN), and artificial neural networks (ANN), and the models achieved global accuracies of 0.828-0.878 for the test set with 99 compounds and 0.745-0.775 for the validation set with 99 compounds [21]. Recently, Montanari et al. developed a computational model to qualitatively predict BCRP inhibition by using logistic regression (LR) based on a relatively large dataset with 978 compounds [22], and the tenfold crossvalidation of the LR model for the training set yielded a MCC of 0.65 and an AUC of 0.90. Apparently, most reported computational models for differentiating BCRP inhibitors from non-inhibitors were developed based on relatively small datasets (less than 1000 compounds) or congeneric compounds and suffered from relatively low accuracy. As a result, the reliability and general applicability of the developed models may be limited. Therefore, it is an urgent need to develop computational models with high reliability and robustness to predict BCRP inhibition based on an extensive dataset.
Recently, new artificial intelligence (AI) methods, such as ensemble learning represented by extreme gradient boosting (XGBoost) and deep learning (DL) represented by deep neural networks (DNN), have been widely employed in CADD, such as drug-target interactions prediction [23], cancer diagnosis [24], ADMET predictions and QSAR modeling [25][26][27][28][29][30][31][32][33]. DL is a categorization of artificial neural networks-based (ANNbased) ML methods that gradually extract features from the raw input using multiple layers. The application of DL in drug discovery was hindered by computability, but it can be well alleviated by using graphics processing units (GPUs) acceleration [34]. Ensemble learning represented by XGBoost attracted the attention of researchers in drug discovery due to its accurate predictions and low computational cost, and some studies reported that XGBoost could achieve comparative or even better performance than other widely-used ML methods such as RF and DNN [31]. Thus, it is a recognized need to explore how well those state-of-the-art methods perform in the prediction of BCRP inhibition. In this study, we reported a large and structurally diverse dataset with 1098 BCRP inhibitors and 1701 non-inhibitors. Then, based on the extensive dataset, the classification models to discriminate between BCRP inhibitors and non-inhibitors were developed by using seven ML approaches, including a representative DL method: DNN, two representative ensemble learning methods: stochastic gradient boosting (SGB) and XGBoost, and four traditional ML methods: naive Bayes (NB), k-NN, regularized logistic regression (RLR), and SVM. The optimal feature subset used in model development was determined by a wrapper feature selection method that couples simulated annealing (SA) algorithm and random forest (RF) and the hyperparameters were determined by the Bayesian optimization technique. To our knowledge, there are relatively few studies devoted to using new AI methods, such as ensemble learning and deep learning, to the prediction of BCRP inhibition. The robustness and reliability of the models were validated internally and externally, and the statistical results demonstrated that one traditional ML method (SVM) and two new ML methods (DNN and XGBoost) could generate reliable classification models for the prediction of BCRP inhibition, and the SVM model achieved the best predictions. The identification and analyses of representative features for different models were accomplished through a perturbation-based method. Furthermore, the important structural fragments related to BCRP inhibition were recognized by the information gain method along with the frequency analysis, which may provide valuable clues for the design of potent BCRP inhibitors and avoiding undesirable drugdrug interactions.

Dataset preparation
In this study, three data sources were integrated to build up an extensive BCRP inhibition dataset. The first data source is the dataset of 978 compounds reported by Montanari et al. [20], and it contains 433 inhibitors and 545 non-inhibitors. The second data source that has 765 BCRP inhibitors and 220 non-inhibitors were manually collected from 51 literatures [22,. However, considering the fact that BCRP non-inhibitors should be much more than inhibitors, the third data source containing 1000 non-inhibitors was also integrated into the final dataset. The 1000 non-inhibitors were extracted from the PubChem database (PubChem AID: 1325), and they have the lowest efflux pump activity, which was defined as "% inhibition of efflux pump activity". According to the authors' statement, compounds with percentage inhibition lower than 80% were decisively inactive. The molecules from the above three data sources were integrated into a single dataset and the duplicated molecules with incoherent labels were removed. Then, the salts, mixtures, and metal-organic compounds were eliminated using the wash tool in the Molecular Operating Environment (MOE) software package (version 2015), and then all the compounds were minimized with the MMFF94 force field [85]. The final dataset contains 1098 BCRP inhibitors and 1701 BCRP non-inhibitors. The whole dataset can be accessed from Additional file 1: Table S1 or the supporting website: http://cadd.zju.edu. cn/pkkb/. As far as we know, the dataset reported in this study is the largest one of its kind available to the public. The BCRP inhibitors were labeled as 1 and non-inhibitors were labeled as 0.

Generation of molecular descriptors and fingerprints
A total of 379 molecular descriptors, which characterize the physicochemical, structural and drug-like properties of the studied compounds, were calculated using the MOE 2015 as the molecular features. According to many previous studies, the addition of molecular fingerprints can improve the performance of QSAR modeling [86][87][88]. Therefore, the PubChem fingerprints (PubchemFP) with 881 substructures generated by PaDEL-Descriptor [89], which characterize 2-dimensional substructures in a binary format, were also used as molecular features. The molecular features with missing values were removed. Then, the features that have one unique value (i.e., zerovariance features) or features have very few unique values relative to the number of compounds or features have a large frequency gap between the most common value and the second most common value were removed by the nearZeroVar function in the caret package of R (version 3.5.3 ×64). In addition, the correlation between any two features was calculated and the feature that has high correlation (r > 0.90) with another feature was removed. After this step, 382 molecular features were retained and the whole dataset was randomly divided into the training and test sets with a ratio of 4:1 using a stratified sampling method. As a result, the training set contains 2240 compounds (879 inhibitors and 1361 non-inhibitors), and the test set contains 559 compounds (219 inhibitors and 340 non-inhibitors).

Feature selection by rfSA
All the calculated molecular features cannot be used for the QSAR modeling due to the curse of dimension for high-dimensional data and the phenomenon of overfitting. Therefore, a wrapper feature selection method named rfSA (SA algorithm coupled with RF) was used to remove redundant or irrelevant features without sacrificing too much information based on the training set. The SA algorithm is used to search the feature subset from the origin feature space and the performance of the RF model developed from the selected feature subset is used to guide the search. As a common method for combinatorial optimization problems, SA has good capability to avoid being trapped into local minima [90]. In the feature subset search process, the RF model developed from a new feature subset selected by the SA algorithm is compared with that developed from the previous feature subset identified in the previous step according to the RF model performance (e.g. accuracy of RF model). If the new feature subset performs better, it would be accepted, otherwise acceptance probability is calculated based on the accuracy difference between the two RF models developed from the two feature subsets and the current iteration of the search. This process is repeated until the stopping criteria are satisfied. The rfSA was implemented using a built-in safs function in the caret package of R (version 3.5.3 ×64). Here, the resample method was set as fivefold cross-validation with five repetitions to guarantee the statistical significance, where four-fifth of the training set (internal set) was used in the feature subset search conducted by SA and the remaining one-fifth (external set) was used to estimate the external accuracy. The best iteration of SA was determined by maximizing the external accuracy. The maximum iterations of the SA optimization were set to 1000. More descriptions about the feature selection process can be found in the documentations [91,92].

QSAR model construction and hyper-parameters optimization
Here, seven ML methods were employed to develop the classification models to discriminate BCRP inhibitors and non-inhibitors, including a representative DL method (DNN), two representative ensemble learning methods (SGB and XGBoost), and four traditional ML methods (NB, k-NN, RLR and SVM). The DNN method was implemented in the h2o package of R (version 3.5.3 ×64), and the other six ML methods were implemented in the caret package of R (version 3.5.3 ×64). The caret package provides miscellaneous functions for building classification and regression models and focuses on simplifying model training at the same time. The whole QSAR modeling pipeline is presented in Fig. 1. The source code that implements the workflow is available in the supplementary information (Additional file 2).

Naive Bayes (NB)
The NB algorithm is a simple and interpretable probabilistic classification method, and it estimates the corresponding class probability for an instance represented by conditionally independent feature variables based on the Bayes' theorem. Despite the simple theorem and oversimplified assumptions, NB has been extensively used in classification and achieved outstanding performance in many intricate real-world situations, such as text classification. In addition, NB is fast and efficient for large datasets, and it is less affected by "curse of dimensionality" when a large number of descriptors are used [93]. The detailed descriptions of the NB algorithm were documented previously [88].

k-Nearest neighbors (k-NN)
The k-NN algorithm is a commonly used non-parametric supervised learning approach for classification and regression [94]. The principle of this algorithm is to find the k closest training instances when a test instance is given and this test instance is predicted based on the information of the closest training instances. In our study, the "weighted voting method" was employed, Fig. 1 The workflow of QSAR modeling which weights the contributions of the k closest instances using a distance weighting function, where the closest instance contributes most to the voting and the furthest instance contributes least.

Regularized logistic regression (RLR)
As an efficient and simple classification methods, the logistic regression (LR) algorithm uses the logistic function as the link function of generalized linear model [22,95]. It is suited for conducting regression analysis where the response variable is binary. Different from conventional linear regression which fits a straight line or hyperplane, LR converts the output of a linear regression model to values between 0 and 1 using a logistic function. In this study, a regularization term by penalizing high values of the optimized parameters was introduced into the cost function of the LR algorithm to prevent over-fitting, which is the so-called regularized logistic regression (RLR).

Support vector machine (SVM)
Support vector machine is one of the most popular ML approaches in QSAR modeling. It is appropriate to develop the classification models for complex but smallor medium-sized datasets. The fundamental principle of SVM is to find a separating hyper-plane with maximum margin in the feature space for classifying instances of different categories, and the classification results produced by this hyper-plane should be insusceptible to local perturbations of training instances. In this study, the radial basis function (RBF) was used as the kernel.

Ensemble learning
Ensemble learning is a group of popular algorithms that can produce a strong learner in the form of an ensemble of weak learners. Gradient boosting is one of the most representative ensemble learning methods and it improves the strong learner gradually through fitting the new weak learner to residuals: an approximation of the negative gradient of the loss function being minimized. Here two gradient boosting-based methods, SGB and XGBoost, were employed to build the classification models. In SGB, a subsample of the training data is selected randomly to fit the new weak learner and update the strong learner for the current iteration, and this randomized approach increases both the approximation accuracy and execution speed of gradient boosting [96]. In XGBoost, some optimization algorithms are introduced based on the gradient boosting framework, such as minor improvements in the loss function by penalizing the complexity of the model, introduction of shrinkage and column subsampling for further preventing overfitting, and employment of sparsity-aware split finding technique for efficient training on sparse data, etc. The significant predictive power of XGBoost on QSAR modeling has been illustrated in previous studies [29,31].

Deep neural networks (DNN)
Deep learning has revolutionized many applications such as natural language processing, computer vision, and speech recognition due to its capability to learn complicated and rapidly-varying non-linear functions and extract a hierarchy of useful features from input [97]. DNN is one of the typical DL models which have an input layer, an output layer, and many hidden layers (typically more than three hidden layers). It is inspired from biological neurons networks and the basic component in DNN is the neuron model. In 2012, Merck sponsored a molecular activity challenge at the Kaggle data mining platform, and the competition was won by a group that utilized DNN as the main ML technique [25]. In addition, the Tox21 data challenge 2014 for toxicity assessment was won by a group that also utilized DNN as the main ML technique [26]. In DNN, each neuron receives the input signals from its connected neurons, and those input signals are multiplied by their respective weights and then summed up. The weighted sum plus the bias in each neuron to feed into the activation function and produce an output. Here the weights and biases in each neuron are the parameters in DNN and they are typically learned by error back-propagation algorithm in combination with optimization method such as stochastic gradient descent. It is commonly known that DNN learns to pick up important features from the original feature space by down-weighing irrelevant features. However, for the sake of fairness, DNN was trained based on the selected features from the training set although it seems redundant or not necessary.

Model hyper-parameters optimization
The performance of the models developed by ML approaches are closely related to important hyper-parameters. In this study, the hyper-parameters were optimized by the Bayesian optimization (aka model-based optimization) algorithm based on the MCC value of the fivefold cross-validation to the training set [98]. As a compelling global optimization method for black-box functions, the Bayesian optimization framework can obtain an ideal solution only after a few objective function evaluations by designing appropriate surrogate model and acquisition function. Before Bayesian optimization, a coarse hyper-parameter tuning based on grid search within relatively wide ranges was performed to determine a smaller region of hyper-parameters where the models work well. The Bayesian optimization was then used to zoom into these smaller regions and find more optimal settings of hyper-parameters. The Bayesian optimization algorithm was implemented by the mlrMBO package in R (version 3.5.3 ×64). The search space of the Bayesian optimization and the optimal hyper-parameters for all the studied models are summarized in Additional file 1: Table S2. It should be noted that some other important hyperparameters of the DNN model were fixed in our study. Here, the optimizer of the DNN modeling was set to an adaptive learning rate algorithm: ADADELTA, and the activation function was set to ReLU which is highly recommended for DNN according to the previous studies [25,99]. All the tested DNN configurations were trained for 300 epochs.

Statistical validation of the QSAR models
According to the Organization for Economic Co-operation and Development (OECD) Validation Principle [4,100], a QSAR model should be associated with appropriate measures of goodness-of-fit, robustness, and predictivity. Here, the random fivefold cross-validation of the training set was used to evaluate the robustness of each model, and the external validation given by the predictions to the test set was used to measure the actual predictivity of each model. In addition to the random fivefold cross-validation, the performance of the seven studied ML algorithms was also evaluated by the cluster crossvalidation method proposed by Mayr et al. [101,102]. The agglomerative hierarchical clustering using complete linkage was used to identify different compound clusters. The distance between any two compounds was measured by the Tanimoto similarity based on the PubChem fingerprints. The maximum distance between any identified clusters was set to 0.7 and the generated smaller compound clusters were distributed across the fivefolds randomly. The following statistical parameters based on the confusion matrix were used to assess the performance of Moreover, the residual (the binary cross entropy was used as the residual function in this study) distribution plots for the fivefold cross-validated training set and the test set were used to further diagnose the quality of the classification models. Different from the confusion matrix based statistical parameters which only supports the predicted class of a compound, it is capable of providing information about the predicted class probability distribution of a compound. For example, for an inhibitor: I [the class probability distribution is (0,1)] and a non-inhibitor: NI [the class probability distribution is (1,0)], the predictions (class probability) for I and NI given by model A are (0.4, 0.6) and (0.65, 0.35), respectively and those given by model B are (0.1, 0.9) and (0.95, 0.05), respectively. Both models give the same confusion matrix and the statistical parameters would be identical [for the majority of ML approaches, the class of a certain compound was derived from its class probability distribution. If the probability for a certain class is larger than or equal to a certain threshold (generally 0.5), this compound would be predicted as the corresponding class], which model should be used in this case? Actually, we still believe that model B is better than model A because the residual to the referenced class probability distribution of model B is smaller than that of model A. In this study, the binary cross entropy was used as the residual function, which was defined as: where y is the true class label of compound (1 for inhibitor and 0 for non-inhibitor); p is the probability of being inhibitor for a compound and the greater the binary cross entropy is, the larger the residual is, and vice versa.

Definition of application domain (AD)
Defining AD is an important step for QSAR modeling in the OECD guidelines because a developed model can only be expected to give reliable predictions for query chemicals that fall within the AD [100]. In this study, the non-parametric probability density distribution-based method was applied to identify the AD of a QSAR model. This method estimates the probability density function for the given data and it can be mainly classified into two categories: parametric and non-parametric approaches. Parametric approaches assume the data distribution follows a standard distribution, such as Gaussian or Poisson distribution. Non-parametric approaches do not make any assumptions about the data distribution and they can estimate the probability density solely from data [103,104]. An advantage of non-parametric approaches is that they can identify internal empty spaces, and it has been argued that they are more accurate and appropriate than other common approaches, such as the range, distance and leverage approaches [103]. More details of the probability density distribution-based method have been documented in literature [103,104]. The AD analysis was conducted by the AMBIT Discovery software (version 0.04) [105].

Interpretation of QSAR models
Identification and analysis of important features may contribute to a better understanding of QSAR models and provide new domain knowledge. Here the agnostic feature importance assessment algorithm, a perturbation-based method, was employed to identify the most important features for different well-performing QSAR models. This algorithm is model agnostic and not relevant to the specific structure of a model. A couple of previous studies only identified and analyzed the important features given by feature selection procedure for QSAR models [21,29,106] but did not explore the relationships between features and different QSAR models. It is quite possible that a set of important descriptors/fingerprints for a well-performing QSAR model may be not important to another QSAR model. In view of that, this perturbation-based model-agnostic method was used to recognize important features for different well-performing QSAR models and further compare the important features between models for interpreting how different QSAR models handle particular features. The basic principle of this method is to evaluate how much the model performance is gone from the performance of the original model after the selected feature is perturbed or resampled. The more the model performance is gone, the more important the feature is, and vice versa. In a more specific way, L = L Y , f (X) denotes the loss function of a model where Y is the observations for all instances and f (X) is the predictions of all instances. The flowing steps were conducted for each single feature v i in the V : first, a perturbed data X * ,−i with feature v i perturbed or resampled was generated; second, the model predictions f X * ,−i on the perturbed data were calculated; then, the loss function L * ,−i = L Y , f X * ,−i on the perturbed data was obtained; finally, the feature importance of v i is measured as the difference or ratio of the original loss L and loss on perturbed data L * ,−i [107,108]. Here, the binary cross entropy was used as the loss function, and the identification and analysis procedure was implemented in the DALEX package of R (version 3.5.3 ×64). In addition, the analysis for each single feature was repeated 10 times to guarantee the repeatability of results due to the perturbation randomness and the importance score was obtained by averaging the results of the 10 times repetitions.

Analysis of important fragments to characterize BCRP inhibitors and non-inhibitors
In order to better understand important chemical features related to BCRP inhibition, each compound was decomposed into three structural fragments and analyzed. The generated structural fragments include contiguous ring systems (Ring Assemblies), contiguous ring systems that share two or more bonds (Bridge Assemblies), and contiguous ring systems plus chains that link two or more rings (BemisMurcko Assemblies). The details of these three structural fragments can be found in previous studies [109]. The structural fragments for all the compounds were generated using the Generate Fragments component in Pipeline Pilot 2017, and the Merge Molecules component was then used to merge records containing identical fragments into a single record for the next statistical analysis.
The important fragments that have large contributions to BCRP inhibitors or non-inhibitors were identified by the information gain (IG) method coupled with frequency analysis [110,111]. The IG value of a fragment was used to measure its importance to the classification system, and the fragments with low IG values are less effective in a classification system and vice versa. In this study, the IG value of a fragment was calculated as followings: where V represents the possible value of a fragment (0 or 1); 0 represents the fragment is absent in a compound and 1 represents the fragment is present in a compound. N denotes the total number of compounds; N v is the number of compounds where the fragment is present or absent. Ent(D) and Ent D v represent the information entropy and conditional entropy of all the compounds, respectively, and both of them were calculated by Eq. 6, where K represents the classes of the compounds (1 represents inhibitors and 0 represents non-inhibitors), and p k is the ratio of each class compounds. The frequency of a fragment was defined as: where N fragment_class is the number of compounds containing the fragment in each class; N total is the total number of compounds; N fragment_total is the total number of compounds containing the fragment; N class is the number of compounds in each class. If the occurrence of a fragment is more frequent in inhibitors than in noninhibitors, this fragment was considered as an important fragment for the inhibitors and vice versa.

Dataset collection
Except 978 BCRP inhibitors and non-inhibitors reported by Montanari et al. [20] and 1000 non-inhibitors extracted from the PubChem database (PubChem AID: 1325), 985 BCRP inhibitors and non-inhibitors collected manually from 51 publications were integrated into the whole BCRP inhibition dataset. The basic information of the compounds in these 51 publications, including compound name, compound structure, activity source, activity index, and reported value, were collected manually and stored into a Molecular Operating Environment (MOE) database. The reported inhibition activity index is represented by IC 50 , EC 50 , %inhibition at a given concentration, or substrate fold-increase, etc. The experimental assays to determine the BCRP inhibition activity can be roughly categorized into two types: (1) fluorescent or radioactivity substrate intracellular accumulation assay and (2) membrane vesicular transport assay. In the fluorescent or radioactivity substrate intracellular accumulation assay, cells overexpressing BCRP are incubated with both fluorescent or radiolabeled substrate (pheophorbide A, mitoxantrone, BODIPY-prazosin, or Hoechst 33342) and the tested compound [52], and the quantity of fluorescence or radioactivity accumulation was measured and it is related to the inhibitory effect of the tested compound. The membrane vesicular transport assay employs inside-out-oriented membrane vesicles, prepared from cells expressing BCRP, to measure the influx of radioactive substrate such as 3 H-methotrexate [19], and it can be used as an indirect (inhibition-type) way to determine the inhibitory effect of a tested compound [112].
Because the inhomogeneity of the experimental data, we used the following rules to convert the reported values into the inhibitor and non-inhibitor classes. First, the suggestions recommended by the authors were used to assign the possible classification threshold or the compound category. It should be noted that the suggestions recommended by the authors may be arbitrary, but it is the situation we are facing now. For example, Gros (7) frequency of a fragment = N fragment_class × N total N fragment_total × N class et al. [40] checked if some compounds could inhibit the transport activity of BCRP, and the authors found that a portion of the studied compounds had no any effect on the mitoxantrone efflux and therefore they were not inhibitors of BCRP. Naturally, those compounds were considered to be non-inhibitors in our study. Elsby et al. [36] suggested a rheumatoid arthritis candidate drug (AZD9056) was an inhibitor of BCRP based on biological activity assay. Similarly, AZD9056 was considered as an inhibitor. The study reported by Sjostedt et al. [73] defined compounds that result in normalized transport values below 50% as BCRP inhibitors. Second, the IC 50 or EC 50 value was used to categorize a compound. More specifically, a compound with IC 50 or EC 50 < 10 μM was classified as an inhibitor, and a compound with IC 50 or EC 50 > 50 μM was classified as a non-inhibitor, and those compounds with IC 50 or EC 50 between 10 and 50 μM were excluded because the testing results are susceptible to experimental conditions. In addition, for compounds that have multiple reported IC 50 or EC 50 values, the average values were used to assign inhibitors or non-inhibitors. But if the discrepancy between multiple values was significant for a compound, such compound was excluded. Third, for the activity index: %inhibition at a given concentration, only the concentration at l0 μM was taken into consideration. A compound with %inhibition at 10 μM > 50% was classified as an inhibitor, and the compound with %inhibition at 10 μM < 25% was classified as a non-inhibitor. Similarly, the compounds with %inhibition at 10 μM ranging from 25 to 50% were removed from the dataset owing to their susceptibilities to experimental conditions. Unfortunately, some compounds still could not be categorized after this step. For those compounds, the "outcome" column of the PubChem database was employed to define the compound category. The "outcome" column can be one of the following four values, including "Active", "Inactive", "Inconclusive", and "Unspecified", and it is in some ways equivalent to the author's subjective definition of what the compound is. In the light of the "outcome" column, the compounds with the "Active" label were considered as inhibitors in our study, and those "Inactive" labeled compounds were considered as non-inhibitors. Certainly, "Inconclusive" or "Unspecified" labeled compounds were removed.
It should be noted that the inhibitory mechanisms of different compounds towards BCRP may vary from one to another and the BCRP inhibitors can be approximately classified into the following three categories: [4] (a) compounds are regarded as "general" inhibitors to BCRP if they are able to inhibit ATPase activity of BCRP, such as the compounds No. (c) compounds are not the substrates of BCRP, but they can bind to BCRP and induce conformational changes to affect the transport function of BCRP, for example, the compound No. 1633 (nelfinavir) in the final dataset. To better understand the complicated molecular mechanisms of BCRP inhibition, it is highly valuable to develop computational classification model for "general" inhibitors and "non-general" inhibitors, respectively. However, considering the fact that the inhibitory mechanisms for most inhibitors in the final dataset are still not clear, in this study, "global" classification models were developed to discriminate BCRP inhibitors and BCRP non-inhibitors with no consideration of their specific inhibitory mechanisms.
The Tanimoto similarity index based on the 881 PubChem fingerprints was used to evaluate the chemical diversity of the final dataset. As shown in Fig. 2, for the overwhelming majority of the compounds, the Tanimoto similarities between any two of compounds are relatively low and the average Tanimoto similarity for the whole dataset is 0.464, highlighting the structural diversity of the final dataset.

The impact of simple molecular descriptors on BCRP inhibition
Various simple molecular descriptors have been used extensively in ADMET prediction and proven to be helpful [86], and they can characterize molecular properties from different aspects, such as hydrophobicity, flexibility, H-bonding ability, etc. It is interesting to analyze whether BCRP inhibitors and non-inhibitors can be distinguished by any of those molecular descriptors. Thus, we systematically explored the discrimination capability of eight molecular descriptors on BCRP inhibition. The examined descriptors include octanol-water partition coefficient (SlogP), aqueous solubility (logS), H-bond acceptor count (a_acc), H-bond donor count (a_don), flexible rotatable bond count (b_rotN), molecular flexibility index (Kier-Flex), aromatic atom count (a_aro), and aromatic bond count (b_aro). A non-parametric test method, known as Mann-Whitney U-test, was used to assess the significance of the difference between the means for inhibitors and non-inhibitors. Compared with Student's t-test, Mann-Whitney U-test gains a competitive advantage because it does not require the assumption of normal distributions for the dataset. The distributions of these eight descriptors for the BCRP inhibitors and non-inhibitors are shown in Fig. 3. Besides, the means and associated p-values of the eight descriptors between the inhibitors and non-inhibitors are listed in Table 1.
In the eight studied descriptors, both SlogP and logS are related to the hydrophobicity and solubility of a molecule. As shown in Table 1, the mean values of SlogP for the inhibitors versus non-inhibitors are 4.26 and 3.29, respectively, and those of logS for the inhibitors versus non-inhibitors are − 6.07 and − 5.25, respectively. In other words, BCRP inhibitors are more hydrophobic and less soluble than non-inhibitors, which is consistent with the finding that the pharmacophore models for BCRP inhibitors contain hydrophobic features reported in the previous studies [11,113]. By investigating the molecular descriptors that determine the interaction between BCRP and its inhibitors, Matsson et al. found that octanol-water partition coefficient at pH 7.4 (logD 7.4 ) is an influential descriptor for the binding of inhibitors to BCRP [16]. The p-value for SlogP is 2.16 × 10 −50 at the 95% confidence level, suggesting that the SlogP distributions for the inhibitors and non-inhibitors are remarkably different. Compared with the p-value for SlogP, that for logS is relatively higher (4.63 × 10 −27 ).
As shown in Fig. 3 and Table 1, we can observe that the two H-bond features, a_acc and a_don, demonstrate relatively high impact on BCRP inhibition. The mean values of a_acc for the inhibitors and non-inhibitors are 4.10 and 3.70, respectively, and the associated p-value is 1.32 × 10 −9 . The mean values of a_don for the inhibitors Fig. 2 The chemical diversity of the final dataset evaluated by the Tanimoto similarities between any two of the compounds. The Tanimoto index was calculated based on the 881 PubChem fingerprints and non-inhibitors are 1.19 and 1.41, respectively, and the associated p-value is 1.12 × 10 −8 . That is to say, BCRP inhibitors tend to contain more H-bond acceptors and less H-bond donors than BCRP non-inhibitors. This finding was supported by the previous studies that H-bond acceptors are quite important for BCRP inhibition [11,16,114]. The descriptors b_rotN and KierFlex are mainly used to describe the flexibility of a molecule. As can be seen from Table 1, the mean values of b_rotN for the two classes are very close and the associated p-value is 6.38 × 10 −1 at the 95% confidence level, showing that the distribution of b_rotN for inhibitors and non-inhibitors is not statistically significant, and it is the same situation for KierFlex but its associated p-value is 3.14 × 10 −4 at the 95% confidence level, demonstrating that the difference between the two distributions of KierFlex is statistically significant. However, as can be seen from Fig. 3 and Table 1, the distribution differences of these two properties between the two classes are not as significant as some other properties, such as hydrophobicity-related descriptors. The other two descriptors, a_aro and b_aro, are related to the aromaticity of a molecule. The mean values of a_aro for the inhibitors and non-inhibitors are 17.51 and 13.46, respectively, and the mean values of b_aro for the inhibitors and non-inhibitors are 18.01 and 13.73, respectively, which indicates that BCRP inhibitors exhibit stronger aromaticity than non-inhibitors. Our finding is in good agreement with the previous observations that aromaticity is an important molecular feature for BCRP inhibitors revealed by the computational model developed by Matsson and co-workers [115]. As can be seen from Fig. 3, the distributions of these two aromaticity-related descriptors differ notably between the two classes, and the associated p-values are 6.90 × 10 −75 and 5.90 × 10 −80 , respectively, indicating that aromaticity is the best indicator to discriminate inhibitors from non-inhibitors. In summary, our findings imply that BCRP inhibitors are generally more hydrophobic and aromatic than non-inhibitors, which are consistent with the results reported by the previous studies [16,73,115,116]. However, as shown in Fig. 3, for any descriptor, the distributions for the BCRP inhibitors and non-inhibitors still overlap largely, suggesting that any single molecular property cannot discriminate between BCRP inhibitors and non-inhibitors effectively.

Feature selection based on rfSA
At first, a total of 1260 molecular features, including 379 molecular descriptors and 881 PubChem fingerprints, were calculated for each molecule in the final dataset.
After eliminating the features with missing values and the redundant features, a smaller feature subset containing 382 molecular features was obtained and normalized. Then, a wrapper feature selection method known as rfSA was employed to identify the most representative features for QSAR modeling based on the training set. As shown in Fig. 4, the mean predictive accuracy of the external set in the fivefold cross-validation with five repetitions has reached the maximum at the 873th iteration (0.896). Therefore, the best iteration for the SA optimization was set to 873, and the SA algorithm at this iteration selected 144 features. As mentioned earlier, the selected 144 features were regarded as the optimal feature subset, where includes 65 molecular descriptors and 79 fingerprints. The detailed descriptions of these representative descriptors and fingerprints are listed in Additional file 1: Tables S3 and S4.
The chemical space distributions of the training and test sets were characterized by the scatter plots of the first three principal components derived from the principal component analysis (PCA) based on the 144 molecular features. Furthermore, the scattered distributions of molecular weight and SlogP [117] were also used to determine the chemical space distributions. As shown in Fig. 5a-d, the chemical space of the test set roughly falls Table 1 The  within that of the training set, implying that it is feasible and reliable to use the test set to validate the prediction performance and generalization capability of the QSAR models developed from the training set.

Comparison of various classification models for BCRP inhibition
Base on the 144 molecular features selected from the training set, seven ML methods, including one DL method (DNN), two ensemble learning methods (XGBoost and SGB) and four traditional ML methods (NB, k-NN, RLR, and SVM), were used to develop the QSAR models to distinguish BCRP inhibitors from noninhibitors. In model development, the Bayesian optimization technique was employed to determine the optimal settings of hyper-parameter for all the methods. The statistical results for the random fivefold cross-validated training and test sets given by the optimized QSAR models are summarized in Table 2. As shown in Table 2, it can be observed that four models (i.e., SVM, DNN, XGBoost, and SGB) give satisfactory statistical parameters for both the fivefold cross-validated training set and the test set. It is obvious that they have potential advantage over the other three traditional models (i.e., NB, RLR, k-NN) on the BCRP inhibition prediction in general. Here, SVM, a classical ML method, yields the best predictions for the test set when MCC and GA were used as the main criteria for the classification ability of models (MCC = 0.812 and GA = 0.911). Compared with GA, BA may give a more comprehensive consideration on sensitivity and specificity and it is capable of avoiding exaggerated assessment of model performance on an unbalanced data set. According to BA, the SVM model also gives the best prediction (BA = 0.905) on the test set. However, we found that the other three methods, including DNN, XGBoost, and SGB, showed similar performance to SVM (Table 2). Therefore, it is difficult to determine which model is the best only according to the confusion matrix based statistical parameters. In view of that, the residual distribution plot was used to further diagnose the quality of those four well-performing models (i.e., SVM, DNN, XGBoost, and SGB). As shown in Fig. 6, the residuals derived from the DNN model are larger than those derived from the other three models for both the fivefold cross-validated training set and test set, implying that the DNN model gives the worst class probability estimation for some compounds although it may still give correct classification for them. As shown in Fig. 6, the residual distribution of the XGBoost model is similar to that of the SVM model. However, it should be noted that, for those compounds with smaller residuals (smaller than 0.7), the DNN model  . 6 The residual (binary cross entropy) distribution plots of the a fivefold cross-validated training set and b test set (b) for the four well-performing QSAR models (DNN, SGB, XGBoost, and SVM). X axis represents the value of binary cross entropy, and Y axis represents the ratio of number of compounds with the residuals higher than a value to the total number of compounds in the training or test set gives the best class probability estimation and the other three models show similar performance. In conclusion, based on the statistical parameters and residual distributions, SVM, DNN and XGBoost were recommended to develop the classification models for BCRP inhibition, and among them, SVM is the best choice. Notably, considerable efforts have been dedicated to explore the performance of DNN on QSAR modeling and many previous studies demonstrated that DNN is the best classifier compared with other conventional methods such as NB, RF, and SVM [32,99,118]. In this study, the DNN model is an outstanding but not the best classifier for BCRP inhibition prediction. As all we know, the DNN method can benefit from big data and is able to perform automatic representation learning from original feature space without the requirement of feature selection. In this study, it is possible that the performance of DNN was limited by the size of the dataset, the number of features used, and even the DNN implementation or hardware used. Finally, two consensus models (Consen-sus1 and Consensus2) were developed by combining the outputs from multiple models. Consensus1 is a class label consensus model based on the predictions given by the SVM, DNN, and XGBoost models, and Consen-sus2 is a class probability consensus model based on the predictions given by the SVM, DNN, and XGBoost models. All the contributions from the individual models were equal, and we could avoid overemphasizing any ML approach. As shown in Table 2, the two consensus models did not improve the predictivity obviously. However, the class probability consensus model gives the best AUC values for both the fivefold cross-validated training set (AUC = 0.956) and test set (AUC = 0.963). It could be found that the class probability consensus model is slightly better than the class label consensus model.
As described earlier, in addition to the evaluation for the random fivefold cross-validated training set, we also performed a cluster fivefold cross-validation for the training set to check the predictive capacity of the wellestablished model for the unseen compounds. A threshold value of 0.7 for complete linkage clustering identified 213 different compound clusters and the random distribution of them across fivefolds results in the numbers of compounds for different folds are 445, 447, 452, 442, and 454, respectively. Based on the cluster fivefold cross-validated training set, the seven ML methods were employed to developed the classification models with the same hyper-parameters determined in the random fivefold cross-validation and the optimal feature subsets determined by rfSA. As shown in Table 3, the four ML approaches (i.e., SVM, DNN, XGBoost, and SGB) that have better performance for the random fivefold crossvalidated training set also give satisfactory predictions to the clustered fivefold cross-validated training set. Moreover, the SVM model gives the best MCC value of 0.810 to the cluster fivefold cross-validated training set, illustrating that the models developed in this study can also be used to predict the compounds with new scaffolds. Briefly, the SVM model gives the best predictions to both the training and test set. The AD statistical results defined by the non-parametric probability density distribution-based method shown in Table 4 illustrate that only 11 compounds are outside the AD. The AD coverages for the training and test sets are 100% and 98%, respectively, indicating that the predictions of the test set are reliable.

Comparison with published models and interpretation of models
In order to further estimate the prediction capacities of the built models, the performance of our models was compared with those of some models reported in the previous studies. As shown in Table 5, a majority of the reported classification models for differentiating BCRP inhibitors and non-inhibitors were developed using conventional ML methods such as NB, k-NN, SVM and RF based on relatively small datasets less than 1000   Model interpretation is a recommended procedure during QSAR modeling in the OECD principles [100].
Here a perturbation-based model-agnostic method was employed to identify and analyze the important features for the SVM, DNN, XGBoost and SGB models. The top ten descriptors for each model are shown in Fig. 7. It can be observed that the DNN model has the highest average binary cross entropy for the test set (0.476), and the SVM model has the lowest value (0.249). Those results are in good agreement with the observed residual distributions shown in Fig. 6 that the residual distribution curve of the DNN model is the highest, to some extent, among the four curves and that of the SVM model is the lowest. Another interesting finding is that different important features were identified for different models, indicating that different methods handle the features in a very varying ways and it potentially depends on the basic principle of approach. Besides, it is quite possible that a set of important descriptors/fingerprints for a well-performing model may be not suitable for another model.
However, we found that some descriptors/fingerprints related to molecular lipophilicity (BCUT_ SLOGP_2, GCUT_SLOGP_0, and SlogP_VSA5), molecular surface area, volume and shape (std_dim2, glob, and vsurf_descriptors) and electronic properties (PEOE_VSA_ descriptors and Q_VSA_POS) were captured by most well-performing models. The importance of molecular lipophilicity descriptors in the development of QSAR models to distinguish BCRP inhibitors from non-inhibitors has been highlighted by the previous studies [8,15,21]. Moreover, according to the two-step mechanism for the inhibition of Fig. 7 The top ten descriptors/fingerprints for the a DNN, b XGBoost, c SGB and d SVM models identified by the perturbation-based model-agnostic method with 10 repetitions. The term "full_model_" denotes the estimation of a model performance on the test set based on the cross entropy loss function BCRP proposed by Matsson et al., high lipophilicity is a prerequisite for an inhibitor to reach the binding site of BCRP in the first step, and then geometrical complementary, H-bonding and π-π interactions, which can be described by the shape, H-bond and electronic descriptors (such as std_dim2, PEOE_VSA_FPNEG and PEOE_VSA+4) [115], determine the binding of inhibitors to BCRP in the second step.
It has been reported that the addition of fingerprints can significantly improve the prediction accuracy for many QSAR models [86,88]. As shown in Fig. 7, some important features are fingerprints. The fingerprint PubchemFP758 (Cc1c(N)cccc1) was identified by three well-performing models (XGBoost, SGB and SVM) and PubchemFP293 (C-S) was identified by two well-performing models (XGBoost and SGB). We then analyzed the frequencies of these two important fingerprints, and found that their distributions for the inhibitors and noninhibitors are quite different: 48.4% inhibitors (531 out of 1098) and 20.1% non-inhibitors (342 out of 1701) contain PubChem758, respectively, and 47.2% non-inhibitors (803 out of 1701) and 6.2% inhibitors (68 out of 1098) contain PubChem293, respectively. The frequency analysis of PubchemFP758 that is a o-Toluidine fragment implies that BCRP inhibitors are more aromatic than non-inhibitors, which is in line with the results reported by Sjöstedt et al. and Matsson et al. that aromaticity is an important feature of BCRP inhibitors [73,115]. PubChem293 contains a sulfur atom, which is supported by the results reported by Montanari and Ecker that the presence of a sulfur atom is unfavorable to BCRP inhibition due to its steric hindrance in the binding site [20].

The important fragments identified by the IG method
As described in "Materials and methods", three kinds of fragments were generated for all the compounds and the frequency values of generated fragments in the inhibitor class and non-inhibitor class were calculated. If the occurrence of a fragment is more frequent in the inhibitor class than in non-inhibitor class, this fragment was considered as an important fragment for the inhibitor class, and it has positive contribution to BCRP inhibition and vice versa. The IG value was also calculated for all the generated fragments to identify the important structural fragments to BCRP inhibition.
As shown in Fig. 8, the IG values of the fragments range from 0 to 0.067, and most fragments have relatively low IG values. Analyses of the IG values and frequencies of the fragments show that 12 positive fragments are responsible for BCRP inhibition ( Table 6). All of them have high IG values and their frequencies for the inhibitor and non-inhibitor classes are obviously different. Similarly, 11 negative fragments were identified to be responsible for BCRP non-inhibition ( Table 7).
As shown in Tables 6 and 7, it can be observed that those positive fragments contain more abundant nitrogen atoms, and 6 of them contain at least three nitrogen atoms (No. 2, 3, 6, 8, 11, and 12). However, most negative fragments have only one or two nitrogen atoms or even do not contain any nitrogen atom. In addition, it is obvious that the positive fragments are more planar. According to the previous studies, it is suggested that the abundance of nitrogen atoms and molecular planarity have a high impact on BCRP inhibition [16,20]. Among the 12 positive fragments, No. 2 fragment (N,2-diphenylquinazolin-4-amine) has been reported to be favorable to BCRP inhibition [119]. In addition, It is obvious that almost all the negative fragments, with the exception of No. 19, contain only one individual ring in their structures, but most positive fragments contain at least two individual rings. This is in good agreement with the results that BCRP inhibitors tend to have more rings than non-inhibitors [73]. The 11 fragments shown in Table 7 have negative contributions to BCRP inhibition. It is obvious that many fragments do not contain planar components, and they have unfavorable effects on BCRP inhibition because molecular planarity is critical to BCRP inhibition as mentioned above. Many negative fragments contain a sulfate group, including No. 14, 15, 18, 19, and 20. According to the previous study, it is noted that the presence of sulfate is unfavorable to BCRP inhibition because sulfate is bulky, which may lead to some kinds of steric hindrance in the binding site [20].
Generally speaking, the positive fragments are larger than negative fragments, and contain more rings and

Analysis of the misclassified compounds
As discussed above, the SVM model developed based on the selected features from the training set yields the best predictions for the test set, but it still misclassified 50 compounds in the test set. These misclassified compounds include 27 false negatives and 23 false positives. The detailed descriptions of the misclassified molecules are listed in Additional file 1: Table S5. First, we analyzed the scaffolds of the 50 misclassified compounds. Each compound was decomposed into three kinds of structural scaffolds including Ring Assemblies, Bridge Assemblies, and BemisMurcko Assemblies. For the generated fragments whose counts in the misclassified compounds are equal or larger than 2 were kept and their occurrence ratios were calculated. In the same way, their counts and occurrence ratios were also calculated for the training and test sets ( Table 8). Most of those representative fragments contain hetero-cycles with nitrogen, oxygen or sulfur atom. These fragments are not abundant in the whole dataset (average occurrence ratio in the dataset is 3.65%) and some fragments in the test set are not enriched in the training set (fragments No. 26 and 36). It is quite possible that the SVM model could not give accurate predictions to those infrequent fragments.
Among the 50 misclassified compounds, except compound No. 221 that only contains a single ring, most of them have complicated structures. It is quite possible that some specific groups in those complex compounds need specific spatial arrangements to form favorable interaction with BCRP, which may not be well characterized by traditional descriptors or fingerprints [29]. Furthermore, in this study, the reported experimental data were binarized according to a predefined threshold, and therefore the compounds with experimental data close to the threshold may be more likely to be misclassified than those with experimental data far from the threshold, which has been well explained by Wu et al. [120]. In addition, existence of a few activity cliffs (similar structures but different activities) that distort the models may lead to misclassifications [29]. For example, in the 48 misclassified compounds, a pair of compounds (compounds No. 90 and 1360) may be a pair of activity cliffs (Tanimoto similarities based on the 881 PubChem fingerprints are 0.923).
Here, five of the 27 misclassified inhibitors (compounds No. 3, No. 472, No. 1606, No. 1657, andNo. 1806) contain a sulfate group, which is unfavorable to BCRP inhibition as mentioned above [20]. Thus, this is one of the potential reasons why they were easily predicted as noninhibitors. Two misclassified non-inhibitors (compounds No. 237 and No. 239) have abundant methoxyl (-OCH 3 ) groups. The methoxyl groups observed in most QSAR studies could be interpreted as favorable H-bond acceptors [8], which may have positive contributions to BCRP inhibition [11,16,114].

Conclusion
As an important efflux transporter, BCRP plays a crucial role in drug metabolism, multi-drug resistance to chemotherapeutic agents, and drug-drug interactions. It is an increasingly important area to develop computational models with high reliability and robustness to predict BCRP inhibition. In this study, we reported an extensive and structurally diverse BCRP inhibition dataset, where includes 1098 BCRP inhibitors and 1701 non-inhibitors. The relationships between the eight simple molecular descriptors and BCRP inhibition were examined. It is observed that BCRP inhibitors were characterized by higher hydrophobicity and aromaticity. The optimal feature subset of 144 molecular features was determined based on a wrapper feature selection named rfSA. Then, based on the extensive BCRP inhibition dataset and the optimal subset, seven ML approaches, including one DL method: DNN, two ensemble learning methods: SGB and XGBoost, and four classical methods: NB, weighted k-NN, RLR, and SVM, were used to develop a number of accurate classification models. Bayesian optimization technique was used to select the optimal hyper-parameters for  all the models. This is the first attempt that the new AI methods (i.e., XGBoost and DNN) were used to predict BCRP inhibition. The statistical results indicated that one conventional ML method (SVM) and two new ML methods (XGBoost and DNN) outperformed the other three traditional ML methods, and among them SVM yielded the best prediction capability. Then, a perturbation-based method was used to interpret our models and analyze the representative features for different models. In addition, the favorable and unfavorable fragments to BCRP inhibition identified by information gain (IG) method along with frequency analysis were then analyzed. The results may provide some valuable information for the design and discovery of BCRP inhibitors.
Additional file 1: Table S1. The detailed descriptions of the final dataset containing 1098 BCRP inhibitors and 1701 BCRP non-inhibitors. Table S2.
The searching space of the Bayesian optimization and the optimal hyperparameters determined by the Bayesian optimization for all the studied models; Table S3. The detailed descriptions of the 65 representative descriptors chosen by rfSA; Table S4. The detailed descriptions of the 79 representative fingerprints chosen by rfSA; Table S5. The detailed descriptions of the misclassified molecules given by the SVM mode1.
Additional file 2. The R source code that implements the workflow.