Machine learning (ML) models require an extensive, user-driven selection of molecular descriptors in order to learn from chemical structures to predict actives and inactives with a high reliability. In addition, privacy concerns often restrict the access to sufficient data, leading to models with a narrow chemical space. Therefore, we propose a framework of re-trainable models that can be transferred from one local instance to another, and further allow a less extensive descriptor selection. The models are shared via a Jupyter Notebook, allowing the evaluation and implementation of a broader chemical space by keeping most of the tunable parameters pre-defined. This enables the models to be updated in a decentralized, facile, and fast manner. Herein, the method was evaluated with six transporter datasets (BCRP, BSEP, OATP1B1, OATP1B3, MRP3, P-gp), which revealed the general applicability of this approach.
The importance of machine learning (ML) approaches in drug discovery and in silico toxicity prediction has shown a significant increase in recent years. As available toxicity data has significantly increased [1,2,3], ML approaches became an essential part of the drug discovery pipeline. Public–private partnerships such as eTOX  and eTRANSAFE , as well as public databases (ChEMBL , PubChem ) enable trustful data supply for the establishment of predictive ML models. For training and improving the performances of ML models, a large amount of data is crucial . However, when seeking to pool data from multiple sources, multiple restrictions occur. Companies quite often restrict access to in house data due to their business value. In addition, collecting, curating, and preserving data requires a lot of effort and time.
Furthermore, once a sufficient amount of qualitative data is established, additional challenges can occur on the path towards the creation of efficient ML models. The selection of chemical descriptors best suited to derive models of sufficient quality is one of them. The selection of a proper set of descriptors is an extensive, time-intensive, and still mostly manual process, especially when trying to understand relationships between chemical properties and their effect on biological targets . Depending on the biological target, the descriptors best suited can considerably vary. Combined with the fact that additional hyper-parameters have to be tuned for each model, the creation of high accuracy ML models becomes an exhaustive process.
To overcome these issues and allow the user to establish predictive models in an easy and fast way, we created a framework that can be used in a semi-automated fashion for the creation and/or re-training of ML models for predicting inhibitory activity towards ABC and SLC transporters. Furthermore, in comparison to previous methods our approach does not require descriptor selection and hyperparameter search which enables fast and efficient model building.
A set of transporters, mainly used in this study, has caught the attention of regulatory agencies such as FDA, EMA, and the Japanese regulatory agency, as the inhibition of these proteins may play a role in drug-drug interactions and/or drug-induced liver injury. Therefore, the prediction of inhibitory profiles of small molecules towards these set of transporters can help to guide safety assessments of new drugs as often requested or recommended by regulatory agencies. Additionally, the knowledge can further help in terms of prioritization of compounds at the early Drug Discovery stage by medicinal chemists [10,11,12,13,14,15,16,17].
Combining Jupyter Notebooks (JNs)  as a framework for creating ML models and high-quality data regarding transport membrane proteins to train these models, shareable models can be built for the assessment of compounds for their interaction profile. In general, JN is a web-based interactive computing platform that enables the combination of computer code (e.g. python) and rich text elements (e.g. figures). A web browser is used to navigate in the JN app, and the established graphical user interface allows a better representation of files and so-called notebook documents. These notebook documents can be executed as well as read by users, as they contain code, rich text, images, plots, interactive figures and widgets. These notebooks can be easily shared since they are saved as structured text files (JSON format) and enable the transfer of the code of the model from one instance to another for re-training the model . This allows the enrichment of the chemical space of the model. The notebook further provides a generalizable set of molecular descriptors for the ABC and SLC transporter families that has been shown to be applicable at least for the transporter proteins BCRP, BSEP, OATP1B1, OATP1B3, MRP3, and P-gp. The procedure was selected as it comprises the possibility of sharing the notebook in a facile manner and the creation of workflows for non-experienced users. By uploading data to the JN, the code can be executed which will allow the creation of models and the verification of the models within the JN. In addition, due to the ease of the integration of RDKit, JNs comprise a versatile tool for cheminformatics tasks.
Subsequently, JNs are great tools for educational purposes. The TeachOpenCADD platform by AG Volkamer has demonstrated this by creating JNs with step-by-step tutorials that can be used as a teaching platform for classroom lessons and self-studying. Open-source data and Python packages are used as tools for establishing both ligand- and structure-based approaches. The usage of these JNs provides knowledge in the field of cheminformatics and structural bioinformatics for students and users interested in these topics . Therefore, our JN not only offers the possibility of improving the ML models, model building and predictions for the six endpoints but also offers students, universities and interested users to learn more about model building, data handling, datasets, standardization procedure, descriptor calculation and model evaluation in cheminformatics.
In this study, datasets of six different transmembrane transport proteins (BCRP, BSEP, OATP1B1, OATP1B3, MRP3, P-gp) were used as a case study [21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46]. Firstly, datasets from the Vienna LiverTox Workspace (LiverTox)  were chosen, as these datasets were already published and used for the development of predictive models. The corresponding web service allows the prediction of substrates and inhibitors for a set of ABC and SLC transporters.
Secondly, an in-house KNIME workflow was used for the retrieval of additional new data from public platforms such as ChEMBL and PubChem (ChEMBL26 , CheEMBL27 , ChEMBL28 , PubChem ). The data from ChEMBL 26 and 27 were used as additional training sets (see below), while data from ChEMBL 28 and additional data from Pubchem served as test sets. Activity values were taken from the original publication and class labeling for binary classification was applied based on a threshold of an IC50 value of 10 µM. All data sets were provided in sdf-format together with a binary classification (0/inactive or 1/active) for each of the six endpoints. For each compound the InChIs (IUPAC International Chemical Identifiers), InChI Keys and SMILES (Simplified Molecular Input Entry Specification) were calculated. All datasets are available on GitHub at https://github.com/PharminfoVienna/Retraining_Notebook/tree/main/data.
Before following the standardization protocol, stereochemistry information was removed from the InChIs and duplicated InChIs were identified. In case duplicates show the same class label, one of the compounds was kept. Otherwise, both compounds were removed. Data cleaning and standardization was performed using a modified version of the Standardizer provided by Atkinson (available at https://github.com/flatkinson/standardiser). This tool was applied to remove salts, neutralize, and discard non-organic compounds. Tables 1 and 2 show the number of data points available per transporter for LiverTox and for the newly collected datasets. The datasets were further used to generate classification models which allow the prediction of inhibitors for a number of liver transporters involved in severe side effects.
For the characterization of the chemical space related to ABC and SLC transporter inhibition, a variety of molecular descriptors from the RDKit library (version 2020.09.1) were used . These molecular descriptors enable the translation of chemical structures into numerical representations of atomic or molecular properties of compounds. In total, 197 two-dimensional (2D) descriptors were chosen as a starting point for the selection of features applicable for ABC and SLC transporters. Herein, three different feature selection methods from the scikit-learn Python library (version 0.24.2) were applied: VarianceThreshold, Univariate feature selection, and Recursive feature elimination. By applying VarianceThreshold all calculated molecular descriptors with zero variance were removed. As a next step, best descriptors were selected based on a univariate statistical approach. ANOVA-f was chosen over mutual information due to the nature of the six transporter datasets. This method estimates the degree of linear dependency by using the F-test approach. In parallel, recursive feature elimination (RFE) was performed to select features by recursively considering subsets of molecular descriptors. A random forest wrapper was used for assigning the weights. As a last step, the results of both univariate feature selection and recursive feature elimination methods were compared from each dataset. Molecular descriptor results were then manually aligned with each other. 70 features were found to match within all transporters using the top 50 scored ANOVA-f method and 170 using the RFE method. The resulting 70 descriptors were used for the creation of the final models (see Fig. 1). A graphical representation of the workflow can be seen in Fig. 2.
Four different classifiers, namely logistic regression, support vector machine, random forest, and k-nearest neighbor were used for model generation. The scikit-learn Python library (version 0.24.2) implementations were used to train binary classification models for the six above mentioned datasets.
Hyperparameter grid search
To find the optimal parameters for each classifier, a grid search of the hyperparameters was performed. The following parameters were used:
Training procedure, cross-validation, and evaluation
In a first step, prediction models were generated simply based on the LiverTox dataset and the settings mentioned above. The performance of these models was compared with the ones obtained from the LiverTox models  to validate our approach.
In a next step, the newly collected datasets of the six transporters were used for the training of the actual models. The performance of the models was evaluated using a tenfold cross-validation, and the statistical metrics, such as accuracy, sensitivity, specificity, and balanced accuracy were calculated (see Table 3). For that, the scikit-learn Python library was used. Additionally, an external test set was used to test the new generated models. This test set was collected from ChEMBL28 and PubChem and only data which was novel to the training set was kept.
Local outlier factor (LOF) as described by Breunig and coworkers was used for the calculation of the applicability domain  and as implemented in the scikit-learn Python library (version 0.24.2). In this approach the local densities of the nearest neighbors of a compound are compared to its local densities, and a factor from 0 to 1 is assigned. In brief, if the local density is greater or equal to its surrounding, a compound is considered inside the domain, otherwise it is considered outside the domain.
The following parameters were used:
5 nearest neighbors
novelty = True
Contamination = 0.1
Minmax scaled descriptors
First two principal components were chosen as input
Three different feature selection methods were applied. Variance threshold setting to zero, univariate feature selection using ANOVA-f, and RFE with a random forest wrapper (default settings) were used for the retrieval of the most relevant molecular descriptors from the RDKit module for each dataset. However, once molecular descriptors with constant values were removed, the ANOVA-f and RFE method were applied. For the RFE method, different sets of descriptors were obtained for each transporter. The obtained descriptors were then aligned with each other for the identification of the most frequent descriptors occurring in each dataset. However, 170 descriptors were obtained, which is still considered as a high number considering the basic principle of parsimony in QSAR. Therefore, we conducted in parallel the ANOVA-f approach. Instead of using all scored molecular descriptors, we decided to use only the best 50 scored molecular descriptors for the alignment procedure. As our idea was to keep the number of descriptors as low as possible, we set the threshold to 50 for the alignment, as the performance of the models decreased in individual cases when a lower number was applied. The alignment of each set of resulted descriptors from the six transporter proteins was then conducted. This resulted in a final set of 70 descriptors. The impact of 197, 170 and 70 descriptors on all four models were then examined by calculating the balanced accuracy for each dataset. Interestingly, we obtained similar results using only 70 descriptors from the ANOVA-f approach (see Fig. 3) compared to 197 and 170 descriptors obtained from the RFE method. Therefore, we decided to implement the resulted 70 molecular descriptors retrieved from the ANOVA-f approach in the Jupyter Notebook.
Performance of the ML models
For the development of predictive models that can be shared in an easy manner and used for all six transporter datasets, four distinct modeling strategies were applied. Logistic regression-, support vector machine-, random forest- and k-nearest neighbor classifiers were used to train models with the datasets from LiverTox. This concept was used to validate our approach to use it for the actual model generation. The comparison of the performance indicated similar results as shown in the documentation of LiverTox. The support vector and random forest models performed overall better. For the improvement of the models, new datasets for all six transporter datasets were collected from ChEMBL and PubChem. Further, newly published data from Chembl28 and PubChem were used as external datasets, whereas the previous versions were implemented for the training of the four modeling approaches. Again, the three feature selection methods and a hyperparameter search were conducted for the optimal number of descriptors and parameters. Finally, we obtained for each transporter two models, support vector and random forest, with a very similar balanced accuracy ranging from 0.67 till 0.83 for SVM and 0.62 till 0.82 for RF via tenfold cross validation within the various transporters. Overall, we observed that training the models with a subset of all descriptors, addition of new chemical space and the application of the grid search can improve the model performance as compared to the LiverTox models, especially considering the balanced accuracy.
Fast and facile model generation for binary classification tasks that are applicable for more than one transporter and additionally allow a retraining of the model is of great interest. Current ML model approaches that were developed are often based on one protein when trying to predict substances as transporter inhibitors or non-inhibitors [32, 53,54,55]. This makes it harder to generalize models when trying to predict substances for a group of transporters. As the selection of appropriate molecular descriptors can vary from one protein to another, this becomes a quite challenging step. Therefore, we established a Jupyter Notebook that allows the user to generate classification models for six transporter proteins (BCRP, BSEP, OATP1B1, OATP1B3, MRP3, P-gp) without intensive descriptor analysis and hyperparameter search. Moreover, these models can be shared between two instances for additional training. Our analysis indicated that 70 molecular descriptors from the RDKit module can be used for the creation of well-performing predictive models when random forest and support vector classifiers are used. The comparison of the feature selection methods implemented in the scikit-learn Python library showed to be useful for the reduction of descriptors by maintaining a good performance for most of the transporter models and establishing a general set of 70 descriptors for all six transporter proteins. However, in the case of MRP3 an overall low performance was obtained due to a low amount of available data points. The data gathering step revealed that only two transporter proteins, namely BCRP and P-gp, covered a well-balanced number of actives and inactives. This can be visualized when comparing the resulted precision with the remaining datasets that are unbalanced. Both, P-gp and BCRP models, predict correctly 76 to 80% of the cases when a random forest classifier was chosen. Interestingly, for all except MRP3, both good sensitivity and specificity values were retrieved, although the other transporters possess an unbalanced dataset. Only for OATP1B1 a sensitivity lower than 70% was obtained. Best performance was retrieved using the BCRP dataset with a balanced accuracy of 80%, precision of 85%, and sensitivity and specificity values from 79 to 83%. This can be explained by the high number of well-curated data points and the balanced number of actives and inactives in the dataset. Nevertheless, these models can be used for re-training and therefore the performance can increase once more data is available. For each transporter protein a tenfold cross-validation was performed and an external dataset was used for a thorough evaluation, after the final model was trained. A reasonable amount of test compounds was collected for BCRP and P-gp transporters. In the case of OATP1B1 and OATP1B3 more than 17 compounds were retrieved, and less than 10 compounds were obtained for BSEP and MRP3. Therefore, an external validation was meaningful when BCRP and P-gp test sets were evaluated. In both cases, the balanced accuracy, specificity decreased by more than 20% compared to the cross-validation, which still indicated a moderate performance. This could be explained by the fact that 31 compounds from the BCRP and 35 compounds from the P-gp test set were out of domain, when local outlier factor algorithm was used for the applicability domain estimation (Table 4) . Using the same approach for OATP1B1 and OATP1B3, indicated a total of 9 outliers and similar decrease in performance. For the remaining test sets no results could have been retrieved due to the low number of data points obtained from ChEMBL28. Nevertheless, a tenfold cross validation was carried out for each transporter dataset indicating performances close to 80% for five out of six transporter datasets, making it a valuable and feasible tool for the prediction of new data related to both ABC and SLC transporters. Additionally, this approach benefits from the model’s ability to be updated and shared in a facile manner using Jupyter Notebook.
In this study, we present a JN which enables the user to generate classification models for six transporter proteins (BCRP, BSEP, OATP1B1, OATP1B3, MRP3, P-gp) based on four different classifiers with pre-selected descriptors and without extensive hyperparameter search. In addition, the notebook can further be used to create models for additional transporters as well as retraining of the existing prediction models using pre-defined descriptors as well as hyperparameters with an extended/novel dataset. The JN can be as well used for educational purposes, especially for the ones interested in the creation of predictive ML models for inhibitory activity predictions.
Yang H, Sun L, Li W et al (2018) In silico prediction of chemical toxicity for drug design using machine learning methods and structural alerts. Front Chem 6:30. https://doi.org/10.3389/fchem.2018.00030
Cases M, Briggs K, Steger-Hartmann T et al (2014) The eTOX data-sharing project to advance in silico drug-induced toxicity prediction. Int J Mol Sci 15:21136–21154. https://doi.org/10.3390/ijms151121136
Pastor M, Quintana J, Sanz F (2018) Development of an infrastructure for the prediction of biological endpoints in industrial environments. Lessons learned at the eTOX Project. Front Pharmacol. https://doi.org/10.3389/FPHAR.2018.01147
Briz O, Serrano MA, Macias RIR et al (2003) Role of organic anion-transporting polypeptides, OATP-A, OATP-C and OATP-8, in the human placenta-maternal liver tandem excretory pathway for foetal bilirubin. Biochem J 371:897–905. https://doi.org/10.1042/BJ20030034
Kluyver T, Ragan-Kelley B, Pérez F, et al (2016) Jupyter Notebooks—a publishing format for reproducible computational workflows. Position Power Acad Publ Play Agents Agendas—Proc 20th Int Conf Electron Publ ELPUB 2016 87–90. https://doi.org/10.3233/978-1-61499-649-1-87
Sydow D, Morger A, Driller M, Volkamer A (2019) TeachopenCadd: a teaching platform for computer-aided drug design using open source packages and data. J Cheminform 11:1–7. https://doi.org/10.1186/s13321-019-0351-x
Hirano H, Kurata A, Onishi Y et al (2006) High-speed screening and QSAR analysis of human ATP-binding cassette transporter ABCB11 (bile salt export pump) to predict drug-induced intrahepatic cholestasis. Mol Pharm 3:252–265. https://doi.org/10.1021/mp060004w
Winter E, Lecerf-Schmidt F, Gozzi G et al (2013) Structure-activity relationships of chromone derivatives toward the mechanism of interaction with and inhibition of breast cancer resistance protein ABCG2. J Med Chem 56:9849–9860. https://doi.org/10.1021/jm401649j
Warner DJ, Chen H, Cantin LD et al (2012) Mitigating the inhibition of human bile salt export pump by drugs: opportunities provided by physicochemical property modulation, in silico modeling, and structural modification. Drug Metab Dispos 40:2332–2341. https://doi.org/10.1124/dmd.112.047068
Karlgren M, Vildhede A, Norinder U et al (2012) Classification of inhibitors of hepatic organic anion transporting polypeptides (OATPs): influence of protein expression on drug–drug interactions. J Med Chem 55:4740–4763. https://doi.org/10.1021/JM300212S
Pedersen JM, Matsson P, Bergström CAS et al (2013) Early identification of clinically relevant drug interactions with the human bile salt export pump (BSEP/ABCB11). Toxicol Sci 136:328–343. https://doi.org/10.1093/toxsci/kft197
Kotsampasakou E, Brenner S, Jäger W, Ecker GF (2015) Identification of novel inhibitors of organic anion transporting polypeptides 1B1 and 1B3 (OATP1B1 and OATP1B3) using a consensus vote of six classification models. Mol Pharm 12:4395–4404. https://doi.org/10.1021/ACS.MOLPHARMACEUT.5B00583
Contino M, Zinzi L, Cantore M et al (2013) Activity-lipophilicity relationship studies on P-gp ligands designed as simplified tariquidar bulky fragments. Bioorgan Med Chem Lett 23:3728–3731. https://doi.org/10.1016/j.bmcl.2013.05.019
Morgan RE, Trauner M, van Staden CJ et al (2010) Interference with bile salt export pump function is a susceptibility factor for human liver injury in drug development. Toxicol Sci 118:485–500. https://doi.org/10.1093/toxsci/kfq269
Hayashi D, Tsukioka N, Inoue Y et al (2015) Synthesis and ABCG2 inhibitory evaluation of 5-N-acetylardeemin derivatives the paper is dedicated to Professor Amos B. Smith, III on the occasion of his 70th birthday. Bioorganic Med Chem 23:2010–2023. https://doi.org/10.1016/j.bmc.2015.03.017
Köck K, Ferslew BC, Netterberg I et al (2014) Risk factors for development of cholestatic drug-induced liver injury: inhibition of hepatic basolateral bile acid transporters multidrug resistance-associated proteins 3 and 4. Drug Metab Dispos 42:665–674. https://doi.org/10.1124/DMD.113.054304
Ochoa-Puentes C, Bauer S, Kühnle M et al (2013) Benzanilide-biphenyl replacement: a bioisosteric approach to quinoline carboxamide-type ABCG2 modulators. ACS Med Chem Lett 4:393–396. https://doi.org/10.1021/ml4000832
Orlandi F, Coronnello M, Bellucci C et al (2013) New structure-activity relationship studies in a series of N, N-bis(cyclohexanol)amine aryl esters as potent reversers of P-glycoprotein-mediated multidrug resistance (MDR). Bioorgan Med Chem 21:456–465. https://doi.org/10.1016/j.bmc.2012.11.019
Dawson S, Stahl S, Paul N et al (2012) In vitro inhibition of the bile salt export pump correlates with risk of cholestatic drug-induced liver injury in humans. Drug Metab Dispos 40:130–138. https://doi.org/10.1124/dmd.111.040758
Capparelli E, Zinzi L, Cantore M et al (2014) SAR studies on tetrahydroisoquinoline derivatives: the role of flexibility and bioisosterism to raise potency and selectivity toward P-glycoprotein. J Med Chem 57:9983–9994. https://doi.org/10.1021/jm501640e
Reis M, Ferreira RJ, Santos MMM et al (2013) Enhancing macrocyclic diterpenes as multidrug-resistance reversers: structure-activity studies on jolkinol D derivatives. J Med Chem 56:748–760. https://doi.org/10.1021/jm301441w
Contino M, Zinzi L, Perrone MG et al (2013) Potent and selective tariquidar bioisosters as potential PET radiotracers for imaging P-gp. Bioorgan Med Chem Lett 23:1370–1374. https://doi.org/10.1016/j.bmcl.2012.12.084
Winter E, Devantier Neuenfeldt P, Chiaradia-Delatorre LD et al (2014) Symmetric bis-chalcones as a new type of breast cancer resistance protein inhibitors with a mechanism different from that of chromones. J Med Chem 57:2930–2941. https://doi.org/10.1021/jm401879z
Baumert C, Günthel M, Krawczyk S et al (2013) Development of small-molecule P-gp inhibitors of the N-benzyl 1,4-dihydropyridine type: novel aspects in SAR and bioanalytical evaluation of multidrug resistance (MDR) reversal properties. Bioorgan Med Chem 21:166–177. https://doi.org/10.1016/j.bmc.2012.10.041
Montanari F, Knasmüller B, Kohlbacher S et al (2020) Vienna LiverTox workspace—a set of machine learning models for prediction of interactions profiles of small molecules with transporters relevant for regulatory agencies. Front Chem 7:899. https://doi.org/10.3389/fchem.2019.00899
Jain S, Grandits M, Richter L, Ecker GF (2017) Structure based classification for bile salt export pump (BSEP) inhibitors using comparative structural modeling of human BSEP. J Comput Aided Mol Des 31:507–521. https://doi.org/10.1007/s10822-017-0021-x
Prachayasittikul V, Worachartcheewan A, Shoombuatong W et al (2015) Classification of p-glycoprotein-interacting compounds using machine learning methods. EXCLI J 14:958–970. https://doi.org/10.17179/excli2015-374
Belekar V, Lingineni K, Garg P (2015) Classification of breast cancer resistant protein (BCRP) inhibitors and non-inhibitors using machine learning approaches. Comb Chem High Throughput Screen 18:476–485. https://doi.org/10.2174/1386207318666150525094503
The authors acknowledge the eTRANSAFE consortium for the support. Disclaimer. This work reflects only the author's views and the JU is not responsible for any use that may be made of the information it contains.
This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No 777365. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA. The Pharmacoinformatics Research Group (Ecker lab) acknowledges funding provided by the Austrian Science Fund FWF AW012321 MolTag.
Authors and Affiliations
Department of Pharmaceutical Sciences, University of Vienna, Vienna, Austria
Aljoša Smajić, Melanie Grandits & Gerhard F. Ecker
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.