CardioTox net: a robust predictor for hERG channel blockade based on deep learning meta-feature ensembles

Motivation Ether-a-go-go-related gene (hERG) channel blockade by small molecules is a big concern during drug development in the pharmaceutical industry. Blockade of hERG channels may cause prolonged QT intervals that potentially could lead to cardiotoxicity. Various in-silico techniques including deep learning models are widely used to screen out small molecules with potential hERG related toxicity. Most of the published deep learning methods utilize a single type of features which might restrict their performance. Methods based on more than one type of features such as DeepHIT struggle with the aggregation of extracted information. DeepHIT shows better performance when evaluated against one or two accuracy metrics such as negative predictive value (NPV) and sensitivity (SEN) but struggle when evaluated against others such as Matthew correlation coefficient (MCC), accuracy (ACC), positive predictive value (PPV) and specificity (SPE). Therefore, there is a need for a method that can efficiently aggregate information gathered from models based on different chemical representations and boost hERG toxicity prediction over a range of performance metrics. Results In this paper, we propose a deep learning framework based on step-wise training to predict hERG channel blocking activity of small molecules. Our approach utilizes five individual deep learning base models with their respective base features and a separate neural network to combine the outputs of the five base models. By using three external independent test sets with potency activity of IC50 at a threshold of 10 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upmu$$\end{document}μm, our method achieves better performance for a combination of classification metrics. We also investigate the effective aggregation of chemical information extracted for robust hERG activity prediction. In summary, CardioTox net can serve as a robust tool for screening small molecules for hERG channel blockade in drug discovery pipelines and performs better than previously reported methods on a range of classification metrics. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-021-00541-z.


S1: Data Preparation
A dataset consisting of molecular structures labelled as hERG and non-hERG blockers in the form of SMILES strings was obtained from the DeepHIT authors [1] and was curated from five sources, the BindingDB database (3056 hERG blockers, 3039 hERG non-blockers) [2], ChEMBL bioactivity database (4859 hERG blockers, 4751 hERG non-blockers) [3], and literature derived (4355 hERG blockers, 3534 hERG non-blockers) [4], (1545 hERG blockers, 816 hERG nonblockers) [5], (2849 hERG blockers, 1202 hERG non-blockers) [6]. SMILES strings from all the 5 sources were standardized using using RDkit http:// www.rdkit.org/ and MolVS https://molvs.readthedocs.io/en/latest/ as described by Ryu et al [1] and shown in Figs S1 a. After standardization, each data source was split into four sets such as 70% base training set, 10% base validation set, 10% meta training set and 10% meta validation set. All redundant molecules were removed and respective sets were merged together to form a combined base training set, base validation set, meta training set and meta validation set. We used test set-I from DeepHIT "as is" which contains more hERG blockers than non-blockers. Pairwise Tanimoto similarity [1] was computed between all molecules of combined data sets with those of molecules in test set-I obtained from DeepHIT. All those molecules in the combined data sets, the Tanimoto similarity of which are >0.7 to any of the molecule in test set-I were removed, thus forming a gold standard training and validation data as shown in Figs S1 a.
In order to evaluate our model on another independent test set which should contain more non blockers molecules, we curated 110 hERG blockers and 336 hERG non-blockers from "E3 training" set of Siramshetty at al. [7]. The reason we curated from E3 training is because it contains molecules with potency threshold (IC 50 ) values < 10 µM considered to be hERG blockers and (IC 50 ) values ≥ 10 µM considered to be hERG non-blockers which is compatible with other datasets used in our study. Besides, E3 training is also negatively imbalanced which contains more non-blockers than blocker molecules, as test set-II is aimed to be negatively imbalanced unlike test set-I which is positively imbalanced. We also obtained 9250 molecules from Kumar et al. [8] with pIC 50 values as potency threshold. We converted the unit of potency from pIC 50 to IC 50 and labelled molecules with (IC 50 ) values < 10 µM as hERG blockers and (IC 50 ) values ≥ 10 µM as hERG non-blockers. Both data sets were merged together and all those molecules with Tanimoto similarity > 0.7 to any molecule in gold standard data training and validation or test set-I were removed. Thus we obtained test set-II which contains more non blocker molecule than blockers and is dissimilar to both gold standard training and validation as well as test set-I as shown in Figs S1 b. Both test set-I and II are relatively small in number, so we curated another larger independent test set from very recent work of Siramshetty at al. [9]. This larger test set is also negatively biased with 53 blockers and 786 non-blockers. All molecules were compared with training set, test set-I and test set-II in terms of pairwise tanimoto similarity. Molecules with tanimoto similarity > 0.7 to any of the molecules in training set, test set-I or test set-II were removed to form test set-III. Thus we obtain total of 706 hERG non-blockers and 34 hERG blockers in the test set-III as shown in Figs S1 b.

S2: SMILES embedding vectors
Based on the training data, SMILES vocabulary is generated using tokenizer module developed by Reverie Labs, the link of which is given below. https://blog.reverielabs.com/transformers-for-drug-discovery/.
Each SMILES string is converted into fixed size numerical vector based on mapping dictionary of SMILES vocab as shown in Figs S2. The mapping dictionary maps each SMILES vocab element to a numerical value. The length of the longest SMILES string is 97 in terms of SMILES vocab element in the training data considered for this work.

S3: Standard deviation for base features validation
Tables S1 shows standard division for each split of base validation set in training the individual base models.