Materials
All chemicals and reagents were purchased from Sigma Aldrich Chemical Company (St. Louis, MO) and Acros Organics (Fair Lawn, NJ), and were used without further purification or drying. The progress of the reactions was monitored by thin-layer chromatography (TLC) on ready-made silica gel plates (Merck, Darmstadt, Germany; UV active). The plates were either developed under iodine vapor or visualized directly under UV-light (254 nm). Column chromatography was performed over flash silica gel (60A size). 1H-NMR spectra were recorded at 400 MHz or 500 MHz and 13C-NMR spectra were recorded at 100 MHz or 125 MHz using Bruker (Billerica, MA) DRX-400 and Bruker DRX-500 spectrophotometers, respectively, and reported in parts per million (ppm) on the δ scale relative to tetramethylsilane as the internal standard. Coupling constants (J) are reported in Hz.
Design and synthesis of ten phenylthiazole-4N-substituted piperazine hybrid molecules
Benzothiamides were reacted with 2-bromo-ethyl pyruvate to form the corresponding 2-phenylthiazole 4-ethyl ester derivatives (2a and 2b). The esters were hydrolysed and reacted with N-Boc-piperazine using DCC and HOBt to give corresponding amides (3a and 3b). The N-boc deportection of amide (3a) followed by reaction with acetic anhydride or benzyl bromide, provided carbonyl linked phenylthiazole-4N-substituted piperazine hybrid molecules (4a–4c). On the other hand, the ester (2a) was reduced using sodium borohydride to alcohol 5 and converted into the corresponding mesylate with methansulfonyl chloride and triethylamine. The mesylate was then substituted with N-Boc piperazine to give intermediate compound 6. Finally, N-boc-deprotection and reaction with acetic anhydride or benzyl bromide followed by reduction using lithium aluminium hydride provided methylene linked phenylthiazole-4N-substituted hybride molecules (7a–7e).
General procedure for synthesis of phenyl thiazoles 2a and 2b
Ethyl bromopyruvate (1.8 ml, 12 mmol, 1.2 equiv) was added to a stirred solution of thiaobenzamide derivatives 1a or 1b (10 mmol) in ethanol (30 ml) at room temperature. The reaction mixture was heated under reflux for 5 h, and the progress of the reaction was monitored by TLC. The solvent was removed by evaporation at reduced pressure, and the residue was washed with water and extracted with ethyl acetate (3 × 30 ml). The combined organic layer was dried over anhydrous sodium sulfate, filtered, and evaporated. The residue was purified by silica gel column chromatography using ethyl acetate hexane (1:9) as the eluent to afford the desired product 2a and 2b (70–90%) as a white solid.
Synthesis of compounds 3a and 3b
To a stirred solution of compound 2a or 2b (3 mmol) in THF (10 ml) and ethanol (15 ml) was added an aqueous solution of lithium hydroxide (1.2 M, 5 ml). After stirring at room temperature for 4–5 h, the volatiles were removed under reduced pressure. The residue was acidified with 1 N hydrochloric acid to pH 3, and the precipitate was extracted with ethyl acetate (3 × 30 ml). The combined organic layer was washed with brine, dried over anhydrous sodium sulfate, filtered, and evaporated under reduced pressure to afford a white powder. The crude acid thus obtained was dissolved in dry dichloromethane (20 ml), and DCC (0.74 g, 3.6 mmol) was added at 0 ℃ under nitrogen. Thereafter, a solution of HOBt (0.49 g, 3.6 mmol) in dimethyl formamide (DMF) (5 ml) was added, followed by addition of boc-piperazine (0.56 g, 3 mmol). The reaction mixture was stirred at room temperature overnight. After completion of the reaction, precipitated DCU was filtered off, and the filtrate was diluted with dichloromethane and washed with a saturated bicarbonate solution (2 × 30 ml) and a 10% citric acid solution (2 × 30 ml). The organic layer was then washed with brine (2 × 30 ml), dried over anhydrous sodium sulfate, filtered, and concentrated to a residue. The crude residue was purified by silica gel column chromatography using ethyl acetate hexane (3:7) as the eluent to afford the desired product 3a and 3b as a white solid.
Synthesis of compound 4a
The compound 3a (0.5 g, 1.34 mmol) was dissolved in a solution of 4 M HCl in anhydrous dioxane at room temperature with vigorous stirring. The reaction mixture was stirred for 2 h at room temperature. After completion of the reaction, dioxane was removed by evaporation at reduced pressure, and the residue was precipitated by adding cold diethyl ether. The precipitate was filtered off, washed with cold diethyl ether, and dried in a vacuum desiccator to afford a white powder as the hydrochloride salt of compound 4a (0.38 g, 92% yield).
Synthesis of compound 4b
To a solution of compound 4a (0.31 g, 1 mmol) in dry pyridine (0.8 ml) was added acetic anhydride (1.02 ml, 10 mmol), and the mixture was stirred overnight at room temperature. After completion of the reaction, the reaction mixture was diluted with water (20 ml) and extracted with ethyl acetate (3 × 30 ml). The combined organic layer was washed with brine (2 × 30 ml), dried over anhydrous sodium sulfate, filtered, and evaporated to dryness. The oily residue was purified by silica gel column chromatography, using ethyl acetate hexane (3:7) as the eluent, to produce compound 4b (0.26 g, 82%) as a colorless oil.
Synthesis of compound 4c
Benzyl bromide (0.07 ml, 0.6 mmol) was added to a suspension of compound 4a (0.136 g, 0.5 mmol) and anhydrous K2CO3 (0.166 g, 1.2 mmol) in dichloromethane (3 ml). The reaction mixture was stirred overnight at room temperature. Water (30 ml) was then added to the reaction mixture, and the mixture was extracted with dichloromethane (3 × 30 ml). The combined organic layer was washed with brine, dried over anhydrous sodium sulfate, filtered, and evaporated under reduced pressure to give a gummy residue, which was purified by silica gel column chromatography, using ethyl acetate hexane (3:7) as eluent, to afford compound 4c as a colorless oil (0.123 g, 68%).
Synthesis of compound 6d
A solution of HOBt (0.233 g, 1.73 mmol) in DMF (2 ml) was slowly added to a mixture of 2-methylthiazole-4-carboxylic acid (0.206 g, 1.44 mmol) and DCC (0.356 g, 1.73 mmol) in dichloromethane (18 ml) at 0 ℃. Thereafter, a solution of compound 6a (0.41 g, 1.34 mmol) in DMF (2 ml) was added, followed by the addition of DIPEA (0.375 ml, 2.16 mmol). The reaction mixture was stirred for 12 h at room temperature. After completion of the reaction, the mixture was cooled for 2 h at 0 ℃ and the precipitated DCU was filtered off. The filtrate was then washed with 1 N HCl (3 × 30 ml), 10% aqueous sodium bicarbonate (3 × 30 ml) and brine (2 × 30 ml). The organic layer was dried over anhydrous sodium sulfate, filtered, and evaporated to afford a crude residue, which was purified by silica gel column chromatography, using ethyl acetate hexane (3:7) as the eluent, to compound 6d (0.45 g, 84%) as a white solid.
Synthesis of compound 5
A solution of sodium borohydride (0.57 g, 15 mmol, 4 equiv) in methanol (4 ml) was added slowly over 15 min to a solution of compound 2a (0.87 g, 3.75 mmol) in tetrahydrofuran (16 ml) at 50 ℃ under nitrogen. The reaction mixture was refluxed for an additional 1 h and cooled to room temperature, and cold water (10 ml) was added slowly. The volatiles were removed under reduced pressure, and the aqueous residue was extracted with ethyl acetate (3 × 30 ml) and washed with brine (2 × 30 ml). The organic layer was dried over anhydrous sodium sulfate, filtered, and evaporated under reduced pressure. The residue was purified by silica gel column chromatography, using ethyl acetate hexane (2:8) as the eluent, to afford compound 5 as a white solid (0.67 g, 93%).
Synthesis of compound 6
To a stirred solution of compound 5 (0.58 g, 3 mmol) in THF (30 ml) was added triethylamine (1.25 ml, 9 mmol), followed by the addition of mesyl chloride (1.2 ml, 18 mmol) at 0 ℃. The reaction mixture was stirred at 0 ℃ for 30 min, saturated aq. sodium bicarbonate (30 ml) was added, and the aqueous layer was extracted with ethyl acetate (3 × 50 ml). The combined organic layer was washed with brine, dried over anhydrous sodium sulfate, and evaporated under reduced pressure. The crude product obtained was dissolved in DMF (15 ml), and boc-piperazine (0.67 g, 3.6 mmol) was added, followed by the addition of anhydrous potassium carbonate (0.83 g, 6 mmol). The reaction mixture was stirred for 5 h at 80 ℃. After completion of the reaction, the reaction mixture was poured into cold water (50 ml) and extracted with ethyl acetate (50 ml × 3). The combined organic layer was washed with brine, dried over anhydrous sodium sulfate, filtered, and concentrated under reduced pressure. The crude residue was purified by silica gel column chromatography, using ethyl acetate hexane (3:7) as the eluent, to afford compound 6 as a white solid (0.87 g, 81%).
Synthesis of compound 7a
The method for synthesis of this compound was same as that for compound 4a, but the starting material used in this case was compound 6. Compound 7a was isolated after purification as a white solid (0.072 g, 93%).
Synthesis of compound 7b
Dry THF (10 ml) was added dropwise under N2 atmosphere to an ice cooled RBF containing LiAlH4 (0.114 g, 3.0 mmol, 3 equiv.) at 0 °C. After completion of the THF addition, compound 6 (1 mmol) in dry THF was added slowly over 30 min at 0 °C, and the resulting mixture was stirred at room temperature for 2 h. After completion of the reaction, ice was added to the resulting reaction mixture and the precipitate was filtered off. The filter cake was washed with diethylether (3 × 20 ml) and then with ethyl acetate (3 × 10 ml). The combined organic phase was dried over anhydrous sodium sulfate, filtered and evaporated under vacuum to produce light yellow oil. The residue was purified by silica gel column chromatography using methanol chloroform (1:19) as eluent to give compound 7b colorless oil (0.109 g, 40%).
Synthesis of compound 7c
The method for synthesis of this compound was same as that for compound 4b but the starting material used in this case was compound 7a. This compound was isolated after purification as a colorless oil (0.058 g, 74%).
Synthesis of compound 7d
The method for synthesis of this compound was same as that for compound 7b but the starting material used in this case was compound 7c. This compound was isolated after purification as light yellow oil (0.076 g, 53%).
Synthesis of compound 7e
The method for synthesis of this compound was same as that for compound 4c but the starting material used in this case was compound 7a. This compound was isolated after purification as white solid (0.07 g, 71%).
In-vitro translation
The inhibition effect of the compounds (and of chloramphenicol as the reference compound) on M. smegmatis ribosomes was tested in a bacterial coupled transcription/translation assay, in which the expression of luciferase gene was measured. The luciferase gene was inserted into the plasmid downstream from the T7 RNA polymerase promotor. The reaction mixture contained 160 mM HEPES-KOH (pH 7.5), 6.5% polyethylene glycol 8000, 0.074 mg/ml tyrosine, 1.3 mM ATP, 0.86 mM for CTP, GTP and UTP, 208 mM potassium glutamate, 83 mM creatine phosphate, 28 mM ammonium acetate, 0.663 mM cAMP, 1.8 mM DTT, 0.036 mg/ml folinic acid, 0.174 mg/ml Escherichia coli tRNA mix, 1 mM of each amino acid, 0.25 mg/ml creatine kinase, 0.044 mg/ml T7 RNA polymerase, 25% V/V S30 M. smegmatis cell-free extract, 7 ng/µl luciferase-encoding plasmid and the compound to be tested at a final concentration of 1 mM. Molecules that showed significant inhibition at a concentration of 1 mM were further tested in concentrations ranging from 6 nM to 1 mM. The reaction mixture was incubated at 37 ℃ for 1 h, and the reaction was terminated by the addition of erythromycin at a final concentration of 8 µM. To quantify the reaction products, luciferin assay reagent (LAR, Promega) was added at 5:3 (luciferase: reaction mix) volume ratio, and luminescence was measured in a plate reader (BioTek SYNERGY H1). The results were plotted (compound concentration vs. luminescence intensity), and IC50 values were calculated.
S30 extract
M. smegmatis was cultured by using established protocols with minor modifications [26, 27] M. smegmatis cells were grown in LB medium (5 g yeast extract, 5 g NaCl, 10 g/L Tryptone) with continuous shaking (200 rpm) at 30 ℃ to OD 0.6 to 0.8. 0.04% (w/v). Tween 80 was added to LB to avoid clumping. The cells were harvested by centrifugation and washed twice in 200 ml buffer A (10 mM HEPES-NaOH [pH 7.4], 60 mM K-glutamate, 14 mM MgCl2), and fresh 7 mM β-mercaptoethanol. The cells were resuspended in buffer A to 0.5 g/ml and lysed with a high-pressure homogenizer at 20,000 psi. The extract was centrifuged at 30,000 RCF for 30 min, and the supernatant was removed to a new vial twice. The supernatant incubated at 37 ℃ for 1.5 h and dialysed for 2 h at 4 ℃ against buffer A. The M. smegmatis S30 extract was aliquoted and kept frozen.
In silico molecular docking
Molecular docking was performed as described [9] by using AutoDock 4.2 [10, 11] to estimate the binding free energy (ΔGbind) and the poses of the investigated compounds in relation to the receptor. The PTC receptor for simulations was derived from the Cartesian coordinates of the large ribosomal subunit (50S) of Staphylococcus aureus (PDBID 4WCE [28]). The virtual screening protocol was conducted through the Raccoon implementation (http://autodock.scripps.edu/resources/raccoon), in which the receptor and ligand molecules were preprocessed for docking. The docking grid was set with 126 points in each dimension, and the default spacing was 0.375 Å. The obtained grid map was centered with respect to the receptor. Free energy calculations and conformational sampling of the ligands were then carried out using the Lamarckian genetic algorithm (LGA), with an initial population size of 150 individuals, 2,500,000 free energy evaluations and 27,000 LGA generations. Clustering of the results was performed by root mean-square deviation (RMSD) calculations for the obtained poses of the ligands, with a constant tolerance of 2.0 Å. Further analyses of the results were performed using default AutoDock VS tutorial scripts along with several in-house written scripts.
Choosing the dataset for the model
As previously described [9], transverse relaxation NMR spectroscopy adapted for fragment screening selected 2-phenylthiazole as a molecular scaffold. Importantly, our data is labeled, i.e., every molecule in the dataset (860) is assigned to a binding value to the RNA target. These binding values were obtained by virtual screening. Molecules that appeared twice with the same name (as a result of different charge or oxidation state) were deleted.
Decision tree
A decision tree is designed to aid in planning new potential inhibitory molecules by identifying essential features that differentiate molecules with high binding scores from molecules with low binding scores. We have divided the data (stratified split) into two groups, binders and non-binders, on the basis of their binding values. The top 2.5% percentile of the docking scores (≤ − 13.2) was determined as a threshold. For this model, all 204 chemical descriptors of the small molecules extracted using RDkit were used. After removing zero-variance features, features were further reduced using ‘Forward Selection’ wrapper method on the decision tree model, resulting in a three features dataset. The decision tree classifier was trained using tenfold cross-validation on the training data, to produce 100% accuracy on the test sets.
Regression model
The Lasso regressor is a linear regression with regularization (the magnitude of the penalty is an adjustable hyperparameter), which has an embedded feature selection property by penalizing to decrease degrees of model complexity. In Lasso, inessential feature weights are pushed to zero, and are thus eliminated from the model [29]. The Lasso model loss function is:
$${\varvec{Loss}} = { }\frac{1}{{\varvec{n}}}{ }\sum \left( {{ }{\mathbf{y}}{ }{-}{ }{\hat{\mathbf{y}}}{ }} \right)^{2} + {\varvec{\alpha}}\sum { }\left| {\varvec{\theta}} \right|$$
where the sum of mean squared errors (y − ŷ)2 is computed with L1 regularization, which comprises the sum of all weighs (Ɵ) multiplied with the penalty term (α). Regularization allows to remove variable attaining zero, and thereby removing irrelevant features from the model [30].
The chemical descriptors used for the regressor are: NHOHCount, TPSA, NOCount, EState_VSA8, PEOE_VSA1, SMR_VSA7, HallKierAlpha, EState_VSA7, NumHDonors, VSA_EState8, EState_VSA9, SlogP_VSA8, SlogP_VSA11, VSA_EState2, EState_VSA5, VSA_EState9, Chi3v, EState_VSA6, VSA_EState5, MolLogP, BalabanJ, SMR_VSA2, PEOE_VSA2, PEOE_VSA3, SlogP_VSA2, SlogP_VSA10, EState_VSA4, PEOE_VSA10, PEOE_VSA13, PEOE_VSA7, SlogP_VSA4, PEOE_VSA11. The diagonal correlation matrix of the 32 features is presented in Additional file 1: Fig. S1.
Feature engineering
First, all features were normalized to standard scores (z-score). The data was divided into training set (80%) and a test set (20%). The features selection and the hyperparameter tuning were evaluated by the mean of the test groups of tenfold cross validation. A summary of the hyper-parameters tuning for all models is presented in Additional file 1: Table S6. The approaches used to select the optimal set of features are: (1) Lasso regression, (2) removal of features with zero-variance (below 0.03), (3) removal of features that are highly collinear to other features (Pearson correlation above 0.9), (4) removal of features with no correlation to the binding score (the label), and (5) forward/backward features selection that allow choosing the most significant features for prediction, by iterative selection and evaluation of each feature, one at a time. In backward feature selection, the model is fit with all the features, then the least significant feature is examined against p-value, and if it is high (> 0.05), this feature is removed from the model. After several iterations of fitting to the model and removal of features, the model remains only with significant 32 features. A list of selected features is presented in and their Pearson correlation in Additional file 1: Fig. S1.
Training the model
Following the selection of the features and the hyperparameters, the model was lunched over all the training data, and the weights and bias were extracted. The final model was evaluated over the test data. The importance of the features was estimated by the values of linear regression coefficients and by the global SHAP value of each feature.
CNN architecture SMILES representation of the input data
Since SMILES constitutes an indexed sequential representation of the molecules, a pre-processing step was required to convert the "words" into matrices of numbers that the computer can process. The SMILES strings were converted into matrices, each containing 42-bit vectors (inspired by the design of Hirohara et al. [17], where 21 bits were used for an atom and chemical description such as degree of unsaturation, formal charge, total valence, aromaticity and ring content, chirality, and hybridization, using RDKit, and 21 bits for structural information, Additional file 1: Table S3). Zero-padding vectors were added to the bottom of each matrix, in order to standardize the matrix size. Matrices of 240 × 42 were then used as input into the CNN. The dataset containing 791 molecules was split into 80% of the molecules (632) training set, and 20% (159) test set. Fivefold cross-validation training (using only the training data) was performed for model optimization and hyperparameter tuning. The test MAE was averaged over the five sub-groups for each, and the model with the lowest MAE was chosen.
The CNN network consists of two convolutional layers and two fully connected layers separated by global max-pooling. Rectified activation function (Rectified Linear Unit, ReLU [31]) was applied on the convolution results in the hidden layers and after the first fully connected operation. As we aimed to solve a regression type of problem, no activation function was used for the output layer (after the second fully connected layer). Batch normalization is applied after each computational layer, and then average pooling was applied on the matrices. All weights were initialized by a normal distribution with a mean of 0 and a standard deviation of 0.01. The model was trained upon mini-batch (32 instances each) gradient descent for 1200 epochs. Optimization was achieved using ADAM [32] with a learning rate of 5*10–5. The weights of the best test loss achieved were saved as the final model.
The output from the global max-pooling layer was extracted for the entire dataset and continued to the fully connected layer. Data was normalized for each filter separately and converted into z-scores for normalization. Molecules' pixels with Z-scores larger than 2.58 (i.e., 99% percentile) were chosen, and the receptive field was calculated for these molecules; meaning that, from the one pixel that remains in each filter, we took only pixels that had values of the top 1% in a normal distribution after normalizing the data. Therefore, these pixels' receptive field represents the most significant motif contributor to the score.
CNN architecture; pictorial representation of the input data
Preprocessing of data
The dataset contained 791 color images of molecular graphs (783 × 1316 × 3), each assigned to binding score of docking to the RNA target. For pre-processing the foreground data was dilated for each image, then cropped to reduce background pixels, which resized the images to 775 × 775 × 3. Next, images were resized to 50 × 50 × 3. The images' pixel values were normalized into the range [0,1]]. In addition, the color images were converted to gray scale and to binary images for different experiments and then the same pre-processing was applied (Additional file 1: Fig. S2). The data was divided into a training set (80%, 632 images) and a test set (20%, 159 images).
Designing the model and training
For model optimization and hyperparameter tuning, we used a fivefold cross-validation. The test MAE was averaged over the five sub-groups for each, and the model with the lowest MAE was chosen. The processed pixel values of the images were used as input for a CNN. The model contains 2 convolution layers followed by max pooling and 2 dense layers at the end of the network. For the hidden layer, rectified activation function (Rectified Linear Unit, ReLU [33]) was applied on all matrices within hidden layers. As we aimed to solve a regression type of problem, no activation function was used for the output layer, i.e. linear output layer. For the optimization, the ADAM algorithm [32] was used and the loss function was expressed in terms of mean absolute error (MAE). Dropout layer of 20% was added before the final layer. The learning rate was 0.001.