KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images

The application of convolutional neural networks (ConvNets) to harness high-content screening images or 2D compound representations is gaining increasing attention in drug discovery. However, existing applications often require large data sets for training, or sophisticated pretraining schemes. Here, we show using 33 IC50 data sets from ChEMBL 23 that the in vitro activity of compounds on cancer cell lines and protein targets can be accurately predicted on a continuous scale from their Kekulé structure representations alone by extending existing architectures (AlexNet, DenseNet-201, ResNet152 and VGG-19), which were pretrained on unrelated image data sets. We show that the predictive power of the generated models, which just require standard 2D compound representations as input, is comparable to that of Random Forest (RF) models and fully-connected Deep Neural Networks trained on circular (Morgan) fingerprints. Notably, including additional fully-connected layers further increases the predictive power of the ConvNets by up to 10%. Analysis of the predictions generated by RF models and ConvNets shows that by simply averaging the output of the RF models and ConvNets we obtain significantly lower errors in prediction for multiple data sets, although the effect size is small, than those obtained with either model alone, indicating that the features extracted by the convolutional layers of the ConvNets provide complementary predictive signal to Morgan fingerprints. Lastly, we show that multi-task ConvNets trained on compound images permit to model COX isoform selectivity on a continuous scale with errors in prediction comparable to the uncertainty of the data. Overall, in this work we present a set of ConvNet architectures for the prediction of compound activity from their Kekulé structure representations with state-of-the-art performance, that require no generation of compound descriptors or use of sophisticated image processing techniques. The code needed to reproduce the results presented in this study and all the data sets are provided at https://github.com/isidroc/kekulescope.


Introduction
Cultured cancer cell lines are limited disease models in that they do not recapitulate the tumor microenvironment nor interactions with the immune system [1][2][3][4][5][6] , fundamental properties of cellular organization are altered in culture 7 , and their response to anticancer drugs is affected by both assay heterogeneity 8 and genomic alterations acquired in vitro 9 .However, cancer cell lines still represent versatile models to study fundamental aspects of cancer biology 10,11 , and the genomic determinants of drug response 3,[12][13][14] .Hence, the development of computational methods to harness the large amount of in vitro sensitivity data collected to date to unravel the underlying molecular mechanisms mediating drug activity and identify novel biomarkers for drug response is an area of intense research [14][15][16][17][18][19][20] .
Whereas existing computational tools to model in vitro compound activity mostly rely on established algorithms (e.g., Random Forest or Support Vector Machines), the utilization of deep learning in drug discovery is gaining momentum, a trend that is only expected to increase in the coming years 21 .Deep learning techniques have been already applied in numerous drug discovery tasks, including toxicity modelling 22,23 , bioactivity prediction [24][25][26][27][28][29][30] , and de novo drug design [31][32][33][34] , among others.Most of these studies have utilized feedforward neural networks consisting of multiple fully-connected layers trained on one of the many compound descriptors developed over the last >30 years in the chemoinformatics field 27,35 .However, the high performance of convolutional neural networks (ConvNets) [36][37][38] , a type of neural networks developed for image recognition tasks, in finding complex high-dimensional relationships in diverse image data sets is fostering their application in drug discovery 21,39 .
ConvNets consist of two sets of layers (Figure 1): (i) the convolutional layers, which extract features from the input images, and (ii) the classification/regression layers, which are generally fully-connected layers that output one value for each of the tasks being modelled.A major advantage of ConvNets is that the extraction of features is performed on a fully automatic and data-driven fashion, thus not requiring to engineer feature selection or image preprocessing filters beforehand 39 .Today, convolutional neural networks are applied to diverse image recognition tasks in healthcare and biomedicine [40][41][42][43] .An obvious critical element for the application of ConvNets is the availability of images for training, or the ability to formulate the modelling task of interest as an image classification problem.An illustrative example of the latter is DeepVariant 44 , a recently proposed algorithm that uses images of sequencing read pileups as input to detect small indels and single-nucleotide polymorphisms, instead of statistical modelling of the genotypes supported by the data, as has been the standard approach for years.
In drug discovery, applications of ConvNets include elucidation of the mechanism of action of small molecules from high-content screening images 45,46 , and modelling in vitro assay endpoints using 2D representations of compound structures, termed "compound images", as input 23,[47][48][49][50] .
Efforts to model compound activity using ConvNets trained on compound images were spearheaded by Goh et al., who developed Chemception 47,51 , a ConvNet based on the Inception-ResNet v2 architecture 52 .The performance of Chemception was compared to multilayer perceptron deep neural networks trained on circular fingerprints in three tasks: prediction of free energy of solvation (632 compounds; regression), inhibition of HIV replication (41,193;   binary classification), and compound toxicity using data from the "Toxicology in the 21st Century" (Tox21) project (8014; multi-task binary classification) 53 .Chemception slightly outperformed the multi-layer perceptron networks except for the TOX21 task.In a follow-up study, the same group introduced ChemNet 47 , a learning strategy that consists of pre-training a ConvNet (e.g., Chemception 51 ) on a large set of compounds (1.7M) to predict their physicochemial properties (e.g., logP, which represents an easy task) in order to learn general aspects of the images related to chemistry.Subsequently, the trained networks are applied to model smaller data sets using transfer learning.Although such an approach led to higher performance than Chemception, a major disadvantage thereof is that it requires the initial training of the network on a large set of compounds, which is computationally demanding.More recently, Fernández et al. proposed Toxic Colors, a framework to classify toxic compounds from the TOX21 data set using compound images as input 23 .Although these studies have paved the way for the application of ConvNets to model the bioactivity of compounds using their images as input, a comprehensive analysis of ConvNet architectures with a reduced computational footprint to model cancer cell line sensitivity on a continuous scale and comparison against the state of the art is still missing.Moreover, whether the combination of models trained on widelyused compound descriptors (e.g., circular fingerprints) and ConvNets trained using compound images leads to increased predictive power remains to be studied.
Here, we introduce KekuleScope, a flexible framework for modelling the bioactivity of compounds on a continuous scale from their Kekulé structure representation using ConvNets pretrained to model unrelated image classification tasks.We demonstrate using eight cytotoxicity data sets from ChEMBL version 23 (Table 1) that this type of compound images convey enough predictive power to build robust models using ConvNets.Instead of using networks pretrained on compound images 47 , we show that widely-used architectures for image classification tasks (AlexNet 54 , DenseNet-201 55 , ResNet152 56 and VGG-19 57 ) are versatile enough to generate robust predictions across a dynamic range of bioactivity values using compound images as input.Moreover, comparison with Random Forest models trained on circular fingerprints (Morgan fingerprints 58,59 ) reveals that ConvNets trained using compound images lead to comparable or higher predictive power on the test set.In addition, combining RF and ConvNet predictions into model ensembles constantly leads to increased model performance, suggesting that the features extracted by the convolutional layers of the networks provide complementary information to Morgan fingerprints.Therefore, our work presents a novel framework for the prediction of compound activity that requires minimal processing of chemical structures and no descriptor choice, and that leads to improved predictive power over the state of the art in our validation on eight cancer cell line sensitivity data sets.

Data Collection and Curation
We gathered cytotoxicity IC50 data for eight cancer cell lines from ChEMBL database version 23 using the chembl_webresource_client python module [60][61][62] .To gather high-quality cytotoxicity data sets, we only kept IC50 values for small molecules that satisfied the following stringent filtering criteria 8 : (i) activity unit equal to "nM", (ii) activity relationship equal to '='.The average pIC50 value was calculated when multiple IC50 values were annotated for the same compoundcell line pair.IC50 values were modeled in a logarithmic scale (pIC50 = −log10 IC50 [M]).Further information about the data sets is given in Table 1.All data sets used in this study are available at https://github.com/isidroc/kekulescope.

Molecular Representation
We standardized all chemical structures to a common representation scheme using the python module standardizer (https://github.com/flatkinson/standardiser).Inorganic molecules were removed, and the largest fragment was kept in order to filter out counterions.We note that, although imperfect, removing counterions is a standard procedure in the field 63,64 .
In addition, salts are not generally well-handled by descriptor calculation software, and hence, filtering them out is generally preferred 65 .
Kekulé structure representations for all compounds (i.e., 'compound images') in Scalable Vector Graphics (SVG) format were generated from the compound structures in SDF format using the

-Data Splitting
The data sets were randomly split into a training (70% of the data), validation (15%), and test set (15%).For each data set, the training set was used to train a given network, the validation set served to monitor its predictive power during the training phase, and the test set served to assess the predictive power after the ConvNets were trained. -

Network Architecture and Training
ConvNets pretrained on the ImageNet 67 data set were downloaded using the python library Pytorch 68 .The structure of the classification layer(s) in each of the architectures used was modified to output a single value, corresponding to compound pIC50 values in this case, by removing the softmax transformation of the last fully connected layer (which is used in classification tasks to output class scores in the 0-1 range).The Root Mean Squared Error (RMSE) value on the validation set was used as the loss function during the training phase of the ConvNets, and to compare the predictive power of RF and ConvNets on the test set.We performed grid search to find the optimal combination of parameters for all networks.The parameter values considered are listed in Table 2.
We generated an extended version of each architecture by including five fully-connected layers (Figure 1C), consisting of 4,096, 1000, 200 and 100 neurons (Figure 1).Thus, for each architecture we implemented two regression versions, one containing one fully-connected layer, and a second one containing five fully-connected layers (abbreviated from now on as "extended").The feature extraction layers were not modified.We used stochastic Gradient Descent algorithm with Nesterov momentum 69 to train all networks, which was set to 0.9 and kept constant during the training phase.The parameters for all layers, including the convolutional and regression layers, were optimized during the training phase.Networks were allowed to evolve over 600 epochs.To reduce the chance of overfitting, we used early stopping, i.e., the training phase was stopped if the validation loss did not decrease after 250 epochs, and 50% dropout 27,70,71 in the five fully-connected layers (labelled as "regression layers" in Figure 1) in the extended versions of the architectures considered.The training phase was divided into cycles, throughout which the learning rate was annealed and set back to its original value at the beginning of the next cycle.The learning rate was decreased by 90 or 40% every 10 or 25 epochs (decay rates of 0.1 and 0.6, respectively; Table 2).
-Random Forest (RF) RF models were generated using the python library scikit learn 72 based on Morgan fingerprint representations, which were calculated as described above.Default parameter values were used except for the number of trees, which was set to 100 because higher values do not generally increase model performance when modelling bioactivity data sets 15,73 .Identical data splits were used to train the ConvNets and the RF models.

Experimental Design
To compare the predictive power of the ConvNets and RF models in a robust statistical manner we designed a balanced fixed-effect full-factorial experiment with replications 74 .The following factors were considered: (i) Data set: 8 data sets (Table 1).
(ii) Model: 7 convolutional network architectures.where the factors Data seti, Modelj, Batchk, Augmentationl, are the main effects considered in the model.The levels "A2780" (Data set), "AlexNet" (Model), "4" (Batch), and "0" (Augmentation) were used as reference factor levels to calculate the intercept term of the linear model, μ0, which corresponds to the mean pIC50 value for this combination of factor levels.The coefficients (i.e., slopes) for the other combinations of factor levels correspond to the difference between their mean pIC50 value and the intercept.The error term, ϵi,j,k,l,m, corresponds to the random error of each pIC50 value, defined as ϵ 5,:,>,B,F =  /0 5,:,>,B,F − mean( /0 5,:,>,B ).These errors are assumed to (i) be mutually independent, (ii) have zero expectation value, and (iii) have constant variance.
We trained ten models for each combination of factor levels, each time randomly assigning different sets of data points to the training, validation and test sets.The normality and homoscedasticity assumptions of the linear models were respectively assessed with (i) quantile-quantile (Q-Q) plots and (ii) by plotting the fitted values against the residuals 74 .
Homoscedasticity means that the residuals are equally dispersed across the range of the dependent variable used in the linear model.A systematic bias of the residuals would indicate that the errors are not random and that they contain predictive information that should be included in the model 75,76 .
To compare the performance of RF, the most predictive ConvNet for each data set and replication, and the Ensemble models, we also used a linear model with two factors, namely Data set and Model.In this case, we only considered the results of the ConvNet architecture leading to the lowest RMSE value on the test set for each data set and replication.

Results and Discussion
We initially evaluated the performance of ConvNets to predict the activity of compounds from their Kekulé structure representations.To this aim, we generated models for all data sets using four widely-used architectures, namely AlexNet, DenseNet 201, ResNet-152, and VGG-19 with batch normalization (VGG-19-bn), and the extended versions thereof that we implemented by including four additional fully-connected layers after the convolutional layers (see Methods and Figure 1).We obtained high performance on the test set for all networks, with mean RMSE values in the 0.65-0.96pIC50 range (Figure 2).These errors in prediction are comparable to the uncertainty of heterogeneous IC50 measurements in ChEMBL 8 , and to the performance of drug sensitivity prediction models previously reported 15,18,77 .Notably, high performance was also obtained for data sets containing few hundred compounds (e.g., LoVo or HCT-15), suggesting that the framework proposed here is applicable to model small data sets.
In order to study the relative performance of the networks in a robust manner, we implemented a factorial design that we evaluated using a linear model (Equation 1).The linear model displayed an R 2 value adjusted for the number of parameters of 0.68 (P < 10 −12 ), thus indicating that the variables considered in our factorial design explain a large proportion of the variation observed in model performance, and hence, its coefficients provide valuable information to study their relative performance in a statistically sound manner.Analysis of the model coefficients revealed that the performance of the extended versions of the architectures considered constantly led to a decrease in the RMSE values of ~5-10% (P < 10 −12 ; Figure 2), with ResNet-152, and VGG-19-bn constantly leading to the highest predictive models.Together, these results thus suggest that the four additional fully-connected layers we included in the architectures and the use of dropout regularization help palliate overfitting (Figure 2), and hence, increase the generalization capabilities of the networks.
We previously showed that data augmentation represents a versatile approach to increase the predictive power of Random Forest models trained on compound fingerprints 78 .Similarly, we here find a significant increase in performance for ConvNets trained on augmented data sets (P = 0.02).In fact, the utilization of data augmentation during training led to the most predictive models in 68% of the cases; when considering the most predictive network for each data set and run only, we find that data augmentation was used in 91% of the cases.Overall, these results indicate that the extraction of chemical information by the ConvNets is robust against rotations of the compound images, and that data augmentation helps improve chemicalstructure activity modelling based on compound images 78 .
In addition to assessing the predictive power on the test set, we performed Y-scrambling experiments to ensure that the predictive power obtained by the ConvNets did not arise by chance.With this aim in mind, the bioactivity values for the training and validation set instances were shuffled before training.We observed R 2 values around 0 (P < 0.001) for the observed against the predicted values on the test set for all the Y-scrambling experiments we performed.
Therefore, these results indicate that the features extracted by the convolutional layers capture chemical information related to bioactivity.
Next, we compared the predictive power of the ConvNets to that of RF models trained on Morgan fingerprints using the factorial design described in Equation 2. The linear model in this case showed an adjusted R 2 value of 0.97, suggesting that the covariates we considered account for most of the variability in model performance.Overall, we did not find significant differences in performance between RF models trained on circular fingerprints and ConvNets trained on compound images (P = 0.76; Figures 3-4), although the average performance of ConvNets was slightly higher than RF for several data sets.These models are using Morgan FP and RF, which have previously been shown to generate models with high predictive power in benchmarking studies of compound descriptors and algorithms [78][79][80] .Taken together, these results suggest that compound images provide sufficient predictive signal to generate ConvNets with comparable predictive power to state-of-the-art methods, even for small data sets of few hundred compounds.
To further characterize the differences between RF and ConvNets, we firstly assessed the correlation between the predicted values calculated for the same test set instances using models trained on the same data splits.We found (as shown in Figure 5) that the predictions of both models are highly correlated for all data sets, with R 2 values in the 0.80-0.89range (Pearson's correlation coefficient; P < 0.05), thus indicating that the predictions calculated with the RF models explain a large fraction of the variance observed for the predictions calculated with the ConvNets, and vice versa.Analysis of the correlation of the absolute error in prediction for each test set instance however revealed that the error profiles of RF and ConvNets are only moderately correlated (R 2 in the 0.58-0.65 range, P < 0.05; Figure 6).From the latter, we hypothesized that combining the predictions generated by each modelling technique into a model ensemble might lead to increased predictive power 82 .In fact, ensemble models built by averaging the predictions generated by RF and ConvNet models constantly displayed higher predictive power, leading to 4-12% and 5-8% decrease in RMSE values with respect to RF and ConvNet models, respectively (P < 10 -5 ; green bars in Figure 3).In contrast to previous analyses 47 , where compound fingerprints and related representations were often thought to contain most information related to bioactivity 81 , our results indicate that Morgan fingerprints and the features extracted from compound images with the ConvNets convey complementary predictive signal, thus permitting to obtain more accurate predictions than either model alone by combining them into a model ensemble.
Finally, we examined the distributions of the residuals for the ConvNet and RF models.This analysis serves to ensure that the low RMSE values observed are not a consequence of simply predicting the mean value of the response variable, because, as we have previously shown for protein-ligand systems 83 , networks that fail to converge often simply predict the mean value of the dependent variable.Overall, we observed similar patterns for both modelling approaches (as shown in Figure 7), with residuals centered around zero and generally showing homoscedasticity, i.e., displaying comparable variance across the entire bioactivity range.
Examination of the residuals is also important when modelling imbalanced data sets, which is generally the case for data sets extracted from ChEMBL, because a large fraction of instances are annotated with pIC50 values in the low micromolar range (4-5 pIC50 units), and by simply predicting the mean value of the response variable one might already obtain low RMSE values.
In such cases, the residuals would be hetereoscedastic, displaying increasingly higher variances towards the low-nanomolar range (i.e., pIC50 values of 8-9), which however was not the case for the models generated here.Together, these results thus indicate that compound images convey sufficient chemical information to model compound bioactivities across a wide dynamic range of pIC50 values.
In this work, we show that a proper design and parametrization of ConvNets is sufficient to generate highly predictive models trained on images of structural compound representations sketched using standard functionalities of commonly used software packages (e.g., RDkit).
Therefore, exploiting such networks which were designed for general image recognition tasks, and pre-trained on unrelated image data sets, represents a versatile approach to model compound activity directly from Kekulé structure representations in a purely data-driven fashion.
However, it is paramount to note that the computational footprint of ConvNets still represents a major limitation of this approach: whereas training the Random Forest models for these data sets required 6-14 seconds per model using 16 CPU cores and no parameter optimization, training times per epoch for the ConvNets were in the 15-64 seconds range (i.e., 150-640 minutes per model using one GPU card and 16 CPU cores for image processing).
While the computation of compound descriptors has traditionally relied on predefined rules or prior knowledge of chemical properties, bioactivity profiles or topological information of compounds, among others 84 , the descriptors calculated by the convolutional layers of ConvNets represent an automatic and data-driven approach to derive features from chemical structures 50 .
As we show in this study, these compound features permit to model compound activity with high accuracy even on a continuous scale.However, image-derived features are still harder to interpret than more traditional descriptors e.g., unhashed Morgan fingerprints 82 .We anticipate that extending the work presented here by including 3D representations of compounds and binding sites using 3D convolutional neural networks to account for conformational changes of small molecules and protein dynamics, respectively, will likely improve compound activity modelling [85][86][87][88] .Finally, future work will also be required to evaluate whether ConvNets trained on both compound and cellular images lead to more accurate modelling of compound activity on cancer cell lines, as well as other output variables (i.e., toxicity), than current modelling approaches based on gene expression or mutation profiles 15,16,18,89 .

Conclusions
In this contribution, we introduce KekuleScope, a framework to model compound activity on a continuous scale using extended versions of four widely-used architectures trained on Kekulé structure representations without requiring any image preprocessing steps.The generated models achieve comparable performance to RF models trained on circular fingerprints, and to the estimated experimental uncertainty of the input data.Our work shows that Kekulé representations can be harnessed to derive robust models without requiring any additional descriptor calculation.In addition, we show that the chemical information extracted by the convolutional layers of the ConvNets is complementary to that provided by Morgan fingerprints, which enables the generation of model ensembles with significantly higher predictive power than either RF models or ConvNets alone.The framework proposed here is generally applicable across endpoints, and it is expected that also on other datasets the combination of models will lead to increases in performance.Overall, all architectures enabled the generation of models with high predictive power on the test set, with RMSE values in the 0.65-0.96pIC50 range.However, the extended versions of these architectures that we designed by including 5 fully-connected layers (see Figure 1) constantly led to increased predictive power on the test set.Overall, it can be seen that ConvNets lead to comparable predictive power than RF models (or higher in some cases, although the effect size is small and not significant).Ensemble models constantly displayed higher predictive power than either model alone (P < 10 -5 ), leading to 4-12% and 5-8% decrease in RMSE values with respect to RF and ConvNet models.

Figure 1
Figure 1 KekuleScope framework.(A) We collected and curated a total of eight cytotoxicity data sets from ChEMBL version 23. (B) Compound Kekulé representations were generated for all compounds and used as input to the ConvNets.(C) We implemented an extended version of commonly used architectures (e.g., VGG-19-bn shown in the Figure) by including five additional fully-connected layers to predict pIC50 values on a continuous scale.(D) The generalization power of the networks was assessed on the test set, and compared to Random Forest models trained using Morgan fingerprints as covariates.

Figure 2
Figure 2 Benchmarking the predictive power of ConvNet architectures.Mean RMSE values (+/-standard deviation) on the test set across ten runs for each of the ConvNet architectures explored in this study (AlexNet 54 , DenseNet-20155 , ResNet15256 and VGG-1957 ).Overall, all architectures enabled the generation of models with high predictive power on the test set, with RMSE values in the 0.65-0.96pIC50 range.However, the extended versions of these architectures that we designed by including 5 fully-connected layers (see Figure1) constantly led to increased predictive power on the test set.

Figure 3
Figure3Comparing the predictive power of ConvNets and RF.Mean RMSE values (+/standard deviation) on the test set across ten runs for (i) the ConvNet showing the highest predictive power for each data set and run combination, (ii) RF models trained on Morgan fingerprints, and (iii) the ensemble models built by averaging the predictions generated with the RF and ConvNet models.Overall, it can be seen that ConvNets lead to comparable predictive power than RF models (or higher in some cases, although the effect size is small and not significant).Ensemble models constantly displayed higher predictive power than either model alone (P < 10 -5 ), leading to 4-12% and 5-8% decrease in RMSE values with respect to RF and ConvNet models.

Figure 4
Figure 4 Predictions for the test set molecules.Observed against predicted pCI50 values for the test set compounds calculated using ConvNets (top panels) or RF models (bottom panels).The results for the ten repetitions are shown (in each repetition the molecules in the training, validation and test sets were different).Overall, both RF models and ConvNets generated comparable error profiles across the entire bioactivity range considered, showing Pearson's correlation coefficient values in the 0.72-0.84range.

Figure 5 RF
Figure 5 RF and ConvNet predictions on the test set.Correlation between the predictions for the test set compounds calculated with RF models and ConvNets trained on the same training set instances.Overall, the predictions show a positive and significant correlation (P < 0.05; Pearson's correlation coefficient values in the 0.72-0.84range).The predictions for the ten runs are shown.

Figure 6
Figure 6 Absolute errors in prediction.Relationship between the absolute errors in prediction for the same test set instances calculated with ConvNets (x-axis) and RF (y-axis) models

Figure 7
Figure 7Analysis of the residuals.Residuals for the ConvNets (top panels) and RF (bottom panels) models.Overall, the residuals are centered around zero and show comparable variance across the bioactivity range for both types of models, indicating that compound images permit to model the activity of small molecules across a dynamic range of pIC50 values.

Table 1 . Data sets used in this study. Cell line Description ChEMBL Cell ID Cellosaurus ID Organism of origin Number of bioactivity data points
66 represent molecules for subsequent model generation based on fingerprints, we computed circular Morgan fingerprints 58 for all compounds using RDkit (release version 2013.03.02)66.The radius was set to 2 and the fingerprint length to 256.
RDkit function MolsToGridImage and default parameter values.SVG images were then converted to Portable Network Graphics (PNG) format using the programme convert (version ImageMagick 6.7.8-9 2016-03-31 Q16; http://www.imagemagick.org)and resized to 224 x 224 pixels using a density (-d argument) of 800.The code needed to reproduce the results presented in this study are provided at https://github.com/isidroc/kekulescope.

Table 2 . Parameters tuned during the training phase using grid search.
The names in parentheses indicate the parameter name abbreviation used in the main text and figures.
(transforms.RandomVerticalFlip); and (iii) random 90° rotation (transforms.RandomRotation).In the three cases, each transformation was applied at every epoch during the training phase with a 50% chance.Thus, in some cases a set of the images might remain intact depending on this sampling step during a given epoch.