Chemically Aware Model Builder (camb): an R package for property and bioactivity modelling of small molecules

Background In silico predictive models have proved to be valuable for the optimisation of compound potency, selectivity and safety profiles in the drug discovery process. Results camb is an R package that provides an environment for the rapid generation of quantitative Structure-Property and Structure-Activity models for small molecules (including QSAR, QSPR, QSAM, PCM) and is aimed at both advanced and beginner R users. camb's capabilities include the standardisation of chemical structure representation, computation of 905 one-dimensional and 14 fingerprint type descriptors for small molecules, 8 types of amino acid descriptors, 13 whole protein sequence descriptors, filtering methods for feature selection, generation of predictive models (using an interface to the R package caret), as well as techniques to create model ensembles using techniques from the R package caretEnsemble). Results can be visualised through high-quality, customisable plots (R package ggplot2). Conclusions Overall, camb constitutes an open-source framework to perform the following steps: (1) compound standardisation, (2) molecular and protein descriptor calculation, (3) descriptor pre-processing and model training, visualisation and validation, and (4) bioactivity/property prediction for new molecules. camb aims to speed model generation, in order to provide reproducibility and tests of robustness. QSPR and proteochemometric case studies are included which demonstrate camb's application.Graphical abstract From compounds and data to models: a complete model building workflow in one package. Electronic supplementary material The online version of this article (doi:10.1186/s13321-015-0086-2) contains supplementary material, which is available to authorized users.


Reading and Preprocessing
The compounds are read in and standardised. Internally, Indigo's C API [2], incorporated into the camb package, is use to perform this task. Molecules are represented with implicit hydrogens, dearomatized, and passed through the InChI format to ensure that tautomers are represented by the same SMILES.
The StandardiseMolecules function allows representation of the molecular structures in a similarly processed form. The arguments of this function allow control over the maximum number of (i) fluorines, (ii) chlorines, (iii) bromines, and (iv) iodines the molecules can contain in order to be retained for training. Inorgnaic molecules (those containing atoms not in {H, C, N, O, P, S, F, Cl, Br, I}) are removed if the argument remove.inorganic is set to TRUE. This is the function's default behaviour. The upper and lower limits for molecular mass can be set with the arguments min.mass.limit and max.mass.limit. The name of the file containing the chemical structures is provided by the argument structures.file. std.options <-StandardiseMolecules(structures.file="solubility_2 7_ref2.sdf", standardised.file="standardised.sdf", removed.file="removed.sdf", properties.file = "properties. Molecules that Indigo manages to parse and that pass the filters are written to the file indicated by the standardised.file argument once they have been through the standardisation procedure. Molecules that were discarded are written to the file indicated by the removed.file argument. The molecule name and molecular properties specified in the structure file are written to the file indicated in the argument properties.file which is in CSV format. A column kept is added which indicates which molecules were deleted (0) or kept (1).

Target Visualisation
camb provides a function to visualise the density of the target variable. Visualising the distribution of the target variable gives can give a measure of how accurate a trained model is. The narrower the distribution the lower the RMSE should for the model to exhibit predictive power.

Statistical Pre-processing
The descriptors and the target values are then merged by name into a single data.frame. We check that the number of rows of the merged and original data.frames are the same. We then split the data.frame into ids, x and y where ids are the molecule names, x is the block of descriptor values and y is the target values.
all <-merge(x = targets, y = descriptors, by = "Name") ids <-all$Name x <-all[3:ncol(all)] y <-all$target Sometimes, some descriptors are not calculated for all molecules, giving a "NA" or "Inf" as the descriptor value. Instead of removing that descriptor for all molecules, the missing descriptor values can be imputed from the corresponding descriptor values in the molecules that are closest to the molecule with the missing information. "Inf" descriptor values are first converted to "NA". For the imputation of missing descriptor values, the R package impute is required. Depending on the R version, it can be accessed from either CRAN or Bioconductor.
## Loading required package: impute ## Cluster size 16 6 broken into 375 1231 ## Done cluster 375 ## Done cluster 1231 The dataset is randomly split into a training set (80%) used for training and a holdout set (20%) which used to assess the predictive ability of the models on molecules drawn from the same distribution as the training set. Unhelpful descriptos are removed: (i) those with a variance close to zero (near-zero variance), and (ii) those highly correlated with one another: The descriptors are converted to z-scores by centering them to have a mean of zero and scaling them to have unit variance:

dataset <-PreProcess(dataset)
Five fold cross-validation (CV) is be used to optimize the hyperparameters of the models: dataset <-GetCVTrainControl(dataset, folds = 5) saveRDS(dataset, file = "dataset_logS_preprocessed.rds") All models are trained with the same CV options, i.e. the arguments of the function GetCVTrainControl, to allow ensemble modeling as ensemble modelling requires the same data fold split over all methods used. It is important to mention that the functions presented in the previous code blocks depend on functions from the caret package, namely: • RemoveNearZeroVarianceFeatures : nearZeroVar • RemoveHighlyCorrelatedFeatures : findCorrelation • PreProcess : preProcess • GetCVTrainControl : trainControl Experienced users might want to have more control over the underlying caret functions. This is certainly possible as the arguments given to the camb functions will be subsequently given to the caret counterparts in the typical ... R fashion. In fact, experienced users may want to learn from the internals of the functions camb provides and create their own specialised pipeline that fits their own modelling needs. camb is intended to speed up the modelling process and should not limit use of the extremely valuble caret package which camb utilises.
The default values of these function permit the less experienced user to quickly handle the statistical preprocessing steps with ease, making a reasonable default choice for the argument values.

Model Training
In the following section we present the different steps required to train a QSPR model with camb. It should be noted that the above steps can be run locally on a low powered computer such as a laptop and the preprocessed dataset saved to disk. This dataset can then be copied to a high powered machine or a farm with multiple cores for model training and the resulting models saved back to the local machine. Pro tip: Dropbox can be used to sync this proceedure so that manual transfer is not required.

Support Vector Machines (SVM)
Firstly, a SVM using a radial basis function kernel is trained [4]. An base 2 exponential grid is used to optimize over the hyperparameters. The train function from the caret package is used directly for model training.

Ensemble Modeling
In the following section, two ensemble modeling techniques are applied, namely greedy optimization and model stacking. Further information about these methods can be found in ref [7] and [8].
We append all the trained models to a list. The sort function shows the cross-validation RMSEs of the trained models in accending order. The greedy and the linear stack ensembles have a cross validated RMSEs that are lower than any of the individual models.
We then test to see if these ensemble models outperform the individual models on the holdout set.
preds <-data.frame(sapply(all.models, predict, newdata = dataset$x.holdout)) preds$ENS_greedy <-predict(greedy, newdata = dataset$x.holdout) preds$ENS_linear <-predict(linear, newdata = dataset$x.holdout) preds$ENS_nonlinear <-predict(nonlinear, newdata = dataset$x.holdout) sort(sqrt(colMeans((preds -dataset$y.holdout)ˆ2))) The linear ensemble slightly outperforms other models on the holdout set as well. This leads us to choose this as the most predictive model for future predictions. In the case that the ensemble models underperform the single models on the holdout set, it is advisable to pick the best single model for future predictions as a simpler model is a better model performance being equal.

External Predictions
One of the main attractions of this package is that it makes standardising and making predictions on new molecules a simple task. It is essential to ensure that the same standardisation options and descriptor types are used when the model is applied to make predictions for new molecules.