 Research article
 Open Access
 Published:
RRegrs: an R package for computeraided model selection with multiple regression models
Journal of Cheminformaticsvolume 7, Article number: 46 (2015)
Abstract
Background
Predictive regression models can be created with many different modelling approaches. Choices need to be made for data set splitting, crossvalidation methods, specific regression parameters and best model criteria, as they all affect the accuracy and efficiency of the produced predictive models, and therefore, raising model reproducibility and comparison issues. Cheminformatics and bioinformatics are extensively using predictive modelling and exhibit a need for standardization of these methodologies in order to assist model selection and speed up the process of predictive model development. A tool accessible to all users, irrespectively of their statistical knowledge, would be valuable if it tests several simple and complex regression models and validation schemes, produce unified reports, and offer the option to be integrated into more extensive studies. Additionally, such methodology should be implemented as a free programming package, in order to be continuously adapted and redistributed by others.
Results
We propose an integrated framework for creating multiple regression models, called RRegrs. The tool offers the option of ten simple and complex regression methods combined with repeated 10fold and leaveoneout crossvalidation. Methods include Multiple Linear regression, Generalized Linear Model with Stepwise Feature Selection, Partial Least Squares regression, Lasso regression, and Support Vector Machines Recursive Feature Elimination. The new framework is an automated fully validated procedure which produces standardized reports to quickly oversee the impact of choices in modelling algorithms and assess the model and crossvalidation results. The methodology was implemented as an open source R package, available at https://www.github.com/enanomapper/RRegrs, by reusing and extending on the caret package.
Conclusion
The universality of the new methodology is demonstrated using five standard data sets from different scientific fields. Its efficiency in cheminformatics and QSAR modelling is shown with three use cases: proteomics data for surfacemodified gold nanoparticles, nanometal oxides descriptor data, and molecular descriptors for acute aquatic toxicity data. The results show that for all data sets RRegrs reports models with equal or better performance for both training and test sets than those reported in the original publications. Its good performance as well as its adaptability in terms of parameter optimization could make RRegrs a popular framework to assist the initial exploration of predictive models, and with that, the design of more comprehensive in silico screening applications.
Background
Many opensource statistical, datamining, and machine learning software projects give access to a wide range of data processing and modelling algorithms often providing graphical user interfaces. Among them are Weka (Waikato Environment for Knowledge Analysis) [1], RapidMiner [2], Keel (Knowledge Extraction based on Evolutionary Learning) [3], Orange [4], Scikitlearn [5], C1C2 [6] and KNIME (Konstanz Information Miner) [7], effectively, providing full predictive modelling frameworks.
Additionally, webbased platforms such as OpenTox [8] and Online Chemical Modelling Environment (OCHEM) [9] focus on the development of quantitative structureactivity relationship (QSAR) models, i.e. regression or classification models that are used for the in silica assessment of physicochemical properties and biological activities of chemical compounds such as toxicity, biological potency and side effects [10–12]. Such platforms typically consist of two major subsystems: the database of experimental measurements and the modelling framework. They allow users to create their own QSAR models, predict results for new chemicals, and share them. OpenTox in particular is a platformindependent collection of components which communicate through web services, so that the user can combine data, models and validation results from multiple sources in a dependable and timeeffective way. Several other tools offer virtual evaluation of chemical properties and toxicity using implemented QSAR models; for instance, Vega [13], EPI Suite [14], Toxicity Estimation Software Tool (TEST) [15], QSAR4u [16], BuildQSAR [17], OECD QSAR Toolbox [18], AZOrange [19] or BioclipseR [20]. However, these tools are limited to supported data sets, QSAR models or specific regressions methods.
On the other hand, the R statistical language environment [21] offers many solutions for regression modelling and also some packages providing simultaneous access to multiple methods. For example, the glmulti package conducts automated model selection and modelaveraging based on the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) [22], the kernlab package applies Kernelbased machine learning methods for classification, regression, clustering, novelty detection, quantile regression and dimensionality reduction, and the e1071 package includes functions for latent class analysis, short time Fourier transform, fuzzy clustering, and support vector machines. Other R packages are specializing on particular set of algorithms, for instance the R package tree focuses on producing classification and regression trees.
Of particular interest is the R package for predictive modelling called caret (Classification and Regression Training) which gathers and simplifies numerous R algorithms for the development of a wide variety of predictive models by calling and integrating more than 25 other packages [23]. Unique features of caret include data splitting, preprocessing, characterizing performance and variable importance, and parallel processing tools.
Although this package is providing useful methods for the syntactical unification of regression and classification prediction modelling approaches, the various models have different inputs and the outcomes different formats, typically depending on their parameters. This makes it hard to run all the available methods for multiple data sets, compare all the outputs, and produce a standardized results summary.
Furthermore, the complexity of the workflows complicates the reproduction of the same regression results and it may affect decisions on issues, such as how to split the original data set, how often to split the data set, which crossvalidation method is to be use, which data filtering to apply before regression (correlated features removal, “not available” (NA) values elimination, etc.), which data scaling is applied (normalization, standardization, etc.), which regression methods to test, which regression parameters and seeds to use, how to summarize and compare the results for several regression models, and which criteria to use in order to choose the best regression model.
To address these limitations, the current manuscript describes a standardized framework that automates the development of a reliable and wellvalidated QSAR model, or set of models. The socalled RRegrs tool (R Regressions) is based on the R caret package and is focusing on regression modelling. RRegrs workflow offers a fully validated procedure capturing any variability or inconsistency in the data. A single RRegrs function call is needed to run the entire workflow and obtain the produced validated QSAR model(s) in a reproducible format in contrast to the standard, inefficient and timeconsuming QSAR modelling workflow, where the modeller tries many different algorithms and even needs to further search the parameter space of each algorithm. This is a considerable advantage for users with perhaps limited statistical knowledge or limited R experience. Also reproducibility is a comparative advantage since often the same procedure needs to be applied for different data sets. RRegrs implements an easy way to explore the models’ search space of linear and nonlinear models with special parameters specifications and cross validation schemes. Furthermore, model outputs are easily accessible and readable, organized by methods, centralized and averaged by multiple reproducible data set splits. Summary files are also produced helping the user to easily access all methodologies results, which can then be prioritized based on various statistics. The current implementation of the RRegrs package contains ten different regression algorithms and supports parallel processing, if prompted. RRegrs function calls can be integrated into complex desktop and web tools for QSAR. RRegrs package is available as an open repository at https://www.github.com/enanomapper/RRegrs. The current release is available from ZENODO with the doi:10.5281/zenodo.21946.
Results and discussion
RRegrs is an R package for computeraided model selection, designed and implemented as a collection of regression tools available from the caret package. RRegrs uses the R package testthat for testing [24]. It does not apply full unit testing, but several RRegrs parameter combinations are tested, which are run during the build process. RRegrs can be used to find the best regression model for any numerical data set using some or all of ten linear and nonlinear regression models: Multiple Linear regression (LM), Generalized Linear Model with Stepwise Feature Selection (GLM) [25], Partial Least Squares Regression (PLS) [26], Lasso regression (LASSO) [27], Elastic Net regression (ENET) [28], Support vector machine using radial functions (SVRM) [29], Neural Networks regression (NN) [30], Random Forest (RF) [31], Random Forest Recursive Feature Elimination (RFRFE) [32] and Support Vector Machines Recursive Feature Elimination (SVMRFE) [33]. Using the above methods, we explore the model space and compare outputs to decide on the optimal model, given the data. We are setting the regression method parameters with grid functions which have been carefully chosen to optimize different models, and particularly this is done for PLS, SVRM, SVMRFE, NN, RF, RFRFE, ENET. Specifically for SVRM, SVMRFE the user could also specify custom parameters.
The main scope of the presented tool is to be able to run a large number of regression methods from the caret package using only one function call, to use standardized crossvalidation (CV) scheme for all the methods, to obtain standardized outputs, to generate result summary tables and comparison plots for the regression methods and to store for each method detailed statistics and fitting plots. RRegrs integrates results of individual models and decides on the best model given the data set and the user specified parameters, unlike caret. Therefore, RRegrs permits users, irrespective of their programming or statistics experience, to predict any type of numerical output using multiple regression methods. In addition, advanced users could integrate RRegrs in other applications or software packages. For example, in cheminformatics RRegrs can be used to generate predictive models using molecular descriptors calculated with the rcdk package, using functions offered by the Chemistry Development Kit [34].
RRegrs function wraps up all the above mentioned procedures within just one call. The following steps are included (see Fig. 1): load parameters and data set, remove the NA values, remove near zero variance features, scale the data set, remove correlated features, split data set into training and test sets, run the selected regression methods using the selected crossvalidation method(s), summarize the statistics for all methods and splittings, average for each method and crossvalidation type for all splittings, apply the best model of each method and split on the test sets, apply Yrandomization on the best model, and assess the Applicability Domain. For each model, a CV scheme is introduced with two options: 10fold repeated CV and LeaveOneOut (LOO) CV. The more timeconsuming regression methods (RF, SVMRFE, RFRFE) are using only repeated crossvalidation (other validation methods could be very slow for these complex functions), whereas for computationally demanding methods RRegrs offers parallel processing for a defined number of CPU cores. Detailed output files for all regression methods are produced, plots for individual model statistics, as well as summary statistics and comparison plots between methods, resulting in a significant number of CSV, PDF, PNG outputs files. Additionally, several summary files are created. A CSV output file is created with all the basic statistics (17 values) for each method, data splitting and CV type. Based on the above, averaged statistics are calculated for each regression method and across all data splits, which are the values needed to decide on the final best model performance. The best model is further validated with Yrandomization runs (100 by default).
For each regression method, caret package utilities are employed. For example, RRegrs uses the trainControl and train functions to set the training conditions (10 repetitions; RMSE used as metrics to choose the model) and train the model, respectively. For each method, RRegrs is generating the same list of 17 statistics values: regression name, split number, crossvalidation type, number of model features, names of model features, training adjusted Rsquared (adj.R^{2}), training root mean squared error (RMSE_{CV/LOO}), training \({\text{R}}_{{{\text{CV}}/{\text{LOO}}}}^{2}\), training standardized RMSE, test adjusted R^{2}, test RMSE (RMSE_{test}), test R^{2} (\({\text{R}}_{\text{test}}^{2}\)), test correlation, and corresponding values for both sets. If the user requests detailed output (the default flag is set to True), several files are generated such as a CSV file with statistics about each regression model listing the following information: regression method, splitting number, crossvalidation type, training set summary, test set summary, fitting summary, list of predictors, training/test predictors, a full list of statistics as defined above, feature importance, residuals of the fitted model, assessment of applicability domain/leverage analysis such as mean of hat values, hat values with warnings, leverage threshold, list of points with leverage greater than threshold, Cook’s distances, Cook’s distance cutoff, points influence. Particularly, for each data splitting and CV method, the following plots are produced: observed versus predicted values for training/test sets, feature importance, fitted versus residuals for the fitted model, leverage statistics for fitted model, Cook’s distance for fitted model, and six standard fitting plots including Cook’s distance cutoff.
Moreover, RRegrs offers an exhaustive validation framework by introducing multiple random data splittings. For each algorithm and data split, the model is produced based on training and validation sets. We are reporting both CV and external validation statistics, however, the test set is used to select the final best model, i.e. the best performing amongst the optimal models produced by different algorithms (both linear and nonlinear). Decision is made on the averaged statistics across data splits to remove any bias towards the structure of the test set. The best regression model is selected based on the following criterion: from the best averaged \({\text{R}}_{\text{test}}^{2}\) (+/−0.05), the model with minimum RMSE_{test} is the final one. Alternatively, the test adjusted R^{2} can be used to select the best model. For the best model, an additional CSV file is generated providing detailed statistics as well as PDF plots for important statistics.
As mentioned above, parallel processing is employed during training steps by enabling caret’s parallel design, and it is activated by either using caret’s TrainControl “allowParallel” option, or in the case of RFE methods also within the model selection (iterating with a parallel foreach through the cross validation model selection for each feature size) using RFEControl “allowParallel” option. Libraries doMC (Linux/Mac) and doSNOW (Windows) provide foreach parallel adaptor.
The uses of RRegrs reported here are aimed at finding QSAR models for cheminformatics and nanotoxicology for the eNanoMapper European FP7 project. RRegrs was first tested with five standard data sets from UC Irvine Machine Learning Repository [35], followed by a demonstration the efficiency of RRegrs in cheminformatics and bioinformatics areas, using three publicly available data sets, as presented in the following sections.
RRegrs models for standard data sets
To benchmark RRegrs we first used five standard regression data sets from the UC Irvine machine learning repository [35]: the housing [36], computer hardware, wine quality [37], automobile [38], and Parkinsons telemonitoring [39] data sets. Based on these data sets we demonstrate the RRegrs methodology capabilities in different scientific fields. The number of cases and features of these data sets are described in the Methods section.
The housing data set is the most used standard data set for complex regression methods: combination of regression estimators as genetic algorithmbased selective neural network ensemble [40], distributed multivariate regression using waveletbased collective data mining [41], application of the Bayesian evidence framework to support vector regression (SVR) [42], Principal Components approach that combines regression estimates [43], regression on feature projections (RFP) method [44], subsetbased least squares subspace regression in reproducing Kernel Hilbert space (RKHS) [45], Smola and Scholkopf’s sequential minimal optimization (SMO) algorithm for SVM regression [46], etc.
Tables 1 and 2 present two statistic values for these standard data sets: \({\text{R}}_{\text{test}}^{2}\) and RMSE_{test} values, averaged by 10 different data set splits, using 10fold repeated CV and 10 Yrandomization. The results show that advanced methods such as RFRFE and RF give the highest R^{2} values. In the case of the Housing data set, PLS provides a very low \({\text{R}}_{\text{test}}^{2}\) of 0.266 compared with the RFRFE/RF that shows 0.875/0.874 (\({\text{R}}_{\text{test}}^{2}\) for LM is 0.707). Because of its slightly lower RMSE_{test} value compared to RFRFE (less than 0.001), RRegrs suggests RF as the best model. Figure 2 shows the differences for \({\text{R}}_{\text{test}}^{2}\) and RMSE_{CV} on the training set (data split 1) and Fig. 3 presents the comparison for resampling on the training set (data split 1). In order to observe the quality of the regression models, Fig. 4 presents the observed versus predicted values in the test set for the best models for five data sets (10fold repeated CV). The applicability domain section of RRegrs plotted the leverage for the Housing best fitted model (RF) as in Fig. 5.
The best model for the Computer Hardware data set was obtained with the RF method (\({\text{R}}_{\text{test}}^{2}\) of 0.907) and the worst one using PLS (\({\text{R}}_{\text{test}}^{2}\) of 0.740). The LM method has \({\text{R}}_{\text{test}}^{2}\) of 0.822. Models for the Red Wine data set do not produce good values for \({\text{R}}_{\text{test}}^{2}\) (>0.501) due to the noncontinuous values of the output variable. When RRegrs is applied to the Automobile data set, \({\text{R}}_{\text{test}}^{2}\) values vary from about 0.915 for RF/RFRFE to 0.714 for SVMRFE (\({\text{R}}_{\text{test}}^{2}\) for LM is 0.824).
If the RRegrs call uses all available CPU cores for the complex methods, one dataset split, one Yrandomizations, and all the regression methods, the following execution times (in seconds) are obtained for the Boston standard dataset (see Table 3). The computer was an Windows 8.1 64bit with i74790 CPU (3.60 GHz, 4 cores, 8 logical cores), 16G RAM. The total execution time was 5.43 min (325.64 s).
Use case 1: RRegrs application on protein corona data
Recent studies have shown that the presence of serum proteins in vitro cell culture systems form a protein adsorption layer (a.k.a. the ‘protein corona’) on the surface of nanoparticles (NPs). This corona is reported to affect the nanoparticlecell interactions as well as change the cell response [47, 48], and defines the NP’s ‘biological identity’ [49]. It thus encodes information about the interface formed between the NP core and the cell surface within a physiological environment.
This section presents results for proteomics data recently published that characterizes the serum protein corona ‘fingerprint’ formed around a library of 105 distinct surfacemodified gold NPs [49]. The authors used LC–MS/MS to identify 129 serum proteins which were considered suitable for relative quantification. The relative abundances for each of these proteins on the corona of a nanoparticle formulation defines the serum protein ‘fingerprint’ for that formulation. The authors presented a multivariate regression model that uses the protein corona fingerprint to predict cell association for the gold NPs and found a model predicted cell association with a \({\text{R}}_{\text{LOO}}^{2}\) of 0.81. Specifically, they applied a PLS regression model along with an internal iterative filtering procedure using the Variable Importance to the Projection (VIP) score and jackknife resampling.
Here we present results on the initial set of 129 × 84 proteins to gold NPs data (21 neutral NPs were excluded from analysis as in Walkey et al. [49]), and also on a set of 76 × 84 proteins to gold NPs data. These 76 proteins are selected in [49] with VIP ≥ 0.6 threshold. RRegrs was applied with 10 random splits of the data (75 % train and 25 % test) along with 10 Yrandomization runs for the best model. Table 4 shows the best model selected by RRegrs, its number of features, the adj.R^{2} and the R^{2} and RMSE values for the train and test sets, averaged over 10 random splits of the data. Table 5 shows the best model found in all data splits, i.e. we compare all methodologies and data splits to find the best R _{test} ^{2} . Data are normalized and filtered using the RRegrs near zero variance and correlation filters, for that reason the 129 proteins are filtered to be 99 and the 76 proteins data set are reduced to 60 features.
For the data set with 129 proteins, the best model is an SVRM model with \({\text{R}}_{\text{test}}^{2}\) = 0.631. CV results on the training set can be seen in Fig. 6, where we can observe that the LM and GLM models are not suitable for the protein corona data, whereas the remaining methodologies perform similarly. It can be observed in Table 5 that the highest value reported was \({\text{R}}_{\text{test}}^{2}\) = 0.844 for individual split nine of the data set.
When we study the set with 76 proteins, we find that the best model is an SVRM model with averaged \({\text{R}}_{\text{test}}^{2}\) = 0.728, whereas the best individual split value is \({\text{R}}_{\text{test}}^{2}\) = 0.89. CV results on the training set are shown in Fig. 7. The corresponding RRegrs results for the PLS model are \({\text{R}}_{\text{test}}^{2}\) = 0.7 (averaged over 10 data splits), whereas the highest values are reported for individual split five \({\text{R}}_{\text{test}}^{2}\) = 0.885 (for repeated CV) and \({\text{R}}_{\text{test}}^{2}\) = 0.873 (for LOO). Although the last number cannot be directly compared to \({\text{R}}_{\text{LOO}}^{2}\) = 0.81 reported by the authors, it gives an indication of how our PLS implementation performs for the specific data set.
Use case 2: RRegrs application on metal oxides data set
The authors of [50] combined experimental and theoretical measurements to develop a nanoQSAR model that describes the toxicity of eighteen nanometal oxides (MeOx) to the human keratinocyte (HaCaT) cell line, which is a common in vitro model for keratinocyte response during toxic dermal exposure. The study was aimed at exposing and explaining the differences in modes of toxic action of metal oxide nanoparticles between the eukaryotic system and the prokaryotic system (E. coli).
They calculated 32 parameters that quantitatively describe the variability of the nanoparticles’ structure, called nanodescriptors, which included quantummechanical descriptors derived from quantumchemical calculations, and image descriptors derived from transmission electron microscopy (TEM) images. Some of the descriptors included are: particle size and size distribution, agglomeration state, particle shape, crystal structure, chemical composition, surface area, surface chemistry, surface charge, electronic properties (reactivity, conductivity, interaction energies, etc.), and porosity. Additionally, the LC_{50} was calculated from experimental data for all MeOx NPs. This is the concentration that caused a 50 % reduction of the cells after 24 h exposure, whereas the −log(LC_{50}) values were used in modelling, as the dependent variable t be predicted.
The authors applied a Genetic Algorithm (GA) to independently select the most efficient combination of the molecular descriptors which were then analyzed using multiple linear regression. They found that two descriptors were sufficient to predict NPs toxicity with high statistical significance, namely \(\Delta {\text{H}}_{\text{c}}^{\text{f}}\) descriptor, which is the enthalpy of formation of metal oxide nanocluster representing a fragment of the surface and, χ^{c} descriptor, which is the Mulliken’s electronegativity of the cluster. The reported values are: R^{2} = 0.93 (RMSE = 0.12), \({\text{R}}_{\text{LOO}}^{2}\) = 0.86 (RMSE_{LOO} = 0.16), \({\text{R}}_{\text{test}}^{2}\) = 0.83 (RMSE_{test} = 0.13). Note that R^{2,} here and in other cases, refers to the coefficient of determination for fitting the model to the training data.
Ten random splits of the data were performed (75 % train and 25 % test) along with ten Yrandomization runs for the best model. Tables 4 and 5 show RRegrs results for the initial set of 32 parameters to the eighteen MeOx’s, averaged or nonaveraged values across the ten data splits, respectively. Because of the restricted number of samples and descriptors RRegrs was applied without filtering options, whereas data were normalized, as in [50]. The best model was selected between those that perform feature selection, i.e. GLM, LASSO, SVMRFE, RFRFE, ENET. As can be seen from the tables, the best performance model and the best averaged model is ENET in this case, keeping on average 8.8 variables from the data including the two important variables (∆H^{c}, χ^{c}) selected in the original publication. The ENET averaged statistics for 10 splits of the data are \({\text{R}}_{\text{test}}^{2}\) = 0.746, \({\text{R}}_{\text{CV}}^{2}\) = 0.933, which are very similar to the values reported by the authors. The best individual split value is equal to \({\text{R}}_{\text{test}}^{2}\) = 0.998 for ENET model with eight variables including the final two suggested by the authors (LOO at the eighth split of the data). The boxplots in Fig. 8 show the resampling values of RMSE_{CV} and \({\text{R}}_{\text{CV}}^{2}\) values in the training set for split eight, where only methodologies with the same resampling scheme are included in the graph. Figure 8 also includes the fit of the selected model ENET for the same data split.
Use case 3: RRegrs application on aquatic toxicity data set
The authors of [51] developed a QSAR model based on 546 organic molecules, to predict acute aquatic toxicity towards Daphnia magna, which is the organism preferred for shortterm aquatic toxicity testing according to REACH [52]. Ad hocdesigned workflows were used for data curation and filtering, as well as for the extraction of LC_{50} data, which in this case is defined to be the concentration that causes death in 50 % of test Daphnia magna over a test duration of 48 h. For modelling purposes the −log(LC_{50}) values were considered as the dependent variable to be predicted. Other experimental data on aquatic toxicity were retrieved from three databases and available scientific publications, as well as onedimensional (1D) and2D molecular descriptors implemented with DRAGON software [51], resulting in a total of 2, 187 molecular descriptors.
A modified kNearest Neighbour (kNN) strategy coupled with GA algorithms was used to select the relevant molecular descriptors. The final data set comprised of 546 organic molecules and a set of 201 descriptors. The GAkNN strategy was implemented with a threshold on the average Mahalanobis distance from the k nearest neighbours, so that only molecules satisfying the threshold criterion were predicted. Particularly, predictions for molecules with an average distance greater than 1.26 from their three neighbours, were considered to be outside of the applicability domain. The training molecules exceeding the threshold did not contribute to the model’s statistics, but were not removed from the data set. The final model showed good performance when the average distance threshold was applied, namely \({\text{R}}_{\text{CV}}^{2}\) = \({\text{R}}_{\text{test}}^{2}\) = 0.78 (5fold CV), R^{2} = 0.72. The model selected eight molecular descriptors that encoded information about lipophilicity, the formation of Hbonds, polar surface area, polarisability, nucleophilicity and electrophilicity. When no distance threshold is applied, the corresponding values are: R^{2} = 0.60, \({\text{R}}_{\text{CV}}^{2}\) = 0.61, \({\text{R}}_{\text{test}}^{2}\) = 0.43.
Tables 4 and 5 show the results for the final set of eight descriptors for 546 organic molecules: the averaged values across data splits and the best model statistics for all data splits are presented. RRegrs is applied using normalization and filtering options. Ten random splits of the data were performed (75 % train and 25 % test) along with ten Yrandomization runs for the best model. As can be seen from the tables the best performance model and the best averaged model is SVRM in this case, keeping all 8 descriptors in the data. The SVRM averaged statistics for ten splits of the data are \({\text{R}}_{\text{CV}}^{2}\) = 0.556 (10fold CV), \({\text{R}}_{\text{test}}^{2}\) = 0.537, which are close to the ones reported by the authors when the distance threshold is not applied. The adjusted R^{2} = 0.7, exceeding the 0.6 value reported without the distance threshold application, and still approaching the 0.78 value when the distance threshold was applied. The best individual split value is reported to be \({\text{R}}_{\text{test}}^{2}\) = 0.657 for SVRM model with eight variables (LOO at the second split of the data). Figure 9 shows the differences between the various models in terms of \({\text{R}}_{\text{CV}}^{2}\) and RMSE_{CV} values in the training set of the second data split, i.e. the train data where the highest \({\text{R}}_{\text{test}}^{2}\) was observed. We can observe that the PLS model differs from all others, having the worst performance, whilst LM, GLM, LASSO, and ENET models have very similar performances.
Conclusions
This paper introduces RRegrs as a new computeraided model selection framework using a single R function call. The aim of RRegrs is to automatically obtain the best regression model given the data set, and the set of all ten regression models available, after an extensive search of the model space. A fully validated procedure is suggested where data are split in training and test sets, ten times by default, capturing any variability or inconsistency in the data. The best model is then found across different data splits and crossvalidation schemes, based on the averaged data splits statistics. RRegrs produces easily accessible summary files that provide an overview of model details and allows methodology comparisons using the same statistics, enabling QSAR model selection. These direct comparisons are built on top of the caret package, and in that respect provide a useful flexibility for all users. However, use of this package does not require advanced knowledge of R, while, on the other hand, experienced R users can easily modify and extend the package with additional algorithms of choice. The single function call makes it easy to integrate into larger QSAR and in silico molecular screening studies. The new tool was tested with five standard data sets from several domains and three use cases originating from cheminformatics and nanotoxicity, showing good performance in all cases. RRegrs is open source and available from https://www.github.com/enanomapper/RRegrs (doi:10.5281/zenodo.21946).
Methods
Regression methods in RRegrs
The RRegrs tool is using ten different linear and nonlinear regression models briefly described in this section, to explore the model space. The most basic model in this package is the LM model [33]. Variable selection could improve the result of prediction in regression models. For that reason we have included a generalized linear model, denoted by GLM, which selects variables that minimize the AIC score. LASSO and ENET are also penalizing the number of variables via an embedded minimization process [27, 28].
Apart from the standard regression methodologies included in RRegrs, other methods that focus on specific characteristics of the data are included. PLS uses linear projections of input and output sets, which is a useful strategy when many of the inputs are correlated. PLS coefficient optimization algorithm improves previous regression coefficient algorithms because the search path is directed to high variance and high correlations paths [26]. The SVMR algorithm attempts to find the hyperplane that separates the positive and negative samples, practically allowing a nonlinear solution to a regression problem by transforming the data to a hyperdimensional feature space using the kernel functions [53, 54]. SVMR in RRegrs uses the radial function or Gaussian function. RRegrs package also allows the use of a support vector regression model where the w^{2} (the square of SVM hyperplane weight vector) measures the importance of each feature [29].
RRegrs includes two additional learning algorithms, namely NN and ensemble RF. NN is usually defined as a network of a large number of connected neurons (simple processors), which produce good results with imprecise and complicated data [30]. RF is a bagging method constructing decision trees based on the random subspace method [31].
Finally, we have included two of the best performing methodologies with extra feature selection characteristics. Particularly, because the SVM and RF methods can be timeconsuming, we have considered their implementation with random feature elimination (RFE), a feature selection method also introduced in caret where less important features are sequentially removed from the model until optimal performance is reached [32]. The two methods are here labeled as RFRFE and SVMRFE. Further details for the functions’ main parameters are available in the online tutorial of the RRegrs package.
Model optimization
Two CV schemes are employed within RRegrs, namely 10fold repeated CV and LOO CV. In the case of repeated CV, we run ten repeats of 10fold CV for all models except SVMRFE (3folds, one repeat) and RFRFE (5folds, one repeat), which are particularly timeconsuming methods. The procedure followed by caret and also introduced in RRegrs tool, randomly splits the data in K distinct blocks of roughly equal size (K = 10, 3, 5 depending on the method). Each block of data is left out sequentially, and a model is fit to the remaining of the data; this model is used to predict the held out block. The process is repeated where for each repetition a random proportion of the data are used to train the model (default value is 0.75) while the remainder is used for testing the models. Average performance across the number of repeats are reported: \({\text{R}}_{\text{test}}^{2}\), RMSE_{test,}, \({\text{R}}_{\text{CV}}^{2}\), \({\text{R}}_{\text{LOO}}^{2}\), RM SE_{C V}, RM SE_{LOO}. The best model is selected based on the averaged R _{test} ^{2} value; if multiple models only differ by ≤0.005 from the best \({\text{R}}_{\text{test}}^{2}\)value, the model with the lowest RMSE_{test} statistic is selected.
In order to further validate RRegrs test results, Yrandomization is applied to the best model found. For the last data split and the best model found, RRegrs performs Yrandomization for the 10fold repeated CV scheme, and compares \({\text{R}}_{\text{test}}^{2}\) values to the best model corresponding value.
RRegrs tested data sets
RRegrs has been used to find the best regression models for eight data sets: three from nano and cheminformatics (use cases), and five standard data sets from different fields. The standard data sets have been downloaded from UC Irvine machine learning repository [35]: housing [36], computer hardware, wine quality [37], automobile [38] and Parkinsons telemonitoring [39] data sets. The nonnumeric columns have been eliminated, whereas the first column is the dependent variable (output of the model). The number of initial features/cases are the following: 13/506 for Housing data set, 6/209 for Computer Hardware data set, 11/1, 599 for Red Wine Quality data set, 14/195 for Automobile data set, and 19/5, 875 for Parkinson telemonitoring data set. The use case data sets were derived from their original publications; the initial number of features and cases are the following: the protein corona data set [49] has 129 features and 84 cases, the metal oxide data set [50] has 32 features and 18 cases and the toxicity data set [51] contains 8 features, and 546 cases.
RRegrs call in R
The main function of RRegrs (also called RRegrs) permits the call of the entire RRegrs methodology in a single line. All details about functions’ main parameters are given in the R package documentation. All these parameters have default values. The default values imply a default location for the output files, execution of all modelling steps (removal of NA, and near zero variance features, and of correlated features), normalization of the data set, ten splits, ten Yrandomization steps, and running of all ten regression methods. The user can alter any step or parameter of the RRegrs methodology.
The following examples show simple calls of the RRegrs() function using a specific dataset file entitled ”MyDataSet.csv” that it should be provided by the user:
The output variable RRegrsResults is a complex object which contains the object of the fitted models and the main statistics for each regression model. Details about each function are presented into the tutorial of the RRegrs package.
The following example could be used to test the RRegrs package using a dataset file from RRegrs GitHub URL:
Availability and requirements

Project name: RRegrs

Project home page: RRegrs

Operating system(s): Platformindependent

Programming language: R programming language

Other requirements: R 3.1.0 or higher

License: NewBSD or MIT

Any restrictions to use by nonacademics: none other than those defined by the license
Abbreviations
 RRegrs:

R regressions (package)
 QSAR:

quantitative structureactivity relationship
 R^{2} :

Rsquared
 RMSE:

rootmeansquare error
 AIC:

akaike information criterion
 LM:

linear multiregression
 GLM:

generalized linear model with stepwise feature selection
 PLS:

partial least squares regression
 LASSO:

Lasso regression
 ENET:

elastic net regression
 SVRM:

support vector machine using radial functions
 NN:

neural networks regression
 RF:

random forest
 RFRFE:

random forest recursive feature elimination
 SVMRFE:

support vector machines recursive feature elimination
 CSV:

commaseparated values file format
 PDF:

portable document format of files
 PNG:

portable network graphics file format
 NP:

nanoparticle
References
 1.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor. Newsl 11(1):10–18
 2.
Hofmann M, Klinkenberg R (2013) RapidMiner: Data mining use cases and business analytics applications. Chapman and Hall, CRC Press, Boca Raton
 3.
Alcal´aFdez J, S´anchez L, Garc´ıa S, del Jesu´s MJ, Ventura S, Garrell J, Otero J, Romero C, Bacardit J, Rivas VM et al (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318
 4.
Demšar J, Zupan B, Leban G, Curk T (2004) Orange: From experimental machine learning to interactive data mining. Springer, Berlin Heidelberg
 5.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikitlearn: machine learning in python. J Mach Learn Res 12:2825–2830
 6.
Eklund M, Spjuth O, Wikberg JE (2008) The c1c2: a framework for simultaneous model selection and assessment. BMC Bioinform 9(1):360
 7.
Berthold MR, Cebron N, Dill F, Gabriel TR, Kotter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: the Konstanz information miner. Springer, Berlin, Heidelberg
 8.
Hardy B, Douglas N, Helma C, Rautenberg M, Jeliazkova N, Jeliazkov V, Nikolova I, Benigni R, Tcheremenskaia O, Kramer S et al (2010) Collaborative development of predictive toxicology applications. J Cheminform 2(1):1–29
 9.
Sushko I, Novotarskyi S, Körner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY et al (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comp Aided Mol Design 25(6):533–554
 10.
Cases M, Briggs K, StegerHartmann T, Pognan F, Marc P, Kleinöder T, Schwab CH, Pastor M, Wichard J, Sanz F (2014) The eTOX datasharing project to advance in silico druginduced toxicity prediction. Int J Mol Sci 15(11):21136–21154
 11.
Ekins S (2014) Progress in computational toxicology. J Pharmacol Toxicol Methods 69(2):115–140
 12.
Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M, Dearden J, Gramatica P, Martin YC, Todeschini R, Consonni V, Kuzmin VE, Cramer R, Benigni R, Yang C, Rathman J, Terfloth L, Gasteiger J, Richard A, Tropsha A (2014) QSAR modeling: where have you been? where are you going to? J Med Chem 57(12):4977–5010
 13.
Fjodorova N, Vracko M, Novic M, Roncaglioni A, Benfenati E (2010) New public QSAR model for carcinogenicity. Chem Cent J 4(Suppl 1):3
 14.
US Environmental Protection Agency (2012) EPI Suite software. http://www.epa.gov/oppt/exposure/pubs/episuitedl.htm
 15.
US Environmental Protection Agency (2012) Toxicity estimation software tool (TEST). http://www.epa.gov/nrmrl/std/qsar/qsar.html#TEST
 16.
National Academy of Sciences of Ukraine (2012) QSAR4u. http://www.qsar4u.com/
 17.
de Oliveira DB, Gaudio AC (2000) Buildqsar: A new computer program for qsar analysis. Quant Struct Act Relat 19(6):599–601
 18.
OECD (2011) OECD QSAR Toolbox. http://www.oecd.org/chemicalsafety/riskassessment/theoecdqsartoolbox.htm
 19.
Stålring JC, Carlsson L, Almeida P, Boyer S (2011) AZOrangeHigh performance open source machine learning for QSAR modeling in a graphical programming environment. J Cheminform 3:28
 20.
Spjuth O, Georgiev V, Carlsson L, Alvarsson J, Berg A, Willighagen E, Wikberg JES, Eklund M (2013) BioclipseR: Integrating management and visualization of life science data with statistical analysis. Bioinformatics 29(2):286–9
 21.
Team RC et al (2011) R: A language and environment for statistical computing. The R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051070. http://www.Rproject.org/
 22.
Venables WN, Ripley BD (2002) Modern Applied Statistics with S, 4th edn. Springer, New York
 23.
Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26
 24.
Wickham H (2011) testthat: get started with testing. R J 3:5–10
 25.
Hocking RR (1976) The Analysis And Selection Of Variables In Linear Regression. Biometrics 32(1):1–49
 26.
Wold S, Ruhe A, Wold H, Dunn W III (1984) The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J Sci Stat Comput 5(3):735–743
 27.
Tibshirani R (1994) Regression selection and shrinkage via the lasso. J R Stat Soc Ser B Stat Methodol 58:267–288
 28.
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67:301–320
 29.
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
 30.
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press Inc, New York
 31.
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
 32.
Saeys Y, Inza I, Larrñaaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
 33.
Dobson AJ, Barnett AG (2008) An introduction to generalized linear models. Chapman and Hall, CRC Press, Boca Raton
 34.
Guha R (2007) Chemical informatics functionality in R. J Stat Softw 18(5):1–16
 35.
Bache K, Lichman M (2013) UCI machine learning repository. http://www.archive.ics.uci.edu/ml
 36.
Harrison D, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manage 5(1):81–102
 37.
Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553
 38.
Kibler D, Aha DW, Albert MK (1989) Instancebased prediction of realvalued attributes. Comput Intell 5(2):51–57
 39.
Tsanas A, Little MA, McSharry PE, Ramig LO (2010) Accurate telemonitoring of parkinsons disease progression by noninvasive speech tests. Biomed Eng IEEE Trans 57(4):884–893
 40.
Zhou ZH, Wu JX, Tang W, Chen Z (2001) Combining regression estimators: GAbased selective neural network ensemble. Int J Comput Intell Appl 1:341
 41.
Hershberger DE, Kargupta H (2001) Distributed multivariate regression using waveletbased collective data mining. J Parallel Distrib Comput 61:372
 42.
Law MHC, Kwok JT (2001) Applying the bayesian evidence framework to υsupport vector regression. In: ECML, pp 312
 43.
Merz CJ, Pazzani MJ (1999) A principal components approach to combining regression estimates. Mach Learn 36:9
 44.
Guvenir HA, Uysal I (2000) Regression on feature projections. Knowl Based Syst 13(4):207–214
 45.
Hoegaerts L, Suykens JAK, Vandewalle J, De Moor B (2005) Subset based least squares subspace regression in RKHS. Neurocomputing 63:293–323
 46.
Shevade SK, Keerthi SS, Bhattacharyya C, Murthy KRK (2000) Improvements to the smo algorithm for svm regression. Neural Netw IEEE Trans 11(5):1188–1193
 47.
Ge C, Du J, Zhao L, Wang L, Liu Y, Li D, Yang Y, Zhou R, Zhao Y, Chai Z et al (2011) Binding of blood proteins to carbon nanotubes reduces cytotoxicity. Proc Natl Acad Sci 108(41):16968–16973
 48.
Lesniak A, Fenaroli F, Monopoli MP, Aberg C, Dawson KA, Salvati A (2012) Effects of the presence or absence of a protein corona on silica nanoparticle uptake and impact on cells. ACS Nano 6(7):5845–5857
 49.
Walkey CD, Olsen JB, Song F, Liu R, Guo H, Olsen DWH, Cohen Y, Emili A, Chan WCW (2014) Protein corona fingerprinting predicts the cellular interaction of gold and silver nanoparticles. ACS Nano 8(3):2439–2455
 50.
Gajewicz A, Schaeublin N, Rasulev B, Hussain S, Leszczynska D, Puzyn T, Leszczynski J (2015) Towards understanding mechanisms governing cytotoxicity of metal oxides nanoparticles: Hints from nanoQSAR studies. Nanotoxicology 9(3):313–325
 51.
Cassotti M, Ballabio D, Consonni V, Mauri A, Tetko I, Todeschini R (2014) Prediction of acute aquatic toxicity toward daphnia magna by using the gaknn method. Altern Lab Anim ATLA 42(1):31–41
 52.
Lahl U, GundertRemy U (2008) The use of (Q)SAR methods in the context of REACH. Toxicol Mech Methods 18(2–3):149–158
 53.
Brereton RG, Lloyd GR (2010) Support vector machines for classification and regression. Analyst 135(2):230–267
 54.
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Authors’ contributions
GT and CRM equally contributed in the conceptual framework, implementation of the framework and writing the paper, supplementary information, and R manual files. JAS and CFL contributed in implementing ENET, RF, SVMRFE, RFRFE methods, writing the corresponding supporting information, and R manual files. HS and ELW contributed in the conceptual framework and writing of the paper. ELW built and maintains the RRegrs R package. All authors read and approved the final manuscript.
Acknowledgements
The eNanoMapper project is funded by the European Union’s Seventh Framework Programme for research, technological development and demonstration (FP7NMP2013SMALL7) under grant agreement no 604134. The authors acknowledge the support by the Galician Network of Drugs R+D REGID (Xunta de Galicia R2014/025) and by “Collaborative Project on Medical Informatics (CIMED)” P I 13/00280 funded by the Carlos III Health Institute from the Spanish National plan for Scientific and Technical Research and Innovation 2013–2016 and the European Regional Development Funds (FEDER).
Compliance with ethical guidelines
Competing interests The authors declare that they have no competing interests.
Author information
Additional information
Georgia Tsiliki and Cristian R. Munteanu contributed equally to this work
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Multiple regression
 QSAR
 R package
 Caretbased tool