 Software
 Open Access
 Published:
QSARCoX: an open source toolkit for multitarget QSAR modelling
Journal of Cheminformatics volume 13, Article number: 29 (2021)
Abstract
Quantitative structure activity relationships (QSAR) modelling is a wellknown computational tool, often used in a wide variety of applications. Yet one of the major drawbacks of conventional QSAR modelling is that models are set up based on a limited number of experimental and/or theoretical conditions. To overcome this, the socalled multitasking or multitarget QSAR (mtQSAR) approaches have emerged as new computational tools able to integrate diverse chemical and biological data into a single model equation, thus extending and improving the reliability of this type of modelling. We have developed QSARCoX, an open source python–based toolkit (available to download at https://github.com/ncordeirfcup/QSARCoX) for supporting mtQSAR modelling following the BoxJenkins moving average approach. The new toolkit embodies several functionalities for dataset selection and curation plus computation of descriptors, for setting up linear and nonlinear models, as well as for a comprehensive results analysis. The workflow within this toolkit is guided by a cohort of multiple statistical parameters and graphical outputs onwards assessing both the predictivity and the robustness of the derived mtQSAR models. To monitor and demonstrate the functionalities of the designed toolkit, four casestudies pertaining to previously reported datasets are examined here. We believe that this new toolkit, along with our previously launched QSARCo code, will significantly contribute to make mtQSAR modelling widely and routinely applicable.
Introduction
Quantitative Structure–Activity Relationships (QSAR) modelling is one of the most frequently employed in silico techniques for chemical data mining and analysis. Though QSAR has been introduced more than 50 years ago, it remains as an efficient technique for building mathematical models to find out crucial structural requirement for targeting specific response variables (e.g., activity, toxicity, physicochemical properties, etc.). At the same time, QSAR provides one of the most effective strategies for predicting properties of new chemicals and also for identifying potential hits through virtual screening of chemical libraries [1, 2]. The last few decades have witnessed several transformations in the field of QSAR modelling, owing to the progress in model development strategies, data mining techniques, validation methodologies, along with machine learning and statistical analysis tools [3, 4]. Nevertheless, the quest for new modelling strategies is still ongoing to further improve the overall efficacy of QSAR modelling [1, 5, 6]. For example, one of the major limitations of conventional QSAR is that models are developed for the response variable(s), regardless of the experimental (or theoretical) conditions followed to obtain such response variable(s). In reality however, the researchers come across datapoints pertaining to various experimental and/or theoretical conditions, the inclusion of which may significantly improve the scope of QSAR modelling. This has paved the way to unconventional computational modelling approaches, socalled multitasking, or multitarget QSAR (mtQSAR), which are able to integrate data under different conditions into a single model equation for simultaneous prediction of the targeted response variable(s) [7,8,9]. Therefore, the interest of QSAR practitioner researchers over such mtmodelling has been growing steadily [1, 5]. In particular, mtQSAR modelling techniques based on the BoxJenkins moving average approach have already proved to be highly efficient in dealing with datasets pertaining to multiple conditions [10,11,12,13,14]. Our group has recently developed an open source standalone software “QSARCo” (https://sites.google.com/view/qsarco) [15] to set up classificationbased QSAR models. Briefly, QSARCo enables users to set up linear or nonlinear classification models, by resorting to the Genetic Algorithm based Linear Discriminant Analysis (GALDA) [16, 17] or to the Random Forests (RF) [18] classifier, respectively. As per our experience so far, mtQSAR modelling is highly sensitive to the strategies used for model development especially because the number of starting descriptors increases depending on the number of experimental (and/or theoretical) conditions. The possibility of employing a larger range of development strategies will definitely improve the usefulness and scope of such mtQSAR modelling. The present work moves a step forward and describes a new toolkit named QSARCoX, which apart from supporting the development of multitarget QSAR models based on the BoxJenkins moving average approach, allows the usage of various descriptor generation schemes, along with several model development strategies, feature selection algorithms and machine learning tools, as well as model selection and validation methodologies. As it will be seen, the QSARCoX software implements a number of additional utilities that renders a much more compact and welldesigned platform for multitarget QSAR modelling, following the principles of QSAR modelling recommended by the OECD (Organization for Economic Cooperation and Development) [19]. The major differences between these two software tools are listed and commented in Table 1.
As can be seen, two additional feature selection techniques were included for establishing LDA models, namely faststepwise (FS) and sequential forward selection (SFS). Even though the GA implemented earlier in QSARCo has proved to be a highly efficient feature selection technique, judging from our previous analyses [11, 20], the implementation of these additional feature selection techniques in QSARCoX improves the scope of LDA modelling in multiple ways. Firstly, the application of more feature selection techniques enhances the chances of obtaining more predictive models especially for big data analysis [21]. Secondly, the GA selection involves the random generation of an initial population, which usually requires several runs to produce the most statistically significant (or optimised) model. Also, due to this randomisation step, the models generated by GALDA lack reproducibility. As such, both FS and SFS techniques are more straightforward and reproducible, allowing the swift establishment of linear discriminant models. Finally, simultaneous application of GA with the two newly implemented feature selection algorithms can help finding a greater number of LDA models, thereby increasing the possibility of consensus modelling. Additionally, the QSARCoX software provides significant modifications as far as strategies for the development of nonlinear models are concerned. First of all, it comprises a toolkit for building nonlinear models by resorting to six different machine learning (ML) algorithms. One of its modules assists in tuning hyperparameters of such ML tools (not included in QSARCo [15]) for achieving optimised models. As an alternative, a separate module is available for setting up userspecific parameters meant to a rapidly development of nonlinear models. Alike QSARCo, model development in QSARCoX is guided by descriptor pretreatment, twostage external validation, and determination of the applicability domain of linear and nonlinear models. Still the QSARCoX’ toolkit applies additional options for calculating the modified descriptors using different types of the BoxJenkins moving average operators. It also provides a modified Ybased randomisation method [15], socalled Y_{c}randomisation, to check the robustness of the derived linear model. The latter may be used for ‘conditionwise prediction’ in which the user may check its predictivity for each experimental/theoretical condition. The relevance of whole these new utilities implemented in the toolkit are exemplified with four case studies.
Implementation
The QSARCoX version 1.0.0 is an open source standalone toolkit developed using Python 3 [22]. It can be downloaded freely from https://github.com/ncordeirfcup/QSARCoX. The manual provided along with the toolkit describes in detail its operating procedures. The QSARCoX toolkit comprises four modules, namely: (i) LM (abbreviation for linear modelling); (ii) NLG (abbreviation for nonlinear modelling with grid search); (iii) NLU (abbreviation for nonlinear modelling with user specific parameters); and (iv) CWP (abbreviation for conditionwise prediction). Details about the functionalities of each of these modules are described below.
Module 1 (LM)
This module assists in dataset division, the calculation of deviation descriptors from input descriptors using the BoxJenkins scheme and data pretreatment. Along with these, the module comprises two feature selection algorithms for development and validation of the LDA models (see the screenshot in Fig. 1). The following sixthstep procedure is adopted for establishing the linear models.
Step 1Dataset division
The first step of any mtQSAR model encompasses a division of the initial dataset into a training and a validation set. In this module, that may be performed following three schemes, namely: (a) predetermined data distribution, (b) random division and (c) kmeans cluster analysis (kMCA) based data division [20]. In the first scheme (a), the user is allowed to explicitly provide information about the training and validation set samples, i.e., the set samples are to be tagged as ‘Train’ and ‘Test’, respectively. This is extremely important when the user intends to compare a model with a specific datadistribution previously derived from any other in silico tool with the models developed using QSARCoX. In the second scheme (b), the random division of the dataset is obtained on the basis of the userspecific percentage of validation set datapoints. At the same time, different training and validation sets may be obtained by changing the random seed values. As an alternative to random datasplitting, the user may opt for a kMeans Cluster Analysisbased rational dataset division strategy (kMCA) [20, 23]. In the latter option, the dataset is first divided into n (user specific) clusters on the basis of input descriptors. Subsequently, a specific number of validations set samples are randomly collected from each cluster. Similar to the random division scheme, the ratio between the training and validation sets may be varied and, simultaneously, different combinations of these sets obtained by changing the random seed value. The python code KMCA.py included in the toolkit allows performing the kMCAbased dataset division.
Step 2box−jenkins moving average approach
The most important part of current mtQSAR modelling is the calculation of the deviation descriptors from the input descriptors, following the BoxJenkins moving average approach. The input descriptors can be calculated using any commercial or noncommercial software packages (e.g.: DRAGON [24] or QuBiLSMAS [25]) but then these have to be modified to incorporate the influence of different experimental (and/or theoretical) elements (\(c_{j}\)).
The mathematical details of the BoxJenkins moving average approach have been extensively described in the past [8, 9, 26], so we will restrict ourselves to a short description highlighting only its most important aspects. There are different ways for calculating the modified descriptors by this approach, the simplest one being as follows:
Specifically, the new descriptors \(\Delta {(}D_{i} {)}c_{j}\) are calculated by the difference between the input descriptors of the active chemicals (D_{i}) and their averages \(avg\;(D_{i} )c_{j}\) − i.e. their arithmetic mean for a specific element of the experimental and/or theoretical conditions (ontology) \(c_{j}\) [8]:
In recent years, different forms for these modified descriptors have however been suggested depending on the conditions. For example, the descriptors may be standardised by resorting to the maximum (\(D_{i\max }\)) and minimum (\(D_{i\min }\)) values of input descriptors [12]:
Analogously, the elements of \(c_{j}\) may be also standardised, as recently proposed by SpeckPlanche [27], leading to the following expression for the modified descriptors:
In this equation \(p{(}c_{j} {)}\) represents the a priori probability of finding the datapoints pertaining to particular conditions and so, \(p{(}c_{j} {)}_{c}\) may simply be obtained by dividing the number of actives in the data under a specific element of \(c_{j}\)−\(n{(}c_{j} {)}\)−by the total number of datapoints N (see Eq. 5). More details about this topic will be discussed within the case study 3 reported in this work.
In the present toolkit, the user can choose one of the four methods provided (Method14) to compute the modified descriptors. The first three ones are based on Eqs. 1, 3 and 4, respectively. Note that both Method2 and Method3 do not work with invariant descriptors and that may hamper further calculations. Therefore, in these two methods, a descriptor pretreatment is carried to remove constant descriptors. Finally, Method4 allows the user to apply its own proper scheme for establishing the \(p{(}c_{j} {)}\) values [27, 28], and the resulting modified descriptors are thus represented as follows:
where the term \(p{(}c_{j} {)}_{u}\) denotes the userspecific \(p{(}c_{j} {)}\), whose values should be provided as inputs. Within that context, the \(p{(}c_{j} {)}\) values do not need to be always calculated since these may also be obtained from experimental and/or theoretical data. As an example, in a previous study [26], \(p{(}c_{j} {)}\) accounted for the degree of reliability of the experimental information and the values of 0.55, 0.75 and 1.00 were used for the datapoints, which were classified as ‘autocuration’, ‘intermediate’ and ‘expert’ according to the labelling of the CHEMBL database, respectively.
Similar to QSARCo, the current toolkit uses two stages of external validation for mtQSAR modelling, thereby requiring two separate test sets as well. As mentioned earlier, the dataset is initially split into training and validation sets by employing predefined sets, random division or kMCAbased systematic division schemes. The BoxJenkins moving average approach is then applied to calculate the modified descriptors for the training set, by selecting one of the methods described above. The training set and their corresponding modified descriptors are subsequently randomly subdivided into a subtraining and a test set (or calibration set). Here, it is important to remark that the \(avg{(}D_{i} {)}c_{j}\) values obtained from the training set are applied to calculate the modified descriptors for the validation set and thus, the latter can be recognised as the ‘ideal test set’ due to the fact that its datapoints do not participate either in the model development or in the descriptor calculation. On the other hand, the test set may be employed both as a ‘calibration set’ (especially for GALDA) and as an ‘external validation set’.
Step 3Data pretreatment
The user specific data pretreatment step of this module includes: (a) removal of highly correlated descriptors based on the user specified correlation cutoff, and (b) removal of the descriptors with less variation based also on the user specified variation cutoff. What is more, constant descriptors fail to produce models for all feature selection procedures.
Step 4Linear model development
Two feature selection algorithms are used for setting up the linear discriminant analysis (LDA) models, namely: (a) fast stepwise (FS) and (b) sequential stepwise (SFS). Although many feature selection algorithms are available, the two chosen here can be highly efficient while handling mtQSAR modelling because of their ability to fast generate models. Both these can be employed along with the GA selection, which is available in QSARCo, but that requires many iterations for finding the optimised LDA models. FS is a very popular algorithm in which the independent descriptors are included in the model stepwise depending on the specific statistical parameter pvalue, and it has previously been successfully employed to set up mtQSAR models [10, 26]. The usual criteria for forward selection (i.e., pvalue to enter) and backward elimination (pvalue to remove) are set in the present toolkit. This is, the descriptor with the lowest pvalue is included first and subsequently other descriptors are included in the model based on the lowest pvalue only if the criteria for forward selection are met. Yet, if the pvalue of a descriptor included in the model is found to be greater than ‘pvalue to remove’, it is eliminated from the model. The SFS algorithm adds features into an empty set until the performance of the model is not improved either by addition of another feature or the maximum number of features is reached [29]. Similar to FS, it is also a greedy search algorithm where the best subsets of descriptors are selected stepwise and the model performance is judged by the user specific statistical parameters, denoted as ‘scoring’ parameters. In the current version of QSARCoX, two scoring parameters are provided, namely: ‘Accuracy’ and ‘AUROC’ (see description below). The users may develop separate models by varying these two scoring parameters in QSARCoX (see Case Study 4 for more details).
In contrast to GA, in which the generation of models is based on a randomisation process, these two feature selection algorithms for LDA are systematic and therefore faster. In this work, we resorted to the tool SequentialFeatureSelector from the library mlxtend (version 0.17.1: http://rasbt.github.io/mlxtend/) for developing the FS/SFSLDA models. In both, the singular value decomposition (svd), recommended for data containing large number of features is applied within the Scikitlearn Linear Discriminant Analysis package [30, 31].
Step 5model validation
The reliability and statistical significance of the models are evaluated by goodnessoffit as well as by internal and external validation criteria.
Goodnessoffit for the subtraining set is assessed by looking at the usual p and F (Fisher’s statistics) parameters along with the Wilks’ lambda (λ) statistic [32]. The latter essentially measures the discriminatory power of the LDA classification models, i.e., how well they separate cases into groups. It is equal to the proportion of the total variance in the discriminant scores not explained by differences among groups, and can take values from zero, perfect discrimination, to one, no discrimination. Similar to Wilk’s λ, the Ftest measures how better a complex model is in comparison to a simpler version of the same model in itscapacity to explain the variance in the response variable [33].
All these statistical parameters are calculated with the help of the “Statsmodel” ordinary least square python library (https://www.statsmodels.org/stable/api.html/).
The overall predictivity of the models is checked by examining the confusion matrix, which includes the number of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) samples. Simultaneously based on those numbers, other statistical parameters such as the Sensitivity, Specificity, Accuracy, F1score, and the Matthew correlation coefficient (MCC) are computed for the subtraining, test and validation sets (see Eq. 7), as well as the area under the receiver operating characteristic curve (AUROC) [34,35,36]. Additionally, the ROC curves are automatically created for each model.
Apart from confirming the internal and external predictivity, the choice of the best linear model should be guided through additional criteria. For example, highly correlated descriptors in the linear model may reduce its overall significance and therefore, the degree of collinearity among its descriptors must be carefully examined. To do so, the current module automatically generates the crosscorrelation matrix for the selected subtraining set descriptors. It is also important to assess the applicability domain (AD) of the derived model−i.e., the response and chemical structure space within which the model makes reliable predictions. Here, the models’ AD is estimated by the standardisation approach as proposed earlier by Roy et al. [37], allowing as well to identify possible structural chemical outliers. The python code for this approach is provided in the applicability.py file of the toolkit.
Step 6Y _{c}randomisation
In the previous QSARCo [15], the Yrandomization scheme has been implemented to judge the performance of the derived linear models. That is, following a classical scheme, the statistical quality in data description of the original linear model is compared to that of models generated upon randomly shuffling several times the response variable based upon the user specified ‘number of runs’−n. Since in the BoxJenkins based mtQSAR modelling, the experimental/theoretical conditions elements participate in the determination of modified descriptors, the Yrandomization is slightly modified here and named Y_{c}randomization−i.e., Y randomization with conditions. In this new scheme, along with the response variables, the experimental elements \(c_{j}\) are also scrambled n times, and thus n randomised datamatrices being generated. The several models are subsequently rederived with these randomised data and averages and the Wilks’ lambda (λ_{r}) and accuracy (Accuracy_{r}) calculated. In a robust model, the values obtained for these two parameters should be considerably less than Wilks’ λ and accuracy of the original model. The phyton code ycr.py tackles this scheme in QSARCoX.
Module 2 (NLG)−hyperparameter tuning
Module 2 assists in setting up nonlinear models using a grid search based hyperparameter optimisation scheme (see Fig. 2). Six machine learning tools have been so far implemented in QSARCoX, namely: (a) kNearest Neighbourhood (kNN) [38], (b) Bernoulli Naïve Bayes (NB) classifier [39], (c) Support Vector Classifier (SVC) [40], (d) Random Forests (RF) [18], (e) Gradient Boosting (GB) [41], and f) Multilayer Perceptron (MLP) neural networks [42]. For all these nonlinear modelling techniques, the Scikitlearn machine learning package is used [30, 31]. Similarly, the data pretreatment option may be utilised in this module as well as in Module 3. In both these modules, the subtraining, test and validation sets set up with Module 1 of QSARCoX are required to be uploaded one after another for development of the nonlinear models.
In Module 2, a range of parameters of the machines learning tools are varied to obtain the most robust and predictive nonlinear models, based on a nfold (i.e., user specific) crossvalidation scheme using the GridSearchCV of Scikitlearn [30, 31]. In this module, a parameter file should be provided as.csv file that includes the parameter names with their values that are required to be optimised. In https://github.com/ncordeirfcup/QSARCoX however, six such parameter files related to the various machine learning techniques are available, namely: grid_knn.csv, grid_nb.csv, grid_svc.csv, grid_mlp.csv, grid_rf.csv and grid_gb.csv. The parameter names and their values mentioned in these files are shown in Table 2 below. The files were prepared based upon the importance of the parameters as well as considering our previous experience regarding overall time requirements for the calculations. Nevertheless, the scope of this module is not only limited to these parameters (and values), because the users may select their own options for hyperparameter tuning by simply altering them. After selecting the best parameters, internal validation of the subtraining set is carried out by nfold (i.e., userspecific) cross validation, as well as external validation of both the test and validation sets. Similar to Module 1, the statistical results obtained for the nonlinear models are automatically generated along with the optimised parameters, as well as ROC curves for the test and validation sets. Similar to QSARCo, the nonlinear models’ AD is determined by the confidence estimation approach [43, 44].
Module 3 (NLU)−user specific parameter settings
The functionality of Module 3 (Fig. 2) is the same as that of Module 2, i.e., development of nonlinear models. However, in Module 3, the user may specify the parameter settings. Since grid search is a time consuming but recommended technique, this module could be used for fast generation of the nonlinear models. Even after hyperparameter tuning, the optimised parameters obtained from Module 2 can be specified in Module 3 for rapid obtention of the optimised models. Other utilities of Module 3, such as calculation of statistics for internal and external validation, pretreatment of datafiles, and making ROC curves for both the test and the validation sets, are similar to Module 2.
Module 4 (CWP)−conditionwise prediction
The QSARCoX toolkit includes this automated and simple analysis tool that can be used for checking the mtQSAR obtained results. Indeed, since the mtQSAR modelling implemented in QSARCoX leads to a unique model for datasets containing several experimental and/or theoretical conditions, one may need to assess how much the derived model is predictive to a specific condition. Module 4 (see Fig. 2) is then to be employed to inspect the models’ performance against each condition, due to different reasons. For example, if the user often ends up with almost equally predictive models, he/she might select one of them on the basis of being more predictive towards a particular condition of interest. Moreover, the conditions over which the model is less predictive may be removed to obtain more predictive and/or more significant models. Finally, experimental or theoretical conditions with negligible number of cases may in addition be identified through this analysis and if the derived model is found less predictive towards such conditions, these may be removed also to rebuild the model.
The overall workflow of this new toolkit along with whole of its described modules can be seen in Fig. 3.
Results
To check as well as to demonstrate the utilities of the developed QSARCoX toolkit, four case studies pertaining to previously compiled datasets [9, 11, 26, 27] are examined in this section. For all of them, both the activity cutoff values and the descriptors employed in the original publications were used here (exact details about those can be found in the original papers). The main purpose of these four chosen case studies are as follows:

Case study 1: Demonstrate how linear and nonlinear mtQSAR models may be developed with this toolkit.

Case study 2: Show how different models may be generated using different datasplitting facilities of the toolkit.

Case study 3: Describe how models may be generated using the various available BoxJenkins operators.

Case study 4: Perform a comparative analysis between the model development techniques of the former QSARCo and the new QSARCoX toolkit.
Case study1 (CS1)
The first dataset comprises 726 inhibitors of four I phosphoinositide 3kinase (PI3K) enzyme isoforms (PI3Kα, β, γ, δ), the activities of which have been assayed against 34 mutated or wild human cell lines [11]. The experimental conditions considered in this dataset can be expressed as an ontology of the form c_{j} → (bt, cl, mt), i.e., corresponding to the combination of the three following elements: b_{t} (biological enzyme target), c_{l} (cell line) and m_{t} (mutated or wild cell lines).
Compounds with IC_{50}/K_{i} /K_{d} values ≤ 600 nM were assigned as active whereas the remaining data samples considered as inactive. The dataset contained 536 active (+ 1) and 190 (− 1) inactive compounds and the mtQSAR models were developed for predicting the activity of inhibitor compounds against these four isoforms of PI3K under various experimental conditions.
Linear interpretable models
The dataset was first divided into a training and validation set using a random division scheme (22% of the data taken as the validation set, seed value = 2). Subsequently, the BoxJenkins operator (Method1, Eq. 1) was applied to produce a subtraining set (n_{str} = 452), a test set (n_{ts} = 114) and a validation set (n_{vd} = 160), using a seed value of 2. The FSLDA model was then derived with the following options: (a) correlation cutoff of 0.999, (b) variance cutoff of 0.001, (c) pvalue to enter of 0.05, and (d) pvalue to remove of 0.05. Meanwhile, the SFSLDA model was built using the following: (a) correlation cutoff of 0.999, (b) variance cutoff of 0.001, (c) Floating = True, and (d) Scoring = Accuracy. For both models, a maximum of ten descriptors were allowed, the subtraining results of being shown in Supplementary Information (Additional file 1: Table S1). As can be seen in Table S1, the FSLDA model shows a higher goodnessoffit than the SFSLDA model.
The FSLDA model that was developed in the first attempt depicted high intercollinearity with a maximum Pearson correlation coefficient (r) of 0.926 between two of its descriptors. Therefore, the maximum allowed pairedcorrelation coefficient was reduced to 0.90, and the final rebuilt model yielded a Wilk’s λ of 0.261. Similarly, the first SFSLDA model developed also presented a high intercollinearity between two of its descriptors (r > 0.98). Therefore, the later model was rebuilt by reducing the correlation cutoff to 0.95, and this revised SFSLDA model depicted a much satisfactory intercollinearity among descriptors (maximum r = 0.808). The overall predictivity of the linear models is depicted in Table 3.
As can be seen, the SFSLDA model was found to be more predictive than the FSLDA model. The average accuracy and MCC values found for the newly developed SFSLDA model are 94.95% and 0.873, respectively. After analysing the AD computed by the standardisation approach, in the FSLDA model, 15 datapoints of the subtraining set, 6 datapoints of the test set, and 5 datapoints of the validation set are found to be outliers. While, in the SFSLDA model, 43 subtraining set, 13 test set and 14 validation set samples emerged as structural outliers. Therefore, based on the results of AD, it may be inferred that the FSLDA model was developed with descriptors that yield a considerably smaller number of structural outliers compared to the SFSLDA model. The ROC plots of FSLDA and SFSLDA models generated with the current toolkit can be found in Supplementary Information (Additional file 1: Figure S1).
Nonlinear models
This dataset was then subjected to nonlinear model development using the QSARCoX toolkit. For such a purpose, the hyperparameter tuning implemented in its Module 2 was employed. Details about the corresponding optimised parameters along with the accuracy values obtained for the subtraining, test and validation sets can be found in Supplementary Information (Additional file 1: Table S2). It can be observed that, except for Bernoulli NB, all other machine learning tools are able to produce highly predictive mtQSAR models. However, the RF and GB tools lead to the most significant nonlinear mtQSAR models, judging from their internal and external validation parameters (i.e., accuracy in this case; see Table 4). Although the same accuracy is obtained for the validation set, on the basis of overall predictivity, the RF model is found to be slightly superior to the GB model. Table 4 shows the overall statistical predictivity of the latter two models, whereas the ROC plots for the validation and test sets are depicted in Supplementary Information (Additional file 1: Figure S2). Interestingly, the external predictivity of the RF model matches exactly with the FSLDA model (cf. Table 3).
Finally, Module 4 of QSARCoX was applied for a conditionwise prediction of the FSLDA model, and the obtained results are listed in Table 5. Note that a similar analysis might have been also performed with any of the nonlinear models. Here, it should be mentioned that the present dataset pertains to as many as 34 experimental condition elements, and from Table 5 it can be observed that not all the latter appear in both the test and validation sets. However, owing to the high external predictivity of the model, most of these experimental elements are predicted with high accuracy values. Nevertheless, it can be additionally seen that samples pertaining to elements 18 and 24 are not only present in less number but are also poorly predicted. These samples may then be removed, or alternate models been generated with other techniques in which the predictivities for these experimental condition elements are higher. Similarly, a ‘conditionwise prediction’ analysis might also be performed using the derived nonlinear models with the help of the present module. The results, i.e., the output files generated for the FSLDA, SFSLDA, RF and GB models of CS1 are given in Additional file 2.
Case study2 (CS2)
The second case study aims at investigating the impact of datadistribution during the development of mtQSAR models. Further, the significance of Y_{c} randomization as an extra criterion for justifying the robustness of linear models is aimed to be demonstrated also. A previously collected dataset [26] will be employed, which contains 46,229 datapoints describing the antibacterial activity against Gramnegative pathogens and in vitro safety profiles related to absorption, distribution, metabolism, elimination, and toxicity (ADMET) properties. This dataset pertains to four experimental condition elements (c_{j}), namely: b_{t} (biological target), m_{e} (measure of effect), a_{i} (assay information), and t_{m} (target mapping). Additionally, each datapoint includes a probabilistic factor p_{c} to account for the degree of reliability of the experimental information. Each case in the data set was assigned as one out of two possible categories, namely positive (+ 1) or negative (− 1). Cutoff values for different measures of toxicity effects of compounds are provided in Supplementary Information (Additional file 1: Table S4).
Two different models were generated and in the first case the probabilistic factor p_{c} was discarded, and the models developed using ‘Method1’. Then, in the second case, the models were developed considering the influence of p_{c} and due to its presence, the BoxJenkins operator based on ‘Method4’ (Eq. 6) was employed. For both cases, we applied three dataset distribution methods available in QSARCoX for splitting the data into the training and validation sets. In the first method (i.e., predefined sets), the training (75% of the data) and validation (25% of the data) sets coming from the original work were used. In the second method (i.e., random division), 25% of the data was placed in the validation set using a random seed value of 2. In the third method (i.e., kMCA based division), the data was divided into ten clusters and, from each of these, 25% of the data was selected as the validation set, and subsequently each training set was divided into subtraining (80%) and test (20%) sets using a random seed value of 3. For each of these data distributions, SFSLDA models were developed using the current toolkit with the following parameters: (a) correlation cutoff of 1.0, (b) variance cutoff of 0.001, (c) maximum steps = 6, (d) Floating = True, and (e) Scoring = Accuracy. The statistical results then gathered as well as the ROC plots for the derived three linear models can be found in Supplementary Information (Additional file 1: Figure S3, Tables S3 and S4). The latter plots along with the corresponding AUROC values allows one to infer the classification ability of the generated mtQSAR models.
As one may observe from Additional file 1: Table S4, irrespectively of the datadistribution method used, the models generated with ‘Method4’ display slightly better statistical parameters. That thus suggests that the probabilistic factor considered in the original investigation truly influences in determining the response variable.
Focusing now only on ‘Method4’ based models, the Wilk’s λ values obtained for these predefined, random and kMCA divisionbased models were 0.438, 0.432 and 0.440, respectively. Such low values for the subtraining sets show that all these models display an adequate discriminatory power and a satisfactory goodnessoffit. In addition, at first sight (Additional file 1: Table S4), there are no significant differences between these models as regards their statistical quality indicating that no matter which datadistribution method is considered, the quality of the linear model remains almost similar. However, after verifying the internal and external validation results, the random divisionbased model is seen to be the best linear mtQSAR model. Further, the degree of collinearity among the variables of the model is not too high, the maximum correlation coefficient between two of its descriptors being 0.831. To further judge the statistical significance of this model, we applied the Y_{c} randomization scheme implemented in QSARCoX. To do so, the response variable and experimental elements were randomised 100 times, and the resulting 100 randomised data matrices were then subjected to the same BoxJenkins operator (i.e., ‘Method4’) used for generating the original model. Subsequently, 100 models were created with the randomised subtraining set using the descriptors of the original model. The average Wilk’s λ (λ_{r}) and average accuracy (Accuracy_{r}) found for such models were 0.999 and 58.09, respectively, which compared to those attained for the original model (i.e., 0.432 and 96.37) confirm that the latter is unique and lacks chance correlations. The results, i.e., the output files from the current toolkit, of these SFSLDA models for CS2 are shown in Additional file 3.
Case study3 (CS3)
The purpose of third case study is to disclose how different BoxJenkins’s operators may have an impact on the statistical quality of the derived models. The dataset of CS3 was retrieved from a recently published work in which the toxicity of 260 pesticides have been targeted by mtQSAR modelling with artificial neural networks (ANN) [27]. The dataset comprised a total of 3610 datapoints related to four primary experimental condition elements (c_{j}), namely: m_{e} (measure of toxicity), b_{s} (bioindicator species), a_{g} (assay guideline) and e_{p} (exposure period). For detailed information about the cutoff values employed for the different measures of toxicity effects, please refer to the Supplementary Information (Additional file 1: Table S5). Further details about m_{e}, b_{s}, a_{g} and e_{p} can be obtained from the original work [27]. The dataset contained 1992 toxic (+ 1) and 1618 nontoxic (− 1) compounds. Additionally, three other experimental condition elements have been taken into consideration while modelling, these being the concentration lethality (l_{c}), target mapping (t_{m}) and time classification (t_{c}). The latter three may be specified as secondary experimental elements (\(c_{{j_{2} }}\)) due simply to the fact that l_{c}, t_{m} and t_{c} are related to m_{e}, b_{s} and e_{p}, respectively. On the basis of these related primary and secondary experimental elements, three probabilistic factors were calculated in that work as follows [27]:
where \(n_{{\text{T}}} {(}c_{j} {)}\) and \(N_{{\text{T}}} {(}c_{{j_{2} }} {)}\) stand for the number of the training set samples, including toxic and nontoxic data points, within the primary and secondary experimental elements, respectively.
In that work, another probabilistic factor was also included based on the following equation [27]:
where N_{T} stands for the total number of samples in the training set, and notably this equation is just like Eq. 5, already implemented within one of the BoxJenkins operators (‘Method3’) in QSARCoX, because it merely corresponds to a normalisation by all the number of elements.
Each of these probabilistic factors may be simply denoted as \(p{(}c_{j} {)}\) and so, the final deviation descriptors employed in such a work [27] are similar to the standardised modified descriptors presented in Eq. 4. Yet these final descriptors embody a more complex moving average operator that is not implemented in QSARCoX (cf. Equations 3–6). Yet ‘Method4’ (Eq. 6) may still be applied with a slight modification to obtain the same modified descriptors used in that work. To that end, the python code of ‘Method4’ was adapted to calculate the modified descriptors (‘Method4 modified’, cf. Table 6) from the starting descriptors reported in such work [27]. Then, nonlinear mtQSAR models were developed using a predefined datadistribution, i.e. to use the same training and validation sets employed in the original work [27]. Eighty percent of the training dataset was treated as the subtraining set whereas the remaining was used as the test set for setting up RF based nonlinear models. However, instead of employing preselected features for developing the nonlinear models, just as it has been done on that original work, here we resort to a maximum descriptor space for model generation. In order to remove less descriptive highly correlated features, a data pretreatment was carried out by setting the correlation cutoff in 0.95 and the variance cutoff in 0.001. In addition, a fivefold crossvalidation was used for grid search as well as for inspecting the internal predictivity of the subtraining set. After developing the model using the adapted ‘Method4’, this model was also compared to models derived based on other operators (i.e., with the original Methods1–4) implemented in QSARCoX. However, to calculate the descriptors using Methods 1–3, the probabilistic factors (i.e., the original \(p{(}m_{e} {)}_{{l_{c} }}\), \(p{(}b_{s} {)}_{{t_{m} }}\), and \(p{(}e_{p} {)}_{{t_{c} }}\) factors) could not be accommodated. Therefore, for these methods the influence of all secondary experimental elements was discarded. However, these probabilistic factors were considered in the model developed by Method4. The results of the RF models developed with all five type of moving average operators and related deviation descriptors are shown in Table 6.
As seen, the models obtained here reveal to display more predictive ability than that of the model reported in the original investigation (MCC score of 0.524 for the test set) [27]. Nevertheless, the latter is more interpretable since only a limited number of features was used for its development. Therefore, a direct comparison of the reported model with the current RF models is not feasible, yet nor it is the purpose of the current case study. Rather, our aim here is to disclose the importance of different operators implemented in QSARCoX. Even though the variations in the operators did not have significant impact on the statistical quality of all these models, the mtQSAR model obtained from ‘Method1’ is found to produce the best solution relying on both internal and external predictivity. However, this outcome is based only on one datadistribution technique and one machine learning method. Therefore, no final conclusion might be drawn regarding the utility of these operators. The case study however demonstrates that the multiple operators implemented in QSARCoX may be utilised to judge which option is most suitable for a specific data. The results, i.e., the output files from the current toolkit, obtained from RF model by applying Method1 for CS3 are given in Additional file 4.
Finally, it is important to remark here that, the previously reported model was developed by resorting to a commercial software.
Case study4 (CS4)
Case studies 1–3 were examined mainly to demonstrate some of the basic utilities of QSARCoX. In the final case study, we attempted however to compare the performances of previously reported QSARCo models with newly created QSARCoX models. For such purpose, we collected a previously reported dataset containing 2,123 peptides (amino acid length 4–119) with antibacterial activities against multiple Gramnegative bacterial strains and cytotoxicity against multiple cell types [9]. This dataset pertains to two experimental condition elements (c_{j}), namely: b_{s} (biological target) and m_{e} (measure of effect). Each peptide in the data set was assigned to one out of two possible categories, namely: positive (+ 1) − i.e., indicating high antibacterial activity or low cytotoxicity, or negative (− 1) − i.e., showing low antibacterial activity or high cytotoxicity. The cutoff values to annotate a peptide as positive were: MIC ≤ 14.97 μM, or CC50 ≥ 60.91 μM, or HC50 ≥ 105.7 μM. For more details, please refer to the original investigation [9]. MtQSAR modelling of this dataset has already been performed using the QSARCo tool [15], being the linear model developed with the GALDA technique and the nonlinear model with the RF technique. In this case study, three additional linear models were built using QSARCoX, keeping the same maximum number of descriptors (i.e., four) and datadistributions. Table 7 shows the statistical parameters obtained for all these models. Note that two LDA models were set up by applying SFS for feature selection with the two different scoring parameters (i.e., Accuracy and AUROC).
The Wilks’ lambda (λ) value obtained for the original developed GALDA model is 0.422, whereas those of the FSLDA, SFSLDA (Scoring: Accuracy) and SFSLDA (Scoring: AUROC) models are 0.421, 0.444 and 0.451, respectively. As seen in Table 7, among the QSARCoX linear models, the SFSLDA model generated with the AUROC scoring parameter is found to be the best one, judging from its overall predictivity results. Furthermore, overall predictivity of this model is significantly higher than that of the GALDA model previously reported [15].
Similarly, in this case study, we also developed two nonlinear models through the RF and GB techniques. It is important to mention here that QSARCo does not provide any option for hyperparameter optimisation and therefore the earlier reported RF model has been generated without it. On the other hand, the models generated by QSARCoX were set up with hyperparameter optimisation by supplying the values for the parameter settings in its Module 2. Table 8 shows the attained results for these models.
By inspecting the statistical parameters given in Table 8, it is clear that the GB model affords the best predictivity and leads to a significant improvement in the external predictive accuracy when compared to that of the previously reported RF model generated with QSARCo. However, it is noteworthy that the significance of this GB based model is not only limited to its better performance. Since this model has been developed with hyperparameter optimization, its overall acceptability is much higher than the RF model generated with QSARCo, without any tuning of hyperparameters [45, 46]. On the whole, the results shown in Tables 7, 8 clearly suggest that the QSARCoX toolkit provides some very useful strategies for setting up linear and nonlinear mtQSAR models.
The results of the SFSLDA and GB models, i.e., the output files from the current toolkit, obtained for CS4 are given in the Supplementary Information (Additional file 5).
Conclusions
In this work, we described the userfriendly opensource QSARCoX toolkit that is an extension of our previously launched javabased tool QSARCo [15], and has a number of advantages over the latter to support mtQSAR modelling efforts. Indeed, the current toolkit move a step forward by including more updated and advanced strategies, namely in what concerns datadistribution options, schemes for calculation of modified descriptors, feature selection algorithms, machine learning methods, validation strategies as well as analysis techniques. The QSARCoX toolkit is based on Python, which is undoubtedly one of the most popular and highly accessed programming languages, especially in the field of data science [22]. The current toolkit utilises some wellknown Python based libraries, such as NumPy [47], SciPy [48], Pandas [49], Matplotlib [50], Tkinter (https://anzeljg.github.io/rin2/book2/2405/docs/tkinter/index.html), and Scikitlearn [30, 31]. The codes of the toolkit are made available in public domain so that, necessary modifications/updates may be easily implemented during their utilisation. Similar to QSARCo, this toolkit relies primarily on BoxJenkins based mtQSAR modelling, which has been proved to be highly efficient in handling large datasets pertaining to various experimental and/or theoretical conditions[10,11,12,13,14,15, 20, 26,27,28, 51]. Further, the ability to explore all of its code tools simultaneously, as well as the graphical user interface itself, provide simple and efficient solutions to the main practical challenges implicated in mtQSAR modelling. The latter was clearly shown by testing its functionalities on four case studies. Indeed, we were able to demonstrate the basic utilities of its tools and at the same time, depicted also how different feature selection algorithms, machine learning methods, dataset division options and different BoxJenkins’s operators may play crucial roles in the development of more predictive mtQSAR models. The toolkit allows the users to save the developed models and use these for predicting properties of new external chemicals. Clearly, future investigations using various datasets will lead to a better understanding about the utilities and shortcomings of the functionalities of the present toolkit and will naturally give rise to its upgrading. Yet, on the whole, the toolkit presented here has the potential of becoming a widely used platform for easily setting up predictive mtQSAR models.
Availability of data and materials
Project name: QSARCoX.
Project home page: The source code of the toolkit along with its manual and reference data files are available from https://github.com/ncordeirfcup/QSARCoX.
Operating system(s): Platform independent.
Programming language: Python.
Other requirements: NumPy, SciPy, Pandas, Matplotlib, Tkinter and Scikitlearn.
License: GNU GPL version 3.
Any restrictions to use by nonacademics: None.
References
Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V, Oprea TI, Baskin II, Varnek A, Roitberg A, Isayev O, Curtalolo S, Fourches D, Cohen Y, AspuruGuzik A, Winkler DA, Agrafiotis D, Cherkasov A, Tropsha A (2020) QSAR without borders. Chem Soc Rev 49:3525–3564
Lewis RA, Wood D (2014) Modern 2D QSAR for drug discovery. WIREComput Mol Sci 4:505–522
Neves BJ, Braga RC, Melo CC, Moreira JT, Muratov EN, Andrade CH (2018) QSARbased virtual screening: advances and applications in drug discovery. Front Pharmacol 9:1275
Gramatica P (2020) Principles of QSAR Modeling: Comments and suggestions from personal experience. Int J Quant StrucProp Relation 5:61–97
Toropov AA, Toropova AP (2020) QSPR/QSAR: Stateofart, weirdness, the future. Molecules 25:1292
Polanski J (2017) Big data in structureproperty studies—from definitions to models. In: Roy K (ed) Advances in QSAR Modeling. Challenges and Advances in Computational Chemistry and Physics. Springer, Cham
SpeckPlanche A (2018) Recent advances in fragmentbased computational drug design: tackling simultaneous targets/biological effects. Future Med Chem 10:2021–2024
SpeckPlanche A, Cordeiro MNDS (2017) Advanced in silico approaches for drug discovery: mining information from multiple biological and chemical data through mtkQSBER and ptQSPR strategies. Curr Med Chem 24:1687–1704
Kleandrova VV, Ruso JM, SpeckPlanche A, Cordeiro MNDS (2016) Enabling the discovery and virtual screening of potent and safe antimicrobial peptides. Simultaneous prediction of antibacterial activity and cytotoxicity. ACS Comb Sci 18:490–498
Halder AK, Natalia M, Cordeiro MNDS (2019) Probing the environmental toxicity of deep eutectic solvents and their components: An in silico modeling approach. ACS Sust Chem Eng 7:10649–10660
Halder AK, Cordeiro MNDS (2019) Development of multitarget chemometric models for the inhibition of class i pi3k enzyme isoforms: a case study using QSARCo tool. Int J Mol Sci 20:4191
SpeckPlanche A (2019) Multicellular target QSAR model for simultaneous prediction and design of antipancreatic cancer agents. ACS Omega 4:3122–3132
SpeckPlanche A, Scotti MT (2019) BET bromodomain inhibitors: fragmentbased in silico design using multitarget QSAR models. Mol Divers 23:555–572
Kleandrova VV, Scotti MT, Scotti L, Nayarisseri A, SpeckPlanche A (2020) Cellbased multitarget QSAR model for design of virtual versatile inhibitors of liver cancer cell lines. SAR QSAR Environ Res 31:815–836
Ambure P, Halder AK, Diaz HG, Cordeiro MNDS (2019) QSARCo: An open source software for developing robust multitasking or multitarget classificationbased QSAR models. J Chem Inf Model 59:2538–2544
Rogers D, Hopfinger AJ (1994) Application of genetic function approximation to quantitative structureactivityrelationships and quantitative structureproperty relationships. J Chem Inf Comput Sci 34:854–866
Ambure P, Aher RB, Gajewicz A, Puzyn T, Roy K (2015) “NanoBRIDGES” software: Open access tools to perform QSAR and nanoQSAR modeling. Chemometrics Intellig Lab Syst 147:1–13
Breiman L (2001) Random forests. Mach Learn 45:5–32
Organization for Economic CoOperation and Development (OECD). Guidance document on the validation of (quantitative) structureactivity relationship ((q)sar) models; OECD Series on Testing and Assessment 69; OECD Document ENV/JM/ MONO2007, pp 55−65.
Halder AK, Giri AK, Cordeiro MNDS (2019) MultiTarget chemometric modelling, fragment analysis and virtual screening with erk inhibitors as potential anticancer agents. Molecules 24:3909
Khan PM, Roy K (2018) Current approaches for choosing feature selection and learning algorithms in quantitative structureactivity relationships (QSAR). Expert Opin Drug Disc 13:1075–1089
Van Rossum G, Drake FL (2009) Python 3 Reference Manual. CreateSpace, CA
Gore PA (2000) Cluster Analysis. In: Tinsley HEA, Brown SD (eds) Handbook of applied multivariate statistics and mathematical modeling. Academic Press, San Diego, p 297
Mauri A, Consonni V, Pavan M, Todeschini R (2006) Dragon software: An easy approach to molecular descriptor calculations. MATCH Commun Math Comput Chem 56:237–248
ValdesMartini JR, MarreroPonce Y, GarciaJacas CR, MartinezMayorga K, Barigye SJ, Almeida YSV, PerezGimenez F, Morell CA (2017) QuBiLSMAS, open source multiplatform software for atom and bondbased topological (2D) and chiral (2.5D) algebraic molecular descriptors computations. J Cheminform 9:35
SpeckPlanche A, Cordeiro MNDS (2017) De novo computational design of compounds virtually displaying potent antibacterial activity and desirable in vitro ADMET profiles. Med Chem Res 26:2345–2356
SpeckPlanche A (2020) Multiscale QSAR approach for simultaneous modeling of ecotoxic effects of pesticides. In: Roy K (ed) Ecotoxicological QSARs. Springer, New York
SpeckPlanche A (2018) Combining ensemble learning with a fragmentbased topological approach to generate new molecular diversity in drug discovery: In silico design of Hsp90 inhibitors. ACS Omega 3:14704–14716
Menzies T, Kocagüneli E, Minku L, Peters F, Turhan B (2015) Complexity: using assemblies of multiple models. In: Menzies T, Kocagüneli E, Minku L, Peters F, Turhan B (eds) Sharing data and models in software engineering. Morgan Kaufmann, Boston
Hao JG, Ho TK (2019) Machine learning made easy: a review of scikitlearn package in python programming language. J Educ Behav Stat 44:348–361
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikitlearn: Machine learning in python. J Mach Learn Res 12:2825–2830
Wilks SS (1932) Certain generalizations in the analysis of variance. Biometrika 24:471–494
HansVaugn DL, Lomax RG (2020) An introduction to statistical concepts. Routledge, NY
Boughorbel S, Jarray F, ElAnbari M (2017) Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12:e0177678
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874
Hanczar B, Hua JP, Sima C, Weinstein J, Bittner M, Dougherty ER (2010) Smallsample precision of ROCrelated estimates. Bioinformatics 26:822–830
Roy K, Kar S, Ambure P (2015) On a simple approach for determining applicability domain of QSAR models. Chemometr Intell Lab Sys 145:22–29
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
McCallum A, Nigam K (2001) A comparison of event models for naive bayes text classification. Work Learn Text Categ 752:41–48
Boser BE, Guyon IM, Vapnik VN A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory ACM 144–152.
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Huang GB, Babri HA (1998) Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Trans Neural Netw 9:224–229
Ambure P, Bhat J, Puzyn T, Roy K (2019) Identifying natural compounds as multitargetdirected ligands against Alzheimer’s disease: an in silico approach. J Biomol Struct Dyn 37:1282–1306
Mathea M, Klingspohn W, Baumann K (2016) Chemoinformatic classification methods and their applicability domain. Mol Inform 35:160–180
Probst P, Boulesteix AL, Bischl B (2019) Tunability: importance of hyperparameters of machine learning algorithms. J Mach Learn Res 20:1–32
Wu J, Chen XY, Zhang H, Xiong LD, Lei H, Deng SH (2019) Hyperparameter optimization for machine learning models based on bayesian optimization. J Electr Sci Technol 17:26–40
van der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13:22–30
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat I, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, Contributors S (2020) SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272
McKinney W (2010) Data structures for statistical computing in python, In: Proceedings of the 9th Python in Science Conference, Austin, Texas, 28 June3 July 2010.
Hunter JD (2007) Matplotlib: A 2D graphics environment. Comput Sci Eng 9:90–95
Halder AK, Melo A, Cordeiro MNDS (2020) A unified in silico model based on perturbation theory for assessing the genotoxicity of metal oxide nanoparticles. Chemosphere 244:125489
Acknowledgements
This work received financial support from FCT  Fundação para a Ciência e Tecnologia through funding for the project PTDC/QUIQIN/30649/2017. The authors would like to thank also the FCT support to LAQVREQUIMTE (UID/QUI/50006/2020).
Author information
Authors and Affiliations
Contributions
AKH designed and implemented the software. MNDSC tested the software. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
File containing the QSARCoX generated ROC plots (Figures S1–3) and additional information related to the several case studies (Tables S1–5).
Additional file 2.
Folder (CS_1) containing the results (i.e., the output files from the current toolkit) of the FSLDA, SFSLDA, RF and GB models for case study 1.
Additional file 3.
Folder (CS_2) containing both the input files and the results (i.e., the output files from the current toolkit) of the SFSLDA models for Case study2.
Additional file 4.
Folder (CS_3) containing both the input files and the results (i.e., the output files from the current toolkit) obtained from the RF model by applying Method1 for Case study3.
Additional file 5.
Folder (CS_4) containing the input file of SFSLDA and GB models and the results (i.e., the output files from the current toolkit) obtained from the SFSLDA for Case study4.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Halder, A.K., Dias Soeiro Cordeiro, M.N. QSARCoX: an open source toolkit for multitarget QSAR modelling. J Cheminform 13, 29 (2021). https://doi.org/10.1186/s13321021005080
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13321021005080
Keywords
 QSAR
 Multitarget models
 Software tools
 Feature selection
 Machine learning