The kernel-weighted local polynomial regression (KwLPR) approach: an efficient, novel tool for development of QSAR/QSAAR toxicity extrapolation models

The ability of accurate predictions of biological response (biological activity/property/toxicity) of a given chemical makes the quantitative structure‐activity/property/toxicity relationship (QSAR/QSPR/QSTR) models unique among the in silico tools. In addition, experimental data of selected species can also be used as an independent variable along with other structural as well as physicochemical variables to predict the response for different species formulating quantitative activity–activity relationship (QAAR)/quantitative structure–activity–activity relationship (QSAAR) approach. Irrespective of the models' type, the developed model's quality, and reliability need to be checked through multiple classical stringent validation metrics. Among the validation metrics, error-based metrics are more significant as the basic idea of a good predictive model is to improve the predictions' quality by lowering the predicted residuals for new query compounds. Following the concept, we have checked the predictive quality of the QSAR and QSAAR models employing kernel-weighted local polynomial regression (KwLPR) approach over the traditional linear and non-linear regression-based approaches tools such as multiple linear regression (MLR) and k nearest neighbors (kNN). Five datasets which were previously modeled using linear and non-linear regression method were considered to implement the KwPLR approach, followed by comparison of their validation metrics outcomes. For all five cases, the KwLPR based models reported better results over the traditional approaches. The present study's focus is not to develop a better or improved QSAR/QSAAR model over the previous ones, but to demonstrate the advantage, prediction power, and reliability of the KwLPR algorithm and establishing it as a novel, powerful cheminformatic tool. To facilitate the use of the KwLPR algorithm for QSAR/QSPR/QSTR/QSAAR modeling, the authors provide an in-house developed KwLPR.RMD script under the open-source R programming language.


Introduction
The biological response, physicochemical properties as well as intrinsic toxicity of a chemical have a strong relationship with its structural representation. This can be mathematically modeled with a series of chemical information employing quantitative structure-activity/property/toxicity relationship (QSAR/QSPR/QSTR) and/or read-across models [1]. Over the last two decades, the QSAR models became an integral part of computer-aided drug design (CADD) and discovery [2,3], earlier prediction of adsorption, distribution, metabolism, excretion, and toxicity (ADMET) of new drug candidates [4,5], environmental risk assessment through fate and toxicity modeling of chemicals [6,7], solution of diverse complications in materials sciences [8,9], food science [10] along with agricultural science [11]. Researchers from academia and industries employ QSAR models as a popular technique for data gap filling through early prediction of activity and toxicity of various chemicals and pharmaceuticals even before their synthesis. Rigorously validated and statistically significant QSAR models may significantly help the prioritization strategy for future synthesis and analysis, leading to substantial advantages in saving resources in the form of materials, human resources, time, and money [12].
The interspecies quantitative activity-activity relationship (QAAR)/quantitative structure-activity-activity relationship (QSAAR) model offers to predict an endpoint (which is a dependent variable) for specific species employing the same endpoint (response in the form of activity, property, or toxicity) for another species along with selected structural and physicochemical features as a predictor or explanatory or independent variables (descriptors) [13,14]. The terms QAAR and QSAAR are used interchangeably by different groups of authors, but for the ease of understanding and concept of the present study, we will use the term QSAAR throughout the manuscript. The endpoint acts as a predictor variable, it can highlight the mechanism of action (MOA) of a series of chemicals to some extent as they are derived from experimental bioassay along with structural and physicochemical features. This specific feature strengthens the reliability and precision of the QSAAR model over the simple QSAR models. Furthermore, when experimental data of a series of chemicals for one species is present but absent for another species, the QSAAR model delivers a mathematical model (or equations) to predict the endpoint for that specific species [15]. Thus, extrapolating data from one species to another helps fill the data gaps without wasting time, money, and animal study maintaining the 3R's approach intended to a replacement, reduction, and refinement of animals.
The most common techniques for developing QSAR and QSAAR models are Multiple Linear Regression (MLR), Principal Component Regression (PCR), Partial Least Squares (PLS), under linear regression technique [1][2][3]. While, k nearest neighbors (kNN), artificial neutral network (ANN), support vector machine (SVM) are the most frequently practiced techniques for non-linear regression modeling. Once the model is developed, it requires to be checked through rigorous validation methods (cross-validation, test set validation, Y-randomization) using stringent validation metrics (R 2 , Q 2 LOO , R 2 pred /Q 2 F1 , Q 2 F2 , Q 2 F3 , r m 2 , CCC, mean absolute prediction error (MAE), Root Mean Square Error (RMSE)) [1]. The applicability domain strategy is useful to identify those chemicals that cannot be predicted reliably by the model. Most importantly, majority of models offer excellent performance for closely related chemicals structurewise while the prediction error can be in the higher side for the new query chemical which is located outside of the AD generated from the training set [2,3].
To reinforce the scientific and systematic validity of any QSAR/QSAAR model and promote its acceptance for regulatory purposes, drug design and discovery, toxicity prediction to humans and the environment, the correct statistical approach for developing the model is an imperative one. Over the years, multiple forms of linear, polynomial, lasso, ridge, ecologic, Bayesian, ElasticNet, etc. regression approaches have been evolved [16]. Without any doubt, the most widely used and acceptable form of regression has been linear regression. Still, drawbacks like over-fitting and sensitivity to both cross-correlations and outliers are a matter of concern. In comparison, ridge and lasso represent a more vigorous version of linear regression taking constraints on regression and being less subject to over-fitting and straightforward interpretation. In polynomial regression, the coefficient can be changed with the predictor or explanatory variable's value and estimated from data that lie within a specific window. The polynomial regression is suitable to evaluate the density of distribution and can be successfully employed when there are two or more predictor variables present in the model [17]. The polynomial models are also instrumental where the relationship between response study and predictor variables is curvilinear. Additionally, a nonlinear relationship in a narrow range of explanatory variables can also be modeled by polynomial regressions.
The use of the kernel-based nonparametric approach to regression analysis has a long tradition in econometrics. Its application in computational toxicology and chemical risk assessment has a much shorter history. Although multiple examples of traditional QSAR/ QSPR models based on nonparametric regression exist [18,19], the attempt to employ the proposed approach to interspecies QSAAR modelling has not been presented in literature before, to the best of the authors' knowledge. Hence to introspect the advantages and predictive quality of the kernel-weighted local polynomial regression (KwLPR) approach over the traditional linear and non-linear regression-based methods we have re-developed both QSAR and QSAAR models. Therefore, we have taken five different datasets of varying sizes from big, medium to small, and used diverse chemical classes of compounds for modeling purposes. Four datasets [20][21][22] were previously utilized to develop the high-quality QSAAR model using a common linear regression technique like MLR and one dataset [23] was employed to develop non-linear model implicating kNN technique. In present study, all five datasets were used to develop models utilizing the KwLPR approach. Then we have compared the statistical quality of our models with the previous models. It is important to mention that we have used the same modeled descriptors and the same combination of training and validation sets compounds as was done in the original papers. This study's idea is neither the generation of the new prediction oriented QSAR/QSAAR models nor criticizing the previous ones. The present KwLPR models demonstrate the worth of the locally weighted least squares kernel regression in interspecies extrapolation as well as in application to simple QSAR model by offering better statistical quality than previously developed models. The R-script code to prepare the KwPLR based QSAR/QSPR/QSTR/QSAAR model is also provided for better accessibility.

Materials and methods
The main idea behind the locally weighted least squares kernel regression that combines the mathematical simplicity and interpretability of the classical least squares method with the flexibility of nonlinear regression is the pointwise approximation of the unknown regression function m(x) by a polynomial of order p in a small neighborhood of x 0 [24]. Using a local Taylor series expansion in the neighborhood of x 0 , a p th degree polynomial approximation of m(x) yields [25]: This polynomial is fitted locally at each point of interest by weighted least squares (also termed as kernelweighted linear regression), that minimizes: where: X denotes the design matrix centered at x 0 ; Y represents a vector with the response variable; β is a vector of regression coefficients obtained by applying weighted least squares; K denotes a non-negative kernel function assigning weights to each point; h is a smoothing parameter controlling the size of the local neighborhood; whereas n is the number of independent variables.
Using matrix notation, one can write this as [25]: The solution vector of coefficients β is provided by weighted least squares theory and can be conveniently expressed as [25]: where: W represents a diagonal matrix of weights.
In the light of the above, the main advantage of the kernel-weighted local polynomial regression approach is that unlike the most common approaches for regression analysis, applied in QSAR/QSAAR studies (e.g. LR, MLR, PCR, PLS, etc.), where the regression coefficients are estimated using the least squares method by minimizing the sum of the squared residuals on the training data, it employs only a small batch of training data from the entire training set, that are most chemically similar to a given target point. This means that in the local polynomial regression approach, the regression coefficients are estimated with a sliding smoothing window [x 0 -h(x 0 ), x 0 + h(x 0 )] by fitting a polynomial of degree p locally at each query point (Fig. 1). Only observations within that smoothing window are used to approximate the unknown regression function m(x) by a polynomial, whereas the coefficients of this polynomial are fitted at each point of interest by weighted least squares regression [24]. It is worth emphasizing that the shape and width of the smoothing window, that is, in practice the shape and extent of the local regression neighborhood is determined by the kernel function and bandwidth described below: 1. Kernel function (K) specifies the neighborhood's shape and assigns weights to the neighboring points based on the distance to the target point. In the overall weighting scheme, the most significant weights are given to data points closest to the target point whose response is being estimated than those that are further away. So simply speaking, the weights determine how much each response value of the neighboring training data points influences the activity of a given fitting point.
2. Bandwidth also called the smoothing parameter (h) dictates the width of the kernel function (i.e., the width of the neighborhood). In practice, bandwidth and kernel function determine the number of nearest neighbors for regression.
As discussed elsewhere [26], the most relevant factors affecting the statistical properties of kernel-weighted local polynomial regression are the degree of the polynomials (p), the bandwidth (h), and the chosen kernel function (K). A brief description of the above-mentioned parameters is given below.

Degree of local polynomials
The benefits, drawbacks, limitations, and applicability of different local polynomial degrees have been extensively reviewed in the literature [24,26]. Generally, as the K x i − x 0 h flexibility in the model fitting increases (by increasing the degree of the polynomial used), the local-polynomial estimator's bias declines, but at the same time, the variance increases. Thus, high-degree polynomials will tend to overfit the data. The choice of degree of the approximating polynomial should appropriately balance the trade-off between bias and variance. It is fairly evident that in a relatively flat (non-sloping) region, a local constant (p = 0) or local linear (p = 1) estimator is preferred.
In contrast, at peaks and valleys, the most common choices are the local quadratic (p = 2) and local cubic (p = 3) estimators [27]. In practice, for spatially inhomogeneous curves, an order of polynomial approximation is adjusted to the curvature of the unknown regression function m(x) in the fixed neighborhood of x 0 by choosing that order p for which the estimated mean squared error (MSE) of the estimator is the smallest. Additionally, as Ruppert and Wand [28], and Fan and Gijbels [26] pointed out, good practice in local polynomial regression is to adopt low odd-degree local polynomials since they have a more straightforward asymptotic bias expression. However, an in-depth review of the literature shows, in practice, the constant, linear, and quadratic polynomials (p ≤ 2) are the most frequently used. An additional motivation behind using the low-order polynomials is that they appear to provide an adequate prediction Fig. 1 Locally weighted least squares kernel regression is illustrated with simulated data, where the dashed grey curve represents m(x) from which the data were generated, while the solid brown curve corresponds to the locally weighted linear regression estimate. The purple-colored points are the neighboring points to the query point whose response is estimated (x 0 ). The light purple bell-shape superimposed on the plot indicates weights assigned to the adjacent points, decreasing to zero with increasing distance from the query point when extrapolated beyond the range of the given data [29]. Although the extrapolation in QSAR/QSAAR predictions is more tenuous than the interpolation, it is a common practice. In general, the estimates based on low-order polynomial extrapolation, similarly to linear extrapolation, are capable of reasonable approximation, especially when the region of extrapolation is not too far beyond the data range. However, one should be aware that this will not hold true for high-order polynomials ( Fig. 2). High-order polynomials frequently fail and result in grossly misleading predictions. Hahn [29] and others [30,31] stressed the inadequacy of high-order polynomials for extrapolation.

Bandwidth
The bandwidth governs the complexity of the model, and therefore the choice of the smoothing parameter is of crucial importance for every kernel regression [32]. A broad bandwidth tends to over-smooth the data with a large bias (i.e., resulting in underfitting), whereas a small bandwidth on the contrary, greatly restricts neighborhood size, and produces to a high-variance estimate (i.e., resulting in overfitting) ( Fig. 3) [27].
Thus, to attain the trade-off between the goodnessof-fit and the model complexity, bandwidth should be optimized in the kernel-based regression methods. To this date, two main strategies have been devised for the bandwidth selection, namely: constant bandwidth selection and variable bandwidth selection [33]. As the name suggests, a constant bandwidth is constant across the entire range of the data (i.e., for all x in m(x)). In general, it is a reasonable choice when the unknown curve is spatially homogeneous. However, it might not be best suited to capture the unknown regression function's complexity that shows different behavior in different regions. Hence, for estimating coefficient functions with a more complicated shape of the curve, it may be desirable to use a bandwidth that varies according to the fitting point x at which m(x) is estimated. This bandwidth is referred to as local variable bandwidth and is denoted by h(x). Despite a vast number of bandwidth selection techniques, most of these methods are based on minimizing the mean squared error (MSE) or the mean integrated squared error (MISE). Among the automatic data-driven bandwidth selection procedures, the most commonly used are, e.g. direct plug-in method [34], cross-validated bandwidth method, least-squares cross-validation method [35], smoothed cross-validation method [36], and the contrast method [37]. An interesting comprehensive review of bandwidth selection techniques and their applications can be found in [27,[38][39][40].

Kernel function
As has already been pointed out, the kernel's choice determines the local neighborhood's shape over which the smoothing is performed. Different kernels merely vary in the relative weights assigned to points closer/farther in relation to a regression point. However, the particular form of the function, K has only a relatively small effect on estimation accuracy. Therefore, the differentiable kernels with low computational complexity such as the Gaussian kernel or Epanechnikov kernel are being favored. To date, several kernel functions have been proposed, the most common are plotted in Fig. 4.

Datasets
As a proof of concept, the kernel-weighted local polynomial regression approach has been applied to five diverse data sets, differing in both numbers and types of chemicals. A summary of the selected datasets that were previously used to develop ecotoxicity QSAR/QSAAR models using linear and non-linear regressions, along with the quality metrics provided by the authors of the original contributions, is given in Table 1. However, it should be highlighted that the specific focus of this study is neither the development of the new prediction oriented QSAR/ QSAAR models nor criticizing the existing ones. The cited models serve as illustrative case studies to demonstrate the usefulness of the locally weighted least squares kernel regression in simple as well as interspecies toxicity extrapolation. In order to make a straightforward comparison between initially employed approaches for QSAR/QSAAR model development and the proposed KwLPR approach, the same training and validation sets, as well as the same independent variables (as in original works), have been used (Additional files 1, 2, 3, 4, 5).

Interpretability of the KwLPR model
As the proposed KwLPR modelling method does not provide a single model equation that would allow to quantify the relative importance of individual explanatory variables on the endpoint of interest, it can be hastily perceived, therefore, as unremarkable. This inherent limitation can be easily overcome. Hence, to gain mechanistic insight into the nature of the interrelationships among the modeled activity and the related descriptors, the factor loadings derived from principal components analysis (PCA) should be evaluated. In essence, the loadings refer to the correlation coefficients between original variables and the particular principal component (PC), and thereby indicate the strength and direction of a linear association. A positive loading implies that a given descriptor correlates positively with the PC, whereas a negative loading means an inverse correlation. In order to enhance the readability and the interpretability of the KwLPR model, the PCA biplot was used. This choice was motivated by the fact that the PCA biplot simultaneously shows both the observations and the variables as well as provides information on the strength of the relationship (expressed by the vector length) and the degree of correlation among the variables (expressed by the angles between the loading vectors: an adjacent angle implies high positive correlation; a straight angle indicates high negative correlation, whereas a right angle suggests no correlation between two variables).

R-script
To facilitate the use of the kernel-weighted local polynomial regression approach for QSAR/QSAAR modeling, we additionally provide an in-house developed script as Supplementary information (Additional file 6). The script is all written in the open-source R programming language [41], and it heavily relies on the 'np' package [42]. An excellent and detailed introduction to 'np' package is provided in Hayfield and Racine publication [43]. To make the above-mentioned code as much user-friendly tool as possible and to provide its further functionality (i.e. preview plots and outputs all in a workspace) it is written in a. RMD file (i.e. R Markdown file created using RStudio as a graphical front-end to R). KwLPR.RMD script requires a single input data matrix written into a. CSV file, placed in the same working directory as the source R-code file (to see an example of an input data file please refer to the Supplementary information). The final modeling outputs are two-fold: (i) a single summary table with the most informative quality metrics organized by kernel regression estimator, degree of local polynomials, and kernel functions; and (ii) detailed output files for every single possible combination of the most influential and frequently used modeling parameters that might affect the quality of the fit. A summary table into a. CSV file is saved in the current working directory where the input file, as well as the source R-code file, are located. Whereas, detailed output files for each individual modelling scheme (including computational results, plot of model predicted and experimentally observed endpoint of interest) are saved in the automatically created nested subdirectory of the working directory.
Noteworthily, aside from comprehensive model development and its validation, initial data transformation that ensures that all variables receive equal attention during the training process is a crucial step, which cannot be overlooked. This is extremely important, since different variables mostly span several orders of magnitude and/or different ranges of units, whereas those with large numerical values can compromise the stability and statistical validity of any predictive model. To alleviate this problem, the script offers automatic data transformation also known as auto-scaling. As previously mentioned, KwLPR.RMD code intends to facilitate the kernel-weighted local polynomial regression modeling using the most commonly used bandwidth selection methods, kernel regression estimator as well as kernel functions. Its current version uses two implemented automatic data-driven bandwidth selection procedures, namely: expected Kullback-Leibler cross-validation (cv. aic), and least-squares cross-validation (cv.ls). Moreover, the current implementation of the KwLPR.RMD script permits application of two different degrees of the local polynomial smoother, namely: local-constant (Nadaraya-Watson) estimator (lc), and local-linear estimator (ll). Besides, it contains three of the most popular kernel functions, i.e.: Gaussian, Epanechnikov, and uniform. It is fairly obvious that although the use of the provided script does not require expertise and/or advanced knowledge of R, however, users who are more familiar with R language can easily customize this code further as per their requirements, by employing, for example, different bandwidth selection method and/or kernel function than is proposed herein. It should be emphasized, moreover, that the provided KwLPR.RMD script can also be successfully applied for the development of any traditional QSAR/ QSPR/QSTR/QSAAR model.

Results and discussion
To address the aim of this study and to introspect the advantages and predictive quality of the KwPLR approach over the traditional linear and non-linear regression-based methods, we have employed five toxicity datasets that were previously utilized to develop the QSAR/QSAAR models. To keep the modeling strategy as consistent as possible, the following rules have been maintained: a.  0.794, respectively). A similar trend is observed for the CCC parameter. The present model is also less error-prone than the previous one which is reflected in the lowest value of the RMSE in both training and validation sets (RMSE C = 0.60; RMSE P = 0.54 while previous one had 0.65 and 0.68, respectively). For detailed information on the experimental and predicted toxicity data for particular compounds as well as the numerical values of the molecular descriptors used within this study, please refer to the Supplementary information (Additional file 7: Table S1). The scatter plot (Fig. 5a) of experimental and predicted toxicity values illustrated that training and validation set chemicals scatter on both sides of the line of the perfect fit, and no points have deviated within ± 1 value. The Williams plot (Fig. 5b) for the applicability domain (AD) analysis suggested no validation compounds are outliers. In contrast, two training compounds showed higher leverage value compare to the critical value (h * ) of 0.035. Both compounds behave as influential observations (X outlier), although they were not response outliers (not Y outlier). We have also performed 500-fold Y-scrambling test where the plot in Fig. 5c suggested the KwPLR model was not obtained by chance and the model is extremely reliable. The Spider plot (Fig. 5d) suggested that KwPLR has superior regression-based (higher and close to 1), as well as error based metrics (lower and close to 0) values, compared to MLR implemented in the previous model. To provide an insight into the structure-activity relationship of the studied pesticides the PCA analysis was performed (Fig. 5e). It is straightforward to see that pLC50 (O. mykiss) increases moving from left to right along the X-axis, termed as first principal component (PC1). Due to low PC1 scores and positive loading values, the pesticides with lower values of pLC50 are characterized by relatively lower lipophilicity (LogP) and toxicity to D. magna (pEC50 [mM]) compared to pesticides with higher values of pLC50. An acute angle between both variables indicates moderate positive correlation (r = 0.54), whereas comparable vectors length indicates that both variables equally contribute to describe the toxicity towards O. mykiss.

Case study 2
The  Table S2). The scatter plot (Fig. 6a) of experimental and predicted toxicity values illustrated that training and validation set chemicals scatter on both sides of the line of the perfect fit, and no points have deviated within ± 1 value. The Williams plot (Fig. 6b) for the applicability domain (AD) analysis suggested that one validation compound (Tributyltin oxide) is detected as an outlier. While one training compound showed a higher leverage value compare to the critical value (h * ) of 0.035, which behaves as influential observations (X outlier) although not a response outlier (not Y outlier). We have also performed 500-fold Y-scrambling test where the plot in Fig. 6c suggested the KwPLR model was not obtained by chance and the model is extremely reliable. The Spider plot (Fig. 6d) suggested that KwPLR offered better validation outcomes over the MLR based model. Analysis of the PCA biplot graphically shown in Fig. 6e revealed that the toxicity of pesticides in L. macrochirus (pLC50) increases with increasing values of both related descriptors (i.e. LogP and pEC50 [mM] D. magna). Considering the vectors length and the angle between them it can be inferred that they are moderately positively correlated (r = 0.58) and paly an equally important role in determining toxicity to L. macrochirus.

Case study 3
The dataset 3 includes toxicity value for O. mykiss and D. magna of 50 PPCPs as LC 50 . The modeled toxicity values of O. mykiss covered the toxicity range of 6.173 logarithmic units. The original model was developed using the MLR technique with 35 compounds in the training set and 15 in the validation set [21]. The present model uses D. magna toxicity and GATS1e as modeled descriptors like the previous model. The current KwLPR based QSAAR model slightly improved the R 2 value (the present model reported a value of 0.95, and the earlier one was 0.91). A similar trend reported for external validation metrics. KwLPR based QSAAR model yielded considerably higher assessment statistics in the validation set (Q 2 F1 = 0.83; Q 2 F2 = 0.83; Q 2 F3 = 0.83), compared to the previous model (Q 2 F1 = 0.77; Q 2 F2 = 0.77; Q 2 F3 = 0.77). A sharp decline of RMSE C and RMSE P values is observed for the present model compared to the previous one, which is expected. It demonstrates the good predictive quality of the KwLPR derived model. The perfect scattering of training and validation set data points around the line's best fit in the scatter plot (Fig. 7a) illustrated the present model's predictive capability. The AD analysis suggested that all training and validation set compounds remain within the set boundaries of standardized residuals and critical leverage value, which implies that none of the PPCPs are outliers, and that their predictions are highly reliable. The Williams plot of AD analysis is illustrated in Fig. 7b. The 500-fold Y-scrambling test randomization plot (Fig. 7c) suggested that the developed KwPLR based model is highly reliable and not obtained by chance. Additionally, the spider plot (Fig. 7d) clearly showed similar internal and superior external regression-based validation metrics values for the KwLPR model compared to the previous MLR model. Considering error-based metrics, the KwLPR model also outperforms the MLR-based model. Analysis of the PCA biplot (Fig. 7e) confirmed an evident trend in the toxicity values of tested PPCPs. PC1 is positively correlated with the 2D autocorrelation descriptor, namely Geary autocorrelation of lag 1 weighted by Sanderson electronegativity (GATS1e) [21], and negatively correlated with the toxic to D. magna (pEC50). Thus, in general, low values of PC1 scores correspond to PPCPs with lower GATS1e and greater pEC50 (D. magna) resulted in higher toxicity to fish (O. mykiss). The angle between the loading vectors confirmed relatively low negative correlation (r = -− 0.43) among the variables. In turn, the similar length of vectors provided the evidence that both descriptors are equally important for the studied toxicity endpoint Supplementary information (Additional file 7: Table S3).

Case study 4
This dataset includes the quantitative toxicity value of 41 substituted phenols to Chlorella vulgaris and Tetrahymena pyriformis expressed in pT = − logIGC 50 (50% growth inhibition). The toxicity values of C. vulgaris (modeled toxicity) covered a range of 2.83 logarithmic units. The previous model was developed using simple linear regression (LR). In contrast, we have developed the present model using KwLPR with 31 training set substituted phenols, and the remaining ten compounds were considered for external validation purposes [22]. Like the earlier model, we have used only T. pyriformis acute toxicity as a modeled descriptor. The previous model reported a determination coefficient (R 2 ) of 0.75, while the present model using KwLPR increases this value to 0.81 using the same division as well as modeled descriptor. While, all external correlation validation metrics showed slightly improved value for the KwLPR based QSAAR model ( Similarly, in the case of error-based metrics, lower values for RMSE C and RMSE P were obtained for the present model compare to the previous, which is the indication of better predictive nature of the KwLPR model (0.28 and 0.27, respectively) over the LR model (0.32 and 0.28, respectively). The scatter plot (Fig. 8a) revealed a good correlation between the experimental and predicted toxicity. The resulting graph showed that most points were close and scattered around both sides to fit the best line. Even the most scattered point showed the residual error of less than one numerical value, implicating the good quality of the developed model. The Williams plot (Fig. 8b) suggested that neither training nor validation set compounds are identified as X or Y-outliers, and their predictions are highly reliable and acceptable. The 500-fold Y-scrambling test supports the robustness of the KwPLR model (Fig. 8c). The spider plot (Fig. 8d) clearly illustrated almost similar metrics values for Q 2 LOO , Q 2 F1 , Q 2 F2 , Q 2 F3 , CCC, and RMSE P while the KwLPR model outperforms the LR model in case of better results for R 2 and RMSE C metrics. In this particular case, since only one independent variable was used to develop the KwLPR model, the interrelationship among the modeled toxicity and the related descriptor can be assessed through the correlation coefficient. Thus, in general, strong positive correlation (r = 0.88) indicates that substituted phenols highly toxic to T. pyriformis are also highly toxic to C. vulgaris Supplementary information (Additional file 7: Table S4).

Case study 5
The acute toxicity (LC 50 96 h) of 908 diverse organic chemicals towards the fathead minnow (Pimephales promelas) were considered for the last case study [23]. Cassotti et al. [23] employed non-linear kNN approach to develop the predictive QSAR model and they obtained a model with k = 6 (R 2 = 0.62; Q 2 cv = 0.61; Q 2 Ext = 0.61). The studied toxicity range is: 0.053 to 9.612. The calibration and validation of the KwPLR model was carried out using the same training/test set (n T = 726/n V = 182) and explanatory variables as provided by the authors of the original work. The list of training/validation set chemicals along with the modelling parameters can be found in Supplementary information (Additional file 7: Table S5). The KwPLR model reported around 37% improvement of quality of the model fit over kNN model. Although, the model's cross-validation for the kNN model showed higher value than the KwLPR model, the better prediction was achieved for the validation set, which can be confirmed by Q 2 EXT parameter (0.61 and 0.68 for kNN and KwLPR model, respectively). The perfect scattering of training and validation set data points around the line's best fit in the scatter plot (Fig. 9a) demonstrated the present model's predictive capability. The Williams plot of AD analysis is reported in Fig. 9b. The 500-fold Y-scrambling test randomization plot (Fig. 9c) suggested that the developed KwPLR based model is highly reliable and not obtained by chance. Additionally, the spider plot ( Fig. 9d) clearly showed that except internal cross-validation (Q 2 CV ) metric, all remaining validation metrics values for the KwLPR model compared to the previous kNN model are superior in terms of accepted threshold. Interpretation of PCA biplot (Fig. 9e) revealed that out of six descriptors employed for the KwLPR model development the most influential variables (indicated by the longest arrow vectors) were Moriguchi octanol-water partitioning coefficient (MLOGP), Complementary Information Content index (CIC0) that accounts for the presence of heteroatoms, and the 2D autocorrelation descriptor that considers the ionization potential of bonded atoms (GATS1i). Two of these variables (i.e. MLOGP and CIC0) were positively correlated, whereas the last (i.e. GATS1i) was negatively correlated. In light of these it can be stated, that highly lipophilic chemicals with large CIC0 and low GATS1i are characterized by an inherently greater toxicity towards P. promelas compared to chemicals with lower MLOGP and CIC0 and larger GATS1i.
The details of the factors that influence the modeling outcomes (i.e., bandwidth method selection, bandwidth parameters, local polynomial's degree, kernel function) as well as internal and external validation metrics of all five KwLPR based QSAAR models are given in Table 2. If one compares the accessible validation metrics of the previously developed model and present ones (Tables 1 and 2), it is quite clear that for all five case studies, the quality of current models is superior.
Finally, to make the comparison between the proposed KwLPR approach and the traditional linear and non-linear regression-based techniques more meaningful, we have considered three error-based metrics RMSE C (Root Mean Square Error of Calibration), RMSE CV (Root Mean Square Error of Cross-Validation) for the training set and RMSE P (Root Mean Square Error of Prediction) for validation set along with six classical internal and external regression-based metrics (Fig. 10). Four new QSAR/ QSAAR models achieved higher (Case studies 1, 3, 4 and 5) and one model (Case study 2) achieved identical R 2 value in the case of the KwLPR model compared to linear and non-linear models. While comparing other regression-based metrics values, in all cases, KwLPR models outperformed the linear as well as non-linear models. As no previous QSAAR studies considered RMSE CV , we can't compare their results with the present models, but all the values obtained here are acceptable. Lower RMSE C and RMSE P values are required to show the KwLPR approach's superiority over the traditional linear and non-linear regression approaches. The RMSE C value of case study 2 is better for the linear model compared to KwLPR. Except for this one instance, both error-based metrics demonstrated improved outcomes for all remaining three case studies (1, 3 and 4) in the KwLPR model's case over the linear models. Similarly, the KwLPR model for case study 5 improve the quality of the model fit (R 2 ) by over 37% compare to non-linear kNN based model.
The reason behind the better statistically predictive model for all five cases is that the regression coefficients of the KwLPR are estimated with a sliding smoothing window by fitting a polynomial of degree (0 or 1) locally at each query point using specific bandwidth method selection (direct plug-in method; least squares cross-validation method or expected Kullback-Leibler cross-validation method) and kernel function (Gaussian) which help in minimizing the MSE of the individual model. Thus, for the precise prediction of external test compounds (new or query compounds), especially for biological activity and toxicity of chemicals and pharmaceuticals, the KwPLR method is a superior choice over the traditional linear and non-linear regression methods for the development of QSAR/ QSAAR models. On the other hand, even though the Gajewicz-Skretna et al. J Cheminform (2021) 13:9