SAR and QSAR modeling of a large collection of LD50 rat acute oral toxicity data

The median lethal dose for rodent oral acute toxicity (LD50) is a standard piece of information required to categorize chemicals in terms of the potential hazard posed to human health after acute exposure. The exclusive use of in vivo testing is limited by the time and costs required for performing experiments and by the need to sacrifice a number of animals. (Quantitative) structure–activity relationships [(Q)SAR] proved a valid alternative to reduce and assist in vivo assays for assessing acute toxicological hazard. In the framework of a new international collaborative project, the NTP Interagency Center for the Evaluation of Alternative Toxicological Methods and the U.S. Environmental Protection Agency’s National Center for Computational Toxicology compiled a large database of rat acute oral LD50 data, with the aim of supporting the development of new computational models for predicting five regulatory relevant acute toxicity endpoints. In this article, a series of regression and classification computational models were developed by employing different statistical and knowledge-based methodologies. External validation was performed to demonstrate the real-life predictability of models. Integrated modeling was then applied to improve performance of single models. Statistical results confirmed the relevance of developed models in regulatory frameworks, and confirmed the effectiveness of integrated modeling. The best integrated strategies reached RMSEs lower than 0.50 and the best classification models reached balanced accuracies over 0.70 for multi-class and over 0.80 for binary endpoints. Computed predictions will be hosted on the EPA’s Chemistry Dashboard and made freely available to the scientific community.


Introduction
Over the past 25 years, synthetic organic chemical production world-wide has increased dramatically, from about 50 million tons to approximately 150 million tons [1]. This ever-growing increase of chemical substances represents a primary issue for the environment and human safety. Toxicological tests need to be performed to evaluate which of these chemicals are safe and which can potentially contaminate the environment and cause toxicity.
In the first stages of toxicological testing programs, acute toxicity studies are frequently used to categorize the agent in terms of the potential hazard posed to human health. Acute toxicity describes the adverse toxicological effects of a chemical that occur either from a single exposure or from multiple exposures in a short period of time (usually less than 24 h) [2].
The median lethal dose (LD50) is the basis for the toxicological classification of chemicals for various regulations concerning chemical hazard [3,4]. The acute LD50 is the lethal dose of a substance that will kill 50% of the test animals/organisms within 24 h of exposure to the test substance [5][6][7]. Acute toxicity studies are conducted following various routes of exposure (e.g. oral, dermal and inhalation), and rodents are the most common animal model employed to estimate the lethal dose [8]. The estimation of rodent acute toxicity provides a baseline value when detailed toxicity data are unavailable for the chemical(s) of interest. In this case, LD50 values may be employed to make a first assessment of relative toxicity among chemicals [6].
However, the exclusive use of in vivo testing has obvious limitations, related to the high monetary and time cost of performing such experiments, the need to sacrifice a number of animals, and the number of chemicals requiring assessment. Indeed, it has been reported that toxicological and other safety evaluations represent 8% of the total number of animals used for experimental purposes in Europe, with rodents being the most commonly used specie [9,10].
In light of this, recent laws are pushing the acceptance of alternative methods (e.g., in vitro and in silico methods) and their use by the regulatory and public health bodies in order to reduce the use of animals [11,12]. Computational toxicology is a viable approach to reduce both the cost and the number of animals used for experimental toxicity assessment [13].
Structure-activity relationship and quantitative structure-activity relationship [(Q)SAR] models are in silico approaches to determine the toxicity of a large number of chemicals by analyzing their chemical structure. These methods are increasingly used to fill the toxicological data gaps for high-production volume chemicals (e.g., pharmaceutical, agrochemical, food additives and industrial) [3,6,14,15] because they require a relatively small amount of resources and time.
Despite this, the reported number of (Q)SAR studies on mammalian toxicity is limited [3, [9], with the majority being restricted to particular classes of chemicals and based on small, focused datasets [16,17,18].
Recently, the NTP Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM) in collaboration with the U.S. Environmental Protection Agency (EPA) National Center for Computational Toxicology (NCCT) compiled a large list of rat acute oral LD50 data on ~ 12 k chemicals. These data have been made available to the scientific community, to serve as the basis for an international collaborative modeling initiative. The modeling initiative was launched by the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) Acute Toxicity Workgroup with the aim to develop new computational models for predicting five specific acute oral systemic toxicity endpoints required for regulatory purposes [19].
The present article encompasses all the efforts that our research group contributed to this initiative. Both regression and classification computational models were developed for the five endpoints, by employing several statistical (i.e. QSAR) and knowledge-based (i.e. SAR) methods. External validation performance was provided to demonstrate the predictive capacity of the models. In the end, integrated modeling strategies were proposed to improve performance of single models. A multi-objective optimization based on the concept of Pareto optimum was proposed for identifying the best solution (i.e., individual or integrated model) for each modeled endpoint [21,22].

Endpoints
The following endpoints were considered for modeling:  [24], the National Library of Medicine's (NLM) Hazardous Substance Data Bank (HSDB) [25], the OECD's eChemPortal (https ://www.echem porta l.org/echem porta l/index .actio n), NICEATM's Pesticide Active Ingredients (PAI) (https ://ntp.niehs .nih.gov/pubhe alth/evala tm/ index .html) and NLM's ChemIDPlus (https ://chem.nlm. nih.gov/chemi dplus /). Over the 75% of chemical structures were retrieved from the EPA's DSSTox database [26], while the remaining were crosschecked from literature publications. Details on the preparation of these data and the related variability is discussed in a separate paper by Kleinstreuer et al. (in preparation). In this work, the list of 8994 chemicals proposed by NICEATM and NCCT for modeling was processed to create a training set (TS) for the development of models here presented, specific for each endpoint modeled. Project information reports the presence of 158 duplicate (Q)SAR-ready structures, mostly due to the presence of different counterions associated to the main molecule (https ://ntp.niehs .nih.gov/ iccva m/metho ds/acute tox/model /qna.pdf ). Because (Q) SAR approaches applied in this work do not deal with counterions, duplicated entries were aggregated. A single experimental value was assigned to each unique chemical structure (see Table 1 for details on the composition of the TSs). Details on the procedure applied to obtain the final version of datasets is included in Additional file 1.
The ES was initially imbedded into a larger prediction set by NICEATM and NCCT to facilitate a blind evaluation of all the models that were developed by the various institutions during the initiative. It was subsequently released with all the information relative to the five endpoints (https ://ntp.niehs .nih.gov/iccva m/metho ds/acute tox/model /valid ation set.txt). While the TS used here were a result of a reworking of data provided for modeling, the ES was used as released for validating models presented here. Deduplication was performed by organizers of the project based only on CAS registration numbers. Consequently, a small degree of superimposition was observed between the TS and the ES due to the presence of different CAS numbers pointing to the same chemical structure. This overlap in chemistry is limited (i.e., about 8% of ES chemicals were also included in the TS) and it does not undermine statistical relevance of validation results. Table 1 summarizes, for each endpoint, the number of chemicals included in each toxicity category for TS and ES. In all cases, TS and ES showed a nearly analogous distribution of chemicals among the various classes.
The entire TS and ES are included in Additional file 2: Table S1 and Additional file 2: Table S2. The TS was analyzed by means of a principal component analysis based on CDK descriptors [27] implemented in KNIME analytical platform [31]. Eight chemicals identified as structural outliers based on score values on the first two principal components were removed from the TS. Table 2 lists all the statistical methods applied for the development of models submitted to NICEATM for predicting rat oral acute toxicity. For each method, modelled endpoints were also specified. Below, technical details of each method are described.

Balanced random forest (BRF)/random forest in regression (rRF)
Data split For each compound, structural fingerprints were calculated using the Indigo toolkit implemented in KNIME (http://lifes cienc e.opens ource .epam.com/indig

Table 1 Number of chemicals included in the TS and the ES for each toxicity class
Hazard categories for the two multi-class modeled endpoints (EPA and GHS classes) are sorted for decreasing toxicity, from 1 to 5. For the vT classification, class 1 corresponds to the positive (i.e., very toxic) compounds, while for the nT endpoint class 1 corresponds to the negative (i.e., toxic) compounds o/). K-means clustering was applied, taking into account the structural information (codified by the fingerprints) and the experimental values of each chemical. TS was split into an internal training set (iTS) (80%) and an internal validation set (iVS) (20%) based on a stratified sampling of the obtained clusters, to ensure structural and activity analogy between the two datasets. The iTS was used for the development of QSAR models, while the iVS was used for the tuning of model and applicability domain (AD) parameters. The number of chemicals included in iTS and iVS for each modeling endpoint is reported ("Results" section). Algorithms Random Forest in regression (rRF) [28,29] was applied for the derivation of LD50 single point estimate prediction models. For the categorical ones (i.e., nT, vT, GHS and EPA), balanced random forest (BRF) was used. This technique is a combination of under-sampling and the ensemble idea that artificially alters the class distribution so that classes are represented equally in each tree [30]. This allowed handling of unbalanced distributions of chemicals among classes for some of the categorical endpoints. The number of trees for each model was varied among 50, 100 and 150, then the best solution was selected based on performance on the iVS. All algorithms were implemented in the KNIME platform [31].
Descriptors Molecular descriptors were calculated for each compound using Dragon software [32]. Descriptors for iTS compounds were pruned by constant and semi-constant values (i.e., standard deviation < 0.01), then those having at least one missing value were removed. In case of highly correlated pairs of descriptors (i.e., absolute pair correlation higher than 95%), only one was retained and the descriptor showing the highest pair correlation with all the other descriptors was removed. Descriptors were normalized in the range of 0-1, then the same normalization scheme was applied to iVS descriptors.
Applicability domain Three approaches were applied for the definition of AD.
• Similarity A matrix containing pairwise Manhattan distances (based on Dragon descriptors used in the model) was calculated for iTS compounds. Chemicals were sorted based on the mean distance with respect to their first k neighbors and then the value corresponding to a given percentile of the distribution of distances was used as a threshold (T D ). Chemicals with mean distances above T D were excluded from the AD. The same procedure was repeated for iVS chemicals with respect to their neighbors in the iTS, for identifying compounds outside of AD. Values assigned to k were 1 and 5; values assigned to T D were those corresponding to the 100th, the 97.5th, the 95th and the 90th percentiles of the iTS distance distribution. This method was applied on both continuous (LD50) and classification models. • Error model An "error model" predicts the uncertainty of the predictions coming from a classical "activity model". An error model was derived from the same iTS of the associated activity model, with the difference that the cross-validated absolute errors (previously generated by the activity model) represent the dependent variables while independent variables are represented by a series of AD metrics that reflect the accuracy of the predictions made by the activity model [33]. The RF algorithm was used for the error model derivation. iTS chemicals were sorted based on errors in prediction estimated by the error model, then the value corresponding to a given percentile of the distribution of predicted errors was used as a threshold (T E ). Chemicals exceeding T E were excluded from the AD. The same T E was applied on predicted errors calculated for iVS chemicals. For the present work, values assigned to T E corresponded to the 100th, the 90th, the 75th and the 65th percentile of the iTS errors distribution. This method was applied only on the continuous LD50 point estimate models. • Confidence The percentage of trees within the RF yielding the same prediction (i.e., confidence) was estimated. This method was applied only for classification models. For binary classification models (i.e., vT and nT) a confidence threshold (T C ) was gradually incremented by 0.05, from a minimum of 0.60 to a maximum of 0.75. For multi-class models (i.e., EPA and GHS), the confidence threshold was incremented by 0.10 from a minimum of 0.30 to a maximum of 0.70. Chemicals having confidence lower than this threshold were considered outside of the model AD.
For each model, the various parameters were evaluated in each possible combination, i.e. T D , k, T E for continuous models, T D , k, T C for classification models. The best combination of parameters was selected based on the best trade-off in terms of coverage and performance on the iVS (see "Results" section).
aiQSAR Data split The entire TS was used to derive aiQSAR models.
Algorithm aiQSAR methodology described by Vukovic et al. [34] was applied for development of regression and classification models. This method is based on the-runtime derivation of local models specific to each compound (in this section referred to as the target compound). Each model is derived from a small local group of structurally similar compounds included in the TS. Figure 1 summarizes the entire aiQSAR workflow.
Similar compounds were selected on the basis of Tanimoto distances computed from the comparison of "PubChem" (ftp://ftp.ncbi.nlm.nih.gov/pubch em/speci ficat ions/pubch em_finge rprin ts.txt) and "Extended" fingerprints [35] between the target compound and all the TS compounds, as implemented in the "rcdk" R package [36]. A minimum of 20 and a maximum of 50 compounds were selected for the development of each model. First all compounds that are above the threshold value are selected (PubChem" similarity ≥ 0.80 and "Extended" similarity ≥ 0.70) Then, in case where the required number is not met, average ranks of compounds are considered to either add additional compounds, or to discard some of the selected ones.
Several mathematical models implemented in "caret" R package [37] were built from the local group of neighbors and each model was used to predict the value of the target compound. The type and the number of models varied based on the endpoint (i.e., regression, binary classification, multi-class classification) and is listed in Fig. 1. Further details on the methods used are reported in Vukovic et al. [34].
Finally, predictions from all methods were combined into a single output value for each target compound. For regression endpoints (i.e., LD50 point estimates), the final prediction was the average of predictions from individual methods, after discarding those more than 10% out of the range of experimental values of the TS. For classification endpoints (i.e., vT, nT, EPA and GHS), a majority vote approach was applied. In case of any tied votes, the class that is least represented in TS (out of the tied classes) was selected. In the present work, this always corresponded to the most toxic class in the tie being selected.
Descriptors Molecular descriptors were calculated using Dragon 7 software [32]. All available 1D and 2D descriptors (3839 overall) were considered for modeling purposes. Dragon descriptors with a missing value in any compound from the local group or in the target compound were iteratively discarded before each local model generation, as well as descriptors that were constant and near-constant within the local group of neighbors.
Applicability domain Applicability domain measure (ADM) of the target compound was computed based on average values of "PubChem" and "Extended" fingerprint similarities between the target compound and its local group of neighbors. The more similar is the local group of neighbors, the higher is the ADM score, that ranges between 1 (out of AD) and 5 (in AD). For the present work, chemicals with ADM ≥ 2 (i.e., "PubChem" similarity ≥ 0.60 and "Extended" similarity ≥ 0.30) were considered within the model's AD.
istkNN Data split The same data split described in "BRF/ rRF" paragraph ("Data split") was used for selecting model's optimal parameters (see below). Once optimal parameters were selected, the global TS (iTS + iVS) was used to derive a new model that was validated on the ES.
Algorithm istkNN is a commercial tool [38] implementing a modified k-Nearest Neighbors (kNN) algorithm. kNN estimates the outcome of a sample in a dataset on the basis of read-across accounting for the k most similar samples (neighbors) in the TS for which the outcomes are known [39,40]. If the algorithm is applied for predicting continuous endpoints (e.g., LD50 point estimates) the mean of the activities of neighbors is calculated [41].
Similarity Similarity between chemicals is described by an integrated similarity index (SI) ranging from 1 (maximum similarity) to 0 (minimum similarity), resulting from a weighted combination of a binary fingerprint array and three non-binary structural keys, as follow: where CD are structural keys with 35 constitutional descriptors (MW, nr of skeleton atoms, etc.), HE are structural keys with 11 hetero-atoms descriptors and FG are structural keys with 154 functional groups (specific chemical moieties) as implemented in Dragon (v. 7.0.8, Kode srl, 2017) (Talete srl, Milano, Italy). FP are the extended fingerprints, which comprise Daylight notation (http://www.dayli ght.com/dayht ml/doc/theor y/theor y.finge r.html) and additional bits accounting for ring features. FP similarity is calculated with Maxwell-Pilliner index while CD, HE, and FG similarities are calculated with Bray-Curtis index [42].
Applicability domain istKNN refines the classical kNN algorithm by setting additional conditions that a sample (i.e. chemical) should fulfill to be considered reliably predicted. The k nearest neighbors used for prediction should have a similarity value with the target greater than a given threshold (T sim1 ), otherwise they are not used for prediction. If no neighbors match the threshold, the model does not provide a prediction for the target compound. If only one neighbor matches the threshold, the similarity should be higher than a second stricter threshold (T sim2 ) to return a prediction (which is equal in this case to the experimental values of this selected neighbor). If two or more neighbors fulfill the T sim1 , the range of experimental values of retained neighbors is considered. If the difference between the maximum and minimum experimental values of neighbors is lower than a threshold (T min-max ), the target is predicted, otherwise the model does not return predictions. To calculate the prediction when more than one neighbor is selected, the experimental values of the similar compounds can be weighted differently on the basis of their similarity with the target (by setting an enhancement factor that increases the weight to the most similar compounds in the prediction).
In the present work, a batch process was used to optimize the settings of the five customizable parameters according to the following criteria: (a) number of neighbors from 2 to 5; (b) T sim1 from 0.70 to 0.90 (step = 0.50); T sim2 = 0.85 or 0.90; (c) enhancement factor from 1 to 3; (d) T min-max from 1.0 to 2.0 (step = 0.50).
The iTS was used for the development of (Q)SAR models using all the possible combinations of the parameters above. A restricted pool of valid models was pre-selected based on leave-one-out cross-validation on the iTS, seeking a good compromise in terms of coverage and performance. The final model was the one among the pool with the best performance in external validation on the iVS with a coverage of at least 0.85 (arbitrary threshold). Finally, the selected parameter settings were used to derive a new model on the global TS (iTS + iVS) and to use it for the prediction of the ES.

SARpy
Data split The entire TS was used to derive SARpy models.
Algorithm The freely available SARpy software (https :// www.vegah ub.eu/portf olio-item/sarpy /) was used to build classification models for nT and vT classifications.
Given a TS of molecular structures described by SMILES, SARpy applies a recursive algorithm considering every combination of bond breakages working directly on the SMILES string. The software generates every structural fragment included in the TS that is then encoded into SMARTS (www.dayli ght.com/dayht ml/ doc/theor y/).
Each substructure is validated as a potential structural alert (SA) by verifying the existing correlation between the incidence of a particular molecular fragment and the class of activity of the molecules containing it. In this way, a reduced ruleset of relevant SAs is defined. Each SA was associated with an activity label (e.g., positive or negative) and a likelihood ratio, estimating the statistical relevance of the SA [43,44].
Applicability domain If a chemical does not contain any fragment present in the rule set, it is not predicted and is considered outside of the model AD.

Random forest with hyper parameter tuning (HPT-RF)
Data split The whole TS was used to derive HPT-RF models.
Algorithm "caret" [37] and "ranger" [45] R packages were used to develop regression (LD50) and multi-category classification (EPA and GHS) RF models [28]. Hyperparameter tuning (HPT) research was performed during the RF derivation, in order to select an "optimal" model across various parameters. Selection was made by evaluating the effect of model tuning on performance (i.e., accuracy) in internal validation (i.e., bootstrap). Three parameters were tuned by grid search: (1) mtry (number of randomly selected descriptors used in each tree of the RF), (2) splitrule (the rule used to choose descriptors for a single tree, i.e. "gini" or "extratrees" for classification; "variance" or "extratrees" for regression), (3) min.node. size (minimal node size of trees). The number of trees was equal to 500. The reader is referred to the user's guides of the above-indicated packages for further details.
A first model run served to evaluate the presence of response outliers within the TS. The isolation forest [46] method for anomalies isolation as implemented in the "isofor" R package was used to identify outliers. In addition, chemicals were flagged as outliers if they were characterized by a high variance and high error among iterations of bootstrap internal validation (100 iterations). In a second run, outliers were excluded from model's derivation.
Descriptors Calculation, pruning and normalization of descriptors were the same as in "rRF/BRF" paragraph ("Descriptors").
Applicability domain Two approaches were applied for the definition of models AD: • Similarity PubChem Fingerprints (ftp://ftp.ncbi.nlm. nih.gov/pubch em/speci ficat ions/pubch em_finge rprin ts.txt) were calculated for each compound starting from SMILES using the "rcdk" R package. Fingerprint-based similarity (Tanimoto) between a target compound and all the compounds in the TS was computed. If the mean similarity with the three most similar compounds among those flagged as outliers was higher than the similarity with three most similar compound from the TS cleaned from outliers, the compound was considered out of AD. • "Dummy" matrix Descriptors of TS chemicals were randomly permuted (vertical permutation) to create a mirror TS. The shuffled TS was merged with the original one. Samples of the original TS were flagged as "real", while those of the mirror TS were flagged as "dummy". A RF classification model was built to distinguish real from dummy samples. External chemicals classified as "dummy" were considered outside of the model's AD [47,48].

Generalized linear model (GLM)
Data split The same data split described in "rRF/ BRF" paragraph ("Data split") was used. The original iTS for the nT classification was further split by 20% of the iTS as an internal calibration set (iCS, 1330 chemicals) in order to evaluate the accuracy of the model during the building process. The iVS was used for validating the final model. Algorithm The model was built using the H 2 O 3.16.0.3 (https ://www.h2o.ai/downl oad/) package for GLM in R v. 3.4.0 (https ://www.R-proje ct.org).
Logistic regression (LR) is used for binary classification problems when the response is a categorical variable with two levels. In this case, only the nT endpoint was modelled. This approach was not applied to the vT endpoint due to a strong bias arising from few chemicals having a vT designation, and the inadequacy of the method in modeling highly unbalanced datasets. LR-GLM model was fitted by finding a set of parameters that maximizes the probability of an observation belonging to its experimental category. Parameter tuning was performed on the iCS. Penalties were also introduced to the model building process to avoid over-fitting, reducing variance of the prediction error and handling correlated predictors. Penalties (elastic method) were controlled by parameters α and λ that were set to 0.09 and 0.02, respectively. A more in-depth description of the algorithm is reported by [49].
Applicability domain AD was not implemented for this model.

Integrated modeling
Integrated modeling was applied to all five endpoints. The approach was based on the intuitive assumption that taking into account predictions from multiple models could compensate for the limitations of single models. Different approaches were used for regression and classification endpoints: • Integrated model for predicting LD50 point estimates was obtained by averaging predictions of the four individual models (i.e., rRF, aiQSAR, istkNN, HPT-RF).
Only predictions within the AD of each model were considered for integration. The integrated model's AD was implemented by applying the concept of "integrated prediction fraction" (PF) described by [50]. In particular, the number of models returning a prediction in AD and contributing to the final integrated prediction for a given sample was considered. The higher was the number of used models, the more reliable the integrated prediction. A threshold was set based on the PF and then integrated predictions with a PF lower than the threshold were considered out of the integrated AD. • Integrated models for categorical endpoints (vT, nT, EPA, GHS) were obtained by using a majority voting approach. Similarly to above, a consensus score (CS) was used as an index of prediction reliability. In this case, CS was calculated as the number of single models returning the same prediction as the integrated one, minus the number of models returning a prediction different from the integrated one. Only predictions within the AD of each model were considered for integration. For example, three out of four nT models returned predictions for a given sample, two of them being positive and one negative. In this case, the integrated prediction for the sample was "positive" with a CS of 1. No predictions were returned in case of ties.
For all the endpoints, the variation of integrated performance with respect to the PF/CS was evaluated. Table 3 reports final settings for each model developed with the rRF and BRF method.

Model parameterization
For the istkNN approach, the selected model was characterized by the use of 3 neighbors maximum, T sim1 = 0.80, T sim2 = 0.85, enhancement factor = 3 and T min-max = 2.0. For the SARpy model, 349 fragments were identified for the vT endpoint (64 for the vT class, 285 for the not vT class), while 446 fragments were identified for the nT endpoint (228 for the nT class, 218 for the class of toxic compounds). The list of fragments encoded as SMARTS are reported in Additional file 3: Table S3a (vT model) and Table S3b (nT model). Table 4 indicates the values for the three parameters mtry, splitrule and min_node_size of the HPT-RF model.

Statistical performance
External validation performance represents the real proof of the predictive power of (Q)SAR models. In this work, results on the common ES allowed a fair comparison of the models' predictivity. Table 5 shows the performance of the LD50 continuous models on the ES, while Table 6 shows the performance of classification models.
Internal validation is not discussed in the present article. The interested reader is referred to Additional file 1: Tables S4 and S5 for internal performance of each model.

Table 3 rRF and BRF settings
For each model, the size of the internal training set (iTS) and internal validation set (iVS), the number of descriptors (#descrs), the number of trees (#trees) and the tuned parameters for AD definition are indicated  Predictions of all models on the TS and ES are included in Additional file 2: Tables S1 and S2.
All continuous logLD50 (mmol/kg) predictive models showed good predictivity on ES. In particular, aiQSAR, istkNN and HPT-RF models showed a similar behavior with an R 2 higher than 0.60 and RMSE near 0.54. On this basis, the istkNN method was particularly appealing for its intuitiveness and simplicity, compared with other more computationally-demanding methods. The rRF model was slightly less predictive, but with a higher percentage of predictions in AD (%AD) (i.e., 0.900) compared to other models ( Table 5).
As far as classification models, Matthew correlation coefficient (MCC) [51] and balanced accuracy (BA) metrics were used for an overall estimation of classification performance for their ability to deal with unbalanced classes. Cooper statistics [52] were also calculated. For multi-category classifiers (EPA and GHS), the generalized formula of MCC reported by Ballabio et al. [53] was used, while BA was obtained as arithmetic mean of averaged sensitivities and specificities calculated for each separate category.
With respect to binary classification models, all the models showed good predictivity on the ES, with BAs close or higher than 0.80 and MCCs often higher than 0.50 (i.e., BRF and aiQSAR). In both cases, the best techniques was BRF, that returned BAs in external validation of 0.839 (for nT) and 0.880 (for vT), and MCCs of 0.674 (for nT) and 0.585 (for vT), despite a slightly lower %AD (i.e. 0.728) with respect to other methods (Table 6). This confirmed our previous experience [54] on the suitability of this technique to effectively handle highly unbalanced datasets.
As far as multi-categorical endpoints, EPA predictive models showed close performance across algorithms, with MCCs higher than 0.40 and BAs close to 0.73 in all cases. The HPT-RF method showed high performance, but at the cost of a slight loss in coverage with respect of other methods. aiQSAR was close in terms of performance, but with a gain in %AD (i.e., 0.891 compared to 0.763) ( Table 6).
For GHS models, aiQSAR and HPT-RF performed better than BRF. However, coverage for GHS models was disappointing, being close to or lower than 0.50 (Table 6). This is reasonable due to the challenging nature of the GHS endpoint, characterized by a high number of (unbalanced) categories (n = 5) ( Table 1).

Integrated models evaluation
Integrated modeling was applied in order to improve predictions of single models. Table 7 shows the performance on ES of integrated models obtained by averaging single LD50 predictions, while Fig. 2 compared the experimental logLD50 (mmol/kg bw ) values with the predictions returned by the integrated model.   Table 8 shows performance on ES of integrated classification models. As expected, the increase of the PF/CS threshold to exclude predictions from the AD always paired with an increased performance, although at the cost of a reduced percentage of predicted compounds. This was especially true for classification when a complete agreement among integrated predictions was required (i.e., maximum CS value) and the number of categories was higher (e.g., GHS integrated models, having %AD = 0.214 for CS = 3). Internal performance on the TS of integrated models is not discussed in the manuscript and is reported in Additional file 1: Tables S6, S7).
The concept of "Pareto optimum" was applied to retrospectively confirm the improvement of consensus strategy with respect of single models, keeping into account both statistical performance and %AD. The approach allows solving of multi-objective optimization problems, in which no single solution exists that simultaneously optimizes multiple quality criteria. A solution is called "nondominated" (i.e., Pareto optimal) if no other solutions exist for which an evaluation function can be improved without degrading one of the other functions [21,22].
Quality functions used in this case were performance in external prediction (i.e., RMSE for regression, BA for classification) and %AD on the ES. Figure 3 graphically shows the set of optimal solutions for each endpoint representing the optimal trade-offs between the two functions. Such solutions constitute a Pareto front [21].
As shown in Fig. 3, integrated models always represented the optimal Pareto solutions for each endpoint, confirming the effectiveness of the integrated approach. Conversely, it was never possible to designate the best Table 7 External performance of the continuous integrated model for predicting single point logLD50 (mmol/kg) The R 2 , the mean absolute error (MAE), the root-mean squared error (RMSE), the number (#AD) and the percentage (%AD) of predictions in AD are reported, with respect to the PF threshold for defining predictions in AD  integrated model among those obtained varying the PF/ CS value. Indeed, the increase of the threshold always resulted an increased performance, but at the cost of a systematic reduction of %AD. For LD50 point estimate models, the integrated model with PF = 0.75 returned a coverage (i.e., 90%) analogous to best individual models, but with a relevant gain in performance (i.e., RMSE = 0.512 with respect of a mean RMSE = 0.542 for the best three individual models) (Tables 5 and 7).
For vT and nT endpoints, the BRF models were the closest to the Pareto front, however both solutions are dominated by the related integrated model with CS = 2.
As for multi-categorical endpoints, the aiQSARs were both close to the Pareto front and in particular to the integrated model with CS = 1. In this case, aiQSAR and integrated models have comparable performance, with a BA close to 0.730 for both endpoints. In both cases, integrated models showed a gain in %AD with respect to aiQSAR models, i.e. 3.7% additional predicted chemicals for the EPA model and 3.0% for GHS.

Discussion
This work describes the modeling efforts our research group contributed in the development of new (Q)SAR models for predicting five endpoints (one continuous, four classification) related to acute oral toxicity in rats, as a result of our participation in the collaborative project launched by NICEATM and NCCT [19]. This endpoint is of utmost importance to several regulatory frameworks, being currently the basis for the toxicological classification of chemicals [4].
To date, the recent literature reports only a small number of successful attempts in modeling oral rat acute toxicity [3,9]. The small number of models available can be explained by the nature of the endpoint and the lack of curated datasets prior to this effort. Indeed, compared with other widely modeled endpoints (e.g., (eco)toxicity) the modeling of mammalian toxicity is challenging, representing the sum of a plethora of toxicological mechanisms, each involving different biological pathways and molecular events concurring to the final effect [3,55].
Individual (Q)SAR methods often showed inherent limitations in developing single models able to handle different mechanism of action, even more in case of a lack of complete understanding of some of the biological mechanisms contributing to the overall effect (e.g., death) [6,41]. This issue can, however, be compensated with large enough datasets that adequately represent the diversity of the chemical universe.
An additional obstacle is often represented by the lack of reliable toxicity data in terms of quality, source of experimental data and organisms used [56]. The involvement of absorption, distribution, bioaccumulation, metabolism and excretion aspects further contributes to making the whole scenario even more complex and challenging [3,56].
As a consequence, (Q)SAR models for this endpoint so far have been largely limited to small datasets restricted to a well-defined class of chemicals while global models are few and, often, not satisfactory in terms of predictive power [3,9].
Despite this, the demand of in silico tools for such complex in vivo endpoints and (Q)SAR models continues to grow, due to objective resource and ethical limitations related to the execution of in vivo tests for a high number of chemicals. In this regard, the need of developing global models addressing a common endpoint such as acute toxicity have arisen as a primary need [3]. Within the project of NICEATM and EPA's NCCT Acute Toxicity Modeling Consortium, the availability of a large and high-quality rat oral acute toxicity database and the involvement of the entire scientific community represented an unprecedented chance to improve the current state of the art on the in silico modeling of this endpoint. To the best of our knowledge, this is the largest curated dataset of heterogeneous chemicals made available so far for the development of new (Q)SARs.
Integrated modeling was also applied in order to improve predictions of single models. As demonstrated previously and in the current work, integrated models had the highest external prediction power compared to any individual model used in the integrated prediction [57,58,59]. This is particularly true when the individual models have been developed using different techniques, have ADs differently defined and showed different behaviors based on the structural and activity profile of predicted chemicals. For example, an inspection of correct predictive rates that classification models have on specific categories of toxicity showed that some methods performed better on certain classes (e.g. more toxic or less toxic compounds) than others (for detailed statistics see Additional file 1: Tables S8-S11).
In this regard, the integrated method can compensate for and correct the limitations of individual techniques, as well as afford greater chemical space coverage [57,59].
Given the improvement of statistical performance, the use of PF/CS as a way to define AD of integrated models proved to be particularly suitable for identifying predictions likely to be wrong. Another advantage of PF/ CS metrics is that these values can be applied to assign a degree of reliability to the prediction. Therefore, this metric can work as an indicator of the likelihood that a compound is within the AD. This meets the need of accounting for the fuzzy nature of boundaries of AD [60], instead of considering the AD assessment as a yes/ no issue (e.g., a compound with an intermediate PF value can be considered questionable instead of out of AD). This is confirmed by an inspection of integrated LD50 models' performance limited to a given PF value. As shown in Table 9, statistical performance on the ES recalculated with respect to a given PF value clearly showed that samples characterized by lower PFs also showed an overall higher error. Looking at external predictions (i.e., ES), samples with PF = 1.00 have the lower RMSE. On the other hand, the value dramatically increases considering samples with lower PFs, to a maximum value of 0.865 in case of PF = 0.25. The effectiveness of the PF/CS approach is further reinforced by looking at the percentage of samples with a high prediction error (i.e., equal or higher than 1.00 log unit) for each PF value, that is about 20% of the total number of samples with PF = 0.25 and only 4.3% for PF = 1.00 (Table 9).
Machine learning methods we used (e.g., RFs) proved to be valid in terms of predictive performance, but a mechanistic interpretation of these models is often more difficult than classical linear models. Indeed they can be based on thousands of different molecular descriptors and the relationship existing between the endpoint and each descriptor is often a complex, non-linear one that is only implicitly included in the model itself. With this in mind, we provided an analysis of most relevant features used in the descriptor-based global models here presented (rRF/BRF, HPT-RF, GLM). Models generated with aiQSAR and istKNN were not considered in this analysis as they are local models not capable of identifying features associated with the global trend of acute toxicity. In addition, istKNN is based on the similarity concept only and not on descriptors.
The top twenty Dragon descriptors for each of the above-cited models were listed in Additional file 1: Table S12. Details on how the importance of descriptors was determined for each model were included in the same Table. Descriptors belonging to the 2D Atom Pairs category (binary, frequency or weighted topological atom pairs) were the most frequent (79 out of 180 descriptors). They were followed by 2D autocorrelation (16), CATS2D (14), functional group counts (11) and P_VSAlike descriptors (11). P_VSA_s_1 (P_VSA-like on I-state, bin 1) was also the single most frequent descriptor (it was included in four out of nine models considered). P_VSA descriptors are related to van der Waals surface area of Table 9 External performance of the continuous integrated model for separate PFs The R2, the mean absolute error (MAE), the root-mean squared error (RMSE), the number (#) and the percentage (%) of predictions with a given prediction fraction (PF) are reported. In addition, the percentage of samples with a given PF value and an absolute error in prediction equal to or greater than 1.00 log unit with respect of the total number of samples with the same PF (%AE ≥ 1) are reported  (26), fluorine, nitrogen and sulfur (19) atoms were those more frequent. In particular, chemicals having a higher number of phosphorus, fluorine and sulfur atoms had lower LD50 values with respect to the mean of the distribution of values for the TS. The number of phosphorus (nP) was also one of the single most frequent descriptors, that was found in HPT-RF models for EPA and GHS and in BRF/ rRF models for single point LD50 and vT classification. Binary/frequency phosphorous-based 2D Autocorrelation descriptors (B01[O-P], B04[C-P] and F02[C-P]) also appeared in multiple models (three models each).
Functional group counts are easier to be interpreted with respect to other theoretical descriptors, because they describe the presence of well-defined structural motif. For example, high values of nOHp (number of primary alcohols) were characteristic of low-toxicity chemicals. High numbers of hydroxyls flagged for high solubility of chemicals that influence the excretion rate, the capability to cross biological membranes and accumulate in tissues to exert toxicity. The presence of primary hydroxyls is important for phase II metabolism (conjugations) that contribute to detoxification of chemicals [17]. On the contrary, high number of aliphatic tertiary alcohols (nOHt) and aliphatic tertiary amines (nRNR2) was observed in high-toxicity compounds. It is possible that high counts of tertiary groups flags for bulky, lipophilic molecules that are easily to accumulate in the organism. Conversely, beta-Lactams, sulphonic and sulphuric acids, that are more hydrophilic, are only presents in safest toxicity categories.
Fragments identified by the SARpy model for vT class with LR = inf were evaluated too. Compared to the functional groups they can include larger fragments. The chemical moieties spotted with these fragments are halogenated 2-trifluoromethyl benzimidazoles, dioxins, phosphonothioates, organothiophosphates (including organothiophosphate aliphatic amides). They refer especially to chemical classes well represented among pesticides or former pesticides active ingredients.

Conclusions
In the present study, a series of computational models were developed as part of the NICEATM and EPA's NCCT collaborative project, for the prediction of five regulatory relevant endpoints describing rat acute oral toxicity (LD50). A series of different (Q)SAR methods were applied and the obtained models were validated on a large external dataset to assess their predictivity. Briefly, no single methods proved to be the best for all the endpoints, despite some of them constantly returning highly satisfactory predictive performance. In particular, results showed that some machine learning methods (e.g. RFs) were especially effective in modeling this kind of composite endpoint. These findings support the fact that machine learning approaches have often been indicated as promising tools in the field of computational toxicology [61,62], and that they are able to handle multiple mechanisms of actions better than classical linear approaches [63,64].
A review from Gonella-Diaza et al. [15] recently proposed an evaluation of the performance of existing models predicting LD50 implemented in a series of computational platforms. It was shown that only a few models were able to deliver acceptable predictive performance. Indeed, the best models among those evaluated returned RMSE values in external validation and within AD never lower than 0.55 for regression, while accuracy values for the five-class GHS classification ranged between 0.45 and 0.56. The models presented here showed robust performance in external validation, with RMSE values close to 0.50 for integrated models and BAs exceeding 0.80. In this regard, the models presented here can be easily considered an improvement on the current state-of-the-art for in silico modeling of this endpoint.
Another important outcome of this study is that integrated methods always returned improved performance with respect to single models. This confirms, as has already been widely reported, that the integration of multiple strategies and the application of a weight-of-evidence approach solves the limitations inherent to single methods and increases the confidence in the final toxicological prediction.
Several other groups were involved in NICEATM/ NCCT collaborative project, and other (Q)SAR models were developed and validated starting from the same data used here. For the ease of example, Alberga et al. [65] developed a multi-fingerprints similarity approach for predicting the five relevant toxicology endpoints related to the acute oral systemic toxicity. The approach integrated the results coming from a similarity search based on 19 different fingerprint definitions to return a consensus prediction value. The algorithm also accounted for toxicity cliffs, i.e. large gaps in LD50 values existing between two highly similar chemicals. Ballabio et al. [55] proposed a Bayesian consensus approach integrating three different mathematical methods for modeling the nT and vT classification endpoints, i.e., N-Nearest