 Research article
 Open
 Published:
Complementary PLS and KNN algorithms for improved 3DQSDAR consensus modeling of AhR binding
Journal of Cheminformaticsvolume 5, Article number: 47 (2013)
Abstract
Multiple validation techniques (Yscrambling, complete training/test set randomization, determination of the dependence of R^{2}_{test} on the number of randomization cycles, etc.) aimed to improve the reliability of the modeling process were utilized and their effect on the statistical parameters of the models was evaluated. A consensus partial least squares (PLS)similarity based knearest neighbors (KNN) model utilizing 3DSDAR (three dimensional spectral dataactivity relationship) fingerprint descriptors for prediction of the log(1/EC_{50}) values of a dataset of 94 aryl hydrocarbon receptor binders was developed. This consensus model was constructed from a PLS model utilizing 10 ppm x 10 ppm x 0.5 Å bins and 7 latent variables (R^{2}_{test} of 0.617), and a KNN model using 2 ppm x 2 ppm x 0.5 Å bins and 6 neighbors (R^{2}_{test} of 0.622). Compared to individual models, improvement in predictive performance of approximately 10.5% (R^{2}_{test} of 0.685) was observed. Further experiments indicated that this improvement is likely an outcome of the complementarity of the information contained in 3DSDAR matrices of different granularity. For similarly sized data sets of Aryl hydrocarbon (AhR) binders the consensus KNN and PLS models compare favorably to earlier reports. The ability of 3DQSDAR (three dimensional quantitative spectral dataactivity relationship) to provide structural interpretation was illustrated by a projection of the most frequently occurring bins on the standard coordinate space, thus allowing identification of structural features related to toxicity.
Background
During the past decade, the application of consensus modeling to various QSAR related problems has been explored [1–3]. Early QSARs often relied on single models, which under certain circumstances were prone to arbitrary overestimation of the contribution of given structural features at the expense of others that were suppressed or ignored. To mitigate such risks consensus models based on a multitude of individual models can be advantageously used. Reports of improved performance of consensus models [4–6] or its lack thereof [7] have been published.
Recently, our group introduced the concept of a robust 3DQSDAR approach [8]. 3DQSDAR utilizes unique fingerprints constructed from pairs of ^{13}C chemical shifts augmented with their corresponding interatomic distances. The proposed 3DQSDAR methodology was designed in accordance with the Organization for Economic Cooperation and Development (OECD) principles [9]: it provided several levels of validation, thus assuring models would be both reliable and interpretable. In our earlier work [8] an automated partial least squares (PLS) algorithm was used to process data from regularly tessellated 3DSDAR fingerprints and to derive averaged (composite model) predictions from 100 randomized training/holdout test set pairs. A technique [10] based on the standard deviation of the experimental data was employed to determine a “realistic” upper bound for coefficient of determination. A Yscrambling procedure [11, 12] assessed the probability of generating seemingly “good” models by chance.
However, the above described modeling procedure employed a single data processing algorithm, namely PLS. As a step forward, experiments designed to explore the likelihood of building improved consensus models combining predictions generated by conceptually unrelated algorithms operating on 3DSDAR matrices of different granularity were conceived. A KNN algorithm intended to supplement PLS by capturing complementary aspects of the structureactivity relationship was devised. It was hypothesized that the improvement of performance in consensus modeling should depend on the degree of orthogonality of the predictions produced by the individual models. Beyond the accuracy of biological data the inherent information content of a given descriptor pool was thought of as a factor limiting the improvement of R^{2}_{test} in consensus modeling. In other words, regardless of data processing algorithm the maximum achievable R^{2} for a holdout test set would be limited by the descriptors’ ability to depict specific aspects of the molecular structure directly related to the observed effect.
In this work, the effect of data processing algorithms on the quality of prediction and the ability of 3DQSDAR to provide a meaningful structural interpretation was tested on a dataset of 94 PCBs, PHDDs and PCDFs binding the aryl hydrocarbon receptor (AhR). Similar datasets were extensively modeled (Table 1 summarizes these results) and the structural features affecting binding are well known. This allowed quantitative and qualitative comparison of these earlier reports with the results carried out using the 3DQSDAR methodology.
Dataset
A dataset compiled by Mekenyan et al. [22] containing the experimental log(1/EC_{50}) values of 94 persistent organic pollutants inhibiting AhR was used for the purpose of this work. Here, EC_{50} is defined as the concentration of the test chemical necessary to reduce the specific binding of a radiolabeled 2,3,7,8tetrachlorodibenzopdioxin to 50% of its maximum value in the absence of competitor. This dataset consists of three distinct chemical classes: i) polychlorinated biphenyls (PCBs); ii) polyhalogenated dibenzopdioxins (PHDDs) and iii) polychlorinated dibenzofurans (PCDFs), shown in Table 2. Since we were unable to locate sources reporting the experimental errors of the EC_{50} data for the current dataset, a work by Long et al. [23] listing the absolute errors of 7 PCDF congeners was used to evaluate the quality of data. Under the assumption that for similar compounds and experimental settings the average relative experimental error would vary insignificantly, based on the above report at least ~17% error in the EC_{50} data should be expected. However, since Mekenyan et al. [22] compiled their dataset from various sources, a negative impact on the accuracy of data which would further lower the “realistic” R^{2}[10] should be anticipated.
Methods
Conventions
Several layers of complexity related to the utilized modeling procedures were introduced in this manuscript and these require clarification. To avoid ambiguity, models utilizing the same algorithm (either PLS or KNN) operating on an individual 3DSDAR data matrix by generating multiple randomized training/test subset pairs later combined to form a single model will be referred to as “composite models”. Models averaging the predictions from two (or eventually more) composite models will be referred to as “consensus models”. The term “individual models” is used interchangeably to denote either the individual PLS or KNN models forming the “consensus model” or the individual randomized training/test subset models resulting in a “composite model”. However, its specific meaning would be determined through its contextual use. The term “matching training/test subset pairs” indicates complementary training and test subset pairs processed by different algorithms, but composed of the same subsets of compounds.
Molecular conformation
In its current implementation, 3DQSDAR does not employ docking or alignment algorithms, nor does it use Xray structures to achieve more consistent geometries of the molecules constituting the dataset. This choice widens its applicability to datasets of compounds with unknown, multiple, or no specific targets and in the absence of knowledge about the binding site and its conformational requirements. For the purpose of reproducibility, however, the conformation at the global minimum of the potential energy surface was used. It has to be acknowledged that, while this conformation is the most energetically stable, it may not be the one assumed during solvent interaction or upon binding with a macromolecule [24].
To find the lowest energy conformers of all PCBs (the PHDDs and PCDFs have no rotatable bonds) a conformational search analysis was performed in HyperChem 8.0 [25]. An AMBER force field [26] and a random walks search method with an acceptance energy criterion of 6 kcal/mol were used. At the final stage of optimization all PCBs, PHDDs and PCDFs were optimized by employing a semiempirical Austin Model 1 (AM1) Hamiltonian with a rootmeansquare gradient of 0.01 kcal/Å × mol.
3DQSDAR descriptor calculations
The final refined geometries of all 94 AhR binders together with their respective ^{13}C chemical shifts simulated by ACD/NMR Predictor version 12.0 [27] were used to generate unique 3Dfingerprints representing each compound in the abstract 3DSDAR space [8]. This space is defined by three orthogonal coordinates, with the chemical shifts of atoms C_{i} and C_{j} on axes X and Y, respectively, and the distance between them on the Z axis. This concept is illustrated in Figure 1, which shows the structure and the ^{13}CNMR spectrum of 2,3,7,8tetrachlorodibenzopdioxin (Figures 1a and 1b) and its corresponding 3DSDAR fingerprint (Figure 1c). For example, in Figure 1c the coordinates of the topmost left fingerprint element are defined by the chemical shifts of atoms 1 and 6 (116.94 ppm) and the distance between them (5.52 Å). Due to the D_{2h} symmetry of the 2,3,7,8tetrachlorodibenzopdioxin molecule, the position of this fingerprint element in the 3DSDAR space coincides with the one representing atoms 4 and 9. The remaining fingerprint elements are constructed in a similar manner.
Because the units of length on each axis are not identical, X, Y and Z do not form a Cartesian coordinate system. Since the number of carbon atoms in a molecule (N_{ C }) determines uniquely the number of elements in a fingerprint (N_{ C }(N_{ C }  1)/2), each of the 94 AhR binders will be represented by at least 66 such fingerprint elements in the 3DSDAR space. This 3DSDAR space was further tessellated using regular grids to form bins ranging in size from 2 ppm x 2 ppm x 0.5 Å to 20 ppm x 20 ppm x 2.5 Å (i.e. incremental steps of 0.5 Å on the Zaxis and 2 ppm on the chemical shifts plane XY were used). As a result, 50 regular grids of different granularity were generated. A procedure performed separately on each of the 50 grids counted the number of fingerprint elements of a molecule belonging to a given bin (i.e., bin occupancy) and stored these values as row vectors in m x n matrices. Here m represents the number of compounds in the dataset, whereas n represents the number of occupied bins.
Determination of the optimal number of randomization cycles
Experiments aimed at the determination of the optimal number of training/test subset randomization cycles necessary to achieve an asymptotic convergence of R^{2}_{test} (an average of N individual R^{2}_{test} values, 10 ≤ N ≤ 1000) to its “true” value were conducted. As an example, Figure 2 shows the dependence between R^{2}_{test} and the number of randomization cycles in case of: i) our best PLS model utilizing 10 ppm x 10 ppm x 0.5 Å bins and 7 latent variables (LVs) and ii) our best KNN model using 2 ppm x 2 ppm x 0.5 Å bins and 6 neighbors. Figure 2 indicates that a minimum of 100 randomization cycles would be needed so that the average R^{2}_{test} values would converge to their asymptotic values. Therefore, to reduce the computational demand and to avoid reporting overlyoptimistic results, 100 randomizations for each of the 50 3DSDAR data matrices were performed.
Model building
To explore the ability of different data processing techniques to capture complementary portions of the variance in biological data, two algorithms based on unrelated concepts but operating on descriptor matrices originating from the 3DQSDAR approach were employed.

i)
A SIMPLS based [28] PLS algorithm written in Matlab [29] was employed to process each of the 50 3DQSDAR data matrices. All descriptors were standardized using the “zscore” Matlab function. As described above, 100 random training/test set pairs were generated and composite (ensemble) PLS models for the training sets, including somewhere between 1 and 10 LVs, were built. These models were then used to predict the log(1/EC_{50}) values for the complementary 20% “holdout” test subsets. At the end, each of the individual 100 R^{2} _{training}, R^{2} _{test} and R^{2} _{scrambling} values were recorded and their averages for the composite models were reported. For each of the 50 average models utilizing grids of different granularity the random number generator was initialized in order to recreate the same training/holdout test sequence (Additional file 1). Due to the specifics of the chosen modelbuilding procedure, the reader should bear in mind that these average reported parameters include contributions from “good” as well as “bad” models (see the results and discussion section).

ii)
Alternatively, a KNN algorithm written in Matlab and based on Tanimoto similarity [30] in its generalized vector form, $T\left(A,B\right)=\frac{A.B}{{\Vert A\Vert}^{2}+{\Vert B\Vert}^{2}A.B}$ was employed. In this equation, A and B are data objects represented by vectors (originally bit vectors). Thus, the Tanimoto similarity is a dot product of two vectors A and B (bin occupancy row vectors for a pair of compounds) divided by the squared magnitudes of A and B minus their dot product. In other words, for compounds sharing common structural features T will be closer to 1, otherwise T will be closer to 0.
Because T is not invariant to standardization, the desire for preservation of its universal nature required use of the original, nontransformed 3DSDAR descriptor pool. At a constant granularity of the grid this specific choice allowed bijection of T  there is one and only one T for a given pair of compounds. For a standardized descriptor pool, T loses its universal nature by being dependant on the mean and the standard deviation of the descriptors within the training set, and multiple Ts between a pair of compounds would exist (i.e., T would become a local characteristic of similarity).
These invariant T values (calculated for all pairs of compounds) were later used to predict the holdout test set activities by ranking the compounds from the training set in a descending order of their similarity to each compound from the holdout test and using T of the first Kneighbors (1 ≤ K ≤ 10) to weight their contributions to activity. Under these experimental settings, both odd and even numbers of neighbors can be used. As with PLS, the KNN validation procedure involved 100 randomized training/holdout test set pairs recreated by the use of the same random seed.
Fit and prediction
The majority of QSARs are built for prediction. Hence, parameters such as the coefficient of determination for the training set (R^{2}_{training}) that measure the fitting ability of a model play only a minor role, typically unrelated to predictive power. Since we are more interested in the behavior of models intended for prediction, our attention was primarily focused on R^{2}_{test} and R^{2}_{scrambling}. More specifically, the behavior of R^{2}_{test} was closely followed, whereas R^{2}_{scrambling} was monitored only as an indicator of potential chance correlations.
Results and discussion
Similarity as a discrimination function
The ability of T to detect structural similarity and thus structural variations is illustrated in Figure 3 (2 ppm x 2 ppm x 0.5 Å bins were used). Three regions of higher similarity are clearly distinguishable: i) compounds with IDs between 1 and 30 are all PCBs; ii) compounds with IDs between 31 and 55 are PHDDs and iii) the remaining 39 compounds are PCDFs. Because T is calculated from row vectors, it can be demonstrated that KNN operating on T may capture information complementary to that of the respective PLS models (virtually all multivariate methods operate on column vectors). Thus, the degree of orthogonality between PLS and KNN would be one of the factors with an impact on the performance of consensus PLSKNN models. Another such factor would be the complementarity of the information content specific to 3DSDAR matrices of different granularity.
Optimal bin size
As described above a total of 50 PLS and KNN composite models of different granularity (each of which is a result of 100 training/test set combinations) were built. The statistical parameters of these models calculated as an average of their corresponding 100 individual values are shown in Table 3. The predictive power of the PLS and KNN models in terms of R^{2}_{test} as a function of the granularity of the 3DSDAR abstract space is shown on Figure 4.
From Table 3 and Figure 4 one can see that the PLS and KNN models achieve their optimal performance at a different granularity of the grid. The best performing PLS model reaches its highest average R^{2}_{test} of 0.633 (σ = 0.147) at 10 ppm x 10 ppm x 0.5 Å, while the KNN model achieves it highest average R^{2}_{test} of 0.618 (σ = 0.170) at 2 ppm x 2 ppm x 0.5 Å. This observation can be explained by the combined effect of several contributing factors:

i)
As described in the methodology section, the PLS algorithm utilizes data standardization, which adjusts for the size disparity of variables. Unlike PLS, KNN uses the original bin occupancies. Thus, in the case of PLS, the optimal bin size would mainly reflect the inherent estimation error in the chemical shifts of carbon atoms and their associated interatomic distances. As demonstrated in [8], bins with a high resolution on the Z axis (C_{i}C_{j} distance) and a granularity of at least twice the estimation error of ^{13}C chemical shifts in the XY plane would result in PLS models of optimal performance. For the current dataset, the ^{13}C chemical shifts estimation error was 3.98 ppm, which would require bins with granularity of at least 8 ppm in the XY plane. Hence, it is not surprising that the best performing PLS model utilizes 10 ppm x 10 ppm bins in the XY plane and a 0.5 Å on the Zaxis. Besides the ^{13}C chemical shifts estimation error, the optimal grid granularity also depends on the bin occupancy. Bins that are too narrow will result in a large, but sparsely populated 3DSDAR matrix and PLS models unable to generalize (poor predictive performance). On the other hand, models using bins that are too wide (e.g., > 14 ppm in the chemical shifts plane XY) may assign fingerprint elements encoding divergent structural features to the same bin, thus producing models lacking in their ability to decode the underlying relationship between structure and activity.

ii)
The use of T as a factor for activity determination in KNN results in smaller optimal bin sizes in part due to the cancelation of the error in the chemical shifts plane (XY) for similar compounds. Note that the highest contribution to the determination of activity in KNN comes only from the first Knearest neighbors, which by definition are most similar to the compound the activity of which is being predicted. Because for similar structures the error of prediction propagates in parallel, it is not surprising that similarity based KNN algorithms will achieve maximum performance at smaller bin sizes.

iii)
Unlike PLS, which assigns a different contribution of each bin to the final model, KNN treats all bins as independent coordinates of a vector compared against other such vectors (i.e., assigns equal contribution). Thus, depending on the model building technique being employed, grids of different granularity may be identified as performing better.
Composite and consensus models
For both PLS and KNN composite models, Figures 5a and 5b show plots of the average predicted activities of all 94 compounds (each was part of the holdout test and predicted ~20 times) against their experimental log(1/EC_{50}) values. Note that the coefficients of determination shown on Figures 5a and 5b differ slightly from the R^{2}_{test} values given in Table 3, as the latter represent an average of 100 individual R^{2}_{test} values for the randomized training/test subset pairs.
Because consensus requires at least two individual models an obvious choice was to select complementary composite PLS and KNN models (indicated by the green arrows in Figure 4). This choice allowed the construction of consensus models from pairwise averaging of the predictions from individual models using: i) the same algorithm but different bin size (IDs 2 and 4); ii) the same bin size but different algorithm (IDs 5 and 6) or iii) different algorithm and different bin size (IDs 1 and 3). The percentage improvement in consensus modeling over the average of the coefficients of determination of the individual models is carried out in Table 4. Although consensus between other pairs of individual models or of higher order (averaging predictions from more than two individual models) is possible and may perform better, the enormous number of such models was prohibitive and such efforts were not undertaken.
As can be seen from Table 4, both differences in the data processing algorithms and the granularity of the 3DSDAR space contribute to the improvement in consensus modeling. A comparison of the performance improvement of consensus models indicates that generally the models of type iii perform best. A possible explanation for this observation is that these models benefit from: i) the complementary information extracted from 3DSDAR matrices of different granularity and ii) the utilization of different data processing algorithms. Among these 6 consensus models, the one averaging the predictions from the best performing PLS (10 ppm x 10 ppm x 0.5 Å bins, 7 LVs) and KNN (2 ppm x 2 ppm x 0.5 Å bins, 6 neighbors) individual models was characterized by the highest coefficient of determination (shown in Figure 5c and the last column of Table 2).
To further understand the factors playing a role in consensus modeling and to explain the observed improvement over the composite PLS and KNN models, analysis based on training/test set pairs of individual models was carried out.
According to our initial hypothesis an improvement in consensus modeling would be observed only if the individual composite models account for complementary information (i.e., explain complementary portions of the variance in the biological data). For this purpose, the behavior of the individual 100 submodels resulting in the best composite PLS and KNN models was investigated. If for each of the 100 training/test set pairs both algorithms capture almost identical structural information encoded in the 3DSDAR descriptor pool, the corresponding R^{2}_{test} values generated on each cycle should be highly correlated and therefore no improvement in consensus modeling would be observed. In other words, the two algorithms would be somewhat redundant and the consensus R^{2}_{test} would be an average of R^{2}_{test} for the 100 individual submodels. It has to be emphasized that such an experiment would be valid only in a case of matching training/test subset pairs. This condition is satisfied by the use of the same random seed for both PLS and KNN and a random number generator which was initialized after 100 runs.
Three different views of the 100 individual R^{2}_{test} values for the best composite PLS (10 ppm x 10 ppm x 0.5 Å bins, 7 LVs) and KNN models (2 ppm x 2 ppm x 0.5 Å bins, 6 neighbors), are shown in Figure 6. A plot of the ranked R^{2}_{test} values for each of these 100 models (Figure 6a) indicated a similar level of performance of both algorithms. Figure 6a also demonstrates that some combinations of training/test subset pairs may produce highly accurate models with R^{2}_{test} reaching 0.9, while others may result in models with inferior performance (i.e., models in which the test set compounds are not well represented by the training set).
Figure 6b shows a plot of R^{2}_{test} of matching training/test subset pairs processed alternatively by PLS or KNN. Although, there were PLS and KNN submodels performing equally well (forming a cluster in the upper right corner or the plot), a significant portion of submodels predicted well by PLS were combined with inferior KNN models and viceversa. This observation and the relatively low R^{2} of 0.367 suggest that the two individual models reflect different structural patterns in the data and are partially “orthogonal”. The distribution of ΔR^{2}_{test} PLSKNN shown on Figure 6c indicated that a total of 28 models deviate by at least 1σ from the mean. PLS outperformed KNN for 13 models while KNN performed better for the remaining 15 models. These 28 models, for which one of the algorithms succeeded in establishing a structureactivity relationship undetected by the other, were identified as a major contributing factor affecting the performance of consensus models. Thus, a consensus PLSKNN model would benefit from the partial orthogonality of the PLS and KNN approaches on different sized bins and would outperform the individual composite models.
Interpretation
An essential part of the QSAR modeling process is the interpretation of the model in terms of structural variations leading to corresponding changes in the biological activity. For the purpose of interpretation the bins with the 10 most positive and negative PLS weights for each of the individual models forming the composite 10 ppm x 10 ppm x 0.5 Å PLS model were extracted and their relative frequencies of occurrence were calculated. Since each of the individual models utilized 7 LVs, a total number of 14000 positive and negative bins were extracted (2 × 100 submodels × 7 LVs × 10 bins). Unique among these were 87 bins with positive weights and 74 bins with negative weights. Their corresponding relative frequencies of occurrence were calculated and ranked. For simplicity, only the topmost 20% unique positive and negative bins were mapped back to the 3DSDAR abstract space (see Figure 7).
A detailed examination of the 3DSDAR maps shown in Figure 7 reveals that none of the bins with positive weights overlaps with any of the bins with negative weights: i.e., the structural features affecting binding (increasing or decreasing log(1/EC_{50})) are well separated. Therefore, compounds with 3DSDAR fingerprints predominantly occupying bins with positive PLS weights will be stronger binders (highly toxic). Conversely, chemicals with fingerprint elements falling into regions of the 3DSDAR space occupied by bins with negative weights will be weaker binders (less toxic). This hypothesis was tested using an in house program projecting some of the most frequently occurring positively and negatively weighted bins on the standard coordinate space. This projection allowed identification of subsets of structures in which these bins can often be found together.
As was expected, most of the bins characterized by positive PLS weights were found in the structures of PHDDs representing the most toxic class of chemicals in the dataset investigated (see Figure 8). Specifically, a subclass of polybrominated dioxins showed consistent presence of multiple positively weighted bins (see Figure 8). As anticipated, 7 of these were among the top 10 most toxic compounds in the dataset. However, infrequently occurring bins specific to compounds with peculiar structural features did not appear as highly ranked in the composite 3DQSDAR models. This explains the absence of bins specific to the 2,8dibromo3,7dichlorodibenzopdioxin, since it is the only dioxin derivative with both Cl and Br substituents on the same ring.
Although in its current version 3DQSDAR “sees” only the carbon atoms, inferences about their chemical environments can be easily drawn. For example Figure 9 shows that the 140 ppm  150 ppm, 110 ppm  120 ppm, 2.0 Å  2.5 Å bin is persistently occupied by carbon atoms neighboring the oxygen atoms in PHDDs, indicating the importance of oxygen atoms for binding to AhR [14, 31]. Hence, the lack of oxygens in the structure of PCBs can be correlated to their weaker binding affinity (and consequently their lower molar toxicity).
In contrast, most of the negatively weighted bins were found to be present in the structures of PCBs. As can be seen from Figure 9, positions 2 and 2′ and (due to symmetry) positions 6 and 6′ are particularly affected and chlorine substitution at these positions will lower the toxicity of PCBs, compared to that of other chlorine substituted homologues.
As an intermediate chemical class with an average activity higher than that of PCBs and lower than that of PHDDs, the activity of dibenzofurans is affected by the presence/absence of structural patterns similar to those observed in the structures of both PCBs and PHDDs. For example, the presence of an oxygen atom resulting in a chemical shift range of the neighboring carbon atoms between 150 and 160 ppm will lower the EC_{50} of PCDFs (higher toxicity). Analogously to the 2 and 2′ positions in biphenyls, chlorine substitution at positions 1 and 9 will result in PCDFs with toxicity lower than that of PCDF homologues substituted elsewhere.
Comparison to earlier models
Due to variability in the datasets and the multitude of available data processing algorithms and validation techniques, a direct quantitative comparison with the QSARs summarized in Table 1 is impossible. However, if one takes into account the much stringent validation criteria imposed in our work (vs the crossvalidation procedures employed in [13–21]) it is clear that the 3DQSDAR methodology performs at least on par with these earlier models. Similarly to CoMFA [17] on a qualitative level the 3DQSDAR was able to recognize correctly the positions that affect the strength of binding to AhR. Since our work is based on a dataset originally compiled by Mekenyan et al. a more direct comparison with the QSARs reported in [22] was possible. Multiple separate QSARs for the three classes of PCBs, PHDDs and PCDFs with R^{2} ranging from 0.640 (n = 30) to 0.899 (n = 14) were derived. The statistical parameters of a model combining the most planar PCBs, PHDDs and PCDFs (n = 80) were as follows: R^{2} = 0.73; s^{2} = 0.59; R^{2}_{cv} = 0.73 and F = 69.2. In comparison, for the complete set of 94 compounds our best consensus model produced an R^{2}_{test} of 0.685 and a q^{2}_{LOO} of 0.79 which are both close to the R^{2}_{cv} of 0.73 reported by Mekenyan et al.
Conclusions
We have introduced several validation techniques intended to improve the quality and reliability of individual and consensus QSAR models. Their use was illustrated on a dataset of 94 AhR binders modeled by 3DQSDAR. The functional dependence between R^{2}_{test} and the number of training/test subset randomization cycles was used to determine the minimum number of cycles necessary to achieve convergence of R^{2}_{test} to its asymptotic “true” value. In this specific case, which uses 20% of the compounds as a holdout test set, 100 randomization cycles proved sufficient for achieving convergence for both PLS and KNN models. The use of a distance measure (Tanimoto similarity) as a discriminant function in KNN was shown to produce models with performance similar to that of PLS when applied to the same dataset. A plot of R^{2}_{test} for matching test set pairs was used to demonstrate the partial orthogonality of PLS and similarity based KNN approaches on different bin granularity. However, further investigations may shed additional light on the character of the multiple factors playing role in the improvements observed in consensus modeling.
In the last stage of the modeling process the most frequently occurring positively and negatively weighted bins were projected back to the standard coordinate space to identify structural features related to toxicity. It was found that most of the highly ranked bins with positive PLS weights were specific to a class of polybrominated dioxins. The oxygen atoms of PHDDs and PCDFs participating in formation of donoracceptor bonds with the receptor were associated with the high toxic effect of these two chemical classes. In the absence of other substituents, PCBs with chlorine atoms at positions 2 and 2′ (and due to symmetry positions 6 and 6′) were accurately predicted to be relatively weaker binders (less toxic).
Abbreviations
 PLS:

Partial Least Squares
 KNN:

k Nearest Neighbors
 3DSDAR:

ThreeDimensional Spectral Data  Activity Relationship
 AhR:

Aryl Hydrocarbon Receptor
 PCBs:

Polychlorinated Biphenyls
 PHDDs:

Polyhalogenated DibenzopDioxins
 PCDFs:

Polychlorinated Dibenzofurans
 QSAR:

Quantitative Structure  Activity relationship
 LOO:

LeaveOneOut
 CV:

CrossValidation
 MLR:

Multiple Linear Regression.
References
 1.
Ganguly M, Brown N, Schuffenhauer A, Ertl P, Gillet VJ, Greenidge PA: Introducing the Consensus Modeling Concept in Genetic Algorithms: Application to Interpretable Discriminant Analysis. J Chem Inf Model. 2006, 46: 21102124. 10.1021/ci050529l.
 2.
Gramatica P, Giani E, Papa E: Statistical External Validation and Consensus Modeling: A QSPR Case Study for Koc Prediction. J Mol Graphics Modell. 2007, 25: 755766. 10.1016/j.jmgm.2006.06.005.
 3.
Kuzmin VE, Muratov EN, Artemenko AG, Varlamova E, Gorb L, Wang J, Leszczynski J: Consensus QSAR Modeling of PhosphorContaining Chiral AChE Inhibitors. QSAR Comb Sci. 2009, 28: 664677. 10.1002/qsar.200860117.
 4.
Gramatica P, Pilutti P, Papa E: Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into TrainingTest Sets and Consensus Modelling. J Chem Inf Comput Sci. 2004, 44: 17941802. 10.1021/ci049923u.
 5.
Mario L, Vinothini S: In Silico Prediction of Aqueous Solubility, Human Plasma Protein Binding and Volume of Distribution of Compounds from Calculated pKa and AlogP98 Values. Mol Divers. 2003, 7: 6987.
 6.
Sussman NB, Arena VC, Yu S, Mazumdar S, Thampatty BP: Decision Tree SAR Models for Developmental Toxicity Based on an FDA/TERIS Database. SAR QSAR Environ Res. 2003, 14: 8396. 10.1080/1062936031000073126.
 7.
Hewitt M, Cronin MT, Madden JC, Rowe PH, Johnson C, Obi A, Enoch SJ: Consensus QSAR models: do the benefits outweigh the complexity?. J Chem Inf Model. 2007, 47: 14601468. 10.1021/ci700016d.
 8.
Slavov S, Geesaman E, Pearce B, Schnackenberg L, Buzatu D, Wilkes J, Beger R: ^{13}C NMRDistance Matrix Descriptors: Optimal Abstract 3D Space Granularity for Predicting Estrogen Binding. J Chem Inf Model. 2012, 52: 18541864. 10.1021/ci3001698.
 9.
Report from the Expert Group on (Quantitative) StructureActivity Relationship ([Q]SARs) on the Principles for the Validation of (Q)SARs. 2004, Paris, France: Organisation for Economic Cooperation and Development
 10.
Doweyko AM, Bell AR, Minatelli JA, Relyea DI: Quantitative StructureActivity Relationships for 2[(Phenylmethyl)Sulfonyl]Pyridine 1Oxide Herbicides. J Med Chem. 1983, 26: 475478. 10.1021/jm00358a004.
 11.
Klopman G, Kalos AN: Causality in StructureActivity Studies. J Comput Chem. 1985, 6: 492506. 10.1002/jcc.540060520.
 12.
Wold S, Eriksson L: Statistical Validation of QSAR Results. Chemometric Methods in Molecular Design. Edited by: van de Waterbeemd H. 1995, Weinheim, Germany: WileyVCH Verlag GmbH, 309318.
 13.
Beger RD, Wilkes JG: Models of Polychlorinated Dibenzodioxins, Dibenzofurans, and Biphenyls Binding Affinity to the Aryl Hydrocarbon Receptor Developed Using ^{13}C NMR Data. J Chem Inf Comput Sci. 2001, 41: 13221329. 10.1021/ci000312l.
 14.
Beger RD, Buzatu DA, Wilkes JG: Combining NMR spectral and structural data to form models of polychlorinated dibenzodioxins, dibenzofurans, and biphenyls binding to the AhR. J Comput Aided Mol Des. 2002, 16: 727740. 10.1023/A:1022479510524.
 15.
Arulmozhiraja S, Morita M: Structureactivity relationships for the toxicity of polychlorinated dibenzofurans: approach through density functional theorybased descriptors. Chem Res Toxicol. 2004, 17: 348356. 10.1021/tx0300380.
 16.
Hirokawa S, Imasaka T, Imasaka T: Chlorine substitution pattern, molecular electronic properties, and the nature of the ligandreceptor interaction: quantitative propertyactivity relationships of polychlorinated dibenzofurans. Chem Res Toxicol. 2005, 18: 232238. 10.1021/tx049874f.
 17.
Ashek A, Lee C, Park H, Cho SJ: 3D QSAR studies of dioxins and dioxinlike compounds using CoMFA and CoMSIA. Chemosphere. 2006, 65: 521529. 10.1016/j.chemosphere.2006.01.010.
 18.
Gu C, Jiang X, Ju X, Yu G, Bian Y: QSARs for the toxicity of polychlorinated dibenzofurans through DFTcalculated descriptors of polarizabilities, hyperpolarizabilities and hyperorder electric moments. Chemosphere. 2007, 67: 13251334. 10.1016/j.chemosphere.2006.10.057.
 19.
Zhao YY, Tao FM, Zeng EY: Theoretical study of the quantitative structureactivity relationships for the toxicity of dibenzopdioxins. Chemosphere. 2008, 73: 8691. 10.1016/j.chemosphere.2008.05.018.
 20.
Gu C, Jiang X, Ju X, Gong X, Wang F, Bian Y, Sun C: QSARs for congenerspecific toxicity of polyhalogenated dibenzopdioxins with DFT and WHIM theory. Ecotoxicol Environ Saf. 2009, 72: 6070. 10.1016/j.ecoenv.2008.04.003.
 21.
Diao J, Li Y, Shi S, Sun Y, Sun Y: QSAR Models for Predicting Toxicity of Polychlorinated Dibenzopdioxins and Dibenzofurans Using Quantum Chemical Descriptors. Bull Environ Contam Toxicol. 2010, 85: 109115. 10.1007/s0012801000652.
 22.
Mekenyan OG, Veith GD, Call DJ, Ankley GTA: QSAR evaluation of Ah receptor binding of halogenated aromatic xenobiotics. Environ Health Perspect. 1996, 104: 13021310.
 23.
Long G, McKinney J, Pedersen L: Polychlorinated dibenzofuran (PCDF) binding to the Ah receptor(s) and associated enzyme induction. Theoretical model based on molecular parameters. Quant StructAct Relat. 1987, 6: 17. 10.1002/qsar.19870060102.
 24.
Eliel EL: Chemistry in Three Dimensions. Chemical Structures. Edited by: Warr WA. 1993, Berlin, Germany: Springer, 1
 25.
HyperChem 8 Professional, version 8.0. 2007, Gainesville, FL: HyperCube Inc
 26.
Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA: A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. J Am Chem Soc. 1995, 117: 51795197. 10.1021/ja00124a002.
 27.
ACD/NMR Predictor Release 12.00, version 12.5; Advanced Chemistry Development, Inc. 2011, Toronto, ON, Canada, http://www.acdlabs.com,
 28.
De Jong S: SIMPLS: an alternative approach to partial least squares regression. Chemom Intell Lab Systems. 1993, 18: 251263. 10.1016/01697439(93)85002X.
 29.
MATLAB, version 8.0 (R2012b), The MathWorks Inc. 2012, Cambridge, MA, USA, http://www.mathworks.com,
 30.
Tanimoto TT: IBM Internal Report: 17th Nov. Technical report. 1957, Armonk, NY, USA: IBM
 31.
Kobayashi S, Saito A, Ishii Y, Tanaka A, Tobinaga S: Relationship between the biological potency of polychlorinated dibenzopdioxins and their electronic states. Chem Pharm Bull. 1991, 39: 21002105. 10.1248/cpb.39.2100.
Acknowledgements
The authors thank F.D.A. for the financial support.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SS worked on the main concepts, significant portion of the data processing code, the computational experiments design and the data analysis and interpretation; BP wrote the routines for 3DSDAR space tessellation into bins; DB carried out experiments using nonlinear pattern recognition methods to confirm that narrow bins could be used to improve consensus model predictions in combination with other pattern recognition methods using larger bins (unpublished but conceptually important data); RB conceived various experimental designs addressing questions regarding the factors playing role in consensus modeling, and participated in all discussions concerning the improvements in the 3DQSDAR methodology; JW proposed different hypotheses explaining the observed improvements in consensus modeling and proposed new experiments that would quantify the impact of each contributing factor. All authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 QSAR
 Molecular descriptors
 Quantitative spectral dataactivity relationship (3DQSDAR)
 Estrogen receptor binding
 Molecular modeling