Complementary PLS and KNN algorithms for improved 3D-QSDAR consensus modeling of AhR binding

Multiple validation techniques (Y-scrambling, complete training/test set randomization, determination of the dependence of R2test on the number of randomization cycles, etc.) aimed to improve the reliability of the modeling process were utilized and their effect on the statistical parameters of the models was evaluated. A consensus partial least squares (PLS)-similarity based k-nearest neighbors (KNN) model utilizing 3D-SDAR (three dimensional spectral data-activity relationship) fingerprint descriptors for prediction of the log(1/EC50) values of a dataset of 94 aryl hydrocarbon receptor binders was developed. This consensus model was constructed from a PLS model utilizing 10 ppm x 10 ppm x 0.5 Å bins and 7 latent variables (R2test of 0.617), and a KNN model using 2 ppm x 2 ppm x 0.5 Å bins and 6 neighbors (R2test of 0.622). Compared to individual models, improvement in predictive performance of approximately 10.5% (R2test of 0.685) was observed. Further experiments indicated that this improvement is likely an outcome of the complementarity of the information contained in 3D-SDAR matrices of different granularity. For similarly sized data sets of Aryl hydrocarbon (AhR) binders the consensus KNN and PLS models compare favorably to earlier reports. The ability of 3D-QSDAR (three dimensional quantitative spectral data-activity relationship) to provide structural interpretation was illustrated by a projection of the most frequently occurring bins on the standard coordinate space, thus allowing identification of structural features related to toxicity.


Background
During the past decade, the application of consensus modeling to various QSAR related problems has been explored [1][2][3]. Early QSARs often relied on single models, which under certain circumstances were prone to arbitrary overestimation of the contribution of given structural features at the expense of others that were suppressed or ignored. To mitigate such risks consensus models based on a multitude of individual models can be advantageously used. Reports of improved performance of consensus models [4][5][6] or its lack thereof [7] have been published.
Recently, our group introduced the concept of a robust 3D-QSDAR approach [8]. 3D-QSDAR utilizes unique fingerprints constructed from pairs of 13 C chemical shifts augmented with their corresponding inter-atomic distances. The proposed 3D-QSDAR methodology was designed in accordance with the Organization for Economic Cooperation and Development (OECD) principles [9]: it provided several levels of validation, thus assuring models would be both reliable and interpretable. In our earlier work [8] an automated partial least squares (PLS) algorithm was used to process data from regularly tessellated 3D-SDAR fingerprints and to derive averaged (composite model) predictions from 100 randomized training/hold-out test set pairs. A technique [10] based on the standard deviation of the experimental data was employed to determine a "realistic" upper bound for coefficient of determination. A Y-scrambling procedure [11,12] assessed the probability of generating seemingly "good" models by chance. However, the above described modeling procedure employed a single data processing algorithm, namely PLS. As a step forward, experiments designed to explore the likelihood of building improved consensus models combining predictions generated by conceptually unrelated algorithms operating on 3D-SDAR matrices of different granularity were conceived. A KNN algorithm intended to supplement PLS by capturing complementary aspects of the structure-activity relationship was devised. It was hypothesized that the improvement of performance in consensus modeling should depend on the degree of orthogonality of the predictions produced by the individual models. Beyond the accuracy of biological data the inherent information content of a given descriptor pool was thought of as a factor limiting the improvement of R 2 test in consensus modeling. In other words, regardless of data processing algorithm the maximum achievable R 2 for a hold-out test set would be limited by the descriptors' ability to depict specific aspects of the molecular structure directly related to the observed effect.
In this work, the effect of data processing algorithms on the quality of prediction and the ability of 3D-QSDAR to provide a meaningful structural interpretation was tested on a dataset of 94 PCBs, PHDDs and PCDFs binding the aryl hydrocarbon receptor (AhR). Similar datasets were extensively modeled (Table 1 summarizes these results) and the structural features affecting binding are well known. This allowed quantitative and qualitative comparison of these earlier reports with the results carried out using the 3D-QSDAR methodology.

Dataset
A dataset compiled by Mekenyan et al. [22] containing the experimental log(1/EC 50 ) values of 94 persistent organic pollutants inhibiting AhR was used for the purpose of this work. Here, EC 50 is defined as the concentration of the test chemical necessary to reduce the specific binding of a radiolabeled 2,3,7,8-tetrachlorodibenzo-p-dioxin to 50% of its maximum value in the absence of competitor. This dataset consists of three distinct chemical classes: i) polychlorinated biphenyls (PCBs); ii) polyhalogenated dibenzop-dioxins (PHDDs) and iii) polychlorinated dibenzofurans (PCDFs), shown in Table 2. Since we were unable to locate sources reporting the experimental errors of the EC 50 data for the current dataset, a work by Long et al. [23] listing the absolute errors of 7 PCDF congeners was used to evaluate the quality of data. Under the assumption that for similar compounds and experimental settings the average relative experimental error would vary insignificantly, based on the above report at least~17% error in the EC 50 data should be expected. However, since Mekenyan et al. [22] compiled their dataset from various sources, a negative impact on the accuracy of data which would further lower the "realistic" R 2 [10] should be anticipated.

Conventions
Several layers of complexity related to the utilized modeling procedures were introduced in this manuscript and these require clarification. To avoid ambiguity, models   utilizing the same algorithm (either PLS or KNN) operating on an individual 3D-SDAR data matrix by generating multiple randomized training/test subset pairs later combined to form a single model will be referred to as "composite models". Models averaging the predictions from two (or eventually more) composite models will be referred to as "consensus models". The term "individual models" is used interchangeably to denote either the individual PLS or KNN models forming the "consensus model" or the individual randomized training/test subset models resulting in a "composite model". However, its specific meaning would be determined through its contextual use. The term "matching training/test subset pairs" indicates complementary training and test subset pairs processed by different algorithms, but composed of the same subsets of compounds.

Molecular conformation
In its current implementation, 3D-QSDAR does not employ docking or alignment algorithms, nor does it use X-ray structures to achieve more consistent geometries of the molecules constituting the dataset. This choice widens its applicability to datasets of compounds with unknown, multiple, or no specific targets and in the absence of knowledge about the binding site and its conformational requirements. For the purpose of reproducibility, however, the conformation at the global minimum of the potential energy surface was used. It has to be acknowledged that, while this conformation is the most energetically stable, it may not be the one assumed during solvent interaction or upon binding with a macromolecule [24]. To find the lowest energy conformers of all PCBs (the PHDDs and PCDFs have no rotatable bonds) a conformational search analysis was performed in HyperChem 8.0 [25]. An AMBER force field [26] and a random walks search method with an acceptance energy criterion of 6 kcal/mol were used. At the final stage of optimization all PCBs, PHDDs and PCDFs were optimized by employing a semiempirical Austin Model 1 (AM1) Hamiltonian with a rootmean-square gradient of 0.01 kcal/Å × mol.

3D-QSDAR descriptor calculations
The final refined geometries of all 94 AhR binders together with their respective 13 C chemical shifts simulated by ACD/NMR Predictor version 12.0 [27] were used to generate unique 3D-fingerprints representing each compound in the abstract 3D-SDAR space [8]. This space is defined by three orthogonal coordinates, with the chemical shifts of atoms C i and C j on axes X and Y, respectively, and the distance between them on the Z axis. This concept is illustrated in Figure 1, which shows the structure and the 13 C-NMR spectrum of 2,3,7,8-tetrachlorodibenzop-dioxin (Figures 1a and 1b) and its corresponding 3D-SDAR fingerprint ( Figure 1c). For example, in Figure 1c the coordinates of the topmost left fingerprint element are defined by the chemical shifts of atoms 1 and 6 (116.94 ppm) and the distance between them (5.52 Å). Due to the D 2h symmetry of the 2,3,7,8-tetrachlorodibenzo-p-dioxin molecule, the position of this fingerprint element in the 3D-SDAR space coincides with the one representing atoms 4 and 9. The remaining fingerprint elements are constructed in a similar manner.
Because the units of length on each axis are not identical, X, Y and Z do not form a Cartesian coordinate system. Since the number of carbon atoms in a molecule (N C ) determines uniquely the number of elements in a fingerprint (N C (N C -1)/2), each of the 94 AhR binders will be represented by at least 66 such fingerprint elements in the 3D-SDAR space. This 3D-SDAR space was further tessellated using regular grids to form bins ranging in size from 2 ppm x 2 ppm x 0.5 Å to 20 ppm x 20 ppm x 2.5 Å (i.e. incremental steps of 0.5 Å on the Z-axis and 2 ppm on the chemical shifts plane XY were used). As a result, 50 regular grids of different granularity were generated. A procedure performed separately on each of the 50 grids counted the number of fingerprint elements of a molecule belonging to a given bin (i.e., bin occupancy) and stored these values as row vectors in m x n matrices. Here m represents the number of compounds in the dataset, whereas n represents the number of occupied bins.

Determination of the optimal number of randomization cycles
Experiments aimed at the determination of the optimal number of training/test subset randomization cycles necessary to achieve an asymptotic convergence of R 2 test (an average of N individual R 2 test values, 10 ≤ N ≤ 1000) to its "true" value were conducted. As an example, Figure 2 shows the dependence between R 2 test and the number of randomization cycles in case of: i) our best PLS model utilizing 10 ppm x 10 ppm x 0.5 Å bins and 7 latent variables (LVs) and ii) our best KNN model using 2 ppm x 2 ppm x 0.5 Å bins and 6 neighbors. Figure 2 indicates that a minimum of 100 randomization cycles would be needed so that the average R 2 test values would converge to their asymptotic values. Therefore, to reduce the computational demand and to avoid reporting overlyoptimistic results, 100 randomizations for each of the 50 3D-SDAR data matrices were performed.

Model building
To explore the ability of different data processing techniques to capture complementary portions of the variance in biological data, two algorithms based on unrelated concepts but operating on descriptor matrices originating from the 3D-QSDAR approach were employed.

i) A SIMPLS based [28] PLS algorithm written in
Matlab [29] was employed to process each of the 50 3D-QSDAR data matrices. All descriptors were standardized using the "zscore" Matlab function. As described above, 100 random training/test set pairs were generated and composite (ensemble) PLS models for the training sets, including somewhere between 1 and 10 LVs, were built. These models were then used to predict the log(1/EC 50 ) values for the complementary 20% "hold-out" test subsets. At the end, each of the individual 100 R 2 training , R 2 test and R 2 scrambling values were recorded and their averages for the composite models were reported. For each of the 50 average models utilizing grids of different granularity the random number generator was initialized in order to recreate the same training/ hold-out test sequence (Additional file 1). Due to the specifics of the chosen model-building procedure, the reader should bear in mind that these average reported parameters include contributions from "good" as well as "bad" models (see the results and discussion section). ii) Alternatively, a KNN algorithm written in Matlab and based on Tanimoto similarity [30] in its generalized vector form, In this equation, A and B are data objects represented by vectors (originally bit vectors). Thus, the Tanimoto similarity is a dot product of two vectors A and B (bin occupancy row vectors for a pair of compounds) divided by the squared magnitudes of A and B minus their dot product. In other words, for compounds sharing common structural features T will be closer to 1, otherwise T will be closer to 0.
Because T is not invariant to standardization, the desire for preservation of its universal nature required use of the original, non-transformed 3D-SDAR descriptor pool. At a constant granularity of the grid this specific choice allowed bijection of T -there is one and only one T for a given pair of compounds. For a standardized descriptor pool, T loses its universal nature by being dependant on the mean and the standard deviation of the descriptors within the training set, and multiple T-s between a pair of compounds would exist (i.e., T would become a local characteristic of similarity).
These invariant T values (calculated for all pairs of compounds) were later used to predict the hold-out test set activities by ranking the compounds from the training set in a descending order of their similarity to each compound from the hold-out test and using T of the first K-neighbors (1 ≤ K ≤ 10) to weight their contributions to activity. Under these experimental settings, both odd and even numbers of neighbors can be used. As with PLS, the KNN validation procedure involved 100 randomized training/hold-out test set pairs recreated by the use of the same random seed.

Fit and prediction
The majority of QSARs are built for prediction. Hence, parameters such as the coefficient of determination for the training set (R 2 training ) that measure the fitting ability of a model play only a minor role, typically unrelated to predictive power. Since we are more interested in the behavior of models intended for prediction, our attention was primarily focused on R 2 test and R 2 scrambling . More specifically, the behavior of R 2 test was closely followed, whereas R 2 scrambling was monitored only as an indicator of potential chance correlations.

Results and discussion
Similarity as a discrimination function The ability of T to detect structural similarity and thus structural variations is illustrated in Figure 3 (2 ppm x 2 ppm x 0.5 Å bins were used). Three regions of higher similarity are clearly distinguishable: i) compounds with IDs between 1 and 30 are all PCBs; ii) compounds with IDs between 31 and 55 are PHDDs and iii) the remaining 39 compounds are PCDFs. Because T is calculated from row vectors, it can be demonstrated that KNN operating on T may capture information complementary to that of the respective PLS models (virtually all multivariate methods operate on column vectors). Thus, the degree of orthogonality between PLS and KNN would be one of the factors with an impact on the performance of consensus PLS-KNN models. Another such factor would be the complementarity of the information content specific to 3D-SDAR matrices of different granularity.

Optimal bin size
As described above a total of 50 PLS and KNN composite models of different granularity (each of which is a result of 100 training/test set combinations) were built. The statistical parameters of these models calculated as an average of their corresponding 100 individual values are shown in Table 3. The predictive power of the PLS and KNN models in terms of R 2 test as a function of the granularity of the 3D-SDAR abstract space is shown on Figure 4.
From Table 3 and Figure 4 one can see that the PLS and KNN models achieve their optimal performance at a different granularity of the grid. The best performing PLS model reaches its highest average R 2 test of 0.633 (σ = 0.147) at 10 ppm x 10 ppm x 0.5 Å, while the KNN model achieves it highest average R 2 test of 0.618 (σ = 0.170) at 2 ppm x 2 ppm x 0.5 Å. This observation can be explained by the combined effect of several contributing factors: i) As described in the methodology section, the PLS algorithm utilizes data standardization, which adjusts for the size disparity of variables. Unlike PLS, KNN uses the original bin occupancies. Thus, in the case of PLS, the optimal bin size would mainly reflect the inherent estimation error in the chemical shifts of carbon atoms and their associated interatomic distances. As demonstrated in [8], bins with a high resolution on the Z axis (C i C j distance) and a granularity of at least twice the estimation error of 13 C chemical shifts in the XY plane would result in PLS models of optimal performance. For the current dataset, the 13 C chemical shifts estimation error was 3.98 ppm, which would require bins with granularity of at least 8 ppm in the XY plane. Hence, it is not surprising that the best performing PLS model utilizes 10 ppm x 10 ppm bins in the XY plane and a 0.5 Å on the Z-axis. Besides the 13 C chemical shifts estimation error, the optimal grid granularity also depends on the bin occupancy. Bins that are too narrow will result in a large, but sparsely populated 3D-SDAR matrix and PLS models unable to generalize (poor predictive performance). On the other hand, models using bins that are too wide (e.g., > 14 ppm in the chemical shifts plane XY) may assign fingerprint elements encoding divergent structural features to the same bin, thus producing models lacking in their ability to decode the underlying relationship between structure and activity. ii) The use of T as a factor for activity determination in KNN results in smaller optimal bin sizes in part due to the cancelation of the error in the chemical shifts plane (XY) for similar compounds. Note that the highest contribution to the determination of activity in KNN comes only from the first K-nearest neighbors, which by definition are most similar to the compound the activity of which is being predicted. Because for similar structures the error of prediction propagates in parallel, it is not surprising that similarity based KNN algorithms will achieve maximum performance at smaller bin sizes. iii) Unlike PLS, which assigns a different contribution of each bin to the final model, KNN treats all bins as independent coordinates of a vector compared   Table 3, as the latter represent an average of 100 individual R 2 test values for the randomized training/test subset pairs.
Because consensus requires at least two individual models an obvious choice was to select complementary composite PLS and KNN models (indicated by the green arrows in Figure 4). This choice allowed the construction of consensus models from pairwise averaging of the predictions from individual models using: i) the same algorithm but different bin size (IDs 2 and 4); ii) the same bin size but different algorithm (IDs 5 and 6) or iii) different algorithm and different bin size (IDs 1 and 3). The percentage improvement in consensus modeling over the average of the coefficients of determination of the individual models is carried out in Table 4. Although consensus between other pairs of individual models or of higher order (averaging predictions from more than two individual models) is possible and may perform better, the enormous number of such models was prohibitive and such efforts were not undertaken.
As can be seen from Table 4, both differences in the data processing algorithms and the granularity of the Table 3 Average statistical parameters of the best PLS and KNN models at a given number of LVs and neighbors as a function of the granularity of the 3D-SDAR space (Continued) 4   3D-SDAR space contribute to the improvement in consensus modeling. A comparison of the performance improvement of consensus models indicates that generally the models of type iii perform best. A possible explanation for this observation is that these models benefit from: i) the complementary information extracted from 3D-SDAR matrices of different granularity and ii) the utilization of different data processing algorithms. Among these 6 consensus models, the one averaging the predictions from the best performing PLS (10 ppm x 10 ppm x 0.5 Å bins, 7 LVs) and KNN (2 ppm x 2 ppm x 0.5 Å bins, 6 neighbors) individual models was characterized by the highest coefficient of determination (shown in Figure 5c and the last column of Table 2). To further understand the factors playing a role in consensus modeling and to explain the observed improvement over the composite PLS and KNN models, analysis based on training/test set pairs of individual models was carried out.
According to our initial hypothesis an improvement in consensus modeling would be observed only if the individual composite models account for complementary information (i.e., explain complementary portions of the variance in the biological data). For this purpose, the behavior of the individual 100 sub-models resulting in the best composite PLS and KNN models was investigated. If for each of the 100 training/test set pairs both algorithms capture almost identical structural information encoded in the 3D-SDAR descriptor pool, the corresponding R 2 test values generated on each cycle should be highly correlated and therefore no improvement in consensus modeling would be observed. In other words, the two algorithms would be somewhat redundant and the consensus R 2 test would be an average of R 2 test for the 100 individual sub-models. It has to be emphasized that such an experiment would be valid only in a case of matching training/test subset pairs. This condition is satisfied by the use of the same random seed for both PLS and KNN and a random number generator which was initialized after 100 runs.
Three different views of the 100 individual R 2 test values for the best composite PLS (10 ppm x 10 ppm x 0.5 Å bins, 7 LVs) and KNN models (2 ppm x 2 ppm x 0.5 Å bins, 6 neighbors), are shown in Figure 6. A plot of the ranked R 2 test values for each of these 100 models ( Figure 6a) indicated a similar level of performance of both algorithms. Figure 6a also demonstrates that some combinations of training/test subset pairs may produce highly accurate models with R 2 test reaching 0.9, while others may result in models with inferior performance (i.e., models in which the test set compounds are not well represented by the training set). Figure 6b shows a plot of R 2 test of matching training/ test subset pairs processed alternatively by PLS or KNN.   Although, there were PLS and KNN sub-models performing equally well (forming a cluster in the upper right corner or the plot), a significant portion of submodels predicted well by PLS were combined with inferior KNN models and vice-versa. This observation and the relatively low R 2 of 0.367 suggest that the two individual models reflect different structural patterns in the data and are partially "orthogonal". The distribution of ΔR 2 test PLS-KNN shown on Figure 6c indicated that a total of 28 models deviate by at least 1σ from the mean.
PLS outperformed KNN for 13 models while KNN performed better for the remaining 15 models. These 28 models, for which one of the algorithms succeeded in establishing a structure-activity relationship undetected by the other, were identified as a major contributing factor affecting the performance of consensus models. Thus, a consensus PLS-KNN model would benefit from the partial orthogonality of the PLS and KNN approaches on different sized bins and would outperform the individual composite models.

Interpretation
An essential part of the QSAR modeling process is the interpretation of the model in terms of structural variations leading to corresponding changes in the biological activity. For the purpose of interpretation the bins with the 10 most positive and negative PLS weights for each of the individual models forming the composite 10 ppm x 10 ppm x 0.5 Å PLS model were extracted and their relative frequencies of occurrence were calculated. Since each of the individual models utilized 7 LVs, a total number of 14000 positive and negative bins were extracted (2 × 100 sub-models × 7 LVs × 10 bins). Unique among these were 87 bins with positive weights and 74 bins with negative weights. Their corresponding relative frequencies of occurrence were calculated and ranked. For simplicity, only the topmost 20% unique positive and negative bins were mapped back to the 3D-SDAR abstract space (see Figure 7). A detailed examination of the 3D-SDAR maps shown in Figure 7 reveals that none of the bins with positive weights overlaps with any of the bins with negative weights: i.e., the structural features affecting binding (increasing or decreasing log(1/EC 50 )) are well separated. Therefore, compounds with 3D-SDAR fingerprints predominantly occupying bins with positive PLS weights will be stronger binders (highly toxic). Conversely, chemicals with fingerprint elements falling into regions of the 3D-SDAR space occupied by bins with negative weights will be weaker binders (less toxic). This hypothesis was tested using an in house program projecting some of the most frequently occurring positively and negatively weighted bins on the standard coordinate space. This projection allowed identification of subsets of structures in which these bins can often be found together.
As was expected, most of the bins characterized by positive PLS weights were found in the structures of PHDDs representing the most toxic class of chemicals in the dataset investigated (see Figure 8). Specifically, a subclass of polybrominated dioxins showed consistent presence of multiple positively weighted bins (see Figure 8). As anticipated, 7 of these were among the top 10 most toxic compounds in the dataset. However, infrequently occurring bins specific to compounds with peculiar structural features did not appear as highly ranked in the composite 3D-QSDAR models. This explains the absence of bins specific to the 2,8-dibromo-3,7-dichlorodibenzo-p-dioxin, since it is the only dioxin derivative with both Cl and Br substituents on the same ring.
Although in its current version 3D-QSDAR "sees" only the carbon atoms, inferences about their chemical environments can be easily drawn. For example Figure 9 shows that the 140 ppm -150 ppm, 110 ppm -120 ppm, 2.0 Å -2.5 Å bin is persistently occupied by carbon atoms neighboring the oxygen atoms in PHDDs, indicating the importance of oxygen atoms for binding to AhR [14,31]. Hence, the lack of oxygens in the structure of PCBs can be correlated to their weaker binding affinity (and consequently their lower molar toxicity).
In contrast, most of the negatively weighted bins were found to be present in the structures of PCBs. As can be seen from Figure 9, positions 2 and 2′ and (due to symmetry) positions 6 and 6′ are particularly affected and chlorine substitution at these positions will lower the toxicity of PCBs, compared to that of other chlorine substituted homologues.
As an intermediate chemical class with an average activity higher than that of PCBs and lower than that of PHDDs, the activity of dibenzofurans is affected by the presence/absence of structural patterns similar to those observed in the structures of both PCBs and PHDDs. For example, the presence of an oxygen atom resulting in a chemical shift range of the neighboring carbon atoms between 150 and 160 ppm will lower the EC 50 of PCDFs (higher toxicity). Analogously to the 2 and 2′ positions in biphenyls, chlorine substitution at positions 1 and 9 will result in PCDFs with toxicity lower than that of PCDF homologues substituted elsewhere.

Comparison to earlier models
Due to variability in the datasets and the multitude of available data processing algorithms and validation techniques, a direct quantitative comparison with the QSARs summarized in Table 1 is impossible. However, if one takes into account the much stringent validation criteria imposed in our work (vs the cross-validation procedures employed in [13][14][15][16][17][18][19][20][21]) it is clear that the 3D-QSDAR methodology performs at least on par with these earlier models. Similarly to CoMFA [17] on a qualitative level the 3D-QSDAR was able to recognize correctly the positions that affect the strength of binding to AhR. Since our work is based on a dataset originally compiled by Mekenyan et al. a more direct comparison with the QSARs reported in [22] was possible. Multiple separate QSARs for the three classes of PCBs, PHDDs and PCDFs with R 2 ranging from 0.640 (n = 30) to 0.899 (n = 14) were derived. The statistical parameters of a model combining the most planar PCBs, PHDDs and PCDFs (n = 80) were as follows: R 2 = 0.73; s 2 = 0.59; R 2 cv = 0.73 and F = 69.2. In comparison, for the complete set of 94 compounds our best consensus model produced an R 2 test of 0.685 and a q 2 LOO of 0.79 which are both close to the R 2 cv of 0.73 reported by Mekenyan et al.

Conclusions
We have introduced several validation techniques intended to improve the quality and reliability of individual and consensus QSAR models. Their use was illustrated on a dataset of 94 AhR binders modeled by 3D-QSDAR. The functional dependence between R 2 test and the number of training/test subset randomization cycles was used to determine the minimum number of cycles necessary to achieve convergence of R 2 test to its asymptotic "true" value. In this specific case, which uses 20% of the compounds as a hold-out test set, 100 randomization cycles proved sufficient for achieving convergence for both PLS and KNN models. The use of a distance measure (Tanimoto similarity) as a discriminant function in KNN was shown to produce models with performance similar to that of PLS when applied to the same dataset. A plot of R 2 test for matching test set pairs was used to demonstrate the partial orthogonality of PLS and similarity based KNN approaches on different bin granularity. However, further investigations may shed additional light on the character of the multiple factors playing role in the improvements observed in consensus modeling.
In the last stage of the modeling process the most frequently occurring positively and negatively weighted bins were projected back to the standard coordinate space to identify structural features related to toxicity. It was found that most of the highly ranked bins with positive PLS weights were specific to a class of polybrominated dioxins. The oxygen atoms of PHDDs and PCDFs participating in formation of donor-acceptor bonds with the receptor were associated with the high toxic effect of these two chemical classes. In the absence of other substituents, PCBs with chlorine atoms at positions 2 and 2′ (and due to symmetry positions 6 and 6′) were accurately predicted to be relatively weaker binders (less toxic).