Journal of Cheminformatics Research article DPRESS: Localizing estimates of predictive uncertainty

Background The need to have a quantitative estimate of the uncertainty of prediction for QSAR models is steadily increasing, in part because such predictions are being widely distributed as tabulated values disconnected from the models used to generate them. Classical statistical theory assumes that the error in the population being modeled is independent and identically distributed (IID), but this is often not actually the case. Such inhomogeneous error (heteroskedasticity) can be addressed by providing an individualized estimate of predictive uncertainty for each particular new object u: the standard error of prediction su can be estimated as the non-cross-validated error st* for the closest object t* in the training set adjusted for its separation d from u in the descriptor space relative to the size of the training set. The predictive uncertainty factor γt* is obtained by distributing the internal predictive error sum of squares across objects in the training set based on the distances between them, hence the acronym: Distributed PRedictive Error Sum of Squares (DPRESS). Note that st* and γt*are characteristic of each training set compound contributing to the model of interest. Results The method was applied to partial least-squares models built using 2D (molecular hologram) or 3D (molecular field) descriptors applied to mid-sized training sets (N = 75) drawn from a large (N = 304), well-characterized pool of cyclooxygenase inhibitors. The observed variation in predictive error for the external 229 compound test sets was compared with the uncertainty estimates from DPRESS. Good qualitative and quantitative agreement was seen between the distributions of predictive error observed and those predicted using DPRESS. Inclusion of the distance-dependent term was essential to getting good agreement between the estimated uncertainties and the observed distributions of predictive error. The uncertainty estimates derived by DPRESS were conservative even when the training set was biased, but not excessively so. Conclusion DPRESS is a straightforward and powerful way to reliably estimate individual predictive uncertainties for compounds outside the training set based on their distance to the training set and the internal predictive uncertainty associated with its nearest neighbor in that set. It represents a sample-based, a posteriori approach to defining applicability domains in terms of localized uncertainty.


Background
Early work on quantitative structure-activity relationships (QSAR) was primarily concerned with relating select physical properties to in vivo biological activity [1,2]. Ordinary least squares regression (multiple linear regression) was the analytical tool of choice, and the statistical questions addressed focused on whether a particular descriptor was significant or not. QSAR methods soon evolved, however, into being ways of identifying optimal physical properties rather than simply trends, a shift accomplished by fitting to quadratic and bilinear equations. This development was spurred in no small part by the desire to identify optimal octanol/water partition coefficients (logP), generally in pursuit of optimal in vivo activity.
The focus for pharmaceutical drug discovery subsequently shifted from in vivo testing to in vitro evaluation of interactions between candidate ligands and isolated enzymes or receptors. This change brought with it a shift of descriptors from measurable properties of compounds to computationally estimated properties of molecules, with the calculations in question often being based on (sub)structural descriptors. The next step was to take descriptors into account that were based on molecular structure but were not themselves measurable physical properties. Often these were more or less local in nature, and the purposes of doing the analysis shifted from identifying significant underlying relationships to the descriptors to identifying optimal substituents or substitution patterns. Interest in artificial neural networks (ANNs) [3] and partial least squares with projection onto latent structures (PLS) [4] as analytical tools increased at the same time. Questions related to validity of the model as a whole took center stage as the number of descriptors available proliferated [5,6], followed closely by a strong interest in predictivity and how best to establish applicability domains [7][8][9][10][11][12][13][14][15].
Today, however, the overall statistical properties of a particular QSAR are less relevant to medicinal chemists or environmental regulatory agencies. Recent pressure to simultaneously reduce clinical failures, ensure the safety of bulk chemicals [16][17][18] and reduce testing on animals have led to an increasing reliance on models for predicting off-target biological effects and toxicity. This use of QSAR models entails applications to more structurally diverse compounds, but it also changes the relative importance of different kinds of mistakes. If a structure is predicted to have a much higher affinity for the target than it actually does, the cost to a lead optimization program is limited to the synthetic resources wasted on that particular structure. Even that cost is mitigated if something useful was learned about the underlying structure-activity relationship (SAR) in the process. Such a false positive error in predictive toxicology, however, may mean that a life-saving (and profitable) drug never gets commercial-ized. Compounds mistakenly predicted to be inactivefalse negatives -represent a missed opportunity in the context of lead optimization, but they have the potential to be downright catastrophic (and ruinous) in the context of predictive toxicology.
Such considerations put a premium on being able to make a quantitative estimate of how reliable an individual prediction obtained from a given model is. What is more, answers to the question, "How reliable are the predictions about this particular molecule that I am considering for synthesis, clinical evaluation or registration?" are often most relevant for extrapolations to structures near the "outside" edges of the descriptor space defined by the training set. Hence, to be of practical use, constraints on applicability domains need to be "soft" -i.e., increase with distance from the descriptor space covered by the training set -but "hard" enough to indicate just how far outside the training set one can safely expect to go. They also need to provide a robust quantitative estimate of predictive reliability that is sensitive to local variations in the descriptor space. This paper presents a novel methodology for doing exactly that based on how close a new compound is to those in the training set and the distribution of internal predictive error across compounds in that set.

Classical statistical theory
The underlying model for linear regression on a vector X of p independent variables is reflected in Eq. 1, wherein Y is the response variable of interest, μ Y is the population mean of Y, β is a vector representing the sensitivities of Y to changes in X, and x is a vector of deviations in X from the population centroid μ X .
As indicated in Eq. 1, the error ε is assumed to be normally distributed with mean 0 and a standard deviation σ X . Best linear unbiased estimators (BLUEs) for the various parameters in Eq. 1 can be calculated from a sample T 0 of n observations (in QSAR, compounds) drawn from the full population, provided several preconditions are met [19]: 4. the error distribution ε is homoskedastic and independent of X and Y -i.e., its standard deviation is the Journal of Cheminformatics 2009, 1:11 http://www.jcheminf.com/content/1/1/11 same everywhere in the descriptor space, so σ X = σ for all X.
The corresponding regression estimators for each individual observation i and the overall standard error of regression s FIT are then given as shown in Eqs. 2 and 3.
where is the mean value of Y for the sample; x i = X i -X 0 , with X 0 being the sample centroid for X; and is the predicted value of Y at X i [19]. Note that s FIT is greater than the root mean square error (RMSE); this is because the means and X 0 and the calculated coefficient vector b are themselves estimates that are subject to sampling error, with 1 and p degrees of freedom, respectively.
Under these assumptions, the potential error in estimating Y increases as one moves away from the centroid X 0 . As a result, the uncertainty s u in predicting the value of Y at some new ("unknown") value X u is generally greater than s FIT . In fact, under the assumptions given above [19]: where s u is the expected standard error of prediction (uncertainty) for the new observation u and n is the number of training set observations t used to build the model. The Mahalanobis distances d 0, u and d 0, t are measured in the model space defined by b, i.e., they are weighted Euclidean distances between the centroid X 0 of the descriptor matrix for the training set and the vectors X u and X t , respectively.
The rationale behind the "extra" terms in Eq. 4 is straightforward. For any random sample, the error involved in using as an estimate of μ Y is inversely proportional to n -hence the 1/n term in Eq. 4. In addition, the accuracy with which β is estimated by b is inversely proportional to how thoroughly X is sampled by the training set, but how much difference that makes to the error is directly proportional to the distance d 0, u between X u and X 0 in the model space. Together these countervailing effects of variation in X account for the second term within the outer brackets.

Dealing with violated assumptions
The value of s u produced by Eq. 4 is a best linear unbiased estimator of σ u -provided the assumptions underlying its derivation hold. Unfortunately, one or more of those assumptions are violated in most QSAR applications. In particular: 1. the dependence of Y on X rarely fits the prescribed function perfectly, linear or otherwise; 2. the training set used is usually a non-random sample, its selection biased by matters of historical accident and convenience that reflect the historical trajectory of the synthesis program that motivated the analysis; 3. the descriptors contributing to X are often correlated to a greater or lesser degree and hence are not independent variables in the statistical sense (correlation implies lack of independence, but the inverse is not true: lack of correlation does not imply statistical independence); and 4. ε is usually heteroskedastic -its standard deviation σ X is often different in different regions of the descriptor space.
Most or all of the assumptions are, in fact, explicitly violated when ANNs, PLS, variable selection, quadratic regression, or bilinear regression techniques are applied, with the result that s FIT and the estimator given by Eq. 4 underestimate the actual uncertainty of prediction, often drastically.
Several groups have derived theoretical variations of Eq. 4 for use with PLS and principal component analysis (PCA) that seek to address departures from ideality [20][21][22].
Unfortunately, subsequent work has demonstrated that these methods are often not robust when applied in realistic situations [23].
An alternative, completely empirical approach to assessing aggregate predictive uncertainty is cross-validation, in which each compound in the training set is held back in turn [24]. The value of Y for the held-back compound is then predicted using a model built from the other n -1 compounds in the training subset T u = T 0 -{u}. In parallel to Eq. 3, the standard error of cross-validation s CV is calculated from the predictive error sum of squares (PRESS) according to equation 5: where is the value of Y predicted by applying the reduced model built from the n -1 compounds in training subset T u to X u and is the corresponding predictive error. The summation is indexed across u to emphasize that prediction is external to the training subset used in each case. Here p represents the number of PLS components included in the model rather than the number of descriptors.
Cross-validation statistics were originally employed in PLS solely as a way to determine an optimal model complexity, a role for which the classical goodness-of-fit measure r 2 used in ordinary least squares is unsuited [24]. It has since come to widely used to assess predictivity, however. This use is unfortunate, in that a poorly predictive model will have a high s CV and a low q 2 , but the converse may or may not be true: good cross-validation statistics may be due to redundancies in the training set rather than truly robust predictive performance [25][26][27][28]. Some workers prefer to use "leave-some-out" cross-validation -in which several compounds are held back together -to address this problem. Nonetheless, the LOO standard error is the best estimate of the full model's predictivity for each individual compound in the training set [29], which makes it is a reasonable starting point for estimating a model's predictive reliability for structures occupying nearby points in the descriptor space.
Violation 4 -that error is not identically and independently distributed across compounds -is especially problematic for QSAR analyses. In one recently described case in point, the variation in predictive error was clearly correlated with one of the two descriptors being used [7]. If that is true when many descriptors are involved (as is the case for PLS), the overall variability in predictive error should be similar across the full range of Y. Such a distribution of error is, in fact, often seen in place of the quadratically increasing spread implied by Eq. 4 [30]. This makes it all too easy to make the unjustified leap to the unjustified conclusion that the aggregate predictive uncertainty -typically s CV or the root mean square error of prediction for an external test set (RMSEP or s PRED ) -is a reliable indicator of the level of uncertainty associated with individual predictions: independence from Y does not imply independence from X.

Partitioning the PRESS
The increasing reliance of drug developers on tabulations of predicted properties makes getting accurate estimates of the uncertainty σ u for individual predictions critically important. Unfortunately, it is rarely if ever possible to construct a unified global model for the dependence of σ u on X. It is neither necessary nor even desirable to do so, however. A better approach is to shift from the classical, descriptor-based view of regression to a sample-based formalism such as that used in the SAMPLS algorithm [31]. This algorithm exploits the fact that Eq. 2 can be recast as Eq. 6 without loss of generality: is the covariance between x t and x i and v t is a weight vector that is specific to compound t. Basically, Eq. 6 says that activity can be expressed as a linear function of the similarities of each compound to each of the other compounds in the training set. This suggests that the observed predictive error e u can be cast as a sum of contributions from each compound in the training set that increases with similarity to those compounds, which is consistent with the observation that predictive error tends to increase with distance from -i.e., tends to decrease with increasing similarity to -compounds in the training set [11,12,15]. If t* is the closest (i.e., the most similar) such compound, its standard error (s t* ) is a reasonable first approximation to the predictive error s u for a new compound. In most QSAR applications, a single response value Y t is assigned to each compound in the training set, so the best estimate of s t is simply |e t |, where e t is the deviation seen for t in the full, non-cross-validated model, i.e., the residual error of fitting.
Though the "true" dependence of predictive uncertainty on the Euclidean distance d t*, u from t* is unknown, its dependence on distance can likely be approximated by a Taylor expansion in which all but the first, linear term in d is dropped. Taken together, these considerations yield the estimator defined by Eq. 7: where d 00 is the length of the vector x 00 defined by the standard deviations of the descriptors; d 00 = 1 when descriptors have been centered and autoscaled, as was the case here.
The problem then becomes one of estimating the predictive error γ t associated with each compound t in the training set. PLS tends to overfit, so this term is likely to be greater than s t* ; otherwise Eq. 7 would parallel Eq. 4 exactly, except for the loss of the 1/n aggregation term within the brackets. Instead, one can turn to the squared predictive errors collected during cross-validation. In the calculation of the aggregate predictive uncertainty s CV (Eq. 5), these are lumped into a single sum -the PRESS. If, however, one assumes that contributions from nearby training set compounds dominate the predictive error and, further, that the value of γ will be comparable for the training subset compounds closest to each individual compound u, the contribution that cross-validation of the i th compound makes to the PRESS can more appropriately be distributed across the training subset in inverse proportion to the distances between X i and the n -1 compounds used to predict Y i (Eq. 8 and Fig. 1). A similar approach is taken to distributing response variance across the various sources of deviation from the mean in classical analysis of variance (ANOVA).
The normalization factor α i in Eq. 9 is necessary to ensure that the distribution is a partition -i.e., that the contributions from the cross-validation step in which compound i was set aside sum to the observed cross-validation error in prediction .
A small constant (1/n) needs to be included to prevent the reciprocal from "exploding" at small distances. Basically, it dictates the distance at which error is expected to distribute evenly. The choice of this particular value is somewhat arbitrary, but 1/n works well and nicely accommodates the tendency of data points to get closer together as the training set gets larger. Taken together, Eqs. 7-9 define the Distributed PRedictive Error Sum of Squares (DPRESS) approach to estimating predictive uncertainty.

Results
The suitability of DPRESS or any other quantitative model of predictive uncertainty is best evaluated by applying it to experimental QSAR data sets. Here, DPRESS is tested against PLS models obtained using a 3D descriptor (com- Schematic representation of predictive error distribution in DPRESS parative molecular field analysis, or CoMFA [32][33][34]) and a 2D descriptor (hologram QSAR, or HQSAR [35][36][37]). A large data set (N = 304) was used to insure that the number of compounds held back to evaluate external predictivity was much greater than the numbers needed to train a reasonably robust model.

The data set
The set of structurally diverse cyclooxygenase inhibitors examined here was originally compiled by Chavatte et al. [38]. It includes data on five major and three minor structural classes (Fig. 2) of inhibitors of the inducible form of the enzyme (COX-2). This data set is attractive because the target has been a major focus of research on anti-inflammatory drugs and because it combines substantial structural variation with a few key shared elements such as the distal sulfonyl (SO 2 CH 3 ) or sulfamoyl (SO 2 NH 2 ) group.
In addition, regression models based on this data set are well-characterized in terms of predictive robustness [25,28] and with respect to variations in how training subsets are selected [39]. Finally, the uneven representation of the different core structures reflects a sampling bias that is typical of the data sets used to build QSARs. Fig. 3A shows how activity is distributed across the various structural classes when the compounds in the data set are projected into two dimensions using embedded non-linear mapping [40,41] based on the similarity in their molecular fields: symbols are colored by structural class and sized by activity. Clearly, no one structural class has a monopoly on high activity. Fig. 3B shows the distribution of activity across the descriptor space defined by the compounds' molecular holograms. Molecular fields are 3D descriptors, which are more generalized than holograms -2D descriptors derived from substructure counts. The more literal character of holograms leads to smaller distances between inhibitors within classes relative to the distances between classes, which accounts for the greater between-class resolution in Fig. 3B. It also accounts for the fact that the sulfonyl and sulfamoyl subclasses are cleanly separated in the hologram space ( duces a biased training set because, as in most such data sets, the major structural classes are not evenly represented (Fig. 2). Therefore diverse but representative ("boosted" [39]) training sets were generated by independently drawing five training (sub)sets of 75 compounds from the full set using optimizable k-dissimilarity (OptiSim) selection [39,42,43]. Models based on those training sets were then used to predict the activities of the 229 inhibitors not used to construct them. Three additional training sets were drawn at random, only one of which gave acceptable internal cross-validation statistics. Representation in the full data set is biased, so such simple random subsets are biased as well. The results obtained using that training set (set R) are included here to illustrate the effect of sampling bias due to structural redundancy [39,44,45].

CoMFA models
The optimal number of components p* for the CoMFA models obtained for the boosted training sets ranged from three to seven. It is not appropriate to compare models that differ in complexity directly, however, so a consensus complexity of p = 6 was used in all cases. The corresponding leave-one-out (LOO) cross-validated standard errors (s CV ) ranged from 0.681 to 0.762, corresponding to internal predictivities (q 2 ) of 0.537 to 0.337. The non-crossvalidated models exhibited standard errors of regression (s FIT ) ranging from 0.279 to 0.398, corresponding to r 2 values between 0.901 and 0.827. Calculating the root mean square error for external predictions yielded s PRED = 0.633 to 0.655 -i.e., the internal cross-validated error underestimated the overall accuracy of external prediction somewhat.
In contrast, the biased training set R yielded a cross-validated standard error (s CV ) of 0.489, corresponding to a q 2 of 0.696. The overall goodness-of-fit statistics for the noncross-validated model were s FIT = 0.279 and r 2 = 0.901. As expected, however, the predictive performance on those compounds not in R was substantially worse than that of the boosted training sets, with s PRED = 0.744. Several conclusions can be drawn by comparing the distribution of errors to each other and to the distribution of activities. Firstly, though the distributions of observed predictive errors for the three models differ from one another (Fig. 4A vs 4B vs 4C), they resemble each other more than they resemble the distribution of activity itself (Fig. 3A). Secondly, the larger observed errors are not particularly concentrated among the singletons or at the edges of the descriptor space, as would be expected for the ordinary least squares distribution expected based on Eq. 6 and in most published approaches to establishing applicability domains. Thirdly, the distributions of predictive uncertainty seen for the boosted training sets are in good overall agreement with the observed errors with respect to the regions of descriptor space where the observed error is relatively high or low (Fig. 4D vs 4A and 4E vs 4B). Though somewhat less evident, the same is true for the model constructed using the biased training set R (Fig. 4F  vs 4C). Finally, the smaller errors predicted by the boosted training with the better internal predictivity (Fig. 4E vs 4D) do seem to be realized in the localized errors actually observed (Fig. 4B vs 4A), even though this was not obvious in the aggregate statistics (s PRED = 0.637 and s PRED = 0.633, respectively).
Interpretation of the plots shown in Fig. 4 is complicated because the uncertainty s u is a measure of the spread in predictive error at X u ; the expected value of the error is still 0.
If is an accurate prediction of uncertainty, the magni- = 2|e u | less than 5% of the time. This is clearly not the case when the cross-validated error for the most similar compound t* in the training set is taken as a direct estimate of , i.e., when γ t is set equal to 0 for all t (Fig. 5A).
There are fewer unduly low predicted uncertainties for the biased training set R, but still more than would be expected by chance (Fig. 5B). Note that the bias evident in the model constructed from R comes mostly in the form of negative residuals, i.e., predicted activities that are larger than the observed activities. Such false positives account for most of the "extra" out-of-bounds errors seen in Fig. 5B. The distributions of errors for the boosted training sets are much better behaved; indeed, the predicted uncertainties are slightly more conservative than necessary for large errors in prediction ( Fig. 5C and 5D).

HQSAR
HQSAR analyses were carried out as a complement to the results obtained in the CoMFA studies described above. The 2D molecular holograms used were built up from the number of each kind of substructure comprised of between four and seven heavy atoms, the counts being mapped down into count vectors of various lengths by hashing [37]. HQSAR models were then constructed bŷ The consensus optimal complexity across the boosted training subsets was five components, in keeping with the full data set's having nearly four times as many compounds and, therefore, containing substantially more information. The s CV values obtained ranged from 0.691 to 0.776 versus a value of 0.540 for the biased training set R; the respective q 2 values were 0.386 to 0.514 and 0.623.
The s PRED for the boosted subsets ranged from 0.619 to 0.669 and the corresponding value for the biased subset was 0.735. Hence HQSAR performance followed the trend seen for CoMFA: cross-validation under-estimated the predictive error substantially for the biased subset (i.e., was overly optimistic about the extensibility of the model) and over-estimated the predictive error slightly for the boosted training sets. It differed in that it was the boosted training set B which gave the better external predictive performance.
The distribution of observed predictive errors and predicted uncertainties across the hologram descriptor space are shown in Fig. 6 for the model based on boosted training set B, and the corresponding plots of as a function of e u are shown in Fig. 7. Note that the predicted uncertainties for the boosted HQSAR models were more conservative than those for the CoMFA models discussed s u Predictive uncertainty as a function of the observed error for the CoMFA models s uŝ tŝu above, with the result that the magnitudes of nearly all errors above 0.75 log units were less than the corresponding . This effect is probably a side-effect of the exaggerated separation between classes seen in the hologram space (Fig. 3B).

Discussion
The degree to which any QSAR can be extended to compounds outside of the training set used to construct it is necessarily limited to some degree by the structural diversity of that training set. Some extensibility is necessary, however, if the QSAR is to be of use for something beyond mere rationalization of known activities. When only a few descriptors are being considered, it may be possible to restrict the applicability domain to "internal" regions in the descriptor space, but as the number of descriptors increases distinguishing compounds that lie "outside" the space defined by the training set from those that are "inside" becomes progressively less meaningful. Regardless of the complexity of the system, it is clear that one will often need to extend the applicability domain beyond the training set somehow. It is equally clear that this must be done cautiously, however, and that it would be desirable for the degree of caution to reflect the idiosyncrasies of the QSAR being examined. It would be particularly desirable to take local variations in the uncertainty of predictions into account, rather than trying to find a single acceptable distance to the model that is applicable across the entire descriptor space [12,14,15].
DPRESS was formulated to address these needs. It is based on two simple assumptions: that the uncertainty in prediction for new objects (e.g., molecular structures) is likely to be dominated by the error in prediction for objects near them in the descriptor space; and that this influence is, to a first approximation, inversely related to the distance between them. The "true" dependence may well be more complex in some cases, but the size of the training set needed to characterize that dependence will almost always be impractically large. In any event, such dependence is likely to reduce to a linear relationship over the relatively short ranges of QSAR extrapolations that have any chance of being relevant.
The fact that the predictive uncertainties derived from DPRESS analysis are sometimes more conservative than necessary for large errors is of some concern, though that is certainly preferable to the alternative of their being overly optimistic; further work in this area is a matter on ongoing investigation. Nonetheless, the method is intrinsically less constraining than the classical quadratic relationship based on distance from the mean (Eq. 4). Given how much predictive error varies across the model space (e.g., Fig. 4), any approach based on the overall s PRED seems bound to be overly optimistic regarding the reliability of predicted potencies for some compounds.
The underlying QSARs examined here -CoMFA and HQSAR -both rely on (nominally [46]) linear PLS, but there is no intrinsic reason that the method cannot be more broadly applied. The key point is that the error being distributed must be predictive -i.e., it needs to reflect predictions made for objects not included in the training set. LOO cross-validation yields the most information for any given dataset, but a leave-some-out approach should be a viable alternative. The predictive errors obtained from the validation sets often used in ANN analysis could be used as well, since there is no intrinsic reason that a linear model for local error distribution should be incompatible with a QSAR that is non-linear on a global scale.
The usual reasons for preferring LSO over LOO cross-validation are unlikely to be relevant to DPRESS calculations, however. LOO can indeed be distorted when the training set is biased due to redundancy, but DPRESS based on LOO turns out to be conservative in such a situation (see above). The reduction in s CV that occurs when the sampling density in one particular area of descriptor space is high is reflected in a reduction in the error that each individual prediction contributes to the PRESS. But is not a root mean square, so the effect on its value is offset by the fact that biased sampling necessarily: increases the total number of errors; decreases their spread (d 00 ); increases the distance between the training set and most new observations; or effects some combination thereof.
Diverse training sets representative of the full structural space produce more reliable local uncertainty estimateŝ s uŝ u than do biased training sets, indicating that taking care to avoid undue sampling bias (redundancy) in the training set is worth the effort. Even the biased training set R, however, did better than setting the uncertainty of prediction for a new object equal to the observed error for the closest object in the training set ( Fig. 5 and 7). Moreover, the errors falling outside the range expected based for the calculated for R were false positives, the least serious type of error to make when trying to predict toxicity.
There are two fundamental differences between the estimate of predictive uncertainty derived from classical theory (Eq. 4) and the DPRESS model represented by Eqs. 7-9. The first difference is that Eq. 4 is a sum of squares, whereas Eq. 7 is a sum of linear terms. Using a sum of squares formulation was considered for DPRESS, but was found to consistently overestimate the uncertainty of prediction (details not shown). The second difference is that Mahalanobis distances d measured in the model space are used in the classical model, whereas Euclidean distances measured in the descriptor space are used in DPRESS. The less parametric approach is followed for DPRESS because the variation in one or more variables in a particular training set may not be large enough to reveal the influence that variable might exert if examined across a greater range. The small coefficient assigned to such a variable in that event means that substantial deviations in its value will have a negligible effect on distances in the model space. Sticking with distances in the "raw" descriptor space rather than using the descriptor weights from b to calculate a Mahalanobis distance is more conservative -it assumes that variation in things that have yet to be explored are likely to make predictions less reliable.

Conclusion
Examination of the distribution of predictive errors across the descriptor space makes it clear that errors are consistently larger in some regions than in others -i.e., the predictive error is heteroskedastic (Fig. 4). Given that a major use of QSAR predictions is in chemoinformatic tabulations used by medicinal chemists and other third parties, it would be good practice to routinely attach some estimate of uncertainty to each prediction. Doing so based on some analytical estimator would be preferable, but is impractical in most real-world situations because it requires detailed a priori knowledge of the global dependence of error on the descriptors. In the absence of such knowledge, a locally linear estimator of predictive reliability that is embedded in the sample space represents a reasonable alternative. Partition of predictive error sum of squares (DPRESS) provides just such an estimator in a form -that of a standard error -that is widely understood by those likely to use it. The calculations involved are straightforward and the estimator produced is a qualita-tively ( Fig. 4 and 6) and quantitatively ( Fig. 5 and 7) reliable estimate of how much confidence one should place in the associated prediction. Moreover, though the particular applications studied here involved PLS models built using 2D and 3D descriptors, the technique is likely applicable to any regression method that can be reformulated in kernel-based terms [12,47].
It is also important when constructing the model in the first place to examine the distribution of predictive error in the descriptor space. If uncertainty is homoskedastic, a classical or uniform distribution model may provide a somewhat more precise estimate of predictive uncertainty. Should (e)NLM or principal components analysis (PCA) indicate heteroskedacity, however, a DPRESS calculation should be carried out before applying the model -e.g., for prioritizeing compounds for synthesis, acquisition or detailed testing. DPRESS may also serve to highlight regions of structural space from which more data needs to be gathered.

Experimental
Ordinary multiple linear regression is not suitable when the number of descriptors in a data set exceeds the number of observations. PLS [4] was used instead, with the appropriate number of latent variables (components) to include (i.e., the model complexity) being the number corresponding to the first minimum in the "leave-oneout" cross-validated standard error (s CV ). This measure of internal consistency is obtained by setting aside each of the n compounds in the training subset in turn and trying to predict its activity using the other n -1 compounds in the training set. The external error of prediction (s PRED ) was calculated as the root mean square error for the N -n compounds left out of the model calculation altogether.

Training set selection
Boosted training sets were obtained by applying OptiSim selection to the full data set. OptiSim selection entails drawing a series of random subsamples of size k from the data set of interest. For each subsample in the series, the individual that is most different from those selected from previous subsamples is extracted and added to the selection set S. This procedure results in a representative but diverse selection set that samples the full data set space both efficiently and effectively [42]. Here the structural space was defined in terms of the Tanimoto similarity T(a, b) between the corresponding UNITY substructural fingerprints [48]. The individual a in the i th subsample for which max (T(a, b): b ∈ S) is smallest was added to S. Candidates with a Tanimoto similarity greater than 0.8 to any compound already in the selection set were deemed redundant and were excluded from subsamples.
The selection process was repeated five times with n = 75 and k = 4, using a different random number seed each time. Five inhibitors appeared in every boosted training set, including the thiophene, cyclopentadiene and isoxazole analogs that fall outside the five major classes. A total of 113 inhibitors were not selected for any of the boosted training sets, whereas 191 were selected for at least one of them.

Molecular fields
CoMFA involves using PLS to identify correlations of biological activity with variations in steric and electrostatic molecular fields, which requires that the molecules under consideration be put into similar conformations and into a common frame of reference as a key part of the process.
Here, conformations were set and molecular structures aligned based on the homologous atoms in their central and peripheral rings, as has been described in detail elsewhere [39]. Charges were calculated using the method of Marsili and Gasteiger [49], as extended in SYBYL [50] to take the distribution of π electrons into account ("Gasteiger-Hückel charges"). Coulombic and Lennard-Jones interaction energies were calculated on a 2 Å rectilinear grid extending 4 Å beyond the edge of any molecule in the full data set. The probe atom used to calculate the fields was an sp 3 -hybridized carbon monocation. Interaction energies were truncated at nominal values above 30 kcal/mol, and electrostatics were ignored within the steric envelope of each inhibitor.

Molecular holograms
The first step in constructing a molecular hologram is to identify all substructures in a molecule that fall within a specified size range -here, all fragments made up of four to seven atoms, with hydrogens ignored and bond types taken into account. Each fragment is then mapped into a compressed count vector of specified length using a hashing function, so that the elements of that count vector can be used as descriptors in subsequent PLS analyses [36]. The hashing means that different fragments may map to the same position in the final count vector. The fragments overlap, however, so each substructure contributes to many fragment counts. The result is that the noise introduced by "collisions" for a few subfragments constitutes a relatively minor perturbation that is, on average, self-limiting. Overfit PLS models are characteristically unstable to such perturbations, however, so surveying a range of hash lengths and picking one with good but representative statistical properties is a good way to avoid picking a length whose cross-validation statistics are overly optimistic. This is a non-parametric perturbation analysis analogous to looking at the effect of small perturbations in response to assess model stability [28].

Visualization
2D depictions of the relationship between different compounds were obtained using the embedded non-linear mapping (eNLM) facility [40] in Benchware DataMiner [51]. "Ordinary" NLM can be thought of as placing springs between all pairs of points in the original descriptor space, then compressing the ensemble into two dimensions in such a way that the residual tension in those springs is minimized. Embedded NLM differs in that parts of springs longer than some specified threshold length (horizon) are treated as elastic to extension, i.e., they do not contribute to the overall stress in the system. Here, spring "tensions" were based on the block-wise autoscaled Euclidean distances ("CoMFA Standard scaling" [32]) between the molecular fields or between the molecular holograms of different compounds.

DPRESS
CoMFA and HQSAR analyses were carried out in SYBYL. The distances d t, i used to partition the PRESS were taken from the SAMPLS.dist file generated by the SYBYL interface as input to the SAMPLS program [52] and represent inter-observation distances in the descriptor space after autoscaling has been applied. The descriptors used here are already either fully commensurate (HQSAR) or are piecewise commensurate (within steric and electrostatic fields but not between them, for CoMFA), so "CoMFA standard" (block) autoscaling was used [33]. Observed and predicted responses were taken from the SAMPLS.out file generated by the SAMPLS program.
Localized predictivity estimates were calculated by combining scripts written in SYBYL programming language (SPL) with spreadsheet manipulations carried out in Excel. For each compound t in the training set, the scaling factor γ t was calculated based on the observed predictive variance (squared cross-validation error of prediction, δ i 2 ) for every other compound in the training set (i ≠ t) weighted inversely by the square of the Euclidean distance between the two (d t, i ) in the descriptor space (Eq. 8). A normalization factor α i for each compound i was calculated as the sum of squared distances to all other compounds in the training set (Eq. 9). A limiting proximity term of 1/n was included to ensure reasonable behavior for closely-spaced compounds where d t, i approaches 0; this works well when the descriptors have been autoscaled in some way before use.
The individual scaling factors γ t obtained from the n LOO cross-validation errors for the training set were used to calculate an estimate for the predictive uncertainty associated with each new structure u based on the observed cross-validation error of the training set compound (t*) lying closest to it in the descriptor space, its distance from s u