Exploring non-linear distance metrics in the structure–activity space: QSAR models for human estrogen receptor

Background Quantitative structure-activity relationship (QSAR) models are important tools used in discovering new drug candidates and identifying potentially harmful environmental chemicals. These models often face two fundamental challenges: limited amount of available biological activity data and noise or uncertainty in the activity data themselves. To address these challenges, we introduce and explore a QSAR model based on custom distance metrics in the structure-activity space. Methods The model is built on top of the k-nearest neighbor model, incorporating non-linearity not only in the chemical structure space, but also in the biological activity space. The model is tuned and evaluated using activity data for human estrogen receptor from the US EPA ToxCast and Tox21 databases. Results The model closely trails the CERAPP consensus model (built on top of 48 individual human estrogen receptor activity models) in agonist activity predictions and consistently outperforms the CERAPP consensus model in antagonist activity predictions. Discussion We suggest that incorporating non-linear distance metrics may significantly improve QSAR model performance when the available biological activity data are limited. Electronic supplementary material The online version of this article (10.1186/s13321-018-0300-0) contains supplementary material, which is available to authorized users.


Introduction
Identifying and understanding the connection between chemical structure and biological activity is a central problem in contemporary pharmacology and toxicology. Advances in such understanding could facilitate in silico discovery of novel drug candidates and give rise to more efficient methods for computational screening of environmental chemicals for potential adverse effects on human health [1,2]. QSAR models address this problem by establishing structure-activity relationships from available chemical and biological data (training set) and using these relationships to estimate biological activities of other chemicals (evaluation set). In order to do so, QSAR models often utilize structure-activity landscapes, i.e., biological response surfaces in the structure-activity space reconstructed from the training set data [3]. The structure-activity landscapes are particularly useful for identifying chemical space domains where activity smoothly depends on structure ("rolling hills") and those where small structural changes lead to significant changes in activity ("activity cliffs") [4]. However, the limited size of typical training sets translates into the limited "resolution" of the reconstructed structure-activity landscapes: the latter only reveal net activity changes from one training set chemical to another but not details of the structure-activity relationship in-between these chemicals [5]. For example, if a training set only includes chemicals with similar activities, the reconstructed structure-activity landscape will be smooth, even though the actual structure-activity landscape may be rugged because of other chemicals with significantly different activities. In that case, the limited size of the training set may result in disappointing accuracy of QSAR model predictions [5]. Since activity cliffs are essential for specificity of many biological targets, most notably receptors, the limited amount of available activity data is a fundamental challenge that QSAR models face.
To address this challenge, we introduce and explore a QSAR model based on custom distance metrics in the structure-activity space. The distance metrics are designed to place higher (or lower, depending on the model parameters) weights on structurally close chemicals and chemicals with higher biological activities. We build our model on top of a simple approach that directly applies the similarity principle-the k-nearest neighbor (kNN) model [6]. Whereas the kNN model with non-Euclidean distances have been in use for decades [7], this, to the best of our knowledge, is the first attempt to incorporate non-linearity not only in the chemical structure space, but also in the biological activity space. We term this approach the generalized k-nearest neighbor (GkNN) model. Since we focus on the effects of the nonlinearity of the distance metrics rather than the choice of a specific metric, we do not perform feature selection [8] but rather utilize conventional chemical fingerprints and similarity measures.
We evaluate the GkNN approach by building and tuning a model for human estrogen receptor (hER) activity using data from the US EPA ToxCast [9] and Tox21 [10] databases. Because of the critical regulatory role of the hER as a part of the endocrine system, the influence of chemicals on its activity has been extensively studied using a variety of methods such as molecular dynamics and docking [11,12], CoMFA [13], pharmacophore-based QSAR modeling [14], and high-throughput screening [15]. We compare the performance of the GkNN-hER model with the recently developed CERAPP (Collaborative Estrogen Receptor Activity Prediction Project) consensus model built on top of 48 other classification and regression models [16].

Chemical and biological data
The training set included 1667 chemicals from the Tox-Cast database [9]. The training set chemicals were curated while they were prepared for the CERAPP collaboration; the curation procedure is described in the CERAPP article [16]. The chemicals had hER agonist, antagonist, and binding activity scores on the scale from 0.0 (inactive) to 1.0 (active). These activity scores were derived from a model that combined data from 18 in vitro hER assays using a variety of different cell types and readout technologies [2]. Because all assays yield some false positives and false negatives, we created a model to quantify our belief that the activity was "true" (i.e., it arose from interaction of the chemicals and the hER), or false (i.e., it arose from some form of technology interference or simple experimental noise) [2]. The activity value for a chemical represents an estimate of potency (the higher the value, the lower the concentration of the chemical that is required to activate the receptor), but also a certainty that the chemical actually interacts with hER [2]. Chemicals with low activity values (e.g., below 0.1) have a higher chance of being false positives than do chemicals with values well above this cutoff. To reduce the uncertainty, a small number of chemicals with activity values between 0.01 and 0.1 was removed from the training set.
The evaluation set included 7221 chemicals from the CERAPP database [10] with AC50, IC50, and/or other hER activity measures reported in the literature [16] (see Additional file 1: Fig. S1). Agonist and antagonist activity scores on the scale from 0.0 to 1.0 for these chemicals were estimated from their AC50 values that constituted the vast majority of all activity data (39,804 out of 44,641 records for agonist activity) and the dependence obtained from the training set [9]. A small number of chemicals with missing AC50 data were not included in model evaluation. For each chemical, activity scores from different sources were averaged. In this larger dataset from Tox21 and the open literature, we observed the same lack of consistency from one assay to another (or one lab to another) in activity, and the range of values from 0.0 to 1.0 again represents a combination of estimated potency (higher values are more potent) and certainty of a true interaction with hER (higher values are more certain to be true actives).
In addition to the entire evaluation set, calculations were performed with its subsets that included more than 3, 5, 7, or 9 consistent activity sources per chemical, respectively. Consistent means that the majority call (active or inactive) had to occur in at least 80% of cases for a chemical. As chemicals required more consistent data (either positive or negative), the quality of the biological data increased, but the number of chemicals decreased.

Structure-activity space
To visualize positions of the training set and evaluation set chemicals in the chemical structure space, we performed principal component analysis (PCA) on the fingerprints of the training set chemicals. The analysis was performed independently for Morgan and Indigo full fingerprints, and positions of the chemicals were described by their projections on the first three eigenvectors. In addition, relative positions of the chemicals were characterized by the distributions of pairwise molecular similarities (analogs of the radial distribution function commonly used in statistical mechanics) [17,18]. To characterize how much positions of chemicals in the chemical structure space depend on the choice of the specific fingerprint, we compiled lists of nearest neighbors for each training set chemical using Morgan and Indigo full fingerprints, respectively.
The extent of ruggedness of the structure-activity landscape was described by the structure-activity landscape index [3] is the activity score of chemical i and S ij is the similarity between chemicals i and j . The distribution of the pairwise SALI values characterized the entire structure-activity landscape, whereas the maximum value per chemical max j SALI ij identified specific chemicals that form activity cliffs.

GkNN model
The model estimates biological activity of a chemical as a non-linear weighted average over activities of k most similar chemicals from the training set: where A j is the activity score of chemical j and S ij is the molecular similarity between chemicals i and j . The activity scores vary continuously in the range from 0.0 (inactive) to 1.0 (active), and a chemical is classified as active or inactive depending on whether its activity score exceeded a specified cutoff. The similarities vary continuously in the range from 0.0 to 1.0. The similarity to the closest chemical from the training set q i = max j S ij characterizes the confidence in the estimate. Tunable parameters x and y characterize non-linearity in the biological activity space and the chemical structure space, respectively.
The GkNN model was compared with three other variations of kNN models suggested earlier [19]: These models are based on arithmetic averaging of the nearest neighbor activities (Eq. 2), geometric averaging of these activities (Eq. 3), and exponential averaging of these activities weighted by distances to the neighbors in the chemical structure space (Eq. 4). In the exponential model, we assumed that the distances are related with molecular similarities as d ij = 1/S ij − 1 and added a tunable parameter X that varied between 0.1 and 10. Molecular similarities were calculated using MACCS keys, Daylight, Morgan, or Indigo full fingerprints and Tanimoto similarity [20]. Calculations with Morgan fingerprints folded to 1024 bits and 16,384 bits, respectively, produced nearly identical results, indicating that increasing the fingerprint folding size beyond about 1000 bits has negligible influence on the performance of QSAR models. Whereas the obtained activity estimates were qualitatively similar for all fingerprints, using Morgan or Indigo full fingerprints consistently resulted in more accurate estimates.

Parameter tuning and evaluation
The accuracy of agonist, antagonist, and binding activity estimates obtained using the GkNN model and other models was characterized by the following metrics Here, TP, FP, FN, and TN indicate the numbers of true positive, false positive, false negative, and true negative evaluations, respectively. These numbers were obtained by converting continuous activity estimates to binary classes using the same activity threshold of 0.1 that was used for the training set.
To identify the values of parameters k , x , and y that yield the most accurate estimates, leave-one-out crossvalidation calculations for the training set were performed with every combination of the model parameters from the following lists (2560 combinations total): Since different parameterizations of the model were found to maximize different accuracy metrics, parameterizations were ranked by the score defined as the product of balanced accuracy, accuracy, and ROC AUC.
Parameterizations that maximize this score were also found to result in nearly maximum values of individual accuracy metrics, indicating that this score provides a robust characteristics of the QSAR model accuracy. Optimal parameterizations were independently identified for agonist, antagonist, and binding activities.
Model evaluation included estimating agonist and antagonist activities for the evaluation set chemicals. The evaluation did not include estimating binding activities, since their values for the evaluation set chemicals were not derived from AC50 data. The evaluation was performed using the optimal parameter combinations identified by cross-validation.

Software implementation
The GkNN model was implemented as a set of standalone Python scripts. Chemical fingerprints and molecular similarities were calculated using open source cheminformatics toolkits RDKit [21] and Indigo [22], activity estimates were obtained using NumPy toolkit [23], and accuracy metrics were calculated using Scikitlearn toolkit [24].

Results and discussion
Chemical structure space Figure 1 indicates that the training set chemicals and the evaluation set chemicals occupy similar domains of the chemical structure space. Chemicals from both sets form approximately Gaussian distributions with a common center and similar shape (the widths of the evaluation set are slightly larger than those of the training set). Whereas using Morgan fingerprints and Indigo full fingerprints Results shown in panels a-c were calculated using Morgan fingerprints and Tanimoto similarity, and results shown in panels d-f were calculated using Indigo full fingerprints and Tanimoto similarity, respectively. Panels a, d distributions of pairwise molecular similarities (overlaid plots). Panels b, e projections of the training set (blue) and the evaluation set (red) on the 3D space formed by the first three eigenvectors of the training set self-similarity matrix. Panels C and F: distributions of the training and evaluation sets along each of the first three eigenvectors results in significantly different absolute similarity values, the above observations hold for the both fingerprints, suggesting that structure-activity relationships inferred from the training set are likely to hold for the evaluation set. This conclusion is further supported by the distributions of pairwise molecular similarities calculated using Indigo full and Morgan fingerprints (Fig. 1) and MACCS keys (Additional file 1: Fig. S2).
How sensitive is the "neighborhood" of a chemical in the chemical structure space to the choice of the molecular fingerprint? To address this question, we compiled neighbor lists for each chemical in the training set using Morgan and Indigo full fingerprints, respectively. We found that these lists significantly overlap, although the order of neighbors in the two lists is essentially different. As such, the overall arrangement of chemicals in the chemical structure space may be robust to the choice of the fingerprint, whereas the order of nearest neighbors for each chemical is fingerprint-sensitive.

Structure-activity landscape
Panels A and D in Fig. 2 show that the hER structureagonist activity landscape obtained from the training set chemicals is mostly smooth, except for a few cliffs that arise from pairs of chemicals with similar structures but significantly different activities. Panels B and E in Fig. 2 along with Additional file 1: Fig. S3 support this conclusion, indicating that the structure-agonist activity landscapes for the training set, the evaluation set, and subsets of the evaluation set are characterized by relatively small SALI values, and higher SALI values that indicate activity cliffs are rare exceptions. Similarly, panels C and F in Fig. 2 along with Additional file 1: Fig. S3 show that the hER structure-antagonist activity landscape is even more smooth than that for agonist activity, in part because the number of active antagonist chemicals (9) was much smaller than the number of active agonist ones (80). Importantly, even though Morgan fingerprints, Indigo full fingerprints, and MACCS keys result in significantly different molecular similarity values and therefore Fig. 2 Structure-activity landscapes of the training set and the evaluation set. Results shown in panels a-c were calculated using Morgan fingerprints, and results shown in panels d-f were calculated using Indigo full fingerprints, respectively. Panels a, d maximum SALI values per chemical for agonist activities of the training set chemicals projected on the 3D space described in Fig. 1. Panels b,  For the evaluation set, increasing the minimum number of sources per chemical reduces the number of chemicals and thereby increases the average distance among them in the chemical structure space. This has the effect of smoothing the structure-activity landscape, as indicated by the elimination of the larger SALI values. For chemicals with more than 9 activity sources, differences in activities are close to either 0.0 or 1.0, suggesting that these chemicals are either clearly active or clearly inactive. We therefore conclude that the full hER structure-activity landscape is more rugged than those reconstructed from the available chemical sets. As discussed above, this ruggedness may be key factor that limits the accuracy of QSAR models. Table 1 shows the accuracy metrics for the tuned GkNN model and the arithmetic, geometric, and exponential averaging kNN models. In all cross-validation calculations, the geometric averaging kNN model was consistently the least accurate one, whereas the arithmetic averaging kNN model performed considerably better, and the exponential averaging kNN model provided further improvement in accuracy. These results are consistent with the earlier calculations of melting point using these models [19]. The tuned GkNN model was found to provide an increase in balanced accuracy over the exponential averaging kNN model.

Optimal parameters
For agonist and binding activity, the most accurate estimates were obtained by using Morgan fingerprints with k = 10 . Increasing the values of the GkNN model parameters X and Y from 1.0 to 1.5 and 3.0 , respectively, resulted in a small increase in balanced accuracy and had no significant effect on ROC AUC. A similar increase in balanced accuracy was observed when the value of the exponential kNN model parameter X increased from 1.0 to 1.5 . Interestingly, all models (except the geometric kNN model that was consistently much less accurate than the others) performed almost as well when using Indigo fingerprints with k = 7 and the same values of parameters X and, for the GkNN model, Y. Using Daylight fingerprints or MACCS keys resulted in a significantly lower performance (see Additional file 1: Table S1).
For antagonist activity, using Indigo fingerprints with k = 10 resulted in the most accurate estimates. The Table 1 Accuracy metrics for agonist, antagonist, and binding activity cross-validation "kNN arithm", "kNN geom", and "kNN exp" indicate the kNN models with the arithmetic, geometric, and exponential averaging, respectively. The cumulative score shown in the last column is the product of balanced accuracy, accuracy, and ROC AUC. Italic font indicates accuracy metric values that exceed those for the CERAPP consensus model exponential kNN model provided an improvement in balanced accuracy over the arithmetic kNN model. Using the exponential model with Morgan fingerprints and k = 3 resulted in similar outcome. Still, the highest balanced accuracy gain was achieved by using the GkNN model with Indigo fingerprints, k = 10 , and two combinations of the other parameters: X = 3 , Y = 7 and X = 5 , Y = 15 , respectively. We suggest that the higher optimum values of X and Y for agonist activity calculations arise from the significantly smaller number of the agonist active chemicals, as discussed above.
Notably, multiple parameter combinations resulted in nearly identical accuracy in cross-validation as well as evaluation, indicating that the model parameters are not completely independent. Indeed, parameter k that controls the number of relevant nearest neighbors and parameter Y that weights contributions from these neighbors both influence the distance in the chemical structure space where the similarity principle is assumed to break down. Accordingly, simultaneously increasing parameters k and Y was found to have minor effect on the GkNN model estimates compared to changing one of those parameters. The above conclusions held when using Indigo full fingerprints as well, although the optimal parameter values in that case were different.
The optimal value of parameter X > 1 suggests that lower (but non-zero) biological activity estimates obtained from assay data might be not as reliable as higher activity estimates, consistent with the analysis of the assay data [2] and the activity distributions for different numbers of literature sources (see Additional file 1: Fig. S4). The optimal value of parameter Y > 1 indicates that the structure-activity principle is more likely to hold at closer distances in the chemical structure space, supporting the conclusion that the full hER structure-activity landscape is more rugged than the one reconstructed from the training set and/or the evaluation set.

Model performance
Tables 2 and 3 summarize the accuracy of agonist and antagonist activity estimates for the evaluation set chemicals obtained by using the kNN models, the GkNN model, and the CERAPP consensus model [16]. As in cross-validation, the geometric kNN model yielded the least accurate estimates, and the arithmetic kNN model performed considerably better but not as well as the exponential kNN model or the GkNN model. In the agonist activity estimates (Table 2), the latter two performed on par with each other. They both closely trailed the CERAPP consensus model in ROC AUC and slightly outperformed it in balanced accuracy for chemicals with 5-9 activity sources. In most antagonist activity estimates (Table 3), the exponential kNN model was on par with the CERAPP consensus model in balanced accuracy and slightly outperformed it in ROC AUC, whereas the GkNN model consistently outperformed the both. Notably, the improvement in balanced accuracy provided by the GkNN model over the exponential kNN model was higher for chemicals with larger numbers of activity sources.
The dependence of the model performance on the confidence level of activity estimates q i is illustrated by Additional file 1: Table S2. For agonist activity, balanced accuracy and ROC AUC for chemicals with higher confidence levels are consistently higher than those calculated for chemicals with lower confidence levels. Panel A in Fig. 3 illustrates the dependence of ROC curves on confidence level, supporting the earlier suggestion that confidence levels can be used to define applicability domains for QSAR models.
For agonist activity estimates, the exponential kNN model and the GkNN model closely trails the CERAPP consensus model [16]. For antagonist activity, the exponential kNN model and the GkNN model consistently outperform the CERAPP consensus model for all estimates except those with q ≥ 0.9 . Since the training set included much fewer antagonist chemicals (9) than agonist chemicals (80), these observations reinforce the suggestion that employing non-linear distance metrics in the structure-activity space may be particularly efficient when training set data are limited. The influence of the uncertainty in the data from literature on the performance of the kNN models, the GkNN model, and the CERAPP consensus model is summarized in Additional file 1: Table S3 and illustrated in panels B and C in Fig. 3. As expected, for either model, increasing the number of literature sources for the evaluation chemicals (and thereby the quality of the activity data) results in increasing accuracy of the estimates and decreasing the number of false positive estimates, as illustrated in Additional file 1: Fig. S5.

Conclusions
We introduced the GkNN QSAR model based on a custom non-linear distance metric in the chemical structure-biological activity space and explored how this non-linearity influences the model performance. Using the hER data from the ToxCast [9] and Tox21 [10] databases, we compared the accuracy of the GkNN model against that of other variants of the kNN model with non-linear weighting schemes and the CERAPP consensus model [16]. We found that the GkNN model, along with the exponential kNN model [19], appears most efficient when the training set data, most notably the number of active chemicals, are limited. Table 2 Accuracy metrics for agonist activity evaluation with different numbers of activity sources per chemical "kNN arithm", "kNN geom", and "kNN exp" indicate the kNN models with the arithmetic, geometric, and exponential averaging, respectively. The cumulative score shown in the last column is the product of balanced accuracy, accuracy, and ROC AUC.  Table 3 Accuracy metrics for antagonist activity evaluation with different numbers of activity sources per chemical "kNN arithm", "kNN geom", and "kNN exp" indicate the kNN models with the arithmetic, geometric, and exponential averaging, respectively. The cumulative score shown in the last column is the product of balanced accuracy, accuracy, and ROC AUC.