 Research article
 Open Access
 Published:
Using Pareto points for model identification in predictive toxicology
Journal of Cheminformaticsvolume 5, Article number: 16 (2013)
Abstract
Predictive toxicology is concerned with the development of models that are able to predict the toxicity of chemicals. A reliable prediction of toxic effects of chemicals in living systems is highly desirable in cosmetics, drug design or food protection to speed up the process of chemical compound discovery while reducing the need for lab tests. There is an extensive literature associated with the best practice of model generation and data integration but management and automated identification of relevant models from available collections of models is still an open problem. Currently, the decision on which model should be used for a new chemical compound is left to users. This paper intends to initiate the discussion on automated model identification. We present an algorithm, based on Pareto optimality, which mines model collections and identifies a model that offers a reliable prediction for a new chemical compound. The performance of this new approach is verified for two endpoints: IGC50 and LogP. The results show a great potential for automated model identification methods in predictive toxicology.
Background
Predictive toxicology is concerned with the development of models that are able to predict the toxicity of chemicals [1]. These models are continuously built and validated on large collections of toxicological experimental studies to discover new biologically active compounds that are more effective, selective, less toxic, or satisfy various toxicological criteria [2, 3]. A reliable prediction of toxic effects of chemicals in living systems is highly desirable in domains such as: cosmetics, drug design or food safety. This knowledge allows an earlier rejection of those chemicals that may fail the testing phase and reduces the cost of manufacturing chemical compounds in the development stage. Additionally, the European Commission’s Legislation of Registration, Evaluation and Authorization of Chemicals (REACH) [4] allows the registration of chemicals that were developed using in silico modelling, which facilitates a reduction in the number of animal tests. These two factors have contributed to increased interests from research and business communities in development of toxicological modelling systems that are focused on data integration, model development and predictions (e.g OpenTox [5], InkSpot [6] or OCHEM [7]).
Quantitative StructureActivity Relationship (QSAR) or StructureActivity Relationship (SAR) models (both regression and classification) are the most common and widely used methods to relate chemical structure/properties with their biological, chemical or environmental activities [8]. According to the Organisation for Economic Cooperation and Development (OECD) Principles for QSAR Model Validation [9], a model should be statistically significant and robust, have its application boundaries defined and be validated by an external dataset [10, 11]. A model applicability domain [12, 13] determines the boundary of the chemical subspace where the model makes reliable prediction for a given activity. Applying models for chemicals from outside of their applicability domains increases the likelihood of inaccurate prediction.
There is an extensive literature associated with the best practice of model generation and data integration [14–19] but management and identification of relevant models from available collections of models is still an open problem. In recent years a large number of highly predictive models, having various applicability domains, has become publicly available. Some of them, tested on a wide chemical space, have become officially approved tools, e.g. KOWWIN (estimates the log octanolwater partition coefficient) or BCFBAF (estimates fish bioconcentration factor) built into Estimation Program Interface (EPI) Suite [20]. There is also a large number of quality models that are applicable only for a narrow chemical space. Some of them are annotated according to the OECD principles and publicly available in databases like JRC QSAR Models Database [21]. This database includes reports of model generation, validation and prediction according to the OECD standards. QSAR Model Reporting Format (QRMF) and QSAR Prediction Reporting Format (QPRF) have been developed at the Computational Toxicology and Modelling lab of the JRC’s Institute to standardise annotation of model metainformation. Currently, there is a lot of effort to build the ontologies for QSAR experiments and to provide an interoperable and reproducible framework for QSAR analyses [22].
Models that are stored in model databases can be reused to predict toxicity of new chemical compounds. Unfortunately, this involves a manual process of model identification. A potential user is required to make a comparison of model applicability domains and their predictivity for a given activity in order to decide if the model can make reliable predictions for a given chemical compound. Model comparison is a difficult task since models are generated using various subsets or various chemical compound descriptors. Consequently, models can be trained and validated on different datasets. For regression models, the model performance can be described by the predictive squared correlation coefficient q^{2}. Since the sizes and contents of modelling and validation datasets may differ for various models, the value of q^{2} is not sufficient for model comparison [10]. Several model performance matrices were analysed in the context of model validation and model selection [14]. They are applied in automated model development where models are validated by the same dataset. In the case where two models come from different sources, model comparison becomes challenging. This requires predictive models to be validated across the entire chemical space, which is very difficult as the list of available chemicals and assays may be limited.
Clearly, there is a need for automated techniques for mining model repositories. This includes methods for model quality control, data and model integration, model comparison and model identification. Our research aims to address this gap. In this paper, we draw attention to the importance of existing models’ usage in predictive toxicology. We also introduce methods for effective model identification for a new unseen chemical compound. The term “model identification“ covers the whole range of problems related to model selection from a collection of models (for a given endpoint) developed on various datasets. In the extreme case, datasets (and specified applicability domains) for two models can be disjoint. Model identification is a much harder problem than the well known model selection problem [23], i.e choosing a model from a set of candidate models with the same applicability domains. Therefore, various methods applied in traditional model selection [24–27] cannot be directly applied to model identification. In contrast to model selection, model identification cannot take into account model variables or parameters since some model variables cannot be easily accessed for new chemical compounds.
The interesting questions here are whether efficient model identification is possible based on molecular structures and models performances, and how good the identified model can be for a new chemical compound. In [28], authors defined the framework for automated model selection and described a simple algorithm for model selection. The method selects the most predictive model from the collection of models for a nearest neighbour to the query chemical compound. Often, the nearest neighbourhood can contain more than one element and model performances can differ slightly. In this case, it is difficult to say which model would be the most reliable for a given chemical compound.
To answer the above question, in this paper we present a new method for model identification for regression models. This method uses Pareto points [29] to define the nearest Pareto neighbourhood according to two criteria: structural similarity of chemicals and models performances. In the next section a framework for model identification, Pareto points and their properties are introduced. Having the Pareto nearest neighbourhood defined, we present two methods for model identification. The first method averages model performances for all Pareto neighbours and identifies the one with the smallest error. The second method identifies a model for which the Pareto point is the closest (based on Euclidean distance) to a centroid of all points in the Pareto neighbourhood. We also demonstrate that model identification improves the quality of the test set, or unseen chemical compound prediction. Experimental work using IGC50 for Tetrahymena pyriformis and internal Syngenta LogP datasets show that our approach provides good results and it is worth being considered for further research.
Methods
Framework for model identification in predictive toxicology
There are several chemical compound representations and thousands of available chemical descriptors [8] used for predictive model development. In this paper, a chemical space X is a set of chemicals represented by pairs x = (x^{d},x^{f}), where ${x}^{d}\in {\mathbb{R}}^{{K}_{1}}$ represents a vector of descriptor values, ${x}^{f}\in {\{0,1\}}^{{K}_{2}}$ is a fingerprint, and K_{1} + K_{2} is the dimension of the chemical space. Descriptors represent various topological, geometrical, physical and chemical properties of a chemical compound. A fingerprint is a binary vector whose coordinates define the presence or absence of predefined structural fragments within a molecule [30]. A fingerprint is also a one dimensional representation of a chemical compound and it is widely used for chemical similarity search in large databases [31]. It is also worth noting that a fingerprint is not a unique chemical compound representation because it encodes only a fragment of a molecule. There can be two different molecules having the same fingerprint representation.
A predictive model M is a mapping X → Y, where $Y\subset \mathbb{R}$ is the output space. The output space Y might, for example, represent a particular biological, physical or chemical activity of a chemical compound.
The input data is represented by the pairs: (x_{ i },y_{ i }) ∈ X × Y for i = 1,…,n, where x_{ i } is an element of the chemical space and y_{ i } is the measured activity of that element. There is also a set of m predictive models $\mathcal{\mathcal{M}}=\{{M}_{1},\dots ,{M}_{m}\}$ associated with the activity Y. These models were generated using various statistical or data mining techniques and they have different applicability domains and performances. To identify the most predictive model from the collection of models $\mathcal{\mathcal{M}}$ for a new chemical compound x, we define a partitioning model that splits the chemical space into disjoint groups and allows an unambiguous model identification.
A partitioning model$\widehat{M}$ is a mapping X → Y given by the following formula:
where

D_{1},…,D_{ m } ⊆ X are disjoint,

$\bigcup _{i=1}^{m}{D}_{i}=X$.
The main hypothesis in predictive modeling is that similar chemical compounds have similar properties [32]. Following this hypothesis we build the partitioning model that it splits the chemical space in groups in order to maximize the similarity of their chemical compounds and to minimize the error of a model associated with this group. It is easy to notice that this is a bicriteria problem and the solutions have to represent a tradeoff between optimality of these criteria (the socalled Pareto points). Pareto optimality is a multicriteria optimisation technique widely applied in decision making problems [29]. In QSAR modelling multiobjective (criteria) was used for feature selection [33] in order to maximize predictive capacity and to reduce the number of selected descriptors. In this paper we present how Pareto optimality can be applied in QSAR model identification. In the following sections we recall the basic definition of the Pareto set and we propose an algorithm that finds Pareto points in 2D vector space.
Pareto points and their properties
Let consider a vector $v=\phantom{\rule{2.77626pt}{0ex}}[{f}_{1},{f}_{2},\dots ,{f}_{K}]$ in the K– dimensional space. Let π_{ j }(v) = f_{ j } denote a jth coordinate of vector v and V be a finite set of vectors in ${\mathbb{R}}^{K}$.
Definition 1 (Domination)
A vector $v\in {\mathbb{R}}^{K}$ is dominated by a vector $w\in {\mathbb{R}}^{K}$, which is denoted by v ≼ w, if
We say that v is strictly dominated by w (v ≺ w), if v ≼ w and v ≠ w, i.e.
Definition 2 (Comparison)
Vectors $v,w\in {\mathbb{R}}^{K}$ are incomparable, which we denote by v∼w, if neither v ≼ w nor w ≼ v.
Note that v ∼ w if and only if there exist $i,j\in \{1,\dots ,K\}$, i ≠ j, such that
Definition 3 (Pareto set)
A set Γ ⊂ V of minimal vectors with respect to ≼ is called a Pareto set for V.
Note that Γ consists of incomparable vectors. We can define Γ equivalently by the formula
The above definitions and basic properties of the Pareto set can be found in [34]. Now, we introduce below some properties of Pareto sets and Pareto order that are used in the following sections. First, we introduce the convenient notation. Let
and
The set V_{ j } consists of all vectors in V with minimal value on the jth coordinate.
Lemma 1
Let Γ_{ j } be the set of all minimal vectors in V_{ j }. Then Γ_{ j } ⊂ Γ, where Γ is the Pareto set for V.
Let $\mathrm{I}\Gamma =\bigcup _{j=1,\dots ,K}{\Gamma}_{j}$ and
In particular, IΓ is a subset of Γ and it is called an initial Pareto set. Now we establish the dependence of the conditions for incomparability with vectors in this initial Pareto set.
Lemma 2
If a vector v ∈ V is incomparable with all vectors in IΓ, then there exist at least two indices $j\in \{1,\dots ,K\}$ such that
The proof of this Lemma 1 and Lemma 2 as well as all other results in the paper are provided in Appendix 1.
Pareto order in two dimensions
This subsection is devoted to the study of the twodimensional case, i.e. K = 2. We shall use the notation introduced above.
Lemma 3
The set IΓ has at most two elements.

1.
If IΓ = 1, then IΓ is the Pareto set for V.

2.
If IΓ = 2, then a vector v ∈ V is incomparable with vectors in IΓ if and only if
$${\forall}_{j=1,2}\phantom{\rule{1em}{0ex}}{\pi}_{j}\left(v\right)\in ({f}_{j}^{\mathit{\text{min}}},{f}_{j}^{\mathit{\text{max}}}).$$(10)
As shown in Figure 1 and Figure 2, when IΓ consists of two elements w_{1} and w_{2}, a set of vectors incomparable with IΓ is given by the rectangle $\mathcal{V}$. Let Γ be a vector incomparable with IΓ, i.e. $\gamma \in \mathcal{V}$. The introduction of v_{0} divides the rectangle $\mathcal{V}$into three areas:

A^{′} and A^{′′} is a set of vectors incomparable with IΓ ∪ {γ},

B is a set of vectors smaller then Γ,

C is a set of vectors bigger then Γ.
The above properties of IΓ and vectors incomparable with IΓ allow us to limit the search space$\mathcal{V}$to find Pareto solutions.
Finding a Pareto set in 2D vector space
In this section, we present an algorithm for finding a Pareto set in twodimensional space (see Algorithm 1). FINDPARETOSET(V) is a recursive algorithm that finds all Pareto points in the rectangle$\mathcal{V}$ defined by two points in the initial Pareto set IΓ (see Lemma 1); this rectangle contains all points from V. The algorithm starts from finding a point Γ that does not dominate any other points in V (line 4). This point splits the area$\mathcal{V}$ into four rectangles (see Figure 2). According to Lemma 2 and Lemma 3, B ∩ V = ∅, C does not contain Pareto points, whereas points in rectangles A^{′} and A^{′′} are incomparable with Γ. The above procedure is recursively repeated for V ∩ A ^{′} and V ∩ A^{′′}.
Algorithm 1 FINDPARETOSET(V)
The algorithm sketched above calls FINDPARETOPOINT($\stackrel{\u0304}{V}$) (see Algorithm 2) to find a Pareto point in the set$\stackrel{\u0304}{V}$. This procedure works in the pessimistic time O(n^{2}), where n is a number of elements in$\stackrel{\u0304}{V}$ (when all solutions are comparable, i.e., to form a chain it may take niterations to find a Pareto point). However, the expected running time is much shorter thanks to the random selection of points.
Algorithm 2 FINDPARETOPOINT($\stackrel{\u0304}{V}$)
Model identification in predictive toxicology
Following the similarity hypothesis researchers build models for groups of chemicals that have a common molecular fragment or common properties. These models are more reliable and give better predictions for chemicals that lie in the model applicability domains. Further, high quality models developed for a small subset of chemical space can be combined in a global model that covers larger chemical space using various ensemble techniques. In this section we present how to identify a reliable model from a collection of already existing models for new before unseen chemicals.
The chemical space X is a set of chemical compounds represented by the combination of all possible existing chemical descriptors, and for a given endpoint there is a collection of existing models$\mathcal{\mathcal{M}}$. For each chemical compound x ∈ X, model predictions${Y}^{\prime}=\{{y}_{1}^{\prime},\dots ,{y}_{m}^{\prime}\}$ for models from$\mathcal{\mathcal{M}}$ are known (see Figure 3). To identify a model for a given query chemical compound q we convert the set of chemicals from X and their model performances into a set of pairs (d_{ i },e_{ i m }), where d_{ i } represents the distance between q and the i–th chemical compound from the chemical space. The error${e}_{\mathit{\text{im}}}=\lefty\right({x}_{i}){y}_{m}^{\prime}({x}_{i}\left)\right$ defines the model performance for the m–th model from$\mathcal{\mathcal{M}}$ and for the i–th chemical compound. In a set of such pairs, one can find models that have a low predictive power for the most similar chemical compounds whereas the other gives better predictions. This illustrates the situation often encountered in multicriterial optimization problems: there is no solution that outperforms the others with respect to all criteria. Hence, instead of having one solution we have a set of solutions that cannot be compared to each other. The above task is a Pareto problem: one has to balance similarity to existing chemical compounds and correctness of predictions offered by available models.
The model identification procedure (see Algorithm 3) can be described as follows: for a query chemical compound q and a given chemical space – 1) create the set V of pairs (d_{ i },e_{ i m }), 2) find the Pareto set for V, 3) select the most suitable model for q. To create a set V we start from the array T (see Figure 3) that contains a structural representation of the chemical compound, its measured activity (for a given endpoint) and predictive performance of each model from $\mathcal{\mathcal{M}}$.
Algorithm 3 MODELIDENTIFY(T q)
After executing MODELIDENTIFY(T,q), in line 1, the array T is converted into a list of vectors V using procedure INIT(T,q) (see Algorithm 4). Every vector v_{ i } ∈ V is defined as a pair of the distance between q and the ith chemical compound from T, and the error of the jth model from$\mathcal{\mathcal{M}}$for the compound i. The distance d_{ q i } = 1  S T_{ q i } is calculated using Tanimoto coefficient ST, which is the most frequently used similarity measure in chemoinformatics [35]. This coefficient works with fingerprints (binary representation of molecules) and is defined as a ratio between the number of bits set on the same position in two fingerprints and the sum of bits set on different positions. The model error e_{ i j } is defined as a distance between the true activity for compound i and the value computed by model j. We treat V as a set of all possible solutions for model identification for a given query molecule q and known chemical subspace.
Algorithm 4 INIT(T,q)
In line 2 of MODELIDENTIFY(T,q), we call FINDPARETOSET(V) to find the set of all Pareto points Γ in V. Then, we analyse points in Γ in order to choose the most predictive model for q. In the case when Γ = 1, there is only one candidate, so the choice is trivial. This case is comparable to the algorithm proposed in [28] which selects the most predictive model for the most similar chemical compound of q. In the case when Γ consists of many Pareto points, the model identification becomes a difficult task: the Tanimoto similarity coefficient (as well as other fingerprint similarity measures) between chemical compounds may not be correlated enough with their activity partially contradicting the similarity hypothesis [32] (see the end of this section for a detailed example). To identify a model using Pareto points, first we define nPareto Neighbourhood as follows:
Definition 4
nPareto Neighbourhood is a set with at most n  Pareto points from Γ which are at distance less than τ from the element q where τ > 0 and n > 0.
The threshold τ is selected by experiment and depends on the chemical similarity within a given chemical space. Having defined the Pareto neighbourhood for a given chemical compound q, we provide two methods for model identification. The first one is called nAverage Pareto (see Algorithm 5). The threshold τ provides means for removing those chemical compounds which are dissimilar to the query compound q but their activity is very well predicted by some model. Next, the model average model errors for the chemicals represented by Pareto points and then the model with the smallest average error is selected. We call this method nAverage Pareto Model Identification (nAPMI). The usage of Pareto neighbourhood in comparison with the standard nearest neighbourhood is that this method is more sensitive on model performances and allows for the rejections of the similar chemical compounds on which models perform badly.
Algorithm 5 Average Pareto
The second method is called nCentroid Pareto (see Algorithm 6). For all Pareto points from the nPareto Neighbourhood the centroid Pareto point c is calculated according to formula:
where d_{ c } is the average of distances and e_{ c } is the average of model errors for all Pareto points from the neighbourhood (n  P N). In the next step the Euclidean distance between Pareto points and the centroid is computed. The model that is associated with the Pareto point for which the Euclidean distance to the centroid is minimal, is selected. We call this method nCentroid Pareto Model Identification (nCPMI). According to the definition, both nAPMI and nCPMI are partitioning models that splits chemical space into disjoin groups and allow unambiguous model identification.
Algorithm 6 Centroid Pareto
We mentioned above that similar chemical compounds might have very different measurements of activity. To demonstrate this, we analysed the TETRATOX [36] dataset which contains growth inhibition concentration (IGC50) for Tetrahymena pyriformis. Chemical compounds were compared in pairs. Their Tanimoto similarity coefficient and differences in measured activity were collected. Summarised results are displayed in Table 1. Column headers hold differences in the measured activity between two chemicals, while row headers describe molecule similarity threshold. The single cell of this array represents a number of pairs of chemical compounds for which the distance is smaller than the row identifier and the difference in the activity is smaller than the column identifier.
The TETRATOX dataset contains over one thousand chemical compounds and the biggest difference between measured values of IGC50 is equal to 5.3. Notice that the number of pairs of chemicals that are similar, based on both the fingerprint similarity and the activity, is very small. There is only one pair of chemical compounds that have the same activity and maximal similarity (1row, 1 column). On the other hand, there are many chemicals which are similar fingerprintwise but have different activities. This makes the model identification challenging.
In the next section we present results of the experiments that were carried out in order to demonstrate how model identification works.
Experimental results
Two experiments were proposed in order to demonstrate the advantages of model identification for predictive toxicology. Each experiment has two phases. In the first phase we treated model identification as a classification problem to study the performances of proposed methods in comparison with the other classification algorithms. We defined an “oracle model” that associates each chemical compound from a given chemical space with the most predictive model from the collection of existing models and we used this model to validate our methods. In the second phase, for each chemical compound we applied an identified model to predict the growth inhibition concentration (IGC50) in the first experiment and Partition coefficient (LogP) in the second. Finally, we compared these results with the original model performances applied to the whole chemical space.
IGC50 Prediction for Tetrahymena Pyriformis
A dataset (Tetrahymena Pyriformis Toxicity  TPT) of 1129 chemicals was obtained from the INCHEMICOTOX webpage [37]. This dataset is compiled of toxicity data for the unicellular ciliated protozoa Tetrahymena pyriformis (see [38]) and was published in [39]. The measure of toxicity is 50% growth inhibition concentration (IGC50). Two QSAR regression models were obtained from INCHEMICOTOX. These models are also reported in the JRC QSAR Models Database. The first, non polar narcosis (NPN) QSAR [40], was originally trained on 87 chemicals identified as non polar narcotics with q^{2} = 0.95. The linear regression model was defined as follows:
where log P is the octanolwater partition coefficient. The second, polar narcosis (PN) QSAR model [41] for Tetrahymena pyriformis, was trained on 138 polar narcotics chemicals with q^{2} = 0.75 and defined as follows:
Training datasets for both models were obtained from JRC QSAR Models Database. These datasets were compared with the Tetrahymena pyriformis dataset and 204 (136 from the PN model and 68 from the NPN models) training chemicals were present in the TPT dataset. We did not perform any data curation for this dataset. The above described models were implemented for the logP value calculated using the cdk library [42] and used to predict toxicity for the TPT datasets.
First, we considered the model identification problem as a classification problem to predict which model will be the most reliable for a given chemical compound. Having a dataset of the predicted IGC50 for both models and the measured value, we used a priori information (“oracle model“) about the best selected model for each chemical compound and we applied various classification methods. To simulate the model identification for before unseen chemical compounds the leaveoneout (LOO) method was used. This methods takes out one chemical compound from the dataset and uses others chemicals to predict which model would be the most reliable for it. This procedure were repeated for all chemicals in the dataset.
Table 2 includes results from the comparison of nCPMI and nAPMI proposed in this paper with the DMS (Double Min Score algorithm) [28] and with the standard classification algorithms such as: NaiveBayes, BayesNet decision trees (PART and J48), nearest neighbour (IBK) or support vector machine (SMO) implemented in WEKA [43]. These classifiers were initialised by the default parameter settings. The dataset, used to generate these classification models, consisted of chemicals represented by binary descriptors (1024  bit fingerprints calculated using cdk library) and the model errors. We compared all classifiers according to a number of the correctly classified chemicals and the classifiers accuracies. The 3APMI methods gives the highest number of correctly classified elements and relatively low numbers for false positive and false negative  especially comparing this method to the IBK(3). The 3APMI uses the 3Pareto neighbourhood where as IBK(3) uses the 3nearest neighbourhood for classification. This shows that the model identification using Pareto points is as good as or can be better than the other well know classification algorithms.
The decision on model identification relies on the distance to the Pareto points. Figures 4 and 5 show misclassification examples for the 3APMI method. On Figure 5 for 3Phenyl1propanol the NPN model was identified. Its Pareto neighbourhood included three chemicals: 4Chloro3methylphenol Methylbenzene and 4Dimethylbenzene with the distances and models errors shown in Table 3. The 3APMI model averages model errors for all Pareto points in this neighbourhood and selects the one with the smallest average, in this case the NPN model. One can notice that the best model for this Pareto neighbourhood is the NPN model for 4Dimethylbenzene whereas this chemical compound is not the most similar to the query chemical compound.
To demonstrate a correct classification example, we selected Benzylamine that was associated correctly with the PN model. Its Pareto neighbourhood included two chemicals: 2Chloroaniline and (+/)1,2Diphenyl2propanol with distances and model performances shown in Table 4 (notice that according to Definition 4, the three Pareto neighbourhood consists of at most three Pareto points). These distances to the query chemical compound are small and for both chemicals the PN model gives the most reliable prediction. The 3APMI identifies the PN model that has the minimal average error for all Pareto neighbours.
Additionally, from the entire TPT dataset, chemicals included in the original training datasets for both models were selected. We identified 4 out of 68 chemicals that were used to train the NPN model but the oracle model associated them with the PN model (see Figure 6). The same analysis were repeated for the training dataset of the PN model and we identified 9 out of 136 chemicals that were associated with the NPN model by the oracle model (see Figure 7).
To predict IGC50 for the TPT dataset we used the identified model for each chemical compound in this dataset. The results obtained for the entire dataset are shown in Table 5. The statistics used are: R2  correlation coefficient for the observed and predicted values, RSE  rootsquared error, Q2  predictive squared correlation coefficient, MAE  mean absolute error and RMSE  root mean square error. The “oracle model” has the knowledge of the best model for each chemical compound. Its predictivity is low because we used only two existing models from JRC QSAR database that were designed based on modeofaction (polar/non polar narcosis) for chemicals from TPT.
The 3APMI method provides the best prediction among “nonoracle models”. The first two rows present prediction statistics for PN and NPN models. They are lower than for all other models. Notice, however, that their R2 and RSE statistics are identical. This is due to the fact that both models are affine functions of one and the same explanatory variable. An affine function can, therefore, transform one model into another. This is what happens when regression is applied to compute R2 and RSE. Notice that other two measures of Q2 and predictive errors are different for these models.
As another example, we considered only a small subset of the whole initial TPT dataset that contains only 376 chemical compounds. This dataset includes all training chemicals used in PN and NPN models plus over 100 additional chemicals from the TPT dataset. We included chemicals for which the absolute error of the oracle model is less than 0.4 and they are in the applicability domain of both models. The value of logP ∈ [0.5,6.2] and the toxicity value is in the range [2.5,3.05]. Again we compared various classifiers that were used for model identification (see Table 6).
In this case, the best method is 3CPMI that from the 3Pareto neighbourhood selects model for which Pareto point is the closest to the neighbourhood centroid. This method gives better results if compared with the DMS method that selects the model with the smallest error for the nearest neighbour. Tables 7 and 8 show the list of chemicals that were wrongly classified by the 3CPMI algorithm. Comparing the regression models for IGC50 (see Table 9), 3CMPI method provides better prediction than DMS, PN and NPN models.
The above examples show the great potential of the model identification methods. We demonstrated that the method based on predefined rules (such as maximal similarity for chemicals and minimal error for a model assigned with them) can be compared with the standard machine learning algorithms for the classification problem. Model identification can be considered as an ensemble technique to build high predictive consensus models in predictive toxicology.
LogP prediction for inhouse Syngenta dataset
For the second experiment we considered the estimation of the LogP for an internal Syngenta dataset. The octanol/water Partition coefficient (LogP) is a measure of the lipophilicity of chemical compounds and is an important descriptive parameter in biostudies [8]. Currently, there are various methods for estimating this coefficient: fragmental methods (CLOGP, KOWWIN), atom contribution methods (TSAR, XLOGP), topological indices (MLOGP), molecular properties (BLOGP).
The initial dataset contains about 9000 chemical compounds and their measured LogP value in Syngenta’s laboratories. The measured value of LogP is in the range [5.08,8.65] (see Figures 8 and 9). There was no additional data curation than the curation provided by Syngenta researchers. Three models to predict LogP: CLOGP developed in Syngenta, KOWWIN in EPI Suite and MLOGP in Dragon were applied for this dataset. We randomly selected 1000 chemicals (out of 9000) and used the remaining 8000 chemicals as the chemical space of the partitioning model. We used the 3APMI method as it was the best method in the first experiment. We compared the performance of these four models on 1000 selected chemicals (see Table 10). We repeated the same experiment with 2000 randomly selected chemicals. Additionally, we selected from the initial dataset those chemical compounds for which oracle model has absolute error >0.7. We obtained a set of 2333 chemical compounds.
Table 10 displays the accuracy of model predictions. The 3APMI is generally at least as good as the best model (CLOGP). In the case of randomly selected chemicals CLOGP was hard to beat, although for 2000 randomly selected chemicals one can clearly see the benefit of using 3APMI (higher Q2 and lower MAE). The biggest gain is, however, observed for those chemicals whose activity is difficult to predict (the last experiment). This shows that partitioning model (3APMI) can be a powerful knowledge extraction tool.
All methods proposed in the paper were implemented in R [44]. The logP value, fingerprints and Tanimoto similarity were calculated using the RCDK [45] library. A number of tests were run to define the threshold τ. It is important to notice that the nPareto neighbourhood defines the set of at most nPareto points. Therefore, for the 3Pareto neighbourhood we found chemicals that have 1, 2, or 3 Pareto neighbours for τ = 0.4 for the entire TPT dataset. For the 5Pareto neighbourhood τ = 0.7 and for the 10Pareto neighbourhood we considered all Pareto neighbours. This shows that a size of the Pareto neighbourhood depends on a size of the available chemical space and may vary for different endpoints. Also, looking at the results for APMI and CPMI one can notice that it is not worth considering all Pareto points, and that the size of the Pareto neighbourhood depends on chemical compound similarities.
Conclusion
In this paper, we draw attention to advantages of model reusage in predictive toxicology. Since the amount of experimental data and the number of predictive models are growing every day, it is crucial to develop automated methods for mining models in repositories. The most demanding task is to find a model for a new chemical compound from a collection of models for a given endpoint.
In this paper, we proposed two methods (APMI and CPMI) that identify the suitable model for a query chemical compound based on the model performances in its Pareto neighbourhood. These algorithms are based on our simple yet effective method for finding the Pareto set in 2D space. The experimental results demonstrate the advantage of our approach and indicate that automated model identification is a promising research direction with many practical applications. Our approach is mainly focused on regression models and in the future we plan to extend it to classification models, including the analysis of model variables in chemical space partitioning. An additional interesting direction could address the estimation of identified model reliability for a new chemical compound.
Appendix 1 Proofs
Proof
(Lemma 1). We prove this lemma by contradiction. Let’s$j\in \{1,\dots ,K\}$ and choose v ∈ Γ_{ j }. Assume that v ∉ Γ, which is equivalent to saying that there exists w ∈ V that is strictly dominated by v, i.e. w ≺ v. This means that π_{ j }(w) = π_{ j }(v) and w ∈ V_{ j }. By the definition of Γ_{ j } we know that v is a minimal vector in V_{ j }, so v ≼ w, which contradicts w ≺ v.
Proof
(Lemma 2). Let v ∈ V. First notice that${\pi}_{j}\left(v\right)\ge {f}_{j}^{\mathit{\text{min}}}$,$j=1,\dots ,K$. If${\pi}_{j}\left(v\right)\notin ({f}_{j}^{\mathit{\text{min}}},{f}_{j}^{\mathit{\text{max}}})$ for all j then${\pi}_{j}\left(v\right)\ge {f}_{j}^{\mathit{\text{max}}}$ for all j and w ≼ v for w ∈ IΓ. If there exists exactly one$j\in \{1,\dots ,K\}$ such that${\pi}_{j}\left(v\right)\in ({f}_{j}^{\mathit{\text{min}}},{f}_{j}^{\mathit{\text{max}}})$, then for each index l ≠ j we have${\pi}_{l}\left(v\right)\ge {f}_{l}^{\mathit{\text{max}}}$ and there exists a vector w ∈ Γ_{ j } such that w ≼ v. Therefore, if v is incomparable with vectors in IΓ, none of the above cases can take place, and the proof is completed.
Proof
(Lemma 3). Notice first that each Γ_{ j }, j = 1,2, consists of one element, because the Pareto order ≼ induces a linear order on the sets V_{ j }. Therefore, IΓ consists of at most two elements. Assume that IΓ has one element, which we denote by w. From the construction of IΓ we have:
Consequently, w is dominated by every vector of V, so it is the only minimal vector in V.
Assume now that IΓ consists of two vectors: w_{1} and w_{2}.
(⇒) After renumbering, Γ_{1} = {w_{1}} and Γ_{2} = {w_{2}}. Hence, we obtain from equations (5)(7)
Due to (3) the set of vectors v ∈ V incomparable with IΓ satisfies (9).
(⇐) Let v ∈ V for which inclusion (9) holds, then using renumbering of set Γ_{ j }, j = 1,2, from the above implication, we obtain:
According to the Definition 2 and formula (4) we obtain v ∼ w_{1} and v ∼ w_{2}. Since IΓ = {w_{1},w_{2}}, then v is incomparable with the vectors w_{1} and w_{2}.
References
 1.
Helma C (Ed): Predictive Toxicology. 2005, Boca Raton: Taylor & Francis Group
 2.
Kavlock R: A framework for computational toxicology research in ORD. [http://nepis.epa.gov/Exe/ZyPURL.cgi?Dockey=100046MA.txt]
 3.
Judson R: Public databases supporting computational toxicology. J Toxicol Environ Health, Part B. 2010, 13 (2): 218231.
 4.
REACH. [http://ec.europa.eu/environment/chemicals/reach/reach_intro.htm]
 5.
OpenTox. [http://www.opentox.org]
 6.
Inkspot Cloud platform for portable, scalable and secure cloud computing. [http://www.inkspot.co]
 7.
OCHEM. [http://ochem.eu]
 8.
Gasteiger J (Ed): Handbook of Chemoinformatics: From Data to Knowledge. 2003, Weinheim: WileyVCH Verlag GmbH
 9.
OECD principles for the validation, for regulatory purposes, of QSAR models. [http://www.oecd.org/dataoecd/33/37/37849783.pdf]
 10.
Golbraikh A, Tropsha A: Beware of q2. J Mol Graph Model. 2002, 20: 269276. 10.1016/S10933263(01)001231.
 11.
Gramatica P: Principles of QSAR models validation: internal and external. QSAR Comb Sci. 2007, 26: 6947012. 10.1002/qsar.200610151.
 12.
Jaworska J, Comber M, Auer C, Leeuwen C: Summary of a workshop on regulatory acceptance of (Q)SARs for human health and environmental endpoints. Environ Health Perspect. 2003, 111: 13581360. 10.1289/ehp.5757.
 13.
Jaworska J, NikolovaJelizkova N, Aldenberg T: QSAR applicability domain estimation by projection of the training set in descriptor space: a review. ATLA Alternat Lab Anim. 2005, 33: 445459.
 14.
Tropsha A, Gramatica P, Gombar V: The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci. 2003, 22: 6977. 10.1002/qsar.200390007.
 15.
Cartmell J, Enoch S, Krstajic D, Leahy D: Automated QSPR through competitive workflow. J ComputAided Mol Design. 2005, 19: 821833. 10.1007/s1082200590298.
 16.
Zhu H, Tropsha A, Fourches D, Varnek A, Papa E, Gramatical P, Öberg T, Dao P, Cherkasov A, Tetko I: Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis. J Chem Inf Model. 2008, 48 (4): 766784. 10.1021/ci700443v.
 17.
Patlewicz G, Jeliazkova N, Gallegos Saliner A, Worth A: Toxmatcha new software tool to aid in the development and evaluation of chemically similar groups. SAR QSAR Environ Res. 2008, 19 (3): 397412.
 18.
Tropsha A: Best practices for QSAR model development, validation, and exploitation. Mol Inf. 2010, 29 (67): 476488. 10.1002/minf.201000061.
 19.
Hardy B, Douglas N, Helma C, Rautenberg M, Jeliazkova N, Jeliazkov V, Nikolova I, Benigni R, Tcheremenskaia O, Kramer S, Girschick T, Buchwald F, Wicker J, Karwath A, Gutlein M, Maunz A, Sarimveis H, Melagraki G, Afantitis A, Sopasakis P, Gallagher D, Poroikov V, Filimonov D, Zakharov A, Lagunin A, Gloriozova T, Novikov S, Skvortsova N, Druzhilovsky D, Chawla S, Ghosh I, Ray S, Patel H, Escher S: Collaborative development of predictive toxicology applications. J Cheminformatics. 2010, 2: 710.1186/1758294627.
 20.
EPI Suite. [http://www.epa.gov/oppt/exposure/pubs/episuite.htm]
 21.
JRC QSAR Model Reporting Format (QMRF). [http://qsardb.jrc.ec.europa.eu/qmrf/]
 22.
Spjuth O, Willighagen E, Guha R, Eklund M, Wikberg J: Towards interoperable and reproducible QSAR analyses: Exchange of datasets. J Cheminformatics. 2010, 2: 510.1186/1758294625.
 23.
Kohavi R: A study of crossvalidation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th international joint conference on Artificial intelligence  Volume 2, IJCAI’95. 1995, San Francisco: Morgan Kaufmann Publishers Inc., 11371143.
 24.
Izrailev S, Agrafiotis DK: A method for quantifying and visualizing the diversity of QSAR models. J Mol Graph Model. 2004, 22 (4): 275284. 10.1016/j.jmgm.2003.10.001.
 25.
Kuncheva L: Combining Pattern Classifiers: Methods and Algorithms. Wiley. 2004, : Wiley
 26.
Todeschini R, Consonni V, Pavan M: A distance measure between models: a tool for similarity/diversity analysis of model populations. Chemometrics Intell Lab Syst. 2004, 70: 5561. 10.1016/j.chemolab.2003.10.003.
 27.
Makhtar M, Neagu D, Ridley M: Predictive model representation and comparison: Towards data and predictive models governance. Comput Intell (UKCI), 2010 UK Workshop on. 2010, 16.
 28.
Wojak A, Neagu D, Ridley M: Double MinScore (DMS) Algorithm for automated model selection in predictive toxicology. United Kingdom Workshop in Computational Intelligence (UKCI 2011). 2011, 150156.
 29.
Ehrgott M: Multicriteria Optimization. 2005, New York, Inc.: SpringerVerlag
 30.
Todeschini R, Consonni V: Handbook of Molecular Descriptors. 2000, Weinheim: WileyVCH Verlag GmbH
 31.
Flower DR: On the properties of bit stringbased measures of chemical similarity. J Chem Inf Comput Sci. 1998, 38 (3): 379386. 10.1021/ci970437z.
 32.
Jahnson M, Maggiora G: Concept of Application of Molecular Similarity. 1990, New York: John Wiley & Sons
 33.
Soto A, Cecchini R, Vazquez G, Ponzoni I: Multiobjective feature selection in QSAR using a machine learning approach. QSAR & Comb Sci. 2009, 28 (1112): 15091523.
 34.
Tappeta RV, Renaud JE: Interactive multiobjective optimization procedure. AIAA J. 1999, 37: 881889. 10.2514/2.7537.
 35.
Willet P, Berdnard J, Downs G: Chemical Similarity Searching. J Chem Inf Comput Sci. 1998, 38: 983996. 10.1021/ci9800211.
 36.
Tetratox. [http://www.vet.utk.edu/TETRATOX]
 37.
Inchemicotox. [http://www.inchemicotox.org/results/]
 38.
Schultz TW: TETRATOX: Tetrahymena Pyriformis population growth impairment endpointa surrogate for fish lethality. Toxicol Methods. 1997, 7 (4): 289309. 10.1080/105172397243079.
 39.
Xue Y, Li H, Ung CY, Yap CW, Chen YZ: Classification of a diverse set of Tetrahymena pyriformis toxicity chemical compounds from molecular descriptors by statistical learning methods. Chem Res Toxicol. 2006, 19 (8): 10301039. 10.1021/tx0600550.
 40.
Ellison C, Cronin M, Madden J, Schultz T: Definition of the structural domain of the baseline nonpolar narcosis model for Tetrahymena pyriformis. SAR QSAR Environ Res. 2008, 19 (78): 751783. 10.1080/10629360802550366.
 41.
Enoch S, Cronin M, Schultz T, Madden J: An evaluation of global QSAR models for the prediction of the toxicity of phenols to Tetrahymena pyriformis. Chemosphere. 2008, 71 (7): 12251232. 10.1016/j.chemosphere.2007.12.011.
 42.
Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): An opensource java library for Chemo and Bioinformatics. J Chem Inf Comput Sci. 2003, 43: 493500. 10.1021/ci025584y.
 43.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explor Newsl. 2009, 11: 1018. 10.1145/1656274.1656278.
 44.
The R Project for Statistical Computing. [http://www.rproject.org/]
 45.
RCDK. [http://cran.rproject.org/web/packages/rcdk/index.html]
Acknowledgements
The authors would like to thank BBSRC and Syngenta Ltd for funding the Industrial CASE Studentship Grant (No. BB/H530854/1) for AP. The authors are also grateful to Kim Travis and Richard MarcheseRobinson from Syngenta Ltd for their useful comments, and to John Delaney and Nathan Kidley from Syngenta Ltd for the access to the LogP dataset. The authors are also grateful to the referees for their invaluable and insightful comments that have helped to improve the presentation of this work.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AP proposed the concept of model identification in predictive toxicology. She designed and validated the method that uses Pareto points for model selection from the collection of existing models. She also proposed the algorithm for finding Pareto points in 2D space. DN originated the concept of data and model governance as a framework for reusing and mining models in predictive toxicology. DN, MR have been involved in the review discussions and proofread the draft of this manuscript. All authors have read and approved the final version of the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Predictive toxicology
 Model identification
 Pareto optimality
 Model combination