Identifying uncertainty in physical–chemical property estimation with IFSQSAR

Brown, Trevor N.; Sangion, Alessandro; Arnot, Jon A.

doi:10.1186/s13321-024-00853-w

Research
Open access
Published: 30 May 2024

Identifying uncertainty in physical–chemical property estimation with IFSQSAR

Trevor N. Brown¹,
Alessandro Sangion¹ &
Jon A. Arnot^1,2,3

Journal of Cheminformatics volume 16, Article number: 65 (2024) Cite this article

566 Accesses
2 Altmetric
Metrics details

Abstract

This study describes the development and evaluation of six new models for predicting physical–chemical (PC) properties that are highly relevant for chemical hazard, exposure, and risk estimation: solubility (in water S_W and octanol S_O), vapor pressure (VP), and the octanol–water (K_OW), octanol–air (K_OA), and air–water (K_AW) partition ratios. The models are implemented in the Iterative Fragment Selection Quantitative Structure–Activity Relationship (IFSQSAR) python package, Version 1.1.0. These models are implemented as Poly-Parameter Linear Free Energy Relationship (PPLFER) equations which combine experimentally calibrated system parameters and solute descriptors predicted with QSPRs. Two other ancillary models have been developed and implemented, a QSPR for Molar Volume (MV) and a classifier for the physical state of chemicals at room temperature. The IFSQSAR methods for characterizing applicability domain (AD) and calculating uncertainty estimates expressed as 95% prediction intervals (PI) for predicted properties are described and tested on 9,000 measured partition ratios and 4,000 VP and S_W values. The measured data are external to IFSQSAR training and validation datasets and are used to assess the predictivity of the models for “novel chemicals” in an unbiased manner. The 95% PI intervals calculated from validation datasets for partition ratios needed to be scaled by a factor of 1.25 to capture 95% of the external data. Predictions for VP and S_W are more uncertain, primarily due to the challenges in differentiating their physical state (i.e., liquids or solids) at room temperature. The prediction accuracy of the models for log K_OW, log K_AW and log K_OA of novel, data-poor chemicals is estimated to be in the range of 0.7 to 1.4 root mean squared error of prediction (RMSEP), with RMSEP in the range 1.7–1.8 for log VP and log S_W.

Scientific contribution

New partitioning models integrate empirical PPLFER equations and QSARs, allowing for seamless integration of experimental data and model predictions. This work tests the real predictivity of the models for novel chemicals which are not in the model training or external validation datasets.

Graphical Abstract

Introduction

Physical–chemical (PC) property data are essential for conducting legislated ecological and human health assessment for new and existing organic chemicals [1,2,3]. Common PC properties used in chemical assessments are solubility in water (S_W; mol/L), solubility in octanol (S_O; mol/L), vapor pressure (VP; Pa), melting point (T_M; K), boiling point (T_B; K) and the octanol–water (K_OW), octanol–air (K_OA), and air–water (K_AW) partition ratios. The partition ratios are considered dimensionless, and K_AW is the dimensionless Henry’s Law Constant (H; Pa.m³/mol) as K_AW = H/RT, where R is the Ideal Gas Law Constant (Pa.m³/(mol.K)) and T is the system temperature (K; kelvin). Models used for predicting bioaccumulation [4], overall persistence and long-range transport potential [5], toxicity, toxicokinetics in in vitro and in vivo systems, chemical concentrations in natural and manufactured environments, and ultimately exposure to human and ecological receptors require at least some of the listed PC properties as input parameters. Chemical assessment outcomes are sensitive to the selected PC values, e.g., [5,6,7,8,9] and reliable PC data are therefore required for reliable chemical assessments; “garbage in = garbage out” [10]. There is a need to better understand which chemicals and properties have the greatest uncertainties so these sources of error in regulatory decision-making can be addressed.

Uncertainty in PC data is inherent whether the data are measured or modelled [11, 12] and guidance for selecting PC data for chemical assessments is available [11]. Theoretical relationships between S_W, S_O, VP, K_OW, K_OA, and K_AW have been outlined by Mackay and colleagues [13,14,15] and others [16, 17]. These theoretical relationships (sometimes referred to as the “three solubility approach” [15]) can be applied for evaluating measured and predicted PC property data quality and obtaining consistency amongst them all as a method to address uncertainty in available PC property data and guide the selection of reliable data. Predictive methods for PC property data are required for thousands of chemicals legislated for evaluation [18,19,20,21]. Methods for predicting PC properties include Quantitative Structure-(Activity)Property Relationships (QS(A)PRs) and Poly-Parameter Free Linear Energy Relationship (PPLFER), also known as Abraham equations [22, 23]. Organization for Economic Co-operation and development (OECD) guidance for QS(A)PR development and validation for applications in regulatory decision-making exists [24, 25] including consideration of the applicability domain (AD) for a predicted property as outlined in the recent OECD QSAR assessment framework (QAF) [26]. There is a need for reliable predictive methods that include AD information as well as uncertainty estimates for predictions.

The Iterative Fragment Selection QSAR (IFSQSAR) development methods have been progressively updated and applied to various chemical properties over the last 10 years [27,28,29]. IFSQSARs are fragment-based multiple linear regression (MLR) models developed using extensive cross-validation and conservative goodness-of-fit metrics to create robust and predictive models, and make predictions based only on the chemical structure as a Simplified Molecular Input Line Entry System (SMILES) string [30]. The IFSQSARs include the prediction of solute descriptors required to parameterize PPLFER equations and other PC properties directly. The IFSQSARs have been developed in agreement with OECD guidance and apply three complementary methods for assessing if predictions are within the QSPR AD and provide estimates of the prediction uncertainty. The IFSQSAR methods and the mechanistic insights of the PPLFER methods are applied in this work to identify and characterize general uncertainties in predicting PC property data required for chemical assessments. The model development ensures that predicted properties are thermodynamically consistent, and their calculation is based on a consistent set of descriptors, i.e. the PPLFER solute descriptors. This is like previous efforts based on different descriptors, such as the Unified Physicochemical Property Estimation Relationships (UPPER) method of Yalkowsky and colleagues [31].

The present study describes the development and evaluation of new models in IFSQSAR Ver.1.10 (https://github.com/tnbrowncontam/ifsqsar) for predicting S_W, S_O, VP, K_OW, K_OA, and K_AW. The new models, and other QSARs, are available in a user-friendly, freely accessible online platform, the Exposure And Safety Estimation (EAS-E) Suite (www.eas-e-suite.com). QSPRs have previously been developed for solute descriptors and system parameters of PPLFERs [32, 33]. These QSPRs are combined with empirically calibrated PPLFER equations to make predictions for PC properties, some calibrated in previous research [34] and some newly calibrated in this work. A key objective of this work is to validate the predictive power of the new models against experimental data for novel chemicals; therefore, in the validation process, the PPLFERs are only parameterized with solute descriptors predicted by the IFSQSARs to represent conditions of applying models to chemicals and properties for which there are no measured data. The new model predictions are compared against independent measured property data to assess their predictive power (uncertainty) expressed as 95% prediction intervals. Methods for quantifying the predictive power of the QSPR predictions for novel chemicals, i.e. chemicals that are outside of the training and validation datasets, are evaluated. Based on these evaluations and the detailed AD information of the IFSQSAR models, methods for further improving the understanding of the prediction uncertainty for novel chemicals are recommended.

Methods

Theory

Thermodynamic property cycles that describe the interrelation between partitioning and solubility in octanol, water and air phases are referred as the three-solubility approach. The three-solubility approach interprets the partition ratios K_OW, K_OA, and K_AW as ratios of the solubilities S_O, S_W and solubility in air (S_A), where S_A is a conversion of VP at atmospheric pressure and temperature. Figure 1 shows how the three-solubility approach [15] is used in this study to calibrate consistent solubility and partitioning properties. Partition ratios and solubility in this work are calculated using PPLFERs. PPLFERs were pioneered by Michael Abraham and colleagues, and are empirical correlations used to predict chemical properties with many applications in environmental chemistry [33]. There are three different forms of PPLFER equations which include different sub-sets of solute descriptors and system parameters. Two forms are recommended by Abraham for partitioning between two condensed phases, or partitioning between one condensed phase and one gaseous phase [22]. A third form was suggested by Goss [23] which contains descriptors from each of the two suggested by Abraham and is shown in Eq. 1. PPLFERs in the form of Eq. 1 are used in this work because they offer two advantages for environmental chemistry research. The first is that using a single form of the equation allows for the application of thermodynamic property cycles. The second is that this form of PPLFER equation shows better predictive power for some solutes with unique properties, including perfluorinated alkyl substances and methyl siloxanes, which are of environmental interest [35].

$${\text{log}}\;{\text{K}}{\mkern 1mu} = {\mkern 1mu} {\text{s}} \cdot {\text{S }}{\mkern 1mu} + {\mkern 1mu} {\text{a}} \cdot {\text{A }}{\mkern 1mu} + {\mkern 1mu} {\text{b}} \cdot {\text{B }}{\mkern 1mu} + {\mkern 1mu} {\text{v}} \cdot {\text{V}}{\mkern 1mu} + {\mkern 1mu} {\text{l}} \cdot {\text{L}}{\mkern 1mu} + {\mkern 1mu} {\text{c}}$$

(1)

PPLFER equations consist of solute descriptors, which correlate with the molecular interactions of the solute, and system parameters which are fitted to the properties of the system of interest. For partition ratios the system will be the two phases that the partition coefficient describes, and the system parameters describe the relative propensity for solutes to partition to one phase or the other with positive values favoring the first phase and negative values favoring the second phase. For solubility the two phases are the pure phase of the solute and water, air, or octanol. System parameters are determined by MLR of the property against experimentally determined solute descriptors of the dataset of training chemicals for which both the solute descriptors and property are available, this is referred to as calibrating a PPLFER equation. Experimental solute descriptors are available for about 8000 solutes and system parameters have been calibrated for solvent-air and solvent–water partitioning of about 100 solvents including octanol [36, 37].

In Fig. 1, Table 1, and Eq. 1 the lower-case letters s, a, b, v, l, and c are the system parameters specific to the system. The upper-case letters S, A, B, V, and L are the solute descriptors specific to the solute. For solubility an additional term that combines A and B with an additional system parameter d is required, as discussed below. The solute descriptors correlate with different types of molecular interactions: S is a combination of the solute dipolarity and polarizability, A is the hydrogen bond donor capacity, B is the hydrogen bond acceptor capacity, V is the McGowan volume which has been interpreted as correlating with energy of cavity formation, and L is the partition coefficient for the hexadecane-air system which correlates with van der Waals interactions. Abraham has also calibrated PPLFER equations for the pure phase properties solubility S_W [38] and vapor pressure VP [39]. Separate PPLFER equations were developed for liquid and solid solutes with quite different system parameters. These PPLFER equations represent a system where partitioning is between the chemical pure phase and the water and air phases meaning that the system is different for every solute which is not consistent with how PPLFERs are typically applied. Equation 2 shows a PPLFER equation analogous to Eq. 1 for solubility of solute in water, octanol, or air, which has been modified according to Abraham’s method 38, 39.

$${\text{logS}}_{{\text{[W,O,A]}}} \,{ = }\,{\text{s}} \cdot {\text{S }}\,{ + }\,{\text{d}} \cdot {\text{A}} \cdot {\text{B }}\,{ + }\,{\text{v}} \cdot {\text{V}}\,{ + }\,{\text{l}} \cdot {\text{L}}\,{ + }\,{\text{c}}$$

(2)

Table 1 Poly-Parameter Free Linear Energy Relationship (PPLFER) system parameters^a

Full size table

In these PPLFERs the solute descriptors are being used to describe how a chemical behaves as both a solute and the solvent. The A∙B term explicitly accounts for the effects of hydrogen bonding between molecules of the chemical, and some versions proposed by Abraham [39, 40] include an S∙S term to account for dipole–dipole interactions. The system parameters quantify how each solute descriptor favors solubility in water, octanol, or air, and any broadly applicable interactions within the pure phase of the solute. Equation 2 was modified to Eq. 3 in this work, because this was found to give better fitting results, and the (AB)^0.5 term is more consistent with previous work done predicting system parameters [34]:

$${\text{logS}}_{{\text{[W,O,A]}}} \,{ = }\,{\text{s}} \cdot {\text{S}}\,{ + }\,{\text{a}} \cdot {\text{A}}\,{ + }\,{\text{b}} \cdot {\text{B}}\,{ + }\,{\text{d}} \cdot \left( {{\text{A}} \cdot {\text{B}}} \right)^{{{0}{\text{.5}}}} \,{ + }\,{\text{v}} \cdot {\text{V}}\,{ + }\,{\text{l}} \cdot {\text{L}}\,{ + }\,{\text{c}}$$

(3)

Previous research developed empirical regressions between solute descriptors and system parameters for solvent-air partitioning which can be used as an alternative method to predict solubility [34]. System parameters of PPLFER equations in the form of Eq. 1 can be predicted for each solute using the empirical regressions. These predicted PPLFER equations are then used to predict the partitioning of a solute between air and the solute’s own pure liquid phase, giving a partition ratio (log K_kAk). These log K_kAk values are then converted to VP using Eq. 4, which is a rearrangement of Raoult’s Law [34], and converted to S_W by the three-solubility approach. In Eq. 4γ is the activity coefficient of the solute which is assumed to be unity in the pure phase, and MV is the molar volume of the liquid or supercooled liquid solute. VP is then unit converted to S_A at standard temperature and pressure and a thermodynamic property cycle is applied to calculate S_W and S_O from the calibrated PPLFER equations for log K_AW and log K_OA.

$${\text{logVP}}\,{ = }\,{\text{log}}\left( {\frac{{{\text{RT}}}}{{{\gamma K}_{{{\text{kAk}}}} {\text{MV}}}}} \right)$$

(4)

This indirect method has only been validated for predicting the VP of liquids, and testing done in this work for solids showed that the results were poor.

PPLFER equations for partition ratios involving pure solvent phases, water, and air typically have standard errors of fitting and prediction of less than 0.2 log units when calibrated with experimental solute descriptors. The Abraham PPLFERs for have larger errors on the order of 0.3 log units for liquids and up to 0.8 log units for some solids, but these equations also contain other correction factors for specific functional groups [39]. For S_W the error is about 0.6 log units [38]. The indirect method for calculating solubilities had errors of about 0.4 and 0.5 log units when applied to solubility in air for liquids. All these statistics are calculated on different datasets and are typically fitting errors rather than predictive errors, so they give an idea of the goodness of fit of the models, but not necessarily the predictive power. If PPLFERs are properly calibrated with sufficient data then they have broad applicability and accuracy [35].

Table 1 summarizes the PPLFER equations used in this work to predict PC properties. The equations for log K_OA, and log K_AW have been calibrated in previous work [32, 34], the system parameters for dry log K_OW (pure octanol) are calculated as the sum of the system parameters for log K_OA and log K_AW, i.e., using the three solubility approach. Sections SI-2, SI-3, and SI-4 detail the calibration of new PPLFER equations in this work, for wet log K_OW (water saturated octanol), log K_OO (hypothetical partition ratio between wet and dry octanol), VP, S_W, and S_O (dry and wet). One of the goals of this work is to create models that predict partition ratios and solubilities which have thermodynamic consistency built in, and this is achieved by calibrating the PPLFER system parameters to be thermodynamically consistent using the concept of the three solubility approach [15]. The PPLFER equations in this work have all been calibrated on experimental data except for S_O, which is only calculated by the three solubility approach due to limited data availability and is shown in a different color in Fig. 1 to reflect this.

One challenge in this process is that there is an inherent discrepancy in the three solubility approach with regards to how the data are measured. Most measurements of log K_OW are performed with the octanol and water phases in direct contact so that the octanol becomes saturated with water and vice versa. The solubility of octanol in water is very low so the effect of partitioning of chemicals to the water phase is negligible. However, a significant amount of water is soluble in the octanol phase, and this changes the partitioning properties [41]. The PPLFER system parameters in Table 1 show the “dry” log K_OW will be lower than the “wet” log K_OW for polar and hydrogen bonding chemicals because the s, a, and b system parameters are lower. In contrast, log K_OA measurements are usually made using dry octanol [42]. In addition, the difference between wet S_O (S_O[w]) and dry S_O (S_O[d]) must be considered. A PPLFER for a hypothetical partition ratio between wet and dry octanol (K_OO) has been derived in this work which can make these corrections, ensure thermodynamic consistency, and is implemented as a QSPR in IFSQSAR.

IFSQSAR description and AD

The IFSQSAR development methods have been described in previous work [27,28,29, 32, 43] and are summarized in Section SI-1. An important aspect to understand for this work is the division of experimental data into a training dataset used to calibrate the QSPR and a validation dataset used to validate the QSPR and estimate the prediction uncertainty. The splitting is rational and deterministic, ensuring that both datasets represent the chemical diversity of the experimental data and the range of expected values. The solute descriptor QSPRs were trained and validated on a common dataset, so that each solute is only in either the training or validation dataset for all solute descriptor QSPRs. Further details on the dataset splitting are in Brown 2022 [32]. All the QSPRs and PPLFERs described here are coded in the IFSQSAR version 1.1.0 python package and implemented in the EAS-E Suite online platform (www.eas-e-suite.com). IFSQSARs apply three complementary approaches to define the basic AD of predictions, the first two approaches are very similar to, but developed in parallel to the AD methods applied by OPEn structure–activity/property Relationship App (OPERA) [44]. The first approach uses the leverage which is interpreted as a measure of extrapolation from the training dataset [45, 46], and the second is Chemical Similarity Score (CSS) which is a nearest neighbours approach and is less sensitive to extrapolation. Various cut-offs are defined for both approaches and are combined to assign each QSPR prediction an Uncertainty Level (UL) between UL 0–3 which correlates with uncertainty of the QSPR predictions, or inversely correlates with predictive power. Individual predictions can always be good or bad regardless of the UL, the UL only quantifies the typical uncertainty. Some special cases are also defined, UL 4 means that all fragments in the QSPR have a count of zero for the chemical, this may be a defined as in or out of the AD depending on the meaning of the intercept. UL 5 is the third complementary AD approach and has been described as a “denylist” AD check [47], but also might be described as a negative domain check, or inverse structural alerts. All the information about atoms and bonds in the training dataset is summarized regardless of whether the exact substructures are included in the fragments selected for the QSPR. Chemicals are checked against this summary and if they contain a substructure that is not found in the training data then they are flagged as UL 5. Finally, for some QSPRs it is pragmatic to set boundary conditions on possible values, and any predictions which violate these boundary conditions are flagged as UL 6. Table 2 summarizes the seven IFSQSAR ULs.

Table 2 IFSQSAR uncertainty level (UL) specifications

Full size table

The IFSQSARs that use chemical structure to predict solute descriptors (used in PPLFER equations) and other PC properties directly provide an UL and predictivity metric along with each prediction [32]. Here predictivity refers to the predictive power of the QSPR, i.e. how accurate the predictions are likely to be, or inversely how uncertain the predictions are likely to be. Predictivity is quantified by the root mean squared error of prediction (RMSEP) as calculated from the external validation dataset of each solute descriptor QSPR, more discussion of the RMSEP can be found in “Metrics of model performance and predictivity” section. As the RMSEP increases the predictivity is lower and the uncertainty is higher.

All the property PPLFER equations in IFSQSAR are implemented as Meta QSPRs. Meta QSPRs use the outputs of other QSPRs as their inputs and calculate new values, aggregated ULs, and error estimates. For example, log K_OW is estimated with a Meta QSPR which combines solute descriptors predicted by QSPR and the experimental system parameters from Table 1 in PPLFER Eq. 1. All the PPLFER equations in this work (K_OW, K_AW, K_OA, VP, S_W and S_O) are implemented as Meta QSPRs. Note that IFSQSAR will by default use experimental solute descriptors instead of predicted ones where possible to increase the accuracy of predictions. This feature of IFSQSAR was not included in the validation process of this study so that only predicted solute descriptors were used to evaluate the models’ expected predictivity for novel or data-poor chemicals. The AD and predictivity as UL and RMSEP of the Meta QSPRs are calculated as an aggregate of UL and RMSEP of the Meta QSPR model inputs and other parameters written into the model such as the experimental system parameters. The details are described elsewhere [32], but in brief the aggregated UL and RMSEP are calculated according to propagation of uncertainty rules. These calculations are done automatically in the Meta QSPR code and documented in the output.

Meta QSPRs for predicting VP and S_W for liquids have already been implemented in previous work based on QSPRs that predict the PPLFER system parameters for liquid solvents [32]. These are referred to as indirect predictions in the present study as opposed to the direct predictions of VP and S_W made with the new PPLFER system parameters in Table 1. As outlined in “Model evaluations with empirical datasets and endpoint relevance” section it is known that VP, S_W and S_O for liquids and solids have notable differences. To help account for these differences two previously created QSPRs were used, and two new ones were created. The previously developed direct prediction QSPRs are the entropy of fusion (ΔS_M) and T_M [29]. The first new QSPR introduced in this study is a new classifier model to predict whether a chemical is a gas, liquid or solid at 25 °C and standard atmospheric pressure to predict when corrections for solids need to be applied. The state classifier is implemented as a Meta QSPR which takes solute descriptors, T_M, and T_B as inputs, and is described in Section SI S-5. Finally, as discussed in “Model evaluations with empirical datasets and endpoint relevance” section the values for S_W and S_O are capped at solute molar volume (MV) in some cases; therefore, Section SI-6 describes a new QSPR for MV developed in this study.

Model evaluations with empirical datasets and endpoint relevance

Figure 1 shows the general workflow and the relationships between properties datasets and the models developed in this study. Yellow filled boxes represent experimental datasets, and in the case of the system parameters, values that have been empirically calibrated using only experimental data inputs. Blue filled boxes represent QSPR predictions, and green filled boxes represent hybrid models which combine QSPR predictions with system parameters calibrated on experimental data. There is a separate PPLFER equation and model for each property, but the calibration of the system parameters for all partitioning properties are interrelated through the three solubility approach. The main division of experimental data is solutes with available solute descriptors which is used from training and validating the models (top left box), and solutes with partitioning data but no solute descriptors (bottom box). IFSQSAR predictions were made for the following PC properties then evaluated using datasets of experimental values originally from the PhysProp database included in EPI Suite package [48]: log K_OW, log K_AW, log K_OA, log VP, and log S_W. These predictions and data are then used to assess the predictivity of IFSQSAR PPLFER-based models for novel chemicals. The PhysProp datasets have been further curated as a part of the creation of the OPERA QSAR package, including assigning all chemicals QSAR-ready structures as SMILES [44, 49]. Chemicals have been matched by CAS number with chemicals in the solute descriptor database used to develop the IFSQSARs [32], and identified as being in the training dataset, the validation dataset, or in neither. Chemicals in neither dataset are novel and are referred to here as being external to IFSQSAR.

There are several caveats to consider when comparing the IFSQSAR model predictions to the experimental datasets of PC properties. The first thing to consider is the difference between wet and dry octanol, as described in “Theory” section. Secondly, PC properties involving a pure chemical phase such as VP and S_W are different for liquids and solids. Chemical fate and transport models typically assume that all chemicals are liquids, or supercooled liquids, also called subcooled liquids. The theory is that at very low concentrations in a phase the solid chemicals behave as liquids because there are never enough molecules to form a solid pure phase. Measured or predicted VP and S data for solids can be corrected to equivalent supercooled liquid values using the Clausius Clapeyron equation or one of its simplifications, the most common being the Van’t Hoff approximation [50]. This is discussed in more detail in Section SI-4. As discussed in the previous section the data inputs required to apply the Van’t Hoff approximation, ΔS_M and T_M, were developed in previous work, and the new classifier helps determine if a chemical is likely to be a liquid or a solid at system temperature (default in IFSQSAR = 25 °C).

Another end point mismatch that is commonly encountered in partitioning data is the partitioning of ions and ionizable chemicals. This is mostly important for partitioning where water is one of the phases, although the effect in other phases, e.g., water-saturated octanol, is possible. The present study only focusses on the partitioning of neutral organic chemicals. Chemical ionization is only considered in this work to identify experimental data where the measurement may be influenced by it and remove those data from model development and evaluation. Strong acids and bases are identified as acids with a pK_a less than 4 and bases with a pK_a greater than 10 and were removed. Experimental pK_a were collected from the curated OPERA database. If a pK_a was not available, a consensus value between ChemAxon estimates (available in the ChEMBL database [51]) and ACD Labs 2023.1.0 (Build 3666) was determined.

In this study upper boundaries have been set for VP and S_O and S_W values. When a solute is miscible in water or octanol there is no limit for how much of the solute can be dissolved. This might be expressed as a S where the amount of the solute is greater than the amount of water or octanol, which is not measurable or physically reasonable in a real system. We propose as a reasonable upper boundary on all solubility values to use the inverse of the solute liquid MV, i.e., the concentration of solute in its own pure liquid phase. The liquid MV QSPR developed in this work is used to set the capped value for solubility predictions. A similar upper boundary can be defined for VP, in this case we use standard atmospheric pressure as the upper boundary, because in the context of modelling the natural environment the pressure of a chemical will not be greater than this value.

Metrics of model performance and predictivity

The RMSEP is calculated from experimental values of the external validation datasets and predicted values from IFSQSAR PPLFER based models using Eq. 5:

$${\text{RMSEP}}\,{ = }\,\left( {\frac{{\sum\nolimits_{{{\text{i}}\, = \,1}}^{{{\text{n}}_{{{\text{ext}}}} }} {\left( {{\text{y}}_{{\text{i}}} \, - \,\widehat{{\text{y}}}_{i} } \right)^{2} } }}{{{\text{n}}_{{{\text{ext}}}} }}} \right)^{0.5}$$

(5)

where n_ext is the number of data points in the validation dataset, y_i are the experimental values and ŷ_i are the predicted values. The RMSEP can is then used to calculate an estimated 95% prediction interval (PI) using Eq. 6:

$${95}\% \,{\text{PI}}\, = \,\left[ {{\text{M}}\,{-}\,{\text{RMSEP}}*{1}.{96},\,{\text{M}}\, + \,{\text{RMSEP}}*{1}.{96}} \right]$$

(6)

where M is the predicted PC property value. In an ideal case the validation dataset of a QSPR is representative of the structural diversity of chemicals to which the model might be applied. In this ideal case the RMSEP calculated from the validation dataset would be a good estimate of global RMSEP and 95% of predictions would have the experimental value contained within their PI. However, in practice the data available for validating QSPRs is limited by the experimental methods used to measure the data and will not be representative of the diversity of chemicals to which the model may be applied, so the RMSEP and the PI will only be estimates.

In cases where predictions are made for chemicals that are well within the AD, the RMSEP is typically comparable to the goodness-of-fit quantified as the standard deviation between the experimental and fitted value of the training dataset, i.e., the same as Eq. 5 but between the experimental values of the training dataset and the fitted QSPR values. The further out of the AD a group of predictions are, the larger the real RMSEP will be. As stated above, individual predictions can always be good or bad regardless of whether they are in the AD or not, the RMSEP is a probabilistic metric.

During IFSQSAR model development each chemical in the external validation dataset is assigned a UL as discussed in “IFSQSAR description and AD” section, and then the RMSEP is calculated for all chemicals within each UL. ULs 0 to 3 almost always have an increasingly large RMSEP for investigated datasets [29, 32, 43]. UL 4 may have a high or low RMSEP depending on if intercept-only predictions are considered within the AD, which depends on the property and structure of the model. Because UL 5 means that the chemical contains atoms or bonds not represented in the training dataset the RMSEP also cannot be reasonably estimated because the untrained atoms and bonds may have unexpected effects on the property. However, in practice the RMSEP of predictions for UL 5 has typically been comparable the RMSEP for UL3, provided that the chemicals are not inorganic. UL 6 means that the model has made a prediction outside of a boundary condition set at the time of model calibration. This UL is assigned after a normal prediction is made and an UL is assigned, the RMSEP of the original UL is used. The same trends of RMSEP with aggregate UL are observed for the PPLFER based models in this work.

One major goal of this work is to assess the accuracy of the RMSEP estimates provided by IFSQSAR models when compared to data that are not in the training or validation datasets, i.e., novel chemicals. The RMSEP values (in log units) will then be adjusted for the partitioning properties log K_OW, log K_AW, log K_OA, S_A and S_W based this comparison. To do this PIs are calculated from the RMSEP and then the actual fraction of predictions within the PI are calculated to assess the accuracy of the PI and the RMSEP. The RMSEPs of each partitioning property are adjusted until the 95% PIs contain at least 95% of the experimental values, by multiplying by a factor increase depending on the trends observed for different ULs or chemical states.

Results and discussion

Evaluation of IFSQSAR partition ratio predictions

Figures 2A, B show the IFSQSAR predictions compared to experimental K_OW data split into two subsets. Figure 2A includes chemicals that have experimental solute descriptors and are in the IFSQSAR validation dataset, but the plotted values are the IFSQSAR predictions. Figure 2B shows chemicals with no experimental solute descriptors, which are therefore entirely external to the IFSQSAR partition ratio and solubility models. The data points in Fig. 2 are colored by the aggregate UL of the predictions with UL 0, the least uncertain, colored green and UL 1 to UL 3 colored blue, yellow and red, which corresponds to increasing uncertainty. Data points with lower UL tend to fall closer the 1:1 line indicating more accurate predictions. UL 5 and 6 are colored in purple and have triangle and square shape to distinguish their different AD types. As could be expected chemicals which are external to IFSQSAR have more uncertainty and variability in the predictions (RMSEP 1.00) compared to chemicals in the IFSQSAR validation dataset (RMSEP 0.57). The external data span a larger range of log K_OW values, from about −5 to 11, compared to the validation data which spans values from about −2 to 8. The chemicals in this expanded lower range tend to be flagged as out of the AD with UL 2 or UL 3 and are mostly identified as solids by the chemical state predictions. The chemicals in the middle of the range with over-predicted log K_OW values and which are mostly UL 2 and UL 3 are also mostly identified as solids and are mostly very large and complex chemicals.

Figure S2 shows the data for wet and dry K_OW. Strong acids and bases and salts were not included because these data were likely distribution ratios (D_OW) rather than K_OW. Data in the IFSQSAR training dataset are excluded in these figures, only data in the IFSQSAR validation dataset and data that are in neither set are included. Applying the IFSQSAR model which applies the PPLFER equation for dry K_OW shows poorer statistics (RMSEP 1.19) compared to the model which applies the PPLFER equation for wet K_OW (RMSEP 0.98). As expected, the PPLFER for dry K_OW tends to underestimate the experimental K_OW values for more water-soluble chemicals, with the predictions skewing to lower values.

Figure S3 shows chemicals identified as liquids or solids plotted separately. Predicted K_OW values for liquids are more accurate with overall RMSEP 0.67 compared to 1.03 for solids. The ratio between RMSEP for solids and the RMSEP for liquids tends to increase with increasing RMSEP. More of the liquids are within the AD, 64% have aggregate UL 0 or 1 compared to solids which have only 31% assigned UL 0 or 1. This means that solids are more likely to be out of the AD, and regardless of whether they are in the AD or not, the K_OW predictions for solids are less accurate, though the difference is relative small for solids that are UL 0, 1, or 2. There are a few different reasons why the predictions for solids may be less accurate. Solids tend to be larger chemicals than liquids and fragment based QSPR predictions, such as the IFSQSAR solute descriptor QSPRs which are used in the PPLFER based models in this work, are known to be less accurate for larger chemicals [43, 52]. The functional group counts in larger chemicals are more likely to be outside of the range of values in the training dataset, meaning that the QSPRs must be extrapolated outside of their training set. Extrapolation is always more uncertain than interpolation between values within the range of the training dataset. Larger chemicals have more opportunities for intramolecular interactions between functional groups which can confound group contribution QSPRs such as those in IFSQSAR. Making experimental measurements for larger chemicals also tends to be more challenging because their solubility in some phases may be very low, so the experimental data also may be less accurate. For example, solids might be more likely to self-associate and undergo a phase transition at low concentrations in water or octanol such as has been observed for perfluorinated alkyl substances [53], which would have a confounding effect for interpreting experimental concentrations in octanol and water. Another example is polymorphism, where a chemical has multiple solid forms each with a different crystal structure and a different solubility [54]. This effect is well known in pharmaceutical science because it is an aspect of drug formulation but is not considered as much in environmental applications.

Table 3 summarizes statistics for the model evaluations and shows the fraction of each subset of data where the experimental values fall within the 95% PI calculated from the aggregate RMSEP estimates. For chemicals in the IFSQSAR validation dataset a little greater than 95% of the chemicals fall within in PI, which is to be expected because these chemicals are a subset of the data used to estimate the RMSEP. For the data which are external to IFSQSAR only 90% of chemicals fall within the 95% PI. The results are about the same for liquids vs. solids at 90% overall. The results are quite consistent across the different UL with no obvious trend. For liquids the fraction within the PI is more variable at UL 2 and UL 3 due to the small number of chemicals. Adjusted RMSEP estimates will be made for all QSPRs in “IFSQSAR Uncertainty Estimates” section.

Table 3 Validation statistics for log K_OW, log K_AW, and log K_OA

Full size table

There are much fewer data available for K_AW and K_OA than for K_OW; therefore, the statistics are less reliable, but the results are consistent with the general trends observed for K_OW. Figures analogous to Figs. 2A, B, and S3 are shown in the SI for K_AW and K_OA (Figure S4 through Figure S6). The prediction statistics are better for chemicals in the validation dataset than in the dataset of chemicals external to IFSQSAR as shown in Figure S4 for K_AW and Figure S6 for K_OA. Figure S5 shows the prediction statistics for liquids are better than for solids for K_AW, while for K_OA the external test set chemicals are all solids. Table 3 also summarizes the validation statistics for the performance of the K_AW and K_OA models. The fraction of experimental data which falls within the 95% PI is much more variable compared to the data for K_OW, likely due the limited amount of data, but again the overall trend is similar, chemicals in the validation set are within the PI more than chemicals in the external set, and liquids are within the PI about as often as solids.

Evaluation of IFSQSAR VP and S_W predictions

There are two IFSQSAR methods for predicting VP and S_W; the indirect method developed in previous work [32, 34] and the direct method developed in the current study as described in Section SI-4. The indirect method predicts system parameters for VP and then calculates system parameters for S_W by thermodynamic property cycle, while the direct method uses system parameters calibrated with experimental data for S_W and uses a thermodynamic property cycle to calculate system parameters for VP. Table 4 shows the validation statistics for the VP and S_W direct method predictions, and Table S1 and Table S2 in the SI compare the direct predictions to the indirect predictions and direct predictions with the Van’t Hoff correction applied. Section SI-4 briefly describes theoretical reasons that VP and S_W will be different for liquids and solids. The indirect method was trained only on liquids and is not applicable to solid chemicals, the RMSEPs for predicting properties for solids, i.e., VP_[s] and S_W[s], are 5 to 6, respectively (results not shown). Figure S7 shows the indirect method gives good predictions for VP_[l] and S_W[l] with RMSEP values of 0.78 and 0.96, respectively, for chemicals which are external to IFSQSAR, i.e. are not in either the training or validation dataset of the IFSQSAR solute descriptor QSPRs.

Table 4 Validation statistics for log VP and log S_W

Full size table

The direct method predicts VP and S_W specifically for liquids and supercooled liquids if the chemical is a solid at 25 °C. When applying the direct method to solids the predictions need to be converted to VP_[s] and S_W[s] using the Van’t Hoff equation and ΔS_M and T_M which can be predicted by QSPRs in the IFSQSAR software. These additional QSPR predictions will introduce more uncertainty and variability into the predicted values for solids and the predictions would be expected to be less accurate. Because of this additional uncertainty the prediction accuracy of VP_[s] and S_W[s] using the Van’t Hoff correction is no better than just using the supercooled liquid predictions when comparing to the experimental data. Nevertheless, we present the results here for thoroughness because comparing the supercooled predictions to experimental VP_[s] and S_W[s] is an end-point mismatch. Large predictions for VP and S_W are capped to provide more reasonable values and assigned UL 6 corresponding to a boundary condition violation. Aside from the challenges for predicting properties for solids, much the same trends are observed in the data and model performance as observed for the partition ratios.

Figure S8 shows the predictions for VP using the direct method versus experimental values for chemicals external to IFSQSAR, comparing the effect of correcting with the Van’t Hoff equation or leaving the data uncorrected. Figure 2C, D show predictions corrected with the Van’t Hoff equation for data that are in the IFSQSAR validation dataset and data that are external to IFSQSAR. As is observed for the log K values, predictions for chemicals in the validation dataset (RMSEP 0.91) are more accurate than predictions for external chemicals (RMSEP 2.04). Predictions for liquids are again more accurate (RMSEP 0.71) than predictions for solids (RMSEP 2.59). Table 4 and Table S1 show the statistics for IFSQSAR log VP predictions. The trend is again the same for log S_W, with predictions for chemicals in the validation dataset (RMSEP 1.28) more accurate than predictions for external chemicals (RMSEP 1.69) as shown in Fig. 2E, F, and predictions for liquids (RMSEP 0.88) more accurate than predictions for solids (RMSEP 1.81). Figure S9 shows the data with and without being corrected with the Van’t Hoff equation, and Table 4 and Table S2 show the statistics for IFSQSAR log S_W predictions.

The indirect and direct IFSQSAR methods for predicting log VP_[l] and log S_W[l] have comparable RMSEP and AD coverage; therefore, the direct method is preferable because the model has fewer inputs. For chemicals flagged as UL 0, 1, 2, 6 the IFSQSAR model predictions for solids with the Van’t Hoff correction applied have comparable or better RMSEP compared to the predictions with no correction applied. However, for chemicals flagged as being egregiously outside of the AD with UL 3 or UL 5 the IFSQSAR predictions with no Van’t Hoff correction applied have a better RMSEP. This can be interpreted to mean that if the IFSQSAR predictions are already very far outside of the AD adding further correction factors with their own AD and uncertainty is likely to only make the predictions worse.

IFSQSAR uncertainty estimates

Tables 3 and 4 show the IFSQSAR 95% PIs typically capture about 80–90% of the deviations from experimental data for the external dataset, indicating a slight underestimation of the standard error of prediction. Multiplying the estimated RMSEP by 1.25 for all IFSQSAR QSPRs brought the fraction within the 95% PI of the partition ratio models close to 95%. No further adjustments were required for the partition ratio QSPRs. For the VP and S_W QSPRs there is a tendency for the 95% PIs to capture less than 95% of the predictions for solids; therefore, additional multiplicative adjustment factors of 1.67 and 1.25 were applied to the 95% PIs for the VP and S_W QSPRs respectively for chemicals identified as maybe or likely solids by the IFSQSAR state classifier. After these adjustments there was still a tendency for the VP QSPR to capture less than 95% for chemicals with high UL; therefore, an additional multiplicative adjustment factor of 1.25 is applied to chemicals with UL 2, UL 3, and UL 5.

The method proposed by Endo to calculate the prediction interval of PPLFER equations [35] was applied to see if the additional uncertainty of extrapolating outside of the PPLFER equation training dataset could explain why some chemicals were not within the estimated RMSEPs. As shown by Endo this is not a large source of additional uncertainty for PPLFERs with at least 100 chemicals in the training dataset, and all PPLFERs used in this work have hundreds of chemicals in their training datasets. The increase in RMSEP from applying this method rarely made any difference in the fraction of chemicals that were within the 95% PI.

Conclusions

In summary, by applying the methods outlined in this study reasonable PIs can be assigned to all IFSQSAR PPLFER predictions for partition ratios, typically even those which are flagged as out of the AD and assigned UL 2 or UL 3. The main exceptions where a PI cannot be reasonably estimated are cases where the experimental endpoint is not applicable to the chemical in question, e.g., log K_OW at pH 7 of a strong acid. If a chemical is a valid target for the QSPR endpoint, then even if the prediction is out of AD, the model predictions are still useful when an acceptable level of uncertainty from the 95% PI estimation is determined. The acceptable level of uncertainty in a property prediction is fundamentally specific to an end user’s judgement and decision-context. For example, for priority setting or screening-level application of the IFSQSAR models, a higher level of uncertainty may be more tolerable than for a definitive risk assessment scenario. Given that typical experimental variability is about 0.1 log units for log K_OW, and standard errors for PPLFERs with experimental solute descriptors are about 0.2 log units, a RMSEP of about 0.5 for chemicals within the AD of the models is probably an acceptable level of uncertainty for many decision-contexts. Even predictions which are out of the AD will typically have an RMSEP that gives a PI which is smaller than the full range of possible values for a partitioning property.

In general, the methods presented here predict partition ratios as log K for novel chemicals with an overall RMSEP of about 1 log unit. The RMSEP of log K_AW is a little larger and log K_OA is a little smaller than 1 log unit. This may have to do with the relative difficulties in making the measurements, or in making predictions for them. The log K_AW measurements have more experimental difficulties because of ionization and other effect specific to water so the inherent variability may be larger; however, there are fewer log K_OA measurements so the dataset of log K_OA values may not represent the full range of variability. VP and S_W of liquids are also predicted with an RMSEP of about one log unit, but predictions for solids have larger RMSEP, up to 2 log units or more depending on the subset. Many of these predictions are still good, for example 85% of predictions which are out of the AD for solid chemicals in the external VP dataset are within ± 1.98 log units of the expected value, corresponding to the 95% PI of an RMSEP of 1. The high overall RMSEP for VP and S_W of solids are clearly heavily influenced by a relatively small group of outliers. These instances tend to be strongly under-predicted, apparently due at least in part to the liquid to solid correction done with the Van’t Hoff equation. This disparity in prediction accuracy between liquids and solids is also apparent even for K values where it should theoretically not be an issue and warrants further investigation which will be a part of future work.

The new work described here advances the capacity for estimating uncertainty in PC property predictions, particularly for novel chemicals, and future work will show how these new methods and existing property predictions methods can be used to systematically address uncertainty in PC property data through integrated approaches to testing and assessment.

Availability of data and materials

The data and model predictions included in this study are available in a user-friendly online platform, the Exposure And Safety Estimation (EAS-E) Suite (www.eas-e-suite.com). The IFSQSAR source code is available on github: https://github.com/tnbrowncontam/ifsqsar.

Abbreviations

AD:: Applicability domain
EAS-E (Suite):: Exposure And Safety Estimation Suite
H:: Henry’s Law constant
IFSQSAR:: Python package for chemical property prediction used in this work
K:: Partition ratio
K_AW :: Air–water partition ratio (K_AW = H/RT)
K_OA :: Octanol–air partition ratio (Henry’s Law constant for octanol)
K_OW :: Octanol–water partition ratio, mutually saturated (wet)
dry K_OW :: Octanol–water partition ratio, not mutually saturated (dry)
K_O[w]O[d] or K_OO :: Hypothetical partition ratio between wet and dry octanol
MV_[l] or MV:: Molar volume of liquids or super-cooled liquids
MW:: Molecular weight
OPERA:: OPEn structure–activity/property Relationship App
PC (property):: Physical–chemical property
PI:: Prediction interval
PPLFER:: Poly-parameter linear free energy relationship (Abraham solvation model)
R:: Ideal gas law constant
RMSEP:: Root mean squared error of prediction
SMILES:: Simplified molecular input line entry system
S, A, B, V, L:: PPLFER solute descriptors (Abraham descriptors)
s, a, b, v, l, c:: PPLFER system parameters (Abraham equation system parameters)
S_A :: “Solubility” in air (S_A = VP/RT)
S_W[l] or S_W :: Solubility in water of liquids or super-cooled liquids
S_W[s] :: Solubility in water of solids
S_O[w][l] :: Solubility in wet octanol of liquids or super-cooled liquids
S_O[d][l] or S_O :: Solubility in dry octanol of liquids or super-cooled liquids
ΔS_M :: Entropy of melting
T:: System temperature
T_B :: Boiling point
T_M :: Melting point
UL:: Uncertainty Level
VP_[l] or VP:: Vapor pressure of liquids or super-cooled liquids
VP_[s] :: Vapor pressure of solids
QSP(A)R:: Quantitative structure–property (activity) relationship
UPPER:: Unified physicochemical property estimation relationships

References

Government of Canada (1999) Canadian Environmental Protection Act, 1999. Canada Gazette Part III, vol 22
Commission E (2007) Regulation (EC) No 1907/2006—Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH). Off J Eur Union L 136:3–280
Google Scholar
Frank R (2016) Lautenberg Chemical Safety for the 21st Century Act. US Congress (114th Congress), Pub. L. No. 114–182.
ECHA (2017) Guidance on Information Requirements and Chemical Safety Assessment Chapter R.11 PBT/vPvB Assessment. European Chemicals Agency, Helsinki, Finland
Wegmann F, Cavin L, MacLeod M, Scheringer M, Hungerbühler K (2009) The OECD software tool for screening chemicals for persistence and long-range transport potential. Environ Model Softw 24(2):228–237
Article Google Scholar
Meyer T, Wania F, Breivik K (2005) Illustrating sensitivity and uncertainty in environmental fate models using partitioning maps. Environ Sci Technol 39(9):3186–3196. https://doi.org/10.1021/Es048728t
Article CAS PubMed Google Scholar
Armitage JM, Wania F, Arnot JA (2014) Application of mass balance models and the chemical activity concept to facilitate the use of in vitro toxicity data for risk assessment. Environ Sci Technol 48(16):9770–9779. https://doi.org/10.1021/es501955g
Article CAS PubMed Google Scholar
Baskaran S, Wania F (2023) Applications of the octanol–air partitioning ratio: a critical review. Environ Sci Atmospheres 3(7):1045–1065. https://doi.org/10.1039/D3EA00046J
Article CAS Google Scholar
Wania F, Lei YD, Baskaran S, Sangion A (2022) Identifying organic chemicals not subject to bioaccumulation in air-breathing organisms using predicted partitioning and biotransformation properties. Integr Environ Assess Manag 18(5):1297–1312. https://doi.org/10.1002/ieam.4555
Article CAS PubMed Google Scholar
Buser AM, MacLeod M, Scheringer M, Mackay D, Bonnell M, Russell MH, DePinto JV, Hungerbuhler K (2012) Good modeling practice guidelines for applying multimedia models in chemical assessments. Integr Environ Assess Manage 8(4):703–708. https://doi.org/10.1002/ieam.1299
Article CAS Google Scholar
Li L, Zhang Z, Men Y, Baskaran S, Sangion A, Wang S, Arnot JA, Wania F (2022) Retrieval, selection, and evaluation of chemical property data for assessments of chemical emissions, fate, hazard, exposure, and risks. ACS Environ Au 2(5):376–395. https://doi.org/10.1021/acsenvironau.2c00010
Article CAS PubMed PubMed Central Google Scholar
Pontolillo J, Eganhouse RP (2001) The search for reliable aqueous solubility (Sw) and octanol-water partition coefficient (Kow) data for hydrophobic organic compounds: DDT and DDE as a Case Study. Water-Resources Investigations Report 01-4201. U.S. Geological Survey. https://doi.org/10.3133/wri014201
Beyer A, Wania F, Gouin T, Mackay D, Matthies M (2002) Selecting internally consistent physicochemical properties of organic compounds. Environ Toxicol Chem 21(5):941–953. https://doi.org/10.1002/etc.5620210508
Article CAS PubMed Google Scholar
Mackay D (2001) Multimedia environmental models: the fugacity approach, 2nd edn. Lewis Publishers, Boca Raton
Book Google Scholar
Cole JG, Mackay D (2000) Correlating environmental partitioning properties of organic compounds: the three solubility approach. Environ Toxicol Chem 19(2):265–270. https://doi.org/10.1002/etc.5620190203
Article CAS Google Scholar
Li NQ, Wania F, Lei YD, Daly GL (2003) A comprehensive and critical compilation, evaluation, and selection of physical-chemical property data for selected polychlorinated biphenyls. J Phys Chem Ref Data 32(4):1545–1590. https://doi.org/10.1063/1.1562632
Article CAS Google Scholar
Schenker U, MacLeod M, Scheringer M, Hungerbühler K (2005) Improving data quality for environmental fate models: a least-squares adjustment procedure for harmonizing physicochemical properties of organic compounds. Environ Sci Technol 39(21):8434–8441
Article CAS PubMed Google Scholar
Egeghy PP, Judson R, Gangwal S, Mosher S, Smith D, Vail J, Cohen Hubal EA (2012) The exposure data landscape for manufactured chemicals. Sci Total Environ 414(1):159–166.
Article CAS PubMed Google Scholar
Arnot JA, Gobas FAPC (2006) A review of bioconcentration factor (BCF) and bioaccumulation factor (BAF) assessments for organic chemicals in aquatic organisms. Environ Rev 14(4):257–297. https://doi.org/10.1139/a06-005
Article CAS Google Scholar
Wetmore BA, Wambaugh JF, Ferguson SS, Sochaski MA, Rotroff DM, Freeman K, Clewell HJ, Dix DJ, Andersen ME, Houck KA, Allen B, Judson RS, Singh R, Kavlock RJ, Richard AM, Thomas RS (2012) Integration of dosimetry, exposure, and high-throughput screening data in chemical toxicity assessment. Toxicol Sci 125(1):157–174. https://doi.org/10.1093/toxsci/kfr254
Article CAS PubMed Google Scholar
Judson R, Richard A, Dix DJ, Houck K, Martin M, Kavlock R, Dellarco V, Henry T, Holderman T, Sayre P, Tan S, Carpenter T, Smith E (2009) The toxicity data landscape for environmental chemicals. Environ Health Perspect 117(5):685–695. https://doi.org/10.1289/ehp.0800168
Article CAS PubMed Google Scholar
Abraham MH (1993) Scales of solute hydrogen-bonding: their construction and application to physicochemical and biochemical processes. Chem Soc Rev 22:73–83.
Article CAS Google Scholar
Goss K-U (2005) Predicting the equilibrium partitioning of organic compounds using just one linear solvation energy relationship (LSER). Fluid Phase Equilib 233(1):19–22. https://doi.org/10.1016/j.fluid.2005.04.006
Article CAS Google Scholar
OECD (2007) Guidance Document on the Validation of (Quantitative) Structure-Activity Relationships (QSAR) Models. OECD Environment Health and Safety Publications Series on Testing and Assessment No. 69. Organisation for Economic Cooperation and Development, Environment Directorate, Paris
OECD (2004) OECD Principles for the validation, for regulatory purposes, of (quantitative) structure-activity relationship models. OECD, Paris
Google Scholar
OECD (2023) (Q)SAR assessment framework: guidance for the regulatory assessment of (Quantitative) structure − activity relationship models, predictions, and results based on multiple predictions. Series on Testing and Assessment No. 386. Organisation for Economic Cooperation and Development, Paris
Brown TN, Arnot JA, Wania F (2012) Iterative fragment selection: a group contribution approach to predicting fish biotransformation half-lives. Environ Sci Technol 46:8253–8260. https://doi.org/10.1021/es301182a
Article CAS PubMed Google Scholar
Arnot JA, Brown TN, Wania F (2014) Estimating screening-level organic chemical half-lives in humans. Environ Sci Technol 48:723–730. https://doi.org/10.1021/es4029414
Article CAS PubMed Google Scholar
Brown TN, Armitage JM, Arnot JA (2019) Application of an Iterative Fragment Selection (IFS) method to estimate entropies of fusion and melting points of organic chemicals. Mol Inf 38(8–9):1800160. https://doi.org/10.1002/minf.201800160
Article CAS Google Scholar
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comp Sci 28(1):31–36. https://doi.org/10.1021/ci00057a005
Article CAS Google Scholar
Lian B, Yalkowsky SH (2014) Unified physicochemical property estimation relationships (UPPER). J Pharm Sci 103(9):2710–2723. https://doi.org/10.1002/jps.24033
Article CAS PubMed Google Scholar
Brown TN (2022) QSPRs for predicting equilibrium partitioning in solvent-air systems from the chemical structures of solutes and solvents. J Solution Chem 51(9):1101–1132. https://doi.org/10.1007/s10953-022-01162-2
Article CAS Google Scholar
Endo S, Goss K-U (2014) Applications of polyparameter linear free energy relationships in environmental chemistry. Environ Sci Technol 48(21):12477–12491. https://doi.org/10.1021/es503369t
Article CAS PubMed Google Scholar
Brown TN (2021) Empirical regressions between system parameters and solute descriptors of polyparameter linear free energy relationships (PPLFERs) for predicting solvent-air partitioning. Fluid Phase Equilib 540:113035. https://doi.org/10.1016/j.fluid.2021.113035
Article CAS Google Scholar
Endo S (2022) Applicability domain of polyparameter linear free energy relationship models evaluated by leverage and prediction interval calculation. Environ Sci Technol 56(9):5572–5579. https://doi.org/10.1021/acs.est.2c00865
Article CAS PubMed PubMed Central Google Scholar
Ulrich N, Endo S, Brown TN, Watanabe N, Bronner G, Abraham MH, Goss KU (2017) UFZ-LSER database v 3.2.1. http://www.ufz.de/lserd. Accessed 25 Jan 2021
Abraham MH, Smith RE, Luchtefeld R, Boorem AJ, Luo R, Acree WE Jr (2010) Prediction of solubility of drugs and other compounds in organic solvents. J Pharm Sci 99(3):1500–1515. https://doi.org/10.1002/jps.21922
Article CAS PubMed Google Scholar
Abraham MH, Le J (1999) The correlation and prediction of the solubility of compounds in water using an amended solvation energy relationship. J Pharm Sci 88(9):868–880. https://doi.org/10.1021/js9901007
Article CAS PubMed Google Scholar
Abraham MH, Acree WE (2020) Estimation of vapor pressures of liquid and solid organic and organometallic compounds at 298.15K. Fluid Phase Equilib 519:112595. https://doi.org/10.1016/j.fluid.2020.112595
Article CAS Google Scholar
Brown TN, Celsie A, Arnot JA, Parnis JM (2023) PPLFER paper #3 Mixtures. In Prep
Abraham MH, Acree WE (2008) Comparison of solubility of gases and vapours in wet and dry alcohols, especially octan-1-ol. J Phys Org Chem 21(10):823–832. https://doi.org/10.1002/poc.1374
Article CAS Google Scholar
Baskaran S, Lei YD, Wania F (2021) A database of experimentally derived and estimated octanol-air partition Ratios (KOA). J Phys Chem Ref Data. https://doi.org/10.1063/5.0059652
Article Google Scholar
Brown TN (2014) Predicting hexadecane-air equilibrium partition coefficients (L) using a group contribution approach constructed from high quality data. SAR QSAR Environ Res 25(1):51–71. https://doi.org/10.1080/1062936X.2013.841286
Article CAS PubMed Google Scholar
Mansouri K, Grulke CM, Judson RS, Williams AJ (2018) OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 10:10. https://doi.org/10.1186/s13321-018-0263-1
Article CAS PubMed PubMed Central Google Scholar
Gramatica P (2007) Principles of QSAR models validation: internal and external. QSAR Comb Sci 26(5):694–701. https://doi.org/10.1002/qsar.200610151
Article CAS Google Scholar
Gramatica P, Cassani S, Roy PP, Kovarich S, Yap CW, Papa E (2012) QSAR modeling is not “push a button and find a correlation”: a case study of toxicity of (Benzo-)triazoles on algae. Mol Inform 31:817–835. https://doi.org/10.1002/minf.201200075
Article CAS PubMed Google Scholar
Zhang Z, Sangion A, Shenghong W, Gouin T, Brown TN, Arnot JA, Li L (2024) Chemical space covered by applicability domains of quantitative structure-property relationships and semi-empirical relationships in chemical assessments. Environ Sci Technol 58 (7):3386–3398. https://doi.org/10.1021/acs.est.3c05643
US E.P.A. (2011) Estimation Programs Interface (EPI) Suite for Microsoft® Windows, Ver. 4.1., Released October, 2011 edn. U. S. Environmental Protection Agency, Washington, D.C.
Mansouri K, Grulke CM, Richard AM, Judson RS, Williams AJ (2016) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR QSAR Environ Res 27(11):911–937. https://doi.org/10.1080/1062936X.2016.1253611
Article CAS Google Scholar
Schwarzenbach RP, Gschwend PM, Imboden DM (2016) Environmental organic chemistry, 3rd edn. Wiley, Hoboken
Google Scholar
Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, Magarinos MP, Mosquera JF, Mutowo P, Nowotka M, Gordillo-Maranon M, Hunter F, Junco L, Mugumbate G, Rodriguez-Lopez M, Atkinson F, Bosc N, Radoux CJ, Segura-Cabrera A, Hersey A, Leach AR (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47(D1):D930-d940. https://doi.org/10.1093/nar/gky1075
Article CAS PubMed Google Scholar
Ulrich N, Ebert A (2022) Can deep learning algorithms enhance the prediction of solute descriptors for linear solvation energy relationship approaches? Fluid Phase Equilib 555:113349. https://doi.org/10.1016/j.fluid.2021.113349
Article CAS Google Scholar
Hodges G, Eadsforth C, Bossuyt B, Bouvy A, Enrici M-H, Geurts M, Kotthoff M, Michie E, Miller D, Müller J, Oetter G, Roberts J, Schowanek D, Sun P, Venzmer J (2019) A comparison of log Kow (n-octanol–water partition coefficient) values for non-ionic, anionic, cationic and amphoteric surfactants determined using predictions and experimental methods. Environ Sci Eur 31(1):1. https://doi.org/10.1186/s12302-018-0176-7
Article CAS Google Scholar
Pudipeddi M, Serajuddin ATM (2005) Trends in solubility of polymorphs. J Pharm Sci 94(5):929–939. https://doi.org/10.1002/jps.20302
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

The authors acknowledge funding from the American Chemistry Council Long-Range Research Initiative. As this publication has not been formally reviewed by the American Chemistry Council, views expressed in this document are solely those of the authors.

Author information

Authors and Affiliations

ARC Arnot Research & Consulting, Toronto, ON, M4C 2B4, Canada
Trevor N. Brown, Alessandro Sangion & Jon A. Arnot
Department of Physical and Environmental Sciences, University of Toronto Scarborough, Toronto, ON, M1C 1A4, Canada
Jon A. Arnot
Department of Pharmacology and Toxicology, University of Toronto, Toronto, ON, M5S 1A8, Canada
Jon A. Arnot

Authors

Trevor N. Brown
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Sangion
View author publications
You can also search for this author in PubMed Google Scholar
Jon A. Arnot
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Trevor N. Brown: project conceptualization, data curation, model development, coding and testing, manuscript writing and editing. Alessandro Sangion: data curation, model deployment on EAS-E Suite, manuscript writing and editing. Jon A. Arnot: project conceptualization, management and funding, manuscript writing and editing.

Corresponding author

Correspondence to Trevor N. Brown.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Brown, T.N., Sangion, A. & Arnot, J.A. Identifying uncertainty in physical–chemical property estimation with IFSQSAR. J Cheminform 16, 65 (2024). https://doi.org/10.1186/s13321-024-00853-w

Download citation

Received: 28 January 2024
Accepted: 09 May 2024
Published: 30 May 2024
DOI: https://doi.org/10.1186/s13321-024-00853-w

Identifying uncertainty in physical–chemical property estimation with IFSQSAR