Skip to main content

Identifying uncertainty in physical–chemical property estimation with IFSQSAR

Abstract

This study describes the development and evaluation of six new models for predicting physical–chemical (PC) properties that are highly relevant for chemical hazard, exposure, and risk estimation: solubility (in water SW and octanol SO), vapor pressure (VP), and the octanol–water (KOW), octanol–air (KOA), and air–water (KAW) partition ratios. The models are implemented in the Iterative Fragment Selection Quantitative Structure–Activity Relationship (IFSQSAR) python package, Version 1.1.0. These models are implemented as Poly-Parameter Linear Free Energy Relationship (PPLFER) equations which combine experimentally calibrated system parameters and solute descriptors predicted with QSPRs. Two other ancillary models have been developed and implemented, a QSPR for Molar Volume (MV) and a classifier for the physical state of chemicals at room temperature. The IFSQSAR methods for characterizing applicability domain (AD) and calculating uncertainty estimates expressed as 95% prediction intervals (PI) for predicted properties are described and tested on 9,000 measured partition ratios and 4,000 VP and SW values. The measured data are external to IFSQSAR training and validation datasets and are used to assess the predictivity of the models for “novel chemicals” in an unbiased manner. The 95% PI intervals calculated from validation datasets for partition ratios needed to be scaled by a factor of 1.25 to capture 95% of the external data. Predictions for VP and SW are more uncertain, primarily due to the challenges in differentiating their physical state (i.e., liquids or solids) at room temperature. The prediction accuracy of the models for log KOW, log KAW and log KOA of novel, data-poor chemicals is estimated to be in the range of 0.7 to 1.4 root mean squared error of prediction (RMSEP), with RMSEP in the range 1.7–1.8 for log VP and log SW.

Scientific contribution

New partitioning models integrate empirical PPLFER equations and QSARs, allowing for seamless integration of experimental data and model predictions. This work tests the real predictivity of the models for novel chemicals which are not in the model training or external validation datasets.

Graphical Abstract

Introduction

Physical–chemical (PC) property data are essential for conducting legislated ecological and human health assessment for new and existing organic chemicals [1,2,3]. Common PC properties used in chemical assessments are solubility in water (SW; mol/L), solubility in octanol (SO; mol/L), vapor pressure (VP; Pa), melting point (TM; K), boiling point (TB; K) and the octanol–water (KOW), octanol–air (KOA), and air–water (KAW) partition ratios. The partition ratios are considered dimensionless, and KAW is the dimensionless Henry’s Law Constant (H; Pa.m3/mol) as KAW = H/RT, where R is the Ideal Gas Law Constant (Pa.m3/(mol.K)) and T is the system temperature (K; kelvin). Models used for predicting bioaccumulation [4], overall persistence and long-range transport potential [5], toxicity, toxicokinetics in in vitro and in vivo systems, chemical concentrations in natural and manufactured environments, and ultimately exposure to human and ecological receptors require at least some of the listed PC properties as input parameters. Chemical assessment outcomes are sensitive to the selected PC values, e.g., [5,6,7,8,9] and reliable PC data are therefore required for reliable chemical assessments; “garbage in = garbage out” [10]. There is a need to better understand which chemicals and properties have the greatest uncertainties so these sources of error in regulatory decision-making can be addressed.

Uncertainty in PC data is inherent whether the data are measured or modelled [11, 12] and guidance for selecting PC data for chemical assessments is available [11]. Theoretical relationships between SW, SO, VP, KOW, KOA, and KAW have been outlined by Mackay and colleagues [13,14,15] and others [16, 17]. These theoretical relationships (sometimes referred to as the “three solubility approach” [15]) can be applied for evaluating measured and predicted PC property data quality and obtaining consistency amongst them all as a method to address uncertainty in available PC property data and guide the selection of reliable data. Predictive methods for PC property data are required for thousands of chemicals legislated for evaluation [18,19,20,21]. Methods for predicting PC properties include Quantitative Structure-(Activity)Property Relationships (QS(A)PRs) and Poly-Parameter Free Linear Energy Relationship (PPLFER), also known as Abraham equations [22, 23]. Organization for Economic Co-operation and development (OECD) guidance for QS(A)PR development and validation for applications in regulatory decision-making exists [24, 25] including consideration of the applicability domain (AD) for a predicted property as outlined in the recent OECD QSAR assessment framework (QAF) [26]. There is a need for reliable predictive methods that include AD information as well as uncertainty estimates for predictions.

The Iterative Fragment Selection QSAR (IFSQSAR) development methods have been progressively updated and applied to various chemical properties over the last 10 years [27,28,29]. IFSQSARs are fragment-based multiple linear regression (MLR) models developed using extensive cross-validation and conservative goodness-of-fit metrics to create robust and predictive models, and make predictions based only on the chemical structure as a Simplified Molecular Input Line Entry System (SMILES) string [30]. The IFSQSARs include the prediction of solute descriptors required to parameterize PPLFER equations and other PC properties directly. The IFSQSARs have been developed in agreement with OECD guidance and apply three complementary methods for assessing if predictions are within the QSPR AD and provide estimates of the prediction uncertainty. The IFSQSAR methods and the mechanistic insights of the PPLFER methods are applied in this work to identify and characterize general uncertainties in predicting PC property data required for chemical assessments. The model development ensures that predicted properties are thermodynamically consistent, and their calculation is based on a consistent set of descriptors, i.e. the PPLFER solute descriptors. This is like previous efforts based on different descriptors, such as the Unified Physicochemical Property Estimation Relationships (UPPER) method of Yalkowsky and colleagues [31].

The present study describes the development and evaluation of new models in IFSQSAR Ver.1.10 (https://github.com/tnbrowncontam/ifsqsar) for predicting SW, SO, VP, KOW, KOA, and KAW. The new models, and other QSARs, are available in a user-friendly, freely accessible online platform, the Exposure And Safety Estimation (EAS-E) Suite (www.eas-e-suite.com). QSPRs have previously been developed for solute descriptors and system parameters of PPLFERs [32, 33]. These QSPRs are combined with empirically calibrated PPLFER equations to make predictions for PC properties, some calibrated in previous research [34] and some newly calibrated in this work. A key objective of this work is to validate the predictive power of the new models against experimental data for novel chemicals; therefore, in the validation process, the PPLFERs are only parameterized with solute descriptors predicted by the IFSQSARs to represent conditions of applying models to chemicals and properties for which there are no measured data. The new model predictions are compared against independent measured property data to assess their predictive power (uncertainty) expressed as 95% prediction intervals. Methods for quantifying the predictive power of the QSPR predictions for novel chemicals, i.e. chemicals that are outside of the training and validation datasets, are evaluated. Based on these evaluations and the detailed AD information of the IFSQSAR models, methods for further improving the understanding of the prediction uncertainty for novel chemicals are recommended.

Methods

Theory

Thermodynamic property cycles that describe the interrelation between partitioning and solubility in octanol, water and air phases are referred as the three-solubility approach. The three-solubility approach interprets the partition ratios KOW, KOA, and KAW as ratios of the solubilities SO, SW and solubility in air (SA), where SA is a conversion of VP at atmospheric pressure and temperature. Figure 1 shows how the three-solubility approach [15] is used in this study to calibrate consistent solubility and partitioning properties. Partition ratios and solubility in this work are calculated using PPLFERs. PPLFERs were pioneered by Michael Abraham and colleagues, and are empirical correlations used to predict chemical properties with many applications in environmental chemistry [33]. There are three different forms of PPLFER equations which include different sub-sets of solute descriptors and system parameters. Two forms are recommended by Abraham for partitioning between two condensed phases, or partitioning between one condensed phase and one gaseous phase [22]. A third form was suggested by Goss [23] which contains descriptors from each of the two suggested by Abraham and is shown in Eq. 1. PPLFERs in the form of Eq. 1 are used in this work because they offer two advantages for environmental chemistry research. The first is that using a single form of the equation allows for the application of thermodynamic property cycles. The second is that this form of PPLFER equation shows better predictive power for some solutes with unique properties, including perfluorinated alkyl substances and methyl siloxanes, which are of environmental interest [35].

$${\text{log}}\;{\text{K}}{\mkern 1mu} = {\mkern 1mu} {\text{s}} \cdot {\text{S }}{\mkern 1mu} + {\mkern 1mu} {\text{a}} \cdot {\text{A }}{\mkern 1mu} + {\mkern 1mu} {\text{b}} \cdot {\text{B }}{\mkern 1mu} + {\mkern 1mu} {\text{v}} \cdot {\text{V}}{\mkern 1mu} + {\mkern 1mu} {\text{l}} \cdot {\text{L}}{\mkern 1mu} + {\mkern 1mu} {\text{c}}$$
(1)
Fig. 1
figure 1

Schematic of the workflow in this research. Yellow boxes represent experimental data and empirical models, blue boxes represent QSPR predictions, green boxes represent MetaQSPRs which combine both, orange text represents models calibrated only by thermodynamic property cycle, purple arrows represent the validation process in which only the chemical structure (SMILES) is used to apply the models. log KOW, log KOA, log KAW partition ratios, VP, SA vapor pressure and solubility in air, SW solubility in water, SO solubility in octanol, MLR multiple linear regression, ΔS entropy of melting, TB boiling point, TM melting point, MW molecular weight, MV molar volume

PPLFER equations consist of solute descriptors, which correlate with the molecular interactions of the solute, and system parameters which are fitted to the properties of the system of interest. For partition ratios the system will be the two phases that the partition coefficient describes, and the system parameters describe the relative propensity for solutes to partition to one phase or the other with positive values favoring the first phase and negative values favoring the second phase. For solubility the two phases are the pure phase of the solute and water, air, or octanol. System parameters are determined by MLR of the property against experimentally determined solute descriptors of the dataset of training chemicals for which both the solute descriptors and property are available, this is referred to as calibrating a PPLFER equation. Experimental solute descriptors are available for about 8000 solutes and system parameters have been calibrated for solvent-air and solvent–water partitioning of about 100 solvents including octanol [36, 37].

In Fig. 1, Table 1, and Eq. 1 the lower-case letters s, a, b, v, l, and c are the system parameters specific to the system. The upper-case letters S, A, B, V, and L are the solute descriptors specific to the solute. For solubility an additional term that combines A and B with an additional system parameter d is required, as discussed below. The solute descriptors correlate with different types of molecular interactions: S is a combination of the solute dipolarity and polarizability, A is the hydrogen bond donor capacity, B is the hydrogen bond acceptor capacity, V is the McGowan volume which has been interpreted as correlating with energy of cavity formation, and L is the partition coefficient for the hexadecane-air system which correlates with van der Waals interactions. Abraham has also calibrated PPLFER equations for the pure phase properties solubility SW [38] and vapor pressure VP [39]. Separate PPLFER equations were developed for liquid and solid solutes with quite different system parameters. These PPLFER equations represent a system where partitioning is between the chemical pure phase and the water and air phases meaning that the system is different for every solute which is not consistent with how PPLFERs are typically applied. Equation 2 shows a PPLFER equation analogous to Eq. 1 for solubility of solute in water, octanol, or air, which has been modified according to Abraham’s method  38, 39.

$${\text{logS}}_{{\text{[W,O,A]}}} \,{ = }\,{\text{s}} \cdot {\text{S }}\,{ + }\,{\text{d}} \cdot {\text{A}} \cdot {\text{B }}\,{ + }\,{\text{v}} \cdot {\text{V}}\,{ + }\,{\text{l}} \cdot {\text{L}}\,{ + }\,{\text{c}}$$
(2)
Table 1 Poly-Parameter Free Linear Energy Relationship (PPLFER) system parametersa

In these PPLFERs the solute descriptors are being used to describe how a chemical behaves as both a solute and the solvent. The AB term explicitly accounts for the effects of hydrogen bonding between molecules of the chemical, and some versions proposed by Abraham [39, 40] include an SS term to account for dipole–dipole interactions. The system parameters quantify how each solute descriptor favors solubility in water, octanol, or air, and any broadly applicable interactions within the pure phase of the solute. Equation 2 was modified to Eq. 3 in this work, because this was found to give better fitting results, and the (AB)0.5 term is more consistent with previous work done predicting system parameters [34]:

$${\text{logS}}_{{\text{[W,O,A]}}} \,{ = }\,{\text{s}} \cdot {\text{S}}\,{ + }\,{\text{a}} \cdot {\text{A}}\,{ + }\,{\text{b}} \cdot {\text{B}}\,{ + }\,{\text{d}} \cdot \left( {{\text{A}} \cdot {\text{B}}} \right)^{{{0}{\text{.5}}}} \,{ + }\,{\text{v}} \cdot {\text{V}}\,{ + }\,{\text{l}} \cdot {\text{L}}\,{ + }\,{\text{c}}$$
(3)

Previous research developed empirical regressions between solute descriptors and system parameters for solvent-air partitioning which can be used as an alternative method to predict solubility [34]. System parameters of PPLFER equations in the form of Eq. 1 can be predicted for each solute using the empirical regressions. These predicted PPLFER equations are then used to predict the partitioning of a solute between air and the solute’s own pure liquid phase, giving a partition ratio (log KkAk). These log KkAk values are then converted to VP using Eq. 4, which is a rearrangement of Raoult’s Law [34], and converted to SW by the three-solubility approach. In Eq. 4γ is the activity coefficient of the solute which is assumed to be unity in the pure phase, and MV is the molar volume of the liquid or supercooled liquid solute. VP is then unit converted to SA at standard temperature and pressure and a thermodynamic property cycle is applied to calculate SW and SO from the calibrated PPLFER equations for log KAW and log KOA.

$${\text{logVP}}\,{ = }\,{\text{log}}\left( {\frac{{{\text{RT}}}}{{{\gamma K}_{{{\text{kAk}}}} {\text{MV}}}}} \right)$$
(4)

This indirect method has only been validated for predicting the VP of liquids, and testing done in this work for solids showed that the results were poor.

PPLFER equations for partition ratios involving pure solvent phases, water, and air typically have standard errors of fitting and prediction of less than 0.2 log units when calibrated with experimental solute descriptors. The Abraham PPLFERs for have larger errors on the order of 0.3 log units for liquids and up to 0.8 log units for some solids, but these equations also contain other correction factors for specific functional groups [39]. For SW the error is about 0.6 log units [38]. The indirect method for calculating solubilities had errors of about 0.4 and 0.5 log units when applied to solubility in air for liquids. All these statistics are calculated on different datasets and are typically fitting errors rather than predictive errors, so they give an idea of the goodness of fit of the models, but not necessarily the predictive power. If PPLFERs are properly calibrated with sufficient data then they have broad applicability and accuracy [35].

Table 1 summarizes the PPLFER equations used in this work to predict PC properties. The equations for log KOA, and log KAW have been calibrated in previous work [32, 34], the system parameters for dry log KOW (pure octanol) are calculated as the sum of the system parameters for log KOA and log KAW, i.e., using the three solubility approach. Sections SI-2, SI-3, and SI-4 detail the calibration of new PPLFER equations in this work, for wet log KOW (water saturated octanol), log KOO (hypothetical partition ratio between wet and dry octanol), VP, SW, and SO (dry and wet). One of the goals of this work is to create models that predict partition ratios and solubilities which have thermodynamic consistency built in, and this is achieved by calibrating the PPLFER system parameters to be thermodynamically consistent using the concept of the three solubility approach [15]. The PPLFER equations in this work have all been calibrated on experimental data except for SO, which is only calculated by the three solubility approach due to limited data availability and is shown in a different color in Fig. 1 to reflect this.

One challenge in this process is that there is an inherent discrepancy in the three solubility approach with regards to how the data are measured. Most measurements of log KOW are performed with the octanol and water phases in direct contact so that the octanol becomes saturated with water and vice versa. The solubility of octanol in water is very low so the effect of partitioning of chemicals to the water phase is negligible. However, a significant amount of water is soluble in the octanol phase, and this changes the partitioning properties [41]. The PPLFER system parameters in Table 1 show the “dry” log KOW will be lower than the “wet” log KOW for polar and hydrogen bonding chemicals because the s, a, and b system parameters are lower. In contrast, log KOA measurements are usually made using dry octanol [42]. In addition, the difference between wet SO (SO[w]) and dry SO (SO[d]) must be considered. A PPLFER for a hypothetical partition ratio between wet and dry octanol (KOO) has been derived in this work which can make these corrections, ensure thermodynamic consistency, and is implemented as a QSPR in IFSQSAR.

IFSQSAR description and AD

The IFSQSAR development methods have been described in previous work [27,28,29, 32, 43] and are summarized in Section SI-1. An important aspect to understand for this work is the division of experimental data into a training dataset used to calibrate the QSPR and a validation dataset used to validate the QSPR and estimate the prediction uncertainty. The splitting is rational and deterministic, ensuring that both datasets represent the chemical diversity of the experimental data and the range of expected values. The solute descriptor QSPRs were trained and validated on a common dataset, so that each solute is only in either the training or validation dataset for all solute descriptor QSPRs. Further details on the dataset splitting are in Brown 2022 [32]. All the QSPRs and PPLFERs described here are coded in the IFSQSAR version 1.1.0 python package and implemented in the EAS-E Suite online platform (www.eas-e-suite.com). IFSQSARs apply three complementary approaches to define the basic AD of predictions, the first two approaches are very similar to, but developed in parallel to the AD methods applied by OPEn structure–activity/property Relationship App (OPERA) [44]. The first approach uses the leverage which is interpreted as a measure of extrapolation from the training dataset [45, 46], and the second is Chemical Similarity Score (CSS) which is a nearest neighbours approach and is less sensitive to extrapolation. Various cut-offs are defined for both approaches and are combined to assign each QSPR prediction an Uncertainty Level (UL) between UL 0–3 which correlates with uncertainty of the QSPR predictions, or inversely correlates with predictive power. Individual predictions can always be good or bad regardless of the UL, the UL only quantifies the typical uncertainty. Some special cases are also defined, UL 4 means that all fragments in the QSPR have a count of zero for the chemical, this may be a defined as in or out of the AD depending on the meaning of the intercept. UL 5 is the third complementary AD approach and has been described as a “denylist” AD check [47], but also might be described as a negative domain check, or inverse structural alerts. All the information about atoms and bonds in the training dataset is summarized regardless of whether the exact substructures are included in the fragments selected for the QSPR. Chemicals are checked against this summary and if they contain a substructure that is not found in the training data then they are flagged as UL 5. Finally, for some QSPRs it is pragmatic to set boundary conditions on possible values, and any predictions which violate these boundary conditions are flagged as UL 6. Table 2 summarizes the seven IFSQSAR ULs.

Table 2 IFSQSAR uncertainty level (UL) specifications

The IFSQSARs that use chemical structure to predict solute descriptors (used in PPLFER equations) and other PC properties directly provide an UL and predictivity metric along with each prediction [32]. Here predictivity refers to the predictive power of the QSPR, i.e. how accurate the predictions are likely to be, or inversely how uncertain the predictions are likely to be. Predictivity is quantified by the root mean squared error of prediction (RMSEP) as calculated from the external validation dataset of each solute descriptor QSPR, more discussion of the RMSEP can be found in “Metrics of model performance and predictivity” section. As the RMSEP increases the predictivity is lower and the uncertainty is higher.

All the property PPLFER equations in IFSQSAR are implemented as Meta QSPRs. Meta QSPRs use the outputs of other QSPRs as their inputs and calculate new values, aggregated ULs, and error estimates. For example, log KOW is estimated with a Meta QSPR which combines solute descriptors predicted by QSPR and the experimental system parameters from Table 1 in PPLFER Eq. 1. All the PPLFER equations in this work (KOW, KAW, KOA, VP, SW and SO) are implemented as Meta QSPRs. Note that IFSQSAR will by default use experimental solute descriptors instead of predicted ones where possible to increase the accuracy of predictions. This feature of IFSQSAR was not included in the validation process of this study so that only predicted solute descriptors were used to evaluate the models’ expected predictivity for novel or data-poor chemicals. The AD and predictivity as UL and RMSEP of the Meta QSPRs are calculated as an aggregate of UL and RMSEP of the Meta QSPR model inputs and other parameters written into the model such as the experimental system parameters. The details are described elsewhere [32], but in brief the aggregated UL and RMSEP are calculated according to propagation of uncertainty rules. These calculations are done automatically in the Meta QSPR code and documented in the output.

Meta QSPRs for predicting VP and SW for liquids have already been implemented in previous work based on QSPRs that predict the PPLFER system parameters for liquid solvents [32]. These are referred to as indirect predictions in the present study as opposed to the direct predictions of VP and SW made with the new PPLFER system parameters in Table 1. As outlined in “Model evaluations with empirical datasets and endpoint relevance” section it is known that VP, SW and SO for liquids and solids have notable differences. To help account for these differences two previously created QSPRs were used, and two new ones were created. The previously developed direct prediction QSPRs are the entropy of fusion (ΔSM) and TM [29]. The first new QSPR introduced in this study is a new classifier model to predict whether a chemical is a gas, liquid or solid at 25 °C and standard atmospheric pressure to predict when corrections for solids need to be applied. The state classifier is implemented as a Meta QSPR which takes solute descriptors, TM, and TB as inputs, and is described in Section SI S-5. Finally, as discussed in “Model evaluations with empirical datasets and endpoint relevance” section the values for SW and SO are capped at solute molar volume (MV) in some cases; therefore, Section SI-6 describes a new QSPR for MV developed in this study.

Model evaluations with empirical datasets and endpoint relevance

Figure 1 shows the general workflow and the relationships between properties datasets and the models developed in this study. Yellow filled boxes represent experimental datasets, and in the case of the system parameters, values that have been empirically calibrated using only experimental data inputs. Blue filled boxes represent QSPR predictions, and green filled boxes represent hybrid models which combine QSPR predictions with system parameters calibrated on experimental data. There is a separate PPLFER equation and model for each property, but the calibration of the system parameters for all partitioning properties are interrelated through the three solubility approach. The main division of experimental data is solutes with available solute descriptors which is used from training and validating the models (top left box), and solutes with partitioning data but no solute descriptors (bottom box). IFSQSAR predictions were made for the following PC properties then evaluated using datasets of experimental values originally from the PhysProp database included in EPI Suite package [48]: log KOW, log KAW, log KOA, log VP, and log SW. These predictions and data are then used to assess the predictivity of IFSQSAR PPLFER-based models for novel chemicals. The PhysProp datasets have been further curated as a part of the creation of the OPERA QSAR package, including assigning all chemicals QSAR-ready structures as SMILES [44, 49]. Chemicals have been matched by CAS number with chemicals in the solute descriptor database used to develop the IFSQSARs [32], and identified as being in the training dataset, the validation dataset, or in neither. Chemicals in neither dataset are novel and are referred to here as being external to IFSQSAR.

There are several caveats to consider when comparing the IFSQSAR model predictions to the experimental datasets of PC properties. The first thing to consider is the difference between wet and dry octanol, as described in “Theory” section. Secondly, PC properties involving a pure chemical phase such as VP and SW are different for liquids and solids. Chemical fate and transport models typically assume that all chemicals are liquids, or supercooled liquids, also called subcooled liquids. The theory is that at very low concentrations in a phase the solid chemicals behave as liquids because there are never enough molecules to form a solid pure phase. Measured or predicted VP and S data for solids can be corrected to equivalent supercooled liquid values using the Clausius Clapeyron equation or one of its simplifications, the most common being the Van’t Hoff approximation [50]. This is discussed in more detail in Section SI-4. As discussed in the previous section the data inputs required to apply the Van’t Hoff approximation, ΔSM and TM, were developed in previous work, and the new classifier helps determine if a chemical is likely to be a liquid or a solid at system temperature (default in IFSQSAR = 25 °C).

Another end point mismatch that is commonly encountered in partitioning data is the partitioning of ions and ionizable chemicals. This is mostly important for partitioning where water is one of the phases, although the effect in other phases, e.g., water-saturated octanol, is possible. The present study only focusses on the partitioning of neutral organic chemicals. Chemical ionization is only considered in this work to identify experimental data where the measurement may be influenced by it and remove those data from model development and evaluation. Strong acids and bases are identified as acids with a pKa less than 4 and bases with a pKa greater than 10 and were removed. Experimental pKa were collected from the curated OPERA database. If a pKa was not available, a consensus value between ChemAxon estimates (available in the ChEMBL database [51]) and ACD Labs 2023.1.0 (Build 3666) was determined.

In this study upper boundaries have been set for VP and SO and SW values. When a solute is miscible in water or octanol there is no limit for how much of the solute can be dissolved. This might be expressed as a S where the amount of the solute is greater than the amount of water or octanol, which is not measurable or physically reasonable in a real system. We propose as a reasonable upper boundary on all solubility values to use the inverse of the solute liquid MV, i.e., the concentration of solute in its own pure liquid phase. The liquid MV QSPR developed in this work is used to set the capped value for solubility predictions. A similar upper boundary can be defined for VP, in this case we use standard atmospheric pressure as the upper boundary, because in the context of modelling the natural environment the pressure of a chemical will not be greater than this value.

Metrics of model performance and predictivity

The RMSEP is calculated from experimental values of the external validation datasets and predicted values from IFSQSAR PPLFER based models using Eq. 5:

$${\text{RMSEP}}\,{ = }\,\left( {\frac{{\sum\nolimits_{{{\text{i}}\, = \,1}}^{{{\text{n}}_{{{\text{ext}}}} }} {\left( {{\text{y}}_{{\text{i}}} \, - \,\widehat{{\text{y}}}_{i} } \right)^{2} } }}{{{\text{n}}_{{{\text{ext}}}} }}} \right)^{0.5}$$
(5)

where next is the number of data points in the validation dataset, yi are the experimental values and ŷi are the predicted values. The RMSEP can is then used to calculate an estimated 95% prediction interval (PI) using Eq. 6:

$${95}\% \,{\text{PI}}\, = \,\left[ {{\text{M}}\,{-}\,{\text{RMSEP}}*{1}.{96},\,{\text{M}}\, + \,{\text{RMSEP}}*{1}.{96}} \right]$$
(6)

where M is the predicted PC property value. In an ideal case the validation dataset of a QSPR is representative of the structural diversity of chemicals to which the model might be applied. In this ideal case the RMSEP calculated from the validation dataset would be a good estimate of global RMSEP and 95% of predictions would have the experimental value contained within their PI. However, in practice the data available for validating QSPRs is limited by the experimental methods used to measure the data and will not be representative of the diversity of chemicals to which the model may be applied, so the RMSEP and the PI will only be estimates.

In cases where predictions are made for chemicals that are well within the AD, the RMSEP is typically comparable to the goodness-of-fit quantified as the standard deviation between the experimental and fitted value of the training dataset, i.e., the same as Eq. 5 but between the experimental values of the training dataset and the fitted QSPR values. The further out of the AD a group of predictions are, the larger the real RMSEP will be. As stated above, individual predictions can always be good or bad regardless of whether they are in the AD or not, the RMSEP is a probabilistic metric.

During IFSQSAR model development each chemical in the external validation dataset is assigned a UL as discussed in “IFSQSAR description and AD” section, and then the RMSEP is calculated for all chemicals within each UL. ULs 0 to 3 almost always have an increasingly large RMSEP for investigated datasets [29, 32, 43]. UL 4 may have a high or low RMSEP depending on if intercept-only predictions are considered within the AD, which depends on the property and structure of the model. Because UL 5 means that the chemical contains atoms or bonds not represented in the training dataset the RMSEP also cannot be reasonably estimated because the untrained atoms and bonds may have unexpected effects on the property. However, in practice the RMSEP of predictions for UL 5 has typically been comparable the RMSEP for UL3, provided that the chemicals are not inorganic. UL 6 means that the model has made a prediction outside of a boundary condition set at the time of model calibration. This UL is assigned after a normal prediction is made and an UL is assigned, the RMSEP of the original UL is used. The same trends of RMSEP with aggregate UL are observed for the PPLFER based models in this work.

One major goal of this work is to assess the accuracy of the RMSEP estimates provided by IFSQSAR models when compared to data that are not in the training or validation datasets, i.e., novel chemicals. The RMSEP values (in log units) will then be adjusted for the partitioning properties log KOW, log KAW, log KOA, SA and SW based this comparison. To do this PIs are calculated from the RMSEP and then the actual fraction of predictions within the PI are calculated to assess the accuracy of the PI and the RMSEP. The RMSEPs of each partitioning property are adjusted until the 95% PIs contain at least 95% of the experimental values, by multiplying by a factor increase depending on the trends observed for different ULs or chemical states.

Results and discussion

Evaluation of IFSQSAR partition ratio predictions

Figures 2A, B show the IFSQSAR predictions compared to experimental KOW data split into two subsets. Figure 2A includes chemicals that have experimental solute descriptors and are in the IFSQSAR validation dataset, but the plotted values are the IFSQSAR predictions. Figure 2B shows chemicals with no experimental solute descriptors, which are therefore entirely external to the IFSQSAR partition ratio and solubility models. The data points in Fig. 2 are colored by the aggregate UL of the predictions with UL 0, the least uncertain, colored green and UL 1 to UL 3 colored blue, yellow and red, which corresponds to increasing uncertainty. Data points with lower UL tend to fall closer the 1:1 line indicating more accurate predictions. UL 5 and 6 are colored in purple and have triangle and square shape to distinguish their different AD types. As could be expected chemicals which are external to IFSQSAR have more uncertainty and variability in the predictions (RMSEP 1.00) compared to chemicals in the IFSQSAR validation dataset (RMSEP 0.57). The external data span a larger range of log KOW values, from about −5 to 11, compared to the validation data which spans values from about −2 to 8. The chemicals in this expanded lower range tend to be flagged as out of the AD with UL 2 or UL 3 and are mostly identified as solids by the chemical state predictions. The chemicals in the middle of the range with over-predicted log KOW values and which are mostly UL 2 and UL 3 are also mostly identified as solids and are mostly very large and complex chemicals.

Fig. 2
figure 2

Comparisons of predicted and experimental data. A log KOW of IFSQSAR validation set (n = 704) B log KOW of external set (n = 8416) C log VP of IFSQSAR validation set (n = 495) D log VP of external set (n = 1207) E log SW of IFSQSAR validation set (n = 529) F log SW of external set (n = 2809)

Figure S2 shows the data for wet and dry KOW. Strong acids and bases and salts were not included because these data were likely distribution ratios (DOW) rather than KOW. Data in the IFSQSAR training dataset are excluded in these figures, only data in the IFSQSAR validation dataset and data that are in neither set are included. Applying the IFSQSAR model which applies the PPLFER equation for dry KOW shows poorer statistics (RMSEP 1.19) compared to the model which applies the PPLFER equation for wet KOW (RMSEP 0.98). As expected, the PPLFER for dry KOW tends to underestimate the experimental KOW values for more water-soluble chemicals, with the predictions skewing to lower values.

Figure S3 shows chemicals identified as liquids or solids plotted separately. Predicted KOW values for liquids are more accurate with overall RMSEP 0.67 compared to 1.03 for solids. The ratio between RMSEP for solids and the RMSEP for liquids tends to increase with increasing RMSEP. More of the liquids are within the AD, 64% have aggregate UL 0 or 1 compared to solids which have only 31% assigned UL 0 or 1. This means that solids are more likely to be out of the AD, and regardless of whether they are in the AD or not, the KOW predictions for solids are less accurate, though the difference is relative small for solids that are UL 0, 1, or 2. There are a few different reasons why the predictions for solids may be less accurate. Solids tend to be larger chemicals than liquids and fragment based QSPR predictions, such as the IFSQSAR solute descriptor QSPRs which are used in the PPLFER based models in this work, are known to be less accurate for larger chemicals [43, 52]. The functional group counts in larger chemicals are more likely to be outside of the range of values in the training dataset, meaning that the QSPRs must be extrapolated outside of their training set. Extrapolation is always more uncertain than interpolation between values within the range of the training dataset. Larger chemicals have more opportunities for intramolecular interactions between functional groups which can confound group contribution QSPRs such as those in IFSQSAR. Making experimental measurements for larger chemicals also tends to be more challenging because their solubility in some phases may be very low, so the experimental data also may be less accurate. For example, solids might be more likely to self-associate and undergo a phase transition at low concentrations in water or octanol such as has been observed for perfluorinated alkyl substances [53], which would have a confounding effect for interpreting experimental concentrations in octanol and water. Another example is polymorphism, where a chemical has multiple solid forms each with a different crystal structure and a different solubility [54]. This effect is well known in pharmaceutical science because it is an aspect of drug formulation but is not considered as much in environmental applications.

Table 3 summarizes statistics for the model evaluations and shows the fraction of each subset of data where the experimental values fall within the 95% PI calculated from the aggregate RMSEP estimates. For chemicals in the IFSQSAR validation dataset a little greater than 95% of the chemicals fall within in PI, which is to be expected because these chemicals are a subset of the data used to estimate the RMSEP. For the data which are external to IFSQSAR only 90% of chemicals fall within the 95% PI. The results are about the same for liquids vs. solids at 90% overall. The results are quite consistent across the different UL with no obvious trend. For liquids the fraction within the PI is more variable at UL 2 and UL 3 due to the small number of chemicals. Adjusted RMSEP estimates will be made for all QSPRs in “IFSQSAR Uncertainty Estimates” section.

Table 3 Validation statistics for log KOW, log KAW, and log KOA

There are much fewer data available for KAW and KOA than for KOW; therefore, the statistics are less reliable, but the results are consistent with the general trends observed for KOW. Figures analogous to Figs. 2A, B, and S3 are shown in the SI for KAW and KOA (Figure S4 through Figure S6). The prediction statistics are better for chemicals in the validation dataset than in the dataset of chemicals external to IFSQSAR as shown in Figure S4 for KAW and Figure S6 for KOA. Figure S5 shows the prediction statistics for liquids are better than for solids for KAW, while for KOA the external test set chemicals are all solids. Table 3 also summarizes the validation statistics for the performance of the KAW and KOA models. The fraction of experimental data which falls within the 95% PI is much more variable compared to the data for KOW, likely due the limited amount of data, but again the overall trend is similar, chemicals in the validation set are within the PI more than chemicals in the external set, and liquids are within the PI about as often as solids.

Evaluation of IFSQSAR VP and SW predictions

There are two IFSQSAR methods for predicting VP and SW; the indirect method developed in previous work [32, 34] and the direct method developed in the current study as described in Section SI-4. The indirect method predicts system parameters for VP and then calculates system parameters for SW by thermodynamic property cycle, while the direct method uses system parameters calibrated with experimental data for SW and uses a thermodynamic property cycle to calculate system parameters for VP. Table 4 shows the validation statistics for the VP and SW direct method predictions, and Table S1 and Table S2 in the SI compare the direct predictions to the indirect predictions and direct predictions with the Van’t Hoff correction applied. Section SI-4 briefly describes theoretical reasons that VP and SW will be different for liquids and solids. The indirect method was trained only on liquids and is not applicable to solid chemicals, the RMSEPs for predicting properties for solids, i.e., VP[s] and SW[s], are 5 to 6, respectively (results not shown). Figure S7 shows the indirect method gives good predictions for VP[l] and SW[l] with RMSEP values of 0.78 and 0.96, respectively, for chemicals which are external to IFSQSAR, i.e. are not in either the training or validation dataset of the IFSQSAR solute descriptor QSPRs.

Table 4 Validation statistics for log VP and log SW

The direct method predicts VP and SW specifically for liquids and supercooled liquids if the chemical is a solid at 25 °C. When applying the direct method to solids the predictions need to be converted to VP[s] and SW[s] using the Van’t Hoff equation and ΔSM and TM which can be predicted by QSPRs in the IFSQSAR software. These additional QSPR predictions will introduce more uncertainty and variability into the predicted values for solids and the predictions would be expected to be less accurate. Because of this additional uncertainty the prediction accuracy of VP[s] and SW[s] using the Van’t Hoff correction is no better than just using the supercooled liquid predictions when comparing to the experimental data. Nevertheless, we present the results here for thoroughness because comparing the supercooled predictions to experimental VP[s] and SW[s] is an end-point mismatch. Large predictions for VP and SW are capped to provide more reasonable values and assigned UL 6 corresponding to a boundary condition violation. Aside from the challenges for predicting properties for solids, much the same trends are observed in the data and model performance as observed for the partition ratios.

Figure S8 shows the predictions for VP using the direct method versus experimental values for chemicals external to IFSQSAR, comparing the effect of correcting with the Van’t Hoff equation or leaving the data uncorrected. Figure 2C, D show predictions corrected with the Van’t Hoff equation for data that are in the IFSQSAR validation dataset and data that are external to IFSQSAR. As is observed for the log K values, predictions for chemicals in the validation dataset (RMSEP 0.91) are more accurate than predictions for external chemicals (RMSEP 2.04). Predictions for liquids are again more accurate (RMSEP 0.71) than predictions for solids (RMSEP 2.59). Table 4 and Table S1 show the statistics for IFSQSAR log VP predictions. The trend is again the same for log SW, with predictions for chemicals in the validation dataset (RMSEP 1.28) more accurate than predictions for external chemicals (RMSEP 1.69) as shown in Fig. 2E, F, and predictions for liquids (RMSEP 0.88) more accurate than predictions for solids (RMSEP 1.81). Figure S9 shows the data with and without being corrected with the Van’t Hoff equation, and Table 4 and Table S2 show the statistics for IFSQSAR log SW predictions.

The indirect and direct IFSQSAR methods for predicting log VP[l] and log SW[l] have comparable RMSEP and AD coverage; therefore, the direct method is preferable because the model has fewer inputs. For chemicals flagged as UL 0, 1, 2, 6 the IFSQSAR model predictions for solids with the Van’t Hoff correction applied have comparable or better RMSEP compared to the predictions with no correction applied. However, for chemicals flagged as being egregiously outside of the AD with UL 3 or UL 5 the IFSQSAR predictions with no Van’t Hoff correction applied have a better RMSEP. This can be interpreted to mean that if the IFSQSAR predictions are already very far outside of the AD adding further correction factors with their own AD and uncertainty is likely to only make the predictions worse.

IFSQSAR uncertainty estimates

Tables 3 and 4 show the IFSQSAR 95% PIs typically capture about 80–90% of the deviations from experimental data for the external dataset, indicating a slight underestimation of the standard error of prediction. Multiplying the estimated RMSEP by 1.25 for all IFSQSAR QSPRs brought the fraction within the 95% PI of the partition ratio models close to 95%. No further adjustments were required for the partition ratio QSPRs. For the VP and SW QSPRs there is a tendency for the 95% PIs to capture less than 95% of the predictions for solids; therefore, additional multiplicative adjustment factors of 1.67 and 1.25 were applied to the 95% PIs for the VP and SW QSPRs respectively for chemicals identified as maybe or likely solids by the IFSQSAR state classifier. After these adjustments there was still a tendency for the VP QSPR to capture less than 95% for chemicals with high UL; therefore, an additional multiplicative adjustment factor of 1.25 is applied to chemicals with UL 2, UL 3, and UL 5.

The method proposed by Endo to calculate the prediction interval of PPLFER equations [35] was applied to see if the additional uncertainty of extrapolating outside of the PPLFER equation training dataset could explain why some chemicals were not within the estimated RMSEPs. As shown by Endo this is not a large source of additional uncertainty for PPLFERs with at least 100 chemicals in the training dataset, and all PPLFERs used in this work have hundreds of chemicals in their training datasets. The increase in RMSEP from applying this method rarely made any difference in the fraction of chemicals that were within the 95% PI.

Conclusions

In summary, by applying the methods outlined in this study reasonable PIs can be assigned to all IFSQSAR PPLFER predictions for partition ratios, typically even those which are flagged as out of the AD and assigned UL 2 or UL 3. The main exceptions where a PI cannot be reasonably estimated are cases where the experimental endpoint is not applicable to the chemical in question, e.g., log KOW at pH 7 of a strong acid. If a chemical is a valid target for the QSPR endpoint, then even if the prediction is out of AD, the model predictions are still useful when an acceptable level of uncertainty from the 95% PI estimation is determined. The acceptable level of uncertainty in a property prediction is fundamentally specific to an end user’s judgement and decision-context. For example, for priority setting or screening-level application of the IFSQSAR models, a higher level of uncertainty may be more tolerable than for a definitive risk assessment scenario. Given that typical experimental variability is about 0.1 log units for log KOW, and standard errors for PPLFERs with experimental solute descriptors are about 0.2 log units, a RMSEP of about 0.5 for chemicals within the AD of the models is probably an acceptable level of uncertainty for many decision-contexts. Even predictions which are out of the AD will typically have an RMSEP that gives a PI which is smaller than the full range of possible values for a partitioning property.

In general, the methods presented here predict partition ratios as log K for novel chemicals with an overall RMSEP of about 1 log unit. The RMSEP of log KAW is a little larger and log KOA is a little smaller than 1 log unit. This may have to do with the relative difficulties in making the measurements, or in making predictions for them. The log KAW measurements have more experimental difficulties because of ionization and other effect specific to water so the inherent variability may be larger; however, there are fewer log KOA measurements so the dataset of log KOA values may not represent the full range of variability. VP and SW of liquids are also predicted with an RMSEP of about one log unit, but predictions for solids have larger RMSEP, up to 2 log units or more depending on the subset. Many of these predictions are still good, for example 85% of predictions which are out of the AD for solid chemicals in the external VP dataset are within ± 1.98 log units of the expected value, corresponding to the 95% PI of an RMSEP of 1. The high overall RMSEP for VP and SW of solids are clearly heavily influenced by a relatively small group of outliers. These instances tend to be strongly under-predicted, apparently due at least in part to the liquid to solid correction done with the Van’t Hoff equation. This disparity in prediction accuracy between liquids and solids is also apparent even for K values where it should theoretically not be an issue and warrants further investigation which will be a part of future work.

The new work described here advances the capacity for estimating uncertainty in PC property predictions, particularly for novel chemicals, and future work will show how these new methods and existing property predictions methods can be used to systematically address uncertainty in PC property data through integrated approaches to testing and assessment.

Availability of data and materials

The data and model predictions included in this study are available in a user-friendly online platform, the Exposure And Safety Estimation (EAS-E) Suite (www.eas-e-suite.com). The IFSQSAR source code is available on github: https://github.com/tnbrowncontam/ifsqsar.

Abbreviations

AD:

Applicability domain

EAS-E (Suite):

Exposure And Safety Estimation Suite

H:

Henry’s Law constant

IFSQSAR:

Python package for chemical property prediction used in this work

K:

Partition ratio

KAW :

Air–water partition ratio (KAW = H/RT)

KOA :

Octanol–air partition ratio (Henry’s Law constant for octanol)

KOW :

Octanol–water partition ratio, mutually saturated (wet)

dry KOW :

Octanol–water partition ratio, not mutually saturated (dry)

KO[w]O[d]  or KOO :

Hypothetical partition ratio between wet and dry octanol

MV[l]  or MV:

Molar volume of liquids or super-cooled liquids

MW:

Molecular weight

OPERA:

OPEn structure–activity/property Relationship App

PC (property):

Physical–chemical property

PI:

Prediction interval

PPLFER:

Poly-parameter linear free energy relationship (Abraham solvation model)

R:

Ideal gas law constant

RMSEP:

Root mean squared error of prediction

SMILES:

Simplified molecular input line entry system

S, A, B, V, L:

PPLFER solute descriptors (Abraham descriptors)

s, a, b, v, l, c:

PPLFER system parameters (Abraham equation system parameters)

SA :

“Solubility” in air (SA = VP/RT)

SW[l]  or SW :

Solubility in water of liquids or super-cooled liquids

SW[s] :

Solubility in water of solids

SO[w][l] :

Solubility in wet octanol of liquids or super-cooled liquids

SO[d][l]  or SO :

Solubility in dry octanol of liquids or super-cooled liquids

ΔSM :

Entropy of melting

T:

System temperature

TB :

Boiling point

TM :

Melting point

UL:

Uncertainty Level

VP[l]  or VP:

Vapor pressure of liquids or super-cooled liquids

VP[s] :

Vapor pressure of solids

QSP(A)R:

Quantitative structure–property (activity) relationship

UPPER:

Unified physicochemical property estimation relationships

References

  1. Government of Canada (1999) Canadian Environmental Protection Act, 1999. Canada Gazette Part III, vol 22

  2. Commission E (2007) Regulation (EC) No 1907/2006—Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH). Off J Eur Union L 136:3–280

    Google Scholar 

  3. Frank R (2016) Lautenberg Chemical Safety for the 21st Century Act. US Congress (114th Congress), Pub. L. No. 114–182.

  4. ECHA (2017) Guidance on Information Requirements and Chemical Safety Assessment Chapter R.11 PBT/vPvB Assessment. European Chemicals Agency, Helsinki, Finland

  5. Wegmann F, Cavin L, MacLeod M, Scheringer M, Hungerbühler K (2009) The OECD software tool for screening chemicals for persistence and long-range transport potential. Environ Model Softw 24(2):228–237

    Article  Google Scholar 

  6. Meyer T, Wania F, Breivik K (2005) Illustrating sensitivity and uncertainty in environmental fate models using partitioning maps. Environ Sci Technol 39(9):3186–3196. https://doi.org/10.1021/Es048728t

    Article  CAS  PubMed  Google Scholar 

  7. Armitage JM, Wania F, Arnot JA (2014) Application of mass balance models and the chemical activity concept to facilitate the use of in vitro toxicity data for risk assessment. Environ Sci Technol 48(16):9770–9779. https://doi.org/10.1021/es501955g

    Article  CAS  PubMed  Google Scholar 

  8. Baskaran S, Wania F (2023) Applications of the octanol–air partitioning ratio: a critical review. Environ Sci Atmospheres 3(7):1045–1065. https://doi.org/10.1039/D3EA00046J

    Article  CAS  Google Scholar 

  9. Wania F, Lei YD, Baskaran S, Sangion A (2022) Identifying organic chemicals not subject to bioaccumulation in air-breathing organisms using predicted partitioning and biotransformation properties. Integr Environ Assess Manag 18(5):1297–1312. https://doi.org/10.1002/ieam.4555

    Article  CAS  PubMed  Google Scholar 

  10. Buser AM, MacLeod M, Scheringer M, Mackay D, Bonnell M, Russell MH, DePinto JV, Hungerbuhler K (2012) Good modeling practice guidelines for applying multimedia models in chemical assessments. Integr Environ Assess Manage 8(4):703–708. https://doi.org/10.1002/ieam.1299

    Article  CAS  Google Scholar 

  11. Li L, Zhang Z, Men Y, Baskaran S, Sangion A, Wang S, Arnot JA, Wania F (2022) Retrieval, selection, and evaluation of chemical property data for assessments of chemical emissions, fate, hazard, exposure, and risks. ACS Environ Au 2(5):376–395. https://doi.org/10.1021/acsenvironau.2c00010

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Pontolillo J, Eganhouse RP (2001) The search for reliable aqueous solubility (Sw) and octanol-water partition coefficient (Kow) data for hydrophobic organic compounds: DDT and DDE as a Case Study. Water-Resources Investigations Report 01-4201. U.S. Geological Survey. https://doi.org/10.3133/wri014201

  13. Beyer A, Wania F, Gouin T, Mackay D, Matthies M (2002) Selecting internally consistent physicochemical properties of organic compounds. Environ Toxicol Chem 21(5):941–953. https://doi.org/10.1002/etc.5620210508

    Article  CAS  PubMed  Google Scholar 

  14. Mackay D (2001) Multimedia environmental models: the fugacity approach, 2nd edn. Lewis Publishers, Boca Raton

    Book  Google Scholar 

  15. Cole JG, Mackay D (2000) Correlating environmental partitioning properties of organic compounds: the three solubility approach. Environ Toxicol Chem 19(2):265–270. https://doi.org/10.1002/etc.5620190203

    Article  CAS  Google Scholar 

  16. Li NQ, Wania F, Lei YD, Daly GL (2003) A comprehensive and critical compilation, evaluation, and selection of physical-chemical property data for selected polychlorinated biphenyls. J Phys Chem Ref Data 32(4):1545–1590. https://doi.org/10.1063/1.1562632

    Article  CAS  Google Scholar 

  17. Schenker U, MacLeod M, Scheringer M, Hungerbühler K (2005) Improving data quality for environmental fate models: a least-squares adjustment procedure for harmonizing physicochemical properties of organic compounds. Environ Sci Technol 39(21):8434–8441

    Article  CAS  PubMed  Google Scholar 

  18. Egeghy PP, Judson R, Gangwal S, Mosher S, Smith D, Vail J, Cohen Hubal EA (2012) The exposure data landscape for manufactured chemicals. Sci Total Environ 414(1):159–166.

    Article  CAS  PubMed  Google Scholar 

  19. Arnot JA, Gobas FAPC (2006) A review of bioconcentration factor (BCF) and bioaccumulation factor (BAF) assessments for organic chemicals in aquatic organisms. Environ Rev 14(4):257–297. https://doi.org/10.1139/a06-005

    Article  CAS  Google Scholar 

  20. Wetmore BA, Wambaugh JF, Ferguson SS, Sochaski MA, Rotroff DM, Freeman K, Clewell HJ, Dix DJ, Andersen ME, Houck KA, Allen B, Judson RS, Singh R, Kavlock RJ, Richard AM, Thomas RS (2012) Integration of dosimetry, exposure, and high-throughput screening data in chemical toxicity assessment. Toxicol Sci 125(1):157–174. https://doi.org/10.1093/toxsci/kfr254

    Article  CAS  PubMed  Google Scholar 

  21. Judson R, Richard A, Dix DJ, Houck K, Martin M, Kavlock R, Dellarco V, Henry T, Holderman T, Sayre P, Tan S, Carpenter T, Smith E (2009) The toxicity data landscape for environmental chemicals. Environ Health Perspect 117(5):685–695. https://doi.org/10.1289/ehp.0800168

    Article  CAS  PubMed  Google Scholar 

  22. Abraham MH (1993) Scales of solute hydrogen-bonding: their construction and application to physicochemical and biochemical processes. Chem Soc Rev 22:73–83.

    Article  CAS  Google Scholar 

  23. Goss K-U (2005) Predicting the equilibrium partitioning of organic compounds using just one linear solvation energy relationship (LSER). Fluid Phase Equilib 233(1):19–22. https://doi.org/10.1016/j.fluid.2005.04.006

    Article  CAS  Google Scholar 

  24. OECD (2007) Guidance Document on the Validation of (Quantitative) Structure-Activity Relationships (QSAR) Models. OECD Environment Health and Safety Publications Series on Testing and Assessment No. 69. Organisation for Economic Cooperation and Development, Environment Directorate, Paris

  25. OECD (2004) OECD Principles for the validation, for regulatory purposes, of (quantitative) structure-activity relationship models. OECD, Paris

    Google Scholar 

  26. OECD (2023) (Q)SAR assessment framework: guidance for the regulatory assessment of (Quantitative) structure − activity relationship models, predictions, and results based on multiple predictions. Series on Testing and Assessment No. 386. Organisation for Economic Cooperation and Development, Paris

  27. Brown TN, Arnot JA, Wania F (2012) Iterative fragment selection: a group contribution approach to predicting fish biotransformation half-lives. Environ Sci Technol 46:8253–8260. https://doi.org/10.1021/es301182a

    Article  CAS  PubMed  Google Scholar 

  28. Arnot JA, Brown TN, Wania F (2014) Estimating screening-level organic chemical half-lives in humans. Environ Sci Technol 48:723–730. https://doi.org/10.1021/es4029414

    Article  CAS  PubMed  Google Scholar 

  29. Brown TN, Armitage JM, Arnot JA (2019) Application of an Iterative Fragment Selection (IFS) method to estimate entropies of fusion and melting points of organic chemicals. Mol Inf 38(8–9):1800160. https://doi.org/10.1002/minf.201800160

    Article  CAS  Google Scholar 

  30. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comp Sci 28(1):31–36. https://doi.org/10.1021/ci00057a005

    Article  CAS  Google Scholar 

  31. Lian B, Yalkowsky SH (2014) Unified physicochemical property estimation relationships (UPPER). J Pharm Sci 103(9):2710–2723. https://doi.org/10.1002/jps.24033

    Article  CAS  PubMed  Google Scholar 

  32. Brown TN (2022) QSPRs for predicting equilibrium partitioning in solvent-air systems from the chemical structures of solutes and solvents. J Solution Chem 51(9):1101–1132. https://doi.org/10.1007/s10953-022-01162-2

    Article  CAS  Google Scholar 

  33. Endo S, Goss K-U (2014) Applications of polyparameter linear free energy relationships in environmental chemistry. Environ Sci Technol 48(21):12477–12491. https://doi.org/10.1021/es503369t

    Article  CAS  PubMed  Google Scholar 

  34. Brown TN (2021) Empirical regressions between system parameters and solute descriptors of polyparameter linear free energy relationships (PPLFERs) for predicting solvent-air partitioning. Fluid Phase Equilib 540:113035. https://doi.org/10.1016/j.fluid.2021.113035

    Article  CAS  Google Scholar 

  35. Endo S (2022) Applicability domain of polyparameter linear free energy relationship models evaluated by leverage and prediction interval calculation. Environ Sci Technol 56(9):5572–5579. https://doi.org/10.1021/acs.est.2c00865

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Ulrich N, Endo S, Brown TN, Watanabe N, Bronner G, Abraham MH, Goss KU (2017) UFZ-LSER database v 3.2.1. http://www.ufz.de/lserd. Accessed 25 Jan 2021

  37. Abraham MH, Smith RE, Luchtefeld R, Boorem AJ, Luo R, Acree WE Jr (2010) Prediction of solubility of drugs and other compounds in organic solvents. J Pharm Sci 99(3):1500–1515. https://doi.org/10.1002/jps.21922

    Article  CAS  PubMed  Google Scholar 

  38. Abraham MH, Le J (1999) The correlation and prediction of the solubility of compounds in water using an amended solvation energy relationship. J Pharm Sci 88(9):868–880. https://doi.org/10.1021/js9901007

    Article  CAS  PubMed  Google Scholar 

  39. Abraham MH, Acree WE (2020) Estimation of vapor pressures of liquid and solid organic and organometallic compounds at 298.15K. Fluid Phase Equilib 519:112595. https://doi.org/10.1016/j.fluid.2020.112595

    Article  CAS  Google Scholar 

  40. Brown TN, Celsie A, Arnot JA, Parnis JM (2023) PPLFER paper #3 Mixtures. In Prep

  41. Abraham MH, Acree WE (2008) Comparison of solubility of gases and vapours in wet and dry alcohols, especially octan-1-ol. J Phys Org Chem 21(10):823–832. https://doi.org/10.1002/poc.1374

    Article  CAS  Google Scholar 

  42. Baskaran S, Lei YD, Wania F (2021) A database of experimentally derived and estimated octanol-air partition Ratios (KOA). J Phys Chem Ref Data. https://doi.org/10.1063/5.0059652

    Article  Google Scholar 

  43. Brown TN (2014) Predicting hexadecane-air equilibrium partition coefficients (L) using a group contribution approach constructed from high quality data. SAR QSAR Environ Res 25(1):51–71. https://doi.org/10.1080/1062936X.2013.841286

    Article  CAS  PubMed  Google Scholar 

  44. Mansouri K, Grulke CM, Judson RS, Williams AJ (2018) OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 10:10. https://doi.org/10.1186/s13321-018-0263-1

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Gramatica P (2007) Principles of QSAR models validation: internal and external. QSAR Comb Sci 26(5):694–701. https://doi.org/10.1002/qsar.200610151

    Article  CAS  Google Scholar 

  46. Gramatica P, Cassani S, Roy PP, Kovarich S, Yap CW, Papa E (2012) QSAR modeling is not “push a button and find a correlation”: a case study of toxicity of (Benzo-)triazoles on algae. Mol Inform 31:817–835. https://doi.org/10.1002/minf.201200075

    Article  CAS  PubMed  Google Scholar 

  47. Zhang Z, Sangion A, Shenghong W, Gouin T, Brown TN, Arnot JA, Li L (2024) Chemical space covered by applicability domains of quantitative structure-property relationships and semi-empirical relationships in chemical assessments. Environ Sci Technol 58 (7):3386–3398. https://doi.org/10.1021/acs.est.3c05643

  48. US E.P.A. (2011) Estimation Programs Interface (EPI) Suite for Microsoft® Windows, Ver. 4.1., Released October, 2011 edn. U. S. Environmental Protection Agency, Washington, D.C.

  49. Mansouri K, Grulke CM, Richard AM, Judson RS, Williams AJ (2016) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR QSAR Environ Res 27(11):911–937. https://doi.org/10.1080/1062936X.2016.1253611

    Article  CAS  Google Scholar 

  50. Schwarzenbach RP, Gschwend PM, Imboden DM (2016) Environmental organic chemistry, 3rd edn. Wiley, Hoboken

    Google Scholar 

  51. Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, Magarinos MP, Mosquera JF, Mutowo P, Nowotka M, Gordillo-Maranon M, Hunter F, Junco L, Mugumbate G, Rodriguez-Lopez M, Atkinson F, Bosc N, Radoux CJ, Segura-Cabrera A, Hersey A, Leach AR (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47(D1):D930-d940. https://doi.org/10.1093/nar/gky1075

    Article  CAS  PubMed  Google Scholar 

  52. Ulrich N, Ebert A (2022) Can deep learning algorithms enhance the prediction of solute descriptors for linear solvation energy relationship approaches? Fluid Phase Equilib 555:113349. https://doi.org/10.1016/j.fluid.2021.113349

    Article  CAS  Google Scholar 

  53. Hodges G, Eadsforth C, Bossuyt B, Bouvy A, Enrici M-H, Geurts M, Kotthoff M, Michie E, Miller D, Müller J, Oetter G, Roberts J, Schowanek D, Sun P, Venzmer J (2019) A comparison of log Kow (n-octanol–water partition coefficient) values for non-ionic, anionic, cationic and amphoteric surfactants determined using predictions and experimental methods. Environ Sci Eur 31(1):1. https://doi.org/10.1186/s12302-018-0176-7

    Article  CAS  Google Scholar 

  54. Pudipeddi M, Serajuddin ATM (2005) Trends in solubility of polymorphs. J Pharm Sci 94(5):929–939. https://doi.org/10.1002/jps.20302

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

The authors acknowledge funding from the American Chemistry Council Long-Range Research Initiative. As this publication has not been formally reviewed by the American Chemistry Council, views expressed in this document are solely those of the authors.

Author information

Authors and Affiliations

Authors

Contributions

Trevor N. Brown: project conceptualization, data curation, model development, coding and testing, manuscript writing and editing. Alessandro Sangion: data curation, model deployment on EAS-E Suite, manuscript writing and editing. Jon A. Arnot: project conceptualization, management and funding, manuscript writing and editing.

Corresponding author

Correspondence to Trevor N. Brown.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brown, T.N., Sangion, A. & Arnot, J.A. Identifying uncertainty in physical–chemical property estimation with IFSQSAR. J Cheminform 16, 65 (2024). https://doi.org/10.1186/s13321-024-00853-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13321-024-00853-w

Keywords