The CompTox Chemistry Dashboard: a community data resource for environmental chemistry

Despite an abundance of online databases providing access to chemical data, there is increasing demand for high-quality, structure-curated, open data to meet the various needs of the environmental sciences and computational toxicology communities. The U.S. Environmental Protection Agency’s (EPA) web-based CompTox Chemistry Dashboard is addressing these needs by integrating diverse types of relevant domain data through a cheminformatics layer, built upon a database of curated substances linked to chemical structures. These data include physicochemical, environmental fate and transport, exposure, usage, in vivo toxicity, and in vitro bioassay data, surfaced through an integration hub with link-outs to additional EPA data and public domain online resources. Batch searching allows for direct chemical identifier (ID) mapping and downloading of multiple data streams in several different formats. This facilitates fast access to available structure, property, toxicity, and bioassay data for collections of chemicals (hundreds to thousands at a time). Advanced search capabilities are available to support, for example, non-targeted analysis and identification of chemicals using mass spectrometry. The contents of the chemistry database, presently containing ~ 760,000 substances, are available as public domain data for download. The chemistry content underpinning the Dashboard has been aggregated over the past 15 years by both manual and auto-curation techniques within EPA’s DSSTox project. DSSTox chemical content is subject to strict quality controls to enforce consistency among chemical substance-structure identifiers, as well as list curation review to ensure accurate linkages of DSSTox substances to chemical lists and associated data. The Dashboard, publicly launched in April 2016, has expanded considerably in content and user traffic over the past year. It is continuously evolving with the growth of DSSTox into high-interest or data-rich domains of interest to EPA, such as chemicals on the Toxic Substances Control Act listing, while providing the user community with a flexible and dynamic web-based platform for integration, processing, visualization and delivery of data and resources. The Dashboard provides support for a broad array of research and regulatory programs across the worldwide community of toxicologists and environmental scientists. Electronic supplementary material The online version of this article (10.1186/s13321-017-0247-6) contains supplementary material, which is available to authorized users.


2.6.Date of model development and/or publication:
2016 2.7.Reference(s) to main scientific papers and/or software package: [1]An automated curation procedure for addressing chemical errors and inconsistencies in public https://cfpub.epa.gov/si/si_public_record_Report.cfm?dirEntryId=311655 [7]The importance of data curation on QSAR Modeling: PHYSPROP open data as a case study.

3.3.Comment on endpoint:
The logarithm of the ratio of a contaminant concentration in biota to its concentration in the surrounding medium (water).

3.7.Endpoint data quality and variability:
The

4.2.Explicit algorithm:
Distance weighted k-nearest neighbors (kNN) This is a refinement of the classical k-NN classification algorithm where the contribution of each of

4.Defining the algorithm -OECD Principle 2
the k neighbors is weighted according to their distance to the query point, giving greater weight to closer neighbors.The used distance is the Euclidean distance. kNN is an unambiguous algorithm that fulfills the transparency requirements of OECD principle 2 with an optimal compromise between model complexity and performance.

5.2.Method used to assess the applicability domain:
The applicability domain of the model is assessed in two independent levels using two different distance-based methods. First, a global applicability domain is determined by means of the leverage approach that checks whether the query structure falls within the multidimensional chemical space of the whole training set.
The leverage of a query chemical is proportional to its Mahalanobis distance measure from the centroid of the training set. The leverages of a given dataset are obtained from the diagonal values of the hat matrix.
This approach is associated with a threshold leverage that corresponds to 3*p/n where p is the number of model variables while n is the number of training compounds. A query chemical with leverage higher than the threshold is considered outside the AD and can be associated with unreliable prediction.
The leverage approach has specific limitations, in particular with respects to gaps within the descriptor space of the model or at the boundaries of the training set. To obviate such limitations, a second tier of applicability domain assessement was added. This comprised a local approach which only investigated the vicinity of the query chemical. This local approach provides a continuous index ranging from 0 to 1 which is different from the first approach which only provides Boolean answers (yes/no). This local AD-index is relative to the similarity of the query chemical to its 5 nearest neighbors in the p dimensional space of the model. The higher this index, the more the prediction is likely to be reliable.

5.3.Software name and version for applicability domain assessment:
Implemented in OPERA V1.02 An implementation of a local similarity index and the leverage approach based on the work of

5.4.Limits of applicability:
These two AD methods described in Section 5.2 are complementary and can be interpreted in the following way: -If a chemical is considered outside the global AD with a low local AD-index, the prediction can be unreliable -If a chemical is considered outside the global AD but the local AD-index is average or relatively high, this means the query chemical is on the boundaries of the training set but has quite similar neighbors.
The prediction can be trusted.
-If a chemical is considered inside the global AD but the local AD-index is average or relatively low, this means the query chemical fell in a "gap" of the chemical space of the model but still within the boudaries of the training set and surrounded with training chemicals.
The prediction should be considered with caution.
-If a chemical is considered inside the global AD with a high local AD-index, the prediction should be considered reliable. 6.6.Pre-processing of data before modelling: No preprocessing of the values.

RMSE=0.55
A plot of the experimental versus predicted values for the training set is provided in supporting information Section 9.3.

7.5.Other information about the external validation set:
The validation set consists of 161 chemicals.
The values are ranging from ~-0.3 to ~5.

7.6.Experimental design of test set:
The structures are randomly selected to represent 25% of the available data keeping a similar normal distrubution of LogBCF vlaues in both training and test sets using the Venetian blinds method. A plot of the distribution of LogBCF values is provided in the supporting information Section 9.3.

8.1.Mechanistic basis of the model:
The model descriptors were selected statistically but they can also be mechanistically interpreted.

8.2.A priori or a posteriori mechanistic interpretation:
A posteriori mechanistic interpretation.

8.3.Other information about the mechanistic interpretation:
For more details and full reference, see references in Section 4.3 and Section 9.2.

9.1.Comments:
This QSAR model for BCF prediction is part of the NCCT_Models Suite that is a free and open-source standalone application for the prediction of physicochemical properties and environmental fate of chemicals. This application is available in the Supporting information Section 9.3 of this report and in the paper ref 2 Section 2.7.
The detailed results of this suite of models applied on more than 700k DSSTox chemicals are available on the iCSS chemistry dashboard To be entered by JRC

10.2.Publication date:
To be entered by JRC

10.3.Keywords:
To be entered by JRC

10.4.Comments:
To be entered by JRC