Compound (or chemical) databases are an invaluable resource for many scientific disciplines. Exposomics researchers need to find and identify relevant chemicals that cover the entirety of potential (chemical and other) exposures over entire lifetimes. This daunting task, with over 100 million chemicals in the largest chemical databases, coupled with broadly acknowledged knowledge gaps in these resources, leaves researchers faced with too much—yet not enough—information at the same time to perform comprehensive exposomics research. Furthermore, the improvements in analytical technologies and computational mass spectrometry workflows coupled with the rapid growth in databases and increasing demand for high throughput “big data” services from the research community present significant challenges for both data hosts and workflow developers. This article explores how to reduce candidate search spaces in non-target small molecule identification workflows, while increasing content usability in the context of environmental and exposomics analyses, so as to profit from the increasing size and information content of large compound databases, while increasing efficiency at the same time. In this article, these methods are explored using PubChem, the NORMAN Network Suspect List Exchange and the in silico fragmentation approach MetFrag. A subset of the PubChem database relevant for exposomics, PubChemLite, is presented as a database resource that can be (and has been) integrated into current workflows for high resolution mass spectrometry. Benchmarking datasets from earlier publications are used to show how experimental knowledge and existing datasets can be used to detect and fill gaps in compound databases to progressively improve large resources such as PubChem, and topic-specific subsets such as PubChemLite. PubChemLite is a living collection, updating as annotation content in PubChem is updated, and exported to allow direct integration into existing workflows such as MetFrag. The source code and files necessary to recreate or adjust this are jointly hosted between the research parties (see data availability statement). This effort shows that enhancing the FAIRness (Findability, Accessibility, Interoperability and Reusability) of open resources can mutually enhance several resources for whole community benefit. The authors explicitly welcome additional community input on ideas for future developments.
Compound (or chemical) databases are an invaluable resource for many scientific disciplines. Through the joint evolution over the last decade of high resolution mass spectrometry (HR-MS), cheminformatics techniques and openly available compound databases, a whole new world for identifying small molecules in complex samples has emerged. Despite many advances, chemical identification is still generally considered a bottleneck in many research fields (see e.g. [1, 2]). Interest in the exposome  and the related exposomics field has increased as awareness of the influence of the external environment on health and disease has increased . Exposomics requires researchers to find and identify relevant chemicals that cover the entirety of potential (chemical and other) exposures over entire lifetimes [4,5,6], significantly adding to the identification challenge.
Scientific disciplines such as environmental science, metabolomics, forensics and exposomics are focusing increasingly on high throughput data exploration with high resolution mass spectrometry (HR-MS) techniques [4, 7, 8]. Mass spectral libraries, which can be used to obtain rapid tentative identifications of relatively high confidence [9,10,11] still only cover a fraction of chemical information resources relevant in exposomics , metabolomics  or in complex samples in general [13, 14]. This is especially true for HR-MS techniques, which are inherently limited by the availability of reference standards as well as the relative youth and lack of standardization in the field . Alternative methods to annotate detected exact masses in HR-MS studies beyond spectral library searching began emerging around 2010 by searching compound (i.e., chemical) databases for possible candidates using the exact mass or calculated molecular formula, and ranking these using in silico techniques to sort candidates using the measured fragmentation information. The plethora of identification methods now available are described and compared in detail elsewhere [14,15,16,17]. A wide variety of (generally open) compound databases are typically used as information sources for these identification efforts, containing anything between tens to hundreds of thousands (e.g. KEGG , HMDB [19, 20], CompTox ) and tens of millions of structures (e.g. ChemSpider  and PubChem [23,24,25]). Most of these resources and, consequently, the number of candidates per exact mass/formula, are expanding significantly over time. Typical queries with smaller databases return tens to hundreds of candidates, whereas typical queries with large databases such as PubChem now return thousands to tens of thousands of candidates per exact mass/formula query. For instance, querying HMDB, CompTox and PubChem with the formula C10H14N2 via the MetFrag [26, 27] web interface (12 August 2020) returns 4, 225 and 3704 candidates, respectively.
A major challenge in correctly identifying a chemical based on exact mass (or formula) and fragmentation information alone arises due to the relatively little information conveyed in the fragmentation spectrum. During one open community evaluation approach, the 2016 Critical Assessment of Small Molecule Identification (CASMI) contest, participants were provided 208 challenges with fragmentation information and candidate query sets retrieved from ChemSpider . Using fragmentation information alone, participants were able to rank between 24 (11.5%) and 70 (33.7%) of these 208 challenges correctly in first place . However, combining this fragmentation information with other forms of information (e.g. references, retention time information) yielded up to 164 (78.8 %) challenges correctly ranked in first place when combining all participant methods over the same ChemSpider candidate sets . Separately, a detailed evaluation of MetFrag combining retention time information with various scoring terms available via ChemSpider (5 different literature terms) and PubChem (PubMed Count and Patent Count) for 473 environmentally relevant standards was performed. This revealed that ranking results were improved from 22 to 89% with ChemSpider and from 6 to 71% with PubChem (with 34 and 71 million entries respectively at the time) . In summary over these evaluations and more; better ranking performance is achieved with small, select databases, at the risk of missing the correct answer , while the use of additional metadata (expert knowledge, additional context) is necessary to improve the results for practical use, especially when using very large compound databases to search for candidates.
Another challenge, especially for exposomics, is database choice. Being a mix between metabolomics and environmental concepts and challenges, exposomics methods need, on the one hand, the biological context of pathway and metabolomics resources (generally small, specialist metabolite databases such as HMDB and KEGG), versus the wide coverage required to capture “chemical space” which, in environmental contexts, generally means PubChem or ChemSpider. Although recent works mention the need for an “exposomics database”, much of the necessary knowledge is already in the public domain to some extent, but under rapid development and scattered over an ever-growing number of resources. Notable recent developments include the CompTox Chemicals Dashboard, covering 882,000 (August 2020) environmentally and toxicologically-relevant compounds  and the Blood Exposome Database , which, although specifically designed for the blood matrix, still contains over 64,000 compounds. Large compound databases such as PubChem have content in common with many of the openly available smaller databases, but at a size of 109 million compounds (January 2021), PubChem also contains many (tens of) millions of entries that are not relevant to the exposomics context.
Beyond the database choice, common criticisms of small molecule identification coupled to compound databases arising from users over the years include the fact that newly-discovered and/or relevant compounds such as emerging chemicals, transformation products and metabolites are missing from, or hard to add to, these databases for a typical researcher. If these compounds are present, these tend to have very low metadata scores and thus common environmental knowledge of transformations or emerging chemicals cannot often be found effectively during identification efforts. As a result (and also to increase efficiency), many groups in the environmental community have taken to compiling their own lists of relevant chemicals (commonly termed “suspect lists” within this community ). The NORMAN Suspect List Exchange (NORMAN-SLE)  is one initiative that arose to address NORMAN Network [31, 32] member needs to exchange this information as a result of a collaborative trial in 2014 , and to date is host to over 73 specialised NORMAN member contributed lists of chemicals of interest.
With a view on this “current state”, this article investigates how very large compound databases, or knowledge bases, such as PubChem, could be empowered to support HR-MS-based small molecule identification efforts in the context of exposomics. This article describes initial collaborative efforts on how to improve the performance of the PubChem integration into the in silico identification approach MetFrag. Since the first release of MetFrag in 2010, PubChem has grown from 25 million to now 109 million compounds, with an accompanying steadily worsening rank performance and increasing strain on resources due to the rapidly increasing candidate numbers. Three main aspects of these collaborative discussions are presented in this article: (1) the creation of a small, exposomics-relevant subset of PubChem–named PubChemLite–for efficient candidate queries, which has already been integrated into existing HR-MS workflows and teaching efforts; (2) progressive integration of environmentally-relevant expert knowledge to mitigate identified knowledge gaps in PubChem annotation content, based on analysis of previous benchmarking sets and the NORMAN-SLE content; and (3) how annotation content can be leveraged for easier interpretation of results. As a result, this article focuses heavily on PubChem, MetFrag and the NORMAN-SLE, with the view that the ideas presented here could be extended to other knowledge bases and other in silico identification approaches based on HR-MS.
Results and discussion
Creating “PubChemLite” for exposomics
Since a very large proportion of the PubChem database (> 60%) is sourced from purchasable screening libraries from chemical vendors, where the chemicals are generally produced in relatively small amounts (e.g. mg) in a laboratory setting, the vast majority of these chemicals are highly unlikely to be detectable in either the environment or biological samples. Thus, instead of the current status quo, i.e. searching the entire PubChem database and using metadata scores to “up-prioritize” interesting candidates (i.e., processing tens of thousands of candidates per mass, to only obtain tens to hundreds of interesting entries), the first step investigated the creation of relevant subsets of PubChem for more efficient queries. This was done by selecting relevant sections of the “PubChem Compound Table of Contents” (PubChem Compound TOC) Classification  as shown in Fig. 1. Further details are given in the "Methods" section.
Initially, two versions of PubChemLite were created. The environmental selection (PubChemLite tier0), formed of the yellow-shaded categories in Fig. 1, shortened to “AgroChemInfo, DrugMedicInfo, FoodRelated, PharmacoInfo, SafetyInfo, ToxicityInfo, KnownUse”, whereas the exposomics selection (PubChemLite tier1) had the additional purple-shaded category, shortened to “BioPathway”, which contained the additional biological information categories relevant to metabolomics and exposomics. Entries were merged by InChIKey first block (the structural skeleton), and total Patent Counts and Literature Counts were calculated over the merged entries (full details in the "Methods" section). Each category was added as an additional column, where each entry was assigned a value that was a (merged) count of the sub-categories, and a total annotation count column was also added, summing the presence in top categories only (for further details, see "Methods"). Initial versions (20 November 2019 /14 January 2020 ) contained 315,843/316,810 entries in tier0 (environmental collection) and 361,976/363,911 entries in tier1 (exposomics). In other words, the 103 M entries of PubChem (at the time) were collapsed down to two datasets of approximately 316 K and 360 K compounds. An RMarkdown file to visualize the content (categories and subcategories) of PubChemLite as an interactive sunburst plot (for a static version see Fig. 2) using the 14 January 2020 tier1 version is included as Additional file 1 and is also available on the ECI GitLab pages [37, 38]; further details are in the "Methods" section below.
A benchmark dataset of 977 de-duplicated compounds (see Additional file 2) was created by merging chemicals from previous evaluations [16, 26] (predominantly environmentally relevant) as described in "Methods". MetFrag was run with different versions of PubChemLite as well as CompTox (7 March 2019 release ) using comparable scoring terms. A summary of the results shown in Fig. 3 includes calculations both without (green) and with (blue) the use of MS/MS information (in silico fragmentation score and MS library matching scores). Further parameter details are given in the "Methods" section, with tables included in Additional file 3. Overall, CompTox and PubChemLite perform comparably; initially CompTox had fewer missing entries (grey shading) due to their earlier concerted efforts to add compounds of environmental interest, including transformation products (these gaps may well be smaller with the new data release). These gaps were closed progressively in PubChemLite as described in the next section “Identifying and Filling Gaps in PubChem Annotation Content”. Furthermore, early results (see Additional file 3: Figures S1 and S2, Tables S1 and S2) showed that both versions of PubChemLite, tier0 and tier1, performed almost identically even on environmental substances of interest, such that finally, one “PubChemLite” for exposomics was created, equivalent to tier1 plus the two additional categories as shown in Fig. 1 . Results from this version are also shown in Fig. 3.
The results in Fig. 3 show that, while annotation information alone leads to good ranking performance (~ 70–73% ranked first, dark green shaded results), the MS/MS information is essential for further improvements (~ 79–83% ranked first, dark blue shaded results). This is discussed further below. The PubChemLite results on the two initial versions (20 November 2019 and 14 January 2020) also clearly show that ~ 8 % of the benchmark dataset were missing from PubChemLite. A detailed interrogation of the benchmark set of 977 reference standards from Eawag and UFZ revealed that—as commented by the community over many years—detailed annotation information was missing for well-known relevant transformation products in PubChem. This accounted for 37 of the 57 missing entries in the January 14, 2020 tier0 version and is discussed further in the next section.
Identifying and Filling Gaps in PubChem Annotation Content
During previous evaluations of MetFrag specifically , and in silico identification approaches for HR-MS in general during e.g. CASMI , the focus has generally been on evaluating the methods themselves, aiming for objective evaluation. The use of identification approaches in typical real-life scenarios, however, often requires additional subjectiveness to provide interpretation, not just identification. Thus, the material in this article should not be viewed as an evaluation of MetFrag itself (which has not changed), but rather demonstrates how improving the underlying database and associated functionality can help to improve outcomes for users (i.e. the ability to find relevant chemicals) in the context of exposomics. In other words, this has been an opportunity to investigate and improve the annotation content (i.e. information content beyond structural properties) in PubChem for exposomics.
As Fig. 3 reveals, 57 chemicals from the benchmark set were missing in the early versions of PubChemLite, many of which were well-known transformation products in environmental studies. Since adding annotation content requires also sufficient provenance and evidence to support the annotation, the NORMAN-SLE [30, 44], which now has its own Classification Browser  in PubChem (see Fig. 4) was browsed for suitable suspect lists containing annotation content. Initial efforts concentrated on list S60 (SWISSPEST19) , a list of pesticides and transformation products/metabolites documented by Kiefer et al. . This list contained parent-transformation product mappings, plus the link to information about agrochemical use (since the focus was on pesticides). The list was modified into a “predecessor/successor” mapping form (to avoid terminology clashes within other sections of PubChem) and added, with full provenance, into a new “Transformations” section in the individual PubChem records (see Fig. 5). Accompanying statements on “Agrochemical Transformations” within the agrochemical sections were also added, for example “Folpet has known environmental transformation products that include Phthalimide, Phthalamic acid, and Phthalic acid” . The PubChemLite version created 22 May 2020  included these new annotations, with fewer missing entries and slightly better ranks (see Fig. 3). Since this only focused on the agrochemicals (pesticides), the many pharmaceutical (and other) transformation products among the Eawag dataset were still missing. While these are all present in MassBank  (S1 in the NORMAN-SLE ), this dataset does not come with appropriate annotation content or provenance. Instead, the Supporting Information from Schollee et al.  provided suitable parent-TP mappings to create the predecessor-successor tables, which was merged with the Eawag classification information (with permission and support from Juliane Hollender) and added as list S66 . This collection, together with list S68 HSDBTPS , resulted in the greater coverage in the June 2020  and October 2020  versions (see Fig. 3), with only 16 missing entries (15 in October) remaining. These remaining 16 entries could not be clearly related to any specific NORMAN-SLE lists to add further annotation content at this stage; although annotation content is being progressively added in separate efforts—as is evident from the one less missing entry in October.
Leveraging annotation content in exposomics
The results presented in Fig. 3 detailed the use of rather generic metadata terms (literature counts, patent counts, total annotation counts). However, one aim of setting up PubChemLite was not only to merge several “useful” categories for exposomics, but to leverage the information within these categories (providing interpretation about candidates in candidates sets). The smallest annotation category in PubChemLite, the agrochemicals, was taken as an additional benchmarking dataset (1336 chemicals, 22 Jan. 2020, see Additional file 4) to investigate the influence of database size and the additional scoring terms on the ranking results. Since this was to mimic an environmental investigation interested in detecting agrochemicals (i.e. a “suspect screening” approach ), the “agrochemical score”, i.e. how many agrochemical categories exist in PubChem for that chemical, was used as an additional scoring term in MetFrag (details in "Methods"). The results are shown as the green entries in Fig. 6; the exact numbers are given in Additional file 3: Table S3.
With a full PubChem query and using only literature and patent information to score, only 58% of entries were correctly ranked in first place (which is not unexpected, as e.g. pharmaceuticals, industrial chemicals or even metabolites with the same mass may have larger literature or patent counts). When the database was restricted to the candidates in PubChemLite using the same scoring terms (literature and patent counts), this increased to 70%. However, adding the Agrochemical Score improved this further to 79.2 %, demonstrating the potential usefulness of individual category-based scoring terms to help select relevant chemicals for further verification. In terms of computational efficiency, the last 101 queries (entries 1236–1336) of the Agrochemicals query took 11 min to complete with PubChemLite tier1 (query run 21 Jan. 2020), while the equivalent query with the full PubChem database and scoring terms took 164 min (query run 26 Jan. 2020). This results in approx. 6.5 s per query for PubChemLite, versus 97 sec per query for a full PubChem query (note: both queries were without fragmentation).
Since this is purely annotation-based scoring, it is imperative to use additional experimental information such as fragmentation information and further verification with reference standards before any claims of higher confidence annotation are made . To address this, the benchmarking dataset (n = 977) used above (with MS/MS information available) was subset according to the availability of information in the Agrochemical Information category (creating a subset of n = 318), and evaluated with scoring terms relevant to the annotation type, as shown in the blue entry in Fig. 6. This mimics, to a certain extent, a typical suspect screening workflow where the main interest is in finding and confirming pesticides in an environmental sample. As shown, adding MS/MS information (MetFrag in silico fragmentation plus MoNA similarity score) increased the correctly ranked chemicals in first place to 90.6% for those agrochemicals that were also in the benchmarking set. If the database (in this case PubChemLite tier0 12 Jun. 2020 version) had been restricted to agrochemicals only this would have risen to 94.3%, as some non-agrochemical isomers still outscored several entries based on the literature and patent values. The performance would not be able to rise much higher than 94% with this dataset, however, since there are multiple agrochemical isomers present in the dataset where the less-well-known (but often structurally related) isomers ranked lower because of less supporting metadata. For instance, for secbutylazine (CID 23712), the candidate terbutylazine, CID 22206 was ranked first and secbutylazine, CID 23712 was third, while another isomer propazine CID 4937 was second. All three isomers were in the dataset. In this case, both the in silico fragmenter and MoNA similarity scores captured these three isomers in the correct order (secbutylazine first, terbutylazine second, propazine third), showing that the experimental evidence is still crucial in distinguishing isomers - or indicating whether they are indistinguishable on given evidence. Terbutylazine was correctly ranked first for its corresponding entry (see Table 1).
Using this benchmarking dataset alone, taking PubChemLite and using the specific topic information for agrochemicals, most candidates were ranked 1st and the worst rank for a chemical was 3rd. Creating a similar pharmaceutical subset (as opposed to agrochemicals) using the “DrugMedicInfo” category yielded similar results (most ranked first, worst rank of 3rd ) using either DrugMedicInfo or PharmacoInfo as scoring terms (see Additional file 3: Figure S3). For a more generic category such as ToxicityInfo, most were ranked 1st or 2nd, but the worst rank was 12, indicating that this term may be less selective (see Additional file 3: Figure S3). Using patent and literature information alone (over the entire benchmark set), the worst rank was 27th, with 11 entries missing entirely. Thus, even though this dataset is of limited size (977 entries), the results indicate that there is a good chance that the top candidate will be among the Top 3 using PubChemLite for highly specific categories such as (agrochemicals, pharmaceuticals). On the other hand, more candidates will often have to be considered for less specific categories or questions (e.g. Toxicity Information) or when only the generic scoring terms are used. In the context of practical use of HR-MS for answering real life questions, e.g. the presence of well-known chemicals in environmental or patient samples, considering only a few candidates (e.g. 1–3) versus hundreds or even thousands of candidates per mass is a great step forward for higher throughput interpretation of non-target screening results and coming to meaningful conclusions quicker. It is expected that greater granularity in the annotation information will improve the interpretability and applicability of this information in the future (for instance toxicity information is currently often only “information is present” and not “the substance is toxic”); efforts are being made to achieve this (beyond the scope of the current article). Regular updates/deposition of relevant third-party data resources in PubChem such as HMDB, CompTox and the Blood Exposome database will help ensure that this content can be included and updated in PubChemLite.
As a future perspective, the addition of extra information, such as partitioning information (e.g. logP, logKow or logD) and collision cross section (CCS) values, will also help in candidate selection in specific cases (although for isobars /isomers that are very similar, predictive values will often be very close). Efforts are currently underway to include XlogP3  in future versions of PubChemLite to integrate within the retention time model already present in MetFrag . Further, an initial version of PubChemLite (January 14, 2020 tier1) with CCS values contributed by CCSbase [56, 57] is also available on Zenodo  and in MetFrag web version  and is currently being evaluated in separate work.
The need to cover the “entire chemical space” in exposomics research is a huge challenge for researchers and database resources alike (and currently unachievable – due to our inability to define chemical space completely). This article explores the use of annotation content of very large compound databases, i.e. compound knowledge bases, to create meaningful and efficient subsets relevant to specific use cases, specifically aimed at creating subsets of PubChem most relevant for exposomics. The resulting PubChemLite is a dynamic yet efficient database that grows as the respective (and relevant) annotation categories grow in PubChem, and is built and deposited regularly to allow integration with existing HR-MS identification approaches such as MetFrag [27, 59] and comprehensive MS workflows such as patRoon . The subcategories present in PubChemLite allow end users a certain a degree of individual or sample-wide interpretation of the results, such that broad chemical categories become obvious amongst suggested candidates. These can be used as scoring terms or hard filters, depending on user choice, and subsets of the database could serve as large suspect lists if desired. PubChemLite is already in use in several research projects. Feedback on the approach and further integration into other resources and workflows is greatly welcomed. Further developments are being made behind the scenes to streamline the ideas presented in this manuscript for the community in other ways. The code and all necessary files are available (see availability statement), such that expert users can build and compile their own subsets of PubChem using any of the categories available in the PubChem Table of Contents Classification Browser  by defining their own input “bit sets”.
To address the “data gap” issue of highly-relevant compounds missing in existing compound databases (a broadly acknowledged weakness and argument frequently applied against using compound databases for HR-MS-based tentative identification efforts), this article also explores how knowledge gaps can be assessed and filled, as exemplified with environmentally-relevant information from the NORMAN Network. A coupled deposition and annotation workflow has been set-up between PubChem and the NORMAN-SLE, allowing the deposition of environmentally relevant substances into PubChem and the progressive integration of the accompanying (relevant) annotation content, with full traceability to the original data sources. The examples covered in detail here included transformation product and agrochemical use cases. Importantly, these integration efforts enhance both resources and help combine knowledge into a central location (thus increasing the FAIRness of the data) by reducing the isolation of the individual NORMAN-SLE lists while increasing the annotation (information) available in PubChem. The integration of content is occurring progressively with a focus on areas of high community interest and on those filling the largest gaps. Community input is very welcome to help focus these efforts to maximize the overall benefit. The content is available in a variety of formats across both resources for re-use.
While PubChemLite is an immediately accessible stepping-stone for HR-MS-based exposomics research, it is still only a small part of efforts towards a bigger picture solution for the exposomics challenge. Enhancing the annotation content of compound knowledge bases is clearly one way of improving the useability of very large knowledge bases. Dynamic and easy-to-use ways to subset and/or order the chemicals based on this annotation content (beyond creation of a MetFrag-specific output file) will be needed to improve the useability further. At some point, specialist users will need to be able to tell chemical knowledge bases what they want to find to improve their search results for their specific use case, rather than just taking the “best match” based on generic scores such as literature or annotation counts. Future efforts, beyond enhancing annotation content, will include continuing conversations with users and the community to develop functionality that can be applied either on the database side, or the workflow side, or both, to truly empower large compound knowledge bases for exposomics research and move from just identification towards more detailed interpretation of HR-MS datasets.
Creating PubChemLite for MetFrag
MetFrag currently has PubChem integrated via the RESTful API as well as a local mirror. Of the typically thousands of candidates that are retrieved using exact mass (with ppm error margin) or molecular formula queries, several candidates are returned that are eventually discarded (e.g. disconnected structures, which cannot be observed at the input mass or formula in the mass spectrometer, or other structures that cannot be processed by MetFrag). Since high resolution mass spectrometry rarely yields information on stereochemistry (there are exceptions for some substances e.g. when chiral chromatography is used), it is the default behaviour of MetFrag and many other approaches to merge candidates by the first block of the InChIKey (i.e. the structural skeleton) and present the users results displaying the stereoisomer with the highest score. For candidates merged by InChIKey first blocks, any ranking is usually driven by metadata rather than fragmentation, which does not usually contain sufficient information to distinguish stereoisomers, except for some tautomers. In MetFrag, this stereoisomer filtering can be switched on or off as desired. However, for larger (or complex) structures, the presence of stereoisomers can dramatically inflate candidate numbers and reduce calculation efficiency, often for little final gain.
To create subsets of PubChem by annotation content category, firstly a Table of Contents fingerprint (TOC FP) was created for each of the PubChem Compound TOC entries (each bit representing presence or absence of information in that category for a compound) along with metadata indicating the relationship between the bits (e.g., subcategories of a given annotation). Then, mapping files containing the desired TOC entries were created. Finally the relevant data (compound information, patent and literature scores, plus the TOC fingerprints) was extracted by the compound identifier (CID) from the respective PubChem download files  using scripts that have been made available at the Environmental Cheminformatics group GitLab pages .
Following this, and considering the current, established MetFrag behaviour , a set of rules was applied to the CIDs extracted from the TOC categories to generate a file that could be processed by MetFrag. Candidates that would be discarded later anyway (e.g. disconnected structures or other structures that cannot currently be processed by MetFrag) were discarded up front. Further, CIDs were collapsed by the first block to have one “best matching” CID and mappings to all related CIDs. The rules applied were the following:
Retrieve all CIDs in PubChem with the desired annotation categories;
Map all CIDs to corresponding parent CIDs to obtain the neutral form, where available, imputing the annotation to the parent;
Collapse by InChIKey first block (IKFB), imputing total annotation to the IKFB, retaining the “best” CID (the most annotated CID for the given IKFB) and listing all related CIDs in a separate column, thus grouping all CIDs with annotation available;
Remove all entries containing the following elements: Kr, Dy, Ir, La, Lu, Nd, Nb, Os, Pd, Pt, Pu, Pr, Re, Rh, Ru, Sm, Sc, Ag, Ta, Tc, Tb, Th, Tm, Ti, W, Ac, Am, Er, Eu, Gd, Hf, Ho, Xe, Yb, Rn, Sr, Be, Cm, Cf, Cs, Md, Pm, Fr, Pa, Np, Bk, Es, Fm, No, Lr, Rf, Db, Sg, Bh, Hs, Mt, Ds, Rg, Cn, Nh, Fl, Mc, Lv, Ts, Og;
Remove disconnected structures-as these will not be observed at the mass/formula of the query;
Remove charges from charged molecular formulae (but not the corresponding structures).
These rules were selected for maximum efficiency, resulting in the following behaviour that should be considered when interpreting the results. Firstly, collapsing all annotated CIDs by IKFB could result in the inclusion of different isotopic states and/or charges, which may not be included otherwise in MetFrag queries initiated by exact mass/formula and could otherwise prevent these candidates appearing in PubChemLite queries at their true exact mass/formula. In the context of efficient screening of masses for environmental, metabolomics or exposomics studies, matches with differing isotopic states are unlikely to be found in large amounts in these studies. In the cases that isotopically labelled standards are used, or isotopically labelled experiments are performed, other data interrogation techniques are usually necessary/recommended to capture these peaks in advance of identification efforts. For differing charge states, since these are usually accounted for in the upstream workflow by adjusting the adduct state, the current behaviour ensures a consistent “base state” for adjustment of charge in other parts of the workflow. Secondly, mixtures are currently discarded from PubChemLite files, as this would require an additional degree of manipulation (splitting and re-merging of the entries), which was not accounted for in the current version as this affects < 10K entries - of which a significant proportion are salts. It would be possible to address both issues in future versions should subsequent use cases deem this necessary. Finally, related CIDs are only included if that CID contains any annotation in at least one of the selected annotation categories. For example, the InChIKey first block HXKKHQJGJAFBHI has 6 related CIDs in PubChemLite tier 0 (14 Jan 2020 version: 4, 111033, 439938, 446260, 7311736, 44150279), while 9 CIDs (4, 439938, 446260, 4631415, 7311735, 7311736, 16655457, 123598986, 140936702) match this InChIKey first block in the PubChem search interface (search date 22 May 2020 ).
As PubChem is changing daily, both in terms of numbers of chemicals and their annotation content, PubChemLite will not remain static. Initial evaluations in this paper were done on the first archived versions, generated November 18th, 2019 , with 640 category fingerprints generated on October 2nd, 2019. There were approximately 33 M entries with TOC annotations at this stage (e.g. 33,766,782 on October 29th, 2019). A second archived version, with additional scoring, was created January 14th, 2020  for further evaluation. By this time the fingerprint consisted of 652 categories (January 9th, 2020) and there were 35 M entries with TOC annotations (35,800,159 on 21 January 2020). The third major version, PubChemLite for exposomics (31 October, 2020) was based on a fingerprint of 524 categories (29 October 2020) and there were 49 M TOC annotations (49,493,641 on 2 November 2020). A breakdown of these files is given in Table 2. These datasets are archived as versions 0.1.0, 0.2.0 and 0.3.0 on Zenodo [35, 36, 40].
For the November 18, 2019 versions, an “FPSum” was calculated for all entries by adding the FP bits to give a maximum of 7 (tier0) or 8 (tier1). Individual columns for each annotation category were also created, so that the annotation categories could be used via the scoring term function in MetFrag, in addition to the patent and literature information. The resulting datasets (with preview) are available on Zenodo . For the January 14, 2020 and subsequent versions, “FPSum” was modified to “AnnoTotalCount”, so the column name better reflected the content, i.e. the availability of annotation categories for that entry. Additionally, individual columns were created for each annotation category, filled with values calculated by adding the category plus the number of subcategories present for that annotation, which ranged from 3 to 15 subcategories (Jan. 2020). The resulting datasets are on Zenodo  and were integrated into the dropdown menu of local databases for MetFragWeb . PubChemLite was built approx. weekly following the January 14, 2020 format to test systems, with two versions used in this article to check additional annotation content (see results and ). During evaluations, it became clear that two additional categories would be useful, one being “Identification” (present but previously overlooked) and the second being “Associated Disorders and Diseases” (not present when PubChemLite was officially drafted). Based on the evaluations showing little difference between tier0 and tier1, one version equivalent to tier1 plus these two additional categories has been built and released as “PubChemLite for exposomics” version 0.3.0  and integrated into MetFragWeb  and patRoon [60, 64]. Subsequent updates will be built and auto-committed to Zenodo (after passing build checks) to allow automatic updates for MetFragWeb  and any workflows/users of the MetFrag command line (MetFragCL) version  and other workflows like patRoon .
The performance of PubChemLite was assessed using various datasets that were already used to evaluate MetFrag performance; CASMI 2016  and MetFrag Relaunched  (hereafter MetFragRL). The CASMI2016 dataset consisted of 208 compound-MS/MS spectra pairs. The MetFragRL evaluation sets consisted of four groups of spectra measured under different conditions (datasets EA, EQEx, EQExPlus and UF, with n = 473, 289, 310 and 226, where n refers to the number of compound-MS/MS spectrum pairs). The calculations performed on the individual datasets are presented in Additional file 3: Table S1 and Figure S1, alongside the previously published results. Since some compounds had mass spectra available in both modes, and there was some overlap between the different datasets, this corresponded to a total of 1298 (MetFragRL) and 1506 (MetFragRL + CASMI) compound-MS/MS pairs overall. Calculations performed on this set (comparing PubChemLite tiers and CompTox) are presented in Additional file 3: Table S2 and Figure S2. For the purpose of clarity in the main manuscript, this set of 1506 was de-duplicated down to a set of 977 unique compounds by InChIKey First Block after accounting for multiple tautomeric forms, to eliminate any confusion due to the presence of duplicate spectra/modes. The MS/MS spectrum record number (the first-matching entry in the case of multiple spectra) was used to automatically extract and save the corresponding MS/MS peaks into the file using an R script, using the MS/MS spectra provided as SI for the respective studies, downloaded from the journal pages [16, 26]. As all compounds were present in PubChem, additional compound information was filled in using PubChem web services via R functions. The final benchmarking file (hereafter “PCLite Benchmark” set) is available as Additional file 2 and on the ECI GitLab pages, along with all associated code .
The PCLite Benchmark set was used to evaluate various versions of PubChemLite (dates: 18/11/2019 , 14/01/2020 , 22/05/2020 , 12/06/2020  and 31/10/2020 ) as well as the CompTox Chemicals Dashboard version from 7/03/2019 archived as MetFrag Local CSV (database) files [39, 65]. Files are not yet available from the most recent CompTox release (but have been requested). The “Select Metadata” version of CompTox was used, which contained 857,615 entries, corresponding to 773,561 DTXCID InChIKeys and 773,232 InChIKey First Blocks associated with DTXCIDs (the CompTox “MS-ready” form  of information used in MetFrag). All CompTox files from the given release contain the same number of entries, just with varying metadata content. All queries were run with exact mass plus 5 ppm error, additional scoring terms and other parameters as detailed in Additional file 3: Table S4 and in the supporter scripts available on the ECI GitLab pages .
Availability of data and materials
All the files needed to generate PubChemLite are available and updated at least weekly on the PubChem FTP website (https://ftp.ncbi.nlm.nih.gov/pubchem/) , all code to create PubChemLite with selected bit lists is available from the Environmental Cheminformatics group GitLab repository (https://git-r3lab.uni.lu/eci/pubchem/-/tree/master/pubchemlite) . Fixed versions of PubChemLite mentioned in this manuscript are all archived on Zenodo [35, 36, 40, 49]. PubChemLite will be created and deposited to Zenodo at regular intervals following automatic checks , to allow integration with MetFrag , and offer download files for external users. The annotation content of the NORMAN-SLE (https://www.norman-network.com/nds/SLE/)  is being progressively added to PubChem , with all data available on PubChem  and Zenodo (https://zenodo.org/communities/norman-sle) . The addition of new substances deposited to the NORMAN-SLE to PubChem is automated through mapping files and updated monthly (or more regularly if needed).
Critical Assessment of Small Molecule
Collision Cross Section
PubChem Compound Identifier
US EPA CompTox Chemicals Dashboard
Directed Acyclic Graph
DSSTox Compound Identifier (from CompTox)
DSSTox Substance Identifier (from CompTox)
Environmental Cheminformatics group (at the
University of Luxembourg)
Addition of fingerprint bits to form a scoring
term used in PubChemLite
Human Metabolome Database
InChIKey First Block
Kyoto Encyclopedia of Genes and Genomes
MassBank of North America
Tandem Mass Spectrum, MS2
National Center for Biotechnology Information
NORMAN Suspect List Exchange (NORMAN-SLE)
Random Access Memory
Suspect List Exchange (see NORMAN-SLE)
PubChem Compound Table of Contents(PubChem
Ljoncheva M, Stepišnik T, Džeroski S, Kosjek T (2020) Cheminformatics in MS-based environmental exposomics: Current achievements and future directions. Trends Environ Anal Chem 28:e00099. https://doi.org/10.1016/j.teac.2020.e00099[cito:citesAsAuthority]
Wild CP (2005) Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 14:1847–1850. https://doi.org/10.1158/1055-9965.EPI-05-0456[cito:citesAsAuthority]
Hollender J, Schymanski EL, Singer HP, Ferguson PL (2017) Nontarget screening with high resolution mass spectrometry in the environment: ready to go? Environ Sci Technol 51:11505–11512. https://doi.org/10.1021/acs.est.7b02184[cito:citesAsAuthority]
Oberacher H, Sasse M, Antignac J-P et al (2020) A European proposal for quality control and quality assurance of tandem mass spectral libraries. Environ Sci Eur 32:43. https://doi.org/10.1186/s12302-020-00314-9[cito:citesAsAuthority]
Schymanski EL, Jeon J, Gulde R et al (2014) Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol 48:2097–2098. https://doi.org/10.1021/es5002105[cito:citesAsAuthority]
Frainay C, Schymanski E, Neumann S et al (2018) Mind the gap: mapping mass spectral databases in genome-scale metabolic networks reveals poorly covered areas. Metabolites 8:51. https://doi.org/10.3390/metabo8030051[cito:citesAsAuthority]
Cooper BT, Yan X, Simón-Manso Y et al (2019) Hybrid search: a method for identifying metabolites absent from Tandem mass spectrometry libraries. Anal Chem 91(21):13924–13932. https://doi.org/10.1021/acs.analchem.9b03415[cito:citesAsAuthority]
Blaženović I, Kind T, Ji J, Fiehn O (2018) Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites 8:31. https://doi.org/10.3390/metabo8020031[cito:citesAsAuthority]
Blaženović I, Kind T, Torbašinović H et al (2017) Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93 % accuracy. J Cheminform 9:32. https://doi.org/10.1186/s13321-017-0219-x[cito:citesAsAuthority]
Schymanski EL, Ruttkies C, Krauss M et al (2017) Critical assessment of small molecule identification 2016: automated methods. J Cheminform 9:22. https://doi.org/10.1186/s13321-017-0207-1([cito:citesAsAuthority] [cito:usesMethodIn] [cito:extends] [cito:usesDataFrom])
Williams AJ, Grulke CM, Edwards J et al (2017) The compTox chemistry dashboard: a community data resource for environmental chemistry. J Cheminform 9:61. https://doi.org/10.1186/s13321-017-0247-6([cito:citesAsDataSource] [cito:usesDataFrom])
Kim S, Chen J, Cheng T et al (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47:D1102–D1109. https://doi.org/10.1093/nar/gky1033([cito:citesAsDataSource] [cito:usesDataFrom])
Kim S, Chen J, Cheng T et al (2021) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49:D1388–D1395. https://doi.org/10.1093/nar/gkaa971([cito:citesAsDataSource] [cito:usesDataFrom])
Ruttkies C, Schymanski EL, Wolf S et al (2016) MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J Cheminform 8:3. https://doi.org/10.1186/s13321-016-0115-9([cito:citesAsAuthority] [cito:usesMethodIn] [cito:extends] [cito:usesDataFrom])
Barupal DK, Fiehn O (2019) Generating the blood exposome database using a comprehensive text mining and database fusion approach. Environ Health Perspect 127:097008. https://doi.org/10.1289/EHP4713([cito:citesAsDataSource] [cito:discusses])
Dulio V, van Bavel B, Brorström-Lundén E et al (2018) Emerging pollutants in the EU: 10 years of NORMAN in support of environmental policies and regulations. Environ Sci Eur 30:5. https://doi.org/10.1186/s12302-018-0135-3[cito:citesAsAuthority]
Schymanski EL, Singer HP, Slobodnik J et al (2015) Non-target screening with high-resolution mass spectrometry: critical review using a collaborative trial on water analysis. Anal Bioanal Chem 407:6237–6255. https://doi.org/10.1007/s00216-015-8681-7([cito:citesAsAuthority] [cito:discusses] [cito:extends])
Kiefer K, Müller A, Singer H, Hollender J (2019) S60 | SWISSPEST19 | Swiss Pesticides and Metabolites from Kiefer et al 2019. https://doi.org/10.5281/zenodo.3544760([cito:usesDataFrom] [cito:citesAsDataSource])
Kiefer K, Müller A, Singer H, Hollender J (2019) New relevant pesticide transformation products in groundwater detected using target and suspect screening for agricultural and urban micropollutants with LC-HRMS. Water Research 165:114972. https://doi.org/10.1016/j.watres.2019.114972([cito:citesAsDataSource] [cito:citesAsAuthority])
Schollée JE, Schymanski EL, Stravs MA et al (2017) Similarity of high-resolution tandem mass spectrometry spectra of structurally related micropollutants and transformation products. J Am Soc Mass Spectrom 28:2692–2704. https://doi.org/10.1007/s13361-017-1797-6([cito:citesAsDataSource] [cito:citesAsAuthority])
LCSB-ECI, Krier J, Schymanski E et al (2020) S68 | HSDBTPS | Transformation Products Extracted from HSDB Content in PubChem. https://doi.org/10.5281/zenodo.3827487([cito:usesDataFrom] [cito:citesAsDataSource])
Cheng T, Zhao Y, Li X et al (2007) Computation of octanol–water partition coefficients by guiding an additive model with knowledge. J Chem Inf Model 47:2140–2148. https://doi.org/10.1021/ci700257y[cito:discusses]
Helmus R, ter Laak TL, van Wezel AP et al (2021) patRoon: open source software platform for environmental mass spectrometry based non-target screening. J Cheminform 13:1. https://doi.org/10.1186/s13321-020-00477-w([cito:citesAsAuthority] [cito:discusses] [cito:extends])
McEachran AD, Mansouri K, Grulke C et al (2018) “MS-Ready” structures for non-targeted high-resolution mass spectrometry screening studies. Journal of Cheminformatics 10:45. https://doi.org/10.1186/s13321-018-0299-2[cito:citesAsAuthority]
ELS acknowledges discussions with Rick Helmus (University of Amsterdam), Herbert Oberacher (Medical University of Innsbruck), Juliane Hollender (Eawag) and the Environmental Cheminformatics team (LCSB-ECI, University of Luxembourg). ELS & SN are grateful for the hard work of Christoph Ruttkies and Sebastian Wolf (both formerly IPB Halle) on MetFrag over the years that has enabled this work. The work of all staff and contributors to PubChem, the NORMAN-SLE, and to open science in general, are also gratefully acknowledged.
The work of EEB, PAT, and JZ was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. ELS and TK acknowledge funding support from the Luxembourg National Research Fund (FNR) for project A18/BM/12341006. SN acknowledges BMBF funding under grant number 031L0107.
Authors and Affiliations
Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 avenue du Swing, 4367, Belvaux, Luxembourg
Emma L. Schymanski & Todor Kondić
Bioinformatics and Scientific Data, Leibniz Institute of Plant Biochemistry (IPB Halle), 06120, Halle, Germany
German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Deutscher Platz 5e, 04103, Leipzig, Germany
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
EEB & ELS conceptualized PubChemLite and annotation gap analysis; EEB coded PubChemLite files, ELS the evaluation, SN integrated into the MetFrag infrastructure. EEB, ELS discussed and developed the manuscript and concepts, SN contributed. PAT developed the bit files; JZ, EEB, ELS and PAT integrated the NORMAN-SLE files, transformation and annotation content into PubChem; TK implemented regular builds and associated infrastructure at LCSB. All authors have contributed the final manuscript. All authors read and approved the final manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Schymanski, E.L., Kondić, T., Neumann, S. et al. Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag.
J Cheminform13, 19 (2021). https://doi.org/10.1186/s13321-021-00489-0