MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics

Background In spite of its great promise, metabolomics has proven difficult to execute in an untargeted and generalizable manner. Liquid chromatography–mass spectrometry (LC–MS) has made it possible to gather data on thousands of cellular metabolites. However, matching metabolites to their spectral features continues to be a bottleneck, meaning that much of the collected information remains uninterpreted and that new metabolites are seldom discovered in untargeted studies. These challenges require new approaches that consider compounds beyond those available in curated biochemistry databases. Description Here we present Metabolic In silico Network Expansions (MINEs), an extension of known metabolite databases to include molecules that have not been observed, but are likely to occur based on known metabolites and common biochemical reactions. We utilize an algorithm called the Biochemical Network Integrated Computational Explorer (BNICE) and expert-curated reaction rules based on the Enzyme Commission classification system to propose the novel chemical structures and reactions that comprise MINE databases. Starting from the Kyoto Encyclopedia of Genes and Genomes (KEGG) COMPOUND database, the MINE contains over 571,000 compounds, of which 93% are not present in the PubChem database. However, these MINE compounds have on average higher structural similarity to natural products than compounds from KEGG or PubChem. MINE databases were able to propose annotations for 98.6% of a set of 667 MassBank spectra, 14% more than KEGG alone and equivalent to PubChem while returning far fewer candidates per spectra than PubChem (46 vs. 1715 median candidates). Application of MINEs to LC–MS accurate mass data enabled the identity of an unknown peak to be confidently predicted. Conclusions MINE databases are freely accessible for non-commercial use via user-friendly web-tools at http://minedatabase.mcs.anl.gov and developer-friendly APIs. MINEs improve metabolomics peak identification as compared to general chemical databases whose results include irrelevant synthetic compounds. Furthermore, MINEs complement and expand on previous in silico generated compound databases that focus on human metabolism. We are actively developing the database; future versions of this resource will incorporate transformation rules for spontaneous chemical reactions and more advanced filtering and prioritization of candidate structures.Graphical abstract MINE database construction and access methods. The process of constructing a MINE database from the curated source databases is depicted on the left. The methods for accessing the database are shown on the right. Electronic supplementary material The online version of this article (doi:10.1186/s13321-015-0087-1) contains supplementary material, which is available to authorized users.


Background
Metabolomics, the study of the population of small molecules in a cell, has drawn intense interest in fields from medicine to synthetic biology because it can provide a fine-grain representation of cellular state and activity [1][2][3][4]. Of particular interest is untargeted metabolomics, which seeks to measure as much of the metabolome as possible by limiting methodological detection bias. The dominant analysis technique for untargeted metabolomics is chromatography coupled with mass spectrometry (MS) but this method is hindered by a large number of unknown peaks [5] and the limited number of reference spectra available to identify the peaks [6]. A number of tools have been developed to propose structural matches for unannotated peaks [7][8][9][10][11] but in practice these tools either return too many candidates when drawing from large chemical databases such as PubChem [12] or miss compounds not yet present in curated biochemical database [13,14].This has the effect of locking untargeted metabolomics in a unfortunate paradox: compounds that are not present in biochemical databases are not identified and in the absence of experimental identification, new compounds cannot be added to databases [15].
There is a growing consensus that many enzymes mediate undocumented side-reactions (known as promiscuous activities) as a result of exposure to diverse cellular metabolites [16,17]. These activities may explain unannotated peaks in metabolomics datasets [18,19] but are difficult to detect as they may be overshadowed by a known function [20] or be dependent on intracellular conditions [21]. Predicting novel chemical reactions based on broad enzyme specificity has been utilized by a number of tools for the prediction of new biochemical pathways [22][23][24]. Recently, this technique has also been used to expand structure databases for metabolomics by the MyCom-poundID tool [25] the In Vivo/In Silico Metabolites Database (IIMDB) [15], LipidHome [26] and others [27,28].
Here we present Metabolic In silico Network Expansions (MINEs) that utilize the Biochemical Network Integrated Network Explorer (BNICE) [29,30] to expand on general biochemical databases as well as organism-specific databases for Escherichia coli and yeast. The focus on endogenously present and organism-specific metabolites has been cited as critical to improving the confidence of compound matches [5] and thus we complement existing resources that focus on human metabolism. In principle, these predictions could also be made using Reaction Difference Matching (RDM) [23], machine learning methods [31,32], or other rule-based methods such as ChemAxon's Metabolizer. Each of these approaches has their benefits; the output really depends on the quality and coverage of the reaction rules used in the analysis. We selected BNICE because we have a set of BNICE reaction rules that have been demonstrated to reproduce a large fraction of known biochemical reactions [24], as well as to predict enzyme reactions that were subsequently verified experimentally [33]. Importantly, we also have the right to re-distribute BNICE output. No license is required for academic users to access the website or APIs and all BNICE predicted compounds are available for download in SDF format from the website.

Construction and content
Construction of MINE databases follows the steps depicted in Fig. 1: BNICE expansion, Standardization and Annotation. The standardization and annotation procedure was guided by previous databases that combine reaction and compound data from various sources [34,35].
Compound information was obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Release 68.0) [36], the Yeast Metabolome Database (YMDB) (Version 1.0) [37] and EcoCyc (Version 17.0) [38]. Generalized (containing R groups), inorganic compounds, and disconnected fragments were removed using the Pybel toolkit [39]. Generalized structures are of very limited utility, as they cannot be assigned an accurate mass or represented in a canonical form. Where possible, we encourage developers to avoid ambiguity by enumerating all possible structures in their databases. Additionally, biochemical databases often contain numerous duplicate compounds [40] and these were identified by Standard InChIKey [41] comparison and removed for computational efficiency.
The BNICE framework has previously been used to explore alternate biosynthetic and xenodegradation pathways through the iterative application of generalized reaction rules. Unlike some approaches that model only a specific class of chemistry (e.g. cytochrome P450 metabolism) these reaction rules span the breadth of the Enzyme Commission (EC) classification system and have been hand curated by examining reactions at the third level of EC specificity. Figure 2 demonstrates the process of encoding the common reactive site motifs as well as the bonds that are broken or formed. 198 of these generalized chemical reaction rules were applied to all compounds in a given source database, resulting in a MINE database of predicted products and chemical reactions.
BNICE products may take a variety of tautomeric forms depending on the source structure and the nature of the operator applied. Therefore, products were processed with ChemAxon's Standardizer & Structure Checker (JChem 6.0.4, 2013) to ensure canonical valences and placement of charge. Natural Product Likeness scores [42] and estimated logP values were calculated with a standalone Java ARchive (JAR) package and ChemAxon's Calculator Plugins (JChem 6.0.4, 2013) respectively. Estimated Kováts Retention Indices were calculated using the NIST RI algorithm [43].
Compounds were matched to PubChem [44] and KEGG COMPOUND databases with the connectivity block of InChIKeys for annotation. Generated compounds are assigned identifiers based on hash of the canonical SMILES [45] for internal use and a numeric MINE ID for human readability. Finally, the exact mass and chemical fingerprints of structures were calculated with Pybel.
Compound and reaction data is stored as collections in a Mongo Database (v2.6.2). A compound entry contains the chemical formula, exact mass, InChIKey canonical SMILES [45], FP2 and FP4 fingerprints and lists of reactions in which the compound is predicted to participate as a reactant or product. A compound may also be annotated with additional information such as common names or database links if it matches a KEGG or PubChem entry. Reactions are uniquely identified by an 'R' followed by the SHA1 hash of the sorted chemical reaction. Reactions entries contain arrays of reactants and products as tuples of the stoichiometric coefficient and the compound ID as well as a list of the operators that predicted the reaction.   Table 1 summarizes a few key statistics to compare MINEs to other commonly used databases. The most conservative metabolite-prediction database is IIMDB [15], which utilizes a combination of absolute and relative reasoning rules [46] based on human xenometabolism to constrain the size of the database. Two other methods using computationally-predicted metabolites, MyCom-poundID [25] and Ridder et al. 's green tea metabolites [27], begin with much smaller metabolite starting sets than KEGG COMPOUND but utilize broader reaction rules and permit more sequential transformations. MINE operators specify reactant substructures but involve no relative likeliness calculations and therefore generate more compounds than IIMDB, but less than MyCom-poundID. The relative increase between the starting metabolite set and the resulting MINE is dependent on the specific compounds present in the starting database. For example, YMDB contains more high-molecularweight compounds than EcoCyc and thus contains more reaction sites and generates more derivatives. Like the IIMDB, the majority of compounds in MINE databases are not found in PubChem (when searching with the InChIKey connectivity block), which indicates MINEs are largely composed of novel structures. An analysis of the overlap in compounds represented in IIMDB was not performed due to licensing restrictions. Figure 3 displays the Natural Product (NP) Likeness scores [42] for 500,000 randomly sampled PubChem compounds, and the entirety of the KEGG COMPOUND and KEGG MINE databases. NP Likeness is calculated by scoring characteristic atomic signatures, which are present in the query molecule. Scores range from −3 to 3 with higher scores indicating a compound that contains more natural than synthetic structural features. Despite being a common source of candidate structures for annotating metabolomics data, the PubChem sample is clearly skewed towards synthetic compounds. In contrast, KEGG is primarily Natural Product-like compounds and the average KEGG MINE compound is even more so. This shift is due to the action of reaction rules in BNICE that mimic detoxification metabolism acting on the least natural compounds in KEGG and additional reactivity of operators with high NP likeness (see Additional file 1). This bias toward NP-like compounds makes it a preferable source for candidate structures for unknown pathway intermediates and peaks in untargeted experiments.

Web interface description
The web interface for the MINE databases has been designed for a range of user needs such as (a) investigation of potential enzymatic transformations, (b) annotation of accurate masses and (c) chemical structure search. Users may access a compound of interest with a variety of identifiers such as InChI Keys, database IDs or common names, or with structure based tools like substructure and structural similarity searching. Compound pages display a set of name, pathway and enzyme annotations inferred from KEGG as well as the in silico predicted reactions that a compound may take part in as a reactant or product. Additionally, we provide a web interface for the annotation of accurate mass LC-MS data as shown in Fig. 4. This utility provides users a way to search for potential matches for a large number of mass-to-charge ratios and a color-coded interface that enables users to rapidly focus on the most probable putative identifications.

Use case: annotation of accurate mass datasets
As a demonstration of the potential of MINEs for annotation of accurate mass data, a diverse test set of 667 unique compounds was compiled from MassBank [47].  Table 2. Using KEGG as source database, structures were suggested for 84.5% of the m/z. The KEGG MINE database annotated an additional 14% of compounds while maintaining a similar accuracy to the KEGG annotations. PubChem annotates a comparable number of these known compounds to the KEGG MINE but does so at the expense of returning a bin of candidates that is two orders of magnitude larger than the MINE. While the MINE database has a higher median number of structures per peak than the KEGG database, the number remains feasible to examine manually. The web interface facilitates this process by distinguishing compounds that are present in user specified KEGG genome reconstructions from those generated by computational means, hence allowing users to consider the most probable isomers first. Additionally, users may restrict structures to a range of partition coefficients or Kováts retention index values. Candidate structures can then be downloaded as a Microsoft Excel compatible CSV file for further review. Finally, to demonstrate the practical utility of MINE databases, we utilized the EcoCyc MINE to annotate  Of these 132 features, 79 matched at least one of the metabolites proposed in the MINEs by the BNICE method. We selected one of these features, which also exhibited statistically significant variation in peak height across our experimental samples, for further study. The EcoCyc MINE database returned one potential hit for this metabolite, a phosphoethanolamine (PE) lipid that we were not able to identify with our traditional workflow. LipidBlast [11] was used to confirm that the MS-MS fragmentation pattern, presented in Fig. 5, is consistent with PE (32:1), more specifically, PE (16:0/16:1), which is also present as a predicted but unidentified lipid in the Lipid-Home database [26]. Detection and verification of novel metabolites is ongoing but beyond the scope of this article.

Further development
In addition to the existing web tools, the underlying MINE databases are accessible through free, developerfriendly APIs. Clients are available for integration into Python, Perl and JavaScript frameworks at https://github. com/JamesJeffryes/MINE-API. This API allows the databases to be integrated into existing candidate ranking algorithms and pipelines. Future versions of these databases will incorporate transformation rules for spontaneous chemical reactions of metabolites, and improved filtering and prioritization of candidate structures.
In addition to expanding the scope for the metabolome, the MINE framework also offers a pipeline for illuminating the synthesis and degradation of poorly annotated secondary metabolites. While applied very broadly to nearly all of metabolism in this study, BNICE expansions may be focused on a region of interest in the metabolic network by adjusting the starting compounds and permissible transformations in a manner similar to that recently demonstrated by Ridder et al. [27]. These targeted MINEs will integrate the generation of plausible pathways by BNICE with the tools to detect the presence of predicted pathway intermediates with accurate mass spectrometry thereby accelerating the process of proposing and evaluating hypothetical enzymatic synthesis routes for a number of compounds of interest.

Conclusions
Here we have presented Metabolic In silico Network Expansions (MINEs) that utilizes generalized biochemical transformations to propose structures for use in untargeted metabolomics. The resulting compounds are rarely found in PubChem but are structurally similar to natural products. We have demonstrated the utility of these databases for proposing correct metabolite structures that stymied a standard annotation workflow. MINE data are accessible without licensing restrictions for non-commercial users through a user-friendly web interface and API for developers in several common scripting languages.