Determining the parent and associated fragment formulae in mass spectrometry via the parent subformula graph

Background Identifying the molecular formula and fragmentation reactions of an unknown compound from its mass spectrum is crucial in areas such as natural product chemistry and metabolomics. We propose a method for identifying the correct candidate formula of an unidentified natural product from its mass spectrum. The method involves scoring the plausibility of parent candidate formulae based on a parent subformula graph (PSG), and two possible metrics relating to the number of edges in the PSG. This method is applicable to both electron-impact mass spectrometry (EI-MS) and tandem mass spectrometry (MS/MS) data. Additionally, this work introduces the two-dimensional fragmentation plot (2DFP) for visualizing PSGs. Results Our results suggest that incorporating information regarding the edges of the PSG results in enhanced performance in correctly identifying parent formulae, in comparison to the more well-accepted “MS/MS score”, on the 2016 Computational Assessment of Small Molecule Identification (CASMI 2016) data set (76.3 vs 58.9% correct formula identification) and the Research Centre for Toxic Compounds in the Environment (RECETOX) data set (66.2% vs 59.4% correct formula identification). In the extension of our method to identify the correct candidate formula from complex EI-MS data of semiochemicals, our method again performed better (correct formula appearing in the top 4 candidates in 20/23 vs 7/23 cases) than the MS/MS score, and enables the rapid identification of both the correct parent ion mass and the correct parent formula with minimal expert intervention. Conclusion Our method reliably identifies the correct parent formula even when the mass information is ambiguous. Furthermore, should parent formula identification be successful, the majority of associated fragment formulae can also be correctly identified. Our method can also identify the parent ion and its associated fragments in EI-MS spectra where the identity of the parent ion is unclear due to low quantities and overlapping compounds. Finally, our method does not inherently require empirical fitting of parameters or statistical learning, meaning it is easy to implement and extend upon. Scientific contribution Developed, implemented and tested new metrics for assessing plausibility of candidate molecular formulae obtained from HR-MS data. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00776-y.


CASMI-2016 Data Set
Figure 2: The rate of obtaining a rank r of 1, ≤ 2 and ≤ 4 for the s LBJ scoring function as a function of δ 1 and the ratio δ 2...N peak /δ 1 .We observe only a weak dependence in the success rate, and there is no great advantage to using a different δ value to make likely candidate formula lists for fragment masses, as opposed to the parent peak.

Results
Although our method is able to in some cases successfully resolve the two distinct components of a mass spectra, the performance fell quite quickly as the maximum allowed mass increased (see Figure 3) The same trend occurs, but slightly more strongly, when the samples where one of the compounds is a subformula of the other is removed, given that two random compounds in CASMI are more likely to be subformulae if they are both smaller in size.
This result is expected, given that the number of parent candidate formulae increases quite rapidly as a function of the mass of the molecular ion.In order to successfully resolve the two components, our method needs to rank both compounds higher than any alternate parent candidate formula for either compound, or spurious parent candidate formulae that can explain portions of both spectra in the component.Another key limitation of our method is that it cannot be used to resolve the components if the molecular formula of one of the candidates is a subformula of the other; our method will simply treat that candidate as a fragment of the first candidate.
Nevertheless, our results show that if one places suitable restrictions on the allowed elements, are investigating compounds with very low molecular mass, or possess a very high resolution mass spectrometer, then it is possible to resolve co-eluting compounds using our method, provided they are not subformulae of each other.In fact, we show an example of this in practice for Compound 10 in the ORCHID data set.However, our method is not specifically designed or optimised to resolve co-eluting compounds in mass spectra.Further investigation is required in order to adapt PSG construction and scoring specifically for this purpose.

ORCHID Data Set -Analysis Results
Table 1: The performance of both the product score and the edge score on the semiochemical dataset, as measured by the ranking of the correct molecular formula based on the scores.An asterisk is placed next to a rank when the molecular ion does not exist and a large fragment ion was annotated instead (Compound 18).

Rank ID Formula
Mass s LBJ s ne s v

ORCHID Data Set -Example Usage of 2DFPs
This section details three examples of how representing the Parent Subformula Graph (PSG) as the 2-Dimensional Fragment Plot (2DFP) may aid the experimentalist in determining the correct molecular formula of the analyte.
4.1 Low S/N -Mass spectra of minor GC peaks: Compound 10 Our method can extract information from a mass spectrum, which could be used to identify compounds even when the corresponding total ion chromatogram peak for said compound is extremely weak and difficult to distinguish from background noise.To illustrate this, we examine the mass spectra of Compound 10, possessing the molecular formula C 9 H 14 N 2 .Due to the presence of only trace amounts of the compound and/or low ionisation efficiency, there is no visible presence of a chromatographic peak (Figure 5 B).In this situation, the analyst will typically need to manually trace scan by scan through the chromatogram in order to monitor the increase or decrease of certain ion masses hypothesised to be the molecular ion, or a key fragment ion.This is very time consuming, requires considerable specialist knowledge regarding which masses are more or less likely to be notable, and subject to confirmation bias.The molecular ion in this case (m/z = 150) is barely visible, while several other background ions are more abundant (i.e.m/z = 164, 169 and 200) (Figure 5 B).However, our method allows the immediate identification of the molecular formula C 9 H 14 N 2 as the top ranked candidate (Figure 6).

Overlapping chromatographic peaks: Compound 12
We also present as an example the identification of Compound 12, possessing the chemical formula C 10 H 16 N 2 .In this case, the compound is co-eluting with another compound, proposed to be hydroquinone (C 6 H 6 O 2 , M = 110), whose molecular ion and fragments are considerably interfering with the mass spectrum of the target compound (Figure 5 A, C).Nonetheless, the annotation containing the correct molecular formula still obtained r = 3 using s LBJ , compared to a r = 30 using s v .Of the highest scoring parent formula (C 6 H 15 NO 3 see Table 13 in the Supplementary Information), examination of its 2DFPs reveals that although the formula possess a very low RDBE, there exists a number of fragment formulae in the annotation which possess spuriously high RDBEs, corresponding to theoretical losses of neutral fragments containing many hydrogen atoms (in the above example, H 9 O for the first formula).The second highest scoring parent formula (C 6 H 13 NO 3 ) possesses a similar feature.In contrast, the 2DFP of the correct parent formula does not exhibit this feature (see Figure 7).Figure 6: The 2DFP plot with mass peak M 1 = 150 annotated by formula C 9 H 14 N 2 .Despite the presence of numerous noise peaks, due to the very small peak height in the mass chromatogram, the formula alongside a number of fragments with sensible (possible) neutral losses can be seen, suggesting this to be a likely molecular formula for the compound.

Figure 1 :
Figure 1: The number of test cases containing a given element in the CASMI-2016 dataset (left) and the Recetox dataset (right)

Figure 3 :
Figure 3: Percentage of components resolved in a randomly sampled (with replacement) of 100 pairs of mass spectra from the CASMI-2016 data set, combined into a mixed mass spectra, over all samples (blue) and all samples where neither compound is a subformula of the other.

Figure 4 :
Figure 4: The twenty molecules comprising the orchid data set with their associated numberings (IDs).Names for these compounds are given in the supplementary material.Note that there are two spectra for compound 1, 14 and 15; referred to as 1(a) and 1(b), 14(a) and 14(b), and 15(a) and 15(b) respectively, for a total of 23 spectra in this data set.

Figure 5 :
Figure 5: A. Total ion chromatogram (TIC) and extracted ions for the molecular ions (m/z = 150, 164) and base peak (m/z = 122) for compounds 10 and 12 in a floral extract of Drakaea glyptodon.The molecular ion for 10 (m/z = 150) is not visible in the extracted ion chromatogram (inset).B. Mass spectrum for the chromatographic peak of compound 10 (five scans across the peak, with background subtraction).Ions corresponding to the target compound are marked with an asterisk.C. Mass spectrum for the chromatographic peak of compound 12 (five scans across the peak, with background subtraction).Ions corresponding to the target compound are marked with an asterisk.The two main ions from the co-eluting hydroquinone (m/z = 81, 110) are the dominant ions across the selected scans.

Figure 7 :
Figure 7: 2DFP generated from the PSG derived from the highest scoring (incorrect) parent candidate formula, C 6 H 15 NO 3 (top) and from the correct parent candidate formula, C 10 H 16 N 2 (bottom).The very high intensity mass peak at M = 110 corresponds to the molecular ion of the co-eluting hydroquinone.

Table 4 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 2. The correct molecular formula is C 11 H 18 O 3 , and its rank is 1 using the product score metric.

Table 5 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 3. The correct molecular formula is C 11 H 18 N 2 O, and its rank is 2 using the product score metric.

Table 6 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 4. The correct molecular formula is C 9 H 12 O 3 , and its rank is 3 using the product score metric.

Table 7 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 5.The correct molecular formula is C 12 H 20 N 2 , and its rank is 1 using the product score metric.

Table 8 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 6.The correct molecular formula is C 13 H 20 N 2 O 2 , and its rank is 1 using the product score metric.

Table 9 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 7. The correct molecular formula is C 13 H 20 N 2 O 2 , and its rank is 9 using the product score metric.

Table 10 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 8.The correct molecular formula is C 10 H 20 O, and its rank is 1 using the product score metric.

Table 11 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 9.The correct molecular formula is C 9 H 10 O 2 , and its rank is 4 using the product score metric.

Table 12 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 10.The correct molecular formula is C 9 H 14 N 2 , and its rank is 1 using the product score metric.

Table 13 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 11.The correct molecular formula is C 8 H 12 N 2 O, and its rank is 1 using the product score metric.

Table 14 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 12.The correct molecular formula is C 10 H 16 N 2 , and its rank is 4 using the product score metric.

Table 15 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 13.The correct molecular formula is C 10 H 16 N 2 O, and its rank is 1 using the product score metric.

Table 17 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 14(b).The correct molecular formula is C 8 H 8 O 2 S, and its rank is 2 using the product score metric.

Table 18 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 15(a).The correct molecular formula is C 8 H 10 O 2 S, and its rank is 17 using the product score metric.

Table 19 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 16.The correct molecular formula is C 7 H 8 O 2 S, and its rank is 3 using the product score metric.

Table 20 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 15(b).The correct molecular formula is C 8 H 10 O 2 S, and its rank is 1 using the product score metric.

Table 21 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 17.The correct molecular formula is C 8 H 10 N 2 O, and its rank is 4 using the product score metric.

Table 22 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 18.The correct molecular formula is C 11 H 18 N 2 .Although the molecular ion was not detected in the mass spectrum analysed, the rank of a large fragment ion, C 10 H 16 N 2 is 1 using the product score metric.

Table 23 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 19.The correct molecular formula is C 9 H 14 N 2 O, and its rank is 1 using the product score metric.

Table 24 :
A list of candidate molecular formulae annotated to the mass spectrum of Compound 20.The correct molecular formula is C 6 H 10 O 3 , and its rank is 16 using the product score metric.