Automated fragment formula annotation for electron ionisation, high resolution mass spectrometry: application to atmospheric measurements of halocarbons

Background Non-target screening consists in searching a sample for all present substances, suspected or unknown, with very little prior knowledge about the sample. This approach has been introduced more than a decade ago in the field of water analysis, together with dedicated compound identification tools, but is still very scarce for indoor and atmospheric trace gas measurements, despite the clear need for a better understanding of the atmospheric trace gas composition. For a systematic detection of emerging trace gases in the atmosphere, a new and powerful analytical method is gas chromatography (GC) of preconcentrated samples, followed by electron ionisation, high resolution mass spectrometry (EI-HRMS). In this work, we present data analysis tools to enable automated fragment formula annotation for unknown compounds measured by GC-EI-HRMS. Results Based on co-eluting mass/charge fragments, we developed an innovative data analysis method to reliably reconstruct the chemical formulae of the fragments, using efficient combinatorics and graph theory. The method does not require the presence of the molecular ion, which is absent in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim$$\end{document}∼40% of EI spectra. Our method has been trained and validated on >50 halocarbons and hydrocarbons, with 3–20 atoms and molar masses of 30–330 g mol\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{-1}$$\end{document}-1, measured with a mass resolution of approx. 3500. For >90% of the compounds, more than 90% of the annotated fragment formulae are correct. Cases of wrong identification can be attributed to the scarcity of detected fragments per compound or the lack of isotopic constraint (no minor isotopocule detected). Conclusions Our method enables to reconstruct most probable chemical formulae independently from spectral databases. Therefore, it demonstrates the suitability of EI-HRMS data for non-target analysis and paves the way for the identification of substances for which no EI mass spectrum is registered in databases. We illustrate the performances of our method for atmospheric trace gases and suggest that it may be well suited for many other types of samples. The L-GPL licenced Python code is released under the name ALPINAC for ALgorithmic Process for Identification of Non-targeted Atmospheric Compounds. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-021-00544-w.


SUPPLEMENT
Supplement for: Automated fragment formula annotation for electron ionisation, high resolution mass spectrometry: application to atmospheric measurements of halocarbons Myriam Guillevic 1* , Aurore Guillevic 2 , Martin K. Vollmer 1 , Paul Schlauri 1 , Matthias Hill 1 , Lukas Emmenegger 1 and Stefan Reimann 1 * Correspondence: myriam.guillevic@empa.ch 1 Laboratory for Air Pollution /Environmental Technology, Empa, Swiss Federal Laboratories for Materials Science and Technology, Ueberlandstrasse 129, 8600 Dübendorf, Switzerland Full list of author information is available at the end of the article

Training set and validation set: SMILES codes
We provide here the SMILES codes for each of the substances in the training and validation set and if a mass spectrum is documented in the NIST spectral database [1].  and mass. In our cases, we use known masses produced by fragmentation of a mass calibration substance, perfluoroperhydrophenanthrene (PFPHP), of chemical formula C 14 F 24 . Only two types of atoms are present in this molecule, carbon (mass 12.000000) and fluorine (mass 18.99840316). The most abundant peaks can therefore be associated to a unique molecular formula. For example, the peak at integer mass 15 69 can be associated to CF + 3 only, of exact mass 68.99466108. Based on the NIST mass spectrum and our measured mass spectrum for PFPHP, we have chosen a list of 13 masses present with a sufficient abundance to be used for mass calibration. In addition, we use the masses of N 2 , O 2 , Ar, which are the most abundant air components and may slightly leak in our system, as well as Cl, 20 also observed to be always present in our detector. The masses are chosen to be evenly distributed between the minimum and maximum masses, covering a range from m/z = 28.0055994348 (N + 2 ) to m/z = 292.98188618 (C 7 F + 11 ). We have observed that our ToF detector is subject to mass drift of up to 100 ppm during a run. This drift is possibly due to temperature variation of our preceding GC. To correct for this drift, we perform a mass calibration every two minutes, using average data of the preceding and following two minutes. For each set of fourminutes-averaged data, all peaks in the mass domain near the exact masses of the selected list are detected. Note that several mass peaks can be detected where only one exact mass from the calibrant is expected. Each detected peak is fitted using a pseudo-Voigt function, which is a combination of a Gaussian and a Lorentzian function:

)F)(F)F)(F)F)(C(C(F)(F)F)(F)F)(F)F yes
and the full-with-at-half-maximum (FWHM) The parameter FWHM is used later on (Main article, Section 2.5.1) to generate candidate mass peaks with the appropriate peak broadness.

25
Then, all obtained centres of ToF are associated to the closed expected exact mass. The entire set of pairs (ToF; expected exact mass) is used to fit a calibration function of the form: with i ToF the time of flight index and m the exact mass. The parameters p 1 , p 2 and p 3 are optimised using the Python lmfit package. According to theory, p 3 should be equal to 0.5 [1] . However a better accuracy is obtained when p 3 is allowed to slightly vary. If one or several pairs are further away from the fit than a set maximum value (20 ppm in our case), the furthest away pair is eliminated and the optimisation 30 routine is repeated. This one-by-one pair elimination improves the robustness of the algorithm. This was proven necessary as when measuring real air from the industrial area where Empa is located, once in while a prominent pollution event occurs, producing outstanding mass peaks that may be in the vicinity of the expected peak, even masking it, therefore disturbing the mass calibration function.

35
Once all residuals are below the set value, the mass calibration is complete. Any ToF value can then be converted to a m/z value using: where p 1 , p 2 and p 3 are calculated at any given specific time as linear interpolation using their time-bracketing optimised values. After the optimisation, the obtained fit parameters are used to calculate the reconstructed m/z values; these values are then compared to the expected exact m/z, 40 for each time slice of four minutes. The obtained offsets, expressed in ppm over the mass domain, are displayed in Fig. 1. With our instrument, the observed mass accuracy is better for larger masses, with residuals usually below 5 ppm for masses higher than 100 m/z, while the accuracy can deteriorate to 20 ppm below 50 m/z. This is potentially due to the mass resolution of our instrument that is around 3000 45 for masses smaller than 50 m/z but around 4000 for larger masses.

Uncertainty of the mass calibration
To reflect this varying mass accuracy over the mass domain, for each used exact mass, we use as uncertainty the maximum observed offset at this mass. Then for any measured mass, its uncertainty is calculated as a linear interpolation between uncertainty at bracketing masses. This constitutes the mass calibration uncertainty.

Validation set: preparation of qualitative standards for compounds newly regulated by the Kigali amendment to the Montreal Protocol
Eighteen substances listed under the Kigali Amendment to the Montreal Protocol were part of the validation set.
The pure substances were bought from Synquest Laboratories (Florida, USA). For each group, the pure substances were spiked one after the other into synthetic air, and the mixture was pressurised into a flask. The two obtained mixtures were prepared at approximately 6.5 nmol·mol −1 .
Then, each mixture was measured by our preconcentration, gas chromatography, time-of-flight mass spectrometry instrumentation. Data analysis then followed the same procedure as explained in the main article.

Algorithmic improvements
We describe our algorithmic improvements to speed-up the running-time of some 70 critical steps.

Organisation of the chemical formulae in a directed acyclic graph (DAG)
After running the knapsack algorithm, many candidate chemical formulae are obtained, hereafter simply referred to as 'formulae'. We organise them in a graph. In this graph, a node n j is a descendant of a node n i if the node's fragment s j is a 75 sub-fragment of the fragment s i of the node n i . For example, CCl is a descendant of CCl 3 . Conversely, a node n i is an ancestor of a node n j if its fragment s i is a sup-fragment of the fragment s j . This define a partial order on the formulae that we formally define in Definition 1.
Definition 1 (Partial order) We define the following partial order on the formulae.

80
Let s i and s j be two formulae encoded as vectors of non-negative integers.
• The formula s j is smaller than the formula • otherwise s i and s j are incomparable.
Organising a set S of n items (here fragments) in a DAG (directed acyclic graph) can takes as many as n 2 comparisons of items, but since S can be quite large (e.g., n = 10000), it makes sense to reduce the number of comparisons. Moreover, the graph should have as less edges as possible, that is, two formulae s i ≥ s j are binded 90 with an edge if and only if there is no other formula s i ′ that could be inserted between them like s i ≥ s i ′ ≥ s j . For example with CCl, CCl 2 and CCl 3 , we will define the graph with minimal edges on the left, not the one of the right (Fig. 2). We now explain how we reduced the number of comparisons between fragments to set the edges, thus improving the complexity of building the graph. First the target 95 masses are sorted in decreasing order before the knapsack step. Hence the output of the knapsack is made of batches of chemical formulae, each one for a given target mass, in decreasing order. Because the mass uncertainty is far thinner than 0.5m/z, all chemical formulae for one target mass are incomparable: it is not possible to find that a formula is a sub-fragment of another, for the same target mass, otherwise the 100 mass difference between the two would be at least 1m/z. Therefore we have a list of formulae {s i } 1≤i≤#S such that for any formula s i , the formulae in the preceeding batches weight more and are either incomparable or contain s i as a sub-fragment, and the formulae in the forthcoming batches are lighter and either incomparable or sub-fragments of s i . The maximal fragments will 105 be the nodes at the "roots" of the DAG. They are made of the formulae of the first (heaviest) batch, and some other formulae from other batches.
We maintain a list of the root nodes (maximal fragments) of the graph. They have no ancestor, and they are incomparable to each other. To add a new formula in the graph, because of the ordering of the formulae, we know that it is either 110 incomparable, or a subfragment of any node of the graph. It cannot be a supfragment (a parent) of any node of the graph. We compare the new formula to each of the maximal fragments. If it is incomparable to any of them, we add it as as new maximal fragment. Otherwise, for each maximal fragment that has the new formula as sub-fragment, we compare the new one to its children. If it is incomparable to any 115 of the children, we add it as a new child of the maximal fragment (we add an edge toward it). Otherwise, we recursively explore the children of the children that have this new formula as sub-fragment. In this way, we avoid many useless comparisons: all children of incomparable nodes are omitted.
Thanks to the list of maximal fragments, the singletons are identified right away: 120 they are the maximal fragments without any child. Figure 2n the main article shows the graph obtained for CCl 4 .

Removing a node and updating the edges
A node n to be removed has parents (closest sup-fragments) connected with one edge, and children (closest sub-fragments) connected with one edge. We wrote this 125 procedure to remove a node and update the edges and list of maximal fragments (Alg. 1, example in Fig. 3).
Algorithm 1: remove the node n and update the edges for each parent p i of n do remove the edge from p i to n ; for each child c i of n do lists its own parents q j (without n) ; if none of q j is a sub-fragment of p i then add an edge from p i to c i (if there is one q j which is a sub-fragment of p i , do nothing: the edges are already fine) for each child c i of n do remove the edge from n to c i ; if c i has no more parent after removing n then add it in the list of maximal fragments if n is a maximal fragment (it has no parent) then remove it from the list of maximal fragments Remove n n p 1 yes n ′ no n ′′ Figure 3 Example for Algorithm 1: removing a node and updating the edges.

Enumeration of isotopocules
We now recall the computation of the relative intensities of the minor isotopocules (see [4] where e ranges over the elements, i ranges over the isotopes of an element, n e is the total number of an element (with all isotopes), n e,i is the number of occurences of one isotope. The relative intensity of a minor-isotope formula is the ratio (see [4]) One needs to enumerate all the possible combinations of isotopes of a given element. This is a classical problem in combinatorics. Denote by i the number of isotopes 130 (the abundant one included). Enumerate all the ways to sum at most i positive integers to obtain n e . It can be seen as a knapsack-like problem: given all isotopes each of "weight" 1, finds how to sum to n e . In particular, each isotope is allowed at most n e times. All ratios are relative to the abundant chemical formula, whose maximum possible 135 intensity is known: this is the intensity of the corresponding measured mass. To improve the running-time of the enumeration, we do not list the minor isotopocules whose intensity would be below the detection threshold of the ToF-MS. For this purpose, we consider the elements one after the other. We maintain a list of partial isotopocules with their partial relative intensity, made of the elements processed 140 so far. The list is ordered by decreasing relative intensity. Given a new element e and its occurence n e , we generate its isotopic patterns, the abundant one included, and sort them in decreasing order of relative intensity. Reading the two lists in decreasing order of relative intensity, we combine the new isotopes to the partial solutions and multiply together the intensities. A loop over a list stops as soon as 145 the product of intensities is below the partial threshold. Once the new list of partial solutions is computed, it is sorted in decreasing order of relative intensity. Then, the next element is processed in the same manner, until all elements are done. The first item of the resulting list is the isotopocule of highest relative intensity (it can be greater than one). For implementation purpose, we scale the list and divide all 150 numbers by this highest relative intensity, so that everything is in the interval [0, 1].

Full Numerical Example with carbon tetrachloride
For CCl 4 found at a retention time of 1708.27 s, nineteen masses are observed, listed in Table 7 with uncertainty and intensity. solutions with a negative DBE are discared. This first approach has a major drawback: many impossible sub-fragments are enumerated, im particular because of the Hydrogen which as a very small mass, and is mono-valent.
Here is a detailed example to compute the lists A and B for the first target mass m max = 34.97006625322406. First one computes multiples of each mass up 175 to m max . One obtains Tables 3 and 4. Then one combines at most one mass per column, starting the enumeration with the lightest mass of each column (the first row value). One obtains the fragments of the first row of Table 5.      To account for chemical formulae made of mono-valent atoms only, a final knapsack algorithm is run to find the chemical formulae with only one or two mono-valent 205 atoms whose mass fits in the uncertainty range (this gives the solution Cl for the mass 34.96878848).

Results for the validation set
Results for the validation set are given in Fig. 4 and Fig. 5.

Figure 4
Distribution of likelihood values of fragments (left) and maximal fragments (right) assigned to a measured mass, for the validation set (23 compounds). A likelihood value of 100 indicates that the chemical formula of the fragment or maximal fragment is highly likely. Blue: distribution for correctly identified fragments/maximal fragments assigned to a measured mass. Red: distribution for wrongly identified fragments/maximal fragments.

Figure 5
Distribution of ranking values for fragments and maximal fragments assigned to a measured mass, for the validation set (23 compounds). A ranking value of 1 means that the fragment/maximal fragments was ranked as most likely (maximum likelihood value within the set of fragments/maximal fragments). Blue: distribution of ranking for correctly identified fragments/maximal fragments assigned to a measured mass. Red: distribution of ranking for wrongly identified fragments/maximal fragments. in the EI mass spectrum. This module uses a pre-defined list; • ElCoCo: module to guess the most likely weight(s) of the molecular ion, with unit mass resolution. It can (or not) uses inputs form MSclass. Using the guessed molecular weight(s), all matching chemical formulae are generated, using a list of chemical elements that can be user-defined.

220
• MOLGEN: using all generated chemical formulae, generates all possible chemical strucutures. These structures are then fragmented and the obtained fragments are compared to the measured ones. All candidate structures are ranked, the most likely first. We used a 90-days free version kindly provided by Markus Meringer, version 225 1.0.1.5. We first converted all our data into a compatible format (.TRA). All masses were converted to unit mass resolution. We used the default settings in MOLGEN-MS and keeping user intervention at the minimum. We used the same chemical elements as in our software: H, C, N, O, F, S, Cl, Br, I. We do not use Si and P that occur very seldom in our type of samples. We tested MOLGEN-MS using two 230 different settings: with and without the MSclass module. When using the MSclass module, the output was then directly used in the ElCoCo module, without any user manipulation of the results. For the ElCoCo module, we used the default parameter for the confidence interval for guessing the molecular weight (98). We did not use a minimum match value to eliminate candidate formula with a match value lower 235 than a threshold. We then checked if the correct molecular weight was listed within the solutions.
We do not include the following substances in the discussion below because their true molecular wieght is outside our detection mass range: C 6 F 14 in the training set and TCHFB in the validation set.

240
In MOLGEN-MS, when using MSclass followed by ElCoCo, on the training set, only three substances have the correct molecular formula listed (COS, benzene and dichlorodifluoromethane). We actually obtain better results when not using the submodule MS-class. When using ElCoCo alone, with the default precision value for the molecular weight (98), and no minimum precision for the molecular formula, 245 the correct molecular formula is listed in 15 cases of the training set (out of 35, this is 43%), with rankings from 1 to 45. If a user manually puts in the correct molecular weight, then the correct molecular formula is listed in 28 cases (out of 35, this is 80%). Two cases with no correct solution require the Sulphur atom to have a set valence of 6, which may not be the default setting in MOLGEN-MS. In 6 cases, the 250 correct molecular formula was listed only if the correct number of each atom was given as input. We do not have an explanation for this.
MOLGEN-MS provides better results on the validation set: for 17 compounds out of 22 (77%), the correct molecular formula is listed (without using MSclass), with ranking from 1 to 46. On average, the correct molecular formula is listed using MOLGEN-MS-ElCoCo in 56% of the cases, in the worse conditions where no user intervention is provided at all.
We provide all input data file in a MOLGEN-MS compatible format as zip file. We believe that to identify the structure of an unknown compound, a promising setup may be to first run our ALPINAC software on the high resolution mass spec-260 trum, and use the suggested molecular ion(s) together with the unit mass resolution spectrum as input to MOLGEN-MS, where the MOLGEN module would be used.