Based on a definition of LC-HRMS signal series which can be caused by homologous compounds, series detection progresses in two stages. A first stage extracts the set S
3 of feasible 3-tupels (triplets) of peaks, while a second stage recombines them to larger tuples of n > 3 in a stepwise manner.
Series definition
A series k of length n ≥ n
min
is defined as the tuple S
n,k
= (p
1,k,
…, p
n,k
) of picked LC-HRMS signal peaks p = {m/z, RT, intensity}, ordered by increasing m/z of the series peaks. S
n
denotes the set of all such series tuples having length n. Peaks being adjacent in a tuple are assumed to only differ in a repetitive and possibly unknown chemical unit or functional group, e.g., CH2 or OH. As a result, changes in the mass differences Δm/z between any two adjacent series peaks p
j,k
and p
j+1,k
must remain within an error margin of [−4ε; 4ε]. ε here denotes a maximum ± m/z measurement error and may depend on m/z or peak intensity [36]. The Δm/z of all series in a LC-HRMS data set range within lower and upper bounds Δm/z
min
and Δm/z
max
, a priori set as the considered mass range of chemical units at given charges z.
Furthermore, Δm/z restrains feasible differences in the mass defect of adjacent series peaks, denoted Δm. The mass defect here refers to the deviation between an ion`s exact m/z value and its nearest integer [37]. For any monoisotopic chemical unit that could constitute a mass difference Δm/z, bounds γ
min
and γ
max
for minimum and maximum differences in Δm between a series peak p
j,k
and another peak p
j+1,k
can be determined by the mass defects of the isotopes of lowest mass for each of the elements contained in a unit. For example, and albeit lacking knowledge of the exact composition of a chemical unit but assuming only C, H, N, O, S, Cl and Br to be present, we can expect the value of Δm between any two series peaks differing by Δm/z to lie within [−0.0010 Δm/z; 0.0078 Δm/z]. The first factor γ
min
is determined by the ratio of mass defect to atomic mass of 79Br, the second factor γ
max
by the ratio for 1H. Factors for all of the other elements range in between these bounds. γ must be calculated over all chemical elements if no assumptions on the involved elements can be made. A mathematical definition of γ
min
and γ
max
is given in Additional file 1. Furthermore, one must account for the rounding involved in the calculation of mass defects: any Δm along a series leading to mass defect values above 0.5 consequently wrap them to Δm − 1, whereas values below −0.5 convert to Δm + 1. Thus, differences by Δm must be adapted accordingly.
Similar to the bounds for Δm/z and Δm, deviations in retention time RT between adjacent series peaks must also be restricted in order to reflect reasonable chromatographic characteristics caused by repeated introduction of chemical units [29, 30]. On the one hand, ΔRT
min
and ΔRT
max
hence define minimum and maximum bounds for differences in RT from one peak p
j,k
to its following tuple peak p
j+1,k
, respectively. On the other hand, changes in ΔRT across a series can be expected to be systematic [29, 30]. First, such changes in ΔRT from one pair of adjacent peaks in a tuple (p
j,k
, p
j+1,k
) to the next (p
j+1,k
, p
j+2,k
) must be smaller than a predefined value, denoted as ΔΔRT. Second, cubic smoothing splines are fitted to model RT as a function of m/z in each series tuple [38]. Briefly, the model fit of each series as determined by the coefficient of determination (R
2) has to be above a certain threshold, using a preset smoothing parameter λ ≥ 0.
Triplet detection
Constrained by the above outlined bounds for Δm/z, Δm and ΔRT, a first series detection stage uses k-dimensional (k-d) trees [39] as a metric data structure to enable a computationally fast extraction of peak 3-tuples which might be feasible sub-tuples of larger series tuples. In a k-d tree, each signal peak x from a LC-HRMS data set is represented by a vector \(a_{x} \in {\mathbb{R}}^{4}\)
$$a_{x} = \left( {m/z_{x} , \, \Delta m_{x} - \, \gamma_{min} m/z_{x} , \, \Delta m_{x} - \, \gamma_{max} m/z_{x} , \, RT_{x} } \right)$$
(1)
A geometrical depiction of the elements in a
x
is given by the blue lines in Fig. 1. The second and third elements \(a_{{x_{2} }}\) and \(a_{{x_{3} }}\) transform the minimum and maximum change in mass defect with changing peak mass to a metric scale that can be represented in a k-d tree. In the latter, each tree node (alias peak) splits the space and the therein contained peaks into two partitions, using the peak with the median value for one of the elements in a
x
. The resulting two partitions are in turn split using the next element in a
x
, each by another median peak contained in those partitions (cp. numbered splitting planes in the TOC for an arbitrary example in \({\mathbb{R}}^{3}\)). Starting with the first entry of a
x
for the root node and recursively cycling over entries of a
x
until partitions with a single peak (i.e., terminal nodes) are reached, a k-d tree supports fast range queries. These queries are conducted to extract all peaks from two subspaces of types L and H with lower and higher m/z repeatedly centered at each LC-HRMS peak. The extracted peaks are then successively recombined to screen for all unique peak combinations that can form 3-tuples in accordance with the above series definition and that include the current center peak as the second element of a triplet (cp. green and blue dots in Fig. 1). The peaks queried with these subspaces usually represent a minor fraction of all measured LC-HRMS peaks and the recombination to feasible triplets thus greatly improves over a check using all peaks. The mentioned subspace types L and H result from combining the intervals
$$I_{1} = [a_{{x_{1} }} + \Delta m/z_{min} ; a_{{x_{1} }} + \Delta m/z_{max} ]$$
(2)
$$I_{2} = [a_{{x_{2} }} - 2\varepsilon ;\infty ]$$
(3)
$$I_{3} = [ - \infty ;a_{{x_{3} }} + 2\varepsilon ]$$
(4)
$$I_{4} = \left[ {a_{{x_{4} }} + RT_{min} ;a_{{x_{4} }} + RT_{max} } \right]$$
(5)
$$I_{5} = [a_{{x_{1} }} - \Delta m/z_{max} ;a_{{x_{1} }} - \Delta m/z_{min} ]$$
(6)
$$I_{6} = [ - \infty ;a_{{x_{2} }} + 2\varepsilon ]$$
(7)
$$I_{7} = \left[ {a_{{x_{3} }} - 2\varepsilon ;\infty } \right]$$
(8)
$$I_{8} = [a_{{x_{4} }} - RT_{max} ;a_{{x_{4} }} - RT_{min} ]$$
(9)
via their Cartesian products to the queried subspaces
$$H = I_{1} \times I_{2} \times I_{3} \times I_{4}$$
(10)
$$L = I_{5} \times I_{6} \times I_{7} \times I_{8}$$
(11)
Intervals I
1–I
4 define bounds in each of the four dimensions of a
x
for a subspace H succeeding the queried center peak. In contrast, intervals I
5–I
8 define a subspace L preceding the queried peak, as indicated by black lines and polygons in Fig. 1. Further details on how to account for the mentioned rounding issue of Δm and to accelerate the computational retrieval of subspaces are provided in Additional file 2.
Tuple recombination
The second stage successively combines the extracted 3-tuples to larger tuples. To this end, all pairwise combinations of tuples x and y from a set S
n
which only differ in their first and last peak members, i.e.,
$$\left( {p_{1,x} , \ldots , \, p_{n - 1,x} } \right) = \left( {p_{2,y} , \ldots , \, p_{n,y} } \right)$$
(12)
or
$$\left( {p_{2,x} , \ldots , \, p_{n,x} } \right) = \left( {p_{1,y} , \ldots , \, p_{n - 1,y} } \right)$$
(13)
and conform to the above series definitions concerning ε, changes in ΔRT and λ are combined to a new (n + 1)-tuple in S
n+1. After having formed all combinations from S
n
, the resulting tuples in S
n+1 are in turn recombined to larger tuples in the next set S
n+2. This is repeated until an empty set is reached, wherein each n-tuple is free to combine to several different (n + 1)-tuples. Therefore, a peak cannot be included more than once in one tuple, but several times in several different tuples. In every recursion, n-tuples which can be combined to at least one new (n + 1)-tuple in S
n+1 are removed from S
n
; they otherwise remain in S
n
or are discarded when ranging below a minimum user-defined length n
min
. Moreover, redundant sub-tuples which form by a regular omission of peaks in larger n-tuples of sets S
n≥5 need to be filtered at each recursion.
Series pairing
As mentioned, a peak may be a member of more than one series, as its containing 3-tuples might have been incorporated into several different larger tuples instead of a single one. To elucidate the underlying reasons for such ambiguities, all unique series pairs that intersect in at least one peak of a LC-HRMS sample were extracted and their properties characterized twofold.
On the one hand, the intersection angle θ was used to approximate in how far two series x and y of such a pair were superjacent in the plane of RT and m/z. θ is defined as
$$\cos \theta = \frac{{u_{x} \cdot u_{y} }}{{\parallel u_{x}\parallel \parallel u_{y}\parallel }}$$
(14)
In this equation, numerator and denominator state the dot product and the product of the Euclidean norm of vectors with scaled mean values
$$u_{x} = \left( {\frac{{\overline{{\Delta RT_{x} }} }}{{c_{\Delta RT} }},\frac{{\overline{{\Delta m/z_{x} }} }}{{c_{\Delta m/z} }}} \right)$$
(15)
of each series, respectively. Here, \(c_{\Delta RT}\) and \(c_{\Delta m/z}\) are the range of \(\overline{\Delta RT}\) and \(\overline{\Delta m/z}\) over all series. The smaller the value of θ, the more do two paired series overlie with each other in the RT vs. m/z plane. At θ = 0π, they are fully superjacent.
On the other hand, a self-organizing map (SOM) was used to visualize and cluster common properties among the paired series to explain differences in θ [40, 41]. Being an unsupervised learning strategy, SOMs allow the mapping of a large set of m multidimensional input vectors v = (v
1
,…,v
j
,…,v
m
) of series pair properties onto a smaller two-dimensional grid of SOM nodes. The SOM can then be selectively displayed for the mapped properties; similar properties are herein mapped to close regions in the SOM while different ones are rather separated. In the given case, each input vector of series pair properties
$$v_{j} = \left( {\frac{{\overline{{\Delta RT_{x} }} }}{{\hat{c}_{\Delta RT} }},\frac{{\overline{{\Delta m/z_{x} }} }}{{\hat{c}_{\Delta m/z} }},\frac{{\overline{{\Delta RT_{y} }} }}{{\hat{c}_{\Delta RT} }},\frac{{\overline{{\Delta m/z_{y} }} }}{{\hat{c}_{\Delta m/z} }}} \right)$$
(16)
contains the mean values of m/z and RT differences present in paired series x and y, arranged by \(\overline{{\Delta m/z_{x} }} \ge \overline{{\Delta m/z_{y} }}\) and \(\hat{c}_{\Delta RT}\) and \(\hat{c}_{\Delta m/z}\) representing the mean expected measurement uncertainties of ΔRT and Δm/z, respectively. Based on these properties, an intersection angle θ can be calculated for each SOM node via Eq. (14) to estimate the superjacency of series mapped onto it. Further information on the training and quality of the SOM is provided in the Additional file 3. The SOM calculations were conducted with the R kohonen package, parameterized as listed in the Additional file 4: Table S1 [42].
Sampling and analysis
Evaluation was carried out on 24 h flow-proportional samples taken from the effluent of ten Swiss sewage treatment plants in February 2010, as used and detailed in Schymanski et al. [16]. In short, a sample volume of 0.25 L was each pH-adjusted, filtered, spiked with 103 isotope-labeled standards and enriched via a mixed-bed solid-phase extraction. After basic/acidic extraction, further enrichment under a nitrogen gas stream, reconstitution with HPLC water to 1 mL and a second filtering step, a final aliquot of 20 μL was analyzed with HPLC-ESI-HRMS. The chromatographic step comprised Waters XBridge C18 columns (Milford, USA) and a water/methanol gradient at a flow rate of 200 μL/min generated by a Rheos 2200 low pressure mixing pump (Flux instruments, Basel, Switzerland). A Q-Exactive (Thermo Fisher Scientific, San Jose, USA) was used for full-scan mass spectrometric analysis at a resolution of 140,000 at m/z = 200, following electrospray ionization in each positive and negative modes (spray voltage +4 and −3 kV, respectively; 350 °C capillary temperature). A blank measurement was run prior to each block of positive and negative sample aliquots, respectively. The data files are openly accessible via the MassIVE repository [43].
Data processing
LC-HRMS full-scan data were centroided and converted to open mzXML format files with ProteoWizard (version 3.0.7162) [44, 45]. All downstream analysis was then run in the R statistical environment [46]. Utilizing the R package enviPick (version 1.2) [47], ion chromatograms were extracted in each file and each extracted chromatogram screened for signal peaks, with parameters listed in the Additional file 5: Table S2. Upon peak-picking, series were detected with the above outlined algorithm, as parameterized in Additional file 6: Table S3. For each peak being part of a series, both a blank subtraction and a deisotoping was run with the enviMass v3.1 [48] and the nontarget v1.9 [49] packages, respectively (see Tables S4 and S5 for parameters in Additional files 7 and 8). In the first case, a peak-centered RT and m/z window was checked for each sample peak to not contain raw blank data points higher than 0.1 times the maximum sample peak intensity to certify its presence in the effluent. A majority rule, i.e., a fraction of ≥0.5 peaks per series, was used for a final assignment of a series to be of blank origin. For deisotoping, a comparison with quantized simulation data enabled a grouping of the isotopologue peaks of an unknown compound, within given measurement uncertainties. The peaks in the individual isotopologue groups of each series peak were then ranked by increasing m/z. A series was assumed to be monoisotopic if the most frequent rank over all peaks in a series equaled 1.