Precursor ion selection for MS/MS analysis
Using full scan mode, 6715 and 4666 features were measured in the NIST samples in positive and negative mode, respectively. After removal of peaks with fold change smaller than three times that of corresponding matrix samples and those peaks with a RSD larger than 30%, 4711 and 3608 features remained in positive and negative mode, respectively, as potential precursor ions for MS/MS analysis.
For PMDDA, the GlobalStd algorithm was used to reduce the redundant peaks [17]. To select precursors for targeted analysis, each reduced independent peak was linked to their paired high frequency PMD ions as an ion cluster, or pseudo-spectra. Clusters were merged if independent peaks could be linked to the same paired ions. In addition, since ions within clusters should be highly correlated, Pearson correlation coefficients smaller than 0.9 between paired mass distances were used as a threshold to exclude unrelated peaks from the same compounds. For each merged ion cluster, the peak with the highest intensity was selected as the precursor ion for MS/MS analysis. For the SRM samples, in positive mode, 849 independent peaks were selected by the GlobalStd algorithm, in which 780 precursor peaks were selected as annotations for targeted analysis after cluster analysis. In negative mode, 761 independent peaks generated 723 precursor peaks as annotations for targeted analysis. These annotations were also used as preferred ions using an iterative DDA strategy for comparison.
Precursor lists were also generated for CAMERA and RAMClustR using default settings. For CAMERA [20], peak cluster groups following annotation of the feature table were treated as pseudo-spectra, and the proposed molecular masses for each pseudo-spectra were extracted. Then, the [M + H]+ for positive mode and [M − H]− for negative mode were generated as precursor ions for targeted analysis. For the SRM samples using CAMERA, 862 and 710 precursor ions were generated for MS/MS annotation for positive and negative mode, respectively. Similarly, RAMClustR [21] generated the molecular masses of each pseudo-spectra, and the corresponding molecular ions ([M + H]+ for positive mode and [M − H]− for negative mode) were generated for targeted MS/MS analysis. For the SRM samples using RAMclustR, 542 and 770 precursor ions were generated for targeted analysis in positive and negative modes, respectively.
While several thousand features were measured in full-scan, pseudo-spectra generation by PMDDA, CAMERA, and RAMclustR resulted in less than 1000 unique features for MS/MS precursor ion selection, covering approximately 15% and 20% of the total feature numbers in positive and negative mode, respectively (see Additional file 1: Figs. S1 and S2). Nevertheless, obtaining high quality MS/MS spectra for all of those features in a single injection with high sensitivity is challenging. In this case, the precursor ions were randomly assigned into multiple injections to make sure that no more than 6 ions were scanned within a retention time shift of 0.2 min of the original retention time from full scan. Such repeated injections for PMDDA, CAMERA, and RAMClustR were aimed to retain high sensitivity and compound coverage, and could be implemented into untargeted studies using pooled QC samples for untargeted MS/MS analysis.
Precursor selection comparison with CAMERA and RamClustR
The chemical coverage of different methods were compared based on molecular networks (spectra sets from related molecules, not necessarily matching to any known compounds) found by GNPS, as well as compound annotation results from only MS2 data on MS1 collected precursors. Here, only the molecular networking results that match with precursor ions found in MS1 full scan were kept for comparison, as only those results would be valuable to the analysis and interpretation of the study. We also included iDDA which utilizes an automatic iterative MS/MS collection from the preferred list of PMDDA precursor ions. Since the database-based annotation is biased towards compounds with available spectral data, and GNPS molecular networks may have multiple spectra from the same compounds, we also compared, by open source software xcms for MS2 spectra extraction, the number of unique MS1 compounds for which there was MS2 spectral data collected for CAMERA, RAMClustR, PMDDA. Since NIST 1950 samples contain known compounds, we also compared the results based on those results. Additional file 1: Fig. S1 and S2 visualized MS1 full scan peaks covered by MS2 precursor ions using different precursor selection methods.
For the molecular networking results from GNPS, as shown in Fig. 2, PMDDA found 160 unique molecular networks and iDDA found 98 unique molecular networks while they shared 116 unique molecular networks between them. Both CAMERA and RAMclustR found fewer unique molecular networks compared with PMDDA, 19 and 29, respectively. While RAMclustR and CAMERA shared 39 and 31 networks with both PMDDA and iDDA, only 31 molecular networks were found in all four methods. Interestingly, RAMclustR and CAMERA shared more molecular networks with PMDDA (22 and14, respectively) than iDDA (18 and 7, respectively). Results for negative mode were similar. As shown in Fig. 2, PMDDA found 46 unique molecular networks and iDDA found 70 unique molecular networks. PMDDA and iDDA shared 168 unique molecular networks. Both CAMERA and RAMclustR found fewer unique molecular networks compared with PMDDA with16 and 12, respectively. However, only 22 unique molecular networks were found in all four methods. In summary, PMDDA showed more molecular networks compared with RAMclustR and CAMERA.
We also compare different methods by the compound annotation results from GNPS. For positive mode, PMDDA found 73 compounds and iDDA found 77 compounds. Both CAMERA and RAMclustR annotated fewer compounds, 29 and 41, respectively. However, only 16 compounds were annotated in all four methods. PMDDA annotated 6 unique compounds with another 23 compounds were shared between PMDDA and iDDA. RAMClustR only annotated three unique compounds and no unique compounds were annotated with CAMERA. For negative mode, as shown in Additional file 1: Fig. S3, PMDDA annotated 36 unique compounds, iDDA found 45 unique compounds, CAMERA found 10 unique compounds, and RAMClustR found 16 unique compounds. PMDDA and iDDA shared 18 compounds while iDDA found 6 unique compounds. Interestingly, PMDDA did not annotate any unique compounds not shared by iDDA. Only 4 compounds were overlapping between PMDDA, iDDA, CAMERA, and RAMClustR. Both CAMERA and RAMclustR had no unique compounds found. In this case, PMDDA outperformed CAMERA and RAMclustR and it would be helpful to perform iDDA to extend the coverage of annotated compounds by GNPS.
As for the MS2 spectra extracted by xcms, PMDDA could extract 293 spectra for unique MS1 compounds, more than CAMERA (34) or RAMClustR (163) for positive mode. For negative mode, again, PMDDA found 254 spectra matching to unique MS1 data, more than CAMERA (46) and RAMClustR (150).
Known compounds in NIST 1950 were also compared among different methods. For positive mode, 6, 3 and 5 ions matched in PMDDA, CAMERA and RAMClustR’s precursor ions list while 12, 9 and 4 ions matched in negative mode, respectively. This suggests that PMDDA performs as well or better than the other precursor selection algorithms for selecting biologically relevant compounds for MS/MS annotation.
Overall, PMDDA showed better coverage than both CAMERA or RAMClustR for untargeted MS2 collection and annotation of metabolites measured in MS1 scan. This may be due to the fact that CAMERA and RAMClustR use pre-defined paired mass distances for adducts or redundant peaks, which may not accurately represent the specific sample type. PMDDA, on the other hand, employs a data-driven process (GlobalStd algorithm [17]) to find high frequency paired mass distances within the pseudo spectra, which may cover more unknown adducts or redundant peaks [17]. As shown in Additional file 1: Fig. S4 and S5, some of the high frequency PMDs belong to known adducts (e.g. 21.98 Da for sodium adducts, 18.01 Da for neutral loss of water) while others might belong to unknown adducts, oligomers or combinations of known adducts. Another difference between PMDDA, CAMERA, and RAMClustR is the software design. The pmd package is designed to remove redundant peaks while CAMERA and RAMClustR are designed for annotation directly from the feature peak table. As such, the latter algorithms have not been optimized for generating a precursor list for MS/MS which may have decreased performance compared to PMDDA.
When we include the results from iDD with the PMDDA selected precursor as the preferred list, the annotation performance can be further improved. However, PMDDA contains some unique annotations missing by iterative DDA (see Additional file 1: Fig. S3). On the other hand, as shown in Additional file 1: Fig. S6, iDDA can cover compounds with lower MS1 full scan intensity missing by other methods. A combination of PMDDA as preferred ions list and iDDA data collection should be considered to reach a larger coverage of peaks found in MS1 full scan when the hardware supports such data acquisition mode.
Compounds identified in both negative and positive ionization modes
To expand metabolite coverage, the same sample is typically analyzed under both negative and positive electrospray ionization modes for a given chromatography and statistical analysis performed separately for both assays. However, compounds do not show the same ionization behavior in different modes, and respective peaks may be present in only one ionization mode or in both. This causes challenges for statistical analysis methods, such as false discovery rate control, which are highly dependent on the independent numbers of total compounds [30]. To overcome this, connections between negative and positive mode can be built after MS/MS annotation or identification, which might introduce bias on downstream statistical analysis. A previous study used correlation analysis to screen the same compounds in both modes [31], which can be influenced by redundant peaks from the same compounds. As an alternative, untargeted features present in both positive mode and negative mode can be determined using PMD.
Untargeted features present in both positive and negative mode can be linked by paired mass distance of 2.02 Da representing the difference between [M + H]+ and [M − H]− in the two modes. For SRM samples, we found 100 peaks that could be linked with 2.02 Da within a retention time shift of 10 s (see Fig. 3). MS/MS annotation of those 100 peaks using PMDDA identified 35 unique compounds with GNPS, only 2 of which had the same annotation [PE(P-16:0/20:4) and PE(P-16:0/18:2)] in both negative and positive mode, due to the absence of a library spectra in the opposite mode. Since spectral annotation databases might contain a more expansive coverage of only one ionization mode for certain compounds, linking through PMD could both reduce the potential redundant annotations and facilitate annotation of unknowns. By linking features in positive and negative mode, the total number of independent metabolites is reduced for choosing the appropriate downstream statistical analysis. A limitation of the current algorithm is that this linkage only works on data analyzed on the same chromatography column and gradient.
Reproducible research
We aimed to maximize reproducibility of this research. Therefore, we used SRM samples that are commercially available and commonly used in metabolomics workflows, and made the raw data accessible online for future potential research purposes. In order to provide full transparency on the data analysis, we choose a command line based script within a graphic user interface to make sure every step is recorded and reproducible by other researchers [27]. A docker image, xcmsrocker was created based on Rocker image [32], which pre-installs most of the R-based metabolomics and NTA data analysis software. This docker image is available online and can be installed on any personal computer, workstation, or cloud computation platform with RStudio as IDE33. Software used for this workflow such as IPO, xcms, pmd, CAMERA, and RAMClustR had been pre-installed. The R package rmwf (https://github.com/yufree/rmwf) is also included with the data processing script of this PMDDA workflow as a template, as well as other workflow templates such as peak picking, annotation, or statistical analysis for different software. ‘xcmsrocker’ is freely available for download at https://hub.docker.com/r/yufree/xcmsrocker and source code on GitHub (https://github.com/yufree/xcmsrocker) [34].