Systems biology is the study of complex living organisms, and as such, analysis on a systems-wide scale involves the collection of information-dense data sets that are representative of an entire phenotype. To uncover dynamic biological mechanisms, bioinformatics tools have become essential to facilitating data interpretation in large-scale analyses. Global metabolomics is one such method for performing systems biology, as metabolites represent the downstream functional products of ongoing biological processes. We have developed XCMS Online, a platform that enables online metabolomics data processing and interpretation. A systems biology workflow recently implemented within XCMS Online enables rapid metabolic pathway mapping using raw metabolomics data for investigating dysregulated metabolic processes. In addition, this platform supports integration of multi-omic (such as genomic and proteomic) data to garner further systems-wide mechanistic insight. Here, we provide an in-depth procedure showing how to effectively navigate and use the systems biology workflow within XCMS Online without a priori knowledge of the platform, including uploading liquid chromatography (LC)–mass spectrometry (MS) data from metabolite-extracted biological samples, defining the job parameters to identify features, correcting for retention time deviations, conducting statistical analysis of features between sample classes and performing predictive metabolic pathway analysis. Additional multi-omics data can be uploaded and overlaid with previously identified pathways to enhance systems-wide analysis of the observed dysregulations. We also describe unique visualization tools to assist in elucidation of statistically significant dysregulated metabolic pathways. Parameter input takes 5–10 min, depending on user experience; data processing typically takes 1–3 h, and data analysis takes ∼30 min.
The goal of systems biology is to decipher complex and interdependent biochemical processes to understand how a biological system operates on a mechanistic level and how it reacts to external factors1,2. Toward this goal, genomic, proteomic and metabolomic technologies have evolved to provide an impressive amount of comprehensive information on genes, proteins and metabolites, with the most recent of this trilogy, metabolomics, having joined as an interesting latecomer. This is in itself interesting, because this approach measures the furthest downstream products of the genes and proteins: metabolites. As metabolites are the most downstream biochemical products, metabolomic data can provide a readout of gene and protein function, thus representing a logical starting point for deciphering their activity.
Advances in high-resolution mass spectrometry (HR-MS) have enabled metabolomics to be used on a global scale, allowing for the detection of low-abundance metabolites in an unbiased manner3,4,5. However, global metabolomics is known to generate large data sets with thousands of metabolic features of great chemical diversity, making identification and analysis of biological relevance challenging and time-consuming6. Development of bioinformatic tools, such as XCMS (an abbreviation for various forms (X) of chromatography mass spectrometry)7, has helped alleviate processing times for metabolic feature detection, retention time alignment and statistical analysis, but further interpretation of the acquired data is necessary to garner biological insight on a systems level.
XCMS Online8, originally a data-processing platform, has recently been expanded to include multi-omic technology in which raw MS metabolomics data are superimposed directly onto pathway maps9. In addition, XCMS can be used to integrate these pathways with proteomics and transcriptomics results. Although mapping of metabolites can give an indication of gene and protein activity, integrating these metabolomics results with proteomics and transcriptomics data provides a more comprehensive and validated characterization of a system under study. XCMS Online can now be used to perform metabolomics-guided systems biology analysis as a cohesive and intuitive workflow that harnesses cloud-based multi-omic technology.
Development of the protocol
XCMS Online began as an automated cloud-based method to process raw metabolomic data, generating a list of statistically significant features that could then be used for biological interpretation8,10. For identifying potential metabolites, an algorithm was used to match the accurate masses of significant features (e.g., P value ≤ 0.01) at a minimum specified fold change (e.g., fold change ≥1.5) with metabolites listed in the METLIN database11 as an additional output. This results table can then be used to perform further biological analysis to identify changes in metabolism. As manual curation of these pathways is extremely time-consuming, pathway enrichment analysis began to appear in independent software applications12,13,14,15,16,17. There was a desire to make pathway enrichment easier for users who were not familiar with bioinformatics, thus necessitating the development of a facile approach to processing large quantities of data. Our answer to this was to automate pathway analysis by incorporating the mummichog algorithm16 into the workflow, producing a list of enriched (dysregulated) pathways directly from the raw metabolomic data.
This algorithm deconvolves large amounts of metabolic features, on the basis of their accurate m/z values and matching adducts, into two lists: a 'significant list' and a 'reference list'. Using Fisher's exact test (FET), the matched features are overlaid onto known metabolic pathways, curated from the BioCyc database (v20)18, and compared with a random sampling of features from the reference list. This process is repeated over many iterations, resulting in a significance P value for a given pathway (see Box 1 for more details). The current platform represents this in both a tabulated format and as a Pathway Cloud Plot (discussed below) to interpret the data. Each metabolite that was identified in a dysregulated pathway provides links to information on the biological importance of that molecule and its position within the metabolic pathway. Genes and proteins that interact with that metabolite are also present within the linked information. This brought about the idea to incorporate even more data into XCMS Online by adding gene and protein data integration. Interpreted metabolomics results can now be cross-referenced by uploading genomic, transcriptomic and/or proteomic data using a list of gene symbols or protein accession numbers (see Box 1 for more details). This subsequent analysis feature of XCMS Online allows researchers to take advantage of collaborative efforts, data sharing and even literature-curated information to make mechanistic interpretation of the identified dysregulated pathways. Differentially expressed genes and proteins that are found to overlap with pathways can be observed with the specific metabolites that have also undergone significant changes, thereby confirming or generating new hypotheses regarding mode of action.
The systems biology platform was first used in a colon cancer study in human patients comparing normal versus tumor tissue samples19. The XCMS systems biology results implicated tumor progression with biofilm development via polyamine biosynthesis9. This platform has also been used on a phase I clinical trial drug for a Parkinson's disease immunotherapy that was found to target the tryptophan pathway and later validated with targeted metabolomics20. More recently, cellular responses to chemical exposure of a xenoestrogen on breast cancer cells have been studied and found to alter tRNA charging and ribonucleoside salvage pathways21. In addition, our study of altering carbon sources in an Escherichia coli model system was crucial in the development of the XCMS Online systems biology platform and implicated glycolysis and amino acid biosynthesis pathways as significantly dysregulated9. Currently, the system has been optimized for >7,600 organisms and has been shown to be effective in cell culture and tissue-based studies9,21.
Comparison with other methods
Generation of raw metabolomics-integrated intensity matrices can be performed from a variety of different platforms22,23. These platforms perform a set of peak detection retention time alignments and statistical analysis. The results are typically produced in either information-dense tables or informative visualizations, the latter often including statistical plots such as principal component analysis (PCA) or box-and-whisker plots. Several freely available platforms that are commonly used to preprocess data include cloud-based XCMS Online8,10, or downloadable packages such as MZmine, which has its own user interface23, or mzMatch, which runs in R (ref. 22).
After detecting significantly dysregulated metabolic features, it is necessary to confirm metabolite identity and infer biological relevance by identifying the metabolic pathways in which they are involved. To expedite this process, pathway analysis can be achieved using algorithms that correlate dysregulated metabolites from an untargeted metabolomic analysis with known metabolic pathways in a biological model. The more identified metabolites in a pathway, the higher the confidence of that pathway being affected by the stressed condition under study. Some platforms that are capable of doing such analysis are MetaboAnalyst12,13, KEGG Mapper17 and MBRole24, but all require preprocessed data as input.
Further integration of metabolomic data with genomic, transcriptomic and proteomic data enables a systems-level understanding of the underlying biological mechanisms. Currently, there are concerted efforts in the field to build free-to-use multi-omic workflows, such as Galaxy25. This web-based platform was originally designed for genomic research, but now contains several bioinformatics tools for multi-omic integration and analysis. Galaxy metabolomics modules, such as Workflow4Metabolomics26 and Galaxy-M (ref. 27), can be used to analyze MS-based metabolomics data for systems biology interpretation, yet require XCMS to preprocess LC–MS data. At this stage, both of these are separate software installations that have not been integrated with other Galaxy modules to perform integration of multiple data types. There are also a series of stand-alone, web-based bioinformatics platforms that can perform multi-omics integration. IMPaLA maps dysregulated gene, protein and metabolite data onto pre-annotated pathways28; iPEAP integrates genomic, transcriptomic, proteomic and metabolomic data for pathway enrichment analysis29; iPATH2.0 is an interactive tool for visualization of cellular pathways14; MetExplore links metabolomics data within genome-scale metabolic networks30; Metscape is a Cytoscape plug-in for visualizing and interpreting metabolomic data within human metabolic networks31; and PIUMet takes untargeted metabolomics data (m/z values without IDs) and maps them onto biological networks to identify dysregulated pathways15. By contrast, the XCMS Systems Biology platform allows a fully integrated environment that requires no software installation and no previous programming experience, and has an intuitive workflow with no need to switch among multiple modules or platforms.
Support for systems biology analysis is currently available for pairwise and multigroup analyses (see Box 2 for current XCMS job types). For Systems Biology–supported jobs, the database for performing pathway analysis and multi-omic integration currently queries BioCyc18; pathways and networks from other sources (KEGG17, Reactome32 and Wikipathways33) will be included in the future to extend the pathway-mapping capabilities.
Statistical analysis to identify dysregulated metabolites is limited to a select number of univariate parametric and nonparametric hypothesis tests. Once the statistical test is chosen in the parameter settings, each feature group is compared between sample classes. There is no provision for differentiating technical replicates from biological replicates at this time, and we recommend the use of biological replicates over technical replicates. Inclusion of technical replicates in data analysis should be done with caution. Technical replicates would be useful to investigators for inspecting the analytical reproducibility before committing data to the Systems Biology analysis. If their analytical system generating metabolomics data has poor replication, then they should make every attempt to improve it, otherwise, the biological interpretations from Systems Biology analysis will be compromised.
Pairwise analyses are performed in a simplified manner when a single control and perturbed condition exist. In reality, there are often biochemical feedback loops and cyclical processes within biological systems, and changes in metabolite concentration within these can alter gene and protein function. It is important to note that any metabolomics experiment is a snapshot of a system at a given time. In some instances, time-course sampling can be performed to assess how a system changes over time. However, if sampling is not done at sufficient frequency, changes can be missed. Time-series data can be analyzed using the multigroup analysis job type, but there is no specific function to include time series in the data analysis. Multigroup analysis does permit the addition of quality control (QC) samples34, typically a set of pooled samples measured throughout the analytical sequence at regular intervals, which is then removed from the statistical analysis, yet provides a means to assess the data quality in multivariate PCA and as a control group in the box-and-whisker plots. It should be noted that reported fold change values are the natural log of the median fold change, to account for variation in normally distributed data, and therefore, P values are uncorrected. Users are encouraged to look at the q value, which denotes the false-discovery rate35, before analyzing pathway analysis results and performing multi-omic integration.
The pathway analysis algorithm uses FET to evaluate statistical significance16. For metabolites that are significantly dysregulated in the identified pathways, it is recommended that further validation experiments be performed by MS/MS using the autonomous workflow36 or manually37. Users also need to pay attention to the interpretation of the predictive pathway analysis results and ensure that the raw data used in pathway analysis are accurate. Although FET is commonly used in pathway enrichment calculations38,39, we are also in the process of developing more sophisticated and advanced pathway prediction algorithms to replace FET. The systems biology platform uses the BioCyc database for predictive pathway analysis (https://biocyc.org). Metabolic pathway information archived in BioCyc is generated from literature-based curation (Tier 1 databases) or computational prediction (Tier 2 and Tier 3 databases). Neither approach is able to capture metabolic reactions that are not well defined, such as the complex biochemical interactions between organisms and diet, environmental exposures, xenobiotics and microbiota.
An important aspect of integrated omics in our systems biology platform is the preparation of transcriptomic40,41,42 and/or proteomic data43,44,45, which is not discussed in detail in this protocol. However, some considerations are given in Experimental design section. Currently, there is no direct statistical analysis performed during data integration and only overlap is shown. In addition, there is no value, such as log2 differential expression, that can be uploaded with the gene/protein lists. Future development will include statistical assessment of the metabolic pathways that overlap with significant genes and/or proteins, as well as incorporate the degree of differential expression.
As job sizes increase and, consequently, uploading times increase, the dead time between data collection and processing becomes more substantial. One way to alleviate delays in processing time is to use our data-streaming application XCMStream46, which directly uploads data from the instrument computer as it is generated and automatically initiates the job once complete. Analyzing data off-site can also be challenging, particularly if there is limited computer access. To improve accessibility to results, the XCMS Mobile app47 has recently been released to give XCMS users the ability to analyze data from the cloud; we are currently working to implement systems biology analysis and results view on the mobile platform.
There are many more unique functions available within XCMS Online and, as such, the protocols outlined in this paper will not be able to describe all possible permutations of the workflow beneficial for systems biology analysis. Video tutorials on the systems biology analysis and many additional features can be found on the XCMS Institute page within XCMS Online (https://xcmsonline.scripps.edu/landing_page.php?pgcontent=institute).
Controls and replicates. An untargeted metabolomics workflow requires a robust experimental design with an appropriate metabolite extraction protocol, an optimized LC–MS or gas chromatography–MS method, and an effective data-analysis workflow to identify perturbations in metabolic pathways. When setting up the initial experimental conditions, metabolomic sample classes should be defined; these must include a control condition and at least one perturbation or treatment group. To determine the appropriate number of biological replicates, we typically recommend a rough statistical power estimation to ensure that there are enough samples48.
To ensure the quality of the MS data being generated, it is recommended to include a pooled sample (for QC purposes) containing an aliquot of all the samples, or, if the sample groups are large enough, a pooled sample can be prepared for each sample class. Pooled samples are run throughout the mass spectrometric sequence regularly (e.g., one in ten injections) as a QC check for signal intensity and retention time drift, but are also extremely valuable as a method for including preliminary metabolite validation using data-dependent or targeted MS/MS37. Further details on metabolomic sampling, extraction and chromatographic methods have been discussed elsewhere49,50,51,52.
Biological experimental design. When preparing samples for metabolomic analysis, it is recommended that samples use biological material that best represents the system under study. Tissue samples taken from the expressing phenotype tend to give more meaningful results versus plasma samples that will be more representative of the whole body. Urine samples, although the easiest to collect, consist of mostly metabolic breakdown products, may be far from the phenotype of interest and can vary greatly depending on dilution. Samples should also be collected in large enough quantities that multi-omics analyses can be performed on the same biological replicates used for the metabolomic analysis. This reduces the complexity of the data set and will typically produce more reproducible results. Biological samples prepared separately for each 'omic' analysis often suffer from minor variances during sample preparation that may result in data sets with detectable differences that are not relevant to the study. Preparation of separate samples may allow researchers to detect differences that are a result of natural biological variation within larger sample cohorts. However, in our experience, it is more important to keep the experimental conditions the same whenever possible, particularly during preliminary studies with limited sample size. To better account for natural biological variation and to confirm integrated multi-omic results, we recommend repeating the whole experiment in an independent manner.
An alternative to generating transcriptomic and proteomic data in-house is obtaining the data from publicly available data sets in which experimental conditions are either the same or very similar. This is useful for studies in which a large quantity of curated data is available, such as human cancer research53. This was demonstrated successfully with a colon cancer study in which 30 paired tumor and normal tissue samples were analyzed using untargeted metabolomics19 and then compared with existing transcriptomic and proteomic data obtained from The Cancer Genome Atlas and the Clinical Proteomic Tumor Analysis Consortium, respectively. The integrated omics module in XCMS Online resulted in excellent agreement with the automated pathway analysis of the metabolomics data9. The overlapping pathways included many pathways previously identified in colon cancer, including 1,25-dihydroxyvitamin D3 biosynthesis54, bile acid biosynthesis55, zymosterol biosynthesis56, ubiquinol-10 biosynthesis57 and the spermine–spermidine pathway identified in the original study19.
Modes of analysis. Some final considerations for running the systems biology platform relate to the type of analysis to be performed with respect to the prepared samples. The most commonly performed XCMS Online job is a pairwise analysis ('Pairwise Job'), the purpose of which is to carefully contrast a control set of samples with a perturbed set in order to isolate the effects of a specific condition. This could include, for example, cell cultures exposed to a stressed condition, animal models perturbed by a specific drug or patients with a specific ailment compared with a healthy population. If more variables or time points are under study, a multigroup analysis ('Multigroup Job') can be performed. There are numerous parameters that can be defined when creating an XCMS Online job. In most cases, selecting the default parameters for the instrument platform used is adequate (e.g., time-of-flight detection after ultra-high-performance liquid chromatography separation); however, the user should have a good understanding of the function of the main parameters and when it is useful to change them; these will be highlighted in the protocol below.
Data smoothing. XCMS Online allows the user to select and optimize from a set of base parameters; some of these settings are more sensitive to change than others. Algorithms for smoothing, correcting and aligning data are embedded within these parameters and can be tuned to best detect features for a given data set. For feature detection in global metabolomics data, HR-MS should be used, although both low- and high-resolution (i.e., resolution >10,000) (ref. 58) data can be used in XCMS Online (Step 6). Low-resolution data, in either centroid or profile, should be used with the matchedFilter algorithm7, whereas high-resolution centroid data should be used with the centWave algorithm59. Conversion of data from profile to centroid before upload may provide more robust feature detection and will have faster upload speeds.
Retention time correction. For retention time correction (Step 10), there are two algorithms to choose from. Obiwarp (option A) is the standard algorithm to select. Peaks are warped into groups when compared between samples; this method tends to be more global and smooth. The peakGroups algorithm (option B) can be used to optimize data processing for better control of alignment, but is more difficult to tune. For most analyses, Obiwarp will be sufficient. However, if there is trouble detecting features, peakGroups should be used. We recommend using the nonlinear retention time alignment 'LOESS' (locally weighted scatterplot smoothing) in peakGroups, which will also allow you to select the smoothing method 'family'. This can be either 'gaussian' (if a normal distribution is expected and all the data are to be included), or 'symmetrical', which is based on a redescending M estimator used with a Tukey's biweight function60 to allow for outlier removal. The 'span' parameter is based on degrees of freedom, is also very sensitive to change and should be selected with caution; the default is 0.6 and going larger (i.e., closer to 1) would obtain a more global smoothing result, but alignment may be diminished, whereas going smaller (i.e., closer to 0.05) may produce better alignment, but will have more-stringent peak selection within a group.
Statistical analysis. There are a select number of univariate statistics (Step 12) used for the generation of the significantly dysregulated m/z feature list that is used in the predictive pathway analysis. For pairwise data analysis, the choice of four standard statistical tests is available. Parametric test options are Welch's t test61 and paired t test; nonparametric test options are Mann–Whitney U test62 and a paired Wilcoxon sum-ranked test61. For multigroup analysis, ANOVA63 and Kruskal–Wallis64 tests are given as choices, depending on whether the data are parametric or nonparametric.
Normalization. When setting the thresholds for significant features from the statistical test, the P value and natural log of the median fold change must be entered based on the quality of the data and a user-chosen level of stringency. Features considered 'highly significant' will be used for the metabolomic cloud plot and PCA, whereas a secondary threshold of 'significant features' is set as a cutoff for plotting extracted ion chromatograms (EICs) and box-and-whisker plots, as well as for performing database matching for peak annotation. Default values are provided as a starting point, but may be adjusted depending on confidence in the data. The fold change value is set to a default of 1.5. It is not recommended to use a value <1.5, as these data tend to be artifacts; however, to obtain features with greater dysregulation, a fold change ≥2 can be used.
Normalization, which can affect the outcome of the deciphered pathway, should be chosen carefully. Two normalization methods are also available to apply to the data set to compensate for analytical variances (Step 12). The 'median fold change' is well suited for normalizing dilution effects by adjusting the median log fold change of peak intensities in each sample in the whole data set to approximately zero34. The 'LOESS' or locally weighted scatterplot smoothing is a stronger polynomial regression normalization method in which the local median of the log fold change between peak intensities in each sample is adjusted to approximately zero across the entire peak intensity range65. LOESS is often applied to compensate for batch effects or systematic variation.
Multi-omics integration. The systems biology platform carries out predictive pathway analysis16 and multi-omic integration on metabolomic data sets using metabolic models from 7,627 unique organisms or biosources. To run predictive pathway analysis, a new parameter set must be defined and saved. The intensity filter in the predictive pathway analysis determines whether the signal intensity of a spectral peak is high enough to be considered as a confident and real metabolic feature for metabolite identification. This value can be determined by checking the signal-to-noise (S/N) ratio in the raw data, and we suggest setting it to at least 10×S/N ratio to avoid artifact metabolite identification from the background noise. When setting up the systems biology parameters, it is important to note that the mass tolerance and adduct forms set in Step 14 are applied only to identify potential metabolites within METLIN but not for predicting metabolic pathways. Metabolite identification in the predictive pathway analysis uses a different set of parameters, in which mass tolerance is defined in Step 14, and adduct forms are preset in the predictive pathway algorithm and cannot be changed. See Box 3 for the complete adduct list used in predictive pathway analysis. Currently, three different mass tolerance options are available: 5, 10 and 20 p.p.m., with 5 p.p.m. being selected for well-calibrated high-resolution MS data and generating the most precise results. A threshold value can also be set for minimal MS peak intensity for features to be considered in the pathway analysis. Users should also define a P value cutoff, which divides the entire metabolomic table into significant and nonsignificant metabolite lists. Typically, this value is the same or lower than the value for data processing; this field can also be left blank to use the default mode, which automatically assigns a P value cutoff based on the top quarter of statistically significant metabolic features.
Multi-omics integration takes place after the LC–MS data are processed, and the pathway analysis algorithm has run. Omics data are uploaded separately using a subjob parameter page found within the results summary page. This algorithm matches gene and protein data to the metabolic pathways identified as dysregulated. During the interpretation of multi-omics results, it may be necessary to sort by gene or protein to find overlap in less significant pathways. If differentially expressed gene or protein data do not directly match the observed up/downregulation of a metabolite, other processes are likely at work. This can include inhibition/activation by a small molecule or metabolite, or a rate-limiting enzymatic process (i.e., low enzymatic catalytic constant) that is up or downstream from a dysregulated metabolite. Interpretation outside the obvious connections should be considered and may require expertise and intuition beyond the immediate results. Given the well-known disconnect between gene expression and protein dynamics66, combining transcriptomic data with metabolomics data may prove useful in elucidating upstream mechanisms as relative metabolite concentrations provide information on protein function67. However, the more orthogonal data that can be included, the better the biological interpretation will be on a systems-wide scale.
Browser requirements: XCMS Online supports many of the mainstream web browsers. For the best results, we recommend using the latest version of Google Chrome (v57+) or Mozilla Firefox (v51+).
Internet connection requirements: A fast upload connection is recommended, with a minimum of 5 Mbps, to upload files to data set storage. This can be done directly from the instrument computer or from a personal computer, provided there is adequate hard-drive space for data files. Physical Ethernet connections are normally preferred over wireless (wifi) connections.
Hardware requirements: To view and work with XCMS Online results, a minimum of a Pentium 4 processor with 8-GB of RAM and a screen resolution of 1,280 × 800 or higher is recommended.
XCMS Online currently supports upload of both raw data files and numerous converted MS data formats; see Box 4 for more details.
Gene and protein data format: Differentially expressed gene and protein data should be in the format of a comma-separated (.csv) or tab-separated (.tsv) file. Genes names should be in the format of gene symbols, and protein names should be in the format of gene symbols or Uniprot68 accession numbers. If multiple data sets are available, they must be uploaded individually.
The results for example data discussed in the 'ANTICIPATED RESULTS' section (see below) can be accessed after logging in to XCMS Online (https://xcmsonline.scripps.edu), clicking on the 'XCMS Public' menu (https://xcmsonline.scripps.edu/landing_page.php?pgcontent=listPublicShares) and searching for the job number '1172567' or name 'Ecoli_glucose-vs-adenosine'. These data and multi-omics data files are also available online for download to users to test on their own using the two sample class data sets 'Glucose.zip' and 'Adenosine.zip' (MetaboLights, study identifier MTBLS572; https://www.ebi.ac.uk/metabolights/MTBLS572). For multi-omics integration, a demonstration transcriptomics data set is provided as 'Ecoli_genes.csv' (Supplementary Data 1) and a significant protein data set is provided as 'Ecoli_proteins.csv' (Supplementary Data 2). Information on the experimental design and XCMS Online parameter settings, including systems biology parameters and multi-omics integration settings, is provided in the Supplementary Methods and Supplementary Table 1.
Stage 1: data upload
Timing: ∼30 s–5 min per file, depending on the size of the data set
Logging in. Go to the XCMS Online home page (https://xcmsonline.scripps.edu) and log in with your e-mail and password, or click 'Sign Up' to create a free user profile.
Uploading data. It is recommended to upload data before starting a job. After logging in to XCMS Online, click 'Stored Datasets' from the top menu. Upload times may vary, depending on the type of Internet connection, file size, number of files, proximity to the XCMS server and how busy the servers are, but typically take ∼30 s–5 min per file. Create one data set per sample class.
To add data to a sample class, click 'Add Dataset(s)', as shown in Figure 1 (top). This opens the HTML5 uploader window (Fig. 1 (center)), in which files can be selected by clicking 'BROWSE' or can be dragged and dropped into the uploader window; follow Box 4 for acceptable file formats.
Give the data set a meaningful name representative of the sample class and click 'Save'. Once the files have finished uploading, click 'Save Dataset & Proceed'.
Click on the new data set name to open the 'View/Edit Datasets' window (Fig. 1 (bottom)), in which files should be checked for upload completion and that the file size is the same as that of the original file.
Start an XCMS job. Click 'Create Job' from the top menu and select a pairwise or multigroup job. Data sets can be loaded directly via 'Load New Dataset' and following Steps 2–5 or can be selected from the previously prepared data sets via 'Select Dataset' (recommended). The control condition should be placed in Dataset 1 and the perturbed condition in Dataset 2. In multigroup analysis, the user can also define a QC data set.
Stage 2: parameter settings
Timing: 5–10 min
Select a base parameter set for data processing. Select a default parameter set that best represents your sample data from the parameter dropdown box. It is recommended to modify the parameters to make them specific to your acquisition parameters. Click 'View/Edit' (Fig. 2a) to open the parameter method details. Click 'Create New' to be able to modify the existing parameters. There are nine tabs with details pertaining to how data are to be processed (see Steps 8–16 for details on the parameters specified in the individual tabs).
General. Give the parameter set a unique name. Once saved, this parameter set will be available only in the user's parameter files. The 'Retention time format' can be changed to minutes or seconds, and polarity can be either positive or negative.
Feature detection. The method for feature detection is based on either the centWave algorithm59 (high-resolution centroid data) or the matchedFilter algorithm7 (low-resolution centroid or profile data). Define the following parameters:
Parameter Description ppm (Set mass accuracy) The deviation value should be slightly higher than that of the expected mass accuracy of the instrument. Guidelines are 5–15 p.p.m. for Orbitrap data, ∼5 p.p.m. for lock mass quadrupole time of flight (QTOF) data and 10–20 p.p.m. for other QTOF instruments minimum/maximum peak width This depends mainly on the type of chromatographic separation performed. For standard reverse-phase separations, a general guideline is 20–60 s, whereas for hydrophilic interaction liquid chromatography (HILIC), in which run times tend to be longer with broader peaks, we recommend 25–90 s. When running with UPLC, these values should drop markedly because of shorter run times and higher resolution, with suggested starting values between 2 and 5 s to a maximum of 30 s. If in doubt of values, check the raw chromatographic run for peak widths of some common compounds from each end of the trace. These values are not hard cutoffs and may be detected slightly out of this range, depending on the quality of the peak data
Click on 'View Advanced Options' and define the following advanced parameters:
Parameter Description mzdiff This is the m/z tolerance allowed for spectral features; the 'signal/noise' threshold for peaks is set to a default of 6 and can be increased if data are noisy Integration method This can be chosen as a filtered method by selecting 1, which uses noise-reduced data, or raw data by selecting 2, which is more exact but prone to noise prefilter peaks This can be selected to apply a prefilter to mass traces; it specifies the minimum number of peaks a mass trace must contain in order to be retained prefilter intensity This defines the minimum scan intensity required for each peak Noise Filter This value can be entered for a minimum value that peaks must reach to be kept for analysis
Retention time correction. Select either the Obiwarp (option A, for data correction with well-behaved peak groups) or the peakGroups algorithm (option B, for more options to detect peaks that require more in-depth grouping) from the dropdown box (see also Experimental design).
Retention time correction using Obiwarp
Set 'profStep', which defines the step size (in m/z) for profile generation from the raw data files. The default value is 0.5.
Retention time correction using peakGroups
Define the following peak-grouping parameters:
Parameter Description non-linear/linear alignment Choose the alignment method from the dropdown box; this can be polynomial (nonparametric) using 'LOESS' (locally weighted scatterplot smoothing) or a 'linear' regression model; we recommend the LOESS method for most applications extra/missing These parameters are dependent on sample sizes and should be increased if the sample sizes are large (i.e., 1 if the data set contains five replicates or 5 if the data set has 25 replicates)
Click on 'View Advanced Settings' to define further optional parameters for initial grouping performed with peakGroups:
Parameter Description Ignore sample class Select TRUE/FALSE for ignoring sample class —selecting false will create bias to sample class bw The 'bw' or band width, is the peak width at half height, which describes the inclusiveness of the peak grouping in seconds; smaller values are less inclusive mzwid Enter the 'mzwid' or mass tolerance (m/z) between samples and across peak groups minfrac Enter the 'minfrac' or minimum fraction of samples required to accept a peak grouping, which is sample size–independent, whereas 'minsamp', the minimum number of samples required to accept a peak grouping, should be based on the sample size family The smoothing method 'family' should be selected when 'LOESS' alignment is performed and can be either 'gaussian', which will include outliers, or 'symmetrical', which will exclude them span Enter a 'span' value between 0 and 1. Again, this is only for 'LOESS' alignment, and values closer to 1 result in more global smoothing
Alignment. Once the peaks are grouped, align the peak features by defining the following parameters:
Parameter Description bw This is the band width, or peak width at half height, and the default is 5 s; this should be set to <10 s for HPLC and to 2–5 s for UPLC data minfrac This is the minimum fraction of samples required for a set of peaks to be called a group mzwid This is the difference in mass accuracy between samples
Define the following additional parameters by clicking 'View Advanced Options':
Parameter Description minsamp This is the minimum number of samples allowed for a set of peaks within the same m/z tolerance to be called a group max This is the maximum number of groups to be identified for a given m/z slice
Statistics. Select the statistical test to be performed on metabolite features. Select from the following:
Analysis Description For unpaired pairwise analysis Choose between Welch's t test (parametric) and Mann–Whitney test (nonparametric) For a paired pairwise analysis Select the paired parametric t test. If a nonparametric test is required, a Wilcoxon signed-rank test can be selected. This enables a new button to appear: 'VIEW PAIRS2', which opens a new window to select the pairs by dragging each sample in the correct order onto the list Multigroup job Choose either an 'ANOVA' parametric test or 'Kruskal–Wallis' nonparametric test
Define the following P value and threshold parameters:
Parameter Description P-value threshold (highly significant features) This generates plots such as the standard Cloud Plot and PCA Fold-change threshold This generates plots such as the standard Cloud Plot, EICs and box-and-whisker plots P-value threshold (significant features) This generates EICs and box-and-whisker plots and performs database matching for peak annotation
Select additional options for how peaks are evaluated for statistical analysis:
Parameter Description Value Select the type of intensity 'value' to be used for statistical tests. This can either be the feature peak maximum intensity value 'maxo' or peak area 'into' Normalization Select the normalization method, either 'median fold change' or 'LOESS'
Annotation. Define the parameters for matching isotopes and/or adducts to the features in the Results Table by selecting either 'isotopes' or 'isotopes and adducts' from the dropdown box, with the latter resulting in increased processing time but more identifications. Define the m/z tolerance in either absolute error or relative error values; the smaller deviation for each m/z feature will be used in the annotation process.
Parameter Description ppm Set the 'ppm' tolerance for labeling metabolite annotations; this matches the m/z values in the Results Table to the accurate mass in the METLIN database. Set to the same value as for feature detection or lower adducts Highlight the ionized forms, such as [M–H]−, [M–H2O–H]−, and [M+Cl]−, to be considered for database search Sample biosource Click 'SELECT BIOSOURCE' (Fig. 2b) to open a separate browser window as in Figure 2c. Biosources can be found by browsing or by using the search field; press 'SELECT' to confirm the biosource choice. Click 'save' to remember the biosource selection in the data-processing method (Fig. 2d) Pathway ppm deviation Define the mass tolerance 'pathway deviation ppm' for matching spectral peaks against metabolites in the BioCyc database Input intensity threshold Enter an 'input intensity threshold' or leave it blank to include all the metabolic features for pathway analysis Significant list P-value cutoff Enter a 'significant list P-value cut-off' for defining significantly dysregulated features for pathway analysis
Visualization. Set the retention time window for visualization of the EIC of the statistically significant metabolic features. The default value of 200 s is recommended for HPLC data; this may be reduced to ∼120 s for UPLC data.
Miscellaneous. Leave the parameter settings in this tab unchecked for most data-processing cases. 'Correct mass calibration gaps' applies to MS data from Waters instruments to subtract lock mass scans from the data; 'Bypass file sanity check' disables the file completeness check to speed up data processing.
Stage 3: XCMS data processing and predictive pathway analysis
Timing: 1–3 h, depending on data set size and server queue
Job submission. Once all the parameter settings are defined, click 'Submit Job' to open the 'Confirm Job Specifications' window to view the job parameter summary. If all the information is correct, click 'Submit Job' in the window to launch the data processing, including pathway analysis.
Confirm job. After the XCMS job is submitted, a notification email will be sent to the registered email address. Check the processing status by clicking the 'View Results' from the top menu to explore all the jobs submitted under the same user account. The status button indicates various data-processing situations, including 'Not Submitted', 'Queued', 'Processing' or 'Error'. The progress bar indicates the percentage completion of the job. Refreshing the webpage updates the progress percentage. Once the job is completed, the status button will change to 'View' and the progress bar reaches 100%.
Stage 4: interpretation of pathway analysis results
Timing: 30–40 min
Access results. Click 'View Results' from the top menu to open a list of in-progress or completed jobs. Refreshing the page will update the progress bar on in-progress jobs. Press 'VIEW' to view a finished job.
Predictive pathway analysis results. To view and interpret the pathway results table, click the 'Systems Biology Results' tab located on the left side of the Job Results Summary page. This table tabulates all the predicted pathways, together with the involved genes, proteins, metabolites and pathway P values (Fig. 3).
Click the pathway names to view detailed pathway descriptions on the BioCyc website. Gene and protein information involved in the pathway are listed in the 'Pathway Results Table' but no overlapping information will be displayed until multi-omic integration is processed (see Stage 5).
Click the number of 'All genes' links to the genes involved in the pathway with their names, enzyme activity and BioCyc reaction identities. Both the gene names and the reaction identities can be further linked to their detailed information in the BioCyc website.
Similarly, click 'All proteins' to link to a list of proteins in the pathway with their names, the protein identifier as a UniProt accession number or gene symbol, and the number of pathways involved. The protein name links to the detailed information on the BioCyc website, whereas the protein identifier links to the protein information in the UniProt database. Clicking on the number of pathways involved opens a window to a new list detailing those pathways, each linking to detailed information on BioCyc.
Click the number of 'Overlapping putative metabolites' to show the list of dysregulated metabolites involved in the pathway and a pie chart showing the percentage coverage of dysregulated metabolites in the pathway. This view is illustrated in Figure 4. Detailed metabolic information is provided, including METLIN ID, KEGG ID, up/downregulation, feature fold change, feature P value, m/z, retention time, matched adduct form and feature details for each unique metabolic feature ID.
Click on a metabolite to open a new tab linking to detailed information on BioCyc.
Click on a METLIN ID to open a new tab linking to the METLIN database metabolite entry.
Click on a KEGG ID to open a new tab linking to the KEGG database metabolite entry.
Click a feature ID number under 'Feature Details' to open a separate window displaying the EICs, the MS spectrum of the average feature m/z value and the box-and-whisker plot of that metabolic feature.
Press the 'Back' button on the browser to return to the 'Metabolic Pathway Results'.
Click the number of 'All metabolites' to display a list of all metabolites found involved in that pathway.
Click the name of the metabolite to open a new tab linking to BioCyc information.
Click the numbers under 'METLIN ID' to open a new tab linking to the METLIN database metabolite entry.
Click on a KEGG ID to open a new tab linking to the KEGG database metabolite entry.
Press the 'Back' button on the browser to return to the 'Metabolic Pathway Results'.
Assess pathway significance by P value; typically, ≤0.05 indicates significant dysregulation and implies that this pathway is worth further investigation.
Predictive metabolite results. In the 'Metabolic Pathway Results' table, click the 'Predictive Metabolites Results' button (Fig. 3) to access information on all dysregulated putative metabolic identifications. This view is illustrated in Figure 5.
Click the name of the 'Pathway(s) Involved' to display a list of all the dysregulated metabolites that are involved in the pathway.
Click the column name 'Fold Change', 'P-value', 'm/z' or 'Retention Time' to sort the entire table by that column in an ascending order. Click the column name again to sort it in a descending order. Click the 'Reset' button above the table to return to the original view.
Click the 'Feature' number to open a separate window to view the LC chromatogram, MS spectrum and box-and-whisker plot of that metabolic feature.
Type the metabolite name in the 'Search' bar at the top right of the page to search for a specific metabolite in the metabolomics data set.
Predictive pathway results download. Download the results of the data processing and pathway analysis via the 'Download Result' button on the top right of the 'Job Results Summary' page. In the downloaded folder, the pathway analysis results can be found in the zipped 'results' folder. All the predicted metabolic pathways and associated metabolic information are stored in the 'mcg_pathwayanalysis_mummichog.tsv' file. Metabolites that contribute to the statistically significant pathways (default P value ≤ 0.05) are stored in the 'mcg_metabolite_worksheet_mummichog.tsv' file. Putative metabolic identifications for all the m/z values in the results table are in the 'tentative featurematch_mummichog.tsv' file.
Pathway cloud plot. Access the pathway cloud (Fig. 6) from either the job summary page or inside the pathway results table. The x axis of the plot represents the percentage of overlapped metabolites, and the y axis represents the negative log of the P values. Each metabolic pathway is represented as a circle in the plot. The radius of each circle is proportional to the total number of metabolites identified in that pathway. The interactive pathway cloud plot allows users to zoom in on any part of the plot by drawing a rectangle with the cursor for a detailed view. Users can reset to the original plot by clicking on the 'Reset zoom' box in the top right of the graph.
Adjust the P value threshold on the left side of the pathway cloud plot to refine and display pathways with P values smaller than that threshold.
Adjust the bubble radius multiplier to optimize the plot view by tuning the bar on the top left side of the plot.
Hover the cursor over the circle to show pathway name, P value, metabolite overlap percentage and total numbers of genes, proteins and metabolites involved in the pathway.
Click on a pathway circle to display specific pathway result details underneath the pathway cloud plot with overlapping gene, protein and metabolite information. If multiple pathways are on a single point, they will all be tabulated.
Stage 5: multi-omic integration
Timing: <1 min
Multi-omic data upload. Press the '+' button beside 'Multi-Omics Data' to open the 'Systems Biology Matching Parameters' window (Fig. 7) to manage the subjob for omics integration. The XCMS job ID and name will be listed for the current job. Click 'Upload' to open the uploader window.
Uploader window. Ensure that gene data are listed as gene symbols, whereas protein data should be in either gene symbols or UniProt accession numbers. Both must be uploaded in either .csv or .tsv format, ensuring that there are no commas present in the names, if uploading, as this may result in improper matching. After selecting or dragging and dropping the file, wait for the upload to complete and click 'Save and Proceed' to close the window. If another file is to be uploaded, repeat this process.
On the main window, indicate whether the file is gene or protein data under 'List Type'. Check to make sure that all the files have the right designation (protein or gene). If the file type is a protein list, ensure that the correct format is selected. Click 'Run matching subjob'.
The progress bar will show 100% once the job is complete. Access the matched results by clicking 'View Results', and the run logs can also be checked for completion statistics or error messages by clicking 'View Log'.
Stage 6: interpretation of multi-omic integration results
Timing: ∼20 min
Systems biology results. Download the overlap list in .tsv or .pdf format at the top of the table. Differentially expressed genes and proteins that are overlapping with the predicted dysregulated pathways can be found in the 'Systems Biology Results' tab with the dysregulated pathways (Fig. 8).
Gene results. Click on the overlapping gene number in a pathway to open a new window showing the percentage overlap and the list of overlapped genes.
Click the gene name in order to link to the BioCyc page containing the gene information.
Click on the reaction in order to open a page describing the enzymatic reaction for the encoded protein related to that gene.
Protein results. In a similar manner to that used for gene overlap, click on the overlapping protein number to show the percentage overlap and a list of overlapped proteins
Click on the proteins in the list in order to link to the BioCyc page that shows related protein information
Click on the 'gene ID (accession)' in order to link to the UniProt protein information and the genes that encode for it.
Click on the 'pathways involved' in order to open a list of pathways in which the protein is involved. Each pathway in that list opens a BioCyc page with metabolic pathway information.
Initial biological interpretation. Correlate overlapping genes and proteins with up- and downregulation of the metabolites to determine if over- or underexpression is occurring as a result of the treatment applied in the experiment.
Troubleshooting advice can be found in Table 1.
Steps 1–6, Stage 1: data upload: 30 s–5 min per file, depending on the size of the data set
Steps 7–16, Stage 2: parameter settings: 5–10 min
Steps 17 and 18, Stage 3: XCMS data processing and predictive pathway analysis: 1–3 h, depending on data set size and server queue
Steps 19–46, Stage 4: interpretation of pathway analysis results: 30–40 min
Steps 47–50, Stage 5: multi-omic integration: <1 min
Steps 51–59, Stage 6: interpretation of multi-omic integration results: ∼20 min
This protocol allows users to quickly generate pathway data directly from raw MS data and further interpret the dysregulated pathways on a systems level by implementing gene and/or protein information. An example was provided using E. coli K12 MG 1655 cultures grown on different carbon sources: glucose and adenosine9. Metabolomic data were generated in HILIC–MS in ESI negative mode, and transcriptomic data were generated using mRNA-seq technology on the same sample set (MetaboLights, study identifier MTBLS572; https://www.ebi.ac.uk/metabolights/MTBLS572; Supplementary Data 1). Dysregulated protein data were generated using literature search on proteome dysregulation under the same carbon source of the same E. coli strain (Supplementary Data 2). Predictive pathway analysis results were generated using XCMS Online, resulting in a list of dysregulated metabolic pathways (Fig. 3); among them, 16 metabolic pathways had predicted P values ≤0.05. Dysregulated metabolic features involved in these pathways, such as pyruvate, fructose 1,6-bisphosphate and 3-phospho-D-glycerate, were tabulated in a table linked to the pathway results, as shown in Figure 4. The most significantly disrupted pathways and metabolites are related to glucose and adenosine metabolism, reflecting the complex cellular response and modulation of major processes, particularly the TCA cycle, upon the change in media. All dysregulated metabolic pathways were plotted in a Pathway Cloud Plot as part of the workflow, illustrated in Figure 5, providing user-friendly visualization and interpretation. The pathway analysis results for the carbon source stress study can be accessed after logging in to XCMS Online (https://xcmsonline.scripps.edu), clicking on the 'XCMS Public' menu (https://xcmsonline.scripps.edu/landing_page.php?pgcontent=listPublicShares) and searching for the job number '1172567' or name 'Ecoli_glucose-vs-adenosine'. All the displayed results can also be downloaded by clicking 'Download Results' in the 'Results Summary page' of the XCMS job. Upon the completion of pathway analysis, transcriptomic and proteomic data were uploaded for multi-omic integration (Fig. 6). Several dysregulated metabolic pathways were further confirmed by the integration with transcriptomic and proteomic data (Fig. 7). For example, in the glycolysis I pathway, 12 out of 18 genes (67%), 7 out of 18 proteins (39%) and 5 out of 10 metabolites (50%) were significantly dysregulated (Fig. 8). Our systems biology platform can capture metabolic regulation of how E. coli responds to a change in carbon source on a system-wide level. It has wide applicability in many different areas of study, including cell culture, toxicity screening, drug development and safety, epidemiological and exposome applications, and even personalized medicine. This platform provides a fast and efficient method to process MS-based metabolomics data, quickly assess pathway dysregulation and correlate with proteomic and genomic data for a comprehensive systems analysis.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors thank the following for funding assistance: Ecosystems and Networks Integrated with Genes and Molecular Assemblies (ENIGMA), a Scientific Focus Area Program at Lawrence Berkeley National Laboratory for the US Department of Energy, Office of Science, Office of Biological and Environmental Research under contract number DE-AC02-05CH11231 (G.S.); and the National Institutes of Health (grants R01 GM114368 (G.S.) and PO1 A1043376-02S1 (G.S.)).