Introduction

Humans and our microbial inhabitants have co-evolved to form an aggregate organism. In the distal gastrointestinal (GI) tract, where microbes are at the greatest abundance, the complex interactions between host and microbes define health and disease. This dynamic relationship differs between individuals (Costello et al., 2009) and changes over time due to numerous variables including diet (Turnbaughet al., 2009), environment (Spor et al., 2011) and host genetics (Turnbaugh et al., 2009). Intensive effort has been applied toward characterizing the microbiota composition and functionality. The MetaHIT consortium (Qin et al., 2010) and the Human Microbiome Project (Consorium THMP, 2012) among other initiatives have charted the diversity of the gut microbiota across the modern world. Elegant work using mouse models has established mechanisms underlying complex host–microbiota interactions (Ivanov et al., 2009; Hsiao et al., 2013; Smith et al., 2013), although relatively few studies have been extended to include mechanistic connections to human biology (Nicholson et al., 2005; Koeth et al., 2013). Potentially significant differences in the microbiota between animal models and humans, co-evolved interactions between a microbiota and its resident species (Snel et al., 1995; Ley et al., 2008) and difficulties in monitoring host responses in humans non-invasively have resulted in a major gap in our understanding of human–microbiota interdependencies. To complement the wealth of sequencing-based microbiota characterization, discovery-based efforts are necessary to measure the numerous ways hosts respond to their resident microbiota and identify novel proteins and pathways that mediate interaction (Box 1).

Current methods of host response analysis

Most analyses of host responses to the microbiota have been conducted with targeted, single-protein level molecular approaches. Although methods, including western blotting, ELISA and quantitative PCR, reliably validate functions and interactions of candidate molecules, they must be individually tailored to specific hypotheses, and are thus inherently low-throughput and narrow in scope. In an extremely complex, ill-defined system like the human gut, targeting only a few molecules for measurement is likely to overlook key players. Discovering physiological responses in an untargeted fashion is necessary for generating new hypotheses in a manner commensurate with ongoing, global, microbiota sequencing efforts (Table 1).

Table 1 The strengths and weaknesses of the various widely-used, global platforms for monitoring host-responses to changes in the gut microbiota

Large-scale transcriptional assays (RNAseq, microarray) of intestinal tissue provide a highly sensitive analysis of microbiota-influenced gene expression that has been applied in both human and model organism contexts (Laukens et al., 2006; Athanasiadou et al., 2011; El Aidy et al., 2012). Although the sensitivity of sequencing is unmatched in the '-omics' space, the vast constellation of cell types in the gut could confound these results by obscuring the contribution of ecologically important but low-abundant cells. An ideal assay would independently measure the contribution of each cell type from a single biopsy (Box 1). For example, goblet cells have an essential role in mucus production, which creates a barrier between the host and microbiota, but these cells only represent 4–16% of the intestinal epithelium (Kim and Ho, 2010). Laser-capture microdissection (Cash et al., 2006) and flow cytometry (Habib et al., 2012) have provided partial solutions for this problem by isolating cell types of interest. However, laser-capture microdissection is technically challenging and flow cytometry only separates cell types that can be unambiguously labeled and requires tissue disaggregation. Cell type-specific transcriptional assays are subject to the same sample limitations common to all transcript-level measurements: they are generally crude indicators of protein abundance (Gygi et al., 1999; Ideker et al., 2001) and cannot describe protein localization, chemical modifications or functional interactions. In the GI tract most direct host–microbe interactions are mediated by host proteins that are either secreted or present on the apical epithelial surface, and may represent a small signal that would be obscured by less relevant readouts of abundant intracellular transcripts. Furthermore, measuring host transcription requires tissue samples obtained through invasive procedures in humans, or sacrifice of animal models, thus establishing a substantial barrier to obtaining samples, particularly for the purpose of monitoring an individual's response to therapeutic interventions.

Imaging approaches have proven essential in differentiating disease states and severity. Histological sections were used to demonstrate the improvement of colitis in mice receiving daily administration of Faecalibacterium prausnitzii (Sokol et al., 2008) and in concordance with molecular approaches to describe the improvement in inflammation via treatment with acetate (Maslowski et al., 2009). Histology also provides spatial resolution to protein expression, both across the length of the gut and within a particular tissue. Conversely, the human requirement for pathology scoring increases variability. Furthermore, as imaging primarily measures cell or tissue morphological changes, it is less sensitive than molecular approaches and often requires many samples to achieve statistically meaningful results. Furthermore, biopsies are highly invasive, and only represent a small region of the tissue. More recently, in situ hybridization with species-specific, fluorescent probes (Swidsinski et al., 2005, 2007; Swidsinski and Sydora, 2007) have localized microbes within the intestinal space, adding an important dimension to unraveling the host–microbe relationship.

Metabolomics has shown great promise in the characterization of host–microbiota interactions. Metabolites in serum and feces, measured by one-dimensional nuclear magnetic resonance, provided the first glimpse of host–species interactions and temporal variation in the fecal metabolome (Saric et al., 2008). Similar studies correlated fecal metabolites with inflammatory bowel diseases (Le Gall et al., 2011) and associated fecal and urine metabolites with antibiotic treatments (Yap et al., 2008). Parallel approaches using mass spectrometry have correlated shifts in metabolic profiles measured from serum, urine and feces to underlying changes in the microbiota, but are unable to assign >95% of the features to known metabolites (Wikoff et al., 2009; Marcobal et al., 2013). Metabolomics demonstrates great promise for global GI studies. However, the absence of compound databases necessary to identify metabolites by either nuclear magnetic resonance or mass spectrometry and the difficulty to deconvolute host–microbiota co-metabolism are serious hurdles that need to be addressed.

As the focus of microbiome research in model organisms (and eventually humans) turns toward mechanistic experimentation and retrospective longitudinal assessments of large subject cohorts, non-invasive methods for assaying host health will be essential. Blood represents a commonly sampled biofluid from which host–microbe interactions can be measured. For example, serum cytokine profiles have suggested mechanisms by which gut microbes induce adaptive immune responses (Sokol et al., 2008). Cytokine measurements are sensitive and reliably represent systemic immune responses. However, as it is physically separated from the intestinal space and contacts many other body regions, serum cannot necessarily provide GI-specific immune profiles indicative of the current host–microbiota status. Furthermore, serum-based assays cannot quantify strictly gut-resident molecules, such as intestinal anti-microbial peptides, immunoglobulins, metabolic enzymes and mucus proteins.

Stool offers many advantages for measuring host responses to the microbiota within the GI tract. Importantly, stool is acquired non-invasively and contains molecules of both host and microbial origin from within the gut ecosystem. Where molecules identified from tissue biopsies and blood that can serve as proxies for microbiota–host interaction in the gut, molecules from stool can directly describe these interactions without confounding non-GI contributions. Moreover, as proteins in frozen fecal specimens are amenable to analysis, previously conducted microbiota-focused experiments can be revisited to provide additional host-specific insight.

The secretion of host proteins have a key role in the dynamics of host–microbial interactions (Vaishnava et al., 2011) and are conveniently measured from feces. Mass spectrometry-based proteomics is uniquely suited to discovering the proteins at the host–microbiota interface.

Proteomics in the gut

Proteomic studies of intestinal epithelial cells have characterized postnatal intestinal development (Hansson et al., 2011) and chronic inflammation (Shkoda et al., 2007). Unfortunately, these studies were subject to many of the same limitations as bulk transcriptomic analyses insofar as they had to surmount the noise of intracellular protein expression. Laser-capture microdissection of intestinal epithelial cells provides a limited amount of sample material, posing a significant constraint on the use of shotgun proteomic techniques. To identify and quantify the proteins secreted into the gut lumen, proteomic workflows had to adapt to the complexities of stool (Box 2).

In line with early proteomics approaches, the first studies of stool proteins relied on two-dimensional gel separation and spot excision (Klaassens et al., 2007). In 2009, the first global metaproteomic profile of human stool was conducted. Verberkmoes et al. (2009) used sequenced metagenomes from two previously surveyed individuals and 33 genomes of isolated commensal microbial species to enumerate candidate proteins in stool. Using this resource, they generated what, at the time, was the deepest coverage of a bacterial metaproteome (Table 2). By pelleting bacteria from fecal samples, bacterially associated host proteins and sloughed host epithelial cells contributed to the ~500 host-derived proteins they identified. In a subsequent study, these researchers improved their bacterial protein identification rates by using a matched metagenome–metaproteome approach: metagenomic data derived from individual fecal specimens were used to create matched protein sequence databases to aid in mass spectrometry-based protein identification (Erickson et al., 2012). More accurately, predicting the fecal proteome likely contributed to improved analytical sensitivity: >3000 bacterial and >1600 host proteins were identified across 12 individuals with Crohn’s disease. Through functional annotations and statistical analysis of host protein expression, they identified impaired epithelial integrity and decreased epithelial absorption as hallmarks of ileal Crohn's disease.

Table 2 Workflow comparison of traditional metaproteomics and host-centric proteomics

This matched metagenome–metaproteome approach is the current state of the art in whole-feces metaproteomics and addresses the issues of protein diversity and variability (Box 2). However, many challenges still exist. First, deep sequencing and de novo genome assembly are expensive, time consuming and technically challenging. Second, these approaches do not address dietary protein sources, which contribute to the GI ecosystem as well as to protein diversity. Third, proteomic data are reliant on the quality of genome annotations, which has yet to keep pace with the 10 million identified genes from the gut microbiota (Li et al., 2014). Finally, these metaproteomic approaches fall victim to the unavoidable issue of dynamic range (Box 2), even when considering recent advancements in mass spectrometry instrumentation. Although modern proteomic techniques are capable of measuring thousands of proteins from just hundreds of nanogram of input material (for example, see Hughes et al. (2014)), the competition from host, microbial and dietary proteins could decrease analytical sensitivity by orders of magnitude. Thus, it remains to be seen for which application proteomics of microbial proteins in the gut will contribute beyond the superior depth of RNAseq and metagenomics.

Proteomics as a tool for analyzing host responses

Host responses to the dense and dynamic intestinal ecosystem are often mediated by the secretion of proteins into the gut lumen (Vaishnava et al., 2011; Vaishnava and Hooper, 2007). Depleted of abundant intracellular proteins that often cloud metaproteomic studies, the extracellular component of the gut metaproteome offers a direct comparison of proteins produced by both the host and microbes that mediate their interactions. Importantly, none of the previously described host response analyses are capable of quantifying the relative abundance of these secreted host proteins in an untargeted and non-invasive way. We previously described a stool-based host-centric proteomics approach in which we identified and quantified thousands of host proteins present in the mouse intestinal lumen (Figure 1; Table 2). By removing intact cells (both microbial and host) and fecal debris, we were able to enrich for secreted proteins, many with well-characterized GI functions, such as digestion and immune response, and which serve as reporters of host physiological status in response to changing microbiota composition and function (Lichtman et al., 2013).

Figure 1
figure 1

Gastrointestinal metaproteomics. Perturbations, like diet and infection, cause distinct changes to the host and microbiota. Protein-level changes throughout the gastrointestinal tract exist in the stool and can be assayed by mass spectrometry. Microbial-focused approaches pellet intact cells of both host and microbial origin, whereas the host-centric approach focuses on luminal proteins.

Many challenges remain in the analysis of host proteins in stool. The high concentration of molecules incompatible with sensitive liquid chromatography-mass spectrometry systems is an important issue to address for these methods to be widely accepted. Although the centrifugation strategy we previously described eliminates intact cells, sloughed epithelial cells that are broken down in the gut could still contribute intracellular proteins to the supernatant. Furthermore, identifying constitutive, stably expressed proteins in the gut will improve data normalization, much like is done with housekeeping genes in intracellular assays. Perhaps the greatest challenges are posed by clinical experiments, in which proteins will need to be associated with particular disease and microbiome states despite tremendous inter-individual variability in microbiota and overall stool composition. Normalization procedures that leverage stool-compatible quantitative labeling strategies (for example, reductive dimethylation (Hsu et al., 2003), tandem mass tags (Werner et al., 2014), Isobaric tag for relative and absolute quantitation (iTRAQ) (Ross et al., 2004) and stable isotope labeling in mammals (Wu et al., 2004)) or hypothesis-driven, targeted mass spectrometry analysis methods (for example, multiple reaction monitoring (Kennedy et al., 2014) and sequential window acquisition of all theoretical fragment ion spectra (SWATH-MS) (Gillet et al., 2012)) stand to improve upon the label-free quantification we previously used (Lichtman et al., 2013).

The application of this host-centric perspective has implications for basic biomedical research. Just as sequencing provided a powerful method to survey the microbial community, a similar approach is needed to reveal novel aspects of gut biology such as the discovery of pathways and effectors, unique biological states and signatures. Host-focused proteomics of stool promises to point the field toward proteins with yet-to-be-defined roles in governing interaction within the GI tract. The combination of microbial community, metabolite and host protein analyses from stool coupled with the application of gnotobiotic animal models and time-course studies will greatly improve our understanding of the proteins that mediate harmony within this complex ecosystem, and those that are signals of interactions going astray.

In clinical research, host-centric proteomics provides an orthogonal approach that could directly affect precision patient care. A multi-dimensional definition of GI states will allow for increased power in differentiating closely related but discrete states, the stratification of patients, individualized treatment and the monitoring of disease progression and recovery. These molecular phenotypes also provide a first step toward biomarker discovery. Signatures may be distilled to one or a few proteins that provide a simple means for diagnosing a spectrum of GI diseases. Such markers may be directly related to the underlying disease mechanism and direct pharmaceutical development, or they may serve as a proxy for specific biological events. As the stool is an aggregate read-out of the entire GI tract, biological insight need not be confined to the colon. Mapping protein signatures back to specific regions from which they originate promises clinical rewards. For example, enzymes produced in the pancreas have been assayed in stool as a metric for pancreatic function for more than 20 years (Loser et al., 1996; Lankisch, 1993).

In summary, we are now armed with high-throughput means to elucidate host intestinal states without invasive testing procedures. Host-centric proteomics is therefore compatible with long time-course evaluations of human subjects. These host-centric methods can be directly applied to banked stool samples from previous microbial studies, enriching prior data without the need for new specimen procurement. Although individualized microbiota present challenges in establishing microbial signatures as markers of disease, host responses are likely to be more conserved. Rapidly improving mass spectrometry instrumentation, advancements in de novo peptide sequencing and multiplexed quantification tools are diminishing the problems associated with proteome complexity and dynamic range and thus greatly improving our abilities to dive deeper into the gut proteome (Box 2). Although discovery-based mass spectrometry techniques will not likely be a clinical tool for diagnosing patients, the markers identified could be readily transferred into targeted mass spectrometry (Addona et al., 2009; Whiteaker et al., 2011) or ELISA assays that are readily adapted by clinical labs.

Protein diversity

Standard shotgun proteomic workflows compare experimental fragmentation spectra of individual peptides to hypothetical spectra calculated from a database of potential peptide matches. Analyzing human stool from thousands of individuals across the world, metagenomic sequencing efforts have assembled between five and nine million unique open reading frames, corresponding with ~1 billion unique tryptic peptides. This is over 200 times larger than the human proteome and results in increased computational time and greatly decreased ability to distinguish correct identifications from spurious matches. Single-nucleotide polymorphisms and post-translational modifications, though biologically relevant, further exacerbate the search space problem, and are best addressed through alternate means, including iterative database search procedures (Bern et al., 2012) and de novo sequencing (Ma et al., 2003).

Individual variability

Although the total sequenced metagenomic space is 5–9 million open reading frames, any single individual harbors only a subset of these genes. Accordingly, individual-to-individual variation in the microbiota is extensive (Costello et al., 2009). Paradoxically, considering all known metagenomic open reading frames would increase the frequency of false-positive identifications, while considering a more focused, though unmatched sequence database would not contain every protein in the sample. In the former case, database search engines are more likely to make an incorrect assignment when they are presented with more candidate peptide matches to an input spectrum—even when the true peptide is considered (Resing et al., 2004). In the latter, database search engines almost always return their 'best guess' as to the peptide source of a mass spectrum, even if the true source is missing from the sequence database. In both cases, higher numbers of false-positive identifications greatly decreases the ability to clearly discern which identifications are actually correct, and therefore decreases the sensitivity of the metaproteome analysis.

Dynamic range

If it can be assumed that the dynamic range of protein concentrations within a bacterium is on the order of 106 and the dynamic range of bacteria concentrations in the gut is, conservatively, on the order of 1010, then the total bacterial protein dynamic range is 1016. Compared with the quantitative dynamic range of 105 in shotgun proteomics, 1016 is a daunting challenge. Cell- and tissue-based shotgun proteomics analyses tend to use fractionation to extend dynamic range by separating abundant and less-abundant proteins but this has only been shown to improve the dynamic range slightly and comes at the cost of more analysis time on the mass spectrometer. This problem is evident in metaproteomic studies where only the most abundant (and well-characterized) bacterial proteins have been identified.

Solutions

Both matched metagenomics–metaproteomics and de novo peptide sequencing are viable solutions to the problems of diversity and variability but come with their own limitations. The dynamic range problem cannot be readily resolved with instrumentation or blind fractionation alone. The utilization of gnotobiotic animal models to decrease the microbial complexity and control the dynamic range may help, but the possibility of collecting deep proteomic signatures of the entire gut microbiota is not technically feasible, or possible in humans. The tried-and-true principles of proteomics are equally merit worthy in the metaproteome context: fractionation and enrichment. In the gut, this can include the rational enrichment of sub-proteomes, like those at the epithelial–luminal interface or the secreted proteome. Furthermore, these sub-proteomes may actually contain the proteins that actively define the host–microbiota relationship.