Advances in chemistry and massively parallel detection underlie DNA-sequencing platforms that are poised for application in personalized medicine. In stark contrast, systematic generation of protein-level data lags well behind genomics in virtually every aspect: depth of coverage, throughput, ease of sample preparation and experimental time. Here, to bridge this gap, we develop an approach based on simple detergent lysis and single-enzyme digest, extreme, orthogonal separation of peptides and true nanoflow liquid chromatography-tandem mass spectrometry that provides high peak capacity and ionization efficiency. This automated, deep efficient peptide sequencing and quantification mass spectrometry platform provides genome-scale proteome coverage equivalent to RNA-seq ribosomal profiling and accurate quantification for multiplexed isotope labels. In a model of the embryonic to epiblast transition in murine stem cells, we unambiguously quantify 11,352 gene products that span 70% of Swiss-Prot and capture protein regulation across the full detectable range of high-throughput gene expression and protein translation.
The field of DNA and RNA sequencing has progressed at a remarkable rate since release of the initial human genome draft sequences1,2. Genotype–phenotype association studies continue to elucidate large genomic regions that are amplified, deleted or otherwise altered in the context of human disease, although identification of specific causal gene elements has proven to be a significant challenge3. Despite these and other successes4, primary DNA sequence data does not capture downstream regulatory information for protein translation, degradation and post-translational modification status5,6,7. These data are critical components of studies designed to quantitatively monitor biological response to perturbation or build predictive models of cellular physiology8. As a result, there remains a clear and unmet need for methodologies that can provide systematic and scalable sequence characterization of proteins as an important functional complement to DNA and RNA data.
The wide dynamic range of protein expression and vast array of post-translational modifications, coupled with the lack of an analyte amplification strategy analogous to PCR, presents tremendous challenges for genome-wide protein characterization, particularly for signal transduction and other key regulatory factors that are often present in low abundance. As a result, the majority of shotgun sequencing approaches frequently rely on low-throughput pre-fractionation (for example, before liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis) of sub-cellular compartments or intact proteins to improve dynamic range. Although these techniques are subject to continued improvements, they have not yet achieved genome-scale protein identification and quantification. Moreover, in many cases the labour-intensive nature of pre-fractionation protocols hinders widespread adoption and standardization.
In this work, we forgo cellular- or protein-level pre-fractionation altogether and instead rely on direct detergent-based protein solubilization, followed by single-enzyme trypsin digest and extensive, fully automated temporal separation of peptides through multiple physicochemically orthogonal stages: high-pH reversed phase (RP) and strong anion exchange (SAX) dimensions coupled in series and directly with a narrow-bore, extended-length (25 μm × 100 cm) low-pH RP analytical column operated in a true nanoflow regime. The latter chromatographic stage provides high peak capacity separation in a third orthogonal dimension, along with increased electrospray ionization efficiency for improved detection. The figures of merit for these individual components combine to yield an automated DEep Efficient Peptide SEquencing and Quantification (DEEP SEQ) mass spectrometry platform that provides unprecedented separation capacity, rapid sequencing speed and quantification of proteins across the entire range of mammalian gene expression and protein translation. We utilized multiplexed stable isotope labels in the context of a model designed to profile early perturbations in the naive, self-renewing ground state of murine embryonic stem cells (mESCs) to quantify 211,535 unique peptide sequences that mapped unambiguously to 11,352 gene products. These results span ~70% of the highly curated Swiss-Prot database, capture a vast majority of known pluripotency factors and provide a depth and scale of proteome coverage commensurate with genome-wide analysis of protein translation by RNA-seq ribosomal profiling9.
Peptide quantification by DEEP SEQ mass spectrometry
Multiplexed isotope labels have emerged as an enabling technology to increase the throughput of quantitative proteomic studies and provide increased flexibility with respect to experimental variables (time course, dose response and so on)10. As these reagents have enjoyed wider use, it has become apparent that quantification accuracy degrades significantly as the complexity of target analyte mixtures increases. Simultaneous fragmentation of precursors that overlap in m/z leads to a compression of discrete peptide ratios towards the mean of all measured values11,12,13. We reasoned that the extreme temporal separation provided on our protein deep-sequencing platform would mitigate deleterious ratio compression effects. To explore this hypothesis, we created a mixed-species quantification model12,13 in which the contaminant peptides were present at a mass ratio of ~6- and ~32-fold greater as compared with the target species, respectively (Fig. 1a). We acquired data at a depth of 20 DEEP SEQ fractions (Methods) with an LTQ-Orbitrap XL and used multiplierz14,15 to compile and analyse extracted ion chromatograms for all detected peptides. We observed an average chromatographic peak width measured at half-height of 27.8 s (Fig. 1b), with over 97% of all detected peptides spanning no more than two adjacent fractions (Fig. 1c), yielding an empirical peak capacity16 of ~1.3 E4. Importantly, projection of peptide sequence identifications as a function of first-dimension organic and second-dimension salt concentrations (Fig. 1d) revealed that peptides were distributed throughout the entire separation space, confirming that our platform provided orthogonal and extensive peptide separation in these experiments. We next plotted isobaric tags for relative and absolute quantitation (iTRAQ) ratios for species-specific peptides and compared these with ratios observed in a conventional, single-dimension LC-MS/MS analysis of the same sample mixture. The median ratio for the mixed-species channels (116:114) in the standard LC-MS/MS analysis was compressed by some 40% as compared with the single-species ratios (117:115); however, the equivalent analysis with DEEP SEQ mass spectrometry revealed a relative ratio compression of only 8% (Fig. 1e and Supplementary Data 1). Inspection of individual peptide spectra (Fig. 1f) reveals the marked improvement in quantification data provided by our DEEP SEQ platform.
Genome-wide protein quantification comparable to ribo-seq
To further explore the performance of our protein deep-sequencing platform, we sought to quantify changes in mESC protein expression resulting from withdrawal of leukaemia inhibitory factor (LIF), a cytokine required to maintain self-renewal in these cells. After iTRAQ labelling, we acquired MS/MS data at a depth of 20 DEEP SEQ fractions (Methods) on a 5600 Triple TOF mass spectrometer (Fig. 2a). A single experiment that required approximately 1 day for sample preparation and another 8 days for data acquisition yielded nearly 2 million MS/MS spectra and 0.6 million peptide spectral matches (PSMs) based on a search against the UniProt database that is composed of Swiss-Prot (16,502 mouse genes) plus Trembl (9,769 mouse genes) and contains 26,271 mouse genes in total. These PSMs corresponded to 180,867 peptides (≤1% false discovery rate (FDR), including chemical modifications). Following the convention described by Qeli et al.17, we mapped the set of 128,513 non-redundant peptide sequences into the genome as follows: first, 107,722 peptides were uniquely assignable to 9,818 gene IDs; we refer to these as ‘Class I’ peptides and genes, respectively (Fig. 2b). Of the remaining 20,791 peptides, 2,819 sequences mapped to at least two of another 1,737 gene IDs (‘Class II’ peptides and genes, respectively, Supplementary Data 2); importantly, Class II peptides do not map to Class I genes. The remaining 17,972 peptides (‘Class III’ peptides) map to multiple genes, including one or more Class I genes. Class III peptides are included in the count of total peptide detections but are not otherwise considered for purposes of protein identification. Cumulatively across three biological replicate experiments, comprising 24 days of data acquisition, we obtained ~5.9 million MS/MS spectra, ~1.8 million PSMs, and identified 211,535 non-redundant peptide sequences (178,167 Class I, 3,496 Class II and 29,872 Class III peptides, respectively). This peptide set encompassed 13,075 and 2,824 Class I and II genes, respectively (Fig. 2c), with the former (Class I genes) spanning 50% of UniProt and nearly 70% of the manually curated, non-redundant Swiss-Prot mouse database (Fig. 2d).
Two recent reports18,19 that sought to analyse global protein expression in mESC utilized pre-fractionation of sub-cellular compartments, proteins or peptides (or some combination) before LC-MS/MS. We took the union of published data from these studies and found that results from our DEEP SEQ platform encompass ~73% and ≥95% of the reported peptide and gene IDs, respectively, while adding a significant quantity of new sequence information (Fig. 3a and Supplementary Data 3). Although the fractionation approaches, mass spectrometry instrumentation and database search engines varied across these three studies, we observed very similar physicochemical properties for peptides identified in each data set (Fig. 3b). To explore the dynamic range of proteins identified across these data, we next overlaid Class I and II genes derived from each peptide set with mRNA expression data for mESC20 (Fig. 3d and Supplementary Data 4). Class I genes identified in our data spanned the full dynamic range of gene expression, with new Class I peptides identified through additional replicate experiments generally mapping to genes expressed at lower levels. Moreover, peptides identified by the two previous studies18,19 were biased to high-expression genes as compared with those in our data. Strikingly, the addition of Class II genes from our DEEP SEQ analysis improved overall proteome coverage only incrementally as compared with the set of Class I genes alone, with a small bias towards low-expression genes. Given the limitations of microarray data as a surrogate for protein expression, we next sought to compare our proteomic data with that from ribosomal profiling by RNA-seq, a technique that monitors protein translation on a genome-wide scale21. In a recent ribosomal profiling study, Ingolia et al.9 measured translation of 12,674 transcripts from 19,022 protein-coding genes in mESC. Our set of Class I and II genes spanned an equivalent fraction of the total gene space (63%) and encompassed 81% of all translation events represented in the ribosomal profiling data (Fig. 3e and Supplementary Data 5). Again, we observed that our deep-sequencing data spanned the entire dynamic range of protein translation as represented by ribosomal profiling, including significant coverage (~42%) of low-frequency translation events (Fig. 3f). As was observed in the comparison with gene expression by microarray, our data exhibited significantly higher coverage of low-expression genes as compared with that from previous studies of the mESC proteome18,19, while inclusion of DEEP SEQ Class II genes provided only a negligible improvement in proteome coverage. The latter observation is particularly notable, given that Class II genes are defined by shared peptides; as a result, the data for these gene products are ambiguous with respect to identification and quantification. For these reasons, we focused the remainder of our analyses exclusively on Class I genes. Collectively, these results suggest that our protein deep-sequencing platform provides accurate quantitative data for multiplexed stable isotope reagents while simultaneously maximizing the discovery potential in the analysis of complex mammalian proteomes. In fact, using very stringent criteria for unambiguous protein identification (Class I peptides), our DEEP SEQ mass spectrometry approach provides data that span the full dynamic range of mammalian protein expression.
Functional proteome coverage by DEEP SEQ mass spectrometry
Regulatory proteins such as transcription factors, kinases and other signal transduction factors are often underrepresented in mass spectrometry-based whole-proteome studies. Hence, we next sought to evaluate the coverage of functional protein classes provided by our DEEP SEQ platform. We found that the set of Class I peptides mapped unambiguously to ≥40% of genes across gene ontology (GO) functional categories, regardless of the evidence filter used (Fig. 4a). We also observed that the coverage varied across categories, from nearly 100% for genes encoding ribosome and proteasome proteins to approximately 20% for transmembrane receptors; the same trend was observed for data derived from ribosomal profiling, suggestive of a general correlation between coverage and protein abundance. This hypothesis was further corroborated by mapping functional protein classes to gene expression in mESC (Fig. 4c). Consistent with these data, we found that replicate experiments tended to augment representation of proteins encoded by genes expressed at low levels. Overall, our set of stringently defined Class I genes spanned 52% of the functional proteome, on par with coverage provided by ribosomal profiling (56% coverage); importantly, our DEEP SEQ mass spectrometry data captured ≥70% of key regulatory protein classes, including kinases, phosphatases and transcription factors.
Quantification of LIF-dependent functional proteome in mESC
To explore the functional proteome characteristic of mESC, we next compiled two reference sets of pluripotent factors derived from genetic depletion screens22,23 and proteins biochemically associated with the master regulatory transcription factors Oct4 and Nanog24,25,26 (Supplementary Data 6). The set of Class I genes from our biological triplicate analysis encompassed ~81% (Fig. 5a) and ~90% (Fig. 5b) of these reference gene sets, respectively, suggesting that our DEEP SEQ mass spectrometry platform can capture regulatory events associated with key mediators of pluripotency in mESC. A plot of quantitative data (Fig. 5c) revealed numerous proteins (Table 1) whose expression level reproducibly increased or decreased in response to LIF withdrawal (Supplementary Data 7). Query of these proteins against the functional and biochemical reference sets demonstrated an enrichment (Fisher’s exact test, P-value=2.1 E-5) for pluripotent factors (Fig. 5d). Biochemical validation (Fig. 5e) of LIF-mediated expression for a sub-set of proteins confirms the power of DEEP SEQ mass spectrometry to provide accurate quantification for multiplexed isobaric labels in the context of genome-wide proteome profiling.
Continued advances in our understanding of genome variation2, gene expression27,28 and translation9, driven in part by the advent of next-generation DNA/RNA-sequencing technologies3, have effectively re-kindled interest in experiments designed to maximize discovery potential and simultaneously quantify known genomic or other biomolecular events. Establishing a parallel trajectory for this paradigm in proteomics is complicated by the broad range of protein abundance, diverse repertoire and stoichiometry of post-translational modifications, as well as the stochastic nature of discovery-driven or shotgun MS/MS acquisition methods. In fact today, nearly two decades after Wilkins et al.29 first coined the term ‘proteome’, the dynamic range of protein expression in mammalian systems has represented an insurmountable hurdle for genome-wide proteome quantification. Despite these obstacles, the functional content of protein-level data represents an important complement to genome-based studies, and hence there remains significant motivation to develop scalable platforms for proteomic analyses.
To achieve true, genome-scale sequence coverage along with high-fidelity protein quantification, we sought to significantly improve the analytical figures of merit for each component of our DEEP SEQ mass spectrometry platform. First, the use of physicochemically orthogonal high-pH RP and SAX separation stages16,30 coupled with a narrow-bore (25 μm i.d.), extended-length (100 cm) low-pH analytical column31 provides extreme temporal separation of peptides with an empirical peak capacity of ~1.3 E4. Our experience with online multidimensional separations16,30 indicates that further optimization of first- and second-dimension eluent concentrations can yield an improved distribution of peptides across each separation dimension, ultimately providing deeper proteome coverage beyond the ~70% achieved herein. Second, the final dimension column geometry maintains the integrity of chromatographic separation at ultra-low effluent flow rates (~5 nl min−1), thus maximizing electrospray ionization efficiency32,33. Third, all separation stages in the DEEP SEQ configuration are implemented in microcapillary format and coupled in series, with the final dimension interfaced directly to the mass spectrometer, providing for fully automated operation, along with efficient capture and transfer of peptides across all separation stages.
Based on a stringent peptide-to-gene ID criterion, our DEEP SEQ mass spectrometry analysis of mESC provided quantitative expression data for 11,352 out of 16,502 genes in Swiss-Prot (~69%, Fig. 2e), a depth of coverage equivalent to a recent study in which RNA-seq ribosomal profiling was used to measure translation of 12,674 out of 19,022 (~66%) protein-coding genes contained in the UCSC mouse database9. Importantly, our data provide significant coverage (~70–85%) for key regulatory protein families, including kinases, ubiquitin ligases and transcription factors (Fig. 4b). In fact, proteins quantified on our deep-sequencing platform span the full dynamic range of corresponding gene expression and protein translation profiles (Fig. 3d–f). Not surprisingly, improvements in proteome coverage across triplicate experiments were most pronounced for low-expression protein families (Fig. 4c).
The absence of protein-amplification strategies analogous to PCR, combined with the limited improvements in dynamic range offered by each generation of mass spectrometry hardware, places a disproportionate burden on separation techniques to achieve robust detection of low-abundance proteins. Despite numerous methodological studies, the choice of separation strategies that will provide the best combination of sample yield, experiment time and, ultimately, proteome coverage is unresolved. Our results refute the notion that analysis of tryptic peptides alone is insufficient to characterize proteins across a wide dynamic range of abundance. In practice, DEEP SEQ mass spectrometry achieves significantly higher proteome coverage (Fig. 3a) from only a fraction (<5%) of the input required by techniques that rely on protein- or cellular-level pre-fractionation18,19. In fact, although chromatographic and sub-cellular fractionation are slow and low-throughput compared with chemical amplification, the time required for our DEEP SEQ analysis (25 days for biological triplicates, including sample preparation) is not dramatically different from other recent attempts at protein deep sequencing by mass spectrometry (for example, 21 days for a study that relied on the use of protein-level fractionation and multiple enzymes34), or for that matter genome-wide RNA-seq ribosomal profiling (9–12 days as described in a recent reveiw35). Finally, it is worth noting that our DEEP SEQ approach provides the serendipitous benefit of a robust and streamlined sample-preparation protocol. Simple detergent solubilization, single-enzyme digestion, peptide desalting and iTRAQ labelling are somewhat reminiscent of the commoditized kits used in conjunction with next-generation DNA/RNA sequencing.
The quality of peptide fractionation provided by our protein deep-sequencing platform is also evident in the analysis of a mixed-species iTRAQ quantification model (Fig. 1a). Even with contaminant species present in >30-fold excess relative to the target peptides, our DEEP SEQ mass spectrometry platform provided sufficient separation peak capacity to yield accurate iTRAQ ratios (Fig. 1e). These data are important in light of discrepancies reported between the throughput advantages afforded by multiplexed reagents and the well-documented limitations encountered when using these labels to quantify proteins in complex mixtures11,12,13,36. In fact, results from three of these studies12,13,36 demonstrated unequivocally that two dimensions of peptide fractionation are insufficient to abrogate suppression of multiplexed ratios for analysis of samples intended to represent complex proteomes. Importantly, DEEP SEQ mass spectrometry provides a platform to reconcile these juxtaposed observations and fully leverage the advantages of isobaric reagents in studies designed to quantify proteome response to perturbation on a genome-wide scale. Moreover, recent reports37,38 suggest the possibility of significant increases in the degree of multiplexing for these stable isotope reagents. These advances, along with improvements in mass spectrometry instrumentation39,40 in addition to new chromatographic41,42,43 and electrophoretic44,45,46 separation platforms, will yield a concomitant increase in the throughput of DEEP SEQ mass spectrometry analysis.
Finally, our data provide insight into the transition of mESC from a naive ground state, similar to that observed in the embryo inner cell mass before implantation, to a post-implantation epiblast-like stage (mouse epiblast stem cell) that is poised for directed differentiation47. LIF is a critical cytokine that supports self-renewal in mESC through the Jak-Stat and Pi3k-Akt signalling axes, while the opposing pathways, including Wnt-Gsk3β and Mek-Erk, mediate transcription programs that degrade pluripotent potential48. How these and other exogenous stimuli coordinately influence the core pluripotent genes (Oct4 and Sox2) and other peripheral factors to enforce the transcriptional ground state or mediate lineage commitment is not yet fully resolved.
Strikingly, we detect regulated protein expression for downstream targets linked to each of these discrete pathways (Klf4, Klf5 (ref. 49), Lef1 (ref. 50), Esrrb, Tcfcp2l1 (ref 51) and Tbx3 (ref. 48)). Proteins having roles in other developmental contexts (Otx2 (ref. 52), Pou3f1/Oct6 (ref. 53) and CD9 (refs 54, 55), along with epigenetic factors (Dnmt3a/b) that are critical for high-fidelity gene expression56, were also regulated in a LIF-dependent manner. In addition to these studies that targeted specific genes or pathways, our DEEP SEQ analysis also quantified the majority of pluripotent factors previously defined by high-throughput functional genetic22,23 and biochemical interaction24,25,26 assays (Fig. 5a). Indeed, we reproducibly observed regulated expression for 50 proteins (Table 1) and found that these were enriched for pluripotent and developmental genes (Fig. 5d). Importantly, this set of putative LIF-dependent protein targets spans the full dynamic range of gene expression and protein translation in mESC (Fig. 5f). Thus, though results presented herein represent a single time point in the LIF-mediated transition between naive and primed epiblast states, the depth and scale of these data provide compelling evidence that our DEEP SEQ mass spectrometry platform can capture the vast majority of the mESC functional proteome in the context of more complex experiments57,58 designed to decipher individual contributions of the above pathways to self-renewal and lineage commitment. More generally, our DEEP SEQ mass spectrometry platform represents a powerful and scalable approach for genome-wide profiling of protein expression and post-translational modification status in mammalian systems.
Yeast cells were grown and processed under conditions similar to those described previously30. Saccharomyces cerevisiae, strain S288C, BY4741 (ATCC 201388, MATa his3Δ1 leu2Δ0met15Δ0 ura3Δ0) was grown in yeast extract peptone dextrose liquid medium to log phase, OD600 ≈0.7 at 30 °C. Cells were lysed by the addition of boiling SDS (50 mM Tris–HCl, pH 7.5, 7.5% SDS, 5% glycerol, 50 mM dithiothreitol (DTT) and 5 mM EDTA), followed by centrifugation, with the supernatant stored at −80 oC.
Mouse embryonic stem cell (mESC) line J1 was generously provided by Dr Stuart Orkin (Dana-Farber Cancer Institute and Children’s Hospital, Boston, MA). Initially, 10 cm Nunclon tissue culture dishes (Thermo Fisher Scientific, Waltham, MA) were coated with 0.1% gelatin (EMD Millipore, Billerica, MA) at room temperature for 30 min. Gelatin was aspirated and J1 mESC were plated at a density of ~6 × 104 cm−2 in DMEM with high glucose supplemented with L-glutamine (Gibco Life Technologies, Carlsbad, CA), 15% embryonic stem cells validated FCS (Stem Cell Technologies, Vancouver, BC, Canada), 2-mercaptoethanol, nucleosides (EMD Millipore), nonessential amino acids (Gibco Life Technologies) and 10 ng ml−1 murine LIF6 (EMD Millipore). Perturbation experiments were performed by establishing J1 mESC under the above conditions and then removing LIF6 for 48 h. At the time of collection, plates were washed with cold PBS to remove serum proteins, and adherent cells were lysed by the addition of a boiling SDS (50 mM Tris–HCl, pH 7.5, 7.5% SDS, 5% glycerol, 50 mM DTT and 5 mM EDTA). Lysed cells were centrifuged, and the supernatant extract was stored at −80 oC.
Sample preparation for DEEP SEQ analysis
Proteins were precipitated by adding six volumes of cold (−20 °C) acetone and resolubilized in digestion buffer containing 8 M urea and 0.1 M NH4HCO3. Total protein levels were measured by BCA. DTT was added to a final concentration of 10 mM and incubated for 30 min at 60 °C, followed by addition of methyl methanethiosulfonate (MMTS) (Thermo Fisher Scientific) to 20 mM. After 30 min incubation in the dark at room temperature, excess MMTS was quenched by addition of DTT to a final concentration of 20 mM. Reduced and alkylated proteins were diluted in 0.1 M ammonium bicarbonate, followed by addition of trypsin, with overnight digestion at 37 °C and end-over-end rotation. Digested sample solutions were loaded onto SepPak C18 reverse phase cartridges (Waters Corp., Milford, MA) to remove urea and other salts. Eluted peptides (45% acetonitrile in water with 0.1% trifluoroacetic acid) were lyophilized by vacuum centrifugation.
Peptides were labelled with 4-plex iTRAQ reagents (AB Sciex, Framingham, MA). Two aliquots of 0.2 μg yeast peptides were labelled with iTRAQ 114 and 115. Separately, two other aliquots of 1.0 μg yeast peptides were labelled with iTRAQ 116 and 117. In addition, two aliquots of 6.3 μg peptides from mESC were labelled with iTRAQ 114 and 116. For the mESC perturbation experiment, peptides from the cells incubated without LIF were labelled as technical replicates with iTRAQ 116 and 117, whereas peptides from the cells incubated with LIF were labelled as technical replicates with iTRAQ 114 and 115. For each reaction, peptides were resuspended in 500 mM triethylammonium bicarbonate and mixed with the appropriate iTRAQ reagent in ethanol. Labelling was allowed to proceed at room temperature for 1 h, and samples were then combined and dried by vacuum centrifugation. Fresh aliquots of mESC were processed as described above to provide biological triplicates.
DEEP SEQ multi-dimension separation
Three-dimension peptide separation was performed on a modified Waters Corp. NanoAcquity UHPLC system with binary and isocratic pumps, along with an autosampler and additional 6-port, 2-position valve (Valco, Austin, TX). The first-dimension RP column consisted of a 200-μm i.d. capillary packed with 20 cm of 5 μm diametre (dia.) XBridge C18 resin (Waters Corp.). An anion-exchange column (200 μm i.d. × 20 cm) was packed with 5 μm dia. SAX resin (Sepax Technologies, Neward, DE) and connected to the outlet of the first-dimension RP column. The third dimension consisted of RP pre- (100 μm i.d. × 4 cm of 10 μm dia. POROS 10R2 resin) and analytical (25 μm i.d. × 100 cm of 5 μm dia. Monitor C18 Column Engineering, Ontario, CA), with integrated 1 μm dia. emitter tip columns configured in a vented geometry59,60. The autosampler picked up and delivered either peptide samples or first (acetonitrile in 20 mM ammonium formate, pH 10) and second (KCl in 20 mM ammonium formate, pH 10) dimension eluents at 2 μl min−1 through the sample loop. Injection of each first- or second-dimension eluent constitutes a ‘DEEP SEQ fraction’ (Supplementary Data 8). The binary pump delivered 0.1% formic acid at 8 μl min−1 to dilute the organic content and acidify the first/second-dimension effluent before the third-dimension pre-column, or provided for gradient elution (2–50% B in 580 min, A=0.1% formic acid, B=acetonitrile with 0.1% formic acid) of peptides from the third-dimension RP columns for LC-MS/MS analysis at a flow rate of ~5 nl min−1. A Digital PicoView electrospray source platform (New Objective, Woburn, MA) was used on both the Orbitrap XL and 5600 Triple TOF mass spectrometers to automatically position the emitter tip at the source inlet during LC-MS/MS acquisition or beneath a gravity-driven drip station during injection of peptide samples or first/second-dimension eluents.
DEEP SEQ sample capacity and experiment time
Based on our published and unpublished studies, we estimate that the total loading capacity of our first-dimension RP column (200 μm i.d. × 20 cm) is currently ~100 μg. The DEEP SEQ platform is easily tailored to a variety of sample types. Generally, first- and second-dimension eluent concentrations in a range of 7–55% acetonitrile and 10–300 mM KCl represent the useful boundaries for peptide elution. Importantly, our experience to date suggests that these conditions are robust with respect to biological input, obviating the need to run repeated pilot experiments for every sample. Total time of analysis is another important consideration when using multidimensional fractionation techniques. For example, DEEP SEQ experiments performed at a depth of 20 fractions required 8 days for data acquisition. System reliability is particularly important, given that a single, DEEP SEQ mass spectrometry analysis will typically require several days of continuous instrument time. Importantly, all data presented herein were acquired using a single 25-μm i.d. × 100-cm resolving column, demonstrating the robustness of our platform.
DEEP SEQ mass spectrometry data-acquisition parameters
The LTQ-Orbitrap XL mass spectrometer (Thermo Fisher Scientific) was operated in data-dependent mode, such that the top 10 most abundant precursors in each MS scan were subjected to MS/MS in both collisionally activated dissociation (CAD) and higher energy collisional dissociation (HCD) mode (CAD in the linear trap, normalized collision energy=35%, precursor isolation width=1.9 Da, intensity threshold for precursor selection=20,000 HCD in the orbitrap, normalized collision energy=47%, precursor isolation width=1.0 Da and intensity threshold for precursor selection=20,000). Dynamic exclusion was enabled with a repeat count of 1 and exclusion duration set to 20 s. Electrospray voltage was 2.2 kV. We enabled the Lock Mass feature and selected m/z=445.120025 ([Si(CH3)2O]6) as the internal calibrant.
The 5600 Triple TOF (AB Sciex) was operated in information-dependent mode, with the top 50 precursors (charge state +2 to +5, >100 counts) in each MS scan (800 ms, scan range 350–1,500 m/z) subjected to MS/MS (minimum time 140 ms, scan range 100–1,400 m/z). A dynamic exclusion window of 20 s was used with unit resolution for precursor isolation. Electrospray voltage was 2.2 kV. Native DEEP SEQ mass spectrometry data files along with Mascot generic format (MGF) peak lists are available for download at: http://blaispathways.dfci.harvard.edu/mz/?id=doi:10.1038/ncomms3171.
DEEP SEQ mass spectrometry data processing
Our application programming interface (API) based multiplierz software framework was used to extract and format MS/MS data from the Orbitrap XL for subsequent search against the UniProt mouse database (downloaded on 02 November 2011) and a S. cerevisiae database (downloaded from http://www.yeastgenome.org/01/06/2010). MS/MS data were searched using Protein Pilot V4.4 (AB Sciex) with the following parameters: instrument=‘Orbi/FT MS (sub-p.p.m.), LTQ MS/MS’ for data acquired with the Orbitrap XL or ‘5600 Triple TOF’ for data acquired with the 5600 Triple TOF mass spectrometer. A fixed modification of +42 corresponding to MMTS of cysteine was also included. The sample type was set to ‘iTRAQ 4-plex (peptides labelling)’. All PSMs from bio-triplicate acquisitions were combined for the FDR assessment. Only those peptides with scores at or above a PSM FDR threshold of 1% were further considered. All peptides were mapped back to the genome without consideration of splice isoforms from the same gene. Following the procedure of Qeli and Ahrens17, peptides passing the 1% FDR filter were classified as either ‘unique’ or ‘shared’ based on whether they could be assigned to only one or more than one gene, respectively. Class I genes were defined as those identified only by uniquely assignable peptides. Class II genes were identified based on shared peptides that could not be assigned to any Class I gene. The results reported herein are based on Class I and Class II genes; no Class III genes were included in our analyses. Multiplierz scripts were used to programmatically generate extracted ion chromatograms (XICs) and calculate corresponding peak widths for all identified peptides. To estimate the orthogonality of peptide fractionation, the number of unique peptides identified in each third-dimension LC-MS/MS run was represented as a circle of proportional diameter and projected onto a 2D plot at the corresponding first-dimension acetonitrile (x axis) and second-dimension salt (y axis) concentrations used in the experiment. Linear regression was performed using the function stats.linregress from Numpy version 1.4.1, with each unique peptide sequence used as a separate data point.
In the mixed-species quantification experiments, the iTRAQ 116/iTRAQ 114 and iTRAQ 117/iTRAQ 115 ratios were first normalized based on all spectra identified (final normalization factor was 1.14). We used R scripts to create box plots for all iTRAQ ratios. For the quantitative analyses of mESC, we summed iTRAQ channels in each biological condition (±LIF) for Class I peptides to generate a final ratio for each Class I gene. Across three biological triplicate experiments, a protein was considered to be regulated in expression (Table 1) if the following criteria were met in two out of the three experiments: (i) the iTRAQ intensities exceeded 200 counts and (ii) the Log2 iTRAQ ratio was ≥1 or ≤−1. Enrichment of pluripotent factors within the set of genes whose expression was reproducibly regulated across bio-triplicate experiments was estimated using a Fisher’s exact test. A null distribution and two-sided P-value were calculated based on code freely available at: (http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio).
Comparisons with mESC microarray and ribosomal profiling data
Comparison of DEEP SEQ mass spectrometry data with that from ribosome profiling for mESC was performed as follows: the UCSC mouse genome database (mm9) (http://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/) was downloaded on 02 October 2012. We aligned the peptides identified in triplicate DEEP SEQ mass spectrometry analyses against sequences in the UCSC database. A similar alignment was performed for mESC ribosomal profiling data previously reported9. In 1,034 cases, we were unable to reconcile gene names from the UCSC and UniProt databases; these genes were removed from further consideration. Average gene expression levels across three wild-type mESC lines (B21, J1 and R1) were calculated using published data sets (accession codes GSM338369, GSM338371 and GSM338373) (ref. 20). We used the R/Bioconductor software environment to re-normalize all microarray data based on the Robust Multi-array Average (RMA) method and to re-map all probe sequences to Entrez Gene IDs using a custom Chip Definition File from the Michigan Microarray Lab (version 13). The average mRNA expression level (in Log2 space) between the three arrays was calculated, and the Entrez Gene IDs were converted to UniProt gene symbols. UniProt entries mapping to more than one Entrez Gene ID were excluded. The final list contained 15,705 entries of average Log2 expression value/UniProt gene symbol pairs.
GO and reference gene sets for pluripotency
Positive reference sets of pluripotent genes were created from RNAi screening data downloaded from two previous studies22,23. Similarly, pluripotency reference genes based on biochemical interactors of Nanog and Pou5f1 (Oct4) were created from three previous studies24,25,26.
The GO database was downloaded on 15 March 2012. The GO subcategories were filtered based on different identifiers: molecular function (GO:0003674), biological process (GO:0008150), signalling (GO:0023052), ribosome (GO:0005840), proteasome (GO:0000502), protein folding (GO:0006457), kinase (combination of GO:0004672 and GO:0016301), phosphatase (combination of GO:0016791 and GO:0004721), ubiquitin ligase (GO:0004842), transcription factor (combination of GO:0006351 and GO:0008134), stem cell maintenance (GO:0019827), chromatin remodelling (GO:0006338), membrane (GO:0016020), cell adhesion (GO:0007155) and transmembrane receptor (GO:0004888).
Protein lysates from ~3 × 105 murine ESC (about 30 μg total protein) incubated with or without LIF were separated on 4–12% Bis-Tris gels (Life Science) under reducing conditions and transferred onto polyvinylidene difluoride membranes. The blot was stained with antibodies against Lef1 (Cell Signaling Technology, 1:1,000), Pou3f1 (Abcam, 1:1,000), Esrrb (Cell Signaling Technology, 1:1,000), Otx2 (Abcam, 1:1,000), Klf4 (Cell Signaling Technology, 1:1,000), Cd9 (Abcam, 1:1,000) and Gapdh (Cell Signaling Technology, 1:6,000). The bands were visualized by enhanced chemiluminescence (SuperSignal West Femto Chemiluminescent Substrate; Thermo).
How to cite this article: Zhou, F. et al. Genome-scale proteome quantification by DEEP SEQ mass spectrometry. Nat. Commun. 4:2171 doi: 10.1038/ncomms3171 (2013).
Lander, E. S. Initial impact of the sequencing of the human genome. Nature 470, 187–197 (2011).
Ecker, J. R. et al. Genomics: ENCODE explained. Nature 489, 52–55 (2012).
Rizzo, J. M. & Buck, M. J. Key principles and clinical applications of ‘next-generation’ DNA sequencing. Cancer Prev. Res. (Phila) 5, 887–900 (2012).
Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Cravatt, B. F., Simon, G. M. & Yates, J. R. 3rd The biological impact of mass-spectrometry-based proteomics. Nature 450, 991–1000 (2007).
Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
Schwanhausser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011).
Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011).
Ingolia, N. T., Lareau, L. F. & Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011).
Evans, C. et al. An insight into iTRAQ: where do we stand now? Anal. Bioanal. Chem. 404, 1011–1027 (2012).
Ow, S. Y. et al. iTRAQ underestimation in simple and complex mixtures:“the good, the bad and the ugly”. J. Proteome Res. 8, 5347–5355 (2009).
Ting, L., Rad, R., Gygi, S. P. & Haas, W. MS3 eliminates ratio distortion in isobaric multiplexed quantitative proteomics. Nat. Methods 8, 937–940 (2011).
Wenger, C. D. et al. Gas-phase purification enables accurate, multiplexed proteome quantification with isobaric tagging. Nat. Methods 8, 933–935 (2011).
Askenazi, M., Parikh, J. R. & Marto, J. A. mzAPI: a new strategy for efficiently sharing mass spectrometry data. Nat. Methods 6, 240–242 (2009).
Parikh, J. R. et al. multiplierz: an extensible API based desktop environment for proteomics data analysis. BMC Bioinformatics 10, 364 (2009).
Ficarro, S. B. et al. Online nanoflow multi-dimensional fractionation strategies for high efficiency phosphopeptide analysis. Mol. Cell. Proteomics 10, doi:10.1074/mcp.O111.011064 (2011).
Qeli, E. & Ahrens, C. H. PeptideClassifier for protein inference and targeted quantitative proteomics. Nat. Biotechnol. 28, 647–650 (2010).
Graumann, J. et al. Stable isotope labeling by amino acids in cell culture (SILAC) and proteome quantitation of mouse embryonic stem cells to a depth of 5111 proteins. Mol. Cell Proteomics 7, 672–683 (2008).
Li, Q. R. et al. Large scale phosphoproteome profiles comprehensive features of mouse embryonic stem cells. Mol. Cell. Proteomics 10, doi:10.1074/mcp.M110.001750 (2011).
Wirt, S. E. et al. G1 arrest and differentiation can occur independently of Rb family function. J. Cell. Biol. 191, 809–825 (2010).
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
Ivanova, N. et al. Dissecting self-renewal in stem cells with RNA interference. Nature 442, 533–538 (2006).
Hu, G. et al. A genome-wide RNAi screen identifies a new transcriptional module required for self-renewal. Genes Dev. 23, 837–848 (2009).
Wang, J. et al. A protein interaction network for pluripotency of embryonic stem cells. Nature 444, 364–368 (2006).
van den Berg, D. L. et al. An Oct4-centered protein interaction network in embryonic stem cells. Cell Stem Cell 6, 369–381 (2010).
Pardo, M. et al. An expanded Oct4 interaction network: implications for stem cell biology, development, and disease. Cell Stem Cell 6, 382–395 (2010).
Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Lu, J. et al. MicroRNA expression profiles classify human cancers. Nature 435, 834–838 (2005).
Wilkins, M. R. et al. Progress with proteome projects: Why all proteins expressed by a genome should be identified and how to do it. Biotechnol. Genet. Eng. Rev. 13, 19–50 (1996).
Zhou, F., Sikorski, T. W., Ficarro, S. B., Webber, J. T. & Marto, J. A. Online nanoflow reversed phase-strong anion exchange-reversed phase liquid chromatography-tandem mass spectrometry platform for efficient and in-depth proteome sequence analysis of complex organisms. Anal. Chem. 83, 6996–7005 (2011).
Zhou, F., Lu, Y., Ficarro, S. B., Webber, J. T. & Marto, J. A. Nanoflow low pressure high peak capacity single dimension LC-MS/MS platform for high-throughput, in-depth analysis of mammalian proteomes. Anal. Chem. 84, 5133–5139 (2012).
Ficarro, S. B. et al. Improved electrospray ionization efficiency compensates for diminished chromatographic resolution and enables proteomics analysis of tyrosine signaling in embryonic stem cells. Anal. Chem. 81, 3440–3447 (2009).
Shen, Y. F. et al. High-efficiency nanoscale liquid chromatography coupled on-line with mass spectrometry using nanoelectrospray ionization for proteomics. Anal. Chem. 74, 4235–4249 (2002).
Nagaraj, N. et al. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 7, 548 (2011).
Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M. & Weissman, J. S. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat. Protoc. 7, 1534–1550 (2012).
Mertins, P. et al. iTRAQ labeling is superior to mTRAQ for quantitative global proteomics and phosphoproteomics. Mol. Cell Proteomics 11, doi:10.1074/mcp.M111.014423 (2012).
Dephoure, N. & Gygi, S. P. Hyperplexing: a method for higher-order multiplexed quantitative proteomics provides a map of the dynamic response to rapamycin in yeast. Sci. Signal 5, doi:10.1126/scisignal.2002548 (2012).
Everley, R. A., Kunz, R. C., McAllister, F. E. & Gygi, S. P. Increasing throughput in targeted proteomics assays: 54-plex quantitation in a single mass spectrometry run. Anal. Chem. 85, 5340–5346 (2013).
Sun, L., Zhu, G. & Dovichi, N. J. Comparison of the LTQ-orbitrap velos and the q-exactive for proteomic analysis of 1–1000, ng RAW 264.7 cell lysate digests. Rapid Commun. Mass Spectrom. 27, 157–162 (2013).
Ding, C. et al. A fast workflow for identification and quantification of proteomes. Mol. Cell. Proteomics doi:10.1074/mcp.O112.025023 (2013).
De Jong, E. P. & Griffin, T. J. Online nanoscale ERLIC-MS outperforms RPLC-MS for shotgun proteomics in complex mixtures. J. Proteome Res. 11, 5059–5064 (2012).
Luo, Q., Gu, Y., Wu, S. L., Rejtar, T. & Karger, B. L. Two-dimensional strong cation exchange/porous layer open tubular/mass spectrometry for ultratrace proteomic analysis using a 10 microm id poly(styrene- divinylbenzene) porous layer open tubular column with an on-line triphasic trapping column. Electrophoresis 29, 1604–1611 (2008).
Ritorto, M. S., Cook, K., Tyagi, K., Pedrioli, P. G. & Trost, M. Hydrophilic strong anion exchange (hSAX) chromatography for highly orthogonal peptide separation of complex proteomes. J. Proteome Res. 12, 2449–2457 (2013).
Zhu, G., Sun, L., Yan, X. & Dovichi, N. J. Single-shot proteomics using capillary zone electrophoresis-electrospray ionization-tandem mass spectrometry with production of more than 1250 Escherichia coli peptide identifications in a 50 min separation. Anal. Chem. 85, 2569–2573 (2013).
Heemskerk, A. A. et al. Ultra-low flow electrospray ionization-mass spectrometry for improved ionization efficiency in phosphoproteomics. Anal. Chem. 84, 4552–4559 (2012).
Wang, Y., Fonslow, B. R., Wong, C. C., Nakorchevsky, A. & Yates, J. R. 3rd Improving the comprehensiveness and sensitivity of sheathless capillary electrophoresis-tandem mass spectrometry for proteomic analysis. Anal. Chem. 84, 8505–8513 (2012).
Nichols, J. & Smith, A. Naive and primed pluripotent states. Cell Stem Cell 4, 487–492 (2009).
Niwa, H., Ogawa, K., Shimosato, D. & Adachi, K. A parallel circuit of LIF signalling pathways maintains pluripotency of mouse ES cells. Nature 460, 118–122 (2009).
Hall, J. et al. Oct4 and LIF/Stat3 additively induce Kruppel factors to sustain embryonic stem cell self-renewal. Cell Stem Cell 5, 597–609 (2009).
Huang, C. & Qin, D. Role of Lef1 in sustaining self-renewal in mouse embryonic stem cells. J. Genet. Genomics 37, 441–449 (2010).
Martello, G. et al. Esrrb is a pivotal target of the Gsk3/Tcf3 axis regulating embryonic stem cell self-renewal. Cell Stem Cell 11, 491–504 (2012).
Matsuo, I., Kuratani, S., Kimura, C., Takeda, N. & Aizawa, S. Mouse Otx2 functions in the formation and patterning of rostral head. Genes Dev. 9, 2646–2658 (1995).
Jaegle, M. et al. The POU factor Oct-6 is required for the progression of Schwann cell differentiation in peripheral nerves. Science 273, 507–510 (1996).
Le Naour, F., Rubinstein, E., Jasmin, C., Prenant, M. & Boucheix, C. Severely reduced female fertility in CD9-deficient mice. Science 287, 319–321 (2000).
Miyado, K. et al. Requirement of CD9 on the egg plasma membrane for fertilization. Science 287, 321–324 (2000).
Li, J. Y. et al. Synergistic function of DNA methyltransferases Dnmt3a and Dnmt3b in the methylation of Oct4 and Nanog. Mol. Cell. Biol. 27, 8748–8759 (2007).
Ying, Q. L. et al. The ground state of embryonic stem cell self-renewal. Nature 453, 519–523 (2008).
Wray, J. et al. Inhibition of glycogen synthase kinase-3 alleviates Tcf3 repression of the pluripotency network and increases embryonic stem cell resistance to differentiation. Nat. Cell Biol. 13, 838–845 (2011).
van der Heeft, E. et al. A microcapillary column switching HPLC-electrospray ionization MS system for the direct identification of peptides presented by major histocompatibility complex class I molecules. Anal. Chem. 70, 3742–3751 (1998).
Licklider, L. J., Thoreen, C. C., Peng, J. & Gygi, S. P. Automation of nanoscale microcapillary liquid chromatography-tandem mass spectrometry with a vented column. Anal. Chem. 74, 3076–3083 (2002).
Generous support for this work was provided (to J.A.M) by the Dana-Farber Cancer Institute and the National Institutes of Health (NINDS, P01NS047572). Y.L. acknowledges support from a National Institutes of Health training grant in transfusion medicine (T32HL66987-11). C.J.L. acknowledges support from the National Blood Foundation and the Department of Pathology at the Brigham and Women’s Hospital. We thank Dr Stuart Orkin (Dana-Farber Cancer Institute and Children’s Hospital, Boston, MA) for kindly providing the gel-adapted J1 murine embryonic stem cells used in this study. We also thank Dr Nicholas Ingolia (Carnegie Institution for Science, Baltimore, MD) for helpful discussions related to mESC ribosomal profiling data. In addition, we thank Christie Hunter and Sean Seymour of AB Sciex for support and technical assistance with the 5600 Triple TOF mass spectrometer and Protein Pilot search software.
The authors declare no competing financial interests.
Peptides detected by 1D LC-MS/MS and DEEP SEQ mass spectrometry in a mixed-species quantification model. (XLS 3178 kb)
Peptides and protein coding genes identified by DEEP SEQ mass spectrometry for mESC cultured (±) LIF. (XLS 18703 kb)
Peptides and protein coding genes identified by DEEP SEQ mass spectrometry and the union of protein coding genes from two previously published proteomic experiments (refs. 18 and 19) performed in mESC. (XLS 11480 kb)
List of protein coding genes identified by DEEP SEQ mass spectrometry and the union of protein coding genes from two previously published proteomic experiments (refs. 18 and 19) and their corresponding mRNA expression levels as determined by microarray analysis. (XLS 3659 kb)
List of protein coding genes identified by DEEP SEQ mass spectrometry and the protein coding genes previously identified by ribosomal profiling (ref. 9). (XLS 4882 kb)
List of protein coding genes identified by DEEP SEQ mass spectrometry and the overlap with pluripotent genes determined by biochemical interaction (refs. 24-26), and genetic depletion (refs. 22 and 23) assays. (XLS 1693 kb)
List of Class I protein coding genes identified by DEEP SEQ mass spectrometry (bio-triplicate experiments) and their relative expression in mESC cultured (±) LIF. (XLS 828 kb)
List of first and second dimension eluents used for 20 fraction DEEP SEQ mass spectrometry experiments. (XLS 18 kb)
About this article
Cite this article
Zhou, F., Lu, Y., Ficarro, S. et al. Genome-scale proteome quantification by DEEP SEQ mass spectrometry. Nat Commun 4, 2171 (2013). https://doi.org/10.1038/ncomms3171
Identification of novel cerebrospinal fluid biomarker candidates for dementia with Lewy bodies: a proteomic approach
Molecular Neurodegeneration (2020)
Nature Metabolism (2020)
Proteomic profiling of HIV-1 infection of human CD4+ T cells identifies PSGL-1 as an HIV restriction factor
Nature Microbiology (2019)
Cell Discovery (2019)
Deep multiomics profiling of brain tumors identifies signaling networks downstream of cancer driver genes
Nature Communications (2019)