Deep human proteome sequencing

In silico tryptic digestion of the ~21,030 reviewed canonical protein sequences of the human proteome (UniProtKB/Swiss-Prot) predicts 2.3 million tryptic peptides of suitable size for MS detection (7–35 amino acids, up to two missed cleavages). These peptides comprise 9.9 million amino acid residues of the 11.5 million total—that is, only 86% of the proteome. If we consider digestion of the same proteins using the six enzymes in our study (LysC, LysN, AspN, chymotrypsin, GluC and trypsin), 7.4 million peptides suitable for shotgun proteomics are generated. These peptides cover 99% of the amino acids contained in the human proteome.

To test the hypothesis that we can in such manner increase coverage of the human proteome, we selected six diverse human cell lines: hES1, an embryonic stem cell line; HeLa S3, from cervical carcinoma; HepG2, from liver carcinoma; GM12878, a blood lymphoblastoid line; K562, from chronic myeloid leukemia; and HUVEC, from umbilical vein epithelial cells (Fig. 1). Having been included in the Encyclopedia of DNA Elements (ENCODE) project, these cell lines have a large amount of publicly available genomic and transcriptomic data49. Proteins from each cell line were separately digested with the six proteases listed above. To maximize depth, the resultant peptides were heavily fractionated (24–80 fractions) and analyzed using nano flow LC coupled with quadrupole-Orbitrap–linear ion trap hybrid MS systems. Dissociation for MS/MS was achieved using HCD, CAD and ETD. The resulting 2,491 raw files were simultaneously analyzed by database search to identify proteins and peptides using the Andromeda search engine50 inside MaxQuant51,52, and results were sequentially filtered to 1% peptide spectrum matches (PSMs) and protein-level false discovery rate (FDR) over the whole dataset.

Fig. 1: Deep proteome sequencing workflow. Six human cell lines were grown in parallel, their proteomes were isolated and then one of the six proteases was used to digest separate aliquots of each proteome in parallel. Peptides resulting from each digestion were fractionated by high-pH RP chromatography and then analyzed separately with nLC–MS/MS using HCD, ETD and CAD. The resulting data were searched with MaxQuant51,52 against the human proteome database, and over 17,000 proteins were identified by peptides that produce a median coverage of over 80%. The high coverage achieved is illustrated on the sequence of hemoglobin subunit gamma-1, with color coding to illustrate the number of unique peptides that cover each amino acid position. Full size image

Figure 2 summarizes these data, showcasing the depth of coverage and gains achieved by the multi-enzyme approach. For each cell line, an average of 539,325 unique peptides, corresponding to ~16,000 proteins, were identified (Fig. 2a). The highest number of identified proteins was from the hES1 cell line (17,121), followed by HeLa S3 (16,399), GM12878 (16,344), HepG2 (16,328), HUVEC (16,158) and K562 (16,054). The trypsin dataset contributed the largest number of unique peptides (396,782), followed by LysN (194,506), LysC (193,956), GluC (162,784), AspN (152,259) and chymotrypsin (114,152). Properties of detected peptides, such as a number of missed cleavages, length distribution and cleavage motif, are in high agreement with previous proteomics multi-enzyme studies (Supplementary Fig. 1)26,27,37. Notably, within each cell line, data from each enzyme digestion alone identified over 10,000 protein groups. Data from tryptic peptides contributed the largest number of identifications and unique sequences, totaling 17,631 proteins with 56.5% median sequence coverage. However, using all data comprising all proteases afforded a modest increase in the number of identified proteins (17,717) but considerably boosted the median sequence coverage to 79.2%. In total, we identified 12,151,708 PSMs and 1,119,510 unique peptides at FDR of 1%. Of those, 790 proteins were identified with complete sequence coverage. The average number of unique peptides per protein was 97 (median 65). However, 54 proteins were identified by only one unique peptide; only 1,122 proteins, or 6.3% of the total proteins, were identified by ten or fewer unique peptides. Median sequence coverage for the combined dataset and the contribution from subsets is shown in Fig. 2b, and ranges from 49.7% (HUVEC; 16,158 proteins) to 63.9% (HeLa S3; 16,399 proteins). Remarkably, nearly half of all identified proteins were observed with 80–100% sequence coverage (Supplementary Fig. 2a,b). Only 936 proteins, or 5.3% of the total data, have sequence coverage below 25%.

Fig. 2: Overview of results from deep proteomics analysis. a, Number of proteins detected for each of the six cell lines and cumulative as a function of peptides from the various protease digests. b, Median sequence coverage of various cell line proteomes achieved by digests with individual proteases and by combining all protease results. Supplementary Fig. 2c shows sequence coverage distributions separately for all combinations of cell lines, proteases and fragmentation methods. c, Venn diagram of all observed amino acids digested by trypsin versus all proteases combined excluding trypsin. d, Sequence coverage for each of the detected proteins for the tryptic peptide data (red) and combined protease digests, including trypsin (gray). e, Observed (dark gray) and theoretical (light gray) distributions of sequence coverage achieved for various combinations of proteases. The top three combinations of 2, 3, 4 or 5 proteases are displayed. f, Protein coverage comparison of transmembrane and nonmembrane proteins. For e and f, the lower whisker/quartile and upper quartile/whisker show the 5th, 25th, 75th and 95th percentiles, accordingly. g, Relative protein coverage of N terminus (left) and C terminus (right) transmembrane segments. Chymo., chymotrypsin. Full size image

The addition of enzymes other than trypsin provided a slight increase in the total number of proteins identified but induced a large increase in the nonredundant amino acids detected. The 17,717 detected human proteins comprise 12,006,700 amino acid residues, including those that arise from noncanonical proteins, that is, isoforms. In total, the unique peptides identified in the combined tryptic datasets from all cell lines detected approximately half of these amino acids (6,113,639). The number of covered amino acids rises to 8,291,681 when all protease data are used (Fig. 2c). Figure 2d illustrates the impact of these additional amino acids on protein sequence coverage. Next, we determined the most optimal multi-protease combinations (Fig. 2e), noting that all top combinations included trypsin. Our total human proteome coverage is, to our knowledge, the largest to date, with 2.12 million more residues (a 34.4% increase) over the 6.17 million identified using exclusively tryptic peptides from the entire MassIVE data repository (Supplementary Fig. 2d)8. Finally, we compared the proteins identified in this study with the curated neXtProt database7, which categorizes proteins across five groups based on the strength of the evidence for their existence. As shown in Supplementary Fig. 3, most of our protein identifications (13,603 proteins) fall into the highest-confidence category (PE1), and 79 proteins now can be promoted to PE1 status from lower categories (Supplementary Table 1).

Alternative proteases have previously been utilized to uncover novel portions of the proteome, including membrane proteins53,54. These proteins—essential to many biological processes and representing important drug discovery targets55—remain under-represented in proteomics datasets due to their hydrophobic nature. This is also true of our dataset. Gene ontology cellular component pathway enrichment analysis of the proteins with sequence coverage below 25% revealed that these low-coverage proteins were primarily membrane proteins (Supplementary Fig. 2e). Indeed, we also observe a coverage reduction for transmembrane proteins across all studied proteases (Fig. 2f). To further explore the behavior of peptides generated from transmembrane-spanning sequences, we calculated the enzyme-specific coverage of aligned membrane-spanning regions to either the N or C terminus (Fig. 2g). These data demonstrate that because transmembrane regions are depleted for typical protease cleavage sites, peptides suitable for detection by shotgun proteomics are less likely to be observed. This conclusion is further supported by the strong relative performance of chymotrypsin, which is atypical in cleaving at hydrophobic residues, as compared with the other proteases.

De novo protein assembly

Protein inference is conceptually akin to reference transcriptome assembly in short-read sequencing, where a previously assembled proteome or genome database is required to map peptide sequences or nucleic acid reads, respectively. In proteomics, however, genome assemblies for proteome database generation are either unavailable or low-quality for many organisms. Several tools are available to assemble short sequencing reads without a reference genome, such as SOAPdenovo-Trans56. However, de novo assembly of nucleic acid sequences relies on the presence of randomly overlapping sequences, which is not a common property of proteomic datasets, which typically use only a single enzyme (for example, trypsin).

With the data from six different proteases and deep coverage presented above, we produce many peptides with partial overlap, which we hypothesized may enable de novo protein assembly. An excellent example for the de novo assembly is the proteasome subunit alpha type-6, which is represented by full sequence coverage (Supplementary Fig. 4a). Overall, the de novo assembly produced 35,480 scaffolds, of which 16,496 (~47%) correctly match to 9,695 protein groups. Median sequence coverage from the de novo assembly was 18% compared with 79.2% for the reference assembly (Supplementary Fig. 4b,c). Assembled scaffolds have a range of 33–358 amino acids with a median length of 45 (Supplementary Fig. 4d), and an average of two scaffolds were mapped to each protein (Supplementary Fig. 4e). These results demonstrate the feasibility of de novo proteome assembly using overlapping peptides from multiple protease digestions of the proteome; application of proteomics-specific assembly methods may improve this result in the future57.

Majority of hypothetical SAPs are confirmed in the proteome

SAPs are variations in the protein sequence which often arise from single nucleotide polymorphisms (SNPs) that result in nonsynonymous codon changes in genomic sequence. The HeLa S3 cell line used in this study contains ~4.5 million SNPs when compared with the hg38 reference human genome. Of these, ~30,000 occur in coding regions, and 4,740 result in nonsynonymous codon changes58. We assessed whether our deep proteomics data would afford the ability to determine whether these SNPs are translated into SAPs. To this end, we searched for SAPs with a MaxQuant module which is tailored for the identification of peptide evidence for the translation of genomic variations (Supplementary Fig. 5)59. From this analysis, we observe protein-level evidence for up to 2,179 SAPs in individual cell lines, or a total of 5,060 SAPs (Fig. 3a and Supplementary Table 2). To assess the quality of these SAP-containing peptide identifications, we performed a correlation analysis of all peptide spectral matches both with and without SAPs (mutated and reference peptides, respectively). Figure 3b demonstrates the distribution of correlation coefficients between observed and predicted MS/MS spectra using the machine learning-based tool DeepMass60 for mutated and reference peptides. The baseline is drawn for peptides with multiple fragmentation spectra, which are compared with each other. The distributions for reference and mutated peptides are similar, providing increased confidence that these peptide spectral matches are legitimate.

Fig. 3: Discovery of proteins with SAPs. a, Comparison of SAPs discovered in the ENCODE transcriptomic data (Trans) and presented proteomics data (Prot) for each of the cell lines. b, Distribution of correlation coefficients between observed and predicted by DeepMass60 spectra. The baseline distribution shows acquisition-to-acquisition variation by comparing observed spectra for peptides. The white circle shows the median value. The lower and upper quartiles of the box demonstrate the 25th and 75th percentiles, accordingly. The lower and upper whiskers show the 5th and 95th percentiles, accordingly. The distributions are based on 5,128,969, 442,476, 16,516 and 4,969 comparisons (from left to right). c, Clustered binary heatmap of the detected SAPs row-grouped by cell line and omics platform (transcriptomics or proteomics). Blue rectangles highlight clusters specific to each cell line, and the green rectangle SAPs that are conserved across all cell lines. d, Gene ontology (GO) enrichment of genes with SAPs detected or undetected by MS. Genes with a mixed population of SAPs were removed, and repeats collapsed. Blue dots highlight GO terms with the word ‘membrane’ mentioned in the name. e. SIFT-generated61 score distribution over four categories for detected and undetected SAPs. Applying the two-sided Wilcoxon rank sum test on the raw scores results in P value of 2 × 10−8. f, The same as e, but for the PolyPhen-2 (ref. 62) tool. Applying the two-sided Wilcoxon rank sum test on the raw scores results in P value of 1.1 × 10−12. Full size image

For all cell lines except HUVEC, we observed high overlap between the mutations detected by transcriptomics and by proteomics (Supplementary Fig. 6a). Given HUVEC is the only primary cell line (that is, obtained directly from host tissue) in the study, this low overlap is expected as the transcriptomic and proteomic data were collected from cells originating from different donors. Therefore, we omitted HUVEC from further analysis. Figure 3a shows that most nonsynonymous SNPs that appear in the transcript also appear at the protein level (median 73% over all studied cell lines). Further, the multi-enzyme data led on average to a doubling of identified SAPs compared with when only trypsin was used (Supplementary Fig. 6a).

Figure 3c shows the presence of variants as a function of cell line and whether they are detected at the protein level. We note that there are primarily two types of SAP—those that are cell line specific (highlighted within a blue rectangle) and those that are conserved across the cell lines (highlighted within a green rectangle). Enrichment analysis of the SAPs found only at the transcriptomic level (Fig. 3d) revealed several gene ontology terms associated with membrane protein families—supporting our earlier conclusions that peptides for such proteins are less amenable to MS analysis.

To test whether some of the mutations that were undetected at the protein level, even though transcripts evidence was present, caused protein instability, we leveraged the SIFT61 and PolyPhen-2 (ref. 62) tools. These software tools predict how an amino acid mutation can alter protein structure and function by classifying mutations as either benign or deleterious. As depicted in Fig. 3e,f, both algorithms predict a significant shift (P values of 2 × 10−8 and 1.1 × 10−12, respectively from two-sided Wilcoxon rank sum test) in the fraction of deleterious mutations for the undetected SAP group. These data confirm that at least a subset of undetected SAPs likely arise from cases where the mutation induces protein instability.

Protein-level evidence for alternative splicing

The high proteome sequence coverage of our dataset provides an opportunity to globally detect protein isoforms arising from alternative splicing and affords a direct assessment of the degree to which this process contributes to proteomic complexity. As mentioned above, RNA-seq analyses of diverse human organs and cell lines have provided evidence that more than 95% of multi-exon genes produce alternatively spliced transcripts11,12. However, the extent to which alternative transcripts with the potential to encode different proteins are translated has been the subject of considerable debate63,64, in large part due to the lack of MS datasets with sufficiently deep coverage. Accordingly, using the high-coverage data generated here, we assessed the proportion of alternatively spliced transcript variants that are detected in the proteome.

To assess the extent to which it is possible to detect splicing within our dataset, we first determined the relative proportions of peptides that fall entirely within exons versus those that span exon–exon junctions. Approximately 30% of identified peptide sequences span junction sequences formed by splicing of protein-coding exons (Supplementary Fig. 7a). Notably, trypsin generates the lowest ratio of junction-spanning versus exon body peptides of all proteases used in this study (~25% versus 28–32%) (Supplementary Fig. 7a). This observation confirms in silico predictions of the limited utility of trypsin alone for detection of spliced junction sequences in shotgun proteomics data65. In particular, peptides from trypsin and LysC digestion that fully map within exons have a clear bias which coincides with the first or last amino acids encoded by exons (Supplementary Fig. 8b). Additionally, exon-spanning LysN peptides tend to overlap by a single amino acid at their C termini (Supplementary Fig. 8c). These data are also consistent with a high frequency of lysine residues overlapping splice sites65 and illustrate the importance of utilizing additional proteases (chymotrypsin, AspN, GluC and so on) when attempting to detect splice isoforms.

Figure 4 illustrates our strategy for detection of translated alternative splicing events. In the example provided, alternative splicing of a cassette exon (exon 8) of the Amyloid precursor protein (APP) gene is detected by a combination of peptides spanning exons 7 and 9, the junction formed by skipping of the exon, and by peptides spanning exons 7 and 8 or exons 8 and 9, which are formed by inclusion of the exon. In total, we detect 11 unique peptides spanning these three junctions, thus confirming translation of isoforms resulting from inclusion and skipping of the exon. Figure 5a depicts the major classes of alternative splicing events and the detection frequencies of these as they appear in RNA-seq data49 generated from all six cell lines analyzed in this study, and the numbers of these events detected at the proteomics level, when considering peptides mapping to one of both possible resulting isoforms (Supplementary Table 3). With a requirement for expression of at least one of two isoforms, we detect 4,608 of 13,450 (34.3%) alternative splicing events (Fig. 5a). Notably, of 6,145 alternative splicing events with RNA-seq expression evidence for both alternatives, we detect 1,141 (18.6%) at the protein level, where junction-spanning peptides representing both alternative isoforms are identified.

Fig. 4: Example of proteomics data corroborating occurrence of an alternative splicing (AS) event in APP. The initial sequential order of exons undergoes transcription. Splicing processing follows, resulting in either 7–9 or 7–8–9 exon combinations. Since all mentioned exons are part of APP’s open reading frame, they have a theoretical possibility to be present and translated into a protein sequence. The multi-enzyme shotgun MS approach described here allows detection of peptides specific to each isoform. Two of 42 total spectra, corroborating these splicing events, are shown. Full size image

Fig. 5: Properties of detected exon skipping AS events. a, Summary table of annotated, detected by transcriptomics and proteomics splicing events. AS events are further subdivided into groups with expression evidence for at least one or both alternatives. b, Proteomics detection rate of exon skipping AS events as a function of expression. Each gene is grouped by expression level as obtained from RNA-seq data. c, Proportions of detected AS events with in-frame or out-of-frame properties. For in-frame AS events, the length of included exon is divisible by 3. It is not the case for out-of-frame AS events which hence result in a frameshift. d, The same analysis as in b but performed based on frame-preserving isoform events only. e, Percentage of MS-identified splicing sites as a function of transcriptional coverage (reads per million, RPM). Three groups of splicing sites are displayed—constitutive (present in all isoforms of a specific gene), exclusion and inclusion splice sites. For more information, see Supplementary Fig. 8. f, The same as e, but by individual proteases used in this study or all combined (Total). g, Splice junction proteomic coverage achieved over all protease combinations. The top two combinations are displayed for 2–5 proteases. Only splice junctions with transcriptomics coverage of more than 1 RPM are included in this analysis. h, ROC curve of a binary XGBoost68 classifier trained to predict whether AS events are detected or not detected on the proteomics level. i, Features ranked by their importance for the XGBoost classifier. The bars and whiskers demonstrate mean and 1 s.d. accordingly. The visualized values were calculated over 100 random shuffles for each parameter. j, Proteomics detection rate as a function of percent spliced-in (PSI) value defined by RNA-seq data. AUC, area under the curve. Full size image

Several factors inherently limit the detection of transcript isoforms at the protein level. These include (1) relatively low transcript abundance arising from reduced levels of gene expression; (2) transcript turnover due to nonsense-mediated mRNA decay (NMD), triggered by premature termination codons introduced by frame-shifting alternative splicing events66 and other turnover processes; and (3) reduced levels of splicing, as measured using the metric PSI. Exemplifying these limitations, intron retention events, which often result in nuclear retention of transcripts or trigger NMD if the retained intron does not prevent transcript export67, are the most rarely detected at the protein level (that is, only 9 of 105). Furthermore, the rate of detection at the proteomics level gradually increases as the corresponding transcript levels for cassette alternative exons increase (Fig. 5b). Moreover, most of the events detected at the proteomics level derive from frame-preserving (that is, in-frame) alternative isoforms (Fig. 5c). Considering only frame-preserving alternative splicing events in relatively abundant transcripts (that is, ≥7 log 2 RPKM), we observe 64% of alternative spliced events at the protein level (Fig. 5d).

To estimate the possible upper bound detection rates for alternative splicing events at the proteomics level, we compared relative detection rates for alternatively spliced and constitutively spliced junctions in the same RNA transcripts, where constitutively spliced exon–exon junctions are defined as those present in all isoforms of a gene. Importantly, detection rates for constitutive and alternative exon–exon junctions were comparable over a range of transcript levels, in both cases plateauing at approximately 40% of total junctions detected at the highest levels of transcript abundance (Fig. 5e and Supplementary Fig. 9a–f). Consistent with these results, the maximum detection levels require combined data from all six proteases, since each enzyme alone resulted in substantially lower detection levels (Fig. 5f and Supplementary Fig. 9g–i). Additionally, the analysis of all protease combinations shows that nonarginine and nonlysine directed proteases (GluC, AspN and Chymotrypsin) are highly complementary to trypsin in terms of splice site coverage (Fig. 5g).

Finally, to further evaluate factors contributing to the detection of spliced isoforms at the proteomics level, we trained a machine learning binary classifier68. Specifically, we classified cassette exon skipping events detected in both proteomics and transcriptomics data versus those events detected solely in the transcriptome. After training on the following properties—transcript abundance, PSI value, exon length, protein coding sequence length, frame-preserving status and a minimum theoretical peptide coverage between isoforms for each studied protease—we evaluated performance using sevenfold cross-validation. This classifier results in 0.83 area under the receiver operating characteristic (ROC) curve (Fig. 5h), which is better than random performance. We next used the permutation importance69 to evaluate the importance of each property and to establish the most important ones for influencing proteomic detection of alternative splicing events. The top three most important parameters are transcript abundance, PSI and frame status (Fig. 5i), consistent with the results in Fig. 5b–d.

The PSI ratio reflects the percentage of the total transcript abundance that results in exon inclusion. Since the exon-included isoform contains two junctions for proteomic detection, while the excluded-exon form only contains one, in the case of equally abundant isoforms, exon-inclusion events have double the probability of detection. This situation would result in an optimal PSI for proteomic detection of 33%. This is confirmed in Fig. 5j, where the highest proteomics detection rate for exon exclusion is close to 30%. Note that for extreme PSI values, for example, >0.9, the abundance of the spliced-in isoform is tenfold higher than the splice-out version. This phenomenon likely reduces the overall protein abundance of one isoform, adding to the challenge of its detection.