Global detection of human variants and isoforms by deep proteome sequencing

An average shotgun proteomics experiment detects approximately 10,000 human proteins from a single sample. However, individual proteins are typically identified by peptide sequences representing a small fraction of their total amino acids. Hence, an average shotgun experiment fails to distinguish different protein variants and isoforms. Deeper proteome sequencing is therefore required for the global discovery of protein isoforms. Using six different human cell lines, six proteases, deep fractionation and three tandem mass spectrometry fragmentation methods, we identify a million unique peptides from 17,717 protein groups, with a median sequence coverage of approximately 80%. Direct comparison with RNA expression data provides evidence for the translation of most nonsynonymous variants. We have also hypothesized that undetected variants likely arise from mutation-induced protein instability. We further observe comparable detection rates for exon–exon junction peptides representing constitutive and alternative splicing events. Our dataset represents a resource for proteoform discovery and provides direct evidence that most frame-preserving alternatively spliced isoforms are translated.

The left column of histograms shows a relative distribution of missed cleavages over six protease digests, the middle column -relative distribution of detected peptides (red line indicates an average value, which is also stated as a number), the right column -an occurrence of amino acids around the cleavage site (labeled as a red line).

Figure S5 .
Figure S5.How to perform a variant extraction in MaxQuant.A, Open MaxQuant and follow the "Tools/-Variant extraction" tab.B, Specify a list of BAM files with NGS data (RNA-seq, WGS, WES), and if needed, change mutation calling parameters.C, Specify location of genomic DNA sequence and genome annotation.Additionally, define folders for temporary and final files.

Figure S7 .
Figure S7.Hot to detect alternative splice events jointly for proteomics and transcriptomics data in Perseus.A, Open Perseus software and follow to "Load/NGS data upload" activity.B, Specify the location of peptide.txtfiles from MaxQuant, transcriptomics BAM files, genome DNA sequence, and annotation.

Figure
Figure S9

Figure S9 .
Figure S9.Properties of MS detected peptides spanning spliced exon-exon junctions.A-F, Percent of MS identified splice junctions as a function of transcriptional coverage, measured as logarithm of read count (reads per million -RPM).Splice junctions are further subdivided into constitutive sites, i.e., present in all isoforms of specific genes, and exclusion/inclusion sites, involved in exon skipping alternative splicing.Figures A-Cdemonstrate statistics for all exon skipping events, but D-F -for in-frame exon skipping events.The percentage of identified splicing sites was calculated among events sorted by transcription coverage using sliding windows of various lengths -100 (A and D), 500 (B and E), and 1000 (C and F) events.Note that figure E is identical to Figure 5D.G-I, the same as D-F, but for each protease used in this study, or all combined (Total).Note that figure H is identical to Figure5E.

Protein sequence coverage of all identified proteins. A,
Histogram showing the number of protein groups binned by observed sequence coverage.B, Pie chart showing the number of proteins observed in each of five 20% bins of sequence coverage.C, Series of violin plots for all measured combinations of cell lines, proteases, and fragmentation methods.D, A number of reported peptides and peptide spectral matches (PSMs) across large-scale proteomics studies.E, Cellular component gene ontology analysis of proteins with sequence coverage less than 25% are significantly enriched for membrane proteins accordingly to the Fisher's exact test... ... ... ...

Comparison with the neXtProt annotation. A,
The current release of neXtProt (October 2022) was downloaded and cross-mapped to peptides profiled in this study by first converting any proteins demarked by UniProt identifiers to Ensembl Protein identifiers.UniProt to ENSP mapping was obtained from BioMart.Next, Ensembl protein identifiers were mapped to neXtProt accession values via the mapping scheme provided in the October 2022 release.Finally, the number of peptides per neXtProt group were summed across all cell lines used in this study.B, Unique neXtProt proteins delineated by protein existence (PE) rank colored by the number of mapped peptides detected in this study.C, The relative proportion within each PE rank of neXtProt proteins with 0, 1, 2, or 3+ mapped peptides.