Ribosome profiling has been used to predict thousands of short open reading frames (sORFs) in eukaryotic cells, but it suffers from substantial levels of noise. PRICE (https://github.com/erhard-lab/price) is a computational method that models experimental noise to enable researchers to accurately resolve overlapping sORFs and noncanonical translation initiation. We experimentally validated translation using major histocompatibility complex class I (MHC I) peptidomics and observed that sORF-derived peptides efficiently enter the MHC I presentation pathway and thus constitute a substantial fraction of the antigen repertoire.
Subscribe to Journal
Get full journal access for 1 year
only $21.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Ingolia, N.T., Ghaemmaghami, S., Newman, J.R.S. & Weissman, J.S. Science 324, 218–223 (2009).
Ingolia, N.T. Cell 165, 22–33 (2016).
Gao, X. et al. Nat. Methods 12, 147–153 (2015).
Ingolia, N.T., Lareau, L.F. & Weissman, J.S. Cell 147, 789–802 (2011).
Calviello, L. et al. Nat. Methods 13, 165–170 (2016).
Chun, S.Y., Rodriguez, C.M., Todd, P.K. & Mills, R.E. BMC Bioinformatics 17, 482 (2016).
Ingolia, N.T. et al. Cell Rep. 8, 1365–1379 (2014).
Bazzini, A.A. et al. EMBO J. 33, 981–993 (2014).
Ji, Z., Song, R., Regev, A. & Struhl, K. eLife 4, e08890 (2015).
Malone, B. et al. Nucleic Acids Res. 45, 2960–2972 (2017).
Fields, A.P. et al. Mol. Cell 60, 816–827 (2015).
Zhang, P. et al. Nat. Commun. 8, 1749 (2017).
Rutkowski, A.J. et al. Nat. Commun. 6, 7126 (2015).
Stern-Ginossar, N. et al. Science 338, 1088–1093 (2012).
Weekes, M.P. et al. Cell 157, 1460–1472 (2014).
Yewdell, J.W. Trends Immunol. 32, 548–558 (2011).
Bassani-Sternberg, M., Pletscher-Frankild, S., Jensen, L.J. & Mann, M. Mol. Cell. Proteomics 14, 658–673 (2015).
Karosiene, E., Lundegaard, C., Lund, O. & Nielsen, M. Immunogenetics 64, 177–186 (2012).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Genome Biol. 10, R25 (2009).
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).
Erhard, F. & Zimmer, R. Nucleic Acids Res. 43, e136 (2015).
Hong, Y. Comput. Stat. Data Anal. 59, 41–51 (2013).
Pachter, L. Preprint at http://arxiv.org/abs/1104.3889 (2011).
Kowalewski, D.J. & Stevanović, S. Methods Mol. Biol. 960, 145–157 (2013).
Cox, J. & Mann, M. Nat. Biotechnol. 26, 1367–1372 (2008).
This work was funded by the MRC (Clinical Fellowship grant G1002523 to L.D.), NHSBT (grant WP11-05 to L.D.), the European Research Council (grant ERC-2016-CoG 721016–HERPES to L.D.) and the Wellcome Trust (Senior Clinical Research Fellowship 108070/Z/15/Z to M.P.W.). R.Z. acknowledges partial funding from the DFG (SFB 1123) and from Bavaria (BioSysNet). We thank S. Gorsky for critical reading of the manuscript.
D.J.K. is an employee of Immatics Biotechnologies GmbH.
Integrated supplementary information
(a) The read length distribution of the HSV-1 Ribo-seq experiment (see Supplementary Table 1) shows that cleavage is stochastic. Single 5 mismatches are frequent and shift the read-length distribution by +1. (b) In the HSV-1 data set the most abundant read class is 28 bp long and the annotated codon triplets start at positions divisible by three within the read (Frame=0). Shorter or longer reads with frame 0 are also highly abundant, indicating high variability of cleavage distances downstream of the P site. Additionally, a substantial fraction of 29 bp long reads is in frame 1, and many 27 bp long reads are in frame 2. This indicates that cleavage upstream of the ribosome exhibits significant variation, albeit less than downstream. A substantial fraction of reads has a mismatch at their 5’ end, which can be attributed to untemplated nucleotide additions during cDNA synthesis. (c) For each of the data sets (see Supplementary Table 1), the fraction of reads is shown that exhibit the most frequent read length in CDS mapped reads. In all data sets, only a minority of all reads have canonical length. (d) Only a fraction of all canonical reads correspond to footprints with the P site codon at a defined distance from their respective start position in the read alignment (e.g. position 12 within the read).
(a) Ribo-seq produces several distinct reads from a given codon (AGU) located in the P site of the ribosome. Besides the canonical read (1), both shorter (2) and longer (5) reads arise due to stochastic mRNA cleavage. Further variability is caused by observed (3) or unobservable (4) untemplated single nucleotide additions. The proportions of these reads are determined by the model parameters. (b) The deterministic approach of mapping reads to P site codons assigns all reads of a certain class (characterized by its length and 5’ mismatch state) to a specific codon at the same position within each read. In contrast, probabilistic mapping solves the inverse problem of the generative model to assign reads to codons. Finally, regularization can be used to overcome the problem of fractional mappings that are due to the over-parameterized problem.
(a) Potential start codons are predicted by comparing the activity of each codon to the mean activity of downstream codons. Data from several different kinds of samples such as translation start site profiling employing cells treated with Lactimidomycin or Harringtonine prior to cell lysis can be included into the start codon prediction using a logistic regression approach. A start codon score (s) is computed via logistic regression for each codon in an ORF candidate. Three examples (x, y and z) are shown. (b) For all annotated ORFs with at least 25 reads, start scores (see Figure 1e) were predicted for all codons. We computed the relative ranks for all codons (i.e. the maximal scores get rank 0) and plotted the distribution of ranks for the annotated start codon. For both the HCMV and HSV-1 data set, about three quarter of the start codons are top ranked.
Read classes (characterized by read length and 5’ mismatch state) exhibit specific signal to noise ratios from slightly above the theoretical minimum (dashed horizontal line) to more than 8. Ignoring 5’ mismatches (pale orange) increases noise levels. For this experiment, the optimal deterministic mapping strategy utilizes four read classes, namely 28 and 29 bp without 5’ mismatch, and 29 and 30 bp long reads with 5’ mismatch.
(a) Deterministic mapping approaches trade off signal against signal to noise ratio (connected dots, from left to right) when considering reads of additional sizes. Probabilistic mapping and regularization (symbols for the regularization parameter δ) substantially improve Ribo-seq data quality for a broad range of published data sets (see Supplementary Table 1). (b) Aggregated reads around start codons demonstrate the improved signal to noise ratio of PRICE for both the HSV-1 and HCMV data set. Codon counts were normalized to the start codon for all well-translated ORFs (at least one read per codon on average) and aggregated per position relative to the start codon.
(a) The top-ranked uoORFs from ORF-scoring methods (RibORF, SPECtre, ORFScore, FLOSS) have significantly higher off-frame fractions (overlapping reads mapped out-of-frame divided by all overlapping reads) than uoORFs from ORF-identifying methods (PRICE, Rp-Bp, ORF-RATER). Data for PRICE with regularization (reg.) and using only the ORF-identifying component of PRICE with deterministic mapping (det.) are included. (b) Start codon distributions for the combined set of uORFs and uoORFs identified by the different methods are shown. Other refers to all other codons with one mismatch to AUG. Rp-Bp could only be run to identify NUG start codons. For each method, the data of the top 1000 uORFs/uoORFs according to the read density (number of reads from the HSV-1 data set mapped to codons divided by ORF length) are shown. Reproducibility of the respective ORFs in between two Ribo-seq data sets from primary human fibroblasts is indicated. (c) / (d) Illustration of ORF classes analyzed in (e)-(h): uORFs (e) are defined to both start and end in the 5’-UTR of an annotated mRNA, whereas uoORFs (f) end within the downstream coding sequence. Both may either start with an AUG (g) or a non-canonical start codon (h). (e) - (h) The reproducibility (number of reproducible ORFs divided by total number of ORFs identified) is plotted against the total number of ORFs by varying their read density (translation rates). This is shown for each method and for different classes of ORFs (uORFs, uoORFs, uORFs/uORFs starting with either an AUG or non-canonical start codon). PRICE with probabilistic mapping substantially outperforms all other methods in terms of reproducibility and total number of identified ORFs.
Supplementary Figure 7 Intra-laboratory comparisons of HFF data sets (HSV-1 replicate 1 compared with HSV-1 replicate 2).
(a) This is the same as Supplementary Fig. 6b, with the exception that not only the top 1000 ORFs were used but all. Note that the start codon distributions are virtually the same, however with a much larger fraction of reproducible ORFs for PRICE. The remaining panels are the same as Supplementary Fig. 6 for the comparison between two replicates of the HSV-1 data set.
(a) PRICE has the greatest sensitivity for identifying annotated ORFs as estimated by the total amount of peptides detected in total proteome. (b) Novel ORFs remain virtually undetectable in the whole proteome data.
(a) We used NetMHCcons [Karosiene et al., 2012] to predict the binding affinities of all identified peptides to the respective HLA allotypes for both MHC-1 peptidome data sets (published: HLA-A*03:01, HLA-A*23:01, HLA-B*08:01, HLA-B*15:01, HLA-C*07:02, HLA-C*07:04; new: HLA-A*01:01, HLA-A*03:01, HLA-B*08:01, HLA-B*51:01, HLA-C*07:01, HLA-C*01:02). Shown is the percentile rank of the IC50 value, as predicted by NetMHCcons. Strong binders are recognized by a percentile rank ≤ 1 (vertical dotted line) [Karosiene et al., 2012]. With the exception of Rp-Bp predicted peptides in the new data set, all predictions of novel ORFs result in peptides that strongly resemble the binding affinities of peptides from annotated proteins.
(a) Many of the novel ORFs from the HCMV re-annotation (see Fig. 2 in the main text) were filtered by PRICE due to too much noise off-frame, too many inactive codons, noise down- or upstream, no clear start codon, abortive translation or due to very weak expression. (b) In comparison to the 168 ORFs of the reference annotation and the 248 confirmed ORFs, about half of our novel ORFs were expressed at very low levels and thus might reflect not experimental but translational noise.
We used this shifted logistic function to compute a penalizing factor for start codon prediction. For each start codon candidate, we computed the fraction of reads downstream of this codon and multiplied the preliminary start codon prediction score (from the logistic regression) by the factor computed from the logistic function. The specific shape of the function is automatically chosen by our implementation such that (i) a preliminary score of 0.5 was still enough to call the start position (we used 0.1 as overall threshold), and (ii) that the score increases smoothly to 1 for positions where more than 60% of the reads belong to the translated region.
(a) Each panel scatters the mean read count per in-frame codon (in log space, discarding zero-count codons) against the fraction of in-frame (f=0) or off-frame (f=1 or f=2) codons above the cut-off of 10% of the mean log read count for all genes from the annotation. A smoothing spline is fit to each scatterplot (red lines). This spline is used as an estimate of the probability pf that a codon in frame f is observed (over this cut-off), when an ORF is actively translated with a specific strength (as estimated by the mean log read count per codon). To filter ORF candidates with too few observed codons, for an estimate of the variance of p0, another spline is fit to the squared residuals of f = 0. The spline functions for mean fm and variance fv of p0 are used to construct a family of beta distributions with parameters fm(a) · c(a), (1 − fm(a)) · c(a)) with c(a) = fm(a) · (1−fm(a))/fv(a)-1 and their 5% quantiles q5(a). An ORF with length n amino acids and mean log read count per in-frame codon a is filtered, if less than q5(a) · n reads are observed (i.e. below the blue line in the scatter plot).
About this article
Cite this article
Erhard, F., Halenius, A., Zimmermann, C. et al. Improved Ribo-seq enables identification of cryptic translation events. Nat Methods 15, 363–366 (2018). https://doi.org/10.1038/nmeth.4631
International Journal of Molecular Sciences (2020)
Experimental Cell Research (2020)
WIREs RNA (2020)