Improved Ribo-seq enables identification of cryptic translation events


Ribosome profiling has been used to predict thousands of short open reading frames (sORFs) in eukaryotic cells, but it suffers from substantial levels of noise. PRICE ( is a computational method that models experimental noise to enable researchers to accurately resolve overlapping sORFs and noncanonical translation initiation. We experimentally validated translation using major histocompatibility complex class I (MHC I) peptidomics and observed that sORF-derived peptides efficiently enter the MHC I presentation pathway and thus constitute a substantial fraction of the antigen repertoire.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: The PRICE approach.
Figure 2: Re-decoding human cytomegalovirus.

Accession codes

Primary accessions

Proteomics Identifications Database

Referenced accessions

Proteomics Identifications Database


  1. 1

    Ingolia, N.T., Ghaemmaghami, S., Newman, J.R.S. & Weissman, J.S. Science 324, 218–223 (2009).

    CAS  Article  Google Scholar 

  2. 2

    Ingolia, N.T. Cell 165, 22–33 (2016).

    CAS  Article  Google Scholar 

  3. 3

    Gao, X. et al. Nat. Methods 12, 147–153 (2015).

    CAS  Article  Google Scholar 

  4. 4

    Ingolia, N.T., Lareau, L.F. & Weissman, J.S. Cell 147, 789–802 (2011).

    CAS  Article  Google Scholar 

  5. 5

    Calviello, L. et al. Nat. Methods 13, 165–170 (2016).

    CAS  Article  Google Scholar 

  6. 6

    Chun, S.Y., Rodriguez, C.M., Todd, P.K. & Mills, R.E. BMC Bioinformatics 17, 482 (2016).

    Article  Google Scholar 

  7. 7

    Ingolia, N.T. et al. Cell Rep. 8, 1365–1379 (2014).

    CAS  Article  Google Scholar 

  8. 8

    Bazzini, A.A. et al. EMBO J. 33, 981–993 (2014).

    CAS  Article  Google Scholar 

  9. 9

    Ji, Z., Song, R., Regev, A. & Struhl, K. eLife 4, e08890 (2015).

    Article  Google Scholar 

  10. 10

    Malone, B. et al. Nucleic Acids Res. 45, 2960–2972 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11

    Fields, A.P. et al. Mol. Cell 60, 816–827 (2015).

    CAS  Article  Google Scholar 

  12. 12

    Zhang, P. et al. Nat. Commun. 8, 1749 (2017).

    Article  Google Scholar 

  13. 13

    Rutkowski, A.J. et al. Nat. Commun. 6, 7126 (2015).

    Article  Google Scholar 

  14. 14

    Stern-Ginossar, N. et al. Science 338, 1088–1093 (2012).

    CAS  Article  Google Scholar 

  15. 15

    Weekes, M.P. et al. Cell 157, 1460–1472 (2014).

    CAS  Article  Google Scholar 

  16. 16

    Yewdell, J.W. Trends Immunol. 32, 548–558 (2011).

    CAS  Article  Google Scholar 

  17. 17

    Bassani-Sternberg, M., Pletscher-Frankild, S., Jensen, L.J. & Mann, M. Mol. Cell. Proteomics 14, 658–673 (2015).

    CAS  Article  Google Scholar 

  18. 18

    Karosiene, E., Lundegaard, C., Lund, O. & Nielsen, M. Immunogenetics 64, 177–186 (2012).

    CAS  Article  Google Scholar 

  19. 19

    Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Genome Biol. 10, R25 (2009).

    Article  Google Scholar 

  20. 20

    Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).

    CAS  Article  Google Scholar 

  21. 21

    Erhard, F. & Zimmer, R. Nucleic Acids Res. 43, e136 (2015).

    PubMed  PubMed Central  Google Scholar 

  22. 22

    Hong, Y. Comput. Stat. Data Anal. 59, 41–51 (2013).

    Article  Google Scholar 

  23. 23

    Pachter, L. Preprint at (2011).

  24. 24

    Kowalewski, D.J. & Stevanović, S. Methods Mol. Biol. 960, 145–157 (2013).

    CAS  Article  Google Scholar 

  25. 25

    Cox, J. & Mann, M. Nat. Biotechnol. 26, 1367–1372 (2008).

    CAS  Article  Google Scholar 

Download references


This work was funded by the MRC (Clinical Fellowship grant G1002523 to L.D.), NHSBT (grant WP11-05 to L.D.), the European Research Council (grant ERC-2016-CoG 721016–HERPES to L.D.) and the Wellcome Trust (Senior Clinical Research Fellowship 108070/Z/15/Z to M.P.W.). R.Z. acknowledges partial funding from the DFG (SFB 1123) and from Bavaria (BioSysNet). We thank S. Gorsky for critical reading of the manuscript.

Author information




F.E. designed and implemented the computational approach. R.Z. supervised the development of the computational methods. A.H., C.Z., D.J.K. and S.S. provided the MHC I peptidome analysis. M.P.W. provided whole-proteome mass spectrometry data. A.L. provided Ribo-seq data used for the validation of this approach. F.E. and L.D. designed the experiments and wrote the paper.

Corresponding authors

Correspondence to Florian Erhard or Ralf Zimmer or Lars Dölken.

Ethics declarations

Competing interests

D.J.K. is an employee of Immatics Biotechnologies GmbH.

Integrated supplementary information

Supplementary Figure 1 Deterministic versus probabilistic mapping strategies.

(a) The read length distribution of the HSV-1 Ribo-seq experiment (see Supplementary Table 1) shows that cleavage is stochastic. Single 5 mismatches are frequent and shift the read-length distribution by +1. (b) In the HSV-1 data set the most abundant read class is 28 bp long and the annotated codon triplets start at positions divisible by three within the read (Frame=0). Shorter or longer reads with frame 0 are also highly abundant, indicating high variability of cleavage distances downstream of the P site. Additionally, a substantial fraction of 29 bp long reads is in frame 1, and many 27 bp long reads are in frame 2. This indicates that cleavage upstream of the ribosome exhibits significant variation, albeit less than downstream. A substantial fraction of reads has a mismatch at their 5’ end, which can be attributed to untemplated nucleotide additions during cDNA synthesis. (c) For each of the data sets (see Supplementary Table 1), the fraction of reads is shown that exhibit the most frequent read length in CDS mapped reads. In all data sets, only a minority of all reads have canonical length. (d) Only a fraction of all canonical reads correspond to footprints with the P site codon at a defined distance from their respective start position in the read alignment (e.g. position 12 within the read).

Supplementary Figure 2 Probabilistic mapping approach.

(a) Ribo-seq produces several distinct reads from a given codon (AGU) located in the P site of the ribosome. Besides the canonical read (1), both shorter (2) and longer (5) reads arise due to stochastic mRNA cleavage. Further variability is caused by observed (3) or unobservable (4) untemplated single nucleotide additions. The proportions of these reads are determined by the model parameters. (b) The deterministic approach of mapping reads to P site codons assigns all reads of a certain class (characterized by its length and 5’ mismatch state) to a specific codon at the same position within each read. In contrast, probabilistic mapping solves the inverse problem of the generative model to assign reads to codons. Finally, regularization can be used to overcome the problem of fractional mappings that are due to the over-parameterized problem.

Supplementary Figure 3 Start-codon prediction.

(a) Potential start codons are predicted by comparing the activity of each codon to the mean activity of downstream codons. Data from several different kinds of samples such as translation start site profiling employing cells treated with Lactimidomycin or Harringtonine prior to cell lysis can be included into the start codon prediction using a logistic regression approach. A start codon score (s) is computed via logistic regression for each codon in an ORF candidate. Three examples (x, y and z) are shown. (b) For all annotated ORFs with at least 25 reads, start scores (see Figure 1e) were predicted for all codons. We computed the relative ranks for all codons (i.e. the maximal scores get rank 0) and plotted the distribution of ranks for the annotated start codon. For both the HCMV and HSV-1 data set, about three quarter of the start codons are top ranked.

Supplementary Figure 4 Signal-to-noise ratio of read classes.

Read classes (characterized by read length and 5’ mismatch state) exhibit specific signal to noise ratios from slightly above the theoretical minimum (dashed horizontal line) to more than 8. Ignoring 5’ mismatches (pale orange) increases noise levels. For this experiment, the optimal deterministic mapping strategy utilizes four read classes, namely 28 and 29 bp without 5’ mismatch, and 29 and 30 bp long reads with 5’ mismatch.

Supplementary Figure 5 Improved signal-to-noise with PRICE.

(a) Deterministic mapping approaches trade off signal against signal to noise ratio (connected dots, from left to right) when considering reads of additional sizes. Probabilistic mapping and regularization (symbols for the regularization parameter δ) substantially improve Ribo-seq data quality for a broad range of published data sets (see Supplementary Table 1). (b) Aggregated reads around start codons demonstrate the improved signal to noise ratio of PRICE for both the HSV-1 and HCMV data set. Codon counts were normalized to the start codon for all well-translated ORFs (at least one read per codon on average) and aggregated per position relative to the start codon.

Supplementary Figure 6 Between-laboratory comparisons of HFF data sets (HCMV compared with HSV-1).

(a) The top-ranked uoORFs from ORF-scoring methods (RibORF, SPECtre, ORFScore, FLOSS) have significantly higher off-frame fractions (overlapping reads mapped out-of-frame divided by all overlapping reads) than uoORFs from ORF-identifying methods (PRICE, Rp-Bp, ORF-RATER). Data for PRICE with regularization (reg.) and using only the ORF-identifying component of PRICE with deterministic mapping (det.) are included. (b) Start codon distributions for the combined set of uORFs and uoORFs identified by the different methods are shown. Other refers to all other codons with one mismatch to AUG. Rp-Bp could only be run to identify NUG start codons. For each method, the data of the top 1000 uORFs/uoORFs according to the read density (number of reads from the HSV-1 data set mapped to codons divided by ORF length) are shown. Reproducibility of the respective ORFs in between two Ribo-seq data sets from primary human fibroblasts is indicated. (c) / (d) Illustration of ORF classes analyzed in (e)-(h): uORFs (e) are defined to both start and end in the 5’-UTR of an annotated mRNA, whereas uoORFs (f) end within the downstream coding sequence. Both may either start with an AUG (g) or a non-canonical start codon (h). (e) - (h) The reproducibility (number of reproducible ORFs divided by total number of ORFs identified) is plotted against the total number of ORFs by varying their read density (translation rates). This is shown for each method and for different classes of ORFs (uORFs, uoORFs, uORFs/uORFs starting with either an AUG or non-canonical start codon). PRICE with probabilistic mapping substantially outperforms all other methods in terms of reproducibility and total number of identified ORFs.

Supplementary Figure 7 Intra-laboratory comparisons of HFF data sets (HSV-1 replicate 1 compared with HSV-1 replicate 2).

(a) This is the same as Supplementary Fig. 6b, with the exception that not only the top 1000 ORFs were used but all. Note that the start codon distributions are virtually the same, however with a much larger fraction of reproducible ORFs for PRICE. The remaining panels are the same as Supplementary Fig. 6 for the comparison between two replicates of the HSV-1 data set.

Supplementary Figure 8 Sensitivity analysis using mass spectrometry data.

(a) PRICE has the greatest sensitivity for identifying annotated ORFs as estimated by the total amount of peptides detected in total proteome. (b) Novel ORFs remain virtually undetectable in the whole proteome data.

Supplementary Figure 9 Sensitivity and specificity in MHC I peptidome analysis.

(a) Compare to Fig. 1c from the main text, data for the MHC-I peptidome data set 2 is shown. (b) Compare to Fig. 1d from the main text, data for the MHC-I peptidome data set 1 is shown.

Supplementary Figure 10 Predicted MHC I binding affinities.

(a) We used NetMHCcons [Karosiene et al., 2012] to predict the binding affinities of all identified peptides to the respective HLA allotypes for both MHC-1 peptidome data sets (published: HLA-A*03:01, HLA-A*23:01, HLA-B*08:01, HLA-B*15:01, HLA-C*07:02, HLA-C*07:04; new: HLA-A*01:01, HLA-A*03:01, HLA-B*08:01, HLA-B*51:01, HLA-C*07:01, HLA-C*01:02). Shown is the percentile rank of the IC50 value, as predicted by NetMHCcons. Strong binders are recognized by a percentile rank ≤ 1 (vertical dotted line) [Karosiene et al., 2012]. With the exception of Rp-Bp predicted peptides in the new data set, all predictions of novel ORFs result in peptides that strongly resemble the binding affinities of peptides from annotated proteins.

Supplementary Figure 11 Rejected and confirmed novel HCMV ORFs.

(a) Many of the novel ORFs from the HCMV re-annotation (see Fig. 2 in the main text) were filtered by PRICE due to too much noise off-frame, too many inactive codons, noise down- or upstream, no clear start codon, abortive translation or due to very weak expression. (b) In comparison to the 168 ORFs of the reference annotation and the 248 confirmed ORFs, about half of our novel ORFs were expressed at very low levels and thus might reflect not experimental but translational noise.

Supplementary Figure 12 Penalizing factor of start-codon prediction.

We used this shifted logistic function to compute a penalizing factor for start codon prediction. For each start codon candidate, we computed the fraction of reads downstream of this codon and multiplied the preliminary start codon prediction score (from the logistic regression) by the factor computed from the logistic function. The specific shape of the function is automatically chosen by our implementation such that (i) a preliminary score of 0.5 was still enough to call the start position (we used 0.1 as overall threshold), and (ii) that the score increases smoothly to 1 for positions where more than 60% of the reads belong to the translated region.

Supplementary Figure 13 Data-driven estimation of the noise model.

(a) Each panel scatters the mean read count per in-frame codon (in log space, discarding zero-count codons) against the fraction of in-frame (f=0) or off-frame (f=1 or f=2) codons above the cut-off of 10% of the mean log read count for all genes from the annotation. A smoothing spline is fit to each scatterplot (red lines). This spline is used as an estimate of the probability pf that a codon in frame f is observed (over this cut-off), when an ORF is actively translated with a specific strength (as estimated by the mean log read count per codon). To filter ORF candidates with too few observed codons, for an estimate of the variance of p0, another spline is fit to the squared residuals of f = 0. The spline functions for mean fm and variance fv of p0 are used to construct a family of beta distributions with parameters fm(a) · c(a), (1 − fm(a)) · c(a)) with c(a) = fm(a) · (1−fm(a))/fv(a)-1 and their 5% quantiles q5(a). An ORF with length n amino acids and mean log read count per in-frame codon a is filtered, if less than q5(a) · n reads are observed (i.e. below the blue line in the scatter plot).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13, Supplementary Notes 1–3 and Supplementary Tables 1–2

Life Sciences Reporting Summary

Supplementary Table 3

Identified ORFs

Supplementary Software

Source code of PRICE version 1.0.1

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Erhard, F., Halenius, A., Zimmermann, C. et al. Improved Ribo-seq enables identification of cryptic translation events. Nat Methods 15, 363–366 (2018).

Download citation

Further reading


Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing