Article | Published:

Human 5′ UTR design and variant effect prediction from a massively parallel translation assay


The ability to predict the impact of cis-regulatory sequences on gene expression would facilitate discovery in fundamental and applied biology. Here we combine polysome profiling of a library of 280,000 randomized 5′ untranslated regions (UTRs) with deep learning to build a predictive model that relates human 5′ UTR sequence to translation. Together with a genetic algorithm, we use the model to engineer new 5′ UTRs that accurately direct specified levels of ribosome loading, providing the ability to tune sequences for optimal protein expression. We show that the same approach can be extended to chemically modified RNA, an important feature for applications in mRNA therapeutics and synthetic biology. We test 35,212 truncated human 5′ UTRs and 3,577 naturally occurring variants and show that the model predicts ribosome loading of these sequences. Finally, we provide evidence of 45 single-nucleotide variants (SNVs) associated with human diseases that substantially change ribosome loading and thus may represent a molecular basis for disease.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Data availability

The authors declare that all data supporting the findings of this study are available from Gene Expression Omnibus under accession GSE114002.

Code availability

The code for the Optimus 5-Prime model is provided in the Supplementary Code file. All code is also available at


  1. 1.

    Araujo, P. R. et al. Before it gets started: regulating translation at the 5′ UTR. Comp. Funct. Genom. 2012, 475731 (2012).

  2. 2.

    Jackson, R. J., Hellen, C. U. T. & Pestova, T. V. The mechanism of eukaryotic translation initiation and principles of its regulation. Nat. Rev. Mol. Cell Biol. 11, 113–127 (2010).

  3. 3.

    Angermueller, C., Pärnamaa, T., Parts, L. & Stegle, O. Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016).

  4. 4.

    Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

  5. 5.

    Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

  6. 6.

    Kleftogiannis, D., Kalnis, P. & Bajic, V. B. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Res. 43, e6 (2015).

  7. 7.

    Liu, F., Li, H., Ren, C., Bo, X. & Shu, W. PEDLA: predicting enhancers with a deep learning-based algorithmic framework. Sci. Rep. 6, 28517 (2016).

  8. 8.

    Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

  9. 9.

    Zhao, W. et al. Massively parallel functional annotation of 3′ untranslated regions. Nat. Biotechnol. 32, 387–391 (2014).

  10. 10.

    Noderer, W. L. et al. Quantitative analysis of mammalian translation initiation sites by FACS-seq. Mol. Syst. Biol. 10, 748 (2014).

  11. 11.

    Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).

  12. 12.

    Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).

  13. 13.

    Zuccotti, P. & Modelska, A. in Post-Transcriptional Gene Regulation (ed. Dassi, E.) 59–69 (Humana Press, 2016).

  14. 14.

    Floor, S. N. & Doudna, J. A. Tunable protein synthesis by transcript isoforms in human cells. elife 5, e10921 (2016).

  15. 15.

    Wang, X., Hou, J., Quedenau, C. & Chen, W. Pervasive isoform‐specific translational regulation via alternative transcription start sites in mammals. Mol. Syst. Biol. 12, 875 (2016).

  16. 16.

    Whiffin, N. et al. Characterising the loss-of-function impact of 5′ untranslated region variants in whole genome sequence data from 15,708 individuals. Preprint at (2019).

  17. 17.

    Hinnebusch, A. G., Ivanov, I. P. & Sonenberg, N. Translational control by 5′-untranslated regions of eukaryotic mRNAs. Science 352, 1413–1416 (2016).

  18. 18.

    Morris, D. R. & Geballe, A. P. Upstream open reading frames as regulators of mRNA translation. Mol. Cell. Biol. 20, 8635–8642 (2000).

  19. 19.

    Johnstone, T. G., Bazzini, A. A. & Giraldez, A. J. Upstream ORFs are prevalent translational repressors in vertebrates. EMBO J. 35, 706–723 (2016).

  20. 20.

    Lee, S. et al. Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proc. Natl Acad. Sci. USA 109, E2424–E2432 (2012).

  21. 21.

    Reuter, K., Biehl, A., Koch, L. & Helms, V. PreTIS: a tool to predict non-canonical 5′ UTR translational initiation sites in human and mouse. PLoS Comput. Biol. 12, e1005170 (2016).

  22. 22.

    Starck, S. R. et al. Translation from the 5′ untranslated region shapes the integrated stress response. Science 351, aad3867 (2016).

  23. 23.

    Hinnebusch, A. G. The scanning mechanism of eukaryotic translation initiation. Annu. Rev. Biochem. 83, 779–812 (2014).

  24. 24.

    Kozak, M. Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes. Cell 44, 283–292 (1986).

  25. 25.

    Kozak, M. Influences of mRNA secondary structure on initiation by eukaryotic ribosomes. Proc. Natl Acad. Sci. USA 83, 2850–2854 (1986).

  26. 26.

    Zadeh, J. N. et al. NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 32, 170–173 (2011).

  27. 27.

    Ferreira, J. P., Overton, K. W. & Wang, C. L. Tuning gene expression with synthetic upstream open reading frames. Proc. Natl Acad. Sci. USA 110, 11284–11289 (2013).

  28. 28.

    Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell in press (2019).

  29. 29.

    Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).

  30. 30.

    Ray, D. et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499, 172–177 (2013).

  31. 31.

    Karikó, K. et al. Incorporation of pseudouridine into mRNA yields superior nonimmunogenic vector with increased translational capacity and biological stability. Mol. Ther. 16, 1833–1840 (2008).

  32. 32.

    Anderson, B. R. et al. Incorporation of pseudouridine into mRNA enhances translation by diminishing PKR activation. Nucleic Acids Res. 38, 5884–5892 (2010).

  33. 33.

    Kierzek, E. et al. The contribution of pseudouridine to stabilities and structure of RNAs. Nucleic Acids Res. 42, 3492–3501 (2014).

  34. 34.

    Seo, S. W. et al. Predictive design of mRNA translation initiation region to control prokaryotic translation efficiency. Metab. Eng. 15, 67–74 (2013).

  35. 35.

    Jensen, M. K. & Keasling, J. D. Recent applications of synthetic biology tools for yeast metabolic engineering. FEMS Yeast Res. 15, 1–10 (2015).

  36. 36.

    Salis, H. M., Mirsky, E. A. & Voigt, C. A. Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950 (2009).

  37. 37.

    Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).

  38. 38.

    Hernandez, R. D. et al. Singleton variants dominate the genetic architecture of human gene expression. Preprint (2018).

  39. 39.

    Battle, A. et al. Impact of regulatory variation from RNA to protein. Science 347, 664–667 (2015).

  40. 40.

    Cenik, C. et al. Integrative analysis of RNA, translation, and protein levels reveals distinct regulatory variation across humans. Genome Res. 25, 1610–1621 (2015).

  41. 41.

    Wang, B. & Bissell, D. M. Hereditary Coproporphyria (University of Washington, 2012). .

  42. 42.

    Boria, I. et al. The ribosomal basis of Diamond–Blackfan anemia: mutation and database update. Hum. Mutat. 31, 1269–1279 (2010).

  43. 43.

    Qin, Y. et al. Germline mutations in TMEM127 confer susceptibility to pheochromocytoma. Nat. Genet. 42, 229–233 (2010).

  44. 44.

    Mignone, F. et al. Untranslated regions of mRNAs. Genome Biol. 3, reviews0004.1 (2002).

  45. 45.

    Leppek, K., Das, R. & Barna, M. Functional 5′ UTR mRNA structures in eukaryotic translation regulation and how to find them. Nat. Rev. Mol. Cell Biol. 19, 158–174 (2018).

  46. 46.

    Richner, J. M. et al. Vaccine mediated protection against Zika virus-induced congenital disease. Cell 170, 273–283 (2017).

  47. 47.

    Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10–12 (2011).

  48. 48.

    Zhao, L., Liu, Z., Levy, S. F. & Wu, S. Bartender: a fast and accurate clustering algorithm to count barcode reads. Bioinformatics 34, 739–747 (2017).

  49. 49.

    Abadi, M. et al. TensorFlow: Large-scale machine laerning on heterogeneous systems. Software available from (2015).

  50. 50.

    Smedley, D. et al. BioMart—biological queries made easy. BMC Genomics 10, 22 (2009).

Download references


We would like to thank A. Rosenberg and J. Linder for helpful discussions on data analysis and modeling. We would also like to thank M. Moore, A. Hsieh and Y. Lim for constructive comments on the manuscript. We are grateful to C. Wang for providing fluorescence data27. This work was supported by a sponsored research agreement by Moderna and National Institutes of Health grant R01HG009892 to G.S.

Author information

P.J.S. and B.W. designed and performed experiments, performed data analysis and modeling, and wrote the manuscript. D.W.R. performed fluorescence validation experiments. V.P. and I.M. wrote the manuscript. D.R.M. helped design polysome profiling. G.S. designed experiments and wrote the manuscript.

Correspondence to Georg Seelig.

Ethics declarations

Competing interests

P.J.S., B.W., G.S. and DRM declare no competing interests. D.R., V.P. and I.M. are employees and shareholders of Moderna.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figures 1–15, Supplementary Tables 4 and 5, and Supplementary Note 1

Reporting Summary

Supplementary Table 2: Data of eGFP expression tested for 10 UTRs in Fig. 2e.

Supplementary Table 3: Statistical details for the 16 box plots in Fig. 3b.

Supplementary Code: Ipython notebooks for Optimus 5-Prime and its generalized version.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark
Fig. 1: A library of 280,000 random 50-nucleotide oligomers as 5′ UTRs for eGFP.
Fig. 2: Modeling 5′ UTR sequences and ribosome loading.
Fig. 3: Design of new 5′ UTRs.
Fig. 4: Model performance with human 5′ UTRs and generalization to 5′ UTRs of varying length.