Review Article | Published:

Emerging evidence for functional peptides encoded by short open reading frames

Nature Reviews Genetics volume 15, pages 193204 (2014) | Download Citation

  • An Erratum to this article was published on 04 March 2014

This article has been updated

Abstract

Short open reading frames (sORFs) are a common feature of all genomes, but their coding potential has mostly been disregarded, partly because of the difficulty in determining whether these sequences are translated. Recent innovations in computing, proteomics and high-throughput analyses of translation start sites have begun to address this challenge and have identified hundreds of putative coding sORFs. The translation of some of these has been confirmed, although the contribution of their peptide products to cellular functions remains largely unknown. This Review examines this hitherto overlooked component of the proteome and considers potential roles for sORF-encoded peptides.

Key points

  • Short open reading frames (sORFs) of 100 codons in length are common and are distributed throughout the genome, but not all sORFs are biologically relevant.

  • sORFs are found on non-coding RNAs and within the 5′ leader and 3′ trailer regions of mRNAs. They can also overlap with the main protein-coding sequence of mRNAs.

  • The identification of sORFs that are translatable and that are likely to encode short peptides remains a major challenge. Three complementary approaches that are typically used to discover functional sORFs are bioinformatics, transcriptomics and proteomics.

  • Bioinformatic studies have identified a large pool of potentially translatable sORFs on the basis of sequence characteristics such as degree of conservation, coding potential and context of the initiation codon.

  • Global ribosome profiling has provided evidence of ribosome engagement at the start codon of many sORFs in various species, including yeast, insects, plants and mammals.

  • Proteomic studies using mass spectrometry on size-fractionated whole-cell lysates have identified several short peptides encoded by sORFs (sPEPs) in human tissues and cell lines.

  • Functional sPEPs have been identified in insects, plants and mammals, but only a small number of them have been fully characterized.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Change history

  • 04 March 2014

    In Table 2 (page 200) of the above article, the gene “RanGAP” was corrected to “SclA and SclB”, where Scl refers to the Sarcolamban gene in Drosophila melanogaster. The corresponding footnote was also corrected. The article has been corrected online. The editors apologize for this error.

References

  1. 1.

    , & Identification of prokaryotic small proteins using a comparative genomic approach. Bioinformatics 27, 1765–1771 (2011).

  2. 2.

    , , & An expanding universe of small proteins. Curr. Opin. Microbiol. 14, 167–173 (2011).

  3. 3.

    , , & Polyamine regulation of ribosome pausing at the upstream open reading frame of S-adenosylmethionine decarboxylase. J. Biol. Chem. 276, 38036–38043 (2001).

  4. 4.

    & The Arabidopsis unannotated secreted peptide database, a resource for plant peptidomics. Plant Physiol. 142, 831–838 (2006).

  5. 5.

    & Comparative genomic analysis of novel conserved peptide upstream open reading frames in Drosophila melanogaster and other dipteran species. BMC Genomics 9, 61 (2008).

  6. 6.

    et al. Discovery and annotation of small proteins using genomics, proteomics, and computational approaches. Genome Res. 21, 634–641 (2011).

  7. 7.

    An overview of the current status of eukaryote gene prediction strategies. Gene 461, 1–4 (2010).

  8. 8.

    & Recent advances in gene structure prediction. Curr. Opin. Struct. Biol. 14, 264–272 (2004).

  9. 9.

    et al. Vertebrate gene predictions and the problem of large genes. Nature Rev. Genet. 4, 741–749 (2003).

  10. 10.

    , , , & A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. Genome Res. 17, 632–640 (2007).

  11. 11.

    et al. Small open reading frames: current prediction techniques and future prospect. Curr. Protein Pept. Sci. 12, 503–507 (2011).

  12. 12.

    , & Small open reading frames: beautiful needles in the haystack. Genome Res. 7, 768–771 (1997).

  13. 13.

    Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6, 1735–1744 (1997).

  14. 14.

    et al. The abundance of short proteins in the mammalian proteome. PLoS Genet. 2, e52 (2006). This is the first study to examine the size and nature of the mammalian peptidome.

  15. 15.

    , & Lilliputians get into the limelight: novel class of small peptide genes in morphogenesis. Dev. Growth Differ. 50, S269–S276 (2008).

  16. 16.

    et al. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Res. 16, 365–373 (2006).

  17. 17.

    et al. SwePep, a database designed for endogenous peptides and mass spectrometry. Mol. Cell. Proteom. 5, 998–1005 (2006).

  18. 18.

    et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nature Chem. Biol. 9, 59–64 (2013). This work builds on previous studies to identify 90 human small proteins using mass spectrometry.

  19. 19.

    et al. Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting. Genome Res. 22, 2208–2218 (2012).

  20. 20.

    , & Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011).

  21. 21.

    et al. Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proc. Natl Acad. Sci. 109, E2424–E2432 (2012).

  22. 22.

    , , & BAIUCAS: a novel BLAST-based algorithm for the identification of upstream open reading frames with conserved amino acid sequences and its application to the Arabidopsis thaliana genome. Bioinformatics 28, 2231–2241 (2012).

  23. 23.

    et al. Discovery and revision of Arabidopsis genes by proteogenomics. Proc. Natl Acad. Sci. 105, 21034–21038 (2008).

  24. 24.

    et al. Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS ONE 8, e70698 (2013). This proteomic-based study has identified numerous short proteins in several human cell lines and tissues.

  25. 25.

    et al. Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol. Cell. Proteom. 12, 1780–1790 (2013). This study shows how ribosome profiling can aid short peptide discovery by mass spectrometry.

  26. 26.

    et al. sORF finder: a program package to identify small open reading frames with high coding potential. Bioinformatics 26, 399–400 (2010).

  27. 27.

    , & HAltORF: a database of predicted out-of-frame alternative open reading frames in human. Database (Oxford) 2012, bas025 (2012).

  28. 28.

    et al. uPEPperoni: an online tool for upstream open reading frame location and analysis of transcript conservation. BMC Bioinformatics (2014).

  29. 29.

    The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 18, 486–487 (2002).

  30. 30.

    & Identification and characterization of upstream open reading frames (uORF) in the 5′ untranslated regions (UTR) of genes in Saccharomyces cerevisiae. Curr. Genet. 48, 77–87 (2005).

  31. 31.

    , , , & Hundreds of putatively functional small open reading frames in Drosophila. Genome Biol. 12, R118 (2011).

  32. 32.

    et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl Acad. Sci. 104, 19428–19433 (2007).

  33. 33.

    An analysis of 5′-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 15, 8125–8148 (1987).

  34. 34.

    , & Comparative DNA analysis across diverse genomes. Annu. Rev. Genet. 32, 185–225 (1998).

  35. 35.

    et al. The Pfam protein families database. Nucleic Acids Res. 32, D138–D141 (2004).

  36. 36.

    & Identification of novel conserved peptide uORF homology groups in Arabidopsis and rice reveals ancient eukaryotic origin of select groups and preferential association with transcription factor-encoding genes. BMC Biol. 5, 32 (2007).

  37. 37.

    et al. Detailed analysis of putative genes encoding small proteins in legume genomes. Front. Plant Sci. 4, 208 (2013).

  38. 38.

    et al. CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison. Nucleic Acids Res. 32, W624–W627 (2004).

  39. 39.

    & CRITICA: coding region identification tool invoking comparative analysis. Mol. Biol. Evol. 16, 512–524 (1999).

  40. 40.

    et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 35, W345–W349 (2007).

  41. 41.

    , & RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).

  42. 42.

    & Nature, nurture, or chance: stochastic gene expression and its consequences. Cell 135, 216–226 (2008).

  43. 43.

    , , & Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).

  44. 44.

    , , , & Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240–251 (2013).

  45. 45.

    , & Mass spectrometry at the interface of proteomics and genomics. Mol. BioSystems 7, 284–291 (2011).

  46. 46.

    et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).

  47. 47.

    et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).

  48. 48.

    et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).

  49. 49.

    et al. The reality of pervasive transcription. PLoS Biol. 9, e1000625 (2011).

  50. 50.

    , , & Most “dark matter” transcripts are associated with known genes. PLoS Biol. 8, e1000371 (2010).

  51. 51.

    Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nature Struct. Mol. Biol. 14, 103–105 (2007).

  52. 52.

    , , , & Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003).

  53. 53.

    , & Coding versus non-coding: translatability of short ORFs found in putative non-coding transcripts. Biochimie 93, 1981–1986 (2011).

  54. 54.

    et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

  55. 55.

    , & uAUG and uORFs in human and rodent 5′untranslated mRNAs. Gene 349, 97–105 (2005).

  56. 56.

    , & Evidence for conservation and selection of upstream open reading frames suggests probable encoding of bioactive peptides. BMC Genomics 7, 16 (2006).

  57. 57.

    & Dual modes of natural selection on upstream open reading frames. Mol. Biol. Evol. 24, 1744–1751 (2007).

  58. 58.

    , , , & Identification of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary conservation. BMC Bioinformatics 8, 295 (2007).

  59. 59.

    , & Conserved upstream open reading frames in higher plants. BMC Genomics 9, 361 (2008).

  60. 60.

    , , & Known and novel post-transcriptional regulatory sequences are conserved across plant families. RNA 18, 368–384 (2012).

  61. 61.

    , & Upstream open reading frames: molecular switches in (patho)physiology. BioEssays 32, 885–893 (2010).

  62. 62.

    et al. Analysis of oligonucleotide AUG start codon context in eukariotic mRNAs. Gene 261, 85–91 (2000).

  63. 63.

    et al. Statistical analysis of the 5′ untranslated region of human mRNA using “oligo-capped” cDNA libraries. Genomics 64, 286–297 (2000).

  64. 64.

    , , , & Presence of ATG triplets in 5′ untranslated regions of eukaryotic cDNAs correlates with a 'weak' context of the start codon. Bioinformatics 17, 890–900 (2001).

  65. 65.

    , , & Small open reading frames in 5′ untranslated regions of mRNAs. C. R. Biol. 326, 987–991 (2003).

  66. 66.

    , & Exploring the selective constraint on the sizes of insertions and deletions in 5′ untranslated regions in mammals. BMC Evol. Biol. 11, 192 (2011).

  67. 67.

    , , , & Bioinformatics prediction of overlapping frameshifted translation products in mammalian transcripts. BMC Genomics 9, 122 (2008).

  68. 68.

    et al. Observation of dually decoded regions of the human genome using ribosome profiling data. Genome Res. 22, 2219–2229 (2012).

  69. 69.

    , , , & First look at ARFome: dual-coding genes in mammalian genomes. PLoS Comput. Biol. 3, e91 (2007).

  70. 70.

    et al. Length of the ORF, position of the first AUG and the Kozak motif are important factors in potential dual-coding transcripts. Cell Res. 20, 445–457 (2010).

  71. 71.

    et al. Expression of distinct RNAs from 3′ untranslated regions. Nucleic Acids Res. 39, 2393–2403 (2011).

  72. 72.

    et al. Ribosome profiling reveals resemblance between long non-coding RNAs and 5′ leaders of coding RNAs. Development 140, 2828–2834 (2013).

  73. 73.

    et al. Long noncoding RNAs are rarely translated in two human cell lines. Genome Res. 22, 1646–1657 (2012).

  74. 74.

    et al. Small open reading frames associated with morphogenesis are hidden in plant genomes. Proc. Natl Acad. Sci. 110, 2395–2400 (2013). This is the first systematic characterization of short open reading frames using transgenic plants.

  75. 75.

    et al. Analysis of small human proteins reveals the translation of upstream open reading frames of mRNAs. Genome Res. 14, 2048–2052 (2004). This is the first study to identify small proteins in human cells using mass spectrometry.

  76. 76.

    et al. Diversity of translation start sites may define increased complexity of the human short ORFeome. Mol. Cell. Proteom. 6, 1000–1006 (2007).

  77. 77.

    , , , & Utilization of an alternative open reading frame of a normal gene in generating a novel human cancer antigen. J. Exp. Med. 183, 1131–1140 (1996).

  78. 78.

    et al. A non-AUG-defined alternative open reading frame of the intestinal carboxyl esterase mRNA generates an epitope recognized by renal cell carcinoma-reactive tumor-infiltrating lymphocytes in situ. J. Immunol. 163, 483–490 (1999).

  79. 79.

    & A small, novel protein highly conserved in plants and animals promotes the polarized growth and division of maize leaf epidermal cells. Curr. Biol. 12, 849–853 (2002).

  80. 80.

    , , , & Soybean ENDO40 encodes two peptides that bind sucrose synthase. Proc. Natl Acad. Sci. 99, 5 (2002).

  81. 81.

    et al. The POLARIS gene of Arabidopsis encodes a predicted peptide required for correct root growth and leaf vascular patterning. Plant Cell 14, 16 (2002).

  82. 82.

    et al. Overexpression of a novel small peptide ROTUNDIFOLIA4 decreases cell proliferation and alters leaf shape in Arabidopsis thaliana. Plant J. 38, 699–713 (2004).

  83. 83.

    et al. HSPC300 and its role in neuronal connectivity. Neural Dev. 2, 18 (2007).

  84. 84.

    , & Secreted peptide Dilp8 coordinates Drosophila tissue growth with developmental timing. Science 336, 582–585 (2012).

  85. 85.

    , , , & Imaginal discs secrete insulin-like peptide 8 to mediate plasticity of growth and maturation. Science 336, 579–582 (2012).

  86. 86.

    et al. Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames. Science 341, 1116–1120 (2013).

  87. 87.

    , , , & Peptides encoded by short ORFs control development and define a new eukaryotic gene family. PLoS Biol. 5, e106 (2007).

  88. 88.

    et al. Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nature Cell Biol. 9, 660–665 (2007).

  89. 89.

    et al. Small peptides switch the transcriptional activity of Shavenbaby during Drosophila embryogenesis. Science 329, 336–339 (2010). This study identifies the molecular target of the small regulatory peptides encoded by a polycistronic mRNA that was previously thought to be a non-coding transcript.

  90. 90.

    , , & A segmentation gene in tribolium produces a polycistronic mRNA that codes for multiple conserved peptides. Cell 126, 559–569 (2006).

  91. 91.

    et al. Ovol2, a mammalian homolog of Drosophila Ovo: gene structure, chromosomal mapping, and aberrant expression in blind-sterile mice. Genomics 80, 319–325 (2002).

  92. 92.

    & Conserved-peptide upstream open reading frames (CPuORFs) are associated with regulatory genes in angiosperms. Front. Plant Sci. 3, 191 (2012).

  93. 93.

    , , & The leader peptide of yeast gene CPA1 is essential for the translational repression of its expression. Cell 49, 805–813 (1987).

  94. 94.

    , & Ribosome occupancy of the yeast CPA1 upstream open reading frame termination codon modulates nonsense-mediated mRNA decay. Mol. Cell 20, 449–460 (2005).

  95. 95.

    et al. Sucrose control of translation mediated by an upstream open reading frame-encoded peptide. Plant Physiol. 150, 1356–1367 (2009).

  96. 96.

    et al. A dual upstream open reading frame-based autoregulatory circuit controlling polyamine-responsive translation. J. Biol. Chem. 280, 39229–39237 (2005).

  97. 97.

    et al. Translational regulation of Arabidopsis XIPOTL1 is modulated by phosphocholine levels via the phylogenetically conserved upstream open reading frame 30. J. Exp. Bot. 63, 5203–5221 (2012).

  98. 98.

    , & 5′UTR sequences of the glucocorticoid receptor 1A transcript encode a peptide associated with translational regulation of the glucocorticoid receptor. J. Cell. Biochem. 81, 149–161 (2001).

  99. 99.

    , , & Regulation of endothelial argininosuccinate synthase expression and NO production by an upstream open reading frame. J. Biol. Chem. 280, 24252–24260 (2005).

  100. 100.

    , & Expression of a novel mRNA transcript for human microsomal epoxide hydrolase (EPHX1) is regulated by short open reading frames within its 5′-untranslated region. RNA 19, 752–766 (2013).

  101. 101.

    et al. Translational repression of the McKusick–Kaufman syndrome transcript by unique upstream open reading frames encoding mitochondrial proteins with alternative polyadenylation sites. Biochim. Biophys. Acta. 1830, 2728–2738 (2013).

  102. 102.

    et al. An overlapping reading frame in the PRNP gene encodes a novel polypeptide distinct from the prion protein. FASEB J. 25, 2373–2386 (2011).

  103. 103.

    et al. An out-of-frame overlapping reading frame in the ataxin-1 coding sequence encodes a novel ataxin-1 interacting protein. J. Biol. Chem. 288, 21824–21835 (2013).

  104. 104.

    & TALENs: a widely applicable technology for targeted genome editing. Nature Rev. Mol. Cell Biol. 14, 49–55 (2013).

  105. 105.

    , & Cas9 as a versatile tool for engineering biology. Nature Methods 10, 957–963 (2013).

  106. 106.

    & Regulation of protein function by 'microProteins'. EMBO Rep. 12, 35–42 (2011).

  107. 107.

    , & Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc. Natl Acad. Sci. 106, 7507–7512 (2009).

  108. 108.

    et al. Loss-of-function mutations of an inhibitory upstream ORF in the human hairless transcript cause Marie Unna hereditary hypotrichosis. Nature Genet. 41, 228–233 (2009). This study identified mutations in a highly conserved upstream open reading frame that are associated with genetic hair loss and suggests that an aberrant short peptide may result in disease.

  109. 109.

    , , , & Investigation of cytotoxicity of negative control peptides versus bioactive peptides on skin cancer and normal cells: a comparative study. Future Med. Chem. 4, 1553–1565 (2012).

  110. 110.

    Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes. Cell 44, 283–292 (1986).

  111. 111.

    Effects of intercistronic length on the efficiency of reinitiation by eucaryotic ribosomes. Mol. Cell. Biol. 7, 3438–3445 (1987).

  112. 112.

    & Upstream open reading frames as regulators of mRNA translation. Mol. Cell. Biol. 20, 8635–8642 (2000).

  113. 113.

    , & Thrombopoietin production is inhibited by a translational mechanism. Blood 92, 4023–4030 (1998).

  114. 114.

    , & Translational control of C/EBPα and C/EBPβ isoform expression. Genes Dev. 14, 1920–1932 (2000).

  115. 115.

    Translational regulation of yeast GCN4. J. Biol. Chem. 272, 21661–21664 (1997).

  116. 116.

    , & Translational control by an upstream open reading frame in the HER-2/neu transcript. J. Biol. Chem. 274, 24335–24341 (1999).

  117. 117.

    & Post-transcriptional regulation of the GLI1 oncogene by the expression of alternative 5′ untranslated regions. J. Biol. Chem. 276, 1311–1316 (2001).

  118. 118.

    & 5′-untranslated regions with multiple upstream AUG codons can support low-level translation via leaky scanning and reinitiation. Nucleic Acids Res. 32, 1382–1391 (2004).

  119. 119.

    An analysis of vertebrate mRNA sequences: intimations of translational control. J. Cell Biol. 115, 887–903 (1991).

  120. 120.

    , , & Drosophila Pgc protein inhibits P-TEFb recruitment to chromatin in primordial germ cells. Nature 451, 730–733 (2008).

Download references

Acknowledgements

This work was supported by a grant to J.A.R. from the Australian National Health and Medical Research Council (ID631551).

Author information

Affiliations

  1. School of Chemistry and Molecular Biosciences, University of Queensland, St. Lucia, Queensland, 4072, Australia.

    • Shea J. Andrews
    •  & Joseph A. Rothnagel

Authors

  1. Search for Shea J. Andrews in:

  2. Search for Joseph A. Rothnagel in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Joseph A. Rothnagel.

Glossary

Short open reading frames

(sORFs). Open reading frames that are usually <100 codons in length but that can also be longer.

Coding DNA sequence

(CDS). An open reading frame (ORF) that encodes a verified protein product.The CDS is typically the first ORF identified and characterized on an mRNA. It defines the end of the 5′ leader and the start of the 3′ trailer sequences.

Ka/Ks test

A ratio that compares the number of nonsynonymous substitutions per nonsynonymous site with the number of synonymous substitutions per synonymous site.

Transcription activator-like effector nucleases

(TALENs). Engineered enzymes that permit precise editing of genomes and that can be used to make specific sequence changes in model organisms such as Arabidopsis thaliana, zebrafish and mice.

Microproteins

Negative regulators of multiprotein complexes. In this case, micro refers to the mechanism of action of these proteins rather than to their sizes.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg3520

Further reading