Analysis | Published:

Classification and function of small open reading frames

Nature Reviews Molecular Cell Biology volume 18, pages 575589 (2017) | Download Citation

This article has been updated

Abstract

Small open reading frames (smORFs) of 100 codons or fewer are usually — if arbitrarily — excluded from proteome annotations. Despite this, the genomes of many metazoans, including humans, contain millions of smORFs, some of which fulfil key physiological functions. Recently, the transcriptome of Drosophila melanogaster was shown to contain thousands of smORFs of different classes that actively undergo translation, which produces peptides of mostly unknown function. Here, we present a comprehensive analysis of smORFs in flies, mice and humans. We propose the existence of several functional classes of smORFs, ranging from inert DNA sequences to transcribed and translated cis-regulators of translation and peptides with a propensity to function as regulators of membrane-associated proteins, or as components of ancient protein complexes in the cytoplasm. We suggest that the different smORF classes could represent steps in gene, peptide and protein evolution. Our analysis introduces a distinction between different peptide-coding classes of smORFs in animal genomes, and highlights the role of model organisms for the study of small peptide biology in the context of development, physiology and human disease.

Key points

  • Small peptides of 100 amino acids or fewer are encoded by small open reading frames (smORFs) and mediate key physiological functions in animals and humans.

  • smORFs constitute 99% of transcribed, but only 1% of annotated, coding sequences in flies, mice and humans.

  • Different smORF classes show distinctive and predictive markers of functionality at the RNA level and the protein sequence level.

  • The characteristics of different smORF classes are evolutionarily conserved across animal species, encouraging the use of Drosophila melanogaster and Mus musculus as model organisms for studies of peptide biology in the context of development, physiology and disease.

  • Different smORF classes may represent steps in the origin and evolution of new genes and proteins.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Change history

  • 01 August 2017

    The original online version of this article contained four errors, which have now been corrected. The corrections included two typos in the main text, the addition of a missing point in the X axis in Figure 3b, and the exchange of the position of two column headers in Figure 5c.

References

  1. 1.

    et al. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17, 669–681 (2007).

  2. 2.

    et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

  3. 3.

    & Modular regulatory principles of large non-coding RNAs. Nature 482, 339–346 (2012).

  4. 4.

    , & Small open reading frames: beautiful needles in the haystack. Genome Res. 7, 768–771 (1997). This seminal work effectively establishes the field of smORF studies by arguing that smORFs exist in large numbers and can encode functional peptides.

  5. 5.

    et al. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Res. 16, 365–373 (2006). The only genome-wide assessment of smORF function, demonstrating smORF function in approximately 5% of baker's yeast genes.

  6. 6.

    , , , & Hundreds of putatively functional small open reading frames in Drosophila. Genome Biol. 12, R118 (2011).

  7. 7.

    et al. The abundance of short proteins in the mammalian proteome. Plos Genet. 2, 515–528 (2006).

  8. 8.

    et al. Extensive translation of small open≈reading frames revealed by Poly-Ribo-Seq. eLife 3, e03528 (2014).

  9. 9.

    et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation EMBO J. 33, 981–993 (2014). References 8 and 9 represent the first two studies of smORF translation using ribosome profiling in animals. Reference 8 introduces the concept that smORFs can be divided into different categories according to sequence features and translation efficiency.

  10. 10.

    , , & Long non-coding RNAs as a source of new peptides. eLife 3, e03523 (2014). This computational study shows that the conservation and translation metrics of lncORFs resemble those of evolutionarily young proteins.

  11. 11.

    et al. Translation of small open reading frames within unannotated RNA transcripts in Saccharomyces cerevisiae. Cell Rep. 7, 1858–1866 (2014).

  12. 12.

    , , , & Peptides encoded by short ORFs control development and define a new eukaryotic gene family. PLoS Biol. 5, 1052–1062 (2007).

  13. 13.

    et al. Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames. Science 341, 1116–1120 (2013). This study finds that smORFs can be conserved across hundreds of millions of years of evolution at the levels of peptide structure and function.

  14. 14.

    & Emerging evidence for functional peptides encoded by short open reading frames. Nat. Rev. Genet. 15, 193–204 (2014). This work further confirms the existence of functional smORFs and reviews smORF functions and current testing techniques.

  15. 15.

    , & New peptides under the s(ORF)ace of the genome. Trends Biochem. Sci. 41, 665–678 (2016).

  16. 16.

    & Discovery and characterization of smORF-encoded bioactive polypeptides. Nat. Chem. Biol. 11, 909–916 (2015).

  17. 17.

    , , , & Small membrane proteins found by comparative genomics and ribosome binding site models. Mol. Microbiol. 70, 1487–1501 (2008).

  18. 18.

    et al. Small open reading frames associated with morphogenesis are hidden in plant genomes. Proc. Natl Acad. Sci. USA 110, 2395–2400 (2013).

  19. 19.

    et al. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J. Proteome Res. 13, 1757–1765 (2014).

  20. 20.

    et al. A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell 160, 595–606 (2015).

  21. 21.

    et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biol. 16, 179 (2015). This study uses a stringent computational approach to identify hundreds of conserved smORFs in lncRNAs and UTRs in several model animals, and re-evaluates previous computational studies.

  22. 22.

    , & Drosophila host defense: differential induction of antimicrobial peptide genes after infection by various classes of microorganisms. Proc. Natl Acad. Sci. USA 94, 14614–14619 (1997).

  23. 23.

    & The 11-aminoacid long Tarsal-less peptides trigger a cell signal in Drosophila leg development. Dev. Biol. 324, 192–201 (2008).

  24. 24.

    , , , & BRICK1/HSPC300 functions with SCAR and the ARP2/3 complex to regulate epidermal cell shape in Arabidopsis. Development 133, 1091–1100 (2006).

  25. 25.

    , , , & Drosophila Pgc protein inhibits P-TEFb recruitment to chromatin in primordial germ cells. Nature 451, 730–733 (2008).

  26. 26.

    FlyBase Consortium. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 27, 85–88 (1999).

  27. 27.

    et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Rep. 8, 1365–1379 (2014).

  28. 28.

    , & Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011). This work uses ribosome profiling in mouse embryonic stem cells to show pervasive translation from alternative start sites, non-canonical start codon usage, and uORF and lncRNA translation.

  29. 29.

    et al. Hemotin, a regulator of phagocytosis encoded by a small ORF and conserved across metazoans. PLoS Biol. 14, e1002395 (2016).

  30. 30.

    , & Upstream ORFs are prevalent translational repressors in vertebrates. EMBO J. 35, 706–723 (2016).

  31. 31.

    & 5′-untranslated regions with multiple upstream AUG codons can support low-level translation via leaky scanning and reinitiation. Nucleic Acids Res. 32, 1382–1391 (2004).

  32. 32.

    , & Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc. Natl Acad. Sci. USA 106, 7507–7512 (2009).

  33. 33.

    et al. Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting. Genome Res. 22, 2208–2218 (2012).

  34. 34.

    , , & Many lncRNAs, 5′UTRs, and pseudogenes are translated and some are likely to express functional proteins. eLife 4, e08890 (2015).

  35. 35.

    & The translational landscape of fission-yeast meiosis and sporulation. Nat. Struct. Mol. Biol. 21, 641–647 (2014).

  36. 36.

    & Secondary structure impacts patterns of selection in human lncRNAs. BMC Biol. 14, 60 (2016).

  37. 37.

    , , , & Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240–251 (2013).

  38. 38.

    et al. Long noncoding RNAs are rarely translated in two human cell lines. Genome Res. 22, 1646–1657 (2012).

  39. 39.

    et al. Ribosome profiling reveals resemblance between long non-coding RNAs and 5′ leaders of coding RNAs. Development 140, 2828–2834 (2013).

  40. 40.

    et al. A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science 351, 271–275 (2016).

  41. 41.

    et al. Ensembl 2016. Nucleic Acids Res. 44, D710–D716 (2016).

  42. 42.

    et al. Comparative genomics of the eukaryotes. Science 287, 2204–2215 (2000).

  43. 43.

    et al. Detection of intergenic non-coding RNAs expressed in the main developmental stages in Drosophila melanogaster. Nucleic Acids Res. 37, 4308–4314 (2009).

  44. 44.

    et al. Extensive localization of long noncoding RNAs to the cytosol and mono- and polyribosomal complexes. Genome Biol. 15, R6 (2014).

  45. 45.

    , , , & Global and cell-type specific properties of lincRNAs with ribosome occupancy. Nucleic Acids Res. 45, 2786–2796 (2017).

  46. 46.

    et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 9, 59–64 (2013). This work applies a novel proteomics approach to discover SEPs in human cells.

  47. 47.

    & The role of the genome project in determining gene function: insights from model organisms. Cell 86, 521–529 (1996).

  48. 48.

    & Ribosomal profiling adds new coding sequences to the proteome. Biochem. Soc. Trans. 43, 1271–1276 (2015).

  49. 49.

    et al. Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs. BMC Genomics 14, 648 (2013).

  50. 50.

    & Long non-coding RNAs: new players in cell differentiation and development. Nat. Rev. Genet. 15, 7–21 (2013).

  51. 51.

    et al. Widespread changes in the posttranscriptional landscape at the Drosophila oocyte-to-embryo transition. Cell Rep. 7, 1495–1508 (2014).

  52. 52.

    et al. Pri sORF peptides induce selective proteasome-mediated protein processing. Science 349, 1356–1358 (2015).

  53. 53.

    et al. Systematic discovery of new genes in the Saccharomyces cerevisiae genome. Genome Res. 13, 264–271 (2003).

  54. 54.

    , , & Peptidomics of the larval Drosophila melanogaster central nervous system. J. Biol. Chem. 277, 40368–40374 (2002).

  55. 55.

    , & Optimization of protein fusion partner length for maximizing in vitro translation of peptides. Biotechnol. Prog. 23, 444–451 (2007).

  56. 56.

    et al. Primary transcripts of microRNAs encode regulatory peptides. Nature 520, 90–93 (2015).

  57. 57.

    , , , & Rat Humanin is encoded and translated in mitochondria and is localized to the mitochondrial compartment where it regulates ROS production. Mol. Cell. Endocrinol. 413, 96–100 (2015).

  58. 58.

    et al. The mitochondrial-derived peptide MOTS-c promotes metabolic homeostasis and reduces obesity and insulin resistance. Cell Metab. 21, 443–454 (2015).

  59. 59.

    Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene 361, 13–37 (2005).

  60. 60.

    et al. eIF3a cooperates with sequences 5′ of uORF1 to promote resumption of scanning by post-termination ribosomes for reinitiation on GCN4 mRNA. Genes Dev. 22, 2414–2425 (2008).

  61. 61.

    et al. Identification of novel Arabidopsis thaliana upstream open reading frames that control expression of the main coding sequences in a peptide sequence-dependent manner. Nucleic Acids Res. 43, 1562–1576 (2015).

  62. 62.

    , , , & Trans-regulation of the expression of the transcription factor MtHAP2-1 by a uORF controls root nodule development. Genes Dev. 22, 1549–1559 (2008).

  63. 63.

    & Identification and characterization of upstream open reading frames (uORF) in the 5′ untranslated regions (UTR) of genes in Saccharomyces cerevisiae. Curr. Genet. 48, 77–87 (2005).

  64. 64.

    Integration of new genes into cellular networks, and their structural maturation. Genetics 195, 1407–1417 (2013).

  65. 65.

    & Partial protein domains: evolutionary insights and bioinformatics challenges. Genome Biol. 16, 100 (2015).

  66. 66.

    & Transduction peptides: from technology to physiology. Nat. Cell Biol. 6, 189–196 (2004).

  67. 67.

    & Ubiquitin-like protein activation by E1 enzymes: the apex for downstream signalling pathways. Nat. Rev. Mol. Cell Biol. 10, 319–331 (2009).

  68. 68.

    & Characterization of the Drosophila melanogaster ribosomal proteome. J. Proteome Res. 5, 2025–2032 (2006).

  69. 69.

    , & Quantitative variations in the level of MAPK activity control patterning of the embryonic termini in Drosophila. J. Dev. Biol. 205, 181–193 (1999).

  70. 70.

    & Cell death in development. Cell 96, 245–254 (1999).

  71. 71.

    , , & Mitochondrial dynamics in neurodegeneration. Trends Cell Biol. 23, 64–71 (2012).

  72. 72.

    & Regulation of protein function by 'microProteins'. EMBO Rep. 12, 35–42 (2011).

  73. 73.

    , , & Competitive inhibition of transcription factors by small interfering peptides. Trends Plant Sci. 16, 541–549 (2011).

  74. 74.

    et al. Microprotein-mediated recruitment of CONSTANS into a TOPLESS trimeric complex represses flowering in Arabidopsis. PLoS Genet. 12, e1005959 (2016).

  75. 75.

    , & Id proteins: small molecules, mighty regulators. Curr. Top. Dev. Biol. 110, 189–216 (2014).

  76. 76.

    The muscle ultrastructure: a structural perspective of the sarcomere. Cell. Mol. Life Sci. 61, 3016–3033 (2004).

  77. 77.

    et al. The Drosophila mitotic inhibitor Frühstart specifically binds to the hydrophobic patch of cyclins. EMBO Rep. 8, 490–496 (2007).

  78. 78.

    et al. Humanin peptide suppresses apoptosis by interfering with Bax activation. Nature 423, 456–461 (2003).

  79. 79.

    & Notch ligand ubiquitylation: what is it good for? Dev. Cell 21, 134–144 (2011).

  80. 80.

    , & Biochemical characterization of distinct regions of SPEC molecules and their role in phagocytosis. Exp. Cell Res. 313, 10–21 (2007).

  81. 81.

    Antimicrobial peptides of multicellular organisms. Nature 415, 389–395 (2002).

  82. 82.

    Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol. 3, 238–250 (2005).

  83. 83.

    , , , & Correlations between the compositional properties of human genes, codon usage, and amino acid composition of proteins. J. Mol. Evol. 32, 504–510 (1991). This study correlates the overall nucleotide and amino acid compositions of protein-coding sequences, highlighting and attempting to explain the biased nonrandom amino acid usage of canonical proteins.

  84. 84.

    , & Predicting cell-penetrating peptides. Adv. Drug Deliv. Rev. 60, 572–579 (2008).

  85. 85.

    et al. Characterisation of cell-penetrating peptide-mediated peptide delivery. Br. J. Pharmacol. 145, 1093–1102 (2005).

  86. 86.

    Targeting lipophilic cations to mitochondria. Biochim. Biophys. Acta 1777, 1028–1031 (2008).

  87. 87.

    , , & Phylogenetic perspectives in innate immunity. Science 284, 1313–1318 (1999).

  88. 88.

    et al. DRAMP: a comprehensive data repository of antimicrobial peptides. Sci. Rep. 6, 24482 (2016).

  89. 89.

    et al. Small cationic antimicrobial peptides delocalize peripheral membrane proteins. Proc. Natl Acad. Sci. USA 111, E1409–E1418 (2014).

  90. 90.

    , , , & A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining. J. Biol. Chem. 289, 10950–10957 (2014).

  91. 91.

    & Tarsal-less peptides control Notch signalling through the Shavenbaby transcription factor. Dev. Biol. 355, 183–193 (2011).

  92. 92.

    , , & Peptide degradation is a critical determinant for cell-penetrating peptide uptake. Biochim. Biophys. Acta 1768, 1769–1776 (2007).

  93. 93.

    et al. Translational control of intron splicing in eukaryotes. Nature 451, 359–362 (2008).

  94. 94.

    Reassessing the amyloid cascade hypothesis of Alzheimer's disease. Int. J. Biochem. Cell Biol. 41, 1261–1268 (2009).

  95. 95.

    & Pleiotropy and the preservation of perfection. Science 279, 1210–1213 (1998).

  96. 96.

    et al. Functional and structural properties of stannin: roles in cellular growth, selective toxicity, and mitochondrial responses to injury. J. Cell. Biochem. 98, 243–250 (2006).

  97. 97.

    , , & ELABELA: a hormone essential for heart development signals via the apelin receptor. Dev. Cell 27, 672–680 (2013).

  98. 98.

    et al. Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science 343, 1248636 (2014). References 97 and 98 characterize the 32-amino-acid-long SEP toddler, which acts as a hormone in the zebrafish heart.

  99. 99.

    , , , & Ribosome profiling reveals pervasive and regulated stop codon readthrough in Drosophila melanogaster. eLife 2, e01179 (2013).

  100. 100.

    et al. Proto-genes and de novo gene birth. Nature 487, 370–374 (2012). This study proposes a model for the de novo emergence of protein-coding genes from proto-genes or sequences, forming a continuum between noncoding DNA and fully coding genes.

  101. 101.

    & New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation. Phil. Trans. R. Soc. B (2015).

  102. 102.

    , , & Origin and Spread of de novo genes in Drosophila melanogaster populations. Science 343, 769–772 (2014).

  103. 103.

    et al. On the origin of new genes in Drosophila. Genome Res. 18, 1446–1455 (2008).

  104. 104.

    & Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics 14, 117 (2013).

  105. 105.

    et al. De novo ORFs in Drosophila are important to organismal fitness and evolved rapidly from previously non-coding sequences. PLoS Genet. 9, e1003860 (2013).

  106. 106.

    & Evaluating phylostratigraphic evidence for widespread de novo gene birth in genome evolution. Mol. Biol. Evol. 33, 1245–1256 (2016).

  107. 107.

    et al. Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs. PLoS Genet. 8, e1002942 (2012).

  108. 108.

    Genes from scratch — the evolutionary fate of de novo genes. Trends Genet. 31, 215–219 (2015).

  109. 109.

    The future of evo-devo: model systems and evolutionary theory. Nat. Rev. Genet. 10, 416–422 (2009).

  110. 110.

    & The evolutionary history of protein domains viewed by species phylogeny. PLoS ONE 4, e8378 (2009).

  111. 111.

    et al. Global intersection of long non-coding RNAs with processed and unprocessed pseudogenes in the human genome. Front. Genet. 7, 26 (2016).

Download references

Acknowledgements

The authors thank their colleagues J. Pueyo, E. Magny, S. Bishop and F. Casares for helpful suggestions about the manuscript. This work was funded by grants from the Spanish Ministerio de Economía, Industria y Competitividad (MINECO; ref. BFU/2016-77793-P) and the British Biotechnology and Biological Sciences Research Council (BBSRC; ref. BB/N001753/1) to J.-P.C.

Author information

Affiliations

  1. Centro Andaluz de Biologia del Desarrollo, CSIC-UPO, Sevilla 41013, Spain.

    • Juan-Pablo Couso
  2. Brighton and Sussex Medical School, University of Sussex, Brighton BN1 9PX, UK.

    • Juan-Pablo Couso
    •  & Pedro Patraquim

Authors

  1. Search for Juan-Pablo Couso in:

  2. Search for Pedro Patraquim in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Juan-Pablo Couso.

Glossary

Ribosome profiling

A technique that globally probes RNA molecules that are being actively translated by ribosomes by analysing ribosome-protected RNA fragments (ribosomal footprints).

Translation efficiency

A measure of the rate of translation for a given mRNA feature, obtained in ribosome profiling experiments. It usually consists of the ratio between ribosomal footprints and RNA sequencing reads generated by the mRNA region.

Protein isoforms

Variants of a given protein generated by the translation of alternative mRNA sequences, in distinct mRNAs produced by the same gene.

ORF tagging

A technique to probe the translation of a specific open reading frame (ORF), whereby a reporter sequence without a start codon is cloned in-frame with the assessed ORF.

Helix–loop–helix

(HLH). A DNA-binding domain that characterizes members of a transcription factor family. It is composed of two α-helices connected by a short loop.

Pseudogene

A paralogue of a functional protein-coding gene, which has lost its gene expression and/or protein-coding capacities.

Paralogue

Homologous gene within a given species, usually generated by gene duplication.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrm.2017.58

Further reading