Proto-genes and de novo gene birth

Journal name:
Date published:
Published online

Novel protein-coding genes can arise either through re-organization of pre-existing genes or de novo1, 2. Processes involving re-organization of pre-existing genes, notably after gene duplication, have been extensively described1, 2. In contrast, de novo gene birth remains poorly understood, mainly because translation of sequences devoid of genes, or ‘non-genic’ sequences, is expected to produce insignificant polypeptides rather than proteins with specific biological functions1, 3, 4, 5, 6. Here we formalize an evolutionary model according to which functional genes evolve de novo through transitory proto-genes4 generated by widespread translational activity in non-genic sequences. Testing this model at the genome scale in Saccharomyces cerevisiae, we detect translation of hundreds of short species-specific open reading frames (ORFs) located in non-genic sequences. These translation events seem to provide adaptive potential7, as suggested by their differential regulation upon stress and by signatures of retention by natural selection. In line with our model, we establish that S. cerevisiae ORFs can be placed within an evolutionary continuum ranging from non-genic sequences to genes. We identify ~1,900 candidate proto-genes among S. cerevisiae ORFs and find that de novo gene birth from such a reservoir may be more prevalent than sporadic gene duplication. Our work illustrates that evolution exploits seemingly dispensable sequences to generate adaptive functional innovation.

At a glance


  1. From non-genic sequences to genes through proto-genes.
    Figure 1: From non-genic sequences to genes through proto-genes.

    a, Proto-genes mirror for gene birth the well-described pseudo-genes for gene death. Circular arrow indicates gene origination from pre-existing genes, such as through gene duplication. Pseudo-genes are highly related to existing genes but have accumulated disabling mutations and translation of functional proteins is no longer possible14. The premise that pseudo-gene formation represents irreversible gene death has been challenged by reports of pseudo-gene resurrection14 (bidirectional arrow). After enough evolutionary time pseudo-gene decay renders them indistinguishable from non-genic sequences (unidirectional arrow). Whereas pseudo-genes resemble known genes, proto-genes resemble no known genes. Proto-genes arise in non-genic sequences and either revert to non-genic sequences or evolve into genes (bidirectional arrow). There can be no reversion of genes to proto-genes (unidirectional arrow) as gene decay engenders pseudo-genes. b, Details of the proposed model for the gradual emergence of protein-coding genes in non-genic sequences via proto-genes. Solid arrows indicate the reversible emergence of ORFs in non-genic transcripts, or of transcripts containing non-genic ORFs. Examples where transcript appearance precedes ORF appearance have been described1, 2, 8, but the reverse order of events cannot be ruled out. Arrows representing expression level symbolize transcription (hidden genetic variation) or transcription and translation (exposed genetic variation). The variations in width of these arrows reflect changes in expression level resulting, at least in part, from changes in regulatory sequences. Sequence composition refers to codon usage, amino acid abundances and structural features. c, Assigning conservation levels to S. cerevisiae ORFs. Conservation levels of annotated ORFs were assigned according to comparisons along the reconstructed phylogenetic tree, by inferring their presence (filled circles) or absence (open circles) in the different species according to the phylostratigraphy principle (Supplementary Information)1. Top right, number of ORFs assigned to each conservation level (logarithmic scale). A. gossypii, Ashbya (Eremothecium) gossypii; A. nidulans, Aspergillus nidulans; C. albicans, Candida albicans; D. hansenii, Debaryomyces hansenii; K. lactis, Kluyveromyces lactis; K. waltii, Kluyveromyces (Lachancea) waltii; N. crassa, Neurospora crassa; S. pombe, Schizosaccharomyces pombe.

  2. Existence of an evolutionary continuum ranging from non-genic ORFs to genes through proto-genes.
    Figure 2: Existence of an evolutionary continuum ranging from non-genic ORFs to genes through proto-genes.

    a, Length (top; error bars represent s.e.m.), RNA expression level (middle; error bars represent s.e.m.), and proximity to transcription factor binding sites (bottom; error bars represent standard error of the proportion) of ORFs correlate with conservation level (Supplementary Table 4). P and τ, Kendall’s correlation statistics. Estimation of RNA abundance from RNA-Seq25 in rich conditions. The positive correlation between proximity to transcription factor binding sites and conservation level is shown for a window of 200 nucleotides and holds when considering windows of 300, 400 and 500 nucleotides (Kendall’s τ = 0.14, 0.16, 0.17, respectively; P<2.2×10−16 in each case). b, Codon bias increases with conservation level (Supplementary Table 4). Codon bias estimated using the codon adaptation index (Supplementary Information). P and τ, Kendall’s correlation statistics. Error bars represent s.e.m. The large s.e.m. observed for ORFs5 may be related to the whole genome duplication event (Supplementary Fig. 3). c, Relative amino acid abundances shift with increasing conservation level. For each encoded amino acid, the ratio between its frequency in ORFs1–4 and its frequency in ORFs5–10 (grey), or the ratio between its frequency in ORFs1–4 and its frequency in ORFs0 (black), is plotted. Enrichment of cysteine in proteins encoded by ORFs1–4 relative to those encoded by ORFs5–10 (P<1.8×10−150, hypergeometric test) corresponds to 3.6±0.1 residues (mean, s.e.m.) per translation product. d, Predicted structural features of ORF translation products correlate with conservation level. ORFs0 were not included in these analyses as their short length hinders the reliability of structural predictions. Error bars represent s.e.m.

  3. Translation and adaptive potential of recently emerged ORFs.
    Figure 3: Translation and adaptive potential of recently emerged ORFs.

    a, Example of an ORF0+ showing signatures of translation in starvation conditions. Syntenic regions in Saccharomyces sensu stricto species are aligned. Orange and black boxes indicate in-frame start and stop sites, respectively. SCER, S. cerevisiae; SPAR, S. paradoxus; SMIK, S. mikatae; SBAY, S. bayanus. b, Significance of the observed number of ORFs0+. Distribution of the number of ORFs0 expected to show signatures of translation if the ribosome footprinting assay were non-specific (as modelled by randomizing footprint reads positions 100 times; squares), or if the presence of ribosomes on non-genic transcripts were not related to the presence of ORFs0 (as modelled by randomizing ORFs0 positions 100 times; circles). P, empirical P value. c, AUG context of ORFs with and without translation signatures. The presence of an adenine at position −3 from the start codon indicates optimum AUG context (Supplementary Information). P and τ, Kendall’s correlation statistics. Asterisks mark significant differences between ORFs with and without translation signatures (P<0.05, Fisher’s exact test). d, Candidate proto-genes tend to undergo condition-specific translation. e, Signatures of intra-species purifying selection. The positive correlation (Supplementary Table 4) holds when only considering ORFs that are free from overlap with ORFs1–10 (Supplementary Fig. 7), and is not entirely driven by the interdependence between strength of purifying selection and expression level (Supplementary Information)29, 30. Asterisk marks a significant difference in proportion of ORFs under significant intra-species purifying selection between ORFs0+ and ORFs1 (P = 0.0001, hypergeometric test). P and τ, Kendall’s correlation statistics. Error bars represent standard error of the proportion in all panels.

  4. Identification of proto-genes in a continuum ranging from non-genic ORFs to genes.
    Figure 4: Identification of proto-genes in a continuum ranging from non-genic ORFs to genes.

    a, Characterization of candidate proto-genes (ORFs0+ and ORFs1–4). Venn diagram not drawn to scale. b, The binary model of annotation (top) and the proposed continuum (bottom).


  1. Tautz, D. & Domazet-Loso, T. The evolutionary origin of orphan genes. Nature Rev. Genet. 12, 692702 (2011)
  2. Kaessmann, H. Origins, evolution, and phenotypic impact of new genes. Genome Res. 20, 13131326 (2010)
  3. Jacob, F. Evolution and tinkering. Science 196, 11611166 (1977)
  4. Siepel, A. Darwinian alchemy: human genes from noncoding DNA. Genome Res. 19, 16931695 (2009)
  5. Khalturin, K., Hemmrich, G., Fraune, S., Augustin, R. & Bosch, T. C. More than just orphans: are taxonomically-restricted genes important in evolution? Trends Genet. 25, 404413 (2009)
  6. Wilson, B. A. & Masel, J. Putatively noncoding transcripts show extensive association with ribosomes. Genome Biol. Evol. 3, 12451252 (2011)
  7. Jarosz, D. F., Taipale, M. & Lindquist, S. Protein homeostasis and the phenotypic manifestation of genetic diversity: principles and mechanisms. Annu. Rev. Genet. 44, 189216 (2010)
  8. Cai, J., Zhao, R., Jiang, H. & Wang, W. De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics 179, 487496 (2008)
  9. Wu, D. D., Irwin, D. M. & Zhang, Y. P. De novo origin of human protein-coding genes. PLoS Genet. 7, e1002379 (2011)
  10. Ekman, D. & Elofsson, A. Identifying and quantifying orphan protein sequences in fungi. J. Mol. Biol. 396, 396405 (2010)
  11. Lipman, D. J., Souvorov, A., Koonin, E. V., Panchenko, A. R. & Tatusova, T. A. The relationship of protein conservation and sequence length. BMC Evol. Biol. 2, 20 (2002)
  12. Wolf, Y. I., Novichkov, P. S., Karev, G. P., Koonin, E. V. & Lipman, D. J. The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc. Natl Acad. Sci. USA 106, 72737280 (2009)
  13. Cai, J. J. & Petrov, D. A. Relaxed purifying selection and possibly high rate of adaptation in primate lineage-specific genes. Genome Biol. Evol. 2, 393409 (2010)
  14. Zheng, D. & Gerstein, M. B. The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they? Trends Genet. 23, 219224 (2007)
  15. Oliver, S. G. et al. The complete DNA sequence of yeast chromosome III. Nature 357, 3846 (1992)
  16. Fisk, D. G. et al. Saccharomyces cerevisiae S288C genome annotation: a working hypothesis. Yeast 23, 857865 (2006)
  17. Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 13441349 (2008)
  18. Boyer, J. et al. Large-scale exploration of growth inhibition caused by overexpression of genomic fragments in Saccharomyces cerevisiae. Genome Biol. 5, R72 (2004)
  19. Brar, G. A. et al. High-resolution view of the yeast meiotic program revealed by ribosome profiling. Science 335, 552557 (2012)
  20. Li, Q. R. et al. Revisiting the Saccharomyces cerevisiae predicted ORFeome. Genome Res. 18, 12941303 (2008)
  21. Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28, 14811488 (2000)
  22. Giacomelli, M. G., Hancock, A. S. & Masel, J. The conversion of 3′ UTRs into coding regions. Mol. Biol. Evol. 24, 457464 (2007)
  23. Prat, Y., Fromer, M., Linial, N. & Linial, M. Codon usage is associated with the evolutionary age of genes in metazoan genomes. BMC Evol. Biol. 9, 285 (2009)
  24. Yomtovian, I., Teerakulkittipong, N., Lee, B., Moult, J. & Unger, R. Composition bias and the origin of ORFan genes. Bioinformatics 26, 996999 (2010)
  25. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218223 (2009)
  26. Vishnoi, A., Kryazhimskiy, S., Bazykin, G. A., Hannenhalli, S. & Plotkin, J. B. Young proteins experience more variable selection pressures than old proteins. Genome Res. 20, 15741581 (2010)
  27. Gao, L. Z. & Innan, H. Very low gene duplication rate in the yeast genome. Science 306, 13671370 (2004)
  28. Hayden, E. J., Ferrada, E. & Wagner, A. Cryptic genetic variation promotes rapid evolutionary adaptation in an RNA enzyme. Nature 474, 9295 (2011)
  29. Pal, C., Papp, B. & Hurst, L. D. Highly expressed genes in yeast evolve slowly. Genetics 158, 927931 (2001)
  30. Drummond, D. A., Raval, A. & Wilke, C. O. A single determinant dominates the rate of yeast protein evolution. Mol. Biol. Evol. 23, 327337 (2006)

Download references

Author information


  1. Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA

    • Anne-Ruxandra Carvunis,
    • Thomas Rolland,
    • Michael A. Calderwood,
    • Nicolas Simonis,
    • Benoit Charloteaux,
    • Justin Barbette,
    • Balaji Santhanam,
    • Michael E. Cusick &
    • Marc Vidal
  2. Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA

    • Anne-Ruxandra Carvunis,
    • Thomas Rolland,
    • Michael A. Calderwood,
    • Nicolas Simonis,
    • Benoit Charloteaux,
    • Justin Barbette,
    • Balaji Santhanam,
    • Michael E. Cusick &
    • Marc Vidal
  3. UJF-Grenoble 1/CNRS/TIMC-IMAG UMR 5525, Computational and Mathematical Biology Group, Grenoble F-38041, France

    • Anne-Ruxandra Carvunis &
    • Nicolas Thierry-Mieg
  4. Department of Systems Biology, Harvard Medical School, Boston, Massachusetts 02115, USA

    • Ilan Wapinski
  5. Center for International Development and Harvard University, Cambridge, Massachusetts 02138, USA

    • Muhammed A. Yildirim
  6. Unit of Animal Genomics, GIGA-R and Faculty of Veterinary Medicine, University of Liege, 4000 Liege, Wallonia-Brussels Federation, Belgium

    • Benoit Charloteaux
  7. The MIT Media Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA

    • César A. Hidalgo
  8. Howard Hughes Medical Institute, Department of Cellular and Molecular Pharmacology, University of California, San Francisco, and California Institute for Quantitative Biosciences, San Francisco, California 94158, USA

    • Gloria A. Brar &
    • Jonathan S. Weissman
  9. Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA

    • Aviv Regev
  10. Howard Hughes Medical Institute, Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA

    • Aviv Regev
  11. Present address: Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe), Campus Plaine, Free University of Brussels, 1050 Brussels, Wallonia-Brussels Federation, Belgium.

    • Nicolas Simonis


A.-R.C., I.W., M.E.C. and M.V. conceived the project. A.-R.C. led the project and performed most of the analyses. T.R. evaluated cross-species transfer events, optimized the ribosome footprint analysis pipeline and assisted in other analyses. I.W. designed the conservation level tool and calculated most of the purifying selection statistics. M.A.C., C.A.H., A.R. and N.T.-M. advised on the research. M.A.Y. aligned the sequencing reads. B.S. predicted disordered and transmembrane regions and assisted in the cross-species transfer analyses. N.S. and B.C. assisted in analyses. G.A.B. and J.S.W. shared their expertise in ribosome footprinting data analysis and provided the meiosis ribosome footprinting raw and processed data. A.-R.C., T.R., M.E.C. and M.V. designed the figures. All authors contributed to writing the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Information (11.1M)

    This file contains Supplementary Figures 1-8, Supplementary Methods, Supplementary Tables 1-4 and additional references.

Additional data