Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth

  • Nature Ecology & Evolution 1, Article number: 0146 (2017)
  • doi:10.1038/s41559-017-0146
  • Download Citation
Published online:


The phenomenon of de novo gene birth from junk DNA is surprising, because random polypeptides are expected to be toxic. There are two conflicting views about how de novo gene birth is nevertheless possible: the continuum hypothesis invokes a gradual gene birth process, whereas the preadaptation hypothesis predicts that young genes will show extreme levels of gene-like traits. We show that intrinsic structural disorder conforms to the predictions of the preadaptation hypothesis and falsifies the continuum hypothesis, with all genes having higher levels than translated junk DNA, but young genes having the highest level of all. Results are robust to homology detection bias, to the non-independence of multiple members of the same gene family and to the false positive annotation of protein-coding genes.

It has become clear that protein-coding genes can originate de novo from non-coding sequences1. This is surprising, because amyloid is a generic structural form of any polypeptide, making the expression of random polypeptides a dangerous affair2. De novo gene birth is a radical evolutionary transition, and two hypotheses have previously been presented to explain how it is possible. The ‘continuum’ view posits that there is a series of intermediate stages, or ‘proto-genes’, between non-genes and genes3. In contrast, the ‘preadaptation’ theory posits that de novo birth is an all-or-nothing transition to functionality, and the key to successful innovation is the imperative to avoid the most toxic ‘hopeless monster’ options in favour of ‘hopeful monsters’4,​5,​6; given a marker for such avoidance, newborn genes will therefore have exaggerated, rather than intermediate, gene-like characteristics. In other words, newborn gene birth occurs only from sequences that happen to be pre-adapted to ‘first, do no harm’ in the most direct way possible. Only later do they adapt to protect themselves against risks in subtler ways, increasing tolerance with respect to other more subtle characteristics7.

These two theories can be empirically distinguished, given a simple trait that systematically makes proteins less likely to be harmful. According to the continuum theory, the trait should be strongest in old genes, intermediate in young genes and weakest in non-coding sequences. According to the preadaptation theory, the trait should be strongest in young genes, intermediate in old genes and weakest in non-coding sequences (Fig. 1).

Scheme 1: The continuum and preadaptation hypotheses make incompatible predictions about the properties of intergenic sequences relative to young versus old genes.
Figure 1

A good candidate trait is the degree of intrinsic structural disorder (ISD), that is, the degree to which a given peptide folds as a stable three-dimensional structure (ordered) versus as a rather flexible and unstructured entity (disordered)8. Predicted levels of ISD can conveniently be calculated from sequence alone9. Natural protein sequences are more intrinsically disordered than random sequences10. Meanwhile, there is conflicting evidence as to whether young genes are more or less disordered than old genes. Increased ISD is found in orphan domains and recent extensions to domains11,​12,​13,​14. Elevated ISD was found in complete orphan genes in Leishmania15 and in genes created de novo in alternative reading frames of existing viral genes16, but low ISD was found in Saccharomyces orphan genes3,13.

Here, we stratify genes in two distantly related eukaryotic organisms, the house mouse and baker’s yeast, by age, and use predicted ISD values to show that young genes do not behave like intermediates between non-genic sequences and older genes, as predicted by a continuum theory. Instead, young genes have exaggerated gene-like structural properties, as expected from sequences biased by the demands of preadaptation.

Results and discussion

We rigorously determine how ISD depends on gene age in mice using a phylostratigraphy approach17 to assign ages to genes. We exclude genes unique to a single species, as these are likely to be contaminated with many false positives that are not protein-coding genes at all. Mouse is a good choice of taxon because of the quality of its gene annotation and the large number of genome sequences available for closely related species7, which together provide temporal resolution among young protein-coding genes. The rarity of horizontal gene transfer in mouse ancestors is also an important consideration.

We find that young mouse genes have higher ISD (Fig. 2). Our statistical analysis avoids pseudoreplication in the form of phylogenetic confounding among multiple members of the same gene family, by controlling for gene family as a random effect in a linear model; this is not a technique we have previously encountered in related literature.

Figure 2: Young genes have higher ISD (black circles) than old genes.
Figure 2

This result from the analysis of 15,347 mouse genes is unchanged by correction for evolutionary rate, and only becomes stronger after correction for length (green squares). Back-transformed central tendency estimates ± 1 s.e.m. come from a linear mixed model, where gene family, phylostratum and length are random, fixed and quantitative terms, respectively. Importantly, this means that we do not treat genes as independent data points, but instead take into account phylogenetic confounding and use gene families as independent data points. Length-corrected ISD values are with respect to a standardized length of 179 amino acids. Both young and old genes have higher ISD than intergenic sequences (blue diamond) and repeat-masked intergenic sequences (light blue triangle). Phylostrata on the x axis are labelled according to the clade in which the oldest detectable homologue of a gene can be found. To minimize homology detection bias, the oldest phylostrata have been condensed into a single pre-vertebrate phylostratum.

The validity of phylostratigraphy has been challenged on the grounds of homology detection bias18,​19,​20. Disordered proteins evolve faster15,21,22; if this makes homology undetectable, then high ISD could cause young gene status, rather than young gene status being the cause of high ISD23. Homology detection bias is minimized by focusing on the youngest genes20; we therefore collapsed all pre-vertebrate phylostrata into a single ‘old gene’ category, to focus on gene ages with the least homology detection bias. What is more, correcting for the influence of evolutionary rate, via a linear regression analysis, had no effect on the predictive power of gene age (P > 0.05), despite the fact that evolutionary rate and age are correlated (Fig. 3a). Both proponents24 and detractors18,19 of phylostratigraphy have used simulations of protein evolution to justify their position on the impact of homology detection bias; unfortunately, this line of argument relies on our ability to model protein evolution realistically. Our more empirical approach suggests low impact, in a more direct and less model-dependent fashion.

Figure 3: In agreement with many previous studies, young genes evolve faster and are shorter.
Figure 3

a,b, These properties are directly causal for homology detection bias, hence there is no way to produce bias-corrected values as for Fig. 2. However, the statistical insignificance of rate correction in Fig. 2 suggests that homology detection bias is negligible. Back-transformed central tendency estimates ± 1 s.e.m. come from a linear mixed model, where gene family and phylostratum are random and fixed terms, respectively. aa, amino acids.

Homology is also easier to detect for longer genes and, in agreement with previous findings7, length and age are correlated (Fig. 3b). However, longer genes have higher ISD as scored with IUPred, with quite large effect sizes—we find a Pearson correlation coefficient (following transformation for normality) of 0.17 for old genes and in the range of 0.32–0.44 in newer phylostrata. Using a linear model to correct for the length–ISD relationship therefore makes our ISD–age relationship stronger, not weaker (Fig. 2, green).

Note that our ISD scores strongly reflect amino acid composition, with low ISD representing hydrophobicity. In this light, our findings agree with previous results on length-dependent frequencies of particular amino acids25, but contradict previous studies, restricted to single-domain globular proteins, that showed no length-dependence for hydrophobicity as a whole26,27. Applying the same hydrophobicity measure26 to our more comprehensive protein set, we continue to find that long genes have more hydrophilic amino acids (Pearson correlation coefficients only slightly weaker, at −0.13 for old genes, and in the range of −0.24 to −0.45 in new phylostrata).

The high ISD of young genes is primarily due to amino acid composition (calculated as the ISD of scrambled versions of a gene; Fig. 4a, orange) rather than the exact order of the amino acids (calculated as the difference between the ISD of the real gene and that of a scrambled control; Fig. 4b). This suggests that genes are born with high ISD, driven by amino acid composition. Amino acid composition can therefore be seen as a preadaptation28, or ‘nonaptation’ in the terminology of ref. 29, for de novo gene birth. This does not imply any pre-adaptive process of the kind postulated elsewhere4,​5,​6,30. Instead, the term preadaptation simply refers to backward-time conditional probability; given that gene birth occurred, the non-coding sequence from which the gene was born is likely to have had more favourable characteristics for gene birth than the average non-coding sequence. Note that while higher guanine–cytosine (GC) content leads to higher ISD31, GC content is not the primary driver of the high ISD-promoting amino acid composition of young genes (Fig. 4a).

Figure 4: Elevated ISD can be broken down into contributions from amino acid composition and from exact amino acid order.
Figure 4

a, ISD in real proteins (black circles) relative to amino acid scrambled controls (orange squares) and controls generated to have matched GC content (yellow diamonds), with error bars showing the back transform of the central tendency estimates ± 1 s.e.m. derived from mixed models as in Fig. 2. Excess ISD is driven primarily by amino acid composition, not GC content or precise amino acid order. b, Paired comparisons show that the small excess in ISD relative to that predicted from amino acid composition is statistically significant (95% confidence intervals are shown) in all young genes except the very youngest, despite the broad confidence intervals in a that do not take into account the paired nature of the data.

In contrast, the contribution of amino acid order to ISD seems to be an adaptation rather than a preadaptation in young genes (those born in vertebrates), because it is not initially present but appears only after some time. This contributes an independent line of evidence that high ISD values are favoured in young genes.

Now that we have determined how ISD depends on gene age, the non-coding sequences from which de novo proteins must be born allow us to distinguish between the continuum and preadaptation hypotheses. The continuum theory predicts that non-coding sequences will resemble exaggerated versions of young genes, and hence have the highest ISD. In contrast, the preadaptation theory expects young genes to show the most extreme deviation from random sequences, predicting that non-coding sequences will have the lowest ISD.

We sampled intergenic sequences near each mouse gene in our analysis as representative of the raw material from which de novo genes are born, rather than analysing randomly generated sequences matching only a subset of known variables, such as GC content31. Our intergenic controls reflect the subtleties found in a real genome with a complex evolutionary history, such as the avoidance of CpG sites. We find strong evidence refuting the continuum hypothesis and supporting the preadaptation hypothesis (Fig. 2). This result is not attributable to repetitive sequences; results are nearly identical when RepeatMasker is used to filter the intergenic control sequences (Fig. 2). The size of the gap between the average ISD of a translated intergenic sequence and that of young genes strikingly illustrates the nature of the filter applied during de novo gene birth, and the relevance of ISD to the process.

Why, then, did a previous yeast study3 find that young genes and proto-genes have low ISD? We believe that the difference lies in the annotation procedure. Our study included only genes with BLASTp homology across at least two species, whereas the previous study accepted BLASTn homology, which might result in homologous non-coding sequences being scored as protein-coding genes. When there is a mixture of true protein-coding genes combined with sequences that do not encode a functional polypeptide, the mean of the entire mixture will of course be intermediate between the means of the two groups. The overall mean will depend strongly on the ratio between the two components. Specifically, if the proportion of non-functional open reading frames (ORFs) decreases with conservation level, then a continuum will automatically be observed, and in the wrong direction as a function of apparent gene age. Figure 5 illustrates, in context, the statistical problem known as Simpson’s paradox32, which drives this effect.

Figure 5: Putative evidence for the continuum hypothesis can be explained as a statistical artefact known as Simpson’s paradox.
Figure 5

a, The continuum view posits the existence of proto-genes that have ‘characteristics intermediate between non-genic ORFs and genes’3. Candidate proto-genes were classified on the basis of being annotated as ORFs and having detectable sequence homology in sister species (without necessarily retention of approximate ORF boundaries), and the authors of ref. 3 claimed to show a continuum of properties as a function of conservation level, shown as a greyscale. b, The same data can be explained without resorting to the existence of such intermediates. Sequence homology for ORFs that are not protein-coding genes (white circles) becomes more difficult to detect as a function of age, such that the proportion of true genes (black circles) increases with age, giving rise to the same observations as a. The downward trend in ISD arises as an example of Simpson’s paradox32. c, By carefully excluding all non-genes, we see the true relationship between gene age and ISD, and compare it with intergenic control sequences that are definitely not protein-coding genes. Note that if true protein-coding genes were excluded in b (rather than excluding non-genes as in c), there would be no relationship with conservation levels.

To verify that this is responsible for the discrepancy, we repeated our mouse pipeline on yeast (Fig. 6a) and confirmed that following our methods, young yeast genes, like young mouse genes, have higher ISD. To pinpoint the source of the discrepancy, we used gene age classifications from the previous study (A.-R. Carvunis, personal communication of dataset) and, omitting gene families from the analysis, reproduced the previously reported trend of low ISD in young proto-genes (Fig. 6b, black). Details such as the treatment of disulfide bonds made no substantive difference (Fig. 6b, dark blue). However, filtering out potentially non-coding sequences eliminated the previously reported trend (Fig. 6b, light blue). Details of the elimination criteria are shown in Table 1; the preferential elimination of younger phylostrata is consistent with the operation of Simpson’s paradox. Figure 6 shows that the differences between our conclusions and those of a previous study3 are due to the categorization of what is a gene, not to the details of the ISD calculations or of how genes are assigned to phylostrata. In both our mouse and yeast analyses, we were careful to discard all possible non-genes, leaving us looking at a single group and not a mixture in each phylostratum. Mouse has more well-verified young protein-coding genes, allowing for clearer resolution of ISD on shorter timescales.

Figure 6: Young yeast genes, like the young mouse genes in Fig. 2, have higher ISD.
Figure 6

a, Back-transformed central tendency estimates ± 1 s.e.m. come from a linear mixed model, where gene family and phylostratum are random and fixed terms, respectively. Phylostrata are labelled according to the species most closely related to S. cerevisiae in which a homologue is still found, except for the S. kudriavzevii group, which includes younger genes found in at least two species. The analysis includes 5,452 yeast genes that overlap with the genes used in ref. 3, with filtering indicated in Table 1. b, Using the age classifications of ref. 3 (Table 1, second column) and ignoring gene family, we reproduce the trend of low ISD in young proto-genes using our slightly different ISD measurement. Standard means ± 1 s.e.m. are reported for untransformed ISD estimates. This trend is insensitive to whether cysteines are included (black circles) or excluded (blue diamonds) from the protein primary sequence. This trend disappears when we screen out proto-genes that lack strong evidence for a functional protein product (light blue squares), by excluding genes whose age we could classify or that were unique to S. cerevisiae, and those classified as ‘dubious’ in SGD (Table 1, last column). Correspondences between the ages assigned by the two phylostratigraphies are indicated with shaded triangles between the two figure parts.

Table 1: Number of genes assigned to each of the conservation levels annotated in ref. 3.

Note that if a continuum of increasing ISD with age was to take place only on short timescales (a less parsimonious hypothesis than ours, and one that Simpson’s paradox would make difficult to confirm), our mouse analysis restricts it to, at most, the last ~21–82 million years before the split of mouse and rat, and after the split of mouse and rabbit. In contrast, the continuum of ISD scores reported in ref. 3 is claimed to go all the way back to the split between Saccharomyces cerevisiae and Candida albicans (~300 million years), despite the much shorter generation times of yeast.

Figure 5 shows that the existence of intermediate proto-genes is not necessary to explain data on trends in mean properties. What is more, the very concept of a proto-gene as intermediate between gene and non-gene is problematic, with inappropriately teleological connotations. However, as a non-teleological definition, it may be useful to refer to slightly expressed but non-functional ORFs as proto-genes. Stretches between a start codon and a stop codon in a transcript (ORFs) occur frequently by chance. ORFs that encode highly deleterious polypeptides, and that are translated at low levels, are purged from a population more rapidly than relatively harmless ORFs, and this fact could help explain the phenomenon of de novo gene birth6. Pervasive transcription subject to rapid evolutionary turnover33 and leading to non-functional translation6 provides the raw materials for proto-genes defined in this fashion. However, it must be noted that even harmless ORFs, in the absence of selection for some beneficial property, are rapidly disrupted by mutation. There is therefore a discrete dichotomy between the states of ‘gene’ and ‘non-gene’, determined by whether the selection coefficient is greater than zero, and hence capable of sustaining their continued existence in the face of mutational onslaught34. A dichotomy of functionality (defined in evolutionary terms) is still compatible with the idea that some non-genes are under selection that weeds out the most deleterious of options, in a manner that promotes evolvability4,5.

In terms of the adaptive potential of non-coding sequences, although even the young genes have much higher ISD on average than found in sequences translated from randomly chosen junk DNA, these averages conceal considerable variation, with greater variation among gene families than among intergenic sequences35 (see Supplementary Fig. 1), suggestive of diversifying selection. Of our intergenic sequences, 12.7% yield ISD levels within the range of the highest 75% of all genes in the youngest phylostratum considered here, creating relatively little barrier to de novo gene birth. On the surface, protein length would seem to be a stronger constraint—only 25% of annotated young genes are less than 108 amino acids long, far longer than expected by chance in junk DNA—although biases in gene annotation may mean that a complete and unbiased set of young genes encode, in reality, even shorter proteins.

Once proteins are born with a given ISD, evolutionary tinkering and differential loss seem to change ISD only slowly, resulting in the consistent trend seen over hundreds of millions of years in Fig. 2. While gene birth is a sudden transition to functionality, subsequent descent with modification can generate extraordinarily slow trends.


Mus musculus proteins from Ensembl (v75)36 were subjected to a BLASTp37 search with an E-value threshold of 0.00120 against the National Center for Biotechnology Information (NCBI) non-redundant protein sequences (nr) database (June 2014). The most phylogenetically distant hit was used to place the gene into one of the 20 phylostrata (gene ages) listed in Supplementary Table 1, following Dollo’s parsimony and neglecting horizontal gene transfer. From 22,778 available protein sequences, 126 could not be successfully assigned to any phylostratum due to BLASTp-related problems such as too short queries, majority of query composed of low-complexity sequences, or a combination of both. These sequences were not considered for further analysis.

To remove dubious genes and perform evolutionary rate-controlled ISD estimates, dN/dS values were downloaded for all mouse proteins from the Ensembl BioMart38 (accessed on 18 February 2016) and mapped to our dataset using the Ensembl protein ID. Evolutionary rates were calculated using PAML by comparing all mouse proteins with their orthologues in rats. Genes with no rat orthologue of amino acid sequence identity greater than 50% were excluded, leaving 17,762 non-orphan mouse genes, all with dN/dS values, for further study. When rat had multiple orthologues meeting this quality filter, the one with the highest rate was taken (to prevent any further exclusion of genes with high evolutionary rate, beyond the low bar of detectable mouse–rat homology). Restricting analysis to one-to-one orthologues did not qualitatively change the results.

Pairwise paralogue information among non-orphan mouse genes was taken from Ensembl, from which gene families were constructed via a single-link cluster analysis. This yielded 8,124 gene families, 7,113 of which showed complete agreement among their member genes regarding age. Of the remainder, 824 gene families contained genes assigned to exactly two different phylostrata; 526 of these had only a single gene in the younger phylostratum, which was reassigned back to the older phylostratum. Of the remainder split across exactly two phylostrata, 150 had only a single older member, which we reassigned to be younger, leaving 148 gene families unclassified. In addition, 187 gene families contained genes split among more than two phylostrata, of which 86 could similarly easily be reconciled by discounting the gene age status of singletons, leaving another 101 gene families unclassified. Most of the 249 gene families with unclassified ages are split between multiple old phylostrata; as we had no shortage of data in these older phylostrata, and as this group includes complex scenarios such as gene fusion and repetitive sequences, these gene families were excluded from further analysis, leaving 15,347 total genes for our analysis.

Twenty-six gene families, consisting of 29 M. musculus genes, did not originally return NCBI nr BLAST hits outside their own species, yet Ensembl reported dN/dS values relative to a rat orthologue that met our sequence identity filter. Fourteen of the genes also returned this rat orthologue from NCBI’s nr database as of August 2016. A sample of those that did not return the rat orthologue nevertheless passed manual inspection of the protein-coding status of the Ensembl-identified orthologue. These were therefore assigned to the Rodentia phylostratum.

Our single-link clustering plus cleanup procedure to construct gene families produced a much better fit to the data than treating genes as being independent, explaining far more variance than phylostratrum, the property of interest (ΔAkaike’s information criterion (AIC) = 114 removing phylostratum from the model versus ΔAIC = 9,928 removing the random effect of gene family from the model).

We calculated ISD using IUPred9, after first excising all cysteines from the protein sequence (from Ensembl v73) because of uncertainty about their disulfide bond status combined with a profound impact of disulfide bond status on ISD39. For each gene, we averaged the ISD across all other amino acids and performed a Box–Cox transformation (Box-Cox exponent λ = 0.66, λ optimized using only coding genes not controls) before linear model analysis. Central tendency estimates and confidence intervals were then back-transformed for the plots.

Protein lengths were approximately log-transformed (Box–Cox λ = −0.0432). Hydrophobicities were calculated first for amino acids: leucine, isoleucine, valine, phenylalanine, methionine and tryptophan were scored as +1, and all other amino acids were scored as −1. Then the mean hydrophobicity for a protein was used to examine the length-dependence of amino acid composition.

For each gene, we generated scrambled controls by resampling amino acids without replacement. To generate GC-matched controls, the numbers of GC and adenine–thymine (AT) nucleotides were calculated excluding the stop codon, then GC versus AT identity was resampled without replacement, and then G versus C and A versus T were assigned at 50% probability. If a premature stop codon arose, one of the three stop codon nucleotides was switched with another nucleotide position chosen at random. This process was iterated until no premature stop codons remained, and then a stop codon was appended to the end.

To generate one intergenic control per gene, we took one intergenic sequence 100 nucleotides downstream from the end of the 3′ end of the Ensembl v80 annotation of the transcript, and progressed further, excising stop codons along the way, until a length match to the neighbouring protein-coding gene was obtained. We then obtained a second control sequence near each gene by repeating the process after starting a search 100 nucleotides farther downstream. For the RepeatMasked40 controls, intergenic sequences farther downstream were used as necessary to extract the control sequence from a contiguous non-masked intergenic sequence.

S. cerevisiae

Genes taken in June 2014 from the Saccharomyces Genome Database (SGD)41 were assigned gene ages according to the procedure described above for mouse. We supplemented our phylostratigraphic analysis of species supported by the NCBI taxonomy browser with a selection of more closely related yeast species; our youngest phylostratum contains any S. cerevisiae genes with a homologue found in S. kudriavzevii (in most cases), or in a still more closely related yeast species (for a handful of genes). As for M. musculus, we constructed gene families using single-link cluster analysis on pairwise paralogue information from Ensembl, and the ages of single discordant genes were reconciled as described above with the other age assignments within their gene family. Genes classified by us as specific to S. cerevisiae were excluded from many analyses, as were genes that we failed to classify using BLAST and those classified as ‘dubious’ in SGD. ‘Conservation level’ (an alternative phylostratigraphy that includes BLASTn homology detection) was provided (A.-R. Carvunis, personal communication) to reproduce the classification presented in ref. 3. ISD values were calculated as for mouse except with Box–Cox λ = 0.554.

Data availability

Source data for the statistical analyses and figures are provided in Supplementary Tables 2,3,4,5,6,7. Code associated with generating and analysing these tables is publicly available at

Additional information

How to cite this article: Wilson, B. A., Foy, S. G., Neme, R. & Masel, J. Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nat. Ecol. Evol. 1, 0146 (2017).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    & New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation. Phil. Trans. R. Soc. B 370, 20140332 (2015).

  2. 2.

    & Prevention of amyloid-like aggregation as a driving force of protein evolution. EMBO Rep. 8, 737–742 (2007).

  3. 3.

    et al. Proto-genes and de novo gene birth. Nature 487, 370–374 (2012).

  4. 4.

    Cryptic genetic variation is enriched for potential adaptations. Genetics 172, 1985–1991 (2006).

  5. 5.

    & The evolution of molecular error rates and the consequences for evolvability. Proc. Natl Acad. Sci. USA 108, 1082–1087 (2011).

  6. 6.

    & Putatively noncoding transcripts show extensive association with ribosomes. Genome Biol. Evol. 3, 1245–1252 (2011).

  7. 7.

    & Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics 14, 117 (2013).

  8. 8.

    . et al. Thousands of proteins likely to have long disordered regions. Pac. Symp. Biocomput. 1998, 437–448 (1998).

  9. 9.

    , , & IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21, 3433–3434 (2005).

  10. 10.

    et al. Natural protein sequences are more intrinsically disordered than random sequences. Cell. Mol. Life Sci. 73, 2949–2957 (2016).

  11. 11.

    , & Quantifying the mechanisms of domain gain in animal proteins. Genome Biol. 11, R74 (2010).

  12. 12.

    & The dynamics and evolutionary potential of domain loss and emergence. Mol. Biol. Evol. 29, 787–796 (2012).

  13. 13.

    & Identifying and quantifying orphan protein sequences in fungi. J. Mol. Biol. 396, 396–405 (2010).

  14. 14.

    & Dynamics and adaptive benefits of modular protein evolution. Curr. Opin. Struct. Biol. 23, 459–466 (2013).

  15. 15.

    , & Elucidating evolutionary features and functional implications of orphan genes in Leishmania major. Infect. Genet. Evol. 32, 330–337 (2015).

  16. 16.

    , , , & Overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation. J. Virol. 83, 10719–10736 (2009).

  17. 17.

    , & A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends Genet. 23, 533–539 (2007).

  18. 18.

    & Phylostratigraphic bias creates spurious patterns of genome evolution. Mol. Biol. Evol. 32, 258–267 (2015).

  19. 19.

    & Evaluating phylostratigraphic evidence for widespread de novo gene birth in genome evolution. Mol. Biol. Evol. 33, 1245–1256 (2016).

  20. 20.

    & On homology searches by protein Blast and the characterization of the age of genes. BMC Evol. Biol. 7, 53 (2007).

  21. 21.

    , & The relationships among microRNA regulation, intrinsically disordered regions, and other indicators of protein evolutionary rate. Mol. Biol. Evol. 28, 2513–2520 (2011).

  22. 22.

    & Exploring the differences in evolutionary rates between monogenic and polygenic disease genes in human. Mol. Biol. Evol. 27, 934–941 (2010).

  23. 23.

    , & Orphans and new gene origination, a structural and evolutionary perspective. Curr. Opin. Struct. Biol. 26, 73–83 (2014).

  24. 24.

    et al. No evidence for phylostratigraphic bias impacting inferences on patterns of gene emergence and evolution. Mol. Biol. Evol. 34, 843–856 (2017).

  25. 25.

    Amino acid preferences of small proteins. J. Mol. Biol. 227, 991–995 (1992).

  26. 26.

    & On hydrophobicity correlations in protein chains. Biophys. J. 79, 2252–2258 (2000).

  27. 27.

    On hydrophobicity and conformational specificity in proteins. Biophys. J. 86, 23–30 (2004).

  28. 28.

    Preadaptation and multiple evolutionary pathways. Evolution 13, 194–211 (1959).

  29. 29.

    & Exaptation—a missing term in the science of form. Paleobiology 8, 4–15 (1982).

  30. 30.

    , , & The look-ahead effect of phenotypic mutations. Biol. Direct 3, 18 (2008).

  31. 31.

    , & Estimating intrinsic structural preferences of de novo emerging random-sequence proteins: is aggregation the main bottleneck? FEBS Lett. 586, 2468–2472 (2012).

  32. 32.

    & Simpson’s Paradox (ed. Zalta, E. N.) (2016).

  33. 33.

    & Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence. eLife 5, e09977 (2016).

  34. 34.

    et al. On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE. Genome Biol. Evol. 5, 578–590 (2013).

  35. 35.

    , , & Organism complexity anti-correlates with proteomic β-aggregation propensity. Protein Sci. 14, 2735–2740 (2005).

  36. 36.

    et al. Ensembl 2014. Nucleic Acids Res. 42, D749–D755 (2014).

  37. 37.

    et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

  38. 38.

    et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43, W589–W598 (2015).

  39. 39.

    & Understanding protein non-folding. BBA-Proteins Proteom. 1804, 1231–1264 (2010).

  40. 40.

    . & RepeatMasker Open-4.0 v. 4.0.5 (2013–2015);

  41. 41.

    et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700–D705 (2012).

Download references


Work was supported by the John Templeton Foundation (39667), the National Institutes of Health (GM104040) and ERC grant NewGenes (322564). We thank D. Tautz and M. Cordes for discussions, R. Bakaric for assistance with phylostratigraphy and A.-R. Carvunis for comments on a draft of the manuscript and for sharing data.

Author information

Author notes

    • Scott G. Foy
    •  & Rafik Neme

    Present addresses: St. Jude Children's Research Hospital, Memphis, Tennessee 38105, USA (S.G.F.); Department of Biochemistry and Molecular Biophysics, Columbia University Medical Center, New York 10032, USA (R.N.).

    • Benjamin A. Wilson
    •  & Scott G. Foy

    These authors contributed equally to this work.


  1. Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona 85721, USA.

    • Benjamin A. Wilson
    • , Scott G. Foy
    •  & Joanna Masel
  2. Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön SH 24306, Germany.

    • Rafik Neme


  1. Search for Benjamin A. Wilson in:

  2. Search for Scott G. Foy in:

  3. Search for Rafik Neme in:

  4. Search for Joanna Masel in:


J.M and R.N. conceived the approach, R.N. performed the phylostratigraphy, B.A.W. and S.G.F. completed all other data analyses, and J.M. wrote the paper.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Joanna Masel.

Supplementary information

PDF files

  1. 1.

    Supplementary information

    Supplementary Figure 1 and Supplementary Table 1

CSV files

  1. 1.

    Supplementary Table 2

    M. musculus proteins.

  2. 2.

    Supplementary Table 3

    Nucleotide sequences from intergenic regions of M. musculus genome

  3. 3.

    Supplementary Table 4

    Nucleotide sequences from intergenic regions of the masked M. musculus genome

  4. 4.

    Supplementary Table 5

    Randomly generated nucleotide sequences

  5. 5.

    Supplementary Table 6

    Scrambled amino acid sequences

  6. 6.

    Supplementary Table 7

    S. cerevisiae proteins from Table 1