The phenomenon of de novo gene birth from junk DNA is surprising, because random polypeptides are expected to be toxic. There are two conflicting views about how de novo gene birth is nevertheless possible: the continuum hypothesis invokes a gradual gene birth process, whereas the preadaptation hypothesis predicts that young genes will show extreme levels of gene-like traits. We show that intrinsic structural disorder conforms to the predictions of the preadaptation hypothesis and falsifies the continuum hypothesis, with all genes having higher levels than translated junk DNA, but young genes having the highest level of all. Results are robust to homology detection bias, to the non-independence of multiple members of the same gene family and to the false positive annotation of protein-coding genes.
It has become clear that protein-coding genes can originate de novo from non-coding sequences1. This is surprising, because amyloid is a generic structural form of any polypeptide, making the expression of random polypeptides a dangerous affair2. De novo gene birth is a radical evolutionary transition, and two hypotheses have previously been presented to explain how it is possible. The ‘continuum’ view posits that there is a series of intermediate stages, or ‘proto-genes’, between non-genes and genes3. In contrast, the ‘preadaptation’ theory posits that de novo birth is an all-or-nothing transition to functionality, and the key to successful innovation is the imperative to avoid the most toxic ‘hopeless monster’ options in favour of ‘hopeful monsters’4,
These two theories can be empirically distinguished, given a simple trait that systematically makes proteins less likely to be harmful. According to the continuum theory, the trait should be strongest in old genes, intermediate in young genes and weakest in non-coding sequences. According to the preadaptation theory, the trait should be strongest in young genes, intermediate in old genes and weakest in non-coding sequences (Fig. 1).
A good candidate trait is the degree of intrinsic structural disorder (ISD), that is, the degree to which a given peptide folds as a stable three-dimensional structure (ordered) versus as a rather flexible and unstructured entity (disordered)8. Predicted levels of ISD can conveniently be calculated from sequence alone9. Natural protein sequences are more intrinsically disordered than random sequences10. Meanwhile, there is conflicting evidence as to whether young genes are more or less disordered than old genes. Increased ISD is found in orphan domains and recent extensions to domains11,
Here, we stratify genes in two distantly related eukaryotic organisms, the house mouse and baker’s yeast, by age, and use predicted ISD values to show that young genes do not behave like intermediates between non-genic sequences and older genes, as predicted by a continuum theory. Instead, young genes have exaggerated gene-like structural properties, as expected from sequences biased by the demands of preadaptation.
Results and discussion
We rigorously determine how ISD depends on gene age in mice using a phylostratigraphy approach17 to assign ages to genes. We exclude genes unique to a single species, as these are likely to be contaminated with many false positives that are not protein-coding genes at all. Mouse is a good choice of taxon because of the quality of its gene annotation and the large number of genome sequences available for closely related species7, which together provide temporal resolution among young protein-coding genes. The rarity of horizontal gene transfer in mouse ancestors is also an important consideration.
We find that young mouse genes have higher ISD (Fig. 2). Our statistical analysis avoids pseudoreplication in the form of phylogenetic confounding among multiple members of the same gene family, by controlling for gene family as a random effect in a linear model; this is not a technique we have previously encountered in related literature.
The validity of phylostratigraphy has been challenged on the grounds of homology detection bias18,
Homology is also easier to detect for longer genes and, in agreement with previous findings7, length and age are correlated (Fig. 3b). However, longer genes have higher ISD as scored with IUPred, with quite large effect sizes—we find a Pearson correlation coefficient (following transformation for normality) of 0.17 for old genes and in the range of 0.32–0.44 in newer phylostrata. Using a linear model to correct for the length–ISD relationship therefore makes our ISD–age relationship stronger, not weaker (Fig. 2, green).
Note that our ISD scores strongly reflect amino acid composition, with low ISD representing hydrophobicity. In this light, our findings agree with previous results on length-dependent frequencies of particular amino acids25, but contradict previous studies, restricted to single-domain globular proteins, that showed no length-dependence for hydrophobicity as a whole26,27. Applying the same hydrophobicity measure26 to our more comprehensive protein set, we continue to find that long genes have more hydrophilic amino acids (Pearson correlation coefficients only slightly weaker, at −0.13 for old genes, and in the range of −0.24 to −0.45 in new phylostrata).
The high ISD of young genes is primarily due to amino acid composition (calculated as the ISD of scrambled versions of a gene; Fig. 4a, orange) rather than the exact order of the amino acids (calculated as the difference between the ISD of the real gene and that of a scrambled control; Fig. 4b). This suggests that genes are born with high ISD, driven by amino acid composition. Amino acid composition can therefore be seen as a preadaptation28, or ‘nonaptation’ in the terminology of ref. 29, for de novo gene birth. This does not imply any pre-adaptive process of the kind postulated elsewhere4,
In contrast, the contribution of amino acid order to ISD seems to be an adaptation rather than a preadaptation in young genes (those born in vertebrates), because it is not initially present but appears only after some time. This contributes an independent line of evidence that high ISD values are favoured in young genes.
Now that we have determined how ISD depends on gene age, the non-coding sequences from which de novo proteins must be born allow us to distinguish between the continuum and preadaptation hypotheses. The continuum theory predicts that non-coding sequences will resemble exaggerated versions of young genes, and hence have the highest ISD. In contrast, the preadaptation theory expects young genes to show the most extreme deviation from random sequences, predicting that non-coding sequences will have the lowest ISD.
We sampled intergenic sequences near each mouse gene in our analysis as representative of the raw material from which de novo genes are born, rather than analysing randomly generated sequences matching only a subset of known variables, such as GC content31. Our intergenic controls reflect the subtleties found in a real genome with a complex evolutionary history, such as the avoidance of CpG sites. We find strong evidence refuting the continuum hypothesis and supporting the preadaptation hypothesis (Fig. 2). This result is not attributable to repetitive sequences; results are nearly identical when RepeatMasker is used to filter the intergenic control sequences (Fig. 2). The size of the gap between the average ISD of a translated intergenic sequence and that of young genes strikingly illustrates the nature of the filter applied during de novo gene birth, and the relevance of ISD to the process.
Why, then, did a previous yeast study3 find that young genes and proto-genes have low ISD? We believe that the difference lies in the annotation procedure. Our study included only genes with BLASTp homology across at least two species, whereas the previous study accepted BLASTn homology, which might result in homologous non-coding sequences being scored as protein-coding genes. When there is a mixture of true protein-coding genes combined with sequences that do not encode a functional polypeptide, the mean of the entire mixture will of course be intermediate between the means of the two groups. The overall mean will depend strongly on the ratio between the two components. Specifically, if the proportion of non-functional open reading frames (ORFs) decreases with conservation level, then a continuum will automatically be observed, and in the wrong direction as a function of apparent gene age. Figure 5 illustrates, in context, the statistical problem known as Simpson’s paradox32, which drives this effect.
To verify that this is responsible for the discrepancy, we repeated our mouse pipeline on yeast (Fig. 6a) and confirmed that following our methods, young yeast genes, like young mouse genes, have higher ISD. To pinpoint the source of the discrepancy, we used gene age classifications from the previous study (A.-R. Carvunis, personal communication of dataset) and, omitting gene families from the analysis, reproduced the previously reported trend of low ISD in young proto-genes (Fig. 6b, black). Details such as the treatment of disulfide bonds made no substantive difference (Fig. 6b, dark blue). However, filtering out potentially non-coding sequences eliminated the previously reported trend (Fig. 6b, light blue). Details of the elimination criteria are shown in Table 1; the preferential elimination of younger phylostrata is consistent with the operation of Simpson’s paradox. Figure 6 shows that the differences between our conclusions and those of a previous study3 are due to the categorization of what is a gene, not to the details of the ISD calculations or of how genes are assigned to phylostrata. In both our mouse and yeast analyses, we were careful to discard all possible non-genes, leaving us looking at a single group and not a mixture in each phylostratum. Mouse has more well-verified young protein-coding genes, allowing for clearer resolution of ISD on shorter timescales.
Note that if a continuum of increasing ISD with age was to take place only on short timescales (a less parsimonious hypothesis than ours, and one that Simpson’s paradox would make difficult to confirm), our mouse analysis restricts it to, at most, the last ~21–82 million years before the split of mouse and rat, and after the split of mouse and rabbit. In contrast, the continuum of ISD scores reported in ref. 3 is claimed to go all the way back to the split between Saccharomyces cerevisiae and Candida albicans (~300 million years), despite the much shorter generation times of yeast.
Figure 5 shows that the existence of intermediate proto-genes is not necessary to explain data on trends in mean properties. What is more, the very concept of a proto-gene as intermediate between gene and non-gene is problematic, with inappropriately teleological connotations. However, as a non-teleological definition, it may be useful to refer to slightly expressed but non-functional ORFs as proto-genes. Stretches between a start codon and a stop codon in a transcript (ORFs) occur frequently by chance. ORFs that encode highly deleterious polypeptides, and that are translated at low levels, are purged from a population more rapidly than relatively harmless ORFs, and this fact could help explain the phenomenon of de novo gene birth6. Pervasive transcription subject to rapid evolutionary turnover33 and leading to non-functional translation6 provides the raw materials for proto-genes defined in this fashion. However, it must be noted that even harmless ORFs, in the absence of selection for some beneficial property, are rapidly disrupted by mutation. There is therefore a discrete dichotomy between the states of ‘gene’ and ‘non-gene’, determined by whether the selection coefficient is greater than zero, and hence capable of sustaining their continued existence in the face of mutational onslaught34. A dichotomy of functionality (defined in evolutionary terms) is still compatible with the idea that some non-genes are under selection that weeds out the most deleterious of options, in a manner that promotes evolvability4,5.
In terms of the adaptive potential of non-coding sequences, although even the young genes have much higher ISD on average than found in sequences translated from randomly chosen junk DNA, these averages conceal considerable variation, with greater variation among gene families than among intergenic sequences35 (see Supplementary Fig. 1), suggestive of diversifying selection. Of our intergenic sequences, 12.7% yield ISD levels within the range of the highest 75% of all genes in the youngest phylostratum considered here, creating relatively little barrier to de novo gene birth. On the surface, protein length would seem to be a stronger constraint—only 25% of annotated young genes are less than 108 amino acids long, far longer than expected by chance in junk DNA—although biases in gene annotation may mean that a complete and unbiased set of young genes encode, in reality, even shorter proteins.
Once proteins are born with a given ISD, evolutionary tinkering and differential loss seem to change ISD only slowly, resulting in the consistent trend seen over hundreds of millions of years in Fig. 2. While gene birth is a sudden transition to functionality, subsequent descent with modification can generate extraordinarily slow trends.
Mus musculus proteins from Ensembl (v75)36 were subjected to a BLASTp37 search with an E-value threshold of 0.00120 against the National Center for Biotechnology Information (NCBI) non-redundant protein sequences (nr) database (June 2014). The most phylogenetically distant hit was used to place the gene into one of the 20 phylostrata (gene ages) listed in Supplementary Table 1, following Dollo’s parsimony and neglecting horizontal gene transfer. From 22,778 available protein sequences, 126 could not be successfully assigned to any phylostratum due to BLASTp-related problems such as too short queries, majority of query composed of low-complexity sequences, or a combination of both. These sequences were not considered for further analysis.
To remove dubious genes and perform evolutionary rate-controlled ISD estimates, dN/dS values were downloaded for all mouse proteins from the Ensembl BioMart38 (accessed on 18 February 2016) and mapped to our dataset using the Ensembl protein ID. Evolutionary rates were calculated using PAML by comparing all mouse proteins with their orthologues in rats. Genes with no rat orthologue of amino acid sequence identity greater than 50% were excluded, leaving 17,762 non-orphan mouse genes, all with dN/dS values, for further study. When rat had multiple orthologues meeting this quality filter, the one with the highest rate was taken (to prevent any further exclusion of genes with high evolutionary rate, beyond the low bar of detectable mouse–rat homology). Restricting analysis to one-to-one orthologues did not qualitatively change the results.
Pairwise paralogue information among non-orphan mouse genes was taken from Ensembl, from which gene families were constructed via a single-link cluster analysis. This yielded 8,124 gene families, 7,113 of which showed complete agreement among their member genes regarding age. Of the remainder, 824 gene families contained genes assigned to exactly two different phylostrata; 526 of these had only a single gene in the younger phylostratum, which was reassigned back to the older phylostratum. Of the remainder split across exactly two phylostrata, 150 had only a single older member, which we reassigned to be younger, leaving 148 gene families unclassified. In addition, 187 gene families contained genes split among more than two phylostrata, of which 86 could similarly easily be reconciled by discounting the gene age status of singletons, leaving another 101 gene families unclassified. Most of the 249 gene families with unclassified ages are split between multiple old phylostrata; as we had no shortage of data in these older phylostrata, and as this group includes complex scenarios such as gene fusion and repetitive sequences, these gene families were excluded from further analysis, leaving 15,347 total genes for our analysis.
Twenty-six gene families, consisting of 29 M. musculus genes, did not originally return NCBI nr BLAST hits outside their own species, yet Ensembl reported dN/dS values relative to a rat orthologue that met our sequence identity filter. Fourteen of the genes also returned this rat orthologue from NCBI’s nr database as of August 2016. A sample of those that did not return the rat orthologue nevertheless passed manual inspection of the protein-coding status of the Ensembl-identified orthologue. These were therefore assigned to the Rodentia phylostratum.
Our single-link clustering plus cleanup procedure to construct gene families produced a much better fit to the data than treating genes as being independent, explaining far more variance than phylostratrum, the property of interest (ΔAkaike’s information criterion (AIC) = 114 removing phylostratum from the model versus ΔAIC = 9,928 removing the random effect of gene family from the model).
We calculated ISD using IUPred9, after first excising all cysteines from the protein sequence (from Ensembl v73) because of uncertainty about their disulfide bond status combined with a profound impact of disulfide bond status on ISD39. For each gene, we averaged the ISD across all other amino acids and performed a Box–Cox transformation (Box-Cox exponent λ = 0.66, λ optimized using only coding genes not controls) before linear model analysis. Central tendency estimates and confidence intervals were then back-transformed for the plots.
Protein lengths were approximately log-transformed (Box–Cox λ = −0.0432). Hydrophobicities were calculated first for amino acids: leucine, isoleucine, valine, phenylalanine, methionine and tryptophan were scored as +1, and all other amino acids were scored as −1. Then the mean hydrophobicity for a protein was used to examine the length-dependence of amino acid composition.
For each gene, we generated scrambled controls by resampling amino acids without replacement. To generate GC-matched controls, the numbers of GC and adenine–thymine (AT) nucleotides were calculated excluding the stop codon, then GC versus AT identity was resampled without replacement, and then G versus C and A versus T were assigned at 50% probability. If a premature stop codon arose, one of the three stop codon nucleotides was switched with another nucleotide position chosen at random. This process was iterated until no premature stop codons remained, and then a stop codon was appended to the end.
To generate one intergenic control per gene, we took one intergenic sequence 100 nucleotides downstream from the end of the 3′ end of the Ensembl v80 annotation of the transcript, and progressed further, excising stop codons along the way, until a length match to the neighbouring protein-coding gene was obtained. We then obtained a second control sequence near each gene by repeating the process after starting a search 100 nucleotides farther downstream. For the RepeatMasked40 controls, intergenic sequences farther downstream were used as necessary to extract the control sequence from a contiguous non-masked intergenic sequence.
Genes taken in June 2014 from the Saccharomyces Genome Database (SGD)41 were assigned gene ages according to the procedure described above for mouse. We supplemented our phylostratigraphic analysis of species supported by the NCBI taxonomy browser with a selection of more closely related yeast species; our youngest phylostratum contains any S. cerevisiae genes with a homologue found in S. kudriavzevii (in most cases), or in a still more closely related yeast species (for a handful of genes). As for M. musculus, we constructed gene families using single-link cluster analysis on pairwise paralogue information from Ensembl, and the ages of single discordant genes were reconciled as described above with the other age assignments within their gene family. Genes classified by us as specific to S. cerevisiae were excluded from many analyses, as were genes that we failed to classify using BLAST and those classified as ‘dubious’ in SGD. ‘Conservation level’ (an alternative phylostratigraphy that includes BLASTn homology detection) was provided (A.-R. Carvunis, personal communication) to reproduce the classification presented in ref. 3. ISD values were calculated as for mouse except with Box–Cox λ = 0.554.
Source data for the statistical analyses and figures are provided in Supplementary Tables 2,3,4,5,6,7. Code associated with generating and analysing these tables is publicly available at https://github.com/MaselLab.
How to cite this article: Wilson, B. A., Foy, S. G., Neme, R. & Masel, J. Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nat. Ecol. Evol. 1, 0146 (2017).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Work was supported by the John Templeton Foundation (39667), the National Institutes of Health (GM104040) and ERC grant NewGenes (322564). We thank D. Tautz and M. Cordes for discussions, R. Bakaric for assistance with phylostratigraphy and A.-R. Carvunis for comments on a draft of the manuscript and for sharing data.
M. musculus proteins.
Nucleotide sequences from intergenic regions of M. musculus genome
Nucleotide sequences from intergenic regions of the masked M. musculus genome
Randomly generated nucleotide sequences
Scrambled amino acid sequences
S. cerevisiae proteins from Table 1