De novo emergence of adaptive membrane proteins from thymine-rich intergenic sequences

Recent evidence demonstrates that novel protein-coding genes can arise de novo from intergenic loci. This evolutionary innovation is thought to be facilitated by the pervasive translation of intergenic transcripts, which exposes a reservoir of variable polypeptides to natural selection. Do intergenic translation events yield polypeptides with useful biochemical capacities? The answer to this question remains controversial. Here, we systematically characterized how de novo emerging coding sequences impact fitness. In budding yeast, overexpression of these sequences was enriched in beneficial effects, while their disruption was generally inconsequential. We found that beneficial emerging sequences have a strong tendency to encode putative transmembrane proteins, which appears to stem from a cryptic propensity for transmembrane signals throughout thymine-rich intergenic regions of the genome. These findings suggest that novel genes with useful biochemical capacities, such as transmembrane domains, tend to evolve de novo within intergenic loci that already harbored a blueprint for these capacities.

Summary: Recent evidence demonstrates that novel protein-coding genes can arise de novo from intergenic loci. This evolutionary innovation is thought to be facilitated by the pervasive translation of intergenic transcripts, which exposes a reservoir of variable polypeptides to natural selection. Do intergenic translation events yield polypeptides with useful biochemical capacities?
The answer to this question remains controversial. Here, we systematically characterized how de novo emerging coding sequences impact fitness. In budding yeast, overexpression of these sequences was enriched in beneficial effects, while their disruption was generally inconsequential.
We found that beneficial emerging sequences have a strong tendency to encode putative transmembrane proteins, which appears to stem from a cryptic propensity for transmembrane signals throughout thymine-rich intergenic regions of the genome. These findings suggest that novel genes with useful biochemical capacities, such as transmembrane domains, tend to evolve de novo within intergenic loci that already harbored a blueprint for these capacities.
The molecular mechanisms and dynamics of de novo gene birth are poorly understood 1 . It is particularly unclear how non-genic sequences could spontaneously encode proteins with specific and useful biochemical capacities. To resolve this paradox, it has been proposed that pervasive translation of non-genic transcripts can expose genetic variation, in the form of novel polypeptides, to natural selection, thereby purging toxic sequences and providing adaptive potential to the 5 organism 2,3 . The genomic sequences encoding these novel polypeptides have been called "protogenes", to denote that they correspond to a distinct class of genetic elements that are intermediates between non-genic sequences and established genes 3 . In agreement, several studies reported that de novo emerging coding sequences tend to display lengths, transcript architectures, transcription levels, strength of purifying selection, sequence compositions, structural features and integration 10 in cellular networks that are intermediate between those observed in non-genic sequences and those observed in established genes [3][4][5][6][7][8] . Furthermore, pervasive translation of non-genic sequences has been observed repeatedly by ribosome profiling and proteo-genomics 3,[9][10][11][12] , and studies have shown that random sequence libraries harbor bioactive effects [13][14][15][16][17] . Nonetheless, whether and how native proto-genes carry adaptive potential remains unknown. 15 We sought to formalize the predictions of adaptive proto-gene evolution. We define adaptive potential as the capacity to increase fitness by means of evolutionary change. While any sequence may in theory carry adaptive potential, changes in established genes are typically constrained by preexisting selected effects -the specific physiological processes mediated by the gene products that lead to their evolutionary conservation 18 . In contrast, emerging proto-genes are 20 expected to mostly lack such selected effects, leaving them more readily accessible to adaptive change and innovation 2,3 . We reasoned that the initial adaptive potential would give way, as protogenes mature and the adaptive changes engender novel selected effects, in turn reducing the possibility of future change. This reasoning is akin to Sartre's "existence precedes essence" dictum 19 , and predicts that proto-genes are enriched in adaptive potential and depleted in selected effects relative to established genes (Fig. 1A). Fig. 1. The adaptive potential prediction: theory and empirical testing. 5 A. Theoretical model. The evolution from non-genic sequences to proto-genes to genes is represented as in ref 3 ; the transition from non-genic sequences to proto-genes is mediated by the act of translation; the transition from proto-genes to genes occurs along a continuum (left). Proto-genes are predicted to provide adaptive potential to the organism by exposing natural variation to selection, while being depleted in selected effects (right).

B.
Operational classification of emerging and established ORFs. We confront the theoretical model by 10 systematically assessing how emerging ORFs impact fitness in budding yeast. Emerging ORFs are young (no detectable homologues outside of the sensu stricto genus and no conserved syntenic homologue in S. kudryiavzevii and S. bayanus) and do not display strong evidence that they encode a useful protein product under intraspecific purifying selection (Methods). Empirical testing of the theoretical prediction (A) involves experimentally measuring the fitness of deletion and overexpression alleles for both classes of ORFs. The numbers of emerging 15 and established ORFs subjected to each analysis are indicated.
In what follows, we confronted these theoretical predictions with systematic measurements of how disruption and overexpression of open reading frames (ORFs) impact fitness in budding yeast as a function of the evolutionary emergence status of the ORFs. We classified annotated S.
cerevisiae ORFs into emerging ORFs and established ORFs ( Fig. 1B; Data S1). The established ORFs group contained ancestral ORFs with high interspecific conservation levels, as well as 5 evolutionarily young ORFs that encode useful proteins under intraspecific purifying selection. The emerging ORFs group contained evolutionarily young ORFs whose recent evolution was determined by phylostratigraphy and syntenic alignments, and which lacked strong evidence of encoding a useful protein product (Methods). As expected, emerging ORFs tend to be short and weakly transcribed relative to established ORFs (Mann-Whitney U test P<2.2x10 -16 in both cases). 10 Most emerging ORFs (>95%) are annotated as Dubious or Uncharacterized by domain experts (Methods), as it is currently unclear whether they correspond to spurious non-genic ORFs or to young genes whose physiological implications remain to be discovered.

Selected effects
We compared estimated fitness costs of disrupting emerging and established ORFs. To this aim, 15 we first examined fitness estimates generated from a large collection of systematic deletion and hypomorphic alleles 20 . After removing ORFs with overlapping genomic locations, we obtained fitness estimates for 239 emerging and 4,410 established ORFs (Fig. 1B) Fig. 1A).
To investigate how the disruption of emerging ORFs impacts fitness in natural conditions, we analyzed intraspecific sequence variation across 1,011 S. cerevisiae isolates 21 . Counting the number of isolates in which the ORF structures (defined as start, stop and frame without 5 considering sequence similarity) were intact in each group, we found ORF structures to be markedly more variable across isolates for emerging than established ORFs (Fig. 2B) (Fig. 2C). Altogether, our results confirmed that disrupting emerging ORFs is generally inconsequential for survival in both laboratory and natural settings, as expected for loci that lack evidence of encoding a useful protein product. These findings argue against the notion 15 that emerging ORFs might correspond primarily to young established genes whose physiological implications remain to be discovered.

Adaptive potential
Across kingdoms, one type of evolutionary change that typically accompanies de novo gene birth is an increase in expression level 22 . It follows that, according to our prediction (Fig. 1A), increasing the expression level of emerging ORFs should increase the organism's fitness more frequently than when the same perturbation is imposed on established ORFs (whose expression levels have 20 presumably been optimized by natural selection). Alternatively, if emerging ORFs mostly correspond to spurious non-genic ORFs with no role in de novo gene birth, increasing their expression level should generally be neutral or toxic, and not provide fitness benefits. Systematic overexpression screens have been shown to recapitulate the outcomes laboratory evolution 8 experiments 23 . We thus developed a dedicated overexpression screening strategy to identify ORFs that increased relative fitness upon increased expression, whereby colony sizes of individual overexpression strains were compared with those of hundreds of replicates of a reference strain with the same genetic background on ultra-high-density arrays ( Fig. 3A; Methods). A. Strategy to screen for relative fitness effects of overexpression strains. Yeast strains overexpressing emerging and established ORFs, and reference strains, are arrayed at ultra-high-density on plates containing agar media. Fitness is estimated from the distributions of colony sizes of technical replicates. Number of colonies are rounded to the nearest hundred. See Methods.
B. Fraction of emerging (blue) and established (white) ORFs displaying increased, decreased and unchanged 5 fitness effects relative to the reference. Environmental condition was SC-URA+GAL+G418 media (Table S1).
Error bars represent standard error of the proportion.

C.
Emerging ORFs are 4.5 times more likely to increase relative fitness when overexpressed than established ORFs, and 3.1 times less likely to decrease relative fitness. Odds ratios derived from the data shown in panel B.
Vertical error bars represent 95% confidence intervals. Horizontal dashed line indicates odds ratio of 1. All odds 10 ratios are significantly different from 1 (P<0.00002).

D.
Emerging ORFs are consistently more likely to increase fitness and less likely to decrease fitness than established ORFs when overexpressed in five different environments. "N": poor (-), complete (+) or rich (++) supplementation of amino-acids; C: complete (+) or rich (++) supplementation of carbon sources ( We deployed our screening strategy in a plasmid-based overexpression collection 24 25 containing 285 emerging ORFs and 4,362 established ORFs (Fig. 1B), having verified that the presence of an overexpression plasmid did not lead to a detectable growth defect relative to a plasmid-free strain (Extended Data Fig. 2; Methods). Strains overexpressing 14 emerging and 49 established ORFs displayed increased relative fitness in complete media, representing 4.9% and 1.1% of the total number of emerging and established ORFs tested, respectively (Fig. 3B). 30 Overall, overexpressing most ORFs did not significantly change colony sizes relative to the reference strain. Nevertheless, overexpression of emerging ORFs was 4.5 times more likely to increase relative fitness, and 3.1 times less likely to decrease relative fitness, than overexpression of established ORFs (Fig. 3C). The tendency of emerging ORFs to increase fitness when overexpressed was also observed in the context of a pooled competition in the same media (Mann- 35 Whitney U test P=5.5x10 -32 ) (Extended Data Fig. 3A). Emerging ORFs with increased relative fitness displayed effect sizes ranging from 7.9% to 19% (Extended Data Fig. 3B), which is remarkable since adaptive mutations resulting in a 10% fitness increase are estimated to reach 5% of the population in ~200 generations and fix in ~500 generations 23 . One of the beneficial emerging ORFs identified by our experiments was MDF1 (YCL058C), one of the best-studied examples of 5 adaptive de novo origination 25,26 .
Expanding our screening strategy to five environments of varying nitrogen and carbon composition ( Table S1; Data S3), we found that strains overexpressing emerging ORFs were consistently 3-to 6-fold more likely to increase relative fitness and 3-to 4-fold less likely to decrease relative fitness, compared to strains overexpressing established ORFs, across all 10 environments tested (Fig. 3D). Notably, while overexpression of only 2.9% of established ORFs increased relative fitness in at least one environment (n=126), this was the case for 9.8% of emerging ORFs (n=28) (Fig. 3E, Extended Data Fig. 4). Sixty percent (17/28) of these adaptive emerging ORFs provided fitness benefits across two or more environmental conditions (empirical P-value <10 -5 ; Extended Data Fig. 4). The strong over-representation of adaptive effects in 15 emerging ORFs relative to established ORFs (Fisher's exact test P = 1.2x10 -7 ; Odds ratio: 3.7) could not be explained by their short length or low native expression levels (Fig. 3E).
The 28 adaptive emerging ORFs we identified as increasing relative fitness when overexpressed did not seem any more required for survival than other emerging ORFs (P>0.05 when comparing fitness cost of deletion, ORF intactness and nucleotide diversity across isolates). 20 Furthermore, these ORFs were never found to be toxic in any of the conditions we tested, in contrast with established ORFs which can be toxic in one environment even when adaptive in another. It is thus unlikely that these adaptive emerging ORFs have already evolved useful physiological roles, in line with the adaptive proto-gene evolution prediction (Fig. 1A). Overall, our experiments show that expression of dispensable emerging ORFs can provide fitness benefits across multiple environments, and thus facilitate de novo gene birth.

Beneficial biochemical capacities
Although we predicted the adaptive potential of emerging ORFs, the molecular mechanisms that 5 may mediate such beneficial effects remain mysterious. It has been suggested that high levels of intrinsic structural disorder may be associated with adaptive fitness effects 27 based on the observation that young de novo genes are highly disordered in many species [27][28][29][30] . However, in S.
cerevisiae, recently-evolved ORFs are predicted to be less disordered than conserved ones 3,7,29,31 and increasing the expression of disordered proteins causes deleterious promiscuous interactions 32 . 10 The 28 adaptive emerging ORFs identified in our screens did not exhibit high intrinsic disorder ( Fig. 4A). In fact, their translated products were predicted to be significantly less disordered that neutral and deleterious emerging ORFs (Mann-Whitney U test P=0.03 and P=0.02, respectively).
Our data thus indicates that disorder is unlikely to be a beneficial biochemical capacity that promotes de novo gene birth in S. cerevisiae, although it may be in other lineages with more 15 complex regulatory systems 33 .
In S. cerevisiae, young ORFs display high GC content 7 and TM propensity, the latter presumably mediated by a sequence composition biased towards hydrophobic and aromatic residues 3,8,34 . We investigated whether these properties may promote beneficial fitness effects in S. cerevisiae, after verifying that adaptive, neutral and deleterious emerging ORFs presented 20 indistinguishable ORF length distribution (Mann-Whitney U tests P>0.3 for all comparisons). GC content was slightly lower in adaptive than neutral emerging ORFs (Fig. 4B). Adaptive emerging ORFs however displayed strikingly high TM propensity as measured by both average TM residue content and fraction of ORFs with full TM domains according to two prediction algorithms (Figs.

4C-F). Though remarkably pronounced and robust in emerging ORFs, the association between TM propensity and fitness benefits was absent in established ORFs (Extended Data Figs. 5-6).
Thus, our data suggests that emerging ORFs containing TM domains promote fitness in budding yeast. 5 ORF length predicted to encode TM residues (C-D) and the fraction of ORFs predicted to contain at least one full TM domain (E-F) according to TMHMM ( Previous analyses encompassing multiple species have shown that GC content negatively correlates with expected TM propensity 29 and that the TM domains of established membrane 25 proteins consist of stretches of hydrophobic and aromatic residues encoded by thymine-rich codons 35 . We thus hypothesized that the high thymine content of yeast intergenic sequences (Extended Data Fig. 6) may facilitate the emergence of novel polypeptides containing TM domains with adaptive potential. In agreement with this hypothesis, a strong influence of thymine content on TM propensity was observed regardless of ORF type, ORF length, or whether the sequences were real or scrambled (Fig. 5B, Extended Data Fig. 7). Established ORFs displayed non-random TM propensities, while emerging ORFs and iORFs appeared enriched in TM domains relative to their thymine content (Figs. 5A-B, Extended Data Fig. 7). We also estimated the TM 5 propensity of small unannotated ORFs that pervasively occur throughout the genome (sORFs).

Fig. 4: TM propensity is associated with beneficial fitness effects in emerging ORFs
TM propensity in sORFs was also largely driven by their thymine content, and markedly increased when they occupied a larger portion of the intergenic region from which they were extracted (Figs.

5B-C).
Altogether, these results showed that the yeast genome harbors a pervasive TM propensity, facilitated by a high thymine content, and further magnified by additional intergenic sequence 10 signals which possibly reflect a form of preadaptation to the birth of novel proteins 36 or other constraints. This discovery converges with our finding that overexpressing emerging ORFs with TM domains tends to increase relative fitness (Fig. 4). Together, these two findings suggest that proto-genes with adaptive biochemical capacities, such as TM domains, are most likely to emerge when the blueprint for these capacities existed in the genome before the acquisition of translation 15 capacities (Fig. 5D).  D. A new model for adaptive proto-gene evolution. Our data suggests that intergenic thymine content influences the TM propensity of novel translated products, which in turn influences their adaptive potential.

An emerging ORF with adaptive potential caught in the process of fixation
To further investigate this hypothesis, we sought to retrace the evolutionary history of a specific 20 locus to determine whether an ORF-first or TM-first scenario was most likely. We focused on YBR196C-A, one of the 28 adaptive emerging ORFs identified on our screens that was predicted to contain a TM domain and whose ORF structure appears relatively stabilized within S. cerevisiae (intact ORF in 95% isolates; one of six annotated ORFs in the genome with these characteristics).
Extensive sequence similarity searches across a broad phylogenetic range (Methods) failed to 25 identify sequences similar to YBR196C-A in species beyond the Saccharomyces genus, consistent with a recent origin. Aligning syntenic sequences across six Saccharomyces species revealed that ORFs of varying lengths in different reading frames were present in some, but not all, species of the clade, with highly variable primary sequences (Extended Data Fig. 8). Ancestral reconstruction along the clade (Methods; Fig. 6C) showed that no potential ORF longer than 30 30 codons was present in the Saccharomyces ancestor, in any reading frame (Extended Data Fig. 8), confirming the de novo origination of YBR196C-A.
The initial ORF that became YBR196C-A (YBR_IO) likely originated at the common ancestor of S. kudriavzevii, S. mikatae, S. paradoxus and S. cerevisiae and already encoded putative TM domains. In fact, the ancestral intergenic sequence at the base of the clade already contained a suite of codons that would have had the capacity to encode TM domains, had it not been interrupted by stop codons (Fig. 6C). This TM propensity persisted in most extant sequences despite substantial 5 primary sequence changes. Consistent with our previous analyses (Fig. 5), YBR196C-A is extremely T-rich (48%, 99 th percentile of all annotated ORFs) and so are its extant relatives and reconstructed ancestors. The inferred evolutionary history of the YBR196C-A locus was therefore largely consistent with a TM-first scenario. Ybr196c-a-EGFP localization was assessed using confocal microscopy. C. Ybr196c-a colocalizes at the ER, but not at the Golgi, with Sec13p. Chromosomally integrated Sec13-RFP and plasmid-borne Ybr196c-a-EGFP localization was assessed using confocal microscopy. to encode a protein with multiple short a-helices (longest are 13 and 14 residues; between positions 18 and 31, and 58 to 72, respectively). YBR196C-A is predicted to encode a protein with a long a-helix (20 residues; positions 9 to 29 between Pro and Arg). G. Ybr196c-a is predicted to stably integrate membranes. Molecular dynamics simulation shows that, after 200ns, the peptide has kept the helix intact, with N and C terminal tails interacting with the surface of the 15 lipid bilayer.
Besides its predicted TM propensity, there is no published evidence that the protein encoded by YBR196C-A has the potential to integrate into cellular membranes. The predicted TM propensity could be an artifact arising from having applied existing TM prediction methods, which 20 were trained on established membrane proteins. To test this, we visualized cells overexpressing ( Figs. 6A-B, Extended Data Fig. 9). In a fraction of the cells, the protein also localized to puncta, where colocalization was observed with Scs2p, but not Sec13p (Extended Data Fig. 9). We did 25 not observe localization at the cell periphery, nor colocalization with mitochondrial, peroxisomal or vacuolar markers (Extended Data Fig. 9). The protein encoded by another emerging ORF, YAR035C-A, was observed specifically in the mitochondrial membrane when we visualized it using the same methods (Extended Data Fig. 10). Therefore, the YBR196C-A locus has the potential to encode a protein that associates with a select subset of cellular membranes. 30 YBR_IO displayed a strong TM propensity, but underwent major changes in primary sequence including frameshifts, truncations and elongation throughout Saccharomyces evolution (Fig. 6C). We further determined that adaptive mutations are actively shaping the molecular changes observed in the protein sequence, consistent with our model (Fig. 1). A positive selection test 10 across the four Saccharomyces species containing ORFs yielded statistically significant results (P To investigate the impact of this rapid sequence evolution at the protein level, we compared the predicted 3D structures of the putative proteins encoded by YBR_IO, YBR_SP and YBR196C-A. All three shared a conserved predicted TM domain in the N-terminal region, but this domain 20 contained a Gly in YBR_IO and YBR_SP that mutated to Ala in YBR196C-A (Extended Data Fig.   8). A second, Phe-rich, low complexity TM domain was predicted for YBR_IO and YBR_SP in the C-terminal region of the alignment, again presenting Gly and Pro residues (Extended Data Fig.  8). Gly and Pro are known for their low helix propensity in solution 37 and in a membrane environment 38 , and expectedly hindered the formation of a-helical structures in 3D models generated with the ab initio prediction server Robetta 39 . In fact, the models were almost all b strands rather than helices for YBR_IO (Fig. 6D). Models of YBR_SP displayed short, broken ahelices relative to models of YBR196C-A, due to the extra Gly and Pro residues (Figs. 6E, F; 5 Extended Data Fig. 8). The only structural models capable of spanning the membrane were those for YBR196C-A. Four out five models predicted by Robetta for Ybr196c-a yielded a robust TM helical structure spanning 20 residues flanked by Pro in N-terminal and a pair of charged Arg in C-terminal, and explicit solvent molecular dynamics simulations of these 3D structures in a membrane bilayer strongly supported that YBR196C-A has the potential to encode a single pass 10 membrane spanning protein (Fig. 6G).

Discussion
Several theories predicted that emerging sequences must carry adaptive potential for de novo gene birth to be possible 2,3,40 . Our results validate this prediction and support an experiential model for de novo gene birth whereby a fraction of incipient proto-genes with adaptive potential can 15 subsequently mature and, as adaptive changes engender novel selected effects, progressively establish themselves in genomes in a species-specific manner (Figs. 1, 6). This model, consistent with studies showing that de novo emerging sequences remain volatile for millions of years 5,22,[41][42][43][44] , emphasizes that the selective pressures acting on proto-genes are different from those acting on canonical protein-coding genes. 20 Our work suggests that incipient proto-genes that expose cryptic TM domains through translation are more likely to carry adaptive potential than others, and thus more likely to contribute to evolutionary innovation through positive selection in S. cerevisiae (Figs. 4-6). This novel mechanistic model may explain how sequences that were never translated previously can evolve into novel proteins with useful physiological capacities. Indeed, we show that a simple thymine bias in intergenic sequences suffices to generate a diverse reservoir of novel TM peptides (Fig. 5).
The membrane environment might provide a natural niche for these novel peptides, shielding them from degradation by the proteasome, and allowing subsequent evolution of specific local 5 interactions while reducing the potential for deleterious promiscuous interactions throughout the cytoplasm. Our examination of the YBR196C-A locus illustrates how a thymine-rich intergenic sequence with high TM propensity can, upon acquisition of translation signals, be molded by positive selection into a genuine TM ORF with adaptive potential that acquires selected effects as it matures over millions of years. 10 Our discovery that emerging ORFs containing TM domains tend to increase fitness when overexpressed (Fig. 4) is surprising given the elaborate systems controlling the insertion and folding of TM domains and preventing their aggregation 45    We also defined artificial intergenic ORFs (iORFs) and genomic small ORFs (sORFs) as follows. iORFs were generated as in ref 7 : first, non-annotated genomic regions were extracted using bedtools subtract 67 and the annotation GFF file downloaded from SGD. Then, stop codons in the +1 reading frame were removed, and the sequences in that frame were translated and used 15 to calculate the various properties. sORFs include all non-annotated ORFs that showed no signs of translation from ref 3 , that were longer than 75nt, and did not entirely overlap annotated ORFs in any strand (n=18,503).

Synteny analysis.
We identified syntenic blocks for young ORFs across four Saccharomyces 20 species by using the downstream and upstream genes of the young ORFs as anchors The orthologs of the anchor genes were downloaded from ref 68 . In cases where a continuous syntenic block could not be constructed between two anchor genes, it was constructed by aligning the ±1kb region surrounding the anchor gene with the largest number of orthologs (identified by ref 69 ; downloaded from SGD). Multiple alignment of the syntenic blocks were generated using MUSCLE 70  (Source Data 1) and secondary structure for extant and reconstructed YBR196C-A homologues was predicted using psiPred 74 (Extended Data Fig. 8). 20 Genome-wide overexpression screening on ultra-high-density colony arrays. Using a Singer ROTOR robotic plate handler (Singer Instrument Co. Ltd), overexpression and reference strains where transferred from glycerol archives to agar plates and then robotically combined into 1536-density agar plates (SC-URA+GLU+G418; in Extended Data Table 1). Cells on these plates were then transferred with the same robot at the same density to SC-URA+GAL+G418 where they were incubated for a day. This process was repeated once, following which the cells were robotically transferred to 6144-density agar plates in the screening conditions (in Extended Data Table 1).
Throughout this transfer process, five 1536-density source plates were copied 4 times, yielding on testing using Q-value estimations as defined in 77 . ORFs where categorized as having increased or decreased relative fitness effect when overexpressed when the Q-value was lower than 0.01 and the normalized colony size was higher than 95%, or lower than 5%, of the technical replicates reference, respectively. All other ORFs were classified as unchanged.
The results of the five screens presented in Fig. 3 are reported in Source Data 3. These results were integrated as follows: ORFs that increased fitness when overexpressed in at least 1/5 conditions were labelled "adaptive"; ORFs that decreased fitness in at least 1/5 conditions and To reconstruct the ancestral state of YBR196C-A, we first identified and extracted its orthologous regions in all other Saccharomyces species. We exploited SGD's fungal alignment resource to download ORF DNA + 1 kb up/downstream for guiding analyses. A multiple alignment of these sequences was generated using MAFFT 89 . A second, codon-aware alignment was generated with MACSE 90 . Using the MAFFT alignment, a phylogenetic tree was generated with PhyML 91 with the following parameters "-d nt -m HKY85 -v e -o lr -c 4 -a e -b 0 -f e -u species_tree.nwk" where "species_tree.nwk" is the species topology. Ancestral reconstruction was 5 performed with PRANK 92 (on an alignment performed by PRANK, and not the one generated by MAFFT) using the above-mentioned tree as a guide and the parameters "-showanc -showevents -F". The ancestral sequences were extracted from the alignment output file of PRANK, and gaps were removed to obtain the nucleotide sequences, which were then translated into amino acid sequences. 10 Pairwise dN/dS (omega) was calculated using yn00 from PAML 93 94 . To verify that the EGFP signal was generated by our plasmid construct, we also visualized the mCherry-Scs2-TM, Sec13-RFP and Pex3-RFP strains in SC+GAL+G418 after a pre-culture in SC+GLU+G418. 15 Data and materials availability. Data is available in the main text, in the extended data figures and tables, and in source data files 1-5 on github: https://github.com/annerux/AdaptiveTMprotogenes. Strains are available upon request.

. Comparing the fitness impact of loss of emerging ORFs and loss of established ORFs with matched expression levels and length distributions.
Established ORFs with length and expression level distributions matched to those of emerging ORFs were randomly sampled with replacement among the set of all established ORFs (Methods). A. Fraction of ORFs for which experimental deletion leads to colonies with fitness <0.9, as in Fig. 2A B. Fraction of ORFs with fixed ORF structure in 90% of S. cerevisiae isolates analyzed in Fig. 2B Extended Data Fig. 2. No detectable growth defect in the neutral reference strain.
The reference strain was tested for growth defect by using the barcoded haploid yeast overexpression strain (Table  S2) transformed with expression vector (pBY011) ("plasmid") and the same strain without vector ("no plasmid"). The strains were grown in SC-URA+GAL+G418 (Table S1). A flat bottom 96-well plate was filled with replicates of diluted pre-cultures (125µl at OD600nm 0.08) and used to collect OD600nm points every 15 min for 48h using a plate reader (SpectraMax M2, Molecular Devices). For each strain, 5 biological replicates represented by 4 technical replicates each were included. Raw data was blank corrected to the respective media, path length corrected and normalized to time zero.

Extended Data Fig. 3: Overexpression of emerging ORFs can provide fitness benefits.
A. Emerging ORFs display higher competitive fitness when overexpressed than established ORFs (Mann-Whitney U test P = 5.5x10 -32 ). Density distributions for emerging (blue) and established (black) ORFs competitive fitness measurements in complete media after 20 generations, as measured and quantified through barcode signal intensity by 24 . Vertical dashed lines represent group means. Note that this experimental design did not allow for direct comparison with the fitness of a reference strain.
B. Distribution of the normalized sizes of individual replicate colonies for the reference strain (grey violin plot on the left) and for each emerging ORFs identified as statistically increasing fitness relative to the reference in complete media (see Fig. 3B). Red dashed line represents the median normalized colony size of the reference strain. Each of the 14 emerging ORFs represented here present distributions of normalized colony sizes that are incompatible with the null hypothesis according to which they could have been randomly picked from the reference distribution (and hence were detected by our pipeline as showing increased relative fitness).

Extended Data Fig. 5. No association between TM propensity and beneficial fitness effects in established ORFs.
The same analyses presented in Fig. 4 were repeated on the group of established ORFs. None of the 4 measures of TM propensity (C-F) are statistically different between adaptive and neutral established ORFs. The association between high TM propensity and beneficial fitness effects is specific to emerging ORFs (Fig. 4). See also Extended Data Fig. 6. Fig. 6. Distribution of TM residue content.

Extended Data
Two prediction methods are compared: Phobius (left column) and TMHMM (right column). Two ORF classes are compared: emerging (top row) and established (bottom row) ORF. In each case, the distribution (density plot) of TM residue content (fraction of amino acids predicted as TM over length of the ORF) is shown for ORFs classified as adaptive (gold), deleterious (purple) and neutral (gray). Vertical dashed lines correspond to the mean of the distribution of the respective color. Fig. 7. Analysis of TM propensity without sampling to control for length.

Extended Data
A. Propensity of ORFs to form TM domains. Established ORFs, iORFs and sORFs have been sampled (n=1,000) with replacement to follow the length distribution of emerging ORFs. Error bars represent standard error of the proportion. B. Thymine content influences TM propensity. Top panel: Bar graph represents the fraction of sequences from (A) predicted to encode a putative TM domain, binned by thymine content (bin size = 0.1) and compared between real (pink) and scrambled (green) sequences. Error bars represent standard error of the proportion. Overlaid density plots represent the distribution of all sequences in each category. Bottom panel: scatterplot showing the fraction of sequence length predicted to be TM residues as a function of thymine content. Only sequences from (A) predicted to encode a TM domain are included in the bottom panel plots. Individual real (pink) and scrambled (brown) sequences are shown in the scatterplot with transparency; points of higher intensity indicate that sequences sampled multiple times. Linear fits with 95% confidence intervals are shown. Synthetic complete (-uracil) + casamino acids + 2% Galactose + 100µg/mL G418 ++/++ SC-URA+CASE+GAL+RAF+G418 Synthetic complete (-uracil) + casamino acids + 2% Galactose + 1% Raffinose + 100µg/mL G418 -/++ SD-URA+GAL+RAF+G418 Synthetic defined (+Histidine, Lysine, Leucine) + 2% Galactose + 1% Raffinose + 100µg/mL G418 SC (Synthetic complete) and SD (Synthetic defined) media are composed by 0.175% Yeast Nitrogen Base without amino acids and without ammonium sulfate and supplemented with 0.1% L-Glutamic Acid and 0.2% dropout amino acids. The dropout mixes were made with 10g of Leucine, 3g of Adenine, 0.2g of para-aminobenzoic acid and 2g of the remaining amino acids for each dropout mix. 5