Functional repurposing of regulatory element activity during mammalian evolution

The spatiotemporal control of gene expression exerted by promoters and enhancers is central for organismal development, physiology and behaviour. These two types of regulatory elements have long been distinguished from each other based on their function, but recent work highlighted common architectural and functional features. It also suggested that inheritable alterations in the epigenetic and sequence context of regulatory elements might underlie evolutionary changes of their principal activity, which could result in changes in the transcriptional profile of genes under their control or even facilitate the birth of new genes. Here, based on integrated cross-mammalian analyses of DNase hypersensitivity, chromatin modification and transcriptional data, we provide support for this hypothesis by detecting 449 regulatory elements with signatures of activity turnover in sister species from the primate and rodent lineages (termed “P/E” elements). Through the comparison with outgroup species, we defined the directionality of turnover events, which revealed that most instances represent transformations of ancestral enhancers into promoters, leading to the emergence of species-specific transcribed loci or 5’ exons. Notably, P/E elements have distinct GC sequence compositions and stabilizing 5’ splicing (U1) regulatory motif patterns, which may predispose them to functional repurposing during evolution. Moreover, we trace changes in the U1 and polyadenylation signal densities and distributions that accompanied and likely drove the evolutionary activity switches. Overall, our work suggests rather widespread evolutionary remodelling of regulatory element functions. Functional repurposing thus represents a notable mechanism that likely facilitated regulatory innovation and the origination of new genes and exons during mammalian evolution.


87
Given the structural and functional similarities between enhancers and promoters, changes in the U1-PAS 88 axis might in principle also lead to a switch in the activity of these regulatory elements. Inheritable 89 mutations at PAS and U1 sites might stabilize enhancer-associated transcripts, thus facilitating their 90 evolution into promoters. Similarly, mutations might destabilize promoter transcription, but not affect the 91 ability of these loci to interact with other regulatory elements and regulate the expression of other genes.

92
In this scenario, one might thus expect to observe orthologous regulatory elements that function as 93 enhancers in one species or organismal lineage but as promoters in another. Interestingly, recent work 94 reported the frequent evolutionary emergence and decay of promoters 26 and enhancers 27 in mammals.

95
Although the gain and loss of regulatory elements is largely driven by the insertion and deletion of 96 genomic sequences, such as repetitive elements, many regions align to orthologous loci in other species 97 not showing the same functionality 26,27 , raising the possibility that some of them might have experienced 98 changes in their activity during evolution -a process hereafter referred to as "functional repurposing" 99 (even if the detected events are not necessarily selectively preserved/of phenotypic relevance).

101
The occurrence of functional repurposing of regulatory elements has been proposed 12,25,28 , and suggestive 102 evidence for the existence of such events has been reported in mammals. We recently described an 103 enrichment of enhancer-associated chromatin marks at mouse loci orthologous to the putative promoters 104 of new rat-specific mRNA-derived gene duplicates 29 (retrocopies). Moreover, two separate studies 105 reported evidence of 11 mouse long noncoding RNAs (lncRNAs) whose promoter sequences were 106 orthologous to putative human regulatory regions marked by DNase hypersensitivity but not associated 107 to any stable transcript 30,31 . Nonetheless, beyond these potential individual candidates, a thorough 108 investigation of the prevalence of functional repurposing during mammalian evolution and of the 109 underlying molecular mechanisms has been lacking. Here we aim to fill this gap, based on detailed and 110 5 integrated cross-species analyses of mammalian DNase hypersensitivity, chromatin modification and 111 transcriptional data.

114
Regulatory element repurposing in primates and rodents

115
As only limited evidence of putative functional repurposing was available from previous studies, we first 116 sought to confirm its occurrence and study its prevalence in mammals. Towards this aim, we defined 117 genome-wide sets of putative enhancers in a mammalian reference species and investigated whether any 118 of these loci were orthologous to putative promoter regions from a closely related species (Fig. 1a) and 119 hence represented candidate repurposed elements -here referred to as "P/E" elements. We focused our 120 work on four species from two mammalian orders: human and rhesus macaque, as representatives of the 121 primate lineage, and mouse and rat from the rodent lineage. We chose these two species pairs for several  therefore the analyses of both datasets allow for overall optimal analyses. For example, while the low 131 mutation rate and resulting high sequence similarity in the primates may allow for higher confidence 132 inferences of early regulatory element evolution, the larger rodent sequence divergence (due to higher 133 mutations rates) and more efficient natural selection during rodent evolution may allow for easier and/or 134 more powerful detection of functional repurposing events.

136
We defined sets of putative promoters as the upstream regions of stably transcribed loci, assembled using 137 both our newly generated as well as recently published 32 strand-specific RNA-seq data from four adult 138 organs (brain, heart, kidney and liver; see Methods) ( Table 3). We thus identified 191 P/E 157 elements in primates and 258 elements in rodents (i.e., 449 in total). We subdivided the P/E elements 158 into two distinct categories based on their association to a species-specific (i.e., macaque-or rat-specific) 159 transcribed locus ("novel" P/E) or to a new (species-specific) 5' exon of a locus transcribed in both species

168
The detection of P/E elements in two closely related species could in principle result from the 169 independent evolution of distinct activities from an ancestral inactive region, rather than from a species-170 specific repurposing event of an ancestral regulatory region. We therefore investigated whether the 171 presence of enhancer activity at a specific locus would significantly increase the chance of observing 172 promoter activity at the orthologous locus in a closely related species, which would support the 173 repurposing scenario. We defined regions showing no signature of regulatory activity and no overlap with 174 any exonic sequence in human and mouse, and evaluated whether their orthologous regions in the sister 175 species were associated to the TSS of a stable transcript. Only 0.05% (87/187466) of the regions tested in 176 primates and 0.03% (23/78470) in rodents showed this behaviour. These numbers were significantly 177 lower compared to the fraction of P/E elements retrieved in primates (0.19%, >3.56-fold enrichment for 7 novel or extended P/E loci, Chi-squared test, P < 10 -15 ) and rodents (0.24%, >7.46-fold enrichment, Chi-179 squared test, P < 10 -15 ) ( Fig. 1e). Although we observed a difference in GC content between the inactive 180 regions and the putative enhancers tested in both species ( Supplementary Fig. 2), rat-and macaque-181 specific promoters were nonetheless more often orthologous to enhancers than to inactive loci with 182 matched sequence composition in their sister species ( Supplementary Fig. 3). These data corroborate the 183 hypothesis that P/E elements likely correspond to ancestral regulatory regions that experienced 184 evolutionary changes in their regulatory activity in the last 25-29 millions of years. In other words, our 185 analyses show that ancestral regulatory capacities of genomic sequences facilitated regulatory innovation 186 and suggest that the "de novo" origination of regulatory activity is rare.  Table 1). Similarly, ≈31% of rodent P/E elements correspond to putative enhancers in 197 rabbit, while only ≈2% overlapped a promoter (Table 1). The much higher fraction of ancestral P/E 198 elements with enhancer activity strongly suggests that most repurposed elements correspond to 199 ancestral enhancers that recently evolved species-specific promoter activities.

201
To evaluate whether the bias in the directionality of the repurposing events could be explained by the 202 higher evolutionary turnover of enhancers compared to promoters 27 , we compared the rates of 203 repurposing and loss of activity for ancestral enhancers and promoters in both lineages (Methods, 204 Supplementary Fig. 4). In the primate lineage, we identified 2,256 ancestral enhancers and 1,370 205 ancestral promoters that lost their activity in macaque. In contrast, 25 ancestral enhancers and 3 206 ancestral promoters changed their activity in human or macaque (≈5-fold enrichment, Fisher's exact test, 207 P < 10 -2 ). These observations suggest that the higher rate of enhancer repurposing cannot be simply 208 explained by the higher turnover rate (i.e., reduced selective preservation) of these elements compared 209 to promoters. The lack of a statistically significant similar pattern in rodents (≈5-fold enrichment, Fisher's 210 exact test, P = 0.08; Supplementary Fig. 4) is probably explained by genomic/evolutionary differences 211 between the two sets of species. That is, the considerably larger divergence time of the rodent species 8 and outgroup (mouse/rat-rabbit divergence time: ≈80 million years, my) compared to that in primates 213 (human/macaque-marmoset: ≈42 my), the higher mutation rates in glires (the clade including rodents and 214 lagomorphs), and/or the larger long-term effective population sizes (i.e., more efficient natural selection) 215 in glires may obscure actual rates of activity turnover in this lineage (e.g., elements that lost regulatory 216 activity may have been completely lost or have diverged too much to be aligned across species; mainly 217 beneficial events have been retained; there may have been multiple turnovers at the same locus).

219
Our analysis also revealed that between 25 and 34 (47% and 66% in primates and rodents, respectively) of 220 the P/E elements active in liver did not bear any signature of activity in the same organ from the outgroup 221 species, and we reasoned that some P/E elements might be active in a different organ. We thus used   islands and enhancers ( Fig. 3a-b, Supplementary Fig. 5a-b). Surprisingly, we observed an overall higher GC 242 content in P/E elements compared to both non CGI-associated promoters and other enhancers across all 243 species (except for macaque enhancers), indicating that the sequence composition distinguishes P/E 244 elements from other regulatory sequences -regardless of their activity (Fig. 3a-b, Supplementary Fig. 5a-245 b). A similar trend was observed for the CpG frequency, with significantly higher content of this 9 dinucleotide in P/E elements compared to enhancers and to non-CGI promoters across species (except for 247 non-CGI promoters in human and mouse) (Fig. 3c-d, Supplementary Fig. 5c-d), reinforcing the distinction 248 of this class of regulatory elements from other active loci.

250
Sequence compositional changes probably contributed to repurposing 251 Next we sought to trace whether compositional changes may underlie functional shifts of P/E elements.

252
Notably, the CpG content of regulatory elements has been proposed to influence their transcriptional 253 output 33,34 . CpG islands are usually associated to the promoter of broadly and highly expressed genes 1 , 254 where they favour gene expression by creating a nucleosome-free environment 33,34 . We then asked 255 whether P/E elements with promoter activity were associated to higher GC-and CpG-contents compared 256 to their orthologous regions. To do so, we compared the difference in GC and CpG frequencies between 257 orthologous P/E elements with that observed between orthologous inactive regions (likely not subjected 258 to selective sequence constraint) to control for potential global differences in the sequence composition 259 of each species pair. We found that although rat P/E elements showed somewhat higher GC-content 260 compared to their orthologous sequences in mouse, the observed difference was not significantly 261 stronger than that of the control regions ( Supplementary Fig. 6), in agreement with the previously 262 reported higher genome-wide GC-content in rat 35 . By contrast, we noted that the total content of CpG 263 dinucleotides in rat P/E elements increased significantly compared to the control regions ( Fig. 3e-f),

264
indicating that the activity turnover of P/E loci is mirrored by a change of CpG frequency in this lineage. In 265 primates, the GC-content did not differ significantly between the orthologous P/E elements, whereas the 266 frequency of the CpG dinucleotides was significantly higher in macaque (i.e., for P/E elements with 267 promoter activity) compared to human (i.e., for P/E enhancers) ( Supplementary Fig. 7). Notably, the small 268 enrichment in both GC and CpG content measured in macaque was statistically significant when 269 compared to the control regions ( Supplementary Fig. 7). This result reflects the slightly lower GC content   (Fig. 1b, lower panel). For each element, we extracted U1 and PAS motifs up-and downstream of the TSS 283 of their associated transcript in macaque and rat, as well as for the corresponding orthologous regions in 284 the respective sister species (Fig. 4a). In both macaque and rat, as expected, given the promoter function 285 of P/E elements (and association with stable transcripts) in these species, we observed a higher density of 286 U1 sites downstream of the TSS compared to the antisense orientation and a weak but significant 287 opposite trend for the PAS motifs (Fig. 4b, Supplementary Fig. 8). In human and mouse, where annotated 288 P/E elements have enhancer properties (and no stable transcription is detectable), we found no 289 difference in PAS distribution around each P/E element but noted a significantly higher density of U1 sites 290 downstream of the projected TSS (Fig. 4c, Supplementary Fig. 8). Therefore, P/E elements not associated 291 to stable transcripts are nonetheless characterized by a U1/PAS environment with mixed features 292 compared to typical promoters and enhancers, which -together with their unique sequence 293 composition (see above) -may predispose them to repurposing during evolution.

295
Evolutionary changes in the U1/PAS axis 296 Evolutionary changes in the U1/PAS axis have been proposed as a mechanism underlying the emergence 297 of new transcribed loci that may be selectively preserved and thus form new genes 25 . However, so far, 298 evidence supporting this hypothesis has been lacking. We therefore took advantage of our dataset of 299 orthologous P/E element pairs to test whether evolutionary changes in the U1 and PAS motif distribution 300 around these loci might underlie their regulatory activity transformation. By comparing the distribution of 301 U1 sites surrounding orthologous P/E elements in rodents, we found that their density increased 302 significantly over 1 kilobase (kb) downstream of the TSS of promoter-associated rat transcripts with 303 respect to the orthologous non-transcribed regions in mouse enhancers (mean of 3.21 vs. 2.65 U1 sites 304 per kb, Wilcoxon's test, Benjamini-Hochberg corrected P < 10 -4 , Fig. 4d), whereas no significant difference 305 was observed in the corresponding upstream regions. When comparing the PAS distribution for the same 306 regions, we observed only a weak decrease in PAS density downstream of the rat TSSs (mean of 1.33 vs 307 1.64 PAS sites per kb, Wilcoxon's test, Benjamini-Hochberg corrected P < 10 -2 , Supplementary Fig. 9) and 308 no difference in the antisense orientation. We further observed a significantly shorter distance separating 309 the TSS from the closest downstream U1 site in rat promoters compared to the orthologous mouse 310 enhancers (Wilcoxon's test, Benjamini-Hochberg corrected P < 10 -2 , Fig. 4e). Finally, U1 sites preceded a 311 PAS downstream of a P/E-associated TSS in rat in 84.5% of the cases, compared to 71.6% for the 312 orthologous mouse enhancers (Chi-squared test, P < 10 -2 ). On the other hand, we found no significant 313 difference in U1 and PAS motif distributions around P/E elements in primates ( Supplementary Fig. 10), probably due to the low sequence divergence and resulting lack of power 36 . Overall, our data revealed 315 evolutionary shifts in the distribution of U1 sites and, to a lesser extent, PAS motifs, which mirrored the 316 presence or absence of stable transcripts (i.e., promoter or enhancer activity) at P/E loci in rodents. Thus, 317 changes in the U1/PAS axis indeed seem to contribute to the origination of promoters (from enhancers) 318 and, as a consequence, the emergence of new transcribed loci in mammals.

350
The investigation of P/E element activity in outgroup species revealed that most turnover events seem to 351 involve the repurposing of ancestral enhancer elements into species-specific promoters, an observation 352 that we show not to be solely explained by the higher evolutionary turnover of enhancers (due to reduced 353 purifying selection) compared to promoters in mammals 27 . It should be noted that almost half of the 354 alignable P/E elements in each lineage had no detectable activity in the outgroup species. This is likely due 355 to the relatively large evolutionary distance that separates our core set of species from their evolutionary 356 outgroups, indicating that the regulatory activity of these loci might either have emerged after the split of 357 the outgroup lineages or that it was lost during the evolution of the outgroup species lineages. Although 358 we cannot exclude that the inferred directionality of the repurposing process is influenced by the lack of 359 definition of the ancestral state for part of the P/E loci, such a scenario is unlikely to fully explain the 360 biased enhancer-to-promoter conversion pattern, because the higher rate of enhancer turnover reduces 361 the likelihood of detecting enhancers conserved in more distantly related species, which should in 362 principle disfavour the detection of enhancer-to-promoter turnover events. Our results therefore suggest 363 the existence of differences in repurposing potential between enhancers and promoters, which could 364 involve their underlying DNA sequence and/or aspects of their chromatin composition. Future work, 365 involving more closely related species or different populations of the same species, will be necessary to 366 further explore the biased directionality of the repurposing process and uncover its mechanistic bases.

368
The sequence analysis of P/E elements revealed sequence features that distinguish these loci from other 369 regulatory loci, and it provided initial evidence for the potential mechanisms behind the repurposing 370 process. Notably, the high GC and CpG content could make P/E loci particularly prone to drive the 371 expression of neighbouring sequences, for example through the recruitment of CpG binding proteins such 372 as Cfp1 37 . This protein is known to deposit H3K4me3 marks over the bound sequence 38 , which in turn  Moreover, U1 site density shifts also seem to be involved in the repurposing process. A higher number 381 and a higher proximity of U1 sites characterize the region downstream of the TSS of P/E-associated 382 transcripts, compared to their transcriptionally inactive orthologous regions. U1 sites are thought to 383 promote transcript stability in mammals, suggesting that changes in the distribution of these motifs might 384 be responsible for the stabilization or destabilization of P/E associated RNAs. On the other hand, it is 385 unclear whether the distribution of polyadenylation signals (PASs) had a significant influence on the 386 turnover process. Although PASs are slightly depleted downstream of the TSS compared to the upstream 387 region specifically in macaque and rat (i.e., the species used to assess P/E promoter activity), we found no 388 significant differences in PAS distribution between orthologous P/E elements, suggesting that, at least in 389 this context, variation in U1 site distribution could be sufficient drive to the repurposing process.

401
We generated 78 single-end strand-specific RNA-seq libraries from brain, heart, kidney and liver samples 402 for 6 mammals (human, macaque, marmoset, mouse, rat and rabbit; Supplementary Table 5)   The chromatin data used in our studies derive from different sources. DNase, H3K4me3, H3K4me1 and 420 H3K27ac data for mouse brain, heart, kidney and liver (core dataset) were obtained from the Mouse 421 ENCODE database 46 (Supplementary Table 6). DNase, H3K4me3, H3K4me1 and H3K27ac data from human 422 brain, heart, kidney and liver were obtained from the ENCODE database 47 or from the human Epigenome

423
Roadmap database 48 (Supplementary Table 6). For both species, we also downloaded H3K4me3 data from 424 additional adult and developmental samples from the same databases (extended dataset)

425
(Supplementary Table 6). All processed data corresponding to an older genome assembly version from 426 mouse (mm9) were converted to the newest version (mm10) using LiftOver 49 . As data were processed in 427 different ways, we applied a common approach to have comparable datasets. Specifically, we 428 downloaded processed peaks (in narrowPeak format) from multiple replicates of all samples, and 429 subsampled the top 20'000 (for H3K4me3 data) or top 80'000 peaks (for H3K4me1 and H3K27ac), ranked 430 based on their peak score; all DNase hypersensitive site (DHS) peaks from each sample were retained as 431 their numbers did not differ significantly across samples. We created organ/tissue specific sets of 432 H3K4me3, H3K4me1 and H3K27ac peaks by considering loci shared by at least three replicates from each organ (or by both replicates if only two samples were available), except when peaks were already derived 434 from merged samples, as for the adult mouse organs. We finally resized the peaks to 1,000 nt centred on 435 the summit of the peak (or on the middle of the peak when the summit was not available).

438
In each species, we defined as promoters the 1,000 nt located upstream of a stable transcript. Putative 439 enhancers in human and mouse were initially defined as DHSs overlapping an H3K27ac and/or an 440 H3K4me1 peak. The resulting set of enhancers was further filtered to exclude loci located closer than 441 1,000 nt from any H3K4me3 peak from any organs/tissues (including the extended dataset), or 442 overlapping the 1000 nt upstream region or the exons of any (stable or unstable) transcript. We further 443 downloaded enhancer sets defined using CAGE from human and mouse 4 ("permissive enhancers phase 1 444 and 2" from http://fantom.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/). These loci were subjected to 445 the same filtering process described above, and then included in the final list of putative enhancers.

447
To define the set of random inactive regions, we sampled from the human and mouse genome 1.5 million 448 non-overlapping 1000 nt loci, and then removed from this list all loci mapping closer than 1000 nt from: a) 449 any DHS or any H3K4me1, H3K4me3 or H3K27ac peak from all organs from the core and extended Coordinates of all regulatory and random regions from human and mouse were converted on the 455 macaque and rat genome, respectively, using LiftOver 49 (with -minMatch=0.6) to define their orthologous 456 loci. A two-way liftOver conversion (species A -> species B -> species A) was adopted to avoid errors in the 457 orthology definition that may result from genomic duplication events. We defined as P/E elements all 458 human and mouse enhancers whose orthologous loci in macaque and rat, respectively, overlapped the 459 500 nt upstream of the TSS of a stable transcript. As a control, we identified all random inactive regions in 460 human and mouse whose orthologous sites in their sister species overlapped the 500 nt upstream of the 461 TSS of a stable transcript, and then compared the frequency of these loci with the frequency of P/E 462 elements in each core sample with a Chi-squared test. P/E elements were analysed separately whether 463 the associated transcript in macaque or rat corresponded to a new isoform of a transcribed locus present 464 in the sister species (extended P/E element) or to a newly transcribed locus (novel P/E element). Shared 465 transcribed loci between orthologous species were defined by the overlap of the orthologous sequences 16 of macaque/rat transcripts (determined using liftOver) with reconstructed or GENCODE annotated 467 transcripts in human/mouse.

469
After analysing the sequence composition of the enhancers and random regions (with an ortholog in the 470 sister species) used in the aforementioned analysis, we observed a significant difference in GC-content 471 between the two sets. To control for the GC-content effect, we resampled sets of enhancers and random 472 regions with a similar GC distribution. To this aim, we considered only enhancers with a GC-content lower 473 than 46% (for human) or 41% (for mouse). These thresholds, roughly corresponding to the mode of the  To define the directionality of the turnover events, we evaluated the presence of putative promoters or 493 enhancers in regions syntenic to P/E elements active in liver in marmoset and rabbit. In marmoset and 494 rabbit, enhancers corresponded to H3K27ac peaks not overlapping any H3K4me3 peak or the 1000 nt 495 upstream and the exons of any (stable or unstable) assembled transcript; promoters were defined as the 496 described before. Macaque and rat liver P/E element coordinates were converted in their outgroup 497 species genome using LiftOver (with -minMatch=0.4), and we then evaluated the overlap between the 498 converted coordinates and the annotated regulatory regions.
To estimate the rate of regulatory element loss, we identified the orthologous loci of liver promoters and 500 enhancers from human and mouse in the corresponding outgroup species (marmoset and rabbit, 501 respectively), and kept only those loci that showed the same activity in the two species. Specifically, 502 ancestral promoters had to be associated (e.g. overlap the 500 nt upstream of the TSS) to a stable liver 503 transcript; ancestral enhancers had to overlap an H3K27ac peak and not be associated to a promoter or 504 an H3K4me3 peak. These loci were then aligned on the macaque/rat genome, and the loss of activity in 505 these species was defined by the lack of stable transcription or H3K27ac peaks for ancestral promoters or 506 enhancers, respectively.

508
To estimate the fraction of ancestral promoters corresponding to P/E elements, we further calculated the 509 overlap of primate and rodent P/E elements coordinates converted in marmoset and rabbit, respectively, 510 with the putative promoter of any stable transcript (from any organ).

513
We extracted and compared the GC content [using the nuc tool from BEDTools 53 ] and CpG dinucleotides 514 frequency of different classes of regulatory elements in human and mouse using a Mann-Whitney U test.

515
The same features were compared between orthologous P/E elements in both lineages with a Wilcoxon 516 signed rank test. Finally, we estimated whether the magnitude of the evolutionary change in GC content 517 or CpG frequency at P/E regions was higher than that measured at random regions (defined above) to 518 control for global skews in sequence composition. This was done by resampling 10000 times from the set 519 of random regions the same number of P/E elements, and then comparing the mean GC and CpG 520 difference in P/E elements with the distribution of differences from the resampled set.

523
We determined the genome-wide location of U1 and PAS sites with the scanMotifGenomeWide tool from 524 HOMER 54 (version 4.7). To compare the distribution of the U1 and PAS motifs around P/E elements, we 525 considered the leftmost TSS of all P/E associated transcripts in macaque and rat and projected their 526 location in the corresponding sister species using BLAT 55 . We considered only novel P/E elements, given 527 that U1/PAS motifs from downstream transcripts in extended P/E loci might have conflated the U1/PAS 528 signal in mouse and human. We compared the density of U1 and PAS motifs over 1 kb up-and 529 downstream of the P/E-associated TSS in each species using a Wilcoxon signed-rank test. The same 530 approach was used to compare the density of these motifs between sister species. The proximity of the 531 closest U1 or PAS site up-or downstream of the P/E-associated TSSs was calculated using closestBed from 532 BEDTools 53 .
18 534 Software used 535 All processing of genomic coordinates was performed using tools from BEDTools suite 53 (version 2.19.1).

536
All statistical analysis were performed using R 56 . In-house code used to perform all analyses is available 537 upon request.