Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Amphioxus functional genomics and the origins of vertebrate gene regulation

## Abstract

Vertebrates have greatly elaborated the basic chordate body plan and evolved highly distinctive genomes that have been sculpted by two whole-genome duplications. Here we sequence the genome of the Mediterranean amphioxus (Branchiostoma lanceolatum) and characterize DNA methylation, chromatin accessibility, histone modifications and transcriptomes across multiple developmental stages and adult tissues to investigate the evolution of the regulation of the chordate genome. Comparisons with vertebrates identify an intermediate stage in the evolution of differentially methylated enhancers, and a high conservation of gene expression and its cis-regulatory logic between amphioxus and vertebrates that occurs maximally at an earlier mid-embryonic phylotypic period. We analyse regulatory evolution after whole-genome duplications, and find that—in vertebrates—over 80% of broadly expressed gene families with multiple paralogues derived from whole-genome duplications have members that restricted their ancestral expression, and underwent specialization rather than subfunctionalization. Counter-intuitively, paralogues that restricted their expression increased the complexity of their regulatory landscapes. These data pave the way for a better understanding of the regulatory principles that underlie key vertebrate innovations.

## Main

All vertebrates share multiple morphological and genomic novelties1. The most prominent genomic difference between vertebrates and non-vertebrate chordates is the reshaping of the gene complement that followed the two rounds of whole genome duplication (WGD)—the 2R hypothesis—that occurred at the base of the vertebrate lineage2,3. These large-scale mutational events are hypothesized to have facilitated the evolution of vertebrate morphological innovations, at least in part through the preferential retention of ‘developmental’ gene families and transcription factors after duplication3,4. However, duplicate genes and their associated regulatory elements were initially identical and could not drive innovation without regulatory and/or protein-coding changes.

To date, the effect of vertebrate WGDs on gene regulation have remained poorly understood—both in terms of the fates of duplicate genes and the acquisition of the unique genomic traits that are characteristic of vertebrates. These traits include numerous features that are often associated with gene regulation, such as unusually large intergenic and intronic regions5,6, high global 5-methylcytosine (5mC) content and 5mC-dependent regulation of embryonic transcriptional enhancers7. To investigate these traits, appropriate species must be used for comparisons. Previous studies have largely focused on phylogenetic distances that are either too short (such as human versus mouse) or too long (such as human versus fly or nematode), resulting in limited insights. In the first case, comparisons among closely related species (for example, between mammals8,9,10,11)—for which the orthology of non-coding regions can be readily determined from genomic alignments—have allowed fine-grained analyses of the evolution of transcription-factor binding. In the second case, three-way comparisons of human, fly and nematode by the modENCODE consortium revealed no detectable conservation at the cis-regulatory level12 and very little conservation of gene expression13. Moreover, the genomes of flies and nematodes are highly derived14,15,16. Thus, we lack comprehensive functional genomic data from a slow-evolving, closely related outgroup that would enable an in-depth investigation of the origins of the vertebrate regulatory genome and of the effect of WGDs on gene regulation.

Unlike flies, nematodes and most non-vertebrates, amphioxus belongs to the chordate phylum. Therefore, although amphioxus lacks the specializations and innovations of vertebrates, it shares with them a basic body plan and has multiple organs and structures homologous to those of vertebrates1. For these reasons, amphioxus has widely been used as a reference outgroup to infer ancestral versus novel features during vertebrate evolution. Here, we undertook a comprehensive study of the transcriptome and regulatory genome of amphioxus to investigate how the unique functional genome architecture of vertebrates evolved.

## Functional genome annotation of amphioxus

We generated an exhaustive resource of genomic, epigenomic and transcriptomic data for the Mediterranean amphioxus (B. lanceolatum), comprising a total of 52 sample types (Fig. 1a and Supplementary Data 2, datasets 1–5). These datasets were mapped to a B. lanceolatum genome that was sequenced and assembled de novo, with 150× coverage, a total size of 495.4 Mbp, a scaffold N50 of 1.29 Mbp and 4% gaps (Extended Data Fig. 1a–c). To facilitate access by the research community, we integrated these resources into a UCSC Genome Browser track hub (Fig. 1b; available at http://amphiencode.github.io/Data/), together with an intra-cephalochordate sequence conservation track and a comprehensive annotation of repetitive elements (Extended Data Fig. 1d–f) and long non-coding RNAs (Extended Data Fig. 1g and Supplementary Data 2, dataset 6). To enable broader evolutionary comparisons, we reconstructed orthologous gene families for multiple vertebrate and non-vertebrate species (Supplementary Data 2, dataset 7), generated several equivalent datasets for zebrafish and medaka (Extended Data Fig. 2a), and built a dedicated server for synteny comparisons (Extended Data Fig. 1h).

A comprehensive functional annotation of the B. lanceolatum genome identified 88,391 putative cis-regulatory elements of DNA as defined by assay for transposase-accessible chromatin using sequencing (ATAC-seq) (these elements are hereafter referred to as APREs), as well as 20,569 protein-coding genes supported by orthology. We divided the APREs into promoters—around transcription start sites (TSSs), which were highly supported by cap analysis gene-expression sequencing (CAGE-seq) data, Extended Data Fig. 2b—and gene-body, proximal and distal APREs (Fig. 1c). Equivalent analyses using zebrafish data yielded 256,018 potential regulatory regions, with a significantly higher proportion of these being distal APREs (Fig. 1c; P < 2.2 × 10−16, one-sided Fisher’s exact test). A significantly larger global TSS distance in APREs was observed for all vertebrates compared to amphioxus (Fig. 1d), even after correcting for differences in average intergenic length among species (Extended Data Fig. 2c; P < 2.2 × 10−16 for all vertebrate-versus-amphioxus comparisons, one-sided Mann–Whitney tests). Amphioxus APREs showed enrichment for enhancer-associated chromatin marks (Extended Data Fig. 2d), which were highly dynamic during embryo development (Extended Data Fig. 2e-g), and consistently drove GFP expression in zebrafish or amphioxus transgenic assays (93% (14/15), Fig. 1e and Extended Data Fig. 2h, i). Moreover, 89% (32/36) of previously reported amphioxus enhancers overlapped APREs defined by our data. Therefore, a large fraction of APREs probably act as developmentally regulated transcriptional enhancers.

## Disentangling vertebrate bidirectional promoters

Analyses of core promoters, defined by CAGE-seq, at single-nucleotide resolution revealed that amphioxus promoters display a mixture of pan-metazoan, pan-vertebrate and unique features (Extended Data Fig. 3 and Supplementary Information). These analyses also identified that 25% (3,950/15,884) of neighbouring protein-coding genes were arranged in bidirectional promoters. Bidirectional promoters were most common among ubiquitous promoters (Extended Data Fig. 4a), displayed a marked periodicity in the distance between promoters (Extended Data Fig. 4b, c) and were associated with genes that were significantly enriched in housekeeping functions (Extended Data Fig. 4d). Notably, the fraction of bidirectional promoters defined by CAGE-seq decreased progressively from amphioxus to mouse (12.83% (1,752/13,654)) and to zebrafish (7.84% (1,098/14,014)), which suggests a disentanglement of ancestral bidirectional promoters after each round of WGD (two in tetrapods and three in teleosts). Consistently, the majority of a set of 372 putatively ancestral, bidirectional promoters were lost in vertebrates—particularly in stem vertebrates (54.5%)—with only very few amphioxus-specific losses (5.3%) (Extended Data Fig. 4e, f).

## Developmental DNA demethylation of APREs

Similar to other non-vertebrates17,18,19, the amphioxus genome exhibited very low levels of CpG methylation (Fig. 2a); nearly all of the 5mC occurred in gene bodies, in which the proportion of methylated CpGs correlated positively with gene-expression levels but negatively with the density of H3K27me3 and H3K4me3 histone marks and CpG dinucleotides (Extended Data Fig. 5a–c). However, as in zebrafish and frogs7, global levels of 5mC displayed a decrease during development (Extended Data Fig. 5d–g), coinciding with the onset of expression of the amphioxus orthologue of TET demethylase (Extended Data Fig. 5h).

To assess whether these 5mC dynamics may have regulatory potential, we identified adult hepatic diverticulum-specific APREs that are inactive during development. Unlike embryo-specific APREs (Extended Data Fig. 6a), the clustering of these adult APREs on the basis of 5mC content revealed two distinct subsets, one with hepatic-specific and one with constitutive hypomethylation (Fig. 2b). Differentially methylated APREs (cluster 1) also displayed robust hypomethylation in other adult tissues (Fig. 2c), which suggests that demethylation at these APREs occurs organism-wide. Both groups of hepatic-specific APREs were enriched for binding sites of liver-specific transcription factors—such as Hnf4a—as well as broadly expressed transcription factors such as Foxa (Extended Data Fig. 6b), which is a pioneer factor that participates in 5mC removal at regulatory regions in mammals20.

APREs from both clusters were preferentially associated with genes with metabolic functions (Extended Data Fig. 6c). However, only APREs with hepatic-specific hypomethylation (cluster 1) were primarily associated with genes that displayed steady widespread expression (Fig. 2d and Extended Data Fig. 6d, e); these APREs were mainly located within gene bodies (Fig. 2e). These data suggest that demethylation of these APREs may contribute to their identification as adult-specific, transcriptional cis-regulatory elements within continuously hypermethylated gene-body contexts, which is characteristic of non-vertebrate species. Fourteen zebrafish gene families contained differentially methylated APREs in introns that are orthologous to those identified in amphioxus—amongst these are four genes that encode components of the Hippo pathway, including the transcriptional effectors Yap (yap1 and wwtr1) and Tead (tead1a and tead3a) (Extended Data Fig. 6f, g).

## The hourglass model and chordate embryogenesis

Previous comparative analyses among vertebrate transcriptomes21,22 showed a developmental period of maximal similarity in gene expression that coincides with the so-called phylotypic period, consistent with the hourglass model23. However, similar comparisons with tunicates and amphioxus have thus far not resolved a phylotypic period shared across all chordates22. Pairwise comparisons of stage-specific RNA sequencing (RNA-seq) data from developmental time courses of amphioxus against zebrafish, medaka, frog (Xenopus tropicalis) and chicken revealed a consistent period of highest similarity (Fig. 3a, b and Extended Data Fig. 7) that occurred slightly earlier than those reported for vertebrates; in amphioxus, this corresponds to the neurula at the 4–7-somite stage (18–21 hours post fertilization (hpf)). At the regulatory level, pairwise comparisons between the relative enrichment of transcription-factor motifs in sets of dynamic APREs that were active at each stage were also consistent with an earlier hourglass model24 (Fig. 3c). By contrast, at a shorter timescale, comparisons between different species of amphioxus showed that the sequence conservation for the same APREs was higher after the putative chordate phylotypic period (Fig. 3d).

## Regulatory conservation shapes chordate body plan

Additional comparisons of embryo transcriptomes and neighbourhood analysis of conserved co-expression25 showed a high conservation of developmental and global expression patterns and of gene functions between amphioxus and vertebrates (Extended Data Fig. 8 and Supplementary Information). Further pairwise comparison of co-regulated gene modules across tissues between amphioxus and zebrafish revealed multiple pairs with highly significant levels of orthologue overlap (Fig. 4a). These included modules with conserved tissue-specific expression that were enriched for coherent Gene Ontology categories, including genes with high expression in organs with ciliated cells (for example, spermatozoa and gill bars) (labelled ‘1’ in Fig. 4a–c) as well as neural, muscle, gut, liver, skin and metabolism-related modules (Supplementary Data 1). We also found a significant positive correlation between relative motif-enrichment scores for many pairs of modules (Fig. 4b); the most-enriched transcription-factor motifs within each cluster were highly consistent between amphioxus and zebrafish (Fig. 4d).

## Higher regulatory information in vertebrate genomes

To investigate the effect of WGDs on the evolution of vertebrate gene regulation, we first asked whether the number of putative regulatory regions per gene is higher in vertebrates than in amphioxus. We observed significantly more APREs in the regulatory landscape of each gene (as defined by the ‘Genomic Regions Enrichment of Annotations Tool’ (GREAT)26) in zebrafish than in amphioxus (Fig. 5a). This difference is particularly evident for gene families that have retained multiple copies after WGD (known as ohnologues; Fig. 5b), for which the number of APREs is very uneven between copies, with marked regulatory expansions observed for some ohnologues (Fig. 5c). The same patterns were detected for all developmental stages of amphioxus and zebrafish, as well as for medaka and mouse genomes, and were highly robust to down-sampling of ATAC-seq coverage in vertebrates (Extended Data Fig. 9a–c). We also detected a higher number of peaks associated with regulatory genes (‘trans-dev’ genes that are involved in the regulation of embryonic development) compared to housekeeping genes in all species (Extended Data Fig. 9d), consistent with the higher frequency of retention of trans-dev genes in multiple copies after WGD3 (Fig. 5b). Comparison of regulatory landscapes—determined experimentally using circular chromosome conformation capture followed by sequencing (4C-seq)—for 58 genes from 11 trans-dev gene families in amphioxus, zebrafish and mouse showed similar results (Extended Data Fig. 9e).

As expected, the higher number of APREs in zebrafish was associated with larger intergenic regions in this species (Extended Data Fig. 9f). However, the differences in APRE complements were not attributable only to an increase in genome size in vertebrates, as subsets of amphioxus and zebrafish genes with matched distributions of GREAT or intergenic-region lengths also displayed a higher number of APREs in zebrafish (Extended Data Fig. 9g, h). Further investigation of matched distributions showed that these differences were particularly great in genes with large regulatory landscapes (>50 kb) (Fig. 5d). Thus, larger regions in amphioxus did not scale at the same rate as in vertebrates in terms of regulatory complexity (Fig. 5e), which is consistent with the overall lower proportion of distal APREs identified in this species (Fig. 1c, d). In summary, these analyses reveal a large increase in the number of regulatory regions during vertebrate evolution (and/or a decrease in these regions in amphioxus)—particularly of distal regulatory elements—and that this trend is enhanced for specific gene copies retained after the WGDs, pointing to unequal rates of regulatory evolution for different ohnologues.

## More-complex regulation in specialized ohnologues

The duplication–degeneration–complementation (DDC) model hypothesizes that the retention of duplicate genes could be driven by reciprocal loss of regulatory elements and restriction of paralogues to distinct subsets of the ancestral expression pattern27. In particular, the DDC model predicts that individual paralogues would each have more restricted expression than an unduplicated outgroup, but that their summation would not. To test this, we binarized the expression (‘on’ or ‘off’) of each gene in nine homologous expression domains in amphioxus, zebrafish, frog and mouse (Fig. 6a). When comparing genes that returned to single-copy status after WGDs, we detected no expression bias between amphioxus and vertebrates (Fig. 6a, b and Extended Data Fig. 10a, b). By contrast, when vertebrate ohnologues were compared to their single amphioxus orthologues, the distributions were strongly skewed and many vertebrate genes displayed far more restricted expression domains (Fig. 6b and Extended Data Fig. 10a, b; similar results were obtained by comparing τ values28, Extended Data Fig. 10c–e). The symmetrical pattern was fully recovered when the expression of all vertebrate members was combined, or when the raw expression values were summed for each member within a paralogy group (Fig. 6a, b and Extended Data Fig. 10a, b).

Although the above findings are consistent with the DDC model, they are also compatible with an alternative model in which a subset of duplicate genes becomes more ‘specialized’ in expression pattern while one or more paralogues retain the broader ancestral expression29. To distinguish between these alternatives, we analysed a subset of multi-gene families in which both the single amphioxus orthologue and the union of the vertebrate ohnologues—and thus probably the ancestral gene—were expressed across all nine samples that we compared. We then identified (i) gene families in which all vertebrate paralogues were expressed in all domains (termed ‘redundancy’), (ii) gene families in which none of the vertebrate members had expression across all domains (termed ‘subfunctionalization’)27 and (iii) gene families in which one or more vertebrate ohnologues were expressed in all domains, but at least one ohnologue was not (termed ‘specialization’) (Fig. 6c). We obtained very similar results for the three vertebrate species we studied (Fig. 6d): between 80 and 88% of gene families were subfunctionalized or specialized, which implies that ancestral expression domains have been lost in at least one member. Moreover, specialization was consistently more frequent than subfunctionalization as a fate for ohnologues with broad ancestral expression.

Ohnologues that have experienced strong specialization (≤2 remaining expression domains) retained expression more often in neural tissues (Fig. 6e and Extended Data Fig. 10f–i) and were generally not expressed in additional vertebrate-specific tissues (Supplementary Information). Furthermore, they showed the fastest rates of sequence evolution (Fig. 6f and Extended Data Fig. 10j–l), consistent with an optimization of their coding sequence to perform their function in a specific tissue and/or with the evolution of novel functions (neofunctionalization). Ohnologues from specialized families that have lost expression domains showed significantly more associated APREs than ohnologues with the full ancestral expression (Fig. 6g). We observed a strong positive relationship between the number of ancestral expression domains lost and the number of APREs associated with specialized ohnologues (Extended Data Fig. 10m). This implies that the specialization of gene expression after WGD does not occur primarily through loss of ancestral tissue-specific enhancers, but rather by a complex remodelling of regulatory landscapes that involves recruitment of novel, tissue-specific regulatory elements.

## Discussion

By applying functional genomics approaches to the cephalochordate amphioxus, we have deepened our understanding of the origin and evolution of chordate genomes. We identified APREs in amphioxus, the activation of which is tightly associated with differential DNA demethylation in adult tissues—a mechanism previously thought to be specific to vertebrates. Additional cases may be subsequently found in other non-vertebrate species when similar multi-omics datasets are analysed. In amphioxus, APREs of this type usually fall within gene bodies of widely expressed genes, which suggests that gene regulation by demethylation could have originated as a mechanism to allow better definition of enhancers in a hyper-methylated intragenic context. If so, this mechanism could have been co-opted into new genomic contexts—that is, distal intergenic enhancers—later in the evolution of vertebrate genomes, which are characterized by their pervasive, genome-wide hypermethylation.

We also found a consistently higher number of open chromatin regions per gene in vertebrates than in amphioxus. This pattern is observed at a genome-wide level, but is particularly evident for distal APREs and in gene families that retain multiple ohnologues after WGD; these families are enriched for regulatory genes with large regulatory landscapes. Finally, we detected a large degree of specialization in expression for retained ohnologues, with the vast majority of multi-gene families with broad ancestral expression having at least one member that restricted its expression breadth. Through this mechanism, vertebrates have increased their repertoire of tightly regulated genes, which has potentially contributed to tissue-specific evolution. Gene-expression specialization was accompanied by faster evolution of protein-coding sequences, and by an increase–rather than a decrease—in the number of regulatory elements. Taken together, these observations indicate that the two rounds of WGD not only caused an expansion and diversification of gene repertoires in vertebrates, but also allowed functional and expression specialization of the extra copies by increasing the complexity of their gene regulatory landscapes. We suggest that these changes to the gene regulatory landscapes underpinned the evolution of morphological specializations in vertebrates.

## Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.

### Animal husbandry and embryo staging

Amphioxus gametes were obtained by heat stimulation as previously described30,31. Embryos were obtained by in vitro fertilization in filtered seawater and cultured at 19 °C. Staging was done based on previous publications32,33; correspondence between developmental stages and hpf are provided in Supplementary Table 1. All protocols used for vertebrate species (zebrafish and medaka) have been approved by the Institutional Animal Care and Use Ethic Committee (PRBB–IACUEC, for CRG) or the Ethics Committee of the Andalusian Government (license numbers 450-1839 and 182-41106, for CABD-CSIC), and implemented according to national and European regulations. All experiments were carried out in accordance with the principles of the 3Rs (replacement, reduction and refinement).

### Genome sequencing and assembly

Genomic DNA was extracted from a single B. lanceolatum adult male collected in Argeles-sur-Mer, France. The genome was sequenced using a combination of Illumina libraries from a range of inserts at Genoscope (897 million reads in total, with a paired-end coverage of 150×; Supplementary Table 2). A diploid assembly was generated using SOAPdenovo assembler34 using a k-mer of 71. After gap closing, haplotypes were reconciled with Haplomerger35.

### Genome annotation

We generated deep coverage RNA-seq for 16 developmental stages and 9 adult tissues (4.16 billion reads in total). The bulk of strand-specific transcriptomic data was assembled de novo with Trinity36, aligned and assembled into loci with the PASA pipeline37. De novo gene models were built using Augustus38 and subsequently refined with EVM39 using PASA assemblies and aligned proteins from other species. In parallel, all strand-specific RNA-seq reads were mapped to the genome using Tophat240, assembled using Cufflinks41 and open reading frames were predicted using Trans-decoder42. Models obtained using both these approaches were reconciled yielding a total 218,070 transcripts from 90,927 unified loci, of which 20,569 were protein-coding and had homologues in at least one of the other studied species (see ‘Comparative genomics’). Gene Ontology (GO) terms were assigned to amphioxus proteins based on their PFAM and Interpro domains, as well as blastp hits against human proteins (1 × 10−6).

Repeats were annotated and filtered with RepeatMasker using a custom library generated with RepeatModeller. Long non-coding RNAs were identified by filtering all transcripts for protein-coding potential using CPAT43 trained with zebrafish transcripts, and further discarding those that had a positive hit in a HMM search against the NR and PFAM databases (Extended Data Fig. 1g).

### Comparative genomics

We used OMA44 to reconstruct gene families and infer homology relationships based on well-established phylogenetic relationships between species45, and further merged families sharing Ensembl paralogues with ‘Euteleostomi’ or ‘Vertebrata’ ancestry. To define the set of high-confidence ohnologue families (Supplementary Data 2, dataset 9), we retained families with two to four copies in three out of five vertebrates (excluding teleosts) and subjected them to phylogenetic reconciliation.

To assess genome sequence conservation, reciprocal whole-genome alignments of Branchiostoma floridae, Branchiostoma belcheri and B. lanceolatum were performed using LASTZ and processed with phastCons46 to produce conservation scores. The distribution of phastCons scores in APREs was determined using ‘dynamic’ ATAC-seq peaks that showed no temporal discontinuity in activity.

### Comparative transcriptomics

To investigate the evolutionary conservation of chordate development at the molecular level, newly generated data from zebrafish, medaka and amphioxus, as well as available data from the SRA (frog and chicken), were compared (Supplementary Data 2, dataset 3 and Supplementary Table 3). Gene expression was estimated with Kallisto47 using Ensembl transcriptome annotations (Supplementary Table 4), and summing up transcripts per million (TPMs) from all transcript isoforms to obtain one individual gene-expression estimate per sample. We used single-copy orthologues to pair genes and used the Jensen–Shannon distance metrics after quantile normalization of TPMs to score distance between pairs of transcriptomes:

$$JS{D}_{s}=\sqrt{\frac{1}{2}\sum _{g=0}^{{n}_{{\rm{og}}}}{p}_{g}\times log\left(\frac{{p}_{g}}{\frac{1}{2}\left({p}_{g}+{q}_{g}\right)}\right)+\frac{1}{2}\sum _{g=0}^{{n}_{{\rm{og}}}}{q}_{g}\times log\left(\frac{{q}_{g}}{\frac{1}{2}\left({p}_{g}+{q}_{g}\right)}\right)}$$

Statistical robustness towards gene sampling was assessed by calculating transcriptomic distances based on 100 bootstrap replicates and estimating the standard deviation over these replicates.

To obtain groups of genes with similar dynamics of expression during development, genes were clustered based on their cRPKMs48 using the Mfuzz package49. For this purpose, eight comparable stages were selected in amphioxus and zebrafish on the basis of conserved developmental landmarks such as fertilization, gastrulation and organogenesis (Supplementary Table 5). The statistical significance of the orthologous gene overlap between pairs of clusters was assessed using upper-tail hypergeometric tests.

Modules of co-expressed genes across stages and adult tissues were inferred using WGCNA50 with default parameters in amphioxus (17 samples) and zebrafish (27 samples) (Supplementary Table 6). The statistical significance of the orthologous gene overlap between pairs of clusters was assessed using upper-tail hypergeometric tests. The numbers of transcription-factor binding-site motifs detected in APREs in the basal regions of genes from any given cluster were standardized using z-scores.

To have a general assessment of the extent of conservation or divergence in gene expression among chordates at adult stages, we used neighbourhood analysis of conserved co-expression (NACC)25, a method developed to compare heterogeneous, non-matched sample sets across species. NACC relies on comparisons of average distances between pairs of orthologous (genes A and B), the 20 genes with the closest transcriptomic distance ($$\bar{A}$$ and $$\bar{B}$$) and their reciprocal orthologues in the other species ($$\overline{AB}$$ and $$\overline{BA}$$), and is calculated as follows:

$$NACC=\frac{1}{2}\left[\left(\overline{AB}-\bar{A}\right)+\left(\overline{BA}-\bar{B}\right)\right]$$

NACC calculations were performed for each family that contained a single amphioxus member and up to eight members in zebrafish and were also performed with randomized orthology relationships as a control.

### Regulatory profiling

#### ATAC-seq

For amphioxus, medaka and zebrafish, ATAC-seq was performed in two biological replicates by directly transferring embryos in the lysis buffer, following the original protocol51,52. ATAC-seq libraries were sequenced to produce an average of 66, 83 and 78 million reads for amphioxus, zebrafish and medaka, respectively. Reads were mapped with Bowtie2 and nucleosome-free pairs (insert < 120 bp) retained for peak-calling using MACS253, and the irreducible discovery rate was used to assess replicability. Nucleosome positioning was calculated from aligned ATAC-seq data using NucleoATAC54

#### Chromatin immunoprecipitation with sequencing (ChIP–seq)

Embryos of undetermined gender were fixed in 2% formaldehyde and ChIP was performed as previously described for other species55. Chromatin was sonicated and incubated with the corresponding antibody (H3K4me3: ab8580, H3K27ac: ab4729 and HeK27me3: ab6002, from Abcam). An average of 30 million reads per library was generated. Reads were mapped with Bowtie2 and peaks called with MACS253, assuming default parameters.

#### 4C-seq

Embryos of undetermined gender were fixed in 2% formaldehyde and chromatin was digested with DpnII and Csp6. Specific primers targeted the TSSs of the studied genes and included Illumina adapters. An average 5 million reads were generated for each of the two biological replicates. After mapping, reads were normalized per digestion fragment cut and interactions were identified using peakC56 with low-coverage regions excluded.

#### MethylC-seq and RRBS

Genomic DNA was extracted as previously described57, sonicated, purified and end-repaired. Bisulfite conversion was performed with the MethylCode Bisulfite Conversion Kit (Thermo Fisher Scientific). After Illumina library construction, an average of 73 million reads per sample were sequenced. RRBS libraries were prepared similarly to those for MethyC-seq, but with restriction digestion with MspI instead of sonication and PCR amplification. An average of 46 million reads per sample was generated. Reads were mapped to an in silico, bisulfite-converted B. lanceolatum reference genome7,58. Differentially methylated regions in the CpG context were identified as previously described7. Differential transcription-factor motif enrichment was obtained with DiffBind from Bioconductor.

#### CAGE-seq

Libraries were constructed using the non-amplifying non-tagging Illumina CAGE protocol59. Mouse CAGE-seq data were obtained from FANTOM560. Reads were aligned using Bowtie. Nearby individual CAGE TSSs were combined using the distance-based clustering method in CAGEr61 to produce tag clusters, which summarize expression at individual promoters. Tag clusters were clustered across samples to produce comparable promoter regions, referred to as ‘consensus clusters’. The consensus clusters were then grouped by expression patterns using a self-organizing map62. We investigated the relative presence and enrichment of the following features: TATA box, YY1 motif, GC and AT content, SS and WW dinucleotides, first exons and nucleosome positioning signal. Heat maps were plotted for visualization by scanning either for exact dinucleotide matches or for position weight matrix matches at 80% of the maximum score. Position weight matrices for TATA and YY1 were taken from the JASPAR vertebrate collection.

### Cis-regulatory comparisons

Depending on the analysis, an APRE was associated with a specific gene if it was located within: (i) the ‘basal’ region of the gene (−5 kb to +1 kb of the TSS; for comparisons of enriched motif composition) or (ii) the GREAT region of the gene (up to ±1 Mb of the TSS unless another basal region was found; for comparing the number of APREs per gene)26. Stratification of gene sets by GREAT or intergenic-region size between amphioxus and zebrafish was done using the function stratify from the matt suite63, with a range of ±500 bp.

The DNA-binding specificity of each transcription factor was predicted on the basis of the binding domain similarity to other transcription-factor family members, as previously performed64. Transcription-factor motifs from CIS-BP version 1.0264 were downloaded and clustered using GimmeMotifs65 (P ≤ 0.0001). Two hundred and forty-two clusters of motifs were assigned to one or more orthologous groups in both amphioxus and zebrafish and used for all analyses (Supplementary Data 2, dataset 10). These motifs were detected in APREs using the tools gimme threshold and gimme scan from GimmeMotifs65.

### Effect of WGDs on gene expression

Gene expression was binarized (1 if the normalized cRPKM > 5, and 0 otherwise) across nine comparable samples in amphioxus and three vertebrate species (mouse, frog and zebrafish) (Supplementary Table 7). Then, for each amphioxus gene and vertebrate orthologue, the expression bias was measured by subtracting the number of positive-expression domains in amphioxus from that of vertebrates (Fig. 6a). The amphioxus gene-expression pattern was also compared to the union of the ohnologues, as well as the pattern after binarizing the expression for the sum of cRPKM values of all family members. The analysis was restricted to families with a single member in amphioxus

Next, we selected those ohnologue families for which the ancestral expression included the nine studied domains, as inferred from having expression in the single amphioxus orthologue and in the union of the family. For each gene family, we then defined (Fig. 6c): (i) redundancy (all vertebrate paralogues were expressed in all domains), (ii) subfunctionalization (none of the vertebrate members had expression across all domains27), and (iii) specialization (one or more vertebrate ohnologues were expressed in all domains, but at least one ohnologue was not). Members of the later type were subdivided into ‘strong’ and ‘mild’ specialization if they retained ≤ 2 or more expression domains. We examined the transcript sequence similarity as well as the dN/dS between human and mouse (retrieved from Biomart), and the number of APREs associated with genes from different categories. Finally, we computed the τ tissue-specificity index as previously described28, to assess more broadly the tissue specificity of ohnologues.

### Transgenic assays in zebrafish and amphioxus

Enhancer reporter assays in zebrafish embryos were performed as previously described66. Selected peaks were first amplified, cloned into a PCR8/GW/TOPO vector and transferred into a detection vector (including a gata2 minimal promoter, a GFP reporter gene and a strong midbrain enhancer (z48) as an internal control)67. Transgenic embryos were generated using the Tol2 transposon and transposase method68. Three or more independent stable transgenic lines were generated for each construct as reported in Supplementary Table 8. For amphioxus reporter assays, selected peaks were amplified and transferred into a detection vector (including the Branchiostoma minimal actin promoter, a GFP reporter gene and piggyBac terminal repeats). Transgenic embryos were generated by the piggyBac transposase method.

### In situ hybridization

Gene fragments that were synthetically designed or amplified by PCR from cDNA were sub-cloned into pBluescript II SK and used as templates for probe synthesis using the DIG labelling kit (Roche) and T3 RNA polymerase. Embryos at different developmental stages were fixed in PFA 4% dissolved in MOPS–EGTA buffer and in situ hybridization carried out as previously described69, using BCIP/NBT as a chromogenic substrate.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

### Code availability

Custom code is available at https://gitlab.com/groups/FunctionalAmphioxus.

## Data availability

Next-generation sequencing data have been deposited in Gene Expression Omnibus (GEO) under the following accession numbers: GSE106372 (ChIP-seq), GSE106428 (ATAC-seq), GSE106429 (CAGE-seq), GSE106430 (RNA-seq), GSE102144 (MethylC-seq and RRBS) and GSE115945 (4C-seq). Raw genome sequencing data and the genome assembly have been submitted to European Nucleotide Archive (ENA) under the accession number PRJEB13665. UCSC hub and annotation files are available at http://amphiencode.github.io/.

## References

1. Bertrand, S. & Escriva, H. Evolutionary crossroads in developmental biology: amphioxus. Development 138, 4819–4830 (2011).

2. Dehal, P. & Boore, J. L. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 3, e314 (2005).

3. Putnam, N. H. et al. The amphioxus genome and the evolution of the chordate karyotype. Nature 453, 1064–1071 (2008).

4. Holland, L. Z. et al. The amphioxus genome illuminates vertebrate origins and cephalochordate biology. Genome Res. 18, 1100–1111 (2008).

5. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

6. Nelson, C. E., Hersh, B. M. & Carroll, S. B. The regulatory content of intergenic DNA shapes genome architecture. Genome Biol. 5, R25 (2004).

7. Bogdanović, O. et al. Active DNA demethylation at enhancers during the vertebrate phylotypic period. Nat. Genet. 48, 417–426 (2016).

8. Berthelot, C., Villar, D., Horvath, J. E., Odom, D. T. & Flicek, P. Complexity and conservation of regulatory landscapes underlie evolutionary resilience of mammalian gene expression. Nat. Ecol. Evol. 2, 152–163 (2018).

9. Reilly, S. K. et al. Evolutionary changes in promoter and enhancer activity during human corticogenesis. Science 347, 1155–1159 (2015).

10. Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015).

11. Vierstra, J. et al. Mouse regulatory DNA landscapes reveal global principles of cis-regulatory evolution. Science 346, 1007–1012 (2014).

12. Boyle, A. P. et al. Comparative analysis of regulatory information and circuits across distant species. Nature 512, 453–456 (2014).

13. Gerstein, M. B. et al. Comparative analysis of the transcriptome across distant species. Nature 512, 445–448 (2014).

14. Hendrich, B. & Tweedie, S. The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends Genet. 19, 269–277 (2003).

15. Irimia, M. et al. Extensive conservation of ancient microsynteny across metazoans due to cis-regulatory constraints. Genome Res. 22, 2356–2367 (2012).

16. Simakov, O. et al. Insights into bilaterian evolution from three spiralian genomes. Nature 493, 526–531 (2013).

17. Wang, X. et al. Genome-wide and single-base resolution DNA methylomes of the Pacific oyster Crassostrea gigas provide insight into the evolution of invertebrate CpG methylation. BMC Genomics 15, 1119 (2014).

18. Albalat, R., Martí-Solans, J. & Cañestro, C. DNA methylation in amphioxus: from ancestral functions to new roles in vertebrates. Brief. Funct. Genomics 11, 142–155 (2012).

19. Huang, S. et al. Decelerated genome evolution in modern vertebrates revealed by analysis of multiple lancelet genomes. Nat. Commun. 5, 5896 (2014).

20. Zhang, Y. et al. Nucleation of DNA repair factors by FOXA1 links DNA demethylation to transcriptional pioneering. Nat. Genet. 48, 1003–1013 (2016).

21. Irie, N. & Kuratani, S. Comparative transcriptome analysis reveals vertebrate phylotypic period during organogenesis. Nat. Commun. 2, 248 (2011).

22. Hu, H. et al. Constrained vertebrate evolution by pleiotropic genes. Nat. Ecol. Evol. 1, 1722–1730 (2017).

23. Duboule, D. Temporal colinearity and the phylotypic progression: a basis for the stability of a vertebrate Bauplan and the evolution of morphologies through heterochrony. Development 1994 Suppl., 135–142 (1994).

24. Bogdanović, O. et al. Dynamics of enhancer chromatin signatures mark the transition from pluripotency to cell specification during embryogenesis. Genome Res. 22, 2043–2053 (2012).

25. Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).

26. McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010).

27. Force, A. et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531–1545 (1999).

28. Yanai, I. et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21, 650–659 (2005).

29. Sandve, S. R., Rohlfs, R. V. & Hvidsten, T. R. Subfunctionalization versus neofunctionalization after whole-genome duplication. Nat. Genet. 50, 908–909 (2018).

30. Fuentes, M. et al. Preliminary observations on the spawning conditions of the European amphioxus (Branchiostoma lanceolatum) in captivity. J. Exp. Zool. B Mol. Dev. Evol. 302B, 384–391 (2004).

31. Fuentes, M. et al. Insights into spawning behavior and development of the European amphioxus (Branchiostoma lanceolatum). J. Exp. Zool. B Mol. Dev. Evol. 308B, 484–493 (2007).

32. Hirakow, R. & Kajita, N. Electron microscopic study of the development of amphioxus, Branchiostoma belcheri tsingtauense: the gastrula. J. Morphol. 207, 37–52 (1991).

33. Hirakow, R. & Kajita, N. Electron microscopic study of the development of amphioxus, Branchiostoma belcheri tsingtauense: the neurula and larva. Kaibogaku Zasshi 69, 1–13 (1994).

34. Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).

35. Huang, S. et al. HaploMerger: reconstructing allelic relationships for polymorphic diploid genome assemblies. Genome Res. 22, 1581–1588 (2012).

36. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).

37. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).

38. Keller, O., Kollmar, M., Stanke, M. & Waack, S. A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics 27, 757–763 (2011).

39. Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9, R7 (2008).

40. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

41. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

42. Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protocols 8, 1494–1512 (2013).

43. Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013).

44. Roth, A. C., Gonnet, G. H. & Dessimoz, C. Algorithm of OMA for large-scale orthology inference. BMC Bioinformatics 9, 518 (2008).

45. Altenhoff, A. M., Gil, M., Gonnet, G. H. & Dessimoz, C. Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS ONE 8, e53786 (2013).

46. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

47. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

48. Labbé, R. M. et al. A comparative transcriptomic analysis reveals conserved features of stem cell pluripotency in planarians and mammals. Stem Cells 30, 1734–1745 (2012).

49. Kumar, L. & Futschik, M. E. Mfuzz: a software package for soft clustering of microarray data. Bioinformation 2, 5–7 (2007).

50. Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).

51. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).

52. Fernández-Miñán, A., Bessa, J., Tena, J. J. & Gómez-Skarmeta, J. L. Assay for transposase-accessible chromatin and circularized chromosome conformation capture, two methods to explore the regulatory landscapes of genes in zebrafish. Methods Cell Biol. 135, 413–430 (2016).

53. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).

54. Schep, A. N. et al. Structured nucleosome fingerprints enable high-resolution mapping of chromatin architecture within regulatory regions. Genome Res. 25, 1757–1770 (2015).

55. Bogdanović, O., Fernández-Miñán, A., Tena, J. J., de la Calle-Mustienes, E. & Gómez-Skarmeta, J. L. The developmental epigenomics toolbox: ChIP-seq and MethylCap-seq profiling of early zebrafish embryos. Methods 62, 207–215 (2013).

56. Geeven, G., Teunissen, H., de Laat, W. & de Wit, E. peakC: a flexible, non-parametric peak calling package for 4C and Capture-C data. Nucleic Acids Res. 46, e91 (2018).

57. Bogdanović, O. & Veenstra, G. J. Affinity-based enrichment strategies to assay methyl-CpG binding activity and DNA methylation in early Xenopus embryos. BMC Res. Notes 4, 300 (2011).

58. Lister, R. et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 471, 68–73 (2011).

59. Murata, M. et al. Detecting expressed genes using CAGE. Methods Mol. Biol. 1164, 67–85 (2014).

60. The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

61. Haberle, V., Forrest, A. R., Hayashizaki, Y., Carninci, P. & Lenhard, B. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 43, e51 (2015).

62. Wehrens, R. & Buydens, L. M. C. Self- and super-organising maps in R: the kohonen package. J. Stat. Softw. 21, 1–19 (2007).

63. Gohr, A. & Irimia, M. Matt: Unix tools for alternative splicing analysis. Bioinformatics (2018).

64. Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).

65. van Heeringen, S. J. & Veenstra, G. J. GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments. Bioinformatics 27, 270–271 (2011).

66. Bessa, J. et al. Zebrafish enhancer detection (ZED) vector: a new tool to facilitate transgenesis and the functional analysis of cis-regulatory regions in zebrafish. Dev. Dyn. 238, 2409–2417 (2009).

67. Gehrke, A. R. et al. Deep conservation of wrist and digit enhancers in fish. Proc. Natl Acad. Sci. USA 112, 803–808 (2015).

68. Kawakami, K. Transgenesis and gene trap methods in zebrafish by using the Tol2 transposable element. Methods Cell Biol. 77, 201–222 (2004).

69. Somorjai, I., Bertrand, S., Camasses, A., Haguenauer, A. & Escriva, H. Evidence for stasis and not genetic piracy in developmental expression patterns of Branchiostoma lanceolatum and Branchiostoma floridae, two amphioxus species that have evolved independently over the course of 200 Myr. Dev. Genes Evol. 218, 703–713 (2008).

70. Tena, J. J. et al. Comparative epigenomics in distantly related teleost species identifies conserved cis-regulatory nodes active during the vertebrate phylotypic period. Genome Res. 24, 1075–1085 (2014).

71. Acemel, R. D. et al. A single three-dimensional chromatin compartment in amphioxus indicates a stepwise evolution of vertebrate Hox bimodal regulation. Nat. Genet. 48, 336–341 (2016).

## Acknowledgements

This research was funded primarily by the European Research Council (ERC) under the European Union’s Horizon 2020 and Seventh Framework Program FP7 research and innovation programs (ERC-AdG-LS8-740041 to J.L.G.-S., ERC-StG-LS2-637591 to M.I., a Marie Sklodowska-Curie Grant (658521) to I.M. and a FP7/2007-2013-ERC-268513 to P.W.H.H.), the Spanish Ministerio de Economía y Competitividad (BFU2016-74961-P to J.L.G.-S., RYC-2016-20089 to I.M., BFU2014-55076-P and BFU2017-89201-P to M.I. and BFU2014-55738-REDT to J.L.G.-S, M.I. and J.R.M.-M), the ‘Centro de Excelencia Severo Ochoa 2013-2017’(SEV-2012-0208), the ‘Unidad de Excelencia María de Maetzu 2017-2021’(MDM-2016-0687), the People Program (Marie Curie Actions) of the European Union’s Seventh Framework Program FP7 under REA grant agreement number 607142 (DevCom) to J.L.G.-S., and the CNRS and the ANR (ANR16-CE12-0008-01) to H.E. O.B. was supported by an Australian Research Council Discovery Early Career Researcher Award (DECRA; DE140101962). We acknowledge the support of the CERCA Programme/Generalitat de Catalunya and of the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) to the EMBL partnership. Additional sources of funding for all authors are listed in Supplementary Information.

### Reviewer information

Nature thanks D. Duboule and P. Flicek for their contribution to the peer review of this work.

## Author information

Authors

### Contributions

F.M., P.N.F., I.M., J.J.T., O.B., M.P., B.L., P.W.H.H., H.E., J.L.G.-S. and M.I. contributed to concept and study design. F.M., P.N.F., I.M., J.J.T., O.B., M.P., C.D.R.W., R.D.A., S.J.v.H., C.H.-U., K.S., Y.M., A. Louis, P.J.B., P.E.D., M.T.W., J.G.-F., R.L., B.L., P.W.H.H., J.L.G.-S. and M.I. performed computational analyses and data interpretation. O.B., E.d.l.C.-M., S.B., D.B., R.D.A., S.N., S.J.-G., D.A., L.B., J.P., B.A.-C., Y.L.P., A. Leon, L.S., E.F., P.C., J.R.M.-M., R.L., B.L., H.E., J.L.G.-S. and M.I. obtained biological material and generated next-generation sequencing data. I.M., J.J.T., E.d.l.C.-M., I.K., R.D.A., Z.K. and J.L.G.-S. performed transgenic assays. J.-M.A., S.M. and P.W. sequenced the genome. R.A., E.B.-G., C.C., F.C., S.D., D.E.K.F., S.H., V.L., G.A.B.M., P.P., M.S., H.S., I.S., T.T., O.M., A.X. and J.-K.Y. contributed to genome sequencing and gene family curation. I.M., H.E., J.L.G.-S. and M.I. coordinated the project. F.M., I.M., P.W.H.H. and M.I. wrote the main text, with input from all authors. Detailed contributions are listed in Supplementary Information. Animal illustrations by J.J.T., released under a Creative Commons Attribution (CC-BY) Licence.

### Corresponding authors

Correspondence to Ignacio Maeso, Hector Escriva, Jose Luis Gómez-Skarmeta or Manuel Irimia.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data figures and tables

### Extended Data Fig. 1 Summary of genomic assembly and repeat annotation.

a, Spectrum of 25-mers in Illumina sequencing data that shows the bimodal distribution that is characteristic of highly polymorphic species. b, Heat map showing k-mer decomposition (y axis) across GC content (x axis). Both peaks show comparable GC content, which is consistent with them representing haploid versus diploid k-mers. c, Flow chart of the steps followed to obtain the B. lanceolatum assembly. d, Repeat landscape and its evolutionary history, shown by the proportion of repetitive elements with a given divergence (K2P) to their consensus in the repeat library (repeatScout). e, Percentage of methylated CpG dinucleotides within repetitive elements, at three developmental stages and in the adult hepatic diverticulum. f, Distribution of average levels of 5mC of different repeat families. Colour key indicates the percentage of repeats in each family with corresponding levels of average methylation. g, Computational pipeline to identify long non-coding RNAs (lncRNAs). Categories: antisense, lncRNA overlaps with a protein-coding gene in the reverse strand; intragenic, lncRNA overlaps with a protein-coding gene in the same strand; bidirectional, within 1 kbp of a TSS of a protein-coding gene in the antisense strand, probably a product of a bidirectional promoter; intergenic, lncRNA does not overlap with any protein-coding gene. The total number in each category is indicated, with the number of those that are multi-exonic in parentheses. h, Quadruple conserved synteny between amphioxus and human. Top, amphioxus scaffold Sc0000001 aligned against the four human chromosomes with which it shares the highest number of orthologues (chr1, chr5, chr9 and chr19). In this scaffold, 277 out of 551 genes have clear orthologues in human, and 203 of these have orthologues on at least one of the four mentioned chromosomes. The black horizontal line represents the amphioxus scaffold, and each vertical coloured box an orthologous gene on the corresponding human chromosome. Bottom, modified view from Genomicus that is centred on the BL22073 gene and spans Sc0000001: 7,736,434–8,850,041. On the top line, each amphioxus gene with at least one orthologue in the nine reference species is represented with an oriented coloured box. Human genes located in the four ohnologous chromosomes are aligned underneath, in boxes of colours that correspond to those of their amphioxus pro-orthologues. The Genomicus server dedicated to amphioxus can be accessed at http://genomicus.biologie.ens.fr/genomicus-amphioxus.

### Extended Data Fig. 2 Dynamics of chromatin marks on APREs and reporter assays.

a, Summary of the zebrafish and medaka RNA-seq and ATAC-seq datasets generated for this study. Dashed lines indicate equivalent developmental stages in the two species, based on aprevious study70. The number of biological replicates is indicated for each experiment. Zebrafish 24-hpf ATAC-seq data are from a previous study67. b, Cumulative distribution of the distance between CAGE-seq peaks and the closest annotated TSSs for genes with expression cRPKM > 5 in any of the samples covered by CAGE-seq (see Fig. 1a). Only CAGE-seq peaks within 1 kbp of an annotated TSS were tested (amphioxus: 10,435 peaks; zebrafish, 23,326 peaks; and mouse, 23,443 peaks). c, Cumulative distribution of distances between each APRE and the closest annotated TSS normalized by the average intergenic distance of the species (amphioxus, 83,471; zebrafish, 252,774; medaka, 174,139; and mouse, 216,857 APREs, as per Fig. 1c). d, Signal distribution of different marks within functional-genomic regions in amphioxus. log10 of read counts of H3K4me3, H3K27ac and ATAC-seq, and raw read counts of CAGE-seq in promoters of homology-supported, protein-coding genes (n = 26,501), other APREs (‘O. APREs’, all APREs that do not overlap a TSS from any gene model; n = 48,341), proximal APREs (n = 24,622), distal APREs (n = 11,881), previously validated enhancers (n = 43; Supplementary Table 9), random regions (n = 88,413) and negative regions (excluding ATAC-seq peaks, n = 88,413). For region designation, see Fig. 1c. For clarity, whiskers and outliers are not displayed. e, k-means clustering of APREs based on H3K27ac signal in three developmental stages. Cluster 1 and 3 APREs were considered as active and inactive, respectively. Average H3K27ac profiles are represented in the top panels. The number of APREs per cluster and stage are provided in Supplementary Data 2, dataset 8. f, Alluvial plot that shows the dynamics of each APRE among the clusters described in e. APREs that remained active (cluster 1 in all stages) along the three developmental stages are represented in blue, constitutively inactive APREs (cluster 3 in all stages) in dark grey and dynamic APREs in red or orange (if inactivated or activated, respectively, during development). Five groups of APREs of special interest are highlighted with stronger colours and named GR1–GR5. g, Representative enriched DNA motifs found in each of the groups described in f. GR1 APREs were enriched in early motifs (for example, Smad3 and Oct4, Sox2 and Nanog); GR3 APREs in motifs of transcription factors involved in the generation of the three germ layers (for example, Foxo3, Sox6 and Sox17); GR4 APREs in tissue-specific transcription factors (for example, Foxa2, Otx2 and Crx); and GR5 APREs in CTCF and CTCF-like (BORIS) motifs. q values as provided by Homer. h, Lateral views of embryos from stable transgenic zebrafish lines at 24 hpf (except for Foxa-243, at 48 hpf) showing GFP expression driven by the amphioxus APREs listed in Supplementary Table 8 and highlighted in Supplementary Fig. 1. The number of independent founders with the same expression were as follows: Six1/2-182 (5/5), Foxa-243 (3/3), Foxa-251 (4/4), FoxC-3067 (6/6) and Pax1/9-157 (3/3). Midbrain expression corresponds to the positive-control enhancer included in the reporter constructs. EN, endoderm; HB, hindbrain; MY, myotomes; PA, pharyngeal arch; SC, spinal cord. Scale bar, 250 μm. i, Lateral views of transient transgenic amphioxus embryos, showing GFP expression driven by the APREs highlighted in Supplementary Fig. 1a, b (Foxa-251 (n = 46 out of 52) and Foxc-3067 (n = 27 out of 35), respectively) and in a previous study71 (Hox-1655, n = 72 out of 80). For each element, left panels correspond to 3D rendering from sub-stacks and right panels to z-stack sagittal sections. Scale bar, 50 μm. Anterior is to the left and dorsal to the top.

### Extended Data Fig. 3 Features of amphioxus promoters derived from CAGE-seq.

ac, Heat maps showing AT and CG signal, nucleosome positioning (derived from the NucleoATAC signal), promoter width (interquantile (IQ) range), first exon length and YY1 (a) or TATA box (b, c) motifs around ubiquitous (a, n = 3,710), embryonic-specific (b, n = 1,451) and tissue-specific (c, n = 4,154) promoters, sorted by promoter width. Position 0 corresponds to the main TSS. d, Ubiquitous promoters show strong evidence for a nucleosome positioned downstream of the CAGE TSS, as judged from the 12-bp periodicity of W and S nucleotide density. e, Per cent of promoters of each category that have associated TATA box or YY1 motifs. Number of promoters: embryo, 1,451; female gonads, 1,494; hepatic, 2,420; neural tube, 1,734; and ubiquitous, 3,710. f, IQ width distribution of ubiquitous promoters (n = 3,710) with and without an associated YY1 motif. P value corresponds to two-sided Wilcoxon sum-rank tests.

### Extended Data Fig. 4 Characteristics and evolution of bidirectional promoters.

a, Number of bidirectional and non-bidirectional promoters identified for each regulatory category. P values correspond to two-sided Fisher’s exact tests against ubiquitous promoters. b, Distribution of distance between bidirectional promoters in each species (amphioxus, 1,975; zebrafish, 549; and mouse, 876 pairs of promoters). The distance between amphioxus peaks closely corresponds to integral nucleosome spacing. c, Heat maps of TA, CG and nucleosome occupancy (derived from the NucleoATAC signal) around bidirectional promoter pairs in amphioxus (n = 1,975), mouse (n = 876) and zebrafish (n = 549), arranged by the distance between the two CAGE TSSs. In amphioxus, both TA and NucleoATAC signals indicate regions in which 0, 1 or 2 nucleosomes separate promoters. d, Enriched GO terms for genes associated with bidirectional promoters in amphioxus. Uncorrected P values correspond to two-sided Fisher’s exact tests as provided by topGO. e, Inferred evolutionary dynamics of 372 putatively ancestral bidirectional promoters among chordate groups. Red, number of inferred losses and disentanglements; black, number of detected bidirectional promoters by CAGE-seq (in brackets) or microsynteny (neighbouring genes in a 5′ to 5′ orientation) for each species. In parentheses, number of lost and disentangled (red) or retained (black) bidirectional promoters when considering only the cases supported by CAGE-seq. f, In vertebrates, disentanglement was not accompanied by a general increase in the fraction of bidirectional promoters with antisense non-coding transcription, as shown by the relative number of CAGE clusters identified as bidirectional promoters that are composed of two protein-coding genes (‘Prot-Prot’) or of one protein-coding and one non-coding or non-annotated locus (‘Prot-NC’). The total number of uniquely annotated, protein-coding-associated CAGE promoters was amphioxus, 11,789; mouse, 13,654; and zebrafish, 14,014.

### Extended Data Fig. 5 5mC dynamics in amphioxus.

a, 5mC levels across gene bodies (n = 20,569) from different expression deciles (0th, not expressed; 10th, highest expression). TTS, transcription termination site. b, Scatter plots of levels of 5mC and CpG density, H3K4me3, H3K27me3 and H3K27ac in 1-kbp genomic bins sorted on the basis of feature rank. The red line tracks anti-correlation between feature density and rank number (a low rank number implies high feature density). The golden line represents a smoothing spline of 5mC signal versus feature rank number. Pearson correlation coefficients (R) are displayed in the top right corner of each panel. c, UCSC browser excerpt of 5mC patterns for selected regions. d, Percentage of methylated CpG dinucleotides in 8-hpf (n = 19,657,388), 15-hpf (n = 21,247,615), 36-hpf (n = 21,702,000) and hepatic (adult, n = 19,240,245) amphioxus samples. Black line indicates the fraction between methylated and non-methylated CpGs at each stage. e, Box plots of average 5mC levels in different types of differentially methylated regions (DMRs) at each stage. ΔmCG denotes the change in the fraction of methylated CpGs between the two stages used for identification of DMRs (red (hyper) and blue (hypo) boxes). The number of DMRs were as follows: 8 hpf(+)–15 hpf(−), 768; 8 hpf(−)–15 hpf(+), 701; 15 hpf(+)–36 hpf(−),1,066; 15 hpf(−)–36 hpf(+), 1,025; 36 hpf(+)–liver(−), 22,333; and 36 hpf(−)–liver(+), 4,154. The coordinates for all DMRs are provided in Supplementary Data 2, dataset 11. f, Distribution of DMR sizes (in bp). g, Genomic distribution of DMRs identified for each sample. ‘Other trans.’, DMRs that overlap with gene models that were not defined as being supported by orthology. h, Expression (cRPKMs) of the amphioxus Tet orthologue in embryos and adult tissues. Error bars represent standard error of the mean (the number of replicates for each RNA-seq dataset is provided in Fig. 1a).

### Extended Data Fig. 6 Developmental 5mC dynamics at APREs in amphioxus.

a, k-means clustering (n = 2) of 5mC signal over embryo-specific open-chromatin regions (that is, APREs), assessed by ATAC-seq (Supplementary Table 10). b, The most significantly enriched transcription-factor binding-site motifs in APREs that display different developmental 5mC patterns in Fig. 2b. Uncorrected P values as provided by MEME. All plotted motifs had Benjamini-corrected q values of 0. c, GO enrichment for genes associated with cluster 1 (top) or cluster 2 (bottom) APREs from Fig. 2b. Uncorrected P values correspond to two-sided Fisher’s exact tests as calculated by topGO. d, Distribution of expression values (cRPKMs) across all samples for genes associated with cluster 1 (top, n = 1,114) or cluster 2 (bottom, n = 1,594) APREs from Fig. 2b. e, Distribution of the coefficients of variation for genes associated with cluster 1 or cluster 2 APREs from Fig. 2b, as well as all (n = 19,710), trans-dev (n = 357) and house-keeping (n = 862) amphioxus genes. f, Example of a potentially conserved (zebrafish to amphioxus) DMR associated with yap1, a major transcription factor of the Hippo pathway. The inset corresponds to the region highlighted in green. The two ohnologous genomic regions in zebrafish are shown in Supplementary Fig. 2. Additional cases included genes that contained APREs that are likely to regulate neighbouring liver-specific genes (‘bystander’ genes) (Supplementary Table 11). The number of replicates for each experiment displayed in each track is provided in Fig. 1a.

### Extended Data Fig. 7 Periods of maximal transcriptomic similarity across chordate development.

a, Stages of minimal transcriptomic distance obtained in the comparison between amphioxus and zebrafish for four alternative distance methods (Euclidean, Manhattan and Jensen–Shannon distances, and Spearman correlation). Values are normalized to minimal (0) and maximal (1) for each metric. b, Stages of minimal transcriptomic divergence shown as the smallest Jensen–Shannon distance between zebrafish stages and four chordate species. The shaded area surrounding the line that connects the stages is the standard deviation, derived from 100 bootstrap replicates of the orthologous gene set. The grey box outlines the ‘phylotypic’ period of minimal divergence; the corresponding periods are indicated for each species as the range provided by the two closest stages. c, d, Heat maps of pairwise transcriptomic distances (Jensen–Shannon distance metric) between pairs of chordate species, amphioxus and frog (c), and zebrafish and frog (d). In both heat maps, the smallest distance (red) indicates maximal similarity of the transcriptome. The periods of minimal divergence of the transcriptome are earlier for the amphioxus–frog comparison than for the zebrafish–frog comparison.

### Extended Data Fig. 8 Comparison of temporal gene expression profiles in amphioxus and zebrafish.

a, Heat map showing the significance of orthologous gene overlap between Mfuzz clusters across eight matched developmental stages in amphioxus and zebrafish as derived from an upper-tail hypergeometric test. Some clusters with highly significant overlap are highlighted, and their corresponding temporal expression profiles are shown. The profiles of all clusters for the two species are included in Supplementary Figs. 3, 4. Exact P values and sample sizes are provided in Supplementary Data 2, dataset 8. b, Distributions of NACC values for orthologous genes (in red) or random orthology assignments (blue) for each species against human. Lower NACC values imply higher conservation of relative expression. Solid lines show the median, and the dashed lines mark the interquartile range. The number of orthologue pairs were as follows: mouse, 15,109; zebrafish, 16,480; and amphioxus, 8,633. c, Differentially enriched GO terms among pairs of zebrafish and amphioxus Mfuzz clusters with significant orthologue overlap (P < 10−10 upper-tail hypergeometric test) with homochronic (48 pairs) and heterochronic (35 pairs) patterns. The GO enrichment of a group was calculated as the number of cluster pairs with significant enrichment for that given term (Supplementary Data 2, dataset 12). d, Top, per cent of zebrafish genes from each developmental pathway we studied, based on the temporal similarity of their corresponding Mfuzz cluster (homochronic, heterochronic or intermediate). Only genes belonging to clusters with significant orthologue overlap were analysed; the number of genes is provided in parenthesis below the pathway name. Bottom, pairwise comparisons between developmental pathway distributions. P values correspond to Bonferroni-corrected, two-sided, three-way Fisher’s exact tests.

### Extended Data Fig. 9 Higher regulatory content in vertebrate genomes.

a, Distribution of the number of APREs per the regulatory landscape of a gene (as determined by GREAT26), at different developmental stages or cell lines of four chordate species (amphioxus, zebrafish, medaka and mouse). Orthologous gene families are split according to the number of ohnologues that are retained per family (from 1 to 4, using mouse as a reference species for the ohnologue counts). The percentage of developmental regulatory genes (trans-dev, TD) in each category is indicated. b, P values of one-sided Mann–Whitney U tests against the amphioxus peak-number distribution using 100% of the minimum read coverage for different levels of down-sampling of the zebrafish and medaka samples. c, Distribution of the number of APREs in the GREAT region of the gene, called after down-sampling the reads of the two vertebrate samples to different fractions of the sample with the minimum effective coverage in our study (~21 reads per kbp for the 36-hpf sample in amphioxus). Asterisks correspond to the significance of the P values of Mann–Whitney U tests against the amphioxus peak-number distribution using 100% of the minimum-read coverage. The number of genes per box was as follows: amphioxus, 20,569; zebrafish, 20,053; and medaka, 15,978. d, As in a, but with gene families separated according to functional categories (housekeeping, trans-dev and others). e, Number of APREs per regulatory landscape determined using 4C-seq, for 58 members of 11 trans-dev families. The number of genes probed in each species is indicated on the x axis. f, Distribution of the length of the intergenic regions from the genes plotted in a for the indicated stages. g, Distributions of GREAT-region sizes (left) and number of APREs per gene (right) for a subset of 10,186 pairs of genes with matched GREAT-region size distributions (±500 bp) in amphioxus and zebrafish. h, Distributions of intergenic-region sizes (left) and number of APREs per gene (right) for a subset of 13,941 pairs of genes with matched intergenic-region size distributions (±500 bp) in amphioxus and zebrafish. P values correspond to Mann–Whitney U tests: *0.05 > P value ≥ 0.01, **0.01 > P value ≥ 0.001, ***P value < 0.001. In a and d, all comparisons between each distribution of a vertebrate species and the equivalent distribution in amphioxus produced significant P values (P value < 0.001); for simplicity, in these panels asterisks are not shown. Exact P values and sample sizes are provided in Supplementary Data 2, dataset 8.

### Extended Data Fig. 10 Regulatory evolution after vertebrate WGD.

a, b, For each mouse (a) or frog (b) gene, the number of positive-expression domains across nine equivalent samples is subtracted from the number of domains in which the single amphioxus orthologue is expressed. The distribution of the difference in domains between the amphioxus and the vertebrate species is plotted for 1-to-1 orthologues (2,450 and 2,484 gene pairs for mouse and frog, respectively; yellow), individual ohnologues (3,011 and 2,637 gene pairs in 1,212 and 1,094 families for mouse and frog, respectively; lilac) and the union of all vertebrate ohnologues in a family (purple). Bottom left, log2 of the ratio between the sum of all mouse (a) or frog (b) genes with negative versus positive score for each orthology group. ‘Sum’ (black), binarization of family expression is performed after summing the raw expression values for all ohnologues. ce, Density scattered plot of the τ values for pairs of mouse (c, n = 1,502), frog (d, n = 1,495) and zebrafish (e, n = 1,498) and amphioxus orthologues from multi-gene families in vertebrates. f, g, Number of ohnologues with strong specialization (≤2 remaining expression domains) in mouse (f) or frog (g) expressed in each tissue or developmental stage. h, i, Representative in situ hybridization assays in zebrafish embryos for different members of specialized families (right) and for the single amphioxus orthologue (left) (Chordc1 and Itgb1bp2 (h) and Rab11 (i)). Zebrafish image data for this paper were retrieved from the Zebrafish Information Network (ZFIN), University of Oregon, Eugene, OR 97403-5274; (http://zfin.org/, accessed May 2018) and are used with the permission of B. Thisse. Amphioxus in situ hybridization was performed once using 10 embryos per probe, all of which showed the same expression pattern. j, Distribution of the dN/dS ratio between human and mouse for different classes of ohnologues based on their fate after WGD. k, l, Distribution of the percentage of nucleotide sequence similarity (k) or dN/dS ratio (l) between human and mouse for ohnologues grouped by the number of expression domains lost. m, Distribution of the number of APREs within GREAT regions for zebrafish ohnologues grouped by the number of expression domains lost. P values in jm correspond to Wilcoxon sum-rank tests. *0.5 > P value ≥ 0.01; **0.01 > P value ≥ 0.001; ***P value < 0.001.

## Supplementary information

### Supplementary Information

This file contains Supplementary Text and Data, Supplementary References, Supplementary Figures 1-9 and full guides for Supplementary Datasets 1-17 and Supplementary Tables 1-19.

### Supplementary Data 1

This file contains the detailed annotation and expression patterns for each module of co-regulated genes identified by WGCNA in amphioxus and zebrafish. Section descriptions: 1 - Amphioxus and zebrafish module annotation and comparisons. a, Name assigned to each module (color) based on gene expression and/or GO enrichment. b, Same heatmaps as in Fig. 4a,b including the names of each module. 2 - Clustered heatmap of TF-motifs vs amphioxus & zebrafish modules. a, WGCNA modules from the two species are plotted against all motif clusters. The values visualized are the z-scores of each motif in each module. Modules and motifs are clustered based on the correlation of the visualized z-scores. b, Key for TF motif super-families in a, ordered by size of the motif name. 3 - Amphioxus module RNA-seq expression and GO terms. For each amphioxus module, boxplots without whiskers showing the median and interquartile range of gene expression levels (using the cRPKM metrics) across RNA-seq samples (top), and significantly enriched GO categories (bottom). P-values correspond to uncorrected p-values from two-sided Fisher's exact tests as calculated by topGO. Number of genes per module is provided in SI Dataset 8. 4 - Zebrafish module RNA-seq expression and GO terms. For each zebrafish module, boxplots without whiskers showing the median and interquartile range of gene expression levels (using the cRPKM metrics) across RNA-seq samples (top), and significantly enriched GO categories (bottom). P-values correspond to uncorrected p-values from two-sided Fisher's exact tests as calculated by topGO. Number of genes per module is provided in SI Dataset 8.

### Supplementary Data 2

This zipped file contains Supplementary Datasets 1-17 – see Supplementary Information document for a full Supplementary Dataset guide.

### Supplementary Tables

This file contains Supplementary Tables 1-19 – see Supplementary Information document for a full Supplementary Table guide.

## Rights and permissions

Reprints and Permissions

Marlétaz, F., Firbas, P.N., Maeso, I. et al. Amphioxus functional genomics and the origins of vertebrate gene regulation. Nature 564, 64–70 (2018). https://doi.org/10.1038/s41586-018-0734-6

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s41586-018-0734-6

### Keywords

• Amphioxus
• Ancestral Expression
• Whole-genome Duplication (WGD)
• Phylotypic Period
• Branchiostoma Lanceolatum

• ### Amphioxus muscle transcriptomes reveal vertebrate-like myoblast fusion genes and a highly conserved role of insulin signalling in the metabolism of muscle

• David E. K. Ferrier

BMC Genomics (2022)

• ### Deep whole-genome resequencing sheds light on the distribution and effect of amphioxus SNPs

• Yunchi Zhu
• Na Lu
• Zuhong Lu

BMC Genomic Data (2022)

• ### Potential contribution of intrinsic developmental stability toward body plan conservation

• Yui Uchida
• Shuji Shigenobu
• Naoki Irie

BMC Biology (2022)

• ### Hmx gene conservation identifies the origin of vertebrate cranial ganglia

• Alessandro Pennati
• Sebastian M. Shimeld

Nature (2022)

• ### Multiomic atlas with functional stratification and developmental dynamics of zebrafish cis-regulatory elements

• Damir Baranasic
• Matthias Hörtenhuber
• Ferenc Müller

Nature Genetics (2022)