Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Reconstruction of proto-vertebrate, proto-cyclostome and proto-gnathostome genomes provides new insights into early vertebrate evolution

### Subjects

A Publisher Correction to this article was published on 29 July 2021

## Abstract

Ancient polyploidization events have had a lasting impact on vertebrate genome structure, organization and function. Some key questions regarding the number of ancient polyploidization events and their timing in relation to the cyclostome-gnathostome divergence have remained contentious. Here we generate de novo long-read-based chromosome-scale genome assemblies for the Japanese lamprey and elephant shark. Using these and other representative genomes and developing algorithms for the probabilistic macrosynteny model, we reconstruct high-resolution proto-vertebrate, proto-cyclostome and proto-gnathostome genomes. Our reconstructions resolve key questions regarding the early evolutionary history of vertebrates. First, cyclostomes diverged from the lineage leading to gnathostomes after a shared tetraploidization (1R) but before a gnathostome-specific tetraploidization (2R). Second, the cyclostome lineage experienced an additional hexaploidization. Third, 2R in the gnathostome lineage was an allotetraploidization event, and biased gene loss from one of the subgenomes shaped the gnathostome genome by giving rise to remarkably conserved microchromosomes. Thus, our reconstructions reveal the major evolutionary events and offer new insights into the origin and evolution of vertebrate genomes.

## Introduction

The emergence of morphologically complex vertebrates from invertebrate chordates is considered a major evolutionary transition that led to the emergence of more than 70,000 vertebrate species (http://vgpdb.snu.ac.kr/splist), including humans. The common ancestor of vertebrates that originated during the Lower Cambrian1 diverged to give rise to the two extant lineages of vertebrates, the cyclostomes (jawless vertebrates) and gnathostomes (jawed vertebrates). Cyclostomes are a monophyletic group2 comprising lampreys and hagfishes, while gnathostomes include cartilaginous fishes (Chondrichthyes, represented by chimaeras, sharks and rays) and bony vertebrates (Osteichthyes, represented by ray-finned fishes and lobe-finned fishes, including tetrapods). Cyclostomes are sometimes thought to be morphologically primitive as compared to gnathostomes as they lack hinged jaws, paired appendages and nostrils, mineralized tissue, and a discrete pancreas3,4,5. However, recent studies suggest that the lamprey and hagfish lineages independently acquired their seemingly simplified as well as their specialized morphology, and that the ancestral cyclostome already had a complex morphology and physiology distinct from the gnathostome lineage6,7. For example, although cyclostomes lack the major histocompatibility complex and immunoglobulin-based adaptive immune system (AIS) of gnathostomes, they have independently evolved somatically diversifying variable lymphocyte receptors for antigen recognition8.

Evolutionary innovations at the origin of vertebrates have been proposed to be the result of ancient tetraploidization events that generated additional copies of the entire genome9,10. This view is now widely accepted because genome-wide synteny and paralogy analyses11,12,13,14,15 have provided convincing evidence for two rounds of tetraploidization (known as 1R and 2R, respectively) during early vertebrate evolution (see for review refs. 16,17; see also refs. 18,19). However, the timing of 1R and 2R relative to the cyclostome–gnathostome divergence has remained contentious—a significant gap in our knowledge considering the important implications for the genetic basis of the shared and derived features of these two lineages. Previous studies have produced conflicting results supporting each of the three possibilities18,19,20,21,22,23,24,25,26, i.e. divergence occurring prior to 1R, between 1R and 2R, and after 2R (Fig. 1). This uncertainty has been further compounded by the discovery of six Hox clusters in both lampreys and hagfish19,22,27 compared to four clusters in most gnathostome lineages, suggesting the possibility of an additional tetraploidization or chromosome-scale segmental duplications in the cyclostome ancestor18,19,22.

Resolving these alternative scenarios using gene trees has proved to be challenging. This is partly due to the presence of multiple ‘ohnologues' (paralogous genes generated by polyploidy) created by successive rounds of tetraploidization; lineage-specific secondary losses of some ohnologues28,29; as well as the confounding effects of asymmetric evolutionary divergence between ohnologues30. The possibility of delayed rediploidization after a tetraploidization event28,31,32 has further complicated the interpretation of gene trees, as it uncouples gene duplication time from the divergence time of the ohnologues. In addition, the tendency of lamprey ohnologues to cluster outside gnathostome gene clades due to high GC-content and consequent codon bias22,26,33 has impeded the use of gene trees for determining the timing of 1R and 2R.

An alternative and more effective strategy for the identification of ancient polyploidy is the macrosynteny-based reconstruction of ancestral genomes12,13,14,15. In particular, this strategy has the potential to reveal chromosome fusion/fission events that occurred in the interval between 1R and 2R and/or after 2R13,15. Whether such genome rearrangements are shared by cyclostomes and gnathostomes would be potentially informative for determining the timing of the cyclostome–gnathostome divergence in relation to 1R and 2R. A prerequisite for such comparisons is the high-resolution reconstruction of the proto-cyclostome and the proto-gnathostome genomes which require high-quality, chromosome-scale genome assemblies from the most basal vertebrate lineages such as cyclostomes and cartilaginous fishes.

In the present study, we generate de novo chromosome-scale genome assemblies of a cyclostome, the Japanese lamprey (Lethenteron japonicum; also known as the Arctic lamprey Lethenteron camtschaticum) and a cartilaginous fish, the elephant shark (Callorhinchus milii), based on long single-molecule reads and chromatin conformation capture (Hi-C) data. These two species represent two crucial divergence points in the evolution of vertebrates (Fig. 1). We use our recently developed probabilistic macrosynteny model34 to reconstruct the proto-vertebrate and proto-cyclostome genomes. The major advantage of our method is that it has a high tolerance to reconstruction uncertainty caused by small-scale rearrangements that have accumulated over a long evolutionary time34. Using our strategy, we are able to reconstruct the proto-cyclostome genome, in which we integrate information from the Japanese lamprey genome, the sea lamprey genome and the Pacific lamprey linkage markers19. In addition, using the elephant shark genome, we reconstruct the proto-gnathostome genome with a higher coverage of extant gnathostome genomes than previous reconstructions (including 19,343 human genes as compared to 12,137 human genes in ref. 13, and 8,434 human genes in ref. 15

Our high-resolution reconstructions resolve the number and timing of polyploidization events during early vertebrate evolution and provide new insights into the genetic basis underlying evolutionary innovations during the origin of early vertebrates. In addition, our reconstructions serve as a reliable reference for accurate annotation of ohnologues, which will be especially important for ohnologues with low sequence similarity30 which are difficult to identify by standard approaches.

## Results

### Genome sequencing, assembly and annotation

We generated de novo chromosome-scale genome assemblies for elephant shark and Japanese lamprey using a combination of PacBio single-molecule real-time (SMRT) sequencing (68- and 87-fold coverage, respectively), and ‘Chicago’35 and Hi-C data aided scaffolding (see Supplementary Note 1). The resultant genome assemblies of elephant shark and Japanese lamprey span 991 Mb (N50 contig, 1.6 Mb and N50 scaffold, 69 Mb) and 1.07 Gb (N50 contig, 1.6 Mb and N50 scaffold, 10.7 Mb), respectively. These assemblies contain a substantially higher amount of repetitive sequences (42 and 50%) compared to the previous short-read assemblies of elephant shark (28%)36 and Japanese lamprey (21%)22, presumably due to the higher contiguity of the long-read assemblies. Using the MAKER pipeline (v2.31.8)37 and evidence-based and ab initio gene predictions, we predicted 18,747 protein-coding genes in the elephant shark genome assembly and 19,455 protein-coding genes in the Japanese lamprey genome assembly, respectively.

### Reconstruction and validation of the proto-vertebrate genome

We reconstructed the proto-vertebrate genome structure by employing the probabilistic macrosynteny model34 and comparing the Japanese lamprey, sea lamprey (Petromyzon marinus)19, amphioxus (Branchiostoma floridae)14, and four gnathostome genomes including human, chicken38, spotted gar39 and elephant shark (see ‘Methods'). In our reconstruction procedure, the lamprey genomes were partitioned into segments of conserved macrosynteny (191 Japanese lamprey segments and 198 sea lamprey segments), where synteny breakpoints were detected using the Japanese lamprey, sea lamprey, elephant shark, spotted gar, chicken and human genomes. Then, our Bayesian inference algorithm reconstructed the proto-vertebrate genome, assuming that individual proto-vertebrate chromosomes (Pvcs) have distinct orthologue distributions over the lamprey segments34. The reconstructed proto-vertebrate genome comprises 18 putative chromosomes (designated as Pvc1–18, with Pvc18 exhibiting only weak macrosynteny conservation in the amphioxus genome) and is largely consistent with previous reconstructions with 17 putative chromosomes14,15 (Supplementary Fig. 3 and Supplementary Table 9 in Supplementary Note 3).

As a validation of our reconstruction we examined conserved macrosynteny with representative invertebrate and gnathostome genomes including the scallop Chlamys farreri40, the placozoan Trichoplax adhaerens41, human and elephant shark (Fig. 2, also see Supplementary Fig. 4). These lineages have been shown to possess relatively slow rates of genome structure evolution12,36,41,42,43. We therefore expect that a reliable reconstruction should show a highly non-random distribution of orthologues in these genomes. Indeed, we noted that the orthologues are not randomly scattered throughout the modern genomes, but are clustered into a small number of chromosomes (evident as concentration of blue dots in Fig. 2). The fact that we find strong macrosynteny conservation in these invertebrate genomes, which were not used in the proto-vertebrate reconstruction, supports the validity of our reconstruction and indicates that all 18 reconstructed chromosomes existed as separate chromosomes in early metazoan lineages.

### Reconstruction of the proto-cyclostome chromosomes and evidence for sixfold duplication of the genome

The generation of a long-read-based high-quality genome assembly for the Japanese lamprey, in addition to the existing ‘hybrid’ genome assembly of the sea lamprey19, permitted us to investigate unresolved issues in cyclostome genome evolution. In particular, the evolutionary steps between the proto-vertebrate and proto-cyclostome genomes have remained contentious, even after the sequencing of the sea lamprey genome18,19,26. For example, the presence of six Hox clusters in two species of lampreys and the inshore hagfish could be due to more than two rounds of tetraploidization (S5 in Fig. 1); alternatively, they could be the result of a single tetraploidization event followed by chromosome duplication events (S8 in Fig. 1). Another possibility is that the cyclostome lineage experienced a hexaploidization event (whole-genome triplication) in addition to a tetraploidization (whole-genome duplication) event (S6 in Fig. 1). These alternative evolutionary models have been discussed in previous studies18,19,22,26 but remained unresolved even with the chromosome-level assembly of the sea lamprey genome.

In the present study, we have generated the first reconstruction of the proto-cyclostome genome by combining lamprey segments (described in the previous section) into 104 proto-cyclostome chromosomes (see ‘Methods' for details). Our algorithm enumerates possible combinations of lamprey segments (see Supplementary Movie 1), and reconstructs proto-cyclostome chromosomes by choosing the combination with the most significant (i.e. non-random) distribution of paralogues and orthologues (partly illustrated in Fig. 3a–c). Importantly, the algorithm explores all alternative models including segmental duplications, chromosome duplications/losses, tetraploidization and hexaploidization, under the assumption that duplicated chromosomes share significantly large numbers of paralogues. The major advantage of this reconstruction method is its robustness against lineage-specific rearrangements and fragmentation of genome assemblies. For example, Japanese lamprey Scaffold2 was partitioned into two segments (Fig. 3a) because each of the segments showed conserved synteny with two different sea lamprey scaffolds; in our reconstruction (Fig. 3b), the two segments on Scaffold2 were assigned to different proto-cyclostome chromosomes because they share a significantly large number of paralogues (dots in Fig. 3c). Thus, our reconstruction-based analysis is more reliable than scaffold-based analyses used in previous studies18,19,26 and provides the first opportunity to conclusively resolve the controversy over the origin of the proto-cyclostome genome.

To distinguish between alternative polyploidization models (i.e. S5–S8 in Fig. 1), we followed ref. 13 and used a measure we have called multiplicity, i.e. the number of proto-cyclostome chromosomes originating from individual proto-vertebrate chromosomes (Fig. 3d), and counted the numbers of Japanese lamprey genes that map to these chromosomes. If the proto-cyclostome genome was shaped by three rounds of tetraploidization (S5 in Fig. 1), it should be covered by chromosomes of multiplicity eight. Instead, if it experienced a single tetraploidization with subsequent chromosomal duplications (S8 in Fig. 1), the multiplicity should peak at two with gradual decrease toward larger multiplicities. The third possibility is that if the genome went through a single tetraploidization and a hexaploidization (genome triplication) (S6 in Fig. 1) the majority of the genome should be covered by chromosomes of multiplicity six. Our analysis indicates that 9 out of the 18 proto-vertebrate chromosomes were duplicated into six paralogous proto-cyclostome chromosomes, and that the majority (60.3%) of the proto-cyclostome genome was covered by the sixfold duplicated chromosomes. In addition, we confirmed by statistical test (see ‘Methods') that the observed peak of multiplicity (Fig. 3d) is unlikely to have been created by accumulation of chromosome scale or segmental duplications after one ($$P\, < \, 4\times {10}^{-5}$$) or two ($$P\, < \, 0.05$$) tetraploidization events. Thus, the clear peak at multiplicity of six is compelling evidence of sixfold duplication of the entire genome, probably through a tetraploidization and a hexaploidization event.

Although the current lamprey genomes might still be incomplete and some chromosomes might be fragmented, such limitations are unlikely to have substantially biased our analysis. First, if the proto-cyclostome genome was shaped by three rounds of tetraploidization, that would additionally require a large number of subsequent chromosome fusions to explain the current genome arrangement (e.g., 45 post-tetraploidization fusions are required to obtain the chromosome number of sea lamprey germline cells: 18 × 8−45 = 99). However, we found that the lamprey lineage had remarkably low rates of inter-chromosomal rearrangement (Supplementary Fig. 5) over 500 million years44 of cyclostome evolution. Specifically, our proto-cyclostome genome reconstruction shows large-scale fusions and translocations affecting only 22 out of 141 Japanese lamprey scaffolds and only 19 out of 151 sea lamprey scaffolds that have at least 10 genes. The exceptionally low rate of inter-chromosomal rearrangement and the haploid chromosome number of 99 in the germline sea lamprey genome45 are consistent with our evolutionary scenario in which the lamprey chromosome number is explained approximately as 18 × 6 = 108 with several subsequent fusions. Second, even though some tiny chromosomes might be missing in the current proto-cyclostome reconstruction, large chromosomes (e.g. Hox-bearing chromosomes duplicated from Pvc1) are unlikely to be missing entirely; therefore, our reconstruction is particularly reliable for the largest five proto-vertebrate chromosomes (i.e. Pvc1, 3, 10, 13 and 17), which consistently exhibited a multiplicity of six. Thus, the high coverage (60.3%) of the Japanese lamprey genome by sixfold duplicated proto-cyclostome chromosomes suggests that extant cyclostome genomes are paleo-dodecaploids (i.e. the chromosome number increased as 18 × 6 due to tetraplodization and hexaploidization), which might be similar to the situation in sturgeon where a species (Acipenser brevirostrum) with 180 chromosomes is considered to be a hexaploid of a tetraploid ancestor with 60 chromosomes46,47,48.

### Proto-gnathostome genome and the origin of microchromosomes

Previous reconstructions of the proto-gnathostome genome12,13,14,15 included members of only bony vertebrates (Osteichthyes) and lacked representatives of its sister group, the cartilaginous fishes (Chondrichthyes). Here, we produced a substantially improved reconstruction of the proto-gnathostome genome structure with a higher coverage of modern genomes by taking advantage of our newly sequenced, chromosome-scale genome assembly of the elephant shark, in addition to the spotted gar, zebra finch, turkey, chicken, opossum, dog, mouse and human genomes (see ‘Methods', Fig. 4 and Supplementary Fig. 6). The reconstruction provided additional support for the previous finding of two rounds of tetraploidization between the proto-gnathostome and its invertebrate ancestor11,12,13,14,15.

Analysis of this proto-gnathostome genome also revealed the origin of microchromosomes found in some modern gnathostomes. Microchromosomes are tiny chromosomes (typically smaller than 20 Mb), characterized by high GC-content, high gene density and high recombination rate38,49. Although there are no microchromosomes in the human genome, they are present in other tetrapod lineages such as birds and reptiles. Whether microchromosomes were recently created by chromosome fission or were present in the gnathostome ancestor has been controversial (see Supplementary Note 4 for a short review). Although several recent studies supported the ancient origin of microchromosomes13,36,39,49,50,51,52, it was still unknown (1) if chromosomal features characteristic to modern avian microchromosomes (i.e. high GC-content, high gene density and high recombination rate) were already present in the ancestral gnathostome genome (cf. the chromosomal features were previously reported to be conserved between the spotted gar and chicken genomes39), and (2) why microchromosomes have been conserved in distantly related gnathostome species such as the chicken, spotted gar and elephant shark.

In the present study, our reconstruction shows that at least 15 proto-gnathostome chromosomes have remained intact as microchromosomes in some modern gnathostome genomes such as chicken, spotted gar and elephant shark (Supplementary Fig. 7) even after 450 million years of gnathostome evolution44,53. Furthermore, we observed that specific sequence features (namely, chromosome length and gene density) are shared by modern gnathostome chromosomal regions that were derived from such proto-gnathostome chromosomes (Fig. 5). First, the total length of segments originating from individual proto-gnathostome chromosomes is highly conserved in chicken, spotted gar and elephant shark, suggesting that the ancestral gnathostome already possessed the tiny microchromosomes and the large macrochromosomes (Fig. 5a). Second, smaller proto-gnathostome chromosomes tend to have higher gene densities in all species, suggesting that the ancestral gnathostome genome consisted of small chromosomes with high gene densities and large chromosomes with low gene densities (Fig. 5b). Third, smaller proto-gnathostome chromosomes tend to have higher ohnologue densities in individual species, suggesting that the ancestral gnathostome genome had small chromosomes with high ohnologue densities (Fig. 5c). These observations suggest that many of the proto-gnathostome chromosomes might have already exhibited distinctive features (e.g. diminutive chromosomes with high gene density) that are considered characteristics of avian microchromosomes38,49. Thus, the proto-gnathostome lineage might have already possessed many microchromosomes with high gene density, many of which are still retained in several modern gnathostome genomes due to low rates of inter-chromosomal rearrangement. On the other hand, macrochromosomes, large genome sizes and high rates of rearrangement are likely to be derived characteristics of lineages that experienced substantial expansion of repetitive sequences.

The persistence of intact microchromosomes in modern gnathostome genomes is intriguing, and raises questions about the possible mechanism and evolutionary forces maintaining them over such a long evolutionary time49. One possibility is the presence of a high density of genomic regulatory blocks (GRBs) comprising long-range interacting regulatory elements and/or topologically associating domains (TADs) that require long-range linkage to be maintained intact. To test this possibility we analysed the density of GRBs and TADs54 and observed no obvious difference between macrochromosomes and microchromosomes (see Supplementary Fig. 8 and Supplementary Note 4). An alternative possibility is that the persistent synteny conservation is a by-product of the small size and high gene density of microchromosomes49, which is corroborated by previous arguments that gene density and ohnologue density are major factors in decreasing the rates of evolutionary breakage55 and inter-chromosomal rearrangement56, respectively. Consistent with this hypothesis, we find evidence for high density of genes (including ohnologues, which were identified with the method described in Supplementary Note 2) in the proto-gnathostome chromosomes that gave rise to the modern microchromosomes (Fig. 5b, c).

### Timing of gnathostome–cyclostome divergence relative to 1R and 2R

The timing of gnathostome–cyclostome divergence relative to the two basal vertebrate tetraploidization events (i.e. 1R and 2R) remains an unresolved issue in the field of vertebrate genome evolution. In order to resolve the divergence timing conclusively, we searched our reconstructions of the proto-vertebrate, proto-cyclostome and proto-gnathostome genomes for evidence of large-scale genomic changes that help distinguish between three alternative divergence models, i.e., divergence before 1R, between 1R and 2R, or after 2R. Our reconstructions revealed nine major fusion events that occurred during the interval between 1R and 2R (see Supplementary Note 3 and Supplementary Fig. 6), but none of these fusions is shared with the proto-cyclostome lineage (Supplementary Fig. 15b, c and Fig. 2), suggesting that the two lineages diverged before the chromosome fusion events and thus before 2R. Furthermore, the orthologue distribution between proto-gnathostome and proto-cyclostome chromosomes demonstrates four-to-six correspondence and a quasi-random gene retention pattern (Supplementary Fig. 15a). This lack of one-to-one or two-to-three orthology relationships indicates that the two lineages diverged shortly after 1R but before rediploidization.

In order to verify the timing of duplications and the gnathostome–cyclostome divergence, we performed a gene tree analysis by inserting lamprey genes into Ensembl gene trees or re-computing the gene trees (see Supplementary Note 5). Then, we classified human and lamprey paralogue pairs by their duplication timing and plotted vertebrate paralogues (i.e. paralogues duplicated before the gnathostome–cyclostome split), gnathostome-specific paralogues and cyclostome-specific paralogues on the proto-gnathostome and proto-cyclostome genomes (Supplementary Figs. 915). Intriguingly, we observed a mixture of vertebrate paralogues and cyclostome-specific paralogues between most pairs of homoeologous proto-cyclostome chromosomes, making it difficult to conclusively determine the duplication timing of individual chromosomes. This observation may be explained by (1) difficulties in gene tree inference due to the high GC content and strong codon bias in the lamprey genomes22,26,33, (2) differential gene loss between cyclostome and gnathostome lineages29, (3) delayed rediploidization28,31,32 creating cyclostome-specific paralogues between proto-cyclostome chromosomes duplicated by 1R, and (4) tetraploidization through hybridization and doubling57,58,59, which may have created both vertebrate-specific and cyclostome-specific paralogues due to recurrent hybridization among genetically diverse subpopulations57,58 and subsequent genetic drift60. Although these factors may have obscured the duplication timing, the presence of chromosome pairs enriched either with vertebrate-specific paralogues or cyclostome-specific paralogues are consistent with the model that the proto-cyclostome lineage diverged from the proto-gnathostome lineage shortly after 1R.

### Inferred scenario of early vertebrate genome evolution

The findings described above can be brought together into a model describing the steps of early vertebrate genome evolution (Fig. 6). First, our reconstruction indicates that the proto-vertebrate genome (with 18 chromosomes) was similar in structure to the ancestral bilaterian animal genome, as suggested by the strong macrosynteny conservation between the proto-vertebrate and scallop genomes (Fig. 6a), e.g. Pvc2, 5, 6, 7, 9, 10, 16, 17 and 18 retain one-to-one correspondence with chr1, 11, 17, 8, 4, 7, 9, 3 and 13 in scallop, respectively. Second, our analysis suggests that the gnathostome–cyclostome divergence occurred shortly after 1R (Fig. 6b) but before rediploidization (Supplementary Fig. 15). This was followed by nine gnathostome-specific chromosome fusions, which were not shared with the proto-cyclostome lineage (Fig. 6c and Supplementary Fig. 6). Third, the 2R event in the proto-gnathostome was an allotetraploidization event, as our reconstruction shows biased gene loss/retention between duplicated chromosomes (Fig. 6e and Supplementary Note 4)61,62,63. Indeed, the ratio of retained genes between the two subgenomes in the proto-gnathostome genome is 2.25, which is considerably larger than previously reported ratios of paleo-allopolyploids: 1.47 for Brassica, 1.46 for maize, 1.24 for sorghum, 1.17 for Arabidopsis and 1.35 for Xenopus laevis61,64. A comparison with the modern gnathostome genomes (Fig. 6f and Supplementary Fig. 7) shows that a pair of chromosomes duplicated by 2R typically gave rise to a large chromosome (dashed lines) and a microchromosome (solid lines) in elephant shark and chicken, which suggests that the proto-gnathostome ancestor already possessed microchromosomes as a result of biased fractionation between the subgenomes. (A paper published after the submission of this manuscript suggested a similar evolutionary scenario65.) Fourth, we present evidence that there was a cyclostome-specific hexaploidization (Fig. 6g) that gave rise to the proto-cyclostome genome with 18 × 2 × 3 chromosomes, most of which are still retained in the modern lamprey genomes with ~99 chromosomes45 due to remarkably low rates of inter-chromosomal rearrangement.

## Discussion

To the best of our knowledge our reconstruction is the first reported genome-scale evidence for hexaploidy in the cyclostome lineage. There are several documented examples of hexaploidy giving rise to new evolutionary lineages. Perhaps the most well-known example is wheat, a domesticated crop with three subgenomes (A, B and D). The formation of hexaploid wheat is believed to be a multi-step process where there was an initial tetraploid genome formed by hybridization, and a subsequent hybridization of the tetraploid with a diploid, generating a hexaploid66. Hexaploidy has also been shown in early dicot plant evolution67, the shortnose sturgeon (Acipenser brevirostrum)46, and in the Prussian carp (Carassius gibelio)68,69. In most instances the mechanism of hexaploidy origin has been inferred to have been by serial hybridizations46,66.

These instances suggest that genome hybridization may have played a significant role in the origin and evolution of early vertebrates, as already discussed in previous studies9,70. A few possible mechanisms have been suggested for explaining the establishment of allopolyploid species. First, heterosis, or hybrid vigour, confers selective advantages to the newly formed allopolyploids71,72. Second, asymmetric and unequal contribution from the subgenomes has been reported to have facilitated the evolution of complex phenotypes in some allopolyploids: for example, it was suggested that the allotetraploid cotton produces high-quality fibres by combination of long fibres from the A-genome and short fibres from the D-genome73; in the allohexaploid wheat, the A-genome is responsible for the morphological traits, while the B- and D-genomes contain most genes for response to biotic and abiotic factors74. In line with this argument, previous studies have shown an example of asymmetric contribution from quadruple paralogous regions in the human genome75. Our reconstruction suggests that such asymmetric evolution may not be limited to specific gene clusters but is a genome-scale phenomenon due to the hybrid origin of the allopolyploid proto-gnathostome genome.

In particular, our reconstruction suggests that genome hybridization might have contributed to the origin of the adaptive immune system (AIS), which is a prime example of a major evolutionary innovation in early vertebrates. The human AIS is an intricate defence system characterized by the B cell and T cell receptors and the major histocompatibility complex (MHC), which are highly conserved throughout most gnathostomes, including cartilaginous fishes, but are missing in invertebrates, including the closest relatives of vertebrates, such as sea squirts and amphioxus76,77. The seemingly abrupt emergence of such a complex molecular machinery of AIS has been described as an evolutionary ‘Big Bang’ triggered by macroevolutionary events, including the two rounds of tetraploidization76,78,79,80.

Of particular interest with regard to the origin of the AIS is a previous hypothesis on the evolution of highly polymorphic genes encoded in the MHC, natural killer gene complex (NKC) and leucocyte receptor complex (LRC), which are essential for the mammalian AIS. It has been proposed that the precursors of MHC, NKC and LRC were physically linked on the proto-MHC chromosome, and the tight linkage facilitated co-evolution of highly polymorphic receptors and ligands, giving rise to a putative ‘immune supercomplex’ that was subsequently fragmented as MHC, NKC and LRC in the human genome76,78,79,80,81. However, our reconstruction shows that the hypothesized proto-MHC chromosome was an artefact due to inter 1R–2R chromosome fusions and additional rearrangements in the mammalian lineages. In our reconstruction, the putative proto-MHC chromosome is divided into several proto-vertebrate chromosomes (i.e. Pvc5, 11, 13, 14, 15, 17 and 18) that are clearly separated in the lamprey, amphioxus and scallop genomes (Fig. 2), and the MHC, NKC and LRC gene clusters are derived from these distinct proto-vertebrate chromosomes including Pvc5, 15 and 17.

Intriguingly, our reconstruction shows that MHC, NKC and LRC were located on microchromosomes in the proto-gnathostome genome (i.e. Pgc38, 12 and 27 in Supplementary Data 1). This observation suggests that the post-1R tetraploid species might have already had co-evolving genes encoding the precursors of MHC, NKC and LRC, and the post-2R allopolyploid preserved this interaction network within a subgenome despite the higher rate of gene loss in microchromosomes. This view is also consistent with the previous observation that functionally linked genes involved in ‘response to stimulus’ (e.g. genes involved in adaptive immunity) tend to be retained in cis after 2R, suggesting that interacting gene clusters were preserved despite extensive gene loss82,83. In addition, we observed functional biases between the two subgenomes: the human genes in the segment derived from the shorter subgenome were enriched with ‘defense/immunity protein’ in PANTHER Protein Class (FDR $$q=2.75\times {10}^{-13}$$, see Supplementary Note 4). Overall, our reconstruction suggests a possible role of asymmetric contribution from subgenomes for the emergence of gnathostome-like AIS, and corroborates the view that a primordial ‘adaptive’ immune system emerged in the ancestral vertebrate genome and later turned into the intricate gnathostome-like AIS through 2R76,77,80.

Finally, our reconstruction of the proto-gnathostome genome has implications for understanding the intrinsic evolutionary constraints on gnathostome genomes. It has previously been shown ohnologues are frequently dosage-sensitive and resistant to evolutionary duplication and loss84,85,86, and that the distribution of ohnologues constrains copy-number variations among human populations87. Interestingly, our reconstruction suggests that the high gene and ohnologue densities are conserved features (Fig. 5) associated with microchromosomes created by biased gene loss after 2R (Fig. 6), and that human chromosomal regions with high ohnologue densities originated from microchromosomes in the proto-gnathostome genome. Thus, ohnologue-rich regions that are susceptible to pathogenic copy-number variations may be regarded as a legacy from the allopolyploid proto-gnathostome genome and subsequent asymmetric evolution between the subgenomes. In addition, by referring to the evolutionary origins (Fig. 6), we can (1) identify ohnologue relationships between genes with low sequence similarity30 that might otherwise remain cryptic and (2) prioritize copy-number variations in personal genomes for potential pathogenicity.

In conclusion, we have generated high-quality, chromosome-scale genome assemblies for two phylogenetically opportune organisms, and inferred the genome structures of early vertebrate lineages. This is the first effort to reconstruct the proto-cyclostome genome, which was critical for determining the cryptic origins of the proto-cyclostome and proto-gnathostome genomes. Consequently, our reconstruction resolved several important issues including the number and relative timings of polyploidization events that occurred during the early origin of vertebrates. The resulting model offers unique perspectives on the origin and evolution of vertebrate genomes.

## Methods

### Probabilistic macrosynteny model

We reconstructed the proto-vertebrate genome by employing the probabilistic macrosynteny model34, which was previously used for inferring the structure of the pre-TGD genome (TGD stands for teleost-specific genome duplication). The details including the probability model, definitions of parameters/variables, algorithm and estimation accuracy can be found in ref. 34 (open access). In short, the macrosynteny model assumes that the individual pre-WGD chromosomes have distinct orthologue distributions over the present-day post-WGD genomes; then, the pre-WGD genome structure can be reconstructed by employing the variational Bayesian inference algorithm. In the present study, we used an algorithm called collapsed variational Bayes (CVB)88,89, which is more efficient than the variational Bayesian expectation-maximization algorithm described in ref. 34.

The CVB algorithm is derived as follows using the same model and variables as defined in ref. 34. In the framework of the probabilistic macrosynteny model, we infer the pre-WGD genome structure as the posterior (i.e. $${p}_{\Theta ,{X|Y}}$$) of the model parameters ($$\Theta$$) and latent variables ($$X$$) conditioned by the orthologue information ($$Y$$). Since exact computation of the posterior is infeasible, the posterior needs to be approximated by tractable probability density functions. For deriving the CVB algorithm, the posterior is approximated by $${q}_{\hat{\Theta} ,\hat{X}}(\theta ,x)$$ that can be factorized as

$${q}_{\hat{\Theta },\hat{X}}\left(\theta ,x\right)={q}_{\hat{\Theta }{{{\rm{|}}}}\hat{X}}\left(\theta {{{\rm{|}}}}x\right){\prod }_{s=1}^{S}{\prod }_{g=1}^{{G}_{s}}{q}_{{\hat{X}}_{s,g}}\left({x}_{s,g}\right),$$
(1)

where S is the number of non-WGD segments and Gs is the number of genes in segment s. Then, assuming this factorization and following the derivation of the CVB (or CVB0) algorithm88,89 for the probabilistic topic model90,91, we obtain Algorithm 1 with the following update formula:

$${{\log }}\left({q}_{{\hat{X}}_{s,g}}\left(k\right)\right)= \;{{\log }}\left({\hat{\alpha} }_{k}^{\left(s\right)}-{q}_{{\hat{X}}_{s,g}}\left(k\right)\right)+\mathop{\sum }\limits_{t=1}^{T}\mathop{\sum }\limits_{c=1}^{{C}_{t}}\mathop{\sum }\limits_{i=1}^{{n}_{t,c}^{s,g}}{{\log }}\left(i-1+{\hat{\beta} }_{c}^{\left(k,t\right)}-{q}_{{\hat{X}}_{s,g}}\left(k\right){n}_{t,c}^{s,g}\right)\\ -\mathop{\sum }\limits_{t=1}^{T}\mathop{\sum }\limits_{i=1}^{{n}_{t}^{s,g}}{{\log }}\left(i-1+\mathop{\sum }\limits_{c=1}^{{C}_{t}}{\hat{\beta }}_{c}^{\left(k,t\right)}-{q}_{{\hat{X}}_{s,g}}\left(k\right){n}_{t}^{s,g}\right)+C,$$
(2)

where $$\hat{\alpha }_{k}^{(s)}$$ and $$\hat{\beta }_{c}^{(k,t)}$$ are given by Eqs. (11) and (12) in ref. 34 and C is a constant that cancels by normalizing $${q}_{{\hat{X}}_{s,g}}(k)$$ so that $$\mathop{\sum }_{k=1}^{K}{q}_{{\hat{X}}_{s,g}}(k)=1$$. In the actual computation, we avoided early convergence of $${q}_{\hat{X}}$$ to suboptimal values by starting from a less extreme prior distribution with a slightly larger value of $${\alpha }_{k}$$ as follows: at the $$j$$th iteration, we replaced $${\alpha }_{k}$$ with $${\alpha }_{k}+{0.8}^{j-1}$$ while $$j\, < \, 98$$. We continued updating $${q}_{\hat{X}}$$ until after 100 iterations or until $${q}_{\hat{X}}$$ converges, satisfying

$$\begin{array}{c}\mathop{\sum }\limits_{s=1}^{S}\mathop{\sum }\limits_{g=1}^{{G}_{s}}\mathop{\sum }\limits_{k=1}^{K}\left|{q}_{{\hat{X}}_{s,g}}(k)-q^{{\prime} }_{{\hat{X}}_{s,g}}(k)\right|\, < \, 0.001,\end{array}$$
(3)

where q′ denotes the estimate in the previous iteration.

Below is a pseudocode for the CVB0 algorithm.

### Reconstruction of the proto-vertebrate genome

We reconstructed the structure of the proto-vertebrate genome in two steps: first, we partitioned the lamprey genomes into blocks of conserved synteny by comparing the lamprey genomes with each other and also with four gnathostome genomes (i.e. human, chicken, spotted gar and elephant shark); second, we inferred the structure of the pre-WGD genome by applying the macrosynteny model to the amphioxus and lamprey genomes. These steps are described below.

### Segmentation of the lamprey genomes

We partitioned the lamprey scaffolds (with at least ten genes) into blocks of conserved synteny as described in ref. 34. Specifically, we employed the Bayesian segmentation model92 and computed the optimal segmentation using a dynamic programming algorithm93. Segmentation was performed in two steps: first, we compared the Japanese lamprey and sea lamprey scaffolds, and identified lineage-specific synteny breakpoints; second, we compared the lamprey genomes with human, chicken, spotted gar and elephant shark genomes, and identified breakpoints occurring between the gnathostomes and cyclostomes. Then we merged the two sets of breakpoints and obtained 191 Japanese lamprey segments and 198 sea lamprey segments. These segments have homogeneous distributions of orthologues in the other genomes under comparison, and thus they are likely to have been unaffected by large-scale inter-chromosomal rearrangements in the cyclostome lineages.

### Inference of the pre-WGD genome structure

We analysed the 1-to-4 orthologue distribution among the amphioxus scaffolds14 and the Japanese lamprey and sea lamprey segments, by applying the macrosynteny model and CVB0 algorithm with the following parameter values: post-WGD species $$T=2$$, numbers of post-WGD segments $${C}_{1}=191$$ and $${C}_{2}=198$$, maximum number of co-orthologues $${D}^{(t)}=4$$ for all $$t$$, $$L=10$$, $${\alpha }_{k}=0.1$$ for all $$k$$, and $${\beta }_{c}^{(t)}=0.1$$ for all $$c$$ and $$t$$ (see ref. 34 for details of these parameters). As described in ref. 34, individual amphioxus scaffolds were associated with mixture distributions over the proto-vertebrate chromosomes, which represent reconstruction confidence scores. For the sake of simplicity in visualization, we assigned each amphioxus scaffold to the proto-vertebrate chromosome with the largest reconstruction confidence score (i.e. $${{{{\rm{argmax}}}}}_{k}{\mathbb{E}}[{\hat{U}}_{s,k}]$$, where $${\mathbb{E}}$$ denotes expectation), which is calculated by using Eq. (11) in ref. 34. In addition, we assigned each lamprey segment to the proto-vertebrate chromosome with the largest reconstruction confidence score (i.e. $${{{{\rm{argmax}}}}}_{k}{\mathbb{E}}[{\hat{V}}_{t,k,c}]$$), which is calculated by using Eq. (12) in ref. 34. See also Fig. 1 in ref. 34 for an intuitive explanation.

### Number of proto-vertebrate chromosomes

In the macrosynteny model, the number of proto-vertebrate chromosomes is treated as an input parameter ($$K$$) for inferring the optimal pre-WGD genome structure. The previous studies estimated the number of proto-vertebrate chromosomes to be 10–13 in refs. 12,13,18,19,94 or 17 in refs. 14,15, but the exact number is unknown. In order to decide the optimal number of $$K$$, we reconstructed the proto-vertebrate chromosomes with $$K=10,\ldots ,20$$, and evaluated the quality of those reconstructions by comparing their paralogue distributions as follows.

The underlying assumption is that most lamprey paralogues were created by WGDs (or by chromosome-scale duplications as proposed in ref. 18); then, the paralogue distribution should be highly non-random, with most paralogues found between lamprey segment pairs both deriving from the same proto-vertebrate chromosome. We quantified such non-randomness by using the hypergeometric distribution under the null hypothesis in which paralogues are randomly distributed over the entire genome as described below.

Let $$g$$ and $$p$$ be the number of gene pairs and paralogue pairs in the genome, respectively, and $$v$$ be the number of gene pairs both of which derive from the same proto-vertebrate chromosomes. Let $$X$$ be a random variable representing the number of paralogue pairs both of which derive from the same proto-vertebrate chromosome, and $$x$$ be the observed number of such paralogue pairs. Then, the significance of $$x$$ is given as follows:

$${\mathbb{P}}(X\ge x)=\mathop{\sum }\limits_{i=x}^{v}\left(\begin{array}{c}g-v\\ p-i\end{array}\right)\left(\begin{array}{c}v\\ i\end{array}\right)\bigg/\left(\begin{array}{c}g\\ p\end{array}\right)$$
(4)

where $${\mathbb{P}}$$ denotes the probability and () denotes the binomial coefficient.

In both the Japanese lamprey and sea lamprey genomes, the reconstruction with $$K=18$$ was the most significant in this criterion (Supplementary Table 8). Then we labelled the proto-vertebrate chromosomes as Pvc1–Pvc18 (although we observed that Pvc18 has no clear synteny in gnathostome and invertebrate genomes).

### Reconstruction of the proto-cyclostome genome

Although the 2R hypothesis was resolved by a genome-wide synteny analysis11 and reconstruction analyses12,13,14, the origins of the proto-gnathostome and proto-cyclostome genomes have remained contentious. In particular, the timing of gnathostome–cyclostome divergence and possibility of cyclostome-specific WGD have remained topics of debate even after sequencing of the sea lamprey genome18,22,26. For example, six Hox clusters were found in the cyclostome genomes19,22,26,27, but it was not clear if the number of Hox clusters should be explained by additional cyclostome-specific WGD followed by the loss of two entire clusters, or by chromosome-scale duplications in the cyclostome lineage18,19,22.

We considered that a reconstruction of the proto-cyclostome chromosomes would provide a conclusive answer to this question. In our macrosynteny model analysis, the Hox-bearing proto-vertebrate chromosome comprises 10 Japanese lamprey segments and 11 sea lamprey segments, which are likely to be parts of proto-cyclostome chromosomes fragmented due to inter-chromosomal rearrangements or limited scaffold length in the current genome assemblies. Thus, the reconstruction of proto-cyclostome chromosomes can be formulated as finding the correct combination from a large number of possible combinations of the lamprey segments. The enumeration of all combinations is called ‘set partitioning’ (i.e. partitioning of a set of segments into non-empty subsets), which is computationally infeasible because the number of all set partitions, known as the Bell number, can be extremely large: for example, the Bell number for the 21 lamprey segments is $${B}_{21}={{{\mathrm{474,869,816,156,751}}}}$$. To address this problem, we performed clustering of lamprey segments and reduced the number of set partitions as follows.

Step 1: Paralogous lamprey segments do not originate from the same proto-cyclostome chromosome. Therefore, we calculated the paralogue significance for each segment pair, and significant pairs were not allowed to be assigned to the same cluster in the subsequent steps.

Step 2: Orthologous segments between Japanese lamprey and sea lamprey originate from the same proto-cyclostome chromosome. Therefore, we performed a single linkage clustering of lamprey segments to make clusters of orthologous segments. First, we defined individual segments as initial clusters. Second, we sorted segment pairs by the significance of the number of orthologues between them. Third, we repeated choosing the most significant segment pair and merged the two clusters if they did not have paralogous segments.

Step 3: Some lamprey scaffolds are expected to be over-fragmented by the synteny segmentation algorithm or by lineage-specific rearrangements. In order to address such over-fragmentation of lamprey scaffolds, we merged two clusters if (i) they had segments on the same scaffold and (ii) they did not have paralogous segments.

Step 4: In addition to Step 3, we utilized the Pacific lamprey linkage markers19 and merged two clusters if (i) the clusters shared a pair of sea lamprey segments having linkage markers on the same Pacific lamprey linkage group and (ii) the clusters did not have paralogous segments.

Step 5: Reliable reconstruction is difficult for short segments having few orthologues and paralogues. Therefore, clusters of lamprey segments were excluded from the proto-cyclostome reconstruction if the clusters had fewer than five genes.

For each of Pvc1–Pvc17, we enumerated all set partitions of the clusters, and chose the optimal set partition with the most significant distribution of orthologues and paralogues as the proto-cyclostome chromosomes. During this analysis we found that some Japanese lamprey scaffolds are likely to be haplotype sequences that were not removed from the primary assembly by FALCON during the final stage of the assembly; we therefore excluded the following Japanese lamprey scaffolds from the proto-cyclostome reconstruction: Scaffolds 110, 190, 198, 105, 104, 82, 163, 69, 133, 74, 115, 139, 86, 70, 72, 171 and 192. We left Pvc18 as a single proto-cyclostome chromosome because computation of the optimal set partitioning was infeasible for Pvc18 consisting of 34 segments.

Significance of paralogues, orthologues and set partitions were calculated as follows.

### Significance of the number of paralogues

The significance of the number of paralogues between two lamprey segments is calculated as follows. Let $$g$$ and $$p$$ be the number of gene pairs and paralogue pairs in the genome, respectively, and $$n$$ be the number of gene pairs between the two segments. Let $$X$$ be a random variable representing the number of paralogue pairs between the two segments, and $$x$$ be the observed number of such paralogue pairs. Then, the probability of observing $$x$$ paralogue pairs between the two segments is given by

$${\mathbb{P}}(X=x)=\left(\begin{array}{c}g-n\\ p-x\end{array}\right)\left(\begin{array}{c}n\\ x\end{array}\right)\bigg/\left(\begin{array}{c}g\\ p\end{array}\right),$$
(5)

and the number of paralogue pairs was considered significant if $${\mathbb{P}}(X\, \ge\, x)\, < \, {10}^{-5}$$.

### Significance of the number of orthologues

The significance of the number of orthologues between two lamprey segments is calculated as follows. Let $$g$$ and $$o$$ be the number of gene pairs and orthologue pairs between the Japanese lamprey and sea lamprey genomes, respectively, and $$n$$ be the number of gene pairs between the two segments. Let $$X$$ be a random variable representing the number of orthologue pairs between the two segments, and $$x$$ be the observed number of such orthologue pairs. Then, the probability of observing $$x$$ orthologue pairs between the two segments is given by

$${\mathbb{P}}(X=x)=\left(\begin{array}{c}g-n\\ o-x\end{array}\right)\left(\begin{array}{c}n\\ x\end{array}\right)\bigg/\left(\begin{array}{c}g\\ o\end{array}\right)$$
(6)

and the number of orthologue pairs was considered significant if $${\mathbb{P}}(X\, \ge\, x)\, < \, {10}^{-5}$$.

### Significance of a reconstruction

For each proto-vertebrate chromosome $$c$$ ($$c=1,\ldots ,17$$ for Pvc1,…,17, respectively) and each species $$s$$ ($$s=1$$ for Japanese lamprey and $$s=2$$ for sea lamprey), the significance of paralogues was calculated as follows. Let $${g}_{c,s}$$ be the number of gene pairs and $${p}_{c,s}$$ be the number of paralogue pairs in species $$s$$ such that both genes derive from proto-vertebrate chromosome $$c$$. Let $${n}_{s}\, (\!\le g_{c,s})$$ be the number of gene pairs between different proto-cyclostome chromosomes (i.e. inter-chromosome gene pairs). Let $${X}_{s}(\!\le p_{c,s})$$ be a random variable representing the number of inter-chromosome paralogue pairs, and $${x}_{s}$$ be the observed number of such paralogue pairs. Then, the significance of $${x}_{s}$$ inter-chromosome paralogue pairs is given by

$${\mathbb{P}}({X}_{s}\ge {x}_{s})=\mathop{\sum }\limits_{i={x}_{s}}^{{p}_{c,s}}\left(\begin{array}{c}{g}_{c,s}-{n}_{s}\\ {p}_{c,s}-i\end{array}\right)\left(\begin{array}{c}{n}_{s}\\ i\end{array}\right)\bigg/\left(\begin{array}{c}{g}_{c,s}\\ {p}_{c,s}\end{array}\right)$$
(7)

In addition, we calculated the significance of the number of orthologues between species $$s$$ and $$t$$. Let $${g}_{c,s,t}$$ be the numbers of gene pairs and $${o}_{c,s,t}$$ be the number of orthologue pairs between the two species such that both genes derive from proto-vertebrate chromosome $$c$$. Let $${n}_{s,t}\,(\le {g}_{c,s,t})$$ be the number of gene pairs between species $$s$$ and $$t$$, where both genes derive from the same proto-cyclostome chromosome. Let $${X}_{s,t}\,(\le {o}_{c,s,t})$$ be a random variable representing the number of orthologue pairs, deriving from the same proto-cyclostome chromosome, and $${x}_{s,t}$$ be the observed number of such orthologue pairs. Then, the significance of $${x}_{s,t}$$ orthologue pairs is given by

$${\mathbb{P}}({X}_{s,t}\ge {x}_{s,t})=\mathop{\sum }\limits_{i={x}_{s,t}}^{{o}_{c,s,t}}\left(\begin{array}{c}{g}_{c,s,t}-{n}_{s,t}\\ {o}_{c,s,t}-i\end{array}\right)\left(\begin{array}{c}{n}_{s,t}\\ i\end{array}\right)\bigg/\left(\begin{array}{c}{g}_{c,s,t}\\ {o}_{c,s,t}\end{array}\right)$$
(8)

Finally, we defined the significance of the set partition for proto-vertebrate chromosome c as

$${\mathbb{P}}({X}_{1}\ge {x}_{1}){\mathbb{P}}({X}_{2}\ge {x}_{2}){\mathbb{P}}({X}_{1,2}\ge {x}_{1,2}).$$
(9)

This method is an extension of the reconstruction of post-2R chromosomes in ref. 13, which was developed for verifying if genome quadruplication occurred in the proto-vertebrate lineage: in the previous study, set partitioning into 2, 3, 4 and 5 post-2R chromosomes were enumerated for showing that quadruplication was the most significant; we extended it to also enumerating set partitions into more than five proto-cyclostome chromosomes.

### Reconstruction of the proto-gnathostome genome

We reconstructed the proto-gnathostome chromosomes by comparing the amphioxus14 and several gnathostome genomes including elephant shark. As illustrated in Fig. 4, we performed reconstruction in three steps: first, we partitioned the gnathostome chromosomes into blocks of conserved synteny (Fig. 4f); second, we applied the CVB0 algorithm and made groups of gnathostome segments that share large numbers of paralogues (Fig. 4g); third, segments in individual groups were further partitioned into several subgroups representing proto-gnathostome chromosomes (Fig. 4h).

### Segmentation of the gnathostome genomes

We partitioned the gnathostome genomes as described in ref. 34. Specifically, we performed genome segmentation twice for each gnathostome genome: one with four teleost genomes (i.e. zebrafish, stickleback, medaka and Tetraodon) to identify blocks of doubly conserved synteny, and the other with chicken, turkey, zebra finch, anole lizard and spotted gar to identify additional synteny breakpoints in individual lineages. Then we merged the two sets of synteny breakpoints to define 151 human segments. Similarly, we partitioned the mouse, dog, opossum, chicken, turkey, zebra finch, spotted gar and elephant shark genomes into 258, 212, 163, 70, 69, 60, 78 and 132 segments, respectively. These numbers of segments are slightly different from the previous study34, because we used the spotted gar genome in addition to the non-mammalian amniotes in the synteny segmentation step.

### Analysis with the macrosynteny model

We used the macrosynteny model and applied the CVB0 algorithm with the following parameter values: number of post-WGD species $$T=9$$, numbers of post-WGD segments $${C}_{t}={{{\mathrm{151,258,212,163,70,69,60,78}}}}$$ and $$132$$ for $$t=1,\ldots ,9$$ respectively, maximum number of co-orthologues $${D}^{(t)}=4$$ for all $$t$$, $$L=10$$, $${\alpha }_{k}=0.1$$ for all $$k$$, and $${\beta }_{c}^{(t)}=0.1$$ for all $$c$$ and $$t$$. Then we set $$K=7,\ldots ,20$$ and evaluated the reconstruction quality by comparing the significance of paralogue distribution as described for the reconstruction of proto-vertebrate genome. We found that clustering into 10 groups of segments was optimal, which is consistent with the previous study13. However, as we argue in Fig. 4, individual groups might represent multiple proto-vertebrate chromosomes due to inter-chromosomal rearrangements. Therefore, we reconstructed proto-gnathostome chromosomes by employing a rearrangement-aware method as follows.

### Reconstruction of proto-gnathostome chromosomes

In this step, we used segments from the human, mouse, dog, opossum, chicken, turkey, zebra finch, spotted gar and elephant shark genomes, and enumerated set partitions for each of the 10 groups after reducing the number of set partitions by clustering of gnathostome segments as in the reconstruction of proto-cyclostome chromosomes:

Step 1: Paralogous gnathostome segments do not originate from the same proto-gnathostome chromosome. Therefore, we calculated the paralogue significance for each segment pair, and significant pairs were not allowed to be assigned to the same cluster in the subsequent steps.

Step 2: Orthologous gnathostome segments originate from the same proto-gnathostome chromosome. Therefore, we performed a single linkage clustering of gnathostome segments to make clusters of orthologous segments. First, we defined individual segments as initial clusters. Second, we sorted segment pairs by the significance of the number of orthologues between them. Third, we repeated choosing the most significant segment pair and merged the two clusters if they did not have paralogous segments.

Step 3: Reliable reconstruction is difficult for short segments having few orthologues and paralogues. Therefore, clusters of gnathostome segments were excluded from the proto-gnathostome reconstruction if the clusters had fewer than five genes.

For the reconstruction of proto-gnathostome chromosomes, we assumed that individual gnathostome segments might have derived from multiple proto-vertebrate chromosomes, since it was reported that several rearrangements occurred between the two WGD events13. Figure 4a illustrates the case of a chromosome fusion occurring between the two WGD events. As the result of the fusion, the grey post-2R chromosomes share large numbers of ohnologues with the black and white chromosomes (represented by red regions in Fig. 4c); on the other hand, there are no ohnologues between black and white chromosomes (white regions). In addition to the case of a chromosome fusion between the two WGD events, our reconstruction method considered other rearrangement scenarios, namely, (A) a chromosome fission event occurring in the period between 1R and 2R and (B) a fusion or translocation after 2R. Scenario A results in the same paralogue distribution pattern as in the case of a fusion between the two WGD events, but the two scenarios can be distinguished by checking the orthologue distribution in invertebrate genomes. In Scenario B, the paralogue distribution is different from Scenario A, since the chromosome created by post-2R fusion is paralogous to the other six post-2R chromosomes. In general, we expect to see a large number of paralogues between a pair of proto-gnathostome chromosomes, only if the two chromosomes (1) are duplicated chromosomes or (2) inherit duplicated chromosomes or duplicated segments through rearrangements (fusions, fissions and translocations). These proto-gnathostome chromosome pairs are called ‘red chromosome pairs’ (as in Fig. 4c) in the subsequent texts.

Then, for each cluster $$c$$ ($$c=1,\ldots ,10$$ for the ten clusters) and for each species $$s$$ ($$s=1,\ldots ,9$$ for human, mouse, dog, opossum, chicken, turkey, zebra finch, spotted gar and elephant shark, respectively), the significance of paralogues was calculated as follows. Let $${g}_{c,s}$$ be the number of gene pairs and $${p}_{c,s}$$ be the number of paralogue pairs in species $$s$$ such that both genes are in cluster $$c$$. Let $${n}_{s}\,(\le g_{c,s})$$ be the number of gene pairs between red proto-gnathostome chromosome pairs. Let $${X}_{s}\,(\le p_{c,s})$$ be a random variable representing the number of paralogue pairs between red chromosome pairs, and $${x}_{s}$$ be the observed number of such paralogue pairs. Then, the significance of $${x}_{s}$$ intra-chromosome paralogue pairs is given by

$${\mathbb{P}}({X}_{s}\ge {x}_{s})=\mathop{\sum }\limits_{i={x}_{s}}^{{p}_{c,s}}\left(\begin{array}{c}{g}_{c,s}-{n}_{s}\\ {p}_{c,s}-i\end{array}\right)\left(\begin{array}{c}{n}_{s}\\ i\end{array}\right)\bigg/\left(\begin{array}{c}{g}_{c,s}\\ {p}_{c,s}\end{array}\right).$$
(10)

Significance of orthologues between two gnathostome species was given in the same way as in the proto-cyclostome reconstruction. Then, we calculated the significance of a set partition for cluster c by multiplying the significance values for all species and all species pairs:

$$\left({\prod }_{s}{\mathbb{P}}({X}_{s}\ge {x}_{s})\right)\left({\prod }_{s\ne t}{\mathbb{P}}({X}_{s,t}\ge {x}_{s,t})\right).$$
(11)

In the last step of the proto-gnathostome reconstruction, we chose the most significant set partition and filtered out small unreliable subgroups. This filtering step was necessary because reliable subgroups are expected to have segments from all (or most) of the gnathostome species, whereas reconstruction errors result in spurious small subgroups having short segments from few species. For this reason, we filtered out small subgroups with segments from only few (<3) species, which filtered out nine small subgroups consisting of four mouse segments, one chicken segment and nine elephant shark segments. This filtering step should have little influence on the analysis results because the number of genes on the filtered segments is only 288 in total. Finally, the remaining subgroups were defined as the proto-gnathostome chromosomes.

### The proto-cyclostome genome was shaped by sixfold genome duplication

Here, we introduce a framework for calculating the probability that multiplicities of independently duplicating chromosomes converge toward a given ploidy level, where the convergence is measured in terms of the deviation ($$\delta$$) from the given ploidy level. Application to the proto-cyclostome genome shows that the observed peak of multiplicity at six is unlikely to be created by chance through accumulation of chromosome-scale duplications.

Let us consider the following situation. The proto-vertebrate genome with $$K$$ chromosomes underwent one or two polyploidization events, producing $${X}_{k}(k=1,\ldots ,K)$$ duplicates for each proto-vertebrate chromosome ($${X}_{k}=2$$ for all $$k$$ after 1R or $${X}_{k}=4$$ after two rounds of tetraploidization). Subsequently, those $$X={\sum }_{k=1}^{K}{X}_{k}$$ chromosomes were duplicated by a series of independent chromosome-scale duplications, eventually creating $${Y}_{k}$$ duplicates for each proto-vertebrate chromosome $$(k=1,\ldots ,K)$$. As a measure of deviation from a polyploidization-only model, we define $$\delta ({Y}_{k})=\mathop{\sum }\nolimits_{k=1}^{K}\left|{Y}_{k}-M\right|$$, where $$M$$ is the expected multiplicity ($$M=6$$ in our model). Assuming that all chromosomes are equally likely to be duplicated, we calculate $${\mathbb{P}}\left(\delta \left({Y}_{k}\right)\le D\,|\mathop{\sum }\nolimits_{k=1}^{K}{Y}_{k}=Y\right)$$, the probability that the deviation is smaller than or equal to the observed deviation $$D$$ (i.e. $$D=13$$ in our reconstruction) conditioned by the total number of proto-cyclostome chromosomes $$Y$$ (i.e. $$Y=103$$ in our reconstruction).

The desired probability is calculated as follows. First, the total number of duplication scenarios is given by $$T=\varGamma (Y)/\varGamma (X)$$, where $$\varGamma \left(n\right)=\left(n-1\right)(n-2)\cdots 1$$ is the gamma function. Second, for given $${Y}_{k}(k=1,\ldots ,K)$$, the number of duplication scenarios in which individual proto-vertebrate chromosomes are eventually duplicated into $${Y}_{k}$$ proto-cyclostome chromosomes is given by

$$S\left({Y}_{1},\ldots ,{Y}_{K}\right)=\left({Y}_{1}-{X}_{1},\ldots ,{Y}_{K}-{X}_{K}\right)!{\prod }_{k=1}^{K}\varGamma ({Y}_{k})/\varGamma ({X}_{k}),$$
(12)

where $$\left(\cdot ,\ldots ,\cdot \right)!$$ is the multinomial coefficient. Then, by enumerating all $${Y}_{k}$$ values, we can calculate the desired probability (i.e. independently duplicating proto-vertebrate chromosomes converging to multiplicity $$M$$ by chance alone) as

$${\mathbb{P}}\left(\delta \left({Y}_{k}\right)\bigg|\mathop{\sum }\limits_{k=1}^{K}{Y}_{k}=Y\right)=\mathop{\sum }\limits_{\left\{{Y}_{k}\right\}}S\left({Y}_{1},\ldots ,{Y}_{K}\right)/T,$$
(13)

where the summation is taken over all $${Y}_{k}$$ that satisfy $$\delta \left({Y}_{k}\right)\le D$$ and $$\mathop{\sum }\nolimits_{k=1}^{K}{Y}_{k}=Y$$.

In our reconstruction, we have $$K=17$$, $$Y=103$$, $$D=13$$ and $$M=6$$ (see Supplementary Table 10 in Supplementary Note 3). We evaluated the following five evolutionary scenarios: (A) chromosome-scale duplications with no tetraploidization, (B) one tetraploidization followed by chromosome-scale duplications, (C) two tetraploidizations followed by chromosome-scale duplications, (D) chromosome-scale duplications followed by one tetraploidization, and (E) first tetraploidization followed by chromosome-scale duplication followed by second tetraploidization. In these scenarios we assume that $${X}_{k}=N$$ for all $$k$$, where we set $$N=1$$ and $$M=6$$ for Scenario A; $$N=2$$ and $$M=6$$ for Scenario B; $$N=4$$ and $$M=6$$ for Scenario C; $$N=1$$ and $$M=3$$ for Scenario D; and $$N=2$$ and $$M=3$$ for Scenario E. We set $$\left({Y}_{1},\ldots ,{Y}_{17}\right)=\left({{{\mathrm{6,5,6,6,7,7,6,6,4,6,8,5,6,4,9,6,6}}}}\right)$$ for Scenarios A/B/C and $$\left({Y}_{1},\ldots ,{Y}_{17}\right)=\left({{{\mathrm{3,2,3,3,3,3,3,3,2,3,4,2,3,2,4,3,3}}}}\right)$$ for Scenarios D/E, based on the proto-cyclostome genome reconstruction. In addition, we evaluated the case of $$K=18$$ by setting $${Y}_{18}={{\max }}(1,N)$$, since our model requires $${Y}_{k}\ge N$$ for all $$k=1,\ldots ,K$$; we also evaluated the case of $$K=5$$, $$Y=30$$ and $$D=0$$ since larger proto-vertebrate chromosomes are more reliable in our reconstruction and the largest five proto-vertebrate chromosomes have multiplicity six.

Supplementary Table 10 in Supplementary Note 3 shows small probabilities of observing convergence of multiplicities through independent chromosome-scale duplications. Thus, it is unlikely that the proto-cyclostome genome was shaped by a series of independently occurring chromosome-scale duplications.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

The Japanese lamprey and elephant shark genome sequences generated in this study have been deposited at DDBJ/ENA/GenBank under the accession numbers WFAB00000000 and WEZY00000000, respectively. The reconstruction dataset including information of orthologues, paralogues and gene names in individual chromosomal segments is available as Supplementary Data 1.

## Code availability

The reconstruction software/code is available on request.

## References

1. 1.

Morris, S. C. & Caron, J. B. A primitive fish from the Cambrian of North America. Nature 512, 419–422 (2014).

2. 2.

Miyashita, T. et al. Hagfish from the Cretaceous Tethys Sea and a reconciliation of the morphological-molecular conflict in early vertebrate phylogeny. Proc. Natl Acad. Sci. USA 116, 2146–2151 (2019).

3. 3.

Osorio, J. & Retaux, S. The lamprey in evolutionary studies. Dev. Genes Evol. 218, 221–235 (2008).

4. 4.

Shimeld, S. M. & Donoghue, P. C. Evolutionary crossroads in developmental biology: cyclostomes (lamprey and hagfish). Development 139, 2091–2099 (2012).

5. 5.

Janvier, P. in Major Transitions in Vertebrate Evolution (eds. Anderson, J. S. & Sues, H. D.) (Indiana University Press, 2007).

6. 6.

Heimberg, A. M., Cowper-Sal-lari, R., Sémon, M., Donoghue, P. C. J. & Peterson, K. J. microRNAs reveal the interrelationships of hagfish, lampreys, and gnathostomes and the nature of the ancestral vertebrate. Proc. Natl Acad. Sci. USA 107, 19379–19383 (2010).

7. 7.

Donoghue, P. Evolution: divining the nature of the ancestral vertebrate. Curr. Biol. 27, R277–R279 (2017).

8. 8.

Boehm, T. et al. Evolution of alternative adaptive immune systems in vertebrates. Annu. Rev. Immunol. 36, 19–42 (2018).

9. 9.

Ohno S. Evolution by Gene Duplication (Springer, 1970).

10. 10.

Holland, P. W. H., Garcia-Fernàndez, J., Williams, N. A. & Sidow, A. Gene duplications and the origins of vertebrate development. Development 1994, 125–133 (1994).

11. 11.

Dehal, P. & Boore, J. L. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 3, e314 (2005).

12. 12.

Putnam, N. H. et al. Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317, 86–94 (2007).

13. 13.

Nakatani, Y., Takeda, H., Kohara, Y. & Morishita, S. Reconstruction of the vertebrate ancestral genome reveals dynamic genome reorganization in early vertebrates. Genome Res. 17, 1254–1265 (2007).

14. 14.

Putnam, N. H. et al. The amphioxus genome and the evolution of the chordate karyotype. Nature 453, 1064–1071 (2008).

15. 15.

Sacerdot, C., Louis, A., Bon, C., Berthelot, C. & Roest Crollius, H. Chromosome evolution at the origin of the ancestral vertebrate genome. Genome Biol. 19, 166 (2018).

16. 16.

Muffato, M. & Roest Crollius, H. Paleogenomics in vertebrates, or the recovery of lost genomes from the mist of time. BioEssays 30, 122–134 (2008).

17. 17.

Van de Peer, Y., Maere, S. & Meyer, A. The evolutionary significance of ancient genome duplications. Nat. Rev. Genet. 10, 725–732 (2009).

18. 18.

Smith, J. J. & Keinath, M. C. The sea lamprey meiotic map improves resolution of ancient vertebrate genome duplications. Genome Res. 25, 1081–1090 (2015).

19. 19.

Smith, J. J. et al. The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution. Nat. Genet. 50, 270–277 (2018).

20. 20.

Fried, C., Prohaska, S. J. & Stadler, P. F. Independent Hox-cluster duplications in lampreys. J. Exp. Zool. 299B, 18–25 (2003).

21. 21.

Furlong, R. F. et al. A degenerate ParaHox gene cluster in a degenerate vertebrate. Mol. Biol. Evol. 24, 2681–2686 (2007).

22. 22.

Mehta, T. K. et al. Evidence for at least six Hox clusters in the Japanese lamprey (Lethenteron japonicum). Proc. Natl Acad. Sci. USA 110, 16044–16049 (2013).

23. 23.

Escriva, H., Manzon, L., Youson, J. & Laudet, V. Analysis of lamprey and hagfish genes reveals a complex history of gene duplications during early vertebrate evolution. Mol. Biol. Evol. 19, 1440–1450 (2002).

24. 24.

Stadler, P. F. et al. Evidence for independent Hox gene duplications in the hagfish lineage: a PCR-based gene inventory of Eptatretus stoutii. Mol. Phylogenet. Evol. 32, 686–694 (2004).

25. 25.

Kuraku, S., Meyer, A. & Kuratani, S. Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after? Mol. Biol. Evol. 26, 47–59 (2009).

26. 26.

Smith, J. J. et al. Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nat. Genet. 45, 415–421 (2013).

27. 27.

Pascual-Anaya, J. et al. Hagfish and lamprey Hox genes reveal conservation of temporal colinearity in vertebrates. Nat. Ecol. Evol. 2, 859–866 (2018).

28. 28.

Furlong, R. F. & Holland, P. W. H. Were vertebrates octoploid? Philos. Trans. R. Soc. Lond. B Biol. Sci. 357, 531–544 (2002).

29. 29.

Kuraku, S. Impact of asymmetric gene repertoire between cyclostomes and gnathostomes. Semin Cell Dev. Biol. 24, 119–127 (2013).

30. 30.

Holland, P. W., Marletaz, F., Maeso, I., Dunwell, T. L. & Paps J. New genes from old: asymmetric divergence of gene duplicates and the evolution of development. Philos. Trans. R. Soc. Lond. B Biol. Sci. 372, 20150480 (2017).

31. 31.

Martin, K. J. & Holland, P. W. H. Enigmatic orthology relationships between Hox clusters of the African butterfly fish and other teleosts following ancient whole-genome duplication. Mol. Biol. Evol. 31, 2592–2611 (2014).

32. 32.

Robertson, F. M. et al. Lineage-specific rediploidization is a mechanism to explain time-lags between genome duplication and evolutionary diversification. Genome Biol. 18, 111 (2017).

33. 33.

Qiu, H., Hildebrand, F., Kuraku, S. & Meyer, A. Unresolved orthology and peculiar coding sequence properties of lamprey genes: the KCNA gene family as test case. BMC Genomics 12, 325 (2011).

34. 34.

Nakatani, Y. & McLysaght, A. Genomes as documents of evolutionary history: a probabilistic macrosynteny model for the reconstruction of ancestral genomes. Bioinformatics 33, i369–i378 (2017).

35. 35.

Putnam, N. H. et al. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 26, 342–350 (2016).

36. 36.

Venkatesh, B. et al. Elephant shark genome provides unique insights into gnathostome evolution. Nature 505, 174–179 (2014).

37. 37.

Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008).

38. 38.

International Chicken Genome Sequencing Consortium. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716 (2004).

39. 39.

Braasch, I. et al. The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons. Nat. Genet. 48, 427–437 (2016).

40. 40.

Li, Y. et al. Scallop genome reveals molecular adaptations to semi-sessile life and neurotoxins. Nat. Commun. 8, 1721 (2017).

41. 41.

Srivastava, M. et al. The Trichoplax genome and the nature of placozoans. Nature 454, 955–960 (2008).

42. 42.

Simakov, O. et al. Insights into bilaterian evolution from three spiralian genomes. Nature 493, 526–531 (2013).

43. 43.

Wang, S. et al. Scallop genome provides insights into evolution of bilaterian karyotype and development. Nat. Ecol. Evol. 1, 0120 (2017).

44. 44.

Benton, M. J., Donoghue, P. C. J. & Asher, R. J. in The Timetree of Life (eds. Hedges, S. B. & Kumar, S.) (Oxford University Press, 2009).

45. 45.

Smith, J. J., Stuart, A. B., Sauka-Spengler, T., Clifton, S. W. & Amemiya, C. T. Development and analysis of a germline BAC resource for the sea lamprey, a vertebrate that undergoes substantial chromatin diminution. Chromosoma 119, 381–389 (2010).

46. 46.

Fontana, F. et al. Evidence of hexaploid karyotype in shortnose sturgeon. Genome 51, 113–119 (2008).

47. 47.

Havelka, M., Bytyutskyy, D., Symonová, R., Ráb, P. & Flajšhans, M. The second highest chromosome count among vertebrates is observed in cultured sturgeon and is associated with genome plasticity. Genet. Sel. Evol. 48, 12 (2016).

48. 48.

Trifonov, V. A. et al. Evolutionary plasticity of acipenseriform genomes. Chromosoma 125, 661–668 (2016).

49. 49.

Burt, D. W. Origin and evolution of avian microchromosomes. Cytogenet. Genome Res. 96, 97–112 (2002).

50. 50.

Voss, S. R. et al. Origin of amphibian and avian chromosomes by fission, fusion, and retention of ancestral chromosomes. Genome Res. 21, 1306–1312 (2011).

51. 51.

Louis, A., Roest Crollius, H. & Robinson-Rechavi, M. How much does the amphioxus genome represent the ancestor of chordates? Brief. Funct. Genomics 11, 89–95 (2012).

52. 52.

Uno, Y. et al. Inference of the protokaryotypes of amniotes and tetrapods and the evolutionary processes of microchromosomes from comparative gene mapping. PLoS ONE 7, e53027 (2012).

53. 53.

Inoue, J. G. et al. Evolutionary origin and phylogeny of the modern Holocephalans (Chondrichthyes: Chimaeriformes): a mitogenomic perspective. Mol. Biol. Evol. 27, 2576–2586 (2010).

54. 54.

Harmston, N. et al. Topologically associating domains are ancient features that coincide with metazoan clusters of extreme noncoding conservation. Nat. Commun. 8, 441 (2017).

55. 55.

Berthelot, C., Muffato, M., Abecassis, J., Roest & Crollius, H. The 3D organization of chromatin explains evolutionary fragile genomic regions. Cell Rep. 10, 1913–1924 (2015).

56. 56.

Lv, J., Havlak, P. & Putnam, N. H. Constraints on genes shape long-term conservation of macro-synteny in metazoan genomes. BMC Bioinformatics 12, S11 (2011).

57. 57.

Soltis, D. E. & Soltis, P. S. Polyploidy: recurrent formation and genome evolution. Trends Ecol. Evol. 14, 348–352 (1999).

58. 58.

Soltis, D. E., Visger, C. J. & Soltis, P. S. The polyploidy revolution then… and now: Stebbins revisited. Am. J. Bot. 101, 1057–1078 (2014).

59. 59.

Holloway, A. K., Cannatella, D. C., Gerhardt, H. C. & Hillis, D. M. Polyploids with different origins and ancestors form a single sexual polyploid species. Am. Nat. 167, E88–E101 (2006).

60. 60.

Wolfe, K. H. Yesterday’s polyploids and the mystery of diploidization. Nat. Rev. Genet. 2, 333–341 (2001).

61. 61.

Garsmeur, O. et al. Two evolutionarily distinct classes of paleopolyploidy. Mol. Biol. Evol. 31, 448–454 (2014).

62. 62.

Wendel, J. F., Lisch, D., Hu, G. & Mason, A. S. The long and short of doubling down: polyploidy, epigenetics, and the temporal dynamics of genome fractionation. Curr. Opin. Genet. Dev. 49, 1–7 (2018).

63. 63.

Cheng, F. et al. Gene retention, fractionation and subgenome differences in polyploid plants. Nat. Plants 4, 258–268 (2018).

64. 64.

Session, A. M. et al. Genome evolution in the allotetraploid frog Xenopus laevis. Nature 538, 336–343 (2016).

65. 65.

Simakov, O. et al. Deeply conserved synteny resolves early events in vertebrate evolution. Nat. Ecol. Evol. 4, 820–830 (2020).

66. 66.

Marcussen, T. et al. Ancient hybridizations among the ancestral genomes of bread wheat. Science 345, 1250092 (2014).

67. 67.

Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).

68. 68.

Luo, J. et al. Tempo and mode of recurrent polyploidization in the Carassius auratus species complex (Cypriniformes, Cyprinidae). Heredity 112, 415–427 (2014).

69. 69.

Liu, X. L. et al. Wider geographic distribution and higher diversity of hexaploids than tetraploids in Carassius species complex reveal recurrent polyploidy effects on adaptive evolution. Sci. Rep. 7, 5395 (2017).

70. 70.

Spring, J. Vertebrate evolution by interspecific hybridisation—are we polyploid? FEBS Lett. 400, 2–8 (1997).

71. 71.

Chen, Z. J. Molecular mechanisms of polyploidy and hybrid vigor. Trends Plant Sci. 15, 57–71 (2010).

72. 72.

Chen, Z. J. Genomic and epigenetic insights into the molecular bases of heterosis. Nat. Rev. Genet. 14, 471–482 (2013).

73. 73.

Renny-Byfield, S. & Wendel, J. F. Doubling down on genomes: polyploidy and crop plants. Am. J. Bot. 101, 1711–1725 (2014).

74. 74.

Feldman, M., Levy, A. A., Fahima, T. & Korol, A. Genomic asymmetry in allopolyploid plants: wheat as a model. J. Exp. Bot. 63, 5045–5059 (2012).

75. 75.

Abi-Rached, L., Gilles, A., Shiina, T., Pontarotti, P. & Inoko, H. Evidence of en bloc duplication in vertebrate genomes. Nat. Genet. 31, 100–105 (2002).

76. 76.

Flajnik, M. F. & Kasahara, M. Origin and evolution of the adaptive immune system: genetic events and selective pressures. Nat. Rev. Genet. 11, 47–59 (2010).

77. 77.

Flajnik, M. F. A cold-blooded view of adaptive immunity. Nat. Rev. Immunol. 18, 438–453 (2018).

78. 78.

Kasahara, M. The 2R hypothesis: an update. Curr. Opin. Immunol. 19, 547–552 (2007).

79. 79.

Kaufman, J. Unfinished business: Evolution of the MHC and the adaptive immune system of jawed vertebrates. Ann. Rev. Immunol. 36, 383–409 (2018).

80. 80.

Ohta, Y., Kasahara, M., O’Connor, T. D. & Flajnik, M. F. Inferring the “primordial immune complex”: origins of MHC Class I and antigen receptors revealed by comparative genomics. J. Immunol. 203, 1882–1896 (2019).

81. 81.

Sambrook, J. G. & Beck, S. Evolutionary vignettes of natural killer cell receptors. Curr. Opin. Immunol. 19, 553–560 (2007).

82. 82.

Makino, T. & McLysaght, A. Interacting gene clusters and the evolution of the vertebrate immune system. Mol. Biol. Evol. 25, 1855–1862 (2008).

83. 83.

Makino, T. & McLysaght, A. Positionally biased gene loss after whole genome duplication: evidence from human, yeast, and plant. Genome Res. 22, 2427–2435 (2012).

84. 84.

Makino, T. & McLysaght, A. Ohnologs in the human genome are dosage balanced and frequently associated with disease. Proc. Natl Acad. Sci. USA 107, 9270–9274 (2010).

85. 85.

McLysaght, A. et al. Ohnologs are overrepresented in pathogenic copy number mutations. Proc. Natl Acad. Sci. USA 111, 361–366 (2014).

86. 86.

Rice, A. M. & McLysaght, A. Dosage sensitivity is a major determinant of human copy number variant pathogenicity. Nat. Commun. 8, 14366 (2017).

87. 87.

Makino, T., McLysaght, A. & Kawata, M. Genome-wide deserts for copy number variation in vertebrates. Nat. Commun. 4, 2283 (2013).

88. 88.

Teh, Y. W., Newman, D. & Welling, M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Adv. Neural Inf. Process. Syst. 19, 1353–1360 (2007).

89. 89.

Asuncion, A., Welling, M., Smyth, P. & Teh, Y. W. On smoothing and inference for topic models. In Proc. Twenty-Fifth Conference on Uncertainty in Artificial Intelligence 27–34 (2009).

90. 90.

Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).

91. 91.

Blei, D. M. Probabilistic topic models. Commun. ACM 55, 77–84 (2012).

92. 92.

Liu, J. S. & Lawrence, C. E. Bayesian inference on biopolymer models. Bioinformatics 15, 38–52 (1999).

93. 93.

Auger, I. E. & Lawrence, C. E. Algorithms for the optimal identification of segment neighborhoods. Bull. Math. Biol. 51, 39–54 (1989).

94. 94.

Muffato, M. Reconstruction de génomes ancestraux chez les vertébrés. Université d’Evry-Val d’Essonne (2010).

## Acknowledgements

This work is supported by funding from the European Research Council Grant Agreements 309834 and 771419 to A.McL. and the Biomedical Research Council of A*STAR, Singapore to B.V. We acknowledge the National Supercomputing Centre of Singapore for providing computational resources for this project. We thank Karsten Hokamp for technical assistance for computational analysis and Anthony Redmond for critical reading of the manuscript.

## Author information

Authors

### Contributions

B.V. and A.McL. conceived and coordinated the project. P.S., V.R., N.E.P., A.P. and B.V. sequenced and annotated the genomes; Y.N. and A.McL. performed genome reconstruction. Y.N., A.McL. and B.V. wrote the manuscript.

### Corresponding authors

Correspondence to Aoife McLysaght or Byrappa Venkatesh.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information Nature Communications thanks Daniel Ocampo Daza, Jeramiah Smith and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Nakatani, Y., Shingate, P., Ravi, V. et al. Reconstruction of proto-vertebrate, proto-cyclostome and proto-gnathostome genomes provides new insights into early vertebrate evolution. Nat Commun 12, 4489 (2021). https://doi.org/10.1038/s41467-021-24573-z

• Accepted:

• Published: