The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution

Smith, Jeramiah J.; Timoshevskaya, Nataliya; Ye, Chengxi; Holt, Carson; Keinath, Melissa C.; Parker, Hugo J.; Cook, Malcolm E.; Hess, Jon E.; Narum, Shawn R.; Lamanna, Francesco; Kaessmann, Henrik; Timoshevskiy, Vladimir A.; Waterbury, Courtney K. M.; Saraceno, Cody; Wiedemann, Leanne M.; Robb, Sofia M. C.; Baker, Carl; Eichler, Evan E.; Hockman, Dorit; Sauka-Spengler, Tatjana; Yandell, Mark; Krumlauf, Robb; Elgar, Greg; Amemiya, Chris T.

doi:10.1038/s41588-017-0036-1

Download PDF

Article
Open access
Published: 22 January 2018

The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution

Nature Genetics volume 50, pages 270–277 (2018)Cite this article

21k Accesses
203 Citations
113 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 17 September 2018

A Publisher Correction to this article was published on 19 April 2018

This article has been updated

Abstract

The sea lamprey (Petromyzon marinus) serves as a comparative model for reconstructing vertebrate evolution. To enable more informed analyses, we developed a new assembly of the lamprey germline genome that integrates several complementary data sets. Analysis of this highly contiguous (chromosome-scale) assembly shows that both chromosomal and whole-genome duplications have played significant roles in the evolution of ancestral vertebrate and lamprey genomes, including chromosomes that carry the six lamprey HOX clusters. The assembly also contains several hundred genes that are reproducibly eliminated from somatic cells during early development in lamprey. Comparative analyses show that gnathostome (mouse) homologs of these genes are frequently marked by polycomb repressive complexes (PRCs) in embryonic stem cells, suggesting overlaps in the regulatory logic of somatic DNA elimination and bivalent states that are regulated by early embryonic PRCs. This new assembly will enhance diverse studies that are informed by lampreys’ unique biology and evolutionary/comparative perspective.

The hagfish genome and the evolution of vertebrates

Article Open access 23 January 2024

Hagfish genome elucidates vertebrate whole-genome duplication events and their evolutionary consequences

Article Open access 12 January 2024

Reconstruction of proto-vertebrate, proto-cyclostome and proto-gnathostome genomes provides new insights into early vertebrate evolution

Article Open access 23 July 2021

Main

The sea lamprey is a member of an ancient lineage that diverged from the vertebrate stem approximately 550 million years ago (MYA). By virtue of this deep evolutionary perspective, lamprey has served as a critical model for understanding the evolution of several conserved and derived features that are relevant to broad fields of biology and biomedicine. Studies have used lampreys to provide perspective on the evolution of developmental pathways that define vertebrate embryogenesis^1,2, vertebrate nervous and neuroendocrine systems^2,3, genome structure⁴, immunity⁵, clotting⁶ and other features⁷. These studies show aspects of vertebrate biology that have been conserved over deep evolutionary time and identify evolutionary modifications that gave rise to novel features that emerged within the jawed vertebrate lineage (gnathostomes). Lampreys also possess several features that are not observed in gnathostomes, which could represent either aspects of ancestral vertebrate biology that have not been conserved in the gnathostomes or features that arose since the divergence of the ancestral lineages that gave rise to lampreys and gnathostomes. These include the ability to achieve full functional recovery after complete spinal cord transection, the deployment of evolutionarily independent yet functionally equivalent adaptive immune receptors, and the physical restructuring of the genome during development known as programmed genome rearrangement (PGR).

PGR results in the physical elimination of ~0.5 Gb of DNA from the organism’s ~2.3-Gb genome^8,9,10. The elimination events that mediate PGR are initiated at the 7^th embryonic cell division and are essentially complete by 3 days post fertilization^11,12. As a result, lampreys are effectively chimeric, with germ cells possessing a full complement of genes and all other cell types possessing a smaller, reproducible fraction of the germline genome. Previous analyses support the idea that the somatic genome lacks several genes that contribute to the development and maintenance of germ cells but are potentially deleterious if misexpressed in somatic lineages. However, our understanding of the mechanisms and consequences of PGR remains incomplete, as only a smaller, reproducible fraction of the genome (lacking 0.5 Gb of sequence that is invariably specific to the germline) has been sequenced to date.

In contrast to the germline genome, the somatically retained portions of the genome are relatively well characterized. Because it was not known until 2009 that lampreys were subject to PGR⁸, sequencing efforts focused on somatic tissues from which DNA or intact nuclei could be readily obtained (e.g., blood and liver)¹³. Sequencing of the sea lamprey somatic genome followed an approach that had proven successful for other vertebrate genomes before the advent of next-generation sequencing technologies (Sanger sequencing of clone ends, fosmid ends and BAC ends). Because of the abundance of highly identical interspersed repetitive elements and moderately high levels of polymorphism (approaching 1%), assembly of the somatic genome resulted in a consensus sequence that was substantially more fragmentary than other Sanger-based vertebrate assemblies¹⁴. Nonetheless, this initial assembly yielded significant improvements in our understanding of the evolution of vertebrate genomes and fundamental aspects of vertebrate neurobiology, immunity and development^{1,2,3,4,5,6,7}.

Here we present the first assembly of the sea lamprey germline genome. Through extensive optimization of assembly pipelines, we identified a computational solution that allowed us to generate an assembly from next-generation sequence data (Illumina and Pacific Biosciences reads) that surpasses the existing Sanger-based somatic assembly. Analysis of the resulting assembly identifies several hundred genes that are eliminated from somatic tissues by PGR and sheds new light on the evolution of genes and functional elements in the wake of ancient large-scale duplication events.

Results

Assembly and annotation of the sea lamprey genome

Several shotgun-sequencing and scaffolding data sets were generated in order to permit assembly of the sea lamprey germline genome (>100× sequence coverage in Illumina paired-end reads, >300× physical coverage in 4-kb Illumina mate pairs and >600× physical coverage in 40-kb Illumina mate pairs). Previous analyses demonstrated that the lamprey genome is highly repetitive, and initial analysis of Illumina shotgun sequence data confirmed that the repeat content of lamprey (~60% high-identity repeats) is substantially higher than that of human (Fig. 1). To enable the development of a highly contiguous assembly, we also generated ~17× genome coverage in single-molecule long-read data (Pacific Biosciences XL/C2 chemistry, N₅₀ read length = 5,424) and performed hybrid assembly using DBG2OLC¹⁵. This approach yielded an assembly with contiguity statistics (23,286 contigs, N₅₀ = 164,585 bp) that rivaled those of a previously published Sanger-based assembly of the lamprey somatic genome¹³. To further improve the large-scale structure of this assembly, we integrated scaffolding data (~56× coverage in BioNano optical mapping: >150 kb molecules, and 325 million Chicago (Dovetail) linked read pairs: 2 × 152 bp), as well as published meiotic mapping data⁴. Linkages identified through these three independent data sets were cross-validated and integrated using AllMaps (Fig. 2)¹⁶. This integrated scaffolding approach allowed us to further increase the contiguity of the assembly (12,077 contigs, N₅₀ = 12 Mb, N₅₀ contig number = 34). In total, 74.8% of the current germline genome assembly is anchored to one of 94 previously defined linkage groups⁴, and >80% of the assembly is present in super-scaffolds that are 1 Mb or longer. Given that the sea lamprey has 99 pairs of chromosomes in its germline, this integrated assembly appears to approach chromosome-scale contiguity.

**Fig. 1: Distribution of k-mer copy numbers in germline shotgun sequencing data.**

**Fig. 2: Long-range scaffolding and assessment of long-range contiguity of lamprey super-scaffolds.**

Our long-range scaffolding approach used three independent methods that extend and cross-validate one another (Fig. 2), and we consider strong agreement among these three methods as evidence that the large-scale structure of the assembly accurately reflects the structure of P. marinus chromosomes. For many vertebrates, it is possible to independently assess long-range contiguity by measuring conservation of gene orders with closely related species. Highly contiguous assemblies are not yet available for any other jawless vertebrate, although an unanchored draft assembly does exist for the Arctic lamprey (Lethenteron camtschaticum: syn. Lethenteron japonicum)¹⁷. To provide perspective on the chromosomal structure of a closely related species, we developed a meiotic map for the Pacific lamprey (Entosphenus tridentatus). The species is a representative of a clade of lampreys (genera Entosphenus, Lethenteron and Lampetra) that diverged from the lineage represented by Petromyzon ~40 MYA¹⁸, and embryos of known parentage are available through ongoing hatchery efforts aimed at restoring the species to its native waterways in the US Pacific Northwest¹⁹. Meiotic mapping was performed using restriction site–associated DNA (RAD) sequencing of 94 F₁ siblings generated from a controlled cross between two wild-captured individuals. The resulting meiotic map provides dense coverage of the genome and represents 83 linkage groups, covering 9,956 cM with an average intermarker distance of 3.4 cM (Supplementary Table 1). Alignment of RAD markers to the sea lamprey genome identified 1,733 homologous sequences, which show strong conservation of synteny and gene order (Fig. 3, Supplementary Table 1). This broad conservation of gene order is considered strong evidence that the sea lamprey assembly and Pacific lamprey meiotic map accurately reflect the chromosomal structure of their respective species.

**Fig. 3: Alignment of the Pacific lamprey (*E. tridentatus*) meiotic map to assembled sea lamprey (*P. marinus*) super-scaffolds.**

The repetitive nature of the lamprey genome presents challenges not only to its assembly but also the identification of genes within assembled contigs. This is largely attributable to the interspersion of transposable coding sequences within and among the coding sequences of low-copy genes. To circumvent these issues, we used a two-tiered approach to gene prediction. Annotation and identification of repetitive elements was performed using RepeatModeler and RepeatMasker^20,21. The entire set of annotated repeats, published gene models and transcriptomic data sets^10,13 were integrated to generate a conservative set of 18,205 gene predictions using MAKER²². After generating initial gene calls, a second round of gene predictions was generated that permitted extraction of gene models that include low-copy repetitive sequences, yielding another 2,745 gene models, for a total of 20,950 MAKER gene models. In total, Maker was able to assign 18,367 of these gene models to a likely vertebrate homolog on the basis of multispecies BLAST alignments, which included the vast majority of single-copy orthologs expected for lamprey (Supplementary Note)^23,24. An additional 2,583 genes (12%) could not be immediately assigned a homolog on the basis of multispecies alignments. Although these may represent lamprey-specific genes, careful manual curation is likely to be necessary to define their precise evolutionary origins. Such efforts will be enabled through the publicly available genome browser (see URLs). This annotation set was subsequently used to identify the location of 35,382 long noncoding RNA (lncRNA) transcripts in 18,857 lncRNA gene bodies (Supplementary Note, Supplementary Table 2 and Supplementary Fig. 1). These and other annotation sets, including RNA sequencing and genome re-sequencing tracks, are available through SIMRbase (see URLs).

Vertebrate genome evolution

Lamprey occupies a critical phylogenetic position with respect to reconstructing ancestral karyotypes and inferring the timing and mode of duplication events that occurred in ancestral vertebrate and gnathostome lineages. Alignment to the chicken²⁵ and gar²⁶ genomes (Supplementary Tables 3–5) permits reconstruction of ancestral orthology groups that are highly consistent with previous reconstructions based on the lamprey meiotic map⁴. Because these comparisons require resolution of homologies that are the product of duplication (i.e., 1:1 orthology is not expected) our operational definition of “orthology groups” is expanded to include higher-order relationships (see ref. ⁴ for more detail). Inclusion of comparative mapping data from the recently published gar genome assembly provides further support for the observation that the majority of ancestral vertebrate chromosomes experienced a single large-scale duplication event in the ancestral vertebrate lineage (Fig. 4, Supplementary Fig. 2). Most ancestral orthology groups correspond to two derived chicken chromosomes (6/11 chicken–lamprey orthology groups identified here). Three other orthology groups possess four derived chromosomes, suggesting that these groups have experienced an additional large-scale duplication: these include well-defined fourfold orthology regions harboring the HOX and MHC clusters in one orthology group, the NPYR locus and ParaHOX cluster in a second, and the RAR and ALDH1 loci in a third⁴ (Fig. 4). Two remaining orthology groups present more complex ratios of ancestral to derived chromosomes. Notably, though, comparative mapping with gar shows that chicken chromosome 26 and a portion of chicken chromosome 1 were likely fused in the bony vertebrate (Euteleostome) ancestor approximately 450 MYA and subsequently experienced a derived fission in the chicken lineage. Other deviations from 1:2 or 1:4 are interpreted as the product of derived fission/fusion events that occurred during the first 150 MY following divergence of basal lamprey and gnathostome lineages, derived fission/fusion events in the lamprey lineage, or misassembled regions of the lamprey genome. Although it is possible that the observed genome-wide patterns of conserved synteny could have arisen through two whole-genome duplication events (the 2R hypothesis)^27,28 accompanied by large numbers of chromosome losses^29,30, a previously proposed alternative scenario involving one whole-genome duplication preceded by three distinct chromosome-scale duplication events requires fewer evolutionary steps and is consistent with the data underlying all previous reconstructions⁴.

**Fig. 4: The distribution of conserved syntenies in chicken and lamprey shows patterns of ancient large-scale duplication.**

Lamprey HOX clusters: duplication and divergence

Historically, descriptions of genome duplications have relied heavily on the HOX gene clusters. This is partially due to their highly conserved organization with respect to gene order and orientation, which contributes to the generation of coordinated patterns of axial expression (collinearity) associated with their roles in embryonic development. Assembly of the Arctic lamprey genome led to the tentative prediction of (at least) six, and possibly eight, HOX clusters, suggesting that the duplication history of at least the lamprey HOX-bearing chromosomes differs from that in the jawed vertebrates¹⁷. We identify 42 HOX genes in the sea lamprey, which all fall within six HOX clusters that are highly similar in content to the HOX clusters predicted in the Arctic lamprey (Fig. 5a, Supplementary Figs. 3 and 4). Additionally, we are able to place these in their broader chromosomal context, showing that these six HOX clusters are embedded in larger chromosomal regions that share conserved synteny with the presumptive ancestral HOX-bearing chromosome (Fig. 4).

**Fig. 5: Structure and evolution of HOX clusters.**

In principle, a number of duplication scenarios could potentially explain the existence of six paralogous HOX-bearing chromosomes. These include: (1) whole-genome duplication followed by triplication, or vice versa; (2) a gnathostome-like duplication history (either 2R accompanied by large numbers of chromosome losses^29,30 or one whole-genome duplication preceded by three chromosome-scale duplication events⁴) followed by a further round of whole-genome duplication (yielding eight ancestral HOX clusters) and loss of two entire paralogous chromosomes; (3) a gnathostome-like duplication history followed by duplication of two individual chromosomes. Initial synteny comparisons between lamprey and gnathostome HOX loci showed no clear orthology relationships, but show substantial similarities in the gene content of lamprey HOX-ε and HOX-β clusters. Notably, phylogenetic analyses of paralogy groups with ≥4 retained copies (HOX4, HOX8, HOX9, HOX11 and HOX13) also show no clear orthology between lamprey and gnathostome clusters, but they reproducibly place members of HOX-ε and HOX-β clusters in sister clades with high bootstrap support (Fig. 5b, Supplementary Figs. 5–9). Taken at face value, this would seem to suggest that the ε and β clusters diverged from one another more recently than other paralogous clusters, apparently lending support to scenario 3. Alternatively, this might also reflect greater functional constraint with respect to the membership of these clusters.

To gain further perspective on the duplication history of lamprey HOX clusters, we extended the analyses to compare the chromosome-wide distribution of two-copy paralogs on all HOX-bearing chromosomes. Because post-duplication patterns of conserved synteny are strongly driven by paralog loss, we reasoned that more recent duplication events should yield pairs of chromosomes that share more two-copy duplications, exclusive of all other paralogous chromosomes (the latter of which would have experienced more extensive loss of redundant paralogs over time). Two pairs of chromosomes were observed to share more duplicates relative to all other pairwise combinations of HOX-bearing chromosomes. The strongest enrichment of two-copy paralogs was observed between super-scaffolds 5 and 16 (χ² = 14.22, P = 1.6 × 10⁻⁴, d.f. = 1, Fig. 5, Supplementary Table 6), which carry the HOX-ε and HOX-β clusters. In conjunction with the internal structure of HOX clusters and consistent phylogenetic clustering of ε and β HOX members, we interpret this as indicating that the ε- and β-bearing chromosomes trace their ancestry to a chromosome-scale duplication event that occurred substantially more recently than the genome- and chromosome-scale duplication events that define all other pairwise contrasts, perhaps within the last 200–300 MY. Only one other pair of chromosomes shows significant enrichment of two-copy paralogs relative to all other contrasts. The chromosomes bearing HOX-α and HOX-δ clusters are enriched in shared two-copy paralogs (χ² = 8.41, P = 3.7 × 10⁻³, d.f. = 1, Fig. 5, Supplementary Table 6), although α and δ HOX members show no consistent pattern of clustering within gene trees. This difference could be interpreted as indicating that these two chromosomes are the product of a slightly older duplication event, or alternatively it might reflect differential constraints relative to the retention of duplicates by individual pairs of paralogous chromosomes. However, it is unclear what processes might constrain the evolution of one pair of paralogous chromosomes relative to all others.

Programmed genome rearrangement

Identification of eliminated DNA

In lampreys approximately 20% of zygotically inherited DNA is eliminated from somatic cell lineages during early embryogenesis, being retained only by the germline^8,10,31. To identify germline-enriched (i.e., somatically eliminated) regions, we generated whole-genome shotgun sequence data for both sperm (73× coverage) and blood (80× coverage) DNAs that were isolated from the same individual. Analysis of read counts identified 1,077 super-scaffolds with enrichment scores (log₂(standardized sperm coverage/blood coverage)) exceeding 2, over more than 80% of the scaffold (Fig. 6, Supplementary Table 7). These presumptively germline-specific regions cover ~13 Mb of the genome assembly and contain 356 annotated protein coding genes. The distribution of enrichment scores also suggests that other regions with lower enrichment scores are likely to be affected by PGR. To further evaluate our predictions, we designed primers for the 96 longest super-scaffolds with enrichment scores of 2 or higher. In total, primers from 90 (94%) of these scaffolds yielded specific amplification in testes relative to blood, confirming that they are deleted during PGR (Supplementary Table 8).

**Fig. 6: Germline enrichment of single- and low-copy DNA sequences.**

Notably, the estimates above only account for single-copy DNA of sufficient complexity to yield unique alignments. Eliminated sequences with retained paralogs or that contain low-copy repetitive elements are expected to show relatively lower enrichment scores. To gain further insight into elimination of repetitive DNA, we performed similar analyses targeting repetitive sequences (Supplementary Note). These analyses identify an additional 102 Mb of eliminated sequence that can be directly assigned to assembly-amenable repetitive sequences and indicate that remaining fractions of the germline-specific subgenome likely consist of arrays of short or incomplex/simple repetitive sequence that are less amenable to sequencing, mapping or assembly (Supplementary Note and Supplementary Fig. 10).

Function of PGR

It has been proposed that PGR serves to prevent the expression of genes with beneficial functions in the germline and deleterious functions in soma (such as oncogenesis and aging)^8,10,12. To gain further insight into the functions of eliminated genes and the underlying evolutionary logic of PGR, we asked whether human homologs of eliminated genes are enriched for defined functional categories. In interpreting these ontology enrichment studies, it is important to recognize that these analyses define a single human or mouse ortholog for each lamprey gene. While this scenario does not accurately reflect duplication events that have structured lamprey and gnathostomes, or divergence in gene functions over more than 500 MY of independent evolution, such analyses are expected to provide some (albeit conservative) perspective on the likely function of lamprey genes. Despite this deep divergence, ontology analyses showed enrichment for several categories, including pathways related to oncogenesis, including regulation of cell division, epithelial migration, adhesion and cell fate commitment (Supplementary Table 9, Supplementary Note).

While ontology analyses provide some insight into the likely functions of eliminated genes, it is important to recognize that curated ontology databases do not capture all of the biological functions that are encoded in the genome. To gain additional insight into the functional consequences of PGR, we searched for enrichment of eliminated orthologs among 645 chromatin immunoprecipitation (ChIP) experiments (ChEA 2016)^32,33 (Supplementary Table 10). To identify subcategories of enriched ChIP data sets, we performed two-way hierarchical clustering of presence–absence calls from the top 50 enriched ChIP data sets. These analyses showed two distinct categories of lamprey genes and ChIP experiments (Fig. 7). One cluster (denoted C1; Fig. 7) corresponds to the binding sites of PRC genes in mouse embryonic stem cells, apparently indicating that that these genes may be marked by bivalent promoters in embryonic stem cells (ESCs) and then presumably released from silencing in germline at later developmental stages. To test this idea, we more closely examined a cluster of genes (denoted GS3) that was highly enriched within C1 ChIP experiments. Notably, all of these genes were previously found to be marked by bivalent (poised) promoters in murine ESCs and primordial germ cells³⁴ (bivalent in ESCs: 16/16, χ² = 77.0, P = 8.8 × 10⁻¹⁹, d.f. = 1; bivalent in primordial germ cells (PGCs): 15/16, χ² = 47.3, P = 3.1 × 10⁻¹², d.f. = 1). A second cluster of eliminated genes (denoted GS1) also showed strong enrichment for these two functional categories (bivalent in ESCs: 14/22, χ² = 34.6, P = 2.0 × 10⁻⁹, d.f. = 1; bivalent in PGCs: 14/22, χ² = 23.2, P = 7.5 × 10⁻⁷, d.f. = 1).

**Fig. 7: Enrichment analysis provides insight into the function of germline-specific sequences.**

Other enriched ChIP experiments (C2) correspond primarily to the binding targets of transcriptional modifiers in embryonic stem cells (N = 7), embryonic progenitor lineages (N = 7) and transcriptional activators in cancer (N = 15; Fig. 7). Notably, all but one (PCDHGB5) of the genes detected in C1 are present in one or more experiments in C2. Overall, comparisons with ChIP analyses performed in non-eliminating species lends further support to the idea that PGR acts to prevent misexpression of ‘germline’ genes and suggests that misexpression of orthologous genes may directly contribute to oncogenesis in a diverse range of cancers. Moreover, these comparative analyses provide new insight into the regulatory functions of PGR by finding overlap between early gene-silencing events that are achieved by PGR and those that are mediated by the PRC during differentiation of germline and soma.

Discussion

The lamprey genome presents an interesting target for sequencing because of its phylogenetic position and unique genome biology, yet a particularly challenging target given its high repeat content and divergence from other species with highly contiguous assemblies. In an attempt to resolve this complexity, we leveraged several complementary technologies to generate a highly contiguous assembly that approaches the scale of entire chromosomes. Moreover, we were able to validate the chromosome-scale contiguity of our assembly by generating a dense meiotic map for a related species. The high contiguity of our assembly provides critical context for understanding the evolution of gene content and genome structure in vertebrates. Here we highlighted the utility of this assembly in addressing fundamental questions related to understanding changes in large-scale structure of vertebrate genomes, specifically reconstructing the deep evolutionary origins of vertebrate chromosomes and understanding how PGR mediates genetic conflicts between germline and somatic tissues.

Our improved assembly permits robust resolution of a complement of ancestral chromosomes that existed before the divergence of ancestral gnathostome and agnathan lineages and prior to whole-genome duplication(s) within the shared ancestral lineage of all extant vertebrates. These reconstructions largely validate previous analyses that were performed using meiotic mapping data, but they provide improved resolution of ancestral homology groups. Analyses also lend further support to the idea that chromosome-scale duplication events may have been more common over the course of vertebrate ancestry than has been appreciated from the analysis of bony vertebrate genomes. Parallel lines of evidence supporting a relatively recent duplication having given raise to lamprey HOX-ε- and HOX-β-bearing chromosomes further highlights the potential for large-scale duplication outside the context of whole-genome duplication. It appears that two features of lamprey biology might favor the fixation of chromosomal duplications. First, lampreys possess a large number of small chromosomes, and consequently chromosomal duplications will generally impact fewer genes than similar events in human. Duplication events (in addition to a single presumptive whole-genome duplication) appear to have affected other groups of lamprey chromosomes, though not all (Supplementary Fig. 11). Second, individuals are highly fecund (~100,000 eggs per female), and therefore a single mutant can introduce thousands of carriers (including stable carriers) into a population^4,35,36,37. While it is likely that the reproductive biology and distribution of chromosome sizes has fluctuated over the course of vertebrate evolution, available evidence suggests that lampreys have possessed similar karyotypes and reproductive biologies for hundreds of millions of years. As such, extant lampreys may represent a better model for conceptualizing phases of evolution during which ancestral vertebrates were characterized by higher fecundity and larger numbers of relatively gene-poor microchromosomes, in addition to providing phylogenetic perspective on early stages of vertebrate genome evolution.

The assembly also identifies a large number of genes that are reproducibly eliminated via PGR. Enrichment analyses reveal a strong overlap in the targets of PGR-mediated elimination and the targets of silencing via PRC proteins in embryonic stem cells. The PRC is a deeply conserved complex that plays roles in gene silencing related to the maintenance of stem cell identity, silencing of oncogene expression and X-chromosome inactivation, among other functions^38,39. These well-defined functions of the PRC mirror several aspects of PGR, particularly in that both act to achieve strong transcriptional silencing and both appear to target an overlapping subset of proto-oncogenes. It is interesting to speculate that the overlapping targets of PGR and the PRC may indicate that these two modes of silencing share common underlying mechanisms. However, it is notable that PRC repression is strongly associated with the deposition and binding to trimethylated lysine 27 of histone H3 (H3K27me3), whereas previous studies have shown that this mark is absent prior to the onset of PGR in lamprey embryos¹¹. It therefore appears that PGR acts to (in part) regulate a subset of germline-expressed targets of the PRC and that it may work upstream of the PRC in lamprey embryos.

The analyses presented here address a focused set of topics that are specifically related to understanding the evolution and development of genome structure in lamprey and other vertebrates. We anticipate that this assembly will substantially improve our ability to use lamprey as a comparative evolutionary model. Because sequences are anchored to their broader chromosomal structure, the current assembly should enhance the ability to reconstruct the deep evolutionary history of the vast majority of genes within vertebrate genomes and perform robust tests of hypotheses related to historical patterns of duplication and divergence. Moreover, the availability of a highly contiguous assembly for an agnathan species should aid in the development and analysis of other genome assemblies from this highly informative vertebrate lineage.

URLs

SIMRbase/Lamprey Genome Browser, https://genomes.stowers.org/organism/Petromyzon/marinus; DifCover, https://github.com/timnat/DifCover; RepeatMasker, http://www.repeatmasker.org. Original data pertaining to the Chicago assembly (Dovetail) and HOX cluster curation can be accessed from the Stowers Original Data Repository at http://www.stowers.org/research/publications/LIBPB-1215.

Methods

Research animals

This study complied with all relevant ethical guidelines and was performed under protocol number 2011-0848 (University of Kentucky Institutional Animal Care and Use Committee).

Sequencing

Fragment libraries were prepared by Covaris shearing of sperm genomic DNA isolated from a single individual and size selected to achieve average insert sizes of ~205 and 231 bp. These libraries were sequenced on the Illumina HiSeq2000 platform. Two separate 4-kb mate pair libraries were generated. One 4-kb library was prepared and sequenced by the Genomic Services Laboratory at HudsonAlpha (Huntsville, AL) and another was prepared and sequenced using the standard Illumina mate-pair kit. Two 4-kb libraries were prepared and sequenced by Lucigen (Middleton, WI). Long reads were prepared by the University of Florida Interdisciplinary Center for Biotechnology Research (Gainesville, FL) and sequenced using Pacific Biosciences (Menlo Park, CA) XL/C2 chemistry on a Single Molecule, Real-Time (SMRT) Sequencing platform.

Hybrid assembly

Hybrid assembly of Illumina fragment reads and Pacific Biosciences single-molecule reads was performed using the programs SparseAssembler⁴² and DBG2OLC¹⁵. The first 159 Gb of the high-quality paired-end reads were used to construct short but accurate de Bruijn graph contigs using the program SparseAssembler⁴² with k-mer size 51 and a skip length of 15. The program DBG2OLC¹⁵ was then used to map short contigs to PacBio SMRT sequencing reads and generate a hybrid assembly. Each PacBio read was compressed using high-quality short-read contigs and aligned to all other reads for structural error correction wherein chimeric PacBio reads are identified and trimmed. A read-overlap-based assembly graph was generated and unbranched linear regions of the graph were output as the initial assembly backbones. Consensus sequences for the backbones were generated by joining overlapped raw sequencing reads and short-read contigs. In practice, many regions of the initial consensus sequences can be erroneous due to the high error rates of the PacBio reads. In order to polish each backbone, all related PacBio reads and contigs are first collected and realigned using Sparc⁴³ to calculate the most likely consensus sequence for the genome.

Scaffolding

Scaffolding of the hybrid assembly was performed using SSPACE 2.0⁴⁴ to incorporate mate pair data, followed by ALLMAPS version 0.5.3¹⁶ to incorporate optical mapping (BioNano), linked-read (Dovetail) and previously published meiotic mapping data⁴. Scaffolding by SSPACE imposed a stringent scaffolding threshold requiring 5 or more consistent linkages to support scaffolding of any pair of contigs. Scaffolding via ALLMAPS was implemented with default parameters and with equal weights assigned to all three types of mapping data with initial anchoring to meiotic maps. For scaffolds without linkage mapping data, additional ALLMAPS runs were performed using the remaining data sets. Conflicts among the three mapping methods were resolved by majority rule or by manually breaking contigs that could not be placed by majority rule.

Meiotic mapping of E. tridentatus

A meiotic map was generated for E. tridentatus using a single outbred adult pair collected from Willamette Falls (Oregon City, OR, USA) and from which larvae were artificially propagated in May 2013 at the USGS Columbia River Cook Laboratory (Cook, WA, USA) and reared for 2 weeks until they were sacrificed after hatching at around ~10 mm in total length. Restriction site–associated DNA sequencing (RAD-seq⁴⁵) Illumina sequencing libraries were prepared using a modified version of a previously published protocol⁴⁶). A total of 250 ng of DNA from each sample was added to a 100 μL restriction digest using the Sbf1 restriction enzyme (New England Biolabs, Ipswich, MA, USA). Each sample was then tagged by ligation of one of 96 unique barcoded adapters (P1 adaptor) to the Sbf1 site. Once barcoded, the samples were mixed together into three libraries of 96 individuals per library, and approximately 4 μg of each was sheared using a Bioruptor UCD-300 instrument (Diagenode, Denville, NJ, USA). Following sonication, each library was concentrated using the Qiagen MinElute PCR purification kit (Qiagen) in preparation for size selection by agarose gel electrophoresis. Prior to sequencing RAD-seq libraries were quantified by qPCR and Illumina library quantification standards (Kappa Biosystems Inc, Woburn, MA, USA) on an ABI 7900HT Sequence Detection System (Life Technologies). Libraries were sequenced with single-end 100-bp reads on an Illumina HiSeq2000 sequencer (Illumina Inc., San Diego, CA, USA). Genotypes from 94 individuals with the greatest marker densities were used to reconstruct a consensus meiotic map from maternal and paternal meiosis. Maximum-likelihood mapping and manual curation were performed using the Join Map software package with default parameters for an outbred crossing design, except that the number of optimization rounds was increased to ten in order to better optimize the internal ordering of markers^47,48.

Annotation

Identification of repetitive elements

Repeats were identified within assembled scaffolds using RepeatModeler²⁰ and annotated using RepeatMasker version open-4.0.5²¹ (see URLs) and a library of vertebrate repeats from repbase (repeatmaskerlibraries-20140131).

Identification of coding sequences

Genome annotations were produced using the MAKER^22,49,50 genome annotation pipeline, which supports re-annotation using pre-existing gene models as input. Previous Petromyzon marinus gene models (WUGSC 7.0/petMar2 assembly)⁵¹ were mapped against the new genome assembly into GFF3 format and were used as prior model input to MAKER for re-annotation. Snap⁵² and Augustus^53,54 were also used with MAKER and were trained using the pre-existing lamprey gene models. Additional input to MAKER included previously published mRNA-seq reads derived from lamprey embryos and testes^10,12,13 and assembled using Trinity⁵⁵, as well as mRNA-seq reads (NexSeq 75–100 bp paired-end) that were derived from whole embryos and dissected heads at Tahara stage 20 and dissected embryonic dorsal neural tubes at Tahara stage 18, 20 and 21. The following protein data sets were also used: Ciona intestinalis (sea squirt)⁵⁶, Lottia gigantea (limpet)⁵⁷, Nematostella vectensis (sea anemone)⁵⁸, Takifugu rubripes (pufferfish)⁵⁹, Branchiostoma floridae (lancelet)⁶⁰, Callorhinchus milii (elephant shark)⁶¹, Xenopus tropicalis (western clawed frog)⁶², Drosophila melanogaster (fruit fly)⁶³, Homo sapiens (human)^64,65, Mus musculus (mouse)⁶⁶, Danio rerio (zebrafish)⁶⁷, Hydra magnipapillata⁶⁸, Trichoplax adhaerens⁶⁹, and the Uniprot/Swiss-Prot protein database^70,71. Protein domains were identified in final gene models using the InterProScan domain identification pipeline^72,73,74, and putative gene functions were assigned using BLASTP⁷⁵ identified homology to the Uniprot/Swiss-Prot protein database.

lncRNA annotation

Putative lncRNAs were predicted from RNA-seq reads obtained from brain, heart, kidney, and ovary/testis sampled from two ripe adult individuals (one female, one male). In total, 8 libraries were produced using the Illumina stranded TruSeq mRNA kit (Illumina Inc.). Sequencing (single-end, directional 100 bp) was performed on a HiSeq 2000. The resulting reads were mapped to the germline genome assembly using GSNAP (v2017-04-24)⁷⁶; the resulting bam files were then assembled into transcript models using StringTie (v1.3.3b)⁷⁷. The following parameters were optimized in order to maximize the number of predicted lncRNAs and reduce the number of assembly artifacts: (1) minimum isoform abundance of the predicted transcripts as a fraction of the most abundant transcript assembled at a given locus: lower-abundance transcripts are often artifacts of incompletely spliced precursor of processed transcripts; (2) minimum read coverage allowed for the predicted transcripts; (3) minimum locus gap separation value: reads that are mapped closer than 10 bp distance are merged together in the same processing bundle; (4) smallest anchor length: junctions that do not have spliced reads that align across them with at least 10 bases on both sides are filtered out; (5) minimum length allowed for the predicted transcripts (200 bp); (6) minimum number of spliced reads that align across a junction (i.e. junction coverage); (7) removal of monoexonic transcripts. The resulting transcriptomes from each library were then merged into a single GTF file (–merge option in StringTie).

Transcripts overlapping (in sense) exons of the protein coding annotated genes were removed using the script FEELnc_filter.pl⁷⁸. The filtered gene models file was then used to compute the Coding Potential Score (CPS) for each of the candidate non-coding transcript with the script FEELnc_codpot.pl⁷⁸. In the absence of a species-specific lncRNA set, as is the case for P. marinus, the implemented machine-learning strategy requires to simulate non-coding RNA sequences to train the model by shuffling the set of mRNAs while preserving their 7-mer frequencies. This approach is based on the hypothesis that at least some lncRNAs are derived from “debris” of protein-coding genes⁷⁹. The simulated data were then used to calculate the CPS cutoff separating coding (mRNAs) from non-coding (lncRNAs) using 10-fold cross-validation on the input training files in order to extract the CPS that maximizes both sensitivity and specificity.

Analysis of conserved synteny

Analyses of conserved synteny were performed as previously described⁴. Briefly, predicted protein sequences from the lamprey genome were aligned to proteins from the Gar (LepOcu1: GCA_000242695.1) and Chicken (Galgal4: GCA_000002315.2) genome assemblies⁸⁰. All alignments with bitscore ≥100 and ≥90% of the best match (within a species) were considered putative orthologs of each lamprey, chicken or gar gene. Groups of orthologs were filtered to remove those with more than 6 members in any given species. Enrichment of orthologs on chromosomes or chromosomal segments was assessed using χ² tests, incorporating Yates’ correction for continuity and Bonferroni corrections for multiple testing as previously described⁴.

Identification and characterization of germline-specific/enriched sequences

Single-copy genes

To identify germline-specific regions, we separately aligned paired-end reads from blood and sperm DNA to the germline genome assembly using BWA-MEM (v.0.7.10)⁸¹ with default parameters and filtered to exclude unmapped reads and supplementary alignments (samtools v.1.2 with option: view -F2308)⁸². Initial coverage analyses was implemented using bedtools v2.23.0⁸³ and revealed that the modal coverage of reads from sperm DNA was slightly lower than the coverage of reads from blood, ~73× and ~80×, respectively, but contained a larger amount of low-copy DNA (Supplementary Fig. 12). To identify germline-enriched intervals, data were filtered to remove regions with coverage both from sperm and blood of <10 (underrepresented regions: computed with genomecov -bga, bedtools v2.23.0) and also regions with coverage exceeding three times the modal value in sperm or blood (high-copy regions). The remaining data were processed to generate coverage ratios for discreet intervals containing 1,000 bp (or >500 bp at contig ends) of approximately single-copy sequence. Identification of contiguous intervals and re-estimation of coverage ratios was performed using DNAcopy version 1.42.0⁸⁴ after removing trailing windows that were <500 bp in length. Ontology analyses used naming assignments that were generated using multispecies BLAST alignments via MAKER^22,49,50 and were performed using Enrichr³³.

Repetitive sequences

High-identity repetitive elements were assembled de novo from k-mers (k = 31) that were abundant in sperm and blood reads, with k-mer counting via Jellyfish version 2.2.4⁸⁵ and assembly using Velvet version 1.2.10⁸⁶. Copy-number thresholds for abundant k-mers set at 3× modal copy numbers for 31-mers: 165 for sperm and 180 for blood. Abundant k-mers from sperm and blood were combined and used as a single-end reads for Velvet running with 29-mers. These analyses resulted in a de novo repeat library with 130,632 sequences (overall length ~11 Mb with individual contigs lengths range from 57 bases to 15.5 kb). These repeats were annotated using RepeatMasker version open-4.0.5²¹ (see URLs) and repeat libraries generated for the germline assembly and from Repbase (repeatmaskerlibraries-20140131: “vertebrate repeats”).

For downstream analyses we used a set of model repeats representing the union of de novo repeats, those identified within assembled genomic sequences via RepeatModeler²⁰ and an updated assembly of the previously identified Germ1 element⁸. Enrichment analyses were performed by separately aligning paired-end reads from blood and sperm DNA to the repeat data set. As with single-copy sequence, alignments were pre-filtered to exclude unmapped reads and supplementary alignments. The remaining data were processed to generate average coverage ratios for intervals of ~100 bp.

Manual curation of HOX clusters

Manual curation of gene models was carried out using Apollo⁸⁷ implemented in JBrowse⁸⁸. Indels in the assembly were identified and corrected by comparison with RNA-seq and genomic DNA re-sequencing data. Gene predictions from Maker were refined based on whole-embryo RNA-seq data from multiple developmental stages and homology with gene sequences from other vertebrates.

In addition to the 42 clustered HOX genes in the genome assembly, 6 further HOX genes were predicted that did not fall within the 6 HOX clusters. To investigate these genes further, the genomic scaffolds harboring these gene loci were extracted and used as queries for alignment against the assembly by BLAST⁸⁹. Five of these gene loci (homologs of hoxA3, D8, C9, B13 and B13a) were found to align with high sequence similarity (>97% identity) across long stretches of their sequence (>4 kb, containing predicted HOX coding sequence and flanking, non-coding sequence) to loci of individual members of the 42 clustered lamprey HOX genes (Supplementary Table 13). These loci either could represent recent duplications of HOX loci or could be assembly artifacts arising from the relatively high heterozygosity of the lamprey genome. Based on their exceptionally high levels of coding and non-coding sequence similarity to clustered HOX loci, we infer that these 5 loci are assembly artifacts due to polymorphism and that they do not represent additional singleton HOX genes in the lamprey genome. The 6^th predicted singleton HOX gene shows equal levels of homology to ANTP-class homeobox genes of both HOX and non-HOX families, suggesting it is a derived ANTP-class homeobox gene and not necessarily a HOX gene.

Phylogenetic analysis of HOX genes

Phylogenetic analysis was performed on HOX paralog groups with 4 or more members in sea lamprey: groups 4, 8, 9, 11 and 13. For each paralog group, predicted sea lamprey HOX protein sequences were aligned against homologs from other vertebrate species and amphioxus, retrieved from GenBank. Our approach was informed by the experiences detailed by Kuraku et al.⁹⁰, Qiu et al.⁹¹, Mehta et al.¹⁷ and Manousaki et al.⁹². In selecting jawed vertebrate taxa for these analyses, we avoided teleost fish and Xenopus laevis as these lineages have undergone additional genome duplication events, which can lead to their co-orthologous genes/proteins being more derived than those from non-duplicated lineages. Thus, we opted for elephant shark (C. milii) and coelacanth (L. menadoensis) as Chondrichthian and ‘basal’ Sarcopterygian representatives, respectively; both have slowly evolving protein-coding genes and well characterized HOX gene complements⁹³,⁹⁴. Urochordates are the sister group of vertebrates, but the divergent nature of their HOX genes led us to favor the cephalochordate amphioxus as a source for outgroup sequences in our analyses. We chose to perform protein alignments rather than DNA alignments due to the high coding GC content in lamprey, which can result in artifactual clustering of lamprey genes in DNA trees. Nevertheless, the unique pattern of amino-acid composition in lamprey proteins is an unavoidable complicating factor that impinges on their phylogenetic analysis and can lead to artifactual clustering of lamprey proteins, as described in Qiu et al.⁹¹. The MEGA7⁴¹ software suite was used for sequence alignment, best-fit substitution model evaluation and phylogeny reconstruction. Protein alignments were performed with full available length protein sequences using MUSCLE⁴¹. Best-fit substitution models were evaluated and chosen for each alignment. Maximum likelihood, neighbor joining and maximum parsimony approaches were used for phylogenetic analysis, with 100 bootstrap replicates generated for node support. For each method, all positions in the alignment containing gaps and missing data were eliminated.

Life Sciences Reporting Summary

Further information on experimental design is available in the Life Sciences Reporting Summary.

Code availability

Custom code (DifCover) is available on GitHub (see URLs).

Data availability

Accession numbers

Genome Assembly: NCBI Genome PIZI00000000.1; BioProject PRJNA357048. Raw sequence data used for genome assembly: NCBI SRA SRR5503831–SRR5503843. Re-sequencing data for detection of eliminated segments: NCBI SRA SRR5535434–SRR5535435. Previously published RNA-seq data were used for annotation: NCBI SRA SRX110029.2–SRX110035.2 (ref. ¹³), NCBI SRA SRX1483277–SRX1483282 (ref. ¹²) and NCBI SRA SRX104180 (ref. ¹⁰). Source data is available online for Figs. 3, 4, 6 and 7.

Change history

17 September 2018
When published, this article did not initially appear open access. This error has been corrected, and the open access status of the paper is noted in all versions of the paper. Additionally, affiliation 16 denoting equal contribution was missing from author Robb Krumlauf in the PDF originally published. This error has also been corrected.
19 April 2018
In the version of this article initially published, the present addresses for authors Dorit Hockman and Chris Amemiya were switched. The error has been corrected in the HTML and PDF versions of the article.

References

Parker, H. J., Bronner, M. E. & Krumlauf, R. A Hox regulatory network of hindbrain segmentation is conserved to the base of vertebrates. Nature 514, 490–493 (2014).
CAS PubMed PubMed Central Google Scholar
Green, S. A., Simoes-Costa, M. & Bronner, M. E. Evolution of vertebrates as viewed from the crest. Nature 520, 474–482 (2015).
CAS PubMed PubMed Central Google Scholar
Sower, S. A. et al. Emergence of an ancestral glycoprotein hormone in the pituitary of the sea lamprey, a basal vertebrate. Endocrinology 156, 3026–3037 (2015).
CAS PubMed Google Scholar
Smith, J. J. & Keinath, M. C. The sea lamprey meiotic map improves resolution of ancient vertebrate genome duplications. Genome Res. 25, 1081–1090 (2015).
CAS PubMed PubMed Central Google Scholar
Das, S. et al. Evolution of two prototypic T cell lineages. Cell. Immunol. 296, 87–94 (2015).
CAS PubMed PubMed Central Google Scholar
Doolittle, R. F. Bioinformatic characterization of genes and proteins involved in blood clotting in lampreys. J. Mol. Evol. 81, 121–130 (2015).
CAS PubMed Google Scholar
McCauley, D. W., Docker, M. F., Whyard, S. & Li, W. Lampreys as diverse model organisms in the genomics era. Bioscience 65, 1046–1056 (2015).
PubMed PubMed Central Google Scholar
Smith, J. J., Antonacci, F., Eichler, E. E. & Amemiya, C. T. Programmed loss of millions of base pairs from a vertebrate genome. Proc. Natl. Acad. Sci. USA 106, 11212–11217 (2009).
CAS PubMed PubMed Central Google Scholar
Smith, J. J., Stuart, A. B., Sauka-Spengler, T., Clifton, S. W. & Amemiya, C. T. Development and analysis of a germline BAC resource for the sea lamprey, a vertebrate that undergoes substantial chromatin diminution. Chromosoma 119, 381–389 (2010).
CAS PubMed PubMed Central Google Scholar
Smith, J. J., Baker, C., Eichler, E. E. & Amemiya, C. T. Genetic consequences of programmed genome rearrangement. Curr. Biol. 22, 1524–1529 (2012).
CAS PubMed PubMed Central Google Scholar
Timoshevskiy, V. A., Herdy, J. R., Keinath, M. C. & Smith, J. J. Cellular and molecular features of developmentally programmed genome rearrangement in a vertebrate (sea lamprey: Petromyzon marinus). PLoS Genet. 12, e1006103 (2016).
PubMed PubMed Central Google Scholar
Bryant, S. A., Herdy, J. R., Amemiya, C. T. & Smith, J. J. characterization of somatically-eliminated genes during development: lamprey (Petromyzon marinus). Mol. Biol. Evol. 33, 2337–2344 (2016).
CAS PubMed PubMed Central Google Scholar
Smith, J. J. et al. Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nat. Genet. 45, 415–421 (2013).
CAS PubMed PubMed Central Google Scholar
Speir, M. L. et al. The UCSC Genome Browser database: 2016 update. Nucleic Acids Res. 44 D1, D717–D725 (2016).
Google Scholar
Ye, C., Hill, C. M., Wu, S., Ruan, J. & Ma, Z. S. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Sci. Rep. 6, 31900 (2016).
CAS PubMed PubMed Central Google Scholar
Tang, H. et al. ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 16, 3 (2015).
CAS PubMed PubMed Central Google Scholar
Mehta, T. K. et al. Evidence for at least six Hox clusters in the Japanese lamprey (Lethenteron japonicum). Proc. Natl. Acad. Sci. USA 110, 16044–16049 (2013).
CAS PubMed PubMed Central Google Scholar
Kuraku, S. & Kuratani, S. Time scale for cyclostome evolution inferred with a phylogenetic diagnosis of hagfish and lamprey cDNA sequences. Zool. Sci. 23, 1053–1064 (2006).
CAS Google Scholar
Lampman, R. et al. in Jawless Fishes of the World Vol. 2 (eds. A. Orlov & R. Beamish) Ch. 22, 160–195 (Cambridge Scholars Publishing, Newcastle-upon-Tyne, UK, 2016).
Smit, A. F. A. & Hubley, R. RepeatModeler Open-1.0 (2015).
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0 (2015).
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).
PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
PubMed Google Scholar
Hara, Y. et al. Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation. BMC Genomics 16, 977 (2015).
PubMed PubMed Central Google Scholar
Warren, W. C. et al. A new chicken genome assembly provides insight into avian genome structure. G3 (Bethesda) 7, 109–117 (2017).
CAS Google Scholar
Braasch, I. et al. The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons. Nat. Genet. 48, 427–437 (2016).
CAS PubMed PubMed Central Google Scholar
Ohno, S. Evolution by Gene Duplication (Springer-Verlag, New York, 1970).
Holland, P. W., Garcia-Fernàndez, J., Williams, N. A. & Sidow, A. Gene duplications and the origins of vertebrate development. Dev. Suppl. 1994, 125–133 (1994).
Google Scholar
Nakatani, Y., Takeda, H., Kohara, Y. & Morishita, S. Reconstruction of the vertebrate ancestral genome reveals dynamic genome reorganization in early vertebrates. Genome Res. 17, 1254–1265 (2007).
CAS PubMed PubMed Central Google Scholar
Murat, F., Van de Peer, Y. & Salse, J. Decoding plant and animal genome plasticity from differential paleo-evolutionary patterns and processes. Genome Biol. Evol. 4, 917–928 (2012).
PubMed PubMed Central Google Scholar
Timoshevskiy, V. A., Lampman, R. T., Hess, J. E., Porter, L. L. & Smith, J. J. Deep ancestry of programmed genome rearrangement in lampreys. Dev. Biol. 429, 31–34 (2017).
CAS PubMed PubMed Central Google Scholar
Lachmann, A. et al. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics 26, 2438–2444 (2010).
CAS PubMed PubMed Central Google Scholar
Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).
PubMed PubMed Central Google Scholar
Sachs, M. et al. Bivalent chromatin marks developmental regulatory genes in the mouse embryonic germline in vivo. Cell Reports 3, 1777–1784 (2013).
CAS PubMed Google Scholar
Moore, C. M. & Best, R. G. Chromosomal Genetic Disease: Structural Aberrations (eLS, Chichester, UK, 2001).
Hardisty, M. W. Fecundity and speciation in lampreys. Evolution 17, 17–22 (1963).
Google Scholar
Hardisty, M. W. & Cosh, J. Primordial germ cells and fecundity. Nature 210, 1370–1371 (1966).
CAS PubMed Google Scholar
Grossniklaus, U. & Paro, R. Transcriptional silencing by polycomb-group proteins. Cold Spring Harb. Perspect. Biol. 6, a019331 (2014).
PubMed PubMed Central Google Scholar
Aloia, L., Di Stefano, B. & Di Croce, L. Polycomb complexes in stem cells and embryonic development. Development 140, 2525–2534 (2013).
CAS PubMed Google Scholar
Acemel, R. D. et al. A single three-dimensional chromatin compartment in amphioxus indicates a stepwise evolution of vertebrate Hox bimodal regulation. Nat. Genet. 48, 336–341 (2016).
CAS PubMed Google Scholar
Kumar, S., Stecher, G. & Tamura, K. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol. Biol. Evol. 33, 1870–1874 (2016).
CAS PubMed PubMed Central Google Scholar
Ye, C., Ma, Z. S., Cannon, C. H., Pop, M. & Yu, D. W. Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 13 (Suppl. 6), S1 (2012).
PubMed PubMed Central Google Scholar
Ye, C. & Ma, Z. S. Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads. PeerJ 4, e2016 (2016).
PubMed PubMed Central Google Scholar
Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578–579 (2011).
CAS PubMed Google Scholar
Miller, M. R., Dunham, J. P., Amores, A., Cresko, W. A. & Johnson, E. A. Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Res. 17, 240–248 (2007).
CAS PubMed PubMed Central Google Scholar
Miller, M. R. et al. A conserved haplotype controls parallel adaptation in geographically distant salmonid populations. Mol. Ecol. 21, 237–249 (2012).
PubMed PubMed Central Google Scholar
Stam, P. Construction of integrated genetic-linkage maps by means of a new computer package: Join Map. Plant J. 3, 739–744 (1993).
CAS Google Scholar
Van Ooijen, J. W. Multipoint maximum likelihood mapping in a full-sib family of an outbreeding species. Genet. Res. 93, 343–349 (2011).
Google Scholar
Campbell, M. S. et al. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 164, 513–524 (2014).
CAS PubMed Google Scholar
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008).
CAS PubMed PubMed Central Google Scholar
Hwang, J. Y., Smith, S. & Myung, K. The Rad1-Rad10 complex promotes the production of gross chromosomal rearrangements from spontaneous DNA damage in Saccharomyces cerevisiae. Genetics 169, 1927–1937 (2005).
CAS PubMed PubMed Central Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
PubMed PubMed Central Google Scholar
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 (Suppl. 2), ii215–ii225 (2003).
PubMed Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
CAS PubMed Google Scholar
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
CAS PubMed PubMed Central Google Scholar
Dehal, P. et al. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science 298, 2157–2167 (2002).
CAS PubMed Google Scholar
Simakov, O. et al. Insights into bilaterian evolution from three spiralian genomes. Nature 493, 526–531 (2013).
CAS PubMed Google Scholar
Putnam, N. H. et al. Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317, 86–94 (2007).
CAS PubMed Google Scholar
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).
CAS PubMed Google Scholar
Putnam, N. H. et al. The amphioxus genome and the evolution of the chordate karyotype. Nature 453, 1064–1071 (2008).
CAS PubMed Google Scholar
Venkatesh, B. et al. Elephant shark genome provides unique insights into gnathostome evolution. Nature 505, 174–179 (2014).
CAS PubMed PubMed Central Google Scholar
Hellsten, U. et al. The genome of the Western clawed frog Xenopus tropicalis. Science 328, 633–636 (2010).
CAS PubMed PubMed Central Google Scholar
Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
PubMed Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
CAS PubMed Google Scholar
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
CAS PubMed Google Scholar
Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
CAS PubMed Google Scholar
Howe, K. et al. The zebrafish reference genome sequence and its relationship to the human genome. Nature 496, 498–503 (2013).
CAS PubMed PubMed Central Google Scholar
Chapman, J. A. et al. The dynamic genome of Hydra. Nature 464, 592–596 (2010).
CAS PubMed PubMed Central Google Scholar
Srivastava, M. et al. The Trichoplax genome and the nature of placozoans. Nature 454, 955–960 (2008).
CAS PubMed Google Scholar
The UniProt Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 39, D214–D219 (2011).
Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000).
CAS PubMed PubMed Central Google Scholar
Mulder, N. J. et al. InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief. Bioinform. 3, 225–235 (2002).
CAS PubMed Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
CAS PubMed PubMed Central Google Scholar
Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116–W120 (2005).
CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
CAS PubMed Google Scholar
Wu, T. D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).
CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
CAS PubMed PubMed Central Google Scholar
Wucher, V. et al. FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic Acids Res. 45, e57 (2017).
CAS PubMed PubMed Central Google Scholar
Duret, L., Chureau, C., Samain, S., Weissenbach, J. & Avner, P. The Xist RNA gene evolved in eutherians by pseudogenization of a protein-coding gene. Science 312, 1653–1655 (2006).
CAS PubMed Google Scholar
Yates, A. et al. Ensembl 2016. Nucleic Acids Res. 44 D1, D710–D716 (2016).
Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
CAS PubMed PubMed Central Google Scholar
Venkatraman, E. S. & Olshen, A. B. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23, 657–663 (2007).
CAS PubMed Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
PubMed PubMed Central Google Scholar
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
CAS PubMed PubMed Central Google Scholar
Lee, E. et al. Web Apollo: a web-based genomic annotation editing platform. Genome Biol. 14, R93 (2013).
PubMed PubMed Central Google Scholar
Skinner, M. E., Uzilov, A. V., Stein, L. D., Mungall, C. J. & Holmes, I. H. JBrowse: a next-generation genome browser. Genome Res. 19, 1630–1638 (2009).
CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
PubMed PubMed Central Google Scholar
Kuraku, S., Meyer, A. & Kuratani, S. Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after? Mol. Biol. Evol. 26, 47–59 (2009).
CAS PubMed Google Scholar
Qiu, H., Hildebrand, F., Kuraku, S. & Meyer, A. Unresolved orthology and peculiar coding sequence properties of lamprey genes: the KCNA gene family as test case. BMC Genomics 12, 325 (2011).
CAS PubMed PubMed Central Google Scholar
Manousaki, T. et al. in Jawless Fishes of the World Vol. 1 (eds. Orlov, A. & Beamish, R.) 2–16 (Cambridge Scholars Publishing, Newcastle-upon-Tyne, UK, 2016).
Ravi, V. et al. Elephant shark (Callorhinchus milii) provides insights into the evolution of Hox gene clusters in gnathostomes. Proc. Natl. Acad. Sci. USA 106, 16327–16332 (2009).
CAS PubMed PubMed Central Google Scholar
Amemiya, C. T. et al. Complete HOX cluster characterization of the coelacanth provides further evidence for slow evolution of its genome. Proc. Natl. Acad. Sci. USA 107, 3622–3627 (2010).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Research reported in this publication was supported by the National Institute of General Medical Sciences of the US National Institutes of Health under award number R01GM104123 to J.J.S., the Stowers Institute under award number SIMR-1001 to H.J.P., M.E.C., L.M.W., S.M.C.R. and R.K., and the Bonneville Power Administration to J.E.H. and S.R.N. E.E.E. is an investigator of the Howard Hughes Medical Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Support and resources from the Center for High Performance Computing at the University of Utah are gratefully acknowledged. Additional computational support was provided by The University of Kentucky High Performance Computing complex.

Author information

Dorit Hockman
Present address: Division of Cell Biology, Department of Anatomy, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
Chris T. Amemiya
Present address: School of Natural Sciences, University of California Merced, Merced, CA, USA
These authors contributed equally: Nataliya Timoshevskaya, Chengxi Ye, Carson Holt, Melissa C. Keinath, Hugo J. Parker, Robb Krumlauf, Greg Elgar and Chris T. Amemiya.

Authors and Affiliations

Department of Biology, University of Kentucky, Lexington, KY, USA
Jeramiah J. Smith, Nataliya Timoshevskaya, Melissa C. Keinath, Vladimir A. Timoshevskiy, Courtney K. M. Waterbury & Cody Saraceno
Department of Computer Science, University of Maryland, College Park, MD, USA
Chengxi Ye
Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
Carson Holt & Mark Yandell
Stowers Institute for Medical Research, Kansas City, MO, USA
Hugo J. Parker, Malcolm E. Cook, Leanne M. Wiedemann, Sofia M. C. Robb & Robb Krumlauf
Columbia River Inter-Tribal Fish Commission, Portland, OR, USA
Jon E. Hess & Shawn R. Narum
Center for Molecular Biology of Heidelberg University (ZMBH), DKFZ-ZMBH Alliance, Heidelberg, Germany
Francesco Lamanna & Henrik Kaessmann
Department of Pathology and Laboratory Medicine, University of Kansas School of Medicine, Kansas City, KS, USA
Leanne M. Wiedemann
Department of Anatomy & Cell Biology, The University of Kansas School of Medicine, Kansas City, KS, USA
Sofia M. C. Robb
Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
Carl Baker & Evan E. Eichler
Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
Evan E. Eichler
Radcliffe Department of Medicine, University of Oxford, Oxford, England
Dorit Hockman & Tatjana Sauka-Spengler
The Francis Crick Institute, London, England
Greg Elgar
Benaroya Research Institute, Seattle, WA, USA
Chris T. Amemiya

Authors

Jeramiah J. Smith
View author publications
You can also search for this author in PubMed Google Scholar
Nataliya Timoshevskaya
View author publications
You can also search for this author in PubMed Google Scholar
Chengxi Ye
View author publications
You can also search for this author in PubMed Google Scholar
Carson Holt
View author publications
You can also search for this author in PubMed Google Scholar
Melissa C. Keinath
View author publications
You can also search for this author in PubMed Google Scholar
Hugo J. Parker
View author publications
You can also search for this author in PubMed Google Scholar
Malcolm E. Cook
View author publications
You can also search for this author in PubMed Google Scholar
Jon E. Hess
View author publications
You can also search for this author in PubMed Google Scholar
Shawn R. Narum
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Lamanna
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Kaessmann
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir A. Timoshevskiy
View author publications
You can also search for this author in PubMed Google Scholar
Courtney K. M. Waterbury
View author publications
You can also search for this author in PubMed Google Scholar
Cody Saraceno
View author publications
You can also search for this author in PubMed Google Scholar
Leanne M. Wiedemann
View author publications
You can also search for this author in PubMed Google Scholar
Sofia M. C. Robb
View author publications
You can also search for this author in PubMed Google Scholar
Carl Baker
View author publications
You can also search for this author in PubMed Google Scholar
Evan E. Eichler
View author publications
You can also search for this author in PubMed Google Scholar
Dorit Hockman
View author publications
You can also search for this author in PubMed Google Scholar
Tatjana Sauka-Spengler
View author publications
You can also search for this author in PubMed Google Scholar
Mark Yandell
View author publications
You can also search for this author in PubMed Google Scholar
Robb Krumlauf
View author publications
You can also search for this author in PubMed Google Scholar
Greg Elgar
View author publications
You can also search for this author in PubMed Google Scholar
Chris T. Amemiya
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.J.S., R.K., C.T.A. and G.E. conceived of the study. J.J.S., N.T., C.Y., C.H., M.C.K., H.J.P., M.E.C., J.E.H., S.R.N., V.A.T., C.K.M.W., C.S., H.K., F.L., L.M.W., S.M.C.R., C.B., E.E.E., D.H., T.S.-S., M.Y. and R.K. contributed analyses. J.J.S., N.T., M.C.K., H.J.P. and R.K. wrote the manuscript.

Corresponding author

Correspondence to Jeramiah J. Smith.

Ethics declarations

Competing interests

E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 lncRNA subclasses identified in the sea lamprey genome.

a) Categories of annotated intergenic lncRNAs based on their location and transcriptional direction, relative to their closest neighboring gene. b) Categories of annotated genic lncRNAs based on their location and transcriptional direction, relative an overlapping gene. c) Counts of genic and intergenic lncRNAs. d) Counts for subcategories of intergenic lncRNAs segregated on the basis of location and direction of transcription. e) Counts for subcategories of genic lncRNAs segregated on the basis of exon/intron overlaps. f) Counts for subcategories of genic lncRNAs segregated on the basis of location and direction of transcription relative to their overlapping gene.

Supplementary Figure 2 The distribution of conserved syntenies in lamprey and gar reveals patterns of ancient large-scale duplication.

These patterns are consistent with those from chicken/lamprey comparisons and reveal both chromosomal/segmental and whole genome duplications. Lamprey super-scaffolds are oriented along the y-axis and gar chromosomes (linkage groups) are oriented along the x-axis. Circles reflect counts of syntenic orthologs on the corresponding lamprey and gar chromosomes, with the size of each circle being proportional to the number of orthologs on that pair. The color of each circle represents the degree to which the number of observed orthologs deviates from null expectations under a uniform distribution across an identical number of lamprey and gar chromosomes with identical numbers of orthology-informative genes. Shaded regions of the plot designate homology groups that correspond to presumptive ancestral chromosomes and mark sets of lamprey chromosomes that are identical to those depicted in Fig. 4 in order to facilitate direct comparison. The ordering of lamprey super-scaffolds along the y-axis is provided in Supplementary Table 4. Altering the ordering of gar chromosomes or lamprey chromosomes within orthology groups or the relative ordering of orthology groups would not change the interpretation of patterns depicted herein.

Supplementary Figure 3 Hox gene loci in the sea lamprey showing the order of assembled scaffolds and gaps.

Hox gene exons and non-Hox flanking genes are represented by boxes. Loci are not drawn to scale. The six Hox clusters fall within separate super-scaffolds with the constituent scaffolds and gaps indicated as lines beneath each cluster. Positions of known micro-RNAs are shown. All Hox genes are transcribed from left to right unless indicated (black arrows). The assembled Hox-ε cluster contains a scaffold arrangement that results in a non-canonical order of Hox genes, with Hox-ε13 and Hox-ε14 exon2-3 being positioned outside of the cluster with respect to evx. Hox-ε14 exon2-3 displays an opposite direction of transcription to the majority of the genes in the cluster, including its first exon. Given that homologues of Hox-ε11, 13 and 14 show a canonical arrangement in the closely related L. camtschaticum Hox-ε cluster, which is based on contiguous BAC sequences, we tentatively infer that these scaffolds have been mis-ordered in the P. marinus assembly and their true positions are between Hox-ε14exon1 and Hox-ε11 as indicated (red arrows). In the current assembly, a homeodomain-containing fragment of a PG8 Hox gene (8e2), with very high similarity to Hox-ε8, falls within the gap-spanning intron of Hox-ε7 and has an opposite direction of transcription to that of the majority of the genes in the cluster. We tentatively infer this to be an assembly error.

Supplementary Figure 4 A comparison between assembled Hox loci of the sea lamprey and the Japanese lamprey.

Orthologous sea lamprey (P.m.) and Japanese lamprey (L.c.) Hox loci are shown side-by-side. Hox genes are represented as boxes. Flanking non-Hox genes are depicted as arrowheads, which indicate their direction of transcription. The sea lamprey assembly supports the ε and ζ clusters that were tentatively predicted in the Japanese lamprey assembly, reveals an additional Hox gene (Hox-ζ1), an additional micro-RNA (miR-196ε), and expands these loci to reveal the genes flanking the ε and ζ clusters. Hox-containing loci homologous to L.c. Hox-η and Hox-θ were not identifiable in the sea lamprey assembly. The white arrow downstream of the lamprey hox-γ cluster represents PMZ_0048273, an uncharacterized non-Hox gene.

Supplementary Figure 5 Phylogenetic analysis of sea lamprey Hox4 genes.

Six predicted full available length sea lamprey Hox4 protein sequences were aligned with selected vertebrate Hox4 homologues, with amphioxus Hox4 as an outgroup. Trees generated by Maximum Likelihood (ML), Maximum Parsimony (MP) and Neighbor Joining (NJ) approaches are shown, based on 190 positions in the alignment. Support for each node is shown as a percentage from 100 bootstrap replicates. ML and NJ trees are drawn to scale, with the unit for evolutionary distance being the number of amino acid substitutions per site. Lamprey genes are highlighted (blue shading). Pm, Petromyzon marinus (Sea lamprey); Lc, Lethenteron camtschaticum (Japanese lamprey); Hs, Homo sapiens (Human); Cm, Callorhinchus milii (Elephant shark); Lm, Latimeria menadoensis (Coelocanth); Gg, Gallus gallus (Chicken); Bf, Branchiostoma floridae (Amphioxus).

Supplementary Figure 6 Phylogenetic analysis of sea lamprey Hox8 genes

Four predicted full available length sea lamprey Hox8 protein sequences were aligned with selected vertebrate Hox8 homologues, with amphioxus Hox8 as an outgroup. Trees generated by Maximum Likelihood (ML), Maximum Parsimony (MP) and Neighbor Joining (NJ) approaches are shown, based on 124 positions in the alignment. Support for each node is shown as a percentage from 100 bootstrap replicates. ML and NJ trees are drawn to scale, with the unit for evolutionary distance being the number of amino acid substitutions per site. Lamprey genes are highlighted (blue shading). Pm, Petromyzon marinus (Sea lamprey); Lc, Lethenteron camtschaticum (Japanese lamprey); Hs, Homo sapiens (Human); Cm, Callorhinchus milii (Elephant shark); Lm, Latimeria menadoensis (Coelocanth); Gg, Gallus gallus (Chicken); Bf, Branchiostoma floridae (Amphioxus).

Supplementary Figure 7 Phylogenetic analysis of sea lamprey Hox9 genes

Five predicted full available length sea lamprey Hox9 protein sequences were aligned with selected vertebrate Hox9 homologues, with amphioxus Hox9 as an outgroup. Trees generated by Maximum Likelihood (ML), Maximum Parsimony (MP) and Neighbor Joining (NJ) approaches are shown, based on 74 positions in the alignment. Support for each node is shown as a percentage from 100 bootstrap replicates. ML and NJ trees are drawn to scale, with the unit for evolutionary distance being the number of amino acid substitutions per site. Lamprey genes are highlighted (blue shading). Pm, Petromyzon marinus (Sea lamprey); Lc, Lethenteron camtschaticum (Japanese lamprey); Hs, Homo sapiens (Human); Cm, Callorhinchus milii (Elephant shark); Lm, Latimeria menadoensis (Coelocanth); Gg, Gallus gallus (Chicken); Bf, Branchiostoma floridae (Amphioxus).

Supplementary Figure 8 Phylogenetic analysis of sea lamprey Hox11 genes

Four predicted full available length sea lamprey Hox11 protein sequences were aligned with selected vertebrate Hox11 homologues. Amphioxus Hox11 was not used as an outgroup due to its relatively short sequence impinging on the alignment. Hox-β11 from the Japanese lamprey was omitted from the alignment as full-length sequence is not available. Trees generated by Maximum Likelihood (ML), Maximum Parsimony (MP) and Neighbor Joining (NJ) approaches are shown, based on 229 positions in the alignment. Support for each node is shown as a percentage from 100 bootstrap replicates. ML and NJ trees are drawn to scale, with the unit for evolutionary distance being the number of amino acid substitutions per site. Lamprey proteins are highlighted (blue shading). Pm, Petromyzon marinus (Sea lamprey); Lc, Lethenteron camtschaticum (Japanese lamprey); Hs, Homo sapiens (Human); Cm, Callorhinchus milii (Elephant shark); Lm, Latimeria menadoensis (Coelocanth); Gg, Gallus gallus (Chicken).

Supplementary Figure 9 Phylogenetic analysis of sea lamprey Hox13 genes.

Five predicted full available length sea lamprey Hox13 protein sequences were aligned with selected vertebrate Hox13 homologues, with amphioxus Hox13 as an outgroup. Trees generated by Maximum Likelihood (ML), Maximum Parsimony (MP) and Neighbor Joining (NJ) approaches are shown, based on 74 positions in the alignment. Support for each node is shown as a percentage from 100 bootstrap replicates. ML and NJ trees are drawn to scale, with the unit for evolutionary distance being the number of amino acid substitutions per site. Lamprey proteins are highlighted (blue shading). Pm, Petromyzon marinus (Sea lamprey); Lc, Lethenteron camtschaticum (Japanese lamprey); Hs, Homo sapiens (Human); Cm, Callorhinchus milii (Elephant shark); Lm, Latimeria menadoensis (Coelocanth); Gg, Gallus gallus (Chicken); Bf, Branchiostoma floridae (Amphioxus).

Supplementary Figure 10 Diversity and abundance of eliminated repeats.

Assembly of repeats and comparative sequencing reveals germline enrichment of several high-copy sequences. The distribution of coverage ratios reveals a long tail corresponding to repetitive sequences with higher sequence coverage in sperm relative to blood. This excess is highlighted in red, assuming a symmetrical distribution of enrichment scores for non-eliminated regions and an absence of somatic-specific sequence. B) Competitive genomic hybridization of labeled CoT2 repetitive DNA from soma (liver: green) and germline (sperm: red) to elimination anaphases. Germline-enriched repeats are nearly absent from somatically retained chromatin (migrating toward the spindle poles) and highly enriched in lagging chromatin that is slated for elimination. DNA is counterstained with DAPI (blue). A similar pattern was observed on 22 additional anaphases across two slides.

Supplementary Figure 11 The relative locations of duplicated genes on lamprey chromosomes derived from four distinct orthology groups.

The locations of duplicate genes on lamprey super-scaffolds are shown for a subset of ancestral linkage orthology groups. a) Lamprey duplications on super-scaffolds corresponding to an ancestral orthology group that encompasses chicken chromosomes 2, 7, 27 and 29. b) Lamprey duplications on super-scaffolds corresponding to an ancestral orthology group that encompasses chicken chromosomes 14 and 18. c) Lamprey duplications on super-scaffolds corresponding to an ancestral orthology group that encompasses chicken chromosomes 17 and 8. Lamprey super-scaffolds 105 and 107 are not shown, these contain no duplicates on any of the other lamprey super-scaffolds that were assigned to this orthology group. d) Lamprey duplications on super-scaffolds corresponding to an ancestral orthology group that encompasses chicken chromosomes 15 and 19. Two-copy duplicates (referenced in Fig. 5 and Supplemental Table 6) are shown in bold. Higher-copy relationships are shown with thinner lines.

Supplementary Figure 12 Distribution of read depths from germline and somatic re-sequencing data

Sequence reads from sperm and blood were mapped separately to the germline genome assembly. Values represent the number of base pairs at a given depth of coverage. These distributions reveal that the modal coverage of reads from sperm DNA was slightly lower than the coverage of reads from blood, ~73X and ~80X, respectively, and sperm sequences map to a larger proportion of the approximately single copy sequence within the germline genome.

Supplementary information

Supplementary Figures, Text and table

Supplementary Figures 1–12, Supplementary Note and Supplementary Table 13.

Life Sciences Reporting Summary

Supplementary Table 1

The Pacific lamprey (Entosphenus tridentatus) meiotic map and alignment of RADseq markers to sea lamprey (Petromyzon marinus) super-scaffolds.

Supplementary Table 2

Noncoding RNAs detected in the germline genome assembly.

Supplementary Table 3

A sea lamprey/chicken comparative map.

Supplementary Table 4

Ordering of lamprey super-scaffolds along the y-axis of Fig. 4 (ordered from bottom to top).

Supplementary Table 5

A sea lamprey/gar comparative map.

Supplementary Table 6

Statistics for all pairwise comparisons of numbers of shared twocopy paralogs on all lamprey derivatives of the ancestral hoxbearing chromosome. Probability estimates were generated using two-tailed tests and a total of n = 200 independent pairs of duplicated genes.

Supplementary Table 7

Depth of coverage ratios for all genomic intervals.

Supplementary Table 8

Primers used for PCR validation of candidate PGR targets and assay results.

Supplementary Table 9

Enrichment statistics for overlap between PGR-orthologs and curated "biological process" ontology terms. These comparisons reflect enrichment statistics derived from n = 102 independent lamprey/human orthologs. Statistics are as described in reference 25. P-value is calculated using the Fisher’s exact test. Adjusted P-value is calculated using the Benjamini-Hochberg method for correction for multiple hypothesis testing. Z-score is calculated by comparing the observed ranking of Fisher’s exact tests relative to those derived from random sets of genes. Combined score is the product of the log P-value and the Z-score.

Supplementary Table 10

Enrichment statistics for overlap between PGR-orthologs and published chromatin immunoprecipitation (ChIP) experiments (ChEA 2016). These comparisons reflect enrichment statistics derived from n = 102 independent lamprey/human orthologs. Statistics are as described in reference 25. P-value is calculated using the Fisher’s exact test. Adjusted P-value is calculated using the Benjamini-Hochberg method for correction for multiple hypothesis testing. Z-score is calculated by comparing the observed ranking of Fisher’s exact tests relative to those derived from random sets of genes. Combined score is the product of the log P-value and the Z-score.

Supplementary Table 11

Copy number estimates and depth of coverage ratios for intervals within all characterized repetitive elements.

Supplementary Table 12

Repeat classification for all repetitive elements with matches in RepBase. The best match for each element is shown here. Elements with no match to RepBase or RepeatModeler repeats are not shown here.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Smith, J.J., Timoshevskaya, N., Ye, C. et al. The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution. Nat Genet 50, 270–277 (2018). https://doi.org/10.1038/s41588-017-0036-1

Download citation

Received: 14 August 2017
Accepted: 15 December 2017
Published: 22 January 2018
Issue Date: February 2018
DOI: https://doi.org/10.1038/s41588-017-0036-1

Subjects

Abstract

Similar content being viewed by others

Main

Results

Assembly and annotation of the sea lamprey genome

Vertebrate genome evolution

Lamprey HOX clusters: duplication and divergence

Programmed genome rearrangement

Identification of eliminated DNA

Function of PGR

Discussion

URLs

Methods

Research animals

Sequencing

Hybrid assembly

Scaffolding

Meiotic mapping of E. tridentatus

Annotation

Identification of repetitive elements

Identification of coding sequences

lncRNA annotation

Analysis of conserved synteny

Identification and characterization of germline-specific/enriched sequences

Single-copy genes

Repetitive sequences

Manual curation of HOX clusters

Phylogenetic analysis of HOX genes

Life Sciences Reporting Summary

Code availability

Data availability

Accession numbers

Change history

17 September 2018

19 April 2018

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links