Genome evolution in the allotetraploid frog Xenopus laevis

Journal name:
Nature
Volume:
538,
Pages:
336–343
Date published:
DOI:
doi:10.1038/nature19840
Received
Accepted
Published online

Abstract

To explore the origins and consequences of tetraploidy in the African clawed frog, we sequenced the Xenopus laevis genome and compared it to the related diploid X. tropicalis genome. We characterize the allotetraploid origin of X. laevis by partitioning its genome into two homoeologous subgenomes, marked by distinct families of ‘fossil’ transposable elements. On the basis of the activity of these elements and the age of hundreds of unitary pseudogenes, we estimate that the two diploid progenitor species diverged around 34 million years ago (Ma) and combined to form an allotetraploid around 17–18 Ma. More than 56% of all genes were retained in two homoeologous copies. Protein function, gene expression, and the amount of conserved flanking sequence all correlate with retention rates. The subgenomes have evolved asymmetrically, with one chromosome set more often preserving the ancestral state and the other experiencing more gene loss, deletion, rearrangement, and reduced gene expression.

At a glance

Figures

  1. Chromosome evolution in Xenopus.
    Figure 1: Chromosome evolution in Xenopus.

    a, Comparative cytogenetic map of XLA (Xenopus laevis) and XTR (Xenopus tropicalis) chromosomes. Magenta lines show relationships of chromosomal locations of 198 homoeologous gene pairs between XLA.L and XLA.S chromosomes, identified by FISH mapping using BAC clones (Supplementary Table 1 and Supplementary Note 3.1). Blue lines show relationships of chromosomal locations of orthologous genes between XTR chromosomes and (i) both XLA.L and XLA.S chromosomes (solid line) (lines between XLA.L and XLA.S are omitted), (ii) only XLA.L (dashed), or (iii) only XLA.S (dotted), which were taken from our previous studies14, 15. Light blue lines indicate positional relationships of actr3 and lypd1 on XTR9q and rpl13a and rps11 on XTR10q with those on XLA9_10LS chromosomes (Supplementary Note 6.2). Double-headed arrows on the right of XLA.S chromosomes indicate the chromosomal regions in which inversions occurred. Ideograms of XTR and XLA chromosomes were taken from our previous reports15, 16. b, Distribution of homoeologous genes (purple), singletons (grey) and subgenome-specific repeats across XLA1L (top) and XLA1S (bottom). Xl-TpL_harb is red, Xl-TpS_harb is blue, and Xl-TpS_mar is green. Purple lines mark homoeologous genes present in both L and S chromosomes, the black line marks the approximate centromere location on each chromosome. The homoeologous gene pairs, from left to right: rnf4, spcs3, intsl2, foxa1, sds, ap3s1, lifr, aqp7. Each bin is 3 Mb in size, with 0.5 Mb overlap with the previous bin. c, Chromosomal localization of the Xl-TpS_mar sequence with fluorescence in situ hybridization. Hybridization signals were only observed on the S chromosomes. Scale bar, 10 μm.

  2. Molecular evolution and allotetraploidy.
    Figure 2: Molecular evolution and allotetraploidy.

    a, The distribution of pseudogene ages, as described in Supplementary Note 9 (top). Phylogenetic tree illustrating the different epochs in Xenopus (bottom), with times based on protein-coding gene phylogeny of pipids, including Xenopus, Pipa carvalhoi, Hymenochirus boettgeri and Rana pipiens (only Xenopus depicted). We date the speciation of X. tropicalis and the X. laevis ancestor at 48 Ma, the L and S polyploid progenitors at 34 Ma and the divergence of the polyploid Xenopus radiation at 17 Ma. Using these times as calibration points, we estimate bursts of transposon activity at 18 Ma (mariner, blue star) and 33–34 Ma (harbinger, red star). The purple star is the time of hybridization, around 17–18 Ma. b, Phylogenetic tree based on protein-coding genes of tetrapods, rooted by elephant shark (not shown). Alignments were done by MACSE (multiple alignment of coding sequences accounting for frameshifts and stop codons) and the maximum-likelihood tree was built by PhyML. Branch length scale shown at the bottom for 0.08 substitutions per site.The difference in branch length between Xenopus laevis-L and Xenopus laevis-S is similar to that seen between mouse and rat. Both subgenomes of X. laevis have longer branch lengths than X. tropicalis.

  3. Structural response to allotetraploidy.
    Figure 3: Structural response to allotetraploidy.

    a, Distributions of consecutive retentions (left) and deletions (right) in the L (red) and S (blue) subgenomes. The distributions were fit using the equation y = a × (ebx) + c × (edx). The y axis is shown on a log scale. Significant differences were seen between L and S subgenomes in both distributions (Student’s t-test, retention, P = 3.6 × 10−22; deletion, P = 4.5 × 10−84). b, Evolutionary conservation of the Xenopus major histocompatibility complex (MHC) and differential MHC silencing on the two X. laevis subgenomes. Selected gene names shown above. The ‘Adaptive MHC’ encodes tightly-linked essential genes involved in antigen presentation to T cells; this group of genes is the primordial linkage group and has been preserved in most non-mammalian vertebrates, including Xenopus. Differential gene silencing is particularly pronounced, as four genes around the class I gene are functional on the S chromosome, but absent (dma and dmb (MHC-class II domain alpha and beta) or pseudogenes (ring3, really interesting new gene 3; lmp2, large multifunctional proteasome 2) on the L chromosome. The gene map is not to scale; pseudogenes are noted as indicated. HSA, Homo sapiens MHC; XLA Xenopus laevis MHC; GGA Gallus gallus (chicken) MHC. Refer to the Supplementary Table 8 for a more detailed MHC map. TAPBP, TAP binding protein, or tapasin; TAP2, antigen peptide transporter 2; CFB, complement factor B and TNFa, tumor necrosis factor α. c, Hox gene clusters. X. laevis retains eight hox clusters, consisting of pairs of hoxa, b, c and d clusters, on L and S chromosomes. even-skipped genes (evx1 or evx2) are positioned flanking hoxa and hoxd clusters. hox genes are classified into four groups namely, labial, proboscipedia, central and posterior groups. Note that hoxb2.L (2p, black) is a pseudogene. d, Syntenies around the mix gene family. Abbreviations for species and chromosome numbers: HSA1, H. sapiens;; GGA3 G. gallus (chicken); XTR5, X. tropicalis; XLA5L and XLA5S, X. laevisL and S subgenomes); DRE20, D. rerio (zebrafish);. Each Xenopus (sub)genome experienced its own independent expansion of the family (see Extended Data Fig. 5 for details).

  4. Retention and functional differentiation.
    Figure 4: Retention and functional differentiation.

    a, Comparison of L and S gene loss by KEGG categories (left) and tissue-weighted gene co-expression network analysis (WGCNA) categories (right) (Supplementary Note 10.1). Blue line denotes expected L or S loss based on genome-wide average (56.4%). Red points denote functional categories showing a high degree of loss. Magenta points denote functional categories showing a high degree of retention (χ2 test, P < 0.01). b, Box plot of log10(LTPM/STPM) for homoeologous gene pairs, zoomed in to show medians. Ovary and maternally controlled developmental time points (left, light blue and dark blue bars, respectively), zygotically controlled developmental time points and adult tissues (right, red and green bars, respectively). Red line, equal ratio log10(1). On average, maternal datasets express the L gene of a homoeologous pair 12% more strongly than the S gene (median = 0%), whereas zygotic tissues and time points express the L gene of a homoeologous pair 25% more strongly than the S gene (median = 1.8%). The difference between the mean and medians is explained by many genes with large differences between homoeologues (Extended Data Fig. 8c). c, d, Developmental expression plot (left) and epigenetic landscape (right) surrounding hoxb4 (c) and numbl (d). Right, genomic profiles of H3K4me3 (green), p300 (yellow), RNA polymerase II (RNAP II; d, purple) and H3K36me3 (d, blue) ChIP–seq tracks, as well as DNA methylation levels determined by whole-genome bisulfite sequencing (grey). Gene annotation track shows hoxb4 (c) and numbl (d) genes on L (top) and S. Grey denotes conservation between L and S genomic sequences. d, The small amount of expression seen in maternal numbl and numbl.L is consistent between replicates. Gene expression is measured in transcripts per million mapped reads (TPM). e, Representative embryos with GFP expression, as detected by in situ hybridization at stages 32–33, driven by six6.L-CNE or six6.S-CNE linked to a basal promoter-GFP cassette (six6.L-CNE:GFP and six6.S-CNE:GFP, respectively). Embryos were 4,250–4,450 μm. Semi-quantitative image analysis revealed a substantial difference in average expression level; the expression driven by six6.S-CNE (n = 27) was 0.6-fold weaker than that by six6.L-CNE in the eye region (n = 32). Given eye-specific patterns of their endogenous expression, the six6 genes probably have additional silencers for restricting enhancer activity of the CNEs in the eye.

  5. Allotetraploidy and assembly.
    Extended Data Fig. 1: Allotetraploidy and assembly.

    ae, Scenarios for allotetraploid formation from distinct ancestral diploid species A and B. Horizontal single lines indicate normal gametes, horizontal double lines indicate unreduced gametes; black square represents fertilization; vertical double lines indicate spontaneous (somatic) genome doubling. a, (i) Fusion of unreduced gametes from species A and B. (ii) Interspecific hybridization followed by spontaneous doubling. (iii) Fusion of unreduced gametes produced by interspecific hybrids. (iv) Interspecific hybrids produce unreduced gametes, which fuse with normal gametes from species A. The resulting triploid again produces unreduced gametes, which fuse with normal gametes from species B. (v) Unreduced gamete from species A fuses with normal gamete from species B. The resulting AAB triploid produces unreduced gametes that are fertilized by normal gametes species B. See Supplementary Note 1.1 for a more detailed discussion. b, History of the J strain. See Supplementary Note 2.1 for details. The years of events and generation numbers (such as frog transfer to another institute, establishment of homozygosity, construction of materials) are indicated in the scheme. Generation numbers are estimates due to loss of old breeding records. c, The nucleotide distance of orthologues (green), homoeologues (red) and alleles (blue) is discussed in Supplementary Note 8.7. The distances are shown on a log scale to differentiate between the distributions. d, Frequency histogram showing the number of 51-mers with specified count in the shotgun dataset. The prominent peak implies that each genomic locus is sampled 29× in 51-mers. Note the absence of a feature at twice this depth, indicating that homoeologous features with high identity are rare. e, Cumulative proportion of 51-mers as a function of relative depth (that is, depth/29). Relative depth provides an estimate of genomic copy number. The rapid rise at relative depth 1 implies that 70–75% of the X. laevis genome is a single copy with respect to 51-mers. The remainder of the genome is primarily concentrated in repetitive sequences with copy number > 100. Note logarithmic scale. f, The contact map of 85,260 TCC read pairs for JGIv72.000090484.chr4S. Read pairs were binned at 10-kb intervals. For each read pair, the forward and reverse reads map with a map quality score of at least 20. g, The contact map of 85,260 Chicago read pairs for JGIv72.000090484.chr4S, a 3.1-Mb scaffold in the XENLA_JGI_v72 assembly. h, The insert distribution of TCC and Chicago read pairs that map to the same scaffold of XENLA_JGI_v72 with a map quality score of at least 20. The x axis is the read pair separation distance. The y axis is the counts for that bin divided by the total number of reads. The bins are 1 kb.

  6. Chromosome structure.
    Extended Data Fig. 2: Chromosome structure.

    a, Structure of the sex chromosome of X. laevis (XLA2L) and comparison with XLA2S and XTR2. The W version of XLA2L harbours a W-specific sequence containing the female sex-determining gene dmw (red) while Z has a different Z-specific sequence (blue). Pentagon arrows and black triangles indicate genes and olfactory receptor genes, respectively. Their tips correspond to their 3′-ends. b, Alignment of the q-terminal regions of XTR9 and 10 with corresponding regions of XLA9_10L and XLA9_10S. Genes near the q-terminal regions of XTR 9 and XTR10 were missing in the X. tropicalis genome assembly v9, but rps11, rpl13a, lypd1 and actr3 were expected to be located there based on the synteny with human chromosomes, and then verified by cDNA FISH (upper panels). Small triangles on XLA9_10L and S indicate the distribution of gene models showing both identity and coverage greater than 30%, against the human and chicken peptide sequences from Ensembl, in the region ±2 Mb from the prospective 9/10 junction. HSA, human chromosome; GGA, chicken chromosome. The magnified view represents syntenic genes to scale with colours corresponding to human genes. c, The orders of orthologous genes across XTR9, XTR10, XLA9_10L and XLA9_10S. Green arrowheads: positions of centromeres in XTR9 and 10 predicted by examination of the cytogenetic chromosome length ratio of p versus q arms15. Blue arrowheads: positions of centromere repeats, frog centromeric repeat-1 (ref. 55), in XLA9_10L and S. Magenta and yellow ellipses, chromosomal locations of snrpn (magenta) and stau1 (yellow) from X. tropicalis v9 and X. laevis v9.1 assemblies. Red ellipses, chromosomal locations of four genes, rps11, rpl13a, lypd1 and actr3. XTR9 is inverted to facilitate comparison. Blue bidirectional arrows indicate the homologous regions where pericentric inversions may have occurred on proto-chromosomes (see Extended Data Fig. 2d). d, Schematic representation for the two hypothetical processes of chromosomal rearrangements (fusion and inversion) that occurred between the hypothetical proto-XTR9 and 10 to produce proto-XLA9_10, and eventually XLA9_10L and S. The process of chromosome rearrangements is explained parsimoniously in two different ways (left and right panels), starting from proto-XTR9 and 10. Actual and hypothetical ancestral chromosomal locations of snrpn and stau1 are shown by magenta and yellow circles, respectively. Note that the chromosomal locations of these genes on the proto-XTR10 differ between the two models. Chromosome segments homologous to XTR9 and XTR10 are shown in red and blue, respectively. XTR9 is inverted to facilitate comparison. Bidirectional arrows indicate the regions where pericentric inversions may have occurred. Black arrows indicate the direction of chromosomal evolution.

  7. Transposons.
    Extended Data Fig. 3: Transposons.

    a, Density of the subgenome-specific transposons on each chromosome (coverage length of transposable element (bp)/chromosome length (Mbp)). The coverage lengths of transposons were calculated from the results of BLASTN search (E-value cutoff 10−5) using the consensus sequences as queries. b, Jukes-Cantor distances across non-CpG sites, corrected as in Supplementary Note 7.5. Distances between X. tropicalis and X. laevis transposons consensus sequences are shown. The X. laevis-specific transposon differences are each individual transposon sequence against the consensus sequence for that subfamily. c, Phylogenetic tree of Xl-TpS_mar transposon expansions in the X. laevis genome, built using Jukes–Cantor corrected distances (Supplementary Note 7.5). Sub-clusters with enough members to determine accurate timings are highlighted. The scale bar represents the corrected Jukes–Cantor distance of 0.08 substitutions per site.

  8. Phylogeny.
    Extended Data Fig. 4: Phylogeny.

    a, Phylogenetic tree of pan-vertebrate conserved non-coding elements (pvCNEs), rooted by elephant shark. Alignments were done by MUSCLE, and the maximum-likelihood tree was built by PhyML. Branch length scale shown at the bottom. The difference in branch lengths of tetrapods follows the same topology as the protein-coding tree (Fig. 2b). b, Complete phylogenetic tree from Fig. 2a, with divergence times computed by r8s. c, Distribution of synonymous and non-synonymous rates Ks and Ka on specific subgenomes during the time between L and S speciation, before X. laevis and X. borealis speciation. We find accelerated mutations rates between T2 and T3 in Ks and Ka (P = 1.4 × 10−5 (left), 8.6 × 10−3 (right)). d, Distribution of Ks and Ka on specific subgenomes during the time after X. laevis and X. borealis speciation. We do not find significantly accelerated substitution rates (P = 0.10 (left) and P = 0.03 (right)). e, Table showing the number of homoeologues and singletons identified as homoeologues from the ancient vertebrate duplication (or ohnologues as they were historically called)56, 79.9% of ohnologues retain both copies in X. laevis today, significantly more than the 54.3% of the rest of the genome after excluding ohnologues (χ2 test P = 4.44 × 10−69). f, Table showing the branch lengths of bootstrapped maximum likelihood trees described in Supplementary Note 12.5. The columns refer to the X. tropicalis (XTR), L chromosome of X. laevis (XLA.L), S chromosome of X. laevis (XLA.S) and XLA.L/XLA.S branch lengths respectively. The first row shows triplets where all genes show expression, the second row shows triplets where L is a thanagene, and the third row shows triplets where S is a thanagene. The L branch length is significantly smaller when all genes are expressed, or when S is a thanagene (Wilcoxon signed-rank test, P = 1.7 × 10−216 and 6.4 × 10−212 respectively). The S branch length is smaller when L is a thanagene (P = 2.4 × 10−223). The ratio of branch lengths (L/S) is significantly different for either L or S thanagene datasets compared to when all genes are expressed (P = 3.55 × 10−214 and 7.48 × 10−220 respectively). The ratio is also different between the two thanagene datasets (P = 1.79 × 10−217).

  9. Structural evolution.
    Extended Data Fig. 5: Structural evolution.

    a, Chromosomal locations of the 45S pre-ribosomal RNA gene (rna45s), which encodes a precursor RNA for 18S, 5.8S and 28S rRNAs, was determined using pHr21Ab (5.8-kb for the 5′ portion) and pHr14E3 (7.3-kb for the 3′ portion) fragments as FISH probes. DNA fragments used for the probes were provided by National Institutes of Biomedical Innovation, Health and Nutrition, Osaka, and labelled with biotin-16-dUTP (Roche Diagnostics) by nick translation. After hybridization, the slides were incubated with FITC-avidin (Vector Laboratories). Hybridization signals (arrows) were detected to the short arm of XLA3L, but not XLA3S. Scale bar, 5 μm. b, A large deletion including an olfactory receptor gene (or) cluster. Schematic structures of or gene clusters and adjacent genes on the 8th chromosomes of X. tropicalis (XTR8) and X. laevis (XLA8L and XLA8S). Chromosomal locations: XTR8: 107,524,547–108,927,581; XLA8L: 105,062,063–106,610,199; XLA8S: 91,630,596–92,060,451. Horizontal bars, genomic DNA sequences; triangles, genes. Outside of or gene cluster, only representative genes are shown. The size of the triangle is to scale. The orientation of triangles indicates 5′ to 3′ direction of genes. Thin lines connect orthologous/homoeologous genes. Magenta triangles, or genes; green triangles, pseudogenes (point-mutated or truncated or genes). The number of or genes is shown underneath gene clusters. Dotted lines, a deleted region in XLA8S compared to XLA8L. The centromere is located on the left side and the telomere is on the right. c, The relative frequency (left panel) and size (right panel) of genomic regions deleted in the S (blue) and L (green) chromosomes respectively. Both subgenomes experienced sequence loss through deletions, but the deletions on the S subgenome are larger and have been more frequent. Deletions were called based on the progressive Cactus sequence alignment between the X. laevis L and S subgenomes and the X. tropicalis genome. Chromosome 9_10 of X. laevis was split into 9 and 10 on the basis of alignment with the X. tropicalis chromosomes. Sequences from L that were not present on S, but could at least partially be identified in X. tropicalis, and consisted of gaps for no more than 25% of their length, were called as deleted regions in S. The same procedure was followed for deleted regions in L. d, Identification of triplet loci is described in Supplementary Note 8.1. Loci were classified into groups based on the presence of gene 2 in both X. laevis subgenomes (homoeologue retained), versus those that had a pseudogene in the middle (pseudogene) or no remnant of the middle gene as assessed by Exonerate (deletion). To normalize the intergenic lengths, we divided the nucleotide distance between genes 1 and 3 in either X. laevis subgenome by the orthologous distance in X. tropicalis. The median of the normalized ratio distribution is plotted on the bar chart. On average, S deletions appear to be larger than L deletions (52.9% versus 80.2% of the size of the orthologous X. tropicalis region, respectively). e, The number of RNA-seq reads aligning ±1 kb of precursor miRNA loci (red) was compared to the read count for 10,000 random unannotated 2.1 kb regions of the genome (blue). All 83 homoeologous, intergenic miRNA pairs showed alignment within their regions, as opposed to 4,127 out of 10,000 (41.27%) of the randomly chosen intergenic sequences. The putative primary-miRNA loci also have a higher read count than the expressed randomly chosen regions (Wilcoxon signed-rank test, P = 1.4 × 10−38). f, The Cactus alignment was parsed to identify flanking CNE around each X. tropicalis gene. The number of CNEs >50 bp in length for singletons is shown in red, homoeologues in blue. Kolmogorov-Smirnov test P = 10−11. g, The average distance to the nearest gene was computed for each chromosomal locus in X. tropicalis. The average intergenic distance for those with a single X. laevis gene is shown in red, those with two shown in blue. Wilcoxon signed-rank test (P = 9.8 × 10−24). h, The distribution of gene retention by genomic footprint of the X. tropicalis orthologue. We define genomic footprint as the genomic distance from the start signal of the coding sequence (CDS) to the stop signal, including introns. The x axis shows log10(genomic footprint), the y axis the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homoeologues and singletons by a Wilcoxon signed-rank test (P = 2.4 × 10−96). i. The distribution of gene retention by CDS length of the X. tropicalis orthologue. The x axis shows log10 (CDS length), the y axis the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homoeologues and singletons by a Wilcoxon signed-rank test (P = 1.7 × 10−21). j, The distribution of gene retention by exon number of the X. tropicalis orthologue. The x axis shows number of exons; the y axis the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homoeologues and singletons by a Wilcoxon signed-rank test (P = 3.2 × 10−8).

  10. Pseudogenes.
    Extended Data Fig. 6: Pseudogenes.

    a, Illustration of htt.S pseudogene alignment to X. tropicalis htt and the extant X. laevis htt.L, translated to amino acids. The amino acid position is shown at the beginning of each line. Missing codons are marked by dashes. Frameshifts and premature stops are marked by X and *, respectively (and pointed to with red arrows). The first exon of the pseudogene is completely missing from the S chromosome (top). The characteristic poly-Q region is maintained by both htt and htt.L. An exon with conservation in the pseudogene (bottom), illustrating that despite many frameshifts, premature stops, the lack of a proper start and insertions of new sequence, we identify many codons in the pseudogene that occur in large conserved blocks. b, Illustration of our model to compute pseudogene ages. The star represents the point of nonfunctionalization for a locus that is currently a pseudogene. We assume the expected rate of nonsynonymous changes can be estimated by the Ka of the extant gene and X. tropicalis. We then compare the Ks and Ka of the pseudogene sequence to estimate the time of nonfunctionalization. See Supplementary Note 9 for a more detailed discussion. c, Estimated epochs of pseudogenization for 430 genes are indistinguishable from a burst of pseudogenization >10 Ma (Ks > 0.03). See Supplementary Note 9 for a more detailed discussion. d. Correlation of pseudogene expression with its extant homoeologue. The little expression seen in pseudogenes tends to be uncorrelated with the extant homoeologue. e, Histogram of pseudogene expression values across all 28 tissues and developmental stages (red) compared to all extant genes (blue). The pseudogenes are rarely expressed and tend to be expressed at lower levels than extant protein-coding genes. f, Histograms of expression variance of pseudogenes (red) compared to extant genes (blue). The small amount of pseudogene expression observed does not tend to vary across tissues and developmental stages in the same way that extant genes do.

  11. Tandem duplications.
    Extended Data Fig. 7: Tandem duplications.

    a, Phylogenetic trees of the mix/bix cluster. Nucleotide sequences were aligned using MUSCLE and a phylogenetic diagram was generated by the ML method with 1,000 bootstraps (MEGA6). Circles with different colours represent X. laevis L genes (magenta), X. laevis S genes (blue) and X. tropicalis genes (green). The table shows the correspondence of bix gene names proposed in this study and previously used (synonyms). b, FISH analysis showing XLA3S-specific deletion of the nodal5 gene cluster. One unit of the nodal5 gene region, including exons, introns and an intergenic region was used as a probe for FISH (counterstained with Hoechst). Arrows indicate the hybridization signals of nodal5s. Scale bar, 5 μm. c, Comparison of the nodal5 gene cluster. Genome sequencing revealed that nodal5.e1.L~.e5.L (pink) and nodal6.L are clustered. Amplification of nodal5 gene in XLA3L and loss of this cluster in XLA3S were confirmed. Pseudogenes (nodal5p1.L~p4.L and nodal5p1.S) are indicated in black. The nodal5 cluster of X. tropicalis does not contain any pseudogene. d, The X. laevis L chromosome has four complete copies of nodal3 (nodal3.e1.L~.e4.L), whereas the gene cluster is lost from the X. laevis S chromosome. A truncated nodal3 gene (nodal3p1.L) is likely to be a pseudogene and highly degenerate pseudogenes (nodal3p2.L and nodal3p3.L) also exist on the L chromosome. e, Like nodal3, vg1 is lost from the S chromosome although there is a pseudogene (vg1p.S). vg1 is specifically amplified on the X. laevis L chromosome (vg1.e1.L~.e3.L) in comparison with X. tropicalis. An amino acid change (Ser20 to Pro20) in Vg1 protein has been shown to result in functional differences (Supplementary Note 13.9). vg1 and derrière are orthologous to mammalian gdf1. f, Fraction of all genes duplicated and retained to present epoch per 1 expected 4DTV (fourfold degenerate transversion) at different epochs (semi-log scale). Shown also are linear fits, which would be consistent with constant birth- and death-rate models (first epoch is omitted from both fitted datasets, as is second epoch from X. laevis). See Supplementary Note 11 for a more detailed discussion. g, Same as f, but for ‘short genes’ (CDS <600 bp) and ‘long genes’ (CDS >1,200 bp) separately. The loss rate of new duplicates appears to be similar. If the extra copy of a newly duplicated gene was lost when the first 100% disabling mutation occurred, we would expect, on average, the longer genes to be lost.

  12. Gene expression analysis.
    Extended Data Fig. 8: Gene expression analysis.

    a, Pairwise Pearson correlation distributions between homoeologous genes (red) and all genes (blue). Left histogram, stage data; right, adult data. The x axis shows the correlation; the y axis the percentage of data. The homoeologous genes have a correlation distribution closer to one owing to the fact that these were recently the same locus. X. laevis TPM values of 0.5 were lowered to 0. Any gene with no TPM >0 was removed from analysis. We then added 0.1 to all TPM values and log transformed (log10) them. b, Scatter plot comparing binned genes by their median X. tropicalis expression57 to the retention rate of their X. laevis (co)-orthologues. Error bars are the standard deviation for the whole dataset divided by the square root of the number of genes analysed in a bin. We assessed significance by a Wilcoxon signed-rank test of the homoeologous and singleton distributions, P = 6.31 × 10−113. c, Full version of the box plot shown in Fig. 4c. The difference between subgenomes is difficult to see at this magnification, illustrating that many loci deviate from the whole genome median of preferring the L homoeologue. There were some L outliers expressed 104 as much as their S homoeologues, whereas no S genes showed such a strong trend. These differences are discussed in more detail in Supplementary Note 12. d, Box plot of 4DTV by homoeologue class defined in Supplementary Note 12.4. Significant differences are marked by a red asterisk (Wilcoxon signed-rank test, P < 10−5). The high correlation, similar expression (HCSE) group showed lower sequence change than other groups (P = 3.7 × 10−12) and the no correlation, different expression (NCDE) group showed high rates of sequence change (P = 5.6 × 10−14). e, Box plot of CDS length difference between X. laevis homoeologues by homoeologue class defined in Supplementary Note 12.4. Significant differences are marked by a red asterisk (Wilcoxon signed-rank test, P < 10−5). The HCSE group showed smaller CDS length differences than other groups (P = 2.4 × 10−13) and the NCDE group showed large differences in homoeologue CDS length (P = 2.1 × 10−32). f, Box plot of Ka/Ks between X. laevis homoeologues by homoeologue class defined in Supplementary Note 12.4. Significant differences are marked by a red asterisk (t-test P < 10−5). The HCSE group showed lower non-synonymous sequence change than other groups (P = 8.2 × 10−19) and the NCDE and no correlation, similar expression (NCSE) groups showed higher rates of non-synonymous sequence changes (P = 2.0 × 10−12 and P = 7.0 × 10−9 respectively). g, RNA-seq analysis of six6.L (red) and six6.S (blue) during X. laevis development (left) and in adult tissues (right). Expression levels of six6.S were lower than those of six6.L at most developmental stages and in adult tissues. h, Diagram of Homo sapiens, X. tropicalis and X. laevis six6 loci (upper panel). Magenta and black boxes indicate CNEs and exons, respectively. The phylogenetic tree analyses of H. sapiens, X. tropicalis and X. laevis six6 CNEs (lower left panel) and Six6 proteins (lower right panel). Notably, six6.S is more diverged from X. tropicalis six6 than six6.L, both in the encoded protein sequences and in CNEs within 3 kb of the transcription start sites. Materials, methods and the CNE locations on genome assemblies are described in Supplementary Note 13.1. i, On the basis of chromatin state properties, a Random Forest machine-learning algorithm can accurately predict L versus S expression bias. The classification is based on all genes with greater than threefold expression difference at NF stage 10.5 (a set of 1,129 genes). The mean (dotted black line) of the ROC area under the curve is 0.778 (tenfold cross-validation). Features were selected using Linear Support Vector Classification and are shown in j. j, Relative importance (based on Gini impurity) of selected features used in the Random Forest classification. All features used in the classification are shown. Among various variables, the ratios of H3K4me3 and DNA methylation at the promoter contributed most to the decision tree model. A difference in p300 binding in the genomic region surrounding the gene also contributed to the Random Forest classification, as did the presence or absence of a number of specific transcription factor motifs in the promoter.

  13. Examples of pathway responses.
    Extended Data Fig. 9: Examples of pathway responses.

    a, The Wnt pathway. Left panel, several key components of the canonical Wnt pathway in the X. laevis genome. The numbers in brackets show the number of paralogues. Components that have homoeologous pair of genes or singletons are shown in blue or red, respectively. Each gene (wnt: 21 genes, LRP: 2 genes, Fzd: 10 genes, Dvl: 3 genes, Frat(GBP): 1 gene, GSK3: 2 genes, Axin: 2 genes, bcatenin: 1 gene, APC: 2 genes, TCF/LEF:4 genes) was classified into 4 groups according to subcellular localization, and the number of singleton and homoeologue retained genes is shown by pie charts. Right panel, syntenies around four singleton genes. b, Cell cycle regulation. Upper right panel, diagram of the cell cycle and regulatory proteins critical to each phase. Cyclin H (ccnh) and Cdk7 constitute Cdk-activating kinase (CAK), a key factor required for activation of all Cdks. Genes encoding Cyclin H and Cdk7 (red), but not other regulators (blue), became singletons. Upper left panel, pie charts show the numbers of homoeologous pairs (blue) and singletons (red) in each functional category as indicated. Lower left panel, syntenies of ccnh and cdk7 loci in X. tropicalis and X. laevis. Lower right table, individual genes used for drawing the pie charts are shown in the table. c, The Hippo pathway. Upper panel, Hippo pathway components and retention of their homoeologous gene pairs. All genes for Hippo pathway components as indicated were identified in the whole genome of X. laevis. Blue icons indicate that both of the homoeologous genes are expressed in normal development and adult organs. The red icon, Taz, indicates a singleton. Yap is interchangeable with Taz in most cases, but TAZ, but not YAP, serves as a mediator of Wnt signalling (broken line). Pie charts show the numbers of homoeologue pairs (blue) and singleton (red) in each category of the Hippo pathway components classified according to subcellular localization. Lower panel, comparative analysis of syntenies around the taz gene. X. tropicalis scaffold247 is not incorporated into the chromosome-scale assembly (v9) and hence its chromosomal location is not known yet. The p arm termini of XLA8L and XLA8S are on the left. See Supplemental Note 13 for further details.

  14. Pathways continued.
    Extended Data Fig. 10: Pathways continued.

    a, The TGFβ pathway. Pie charts indicate the ratio of differentially expressed homoeologous pairs (orange) and singletons (red). Many of the extracellular regulatory factors are either differentially regulated or became singletons. Genes for a type I receptor, co-receptors and an inhibitory Smad are also differentially regulated. Multicopy genes such as nodal3, nodal5 and vg1 are not counted as singletons, even though those genes are deleted on S chromosomes. Instead, these and duplicated chordin genes are categorized into differentially regulated genes. b, The sonic hedgehog pathway. Upper panel, the simplified hedgehog pathway known in Shh signalling is schematically shown. Most signalling components are encoded by both homoeologous genes, whereas Hhat (shown in red) is encoded by a singleton gene. Where paralogues exist, the numbers of paralogues are shown in parentheses. In the left cell, the Shh precursor (Hh precursor) is matured through the process involving Hhat and Hhatl and secreted. In the right cell, the binding of Shh (Hh) to Ptch1 (Ptch) receptor inhibits Ptch1-mediated repression of Smo, leading to Smo activation and subsequent inhibition of PKA; otherwise PKA converts Gli activators to truncated repressors. As a consequence, Gli proteins activate target genes, such as Ptch1 and Hhip. The transmembrane protein Hhip binds Shh and suppresses Shh activity. Lower panel, schematic comparison of syntenies around hhat genes of X. tropicalis chromosome 5 (top) and X. laevis 5L chromosome (middle) and the corresponding region of X. laevis 5S chromosome (bottom). The diagram is not drawn to scale. c, Deletion rates on L (x axis) versus S (y axis) for different Pfam groups. For Pfam groups we computed the number of X. laevis single-copy genes (singletons) versus homoeologue pairs and computed the fraction retained. The line of expected L/S loss is based on the genome-wide average (56.4%). Red points show groups with high or low rates of loss (P < .01). See Supplementary Table 5 for more information. d, Deletion rates on L (x axis) versus S (y axis) for different stage weighted gene correlation network analysis (WGCNA)54 groups (visualized as a heatmap in Fig. 4a). For stage WGCNA groups we computed the number of X. laevis single-copy genes (singletons) versus homoeologue pairs and computed the fraction retained. The line of expected L/S loss is based on the genome-wide average (56.4%). Red points show groups with high or low rates of loss (P < .01). e, Deletion rates on L (x axis) versus S (y axis) for different GO groups. For GO groups we computed the number of X. laevis single-copy genes (singletons) versus homoeologue pairs and computed the fraction retained. The line of expected L/S loss is based on the genome-wide average (56.4%). Red points show groups with high or low rates of loss (P < 0.01). See Supplementary Table 5 for more information.

References

  1. Van de Peer, Y., Maere, S. & Meyer, A. The evolutionary significance of ancient genome duplications. Nat. Rev. Genet. 10, 725732 (2009)
  2. Holland, P. W., Garcia-Fernàndez, J., Williams, N. A. & Sidow, A. Gene duplications and the origins of vertebrate development. Development Suppl. , 125133 (1994)
  3. Muller, H. J. Why polyploidy is rarer in animals than in plants. Am. Nat. 59, 346353 (1925)
  4. Orr, H. A. ‘Why polyploidy is rarer in animals than in plants’ revisited. Am. Nat. 136, 759770 (1990)
  5. Berthelot, C. et al. The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nat. Commun. 5, 3657 (2014)
  6. Woods, I. G. et al. The zebrafish gene map defines ancestral vertebrate chromosomes. Genome Res. 15, 13071314 (2005)
  7. Glasauer, S. M. K. & Neuhauss, S. C. F. Whole-genome duplication in teleost fishes and its evolutionary consequences. Mol. Genet. Genomics 289, 10451060 (2014)
  8. Otto, S. P. The evolutionary consequences of polyploidy. Cell 131, 452462 (2007)
  9. Ohno, S. Evolution by Gene Duplication (Springer, 1970)
  10. Kobel, H. R. & Du Pasquier, L. Genetics of polyploid Xenopus. Trends Genet. 2, 310315 (1986)
  11. Harland, R. M. & Grainger, R. M. Xenopus research: metamorphosed by genetics and genomics. Trends Genet. 27, 507515 (2011)
  12. Kuramoto, M. A list of chromosome numbers of anuran amphibians. Bull. Fukuoka Univ. Educ. 39, 83127 (1990)
  13. Bisbee, C. A., Baker, M. A., Wilson, A. C., Haji-Azimi, I. & Fischberg, M. Albumin phylogeny for clawed frogs (Xenopus). Science 195, 785787 (1977)
  14. Uno, Y., Nishida, C., Takagi, C., Ueno, N. & Matsuda, Y. Homoeologous chromosomes of Xenopus laevis are highly conserved after whole-genome duplication. Heredity 111, 430436 (2013)
  15. Uno, Y. et al. Inference of the protokaryotypes of amniotes and tetrapods and the evolutionary processes of microchromosomes from comparative gene mapping. PLoS One 7, e53027 (2012)
  16. Matsuda, Y. et al. A new nomenclature of Xenopus laevis chromosomes based on the phylogenetic relationship to Silurana/Xenopus tropicalis. Cytogenet. Genome Res. 145, 187191 (2015)
  17. Yoshimoto, S. et al. A W-linked DM-domain gene, DM-W, participates in primary ovary development in Xenopus laevis. Proc. Natl Acad. Sci. USA 105, 24692474 (2008)
  18. Zhang, X. et al. P instability factor: an active maize transposon system associated with the amplification of Tourist-like MITEs and a new superfamily of transposases. Proc. Natl Acad. Sci. USA 98, 1257212577 (2001)
  19. Jurka, J. & Kapitonov, V. V. PIFs meet Tourists and Harbingers: a superfamily reunion. Proc. Natl Acad. Sci. USA 98, 1231512316 (2001)
  20. Ahn, S. J., Kim, M.-S., Jang, J. H., Lim, S. U. & Lee, H. H. MMTS, a new subfamily of Tc1-like transposons. Mol. Cells 26, 387395 (2008)
  21. Morin, R. D. et al. Sequencing and analysis of 10,967 full-length cDNA clones from Xenopus laevis and Xenopus tropicalis reveals post-tetraploidization transcriptome remodeling. Genome Res. 16, 796803 (2006)
  22. Hellsten, U. et al. Accelerated gene evolution and subfunctionalization in the pseudotetraploid frog Xenopus laevis. BMC Biol. 5, 31 (2007)
  23. Bewick, A. J., Chain, F. J. J., Heled, J. & Evans, B. J. The pipid root. Syst. Biol. 61, 913926 (2012)
  24. Cannatella, D. Xenopus in space and time: fossils, node calibrations, tip-dating, and paleobiogeography. Cytogenet. Genome Res. 145, 283301 (2015)
  25. Voss, S. R. et al. Origin of amphibian and avian chromosomes by fission, fusion, and retention of ancestral chromosomes. Genome Res. 21, 13061312 (2011)
  26. Ferguson-Smith, M. A. & Trifonov, V. Mammalian karyotype evolution. Nat. Rev. Genet. 8, 950962 (2007)
  27. Langham, R. J. et al. Genomic duplication, fractionation and the origin of regulatory novelty. Genetics 166, 935945 (2004)
  28. Haldane, J. B. S. The part played by recurrent mutation in evolution. Am. Nat. 67, 519 (1933)
  29. Birchler, J. A. & Veitia, R. A. Gene balance hypothesis: connecting issues of dosage sensitivity across biological disciplines. Proc. Natl Acad. Sci. USA 109, 1474614753 (2012)
  30. Schnable, J. C., Springer, N. M. & Freeling, M. Differentiation of the maize subgenomes by genome dominance and both ancient and ongoing gene loss. Proc. Natl Acad. Sci. USA 108, 40694074 (2011)
  31. Sankoff, D., Zheng, C. & Wang, B. A model for biased fractionation after whole genome duplication. BMC Genomics 13 (Suppl. 1), S8 (2012)
  32. Garsmeur, O. et al. Two evolutionarily distinct classes of paleopolyploidy. Mol. Biol. Evol. 31, 448454 (2014)
  33. Sémon, M. & Wolfe, K. H. Preferential subfunctionalization of slow-evolving genes after allopolyploidization in Xenopus laevis. Proc. Natl Acad. Sci. USA 105, 83338338 (2008)
  34. Chain, F. J. J., Dushoff, J. & Evans, B. J. The odds of duplicate gene persistence after polyploidization. BMC Genomics 12, 599 (2011)
  35. Lee, A. P., Kerk, S. Y., Tan, Y. Y., Brenner, S. & Venkatesh, B. Ancient vertebrate conserved noncoding elements have been evolving rapidly in teleost fishes. Mol. Biol. Evol. 28, 12051215 (2011)
  36. Force, A. et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 15311545 (1999)
  37. Meredith, R. W., Gatesy, J., Murphy, W. J., Ryder, O. A. & Springer, M. S. Molecular decay of the tooth gene Enamelin (ENAM) mirrors the loss of enamel in the fossil record of placental mammals. PLoS Genet. 5, e1000634 (2009)
  38. Kondrashov, F. A. & Koonin, E. V. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet. 20, 287290 (2004)
  39. Aury, J.-M. et al. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444, 171178 (2006)
  40. Gout, J.-F., Kahn, D., Duret, L. & Paramecium Post-Genomics Consortium. The relationship among gene expression, the evolution of gene dosage, and the rate of protein evolution. PLoS Genet. 6, e1000944 (2010)
  41. Yanai, I., Peshkin, L., Jorgensen, P. & Kirschner, M. W. Mapping gene expression in two Xenopus species: evolutionary constraints and developmental flexibility. Dev. Cell 20, 483496 (2011)
  42. Langley, A. R., Smith, J. C., Stemple, D. L. & Harvey, S. A. New insights into the maternal to zygotic transition. Development 141, 38343841 (2014)
  43. Marcet-Houben, M. & Gabaldón, T. Beyond the whole-genome duplication: phylogenetic evidence for an ancient interspecies hybridization in the baker’s yeast lineage. PLoS Biol. 13, e1002220 (2015)
  44. McClintock, B. The significance of responses of the genome to challenge. Science 226, 792801 (1984)
  45. Chapman, J. A. et al. Meraculous: de novo genome assembly with short paired-end reads. PLoS One 6, e23501 (2011)
  46. Chen, L. et al. Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nat. Biotechnol. 30, 9098 (2011)
  47. Putnam, N. H. et al. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 26, 342350 (2016)
  48. Chang, C. Y. & Witschi, E. Genic control and hormonal reversal of sex differentiation in Xenopus. Proc. Soc. Exp. Biol. Med. 93, 140144 (1956)
  49. Gilchrist, M. J. From expression cloning to gene modeling: the development of Xenopus gene sequence resources. Genesis 50, 143154 (2012)
  50. Smit, A. F. A., Hubley, R & Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org.
  51. Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213D221 (2015)
  52. Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199D205 (2014)
  53. Calvo, S. E., Clauser, K. R. & Mootha, V. K. MitoCarta2.0: an updated inventory of mammalian mitochondrial proteins. Nucleic Acids Res. 44, D1251D1257 (2016)
  54. Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008)
  55. Edwards, N. S. & Murray, A. W. Identification of Xenopus CENP-A and an associated centromeric DNA repeat. Mol. Biol. Cell 16, 18001810 (2005)
  56. McLysaght, A. et al. Ohnologs are overrepresented in pathogenic copy number mutations. Proc. Natl. Acad. Sci. USA 111, 361366 (2014)
  57. Tan, M. H. et al. RNA sequencing reveals a diverse and dynamic repertoire of the Xenopus tropicalis transcriptome over development. Genome Res. 23, 201216 (2013)

Download references

Author information

  1. These authors contributed equally to this work.

    • Adam M. Session,
    • Yoshinobu Uno &
    • Taejoon Kwon
  2. Present address: Personalis Inc., 1330 O’Brien Drive, Menlo Park, California 94025, USA.

    • Christian Haudenschild

Affiliations

  1. University of California, Berkeley, Department of Molecular and Cell Biology and Center for Integrative Genomics, Life Sciences Addition #3200, Berkeley, California 94720-3200, USA

    • Adam M. Session,
    • Jessica B. Lyons,
    • Therese Mitros,
    • Darwin S. Dichmann,
    • Richard M. Harland &
    • Daniel S. Rokhsar
  2. US Department of Energy Joint Genome Institute, Walnut Creek, California 94598, USA

    • Adam M. Session,
    • Jarrod A. Chapman,
    • Uffe Hellsten,
    • Shengquiang Shu,
    • Joseph Carlson,
    • Jerry Jenkins,
    • Jane Grimwood,
    • Jeremy Schmutz &
    • Daniel S. Rokhsar
  3. Department of Applied Molecular Biosciences, Graduate School of Bioagricultural Sciences, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8601, Japan

    • Yoshinobu Uno &
    • Yoichi Matsuda
  4. Department of Molecular Biosciences, Center for Systems and Synthetic Biology, University of Texas at Austin, Austin, Texas 78712, USA

    • Taejoon Kwon,
    • Edward M. Marcotte &
    • John B. Wallingford
  5. Department of Biomedical Engineering, School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan 689-798, Republic of Korea

    • Taejoon Kwon
  6. Center for Information Biology, and Advanced Genomics Center, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan

    • Atsushi Toyoda &
    • Asao Fujiyama
  7. Amphibian Research Center, Graduate School of Science, Hiroshima University, 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8526, Japan

    • Shuji Takahashi &
    • Atsushi Suzuki
  8. Laboratory of Tissue and Polymer Sciences, Faculty of Advanced Life Science, Hokkaido University, N10W8, Kita-ku, Sapporo 060-0810, Japan

    • Akimasa Fukui
  9. Division of Human Sciences, Graduate School of Integrated Arts and Sciences, Hiroshima University, 1-7-1 Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8521, Japan

    • Akira Hikosaka
  10. Misaki Marine Biological Station (MMBS), Graduate School of Science, The University of Tokyo, 1024 Koajiro, Misaki, Miura, Kanagawa 238-0225, Japan

    • Mariko Kondo
  11. Radboud University, Faculty of Science, Department of Molecular Developmental Biology, 259 RIMLS, M850/2.97, Geert Grooteplein 28, Nijmegen 6525 GA, the Netherlands

    • Simon J. van Heeringen,
    • Georgios Georgiou,
    • Sarita S. Paranjpe,
    • Ila van Kruijsbergen &
    • Gert Jan C. Veenstra
  12. Salk Institute, Molecular Neurobiology Laboratory, La Jolla, San Diego, California 92037, USA

    • Ian Quigley
  13. Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, San Diego, California 92037, USA

    • Sven Heinz
  14. Department of Animal Bioscience, Nagahama Institute of Bio-Science and Technology, 1266 Tamura, Nagahama, Shiga 526-0829, Japan

    • Hajime Ogino
  15. Institute for Promotion of Medical Science Research, Yamagata University Faculty of Medicine, 2-2-2 Iida-Nishi, Yamagata, Yamagata 990-9585, Japan

    • Haruki Ochi
  16. Molecular Genetics Unit, Okinawa Institute of Science and Technology Graduate University, Onna, Okinawa 904-0495, Japan

    • Oleg Simakov &
    • Daniel S. Rokhsar
  17. Dovetail Genomics LLC. Santa Cruz, California 95060, USA

    • Nicholas Putnam &
    • Jonathan Stites
  18. Department of Genome Medicine, National Research Institute for Child Health and Development, NCCHD, 2-10-1, Okura, Setagaya-ku, Tokyo 157-8535, Japan

    • Yoko Kuroki
  19. Department of Life Science and Technology, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama 226-8501, Japan

    • Toshiaki Tanaka
  20. Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo, 3-8-1, Komaba, Meguro-ku, Tokyo 153-8902, Japan

    • Tatsuo Michiue
  21. Institute of Institution of Liberal Arts and Fundamental Education, Tokushima University, 1-1 Minamijosanjima-cho, Tokushima 770-8502, Japan

    • Minoru Watanabe
  22. Harry Perkins Institute of Medical Research and ARC Centre of Excellence in Plant Energy Biology, The University of Western Australia, Perth, Western Australia 6009, Australia

    • Ozren Bogdanovic &
    • Ryan Lister
  23. Department of Life Science, Faculty of Science, Rikkyo University, 3-34-1 Nishi-Ikebukuro, Toshima-ku, Tokyo 171-8501, Japan

    • Tsutomu Kinoshita
  24. Department of Microbiology and Immunology, University of Maryland, 655 W Baltimore St, Baltimore, Maryland 21201, USA

    • Yuko Ohta &
    • Martin F. Flajnik
  25. Kitasato Institute for Life Sciences, Kitasato University, 5-9-1 Shirokane Minato-ku, Tokyo 108-8641, Japan

    • Shuuji Mawaribuchi
  26. HudsonAlpha Institute of Biotechnology, Huntsville, Alabama 35806, USA

    • Jerry Jenkins,
    • Jane Grimwood &
    • Jeremy Schmutz
  27. Department of Human Genetics, University of Chicago, 920 E. 58th St, CLSC 431F, Chicago, Illinois 60637, USA

    • Sahar V. Mozaffari
  28. Department of Computational Biology and Medical Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba 277-8568, Japan

    • Yutaka Suzuki
  29. Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), Central 5, 1-1-1 Higashi, Tsukuba, Ibaraki 305-8565, Japan

    • Yoshikazu Haramoto,
    • Yuzuru Ito &
    • Makoto Asashima
  30. Division of Morphogenesis, Department of Developmental Biology, National Institute for Basic Biology, 38 Nishigonaka, Myodaiji, Okazaki, Aichi 444-8585, Japan

    • Takamasa S. Yamamoto,
    • Chiyo Takagi &
    • Naoto Ueno
  31. University of California, Berkeley, Department of Molecular and Cell Biology, Life Sciences Addition #3200, Berkeley California 94720-3200, USA

    • Rebecca Heald &
    • Kelly Miller
  32. Illumina Inc., 25861 Industrial Blvd, Hayward, California 94545, USA

    • Christian Haudenschild
  33. Department of Genome Sciences, University of Washington, Foege Building S-250, Box 355065, 3720 15th Ave NE, Seattle Washington 98195-5065, USA

    • Jacob Kitzman &
    • Jay Shendure
  34. Department of Biology, University of Virginia, Charlottesville, Virginia 22904, USA

    • Takuya Nakayama
  35. Department of Biology, Faculty of Science, Niigata University, 8050, Ikarashi 2-no-cho, Nishi-ku, Niigata 950-2181, Japan

    • Yumi Izutsu
  36. Department of Microbiology & Immunology, University of Rochester Medical Center, Rochester, New York 14642, USA

    • Jacques Robert
  37. Division of Developmental Biology, Cincinnati Children's Research Foundation, Cincinnati, Ohio 45229-3039, USA

    • Joshua Fortriede,
    • Kevin Burns &
    • Aaron M. Zorn
  38. Department of Biological Sciences, University of Calgary, Alberta T2N 1N4, Canada

    • Vaneet Lotay,
    • Kamran Karimi &
    • Peter D. Vize
  39. Marine Genomics Unit, Okinawa Institute of Science and Technology Graduate University, 1919-1 Tancha, Onna-son, Okinawa 904-0495, Japan

    • Yuuri Yasuoka
  40. The University of Iowa, Department of Biology, 257 Biology Building, Iowa City, Iowa 52242-1324, USA

    • Douglas W. Houston
  41. Department of Zoology and Evolutionary Biology, University of Basel, Basel CH-4051, Switzerland

    • Louis DuPasquier
  42. Department of Biological Sciences, School of Science, Kitasato University, 1-15-1 Minamiku, Sagamihara, Kanagawa 252-0373, Japan

    • Michihiko Ito
  43. Department of Basic Biology, SOKENDAI (The Graduate University for Advanced Studies), 38 Nishigonaka, Myodaiji, Okazaki, Aichi 444-8585, Japan

    • Naoto Ueno
  44. Principles of Informatics, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan

    • Asao Fujiyama
  45. Department of Genetics, SOKENDAI (The Graduate University for Advanced Studies), 1111 Yata, Mishima, Shizoka 411-8540, Japan

    • Asao Fujiyama
  46. Department of Biological Sciences, Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan

    • Masanori Taira

Contributions

R.M.H., M.T., D.S.R., G.J.C.V., A.Fuj., A.S., A.M.S., T.Kw., Y.U., A.Fuk., M.K. and H.Og. provided project leadership, with additional project management from Y.M., M.A., Y.Iz., N.U., J.Sh., J.B.W., E.M.M., J.Sc., A.M.Z., P.D.V. and M.I. Y.Iz. and J.R. inbred J strain frogs. A.T., C.H., A.Fuj., J.G., J.C., J.K., J.Sh., T.Mit. and J.B.L. generated genome sequence data. J.A.C., A.M.S., T.Kw., J.J., A.Fuk., M.T. and J.Sc. performed genome assembly and validation. S.T., T.Kw., A.M.S., Y.S., T.T., A.T., A.S. and M.T. generated and analysed the transcriptome data. A.M.S., T.Kw., S.J.v.H. and S.S. generated the annotations. Manual validation of annotation was done by H.Og., S.T., A.Fuk., A.S., M.K., H.Oc., T.T., T.Mic., M.W., T.Ki., Y.O., S.Ma., Y.H., T.N., Y.Y., J.F., K.B., V.L., D.W.D., M.T. and K.K. K.M., A.M.S. and R.H. generated the Hymenochirus transcriptome data. A.M.S. performed the phylogenetic analysis, with contributions from S.V.M. and U.H. M.W., A.Fuk., S.Ma., Y.U., Y.M. and M.T. performed the chromosome structure analysis. A.M.S., A.H., O.S., J.C. and Y.U. studied the transposable elements. BAC-FISH was performed by Y.U., A.Fuk., M.K., A.T., S.T., H.Og., H.Oc., Y.K., T.T., T.M., M.W., T.Ki., Y.O., Y.H., T.S.Y., C.T., T.N., A.S., Y.M., N.U., M.A., Y.Iz., A.Fuj. and M.T. I.Q., S.H., N.P. and J.St. generated and analysed the chromatin conformation capture data and their use in long-range scaffolding. H.Og. and H.Oc. performed the transgenic enhancer analysis. S.J.v.H, G.G., S.S.P., I.v.K., O.B., R.L., and G.J.C.V. generated and analysed the epigenetic data. A.S., A.M.S., T.Ki., M.K., M.T., Y.O., T.T., A.Fuk., M.W., T.Mic., D.W.H., T.N. and L.D. conducted the gene and pathway analysis. D.S.R., A.M.S., T.Kw., R.M.H., M.T., A.S., Y.U., G.J.C.V., M.K., U.H., S.J.v.H., A.Fuk., A.H., O.S., H.Og., T.T., I.Q., J.K., Y.O., S.T., M.W., T.Mic., A.T., H.Oc., T.Ki., S.Maw., Y.S., T.N., Y.Iz. and M.F.F. wrote the paper and supplementary notes, with input from all authors.

Competing financial interests

Dovetail Genomics LLC is a commercial entity developing genome assembly methods. N.P. and J.St. are employees of Dovetail Genomics, and D.S.R. is a scientific advisor to and minor investor in Dovetail.

Corresponding authors

Correspondence to:

NCBI (LYTH00000000). Sequence Read Archive (SRP071264, SRP070985). NCBI Gene Expression Omnibus ( GSE73430, GSE73419, GSE76089, GSE76059, GSE76247). DDBJ/GenBank/EMBL ( AP017316 and AP017317).

Reviewer Information Nature thanks C. Amemiya, S. Burgess and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Allotetraploidy and assembly. (526 KB)

    ae, Scenarios for allotetraploid formation from distinct ancestral diploid species A and B. Horizontal single lines indicate normal gametes, horizontal double lines indicate unreduced gametes; black square represents fertilization; vertical double lines indicate spontaneous (somatic) genome doubling. a, (i) Fusion of unreduced gametes from species A and B. (ii) Interspecific hybridization followed by spontaneous doubling. (iii) Fusion of unreduced gametes produced by interspecific hybrids. (iv) Interspecific hybrids produce unreduced gametes, which fuse with normal gametes from species A. The resulting triploid again produces unreduced gametes, which fuse with normal gametes from species B. (v) Unreduced gamete from species A fuses with normal gamete from species B. The resulting AAB triploid produces unreduced gametes that are fertilized by normal gametes species B. See Supplementary Note 1.1 for a more detailed discussion. b, History of the J strain. See Supplementary Note 2.1 for details. The years of events and generation numbers (such as frog transfer to another institute, establishment of homozygosity, construction of materials) are indicated in the scheme. Generation numbers are estimates due to loss of old breeding records. c, The nucleotide distance of orthologues (green), homoeologues (red) and alleles (blue) is discussed in Supplementary Note 8.7. The distances are shown on a log scale to differentiate between the distributions. d, Frequency histogram showing the number of 51-mers with specified count in the shotgun dataset. The prominent peak implies that each genomic locus is sampled 29× in 51-mers. Note the absence of a feature at twice this depth, indicating that homoeologous features with high identity are rare. e, Cumulative proportion of 51-mers as a function of relative depth (that is, depth/29). Relative depth provides an estimate of genomic copy number. The rapid rise at relative depth 1 implies that 70–75% of the X. laevis genome is a single copy with respect to 51-mers. The remainder of the genome is primarily concentrated in repetitive sequences with copy number > 100. Note logarithmic scale. f, The contact map of 85,260 TCC read pairs for JGIv72.000090484.chr4S. Read pairs were binned at 10-kb intervals. For each read pair, the forward and reverse reads map with a map quality score of at least 20. g, The contact map of 85,260 Chicago read pairs for JGIv72.000090484.chr4S, a 3.1-Mb scaffold in the XENLA_JGI_v72 assembly. h, The insert distribution of TCC and Chicago read pairs that map to the same scaffold of XENLA_JGI_v72 with a map quality score of at least 20. The x axis is the read pair separation distance. The y axis is the counts for that bin divided by the total number of reads. The bins are 1 kb.

  2. Extended Data Figure 2: Chromosome structure. (507 KB)

    a, Structure of the sex chromosome of X. laevis (XLA2L) and comparison with XLA2S and XTR2. The W version of XLA2L harbours a W-specific sequence containing the female sex-determining gene dmw (red) while Z has a different Z-specific sequence (blue). Pentagon arrows and black triangles indicate genes and olfactory receptor genes, respectively. Their tips correspond to their 3′-ends. b, Alignment of the q-terminal regions of XTR9 and 10 with corresponding regions of XLA9_10L and XLA9_10S. Genes near the q-terminal regions of XTR 9 and XTR10 were missing in the X. tropicalis genome assembly v9, but rps11, rpl13a, lypd1 and actr3 were expected to be located there based on the synteny with human chromosomes, and then verified by cDNA FISH (upper panels). Small triangles on XLA9_10L and S indicate the distribution of gene models showing both identity and coverage greater than 30%, against the human and chicken peptide sequences from Ensembl, in the region ±2 Mb from the prospective 9/10 junction. HSA, human chromosome; GGA, chicken chromosome. The magnified view represents syntenic genes to scale with colours corresponding to human genes. c, The orders of orthologous genes across XTR9, XTR10, XLA9_10L and XLA9_10S. Green arrowheads: positions of centromeres in XTR9 and 10 predicted by examination of the cytogenetic chromosome length ratio of p versus q arms15. Blue arrowheads: positions of centromere repeats, frog centromeric repeat-1 (ref. 55), in XLA9_10L and S. Magenta and yellow ellipses, chromosomal locations of snrpn (magenta) and stau1 (yellow) from X. tropicalis v9 and X. laevis v9.1 assemblies. Red ellipses, chromosomal locations of four genes, rps11, rpl13a, lypd1 and actr3. XTR9 is inverted to facilitate comparison. Blue bidirectional arrows indicate the homologous regions where pericentric inversions may have occurred on proto-chromosomes (see Extended Data Fig. 2d). d, Schematic representation for the two hypothetical processes of chromosomal rearrangements (fusion and inversion) that occurred between the hypothetical proto-XTR9 and 10 to produce proto-XLA9_10, and eventually XLA9_10L and S. The process of chromosome rearrangements is explained parsimoniously in two different ways (left and right panels), starting from proto-XTR9 and 10. Actual and hypothetical ancestral chromosomal locations of snrpn and stau1 are shown by magenta and yellow circles, respectively. Note that the chromosomal locations of these genes on the proto-XTR10 differ between the two models. Chromosome segments homologous to XTR9 and XTR10 are shown in red and blue, respectively. XTR9 is inverted to facilitate comparison. Bidirectional arrows indicate the regions where pericentric inversions may have occurred. Black arrows indicate the direction of chromosomal evolution.

  3. Extended Data Figure 3: Transposons. (466 KB)

    a, Density of the subgenome-specific transposons on each chromosome (coverage length of transposable element (bp)/chromosome length (Mbp)). The coverage lengths of transposons were calculated from the results of BLASTN search (E-value cutoff 10−5) using the consensus sequences as queries. b, Jukes-Cantor distances across non-CpG sites, corrected as in Supplementary Note 7.5. Distances between X. tropicalis and X. laevis transposons consensus sequences are shown. The X. laevis-specific transposon differences are each individual transposon sequence against the consensus sequence for that subfamily. c, Phylogenetic tree of Xl-TpS_mar transposon expansions in the X. laevis genome, built using Jukes–Cantor corrected distances (Supplementary Note 7.5). Sub-clusters with enough members to determine accurate timings are highlighted. The scale bar represents the corrected Jukes–Cantor distance of 0.08 substitutions per site.

  4. Extended Data Figure 4: Phylogeny. (280 KB)

    a, Phylogenetic tree of pan-vertebrate conserved non-coding elements (pvCNEs), rooted by elephant shark. Alignments were done by MUSCLE, and the maximum-likelihood tree was built by PhyML. Branch length scale shown at the bottom. The difference in branch lengths of tetrapods follows the same topology as the protein-coding tree (Fig. 2b). b, Complete phylogenetic tree from Fig. 2a, with divergence times computed by r8s. c, Distribution of synonymous and non-synonymous rates Ks and Ka on specific subgenomes during the time between L and S speciation, before X. laevis and X. borealis speciation. We find accelerated mutations rates between T2 and T3 in Ks and Ka (P = 1.4 × 10−5 (left), 8.6 × 10−3 (right)). d, Distribution of Ks and Ka on specific subgenomes during the time after X. laevis and X. borealis speciation. We do not find significantly accelerated substitution rates (P = 0.10 (left) and P = 0.03 (right)). e, Table showing the number of homoeologues and singletons identified as homoeologues from the ancient vertebrate duplication (or ohnologues as they were historically called)56, 79.9% of ohnologues retain both copies in X. laevis today, significantly more than the 54.3% of the rest of the genome after excluding ohnologues (χ2 test P = 4.44 × 10−69). f, Table showing the branch lengths of bootstrapped maximum likelihood trees described in Supplementary Note 12.5. The columns refer to the X. tropicalis (XTR), L chromosome of X. laevis (XLA.L), S chromosome of X. laevis (XLA.S) and XLA.L/XLA.S branch lengths respectively. The first row shows triplets where all genes show expression, the second row shows triplets where L is a thanagene, and the third row shows triplets where S is a thanagene. The L branch length is significantly smaller when all genes are expressed, or when S is a thanagene (Wilcoxon signed-rank test, P = 1.7 × 10−216 and 6.4 × 10−212 respectively). The S branch length is smaller when L is a thanagene (P = 2.4 × 10−223). The ratio of branch lengths (L/S) is significantly different for either L or S thanagene datasets compared to when all genes are expressed (P = 3.55 × 10−214 and 7.48 × 10−220 respectively). The ratio is also different between the two thanagene datasets (P = 1.79 × 10−217).

  5. Extended Data Figure 5: Structural evolution. (402 KB)

    a, Chromosomal locations of the 45S pre-ribosomal RNA gene (rna45s), which encodes a precursor RNA for 18S, 5.8S and 28S rRNAs, was determined using pHr21Ab (5.8-kb for the 5′ portion) and pHr14E3 (7.3-kb for the 3′ portion) fragments as FISH probes. DNA fragments used for the probes were provided by National Institutes of Biomedical Innovation, Health and Nutrition, Osaka, and labelled with biotin-16-dUTP (Roche Diagnostics) by nick translation. After hybridization, the slides were incubated with FITC-avidin (Vector Laboratories). Hybridization signals (arrows) were detected to the short arm of XLA3L, but not XLA3S. Scale bar, 5 μm. b, A large deletion including an olfactory receptor gene (or) cluster. Schematic structures of or gene clusters and adjacent genes on the 8th chromosomes of X. tropicalis (XTR8) and X. laevis (XLA8L and XLA8S). Chromosomal locations: XTR8: 107,524,547–108,927,581; XLA8L: 105,062,063–106,610,199; XLA8S: 91,630,596–92,060,451. Horizontal bars, genomic DNA sequences; triangles, genes. Outside of or gene cluster, only representative genes are shown. The size of the triangle is to scale. The orientation of triangles indicates 5′ to 3′ direction of genes. Thin lines connect orthologous/homoeologous genes. Magenta triangles, or genes; green triangles, pseudogenes (point-mutated or truncated or genes). The number of or genes is shown underneath gene clusters. Dotted lines, a deleted region in XLA8S compared to XLA8L. The centromere is located on the left side and the telomere is on the right. c, The relative frequency (left panel) and size (right panel) of genomic regions deleted in the S (blue) and L (green) chromosomes respectively. Both subgenomes experienced sequence loss through deletions, but the deletions on the S subgenome are larger and have been more frequent. Deletions were called based on the progressive Cactus sequence alignment between the X. laevis L and S subgenomes and the X. tropicalis genome. Chromosome 9_10 of X. laevis was split into 9 and 10 on the basis of alignment with the X. tropicalis chromosomes. Sequences from L that were not present on S, but could at least partially be identified in X. tropicalis, and consisted of gaps for no more than 25% of their length, were called as deleted regions in S. The same procedure was followed for deleted regions in L. d, Identification of triplet loci is described in Supplementary Note 8.1. Loci were classified into groups based on the presence of gene 2 in both X. laevis subgenomes (homoeologue retained), versus those that had a pseudogene in the middle (pseudogene) or no remnant of the middle gene as assessed by Exonerate (deletion). To normalize the intergenic lengths, we divided the nucleotide distance between genes 1 and 3 in either X. laevis subgenome by the orthologous distance in X. tropicalis. The median of the normalized ratio distribution is plotted on the bar chart. On average, S deletions appear to be larger than L deletions (52.9% versus 80.2% of the size of the orthologous X. tropicalis region, respectively). e, The number of RNA-seq reads aligning ±1 kb of precursor miRNA loci (red) was compared to the read count for 10,000 random unannotated 2.1 kb regions of the genome (blue). All 83 homoeologous, intergenic miRNA pairs showed alignment within their regions, as opposed to 4,127 out of 10,000 (41.27%) of the randomly chosen intergenic sequences. The putative primary-miRNA loci also have a higher read count than the expressed randomly chosen regions (Wilcoxon signed-rank test, P = 1.4 × 10−38). f, The Cactus alignment was parsed to identify flanking CNE around each X. tropicalis gene. The number of CNEs >50 bp in length for singletons is shown in red, homoeologues in blue. Kolmogorov-Smirnov test P = 10−11. g, The average distance to the nearest gene was computed for each chromosomal locus in X. tropicalis. The average intergenic distance for those with a single X. laevis gene is shown in red, those with two shown in blue. Wilcoxon signed-rank test (P = 9.8 × 10−24). h, The distribution of gene retention by genomic footprint of the X. tropicalis orthologue. We define genomic footprint as the genomic distance from the start signal of the coding sequence (CDS) to the stop signal, including introns. The x axis shows log10(genomic footprint), the y axis the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homoeologues and singletons by a Wilcoxon signed-rank test (P = 2.4 × 10−96). i. The distribution of gene retention by CDS length of the X. tropicalis orthologue. The x axis shows log10 (CDS length), the y axis the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homoeologues and singletons by a Wilcoxon signed-rank test (P = 1.7 × 10−21). j, The distribution of gene retention by exon number of the X. tropicalis orthologue. The x axis shows number of exons; the y axis the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homoeologues and singletons by a Wilcoxon signed-rank test (P = 3.2 × 10−8).

  6. Extended Data Figure 6: Pseudogenes. (713 KB)

    a, Illustration of htt.S pseudogene alignment to X. tropicalis htt and the extant X. laevis htt.L, translated to amino acids. The amino acid position is shown at the beginning of each line. Missing codons are marked by dashes. Frameshifts and premature stops are marked by X and *, respectively (and pointed to with red arrows). The first exon of the pseudogene is completely missing from the S chromosome (top). The characteristic poly-Q region is maintained by both htt and htt.L. An exon with conservation in the pseudogene (bottom), illustrating that despite many frameshifts, premature stops, the lack of a proper start and insertions of new sequence, we identify many codons in the pseudogene that occur in large conserved blocks. b, Illustration of our model to compute pseudogene ages. The star represents the point of nonfunctionalization for a locus that is currently a pseudogene. We assume the expected rate of nonsynonymous changes can be estimated by the Ka of the extant gene and X. tropicalis. We then compare the Ks and Ka of the pseudogene sequence to estimate the time of nonfunctionalization. See Supplementary Note 9 for a more detailed discussion. c, Estimated epochs of pseudogenization for 430 genes are indistinguishable from a burst of pseudogenization >10 Ma (Ks > 0.03). See Supplementary Note 9 for a more detailed discussion. d. Correlation of pseudogene expression with its extant homoeologue. The little expression seen in pseudogenes tends to be uncorrelated with the extant homoeologue. e, Histogram of pseudogene expression values across all 28 tissues and developmental stages (red) compared to all extant genes (blue). The pseudogenes are rarely expressed and tend to be expressed at lower levels than extant protein-coding genes. f, Histograms of expression variance of pseudogenes (red) compared to extant genes (blue). The small amount of pseudogene expression observed does not tend to vary across tissues and developmental stages in the same way that extant genes do.

  7. Extended Data Figure 7: Tandem duplications. (332 KB)

    a, Phylogenetic trees of the mix/bix cluster. Nucleotide sequences were aligned using MUSCLE and a phylogenetic diagram was generated by the ML method with 1,000 bootstraps (MEGA6). Circles with different colours represent X. laevis L genes (magenta), X. laevis S genes (blue) and X. tropicalis genes (green). The table shows the correspondence of bix gene names proposed in this study and previously used (synonyms). b, FISH analysis showing XLA3S-specific deletion of the nodal5 gene cluster. One unit of the nodal5 gene region, including exons, introns and an intergenic region was used as a probe for FISH (counterstained with Hoechst). Arrows indicate the hybridization signals of nodal5s. Scale bar, 5 μm. c, Comparison of the nodal5 gene cluster. Genome sequencing revealed that nodal5.e1.L~.e5.L (pink) and nodal6.L are clustered. Amplification of nodal5 gene in XLA3L and loss of this cluster in XLA3S were confirmed. Pseudogenes (nodal5p1.L~p4.L and nodal5p1.S) are indicated in black. The nodal5 cluster of X. tropicalis does not contain any pseudogene. d, The X. laevis L chromosome has four complete copies of nodal3 (nodal3.e1.L~.e4.L), whereas the gene cluster is lost from the X. laevis S chromosome. A truncated nodal3 gene (nodal3p1.L) is likely to be a pseudogene and highly degenerate pseudogenes (nodal3p2.L and nodal3p3.L) also exist on the L chromosome. e, Like nodal3, vg1 is lost from the S chromosome although there is a pseudogene (vg1p.S). vg1 is specifically amplified on the X. laevis L chromosome (vg1.e1.L~.e3.L) in comparison with X. tropicalis. An amino acid change (Ser20 to Pro20) in Vg1 protein has been shown to result in functional differences (Supplementary Note 13.9). vg1 and derrière are orthologous to mammalian gdf1. f, Fraction of all genes duplicated and retained to present epoch per 1 expected 4DTV (fourfold degenerate transversion) at different epochs (semi-log scale). Shown also are linear fits, which would be consistent with constant birth- and death-rate models (first epoch is omitted from both fitted datasets, as is second epoch from X. laevis). See Supplementary Note 11 for a more detailed discussion. g, Same as f, but for ‘short genes’ (CDS <600 bp) and ‘long genes’ (CDS >1,200 bp) separately. The loss rate of new duplicates appears to be similar. If the extra copy of a newly duplicated gene was lost when the first 100% disabling mutation occurred, we would expect, on average, the longer genes to be lost.

  8. Extended Data Figure 8: Gene expression analysis. (483 KB)

    a, Pairwise Pearson correlation distributions between homoeologous genes (red) and all genes (blue). Left histogram, stage data; right, adult data. The x axis shows the correlation; the y axis the percentage of data. The homoeologous genes have a correlation distribution closer to one owing to the fact that these were recently the same locus. X. laevis TPM values of 0.5 were lowered to 0. Any gene with no TPM >0 was removed from analysis. We then added 0.1 to all TPM values and log transformed (log10) them. b, Scatter plot comparing binned genes by their median X. tropicalis expression57 to the retention rate of their X. laevis (co)-orthologues. Error bars are the standard deviation for the whole dataset divided by the square root of the number of genes analysed in a bin. We assessed significance by a Wilcoxon signed-rank test of the homoeologous and singleton distributions, P = 6.31 × 10−113. c, Full version of the box plot shown in Fig. 4c. The difference between subgenomes is difficult to see at this magnification, illustrating that many loci deviate from the whole genome median of preferring the L homoeologue. There were some L outliers expressed 104 as much as their S homoeologues, whereas no S genes showed such a strong trend. These differences are discussed in more detail in Supplementary Note 12. d, Box plot of 4DTV by homoeologue class defined in Supplementary Note 12.4. Significant differences are marked by a red asterisk (Wilcoxon signed-rank test, P < 10−5). The high correlation, similar expression (HCSE) group showed lower sequence change than other groups (P = 3.7 × 10−12) and the no correlation, different expression (NCDE) group showed high rates of sequence change (P = 5.6 × 10−14). e, Box plot of CDS length difference between X. laevis homoeologues by homoeologue class defined in Supplementary Note 12.4. Significant differences are marked by a red asterisk (Wilcoxon signed-rank test, P < 10−5). The HCSE group showed smaller CDS length differences than other groups (P = 2.4 × 10−13) and the NCDE group showed large differences in homoeologue CDS length (P = 2.1 × 10−32). f, Box plot of Ka/Ks between X. laevis homoeologues by homoeologue class defined in Supplementary Note 12.4. Significant differences are marked by a red asterisk (t-test P < 10−5). The HCSE group showed lower non-synonymous sequence change than other groups (P = 8.2 × 10−19) and the NCDE and no correlation, similar expression (NCSE) groups showed higher rates of non-synonymous sequence changes (P = 2.0 × 10−12 and P = 7.0 × 10−9 respectively). g, RNA-seq analysis of six6.L (red) and six6.S (blue) during X. laevis development (left) and in adult tissues (right). Expression levels of six6.S were lower than those of six6.L at most developmental stages and in adult tissues. h, Diagram of Homo sapiens, X. tropicalis and X. laevis six6 loci (upper panel). Magenta and black boxes indicate CNEs and exons, respectively. The phylogenetic tree analyses of H. sapiens, X. tropicalis and X. laevis six6 CNEs (lower left panel) and Six6 proteins (lower right panel). Notably, six6.S is more diverged from X. tropicalis six6 than six6.L, both in the encoded protein sequences and in CNEs within 3 kb of the transcription start sites. Materials, methods and the CNE locations on genome assemblies are described in Supplementary Note 13.1. i, On the basis of chromatin state properties, a Random Forest machine-learning algorithm can accurately predict L versus S expression bias. The classification is based on all genes with greater than threefold expression difference at NF stage 10.5 (a set of 1,129 genes). The mean (dotted black line) of the ROC area under the curve is 0.778 (tenfold cross-validation). Features were selected using Linear Support Vector Classification and are shown in j. j, Relative importance (based on Gini impurity) of selected features used in the Random Forest classification. All features used in the classification are shown. Among various variables, the ratios of H3K4me3 and DNA methylation at the promoter contributed most to the decision tree model. A difference in p300 binding in the genomic region surrounding the gene also contributed to the Random Forest classification, as did the presence or absence of a number of specific transcription factor motifs in the promoter.

  9. Extended Data Figure 9: Examples of pathway responses. (329 KB)

    a, The Wnt pathway. Left panel, several key components of the canonical Wnt pathway in the X. laevis genome. The numbers in brackets show the number of paralogues. Components that have homoeologous pair of genes or singletons are shown in blue or red, respectively. Each gene (wnt: 21 genes, LRP: 2 genes, Fzd: 10 genes, Dvl: 3 genes, Frat(GBP): 1 gene, GSK3: 2 genes, Axin: 2 genes, bcatenin: 1 gene, APC: 2 genes, TCF/LEF:4 genes) was classified into 4 groups according to subcellular localization, and the number of singleton and homoeologue retained genes is shown by pie charts. Right panel, syntenies around four singleton genes. b, Cell cycle regulation. Upper right panel, diagram of the cell cycle and regulatory proteins critical to each phase. Cyclin H (ccnh) and Cdk7 constitute Cdk-activating kinase (CAK), a key factor required for activation of all Cdks. Genes encoding Cyclin H and Cdk7 (red), but not other regulators (blue), became singletons. Upper left panel, pie charts show the numbers of homoeologous pairs (blue) and singletons (red) in each functional category as indicated. Lower left panel, syntenies of ccnh and cdk7 loci in X. tropicalis and X. laevis. Lower right table, individual genes used for drawing the pie charts are shown in the table. c, The Hippo pathway. Upper panel, Hippo pathway components and retention of their homoeologous gene pairs. All genes for Hippo pathway components as indicated were identified in the whole genome of X. laevis. Blue icons indicate that both of the homoeologous genes are expressed in normal development and adult organs. The red icon, Taz, indicates a singleton. Yap is interchangeable with Taz in most cases, but TAZ, but not YAP, serves as a mediator of Wnt signalling (broken line). Pie charts show the numbers of homoeologue pairs (blue) and singleton (red) in each category of the Hippo pathway components classified according to subcellular localization. Lower panel, comparative analysis of syntenies around the taz gene. X. tropicalis scaffold247 is not incorporated into the chromosome-scale assembly (v9) and hence its chromosomal location is not known yet. The p arm termini of XLA8L and XLA8S are on the left. See Supplemental Note 13 for further details.

  10. Extended Data Figure 10: Pathways continued. (432 KB)

    a, The TGFβ pathway. Pie charts indicate the ratio of differentially expressed homoeologous pairs (orange) and singletons (red). Many of the extracellular regulatory factors are either differentially regulated or became singletons. Genes for a type I receptor, co-receptors and an inhibitory Smad are also differentially regulated. Multicopy genes such as nodal3, nodal5 and vg1 are not counted as singletons, even though those genes are deleted on S chromosomes. Instead, these and duplicated chordin genes are categorized into differentially regulated genes. b, The sonic hedgehog pathway. Upper panel, the simplified hedgehog pathway known in Shh signalling is schematically shown. Most signalling components are encoded by both homoeologous genes, whereas Hhat (shown in red) is encoded by a singleton gene. Where paralogues exist, the numbers of paralogues are shown in parentheses. In the left cell, the Shh precursor (Hh precursor) is matured through the process involving Hhat and Hhatl and secreted. In the right cell, the binding of Shh (Hh) to Ptch1 (Ptch) receptor inhibits Ptch1-mediated repression of Smo, leading to Smo activation and subsequent inhibition of PKA; otherwise PKA converts Gli activators to truncated repressors. As a consequence, Gli proteins activate target genes, such as Ptch1 and Hhip. The transmembrane protein Hhip binds Shh and suppresses Shh activity. Lower panel, schematic comparison of syntenies around hhat genes of X. tropicalis chromosome 5 (top) and X. laevis 5L chromosome (middle) and the corresponding region of X. laevis 5S chromosome (bottom). The diagram is not drawn to scale. c, Deletion rates on L (x axis) versus S (y axis) for different Pfam groups. For Pfam groups we computed the number of X. laevis single-copy genes (singletons) versus homoeologue pairs and computed the fraction retained. The line of expected L/S loss is based on the genome-wide average (56.4%). Red points show groups with high or low rates of loss (P < .01). See Supplementary Table 5 for more information. d, Deletion rates on L (x axis) versus S (y axis) for different stage weighted gene correlation network analysis (WGCNA)54 groups (visualized as a heatmap in Fig. 4a). For stage WGCNA groups we computed the number of X. laevis single-copy genes (singletons) versus homoeologue pairs and computed the fraction retained. The line of expected L/S loss is based on the genome-wide average (56.4%). Red points show groups with high or low rates of loss (P < .01). e, Deletion rates on L (x axis) versus S (y axis) for different GO groups. For GO groups we computed the number of X. laevis single-copy genes (singletons) versus homoeologue pairs and computed the fraction retained. The line of expected L/S loss is based on the genome-wide average (56.4%). Red points show groups with high or low rates of loss (P < 0.01). See Supplementary Table 5 for more information.

Supplementary information

PDF files

  1. Supplementary Information (2.4 MB)

    This file contains Supplementary Notes 1-15, which detail analyses from the main text and additional references (see Contents for more details).

Zip files

  1. Supplementary Tables (31.5 MB)

    This zipped file contains Supplementary Table 1-11.

Comments

  1. Report this comment #69037

    Miguel Romero said:

    Unexpected findings from Session et al genomic study data

    Dr. Session and his colleagues performed a remarkably detailed analysis of the frog Xenopus laevis and X. tropicalis genomes. Pursuing their data as an exercise in bioinformatics and the construction of phylogenetic trees led to the discovery of homologies between their SRA data and HIV-1 nucleic acid sequences. BLASTN returned over 700 high identity, low eValue alignments representing the coding nucleic acids for all the major structural, regulatory and accessory HIV-1 proteins. Data at goo.gl/GhCyNb.

    Session et used several cloning vectors including pKS145, pKS200 , pKS145, pKS300-IC, pCC1FOS and Blues cript II SK(-). An Internet search did not locate the pKS200 or pKS300-IC vectors although the latter may be related to pKS300, an insert of the HTLV-I gp46 (env) gene https://www.atcc.org/~/ps/39902.ashx. However, BLASTN analysis with HIV-1 HBX2 and several other HIV-1 sequences as subject against HTLV-I DNA as query did not return any significant identities. The other vectors were tested and none appear to be lentiviral.

    Although I have previously documented evidence for HIV-1 contamination of nucleic acid databases [1, 2], this does not readily fit with the data presented here. Is it possible that Dr. Session and his colleagues conducted their experiments at or near a laboratory involved in HIV research?. Otherwise, how can the HIV-1 sequences be explained?. If the HIV-1 sequences are excluded from the analysis, might this have some bearing on the authors? conclusions?.

    1. Romero Fernández-Bravo M. Contamination of genomic databases by HIV-1 and its possible consequences. A study in Bioinformatics. 2014. http://openaccess.uoc.edu/webapps/o2/handle/10609/31361
    2. Romero Fernández-Bravo M. Readers comment on genome of Dr. James Watson. Nature 2014. 452:872-876. http://www.nature.com/nature/journal/v452/n7189/full/nature06884.html#comment-64495
    http://www.nature.com/nature/journal/v452/n7189/full/nature06884.html#comment-67241

  2. Report this comment #69039

    Miguel Romero said:

    Unexpected findings from Session et al genomic study data

    Dr. Session and his colleagues performed a remarkably detailed analysis of the frog Xenopus laevis and X. tropicalis genomes. Pursuing their data as an exercise in bioinformatics and the construction of phylogenetic trees led to the discovery of homologies between their SRA data and HIV-1 nucleic acid sequences. BLASTN returned over 700 high identity, low eValue alignments representing the coding nucleic acids for all the major structural, regulatory and accessory HIV-1 proteins. Data here .

    Session et used several cloning vectors including pKS145, pKS200 , pKS145, pKS300-IC, pCC1FOS and Blues cript II SK(-). An Internet search did not locate the pKS200 or pKS300-IC vectors although the latter may be related to pKS300, an insert of the HTLV-I gp46 (env) gene Data here . However, BLASTN analysis with HIV-1 HBX2 and several other HIV-1 sequences as subject against HTLV-I DNA as query did not return any significant identities. The other vectors were tested and none appear to be lentiviral.

    Although I have previously documented evidence for HIV-1 contamination of nucleic acid databases [1, 2], this does not readily fit with the data presented here. Is it possible that Dr. Session and his colleagues conducted their experiments at or near a laboratory involved in HIV research?. Otherwise, how can the HIV-1 sequences be explained?. If the HIV-1 sequences are excluded from the analysis, might this have some bearing on the authors? conclusions?.

    1. Romero Fernández-Bravo M. Contamination of genomic databases by HIV-1 and its possible consequences. A study in Bioinformatics. 2014

    2. Romero Fernández-Bravo M. Readers comment on genome of Dr. James Watson. Nature 2014. 452:872-876 and here

Subscribe to comments

Additional data