Genome evolution in the allotetraploid frog Xenopus laevis

To explore the origins and consequences of tetraploidy in the African clawed frog, we sequenced the Xenopus laevis genome and compared it to the related diploid X. tropicalis genome. We demonstrate the allotetraploid origin of X. laevis by partitioning its genome into two homeologous subgenomes, marked by distinct families of “fossil” transposable elements. Based on the activity of these elements and the age of hundreds of unitary pseudogenes, we estimate that the two diploid progenitor species diverged ~34 million years ago (Mya) and combined to form an allotetraploid ~17–18 Mya. 56% of all genes are retained in two homeologous copies. Protein function, gene expression, and the amount of flanking conserved sequence all correlate with retention rates. The subgenomes have evolved asymmetrically, with one chromosome set more often preserving the ancestral state and the other experiencing more gene loss, deletion, rearrangement, and reduced gene expression.

Here we prove the allotetraploid hypothesis by tracing the origins of the X. laevis genome from its extinct progenitor diploids. The two subgenomes are distinct and maintain separate recombinational identities. Despite sharing the same nucleus, we find that the subgenomes have evolved asymmetrically: one of the two subgenomes has experienced more intrachromosomal rearrangement, gene loss by deletion and pseudogenization, changes in levels of gene expression, and in histone and DNA methylation. Superimposed on these global trends are local gene family expansions and alteration of gene expression patterns.

Subgenome identity and timing of allotetraploidization
We reasoned that dispersed relicts of transposable elements specific to each progenitor would mark the descendent subgenomes in an allotetraploid (Fig. 2c, Extended Data Fig. 1). Three classes of DNA transposon relicts appear almost exclusively on either the L or S chromosomes (Supplementary Note 7). Xl-TpL_Harb and Xl-TpS_Harb are novel subfamilies of miniature inverted-repeat transposable elements (MITE) of the PIF/Harbinger superfamily 18,19 whose relicts are almost completely restricted to L or S chromosomes, respectively (Fig. 1b, Extended Data Fig. 3a). Similarly, sequence relicts of the Tc1/mariner superfamily member Xl-TpS_Mar (closely related to the fish MMTS subfamily 20 ) are found almost exclusively on the S chromosomes (Fig. 1b), as confirmed by FISH analysis using Xl-TpS_Mar as a probe (Fig. 1c, Supplemental Note 7.4; see Supplemental Note 7.3 for details on the rare elements that map to the opposite subgenome).
The L and S chromosome sets therefore represent the descendants of two distinct diploid progenitors, confirming the allotetraploid hypothesis even in the absence of extant progenitor species. Based on analysis of synonymous divergence of protein-coding genes, the L and S subgenomes diverged from each other ~34 Mya (T 2 ) and from X. tropicalis ~48 Mya (T 1 ) (Fig. 2a), consistent with prior gene-by-gene estimates from transcriptomes [21][22][23][24] (Supplementary Note 8, Extended Data Fig. 4; Online Methods). L-and S-specific transposable elements were active ~18-34 Mya, indicating that the two progenitors were independently evolving diploids during that period ( Fig. 2a; Supplementary Note 7.5; Extended Data Fig. 3). More recent transposon activity is more uniformly distributed across the L and S chromosomes (not shown). Finally, consistent with a common origin for tetraploid Xenopus species, we can clearly identify orthologs of L and S genes in whole genome sequences of another allotetraploid frog, X. borealis, and estimate the X. laevis-X. borealis divergence to be ~17 Mya (T 3 ). These considerations constrain the allotetraploid event to ~17-18 Mya (T*). This timing is consistent with other estimates of the radiation of tetraploid Xenopus species, which are presumed to emerge from the bottleneck of a shared allotetraploid founder population 23,24 .

Karyotype stability
Remarkably, with the exception of the chromosome 9/10 fusion, X. laevis and X. tropicalis chromosomes have maintained conserved synteny since their divergence ~48 Mya (Fig.  1a,b). The absence of inter-chromosomal rearrangements is consistent with the relative stability of amphibian and avian karyotypes compared to mammals 25 , which typically show dozens of inter-chromosome rearrangements 26 . It also contrasts with many plant polyploids, which can show considerable inter-subgenome rearrangement 27 . The distribution of L-and S-specific repeats along entire chromosomes implies the absence of crossover recombination between homeologs since allotetraploidization, presumably because the two progenitors were sufficiently diverged to avoid meiotic pairing between homeologous chromosomes, though we cannot rule out very limited localized inter-homeolog exchanges (Supplementary Note 7).
The extensive collinearity between homologous X. laevis L and X. tropicalis chromosomes ( Fig. 1a) implies that they represent the ancestral chromosome organization. In contrast, the S subgenome shows extensive intra-chromosomal rearrangements, evident in the large inversions of XLA2S, XLA3S, XLA4S, XLA5S and XLA8S, as well as shorter rearrangements (Fig. 1a). The S subgenome has also experienced more deletions. For example, the 45S pre-ribosomal RNA gene cluster is found on X. laevis XLA3Lp, but its homeologous locus on XLA3Sp is absent (Extended Data Fig. 5a). Extensive small-scale deletions (Extended Data Fig. 5b) reduce the length of S chromosomes relative to the L and X. tropicalis counterparts (see below).

Response of subgenomes to allotetraploidy
Redundant functional elements in a polyploid are expected to rapidly revert to single copy through the fixation of disabling mutations and/or loss 28 unless prevented by neofunctionalization 8 , subfunctionalization 26 , or selection for gene dosage 29 . Differential gene loss between homeologous chromosomes is sometimes referred to as "genome fractionation" [30][31][32] 5 h-j), broadly consistent with the idea that longer genes have more independently mutable functions and are therefore more susceptible to subfunctionalization and subsequent retention 36 .
Genes have been lost asymmetrically between the two subgenomes of X. laevis. Similar results have been reported for some plant polyploids 30 but not in rainbow trout 5 . For X. laevis protein-coding genes with clear 1:1 or 2:1 orthologs in X. tropicalis, we find that significantly more genes are lost on the S subgenome (31.5%) vs. the L subgenome (8.3%; Table 2), with the same trend for other types of functional elements, such as H3K4me3-enriched promoters and p300-bound enhancers (Table 1). Across most of the genome, genes appear to be lost independently of their neighbors, as the distribution of runs of gene losses are nearly geometrically distributed (Fig.  3a, right). We do observe some large block deletions (e.g., several olfactory clusters (Extended Data Fig. 5b) and a few unusually long blocks of functionally unrelated genes that are retained in two copies without loss (Fig 3a, left).
Many lost genes are simply deleted, as demonstrated by significantly shorter distances between conserved flanking genes. Both the size and number of deletions are greater on the S subgenome (Extended Data Fig. 5c). We identified 985 "unitary" (i.e., nonretrotransposed) pseudogenes out of 1,531 loci examined in detail. This 64% detection rate is similar between subgenomes in X. laevis and comparable to that reported in trout 5 . Based on the accumulation of non-synonymous mutations 37 we estimate that most of these pseudogenes escaped evolutionary constraint ~15 Mya (Fig. 2a, Extended Data Fig. 6), consistent with the onset of extensive redundancy in the allotetraploid, though the precision of our pseudogene age estimates is low (Supplementary Note 9). Most pseudogenes show no evidence of expression, but of 769 pseudogenes longer than 100 bp, 133 (17.2%) showed residual expression (Extended Data Fig. 6). Conversely, among homeologous gene pairs, we found 760 for which one member had little to no expression across our 28 sampled conditions. Although these retained some gene structure (start and stop codon, no frame shifts, good splice signals) they showed increased rates of amino acid change and appear to be under relaxed selection (Extended Data Fig. 5f). We call these nominally dying genes "thanagenes" (Supplementary Note 12.5). Reduced expression may be due to mutated cisregulatory elements, exemplified by the six6 gene pair ( Although tetraploidy created two "copies" of nearly every gene, additional gene copies are continually produced by tandem duplication (Fig. 3d; Extended Data Fig. 7). The number of tandem clusters is greater in X. tropicalis than in the X. laevis L subgenome, which in turn is greater than in S (Supplementary Note 11.1). Although tandem duplication is faster in X. tropicalis than in X. laevis, there is also a higher rate of loss. Since tandem duplications and deletions occur by unequal crossing over during meiosis, these differing rates are consistent with a shorter generation time of X. tropicalis (Extended Data Fig. 7 f, g). The mean time to loss of an old tandem duplicate is ~40 Mya in X. laevis (on either subgenome) compared with ~16 Mya in X. tropicalis. Homeologous gene loss and tandem duplication can combine to yield complex histories for some gene families. We discuss how these families contribute to the literature on whole genome duplication evolution in Supplemental Notes 10 and 13.

Functional patterns of gene retention and loss
We find preferential retention or loss of many functional categories (Fig 4a; Extended Data Figs. 4e, 9, 10; Supplemental Note 13). DNA binding proteins and components of developmentally-regulated signaling pathways (TGFβ, Wnt, Hh and Hippo) and cell cycle regulation are retained at a significantly higher rate (> 90%) than average (Extended Data Fig. 10). Genes retained in multiple copies after the ancient vertebrate genome duplications are also more likely to be retained as homeologs in X. laevis (Supplemental Note 10.4), as found for the teleost and trout genome duplications 5 . A notable example is the nearly complete retention of 37/38 duplicated genes in the four pairs of homeologous Hox clusters, with a single pseudogene (Fig. 3c). High rates of homeolog retention in most genes in these categories suggest that stoichiometrically controlled expression levels may be needed or subfunctionalization of homeologs may have occurred, either in their expression domain or target specificity.
Conversely, homeologous genes in other functional categories have been lost at a higher rate, presumably because of a corresponding lack of selection for dosage. For example, genes involved in DNA repair are lost at a high rate (79%) (Supplementary Note 10.1), consistent with reduced selection for repair in the immediate aftermath of allotetraploidy, when all genes were present in four copies per somatic cell 5 . Other metabolic categories are also prone to loss, presumably because single loci encoding enzymes are sufficient 38 . Genomic regions with notable loss include the major histocompatibility complex genes on the S subgenome (Fig. 3b) and several olfactory receptor clusters (Extended Data Fig. 5b). We hypothesize that homeologous genes may be functionally incompatible in these cases, leading to en bloc deletion in response to this selection pressure. Specific case studies of duplicate gene retention and loss are detailed in Extended Data Figures 9,10 and Supplemental Note 13.

Evolution of gene expression
Gene expression is also a predictor of retention, with more highly expressed genes more likely to be retained (Extended Data Fig. 8b), similar to results seen in Paramecium 39,40 .
Developmentally regulated genes whose expression levels peak at the maternal-zygotic transition (MZT) or during neural differentiation are retained at higher levels (p < 0.01), based on gene expression networks constructed from developmental and adult tissue expression (Online Methods; Fig. 4a  3). We speculate that the exceptional retention of developmentally regulated genes is due to selection for stoichiometric dosage of these factors, and in some cases higher expression in the physically larger allotetraploid cells and embryos relative to those of diploid frogs, although a propensity 36 of these genes for sub-or neo-functionalization could also contribute. In the adult, genes whose expression peaks in the brain and eye are also retained at higher levels ( Figure 4b).
In X. laevis, the expression of homeologs is highly correlated (Extended Data Fig. 8a), showing that the overall expression of homeologs diverges similarly to orthologs between Xenopus species 41 . Many homeologous pairs, however, are differentially expressed throughout development or across adult tissues, either in spatiotemporal pattern (a form of sub-functionalization 36 ; Supplemental Note 12.4; Extended Data Fig. 8d-f) or in the same pattern but with differing expression levels. When homeologous gene pairs are both expressed the average L copy expression level is ~25% higher than the S copy consistently across adult tissues, and after the MZT 42 ( Fig. 4b; Supplemental Note 12.2). Excess L expression, however, averages only ~12% in oocyte and early pre-MZT stages, suggesting that the two subgenomes are more evenly expressed as maternal transcripts but develop an increased asymmetry after MZT. Strikingly, we found 391 cases in which one homeolog had no detectable maternal mRNA (oocytes, egg and stage 8; Comparing with similar transcript data from X. tropicalis, we found cases of apparent loss of expression ("maternal subfunctionalization": that is, X. tropicalis and one X. laevis gene expressed, the other X. laevis gene silenced pre-MZT; 238 genes, e.g., numbl.S) as well as a surprising gain ("maternal neofunctionalization": that is, X. tropicalis gene not expressed maternally, but one X. laevis gene expressed; 153 genes, e.g., hoxb4.L). We do not see such a large divergence in other expression domains (Supplemental Note 12.2; Extended Data Fig. 8c), suggesting a high level of plasticity of maternal mRNA regulation between X. laevis homeologs, similar to the trend seen between Xenopus species 41 .
Overall, thousands of homeolog pairs have either divergent spatiotemporal patterns or similar patterns with differing expression levels. Such homeolog pairs differ in substitution rate, and CDS length difference, more than those that are similar in expression (Supplementary Note 12.4; Extended Data Fig. 8Fig. 8d-f), a pattern also found in trout homeologous pairs 5 . These expression differences can largely be attributed to changes in epigenetic regulation (Random Forest classification; ROC AUC 0.78), with changes in H3K4me3 and DNA methylation contributing the most explanatory power among our epigenetic variables (Supplementary Note 14). Detailed comparison of the two subgenomes will facilitate identification of specific sequences that control cis-regulatory differences between homeologs.

Conclusion
The two subgenomes of Xenopus laevis have evolved asymmetrically, with the Lsubgenome more consistently resembling the ancestral condition and the S-subgenome more disrupted by deletion and rearrangement. Asymmetric gene loss has been observed in allopolyploid plants 30 and yeast 43 at the segmental level, but it has not been shown directly that similarly "fractionated" segments derive from the same progenitor (Fig. 1c). Our results are consistent with the model that optimized gene expression levels are an important force affecting gene retention following polyploidy 39,40 . The asymmetry between L and S could have been the result of an intrinsic difference between their diploid progenitors. Alternately, the remodeling of the S genome could have been a response to the L-S merger itself, a "genomic shock," 44 resulting from the activation of transposable elements (Fig 2a; Supplemental Note 8.5). Xenopus' position as a premier model for the study of vertebrate development, cell biology, and immunology, and the existence of a number of related polyploids, will continue to provide rich material for the study of vertebrate polyploidy.

Notation and terminology
"Homeologous" chromosomes are anciently orthologous chromosomes that diverged by speciation but were reunited in the same nucleus by a polyploidization event. They are a special case of paralogs. Homeologous genes are sometimes called "alloalleles" to emphasize their role as alternate forms of a gene, but since homeologs are unlinked and assort independently, we do not use this terminology. Similarly, loss of homeologous genes is sometimes referred to as "diploidization." We prefer the simpler and more descriptive term "gene loss." Note that an allotetraploid like Xenopus laevis has two related subgenomes, but these subgenomes are each transmitted to progeny via conventional disomic inheritance. So immediately after allotetraploidization, the new species is already genetically diploid. This is clearly the case for X. laevis, since we find no evidence for recombination between homeologous chromosomes, which would create new sequences with mixed "L" and "S" type transposable elements.

Sequencing and assembly
DNA was extracted from the blood of a single female from the inbred J-strain for whole genome shotgun sequencing. We generated 4.6 billion paired-end Illumina reads from a range of inserts, and used Sanger dideoxy sequencing to obtain fosmid-and bacterial artificial chromosome (BAC)-end pairs and full BAC sequences. We used meraculous 45 as the primary genome assembler. See supplementary notes for more detailed information.

Chromosome scale organization
We identified 798 bacterial artificial chromosomes (BACs) containing genes of interest distributed across the Xenopus genome, and performed fluorescence in situ hybridization (FISH) to assign these BACs to specific chromosomes based on Hoechst 33258-stained latereplication banding patterns (Supplemental Table 1). "HiC" chromatin capture from animal caps was performed as previously reported 46 and assembled with HiRise 47 .

Characterization of sex locus
Sex determination in X. laevis follows a female heterogametic ZZ/ZW system 48 . We fully sequenced BAC clones representing both W and Z haplotypes, and identified both W-and Zspecific sequences (Extended Data Fig. 2a). The existence of the Z-specific sequence was unexpected and therefore verified by PCR analysis using specific primer sets and DNA from gynogenetic frogs having either W or Z loci.

Gene annotation
We made use of extensive previously generated transcriptome data for X. laevis and X. tropicalis, including 697,015 X. laevis EST sequences (see a review 49 ). In addition, more than 1 billion RNAseq reads were generated for this project from 14 oocyte/developmental stages and 14 adult tissues from J-strain X. laevis (see Supplementary Note 4). These data were combined with homology and ab initio predictions using the Joint Genome Institute's Integrated Gene Call pipeline (See Supplementary Note 4 and 8 for more details).

Characterization of subgenome-specific transposable elements
We found subgenome specific repeats using a RepeatMasker 50 result. The repeats were used to reconstruct full-length subgenome specific transposon sequences. The specific transposons, Xl-TpL_Harb, Xl-TpS_Harb, and Xl-TpS_Mar, were classified based on their target site sequence and terminal inverted repeat (TIR) sequences. The coverage lengths of the transposons on each chromosome were calculated from the results of BLASTN search (E-value < 1E-5) using the consensus sequences of the transposons as queries. The chromosomal distribution of the Xl-TpS_Mar was revealed by a FISH analysis (See Supplemental Note 7.4).

Phylogeny, divergence time, and evolutionary rates
We used Hymenochirus boettgeri, Pipa carvalhoi, and Rana pipiens sequences as outgroups to estimate the evolutionary rate of duplicated genes in X. laevis and their relationship to X. tropicalis. See Supplementary Note 7 and 8 for more detail.

Deletions and pseudogenes
Pseudogene sequences contain various defects including premature stop codons, frameshifts, disrupted splicing, and/or partial coding deletions. 985 pseudogenes were identified among 1,531 "2-1-2 regions", with the others deleted or rendered unidentifiable by mutation. 368/985 could be timed, based on the accumulation of non-synonymous and synonymous substitution between a pseudogene, its homeolog, and its ortholog in X. tropicalis, providing a time since the loss of constraint for each pseudogene 37 .

Functional annotation of genes
We used several bioinformatic methods and high throughput datasets to assign functional annotations to Xenopus genes. Protein domains were assigned using InterPro (including PFAM and Panther) 51 and KEGG 52 . Gene Ontology was assigned using InterPro2Go 51 . We identified genes that encode mitochondrial proteins by mapping the MitoCarta 53 database from mouse to the most recent X. tropicalis proteome. Xenopus genes associated with germ plasm were manually curated using the extensive Xenopus literature (See Supplement Note 13).

Gene expression
We analyzed transcriptome data generated for 14 oocyte/developmental stages and 14 adult tissues in duplicate except for oocyte stages (see Supplementary Note 4). Expression levels were measured by mapping paired-end RNA-seq reads to predicted full length cDNA and reporting transcripts per one million mapped reads (TPM). We consider the limit of detectable expression to be TPM > 0.5 Co-expression modules were defined by Weighted Gene Co-expression Network Analysis (WGCNA) clustering 54 (See Supplementary Note 12).

Epigenetic analysis
We determined DNA methylation levels (DNAme) by whole genome bisulfite sequencing, and used ChIP seq to generate profiles of the promoter mark histone H3 lysine 4 trimethylation (H3K4me3), the transcription elongation mark H3K36me3, as well as RNA polymerase II (RNAPII) and the enhancer-associated co-activator p300. To test which regulatory features would contribute most to the L versus S expression differences, we applied a Random Forest machine learning algorithm to analyze differential expression between the L and S homeologs (See Supplementary Note 14).

e.
Cumulative proportion of 51-mers as a function of relative depth (i.e., depth/29). Relative depth provides an estimate of genomic copy number. The rapid rise at relative depth 1 implies that 70-75% the X. laevis genome is single copy with respect to 51-mers. The remainder of the genome is primarily concentrated in repetitive sequences with copy number ≫100. Note logarithmic scale.

f.
The contact map of 85,260 Chicago read pairs for JGIv72.000090484.chr4S, a 3.1Mb scaffold in the XENLA_JGI_v72 assembly.

g.
The contact map of 85,260 HiC read pairs for JGIv72.000090484.chr4S. Read pairs were binned at 10kb intervals. For each read pair, the forward and reverse reads map with at least 20 map quality score.

h.
The insert distribution of HiC and Chicago read pairs that map to the same scaffold of XENLA_JGI_v72 with at least 20 map quality score. The x-axis is the read pair separation distance. The y-axis is the counts for that bin divided by the total number of reads. The bins are 1kb.
XTR10 were missing in the X. tropicalis genome assembly v9, but rps11, rpl13a, lypd1, and actr3 were expected to be located there based on the synteny with human chromosomes, and then verified by cDNA FISH (upper panels). Small triangles on XLA9_10L and S indicate the distribution of gene models showing both identity and coverage greater than 30%, against the human and chicken peptide sequences from Ensembl, in the region between ±2 Mb from the prospective 9/10 junction. HSA: human chromosome. GGA: chicken chromosome. The magnified view represents syntenic genes to scale with colors corresponding to human genes.

d.
Schematic representation for the two hypothetical processes of chromosomal rearrangements (fusion and inversion) that occurred between the hypothetical proto-XTR9 and 10 to produce proto-XLA9_10, and eventually XLA9_10L and S. The process of chromosome rearrangements is explained parsimoniously in two different ways (left and right panels), starting from proto-XTR9 and 10. Actual and hypothetical ancestral chromosomal locations of snrpn and stau1 are shown by magenta and yellow circles, respectively. Note that the chromosomal locations of these genes on the proto-XTR10 differ between the two models.

a.
Phylogenetic tree of pan-vertebrate conserved non-coding elements (pvCNEs), rooted by elephant shark. Alignments were done by MUSCLE, and the maximum-likelihood tree was built by PhyML. Branch length scale shown on bottom. The difference in branch lengths of tetrapods follows the same topology as the protein-coding tree (Fig. 2b).

b.
Complete phylogenetic tree from Fig. 2a, with divergence times computed by r8s.

c.
Distribution of K s and K a on specific subgenomes during the time between L and S speciation, before X. laevis and X. borealis speciation. We find accelerated mutations rates between T2 and T3 in Ks and Ka (p=1.4e-5 (left), 8.6e-3 (right)).

d.
Distribution of K s and K a on specific subgenomes during the time after laevis and borealis speciation. We do not find significantly accelerated substitution rates. (p= 0.10 (left) and 0.03 (right)).

e.
Table showing the number of homeologs and singletons identified as homeologs from the ancient vertebrate duplication (or ohnologs as they are historically called) 63 . 79.9% of ohnologs retain both copies in X. laevis today, significantly more than the 54.3% of the rest of the genome after excluding ohnologs (χ 2 test p-value= 4.44E-69).
f. Table showing the branch lengths of bootstrapped maximum likelihood trees described in Supplemental Note 12.5. The columns refer to the X. tropicalis (XTR), L chromosome of X. laevis (XLA.L), S chromosome of X. laevis (XLA.S), and XLA.L/XLA.S branch lengths respectively. The first row is triplets where all genes show expression, the second row is triplets where L is a thanagene, and the third row is triplets where S is a thanagene. The L branch length is significantly smaller when all genes are expressed, or when S is a thanagene (Wilcoxon p-value=1.7E-216 and 6.4E-212 respectively). The S branch length is smaller when L is a thanagene (p=2.4E-223). The ratio of branch lengths (L/S) is significantly different for either L or S thanagene datasets compared to when all genes are expressed (p=3.55E-214 and 7.48E-220 respectively). The ratio is different between the two thanagene datasets as well (p=1.79E-217).

a.
Chromosomal locations of the 45S pre-ribosomal RNA gene (rna45s), which encodes a precursor RNA for 18S, 5.8S, and 28S rRNAs, was determined using pHr21Ab (5.8-kb for the 5′ portion) and pHr14E3 (7.3-kb for the 3′ portion) fragments as FISH probes. DNA fragments used for the probes were provided by National Institutes of Biomedical Innovation, Health and Nutrition, Osaka, and labeled with biotin-16-dUTP (Roche Diagnostics) by nick translation. After hybridization, the slides were incubated with FITC-avidin (Vector Laboratories). Hybridization signals (arrows) were detected to the short arm of XLA3L, but not XLA3S. Scale bar represents 5 μm.

b.
A large deletion including an olfactory receptor gene (or) cluster. Schematic structures of or gene clusters and adjacent genes on the 8th chromosomes of X. a deleted region in XLA8S in comparison with XLA8L. The centromere is located on the left side and the telomere is on the right.

c.
The relative frequency (left panel) and size (right panel) of genomic regions deleted in the S (blue) and L (green) chromosomes respectively. Both subgenomes experienced sequence loss through deletions, however, the deletions on the S subgenome are larger and have been more frequent. Deletions were called based on the progressive Cactus sequence alignment between the X.laevis L and S subgenomes and the X. tropicalis genome. Chromosome 9_10 of laevis was split into 9 and 10 on basis of alignment with the X. tropicalis chromosomes.
Sequences from L that were not present on S, but could at least partially be identified in X. tropicalis, and consisted of gaps for no more than 25% of their length were called as deleted regions in S. The same procedure was followed for deleted regions in L.

d.
Identification of triplet loci is described in Supplemental Note 8.1. Loci were classified into groups based on the presence of gene 2 in both X. laevis subgenomes (homeolog retained), versus those that had a pseudogene in the middle (pseudogene) or no remnant of the middle gene as assessed by Exonerate (deletion). To normalize the intergenic lengths we divided the nucleotide distance between genes 1 and 3 in either X. laevis subgenome by the orthologous distance in X. tropicalis. The median of the normalized ratio distribution is plotted on the bar chart. On average S deletions appear to be larger than L deletions (52.9% length vs 80.2% the size of the orthologous X. tropicalis region respectively).

e.
The number of RNA-seq reads aligning +/− 1kb of precursor miRNA loci (red) was compared to the read count for 10,000 random unannotated 2.1 kb regions of the genome (blue). All 83 homeologous, intergenic miRNA pairs showed alignment within their regions, as opposed to 4,127/10,000 (41.27%) of the randomly chosen intergenic sequences. The putative primary-miRNA loci have a higher read count than the expressed randomly chosen regions as well (Wilcoxon p=1.4E-38).

f.
The CACTUS alignment was parsed to identify flanking CNE around each X. tropicalis gene. The number of CNEs > 50bp in length for singletons is shown in red, homeologs in blue. Komologrov-Smirnov test p-value is 1E-11.

g.
The average distance to the nearest gene was computed for each chromosomal locus in X. tropicalis. The average intergenic distance for those with a single X.
laevis gene is shown in red, those with two shown in blue. Wilcoxon p-value= 9.8E-24.

h.
The distribution of gene retention by genomic footprint of the X. tropicalis ortholog. We define genomic footprint as the genomic distance from the start signal of the CDS to the stop signal, including introns. The x axis shows log 10 (genomic footprint), the y-axis is the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homeologs and singletons by a Wilcoxon test (p-value = 2.4E-96).

i.
The distribution of gene retention by CDS length of the X. tropicalis ortholog.
The x axis shows log 10 (CDS length), the y-axis is the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homeologs and singletons by a Wilcoxon test (p-value= 1.7E-21).

j.
The distribution of gene retention by exon number of the X. tropicalis ortholog.
The x axis shows number of exons; the y-axis is the retention rate of each bin. The error bars are the standard deviation of the total divided by the number of genes in each bin. We tested for significant differences in length between homeologs and singletons by a Wilcoxon test (p-value= 3.2E-8).  despite many frameshifts, premature stops, and the lack of a proper start, and insertions of new sequence, we identify many codons in the pseudogene that occur in large conserved blocks.

b.
Illustration of our model to compute pseudogene ages. The star represents the point of nonfunctionalization for a currently pseudogenized locus. We assume the expected rate of nonsynonymous changes can be estimated by the Ka of the extant gene and X. tropicalis. We then compare the Ks and Ka of the pseudogene sequence to estimate the time of nonfunctionalization. See Supplemental Note 9 for a more detailed discussion.

c.
Estimated epochs of pseudogenization for 430 genes are indistinguishable from a burst of pseudogenization > 10 Mya (K s > 0.03). See Supplemental Note 9 for a more detailed discussion.

d.
Correlation of pseudogene expression with its extant homeolog. The little expression seen in pseudogenes tends to be uncorrelated with the extant homeolog.

e.
Histogram of pseudogene expression values across all 28 tissues and developmental stages (red) compared to all extant genes (blue). The pseudogenes are rarely expressed, and tend to be expressed at lower levels than extant proteincoding genes.

f.
Histograms of expression variance of pseudogenes (red) compared to extant genes (blue). The small amount of pseudogene expression observed does not tend to vary across tissues and developmental stages in the same way that extant genes do.

EDF 7. TANDEM DUPLICATIONS a.
Phylogenetic trees of the mix/bix cluster. Nucleotide sequences were aligned using MUSCLE, and a phylogenetic diagram was generated by the ML method with 1,000 bootstraps (MEGA6). Circles with different colors represent X. laevis L genes (magenta), X. laevis S genes (blue), and X. tropicalis genes (green). The table shows the correspondence of bix gene names proposed in this study and previously used (synonyms).

b.
FISH analysis showing XLA3S-specific deletion of the nodal5 gene cluster. One unit of the nodal5 gene region, including exons, introns, and an intergenic region was used as a probe for FISH (counterstained with Hoechst). Arrows indicate the hybridization signals of nodal5s. Scale bar indicates 5 um.

c.
Comparison of the nodal5 gene cluster. Genome sequencing revealed that nodal5.e1.L~.e5.L (in pink) and nodal6.L are clustered. Amplification of nodal5 gene in XLA3L and loss of this cluster in XLA3S were confirmed. Pseudogenes (nodal5p1.L~p4.L and nodal5p1.S) are indicated in black. The nodal5 cluster of X. tropicails does not contain any pseudogene.

d.
X. laevis L chromosome has four complete copies of nodal3 (nodal3.e1.L~.e4.L), whereas the gene cluster is lost from the X. laevis S chromosome. A truncated nodal3 gene (nodal3p1.L) is likely to be a pseudogene, and highly degenerate pseudogenes (nodal3p2.L and nodal3p3.L) also exist on the L chromosome.

e.
Like nodal3, vg1 is lost from the S chromosome although there is a pseudogene (vg1p.S). vg1 is specifically amplified on the X. laevis L chromosome (vg1.e1.L~.e3.L) in comparison with X. tropicalis. An amino acid change (Ser20 to Pro20) in Vg1 protein has been shown to result in functional differences (Supplementary Note 13.9). vg1 and derrière are orthologous to mammalian gdf1.

f.
Fraction of all genes duplicated and retained to present epoch per 1 expected 4DTV(four-fold degenerate transversions) at different epochs (semi-log scale). Shown also are linear fits, which would be consistent with constant birth-and death rate models (first epoch is omitted from both fitted data sets, as is second epoch from X. laevis). See Supplemental Note 11 for a more detailed discussion.

g.
Same, but for "short genes" (CDS < 600 bp) and "long genes" (CDS > 1200 bp) separately. The loss rate of new duplicates appears to be similar. If the extra copy of a newly duplicated gene were lost when the first 100% disabling mutation occurred, we would expect, on average, the longer genes to be lost.

EDF 8. GENE EXPRESSION ANALYSIS a.
Pairwise Pearson correlation distributions between homeologous genes (red) and all genes (blue). The left histogram is for stage data; right is for adult data. The x-axis is the correlation; the y-axis is the percent of data. The homeologous genes have a correlation distribution closer to one due to their being the same locus recently. X. laevis TPM values 0.5 were lowered to 0. Any gene with no TPM > 0 was removed from analysis. We then added 0.1 to all TPM values and log transformed (log 10 ).

b.
Scatterplot comparing binned genes by their median X. tropicalis expression 64 to the retention rate of their X. laevis (co)-orthologs. Error bars are the standard deviation for the whole data set divided by the square root of the number of genes analyzed in a bin. We assessed significance by a Wilcoxon test of the homeologous and singleton distributions, p-value = 6.31E-113.

c.
Complete boxplot shown in Fig. 4c. The difference between subgenomes is difficult to see at this magnification, illustrating that many loci deviate from the whole genome median of preferring the L homeolog. There are some L outliers expressed 10 4 as much as their S homeologs, whereas no S genes shows such a strong trend. These differences are discussed in more detail in Supplemental Note 12.  Table 5 for more information.

d.
Deletions rates on L (x-axis), vs S (y-axis) for different stage WGCNA groups (visualized as a heatmap in Fig. 4a). For stage WGCNA groups we computed the number of X. laevis single-copy genes (singletons) vs homeolog pairs and computes a fraction retained. The line is expected L/S loss based on genomewide average (56.4%). Red points show groups with high or low rates of loss (p<.01).

e.
Deletion rates on L (x-axis), vs S (y-axis) for different GO groups. For GO groups we computed the number of X. laevis single-copy genes (singletons) vs homeolog pairs and computes a fraction retained. The line is expected L/S loss based on genome-wide average (56.4%). Red points show groups with high or low rates of loss (p<0.01). See Supplemental Table 5 for more information.   Phylogenetic tree based on protein-coding genes of tetrapods, rooted by elephant shark (not shown). Alignments were done by MACSE, the maximum-likelihood tree was built by PhyML. Branch length scale shown on bottom. The difference in branch length between Xenopus laevis-L and Xenopus laevis-S is similar to that seen between mouse and rat. Both subgenomes of X. laevis have longer branch lengths than X. tropicalis.  dark blue bars respectively), zygotically-controlled developmental time points and adult tissues are on the right (red and green bars respectively). The red line shows the equal ratio log 10 (1). On average maternal data sets express the L gene of a homeologous pair 12% higher than as S (median = 0%), while the zygotic tissues and time points express the L gene of a homeologous pair 25% higher than S (median = 1.8%). The difference between the mean and medians is explained by many genes with large differences between homeologs, illustrated by the full distribution in Extended Data Fig. 8c. Here, to illustrate the difference in median of zygotic expression, we zoom in on the center of the boxplot. In addition to the tracks described for c), the right panel shows RNA Polymerase II (RNAPII; purple) and H3K36me3 (blue) ChIP-seq profiles. e. Representative embryos with GFP expression driven by either six6.L-CNE or six6.S-CNE linked to a basal promoter-GFP cassette (six6.L-CNE:GFP and six6.S-CNE:GFP, respectively). GFP expression was detected by in situ hybridization. Semi-quantitative image analysis revealed a statistically significant difference in their average expression level (p < 0.01); the expression driven by six6.S-CNE (n = 27) was 0.6-fold weaker than that by six6.L-CNE in the eye region (n = 32). Given eye-specific patterns of their endogenous expression, the six6 genes likely have additional silencers for restricting enhancer activity of the CNEs in the eye.