Decelerated genome evolution in modern vertebrates revealed by analysis of multiple lancelet genomes

Vertebrates diverged from other chordates ~500 Myr ago and experienced successful innovations and adaptations, but the genomic basis underlying vertebrate origins are not fully understood. Here we suggest, through comparison with multiple lancelet (amphioxus) genomes, that ancient vertebrates experienced high rates of protein evolution, genome rearrangement and domain shuffling and that these rates greatly slowed down after the divergence of jawed and jawless vertebrates. Compared with lancelets, modern vertebrates retain, at least relatively, less protein diversity, fewer nucleotide polymorphisms, domain combinations and conserved non-coding elements (CNE). Modern vertebrates also lost substantial transposable element (TE) diversity, whereas lancelets preserve high TE diversity that includes even the long-sought RAG transposon. Lancelets also exhibit rapid gene turnover, pervasive transcription, fastest exon shuffling in metazoans and substantial TE methylation not observed in other invertebrates. These new lancelet genome sequences provide new insights into the chordate ancestral state and the vertebrate evolution.


Supplementary Table 4. Inferred substitution rates and divergence times of selected lineages
A. Estimates of amino acid substitution rates and divergence times of several pairs of closely related species        . **The purple sea urchin genome assembly is presented as diploid assembly and no haploid assembly is available. The gene number and total CDS length for the diploid assembly are 42420 and 54378780bp 2 . We assumed that the compeleness for the sea urchin haploid assembly is also ~85%, so we estimated the CDS length for haploid genome is 54378780/(0.85x2)=31987518bp.

Supplementary Note 1 Background of the Chinese amphioxus and its relationships with the Japanese and Malaysian amphioxus
The lancelet, also called amphioxus, belonging to the subphylum Cephalochordata, represents the living basal lineage of the phylum Chordata (which includes three subphyla, Cephalochordata, Urochordata and Vertebrata) 5 . Lancelets are filter-feeders dwelling in the shallow sandy sea floor along coastlines. The first species of lancelet was described by Pallas in 1774, and there are now 31 known species inhabiting seashores around tropical and temperate oceans 6,7 . Lancelets are currently widely distributed along the Chinese coastline, from Qingtao to Beihai (Supplementary Figure 1A). In fact, the entire coastal area, from Northern Japan to Southeast Asia, is the natural habitat for this genus (Supplementary Figure  1A).
The habitats of three amphioxus species (B. belcheri, B. japonicum and B. malayanum) overlap in the coastal area between Xiamen and Beihai.
Traditionally, all lancelets in Chinese seas are collectively referred to as Chinese amphioxus and were considered to comprise a single species, Branchiostoma belcheri; those distributed in Qingdao waters were treated as a sub-species named Branchiostoma belcheri tsingtauense. These concepts have recently proved inaccurate.
Xiamen, a coastal city in China, has long been famous for its abundance of lancelets 8 .
Branchiostoma belcheri was first reported in Xiamen waters in 1932 9 and was once believed to include all lancelets in the region. However, it was recently discovered that there are two similar but distinct lancelet species in the region 10,11 . One species is mainly distributed from Xiamen to Beihai, in the sub-tropical oceans, whereas the second species is genetically the same as lancelets from the northern Chinese (like QingDao waters) and Japanese seas, suggesting that the second species mainly inhabits temperate oceans (Supplementary Figure  1). Lancelets from Qingdao waters and Japanese waters were traditionally considered B. belcheri tsingtauense, a sub-species of B. belcheri. Now, according to the priority rules of nomenclature, the first species is entitled to the name Branchiostoma belcheri, and the second species has been renamed Branchiostoma japonicum 10,12 .
Studies of Cyt b gene sequences and 12S rRNA gene sequences have revealed significant divergence between B. belcheri and B. japonicum [13][14][15] . In addition, cytotaxonomic analyses found that the diploid chromosome numbers were 2n=40 in B. belcheri and 2n=36 in B. japonicum 16 (and, for comparison, 2n=38 in B. floridae). Further experiments confirmed that the two species are apparently reproductively isolated, which explains how the two species can dwell in the same habitat (Xiamen waters) but maintain independence. In addition, the two species in Xiamen show traces of morphological differences: 1) the rostral fin is slightly round with an obtuse end in B. belcheri but elliptical with a cuspate end in B. japonicum; 2) the number of preanal fin chambers is more than 80 in B. belcheri but normally less than 70 in B. japonicum, and the chambers are slender in the former but stout in the latter; 3) the caudal fin of B. belcheri is narrower than that of B. japonicum, and the angles between the dorsal and super-caudal fins and between the preanal and sub-caudal fins are obtuse in B. belcheri but acute in B. japonicum 10 . However, these morphological differences are not absolute because in reality, approximately 20% of individuals collected from the wild could not be unambiguously assigned to either species based on morphology.
A foreign species, Branchiostoma malayanum, which mainly dwells in Malaysian waters (Southeast Asia), is occasionally found along the southern China seashore, near Hong Kong, for example 11,17 , hence suggesting the frequent incursions of this tropical species.
Although B. belcheri has been shown to share more morphological similarities with B. japonicum than B. malayanum, molecular comparisons using 12S RNA genes and AFLP markers indicated that B. belcheri is less different from B. malayanum than from B. japonicum 11 .
Geographically, the Chinese amphioxus population is mainly distributed along the Southern China coastline from Xiamen to Beihai, which extends at least 1,200 kilometers (Supplementary Figure 1A).  Figure 1B). This suggests that the genetic structure of the natural population of the Chinese lancelet is weak or absent.
In this study, the species sequenced was B. belcheri from the Xiamen waters, a lancelet that shows typical characteristics of sub-tropical marine animals, such as a larger body size, faster developmental speed and a longer breeding season (relative to the temperate species B. japonicum) 18 . However, no comparisons of development or reproduction have been conducted between B. belcheri and another co-habitant, the tropical species B. malayanum.
B. belcheri was the first lancelet to be raised in captivity for multiple generations 18 . Methods for year-round reproduction and spawning induction have also recently been developed for B. belcheri 19,20 . Furthermore, a recent attempt suggests that TALENs can be used to induce mutagenesis at specific genomic loci in this species 21 . The availability of on-schedule embryonic materials and direct mutagenesis approaches will accelerate the process of establishing amphioxus as a model organism suitable for experimental biology.

Sample collection and DNA isolation
Specimens of outbred male adult B. belcheri were collected in July 2008 from Huangcuo (24°27′07″N, 118°10′27″E) in the Xiamen Rare Marine Creature Conservation Areas, China (Supplementary Figure 1). Animals were kept in filtered running seawater for 24 hours to facilitate cleaning of the body and emptying of the digestive tract. Ripe gonads full of sperm were harvested from a single large healthy male that was approximately 2-3 years old and 4 centimeters long. Genomic DNA was purified from the gonads using the DNeasy TM blood and tissue kit (QIAGEN). Quality and quantity were evaluated by Nanodrop and agarose electrophoresis. A total of 280 μg DNA was obtained from the single male, with a fragment size of 30-40 kb.

Construction and sequencing of shotgun and paired-end libraries
Two platforms of next-generation sequencing technologies, the 454 platform and the Illumina platform, were used to sequence the diploid genome of the selected male lancelet.
The 454 dataset was sequenced by GS FLX Titanium chemistry. Both shotgun and paired-end libraries were prepared using Roche's protocols and GS FLX Titanium series kits. The total reads dataset consisted of 17 million shotgun reads and 27 million paired-end reads, which were generated from multiple shotgun libraries and multiple paired-end libraries, with insert sizes of 2 kb, 3 kb, 8 kb and 20 kb (Supplementary Table 1).
The Illumina dataset consisted of four paired-end libraries with insert sizes of 340-600 bp. All libraries were constructed according to the Illumina protocols. Each library was subjected to 2x 115 bp paired-end sequencing on the Genome Analyzer IIx (GAIIx). A total of 145 million paired-end reads were obtained, yielding approximately 33 Gb (Supplementary Table 1).

Genome size estimation
The genome size was estimated by the k-mer method as described 22,23 . In brief, genome size (G) can be determined by dividing the total amount of sequenced bases (T) by the sequencing depth (D). The sequencing depth (D) can be estimated by the formula D=E*L/(L-K+1), where L is the average read length, K is the k-mer size, and E is the peak coverage depth for the given K. The peak coverage depth may decrease with longer k-mer size; therefore, an optimal combination of K and E should be inferred by analyzing the k-mer distribution profiles.
Quality-filtered 454 reads, including both shotgun and paired-end reads, were extracted with the Newbler assembler 24 . Both the filtered 454 reads and a subset of the Illumina reads were subjected to k-mer distribution profiling. The calculation revealed that the k-mer depth peaked at 35.1, with a k-mer size of 20, a total read length of 36 Gb and a mean read length of 143 bp, which therefore gave an estimate of 884 Mb for the sequenced genome.
Given that both haplotypes of the chosen individual were sequenced together and that the average heterozygosity between two haplotypes was 4-5% (Supplementary Note 4), the haploid genome size was assumed to be half the estimates, namely, 442 Mb, as a close estimate to the actual haploid genome size.
In addition, a cytometry analysis of the sperm cells yielded an estimated haploid genome size of 440 Mb.
This haploid genome size is considerably smaller than that reported for the Florida amphioxus, B. floridae (500~520 Mb) 1 . This size difference was also confirmed by our later intron size comparison between the two species (Supplementary Note 9).

Haploid genome assembly version 7: initial attempt
Lancelets are marine species with high allelic polymorphisms due to their large effective breeding population. The individual sequenced here exhibited ~5% heterozygosity in its diploid genome (Supplementary Note 4) plus a large quantity of repetitive sequences (~30% of the genome).
Whole-genome shotgun (WGS) assembly for highly polymorphic diploid genomes is generally difficult and does not reach the same level of quality as assemblies for haploid genomes or diploid genomes with low levels of polymorphism [25][26][27][28][29] . The difficulty is caused by sequencing two closely related haplotypes together. The process is even more challenging when the next generation sequencing (NGS) technologies are used in place of Sanger methods 30 because short read length, higher error rates and new types of sequencing errors exacerbate the problem 31 .
We deemed that haplotypes could be better resolved by longer reads, whereas base-level errors could be rectified by a high depth of short reads. Therefore, we generated 30x 454 reads and 70x Illumina reads (Supplementary Table 1).
SOAPdenovo assembly. SOAPdenovo2 is a de Bruijn graph-based assembler designed for short-read (next-generation) sequencing 32 . An early study showed that SOAPdenovo could not adequately assemble the polymorphic oyster diploid genome (with a polymorphism rate of 1.3%) from pure high depth (155x) of short reads, but when combined with fosmid pooling and other methods, SOAPdenovo managed to produce an oyster genome assembly with contig and scaffold N50 sizes of 19kb and 401kb, respectively 33 Table 2).
Reference haploid assembly version 7. The v7 diploid assembly contained redundant alleles, remained highly fragmented and was infested with numerous middle-to-large-scale mis-assemblies. Two notorious types of error in polymorphic diploid assemblies are the tandem mis-assembly of alleles and mis-joins of unrelated genomic portions that violate the large-scale (>100 kb) colinearity between alleles 37 . To automate and seek an optimized solution for these problems, we developed HaploMerger, a pipeline containing a series of algorithms and programs designed to remove assembly errors and infer reference haploid assemblies from a given diploid assembly 31 . The original HaploMerger (i.e., release_20110720) was used to process the v7 diploid assembly with default parameters. A total of 413 tandem mis-assemblies (>10 kb), accounting for 8.6 Mb, were removed, and a total of 132 major mis-joins (>50 kb) were detected. For each mis-join, one of the two implicated scaffolds was selected to break up based on a set of heuristics that prefer to preserve sequence continuity 31 . This breaking scheme is far from perfect because it may cause HaploMerger to erroneously break the correct scaffold. In the end, HaploMerger produced a reference haploid assembly spanning 416 Mb, with a scaffold N50 size of 833 kb and a contig N50 size of 104 kb (Supplementary Table 2).
Quality inspection of the v7 assembly. The v7 assembly provided an opportunity to assess our assembly strategy. (1) For polymorphic read data, a high error rate allowance is confused with true polymorphism, inevitably causing excessive allele collapsing, assembly errors and short scaffolds when further complicated by short read length, repeats, and sequence duplications.
(2) Linking the diploid assembly with 20 kb paired-end reads did not yield significant improvement. It appeared that excessive assembly errors and the presence of multiple alleles prevent effective scaffolding. (3) We observed many more small tandem mis-assemblies (<10 kb) than expected. (4) To avoid false positives, HaploMerger by default does not process potential small-scale mis-joins (<50 kb). However, the use of a high "utgErrorRate" seemed to cause excessive (potential) mis-joins, with a large portion classified as "small scale" (<50 kb) due to the short scaffold size. (5) When two scaffolds violate large-scale colinearity (e.g., >50 kb mis-join), the default action of HaploMerger (release_20110720) is to break one of the two scaffolds without consulting the mate-pair graph or the contig-scaffold layout. This may erroneously break the correct scaffold and conceal the problematic scaffold. (6) The huge contig N50 size suggests that GapCloser version 1.04 is too aggressive in closing N-gaps.

Haploid genome assembly version 15: hierarchical scaffolding
After inspection of assembly version 7, we redesigned our assembly strategy and created a new assembly.
Step 1. Diploid assembly. CABOG was used to create a diploid assembly from all 454 shotgun reads and paired-end reads with insert sizes of 2 kb, 3 kb and 8 kb.
Step 2. Haploid assembly. An expansion kit for HaploMerger was used to create a haploid assembly from the diploid assembly. The new HaploMerger module can detect and remove tandem-assembled alleles larger than 100 bp. A total of 1460 tandem mis-assemblies accounting for 7.9 Mb were removed from the diploid assembly. For large-scale (>50 kb) mis-joins, the new HaploMerger attempts to interrogate paired-end linking information to determine which scaffold should be broken up and to consult the contig-scaffold layout to decide the breakpoint position. Specifically, the scaffold receiving less link support (<1/3 of those of another scaffold) across the breakpoint within a 50 kb range is selected for breaking. If the paired-end links favor neither of the scaffolds, then both scaffolds are broken. A total of 167 mis-joins were processed. Step 3. Hierarchical scaffolding. Bambus 38 version 2.33 was used to further scaffold the haploid assembly (produced in Step 2) with all 20 kb paired-end reads. Before linking, the haploid assembly was first masked by WindowMasker 39 . Paired-end reads were then mapped to the scaffolds using GMAP 40 . After the mapping was performed, three consecutive steps of filtering were employed: (1) duplicated reads (both ends mapped to nearly the same positions of another read) were filtered; (2) reads of non-unique mapping were removed; and (3) reads overlapping with masked regions were discarded. Bambus was then used to link the scaffolds. Two special parameter settings were used to guarantee quality: at least 2 reads were required to link two scaffolds (default=2); and only 2 standard deviations of the insert size were allowed (default=3). Finally, a custom script was used to estimate the sizes of new N-gaps.
Step 4. N-gap filling. A new version of GapCloser (version 1.12) was used to close N-gaps in the derived assembly with all Illumina paired-end reads. According to the manual, GapCloser version 1.12 is less aggressive and more accurate than GapCloser version 1.04. Moreover, we filtered Illumina reads with QUAKE 41 before feeding them to GapCloser. By default, GapCloser tries to narrow a gap by extending sequences, even when there is no expectation of closing it. We found that these extended sequences were more error-prone than those in the closed gaps; therefore, we used a custom script to remove these sequences. In the end, we obtained a haploid assembly of 450 Mb, with a scaffold N50 size of 1.5 Mb and a contig N50 size of 25 kb (Supplementary Table 1).

Quality comparison of haploid assembly versions 7 and 15.
A statistical comparison of the two assembly versions is presented in Supplementary Table 2. A major improvement of version 2 over version 1 was the nearly doubled scaffold N50 size. The most notorious errors for polymorphic assembly are large-scale mis-joins. To detect large-scale discrepancies between the two assembly versions, we used LASTZ 42 and chainNet 43 to create pairwise whole-genome alignments between the two versions. A total of 320 large-scale (>100 kb) colinearity violations (or mis-joins) were identified from the alignments. To determine which version contained the mis-join, we also aligned each assembly version against the draft genome version 2 of the Florida amphioxus B. floridae 1 . The genome of Florida amphioxus served as an out-group due to its relatively conserved genomic structure and divergence time of ~110 million years from Chinese amphioxus. The 3-way comparison pinpointed 77 potential mis-joins in the v15 assembly and 66 in v7; and the remaining 177 potential mis-joins could not be determined (Supplementary Table 3). Nevertheless, these results suggested that despite the much longer scaffold size, the v15 assembly carried more potential mis-joins than the v7 assembly.

Haploid genome assembly version 18: hybrid methods
A comparative analysis of assembly versions 7 and 15 provided more information on how to fine-tune the assembly strategy described above. After testing a series of data and strategic combinations, we achieved a final assembly strategy (Supplementary Figure 2). The important changes are described below: Hybrid assembly with multi-platform data. In assembly versions 7 and 15, Illumina paired-end reads were used only for gap filling. After extensive tests, we found that a hybrid assembly involving both 454 reads and Illumina reads can be very effective at increasing assembly accuracy and continuity. We believe that several factors may contribute to this property: (1) higher depth; and (2) the fact that per base quality increases not only because of higher depth but because Illumina reads correct specific sequencing errors inherent in 454 reads and vice versa, which, in turn, helps paired-end reads be placed in the correct positions. Updates for HaploMerger. 1) The contig-scaffold layout was used to evaluate tandem mis-assemblies and suppress false positives; 2) in addition to large-scale mis-joins (>50 kb), middle-scale mis-joins (30-50 kb) were also detected and processed; 3) we no longer considered the mate-pair graph or the contig-scaffold layout because they can be misleading (we observed more potential mis-joins in version 15 than in version 7). On the other hand, we reasoned that the correct linkage of contigs should be re-linked in the second round of scaffolding; hence, the algorithm now breaks up both scaffolds involved in a possible mis-join; and 4) tandem assemblies on the haploid assembly can now be detected and processed.
Diploid assembly. The version 18 diploid assembly was created using both 454 reads (shotgun, 2 kb, 3 kb and 8 kb paired-end) and Illumina reads (340 bp and 500 bp paired-end). The new assembly pipeline is shown in Supplementary Figure 2. The hybrid diploid assembly spans 707 Mb, with a scaffold N50 size of 264 kb and a contig N50 size of 30 kb.
Reference (or primary) haploid assembly. The hybrid diploid assembly was processed by HaploMerger. A total of 2,149 tandem assemblies (>100 bp; in total accounting for ~17.90 Mb or ~2.5% assembly size) were detected and removed by consulting the self-alignments and contig-scaffold layout. A total of 159 events of middle-to large-scale mis-joins (>30 kb) were detected, and all implicated scaffolds were broken up. Finally, HaploMerger selected the longer alleles for the initial haploid assembly, which was further scaffolded using 20 kb paired-end reads as described above (except this time requiring at least 4 mate-pairs to establish a link, thereby suppressing false positives). After scaffolding, HaploMerger was used to remove newly derived tandem mis-assemblies (588 events, accounting for 2.89 Mb). Finally, GapCloser version 1.12 was used to close N-gaps in the assembly. By default, GapCloser tries to narrow a gap by extending sequences, even when there is no hope of closing the gap. These extended sequences were more error-prone than those in closed gaps and were therefore discarded. The "gap-filling" contigs created by GapCloser were generally less accurate and hence marked with the prefix "GF_" in the companion AGP file. The final reference (or primary) haploid assembly has better N50 sizes for both scaffold and contig, namely, 2.3 Mb and 46 kb, respectively (Supplementary Table 2).
Alternative haploid assembly. HaploMerger produced ~291 Mb alignments (>500 bp) after solving the allelic relationships in the diploid assembly. This means that approximately 65~70% of the loci in the primary haploid assembly have an alternative allele. This proportion is significantly lower than that of the draft diploid genome of B. floridae 1 , where 77~85% of genomic loci have an alternative allele. We believe that the difference reflects the nature of current next generation sequencing (NGS) technologies, where shorter read lengths and a higher sequencing error rate cause more severe allele sequence collapsing and demote many more sequences to degenerate status. We created an alternative haploid assembly by replacing corresponding loci in the primary haploid assembly with available alternative alleles.

Comparison of large-scale mis-joins in different haploid assembly versions.
Based on 3-way whole-genome alignments, a total of 384 large-scale (>100 kb) colinearity violations (or mis-joins) were identified between haploid assembly versions 18 and 15. Using the draft genome version 2 of Florida amphioxus (B. floridae) as a reference, we pinpointed 27 possible mis-joins in assembly version 18 and 130 in version 15; the remaining 227 potential mis-joins could not be determined (Supplementary Table 3). This result suggests that assembly version 18 is much better than version 15 in terms of large-scale mis-assemblies. We also compared version 18 with version 7, testing at the smaller scale size of 50 kb; these results showed a similar trend (Supplementary Table 3). Based on these results, we estimated that there was less than one potential mis-join (>100 kb) in every 6.5 Mb in the v18 assembly. Note that this estimate could be an overestimate due to mis-joins in the Florida lancelet genome assembly, alignment artifacts, true large-scale genetic variation, and other assembly errors, such as large indels taken for colinearity violations.

The completeness of the v18 assembly
Raw Illumina paired-end reads (~23 Gb) were aligned to the reference+alternative assembly and the reference assembly using GSNAP 44 , which was run in the DNA mode with the provided insert sizes and all other default parameters. As individual reads, 99.82% of the Illumina reads were successfully mapped to the reference+alternative assembly. As read pairs, 98.32% were successfully mapped in the correct direction and distance range (i.e., concordant match). On the other hand, though 99.31% of the individual reads could be mapped to the reference haploid assembly, only 89.30% of the read pairs were mapped in the correct direction and distance range. These results suggest a large quantity of allele-specific reads, which is a reflection of the high polymorphism of the amphioxus diploid genome.
In addition, EST contigs, which were assembled from ~3 million 454 FLX Titanium reads (Supplementary Note 8), were aligned to the v18 assembly using NCBI-BLASTN 45 . Of 52,961 contigs with at least 300 base pairs and exactly one apparent open reading frame (ORF), 98.85% had at least one alignment of at least 80% identity and 25% coverage to the diploid assembly (or 96.55% at 80% identity and 50% coverage). The reference haploid assembly showed similar completeness (96.13% at 80% identity and 50% coverage, or 98.59% at 80% identity and 25% coverage), suggesting that the reference haploid assembly inferred by HaploMerger was almost as complete as the original diploid assembly in terms of protein-coding gene content.

Supplementary Note 3 Divergence between two lancelet species Curation of multigene protein alignments
To evaluate the amino acid substitution rates and divergence times between two lancelets and other species, we extracted orthologous protein-coding gene families from fifteen selected species using a modified reciprocal best hit (RBH) method as suggested by Putnam et al 1 (see Supplementary Note 7 for more details). The initial ortholog families based on lancelets and humans were used as anchors to identify orthologs from 12 other species. At least 50% sequence coverage was required for every orthologous protein pair and only one species was allowed to be absent in each protein family. The final resulted dataset contains 729 ortholog families. CLUSTALW2 46 was used to create multiple alignments for each family, and all alignments were concatenated to form an all-in-one alignment (alignment 1), which contained 729 ortholog families and a total of 403,674 sites. Alignment 2 (with 245,205 sites) was further created by removing the less-conserved sites (158,469 sites) from alignment 1 using Gblocks 47 . Finally, we removed sites with indels from alignment 2 and retained only ortholog families with representatives in all 15 species, which gave alignment 3 (with 72,795 sites from 513 ortholog families). Alignment 3 was used to estimate phylogenetic relationships, amino acid substitution and divergence times, whereas alignment 1 and 2 were only used for the estimation of amino acid substitution.

Protein-based phylogenetic reconstruction
Both Bayesian and maximum likelihood analyses were used for phylogenetic reconstruction. Bayesian analysis were carried out on alignment 3 using Phylobayes 48 v3.3 with the CAT model, which is a mixture model especially devised to account for site-specific features of protein evolution and hence particularly well suited for large multigene alignments. We ran four independent chains with random starting trees for over 20,000 Monte Carlo iterations (with the first 10,000 burin-in cycles removed), and they converged to the same tree topology (Supplementary Figure 3A). For maximum likelihood analysis, we first ran ProtTest3 49 on alignment 3 to select the best-fit model. The recommended model is LG+I+G+F, namely, the LG 50 amino acid substitution matrix plus invariant sites, Gamma distribution (under four rate categories) and empirical amino acid frequencies. PhyML 51 v3.1 was then run on alignment 3 to infer tree topology and branch length. Special settings for PhyML included 200 bootstrap replicates and the BEST topology search method. PhyML produced the same tree topology as obtained by the Bayesian method (Supplementary Figure 3B). The estimated Gamma-shape parameter and the invariant fraction were 0.83 and 0.16 respectively. Finally, amino acid substitution per site (branch length) based on alignment 1 and 2 was also calculated using PhyML based on the tree topology obtained from alignment 3 (Supplementary Figure 3C-D).
The analysis shows that both Bayesian and maximum likelihood methods recovered the same tree topology (Supplementary Figure 3A-B). And the relative amino acid substitution rates (branch length divided by total tree length) inferred from alignment 1, 2 and 3 were consistent with each other (Supplementary Figure 3B-D). Branch length indicates that the protein divergence of two lancelets is comparable to the selected species pairs of worms (C. elegans and C. briggsae) and fruit flies (D. melanogaster and D. mojavensis), but is approximately half the divergence of those selected species pairs of tunicates (C. savignyi and C. instestinalis), fishes (tetraodon and stickleback) and human versus chicken (Supplementary Figure 3B-D). Since alignment 3 tends to bias to highly conserved amino acids and alignment 1 is overwhelmed by fast-evolving sites, we reasoned that the tree based on alignment 2 may provide a balanced estimation of amino acid substitution (Supplementary Figure 3C and Supplementary Table 4).

Protein-based divergence time analysis
We then used Bayesian methods to estimate the divergence time of two lancelet species between two lancelets based on alignment 3 and the inferred phylogenetic topology. Fossil-based divergence time constraints taken from literatures 52-54 were imposed on this dating analysis (human-chicken 312-331Mya, tetraodon-stickleback 98-151Mya, human-ray-finned fish 416-422Mya, tunicate-human >485Mya, lancelet-human >485Mya, echinoderm-vertebrate >521Mya, echinoderm-hemichordate >485Mya, drosophilanematode >541Mya, protostome-deuterostome >558Mya). Lower bounds were used as hard bounds, whereas upper bounds, if available, were increased by 10% to make them "softer". Two molecular dating programs and five parameter sets were run on the obtained tree topology and alignment 3 (Supplementary Figure 3E). PhyTime 55 v3.1 (from the package of PhyML) was run under the autocorrelated relaxed clock model with these parameters: the GBS rate model (Geometric Brownian + Stochastic), the LG matrix, Gamma distribution (16 categories), the multivariate normal approximation and 1,000,000 iterations (with 30% as burn-ins). Phylobayes 48 v3.3 was run for >100,000 iterations (with 30% as burn-ins) on four different parameter sets: the log-normal autocorrelated relaxed clock model with a uniform (1) or a birth-death (2) prior on divergence times, and the uncorrelated gamma model with a uniform (3) or a birth-death (4) prior on divergence times.
As expected, autocorrelation models (PhyTime and the Phylobayes parameter sets 1 and 3) produced similar results which were different from those of uncorrelation models (the Phylobayes parameter sets 2 and 4) (Supplementary Figure 3E). Anyway, two sets of estimates were largely consistent with each other and the differences were within the acceptable range. Both clock models agreed that the divergence time of two lancelet species should be in the range of 111-130 Myr (mean=120 Myr) (Supplementary Figure 3E). In consistence, the divergence time of two lancelets was estimated to be 112 Myr based on mitochondrial genome sequences 17,56 . And the geological separation time between Atlantic and Pacific oceans (the respective habitats of two lancelet species) is 100-130Myr 57 .
A comparison of the amino acid substitution rates and the divergence times in six selected pairs of species (two lancelets, two worms, two fruit flies, two tunicates, two fishes and human versus chicken) was provided in Supplementary Table 4A.
Furthermore, as shown in Supplementary Figure 6, the distribution of pairwise divergence for all 1:1 ortholgous protein pairs shows that the protein divergence between two lancelets is larger than human versus mouse (62-101 Mya) and human versus sheep (95-113 Mya), but smaller than human versus opossum (125-138 Mya). Therefore, both the amino acid substitution rate and the divergence time of two lancelet species are comparable to those of human versus sheep and human versus oppossum.

Amino acid substitution rates in different chordate lineages
According to the above phylogenetic analysis, tunicates are the fastest evolvers, while cephalochordates have the shortest branch length, and vertebrates fall in the intermediate range ( Supplementary Figure 3A-D). These observations are consistent with previous reports.
Vertebrates went through fast substitution rates in the period between the split of cephalochordates and vertebrates and the separation of jawed and jawless vertebrates. In modern vertebrates, protein evolution is actually in a similar pace as lancelets (Supplementary Table 4B). This is also confirmed in our later analysis (the section "Pairwise orthologous protein divergence" below).
Supposing that the lancelet lineage diverged from other chordates 623 Mya and that the separation of the Florida and Chinese lancelet occurred 120 Mya, then using the distances shown in Supplementary Figure 3C, we can calculate that the amino acid substitution were largely the same before and after the divergence of two lancelet species (Supplementary Table 4B).
Supposing that the vertebrate-urochordate lineage diverged from the lancelet lineage 623 Mya, and that the separation of jawless and jawed vertebrates occurred 420 Mya, then using the distances shown in Supplementary Figure 3C, we can calculate that the amino acid substitution rates before the separation of jawless and jawed vertebrates are 2-4 times higher than that after this point (Supplementary Table 4B).
Moreover, considering that human and chicken diverged 319 Mya, and that the Florida and Chinese lancelets diverged 120 Mya, then using the distance showed in Supplementary Figure 3C, we can calculate that average amino acid substitution rates in human or chicken after their divergence is lower than those in Florida or Chinese lancelets (Supplementary Table 4). This is also confirmed in our later analysis (the section "Pairwise orthologous protein divergence" below).

Protein sequence divergence between the two lancelet species
Pairwise alignments of 11,589 orthologous gene pairs (1:1), which covered at least 60% of the protein length, were used for protein sequence identity analysis (Supplementary Figure 4). These calculations revealed that the mean protein identity between two lancelets is 81.2%, with a median of 84.0%. The sequence identity of approximately 30% of the protein pairs is higher than 90%.

Coding sequence divergence between the two lancelet species
We also converted the protein alignments into the corresponding coding sequence alignments and re-calculated sequence identities (Supplementary Figure 4). The estimated mean coding DNA identity between the two lancelets is 79.5%, with a median of 83.0%. Approximately 16% of the coding sequence pairs have an identity of more than 90%.

Intron sequence divergence between the two lancelet species
Using the above-obtained 11,589 orthologous gene pairs for the two lancelets, we further extracted a set of 23,021 high-confidence orthologous intron pairs. For each intron pair, we performed a pairwise BLASTN analysis (with a penalty of -1 and a cutoff E-value of 10) and recorded the alignment identity and coverage. The results showed that only 1.5% of the intron pairs could produce an alignment covering more than 50% of the intron length (Supplementary Figure 5). A plot of identity against coverage confirmed that the intron sequences of the two lancelets have virtually no similarity (Supplementary Figure 5). In particular, among the 12,533 (54% of the total intron pairs) aligned intron pairs with at least 66% identity, 88% (11,024) could not produce an alignment covering >25% of the intron length.

Protein sequence divergence in different functional categories
To evaluate how protein sequence divergence varies between different functional categories, we classified proteins into functional categories (GO terms) and calculated their mean protein protein identities and d N /d S ratios (Supplementary Table 5). In the "cellular component" class, extracellular and cell membrane-bound proteins are the most divergent, whereas proteins within the nuclei, macromolecular complexes and membrane-enclosed lumens are the least divergent. In terms of molecular function, those involved in signaling and transducer activity evolve at the fastest pace. As to biological process, proteins associated with rhythm, metabolism, cellular component biogenesis and organization are highly conserved; in contrast, proteins related to signaling transduction, growth, immunity and anti-stimulus, reproduction and adhesion are the most divergent (Supplementary Table 5).

Pairwise orthologous protein divergence
In addition to the analysis of core protein-coding gene divergence, we wanted to understand the divergence of orthologous protein pairs between closely related species, which could yield a more complete picture of recent protein evolution. We selected six species pairs for comparison. The soft-masked genome sequences and the complete protein set of opossum and sheep were downloaded from ENSEMBL. We identified 1:1 orthologous protein-coding gene pairs that covered at least 50% of the protein length from each species pair. We then plotted their distance (simple computed by 100-Identity) against the normalized accumulated gene number (Supplementary Figure 6). The following species pairs were evaluated: The results show that the orthologous protein divergence between the two lancelet species is between that of human versus opossum and human versus mouse or sheep (Supplementary Figure 6). The divergence time between human and mouse (61.5-100.5 Myr) is shorter than that between human and sheep (95.3-113 Myr), but the mouse is known to evolve faster than the sheep or human. The divergence time between human and opossum is estimated to be 124. 6-138.4 Myr. Chinese and Florida lancelets are thought to have split 120 (111-130) Mya. We varied the parameters for the analysis but obtained similar results. Therefore, the protein evolutionary rate of lancelets is roughly as fast as, if not faster than, mammals.
By comparing the distribution of protein divergence in six species, we found that lancelets have relatively more divergent protein sequence pairs than human versus other tetrapods. This trend is most clear for human versus opossum: though the protein identity of human versus opossum is on average 3.5% higher than that of the two lancelets (see the above table), they have a similar fraction of protein pairs with an identity higher than 50%. This diversifying pattern is also reflected by the fact that though >90% of genes have homologs between the two lancelets (Supplementary Note 9), only ~50% formed stable orthologous pairs in this analysis.

Supplementary Note 4 Polymorphism within the population Global statistics
To characterize the heterozygosity, a hybrid assembly (version 18) was created from all 454 and Illumina read data using CABOG. The resulting diploid assembly spanned 707 Mb, with a scaffold N50 size of 264 kb, a contig N50 size of 30 kb and a contig sequencing depth of 30x coverage. Allelic relationships within the diploid assembly were reconstructed using HaploMerger 31 . HaploMerger also used the LASTZ and chainNet programs to generate pairwise reciprocal-best alignments for each allele pair. The chainNet alignments (>500 bp) spanned a total of 291 Mb. Approximately 272 Mb of long alignments (>10 kb) were further refined by MUSCLE 58 , of which 182 Mb alignment stretches (>1000 bp) free of both sequencing gaps and overlapping alignment gaps were used for SNP and indel polymorphism analysis.
The mean difference rate between the aligned allele sequences is 13.31%, with 4.02% as single nucleotide mismatches and 9.29% as small indel-caused length differences (indels of size 300 bp; 96.4% were <50 bp and accounted for 4.90% of the length differences). If indels were treated as point differences, the nucleotide substitution rate increased to 4.39% and the indel rate decreased to 0.98%, giving a total mean allelic polymorphism rate of 5.37% more than 50 times the rate in humans but comparable to the rates of the Florida lancelet (~4.0%) and the tunicate C. savignyi (~5.6%) 1,29 .
The alignments were analyzed using a 50 bp sliding window with a step size of 25 bp (Supplementary Figure 4), from which we observed a mean allelic difference of μ=2.67, with a variance of σ 2 =6.84. This level of polymorphism is closer to a geometric distribution (μ=2.67, σ 2 =9.79) than to a Poisson distribution (μ=σ 2 =2.67). Further analyses with window sizes of 100 bp and 200 bp showed similar variation patterns (Supplementary Figure 7-9). Therefore, the polymorphism rate between the two haplotypes is not only high but also highly variable across regions. In particular, 50% of the 50 bp regions contribute only 10% of the polymorphic sites, whereas 20% of the 50 bp regions account for over 50% of the polymorphic sites. From the point of genome assembly, such large regional variation will make the assembly of the lancelet polymorphic diploid genome much more difficult than the assembly of mixed data from two closely related species (e.g., human and rhesus).
According to the coalescent theory, divergence between species usually fits a Poisson distribution 59 , whereas divergence between haplotypes in a freely mixing population of constant size tends to be geometrically distributed. According to this theory, to produce a polymorphism rate of approximately 5.4%, an effective population size of millions of individuals is required (for a mutation rate of 1e-9: N e =θ/(4μ)≈0.054/(4*10 -9 )=13.5*10 6 ; for a mutation rate of 1e-8: N e =θ/(4μ)≈0.054/(4*10 -8 )=1.35*10 6 ). Such a size is immense but possible in the animal kingdom. For example, similar polymorphism rates and distribution patterns have been observed in other marine invertebrates, such as Florida amphioxus 1 and Ciona savignyi 29 .
Indels are common differences between haplotypes and affect one tenth of the alignment length despite an occurrence rate of only 0.98%. The size distribution of polymorphic indels obeys the power law (Supplementary Figure 10), consistent with indels between mammalian genomes 60 . Remarkably, 6-bp indels are excessively more common than expected and contribute the highest total sequence differences among all indel sizes (Supplementary Figure  10).
The two haplotypes also differ by many large indels that span hundreds and thousands of base pairs. We used the original chainNet alignment (the original net file created by HaploMerger) to analyze these large indels. In particular, we examined polymorphic indels 200-1500 bp in length and found a total of 28,652 indel events that contributed to 4.3% of the total alignment length difference ( Figure 10).
We also analyzed the distribution of length of ungapped alignments between haplotypes (Supplementary Figure 11). This analysis revealed that the mean length of ungapped alignments is 95 bp and that, in terms of total length, the most abundant length of ungapped alignments is 36 bp.
A total of 9,490 translocation events (>100 bp and inversions excluded) were detected between the two haplotypes using the chainNet alignments, accounting for over 12.5 Mb (~4.3%) of the alignments. The size distribution of these translocations roughly obeys the power law (Supplementary Figure 12), with 3,087 cases larger than 1000 bp.
We detected a total of 700 inversion events (>100 bp and other translocations excluded) that account for over 2.5 Mb (~0.85%) of the alignments. The size distribution of these inversions roughly obeys the power law (Supplementary Figure 13), with 255 cases larger than 1000 bp.
Finally, in another analysis (Supplementary Note 13), we show that the rate of structural variations (translocations and inversions) within Chinese lancelets is ~10 times higher than that between human and rhesus (~5% sequence divergence between human and rhesus) and ~30 times higher than that between human and chimpanzee (~1.5% sequence divergence between human and chimpanzee).

Large polymorphic indels and transposable element activity
We visually examined 310 random polymorphic indels (>200 bp) and found that 210 could be identified as potential transposable element (TE) insertions or deletions. This visualization also revealed that while many alignment gaps can be readily ascribed to TE activities (Supplementary Figure 14), many others are caused by N-gaps, reciprocal gaps and possibilities not readily explained (e.g., mis-assemblies).
We then created a new chainNet alignment between the two final haploid assemblies. Here, there were 36,859 events of large polymorphic indels (300-10000 bp), affecting 37,868,997 bp of the genome assembly. A comparison with the TE annotation showed that 65-77% (depending on different criteria) of indels are associated with TE activity (Supplementary Table 6).

Synonymous and nonsynonymous polymorphisms in Chinese lancelets
High-confidence pairwise DNA alignments for coding regions were created for each protein-coding gene pair from the two haploid assemblies. Gaps and Ns were removed from the alignments. Alignments of genes were treated separately or as concatenated alignments, depending on the analysis. PAML v4.5 was used to infer the synonymous diversity (dS) and non-synonymous diversity (dN) and their ratios. Both the Nei & Gojobori (1986) method and the Yang and Nielsen (2000) method were used.
There are some difficulties inherent in finding exact, genuine 1:1 orthologous gene pairs: 1) the gene models often fragment into pieces; 2) some gene families underwent multiple tandem duplications that make it difficult to determine orthologous gene pairs; 3) both polymorphic (selected) duplications, deletions and non-functionalization of genes are present in different haplotypes; and 4) there is some fraction of false predictions and false frame calling. We therefore focused on those gene pairs with clear hits to the gene ontology (GO) protein database. One advantage of this procedure is that it allows us to directly assess d N /d S ratios in different functional categories (GO terms).

Supplementary Note 5 Whole-genome re-sequencing of five Chinese lancelets Sample collection and re-sequencing
We collected five additional adult Chinese lancelets for whole-genome re-sequencing and bisulfite sequencing. The procedures of sample collection and DNA isolation was the same as described in Supplementary Note 2, except that here we purified DNA from the whole body without gonads. Samples were collected from two locations: Xiamen and Zhanjiang (near Beihai) that are ~1000 kilometers apart (Supplementary Figure 1; Supplementary Note 1).
Three animals from Zhanjiang were sequenced by the Illumina Hiseq2000 platform (2x 101 bp); approximately 30 G filtered data were generated for each animal (Supplementary Table  8).
Two animals were collected from Xiamen, the same place where the lancelet for the reference genome assembly was collected. These two animals were first sequenced by the Illumina  Hiseq2500 platform (2x 151 bp); approximately 45 G filtered data were generated for each animal (Supplementary Table 8).

Multiple whole-genome alignment of six individual genome sequences
The diploid genome status and the high heterozygosity of the Chinese lancelet genome posed great difficulties for our genome-wide comparison between individual genome sequences. For example, common tools for short-read mapping and SNP calling were designed based on genomes with lower polymorphism rates (e.g., the human genome). To work around this predicament, we used the multiple whole-genome alignment approach.
First, we used the Celera assembler 35 to create a de novo diploid genome assembly for each re-sequenced individual lancelet. The procedure and parameter settings were basically the same as described in Supplementary Note 2, except that we used the BOGART module, which is supposed to handle short Illumina reads better, and we did not perform hierarchical assembly and gap-filling. The obtained diploid assemblies (scaffolds plus degenerate contigs) range from 600-750 Mb, with a scaffold N50 length range between 2 and 6 kb and a contig depth of over 30x coverage. For comparison, we also used the SOAPdenovo assembler 32 for the task, which produced assemblies with smaller contig N50 lengths.
Third, we created six-way multiple whole-genome alignments including the reference genome and the five re-sequenced genomes. Multiple alignments were constructed using TBA (parameters: E, null, P, multic, a guide tree ((bbv18ref, bbe23a, bbe23f), (bbe01, bbe03, bbe06)) and all others as defaults) 65 . Finally, only those alignment blocks containing all six individuals were kept for further study. This procedure is similar to our previous work 66 .

SNP rates, population structure and natural selection
The six-way alignment contains ~50 Mb gap-free and N-free alignments. This is only slightly more than 1/8 of the total genome size. Potential causes responsible for this small set of alignments include: 1) the fragmented re-sequenced genome assemblies and our requirement of at least 1000 bp matches for each pairwise alignment; 2) the repeat-masking; 3) multiple alignment blocks for analysis were required to contain all six individuals and be longer than 200 bp; and 4) the exclusion of gap-containing alignments, N-containing alignments and alignments near the terminals of an alignment block (20 bp).
This six-way alignment contains three individuals from the Xiamen population (including Bbv18ref (the reference genome), Bbe23a and Bbe23f) and three individuals (Bbe01, Bbe03 and Bbe06) from the Zhanjiang population. Our analysis revealed that in this alignment, the SNP rates per nucleotide (=p-distance) between any two individuals were almost the same (Supplementary Table 34). Though Bbe23a and Bbe23f were slightly more similar to the reference genome than Bbe01/03/06, the difference is trivial. We also analyzed protein-coding regions (~3.2 Mb) and obtained similar results (Supplementary Table 9). These findings suggested that the genetic structure within Chinese lancelets is very weak, if not absent, despite their large habitat (Supplementary Note 1), consistent with the analysis of mitochondrial sequences (Supplementary Figure 1B; Supplementary Note 1).
As estimated for the ~50 Mb gap-free and N-free multiple alignments, the average SNP rates between two individuals (i.e., nucleotide diversity) were 4.86%, close to the estimate obtained from the individual used for the reference genome assembly (namely, between the reference genome assembly and the alternative assembly). This rate is dropped to 3.19% when we only counted protein-coding regions (Supplementary Table 9).
We performed a Tajima's Neutrality Test with the protein-coding region alignments (~3 Mb). The segregating sites per nucleotide is 0.031629, the nucleotide diversity E(π) is 0.031808, and Tajima's D is 0.043276. This suggests that under the neutral theory, there has not been a recent bottleneck crisis, and no recent population admixture has occurred in Chinese lancelet populations. The use of all alignment sites for this analysis yielded the same conclusion.
We also performed the d N /d S analysis using the protein-coding region alignments (~3.2 Mb), as described in Supplementary Note 4. This analysis revealed that the d N /d S ratio between individuals is approximately 0.083, i.e., very strong purifying selection (Supplementary Table  35) and consistent with the rate between the two haplotypes of a single individual (Supplementary Table 7).

Identification of transposable elements (TEs) in B. belcheri and B. floridae
Two de novo repeat family identification and modeling packages, RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) and REPET 67 , were used to identify novel repeat families from the diploid genome. RepeatModeler invokes three programs (RECON 68 , RepeatScout 69 and Tandem Repeat Finder 70 ) to identify novel repeats and then combines their results into a non-redundant set of repeat families. RepeatModeler initially reported 1178 candidate families. After manually filtering out protein-coding genes, we obtained a set of 481 novel repeat families. In addition, the TEdenovo pipeline from the REPET package produced a set of 7639 non-redundant clusters of novel repeat sequences.
The same procedure was applied to the genome sequence of B. floridae to obtain a set of candidate repeat families.
The two de novo datasets (B. belcheri and B. floridae) and the downloaded dataset were combined into a single library, which was then used to screen amphioxus genome sequences using RepeatMasker 72 . RMBlast and option -s were used in the homologous search.

Genome-wide profile of amphioxus TEs
Window-based analysis by WinMasker 39 was first employed to estimate the proportion of repetitive DNA sequences in the reference genomes of B. floridae and B. belcheri. This analysis showed that 29.5-30.2% of the reference genomes could be repetitive DNA sequences.
We then used RepeatMasker and the curated TE library to screen the B. belcheri reference genome. According to the search results, satellites, simple repeats, and TEs constitute at least 27.6% of the diploid assembly. In the reference assembly, TEs comprise of 26.9% of the genome. The discrepancy between the WinMasker results and the RepeatMasker results suggest that there are some unknown repetitive sequences not identified by RepeatMasker. Notably, DNA transposons (12.7%) are more abundant than retrotransposons (10.3%), which is different from mammals but similar to some invertebrates (Supplementary Figure 15).
The same procedure was applied to the B. floridae reference genome. Its composition pattern largely recapitulates that of the genome of B. belcheri, with ~26.6% TE content in total, 12.6% for DNA transposons and 9.5% for retrotransposons (Supplementary Figure 15), hence suggesting a conserved trend for TE evolution in the amphioxus lineage. Note that the genome sizes of the two lancelets differ by 15-20%.

The composition and classification of amphioxus TEs
We proceeded to identify and classify the lancelet TEs into superfamilies. However, because the lancelet TE content is a very complex entity, here we chose to mainly focus on TE families that have homologs in other species and encode proteins that are necessary for TE mobilization. In other words, non-autonomous TEs, including miniature inverted repeat TEs (MITEs) and short interspersed nuclear elements (SINEs), were not considered in this study if they do not clearly belong to any protein-coding TE families. In addition, though at least 4-7% of the genome content could consist of unknown TEs (Supplementary Table 10), the identification of novel TEs or lancelet-specific TEs was not our purpose here.
Though TEs account for 26-30% of the contents of both the Florida and the Chinese lancelet genome assemblies, only a few intact (complete) or nearly intact autonomous (protein-coding) TEs could be identified. Most TEs presented as fragments, overlapped or nested within each other and containing defective coding regions or completely devoid of coding sequences. As a side effect, this phenomenon not only makes it difficult to identify TEs but makes TE numbers and TE total length counting less reliable. Three possibilities may account for this phenomenon.
1. Errors in the genome assembly. However, this is unlikely to be the major reason because the phenomenon occurs in both the Florida and the Chinese lancelet genome assemblies, and the Florida lancelet assembly was created using the traditional Sanger method, and the Chinese lancelet assembly was sequenced to a depth of 100x coverage (containing both 454 reads and Illumina reads) and showed good continuity (contig N50 size=46 Kb). In fact, even in the well-resolved regions (no gaps, no N, no simple repeats) in the current lancelet assembly, intact TE copies are still rare. Our assembly's contig N50 size (~46 kb) is better than that of many non-polymorphic genome assemblies. The phenomenon thus appears not entirely attributable to the assembly quality. 2. Another possibility is that intact TEs hid in the "dark" regions of the genome, e.g., the heterochromatin and the regions crammed with TEs and simple repeats; these regions are intractable to the whole-genome shotgun strategy. We estimated that the dark matter could include 442 Mb-(426 Mb+416 Mb)/2=21 Mb. 3. A third possibility is that all of the TE content in lancelets is in fact produced by a few active intact autonomous TEs.
In Florida and Chinese lancelets, we, respectively identified 1,233 and 1,087 TEs containing complete or partial TE proteins. An analysis of the protein architectures of these elements let us unambiguously identify 19 In an analysis of large polymorphic indels, we identified DNA transposons that encode the recombination activating gene 1 and 2 (RAG1 and RAG2).
In total, we identified at least 40 TE superfamilies (18 retrotransposons and 22 DNA transposons) that are all conserved in the B. belcheri and B. floridae genomes (Supplementary Table 10) and could be found in other species. It is also apparent that no TE members or families underwent drastic expansions or contractions (Supplementary Table 10), as previously reported in B. floridae 73 . Taken together, lancelets have a higher TE diversity than vertebrates and other invertebrates. However, there are certain compositional differences in different TE superfamilies between the two lancelet species (Supplementary Table 10).
In addition to the 40 high-confidence superfamilies, by comparison with RepBase, we also detected some small DNA fragments with weak homology to other TE superfamilies, including Ambal, CRE, RandI, Proto1, Kiri, R4 and Tad1.

Expression of TE protein-encoding transcripts
We assembled ~300 million EST reads or read pairs using Cufflinks and Trinity (Supplementary Note 8). The assembled transcripts were compared with our curated TE protein set using BLASTX with a cutoff expectation value of 1e-5. Under these cutoff criteria, there were only four superfamilies with no detectable expression: ERV, Copia and Novisib (these three have no detectable protein coding sequences in the current Chinese lancelet genome assemblies) and ISL2EU. When we lifted the cutoff criteria to 40% coverage and 40% identity, a total of 1,251 TE protein-encoding transcripts from 28 superfamilies were identified. At 50% coverage and 50% identity, a total of 757 TE protein-encoding transcripts were identified, distributed in 26 of the 40 TE superfamilies (Supplementary Table 10). At 60% coverage and 60% identity, 411 transcripts from 26 TE superfamilies were identified.
We also attempted to identify retrotranscriptases and transposases from the reference haploid genome of B. belcheri (version 18) using RPS-BLAST with all retrotranscriptase/transpose Pfam domains. Under the cutoff E-value of 1e-5 and with at least 55% coverage and 50 amino acids, we obtained 2,300 retrotranscriptase gene fragments and 415 transposase gene fragments. Comparison of these gene fragments with the raw EST genome mapping data, we found that 68% of the retrotranscriptase gene fragments and 73% of the transposase gene fragments were transcribed.

Relationship between polymorphic indels and TEs
We identified 36,859 large polymorphic indels (300 bp and 10,000 bp) between the two Chinese lancelet haploid assemblies. These indel sequences were compared to the curated TE library using BLASTN. At an E-value of 1e-10 and a coverage of 30%, 28,481 (77%) indels could be ascribed to TE insertions, whereas at an E-value of 1e-10 and a coverage of 50%, 23,836 (65%) indels could be ascribed to TE insertions (Supplementary Table 6; Supplementary Note 4). Further analyses showed that only three TE superfamilies have no representative in these indel sequences: Merlin, Novosib and ERV. This analysis also led to the identification of the long-lost, legendary RAG transposon.

Supplementary Note 7 Genome rearrangement
The amphioxus genome has been shown to share deep conservation of global architecture with vertebrate genomes 1 . To understand the pattern of the evolution of the amphioxus genome architecture, we compared genome rearrangements in eight species pairs.

Orthologous gene/protein families for each species pair
We selected seven pairs of species for gene (protein)-based genome rearrangement analysis (Supplementary Table 12). Genome sequences, proteins and GFF files for B. floridae were downloaded from the JGI website (http://genome.jgi-psf.org/Brafl1/Brafl1.home.html). Data for the other species were downloaded from ENSEMBL (http://www.ensembl.org/), release 64. Orthologous gene families for each pair of species were identified using a modified reciprocal best hit (RBH) method similar to the protocol previously described 1 .
First, for each pair of species, A and B, all-against-all reciprocal BLASTP was performed on all protein sequences for both directions (species A to species B and vice versa). For a gene with multiple protein variants, all variants were subjected to BLASTP but only the best hit among all variants was selected to represent the gene. Segments of alignments between two genes were concatenated, and the cutoff criteria were set to 60% identity and 40% coverage.
Second, orthologous gene pairs between each species pair were identified using the RBH method. If the best hit of gene A1 in species Third, we used a C-value of 0.7 to include the second best hit. For a pair of genes S(A1,B2), the C-value is calculated as follows: C(A1,B2)=S(A1,B2)/max(S(A1),S(B2)). If C(A1,B2)>0.7 and S(A1)>S(B2), B2 will join with A1 and B1 to form a larger orthologous gene group. This process was continued until there was no more new joining.

Oxford grams
An Oxford gram showing gene rearrangements was drawn according to the orthologous gene relationships. The X-axis shows the position of genes from species A, and the Y-axis shows the position of genes from species B. Each gene pair in the orthologous gene groups corresponds to a point in the Oxford gram. Note that a gene may have multiple points to show its second-best hits.
Because the draft genomes of B. belcheri and B. floridae are only available at the scaffold level, we used a method to cluster the orthologous scaffolds. For each pair of scaffolds (FA1 in species A, FB2 in species B), each orthologous gene pair (A1,B1) was assigned a pair of values (a1,b1): a1=1 if A1 is in FA1 and 0 otherwise; b1=1 if B1 is in FB1 and 0 otherwise. A Fisher's exact test with Bonferroni correction was applied on all pairs of values to generate a p-value for each pair of scaffolds or chromosomes. The dissimilarity is defined as -log(P), where P is the p-value for the pair of scaffolds. The scaffolds were then bidirectionally clustered using a hierarchical cluster method, implemented by the function 'hclust' in R (http://r-project.org).

Double cut and join (DCJ) distances
Genome rearrangement events can be measured using the edit distance, which is defined as the minimum number of rearrangement events necessary to transform one genome into another. Double cut and join (DCJ) distance and its efficient calculation were introduced by Yancopoulos (2005) 74 . DCJ distance differs from other distance metrics in that it includes chromosomal fusion, fission, inversion, translocation, and block interchange in a single model and allows simpler algorithms for calculations.
The orthologous gene pairs for each species pair were used to infer DCJ distances. The calculation followed the standard algorithm 74 and was implemented in our software AliquotG 75 (http://mosas.sysu.edu.cn/genome/download_softwares.php#). For the amphioxus genomes, only those scaffolds containing >30 genes were used for calculation.

Patterns of gene rearrangement in amphioxus
The urochordate pair, C. intestinalis and C. savignyi, shows the most drastic changes in genome architecture, with a DCJ distance up to 0.4 (Supplementary Table 11), whereas the human-chicken pair and the fish pair show the lowest genome rearrangement rates relative to their protein divergences (Supplementary Table 11). The amphioxus pair, the worm pair (C. elegans and C. briggsae) and the fly pair (D.melanogaster and D. mojavensis) show similar DCJ distances and protein divergences (Supplementary Table 11), suggesting that the genome rearrangement rate of the amphioxus lineage is similar to those of the protostome invertebrates. Because both amphioxus genomes were separated into hundreds of scaffolds, the rearrangement rates for the amphioxus lineage could be overestimated. However, the scaffold number used for amphioxus is approximately (186/2+195/2)=191, or one tenth of the total rearrangement events observed between the two lancelets (Supplementary Table 11), suggesting that the overestimation will not be more than 10%.
DCJ distance does not discriminate between large-scale and small-scale rearrangements. Large-scale rearrangements, including chromosome fusion, fission, and genes translocating to a distant site (e.g., another chromosome), often tend to shatter the original gene synteny, whereas small-scale rearrangements usually scramble the local gene order and hence leave syntenic relationships maintained on a large scale. Therefore, we further visually compared the syntenic relationships of these closely related species pairs by plotting their chromosomal homology on Oxford grids ( Figures S16-21).
The results show that rearrangements in worms and fruit flies are highly restricted within chromosome arms, whereas in urochordates, both large-scale and small-scale rearrangements are common. Vertebrates have low rates of rearrangements, but a substantial number of genes have translocated outside their original chromosomes.
Through Oxford grams and hierarchical clustering of scaffolds, we observed that the rearrangement pattern in the amphioxus lineage is more similar to that of worms and fruit flies than those of vertebrates: more rearrangements are restricted within a small scale. This pattern partly explains why amphioxus still shares a high degree of synteny conservation with vertebrate genomes, despite their divergence time of over 550 million years.

Vertebrate rearrangement rates slowed down after the 2R-WGD
We next wanted to estimate how many DCJ gene rearrangement events occurred in vertebrates after the two rounds of whole-genome duplication (2R-WGD) in the early evolution of this lineage. This problem can be addressed by solving the genome aliquoting problem with double cut and join metrics 76 . We have developed an improved heuristic algorithm (i.e., AliquotG) for the genome aliquoting problem for 2 rounds of duplication (R=4) 75 , but AliquotG version 1 can only handle those genes with all four ohnologs (duplicates from a whole-genome duplication) retained, i.e., AliquotG cannot use genes with only 2-3 ohnologs left. Here we developed an upgraded AliquotG algorithm (version 1.5, available on demand), which can use genes with 2-4 duplicates in a single genome to infer the pre-2R-WGD gene order.
The procedure is similar to the previous report 75 . Briefly, all proteins (including variants) from a vertebrate were BLASTed (the RBH method described above) against themselves and against the lancelet proteins (E-value: 1e-5; -F "m S"). The lancelet proteins were used to identify the 1:2, 1:3 and 1:4 paralogs. These paralogs were used to estimate the number of rearrangement events that have occurred since the 2R-WGD. Three vertebrate species were analyzed: human, mouse and chicken. Zebrafish and other teleost fishes were not used because they underwent another lineage-specific whole-genome duplication that aliquotG cannot handle. The rearrangement distances between human and mouse, mouse and chicken and human and chicken were also calculated.
These DCJ rearrangement rates were then used for distance tree reconstruction with the Neighbor-join method (Supplementary Figure 22). Considering that human and chicken diverged 312-33 Mya, human and mouse diverged 61-102 Mya, and the 2R-WGD event happened 450-500 Mya, we estimate that the relative rearrangement rate between the 2R-WGD and the human-chicken divergence was 4-6 times faster than the rates after the human-chicken divergence (p<1e-10, chi-square test) (Supplementary Figure 22).

Synteny of the Hox and the protoMHC gene clusters in lancelets
As shown above, the lancelet genome displays an average genome-wide gene rearrangement rate (0.23 per gene) close to those of other invertebrates and a local gene order scrambling pattern that is also similar to other examined invertebrates.
However, the local rearrangement rate is highly variable in lancelets. One remarkable example is the Hox gene cluster, which contains 17 genes (Hox1-15 and EvxA and B) and shows no rearrangement between the two lancelet species.
Another notable example is the protoMHC region. The origin of the vertebrate MHC region, or the MHC big bang, represents a critical event in vertebrate evolution and the rise of adaptive immunity [77][78][79] . Here, by comparative analysis of different assembly versions of the Chinese lancelet genome, we identified the entire protoMHC region (in three corrected scaffolds). This region contains 269 MHC-related genes that are conserved between the human and lancelet genomes. Remarkably, though this protoMHC region shares good synteny with the four human MHC paralogous regions, the gene rearrangement rate in the lancelet protoMHC region is as high as 120/269=0.45 per gene, twice as high as the average genome-wide rearrangement rate in the lancelet genomes. In addition to the 269 MHC-related genes, we also used all genes in this region to recalculate the local gene rearrangement distance, or 382/816=0.47. Taken together, there is active local gene order scrambling in this protoMHC region. This active rearrangement was most likely important for the so-called MHC big bang in vertebrates, from which the MHC type I &II molecules and the Ig C1 domain derive 79 .

RNA preparation and sequencing
Due to the small body size of lancelets, we collected RNA samples from multiple individuals. Total RNA samples purified from harvested tissues using the QIAGEN RNeasy plus midi kit were treated with Promega DNaseI and then used for mRNA isolation with the Oligotex mRNA mini kit (QIAGEN). The obtained polyA mRNA samples were analyzed by Nanodrop and Agilent Bioanalyzer 2100 (using RNA Nano 6000 chip) to ensure an OD260/280 above 1.8 and an RIN above 8.5.
For 454 sequencing, three random-primed cDNA libraries (2 from adult bodies and 1 from embryos of various developmental stages) were prepared using random hexamers and the Roche cDNA synthesis system. Sequencing was performed following the GS FLX Titanium protocol and yielded approximately 3 million high-quality titanium reads (Supplementary Table 1).
In addition, eleven cDNA libraries were synthesized using the Truseq TM RNA sample preparation kit and sequenced by an Illumina Genome Analyzer IIx. Eight of the libraries were derived from different developmental stages (from eggs to the adult stage), and the other three were from adult guts challenged by different bacteria. A total of ~291 million high-quality read pairs were obtained (Supplementary Table 1).

De novo transcript assembly
Three 454 titanium reads of expression sequence tags (ESTs) were assembled into ~90,000 non-redundant EST contigs using Newbler. Contigs shorter than 300 bp were discarded. FrameDP 80 was used to correct frameshifts and identified 52,961 contigs with exactly one protein-encoding open reading frame (ORF). These EST contigs were used to assess the completeness of the genome assembly (Supplementary Note 2).
Illumina RNA-seq data were also assembled using Trinity 81 . These data were compared with the genome-based transcript assembly and used for gene identification.

Genome-based transcript assembly
One of the state-of-the-art algorithms for genome-based transcriptome assembly is the combination of Bowtie2 82 , Tophat 83 version 2 and Cufflinks 84 version 2. However, far less than 40% of the 2x 115 bp Illumina read pairs (with unpaired and discordant alignments excluded) could be mapped to the genome using this pipeline, indicating that the pipeline is not tuned for highly polymorphic genomes.
One way to increase the successful mapping ratio is to trim the read length down to 50 bp, which led to over 70% concordant matches. However, this practice gives up virtually all advantages of the long read length.
Another way to accommodate high polymorphism rates is to relax the alignment parameters. First, we tweaked the Tophat parameters: min-anchor-length 8; splice-mismatches=1; genome-read-mismatches=50; segment-length=27; segment-mismatches=3; and read-mismatches=50. We then modified the Bowtie parameters: -L 18 -N 0 -i C, 1, 0 score-min L, -0.6, -1.0 rdg 4, 2 rfg 4, 2 -D 20 -R 3, which increased the alignment sensitivity and allowed for low-scored alignments with more mismatches and indels. Tophat does not relay all Bowtie parameters and does not implement custom Bowtie parameters in the segment search stage; therefore, we had to work around the problem by wrapping up the executable "bowtie2-align" with a shell script that enforced the custom parameters. These tweaks significantly slowed the mapping speed by over 10-fold but did successfully map over 70% of the full-length Illumina read pairs to the genome concordantly. The alignments were fed to Cufflinks for genome-based transcript assembly and reference annotation-based transcript (RABT) assembly.
The statistics of the EST mapping against genome is shown in Supplementary Figure 23.

Ab initio prediction and evidence-based prediction
We aligned the 52,961 EST contigs with exactly one ORF to the haploid assembly version 2 using PASA 85 version 2011_05_20. PASA reported a total of 2883 high-quality full-length transcripts. The ab initio gene finders, Augustus 86 and GlimmerHMM 87 , were trained on this exon set to achieve sensitivity and specificity of 78-81% and 78-81%, respectively.
The protein set of B. floridae was first aligned to the haploid assembly using GenBlastA 88 . GeneWise 88 version 2 was used to refine the initial protein alignments and to predict the corresponding gene structures.
The Bowtie2-Tophat-Cufflinks and GMAP/GSNAP-Cufflinks pipelines were used to create a genome-based transcript assembly for the RNA-seq dataset. Reference-based and de novo-assembled transcripts were incorporated into the Augustus prediction using the Augustus protocol (http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=Augustus.IncorporateESTs).
Finally, multiple prediction sets, including PASA alignments, protein alignments, Cufflinks alignments, ab initio datasets from Augustus and GlimmerHMM, and RNA-seq-based predictions by Augustus, were combined into a non-redundant gene set using EVidenceModeler 85 version r03062010.

Gene model refinement using RNA-seq
We extracted the intron splice site data from the initial gene set and used them to guide a second round of RNA-seq mapping. The new mapping data and the initial gene set were combined using the Cufflinks protocol for reference annotation-based transcript (RABT) assembly. Finally, the obtained RABT assembly was fed to Augustus for a new round of evidence-based prediction. In this round of prediction, Augustus was allowed to predict evidence-based alternatively spliced isoforms.

Annotation and characterization of the predicted gene set
The refined gene set consists of 30,392 gene models. The use of a large quantity of RNA-seq data helped to predict 7,254 evidence-based alternatively spliced isoforms from 4,399 gene models. These numbers could be an overestimation because of pseudogenes, over-prediction, unrecognized transposable elements, fragmented genes and gene fragmentation at contig or scaffold boundaries. However, we also estimated that at least a thousand protein-coding genes were supported by ESTs but failed to be correctly captured by our prediction pipeline because of overlapping with the other gene models. The basic statistics are provided in Supplementary  Table 13.
Gene models were annotated by homology comparison with other proteins in the InterPro database 89 . Under the default setting, 22,008 proteins (68%) were annotated by InterProScan 90 , of which 18,650 proteins (57%) were assigned at least one gene ontology (GO) term 91 . In addition, 19,537 proteins (60%) were assigned to KEGG pathways 92 (Supplementary Figure  24).
Gene models were also compared with proteins from B. floridae and other model organisms (C. elegans, D. melanogaster, zebrafish, chicken, mouse and human) using BLASTP. Of the 30,392 models, 27,581 models have at least one hit with an E-value of 1e-5, 26,863 models have at least one hit with an E-value of 1e-10, and 25,363 models have at least one hit with an E-value of 1e-20. Finally, up to 18,167 models have high-confidence nominal orthologs in the B. floridae genome.

Intron sizes and genome sizes differ between the two lancelet species
K-mer methods suggest that the genome size of B. belcheri is approximately 442 Mb, consistent with the range of the v18 reference haploid assembly. Therefore, the genome of B. belcheri is 15-20% smaller than that of B. floridae. However, this difference might be an artifact due to several factors. First, the k-mer method provides a coarse estimation, especially for highly polymorphic genomes. Second, the size of the haploid assembly is affected by the completeness and the precision of N-gap size estimation. Third, short read length and higher error rates may cause more repeat collapsing, hence making the assembly smaller. Finally, if allelic redundancy remains in the haploid assembly, it may inflate the assembly size.
We assumed that major size differences between the two lancelet species should lie in the intergenic regions and introns. As the intergenic regions are not easy to compare, we focused on introns. We determined that the mean length of all introns in B. floridae is approximately 400 bp longer than in B. belcheri (Supplementary Table 13), but this is still a very coarse estimation.
We next sought a strict pairwise intron comparison. We first obtained 9,961 highly reliable reciprocal best-hit (RBH) orthologous gene pairs with an identity >60% and a length coverage >60%, from which 3,976 RBH pairs were further filtered because of inconsistent exon-intron configuration. Using these orthologous gene pairs, we identified a set of high-confidence orthologous intron pairs, of which intron pairs containing N-gaps were excluded from further analysis. The comparison revealed that on average, 60% of B. floridae introns are longer than their corresponding introns in B. belcheri. This pattern is consistent in different intron positions (Supplementary Figure 25). Therefore, we concluded that in accordance with a smaller genome size, the intron size of B. belcheri is on average smaller than that of B. floridae (p<1e-16, pairwise t-test).

Chinese lancelets General design
Treatment of genomic DNA with sodium bisulfite (BS) causes massive cytosine-to-adenosine conversions, posing a considerable challenge for accurately mapping short BS-seq reads to the genome 93 . The task is even more difficult for the Chinese lancelet genome because with an average difference of 5% between the reference genome and the bisulfite-sequenced genome, short BS-seq reads simply cannot be correctly mapped to the reference genome.
To overcome this difficulty, we produced both re-sequencing data and bisulfite-sequencing data for the same lancelet individual. We first created a de novo assembly for the diploid genome of the selected individual. We then mapped the short BS-seq reads to this genome assembly using a wild-card bisulfite aligner (only uniquely mapped read pairs were retained) and called methylated cytosines using the default procedure of Bis-SNP 94,95 . Finally, we created a whole-genome alignment between the reference haploid genome and the re-sequenced genome assembly (Supplementary Note 5). This alignment permitted us to project the methylation patterns onto the reference genome.
To provide a biological duplicate and to reveal variation in methylation patterns between individuals, we selected two unrelated adult lancelet individuals for this study. Instead of using certain tissues and organs, we measured the average methylation level of the whole animal body.

Sample collection and whole-genome bisulfite sequencing
The two animals collected from Xiamen for re-sequencing were also used for whole-genome bisulfite sequencing. The procedures of sample collection and DNA isolation were the same as described in Supplementary Note 2, except that here we purified DNA from the whole body without gonads. The purified genomic DNA of these two animals was used for re-sequencing by the Illumina Hiseq2500 platform (2x 150 bp); approximately 45 G filtered data were generated for each animal (Supplementary Table 8 Table 8).

General methylation patterns in lancelet genomes
We obtained ~16-fold read coverage for the two individual diploid genomes (~610 Mb). We assessed the methylation levels in three sequence contexts: CG, CHG and CHH (where H is A, C or T). In both genomes, the overall genome-wide methylation levels were 21% for CG, 0.36% for CHG and 0.37% for CHH (Supplementary Table 14). The methylation level for CGs is similar to those observed in the plant Arabidopsis (24%) and urochordate Ciona intestinalis (21.6%) 96,97 . Of ~31 million callable CG sites, ~30% have detectable methylation (i.e., passed the default Bis-SNP filtering). The level varies from 1-100% (Supplementary Figure 26). Because we used entire animal bodies for analysis, this variation suggests highly differential methylation in different cell types or even in different cells of the same type. Nevertheless, ~55% of mCGs are highly methylated (80-100%) (Supplementary Figure 26).
However, the non-CG methylation was rather weak and may represent false-positive signals.
To evaluate the false-positive methylation rates, we analyzed the unmethylated mitochondrial genome. With an over 1000-fold read depth, the mitochondrial genome shows false-positive rates of 0.29-0.31% (CG), 0.32-0.33% (CHG) and 0.25-0.27% (CHH) (Supplementary Table  14). Therefore, we concluded that the observed non-CG methylation in the two lancelet genomes is too weak to be distinguishable from false-positive rates. There are two explanations for the low non-CG methylation. One is that non-CG methylation is supposed to be absent in the adult body, as observed in human fetal lung fibroblast IMR90 cells 98 , the body of the pufferfish Tetraodon nigroviridis and the muscle tissue of the urochordate Ciona intestinalis 97 . Another explanation is that lancelets might lack the mechanism (i.e., CHG methyltransferase (CMT)) for non-CG methylation.

Methylation patterns in TEs and other functional regions
We analyzed the relative CG methylation level (total methylation level divided by the number of CG sites) and the absolute CG methylation level (total methylation level divided by sequence length) in different functional regions (Supplementary Figure 27).
In contrast to the genome-wide relative methylation level of 21% and the genome-wide absolute methylation level of 11%, intergenic regions showed a relative methylation level of 10% and an absolute methylation level of 5.6%, which we here considered the background methylation level in the lancelet genome.
Coding DNA sequences (CDS) displayed the highest methylation levels (33% relative and 31% absolute). This finding is consistent with observations in vertebrates and plants. It has been proposed that high methylation may prevent aberrant transcriptional initiation within gene bodies 99 . However, the methylation level of introns (~23%) was much lower than that of CDS, and introns contain many fewer CG sites (1 in 25 bp versus 1 in 10 bp in CDS and 1 in 19 bp genome-wide), which is also reflected in the even lower absolute methylation level of introns (9%). This suggests that aberrant transcription could be more frequently initiated from within introns.
The lowest methylation level was observed upstream of gene bodies (=CDS+intron), where transcriptional initiation and protein binding are known to be allowed.
Interestingly, we observed an elevated methylation level in transposable elements (TEs) that was much higher than that of intergenic regions and introns and just second to that of CDS (p<1e-16, chi-square test) (Supplementary Figure 27). This indicates that CG methylation likely plays a role in TE suppression in lancelets. Note that this pattern is very different from the patterns observed in other invertebrates (including the urochordate C. intestinalis and insects), which show hypomethylation in TE regions 97 .

Sizes of proteome and transcriptome
The 30,392 gene models collectively account for a total of ~48 Mb coding DNA sequences (CDS), larger than any other examined vertebrate or invertebrate genome (Supplementary  Table 15). Though this number could be an overestimate of the true gene content because of pseudogenes, over-prediction, and transposable elements (as TE proteins affected 709 gene models), 27,581 and 25,363 models have homologs from other species with E-values of at least 1e-5 and 1e-20, respectively. Up to 18,167 models have nominal orthologs in the B. floridae genome.
To assess how many CDS are supported by ESTs, we mapped all RNA-seq data (~300 million reads or paired-end reads) to the genome using GMAP/GSNAP 40 and the default settings except counting best hits only. This analysis suggested that 70% of the genomic loci could be transcribed (i.e., were covered by at least one EST).
Because the high polymorphism and the relatively short read length (2x 115 bp) for Illumina paired-end reads may cause false alignments, to minimize false positives and obtain a lower bound, in the following statistics, we only considered those paired-end reads that could be mapped to the genome with the correct orientation and distance, or, in other words, concordantly mapped mate-pairs. The new analysis showed that 62% of the genomic loci were covered by ≥1 EST, 50% by ≥2 ESTs, and 34% by ≥5 ESTs. Furthermore, 94%, 91% and 84% of total CDS nucleotides were covered by ≥1, ≥2 and ≥5 ESTs, respectively (Supplementary Figure 23). In terms of CDS number, 95%, 92% and 86% of all CDS sequences were covered by ≥1, ≥2 and ≥5 ESTs, respectively (Supplementary Figure 23). These statistics suggest that a substantial proportion of the predicted genes could be transcribed.
We next divided the Chinese lancelet reference genome into five regions: CDS, the 2000 bp upstream of genes, the 2000 bp downstream of genes, introns and intergenic regions. We determined that 67%, 2%, 20%, 6% and 5% of the EST reads mapped to these regions, respectively (Supplementary Figure 23). Note that we assigned ESTs to a certain region following this priority: CDS, intron, downstream, upstream and then intergenic regions, which clearly biased the counts to CDS and genic regions. However, this analysis confirms the pervasive transcription of the Chinese lancelet genome. Similarly pervasive transcription has been observed in humans (~62% of the genome covered by processed mRNAs) but not in fruit flies 100,101 , suggesting that pervasive transcription could be a chordate-specific feature. Compared with the much higher sequencing depth and better designs for pervasive transcription analyses in humans and fruit flies 100,101 , our observation for lancelets was based on a much smaller RNA-seq dataset (~120× of the genome and ~300 million reads or read pairs restricted to a few tissues and developmental stages [Supplementary Table 1]). Because this smaller dataset covers only 50-70% of the lancelet reference genome, we speculate that lancelets might have an even more pervasive transcription pattern than humans.
Notably, the total CDS length of the draft genome of the Florida lancelet is ~8 Mb smaller but still larger than that of other invertebrates and vertebrates (except zebrafish). By searching the non-coding regions of the Florida lancelet reference genome with orthologous proteins from both lancelets, we identified an additional ~15 Mb coding sequence fragments (though some could be pseudogene fragments); therefore, we believe that the smaller proteome size of the Florida lancelet is attributable to assembly errors, under-prediction, lack of sufficient EST evidence, and the lower completeness of the draft genome of the Florida lancelet.

Pfam domain catalogs
We compared the Pfam domain catalogs of 16 species. We observed that many novel domains were only present in certain protein isoforms. Hence, for a gene with multiple protein isoforms available, all sequence isoforms were used in this study.
All deduced B. belcheri proteins from the gene models were compared with the Pfam database 102 Table 16). This domain type number and total sequence length are much higher than those of other known invertebrates (Supplementary Table 16). The total sequence length is also greater than all examined vertebrates (human, mouse, chicken, Xenopus, pufferfish) except zebrafish (Supplementary Table 16). The zebrafish has more domain sequences because it underwent an extra round of WGD during its early evolution and retains many duplicated gene copies from that WGD.
As some potential domains may not be covered by the Pfam-A dataset, we included the Pfam-B dataset for further analysis and discovered similar patterns (Supplementary Table 17). Notably, this analysis demonstrated that approximately 22,927 predicted genes have at least one detectable domain type (Pfam-A plus Pfam-B).

Ancient domain type preservation and loss
The Pfam database is biased towards vertebrates, particularly mammals, and includes many vertebrate-specific domains. To reduce this bias, we focused on ancient protein domain types, which in this study refer to the domain types present in any of the following eight invertebrates: the sea anemone Nematostella vectensis, the worm C. elegans, the insects D. melanogaster and Anopheles gambiae, the sea urchin Strongylocentrotus purpuratus, the oyster C. gigas, and the urochordates C. intestinalis and C. savignyi. We compared the occurrence of ancient domain types in amphioxus and vertebrates and determined that lancelets preserved more ancient domain types than vertebrates (at the cutoff E-values of 1 and 1e-5; Supplementary Table 18).
We also queried how many ancient domain types are preserved in lancelet species but lost in vertebrate species and vice versa. We first compared lancelets with the mouse and human lineages and found ~193 ancient domain types preserved in the amphioxus lineage but lost in the mouse and human lineages. In reverse, ~112 ancient domain types were lost in the amphioxus lineage but retained in the mouse and human lineages. We then extended the comparison to six representative vertebrates, including the pufferfish, zebrafish, Xenopus, chicken, mouse and human. This analysis revealed that ~144 ancient domain types were preserved in the amphioxus lineage but lost in all six vertebrates. In contrast, ~122 ancient domain types were lost in amphioxus but preserved in at least one of the six vertebrates. These domain types are listed in Supplementary Table 19-20 (the default cutoff E-value was applied).

Direct assessment of protein domain diversity
Our analysis showed that the lancelet genomes contain many more Pfam-A domain types than other invertebrates (N. vectensis, C. elegans, D. melanogaster, A. gambiae, S. purpuratus, C. gigas, C. intestinalis and C. savignyi). Moreover, the lancelet genomes contain even more Pfam-A domain types than some vertebrates (Xenopus, chicken and Tetraodon) (Supplementary Table 17). We suspected that the lancelet genomes would indeed have maintained more domain types than vertebrates because according to our analysis of the Pfam-A domain dataset, there were ~460 domain types present in any of the six examined vertebrates (human, mouse, chicken, Xenopus, Tetraodon and zebrafish) but absent in the eight examined invertebrates (the sea anemone N. vectensis, the worm C. elegans, the insects D. melanogaster and Anopheles gambiae, the sea urchin Strongylocentrotus purpuratus, the oyster C. gigas, the urochordates C. intestinalis and C. savignyi, and the two lancelets). After removing these vertebrate-specific domains, we found that the lancelet genomes contained more domain types than any single vertebrate (Supplementary Table 16).
There are two major ways to give rise to a novel domain type: one is for a novel domain type to arise from unstructured protein sequence; the other is that a pre-existing domain accumulates mutations and finally becomes sufficiently divergent to form a novel domain type. The latter accounts for many vertebrate-specific domains, such as the IGV from IG types and the PYRIN domain derived from the Death/CARD/DED domains. In other words, the domain diversity in a proteome can reflect the sequence divergence.
We directly compared domain diversity between human, mouse, zebrafish, ascidians and amphioxus using Blastclust. We extracted domain sequences (all Pfam-A domain types included) and used Blastclust to cluster them. More clusters for a proteome may indicate higher diversity. To minimize artifacts caused by small domains/motifs and fragmented sequences, we used only domain types with at least 60 amino acids and required a protein sequence to cover 55% of the domain length. Two combinations of parameters were used for Blastclust: 1) -L 0.8 (min-coverage>80%) -S 50 (identity 50%) -b T (require coverage on both neighbors); and 2) -L 0.8 -S 40 -b T.
The first result (all Pfam-A domain types, i.e., both ancient domain types and vertebrate-specific domain types) shows that both lancelet species have more domain clusters than human, mouse or zebrafish (Supplementary Figure 28). Because the Pfam-A dataset is severely biased towards vertebrates, to reduce the bias, we re-performed the clustering analysis using only ancient domain types. This new analysis gave similar results (Supplementary Figure 29). Note that we also analyzed the human+mouse dataset, which shows nearly the same number of clusters as those of human and mouse, suggesting that the clustering analysis is quite stable. For B. floridae, we used the diploid assembly instead of the haploid assembly because many domain types are missing in the haploid assembly for B. floridae. This operation nearly doubled the sequence number for B. floridae, but no obvious inflation occurred in the cluster number and thus further confirmed the effectiveness of this clustering analysis.
Taken together, our results suggest that lancelet genomes contain higher protein sequence diversity than those of vertebrates or invertebrates.

De novo identification of novel domains in amphioxus
We reasoned that because the Pfam domain database 104 is biased towards vertebrate proteins, the amphioxus proteome should contain many domains that have not yet been included in the Pfam database. These novel domains can be classified into two groups: one that is conserved in lancelets and other invertebrates and another that is conserved in the lancelet lineage only.
Here we attempted to glimpse of the unknown domain repertoire of amphioxus (mainly focusing on the second group). We used a de novo method to identify novel domains shared between the two lancelet species. We identified all Pfam-A domain sequence segments in the haploid B. belcheri proteome and the diploid B. floridae proteome using HMMER3.0 103 . These segments were removed from the protein set. By doing so, the remaining sequences were also broken down into segments.
The protein segments from the two lancelet species were pooled together and subjected to all-against-all BLASTP with the filter on. The results were used by Blastclust to group homologous segments into clusters. The custom parameters for Blastclust were 50% identity (-S 50) and 80% coverage (-L 0.8) for both sides (-b T). We also required that an acceptable cluster should contain at least 40 amino acids and have 2 representative sequences from B. belcheri and 3 from B. floridae. This method should be effective for the two lancelets because of their divergence time of 100-130 Myr and the fact that the two species have basically no similarity in neutral sequences (Supplementary Note 3 and Supplementary Figure 5). The resulting dataset contains a total of 941 clusters, or candidate novel domain families. CLUSTALW2 was used to create multiple alignments for each cluster.
Each cluster was compared with the NCBI NR database and the Pfam database (Pfam-A+Pfam-B). Among the 941 clusters, 553 hit proteins of other species (E-value>1), 89 to Pfam-A and 213 to Pfam-B. These clusters were excluded from further analysis. In addition, we also removed clusters containing only signal peptides (detected by SignalIP 4.0) and transmembrane regions (detected by TMHMM2.0). After this filtering, we obtained a set of 375 candidate families of novel domains.
Of these 375 candidate families, 138 (with copies in 774 proteins in B. belcheri) were annotated as intrinsically unstructured or disordered protein sequences using IUPred 105 . This fraction (~30%) is consistent with the early observation that >30% of proteins in eukaryotic cells can be classified as intrinsically unstructured 106 . Unstructured protein sequences may function in protein-protein interactions and/or give rise to novel domains 107 . The other 237 candidate families (with copies in 1,070 proteins in B. belcheri) mostly co-existed with other known domains. In Supplementary Figure 30 and 31, we show the positions and related protein architectures for the 20 longest and the 10 largest domain families, respectively.
We should note that the method used here only provided a glimpse of the novel domain repertoire because one can expect that many potential novel domains fail the detection process. For example, novel domains that occur once in the genome would not be reported through this design. In addition, as a reference, in many invertebrate genomes, over 50% of domain types occur only once; in mammalian genomes, this proportion is approximately 40%.

Protein evolution and the immune and stress repertoire
Protein identity between the two lancelet species varies between different functional categories (Supplementary Table 5). The most divergent categories include extracellular components, adhesion, signaling, death and the immune system; these proteins interact with microenvironments and microorganisms and thus could be under strong diversifying/positive selection. The categories reproduction and growth were also among the most divergent. Because lancelets occupy a relatively stable ecological habitat, reproduce by mass spawning and usually live as a large population, we speculate that a major drive for protein divergence in lancelets is the intense intraspecies competition in growth rates and reproductive capability. In line with this, in the Xiamen waters, a habitat shared by Chinese and Japanese lancelets, the two species differ significantly in reproductive behavior (Supplementary Note 1). An analysis of the d N /d S ratios in the Chinese lancelet showed a similar trend: the highest ratios were present in extracellular components, adhesion, signaling, death and the immune system (Supplementary Table 5). Interestingly, the categories of reproduction and growth were not among the top rank of d N /d S ratios but rather showed higher rates in both d N and d S for most of the other categories, suggesting their evolution has indeed accelerated. Overall, these protein divergence patterns are basically consistent with those of vertebrates (e.g., human versus mouse) (Supplementary Table 5).
In Florida lancelets, many protein families display species-specific expansion and diversification 108,109 , as also observed in Chinese lancelets. However, there are substantial differences between the two lancelets in the expansion magnitude, the proportions of orthologous pairs and the protein divergence of different protein families. Here we focused on the immune and stress repertoire that has expanded to comprise over 10% of the lancelet protein-coding genes 108,110 . Because of the limited experimental evidence and vague demarcation, immune/stress families are hereby defined by sequence similarity and may include proteins involved in apoptosis, signaling, adhesion and the extracellular matrix. While many families have similar gene numbers in the two lancelets, others, such as LRRIG, FBG and PGRP, display different levels of expansion (Supplementary Figure 32). Moreover, some families include mostly orthologous genes, some contain mostly species-specific genes, and the others consist of half orthologous and half species-specific genes ("half-half"). At one extreme, transcription factors, kinases, and certain signal transducers and oxidative defense enzymes (e.g., NOX, GPX and PRDX) predominantly consist of orthologous genes. At the opposite extreme, TLR, NLR, SRCR, FBG and CTL contain a large proportion of species-specific genes. In vertebrates, SRCR, FBG and especially CTL proteins are implicated in many functions, such as pattern recognition, effector, stress response, adhesion and the extracellular matrix, whereas TLRs are dedicated innate receptors. In most cases, each vertebrate has exactly one ortholog for every vertebrate TLR lineage 111 . Unlike the situation in vertebrates, TLRs in lancelets include a large proportion (~85%) of species-specific genes. This contrast indicates that lancelet TLRs are neither conventional innate receptors as functionally fixed as vertebrate TLRs nor somatically diversified receptors like vertebrate BCRs/TCRs. We tentatively define lancelet TLRs as a type of "diversified innate receptor". As for the "half-half" catalog, examples include TNF/TNFR, TIR adaptors, Death-fold domain genes and complement-related proteins (e.g., C1q, MASP and CCP). In lancelets, protein divergence is also highly variable across different families (Supplementary Figure 32). If the immune process is divided into sequential phases, the protein divergence shows a fast-narrowing trend from extracellular spaces to nuclei. Taken together, these gene expansions, diversifications, evolutionary dynamics and conservation patterns may collectively provide the necessary plasticity for host defense in the lancelet.  103 . We observed that, especially in vertebrates, many novel domain combinations were only present in certain protein isoforms, which suggests that creating multiple alternative splice isoforms is an important way of generating novel domain combinations. Hence, for a gene with multiple protein isoforms available, all sequence isoforms were used in this study.

Supplementary Note 12 Domain combinations
To suppress artifacts, we attempted to filter non-reliable hits by setting difference cutoff E-values. After multiple tests, we chose to use two values, 1 for a relaxed search and 1e-5 for a stringent search. These methods provided similar results. Any tandem array of the same domain type was compressed into one. For short domains or motifs (usually containing <20 aa), we required at least two consecutive modules or >40 aa hit length to justify the existence of the domain/motif. Signal peptides and transmembrane regions were not considered in this analysis. Finally, we took into account the domain order in a gene, which means that different orders of two adjacent domains were considered two different combinations.

Phylogenetic reconstruction based on domain combinations
At the cutoff E-values of 1 and 1e-5, we identified a total of 12,652 and 10,901 cases of two-domain combinations from the fifteen species, respectively. If the clan mode was used instead, the numbers were 8,993 and 8,271, respectively. In addition, we analyzed three-and four-domain combinations. To determine whether the gain and loss of domain combinations reflects evolution, we converted the presence and absence of the domain combinations of a species into a sequence and then used it for phylogenetic reconstruction with MEGA4 112 and the Maximum Evolution (ME) method.
From these results, we drew several conclusions (Supplementary Figure 33). First, the gain and loss of domain combinations are largely consistent with the evolution of species. The only violation was caused by urochordates (C. savignyi, C. intestinalis), which were clustered with other protostomes due to the short branch length. The short branch length is obviously ascribed to massive gene losses in this lineage. Second, the E-value had little effect on the tree topologies. Third, the domain mode and the domain clan mode yielded similar results.

Gain and loss (or turnover) rates of domain combinations
We used the well-recognized species tree to guide the estimation of turnover rates of domain combinations along different speciation paths. The baseml (which implements the maximum likelihood method) program from PAML 113 was used for this purpose. The simplest model, JC69, was used for calculation. Two E-value settings (1 and 1e-5) and both the domain mode and the domain clan mode were attempted.
The results showed that both the vertebrate ancestor and lancelets experienced elevated turnover rates (long branches). This pattern was later slowed down in modern vertebrates. The results also showed that the turnover rate is much higher in the amphioxus lineage than in other lineages and more than twice that of the vertebrate lineage (Supplementary Figure  34). Furthermore, the difference is even larger for three-and four-domain combinations, which is mathematically expected if lancelets do have elevated domain rearrangement rates.

Novel domain combinations in lancelets
We proceeded to calculate the number of novel domain combinations specifically contained in a given lineage but not found in any other lineage. These numbers were manually counted and marked on the species trees (Supplementary Figure 35). For internal branches, we required that a novel domain combination would be counted only if it could be found in all of its directly linked subordinate branches. For example, for the branch leading to the lancelet, urochordates and vertebrates, a novel domain combination should be present in both the amphioxus branch and the vertebrate-urochordate branches. As an exception, for the branch leading to the six vertebrates, a novel domain combination had to be simultaneously present in the fish group, the mammal group and the chicken-Xenopus group. Note that many domain combinations could arise independently in different lineages rather than be inherited from ancestors, and these combinations should therefore be considered "novel". However, our method would exclude these combinations, leaving only unambiguous lineage-specific novel combinations.
Our results showed that in the early evolution of deuterostomes, chordates and vertebrates, there was a rapid accumulation of novel domain combinations. Both lancelets have two-fold more novel two-domain combinations (or domain pairs) than any of the six vertebrates (Supplementary Figure 35A). The difference is even more prominent for three-and four-domain combinations (Supplementary Figure 35B-C), suggesting a dramatically elevated rate of domain reshuffling in the amphioxus lineage.
We then excluded the vertebrate-specific domain types and recalculated the number of novel domain pairs (Supplementary Figure 35D-F), which resulted in a significant decrease in the vertebrate lineage. Therefore, a large proportion (33-50%) of novel domain pairs in vertebrates were considered "novel" only because of vertebrate-specific domain types.
We then focused on novel domain pairs in lancelets. The lancelet B. belcheri has 1,874 domain pairs never found in other examined lineages, of which 638 were shared between the two lancelet species (Supplementary Table 21). As the divergence between lancelets and vertebrates occurred approximately 550 million years ago, and the two lancelets diverged approximately 100-130 million years ago, we inferred that the average novel domain pair gain before the divergence of the two lancelets was 638/(540-100)=1.5 per million years, and the average novel domain pair gain after the divergence of the two lancelets was (1236+1173)/(2*101)=11.9 per million years. The difference in these two rates suggests a very high turnover rates for the newly derived domain pairs, or, in other words, new domain pairs were not only gained quickly but also lost quickly.

The most used domains in novel domain pairs
We further investigated which domains are most actively involved in creating novel domain pairs, or, in other words, the most promiscuous domains in novel domain pairs. The top 50 most promiscuous domains in novel domain pairs in several important lineages are listed in Supplementary Table 22. For all examined lineages, the most promiscuous domains include EGF, Sushi, LRR, IG, Fn, Ank, TPR, and Pkinase. Different lineages also have their own favorable domains, for example, lancelets tend to use Lectin_C, Death/CARD/DED, F5_F8_type_C, and Kringle to form their novel domain pairs.
We found that the novel pairs shared between the two lancelets are 2 to 3-fold more abundant in immunity-related domains than other lineages (Supplementary Figure 36), which is consistent with previous studies 108, 109 . This suggests that a large proportion of the conserved novel domain pairs in amphioxus were produced for host defense purposes. Interestingly, we also found relatively fewer immunity-related domains in those novel domain pairs restricted to one lancelet species (Supplementary Figure 36). A possible explanation is that these species-specific novel domain pairs were newly created by unbiased or less-biased selection of domain types and that natural selection has not yet had enough time to effectively reshape the composition. In addition, these patterns suggest that natural selection plays an important role in shaping the domain combination repertoire.
Among the most commonly used immunity-related domains, lancelets tend to use Lectin_C, Fibrinogen_C, LRR, Gal_lectin and Death-fold domains to create novel domain pairs. SRCR is also frequently used by the lancelet but not as frequently as the sea urchin or the deuterostome ancestor (Supplementary Figure 37). IG domains are most used by vertebrates. However, IG domains are actually the only domain type that is frequently used by all examined lineages (Supplementary Figure 37).
The top 50 promiscuous domains in novel domain pairs were then classified according to their molecular functions (Supplementary Figure 38). The two largest categories are signal transducers and receptors. In lancelets, these two domain categories are used even more frequently to create novel domain pairs (Supplementary Figure 38). In addition, relatively more catalytic domains were also used by lancelets (Supplementary Figure 38).
The top 50 promiscuous domains in novel domain pairs were next classified according to their cellular locations (Supplementary Figure 39). This analysis revealed that amphioxus uses more extracellular domains, whereas vertebrates tend to create novel domain pairs with intracellular domains (Supplementary Figure 39).
Finally, we observed that the common ancestor of chordates, the common ancestor of deuterostomes and the amphioxus lineage used a similar set of promiscuous domains for novel domain pairs, while the vertebrate lineage used a different set. This observation became more evident after we performed a series of Pearson correlation tests on the usage patterns for the promiscuous domain sets between different lineages (Supplementary Table 23). We infer that the amphioxus lineage is more similar to the chordate and deuterostome ancestors in terms of gaining novel domain pairs.

Rearrangement rates at the exon level
The excessive novel domain combinations in the lancelet genomes prompted us to wonder whether excessive rearrangements occur at the sub-genic level but failed to be reflected by the rearrangement rates calculated based on genes.
Because of the lack of independent function and regulatory elements that are usually associated with complete genes, shuffled or rearranged exons are under a very different selection regime from that of rearranged genes and may show disparate patterns.
We first analyzed the rates of exon-level rearrangements in eight species pairs, including human versus chicken, human versus rhesus (Rhesus macaque), pufferfish (Tetraodon nigroviridis) versus stickleback (Gasterosteus aculeatus), C. elegans versus C. briggsae, Ciona intestinalis versus C. savignyi, Drosophila melanogaster versus D. mojavensis, rat versus mouse, and B. belcheri versus B. floridae. The rate of exon rearrangements between the two haplotypes of the B. belcheri genome was also calculated for comparison.
All coding exon sequences, or coding DNA sequences (CDS), were extracted from each genome. BLASTN and the RBH method (described in Supplementary Note 9) were used to identify nominal orthologous CDS sequences for each species pair. Special parameters for BLASTN included "-q 2" and "-F m D". The nominal orthologous CDS pairs for each species pair were used to calculate the number of double-cut-and-join (DCJ) rearrangement events, as described in Supplementary Note 7. To obtain a baseline for comparison, the ORF (open reading frame) sequences were also extracted and used to calculate rearrangements at the genic level.
The results are summarized in Supplementary Table 24. The relative DCJ distances estimated by ORF sequences were not unusual but rather consistent with those calculations based on protein sequences (Supplementary Table 12). On the other hand, a comparison of the two lancelet species revealed thousands of rearrangements at the exon level, many more than any other species pair (Supplementary Table 24).
To distinguish individual exon rearrangements from rearrangements involving entire genes, we subtracted the number of ORF rearrangements from the number of exon rearrangements. This calculation gave an estimate of the number of the rearrangement events that occurred at the exon level. Divided by the total number of coding exons, we obtained the relative DCJ distance contributed by exon-level rearrangements.
Supplementary Figure 40 shows that the relative DCJ distance specific to coding exons is approximately 0.1. This number is 10-100 times higher than two mammal pairs (human-rhesus and rat-mouse), and 3-20 times higher than that calculated for C. elegans and C. briggsae, which have nearly the same divergence time and global gene rearrangement rate as the two lancelet species. The urochordate C. intestinalis and C. savignyi are well known for their rapid genome rearrangements, but the estimated exon rearrangement distance between them was still not comparable to that between lancelets (p<1e-16, chi-square test).
The relatively short length of exon sequences might cause false rearrangement events. To reduce this effect, we filtered the orthologous exon pairs by their alignment length and re-performed the analysis. The relative DCJ distances between the two lancelets were consistent under different cutoff alignment lengths, i.e., 100, 150 and 200 bp. However, for other species pairs, the distance sharply dropped to near zero when filtering was applied (Supplementary Figure 40 and Supplementary Table 24), which can be explained by two mutually compactible possibilities: a) shorter exons are more prone to false positives; and b) rearrangements of longer exons are scarce and even unfavorable in species other than lancelets.
Finally, to assess exon rearrangements at the population level, we compared the two haploid assemblies of the B. belcheri genome. The estimated relative DCJ distance contributed solely by exon rearrangements is 0.022-0.025, much higher than that for human versus rhesus or human versus chimpanzee (Supplementary Figure 40 and Table 24).

Global patterns of exon phase and exon expansion
Based on the phase of the flanking introns, there are nine different types of exon phases, including symmetrical 0-0, 1-1 and 2-2, and asymmetrical 0-1, 0-2, 1-0, 1-2, 2-0 and 2-1. When exon translocations (cut-and-paste or copy-and-paste), exon tandem duplications or deletions occur, exons of symmetrical phase are favored by natural selection because no immediate frame-shifts are introduced. Furthermore, early studies showed that shuffled exons encoding protein domains are significantly biased to the 1-1 phase combination in the human genome 3 .
We analyzed the phase types of internal exons from eight species and found that the internal exons of lancelets are significantly biased to the 1-1 phase type. If restricted to exons encoding Pfam domains and at least 100 bp long, the 1-1 phase type accounts for over 28% of all exons, nearly twice the exon number of the 0-0 phase type and the exon number of the 1-1 phase type in human (Supplementary Figure 41). This suggests that in lancelets, natural selection favors domain exons over non-domain exons for shuffling and expansion. The logical explanation is that domain exons are more "useful" in producing proteomic diversity and plasticity (e.g., diverse domain combinations).
We then examined the composition of domain types that are significantly biased towards 1-1 phased exons (Supplementary Table 25) and 0-0 phased exons (Supplementary Table 26). Lancelets and humans share the same set of common domain types in 1-1 phased exons, suggesting that the pattern of domain type usage in 1-1 phased exons could be an ancient, conserved feature in the chordate phylum. We found that the top 10 most common symmetrical (1-1 phased) domain families involved in domain shuffling in the human genome 3 are also ranked high on the promiscuous domain list (Supplementary Table 22). This suggests that promiscuous domains tend to disseminate via 1-1 phase exons. These families include EGF, Sushi, CUB, VWA, VWC, F5_F8_type_C, PAN_1, MAM, TIG, Kringle and Ldl_recept_a. Notably, nearly all internal IG domains from both vertebrates and lancelets are encoded in 1-1 phased exons, suggesting that the large expansion of IG domains in chordates (especially in vertebrates) is mainly disseminated through 1-1 phased exons.
Finally, it is worth noting that several domain families active in novel domain pairs show no apparent bias in phase types, including LRR, 7tm_1 and P450.

Analysis of exon-level shuffling events between lancelet species
We chose the chainNet method to find confident genomic rearrangements. This method was previously used to identify genomic rearrangements between the mouse and human genomes 43 .
Unlike the RBH method, which intends to find the best hit between individual exon or gene sequences, the whole-genome chainNet method takes into account both non-exon sequences and syntenic information. Hence, the chainNet method generally reports fewer but higher-confidence rearrangements. In addition, the chainNet method is not affected by errors in gene and domain annotations that can occur in draft genomes.
We did not distinguish between cut-and-paste and copy-and-paste mechanisms because both can create novel gene structures and novel domain combinations.
From the pairwise alignments between the two lancelet genomes, we detected 6,782 translocation events (inversions excluded) 100-50,000 bp in length, of which 3,097 events contained coding exons and 1,047 contained domain-encoding exons (Supplementary Table  27). The 3,097 translocations harbor a total of 14,280 exons, of which 10,592 are middle exons and biased towards the 1-1 phase (with 21.0% exons as 1-1 phase versus 18.5% as 0-0 phase). The bias is even stronger for domain-encoding middle exons, with 31.0% exons as 1-1 phase and 17.7% as 0-0 phase.
It is difficult to separate exon shuffling from gene shuffling that happened between two species, especially when the gene-based DCJ distance is as high as 0.23 (Supplementary  Table 12). However, because there should be no mechanistic boundary between the two types of shuffling except for sequence length, we posited that the smaller the translocation size, the more likely it is an exon-level shuffling event. We thus identified all translocation events that contain 10 exons and filtered for exons at least 100 bp long with 80% alignment coverage. We then compared the extent of exon phase bias between different translocation sizes (Supplementary Table 28). These results showed that the smaller the translocation size, the higher the phase bias towards 1-1. For translocations containing single domain-encoding exons, 50% of the exons showed the 1-1 phase combination. Supplementary Table 29 lists the common domain types encoded in these translocations, which are basically the same set of domains actively involved in novel domain pairs (i.e., promiscuous domains; Supplementary Table 22).

Analysis of exon-level shuffling events within the B. belcheri diploid genome
The same chainNet method and parameters were used to create reciprocal-best single-coverage chainNet alignments for the two haploid sequences of the B. belcheri diploid genome (the v18 reference assembly versus the v18 alternative assembly). From the alignments, we identified 6,244 translocation events (inversions excluded), of which 5,713 were within the range of 100-50,000 bp (Supplementary Table 27). Among the 5,713 translocation events, 1,056 events (18.5%) contained coding exons, and 293 (5.0%) contained domain-encoding coding exons.
For comparison, we also identified translocations by applying the same method to two species pairs, human versus rhesus and human versus chimpanzee (Supplementary Table 27).
Human and rhesus have a genomic sequence divergence of ~5%, close to that between the two B. belcheri haploid sequences. The number of translocation events (4,981) between human and rhesus is close to that of the lancelet diploid genome, though they have ~10-fold more total alignable sequences than the lancelet alleles. Moreover, only 310 of the 4,981 translocations contain coding exons, and only 173 contain domain-encoding exons (Supplementary Table 27). These statistics are even smaller between human and chimpanzee, likely due to their more recent sequence divergence (Supplementary Table 27). Based on the statistics presented in Supplementary Table 27, we estimate that the potential exon shuffling rate is over 10 times higher in lancelets compared with humans.
We identified all exons involved in these translocation events and compared their phase bias and composition (Supplementary Table 28 and 29). Shuffled exons in the B. belcheri genome are clearly biased to the 1-1 phase combination, whereas those from the human genome show no such bias.

Association of transposases/retrotranscriptases and micro-translocations
We also looked into the relationship between TEs and micro-translocations in the Chinese lancelet genome.
Using RPS-BLAST and Pfam domains for transposases and retrotranscriptases (55% coverage and an E-value of <1e-5), we identified 415 transposase gene fragments and 2,300 retrotranscriptase gene fragments in the B. belcheri genome. Using the same method, we identified 2,926, 2,861 and 2,416 transposase gene fragments and 20,883, 18,845 and 16,164 retrotranscriptase fragments in the human, chimpanzee and rhesus genomes, respectively. We next assessed whether these TE fragments co-localized with the identified micro-translocations (plus the upstream and downstream 1,000 bp). We found that in the B. belcheri genome, transposases and retrotranscriptases were significantly enriched around micro-translocations (Chi-square test, p<1e-16), whereas such enrichment was not observed between human and rhesus or human and chimpanzee (Supplementary Table 30). This suggests that in the B. belcheri genome, TE activity might play a role in driving micro-translocations.
It has been suggested that in vertebrates, gene fusion is much more important than transposition for domain gains 114 . However, in the Chinese lancelet genome, we observed a high enrichment of transposase (12%) and retrotranscriptase (16%) gene fragments in the translocation regions -10-fold higher than the enrichment in the translocation regions between human versus rhesus (Chi-square test, p<1e-16) (Supplementary Table 30). This is further evidence that TEs might have a more important role in micro-translocations in lancelets.

Methods to identify conserved non-coding elements
We used the reciprocal-best whole-genome alignment method to identify conserved non-coding elements (CNEs) between two genome sequences. The aforementioned LASTZ-chainNet method 43 was used for this task. Unlike the reciprocal-best BLAST method, the reciprocal-best LASTZ-chainNet method takes synteny into account and permits translocations and inversions, which is conservative but increases the search sensitivity.
The obtained CNEs contained cis-regulatory elements, microRNAs and long non-coding RNAs. We did not distinguish between these categories in this study. These three types of CNEs control the timing, quantity, and regions of gene expression as well as post-transcriptional regulation.
We identified CNEs between Chinese and Florida lancelets. For comparison, we performed parallel searches for CNEs in other four species pairs: human versus mouse, human versus opossum, the worms C. elegans versus C. briggsae, and the insects D. melanogaster versus D. mojavensis.

Identification of CNE candidates from five species pairs
Using the method described above, we found that non-repeating, non-CDS, conserved, and reciprocal-best alignments comprised 11% (45.4 Mb) of the lancelet genome, 3% (3 Mb) of the C. elegans genome, 4% (6.7 Mb) of the D. melanogaster genome, and 3.4% (106 Mb; versus mouse) or 1.1% (33.5 Mb; versus opossum) of the human genome (Supplementary  Table 31A). The relative lack of conserved non-coding regions in worms and fruit flies agrees with previous reports. However, it was surprising to observe that the lancelet genome contains such an abundance of CNE contents, not only in terms of relative abundance (11% of the genome) but also in the absolute amount (45.4 Mb)even more than the amount (33.5 Mb) in the human genome, as identified by comparison with the opossum genome. Note that the human genome is six times larger than that of the lancelet. For the record, the divergence of the two lancelets (100-130 Myr) falls between the divergence of human versus mouse (62-100 Myr) and human versus opossum (125-138 Myr) (see Supplementary Note 3). A similar trend was obtained using a less stringent alignment scheme (Supplementary Table  31B).
We then refined the CNE candidate sets by removing sequences shorter than 75 bp, with a sequence identity lower than 70%, adjacency to coding sequences, or blast hits to proteins or rRNAs/tRNAs/snoRNAs, etc. (Supplementary Table 32).