Centromere evolution and CpG methylation during vertebrate speciation

Centromeres and large-scale structural variants evolve and contribute to genome diversity during vertebrate speciation. Here, we perform de novo long-read genome assembly of three inbred medaka strains that are derived from geographically isolated subpopulations and undergo speciation. Using single-molecule real-time (SMRT) sequencing, we obtain three chromosome-mapped genomes of length ~734, ~678, and ~744Mbp with a resource of twenty-two centromeric regions of length 20–345kbp. Centromeres are positionally conserved among the three strains and even between four pairs of chromosomes that were duplicated by the teleost-specific whole-genome duplication 320–350 million years ago. The centromeres do not all evolve at a similar pace; rather, centromeric monomers in non-acrocentric chromosomes evolve significantly faster than those in acrocentric chromosomes. Using methylation sensitive SMRT reads, we uncover centromeres are mostly hypermethylated but have hypomethylated sub-regions that acquire unique sequence compositions independently. These findings reveal the potential of non-acrocentric centromere evolution to contribute to speciation.

Each leaf has a cluster with >10 monomers; for example, the top leaf labeled with "HSOK chr23 2 15" indicates the 2 nd largest cluster among all clusters in HSOK chromosome 23 with 15 monomers,.
The second top leaf with "Hd-rR chr2 2 137" has 137 monomers and is the 2 nd largest cluster in Hd-rR chromosome 2. We partitioned all clusters into four groups named SF1-4 enclosed in boxes according to the hierarchical clustering, where SF denotes Supra-chromosomal Family. b. From each of the four groups, we selected a representative (Methods). To highlight the difference, we generated a multiple sequence alignment of the representatives using CLUSTAL 2.1. c. The four boxplots illustrate the distribution of sequence identity between each representative of SF1-4 and all monomers in individual chromosomes that had centromeric repeats in its genomic sequence (Supplementary Table 12). The x-axis shows chromosome numbers with approximate genomic positions (metacentric, submetacentric, acrocentric, or subtelocentric) of their centromeres in parentheses, and the y-axis displays sequence identity (in percentage). d. . Red and green dots indicate forward and reverse matches, respectively. Red blocks indicate contigs gaps. We can observe multiple patterns of higher order repeats that are represented by lines parallel to the diagonal in most of chromosomes (Hd-rR chr. 1, 3,6,8,9,12,13,14,18,19,20,21,22,HSOK chr.2,4,7,8,11,12,15,20,23), uncovering broad divergence in higher order repeats. (Bottom) Snapshots of genome browser in centromeric regions. The yellow bars represent Hd-rR contigs, green bars HSOK contigs, and red bars centromeric repeats. Below the track for centromeric repeats, we display tracks for regional methylation prediction from PacBio reads (+, methylated; -, unmethylated), CpG-wise methylation from bisulfite reads, coverage of PacBio reads, coverage of bisulfite reads, and PacBio subreads alignments (red, forward; blue, reverse) by using BLASR. As bisulfite data are unavailable for the HSOK genome, we generated two tracks for the methylation status calculated from PacBio subreads and for PacBio subread coverage at each CpG site. On chromosomes 12, 18, and 20 in the Hd-rR genome, we could not make methylation calls on centromeric repeats from PacBio subreads, and we labelled these regions with "No methylation calls" to distinguish them from unmethylated regions.
We observed hypomethylated centromeric repeats in six Hd-rR chromosomes (chr. 1, 2, 6, 12, 13,   Table 21). The dot plot illustrates the difficulty in distinguishing the underlying sequence compositions of hypermethylated and hypomethylated centromeric regions. c,d. We evaluated the classification by a five-fold cross validation. The data set was randomly partitioned into five subsets, and one subset was used as the test data and the rest were used as the training data. We set k=6 (c) and k=3 (d) for comparison because setting k=6 slightly overfitted the training data. Each curve colored differently represents one classification of five cross validation tests. An average ROCauc of five cross validation shown in the graph implies the presence of HMD-specific k-mers within centromeric repeats in the three HSOK chromosomes. The top 10 sequence motifs of length k=6 and k=3 are put into individual graphs (c and d).
e,f. The evaluation of the classification by using different chromosomes for the test and training data.
Among chr 2, 4, and 23, two chromosomes were used as the training data and the remaining one was used as the test data. k=6 (b) and k=3 (c) were used for comparison. For each curve, the chromosome used as the test data is labelled in the graph legend. An average ROCauc of three classifications is indicated within the graph. Both of true positive and false positive rates are low, implying that the sequence compositions of HMDs differ in individual chromosomes.
g. The table shows all 3-mers with their weights that were used for the prediction in the graphs (d and f). To highlight the difference in the weights, high, medium, and low weights are colored red, white, and green, respectively.

Supplementary Figure 7: Examples of large-scale structural variants.
We categorized mid-sized structural variants (SVs) into three classes:
d. Breakdown of mid-sized SVs that were identified between medaka Hd-rR and HNI strains.
e. An extremely large inversion (>15 Mbp) in chromosome 11 was evident when Hd-rR and HNI were compared. The presence of the inversion was suggested by the Sanger-sequence genome assembly; however, the contigs assigned to chromosomes were not of sufficient length to reveal the boundaries of the inversion. In the present study, when we anchored contigs onto HNI and HSOK chromosome 11, we identified two pairs of contigs that had two sets of distal genetic markers that were separated by ~16Mb while we found no such pairs in Hd-rR, indicating that the inversion had occurred in the Hd-rR lineage. Contigs surrounding the breakpoints of the inversion are associated with their contig identifiers (e.g., 83F and 481F). In the HNI genome, the two breakpoints are located at 7F and 143F, whereas in the Hd-rR genome, one breakpoint lies between 83F and 481F, and the other is between 240F and 138F.
This is partly because the breakpoints lie in the long repetitive regions shown in Figure f.

f.
We determined the inversion breakpoints in focal HNI and HSOK contigs by aligning these contigs with the corresponding region of Hd-rR. Dot plots comparing the four pairs of Hd-rR and HNI regions that contain the two breakpoints of the inversion. The inversion was surrounded by highly repetitive regions of ~200 kb and ~10 kb in size, which were difficult to detect using short read sequencing technology.

a.
We used HSOK as an outgroup of Hd-rR and HNI determine whether a given event was an insertion into, or a deletion from, the focal genome. We mapped (to HSOK) the two 2,500-bp regions upstream and downstream from the positions in HNI (or Hd-rR, respectively) where the insertion/deletion events had occurred in Hd-rR (HNI). We measured the distance between the alignments of the two 2,500-bp regions in the HSOK genome.
b. The histogram shows the frequency distribution of these distances, and exhibits two peaks around 0 and >1000. As the two peaks were thus clearly separated, we classified events using the heuristic whereby events in the peak around 0 were insertions, and those around the other peak were deletions. The peak around 0 is broad because the three strains collected mutations during evolution.

c.
The frequency distribution of lengths of insertions. The schematic picture shows a telomere bouquet, a site on nuclear envelope where telomeres are clustered during meiotic prophase I. Centromeres of acrocentric chromosomes (blue) are brought into proximity, which may facilitate exchanges of centromeric repeats and preserve their sequence composition. By contrast, non-acrocentric centromeres (orange) are likely to be apart from each other, which may accelerate centromere evolution in individual chromosomes.

Analysis of gaps in the assembled genomes
Despite dramatic improvements in genome assembly methods, hundreds of gaps remain in the assembled chromosomes, and we recorded ~1000 unanchored contigs. We attempted to extend contigs to the telomeric regions of individual chromosomes; however, this was successful for only a few chromosomes of each of the three strains. We also approached to the problem of how to sequence centromeres (a challenging issue in vertebrate genomics) and we found a number of contigs bearing centromeric tandem repeats could be anchored to chromosomes. At the moment, some centromeric tandem repeats still remained in unanchored contigs. We tested BioNano optical mapping technology, but the mapping data were partial, and failed to cover most of the assembled genomes. Because many contigs anchored to chromosomes were not connected by BAC-end pairs, longer reads (> 200 kbp) are required to fill the remaining gaps and allow us to identify hitherto unknown long repetitive regions.

Near-identical copies of the innate autonomous transposon Tol2 and the Y-specific region
To demonstrate the comprehensive nature of our current sequences, we examined the distributions of Tol2 element insertions. Tol2 is an example of an early innate autonomous transposon in a vertebrate genome 1 ; Tol2 is 4682bp in length and is estimated to be present in ~20 near-identical copies in the medaka genome 2 . Although no full Tol2 matches were found in the previous Sanger-sequence genome assembly (version 1), we identified 15, 5, and 16 full matches in the new Hd-rR, HNI and HSOK genomes (version 2.2.4); BLASTn showed that all were >99.8% identical to the reference Tol2 sequence, except for two with identities of 99.7% and 99.4%. Supplementary Table 6 shows the locations of all Tol2 sequences and their levels of identity. We also observed copies of a known Tol2 variant with a 117-bp deletion (1793-1909bp) in an intron 3 . We found one such copy in Hd-rR and eight copies in HNI; the latter fact is remarkable because the presence of this variant was thought to occur at low frequencies 3 . In HSOK, we identified a novel variant with a 68-bp deletion (1487-1554bp) in another intron. The identity levels of individual deletion variants were quite high (> 99.7%). We confirmed, by manual inspection, that the three genomes bore Tol2 in different positions.
This means that individual Tol2 copies have only recently been incorporated into each medaka lineage as a result of horizontal transfer after divergence of Hd-rR and HNI 2 .
As another example of the high quality of the new Hd-rR assembly, we examined the Y-specific region in the medaka linkage group 1 (LG1, 33.7 Mb), carrying DMY, the male determining gene, first non-mammalian equivalent of SRY 4 . When LG1 carries an insertion of a 250-kb Y-specific region, it serves as the Y chromosome, whereas LG1 without the insert serves as the X chromosome.
In the new assembly, we obtained a single contig bearing the male-determining gene, DMY, which had mapped to three scaffolds (with gaps) in the earlier Hd-rR genome (version 1) presumably because the Y-specific region carries a number of repetitive elements proximal to DMY 5 . The Y-specific region of chromosome 1 is thought to be a duplicate of a region in chromosome 9 6 .
Indeed, the DMY mRNA sequence from HNI mapped partially to chromosomes 9 of all three genomes (Supplementary Table 7).

Centromere specification is epigenetic but not defined by the underlying DNA sequences.
In non-centromeres, underlying sequence compositions were shown to largely determine hypomethylated domains (HMDs) 7 by using a support vector machine (SVM) equipped with spectrum kernel 8,9 . SVM has been proven to be useful in predicting HMDs 7 and in listing relevant k-mers (strings of length k) that underlie in HMDs (Methods). We hypothesized that relevant k-mers in HMDs of non-centromeres could also characterize HMDs in centromeric repeats, but this hypothesis did not hold true ( Supplementary Fig. 6a,b). We then searched for HMD-specific k-mers within centromeric repeats using HSOK non-acrocentric chromosomes 2, 4, and 23, which had large hypo-and hyper-methylated domains. We indeed obtained such k-mers ( Supplementary Fig. 6c-d), but each chromosome had a unique set of relevant k-mers that were unlikely to occur in the other two chromosomes ( Supplementary Fig. 6e,f,g). This is consistent with centromeric repeats in non-acrocentric chromosomes evolving rapidly and independently of each other, as we have seen in the analysis of centromere evolution. We confirmed the speculation that centromere specification is primarily epigenetic but not defined by the underlying DNA sequences 10 .

Telomere regions
We also sought to characterize telomeres in the medaka genome contigs by using Tandem

Categorization of mid-sized SVs into three categories
We categorized SVs into groups according to three mechanisms of formation, non-homologous end-joining (NHEJ) (e.g., simple insertions and deletions of mobile elements); non-allelic homologous recombination (NAHR) (e.g., genome duplications); and inversion ( Supplementary Fig.   7a-c). The positions of NHEJ, NAHR, and inversion events in contigs are detailed in Supplementary   Tables 14, 15, and 16, respectively. Supplementary Fig. 7d breaks down the numbers of each group and shows that the majority of mid-sized SVs (96.9%) was caused by NHEJ. This led us to focus on this class and determined whether each NHEJ event was an insertion into, or a deletion from, the Hd-rR and HNI genomes using the HSOK genome as an outgroup ( Fig. 4a; Methods). A possible scenario for this centromere evolution is that the orange repeat jumped into the blue repeat by gene conversion to create a basic pattern, and the pattern was duplicated multiple times by unequal crossover.

Transposable elements inserted into regions upstream of TSSs
Transposable elements (TEs) could be responsible for variation in genome size and regulatory circuits 11 . We thus examined whether inserted fragments upstream of TSSs in medaka genomes were derived from TEs. Of these 101 insertions, only 25 matched LTR, LINE, or DNA transposons (Supplementary Fig. 10c; Methods). We then examined if 76 remaining insertions were repetitive by searching the genomes for their occurrences with >90% identity, and 73 were qualified. In spite of incomplete collection of repeats in RepBase, we identified 34 (46.6% of the remaining 73 insertions) as repeats with >10 occurrences.