Linear Assembly of a Human Y Centromere using Nanopore Long Reads

The human genome reference sequence remains incomplete due to the challenge of assembling long tracts of near-identical tandem repeats, or satellite DNAs, that are highly enriched in centromeric regions 1. Efforts to resolve these regions capitalize on a small number of sparsely arranged sequence variants that offer unique markers to break the repeat monotony and ensure proper overlap-layout-consensus assembly DNAs 2–4. Identifying and spanning sequence variants that may be spaced hundreds of kilobases away within a given array requires long and highly accurate sequence reads. Achieving this requires an advancement in standard single-molecule sequencing, which to date has been error-prone and offers a low throughput of sufficiently long-reads (100 kb+)5,6. Here we present a strategy that generates long-reads capable of spanning the complete sequence insert of bacterial artificial chromosomes (BACs) that are hundreds of kilobases in length (∼100-300kb). We demonstrate that these reads are sufficient to resolve the linear ordering of repeats within a single satellite array on the Y chromosome, allowing the first complete sequence characterization of a human centromere.

the enrichment of a divergent, AT-rich~171-bp tandem repeat, known as alpha satellite DNA 6,7 .
The majority of alpha satellite DNAs are organized into higher order repeats (HORs), where a chromosome-specific subset of alpha satellite repeat units, or monomers, is reiterated as a single repeat structure hundreds or thousands of times with high (>99%) sequence conservation to form extensive arrays 8,9 . The sequence composition of individual HOR structures and the extent of repeat variation within the context of each chromosome-assigned HOR array are important to establish kinetochore assembly and ensure centromere identity 1,10,11 . Yet, despite the functional significance of the genomic organization and structure, sequences within human centromeric regions remain absent from even the most complete chromosome assemblies 10,12,13 . To date, no sequencing technology or collection of sequencing technologies has been capable of assembling through centromeric regions due to the requirements for extremely high-quality, long reads to confidently traverse informative, low-copy sequence variants within a given array 3,14 . To this end, we have implemented a nanopore long-read sequencing strategy to generate high-quality reads capable of spanning hundreds of kilobases of highly repetitive DNAs. We have focused on the haploid satellite array that spans the Y centromere (DYZ3) as it is particularly suitable for assembly due to its tractable array size, well-characterized HOR structure and previous physical mapping data [15][16][17][18][19] .
We employed a transposase-based method ('1D Longboard Strategy') to generate high-read coverage of full-length BAC DNA with nanopore sequencing (MinION sequencing device, Oxford Nanopore Technologies). This method is designed to linearize the BAC with a single cut-site, followed by addition of the necessary sequencing adaptors (as described in Fig. 1a and Online Methods). The BAC DNA is then read in its entirety through the pore, resulting in complete, end-to-end read coverage of the BAC insert sequence. Plots of read length versus megabase yield revealed enrichment for full length BAC DNA sequences ( Fig. 1b and Supplementary Fig. 1). In total, we generated over >3500 full-length 1D reads that span the entirety of 10 BACs (one control BAC from Xq24 4,5 and nine BACs that mapped to the DYZ3 locus 20 ) with MinION sequencing (Supplementary Table 1).
BAC-based assembly across the DYZ3 locus requires overlap among a few informative sequence variants, thus placing great importance on the accuracy of base-calls. However, individual reads (MinION R9.4 chemistry, 1D reads) provide inadequate sequence identity to ensure proper assembly 4,5 . In our experiments, we observed a median alignment identity of 84.8% for individual reads obtained from a control BAC (Xq24; RP11-482A22). Further, we determined insertions, deletions, and mismatches rates of 3.6%, 4.6%, and 3.4% respectively, which are consistent with previous genome-wide estimates 5 . To improve overall base quality, we derived a consensus from multiple alignments of 1D reads that span the full insert length for each BAC (Online Methods). We found that we were able to improve the consensus quality with modest coverage increase and sampling (multiple alignments from 60 randomly sampled full-length reads, with 10 iterations) (Fig. 1c). Additional polishing steps were performed using re-alignment of all full-length nanopore reads for each BAC to improve consensus sequence base quality (99.2% observed for control BAC, RP11-482A22; and an observed range of 99.4 -99.8% for vector sequences in DYZ3-containing BACs; Online Methods).
To validate satellite sequence variants and to evaluate inherent nanopore sequence biases, we performed Illumina high-coverage BAC resequencing (with coverage median range: 419-1984).
We compared counts of 5-mers between corresponding Illumina and nanopore sequence libraries derived from each BAC. Although the 5 -mer frequency profiles between the two datasets were largely concordant (Supplemental Figure 2a,b), we found that poly(dA) and poly(dT) homopolymers were overrepresented in our initial nanopore read datasets, a finding that is consistent with genome-wide observations 5 . These poly(dA) and poly(dT) over-representations were reduced in our quality corrected consensus sequences especially for 6mers and 7mers (Supplemental Figure 2c,d). In addition to detecting sequence biases, we used Illumina to validate single-copy sequence variants. In doing so, we used Illumina sequence coverage the BAC cloning vector in each dataset, a region on the BAC expected to be single-copy, to confidently identify single-copy sites within the DYZ3 array that are useful for overlap methods of assembly (illustrated for RP11-718M18; Supplemental Fig. 3b,c; Online Methods). After eliminating reads with multiple best alignments, Illumina base coverage profiles were used to determine informative HOR sequence variants. Further, using a k-mer strategy (where k=21 bp) that identified exact matches between the Illumina and each BAC consensus sequence, we observed an average positive prediction value of 95.8. This allowed us to identify and mask all sites not supported by Illumina reads as false positive variants. Finally, standard quality polishing with pilon 21 was applied strictly to unique (that is, non-satellite DNA) sequences on the proximal p and q arms to improve final quality. Alignment of polished consensus sequences from our control BAC from Xq24 (RP11-482A22) and non-satellite DNA in the p-arm adjacent to the centromere (Yp11.2, RP11-531P03), revealed base-quality improvement to >99% identity. Using this strategy, we generated nine BAC sequences with high-quality, illumina sequence validated variants and long-range repeat structure, (e.g. 217 kb for RP11-718M18, Fig. 1d) to guide the ordered assembly of BACs from p-arm to q-arm, spanning an entire Y centromere.
We ordered the DYZ3-containing BACs using 38 single-copy, Illumina-validated variants, resulting in 365 kb of assembled alpha satellite DNA (Fig 2, with variant overlap between BACs spanning the p-arm and q-arm shown in Supplemental Fig. 4). The majority of the centromeric locus is defined by a 301 kb array that is comprised entirely of the DYZ3 higher-order repeat (HOR), with a 5.8 kb consensus sequence, repeated in an uninterrupted head-to-tail orientation without repeat inversions or transposable element interruptions 15,19,22 . The assembled length of the RP11 DYZ3 array is consistent with estimates for 96 individuals from the same Y haplogroup (R1b) (Supplemental Fig. 5; mean: 315 kb; median: 350 kb) 23, 24 . This finding is in general agreement with pulse-field gel electrophoresis (PFGE) DYZ3 size estimates presented in previous physical mapping of the Y centromeric region 15,16,18 . Using a Y-haplogroup matched cell line 25 , we demonstrate concordant PFGE array size estimates across six restriction digests with our RP11 Y centromere length measurement (Supplemental Fig 6).
Pairwise comparisons among the 52 HOR repeats in the assembled DYZ3 array reveal limited sequence divergence between copies (mean 99.7% pairwise identity), as expected for highly homogenized HORs 9,14,17 . Further, in agreement with previous assessment of sequence variation within the DYZ3 array 15,19 , we detected instances of a 6.0 kb HOR structural variant and provide evidence for nine copies within the RP11 DYZ3 array that are, in all but one instance, found in tandem 15 . The variant 6.0 kb DYZ3 units are present in two clusters that are separated by 110 kb, as roughly predicted by previous restriction map estimates 17 . Sequence characterization of the DYZ3 array revealed nine HOR haplotypes, defined by linkage between variant bases that are frequent in the array (Supplemental Table 2; Supplemental Fig. 7). These HOR-haplotypes are organized into three local blocks that are enriched for distinct haplotype groups, consistent with previous demonstrations of short-range homogenization of satellite DNA sequence variants 14,15,26 .
Directly adjacent to the 301 kb HOR array, we identified a brief transition zones on both p-arm (6.1 kb) and q-arm (4.9 kb) where we observed interspersed monomers with high sequence identity (~98-100%) to the canonical DYZ3 HOR unit, yet do not have the same multi-monomeric repeat structure (Supplemental Figure 8). Distal from the HOR array, sequence identity with DYZ3 markedly decreases (80-85%), and we observe a shift in monomer orientation before the satellite junction with the p-arm (Supplemental Figure 8b). We observe accumulation of transposable elements in the divergent satellite at the ends of the HOR array, in support of the accretion model for HOR array turnover 1,27 . The DYZ3 HOR sequence and chromosomal location of the active centromere on the human chromosome Y is not shared among closely related great apes (Supplemental Figure 9a) 28,29 . However, previous evolutionary dating of specific transposable element subfamilies (notably, L1PA3 9.2-15.8 MYA 30 ) within the divergent satellite DNAs, as well as shared synteny of 11.9 kb of alpha satellite DNA in the chimpanzee genome Yq assembly indicate that the locus was present in the last common ancestor with chimpanzee (Supplemental Figure 9b) 28,31 . In conclusion, we have implemented a long-read strategy to advance sequence characterization of tandemly repeated satellite DNAs. Despite their repetitive content, our analysis provides the necessary directed genomic approach to map, sequence and assemble centromere regions. In doing so, we report the array repeat organization and structure of a human centromere on chromosome Y. Complete, haploid resolved linear assembly of centromeric regions, as shown in our analysis, are expected to have evolutionary and functional implications. We expect that this work will be applicable to ongoing efforts to complete of the human genome.  defined by the DYZ3 conical 5.8 kb higher-order repeat (HOR) (light blue), that is observed in a head to tail orientation from p-arm to q-arm, for a total of 301 kb. Nine HOR variants (6.0 kb, shown in purple) have been identified, with all but one identified in tandem. DYZ3 HORs were classified into nine haplotypes using four frequent satellite DNA variants in the array (haplotype (H)1 red, H2 orange, H3 yellow, H4 green, H5 blue, H6 dark orange; H7 purple, H8 dark purple, H9 grey). We identified three predominant blocks: H1 proximal to the p-arm (red), H4 in the middle of the array (green), and H5 adjacent to the q-arm (blue). The genomic location of the functional Y centromere is defined by the enrichment of centromere protein A (CENP-A), where enrichment (~5-6x) is attributed predominantly to the DYZ3 HOR array.