Telomere-to-telomere assembly of a complete human X chromosome

After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes.


Spectral karyotyping (SKY)
Spectral imaging was performed using In laser-scanning confocal microscopes LSM-710 and LSM-780 (Carl Zeiss Microimaging, Jena, Germany). Both microscopes were equipped with a QUASAR detection unit that can acquire with a single scan an entire range of emission wavelengths (in 10 nm increments) for subsequent spectral unmixing. For spectral imaging, 3 excitation laser lines were utilized: 488, 561, and 633 nm. Images were collected with 3 different dichroics: the first passing 488 nm excitation, the second passing 488 nm and 561 nm excitation, and the third passing all 3 laser lines. In addition, a 405 nm laser was used to acquire a Hoechst 33342-stained DNA image for segmentation, with emission collected at ~450 nm. All images were acquired with either a 40× or 63× Plan Apochromat objective (Carl Zeiss Microimaging). Pinhole settings were optimized for background reduction and signal-to-noise ratio. Image processing and karyotyping of the CHM13 line were performed with a set of custom open source ImageJ (NIH, Bethesda, MD) plugins called Karyotype Identification via Spectral Separation (KISS), freely available at http://research.stowers.org/imagejplugins/KISS_analysis.html . Briefly, the plugins perform interactive background subtraction, spectral unmixing, interactive chromosome segmentation, and interactive karyotyping based on dye composition. Chromosome segmentation is performed using a semi-automated method based on the Hoechst image. First, the image is smoothed with a Gaussian blur with a 1 pixel standard deviation and then segmented with a manually chosen fractional threshold and object area limits to eliminate dirt and intact nuclei. Next, chromosomes too close to be separated by thresholding are manually separated. Finally overlapping chromosomes are separated into non-overlapping parts and then linked together for karyotyping. A total of 10 SKY images were evaluated to assess the stability of the CHM13 line.

Supplementary Note 2. CHM13 admixture analysis
We ran the software ADMIXTURE v1.3.0 1 with 10-fold cross validation (CV) on a diversity panel of 1,964 unrelated individuals from the 1000 Genomes Project (1KG, 20140818 release) 2 and Simons Genome Diversity Panel (SGDP) 3 . SNVs were called previously and data from the SGDP and 1KG were prepared by left normalizing variants with bcftools v1.9 4 (to standardize indels), followed by filtering for <10% missing genotypes within an individual and across individuals for a given SNP (<50% missing genotypes across a SNP), minor allele frequency (>5%), and LD pruning with PLINK 1.90 5 . Sites were then converted from GRCh37 to GRCh38 using UCSC liftover. This resulted in a total of ~155,000 SNP sites. Based on the smallest CV error, we determine an optimal value for the K parameter (K=9), but also considered other values (K=6 to K=14) since the CV errors were marginally different. Since CHM13 is comprised of only one haplotype, we used the allele frequencies learned by ADMIXTURE from the diversity panel to assess the ancestry of CHM13 in a supervised manner by using ADMIXTURE to project its ancestry. Missing genotypes were assumed to be reference as CHM13 was not jointly-genotyped with the reference samples. We assigned CHM13 to the cluster which contributed the largest proportion of ancestry and then grouped clusters into super populations according to membership of known populations from the reference cohort. We assessed our inference by visualizing the proportion of ancestry from each cluster for a random subset of 10 individuals from each known population as well as CHM13. While this analysis is preliminary and ADMIXTURE is not optimized to handle data from a genome of a single haplotype, the analysis consistently predicted an admixed haplotype predominantly of European ancestries.

Supplementary Note 3. Library preparation and sequencing
Oxford Nanopore Library preparation and nanopore sequencing was performed as previously described 6 , with the following updates. Generation of ultra-long reads employs the Rapid Sequencing Kit (Oxford Nanopore Technologies, UK) and comprises two steps: tagmentation of DNA by a transposase complex followed by attachment of the sequencing adapter. Previous work was performed using kit SQK-RAD002 which was replaced by SQK-RAD003 in Jun 2017. Testing performed on this kit indicated difficulty generating ultra-long reads was due to a protocol change which doubled the standard input required from 200 ng to 400 ng and a reformulation of the FRM reagent (now called FRA). This protocol resulted in low efficiency libraries when using HMW DNA input. Testing showed that reducing the volume of fragmentation reagent from 5 ul to 1.5 ul and the addition of 0.02% Triton-X100 final concentration could restore library performance. The modifications are included in the 'Ultra-long read sequencing protocol for RAD004' (dx.doi.org/10.17504/protocols.io.mrxc57n) used here. High-molecular-weight genomic DNA from the CHM13hTERT cell line was obtained using a modified Sambrook and Russell DNA extraction method before preparing ultra-long read sequencing libraries using the protocol above. Briefly, 16 μl of DNA from the Sambrook extraction at approximately 1 μg/μl, manipulated with a wide-bore P20 pipette tip, was placed in a 0.2 ml PCR tube, with 1 μl removed to confirm quantification value. 3.5ul EB and 1.5 μl FRA (SQK-RAD004, ONT) was added and mixed slowly ten times by gentle pipetting with a wide-bore pipette tip moving only 18 μl. After mixing, the sample was incubated at 30 °C for 1 min followed by 80 °C for 1 min on a thermocycler. After this, 1 μl RAP (SQK-RAD004, ONT) was added and mixed slowly ten times by gentle pipetting with a cut-off pipette tip moving only 14 μl. The library was then incubated at room temperature for 30 min to allow adapter attachment. Libraries are divided, diluted and incubated for 48 hours (as discussed in updates above). To load the library, 34 μl SQB (SQK-RAD004, ONT) was mixed with 20 μl nuclease-free water, and this was added to the library. Using a P100 wide-bore tip set to 75 μl, this library was mixed by pipetting slowly five times. This extremely viscous sample was loaded onto the "spot on" port and entered the flow cell by capillary action. The standard loading beads were omitted from this protocol owing to excessive clumping when mixed with the viscous library. GridION sequencing was performed as per manufacturer's guidelines using R9/R9.4 flow cells (FLO-MIN106 or FLO-MIN106D, ONT), and controlled using Oxford Nanopore Technologies MinKNOW (version 3.4.5 ) software. The specific versions of the software used varied from run to run but can be determined by inspection of the provided fast5 files. This generated the rel1 dataset. Reads from all sites were copied to the NIH Biowulf HPC cluster, where base calling was performed using Guppy (flip-flop version 2.3.1) to generate the updated dataset (referred to as rel2).

10x Genomics
A linked read genomic library was prepared from one nanogram of high molecular weight genomic DNA using a 10x Genomics Chromium device and Chromium Reagent Kit v2 according to manufacturer's protocol. The library was sequenced on a NovaSeq 6000 DNA sequencer (Illumina, Inc.) on an S4 flow cell, generating 586M paired-end 151 base reads. The raw data was processed using RTA3.3.3 and bwa0.7.12. The resulting molecule size was calculated to be 130.6 kb from a Supernova assembly.
Bionano optical mapping DNA was prepared using the 'Bionano Prep Cell Culture DNA Isolation Protocol'. After cells were harvested, they were put through a number of washes before embedding in agarose. A proteinase K digestion was performed, followed by additional washes and agarose digestion. From this point, the DNA was drop dialyzed and allowed to equilibrate at room temperature for two days. The DNA was assessed for quantity and quality using a Qubit dsDNA BR Assay kit and CHEF gel. A 750 ng aliquot of DNA was labeled and stained following the Bionano Prep Direct Label and Stain (DLS) protocol. Once stained, the DNA was quantified using a Qubit dsDNA HS Assay kit and run on the Saphyr chip.

Hi-C sequencing
Hi-C libraries were generated, in replicate, by Arima Genomics using a modified version of the Arima-HiC kit. Briefly, the current Arima-HiC kit (P/N: A510008) utilizes 2 restriction enzymes for simultaneous chromatin digestion. In the modified protocol, 4 restriction enzymes were deployed to enable more uniform per base coverage of the genome while maintaining the highest long-range contiguity signal, thereby benefiting analyses such as base polishing, scaffolding, and phasing. After the modified chromatin digestion, digested ends were labelled, proximally ligated, and then proximally-ligated DNA was purified. After the Arima-HiC protocol, Illumina-compatible sequencing libraries were prepared by first shearing purified Arima-HiC proximally-ligated DNA and then size-selecting DNA fragments using SPRI beads. The size-selected fragments containing ligation junctions were enriched using Enrichment Beads provided in the Arima-HiC kit, and converted into Illumina-compatible sequencing libraries using the Swift Accel-NGS 2S Plus kit (P/N: 21024) reagents. After adapter ligation, DNA was PCR amplified and purified using SPRI beads. The purified DNA underwent standard QC (qPCR and Bioanalyzer) and sequenced on the HiSeq X following manufacturer's protocols.

10x Genomics whole-genome assembly and validation
The 10x data was assembled with Supernova v2.1.1 using the command run --maxreads=all --id=CHM13 --fastqs=Chromium . This resulted in a 2.95 Gbp assembly with a contig NG50 of 209.7 kbp and scaffold NG50 of 38.5 Mbp for pseudohaplotype 1 and a 2.95 Gbp assembly with a contig NG50 of 209.7 kbp and scaffold NG50 of 38.5 Mbp for pseudohaplotype 2. Prior to optical mapping, 10x Genomics / Illumina data was mapped to the full assembly using Long Ranger v2.2.2 with the options longranger align --jobmode=slurm --localcores=32 --localmem=60 --maxjobs=500 --jobinterval=5000 --disable-ui --nopreflight . Any regions with ≥10-fold coverage were marked as supported. Adjacent supported regions were merged if they were within 500 bp of each other. This list of supported regions was inverted and any unsupported regions within 2 kbp of each other were merged. Finally, the assembly was split at any low-coverage region ≥50 kbp.

Bionano optical map assembly and scaffolding
The raw data was assembled with the Bionano Solve data analysis software. This software generated whole genome map assemblies, along with alignments to the reference sequences. In this case, the CHM13 assembly was aligned with CHM13 optical map. After breaking potential mis-assemblies identified by the 10x data, hybrid scaffolding was run using the optical map data using the command hybridScaffold.pl -n $ASM -b DLE1.cmap -c hybridScaffold_DLE1_config.xml -r avx/RefAligner -B 2 -N 2 -f -o $PWD/scaffold .

Assembly chromosome assignment
The final scaffolds were assigned to chromosomes by NCBI. Briefly, the automated assembly alignment pipeline was used to build a list of scaffold to chromosome mappings. This list was manually reviewed to assign additional scaffolds or re-assign scaffolds to chr Un as necessary. Based on this assignment, there are 6 chromosomes with 90% of their length covered by 2 or fewer contigs (3, 6, 10, 12, 18, X) and 10 chromosomes (3, 5, 6, 8, 10, 12, 17, 18, 20, X) covered by 2 or fewer scaffolds.

Telomere Analysis
We re-used the telomere identification scripts from the vertebrate genome project (https://github.com/VGP/vgp-assembly/tree/master/pipeline/telomere). Windows of 1000 bp where at least 50% of the sequence matches the canonical vertebrate telomere (TTAGGG) in either strand were identified. Overlapping windows were merged and windows within 5kb of the scaffold end marked as putative telomeres. The X chromosome contig had two telomeres at both ends, as expected. The first from 0-1800 bp and and the second from 154,267,200-154,268,800 bp. In total, we identified 41 of 46 expected telomeres in the assembly. The chromosomes with two telomeres on assigned scaffolds are: 1, 2, 4, 6,7,8,11,12,16,17,20, and X. Chromosomes 3,5,9,10,13,14,15,18,19,21 have one telomere on an assigned scaffold, and chromosome 22 has none. The remaining seven telomere arrays were in short contigs which we could not assign to a chromosome. The average telomere length in the assembly is 3,215 bp.
We used the same strategy to identify telomeres in ONT reads >50 kbp and all HiFi PacBio reads. Reads with a telomere within 5 kbp of the read start or end are marked as telomeric reads. Telomere lengths (based on the merged overlapping windows) in both the HiFi (mean=2,448 bp) and ONT (mean=2,368 bp) reads are concordant in size with each other (Extended Data Fig4).

Chromosome X validation and fixes
The assembled optical map was used to call high-confidence structural variants on the entire assembly, including the candidate X chromosome. This identified four structural variants (Supplementary Table 4). These SVs were confirmed by discordantly mapping reads later identified in the rel2 ultra-long dataset. To correct these assembly errors, reads over 100 kb with breaks near the variant site were extracted for each SV, making four sets of reads. Each read set was then assembled separately with default parameters by both Canu 1.8 and Flye 2.4 9,10 . The two assemblers had good agreement and the Flye contigs were aligned to the chromosome X draft and used as patches to replace the incorrect sequence in the original assembly. The patched assembly was once again validated by the optical map, which now reported no discrepancies. PacBio HiFi reads were aligned to the X chromosome and potential repeat collapses identified using a previously described method 11 . This analysis identified the GAGE locus (48.7-48.9 Mbp), cenX (57-61 Mbp), 122 kb segmental duplication containing CXorf49 gene copies (69.5-71.2 Mbp), and CT45 (138.6-139.7 Mbp) as regions of potential collapse (Extended Data Fig7). Manual inspection as well as optical map support suggested these regions were not typical repeat collapses, but residual consensus errors due to uneven polishing of large repeat arrays, which were later resolved using a novel polishing strategy as described below.

Chromosome X long-read polishing
Unique k -mers were identified as those having a copy number in the Illumina read set roughly equal to the expected depth of coverage (between 5 and 58, Extended Data Fig8a) using Meryl 12 from Canu snapshot v1.8 +298 changes (r9508 aab8e5dc15c6b20addccd809c2cc6a62c1fa9c46). In brief, k -mers were counted with meryl count k=21 output 10x.meryl $FASTQ and filtered with meryl greater-than 5 output 10x.gt5.meryl 10x.meryl and meryl less-than 58 output 10x.gt5.lt58.meryl 10x.gt5.meryl . Those k -mers having both the expected copy number in the 10x data and occuring once in the assembled genome were selected as putative unique markers. meryl equal-to 1 output asm_1.meryl [ count k=21 asm.fasta ] was run to collect single-copy kmers in the assembly, and it was intersected with meryl intersect output 10x_asm_single.meryl 10x.gt5.lt58.meryl asm_1.meryl . Reads were mapped using Minimap2 v2.71-941 with the parameters -N 50 -r 10000 -ax map-ont . These parameters increase the number of candidate sites reported for a read and tolerate larger gaps within a read without breaking to better allow correction of larger indels in repeat arrays. The Minimap2 alignments were converted to sequence, replacing any mis-matched or missing bases in the read with Ns, and these sequences were scored using the unique markers and placed in the location maximizing the unique marker matches. This generated a new SAM file with all uniquely placed reads assigned a Phred mapping quality (MQ) value of 60. This SAM was filtered to exclude short CIGAR strings (<50 kb for Nanopore, <10 kb for PacBio), and those below a minimum length / identity threshold (25 kb at 75% identity for Nanopore and 5 kb at 75% identity for PacBio). Racon used the parameters -w 5000 -e 0.2 . Nanopolish v0.11.0 ran with minimap2 -ax map-ont -N 50 -r 10000 for mapping and nanopolish variants --methylation-aware=cpg --consensus --min-candidate-frequency 0.01 --fix-homopolymers for consensus. Arrow v2.2.2 ran with minimap2 -ax map-pb -N 50 -r 1000 for mapping and -x 10 -q 0 -X120 -v --algorithm=arrow for consensus.

Chromosome X mapping and variant identification
The three available long-read technologies (PacBio HiFi, PacBio CLR, Nanopore UL) were mapped to the final chrX contig using the same unique marker based filtering as used for polishing. The coverage distribution matched the expected normal distribution (Extended Data  Fig 8b, UL: mean: 26.08, sd: 6.05; CLR: mean: 44.87, sd: 8.66; HiFi: mean: 22.99, sd: 6.27) with a low fraction of bases above or below 3 standard deviations (0.44% for UL, 0.77% for PacBio CLR, 2.24% for PacBio HiFi). The low-coverage HiFi regions were enriched for low frequency of unique markers (mean spacing 3,342 bp vs 66 for the whole X) which we attribute to their relatively short length and lower coverage.
We ran the variant caller Sniffles 13 v1.0.11 with the command: sniffles --genotype -t 16 -m chrX.bam -v chrX.vcf We then counted the number of variants from each data type with allele frequency ≥75%. As CHM13 is haploid, we expect true errors to have high read support. No variants meeting the threshold were identified in the HiFi or CLR data. 79 variants were identified in the UL data but they were short (mean: 130.0 bp) and enriched for simple sequence repeats or homopolymers (75.93% masked by sdust versus 4.66% in the entire X chromosome) and are likely base calling errors in the UL data.

Assembly quality estimation
We estimated final assembly QV and completeness using previously sequenced CHM13 BACs targeting segmental duplications (VMRC59 library), as well as concordance with the 10x Genomics data. All nucleotide sequences matching VMRC59 with "complete" in the name were downloaded from NCBI. This gave a total of 341 complete BACs. The BACs were mapped with minimap2 with the command minimap2 --secondary=no -ax asm20 -r 2000 and evaluated using the pipeline available from https://github.com/skoren/bacValidation . For 10x Genomics, both Supernova haplotypes were combined and a BAC was considered resolved if either pseudo-haplotype assembly captured it. Out of these 341 BACs, 280 mapped over 99.5% of their length to our CHM13 assembly, which compares favorably to previous assemblies (main text, Table 1 We also estimated assembly quality by measuring concordance of the consensus sequence with mapped 10x Genomics / Illumina data. Using the 10x mapping procedure described above, the bam file was filtered for mapping quality >20 using samtools v1.9 with the command samtools view -hb -q20 . Variants were called using Freebayes v1.3.1 with the command freebayes --skip-coverage 648 asm.bam -v asm.bayes.vcf -f asm.fasta , excluding regions with excessive read coverage (12 x mean = 648) . Calls genotyped as 0/1 (with support for the assembly allele) were filtered out and the total bases changed (added/deleted/substituted) B was summed. Total bases with at least 3-fold and less than 648-fold coverage, T, were also tabulated and the QV computed as , resulting in an average consensus quality estimate of 99.9896% (Q39.83). Note that these FreeBayes parameters are more aggressive and will call more variants than those used for polishing (e.g. the FreeBayes polishing only corrects indels), but this validation is still somewhat circular and we view the BAC validation as more reliable. Using the same criteria, measuring the QV on the X chromosome resulted in 99.9953% (Q43.31).

Supplementary Note 5. Structural variant analysis
To compare our CHM13 assembly to GRCh38 as a reference for calling structural variation, contigs from several human assemblies were aligned to each of the two references with MUMmer version 3.23 14 (l=100, c=500), and structural variants were called using Assemblytics 15 . The four assemblies shown in Extended Data Fig3 are: (1) the maternal haplotype of NA12878 12 (2) TrioCanu assemblies of the maternal haplotypes of the Puerto Rican son HG00733 and the Yoruba son NA19240 16 and (3) a haplotype-phased assembly of a Korean individual 17 . When aligned to GRCh38, the four assemblies yield the following numbers of insertions/deletions: NA12878: 6785/4265, HG00733: 7861/4667, NA19240: 7993/5886, and AK1: 8176/5781. Aligned to the CHM13 assembly, the four assemblies give the following number of insertions/deletions: NA12878: 4129/4345, HG00733: 5018/4898, NA19240: 5707/6578, and AK1: 5657/6113. This excess of insertion calls with respect to GRCh38 exists across a wide size range, and is absent in calls against CHM13.
In addition to insertions and deletions, inversions were called against both the GRCh38 reference and our CHM13 assembly with SVrefine v0.34 (https://github.com/nhansen/SVanalyzer) using MUMmer alignments to the same four assemblies as used to call large insertions and deletions. SVrefine predicts a total of 102 inversions against GRCh38 (NA12878:16, HG00733:35, NA19240:26, and AK1:25), and 41 inversions against the CHM13 assembly (NA12878:5, HG00733:14, NA19240:16, and AK1:6). After merging equivalent calls with SVmerge, we manually curated 63 calls and genotyped them in all four of the assemblies by inspecting alignments to the inverted region of the reference. Assemblies were classified as matching the GRCh38 reference allele (REF), matching the inverted alternate allele (INV), having no coverage across the region (NoCov), or exhibiting alignments that neither match the GRCh38 reference nor the inversion (Complex). Supplementary Table 5 lists all confirmed inversion calls, of which 19 are unique to GRCh38 (possible reference errors) and 1 is unique to CHM13 (possible assembly error).