Nature | Letter Open
De novo assembly and phasing of a Korean human genome
- Journal name:
- Nature
- Volume:
- 538,
- Pages:
- 243–247
- Date published:
- DOI:
- doi:10.1038/nature20098
- Received
- Accepted
- Published online
Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing2, next-generation mapping3, microfluidics-based linked reads4, and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novo assembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6. This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of unreported and Asian-specific structural variants, and high-quality haplotyping of clinically relevant alleles for precision medicine.
Subject terms:
At a glance
Figures
-
Figure 1: AK1 de novo assembly scaffolds compared to GRCh38. a, Scaffold coverage over GRCh38 per chromosome. The blue shading represents scaffold size, with darker segments for longer scaffolds. Eight chromosomal arms are spanned by single scaffolds. Closed euchromatic gaps are labelled in red on each chromosome, with the total number of gaps in grey. b, Number of gaps closed using the AK1 assembly (blue), local assembly of long reads (light blue), and long reads alone (red). The number of extended gaps with AK1 assembly is represented in yellow, with long reads in green and open gaps in grey. The 65 dot plots of gaps closed with the AK1 assembly can be found in the AK1 genome browser (http://211.110.34.36/gbrowse2). c, AK1 assembly resolving two gaps along with BACs and optical map suggests that gap_367 and both its edges (red and black bars) shrink to zero, whereas gap_368 expands to 144 kb (yellow bar). d, Three dot plots show how unique sequences have been added to the reference genome. Reference–reference (top left), reference–AK1 assembly (top right) and AK1–AK1 (bottom right). A and B indicate deleted GRCh38 sequence around gap_367.
-
Figure 2: AK1 SV distribution and Asian-specific variants. a, Distribution of insertions (red/orange) and deletions (cyan/dark blue) between AK1 and GRCh37, compared to SVs identified from previous studies. In total. 47% and 76% of the insertions and deletions, respectively, were previously unreported. b, Allele frequency of 45 Asian-specific insertions (≥0.3 allele frequency difference; ≤0.5 non-Asian allele frequency). The coverage for the genic insertions was calculated from 38 whole-genome high-coverage samples by dividing the read depth by the median genome coverage across individuals with the same ancestry. c, In ANO2, the Asian-specific insertion occurs within an East Asian (EAS) linkage disequilibrium (LD) block, sharing a similar population allele frequency with the adjacent AK1 SNPs. AFR, African; EUR, European; SAS, South Asian.
-
Figure 3: Circular visualization of phased blocks with phase-specific expression and two phased regions of MHC class II and CYP2D6. a, Genome-wide map of highly heterozygous regions and expression levels of haplotype A and B in log scale. b, HLA genes in the MHC class II region. This highly variable, complex region contained many SVs, making it difficult to phase against the reference genome, but allowed full resolution through the de novo approach. For detailed comparison, see Extended Data Fig. 7. c, Both haplotypes of CYP2D6 and CYP2D7. A duplicated copy of CYP2D6 was fused with the last exon of CYP2D7 on haplotype B.
-
Extended Data Fig. 1: Global overview of data generation and sequencing throughput. Flowchart of the data generation, processing and analysing for the de novo assembly and haplotype phasing of the AK1 diploid genome. *The SMRT platform sequencing throughput is described in Supplementary Table 1. †The number of read and sequencing throughputs from the Illumina platform are 1,635,192,864 and 249,914,122,464 bp, respectively. ‡AK1 BAC library was sequenced using Sanger capillary end sequencing (single end: 22,563, paired end: 62,758), Illumina (31,719) and SMRT (307) platform. §Linked-read data were additionally generated with the GemCode platform to produce 1,153,598,732 reads from high molecular mass DNA with an average insert size of 100 kb.
-
Extended Data Fig. 2: Length distribution of SMRT subreads and FALCON parameter optimization for assembly. a, The y axis on the left shows the number of subreads with given length (bin size = 100 bp) on the x axis, whereas the y axis on the right shows the sum of the length of subreads longer than or equal to the given length on the x axis. b, Effects of length cutoff parameters on contig N50 in de novo assembly by FALCON is shown on the right. The contig N50 depends on the two parameters, related to the amount of error-corrected reads for final assembly, length_cutoff and length_cutoff_pr, respectively, where the former was fixed at 10 kb but the latter varied from 10 to 16 kb. Black and green lines indicate the changes of N50 for 72× and 101× sequencing dataset, respectively.
-
Extended Data Fig. 3: Graphical representation of hybrid assembly and statistics for next generation map and genome map. a, The hybrid assembly approach aligns in silico generated maps from sequence contigs with genome maps. When genome maps bridge two contigs, a scaffold is produced. The comparison is visualized between the genome maps and contigs in the Iris Viewer. b, Examples of edited contigs due to conflicts between the contig and the genome maps. The matches between the in silico map and the genome map are highlighted in red, and mismatches are indicated by absence of the red lines.
-
Extended Data Fig. 4: Assessment of assembly accuracy with homopolymer and read depth coverage generated with short reads. a, Distribution of corrections in homopolymer. Pilon mostly corrected the single base deletions in the assembly and the corrections are enriched in regions with long stretches of homopolymer. b, The read-depth distribution against AK1 assembly, AK1 assembly with scaffolds ≥500 kb, GRCh37 and GRCh38. As the mean coverage depth of short reads was 72×, a peak is shown around it representing the fraction of autosomal region. Another peak is shown in ~36×, which is half of the mean coverage depth, representing the contigs derived from chromosomes X and Y. The fluctuating long tale is showing 3-copy and 4-copy of a haplotype, but more clearly observed with AK1 long scaffolds. The overall pattern is showing that more SVs are reflected in AK1 long contigs than the reference. The short contigs (<500 kb) are only 120.4 Mb, comprising a small fraction of the AK1 assembly. c, Density plot of the homozygously altered allele read depth from long scaffolds (≥500 kb). Most variants are skewed in low allelic read depth, suggested to be mainly due to sequencing artefact or mapping bias.
-
Extended Data Fig. 5: An example of filled sequence that matches perfectly with the patch sequence (KN538365.1). a, One AK1 scaffold (Super-scaffold3_99) closes a 100-bp gap in chromosome 10, reducing the size of this gap to zero while it also removes a 4.7 kb false duplication found left of the gap. This information corresponds perfectly to the GRCh38 fix patch (KN538365.1) sequence covering this region, thus validating our assembly and gap closing accuracy. b, Six dot plots show the comparison between GRCh38, KN538365.1 and the AK1 assembly. The dot plots are organized in the following manner: Reference-reference (top left), KN538365.1-reference (top middle), AK1-reference (top right), KN538365.1-KN538365.1 (centre middle), AK1-KN538365.1 (middle right) and AK1-AK1 (bottom right).
-
Extended Data Fig. 6: Number of SVs and repeat composition. a, Overall distribution of SVs. By direct comparison between AK1 assembly and GRCh37 reference genome, deletion (red), insertion (blue), inversion (green), and complex (grey) variants were detected. Outer pie chart represents new variants for each SV types. In total, 65% (11,927) of the SVs were unreported previously. b, Repeat composition of AK1 insertion and deletion. Both insertions and deletions are mostly composed of mobile elements or tandem repeats. Complex is defined as the SVs having either several annotated repeat elements, or at least 30% of the remaining sequence not annotated as repeat.
-
Extended Data Fig. 7: MHC class II haplotigs alignment on chromosome 6 and dot plots. a, MHC haplotigs A and B aligned on GRCh37 chr6. The complex regions shown in Fig. 3a are in green bars. b, Dot plot of haplotig A and B to the reference genome. The region highlighted in red is giving many SVs when aligning on the reference owing to different sequence context in haplotigs. c, Dot plots of haplotig A and B to the alternative loci (ALT) patches of MHC region in hg38. Haplotig A had the most similarities with chr6_GL000255v2_alt for the highlighted region in b. The blank vertical lines indicate ‘N’ bases in the reference ALT sequence.