Efficient hybrid de novo assembly of human genomes with WENGAN

Generating accurate genome assemblies of large, repeat-rich human genomes has proved difficult using only long, error-prone reads, and most human genomes assembled from long reads add accurate short reads to polish the consensus sequence. Here we report an algorithm for hybrid assembly, WENGAN, that provides very high quality at low computational cost. We demonstrate de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms to improve assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50: 17.24–80.64 Mb), few assembly errors (contig NGA50: 11.8–59.59 Mb), good consensus quality (QV: 27.84–42.88) and high gene completeness (BUSCO complete: 94.6–95.2%), while consuming low computational resources (CPU hours: 187–1,200). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50: 59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50: 57.88 Mb).


WENGAN assemblies of non-human genomes
The command used for the WENGAN assemblies of non-human genomes are shown. All nonhuman genomes were performed using 20 CPUs. The Flye assemblies of NA12878 were polished using RACON and NTEDIT. In particular, two rounds of long-read polishing with RACON were performed, followed by three rounds of short-read polishing with NTEDIT. The commands executed were the following: #Polishing of the Flye assembly with 40X Nanopore reads (rel5) and 50X of short illumina reads.
LG50 is the minimum number of contigs that produce half of the reference length.
LGA50 is similar to LG50 but aligned blocks are counted instead. Assemblyerrors correspond to the number of positions in the assembled contigs where the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference (relocation), or they overlap on more than 1 kbp (relocation), or else the flanking sequences align on different strands (inversion) or different chromosomes (translocation).
Genome fraction (%) is the total number of bases aligned of the reference, divided by the reference size. The QUAST (Version: 5.0.2) analysis was run with the options min-identity 80 and fragmented using the autosomes plus X and Y chromosomes of GRCh38 ("quast -r GRCh38 chrom no alt.fa -large -min-identity 80 -fragmented"). Additionally, we ran a QUAST analysis using as reference the curated CHM13 Canu assembly (chm13.draft v0.7, 2.9384 Gb) generated by the T2T consortium. Assembly errors overlapping centromeres or segmental duplications of GRCh38 were discounted. A second QUAST run using a minimum alignment length of 50kb was performed to discount assembly errors overlapping problematic regions (segmental duplications and centromeres) of the curated CHM13 assembly.
Assembly errors before and after discounting problematic regions are shown.  (" -r chrX.t2t.fa -f one-to-one -q asm.fa -s 10000 -pi 85"). Anchored contigs (Figure 7) were masked using RE-PEATMASKER version 4.1.0 ("-species human -gff -xm -dir=asm.rm asm.anchored.fa"). The REPEATMASKER report (*.tbl) was used to collect the amount of sequence masked by repeat classes in each assembly. The percentages are computed relative to the amount of repeat class sequences masked in the curated T2T-X chromosome (v.07).  where the lengths of aligned blocks are counted instead of the contig lengths.
LG50 is the minimum number of contigs that produce half of the reference length.
LGA50 is similar to LG50 but aligned blocks are counted instead.
Assembly-errors correspond to the number of positions in the assembled contigs where the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference (relocation), or they overlap on more than 1 kbp (relocation), or else the flanking sequences align on different strands (inversion) or different chromosomes (translocation). Genome fraction (%) is the total number of bases aligned of the reference, divided by the reference size. The QUAST (Version: 5.0.2) analysis was run with the options min-identity 80 and fragmented using the autosomes plus X and Y chromosomes of GRCh38 ("quast -r GRCh38 chrom no alt.fa -large -min-identity 80fragmented"). Assembly errors overlapping centromeres or segmental duplications of GRCh38 were discounted.
Assembly errors before and after discounting problematic regions are shown. The SHASTA assemblies were generated and polished using only Nanopore reads. The total elapsed time for the WENGAND assemblies (using 44 cores) was            genomes. A) The MHC sequence was aligned to the genome assemblies and the aligned blocks ≥ 30kb with a minimum identity of 95% were kept. The alignment breakpoints (vertical black lines) indicate a contig switch, alignment error or gap in the assembly. B) The NGA50 and the number of contigs spanning the MHC region of each diploid assembly are depicted. NGA50 is NG50 corrected of assembly errors. The NGA50 was computed using a genome size equal to the length of the MHC region (n=4.97Mb).   15X  20X  25X  30X  10X  15X  20X  25X  30X  10X  15X  20X  25X  30X  10X  15X  20X  25X  30X  10X  15X  20X  25X  30X  10X  15X  20X  25X  30X  10X  15X  20X  25X  30X   ONT  MGI  ILL  MGI  ILL  MGI  corrected of assembly errors. The NGA50 was computed using a genome size equal to the length of the MHC region (n=4.97Mb).