Credit: NPG

One of the major challenges of de novo mammalian genome assembly arises from the presence of large, interspersed segmental duplications with high levels of sequence identity. These regions are particularly difficult to assemble using current short-read high-throughput sequencing methods. Combining long-read single-molecule, real-time (SMRT) sequencing with a hierarchical genome-assembly process (HGAP), as well as the consensus and variant caller Quiver, enabled these complex genomic regions to be resolved in a more cost- and time-effective manner than previously possible.

Huddleston et al. applied a high-throughput method that was developed for finishing microbial genomes to a 1.3 Mb complex region of human chromosome 17q21.31. They then compared their results with those using traditional Sanger-based approaches, which are accurate but low throughput. Overall, 99.994% sequence identity was achieved between the two assemblies, which shows that long-read SMRT sequencing enables a highly accurate genome assembly.

Next, the researchers applied the long-read SMRT technique to a previously uncharacterized 766 kb region in the chimpanzee genome, which consists of complex lineage-specific duplications. Comparison of the assembled sequence with the current chimpanzee genome assembly (panTro4) revealed that 241 kb of sequence was entirely missing from the whole-genome assembly.

Taken together, this study introduces a cost-effective strategy for rapidly resolving the structure and organization of complex genomic regions during the final stages of mammalian genome assembly.