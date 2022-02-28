Skip to main content

Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads

Nature Biotechnology (2022)Cite this article

Subjects

Abstract

Although most existing genome assemblers are based on de Bruijn graphs, the construction of these graphs for large genomes and large k-mer sizes has remained elusive. This algorithmic challenge has become particularly pressing with the emergence of long, high-fidelity (HiFi) reads that have been recently used to generate a semi-manual telomere-to-telomere assembly of the human genome. To enable automated assemblies of long, HiFi reads, we present the La Jolla Assembler (LJA), a fast algorithm using the Bloom filter, sparse de Bruijn graphs and disjointig generation. LJA reduces the error rate in HiFi reads by three orders of magnitude, constructs the de Bruijn graph for large genomes and large k-mer sizes and transforms it into a multiplex de Bruijn graph with varying k-mer sizes. Compared to state-of-the-art assemblers, our algorithm not only achieves five-fold fewer misassemblies but also generates more contiguous assemblies. We demonstrate the utility of LJA via the automated assembly of a human genome that completely assembled six chromosomes.

Fig. 1: JumboDBG pipeline.
Fig. 2: LJA pipeline.
Fig. 3: Chromosome-by chromosome LGA95 (left) and completeness (center) metrics as well as centromere-by-centromere LGA95 metric (right) for LJA (blue), hifiasm (green) and HiCanu (red) assemblies of the T2T read-set.
Fig. 4: Constructing a multiplex de Bruijn graph.

Data availability

All assemblies generated by LJA are available at https://zenodo.org/record/5552696#.YV3MkVNBxH4. All described datasets are publicly available through the corresponding repositories. All HiFi data were obtained from the National Center for Biotechnology Information Sequence Read Archive (SRA). The SRA access codes for all datasets are specified in Supplementary Note 2 ‘Information about datasets’. The CHM13 reference (version 0.9) generated by the T2T consortium (referred to as T2TGenome) can be found at https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v0.9.fasta.gz.

Code availability

The LJA code uses the open-source libraries spoa (version 4.0.5) and ksw2 (version 4e0a1cc) and is available at https://github.com/AntonBankevich/LJA. All software tools used in the analysis and their versions and parameters are specified in the text of the paper and in Supplementary Note 9 ‘LJA parameters’.

References

Acknowledgements

A. Bankevich, A. Bzikadze and P.A.P. were supported by National Science Foundation EAGER award 2032783. D.A. was supported by Saint Petersburg State University (grant ID PURE 73023672). We thank A. Korobeynikov and A. Mikheenko for useful suggestions on assembling HiFi reads using SPAdes and benchmarking.

Author information

Affiliations

  1. Department of Computer Science and Engineering, University of California, San Diego, San Diego CA, USA

    Anton Bankevich & Pavel A. Pevzner

  2. Program in Bioinformatics and Systems Biology, University of California, San Diego, San Diego CA, USA

    Andrey V. Bzikadze

  3. Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz CA, USA

    Mikhail Kolmogorov

  4. Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia

    Dmitry Antipov

Authors
  Anton Bankevich
  Andrey V. Bzikadze
  Mikhail Kolmogorov
  Dmitry Antipov
  Pavel A. Pevzner
Contributions

All authors contributed to developing the LJA algorithms and writing the paper. A. Bankevich (jumboDBG and mowerDBG), A. Bzikadze (multiplexDBG) and D.A. (LJApolish) implemented the LJA algorithm. A. Bankevich benchmarked LJA and other assembly tools. A. Bankevich and P.A.P. directed the work.

Corresponding authors

Correspondence to Anton Bankevich or Pavel A. Pevzner.

Ethics declarations

Competing interests

The authors declare no competing interests.

