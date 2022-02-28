Abstract
Although most existing genome assemblers are based on de Bruijn graphs, the construction of these graphs for large genomes and large k-mer sizes has remained elusive. This algorithmic challenge has become particularly pressing with the emergence of long, high-fidelity (HiFi) reads that have been recently used to generate a semi-manual telomere-to-telomere assembly of the human genome. To enable automated assemblies of long, HiFi reads, we present the La Jolla Assembler (LJA), a fast algorithm using the Bloom filter, sparse de Bruijn graphs and disjointig generation. LJA reduces the error rate in HiFi reads by three orders of magnitude, constructs the de Bruijn graph for large genomes and large k-mer sizes and transforms it into a multiplex de Bruijn graph with varying k-mer sizes. Compared to state-of-the-art assemblers, our algorithm not only achieves five-fold fewer misassemblies but also generates more contiguous assemblies. We demonstrate the utility of LJA via the automated assembly of a human genome that completely assembled six chromosomes.
Data availability
All assemblies generated by LJA are available at https://zenodo.org/record/5552696#.YV3MkVNBxH4. All described datasets are publicly available through the corresponding repositories. All HiFi data were obtained from the National Center for Biotechnology Information Sequence Read Archive (SRA). The SRA access codes for all datasets are specified in Supplementary Note 2 ‘Information about datasets’. The CHM13 reference (version 0.9) generated by the T2T consortium (referred to as T2TGenome) can be found at https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v0.9.fasta.gz.
Code availability
The LJA code uses the open-source libraries spoa (version 4.0.5) and ksw2 (version 4e0a1cc) and is available at https://github.com/AntonBankevich/LJA. All software tools used in the analysis and their versions and parameters are specified in the text of the paper and in Supplementary Note 9 ‘LJA parameters’.
References
Acknowledgements
A. Bankevich, A. Bzikadze and P.A.P. were supported by National Science Foundation EAGER award 2032783. D.A. was supported by Saint Petersburg State University (grant ID PURE 73023672). We thank A. Korobeynikov and A. Mikheenko for useful suggestions on assembling HiFi reads using SPAdes and benchmarking.
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Ergude Bao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1–15
Rights and permissions
About this article
Cite this article
Bankevich, A., Bzikadze, A.V., Kolmogorov, M. et al. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-022-01220-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587-022-01220-6