Abstract
Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All described datasets are publicly available through the corresponding repositories. The supplementary files, including the assemblies generated by Flye, are available at https://doi.org/10.5281/zenodo.1143753; NCTC PacBio reads: http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/; PacBio metagenome dataset: https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun; PacBio C. elegans dataset: https://github.com/PacificBiosciences/DevNet/wiki/C.-elegans-data-set; PacBio/ONT S. cerevisiae dataset: https://github.com/fg6/YeastStrainsStudy. The ONT reads from the HUMAN and HUMAN+ datasets are available at https://github.com/nanopore-wgs-consortium/NA12878. The matching Illumina reads are available as SRA project ERP001229. The Canu HUMAN+ assembly was downloaded from https://genomeinformatics.github.io/na12878update. MaSuRCA assemblies are available from http://masurca.blogspot.com/.
Code availability
The Flye code used in this study is available in the online version of the paper. The most recent Flye version is freely available at http://github.com/fenderglass/Flye.
References
Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).
Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).
Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Lin, Y. et al. Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl Acad. Sci. USA 113, E8396–E8405 (2016).
Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & David, N. T. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Nowoshilow, S. et al. The axolotl genome and the evolution of key tissue formation regulators. Nature 554, 50–55 (2018).
Ghurye, J., Pop, M., Koren, S., Bickhart, D. & Chin, C. S. Scaffolding of long read assemblies using long range contact information. BMC Genomics 18, 527 (2017).
Weissensteiner, M. H. et al. Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications. Genome Res. 27, 697–708 (2017).
Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Jiang, Z. et al. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39, 1361–1368 (2007).
Pu., L., Lin, Y. & Pevzner, P. A. Detection and analysis of ancient segmental duplications in mammalian genomes. Genome Res. 28, 901–909 (2018).
Bao, Z. & Eddy, S. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 8, 1269–1276 (2002).
Schmid, M. D. et al. Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats. Nucleic Acids Res. 46, 8953–8965 (2018).
Tischler, G. Haplotype and repeat separation in long reads. Preprint at bioRxiv https://doi.org/10.1101/145474 (2017).
Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).
Edmonds, J. & Johnson, E. L. Matching, Euler tours and the Chinese postman. Math. Program. 5, 88–124 (1973).
Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2015).
Giordano, F. et al. De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms. Sci. Rep. 7, 3935 (2017).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Zimin, A. V. et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 27, 787–792 (2017).
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407 (2017).
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS ONE 9, e112963 (2014).
Lin, Y., Nurk, S. & Pevzner, P. A. What is the difference between the breakpoint graph and the de Bruijn graph? BMC Genomics 15, S6 (2014).
Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 51, 608–611 (2015).
Nattestad, M. S. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA 2 and RNA sequencing of a breast cancer cell line. Genome Res. 28, 1126–1135 (2018).
Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).
Gibbs, A. J. & McIntyre, G. A. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1–11 (1970).
Edmonds, J. Paths, trees, and flowers. Canad. J. Math. 17, 449–467 (1965).
Acknowledgements
We are indebted to S. Nurk for his multiple rounds of critique and suggestions that have improved the paper. We are also grateful to A. Mikheenko, B. Behsaz, L. Pu, and G. Tesler for their comments. This work is supported by NSF/MCB-BSF grant no. 1715911.
Author information
Authors and Affiliations
Contributions
All authors contributed to developing the Flye algorithms and writing the paper. M.K., Y.L., and J.Y. implemented the Flye algorithm. M.K. benchmarked Flye and other assembly tools. P.A.P. directed the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Figure 1 A comparison of the Flye and HINGE assembly graphs on bacterial genomes from the BACTERIA dataset.
(Left) The Flye and Hinge assembly graphs of the KP9657 dataset. There is a single unique edge entering into (and exiting) the unresolved “yellow” repeat and connecting it to the rest of the graph. Thus, this repeat can be resolved if one excludes the possibility that it is shared between a chromosome and a plasmid. In contrast to HINGE, Flye does not rule out this possibility and classifies the yellow repeat as unresolved. (Right) The Flye and Hinge assembly graphs of the EC10864 dataset show a mosaic repeat of multiplicity four formed by yellow, blue, red and green edges (the two copies of each edge represent complementary strands). HINGE reports a complete assembly into a single chromosome.
Supplementary Figure 2 The assembly graph of the YEAST-ONT dataset.
Edges that were classified as repetitive by Flye are shown in color, while unique edges are black. Flye assembled the YEAST-ONT dataset into a graph with 21 unique and 34 repeat edges and generated 21 contigs as unambiguous paths in the assembly graph. A path v1, …vi, vi+1 …vn in the graph is called unambiguous if there exists a single incoming edge into each vertex of this path before vi+1 and a single outgoing edge from each vertex after vi. Each unique contig is formed by a single unique edge and possibly multiple repeat edges, while repetitive contigs consist of the repetitive edges which were not covered by the unique contigs. The visualization was generated using the graphviz tool (http://graphviz.org).
Supplementary Figure 3 The assembly graph of the WORM dataset.
Edges that were classified as repetitive by Flye are shown in color, while unique edges are black. Flye assembled the WORM dataset into a graph with 127 unique and 61 repeat edges and generated 127 contigs as unambiguous paths in the assembly graph. The visualization was generated using the graphviz tool (http://graphviz.org).
Supplementary Figure 4 Dot plots showing the alignment of reads against the Flye assembly, the Miniasm assembly and the reference C. elegans genome.
(a) The reference genome contains a tandem repeat of length 1.9 kb (10 copies) on chromosome X with the repeated unit having length ≈190 nucleotides. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of length 5.5 kb (27 copies) and 2.8 kb (13 copies), respectively. 15 reads that span over the tandem repeat support the Flye assembly (the mean length between the flanking unique sequence matches the repeat length reconstructed by Flye) and suggests that the Flye length estimate is more accurate. (b) The reference genome contains a tandem repeat of length 2 kb on chromosome 1. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of length 10 kb and 5.6 kb, respectively. A single read that spans over the tandem repeat supports the Flye assembly. Since the mean read length in the WORM dataset is 11 kb, it is expected to have a single read spanning a given 10.0 kb region but many more reads spanning any 5.6 kb region (as implied by the Miniasm assembly) or 2.0 kb region (as implied by the reference genome). Six out of 23 reads cross the “left” border (two out of 18 reads cross the “right” border) of this tandem repeat by more than 5.6 kb, thus contradicting the length estimate given by Miniasm and suggesting that the Flye length estimate is more accurate. (c) The reference genome contains a tandem repeat of length 3 kb on chromosome X. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of lengths 13.6 kb and 8 kb, respectively. A single read that spans over the tandem repeat reveals the repeat cluster to be of length 12.2k, which suggests that the Flye length estimate is more accurate. (d) The reference genome contains a tandem repeat of length 1.5 kb on chromosome 1. In contrast, the Flye and Miniasm assemblies of this region suggest tandem repeats of length 17 kb and 4.3 kb, respectively. One read that spans over the tandem repeat reveals the repeat cluster to be of length 18.0 kb, which suggests that the Flye length estimate is more accurate.
Supplementary information
Supplementary Figures and Text
Supplementary Figures 1–4, Supplementary Tables 1 and 2, and Supplementary Notes 1–16
Rights and permissions
About this article
Cite this article
Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540–546 (2019). https://doi.org/10.1038/s41587-019-0072-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-019-0072-8
This article is cited by
-
Genome sequence and assembly of the amylolytic Bacillus licheniformis T5 strain isolated from Kazakhstan soil
BMC Genomic Data (2024)
-
Maptcha: an efficient parallel workflow for hybrid genome scaffolding
BMC Bioinformatics (2024)
-
De novo assembly and annotation of Popillia japonica’s genome with initial clues to its potential as an invasive pest
BMC Genomics (2024)
-
Comparative transcriptomics reveals a highly polymorphic Xanthomonas HrpG virulence regulon
BMC Genomics (2024)
-
Recovering high-quality bacterial genomes from cross-contaminated cultures: a case study of marine Vibrio campbellii
BMC Genomics (2024)