Article | Published:

Assembly of long, error-prone reads using repeat graphs

Abstract

Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

All described datasets are publicly available through the corresponding repositories. The supplementary files, including the assemblies generated by Flye, are available at https://doi.org/10.5281/zenodo.1143753; NCTC PacBio reads: http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/; PacBio metagenome dataset: https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun; PacBio C. elegans dataset: https://github.com/PacificBiosciences/DevNet/wiki/C.-elegans-data-set; PacBio/ONT S. cerevisiae dataset: https://github.com/fg6/YeastStrainsStudy. The ONT reads from the HUMAN and HUMAN+ datasets are available at https://github.com/nanopore-wgs-consortium/NA12878. The matching Illumina reads are available as SRA project ERP001229. The Canu HUMAN+ assembly was downloaded from https://genomeinformatics.github.io/na12878update. MaSuRCA assemblies are available from http://masurca.blogspot.com/.

Code availability

The Flye code used in this study is available in the online version of the paper. The most recent Flye version is freely available at http://github.com/fenderglass/Flye.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

  2. 2.

    Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

  3. 3.

    Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

  4. 4.

    Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

  5. 5.

    Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

  6. 6.

    Lin, Y. et al. Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl Acad. Sci. USA 113, E8396–E8405 (2016).

  7. 7.

    Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & David, N. T. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).

  8. 8.

    Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

  9. 9.

    Nowoshilow, S. et al. The axolotl genome and the evolution of key tissue formation regulators. Nature 554, 50–55 (2018).

  10. 10.

    Ghurye, J., Pop, M., Koren, S., Bickhart, D. & Chin, C. S. Scaffolding of long read assemblies using long range contact information. BMC Genomics 18, 527 (2017).

  11. 11.

    Weissensteiner, M. H. et al. Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications. Genome Res. 27, 697–708 (2017).

  12. 12.

    Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).

  13. 13.

    Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

  14. 14.

    Jiang, Z. et al. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39, 1361–1368 (2007).

  15. 15.

    Pu., L., Lin, Y. & Pevzner, P. A. Detection and analysis of ancient segmental duplications in mammalian genomes. Genome Res. 28, 901–909 (2018).

  16. 16.

    Bao, Z. & Eddy, S. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 8, 1269–1276 (2002).

  17. 17.

    Schmid, M. D. et al. Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats. Nucleic Acids Res. 46, 8953–8965 (2018).

  18. 18.

    Tischler, G. Haplotype and repeat separation in long reads. Preprint at bioRxiv https://doi.org/10.1101/145474 (2017).

  19. 19.

    Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).

  20. 20.

    Edmonds, J. & Johnson, E. L. Matching, Euler tours and the Chinese postman. Math. Program. 5, 88–124 (1973).

  21. 21.

    Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2015).

  22. 22.

    Giordano, F. et al. De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms. Sci. Rep. 7, 3935 (2017).

  23. 23.

    Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

  24. 24.

    Zimin, A. V. et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 27, 787–792 (2017).

  25. 25.

    Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407 (2017).

  26. 26.

    Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS ONE 9, e112963 (2014).

  27. 27.

    Lin, Y., Nurk, S. & Pevzner, P. A. What is the difference between the breakpoint graph and the de Bruijn graph? BMC Genomics 15, S6 (2014).

  28. 28.

    Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 51, 608–611 (2015).

  29. 29.

    Nattestad, M. S. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA 2 and RNA sequencing of a breast cancer cell line. Genome Res. 28, 1126–1135 (2018).

  30. 30.

    Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).

  31. 31.

    Gibbs, A. J. & McIntyre, G. A. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1–11 (1970).

  32. 32.

    Edmonds, J. Paths, trees, and flowers. Canad. J. Math. 17, 449–467 (1965).

Download references

Acknowledgements

We are indebted to S. Nurk for his multiple rounds of critique and suggestions that have improved the paper. We are also grateful to A. Mikheenko, B. Behsaz, L. Pu, and G. Tesler for their comments. This work is supported by NSF/MCB-BSF grant no. 1715911.

Author information

All authors contributed to developing the Flye algorithms and writing the paper. M.K., Y.L., and J.Y. implemented the Flye algorithm. M.K. benchmarked Flye and other assembly tools. P.A.P. directed the work.

Competing interests

The authors declare no competing interests.

Correspondence to Pavel A. Pevzner.

Integrated supplementary information

Supplementary Figure 1 A comparison of the Flye and HINGE assembly graphs on bacterial genomes from the BACTERIA dataset.

(Left) The Flye and Hinge assembly graphs of the KP9657 dataset. There is a single unique edge entering into (and exiting) the unresolved “yellow” repeat and connecting it to the rest of the graph. Thus, this repeat can be resolved if one excludes the possibility that it is shared between a chromosome and a plasmid. In contrast to HINGE, Flye does not rule out this possibility and classifies the yellow repeat as unresolved. (Right) The Flye and Hinge assembly graphs of the EC10864 dataset show a mosaic repeat of multiplicity four formed by yellow, blue, red and green edges (the two copies of each edge represent complementary strands). HINGE reports a complete assembly into a single chromosome.

Supplementary Figure 2 The assembly graph of the YEAST-ONT dataset.

Edges that were classified as repetitive by Flye are shown in color, while unique edges are black. Flye assembled the YEAST-ONT dataset into a graph with 21 unique and 34 repeat edges and generated 21 contigs as unambiguous paths in the assembly graph. A path v1, …vi, vi+1vn in the graph is called unambiguous if there exists a single incoming edge into each vertex of this path before vi+1 and a single outgoing edge from each vertex after vi. Each unique contig is formed by a single unique edge and possibly multiple repeat edges, while repetitive contigs consist of the repetitive edges which were not covered by the unique contigs. The visualization was generated using the graphviz tool (http://graphviz.org).

Supplementary Figure 3 The assembly graph of the WORM dataset.

Edges that were classified as repetitive by Flye are shown in color, while unique edges are black. Flye assembled the WORM dataset into a graph with 127 unique and 61 repeat edges and generated 127 contigs as unambiguous paths in the assembly graph. The visualization was generated using the graphviz tool (http://graphviz.org).

Supplementary Figure 4 Dot plots showing the alignment of reads against the Flye assembly, the Miniasm assembly and the reference C. elegans genome.

(a) The reference genome contains a tandem repeat of length 1.9 kb (10 copies) on chromosome X with the repeated unit having length ≈190 nucleotides. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of length 5.5 kb (27 copies) and 2.8 kb (13 copies), respectively. 15 reads that span over the tandem repeat support the Flye assembly (the mean length between the flanking unique sequence matches the repeat length reconstructed by Flye) and suggests that the Flye length estimate is more accurate. (b) The reference genome contains a tandem repeat of length 2 kb on chromosome 1. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of length 10 kb and 5.6 kb, respectively. A single read that spans over the tandem repeat supports the Flye assembly. Since the mean read length in the WORM dataset is 11 kb, it is expected to have a single read spanning a given 10.0 kb region but many more reads spanning any 5.6 kb region (as implied by the Miniasm assembly) or 2.0 kb region (as implied by the reference genome). Six out of 23 reads cross the “left” border (two out of 18 reads cross the “right” border) of this tandem repeat by more than 5.6 kb, thus contradicting the length estimate given by Miniasm and suggesting that the Flye length estimate is more accurate. (c) The reference genome contains a tandem repeat of length 3 kb on chromosome X. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of lengths 13.6 kb and 8 kb, respectively. A single read that spans over the tandem repeat reveals the repeat cluster to be of length 12.2k, which suggests that the Flye length estimate is more accurate. (d) The reference genome contains a tandem repeat of length 1.5 kb on chromosome 1. In contrast, the Flye and Miniasm assemblies of this region suggest tandem repeats of length 17 kb and 4.3 kb, respectively. One read that spans over the tandem repeat reveals the repeat cluster to be of length 18.0 kb, which suggests that the Flye length estimate is more accurate.

Supplementary information

Supplementary Figures and Text

Supplementary Figures 1–4, Supplementary Tables 1 and 2, and Supplementary Notes 1–16

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark
Fig. 1: Flye outline.
Fig. 2: Constructing the approximate repeat graph from local self-alignments.
Fig. 3: Resolving an unbridged repeat.
Fig. 4: An SD from the Flye assembly of the HUMAN dataset and the distribution of the lengths and complexities of all SDs from the same assembly.
Fig. 5: Constructing the repeat plot of a tour in the graph and constructing the repeat graph from a repeat plot.
Supplementary Figure 1: A comparison of the Flye and HINGE assembly graphs on bacterial genomes from the BACTERIA dataset.
Supplementary Figure 2: The assembly graph of the YEAST-ONT dataset.
Supplementary Figure 3: The assembly graph of the WORM dataset.
Supplementary Figure 4: Dot plots showing the alignment of reads against the Flye assembly, the Miniasm assembly and the reference C. elegans genome.