Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Computational tools

Piercing the dark matter: bioinformatics of long-range sequencing and mapping

Abstract

Several new genomics technologies have become available that offer long-read sequencing or long-range mapping with higher throughput and higher resolution analysis than ever before. These long-range technologies are rapidly advancing the field with improved reference genomes, more comprehensive variant identification and more complete views of transcriptomes and epigenomes. However, they also require new bioinformatics approaches to take full advantage of their unique characteristics while overcoming their complex errors and modalities. Here, we discuss several of the most important applications of the new technologies, focusing on both the currently available bioinformatics tools and opportunities for future research.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: De novo genome assembly.
Fig. 2: Structural variant detection with long-read sequencing.
Fig. 3: Phasing concepts and requirements.
Fig. 4: Example of a novel isoforms discovered using long-read sequencing.
Fig. 5: Detecting methylated nucleotides using single-molecule sequencing.

References

  1. 1.

    Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016). This is a comprehensive Review of all major sequencing and mapping platforms, including a detailed discussion of their relative strengths and weaknesses.

    PubMed  Article  CAS  Google Scholar 

  2. 2.

    The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    PubMed Central  Article  CAS  Google Scholar 

  3. 3.

    Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  4. 4.

    The Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    PubMed Central  Article  CAS  Google Scholar 

  5. 5.

    Celniker, S. E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 (2009).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  6. 6.

    Chaisson, M. J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015). This is the first major publication describing how PacBio long reads could be used for human genetics, showing that over 20,000 SVs are present in a typical human genome.

    PubMed  Article  CAS  Google Scholar 

  7. 7.

    Roberts, R. J., Carneiro, M. O. & Schatz, M. C. The advantages of SMRT sequencing. Genome Biol. 14, 405 (2013).

    PubMed  Article  PubMed Central  Google Scholar 

  8. 8.

    Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  9. 9.

    Zheng, G. X. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  10. 10.

    Putnam, N. H. et al. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 26, 342–350 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  11. 11.

    Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  12. 12.

    Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  13. 13.

    Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017). This paper describes the very flexible HapCUT2 phasing algorithm for use with short, long or linked reads, as well as Hi-C-based mate pairs.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  14. 14.

    Cao, H. et al. Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology. Gigascience 3, 34 (2014).

    PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Jiao, Y. et al. Improved maize reference genome with single-molecule technologies. Nature 546, 524–527 (2017).

    PubMed  CAS  Google Scholar 

  16. 16.

    Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

    PubMed  Article  CAS  Google Scholar 

  17. 17.

    Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12, 780–786 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  18. 18.

    Merker, J. D. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. https://doi.org/10.1038/gim.2017.86 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  19. 19.

    Spies, N. et al. Genome-wide reconstruction of complex structural variants using read clouds. Nat. Methods 9, 915–920 (2017).

    Article  CAS  Google Scholar 

  20. 20.

    Sharon, D., Tilgner, H., Grubert, F. & Snyder, M. A single-molecule long-read survey of the human transcriptome. Nat. Biotechnol. 31, 1009–1014 (2013). This is one of the first reports describing how long-read sequencing can be used to detect novel isoforms in the human transcriptome.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  21. 21.

    Rand, A. C. et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods 14, 411–413 (2017). This paper presents one of the first methods able to detect methylation changes directly from Oxford Nanopore long-read sequencing. It can detect three cytosine variants and two adenine variants.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  22. 22.

    Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017). This paper presents one of the first methods able to detect 5mC methylation changes directly from Oxford Nanopore long-read sequencing.

    PubMed  Article  CAS  Google Scholar 

  23. 23.

    Phillippy, A. M. New advances in sequence assembly. Genome Res 27, xi–xiii (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  24. 24.

    Bradnam, K. R. et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2, 10 (2013).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  25. 25.

    Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  26. 26.

    Nagarajan, N. & Pop, M. Sequence assembly demystified. Nat. Rev. Genet. 14, 157–167 (2013).

    PubMed  Article  CAS  Google Scholar 

  27. 27.

    Ling, H. Q. et al. Draft genome of the wheat A-genome progenitor Triticum urartu. Nature 496, 87–90 (2013).

    PubMed  Article  CAS  Google Scholar 

  28. 28.

    Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature 463, 311–317 (2010).

    PubMed  Article  CAS  Google Scholar 

  29. 29.

    Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017). This study describes Canu, one of the most commonly used long-read assemblers supporting both PacBio and Oxford Nanopore data.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  30. 30.

    Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016). This study describes FALCON-Unzip, the first long-read-based assembler reporting phased diploid contigs.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  31. 31.

    Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. https://doi.org/10.1038/nbt.4060 (2018).

    PubMed Central  Article  PubMed  Google Scholar 

  32. 32.

    Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  33. 33.

    Goodwin, S. et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 25, 1750–1756 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  34. 34.

    Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  35. 35.

    Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  36. 36.

    Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013). This study describes HGAP, the first non-hybrid long-read de novo assembler.

    PubMed  Article  CAS  Google Scholar 

  37. 37.

    Nowoshilow, S. et al. The axolotl genome and the evolution of key tissue formation regulators. Nature 554, 50–55 (2018).

    PubMed  Article  CAS  Google Scholar 

  38. 38.

    Broder, A. in SEQUENCES ‘97 Proceedings of the Compression and Complexity of Sequences. 21 (Washington, DC, 1997).

  39. 39.

    Chu, J., Mohamadi, H., Warren, R. L., Yang, C. & Birol, I. Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art. Bioinformatics 33, 1261–1270 (2017).

    PubMed  CAS  Google Scholar 

  40. 40.

    Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

    PubMed  Article  CAS  Google Scholar 

  41. 41.

    Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  42. 42.

    Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  43. 43.

    Myers, G. Efficient local alignment discovery amongst noisy long reads. Lect. Notes Bioinf. 8701, 52–67 (2014).

    Google Scholar 

  44. 44.

    Myers, E. W. The fragment assembly string graph. Bioinformatics 21 (Suppl. 2), ii79–ii85 (2005).

    PubMed  CAS  Google Scholar 

  45. 45.

    Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  46. 46.

    Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).

    PubMed  Article  CAS  Google Scholar 

  47. 47.

    Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  48. 48.

    Gajer, P., Schatz, M. & Salzberg, S. L. Automated correction of genome sequence errors. Nucleic Acids Res. 32, 562–569 (2004).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  49. 49.

    Boza, V., Brejova, B. & Vinar, T. DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads. PLoS ONE 12, e0178751 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  50. 50.

    Teng, H. et al. Chiron: Translating nanopore raw signal directly into nucleotide sequence using deep learning. Preprint at bioRxiv https://doi.org/10.1101/179531 (2017).

  51. 51.

    Mendelowitz, L. & Pop, M. Computational methods for optical mapping. Gigascience 3, 33 (2014).

    PubMed  PubMed Central  Article  Google Scholar 

  52. 52.

    Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M. & Jaffe, D. B. Direct determination of diploid genome sequences. Genome Res. 27, 757–767 (2017). This study describes the Supernova assembler for 10X Genomics linked reads, which reports phased diploid genomes.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  53. 53.

    Kuleshov, V., Snyder, M. P. & Batzoglou, S. Genome assembly from synthetic long read clouds. Bioinformatics 32, i216–i224 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  54. 54.

    Yeo, S., Coombe, L., Chu, J., Warren, R. L. & Birol, I. ARCS: scaffolding genome drafts with linked reads. Bioinformatics https://doi.org/10.1093/bioinformatics/btx675 (2017).

    Article  PubMed Central  Google Scholar 

  55. 55.

    Adey, A. et al. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 24, 2041–2049 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  56. 56.

    Ghurye, J., Pop, M., Koren, S., Bickhart, D. & Chin, C. S. Scaffolding of long read assemblies using long range contact information. BMC Genomics 18, 527 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  57. 57.

    Bickhart, D. M. et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat. Genet. 49, 643–650 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  58. 58.

    English, A. C. et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE 7, e47768 (2012).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  59. 59.

    Warren, R. L. RAILS and Cobbler: scaffolding and automated finishing of draft genomes using long DNA sequences. J. Open Source Software 1, 116 (2016).

    Article  Google Scholar 

  60. 60.

    Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14, 125–138 (2013).

    PubMed  Article  CAS  Google Scholar 

  61. 61.

    Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  62. 62.

    Lupski, J. R. Structural variation mutagenesis of the human genome: Impact on disease and evolution. Environ. Mol. Mutag. 56, 419–436 (2015).

    Article  CAS  Google Scholar 

  63. 63.

    Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  64. 64.

    Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  65. 65.

    Carvalho, C. M. & Lupski, J. R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 17, 224–238 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  66. 66.

    Moncunill, V. et al. Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads. Nat. Biotechnol. 32, 1106–1112 (2014).

    PubMed  Article  CAS  Google Scholar 

  67. 67.

    Trask, B. J. Human cytogenetics: 46 chromosomes, 46 years and counting. Nat. Rev. Genet. 3, 769–778 (2002).

    PubMed  Article  Google Scholar 

  68. 68.

    Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).

    PubMed  Article  CAS  Google Scholar 

  69. 69.

    Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  70. 70.

    Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  71. 71.

    English, A. C., Salerno, W. J. & Reid, J. G. PBHoney: identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinformatics 15, 180 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  72. 72.

    English, A. C. et al. Assessing structural variation in a personal genome-towards a human reference diploid genome. BMC Genomics 16, 286 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  73. 73.

    Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single molecule sequencing. Preprint at bioRxiv https://doi.org/10.1101/169557 (2017). This study introduces an improved long-read mapping algorithm NGMLR and a comprehensive structural variation detection pipeline Sniffles.

  74. 74.

    Harewood, L. et al. Hi-C as a tool for precise detection and characterisation of chromosomal rearrangements and copy number variation in human tumours. Genome Biol. 18, 125 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  75. 75.

    Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  76. 76.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv arXiv:1303.3997 (2013).

  77. 77.

    Li, H. Minimap2: fast pairwise alignment for long nucleotide sequences. Preprint at arXiv arXiv:1708.01492 (2017). This paper introduces the very fast Minimap2 long-read aligner for both PacBio and Oxford Nanopore sequencing.

  78. 78.

    Bishara, A. et al. Read clouds uncover variation in complex regions of the human genome. Genome Res. 25, 1570–1580 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  79. 79.

    Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

    PubMed  PubMed Central  Article  Google Scholar 

  80. 80.

    Kielbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  81. 81.

    Nattestad, M. & Schatz, M. C. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics 32, 3021–3023 (2016).

    PubMed  Article  CAS  Google Scholar 

  82. 82.

    Mohiyuddin, M. et al. MetaSV: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741–2744 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  83. 83.

    Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).

    PubMed  Article  CAS  Google Scholar 

  84. 84.

    Nattestad, M., Chin, C. S. & Schatz, M. C. Ribbon: visualizing complex genome alignments and structural variation. Preprint at bioRxiv https://doi.org/10.1101/082123 (2016).

  85. 85.

    Narzisi, G. et al. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat. Methods 11, 1033–1036 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  86. 86.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  87. 87.

    Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  88. 88.

    Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  89. 89.

    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  90. 90.

    Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  91. 91.

    Luo, R., Schatz, M. C. & Salzberg, S. L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6, 1–4 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  92. 92.

    Cilibrasi, R., Iersel, L. v., Kelk, S. & Tromp, J. The complexity of the single individual SNP haplotyping problem. Algorithmica 49, 13–36 (2007).

    Article  Google Scholar 

  93. 93.

    Lo, C., Bashir, A., Bansal, V. & Bafna, V. Strobe sequence design for haplotype assembly. BMC Bioinformatics 12, S24 (2011).

    PubMed  PubMed Central  Article  Google Scholar 

  94. 94.

    Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  95. 95.

    Lynch, K. W. & Maniatis, T. Assembly of specific SR protein complexes on distinct regulatory elements of the Drosophila doublesex splicing enhancer. Genes Dev. 10, 2089–2101 (1996).

    PubMed  Article  CAS  Google Scholar 

  96. 96.

    Pan, Q., Shai, O., Lee, L. J., Frey, B. J. & Blencowe, B. J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413–1415 (2008).

    PubMed  Article  CAS  Google Scholar 

  97. 97.

    Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  98. 98.

    Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  99. 99.

    Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  100. 100.

    Abdel-Ghany, S. E. et al. A survey of the sorghum transcriptome using single-molecule long reads. Nat. Commun. 7, 11706 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  101. 101.

    Byrne, A. et al. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat. Commun. 8, 16027 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  102. 102.

    Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods https://doi.org/10.1038/nmeth.4577 (2018). This is the first demonstration of direct RNA sequencing on an Oxford Nanopore MinION sequencer.

    PubMed  Article  Google Scholar 

  103. 103.

    Tilgner, H. et al. Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events. Nat. Biotechnol. 33, 736–742 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  104. 104.

    Wang, B. et al. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat. Commun. 7, 11708 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  105. 105.

    Gordon, S. P. et al. Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PLoS ONE 10, e0132628 (2015). This paper describes the ToFU algorithm for studying alternative splicing and isoform diversity using long-read sequencing.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  106. 106.

    Tardaguila, M. et al. SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. https://doi.org/10.1101/gr.222976.117 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  107. 107.

    Au, K. F. et al. Characterization of the human ESC transcriptome by hybrid sequencing. Proc. Natl Acad. Sci. USA 110, E4821–E4830 (2013).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  108. 108.

    Deonovic, B., Wang, Y., Weirather, J., Wang, X. J. & Au, K. F. IDP-ASE: haplotyping and quantifying allele-specific expression at the gene and gene isoform level by hybrid sequencing. Nucleic Acids Res. 45, e32 (2017).

    PubMed  Article  CAS  Google Scholar 

  109. 109.

    Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  110. 110.

    Lister, R. & Ecker, J. R. Finding the fifth base: genome-wide sequencing of cytosine methylation. Genome Res. 19, 959–966 (2009).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  111. 111.

    Dinh, H. Q. et al. Advanced methylome analysis after bisulfite deep sequencing: an example in Arabidopsis. PLoS ONE 7, e41528 (2012).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  112. 112.

    Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7, 461–465 (2010). This is one of the first demonstrations of the ability to directly detect methylated bases using PacBio long-read sequencing.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  113. 113.

    Fang, G. et al. Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing. Nat. Biotechnol. 30, 1232–1239 (2012).

    PubMed  Article  CAS  Google Scholar 

  114. 114.

    Greer, E. L. et al. DNA methylation on N6-adenine in C. elegans. Cell 161, 868–878 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  115. 115.

    Graralde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat Methods https://doi.org/10.1038/nmeth.4577 (2018).

    Article  Google Scholar 

  116. 116.

    Zimin, A. V. et al. The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum. Gigascience 6, 1–7 (2017).

    PubMed  PubMed Central  Google Scholar 

  117. 117.

    Poplin, R. et al. Creating a universal SNP and small indel variant caller with deep neural networks. Preprint at bioRxiv https://doi.org/10.1101/092890 (2016).

  118. 118.

    Danko, C. D., Meleshko, D., Bezcan, D., Mason, C. E. & Hajirasouliha, I. Minerva: an alignment and reference free approach to deconvolve linked-reads for metagenomics. Preprint at bioRxiv https://doi.org/10.1101/217869 (2017).

  119. 119.

    Tsai, Y. C. et al. Resolving the complexity of human skin metagenomes using single-molecule sequencing. MBio 7, e01948–01915 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  120. 120.

    Burton, J. N., Liachko, I., Dunham, M. J. & Shendure, J. Species-level deconvolution of metagenome assemblies with Hi-C-based contact probability maps. G3 4, 1339–1346 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  121. 121.

    Novak, A. M. et al. Genome graphs. bioRxiv https://doi.org/10.1101/101378 (2017).

  122. 122.

    Church, D. M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

  123. 123.

    Matzaraki, V., Kumar, V., Wijmenga, C. & Zhernakova, A. The MHC locus and genetic susceptibility to autoimmune and infectious diseases. Genome Biol. 18, 76 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  124. 124.

    Mayor, N. P. et al. HLA typing for the next generation. PLoS ONE 10, e0127153 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  125. 125.

    Hayward, D. R., Bultitude, W. P., Mayor, N. P., Madrigal, J. A. & Marsh, S. G. The novel HLA-B*44 allele, HLA-B*44:220, identified by single molecule real-time DNA sequencing in a British caucasoid male. Tissue Antigens 86, 61–63 (2015).

    PubMed  Article  CAS  Google Scholar 

  126. 126.

    Wang, M. et al. PacBio-LITS: a large-insert targeted sequencing method for characterization of human disease-associated chromosomal structural variations. BMC Genomics 16, 214 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  127. 127.

    Nattestad, M. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Preprint at bioRxiv https://doi.org/10.1101/174938 (2017).

  128. 128.

    Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  129. 129.

    Faria, N. R. et al. Mobile real-time surveillance of Zika virus in Brazil. Genome Med. 8, 97 (2016).

    PubMed  PubMed Central  Article  Google Scholar 

  130. 130.

    Schatz, M. C. & Phillippy, A. M. The rise of a digital immune system. Gigascience 1, 4 (2012).

    PubMed  PubMed Central  Article  Google Scholar 

  131. 131.

    Biesecker, L. G. & Green, R. C. Diagnostic clinical genome and exome sequencing. N. Engl. J. Med. 370, 2418–2425 (2014).

    PubMed  Article  CAS  Google Scholar 

  132. 132.

    Schatz, M. C., Witkowski, J. & McCombie, W. R. Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 13, 243 (2012).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  133. 133.

    Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  134. 134.

    Schatz, M. C. Nanopore sequencing meets epigenetics. Nat. Methods 14, 347–348 (2017).

    PubMed  Article  CAS  Google Scholar 

  135. 135.

    Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & Tse, D. N. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  136. 136.

    Xiao, C. L. et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods 14, 1072–1074 (2017).

    PubMed  Article  CAS  Google Scholar 

  137. 137.

    Lin, Y. et al. Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl Acad. Sci. USA 113, E8396–E8405 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  138. 138.

    Warren, R. L. et al. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. Gigascience 4, 35 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  139. 139.

    Cao, M. D. et al. Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nat. Commun. 8, 14515 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  140. 140.

    Vaser, R., Sovic, I., Nagarajan, N. & Sikic, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  141. 141.

    Sovic, I. et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 7, 11307 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  142. 142.

    Lin, H. N. & Hsu, W. L. Kart: a divide-and-conquer algorithm for NGS read alignment. Bioinformatics 33, 2281–2287 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  143. 143.

    Liu, B., Gao, Y. & Wang, Y. LAMSA: fast split read alignment with long approximate matches. Bioinformatics 33, 192–201 (2017).

    PubMed  Article  CAS  Google Scholar 

  144. 144.

    Elyanow, R., Wu, H. T. & Raphael, B. J. Identifying structural variants using linked-read sequencing data. Bioinformatics https://doi.org/10.1093/bioinformatics/btx712 (2017).

    PubMed  Article  Google Scholar 

  145. 145.

    Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015). This study describes WhatsHap, a widely used and very fast phasing algorithm for long reads.

    PubMed  Article  CAS  Google Scholar 

  146. 146.

    Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  147. 147.

    Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).

    PubMed  Article  CAS  Google Scholar 

Download references

Acknowledgements

The authors thank A. Phillippy, W. Timp, W. R. McCombie, S. Goodwin and R. Gibbs for helpful discussions. This work was supported, in part, by awards from the National Science Foundation (DBI-1350041) and from the National Institutes of Health (R01-HG006677 and UM1-HG008898). Also, this work was completed in part while H.L. was visiting the Simons Institute for the Theory of Computing, University of California, Berkeley, USA.

Author information

Affiliations

Authors

Contributions

All authors contributed to all aspects of this manuscript, including researching data, discussing content and writing, reviewing and editing the manuscript before submission.

Corresponding author

Correspondence to Michael C. Schatz.

Ethics declarations

Competing interests

M.C.S. and F.J.S. have participated in Pacific Biosciences (PacBio) sponsored meetings over the past few years and have received travel reimbursement and honoraria for presenting at these events. PacBio had no role in decisions relating to the study and/or work to be published, data collection and analysis or the decision to publish.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Reviewer information

Nature Reviews thanks Heng Li, René Warren and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Related Links

10X Genomics: https://www.10xgenomics.com/

Bionano Genomics: https://bionanogenomics.com/

Illumina: https://www.illumina.com/

Oxford Nanopore Technologies: https://nanoporetech.com/

Pacific Biosciences: http://www.pacb.com/

Proposed solutions to corruption of long-read sequencing files: https://github.com/samtools/hts-specs/issues/40

Glossary

Mate pairs

A molecular technique to generate a pair of sequencing reads separated by an approximately known distance. The typical separation distance for mate pairs is a few kilobases, as opposed to paired-end sequencing, which separates the reads by a few hundred bases at most.

Optical mapping

A microscopy technique used to visualize the characteristics of DNA, especially the physical lengths or the position of fluorescent probes.

Indels

A type of DNA sequence variation marked by the insertion or deletion of nucleotides.

Structural variants

(SVs). DNA sequence variants that are 50 bp or larger, including insertions, deletions, inversions, duplications and translocations.

Linked reads

Also known as a read cloud. A set of barcoded short reads derived from the same DNA molecule and therefore highly localized in the genome.

Phased

Grouping together variants located on the same molecule, such as to identify variants from the maternal or the paternal genome in a diploid sample.

Scaffolding

The process of assembling sequences of DNA into a scaffold. A scaffold is similar to a contig but may contain gaps, typically represented as Ns in the sequence.

Contigs

Contiguously assembled sequences of DNA.

N50

A weighted average length; specifically, the N50 length is the length such that 50% of the genome has been assembled into contig or scaffold sequences of this length or longer.

Cis-regulation

Any molecular interaction that regulates the transcription of nearby genes on the same DNA molecule, such as the role of a gene promoter.

Trans-regulation

Any molecular interaction that regulates the transcription of genes on a different DNA molecule, such as a transcription factor regulating both alleles of a target gene or genes.

Topologically associating domains

(TADs). Regions of the genome that are enriched for interactions with other elements within the same domain.

Synteny blocks

Genomic regions that are conserved among multiple species.

Fragile sites

Regions of the DNA molecule that are prone to physical shearing, especially when multiple nicking sites targeted by a nicking enzyme are located in close proximity.

Nested

Two or more adjacent or even overlapping variants in the same region of the genome, such as a deletion within the middle of a larger inverted sequence.

Chromothripsis

A phenomenon by which many chromosomal rearrangements occur in a single event in a localized region of the genome. Also called chromosome shattering.

Chromoplexy

A complex mutation where genetic material from multiple chromosomes is broken and ligated to each other in a new configuration, especially in cancer.

Polyploid

Cells and organisms that contain more than two paired (homologous) sets of chromosomes.

Polymorphisms

Variants observed in the genome that are present to some appreciable degree within a population (for example, >1%).

Compound heterozygous

Two different mutant alleles at a particular gene locus, one on each chromosome of a pair.

Hemizygous mutations

Two or more heterozygous mutations, especially loss-of-function mutations, occurring on the same chromosome so that they disrupt one copy of the gene but leave one functional copy.

Private mutations

Rare variants observed only within a single person or family.

NP-hard

In computational complexity theory, this is a class of problems in which no fast solutions are known to exist.

Metagenomes

The genomes of all the species present in a sample, studied without culturing or otherwise isolating any individual.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sedlazeck, F.J., Lee, H., Darby, C.A. et al. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 19, 329–346 (2018). https://doi.org/10.1038/s41576-018-0003-4

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing