Abstract

Structural variations are the greatest source of genetic variation, but they remain poorly understood because of technological limitations. Single-molecule long-read sequencing has the potential to dramatically advance the field, although high error rates are a challenge with existing methods. Addressing this need, we introduce open-source methods for long-read alignment (NGMLR; https://github.com/philres/ngmlr) and structural variant identification (Sniffles; https://github.com/fritzsedlazeck/Sniffles) that provide unprecedented sensitivity and precision for variant detection, even in repeat-rich regions and for complex nested events that can have substantial effects on human health. In several long-read datasets, including healthy and cancerous human genomes, we discovered thousands of novel variants and categorized systematic errors in short-read approaches. NGMLR and Sniffles can automatically filter false events and operate on low-coverage data, thereby reducing the high costs that have hindered the application of long reads in clinical and research settings.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14, 125–138 (2013).

  2. 2.

    Lupski, J. R. Structural variation mutagenesis of the human genome: impact on disease and evolution. Environ. Mol. Mutagen. 56, 419–436 (2015).

  3. 3.

    Macintyre, G., Ylstra, B. & Brenton, J. D. Sequencing structural variants in cancer for precision therapeutics. Trends Genet. 32, 530–542 (2016).

  4. 4.

    Hedges, D. J. et al. Evidence of novel fine-scale structural variation at autism spectrum disorder candidate loci. Mol. Autism 3, 2 (2012).

  5. 5.

    Rovelet-Lecrux, A. et al. APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy. Nat. Genet. 38, 24–26 (2006).

  6. 6.

    Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

  7. 7.

    Dennenmoser, S. et al. Copy number increases of transposable elements and protein-coding genes in an invasive fish of hybrid origin. Mol. Ecol. 26, 4712–4724 (2017).

  8. 8.

    Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).

  9. 9.

    Zichner, T. et al. Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res. 23, 568–579 (2013).

  10. 10.

    Imprialou, M. et al. Genomic rearrangements in Arabidopsis considered as quantitative traits. Genetics 205, 1425–1441 (2017).

  11. 11.

    Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).

  12. 12.

    Kadalayil, L. et al. Exome sequence read depth methods for identifying copy number changes. Brief. Bioinform. 16, 380–392 (2015).

  13. 13.

    Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

  14. 14.

    Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).

  15. 15.

    Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

  16. 16.

    Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

  17. 17.

    Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).

  18. 18.

    English, A. C., Salerno, W. J. & Reid, J. G. PBHoney: identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinformatics 15, 180 (2014).

  19. 19.

    Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

  20. 20.

    Tattini, L., D’Aurizio, R. & Magi, A. Detection of genomic structural variants from next-generation sequencing data. Front. Bioeng. Biotechnol 3, 92 (2015).

  21. 21.

    Teo, S. M., Pawitan, Y., Ku, C. S., Chia, K. S. & Salim, A. Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics 28, 2711–2718 (2012).

  22. 22.

    Lucas Lledó, J. I. & Cáceres, M. On the power and the systematic biases of the detection of chromosomal inversions by paired-end genome sequencing. PLoS One 8, e61292 (2013).

  23. 23.

    Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

  24. 24.

    Kiełbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011).

  25. 25.

    Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).

  26. 26.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).

  27. 27.

    Sović, I. et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 7, 11307 (2016).

  28. 28.

    Xiao, C. L. et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods 14, 1072–1074 (2017).

  29. 29.

    Li, H. Minimap2: fast pairwise alignment for long nucleotide sequences. arXiv Preprint at https://arxiv.org/abs/1708.01492 (2017).

  30. 30.

    Chaisson, M. J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

  31. 31.

    Sedlazeck, F. J., Rescheneder, P. & von Haeseler, A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 29, 2790–2791 (2013).

  32. 32.

    Carvalho, C. M. et al. Inverted genomic segments and complex triplication rearrangements are mediated by inverted repeats in the human genome. Nat. Genet. 43, 1074–1081 (2011).

  33. 33.

    Shimojima, K. et al. Pelizaeus-Merzbacher disease caused by a duplication-inverted triplication-duplication in chromosomal segments including the PLP1 region. Eur. J. Med. Genet. 55, 400–403 (2012).

  34. 34.

    Carvalho, C. M. & Lupski, J. R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 17, 224–238 (2016).

  35. 35.

    Mühle, C., Zenker, M., Chuzhanova, N. & Schneider, H. Recurrent inversion with concomitant deletion and insertion events in the coagulation factor VIII gene suggests a new mechanism for X-chromosomal rearrangements causing hemophilia A. Hum. Mutat. 28, 1045 (2007).

  36. 36.

    Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (Cambridge Univ. Press, Cambridge, UK, 1997).

  37. 37.

    Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

  38. 38.

    Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

  39. 39.

    Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

  40. 40.

    Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

  41. 41.

    Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

  42. 42.

    Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

  43. 43.

    Zimin, A. V., Smith, D. R., Sutton, G. & Yorke, J. A. Assembly reconciliation. Bioinformatics 24, 42–45 (2008).

  44. 44.

    Beri, S., Bonaglia, M. C. & Giorda, R. Low-copy repeats at the human VIPR2 gene predispose to recurrent and nonrecurrent rearrangements. Eur. J. Hum. Genet. 21, 757–761 (2013).

  45. 45.

    Nattestad, M. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/08/10/174938 (2017).

  46. 46.

    Merker, J. D. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. 20, 159–163 (2017).

  47. 47.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  48. 48.

    Jeffares, D. C. et al. Transient structural variations alter gene expression and quantitative traits in Schizosaccharomyces pombe. Nat. Commun. 8, 14061 (2017).

Download references

Acknowledgements

We thank W.R. McCombie, S. Wheelan, S. Goodwin, H. Li, and B.Q. Minh for helpful discussions. This work was supported by the National Science Foundation (DBI- 1350041, IOS-1732253, and IOS-1445025 to M.C.S.) and the US National Institutes of Health (R01-HG006677 and UM1 HG008898 to M.C.S. and F.J.S.). P.R. acknowledges support from DK RNA Biology (W1207-B09). A.v.H. and M.S. acknowledge financial support from the University of Vienna and the Medical University of Vienna.

Author information

Author notes

  1. These authors contributed equally: Fritz J. Sedlazeck, Philipp Rescheneder.

Affiliations

  1. Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA

    • Fritz J. Sedlazeck
  2. Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, Vienna, Austria

    • Philipp Rescheneder
    • , Moritz Smolka
    •  & Arndt von Haeseler
  3. Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA

    • Han Fang
    • , Maria Nattestad
    •  & Michael C. Schatz
  4. Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, Austria

    • Arndt von Haeseler
  5. Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA

    • Michael C. Schatz

Authors

  1. Search for Fritz J. Sedlazeck in:

  2. Search for Philipp Rescheneder in:

  3. Search for Moritz Smolka in:

  4. Search for Han Fang in:

  5. Search for Maria Nattestad in:

  6. Search for Arndt von Haeseler in:

  7. Search for Michael C. Schatz in:

Contributions

F.J.S., P.R., and M.S. developed the software. F.J.S., P.R., M.S., H.F., and M.N. performed analysis. F.J.S., P.R., M.C.S., and A.v.H. wrote the manuscript. M.C.S. and A.v.H. directed the project. All authors read and approved the final manuscript.

Competing interests

M.C.S. and F.J.S. have participated in PacBio-sponsored meetings over the past few years and have received travel reimbursement and honoraria for presenting at these events. Since the initial submission of this paper, P.R. has become an employee of Oxford Nanopore. PacBio and Oxford Nanopore had no role in decisions related to the study/work, data collection, or analysis of data described in this paper.

Corresponding authors

Correspondence to Fritz J. Sedlazeck or Michael C. Schatz.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Notes 1–5

  2. Reporting Summary

  3. Supplementary Table 1

    Raw statistics over the mapper evaluation

  4. Supplementary Table 2

    SV caller statistics over simulated reads

  5. Supplementary Table 3

    Mapping comparison over simulated reference and real reads

  6. Supplementary Table 4

    SV caller comparison over simulated reference and real reads

  7. Supplementary Table 5

    Used real datasets and accessions

  8. Supplementary Table 6

    GiaB trio comparison

  9. Supplementary Table 7

    Comparison of existing NA12878 datasets

  10. Supplementary Table 8

    NA12878 indel assessments using Illumina short-read data

  11. Supplementary Table 9

    Analysis of potential biases in short-read calling

  12. Supplementary Table 10

    Runtime comparisons over NA12878

  13. Supplementary Table 11

    Insertion and deletion assessment for simulated data

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0001-7

Further reading