Accurate detection of complex structural variations using single-molecule sequencing

Sedlazeck, Fritz J.; Rescheneder, Philipp; Smolka, Moritz; Fang, Han; Nattestad, Maria; von Haeseler, Arndt; Schatz, Michael C.

doi:10.1038/s41592-018-0001-7

Article
Published: 30 April 2018

Accurate detection of complex structural variations using single-molecule sequencing

Nature Methods volume 15, pages 461–468 (2018)Cite this article

34k Accesses
819 Citations
234 Altmetric
Metrics details

Subjects

Abstract

Structural variations are the greatest source of genetic variation, but they remain poorly understood because of technological limitations. Single-molecule long-read sequencing has the potential to dramatically advance the field, although high error rates are a challenge with existing methods. Addressing this need, we introduce open-source methods for long-read alignment (NGMLR; https://github.com/philres/ngmlr) and structural variant identification (Sniffles; https://github.com/fritzsedlazeck/Sniffles) that provide unprecedented sensitivity and precision for variant detection, even in repeat-rich regions and for complex nested events that can have substantial effects on human health. In several long-read datasets, including healthy and cancerous human genomes, we discovered thousands of novel variants and categorized systematic errors in short-read approaches. NGMLR and Sniffles can automatically filter false events and operate on low-coverage data, thereby reducing the high costs that have hindered the application of long reads in clinical and research settings.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The main steps implemented in NGMLR and Sniffles.**

**Fig. 2: Improved alignment by NGMLR for a 228-bp deletion and a 150-bp inversion.**

**Fig. 3: Tool evaluation with simulated data.**

**Fig. 4: Systematic error in short-read-based SV calling.**

**Fig. 5: Nested SVs in the SKBR3 cancer cell line.**

**Fig. 6: Dependence of SV detection accuracy on the level of coverage.**

Towards population-scale long-read sequencing

Article 28 May 2021

Wouter De Coster, Matthias H. Weissensteiner & Fritz J. Sedlazeck

Discovery and quality analysis of a comprehensive set of structural variants and short tandem repeats

Article Open access 10 June 2020

David Jakubosky, Erin N. Smith, … Kelly A. Frazer

Jasmine and Iris: population-scale structural variant comparison and analysis

Article 19 January 2023

Melanie Kirsche, Gautam Prabhu, … Michael C. Schatz

References

Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14, 125–138 (2013).
Article PubMed CAS Google Scholar
Lupski, J. R. Structural variation mutagenesis of the human genome: impact on disease and evolution. Environ. Mol. Mutagen. 56, 419–436 (2015).
Article PubMed PubMed Central CAS Google Scholar
Macintyre, G., Ylstra, B. & Brenton, J. D. Sequencing structural variants in cancer for precision therapeutics. Trends Genet. 32, 530–542 (2016).
Article PubMed CAS Google Scholar
Hedges, D. J. et al. Evidence of novel fine-scale structural variation at autism spectrum disorder candidate loci. Mol. Autism 3, 2 (2012).
Article PubMed PubMed Central CAS Google Scholar
Rovelet-Lecrux, A. et al. APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy. Nat. Genet. 38, 24–26 (2006).
Article PubMed CAS Google Scholar
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Article PubMed PubMed Central CAS Google Scholar
Dennenmoser, S. et al. Copy number increases of transposable elements and protein-coding genes in an invasive fish of hybrid origin. Mol. Ecol. 26, 4712–4724 (2017).
Article PubMed PubMed Central CAS Google Scholar
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
Article PubMed PubMed Central CAS Google Scholar
Zichner, T. et al. Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res. 23, 568–579 (2013).
Article PubMed PubMed Central CAS Google Scholar
Imprialou, M. et al. Genomic rearrangements in Arabidopsis considered as quantitative traits. Genetics 205, 1425–1441 (2017).
Article PubMed PubMed Central CAS Google Scholar
Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).
Article PubMed CAS Google Scholar
Kadalayil, L. et al. Exome sequence read depth methods for identifying copy number changes. Brief. Bioinform. 16, 380–392 (2015).
Article PubMed CAS Google Scholar
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Article PubMed PubMed Central CAS Google Scholar
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Article PubMed PubMed Central Google Scholar
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Article PubMed PubMed Central CAS Google Scholar
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Article PubMed CAS Google Scholar
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Article PubMed PubMed Central CAS Google Scholar
English, A. C., Salerno, W. J. & Reid, J. G. PBHoney: identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinformatics 15, 180 (2014).
Article PubMed PubMed Central CAS Google Scholar
Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).
Article PubMed PubMed Central CAS Google Scholar
Tattini, L., D’Aurizio, R. & Magi, A. Detection of genomic structural variants from next-generation sequencing data. Front. Bioeng. Biotechnol 3, 92 (2015).
Article PubMed PubMed Central Google Scholar
Teo, S. M., Pawitan, Y., Ku, C. S., Chia, K. S. & Salim, A. Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics 28, 2711–2718 (2012).
Article PubMed CAS Google Scholar
Lucas Lledó, J. I. & Cáceres, M. On the power and the systematic biases of the detection of chromosomal inversions by paired-end genome sequencing. PLoS One 8, e61292 (2013).
Article PubMed PubMed Central CAS Google Scholar
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
Article PubMed CAS Google Scholar
Kiełbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011).
Article PubMed PubMed Central CAS Google Scholar
Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
Article PubMed PubMed Central CAS Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).
Sović, I. et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 7, 11307 (2016).
Article PubMed PubMed Central CAS Google Scholar
Xiao, C. L. et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods 14, 1072–1074 (2017).
Article PubMed CAS Google Scholar
Li, H. Minimap2: fast pairwise alignment for long nucleotide sequences. arXiv Preprint at https://arxiv.org/abs/1708.01492 (2017).
Chaisson, M. J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).
Article PubMed CAS Google Scholar
Sedlazeck, F. J., Rescheneder, P. & von Haeseler, A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 29, 2790–2791 (2013).
Article PubMed CAS Google Scholar
Carvalho, C. M. et al. Inverted genomic segments and complex triplication rearrangements are mediated by inverted repeats in the human genome. Nat. Genet. 43, 1074–1081 (2011).
Article PubMed PubMed Central CAS Google Scholar
Shimojima, K. et al. Pelizaeus-Merzbacher disease caused by a duplication-inverted triplication-duplication in chromosomal segments including the PLP1 region. Eur. J. Med. Genet. 55, 400–403 (2012).
Article PubMed Google Scholar
Carvalho, C. M. & Lupski, J. R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 17, 224–238 (2016).
Article PubMed PubMed Central CAS Google Scholar
Mühle, C., Zenker, M., Chuzhanova, N. & Schneider, H. Recurrent inversion with concomitant deletion and insertion events in the coagulation factor VIII gene suggests a new mechanism for X-chromosomal rearrangements causing hemophilia A. Hum. Mutat. 28, 1045 (2007).
Article PubMed Google Scholar
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (Cambridge Univ. Press, Cambridge, UK, 1997).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Article PubMed PubMed Central CAS Google Scholar
Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Article PubMed PubMed Central CAS Google Scholar
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article PubMed PubMed Central CAS Google Scholar
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article PubMed CAS Google Scholar
Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Article PubMed PubMed Central CAS Google Scholar
Zimin, A. V., Smith, D. R., Sutton, G. & Yorke, J. A. Assembly reconciliation. Bioinformatics 24, 42–45 (2008).
Article PubMed CAS Google Scholar
Beri, S., Bonaglia, M. C. & Giorda, R. Low-copy repeats at the human VIPR2 gene predispose to recurrent and nonrecurrent rearrangements. Eur. J. Hum. Genet. 21, 757–761 (2013).
Article PubMed CAS Google Scholar
Nattestad, M. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/08/10/174938 (2017).
Merker, J. D. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. 20, 159–163 (2017).
Article PubMed PubMed Central CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central CAS Google Scholar
Jeffares, D. C. et al. Transient structural variations alter gene expression and quantitative traits in Schizosaccharomyces pombe. Nat. Commun. 8, 14061 (2017).

Download references

Acknowledgements

We thank W.R. McCombie, S. Wheelan, S. Goodwin, H. Li, and B.Q. Minh for helpful discussions. This work was supported by the National Science Foundation (DBI- 1350041, IOS-1732253, and IOS-1445025 to M.C.S.) and the US National Institutes of Health (R01-HG006677 and UM1 HG008898 to M.C.S. and F.J.S.). P.R. acknowledges support from DK RNA Biology (W1207-B09). A.v.H. and M.S. acknowledge financial support from the University of Vienna and the Medical University of Vienna.

Author information

These authors contributed equally: Fritz J. Sedlazeck, Philipp Rescheneder.

Authors and Affiliations

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
Fritz J. Sedlazeck
Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, Vienna, Austria
Philipp Rescheneder, Moritz Smolka & Arndt von Haeseler
Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
Han Fang, Maria Nattestad & Michael C. Schatz
Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, Austria
Arndt von Haeseler
Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
Michael C. Schatz

Authors

Fritz J. Sedlazeck
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Rescheneder
View author publications
You can also search for this author in PubMed Google Scholar
Moritz Smolka
View author publications
You can also search for this author in PubMed Google Scholar
Han Fang
View author publications
You can also search for this author in PubMed Google Scholar
Maria Nattestad
View author publications
You can also search for this author in PubMed Google Scholar
Arndt von Haeseler
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Schatz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.J.S., P.R., and M.S. developed the software. F.J.S., P.R., M.S., H.F., and M.N. performed analysis. F.J.S., P.R., M.C.S., and A.v.H. wrote the manuscript. M.C.S. and A.v.H. directed the project. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Fritz J. Sedlazeck or Michael C. Schatz.

Ethics declarations

Competing interests

M.C.S. and F.J.S. have participated in PacBio-sponsored meetings over the past few years and have received travel reimbursement and honoraria for presenting at these events. Since the initial submission of this paper, P.R. has become an employee of Oxford Nanopore. PacBio and Oxford Nanopore had no role in decisions related to the study/work, data collection, or analysis of data described in this paper.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Text and Figures

Supplementary Notes 1–5

Reporting Summary

Supplementary Table 1

Raw statistics over the mapper evaluation

Supplementary Table 2

SV caller statistics over simulated reads

Supplementary Table 3

Mapping comparison over simulated reference and real reads

Supplementary Table 4

SV caller comparison over simulated reference and real reads

Supplementary Table 5

Used real datasets and accessions

Supplementary Table 6

GiaB trio comparison

Supplementary Table 7

Comparison of existing NA12878 datasets

Supplementary Table 8

NA12878 indel assessments using Illumina short-read data

Supplementary Table 9

Analysis of potential biases in short-read calling

Supplementary Table 10

Runtime comparisons over NA12878

Supplementary Table 11

Insertion and deletion assessment for simulated data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sedlazeck, F.J., Rescheneder, P., Smolka, M. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15, 461–468 (2018). https://doi.org/10.1038/s41592-018-0001-7

Download citation

Received: 11 August 2017
Accepted: 16 March 2018
Published: 30 April 2018
Issue Date: June 2018
DOI: https://doi.org/10.1038/s41592-018-0001-7

This article is cited by

Comparison of structural variant callers for massive whole-genome sequence data
- Soobok Joe
- Jong-Lyul Park
- Seon-Young Kim
BMC Genomics (2024)
A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline
- Ze-Zhen Du
- Jia-Bao He
- Wen-Biao Jiao
Genome Biology (2024)
Long-read sequencing reveals the structural complexity of genomic integration of HPV DNA in cervical cancer cell lines
- Zhijie Wang
- Chen Liu
- Xiaoyuan Huang
BMC Genomics (2024)
Population genomics of Agrotis segetum provide insights into the local adaptive evolution of agricultural pests
- Ping Wang
- Minghui Jin
- Yutao Xiao
BMC Biology (2024)
New reference genomes to distinguish the sympatric malaria parasites, Plasmodium ovale curtisi and Plasmodium ovale wallikeri
- Matthew Higgins
- Emilia Manko
- Susana Campino
Scientific Reports (2024)