Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Indel variant analysis of short-read sequencing data with Scalpel

This article has been updated

Abstract

As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in 5 h after read mapping.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: High accuracy of indel detection using Scalpel on WGS data.
Figure 2: Main steps in the Scalpel protocol.
Figure 3: Overview of the indel variant filtering cascade.
Figure 4: Higher coverage can improve Scalpel's sensitivity performance for indel detection with WGS data.
Figure 5: Comparison of standard WGS and PCR-free data based on indel quality.
Figure 6: Whole-genome mutational concordance.
Figure 7: Size distribution of inherited and de novo indels.
Figure 8: Histograms of low-quality homopolymer indels by category.
Figure 9: Variant allele fractions (VAF %) of the inherited indels.
Figure 10: Filtering cascade of inherited and de novo indel calls.
Figure 11: Frame-preserving indels are more abundant within coding sequences.
Figure 12: Screenshot of the alignment of the de novo deletion in the IGV browser.

Change history

  • 08 December 2016

    In the version of this article initially published, the affiliations of two authors, Esra Dikoglu and Vaidehi Jobanputra, were incorrectly reported. Corrected affiliations are as follows: Esra Dikoglu is affiliated with the New York Genome Center, New York, New York, USA. Vaidehi Jobanputra is affiliated with the New York Genome Center, New York, New York, USA; and Columbia University Medical Center, New York, New York, USA.

References

  1. Collins, F.S. & Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 372, 793–795 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Highnam, G. et al. An analytical framework for optimizing variant discovery from personal genomes. Nat. Commun. 6, 6275 (2015).

    Article  CAS  PubMed  Google Scholar 

  3. Watson, J.D., Baker, T.A., Gann, A., Levine, M. & Losick, R. Molecular Biology of the Gene 7th edn. (Cold Spring Harbor Laboratory Press, (2013).

  4. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Zaidi, S. et al. De novo mutations in histone-modifying genes in congenital heart disease. Nature 498, 220–223 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Iossifov, I. et al. De novo gene disruptions in children on the autistic spectrum. Neuron 74, 285–299 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Gupta, R.S. Protein phylogenies and signature sequences: a reappraisal of evolutionary relationships among archaebacteria, eubacteria, and eukaryotes. Microbiol. Mol. Biol. Rev. 62, 1435–1491 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Tian, D. et al. Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes. Nature 455, 105–108 (2008).

    Article  CAS  PubMed  Google Scholar 

  10. MacArthur, D.G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Fukuoka, S. et al. Loss of function of a proline-containing protein confers durable disease resistance in rice. Science 325, 998–1001 (2009).

    Article  CAS  PubMed  Google Scholar 

  12. Denver, D.R. et al. High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature 430, 679–682 (2004).

    Article  CAS  PubMed  Google Scholar 

  13. Montgomery, S.B. et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Mullaney, J.M. et al. Small insertions and deletions (INDELs) in human genomes. Hum. Mol. Genet. 19, R131–R136 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Jiang, Y., Turinsky, A.L. & Brudno, M. The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection. Nucleic Acids Res. 43, 7217–7228 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Narzisi, G. et al. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat. Methods 11, 1033–1036 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Narzisi, G. & Schatz, M.C. The challenge of small-scale repeats for indel discovery. Front. Bioeng. Biotechnol. 3, 8 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Fang, H. et al. Reducing INDEL calling errors in whole genome and exome sequencing data. Genome Med. 6, 89 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  19. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Albers, C.A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. Ye, K. et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Karakoc, E. et al. Detection of structural variants and indels within exome data. Nat. Methods 9, 176–178 (2012).

    Article  CAS  Google Scholar 

  24. Iqbal, Z. et al. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Van der Auwera, G.A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 11, 11 10 1–11 10 33 (2013).

    Google Scholar 

  26. Li, S. et al. SOAPindel: efficient identification of indels from short paired reads. Genome Res. 23, 195–200 (2013).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  27. Pabinger, S. et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 15, 256–278 (2014).

    Article  PubMed  Google Scholar 

  28. Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Mose, L.E. et al. ABRA: improved coding indel detection via assembly-based realignment. Bioinformatics 30, 2813–2815 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Chen, K. et al. TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome Res. 24310–24317 (2014).

  31. Weisenfeld, N.I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Leggett, R.M. et al. Identifying and classifying trait linked polymorphisms in non-reference species by walking coloured de Bruijn graphs. PLoS One 8, e60058 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

    Article  CAS  PubMed  Google Scholar 

  34. Yang, R. et al. ScanIndel: a hybrid framework for indel detection via gapped alignment, split reads and de novo assembly. Genome Med. 7, 127 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. Highnam, G. et al. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 41, e32 (2013).

    Article  CAS  PubMed  Google Scholar 

  36. Gymrek, M. et al. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Ye, K. et al. Systematic discovery of complex insertions and deletions in human cancers. Nat. Med. 22, 97–104 (2016).

    Article  CAS  PubMed  Google Scholar 

  38. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).

    Article  CAS  PubMed  Google Scholar 

  39. Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

    Article  CAS  PubMed  Google Scholar 

  40. Paila, U. et al. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput. Biol. 9, e1003153 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Brannon, A.R. et al. Comparative sequencing analysis reveals high genomic concordance between matched primary and metastatic colorectal cancer lesions. Genome Biol. 15, 454 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Eberle, M.A. et al. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. (bioRxivhttp://dx.doi.org/10.1101/055541 (2016).

  43. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv e-prints 1303, 3997 (2013).

    Google Scholar 

  44. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  46. Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Genomes Project, C. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  48. Tan, A., Abecasis, G.R. & Kang, H.M. Unified representation of genetic variants. Bioinformatics 31, 2202–2204 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Van der Auwera, G.A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11 10 1–11 10 33 (2013).

    Google Scholar 

  50. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012).

    Article  CAS  Google Scholar 

  51. McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. McCarthy, D.J. et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 6, 26 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The project was supported in part by grants from the US National Institutes of Health (R01-HG006677 and U01-CA168409) and the US National Science Foundation (DBI-1350041) to M.C.S. and by grants from the Cold Spring Harbor Laboratory (CSHL) Cancer Center Support (5P30CA045508), the Stanley Institute for Cognitive Genomics and the Simons Foundation (SF51 and SF235988) to M.W.

Author information

Authors and Affiliations

Authors

Contributions

G.N. is the lead developer of Scalpel. M.C.S. contributed to the development of Scalpel and wrote the microsatellite detector scripts. H.F. contributed to enhance Scalpel, compiled the Scalpel resource bundle and generated the figures in this article. E.D. and V.J. performed the Sanger validation. G.N., M.C.S. and M.W. conceived the Scalpel software project. G.N., M.C.S. and H.F. wrote the initial draft of the manuscript. M.C.S. is the principal investigator. All authors contributed to the development and approval of the final manuscript.

Corresponding author

Correspondence to Giuseppe Narzisi.

Ethics declarations

Competing interests

G.J.L. serves on advisory boards for GenePeeks and Omicia. The remaining authors declare no competing financial interests.

Supplementary information

Supplementary Methods and Supplementary Results

Supplementary Methods. Descriptions of Sanger validation experiments. Supplementary Results. Screenshots of images from Mutation Surveyor for each Sanger validation experiment. (PDF 1576 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, H., Bergmann, E., Arora, K. et al. Indel variant analysis of short-read sequencing data with Scalpel. Nat Protoc 11, 2529–2548 (2016). https://doi.org/10.1038/nprot.2016.150

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nprot.2016.150

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research