As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in ∼5 h after read mapping.
Subscribe to Journal
Get full journal access for 1 year
only $21.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Collins, F.S. & Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 372, 793–795 (2015).
Highnam, G. et al. An analytical framework for optimizing variant discovery from personal genomes. Nat. Commun. 6, 6275 (2015).
Watson, J.D., Baker, T.A., Gann, A., Levine, M. & Losick, R. Molecular Biology of the Gene 7th edn. (Cold Spring Harbor Laboratory Press, (2013).
Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012).
Zaidi, S. et al. De novo mutations in histone-modifying genes in congenital heart disease. Nature 498, 220–223 (2013).
Iossifov, I. et al. De novo gene disruptions in children on the autistic spectrum. Neuron 74, 285–299 (2012).
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
Gupta, R.S. Protein phylogenies and signature sequences: a reappraisal of evolutionary relationships among archaebacteria, eubacteria, and eukaryotes. Microbiol. Mol. Biol. Rev. 62, 1435–1491 (1998).
Tian, D. et al. Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes. Nature 455, 105–108 (2008).
MacArthur, D.G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).
Fukuoka, S. et al. Loss of function of a proline-containing protein confers durable disease resistance in rice. Science 325, 998–1001 (2009).
Denver, D.R. et al. High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature 430, 679–682 (2004).
Montgomery, S.B. et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).
Mullaney, J.M. et al. Small insertions and deletions (INDELs) in human genomes. Hum. Mol. Genet. 19, R131–R136 (2010).
Jiang, Y., Turinsky, A.L. & Brudno, M. The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection. Nucleic Acids Res. 43, 7217–7228 (2015).
Narzisi, G. et al. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat. Methods 11, 1033–1036 (2014).
Narzisi, G. & Schatz, M.C. The challenge of small-scale repeats for indel discovery. Front. Bioeng. Biotechnol. 3, 8 (2015).
Fang, H. et al. Reducing INDEL calling errors in whole genome and exome sequencing data. Genome Med. 6, 89 (2014).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Albers, C.A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Ye, K. et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).
Karakoc, E. et al. Detection of structural variants and indels within exome data. Nat. Methods 9, 176–178 (2012).
Iqbal, Z. et al. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Van der Auwera, G.A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 11, 11 10 1–11 10 33 (2013).
Li, S. et al. SOAPindel: efficient identification of indels from short paired reads. Genome Res. 23, 195–200 (2013).
Pabinger, S. et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 15, 256–278 (2014).
Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).
Mose, L.E. et al. ABRA: improved coding indel detection via assembly-based realignment. Bioinformatics 30, 2813–2815 (2014).
Chen, K. et al. TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome Res. 24310–24317 (2014).
Weisenfeld, N.I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014).
Leggett, R.M. et al. Identifying and classifying trait linked polymorphisms in non-reference species by walking coloured de Bruijn graphs. PLoS One 8, e60058 (2013).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Yang, R. et al. ScanIndel: a hybrid framework for indel detection via gapped alignment, split reads and de novo assembly. Genome Med. 7, 127 (2015).
Highnam, G. et al. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 41, e32 (2013).
Gymrek, M. et al. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).
Ye, K. et al. Systematic discovery of complex insertions and deletions in human cancers. Nat. Med. 22, 97–104 (2016).
Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).
Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).
Paila, U. et al. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput. Biol. 9, e1003153 (2013).
Brannon, A.R. et al. Comparative sequencing analysis reveals high genomic concordance between matched primary and metastatic colorectal cancer lesions. Genome Biol. 15, 454 (2014).
Eberle, M.A. et al. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. (bioRxivhttp://dx.doi.org/10.1101/055541 (2016).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv e-prints 1303, 3997 (2013).
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Genomes Project, C. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Tan, A., Abecasis, G.R. & Kang, H.M. Unified representation of genetic variants. Bioinformatics 31, 2202–2204 (2015).
Van der Auwera, G.A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11 10 1–11 10 33 (2013).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012).
McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 (2010).
McCarthy, D.J. et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 6, 26 (2014).
The project was supported in part by grants from the US National Institutes of Health (R01-HG006677 and U01-CA168409) and the US National Science Foundation (DBI-1350041) to M.C.S. and by grants from the Cold Spring Harbor Laboratory (CSHL) Cancer Center Support (5P30CA045508), the Stanley Institute for Cognitive Genomics and the Simons Foundation (SF51 and SF235988) to M.W.
G.J.L. serves on advisory boards for GenePeeks and Omicia. The remaining authors declare no competing financial interests.
Supplementary Methods. Descriptions of Sanger validation experiments. Supplementary Results. Screenshots of images from Mutation Surveyor for each Sanger validation experiment. (PDF 1576 kb)
About this article
Cite this article
Fang, H., Bergmann, E., Arora, K. et al. Indel variant analysis of short-read sequencing data with Scalpel. Nat Protoc 11, 2529–2548 (2016). https://doi.org/10.1038/nprot.2016.150
Genomic landscape of metastatic papillary thyroid carcinoma and novel biomarkers for predicting distant metastasis
Cancer Science (2020)
Polyomavirus-Positive Merkel Cell Carcinoma Derived from a Trichoblastoma Suggests an Epithelial Origin of this Merkel Cell Carcinoma
Journal of Investigative Dermatology (2020)
Genome Biology (2020)
Wait-and-See Treatment Strategy Could be Considered for Lung Adenocarcinoma with Special Pleural Dissemination Lesions, and Low Genomic Instability Correlates with Better Survival
Annals of Surgical Oncology (2020)
American Journal of Medical Genetics Part A (2020)