Recent developments in sequencing techniques have enabled rapid and high-throughput generation of sequence data, democratizing the ability to compile information on large amounts of genetic variations in individual laboratories. However, there is a growing gap between the generation of raw sequencing data and the extraction of meaningful biological information. Here, we describe a protocol to use the ANNOVAR (ANNOtate VARiation) software to facilitate fast and easy variant annotations, including gene-based, region-based and filter-based annotations on a variant call format (VCF) file generated from human genomes. We further describe a protocol for gene-based annotation of a newly sequenced nonhuman species. Finally, we describe how to use a user-friendly and easily accessible web server called wANNOVAR to prioritize candidate genes for a Mendelian disease. The variant annotation protocols take 5–30 min of computer time, depending on the size of the variant file, and 5–10 min of hands-on time. In summary, through the command-line tool and the web server, these protocols provide a convenient means to analyze genetic variants generated in humans and other species.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Li, H. & Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473–483 (2010).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Nagarajan, N. & Pop, M. Sequence assembly demystified. Nat. Rev. Genet. 14, 157–167 (2013).
Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).
Simpson, J.T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
Xie, Y. et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-seq reads. Bioinformatics 30, 1660–1666 (2014).
Andrews, S. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc (2010).
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Zhao, M., Wang, Q., Wang, Q., Jia, P. & Zhao, Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 14, S1 (2013).
Abyzov, A., Urban, A.E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).
Zhu, M. et al. Using ERDS to infer copy-number variants in high-coverage genomes. Am. J. Hum. Genet. 91, 408–421 (2012).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 (2010).
De Baets, G. et al. SNPeffect 4.0: on-line prediction of molecular and structural effects of protein-coding variants. Nucleic Acids Res. 40 (Database issue): D935–D939 (2012).
Hu, H. et al. VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix. Genet. Epidemiol. 37, 622–634 (2013).
Makarov, V. et al. AnnTools: a comprehensive and versatile annotation toolkit for genomic variants. Bioinformatics 28, 724–725 (2012).
Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).
Girard, S.L. et al. Increased exonic de novo mutation rate in individuals with schizophrenia. Nat. Genet. 43, 860–863 (2011).
Weedon, M.N. et al. Exome sequencing identifies a DYNC1H1 mutation in a large pedigree with dominant axonal Charcot-Marie-Tooth disease. Am. J. Hum. Genet. 89, 308–312 (2011).
Lai, C.-C. et al. Whole-exome sequencing to identify a novel LMNA gene mutation associated with inherited cardiac conduction disease. PLoS ONE 8, e83322 (2013).
Brownstein, C.A. et al. An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge. Genome Biol. 15, R53 (2014).
Liu, J. et al. Regenerative phenotype in mice with a point mutation in transforming growth factor β type I receptor (TGFBR1). Proc. Natl. Acad. Sci. USA 108, 14560–14565 (2011).
Nam, K. et al. Strong selective sweeps associated with ampliconic regions in great ape X chromosomes. arXiv:1402.5790 (2014).
Chang, X. & Wang, K. wANNOVAR: annotating genetic variants for personal genomes via the web. J. Med. Genet. 49, 433–436 (2012).
Yang, H., Robinson, P.N. & Wang, K. Phenolyzer: phenotype-based prioritization of candidate genes for human diseases. Nat. Methods 10.1038/nmeth.3484 (20 July 2015).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Lewis, B.P., Shih, I.-h., Jones-Rhoades, M.W., Bartel, D.P. & Burge, C.B. Prediction of mammalian microRNA targets. Cell 115, 787–798 (2003).
Birney, E. et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).
Consortium, G.P. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).
Ng, P.C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).
Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Lyon, G.J. & Wang, K. Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress. Genome Med. 4, 58 (2012).
Hu, H. et al. A unified test of linkage analysis and rare-variant association for analysis of pedigree sequence data. Nat. Biotechnol. 32, 663–669 (2014).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92 (2012).
Paila, U., Chapman, B.A., Kirchner, R. & Quinlan, A.R. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput. Biol. 9, e1003153 (2013).
Habegger, L. et al. VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment. Bioinformatics 28, 2267–2269 (2012).
Ng, S.B. et al. Exome sequencing identifies the cause of a Mendelian disorder. Nature Genet. 42, 30–35 (2010).
Vuong, H. et al. AVIA v2.0: annotation, visualization and impact analysis of genomic variants and genes. Bioinformatics 31, 2748–2750 (2015).
Medina, I. et al. VARIANT: command line, web service and web interface for fast and accurate functional characterization of variants found by next-generation sequencing. Nucleic Acids Res. 40, W54–W58 (2012).
McCarthy, D.J. et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 6, 26 (2014).
Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole-exome sequencing studies. Hum. Mol. Genet. 24, 2125–2137 (2015).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Eilbeck, K. et al. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 6, R44 (2005).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Consortium, G.P. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Liu, X., Jian, X. & Boerwinkle, E. dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations. Hum. Mutat. 34, E2393–E2402 (2013).
Landrum, M.J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42 (Database issue): D980–D985 (2014).
Day, I.N. dbSNP in the detail and copy number complexities. Hum. Mutat. 31, 2–4 (2010).
Karolchik, D. et al. The UCSC genome browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).
Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).
Hsu, F. et al. The UCSC known genes. Bioinformatics 22, 1036–1046 (2006).
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).
Ng, P.C. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Development of the ANNOVAR/wANNOVAR tool is supported by US National Institutes of Health (NIH) grant R01 HG006465. We thank X. Chang for the initial development of the wANNOVAR server. We thank all ANNOVAR and wANNOVAR users for their helpful suggestions, comments and bug reports to improve the software tools and web servers.
K.W. is a shareholder and board member of Tute Genomics.
Integrated supplementary information
Supplementary Figure 1 The expected results for discovering candidate genes of the ‘hemolytic anemia’ example in the Phenolyzer website.
Each ball represents one of the top 50 ranked genes. The larger the ball, the higher the ranking. The blue balls represent disease genes reported before and the yellow ones represent predicted disease genes. For detailed explanations on each color and shape, and on how the algorithm works to find disease genes, please visit http://phenolyzer.usc.edu/FAQ.php
This is a sample output with default parameters. The first 5 columns represent the original information on the input variants. The following 5 columns give gene-based annotations on each variant. The following columns give region-based and filter-based annotations on each variant.
About this article
Cite this article
Yang, H., Wang, K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat Protoc 10, 1556–1566 (2015). https://doi.org/10.1038/nprot.2015.105
BMC Neurology (2021)
Integrated multi-omics analyses on patient-derived CRC organoids highlight altered molecular pathways in colorectal cancer progression involving PTEN
Journal of Experimental & Clinical Cancer Research (2021)
Transcript annotation tool (TransAT): an R package for retrieving annotations for transcript-specific genetic variants
BMC Bioinformatics (2021)
Scientific Reports (2021)
Genome sequencing increases diagnostic yield in clinically diagnosed Alagille syndrome patients with previously negative test results
Genetics in Medicine (2021)