Review Article | Published:

Genotype and SNP calling from next-generation sequencing data

Nature Reviews Genetics volume 12, pages 443451 (2011) | Download Citation

Abstract

Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data. We review these methods and provide a guide for their use in NGS studies.

Key points

  • Converting next-generation sequencing (NGS) image files into a set of called SNPs involves a number of steps including image analysis, alignment and assembly, SNP calling and genotype calling.

  • Genotype probabilities for a single individual can be calculated from alignments using recalibrated quality scores.

  • SNP calling and genotype calling is best done using information from multiple individuals simultaneously. The pattern of linkage disequilibrium should be used to call SNPs and genotypes when possible.

  • Analyses of low coverage data can proceed by taking uncertainty in the genotype calls into account, rather than assuming any particular genotype call is correct.

  • The methods used for calling SNPs and for taking uncertainty in SNP genotypes into account can have a strong effect on downstream analyses, including association mapping analyses.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    Sequencing technologies — the next generation. Nature Rev. Genet. 11, 31–46 (2010). This article provides an excellent Review of NGS technologies and their applications.

  2. 2.

    et al. The sequence and de novo assembly of the giant panda genome. Nature 463, 311–317 (2010).

  3. 3.

    et al. Exome sequencing identifies the cause of a mendelian disorder. Nature Genet. 42, 30–35 (2010).

  4. 4.

    et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).

  5. 5.

    et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotech. 28, 503–510 (2010).

  6. 6.

    et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotech. 28, 511–515 (2010).

  7. 7.

    et al. Population genomics of domestic and wild yeasts. Nature 458, 337–341 (2009).

  8. 8.

    et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nature Genet. 42, 969–972 (2010).

  9. 9.

    et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010). This 1000Genomes paper provides an application of many of the state-of-the-art methods for analysis of NGS data.

  10. 10.

    & Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009).

  11. 11.

    et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet. Epidemiol. 34, 479–491 (2010).

  12. 12.

    , & Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008). This paper describes MAQ, a forerunner of efficient, hash-based alignment algorithms for short reads. MAQ also produces genotype calls. The concept of read-mapping quality is introduced in this paper.

  13. 13.

    et al. Multiplex padlock targeted sequencing reveal human hypermutable CpG variations. Genome Res. 19, 1606–1615 (2009).

  14. 14.

    et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).

  15. 15.

    et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).

  16. 16.

    & Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).

  17. 17.

    et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods 5, 179–181 (2008).

  18. 18.

    , & Intensity normalization improves color calling in SOLiD sequencing. Nature Methods 7, 336–337 (2010).

  19. 19.

    , & Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 10, R83 (2009).

  20. 20.

    , & BayesCall: a model-based basecalling algorithm for high-throughput short-read sequencing. Genome Res. 19, 1884–1895 (2009).

  21. 21.

    & naiveBayesCall: an efficient model-based base-calling algorithm for high-throughput sequencing. Lect. Notes Comp. Sci. 6044, 233–247 (2010).

  22. 22.

    & A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. HP Labs Technical Reports , (1994).

  23. 23.

    , , & Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

  24. 24.

    & Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  25. 25.

    & Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 27 Oct 2010 (doi:10.1101/gr.111120.110).

  26. 26.

    , , , & Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS ONE 2, e484 (2007).

  27. 27.

    & Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

  28. 28.

    et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).

  29. 29.

    et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

  30. 30.

    , & De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 19, 336–346 (2009).

  31. 31.

    et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 18, 763–770 (2008).

  32. 32.

    et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

  33. 33.

    et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 10 Apr 2011 (doi:10.1038/ng.806).

  34. 34.

    et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 10, R32 (2009).

  35. 35.

    et al. The diploid sequence of an Asian individual. Nature 456, 60–65 (2009).

  36. 36.

    et al. Exome sequencing of a multigenerational human pedigree. PLoS ONE 4, e8232 (2009).

  37. 37.

    et al. SeqEM: an adaptive genotype-calling approach for next- generation sequencing studies. Bioinformatics 26, 2803–2810 (2010).

  38. 38.

    et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

  39. 39.

    et al. Imputation methods to improve inference in SNP association studies. Genet. Epidemiol. 30, 690–702 (2006).

  40. 40.

    & Mapping trait loci by use of inferred ancestral recombination graphs. Am. J. Hum. Genet. 79, 910–922 (2006).

  41. 41.

    & A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).

  42. 42.

    & Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).

  43. 43.

    , , , & A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genet. 39, 906–913 (2007).

  44. 44.

    , & A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).

  45. 45.

    & Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511 (2010). This Review provides a comprehensive overview of available statistical methods for imputing genotypes and discusses various uses of imputation.

  46. 46.

    et al. The relationship between imputation error and statistical power in genetic association studies in diverse populations. Am. J. Hum. Genet. 85, 692–698 (2009).

  47. 47.

    , , , & Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, 425–434 (2002).

  48. 48.

    & Imputation-based analysis of association studies: candidate genes and quantitative traits. PLoS Genet. 3, e114 (2007).

  49. 49.

    et al. Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. Genome Res. 18, 1020–1029 (2008).

  50. 50.

    & Accounting for bias from sequencing error in population genetic estimates. Mol. Biol. Evol. 25, 199–206 (2008).

  51. 51.

    & Inference of population genetic parameters in metagenomics. A clean look at messy data. Genome Res. 16, 1320–1327 (2006).

  52. 52.

    et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78 (2010).

  53. 53.

    et al. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 25, 2078–2079 (2009).

  54. 54.

    & SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 27 Oct 2010 (doi:10.1101/gr.113084.110).

Download references

Acknowledgements

This work was supported in part by NIH grants NIGMS R01-HG003229–05 and R01-HG003229–0551 to R.N., an NSF CAREER grant DBI-0846015 to Y.S.S. and an NIH National Research Service Award Trainee appointment on T32-HG00047 to J.S.P.

Author information

Affiliations

  1. Department of Integrative Biology, University of California, Berkeley, California 94720, USA.

    • Rasmus Nielsen
  2. Centre for Bioinformatics, University of Copenhagen, Universitetsparken 15, 2100 Copenhagen Ø, Denmark.

    • Rasmus Nielsen
    •  & Anders Albrechtsen
  3. Department of Statistics, University of California, Berkeley, California 94720, USA.

    • Rasmus Nielsen
    •  & Yun S. Song
  4. Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA.

    • Joshua S. Paul
    •  & Yun S. Song

Authors

  1. Search for Rasmus Nielsen in:

  2. Search for Joshua S. Paul in:

  3. Search for Anders Albrechtsen in:

  4. Search for Yun S. Song in:

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Rasmus Nielsen or Yun S. Song.

Glossary

Likelihoods

Functions expressing the probability of observing the data — for example, next-generation sequencing data — given a parameter, such as a genotype or an allele frequency.

Posterior probabilities

In this context, these are the probabilities of a particular genotype given observed data: they are calculated by incorporating the information from the next-generation sequencing data as well as some prior information.

Hashing

A procedure of creating a data structure that helps to accelerate alignment. It stores information about which reads or where in the reference genome a particular substring or subsequence occurs. Some hash-based aligners hash the reads, while others hash the reference genome.

Paired-end reads

Sequencing of both the forward and reverse template of a DNA molecule, which is possible by inserting a primer sequence between the two ends of the read. The use of this technique greatly helps to increase assembly and alignment accuracy.

CEU individuals

One of the 11 populations in HapMap phase three. It consists of Utah residents with Northern and Western European ancestry from the Centre d'Etude du Polymorphisme Humain (CEPH) collection.

Bayes' formula

A mathematical expression showing that a posterior probability can be found as the prior probability multiplied by the likelihood divided by a constant.

Correlated errors

Errors that do not occur independently of each other. An error that is observed in one position might increase the chance of observing another error in a neighbouring position.

Prior probability

In the context of this Review, the probability of a genotype calculated without incorporating information from the next-generation sequencing data. Prior probabilities can be obtained from a set of reference data.

Maximum likelihood

The statistical principle of estimating a parameter by finding the value of the parameters that maximizes the likelihood function.

Imputation

The substitution of some value for a missing data point. In this context, it is the use of a set of reference haplotypes to infer a genotype for an individual, when data are missing or incomplete.

Likelihood ratio test

A method for testing statistical hypotheses based on comparing the maximum likelihood under two different models. In this context, the allele frequency in one model equals zero, whereas the frequency in the second model might be larger than zero.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg2986

Further reading