Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A general approach to single-nucleotide polymorphism discovery


Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits1. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2, 3, 4, 5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence6,7 as a template on which to layer often unmapped, fragmentary sequence data8,9,10,11 and to use base quality values12 to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Application of the POLYBAYES procedure to EST data.
Figure 2: Paralogue discrimination.
Figure 3: SNP probability scores.
Figure 4: Sensitivity of the SNP detection algorithm.
Figure 5: SNP detection with assembled shotgun genomic reference sequence.

Accession codes




  1. Collins, F.S., Guyer, M.S. & Chakravarti, A. Variations on a theme: cataloging human DNA sequence variation. Science 278, 1580– 1581 (1997).

    Article  CAS  Google Scholar 

  2. Wang, D.G. et al. Large-scale identification, mapping, and genotyping of single nucleotide polymorphisms in the human genome. Science 280,1077–1082 (1998).

    Article  CAS  Google Scholar 

  3. Taillon-Miller, P., Gu, Z., Hillier, L. & Kwok, P.-Y. Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8, 748–754 ( 1998).

    Article  CAS  Google Scholar 

  4. Picoult-Newberg, L. et al. Mining SNPs from EST databases. Genome Res. 9, 167–174 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Buetow, K.H., Edmondson, M.N. & Cassidy, A.B. Reliable identification of large numbers of candidate SNPs from public EST data. Nature Genet. 21, 323–325 (1999).

    Article  CAS  Google Scholar 

  6. The Sanger Centre & The Washington University Genome Sequencing Center. Toward a complete human genome sequence. Genome Res. 8, 1097–1108 (1998).

  7. Venter, J.C. et al. Shotgun sequencing of the human genome. Science 280, 1540–1542 ( 1998).

    Article  CAS  Google Scholar 

  8. Hillier, L. et al. Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6, 807– 828 (1996).

    Article  CAS  Google Scholar 

  9. Adams, M.D., Soares, M.B., Kerlavage, A.R., Fields, C. & Venter, J.C. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Genet. 4, 373–380 (1993).

    Article  CAS  Google Scholar 

  10. Hudson, T.J. et al. An STS-based map of the human genome. Science 270, 1945–1954 (1995).

    Article  CAS  Google Scholar 

  11. Marra, M., Weinstock, L.A. & Mardis, E.R. End sequence determination from large insert clones using energy transfer fluorescent primers. Genome Res. 6, 1118–1122 (1996).

    Article  CAS  Google Scholar 

  12. Durbin, R. & Dear, S. Base qualities help sequencing software. Genome Res. 8, 161–162 (1998).

    Article  CAS  Google Scholar 

  13. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated traces using Phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).

    Article  CAS  Google Scholar 

  14. Ewing, B. & Green, P. Base-calling of automated traces using Phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).

    Article  CAS  Google Scholar 

  15. Bayes, T. An essay towards solving a problem in the doctrine of chances. Philos. Trans. R. Soc. 53, 370–418 (1763). Reprinted in Biometrika 45, 293–315 (1958).

    Article  Google Scholar 

  16. Aaronson, J. et al. Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res. 6, 829–845 (1996).

    Article  CAS  Google Scholar 

  17. Kwok, P.-Y., Carlson, C., Yager, T., Ankener, W. & Nickerson, D.A. Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. Genomics 23, 138–144 (1994).

    Article  CAS  Google Scholar 

  18. Taillon-Miller, P. et al. The homozygous complete hydatidiform mole: a unique resource for genome studies. Genomics 46, 307– 310 (1997).

    Article  CAS  Google Scholar 

  19. Collins, F.S. et al. New goals for the U.S. Human Genome Project: 1998–2003. Science 282, 682–689 (1998).

    Article  CAS  Google Scholar 

  20. Nickerson, D.A. et al. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nature Genet. 19, 233– 240 (1998).

    Article  CAS  Google Scholar 

  21. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet. 22, 231–238 (1999).

    Article  CAS  Google Scholar 

  22. Halushka, M.K. et al. Patterns of single-nucleotide polymorphisms in candidate genes regulating blood-pressure homeostasis. Nature Genet. 22, 239–247 (1999).

    Article  CAS  Google Scholar 

  23. Gordon, D., Abaijan, C. & Green, P. Consed: a graphical tool for sequence finishing. Genome Res. 8, 195–202 (1998).

    Article  CAS  Google Scholar 

Download references


We thank T. Blackwell and S. Eddy for informative discussions during the development of the mathematical framework of the technique. This work was supported by NIH grants P50HG01458 (L.H. and W.R.G.), R01HG1720 (P.-Y.K.) and T32AR07284 (Z.G.), and an equipment loan from Compaq Computer Corporation.

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Gabor T. Marth or Pui-Yan Kwok.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Marth, G., Korf, I., Yandell, M. et al. A general approach to single-nucleotide polymorphism discovery. Nat Genet 23, 452–456 (1999).

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing