Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries


High-density single-nucleotide polymorphism (SNP) arrays have revolutionized the ability of genome-wide association studies to detect genomic regions harboring sequence variants that affect complex traits. Extensive numbers of validated SNPs with known allele frequencies are essential to construct genotyping assays with broad utility. We describe an economical, efficient, single-step method for SNP discovery, validation and characterization that uses deep sequencing of reduced representation libraries (RRLs) from specified target populations. Using nearly 50 million sequences generated on an Illumina Genome Analyzer from DNA of 66 cattle representing three populations, we identified 62,042 putative SNPs and predicted their allele frequencies. Genotype data for these 66 individuals validated 92% of 23,357 selected genome-wide SNPs, with a genotypic and sequence allele frequency correlation of r = 0.67. This approach for simultaneous de novo discovery of high-quality SNPs and population characterization of allele frequencies may be applied to any species with at least a partially sequenced genome.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Prices vary by article type



Prices may be subject to local taxes which are calculated during checkout

Figure 1: Distribution of sequence and genotype derived allele frequencies (r = 0.67) in the SNP discovery populations.


  1. Klein, R.J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389 (2005).

    Article  CAS  Google Scholar 

  2. Libioulle, C. et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet. 3, e58 (2007).

    Article  Google Scholar 

  3. Sladek, R. et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445, 881–885 (2007).

    Article  CAS  Google Scholar 

  4. Zanke, B.W. et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nat. Genet. 39, 989–994 (2007).

    Article  CAS  Google Scholar 

  5. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

  6. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  7. The International HapMap Consortium. The international HapMap project. Nature 426, 789–796 (2003).

  8. Nickerson, D.A., Tobe, V.O. & Taylor, S.L. PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25, 2745–2751 (1997).

    Article  CAS  Google Scholar 

  9. International Chicken Genome Sequencing Consortium. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716 (2004).

  10. Lindblad-Toh, K. et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438, 803–819 (2005).

    Article  CAS  Google Scholar 

  11. O'Brien, S.J. et al. The promise of comparative genomics in mammals. Science 286, 458–481 (1999).

    Article  CAS  Google Scholar 

  12. Altshuler, D. et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516 (2000).

    Article  CAS  Google Scholar 

  13. Albert, T.J. et al. Direct selection of human genomic loci by microarray hybridization. Nat. Methods 4, 903–905 (2007).

    Article  CAS  Google Scholar 

  14. Barbazuk, W.B., Emrich, S.J., Chen, H.D., Li, L. & Schnable, P.S. SNP discovery via 454 transcriptome sequencing. Plant J. 51, 910–918 (2007).

    Article  CAS  Google Scholar 

  15. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007).

    Article  CAS  Google Scholar 

  16. McKay, S.D. et al. Construction of bovine whole-genome radiation hybrid and linkage maps using high-throughput genotyping. Anim. Genet. 38, 120–125 (2007).

    Article  CAS  Google Scholar 

Download references


J.F.T. and R.D.S. were supported by National Research Initiative grants 2005-35205-15448, 2005-35604-15615, 2006-35205-16701 and 2006-35616-16697 from the US Department of Agriculture Cooperative State Research, Education and Extension Service. C.P.V.T., T.S.S., and L.K.M. were supported by National Research Initiative grant 2006-35205-16888 from the US Department of Agriculture Cooperative State Research, Education, and Extension Service and by Projects 1265-31000-081D and 1265-31000-090-00D from the United States Department of Agriculture Agricultural Research Service. T.P.L.S. was supported by Project 5438-31000-073D from the US Department of Agriculture Agricultural Research Service. L.K.M. was also supported by National Research Initiative grant 2006-35205-17878 from the US Department of Agriculture Cooperative State Research, Education and Extension Service. We gratefully acknowledge the early prepublication access under the Fort Lauderdale conventions to the draft bovine genome sequence provided by the Baylor College of Medicine Human Genome Sequencing Center and the Bovine Genome Sequencing Project Consortium.

Author information

Authors and Affiliations



C.P.V.T. and L.K.M. developed and implemented the SNP discovery algorithm; J.F.T. and C.P.V.T. performed SNP discovery modeling; C.P.V.T., T.S.S. and L.K.M. performed in silico genome analysis; W.C.W. suggested the reduced representation strategy; T.P.L.S. constructed the RRLs; J.F.T., R.D.S., T.P.L.S. and T.S.S. identified cows for DNA pools; T.S.S. managed the DNA collection; C.T.L. genotyped the discovery animals and managed the assay synthesis; C.D.H. sequenced the RRLs; L.K.M., T.S.S., R.D.S., S.S.M. and T.P.L.S. conducted pilot validations; and C.P.V.T., J.F.T., T.S.S. and T.P.L.S. coordinated manuscript writing and editing.

Corresponding author

Correspondence to Curtis P Van Tassell.

Ethics declarations

Competing interests

C.T.L. and C.D.H. are employees of Illumina, Inc.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–3, Supplementary Table 1, Supplementary Methods (PDF 1163 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Van Tassell, C., Smith, T., Matukumalli, L. et al. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods 5, 247–252 (2008).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing