A general approach to single-nucleotide polymorphism discovery

Marth, Gabor T.; Korf, Ian; Yandell, Mark D.; Yeh, Raymond T.; Gu, Zhijie; Zakeri, Hamideh; Stitziel, Nathan O.; Hillier, LaDeana; Kwok, Pui-Yan; Gish, Warren R.

doi:10.1038/70570

Letter
Published: December 1999

A general approach to single-nucleotide polymorphism discovery

Gabor T. Marth¹,
Ian Korf¹,
Mark D. Yandell¹,
Raymond T. Yeh¹,
Zhijie Gu²,
Hamideh Zakeri²,
Nathan O. Stitziel¹,
LaDeana Hillier¹,
Pui-Yan Kwok² &
…
Warren R. Gish¹

Nature Genetics volume 23, pages 452–456 (1999)Cite this article

4157 Accesses
376 Citations
9 Altmetric
Metrics details

Abstract

Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits¹. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2, 3, 4, 5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence^6,7 as a template on which to layer often unmapped, fragmentary sequence data^8,9,10,11 and to use base quality values¹² to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Application of the POLYBAYES procedure to EST data.**

**Figure 4: Sensitivity of the SNP detection algorithm.**

**Figure 5: SNP detection with assembled shotgun genomic reference sequence.**

Searching thousands of genomes to classify somatic and novel structural variants using STIX

Article Open access 08 April 2022

Murad Chowdhury, Brent S. Pedersen, … Ryan M. Layer

Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation

Article Open access 27 October 2023

Wouter Steyaert, Lonneke Haer-Wigman, … Christian Gilissen

The sequences of 150,119 genomes in the UK Biobank

Article Open access 20 July 2022

Bjarni V. Halldorsson, Hannes P. Eggertsson, … Kari Stefansson

Accession codes

Accessions

GenBank/EMBL/DDBJ

References

Collins, F.S., Guyer, M.S. & Chakravarti, A. Variations on a theme: cataloging human DNA sequence variation. Science 278, 1580– 1581 (1997).
Article CAS Google Scholar
Wang, D.G. et al. Large-scale identification, mapping, and genotyping of single nucleotide polymorphisms in the human genome. Science 280,1077–1082 (1998).
Article CAS Google Scholar
Taillon-Miller, P., Gu, Z., Hillier, L. & Kwok, P.-Y. Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8, 748–754 ( 1998).
Article CAS Google Scholar
Picoult-Newberg, L. et al. Mining SNPs from EST databases. Genome Res. 9, 167–174 (1999).
CAS PubMed PubMed Central Google Scholar
Buetow, K.H., Edmondson, M.N. & Cassidy, A.B. Reliable identification of large numbers of candidate SNPs from public EST data. Nature Genet. 21, 323–325 (1999).
Article CAS Google Scholar
The Sanger Centre & The Washington University Genome Sequencing Center. Toward a complete human genome sequence. Genome Res. 8, 1097–1108 (1998).
Venter, J.C. et al. Shotgun sequencing of the human genome. Science 280, 1540–1542 ( 1998).
Article CAS Google Scholar
Hillier, L. et al. Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6, 807– 828 (1996).
Article CAS Google Scholar
Adams, M.D., Soares, M.B., Kerlavage, A.R., Fields, C. & Venter, J.C. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Genet. 4, 373–380 (1993).
Article CAS Google Scholar
Hudson, T.J. et al. An STS-based map of the human genome. Science 270, 1945–1954 (1995).
Article CAS Google Scholar
Marra, M., Weinstock, L.A. & Mardis, E.R. End sequence determination from large insert clones using energy transfer fluorescent primers. Genome Res. 6, 1118–1122 (1996).
Article CAS Google Scholar
Durbin, R. & Dear, S. Base qualities help sequencing software. Genome Res. 8, 161–162 (1998).
Article CAS Google Scholar
Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated traces using Phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).
Article CAS Google Scholar
Ewing, B. & Green, P. Base-calling of automated traces using Phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).
Article CAS Google Scholar
Bayes, T. An essay towards solving a problem in the doctrine of chances. Philos. Trans. R. Soc. 53, 370–418 (1763). Reprinted in Biometrika 45, 293–315 (1958).
Article Google Scholar
Aaronson, J. et al. Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res. 6, 829–845 (1996).
Article CAS Google Scholar
Kwok, P.-Y., Carlson, C., Yager, T., Ankener, W. & Nickerson, D.A. Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. Genomics 23, 138–144 (1994).
Article CAS Google Scholar
Taillon-Miller, P. et al. The homozygous complete hydatidiform mole: a unique resource for genome studies. Genomics 46, 307– 310 (1997).
Article CAS Google Scholar
Collins, F.S. et al. New goals for the U.S. Human Genome Project: 1998–2003. Science 282, 682–689 (1998).
Article CAS Google Scholar
Nickerson, D.A. et al. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nature Genet. 19, 233– 240 (1998).
Article CAS Google Scholar
Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet. 22, 231–238 (1999).
Article CAS Google Scholar
Halushka, M.K. et al. Patterns of single-nucleotide polymorphisms in candidate genes regulating blood-pressure homeostasis. Nature Genet. 22, 239–247 (1999).
Article CAS Google Scholar
Gordon, D., Abaijan, C. & Green, P. Consed: a graphical tool for sequence finishing. Genome Res. 8, 195–202 (1998).
Article CAS Google Scholar

Download references

Acknowledgements

We thank T. Blackwell and S. Eddy for informative discussions during the development of the mathematical framework of the technique. This work was supported by NIH grants P50HG01458 (L.H. and W.R.G.), R01HG1720 (P.-Y.K.) and T32AR07284 (Z.G.), and an equipment loan from Compaq Computer Corporation.

Author information

Authors and Affiliations

Washington University Department of Genetics and Genome Sequencing Center, St. Louis, Missouri, USA
Gabor T. Marth, Ian Korf, Mark D. Yandell, Raymond T. Yeh, Nathan O. Stitziel, LaDeana Hillier & Warren R. Gish
Washington University Division of Dermatology, St. Louis, Missouri, USA
Zhijie Gu, Hamideh Zakeri & Pui-Yan Kwok

Authors

Gabor T. Marth
View author publications
You can also search for this author in PubMed Google Scholar
Ian Korf
View author publications
You can also search for this author in PubMed Google Scholar
Mark D. Yandell
View author publications
You can also search for this author in PubMed Google Scholar
Raymond T. Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Zhijie Gu
View author publications
You can also search for this author in PubMed Google Scholar
Hamideh Zakeri
View author publications
You can also search for this author in PubMed Google Scholar
Nathan O. Stitziel
View author publications
You can also search for this author in PubMed Google Scholar
LaDeana Hillier
View author publications
You can also search for this author in PubMed Google Scholar
Pui-Yan Kwok
View author publications
You can also search for this author in PubMed Google Scholar
Warren R. Gish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Gabor T. Marth or Pui-Yan Kwok.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marth, G., Korf, I., Yandell, M. et al. A general approach to single-nucleotide polymorphism discovery. Nat Genet 23, 452–456 (1999). https://doi.org/10.1038/70570

Download citation

Received: 17 August 1999
Accepted: 18 October 1999
Issue Date: December 1999
DOI: https://doi.org/10.1038/70570

This article is cited by

Preimplantation genetic testing for Aicardi–Goutières syndrome induced by novel compound heterozygous mutations of TREX1: an unaffected live birth
- Huiling Xu
- Jiajie Pu
- Xuemei Li
Molecular Cytogenetics (2023)
Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method
- Behrouz Bokharaeian
- Mohammad Dehghani
- Alberto Diaz
BMC Bioinformatics (2023)
Identification of high-risk human papillomavirus isolates circulating in Nigeria and phylogenetic analysis based on the virus essential protein
- Kabir Imam Malik
- Dutsinma Usman Aliyu
- Aliyu Isah Abubakar
Indian Journal of Gynecologic Oncology (2021)
Accuracy and reproducibility of somatic point mutation calling in clinical-type targeted sequencing data
- Ali Karimnezhad
- Gareth A. Palidwor
- Theodore J. Perkins
BMC Medical Genomics (2020)
Genomic profiling of bacterial and fungal communities and their predictive functionality during pulque fermentation by whole-genome shotgun sequencing
- Katherine Chacón-Vargas
- Julian Torres
- John G. Gibbons
Scientific Reports (2020)

A general approach to single-nucleotide polymorphism discovery

Abstract

Access options

Similar content being viewed by others

Searching thousands of genomes to classify somatic and novel structural variants using STIX

Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation

The sequences of 150,119 genomes in the UK Biobank

Accession codes

Accessions

GenBank/EMBL/DDBJ

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

This article is cited by

Preimplantation genetic testing for Aicardi–Goutières syndrome induced by novel compound heterozygous mutations of TREX1: an unaffected live birth

Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method

Identification of high-risk human papillomavirus isolates circulating in Nigeria and phylogenetic analysis based on the virus essential protein

Accuracy and reproducibility of somatic point mutation calling in clinical-type targeted sequencing data

Genomic profiling of bacterial and fungal communities and their predictive functionality during pulque fermentation by whole-genome shotgun sequencing

Search

Quick links

Abstract

Access options

Similar content being viewed by others

Accession codes

Accessions

GenBank/EMBL/DDBJ

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links