Nature Genetics
23, 452 - 456 (1999)
doi:10.1038/70570
A general approach to single-nucleotide polymorphism discoveryGabor T. Marth1, Ian Korf1, Mark D. Yandell1, Raymond T. Yeh1, Zhijie Gu2, Hamideh Zakeri2, Nathan O. Stitziel1, LaDeana Hillier1, Pui-Yan Kwok2
& Warren R. Gish11
Washington University Department of Genetics and Genome
Sequencing Center, St. Louis, Missouri, USA
. 2
Washington University Division of Dermatology,
St. Louis, Missouri, USA.
Correspondence should be addressed to Gabor T. Marth gmarth@watson.wustl.edu or Pui-Yan Kwok kwok@im.wustl.eduSingle-nucleotide polymorphisms (SNPs) are the most abundant form of human
genetic variation and a resource for mapping complex genetic traits1.
The large volume of data produced by high-throughput sequencing projects is
a rich and largely untapped source of SNPs (refs
2, 3, 4, 5). We present here a unified approach to the discovery
of variations in genetic sequence data of arbitrary DNA sources. We propose
to use the rapidly emerging genomic sequence6,
7 as a template
on which to layer often unmapped, fragmentary sequence data8,
9,
10,
11
and to use base quality values12 to discern true allelic variations
from sequencing errors. By taking advantage of the genomic sequence we are
able to use simpler yet more accurate methods for sequence organization: fragment
clustering, paralogue identification and multiple alignment. We analyse these
sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate
the probability that a given site is polymorphic. Rigorous treatment of base
quality permits completely automated evaluation of the full length of all
sequences, without limitations on alignment depth. We demonstrate this approach
by accurate SNP predictions in human ESTs aligned to finished and working-draft
quality genomic sequences, a data set representative of the typical challenges
of sequence-based SNP discovery.
|