nbt0198-033Nature Biotechnology16119980133391087-0156199810.1038/nbt0198-331546-1696199813 November 19973 December 1997January 1998#79791D12826644July 1999spacer_grey.gif39NBT208/cgi-taf/DynaPage.taf?file=100usnbt_logo.gifnbt21March 2002Nature Biotechnologyarrow_green_prev.gif142nbt_bg_header.gif#F0EECBarrow_green.gifNat Biotechbiotech54/nbt/journal/v16/n1issueJournal homeArchiveAdvance online publicationPrivacy policySubscribeNature Publishing GroupCurrent issuenbt0198-33DNA variation and the future of human genetics
AU  - Schafer, Alan J.
AU  - Hawkins, J. Ross1Hexagen, 214 Cambridge Science Park, Milton Rd., Cambridge CB4 4WA, UK.2Box 116, Department of Pediatrics, University of Cambridge, Addenbrooke's Hospital, Cambridge, CB2 2QQ, UK.[ast]e-mail: alan.schafer@hexagen.co.ukThe use of DNA variants in the mapping of the human genome and in the positional cloning of monogenie disease genes is well established. Determining the genetic bases of the more common [ldquo]multifactorial[rdquo] diseases, however, presents a major challenge. The genetics of these diseases are complicated by the interplay between many genes and the environment. These investigations will require large numbers of DNA markers and the technology to screen large populations with these markers. The systematic identification of the common DNA polymorphisms in the human genome coupled with the development of high throughput screening methods should allow ultimately the elucidation of the genetic component of most clinical and nonclinical phenotypes.Keywords: genomics, genetic mapping, polymorphism
It is the phenotypic differences between individuals that drive genetic studies, but it is the ability to detect polymorphism at the DNA level that has so profoundly changed human genetic analysis and promises to continue to do so. DNA sequence variants are used in all aspects of genetic investigation, including evolutionary and population structure studies, forensics, and the analysis and diagnosis of genetic disease. DNA sequence polymorphisms result from mutation, and the changes may or may not have functional consequences. Some mutations in a single gene alter function sufficiently to result in disease (monogenic disease), but many of the common human diseases appear to be polygenic; the result of complex interactions of multiple genes. In these cases, the alteration of a single gene may not be detrimental, but in combination with certain variants of other genes, may contribute to a disease phenotype. The variant genes might be sufficient to cause a disease phenotype, but in many cases an environmental component such as smoking, nutrition, or infection, is also required.
DNA variants leading to monogenic diseases are usually rare in a population due to the process of natural selection. As variants in genes involved in polygenic disease do not act alone to produce the phenotype, selection against them will only occur when they are present in the disease-causing combination. Thus, these variants may exist at a high frequency in the population. Neutral DNA variants are under no selective pressure and occur at variable frequencies within populations as a result of genetic drift. Sequence variants which are present at a frequency of less than 1%; in a population are arbitrarily designated as mutations, and those at a higher frequency are referred to as polymorphisms.
One of the greatest impacts of DNA polymorphism detection has been in its use to assay markers for mapping, cloning, and identification of disease genes. The immediate goal is to identify the causative or contributory DNA sequence variation that underlies all human phenotypes that have a genetic component. The current challenge is to most effectively use emerging technologies, along with abundant sequence information, to identify DNA polymorphisms and use them to elucidate the genetic components of complex human disease.
Molecular polymorphism. The first molecular polymorphism to be observed and analyzed in humans was the ABO blood group polymorphism, identified by Landsteiner in 1900 (ref. 1). Although obscure at the time, the antigenic variation reflects DNA sequence polymorphism and we now know that the variability of the A, B, and O alleles is due to a few single-base DNA substitutions2. Initially, the study of molecular polymorphism was largely limited to the field of immunogenetics until the development of starch gel electrophoresis by Smithies in 1955, allowing polymorphism studies to be extended to a wider selection of proteins3. Electrophoretic surveys of a variety of plasma proteins and red-cell enzymes indicated that protein variation between individuals, and thus genetic polymorphism, was not infrequent. The development of the Southern blot in 1975 made it possible to examine directly DNA polymorphism, and to analyze variation both within and outside gene coding regions4. The first DNA variants were detected in the late 1970s as restriction fragment length polymorphisms (RFLPs) on Southern blots of genomic DNA. The difference in restriction fragment size was due to the cleavage or noncleavage of the DNA at particular sites, caused by single nucleotide polymorphisms (SNPs) creating or abolishing the restriction enzyme recognition sites. SNPs are distributed at different densities across the genome, e.g., at lower frequency in protein coding versus non-coding DNA. The actual number of heterozygous SNPs in an individual genome is not known, with estimates ranging from 0.5[ndash]10 SNPs per 1000 base pairs when comparing any two chromosomes5,6. Restriction enzymes only recognize specific short sequence combinations (usually 4[ndash]6 bases in length) and as a result, RFLPs represent only a small subset of DNA sequence variability. A second class of common RFLP was recognized by Jeffreys in 1985 (ref. 7), in which the restriction fragment length variability is due to a "variable number of tandem repeats" (VNTR)8. The likeness of these repeats to the much larger satellite DNA repeats, led to them being named minisatellites. In contrast to the conventional class of RFLP, which generally have only two alleles, the minisatellites show tremendous size variability and can have hundreds (or more) of alleles per locus. Not long after the discovery of minisatellites, Weber identified another subclass of VNTR polymorphism in which the repeat unit consisted of only two base pairs9. These dinucleotide repeats or /`microsatellites/' are present in many copies in the genome (perhaps several thousand), show a high level of length variability per locus and are distributed approximately randomly throughout the genome. Unlike min-isatellites, microsatellites are easily scored by the polymerase chain reaction (PCR). The combination of number, variability, and relative ease of scoring has made the microsatellites excellent for genetic analysis (see below). It is these markers that have been the workhorse of mammalian genetic analysis in recent years.
DNA polymorphisms in family-based genetic studies. Genes on a common chromosome are physically linked. In principle, two alleles of two loci on the same chromosome should be co-transmitted from one generation to the next, and one can serve as a marker for the other. Actually, the two loci may be far apart, and meiotic recombination between homologous pairs of chromosomes leads to segregation of the alleles to different germ cells, so that an individual may inherit a new combination of alleles (Fig. 1). As the probability of recombination occurring in a region between two loci is a function of the physical distance between the loci, linkage, mapping can be performed by testing the frequency of co-segregation of markers within large pedigrees. A likelihood of less than about 1 in 2000 of coincidental co-inheritance of the two markers is generally accepted as proof of linkage10. Linkage mapping originally was performed using overt phenotypes in organisms, but later was performed using antigenic variants and protein isoforms as molecular markers, with the first evidence of linkage to a human disease (myotonic dystrophy) occurring in 1954 (ref. 11).
Figure 1. Meiotic segregation of DNA variants. A/a and B/b are alleles at two loci. A and B are physically linked on the one chromosome, and a and b are physically linked on the other. Recombination between chromosome pairs during meiosis can result in chromosomes with A and b on one chromosome, and a and B segregation on the other. The probability of a recombination event occurring between the two loci is a function of distance, with recombination occurring less frequently between close loci. When recombination fails to cause segregation of the alleles, they are said to be in [ldquo]linkage disequilibrium.[rdquo]
In a germinative proposal, Botstein and colleagues outlined the theoretical utility of DNA sequence variation, in the manner of alleles of genes, as markers for linkage studies'2. To detect recombination, polymorphic markers at two loci are necessary for the different chromosomal origins to be recognized. The authors (correctly) asserted that RFLPs, tested in a sufficiently large number of individuals, would allow genetic mapping of a gene responsible for a phenotypic trait, thus beginning the era of DNA polymorphism[ndash]based mapping.
Classical linkage analyses require exact specification of the mode of inheritance of the loci. This method works well for simple Mendelian traits because there are few allowable models, and these are easily tested. In many cases however, a precise model that adequately explains the mode of inheritance of a phenotype cannot be derived13. Alternative linkage analysis methods therefore have been developed that do not require assumptions about the model or mode of inheritance of the phenotype being studied. This nonparametric approach of allele-sharing analysis avoids the requirement for a model. Rather it uses DNA polymorphisms to search for chromosomal regions that are identical between affected relatives, with the expectation that allele-sharing frequencies will be higher for a marker that is closely linked to a disease allele. One version of this approach, sib pair analysis14'15, is based on the frequencies at which pairs of siblings are alike for two markers being studied. Multiple affected sibships are tested at a large number of loci to determine how often they share a common ancestral allele. These alleles are said to be identical-by-descent (IBD) and indicate linkage to the disease when co-inherited at a higher frequency than would be expected by chance13. Although most IBD studies have used affected sib-pairs, alternative versions of IBD analysis allow the use of affected relative pairs other than siblings16. Using affected individuals exclusively is particularly advantageous for diseases that exhibit incomplete penetrance of a phenotype or have a delayed disease onset. However, the occurrence of phenocopies (where the same phenotype is mediated by variation at other loci) drastically reduces the power of the affected pairs methods17. IBD analysis has been applied to recessive monogenic diseases such as Hallervorden Spatz syndrome and Bloom's syndrome to reveal the map position and gene respectively18,19. The power of IBD analysis is demonstrated by its ability to map multiple disease susceptibility loci in complex polygenic diseases, such as type 1 diabetes20,21.
Linkage methods are most effective for monogenic diseases, and there have been some spectacular successes such as cloning of genes responsible for cystic fibrosis, fragile-X syndrome and Huntington's chorea22[ndash]25. Linkage approaches have been widely applied to polygenic diseases10,26 but have been of limited success in some of these diseases, such as multiple sclerosis and schizophrenia27[ndash]29. There are various explanations for the lack of detectable linkage. In both monogenic and polygenic disease, genes exist that modify the degree of the disease phenotype and may interfere with linkage analysis. It is not possible strictly to control environmental factors, so environment-gene interactions that cause or augment a disease can confound linkage studies. Additional difficulties arise with poor phenotype definition, delayed disease onset (leading to misassignment of disease status), and occurrence of phenocopies.
The role of linkage analysis is to provide genetic map information, facilitating gene cloning with no prior knowledge of the biochemical function of a gene. Physical cloning of a linkage region provides DNA sequences in which to identify candidate genes for the phenotype30. These genes are then tested to identify DNA variants that affect gene function and lead to disease. Hence this approach to identifying genes responsible for a phenotype is called positional cloning31. The identification of genes within a cloned region, and subsequent polymorphism testing of these genes in many individuals, is tedious and laborious. Narrowing a phenotype locus to as small a region as possible, prior to cloning, greatly reduces the work necessary to identify the underlying (causative) genetic variant.
Box 1. Variation detection techniques       
I. Variation scanning methods
Established methods
Single Strand Conformation Polymorphism analysis (SSCP)60. PCR products from the region to be tested are heat denatured and rapidly cooled to impede reassociation of complementary strands. The single strands form sequence dependent conformations that influence gel mobility. The same region of DNA is compared between individuals and differential mobilities indicate sequence differences that exist between the individuals in this region.
Hetewduplex Analysis (HA)61. The DNA region to be tested is amplified, denatured, and renatured to itself or wild-type DNA. Heteroduplexes between different alleles contain DNA [ldquo]bubbles[rdquo] at mismatched basepairs that can affect mobility through a gel. The same region of DNA is compared between individuals and differential mobilities indicate sequence differences.
Denaturing Gradient Gel Electrophoresis (DGGE)62. A gel-based system that examines the point at which double-stranded DNA melts into single-strand DNA, and which varies between SNP alleles. Recent refinements of the method have focused on the replacement of chemical denaturants in the gel with a temperature gradient as the DNA denaturant63.
DAM Sequencing. A gel-based system in which the base at each position of a DNA fragment is characterized. Heterozygous changes appears as two bases at a single position, and homozygous variants are found by comparison to a control sequence.
RNase cleavage64. Ribonucleases are used to cleave mismatches in RNA:RNA or RNA:DNA heteroduplexes. Cleavage is detected by the presence of smaller sized fragments on gels.
Chemical Cleavage of Mismatch (CCM)65,66. Similar in principle to RNase cleavage, but the mismatches in DNA:DNA heteroduplexes are bound by chemicals and then cleaved with piperidine.
New scanning methods
T4 Endonudease VII Cleavage. A cleavage method exploiting bac-teriophage resolvases67. Although free of toxic chemicals, the efficiency and reliability of this method is yet to be proved. Currently used in gel assays, future developments may reduce the background noise, allowing for conversion to a plate assay.
Multi-Photon Detection. This radioactive detection system may allow the pooling of many DNA samples into a single gel track of an established method such as DGGE.
/`Cleavase fragment length polymorphism assay./' This method exploits a thermostable structure-specific nuclease to cleave stem-loop structures in single stranded DNA, in which variation is detected as an altered banding pattern on a gel. Although the system can handle relatively large fragments of DNA, its efficacy has yet to be demonstrated.
E.coli mismatch repair enzymes. The MutH, MutL, and Muts enzymes have been used to recognize heteroduplex mismatches68, an approach that could be converted to a plate-based assay. Even if success is limited, this approach has the potential for genetic modification to improve the desired activities Of the proteins.
Denaturing High Performance Liquid Chromatography. DHPLC uses partial heat denaturation and a linear acetonitrile column to sensitively scan DNA fragments for variation6'. Akin to DGGE, this system offers the significant advantage of automation potential. Throughput, cost and ability to cope with DNA fragments containing multiple melting domains are yet to be evaluated.
Mass Spectrometry. Matrix-assisted laser desorption/ioniza-tion time-of-flight mass spectrometry (MALDI-TOF-MS)70 compares DNA fragments by sensitive mass determination. Potential is good in both the scanning and scoring of DNA variants.
II. Variation scoring methods
Established scoring methods
Single Nudeotide Primer Extension (SNuPE). Also known as Mini-Sequencing, this technique involves the single base extension of an immobilized primer, in which the added base corresponds to the SNP. The use of all four bases, each with a different fluorescent label, allows the scoring of the SNP on the basis of the resulting color of fluorescence. Conversion into a high density array format could make this the scoring method of choice53,54.
5[prime] Nuckase Assay72 . Scores SNP alleles on the basis of the degree of hybridization of an allele-specific oligonucleotide to a PCR product by fluorescent emission or quenching of fluorescence. Apparent drawbacks of the technique however are its expense, and the need to establish conditions for each locus.
New scoring methods
DNA Microchips. The technology that has the greatest potential and which is generating widespread interest. One format uses the light-directed (photolithographic) synthesis of microarrays of oligonucleotides on glass supports". Fluorescently labeled PCR products are hybridized to the oligonucleotide arrays and sequence-specific hybridization signal is detected by scanning confocal microscopy and analyzed automatically. The technology enables rapid re-sequencing, allowing the scoring of DNA variants as predictable differences in the hybridization pattern. Microchips can also be applied to gene expression studies, simultaneously quantitating the levels of many transcripts5[ast].! The technique has already been demonstrated in the scoring of mutations in the mitochondria! and HIV genomes as well as mutations in the CFTR cystic fibrosis gene, the BRCA1 breast cancei gene and the p53 oncogene. Although DNA microchips show gre<it promise in the scoring of SNPs, it is not yet clear whether it will be effective for scanning for unictown polymorphisms.
DNA polymorphism in population-based genetic studies. Linkage analysis locates the disease locus to a chromosomal region that can be many megabases in size. Given that scores or hundreds of genes likely occupy such a large region, it is not always feasible to clone and analyze each gene. An alternative form of genetic analyses are association studies, which offer opportunities to more finely map linkage regions, to map loci that are refractory to linkage analysis, to map unknown predisposition loci, and even to assign function to anonymous genes10,26,32.
In contrast to linkage, which shows coexistence of a variant and a disease through inheritance in families, association shows coexistence of a variant and a disease in a population. Association studies are based upon linkage disequilibrium, which occurs between a marker and disease if the marker polymorphism is situated in close proximity to the functional (disease contributing) variant. Due to the close physical proximity, many generations are required for the two to be separated by recombination. As a result, they are present together on the same haplotype (Fig. 2) at higher frequency than expected even in very distantly related people.
Following mapping of a disease gene by linkage, additional polymorphic sites can be identified in the linkage region and used in association studies to narrow the linkage region. A marker (such as a DNA polymorphism) is said to be associated with a particular phenotype when its frequency is significantly higher among affected than nonaffected individuals. The closer a marker is to the polymorphic site that contributes to the disease (the functional polymorphism), the stronger the association; the functional variant at that polymorphic site however will have the maximum association. Association studies can thus be used to screen for functionally significant variation and to test candidate genes for involvement in a phenotype. These may be genes cloned from a linkage region, but the approach can also be used for genes which are candidates for other reasons, such as differential expression patterns in diseased versus unaffected individuals.
Figure 2. Haplotypes and DNA polymorphism. The horizontal lines represent a homologous region of a chromosome in four different individuals, with an X or filled circle indicating a variant nucleotide. Unique combinations of variants define a haplotype. The disease-causing variant arose on a chromosome carrying variant B and due to their close physical proximity are rarely separated by recombination. The disease-causing variant and B are therefore in linkage disequilibrium. Variants A and C are located further away, and recombination occurs between them frequently. In this example, four haplotypes exist in the population. Not all polymorphic sites need to be tested, as genotypic scoring of alleles A and C differentiates between the four haplotypes. As two homologous chromosomes, and thus two haplotypes, are present in an individual, polymorphism scoring gives a combined result of the two haplotypes. Individual haplotypes are constructed by using family transmission data.
Association studies are not new, with the association between the HLA-B27 allele and ankylosing spondylitis well established, in which more than 90% of disease sufferers have -B27 but less than 10% of the general population have the allele33'34. It is still unclear however, whether the -B27 allele predisposes to the disease or whether it lies close to the true functional variant (in linkage disequilibrium). Best known perhaps, is the association between Alzheimer's disease and the Apolipoprotein-E e4-allele, which appears to be a true functional variant35/. Association studies are not without their pitfalls: The major problem lies with the ethnic history of the control population, as analysis of a trait in a mixed population can yield false positive associations. If the trait is present at a higher frequency in a migrant population relative to the rest of the population under study, this ethnic group will constitute an elevated proportion of the case population under study. By virtue of their differing ethnic origin, the migrant population will have different allele frequencies at many loci relative to the rest of the population in the study, even though these loci do not contribute to the phenotype. Any variant allele more common in the migrant population will therefore appear (falsely) to be associated with the disease. One approach to reducing heterogeneity is to perform case-control association studies on relatively homogeneous populations with a small founder population such as certain regions of Finland, Iceland, or island populations37. The use of small populations is not always feasible, as the disease in question may not be present at a sufficiently high frequency in that population. Case-control associations using unaffected siblings as controls, rather than unrelated controls, may protect against bias due to population heterogeneity38.
Following the establishment of an association, family-based linkage-disequilibrium tests, such as the transmission-disequilibrium test (TDT)39[ndash]41 offer an independent means to assess the association, which control for the confounding effects of population stratification or admixtures which plague population-based association tests. TDT uses data derived from two parents and an affected offspring. The TDT statistic tests for equal numbers of transmissions of marker alleles from heterozygous parents to affected offspring. Significantly different transmission frequencies provides evidence that the marker is linked to the disease locus. This test was originally devised to test for linkage between a complex disease and a marker where a disease association had already been found, but TDT is also valid in the absence of association. Given sufficient numbers of markers and appropriate DNA samples, TDT analyses can be used as the initial test, an approach currently best suited to candidate genes studies. One would think it logical to apply TDT studies to the families of affected sib-pair collections, but unfortunately most collections have not included parental DNA samples. The requirement for parental DNAs is also problematic in the study of late onset diseases such as noninsulin[ndash]dependent diabetes mellitus, Alzheimer's disease, and some cancers. By the time the disease manifests itself, a large fraction of the parents have died, and thus DNA samples are not available. Recent advances in association analysis approaches to the study of discordant sib-pairs however, hold promise for alleviating the requirement of parental DNA42,43.
The use of SNPs in genetic studies. The availability of microsatellite markers distributed throughout the genome together with the development of PCR-based semi-automated scoring techniques have made linkage studies for human genetic disease a practical approach. Despite many notable successes, linkage approaches are not entirely satisfactory. Linkage studies have often suffered from an insufficient number of affected individuals, of family DNA samples, and of highly polymorphic markers. In addition, microsatellite scoring procedures are labor intensive. A way to increase the number of markers is to use SNPs, which are plentiful. Technological advances promise to make SNP screens highly automated, faster, and cheaper to perform than microsatellite analysis. As each SNP site is biallelic, and therefore less informative than a variable microsatellite which has multiple alleles, more SNP loci are needed to be equally informative. The high frequency of SNPs in the genome provides enough polymorphic sites to more than compensate for the lost information content. Genome scans commonly test 300[ndash]400 polymorphic microsatellites, spaced at l0cM (10% recombination) intervals. The minimum number of moderately variant sites to get equivalent linkage power is 700[ndash]900 and a preliminary genome screen with a marker density of one per cM, would require in the order of 1500[ndash]3000 SNPs44. The fine mapping and eventual definition of the causative site would probably require in the region of 300,000 scorable variant sites throughout the genome (i.e., 1 marker per 10,000 bp of DNA), although it would only be necessary to test at high density in regions which have shown linkage in a lower density screen. With a high enough density of markers though, linkage studies will rapidly become of decreasing importance (and usefulness) as it will be possible to go directly to linkage disequilibrium studies. In the extreme, if all human DNA variants were known (including but not restricted to SNPs) this set would include all functional polymorphisms, and if they could be analyzed in all individuals, comparison of pheno-types and correlation with genotype might make possible the assignment of function to every gene that predisposes to disease of any kind, and also to nonclinical phenotypes including behavioral traits. The sheer size of the task is overwhelming and may possibly never be practical. But by limiting our immediate efforts to ascertaining and testing variants in candidate genes, and using well-defined populations, we can consider using linkage disequilibrium studies as the first course of investigation to identify the molecular bases of disease. Even when limiting the polymorphisms to be studied to those at a candidate gene locus, the number to be tested will be very large. An approach to reducing the number of sites to be tested per gene is to define haplotypes, and use a subset of polymorphisms within the gene that distinguish between haplotypes (Fig. 2). This allows fewer markers to be used to test across a larger region of DNA.
Preferentially, polymorphism identification and haplotype definition could be done on a single population, with the defined polymorphism subset being used on multiple disease populations. For polygenic disease, the disease-contributing allele will be present in a normal population, but it may be difficult to define a single normal population that accurately reflects the diversity of all populations that will want to be tested, and important haplotypes may be missed. Therefore, polymorphisms should ideally be both defined and tested in populations with the same genetic history, or polymorphisms should be initially defined on a set of individuals from various populations. In addition it cannot be determined a priori at what frequency the disease-contributing alleles will occur in the population, so there is uncertainty as to the numbers of individuals to test to define polymorphisms and haplotypes.
Techniques for identifying and scoring polymorphisms
The human genome project will generate a representative sequence of the human genome, but this sequencing effort does not deliberately or systematically detect polymorphism across the whole genome. Efficient techniques are therefore required for the simple and rapid initial identification of SNPs (scanning methods), and also for the subsequent testing of the SNP alleles (scoring methods).
SNP scanning. All of the well established SNP scanning methods45[ndash]48 are gel based assays. Single-strand conformation polymorphism (SSCP), heteroduplex analysis (HA), denaturing gradient gel electrophoresis (DGGE) (Box 1) rely on conformation induced mobility differences, RNase and chemical cleavage mismatch (CCM) (Box 1) act via DNA fragment cleavage, and sequencing (Box 1) characterizes each sequential base of the DNA fragment under investigation. In applying these techniques, the following criteria need to be considered: ease of use, efficacy and throughput.
Ease of use. SSCP and HA are the simplest assays to perform, both involving PCR and heat denaturation of the product (without and with re-annealing respectively), followed by gel electrophoresis. The combination of few steps, and absence of chemical or enzymatic treatment, and uncomplicated gel systems for analysis make these the simplest polymorphism detection methods. Sequencing has become fairly routine for many groups as protocols are well established and generally robust. Although sample processing has been simplified, sequencing still requires substantial manipulation. RNase cleavage requires substantial sample processing, and in addition requires the in vitro synthesis of RNA probes. A simplified and improved nonisotopic development of the technique by Ambion has improved its usability. The complexity of the CCM protocol and requirement of very toxic chemicals makes it unpopular. The recent discovery that osmium tetroxide, the most toxic of the chemicals involved, can be replaced with potassium permangenate, together with the application of magnetic bead and fluorescence technology, has made this method substantially easier to perform. DGGE, which requires specialized gel electrophoresis equipment, is moderately simple to perform, but only after the optimal experimental conditions specific for each DNA fragment have been established. This method is therefore poorly suited to the analysis of many different fragments of DNA.
Efficacy of detection. RNase cleavage and HA are inefficient detection methods with SNP detection rates of approximately 70% and 80% respectively, and as such are rarely used as a sole technique for testing DNA for polymorphisms'". In both methods the sequence surrounding the heteroduplex mismatch influences the detection rate of a given mismatch. Several factors influence the efficacy of detection of DNA sequence variation by SSCP, including temperature, gel composition, and DNA fragment size, but by using defined conditions and short fragment size (e.g., 200 bp), detection frequencies of greater than 90% can be obtained4'. DGGE is probably more effective still, but is not as effective as CCM, which has a near-100% SNP detection rate5051. As sequencing examines each base directly, this method is often used when it is crucial that variants are not missed. This is the method chosen by Myriad Genetics for the routine detection of mutations in the BRCA1 and BRCA2 breast cancer genes. Nevertheless, the quality of the data is not impeccable: sequencing errors occur at frequencies which are not negligible, and in particular, the detection of heterozygous SNPs can be problematic52
Throughput. Fluorescence technology has made a major impact on the throughput of gel-based SNP detection methods. Each of the techniques using standard polyacrylamide gels (i.e., SSCP, HA, and CCM) have been adapted to this technique. The use of fluorescently labeled DNA avoids the requirement for radio labeling of DNA, and the availability of a variety of fluors permits the loading of samples in each lane of a gel, allowing high-throughput DNA variation detection. Although high-volume sequencing can be performed, larger fragments of DNA and multiple samples can be tested using one of the other scanning methods, giving several fold higher throughput than sequencing.
SNP scoring. All of the techniques used for scanning for unknown DNA variation can also be used for scoring for known mutations. Initially a mobility pattern associated with the variant is defined, allowing sequence variants to be scored as corresponding to known mobility patterns. It is necessary to be able to score many alleles rapidly and cheaply and in a large number of individuals. For this, specialized technologies are required. A primary focus of polymorphism analysis development has been the efficient, high throughput scoring of known SNPs, both for research and in anticipation of the increase in demand for diagnostic DNA testing. Several of the most promising technologies are described in Box 1. All of the major scoring methods (apart from allele-specific PCR) are nongel-based, and all have the potential for high throughput, but it is the techniques of single nucleotide primer extension (SNuPE)53,54 and DNA microchips55,56 which stand out as having the greatest potential for development to the ultra-high throughputs required.
The future of SNP scanning and scoring
That polymorphism scanning technologies will have a large impact on SNP-based genetic studies in the immediate future is clear. In the longer term, new detection and scoring technologies will be required if association studies are to be done on thousands of loci in tens of thousands of individuals.
Future scanning techniques will have very high detection efficacies, will be easily applied to many samples and many fragments of DNA, and will need to be inexpensive. A requirement of existing techniques is PCR amplification of each polymorphic region to be tested, requiring either substantial improvements in multiplexing of PCR primers, or development of a PCR-free polymorphism detection system. It is unclear what form these techniques will take, but they may require technologies fundamentally different from gel mobility, mismatch cleavage, and hybridization. The use of techniques that measure physical characteristics, such as DNA mass, is one such possibility.
Assessing gene function
Once a gene has been implicated in a phenotype by linkage or association studies, determining the function of the gene and proving the association, is usually complicated and can be approached in many different ways. For genes belonging to well-studied gene families or encoding well-characterized protein domains, a good indication of function may be derived from the DNA sequence alone. Evidence of interaction with other proteins or genes may be derived from cell biological or biochemical studies, but any full understanding of mammalian gene function requires studies of DNA variation, and in particular the selective disruption of a gene in a complex mammalian in vivo test system, such as a mouse. Studies in human subjects are hindered by the inability to study specific DNA changes in multiple individuals in a constant genetic background. It is possible to generate mice with mutations in specific genes and study associated phenotypes. In addition, breeding mice with different mutations (in the same or different locus) allows the dissection of complex traits. The mouse therefore offers many opportunities which the human does not.
There are many ways in which mutations can be generated in mice, but they can generally be divided into targeted and nontar-geted approaches. The targeted approach generally uses partial deletion and inactivation of a gene to be studied. This is achieved via homologous recombination in embryonic stem (ES) cells, which can be used to generate mice57. Homologous recombination techniques also allow subtle DNA lesions to be created, and although the procedure is very difficult and laborious, it enables the production of mice carrying defined mutations.
Nontargeted gene disruption does not allow alteration directed to a particular base pair of a gene, but does permit the relatively rapid production of many mutants. This may be performed by viral insertional mutagenesis or by chemical mutagenesis, the latter of which is currently generating much interest. Ethylnitrosourea (ENU) is a potent point mutagen efficiently mutagenizing pre-mei-otic spermatogonia, enabling the production of (Fl) offspring each containing phenotype-inducing mutations in as many as 100 genes5[ast]. Dominant functional mutations may produce an observable phenotype in the first generation mice, and recessive mutations can be bred to homozygosity. Examination of the phenotypes resulting from mutagenesis allows study of mutations in genes for which the phenotype is both known and unknown. New phenotypes with unknown genetic causes .can be mapped by breeding with mice carrying defined chromosomal deletions, a technology which has been used for several decades by the Drosophila community and therefore should serve mammalian functional genomics well. It is often desirable to identify and study the (unknown) phe-notypic outcome of mutations in a gene being investigated. Mice can be generated by coupling ENU mutagenesis with high throughput mutation scanning. ENU mutagenized male mice are mated with isogenic females to generate a large colony of Fl mice carrying a large number of heterozygous mutations. The DNA from these mice are then screened for heterozygous mutations (without regard to phenotype), identifying animals that can be bred and studied for the phenotypic consequence of the identified mutation. As mouse mutagenesis technologies develop, it will become routine to introduce multiple variants into a single genetic background and in various combinations. This high mutation/variant repertoire will facilitate the dissection of complex traits and help determine which variants identified in human association studies are of functional significance.
Exploiting lack of DNA variation
While cloning sheep from adult cells59 was something of a revelation, it should not be forgotten that inbred mice, although not clones, are virtually genetically identical. Using model systems devoid of DNA variation allows us to dissect the environmental components of complex traits. Indeed twin studies are well established in determining the genetic vs. environmental components of human disease. Studies of naturally occurring mouse mutants in inbred backgrounds have shed light on phenotypes such as obesity and diabetes. Mice have the advantage of (relative) ease of handling and short generation times, which allows the derivation of isogenic strains. Some defined captive populations of higher mammals such as nonhuman primates show high incidences of polygenic disease, but production of isogenic strains for the study of environmental effects on the disease is hugely impractical. Subjecting multiple clones of those individuals to different environments should therefore offer a way of dissecting the nongenetic aspects of complex diseases such as cardiovascular disease. Thus, the combination of variation and cloning studies should provide an understanding of complex disease only dreamed of today.
Conclusion
The revolution in human molecular genetics continues. Incredible progress has been made in a short time to discover the bases of many human diseases. To date, most of these have been monogenic diseases affecting a relatively small proportion of the population. For the identification of the genetic bases of complex disease the path forward is apparent, and the obstacles and challenges are clear. Identification of all DNA variants in the human genome should make it ultimately possible to link all genetic phenotypes to their genie basis. With this understanding it will be possible to diagnose effectively disease and disease risk, to develop and selectively apply therapeutics to relieve disease symptoms, to treat disease progression and ultimately to prevent disease onset.Landsteiner, , K., 1900. Kenntnis der antifermentativen, lytischen und agglutinierended Wikungen des Blutserums und der Lymphe. Zbl. Bakt.27: 357[ndash]362. 
Yamamoto, , R, Clausen, , H., White, , T., Marken, , J., and Hakomori, , S.1990. Molecular genetic basis of the histoblood group ABO system. Nature345: 229[ndash]233.Smithies, , O.1957. Variations in human serum [beta]-globinsNature180: 1482[ndash]1483Southern, , E.M.1975. Detection of specific sequences among DNA fragments separated by gel electrophoresis. J. Mol. Biol.98: 503[ndash]517.Jeffreys, , A.J.1979. DNA sequence variants in the G-[gamma] -, A[gamma]-, [delta]- and [beta]-globin genes of man. Cell18: 1[ndash]10.Cooper, , D.N., Smith, , B.A., Cooke, , H.J., Niemann, , S., and Schmidtke, , J.1985. An estimate of unique DNA sequence heterozygosity in the human genome. Hum. Genet.69: 201[ndash]205.Jeffreys, , A.J., Wilson, , V., and Thein, , S.L.1985. Hypervariable /`minisatellite/' regions of human DNA. Nature314: 67[ndash]73.Nakamura, , Y., Leppert, , M., O'Connell, , P., Wolff, , R., Holm, , T., Culver, , M., et al. 1987. Variable number of tandem repeat (VNTR) markers for human gene mapping. Science235: 1616[ndash]1622.Weber, , J.L. and May, , P.E.1989. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet.44: 388[ndash]396.Lander, , E.S. and Schork, , N.J.1994. Genetic dissection of complex traits. Science265: 2037[ndash]2048.Mohr, , J.1954. A. study of linkage in man. Opera ex Domo Biologiae Hereditariae Humanae Universitatis Hafniensis, vol. 33. Munksgaard, Copenhagen.Botstein, , D., White, , R.L., Skolnick, , M., and Davis, , R.W.1980. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet.32: 314[ndash]331.Ott, , J.1991. Analysis of human genetic linkage (Revised Edition). The Johns Hopkins Press, Baltimore, MD.Penrose, , L.S.1935. The detection of autosomal linkage in data which consists of pairs of brothers and sisters of unspecified parentage. Ann. Eugen.6: 133[ndash]138.Suarez, , B.K., Rice, , J., and Reich, , T.1978. The generalized sib pair IBD distribution: its use in the detection of linkage. Ann. Hum. Genet.42: 87[ndash]94.Risch, , N.1990. Linkage strategies for genetically complex traits :ll. The power of affected relative pairs. Am. J. Hum. Genet.46: 229[ndash]241. 
Bishop, , D.T. and Williamson, , J.A.1990. The power of identity-by-state methods for linkage analysis. Am. J. Hum. Genet.46: 254[ndash]265.Taylor, , T.D., Litt, , M., Kramer, , P., Pandolfo, , M., Angelini, , L., Nardocci, , N., et al. 1996. Homozygosity mapping of Hallervorden-Spatz syndrome to chromosome 20p12.3[ndash]p13. Nat. Genet.14: 479[ndash]481.Ellis, , N.A., Groden, , J., Ye, , T.Z., Straughen, , J., Lennon, , D.J., Ciocci, , S., Proytcheva, , M. and German, , J.1995. The Blooms-syndrome gene-product is 46, homologous to recQ helicases. Cell83: 655[ndash]666.Davies, , J.L., Kawaguchi, , Y., Bennett, , S.T., Copeman, , J.B., Cordell, , H.J., Pritchard, , L.E., et al. 1994. A genome-wide search for human type 1 diabetes susceptibility genes. Nature371: 130[ndash]136.Cordell, , H.J. and Todd, , J.A.1995. Multifactorial inheritance in type 1 diabetes. Trends Genet.11: 499[ndash]504.Riordan, , J.R., Rommens, , J.M., Kerem, , B., Alon, , N., Rozmahel, , R., et al. 1989. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science245: 1066[ndash]1073.Yu, , S., Pritchard, , M., Kremer, , E., Lynch, , M., Nancarrow, , J., Baker, , E., et al. 1991. Fragile X genotype characterized by an unstable region of DNA. Science252: 1179[ndash]1181.Verkerk, , A.J., Pieretti, , M., Sutcliffe, , J.S., Fu, , Y.H., Kuhl, , D.P., Pizzuti, , A., et al. 1991. Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell65: 905[ndash]914.The Huntington's Disease Collaborative Research Group. 1993. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntingtons-disease chromosomes. Cell72: 971[ndash]983.Weeks, , D.E. and Lathrop, , G.M.1995. Polygenic disease: methods for mapping complex disease traits. Trends Genet.11: 513[ndash]519.Sawcer, , S., Jones, , H.B., Feakes, , R., Gray, , J., Smaldon, , N., Chataway, , J., et al. 1996. A genome screen in multiple-sclerosis reveals susceptibility loci on chromosome 6p21 and 17q22. Nat. Genet.13: 464[ndash]468.Haines, , J.L., Ter-Minassian, , M., Bazyk, , A., Gusella, , J.F., Kim, , D.J., Terwedow, , H., et al. 1996. A complete genomic screen for multiple sclerosis underscores a role for the major histocompatability complex. Nat. Genet.13: 469[ndash]471.Moldin, , S.O.1997.The maddening hunt for madness genes. Nat. Genet.17: 127[ndash]129.Collins, , F.S.1995. Positional cloning moves from perditional to traditionalNat. Genet.9: 347[ndash]350.Nelson, , D.L.1995. Positional cloning reaches maturity. Curr. Opin. Genet. Dev.5: 298[ndash]303.Owen, , M.J. and McGuffin, , P.1993. Association and linkage: complementary strategies for complex disorders. J. Med. Genet.30: 638[ndash]639.Schlosstein, , L., Terasaki, , J.I., Bluestone, , R., and Pearson, , C.M.1973. High association of an HLA antigen, W27, with ankylosing spondylitis. N. Engl. J. Med.288: 704[ndash]706.Brewerton, , D.A., Caffrey, , M., Hart, , F.D., James, , D.C.D., Nicholls, , A., and Sturrocj, , R.D.1973. Ankylosing spondylitis and HLA 27. Lancet1: 904[ndash]907.Strittmatter, , W.J., Saunders, , A.M., Schmechel, , D., Pericak-Vance, , M., Enghild, , J., Salvesen, , G.S. and Roses, , A.D.1993. Apolipoprotein-e high-avidity binding to beta-amyloid and increased frequency of type-4 allele in late-onset familial Alzheimer-disease. Proc. Natl. Acad. Sci. USA90: 1977[ndash]1981.Saunders, , A.M., Strittmatter, , W.J., Schmechel, , D., St George-Hyslop, , P.H., Pericak-Vance, , M.A., Joo, , S.H.et al. 1993. Association of apolipoprotein-e allele epsilon-4 with late-onset familial and sporadic Alzheimer's-disease. Neurology43: 1467[ndash]1472.Jorde, , L.B.1995. Linkage disequilibrium as a gene-mapping tool. Am. J. Hum. Genet.56: 11[ndash]14.Curtis, , D.1997. Use of siblings in case-control association studies. Ann. Hum. Genet.61: 319[ndash]333.Spielman, , R.S.R.E., McGinnis, , R.E, and Ewens, , W.J.1993. Transmission test for link-age disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet.52: 506[ndash]516.Schaid, , D.J., Sommer, , S.S.1994. Comparison of statistics for candidate-gene association studies using cases and parents. Am. J. Hum. Genet.55: 402[ndash]409.Spielman, , R.S. and Ewens, , W.J.1996. The TDT and other family-based tests for linkage disequilibrium and association. Am. J. Hum. Genet.59: 983[ndash]989.Langefeld, , C.D., Pericak-Vance, , M.A., Saunders, , A.M., and Boehnke, , M.Family-based tests for association using discordant sib pairs. Am. J. Hum. Genet.61S:1643.
Ewens, , W.J. and Spielman, , R.S.1997. The sib-TDT (S-TDT): a TDT (transmission/disequilbrium test) without parents. Am. J. Hum. Genet.61S:1600.Kruglyak, , L.1997. The use of a genetic map of biallelic markers in linkage studies. Nat. Genet.17: 21[ndash]24.Grompe, , M.1993. The rapid detection of unknown mutations in nucleic acids. Nat. Genet.5: 111[ndash]117.Mashal, , R.D. and Sklar, , J.1996. Practical methods of mutation detection. Curr. Opin. Genet. Dev.6: 275[ndash]280.Hawkins, , J.R.1997. Finding mutations. IRL Press at Oxford University Press.Cotton, , R.G.H.1997. Mutation detection. Oxford University Press.Hayashi, , K. and Yandell, , D.W., 1993. How Sensitive Is PCR-SSCP?Hum. Mutat.2: 338[ndash]346.Naylor, , J.A., Green, , P.M., Rizza, , C.R. and Giannelli, , F.1993. Analysis of factor VIII mRNA reveals defects in every one of hemophilia A patients. Hum. Mol. Genet.2: 11[ndash]17.Roberts, , R.G.Bobrow, , M., and Bentley, , D.R.1992. Point mutations in the dystrophin gene. Proc. Natl. Acad. Sci. USA.89: 2331[ndash]2335.Khurshid, , F. and Beck, , S.1993. Error analyisis in manual and automated DNA sequencing. Anal. Biochem.208: 138[ndash]143.Shumaker, , J.M., Metspalu, , A., and Caskey, , C.T.1996. Mutation detection by solid-phase primer extension. Hum. Mutat.7: 346[ndash]354.Pastinen, , T., Kurg, , A., Metspalu, , A., Peltonen, , L., and Syvanen, , A.C.1997. Minisequencing[mdash]a specific tool for DNA analysis and diagnostics on oligonucleotide arrays. Genome Res.7: 606[ndash]614.Hoheisel, , J.D.1997. Oligomer-chip technology. Trends Biotechnol.15: 465[ndash]469. Editorial. 1996. To affinity[hellip] and beyond!Nat. Genet.14: 367[ndash]370.Joyner, , A. (ed.) 1993. Gene Targeting. A practical approach. Oxford University Press.Russell, , W.L., Kelly, , E.M., Hunsicker, , P.R., Bangham, , J.W., Maddux, , S.C., and Phipps, , E.L.1979. Specific-locus test shows ethylnitrosourea to be the most potent mutagen in the mouse. Proc. Natl. Acad. Sci. USA, 76: 5818[ndash]5819.Wilmut, , I., Schnieke, , A.E., McWhir, , J., Kind, , A.J., and Campbell, , K.H.S.1977. Viable offspring derived from fetal and adult mammalian cells. Nature385: 810[ndash]813.Orita, , M., Iwahana, , H., Kanazawa, , H., Hayashi, , K., and Sekiya, , T.1989. Detection of polymorphisms of human DNA by gel electrophoresis as single-strand conformation polymorphisms. Proc. Natl. Acad. Sci. USA86: 2766[ndash]2770.White, , M.B., Carvalho, , M., Derse, , D., O'Brien, , S.J., and Dean, , M.1992. Detecting single base substitutions as heteroduplex polymorphisms. Genomics12: 301[ndash]306.Fischer, , S.G. and Lerman, , L.S.1983. DNA fragments differing by single base pair substitutions are separated in denaturing gradient gels: correspondence with melting theory. Proc. Natl. Acad. Sci. USA80: 1579[ndash]1583.Riesner, , D., Steger, , G., Zimmat, , R., Owens, , R.A., Wagenhofer, , M., Hillen, , W., et al. 1989. Temperature-gradient gel electrophoresis of nucleic acids: analysis of conformational transitions, sequence variations, and protein-nucleic acid interactions. Electrophoresis10: 377[ndash]389.Myers, , R.M., Larin, , Z., and Maniatis, , T.1985. Detection of single base substitu-tions by ribonuclease cleavage at mismatches in RNA:DNA duplexes. Science230: 1242[ndash]1246.Rowley, , G., Saad, , S., Giannelli, , F., and Green, , P.M.1995. Ultrarapid mutation detection by multiplex, solid-phase chemical cleavage. Genomics30: 574[ndash]582.Roberts, , E., Deeble, , V.J., Woods, , C.G., and Taylor, , G.R.1997. Potassium-permanganate and tetraethylammonium chloride are a safe and effective substitute for osmium-tetroxide in solid-phase fluorescent chemical cleavage of mismatch. Nucl. Acids Res.25: 3377[ndash]3378.Youil, , R., Kemper, , B., and Cotton, , R.G.H.1996. Detection of 81 of 81 known mouse beta-globin promoter mutations with T4 endonuclease-VII[mdash]the EMC method. Genomics32: 431[ndash]435.Smith., , J. and Modrich, , P.1996. Mutation detection with MutH MutL, and MutS mismatch repair proteins. Proc. Natl. Acad. Sci. USA93: 4374[ndash]4379.Underhill, , P.A., Jin, , L., Zemans, , R., Oefner, , P.J., and Cavalli-Sforza, , L.L.1996. A pre-Columbian Y-chromosome-specific transition and its implications for human evolutionary history. Proc. Natl. Acad. Sci. USA93: 196[ndash]200.Roskey, , M.T., Juhasz, , P., Smirnov, , I.P., Takach, , E.J., Martin, , S.A., and Haff, , L.A.1996. DNA sequencing by delayed extraction-matrix-assisted laser desorption/ionization time of flight mass spectrometry. Proc. Natl. Acad. Sci. USA93: 4724[ndash]4729.Livak, , K.J., Marmaro, , J. and Todd, , J.A.1995. Towards fully automated genome-wide polymorphism screening. Nat. Genet.9: 341[ndash]342.Pease, , A.C., Solas, , D., Sullivan, , E.J., Cronin, , M.T., Holmes, , C.P, Fodor, , S.P.A.1994. Light-generated oligonucleotide arrays for rapid DNA-sequence analysis. Proc. Nat/. Acad. Sci. USA91: 5022[ndash]5026.A JSchaferAlan J.J RHawkinsJ. Rossnbt0198-0331,252KNature Biotechnologynbt0198-3333rrDNA variation and the future of human geneticsnbt0198-33.xml1998010139
