Accurate determination of microsatellite allele frequencies in pooled DNA samples

Schnack, Hugo G; Bakker, Steven C; van't Slot, Ruben; Groot, Bart M; Sinke, Richard J; Kahn, Rene S; Pearson, Peter L

doi:10.1038/sj.ejhg.5201234

Download PDF

Article
Published: 11 August 2004

Accurate determination of microsatellite allele frequencies in pooled DNA samples

Hugo G Schnack¹^na1,
Steven C Bakker²^na1,
Ruben van't Slot³,
Bart M Groot⁴,
Richard J Sinke²,
Rene S Kahn¹ &
…
Peter L Pearson²

European Journal of Human Genetics volume 12, pages 925–934 (2004)Cite this article

1789 Accesses
14 Citations
Metrics details

Abstract

Pooling of DNA samples instead of individual genotyping can speed up genetic association studies. However, for microsatellite markers, the electrophoretic pattern of DNA pools can be complex, and procedures for deriving allele frequencies are often confounded by PCR-induced stutter artefacts. We have developed a mathematical procedure to remove stutter noise and accurately determine allele frequencies in pools. A stutter correction model can be reliably derived from one standard ‘training set’ of the same 10 individual DNA samples for each marker, which can also include heterozygous patterns with partially overlapping peaks. Compared with earlier methods, this reduces the number of genotypes needed in the training set considerably, and allows standardization of analyses for different markers. Moreover, the use of a procedure that fits all data simultaneously makes the method less sensitive to aberrant data. The model was tested with 34 markers, 18 of which were newly defined from human sequence data. Allele frequencies derived from stutter-corrected DNA pool patterns were compared with the summed individual genotyping results of all the individuals in the pools (n=109 and n=64). We show that the model is robust and accurately extracts allele frequencies from pooled DNA samples for 32 of the 34 microsatellite markers tested. Finally, we performed a case–control study in celiac disease and found that weakly associated disease alleles, identified by individual genotyping, were only detectable in pools after stutter correction. This efficient method for correcting stutter artefacts in microsatellite markers enables large-scale genetic association studies using DNA pools to be performed.

The variation and evolution of complete human centromeres

Article Open access 03 April 2024

Glennis A. Logsdon, Allison N. Rozanski, … Evan E. Eichler

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

Introduction

It has been cogently argued that population-based genetic association studies will have a greater power than linkage studies to localize genes contributing moderately or only a little to the phenotype of complex diseases.¹ However, the detection of association, or linkage disequilibrium between a genetic marker and a disease locus in outbred populations is only possible over small genetic distances.^{2, 3, 4, 5, 6} For screening large genomic regions, or even comprehensive whole-genome association studies, this implies that hundreds or thousands of markers have to be genotyped for each subject. Such studies are barely feasible using currently available genotyping technology.

Pooling of DNA samples for genetic marker analysis is a method to reduce the amount of genotyping required in allelic association studies.^{7, 8, 9, 10, 11, 12, 13, 14, 15, 16} This technique involves combining equal amounts of DNA from patients and controls into separate pools, and comparing the pools for differences in allele distributions of genetic markers. In the absence of haplotype information, which is the situation encountered in a typical association study based on pooled case–control comparisons, the biallelic variation of single-nucleotide polymorphisms (SNPs) contains far less polymorphic information than microsatellite markers. Therefore, microsatellites provide a more powerful tool on a marker-by-marker basis than SNPs.^{17, 18} However, in the case of microsatellite markers, the overall genotype patterns of pooled samples are often distorted by PCR artefacts such as stutter and preferential amplification, which prevent an accurate determination of the allele frequencies by simple procedures. Several methods have been proposed to handle these artefacts. Some studies compared summed differences in patterns between two pools without correction for PCR artefacts, and without allotting the individual allelic contributions to the differences.^{8, 9, 10, 11}

A fundamentally different way to compare pool patterns is to correct the pool signal for predicted PCR artefacts, in order to derive more accurate estimates of the allele frequencies. Advantages of this approach are that it allows the comparison of frequencies for individual marker alleles, and that results from different experiments can be summated and analyzed using regular statistics such as χ² tests, since the entire pool signal is deconvoluted into individual allele counts.¹² All recent correction methods use information derived from a training set of individual genotype patterns to obtain information about the stutter behavior of the marker under investigation. One approach is to build a matrix of stutter patterns for individual alleles.^{7, 14, 15} This requires a set of well-distributed homozygous or well-separated (nonoverlapping) heterozygous individual genotype patterns, and interpolation or extrapolation has to be invoked to complete the matrix for missing alleles. These methods are sensitive to one or more nonrepresentative patterns caused by, for example, measurement errors.

Alternatively, a stutter model can be derived from individual genotypes, which is used to correct for stutter and permits interpolation of stutter for allele sizes not encountered in the training set.¹² The advantage of a model is that it partly removes the influence of aberrant patterns. On the other hand, it interprets the stutter peaks according to a fixed behavior, which can yield a less accurate description. The model approaches presented thus far also require well-distributed homozygous or well-separated heterozygous individual patterns for each marker to define the model parameters. In both types of correction procedure, a rather large set (at least 20 to 50) of individual patterns has been considered necessary^{7, 8} to provide sufficient data to obtain the necessary stutter information.¹² The search for and analysis of informative marker data often make these approaches tedious and highly interactive.

We have developed a stutter correction method that fits a model to one small set of genotype data from 10 individuals. This standard training set is identical for all markers, and can be of any allelic composition, since it does not need to include particularly defined homozygous or heterozygous individuals. The accuracy of the stutter correction model has been tested on 34 different microsatellite markers and used in a case–control study for celiac disease.

Materials and methods

Definitions

Uncorrected pool:: allele frequencies derived from pool signals uncorrected for stutter.
Corrected pool:: allele frequencies derived from pool signals corrected for stutter.
True pool:: allele frequencies obtained by individual genotyping of all samples present in a pool, and summing the allele counts.

Preparation of DNA pools, marker selection, PCR, and analysis

Genomic DNA was obtained from peripheral blood lymphocytes using established procedures. Stock solutions were diluted to approximately 25 ng/μl, vortexed gently, and measured with Pico Green (Molecular Probes, Leiden, the Netherlands) on a Genios plate reader (Tecan, Männedorf). Subsequently, samples were diluted to 10 ng/μl and final concentrations were measured in triplicate. Each sample was tested for adequate PCR amplification. Volumes containing 100 ng of DNA from individual samples were pooled. Pools, as well as a set of 10 random individual samples, were purified by phenol extraction, and diluted with water to 10 ng/μl. Characterized microsatellite markers were obtained from the Genome Database (GDB) and Marshfield database. New microsatellite markers were identified by searching a 4 Mb ADHD linkage region on chromosome 15 for microsatellite repeats¹⁹ using the Tandem Repeat Finder program (TRF). PCR primers flanking the repeats were designed with the Primer3 program (sequences are available on request). A so-called pig-tail sequence extension was added to one of the primers in order to reduce plus-A artefact during PCR.²⁰ The other primer was labeled with 6-FAM, HEX, or NED fluorescent dyes (Biolegio, Malden, the Netherlands, and Applied Biosystems, Foster City, CA, USA).

Individual samples and triplicate pools were amplified simultaneously as described elsewhere,¹⁹ but with 27 instead of 33 cycles. Up to three products were pooled, and analyzed on an ABI 3700 sequencer.¹⁹ Sample files were analyzed using Genescan 3.5 and Genotyper 3.6 for Windows NT and the heights of all peaks were labeled. Samples with allelic peak heights below 200 or above 6000 were not labeled. A computer program called PoolFitter (freely available from our web site), which is a user interface invoking our stutter correction algorithm, then processed the tables with allele sizes and peak heights (see below). The pool patterns were corrected for stutter by applying the model parameters derived from the individual genotypes (see below). For marker D7S2422 only, preferential amplification of shorter alleles was compensated in the PoolFitter program, by dividing the peak heights of both individual data and pooled data before model fitting by a function fitted to the corrected heterozygous patterns without compensation for preferential amplification. Estimates from corrected and uncorrected pool patterns (averages of triplicate measurements) were compared with true pools using the program CLUMP.²¹

The model

The basic concept is that, for pooled DNA, any electrophoretic microsatellite marker pattern (See Figure 1a) is the sum of its constituent parts comprising a mixture of homozygous and heterozygous individual patterns. Peaks may represent individual alleles, or individual alleles plus a stutter component, or only stutter. We describe a pattern by Y(a), where Y is the height of the signal at fragment length a. In all figures, the signal height has been scaled to facilitate comparison with calculated quantities later on. The length a can assume discrete values differing by multiples of the repeat length Δa. For a dinucleotide marker, Δa=2 base pairs. Looking at a pattern for a single allelic peak at length a₀, one expects stutter peaks at a₀−Δa, a₀−2Δa, etc. and possibly ‘up-stutter’ peaks at a₀+Δa, etc. In general, peaks are located at a=a₀–mΔa, with m integer. The modeled peaks are described by y(a), representing the peak height at length a. This peak y(a) can have contributions from an allelic peak y₀(a), and from stutter peaks of alleles located m base pairs away: y(a)=y_m(a+mΔa). Thus, the index m refers to the order of the (stutter) peak: y₀(.) is the main, allelic peak, y₁(.) is the first stutter peak, and so on. The argument of y_m(.) refers to the location of the main, allelic peak. Our aim is to describe the total set of peaks by a model with as few parameters as possible. The values of the parameters will depend on the marker, PCR conditions, settings of the electrophoresis apparatus, etc. Knowing the stutter parameters, it is possible to deconvolve the measured signal Y(a) of a DNA pool into the contributions of individual alleles and hence calculate the frequency of each allele in the pool. The ratio between allele and stutter signal appears to be marker specific, and since pool patterns do not contain enough information to obtain reliable estimates for the stutter parameters used in the model, these parameters have to be derived by fitting the model to a number of individual test patterns for each marker.

Stutter pattern of a homozygous individual

It can be demonstrated that the heights of stutter peaks decay exponentially with the number of stutters, as clearly shown in a logarithmic plot (Figure 1b), in which a straight line can be drawn through the tops of the stutter peaks. A few simple theoretical assumptions about the nature of DNA amplification predict this exponential behavior.²² From many of such plots we found empirically that the ratios of the heights of successive stutter peaks are roughly constant for all samples of the same marker and amplification condition, but that the constant differs between markers and conditions. We denote this constant by the ratio r. The first stutter peak is usually found to be proportionally higher compared to the main peak; in Figure 1b this is observed as a deviation of the stutter straight line with the top of the allelic peak. We therefore use a different ratio to describe the relationship between the first stutter peak and the allele peak:

with 0<r<1, and λ>1, normally. This is for the ‘normal’ downward stutter. For the upward stutter, we take only one peak into account, as it is rare to see more up-stutter peaks; however, the model can easily be extended to more, if necessary.

with 0<μ≪1, normally.

For all other positions, that is, positions at larger lengths:

We can combine and rewrite Equations (1a), (1b), (1c) and (1d) as follows:

The fragment length of (stutter) peak y_m(a) is

It is usually observed that stutter is more severe for longer alleles than shorter alleles. This can be understood, at least qualitatively, by realizing that a larger number of repeats offers more chances for the PCR process to stutter. We therefore introduce an a dependence of the stutter ratio r:

For positive values of b₁, this formula yields an increasing stutter for increasing a.

The true amount of signal at the allelic peak, that is, if no allele signal had been dissipated into stutter peaks, is represented by

We have now described the set of stutter peaks by four parameters: b₀, b₁, λ, μ. In the trivial case of a pattern of a homozygous individual, Equation (2) can be fitted directly to the Y(a) data, with y₀(a₀) as a fifth fit parameter. The length of the allele is directly read from the pattern: a₀; y₀(a₀) is just the height of the measured main peak Y₀(a₀); μ is the ratio of the up-stutter peak at a₀+Δa and the main peak. The factor r is determined by the logarithm of the heights of the stutter peaks y₁(a₀), y₂,(a₀)… the heights of which are directly taken from the measured peaks Y(a₀−Δa), Y(a₀−2Δa),… Then λ follows from y₁(a₀)/y₀(a₀), with the value of r inserted in Equation (2). In this example, r is kept constant; in a more realistic situation involving alleles of several lengths (such as in a pooled DNA sample), b₀ and b₁ can be fitted instead of a constant r.

Stutter pattern of a heterozygous individual

In the case of a heterozygous individual, there are more measured peaks to fit, and there is one extra fit parameter, namely the y₀ of the second allelic peak. We will refer to the two y₀s as 0S and 0L, located at a_0S and a_0L, respectively, with a_0S< a_0L (see inset of Figure 2). Heterozygous patterns often overlap to a large extent, and pool patterns always do. For two alleles close together, the measured peak heights in the overlapping region are the sum of two contributing peaks, one for the (shorter) S-allele, and one for the (longer) L-allele. For instance, the peak at the left arrow in the inset of Figure 2 is made up of the first stutter peak of the S-allele and the third stutter peak of the L-allele and is represented by

with a_0S=153 and a_0L=157. This effect makes the fit procedure more challenging and real solution algorithms have to be invoked. We used the Levenberg–Marquardt method.²³ The result of such a fit is shown in Figure 2. The model fits the data well, and the relatively large contribution of the stutter peaks of the L-allele to both the allelic peak and stutter peaks of the S-allele is clearly seen.

Pattern of a pooled sample

The generalization to fitting a pattern of a pooled DNA sample containing alleles of n individuals is straightforward. At every measured fragment length a, the following peaks can contribute to y(a), depending on the presence of alleles in the pooled sample:

a)
the allelic peak y₀(a) of the allele at a₀=a;
b)
the up-stutter peak y₋₁(a−Δa) of the allele just left of it, at a₀=a−Δa;
c)
the first-order stutter peak y₁(a+Δa) of the allele just right of it, at a₀=a+Δa;
d)
higher-order stutter peaks y_m(a+mΔa) of alleles more to the right (m=2,3,…).

In a formula, this can be written as

with y_m(a+mΔa)=y₀(a+mΔa)λ exp((b₀+b₁a)m) for m≥1 the mth stutter peak of the allele at a₀=a+mΔa, and y₋₁(a−Δa)=y₀(a−Δa)μ the up-stutter peak of the allelic peak just left, at a₀=a−mΔa. The arguments of y(.) in Equation (7) must lie in the measured range of allele lengths (a_min,a_max).

The y_t(a)s are now calculated as follows:

These values are proportional to the number of individuals n contributing to that allele a. To obtain estimates of the true allelic frequencies F(a) in the pool, one calculates:

where the summation is carried out over the full range (a_min, a_max) of the pool pattern.

To correct a pattern of a pooled DNA sample, one has to fit Equation (7) to the measured data Y(a). Values for the four model parameters b₀, b₁, λ, μ could be found from fitting the model to the genotype patterns of a small number of representative individuals one at a time, and deriving n_i values for each of the fit parameters. These n_i values could then be averaged to obtain a good estimate for each of the parameters. A much more efficient way is to perform the model fitting to all individual patterns simultaneously. The total number of data points is n_im_i, with m_i the average number of measured peaks per individual. The total number of fit parameters is 4 + n_i(1+h), with h the calculated heterozygote frequency of the marker (0≤ h≤1). A set of n_i =10 individuals, each with on average m_i=7 data points and a heterozygote frequency of h =0.5, requires fitting a model with 19 parameters to a combined data set of 70 data points, which, as shown in Figure 3 for marker D6S273, yields a very stable fit.

Comparison with a deconvolution method

The most robust method previously published is the deconvolution method described by Perlin et al.¹⁵ Like our method, it uses a set of individual patterns to obtain the stutter behavior. The main difference between our method and Perlin et al's is the fact that we fit a model to the data to describe the stutter behavior, which makes our method potentially much more robust, and thus requiring fewer individual patterns to train the method. This has been tested below.

Results

Minimum number of genotypes required in the training set

For marker D6S273, we investigated the influence of training set size on the reproducibility of the results by fitting models based on sets varying in size from 2 to 30 individuals, which were taken at random from the n individuals in the pool. For each chosen set size, a random selection of individuals was taken 20 times to derive the model parameters and to correct the pool data. Figure 4 shows the effect of training set size plotted against the spread in the corrected peak height of one of the alleles (a=127; see Figure 3) in the pool. We chose to show this allele because of its low frequency (3%), in which adequate correction is crucial. For sets smaller than about five individuals, the variation in the results was relatively large, but for 10 individuals or more, the gain in reproducibility was limited. The effect of training set size was also tested for two other dinucleotide markers, with similar results (not shown). A set of n_i=10 was found to give reliable results with a coefficient of variation of about 1%.

This test was also carried out for Perlin et al's method.¹⁵ For all alleles and values of n_i, the variation in estimated frequencies from this method was at least three times as large. For n_i=30, the variation was still twice as high as our method's variation for n_i=10.

Robustness to atypical training sets and measurement errors

Using data for marker D6S273, the robustness of the algorithm was checked in the following way: various sets of n_i=10 individuals were used as a training set to fit the model to the pooled data depicted in Figure 3: (i) a set of 10 homozygous individuals; (ii) a set of 10 heterozygous individuals; (iii) a set of 10 individuals whose alleles were closely packed together in a certain region, leaving part of the allelic range of the pool uncovered; (iv) a set of eight regular individuals plus two measurement errors: patterns containing an allele exhibiting a completely different stutter behavior to the others (but with peaks in the same molecular weight range).

All tests yielded good results that hardly differed from the ‘normal’ pool fit results of Figure 3.

A comparison of test (ii) with test (i) shows that no prechosen homozygous (or well-separated heterozygous) individual patterns are needed to derive good parameter estimates. Further, test (iii) shows that there is no need for training data to cover the full molecular range of alleles. Only in the extreme case of having only data points at one extreme of the molecular range in the training set do the pool results at the other end become less reliable. Test (iv) simulates the presence of measurement errors. If one or two of the 10 individual patterns are dissimilar to the others, for example, because of an artefact in the PCR process or a measurement error, the fit procedure does not appear to be misguided. The test showed that the fits derived from a training set of eight normal and two abnormal patterns were nearly as good as those based on 10 good patterns.

Validation of the model

For 34 different microsatellite markers, correction models were derived from the same training set of 10 individuals, and both uncorrected and corrected pools were compared with the true pools (for definitions see Materials and methods section). An example is shown in Figure 3. In total, five genotypes from four different markers could not be determined reliably, and these were discarded. Correlation coefficients of uncorrected and corrected pools vs true pools for all 34 markers are given in Table 1. A graphical representation of the data for 16 characterized markers can be found on our web site (Figure C).

Table 1 Statistical comparison of allele frequencies obtained by individual genotyping and frequency estimates from uncorrected and corrected pool patterns

Full size table

The only markers in which uncorrected pools approached true pools were the four tetranucleotide markers. For the dinucleotide markers, uncorrected pools were generally very different from the true pools, whereas corrected and true pools did not differ significantly, with the exception of markers D11S1760, D7S2422, and kk9. For marker D11S1760, there was a large overestimation of the frequency of the shortest allele in both uncorrected and corrected pools. Analysis of all individual genotype patterns for this marker revealed that stutter did not increase with allele length in a regular fashion (see web Figure D), which is an underlying assumption in the correction model. Marker D7S2422 showed a systematic overestimation of the peak height of the shorter allele in heterozygotes in the PoolFitter program, which persisted after correction for stutter. This suggested preferential amplification of shorter alleles, and after applying a simple compensation in the program, the differences between corrected and true pools were no longer significant (data not shown). No evidence for preferential amplification was found in the other markers (see web Figure E). Marker kk9 had two extra alleles (together accounting for 18% of all alleles in the true pool), with a size exactly between alleles at the regular 2 bp intervals. These aberrant alleles were discarded from the analysis, since the correction method ignores alleles at irregular intervals.

Case–control study

We investigated the application of the correction method in a case–control study in celiac disease (CD). DNA from 50 CD patients and 100 healthy controls was combined into two pools. Five microsatellite markers that had previously been used in association studies of CD patients were blinded and analyzed in CD and control pools. For three markers, allele frequencies did not differ significantly between cases and control pools, in either individual genotyping or pooled analysis. The other two markers showed significant differences between cases and controls. In each marker, one allele was very strongly associated, and already detectable in uncorrected pools, but after stutter correction, both markers also showed a much weaker but significant association with a second allele (see Figure 5). Both weak and strong associations were also demonstrable in the summed individual analysis.

Discussion

Although the use of pooled DNA enormously reduces the amount of genotyping in comparing cases and controls, it suffers from the inability to generate haplotype information. As a result, microsatellite markers, with their high information content, are much more suitable than SNPs for use in pooled DNA samples. The number of potentially polymorphic microsatellites in the genome is much higher than the number of characterized markers in public databases. For example, in 11 schizophrenia candidate genes, we have tested 19 polymorphic microsatellites, eight of which were intragenic, while flanking markers were on average at 45 kb distance from the gene (max 130 kb). In a schizophrenia candidate region, we found nearly 250 potentially polymorphic microsatellites with an average spacing of 55 kb (max 168 kb). However, the widespread application of microsatellite markers in DNA pooling may have been prevented by uncertainties induced by stutter artefacts and the consequent distortion of allele frequency estimates.

We have developed a novel method, which enables accurate extraction of allele frequencies from microsatellite pool signals. A prerequisite for the application to large studies is that the correction method does not entail much additional analysis time. Our method meets this requirement, since the same training set of only 10 independent DNA samples plus the pool samples is required to carry out an analysis for a given marker. An apparent advantage of our approach is that there is no requirement for stutter and allelic peak signals of heterozygous individuals to be clearly separated, which greatly reduces the number of individuals required. There was little gain in accuracy when more than 5-10 individual genotypes were used, and accordingly, we choose one set of the same 10 independent individuals for all analyses, to allow for occasional dropouts. Other advantages of our fit algorithm are the simultaneous fitting of all data, which decreases the sensitivity to aberrant data, and that the size distribution of alleles and stutter, or alleles with an anomalous stutter height, had little influence on the predictive accuracy of the model.

The model was tested on DNA pool patterns with 34 different microsatellite markers, 18 of which were newly defined from human sequence data, since well-characterized markers could have been selected for their accuracy in genotyping. Our results with tetranucleotide markers confirm previous reports that stutter is low in these markers (generally <5%) and that no stutter correction is required.^{7, 12, 24} Significantly, for the two dinucleotide markers in which correction remained inaccurate, the presence of an aberration was readily detected in the PoolFitter program, even though it could not correct the stutter distortion.

In a case–control study involving celiac disease, marker alleles that were weakly associated in individual genotyping were also found to be associated in the pool analysis, but only after stutter correction. These two exceptionally strongly associated markers in the HLA region would have been detected even without correction, but would have been missed if only the weakly associated alleles had been present. This clearly demonstrates the benefit of stutter correction in DNA pooling.

Taken together, stutter correction generally resulted in accurate estimates of true allele frequencies in DNA pools. Compared with methods that use uncorrected pool patterns, several important advantages are apparent. The recently proposed ΔAIP and ΔTAC methods compare overall differences in peak area or peak height between pool patterns.^{8, 9} However, both methods assume a single fixed stutter profile for all markers and simulate large numbers of pool patterns to determine what proportion by chance will deviate significantly. Since the heights as well as the number of stutter peaks can differ greatly between markers, these methods raise the question whether realistic significance levels can be calculated in this way. In any case, such an approach prevents ascribing differences between pools to single alleles and summing results from different subpools or different experiments.¹² These drawbacks are not evident in our method.

We found that technical measures, such as reducing the number of PCR cycles, and adding pig-tail sequences to primers to eliminate plus-A artefacts, and separation on a capillary sequencer instead of a slab gel machine,²¹ consistently improved the accuracy of DNA pool measurements. However, the nature of DNA pooling will inevitably result in some loss of sensitivity compared to individual genotyping.²⁵ Furthermore, a four-parameter model is not a perfect description of reality.

Despite these and other limitations, such as the lack of haplotype information, until cheap and rapid large-scale individual genotyping of markers for single individuals becomes technically feasible, DNA pooling methods allow efficient initial screening of candidate regions, and candidate gene systems. In pooled DNA, microsatellites are much more informative than single SNPs. In a second phase, associated microsatellites could then be followed-up by individual genotyping of high-density SNP markers, and haplotype analysis. Even if cases and controls were divided into pools of only 100 individuals each, as recently advocated,^{16, 26} and all amplified in triplicate, DNA pooling decreases genotyping by a factor of 30 in studies involving 500 cases and 1000 controls.

Our results confirm that the accuracy of analyzing corrected pool patterns generated from microsatellites approaches that of individual genotyping. Particularly in complex disorders, where the association of marker alleles with disease loci is likely to be only moderate or weak, a gain in sensitivity with stutter correction in pooled analyses justifies the limited amount of extra genotyping required to create a small training set.

In conclusion, we have demonstrated that accurate estimates of microsatellite allele frequencies from DNA pools are feasible with a novel stutter correction method requiring one standard training set of only 10 additional individual genotypes. This method opens the way for realistic large-scale genetic association studies using microsatellite markers.

References

Risch N, Merikangas K : The future of genetic studies of complex human diseases. Science 1996; 273: 1516–1517.
Article CAS Google Scholar
Dunning AM, Durocher F, Healey C et al: The extent of linkage disequilibrium in four populations with distinct demographic histories. Am J Hum Genet 2000; 67: 1544–1554.
Article CAS Google Scholar
Jorde LB : Linkage disequilibrium and the search for complex disease genes. Genome Res 2000; 10: 1435–1444.
Article CAS Google Scholar
Abecasis GR, Noguchi E, Heinzmann A et al: Extent and distribution of linkage disequilibrium in three genomic regions. Am J Hum Genet 2001; 68: 191–197.
Article CAS Google Scholar
Innan H, Padhukasahasram B, Nordborg M : The pattern of polymorphism on human chromosome 21. Genome Res 2003; 13: 1158–1168.
Article CAS Google Scholar
Salisbury BA, Pungliya M, Choi JY, Jiang RH, Sun XJ, Stephens JC : SNP and haplotype variation in the human genome. Mutat Res 2003; 526: 53–61.
Article CAS Google Scholar
Barcellos LF, Klitz W, Field et al: Association mapping of disease loci, by use of a pooled DNA genomic screen. Am J Hum Genet 1997; 61: 734–747.
Article CAS Google Scholar
Collins HE, Li H, Inda SE et al: A simple and accurate method for determination of microsatellite total allele content differences between DNA pools. Hum Genet 2000; 106: 218–226.
Article CAS Google Scholar
Daniels J, Holmans P, Williams N et al: A simple method for analyzing microsatellite allele image patterns generated from DNA pools and its application to allelic association studies. Am J Hum Genet 1998; 62: 1189–1197.
Article CAS Google Scholar
Fisher PJ, Turic D, Williams NM et al: DNA pooling identifies QTLs on chromosome 4 for general cognitive ability in children. Hum Mol Genet 1999; 8: 915–922.
Article CAS Google Scholar
Plomin R, Hill L, Craig IW et al: A genome-wide scan of 1842 DNA markers for allelic associations with general cognitive ability: a five-stage design using DNA pooling and extreme selected groups. Behav Genet 2001; 31: 497–509.
Article CAS Google Scholar
Kirov G, Williams N, Sham P, Craddock N, Owen MJ : Pooled genotyping of microsatellite markers in parent-offspring trios. Genome Res 2000; 10: 105–115.
CAS PubMed PubMed Central Google Scholar
LeDuc C, Miller P, Lichter J, Parry P : Batched analysis of genotypes. PCR Methods Appl 1995; 4: 331–336.
Article CAS Google Scholar
Lipkin E, Mosig MO, Darvasi A et al: Quantitative trait locus mapping in dairy cattle by means of selective milk DNA pooling using dinucleotide microsatellite markers: analysis of milk protein percentage. Genetics 1998; 149: 1557–1567.
CAS PubMed PubMed Central Google Scholar
Perlin MW, Lancia G, Ng S-K : Toward fully automated genotyping: genotyping microsatellite markers by deconvolution. Am J Hum Genet 1995; 57: 1199–1210.
CAS PubMed PubMed Central Google Scholar
Sham P, Bader, JS, Craig I, O'Donovan M, Owen M : DNA pooling: a tool for large scale association studies. Nat Rev Genet 2002; 3: 862–869.
Article CAS Google Scholar
Sham PC, Zhao JH, Curtis D : The effect of marker characteristics on the power to detect linkage disequilibrium due to single or multiple ancestral mutations. Ann Hum Genet 2000; 64: 161–169.
Article CAS Google Scholar
Morris RW, Kaplan NL : On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genet Epidemiol 2002; 23: 221–233.
Article Google Scholar
Bakker SC, van der Meulen EM, Buitelaar JK et al: A whole-genome scan in 164 Dutch sib pairs with attention-deficit/hyperactivity disorder: suggestive evidence for linkage on chromosomes 7p and 15q. Am J Hum Genet 2003; 72: 1251–1260.
Article CAS Google Scholar
Brownstein MJ, Carpenter JD, Smith JR : Modulation of non-templated nucleotide addition by Taq DNA polymerase: primer modifications that facilitate genotyping. BioTechniques 1996; 20: 1004–1010.
Article CAS Google Scholar
Sham PC, Curtis D : Monte Carlo tests for associations between disease and alleles at highly polymorphic loci. Ann Hum Genet 1995; 59 (Part 1): 97–105.
Article CAS Google Scholar
Miller MJ, Yuan B-Z : Semiautomated resolution of overlapping stutter patterns in genomic microsatellite analysis. Anal Biochem 1997; 251: 50–56.
Article CAS Google Scholar
Press WH, Teukolsky SA, Vettering WT, Flannery BH : Numerical recipes in C – the art of scientific computing, 2nd edn. Cambridge: Cambridge University Press, 1992.
Google Scholar
Shaw SH, Carrasquillo MM, Kashuk C, Puffenberger EG, Chakravarti A : Allele frequency distributions in pooled DNA samples: applications to mapping complex disease genes. Genome Res 1998; 8: 111–123.
Article CAS Google Scholar
Barratt BJ, Payne F, Rance HE, Nutland S, Todd JA, Clayton DG : Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann Hum Genet 2002; 66: 393–405.
Article CAS Google Scholar
Sawcer S, Maranian M, Setakis E et al: A whole genome screen for linkage disequilibrium in multiple sclerosis confirms disease associations with regions previously linked to susceptibility. Brain 2002; 125: 1337–1347.
Article Google Scholar

Download references

Acknowledgements

We acknowledge the special contribution of Lodewijk Sandkuijl in formulating some of the basic concepts embodied in this work; he died shortly before completion of the manuscript. We also thank Martine van Belzen for providing the celiac disease samples and individual genotypes, and to Jackie Senior for critically reading the manuscript.

Author information

Hugo G Schnack and Steven C Bakker: Both authors contributed equally to this work

Authors and Affiliations

Department of Psychiatry, University Medical Center Utrecht, The Netherlands
Hugo G Schnack & Rene S Kahn
Department of Medical Genetics, University Medical Center Utrecht, The Netherlands
Steven C Bakker, Richard J Sinke & Peter L Pearson
Department of Veterinary Medicine, Utrecht University, Utrecht, The Netherlands
Ruben van't Slot
Institute of Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands
Bart M Groot

Authors

Hugo G Schnack
View author publications
You can also search for this author in PubMed Google Scholar
Steven C Bakker
View author publications
You can also search for this author in PubMed Google Scholar
Ruben van't Slot
View author publications
You can also search for this author in PubMed Google Scholar
Bart M Groot
View author publications
You can also search for this author in PubMed Google Scholar
Richard J Sinke
View author publications
You can also search for this author in PubMed Google Scholar
Rene S Kahn
View author publications
You can also search for this author in PubMed Google Scholar
Peter L Pearson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hugo G Schnack.

Additional information

Electronic Database Information The PoolFitter program and more illustrating figures are available at our website: http://www.smri.nl/microsatellitesCLUMP (DOS version): http://www.mds.qmw.ac.uk/statgen/dcurtis/software.htmlPrimer3: http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgiTandem Repeat Finder: http://c3.biomath.mssm.edu/trf.htmlGenome Database (mirror site): http://gdbwww.dkfz-heidelberg.de/

Marshfield Center for Medical Genetics: http://research.marshfieldclinic.org/genetics/

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schnack, H., Bakker, S., van't Slot, R. et al. Accurate determination of microsatellite allele frequencies in pooled DNA samples. Eur J Hum Genet 12, 925–934 (2004). https://doi.org/10.1038/sj.ejhg.5201234

Download citation

Received: 01 October 2003
Revised: 29 April 2004
Accepted: 05 May 2004
Published: 11 August 2004
Issue Date: 01 November 2004
DOI: https://doi.org/10.1038/sj.ejhg.5201234

Keywords

microsatellite marker pool stutter PCR

This article is cited by

MPDA: Microarray pooled DNA analyzer
- Hsin-Chou Yang
- Mei-Chu Huang
- Cathy SJ Fann
BMC Bioinformatics (2008)
Quantitative Single-letter Sequencing: a method for simultaneously monitoring numerous known allelic variants in single DNA samples
- Baptiste Monsion
- Hervé Duborjal
- Stéphane Blanc
BMC Genomics (2008)
Empirical evaluation of selective DNA pooling to map QTL in dairy cattle using a half-sib design by comparison to individual genotyping and interval mapping
- Maxy Mariasegaram
- Nicholas Andrew Robinson
- Michael Edward Goddard
Genetics Selection Evolution (2007)

Accurate determination of microsatellite allele frequencies in pooled DNA samples

Abstract

Similar content being viewed by others