Introduction

The Centre d’Etude du Polymorphisme Humain (CEPH) genotype database consists of 61 families with large sibships that have served as a reference panel for the construction of human genetic maps by the international scientific community (Dausset et al. 1990; White et al. 1985). As a result of this collaboration, information has accumulated in these families on thousands of genetic markers, including restriction fragment length polymorphisms, minisatellites, and microsatellites (CEPH genotype database, The Utah Marker Development Group 1995).

Utah CEPH pedigrees constitute the majority (47 pedigrees) of the CEPH reference panel. In recent years, family members have been invited to participate in new clinical interviews and sample collection with support from the Keck Foundation. Multiple ongoing collaborative efforts will continue to yield extensive phenotyping of both quantitative and qualitative traits of biochemical, physiological, and clinical relevance. Furthermore, additional genotyping of a large set of highly informative markers in all Utah CEPH families will soon reach completion, providing a refined map and an extensive database of genotypes on this set of families. This extensive genotyping and phenotyping should make the Utah CEPH families a unique and invaluable resource for analysis of the genetic diversity that underlies individual variation.

The pedigree structures sampled in Utah consisted of a large sibship with at least eight children, their parents, and their living grandparents at the time of sampling. These structures were selected to optimize the probability of identifying genetic linkage: (1) the large sibship size affords replication of segregating events in informative families, and (2) the inclusion of grandparents can yield information on the phase of the loci under investigation (Ott 1999). Linkage analysis of all these traits will require large amounts of computer resources and analysis time. Furthermore, the likelihood that linkage will be detected for any particular trait will depend on the degree and the nature of genetic influences. Therefore, before undertaking such a task, it would be of interest to identify the statistical properties of a trait that would increase the chance of detecting linkage. This can be accomplished by estimating the power to detect linkage depending on these characteristics of the trait.

The investigation presented here was restricted to the analysis of quantitative traits. It has often been advocated that unraveling the genetics of complex traits such as common diseases would be markedly facilitated by the genetic analysis of intermediate phenotypes, typically physiological or biochemical parameters closer to gene action. These intermediate phenotypes are generally quantitative variables. Quantitative traits have been the staple of genetics in animal and plant breeding (Lynch and Walsh 1997), and their use in human genetics has long been advocated (Elston 1979; Haseman and Elston 1972). Analytical methods for the genetic analysis of quantitative traits in human genetics have received increased attention in recent years (Blangero et al. 2000).

In the design of an experiment, it is essential for an investigator to evaluate its statistical power, which in the present case is the probability of detecting linkage given that linkage actually exists (Haines and Pericak-Vance 1998; Risch 1997; Sham 1998). Due to the complexity of the likelihood calculation for family data, power is not readily assessed analytically (Boehnke 1986; Ploughman and Boehnke 1989). Therefore, simulations are used; a large number of replicates of data are simulated based on a priori specified conditions followed by linkage analysis of each of these replicates. The power is estimated as the proportion of replicates displaying lod scores above a given threshold (Ott 1999). A number of published power analyses have addressed similar issues (Duggirala et al. 1997; Greenberg et al. 1998; Rijsdijk et al. 2001; Shugart et al. 2002; Wijsman and Amos 1997; Williams and Blangero 1999). Wijsman and Amos (1997) and Williams and Blangero (1999) compared the power to detect linkage in nuclear families versus extended pedigrees; and another group compared two-point (the use of a single marker) with multipoint (the simultaneous use of multiple markers) linkage analysis (Ekstrom 2001). These studies have documented the expected increase in power as a function of pedigree size and the number of markers used in linkage analysis. Such information has proven extremely useful in interpreting discrepancies or insignificant results caused by the lack of power.

The following question was addressed in our study: for what characteristics of a trait measured on the Utah CEPH pedigrees is there enough information to detect linkage of a marker to a quantitative trait locus (QTL) based on the marker information available? This question was answered by simulating data using single locus models and then performing linkage analysis on the simulated data using parametric single locus (PSL) or variance components (VC) models.

Subjects and methods

The Utah CEPH pedigrees

The Utah CEPH families consist of 47 pedigrees with large sibship size, their parents, and their living grandparents. Since 1997, members of 45 Utah CEPH pedigrees (two families are not participating) have been measured for a large number of qualitative and quantitative traits as part of the Utah Genetic Reference Project (UGRP). In its initial phase, completed at the time the present study began, the project had involved the participation of 30 of the 45 Utah CEPH families. (The Utah CEPH pedigree individual numbers can be obtained from the author upon request.) There were 418 individuals in the 30 pedigrees ranging in size from nine to 16 individuals. Of these, 312 were phenotyped, including 16 grandparents, 42 parents, and 254 offspring. Power calculations were performed for this sample set. Extrapolations to the entire set of families are presented in the Results and discussion section. All UGRP study subjects gave informed consent under University of Utah IRB approved protocol number 6090–96.

Assumptions pertaining to marker data for the simulations

The large number of markers that has been genotyped in the Utah CEPH pedigrees have allowed us to make the following assumptions for the simulation of marker data: first, we simulated a marker with heterozygosity of 75% and a recombination fraction of 0.05 between marker and QTL; this was done by assuming four alleles with equal frequencies at the marker locus. Second, as an approximation to multipoint linkage analysis, simulations were performed for a fully informative marker generating four different alleles among the parents with the assumption of a recombination fraction of 0.005. We refer subsequently to these two models as two-point and fully informative, respectively.

Simulations

We defined a large number of simulation models, with a wide range of parameters, to include most of the characteristics that vary among quantitative traits. Phenotypic data were simulated under single locus models. Each simulated QTL was assumed to have two alleles in Hardy-Weinberg equilibrium. Phenotypes were simulated varying the following conditions: (1) dominant or recessive inheritance; (2) displacement (the difference between the two homozygote means in standard deviation units of the trait distribution conditional on major genotype) of either 2, 2.5, or 3 ; (3) residual heritability (the proportion of the within-genotype variance due to genetic background) equal to 0.1, 0.3, or 0.5; and (4) prevalence (the percentage of individuals who are gene carriers) of 10, 20, or 30%. Specifically, prevalence was equal to q2 for recessive models and 2pq + q2 for dominant models. With all combinations of parameter values, 27 models were generated for each mode of inheritance (dominant or recessive). Recombination fractions were assumed to be equal in males and females.

The pedigree structures were fixed, but both phenotypic and genotypic data were simulated in all cases using the Pedigree Analysis Package (PAP) (Hasstedt 2002). The following steps were used to generate the data: (1) using the two-point (trait and marker) genotype frequencies for a given model, genotypes were assigned to the founders in accordance with Hardy-Weinberg proportions based on selecting a random number; (2) in a similar manner, offspring were assigned genotypes conditional on the parental genotypes, the recombination fraction between the marker and trait locus, and Mendelian transmission probabilities; (3) using the assumed means and standard deviations, quantitative phenotypes were assigned to each individual using the normal density function. Assignment of genotypes and phenotypes to all individuals in the 30 pedigrees completed one replicate. Five hundred replicates were generated for each model.

Linkage analysis and power estimation

Upon completion of the simulations, linkage analysis was performed on 500 replicates for each model using two methods: PSL and VC. In the PSL method, the model underlying a QTL is specified in terms of mean genotypic effects (under dominant or recessive modes of inheritance), total phenotypic variance, and gene frequency. In the VC method, on the other hand, genetic modeling involves partitioning of the variance into genetic and residual environmental components (Ott 1999). Both the PSL and VC analyses utilize maximum likelihood estimation methods.

For PSL analyses, we assumed either dominant or recessive inheritance with a prevalence of 10% and 30% and a displacement of 2.5. This yielded one VC and four PSL analyses for each replicate. The program LINKAGE (Lathrop et al. 1985) was used for the PSL analyses, and GENEHUNTER (Kruglyak et al. 1996; Pratt et al. 2000) was used for VC analyses. After LINKAGE was used to estimate lod scores for a range of recombination fractions (0, 0.01, 0.05, 0.1, 0.2, 0.3, and 0.4), the maximum lod score across the different recombination fractions for each replicate was used to estimate the power. Power was defined as the percentage of replicates above a given lod score threshold. We estimated power for lod >2 (suggestive linkage) and lod >3 (significant linkage). Power estimates are presented only for the latter situation.

Results and discussion

Power estimates

Tables 1, 2, 3, and 4 summarize analyses of the different simulation models, displaying a wide range of estimated power to achieve a lod score greater than 3. Table columns summarize (1) parameter values assigned in the simulation model, (2) power estimates obtained for PSL analysis assuming prevalence of either 10% or 30%, and (3) power estimate obtained by VC analysis. The power reported for the parametric analyses was based on maximum lod scores across a range of recombination fractions for the simulated two-point and fully informative models.

Table 1 Power estimates (%) for lod >3. Simulation model: recombination fraction of θ=0.05, two-point, dominant; analysis model: dominant. QTL quantitative trait locus
Table 2 Power estimates (%) for lod >3. Simulation model: recombination fraction of θ=0.05, two-point, recessive; analysis model: recessive. QTL quantitative trait locus
Table 3 Power estimates (%) for lod >3: Simulation model: recombination fraction of θ=0.005, fully informative, dominant; analysis model: dominant. QTL quantitative trait locus
Table 4 Power estimates (%) for lod >3. Simulation model: recombination fraction of θ = 0.005, fully informative, recessive; analysis model: recessive. QTL quantitative trait locus

The power estimates varied widely as a function of the underlying prevalence, displacement, and QTL variance (and therefore total heritability) assumed in the data-generating models. As anticipated, displacement is a critical determinant of the power for both PSL and VC models. For the two-point models, when the simulation and analysis models were both dominant with the same prevalence, power for lod >3 ranged from 44% to 92% for a displacement of 2.5 with QTL variance ranging from 0.36 to 0.57 (total heritability 0.42–0.78). For a displacement of 3, power for lod >3 ranged from 60% to 98% with a QTL variance range of 0.45–0.65 (total heritability 0.50–0.83) (Tables 1, 2). The power of multipoint linkage analysis was approximated using a tightly linked, fully informative marker (Tables 3, 4). As expected from formal analyses, full marker information led to a marked gain in power in all PSL and VC analyses. When the simulation and analysis models were both dominant with the same prevalence, power for lod >3 ranged from 64% to 99% for a displacement of 2.5 with QTL variance ranging from 0.36 to 0.57 (total heritability 0.42–0.78). For a displacement of 3, power for lod >3 ranged from 81.4% to 100% with a QTL variance range of 0.45–0.65 (total heritability 0.50–0.83).

Comparing data generated under a dominant or a recessive model for the same displacement and QTL variance, the lower the population prevalence, the lower was the power when a recessive model was assumed for both the simulation and linkage analysis. This reflects the fact that, at lower population prevalences, the proportion of informative families and the proportion of segregating offspring is greater for a dominant than for a recessive trait. As prevalence increases, the behavior of the two models becomes more similar due to the manner in which these two proportions are differentially affected by these two modes of inheritance. Misspecifying the dominance of the trait is known to markedly reduce the statistical power of parametric methods of analysis (Haines and Pericak-Vance 1998). In the present instance, we found that the power was less than 20% in all cases where this relationship was misspecified in the analysis (data not shown).

Parametric models not only require an assumption about dominance but also depend on assumptions about the prevalence. Our analyses consistently showed that the loss of power was more pronounced when prevalence was overestimated than when it was underestimated in the analysis (Tables 1, 2, 3, 4). A similar observation was made by Pal et al. (2001).

The relative power of PSL and VC models could also be evaluated from Tables 1, 2, 3 and 4. As expected, PSL models had markedly greater power than VC models when the parameters used in the analysis were close to those used to simulate the data. The robustness of VC models, however, was reflected in the greater power they achieved compared to PSL models when the dominance relationship was misspecified. A note of caution is in order regarding these conclusions, however, since the distribution of type I and type II errors may not be similar for the two approaches.

Another feature revealed by inspection of Tables 1, 2, 3 and 4 was that residual heritability, while it had little impact on the power of PSL models, led to a proportional increase in power of VC methods, particularly when a dominant model was assumed in the simulation models.

Extrapolation of power for the full set of CEPH families

Whereas simulations were performed for the 30 pedigrees sampled in the initial phase of the reinterview process, 45 families are expected to participate in the UGRP. To estimate the gain in power that could be expected from this sample size increase, the empirical distribution of lod scores for each set of replicates was examined. For PSL models of analysis, the square root of the lod scores followed approximately a normal distribution. We therefore used a normal approximation to assess the gain in power; the VC lod scores did not distribute normally and were not included in the approximation. Since all the Utah CEPH families have similar pedigree structures, it was reasonable to assume that the families in the current sample of 30 pedigrees are representative of the 15 more to be added.

The mean and variance of the lod score distribution for a given analysis model were multiplied by 1.5 (since the sample size will increase 1.5 times) using the principles of statistics that state E(cX)=cE(X) and Var(cX)=c2Var(X), where c is a constant, E(X) is the expected value (mean) of X, and Var(X) is the variance of X. Therefore, in our analysis, the mean was multiplied by 1.5 and the variance was multiplied by (1.5)2 . Using these 45-pedigree means (μ) and variances (σ2), a standard normal equation (√lod − μ)/σ (where √lod = √cut-off lod score which, in our case, equals 3) was then referred to a table of the normal density function. The results of these power calculations for an extended sample of 45 pedigrees are presented in Table 5. The average power increase varied from 17% to 26% with maximum values ranging from 34% to 40%.

Table 5 Average and maximum increase (%) in power (lod >3) with the addition of 15 pedigrees to the sample. Results are based on parametric analysis models

Conclusion

The analyses presented here estimate the power that can be achieved for linkage studies performed in the Utah CEPH pedigrees as a function of the parameters that characterize the distribution of a quantitative trait. The tables show parameter values likely to provide adequate power prior to analysis whether PSL or VC methods are to be used. Although a QTL variance <26%, total heritability <35%, and a displacement <2 gave low power estimates for lod >3, values above these limits yielded adequate power (a large number of models giving power >85%). Trait parameters that play a large role in detecting linkage are prevalence, displacement, QTL variance, and total heritability of the trait. As the focus shifts to intermediate phenotypes in the genetic analysis of common disease, this analysis documents that the use of a reference set with large sibship size such as the Utah CEPH families can provide considerable statistical power to detect linkage for common traits.