Introduction

For a common trait, prevalence is easily estimated from a random sample of the population. However, this is prohibitively expensive for a rare disease, which is often ascertained through probands [1]. Population prevalence must then be estimated via the ascertainment probability.

Specialized indirect methods have been devised that do not rely on complete enumeration of all cases; they work with probands but use additional information that obviates the need for ascertainment corrections. In this paper, we are concerned with recessively inherited traits. For these, when q denotes the population frequency of the disease allele, q2 is the disease incidence. Assuming an equal life expectancy for affected and unaffected individuals, q2 is also the disease prevalence. A common method of estimating q is due to Dahlberg [2] and relies on the observation that recessive traits tend to occur more frequently among offspring of consanguineous matings than of unrelated parents. This method and extensions of it have been covered by Li [3] and applied, for example, to cystic fibrosis in Italy [4]. Below, we propose a new map-based method for estimating q and compare its efficiency relative to the Dahlberg method and that of simple population sampling.

For the two methods, probands are defined as follows. In our map-based method, a proband is an affected individual whose parents are first cousins, whereas in the Dahlberg method, a proband is any affected invididual. Thus, as will be outlined in the discussion, the cost of ascertaining probands varies among the two methods. For each method, N denotes the number of probands.

Methods

Map-Based Method

Consider a recessive trait with disease allele frequency, q. For a random individual, the probability of being affected (homozygous) is given by Fq + (1 - F)q2, where F is the individual’s inbreeding coefficient [3]. The first term indicates the probability of being autozygous, that is, homozygous due to having inherited the two disease alleles as copies of the same ancestral allele (identically by descent), while the second term refers to being allozygous. Therefore, for an affected individual whose parents are first cousins (F = 1/16), the conditional probability of being autozygous is given by

$$p = Fq{\rm{/}}\left[ {Fq + \left( {1 - F} \right){q^2}} \right] = 1/\left( {1 + 15q} \right).$$
((1))

If the disease gene has a known genomic position and is located in a dense map of marker loci, marker typing of parents and grandparents will allow one to determine whether a proband is autozygous or allozygous. In other words, inheritance of alleles for markers tightly linked with the disease locus will show whether the two disease alleles in a proband are copies of one disease allele in one of the great-grandparents (autozygosity) or whether the two disease alleles have entered the pedigree separately (allozygosity). Consider N such probands of which an observed proportion, p, is autozygous. Based on (1), this estimate of p can be translated into a maximum likelihood estimate, q̂, of the allele frequency. Because some values of p may lead to values of q exceeding 1, we define

$$\hat{q} = \left\{ {\matrix{ {\left( {1 - \hat{p}} \right){\rm{/}}\left( {15\hat{p}} \right)} \hfill & {{\rm{if}}\;\hat{p} > 1{\rm{/}}16} \hfill \cr 1 \hfill & {{\rm{if}}\;\hat{p} \le 1{\rm{/}}16} \hfill \cr } } \right.$$
((2))

The variance of the estimate, V(q̂), is computed numerically as follows. For a given sample size, N, and population allele frequency, q, each possible outcome (number of autozygous probands), i = 0,…N, occurs with binomial probability, B(p, N), where p is given by (1). For each i, there is an associated p̂ = i/N and corresponding q̂ as given by (2). The variance of the allele frequency estimate is then obtained as V(q̂) = E(q̂2) - E2(q̂), where E stands for expectation (mean).

Dahlberg’s Method

To compare the variance of our estimate for q to that of the conventional (Dahlberg’s) estimate, we applied Dahlberg’s method as follows. Assume that among all matings in a population, a known proportion c is between first cousins while all other matings are between unrelated individuals (in practice, the latter category includes the rare matings between individuals of other relationships). Then, the proportion of recessive cases born to cousin marriages among all recessive cases in the population is

$$k = c\left( {1 + 15q} \right){\rm{/}}\left[ {c\left( {1 - q} \right) + 16q} \right]$$
((3))

[2], which may be viewed as being analogous to (1). Consider a sample of N probands of which an observed proportion, k̂, has parents who are first cousins. Based on (3), this estimate of k can be translated into a maximum likelihood estimate, q̂D, of the allele frequency. Because some values of k may lead to values of q exceeding 1, we define

$${\hat{q}_D} = \left\{ {\matrix{ {c\left( {1 - \hat{k}} \right){\rm{/}}\left[ {\hat{k}\left( {16 - c} \right) - 15c} \right]} \hfill & {{\rm{if}}\;\hat{k} > c} \hfill \cr 1 \hfill & {{\rm{if}}\;\hat{k} \le c.} \hfill \cr } } \right.$$
((4))

The variance, V(q̂D) is calculated in analogy to the calculation described above leading to V(q̂).

Results

One of the best-studied recessive traits, phenylketonuria, shows a population frequency in the US of approximately 1 in 12,000 [5], that is, a disease allele frequency of just about 0.01 while other recessive traits appear to be more common. For a range of population allele frequencies, table 1 shows the standard error (square root of variance) of our allele frequency estimate, q̂, depending on the sample size of N probands. Clearly, standard errors are quite high and it takes considerable sample sizes for reasonably accurate allele frequency estimates. For example, with q = 0.05 and N = 100, the 95% confidence interval approximately ranges from 0.03 to 0.07.

Table 1. Standard error of map-based estimate

For Dahlberg’s method, the population frequency of first cousin matings among all marriages must be known. In western societies, this rate is on the order of c = 0.001 [6], while in some eastern countries, it can be as high as c = 0.200 or higher [7]. For a range of these values, table 2 shows the efficiency of our map-based estimate versus the Dahlberg estimate, where efficiency is defined in the customary manner, that is, as the variance ratio, V(q̂D)/V(q̂). This ratio expresses the relative accuracy of the two estimates but, as outlined in the discussion, does not take into account the costs associated with sampling probands. For rare diseases (q ≤ 0.02) and small sample size (N ≤ 20), map-based estimation is seen to be always more efficient than the Dahlberg method (i.e. for any of the c values considered). For cousin marriage rates of c ≤ 0.10, the map-based estimate is generally more efficient except for small N and very large q. The differences in accuracy can be quite dramatic. For example, for a common allele frequency of 0.05 and c = 0.001, efficiency is 7.1 for N = 10, and 1,427 for N = 100. Thus, in most situations, the map-based estimate is clearly superior to Dahlberg’s estimate.

Table 2. Relative efficiency (accuracy) of map-based versus Dahlberg’s method of allele frequency estimation

Discussion

As shown above, for small to moderate sample sizes and traits that are not too common, our new method is more efficient than Dahlberg’s method. In practice, grandparents are often not typed and the number of probands in linkage studies of recessive traits tends to be rather small. Thus it is fortuitous that this is the situation in which our method shines.

Clearly, obtaining probands suitable for the map-based method requires more resources than obtaining probands for Dahlberg’s method. In principle, all relevant ancestors of the former probands must be genotyped but such data are difficult or costly to obtain. With very highly polymorphic markers, identity by state is almost equivalent to identity by descent so that it may be possible to obtain approximate solutions based on homozygosity versus heterozygosity at closely linked marker loci. Such approaches are under investigation.

For Dahlberg’s method, the population proportion of cousin marriages must be known. If it is unknown or inaccurate, this method presumably furnishes biased results. There is no such requirement for the map-based method.

Our approach focuses on one proband per family. In linkage studies of recessive traits, there is often more than one affected individual per shibship. We have not yet investigated how information from multiplex sibships could be used in the map-based method. This does not appear to be a simple problem.

The method introduced in this paper is tailored for a specific relationship of the parents. This relationship — first cousins — is the most common one among related parents in western civilizations. In other, for example, eastern populations, other relationships may be more common for which our method is not directly applicable.