Introduction

The variance components(VC) models1, 2, 3, 4, 5, 6, 7, 8 has received much attention and wide applications in quantitative genetic trait studies, as this method requires few model assumptions. It has been extended to various forms for different data structures under different algorithms and model assumptions. Lange and Boehnke9 extended it to multivariate traits, Duggirala et al10 applied it to dichotomous traits, Amos et al11 studied the least squares algorithm of it. Andrade et al12 extended it to longitudinal pedigree data. This model and its variants have been used extensively in genetic linkage analysis. However, most of the existing VC models are, explicitly or implicitly, under the assumption of the Hardy–Weinberg and/ or linkage equilibria. These fundamental assumptions are sometimes not easy to justify, and in practice they are often more or less deviated. In linkage analysis the latter assumption may be inappropriate, since putative disease locus are usually in linkage disequilibrium(LD) with the flanking marker loci.13 Almasy et al14 proposed a combined linkage/disequilibrium analysis in which the LD are incorporated into the VC model. There are some VC models for combined linkage and association studies,15 a VC model incorporated with the two disequilibria is of practical meaning, and has not been in the literature. Here we consider such model in the settings of Hardy–Weinberg and/or LD, as an extension of the existing VC models. In our model the LD is parameterized via the trait-marker composite genotype, differently from that in Almasy et al14 in which the LD is parameterized via the trait-marker alleles. The correspondindg variance components are computed for some commonly used relative pairs conditional on the observed marker identity-by-descent (IBD) data. Parameters can be estimated by the traditional methods such as the maximum likelihood estimate (MLE) under the normal model assumption. This extended VC model is expected to have more accurate estimation of parameters, can be used for linkage and combined linkage and LD mapping (association study), using pedigree data, and have more power for such analysis.

The common VC model

We first describe the likelihood of the commonly used variance components model, for example as in Amos.5 Since the total likelihood is a product of likelihood over all the families under study, we only present the model for a given family for the sake of simplicity.

Let Yi be the trait value of the ith individual in the family.The VC model describing the trait value is

where μ is the overall mean, gi is the unobserved random major gene effect at the trait locus with alleles denoted by A and B, Gi is the unobserved polygenic effects,

where the ηj's are effects associated with the covariates xij's, and ei is the residual random error. The usual assumption is that gi, Gi and ei are uncorrelated and E(gi)=E(Gi)=E(ei)=0. Let p be the population proportion of allele A. Under the Hardy–Weinberg assumption one has E(gi)=a(2p−1)+2p(1−p)d=0. The covariance between individuals i and j is

where σa2=2p(1−p)[ad(2p−1)]2 is the additive genetic variance due to the locus, σ2d=4p2(1−p)2d2 is the dominant genetic variance, Φij7ij/2+Δ8ij/4 is the kinship coefficient16 between individuals i and j, and Δ7ij, Δ8ij, Δ9ij are the condensed kinship coefficient,17 between individuals i and j. The Δkijs(k=1, …, 9) are the probabilities for the nine possible condensed IBD status as divided by Jacquard,17 in which Δ7ij, Δ8ij and Δ9ij are commonly used in practice. They are the population probabilities of sharing 2, 1 and 0 genes IBD for individuals (i, j), without regard to their particular genotypes, but only (i, j)'s kinship relationships, and under the Mendelian inheritance. Also, 2Φij is the expected proportion of gene IBD for individuals (i, j), at this locus.

For linkage analysis, usually IBD sharing data {πij} {πij=0, 1, 2} between a relative pair individual i and j, at marker locus is available, Amos5 proposed the following model for the conditional covariance

where θ is the recombination fraction between the trait and the marker loci. The values of f(θ, πij) and g(θ, πij) can be found.5 It is noted that g(θ, πij)=0 for most human relative pairs except full sibs and it's related to the possibility of sharing two allales IBD.

VC model with disequilibria

In this section we derive VC models with disequilibria in different settings, by incorporating these parameters into the covariances (2).

Hardy–Weinberg disequilibrium at trait locus

We first consider incorporating the Hardy–Weinberg disequilibrium at the trait locus into the VC model, without marker information Let Ak denote allele k at the trait locus (k=1, …, K), pk its proportion in the population, Pkl the corresponding proportion of the genotype AkAl. One way to deal with the deviation from the Hardy–Weinberg assumption is the use of the within population inbreeding cofficient18, 19 f at the trait locus, which is the odds that at any gene, both alleles of the gene pair were inherited from the same ancestor. Let I(·) be the indicator function. Given f we have

Here 0≤f≤1, and f=0 corresponds to Hardy–Weinberg equilibrium. Let p(kl)(km) be the conditional probability that two individuals have genotype (AkAl, AkAm) or (AlAk, AmAk) at the trait locus given that they share Ak IBD (Assuming random mating and phase known, these are the only cases they share Ak IBD. The possibilities for the cases AkAl, AmAk or (AlAk, AkAm) are negligible). Let Y be the trait value of a general individual and g be his/her genotype, and μkl=E(Yg=AkAl). Following Fisher1 and Lange,16 let αk's be the optimal additive genetic effects in the sense that they minimize the sum of squared residuals ΣkΣlδ2klpkl, where δkl=μklαkαl. We show in Appendix A that

and

where γ7(f)=(1+(f/2))σ2a+(1−f)σ2d+20, γ8(f)=((1+f)2/2)σ2a, σ2a=2Σkα2kpk, σ2dkΣlδ2klpkpl, σ20kδ2kkpk is the part of variance explained by the optimal additive genetic effects, and αklμklpkl/[(1+f)pk] for all k.

Note that if f=0, (6) reduces to (2). The αk's and is δkl's are the optimal additive major gene effects and the residual effects.16

Linkage to marker

Now we consider the case with marker information available in addition to the trait locus data. Let πij(=0, 1, 2) to be the number of IBD allele sharing between individuals i and j at the marker locus, πij be the corresponding unobserved number at the trait locus, and θ be the recombination fraction between the two loci. Expressions for Cov(Yi, Yjπij=k) can be found by the formula

Usually, for each individual the IBD data πij is not directly available. However, their probabilities P(πij=k)(k=0, 1, 2) can be computed from the corresponding observed marker genotypes. So the covariances between individual pair (i, j) in a given family is

Covariance with Hardy–Weinberg disequilibrium at trait given marker IBD

In the previous section, we derived the variance components under Hardy–Weinberg equilibrium at the trait locus. Here we give these componenets with the linked marker information, that is, conditional on the trait-marker IBD data. In this case the variance components are

where Δ7ij(πij)=P(π′ij=2πij), Δ8ij(πij)=P(π′ij=1πij) and Δ9ij(πij)=P(π′ij=0πij) are the conditional IBD sharing at the trait locus given the IBD sharing at the marker locus, for individuals (i, j). The derivation is the same as that for Cov(Yi, Yjf) with Δ7ij and Δ8ij replaced by Δ7ij(πij) and Δ8ij(πij), whose values are obtained from the relationships

and the known values of P(π′ij=0, πij) as listed in the literatures cited before. Note here given π′ij=0, Yj and Yj are independent, and Cov(Yi, Yjf,π′ij=0)=0, thus we don't have the term for Δ9ij(πij).

Since in real data the set {πij} is unobservable, we only have the computed the set of probabilities {P(πij=k)}, thus the covariance is

Covariance with LD between trait and marker

In linkage analysis, LD between the trait locus and the genotype marker locus should be taken into consideration. In this section we compute the covariances between relative pairs when in addition to the case of LD is also present between the trait and marker loci. Let ak and akal denote the alleles and genotypes at the marker locus, qk and qkl be the corresponding population frequencies. Since the within-population inbreeding coefficient f is common for any locus in the genome of the given population, f describes the relationship between the marker genotype frequencies qks allele frequencies qkls, in the same way as it did between the pks and pkls at the trait locus. That is, we have

Let be a general notation for the trait-marker composite genotype. We assume

It is easy to check that under (11), ΣkΣlp(kl, rs)=prs and ΣrΣsp(kl, rs)=pkl,, the probabilities of composite genotypes satisfy such consistent condition with its marginal probabilities. Here 0ζ1 is the LD parameter, and it should not be confused with the definition of LD that is used in some texts, such as in Weir20 or Almasy et al.14 Note that ζ=0 corresponds to linkage equilibrium. Also, ζ manifests the vertical connection between the trait and marker loci, while the recombination fraction describes the horizontal link between the alleles.

For a relative pair, let p(kl)(mn)πij=P(AkAl, AmAnπij) be the conditional probability that individual i has trait genotype AkAl and individual j has trait genotype AmAn given their IBD value πij at this locus, pklmπijP(AkAl, AkAmπij)+½P(AkAl, AmAkπij) be the probability when they also share one allele identical by state (IBS) at the trait; pklπij=P(AkAl, AkAlπij) be the probability when they share both alleles IBS at the trait locus. We have (Appendix B)

where γ7(f, ζ, πij), γ8(f, ζ, πij) and γ9(f, ζ, πij) denote respectively Cov(gi, gjf, ζ, πij, πij=2), Cov(gi, gjf, ζ, πij, πij=1) and Cov(gi, gjf, ζ, πij, πij=0). Note that by conditioning on the IDB values at both the trait and marker loci, we cannot assert Cov(gi, gjf, ζ, πij, πij=0)=0 as we did for the previous section. We have γ7(f, ζ, πi,j)≡γ7(f),

and

where

which is also written as

Since the genetic covariance between the relative pair can be written as

by (12), when πij=2 or πij=0, the expression for genetic variance between a relative pair is the same regardless LD is present or not. In fact, from the derivation in Appendix B, this conclusion is true for any consistent composite genotype specification: under random mating and any consistent specification P(G) of the composite genotype, the IBD status (πij, πij) of a relative pair (i, j) contributes information of LD to their genetic variance at the trait locus only if πij≤1 and πij≥1.

Again in practice, given the estimated IBD probabilities, the covariance is computed as

Parameter estimation

Let β=(μ, η1, …, ηj)T be the parameters in the mean, α=(θ, f, ζ, σ2a, σ2d, σ2G, σ2e, σ20, σ21, σ22, σ23, σ11, σ12, σ13)T be the parameters in the covariance matrices, yk be the observations of all the members in the kth family, and μk=μk(β)=E(Yk)=Xkβ, where Xk is the covariate matrix for the kth family, and nk is the total number of individuals in this family. The commonly used model based estimation method is MLE, while the common model for quantitative trait is the normal distribution. Under these assumptions, the likelihood of the kth family is Lk(α, βYk)=φ(YkμkΩk), where φ(YμΩ) is the density of the nk dimensional normal N(μ, Ω) distribution, is the covariance matrix of the kth family, with

as specified in (12) in the most general case. The P(πij=rgij)'s can be obtained by some common IBD computation methods. The covariances can also take any of the more specific form (8), (6), (3) and (2) in the equilibrium case. Here we used (Yi, Yj) for (Yki, Ykj), the (i, j)th relative pair in the kth family. The total likelihood is thus L(α, βY)=Πkk=1Lk(α, βYk), and the log-likelihood, omitting the normalizing constant, is

The MLE is the parametric value (α̂, β̂) that maximizes (14), and it has many desired optimality properties.

Power

The power of the method can be easily estimated and will shown is dependent only on the parameters α in the covariance matrix. Let H0:α=α0 and H1:α=α1 (H0H1 or α0 be part of α1) be the null and alternative hypothesis considered in the previous sections, dim(H1)−dim(H0)=k and fα, β) be the density of the model considered. Let α̂0 and α̂1 be the MLE of α under H0 and H1, respectively. Note our hypothesis only involves α, not the parameters β in the mean specification. Let

be the relative entropy (Kullback–Leibler divergence) between the two densities fα1, β1) and fα0, β0). It is known that D(α1α0)≥0 with equality hold if and only if α1=α0. Assuming homogenous familial structures for all the families, for give level γ>0, the asymptotic power qn for the likelihood ratio test of H0 vs H1, with a dataset of size (number of families) n, is (Appendix C)

where Vk is the χ2 random variable with k degrees of freedom and χ2k(1−γ) is its 1−γ upper quantile.

Given f·, ·), α1 and α0, D(α1α0) can be easily computed. In fact, since our model f·, ·) is multivariate normal, it is easy to see that

where d=dim(μ), Ω(·) is the Ωk's with the elements given in (12), in which the τij's take the theoretical mean values. To plot the power surface, we fix the parameter values at their MLE, except those for f and ζ. Then for a given γ>0 and a set of selected (f, ζ) values, we can compute qn=qn(γ, f, ζ) for different γ, f, ζ and n.

Application

Simulation study

Data of 10 000 sibpairs are simulated in our study. We give some detailed description of how the two levels of disequilibria are incorporated in the simulation process. It can be described in the following three steps.

Step 1

For each sibpair we simulate the their trait genotypes gis and the marker IBD probabilities πijs. Let Gi=(aras)/(AkAl) be the composite genotype of the trait and marker for the ith individual, with lower case letters aras for marker genotype. we simulate (Gi, Gj) for each sibpair, and πij is generated along. We first generate the composite genotypes Gf of the father and Gm of the mother by the probability given in (11) with ζ=0.1, and pkl and qrs are given (4) and (10) with f=0.12, p1=0.55, p2=0.45, q1=0.65 and q2=0.35. Although (Gf, Gm) are not part of the data to be used in the computation, they are needed to generate the sibs composite genotypes. Now given (Gf, Gm) we generate Gi, Gj and πij as below. Let Gf=(af1 af2)/(Af1 Af2), Gm=(am1 am2)/(Am1 Am2). During meiosis, if there is no recombination (with probability 1−θ, θ=0.25), Gf splits into two gametes (af1/Af1) and (af2/Af2). Then one of the gametes is selected with probability 0.5 to pass to the next generation. Here we only consider the recombination at the marker, since we want the IBD πij at the marker. The recombination at the trait is similar, and we omit it for simplicity, since this will not affect the probabilities of the Gis. Similarly, Gm will split into (am1/Am1) and (am2/Am1), or (am2/Am1) and (am1/Am2), and one of the gamets is selected with probability 0.5 to pass to the next generation. For example, if for the father, there is recombination during meiosis and (af1/Af1) is selected, and for the mother there is no recombination during meiosis and (am1/Am1) is selected, then Gi=(af1 am1)/(Af2 Am1) and gi=(Af2Am1). Repeat the above process to get, say, Gj=(af2 am1)/(Af1 Am1) and gj=(Af1Am1). Since at the marker locus, sibpair (i, j) has a composite genotype (af1am1, af2am2), we have πij=1, which comes from the common maternal allele am1.

Step 2

Simulate each pair's covariates. The mean μI of the ith individual is given by (1). Specifically, we take μ=23, gi=1 if individual i has genotype A1A1=0 if A1A2, and =−1 if A1A2. We take GiN(0,σ2G) with σ2G=0.2. Two covariates are genetated, xi1 and xi2, stand for age (years) and sex index for the ith individual, xi2=1 for female and =0 for male. The coefficient for age is η1=0.2 and that for sex is η2=1.5. ei is the random error from N(0, 1) distribution.we always assume the first dib is younger with xi1U[10, 60], then for the second sib, with xj1=xi1+z with zU[1, 10]. For xi2, using the gender ratio from the real data, we sample zU(0, 1), if z≤0.54 let xi2=1 (female) otherwise 0 (male).

Step 3

Simulates the sibpair covariance matrices Ωij=Cov(Yi, Yj)=(ωij) and the final observed trait values. By (3.9), ω11=ω22=(1+f/2)σ2a+(1−f)σ2d+20+σ2G+σ2e, σ2a=2Σkα2kpk, α2dk,lδ2klpkpl, σ20kδ2kkpk and pk is the population proportion of allele Ak, δkl=μklαkαl, αklμklpkl/[(1+f)pk], pkl=(1−f)pkpl+fpkI(l=k) is the population proportion of genotype AkAl, and μkl=E(Yg=AkAl)=μ+gkl+ηlE(xi1)+η2E(xi2)=μ+gkl+40η1+0.54η2, as is given in (3.9), and for sibpairs Φij=1/4. Δkij(πij) is defined after (8) and can be found in Wright,18 where they are implemented in terms of the recombination fraction θ. The marker IBD data πijs are generated above, the trait IBD πij are unknown, but only the conditional probability P(πijπij)s are used, which are easily derived.20 The γk(f, ζ, πij)s are defined after (12). The definition of σ1,2 involved p(kl)(km) which is given in the definition of the γk(f, ,ζ, πij)s. Now we have implemented Ωij and are ready to simulated the yis. We simulate the data pairwise. For a sibpair (yi, yj), denote Y=(yi, yj) and μ=(μi, μj). We sample ZN(0, I2), the two-dimensional standard normal distribution, and let , and simulate such Y 10 000 times.

For γ8(f, ζ, πij) in te case πij=2, σ1, 1, σ1, 2 and σ1, 3 are not independently estimable, so in this case we write γ8(f, ,ζ, 2)=(1−ζ)γ8(f)=σ42, where σ42=−ζ(1+f)σ1, 1+2ζσ1, 2+ζ2σ1, 3 viewed as a single parameter to be estimated.

Table 1 displays the values of the real parameters of interest from the simulation, and their MLE estimates (estimated standard deviation in bracket) under H0: f=ζ=0.0 and H1: all parameters free, respectively.

Table 1 Parameter estimates for the simulated data under H0 and H1

The difference 2(log likelihood(H1)−log likelihood(H0))=20.9934, with a P-value of 0.000106 under a χ2 distribution with two degrees of freedom, that is, the evidence of rejecting H0 is very strong. This example shows that incorporating the disequilibria mechanism into the variance components model can improve the inference significantly when such disequilibria are present.

Real data application

We used the AADM data (African-American Diabetes mellitus) to illustrate the method. The data is from an international collaboration between West Africa and US investigators in mapping type II diabetes susceptibility genes in West African ancestral populations of African-Americans. Affected sib-pairs along with unaffected spouse controls were being enrolled. Eligible participants were invited to study clinics to obtain detailed epidemiological, familial and medical history information. For detailed description of the data, see Rotimi et al.21 For this data we computed the model parameter estimates using VC model (2), or under the hypothesis of equilibria, H0: f=ζ=0; and under the VC model with Hardy–Weinberg/LD (12), H1:f and ζ are free parameters, to fit the data. The response variable is BMI, the covariate is age. The results are shown in Table 2, where the estimated standard deviations are listed inside the brackets.

Table 2 Parameter estimates for the AADM data under H0 and H1

The −2 loglikelihood difference is 12.5076 with a P-value of 0.0058, which is highly significant. So the inference should be based on H1. We see a large Hardy–Weinberg disequilibrium at the triat locus, suggesting that the genetic background of the sample under study is not as simple as assumed by the existing VC model. The low recombination rate (0.0016) indicates that the trait and marker loci are tightly linked, and the LD between the trait and marker is non-negligible. The overall BMI of this sample is 23.58, and the age effect is 0.053, which are quite common for normal populations.

The power depends on all the parameters in the model, we highlight its dependence on (f, η) to study its relationship with these two parameters. Using (15) and the parameters above, the following Figure 1 shows the powers of the likelihood ratio test for H0 vs H1, for various combinations of f, ζ, and n.

Figure 1
figure 1

Powers of the likelihood ratio test for H0 vs H1. (a)–(c) Power for the real data, with parameters set at the MLE values. (a) H0: f=0 vs H1: f≠0. Horizontal axis for f, vertical axis for power, n=8 000 000; (b) H0: ζ=0 vs H1: ζ≠0. Horizontal axis for ζ, vertical axis for power, n=675; (c) (the third on the first row): H0: (f, ζ)=(0, 0) vs H1: (f, ζ)≠(0, 0). n=675. (d)–(f) Power for the simulated data. The parameters used are α=(σ4, σ3, σ2, σg, σ0, σa, σd, θ)=3.08, 16.3, 43.5, 0.45, 21.8, 7.24, 20.8, 0.2468), sample size is n=250 for the three panels. (d): H0: f=0 vs H1: f≠0; (e) H0: ζ=0 vs H1: ζ≠0; (f) (the third on the second row): H0: (f, ζ)=(0, 0) vs H1: (f, ζ)≠(0, 0).

Since the LD depends on the unobservable trait genotype, its needs larger sample size to detect. For the real data, with the observations and the estimated parameter setting, it is easy to detect the HWE disequilibrium with reasonable sample size, while it is very difficult to detect the LD, or requires very large sample size to achieve high power. For the simulated data-parameter setting, the powers are high for the joint HWE disequilibrium, LD and the joint HWE and LD disequilibria.

The software for this extended VC model is written in SAS; the current version is for sibpair familial structure only, and is available upon request from the second author at gchen@genomecenter.howard.edu. The CPU time to compute the parameter estimates depends on the machine, data size, number of regressors, pedigree structure and starting values for the parameters etc. For the two examples above, with suitably chosen starting values, the CPU times for computing the MLEs are 27.24 and 27.33 s on our machine.

Discussion

We have generalized the VC model to the cases of the Hardy–Weinberg and LD or both, this gives more practical application of this popular model. In some practices, these disequilibria are not justified. In these cases, the existing VC model is clearly inadequate, and our generalized VC model might be beneficial in more estimates, and in enhancing the inference power of parameters of interest. Also this generalized model can be used in testing these disequilibria by forming the corresponding likelihood ratio statistic, along with the parameter estimates. Other inferences on one or both of the two disequilibria are sometimes also of direct interest, which are now available under this generalized VC model.

We computed the variance components for some common relative pairs. The cases of other relative pairs are similar and straightforward. We considered the parameter estimation in several ways and computed the IBD under some common cases.

Further extensions/modifications to implement more features will be similar, such as the multivariate traits,9 the multipoint VC, dichotomous trait, robust LOD score correction,7 the conditioning adjustment.21