Introduction

Population-based marker–phenotype association studies suffer from confounding due to population structure and cryptic relatedness (residual dependencies) that have not been observed or accounted for among the study subjects (Lander and Schork 1994; Yu et al., 2006; Iwata et al., 2007). The same applies to expression–phenotype association studies (Gibson, 2003; Kraft and Horvath, 2003) and clinical quantitative trait locus (cQTL) studies where genotypes and gene expressions are simultaneously used to study the association with the phenotype (Hoti and Sillanpää, 2006; Bhattacharjee and Sillanpää, 2009; Sillanpää and Noykova, 2008). The significance of this problem in human association studies is currently a subject of considerable debate (Marchini et al., 2004; Devlin et al., 2004; Hinds et al., 2004; Helgason et al., 2005; Clayton et al., 2005; Voight and Pritchard, 2005; Setakis et al., 2006; Zhao et al., 2007). If no pedigree/ancestry information is available, there are different approaches to estimate the unobserved structure of population or of the pedigree using neutral molecular markers (Pritchard et al., 2000; Blouin, 2003; Excoffier and Heckel, 2006; Weir et al., 2006; Gasbarra et al., 2007; Bink et al., 2008). In many cases, however, exact information specifying the interrelations between individuals may be available. This is so, for example when data has been ascertained specifically from families or from pedigrees (see Visscher et al., 2008).

Robust methods have been developed especially for family-based association studies (Gauderman et al., 1999; Zhao, 2000; Knapp and Becker, 2003; Chen and Abecasis, 2007) and case–control association testing with related individuals (Thornton and McPeek, 2007). To correct for population structure in population-based association analyses, one can for example adjust the P-value (Devlin and Roeder, 1999), include the population term in the association model (Yu et al., 2006; Zhao et al., 2007) or use a principal component approach (Price et al., 2006). Similarly, for known or estimated relatedness, there are many ways to include such information in the association model. One approach to correct for cryptic relatedness is to include relationships in the form of a (covariance) matrix into the association model in the studies of marker–phenotype association (see George and Elston, 1987; Kennedy et al., 1992; Jannink et al., 2001; Yu et al., 2006) or in the studies of expression–phenotype association (see Lu et al., 2004). One can incorporate an additive relationship matrix (the covariance structure of a multivariate normal distribution) either to residuals or use a specific random term (arising from the infinite polygenic model) in the regression model. In linkage studies the same term appears in the role of the genetic background. Such covariance structure takes care of the dependencies between the study subjects. Another approach approximates such structure by having the phenotype of the parents, sibs and the spouse of the subject as covariates in the regression model and assumes independence for residuals (Bonney, 1986). This autoregressive structure does not model the true underlying dependence structure, but has been shown to perform well and to account for confounding with single locus models. These two (polygenic and covariate) correction terms have been studied and used earlier only in a single-gene association model and the testing framework. Thus their properties for Bayesian multilocus association models (for example, Kilpikari and Sillanpää, 2003; Sillanpää and Bhattacharjee, 2005) are largely unknown.

Modelling phenotype with both gene expression data and marker data could be advantageous and provides more information because marker data is stable in comparison with time- and tissue-dependent gene expression data (O'Hara, 2006; West et al., 2006). Although this view has been confirmed in simulations (Hoti and Sillanpää, 2006; Sillanpää and Noykova, 2008) it remains arguable with real data (Bhattacharjee et al., 2008; Bhattacharjee and Sillanpää 2009). We compare here how these two relatedness corrections work together with Bayesian model-based multilocus association using both marker and gene expression data. To do this, we have modified the cQTL-model of Hoti and Sillanpää (2006) for SNPs and pedigree data. Our emphasis is on a collection of small pedigrees, as cryptic relatedness has a negligible effect in large outbred populations, especially when the sample size increases (Voight and Pritchard, 2005). We consider only a small amount (5%) of missing data here (cf. Sillanpää and Noykova, 2008).

Model

cQTL model

Let NM be the number of SNP (single nucleotide polymorphism) markers and NE be the number of gene expression transcripts. Our data consists of continuous phenotypes y=(y1, …, yn)t, SNP marker genotypes and gene expression measurements from n individuals in the known pedigrees collected from a single population. We assume that each individual has its own observation (array) on gene expression made at single time point. For cases with multiple populations, see the Discussion section. Here, we let yi denote the observed continuous phenotype of the ith individual and summarize the genotypes as and

, where zi,j,k denotes the indicator of kth genotype at the SNP j for individual i. The gene expression measurement j for the individual i will be denoted by xi,j. We want to emphasize that we assume that gene expression levels are available (with some missing entries) for each study subject. We closely follow the notation in Hoti and Sillanpää (2006) and assume that gene expression measurements are normalized (Quackenbush, 2001; Butte, 2002) and transformed suitably beforehand, so that sample distribution of the majority of the genes is approximately standard normal. Moreover, we assume that NE and NM are relatively small (a few hundreds at most). To form candidates for genotype × expression interactions, we assume that some markers are a priori associated (that is, possibly have a regulatory effect) with some gene expression measurements. This prior information on the pairing of markers and expressions may be obtained from previous, independent studies, or could be based on known pathways or proximity of their genomic location. We refer to them as marker–gene pairs (see Hoti and Sillanpää, 2006). We allow multiple expressions to be associated with a single marker, but not the other way around. Let and be the expression measurement and genotype indicator for some pair (gj,sj) so that for individual i x̃i,j is the gene expression measurement of gene gj and is the indicator of genotype k at SNP sj. The number of these previously assigned pairs is NME. We consider the following linear model for a continuous phenotype

where μ is the population mean and ɛiN(0, σ02) is a normally distributed residual term with mean zero and variance σ02. Fi denotes a correction term, which takes into account the family structure and the dependence between family members. The linear regression coefficient (effect) of genotype k at marker j is αj,k, coefficient of expression effect is βj and coefficient of the interaction effect is γj,k. Unlike in Hoti and Sillanpää (2006), each genetic component, marker, expression or interaction, has its own indicator variable, IjM, IjE or IjME, respectively. For our motivation for the use of indicators, see the Discussion section. For indicators, the value one corresponds to the inclusion and value zero to the exclusion of the genetic component in the model. Obviously SNP markers exhibit three genotypes and we use an over-parameterized model, so for each marker and for each marker–gene pair (that is, marker × expression interaction), there is a single indicator variable and three effect coefficients. (We allow the first coefficients (αj,1, γj,1) at each locus j to be unconstrained in our model unlike that in Hoti and Sillanpää (2006)). We can identify differences (genotypic contrasts) as functions of posteriors afterwards from the Markov chain Monte Carlo (MCMC) sample or from the MCMC point estimates. As in Hoti and Sillanpää (2006) we can write the genetic data of individual i as the vector

and the vector containing N=3NM+NE+3NME unknown effects is denoted by

In addition there are NM+NE+NME indicator variables. To create the vector that contains the indicator variables for all genetic effects, we need to arrange the indicators into the vector containing N elements

Now we can rewrite the linear cQTL-model (1) as

Infinite polygenic model

One approach to taking family structure into account in the model is to add a random individual effect, whose correlation structure would follow the degree of relationship between individuals. In the cQTL-model (1), the term Fi represents the additive effects of the polygenes on individual i, which arise from the combined action of infinitely many loci whose individual contributions cannot be distinguished (Yi and Xu, 2000; Jannink et al., 2001). The additive polygenic effects F=(F1,…,Fn)t are distributed as multivariate normal with known covariance structure,

. Here, is a n × 1 vector of zeros and Φ is a n × n matrix of kinship coefficients among individuals based on pedigree information and σF2 is the additive variance of the polygenes (George and Elston, 1987; Kennedy et al., 1992; Jannink et al., 2001; Monks et al., 2004). The kinship coefficient between two individuals is the expected probability that homologous genes taken randomly from their genomes are identical by descent from common ancestors in the given pedigree (Lynch and Walsh, 1998). In the breeding literature, Fi's are called breeding values and 2Φ the additive relationship matrix (Henderson, 1976). The structure of the matrix Φ is block-diagonal once the individuals are arranged by the families and no remote shared ancestry among the families is assumed. For simplicity, we assume no inbreeding and omit the dominance component.

Regression covariates

Another approach to describing family dependencies is to add the phenotypes of the relatives as covariates (fixed effects) into the model. Then for individual i we can write Fi in the cQTL-model (1) as

where y′denotes deviations of the phenotypes from their empirical mean, subscript fi refers to father, mi denotes mother, si denotes spouse if she/he appears earlier in the data set and osi is the set of sibs of individual i appearing earlier in the data set (Bonney, 1986; Thomas, 2004). Here, ρf, ρm, ρs and ρos are respective regression coefficients, which can be written in the vector form as ρ=(ρf, ρm, ρs, ρos). Bonney (1986) referred to this as the class D model. This kind of model structure can be seen as an approximation of polygenic background (Thomas, 2004).

Hierarchical model

Prior distributions

We need to specify prior distributions for the unknown parameters. We allow each genetic effect in vector (θ=θ1, θ2, …, θN) to have its own variance parameter (Xu, 2003; Hoti and Sillanpää, 2006). For the genetic effects θ, we assign prior

, where the functional form of p(θjσj2) is a normal density with the mean zero and the effect-specific variance σj2. We assigned to σj2 the Jeffreys' prior p(σj2)1/σj2, which together with effect-specific variances induce sparseness into the model (Xu, 2003; Hoti and Sillanpää, 2006). By sparseness, we mean that most of the effects are zero or almost zero. For details of implementation, see the Estimation section in Appendix A. There is also another source of sparseness in our model—indicator variables. For indicator variables I, we assign the Bernoulli distribution with parameter s=P(Ij=1)0.5, which is the prior selection probability for a candidate to be included in the model (that is, I=1). For parameter s we give values 1/NM, 1/NE or 1/NME, for markers, expressions and their interactions, respectively. This is equivalent to assuming a priori that there is one selected effect for each type of genetic component. We treat priors p(θ) and p(I) independently (Kuo and Mallick, 1998; Sillanpää and Bhattacharjee, 2005; Sillanpää and Noykova, 2008). The prior for μ is p(μ)1, and prior density for σ02=var(ɛi) is p(σ02)1/σ02. As a prior for polygenic effects, we use the multivariate normal density. That is

where F=(F1, …, Fn) is a vector of polygenic effects, Φ is a matrix of kinship coefficients among individuals, σF2 is additive polygenic variance with prior p(σF2)1/σF2 and 2ΦσF2 is the determinant of the covariance matrix. Now, under the additive polygenic model, the joint prior is where

. For the regression covariate model we replace p(FσF2) with where p(ρj) is a normal density function with the mean zero and variance 1000, and

.

Missing data model

We assume data are missing at random (Rubin, 1976) and treat missing values as unknown random variables in Bayesian inference. Thus, we need to specify a prior distribution for missing observations. Denote the complete genetic data with no missing values by D={m, x}. The observed genetic data with possibly some missing values is denoted by D={m, x}. Recall that the gene expression measurements were assumed to be normalized beforehand. Prior distribution for missing gene expression measurement is assumed simply to be a standard normal distribution (cf. for a major gene model, see Sillanpää and Noykova, 2008). Even if in this model the polygenic basis is assumed for gene expressions, we omit (genetic) dependencies from parents. In the prior distribution for missing genotypes we take into account the genotypic values of individuals' parents, but omit the recombination aspect because we do not utilize linkage information in our association model here (see the Discussion section). The joint probability distribution of the marker j over individuals is given by

where mj=(m1,j,…,mn,j)t is the genotype pattern at marker j. The first product is over the prior probabilities of the genotypes of founders, and the second is over transmission probabilities of genotypes of non-founders and mm,j and mf,j are the genotypes of mother and father of individual i, respectively. Transmission probabilities p(mi,jmm,j,mf,j) follow the Mendelian rules of inheritance. Note that although it seems that there are dependencies in transmission only downwards the pedigree, in practise there are also upward dependencies due to total probability. The genotypes of the founders are thought of as being drawn from the population with uniform allele frequencies. Then, the prior density function of the genetic data is

For details of implementation, see the Estimation section in Appendix A.

Posterior distributions

In Bayesian analysis, marginal posterior distributions for the parameters are derived from the prior distributions and the likelihood of the data. Using Bayes formula, the joint posterior density of the model parameters conditional on phenotypic and genetic data is given by

where p(θ,I,F,μ,σ2) is the density function of the joint prior distribution of parameters {θ,I,F,μ,σ2}, p(D) is the prior density function of the complete genetic data, p(DD) is the mass probability function of the observed genetic data D conditional on the complete genetic data D (that is, is the indicator function and takes value 1 only when D- is consistent with D and is zero otherwise) and is the likelihood of the phenotype data, where

Examples of cQTL analysis with family data

To compare corrections for family data with cQTL-model (1), we analyse a few data sets with three-generation pedigrees in the presence of missing data. First, we analyse two simulated data sets with known genetic effects and then a real CEPH family data that have been used in previous studies (Kraft et al., 2003; Schadt et al., 2003). We also consider average performance (assessed by analysing 25 data replicates). The simulated data is an example of a large data set (210 individuals) with loosely correlated genetic components and real data is an example of a small sample size (58 individuals) with highly correlated genetic components. We first compare how the two correction terms (infinite polygenic model and covariate model) perform against no correction term (model for unrelated individuals) with family data, which has either single or multiple simulated trait-influencing components and compare two correction terms with the real CEPH data. Finally, in simulated data replicates, we consider only marker–phenotype association and compare three methods using 25 marker data sets with three trait-loci.

Simulations

We simulated family data consisting of molecular markers, gene expression level measurements and a continuous phenotype. Our simulation procedure follows the procedure of Hoti and Sillanpää (2006), where expression levels are first generated conditionally on markers, and phenotypes are then generated conditionally on them both. The main difference is that we use real SNP marker data on families as a starting point. We want to emphasize that this approach is able to generate realistic dependence structures for markers as well as expressions. Real marker data was obtained from the CEPH genotype database (Dausset et al., 1990). Fifteen families from the CEPH/Utah family collection were selected with the family identifiers 1334, 1340, 1345, 1346, 1349, 1350, 1358, 1362, 1375, 1377, 1408, 1418, 1421, 1424 and 1477. Selection criteria were large number of children and large number of genotypes available for all three generations (cf. Monks et al., 2004). In total, the families represent 210 individuals. We selected 52 SNPs from eight different chromosomes, based on the availability of genotypes (not too many missing values) and the property that selected markers was not highly dependent (closely linked) to one another (Table 1). We also required that the markers are in Hardy–Weinberg equilibrium and that minor allele frequency (MAF) was not less than 5%.

Table 1 ID-numbers of SNPs selected from 8 chromosomes. There were a couple of hundreds of SNPs available on each chromosome in the database

Simulating SNP genotypes

First, we needed to complete missing genotypes in CEPH families, as our simulation procedure for expression and phenotype data (below) necessitates complete SNPs. There were less than 5% of genotypes missing among the set of selected markers. Genotypes were missing entirely on two individuals and some SNPs were not available for a couple of families. We sampled the missing genotypes of the founders from the population of equal allele frequencies conditionally on the progeny. This allowed us to consistently fill data (missing genotypes) downwards through the pedigree. Genotypes were drawn according to the Mendelian transmission probabilities. In this process, we omitted recombination probabilities, but took fully into account that every missing genotype depends on genotypes of the parents on the same SNP marker.

Simulating expression levels

Conditionally on an individual's genotype on each marker (mj), we simulated three gene expression measurements (x3 × j−2, x3 × j−1, x3 × j). The first two of these (x3 × j−2, x3 × j−1) had constant probability to have in cis effect and the third gene (x3 × j) was set to have no regulatory effects with probability one. A priori (before simulating actual values) we divided markers into three in cis effect groups. One-third of the markers (mj) were assigned an in cis effect on gene expressions if the genotype was homozygote (AA), one-third had an in cis effect if genotype was heterozygote (AB) and final third had an in cis effect if the genotype was homozygote (BB). The decision that the marker actually exhibits the pre-specified in cis effect was made with probability 0.3, which is in line with previous estimates (Jansen and Nap, 2004; Morley et al., 2004). Gene expression measurements (x3 × j) at positions with no in cis effect, and gene expression measurements (x3 × j−2 and x3 × j−1), in absence of in cis effect, were simulated from the distribution N(0,1). In presence of in cis effect, the expression value of the one (in cis) gene (x3 × j−2) assigned on current marker j was simulated from the distribution N(2,1) and expression value of another gene (x3 × j−1) assigned on the same marker was simulated from the distribution N(−2,1) (see Figure 1).

Figure 1
figure 1

Distributions from which the gene expression measurements for different genotypic groups were generated. Probability of in cis effect is 0.3.

Simulating phenotypes

Excluding simultaneously active components at each marker–gene pair, genetic components can be divided into six subtypes, depending on their effect on phenotype and whether an in cis effect is present or absent in marker–gene pairs. Following Hoti and Sillanpää (2006), we denote these subtypes as genotype effect without in cis effect (G), genotype effect with in cis effect (iG), gene expression effect without in cis effect (E), gene expression effect with in cis effect (iE), genotype × gene expression effect without in cis effect (GE) and genotype × gene expression effect with in cis effect (iGE). Continuous phenotypes are constructed as a linear combination of six underlying genetic components and the polygene.

where s1,…, s6 are indexes of influential marker–gene pairs of types G, iG, E, iE, GE and iGE, respectively, zi,j,k is the indicator of genotype k at the marker j for the individual i, xi,l is the gene expression value of gene l for the individual i, and the environmental residual ɛi is assumed to be normally distributed with mean 0 and variance 1. Polygenic terms gi are simulated jointly from the multivariate normal distribution with zero mean vector and covariance–variance structure 2Φσ2, where 2Φ describes the additive relationships between CEPH family members and σ2 is the additive polygenic variance. Here, σ2 is fixed to some value for the desired degree of heritability due to polygenes. See Table 2 for details of the simulation including values of the coefficients (a1,k, a2,k, a3, a4, a5,k, a6,k) and variability due to given effects and the polygene. To test how well our method behaves in case of missing data, we randomly discarded 5% of the data on phenotypes, genotypes and expressions.

Table 2 Simulated genetic components and their effect sizes

We also simulated phenotypic data with one underlying genetic component and the polygene. The starting point for the phenotypic simulation was that the genotypic data was identical with the earlier data set and expression levels were simulated similarly. The only influential genetic component was SNP marker s1=36 without in cis effect. Simulated effects were a1,1=−2 and a1,3=6 for the homozygotes and a1,2=2 for the heterozygote. The simulated polygenic component explained approximately 7% and the simulated SNP effects approximately 24% of the phenotypic variance. We also randomly discarded 5% of the phenotypes, genotypes and expression measurements in this data set.

Simulating replicated data sets

To evaluate the average performance of the methods, we simulated 25 phenotypic data sets with the same marker effects and the polygene in each. The genotypic family data was the same as in earlier simulations (210 individuals and 52 SNP markers) and was kept unchanged in all simulations. We simulated three trait loci (SNPs 7, 29 and 36) with their own effect sizes. For SNP s1=7, we simulated effects a1,1=1 and a1,3=9 for the homozygotes and a1,2=5 for the heterozygote, for SNP s2=29 we simulated effects a2,1=−3, a2,3=1 and a2,2=−1 and for SNP s1=36 we simulated effects a3,1=−2, a3,3=4 and a3,2=1. The simulated polygenic component was approximately 17% of the phenotypic variance but varied due to sampling variation in different realizations. Simulated overall heritability varied equally from 0.34 to 0.52 in replicates. Replicated analyses were done for the complete data sets with no missing values.

Real data

We analysed gene expression data of the lymphoblastoid cell lines of 58 individuals from four CEPH families (CEPH/Utah pedigrees 1362, 1375, 1377 and 1408). The original article about the data set is Schadt et al. (2003). The sibship data from the same families has been used earlier in Kraft et al. (2003) as test data to examine the performance of the FEXAT statistic, which represents a sort of correlation coefficient for family data. Technical details about measuring gene expression in this data set can be found in Schadt et al. (2003). CEPH lymphoblastoid cell lines had been cultured and maintained in the log phase of cell growth at least 2 days before harvest (Schadt et al., 2003). At the time of measuring the expression, it would be expected that the WNT pathway would be active, because the WNT pathway has been shown to regulate B lymphocyte proliferation (Reya et al., 2000). Following Kraft et al. (2003), we chose the expression of β-catenin (CTNNB1NM_001904) as a clinical quantitative trait, and expect that in the presence of WNT, levels of the β-catenin (trait) will be associated with factors that can lead to the formation and stabilization of the β-catenin/TCF complex. On the other hand, in the absence of WNT, β-catenin levels will be associated with genes making up the β-catenin destruction complex (Seidensticker and Behrens, 2000).

Gene expression measurements were obtained from the NCBI GenBank. Locations of genes are based on reference assembly. For every gene, we additionally searched the closest available SNP, which is genotyped for these same four CEPH families, using the same criteria as in the simulation analysis. Genotypes were obtained from the CEPH genotype database. Maximum distance between a gene and the closest SNP was 2 361 528 bp, whereas there was no minimum distance, because one SNP was found inside the gene region (Table 3). We omitted individuals who did not have expression measurements at all.

Table 3 List of putative genes and their closest available SNP markers

Results

Simulated data

Analysis details and effect summaries

For data sets with 5% of the data missing, we run our models with WinBUGS 1.4.1 using four separate MCMC chains each of length 10 000. For each chain, burn-in was 1000 and thinning 10 (that is, only every 10th MCMC sample was stored), and samples from all chains were combined in MCMC estimation of the parameters. For checking the convergence of each chain, we visually inspected MCMC paths of several parameters. We summarize our results as posterior genetic occupancy probabilities for genetic component j, P(occupancy at location jdata), obtained as the proportion of MCMC rounds where the indicator variable Ij is 1, indicating that genetic component j is included in the model. Note that there are as many indicator variables as genetic components (NM+NE+NME) with continuous indexing. We also calculated conditional probabilities Qj,k=P(IJ=1Ik=1, data) for all pairs (j,k) of indicator variables, which showed elevated posterior probabilities (cf. Hoti and Sillanpää, 2006). Qj,k is the posterior probability that the genetic component j is included in the model on the condition that the genetic component k is included in the model. In our preliminary analysis (results not shown), we have found that sometimes the expression effect may be captured by the interaction term where the same expression measurement is involved. Same kind of complementary behaviour of the effects was present also in Hoti and Sillanpää (2006) and in Sillanpää and Noykova (2008). So, we calculated Q summaries also for expression components, which corresponds to elevated interaction terms and vice versa. The number of genetic components in the model was summarized (in each MCMC round) as the number of indicator variables, which were simultaneously 1. Heritability was estimated by using the formula

where σy2(t) is phenotypic variance and σy2(t) σe2(t) is residual variance at round t, and r is total number of MCMC rounds after burn-in. Note that the estimated phenotypic variance depends on imputed values.

Analysis results

Throughout the paper, we used the threshold 0.1 to determine significant components. For the data with six simulated effects, the model with infinite polygenic correction term found five and the covariate model found six genetic components with elevated posterior occupancy probabilities (Figure 2). One of these (j=31), which was found with both models, was actually a false positive, which cannot be explained, with any simulated effects. The expression effects were partly captured by interaction terms so that the expression effect was rarely in the model at the same time as the corresponding interaction term. This can be seen from the Q summaries (Table 4) where the conditional probabilities were less than 0.07 and 0.04 (for an infinite polygenic model and a covariate model, respectively) for all such cases. The same can be seen also from the MCMC paths of the indicator times the effect (Figure 3). Here, Ij* × θj (product of the indicator and effect size) shows, that the expression effect (j=167) and the genotype × expression interaction (j=323) are clearly complementary, indicating that only one of them contributes to the model at a time. Even though the occupancy probabilities for the simulated components of the type E and GE were both smaller than 0.1, they are clearly higher than the occupancy probabilities of the other similar type of components (Figure 2). As illustrated in the Table 5, the highest posterior probability was obtained for the correct number of genetic components in both the infinite polygenic model P(nc=6data)≈0.22 and in the covariate model P(nc=6data)≈0.23. However, posterior support was obtained for a wide range of values varying from 2 to 11 and 3 to 11 components in the infinite polygenic model and in the covariate model, respectively. When 5% of the data was artificially coded as missing, run time was approximately 15 h (4 chains) for every 1000 rounds of iterations for both models.

Figure 2
figure 2

A summary of analysis with the covariate model. The panels contain the estimated posterior occupancy probability for the genotype effects (top), gene expression effects (middle) and genotype × expression effects (bottom). The positions of simulated effects are indicated by a shortcut notation of the effect subtype. The corresponding genetic components are vertically levelled.

Table 4 Pairwise conditional summaries
Figure 3
figure 3

The MCMC paths of the effect × I for the expression component 167 (top) and the interaction component 323 corresponding to genotype AB (bottom). The burn-in period is 1000 with thinning 10 and has been removed from the MCMC sample before drawing the figure.

Table 5 Number of components

The same data was also analysed with the model which did not include any correction term for the pedigree structure (that is, Fi=0 for all i). Surprisingly, the same six effects with elevated occupancy probabilities were found here as in the covariate model analysis. The false positive (j=31) showed slightly higher probability in this analysis than with the other two models (Table 6). Again the highest posterior probability was obtained for the correct number of simulated genetic components (P(nc=6data)≈0.24). The mean posterior estimated number of genetic components included simultaneously in the model was slightly higher for this model than for the other two models (Table 7).

Table 6 Effect estimates and occupancy probabilities
Table 7 Point estimates of the number of components

In the infinite polygenic model analysis of data with one simulated effect, the true simulated component was always captured correctly in the model. However, the number of genetic components in the model was clearly overestimated, even though there was no strong evidence for any false positives. There were only two genetic components with occupancy probabilities larger than 0.1 and one of them was false positive, although there was some probability mass for as many as nine influential components (Table 5). The covariate model performed slightly better than the infinite polygenic model and was able to include the true simulated component in the model with probability one, whereas there were no other components with occupancy probabilities larger than 0.1. The model without correction term found the true simulated genotypic component and one genotypic component with occupancy probability 0.1 and one interaction component with occupancy probability 0.098.

Heritability and effect estimates

For the data with six simulated effects the infinite polygenic model underestimated the heritability in its posterior point-estimate whereas the similar estimate from the covariate model was even smaller (Table 8). The 95% Credibility Interval for infinite polygenic model included the true simulated heritability but the CI was wider than for the covariate model or for the model with no correction. The estimates of the effect sizes were also underestimated. It turned out that there was a clear dependence, as expected: the higher the posterior occupancy probability was for the genetic component the more accurate the estimate for the effect was. This was true especially for the expression effects. Table 6 presents the comparison between simulated and estimated genetic effects. For effects, we show only the model-averaged estimate I × θ, because it is more robust (Ball, 2001) and I appears always together with θ in the model (cf. Sillanpää and Bhattacharjee, 2005). In the table the effect of the genotype AA is constrained to zero to make values more comparable. Posterior estimate for the additive polygenic variance was much smaller than the true simulated value and it seemed in the trace plot nearly zero for most of the MCMC iterations. It is likely that other genetic components (SNPs, expressions and their interactions) captured some of the polygenic variance by subdividing a small amount of variance to be explained by each component. Thus correction term estimates were relatively modest (Table 9). Also, it is likely that the heritability estimate suffered from the fact that some simulated components stayed unselected for most of the MCMC iterations.

Table 8 Heritability estimates and the 95% credible intervals around the posterior mean for two simulated data set with three competing models
Table 9 Correction term estimates

Analyses with all three models for the data with one simulated effect also underestimated heritability (Table 8), but the 95% CI included the true simulated value in all models. The estimated polygenic variance behaved the same way as in the data with six simulated effects. The effect estimates were slightly better with the covariate model though the infinite polygenic model and the model without correction also gave good estimates. In general, these estimates were more accurate here than for the data with six simulated components (results not shown).

Simulated data replicates

Analysis results

Replicated analysis of marker data sets gave quite similar results with all three methods. All methods found the same trait loci in almost every data set, but their degree of evidence (the magnitude of signals) was slightly different. The infinite polygenic model underestimated polygenic variance and for some data sets it had difficulties of finding a single mode (converging value) and thus had identifiability problems. As a whole, the infinite polygenic model estimated heritability better than the other two models, but still it underestimated the true heritability almost every time. The model without the correction term found simulated effects more frequently (that is, had better power) than the other two models and the infinite polygenic model had the lowest false-positive rate and false-discovery rate but FPR and FDR were quite similar with all three models (Table 10). The performance of the covariate model was not superior with respect to any summary statistic but performance was still comparable to the other models. As earlier, the higher the posterior occupancy probability was for the genetic component, the more accurate the estimate for the effect was.

Table 10 Averaged effect estimates and occupancy probabilities of replicated data analysis: Simulated and estimated effects (posterior means) of trait loci under three competing models when constraining the effect of the genotype AA to zero

Real data

Analysis details

When analysing the real data from the CEPH families, we ran four MCMC chains each of length 50 000 and we allowed only the closest marker to have an interaction with corresponding gene expression in the model. In the prior, we restricted all variance components to be less than our empirically estimated phenotypic variance (ς̂2≈0.007). The MCMC sampler under the infinite polygenic model showed poor mixing, which resulted in unreliable (non-converged) estimates. The MCMC chains of several parameters were stuck in some parts of the parameter space for many iterations and posterior estimates were different and depended on the initial values of the different MCMC runs. In addition, the infinite polygenic model had clear difficulties in separating (identifying) the polygenic variance and the residual variance from each other. During MCMC iterations, most of the time the value of the polygenic variance dominated that of the residual variance which was zero or almost zero, but sometimes this was swapped the other way round. Both these issues probably arise due to the small number of individuals in the data and therefore it is safest not to estimate the variance components from such small data sets (see Misztal, 1996; Burton et al., 1999).

Analysis results

The MCMC estimation under the covariate model did not show any problems with mixing, but could not capture any significant genetic effects either. Every genetic effect occurs in the model with almost equal probability, the largest probability (≈0.068) was found for the SNP close to gene GSK3B.

In a roundtable discussion (Kass et al., 1998) Neal stated that prior constraints may cause convergence problems for Markov chains, so we loosened our prior restriction with genetic variance components in MCMC estimation and allowed them also to have values larger than the phenotypic variance. After this change the covariate model produced slightly elevated posterior probability (≈0.130) for the effect of marker × gene expression interaction for gene LEF1. Probabilities for the rest of the effects varied in range (0.019, 0.072). In earlier studies (Behrens et al., 1996; Huber et al., 1996) LEF1 has been shown to interact with β-catenin, which is an important effector of the WNT-signaling pathway. Together these two proteins mediate a transcriptional response to WNT signalling (Reya et al., 2000).

Discussion

In the population-based association analysis of quantitative traits, the use of relatives provides a competitive alternative for a sample of unrelated individuals (Visscher et al., 2008). In such cases, the use of a correction term is important in single-gene models to avoid false positives due to the resemblance of individuals (Yu et al., 2006; Iwata et al., 2007). Two approaches for taking the pedigree structure into account in a model-based multilocus association were presented and compared here with the approach of no correction. In principle, one can easily include a large pedigree in a covariate model. To allow larger pedigrees in the infinite polygenic model, Damgaard (2007) has suggested prior transformation of the kinship matrix to improve the mixing properties of the WinBUGS sampler. However, because we have concentrated on reasonably small pedigrees, we did not apply such a transformation here. Also application of Lin (1999) and Thomas (1992) provide natural samplers for larger pedigrees (see Waldmann, 2009).

Use of indicator variables

Initially, we began by adding a correction term which takes into account the pedigree structure to the model of Hoti and Sillanpää (2006), which does not include any indicator variables. Generally, the model found genetic components quite well, but the heritability estimate had a tendency to become highly inflated (being almost one). We found out that this overestimation was due to the cumulative effect of many negligible genetic effects (at insignificant components) which each contributed very little to the cumulative variance of genetic effects (results not shown). When we added indicator variables into the model (as explained in the Model section), the heritability estimate was affected only by the genetic variances of significant components, whereas the other variances were truly zero (cf. method BayesB in Meuwissen et al., 2001). This change in the model structure brought the heritability estimates down from one. It is important to note that Hoti and Sillanpää (2006) obtained good estimates for heritability with their model even without indicators. One reason for different behaviour in Hoti and Sillanpää (2006) and in our implementation here might be that we made our analysis with WinBUGS, where we had to restrict our flat priors to certain region, which had to be narrow to prevent computational overflows and maintain numerical stability (see Appendix A). This restriction led to the situation where the variance parameters cannot be exactly zero.

Analyses of simulated data

When analysing data with six simulated effects using the infinite polygenic model, our estimate for additive polygenic variance was much smaller than the true simulated value. On the other hand, the estimated number of influential genetic components had some support for being larger than the true number of simulated effects. We found out that these additional effects were all small in size. We suppose that this phenomenon occurs because our model approximates polygenic variance in a similar way as the finite polygenic model (FPM). FPM was first proposed by Thompson and Skolnick (1977) and it describes the genetic (polygenic) covariance among pedigree members by a finite number of unlinked small-effect quantitative trait loci (Du et al., 1999; Du and Hoeschele, 2000). Briefly, the correctly identified genetic components and a few extra components together seem to fit (explain) most of the polygenic structure of the data leaving only a small amount of polygenic variance to be explained by the infinite polygenic component. Our model is more flexible than FPM, because FPM assumes a constant number of equal-sized genetic effects when approximating the polygenic structure, whereas our model estimates the number of components and their effects simultaneously from the data. The running of the covariate model with the same data led to the slightly smaller heritability estimate and the same amount of significant genetic effects. Like the infinite polygenic model, also here the multiple markers (and expressions) took the role of polygenic inheritance. Moreover, this performance of the marker effects here is also closely related to the genomic selection (see Meuwissen et al., 2001; Calus and Veerkamp, 2007) where the sum of the marker effects is used to model polygenic variation.

In the data with one simulated effect, both the infinite polygenic model and the covariate model favoured more than one influential component in each MCMC iteration. However, these additional components had negligible effects, which were very small in size. This gave further support to the fact that the polygenic inheritance is captured mostly by extra loci in multilocus association models where the effects of multiple loci/components are considered in the model simultaneously.

Analysis of simulated data replicates

These replicated marker data analyses showed us that the infinite polygenic model is very sensitive (in the sense of sometimes providing good estimates and sometimes poor estimates) on particular data in estimating several variance components. All models provided quite similar results, which makes the definition of the best performing method difficult. However, the unpredictable performance of the infinite polygenic model makes it less attractable.

Analysis of real data

Real data analysis with the infinite polygenic model had difficulties in separating polygenic and residual variance during the estimation. This may imply that individual components here also explain/approximate polygenic variance quite well and that the remaining variability cannot be partitioned into two distinct variance components. Also the amount of the data was rather small, so estimating the variance components is not reliable (see Burton et al., 1999; Misztal, 1996).

There are several reasons why our approach did not lead to any significant genetic effects on the real data analysis. One is the amount of the data (four families), which was quite small. In addition, it is likely that the heritability of the expression trait is also small. Usually, small amount of data does not matter if (1) heritability is large enough, and vice versa, and if (2) components/candidates are independent. Here, the candidates were especially selected as members from the single WNT pathway, in which case they are evidently highly correlated, which again makes it difficult to do model selection among them using multilocus association models. Our method tries to find a sparse set of trait-associated components at the same time, whereas due to correlatedness, selected components may vary at each MCMC iteration. This may have been the cause of almost equal posterior probabilities for all genetic effects.

In contrast, Kraft et al. (2003) used a single-gene test, where the association of a single-gene was tested at a time. The high correlation between candidates in such circumstances could mean that one being significant is the same as them all being significant which may explain the differences between the results. On the other hand, group-based testing would have provided an interesting alternative (Goeman et al., 2004).

Model extensions and MCMC estimation

The linkage information of SNPs provided by the pedigree was omitted in our model for the missing data. Linkage can be added into our model so that genotypes of the closely linked loci have dependency structure, by modelling haplotypes and their recombinations as in oligogenic analysis (for example, Heath, 1997; Uimari and Sillanpää, 2001). In principle, the pedigree information also allows an extension to models with combined association and linkage (Fulker et al., 1999; George et al., 1999; Abecasis et al., 2000; Perez-Enciso, 2003). In the case with many missing genotypes in a single family, one must keep in mind that WinBUGS uses the single-site Gibbs sampler in updating missing genotypes. Missing genotypes in consecutive generations can cause the single-site Gibbs sampler to be reducible (Cannings and Sheehan, 2002). In that case one configuration can never be reached, once the other configuration has been assigned earlier. Our model for missing expressions could be extended by utilizing information on cis- and trans-acting markers (Sillanpää and Noykova, 2008) or by assuming correlated expressions among related individuals to an extent that reflects the heritability. Our model could also be extended to include the correction term for population structure as in Yu et al. (2006). However, WinBUGS analysis with multiple populations may be too demanding so that other implementations need to be considered. On the other hand, in light of our results, it is likely that multilocus association models can self-correct population structure similarly as they did for family structure here. Actually, Setakis et al. (2006) found this to be true for population structure in their study by using the binary phenotype and logistic regression, and Iwata et al. (2007) for the multilocus association analysis of quantitative traits, see also Iwata et al. (2009). In any case, further inspection is needed on the role (importance) of the correction terms in multilocus association models.

There are two sources of sparseness in our model. One results from using Jeffreys' prior and the other from the use of indicator variables (see O'Hara and Sillanpää, 2009). Based on our experiments, Jeffreys' prior dominates the other source of sparseness. It seems that the prior selection probability s has only modest influence on the posterior and that the degree of sparseness here is similar to that which would be obtained from Jeffreys' prior alone. The great benefit of using indicator variables is that they can produce occupancy probabilities directly. Xu (2003) also used Jeffreys' prior to induce sparseness in his model, which did not produce occupancy probabilities for the components. In the model of Hoti and Sillanpää (2006) occupancy probabilities were calculated afterwards for standardized effects using a pre-specified threshold value.

Based on our limited experiments carried out here, Bayesian multilocus models without correction seem to be a flexible tool in association analysis even if there are dependencies among study individuals. When there are many candidate components, they can automatically take residual dependencies into account without producing a large amount of false positives. However, further inspection is needed to clarify when there are enough candidates and data, for it to be safe to leave out the correction term from the cQTL-model. For population structure, Iwata et al. (2007) found that use of a correction term (in two-genotype data) systematically seemed to provide some additional advantages over self-correction (the use of a multilocus model without a correction term). It seems that the model without the correction term performs quite similarly with the models, which take into account the pedigree structure. If one, however, wants to use the model with correction we found that if the heritability or the number of individuals is quite small, the use of a covariate model is then preferable. In addition, the covariate model provides a framework to include phenotype information from ungenotyped parents to the analysis (cf. Purcell et al., 2005). Nevertheless, the use of the model without the correction term gives satisfactory results when several candidate components are studied in the model simultaneously.

The model specification codes (written in WinBUGS) used in this article are freely available for research purposes from the authors upon request.