Correcting for relatedness in Bayesian models for genomic data association analysis

Pikkuhookana, P; Sillanpää, M J

doi:10.1038/hdy.2009.56

Download PDF

Original Article
Published: 20 May 2009

Correcting for relatedness in Bayesian models for genomic data association analysis

P Pikkuhookana¹ &
M J Sillanpää¹

Heredity volume 103, pages 223–237 (2009)Cite this article

925 Accesses
19 Citations
Metrics details

Abstract

For small pedigrees, the issue of correcting for known or estimated relatedness structure in population-based Bayesian multilocus association analysis is considered. Two such relatedness corrections: [1] a random term arising from the infinite polygenic model and [2] a fixed covariate following the class D model of Bonney, are compared with the case of no correction using both simulated and real marker and gene-expression data from lymphoblastoid cell lines from four CEPH families. This comparison is performed with clinical quantitative trait locus (cQTL) models—multilocus association models where marker data and expression levels of gene transcripts as well as possible genotype × expression interaction terms are jointly used to explain quantitative trait variation. We found out that regardless of having a correction term in the model, the cQTL-models fit a few extra small-effect components (similar to finite polygenic models) which itself serves as a relatedness correction. For small data and small heritability one may use the covariate model, which clearly outperforms the infinite polygenic model in small data examples.

A resource-efficient tool for mixed model association analysis of large-scale data

Article 25 November 2019

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Article 18 May 2020

A generalized linear mixed model association tool for biobank-scale data

Article 04 November 2021

Introduction

Population-based marker–phenotype association studies suffer from confounding due to population structure and cryptic relatedness (residual dependencies) that have not been observed or accounted for among the study subjects (Lander and Schork 1994; Yu et al., 2006; Iwata et al., 2007). The same applies to expression–phenotype association studies (Gibson, 2003; Kraft and Horvath, 2003) and clinical quantitative trait locus (cQTL) studies where genotypes and gene expressions are simultaneously used to study the association with the phenotype (Hoti and Sillanpää, 2006; Bhattacharjee and Sillanpää, 2009; Sillanpää and Noykova, 2008). The significance of this problem in human association studies is currently a subject of considerable debate (Marchini et al., 2004; Devlin et al., 2004; Hinds et al., 2004; Helgason et al., 2005; Clayton et al., 2005; Voight and Pritchard, 2005; Setakis et al., 2006; Zhao et al., 2007). If no pedigree/ancestry information is available, there are different approaches to estimate the unobserved structure of population or of the pedigree using neutral molecular markers (Pritchard et al., 2000; Blouin, 2003; Excoffier and Heckel, 2006; Weir et al., 2006; Gasbarra et al., 2007; Bink et al., 2008). In many cases, however, exact information specifying the interrelations between individuals may be available. This is so, for example when data has been ascertained specifically from families or from pedigrees (see Visscher et al., 2008).

Robust methods have been developed especially for family-based association studies (Gauderman et al., 1999; Zhao, 2000; Knapp and Becker, 2003; Chen and Abecasis, 2007) and case–control association testing with related individuals (Thornton and McPeek, 2007). To correct for population structure in population-based association analyses, one can for example adjust the P-value (Devlin and Roeder, 1999), include the population term in the association model (Yu et al., 2006; Zhao et al., 2007) or use a principal component approach (Price et al., 2006). Similarly, for known or estimated relatedness, there are many ways to include such information in the association model. One approach to correct for cryptic relatedness is to include relationships in the form of a (covariance) matrix into the association model in the studies of marker–phenotype association (see George and Elston, 1987; Kennedy et al., 1992; Jannink et al., 2001; Yu et al., 2006) or in the studies of expression–phenotype association (see Lu et al., 2004). One can incorporate an additive relationship matrix (the covariance structure of a multivariate normal distribution) either to residuals or use a specific random term (arising from the infinite polygenic model) in the regression model. In linkage studies the same term appears in the role of the genetic background. Such covariance structure takes care of the dependencies between the study subjects. Another approach approximates such structure by having the phenotype of the parents, sibs and the spouse of the subject as covariates in the regression model and assumes independence for residuals (Bonney, 1986). This autoregressive structure does not model the true underlying dependence structure, but has been shown to perform well and to account for confounding with single locus models. These two (polygenic and covariate) correction terms have been studied and used earlier only in a single-gene association model and the testing framework. Thus their properties for Bayesian multilocus association models (for example, Kilpikari and Sillanpää, 2003; Sillanpää and Bhattacharjee, 2005) are largely unknown.

Modelling phenotype with both gene expression data and marker data could be advantageous and provides more information because marker data is stable in comparison with time- and tissue-dependent gene expression data (O'Hara, 2006; West et al., 2006). Although this view has been confirmed in simulations (Hoti and Sillanpää, 2006; Sillanpää and Noykova, 2008) it remains arguable with real data (Bhattacharjee et al., 2008; Bhattacharjee and Sillanpää 2009). We compare here how these two relatedness corrections work together with Bayesian model-based multilocus association using both marker and gene expression data. To do this, we have modified the cQTL-model of Hoti and Sillanpää (2006) for SNPs and pedigree data. Our emphasis is on a collection of small pedigrees, as cryptic relatedness has a negligible effect in large outbred populations, especially when the sample size increases (Voight and Pritchard, 2005). We consider only a small amount (∼5%) of missing data here (cf. Sillanpää and Noykova, 2008).

Model

cQTL model

Let N_M be the number of SNP (single nucleotide polymorphism) markers and N_E be the number of gene expression transcripts. Our data consists of continuous phenotypes y=(y₁, …, y_n)^t, SNP marker genotypes and gene expression measurements from n individuals in the known pedigrees collected from a single population. We assume that each individual has its own observation (array) on gene expression made at single time point. For cases with multiple populations, see the Discussion section. Here, we let y_i denote the observed continuous phenotype of the ith individual and summarize the genotypes as and

, where z_i,j,k denotes the indicator of kth genotype at the SNP j for individual i. The gene expression measurement j for the individual i will be denoted by x_i,j. We want to emphasize that we assume that gene expression levels are available (with some missing entries) for each study subject. We closely follow the notation in Hoti and Sillanpää (2006) and assume that gene expression measurements are normalized (Quackenbush, 2001; Butte, 2002) and transformed suitably beforehand, so that sample distribution of the majority of the genes is approximately standard normal. Moreover, we assume that N_E and N_M are relatively small (a few hundreds at most). To form candidates for genotype × expression interactions, we assume that some markers are a priori associated (that is, possibly have a regulatory effect) with some gene expression measurements. This prior information on the pairing of markers and expressions may be obtained from previous, independent studies, or could be based on known pathways or proximity of their genomic location. We refer to them as marker–gene pairs (see Hoti and Sillanpää, 2006). We allow multiple expressions to be associated with a single marker, but not the other way around. Let and be the expression measurement and genotype indicator for some pair (g_j,s_j) so that for individual i x̃_i,j is the gene expression measurement of gene g_j and is the indicator of genotype k at SNP s_j. The number of these previously assigned pairs is N_ME. We consider the following linear model for a continuous phenotype

where μ is the population mean and ɛ_i∼N(0, σ₀²) is a normally distributed residual term with mean zero and variance σ₀². F_i denotes a correction term, which takes into account the family structure and the dependence between family members. The linear regression coefficient (effect) of genotype k at marker j is α_j,k, coefficient of expression effect is β_j and coefficient of the interaction effect is γ_j,k. Unlike in Hoti and Sillanpää (2006), each genetic component, marker, expression or interaction, has its own indicator variable, I_j^M, I_j^E or I_j^ME, respectively. For our motivation for the use of indicators, see the Discussion section. For indicators, the value one corresponds to the inclusion and value zero to the exclusion of the genetic component in the model. Obviously SNP markers exhibit three genotypes and we use an over-parameterized model, so for each marker and for each marker–gene pair (that is, marker × expression interaction), there is a single indicator variable and three effect coefficients. (We allow the first coefficients (α_j,1, γ_j,1) at each locus j to be unconstrained in our model unlike that in Hoti and Sillanpää (2006)). We can identify differences (genotypic contrasts) as functions of posteriors afterwards from the Markov chain Monte Carlo (MCMC) sample or from the MCMC point estimates. As in Hoti and Sillanpää (2006) we can write the genetic data of individual i as the vector

and the vector containing N=3N_M+N_E+3N_ME unknown effects is denoted by

In addition there are N_M+N_E+N_ME indicator variables. To create the vector that contains the indicator variables for all genetic effects, we need to arrange the indicators into the vector containing N elements

Now we can rewrite the linear cQTL-model (1) as

Infinite polygenic model

One approach to taking family structure into account in the model is to add a random individual effect, whose correlation structure would follow the degree of relationship between individuals. In the cQTL-model (1), the term F_i represents the additive effects of the polygenes on individual i, which arise from the combined action of infinitely many loci whose individual contributions cannot be distinguished (Yi and Xu, 2000; Jannink et al., 2001). The additive polygenic effects F=(F₁,…,F_n)^t are distributed as multivariate normal with known covariance structure,

. Here, is a n × 1 vector of zeros and Φ is a n × n matrix of kinship coefficients among individuals based on pedigree information and σ_F² is the additive variance of the polygenes (George and Elston, 1987; Kennedy et al., 1992; Jannink et al., 2001; Monks et al., 2004). The kinship coefficient between two individuals is the expected probability that homologous genes taken randomly from their genomes are identical by descent from common ancestors in the given pedigree (Lynch and Walsh, 1998). In the breeding literature, F_i's are called breeding values and 2Φ the additive relationship matrix (Henderson, 1976). The structure of the matrix Φ is block-diagonal once the individuals are arranged by the families and no remote shared ancestry among the families is assumed. For simplicity, we assume no inbreeding and omit the dominance component.

Regression covariates

Another approach to describing family dependencies is to add the phenotypes of the relatives as covariates (fixed effects) into the model. Then for individual i we can write F_i in the cQTL-model (1) as

where y′denotes deviations of the phenotypes from their empirical mean, subscript f_i refers to father, m_i denotes mother, s_i denotes spouse if she/he appears earlier in the data set and os_i is the set of sibs of individual i appearing earlier in the data set (Bonney, 1986; Thomas, 2004). Here, ρ_f, ρ_m, ρ_s and ρ_os are respective regression coefficients, which can be written in the vector form as ρ=(ρ_f, ρ_m, ρ_s, ρ_os). Bonney (1986) referred to this as the class D model. This kind of model structure can be seen as an approximation of polygenic background (Thomas, 2004).

Hierarchical model

Prior distributions

We need to specify prior distributions for the unknown parameters. We allow each genetic effect in vector (θ=θ₁, θ₂, …, θ_N) to have its own variance parameter (Xu, 2003; Hoti and Sillanpää, 2006). For the genetic effects θ, we assign prior

, where the functional form of p(θ_j∣σ_j²) is a normal density with the mean zero and the effect-specific variance σ_j². We assigned to σ_j² the Jeffreys' prior p(σ_j²)∝1/σ_j², which together with effect-specific variances induce sparseness into the model (Xu, 2003; Hoti and Sillanpää, 2006). By sparseness, we mean that most of the effects are zero or almost zero. For details of implementation, see the Estimation section in Appendix A. There is also another source of sparseness in our model—indicator variables. For indicator variables I, we assign the Bernoulli distribution with parameter s=P(I_j=1)≪0.5, which is the prior selection probability for a candidate to be included in the model (that is, I=1). For parameter s we give values 1/N_M, 1/N_E or 1/N_ME, for markers, expressions and their interactions, respectively. This is equivalent to assuming a priori that there is one selected effect for each type of genetic component. We treat priors p(θ) and p(I) independently (Kuo and Mallick, 1998; Sillanpää and Bhattacharjee, 2005; Sillanpää and Noykova, 2008). The prior for μ is p(μ)∝1, and prior density for σ₀²=var(ɛ_i) is p(σ₀²)∝1/σ₀². As a prior for polygenic effects, we use the multivariate normal density. That is

where F=(F₁, …, F_n) is a vector of polygenic effects, Φ is a matrix of kinship coefficients among individuals, σ_F² is additive polygenic variance with prior p(σ_F²)∝1/σ_F² and ∣2Φσ_F²∣ is the determinant of the covariance matrix. Now, under the additive polygenic model, the joint prior is where

. For the regression covariate model we replace p(F∣σ_F²) with where p(ρ_j) is a normal density function with the mean zero and variance 1000, and

.

Missing data model

We assume data are missing at random (Rubin, 1976) and treat missing values as unknown random variables in Bayesian inference. Thus, we need to specify a prior distribution for missing observations. Denote the complete genetic data with no missing values by D={m, x}. The observed genetic data with possibly some missing values is denoted by D⁻={m⁻, x⁻}. Recall that the gene expression measurements were assumed to be normalized beforehand. Prior distribution for missing gene expression measurement is assumed simply to be a standard normal distribution (cf. for a major gene model, see Sillanpää and Noykova, 2008). Even if in this model the polygenic basis is assumed for gene expressions, we omit (genetic) dependencies from parents. In the prior distribution for missing genotypes we take into account the genotypic values of individuals' parents, but omit the recombination aspect because we do not utilize linkage information in our association model here (see the Discussion section). The joint probability distribution of the marker j over individuals is given by

where m_j=(m_1,j,…,m_n,j)^t is the genotype pattern at marker j. The first product is over the prior probabilities of the genotypes of founders, and the second is over transmission probabilities of genotypes of non-founders and m_m,j and m_f,j are the genotypes of mother and father of individual i, respectively. Transmission probabilities p(m_i,j∣m_m,j,m_f,j) follow the Mendelian rules of inheritance. Note that although it seems that there are dependencies in transmission only downwards the pedigree, in practise there are also upward dependencies due to total probability. The genotypes of the founders are thought of as being drawn from the population with uniform allele frequencies. Then, the prior density function of the genetic data is

For details of implementation, see the Estimation section in Appendix A.

Posterior distributions

In Bayesian analysis, marginal posterior distributions for the parameters are derived from the prior distributions and the likelihood of the data. Using Bayes formula, the joint posterior density of the model parameters conditional on phenotypic and genetic data is given by

where p(θ,I,F,μ,σ²) is the density function of the joint prior distribution of parameters {θ,I,F,μ,σ²}, p(D) is the prior density function of the complete genetic data, p(D⁻∣D) is the mass probability function of the observed genetic data D⁻ conditional on the complete genetic data D (that is, is the indicator function and takes value 1 only when D^- is consistent with D and is zero otherwise) and is the likelihood of the phenotype data, where

Examples of cQTL analysis with family data

To compare corrections for family data with cQTL-model (1), we analyse a few data sets with three-generation pedigrees in the presence of missing data. First, we analyse two simulated data sets with known genetic effects and then a real CEPH family data that have been used in previous studies (Kraft et al., 2003; Schadt et al., 2003). We also consider average performance (assessed by analysing 25 data replicates). The simulated data is an example of a large data set (210 individuals) with loosely correlated genetic components and real data is an example of a small sample size (58 individuals) with highly correlated genetic components. We first compare how the two correction terms (infinite polygenic model and covariate model) perform against no correction term (model for unrelated individuals) with family data, which has either single or multiple simulated trait-influencing components and compare two correction terms with the real CEPH data. Finally, in simulated data replicates, we consider only marker–phenotype association and compare three methods using 25 marker data sets with three trait-loci.

Simulations

We simulated family data consisting of molecular markers, gene expression level measurements and a continuous phenotype. Our simulation procedure follows the procedure of Hoti and Sillanpää (2006), where expression levels are first generated conditionally on markers, and phenotypes are then generated conditionally on them both. The main difference is that we use real SNP marker data on families as a starting point. We want to emphasize that this approach is able to generate realistic dependence structures for markers as well as expressions. Real marker data was obtained from the CEPH genotype database (Dausset et al., 1990). Fifteen families from the CEPH/Utah family collection were selected with the family identifiers 1334, 1340, 1345, 1346, 1349, 1350, 1358, 1362, 1375, 1377, 1408, 1418, 1421, 1424 and 1477. Selection criteria were large number of children and large number of genotypes available for all three generations (cf. Monks et al., 2004). In total, the families represent 210 individuals. We selected 52 SNPs from eight different chromosomes, based on the availability of genotypes (not too many missing values) and the property that selected markers was not highly dependent (closely linked) to one another (Table 1). We also required that the markers are in Hardy–Weinberg equilibrium and that minor allele frequency (MAF) was not less than 5%.

Table 1 ID-numbers of SNPs selected from 8 chromosomes. There were a couple of hundreds of SNPs available on each chromosome in the database

Full size table

Simulating SNP genotypes

First, we needed to complete missing genotypes in CEPH families, as our simulation procedure for expression and phenotype data (below) necessitates complete SNPs. There were less than 5% of genotypes missing among the set of selected markers. Genotypes were missing entirely on two individuals and some SNPs were not available for a couple of families. We sampled the missing genotypes of the founders from the population of equal allele frequencies conditionally on the progeny. This allowed us to consistently fill data (missing genotypes) downwards through the pedigree. Genotypes were drawn according to the Mendelian transmission probabilities. In this process, we omitted recombination probabilities, but took fully into account that every missing genotype depends on genotypes of the parents on the same SNP marker.

Simulating expression levels

Conditionally on an individual's genotype on each marker (m_j), we simulated three gene expression measurements (x_{3 × j−2}, x_{3 × j−1}, x_{3 × j}). The first two of these (x_{3 × j−2}, x_{3 × j−1}) had constant probability to have in cis effect and the third gene (x_{3 × j}) was set to have no regulatory effects with probability one. A priori (before simulating actual values) we divided markers into three in cis effect groups. One-third of the markers (m_j) were assigned an in cis effect on gene expressions if the genotype was homozygote (AA), one-third had an in cis effect if genotype was heterozygote (AB) and final third had an in cis effect if the genotype was homozygote (BB). The decision that the marker actually exhibits the pre-specified in cis effect was made with probability 0.3, which is in line with previous estimates (Jansen and Nap, 2004; Morley et al., 2004). Gene expression measurements (x_{3 × j}) at positions with no in cis effect, and gene expression measurements (x_{3 × j−2} and x_{3 × j−1}), in absence of in cis effect, were simulated from the distribution N(0,1). In presence of in cis effect, the expression value of the one (in cis) gene (x_{3 × j−2}) assigned on current marker j was simulated from the distribution N(2,1) and expression value of another gene (x_{3 × j−1}) assigned on the same marker was simulated from the distribution N(−2,1) (see Figure 1).

Simulating phenotypes

Excluding simultaneously active components at each marker–gene pair, genetic components can be divided into six subtypes, depending on their effect on phenotype and whether an in cis effect is present or absent in marker–gene pairs. Following Hoti and Sillanpää (2006), we denote these subtypes as genotype effect without in cis effect (G), genotype effect with in cis effect (iG), gene expression effect without in cis effect (E), gene expression effect with in cis effect (iE), genotype × gene expression effect without in cis effect (GE) and genotype × gene expression effect with in cis effect (iGE). Continuous phenotypes are constructed as a linear combination of six underlying genetic components and the polygene.

where s₁,…, s₆ are indexes of influential marker–gene pairs of types G, iG, E, iE, GE and iGE, respectively, z_i,j,k is the indicator of genotype k at the marker j for the individual i, x_i,l is the gene expression value of gene l for the individual i, and the environmental residual ɛ_i is assumed to be normally distributed with mean 0 and variance 1. Polygenic terms g_i are simulated jointly from the multivariate normal distribution with zero mean vector and covariance–variance structure 2Φσ², where 2Φ describes the additive relationships between CEPH family members and σ² is the additive polygenic variance. Here, σ² is fixed to some value for the desired degree of heritability due to polygenes. See Table 2 for details of the simulation including values of the coefficients (a_1,k, a_2,k, a₃, a₄, a_5,k, a_6,k) and variability due to given effects and the polygene. To test how well our method behaves in case of missing data, we randomly discarded 5% of the data on phenotypes, genotypes and expressions.

Table 2 Simulated genetic components and their effect sizes

Full size table

We also simulated phenotypic data with one underlying genetic component and the polygene. The starting point for the phenotypic simulation was that the genotypic data was identical with the earlier data set and expression levels were simulated similarly. The only influential genetic component was SNP marker s₁=36 without in cis effect. Simulated effects were a_1,1=−2 and a_1,3=6 for the homozygotes and a_1,2=2 for the heterozygote. The simulated polygenic component explained approximately 7% and the simulated SNP effects approximately 24% of the phenotypic variance. We also randomly discarded 5% of the phenotypes, genotypes and expression measurements in this data set.

Simulating replicated data sets

To evaluate the average performance of the methods, we simulated 25 phenotypic data sets with the same marker effects and the polygene in each. The genotypic family data was the same as in earlier simulations (210 individuals and 52 SNP markers) and was kept unchanged in all simulations. We simulated three trait loci (SNPs 7, 29 and 36) with their own effect sizes. For SNP s₁=7, we simulated effects a_1,1=1 and a_1,3=9 for the homozygotes and a_1,2=5 for the heterozygote, for SNP s₂=29 we simulated effects a_2,1=−3, a_2,3=1 and a_2,2=−1 and for SNP s₁=36 we simulated effects a_3,1=−2, a_3,3=4 and a_3,2=1. The simulated polygenic component was approximately 17% of the phenotypic variance but varied due to sampling variation in different realizations. Simulated overall heritability varied equally from 0.34 to 0.52 in replicates. Replicated analyses were done for the complete data sets with no missing values.

Real data

We analysed gene expression data of the lymphoblastoid cell lines of 58 individuals from four CEPH families (CEPH/Utah pedigrees 1362, 1375, 1377 and 1408). The original article about the data set is Schadt et al. (2003). The sibship data from the same families has been used earlier in Kraft et al. (2003) as test data to examine the performance of the FEXAT statistic, which represents a sort of correlation coefficient for family data. Technical details about measuring gene expression in this data set can be found in Schadt et al. (2003). CEPH lymphoblastoid cell lines had been cultured and maintained in the log phase of cell growth at least 2 days before harvest (Schadt et al., 2003). At the time of measuring the expression, it would be expected that the WNT pathway would be active, because the WNT pathway has been shown to regulate B lymphocyte proliferation (Reya et al., 2000). Following Kraft et al. (2003), we chose the expression of β-catenin (CTNNB1NM_001904) as a clinical quantitative trait, and expect that in the presence of WNT, levels of the β-catenin (trait) will be associated with factors that can lead to the formation and stabilization of the β-catenin/TCF complex. On the other hand, in the absence of WNT, β-catenin levels will be associated with genes making up the β-catenin destruction complex (Seidensticker and Behrens, 2000).

Gene expression measurements were obtained from the NCBI GenBank. Locations of genes are based on reference assembly. For every gene, we additionally searched the closest available SNP, which is genotyped for these same four CEPH families, using the same criteria as in the simulation analysis. Genotypes were obtained from the CEPH genotype database. Maximum distance between a gene and the closest SNP was 2 361 528 bp, whereas there was no minimum distance, because one SNP was found inside the gene region (Table 3). We omitted individuals who did not have expression measurements at all.

Table 3 List of putative genes and their closest available SNP markers

Full size table

Results

Simulated data

Analysis details and effect summaries

For data sets with 5% of the data missing, we run our models with WinBUGS 1.4.1 using four separate MCMC chains each of length 10 000. For each chain, burn-in was 1000 and thinning 10 (that is, only every 10th MCMC sample was stored), and samples from all chains were combined in MCMC estimation of the parameters. For checking the convergence of each chain, we visually inspected MCMC paths of several parameters. We summarize our results as posterior genetic occupancy probabilities for genetic component j, P(occupancy at location j∣data), obtained as the proportion of MCMC rounds where the indicator variable I_j is 1, indicating that genetic component j is included in the model. Note that there are as many indicator variables as genetic components (N_M+N_E+N_ME) with continuous indexing. We also calculated conditional probabilities Q_j,k=P(I_J=1∣I_k=1, data) for all pairs (j,k) of indicator variables, which showed elevated posterior probabilities (cf. Hoti and Sillanpää, 2006). Q_j,k is the posterior probability that the genetic component j is included in the model on the condition that the genetic component k is included in the model. In our preliminary analysis (results not shown), we have found that sometimes the expression effect may be captured by the interaction term where the same expression measurement is involved. Same kind of complementary behaviour of the effects was present also in Hoti and Sillanpää (2006) and in Sillanpää and Noykova (2008). So, we calculated Q summaries also for expression components, which corresponds to elevated interaction terms and vice versa. The number of genetic components in the model was summarized (in each MCMC round) as the number of indicator variables, which were simultaneously 1. Heritability was estimated by using the formula

where σ_y^2(t) is phenotypic variance and σ_y^2(t) σ_e^2(t) is residual variance at round t, and r is total number of MCMC rounds after burn-in. Note that the estimated phenotypic variance depends on imputed values.

Analysis results

Throughout the paper, we used the threshold 0.1 to determine significant components. For the data with six simulated effects, the model with infinite polygenic correction term found five and the covariate model found six genetic components with elevated posterior occupancy probabilities (Figure 2). One of these (j=31), which was found with both models, was actually a false positive, which cannot be explained, with any simulated effects. The expression effects were partly captured by interaction terms so that the expression effect was rarely in the model at the same time as the corresponding interaction term. This can be seen from the Q summaries (Table 4) where the conditional probabilities were less than 0.07 and 0.04 (for an infinite polygenic model and a covariate model, respectively) for all such cases. The same can be seen also from the MCMC paths of the indicator times the effect (Figure 3). Here, I_j^* × θ_j (product of the indicator and effect size) shows, that the expression effect (j=167) and the genotype × expression interaction (j=323) are clearly complementary, indicating that only one of them contributes to the model at a time. Even though the occupancy probabilities for the simulated components of the type E and GE were both smaller than 0.1, they are clearly higher than the occupancy probabilities of the other similar type of components (Figure 2). As illustrated in the Table 5, the highest posterior probability was obtained for the correct number of genetic components in both the infinite polygenic model P(n_c=6∣data)≈0.22 and in the covariate model P(n_c=6∣data)≈0.23. However, posterior support was obtained for a wide range of values varying from 2 to 11 and 3 to 11 components in the infinite polygenic model and in the covariate model, respectively. When 5% of the data was artificially coded as missing, run time was approximately 15 h (4 chains) for every 1000 rounds of iterations for both models.

Table 4 Pairwise conditional summaries

Full size table

Table 5 Number of components

Full size table

The same data was also analysed with the model which did not include any correction term for the pedigree structure (that is, F_i=0 for all i). Surprisingly, the same six effects with elevated occupancy probabilities were found here as in the covariate model analysis. The false positive (j=31) showed slightly higher probability in this analysis than with the other two models (Table 6). Again the highest posterior probability was obtained for the correct number of simulated genetic components (P(n_c=6∣data)≈0.24). The mean posterior estimated number of genetic components included simultaneously in the model was slightly higher for this model than for the other two models (Table 7).

Table 6 Effect estimates and occupancy probabilities

Full size table

Table 7 Point estimates of the number of components

Full size table

In the infinite polygenic model analysis of data with one simulated effect, the true simulated component was always captured correctly in the model. However, the number of genetic components in the model was clearly overestimated, even though there was no strong evidence for any false positives. There were only two genetic components with occupancy probabilities larger than 0.1 and one of them was false positive, although there was some probability mass for as many as nine influential components (Table 5). The covariate model performed slightly better than the infinite polygenic model and was able to include the true simulated component in the model with probability one, whereas there were no other components with occupancy probabilities larger than 0.1. The model without correction term found the true simulated genotypic component and one genotypic component with occupancy probability 0.1 and one interaction component with occupancy probability 0.098.

Heritability and effect estimates

For the data with six simulated effects the infinite polygenic model underestimated the heritability in its posterior point-estimate whereas the similar estimate from the covariate model was even smaller (Table 8). The 95% Credibility Interval for infinite polygenic model included the true simulated heritability but the CI was wider than for the covariate model or for the model with no correction. The estimates of the effect sizes were also underestimated. It turned out that there was a clear dependence, as expected: the higher the posterior occupancy probability was for the genetic component the more accurate the estimate for the effect was. This was true especially for the expression effects. Table 6 presents the comparison between simulated and estimated genetic effects. For effects, we show only the model-averaged estimate I × θ, because it is more robust (Ball, 2001) and I appears always together with θ in the model (cf. Sillanpää and Bhattacharjee, 2005). In the table the effect of the genotype AA is constrained to zero to make values more comparable. Posterior estimate for the additive polygenic variance was much smaller than the true simulated value and it seemed in the trace plot nearly zero for most of the MCMC iterations. It is likely that other genetic components (SNPs, expressions and their interactions) captured some of the polygenic variance by subdividing a small amount of variance to be explained by each component. Thus correction term estimates were relatively modest (Table 9). Also, it is likely that the heritability estimate suffered from the fact that some simulated components stayed unselected for most of the MCMC iterations.

Table 8 Heritability estimates and the 95% credible intervals around the posterior mean for two simulated data set with three competing models

Full size table

Table 9 Correction term estimates

Full size table

Analyses with all three models for the data with one simulated effect also underestimated heritability (Table 8), but the 95% CI included the true simulated value in all models. The estimated polygenic variance behaved the same way as in the data with six simulated effects. The effect estimates were slightly better with the covariate model though the infinite polygenic model and the model without correction also gave good estimates. In general, these estimates were more accurate here than for the data with six simulated components (results not shown).

Simulated data replicates

Analysis results

Replicated analysis of marker data sets gave quite similar results with all three methods. All methods found the same trait loci in almost every data set, but their degree of evidence (the magnitude of signals) was slightly different. The infinite polygenic model underestimated polygenic variance and for some data sets it had difficulties of finding a single mode (converging value) and thus had identifiability problems. As a whole, the infinite polygenic model estimated heritability better than the other two models, but still it underestimated the true heritability almost every time. The model without the correction term found simulated effects more frequently (that is, had better power) than the other two models and the infinite polygenic model had the lowest false-positive rate and false-discovery rate but FPR and FDR were quite similar with all three models (Table 10). The performance of the covariate model was not superior with respect to any summary statistic but performance was still comparable to the other models. As earlier, the higher the posterior occupancy probability was for the genetic component, the more accurate the estimate for the effect was.

Table 10 Averaged effect estimates and occupancy probabilities of replicated data analysis: Simulated and estimated effects (posterior means) of trait loci under three competing models when constraining the effect of the genotype AA to zero

Full size table

Real data

Analysis details

When analysing the real data from the CEPH families, we ran four MCMC chains each of length 50 000 and we allowed only the closest marker to have an interaction with corresponding gene expression in the model. In the prior, we restricted all variance components to be less than our empirically estimated phenotypic variance (ς̂²≈0.007). The MCMC sampler under the infinite polygenic model showed poor mixing, which resulted in unreliable (non-converged) estimates. The MCMC chains of several parameters were stuck in some parts of the parameter space for many iterations and posterior estimates were different and depended on the initial values of the different MCMC runs. In addition, the infinite polygenic model had clear difficulties in separating (identifying) the polygenic variance and the residual variance from each other. During MCMC iterations, most of the time the value of the polygenic variance dominated that of the residual variance which was zero or almost zero, but sometimes this was swapped the other way round. Both these issues probably arise due to the small number of individuals in the data and therefore it is safest not to estimate the variance components from such small data sets (see Misztal, 1996; Burton et al., 1999).

Analysis results

The MCMC estimation under the covariate model did not show any problems with mixing, but could not capture any significant genetic effects either. Every genetic effect occurs in the model with almost equal probability, the largest probability (≈0.068) was found for the SNP close to gene GSK3B.

In a roundtable discussion (Kass et al., 1998) Neal stated that prior constraints may cause convergence problems for Markov chains, so we loosened our prior restriction with genetic variance components in MCMC estimation and allowed them also to have values larger than the phenotypic variance. After this change the covariate model produced slightly elevated posterior probability (≈0.130) for the effect of marker × gene expression interaction for gene LEF1. Probabilities for the rest of the effects varied in range (0.019, 0.072). In earlier studies (Behrens et al., 1996; Huber et al., 1996) LEF1 has been shown to interact with β-catenin, which is an important effector of the WNT-signaling pathway. Together these two proteins mediate a transcriptional response to WNT signalling (Reya et al., 2000).

Discussion

In the population-based association analysis of quantitative traits, the use of relatives provides a competitive alternative for a sample of unrelated individuals (Visscher et al., 2008). In such cases, the use of a correction term is important in single-gene models to avoid false positives due to the resemblance of individuals (Yu et al., 2006; Iwata et al., 2007). Two approaches for taking the pedigree structure into account in a model-based multilocus association were presented and compared here with the approach of no correction. In principle, one can easily include a large pedigree in a covariate model. To allow larger pedigrees in the infinite polygenic model, Damgaard (2007) has suggested prior transformation of the kinship matrix to improve the mixing properties of the WinBUGS sampler. However, because we have concentrated on reasonably small pedigrees, we did not apply such a transformation here. Also application of Lin (1999) and Thomas (1992) provide natural samplers for larger pedigrees (see Waldmann, 2009).

Use of indicator variables

Initially, we began by adding a correction term which takes into account the pedigree structure to the model of Hoti and Sillanpää (2006), which does not include any indicator variables. Generally, the model found genetic components quite well, but the heritability estimate had a tendency to become highly inflated (being almost one). We found out that this overestimation was due to the cumulative effect of many negligible genetic effects (at insignificant components) which each contributed very little to the cumulative variance of genetic effects (results not shown). When we added indicator variables into the model (as explained in the Model section), the heritability estimate was affected only by the genetic variances of significant components, whereas the other variances were truly zero (cf. method BayesB in Meuwissen et al., 2001). This change in the model structure brought the heritability estimates down from one. It is important to note that Hoti and Sillanpää (2006) obtained good estimates for heritability with their model even without indicators. One reason for different behaviour in Hoti and Sillanpää (2006) and in our implementation here might be that we made our analysis with WinBUGS, where we had to restrict our flat priors to certain region, which had to be narrow to prevent computational overflows and maintain numerical stability (see Appendix A). This restriction led to the situation where the variance parameters cannot be exactly zero.

Analyses of simulated data

When analysing data with six simulated effects using the infinite polygenic model, our estimate for additive polygenic variance was much smaller than the true simulated value. On the other hand, the estimated number of influential genetic components had some support for being larger than the true number of simulated effects. We found out that these additional effects were all small in size. We suppose that this phenomenon occurs because our model approximates polygenic variance in a similar way as the finite polygenic model (FPM). FPM was first proposed by Thompson and Skolnick (1977) and it describes the genetic (polygenic) covariance among pedigree members by a finite number of unlinked small-effect quantitative trait loci (Du et al., 1999; Du and Hoeschele, 2000). Briefly, the correctly identified genetic components and a few extra components together seem to fit (explain) most of the polygenic structure of the data leaving only a small amount of polygenic variance to be explained by the infinite polygenic component. Our model is more flexible than FPM, because FPM assumes a constant number of equal-sized genetic effects when approximating the polygenic structure, whereas our model estimates the number of components and their effects simultaneously from the data. The running of the covariate model with the same data led to the slightly smaller heritability estimate and the same amount of significant genetic effects. Like the infinite polygenic model, also here the multiple markers (and expressions) took the role of polygenic inheritance. Moreover, this performance of the marker effects here is also closely related to the genomic selection (see Meuwissen et al., 2001; Calus and Veerkamp, 2007) where the sum of the marker effects is used to model polygenic variation.

In the data with one simulated effect, both the infinite polygenic model and the covariate model favoured more than one influential component in each MCMC iteration. However, these additional components had negligible effects, which were very small in size. This gave further support to the fact that the polygenic inheritance is captured mostly by extra loci in multilocus association models where the effects of multiple loci/components are considered in the model simultaneously.

Analysis of simulated data replicates

These replicated marker data analyses showed us that the infinite polygenic model is very sensitive (in the sense of sometimes providing good estimates and sometimes poor estimates) on particular data in estimating several variance components. All models provided quite similar results, which makes the definition of the best performing method difficult. However, the unpredictable performance of the infinite polygenic model makes it less attractable.

Analysis of real data

Real data analysis with the infinite polygenic model had difficulties in separating polygenic and residual variance during the estimation. This may imply that individual components here also explain/approximate polygenic variance quite well and that the remaining variability cannot be partitioned into two distinct variance components. Also the amount of the data was rather small, so estimating the variance components is not reliable (see Burton et al., 1999; Misztal, 1996).

There are several reasons why our approach did not lead to any significant genetic effects on the real data analysis. One is the amount of the data (four families), which was quite small. In addition, it is likely that the heritability of the expression trait is also small. Usually, small amount of data does not matter if (1) heritability is large enough, and vice versa, and if (2) components/candidates are independent. Here, the candidates were especially selected as members from the single WNT pathway, in which case they are evidently highly correlated, which again makes it difficult to do model selection among them using multilocus association models. Our method tries to find a sparse set of trait-associated components at the same time, whereas due to correlatedness, selected components may vary at each MCMC iteration. This may have been the cause of almost equal posterior probabilities for all genetic effects.

In contrast, Kraft et al. (2003) used a single-gene test, where the association of a single-gene was tested at a time. The high correlation between candidates in such circumstances could mean that one being significant is the same as them all being significant which may explain the differences between the results. On the other hand, group-based testing would have provided an interesting alternative (Goeman et al., 2004).

Model extensions and MCMC estimation

The linkage information of SNPs provided by the pedigree was omitted in our model for the missing data. Linkage can be added into our model so that genotypes of the closely linked loci have dependency structure, by modelling haplotypes and their recombinations as in oligogenic analysis (for example, Heath, 1997; Uimari and Sillanpää, 2001). In principle, the pedigree information also allows an extension to models with combined association and linkage (Fulker et al., 1999; George et al., 1999; Abecasis et al., 2000; Perez-Enciso, 2003). In the case with many missing genotypes in a single family, one must keep in mind that WinBUGS uses the single-site Gibbs sampler in updating missing genotypes. Missing genotypes in consecutive generations can cause the single-site Gibbs sampler to be reducible (Cannings and Sheehan, 2002). In that case one configuration can never be reached, once the other configuration has been assigned earlier. Our model for missing expressions could be extended by utilizing information on cis- and trans-acting markers (Sillanpää and Noykova, 2008) or by assuming correlated expressions among related individuals to an extent that reflects the heritability. Our model could also be extended to include the correction term for population structure as in Yu et al. (2006). However, WinBUGS analysis with multiple populations may be too demanding so that other implementations need to be considered. On the other hand, in light of our results, it is likely that multilocus association models can self-correct population structure similarly as they did for family structure here. Actually, Setakis et al. (2006) found this to be true for population structure in their study by using the binary phenotype and logistic regression, and Iwata et al. (2007) for the multilocus association analysis of quantitative traits, see also Iwata et al. (2009). In any case, further inspection is needed on the role (importance) of the correction terms in multilocus association models.

There are two sources of sparseness in our model. One results from using Jeffreys' prior and the other from the use of indicator variables (see O'Hara and Sillanpää, 2009). Based on our experiments, Jeffreys' prior dominates the other source of sparseness. It seems that the prior selection probability s has only modest influence on the posterior and that the degree of sparseness here is similar to that which would be obtained from Jeffreys' prior alone. The great benefit of using indicator variables is that they can produce occupancy probabilities directly. Xu (2003) also used Jeffreys' prior to induce sparseness in his model, which did not produce occupancy probabilities for the components. In the model of Hoti and Sillanpää (2006) occupancy probabilities were calculated afterwards for standardized effects using a pre-specified threshold value.

Based on our limited experiments carried out here, Bayesian multilocus models without correction seem to be a flexible tool in association analysis even if there are dependencies among study individuals. When there are many candidate components, they can automatically take residual dependencies into account without producing a large amount of false positives. However, further inspection is needed to clarify when there are enough candidates and data, for it to be safe to leave out the correction term from the cQTL-model. For population structure, Iwata et al. (2007) found that use of a correction term (in two-genotype data) systematically seemed to provide some additional advantages over self-correction (the use of a multilocus model without a correction term). It seems that the model without the correction term performs quite similarly with the models, which take into account the pedigree structure. If one, however, wants to use the model with correction we found that if the heritability or the number of individuals is quite small, the use of a covariate model is then preferable. In addition, the covariate model provides a framework to include phenotype information from ungenotyped parents to the analysis (cf. Purcell et al., 2005). Nevertheless, the use of the model without the correction term gives satisfactory results when several candidate components are studied in the model simultaneously.

The model specification codes (written in WinBUGS) used in this article are freely available for research purposes from the authors upon request.

References

Abecasis GR, Cardon LR, Cookson WOC (2000). A general test of association for quantitative traits in nuclear families. Am J Hum Genet 66: 279–292.
Article CAS PubMed Google Scholar
Ball RD (2001). Bayesian methods for quantitative trait loci mapping based on model selection: approximate analysis using the Bayesian information criterion. Genetics 159: 1351–1364.
CAS PubMed PubMed Central Google Scholar
Behrens J, von Kries JP, Kühl M, Bruhn L, Wedlich D, Grosschedl R et al. (1996). Functional interaction of β-catenin with the transcription factor LEF-1. Nature 382: 638–642.
CAS PubMed Google Scholar
Bhattacharjee M, Botting CH, Sillanpää MJ (2008). Bayesian biomarker identification based on marker-expression-proteomics data. Genomics 92: 384–392.
CAS PubMed Google Scholar
Bhattacharjee M, Sillanpää MJ (2009). Bayesian joint disease-marker-expression analysis applied to clinical characteristics of chronic fatigue syndrome. In: McConnell P, Lim S, Cuticchia AJ (eds). Methods of Microarray Data Analysis VI. CreateSpace Publishing: Scotts Valley, California. pp 15–34.
Google Scholar
Bink MCAM, Anderson AD, van de Weg WE Thompson EA (2008). Comparison of marker-based pairwise relatedness estimators on a pedigreed plant population. Theor Appl Genet 117: 843–855.
PubMed Google Scholar
Blouin MS (2003). DNA-based methods for pedigree reconstruction and kinship analysis in natural populations. Trends Ecol Evol 18: 503–511.
Google Scholar
Bonney GE (1986). Regressive logistic models for familial disease and other binary traits. Biometrics 42: 611–625.
CAS PubMed Google Scholar
ter Braak CJF, Boer MP, Bink MCAM (2005). Extending Xu's Bayesian model for estimating polygenic effects using markers of the entire genome. Genetics 170: 1435–1438.
CAS PubMed PubMed Central Google Scholar
Burton P, Tiller K, Gurrin L, Cookson W, Musk A, Palmer LJ (1999). Genetic variance components analysis for binary phenotypes using generalized linear mixed models (GLMMs) and Gibbs sampling. Genet Epidemiol 17: 118–140.
CAS PubMed Google Scholar
Butte A (2002). The use and analysis of microarray data. Nat Rev Drug Discov 1: 951–958.
CAS PubMed Google Scholar
Calus MPL, Veerkamp RF (2007). Accuracy of breeding values when using and ignoring the polygenic effect in genomic breeding value estimation with a marker density of one SNP per cM. J Anim Breed Genet 124: 362–368.
CAS PubMed Google Scholar
Cannings C, Sheehan NA (2002). On a misconception about irreducibility of the single-site Gibbs sampler in a pedigree application. Genetics 162: 993–996.
CAS PubMed PubMed Central Google Scholar
Cemgil AT, Févotte S, Godsill CJ (2007). Variational and stochastic inference for Bayesian source separation. Digital Signal Process 17: 891–913.
Google Scholar
Chen W-M, Abecasis GR (2007). Family-based association tests for genomewide association scans. Am J Hum Genet 81: 913–926.
CAS PubMed PubMed Central Google Scholar
Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM et al. (2005). Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 37: 1243–1246.
CAS PubMed Google Scholar
Damgaard LH (2007). Technical note: how to use Winbugs to draw inferences in animal models. J Anim Sci 85: 1363–1368.
CAS PubMed Google Scholar
Dausset J, Cann H, Cohen D, Lathrop M, Lalouel JM, White R (1990). Centre d'etude du polymorphisme humain (CEPH): collaborative genetic mapping of the human genome. Genomics 6: 575–577.
CAS PubMed Google Scholar
Devlin B, Bacanu SA, Roeder K (2004). Genomic control to the extreme. Nat Genet 36: 1129–1130.
CAS PubMed Google Scholar
Devlin B, Roeder K (1999). Genomic control for association studies. Biometrics 55: 997–1004.
CAS PubMed Google Scholar
Du F-X, Hoeschele I (2000). Estimation of additive, dominance and epistatic variance components using finite locus models implemented with a single-site Gibbs and a descent graph sampler. Genet Res 76: 187–198.
CAS PubMed Google Scholar
Du F-X, Hoeschele I, Gage-Lahti KM (1999). Estimation of additive and dominance variance components in finite polygenic models and complex pedigrees. Genet Res 74: 179–187.
Google Scholar
Excoffier L, Heckel G (2006). Computer programs for population genetics data analysis: a survival guide. Nat Rev Genet 7: 745–758.
CAS PubMed Google Scholar
Fulker DW, Cherny SS, Sham PC, Hewitt JK (1999). Combined linkage and association sib-pair analysis for quantitative traits. Am J Hum Genet 64: 259–267.
Article CAS PubMed PubMed Central Google Scholar
Gasbarra D, Pirinen M, Sillanpää MJ, Salmela E, Arjas E (2007). Estimating genealogies from unlinked marker data: A Bayesian approach. Theor Pop Biol 72: 305–322.
Google Scholar
Gauderman WJ, Witte JS, Thomas DC (1999). Family-based association studies. J Natl Cancer Inst 26: 31–37.
Google Scholar
Gelman A, Carlin JBStern HS, Rubin DB (2004). Bayesian Data Analysis 2nd edn. Chapman and Hall, London.
Google Scholar
George V, Elston RC (1987). Testing the association between polymorphic markers and quantitative traits in pedigrees. Genet Epidemiol 4: 193–201.
CAS PubMed Google Scholar
George V, Tiwari HT, Zhu X, Elston RC (1999). A test of transmission/disequilibrium for quantitative traits in pedigree data, by multiple regression. Am J Hum Genet 65: 236–245.
CAS PubMed PubMed Central Google Scholar
Gibson G (2003). Population genomics: celebrating individual expression. Heredity 90: 1–2.
CAS PubMed Google Scholar
Gilks WR, Thomas A, Spiegelhalter DJ (1994). A language and program for complex Bayesian modelling. Statistician 43: 169–178.
Google Scholar
Goeman JJ, van de Geer SA, de Kort F, Houwelingen HJ (2004). A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20: 93–99.
CAS PubMed Google Scholar
Heath SC (1997). Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am J Hum Genet 61: 748–760.
CAS PubMed PubMed Central Google Scholar
Helgason A, Yngvadottir B, Hrafnkelsson B, Gulcher J, Stefansson K (2005). An Icelandic example of the impact of population structure on association studies. Nat Genet 37: 90–95.
CAS PubMed Google Scholar
Henderson CR (1976). A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32: 69–83.
Google Scholar
Hinds DA, Stokowski RP, Patil N, Konvicka K, Kershenobich D, Cox DR et al. (2004). Matching strategies for genetic association studies in structured populations. Am J Hum Genet 74: 317–325.
CAS PubMed PubMed Central Google Scholar
Hopert JP, Casella G (1996). The effect of improper priors on Gibbs sampling in hierarchical mixed models. J Am Stat Assoc 91: 1461–1473.
Google Scholar
Hoti F, Sillanpää MJ (2006). Bayesian mapping of genotype × expression interactions in quantitative and qualitative traits. Heredity 97: 4–18.
CAS PubMed Google Scholar
Huber O, Korn R, McLaughlin J, Ohsugi M, Herrmann BG, Kemler R. (1996). Nuclear localization of beta-catenin by interaction with transcription factor LEF-1. Mech Dev 59: 3–10.
CAS PubMed Google Scholar
Iwata H, Ebana K, Fukuoka S, Jannink J-L, Hayashi T (2009). Bayesian multilocus association mapping on ordinal and censored traits and its application to the analysis of genetic variation among Oryza sativa L. germplasms. Theor Appl Genet 118: 865–880.
PubMed Google Scholar
Iwata H, Uga Y, Yoshioka Y, Ebana K, Hayashi T (2007). Bayesian association mapping of multiple quantitative trait loci and its application to the analysis of genetic variation among Oryza sativa L. germplasms. Theor Appl Genet 114: 1437–1449.
PubMed Google Scholar
Jannink J-L, Bink MCAM, Jansen RC (2001). Using complex plant pedigrees to map valuable genes. Trends Plant Sci 6: 337–342.
CAS PubMed Google Scholar
Jansen RC, Nap J-P (2004). Regulating gene expression: surprises still in store. Trends Genet 20: 223–225.
CAS PubMed Google Scholar
Kass RE, Carlin BP, Gelman A, Neal RM (1998). Markov Chain Monte Carlo in practice: A roundtable discussion. Am Stat 52: 93–100.
Google Scholar
Kennedy BW, Quinton M, van Arendonk JAM (1992). Estimation of effects of single genes on quantitative traits. J Anim Sci 70: 2000–2012.
CAS PubMed Google Scholar
Kilpikari R, Sillanpää MJ (2003). Bayesian analysis of multilocus association in quantitative and qualitative traits. Genet Epidemiol 25: 122–135.
PubMed Google Scholar
Knapp M, Becker T (2003). Family-based association analysis with tightly linked markers. Hum Hered 56: 2–9.
PubMed Google Scholar
Kraft P, Horvath S (2003). The genetics of gene expression and gene mapping. Trends Biotechnol 21: 377–378.
CAS PubMed Google Scholar
Kraft P, Schadt E, Aten J, Horvath S (2003). A family-based test for correlation between gene expression and trait values. Am J Hum Genet 72: 1323–1330.
CAS PubMed PubMed Central Google Scholar
Kuo L, Mallick B (1998). Variable selection for regression models. Sankhyâ, Series: B 60: 65–81.
Google Scholar
Lander ES, Schork NJ (1994). Genetic dissection of complex traits. Science 265: 2037–2048.
CAS PubMed Google Scholar
Lin S (1999). Monte Carlo Bayesian methods for quantitative traits. Comp Stat Data Anal 31: 89–108.
Google Scholar
Lu Y, Liu P-U, Liu Y-J, Xu F-H, Deng H-W (2004). Quantifying the relationship between gene expressions and trait values in general pedigrees. Genetics 168: 2395–2405.
PubMed PubMed Central Google Scholar
Lynch M, Walsh B (1998). Genetics and Analysis of Quantitative Traits. Sinauer Associates: Sunderland, MA.
Google Scholar
Marchini J, Cardon LR, Phillips MS, Donnelly P (2004). The effects of human population structure on large genetic association studies. Nat Genet 36: 512–517.
CAS PubMed Google Scholar
Meuwissen THE, Hayes BJ, Goddard ME (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829.
CAS PubMed PubMed Central Google Scholar
Misztal I (1996). Estimation of variance components with large-scale dominance models. J Dairy Sci 80: 965–974.
Google Scholar
Monks SA, Leonardson A, Zhu H, Cundiff P, Pietrusiak P, Edwards S et al. (2004). Genetic inheritance of gene expression in human cell lines. Am J Hum Genet 75: 1094–1105.
CAS PubMed PubMed Central Google Scholar
Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS et al. (2004). Genetic analysis of genome-wide variation in human gene expression. Nature 430: 743–747.
CAS PubMed PubMed Central Google Scholar
O'Hara RB (2006). Wholesale analysis of genes, traits and microarrays. Heredity 97: 253.
CAS PubMed Google Scholar
O'Hara RB, Sillanpää MJ (2009). A Review of Bayesian variable selection methods: What, how and which. Bayesian Analysis 4: 85–118.
Google Scholar
Perez-Enciso M (2003). Fine mapping of complex trait genes combining pedigree and linkage disequilibrium information: a Bayesian unified framework. Genetics 163: 1497–1510.
CAS PubMed PubMed Central Google Scholar
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909.
CAS PubMed Google Scholar
Pritchard JK, Stephens M, Donnelly P (2000). Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
CAS PubMed PubMed Central Google Scholar
Purcell S, Sham P, Daly MJ (2005). Parental phenotypes in family-based association analysis. Am J Hum Genet 76: 249–259.
CAS PubMed Google Scholar
Quackenbush J. (2001). Computational analysis of microarray data. Nat Rev Genet 2: 418–427.
CAS PubMed Google Scholar
Reya T, O'Riordan M, Okamura R, Devaney E, Willert K, Nusse R et al. (2000). Wnt signalling regulates B lymphocyte proliferation through a LEF dependent mechanism. Immunity 13: 15–24.
CAS PubMed Google Scholar
Rubin DB (1976). Inference and missing data. Biometrika 63: 581–592.
Google Scholar
Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V et al. (2003). Genetics of gene expression surveyed in maize, mouse and man. Nature 422: 297–302.
CAS PubMed Google Scholar
Seidensticker M, Behrens J (2000). Biochemical interactions in the Wnt pathway. Biochim Biophys Acta 1495: 168–182.
CAS PubMed Google Scholar
Setakis E, Stirnadel H, Balding DJ (2006). Logistic regression protects against population structure in genetic association studies. Genome Res 16: 290–296.
CAS PubMed PubMed Central Google Scholar
Sillanpää MJ, Bhattacharjee M (2005). Bayesian association-based fine mapping in small chromosomal segments. Genetics 169: 427–439.
PubMed PubMed Central Google Scholar
Sillanpää MJ, Noykova N (2008). Hierarchical modelling of clinical and expression quantitative trait loci. Heredity 101: 271–284.
PubMed Google Scholar
Spiegelhalter DJ, Thomas A, Best NG (1999). WinBUGS Version 1.2 User Manual. MRC Biostatistics Unit: Cambridge, UK.
Google Scholar
Thomas DC (1992). Fitting genetic data using Gibbs sampling— an application to nevus counts in 38 Utah kindreds. Cytogenet Cell Genet 59: 228–230.
CAS PubMed Google Scholar
Thomas DC (2004). Statistical Methods in Genetic Epidemiology. Oxford University Press: New York.
Google Scholar
Thompson EA, Skolnick MH (1977). Likelihoods on complex pedigrees for quantitative traits. In: Pollack E, Kempthorne O, Bailey Jr TB. (eds). Proceedings of the International Conference on Quantitative Genetics. Iowa State University Press: Ames. pp 815–818.
Google Scholar
Thornton T, McPeek MS (2007). Case-control association testing with related individuals: A more powerful quasi-likelihood score test. Am J Hum Genet 81: 321–337.
CAS PubMed PubMed Central Google Scholar
Uimari P, Sillanpää MJ (2001). Bayesian oligogenic analysis of quantitative and qualitative traits in general pedigrees. Genet Epidemiol 21: 224–242.
CAS PubMed Google Scholar
Visscher PM, Andrew T, Nyholt DR (2008). Genome-wide association studies of quantitative traits with related individuals: little (power) lost but much to be gained. Eur J Hum Genet 16: 387–390.
CAS PubMed Google Scholar
Voight BF, Pritchard JK (2005). Confounding from cryptic relatedness in case-control association studies. PLoS Genet 1: e32.
PubMed PubMed Central Google Scholar
Waldmann P (2009). Easy and flexible Bayesian inference of quantitative genetic parameters. Evolution (in press).
Weir BS, Anderson AD, Hepler AB (2006). Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet 7: 771–780.
CAS PubMed Google Scholar
West M, Ginsburg GS, Huang AT, Nevins JR (2006). Embracing the complexity of genomic data for personalized medicine. Genome Res 16: 559–566.
CAS PubMed Google Scholar
Xu S (2003). Estimating polygenic effects using markers of the entire genome. Genetics 163: 789–801.
CAS PubMed PubMed Central Google Scholar
Yi N, Xu S (2000). Bayesian mapping of quantitative trait loci under the Identity-by-Descent-based variance component model. Genetics 156: 411–422.
CAS PubMed PubMed Central Google Scholar
Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF et al. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet 38: 203–208.
CAS PubMed Google Scholar
Zhao H (2000). Family-based association studies. Stat Methods Med Res 9: 563–587.
CAS PubMed Google Scholar
Zhao K, Aranzana MJ, Kim S, Lister C, Shindo C, Tang C et al. (2007). An Arabidopsis example of association mapping in structured samples. PLoS Genet 3: e4.
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We are grateful to Bob O'Hara, Andrew Thomas, Petri Koistinen and Crispin Mutshinda Mwanza for discussions and constructive comments on the paper. This work was supported by a research grant (202324) from the Academy of Finland.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, Rolf Nevanlinna Institute, University of Helsinki, Helsinki, Finland
P Pikkuhookana & M J Sillanpää

Authors

P Pikkuhookana
View author publications
You can also search for this author in PubMed Google Scholar
M J Sillanpää
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M J Sillanpää.

Additional information

Electronic-database information

CEPH genotype database, http://www.cephb.fr/cephdb/

NCBI GenBank, http://www.ncbi.nlm.nih.gov/Genbank/

Gene expression omnibus, http://www.ncbi.nlm.nih.gov/geo/

Appendix A

Estimation

To induce sparseness into a model, we are using Jeffreys' prior p(σ²)∝1/σ² for variances. This however, is an improper prior (does not integrate to a finite value) which can lead to an improper posterior (for example, Hopert and Casella, 1996; ter Braak et al., 2005). As we are using WinBUGS software (Gilks et al., 1994; Spiegelhalter et al., 1999) for implementation, we need to do some adjustments. WinBUGS uses nonstandard parameterization of distributions in terms of their precision (that is, precision=τ=1/variance). We make transformation φ=log(τ) for the precision parameters. Note that the transformation applies equivalently for both variance and precision. Equivalent to the prior p(τ)∝1/τ, the prior for transformed parameter can be derived as

(see Gelman et al., 2004, p. 65). However, the flat prior p(φ) is also improper, but when it is restricted to some finite range, it will give us proper prior. We restricted the precision to the range

, where b̂ is empirical approximation of phenotypic variance and a is very close to zero (10⁻¹⁸). For the precision parameter we also tried a Gamma prior with certain shape parameters which has similar shape as Jeffreys' prior (see Cemgil et al., 2007). We found out that such a prior was sensitive to shape parameters and it also easily produced numerical instability (‘trap messages’) in WinBUGS. Thus, we decided to use a restricted Jeffreys' prior in our examples. The prior for μ is also an improper flat prior. An approximation for that is flat normal density with zero mean and large enough variance.

The prior for the missing data of founders is constructed in the following way: we create two hypothetical extra individuals, which are the parents of all the founders. These artificial individuals are heterozygotes in all their markers. Thus, we can give the same prior p(m_i,j∣m_m,j,m_f,j) for all the individuals regardless of them being founders or non-founders, which allow us to use WinBUGS. In this way, we could keep the data structured by the pedigrees and this procedure is equivalent to the assumption of uniform allele frequencies. We assume that also phenotypic data is missing at random (Rubin, 1976). WinBUGS follows this assumption and thus, the posterior distributions of the parameters are influenced only by the observed records of the outcome variable.

For the infinite polygenic model, we tested both the multivariate normal distribution and the conditional factorization of Lin (1999) and Thomas (1992) as a prior for ‘breeding values’. In WinBUGS, our experience confirmed the expectation that both of these methods are practically equally efficient than a block updating of ‘breeding values’ which maintain well mixing samplers. Faster computational speed favors the use of conditional factorization.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pikkuhookana, P., Sillanpää, M. Correcting for relatedness in Bayesian models for genomic data association analysis. Heredity 103, 223–237 (2009). https://doi.org/10.1038/hdy.2009.56

Download citation

Received: 08 September 2008
Revised: 09 April 2009
Accepted: 14 April 2009
Published: 20 May 2009
Issue Date: September 2009
DOI: https://doi.org/10.1038/hdy.2009.56

Keywords

This article is cited by

Genome-wide mapping of quantitative trait loci in admixed populations using mixed linear model and Bayesian multiple regression analysis
- Ali Toosi
- Rohan L. Fernando
- Jack C. M. Dekkers
Genetics Selection Evolution (2018)
Genetic heterogeneity underlying variation in a locally adaptive clinal trait in Pinus sylvestris revealed by a Bayesian multipopulation analysis
- S T Kujala
- T Knürr
- O Savolainen
Heredity (2017)
Evaluation of multi-locus models for genome-wide association studies: a case study in sugar beet
- T Würschum
- T Kraft
Heredity (2015)
Combined linkage disequilibrium and linkage mapping: Bayesian multilocus approach
- P Pikkuhookana
- M J Sillanpää
Heredity (2014)
Impact of prior specifications in a shrinkage-inducing Bayesian model for quantitative trait mapping and genomic prediction
- Timo Knürr
- Esa Läärä
- Mikko J Sillanpää
Genetics Selection Evolution (2013)

Abstract

Similar content being viewed by others

Introduction

Model

cQTL model

Infinite polygenic model

Regression covariates

Hierarchical model

Prior distributions

Missing data model

Posterior distributions

Examples of cQTL analysis with family data

Simulations

Simulating SNP genotypes

Simulating expression levels

Simulating phenotypes

Simulating replicated data sets

Real data

Results

Simulated data

Analysis details and effect summaries

Analysis results

Heritability and effect estimates

Simulated data replicates

Analysis results

Real data

Analysis details

Analysis results

Discussion

Use of indicator variables

Analyses of simulated data

Analysis of simulated data replicates

Analysis of real data

Model extensions and MCMC estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix A

Appendix A

Estimation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Search

Quick links