Introduction

Population-based association analysis, enhanced with projects like HAPMAP (The International HapMap Consortium, 2003, 2005), is one of the most promising gene mapping techniques (Risch and Merikangas, 1996; Lohmueller et al, 2003). The selection of a trait-associated subset of markers among a large number of candidates is a challenging model selection problem on its own right (Broman and Speed, 2002; Sillanpää and Corander, 2002; Kilpikari and Sillanpää, 2003). In addition to measuring the genotype at selected marker points along the chromosome, it is currently possible to measure the gene expression, mRNA abundance, levels for a large number of genomic locations simultaneously. The availability of such data has made it possible to base candidate selection on the associations between a phenotype and gene expression levels (Quackenbush, 2001; Wayne and McIntyre, 2002; Kraft et al, 2003; Goeman et al, 2004; Lu et al, 2004). Further, if both phenotype–marker and phenotype–expression associations are analyzed, it is possible to study the overlap between the genomic locations of the resulting significant markers and genes (Aune et al, 2004).

Gene expression and marker data are combined in genetical genomics – to map gene regulators or modifier genes with respect to a marker map. This is carried out by treating the expression levels as phenotypes in a similar manner as quantitative traits in standard quantitative trait locus (QTL) mapping (for alternative procedures, see Jansen and Nap, 2001; Jansen, 2003; Darvasi, 2003; Heighway et al, 2005). Such practice has led to accumulating evidence that genetic variants can control or introduce quantitative variation in gene expression levels, and that this variation follows the Mendelian inheritance similar to the genetic variants themselves (Brem et al, 2002; Yan et al, 2002; Lo et al, 2003; Schadt et al, 2003; Jansen and Nap, 2004; Knight, 2004; Morley et al, 2004; Auger et al, 2005; Bystrykh et al, 2005; Chesler et al, 2005; Hubner et al, 2005).

The treatment of gene expression levels as genetic markers, expression level polymorphisms, in QTL studies has been suggested by Doerge (2002) and Kell (2002). The utilization of gene expression profiling to define genetically homogeneous groups or to improve the definition of a phenotype for gene mapping purposes was proposed by Watts et al (2002), Schadt et al (2003), and Kraft and Horvath (2003). Instead of studying each expression phenotype individually in genetical genomics, Perez-Enciso et al (2003) addressed the problem of combining gene expression phenotypes and QTL mapping by first predicting the value of an underlying liability variable using partial least squares. In their approach, the hypothetical liability (underlying the observed phenotype) is defined as a linear combination of individual gene expression levels. This predicted liability is then used instead of the individual expression phenotypes in QTL mapping. The use of principal component analysis and hierarchical clustering for a similar purpose has been proposed by Lan et al (2004).

The large size of modern gene expression and molecular marker data sets combined with the goal of finding a small subset of trait-associated candidate genes underlines the need for computationally efficient methods especially designed to detect sparse candidates. Recently, sophisticated shrinkage and sparse methods have been proposed to study phenotype–expression association (Shevade and Keerthi, 2003; West, 2003; Lopes and West, 2004) and phenotype–marker association (Meuwissen et al, 2001; Devlin et al, 2003; Kopp et al, 2003; Xu, 2003; Sillanpää and Bhattacharjee, 2005; Zhang and Xu, 2005; Zhang et al, 2005).

In this paper, we present a novel method where the phenotype is modeled as a linear combination of marker genotypes, gene expression levels, and genotype × gene expression interaction terms. The method is implemented for inbred line cross data by using a Bayesian shrinkage approach in a similar manner as in Meuwissen et al (2001) and Xu (2003). Our method is computationally efficient and easy to use, since no tuning parameters are required. Also similar to Xu (2003), our method provides good heritability estimates. We emphasize the use of standardized regression coefficients in interpreting the results and introduce summary statistics that enable us to effectively identify complementary genetic determinants (submodels).

Model

Genetic model

Here, we consider a population with only two segregating genotypes resulting from an inbred line cross experiment, for example a backcross or a double haploid population. The generalization to situations with multiple genotypes, that is, general population samples of outbred human populations, is also discussed. We assume that the sample consists of both marker data and gene expression measurements (mRNA abundance levels), as well as a quantitative or a qualitative trait phenotype, that have been collected from all study individuals. In addition, we assume that a proportion of the marker measurements are a priori associated with some of the gene expression measurements, allowing the identification of genotype-specific gene expression effects, that is, genotype × expression interactions with respect to the phenotype. We allow multiple markers to be associated with a single gene and vica versa. In summary, our genetic data consist of three data subtypes of the following forms:

  1. 1

    marker data (NM markers),

  2. 2

    gene expression data (NE genes),

  3. 3

    link data: data allowing the identification of marker–expression pairs whose interaction term is to be considered in our model (NME marker–gene pairs).

The gene expression data are assumed to have gone through suitable transformation and normalization steps (Quackenbush, 2001; Butte, 2002) so that the sample distribution of the majority of the genes is approximately standard normal. The link data, which can originate from previous genetical genomics studies (cis- and trans-acting variation) or are based on known pathways, enable incorporating cross terms into the model. If no prior external knowledge or hypothesis is available, the link data can be constructed solely based on the genetic distances between the markers and the genes. Thus, one assumes in cis effects between the marker and all genes within a given genetic distance of the marker. Also, oligonucleotide arrays can provide simultaneous genotype and gene expression measurements directly (Ronald et al, 2005).

Given genetic data as described above, we propose modeling a quantitative phenotype yi of individual i with the following linear model:

where μ is the population mean and ɛiN(0, σ02) is a normally distributed residual term with mean zero and variance σ02. For a binary trait, we use model (1) to model an underlying continuous liability, which then gives rise to the binary observation according to the Bayesian probit model (see Appendix A1 for details). The first summation runs over all markers and is designed to capture the genotype-specific main effects (cf Xu, 2003). For individual i at marker j, the indicator for genotype k is denoted by zi,j,k, and αj,k is the coefficient of the corresponding genotype-specific main effect. The second summation runs over the NE genes and xi,j denotes the gene expression measurement of gene j for individual i, and βj is the coefficient of the corresponding linear gene expression effect. In the last summation, genotype and gene expression data of the NME marker–gene pairs are gathered into pairs j=1, …, NME, where and for some pair (gj, sj) given by the link data. Thus, for individual i, is the gene expression measurement of gene gj and is the indicator of genotype k for the corresponding marker sj, and γj,k is the coefficient of the corresponding genotype × expression interaction effect. The extension of model (1) to include also genotype × genotype or expression × expression interaction terms is considered in the Discussion section.

In order to ensure that the model parameters are identifiable, we introduce the constrains

Thus, the first genotype, at each marker, is identified as a baseline, and their effects are included into the terms μ and The genotype-specific contrasts (differences) are then modeled using the two remaining terms in model (1), which, by taking into account the above constrains, are and

Next, for individual i, the genetic data are gathered into vector

and the vector containing the N=NM+NE+NME unknown effects (coefficients) is denoted by

Now, by taking into account constrains (2) and (3), the linear model (1) can be rewritten as

Hierarchical model

Prior distributions

Bayesian approaches require the specification of prior distributions for the unknown parameters. We follow the work of Xu (2003) and adopt the following prior densities, where each effect is assigned its own variance term. For j=1, …, N, the effect prior p(θjσj2) is the density function of normal distribution with mean zero and effect-specific variance σj2, and p(σj2)1/σj2 is the (Jeffreys' scale invariant) prior density function of the effect-specific hyperparameter σj2. The prior density function of the mean μ is p(μ)1, and the prior density function of the variance σ02=var(ɛi), for i=1, …, n, is p(σ02)1/σ02. Now, by the use of appropriate (conditional) independence assumptions, the joint prior density function of the model parameters θ, μ, and σ2, where σ2=(σ02, …, σN2), is p(θ,μ,σ2)=p(θσ2)p(μ)p(σ2), where , and . It has been demonstrated (see Figueiredo, 2003; Xu, 2003) that the above use of effect-specific variance parameters induces sparseness. Thus, our prior information states that most terms in the sums of model (1) are expected to be zero or almost zero and the degree of sparseness adaptively depends on the data at hand.

Model for missing values

In Bayesian inference, missing values are handled in a similar manner as any other unknown parameter (random variable). Thus, prior distributions are assigned to all missing values. The prior density function p(xi,j) of a missing gene expression measurement xi,j is chosen to be that of a standard normal distribution. Recall that the gene expression level measurements are assumed to be approximately normally distributed.

Next we define the prior distribution for the marker data, in a backcross or a double haploid situation, by taking into account the probability of a recombination, which again is defined by the genetic distances between the markers. Following Sillanpää and Arjas (1998), the joint probability of the marker data for individual i is given by

where mi,j is the genotype of individual i at marker j, is the probability of genotype mi,1 at marker 1, and P(mi,jmi,j−1) is the probability of genotype mi,j at marker j conditional on genotype mi,j−1 at marker j−1. The conditional probability P(mi,jmi,j−1) is 1−rj if genotypes mi,j and mi,j−1 are the same (no recombination) and is rj otherwise, where rj is derived from the genetic distance dj (in Morgans) between markers j and j−1, by the Haldane map function . A simpler way to proceed, which works also in more general setups, is to assume independence between markers and take the prior probability of each genotype to be equal. However, in many cases, it is possible to derive more informative prior distributions similar to equation (5); for various crosses from two inbred lines, see Jiang and Zeng (1997).

Posterior distributions

Next we derive the posterior distribution of the model parameters θ, μ, and σ2, where σ2=(σ02, …, σN2). Denote by D={m,x} the complete genetic data, that is, the combined marker and gene expression data, with no missing values. Further, let D={m, x} denote the observed genetic data with possibly some entries missing. By the use of the Bayes formula, the density function of the joint posterior distribution (see Figure 1) of the model parameters and the genetic data is given by

where p(θ, μ, σ2) is the density function of the joint prior distribution of the parameters (θ, μ, σ2), p(D) is the prior density function of the complete genetic data D, p(DD) is the mass probability function of the observed genetic data D conditional on the complete genetic data D, and p(yθ, μ, σ2, D) is the likelihood of the phenotype data y. Note that p(DD) is an indicator function and takes value 1 only when D is consistent with D and 0 otherwise. The prior density function of the genetic data D is proportional to and the likelihood function can be factorized into where

Figure 1
figure 1

Graphical representation of the hierarchical structure of the model. Boxes refer to prespecified values or observed data and ellipses to unknown random variables. This graph (directed acyclic graph, DAG) gives a graphical summary of the conditional independence assumptions and directions of hierarchical (solid arrows) and deterministic (dotted arrows) dependencies.

Markov chain Monte Carlo estimation

We apply Gibbs sampling (Geman and Geman, 1984) and Metropolis–Hastings (Hastings, 1970) algorithms to draw dependent samples from the joint posterior distribution of the unknowns (equation (6)). The specific choices of the prior distributions (conjugate distributions) allow us to generate samples directly from the fully conditional marginal posterior distributions of θ, μ, and σ2. A detailed description of the adopted algorithm is given in Appendix A1. Possible point estimates for the unknown distributions include the maximum a posteriori (MAP), the median, and the expected value of the marginal distributions. We assume that the number of markers and genes in the data set is such that it is computationally reasonable to attempt to estimate the complete posterior distribution, for example, their number is reduced by some preliminary feature selection algorithm or the features are chosen based on known pathways (see Thomas, 2005). Zhang and Xu (2005) were able to handle a model where the number of effects was 15 times larger than the sample size. In our opinion, it is preferable to reduce this ratio (upper limit) even lower, say down to 10, by gathering more samples or by reducing the number of considered effects in the model. If the number of markers or genes is very large (several thousands), even though the data contain enough information to estimate the effects, the time needed to perform the calculations can be overwhelming. In such case, one can postpone the estimation of the whole distribution and concentrate on directly estimating some summary statistic of the distribution (eg MAP via an EM algorithm in Figueiredo (2003), and MAP via a penalized ML method in Zhang and Xu (2005)).

Simulations

We simulated backcross data consisting of molecular markers, gene expression level measurements, and both a binary and a continuous phenotype. First, linked marker data were simulated, then gene expression data were generated conditionally on the marker data. Next, the phenotype data were generated conditionally on both the marker and gene expression data. Finally, missing values were introduced by randomly removing a given proportion of the marker and gene expression measurements. Note that our simulation strategy differs from some others, which use real gene expression data as a starting point (Perez-Enciso et al, 2003; Perez-Enciso, 2004). By conditioning on the expression measurement, Perez-Enciso et al (2003) simulated case–control data (QTLs and binary phenotypes) using partial least squares. Subsequent linked marker data were then generated around each QTL using the exponential decay model for linkage disequilibrium. In Perez-Enciso (2004), the data, a set of linked marker loci and phenotypes, were simulated from an outbred population based on coalescent techniques and gene dropping. Our main reason for not adopting any existing simulation method was the need to have realistic linked marker data for offspring resulting from a backcross design of two inbred lines. Also, our approach (see below), although including simplifying assumptions, allows us to fully validate the performance of the proposed estimation method.

Genetic data

Linked marker data for a population of 200 backcross individuals was simulated using the QTL Cartographer software (Basten et al, 1994, 2003). Altogether 100 markers were simulated, 50 markers equally spaced on two different chromosomes. The inter-marker distance on both chromosomes was taken to be 4 cM and the length of the genetic material outside the boundary markers on each chromosome was 2 cM.

Next, three genes (gene expression measurements) were assigned about each marker, resulting in 300 genes. We assume that the genetic distances from the three gene loci to the marker is so small that any effect between the marker and the genes is of in cis nature (this is our link data). Then, for each marker, one gene (the middle one) was randomly assigned a value φj{0, 1}, where φj=1 indicates the presence of an in cis effect and for the remaining two genes we assume no in cis effect (φj=0). For the middle genes, the probability that φj=1 was taken to be 0.3, which is in line with current estimates (Jansen and Nap, 2004; Morley et al, 2004).

To mimic allele-specific expression, the gene expression value xi,j of gene j for individual i was generated from the mixture distribution

where N(a, b) is the normal distribution with mean a and variance b, and is the indicator of genotype k for individual i at the marker linked to gene j. Although gene expression values are generated independently from each other, dependence between markers (with in cis effects) will imply also some dependence between expression levels.

Phenotype data

The phenotype data were constructed as a linear combination of genetic components, that is, genotype–expression pairs, of six different subtypes (Figure 2). The possibility of single genotype–expression pair simultaneously having more than one active phenotype effect was excluded in the simulation. Thus, we divide the genetic components into three subtypes depending on the mechanism by which they have an effect on the phenotype: genotype effect (G), gene expression effect (E), and genotype × expression effect (GE). Also, we distinguish between marker–gene pairs with and without in cis effects. We add an i to denote the presence of in cis effect. So, for example, a marker–gene pair of type iE contributes to the phenotype only through the gene expression, although there exists an in cis effect on the genotype–expression level.

Figure 2
figure 2

Different genetic effects used in the simulation studies. The mechanism of the effect from the genotypes (G) and the gene expressions (E) to the phenotype (Y) is described using directed graphs. The arrow from G to E indicates genotype-specific gene expressions (in cis) and the absence of the arrow means the genotype does not regulate the gene expression. Starting from the left, the graphs indicate the presence of a genotype–phenotype effect, an expression–phenotype effect, and a genotype × expression–phenotype effect, respectively. Also, in each graph, the shortcut notation of the effect type is given using italic fonts.

Based on the above genetic data, a continuous and a binary phenotype were generated. The phenotypes were designed to study which of the above six effect types our model is able to recapture. Also, we wanted to do comparison studies against more traditional models, that is, models based solely on either marker or gene expression data. With this task in mind, a continuous phenotype for individual i was generated as

where the subscripts s1, …, s6 are randomly chosen indexes of marker–gene pairs of types G, iG, E, iE, GE, and iGE, respectively, and ɛi is normally distributed with mean 0 and variance 1. The factors a1, …, a6 are the inverses of the sample standard deviations of the six genetic terms, respectively, and their function is to ensure that each term contributes an equal amount of variation to the phenotype. Further, the binary phenotype was defined based on the continuous phenotype as follows:

For realizations of the above phenotype, the heritability, which is a measure of the proportion of the phenotypic variation explained by the genetic components, is typically about 0.6–0.7.

Analyses

We analyzed a realization of the phenotype where the genetic effect components were about equally distributed along the genome and the heritability was 0.69. The proportion of missing values in both the molecular marker data and the gene expression data was taken to be 0.01. These proportions can vary in practice and in addition to reducing the information content in the data, missing values slow down the actual estimation process. The continuous phenotype and the binary phenotype were analyzed separately using combined marker and gene expression data, marker data alone, and gene expression data alone. Because no trans-acting variation was included into the simulation, we can easily monitor false positives arising from in cis effects in conventional analyses. These six different analyses were implemented using Matlab software on a Pentium IV 2.8 GHz processor. The initial values for the effects θj, j=1, …, N, were taken to be zero, and those of the variance terms σj2, j=0, …, N, were initialized to 0.5. The mean value μ was assigned to the sample mean of the phenotypes and the missing values were randomly assigned initial values from their empirical distributions. The Markov chain Monte Carlo (MCMC) algorithm was run for 50 000 rounds (≈2 h) in all simulated examples. In each case, the first 10% of the rounds were considered to be ‘burn-in’ rounds and were thus discarded from the analysis. Also, to reduce serial correlation, only every 10th round was stored and used in the final summaries. The convergence assessment of the method was made by visually monitoring the chains for several different parameters, mainly the effect coefficients and the error variance. Although the number of the effect coefficients can be very large, in practice one needs to consider only the few chains, as after the burn-in rounds the majority of the effect coefficients are constantly zero. In our simulation studies, the convergence was very fast. In fact, very similar results to those reported are achieved when using samples from the first 20% of the MCMC rounds only.

Results

Continuous trait

Combined data analysis

In Figure 3, the results of the analysis combining marker and gene expression data are summarized by the posterior probabilities that the standardized effect size exceeds the given threshold 0.1. The standardized effect (an analog to standardized regression coefficient found in statistical text books) for the genetic component j is given by where σj is the standard deviation of the genetic component j and σp is the standard deviation of the phenotype. In the presence of missing values, θj* is calculated on every MCMC round using σj of the imputed data. Further in the binary case, σp is the standard deviation of the liability phenotype, which changes every MCMC round. The above posterior probability for the genetic component j can be written as Pj(0.1)=P(θj*>0.1data), where ‘data’ refers to the observed genetic data and the phenotype. In summary of Figure 3, altogether nine genetic components had elevated posterior probabilities Pj(0.1), where this probability was distinctive (mostly 1.0) for all six simulated components, and it was equal or less than 0.4 for the others. The number of nonzero effects is controlled adaptively in the analysis (their number depends on the data at hand).

Figure 3
figure 3

A summary of the combined analysis for the continuous phenotype using both marker and gene expression data. The panels contain the estimated posterior probabilities Pj(0.1)=P(θj*>0.1data) for the genotype effects (top), gene expression effects (middle), and genotype × expression effects (bottom). In all three panels, the positions of the simulated effects are indicated by triangles together with the shortcut notation of the subtype. Note that each panel contains the same genetic section and therefore the corresponding genetic components are vertically leveled.

To summarize the number of influential components, in the genetic model, Table 1 presents the posterior probabilities for different numbers of components, whose standardized effect size simultaneously exceeds the threshold 0.1, that is, P(Ic(0.1)=ndata). Note that the distribution only supports numbers in the range [6, 10] and that the support is clearly highest at the correct number six. In Table 2, we have calculated the conditional probability for all pairs {j, k}, formed by genetic components whose probability Pj(0.1) exceeds 0.1. Qj,k(T) is the posterior probability that the standardized effect of the genetic component j exceeds the threshold T conditional that the standardized effect of the genetic component k exceeds the same threshold. These Q-summaries allow the detection of alternating components in the genetic model. From Table 2 and by visually studying the MCMC paths of the standardized effects (Figure 4), it can be seen that the expression effect (j=246) and the genotype × expression effect (j=546), associated to the genetic component of type iE, are complementary, and thus only one of the two at a time contributes a nonzero effect into the genetic model. The threshold value T=0.1 was chosen subjectively. Experiments with different threshold values indicated that the above summary statistics are robust to the chosen value. The use of a smaller threshold value T<0.1, although increasing the number of components with nonzero Pj(T) values, rarely had an effect on the number of components with higher Pj(T) values, say greater than 0.1. Also, the effect on the distribution of the number of influential components was negligible.

Table 1 The distributions of the number of influential components
Table 2 Pairwise conditional summaries
Figure 4
figure 4

The MCMC sample paths of the standardized effects of the genetic components 246 and 546. The vertical dashed line indicates the end of the burn-in period.

Sole marker analysis

In the top-left panel of Figure 5, the standardized genotype effects are summarized by the component probabilities Pj(0.1). From the results, we can locate four markers (genetic components), which all satisfy the condition Pj(0.1)>0.1. Further, by studying the Q-summaries, the pairwise probabilities Qj,k(0.1) (table not shown), it can be concluded that the genetic components 83 and 86 are complementary. Thus, the results suggest that we are able to locate three putative markers only, two correct ones having main genotype effects (types iG and G) and a ‘false-positive’ one (j=45) located between the type GE component and the type iE component. These conclusions are further supported by Table 1, where the highest probability 0.60 is assigned to the case n=3.

Figure 5
figure 5

A summary of the analysis for both the continuous phenotype (left panels) and the binary phenotype (right panels) using only marker data (upper panels) and only gene expression data (lower panels). The notation in the panels is as in Figure 3.

Sole gene expression analysis

In the lower-left panel of Figure 5, the standardized gene expression effects are summarized by the component probabilities Pj(0.1). We were able to locate all the genes (genetic components) that contribute to the continuous phenotype through their expression (types E, iE, GE, and iGE). Also, there is some evidence (Pj(0.1)>0.1) about the two simulated components with no expression effects: two components (j=50 and j=53) close to the component of type G and one (j=248) at the component of type iG. This can be explained by the correlation between the gene expression values and the phenotype, which is induced by the high correlation between close markers and their in cis effects (recall that about 30% of the markers have in cis effects). By studying the MCMC sample paths of components 50 and 53 in Figure 6, the strong interaction between their effect sizes becomes apparent; if either of the components obtains a zero effect, the other compensates its absence by taking on higher values. Thus, again we can make the conclusion that both effects attempt to model the same genetic effect (type G). This conclusion cannot be made from the Q-summaries, as Q50,53(0.1)=0.50 and Q53,50(0.1)=0.33. In summary, we were able to locate six putative genes, with one (j=248) having a posterior probability value of 0.1 only. This result is also supported by Table 1, where although some evidence is assigned to the case n=6, the highest probabilities are obtained for the cases n=4 and n=5.

Figure 6
figure 6

The MCMC paths of the standardized effects of the genetic components 50 and 53 from the analysis of the continuous phenotype using gene expression data only. The vertical dashed line indicates the end of the burn-in period. Note that if one of the paths vanishes, the other compensates it by taking on a higher value.

Binary trait

Combined data analysis

In Figure 7, the results of the analysis, utilizing both marker data and gene expression data, are summarized by the probabilities Pj(0.1). Eight genetic components obtained Pj(0.1) values higher than 0.1, although the P(Ic(0.1)=ndata) values in Table 1 supported strongly the existence of only four or five influential components. From the Q-summaries (Table 3), it is apparent that components 15 and 16 as well as 84 and 85 are complementary and represent alternative signals from the same underlying component. Thus, the model suggests the existence of six genetic components from which the two with the smallest Pj(0.1) values are false positives. Further candidates can be unveiled by considering genetic components with smaller Pj(0.1) values than 0.1. However, their existence is not supported by the P(Ic(0.1)=ndata) values. Not surprisingly, the performance of the method using a binary phenotype is slightly worse than that of the continuous counterpart, which can be explained by the loss of information in the dichotomization process.

Figure 7
figure 7

A summary of the combined analysis for the binary phenotype using both marker and gene expression data. The panels and the notation are as in Figure 3.

Table 3 Pairwise conditional summaries

Sole marker analysis

In the analysis of the binary phenotype, several high Pj(0.1) values showed up in the vicinity of the two markers with main genotype effects (types iG and G) (see the upper-right panel of Figure 5). Also high values were found around the component of type iGE. The P(Ic(0.1)=ndata) values in Table 1 and the Q-summaries (table not shown) support the existence of two or three putative markers.

Sole gene expression analysis

In this analysis of the binary phenotype, six genetic components (gene) were assigned Pj(0.1) values greater than 0.1, with two having values close to 1 (components 53 and 269). Here component 269 is a real simulated expression effect (type iGE) and component 53 is an artifact arising from the nearby simulated effect of type iG. Component 146 (type iE) was assigned a Pj(0.1) value 0.39 and the remaining three putative components had values about 0.2 or less (see the lower-right panel of Figure 5). The P(Ic(0.1)=ndata) values in Table 1 supported the existence of two to four putative genes.

Heritability and effect estimation

In addition to producing location estimates of putative genetic components, the analysis provides posterior heritability and effect estimates. If the model is able to successfully capture the phenotypic variation due to the genetic components, then accurate heritability estimates are obtained from the formula , where σp2(t) is the phenotypic variance, is the error variance at round t, and r is the total number of MCMC rounds. In the continuous case, the phenotypic variance is constant over the MCMC rounds, and in the binary case, it is calculated using the liability phenotype. Note also that in the binary case, the error variance is 1 by definition. The posterior heritability estimates based on all six analyses are given in Table 4. We emphasize the huge drop in the accuracy of the estimates, when only a subset of the original genetic components used to simulate the phenotype is included as data in the analysis. Therefore, some caution should be taken when reporting or making conclusions on heritability estimates based on analyses of real data sets.

Table 4 Heritability estimates

The accuracy of the effect estimates provided by the model (results not shown) is similar to that of the heritability estimates. For example, the posterior distributions of standardized genetic effects with Pj(0.1) values close to 1, in the combined analysis of the continuous phenotype, are all highly concentrated around 1, which is the correct value of the standardized simulated effects. Figures 4 and 6 are typical examples of the MCMC paths of standardized effects for influential genetic components, whose Pj(0.1) value is smaller than 1. As indicated in the figures, their distributions have multiple modes, which need to be taken into account when conclusions are drawn.

Real data example

Additionally, in order to experiment on real data, we applied our method to the publicly available data set of Brem et al (2002). This Saccharomyces cerevisiae data consist of gene expression measurements and marker genotypes in 40 haplotypes (segregants) from a cross between a laboratory and a wild strain of Yeast. As a starting point of our analysis, we chose 102 genes that were reported by Brem et al (2002) to belong to one of the first five gene groups, which contain the known members of different pathways. Based on the 570 expression QTLs (gene–marker pairs) reported by Brem et al (2002), we identified 12 such markers to whom at least one of our chosen 102 genes was linked. To perform genetical genomics analysis, we treated one of the genes, namely gene YLR089C, as the expression phenotype and explained its variation by using the remaining 101 genes, 12 markers, and 101 marker–gene pairs as covariates in model (1). Note that noninformative prior (omitting inter-marker distances) was assumed for missing genotype data.

We ran four separate MCMC chains (runs I–IV) each of length 100 000 (≈1 h) using different starting values. First 50 000 rounds from each chain were discarded because of ‘burn-in’ and every 50th sampled values were stored and utilized in actual MCMC estimation. The convergence of each chain was checked by visually inspecting MCMC paths for several different parameters. Although all chains seemed to be converged well, we found some variation between the results. Such a behavior is evidently owing to the high correlation between genes within the same pathway and the fact that the sample size (number of individuals) was actually quite small.

In three (runs I–III) out of the four cases, we located a significant effect at component 44 (gene YMR108W), with probability P44(0.1)=0.69 in run I and with probability P44(0.1)=0.98 in runs II and III. Based on the Q-summaries (results not shown), the small probability value 0.69 at component 44 in run I can be explained by the existence of its complementary effect with the component 49, having probability P49(0.1)=0.33. In the remaining case (run IV), another gene located in the same pathway was identified, namely the component 42 (gene YHR208W), with probability P42(0.1)=1.0. Both identified genes are in the same pathway with the expression phenotype. In addition to the main terms, a genotype–expression interaction term arose at gene YMR108W in run I and at gene YER073W in runs II and III. Not surprisingly, gene YER073W is also located in the same pathway as the expression phenotype (gene) and all the interaction terms are linked to the same marker, ‘2435_at_ × 00’.

This example illustrates difficulties that can occur in applications of the method to real life data sets, that is, the method seems to suffer convergence problems owing to the small sample size (no. of individuals) and the highly correlated candidates (and perhaps noisy phenotype). Note that Wang et al (2005) also recognized potential mixing problems of their method in the presence of highly correlated genetic components (closely linked markers) and small sample size. To alleviate these problems in extremely small data sets like here, we suggest that future studies should run several different MCMC chains and base their estimates on the pooled MCMC samples, where samples from several different MCMC chains are combined together.

Discussion

Effect components

We have presented a novel Bayesian sparse method, which allows us to simultaneously utilize measurements from multiple data sources (molecular markers and gene expression microarrays) to model phenotype. The benefit of the method compared to conventional phenotype–marker or phenotype–expression association method is the possibility to consider genotype × expression interactions by introducing marker–gene pairs as link data. Also, available environmental (nongenetic) covariates of discrete or continuous type can be included into the model by coding them in a similar manner as ‘markers’ with multiple variants or ‘gene expression’ measurements, respectively. Protein-expression measurements (Sellers and Yates, 2003) can be included as ‘gene expressions’ in a similar fashion. In the presence of readily available database information (eg GO, KEGG; see Ashburner et al, 2000) about gene × gene and expression × expression interactions involved in known pathways (Thomas, 2005), the model can be easily expanded to include known pairs of epistatic markers (Conti et al, 2003) and known expression × expression interaction determinants. Also genetic × nongenetic interactions can be considered. If the sample size is large, one can even search through all possible pairwise combinations by incorporating epistatic effects into the oversaturated model following Zhang and Xu (2005).

Resolution

The mapping resolution can be increased by introducing new marker points, the so-called pseudo-markers, along a discrete grid (Sen and Churchill, 2001) or by adding a random QTL position into each marker interval (Wang et al, 2005). The genotype information at these new markers is handled as missing values and their patterns are predicted based on the genetic distances and the observed genetic configuration of the surrounding markers. In addition to marker data, one could also use cis- and trans-acting gene expression information to impute/predict pseudo-marker genotypes. Note that one can still include the original gene expression measurements as putative candidates into model (1). However, these treatments as such are applicable only for controlled crosses. With small data sets, introducing pseudo-markers on small intervals may enrich correlateness (colinearity) between candidates, which again may potentially lead to mixing problems of the sampler.

Alternative model considered

An intuitive picture of the mechanism leading to sparseness in the Bayesian model is provided by the following reasoning (see Figueiredo, 2003). Replace Jeffreys' prior p(σj2)1/σj2 by the double exponential prior . It is then possible to analytically integrate out σj2 from the prior distribution of the effect θj resulting in the well-known Laplacian distribution . Thus, replacing Jeffreys' prior with the double exponential prior is identical to assuming a priori that the effects θj have a Laplacian (sparse) distribution with a common parameter γ. A drawback of such an approach is that the Laplacian prior does not allow the application of Gibbs sampling steps in updating the effects, and also tuning of the extra parameter γ can be difficult and time consuming. We performed various experiments using the Laplacian prior with different Metropolis–Hastings steps, but we finally abandoned the approach owing to serious mixing and convergence problems. However, if these problems are overcome in the future, Laplacian prior provides the user a way to control the number of influential components (nonzero effects) or the degree of sparseness in the model, where lack of such parameter can be thought of as a drawback of the present adaptive approach. Note that there are various other approaches that allow user to control the amount of sparseness in the data (eg Tibshirani, 1996; Meuwissen et al, 2001; Sillanpää and Bhattacharjee, 2005; Zhang et al, 2005).

Human data sets

The method was presented for controlled line crossing experiments of inbred animals or plants with two genotype combinations. However, the method can be easily generalized for human data sets (population-based samples) using single nucleotide polymorphism (SNP) markers, where there are only three genotype combinations (two coefficients are required in the model with their own variance components). Difference between using this method for analyzing inbred line cross F2 and human population-based samples of SNPs is that in F2 one can use information from other markers in the prior for missing marker imputation. For microsatellite markers having multiple genotypes at single locus, one can alternatively assume exchangeability and fit common variance for all the effects at single location (see Meuwissen et al, 2001). Additional complexity in human studies that is expected to occur in simultaneous marker and expression analysis is the population stratification – confounding owing to unobserved population substructure. This problem has been considered in marker association studies (Lander and Schork, 1994; Sillanpää and Bhattacharjee, 2005) and in expression association studies (Gibson, 2003; Kraft and Horvath, 2003; Kraft et al, 2003; Lu et al, 2004).

Comparing results between experiments

The use of the standardized effects enables direct comparisons of the effect sizes of genetic components with different variances and of experiments with different phenotypic variances. An alternative approach would be to normalize the genetic components and the phenotype in advance. However, in the binary case and when there is missing data present, the normalization needs to be performed on every MCMC round. In studies utilizing data from a single source, this problem is seldom present, as the components have naturally somewhat similar variances. For example, the component variance of marker data from a backcross study is about 0.25, and gene expression measurements are of equal scale after preprocessing. Further, the use of standardized effects makes the comparison of separate studies more easy and informative. This is important especially now when combining data into meta-analysis has become popular (see Sillanpää and Auranen, 2004). In our study, the use of Jeffreys' scale invariant prior allows us to perform the actual analyses without normalizing the genetic components in advance. However, if some other prior is used, the scale and thus the normalization of the data can prove to be extremely important.

A Matlab implementation of the method is available from the authors upon request.