Introduction

Expression quantitative trait locus (eQTL) studies (Jansen and Nap, 2001, 2004) have been conducted recently in man, mouse and other organisms (Schadt et al., 2003; Morley et al., 2004; Sladek and Hudson, 2006). In such studies both marker- and gene-expression data need to be available from each study individual. These studies utilize conventional QTL mapping to analyze genetic patterns (eQTLs) underlying the gene expressions and regulatory networks (Bystrykh et al., 2005; Chessler et al., 2005). A similar strategy has recently been applied also for studying genetic patterns of protein expression (Foss et al., 2007). An eQTL could be cis- or trans-acting. Cis-acting means that the eQTL maps to the same (or a very close) genome position as the gene whose variation it explains. Similarly trans-acting eQTLs map to a distant genome locations to the genes and remotely regulate their expression. As in the case of expression profiling (Aune et al., 2004), it is also possible to study colocalization of eQTLs with genome positions explaining clinical phenotype(s) (Mehrabian et al., 2005; Schadt et al., 2005).

The data collected for eQTL studies are two dimensional. The number of investigated gene transcripts determines the first dimension (typically containing thousands of measurements), whereas the other dimension (typically containing dozens of measurements) refers to the number of genotyped individuals, forming the sample size. However, eQTL studies today still suffer from low reproducibility in microarray measurements (Draghici et al., 2006) and small sample size in terms of individuals (de Koning and Haley, 2005), both of which give some cause for concern. To alleviate this problem, strategies for optimal design (Bueno Filho et al., 2006) and selective phenotyping (Jin et al., 2004; Jannink, 2005; Xu et al., 2005; Fu and Jansen, 2006) have been developed.

Typically in an eQTL mapping study, data is screened over hundreds (or thousands) of different gene expressions (that is, expression phenotypes). The high dimensionality of the data may lead to serious computational problems. This encourages the use of some exploratory or preliminary screening methods or database information concerning interesting pathways, which are usually applied to reduce number of candidates (Thomas, 2005). The more detailed modeling efforts are then only targeted on the resulting subsets of data. The exception to this design are the marker-based approaches, where data at given marker point is first divided into genotypic subgroups and each subgroup is then searched for differentially expressed genes by using standard methods (see Kendziorski et al., 2006; for a review see Parmigiani et al., 2003). However, all of these approaches suffer from certain flaws. The repeated application of statistical test leads to serious concerns about the appropriate significance threshold due to dependency between the tests and the multiple testing problem. The subset of differentially expressed genes may not necessarily represent a functionally important subset of genes (Yanai et al., 2006). Moreover, the selection of candidates based on dimension reduction techniques (Perez-Enciso et al., 2003; Lan et al., 2004) has difficulties concerning the interpretation of the new variables. And finally, interesting pathways may still include hundreds of genes that necessitate the development of new effective analysis methods. For some new developments, see Chen and Kendziorski (2007), Gelfond et al. (2007), Jia and Xu (2007), Perez-Enciso et al. (2007).

The same kind of data (markers and expression jointly) can be used to explain variation in quantitative traits, which is called clinical quantitative trait locus (cQTL) analysis (Hoti and Sillanpää, 2006; Bhattacharjee and Sillanpää, 2008). It is evident in this setup that the expression measurements of the joint data can provide additional information for explaining and predicting the phenotype (West et al., 2006). Due to financial constraints the current problem in cQTL analysis is having a too small sample size for the joint data even if marker data may be available for much larger group of individuals. Here we want to address the problems of predicting expression values for genotyped individuals by integrating the eQTL model, as a missing data model, into the cQTL model. Thus, we present method to estimate parameters underlying the frequency distribution of gene expression among the prespecified set of marker gene pairs. Information from previous eQTL and allelic expression studies as well as known pathways can be utilized in the forming of such input data (marker gene pairs; Hoti and Sillanpää, 2006). The suggested method provides posterior estimates and predictions for parameters (for example, missing data) including: (1) the proportions and occupancy probabilities of the eQTLs (the markers regulating the expression) and the cQTLs (the marker or transcript variation explaining the phenotype) as well as their eQTL and cQTL effects, (2) the predicted values of gene expression based on the genotypes at a regulatory locus and (3) the genotype predictions based on the expression values and the genotypes at linked (adjacent) loci. Because of the above listed properties the suggested method can also be regarded as a multi-trait eQTL analysis, which can simultaneously handle hundreds of expression phenotypes.

The model

We first introduce the eQTL model for molecular marker and expression data and then present a large hierarchical model for quantitative phenotypes, the cQTL model, as an extension of our eQTL model.

Input data

Let us assume that offspring data from inbred line cross experiment (backcross, double haploids or F2) consist of paired marker- and gene-expression measurements, marker–gene pairs, collected from each individual separately. Such data was called link data in Hoti and Sillanpää (2006) and is here considered to represent earlier eQTL findings from allele expression studies, genetical genomics experiments or known pathways that are to be validated. Both cis- and trans-acting pairs can be included but the validation data set cannot be the same set where the original findings were made (to avoid selection bias). However, the trans-acting effects are known to be small and thus more difficult to identify (Ren et al., 2000; Sladek and Hudson, 2006). Moreover, trans-acting eQTLs often occur in clusters (Mueller et al., 2006). Therefore we focus mainly on the larger cis-acting effects, which could be successfully identified from the current small data sets in presence of missing data. If there are no earlier findings, the suitable data for the method would be oligonucleotide array data of Ronald et al. (2005) where the marker- and the gene-expression measurements are simultaneously produced at every position. Alternatively, to study cis-acting regulation, one can form putative link data (presented as pairs) solely based on the genomic proximity between the markers and the genes.

Here we assume that only a single marker (a major gene/major-effect eQTL) is controlling each expression phenotype, which means that a gene-expression distribution can depart from a normal distribution (Gibson and Weir, 2005). Note, however, that under this assumption, the same marker can simultaneously regulate two or more expressions. See ‘Discussion’ for multilocus modeling of expression phenotype.

The cis- and trans-acting effects and the corresponding marker gene pairs are illustrated in Figure 1. On the same figure the form and indexing of the input data, which are chosen cis- and trans-acting pairs, is described. This indexing is required as a first step of the statistical data description and follows the order of the markers on chromosomes. In case there is no information about the expression value, related to some particular marker, this marker is included in the input data as a pair with missing information about the corresponding gene expression. In case when two gene expressions are both regulated by the same marker we assume that this marker is represented twice, so that the distance between the both copies is specified to be extremely small (approximately 0).

Figure 1
figure 1

An example of ordering of known cis- and trans-acting marker gene pairs. In this example 4000 expression measurements and information from six markers are available. On the basis of the previous independent experiment only three cis-acting pairs with clear one-to-one correspondence and two trans-acting effect pairs are expected. There is no prior information about cis-acting effects between the marker M4 and the gene-expression E3, but this putative pair (M4, E3) is included in the statistical analysis because M4 and E3 are very close to each other on the genome. There is no information about any Ej expression, connected with the marker M2. Therefore (M2, •) is a pair with missing information. There are two expressions (E5 and E3700), both corresponding to the marker M6. In this case the pairs {(M6, E5), (M6, E3700)} are formed. The ordering of all pairs in the input data follows the chromosomal ordering of the markers.

Expression QTL model

Let us assume that backcross or double haploid data has been collected from N individuals at Np marker gene pairs. See Appendix for consideration of F2 intercross. For a convenience, two genotypes are denoted as AA and Aa in case of backcross; AA and aa for double haploid data. Conditionally on the underlying parameters explained below the following bimodal mixture distribution is assumed for the expression data, where i is index for an individual (i=1…N), and j for a marker gene pair (j=1…Np):

This is equivalent to assuming (simultaneously for each pair) a linear eQTL model Ei,j=αj+IjμjAjGi,ji,j, where the residuals ɛi,j (that is, the expression values after correcting with respect to the regulatory effects) follow a normal distribution with mean 0 and variance σj2. After normalization and transformation of the data (Quackenbush, 2001) we assume that the overall mean and the expression variance in each pair and in each mixture component are equal: αj=α0 and σj2=σ02 for all j. Moreover, we assume that data has been centralized, that is, αj=α0=0, and the residuals ɛi,j are uncorrelated even if centralization may induce dependence between residuals in practice (Qu and Xu, 2006; Jia and Xu, 2007). Here, the value of the indicator variable Ij controls presence (Ij=1) or absence (Ij=0) of a regulatory effect for pair j. A variable μj0 is the effect size, and a variable Gi,j is the genotype value of individual i at marker j, which is 1 for genotype AA and −1 for the other genotype. The value of the assignment variable Aj for pair j defines the sign of the regulatory effect: Aj=1 corresponds to the positive and Aj=−1 to the negative effect (Figure 2).

Figure 2
figure 2

Backcross. The gene-expression distribution for the possible values of the assignment variable Aj, corresponding to the left (Aj=1) and the right (Aj=−1) ordering of the genotypes Aa and AA. Here Aj=1 means that there is a positive regulatory effect, and Aj=−1 means a negative regulatory effect.

Note that, in case of backcross, one can alternatively learn the values of the assignment variable Aj by taking one extra microarray from pooled sample of individuals from one of the parental lines (all with genotype AA) and possibly one from F1 individuals (Aa). From Figure 2, it becomes clear that an individual's genotype (at regulatory locus) can be predicted by knowing only the values of the assignment and gene expression. This immediately suggests one strategy to produce genotype predictions for known marker gene pairs based on the gene expressions from the offspring and one of the parents.

Hence the expression data Ei,j are described as a mixture of two normal distributions, centered around −μj and +μj and depending on the possible values of genotypes AA and Aa. Thus the gene effect (assuming co-dominance) is presented by the quantity 2Ijμj, where the product βj=Ijμj is called as a regulatory effect of the gene.

In the case where we allow the effect size μj to be also negative, μj(−∞, +∞), and fix the assignment variable Aj=1, we practically obtain the same model (1), described above. Then the information from the assignment variable Aj is in the sign of μj because only the product μj*=μjAJ is involved in the description of the bimodal distribution (1). The separate sign variable Aj is simply used here for illustration and to emphasize that knowledge of it may potentially be useful for genotype prediction.

Hierarchical eQTL model

Denote the data vector as D=(EO, GO), where the observed gene expression (EO) and marker data (GO) both may have some missing entries. The eQTL-parameter vector is denoted as θe=(I,μ,A,G,E,σ02), where E and G represent the complete forms of the data. The mutual independence is assumed between and among the variables I, μ and A. According to Bayes rule, where c=1/p(D) is a normalizing constant. Here the posterior distribution p(θeD) is proportional to the joint distribution of the parameters and data, p(D,θe), which can be expressed as a product of likelihood p(Dθe) and prior p(θe). This is equivalently p(θeD)p(D,θe)=p(Dθe)p(θe) and, for given conditional independence assumptions, can be further factorized as:

Here p(EO,GOE,G) is the indicator function being one only when the complete data is consistent with the observations, and is 0 otherwise. The complete expression data likelihood

where the likelihood function can be written for individual i and pair j (equation 1) as:

Prior distributions

Next we define functional forms of prior distributions, reflecting our prior beliefs, for parameters p(A), p(G), p(μ), p(Ise), and p(σ02). We assume that the assignment variables Aj are mutually independent and have the following prior distribution, where p(Aj) is a Bernoulli (πj) distribution with parameter πj=½ at each locus. This means that both assignments for Aj are a priori equally likely. The prior density function of the effect size can be expressed as where p(μj) is a density function of right (positive) tail of normal distribution (truncated at 0) with mean 0 and variance 100. The prior for indicator variables I is where p(Ijse) is a Bernoulli (se) distribution with known parameter se=P(Ij=1). The parameter se½ represents our prior expectation for the proportion of pairs with regulatory effect, that is, it controls how much value one is preferred over 0. Its value is assumed to be greater than ½ because it is probable that a large proportion of pairs (to be validated) actually has a regulatory effect. The prior p(σ02) is assumed to be an inverse Gamma (1,1) restricted to the range (0.5, 10 000) (cf. Sillanpää and Bhattacharjee, 2005, 2006). For discussion of alternative priors, see Gelman (2006) and Van Dongen (2006). Note that the restriction of the inverse Gamma distribution was imposed for computational reasons—to maintain numerical stability in OpenBUGS.

Model for missing genotypes

The prior distribution of the marker data is defined in the same way as in Sillanpää and Arjas (1998) and Hoti and Sillanpää (2006). We assume that the genotype measurements are conditionally independent between individuals (given the parents), because all individuals are equally related:

where P(Gj,1) is the prior probability (expected frequency) of genotype Gi,1 at marker 1 and p(Gi,jGi,j−1) is the between-loci transition probability for individual i. The actual values depend on the genotypes, the map distance (the recombination fraction) and the design considered (for details, see Jiang and Zeng, 1997; Sillanpää and Arjas, 1998). In case where the genetic map is unknown, unlinked loci and expected genotype frequencies can be assumed in

Model selection and interpretation

Bayesian model selection of pairs with a regulatory effect is performed using indicator variables. A similar technique is commonly used for variable selection in QTL and association models (Uimari et al., 1996; Uimari and Hoeschele, 1997; Yi et al., 2003; Sillanpää and Bhattacharjee, 2005, 2006). According to the above prior assumptions, the parameters μj and Ij are a priori independent, that is, p(μ,Ise)=p(μ)p(Ise). This formulation is analogous to Kuo and Mallick (1998). Thus, based on evidence given by data, to conclude if pair j has a regulatory effect, it is more robust to monitor the posterior product βj=μj × Ij rather than each variable (μj or Ij) separately (Sillanpää and Bhattacharjee, 2005). Alternatively one can assume a hierarchical prior p(μ,Ise)=p(μI)p(Ise), which gives better identifiability for individual parameters but additional computational problems such as adjustment of tuning parameters (pseudo priors) may appear unless μj and Ij are updated together as a block during the Markov Chain Monte Carlo (MCMC) sampling (Geweke, 1996; Meuwissen et al., 2001).

Modeling dependence between transcripts

So far we have assumed independence between expression levels at different pairs. However, the values of gene expressions may be (in reality) correlated due to many different reasons. In the following, we hypothesize two kinds of dependencies: (1) the spatial dependencies due to genomic proximity of the gene transcripts meaning that the expression distributions of two genes, whose positions are close by in the genome, are dependent on each other according to the distance between them and (2) the dependencies due to the membership of the genes in the same pathway/gene set. To consider dependence between expressions we model it in the level of their underlying distributions, at effect sizes μj. That is, we model the effect size vector μ using a multivariate normal distribution, μMVN(Õ,S) with the mean vector Õ=(0, 0, …, 0)T and the covariance matrix S.

The existence of spatial dependence between expression distributions may not be well justified biologically, but it provides helpful way to share information horizontally between transcripts. In the spatial dependence model, the elements of the covariance matrix are given by the exponential decay function sj,j+1=τ0 exp(−λdj), which depends on the given smoothing parameters (τ0 and λ) and the physical or genetic distance dj between the transcripts j and j+1. The parameter τ0 controls the overall level of smoothing (say τ0=100) and λ defines the degree of spatial dependence (Conti and Witte, 2003; Sillanpää and Bhattacharjee, 2005). Note that this model is especially suitable for densely spaced transcripts (spanning few cM candidate regions) because dependence is a decreasing function of genomic distance and the rate is dependent on τ0 and λ (cf., Sillanpää and Bhattacharjee, 2005).

In the pathway membership model, the connectivity matrix S (with all elements being τ0 or 0) is constructed based on the database knowledge about the pathway memberships, list of differentially expressed genes or simply based on the pairwise correlateness between the gene expressions. In the last case, if linear pairwise expression correlation between the genes j and k is higher or equivalent than a predefined threshold T, then two genes are said to be connected, that is, if ρj,kT then sj,k=τ0 and otherwise sj,k=0. The parameter τ0 is the prior variance/covariance assumed for the effect size among the pathway members.

Clinical QTL model

Hoti and Sillanpää (2006) presented a cQTL model where a phenotype Y=(Yi) was described as a linear combination of the marker genotypes G=(Gi,j) and the gene-expression levels E=(Ei,j) and possible genotype × expression interactions. The genotype × expression interactions are allowed to occur only between members (genotypes and expressions) of the single marker gene pair. Due to necessary assumption of co-dominance in backcross, these genotype × expression interactions should be interpreted as allele-specific expression effects. Here we use the similar model than Hoti and Sillanpää (2006) except that the phenotype-associated subset of terms is determined by the indicator variables (cf. Bhattacharjee and Sillanpää, 2008; see ‘Model selection and interpretation’ above). The generic term cQTL is used for the trait-associated components. We assume the following cQTL model for the quantitative phenotype Yi of individual i:

Here a is an overall mean and the residuals ei (=observed−estimated trait value) are assumed to be normally distributed with mean 0 and variance σe2. Let us denote an indicator variable for the marker and the transcript at each pair j as IjM and IjE, respectively. Similarly, let us denote an indicator variable for the genotype-expression interaction component (at pair j) as IjME. These indicators variables are together collected into the single vector of triplets as For pair j, the genotype, expression, and genotype × expression interaction effects with respect to phenotype are determined by βjM, βjE and βjME. The cQTL effects are jointly denoted in the vector form as For F2 intercross, see Appendix. To model binary phenotypes, see Hoti and Sillanpää (2006) and Bhattacharjee and Sillanpää (2008).

Hierarchical cQTL model

Let us denote the cQTL model parameters as θc=(Ic, βc, σc2, a, σe2) and recall that the eQTL model parameters were denoted as θe=(I, μ, A, G, E, σ02). Now their posterior distribution is proportional to the joint distribution (of data and parameters) and can be further factorized as

Here the likelihood function for all individuals jointly is

We make several (conditional) independence assumptions in the construction of prior distributions. Given sc=p(Ic=1), which is a small prior probability for a candidate to be associated into the trait, we assume the following independence prior for the indicator variables Here p(IjKsc) for each component j and K={M,E,ME} is a Bernoulli (sc) distribution with parameter sc. Note that unlike the se of eQTL model, we consider sc½ to be very small. Similarly, we assume a prior for the genetic effects where p(βjKσK(j)2) for each (coefficient at) component j and K={M,E,ME} is a normal distribution with mean 0 and variance σK(j)2. The prior for genetic variances is assumed to be and p(σK(j)2) for each component j and K={M,E,ME} is an inverse Gamma (1,1) without any boundaries. The prior p(a) is assumed to be flat normal distribution with mean 0 and variance 10 000. The prior for residual variance p(σe2) is assumed to be inverse Gamma (1,1) restricted to the range (0.5, 10 000). It is useful to note that our earlier eQTL model constitutes a missing data model within this large hierarchical cQTL model specified above. Moreover, the likelihood of the eQTL model is the prior in the cQTL model.

Applications

Before presenting analyses using the complete cQTL model with simulated data, we focus clearly on eQTL model as independent model structure. Thus, in following, we first present several trials by using only eQTL model in our analyses. With simulated (complete) data, we consider both the performance under a single realization of a data set and the average performance by analyzing 50 data replicates. In addition, we consider the performance of two realizations of data sets in presence of missing values. Then we present the eQTL model analyses using previously analyzed real double haploid data on Saccharomyces cerevisiae (Brem et al., 2002) in the original and transformed scales as well as the accuracy assessment of the predicted expression values. Finally, we study performance of eQTL model with simulated data in presence of dependence between transcripts. For data simulation and for the MCMC estimation of eQTL model (1) parameters in these experiments, we have systematically used the OpenBUGS 2.2.0 software (Spiegelhalter et al., 2005; Thomas et al., 2006) if not stated otherwise. In the analyses, the first 10 000 MCMC iterations were discarded from the chain as ‘burn-in’. The posterior estimates are based on the next 100 000 MCMC iterations. To summarize the results of continuous parameters, we have preferred to use posterior median instead of posterior mean (available in OpenBUGS) as our point estimate approximating posterior mode (Hazelton and Gurrin, 2003). For discrete parameters (and product of discrete and continuous parameters), we have used the posterior mean. In eQTL analyses, the key criterion of assessment of performance has been the estimation and the prediction error.

Simulated eQTL data

Simulating markers

The linked marker data G (over 102 marker points) was simulated using the WinQTL Cartographer program (Wang et al., 2006) for 200 backcross individuals resulting from an inbred line cross experiment. The marker data spanned three chromosomes of length 99 cM, so that there were 34 evenly spaced markers on every chromosome. The distance between every two markers was 3 cM.

Simulating expressions

The gene-expression value Ei,j, for each individual i and for each marker gene pair j, was simulated in the OpenBUGS conditionally on the marker data Gi,j and the parameters according to the eQTL model (1). At each locus j, the selection indicator Ij was generated from a Bernoulli distribution, with Bernoulli parameter se=P(Ij=1)=0.9, which means that majority (90%) of the pairs are likely to have regulatory effect. The residual variance was set to σ02=1 and the overall mean to α0=0. For every pair j, the effect size μj was generated from a truncated flat normal distribution with mean 2 and variance 100, which was restricted to the range [0,4]. Thus only moderate positive values of μj were possible. All regulatory effects were set to be positive by having Aj=1 for each pair j.

Most of the pairs (G,E) in the simulated data follow the typical bimodal gene-expression frequency distribution (Figure 3a). In some cases (Figure 3b) the resulting frequency distribution does not strictly follow the bimodal shape because of some overlapping between the two mixture components.

Figure 3
figure 3

Three typical cases of the simulated gene-expression (Ej) data: (a) the well-distinguished bimodal frequency distribution when μj is high (μj=3.586); (b) the case when the simulated Ej data components overlap, but it is still possible to distinguish two parts (μj=1.315); (c) two parts of the distribution overlap almost completely when μj is small (μj=0.071). The expression values are shown on the x-axes and the frequencies on the y-axes.

In the cases when μj is near to 0, the tails of the two mixture components of Ej almost completely overlap (Figure 3c).

Analysis of the single simulated eQTL data set

Note that the known values of σ02=1 and α0=0 were assumed in the analysis of the simulated backcross data. The weakly informative prior (a truncated normal distribution) was considered for the effect size μjN(2,100) with the restriction [0,∞[. From the eQTL model (1) it becomes clear that when there is no regulatory effect (βj=0), the parameter Aj does not have any interpretation. Excluding such positions, the estimated values of Aj match perfectly well with the true simulated values. In Figure 4a, based on the posterior estimates (the median βjmed and 95% credible interval) for the regulatory effect βj=Ij × μj of pair j, we present the estimation error qj=βjmedβjt (such as, a deviation from the true simulated value βjt) and the corresponding credible interval as a summary of the analysis.

Figure 4
figure 4

Summary of the estimation error of the posterior estimated regulatory effect βj=Ij × μj over 102 marker gene pairs. (a) The analysis of the single simulated data set. (b) The analysis of 50 simulation replicates. (c) The analysis of the single simulated data set with 10% of both the markers and the expressions missing. (d) The analysis of the single simulated data set with 10% of the markers and 50% of the expressions missing. In (a, c, and d) the estimation error (of the posterior median and 2.5 and 97.5% quantiles) around the true simulated value βj is presented at each position j. In (b) the mean Mj(qj), the median mj(qj) and the standard deviation (around the mean) Mj(qj)±SDj(qj) of the estimation error are shown for every marker gene pair j over data sets. The pairs are shown on the x-axis and the estimation error on the y-axis.

Analysis of 50 simulated eQTL data replicates

Next, 50 replicated data sets (simulated replicates) of size N=200 were simulated using the same generating eQTL model as above. Every data set (d=1,…,50) was analyzed using OpenBUGS similarly as above. For data set d, let us denote the posterior median of the estimated regulatory effect at position j as βjmed(d) and its estimation error as qj(d)=βjmed(d)−βjt(d) (viz. a deviation from the true simulated value βjt(d)). Figure 4b presents the mean Mj(qj), the median mj(qj), and the standard deviation SDj(qj) of the estimation error for every marker j over 50 simulation replicates (note the scale of the y-axis). These summaries indicate that the estimation errors are very small supporting the conclusion that our method can provide reliable eQTL effect estimates (when σ02 and α0 are known).

Analysis of the single simulated eQTL data set in presence of missing data

To check the sensitivity of the method to randomly occurring missing values we have analyzed here the same simulated backcross data (explained above) in two different cases: (1) when 10% of backcross data were coded (in random locations) as missing among both the marker (Gi,j) and the expression (Ei,j) measurements, and (2) when 10% of the marker data (Gi,j) and 50% of the expression data (Ei,j) in the simulated backcross were coded (in random locations) as missing.

The two data sets with missing values were analyzed using OpenBUGS similarly as above. We assume that values of the outcome variable (here the gene expressions) are missing at random. This is a default assumption in OpenBUGS implying that the posterior distributions of the eQTL model parameters are influenced only by the observed part of the outcomes Eo (Rubin, 1976).

From Figure 4c, it becomes clear that the obtained estimates are reliable in presence of 10% missingness. The same conclusion is valid also in the case when 50% of the expression measurements and 10% of the marker data are missing (Figure 4d). As expected, the credible interval (for the estimation error) becomes wider with increasing amount of missingness. Although it is visible in Figure 4d that the posterior median has a larger amplitude than in A and C, we found out that this was not the case for the relative estimation error (calculated for non-zero coefficients), which practically stays a constant level at these three cases (results not shown). Thus the results suggest that the performance of the method is generally robust to the presence of missing observations.

Real yeast data

We selected the publicly available data from a double haploid experiment on S. cerevisiae, described in Brem et al. (2002), and used as test data in Hoti and Sillanpää (2006). The data contain the gene expressions (a dye swap pair of arrays) and the marker genotypes measured from 40 individuals (segregant samples) obtained from a cross between a laboratory (BY4716) and a wild strain of Yeast. The expression data represent the background corrected and normalized log ratios, which have been centered over all the microarray spots, that is, the mean of the expression data is 0. For each gene we took simply an average of two expression values (a dye swap pair of gene expressions) if both values were available and marked it as missing otherwise; this procedure did not significantly change the centering from 0.

Analysis of yeast data using eQTL model

Brem et al. (2002) found 570 eQTLs, which we simply took together with the appropriate expressions as our input data (marker gene pairs) here. (Note that we are aware of the potential selection bias that may appear as a consequence of using data twice.) To validate these eQTLs, the four different analyses with the eQTL model were executed for the data: (1) the eQTL analysis of the original data assuming the known values of σ02=1 and α0=0, (2) the eQTL analysis of the transformed data (the expression values of each gene were rescaled to have an unit variance by the common scaling factor) again assuming the known values of σ02=1 and α0=0, (3) the eQTL analysis as 1 above with unknown σ02 and (4) the eQTL analysis of the transformed data (the expression values of each gene were rescaled to have an unit variance by the locus-specific scaling factor) assuming the known values of σ02=1 and α0=0. The uninformative prior (a uniform distribution; unlinked loci) was assumed for the genotype data in p(G). In addition, two different priors (truncated normal distributions) were considered for the effect size in all four analyses: μjN+(0,100) (a neutral prior) and μjN+(2,100) (a nearly neutral prior), where the subscript + indicates the positive [0,∞[ region of support. This resulted in the eight different analyses in total. A data transformation was performed using the formula in analysis 2 and the formula in analysis 4. Here ς̂0≈1.29 is the common empirical standard deviation of all the expressions and ς̂Ej is the empirical standard deviation of the expressions at gene j. By including the missing outcomes in the OpenBUGS analysis, one obtains the posterior predicted values for them based on the posterior distributions of the parameters. In Table 1, we show the posterior (median) estimated proportion of eQTLs, for all eight analyses corresponding to different prior assumptions for se=P(Ij=1). In the table, the observed proportion of non-zero posterior (median) estimated regulatory effects is also shown. The Monte Carlo error of the quantity ∑Ij was estimated to be around 0.12 and 0.13 resulting to very accurate estimates for that is, error around 0.12/570. The Monte Carlo error for βj varied and was usually smaller than 0.003 but to estimate the observed proportion, this error should be multiplied with the number of non-zero positions. In other words, the posterior proportions, are much more accurate than the observed proportions in Table 1.

Table 1 Summary of the posterior and the observed proportions of eQTLs for different values of prior proportion se=P (Ij=1) in four analyses of yeast data (which each are evaluated at two priors of μ) : the original data with the known σ02=1 (top left), the transformed data (by using a common scaling factor) with the known σ02=1 (top right), the original data with unknown σ02 (bottom left), the transformed data (by using a gene-specific scaling factor) with the known σ20=1 (bottom right)

In Table 1, the analyses with the original data and the transformed data (by a common scaling factor) seem to lead to large deviation from the prior so that the posterior/observed proportion of eQTLs is always much smaller than the prior proportion. This means that data strongly support the conclusion of the absence of the regulation for a large number of pairs. In contrast, the analyses with the transformed data (by a gene-specific scaling factor) lead to the estimated proportions extremely close to the prior, indicating the lack of information in the transformed data.

The estimated (posterior/observed) proportion of eQTL is slightly smaller for the model with the prior assumption μjN+(0,100) than for the model with μjN+(2,100), the exception being the case of se=0.80 (Table 1). Because two priors are almost equal, this result can be explained by the fact that the model assuming μjN+(0,100) prefers to ‘find’ neutral (or extremely small effect) eQTLs rather than requiring them to be any larger. For this to be true, the information content of the data also need to be extremely low and/or supportive for small-effect eQTLs. Moreover, the observed proportion is always smaller than the posterior proportion which may indicate that it is more easy to have the posterior median of βj=Ij × μj equal to 0 for few j (to downweight the observed proportion) than to downweight the posterior median of ∑Ij (influencing on the posterior proportion). Note also that the estimated (posterior/observed) proportion of eQTL is somewhat larger for the analysis of the transformed data (by a common scaling factor) than the analyses of the original data (Table 1). This may indicate better fit (perhaps even overfit) of the model to the data, because the scaled data perfectly correspond to the model assumption σ02=1.

Model assessment

To assess the goodness-of-fit of the model, we want to assess how well one can predict values of the gene expressions based on the model and the posteriors of the parameters. These posterior predictions Ei,j* are then compared to the observed data of each individual Ei,j and one obtains the prediction error PEi,j=(Ei,j*Ei,j) as a simple difference between the two. In general—for robust predictions—instead of using the best-case scenario (that is, to evaluate posterior predictive distribution only at a point estimate, for example the posterior mode or median), one should use the whole predictive posterior distribution (that is, include uncertainty of the whole posterior distribution) and thus utilize the Bayesian model averaging (West et al., 2006). However, we are here more interested in checking the best-case scenario, that is, to sample a gene-expression value for each individual from its posterior predictive distribution p(E*I, μ, A, G, σ02) conditionally on the genotype data and the posterior estimates of the parameters (I, μ, A, σ02) using posterior medians of the continuous parameters. To handle missing genotype data, we again assumed a uniform prior p(G). We can then calculate the mean and the variance of the prediction errors under the different models considered. Such summaries are presented in Figure 5. Because the observed gene-expression values contained some missing observations, the mean and the variance were calculated only over individuals with the observations.

Figure 5
figure 5

The mean and the variance of the individual-specific prediction error, which is a difference between the predicted and the observed gene expression of individual i at pair j. The quantities are calculated based on 1000 Markov Chain Monte Carlo (MCMC) samples from the posterior predictive distribution evaluated at the median (point estimate) of the posterior distribution of the model parameters from the model with μN+(0,100). (a) The mean prediction error from the analysis of the transformed data (by using a gene-specific scaling factor) with the known σ02=1. (b) The mean prediction error from the analysis of the original data with the known σ02=1. (c) The prediction error variance from the analysis of the original data with the known σ02=1. (d) The prediction error variance from the analysis of the original data with unknown σ02. (e) The prediction error variance from the analysis of the transformed data (by using a common scaling factor) with the known σ02=1.

It becomes evident that the analysis of the transformed data (with a gene-specific scaling factor) resulted in predictions where all the existing information is lost (Figure 5a). The same was true for the prediction error variance (picture not shown). Because the mean prediction errors from analyses of the original data with both the known and unknown σ02 as well as of the transformed data (by using a common scaling factor) with the known σ02 all resulted into very similar pictures with minor numerical differences in the mean prediction errors, only one of them is shown in Figure 5b. In Figures 5c, d, and e, one can see how the prediction error variance decreases by treating σ02 as an unknown variable or by using a data transformation and a common scaling factor. Even if the results suggest that the data transformation (scaling) is a reasonable way to proceed in this type of analysis, one should proceed with caution because some biological interactions may be lost or destroyed by the scaling (Jansen, 2003; Vormfelde and Brockmöller, 2007).

Simulated eQTL data with dependence between transcripts

Here we study prediction of the missing gene expressions based on the linked marker data and the observed values of the gene expressions on the flanking transcripts. We use the eQTL model with spatial dependencies here. Albeit this model may appear to be unrealistic, the results presented here arguably correspond to the more realistic case (for example, analysis of pathway membership model).

Simulating expressions

We use the same simulated marker data described above, but instead of utilizing all 102 markers we use only the 34 markers (evenly spaced at every 3 cM) on the first chromosome. Conditionally on the markers, we simulated 100 data sets with the correlated expression data, using the mean vector Õ=(2, 2, …, 2)T, and the smoothing parameter values τ0=4 and λ=10. This resulted in the average correlation of 0.7417 between any adjacent pair of effect sizes (μj, μj+1) in the data sets. Similarly, we also obtained the average correlation of 0.9722 by changing the smoothing parameter to λ=1. In a following we refer to these two different generating models as ‘a weak dependence (λ=10)’ model and ‘a strong dependence (λ=1)’ model. All simulations were carried out in OpenBUGS using eQTL model with spatial dependencies.

Analysis of simulated eQTL data using the spatial dependence model for transcripts

Two simulated data sets were analyzed, a single realization generated with a weak dependence (λ=10) model, and other obtained with ‘a high dependence (λ=1)’ model. In the analysis stage, all the expression values at every other marker (in even numbers) in both data sets were coded as missing. These two data sets were both analyzed using two different eQTL models differing in the structure of the prior p(μ). The two eQTL models are the spatial dependence model (with the values Õ=(0, 0, …, 0)T, τ0=100, and the known λ={1,10}) and the independence model (1) (with the uninformative prior μjN(0,100) in the positive range [0,∞[, and the known values of σ02=1 and α0=0).

In addition, to depict an interval of maximum estimation error, all the expressions were deleted from the first data set and analyzed using the independence model. The estimation using the spatial dependence model requires markedly more computational efforts because the model is more complicated. (Thus, for any practical settings, one should seriously consider some other computational tool than OpenBUGS.) All these analyses are summarized in Figure 6, except the independence model analysis of weakly correlated pair data with 100% of the expression measurements missing at every other marker (the picture is almost identical to d).

Figure 6
figure 6

Summary of the estimation error of the posterior estimated regulatory effect βj over 34 marker gene pairs. (a) The spatial dependence model analysis of weakly correlated pair data with 100% of the expression measurements missing at every other marker. (b) The spatial dependence model analysis of strongly correlated pair data with 100% of the expressions missing at every other marker. (c) The independence model analysis of weakly correlated pair data with all the expression measurements missing at every marker. (d) The independence model analysis of strongly correlated pair data with all the expression measurements missing at every other marker. The estimation error (of the posterior median and 2.5 and 97.5% quantiles) around the true simulated value βjt is presented at each position j. The pairs are shown on the x-axis and the estimation error on the y-axis.

In general, the credible intervals (of the estimation error) of Figure 6 are constantly wide at every other marker in all cases because 100 % of the expressions were missing at those positions (that is, all the information comes through the dependence structure). However it becomes clear that the predictive properties of the spatial dependence model are slightly better than the independence model and its accuracy improves along with the increasing amount of dependence in the data (cf. the scales at the y-axis). On the other hand, the predictive accuracy of the independence model seems to stay practically constant while the amount of dependence in the data increases.

Simulated cQTL data

To test how well the cQTL model can handle the missing values among genotypes and expressions, we simulated eQTL data on backcross (N=200, Np=102) as before (see details above) except that μN (0, 100) and Aj=1. On the basis of the complete data and the cQTL model (2), nine components (1 marker, 5 expressions, and 3 genotype × expression interactions) were used to generate the phenotypic values (that is, those components should exhibit non-zero cQTL effects with respect to the phenotype in the analysis). We used fixed value σe2=1/0.065≈15.3846 resulting to the joint heritability of the trait which was approximately 0.68. The actual effect sizes and types of the components are shown in Table 2. In the analysis stage, we again (in random locations) deleted 5% of the marker genotypes (Gi,j) and 50% of the gene expressions (Ei,j) from the complete data set. All phenotypes were assumed to be available and the genotypes and the expressions were assumed to be missing at random.

Table 2 Posterior estimated (mean) and true cQTL effects under two models (MD1 and MD2) for pairs where the true or estimated effect was nonnegligible (all values less than 0.05 are set to 0 or not shown)

Analysis of simulated cQTL data

The data set was analyzed using two cQTL models, differing in the complexity of the missing data model for the missing values of expressions. The first model (MD1) is the one including the eQTL model (1) as missing data model, as presented in this article, and the second one (MD2) uses a much simpler model to handle the missing expressions, Ei,jN(0,σ02), where p(EI, μ, A, G, σ02) is replaced simply by p(Eσ02). Note that the latter is close to the missing data model of Hoti and Sillanpää (2006), and it follows by assuming the additive polygenic (infinitely many loci) basis for gene expression. For the first model, the truncated normal prior: μN(0,100) in the positive range [0,∞[ is assumed and for the both of these two models, α0=0 and the prior p(σ02) is assumed to be an inverse Gamma (1,1) restricted to the range [0.5, 10 000]. In both cases, 306 candidate terms (102 markers, 102 expressions, and 102 marker × expression interactions) were considered in the model. For a Bernoulli parameter, we set sc=0.0033≈1/306 which roughly corresponds to a single a priori associated component among the candidates. Note that because the phenotypes represent an outcome variable in the large cQTL model, the posterior distributions of the eQTL model (1) parameters are now influenced by all (the missing and observed) expression values. This is contrary to modeling expressions as outcome variable in the plain eQTL model (see Equation (1) above). To estimate the parameters in MD1 and MD2, the OpenBUGS 2.2.0 was ran for 110 000 MCMC iterations, with 10 000 burn-in. Surprisingly, we encountered a slight mixing problem in the sense that locations of the false positive cQTL signals, with small effect sizes, varied somewhat from one analysis to the next. However, there was only few such locations. It seemed that running longer chains did not influence to this property much. The convergence was inspected by comparing the results of different smaller runs. This was complicated by the fact that the running times for both analyses took more than a week on a personal computer. Unlike Hoti and Sillanpää (2006), we did not consider standardized effect sizes here. In Table 2, one can see the posterior weighted cQTL effects found for different genetic components under the two models. To define what is a cQTL, we had to choose rather high noise level (0.05 in analyses of Table 2). Among the markers, MD1 found weak cQTL evidence for the correct locus (the pair 31) but it strongly supported also for the locus that is a false positive (the pair 15). For a comparison, all putative cQTL findings of MD2 were false (the pairs 14, 15, 24, and 57), except a weak signal near the noise level (the pair 31). Because of the huge false positive signal at pair 14, we further checked the proportion of missing data at pair 14, which was unexpectedly less than an average (4.5 % of the marker data and 47% of the expression data). Among the expression effects, MD1 correctly identified three out of five gene expressions (the pairs 14, 15 and 57) where, however, the cQTL evidence for the pair 57 was negligible. In addition, although negligible, also the cQTL was correctly estimated to have non-zero effect (−0.011) at position 100. Among the same candidates, MD2 found only some negligible cQTL evidence (0.014) for the incorrect position (the pair 88, which was simulated to have an interaction effect). Finally, MD1 correctly identified two out of three genotype × expression interactions (the pairs 50 and 88) while MD2 found none of them. It is good to emphasize here that although some marker gene pairs (14, 15 and 57) were interpreted as false positives among marker effects above, the same pairs originally had expression effects. Thus, they actually are false positives only in the sense of their effect types rather than their positions. To further experiment with MD1, we analyzed the same data with different missing data pattern (5% of the marker data and 30% of the expressions missing at random). This data was ran for 110 000 MCMC rounds with 10 000 burn-in. The results were quite similar to the ones of MD1 in Table 2, except the huge marker effect (−26.46) and no expression effect at pair 14, similar to MD2 above (results not shown). As a conclusion, it becomes clear that the more complicated model, MD1, outperforms the simpler model, MD2, and leads to better identification of cQTLs. Actually the poor performance of MD2 indicates that the amount of missingness is very large, and it is helpful in such cases to utilize marker and phenotype information jointly to predict the missing values of expressions.

Discussion

We have presented here a new method for simultaneously estimating cis- and trans-acting eQTL effects as well as the cQTL effects among the preselected set of marker gene pairs. The method is based on hierarchical modeling so that the eQTL model is a part of the larger cQTL model. The both (eQTL and cQTL) models were tested as separate analyses in presence of missing data by assuming missing at random (Rubin, 1976). However, there is one important difference in these two analyses that needs more attention. Namely, in the plain eQTL analysis, the posterior distributions of the eQTL model parameters are influenced only by the observed part of the expression data, whereas in cQTL analysis imputed expression values also influence the posterior. Therefore, in cQTL analysis, one should be careful that the amount of missing data does not become larger than the observed part of the data (cf. Kilpikari and Sillanpää, 2003). Even if not detected here, there may still be some unwanted biases present in the estimates when the amount of missing data exceeds 50%. The presented method is, to our knowledge, the first attempt to model these two tasks simultaneously within a single modeling framework. Therefore, we want to here briefly discuss different future directions that we feel are central in this context.

Multiple trait analysis

In this article we have considered only a single phenotype at a time. However, using several traits simultaneously would be interesting extension to be considered in the future that definitely can provide more information on locating the cQTLs and eQTLs as well as on separating pleiotropy from close linkage. In case of pleiotropy, we can further consider separating marker effects from expression effects at same location. In addition, an interesting issue here is the comorbidity, association of two or more traits, which would provide insight on direct and indirect genetic effects (Smoller et al., 2000; Robins et al., 2001; Corander and Sillanpää, 2002; Grünewald, 2004; D Remington, North Carolina, personal communication). To study this issue, Li et al. (2006) presented the structural equation model, where hierarchical regression relationships between variables are determined. Verzilli et al. (2005) considered seemingly unrelated regressions model, where different sets of single nucleotide polymorphisms can be taken as explanatory variables for each trait. As of their flexibility, these models could provide adequate framework for future extensions of our setup to multiple traits.

The central technical issue (for MCMC estimation and convergence of the Bayesian approach) due to the small sample size (number of individuals) is the efficient parametrization of the multiple trait model. The useful parametrization, in terms of restricting (between trait) effect ratio to be constant over alleles at each locus and gene correlations always to be either −1 and 1, has been proposed by Goddard (2001). For implementation, see Meuwissen and Goddard (2004). Also, an application of Bayesian variable selection for estimating non-zero elements of the covariance matrix has been suggested (Smith and Kohn, 2002).

Multiple gene models and model choice

The presented method is based on a single-eQTL model, which may limit an application of the method for eQTL mapping purposes, but it provides a new source of information for handling of missing gene expressions in the cQTL mapping context. In their real data application, Hoti and Sillanpää (2006) considered the eQTL model for a single expression phenotype, where an associated subset of multiple markers and expression levels (as well as their interaction components) were determined using Bayesian adaptive model selection. Perez-Enciso et al. (2007) proposed the use of support vector machines and stepwise regression approach for similar purpose. They also showed how the use of other expression levels as potential covariates in the model can improve the performance of eQTL mapping. Bhattacharjee and Sillanpää (2008) proposed the Bayesian cQTL model with the indicator variables (for model selection) to study stratified allele and expression effects to the phenotype using different clinical variables (for example, sex and onset) as stratifying factors. Recently, Jia and Xu (2007) presented a new Bayesian eQTL approach to simultaneously analyze hundreds of expression levels using a multiple marker model and model selection. Further studies are needed in this area, especially from the viewpoint of small sample size (number of individuals). This is because a small sample size has a direct impact on practical identifiability (multimodality of the posterior distribution) of the parameters and it largely determines what is a reasonable number of putative candidates and effects to be considered in the model (Hoti and Sillanpää, 2006). In such (sample size) assessment, one should also account for colinearity (correlateness) between candidates.

Pathway/dependence information

The pathway information is usually utilized to reduce the number of candidates in the cQTL analysis (Thomas, 2005), but we have presented here another way to incorporate dependence information between transcripts to the eQTL and cQTL analyses. If the eQTL model is omitted from the cQTL model, it is possible to model dependence between transcripts directly in their values of gene expressions. This represents again an alternative formulation of missing data model for expressions in the cQTL model context. Other than the missing data model, the dependence between candidates (markers/gene expressions) of the cQTL model can be modeled also indirectly by introducing dependence prior for the model selection indicators (Sillanpää and Bhattacharjee, 2005).

See also the approach of Malo et al. (2008). To model dependence due to pathway membership in the cQTL analysis, Hung et al. (2004) have suggested the approach where the effects, of the markers (genes) being members of the same pathway, are exchangeable and arise from a common distribution. To consider more about pathway-based approaches, see Wang et al. (2007), and Luan and Li (2008).

The model specification code (written in OpenBUGS) is freely available for research purposes at http://www.rni.helsinki.fi/~mjs/.