Bayesian mapping of genotype × expression interactions in quantitative and qualitative traits

Hoti, F; Sillanpää, M J

doi:10.1038/sj.hdy.6800817

Download PDF

Original Article
Published: 03 May 2006

Bayesian mapping of genotype × expression interactions in quantitative and qualitative traits

F Hoti¹ &
M J Sillanpää¹

Heredity volume 97, pages 4–18 (2006)Cite this article

1265 Accesses
63 Citations
Metrics details

Abstract

A novel Bayesian gene mapping method, which can simultaneously utilize both molecular marker and gene expression data, is introduced. The approach enables a quantitative or qualitative phenotype to be expressed as a linear combination of the marker genotypes, gene expression levels, and possible genotype × gene expression interactions. The interaction data, given as marker–gene pairs, contains possible in cis and in trans effects obtained from earlier allelic expression studies, genetical genomics studies, biological hypotheses, or known pathways. The method is presented for an inbred line cross design and can be easily generalized to handle other types of populations and designs. The model selection is based on the use of effect-specific variance components combined with Jeffreys' noninformative prior – the method operates by adaptively shrinking marker, expression, and interaction effects toward zero so that non-negligible effects are expected to occur only at very few positions. The estimation of the model parameters and the handling of missing genotype or expression data is performed via Markov chain Monte Carlo sampling. The potential of the method including heritability estimation is presented using simulated examples and novel summary statistics. The method is also applied to a real yeast data set with known pathways.

High-resolution genome-wide mapping of chromosome-arm-scale truncations induced by CRISPR–Cas9 editing

Article Open access 29 May 2024

Genome-wide association studies

Article 26 August 2021

A deep catalogue of protein-coding variation in 983,578 individuals

Article 20 May 2024

Introduction

Population-based association analysis, enhanced with projects like HAPMAP (The International HapMap Consortium, 2003, 2005), is one of the most promising gene mapping techniques (Risch and Merikangas, 1996; Lohmueller et al, 2003). The selection of a trait-associated subset of markers among a large number of candidates is a challenging model selection problem on its own right (Broman and Speed, 2002; Sillanpää and Corander, 2002; Kilpikari and Sillanpää, 2003). In addition to measuring the genotype at selected marker points along the chromosome, it is currently possible to measure the gene expression, mRNA abundance, levels for a large number of genomic locations simultaneously. The availability of such data has made it possible to base candidate selection on the associations between a phenotype and gene expression levels (Quackenbush, 2001; Wayne and McIntyre, 2002; Kraft et al, 2003; Goeman et al, 2004; Lu et al, 2004). Further, if both phenotype–marker and phenotype–expression associations are analyzed, it is possible to study the overlap between the genomic locations of the resulting significant markers and genes (Aune et al, 2004).

Gene expression and marker data are combined in genetical genomics – to map gene regulators or modifier genes with respect to a marker map. This is carried out by treating the expression levels as phenotypes in a similar manner as quantitative traits in standard quantitative trait locus (QTL) mapping (for alternative procedures, see Jansen and Nap, 2001; Jansen, 2003; Darvasi, 2003; Heighway et al, 2005). Such practice has led to accumulating evidence that genetic variants can control or introduce quantitative variation in gene expression levels, and that this variation follows the Mendelian inheritance similar to the genetic variants themselves (Brem et al, 2002; Yan et al, 2002; Lo et al, 2003; Schadt et al, 2003; Jansen and Nap, 2004; Knight, 2004; Morley et al, 2004; Auger et al, 2005; Bystrykh et al, 2005; Chesler et al, 2005; Hubner et al, 2005).

The treatment of gene expression levels as genetic markers, expression level polymorphisms, in QTL studies has been suggested by Doerge (2002) and Kell (2002). The utilization of gene expression profiling to define genetically homogeneous groups or to improve the definition of a phenotype for gene mapping purposes was proposed by Watts et al (2002), Schadt et al (2003), and Kraft and Horvath (2003). Instead of studying each expression phenotype individually in genetical genomics, Perez-Enciso et al (2003) addressed the problem of combining gene expression phenotypes and QTL mapping by first predicting the value of an underlying liability variable using partial least squares. In their approach, the hypothetical liability (underlying the observed phenotype) is defined as a linear combination of individual gene expression levels. This predicted liability is then used instead of the individual expression phenotypes in QTL mapping. The use of principal component analysis and hierarchical clustering for a similar purpose has been proposed by Lan et al (2004).

The large size of modern gene expression and molecular marker data sets combined with the goal of finding a small subset of trait-associated candidate genes underlines the need for computationally efficient methods especially designed to detect sparse candidates. Recently, sophisticated shrinkage and sparse methods have been proposed to study phenotype–expression association (Shevade and Keerthi, 2003; West, 2003; Lopes and West, 2004) and phenotype–marker association (Meuwissen et al, 2001; Devlin et al, 2003; Kopp et al, 2003; Xu, 2003; Sillanpää and Bhattacharjee, 2005; Zhang and Xu, 2005; Zhang et al, 2005).

In this paper, we present a novel method where the phenotype is modeled as a linear combination of marker genotypes, gene expression levels, and genotype × gene expression interaction terms. The method is implemented for inbred line cross data by using a Bayesian shrinkage approach in a similar manner as in Meuwissen et al (2001) and Xu (2003). Our method is computationally efficient and easy to use, since no tuning parameters are required. Also similar to Xu (2003), our method provides good heritability estimates. We emphasize the use of standardized regression coefficients in interpreting the results and introduce summary statistics that enable us to effectively identify complementary genetic determinants (submodels).

Model

Genetic model

Here, we consider a population with only two segregating genotypes resulting from an inbred line cross experiment, for example a backcross or a double haploid population. The generalization to situations with multiple genotypes, that is, general population samples of outbred human populations, is also discussed. We assume that the sample consists of both marker data and gene expression measurements (mRNA abundance levels), as well as a quantitative or a qualitative trait phenotype, that have been collected from all study individuals. In addition, we assume that a proportion of the marker measurements are a priori associated with some of the gene expression measurements, allowing the identification of genotype-specific gene expression effects, that is, genotype × expression interactions with respect to the phenotype. We allow multiple markers to be associated with a single gene and vica versa. In summary, our genetic data consist of three data subtypes of the following forms:

1
marker data (N_M markers),
2
gene expression data (N_E genes),
3
link data: data allowing the identification of marker–expression pairs whose interaction term is to be considered in our model (N_ME marker–gene pairs).

The gene expression data are assumed to have gone through suitable transformation and normalization steps (Quackenbush, 2001; Butte, 2002) so that the sample distribution of the majority of the genes is approximately standard normal. The link data, which can originate from previous genetical genomics studies (cis- and trans-acting variation) or are based on known pathways, enable incorporating cross terms into the model. If no prior external knowledge or hypothesis is available, the link data can be constructed solely based on the genetic distances between the markers and the genes. Thus, one assumes in cis effects between the marker and all genes within a given genetic distance of the marker. Also, oligonucleotide arrays can provide simultaneous genotype and gene expression measurements directly (Ronald et al, 2005).

Given genetic data as described above, we propose modeling a quantitative phenotype y_i of individual i with the following linear model:

where μ is the population mean and ɛ_i∼N(0, σ₀²) is a normally distributed residual term with mean zero and variance σ₀². For a binary trait, we use model (1) to model an underlying continuous liability, which then gives rise to the binary observation according to the Bayesian probit model (see Appendix A1 for details). The first summation runs over all markers and is designed to capture the genotype-specific main effects (cf Xu, 2003). For individual i at marker j, the indicator for genotype k is denoted by z_i,j,k, and α_j,k is the coefficient of the corresponding genotype-specific main effect. The second summation runs over the N_E genes and x_i,j denotes the gene expression measurement of gene j for individual i, and β_j is the coefficient of the corresponding linear gene expression effect. In the last summation, genotype and gene expression data of the N_ME marker–gene pairs are gathered into pairs j=1, …, N_ME, where and for some pair (g_j, s_j) given by the link data. Thus, for individual i, is the gene expression measurement of gene g_j and is the indicator of genotype k for the corresponding marker s_j, and γ_j,k is the coefficient of the corresponding genotype × expression interaction effect. The extension of model (1) to include also genotype × genotype or expression × expression interaction terms is considered in the Discussion section.

In order to ensure that the model parameters are identifiable, we introduce the constrains

Thus, the first genotype, at each marker, is identified as a baseline, and their effects are included into the terms μ and The genotype-specific contrasts (differences) are then modeled using the two remaining terms in model (1), which, by taking into account the above constrains, are and

Next, for individual i, the genetic data are gathered into vector

and the vector containing the N=N_M+N_E+N_ME unknown effects (coefficients) is denoted by

Now, by taking into account constrains (2) and (3), the linear model (1) can be rewritten as

Hierarchical model

Prior distributions

Bayesian approaches require the specification of prior distributions for the unknown parameters. We follow the work of Xu (2003) and adopt the following prior densities, where each effect is assigned its own variance term. For j=1, …, N, the effect prior p(θ_j∣σ_j²) is the density function of normal distribution with mean zero and effect-specific variance σ_j², and p(σ_j²)∝1/σ_j² is the (Jeffreys' scale invariant) prior density function of the effect-specific hyperparameter σ_j². The prior density function of the mean μ is p(μ)∝1, and the prior density function of the variance σ₀²=var(ɛ_i), for i=1, …, n, is p(σ₀²)∝1/σ₀². Now, by the use of appropriate (conditional) independence assumptions, the joint prior density function of the model parameters θ, μ, and σ², where σ²=(σ₀², …, σ_N²), is p(θ,μ,σ²)=p(θ∣σ²)p(μ)p(σ²), where , and . It has been demonstrated (see Figueiredo, 2003; Xu, 2003) that the above use of effect-specific variance parameters induces sparseness. Thus, our prior information states that most terms in the sums of model (1) are expected to be zero or almost zero and the degree of sparseness adaptively depends on the data at hand.

Model for missing values

In Bayesian inference, missing values are handled in a similar manner as any other unknown parameter (random variable). Thus, prior distributions are assigned to all missing values. The prior density function p(x_i,j) of a missing gene expression measurement x_i,j is chosen to be that of a standard normal distribution. Recall that the gene expression level measurements are assumed to be approximately normally distributed.

Next we define the prior distribution for the marker data, in a backcross or a double haploid situation, by taking into account the probability of a recombination, which again is defined by the genetic distances between the markers. Following Sillanpää and Arjas (1998), the joint probability of the marker data for individual i is given by

where m_i,j is the genotype of individual i at marker j, is the probability of genotype m_i,1 at marker 1, and P(m_i,j∣m_i,j−1) is the probability of genotype m_i,j at marker j conditional on genotype m_i,j−1 at marker j−1. The conditional probability P(m_i,j∣m_i,j−1) is 1−r_j if genotypes m_i,j and m_i,j−1 are the same (no recombination) and is r_j otherwise, where r_j is derived from the genetic distance d_j (in Morgans) between markers j and j−1, by the Haldane map function . A simpler way to proceed, which works also in more general setups, is to assume independence between markers and take the prior probability of each genotype to be equal. However, in many cases, it is possible to derive more informative prior distributions similar to equation (5); for various crosses from two inbred lines, see Jiang and Zeng (1997).

Posterior distributions

Next we derive the posterior distribution of the model parameters θ, μ, and σ², where σ²=(σ₀², …, σ_N²). Denote by D={m,x} the complete genetic data, that is, the combined marker and gene expression data, with no missing values. Further, let D⁻={m⁻, x⁻} denote the observed genetic data with possibly some entries missing. By the use of the Bayes formula, the density function of the joint posterior distribution (see Figure 1) of the model parameters and the genetic data is given by

where p(θ, μ, σ²) is the density function of the joint prior distribution of the parameters (θ, μ, σ²), p(D) is the prior density function of the complete genetic data D, p(D⁻∣D) is the mass probability function of the observed genetic data D⁻ conditional on the complete genetic data D, and p(y∣θ, μ, σ², D) is the likelihood of the phenotype data y. Note that p(D⁻∣D) is an indicator function and takes value 1 only when D⁻ is consistent with D and 0 otherwise. The prior density function of the genetic data D is proportional to and the likelihood function can be factorized into where

Markov chain Monte Carlo estimation

We apply Gibbs sampling (Geman and Geman, 1984) and Metropolis–Hastings (Hastings, 1970) algorithms to draw dependent samples from the joint posterior distribution of the unknowns (equation (6)). The specific choices of the prior distributions (conjugate distributions) allow us to generate samples directly from the fully conditional marginal posterior distributions of θ, μ, and σ². A detailed description of the adopted algorithm is given in Appendix A1. Possible point estimates for the unknown distributions include the maximum a posteriori (MAP), the median, and the expected value of the marginal distributions. We assume that the number of markers and genes in the data set is such that it is computationally reasonable to attempt to estimate the complete posterior distribution, for example, their number is reduced by some preliminary feature selection algorithm or the features are chosen based on known pathways (see Thomas, 2005). Zhang and Xu (2005) were able to handle a model where the number of effects was 15 times larger than the sample size. In our opinion, it is preferable to reduce this ratio (upper limit) even lower, say down to 10, by gathering more samples or by reducing the number of considered effects in the model. If the number of markers or genes is very large (several thousands), even though the data contain enough information to estimate the effects, the time needed to perform the calculations can be overwhelming. In such case, one can postpone the estimation of the whole distribution and concentrate on directly estimating some summary statistic of the distribution (eg MAP via an EM algorithm in Figueiredo (2003), and MAP via a penalized ML method in Zhang and Xu (2005)).

Simulations

We simulated backcross data consisting of molecular markers, gene expression level measurements, and both a binary and a continuous phenotype. First, linked marker data were simulated, then gene expression data were generated conditionally on the marker data. Next, the phenotype data were generated conditionally on both the marker and gene expression data. Finally, missing values were introduced by randomly removing a given proportion of the marker and gene expression measurements. Note that our simulation strategy differs from some others, which use real gene expression data as a starting point (Perez-Enciso et al, 2003; Perez-Enciso, 2004). By conditioning on the expression measurement, Perez-Enciso et al (2003) simulated case–control data (QTLs and binary phenotypes) using partial least squares. Subsequent linked marker data were then generated around each QTL using the exponential decay model for linkage disequilibrium. In Perez-Enciso (2004), the data, a set of linked marker loci and phenotypes, were simulated from an outbred population based on coalescent techniques and gene dropping. Our main reason for not adopting any existing simulation method was the need to have realistic linked marker data for offspring resulting from a backcross design of two inbred lines. Also, our approach (see below), although including simplifying assumptions, allows us to fully validate the performance of the proposed estimation method.

Genetic data

Linked marker data for a population of 200 backcross individuals was simulated using the QTL Cartographer software (Basten et al, 1994, 2003). Altogether 100 markers were simulated, 50 markers equally spaced on two different chromosomes. The inter-marker distance on both chromosomes was taken to be 4 cM and the length of the genetic material outside the boundary markers on each chromosome was 2 cM.

Next, three genes (gene expression measurements) were assigned about each marker, resulting in 300 genes. We assume that the genetic distances from the three gene loci to the marker is so small that any effect between the marker and the genes is of in cis nature (this is our link data). Then, for each marker, one gene (the middle one) was randomly assigned a value φ_j∈{0, 1}, where φ_j=1 indicates the presence of an in cis effect and for the remaining two genes we assume no in cis effect (φ_j=0). For the middle genes, the probability that φ_j=1 was taken to be 0.3, which is in line with current estimates (Jansen and Nap, 2004; Morley et al, 2004).

To mimic allele-specific expression, the gene expression value x_i,j of gene j for individual i was generated from the mixture distribution

where N(a, b) is the normal distribution with mean a and variance b, and is the indicator of genotype k for individual i at the marker linked to gene j. Although gene expression values are generated independently from each other, dependence between markers (with in cis effects) will imply also some dependence between expression levels.

Phenotype data

The phenotype data were constructed as a linear combination of genetic components, that is, genotype–expression pairs, of six different subtypes (Figure 2). The possibility of single genotype–expression pair simultaneously having more than one active phenotype effect was excluded in the simulation. Thus, we divide the genetic components into three subtypes depending on the mechanism by which they have an effect on the phenotype: genotype effect (G), gene expression effect (E), and genotype × expression effect (GE). Also, we distinguish between marker–gene pairs with and without in cis effects. We add an i to denote the presence of in cis effect. So, for example, a marker–gene pair of type iE contributes to the phenotype only through the gene expression, although there exists an in cis effect on the genotype–expression level.

Based on the above genetic data, a continuous and a binary phenotype were generated. The phenotypes were designed to study which of the above six effect types our model is able to recapture. Also, we wanted to do comparison studies against more traditional models, that is, models based solely on either marker or gene expression data. With this task in mind, a continuous phenotype for individual i was generated as

where the subscripts s₁, …, s₆ are randomly chosen indexes of marker–gene pairs of types G, iG, E, iE, GE, and iGE, respectively, and ɛ_i is normally distributed with mean 0 and variance 1. The factors a₁, …, a₆ are the inverses of the sample standard deviations of the six genetic terms, respectively, and their function is to ensure that each term contributes an equal amount of variation to the phenotype. Further, the binary phenotype was defined based on the continuous phenotype as follows:

For realizations of the above phenotype, the heritability, which is a measure of the proportion of the phenotypic variation explained by the genetic components, is typically about 0.6–0.7.

Analyses

We analyzed a realization of the phenotype where the genetic effect components were about equally distributed along the genome and the heritability was 0.69. The proportion of missing values in both the molecular marker data and the gene expression data was taken to be 0.01. These proportions can vary in practice and in addition to reducing the information content in the data, missing values slow down the actual estimation process. The continuous phenotype and the binary phenotype were analyzed separately using combined marker and gene expression data, marker data alone, and gene expression data alone. Because no trans-acting variation was included into the simulation, we can easily monitor false positives arising from in cis effects in conventional analyses. These six different analyses were implemented using Matlab software on a Pentium IV 2.8 GHz processor. The initial values for the effects θ_j, j=1, …, N, were taken to be zero, and those of the variance terms σ_j², j=0, …, N, were initialized to 0.5. The mean value μ was assigned to the sample mean of the phenotypes and the missing values were randomly assigned initial values from their empirical distributions. The Markov chain Monte Carlo (MCMC) algorithm was run for 50 000 rounds (≈2 h) in all simulated examples. In each case, the first 10% of the rounds were considered to be ‘burn-in’ rounds and were thus discarded from the analysis. Also, to reduce serial correlation, only every 10th round was stored and used in the final summaries. The convergence assessment of the method was made by visually monitoring the chains for several different parameters, mainly the effect coefficients and the error variance. Although the number of the effect coefficients can be very large, in practice one needs to consider only the few chains, as after the burn-in rounds the majority of the effect coefficients are constantly zero. In our simulation studies, the convergence was very fast. In fact, very similar results to those reported are achieved when using samples from the first 20% of the MCMC rounds only.

Results

Continuous trait

Combined data analysis

In Figure 3, the results of the analysis combining marker and gene expression data are summarized by the posterior probabilities that the standardized effect size exceeds the given threshold 0.1. The standardized effect (an analog to standardized regression coefficient found in statistical text books) for the genetic component j is given by where σ_j is the standard deviation of the genetic component j and σ_p is the standard deviation of the phenotype. In the presence of missing values, θ_j^* is calculated on every MCMC round using σ_j of the imputed data. Further in the binary case, σ_p is the standard deviation of the liability phenotype, which changes every MCMC round. The above posterior probability for the genetic component j can be written as P_j^(0.1)=P(θ_j^*>0.1∣data), where ‘data’ refers to the observed genetic data and the phenotype. In summary of Figure 3, altogether nine genetic components had elevated posterior probabilities P_j^(0.1), where this probability was distinctive (mostly 1.0) for all six simulated components, and it was equal or less than 0.4 for the others. The number of nonzero effects is controlled adaptively in the analysis (their number depends on the data at hand).

To summarize the number of influential components, in the genetic model, Table 1 presents the posterior probabilities for different numbers of components, whose standardized effect size simultaneously exceeds the threshold 0.1, that is, P(I_c^(0.1)=n∣data). Note that the distribution only supports numbers in the range [6, 10] and that the support is clearly highest at the correct number six. In Table 2, we have calculated the conditional probability for all pairs {j, k}, formed by genetic components whose probability P_j^(0.1) exceeds 0.1. Q_j,k^(T) is the posterior probability that the standardized effect of the genetic component j exceeds the threshold T conditional that the standardized effect of the genetic component k exceeds the same threshold. These Q-summaries allow the detection of alternating components in the genetic model. From Table 2 and by visually studying the MCMC paths of the standardized effects (Figure 4), it can be seen that the expression effect (j=246) and the genotype × expression effect (j=546), associated to the genetic component of type iE, are complementary, and thus only one of the two at a time contributes a nonzero effect into the genetic model. The threshold value T=0.1 was chosen subjectively. Experiments with different threshold values indicated that the above summary statistics are robust to the chosen value. The use of a smaller threshold value T<0.1, although increasing the number of components with nonzero P_j^(T) values, rarely had an effect on the number of components with higher P_j^(T) values, say greater than 0.1. Also, the effect on the distribution of the number of influential components was negligible.

Table 1 The distributions of the number of influential components

Full size table

Table 2 Pairwise conditional summaries

Full size table

Sole marker analysis

In the top-left panel of Figure 5, the standardized genotype effects are summarized by the component probabilities P_j^(0.1). From the results, we can locate four markers (genetic components), which all satisfy the condition P_j^(0.1)>0.1. Further, by studying the Q-summaries, the pairwise probabilities Q_j,k^(0.1) (table not shown), it can be concluded that the genetic components 83 and 86 are complementary. Thus, the results suggest that we are able to locate three putative markers only, two correct ones having main genotype effects (types iG and G) and a ‘false-positive’ one (j=45) located between the type GE component and the type iE component. These conclusions are further supported by Table 1, where the highest probability 0.60 is assigned to the case n=3.

Sole gene expression analysis

In the lower-left panel of Figure 5, the standardized gene expression effects are summarized by the component probabilities P_j^(0.1). We were able to locate all the genes (genetic components) that contribute to the continuous phenotype through their expression (types E, iE, GE, and iGE). Also, there is some evidence (P_j^(0.1)>0.1) about the two simulated components with no expression effects: two components (j=50 and j=53) close to the component of type G and one (j=248) at the component of type iG. This can be explained by the correlation between the gene expression values and the phenotype, which is induced by the high correlation between close markers and their in cis effects (recall that about 30% of the markers have in cis effects). By studying the MCMC sample paths of components 50 and 53 in Figure 6, the strong interaction between their effect sizes becomes apparent; if either of the components obtains a zero effect, the other compensates its absence by taking on higher values. Thus, again we can make the conclusion that both effects attempt to model the same genetic effect (type G). This conclusion cannot be made from the Q-summaries, as Q_50,53^(0.1)=0.50 and Q_53,50^(0.1)=0.33. In summary, we were able to locate six putative genes, with one (j=248) having a posterior probability value of 0.1 only. This result is also supported by Table 1, where although some evidence is assigned to the case n=6, the highest probabilities are obtained for the cases n=4 and n=5.