Nature Biotechnology  Computational Biology  Primer
Analyzing 'omics data using hierarchical models
 Hongkai Ji^{1}^{, }
 X Shirley Liu^{2}^{, }
 Journal name:
 Nature Biotechnology
 Volume:
 28,
 Pages:
 337–340
 Year published:
 DOI:
 doi:10.1038/nbt.1619
Hierarchical models provide reliable statistical estimates for data sets from highthroughput experiments where measurements vastly outnumber experimental samples.
Subject terms:
Introduction
Interpreting 'omics data often involves statistical analysis of large numbers of loci such as genes, binding sites or singlenucleotide polymorphisms (SNPs). Although the data set as a whole may be rich in information, each individual locus is typically only associated with a limited amount of data. Statistical inference in this context is challenging. A hierarchical model is a useful statistical tool to more efficiently analyze the data, and it is increasingly being used in computational genomics.
A motivating example
Consider a hypothetical microarray experiment with ten genes. For each gene, log_{2} expression foldchanges (hereafter referred to as simply 'expression') are observed between tumor and normal tissues in three biological replicates (Table 1). To select a gene for followup study that is differentially expressed in tumor compared with normal cells, which gene should be the top candidate?
A simple solution is to rank the genes by tstatistics
Here n (= 3) is the number of replicates, is the average expression of gene i, and s_{i}^{2} is the sample variance. Based on the absolute values of tstatistics, gene 2 is the top candidate (Table 1).
The data in this example, however, are simulated, with each gene having a 'true' expression μ_{i} whose measurement is confounded by experimental or biological variability represented by the parameter σ_{i}^{2}. (In fact, each expression measurement was randomly drawn from a bell curve–shaped normal distribution with a mean μ_{i} and variance σ_{i}^{2}). The true values of μ_{i} and σ_{i}^{2}, which are unknown to you, are shown in Table 1. It turns out the only truly differentially expressed gene is gene 10, which has a nonzero μ_{i}. Gene 2 thereby represents a falsepositive call.
What causes this mistake? Small sample size and the multiplicity of the problem are the reasons. To understand why, it may be helpful to briefly review the key ideas behind statistical inference. The first concept to understand is that of the 'distribution'. Briefly, in the presence of biological or experimental noise and variability, repeated biological measurements are unlikely to be identical, giving rise to a collection, or distribution, of data values. This distribution can be characterized by parameters, such as its mean (or average value) and variance (which quantifies how far the measurements are expected to be from the mean). The parameters are properties associated with infinitely many measurements. In a real scenario, when only a finite number of measurements are available, the true parameter value cannot be obtained. Statistical inference seeks to make statements about the true, also referred to as 'unobserved', parameter value based on the observed data which are called by statisticians as 'samples' drawn from the distribution.
In a tstatistic, the sample mean represents an estimate of the true mean μ_{i} of the distribution from which gene i's data are sampled, and the sample variance s_{i}^{2} represents an estimate of the true variance σ_{i}^{2}. If the true mean is zero (that is, gene i is not differentially expressed), it is unlikely to obtain a tstatistic with a large magnitude.
When the sample size is small, however, the observed sample variance is an unreliable estimate of the true variance of the system. To see why, imagine randomly selecting three data points from a normal distribution with mean 0 and variance 1, which results in the values 0.1, 0.09 and 0.11 (Fig. 1a, blue dots). As a result, the observed variance is 0.0001 (or approximately 0) even though the true variance is 1 (that is, much bigger than 0). Another random draw of three data points from the same distribution may give you −1.1, −0.2 and 0.7 (Fig. 1a, orange dots) and a totally different observed variance of 0.81. Although the probability that the observed variance significantly deviates from the true variance is small for each individual gene, in a genomic study with many genes, the chance to encounter such deviants for some genes is high.
Small sample variances obtained by chance give rise to large tstatistics, which can incorrectly rank nondifferentially expressed genes at the top. This is what happened in our example. The true variance of gene 2 is 1, but the sample variance is 0.005 (Table 1); as a result, the tstatistic incorrectly ranked gene 2 (t_{2} = 17.5) on top of the truly differentially expressed gene 10 (t_{10} = 3.42). In general, when data analysis involves estimating many parameters or testing many hypotheses but the sample size is small, it is difficult to reliably estimate all parameters or to make correct decisions for all tests simultaneously. This problem is less serious if more samples are available, as more reliable estimates of parameters can be obtained for each gene.
Real gene expression microarray experiments with tens of thousands of genes are examples of a 'large p, small n' problem, where p refers to the number of genes and n refers to the number of samples. In addition to the multiplicity issue mentioned before, another potential problem is that if the data are not normally distributed, applying a ttest can be invalid when the sample size is small^{1}. However, this problem is not the focus of the current primer, in which the data in our example are assumed to be normally distributed.
What is a hierarchical model?
One statistical tool for handling largep, smalln problems is a hierarchical model. Such a model describes hierarchical relationships between various sources of data variation. The model structure effectively makes it possible to 'borrow' information from all genes to make more reliable statistical inferences about a particular gene. Hierarchical models are conceptually related to regularization techniques, which include Lasso and ridge regression and represent a broad class of methods for handling largep, smalln problems (reviewed in refs. 2,3).
In our example, a hierarchical model can be built by assuming that the unobserved mean and variance parameters (that is, μ_{i} and σ_{i}^{2}) of different genes are also sampled from a distribution (denoted as F_{0}). The distribution is characterized by parameters, such as mean and variance of infinitely many μ_{i} and σ_{i}^{2} hypothetically collected from different genes. Accordingly, one can imagine that the observed expression data are generated hierarchically by first drawing the mean and variance parameters for each gene from F_{0}, and then drawing expression measurements for each gene from a genespecific distribution (that is, a normal distribution with mean μ_{i} and variance σ_{i}^{2}) (Fig. 1b).
Naturally, this model describes two sources of variation in the observed expression data. At the top of the hierarchy, the intrinsic similarities and differences between the expression of different genes is mathematically modeled using a distribution (that is, F_{0}) of the unobserved genespecific parameters. At the bottom of the hierarchy, the crosssample variability within a single gene is modeled using a genespecific distribution with parameters generated from the toplevel distribution (Fig. 1c). In effect, the toplevel distribution describes which genespecific parameter values are common and which are unusual. The data contain information about the distributions at both levels because there are several replicates for each gene over many different genes.
Although the toplevel distribution is usually unknown, it can be estimated using data from the thousands of genes available. Then, using this distribution, the hierarchical model allows one to 'borrow' information across genes to facilitate inference. How much information to borrow is determined by how similar the genes are relative to the crosssample variability. The intuition is that if the heterogeneity across genes is small, then data from all genes could be informative about the parameters of a particular gene (Fig. 1c). Borrowing information across genes essentially increases the effective sample size for making inferences about individual genes^{4}. In contrast, the tstatistic approach only uses information from a single gene to estimate the mean and variance of the bottomlevel distribution for that gene.
Inference using the hierarchical model
The first step in using the hierarchical model is to find a toplevel distribution that fits the data (Fig. 1d). This process can be intuitively interpreted as learning the crossgene heterogeneity from the data. The toplevel distribution is usually assumed to be a member of a broad family of distributions. In other words, a large number of candidate distributions with the same mathematical form but different parameter values are considered. By varying the parameter values, members in the family are able to describe a variety of distribution patterns of the genespecific parameters. The analysis starts by finding the distribution (through identifying the parameter value) in the family that fits the data well, and then using the identified distribution to help infer the genespecific parameters. Commonly used toplevel distribution families include 'conjugate priors' and mixtures of simple distributions (e.g., mixture of normal distributions)^{5}. The former is typically used if developing a simple computational algorithm is required, and the latter is used if one needs flexibility to describe very complex crossgene variation patterns.
Next, the toplevel distribution is used to adjust the parameter estimate of every gene (Fig. 1e). If crossgene heterogeneity is small, the adjustment will make the parameter estimates of different genes more similar to each other. Here, the hierarchical model borrows from Bayesian inference, a general approach to make statistical inference by combining prior information with observed data^{5, 6}, with the toplevel distribution being treated as the prior knowledge about the unobserved mean and variance parameters of individual genes.
Algorithmically, finding the toplevel distribution and inferring genespecific parameters can be implemented simultaneously using standard Bayesian or empirical Bayes techniques^{5, 6}, which sometimes requires advanced and computationintensive techniques such as Markov chain Monte Carlo^{5}.
In our example, applying the hierarchical model yields a new estimate of the variance parameter of a gene. The new estimate of σ_{i}^{2} is a weighted average between the sample variance s_{i}^{2} and the estimated mean variance of all genes (that is, the mean of all variances σ^{2} in the estimated F_{0}, also denoted as σ_{0}^{2}) (ref. 7). The sample variance is an estimate of σ_{i}^{2} based on gene i's data, and σ0^{2} represents a shared property of all genes. These two pieces of information are combined using a weight determined automatically by comparing the magnitude of crossgene variation (with respect to σ_{i}^{2}) with that of the withingene sampling variability (with respect to s_{i}^{2}). If the variability among genes is low relative to the sampling variability within a gene, the mean variance σ0^{2} will receive a high weight. On the other hand, if the crossgene variation is high compared to the withingene sampling variability, more weight will be given to s_{i}^{2}.
The new estimates shift the sample variances toward the common population mean of σ_{i}^{2}, and pulls small variances by chance away from zero. Compared with the old estimates s_{i}^{2}, the sum of squared error of the new estimates from the true values is much smaller (3.50 versus 19.46). When the sample variances s_{i}^{2} in the tstatistics are replaced by the new estimates, the new tstatistics correctly rank gene 10 before gene 2 (Table 1). This weighted average technique to estimate the variance is called variance stabilization. It is widely used in analyses of gene expression microarrays^{4, 8} and chromatin immunoprecipitation on tiling microarrays (ChIPchip)^{7} to detect differentially expressed genes and proteinDNA binding sites, respectively. Naturally, real microarray experiments are more complicated and contain more sources of variation than our example; thus, they can benefit from more sophisticated hierarchical models that capture those types of variation.
The validity of model assumptions, such as those on the hierarchical structure and the distributions at the top and bottom levels, is crucial for the successful application of hierarchical models. When the assumptions hold true, the model brings additional power. Otherwise, the model may not use the information optimally, or may introduce bias that leads to misleading results. Therefore, it is always wise to check the model assumptions by exploring characteristics of the raw data and testing the analysis results using independent information or crossvalidation^{2}.
Other applications
Hierarchical models can be applied to many other problems besides gene expression microarrays and ChIPchip. For example, in genomewide association studies, hundreds of thousands of SNPs are tested for association with a phenotype. In a simple scenario, the association can be studied in a linear regression “phenotype = α_{i} + β_{i} ^{*}genotype + noise,” where a nonzero coefficient β_{i} (i indexes SNPs) indicates association. With a limited number of samples and many SNPs to evaluate, this approach often lacks the power to distinguish relevant SNPs from random associations. Because SNPs with similar characteristics, such as those that reside in genes in the same pathway or that show a similar degree of evolutionary conservation, have similar potentials to be associated with the phenotype, one can build a hierarchical model to borrow information from similar SNPs to increase the statistical power of association studies^{9}. To use this information, one can assume that β_{i} from different SNPs follow a toplevel distribution N (μ + η^{*}z_{i}, τ^{2}), where z_{i} is an observed characteristic of SNP i, such as conservation score. Here, μ + η^{*}z_{i} describes the relationship between a SNP's characteristic and its potential association with the phenotype, and τ^{2} describes the heterogeneity among SNPs with the same characteristic. The model can be generalized to incorporate multiple characteristics. One can use data from all SNPs to estimate this toplevel distribution (that is, μ, η, τ^{2}), and make an inference based on new estimates of β_{i} that combines the toplevel distribution with the SNPspecific data.
Application of hierarchical models is not limited to largepsmalln data. The models are useful in a broad spectrum of largep problems where the amount of information per locus is limited, with small sample size being a special case. For example, predicting transcription factor binding sites in DNA sequences can be viewed as a problem that probabilistically assigns a 0–1 label to each locus by matching the sequence to a motif model as opposed to a background model. If the sequences are long, there could be random matches to the motif, which leads to falsepositive predictions. However, functional transcription factor binding sites tend to cluster in the genome to form regulatory modules. One can build a hierarchical model by assuming that the input sequences consist of background and modules, and the modules in turn consist of background and binding sites, hence binding sites only occur within modules; given the binding site locations, nucleotides are generated according to either the motif or background probability models. Using this hierarchical model, one can first infer the toplevel module status by checking sequences from nearby genomic loci, and then combine the module status as prior and the DNA sequence at each locus to infer its binding status. The module status estimated using information across loci helps eliminate many falsepositive binding site predictions. In ref. 10, it was shown that the improved estimates of binding site locations increase the power of de novo motif discovery.
We conclude by providing two other examples where hierarchical models might be useful yet have not been fully explored. First, if you want to estimate the fold enrichment at ChIPseq binding loci, but each ChIP and control library has only one replicate sequenced not so deeply, you may estimate a more robust background read count at one locus by borrowing information from other loci. Second, if you want to estimate the binding motif matrices for several transcription factors in the same protein family, but have only a handful of known binding sites for each factor, you can estimate more robust motif matrices by borrowing information across the family. What are other examples? Looking at your own data might reveal the answer.
References
 Ramsey, F.L. & Schafer, D.W. The Statistical Sleuth: A Course in Methods of Data Analysis (Duxbury/Thomson Learning; 2002).
 Hastie, T., Tibshirani, R. & Friedman, J.H. The Elements of Statistical Learning, edn. 2 (Springer; 2009).
 Tibshirani, R. J. Roy Stat. Soc. B 58, 267–288 (1996).
 Smyth, G.K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, 3 (2004).
 Gelman, A., Carlin, J.B., Stern, H.S. & Rubin, D.B. Bayesian Data Analysis edn. 2 (Chapman & Hall/CRC; 2004).
 Beaumont, M.A. & Rannala, B. Nat. Rev. Genet. 5, 251–261 (2004).
 Ji, H. & Wong, W.H. Bioinformatics 21, 3629–3636 (2005).
 Sartor, M.A. et al. BMC Bioinformatics 7, 538 (2006).
 Chen, G.K. & Witte, J.S. Am. J. Hum. Genet. 81, 397–404 (2007).
 Zhou, Q. & Wong, W.H. Proc. Natl. Acad. Sci. USA 101, 12114–12119 (2004).
Author information
Affiliations

Hongkai Ji is in the Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA

X. Shirley Liu is in the Department of Biostatistics and Computational Biology, DanaFarber Cancer Institute, Harvard School of Public Health, Boston, Massachusetts, USA.
Competing financial interests
The authors declare no competing financial interests.
Author details
Hongkai Ji
Search for this author in:
X Shirley Liu
Search for this author in: