Interpreting 'omics data often involves statistical analysis of large numbers of loci such as genes, binding sites or single-nucleotide polymorphisms (SNPs). Although the data set as a whole may be rich in information, each individual locus is typically only associated with a limited amount of data. Statistical inference in this context is challenging. A hierarchical model is a useful statistical tool to more efficiently analyze the data, and it is increasingly being used in computational genomics.

A motivating example

Consider a hypothetical microarray experiment with ten genes. For each gene, log2 expression fold-changes (hereafter referred to as simply 'expression') are observed between tumor and normal tissues in three biological replicates (Table 1). To select a gene for follow-up study that is differentially expressed in tumor compared with normal cells, which gene should be the top candidate?

Table 1 Statistical analysis of example data using either t-statistics or a hierarchical model

A simple solution is to rank the genes by t-statistics

Here n (= 3) is the number of replicates, is the average expression of gene i, and si2 is the sample variance. Based on the absolute values of t-statistics, gene 2 is the top candidate (Table 1).

The data in this example, however, are simulated, with each gene having a 'true' expression μi whose measurement is confounded by experimental or biological variability represented by the parameter σi2. (In fact, each expression measurement was randomly drawn from a bell curve–shaped normal distribution with a mean μi and variance σi2). The true values of μi and σi2, which are unknown to you, are shown in Table 1. It turns out the only truly differentially expressed gene is gene 10, which has a nonzero μi. Gene 2 thereby represents a false-positive call.

What causes this mistake? Small sample size and the multiplicity of the problem are the reasons. To understand why, it may be helpful to briefly review the key ideas behind statistical inference. The first concept to understand is that of the 'distribution'. Briefly, in the presence of biological or experimental noise and variability, repeated biological measurements are unlikely to be identical, giving rise to a collection, or distribution, of data values. This distribution can be characterized by parameters, such as its mean (or average value) and variance (which quantifies how far the measurements are expected to be from the mean). The parameters are properties associated with infinitely many measurements. In a real scenario, when only a finite number of measurements are available, the true parameter value cannot be obtained. Statistical inference seeks to make statements about the true, also referred to as 'unobserved', parameter value based on the observed data which are called by statisticians as 'samples' drawn from the distribution.

In a t-statistic, the sample mean represents an estimate of the true mean μi of the distribution from which gene i's data are sampled, and the sample variance si2 represents an estimate of the true variance σi2. If the true mean is zero (that is, gene i is not differentially expressed), it is unlikely to obtain a t-statistic with a large magnitude.

When the sample size is small, however, the observed sample variance is an unreliable estimate of the true variance of the system. To see why, imagine randomly selecting three data points from a normal distribution with mean 0 and variance 1, which results in the values 0.1, 0.09 and 0.11 (Fig. 1a, blue dots). As a result, the observed variance is 0.0001 (or approximately 0) even though the true variance is 1 (that is, much bigger than 0). Another random draw of three data points from the same distribution may give you −1.1, −0.2 and 0.7 (Fig. 1a, orange dots) and a totally different observed variance of 0.81. Although the probability that the observed variance significantly deviates from the true variance is small for each individual gene, in a genomic study with many genes, the chance to encounter such deviants for some genes is high.

Figure 1: Hierarchical modeling.
figure 1

(a) Many analysis techniques, such as t-statistics, consider each gene separately. Owing to different sources of biological and experimental variation, if triplicate measurements of the expression of the same gene are collected twice (blue dots and orange dots), the measurements may yield different estimates of the mean and variance of the true distribution that describes the gene's expression (gray). (b) A hierarchical model helps produce more reliable estimates of the mean and variance by considering all genes together. It models different sources of biological variation hierarchically. A top-level distribution (F0) models variation across genes and a bottom-level distribution models variation of the same gene between samples. Data are described by first drawing μ and σ2 from F0 for each gene and then drawing expression fold-changes for each gene. (c) If different genes have similar mean and variance, data from one gene are informative about the mean and variance of another gene. It is not known a priori whether genes are similar (left, F0 is tightly clustered) or not (right, F0 is more spread out). However, this can be learned by looking at the data of many different genes. If genes are similar, the observed gene-to-gene differences can be largely explained by the sampling variability within a gene (bottom, left); otherwise genes are heterogeneous (bottom, right). (d) The hierarchical model is applied by first using the observed data to estimate cross-gene variation (that is, F0), then comparing it to within-gene sampling variability to determine a rule to combine the characteristics shared by all genes with the data specific to each gene for estimating μ and σ2 (solid lines). In our example, this yields an adjusted variance estimate in the form of a weighted average between the sample variance and the mean of variances σ2 in F0 (that is, σ02) (dashed lines). The model was not applied to estimate the gene-specific mean μ. (e) The genes' true variances in our example are similar (as in the left side of c), which is perceived by the model. As a result, the adjusted variance estimates (red) are closer than the original variance estimates (blue) to the mean variance σ02 (dotted line), which incorporates data from all genes. Overall, the adjusted variance estimates are also closer to the unobserved true variances listed in Table 1 (black '+').

Small sample variances obtained by chance give rise to large t-statistics, which can incorrectly rank nondifferentially expressed genes at the top. This is what happened in our example. The true variance of gene 2 is 1, but the sample variance is 0.005 (Table 1); as a result, the t-statistic incorrectly ranked gene 2 (t2 = 17.5) on top of the truly differentially expressed gene 10 (t10 = 3.42). In general, when data analysis involves estimating many parameters or testing many hypotheses but the sample size is small, it is difficult to reliably estimate all parameters or to make correct decisions for all tests simultaneously. This problem is less serious if more samples are available, as more reliable estimates of parameters can be obtained for each gene.

Real gene expression microarray experiments with tens of thousands of genes are examples of a 'large p, small n' problem, where p refers to the number of genes and n refers to the number of samples. In addition to the multiplicity issue mentioned before, another potential problem is that if the data are not normally distributed, applying a t-test can be invalid when the sample size is small1. However, this problem is not the focus of the current primer, in which the data in our example are assumed to be normally distributed.

What is a hierarchical model?

One statistical tool for handling large-p, small-n problems is a hierarchical model. Such a model describes hierarchical relationships between various sources of data variation. The model structure effectively makes it possible to 'borrow' information from all genes to make more reliable statistical inferences about a particular gene. Hierarchical models are conceptually related to regularization techniques, which include Lasso and ridge regression and represent a broad class of methods for handling large-p, small-n problems (reviewed in refs. 2,3).

In our example, a hierarchical model can be built by assuming that the unobserved mean and variance parameters (that is, μi and σi2) of different genes are also sampled from a distribution (denoted as F0). The distribution is characterized by parameters, such as mean and variance of infinitely many μi and σi2 hypothetically collected from different genes. Accordingly, one can imagine that the observed expression data are generated hierarchically by first drawing the mean and variance parameters for each gene from F0, and then drawing expression measurements for each gene from a gene-specific distribution (that is, a normal distribution with mean μi and variance σi2) (Fig. 1b).

Naturally, this model describes two sources of variation in the observed expression data. At the top of the hierarchy, the intrinsic similarities and differences between the expression of different genes is mathematically modeled using a distribution (that is, F0) of the unobserved gene-specific parameters. At the bottom of the hierarchy, the cross-sample variability within a single gene is modeled using a gene-specific distribution with parameters generated from the top-level distribution (Fig. 1c). In effect, the top-level distribution describes which gene-specific parameter values are common and which are unusual. The data contain information about the distributions at both levels because there are several replicates for each gene over many different genes.

Although the top-level distribution is usually unknown, it can be estimated using data from the thousands of genes available. Then, using this distribution, the hierarchical model allows one to 'borrow' information across genes to facilitate inference. How much information to borrow is determined by how similar the genes are relative to the cross-sample variability. The intuition is that if the heterogeneity across genes is small, then data from all genes could be informative about the parameters of a particular gene (Fig. 1c). Borrowing information across genes essentially increases the effective sample size for making inferences about individual genes4. In contrast, the t-statistic approach only uses information from a single gene to estimate the mean and variance of the bottom-level distribution for that gene.

Inference using the hierarchical model

The first step in using the hierarchical model is to find a top-level distribution that fits the data (Fig. 1d). This process can be intuitively interpreted as learning the cross-gene heterogeneity from the data. The top-level distribution is usually assumed to be a member of a broad family of distributions. In other words, a large number of candidate distributions with the same mathematical form but different parameter values are considered. By varying the parameter values, members in the family are able to describe a variety of distribution patterns of the gene-specific parameters. The analysis starts by finding the distribution (through identifying the parameter value) in the family that fits the data well, and then using the identified distribution to help infer the gene-specific parameters. Commonly used top-level distribution families include 'conjugate priors' and mixtures of simple distributions (e.g., mixture of normal distributions)5. The former is typically used if developing a simple computational algorithm is required, and the latter is used if one needs flexibility to describe very complex cross-gene variation patterns.

Next, the top-level distribution is used to adjust the parameter estimate of every gene (Fig. 1e). If cross-gene heterogeneity is small, the adjustment will make the parameter estimates of different genes more similar to each other. Here, the hierarchical model borrows from Bayesian inference, a general approach to make statistical inference by combining prior information with observed data5,6, with the top-level distribution being treated as the prior knowledge about the unobserved mean and variance parameters of individual genes.

Algorithmically, finding the top-level distribution and inferring gene-specific parameters can be implemented simultaneously using standard Bayesian or empirical Bayes techniques5,6, which sometimes requires advanced and computation-intensive techniques such as Markov chain Monte Carlo5.

In our example, applying the hierarchical model yields a new estimate of the variance parameter of a gene. The new estimate of σi2 is a weighted average between the sample variance si2 and the estimated mean variance of all genes (that is, the mean of all variances σ2 in the estimated F0, also denoted as σ02) (ref. 7). The sample variance is an estimate of σi2 based on gene i's data, and σ02 represents a shared property of all genes. These two pieces of information are combined using a weight determined automatically by comparing the magnitude of cross-gene variation (with respect to σi2) with that of the within-gene sampling variability (with respect to si2). If the variability among genes is low relative to the sampling variability within a gene, the mean variance σ02 will receive a high weight. On the other hand, if the cross-gene variation is high compared to the within-gene sampling variability, more weight will be given to si2.

The new estimates shift the sample variances toward the common population mean of σi2, and pulls small variances by chance away from zero. Compared with the old estimates si2, the sum of squared error of the new estimates from the true values is much smaller (3.50 versus 19.46). When the sample variances si2 in the t-statistics are replaced by the new estimates, the new t-statistics correctly rank gene 10 before gene 2 (Table 1). This weighted average technique to estimate the variance is called variance stabilization. It is widely used in analyses of gene expression microarrays4,8 and chromatin immunoprecipitation on tiling microarrays (ChIP-chip)7 to detect differentially expressed genes and protein-DNA binding sites, respectively. Naturally, real microarray experiments are more complicated and contain more sources of variation than our example; thus, they can benefit from more sophisticated hierarchical models that capture those types of variation.

The validity of model assumptions, such as those on the hierarchical structure and the distributions at the top and bottom levels, is crucial for the successful application of hierarchical models. When the assumptions hold true, the model brings additional power. Otherwise, the model may not use the information optimally, or may introduce bias that leads to misleading results. Therefore, it is always wise to check the model assumptions by exploring characteristics of the raw data and testing the analysis results using independent information or cross-validation2.

Other applications

Hierarchical models can be applied to many other problems besides gene expression microarrays and ChIP-chip. For example, in genome-wide association studies, hundreds of thousands of SNPs are tested for association with a phenotype. In a simple scenario, the association can be studied in a linear regression “phenotype = αi + βi *genotype + noise,” where a nonzero coefficient βi (i indexes SNPs) indicates association. With a limited number of samples and many SNPs to evaluate, this approach often lacks the power to distinguish relevant SNPs from random associations. Because SNPs with similar characteristics, such as those that reside in genes in the same pathway or that show a similar degree of evolutionary conservation, have similar potentials to be associated with the phenotype, one can build a hierarchical model to borrow information from similar SNPs to increase the statistical power of association studies9. To use this information, one can assume that βi from different SNPs follow a top-level distribution N (μ + η*zi, τ2), where zi is an observed characteristic of SNP i, such as conservation score. Here, μ + η*zi describes the relationship between a SNP's characteristic and its potential association with the phenotype, and τ2 describes the heterogeneity among SNPs with the same characteristic. The model can be generalized to incorporate multiple characteristics. One can use data from all SNPs to estimate this top-level distribution (that is, μ, η, τ2), and make an inference based on new estimates of βi that combines the top-level distribution with the SNP-specific data.

Application of hierarchical models is not limited to large-p-small-n data. The models are useful in a broad spectrum of large-p problems where the amount of information per locus is limited, with small sample size being a special case. For example, predicting transcription factor binding sites in DNA sequences can be viewed as a problem that probabilistically assigns a 0–1 label to each locus by matching the sequence to a motif model as opposed to a background model. If the sequences are long, there could be random matches to the motif, which leads to false-positive predictions. However, functional transcription factor binding sites tend to cluster in the genome to form regulatory modules. One can build a hierarchical model by assuming that the input sequences consist of background and modules, and the modules in turn consist of background and binding sites, hence binding sites only occur within modules; given the binding site locations, nucleotides are generated according to either the motif or background probability models. Using this hierarchical model, one can first infer the top-level module status by checking sequences from nearby genomic loci, and then combine the module status as prior and the DNA sequence at each locus to infer its binding status. The module status estimated using information across loci helps eliminate many false-positive binding site predictions. In ref. 10, it was shown that the improved estimates of binding site locations increase the power of de novo motif discovery.

We conclude by providing two other examples where hierarchical models might be useful yet have not been fully explored. First, if you want to estimate the fold enrichment at ChIP-seq binding loci, but each ChIP and control library has only one replicate sequenced not so deeply, you may estimate a more robust background read count at one locus by borrowing information from other loci. Second, if you want to estimate the binding motif matrices for several transcription factors in the same protein family, but have only a handful of known binding sites for each factor, you can estimate more robust motif matrices by borrowing information across the family. What are other examples? Looking at your own data might reveal the answer.