Analyzing 'omics data using hierarchical models

Ji, Hongkai; Liu, X Shirley

doi:10.1038/nbt.1619

Primer
Published: April 2010

Analyzing 'omics data using hierarchical models

Hongkai Ji¹ &
X Shirley Liu²

Nature Biotechnology volume 28, pages 337–340 (2010)Cite this article

28k Accesses
44 Citations
1 Altmetric
Metrics details

Subjects

Hierarchical models provide reliable statistical estimates for data sets from high-throughput experiments where measurements vastly outnumber experimental samples.

You have full access to this article via your institution.

Download PDF

Interpreting 'omics data often involves statistical analysis of large numbers of loci such as genes, binding sites or single-nucleotide polymorphisms (SNPs). Although the data set as a whole may be rich in information, each individual locus is typically only associated with a limited amount of data. Statistical inference in this context is challenging. A hierarchical model is a useful statistical tool to more efficiently analyze the data, and it is increasingly being used in computational genomics.

A motivating example

Consider a hypothetical microarray experiment with ten genes. For each gene, log₂ expression fold-changes (hereafter referred to as simply 'expression') are observed between tumor and normal tissues in three biological replicates (Table 1). To select a gene for follow-up study that is differentially expressed in tumor compared with normal cells, which gene should be the top candidate?

Table 1 Statistical analysis of example data using either t-statistics or a hierarchical model

Full size table

A simple solution is to rank the genes by t-statistics

Here n (= 3) is the number of replicates, is the average expression of gene i, and s_i² is the sample variance. Based on the absolute values of t-statistics, gene 2 is the top candidate (Table 1).

The data in this example, however, are simulated, with each gene having a 'true' expression μ_i whose measurement is confounded by experimental or biological variability represented by the parameter σ_i². (In fact, each expression measurement was randomly drawn from a bell curve–shaped normal distribution with a mean μ_i and variance σ_i²). The true values of μ_i and σ_i², which are unknown to you, are shown in Table 1. It turns out the only truly differentially expressed gene is gene 10, which has a nonzero μ_i. Gene 2 thereby represents a false-positive call.

What causes this mistake? Small sample size and the multiplicity of the problem are the reasons. To understand why, it may be helpful to briefly review the key ideas behind statistical inference. The first concept to understand is that of the 'distribution'. Briefly, in the presence of biological or experimental noise and variability, repeated biological measurements are unlikely to be identical, giving rise to a collection, or distribution, of data values. This distribution can be characterized by parameters, such as its mean (or average value) and variance (which quantifies how far the measurements are expected to be from the mean). The parameters are properties associated with infinitely many measurements. In a real scenario, when only a finite number of measurements are available, the true parameter value cannot be obtained. Statistical inference seeks to make statements about the true, also referred to as 'unobserved', parameter value based on the observed data which are called by statisticians as 'samples' drawn from the distribution.

In a t-statistic, the sample mean represents an estimate of the true mean μ_i of the distribution from which gene i's data are sampled, and the sample variance s_i² represents an estimate of the true variance σ_i². If the true mean is zero (that is, gene i is not differentially expressed), it is unlikely to obtain a t-statistic with a large magnitude.

When the sample size is small, however, the observed sample variance is an unreliable estimate of the true variance of the system. To see why, imagine randomly selecting three data points from a normal distribution with mean 0 and variance 1, which results in the values 0.1, 0.09 and 0.11 (Fig. 1a, blue dots). As a result, the observed variance is 0.0001 (or approximately 0) even though the true variance is 1 (that is, much bigger than 0). Another random draw of three data points from the same distribution may give you −1.1, −0.2 and 0.7 (Fig. 1a, orange dots) and a totally different observed variance of 0.81. Although the probability that the observed variance significantly deviates from the true variance is small for each individual gene, in a genomic study with many genes, the chance to encounter such deviants for some genes is high.

Small sample variances obtained by chance give rise to large t-statistics, which can incorrectly rank nondifferentially expressed genes at the top. This is what happened in our example. The true variance of gene 2 is 1, but the sample variance is 0.005 (Table 1); as a result, the t-statistic incorrectly ranked gene 2 (t₂ = 17.5) on top of the truly differentially expressed gene 10 (t₁₀ = 3.42). In general, when data analysis involves estimating many parameters or testing many hypotheses but the sample size is small, it is difficult to reliably estimate all parameters or to make correct decisions for all tests simultaneously. This problem is less serious if more samples are available, as more reliable estimates of parameters can be obtained for each gene.

Real gene expression microarray experiments with tens of thousands of genes are examples of a 'large p, small n' problem, where p refers to the number of genes and n refers to the number of samples. In addition to the multiplicity issue mentioned before, another potential problem is that if the data are not normally distributed, applying a t-test can be invalid when the sample size is small¹. However, this problem is not the focus of the current primer, in which the data in our example are assumed to be normally distributed.

What is a hierarchical model?

One statistical tool for handling large-p, small-n problems is a hierarchical model. Such a model describes hierarchical relationships between various sources of data variation. The model structure effectively makes it possible to 'borrow' information from all genes to make more reliable statistical inferences about a particular gene. Hierarchical models are conceptually related to regularization techniques, which include Lasso and ridge regression and represent a broad class of methods for handling large-p, small-n problems (reviewed in refs. 2,3).

In our example, a hierarchical model can be built by assuming that the unobserved mean and variance parameters (that is, μ_i and σ_i²) of different genes are also sampled from a distribution (denoted as F₀). The distribution is characterized by parameters, such as mean and variance of infinitely many μ_i and σ_i² hypothetically collected from different genes. Accordingly, one can imagine that the observed expression data are generated hierarchically by first drawing the mean and variance parameters for each gene from F₀, and then drawing expression measurements for each gene from a gene-specific distribution (that is, a normal distribution with mean μ_i and variance σ_i²) (Fig. 1b).

Naturally, this model describes two sources of variation in the observed expression data. At the top of the hierarchy, the intrinsic similarities and differences between the expression of different genes is mathematically modeled using a distribution (that is, F₀) of the unobserved gene-specific parameters. At the bottom of the hierarchy, the cross-sample variability within a single gene is modeled using a gene-specific distribution with parameters generated from the top-level distribution (Fig. 1c). In effect, the top-level distribution describes which gene-specific parameter values are common and which are unusual. The data contain information about the distributions at both levels because there are several replicates for each gene over many different genes.

Although the top-level distribution is usually unknown, it can be estimated using data from the thousands of genes available. Then, using this distribution, the hierarchical model allows one to 'borrow' information across genes to facilitate inference. How much information to borrow is determined by how similar the genes are relative to the cross-sample variability. The intuition is that if the heterogeneity across genes is small, then data from all genes could be informative about the parameters of a particular gene (Fig. 1c). Borrowing information across genes essentially increases the effective sample size for making inferences about individual genes⁴. In contrast, the t-statistic approach only uses information from a single gene to estimate the mean and variance of the bottom-level distribution for that gene.

Inference using the hierarchical model

The first step in using the hierarchical model is to find a top-level distribution that fits the data (Fig. 1d). This process can be intuitively interpreted as learning the cross-gene heterogeneity from the data. The top-level distribution is usually assumed to be a member of a broad family of distributions. In other words, a large number of candidate distributions with the same mathematical form but different parameter values are considered. By varying the parameter values, members in the family are able to describe a variety of distribution patterns of the gene-specific parameters. The analysis starts by finding the distribution (through identifying the parameter value) in the family that fits the data well, and then using the identified distribution to help infer the gene-specific parameters. Commonly used top-level distribution families include 'conjugate priors' and mixtures of simple distributions (e.g., mixture of normal distributions)⁵. The former is typically used if developing a simple computational algorithm is required, and the latter is used if one needs flexibility to describe very complex cross-gene variation patterns.

Next, the top-level distribution is used to adjust the parameter estimate of every gene (Fig. 1e). If cross-gene heterogeneity is small, the adjustment will make the parameter estimates of different genes more similar to each other. Here, the hierarchical model borrows from Bayesian inference, a general approach to make statistical inference by combining prior information with observed data^5,6, with the top-level distribution being treated as the prior knowledge about the unobserved mean and variance parameters of individual genes.

Algorithmically, finding the top-level distribution and inferring gene-specific parameters can be implemented simultaneously using standard Bayesian or empirical Bayes techniques^5,6, which sometimes requires advanced and computation-intensive techniques such as Markov chain Monte Carlo⁵.

In our example, applying the hierarchical model yields a new estimate of the variance parameter of a gene. The new estimate of σ_i² is a weighted average between the sample variance s_i² and the estimated mean variance of all genes (that is, the mean of all variances σ² in the estimated F₀, also denoted as σ₀²) (ref. 7). The sample variance is an estimate of σ_i² based on gene i's data, and σ0² represents a shared property of all genes. These two pieces of information are combined using a weight determined automatically by comparing the magnitude of cross-gene variation (with respect to σ_i²) with that of the within-gene sampling variability (with respect to s_i²). If the variability among genes is low relative to the sampling variability within a gene, the mean variance σ0² will receive a high weight. On the other hand, if the cross-gene variation is high compared to the within-gene sampling variability, more weight will be given to s_i².

The new estimates shift the sample variances toward the common population mean of σ_i², and pulls small variances by chance away from zero. Compared with the old estimates s_i², the sum of squared error of the new estimates from the true values is much smaller (3.50 versus 19.46). When the sample variances s_i² in the t-statistics are replaced by the new estimates, the new t-statistics correctly rank gene 10 before gene 2 (Table 1). This weighted average technique to estimate the variance is called variance stabilization. It is widely used in analyses of gene expression microarrays^4,8 and chromatin immunoprecipitation on tiling microarrays (ChIP-chip)⁷ to detect differentially expressed genes and protein-DNA binding sites, respectively. Naturally, real microarray experiments are more complicated and contain more sources of variation than our example; thus, they can benefit from more sophisticated hierarchical models that capture those types of variation.

The validity of model assumptions, such as those on the hierarchical structure and the distributions at the top and bottom levels, is crucial for the successful application of hierarchical models. When the assumptions hold true, the model brings additional power. Otherwise, the model may not use the information optimally, or may introduce bias that leads to misleading results. Therefore, it is always wise to check the model assumptions by exploring characteristics of the raw data and testing the analysis results using independent information or cross-validation².

Other applications

Hierarchical models can be applied to many other problems besides gene expression microarrays and ChIP-chip. For example, in genome-wide association studies, hundreds of thousands of SNPs are tested for association with a phenotype. In a simple scenario, the association can be studied in a linear regression “phenotype = α_i + β_i ^*genotype + noise,” where a nonzero coefficient β_i (i indexes SNPs) indicates association. With a limited number of samples and many SNPs to evaluate, this approach often lacks the power to distinguish relevant SNPs from random associations. Because SNPs with similar characteristics, such as those that reside in genes in the same pathway or that show a similar degree of evolutionary conservation, have similar potentials to be associated with the phenotype, one can build a hierarchical model to borrow information from similar SNPs to increase the statistical power of association studies⁹. To use this information, one can assume that β_i from different SNPs follow a top-level distribution N (μ + η^*z_i, τ²), where z_i is an observed characteristic of SNP i, such as conservation score. Here, μ + η^*z_i describes the relationship between a SNP's characteristic and its potential association with the phenotype, and τ² describes the heterogeneity among SNPs with the same characteristic. The model can be generalized to incorporate multiple characteristics. One can use data from all SNPs to estimate this top-level distribution (that is, μ, η, τ²), and make an inference based on new estimates of β_i that combines the top-level distribution with the SNP-specific data.

Application of hierarchical models is not limited to large-p-small-n data. The models are useful in a broad spectrum of large-p problems where the amount of information per locus is limited, with small sample size being a special case. For example, predicting transcription factor binding sites in DNA sequences can be viewed as a problem that probabilistically assigns a 0–1 label to each locus by matching the sequence to a motif model as opposed to a background model. If the sequences are long, there could be random matches to the motif, which leads to false-positive predictions. However, functional transcription factor binding sites tend to cluster in the genome to form regulatory modules. One can build a hierarchical model by assuming that the input sequences consist of background and modules, and the modules in turn consist of background and binding sites, hence binding sites only occur within modules; given the binding site locations, nucleotides are generated according to either the motif or background probability models. Using this hierarchical model, one can first infer the top-level module status by checking sequences from nearby genomic loci, and then combine the module status as prior and the DNA sequence at each locus to infer its binding status. The module status estimated using information across loci helps eliminate many false-positive binding site predictions. In ref. 10, it was shown that the improved estimates of binding site locations increase the power of de novo motif discovery.

We conclude by providing two other examples where hierarchical models might be useful yet have not been fully explored. First, if you want to estimate the fold enrichment at ChIP-seq binding loci, but each ChIP and control library has only one replicate sequenced not so deeply, you may estimate a more robust background read count at one locus by borrowing information from other loci. Second, if you want to estimate the binding motif matrices for several transcription factors in the same protein family, but have only a handful of known binding sites for each factor, you can estimate more robust motif matrices by borrowing information across the family. What are other examples? Looking at your own data might reveal the answer.

References

Ramsey, F.L. & Schafer, D.W. The Statistical Sleuth: A Course in Methods of Data Analysis (Duxbury/Thomson Learning; 2002).
Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J.H. The Elements of Statistical Learning, edn. 2 (Springer; 2009).
Book Google Scholar
Tibshirani, R. J. Roy Stat. Soc. B 58, 267–288 (1996).
Google Scholar
Smyth, G.K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, 3 (2004).
Article Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S. & Rubin, D.B. Bayesian Data Analysis edn. 2 (Chapman & Hall/CRC; 2004).
Google Scholar
Beaumont, M.A. & Rannala, B. Nat. Rev. Genet. 5, 251–261 (2004).
Article CAS Google Scholar
Ji, H. & Wong, W.H. Bioinformatics 21, 3629–3636 (2005).
Article CAS Google Scholar
Sartor, M.A. et al. BMC Bioinformatics 7, 538 (2006).
PubMed Google Scholar
Chen, G.K. & Witte, J.S. Am. J. Hum. Genet. 81, 397–404 (2007).
Article CAS Google Scholar
Zhou, Q. & Wong, W.H. Proc. Natl. Acad. Sci. USA 101, 12114–12119 (2004).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Hongkai Ji is in the Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA,
Hongkai Ji
X. Shirley Liu is in the Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health, Boston, Massachusetts, USA.,
X Shirley Liu

Authors

Hongkai Ji
View author publications
You can also search for this author in PubMed Google Scholar
X Shirley Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hongkai Ji or X Shirley Liu.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ji, H., Liu, X. Analyzing 'omics data using hierarchical models. Nat Biotechnol 28, 337–340 (2010). https://doi.org/10.1038/nbt.1619

Download citation

Issue Date: April 2010
DOI: https://doi.org/10.1038/nbt.1619

This article is cited by

Plasma proteome changes linked to late phase response after inhaled allergen challenge in asthmatics
- Maria Weitoft
- Måns Kadefors
- Gunilla Westergren-Thorsson
Respiratory Research (2022)
Telomere heterogeneity linked to metabolism and pluripotency state revealed by simultaneous analysis of telomere length and RNA-seq in the same human embryonic stem cell
- Hua Wang
- Kunshan Zhang
- Lin Liu
BMC Biology (2017)
Improving Hierarchical Models Using Historical Data with Applications in High-Throughput Genomics Data Analysis
- Ben Li
- Yunxiao Li
- Zhaohui S. Qin
Statistics in Biosciences (2017)
Unit-Free and Robust Detection of Differential Expression from RNA-Seq Data
- Hui Jiang
- Tianyu Zhan
Statistics in Biosciences (2017)
The Statistical Value of Raw Fluorescence Signal in Luminex xMAP Based Multiplex Immunoassays
- Edmond J. Breen
- Woei Tan
- Alamgir Khan
Scientific Reports (2016)