Finding correlations in big data

Journal name:
Nature Biotechnology
Volume:
30,
Pages:
334–335
Year published:
DOI:
doi:10.1038/nbt.2182
Published online

A new statistical method called MIC can find diverse types of correlations in large data sets. Nature Biotechnology asked eight experts to weigh in on its utility.

Introduction

In today's era of large data sets, statistical methods that facilitate exploratory analyses to detect patterns and generate hypotheses are critical to progress in biology. Last year, David Reshef and colleagues published a new approach to such analysis, called maximal information criteria or MIC (Science 334, 15181524, 2011). Nature Biotechnology solicited comments from several practitioners versed in data-intensive biological research. Their responses not only highlight the appeal of methods like MIC for biological research, but also raise some important reservations as to its widespread use and statistical power.

What is MIC?

Gustavo Stolovitzky, manager, Functional Genomics and Systems Biology, IBM Computational Biology Center, IBM, Yorktown Heights, New York.

Gustavo Stolovitzky: MIC is a quantity between 0 and 1 that measures the association between a pair of variables. If one variable deterministically dictates the value of another variable, the MIC coefficient will be 1. But if noise influences the relationship, the MIC will be smaller than 1, with larger amounts of noise resulting in a bigger deviation of MIC from 1.

Peng Qiu, assistant professor, Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, Texas.

Peng Qiu: When examining many potential pair-wise associations between variables measured in a large data set, MIC produces a ranked list of pairs ordered by the strength of associations.

Why is this approach important?

Eran Segal, associate professor, Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel.

Eran Segal: The main attraction of MIC is its ability to identify diverse types of relationships between variables without a priori favoring one relationship over the other. In contrast, commonly used statistics, such as the Pearson correlation, are either limited in the relationships they can identify (e.g., linear) or they favor certain types of relationships.

Bill Noble, professor, Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington, Seattle, Washington.

Bill Noble: Methods for detecting nonlinear relationships are most useful in the context of exploratory knowledge discovery from large data sets, when the structure of the data itself is not yet well understood.

PQ: The most appealing property of MIC is 'equitability', meaning that it is equally sensitive to different types of association relationships (linear and nonlinear).

How important is 'equitability'?

PQ: Equitability ensures that the top of the list of ranked associations is not dominated or biased by certain types of associations. As a simple comparison, if Pearson correlation is used to rank-order pair-wise associations, the top of the list will be dominated by linear relationships, and many types of nonlinear associations may receive insignificant rankings.

GS: The MIC equitability property ensures that the least noisy, associated pairs will be ranked higher regardless of the specific nature of the association, followed by pairs whose noise strength increases as we go down the list.

When might equitability be useful?

Olga Troyanskaya, associate professor, Department of Computer Science and the Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey.

Olga Troyanskaya: The most common situation is in analyzing a collection of data sets, where many different types of pair-wise relationships might occur. Another reason to use such metrics is when comparing measurements at different biological levels of control (e.g., epigenetic and transcriptional regulation) or between different biomolecules (e.g., proteins and metabolites), especially if there's a reason to suspect highly nonlinear relationships between these variables.

ES: MIC might be able to identify relationships that are the result of a superposition of functions. This is particularly useful in uncovering relationships that are determined by multiple distinct factors, for example, in the regulation of gene expression.

What types of biological data would be amenable to MIC analysis?

ES: MIC could be applied to analyze the relationship between the expression of a repressor and its target gene, as the dependence of both on global cellular resources such as polymerases enforces a positive relationship, whereas a negative relationship is enforced by their specific interaction. Thus, the positive and negative relationships are superimposed to produce a more complex relationship.

BN: MIC or related methods could be used for identifying pairwise relationships between transcription factors. For example, given a large collection of candidate regulatory regions identified by an open chromatin assay, such as DNaseI hypersensitivity, one could characterize each region with a vector of random variables indicating the presence of known regulatory motifs. MIC might then be used to identify pairs of factors that correlate with one another or with measured mRNA levels of proximal genes.

How could MIC be applied to proteomics data?

GS: Suppose we wish to infer causal connections in Escherichia coli between regulatory proteins and their target genes by perturbing the organism with different stresses, nutrients and chemical compounds (typically n = 50–1,000 perturbations). For each perturbation, we quantitatively measure the transcriptome and the proteome of the cells. To identify a regulatory network of proteins and transcripts that are causally linked, we use MIC to study their joint response to the perturbations, resulting in a list of all the pairs (protein concentration, transcript concentration) ordered from higher to lower MIC values. For the ~4,400 genes in E. coli, there might be 20 million protein-transcript pairs, so we should hope that the top of the list, say the first 5,000 pairs, are highly enriched in true-positive associations. As long as the number of perturbations is sufficiently large (say n = 1,000), the equitability property ensures that pairs least associated with noise will be at the top.

What about DNA sequence data or other types of biological data?

BN: Any sequence analysis task that involves characterizing poorly understood sequence patterns might benefit from such a metric. One could characterize a region of DNA by a vector of k-mer frequencies and search for significant correlations with respect to nucleosome binding or replication origin activity.

ES: MIC is particularly well suited for unraveling relationships where many variables may interact in complex and unanticipated ways, such as in the firing rates of neurons over time, where we wish to identify neurons that interact in meaningful ways.

PQ: MIC can be also applied in genomics studies that look for associations, such as the mapping of expression quantitative trait loci or the integration of multiple data types (e.g., gene expression, methylation, copy number and single-nucleotide polymorphisms).

GS: One application I can think of is discovering associations between kinases and substrates. For example, this could be done by studying the associations among the phosophorylation levels of the complement of phospho-proteins in stimulated cells.

What drawbacks are there to using MIC?

Noah Simon, PhD student, Department of Statistics, Stanford University, Stanford, California.

Noah Simon & Rob Tibshirani: MIC has a potentially serious drawback: as a test that strives to be equitable, it can have low statistical power in many important situations. We ran simulations to compare the power of MIC to that of standard Pearson correlation and another recently proposed method—distance correlation (dcor) (Ann. of Appl. Stat. 3/4, 12331303, 2009) (http://www-stat.stanford.edu/~tibs/reshef/comment.pdf).

Rob Tibshirani, professor, Department of Health Research and Policy and the Department of Statistics, Stanford University, Stanford, California.

We found that MIC has lower power than dcor in every case but one, and substantially lower power than Pearson correlation for linear trends. Hence, when MIC is used for large-scale exploratory analysis, it will produce too many false positives. We believe that dcor is a more powerful technique that is simple, easy to compute and should be considered for general use.

OT: There's a cost for the convenience of such a general correlation measure—namely, a likely higher false-positive rate, as hinted to by results in Reshef et al. (Science 334, 15181524, 2011).

GS: The associations found by MIC are not necessarily causal. As is the case with other association measures such as mutual information and different types of correlations, MIC is a symmetric measure. Therefore, for example, a protein-transcript pair (P,T) will have the same MIC value as (T,P), and MIC does not indicate whether protein P is regulating transcript T or vice versa. The lack of statistical power for MIC, while a possible drawback for a MIC-only analysis with small sample size, shouldn't pose a problem in practice if we renounce the notion of a 'best' metric, and embrace the idea that more than one metric has to be used to capture the nuances of complex data. MIC should be welcomed as a good addition to our toolbox of association measures.

What types of biological data might be less amenable to MIC?

ES: When we know the expected type of relationship, direct functional tests may have more statistical power than MIC.

OT: Biology researchers often have an idea of what sort of relationship they are looking at, such as a linear relationship or a step-function response to a perturbation in time series data. In such situations, a more specialized measure of pairwise dependency (e.g., Pearson correlation) is more appropriate.

What are the next steps to better understanding the utility of MIC?

Edward Dougherty, professor, is in the Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas.

Edward Dougherty: Complex methods are often very sensitive to sample size. This could be assessed by performing a careful simulation study of the behavior of MIC relative to properties of the sample.

BN: MIC should be compared against distance correlation, as has been done to some degree by Noah Simon and Rob Tibshirani.

GS: For small numbers of experiments (say n = 50), the equitability property does not strictly hold, and the deviance of MIC from 1 will depend not only on the strength of the noise but also on the functional form of the underlying noiseless relationships, possibly leading to false negatives. This should be characterized in greater detail. Also, MIC could be extended to more than two variables (Science 334, 15021503, 2011).

Additional data