Synopsis

Subject Categories: Functional genomics | Computational methods

Molecular Systems Biology 1 Article number: 2005.0002  doi:10.1038/msb4100005
Published online: 29 March 2005
Citation: Molecular Systems Biology 1:2005.0002

Integrative analysis of genome-wide experiments in the context of a large high-throughput data compendium

Amos Tanay1, Israel Steinfeld1, Martin Kupiec2 & Ron Shamir1

  1. School of Computer Science, Tel Aviv University, Tel Aviv, Israel
  2. Department of Molecular Biology and Biotechnology, Tel Aviv University, Tel Aviv, Israel

Correspondence to: Amos Tanay1 School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel. E-mail: Email: amos@post.tau.ac.il

Received 3 January 2005; Accepted 24 February 2005; Published online 29 March 2005

Top

Article highlights

A new methodology is proposed for the analysis of gene expression experiments in the context of a large and diverse compendium of prior published data. Examples on yeast show how such approach allows detailed dissection of single experiments from the system level perspective.

  • By proposing a method for analyzing the results of a single microarray study in the context of all prior studies, deeper and more detailed analysis of the conditions in the study is possible.
  • By allowing the routine analysis of new experiments in the context of large functional genomics compendia, the communication of new results and the agglomeration of diverse datasets into one body of knowledge are simplified.
  • The new methodology can accelerate the evolution of functional genomics into a collaborative community effort, by proposing a common language for describing and analyzing the accumulated experimental data.

Top

Synopsis

Some of the greatest "success stories" in modern biology can be attributed to coordinated community efforts that tackled an overwhelmingly large problem using a web of semi-independent efforts. In the most prominent and recent of these efforts, the emergence of genomics was facilitated by the ability to share and compare sequence data, by the availability of these data to extensive search, and by the aggregation of data into one body of knowledge. A major challenge of today's biology is the functional characterization of biological systems. This problem (in any of several alternative forms) is probably one of the largest ever attempted by biologists, and is thus a natural candidate for being tackled by such a community-based scheme. With this long-term goal in mind, the present study proposes a methodology that can help to exploit large-scale compendia of functional genomics data as part of the routine analysis of high throughput experiments. Using a large collection of different types of data obtained for the baker's yeast S. cerevisiae, we demonstrate how fruitful such combined approach may be in characterizing responses to specific conditions from a system level perspective.

We focus on a relatively simple building block of biological systems - the functional module. Following the pioneering studies of gene expression profiles (Eisen et al., 1998), researchers have extensively used clusters of co-expressed genes to gain insights into the organization of regulatory processes. Clustering, in its simple form, partitions the genome into disjoint gene sets (possibly obeying hierarchical organization), such that each set manifests a different characteristic expression pattern across all the experimental conditions. A natural generalization of a co-expressed gene cluster is a transcriptional module (Ihmels et al., 2002) - a set of genes that are co-expressed in some (but not necessarily all) experimental conditions. Transcriptional modules are a more flexible and realistic building block for biological systems. A certain gene may belong to more than a single transcriptional module, as it can be expressed (or may exhibit different genetic and physical interactions) under different conditions. Transcriptional modules can be detected using bicluster analysis of gene expression datasets. In bicluster analysis, the output is not a set of disjoint clusters, but a collection of (possibly overlapping) transcriptional modules that can represent phenomena like pleiotropy or context-dependent regulation. Finally, a functional module (FM) generalizes a transcriptional module by taking into account other heterogeneous sources of biological information in addition to gene expression (e.g., protein interactions, synthetic lethality, etc.). A functional module is thus a set of genes that are correlated with each other across a set of biological properties. In previous work (Tanay et al., 2004) we have introduced the SAMBA algorithm for detecting FMs in very large scale and highly heterogeneous datasets. Biological properties can represent any source of information on genes and their products, including gene expression, phenotype and protein interactions.

What can be gained from dissection of biological systems using FMs? FMs simplify the understanding of biological systems by representing cellular processes in terms of the activity of a modest number of modules instead of thousands of genes. As we show here, a comprehensive set of FMs for a model system, built by integrating data from many different studies and sources, may form a valuable foundation when analyzing the results of a new experiment. For example, O'Rourke and Herskowitz (O'Rourke and Herskowitz, 2004) studied the response of several key S. cerevisiae mutant strains to variable levels of hyper-osmotic stress. By analyzing the resulting gene expression dataset using standard clustering and extensive expert analysis, the process of hyper osmotic adaptation was dissected into several clusters containing hundreds of genes each. These clusters represent groups of genes that exhibit typical response patterns in the osmotic shock treatments. On the other hand, by adding the Orourke-Herskowitz gene expression profiles to the vast compendium of available yeast functional properties accumulated so far (including almost 2000 different conditions from 60 different studies) and analyzing the combined dataset using the methods described here, we can characterize the osmo-adaptation process in terms of the activity of a small number of well-defined, highly specific FMs (Fig. 4a, b). By using more data, which shed light on different aspects of the biological system, we can better separate modules that seem to respond alike under the limited number of particular conditions of the original study.

Figure 4: Revisiting the response to hyperosmotic stress
Figure 4 : Revisiting the response to hyperosmotic stress Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com

(A) Outline of the hyperosmotic stress signaling pathway. Two Hog-dependent (Ssk1, Ste11) and one Hog-independent (Msn2/4) pathways mediate the hyperosmotic stress signal. (B) Response of selected modules to osmotic stress. We plot the average expression of several modules that our algorithm associated with osmotic stress conditions, in several strains knocked out for key players in the HOG pathway. The graphs show modules' mean expression time courses after treatment with 0.5 M KCl. In general, modules #232 (Ribosomal proteins) and #524 (RNA processing), #686 (Amino-acid biosynthesis), #503 (Purines) and #985 (Ergosterol biosynthesis) are repressed as part of the ESR, with peak response observed at 20 min and re-establishment of normal transcription after 40–60 min. Modules #536 (Respiration) and #1215 (Gluconeogenesis) are induced with similar kinetics. Specific modules show particular deviation from these two general trends. (C) Multiple signals additively regulate module #524. We plot the mean expression of module #524 and its standard deviations in four strains (wt, hog1, ste11, ssk1) under two levels of hyperosmotic shock (0.5 and 0.125 M KCl). There is marked difference between the ssk1 and hog1 strains and the wt, ste11 strains, suggesting the existence of two regulatory mechanisms. An osmotic stress-specific, Ssk1/Hog1-mediated signal represses the module in both low and high levels of osmotic shock. In high osmotic shock, a second, Hog1-independent signal (which is probably related to the general ESR) is active in parallel to the Hog1 signal and contributes additively to the repression of the module. (D) A two-phase regulatory program for module #536. We show the time courses of the mean expression of module #536 (Respiration) and its main regulator Hap4, when treated with 0.5 M KCl in the wt strain. The module exhibits weak and poorly correlated induction, which is Hap4 independent, during the primary phase of the osmoregulation program (0–40 min). A second phase is observed at 60–180 min, where a tightly correlated induction is facilitated by increase in HAP4 expression.

Full figure and legend (257K)Figures & Tables index

We envision a framework (Fig. 5) in which the set of currently available functional profiles for an organism will be continuously updated in one of today's public repositories (www.ncbi.nlm.nih.gov/geo/, www.ebi.ac.uk/arrayexpress/). In addition to the data itself, a set of characterized and annotated functional modules will be maintained and incrementally refined. The analysis of a new dataset will be performed in light of the entire compendium and annotated modules (using, for example, the algorithms we describe here). In this way, researchers performing even modest-size experiments will be able to probe into the system level effects of the conditions they study, and benefit from the cumulative community knowledge base. Upon publication, each new dataset will become part of the compendium, contributing to the robustness of future analyses by refining the compendium and the FMs.

Figure 5: A new paradigm for analyzing functional genomic experiments
Figure 5 : A new paradigm for analyzing functional genomic experiments Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com

According to the current prevalent paradigm (top part), novel data are analyzed in isolation, typically using clustering and expert manual analysis of specific clusters. We suggest a new approach (lower part) in which the community maintains the current publicly available data sets and the set of biological modules revealed by them. Modules may cover all aspects of biological processes and their regulation, as revealed, for example, by our biclustering algorithm. Using this resource, novel data sets can be represented in terms of the behavior of known and novel modules, providing an objective and transparent method for understanding, communicating and reusing high-throughput data.

Full figure and legend (182K)Figures & Tables index

Functional genomics is naturally evolving into a multidisciplinary collaborative effort, and the development of tools that facilitate the communication and the use of published data has become an active field of research in recent years. Several efforts to perform genome-wide characterization of biological systems using heterogeneous data are underway (a beautiful example is the integrated analysis of cancer gene expression studies by Segal et al. (Segal et al., 2004)). We believe that the methods we present here, and more importantly, their implementation, make progress in this direction. They may prove useful for dissecting functional genomics experiments and for integrating results obtained by several different types of genome-wide methodologies. A prototype website that demonstrates this methodology is available at www.cs.tau.ac.il/~samba.

Top

Acknowledgements

AT was supported by a Horvitz complexity fellowship. MK was supported in part by the ISF, the Recanati Fund and the Israeli Ministry of Health. RS holds the Raymond and Beverly Sackler Chair for Bioinformatics at Tel Aviv University, and was supported by the Israel Science Foundation (Grant 309/02).

Top

References

  1. EisenMB, SpellmanPT, BrownPO, BotsteinD (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA95: 14863–14868 | Article | PubMed | ChemPort |
  2. IhmelsJ, FriedlanderG, BergmannS, SarigO, ZivY, BarkaiN (2002) Revealing modular organization in the yeast transcriptional network. Nat Genet31: 370–377 | Article | PubMed | ISI | ChemPort |
  3. O'RourkeSM, HerskowitzI (2004) Unique and redundant roles for HOG MAPK pathway components as revealed by whole-genome expression analysis. Mol Biol Cell15: 532–542 | PubMed | ChemPort |
  4. SegalE, FriedmanN, KollerD, RegevA (2004) A module map showing conditional activity of expression modules in cancer. Nat Genet36: 1090–1098 | Article | PubMed | ISI |
  5. TanayA, SharanR, KupiecM, ShamirR (2004b) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA101: 2981–2986 | Article | PubMed | ChemPort |

MORE ARTICLES LIKE THIS

These links to content published by NPG are automatically generated.

NEWS AND VIEWS

Multifunctional genes

Molecular Systems Biology News and Views (29 Mar 2005)

Size matters: network inference tackles the genome scale

Molecular Systems Biology News and Views (13 Feb 2007)

See all 3 matches for News And Views

Extra navigation

.
ADVERTISEMENT