A discussion of recent advances and limitations in functional annotation and network reconstruction based on gene expression microarray data
Systems approaches for understanding biological complexity and studying diseases rely on iterative and extensive characterization of genes, transcripts, proteins and their interactions, generation of hypotheses about how they functionally inter-relate within subsystems, conversion of these hypotheses into formal mathematical models and their experimental testing (Auffray et al, 2003a). In this context, modeling of gene regulatory networks from functional annotations is currently performed top-down, studying global network architecture and performance (Bray, 2003), and bottom-up, identifying modular subsystems from functional genomics data (Alon, 2003).
Because transcription is the first step of gene expression subjected to extensive regulations by internal and external factors, systems approaches rely heavily on gene expression data. Microarray technology has developed steadily for three decades to allow measurements of expression levels for thousands of genes in different biological contexts, and a wealth of such data is now available in public repositories (Ball et al, 2004). The expectation is that microarray analysis will help elucidating what the genes do, when, where and how they are expressed as elements of an orchestrated system under the effects of perturbations, and thus reveal the underlying transcriptional regulatory networks.
Two recent papers report attempts to develop an optimized framework for functional annotation and reconstruction of regulatory networks using large-scale expression data sets combined with protein interaction and phenotypic data in yeast (Tanay et al, 2005; Zhou et al, 2005). These approaches are designed to identify genes with similar functions, but not necessarily coexpressed, and to extract essential features of regulatory networks through analysis of independent data sets. Zhou et al first identified a collection of coexpressed gene pairs (doublets) representing functional modules in individual data sets, based on expression correlation and functional annotation, most of which were functionally homogeneous. Then, using this first-order meta-information, they conducted a second-order expression analysis, assembling pairs of doublets (quadruplets) found tightly coregulated across multiple data sets into context-dependent regulatory modules. Similarly, Tanay et al used biclustering within a large microarray data compendium to identify relevant functional modules. These approaches generated functional predictions consistent with experimental studies, identified novel cross-doublet gene pairs missed in the standard analyses and allowed assignment of novel functions to a number of previously uncharacterized genes. These achievements represent significant improvements compared to previous studies, since only half of the globally coexpressed gene pairs identified by the standard methods are functionally homogeneous, and analysis of a compendium of yeast expression profiles yielded only a handful of functional assignments (Hughes et al, 2000; Auffray et al, 2003b).
In addition, Zhou et al assembled gene regulatory networks by using first-order expression correlation of target gene modules as an activity profile for the transcriptional factor regulating them, and second-order expression correlations between the activity profiles of transcriptional modules to measure cooperativity between transcription factors. Through integration with protein–protein and protein–DNA interaction data, functionally consistent transcription modules controlled by distinct transcription factors and displaying high second-order correlations were shown to participate in transcription cascades. Thus high-order clustering of transcriptional modules identifies potential interconnectivity between groups of genes participating in the same biological processes, and provides indirect assignment of transcription factors to these processes, overcoming their low expression levels. This represents another improvement over conventional approaches, which are limited by their inability to reconstruct the hidden organization of the regulatory signals (Wei et al, 2004): high-order analyses go one step further to capture combinatorial coregulations for genes that do not exhibit identical expression patterns.
Despite the significant progress that such data-driven network assembly methods represent, due to the underlying network complexity, it remains extremely difficult to reconstruct complete regulatory networks exclusively based on the information available from microarrays, even when combined with other types of data in higher order analyses (Wei et al, 2004; Papin et al, 2005). This is currently limiting our ability to understand the biological significance of the topological properties of the reconstructed networks, which have typically scale-free and small-world architectures (Grigorov, 2005). Combinatorial expansion in the number of potential network structures and comprehensive evaluation of their consistency are key challenges that the approach developed by Zhou et al does not entirely overcome, particularly since the number of possible alternate genetic regulatory networks highly depends on the size and type of the data sets and the maximum number of regulatory inputs per gene (Orrell et al, 2005). Using a Bayesian modeling approach imposing severe constraints on network architecture, several groups have successfully overcome some of these limitations, inferring transcriptional regulatory modules through a high-order analysis of microarray data combined with genotyping and phenotypic data in recombinant inbred mice (Bystrykh et al, 2005; Chesler et al, 2005; Hubner et al, 2005; Li et al, 2005).
However, a great deal of biological information is most likely contained in the absolute expression levels, including the large number of those of low magnitude that are subject to chaotic fluctuations and trigger the emergence of self-organization in complex biological systems (Auffray et al, 2003b). Such fluctuations are unlikely to be captured by high-order expression analysis when it only considers functional links that are simultaneously turned on or off over various conditions, and is limited by the current inability of high-throughput technologies to provide the accurate and consistent data required (Jarvinen et al, 2004). Due to insufficient standardization in experiment description, including array element description and annotation, and irregularities in data integrity (Brazma et al, 2001; Grunenfelder and Winzeler, 2002), microarrays represent an incompletely mature technology using a variety of platforms and analysis tools, which are often difficult to compare. Thus, poorly documented variations exist within any given microarray data set, especially when different generations of microarrays are considered together (Hwang et al, 2004; Shi et al, 2004). They are likely to influence significantly both first-order and high-order analyses, as shown by the influence of RNA integrity on expression level measurements (Imbeaud et al, 2005). Such variations should therefore be documented using vigilant experimental and data processing pipelines rather than masked, as is currently done in most microarray studies.


