Pan- and core- network analysis of co-expression genes in a model plant

Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ and ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. We showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.

indiscriminately combining multiple samples may not be universally good. The combined samples need to be biologically comparable 18 . Furthermore, batch effects may give rise to false positive and spurious correlations between genes when microarray data from different labs are combined 19,20 .
Another problem with combined co-expression networks is that it may miss rare gene interactions formed under specific conditions such as a particular disease 21 . Increasing amount of evidence indicates that different gene networks operate in different biological contexts 22,23 . Thus, it is increasingly important to compare and contrast coexpression networks generated from individual datasets [24][25][26] . Experimental results suggest that more than one third of genetic interactions are condition-specific 27 . Several studies have also demonstrated that coexpression of genes varies in different conditions. Southworth et al. studied the difference between coexpression networks of young and old mouse brains and found genes involved in memory have more network connections in the young than in the old animals 28 . By leveraging the concept of differential rewiring, Hudson et al. captured the phenotypic differences between two breeds of cows 29 . Compared with normal tissue, many coexpression relationships were lost in cancer tissue 30 .
A published study of changes in gene expression usually has its own experimental design created in order to answer a specific biological question or several related questions, such as, to understand the mechanisms of plant heat shock response 31 or biological function of a plant hormone 32 . It might not be universally good to detect coexpression based on a combined dataset of microarray samples from different labs. Here we construct and analyze a comprehensive collection of 134 coexpression networks each based on expression samples from an individual published study, thus preserving context-specific network structure. We assume the network generated from each Gene Expression Omnibus (GEO) series represent a particular regulatory response specific to the experimental design and biological query of that study. Therefore, expression datasets from individual studies are ideal for the detection of condition-specific networks. We found most coexpression edges can only be detected in a few networks, while only a small fraction of edges can be repeatedly detected in many networks. Those two types of edges are enriched for distinctly different biological functions. Our analysis may provide guide for future coexpression network analysis in plants.

Results
A large collection of Arabidopsis coexpression networks. Previous work has combined expression datasets from many labs to build coexpression networks for animals or plants 1,13,14 . Although this strategy has been intensively used by plant scientists to assist gene characterization 33 , it preferentially captures the relationships between genes that are conserved across most contexts. In contrast to this, in this study we built coexpression networks based on individual expression datasets from the model plant Arabidopsis in order to capture network aspects that appeared in a specific experimental setup 21 .
Thousands of gene expression profiling datasets are available for Arabidopsis in public repositories such as GEO. We focused on the Affymetrix GPL198 platform, since it is the most widely used platform and its annotation is continuously updated. Many of those datasets are not suitable for network inference simply because the number of samples are not enough for a robust inference of correlation. Similar to a previous study performed in humans, we limited our analysis to datasets with at least 20 samples; 134 such datasets were acquired from GEO. Each of them was normalized in the same manner, and genes with very low mRNA abundance or genes without any significant changes were removed (see Methods). The top 0.1% most coexpressed gene pairs within each dataset were then used to build individual networks 34 . We used GSE series number to identify individual published studies, thus each of our 134 networks is labelled with the GSE number of the corresponding GEO experiment/ series (Fig. 1).
The number of nodes in most of the series-based networks was between 500 and 5,000, while the number of edges was between 50,000 and 100,000 ( Fig. 2a,b). The Pearson Correlation Coefficient (PCC) cutoff for each series-based network was about 0.73 ( Figure S1). Since the number of samples for each coexpressed gene pair is at least 20, the correlation is statistically significant (p-value < 0.001 for PCC = 0.73 and sample size ≥ 20). About 60% of these series-based networks were inferred from leaf, seedling and root tissues (Fig. 2c). The other 40% included other tissues such as flowers, seeds and shoots (Fig. 2c). As can be seen in Fig. 2d, these series-based networks were inferred from samples in a variety of experimental contexts, such as the effect of gene mutations or abiotic stress. The breadth of data sources were included to better capture of coexpression networks in different conditions.
Coexpression networks in Arabidopsis are highly context-dependent. Similar to the idea of 'pan/ core genome' in bacteria 35,36 and plants [37][38][39] , we propose to use the term 'pan-network' for the union of all the 134 series-based networks, while 'core-network' to represent the intersection of a considerable fraction (10 or more out of 134) of these networks. Indeed, unlike a large number of core genes shared across the entire networks there were no edges present in all 134 of our networks. This is similar in spirit to a soft cutoff used by one of us 40 in detecting the core ("basic") genome of a bacterial species (E. coli).
The pan-network representing the union of all 134 of our individual networks contains 2,294,175 non-redundant edges and 18301 nodes/genes (Supplemental File 1). Every edge in the pan-network is characterized by its 'universality number' , U, defined as the total number of our networks in which this edge was observed. More than 80% of the edges were observed in only one network (universality, U = 1), suggesting that gene networks in Arabidopsis operating under different conditions are drastically different from each other 25 . Compared to a more conventional approach in which the samples from all of 134 experiments are combined in one large table based on which a single coexpression network is calculated, our approach detects more edges ( Figure S2a). In addition, our pan network has a better quality as it contains less false-positive interactions due to e.g. the batch effect. Indeed a higher fraction of edges in our pan-network than in the network based on the coexpression of all 134 samples are supported by experiments other than expression data ( Figure S2b and c).  Co-expression edges between kinases and transcription factors (TF) are often of a particular biological interest as they may indicate gene regulation triggered by signaling pathways. To illustrate this below we discuss several examples of edges with experimental evidence of physical interactions 41 . Firstly, we observed that a guanylate kinase (AT3G57550) is coexpressed with a MYB TF (AT1G18570) in only one out of the 134 experiments (GSE40354, Supplemental File 1). The experimental context of the coexpression between those two genes involved treatment with bacterial elicitor 42 . Another edge deserving further investigation connects an F-box protein (AT1G47340) and SKP-1 (AT5G42190) (Supplemental File 1). F-box proteins are well known to interact with SKP-1 to degrade unwanted proteins 43 , however, hundreds of F-box genes are encoded in the Arabidopsis genome 44 . It is critical to determine the specificity of the interactions between those F-box genes and their interacting partners. It is also important to understand under which environmental conditions the interaction happens 45 . Although previous experiments have already detected physical interactions between the edges discussed above 41 , the results from our analysis suggest the design of further experiments to reveal the specificity of these interactions. In contrast to the edges that were observed in only one dataset (i.e. U = 1), the edges with larger values of U (Universality) were mostly formed between members of large multi-protein complexes (Supplemental File 1).
The core-network connects components of large molecular machines. A family of core networks of progressively increasing universality can be extracted from the pan-network by applying a strict cutoff on the universality of edges (e.g. a core-network formed by all edges existing in at least 5 datasets). As the cutoff value increases, the resulting network becomes smaller but more modular (Fig. 3). Modularity measures how well are these network modules separated from each other, while the clustering coefficient measures how tightly the neighbors of a node within a module are connected with each other. Both parameters are frequently reported for all types of biological networks, including protein-protein networks, metabolic networks and transcriptional networks 46 but (to the best of our knowledge) ours is the first study of their systematic dependence on edge universality.
We used U = 10 as the cutoff to determine the core-network used in the rest of our study (marked red in Fig. 4), which contains 7326 non-redundant edges among 935 genes. The exact value of the cutoff defining the core-network is always somewhat arbitrary. Our choice was inspired by the network analysis shown in Fig. 3. Both clustering coefficient and especially modularity of the core network appear to saturate above the U = 10 threshold. Interacting genes in thus defined core network are likely to be statistically significantly correlated even in datasets where they didn't make the PCC cutoff. Indeed, the average PCC of core-network edges in datasets where they did not pass the PCC cutoff is still significantly higher than the average PCC of edges within the pan-network (student's t-test, p-value = 0, see Figure S3). We refer to the set of edges present in the pan-network but not universal enough to be included in the core network as "condition-specific" (marked green in Fig. 4). The degree distribution of the core-network approximately follows a power-law (scale-free) pattern with the exponent  Fig. 4). The modules of this network were also highly enriched in genes characterized by functional categories 'translation' or 'photosynthesis' (p-value < 10 −10 , Supplemental File 4). These facts are consistent with earlier observation that the genes involved in large molecular machines tend be coexpressed under many conditions 13 .
The pan-network is enriched in condition-specific biological processes. Since each expression dataset usually has its own experimental design addressing a specific biological question 50 , an edge detected in one dataset may not be detected in another 21 . As more and more evidence supports condition-specific networks in animals [28][29][30] , we hypothesized that biological processes that are active in a small set of specific contexts would form the bulk of our pan-network. First of all, although the edges with smaller values of U contained fewer direct physical Protein-Protein Interactions (PPIs) compared with the core-network, PPIs are still overrepresented among pan-network edges (p-value < 0.001, Fig. 4). This suggests that biologically meaningful connections exist among co-expressed genes which are less likely to be detected by the traditional methods 16 . The degree distribution for the pan-network approximately followed a power-law pattern (exponent = − 0.5 transitioning to the exponential cutoff above 500) (Fig. 5, Supplemental File 2). Besides the support from physical interactions, we wondered if more evidence could be found to reveal the biological significance of the pan-network.
With an average degree more than 200 (Supplemental File 2), an overview of the pan-network showed a densely connected large central component. However, we were able to detect network communities (i.e. modules)  using a scalable algorithm 51 implemented in Gephi 0.9 graph visualization software package 52 . In fact, the modules from the core-network and pan-network were identified using the same method to allow for an apples-to-apples comparison. We found two (out of six) modules in the pan-network enriched in 'regulation of cell cycle' (p-value = 5.39 × 10 −15 ) and 'regulation of cell communication' (2.93 × 10 −6 ), respectively (Fig. 6). Interestingly, those biological processes were not detected among core-network modules (p-value > 0.01, Supplemental File 4).
The most connected nodes, i.e. hubs, are generally the focus of the network analysis. For instance, hubs in protein-protein interaction networks are more likely to be essential genes 53,54 . Hubs in coexpression networks are often considered to be the most informative genes 13,55,56 . Approximate power-law degree distribution observed in our analysis confirms the existence of hubs in both pan-network and core-network (Fig. 5). Most of the hubs in core-network are ribosomal genes, while the hubs in the pan-network represent a broad spectrum of functional categories, such as aminotransferase (AT3G49680), or ferredoxin (AT1G10960) ( Table 1, Supplemental File 2). Genes involved in 'response to abiotic stimulus' were enriched among the top 200 most connected genes in the pan-network (p-value = 1.6 × 10 −10 ). In addition, chaperonin genes were also enriched (p-value < 0.01). Chaperonins play a critical role in helping plants fight against environmental stresses by reestablishing the normal conformation of proteins. This may explain their potential ability to interact with many different genes under different conditions 57 . For instance, AT1G55490, encoding a subunit of chloroplasts chaperonins, was coexpressed with 1605 genes in the pan-network. These instances of coexpression were from 49 different experiments in total (Supplemental file 5). In conclusion, we demonstrated that a broad spectrum of condition-specific biological processes can be revealed by the pan-network analysis.

Discussion
Networks of interactions between different genes are key to our understanding of cellular mechanisms. Large-scale network data for plants are still very limited and are expensive to generate experimentally. Coexpression networks inferred from gene expression profiling data allows one to study interactions between genes (or proteins they encode) albeit indirectly. Based on 'guilt-by-association' , coexpression network analysis provides great power in providing clues for gene functions in plants 58 . It also suggests candidates for the design of both high-throughput 59,60 as well as more focused low-throughput experiments 3,4,7,8 . Thousands of expression profiling experiments are available for Arabidopsis in the GEO 50 . In our previous study we reported a principle component analysis of ~7000 expression samples from more than 300 publications in Arabidopsis 61 . The developmental stage, growth conditions and the tissues of the mRNA samples used in each study are highly variable, since each of these experiments was designed to answer a specific biological question 50 . In this study, we assume the coexpression network inferred from a given experiment represents a part of the overall gene regulatory network perturbed by these specific changes in environmental or intrinsic conditions. We found relatively small number of edges (core network) that can be repeatedly detected in multiple conditions. Differences between functional enrichment within modules in pan-and core-networks emphasize the importance of biological context in coexpression analysis.
Recent studies have shown that coexpression networks in animals are highly condition-specific [28][29][30] . For example, by comparing mice of different ages, Southworth et al. found many coexpression gene modules cannot be detected in older mice 28 . Network rewiring was first revealed through comparisons between different cellular states, such as healthy and cancerous tissues 21 . Our study showed plant coexpression networks are also under dramatic changes under different conditions. To complement our pan/core-network analysis focused on individual edges s, we further calculated the preservation of co-expressed gene modules in each dataset 62 . Consistent with the results shown in the above, the most preserved modules are enriched in 'photosynthesis' (p-value < 10 −10 , Supplemental File 6). The modules which can only be detected in one dataset are enriched in more specific biological processes, such as 'pollen exine formation' (p-value < 10 −10 ) and 'nucleosome organization' (p-value = 3.9*10 −7 ) (Supplemental File 6). In the plant community, the context-specificity of gene coexpression has been usually ignored 5,8,13,63,64 . Our search of existing literature revealed that most previous meta-analysis studies of plant expression data (except for a few notable examples discussed below) combined datasets from different labs to detect pairs of universally co-expressed genes (Supplemental File 7). Using 15 rice gene expression datasets, Childs et al. compared coexpression networks generated from the combined expression data against those from individual datasets. They found networks from individual datasets to contain specific but potentially informative gene modules 65 . Lee et al. detected coexpression relationships based on individual datasets instead of one combined dataset for Arabidopsis as well as for rice 9,66 . Although Lee et al. successfully predicted gene functions based on the individual networks they constructed, the differences between networks from different labs were not systemically evaluated in their study.
Our collection of 134 co-expression networks in Arabidopsis based on individual experimental series in GEO database can be used to answer the question of whether one should combine multiple expression datasets before constructing the co-expression network and if yes, which datasets can be best grouped together. In principle, by combining together similar series one can get the best of both worlds: increased statistical power to detect significantly correlated genes can be gained without losing condition-specific edges. To shed light on this problem we constructed a 134 × 134 matrix of similarities between our set of networks. The similarity was estimated using two different measures. The first similarity matrix shown in Fig. 7a and made available for download as Supplemental File 8 was constructed in the spirit of pan-and core-network analysis. It quantifies the fraction of edges shared between a pair of networks (See Methods). While clusters of similar networks, corresponding mostly to identical tissue types (empirical p-value < 10 −5 based on permutation tests), are visible already in this measure, the contrast between similar and different networks can be made even sharper (Fig. 7) if one uses an alternative similarity measure based on shared modules of co-expressed gene across a pair of networks detected by the WGCNA algorithm's 62 method 'modulePreservation' (Supplemental File 9). We used this similarity measure to construct the 'network of networks' connecting pairs of networks with average module similarity score above 10 (Fig. 7). Densely interconnected modules in this 'network of networks' represent good candidates for series that can be integrated without significant loss of condition-specific edges. We plan to investigate pros and cons of this approach in a follow up study.
Our analysis demonstrated that coexpression networks inferred from different microarray datasets share relatively small number of common edges, while at the same time maintaining a large number of condition-specific edges. We constructed a pan-network to represent the union of all detected co-expressed edges among 134 datasets. We also proposed the concept of core-network representing edges detected in multiple datasets. Compared to the pan-network, the core-network is more modular and enriched in genes from large multi-protein complexes. The hubs of the pan-network include genes that play a role in response to a variety of environmental stimuli. In comparison, the modules within the pan-network are enriched in signaling and regulatory functions. We also considered several measures of similarities between individual coexpression networks and constructed 'network of networks' connecting similar networks to each other. We anticipate concepts of pan-and core-coexpression networks to provide a useful description of gene regulation architecture in a variety of species and we are currently working on extending our analysis to model organisms other than Arabidopsis.

Methods
Microarray data. All the expression datasets in this study were based on the Affymetrix platform GPL198.
Only datasets containing at least 20 samples were used 67 . The CEL files of 134 expression datasets were downloaded from GEO (see Supplemental File 10) and normalized by MAS5.0 68 . The probesets were converted into TAIR gene locus ID based on the annotation file for GPL198. We only used 21678 probesets each of which has a unique mapping on a single Arabidopsis gene.
Filtering biologically relevant genes. We manually grouped samples within each dataset into different replicate groups based on the metadata provided in the dataset. Then we applied ANOVA (Analysis of variance) to identify genes that are differentially expressed between replicate groups (p-value < 0.01). If there were fewer than 3000 differentially expressed genes, the top 3000 genes ranked by p-value were used. We then applied the following standards to exclude genes with low expression. 1) For a gene to be included, at least one of its expression abundance values in a dataset was identified as expressed (i.e. present) by MAS5.0; and 2) For a gene to be included, at least one of its expression abundance values in a dataset was higher than the 90% percentile of the abundance of transposable element on the same array 69 . Those biologically relevant genes were then used to build series-based coexpression networks.
Building the series-based network for each GEO dataset. For each GEO dataset, we calculated the Pearson Correlation Coefficient (PCC) between the expression profiles of two genes. All possible pairs were calculated. Then, the top 0.1% pairs with the highest correlations were used to build the network for each dataset (i.e. p-value < 0.001) 34 . A recent study showed that genes within the same metabolic pathway tend to have positive correlations 70 . In addition, taking into account negatively correlated gene pairs complicates the interpretation of our results. Therefore, we focused our analysis only on positively correlated gene pairs. Enrichment test. The GO annotation data were downloaded from GeneOntology web site (http://geneontology.org/gene-associations/gene_association.tair.gz, July 18, 2014). The annotations inferred from the expression profile (i.e. IEP) were removed to avoid the possibility of 'self-fulfilling prophecy' . All the daughter nodes were recursively mapped to the mother node based on 'is_a' relationship. Only the GO terms that are not broadly associated with too many genes were used according to the Bonferroni correction:            < . # of genes associated with the GO term # of genes in the genome 0 05 # of GO terms in the genome (1) 2 The Fisher's exact test was utilized to calculate the significance of the enrichment for each GO term, followed by Benjamini-Hochberg correction.
Calculation of network similarity between different datasets. We first used the fraction of overlapped edges between two networks to measure their similarity (Supplemental File 8) which was based on Jaccard index,

∩ ∪
Ei Ej Ei Ej (2) where Ei and Ej are the edges in network i and j, respectively. Another measure of network similarity was calculated as follows (Supplemental File 9). First, densely interconnected network modules were detected by Weighted Gene Co-Expression Network Analysis (WGCNA) software for each one of our 134 networks 71 . Second, the method, 'modulePreservation' within WGCNA was utilized to calculate the preservation of each module in another dataset 62 . Then the average Z-summary score of all the modules shared between a pair of networks was used to represent their similarity. This score was normalized between 0 and 1 for visualization purpose by, where S i,j is the similarity (i.e average Z-summary) between a pair of datasets. max and min represent the largest and smallest similarity, respectively. A cutoff of S i,j > 10 was applied in order to keep the network pairs with the strongest similarity when be visualized 62 . If a network has more than 10 neighbors, only the first 10 neighbors were shown. If a network has no neighbor, it was not shown. For more information on the calculation of network similarity using WGCNA, see Supplemental File 11.