Introduction

Gene coexpression network analysis is an attractive method for gene function annotation, which has been used in many model organisms, including yeast, mouse, human, Arabidopsis, and grapevine15. In the gene coexpression network, nodes represent genes and edges represent significant correlations between the expression patterns of connected genes6. After network construction, highly connected genes are clustered into modules. Genes within one module tend to participate in similar biological processes. Therefore, the function of unannotated genes could be hypothesized based on “guilt-by-association” principle7.

After the sequencing of citrus genomes8, gene function annotation is becoming a new challenge. For citrus, large amounts of data from microarray and RNA-seq experiments are available in public databases913. These data make it possible to construct gene coexpression networks for citrus. Several papers on citrus gene coexpression networks have been published1417. Most of these studies focused on specific areas and used small data sets. Only one study used 297 citrus microarrays, and covered the general area and several specific areas17. However, a limitation of this study was that probe sets were used to construct gene coexpression networks, not the genes, which were used in many coexpression studies1820. There are also some protein–protein interaction (PPI) networks, but these networks were inferred based on PPI networks of Arabidopsis2123.

In this study, we first made a customized Chip Definition File (CDF) by AffyProbeMiner to transfer probes to gene locus. Then, seven gene coexpression networks were constructed by RMTGeneNet using all or part of 230 citrus microarrays. These networks were partitioned into modules, and the functional coherence of modules was assessed by Gene Ontology (GO) and KEGG pathway enrichment analyses. Finally, RNA-seq data of 371 genes were used to test the validity of these networks.

Materials and methods

Data collection and preprocessing

The sweet orange (Citrus sinensis) microarray data used in this study were downloaded from National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO)24. A total of 231 CEL files were obtained from the platform GPL5731. The raw CEL data were preprocessed with RMA normalization using affy package of R 3.1.025. One sample (GSM825502) that failed more than one test of arrayQualityMetrics26 was removed and a total of 230 samples remained for network construction (Table S1). Based on hierarchical cluster analysis, these 230 microarrays (called “all data”) were classified into “citrus canker” (30 arrays) and “HLB” (36 arrays), or “leaves” (63 arrays), “flavedo” (40 arrays), “albedo” (31 arrays), and “flesh” (43 arrays). These sub-data sets were also used for networks construction. To map the microarray probes to citrus genes, a customized CDF was generated by AffyProbeMiner27 using C. x clementina v1.0 annotation as the reference8. Probes mapping to multiple citrus loci and probe sets containing less than five members were discarded. Data about the NICCE networks were downloaded from http://citrus.adelaide.edu.au/nicce/home.aspx.

Coexpression network construction and topological analysis

The coexpression networks were constructed using the RMTGeneNet package28. A minimum of 25 input microarrays is required for this application. First, a gene expression correlation matrix was constructed using pair-wise Pearson correlation coefficients (PCC). Then, a threshold was determined according to the transition of nearest neighbor spacing distribution from the Gaussian distribution to Poisson distribution (p = 0.001). Coexpression networks were visualized using the Cytoscape 2.8.329, and all topological analyses were performed using the NetworkAnalyzer package30 for Cytoscape 2.8.3.

Module clustering and functional enrichment analysis

The Markov Cluster (MCL) Algorithm31, an efficient graph clustering algorithm based on the simulation of random walk, was used to partition the network into modules. The inflation parameter (I) was scanned from 1.2 to 5.0 with increments of 0.2. Area fraction, mass fraction, and efficiency were used to determine the inflation parameter for MCL. The GO terms and Arabidopsis homologs for C. x clementina genes were downloaded from Phytozome v1032. The KEGG annotations of Arabidopsis genes were obtained through KEGGREST of Bioconductor33. GO biological process term enrichment analysis was carried out using topGO package of Bioconductor33. KEGG pathway enrichment was performed in R 3.1.0. Terms enriched with a Fisher’s test p-value <0.05 were considered.

Genome synteny analysis and network validation

To use the RNA-seq data of the C. sinensis annotation project (CAP)34, synteny analysis between C. sinensis genome and C. x clementina genome was conducted locally using the similar method developed for the Plant Genome Duplication Database35,36. First, BLASTP37 was conducted using all C. sinensis proteins to search for potential anchors (E < 1e−10, top 5 matches) in the C. x clementina genome. Afterwards, MCscan was employed to identify homologous regions38. Finally, syntenic blocks were evaluated by ColinearScan39. Alignments with an E value < 1e−10 were considered as significant matches. The expression data of 371 C. sinensis genes were downloaded from CAP and the correlation coefficients between them were calculated using R 3.1.0.

Results

Network construction

As shown in Figure 1, 231 Affymetrix citrus microarrays were downloaded from the NCBI GEO. After quality check, 230 high-quality microarrays (Table S1) were chosen for downstream analyses. Based on the hierarchical cluster analysis (Figure S1), these microarrays were first distributed into seven organ groups: flower, stem, leaves, fruit, seed, roots, and epicotyls. In the fruit group, they were further divided into flavedo, albedo, flesh, and vascular core (also called central core) subgroup. The data from albedo and vascular core were first clustered together and then clustered with data from other parts of fruit. This is reasonable considering that albedo and vascular core are composed of a colorless, spongy network of parenchymatous cells. These data sets were combined into one group, and labeled “albedo”, because neither was large enough for RMTGeneNet analysis. Five groups (flower, stem, seed, roots, and epicotyls), which had fewer microarrays than the minimum requirement of RMTGeneNet (Table 1), were not included for condition-dependent coexpression analysis. Within groups, microarrays of the same treatment were clustered together. Two major diseases of citrus40, citrus canker and HLB, constituted 38.3% of the experiments (citrus canker: 30, HLB: 58), or 81.5% of the experiments if controls were not included. Other treatments were not included for network construction because of insufficient numbers of microarrays. For citrus canker, all microarrays are included in the leaves group. However, HLB data covered five groups (stem, leaves, fruit, seed, and roots). Only 36 microarrays in the fruit group were used for constructing “HLB” coexpression network. Therefore, these 230 microarrays (called “all data”) were divided into sub-data sets of “citrus canker”, “HLB”, “leaves”, “flavedo”, “albedo”, and “flesh” based on their experimental conditions or organ types. Data from these seven groups were analyzed individually to construct coexpression networks.

Figure 1
figure 1

Work flow used for networks construction and clustering in the present study.

Table 1 Composition of the 230 microarrays according to the experiment conditions and organs.

The Affymetrix citrus microarray contains 30 217 probe sets and 341 730 probes. In order to map the probes to citrus gene loci, a customized CDF was generated by AffyProbeMiner27 using C. x clementina v1.0 annotation as the reference8. After removing ambiguous probes mapping to multiple gene loci and probe sets with less than 5 probes, 158 557 probes belonging to 12 005 gene loci were kept in the customized CDF. Therefore, the expression of 48.9% citrus genes (12 005/24 533) can be tested accurately using the Affymetrix citrus microarray. According to the study of NICCE network, 47.6% (14 020/29 445) C. sinensis genes can be tested by this citrus microarray17. Similar results were also found in maize that only 56.5% genes could be detected by maize microarrays41.

The coexpression networks were constructed using the RMTGeneNet28, which first calculated pair-wise Pearson Correlation Coefficients (PCC) for all genes and then identified a threshold for cutting PCC values using Random Matrix Theory. The PCC thresholds for these coexpression networks are shown in Table 2, ranging from 0.882 for “all data” to 0.968 for “HLB”. At these relatively stringent thresholds, only the top 0.24% to 1.06% of all possible edges was retained. The nodes of these networks range from 1137 to 2263, accounting for 9.47%–18.85% measurable genes of the citrus microarray.

Table 2 Topological characteristics of seven coexpression networks

Network topology

Figure 2 displays the coexpression network of “all data” using Cytoscape 2.8.329. Although these networks have different numbers of nodes and edges, they have similar topological characteristics (Table 2). All these networks are composed of a major component and other small components. All nodes within one component are directly or indirectly connected. Nodes in the major components account for 62.56% to 82.28% of that in corresponding networks. The average path length of these networks ranges from 6.66 to 10.75, implying the small-world properties. The nodes degree of these networks fits to a power law distribution with the degree exponent (r) ranging from 1.13 to 2.10, indicating that these networks are scale free. These networks demonstrate modular and hierarchical characteristics with the average clustering coefficient ranging from 0.20 to 0.40, which is more than 36 times higher than that of same size random networks (data not shown). Table S2 provides a list of all edges in these networks. Taken together, these seven networks contain 37 633 edges among 6256 nodes (genes, Table S3), which account for 52.11% measurable genes of the microarray or 25.50% total genes in C. x clementina v1.0 genome. Table 3 shows the intersections between nodes and edges of these networks. Generally, the intersections among them are relatively low. In total, 3304 nodes (52.81%) and 34 860 edges (92.63%) were found only in one network. The intersection among “leaves”, “all data”, and “citrus canker” network is relatively higher than that between other networks.

Figure 2
figure 2

Layout of the citrus “all data” coexpression network. The most overrepresented GO terms were shown for the 12 largest color-coded modules.

Table 3 Intersection between edges/nodes (upper/lower triangular) of networks

Network clustering and functional enrichment

MCL algorithm was used to identify sets of nodes (i.e. coexpression module) that are more densely connected with each other than with the remaining nodes of the network4. The inflation parameter (I), the most important parameter of MCL, was chosen according to area fraction, mass fraction, and efficiency. In the present study, more than 80% of the entire edge masses could be captured only using less than 3% of the network area (Table 4). A total of 2338 modules were detected in these seven networks (Table S4), with 525 of them containing five or more nodes. The size of biggest modules in these networks ranges from 47 to 200. Functional enrichment analyses of these 525 modules were performed using terms from the GO biological process and KEGG pathway (Tables S5 and S6). Only terms enriched within a module with a Fisher’s p-value of 0.05 or less were considered. Only 343 modules in these networks had some degree of GO enrichment. Some GO terms were commonly enriched in these networks, such as gene expression (GO: 0010467), translation (GO: 0006412), and photosynthesis (GO: 0015979). However, the gene numbers associated with these common GO terms varied among networks. For example, 33 and 28 genes were associated with photosynthesis (GO: 0015979) in the “all data” and “leaves” network, respectively. However, zero and five genes were related to photosynthesis in the “albedo” and “flesh” network, respectively. A total of 132 GO terms (28.5%) were enriched exclusively in one network, such as polysaccharide catabolic process (GO: 0000272) and trehalose metabolic process (GO: 0005991) in “citrus canker” network.

Table 4 Network clustering and functional enrichment of modules

Because a small portion of all nodes (25.15%, 1574/6256) was annotated with KEGG orthology identifiers in the C. x clementina annotation file, their homologs in Arabidopsis were used for KEGG enrichment. A total of 60 modules were detected with significantly enriched KEGG pathways, and 36 KEGG pathways were enriched in at least one module. Some pathways were commonly enriched in these networks, such as Ribosome (ath03010), and photosynthesis (ath00195). A clear correspondence was observed between GO and KEGG enrichment analyses.

Predominant function of selected modules

Four modules were presented below to illustrate the correspondence of these modules with defined biological functions and methods that can be used to explore functional modules from these gene coexpression networks.

(1) Citrus lateral organ boundaries 1 in “citrus canker“ network

The guide-gene approach is commonly used to explore functional modules from gene coexpression networks. A lateral organ boundaries 1 (CsLOB1) gene has recently been identified as a citrus canker disease susceptibility gene in sweet orange42. The precise function of CsLOB1 is still not clear. Using its homolog in C. x clementina (Ciclev10033956m) as a guide, 25 coexpressed genes were identified in module 1 of the “citrus canker” network (Figure 3). Six of them were involved in cell wall metabolism: Ciclev10005888m (plant pectin methylesterase inhibitor superfamily protein), Ciclev10016123m (xyloglucan endotransglucosylase/hydrolase 5), Ciclev10021623m (expansin B2), Ciclev10007670m (proline-rich extensin-like receptor kinase), Ciclev10014994m (glycosyl hydrolase), and Ciclev10019941m (pectin lyase-like superfamily protein). Similar results were reported in the NICCE networks17. Interestingly, three minichromosome maintenance family genes (Ciclev10007588m, Ciclev10027769m, and Ciclev10019324m) were coexpressed with Ciclev10033956m, implying the functions of LOB1 in DNA replication. Another candidate target of TAL effectors, CsSWEET1 (Ciclev10002276m)42, was also included in module 1 of the “citrus canker” network. It encodes a sugar transporter for pathogen nutrition and is linked to Ciclev10033956m through three nodes (the shortest path).

Figure 3
figure 3

Graph showing coexpressed genes of the C. clementina homolog of citrus LOB1 (Ciclev10033956m) and SWEET1 (Ciclev10002276m) in canker-module 1.

(2) Module 25 in “citrus canker network” (canker-module 25): plant hormone signal transduction

Canker-module 25 was selected based on functional enrichment analyses. It has 10 nodes, 14 edges and a density of 0.311 (Figure 4). The highest ranked (lowest p value) GO term of this module was response to oxidative stress (GO: 0006979, p = 0.05). The highest ranked KEGG pathway of this module was plant hormone signal transduction (ath04075, p = 0.00024). Increased ethylene production was reported in citrus leaves inoculated with Xanthomonas campestris pv. citri (Hasse) Dye (Xc), a strain of bacteria that causes citrus canker43. However, the ethylene signal transduction pathway is not clear in citrus. Three nodes in this module, Ciclev10019132m (ERS1, ethylene response sensor 1), Ciclev10021170m (MAP kinase kinase) and Ciclev10005820m (ERF1, ethylene response factor 1), may be involved in the ethylene signal transduction. The hub gene of this module is Ciclev10019132m (ERS1). In Arabidopsis, ethylene signal is first perceived by endoplasmic reticulum localized receptor (including ERS1) and then transduced to ERF and downstream targets through MAPK cascades44,45. A jasmonic acid-amido synthetase gene (Ciclev10019459m) and a protein phosphatase 2C gene (Ciclev10004981m) were also included in this module, implying the cross-talk among ethylene, JA and ABA signaling pathways. Other genes may also be involved in plant hormone signal transduction, such as Ciclev10024032m (cysteine-rich receptor-like protein kinase) and Ciclev10001726m (peroxidase gene). Therefore, canker-module 25 is likely to carry on the functions of plant hormone signal transduction. Unannotated genes in this module would be hypothesized to be related to plant hormone signal transduction.

Figure 4
figure 4

Genes and edges in canker-module 25.

(3) Module 19 in “flesh” network (flesh-module 19): fruit ripening

Flesh-module 19 was also selected based on functional enrichment analyses. It has 11 nodes, 11 edges, and a density of 0.2 (Figure 5). The highest ranked KEGG pathway of this module was the citrate cycle (TCA cycle) (ath00020, p = 0.00026). The citrate cycle is the major pathway for the synthesis of citric acid, the most abundant organic acid in citrus46. At least three nodes of this module were related to the citrate cycle: Ciclev10008189m (dihydrolipoamide succinyltransferase gene), Ciclev10025308m (dihydrolipoamide acetyltransferase gene), and Ciclev10013692m (acyl-activating enzyme 5 gene). Two nodes were involved in the biosynthesis of the polyphenol compounds: Ciclev10019346m (UDP-glycosyltransferase gene) and Ciclev10011175m (phenylalanine ammonia lyase gene). One node, Ciclev10028195m (glucose-1-phosphate adenylyltransferase gene) was involved in glycogen biosynthesis. All these nodes were linked by Ciclev10006509m, which encodes a subunit of a RUB (Related to Ubiquitin)-activating enzyme. The proteins encoded by these genes may be subject to similar post-translational modifications.

Figure 5
figure 5

Genes and edges in flesh-module 19.

(4) Module 6 in “HLB” network (HLB-module 6): programmed cell death

HLB-module 6 has 28 nodes, 75 edges, and a density of 0.198 (Figure 6). This module was selected because 13 of these 28 genes (46.43%) were only included in “HLB” network. Seventeen genes were assigned to specific GO terms. The highest ranked GO term of this module was programmed cell death (PCD, GO: 0012501, p = 0.005). PCD is widely observed in plants in response to pathogenic infection. At least eight genes in this module were related to PCD. Bcl-2-associated athanogene gene (Ciclev10018596m) plays a critical role in PCD47. It can suppress PCD via its interaction with Hsc70 and Hsp40 (Ciclev10000372m)48. However, the up-regulation of genes involved in the ubiquitin-proteasome system can activate PCD49. Ciclev10008240m (polyubiquitin 10) and Ciclev10005221m (RING finger E3 ubiquitin ligases) are parts of the ubiquitin-proteasome system. Other genes related to PCD include: Ciclev10005800m (myosin heavy chain-related), Ciclev10032432m (sphingoid base hydroxylase), Ciclev10021281m (LAG1 longevity assurance homolog 3), and Ciclev10032631m (Glutaredoxin family protein). Their functions in HLB still need to be determined.

Figure 6
figure 6

Genes and edges in HLB-module 6.

Comparison with NICCE network

When this manuscript was being prepared, a citrus gene coexpression network (called “NICCE network” in this study) based on publicly available microarray data sets was reported17. There are several differences between the NICCE networks and networks in this study.

First, probe sets, rather than genes, were used to construct the NICCE networks. In the 30 217 nodes of the NICCE networks, 5960 (19.7%) nodes were not mapped to any citrus transcripts; 9336 (30.9%) nodes belonged to the “one probe set per transcript” group. 5775 transcripts (38.2%) were represented by the remaining 14 921 (49.4%) probe sets (Table S7). Therefore, 5.9% of the edges of NICCE networks were between probe sets of the same transcript/gene. Probe sets representing the same transcript were expected to have similar expression levels and appear in the same cluster of one network. However, this is not the case in the NICCE networks. One example (Cs1g07330.1) was shown in Table S8.

Second, when constructing the NICCE networks, PCC values between probe sets were transformed into highest reciprocal ranks (HRR), and the top 100 HRR for a given probe set was considered. This leads to most PCC values between nodes of NICCE networks being very low. Cs5g33560 was given as an example in the website of NICCE (http://citrus.adelaide.edu.au/nicce/home.aspx). However, the PCC values between Cs5g33560 and its coexpressed genes in condition-independent network range from 0.68 to 0.39. More attention should be paid to assess gene pairs with low PCC values.

Third, only sweet orange microarrays were used in this study, and they were classified into six condition-dependent data sets: citrus canker, HLB, leaves, flavedo, albedo, and flesh. In the NICCE networks, 297 microarrays from different species of citrus (including mandarin, sweet orange, lemon, and pummelo) were used, and they were classified into four condition-dependent data sets: sweet orange, fruit, leaf, and stress17.

In order to compare our networks with the NICCE networks, C. x clementina gene IDs from our networks were transformed to C. sinensis gene IDs. C. sinensis orthologs were not identified for 1504 C. x clementina genes in our networks. Therefore, only 26 191 edges in our networks were used in the comparison with the NICCE networks, whose nodes were also transformed to C. sinensis gene IDs. Only 3868 edges were found in common between the two networks. About 85% of edges in our networks were not included in the NICCE networks. This may be due to different classification methods for microarray data sets. Most edges (72.84%) in our networks were exclusively found in condition-dependent networks.

Validation of coexpression networks using RNA-seq data

To confirm the coexpression networks in this study, 500 edges among 371 genes were randomly selected from the “all data” network. The expression of these genes was examined using another gene expression data set (Table S9) in CAP34. The correlation coefficients (r) between them were computed. The distribution of these correlation coefficients was highly skewed, as shown in Figure 7. For 353 edges (70.6%), r values were higher than the PCC thresholds that were used to construct the “all data” networks (0.882). R values of 385 edges (77.0%) were higher than 0.8. These results suggest that the coexpression networks in this study are reliable.

Figure 7
figure 7

Distribution of absolute value of correlation coefficients.

Discussion

In this study, 230 citrus microarrays from a diverse collection of experiments were used to construct seven coexpression networks. The nodes of these networks range from 1137 to 2263, accounting for 9.47%–18.85% measurable genes of the citrus microarray. This is consistent with Ficklin’s work on rice20, which also employed the RMT method to select a threshold for rice coexpression network. 10% of the measurable genes on rice microarray were included in their network. The percentage is relatively low compared with other studies using empirical thresholds. For example, in the Arabidopsis coexpression network, the PCC cutoff value was set to 0.75 and 38% measurable genes were retained4. RMT method was taken from the field of particle physics and had been used to construct gene coexpression networks for Escherichia coli, yeast, human, Arabidopsis, rice, and maize20,28,50,51. It has been demonstrated to be a reliable method for generating networks across a wide range of data sets50. It should be mentioned that after combining the seven coexpression networks, the nodes captured in our study reached 52.11% of the measurable genes of the microarray.

Both condition-independent and condition-dependent analyses were employed to ensure that coexpressed genes in special conditions were not lost. 77.77% nodes and 72.84% edges in our networks were exclusively found in condition-dependent networks. Function analysis of modules yielded similar results. 66.31% enriched GO terms were identified only in condition-dependent networks, such as programmed cell death in “citrus canker”, “HLB”, and “albedo” network. Condition-independent analysis was considered to be suitable for identifying globally coexpressed genes7, such as genes in photosynthesis, ribosome and DNA metabolism. In this study, we found that condition-independent analysis was not sufficient to identify all the genes in these pathways. For example, 159 ribosome genes could be measured in the citrus microarray (Table S10). Thirty-seven ribosome genes were included in the condition-independent network (“all data”). This number is much smaller compared with 147 ribosome genes in “leaves” network. It has been demonstrated that gene coexpression analysis using too many microarray samples could result in the loss of information52. Therefore, condition-dependent analysis is necessary even for identifying globally coexpressed genes.

According to the present annotation of the C. x clementina genome, 2485 and 4682 (39.72% and 74.84%) genes in these networks were not assigned to a specific GO and KEGG pathway term, respectively8. The function of these genes could be predicted based on well-annotated genes within the same module. For example, 28 genes were included in HLB-module 6. Eleven of them were not labeled with a specific GO term, and only four genes were assigned to a specific KEGG pathway. Based on the above analysis, HLB-module 6 is likely to carry on the functions of programmed cell death. Unannotated genes in this module could be hypothesized to be related to programmed cell death. In addition to gene function prediction, gene coexpression analysis is also helpful for hypothesis generation and testing7. For example, several genes encoding transcription factors were also included in HLB-module 6, such as ERF and KH domain-containing putative RNA-binding protein. It has been demonstrated in Arabidopsis that a KH domain-containing putative RNA-binding protein is critical for HSF and HSP regulation53. Therefore, it would be reasonable to hypothesize that those transcription factors can regulate the expression of other genes within the same module.