Parsimonious Gene Correlation Network Analysis (PGCNA): a tool to define modular gene co-expression for refined molecular stratification in cancer

Cancers converge onto shared patterns that arise from constraints placed by the biology of the originating cell lineage and microenvironment on programs driven by oncogenic events. Here we define consistent expression modules reflecting this structure in colon and breast cancer by exploiting expression data resources and a new computationally efficient approach that we validate against other comparable methods. This approach, Parsimonious Gene Correlation Network Analysis (PGCNA), allows comparison of network structures between these cancer types identifying shared modules of gene co-expression reflecting: cancer hallmarks, functional and structural gene batteries, copy number variation and biology of originating lineage. These networks along with the mapping of outcome data at gene and module level provide an interactive resource that generates context for relationships between genes within and between such modules. Assigning module expression values (MEVs) provides a tool to summarize network level gene expression in individual cases illustrating potential utility in classification and allowing analysis of linkage between module expression and mutational state. Exploiting TCGA data thus defines both recurrent patterns of association between module expression and mutation at data-set level, and exemplifies the polarization of mutation patterns with the leading edge of module expression at individual case level. We illustrate the scalable nature of the approach within immune response related modules, which in the context of breast cancer demonstrates the selective association of immune subsets, in particular mast cells, with the underlying mutational pattern. Together our analyses provide evidence for a generalizable framework to enhance molecular stratification in cancer.

Analysis of different edge reduction approaches clustered using the FastUnfold method for the single data-set GSE39582. Each edge reduction method was clustered 10,000 times by FastUnfold and the 100 best clusterings (judged by the modularity score) analysed for gene signature enrichment. (a) shows density plots of the module size vs module SCES (Scaled Cluster Enrichment Score; see supplemental methods) across the 100 best clusterings for EPG3 (Edge Per Gene 3) or the WGCNA sigmoid adjacency function varying the shift (μ, 0.1-0.9). (b)  Fig.10. Network visualization. Accompanies, Figure 2 and Figure 5.
Detailed annotation of the network generated by PGCNA for BRCA displaying the optimal clustering as shown in Fig. 2 Fig.14. Assessment of network module co-occurrence across all samples in training array data sets and TCGA RNA-seq data. Accompanies Figure 6.
Correlation heatmaps of the co-occurrence of network modules in array training data (left panels) and TCGA RNAseq data (right panels) for BRCA (upper panels) and CRC (lower panels).
These heatmaps illustrate the results of gene signature and ontology term enrichment analysis for the module neighborhood analysis shown in Figure 7

Motivation for method
Our overall motivation was based on the following requirements for the method: Visualize/contextualize the biology contained in large gene expression data-sets. Generate modules (also referred to as clusters) that have distinct and meaningful biology using an unsupervised approach (no prior assumptions). Be able to map the discovered modules onto gene expression data-sets to generate module 'fingerprints' per sample/patient. Use these derived 'fingerprints' to study relationships between module biology and external factors (not expression based; e.g. mutations). Use module gene/signature membership to study recurrent features between cancers.
However, we wanted to achieve these goals whilst only introducing complexity where it was proven to be required. This was a core motivation for the development of the Parsimonious Gene Correlation Network Analysis (PGCNA) technique. There were several challenges that we needed to overcome during the implementation of PGCNA: Generation of robust correlations Edge reduction Clustering methods Clustering selection (ranking success) These will be addressed in Method development and further specific method details given in Additional method details.

Method development
This section gives an overview of the development path of PGCNA.

Generation of robust correlations
See Supp. Fig.1 part 2 and 3 (For data-sets and data preparation see Normalization and re-annotation of data and Expression data sets) In order to generate networks we required the correlations of all gene pairs within breastcancer (BRCA) and colorectal-cancer (CRC). Sample biases may exist within a single gene expression data-set; in order to minimize such effects we decided to harness as many expression data-sets as possible. Merging the data-sets, given the diversity of array types and the resulting normalization issues would likely introduce noise. The simplest way to harness multiple data-sets was to initially analyze them independently and then merge the correlations. Per data-set the 80% most variant genes were retained, and Spearman's rank correlations calculated (Python scipy.stats package). Given that each data-set was analyzed independently (different ranking of gene variance) and that the data-sets spanned array platforms (different gene content) the resulting correlation matrices whilst largely overlapping in terms of gene content, additionally included variable subsets of genes. The resultant p-values and correlations matrices were merged across all data sets for a given cancer by taking the median values (across the sets in which the gene pairs were contained) to give a final median correlation matrix and its corresponding p-value matrix. Genes present in < 9 and <4 data sets were removed from the BRCA and CRC matrices respectively. This gave a final matrix size of 17,805 and 18,896 for BRCA and CRC respectively. Finally, all correlations with a p-value > 0.05 were set to 0 to reduce noise.

Edge reduction
See Supp. Fig.1 part 4 One of the biggest challenges with gene network analysis is edge reduction (where edges are the gene pair correlations); the removal of less-informative correlations. As the (number of edges) = (number of genes) 2 it quickly becomes intractable for both visualization and memory footprint (see Table 1). Using a hard-threshold, where correlations below a cutoff are removed, can help reduce the total number of edges but at the cost of generating many orphan genes (disconnected from any other gene) thus failing our requirement to visualize/contextualize overall patterns of gene co-expression. Furthermore, in order to generate visually interpretable networks the generation of modules (finding modules within the correlation matrix) is sometimes separated from the visualization step, utilizing all edges for clustering while only visualizing a small subset (e.g. WGCNA; again accompanied by the problem of orphans) 1 . We aimed if possible to visualize all the data used in module discovery.
We decided to test the simplest possible edge reduction technique -For each gene (row) in a correlation matrix only the N most correlated Edges Per Gene (EPG; EPG3 means retaining 3 edges per gene) were retained, with N ranging from 3 to 10 (<3 gives orphan modules). The resulting matrix M, with entries written as M = (mij) was made symmetrical by setting mij = mji for all indices i and j so that M = M T (its transpose). The idea being that related genes would join to form communities, in a fashion analogous to the assembly of a daisy-chain flower-garland.
We wanted to contrast the EPG approach with other edge refinement approaches. One such approach is applying an iterative Pearson's correlation coefficient (iPCC) where correlations are iteratively calculated using the previous iteration as input (1 st -n th order) until these converge on a correlation matrix that contains only -1/1 values 2 . In addition, we compared with two methods utilized by the WGCNA package: a power based (WGCNA adjacency function; referred here as PowerST) and sigmoid based (WGCNA sigmoidAdjacencyFunction; referred here as Sigmoid) function.
These different edge refinement approaches have different characteristics. While the iPCC method has been shown to aid module discovery it removes all substructure within modules and removes the opportunity for edge reduction (correlations all -1/1). While this may generate useful modules, by retaining such a large number of edges, and by removing information that may guide substructure within the modules, it fails our visualize/contextualize requirement. However, this was included in our analysis to test if clustering, at the cost of no visualization, would yield more informative modules.
The two WGCNA approaches (PowerST/Sigmoid) have user defined parameters, selection of these is guided by generating networks over a range of parameter values and quantifying how well each resultant network satisfies a scale-free topology. We analyzed PowerST/Sigmoid across the scale-free range to see how sensitive the results are to input parameters.
The EPG approach has two interesting properties -firstly even with just EPG3 no gene is an orphan and rarely is any set of genes disconnected from the remainder. By contrast at EPG2 orphan modules are often generated. Secondly, as some genes are common partners of many other genes the total number of edges < #genes x EPG, giving a degree (#edges per gene) that follows the power-law and is a scale-free network (see Supp. Fig.  2d, unlike hard-thresholding Supp. Fig. 2e )   Clustering methods See Supp. Fig.1 part 5 The EPG edge reduction approach lends itself to network module discovery methods. We chose the 'Fast unfolding of communities in large networks'/Louvain modularity method (herein referred to as FastUnfold; version 0.3) as it scales to very large networks and runs fast enough to allow many iterations 4 . However, we wanted to make sure that our drastic edge reduction technique was not throwing away useful information. To this end we compared FastUnfold to two standard clustering approaches: hierarchical clustering and kmeans clustering (R packages fastcluster and kmeans respectively), comparing these for EPG3 -EPG10 and the total unpruned correlation matrices. FastUnfold was run 10,000 times at each EPG level and the 100 best (judged by the FastUnfold modularity score) were used for downstream analysis. The FastUnfold algorithm automatically converges on a module number and therefore does not require a user defined module number. Interestingly, the median number of modules generated by FastUnfold decreased with EPG level (Supp. Fig.2a), converging on 3-5 modules for the unpruned total correlation matrix. Thus, applying FastUnfold was not particularly effective when considering the unpruned data. By contrast the generation of discrete modules by FastUnfold, which penalizes network solutions that retain connectivity between modules, is favored by the EPG edge reduction technique. The latter intrinsically selects for hub nodes and favors separation of gene clusters by ensuring that common partner genes at the given EPG level retain more edges. This could be seen as effectively pre-focusing the input correlation matrix prior to application of the network module discovery method.
For the k-means clustering k was set to ± 1 around the module number from the best FastUnfold solution (see Cluster selection) and for each k and EPG 50 iterations were run.
For hierarchical clustering 8 different linkage methods (average, centroid, complete, Mcquitty, Median, Single, WardD and WardD2) were used and the resultant dendrograms cut at ± 10 around the module number from the best FastUnfold solution, giving 21 results for every input matrix (note: only the 2 best linkage methods, WardD/WardD2, are shown in Fig. 1a).

Clustering selection
See Supp. Fig.1 part 6 To judge the success of the clustering approaches we wanted a metric that assessed 3 things: Biological enrichment in each module. Purity of biological enrichment. Rewarding clustering approaches that separate genes into modules with distinct biological function, and in turn punishing those that have functional redundancy across modules. Skewing of modules sizes. While we don't expect/require all modules to be the same size we want to avoid skewing as much as possible (i.e. a few modules that contain many genes/functions).
Gene signature analysis was carried out for each module, from each clustering of the data. Then to generate a total enrichment score for a given clustering: 1. Signatures were filtered to retain only those with 5 and 1000 genes with a false discovery rate (FDR; Benjamini Hochberg) of < 0.05. 2. For each module within a clustering, the enriched signatures were ranked by FDR and the top 15 added to a global list of signatures for that clustering. 3. A matrix was generated that contained all the z-scores for every signature (rows) in the global list across all the modules (columns). 4. For each signature a fractional contribution was calculated as the row-max-zScore/row-sum-zScores (where 1 = enrichment of signature in only 1 module). Across all signatures a median factional contribution (MFC) was calculated. 5. The sum of the maximum z-score per signature (row) was calculated (ZScoreMS). 6. Module size skewing was assessed by calculating the normalized Shannon entropy: ( ) = log log of the module sizes. This gave a score that ranged from 1 (even module sizes) towards 0 with increasing skewing.
Since both MFC and normalizedEntropy have a range of 0-1 the simplest way to scale the ZScoreMS was the product of the three values, thus: Scaled cluster enrichment score (SCES) = ZScoreMS MFC normalizedEntropy.
This allowed the selection of the best FastUnfold clustering ( Fig.1a; Gene Signature Enrichment: FastUnfold). This was then used to set the module number range in the kmeans/hierarchical approaches. The FastUnfold method outperformed the kmeans/hierarchical clustering methods across all EPG, with only the Ward-linkage hierarchical clustering approaching a similar enrichment when using the Total data. With increasing EPG there was a corresponding decrease in module number with no trade-off of increased biological enrichment (Supp Fig.2a and Fig. 1a). Thus, for all downstream analysis we chose the optimal FastUnfold EPG3 result for both cancers. However, it should be noted that most of the recovered modules were broadly retained across the 100 FastUnfold clustering results (see Module Stability and Supp. Fig. 2b/c). The combination of FastUnfold and EPG3 we term a Parsimonious Gene Correlation Network Analysis (PGCNA).  Fig. 11a/b show visualizations of the optimal BRCA/CRC gene signature results. As before these show the top 15 signatures per module (with 5 and 1000 genes) but are filtered with the more lenient p-value < 0.01.

Comparison of edge refinement methods
The EPG edge reduction strategy combined with clustering with FastUnfold (termed PGCNA) generated modules with high Scaled cluster enrichment scores (SCES). Given the simplicity of the EPG approach it might be surprising that it generates such good results. We therefore used 3 other edge refinement approaches (iPCC, PowerST and Sigmoid; see Edge Reduction) in combination with FastUnfold as a comparison.
As the PowerST and Sigmoid function utilized WGCNA, and we wanted to also compare against the whole WGCNA package (Comparison with WGCNA) we could not use the median correlation matrices for this analysis. This is because WGCNA is mostly aimed at analyzing individual data-sets starting from gene expression arrays. While the median correlation matrix we used for the PGCNA analysis could be imported into WGCNA this then precluded many of the downstream features. Thus, we selected single representative data-sets for BRCA/CRC to analyze, selecting two data-sets that were on the Affymetrix Human Genome U133 Plus 2.0 platform and had been used successfully to generate classifications of BRCA/CRC (GSE20686/GSE39582 containing 327 and 566 samples respectively).
For both data-sets, probes were reannotated and merged (see Normalization and reannotation of data) and the 80% most variant genes retained, giving data-sets containing 16,807 genes.
For both the BRCA and CRC data-sets the edge reduction strategies: Were clustered by FastUnfold 10,000 and the best 100 (judged by FastUnfold modularity score) were retained for downstream analysis.
The different methods were clustered into varying numbers of modules across a range of SCES ( Fig. 1. b/c and Supp. Fig. 3 and 4). The iPCC, retaining > 70 million edges, always clustered into 2 modules, with the lowest level of SCES. PowerST generated between 2-4 modules, with marginally more SCES than iPCC. The Sigmoid function generated 3-280.5 and 3.5-263 median modules for BRCA and CRC respectively. As can be seen in Supp. Fig. 2/3 as the sigmoid increased a larger portion of very small modules was produced, so that by 0.9 only 38 of the 280.5 and 33 of the 263 modules had > 5 genes for BRCA/CRC respectively. In addition, if the resulting correlation matrices were cut at a hard threshold of 0.01 then only 32/61 % of genes were connected in the Sigmoid0.9 networks for BRCA/CRC respectively.
The EPG3 method generated networks that had the highest SCES and lowest edge numbers whilst maintaining 100% gene connectivity. For example, BRCA EPG3 generated a median of 31 modules with an SCES of 6,754 with only 46,776 edges, in contrast  Sigmoid0.9 generated a median of 280.5 modules (38 with >5 genes) with an SCES of 5,044 with 189,401 edges, but with only 32% of genes connected.
Thus, the EPG edge reduction approach seems to be particularly suited to clustering by FastUnfold.
Comparison with WGCNA In order to assess any potential added value of the PGCNA approach we considered its relative merits in relation to the popular gene network analysis approach, the Weighted Gene Correlation Network Analysis (WGCNA) tool 1 . It tackles the edge reduction problem in two ways -firstly the correlation matrix is raised to a power (soft-thresholding; default approach), with the power chosen to try and maximize the scale-free nature of the network, this has the effect of reducing low correlations to near zero. Secondly, after softthresholding the correlation matrix is converted to a topological overlap measure matrix (TOM). This ranges from 1 (direct link between two genes with one set of direct neighbors being a subset of the other) to 0 (no direct link between two genes). TOM has been shown to aid module discovery and help with spurious/missing correlations between genes. The TOM is used for module discovery using a hierarchical clustering approach followed by merging of similar modules (via clustering of eigengenes) to end up with a representative set of modules.
Where possible we used the default settings for WGCNA: Network-type: signed Cut-method: tree (cutreeDynamic) Soft-Power: 2-7 (covering the scale free range for both BRCA/CRC; see Supp. Methods Fig. 1 a/ For the PGCNA analysis the same expression data-sets were used to generate Spearman correlation matrices. These were pruned by EPG 3-10 and clustered 10,000 times with FastUnfold, the best 100 (judged by FastUnfold modularity score) were retained for downstream analysis (see Supp. Methods Fig. 2).
Comparing SCES levels across the 100 PGCNA EPG3 clusterings and 6 WGCNA softthreshold levels (Supp. Fig. 5) showed that the PGCNA approach generated better overall biological enrichment, which is less redundant across modules and had modules that were more even in size (Supp. Fig. 5. a). There was a clear separation between the distribution of enrichments when comparing these methods, such that even the worst of the 100 iterations of the FastUnfold clusterings outperformed the best WGCNA derived clustering.
Comparing modules obtained across the different run parameters addresses to what extent gene membership of modules is a recurrent feature. For these comparisons a hypergeometric test was used to compare the overlap of module gene membership generated by the indicated networking solution (WGCNA and PGCNA). For WGCNA the comparisons shown are between different soft threshold levels (ST1-ST7) (Supp. Fig. 6/7), while for PGCNA either different FastUnfold solutions based on the input EPG level are compared (EPG3-EPG10) or for EPG3 the network solutions with Best, Median and Worst resolution of gene ontologies as judged by the scaled cluster enrichment score (SCES) are displayed (Supp. Fig. 8/9). WGCNA has a greater disparity between soft-threshold levels Supp. Methods Fig.2. PGCNA run information. Top table shows results for the merged data-set analyses, the bottom table the results for the single data-set analyses over a range of EPG. For the single data-set analyses to make the results more comparable with WGCNA the Time-Taken is for performing 1,000 FastUnfold clusterings, while 10,000 were used in the final analysis (~ 2h run time). In addition, the Time-Take and RAM-used are not shown for the Merged data-sets (top table) as these were generated before the final PGCNA python script was produced (see software for details on python script).
than PGCNA does for either EPG or GSE levels, which is shown by the WGCNA histograms having a skewing to low level -log10 p-values (high p-values) and PGCNA skewing to high -log10 p-values. PGCNA generates very similar sets of modules across EPG levels, with higher EPG modules being supersets of those generated with lower EPG. In addition, the difference between the best/worst PGCNA clustering (judged by SCES; see grey boxes in Supp. Fig. 8/9) is less than seen between WGCNA soft-threshold levels, and again reflects modules being split/joined rather than completely distinct modules being identified.
This analysis suggests that for the data-sets tested PGCNA outperforms the popular WGCNA method. It is likely that the WGCNA approach could be improved with fine tuning of parameters (e.g. different cut/merge heights), selection of a more robust clustering method (biweight mid-correlation) etc. However, we note that a strength of the PGCNA approach is again provided by simplicity, which by minimizing user options removes the onus to fine-tune parameters, while still providing efficient network solutions. PGCNA users may choose to retain more edges as one variable input parameter, we note that increasing EPG level of the parsimonious correlation matrix primarily has the effect of merging modules that separate at more stringent EPG levels, rather than leading to the identification of entirely new gene clustering.
As an extension of this concept FastUnfold resolves substructure within network modules. Such substructure can be further explored within individual modules or across merged combinations of modules, or indeed by focusing around a single index gene by limiting the content of the input correlation matrix. This is illustrated in the main manuscript in what we refer to as a neighborhood analysis. Here the computational efficiency of the EPG and FastUnfold combination again proves valuable.
Thus, overall, the EPG method of edge reduction, while simple, when combined with FastUnfold yields surprisingly robust results that: Outperform analyzing all edges with existing clustering methods (hierarchicalclustering/k-means; see Fig.1 a). Outperform other edge refinement methods combined with FastUnfold (see Fig. 1. b/c and Supp. Fig. 3/4). Outperform more computationally expensive networking approaches (e.g. WGCNA, see Supp. Fig. 5-9). Allows visualization/storage of very large networks by reducing to a minimal edge set. Allows visualization of the same data that was used for clustering. To visualize WGCNA networks there is often a requirement to carry out additional hard thresholding (removing edges < correlation-cutoff) to reduce to a tractable edge number, thus the clustered/visualized data are different. Generates networks with scale-free properties, with hub nodes that seem to be recurrent between cancers (e.g. TNFSF13B, ARHGAP9) and stable across analysis of individual data-sets.