Integrating gene regulatory pathways into differential network analysis of gene expression data

The advent of next-generation sequencing has introduced new opportunities in analyzing gene expression data. Research in systems biology has taken advantage of these opportunities by gleaning insights into gene regulatory networks through the analysis of gene association networks. Contrasting networks from different populations can reveal the many different roles genes fill, which can lead to new discoveries in gene function. Pathologies can also arise from aberrations in these gene-gene interactions. Exposing these network irregularities provides a new avenue for understanding and treating diseases. A general framework for integrating known gene regulatory pathways into a differential network analysis between two populations is proposed. The framework importantly allows for any gene-gene association measure to be used, and inference is carried out through permutation testing. A simulation study investigates the performance in identifying differentially connected genes when incorporating known pathways, even if the pathway knowledge is partially inaccurate. Another simulation study compares the general framework with four state-of-the-art methods. Two RNA-seq datasets are analyzed to illustrate the use of this framework in practice. In both examples, the analysis reveals genes and pathways that are known to be biologically significant along with potentially novel findings that may be used to motivate future research.


Methods problem formulation.
A differential network analysis is considered when gene expression samples are collected from two distinct populations or "groups". The analysis compares the set of inter-connections of genes between the groups. Additionally, the existence of gene pathways is assumed. For our purpose, a pathway is considered to be a subset of genes that act together to carry out a specific molecular function.
Let  ∈ × X k n m k denote an observed gene expression profile containing m genes for n k samples in group ∈ k {1, 2}. The rows and columns of X k are indexed by  ∈ ⋅ X i k m and  ∈ ⋅ X j k n k , respectively. The samples ⋅ X i k are assumed independent and identically distributed, conditioned on the group k. Let G P ⊂ … m ({1, , }) denote a collection of pathways of interest, where  is the power set. ∈ G  contains the indices for the genes in pathway G, and  ∈ ×| | X G ( ) k n G k is the observed expression profile with columns subset on those genes. For a given pathway ∈ G , our interest is in estimating the gene-gene association networks for both groups = k 1, 2, which is represented by the real symmetric matrix  ∈ | |×| | S G ( ) www.nature.com/scientificreports www.nature.com/scientificreports/ Estimation of association networks. The first step is to estimate the gene-gene association network within each group. There is no restriction on the meaning of "association" in the problem formulation other than symmetry. This choice of which association measure to use is left to the practitioner and will likely depend on the biological problem being studied, the amount of data available, and other factors. In this section, we review a variety of approaches that can be considered.
The most common approach is to use simple pair-wise correlations. Let Σ = ⋅ ⋅ E X X ( ( ) ) T 1 1 denote the covariance matrix between genes in a fixed group, where X 1· is a (m-dimensional) random variable. Here, the first row of X can be taken without loss of generality, since we assume that observations are i.i.d. within the group. The marginal correlation between genes i and j is, These correlations can be estimated using the sample covariance matrix Σ = − − n X X ( 1) ( ) T 1 , where n is the number of observations in the group. Alternatively, a robust estimator can be used, such as Spearman's rank-based correlation, Kendall's correlation, and Gaussian rank correlation 37 .
Marginal correlation is used in weighted correlation network analysis (WGCNA) 38 , which is motivated from the scale-free topology observed in biological networks. In WGCNA, the estimates r ij are computed for each pair of genes and soft thresholding is used to send smaller values towards zero.
ij ij where β ≥ 1 is the soft-thresholding parameter. This is in contrast to hard thresholding, ij ij ij in which correlations below a hard threshold parameter, γ > 0, are set to zero. An alternative measure is partial correlation, which gives the conditional correlation rather than marginal. That is, the linear association between two genes is considered after accounting for the presence of the remaining genes in the pathway. Let Ω = Σ −1 denote the inverse of the covariance matrix, also called the precision matrix. Partial correlations between genes can be computed from the precision matrix by If a gene expression profile follows a multivariate normal distribution, then zero partial correlation implies conditional independence of genes; this is called the Gaussian graphical model (GGM). In this model, the precision matrix can be estimated from the log-likelihood function by where  Ω ∈ × m m and Ω  0 denotes the set of positive definite matrices. Alternatively, a penalized log-likelihood function can be used to enforce desirable properties of the network. For example, the graphical lasso enforces sparsity in the network by shrinking some off-diagonal elements to zero through an L 1 penalty 39 .
The problem of estimating large precision matrices has been studied extensively; a review of estimation procedures implemented in R is available 40 , as well as an overview with a concentration on rank based and factor model based methods 41 .
We estimate partial correlations using a shrinkage approach 42 . This estimator was chosen because of its favorable computational properties 42 . The method uses a mixture of a low-dimensional estimate, T , and unconstrained estimate, Σ . In this case, Σ is the sample correlation matrix for group k, and T is the diagonal matrix with = ΣT ii ii , for = … i m 1, , , and zeros on the off-diagonal. The shrinkage estimator, Σ ⁎ , is defined as the linear combination where λ is a shrinkage parameter. It has been shown 43 that there exists an analytical solution for the optimal λ, denoted by λ*, which minimizes Σ − Σ F 2 , where F denotes the Frobenious norm. The solution λ* for various choices of low-dimensional spaces containing T have been derived 42 . In this study, the shrinkage target is the space containing uncorrelated genes allowing for unequal variances. The corresponding optimal shrinkage parameter is The estimate for the precision matrix is obtained from the inverse of the shrinkage covariance estimate, Ω = Σ −ˆ⁎ ( ) 1 .
Differential connectivity score. The differential network analysis measures the change in a set of connections , } for a given pathway  ∈ G . We generalize the differential connectivity score proposed in earlier work 44   where ≥ p 1 is fixed and  | | = | | | | − G G ( 1)/2 denotes the number of possible connections in the pathway G; the weight 1/|E| accounts for the varying sizes of pathways. For < < p 0 1 , the same expression is used to define δ E but with the 1/p exponent removed.
The elements in E can be chosen to test different components of the network: for differential connectivity of the whole pathway, ( ) }; and for the differential connectivity of a single association between genes i and j in G, with < i j, E is the singleton = E i j {( , )}. Note, in this last case the choice of p is inconsequential since the sum in δ E is over a single element.
Tests for significance. For a given set of connections E, we consider the hypothesis test, . The null hypothesis says that the connections in E are consistent across both groups. Under this null, the group labels, ∈ k {1, 2}, for each observation are immaterial when computing S k (G). This sets up a permutation testing procedure that can be used to estimate a p-value for d under the null, whereby permutations of the group labels are used, i.e. the observations are shuffled between groups. The total number of distinct permutations will often be quite large even for moderate sample sizes. In this case, the exact p-value is estimated from B randomly sampled permutations. This estimated p-value is adjusted to ensure it is positive 45 . The following algorithm outlines the permutation procedure.
3. Permute the rows of X to obtain a permuted matrix X*. Use the first n 1 rows for ⁎ X 1 and the remaining n 2 rows for ⁎ X 2 .

Estimate the association networks
In the case of multiple hypothesis testing, i.e. multiple sets E to be tested within a pathway, the Westfall-Young step-down p-values can be used to help control the false discovery rate 46,47 . This procedure monotonizes the p-values with respect to the original test statistics; larger differential connectivity scores will always correspond with lower p-values. The algorithm to compute these monotonized p-values is provided in the Supplementary Materials section S1. simulation studies. Two simulation studies are performed. The first is used to assess the network-wide performance of the proposed framework in detecting DC pathways, DC genes, and DC edges when pathway information is used. The second study compares the performance within a single pathway to four alternative methods, including DGCA 15 , DINGO 20 , INDEED 19 , and JDINAC 27 .
DGCA is available the CRAN R package 'DGCA' (version 1.0.1); DINGO is implemented in the CRAN R package 'iDINGO' (version 1.0.2); an R package for INDEED is available on GitHub at https://github.com/ressomlab/INDEED (version 0.99.19); and there is no package for JDINAC, but the R source code is available on GitHub at https://github.com/jijiadong/JDINAC (accessed on November 27, 2018). The default settings for each of these methods are used. In DGCA, several options for multiple testing correction are available; we use the permutation option with 100 permutations. In DINGO, we lowered the suggested number of permutations from 100 to 20 due to computation constraints. For INDEED, the sparsity parameters are selected by cross validation using one standard error rule, and the number of permutations is set to 1000 as recommended. In JDINAC, the weight threshold is set to 4, with 10 splits and 5 folds used.
Both studies follow the same overall procedure for simulating data. Two GGMs are created to represent the underlying gene-gene network for two distinct groups. In the first study, the networks contain 500 genes, 9 of which are hub genes, and 20 pathways. In the second study, the networks contain 100 genes with one hub gene and a single pathway. A network is generated as follows: 1. For each pathway, generate a random pathway size from a negative binomial distribution with mean 20 and standard deviation 10. 2. Initialize each pathway by randomly selecting nodes from the network to populate the pathway, then generate a scale-free structure to connect these nodes using the Watts-Strogatz method 48 . 3. From the union of nodes selected for the pathways, randomly select nine to rewire as hub nodes. In each pathway containing one of these hub nodes, the hub node has a 50% chance of being connected to each www.nature.com/scientificreports www.nature.com/scientificreports/ node in that pathway. 4. At this point the creation of the first network is complete. The second network is initially identical to the first with the following modifications: one third of the hub nodes are turned off (all connections removed), another third are rewired, and the remaining third are left unchanged. In addition, 2.5% of the non-hub nodes are rewired.
These steps result in two distinct graphs (network structures) with several differentially connected genes. A graphical representation of the differential network (and the individual pathways) that was generated for the simulation is shown in Supplementary Figs S1 and S2.
The edges in these graphs correspond to nonzero partial correlations, i.e. nonzero values in the precision matrix, Ω. The next step is to generate values for these partial correlations.
, is initialized as an identity matrix. 2. The non-zero partial correlations in the lower triangle of Ω k ( ) are generated from a uniform distribution on ∪ − − . . ( 1, 0 5) (0 5, 1). Edges common to both networks are set to have the same partial correlation. The entries in the upper-triangle are set to ensure symmetry. , respectively. The value c (k) controls the condition number of Ω k ( ) and ensures numerical stability when computing its inverse 49 , and the maximum over both networks is used to ensure that common edges have the same partial correlations after the adjustment.

Positive definiteness is enforced by increasing the diagonal entries by
Performance is assessed through sensitivity, specificity, true discovery rate (TDR), true non-discovery rate (TNDR), F1 score, and Matthews correlation coefficient (MCC). Let TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively. Then, each performance measure is defined as follows: TP FN  TN TN FP  TP TP FP  TN TN FN   TP TN FP FN  TP FP TP FN TN FP TN The F1 score is the harmonic mean of sensitivity and TDR. It conveys the balance between detecting many true differential connections while keeping the false discovery rate low. This measure completely ignores the number of true negatives, which can be quite large when dealing with sparse networks. MCC is a summary measure that reflects the overall performance; it includes the number of true negatives without being heavily influenced by the large imbalance of positives to negatives 50 .
The performance under misspecified pathways is also assessed. Pathway misspecification is simulated by introducing errors into our pathway knowledge; in particular, we mimic the scenario that pathways have both missing genes and genes wrongly included. At 100% knowledge, the pathway information used for analysis is identical to the true pathways used to generate the data. For misspecification, we introduce errors into the pathway information by replacing a portion of genes in each pathway. For example, with 90% pathway knowledge, 10% of the genes in each pathway are removed and replaced with genes outside of the pathway. This perturbed pathway information is then used in the differential network analysis as usual, and the effect on performance can be evaluated.
When no pathway information is being considered, the differential network is carried out using a single "pathway" containing all genes in the network.

RNA-seq datasets.
Craniofacial data of E14.5 mice. RNA-seq data from eight palatal regions of E14.5 (embryonic day 14.5) mice are analyzed 51 . E14.5 is a time during development when the palatal shelves are fused together. The eight regions are organized into four pairs that are analyzed in turn; these include the anterior and posterior domain of the Lateral, Medial, Nasal, and Oral compartments.
The differential network analysis attempts to identify changes in gene-gene interactions between the two domains; this type of behavior may, for example, be indicative of transcription factors involved in orchestrating gene expression during this phase of craniofacial development 52,53 . However, this investigation must be considered as exploratory, and the top DC genes may be considered for further validation.
The RNA-seq BAM files are available from the FaceBase Consortium 54,55 repository using accession numbers FB00000753.01, FB00000754.01, FB00000757.01, FB00000758.01, FB00000761.01, FB00000762.01, FB00000765.01, and FB00000767.01. The reads are aligned and annotated using the mm9 reference genome from the UCSC Genome Browser 56 . There were 21585 genes mapped that had both an Entrez gene ID and MGI symbol. Genes on the Y chromosomes were removed, leaving 21094 genes. The read counts were normalized using transcripts per kilo-base million (TPM) normalization 57 followed by a + x log (1 ) 2 transformation. Genes with zeros in more than one third of the samples were filtered out; this threshold was set fairly low since only 3-4 samples are available per region. (2019) 9:5479 | https://doi.org/10.1038/s41598-019-41918-3 www.nature.com/scientificreports www.nature.com/scientificreports/ Neuroblastoma tumor samples. Data for 498 neuroblastoma tumors are analyzed. These data contain heterogeneous gene expression profiles with diverse clinical outcomes. It is assumed that the differences in patient outcomes are largely a consequence of the differences in the somatic mutations present in their tumors 58,59 . Mutations in a gene may alter the function of the gene product, which may in turn alter the interaction of the gene with other genes 60 . The goal of a differential network analysis is to compare two clinically distinct subgroups of tumors and identify any differentially connected genes. This produces a set of postulated genes that may explain the disparity in prognosis. Further investigation into specific connections or altered pathways may contribute to improving risk stratification or motivate potential therapeutic targets for new treatments 61,62 .
The neurobalstoma data are obtained from the GEO database with accession number GSE49711 63,64 . Several clinical variables for each patient are available, including a high-risk (HR) indicator. We use this label to partition patients into two groups: HR versus non-HR. Normalization of the RNA-seq data was previously performed 63 , which is retained, unmodified in this analysis. There are 17115 genes mapped with both an Entrez gene ID and HGNC symbol. Incidentally, all genes had non-zero expression in at least 20% of the samples; no additional filters to remove lowly expressed genes were applied.
Reactome pathway database. Pathway information is obtained for both mice (mus musculus) and humans (homo sapiens) from the Reactome database 32 . Only pathways containing between 10 and 100 genes were considered. Some of the pathways have significant overlap, sometimes differing by only one gene. This is due to the fact each biological process is broken down into separate events in Reactome, and at the lowest levels a series of events will often contain the same sets of genes. In our analysis, these specialized events containing a significant overlap of genes will be represented by a single pathway. This grouping can be implemented by hierarchical clustering using 1 minus the Jaccard Index as a distance measure, A B A B ( , ) 1 / . By trimming the resulting dendrogram fairly low, for example at 0.1, the specialized events will be grouped together, and the higher-level pathways will remain separated. A concise review of hierarchical clustering is available 65 .
After clustering, we obtain 918 distinct pathways for mus musculus and 1160 distinct pathways for homo sapiens. In application, it is desirable to further remove any inactive pathways to help avoid spurious associations. This can be done in two ways: using domain knowledge to remove irrelevant pathways, or using the gene expression profiles by assuming that pathways containing many unexpressed genes are likely to be inactive. In this study, we take the latter approach -pathways containing over 20% unexpressed genes are considered inactive. Unexpressed genes include those that have zero counts in all samples, have been previously filtered out, or are otherwise not present in the gene expression profile. A threshold of 50% was also considered, but no substantial changes in the top results were found.

Results
simulation study. The first simulation study is used to assess the performance in detecting DC edges, DC genes, and DC pathways. Figure 1 provides the results for DC edges using monotonized p-values with partial correlation as the association measure. Results for other settings can be found in the supplementary materials.
The incorporation of pathway information into the analysis provides a boost in sensitivity when dealing with smaller samples ( < n 250). Even if the pathways are partially misspecified, the overall performance remains comparable. In Fig. 1, the gap in sensitivity between complete pathway knowledge versus partial knowledge begins to widen with large samples ( > n 500); for other simulation settings, for example if Pearson correlations are used instead of partial correlations, this gap can occur earlier. However, in all settings considered, the lack of complete pathway knowledge maintained high specificity (i.e. low type-I error rate) and true discovery rate.
The proposed framework has a parameter p for the norm used in the DC score. A sensitivity analysis on the choice of p is carried out. The simulation results suggest that the performance increases with larger p but Figure 1. Simulation results from 100 generated datasets testing for DC edges based on partial correlation. Monotonized p-values from 100 permutations were compared to a 0.05 significance threshold. Results show the differential network analysis conducted without any pathway information (black), with complete pathway information (red), with 90% correct pathway information (green), and 80% correct pathway information (blue). The specificity and TNDR were approximately 1 for each setting and sample size, hence those graphs are not shown here.
www.nature.com/scientificreports www.nature.com/scientificreports/ eventually reaches a plateau. For detecting DC genes with = n 250 observations in both groups, the plateau is reached at = p 2 (see Supplementary Fig. S5). With = n 1000, this point is shifted to a point > p 2, but increasing p also expands the disparity between complete and partial pathway knowledge (see Supplementary Fig. S6). Based on these results, a robust choice appears to be = p 2, which results in an L 2 norm. In the second simulation study, the proposed method (using monotonized p-values) is compared to four modern approaches: DGCA 15 , DINGO 20 , INDEED 19 , and JDINAC 27 . Figure 2 shows the average estimated differential network by each method on 20 random samples generated from a single pathway.
It is clear from Fig. 2 that not all methods are estimating the same associations, so it is unreasonable to assume they are all directly comparable. The top-left matrix shows the true differential network, but this is with respect to partial correlations. If a different association measure was considered, then this "true" differential network would change. For partial correlations, we see that the proposed method is able to correctly identify DC edges without many false discoveries.
The DGCA method is based on pair-wise correlation. The estimated network from DGCA appears to be noisy, but that is only because it is being compared to a "true" network of partial correlations. If the true differential network is constructed using correlation instead, we find that many of the edges that currently appear to be false discoveries are actually true differential correlations (see Supplementary Fig. S7).
On the other hand, DINGO and INDEED are both based on partial correlation, so they should be relatively comparable to the proposed method. DINGO attempts to find group-specific conditional dependencies by decomposing the GGM into global and local (group-specific) components. In this simulation, it has a higher sensitivity but also a much higher false discovery rate. INDEED uses graphical lasso to estimate the precision matrix for each group. We would expect the performance to be comparable, but surprisingly most of the DC edges identified are false discoveries.
JDINAC is a non-parametric approach that is designed to detect differences in nonlinear associations. Since this method is not based on partial correlations, a direct comparison to the proposed method may not be reasonable. However, since this simulation is based on a GGM, the changes in conditional dependencies detected by JDINAC are comparable to the changes in partial correlations detected by the proposed method. In this setting, JDINAC has a very low sensitivity but high TDR. A more fair comparison could be made if a measure of nonlinear association was used in the proposed method, but this doesn't escape the fact that the two methods will still be operating under different definitions of associations, and any comparison between them may inherently favor one over the other.
Application to RNA-seq datasets. Two RNA-seq datasets are used to illustrate how an exploratory analysis can be carried out using the proposed framework. In both examples, a single pathway is analyzed in depth; a similar analysis can be conducted for any of the significantly DC pathways. Ideally, the practitioner will use domain knowledge when selecting from the list of DC pathways those that should be investigated further. To www.nature.com/scientificreports www.nature.com/scientificreports/ help guide the selection, other criteria can be applied, such as requiring the pathways to also be highly expressed (compared to the average pathway, for example) or differentially expressed between groups.
In both of the real datasets, the partial correlation is chosen for the association measure since we are interested in identifying changes in the direct connections between genes. Based on the simulation study results, we set = p 2 for the DC scores. Monotonized p-values are used to identify DC pathways and DC genes.
Craniofacial data of E14.5 mice. Within each of the four palatal compartments (lateral, medial, nasal, and oral), we explore the differential connections in gene-gene associations between the anterior and posterior domains. The top DC pathways and DC genes for each compartment are shown in Supplementary Tables S1 and S2, respectively. For the permutation test, the smallest obtainable p-value with the given sample size ( = n 3 in each group) is 0.1, hence this is the threshold used for significance. These DC pathways and DC genes are also required to be highly expressed, i.e. among the top 50% in terms of expression level; this helps us identify the most active pathways. Furthermore, pathways or genes that are also significantly DE at the 0.1 significance level are moved to the top of the list. In the remainder of this section, results for the lateral compartment are discussed.
One of the top differentially connected Reactome pathways is Signaling by NOTCH1, which is a highly conserved pathway in developmental biology 66 . Within this pathway, the receptor gene Jag2 is differentially expressed -there is a 3.0 fold change in expression from the anterior domain to the posterior. Furthermore, we find supporting evidence that Jag2 may also be differentially connected; it appears to lose many of its connections to other genes in the posterior domain (see Fig. 3). On the other hand, the regulator gene Kat2b, though not differentially expressed, is differentially connected and is acting in consort with different sets of genes in the two domains. Jag2 is a gene that has been linked to the development of a cleft palate in mice 67 , and Jag2-Notch1 signaling has been shown to be a regulator in palate development 68 . Our findings suggest that Kat2b may also have an important, undiscovered role in the fusion of the palatal shelves. Indeed, in a very recent publication Kat2a and Kat2b are suggested to be epigenetic regulators required for craiofacial bone and cartilage growth and differentiation 69 .
Five of the top ten DC genes between the anterior and posterior domains of the lateral region have been shown to be involved in craniofacial development: Ep300 is linked to abnormal facial morphology 70 ; Ngfr is associated with abnormal molar crown morphology 71 ; Nras is an oncogene that is connected to abnormal cranium morphology 72 ; Hprt is linked to abnormal pharyngeal arch mesenchyme morphology 73 ; Sirt1 is associated with abnormal palatal rugae morphology 74 .
Neuroblastoma tumor samples. The gene expression profiles of clinically high-risk (HR) neuroblastoma patients are compared to non-HR patients. The permutation test is performed using = B 100 permutations, and the threshold for significance is set to 0.01. The top 10 results for DC pathways and DC genes are given in Supplementary Tables S3 and S4, respectively. Several of the top pathways are directly involved in cell proliferation: "Mitotic Telophase/Cytokinesis" is involved in the pinching of the cell into two daughter cells during the final phase of mitosis; "Chk1/Chk2(Cds1) mediated inactivation of Cyclin B:Cdk1 complex" is an event that can occur during the monitoring of the genome for damage to prevent the transition into the next cell cycle; and "ERKs are inactivated" is a MAPK pathway that has a role in several fundamental cellular functions, including proliferation, cell survival, and apoptosis. A rewiring in any of these pathways might explain how a tumor becomes malignant or resistant to treatment in the high-risk group.
The differential network for the pathway "InlA-mediated entry of Listeria monocytogenes into host cells" is shown in Fig. 4. This pathway contains the proto-oncogene SRC, which is also one of the top DC genes. The over-expression of this gene in colon cancer has been associated with accelerated metastatic growth and resistance to chemotherapeutic treatments 75 . One of the strongest differential connections in this pathway is between SRC and CTNND1 -the pair have a stronger connection in non-HR patients. It has been suggested that CTNND1 can modulate anchorage-independent growth induced by SRC 76 . This type of growth is a characteristic of metastatic www.nature.com/scientificreports www.nature.com/scientificreports/ potential 77 . However, it was suggested that the CTNND1 modulation may be reversed by the downstream ROCK cascade 76 . We checked to see if any pathways containing any ROCK genes were differentially connected and found one such pathway: "EPHB-mediated forward signaling. " The top differential connection in this pathway is between ROCK2 and EPHB6. Incidentally, a recent study suggests that treating certain breast cancer patients with SRC inhibitors may be more effective in the cases where EPHB6 is under-expressed 78 . We do not find differential expression in any of these genes in the neuroblastoma dataset, however the differential connectivity of EPHB6, ROCK2, and CTNND1 may be a sign of possible somatic mutations that are affecting their functionality and possibly impeding their ability to regulate SRC, resulting in metastatic growth.
Many of the top DC genes, shown in Supplementary Table S4, have a known relation to cancer. As for example, FADD is involved in the mediation of cell apoptotic signals, TMEM219 is a cell death receptor for IGFBP3, and TNFRSF10B is a receptor for apoptosis signaling. As demonstrated with SRC, by looking at the specific differential connections of these genes we may find a biological mechanism, novel to neuroblastoma, that explains the difference between the HR and non-HR patients.

Discussion
A differential network analysis of gene-gene co-expression networks is used to explore the differences between the underlying biochemical processes of two groups. In this study, we propose a framework for incorporating knowledge of gene regulatory pathways into this type of analysis. The practitioner is able to choose any association measure that is deemed appropriate to address the question at hand. Differential connectivity scores are computed to find DC pathways, DC genes, and DC edges, using the proposed measure, δ E , and a monotonized permutation test provides p-values for these differential connectivity scores.
An exploratory analysis is demonstrated on two datasets, and in both examples we find biologically meaningful genes and pathways that are differentially connected. The utility of analyzing the co-expression networks at different levels is emphasized: DC pathways provide the biological context of the results, DC genes are possible candidates for further investigation, and DC edges identify the specific interactions that may motivate hypotheses to test. Furthermore, the results are automatically partitioned by pathway, so domain knowledge can be used to help select and sort the results that make most biological sense.
Incidentally, the penultimate version of this manuscript found Kat2b as a potential novel finding in craniofacial development. An article that was very recently published supports the validity of this finding 69 , which is the first study in the literature to describe the role of Kat2b in craniofacial development. The results of our analysis suggest this gene may also have a more specific role in the fusion of palatal shelves.
Another benefit of using pathways is the savings in computation time. In fact, many modern methods for differential network analysis have computational restrictions that implicitly require genes to be subset on pathways, or for most genes in an expression profile to be filtered out in some other way prior to analysis. The proposed framework formalizes this act of using pathway knowledge.
A simulation study shows that including pathway information gives comparable performance to no pathway information, even if the pathway information is incomplete. This result is important since our knowledge of pathways is continuously growing, and any pathway may be missing genes or contain extraneous ones. The simulation results suggest that even if the pathway information is imperfect, the differential network analysis is still able to find DC pathways, DC genes, and DC edges without compromising specificity or true discovery rate. A second simulation study considers the performance of four modern approaches. However, this study underlines the fact that different methods will typically use different notions of association. When two approaches are estimating different things, their performance is not directly comparable.
There are a few avenues for future research. Currently, pathways are treated and analyzed independently, but pathways are dependent and often have overlapping genes. For example, the Reactome pathways have a hierarchical structure; pathways with a general function are broken down into specialized events. If a gene is differentially connected due to a mutation, then perhaps it should be differentially connected in all pathways it's involved in. Incorporating this kind of dependency may be a way to increase sensitivity. www.nature.com/scientificreports www.nature.com/scientificreports/ Another concern is that by sub-setting on pathways, important genes that are not included in any pathway could be missed; these could be genes with no known functionality that lead to a novel discovery. One possible solution is to add a preliminary step in which the pathways are inferred from the data; an unsupervised clustering algorithm that allows for overlapping clusters may be able to approximate the known pathways while incorporating all of the genes that are expressed. To make this approach feasible for smaller samples, a semi-supervised approach that also incorporates known pathways could be devised. These ideas are left for future investigation.