Network diffusion-based analysis of high-throughput data for the detection of differentially enriched modules

A relation exists between network proximity of molecular entities in interaction networks, functional similarity and association with diseases. The identification of network regions associated with biological functions and pathologies is a major goal in systems biology. We describe a network diffusion-based pipeline for the interpretation of different types of omics in the context of molecular interaction networks. We introduce the network smoothing index, a network-based quantity that allows to jointly quantify the amount of omics information in genes and in their network neighbourhood, using network diffusion to define network proximity. The approach is applicable to both descriptive and inferential statistics calculated on omics data. We also show that network resampling, applied to gene lists ranked by quantities derived from the network smoothing index, indicates the presence of significantly connected genes. As a proof of principle, we identified gene modules enriched in somatic mutations and transcriptional variations observed in samples of prostate adenocarcinoma (PRAD). In line with the local hypothesis, network smoothing index and network resampling underlined the existence of a connected component of genes harbouring molecular alterations in PRAD.


Supplementary Notes and Figures
Supplementary Note

Network diffusion
We are interested in studying the the stationary distributions X * dependence on the altered nodes that for each sample are summarized by the vector X0 where the i-th component is 1 if the i-th molecular entity is altered and 0 otherwise. We consider the network propagation equation In equation (1) during each iteration each node receives the information from its neighbors, and also retains its initial information and self-reinforcement is avoided. Moreover the information is spread symmetrically since W is a symmetric matrix. The algorithm convergence is demonstrated using the power extension method; we set X0 = X(0) and we apply some iterations: Iterating this procedure at step t we get: Since 0 < α < 1 and the eigenvalues of W are in [−1; 1], when we take the limit for t → ∞ we get: This method is successfully exploited by Hofree et al. [1] in their network based approach applied to somatic mutation profiles. We now try to give a physical interpretation of such a diffusive algorithm in order to be able to inherit some helpful concepts from an associated physical model.

Physical model
The network propagation algorithm can be interpreted as the discrete implementation of a continuous linear dynamical system; we define Lα = I − αW the symmetrically normalized Laplacian matrix with the perturbation parameter α; and consider the following system of ODEs written in vector form: We can rewrite equation (3) by defining a new sink parameter γ > 0 so that γ = 1−α α so that, after rescaling time by the parameter α: where L = I −W is the symmetrically normalized Laplacian matrix. Equation (4) is essentially the physical model cited by Vandin [3] for defining a gene prioritization algorithm on the basis of the work of Qi et al. [2]; Qi defines a continuous-time model for the distribution of a hypothetical fluid in the network. All nodes contain no fluid initially. Query nodes are then selected to serve as sources, where the fluid is pumped in at a constant rate. Fluid diffuses from node to node through the network according to the edge connections. The source imput is balanced by fluid loss out of each node at the constant first order rate γ. The stationary solution of the dynamical system is of course equation (2) and, expressed in terms of the constant sink parameter γ: The choice of α and γ The choice of parameter α (or equivalently γ) influences the behavior of the diffusive algorithm since α controls how much information is retained in the nodes versus how much tends to be spread in the network. From a physical point of view it is reasonable to assume that α > 0.5, in order to make network topology relevant. Throughout the analysis of several real or toy datasets we find that the importance of the choice of a specific value of 0.5 ≤ α < 1 is somehow negligible, in the sense that the results are very similar for different choices of α. However qualitatively it is a good trade off between diffusion rate and computational cost (which increases as α → 1) to take the parameter α = 0.7.

Discussion
The connection of the discrete model described by equation (1) to the continuous model allows us to point out that we are dealing with an open system in which the amount of information in each node depends on the constant rate of the sinks γ and on the query nodes distribution.
The system is open in the sense that the conservation of fluid amount from a macroscopic point of view and the probability conservation from a microscopic point of view is not guaranteed. The non conservative system shows that X * represents the quantity of fluid remaining after the flow stabilizes; in each node the fluid pumped by the altered nodes is balanced by the constant sinking of the fluid at the constant rate γ. In principle also the amount of overall fluid on the network could differentiate from cases to controls but we observe that on large networks differences in this sense are not sensible; on the other hand the fact that we are dealing with an open system implies that the final overall amount of fluid depends on both the distribution and the number of initially altered species. Hofree et al [1] normalize each diffused profile so that the sum of the fluid in each sample is constant therefore actually interpreting equation (1) as a random walk with restart. In our work we chose not to normalize each profile in this fashon, but we use the non-normalized profiles to define the S scores and the ∆S scores where the actual amount of fluid on each node is taken into account. This allows to better highlight the nodes that gain more fluid in one class versus the other even in patients in the same class that have a different number of altered entities.

Network resampling procedure
The definition of network resampling p values (pnr) is performed by comparing the objective function Ω with a sample of its local perturbations Ωσ. We first define, as in the main text, the objective function Ω: Ω(n) = ∆S t (n) · An · ∆S(n) where An is the adjacency matrix between the first n top scoring entities and ∆S(n) are the first n scores decreasingly ordered. Equation (6) can be seen as a scalar product between the top highest ∆S , where the matrix An makes sure that only the scores associated with entities that have at least a connection to the remaining n − 1 ones positively contribute to the calculation. The Ω function is non-decreasing since only positive scores are considered. A single perturbation of function (6) is defined as follows: where σ(n) is a random permutation between the first n molecular entities labels, so that A σ(n) is simply a random resampling of the existing connections between the top scoring genes. We specify that the permutations are constructed as follows: we first randomly permute the indexes of all the genes and then we use that random reassignment to define a single Ω σ(n) perturbation; in this fashion we construct perturbations that behave similarly to equation (6), with the advantage of a graphical feedback (for example, see Supplementary Fig. S3).
In order to define network resampling p-values we produce a set of k perturbations and for each value n we compute the fraction of times in which the perturbations are above or equal to the objective function Ω. As log as for some of the k permutations it holds that Ωσ(n) ≥ Ω(n) it means that the first n top scoring genes are connected enough that a resampling of the connections among them do not alter too much the strength of the subnetwork, while it's reasonable to expect a sensible deviation of the permutations from the objective function when top-scoring genes that are not connected to the previous n − 1 ones enter the top of the list. Therefore for each value n we take k different permutations of the indexes {σ1, σ2, · · · , σ k } compute: where we add 1 both at the numerator and at the denominator so that the smallest p value is never null. At this point, according to the model's assumptions, in principle any local minimum of equation (8) could be an interesting choice for cutting the top of the ∆S list; however in our applications we find reasonable to cut at the first local minimumpnr since it often represents the value corresponding to the valuen where the perturbations (equation (7)) start to sensibly deviate from the objective function (equation (6)