Identifying intracellular signaling modules and exploring pathways associated with breast cancer recurrence

Exploring complex modularization of intracellular signal transduction pathways is critical to understanding aberrant cellular responses during disease development and drug treatment. IMPALA (Inferred Modularization of PAthway LAndscapes) integrates information from high throughput gene expression experiments and genome-scale knowledge databases to identify aberrant pathway modules, thereby providing a powerful sampling strategy to reconstruct and explore pathway landscapes. Here IMPALA identifies pathway modules associated with breast cancer recurrence and Tamoxifen resistance. Focusing on estrogen-receptor (ER) signaling, IMPALA identifies alternative pathways from gene expression data of Tamoxifen treated ER positive breast cancer patient samples. These pathways were often interconnected through cytoplasmic genes such as IRS1/2, JAK1, YWHAZ, CSNK2A1, MAPK1 and HSP90AA1 and significantly enriched with ErbB, MAPK, and JAK-STAT signaling components. Characterization of the pathway landscape revealed key modules associated with ER signaling and with cell cycle and apoptosis signaling. We validated IMPALA-identified pathway modules using data from four different breast cancer cell lines including sensitive and resistant models to Tamoxifen. Results showed that a majority of genes in cell cycle/apoptosis modules that were up-regulated in breast cancer patients with short survivals (< 5 years) were also over-expressed in drug resistant cell lines, whereas the transcription factors JUN, FOS, and STAT3 were down-regulated in both patient and drug resistant cell lines. Hence, IMPALA identified pathways were associated with Tamoxifen resistance and an increased risk of breast cancer recurrence. The IMPALA package is available at https://dlrl.ece.vt.edu/software/.

. GibbsOS identified transcription factors that are up-regulated (+) or downregulated (-) in the early-recurrence group of Loi data.

Building flow network
GIST applies Gibbs sampling to a regularized structure which we refer to as "flow network". Given the source and target gene(s), we build a directed pathway flow network of L layers from the original PPI as shown in Fig. S6. First, we start from the source genes (genes in the first layer) and search their neighbors in the PPI network.
The direct neighbors of the source genes are included into the second layer, based on which we successively define the third layer, forth layer, etc. This is called the "forward search" of the PPI network, and the target gene will present at the L th layer (the target gene can also show up in the upper layers if there is a path between source and target that has length smaller than L). In the meanwhile, we also perform a "backward search" starting from the target gene and rebuild L layers in the reverse direction. For each layer we only keep the genes that present at both "forward" and "backward" networks and obtain the final flow network.

Sampling flow network with a modified Gibbs sampler and Markov property
We illustrate the sampling procedure as shown in Fig. S7. The current pathway is highlighted in the shaded area. A Gibbs sampler updates one gene i  at a time, and iteratively updates the other module members. Suppose we want to update the third gene 3  in the pathway. Based on the flow network, three genes in the third layer (marked 1, 2, and 3) are potential candidates that connect the existing genes to the second and fourth layer. We calculate the pathway probabilities for all three corresponding pathways and then probabilistically accept one of the genes to update the previous gene (gene 2 is selected to update 3  ). We also correspondingly update the edges of the new gene. This procedure will be sequentially applied to the fourth, fifth until the L th layer, and will be repeated for many iterations until convergence.
In order that through enough sampling iterations, the estimated distribution will be a stationary distribution that is irrelevant to its initial states, the proposed Gibbs sampler should have some basic properties such as irreducibility and ergodicity. Unfortunately, this property cannot always be satisfied when we sample pathways from the PPI network and only allows changing one layer in the current path at a time. Here we propose a simple modification of the previous Gibbs sampler by introducing a small baseline sampling frequency δ to states (pathway configurations) that are not allowed by the initial flow network, so that any two states in Θ communicate with each other.
Edges connecting genes in two adjacent layers that are not connected in the original PPI are referred to as "pseudo-edges" (non -directed lines in Fig. 8). Then, genes in any two adjacent layers are mutually connected regardless of whether they are truly connected in the original PPI network or not. The Gibbs sampling process on the modified flow network has defined an irreducible Markov chain that draw samples (states) from an enlarged state space  Θ (Fig. S8(b)) and   Θ Θ , where Θ is the state space of all "valid pathways" defined by the original flow network (Fig. S8(a)).

Truncation of pathways samples to estimate edge probability
When the number of genes in  is large, the sample space Θ becomes huge. In this case, it is not computationally feasible to sample all possible pathways to calculate the posterior probability of every directed edges. Practically, we are interested in the top ranked pathway samples with the highest inter-correlation. Therefore, we offer a nonnormalized edge probability based on truncated samples as follows: where K Θ denotes top K truncated pathway samples.
, K ij p is not a probability but a score function for ranking valid pathway samples. In Eq. (S1) we only used gene and edge potentials because samples from K Θ have already been well constrained by prior knowledge (e.g., cellular locations). The confidence of edge direction was defined as follows: We summarize these into two major types of alternative pathway structures: type I, alternative pathways between single source and single target; type II, alternative pathways among multiple sources and multiple targets.
A schematic diagram of the two pathway structures is shown in Fig. S9. It can be found that type I structure is a special case of type N (nested) pathway between a single source gene and a single target gene, which embraces type O (single/multiple) pathways as sub-components. Type II structure is actually a more general case of the type N (nested) structure among multiple source genes and multiple target genes. It has a mixture structure of type A (divergent), type V (convergent) and type M (multiple/multiple) pathways.
Both type I and type II structures are designed to study alternative signal transduction, while the latter is also used to model crosstalk among multiple pathways. To generate simulation pathways for each structure, a subnetwork centered at some putative hub genes within the human PPI network was selected as the base topology. Genes that are involved in canonical pathways (MAPK, ERBB, JAK/STAT, et al.) were also extracted from the knowledge database 2,3 as the candidate pool for building ground truth pathways. We also collected subcellular location information for the human proteome.
To keep the models simple and logical for this study, we assumed that a valid path 18 Gene expression noise level = 0.2

Pathway gene identification Pathway interaction identification
Gene expression noise level = 0.5

Pathway gene identification Pathway interaction identification
Gene expression noise level = 0.8