Cellular network entropy as the energy potential in Waddington's differentiation landscape

Differentiation is a key cellular process in normal tissue development that is significantly altered in cancer. Although molecular signatures characterising pluripotency and multipotency exist, there is, as yet, no single quantitative mark of a cellular sample's position in the global differentiation hierarchy. Here we adopt a systems view and consider the sample's network entropy, a measure of signaling pathway promiscuity, computable from a sample's genome-wide expression profile. We demonstrate that network entropy provides a quantitative, in-silico, readout of the average undifferentiated state of the profiled cells, recapitulating the known hierarchy of pluripotent, multipotent and differentiated cell types. Network entropy further exhibits dynamic changes in time course differentiation data, and in line with a sample's differentiation stage. In disease, network entropy predicts a higher level of cellular plasticity in cancer stem cell populations compared to ordinary cancer cells. Importantly, network entropy also allows identification of key differentiation pathways. Our results are consistent with the view that pluripotency is a statistical property defined at the cellular population level, correlating with intra-sample heterogeneity, and driven by the degree of signaling promiscuity in cells. In summary, network entropy provides a quantitative measure of a cell's undifferentiated state, defining its elevation in Waddington's landscape.

In addition, we also obtained the normalised gene expression data (Illumina HT12v3 expression arrays, GSE30652) of a subset of the samples used in the SCM2 compendium, consisting of 107 hESC lines, 52 iPSC samples and 32 samples from differentiated tissue (2).

hESC, MSC and iPSC data sets
We obtained the normalised data from GEO (3) of two studies profiling human embryonic stem cell (hESC) lines and derived mesenchymal stem cells (MSC). One study (4) profiled a hESC stem cell line (H1) at 3 different passages, giving 3 replicates, as well as two MSC precursors and a bone marrow derived MSC population. These 6 samples were measured with Affymetrix HG-U133A arrays. The other study (5) profiled two hESC cell lines and MSCs derived from these two, all 4 samples done in triplicate leading to a total of 12 samples. These were generated using the Affymetrix HG-U133 Plus 2 platform.
In addition, we downloaded the normalised data for three studies profiling hESC samples (9)(10)(11), again all done on the same Affymetrix HG-U133 Plus 2 arrays to allow comparisons between them and to the MSC-BM samples from above. The number of hESC samples per set were 5, 5 and 10 for Set1 (GSE7896) (9), Set2 (GSE13828) (10) and Set3 (GSE15148) (11), respectively. Set2 (GSE13828) also profiled induced pluripotent stem (iPS) cells from skin fibroblast samples taken from a child with spinal muscular atrophy (10). Specifically, in addition to the 5 hESC samples, there were 3 iPSC samples and 2 skin fibroblast samples. Set3 (GSE15148) contained 16 induced pluripotent stem cell lines (iPSC) derived using episomal vectors from 2 foreskin samples (11). Another data set comparing iPSCs (n=12) to parental fibroblasts (n=6), including hESCs (n=20) was obtained from (12). The raw data, generated with Affy HG HT U133A arrays, was quality checked and RMA normalised. Thus, all these data sets allowed a three-way comparison between adult diffentiated cells, the iPSCs derived from them, and hESCs. (18). The array for this study was the Illumina Human WG-6 v2 Expression beadchip. However, both arrays led to integrated networks of similar size encompassing effectively the same genes, allowing direct comparison.

Time course de-differentiation and re-differentiation experiment of retinal pigment epithelial (RPE) cells
The RPE dedifferentiation and redifferentiation normalised data set was obtained from ArrayExpress (E-MTAB-854), where a detailed experimental protocol can be found. Briefly, a single cell suspension of RPE was derived from hESCs and plated in 96 well plates at two densities in triplicate, a high density (100,000 cells/cm 2 ) and a low density (8000 cells/cm 2 ), and cultured for 5 weeks. RNA was extracted from the starting RPE cell suspension and from the plated cells at 8 subsequent time points (1,2,3,7,15,19,29 and 35 days after plating). In the high density plates cells proliferate and dedifferentiate (acquiring a mesenchymal morphology) during the first 5 days before re-differentiating to RPE by the end of the 5 weeks. In the low density plates cells proliferate and de-differentiate for the first 7 days maintaining their mesenchymal morphology by the end of the 5 weeks. RNA was profiled on Illumina HumanHT12v4 arrays.

Time course HL60 neutrophil expression data
We obtained from GEO, the raw CEL files for the microarray (Affymetrix Human Genome U95 Version 2 Array) dataset collected by Huang et al (19) describing HL60 progenitor cells differentiating into neutrophils. The dataset consists of 25 samples; the first sample is of HL60 progenitor cells during proliferation and the remaining 24 samples correspond to two time courses, each of 12 time points (2 hrs, 4 hrs, 8 hrs, 12 hrs, 18 hrs, 1 day, 2 days, 3 days, 4 days, 5 days, 6 days, 7 days), describing differentiation of HL60 progenitors into neutrophils. Each time course corresponds to differentiation initiated via stimulation with a given medium; either DMSO or ATRA. The dataset underwent quality control assessment via the arrayQualityMetrics package in R, and was normalised using RMA.

Normal and cancer tissue, and cancer cell line data sets
We obtained from GEO the normalised data of four gene expression data sets, all using Affymetrix U133 Plus 2 arrays, profiling sufficient numbers of both normal and cancer tissue specimens. Tissues included liver (20), pancreas (21), colon (22) and stomach (23). The colon set also included colon cancer cell lines. For the other tissue types, we obtained corresponding cancer cell lines, profiled on the same Affymetrix arrays, from the Cancer Cell Line Encyclopedia (24).

Normal and cancer stem cell data sets
We collected the normalised data from GEO of two neural and gioma stem cell gene expression data sets, both using Affymetrix U133 Plus 2 arrays. One data sets profiled a neural stem cell line (in duplicate) plus replicates from 4 different glioma stem cell lines (Materials and Methods, (25)). The second data set profiled normal human fetal (hf) neural stem cells and tumorigenic glioma neural stem cell lines, which had been expanded using growth factors EGF and FGF in adherent culture conditions (26). Under these conditions apoptosis and differentiation are suppressed resulting in more homogeneous populations of stem cells (26).
We collected the normalised data from GEO of 12 hematopoietic and 12 leukemic stem cell (LSC) samples (27). The LSCs were from CD34+ chronic myelogenous leukemia patients. All samples were profiled on Affymetrix U133 Plus 2 arrays.

Cancer stem cell and parental cancer cell data set
In order to compare putative cancer stem cells (CSC) to their parental tumour cell (PTC) lines we used the normalised data (GEO) from (28). Specifically, we focused on five tissues: breast with 2 CSCs and 3 PTCs, brain with 5 CSCs and 4 PTCs, lung with 5 CSCs and 3 PTCs, oral cavity with 8 CSCs and 5 PTCs and colon with 7 CSCs and 4 PTCs. In the case of colon we used positivity of CD133, a colon stem cell marker to assign CSC/PTC status (samples on GEO are likely to have been mislabeled). Reported results across all tissue types is however independent of inclusion or exclusion of the colon subset. All samples were profiled on Affymetrix HG-U133 Plus2 arrays.

Pluripotency signature(s) and the TPSC pluripotency score
We considered two pluripotency gene expression signatures: (1) A 19-gene pluripotency signature, consisting of 11 up-and 8 down regulated genes in pluripotent cells (29). For this signature, the pluripotency score of a given sample was derived as the t-statistic comparing the expression levels of the upregulated genes to the downregulated ones. We call this score construction method, the tstatistic based pluripotency score, abbreviated as TPSC. (2) A 189-gene pluripotency expression signature derived in Palmer et al (30). This signature only consists of a list of genes and is not a single sample predictor. To construct one, we followed the principal component analysis (PCA) procedure outlined in (30). Thus, we performed a focused PCA on the stem cell matrix (SCM) compendium on the genes from the signature present in that array (a total of 157 genes). The top PC from the PCA correlated with pluripotency status (Wilcoxon rank sum, P<10 -10 ), thus validating the signature in the SCM data. Thus, a pluripotency score of an independent sample can be obtained by correlation of the loadings of the top principal component with the sample's gene expression profile. Because this method is sensitive to the normalisation strategy, we also constructed an alternative score based on the t-test procedure. Specifically, we used the sign of the loadings in the top principal components to assign the 157 genes into up and downregulated categories, having fixed the overall sign of the principal component also from the SCM data. By comparing, in a given sample, the expression levels of the up and down regulated genes using a t-test we obtain the sample-specific TPSC score.

Protein Interaction Network
We downloaded the full Protein Interaction Network (PIN) from Pathway Commons (www.pathwaycommons.org) (31) (date stamp 13th June 2012). Using this comprehensive resource, we built an integrated network including the Human Protein Reference Database (32), the National Cancer Institute Nature Pathway Interaction Database (NCI-PID) (pid.nci.nih.gov), the Interactome (Intact) http://www.ebi.ac.uk/intact/ and the Molecular Interaction Database (MINT) http://mint.bio.uniroma2.it/mint/. Protein interactions in this network include physical stable interactions such as those defining protein complexes, as well as transient interactions such as posttranslational modifications and enzymatic reactions found in signal transduction pathways, including 20 highly curated immune and cancer signaling pathways from NetPath (www.netpath.org) (33). Redundant interactions were removed and only genes with an Entrez gene ID annotation were retained.
To remove network edges which may likely represent false positives, we adopted a sparsification procedure based on imposing a signaling hierarchy (albeit bi-directional) on the network. This procedure removes edges between proteins whose main cellular localisations are not adjacent in the context of the layers defining the signaling hierarchy. Imposing a signaling hierarchy on a PIN is a procedure which has already been successfully used in other contexts (34). Briefly, each node (protein) in the PIN is assigned a main cellular localisation ("a signaling layer") by using the Gene Ontology annotation (34). Specifically, we first selected all Entrez ID genes annotated to the extracellular space/region using the following set of GO-terms: GO:0005102, GO:0008083, GO:0005125, GO:0005615, GO:0005576, GO:0044421. Next, we identified all Entrez ID genes annotated to transmembrane receptor activity (GO:0004888, GO:0004872, GO:0005886), and finally all genes annotated to the intracellular domain or to biological functions associated with the intracellular region (GO:0005622, GO:0044424, GO:0005634, GO:0005737, GO:0005829, GO:0000139, GO:0035556, GO:0007243, GO:0006468). This included GO terms such as cell-nucleus, cytoplasm, cytosol, golgi membrane, intracellular signal transduction, kinase cascade, protein phosphorylation. Because genes may be annotated to multiple domains, we constructed mutually exclusive sets by favouring the more "external" domain, so that a gene annotated to both extracellular and transmembrane domains was allocated to the extracellular domain only, and a gene annotated to both transmembrane and intracellular domains was assigned to the transmembrane domain only. Thus, all Entrez ID genes were annotated to one of the extracellular ("EC"), transmembrane receptor ("MR") or intracellular ("IC") domains. Next, we selected those nodes in the PIN which could be assigned to one of these three domains. The resulting PIN was then pruned by removing edges (interactions) which were not consistent with the signaling hierarchy structure. Thus, only edges with corresponding end nodes in the following combinations were allowed: EC-EC, EC-MR (or MR-EC), MR-IC (or IC-MR) and IC-IC. This resulted in a maximally connected PIN of 8,434 nodes and over 300,000 interactions.

Integration of PIN with gene expression data
To integrate the gene expression data with the PIN, expression profiles of probesets mapping to the same Entrez gene identifier in the PIN were averaged. Proteins in the PIN without probesets representing their coding genes on the array, were removed from the PIN. The resulting maximally connected component of the PIN defined a sparse network with the number of nodes (genes) depending on the arrays used. In the case of Affymetrix U133 Plus 2 arrays (the most widely used array in this work), the resulting maximally connected integrated PIN consisted of 8290 nodes and 299459 edges. For the Affymetrix Human Genome U95 v2 arrays, the integrated PIN consisted of 5555 nodes and 175640 interactions. For the Affymetric U133A arrays, the PIN was of size 7027 nodes and 246005 edges. Finally, for the Illumina Human WG-6 v2 expression beadchip, the integrated maximally connected PIN consisted of 8135 nodes and 290360 edges, i.e very similar in size to the Affymetrix U133 Plus 2 arrays.

Gene Set Enrichment Analysis (GSEA)
Functional annotation and GSEA was performed using the DAVID Bioinformatics Resources 6.7 (35). The proteins in the PIN were used as a background gene set and P-value estimated using Fisher's exact test. Adjusted P-values used the Benjamini-Hochberg correction as implemented in DAVID.

Construction of the sample specific stochastic matrix and network entropy rate
In this work we invoke the mass action principle in order to be able to define a stochastic matrix for each individual sample. In detail, let E i denote the normalised expression level of gene i in a given sample. For a given neighbour jN(i) (where N(i) labels the neighbours of i in the PIN), the mass-action principle states that the probability of interaction with i is approximated by the product E i E j . Normalising this to ensure that  j p ij =1, we get Clearly, if j∉N(i), then p ij =0. This then defines a sample-specific stochastic matrix. From this stochastic matrix one can then construct a local network entropy for each gene i in the PIN, as which reflects the level of uncertainty or redundancy in the local interaction probabilities around gene i. We note that the above expression for the local entropy is not normalised so that the maximum possible entropy depends on the degree (k i ) of the node i. In fact, Thus, it is convenient to also define a normalised local entropy as (see (36)), We stress again that this local network entropy can be computed for each gene i in each given sample.
When defining a global network entropy (i.e. for the whole network) one can, in principle, consider the average of these normalised local entropies. This average however is a nonequilibrium entropy, in contrast to the global entropy rate, S R , which is defined in terms of the stationary distribution, , of the stochastic matrix p, i.e. through p=. Specifically, the global entropy rate, S R , is defined by (37) where S i are the unnormalised local entropies.
We note that the network entropy rate is bounded between 0 and a positive maximum value that depends only on the adjacency matrix of the network (38). Indeed, it can be shown that the maximum possible entropy rate is attained by a stochastic matrix, p ij , defined by where A ij is the adjacency matrix (i.e. unweighted) of the PIN, and v and  are the dominant eigenvector and eigenvalue of this adjacency matrix, respectively. The maximum attainable entropy rate, M R , will thus depend on the specifics of the network, including total number of nodes and edges. Thus, if desired, the network entropy rate, S R , can be scaled relative to the maximum attainable value, S  R S R /M R , so that S  R is bounded between 0 and 1.
We also note that there is an alternative construction of the stochastic matrix. In fact, the construction of the local stochastic matrix around a given gene i given above does not depend on the expression level of gene i. This means that a gene can have a low network entropy despite the fact that it is not expressed. Since this may not be desirable we also consider an alternative construction of the network entropy which "enforces" a high local network entropy for genes that are not expressed. This is achieved by replacing p ij in the equation above with where  is a function of the gene i's expression value, E is , taking values on the compact support [0,1]. We define the function (E is ) as follows: We choose  l and  h as the 10th and 90th quantile of the sample's genomewide expression profile. In this work we report the S R values obtained from the construction in equation 1 above. However, using the alternative construction (equation 2) leads to similar results.

Simulation and entropy rate perturbation analysis
As a proof of concept that network entropy provides a measure of the degree of signalling promiscuity in a network, we performed a simulation study. Without loss of generality, we randomly sampled 2000 genes from the complete integrated PIN (~7800 nodes) obtained by integrating the gene expression data from the SCM with our constructed PIN. This resulted in a maximally connected component consisting of ~1500 genes. This reduced network size allowed faster computation of the network entropy as required for our simulation study. To simulate a ground state, representing a pluripotent poised state, the corresponding stochastic matrix was defined as one where the associated random walk is unbiased, i.e.
i ij otherwise. Next, we considered two types of perturbation. In one case, we picked all nodes of degree>=2 in the network, and for each such node, a randomly picked edge was then assigned a large weight (in the range 0.8-0.95), with the rest of edges assigned a low weight (~0.1), ensuring that the sum of weights equals 1, as required for a stochastic matrix. This was done for each node in the network separately, resulting in >1000 perturbations and associated entropy rates. In the second perturbation analysis, we generated whole signal transduction pathways, in which a specific path was generated at random, which the weights of the visited nodes in the path modified as described earlier. Path lengths were chosen to be of maximum length 9, corresponding to the diameter of the network, although generated paths often led to shorter paths (since a requirement was not to visit the same node twice). In this perturbation analysis, we generated 100 distinct activation pathways and computed the resulting entropy rates. Thus, by using these two types of perturbation and comparing the entropy rates before and after the perturbation, we can assess the impact of single gene perturbations (activation of single genes) and activation of complete pathways.

Supplementary Figures:
Supplementary Figure S1: Scatterplots of the maximum possible entropy rate (maxSR) against the number of nodes (left panel) and edges (right panel) in the respective integrated PINs for six different arrays considered in this work. Observe how the maxSR scales with the size of the resulting network. We note that the integrated networks from the different arrays may differ in the precise genes represented in them, hence why the network topology might be slightly different, and hence why deviations from a linear relation may be expected.
Supplementary Figure S3: Comparison of the normalised entropy rates in the SCM compendium comparing pluripotent to non-pluripotent cell types, where the entropy rate (SR) was computed after removing cellproliferation and cell-cycling genes defined in Ben-Porath et al (39). Wilcoxon rank sum test P-value is given.
Supplementary Figure S4: Comparison of the normalised entropy rates in the SCM2 compendium comparing pluripotent to differentiated cell types, where the entropy rate (SR) was computed after removing cellproliferation and cell-cycling genes defined in Ben-Porath et al (39). Wilcoxon rank sum test P-value is given. Right panel shows the corresponding ROC curve and AUC.  (12). Right panel: comparison of normalised network entropy values of two adult differentiated foreskin samples (green), their induced pluripotent cells (iPSCs, orange), and hESC controls (magenta). All samples done on same arrays as part of the same study (11). Wilcoxon rank sum test Pvalues between the respective groups are given, as indicated, except for the comparison between iPSCs and the two skin samples in the right panel for which a t-test was used.

Supplementary
Supplementary Figure S7: Comparison of normalised network entropy values of two differentiated spinal muscular atrophy samples (green), induced pluripotent cells (iPSCs, orange) derived from these, and hESC controls (magenta). Expression data is from study Ebert et al (10) , which used Affymetrix HG-U133 Plus2 arrays for all samples. values and associated P-values from a linear regression. In both A) and B), the network entropy was computed after removing the cell-proliferation and cell-cycling genes of Ben-Porath et al (39).
Supplementary Figure S13: Normalised network entropy rates of human embryonic stem cells (hESCs,n=48) and combined teratocarcinoma (n=5) / germ tumour (n=3) cell lines from the stem cell matrix compendium (all deemed pluripotent), as well as of 12 hematopoietic stem cells (HSC) and 12 CD34+ chronic myeloid leukemic stem cells (LSC) (Supplementary Materials). Wilcoxon rank sum test P-value between the pluripotent normal (hESC) and cancer (GTC/TCC) cells is given, as well as between the HSCs and LSCs. Finally, also included are the normalised network entropy rates (SR/maxSR) of a neural stem cell line (NSC) (in duplicate, blue) compared to that of four different MGG glioma stem cells (red, with replicates) from (25), as indicated.
Supplementary Figure S14: Normalised network entropy rates of a set of glioma (red) and neural stem cell (blue, NSC) lines (with replicates), plus samples from normal differentiated brain cortex (green). Expression data is from Pollard et al (26) and was generated using Affymetrix HG-U133 Plus 2 arrays. P-value is from a Wilcoxon rank sum test comparing the 3 different NSCs to the 6 independent normal brain samples.