Over the last decade, the genome-wide study of both heritable and somatic human variability has gone from a theoretical concept to a broadly implemented, practical reality, covering the entire spectrum of human disease. Although several findings have emerged from these studies1, the results of genome-wide association studies (GWAS) have been mostly sobering. For instance, although several genes showing medium-to-high penetrance within heritable traits were identified by these approaches, the majority of heritable genetic risk factors for most common diseases remain elusive2,7. Additionally, due to impractical requirements for cohort size8 and lack of methodologies to maximize power for such detections, few epistatic interactions and low-penetrance variants have been identified9. At the opposite end of the germline versus somatic event spectrum, considering that tumor cells abide by the same evolutionary fitness principles but on accelerated timescales due to mutator phenotypes, extensive somatic genomic rearrangements in solid tumors10 yield so many alterations that distinguishing 'drivers' from 'passengers' has been challenging.

This raises the question of whether GWAS data sets could yield additional insight when combined with other data modalities. Indeed, a number of previous studies have integrated significant genotype-phenotype associations with databases of gene annotations, such as the Gene Ontology (GO)11, MSigDB12 or the Kyoto Encyclopedia of Genes and Genomes (KEGG)13. The goal of these studies is to recognize higher-order structure within the data through the aggregation of loci in genes with similar functions or that are in the same pathway.

The context-specific networks of molecular interactions that determine cell behavior provide a particularly relevant framework for the integration of data from multiple 'omics'. The rationale is straightforward: within the space of all possible genetic and epigenetic variants, those contributing to a specific trait or disease likely have some coalescent properties, allowing their effect to be functionally canalized via the cell communication and cell regulatory machinery that allows distinct cells to interact and regulates their behavior. Notably, contrary to random networks, whose output is essentially unconstrained, regulatory networks produced by adaptation to specific fitness landscapes are optimized to produce only a finite number of well-defined outcomes as a function of a very large number of exogenous and endogenous signals. Thus, if a comprehensive and accurate map of all intra- and intercellular molecular interactions were available, then genetic and epigenetic events implicated in a specific trait or disease should cluster in subnetworks of closely interacting genes.

Thus, if regulatory networks controlling cell pathophysiology were known a priori, one could systematically reduce the number of statistical association tests between genomic variants and the trait or disease of interest by considering only events that cluster within regulatory networks, as topologically related events would be more likely to produce related phenotypic effects. Such a pathway-wide association study (PWAS) strategy14 may improve our ability to distinguish signals from background noise by mitigating the need to account for a large number of multiple-hypothesis testing. In general, however, the molecular pathways governing physiological and disease-related traits are poorly characterized. Indeed, the classical notion of a relatively linear and interpretable set of regulatory pathways should be revisited in light of the dynamic, multiscale, context-specific nature of gene regulatory networks. We thus favor an alternative approach requiring the simultaneous reconstruction of context-specific gene regulatory networks15 as well as of the genetic and epigenetic variability they harbor. We call this second strategy integrative network-based association studies (INAS) and suggest that INAS will become increasingly valuable as the context-specific logic of gene regulatory networks is further elucidated.

In this Perspective, we explore current advances in PWAS and INAS research, inspired by a regulatory network–oriented view of traits and disease, and examine future directions that are being pursued within the emerging community of systems geneticists. We explore how networks (and pathway motifs within them) can be reconstructed and validated and how they may provide a valuable integrative framework within which to interpret GWAS results as well as other data on genetic and epigenetic variation.

This is not my beautiful pathway

An increasing body of evidence suggests that canonical pathways are incomplete and largely inaccurate models for studying the complex interplay of signal transduction, transcriptional, post-transcriptional, metabolic and other regulatory events that determine cell behavior. Even today, entirely new classes of molecular entities (for example, long intergenic non-coding RNAs (lincRNAs))16 and interactions (for example, microRNA-mediated interactions)17 are being discovered and shown to have critical impact on cell regulation. Pathway models represented as linear chains of events provide ready visualization and the opportunity for intuitive predictions that can be experimentally tested in a manageable number of experiments. Unfortunately, cell regulation is anything but linear and is instead determined by complex, multivariate interactions that are not amenable to visual intepretation. For instance, individual transcription factors may regulate hundreds to thousands of cell context–dependent targets18,19, with functional specificity achieved by combinatorial transcription factor interactions20,21. For instance, FOXM1 and MYB individually regulate the transcription of more than 1,000 distinct genes in human B cells. Yet, the 100 targets they co-regulate are exquisitely specific to germinal center formation20 (Fig. 1a), in contrast to those uniquely regulated by each transcription factor. Similarly, transcription factor activity is modulated by hundreds of signal transduction proteins22, whose availability is again context specific. A map of expressed transcription factors in human B cells and of their computationally inferred modulators is shown in Figure 1b. Many of these interactions were experimentally validated, indicating that such a level of complexity is realistic. Additionally, recent large-scale screens for protein-protein interactions in human cells23 suggest that the number of such interactions is orders of magnitude larger than the few thousand captured in canonical pathways. Finally, adding yet another level of complexity, causal dependencies between the genetic, regulatory and functional layers provide insight into the mechanisms by which genetic variation may affect the activity of entire constellations of transcription factors, which in turn regulate thousands of genes11,24,29 (Fig. 2).

Figure 1: Examples of transcriptional and post-translational regulatory networks in human B cells.
figure 1

(a) FOXM1 and MYB co-regulation network from the Human B Cell Interactome. Red and blue represent over- and underexpression of genes, respectively, in centroblast versus naïve germinal centers (t test, false discovery rate < 0.05). Blue arcs represent protein-protein interactions. Adapted from reference 20. (b) Visualization of the molecular interaction network of signaling molecules and transcription factors in mature human B cells. Adapted from reference 22.

Figure 2: Genetic subnetwork controlled by Zfp90 (black node) as a central node in the liver transcriptional network.
figure 2

This subnetwork was obtained from a full liver expression network by identifying all nodes that were downstream of the Zpf90 node, within a path length of 3. Nodes highlighted in green represent genes that were validated as causal for fat mass. Adapted from ref. 26.

As discussed, such intrinsic complexity is made even more daunting by the context-specific nature of cell regulation. For instance, the oncogenic effect of genetic lesions depends both on cell type and microenvironment30. Finally, the paracrine and endocrine molecular interactions that allow distinct cell types and even whole organs to communicate form the highest-order networks in living organisms, directly affecting their physiological and pathological states and forcing the study of some diseases in their non–cell autonomous context. For instance, obesity and type 2 diabetes may result from failures in distinct organ systems. Similarly, insulin signaling in osteoblasts has been shown to be necessary for whole-body glucose homeostasis31. Thus, examination of networks spanning multiple tissues becomes critical to highlight interactions that would be otherwise invisible within individual tissue networks15. These examples suggest that molecular networks capable of predicting whole-system behavior will require both de novo reconstruction of molecular interactions within each cellular context of interest and novel modeling approaches that explicitly represent interactions within a hierarchy of scales and across the full range of cellular compartments that define the physiological states relevant to a disease phenotype32.

Reverse engineering of cellular networks

Until recently, experimental elucidation of a protein kinase substrate or transcription factor target may have required a year of bench work. As regulatory networks in eukaryotes seem to comprise hundreds of thousands of interactions23,33,34—both context-specific35 and dynamic34,36,37—dissecting them with sufficient accuracy, coverage and context specificity may thus seem to be an unrealistic goal. Yet, the field of high-throughput computational and experimental reverse engineering was born precisely to address this challenge.

Experimentally, over the last few years, large-scale, high-throughput efforts have already produced critical data sets. These have been used as a scaffold for the assembly of molecular interaction networks, thus providing the first insight into the architecture of the cell, tissues and even whole systems38. For example, protein-protein interactions have been dissected using the yeast two-hybrid (Y2H) system or tandem affinity purification and mass spectrometry (TAP–MS)23. Similarly, transcription factor–binding sites have been mapped using genome-wide chromatin immunoprecipitation approaches (ChIP-chip39 and ChIP-seq40). Physical interactions can also be measured in vitro with DNA or protein arrays, which have been used to identify transcription factor–binding sites41,42 and the substrates of kinases43. Although interactions characterized by high-throughput experimental methods generally have high false positive and false negative rates and are unlikely to generalize to cellular contexts other than the one in which they were ascertained, they nonetheless provide an initial, albeit sparse, snapshot of regulatory networks, especially when integrated with other types of data that can help contextualize individual interactions28.

Complementing such high-throughput approaches, computational reverse-engineering algorithms have recently achieved accuracy and sensitivity comparable with those obtained by their experimental counterpart, at a fraction of the cost and time requirements. Computational methods for reverse engineering cellular networks were first developed for the study of prokaryotes and lower eukaryotes44,46 and have more recently become highly successful in reconstructing the transcriptional33, post-translational34,47,48, post-transcriptional49, metabolic50 and protein-complex20 logic of human cells, as well as in elucidating the dependence of such logic on the genetic information and variability encoded in the DNA molecule26,27,28,51,52. Moreover, the combined use of multiple evidence sources has been particularly effective in reconstructing accurate, high-coverage regulatory models20,57,58 and in integrating multiple layers of regulation within cellular networks. Taken together, these computational and experimental approaches are paving the road to regulatory network–based studies of human disease27,53,56.

Computational methods all rely, in one way or another, on measuring changes in distinct molecular moieties (for example, RNAs or proteins) as a response to either endogenous or exogenous perturbations. The former include, for instance, differences in kinetic constants caused by the genotypic variability between individuals or the different spectra of genetic lesions associated with particular tumor phenotypes53. The latter include small-molecule59, RNA interference (RNAi) and environmental perturbations60, such as differences in temperature, nutrients or culture medium, among many others. In fact, several methods have been described that specifically use perturbations to infer regulatory networks60,61 or to interrogate them to infer drug sensitivity62, resistance63 and mechanism of action35,46. Monitoring network states over time provides another systematic variability source for causal inference64,65.

Finally, functional rather than physical interactions, such as the genetic interactions that define the combinatorial relationships between genes and phenotypes, constitute another valuable knowledge layer. In model organisms, such as yeast, genetic interaction networks are being systematically measured through synthetic lethality screens66, while, in higher eukaryotes, genetic interactions can be explored by a variety of combinatorial RNAi67 and RNAi-based screening approaches68. In the absence of previous information, however, de novo identification of such epistatic interactions from GWAS data is greatly limited by lack of statistical power, although emerging methods are beginning to address this limitation9,69.

Examples of PWAS and INAS approaches

In the following, we discuss a few illustrative examples of PWAS and INAS approaches that have successfully identified genes whose genetic alteration or functional dysregulation induces specific phenotypes.

Canonical pathway analysis. Canonical pathways are compact representations of literature-based knowledge about regulatory interactions. Although their representation is largely incomplete and lacks context specificity, it provides visual access to a collection of molecular interaction facts that have led to the elucidation of important biological mechanisms.

Some of the most accurate pathway models represent immunology-related signaling cascades. These have been used to identify genetic alterations in lymphomagenesis. For instance, integration of the nuclear factor (NF)-κB pathway and targets with GWAS data from a large collection of diffuse large B-cell lymphoma (DLBCL) samples led to the identification of the NF-κB nuclear complex as the key integrator of a spectrum of upstream genetic alterations characterizing the more aggressive activated B cell–like (ABC) subtype of the disease from its germinal center B cell–like (GC) counterpart70,71. These included several genes in the B-cell receptor (BCR) and other signal transduction pathways, such as CARD11, TNFAIP3, TRAF2, TRAF5, MAP3K7 and TRANK1, among others. Unexpectedly, whereas NFKB1, NFKB2, RELA, RELB and REL harbor no genetic alterations in ABC DLBCL tumors, the NF-κB nuclear complex constitutes a key non-oncogene addiction for this subtype70.

Pathways assembled by automated literature data mining approaches have also been useful in the study of genetic predisposition to several human diseases72.

Integrative genomics. There is abundant literature on cellular network analysis, including of protein-protein and protein-DNA interactions, to identify 'expression-activated modules' from gene expression data9,20,73,75. These are sets of proteins enriched for both network interaction and co-expression across several conditions; they allow the thousands of interactions in a typical cellular network to be reduced to a handful of small, differentially activated modules. Dysregulated gene set analysis via subnetworks (DEGAS) and interactome dysregulation enrichment analysis (IDEA) represent recent examples of tools for identifying connected subnetworks enriched in genes or interactions that are dysregulated in a disease or following chemical perturbations35,76. In Parkinson's disease, DEGAS identified mRNA splicing, cell proliferation and the 14-3-3 complex as candidate disease-progression mediators. In B-cell lymphoma, IDEA identified validated genetic alterations in chronic lymphocytic leukemia and follicular lymphoma.

In parallel, related methods have been developed for integrating protein networks with genome-wide linkage and association studies. For instance, Lage et al.77 identified protein complexes encoded by genes that were associated with similar phenotypes, using a protein interaction network assembled with both human and model organism data. Proteins were ranked by the phenotypic similarity score of diseases associated with them and with their directly interacting proteins. In dense module searching for GWAS (dmGWAS)78, dense subnetworks of protein-protein interactions were tested for enrichment in genes harboring SNPs with low P values in GWAS studies.

A similar approach integrated genes linked to ataxia within a human protein interaction network, showing potential gains in statistical power38. Further attempts to boost statistical power in GWAS include the identification of SNP pairs, whose joint state was associated with the phenotype79. A biclustering method was used to cluster SNP-SNP interactions, first across genomic regions and then across a protein interaction network (Fig. 3). The analysis showed strong enrichment of GWAS genetic interactions among interacting proteins. This GWAS-based method suggested that the INO80 chromatin-remodeling complex is functionally linked to transcription elongation via RNA polymerase II and vacuolar protein degradation. Finally, related approaches were developed for using previous knowledge in the inference of epistatic interactions from GWAS39.

Figure 3: Genetic networks extracted from GWAS elucidate pathway architecture.
figure 3

(a) A global map of the top GWAS genetic interactions between protein interaction complexes. Each node represents a protein complex, and each interaction represents a significant number of genetic interactions. Node sizes are proportional to the number of proteins in the complex. (b,c) Genetic interactions mined from GWAS data are shown in greater detail for the interaction between the synaptonemal complex and the RNA polymerase II complex (b) and the interaction between the mannan polymerase II complex, the TIM9-TIM10 complex and the TRAPP complex (c). Adapted from ref. 79.

Genetics of gene expression. Systems genetics represents a broad class of approaches that integrate germline or somatic genetic variants and phenotypic data to infer causal gene-gene and gene-phenotype relationships. Variations in DNA can directly affect gene expression and protein activity and can thus be viewed as the naturally occurring counterpart of the artificial perturbations commonly employed to establish causal relationships. However, because common forms of human disease and physiological differences are caused by such variation, they constitute a more relevant context in which to elucidate causal mechanisms related to disease risk assessment, initiation, progression and therapy.

DNA variation can be effectively used to infer causal relationships among molecular phenotypes24,26,27 and to reconstruct entire gene networks by systematically assessing its effect on gene, protein and metabolite expression and interactions28,51. Gene networks dissected from DNA variability data can elucidate gene subnetworks driven by common genetic factors in an unbiased, data-driven fashion. For instance, Zhong et al. identified such a subnetwork by studying islets isolated from a population of mice segregating with a type 2 diabetes (T2D) phenotype29. More than half of the genes that were predicted to be causal for T2D in this population were members of this sub-network. Furthermore, human SNPs associated with genes in the mouse-derived T2D network were more than eightfold enriched for statistically significant associations with T2D in GWAS data. Notably, no enrichments were observed using established GO and KEGG pathways11.

Along similar lines, module-based network approaches44 were extended to identify genetic determinants of differential regulation of genetic modules80 as well as to identify genetic alterations causally related to the presentation of a tumor phenotype81.

Regulatory network analysis. Causal regulatory networks have also been successfully used to identify disease-relevant genes that were then experimentally validated. In these networks, interactions are directed (causal) rather than undirected, as in protein interaction networks. Thus, if networks are sufficiently accurate and comprehensive, they may allow traversing back regulatory event cascades to identify 'master regulator' genes that are necessary and/or sufficient to induce specific disease-related molecular signatures. This method was originally proposed for networks reconstructed from DNA-binding signatures of transcription factors, without experimental validation82. More recently, master regulator genes were inferred and experimentally validated, both in disease, for human high-grade glioma53, and for normal physiological formation of germinal centers20. In high-grade glioma, for instance, the master regulator inference algorithm (MARINa) identified two transcription factors, C/EBP (including both the β and δ subunits) and STAT3, as master regulators of the mesenchymal subtype, which is associated with the worst prognosis in this disease. Ectopic expression of both transcription factors, but not of either one individually, was sufficient to reprogram neural stem cells along an aberrant mesenchymal lineage. Simultaneous silencing in high-grade glioma lines, but not individual silencing of either gene, was sufficient to abrogate the mesenchymal phenotype and tumorigenesis in vivo. Direct exploration of GWAS data from the Tumor Cancer Genome Atlas (TCGA) study on glioblastoma in the context of genes upstream of these master regulators has identified genetic alterations responsible for most mesenchymal cases.

Diseasome approaches. Genes and proteins work within highly coordinated programs. Thus, another approach for the analysis of GWAS data exploits previous biological knowledge of gene similarities and dissimilarities across diseases.

For example, although the immune system is implicated in many pathophysiological phenotypes, suggesting that autoimmune disorders may share causal genetic variants with them, there are also notable differences. For example, the G allele of the rs2076530 polymorphism in BTNL2 (encoding butyrophilin-like 2, a major histocompatibility complex (MHC) class II–associated factor) is more frequent among individuals with type 1 diabetes and rheumatoid arthritis than in healthy controls, whereas the A allele was more frequent in individuals with systemic lupus erythematosus than in healthy individuals83. One way to use disease relationships is to compare multiple GWAS data sets to find risk alleles and SNPs associated with disease sets, whether as predisposing or protective factors. The identification of such 'toggleSNPs' was used to study molecular mechanisms in actual human disease incidence, providing a key advantage over similar studies in animal models84.

Phenotype canalization. Many diseases, including cancer, present a seeming paradox. Whereas the number of genetic and epigenetic dysregulation patterns associated with disease etiology is generally large, the number of distinct molecular subtypes from gene expression profiling analysis is substantially smaller. For instance, in high-grade glioma, dozens of genetic alterations have been reported85, yet there are only three or four distinct molecular subtypes86,87. This suggests the existence of an integrative logic, usually at the level of transcriptional regulation, canalizing aberrant signals from complex genetic and epigenetic alteration patterns into a few molecular phenotypes. The existence of this integrative logic has been uncovered in several tumor types, including in lymphoma70 and in high-grade glioma53. These observations suggest yet another approach to INAS, based on the identification of candidate genes in the regulatory modules that control the disease subtype and in their upstream pathways. This handful of genes can then be directly assessed for genetic and/or epigenetic variation, thus dramatically increasing statistical power by reducing the number of multiple hypotheses tested.

Conclusions

Regulatory network models are emerging as powerful integrative frameworks to understand and interpret the roles of genetics and epigenetics in disease predisposition and etiology. By providing the backbone of molecular interactions through which signals are transduced and gene expression is regulated, they dramatically limit the search space of allele variants and alterations that can be causally linked to the presentation of a phenotype. In addition, by providing accurate regulatory models of the cellular machinery that integrates signals that are dysregulated in disease, they yield valuable hypotheses for diagnostic and prognostic biomarkers, for therapeutic targets and for the understanding of context-specific synthetic lethality.

For regulatory network models to yield their full potential, however, we must understand both the mechanistic and statistical implications of their variability across cellular context, their dependence on the genetic and epigenetic layers of regulation and their dynamics over time. The latter is particularly important for diseases where the underlying cellular pathophysiology cannot be considered to be close to steady state, such as metabolic and neurological diseases. We note that, in leveraging network models reflecting multiple conditions or multiple contexts to identify key drivers of phenotypes of interest, particular attention must be paid to assessing the significance of drivers predicted in one context after searching a diversity of contexts. Controlling for false discovery rates in this setting demands that one account for all of the models queried across the different contexts.

Unexpectedly, even rough regulatory models that are largely inaccurate and incomplete are starting to show substantial value in dissecting the genetics of disease. Thus, we expect that, as these models progress and become better able to deal with the dynamic, cell context–specific nature of biological process regulation, they will dramatically increase their ability to yield key insight into both normal cell physiology and its dysregulation in disease. We herald network reverse engineering and interrogation as one of the most critical challenges of quantitative biology.

Assembling these models will require efforts that transcend individual laboratories and even institutions. Yet, until very recently, nearly all of the historic studies that drive the current understanding of disease were performed by single laboratories. This process fails to recognize that the value of data is multiplied when it can be easily accessed and leveraged in ways that were not originally envisioned. While efforts like TCGA85, the database of Genotypes and Phenotypes (dbGAP), Gene Expression Omnibus (GEO) and GWAS meta-analysis have shown the usefulness of sharing data on a large scale, the absence of a culture of appropriate data sharing remains perhaps the single greatest impediment to the rapid development of the integrative techniques described here. Even in cases where substantial effort has gone into providing data in the most comprehensive fashion (for example, in the TCGA projects), the reproduction of results derived from such data by others remains often elusive88.

Author contributions

A.C. and E.S. wrote the manuscript, and A.J.B., S.F. and T.I. provided specific examples and editorial comments.