Introduction

Genome-wide association studies (GWAS) of psychiatric disorders (schizophrenia, bipolar disorder (BD), major depressive disorder and others) have suggested a highly polygenic architecture1 with a high degree of heterogeneity. Given the relative lack of replicated common risk variants2, 3, 4 with a large effect size, interest has turned to other potential explanations (including rare variants and epistasis5, 6, 7, 8, 9) for the presumed missing heritability.10, 11 Recent analyses have suggested that a substantial proportion of additive genetic variability is in fact well tagged by common variants when considered in aggregate, for example, explaining 37–40% of the genetic variability for BD.12, 13 These analyses have also suggested that the remaining missing heritability may be a function of imperfect linkage disequilibrium with rare causal risk variants. Although a large degree of additive genetic variance is supported both theoretically and empirically, it is important to note that a large additive contribution to genetic variance does not preclude the contribution of models involving epistasis between single-nucleotide polymorphisms (SNPs).14, 15 The variation encoded in the nodes and edges may be used to estimate the amount of additional variation accounted for by the epistasis network. However, the goal of the current study is to demonstrate the senstivity of epistasis networks to discover new susceptibility genes in GWAS.

The recognition that numerous variants act together to increase disease susceptibility has also led to the development of gene-set or pathway enrichment approaches, which aggregate association evidence at the level of a single gene or biological pathway.16, 17, 18 As applied to SNP data, these approaches typically rely on association evidence calculated marginally for each SNP, thus ignoring potential effects due to interactions.19, 20 Here, we consider a network approach that prioritizes genes and pathways based on the aggregation of effects due to gene–gene interactions as well as marginal (main) effects. This approach consists of four main steps, summarized in Figure 1: (1) filtering to remove noise SNPs from consideration, (2) representing association evidence in terms of an epistasis network, (3) prioritizing SNPs/genes in the network using an eigenvector centrality algorithm and (4) pathway enrichment based on epistasis network centrality prioritization. We first remove noise SNPs with an optimized version of the evaporative cooling machine learning (ECML) filter. We have shown that the ECML filter, which is based on the combination of Relief-F and Random Forests, has the power to detect both epistatic and main effects, whereas Random Forest alone has very weak power to detect epistatic effects in high dimensional data.21

Figure 1
figure 1

Epistasis network analysis flowchart. Overview of the data analysis workflow used to identify variants due to epistasis network centrality and test for replication of pathways. The analysis steps in the dotted frame are carried out for the three GWAS at the top (WTCCC, NIMH and, as a secondary analysis, the two GWAS combined). On the bottom left, the enriched pathways are compared between the WTCCC and NIMH GWAS, and replication is defined when a pathway has an FDR-adjusted P-value less than 0.05 for both. On the bottom right, tables are created for the top genes based on their epistasis network centrality for each of the data combinations.

We have previously used information theory to construct epistasis networks (which we label as a genetic association interaction network, itGAIN); however, in the present study, we rely upon regression models, primarily to be able to assign statistical significance to nodes and edges inferred in the network (which we label as a regression-based genetic association interaction network (reGAIN) to differentiate it from an information theory-based approach). Other groups have recently investigated the graph properties of epistasis networks, illustrating, for example, that hub (highly connected) SNPs do not necessarily correspond to SNPs with large main effects.22 For the final step in our approach, we prioritize edges and nodes in the epistasis network using an eigenvector centrality algorithm we have developed called SNPrank.21, 23 SNPrank can be understood by the analogy of a random SNP surfer circulating through the network, accumulating bits of interaction and main effect information from each SNP regarding association with the phenotype. In a previous application of this approach to a genetic association study of the immune response to smallpox vaccine, we identified an intronic SNP in the retinoid X receptor α (RXRA) gene, which is known to be a mediator of vitamin D signaling and has recently been shown to be involved in innate immune response.23

Here, we apply the combined approach of ECML+reGAIN+SNPrank to two previous GWAS of BD: the Wellcome Trust Case Control Consortium (WTCCC)24 and a more recent National Institute of Mental Health (NIMH) GWAS.25 The original WTCCC study of BD, consisting of 1868 cases and 2938 controls, did not find any single SNP associations surpassing commonly accepted thresholds for genome-wide significance (P<5 × 10−8). However, a recent collaborative analysis of BD, which combined the WTCCC data with other studies for an overall sample of 4387 cases and 6209 controls, found a strong association for the imputed SNP rs10994336 (ANK3) on chromosome 10q21 (P=9.1 × 10−9).26 In the recent NIMH GWAS of European ancestry (EA) and African ancestry (AA) individuals, no SNP reached genome-wide significance. However, in the EA samples (1001 cases and 1033 controls), which we analyze in the current study, a sliding-window analysis yielded a high proportion of haplotypes with P<0.05 in the ANK3 region. In the current work, we observe a highly connected SNP in ANK3 that is ranked third by SNPrank in our epistasis network analysis of the original WTCCC GWAS, and the network rank of this variant is second when the WTCCC and NIMH-EA GWAS are merged. The network analysis of the merged data yields a top-10 ranking to a SNP in diacylglycerol kinase eta (DGKH), which was implicated for BD in a previous study,27 and top 15 for ODZ4, which has been identified in Sklar et al.28 The top genes based on epistasis network analysis for the merged GWAS are given in Table 3.

The epistasis network prioritization also results in enrichment of plausible biological pathways for BD that replicate between the WTCCC24 and the NIMH BD GWAS.25 Using the epistasis network centrality for gene prioritization based on the Reactome FI database,29 we find replication of enrichment of the cadherin signaling pathway and evidence consistent with replication in the Wnt signaling pathway. Genes in the Cadherin pathway have been implicated in BD pathophysiology.30 In addition, it has been suggested that BD is affected by genes in the Wnt Signaling pathway as well as the circadian rhythm pathway,31 which are both enriched in the WTCCC GWAS by this approach. Other enriched pathways include axon guidance and neuroactive ligand-receptor interaction. The identification of replicated pathways suggests that network aggregation of gene–gene interactions and main effects can provide statistical power to expose hidden variation associated with complex diseases. These results also indicate the importance of taking into account the information concerning epistasis as well as main effects when prioritizing genes for pathway analysis.

Materials and methods

Study samples and initial filtering

For the primary/discovery epistasis network analysis, we used the WTCCC-BD GWAS, which included bipolar I, bipolar II and schizoaffective bipolar in the case diagnosis.24 Samples (including 1868 cases and 2938 controls after exclusions) were genotyped on the Affymetrix 500K array (Santa Clara, CA, USA). For replication, we used the NIMH-BD GWAS genotyped on the Affymetrix 6.0 platform.25 The NIMH BD study involved a sample of individuals of EA (n=1001 cases; n=1033 controls), and one involving a sample of individuals of AA (n=345 cases; n=670 controls). We focus on the EA individuals from the NIMH study because the effect of admixture on these machine learning and network techniques has not been fully investigated. The case diagnosis included bipolar I and schizoaffective bipolar. For both studies, we removed SNPs with call rates <95%, minor allele frequency <1%, or with evidence of deviation from Hardy–Weinberg equilibrium (P<0.001). As a secondary analysis, we merged the top SNPs from the WTCCC and NIMH-EA cohorts. In the merged data, we only include overlapping SNPs between the Affy 6.0 and 500K chips rather than impute missing SNPs. Imputation may allow for the discovery of additional genes and pathways.

We now detail the methods used in the steps of the analysis pipeline, which is summarized in Figure 1. To limit the number of noise (irrelevant) SNPs used in the network analysis, we filtered SNPs based on ECML, which has power to detect main and interaction effects.21 We used the 1000 SNPs with the top ECML score to construct a reGAIN, as described below. Any filter increases the risk of excluding pure interaction effects that exhibit negligible marginal effects as well as excluding some weak main effects. However, filtering reduces the number of pairwise interactions that must be calculated, eliminates many irrelevant variants and improves interpretability of the network. The filter used herein retains many more potential interaction effects and is approximately two order of magnitude more SNPs than the threshold used by WTCCC to define moderate associations in their Supplementary data (P<0.0001).

Regression-based epistasis network construction (reGAIN)

From the 1000 SNPs remaining after the ECML filter, we construct a GAIN/epistasis network composed of main effects and gene–gene interactions between all pairs. Our previous data-driven GAIN network approach for GWAS used Shannon information theory for epistasis calculations and network construction.21, 23 However, casting the network in the statistical framework of a general linear model has some advantages over information theory. For example, use of a general linear model framework provides the flexibility to handle environmental covariates, longitudinal data, missing data, censoring and cluster structure (for example, family studies) through the inclusion of appropriate random effects. For the BD GWAS, we use a likelihood ratio test of association between disease and a genetic locus, allowing for the possibility that the genetic effect may be modified by another genetic factor.

The coefficient bb gives the baseline risk of disease and coefficients b1 and b2 correct for main effects in the interaction regression model. For defining gene–gene edge weights b12 in the reGAIN, we are interested in the b12 regression coefficients that are statistically different from zero. The statistical framework also allows false discovery rate procedures to be applied to correct for multiple gene–gene hypotheses. The diagonal element bii of the reGAIN is simply the main effect regression coefficient without interactions. These interaction and main effect regression coefficients for all SNPs in the filter become matrix elements in the SNPrank Markov transition matrix, discussed next.

Eigenvector network centrality (SNPrank) for gene prioritization in pathway enrichment

We use the SNPrank23 network centrality/importance score to prioritize the 1000 SNP nodes in the reGAIN for pathway enrichment. This score accounts for main effects and gene–gene interactions encoded in the reGAIN matrix. Briefly, SNPrank constructs a stochastic transition matrix from the reGAIN matrix B (see above). The matrix accounts for single-locus effects through the main effects along the diagonal bii and accounts for pair-wise interactions through the interaction coefficients bij on the off-diagonal elements. Higher-order interactions (linear combinations of multiple pair-wise interactions and main effects) are incorporated through a recursive power method to calculate the dominant eigenvector of the transition matrix. The elements of the dominant eigenvector are the SNPrank scores of each genetic node in the reGAIN. The eigenvector is normalized so the elements sum to one, like a probability field. Thus, we use a QQ plot to estimate the number of genes to include in pathway enrichment below; we use the top n=200 genes for both GWAS (WTCCC and NIMH).

Pathway enrichment analysis

To identify enriched pathways from the n=200 top genes, we used the Reactome FI database29 of expert-curated human biological pathways. Reactome pathways are described as a series of molecular events that transform one or more input entities into one or more output entities catalyzed or regulated by other entities. Entities include small molecules, proteins, complexes, post-translationally modified proteins and nucleic acid sequences. SNPs are assigned to genes based on proximity to the 5′ and 3′ ends of the first and last exons. For SNPs whose proximity is greater than 20 kb, we look for linkage disequilibrium information that may inform gene assignment.32 If a SNP is not easily assigned, we do not use it in pathway analysis. We use this conservative approach to limit false positive assignments and false positive enriched pathways. Genes are not repeated in the enrichment if more than one SNP from a gene is found in the top list. We calculated the P-value for the significance of the overrepresentation of a biological pathway πi with the hypergeometric distribution

where N is the number of background genes (genes annotated to any pathway), n is the number of top genes prioritized by SNPrank, M(i) is the total number of genes in pathway πi, whereas m(i) is the number of top SNPrank genes that intersect the set of pathway genes πi.

Two corrective measures were taken to reduce false positive pathway enrichments. The first is correction due to multiple hypothesis testing. All pathways tested for enrichment were sorted in ascending order and the corrected P-value was given by

where P is the total number of pathways and R(x) is the rank order of pathway x. Second, we generated pathway-specific and GWAS-specific enrichment distributions to correct for gene-size bias. Gene length can bias pathway enrichment,33 which can be particularly significant for large brain-function genes.8 We select n=200 SNPs randomly from the GWAS, map SNPs to genes and calculate mrand(i) (the number of the randomly selected genes that intersect the set of genes in pathway πi). We repeat this sampling 1000 times to create a null distribution of mrand for each pathway. If a pathway has a gene-size bias, this should be reflected in the random distribution of mrand. We use the mean and standard deviation of mrand(i), to calculate a z-score and P-value for each observed m(i) (from the epistasis network centrality ranking of the GWAS). The gene-size corrected P-value for Wnt signaling is P=0.000337 for the WTCCC data and P=0.06 for the NIMH data; and for cadherin signaling P=0.032 for both WTCCC and NIMH. Cadherin signaling meets our replication criteria when corrected for multiple tests and gene length. Although Wnt signaling does not technically replicate when corrected for gene length, the consistency of high significance in WTCCC and near significance in NIMH make this pathway very suggestive for involvement in BD.

Network pruning with edge significance for visualization of network

For SNPrank gene ranking, we used the full network of ECML-filtered SNPs because we suspect multiple small interactions with potentially weaker significance will contribute to the overall expression of the phenotype. False connections have the potential to bias the network, but we expect the false edges to be randomly distributed. We did not observe a gene length bias that might artificially inflate the network importance of longer genes. For improved interpretation of the network, we pruned the network based on edge strength. We used an edge strength threshold of bij=0.575 to highlight the gene nodes and edges that have the strongest effects and to reduce the obscuring effect (network hairball) of many weak connections. The maximum threshold was chosen (edges below this threshold were pruned) subject to the constraint of minimizing the number of network islands. Gene symbols are used to label nodes. If more than one SNP from a gene is found in the network, then the SNP with the highest SNPrank score represents the gene and its interactions.

Results

The Materials and methods section contains details of the regression-based epistasis-network pathway-enrichment analysis as well as descriptions of the WTCCC-BD24 and NIMH-BD25 GWAS data sets. In brief, the WTCCC-BD GWAS was used for discovery and NIMH-BD for replication. We retained the top 1000 SNPs based on ECML feature selection, which has demonstrated power to detect both main effects and gene–gene interactions in GWAS.21 From these top 1000 SNPs, we constructed an epistasis network of main effects and gene–gene interactions between all pairs using the reGAIN method discussed below and in McKinney et al.20 We then applied SNPrank23 to the epistasis network to further remove noise SNPs and enrich the top list of SNPs for main effects and interactions. We retained the top genes for pathway enrichment analysis based on the QQ plot of the SNPrank eigenvector scores, which resulted in a cutoff of approximately 200 genes. This cutoff removes network nodes whose SNPrank scores are consistent with a uniform distribution in the range (0,1). We used the same cutoff for both the discovery and replication data sets to define the number of top genes for use in the hypergeometric distribution for pathway enrichment. We used pathway annotations from the Reactome FI pathway database.29

We list the most significant epistasis network pathway enrichment results in Tables 1, 2 for the WTCCC and NIMH GWAS of BD. We find replication evidence of enrichment of the cadherin signaling pathway (P=0.004 in WTCCC and P=0.0094 in NIMH-EA) and evidence of replication in the Wnt signaling pathway (P=0.0008 in WTCCC and P=0.06 in NIMH-EA). Genes in the cadherin pathway as well as protein partners in the Wnt pathway have been implicated as possible components of a molecular pathway in susceptibility to BD pathophysiology.30 It has also been suggested separately that BD is affected by genes in the Wnt signaling pathway as well as the circadian rhythm pathway,31 both enriched in the WTCCC GWAS by the epistasis network approach. These pathways are not significantly enriched when SNPs are prioritized by single-locus statistics as observed for example in the WTCCC-BD in Torkamani et al.19 Other enriched pathways of note based on epistasis networks include axon guidance (NIMH-EA (P=0.028)) and neuroactive ligand-receptor interaction (WTCCC (P=0.0008)), which is also the most significantly enriched when the WTCCC and NIMH-EA GWAS are merged. Genes and edges for the WTCCC reGAIN network in Figure 2 are annotated by pathway membership for the replicated pathways.

Table 1 WTCCC pathway enrichment
Table 2 NIMH-EA pathway enrichment
Figure 2
figure 2

Epistasis network for WTCCC GWAS of bipolar disorder. Network inferred following ECML feature selection and regression-based genetic association interaction network (reGAIN) for the WTCCC GWAS of bipolar disorder, annotated by top enriched pathways. An edge threshold (0.575) was chosen as described in Materials and methods; interactions below this threshold are hidden. The 146 nodes are colored based on membership of the genes in the pathways with evidence of enrichment replication (Tables 1 and 2): red diamond (membership in both Wnt signaling pathway and cadherin signaling pathway), green square (Wnt signaling pathway only) and magenta triangle (Neuroactive ligand-receptor interaction pathway). The weight of an edge is proportional to the gene–gene interaction strength. The 183 edges are colored based on connection of a gene node to a gene in the given pathway using the scheme above (red squiggle, green dashed, magenta solid). The size of a node is proportional to its degree (number of edges). Note, ANK3 in the middle is the most connected.

Epistasis network centrality (SNPrank) results of the top individual SNPs for the WTCCC, NIMH and merged data sets may be found in Supplementary Table 1. There is consistent evidence in the GWAS literature for the role of ANK3 for BD susceptibility, yet no ANK3 SNPs are ranked higher than 600 in a single-locus analysis of the WTCCC data unless the data is merged with other studies to create a larger sample size.26 Without pooling additional samples, the epistasis network centrality analysis of the WTCCC data yields a variant in ANK3 (rs10509126) that is ranked third by SNPrank. The network centrality rank (SNPrank) of this variant moves higher in the rankings when the WTCCC and NIMH-EA GWAS are merged (rank second). As shown in Figure 2, this ANK3 SNP has the largest number of gene–gene interaction connections in the WTCCC GWAS data. The merged network analysis yields a top-10 SNPrank (rank seventh) to a SNP in DGKH, which was implicated for BD in a previous study27 but not in the WTCCC and NIMH data sets. The merged analysis yields a rank of 15 for a variant in ODZ4, which was identified in Sklar et al.28

Discussion

Motivated by the complex, interconnected nature of biological pathways involved in biological processes such as mood regulation, we infer epistasis network signatures of BD from two published GWAS. An underlying assumption of pathway and gene-set approaches is that genes influence phenotypic expression as part of a biological network; however, most gene-set and pathway studies use statistical gene prioritization limited to the individual effect of each gene or variant. The goal of the current work was to use pathway replication evidence for the hypothesis that epistasis network signatures contain information about the underlying biological pathways that regulate phenotypic expression of BD. Our approach used ECML filtering and reGAIN to create a data-driven BD-specific network consisting of statistical gene–gene interactions and single-locus associations. We then used SNPrank to integrate these effects and prioritize genes for pathway enrichment analysis.

Direct replication of a network signature poses a statistical challenge due to the complexity of the models that are to be tested.19, 20 We chose a level of replication that uses pathway enrichment statistics as evidence for network effects in independent GWAS. We constructed filtered epistasis networks and use SNPrank network centrality scores to prioritize genes for subsequent pathway enrichment analysis. In the current study, we replicated the enrichment of the cadherin signaling pathway based on the prioritization of genes through an epistasis network analysis of the WTCCC and NIMH GWA studies of BD. Other enriched pathways of interest were identified including WNT signaling, axon guidance and neuroactive ligand-receptor interaction (see Tables 1 and 2).

The enrichment of genes in the cadherin, Wnt and axon guidance signaling pathways is suggestive of a developmental origin for BD. The Wnt/B-catenin pathway is the canonical pathway controlling cell proliferation and differentiation during embryonic development.34 Cadherins guide neuronal migration during development and are involved in neuronal differentiation and synaptogenesis. Interestingly, the schizophrenia susceptibility gene, DISC1, appears to have a role in the regulation of cell–cell adhesion and neurite outgrowth via the expression of N-cadherins.35 Wnt pathway genes may also have a role in synpatic plasticty and adult neurogenesis, possibly explaining why lithium36 and perhaps valproate,37 increase gray matter volumes in patients with BD—lithium inhibits GSK3B thereby upregulating WNT signaling.38 Although the cadherin/WNT pathway has not generally been the focus of genetic studies, a number of genes within this pathway, including FAT30, 39 and PPARD,40 have been implicated in the development of BD.

In addition to pathways, we find evidence for increased sensitivity to detect SNPs relevant to BD susceptibility by aggregating network effects, including the main effect of nodes. A notable example of this boost in sensitivity is ANK3 (rs10509126). When ranked by univariate statistical significance in the WTCCC GWAS, ANK3 SNPs are outside the top 600 SNPs. However, the epistasis network procedure ranks this ANK3 SNP third in the WTCCC data, and the rank is second when the WTCCC data is merged with the NIMH-EA data (see Table 3 and Supplementary Table). The ability to identify this SNP in the WTCCC data is significant because of the growing body of support for ANK3 for BD susceptibility since the WTCCC study. The top SNPrank SNP in the WTCCC data is ARAP2 gene, which contains ankyrin repeats. Both ANK3 and ARAP2 are highly connected in the reGAIN in Figure 2 and interact with genes in the neuroactive ligand-receptor interaction pathway. The DGKH region, implicated in a previous study,27 lacks a strong signal in the WTCCC data by itself, but when merged with the NIMH data, the epistasis network approach ranks one of the DGKH SNPs seventh.

Table 3 Top genes from epistasis network centrality of combined WTCCC+NIMH GWAS

Baum et al.27 reported the first association between a SNP in DGKH and BD in the context of a GWAS. The association with DGKH was recently replicated in a Han-Chinese population.41 Moreover, a DGKH haplotype consisting of the SNPs, rs994856, rs9525580 and rs9525584, was recently associated with BD, unipolar depression and attention deficit hyperactivity disorder (ADHD),42 which comprise psychiatric disorders that share substantial overlap with respect to clinical symptomatology. Interestingly, DGKH is a key protein in the phosphatidyl-inositol pathway that is also regulated by lithium.43 A recent large-scale analysis (11 974 BD cases and 51 792 controls) identified a new variant in ODZ4.28 The epistasis network analysis of the present study also yielded variants in the ODZ4 gene for the smaller WTCCC and NIMH GWAS data sets, and the merged analysis yielded a rank of 15 for a variant in ODZ4. With the growing number of large-scale GWAS studies, it may be possible to identify novel variants of biological importance through an epistasis network approach.

The general linear model used in reGAIN provides a statistical framework to assign confidence to edges and nodes in the network. In addition, the SNPrank eigenvector centrality scores computed from the reGAIN are well suited to prioritizing genes for pathway enrichment calculations. The SNPrank scores are more difficult to interpret than an odds ratio or a P-value; however, the scores have an interpretation as probabilities because the scores come from the elements of a normalized eigenvector so that the scores sum to unity. Thus, we can identify a significance threshold for pathway enrichment by comparing the observed SNPrank score distribution with a uniform probability as a theoretical null.

These results suggest that some of the missing heritability may be due to the neglect of the context of disease-specific networks of epistatic and main effects. A future challenge is to quantify the amount of heritability that may be accounted for in these networks. A strategy toward this end may be to use the variation in the edge and node regression coefficients of the network to estimate the heritability. These data-driven network techniques offer an additional tool to identify new biological pathways, network signatures and markers relevant to phenotypes due to network interactions.