Introduction

Genome-wide association studies (GWAS) have cataloged many significant and reproducible associations between common genetic variants, notably single-nucleotide polymorphisms (SNPs), and diverse human complex traits1. However, it remains challenging2 to translate these findings into biological mechanisms and clinical applications, because most trait-associated variants have individually small effects and map to non-coding sequences.

One hypothesis is that non-coding variants cumulatively affect complex traits through cell type- or tissue-specific3 gene regulation4. To test this hypothesis, large-scale epigenomic5,6 and transcriptomic7,8,9,10 data have been made available spanning diverse human cell types and tissues. Exploiting these data many studies have shown enrichments of trait-associated SNPs in chromatin regions11,12,13 and genes14,15,16 that are active in trait-relevant cell types or tissues. These studies overlap regulatory maps with GWAS data and often ignore functional interactions among loci within regulatory programs.

Gene regulatory networks17,18,19,20 have proven useful in mining functional interactions of genes from genomic data. Transcriptional regulatory interactions, rather than gene expression alone, drive tissue specificity19. Further, context-specific regulatory networks have emerged as promising tools to dissect the genetics of complex traits21,22,23. Network-connectivity analyses in GWAS have shown that trait-associated genes are more interconnected than expected18 and highly interconnected genes are enriched for trait heritability24. However, these analyses do not leverage observed enrichments to further enhance trait-associated gene discovery.

To unleash the potential of regulatory networks in GWAS, we develop a Bayesian framework for simultaneous genome-wide network enrichment and gene prioritization analysis. Through extensive simulations we show several advantages of the method such as flexibility in various genetic architectures, robustness to a wide range of model mis-specification and improved performance over existing methods. Applying the method to 18 human traits and 38 regulatory networks, we identify strong enrichments of genetic associations in network topology specific to trait-relevant cell types or tissues. By prioritizing variants within enriched networks we identify trait-associated genes that were not implicated by the same GWAS. Many of these previously undescribed genes have strong support from multiple lines of external evidence; some are further validated by follow-up GWAS of the same traits with increased sample sizes. Together, these results demonstrate the potential for our method to yield additional biological and therapeutic insights from existing data.

Results

Method overview

 Figure 1 shows the method schematic. In brief, we develop a model dissecting the total effect of a single SNP on a trait into effects of multiple (nearby and distal) genes through a regulatory network, and we combine it with a multiple-SNP regression likelihood25 based on GWAS summary statistics to perform Bayesian inference.

Fig. 1: Schematic of RSS-NET.
figure 1

a Decomposition of the total effect of a common SNP on a complex trait through multiple nearby and distal genes. b Gene regulatory network defined as a weighted and directed bipartite graph linking TFs to TGs. c RSS-NET exploits the topology of a TF-TG network to decompose the total genetic effect into cis and trans-regulatory components. Both the SNP-gene (cjg) and TF-TG (vgt) weights in this decomposition are assumed known and are specified by existing omics data (Methods). In addition to TF-TG networks, RSS-NET also requires d GWAS summary statistics and e ancestry-matching LD estimates as input. f Bayesian hierarchical model underlying RSS-NET. An in-depth description is provided in Methods. g Given a network, RSS-NET produces a Bayes factor comparing the baseline (M0) and enrichment (M1) models to summarize the evidence for network enrichment. h RSS-NET prioritizes loci within an enriched network by computing P1, the posterior probability that at least one SNP j in a locus is trait-associated (βj ≠ 0). Differences between P1 under M0 and M1 reflect the influence of a regulatory network on genetic associations, highlighting previously undescribed trait-associated genes.

Conceptually, we decompose the total effect of a common SNP on a complex trait into three components: a cis-regulatory component through nearby genes, a trans-regulatory component through distal genes that are regulated by genes near this SNP, and a remaining component due to other factors (Fig. 1a). Since common genetic variation contributes to complex traits primarily via gene regulation22, we find this decomposition a sensible approximation to the genetic basis of complex traits.

Despite various ways to model the regulatory components, here we use cell type- or tissue-specific regulatory networks18,20 linking transcription factors (TFs) to target genes (TGs). Specifically, we define a regulatory network as a directed bipartite graph with weighted edges from TFs to TGs (Fig. 1b). Given a TF-TG network, we use its topology to decompose the total effect of each SNP into effects of multiple interconnected genes. As shown in Fig. 1c, we approximate the effect of SNP j using a weighted sum of cis effects of three nearby genes (outside-network gene k, TG u and TF g) and trans effects of three TGs (u and t on the same chromosome, and n on a different chromosome) that are directly regulated by TF g near SNP j. For identifiability we assume the SNP-gene (cjg) and TF-TG (vgt) weights in the decomposition are known, specified by existing omics data (Methods).

To implement this regulatory decomposition in GWAS, we formulate a network-induced prior for SNP-level effects (β), and combine it with a regression likelihood25 of β based on single-SNP association statistics from a GWAS (Fig. 1d) and linkage disequilibrium (LD) estimates from a reference panel with ancestry matching the GWAS (Fig. 1e). We refer to the resulting Bayesian framework (Fig. 1f) as Regression with Summary Statistics exploiting NEtwork Topology (RSS-NET).

RSS-NET accomplishes two tasks simultaneously: (1) testing if a network is enriched for genetic associations (Fig. 1g); (2) identifying which genes within this network drive the enrichment (Fig. 1h). Specifically, RSS-NET estimates two independent enrichment parameters (θ and σ2) that measure the extent to which, SNPs near network genes and regulatory elements (REs) have higher chances to be associated with the trait, and, SNPs near network edges have larger effect sizes, respectively. To assess network enrichment, RSS-NET computes a Bayes factor (BF) comparing the “enrichment model” (M1: θ > 0 or σ2 > 0) against the “baseline model” (M0: θ = 0 and σ2 = 0). To prioritize genes within enriched networks, RSS-NET contrasts posterior distributions of β estimated under M0 and M1.

RSS-NET improves upon its predecessor RSS-E16. Specifically, RSS-NET exploits the full network topology, whereas RSS-E ignores the edge information. By explicitly modeling regulatory interconnections, RSS-NET outperforms RSS-E on both simulated and real data. Despite different treatments of network information, RSS-NET and RSS-E share computation schemes (Box 1; Supplementary Notes 13), allowing us to reuse the efficient algorithm of RSS-E. Software is available at https://github.com/suwonglab/rss-net.

Method comparison through simulations

The key contribution of RSS-NET is a unified framework that leverages network topology to infer enrichments from whole-genome association statistics and prioritizes loci in light of inferred enrichments automatically. We are not aware of any published method with the same features. However, one could ignore topology and simply annotate SNPs based on their proximity to network genes and REs (Methods). For these SNP-level annotations there are methods to assess global enrichments or local associations on GWAS summary data. Here we use Pascal26, LDSC13,27, and RSS-E16 to benchmark RSS-NET.

Given a network, we first simulated SNP effects (β) from either RSS-NET or mis-specified models, and then combined them with real genotypes to simulate phenotypes from a genome-wide multiple-SNP model. We computed the single-SNP association statistics, on which we compared RSS-NET with other methods (Figs. 24; Supplementary Figs. 19). Since RSS-NET is model-based, we designed a large array of simulation scenarios for both correctly- and mis-specified β. To reduce computation of this large-scale design, we mainly used genotypes28 of 348,965 genome-wide common SNPs and a whole-genome regulatory network inferred for human B cells (436 TFs, 3,018 TGs)20,29. We obtained similar results from simulations based on genotypes30 of 1 million common SNPs31 (Supplementary Fig. 9) or a different network (Supplementary Figs. 2 and 8).

We started with simulations where RSS-NET modeling assumptions were satisfied. We considered two genetic architectures: a sparse scenario with most SNPs being null and a polygenic scenario with most SNPs being trait-associated. For each architecture, we created negative datasets by simulating SNP effects (β) from M0 and positive datasets by simulating β from three M1 patterns (only θ > 0; only σ2 > 0; both θ > 0 and σ2 > 0) of the target network, and applied the methods to detect M1 from all datasets (Fig. 2; Supplementary Figs. 1, 2). Existing methods tend to perform well in select settings. For example, Pascal and LDSC perform poorly when genetic signals are very sparse (Fig. 2b); RSS-E performs poorly when enrichment patterns are inconsistent with its modeling assumptions (Fig. 2c). Except for datasets with weak genetic signals on the network (Fig. 2d), RSS-NET performs consistently well in all scenarios. This is expected because the flexible model underlying RSS-NET can capture various genetic architectures and enrichment patterns. In practice, one rarely knows beforehand the correct architecture, which makes the flexibility of RSS-NET appealing.

Fig. 2: Flexibility of RSS-NET to identify network-level enrichments from GWAS summary statistics.
figure 2

We used a B cell-specific regulatory network and real genotypes of 348,965 genome-wide SNPs to simulate negative and positive individual-level data under two genetic architectures (“sparse” and “polygenic”). We simulated SNP effects (β) for negative datasets from the baseline model (M0: θ = 0 and σ2 = 0). We simulated β for positive datasets from the enrichment model (M1: θ > 0 or σ2 > 0) for the target network under three scenarios: a θ > 0, σ2 = 0; b θ = 0, σ2 > 0; c θ > 0, σ2 > 0. Using the simulated individual-level data we computed single-SNP association statistics, on which we compared RSS-NET with RSS-E16, LDSC-baseline13, LDSC-baselineLD27, and Pascal26 using their default setups (Methods). Pascal includes two gene (“max”: maximum-of-χ2; “sum”: sum-of-χ2) and two pathway (“chi”: χ2 approximation; “emp”: empirical sampling) scoring options. For each dataset, Pascal and LDSC methods produced P-values, whereas RSS-E and RSS-NET produced BFs; these statistics were used to rank the significance of enrichments. A false and true positive occurs if a method identifies enrichment of the target network in a negative and positive dataset respectively. Each panel displays the trade-off between false and true positives via receiver operating characteristics (ROC) curves for all methods in 200 negative and 200 positive datasets of a simulation scenario, and also reports the corresponding areas under ROC curves (AUROCs, higher value indicating better performance). Dashed diagonal lines denote random ROC curves (AUROC = 0.5). d RSS-NET, as well as other methods, does not perform well when the target network harbors weak genetic associations. Simulation details and additional results are provided in Supplementary Figs. 1, 2.

Genetic associations of complex traits are enriched in regulatory regions5,6. Since a regulatory network is a set of genes linked by REs, it is important to confirm that network enrichments identified by RSS-NET are not driven by general regulatory enrichments. To this end, we simulated negative datasets with enriched associations in random SNPs that are near genes (Fig. 3a; Supplementary Fig. 3) or REs (Fig. 3b; Supplementary Fig. 4). The results show that RSS-NET is unlikely to yield false discoveries due to arbitrary enrichments in regulatory regions, and it is yet more powerful than other methods.

Fig. 3: Robustness of RSS-NET to model mis-specification in enrichment analyses.
figure 3

Here positive datasets were generated from M1 with θ > 0 and σ2 > 0 (Fig. 2c). Negative datasets were simulated from four scenarios where genetic associations were enriched in: a a random set of near-gene SNPs; b a random set of near-RE SNPs; c SNPs with MAF- and LD-dependent effects; d a random edge-altered network. By this design, RSS-NET was mis-specified in all four scenarios. Similar to positive datasets, the simulated false enrichments in all negative datasets manifested in both association proportion (more frequent) and magnitude (larger effect). RSS-E was excluded here because of its poor performance shown in Fig. 2c. The rest is the same as Fig. 2. Simulation details and additional results are provided in Supplementary Figs. 3–6.

Minor allele frequency (MAF)- and LD-dependent genetic architectures are identified in complex traits27. To assess the impact of MAF- and LD-dependence on RSS-NET results, we simulated MAF- and LD-dependent SNP effects (β) from an additive model of 10 MAF bins and 6 LD-related annotations27, which were then used to create negative datasets (Fig. 3c; Supplementary Fig. 5). Similarly, enrichments identified by RSS-NET are unlikely to be false positives induced by MAF- and LD-dependence.

Interconnections within regulatory programs play key roles in driving context specificity19 and propagating disease risk22, but existing methods often ignore the edge information. In contrast, RSS-NET leverages the full topology of a given network. The topology-aware feature increases the potential of RSS-NET to identify the most relevant network for a trait among candidates that share many nodes but differ in edges. To illustrate this feature, we designed a scenario where a real target network and random candidates had the same nodes and edge counts, but different edges. We simulated positive and negative datasets where genetic associations were enriched in the target network and random candidates respectively, and then tested enrichment of the target network on all datasets. As expected, only RSS-NET can reliably distinguish true enrichments of the target network from enrichments of its edge-altered counterparts (Fig. 3d; Supplementary Fig. 6).

To benchmark its prioritization component, we compared RSS-NET with gene-based association modules in RSS-E16 and Pascal26 (Fig. 4; Supplementary Figs. 79). Consistent with previous work16, RSS methods outperform Pascal methods even without network enrichment (Fig. 4a). This is because RSS-NET and RSS-E exploit a multiple regression framework25 to learn the genetic architecture from data of all genes and assess their effects jointly, whereas Pascal only uses data of a single gene to estimate its effect. Similar to enrichment simulations (Fig. 2), RSS-NET outperforms RSS-E in prioritizing genes across different architectures (Fig. 4b–d). This again highlights the flexibility of RSS-NET.

Fig. 4: Power of RSS-NET to identify gene-level associations from GWAS summary statistics.
figure 4

We used a B cell-specific regulatory network and real genotypes of 348,965 genome-wide SNPs to simulate individual-level GWAS data under four scenarios: a θ = 0, σ2 = 0; b θ > 0, σ2 = 0; c θ = 0, σ2 > 0; d θ > 0, σ2 > 0. Using the simulated individual-level data we computed single-SNP association statistics, on which we compared RSS-NET with gene-level association components of RSS-E16 and Pascal26. RSS-E is a special case of RSS-NET assuming σ2 = 0, and RSS-E-baseline is a special case of RSS-E assuming θ = 0. Pascal includes two gene scoring options: maximum-of-χ2 (“max”) and sum-of-χ2 (“sum”). Given a network, Pascal and RSS-E-baseline do not leverage any network information, RSS-E ignores the edge information, and RSS-NET exploits the full topology. Each scenario contains 200 datasets and each dataset contains 16,954 autosomal protein-coding genes for testing. We defined a gene as "trait-associated'' if at least one SNP j within 100 kb of the transcribed region of this gene had non-zero effect (βj ≠ 0). For each gene in each dataset, RSS methods produced posterior probabilities that the gene was trait-associated (P1), whereas Pascal methods produced association P-values; these statistics were used to rank the significance of gene-level associations. The first row of each panel displays ROC curves and AUROCs for all methods, with dashed diagonal lines indicating random performance (AUROC = 0.5). The second row of each panel displays precision-recall (PRC) curves and areas under PRC curves (AUPRCs) for all methods, with dashed horizontal lines indicating random performance. For both AUROC and AUPRC, higher value indicates better performance. Simulation details and additional results are provided in Supplementary Figs. 7, 8.

Finally, since RSS-NET uses network as is and most networks to date are algorithmically inferred, we performed simulations to assess the robustness of RSS-NET under noisy networks. Specifically, we simulated datasets from a real target network, created noisy networks by randomly removing edges from this real target, and then fed the noisy networks (rather than the real one) to RSS-NET. By exploiting retained true nodes and edges, RSS-NET produces reliable results in identifying both network enrichments and genetic associations, and unsurprisingly, its performance drops as the noise level increases (Supplementary Fig. 10).

In conclusion, RSS-NET is adaptive to various genetic architectures and enrichment patterns, it is robust to a wide range of model mis-specification, and it outperforms existing related methods. To further investigate its real-world utility, we applied RSS-NET to analyze 18 complex traits and 38 regulatory networks.

Enrichment analyses of 38 networks across 18 traits

We first inferred20 whole-genome regulatory networks for 38 human cell types and tissues (Methods; Supplementary Data 1) from public data29 of paired expression and chromatin accessibility (PECA). On average each network has 431 TFs, 3,298 TGs, and 93,764 weighted TF-TG edges. Clustering showed that networks recapitulated context similarity, with immune cells and brain regions grouping together as two units (Fig. 5a; Supplementary Fig. 11).

Fig. 5: RSS-NET analyses of 18 complex traits and 38 regulatory networks.
figure 5

a Clustering of 38 regulatory networks based on t-distributed stochastic neighbour embedding. Details are provided in Supplementary Fig. 11. b Similarity between a given tissue-specific PECA-based network and 394 CAGE-based networks for various cell types and tissues (a: adult samples; c: cell lines; f: fetal samples). The similarity between a PECA- and CAGE-based network is summarized by Jaccard indices of their node sets (x-axis) and edge sets (y-axis). To simplify visualization, only labels of top four CAGE-based networks with the highest edge similarity are shown for each PECA-based network. See Supplementary Fig. 12 for additional results. c Ternary diagram showing, for each trait, percentages of the “best” enrichment model (with the largest BF) as M11: θ > 0, σ2 = 0, M12: θ = 0, σ2 > 0 and M13: θ > 0, σ2 > 0 across networks. See Supplementary Table 4 for numerical values. Shown are 16 traits having multiple networks more enriched than the near-gene control. d Comparison of context-matched PECA-based (y-axis) and CAGE-based (x-axis) network enrichments on the same GWAS. Dashed lines have slope 1 and intercept 0. See Supplementary Fig. 14 for additional results. e Median proportion of genes with \({P}_{1}^{{\mathsf{bma}}}\) higher than reference estimates (\({P}_{1}^{{\mathsf{base}}}\) or \({P}_{1}^{{\mathsf{near}}}\)), among genes with reference estimates higher than a given cutoff. Medians are evaluated among 16 traits in c. See Supplementary Table 5 for numerical values. Overlap of RSS-NET prioritized genes (\({P}_{1}^{{\mathsf{bma}}}\ge 0.9\)) with genes implicated in f knockout mouse phenotypes47 and g human Mendelian diseases49,50. An edge indicates that a category of knockout mouse or Mendelian genes is significantly enriched for genes prioritized for a GWAS trait (FDR ≤ 0.1). Thicker edges correspond to stronger enrichments. To simplify visualization, only top-ranked categories are shown for each trait (f 3; g 2). See Supplementary Data 4, 5 for full results. Trait abbreviations are defined in Supplementary Table 1.

As a validation, we assessed the pairwise similarity between the 38 PECA-based networks and 394 human cell type- and tissue-specific regulatory networks18 reconstructed from independent cap analysis of gene expression (CAGE) data7,8. Reassuringly, PECA- and CAGE-based networks often reached maximum overlap when they were derived from biosamples of matched cell types or tissues (Fig. 5b; Supplementary Fig. 12), showing that the context specificity of regulatory networks is replicable.

On the 38 networks, we applied RSS-NET to analyze 1.1 million common SNPs31 for 18 traits, using GWAS summary statistics from 20,883 to 253,288 European-ancestry individuals (Supplementary Table 1) and LD estimates16 from the European panel of 1000 Genomes Project30. For each trait-network pair we computed a BF assessing network enrichment. Full results of 684 trait-network pairs are available online (Data availability).

To check whether observed enrichments could be driven by general regulatory enrichments, we created a “near-gene” control network with 18,334 protein-coding autosomal genes as nodes and no edges, and analyzed this control with RSS-NET on the same GWAS data. For most traits, the near-gene control has substantially weaker enrichment than the actual networks. In particular, 512 out of 684 trait-network pairs (one-sided binomial P = 2.2 × 10−40) showed stronger enrichments than their near-gene counterparts (average log10 BF increase: 13.94; one-sided P = 5.1 × 10−15), and, 16 out of 18 traits had multiple networks more enriched than the near-gene control (minimum: 5; one-sided Wilcoxon P = 1.2 × 10−4). In contrast, LDSC and Pascal methods identified fewer trait-network pairs passing the near-gene enrichment control (LDSC maximum: 389, one-sided χP = 1.7 × 10−12; Pascal maximum: 69, P = 2.0 × 10−129; Supplementary Table 2). Consistent with simulations (Fig. 3a, b), these results indicate that network enrichments identified by RSS-NET are unlikely driven by arbitrary enrichments harbored in the vicinity of genes.

Among 512 trait-network pairs passing the near-gene enrichment control, we further examined whether the observed enrichments could be confounded by network properties or genomic annotations. We did not observe any correlation between BFs and three network features (proportion of SNPs in a network: Pearson R = −3.0 × 10−2, two-sided P = 0.49; node counts: R = −5.4 × 10−2, P = 0.23; edge counts: R = −9.2 × 10−3, P = 0.84). To check confounding effects of genomic annotations, we computed the correlation between BFs and proportions of SNPs falling into both a network and each of 73 functional categories27, and we did not find any significant correlation (−0.13 < R < −0.01, P > 0.05/73). Similar patterns hold for all 684 trait-network pairs (Supplementary Table 3 and Data 2). Together, the results suggest that observed enrichments are unlikely driven by generic network or genome features.

For each trait-network pair, we also computed BFs comparing the baseline (M0) against three disjoint models where enrichment (M1) was contributed by (1) network genes and REs only (M11: θ > 0, σ2 = 0); (2) TF-TG edges only (M12: θ = 0, σ2 > 0); (3) network genes, REs and TF-TG edges (M13: θ > 0, σ2 > 0). We found that M13 was the most supported model by data (with the largest BF) for 411 out of 512 trait-network pairs (one-sided binomial P = 1.2 × 10−45), highlighting the key role of TF-TG edges in driving enrichments. To further confirm this finding, we repeated RSS-NET analyses by fixing all TF-TG edge weights as zero (vtg = 0) and we observed substantially weaker enrichments (average log10 BF decrease: 30.46; one-sided P = 8.6 × 10−35; Supplementary Fig. 13). Altogether the results corroborate the “omnigenic model” that genetic signals of complex traits are distributed across the genome via regulatory interconnections22.

Enrichment patterns varied considerably among traits (Fig. 5c; Supplementary Table 4). For type 2 diabetes (T2D), two of five networks passing the near-gene enrichment control showed the strongest support for M11. Many networks showed the strongest support for M12 in breast cancer (10), body mass index (BMI, 14), waist-hip ratio (37), and schizophrenia (38). Since one rarely knows the true enrichment patterns a priori, and M1 includes {M11, M12,  M13} as special cases, we used M1-based BFs throughout this study. Collectively, these results highlight the heterogeneity of network enrichments across traits, which can be potentially learned from data by flexible approaches like RSS-NET.

Top-ranked enrichments recapitulated many trait-context links reported in previous GWAS. Genetic associations with BMI were enriched in the networks of pancreas (BF = 2.07 × 1013), bowel (BF = 8.02 × 1012), and adipose (BF = 4.73 × 1012), consistent with the roles of obesity-related genes in insulin biology and energy metabolism. Networks of immune cells showed enrichments for rheumatoid arthritis (RA, BF = 2.95 × 1060), inflammatory bowel disease (IBD, BF = 5.07 × 1035) and Alzheimer’s disease (BF = 8.31 × 1026). Networks of cardiac and other muscle tissues showed enrichments for coronary artery disease (CAD, BF = 9.78 × 1028), atrial fibrillation (AF, BF = 8.55 × 1014), and heart rate (BF = 2.43 × 107). Other examples are brain network with neuroticism (BF = 2.12 × 1019), and, liver network with high- and low-density lipoprotein (HDL, BF = 2.81 × 1021; LDL, BF = 7.66 × 1027).

Some top-ranked enrichments were not identified in the original GWAS, but they are biologically relevant. For example, natural killer (NK) cell network showed the strongest enrichment among 38 networks for BMI (BF = 3.95 × 1013), LDL (BF = 5.18 × 1030), and T2D (BF = 1.49 × 1077). This result supports a recent mouse study32 revealing the role of NK cell in obesity-induced inflammation and insulin resistance, and adds to the considerable evidence unifying metabolism and immunity in many pathological states33. Other examples include adipose network with CAD34 (BF = 1.67 × 1029), liver network with Alzheimer’s disease16,35 (BF = 1.09 × 1020) and monocyte network with AF36,37 (BF = 4.84 × 1012).

Some networks show enrichments in multiple traits. To assess network co-enrichments among traits, we tested correlations for all trait pairs using their BFs of 38 networks (Supplementary Data 3). In total 29 of 153 trait pairs had significant correlations (two-sided Pearson P < 0.05/153). Reassuringly, subtypes of the same disease showed strongly correlated enrichments, as in IBD (R = 0.96, P = 1.3 × 10−20) and CAD subtypes (R = 0.90, P = 3.3 × 10−14). The results also recapitulated known genetic correlations including RA with IBD (R = 0.79, P = 5.3 × 10−9) and neuroticism with schizophrenia (R = 0.73, P = 1.6 × 10−7). Network enrichments of CAD were correlated with network enrichments of known CAD risk factors such as heart rate (R = 0.75, P = 5.1 × 10−8), BMI (R = 0.71, P = 5.1 × 10−7), AF (R = 0.65, P = 9.2 × 10−6) and height (R = 0.64, P = 1.6 × 10−5). Network enrichments of Alzheimer’s disease were strongly correlated with network enrichments of LDL (R = 0.90, P = 2.6 × 10−14) and IBD (R = 0.78, P = 8.3 × 10−9), consistent with roles of lipid metabolism and inflammation in Alzheimer’s disease35. Genetic correlations among traits are not predictive of correlations based on network enrichments (Pearson R = 0.12, two-sided P = 0.18), suggesting the additional explanatory power from regulatory networks to reveal trait similarities in GWAS.

To show that RSS-NET can be applied more generally, we analyzed the CAGE-based networks18 of 20 cell types and tissues that were present in 38 PECA-based networks (Fig. 5d; Supplementary Fig. 14). PECA-based networks often produced larger BFs than their CAGE-based counterparts on the same GWAS data (average log10 BF increase: 17.36; one-sided P = 1.4 × 10−11), suggesting that PECA-based networks are more enriched in genetic signals. Reassuringly, PECA- and CAGE-based networks consistently highlighted known trait-context links (e.g., immune cells and autoimmune diseases, muscle tissues and heart diseases). For some traits PECA-based networks produced more informative results. For example, CAGE-based analysis of HDL showed a broad enrichment pattern across cell types and tissues (which is consistent with previous connectivity analysis18 of the same data), whereas PECA-based analysis identified liver as the top-enriched context by a wide margin. Although not our main focus, these results highlight the potential for RSS-NET to systematically evaluate different network inferences in GWAS.

Enrichment-informed prioritization of network genes

A key feature of RSS-NET is that inferred network enrichments automatically contribute to prioritization of network genes (Method). Specifically, for each locus RSS-NET produces \({P}_{1}^{{\mathsf{base}}}\), \({P}_{1}^{{\mathsf{near}}}\) and \({P}_{1}^{{\mathsf{net}}}\), the posterior probabilities that at least one SNP in the locus is associated with the trait, assuming M0, M1 for the near-gene control network and M1 for a given network, respectively. When multiple networks are enriched, RSS-NET produces \({P}_{1}^{{\mathsf{bma}}}\) by averaging \({P}_{1}^{{\mathsf{net}}}\) over all networks passing the near-gene control, weighted by their BFs. This allows us to assess genetic associations in light of enrichment without having to select a single enriched network. Differences between enrichment estimates (\({P}_{1}^{{\mathsf{net}}}\) or \({P}_{1}^{{\mathsf{bma}}}\)) and reference estimates (\({P}_{1}^{{\mathsf{base}}}\) or \({P}_{1}^{{\mathsf{near}}}\)) reflect the impact of network on a locus.

RSS-NET enhances genetic association detection by leveraging inferred enrichments. To quantify this improvement, for each trait we calculated the proportion of genes with higher \({P}_{1}^{{\mathsf{bma}}}\) than reference estimates (\({P}_{1}^{{\mathsf{base}}}\) or \({P}_{1}^{{\mathsf{near}}}\)), among genes with reference P1 passing a given cutoff (Fig. 5e). When using \({P}_{1}^{{\mathsf{base}}}\) as reference, we observed high proportions of genes with \({P}_{1}^{{\mathsf{bma}}}\, > \, {P}_{1}^{{\mathsf{base}}}\) (median: 82–98%) across a wide range of \({P}_{1}^{{\mathsf{base}}}\)-cutoffs (0−0.9), and as expected, the improvement decreased as the reference cutoff increased. When using \({P}_{1}^{{\mathsf{near}}}\) as reference, we observed less genes with improved \({P}_{1}^{{\mathsf{bma}}}\) than using \({P}_{1}^{{\mathsf{base}}}\) (one-sided Wilcoxon P = 9.8 × 10−4), suggesting the observed improvement might be partially due to general near-gene enrichments, but proportions of genes with \({P}_{1}^{{\mathsf{bma}}}\, > \, {P}_{1}^{{\mathsf{near}}}\) remained high (median: 74–94%) nonetheless. Similar patterns occurred when we repeated the analysis with \({P}_{1}^{{\mathsf{net}}}\) across 512 trait-network pairs (Supplementary Table 5). Together the results demonstrate the strong influence of network enrichments on nominating additional trait-associated genes.

RSS-NET tends to promote more genes in networks with stronger enrichments. For each trait, the proportion of genes with \({P}_{1}^{{\mathsf{net}}}\, > \, {P}_{1}^{{\mathsf{near}}}\) in a network is often positively correlated with the network enrichment BF (R: 0.28−0.91; Supplementary Table 6). When a gene belongs to multiple networks, the highest \({P}_{1}^{{\mathsf{net}}}\) often occurs in the top-enriched networks (Fig. 6). We illustrate this coherent pattern with MT1G, a liver-active9 gene prioritized for HDL by RSS-NET and also implicated in a recent multi-ancestry genome-wide interaction analysis of HDL38. Although MT1G belongs to regulatory networks of 18 contexts, only the top enrichment in liver informs a strong association between MT1G and HDL (\({P}_{1}^{{\mathsf{net}}}=0.98\)), and remaining networks with weaker enrichments yield minimal improvement (\({P}_{1}^{{\mathsf{base}}}=0.10\), \({P}_{1}^{{\mathsf{net}}}:0.14\!-\!0.19\)).

Fig. 6: RSS-NET gene prioritization results of select trait-network pairs.
figure 6

Shown are four trait-network pairs: a body mass index and pancreas; b rheumatoid arthritis and B cell; c high-density lipoprotein cholesterol and liver; d neuroticism and putamen. In the first column of each panel, each point represents a member gene of a given network (blue circle: TF; orange triangle: TG). Dashed lines have slope 1 and intercept 0. In the second and third columns, each point represents a cell type- or tissue-specific network to which a select gene belongs. Numerical values of P1 and BF are available online (Data availability) and are provided as a Source Data file.

RSS-NET recapitulates many genes implicated in the same GWAS. For each analyzed dataset we downloaded the GWAS-implicated genes from the GWAS Catalog1 and computed the proportion of these genes with high \({P}_{1}^{{\mathsf{bma}}}\). With a stringent cutoff \({P}_{1}^{{\mathsf{bma}}}\ge 0.9\), we observed a significant overlap (median across traits: 69%; median two-sided Fisher exact P = 1.2 × 10−26; Supplementary Table 7). Reassuringly, many recapitulated genes are well-established for the traits (Supplementary Table 8), such as CACNA1C for schizophrenia, TCF7L2 for T2D, APOB for lipids, and STAT4 for autoimmune diseases.

RSS-NET also uncovers putative associations that were not reported in the same GWAS. To demonstrate that many of these previously undescribed associations are potentially real, we exploited 15 analyzed traits that each had a updated GWAS with larger sample size. In each case, we obtained newly implicated genes from the GWAS Catalog1 and computed the proportion of these genes that were identified by RSS-NET (\({P}_{1}^{{\mathsf{bma}}}\ge 0.9\)). The overlap proportions remained significant (median: 12%; median two-sided Fisher exact P = 1.9 × 10−5; Supplementary Table 7), showing the potential of RSS-NET to identify trait-associated genes that can be validated by later GWAS with additional samples. Among these validated genes, many are strongly supported by multiple lines of external evidence (Table 1). A particular example is NR0B2, a liver-active9 gene prioritized for HDL by RSS-NET (\({P}_{1}^{{\mathsf{base}}}=0.84\), \({P}_{1}^{{\mathsf{net}}}=0.98\)) but not identified by standard GWAS39 of the same data (minimum single-SNP association P = 1.4 × 10−7 within 100 kb, n = 99, 900). NR0B2 was associated with mouse lipid traits40,41,42 and human obesity43, and identified in a later GWAS of HDL44 with doubled sample size (P = 9.7 × 10−16, n = 187, 056).

Table 1 Examples of RSS-NET highlighted genes that were not reported in GWAS of the same data but were implicated in later GWAS with increased sample sizes (genome-wide significance threshold: single-SNP association P < 5 × 10−8).

Biological and clinical relevance of prioritized genes

Besides looking up overlaps with GWAS publications, we cross-referenced RSS-NET prioritized genes (\({P}_{1}^{{\mathsf{bma}}}\ge 0.9\)) with multiple orthogonal databases to systematically assess their biological and therapeutic themes.

Mouse phenomics provides important resources to study genetics of human traits45. Here we evaluated overlap between RSS-NET prioritized genes and genes implicated in 27 categories of knockout mouse phenotypes46. Network-informed genes (\({P}_{1}^{{\mathsf{bma}}}\ge 0.9\)) were significantly enriched in 128 mouse-human trait pairs (FDR ≤ 0.1; Supplementary Data 4). Fewer significant pairs were identified without network information (119 for \({P}_{1}^{{\mathsf{near}}}\ge 0.9\); 80 for \({P}_{1}^{{\mathsf{base}}}\ge 0.9\)). For many human traits, top enrichments of network-prioritized genes occurred in closely related mouse phenotypes (Fig. 5f). Genes prioritized for schizophrenia were strongly enriched in nervous, neurological and growth phenotypes (OR: 1.77–2.04). Genes prioritized for autoimmune diseases were strongly enriched in immune and hematopoietic phenotypes (OR: 2.05–2.35). The cardiovascular system showed strong enrichments of genes prioritized for heart conditions (OR: 2.45–2.92). The biliary system showed strong enrichments of genes prioritized for lipids, BMI, CAD, and T2D (OR: 2.16–10.78). The phenotypically matched cross-species enrichments strengthen the biological relevance of RSS-NET results.

Genes causing Mendelian diseases often contribute to complex traits47. Here we quantified overlap between RSS-NET prioritized genes and genes causing 19 categories48 of Mendelian disorders49. Leveraging regulatory networks (\({P}_{1}^{{\mathsf{bma}}}\ge 0.9\)), we observed 47 significantly enriched Mendelian-complex trait pairs (FDR ≤ 0.1; 44 for \({P}_{1}^{{\mathsf{near}}}\ge 0.9\); 31 for \({P}_{1}^{{\mathsf{base}}}\ge 0.9\); Supplementary Data 5), among which the top-ranked ones were often phenotypically matched (Fig. 5g). Genes prioritized for schizophrenia were strongly enriched in Mendelian development and psychiatric disorders (OR: 2.22–2.23). Genes prioritized for AF and heart rate were strongly enriched in arrhythmia (OR: 7.16–8.28). Genes prioritized for autoimmune diseases were strongly enriched in monogenic immune dysregulation (OR: 3.11–4.32). Monogenic cardiovascular diseases showed strong enrichments of genes prioritized for lipids and heart conditions (OR: 2.69–3.70). We also identified pairs where Mendelian and complex traits seemed unrelated but were indeed linked. Examples include Alzheimer’s disease with immune dysregulation35 (OR = 7.32) and breast cancer with insulin disorders50 (OR = 9.71). The results corroborate the continuum between Mendelian and complex traits.

Human genetics has proven valuable in therapeutic development51. To evaluate their potential in drug discovery, we examined whether RSS-NET prioritized genes are pharmacologically active and clinically relevant52. We identified genes with drug indications matching GWAS traits. One identical match is EDNRA, a gene that is prioritized for CAD (\({P}_{1}^{{\mathsf{base}}}=0.57\), \({P}_{1}^{{\mathsf{net}}}=0.82\) in aorta) and also a successful target of approved drugs for cardiovascular diseases (Table 1). We identified genes with drug indications closely related to GWAS traits. For example, TTR is prioritized for Alzheimer (\({P}_{1}^{{\mathsf{base}}}=0.64\), \({P}_{1}^{{\mathsf{bma}}}=0.94\)) and also a successful target of approved drugs for amyloidosis (Table 2). For early-stage development, overlaps between drug indications and GWAS traits may provide additional genetic confidence. For example, HCAR3 is prioritized for HDL (\({P}_{1}^{{\mathsf{base}}}=0.85\), \({P}_{1}^{{\mathsf{bma}}}=0.92\)) and also a clinical trial target for lipid metabolism disorders (Table 2). Other examples include CASP8 with cancer, NFKB2 with IBD, and DLG4 with stroke (Tables 1, 2). For some genes we found mismatches between drug indications and GWAS traits, which could suggest drug repurposing opportunities53. For example, CSF3 is prioritized for AF (\({P}_{1}^{{\mathsf{base}}}=0.56\), \({P}_{1}^{{\mathsf{bma}}}=0.88\)) and also a successful target of an approved drug for aplastic anemia (AA). Since CSF3 is associated with various blood cell traits in mouse54 and human55, and inflammation plays a role in both AA and AF etiology36,37,56, it is tempting to assess effects of the approved AA drug on AF. Mechanistic evaluations are required to understand the prioritized therapeutic genes, but they could form a useful basis for future studies.

Table 2 Examples of RSS-NET highlighted genes that have not reached genome-wide significance in the GWAS Catalog1 at the time of analysis.

Discussion

We present RSS-NET, a topology-aware method for integrative analysis of regulatory networks and GWAS summary data. We demonstrate the improvement of RSS-NET over existing methods through extensive simulations, and illustrate its potential to yield biological and therapeutic insights via analyses of 38 networks and 18 traits. With multi-omics integration becoming a routine in GWAS, we expect that researchers will find RSS-NET useful.

Compared with existing integrative approaches, RSS-NET has several key strengths. First, unlike many methods that require loci passing a significance threshold11,12,17, RSS-NET uses data from genome-wide common variants. This potentially allows RSS-NET to identify subtle enrichments even in studies with few significant hits. Second, RSS-NET models enrichments directly as increased rates (θ) and sizes (σ2) of SNP-level associations, and thus bypasses the issue of converting SNP-level summary data to gene-level statistics17,18,26. Third, RSS-NET inherits from RSS-E16 an important feature that inferred enrichments automatically highlight which network genes are most likely to be trait-associated. This prioritization component, though useful, is missing in current polygenic analyses13,15,24,27. Fourth, by making flexible modeling assumptions, RSS-NET is adaptive to unknown genetic architectures.

RSS-NET allows us to study complex trait genetics through the lens of regulatory topology. Complementing previous connectivity analyses17,18,19,24, RSS-NET highlights a consistent pattern that genetic signals of complex traits often distribute across genome via regulatory topology. RSS-NET further leverages topology enrichments to enhance trait-associated gene discovery. The topology awareness of RSS-NET in both enrichment and prioritization analyses is enabled by a model that decomposes the effect of a single SNP into effects of multiple (cis or trans) genes through a regulatory network.

RSS-NET depends critically on the quality of input networks. The more accurate networks are, the better performance RSS-NET achieves. Currently, our understanding of regulatory networks remains incomplete, and most of the available networks are algorithmically inferred17,18,19,20. Artifacts in inferred networks can bias RSS-NET results; however, our simulations confirm the robustness of RSS-NET when input networks are not severely deviated from ground truth. The modular design of RSS-NET enables systematic assessment of various networks in the same GWAS and provides interpretable performance metrics, as illustrated in our comparison of PECA- and CAGE-based networks. As more accurate networks become available in diverse cellular contexts, the performance of RSS-NET will be markedly enhanced.

Like any method, RSS-NET has several limitations in its current form. First, despite its prioritization feature, RSS-NET does not attempt to pinpoint associations to causal SNPs within prioritized loci. For this task, we recommend off-the-shelf fine-mapping methods57. Second, the computation time of RSS-NET increases as the total number of analyzed SNPs increases, and thus our simulations and analyses focused on 0.35–1.19 million genome-wide common SNPs28,31. Relaxing the complexity will allow RSS-NET to analyze more SNPs jointly. Third, RSS-NET uses a simple method to derive SNP-gene relevance (cjg) from expression quantitative trait loci (eQTL). A more principled approach would be applying the RSS likelihood25 to eQTL summary data (as we did in GWAS) and using the estimated SNP effects to specify cjg. However, our initial assessments indicated that the model-based approach was limited by the small sample sizes of current eQTL studies9,10. With eQTL studies reaching large sample sizes58 comparable to current GWAS1, this approach may improve cjg specification in RSS-NET. Fourth, RSS-NET analyzes one network at a time. Since a complex disease typically manifests in various sites, multiple cellular networks are likely to mediate disease risk jointly. To extend RSS-NET to incorporate multiple networks, an intuitive idea would be representing the total effect of a SNP as an average of its effect in each network, weighted by network relevance for a disease. Fifth, RSS-NET does not include known SNP-level13,24,27 or gene-level14,15,16 annotations. Although our mis-specification simulations and near-gene control analyses confirm that RSS-NET is robust to generic enrichments of known features, accounting for known annotations can help interpret observed network enrichments24. Our preliminary experiments showed that incorporating additional networks or annotations in RSS-NET increased computation costs. Hence, we view developing computationally efficient multi-network, multi-annotation methods as an important area for future work.

In summary, improved understanding of complex trait genetics requires biologically informed models beyond the standard one employed in GWAS. By modeling context-specific regulatory topology, RSS-NET is a step forward in this direction.

Methods

Gene and SNP information

This study used genes and SNPs from the human genome assembly GRCh37. This study used 18,334 protein-coding autosomal genes (http://ftp.ensembl.org/pub/grch37/release-94/gtf/homo_sapiens, accessed January 3, 2019). Simulations used 348,965 genome-wide SNPs28 (https://www.wtccc.org.uk), and data analyses used 1,289,786 genome-wide HapMap331 SNPs (https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2, accessed November 27, 2018). As discussed later, these SNP sets were chosen to reduce computation. This study excluded SNPs on sex chromosomes, SNPs with MAF less than 1%, and SNPs in the human leukocyte antigen region.

Gene regulatory networks

In this study a regulatory network is a directed bipartite graph {VTF, VTG, ETF→TG}, where VTF and VTG denote the node sets of TFs and TGs respectively, and ETF→TG denotes the set of TF-to-TG edges, summarizing how TFs regulate TGs through REs (Fig. 1b; Supplementary Note 4). Each edge has a weight between 0 and 1, measuring the relative regulation strength of a TF on a TG.

We inferred 38 regulatory networks from context-matched sequencing data of gene expression (e.g., RNA-seq) and chromatin accessibility (e.g., DNAse-seq or ATAC-seq). We obtained these PECA data from ENCODE29 (https://www.encodeproject.org, accessed December 14, 2018) and GTEx9 (https://gtexportal.org, accessed July 13, 2019); see Supplementary Data 1. The network-construction software and TF-motif information are available at https://github.com/suwonglab/PECA. The 38 networks are available at https://github.com/suwonglab/rss-net, with descriptive statistics provided in Supplementary Tables 911.

We first constructed an “omnibus” network from PECA data of 201 biosamples across 80 cell types and tissues, using a regression-based method20. In brief, by modeling the distribution of TG expression levels conditional on RE accessibility levels and TF expression levels, we estimated a regression coefficient for each TF-TG pair. We selected a TF-TG pair as the network edge if this estimated coefficient was significantly non-zero, and divided the estimate by the maximum of estimates for all TF-TG pairs to set a (0, 1)-scale edge weight. We also estimated a regression coefficient for each RE-TG pair, which reflected the regulating strengths of REs on TGs and was later used to construct context-specific networks, i.e., {Iit} in Eq. (1). Here we defined REs as open chromatin peaks called from accessibility sequencing data by MACS259 (https://github.com/macs3-project/MACS, accessed July 12, 2018).

With the omnibus network in place, we then constructed context-specific networks for 5 immune cell types, 5 brain regions and 27 non-brain tissues. For each context (tissue or cell type), we computed a trans-regulation score (TRS) between TF g and TG t:

$${\text{TRS}}_{gt}={2}^{| {R}_{gt}| }\cdot \sqrt{{\widetilde{\text{TF}}}_{g}\cdot {\widetilde{\text{TG}}}_{t}}\cdot \mathop{\sum}\limits_{i}({\widetilde{\text{RE}}}_{i}\cdot {B}_{gi}\cdot {I}_{it}),$$
(1)

where Rgt is the correlation of TF g and TG t expression levels across all contexts; \(\{{\widetilde{\text{TF}}}_{g},{\widetilde{\text{TG}}}_{t},{\widetilde{\text{RE}}}_{i}\}\) are normalized context-specific expression (TF g, TG t) and accessibility (RE i) levels (\(\tilde{y}={y}^{2}/{y}_{\text{med}}\), where y denotes the actual accessibility or expression level in a given context, and ymed denotes median level across all contexts); Bgi reflects the motif binding strength of TF g on RE i, defined as the sum of motif position weight matrix-based log-odds probabilities of all binding sites on RE i and calculated by HOMER60 (http://homer.ucsd.edu/homer/, accessed July 12, 2018); and Iit reflects the overall regulating strength of RE i on TG t, provided by the omnibus network. TRS naturally ranks and selects context-specific TF-TG edges because a larger value of TRSgt indicates a stronger regulating strength of TF g on TG t in the given context. We set (0, 1)-scale TF-TG edge weights by computing \({\mathrm{log}}_{2}(1+{\text{TRS}}_{gt})/{\mathrm{max}}_{(i,j)}\{{\mathrm{log}}_{2}(1+{\text{TRS}}_{ij})\}\).

To validate PECA-based networks and illustrate RSS-NET as a generally applicable tool, we also analyzed 394 cell type- and tissue-specific TF-TG circuits18 inferred from independent CAGE data7,8 (http://regulatorycircuits.org/, accessed May 8, 2019). When evaluating the similarity between PECA- and CAGE-based networks (Fig. 5b; Supplementary Fig. 12), we used their full node and edge sets to compute Jaccard indices. When running RSS-NET on context-matched PECA- and CAGE-based networks (Fig. 5d; Supplementary Fig. 14), we selected top-ranked CAGE-based edges to match PECA-based edge counts (Supplementary Table 10) and normalized CAGE-based edge weights (\(\tilde{x}={\mathrm{min}} \{1,{x}^{1/6}\}\), where x denotes original weight) to match the scale of PECA-based edge weights (Supplementary Table 11).

External databases for cross-reference

To validate and interpret RSS-NET results, we used the following external databases (accessed November 28, 2019): GWAS Catalog1 (https://www.ebi.ac.uk/gwas/), Mouse Genome Informatics46 (http://www.informatics.jax.org/), Mendelian gene sets48 (https://github.com/bogdanlab/gene_sets/), Online Mendelian Inheritance in Man49 (https://www.omim.org/), Therapeutic Target Database52 (http://db.idrblab.net/ttd/).

When quantifying overlaps between RSS-NET prioritized genes and mouse or Mendelian genes, we used all genes for each GWAS trait. We repeated the overlap analysis under the same significance cutoff (FDR ≤ 0.1) after excluding genes implicated in the same or later GWAS (Supplementary Table 7). Since GWAS-implicated genes overlap significantly with phenotypically-matched mouse and Mendelian genes (median two-sided Fisher exact P = 7.1 × 10−7), we identified fewer discoveries as expected (mouse-human pairs: 26, Mendelian-complex pairs: 4; Supplementary Data 45), but we obtained consistent effect sizes nonetheless (mouse R = 0.78, two-sided P = 8.6 × 10−73; Mendelian R = 0.89, P = 9.0 × 10−74; Supplementary Fig. 15).

Network-induced effect size distribution

We model the total effect of SNP j on a given trait βj as

$${\beta }_{j} \sim {\pi }_{j}\cdot {\mathcal{N}}({\mu }_{j},\,{\sigma }_{0}^{2})+(1-{\pi }_{j})\cdot {\delta }_{0},$$
(2)

where πj denotes the probability that SNP j is associated with the trait (βj ≠ 0), \({\mathcal{N}}({\mu }_{j},\,{\sigma }_{0}^{2})\) denotes a normal distribution with mean μj and variance \({\sigma }_{0}^{2}\) specifying the effect size of a trait-associated SNP j, and δ0 denotes point mass at zero (βj = 0).

We model the trait-association probability πj as

$${\mathrm{log}}_{10}\left(\frac{{\pi }_{j}}{1-{\pi }_{j}}\right)={\theta }_{0}+{a}_{j}\cdot \theta ,$$
(3)

where θ0 < 0 captures the genome-wide background proportion of trait-associated SNPs, θ > 0 reflects the increase in probability, on the log10-odds scale, that a SNP near network genes and REs is trait-associated, and aj reflects the proximity of SNP j to a network. Following previous analyses15,16,24, we let aj = 1 if SNP j is within 100 kb of any member gene (TF, TG) or RE for a given network. Equation (3) suggests that if a cell type or tissue plays an important role in a trait then genetic associations may occur more often in SNPs involved in the corresponding network genes and REs than expected by chance.

We model the mean effect size μj as

$${\mu }_{j}=\mathop{\sum}\limits_{g\in {{\bf{O}}}_{j}}{w}_{jg}\cdot {\gamma }_{jg},$$
(4)

where Oj is the set of all nearby or distal genes contributing to the total effect of SNP j, wjg measures the relevance between SNP j and gene g, and γjg denotes the effect of SNP j on a trait due to gene g. Equation (4) provides a general decomposition of total SNP effect into gene effects through {Oj, wjg}.

Here we use a TF-TG network to specify {Oj, wjg} in Eq. (4):

$$\mu_{j} = \underbrace{\mathop{\sum}\limits_{g\in {\bf{G}}_{j}}\left[\right.c_{jg}}_{\text{cis}} \cdot (\gamma_{jg} + \underbrace{\mathop{\sum}\limits_{t\in {\bf{T}}_g}v_{gt}\cdot\gamma_{jt}}_{\text{trans}})\left.\right],$$
(5)

where Gj is the set of all genes within 1 Mb window of SNP j (a standard window size used in cis-eQTL studies9,10,58), cjg measures the relative impact of a SNP j on gene g, Tg is the set of all genes directly regulated by TF g in a given network (Tg is empty if gene g is not a TF), and vgt measures the relative impact of a TF g on its TG t. Since a genome-wide analysis typically involves many SNPs and genes, we fix {Tg, vgt, cjg} to ensure the identifiability of Eq. (5). We use inferred edges and weights of a context-specific TF-TG network20,29 to specify Tg and vgt respectively. We use context-matched cis-eQTL9,10,58 to specify cjg (Supplementary Note 5 and Tables 12, 13). Equation (5) suggests that the total effect of a SNP may fan out through some regulatory network of multiple (nearby or distal) genes to affect the trait22.

We model the random effect γjg of SNP j due to gene g as

$${\gamma }_{jg}\mathop{ \sim }\limits^{\text{i.i.d.}\,}{\mathcal{N}}(0,\,{\sigma }^{2}),$$
(6)

where the SNP-level subscript j in γjg ensures the exchangeability of βj in Eq. (2); see Supplementary Note 6. Equation (6) uses a constant σ2 for computational convenience. Equation (6) could be modified by letting σ2 depend on functional annotations13,27 of SNP j and context-specific expression14,15,16 of gene g, though possibly at higher computational cost.

Equations (2), (4), and (6) implies a variance decomposition for SNP effect:

$${\text{Var}}({\beta }_{j})={\uppi }_{j}\cdot ({\sigma }_{0}^{2}+{\sigma }^{2}\cdot \mathop{\sum}\limits_{g\in {{\bf{O}}}_{j}}{w}_{jg}^{2}).$$
(7)

We hypothesize that Eq. (7) may provide an alternative approach to heritability analyses13,24,27 and we plan to investigate it elsewhere.

Bayesian hierarchical modeling

Consider a GWAS with n unrelated individuals measured on p SNPs. In practice we do not know the true SNP-level effects \({\boldsymbol{\beta }}:= {({\beta }_{1},\ldots ,{\beta }_{p})}^{\prime}\) in Eq. (2), but we can infer them from GWAS summary statistics and LD estimates. Specifically, we perform Bayesian inference for β by combining the network-based prior defined by Eqs. (2)–(6) with the RSS likelihood25:

$$\widehat{{\boldsymbol{\beta }}} \sim {\mathcal{N}}(\widehat{{\bf{S}}}\widehat{{\bf{R}}}{\widehat{{\bf{S}}}}^{-1}{\boldsymbol{\beta }},\,\widehat{{\bf{S}}}\widehat{{\bf{R}}}\widehat{{\bf{S}}}),$$
(8)

where \(\widehat{{\boldsymbol{\beta }}}:= {({\hat{\beta }}_{1},\ldots ,{\hat{\beta }}_{p})}^{\prime}\), \(\widehat{{\bf{S}}}:= {\rm{diag}}(\widehat{{\bf{s}}})\) is a p × p diagonal matrix with \(\widehat{{\bf{s}}}:= {({\hat{s}}_{1},\ldots ,{\hat{s}}_{p})}^{\prime}\), \(\{{\hat{\beta }}_{j},{\hat{s}}_{j}\}\) are estimated single-SNP effect size of each SNP j and its standard error from the GWAS, and \(\widehat{{\bf{R}}}:= [{\hat{r}}_{ij}]\) is the p × p LD matrix estimated from a reference panel with ancestry matching the GWAS.

RSS-NET, defined by Eqs. (2)–(6), and (8), consists of four unknown hyper-parameters \(\{{\theta }_{0},\theta ,{\sigma }_{0}^{2},{\sigma }^{2}\}\). To specify hyper-priors, we first introduce two free parameters {η, ρ} to re-parameterize \(\{{\sigma }_{0}^{2},{\sigma }^{2}\}\):

$${\sigma }_{0}^{2}=\eta \cdot (1-\rho )\cdot {\left(\mathop{\sum }\limits_{j = 1}^{p}\frac{{\pi }_{j}}{n{\hat{s}}_{j}^{2}}\right)}^{-1},\,\,{\sigma }^{2}=\eta \cdot \rho \cdot {\left(\mathop{\sum }\limits_{j = 1}^{p}\frac{{\pi }_{j}\cdot {\sum }_{g\in {{\bf{O}}}_{j}}{w}_{jg}^{2}}{n{\hat{s}}_{j}^{2}}\right)}^{-1},$$
(9)

where, roughly, η represents the proportion of the total phenotypic variation explained by p SNPs, and ρ represents the proportion of total genetic variation explained by network annotations {Oj, wjg}. Because \(n{\hat{s}}_{j}^{2}\) approximates the ratio of phenotype variance to genotype variance, Eq. (9) ensures that SNP effects (β) do not rely on sample size n and have the same measurement unit as the trait. See Supplementary Note 7 for derivation of Eq. (9).

We then place independent uniform grid priors on {θ0, θ, η, ρ} (Supplementary Table 14). These simple hyper-priors produce accurate posterior estimates for hyper-parameters in simulations (Supplementary Fig. 16). RSS-NET results are robust to grid choice on both simulated and real data (Supplementary Figs. 1718). (If one had specific information about {θ0, θ, η, ρ} in a given setting then this could be incorporated in the hyper-priors).

Network enrichment

To assess whether a regulatory network is enriched for genetic associations with a trait, we evaluate a Bayes factor (BF):

$${\text{BF}}=\frac{f(\hat{{\boldsymbol{\beta }}}\,| \,\hat{{\bf{S}}},\widehat{{\bf{R}}},{\bf{a}},{\bf{O}},{\bf{W}},{M}_{1})}{f(\hat{{\boldsymbol{\beta }}}\,| \,\hat{{\bf{S}}},\widehat{{\bf{R}}},{\bf{a}},{\bf{O}},{\bf{W}},{M}_{0})},$$
(10)

where f(  ) denotes probability densities, a is defined in Eq. (3), {O, W} are defined in Eq. (4), M1 denotes the enrichment model with θ > 0 or σ2 > 0, and M0 denotes the baseline model with θ = 0 and σ2 = 0. The observed data are BF times more likely under M1 than under M0, and so the larger the BF, the stronger evidence for network enrichment. See Supplementary Note 2 for computation details. To compute BFs used in Fig. 5c, we replace M1 in Eq. (10) with three restricted enrichment models (M11, M12, M13). Unless otherwise specified, all BFs reported in this work are based on M1.

Given a BF cutoff, false positive rates vary considerably across genetic architectures and enrichment patterns in simulations (Supplementary Table 15). As the genetic basis of most complex traits remains unknown, we find it impractical to fix some significance threshold. Instead we recommend an adaptive approach. Specifically, for a given GWAS we run RSS-NET on a near-gene control network containing all genes as nodes and no edges (i.e., aj = 1 for all SNPs within 100 kb of any gene and vgt = 0 for all TF-TG pairs), and we use the resulting BF as the enrichment threshold in this GWAS. Our analyses show three advantages of this approach. First, it is adaptive to study heterogeneity such as trait differences and sample sizes (Supplementary Table 1). Second, it accounts for generic enrichments of genetic signals residing near genes. Third, it facilitates comparisons with non-Bayesian methods based on P-values (Supplementary Table 2).

Locus association

To identify the association between a locus and a trait, we compute P1, the posterior probability that at least one SNP in the locus is associated with the trait:

$${P}_{1}=1-\Pr ({\beta }_{j}=0,\forall j\in \,\text{locus}\,\,| \,{\bf{D}},\,\text{model}\,),$$
(11)

where D is a shorthand for the input data of RSS-NET including GWAS summary statistics \(\{\widehat{{\boldsymbol{\beta }}},\widehat{{\bf{S}}}\}\), LD estimates \(\widehat{{\bf{R}}}\) and network annotations {a, O, W}. See Supplementary Note 3 for computation details. For a locus, \({P}_{1}^{{\mathsf{base}}}\), \({P}_{1}^{{\mathsf{near}}}\), and \({P}_{1}^{{\mathsf{net}}}\) correspond to P1 evaluated under the baseline model M0, the enrichment model M1 for the near-gene control network, and M1 for a given TF-TG network. In this study, we defined a locus as the transcribed region of a gene plus 100 kb up and downstream, and we used “locus” and “gene” interchangeably.

For K networks with enrichments stronger than the near-gene control, we use Bayesian model averaging (BMA) to compute \({P}_{1}^{{\mathsf{bma}}}\) for each locus:

$${P}_{1}^{{\mathsf{bma}}}=\frac{\mathop{\sum }\nolimits_{k = 1}^{K}{P}_{1}^{{\mathsf{net}}}(k)\cdot {\text{BF}}(k)}{\mathop{\sum }\nolimits_{k = 1}^{K}{\text{BF}}(k)},$$
(12)

where \({P}_{1}^{{\mathsf{net}}}(k)\) and BF(k) are enrichment P1 and BF for network k. The ability to average across networks in Eq. (12) is an advantage of our Bayesian framework, because it allows us to assess associations in light of network enrichment without having to select a single enriched network.

In this study we used P1 ≥ 0.9 as the significance cutoff, yielding a median false positive rate 1.24 × 10−4 and a median false discovery rate 6.43 × 10−2 in simulations (Supplementary Tables 16, 17). We also highlighted genes with \({P}_{1}^{{\mathsf{net}}}\, > \, {P}_{1}^{{\mathsf{near}}}\) (Fig. 6 and Tables 1, 2), because they showcase the influence of context-specific regulatory topology on prioritizing genetic associations.

Computation time

The total computation time of RSS-NET to analyze a pair of trait and network is determined by the number of genome-wide SNPs analyzed, the size of hyper-parameter grid, and the number of variational iterations till convergence, all of which can vary considerably among studies. It is thus hard to make general statements about computation time. However, to give a specific example, we finished the analysis of 1,032,214 HapMap3 SNPs and liver network for HDL within 12 hours in a standard computer cluster (60 nodes, 8 CPUs, and 32 Gb memory per node).

The number of genome-wide SNPs analyzed (p) affects the computation time of RSS-NET in two distinct ways. First, the per-iteration complexity of RSS-NET is linear with p (Box 1; Supplementary Note 1). Second, a large p defines a large optimization problem, often requiring many iterations to converge. To quantify the impact of p on computation time, we simulated datasets from different sets of genome-wide SNPs, analyzed them with RSS-NET on identical computers, and compared the computation time (Supplementary Fig. 9). When p increased from 348,965 to 1,030,397, on average the total computation time was four times longer (one-sided Wilcoxon P = 8.0 × 10−132).

Simulation overview

To assess the network-induced model for SNP effects (β) in RSS-NET, we simulated a large array of correctly- and mis-specified β for a given target network. Specifically, we generated “positive” datasets where the underlying β was simulated from M1 for the target network, and “negative” datasets where β was simulated from either M0 or the following scenarios: (1) random enrichments of near-gene SNPs; (2) random enrichments of near-RE SNPs; (3) MAF- and LD-dependent effect sizes; (4) M1 for edge-altered copies of the target network. For a fair comparison in each scenario, we matched positive and negative datasets by both the number of trait-associated SNPs and the proportion of phenotypic variation explained by all SNPs. See Supplementary Figs. 19 for details.

We combined the simulated β with genotypes of 348,965 genome-wide SNPs from 1,458 individuals28 to simulate phenotypes using an additive multiple-SNP model with Gaussian noise. We performed the standard single-SNP analysis of simulated individual-level datasets to generate GWAS summary statistics, on which we compared RSS-NET with external methods.

External software for benchmarking

To benchmark RSS-NET this study used the following software: RSS-E (https://github.com/stephenslab/rss, accessed October 19, 2018), Pascal (https://www2.unil.ch/cbg/index.php?title=Pascal, accessed October 5, 2017) and LDSC with two sets of baseline annotations as covariates (version 1.0.0, https://github.com/bulik/ldsc; baseline model v1.1, https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_baseline_v1.1_ldscores.tgz; baselineLD model v2.1, https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_baselineLD_v2.1_ldscores.tgz; accessed November 27, 2018). Versions of all packages and files were up-to-date at the time of analysis.

Given a context-specific TF-TG network, RSS-E and LDSC methods use the same binary SNP-level annotations {aj} defined in Eq. (3). The interface design of Pascal does not allow direct usage of {aj}. Here we supplied Pascal program with a GMT file containing all member genes of a network and set SNP-to-gene window sizes as 100 kb (“–up = 100000 –down = 100000”). In this study all external methods were used with their default setups, which did not include the edge information of a network.

RSS-E outputs the same statistics as RSS-NET (BF and P1). Pascal implements two gene scoring methods (maximum-of-χ2 and sum-of-χ2) to produce gene-based association P-values. Given gene scores, Pascal provides two gene set scoring options (χ2 approximation and empirical sampling) to produce enrichment P-values. LDSC methods output enrichment P-values and coefficient Z-scores, yielding consistent results in our simulations (LDSC-baseline: R = 0.98, two-sided P = 1.2 × 10−67; LDSC-baselineLD: R = 0.98, P = 9.1 × 10−63; Supplementary Fig. 19). Due to the higher power shown in simulations (LDSC-baseline: average AUROC increase = 0.012, one-sided P = 4.0 × 10−3; LDSC-baseline LD: average AUROC increase = 0.023, one-sided P = 1.5 × 10−5), we used enrichment P-values from LDSC in this study.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.