A method to estimate the contribution of regional genetic associations to complex traits from summary association statistics

Despite considerable efforts, known genetic associations only explain a small fraction of predicted heritability. Regional associations combine information from multiple contiguous genetic variants and can improve variance explained at established association loci. However, regional associations are not easily amenable to estimation using summary association statistics because of sensitivity to linkage disequilibrium (LD). We now propose a novel method, LD Adjusted Regional Genetic Variance (LARGV), to estimate phenotypic variance explained by regional associations using summary statistics while accounting for LD. Our method is asymptotically equivalent to a multiple linear regression model when no interaction or haplotype effects are present. It has several applications, such as ranking of genetic regions according to variance explained or comparison of variance explained by two or more regions. Using height and BMI data from the Health Retirement Study (N = 7,776), we show that most genetic variance lies in a small proportion of the genome and that previously identified linkage peaks have higher than expected regional variance.

component models using genetic relatedness as the variance-covariance matrix of the random effect. An implementation of this approach uses REML 1 to estimate the genetic effect, and modifications have been reported to either take into account LD between SNPs 9 or to handle large datasets 10 . While very useful and informative, all of these approaches require individual-level data. This latter pitfall has been overcome by the development of LDScore 11 , which uses summary-level association statistics as inputs and has been shown to be equivalent to Elston regression 12 . However, LDScore is ill suited for estimation of regional heritability because it is based on regressing genetic effects on the sum of linkage disequilibrium and requires large number of SNPs for precise estimations. A multi-SNP locus-association method has been described that uses summary association statistics, but it necessitates to first prune SNPs for p-value (p < 0.01) and LD (r 2 < 0.1) 13 . Finally, alternative methods 14,15 estimate genetic variance from the distribution of effect sizes and are not appropriate for regional genetic variance because the small number of SNPs will lead to an imprecise estimation and linkage equilibrium is assumed. There is thus a need for a method to estimate the regional contribution of common genetic variants using summary association statistics while simultaneously taking LD into account.

Results
Comparison of genetic variance estimated using summary statistics and variance component models. Our method estimates the regional contribution to complex trait variance using summary data by adjusting the variance explained by each SNP for its LD with neighboring SNPs. Simulations using 1000 Genomes Project 16 (1000G) data showed that genetic regional variance is accurately estimated by our method when no haplotype or interaction effects are present ( Figure S1), as predicted by theoretical derivations (see Methods). We sought to compare estimation of overall genetic variance by our novel method to variance component models. We divided the genome into SNP blocks of median size 250 Kb (85-95 contiguous SNPs) minimizing inter-block LD and applied our method to BMI and height in the Health Retirement Study 17 (HRS; N = 7,776) for which individual-level genotypes were available. Using only summary association statistics for our method and corresponding individual-level data for variance component models, both methods provided consistent estimates of genome-wide genetic variance (Fig. 1). We explored the impact of adjustment for genetic principal components. The first 20 components provided adequate protection against population stratification while the inclusion of fewer components led to the inflation in genetic variance, especially for height. Using 20 components, genetic variance was estimated at 0.12 (95% CI 0.01-0.24) for BMI with our novel method and 0.14 (95% CI 0.05-0.23) with variance component models 18 . The corresponding estimates for height were 0.28 (95% CI 0.17-0.39) and 0.30 (95% CI 0.20-0.39).
Using summary association statistics to identify regional associations. We next sought to compare the ability of our method to identify regional associations with other approaches also using summary association statistics. We ranked SNP blocks (median size of 250 Kb) according to decreasing regional genetic variance by applying three different methods on Genetic Investigation of Anthropometric Traits consortium (GIANT) summary association statistics 19,20 : (1) our proposed approach (LARGV), (2) LDScore and (3) the multi-SNP locus-association proposed by Ehret and colleagues 13 . We then used SNP block ranks derived from GIANT by each of the three methods to estimate genetic variance in HRS (which is not part of GIANT) with variance component models, successively adding a higher proportion of SNP blocks. As illustrated in Fig. 2, the genetic variance estimated by LARGV increased more rapidly with the proportion of top SNP blocks included as compared

Method
Advantages Disadvantages Multi-SNP locus association 13 -Can estimate genetic variance of medium to large regions (i.e. small regions might not have SNPs remaining after LD and p-value pruning) -Need for LD pruning as it assumes uncorrelated markers -Need for p-value pruning -Assumes population structure has been entirely adjusted for LDScore 11 -Includes all SNPs irrespective of LD -Adjusts for potential bias such as population stratification -Can estimate genetic covariance between two traits -Need for a reference LD structure -Best suited for genome-wide analysis or large regions Distribution of effect size 3,14,15 (AVENGEME and ABPA) -Estimates the proportion of markers affecting a trait -Can estimate genetic covariance between two traits -Need for LD pruning as it assumes uncorrelated markers -Assumes population structure has been entirely adjusted for -Best suited for genome-wide analysis or large regions LD Adjusted Regional Genetic Variance (LARGV) -Includes all SNPs irrespective of LD -Can estimate genetic variance of any region, small or large -Need for a reference LD structure -Assumes population structure has been entirely adjusted for Table 1. Advantages and disadvantages of existing methods to estimate regional genetic variance from summary association statistics.
to other two methods. The multi-SNP locus-association method performed well, but a large proportion of SNP blocks did not have any SNP left after LD and p-value pruning (68.4% for BMI and 26.7% for height), highlighting the disadvantage of pruning instead of adjusting for LD. We obtained consistent results when using LD data from 1000G data instead of HRS ( Figure S2), changing SNP block size ( Figure S3), or changing the number of neighboring SNPs included in LD calculations ( Figure S4). A relatively small proportion of blocks contributed disproportionately to genetic variance. The top 25% SNP blocks explained 0.94 of BMI genetic variance (i.e. = 0.132/0.140) and 0.83 of height genetic variance when using LARGV on GIANT data to rank the genetic variance of each SNP block. These results could potentially be explained by the presence of one or more very strong associations in each of these top SNP blocks. To explore this possibility, we recorded the minimum univariate association SNP p-value for each block in both GIANT and HRS (Fig. 3). Median minimum univariate p-value was 2.0 × 10 −2 for BMI in GIANT, with 1% of blocks having one or more genome-wide significant associations (p < 5 × 10 −8 ). On the other hand, the median minimum p-value was 1.9 × 10 −3 for height in GIANT, with 9% of blocks having one or more genome-wide significant associations. The median minimum univariate p-values were 1.8 × 10 −2 and 1.7 × 10 −2 for BMI and height in HRS. No SNP reached genome-wide significance in HRS.

Analysis of known linkage peaks.
A unique application of our method is the estimation of genetic variance over extended genomic regions using summary association statistics. We therefore tested the hypothesis that previously identified linkage peaks are enriched for regional associations. Based on results from the largest linkage study of height and BMI, three peaks with suggestive (LOD>2.0) evidence of linkage in Europeans were identified, all for height 21 . The only peak with LOD > 3.0 showed a significant (p = 0.002) enrichment in regional association within a distance of + /− 7.5 Mb from the linkage marker, corresponding to an estimated excess regional genetic variance of 0.0044 over the genome-wide average ( Table 2). Upon a closer inspection, the region encompasses several sub-regions with genome-wide significant associations.

Discussion
We propose a novel method to estimate regional genetic variance from summary association statistics. Using this method, we confirmed a major role of regional associations in complex trait heritability, whereby the aggregation of genetic associations contributes disproportionately to phenotypic variance. Selecting the top SNP blocks from the GIANT meta-analysis, we showed that 25% of the genome is responsible for up to 0.94 and 0.83 of BMI and height genetic variance. A large proportion of these blocks had unremarkable minimum univariate p-values, suggesting the presence of multiple weak associations underlies their impact on phenotypic variance, especially for BMI. The concentration of genetic associations within these regions supports the existence of critical nodes in the genetic regulation of complex traits such as height and BMI, with implications not only for association testing but also for population genetics and natural selection. These results also suggest that a combination of strong genetic associations and regional associations contribute to complex traits variance, with the relative proportions varying across traits. For instance, a higher proportion of genetic variance was found in the top 25% blocks for BMI yet these blocks had less significant minimum univariate p-value than height.
Our method can also be used to estimate the genetic variance explained by extended regions. We therefore tested the hypothesis that some of the previously identified linkage peaks are the result of regional associations. The only known linkage peak with LOD > 3.0 for height showed a marked and significant enrichment in regional association. This region had been previously identified in multiple linkage studies 21,22 . Genetic variance explained by the region was estimated at 0.0044, which is unlikely to explain the linkage peak by itself. Nonetheless, the juxtaposition of linkage and regional associations points towards concentration of functional variants as a potential explanation for the observed linkage. The lack of regional association at other peaks can be explained by false-positive linkage results, rare variant associations undetected by genome-wide association studies or by differences in genetic architecture between studied populations.
Our proposed method has several advantages and is complementary to other methods. First, it provides results that are highly consistent with the widely used variance component models requiring individual-level data. Second, it is computationally straightforward and can be applied to large datasets. Third, it is agnostic and therefore complementary to functional annotations of the genome. Fourth, since SNPs are not pruned by either LD or significance, every SNP block or region can be evaluated.
A few limitations are worth mentioning. First, the assumption that SNPs contribute to genetic variance without any interaction or haplotype effects can lead to an underestimation of genetic variance. While our estimates were consistent with variance component models, there is a need for statistical models that better capture genetic variance when strong haplotype effects are expected 23 . Second, estimation of overall genome-wide genetic variance using summary association statistics is dependent on both the accuracy of SNP effect size estimates and the correct specification of LD structure. For instance, differences in adjustment for population stratification (e.g. principal components) in individual studies that participate in a meta-analysis could potentially influence the results. Ideally, LD data would come from the same population used to derived the summary association statistics, which could be included as summary statistics for each SNP (i.e. η d , see Methods). While this information is currently not available, consistent results were observed when using LD data from either HRS or 1000G, demonstrating the robustness of our approach to LD misspecification. Third, we have only tested continuous traits. Nevertheless, our method can be easily adapted to other outcome types through the use of generalized linear models.
In this report, we establish a novel method to estimate the regional contribution of common variants to complex traits variance using summary association statistics. Our method has several applications, such as ranking of genetic regions for genetic variance and identification of key regions contributing to genetic variance. Our Scientific RepoRts | 6:27644 | DOI: 10.1038/srep27644 method can also be used to perform network analysis using summary association statistics, or to combine summary association statistics with other types of genetic annotations as we have shown with linkage results.

Methods overview.
We have previously shown that large-region joint association, where multiple genetic variants are included as independent variables in a linear model, is a simple and powerful way to test for regional associations when individual-level data are available 4 . However, such regional joint associations are not easily amenable to estimation using summary data because of sensitivity to linkage disequilibrium. To evaluate the contribution of large regions joint associations to the genetic variance of complex traits, we first devised an algorithm to divide the genome in blocks of SNPs in such a way to minimize inter-block linkage disequilibrium and thus "spillage" associations. We then derived a method to estimate regional associations using summary data and showed this method to be equivalent to a multiple linear regression model when genetic effects are strictly additive (i.e. no haplotype or interaction effect). Using regional variance estimates from summary association statistics for height and BMI from the Genetic Investigation of Anthropometric Traits (GIANT) consortium, we estimated the distribution of genetic effects across the genome and validated results in the Health Retirement Blue lines represent estimates of genetic variance using summary association statistics with LARGV; with 95% confidence intervals illustrated as blue shaded area. Corresponding estimates for variance component models are in red.

Figure 2. Genetic variance as a function of proportion of top SNP blocks.
Genetic variance in HRS based on SNP block ranking derived from GIANT summary association statistics. Three methods were tested to rank SNP blocks: LARGV (red), LD Score (blue) and multi-SNP locus-association (green). The median SNP block size was 250 Kb (i.e. 85-95 SNPs). Genetic variance was calculated in HRS using variance component models for BMI (a) and height (b).
Scientific RepoRts | 6:27644 | DOI: 10.1038/srep27644 Study (HRS), which is not part of GIANT. A user-friendly software is available to produce SNP blocks and LD adjustment upon request at pareg@mcmaster.ca.
Dividing the genome into SNP blocks. We first divided the genome into regions of contiguous SNPs varying in size (e.g. from 195 SNPs to 205 SNPs), herein referred to as SNP blocks and used as units for regional associations. To minimize inter-block LD and thus "spillage" associations, we devised a greedy algorithm optimizing the choice of block boundary sequentially from one end of a chromosome to the other. Briefly, using a user-defined minimum and maximum block size (in number of SNPs) and starting at one end of a chromosome arm, each possible "cut-point" between the first and second block are tested and maximal LD (r 2 ) between pairs of SNPs crossing block boundary is calculated. The cut-point that minimizes the maximal LD is chosen, thus defining the first block, and the procedure is repeated for each subsequent block until all SNPs on a chromosome arm have been assigned to a block. We empirically determined that SNP blocks of size 85 SNPs to 95 SNPs (median 90 SNPs) had a median physical size of 250 Kb.  Estimating the contribution of regional genetic associations with individual-level genotypes. The use of adjusted R 2 lends itself nicely to estimation of regional variance explained when individual-level genotypes are available 4 . In this context, SNPs comprised in a given SNP block are included as independent variables in a multiple linear regression model and the goodness of fit statistic, adjusted R 2 calculated. Because the adjusted R 2 accounts for the number of SNPs included in each block, the expected adjusted R 2 is zero under the null hypothesis of no association and the expected sum of adjusted R 2 over all SNP blocks is also zero. The overall contribution of regional associations to complex traits variance can be estimated by simply summing the adjusted R 2 over all (or selected) SNP blocks. Furthermore, the distribution of adjusted R 2 under the null hypothesis of no association has been previously described 24 and can be used to derive the distribution of the sum of adjusted R 2 .
Estimating the contribution of regional genetic associations with summary association statistics. Estimating the contribution of regional genetic associations from summary association statistics is challenging when the exact SNP linkage disequilibrium structure of source populations is unknown. While approaches have been described to perform joint or conditional associations 23,25 using estimated SNP covariance matrices, they do not perform well when estimating regional variance explained because of sensitivity to misspecification of linkage disequilibrium (data not shown) and ensuing an overestimation of regional associations. We therefore created a simple procedure to estimate regional variance explained from summary association statistics data, adjusting for linkage disequilibrium. Without loss of generalizability, we assume a quantitative trait (Y) standard normally distributed and genotypes normalized to have mean = 0 and standard deviation = 1 throughout. Given an n × m genotype matrix X representing genotypes at m SNPs in n individuals and the pairwise linkage disequilibrium (r 2 ) between two SNPs k and l as r 2 k,l , for a SNP d, the following LD adjustment (η d ) can be defined as the summation of LD between the d th SNP and 100 SNPs upstream and downstream: with a distance of 100 SNPs assumed sufficient to ensure linkage equilibrium (other values might be used). Only including SNPs with summary GWAS statistics in the sum, variance explained by each SNP d is given by: where b d denotes the univariate regression coefficient commonly reported in GWAS results (assuming genotypes have been standardized to have mean zero and SD = 1). Regional variance explained is then given by the sum of R d 2 over SNPs in a given region. Assuming a strictly additive genetic model where each SNP contributes additively to a trait without any interaction or haplotype effects, we demonstrate the expected total variance explained over a region ∑ E R ( ) is an m by m symmetric matrix, where D k is a m × 1 vector whose entries represent the k th column of D. We will make use of the following properties: where N is the total number of individuals in the sample. Estimation of regional genetic variance with multiple linear regression models. Suppose the genotype matrix is fixed while the true genetic effect is a random vector β, whose individual components, i.e. the SNPs, i = 1, 2, … , m, have mean zero and variance σ 2 . The size of the variance σ 2 is on the scale of M −1 , where M is the number of genome-wide SNPs. The genetic model can be expressed as: where ε is a vector of standard normal error with identity variance covariance matrix. Then, the vector of estimated multiple linear regression coefficients B is given by: The multivariate variance explained R 2 can be written in terms of the true effect and the error term: Scientific RepoRts | 6:27644 | DOI: 10.1038/srep27644 and since the error term has identity variance covariance and the true effect β has variance covariance matrix σ 2 I, the expected variance explained is simplified to 2 and the variance can be calculated accordingly: Estimation of regional genetic variance using summary association statistics. The univariate regression coefficients, denoted by lower case b and directly obtained from GWAS summary statistics, are given by The total variance explained over a region ∑ R d d 2 can be calculated using only the univariate regression coefficients from GWAS: We can rewrite the expression by defining:  as LD tends to be weak 100 SNPs upstream and downstream away from the index SNP, we can simplify the expected total variance explained using the summary statistics to (Equation 11): Health Retirement Study. We conducted large region joint association analysis for height using genome-wide data from the publicly available Health Retirement Study (HRS; dbGaP Study Accession: phs000428.v1.p1). HRS quality control criteria were used for filtering of both genotype and phenotype data,