Efficient variance components analysis across millions of genomes

Pazokitoroudi, Ali; Wu, Yue; Burch, Kathryn S.; Hou, Kangcheng; Zhou, Aaron; Pasaniuc, Bogdan; Sankararaman, Sriram

doi:10.1038/s41467-020-17576-9

Download PDF

Article
Open access
Published: 11 August 2020

Efficient variance components analysis across millions of genomes

Ali Pazokitoroudi¹,
Yue Wu¹,
Kathryn S. Burch²,
Kangcheng Hou²,
Aaron Zhou¹,
Bogdan Pasaniuc^3,4,5 &
…
Sriram Sankararaman ORCID: orcid.org/0000-0003-1586-9641^1,4,5

Nature Communications volume 11, Article number: 4020 (2020) Cite this article

Subjects

Abstract

While variance components analysis has emerged as a powerful tool in complex trait genetics, existing methods for fitting variance components do not scale well to large-scale datasets of genetic variation. Here, we present a method for variance components analysis that is accurate and efficient: capable of estimating one hundred variance components on a million individuals genotyped at a million SNPs in a few hours. We illustrate the utility of our method in estimating and partitioning variation in a trait explained by genotyped SNPs (SNP-heritability). Analyzing 22 traits with genotypes from 300,000 individuals across about 8 million common and low frequency SNPs, we observe that per-allele squared effect size increases with decreasing minor allele frequency (MAF) and linkage disequilibrium (LD) consistent with the action of negative selection. Partitioning heritability across 28 functional annotations, we observe enrichment of heritability in FANTOM5 enhancers in asthma, eczema, thyroid and autoimmune disorders.

Probabilistic inference of the genetic architecture underlying functional enrichment of complex traits

Article Open access 30 November 2021

Evaluating and improving heritability models using summary statistics

Article 23 March 2020

Widespread signatures of natural selection across human complex traits and functional genomic categories

Article Open access 19 February 2021

Introduction

Variance components analysis¹ has emerged as a versatile tool in human complex trait genetics, enabling studies of the genetic contribution to variation in a trait² as well as its distribution across genomic loci^3,4, allele frequencies³, and functional annotations^3,5,6. There is increasing interest in applying methods for variance components analysis to large-scale genetic datasets with the goal of uncovering novel insights into the genetic architecture of complex traits^4,7. A prominent example of the utility of these methods is in the estimation of SNP heritability (${h}_{{\mathrm{{SNP}}}}^{2}$)², the variance in a trait explained by a given set of genotyped SNPs. Variance components methods for estimating SNP heritability typically assume a genetic variance component that represents the fraction of phenotypic variation explained by the SNPs included in the study and a residual variance component. Recent studies have shown that these “single-component” methods yield biased estimates of SNP heritability due to the linkage disequilibrium (LD) and minor allele frequency (MAF)-dependent architecture of complex traits^8,9. On the other hand, flexible models with multiple variance components^3,4 that allows for SNP effects to vary with MAF and LD, have been shown to yield more accurate SNP heritability estimates^8,9. Recent work has shown that SNP heritability can be estimated with minimal assumptions about the genetic architecture¹⁰; however, this method cannot partition heritability across categories of SNPs of interest such as functional or population genomic annotations. Partitioning heritability requires fitting multiple variance components, thus creating the need for accurate and scalable methods that can fit tens or even hundreds of variance components to large-scale genomic data to obtain accurate and novel insights into genetic architecture.

While the ability to fit flexible variance component models to large-scale datasets is essential to obtain accurate and novel insights into genetic architecture, fitting such models requires scalable algorithms. Approaches for estimating variance components typically search for parameter values that maximize the likelihood or the restricted maximum likelihood (REML)¹¹. Despite a number of algorithmic improvements^{2,4,12,13,14,15,16}, computing REML estimates of the variance components on data sets such as the UK Biobank¹⁷ (≈500,000 individuals genotyped at nearly one million SNPs) remains challenging. The reason is that methods for computing these estimators typically perform repeated computations on the input genotypes.

We propose a method that can jointly estimate multiple variance components efficiently. Our proposed method, RHE-mc, is a randomized multi-component version of the classical Haseman–Elston regression for heritability estimation^18,19. RHE-mc builds on our previously proposed method, RHE-reg²⁰, which uses a randomized algorithm to estimate a single variance component. RHE-mc can simultaneously estimate multiple variance components, as well as variance components associated with continuous and overlapping annotations. Further, unlike REML estimation algorithms, RHE-mc requires only a single pass over the input genotypes that results in a highly memory efficient implementation. The resulting computational efficiency permits RHE-mc to jointly fit 300 variance components in less than an hour on a dataset of about 300,000 individuals and 500,000 SNPs, about two orders of magnitude faster than state-of-the-art methods. On a dataset of one million individuals and one million SNPs, RHE-mc can fit 100 variance components in about 12 h.

To demonstrate its utility, we first show that RHE-mc can accurately estimate genome-wide and partitioned SNP heritability under realistic genetic architectures (the functional dependence of SNP effect sizes on MAF and LD). We applied RHE-mc to 22 traits measured across 291,273 individuals genotyped at 459,792 common SNPs (MAF > 1%) in the UK Biobank to obtain estimates of genome-wide SNP heritability. We then used RHE-mc to partition heritability for the 22 traits across seven million imputed SNPs (MAF > 0.1%) into 144 bins defined based on MAF and LD. We observe that the per-allele squared effect size tends to increase with lower MAF and LD across the traits considered. Finally, we partitioned heritability for SNPs with MAF > 0.1% across 28 functional annotations. We recover previously reported enrichment of heritability in annotations corresponding to conserved regions⁷ and also document enrichment of heritability in FANTOM5 enhancers in eczema, asthma, autoimmune disorders, and thyroid disorders.

Results

Methods overview

RHE-mc aims to fit a variance component model that relates phenotypes y measured across N individuals to their genotypes over M SNPs X:

$$ {\boldsymbol{y}}| {\boldsymbol{\epsilon }},{{\boldsymbol{\beta }}}_{1},\ldots ,{{\boldsymbol{\beta }}}_{K} =\mathop{\sum }\limits_{k=1}^{K}{{\boldsymbol{X}}}_{k}{{\boldsymbol{\beta }}}_{k}+{\boldsymbol{\epsilon }}\\ \hskip 62pt {\boldsymbol{\epsilon }} \sim {\mathcal{D}}({\bf{0}},{\sigma }_{e}^{2}{{\bf{I}}}_{N})\\ \hskip 20pt {{\boldsymbol{\beta }}}_{k} \sim {\mathcal{D}}\left({\bf{0}},\frac{{\sigma }_{k}^{2}}{{M}_{k}}{{\bf{I}}}_{{M}_{k}}\right),k\in \{1,\ldots ,K\}$$

where ${\mathcal{D}}(\boldsymbol{\mu} ,\boldsymbol{\Sigma})$ is an arbitrary distribution with mean μ and covariance Σ. Each of the M SNPs is assigned to one of K non-overlapping categories so that X_k is the N × M_k matrix consisting of standardized genotypes of SNPs belonging to category k (note that the expected heritability is constant within categories when we use standardized genotypes). β_k denotes the effect sizes of SNPs assigned to category k which are drawn from a zero-mean distribution with covariance parameter $\frac{{\sigma }_{k}^{2}}{M_k}{{\bf{I}}}_{{M}_{k}}$ (the variance component of category k) while ${\sigma }_{e}^{2}$ is the residual variance.

In this model, the genome-wide SNP heritability is defined as: ${h}_{{\mathrm{{SNP}}}}^{2}=\frac{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2} \, + \, {\sigma }_{e}^{2}}$ while the SNP heritability of category k is defined as: ${h}_{k}^{2}=\frac{{\sigma }_{k}^{2}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2} \, + \, {\sigma }_{e}^{2}}$. By choosing categories to represent genomic annotations of interest, e.g., chromosomes, allele frequencies, or functional annotations, these models can be used to estimate the phenotypic variation that can be attributed to the relevant annotation.

The key inference problem in this model is the estimation of the variance components: $({\sigma }_{1}^{2},\ldots ,{\sigma }_{K}^{2},{\sigma }_{e}^{2})$. These parameters are typically estimated by maximizing the likelihood or the restricted likelihood. Instead, RHE-mc uses a scalable method-of-moments estimator, i.e., finding values of the variance components such that the population moments match the sample moments^{18,19,21,22,23}. RHE-mc uses a randomized algorithm that avoids explicitly computing N × N genetic relatedness matrices that are required by method-of-moments estimators. Instead, it operates on a smaller matrix formed by multiplying the input genotype matrix with a small number of random vectors (see “Methods” section). The application of a randomized algorithm for SNP heritability estimation using a single variance component was proposed in our previous work, RHE-reg²⁰. RHE-mc extends our previous work in several directions. RHE-mc can efficiently fit multiple variance components (both non-overlapping and overlapping) and can also handle continuous annotations. The resulting algorithm has scalable runtime as it only requires operating on the genotype matrix one time. Further, RHE-mc uses a streaming implementation that does not require all the genotypes to be stored in memory leading to scalable memory requirements (Supplementary Notes). Finally, RHE-mc uses an efficient implementation of a block Jackknife to estimate standard errors with little computational overhead (Supplementary Notes).

Accuracy of genome-wide SNP heritability estimates in simulations

We assessed the accuracy of RHE-mc in estimating genome-wide SNP heritability as previous attempts at estimating SNP heritability have been shown to be sensitive to assumptions about how SNP effect size varies with MAF and LD⁸. Starting with genotypes of M = 593,300 array SNPs over N = 337,205 unrelated white British individuals in the UK Biobank, we simulated phenotypes according to 64 MAF and LD-dependent architectures by varying the SNP heritability, the proportion of variants that have non-zero effects (causal variants or CVs), the distribution of CVs across minor allele frequencies (CVs distributed across all minor allele frequency bins or CVs restricted to either common or low-frequency bins), and the form of coupling between the SNP effect size and MAF as well as LD. For RHE-mc, we partitioned the SNPs into 24 variance components based on six MAF bins as well as four LD bins (see “Methods” section). The key parameter in applying RHE-mc is the number of random vectors B which we set to 10. RHE-mc estimates were relatively insensitive when we increased the number of random vectors B to 100 (Supplementary Figs. 1 and 2, Supplementary Table 1). Across these 64 architectures, RHE-mc is relatively unbiased (a two-sided t-test of the hypothesis of no bias is not rejected across any of the architectures at a p-value < 0.05) with the largest relative bias observed to be 0.5% of the true SNP heritability (Supplementary Fig. 3). We used a block Jackknife (number of blocks = 100) to estimate the standard errors of RHE-mc and confirmed that the estimated standard errors are close to the true SE (Supplementary Table 2).

We compared the accuracy of RHE-mc to state-of-the-art methods for heritability estimation that can be applied to large datasets (across architectures where the true SNP heritability was fixed at 0.5). These methods, LDSC²⁴, SumHer²⁵, S-LDSC²⁶, and GRE¹⁰, all leverage summary statistics while RHE-mc requires individual genotype data. We found that estimates from the summary-statistic methods tend to be sensitive to the underlying genetic architecture: across 16 architecture relative biases range from −31% to 27% for LDSC, −27% to 5% for S-LDSC, and −5% to 9% for SumHer (Fig. 1). We also compared to a recently proposed method (GRE¹⁰) that only estimates genome-wide SNP heritability (without partitioning by MAF/LD) and observed that relative biases ranged from 1% to 1.4% for GRE and from −1.5% to 0.5% for RHE-mc. We also considered architectures in which only rare variants are causal and found RHE-mc is accurate relative to other methods (Supplementary Fig. 4). These results further emphasize that RHE-mc can accurately estimate SNP-heritability through fitting multiple variance components.

Fig. 1: Comparison of estimates of genome-wide SNP heritability from RHE-mc with LDSC, GRE, S-LDSC, and SumHer in large-scale simulations (N = 337,205 unrelated individuals, M = 593,300 array SNPs).

We compared RHE-mc to the state-of-the-art REML-based variance component estimation method, GCTA-mc (multi-component GREML^8,27,28) and to exact multi-component Haseman–Elston Regression (HE-mc) as implemented in GCTA²⁷. We ran each of these methods by partitioning SNPs into 24 variance components (6 MAF bins by 4 LD bins, see “Methods” section). To make these experiments computationally feasible, we simulated phenotypes starting from a smaller set of genotypes (M = 593,300 array SNPs and N = 10,000 white British individuals). Across 16 architectures where the true SNP heritability was fixed at 0.25, the relative biases for RHE-mc range from −3.2% to 3.6%, and from −3.2% to 5% for GCTA-mc (Fig. 2). On average, RHE-mc has standard errors that are 1.1 times larger than GCTA-mc (which range from 0.97 to 1.24) and 1.08 times larger than HE-mc (which range from 1.00 to 1.21).

Fig. 2: Comparison of SNP heritability estimates from RHE-mc with GCTA-mc (GCTA with multiple variance components) and HE-mc (HE with multiple variance components) (N = 10,000 unrelated individuals, M = 593,300 array SNPs).

Accuracy of heritability partitioning in simulations

We also evaluated the accuracy of RHE-mc in partitioning SNP heritability in both small-scale (M = 593,300 SNPs, N = 10,000 individuals) (Supplementary Fig. 5) and large-scale settings (M = 593,300 SNPs, N = 337,205 individuals) (see Supplementary Fig. 6). For these experiments, we restrict our attention to architectures for which the CVs are chosen to lie within a narrow range of MAF. Since the variance components correspond to bins of MAF and LD, a subset of the variance components would have no causal SNPs and hence have a heritability of zero. We assess the accuracy of estimates of heritability aggregated over these components (termed the non-causal bin) as well as the heritability aggregated over the remaining genetic components (termed the causal bin). For example, variance components that correspond to MAF ∈ [0.01, 0.05] would be included in the causal bin for an architecture that restricts the MAF of CVs to lie in the range [0.01, 0.05]. For the small-scale simulations, we compared RHE-mc to GCTA-mc. We ran both methods by partitioning the SNPs into 24 variance components based on six MAF bins as well as four LD bins defined by quartiles of the measure of LDAK weight at a SNP (see “Methods” section). Across the genetic architectures tested, estimates of heritability within each of the causal and non-causal bins are highly concordant between RHE-mc and GCTA-mc (Supplementary Fig. 5, Supplementary Table 3): for the causal bin, the relative bias ranges from −4% to 0.4% for RHE-mc and −3.6% to 2% for GCTA-mc while, for the non-causal bin, the bias ranges from 0 to 0.7% for RHE-mc and 0 to 1.4% for GCTA-mc (Supplementary Table 3). For the large-scale settings, RHE-mc remains accurate: the relative bias ranges from −2.6% to 3.2% (causal bin) and −0.5% to 0.2% (non-causal bin) over the genetic architectures considered (Supplementary Fig. 6, Supplementary Table 4).

Heritability partitioning has been used to estimate heritability attributed to functional genomic annotations⁷. However, some of these annotations (such as FANTOM5 enhancers) are quite small covering <1% of the genome. We explored the ability of RHE-mc to accurately estimate heritability as a function of the size of the annotation. To this end, we performed simulations using N = 291,273 unrelated white British individuals and M = 459,792 common SNPs. We defined eight annotations (four MAF bins and two LD bins) in which we fixed the enrichment of a selected bin and varied the proportion of SNPs in the selected category. RHE-mc obtained accurate estimates of enrichment even when the selected bin only contained 0.4% of the genome-wide SNPs (comparable to the size of FANTOM5 enhancers). RHE-mc estimates are well-calibrated: when the bin has zero enrichment, RHE-mc rejected the null hypothesis of no enrichment in 5% of the simulations, while attaining high power to reject the null hypothesis even when the bin contained <1% of the SNPs (Supplementary Notes).

Computational efficiency

We benchmarked the runtime and memory usage of RHE-mc as a function of number of individuals, SNPs and variance components (Fig. 3, Table 1). We ran RHE-mc with B = 10 random vectors and 22 variance components where each chromosome forms a distinct component. On a dataset of ≈300,000 individuals and ≈500,000 SNPs, RHE-mc can fit 22 variance components in less than an hour and ≈300 variance components (corresponding to bins of size 10 Mb) with little increase in its runtime. On a dataset of one million individuals and one million SNPs, RHE-mc can fit 100 variance components in a few hours. Further, due to its use of a streaming implementation that only requires the genotypes to be operated on once, the memory requirement of RHE-mc is modest: all experiments required <60 GB. We compared the run time and memory usage of RHE-mc with REML-based methods (GCTA²⁷ and BOLT-REML⁴) on the UK Biobank genotypes consisting of around 500,000 SNPs over varying sample sizes and observed that RHE-mc achieves several orders-of-magnitude reduction in runtime. Summary-statistic methods such as S-LDSC requires pre-computed inputs which depend on the runtimes of other softwares making a direct comparison of speed difficult. Thus, we have restricted our comparison to individual-level methods where the benchmarking can be done in a comparable manner.

**Fig. 3: Comparison of running time of RHE-mc, GCTA-mc, and BOLT-REML.**

Table 1 Comparison of running time of RHE-mc, GCTA-mc, and BOLT-REML.

Full size table

Estimating total SNP heritability in the UK Biobank

We applied RHE-mc to estimate genome-wide SNP heritability for 22 complex traits (6 quantitative and 16 binary traits) measured in the UK Biobank. We analyzed N = 291,273 unrelated white British individuals and M = 459,792 SNPs genotyped on the UK Biobank Axiom array (see “Methods” section). We ran RHE-mc with B = 10 and with SNPs divided into eight bins based on two MAF bins (0.01 ≤ MAF < 0.05, MAF ≥ 0.05) and quartiles of the LD-scores. We compared the estimates from RHE-mc to those from LDSC, S-LDSC, SumHer, and GRE. Restricting our analysis to 18 traits for which the point estimate of genome-wide SNP heritability from RHE-mc is >0.05, the estimates from S-LDSC, GRE, SumHer, and LDSC were on average 2.5%, 10%, 25%, and 67% higher than RHE-mc (Fig. 4). Relative to the simulation results, the estimates from S-LDSC are generally consistent with those from RHE-mc. This is likely due to the fact that, in simulations, our application of S-LDSC used only MAF bins. On the other hand, in real data, we used S-LDSC with the recommended baseline-LD annotations (including functional annotations).

**Fig. 4: Estimates of genome-wide SNP heritability from RHE-mc, LDSC, S-LDSC, GRE, and SumHer for 22 complex traits and diseases in the UK Biobank.**

We then applied RHE-mc to estimate genome-wide heritability attributable to imputed variants. The genome-wide estimates of SNP heritability from RHE-mc on imputed SNPs (MAF > 1%) are concordant with the estimates from array SNPs (2.8% higher on average). We then analyzed M = 7,774,235 imputed genotypes with MAF > 0.1% using 144 bins formed by 4 LD bins and 36 MAF bins (see “Methods” section). Genome-wide SNP heritability estimates from RHE-mc on imputed SNPs (MAF > 0.1%) are 11.4% higher than RHE-mc on imputed SNPs (MAF > 1%) (Fig. 4, Supplementary Fig. 7). Following previous work¹⁰, we have removed the MHC region to enable a systematic comparison since the estimation of LD in the MHC region can be challenging; it would be of interest to compare methods when the MHC is included.

Partitioning SNP heritability across allele frequency and LD bins

We used RHE-mc to partition SNP heritability of 22 complex traits across MAF and LD bins. We analyzed M = 7,774,235 imputed SNPs with MAF > 0.1%. We used 144 bins formed by 4 LD bins and 36 MAF bins (see “Methods” section). We compute the per-allele squared effect size of SNPs in bin k as $\frac{{h}_{k}^{2}}{2{f}_{k}(1\,-\,{f}_{k}){M}_{k}}$, where ${h}_{k}^{2}$ is the heritability estimated in bin k, f_k is the mean MAF in bin k, and M_k is the number of SNPs in bin k. We observe that allelic effect size increases with lower MAF and LD. For height, in the lowest quartile of LD scores, SNPs with MAF ≈ 0.1% have allelic effect sizes ≈27x ± 8 larger than SNPs with MAF ≈ 50%. Similarly, among SNPs with MAF ≈50%, SNPs in the lowest quartile of LD scores have allelic effect sizes ≈5x ± 1 larger than SNPs in the highest quartile (Fig. 5 for height; other traits in Supplementary Fig. 9). While these trends have been observed in previous studies^9,29,30, the ability of RHE-mc to jointly fit multiple variance components allows us to estimate effect sizes at SNPs with MAF as low as 0.1%. We caution that negative heritability estimates in bins of lowest MAF and high LD score could arise due to one or more of the following factors: low number of SNPs in this bin (we did not constrain our variance components estimates to be non-negative), the inadequacy of the assumed heritability model, and errors in the imputed genotypes used for the analysis.

**Fig. 5: Per-allele squared effect size of height as a function of MAF.**

Partitioning heritability by functional annotations

The ability of RHE-mc to estimate variance components associated with a large number of overlapping annotations enables us to explore the contribution of a variety of functional genomic annotations to trait heritability using individual-level data in the UK Biobank. We applied RHE-mc to jointly partition heritability of 22 complex traits across 28 functional annotations as defined in ref. ⁷ restricting our analysis to N = 291,273 unrelated white British individuals and M = 5,670,959 imputed SNPs (we restrict to SNPs with MAF > 0.1% which are also present in 1000 Genomes Project). We grouped the traits into five categories (autoimmune, diabetes, respiratory, anthropometric, cardiovascular); for a representative trait from each category, we report enrichment of each of the 28 functional annotations in Fig. 6 (see “Methods” section; for all traits see Supplementary Fig. 8). Our results are largely concordant with previous studies^7,9: we observe enrichment of heritability across traits in conserved regions (Z-score > 3 in 15 traits). We also observe enrichment of heritability at FANTOM5 enhancers (labeled Enhancer_Andersson in Fig. 6) in asthma, eczema, autoimmune disorders (broad), hypothyroidism, and thyroid disorders (Z-score > 3) even though these annotations cover only 0.4% of the analyzed SNPs.

**Fig. 6: Enrichment of heritability across 28 functional annotations.**

Discussion

We have presented RHE-mc, an algorithm that can efficiently estimate multiple variance components on large-scale genotype data. In light of increasing evidence for SNP effect sizes that vary as a function of covariates, such as MAF and LD and the bias associated with methods that fit only a single variance component⁸, the ability to define flexible models endowed with multiple variance components is important to obtain unbiased estimates of fundamental quantities such as SNP heritability. We confirm that RHE-mc yields accurate genome-wide SNP heritability estimates under diverse genetic architectures. In applications to 22 complex traits in the UK Biobank, RHE-mc yields heritability estimates on array SNPs that are lower on average relative to S-LDSC and SumHer. We have explored the utility of RHE-mc in heritability partitioning analyses. These analyses show that per-allele squared effect sizes tend to increase with a decrease in MAF and LD consistent with previous studies⁹. We also partitioned heritability across functional annotations to reveal enrichment of heritability at FANTOM5 enhancers in specific traits such as asthma and eczema.

We discuss several limitations of RHE-mc as well as directions for future work. First, the method-of-moments estimator underlying RHE-mc tends to yield slightly larger standard errors, on average, relative to REML estimators. The relative performance of the two methods likely depends on a number of aspects of the study design such as sample size, number of SNPs, the LD structure, relatedness patterns, and the underlying genetic architecture. Nevertheless, our method is designed to be applicable to massive datasets for which the heritability estimates are relatively precise. Developing scalable variance components estimators that are as efficient as REML-based methods is an important direction for future work. Second, this work has primarily explored the partitioning of heritability across discrete annotations. While we have shown how the methodology can be extended to continuous-valued annotations (see “Methods” section and Supplementary Notes), it would be of interest to explore variation in trait heritability as a function of the value of an annotation. On the other hand, the ability of RHE-mc to fit many annotations allows the annotation to be divided into a sufficiently large number of bins. Third, we have applied RHE-mc to binary traits available in the UK Biobank treating these traits as continuous. Methods that explicitly model binary traits as well as the underlying ascertainment involved in case-control studies are likely to lead to more accurate heritability estimates^23,31. For example, the PCGC method²³ is an extension of HE regression and it would be of interest to develop a scalable randomized PCGC estimator. Fourth, RHE-mc requires access to individual-level genotype and phenotype data. Methods that only require summary statistic data (GRE¹⁰, LDSC²⁴, and SumHer²⁵) have the advantage of being applicable to datasets where acquiring access to individual-level data can be challenging¹⁰. Finally, our method could potentially lead to improvements in association testing, trait prediction, and understanding of polygenic selection.

Methods

Multi-component linear mixed model

RHE-mc attempts to fit the following variance component model:

$$ {\boldsymbol{y}}| {\boldsymbol{\epsilon }},{{\boldsymbol{\beta }}}_{1},\ldots ,{{\boldsymbol{\beta }}}_{K} =\mathop{\sum }\limits_{k=1}^{K}{{\boldsymbol{X}}}_{k}{{\boldsymbol{\beta }}}_{k}+{\boldsymbol{\epsilon }}\\ \hskip 62pt {\boldsymbol{\epsilon }} \sim {\mathcal{D}}({\bf{0}},{\sigma }_{e}^{2}{{\bf{I}}}_{N})\\ \hskip 20pt {{\boldsymbol{\beta }}}_{k} \sim {\mathcal{D}}\left({\bf{0}},\frac{{\sigma }_{k}^{2}}{{M}_{k}}{{\bf{I}}}_{{M}_{k}}\right),k\in \{1,\ldots ,K\}$$

(1)

Here y is a N-vector of centered phenotypes and each of the M SNPs is assigned to one of K non-overlapping categories. Each category k contains M_k SNPs, k ∈ {1, …, K}, ∑_kM_k = M. X_k is a N × M_k matrix, where x_k,n,m denotes the standardized genotype for individual n at SNP m in category k. We have ∑_nx_k,n,m = 0 and ${\sum }_{n}{x}_{k,n,m}^{2}=N$ for m ∈ {1, 2, …, M_k}. β_k denote the M_k-vector of SNP effect sizes for the kth category where ${\mathcal{D}}(\boldsymbol{\mu} ,\boldsymbol{\Sigma})$ is an arbitrary distribution with mean $\boldsymbol{\mu}$ and covariance $\boldsymbol{\Sigma}$. In the above model, ${\sigma }_{e}^{2}$ is the residual variance, and ${\sigma }_{k}^{2}$ is the variance component of the kth category. The total SNP heritability is defined as

$${h}_{{\mathrm{{SNP}}}}^{2}=\frac{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2}+{\sigma }_{e}^{2}}$$

(2)

The SNP heritability of category k is defined as

$${h}_{k}^{2}=\frac{{\sigma }_{k}^{2}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2}+{\sigma }_{e}^{2}},k\in \{1,\ldots ,K\}$$

(3)

Enrichment in bin k is defined as

$${e}_{k}=\frac{{h}_{k}^{2}/{h}_{{\mathrm{{SNP}}}}^{2}}{{M}_{k}/M},k\in \{1,\ldots ,K\}$$

(4)

Method-of-moments for estimating multiple variance components

To estimate the variance components, RHE-mc uses a Method-of-Moments (MoM) estimator that searches for parameter values so that the population moments are close to the sample moments³². Since ${\mathbb{E}}[{\bf{y}}]=0$, we derived the MoM estimates by equating the population covariance to the empirical covariance. The population covariance is given by

$${\mathrm{{cov}}}({\boldsymbol{y}})=E[{\boldsymbol{y}}{{\boldsymbol{y}}}^{{\mathrm{{T}}}}]-E[{\boldsymbol{y}}]E[{{\boldsymbol{y}}}^{{\mathrm{{T}}}}]=\sum _{k}{\sigma }_{k}^{2}{{\boldsymbol{K}}}_{k}+{\sigma }_{e}^{2}{{\boldsymbol{I}}}_{N}$$

(5)

Here ${{\boldsymbol{K}}}_{k}=\frac{{{\boldsymbol{X}}}_{k}{{\boldsymbol{X}}}_{k}^{{\mathrm{{T}}}}}{{M}_{k}}$ is the genetic relatedness matrix (GRM) computed from all SNPs of kth category. Using yy^T as our estimate of the empirical covariance, we need to solve the following least-squares problem to find the variance components.

$$(\tilde{{\sigma }_{1}^{2}},\ldots ,\tilde{{\sigma }_{K}^{2}},\tilde{{\sigma }_{e}^{2}})={\mathop {\mathrm{{argmi{n}}}}\limits_{({\sigma }_{1}^{2},\ldots ,{\sigma }_{K}^{2},{\sigma }_{e}^{2})}}| | {\boldsymbol{y}}{{\boldsymbol{y}}}^{{\mathrm{{T}}}}-\left({\mathop{\sum }\limits_{k}}{\sigma }_{k}^{2}{{\boldsymbol{K}}}_{k}+{\sigma }_{e}^{2}{\boldsymbol{I}}\right)| {| }_{F}^{2}$$

(6)

The MoM estimator satisfies the following normal equations:

$$\left[\begin{array}{ll}{\boldsymbol{T}}&{\boldsymbol{b}}\\ {{\boldsymbol{b}}}^{{\mathrm{{T}}}}&N\end{array}\right]\left[\begin{array}{l}\tilde{{\sigma }_{g}^{2}}\\ {\tilde{\sigma }}_{e}^{2}\end{array}\right]=\left[\begin{array}{l}{\boldsymbol{c}}\\ {{\boldsymbol{y}}}^{{\mathrm{{T}}}}{\boldsymbol{y}}\end{array}\right]$$

(7)

Here $\tilde{{\sigma }_{g}^{2}}=\left[\begin{array}{l}\tilde{{\sigma }_{1}^{2}}\\ \vdots \\ \tilde{{\sigma }_{K}^{2}}\end{array}\right]$, T is a K × K matrix with entries T_k,l = tr(K_kK_l), k, l ∈ {1, …, K}, b is a K-vector with entries b_k = tr(K_k) = N (because X_ks is standardized), and c is a K-vector with entries c_k = y^TK_ky. Each GRM K_k can be computed in time ${\mathcal{O}}({N}^{2}{M}_{k})$ and ${\mathcal{O}}({N}^{2})$ memory. Given K GRMs, the quantities T_k,l, c_k, k, l ∈ {1, …, K}, can be computed in ${\mathcal{O}}({K}^{2}{N}^{2})$. Given the quantities T_k,l, c_k, the normal Eq. (7) can be solved in ${\mathcal{O}}({K}^{3})$. Therefore, the total time complexity for estimating the variance components is ${\mathcal{O}}({N}^{2}M+{K}^{2}{N}^{2}+{K}^{3})$.

RHE-mc: Randomized estimator of multiple variance components

The key bottleneck in solving the normal Eq. (7) is the computation of T_k,l, k, l ∈ {1, …, K} which takes ${\mathcal{O}}({N}^{2}M)$. Instead of computing the exact value of T_k,l, we use an unbiased estimator of the trace³³ based on the following identity: for a given N × N matrix C, z^TCz is an unbiased estimator of tr(C) (E[z^TCz] = tr[C]), where z be a random vector with mean zero and covariance I_N. Hence, we can estimate the values T_k,l, k, l ∈ {1, …, K} as follows:

$${T}_{k,l}={\mathrm{{tr}}}({{\boldsymbol{K}}}_{k}{{\boldsymbol{K}}}_{l})\approx \widehat{{T}_{k,l}}=\frac{1}{B}\frac{1}{{M}_{k}{M}_{l}}{\mathop {\sum }\limits_{b}}{{\boldsymbol{z}}}_{b}^{{\mathrm{{T}}}}{{\boldsymbol{X}}}_{k}{{\boldsymbol{X}}}_{k}^{{\mathrm{{T}}}}{{\boldsymbol{X}}}_{l}{{\boldsymbol{X}}}_{l}^{{\mathrm{{T}}}}{{\boldsymbol{z}}}_{b}$$

(8)

Here z₁, …, z_B are B independent random vectors with zero mean and covariance I_N. We draw these random vectors independently from a standard normal distribution. Computing T_k,l using the unbiased estimator involves four multiplications of sub-matrices of the genotype matrix with a vector, repeated B times. Therefore, the total running time for estimating the matrix T is ${\mathcal{O}}(NMB+{K}^{2}NB)$.

Moreover, we can leverage the structure of the genotype matrix which only contains entries in {0, 1, 2}. For a fixed genotype matrix X_k, we can improve the per iteration time complexity of matrix–vector multiplication from ${\mathcal{O}}(NM)$ to ${\mathcal{O}}(\frac{NM}{{\mathrm{{max}}}({\mathrm{log}\,}_{3}N,{\mathrm{log}\,}_{3}M)})$ by using the Mailman algorithm³⁴. Solving the normal equations takes ${\mathcal{O}}({K}^{3})$ time so that the overall time complexity of our algorithm is ${\mathcal{O}}(\frac{NMB}{\max ({\mathrm{log}\,}_{3}(N),{\mathrm{log}\,}_{3}(M))}+{K}^{2}(K+NB))$.

RHE-mc uses a block Jackknife to estimate standard errors. In Supplementary Notes, we show how the block Jackknife estimates can be computed with little additional computational overhead. Further, we also show how covariates can be efficiently included in the model (Supplementary Notes).

Multi-component LMM with overlapping annotations

RHE-mc can also be applied in the setting where annotations overlap. Following ref. ⁷, the heritability of SNPs belong to annotation k is defined as

$${h}_{k}^{2}=\frac{{\sum }_{i\in {S}_{k}}{\sum }_{j:i\in {S}_{j}}\frac{{\sigma }_{j}^{2}}{{M}_{j}}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2}+{\sigma }_{e}^{2}},k\in \{1,\ldots ,K\}$$

(9)

where S_k is the set of SNPs in kth annotation and M_k = ∣S_k∣. Enrichment in bin k is defined as ${e}_{k}=\frac{{h}_{k}^{2}/{h}_{{\mathrm{{SNP}}}}^{2}}{{M}_{k}/M}$.

Multi-component LMM with continuous annotations

We have described the derivation of RHE-mc using binary annotations. Following ref. ²⁹, we can extend RHE-mc to support continuous-value annotations as follows:

$$ {\boldsymbol{y}}| {\boldsymbol{\epsilon }},{{\boldsymbol{\beta }}}_{1},\ldots ,{{\boldsymbol{\beta }}}_{K} =\mathop{\sum }\limits_{k=1}^{K}{{\boldsymbol{X}}}_{k}{{\boldsymbol{\beta }}}_{k}+{\boldsymbol{\epsilon }}\\ \hskip 55pt {\boldsymbol{\epsilon }} \sim {\mathcal{D}}({\bf{0}},{\sigma }_{e}^{2}{{\bf{I}}}_{N})\\ \hskip -5pt {{\boldsymbol{\beta }}}_{k} \sim {\mathcal{D}}\left({\bf{0}},\frac{{\sigma }_{k}^{2}}{{M}_{k}}{\mathrm{{diag}}}({\bf{a}}_{k})\right),k\in \{1,\ldots ,K\}$$

(10)

This model is similar to the model in Eq. (1) except that here we assume that the variance of effect sizes depend on continuous-valued annotation. Let ${\mathbf{a}}$_k be a M_k-vector where a_k,m is the value of kth annotation at SNP m (the elements of ${\mathbf{a}}_{k}$ must be non-negative). Let S_k be the set of SNPs belong to annotation k. In this model, the SNP heritability of annotation k is defined as:

$${h}_{k}^{2}=\frac{{\sum }_{i\in {S}_{k}}\frac{{\sigma }_{k}^{2}}{{M}_{k}}{a}_{k,i}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\sum }_{i\in {S}_{k}}\frac{{\sigma }_{k}^{2}}{{M}_{k}}{a}_{k,i}+{\sigma }_{e}^{2}},k\in \{1,\ldots ,K\}$$

(11)

To estimate the variance components of this new model, we only need to replace X_k with ${{\boldsymbol{X}}}_{k}{\mathrm{{diag}}}(\sqrt{{{\bf{a}}}_{k}})$ in Eq. (5) for every annotation k. We assessed the accuracy of RHE-mc in estimating variance components with continuous annotation in Supplementary Notes.

Simulations

We performed simulations to compare the performance of RHE-mc with several state-of-the-art methods for heritability estimation that cover the spectrum of methods that have been proposed.

We considered two simulation settings. In the large-scale simulation setting, we simulated phenotypes for the full set of UK Biobank genotypes consisting of M = 593,300 array SNPs and N = 337,205 individuals. We obtained the individuals by keeping unrelated white British individuals which are >3rd degree relatives (defined as pairs of individuals with kinship coefficient <1/2^(9/2))¹⁷, and removing individuals with putative sex chromosome aneuploidy. The small-scale setting was designed so that we could compare the accuracies of RHE-mc to REML methods. In this setting, we simulated phenotypes from a subsampled set of genotypes from the UK Biobank data genotypes used in large-scale simulation³⁵. Specifically, we randomly chose a subset of N = 10,000 individuals from the large-scale data so that we have M = 593,300 array SNPs and N = 10,000 individuals. We simulated phenotypes from genotypes using the following model which is used in refs. ^8,10:

$$ {\sigma }_{m}^{2} =S{c}_{m}{w}_{m}^{b}{[{f}_{m}(1-{f}_{m})]}^{a}\\ \hskip -48pt {({{\boldsymbol{\beta }}}_{1},{{\boldsymbol{\beta }}}_{2},..,{{\boldsymbol{\beta }}}_{m})}^{T} \sim {\mathcal{N}}({\bf{0}},{\mathrm{{diag}}}({\sigma }_{1}^{2},{\sigma }_{2}^{2},...,{\sigma }_{m}^{2}))\\ y| {\boldsymbol{\beta }} \sim {\mathcal{N}}({\bf{X}}\beta ,(1-{h}^{2}){{\bf{I}}}_{N})$$

(12)

where S is a normalizing constant chosen so that $\mathop{\sum }\nolimits_{m = 1}^{M}{\sigma }_{m}^{2}={h}^{2}$. Here h² ∈ [0, 1], a ∈ {0, 0.75}, b ∈ {0, 1}. β_m, f_m, and w_m are the effect size, the minor allele frequency, and LDAK score of mth SNP, respectively. Let c_m ∈ {0, 1} be an indicator variable for the causal status of SNP m. The LD score of a SNP is defined to be the sum of the squared correlation of the SNP with all other SNPs that lie within a specific distance, and the LDAK score of a SNP is computed based on local levels of LD such that the LDAK score tends to be higher for SNPs in regions of low LD³⁶. The above models relating genotype to phenotype are commonly used in methods for estimating SNP heritability: the GCTA Model (when a = b = 0 in Eq. (12)), which is used by the software GCTA²⁷ and LD Score regression (LDSC)²⁴, and the LDAK Model (where a = 0.75, b = 1 in Eq. (12)) used by software LDAK³⁶. Moreover, under each model, we varied the proportion and minor allele frequency (MAF) of CVs. Proportion of CVs were set to be either 100% or 1%, and MAF of CVs drawn uniformly from [0, 0.5] or [0.01, 0.05] or [0.05, 0.5] to consider genetic architectures that are either infinitesimal or sparse, as well genetic architectures that include a mixture of common and rare SNPs as well as ones that consist of only rare or common SNPs. The true heritability were chosen from {0.1, 0.25, 0.5, 0.8}.

We generated 100 sets of simulated phenotypes for each setting of parameters and report accuracies averaged over these 100 sets.

Comparisons

For the large-scale simulations, we compared RHE-mc to methods that rely on summary statistics for estimating heritability. Among the summary statistic methods, LD score regression (LDSC)²⁴ uses the slope from the GWAS χ² statistics regressed on the LD scores to estimate heritability. Stratified LD score regression (S-LDSC)⁷ is an extension of LDSC for partitioning heritability from summary statistics. SumHer is the summary statistic analog of LDAK²⁵. We ran S-LDSC with 10 binary MAF bin annotations defined such that each bin contains exactly 10% of the typed SNPs; this is intended to mirror the 10 MAF bin annotations in the S-LDSC “baseline-LD model”²⁹ (see Supplementary Table 5). To run SumHer, we used the LDAK software to compute the default “LDAK weights” using in-sample LD ^25,36,37. We then computed “LD tagging” using 1-Mb windows centered on each SNP as recommended²⁵. To do a fair comparison we computed LD scores for LDSC, S-LDSC, GRE, and SumHer by using in-sample LD among the M SNPs, and in all simulations we aim to estimate the SNP-heritability explained by the same set of M SNP. We described the parameter settings of summary statistic methods in Supplementary Notes.

For the small-scale simulations, we compared RHE-mc to GCTA-mc and HE-mc²⁷. GCTA-mc and HE-mc are the extensions of GCTA and HE to a multi-component LMM, respectively, where the variance components are typically defined by binning SNPs according to their MAF as well as local LD⁸. We ran GCTA-mc, HE-mc and RHE-mc using 24 bins formed by the combination of six bins based on MAF (MAF ≤ 0.01, 0.01 < MAF ≤ 0.02, 0.02 < MAF ≤ 0.03, 0.03 < MAF ≤ 0.4, 0.04 < MAF ≤ 0.05, MAF > 0.05) as well as four bins based on quartiles of the LDAK score of a SNP. We ran both GCTA-mc and RHE-mc allowing for estimates of a variance component to be negative.

For comparisons of runtime, we compared RHE-mc to GCTA²⁷ and BOLT-REML⁴ which is a computationally efficient approximate method to compute the REML estimator. We ran all methods with 22 components (one for each chromosome). We also ran RHE-mc with ≈300 components (corresponding to 10 Mb bins) on the UK Biobank genotype (Supplementary Fig. 10). To create our largest dataset, we replicate individuals from the UK Biobank and a subset of the imputed SNPs to obtain a dataset with one million individuals and SNPs. We use the latest versions of BOLT-REML (Version 2.3.2) and GCTA (Version 1.92.1) in our comparison. All comparisons are performed on an Intel(R) Xeon(R) CPU 2.10 GHz server with 128 GB RAM.

Heritability estimates in the UK Biobank

We estimated SNP-heritability for 22 complex traits (6 quantitative, 16 binary) in the UK Biobank¹⁷. In this study, we restricted our analysis to SNPs that were present in the UK Biobank Axiom array used to genotype the UK Biobank. SNPs with >1% missingness and minor allele frequency <1% were removed. Moreover, SNPs that fail the Hardy–Weinberg test at significance threshold 10⁻⁷ were removed. We restricted our study to self-reported British white ancestry individuals who are >3rd degree relatives defined as pairs of individuals with kinship coefficient <1/2^(9/2)¹⁷. Furthermore, we removed individuals who are outliers for genotype heterozygosity and/or missingness. Finally, we obtained a set of N = 291,273 individuals and M = 459,792 SNPs to use in the real data analyses. We included age, sex, and the top 20 genetic principal components (PCs) as covariates in our analysis for all traits. We used PCs precomputed by the UK Biobank from a superset of 488,295 individuals. Additional covariates were used for waist-to-hip ratio (adjusted for BMI) and diastolic/systolic blood pressure (adjusted for cholesterol-lowering medication, blood pressure medication, insulin, hormone replacement therapy, and oral contraceptives).

Heritability partitioning

In our initial analysis, we removed SNPs with >1% missingness and minor allele frequency <1%. Moreover, we removed SNPs that fail the Hardy–Weinberg test at significance threshold 10⁻⁷ as well as SNPs that lie within the MHC region (Chr6: 25–35 Mb) to obtain 4,824,392 SNPs. We restricted our study to self-reported British white ancestry individuals who are >3rd degree relatives defined as pairs of individuals with kinship coefficient <1/2^(9/2)¹⁷. Furthermore, we removed individuals who are outliers for genotype heterozygosity and/or missingness. Finally, we obtained 291,273 individuals . We partitioned SNPs into eight bins based on two MAF bins (MAF ≤ 0.05, MAF > 0.05) and quartiles of the LD-scores. For each bin k, we computed the heritability enrichment as the ratio of the percentage of heritability explained by SNPs in bin k to the the percentage of SNPs in bin k.

We considered an additional analysis in which we included SNPs with MAF > 0.1% resulting in N = 291,273 unrelated white British individuals and M = 7,774,235 imputed SNPs (MAF > 0.1%). We defined 144 bins based on 4 LD bins and 36 MAF bins. The 4 LD bins are defined based on quartile of LD-scores, and 36 MAF bins are defined based on 9-quantile of the following four intervals: 0.001 ≤ MAF ≤ 0.01, 0.01 < MAF ≤ 0.05, 0.05 ≤ MAF ≤ 0.10, 0.10 < MAF ≤ 0.50.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Access to the UK Biobank resource is available via application at: http://www.ukbiobank.ac.uk.

Code availability

RHE-mc software is open-source software freely available at: https://github.com/sriramlab/RHE-mc

References

McCulloch, C. E. & Searle, S. R. Generalized, Linear, and Mixed Models (John Wiley & Sons, 2004).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet.42, 565 (2010).
Article CAS PubMed PubMed Central Google Scholar
Yang, J. et al. Genome partitioning of genetic variation for complex traits using common snps. Nat. Genet.43, 519 (2011).
Article CAS PubMed PubMed Central Google Scholar
Loh, P.-R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet.47, 1385 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. H. et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common snps. Nat. Genet.44, 247 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet.95, 535–552 (2014).
Article CAS PubMed PubMed Central Google Scholar
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet.47, 1228 (2015).
Article CAS PubMed PubMed Central Google Scholar
Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet.50, 737 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gazal, S. et al. Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations. Nat. Genet50, 1600–1607 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hou, K. et al. Accurate estimation of snp-heritability from biobank-scale data irrespective of genetic architecture. Nat. Genet. https://doi.org/10.1038/s41588-019-0465-0. https://www.biorxiv.org/content/early/2019/01/23/526855.full.pdf (2019).
Patterson, H. D. & Thompson, R. Recovery of inter-block information when block sizes are unequal. Biometrika58, 545–554 (1971).
Article MathSciNet Google Scholar
Kuk, A. Y. & Cheng, Y. W. The Monte Carlo Newton–Raphson algorithm. J. Stat. Comput. Simul.59, 233–250 (1997).
Article Google Scholar
Liu, J. S. & Wu, Y. N. Parameter expansion for data augmentation. J. Am. Stat. Assoc.94, 1264–1274 (1999).
Article MathSciNet Google Scholar
Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics51, 1440–1450 (1995).
Article Google Scholar
Matilainen, K., Mäntysaari, E. A., Lidauer, M. H., Strandén, I. & Thompson, R. Employing a Monte Carlo algorithm in Newton-type methods for restricted maximum likelihood estimation of genetic parameters. PLoS ONE8, e80821 (2013).
Article ADS PubMed PubMed Central Google Scholar
Runcie, D. E. & Crawford, L. Fast and exible linear mixed models for genome-wide genetics. PLoS Genet.15, e1007978 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bycroft, C. et al. The uk biobank resource with deep phenotyping and genomic data. Nature562, 203–209 (2018).
ADS CAS PubMed PubMed Central Google Scholar
Haseman, J. & Elston, R. The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet.2, 3–19 (1972).
Article CAS PubMed Google Scholar
Zhou, X. A unified framework for variance component estimation with summary statistics in genomewide association studies. Ann. Appl. Stat.11, 2027 (2017).
Article MathSciNet PubMed PubMed Central Google Scholar
Wu, Y. & Sankararaman, S. A scalable estimator of snp heritability for biobank-scale data. Bioinformatics34, i187–i194 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ge, T., Chen, C.-Y., Neale, B. M., Sabuncu, M. R. & Smoller, J. W. Phenome-wide heritability analysis of the uk biobank. PLoS Genet.13, e1006711 (2017).
Article PubMed PubMed Central Google Scholar
Visscher, P. M. et al. Statistical power to detect genetic (co) variance of complex traits using snp data in unrelated samples. PLoS Genet.10, e1004269 (2014).
Article PubMed PubMed Central Google Scholar
Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: inferring the contribution of common variants. Proc. Natl Acad. Sci. USA111, E5272–E5281 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Bulik-Sullivan, B. K. et al. Ld score regression distinguishes confounding from polygenicity in genomewide association studies. Nat. Genet.47, 291 (2015).
Article CAS PubMed PubMed Central Google Scholar
Speed, D. & Balding, D. J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat Genet.51, 277–284 (2019).
Article CAS PubMed Google Scholar
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet.47, 1228 (2015).
Article CAS PubMed PubMed Central Google Scholar
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. Gcta: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet.88, 76–82 (2011).
Article CAS PubMed PubMed Central Google Scholar
Yang, J. et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet.47, 1114 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet.49, 1421 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wainschtein, P. et al. Recovery of trait heritability from whole genome sequence data. Preprint at 588020 (2019).
Weissbrod, O., Flint, J. & Rosset, S. Estimating snp-based heritability and genetic correlation in casecontrol studies directly and with summary statistics. Am. J. Hum. Genet.103, 89–99 (2018).
Article CAS PubMed PubMed Central Google Scholar
Henderson, C. R. Estimation of variance and covariance components. Biometrics9, 226–252 (1953).
Article MathSciNet Google Scholar
Hutchinson, M. A stochastic estimator of the trace of the inuence matrix for Laplacian smoothing splines. Commun. Stat.-Simul. Comput.18, 1059–1076 (1989).
Article Google Scholar
Liberty, E. & Zucker, S. W. The mailman algorithm: a note on matrix–vector multiplication. Inf. Process. Lett.109, 179–182 (2009).
Article MathSciNet Google Scholar
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med.12, e1001779 (2015).
PubMed PubMed Central Google Scholar
Speed, D., Hemani, G., Johnson, M. R. & Balding, D. J. Improved heritability estimation from genomewide SNPs. Am. J. Hum. Genet.91, 1011–1021 (2012).
Article CAS PubMed PubMed Central Google Scholar
Speed, D. et al. Reevaluation of SNP heritability in complex human traits. Nat. Genet.49, 986 (2017).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research was conducted using the UK Biobank Resource under applications 33127 and 33297. We thank the participants of UK Biobank for making this work possible. We thank Rob Brown, Steven Gazal, and members of the Sankararaman and Pasaniuc labs for feedback on this manuscript. This work was funded by NIH grants R01HG009120 (B.P. and K.S.B.), R35GM125055 (S.S.), an Alfred P. Sloan Research Fellowship (S.S.), and a NSF grant III-1705121 (A.P., Y.W., and S.S.).

Author information

Authors and Affiliations

Department of Computer Science, UCLA, Los Angeles, CA, 90095, USA
Ali Pazokitoroudi, Yue Wu, Aaron Zhou & Sriram Sankararaman
Bioinformatics Interdepartmental Program, UCLA, Los Angeles, CA, 90095, USA
Kathryn S. Burch & Kangcheng Hou
Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
Bogdan Pasaniuc
Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
Bogdan Pasaniuc & Sriram Sankararaman
Department of Computational Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
Bogdan Pasaniuc & Sriram Sankararaman

Authors

Ali Pazokitoroudi
View author publications
You can also search for this author in PubMed Google Scholar
Yue Wu
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn S. Burch
View author publications
You can also search for this author in PubMed Google Scholar
Kangcheng Hou
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Pasaniuc
View author publications
You can also search for this author in PubMed Google Scholar
Sriram Sankararaman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.P. and S.S. conceived and designed the experiments. A.P. performed the experiment and statistical analyses. Y.W., K.S.B., and K.H. collected and managed the data. Y.W., K.S.B., K.H., and A.Z. assisted with the experiments. B.P. consulted on analysis and interpretation of the data. A.P., K.S.B., B.P., and S.S. wrote the manuscript.

Corresponding author

Correspondence to Sriram Sankararaman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review informationNature Communications thanks Doug Speed, Bjarni Vilhjalmsson, and the other, anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pazokitoroudi, A., Wu, Y., Burch, K.S. et al. Efficient variance components analysis across millions of genomes. Nat Commun 11, 4020 (2020). https://doi.org/10.1038/s41467-020-17576-9

Download citation

Received: 04 April 2020
Accepted: 03 July 2020
Published: 11 August 2020
DOI: https://doi.org/10.1038/s41467-020-17576-9

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.