Abstract
While variance components analysis has emerged as a powerful tool in complex trait genetics, existing methods for fitting variance components do not scale well to largescale datasets of genetic variation. Here, we present a method for variance components analysis that is accurate and efficient: capable of estimating one hundred variance components on a million individuals genotyped at a million SNPs in a few hours. We illustrate the utility of our method in estimating and partitioning variation in a trait explained by genotyped SNPs (SNPheritability). Analyzing 22 traits with genotypes from 300,000 individuals across about 8 million common and low frequency SNPs, we observe that perallele squared effect size increases with decreasing minor allele frequency (MAF) and linkage disequilibrium (LD) consistent with the action of negative selection. Partitioning heritability across 28 functional annotations, we observe enrichment of heritability in FANTOM5 enhancers in asthma, eczema, thyroid and autoimmune disorders.
Introduction
Variance components analysis^{1} has emerged as a versatile tool in human complex trait genetics, enabling studies of the genetic contribution to variation in a trait^{2} as well as its distribution across genomic loci^{3,4}, allele frequencies^{3}, and functional annotations^{3,5,6}. There is increasing interest in applying methods for variance components analysis to largescale genetic datasets with the goal of uncovering novel insights into the genetic architecture of complex traits^{4,7}. A prominent example of the utility of these methods is in the estimation of SNP heritability (\({h}_{{\mathrm{{SNP}}}}^{2}\))^{2}, the variance in a trait explained by a given set of genotyped SNPs. Variance components methods for estimating SNP heritability typically assume a genetic variance component that represents the fraction of phenotypic variation explained by the SNPs included in the study and a residual variance component. Recent studies have shown that these “singlecomponent” methods yield biased estimates of SNP heritability due to the linkage disequilibrium (LD) and minor allele frequency (MAF)dependent architecture of complex traits^{8,9}. On the other hand, flexible models with multiple variance components^{3,4} that allows for SNP effects to vary with MAF and LD, have been shown to yield more accurate SNP heritability estimates^{8,9}. Recent work has shown that SNP heritability can be estimated with minimal assumptions about the genetic architecture^{10}; however, this method cannot partition heritability across categories of SNPs of interest such as functional or population genomic annotations. Partitioning heritability requires fitting multiple variance components, thus creating the need for accurate and scalable methods that can fit tens or even hundreds of variance components to largescale genomic data to obtain accurate and novel insights into genetic architecture.
While the ability to fit flexible variance component models to largescale datasets is essential to obtain accurate and novel insights into genetic architecture, fitting such models requires scalable algorithms. Approaches for estimating variance components typically search for parameter values that maximize the likelihood or the restricted maximum likelihood (REML)^{11}. Despite a number of algorithmic improvements^{2,4,12,13,14,15,16}, computing REML estimates of the variance components on data sets such as the UK Biobank^{17} (≈500,000 individuals genotyped at nearly one million SNPs) remains challenging. The reason is that methods for computing these estimators typically perform repeated computations on the input genotypes.
We propose a method that can jointly estimate multiple variance components efficiently. Our proposed method, RHEmc, is a randomized multicomponent version of the classical Haseman–Elston regression for heritability estimation^{18,19}. RHEmc builds on our previously proposed method, RHEreg^{20}, which uses a randomized algorithm to estimate a single variance component. RHEmc can simultaneously estimate multiple variance components, as well as variance components associated with continuous and overlapping annotations. Further, unlike REML estimation algorithms, RHEmc requires only a single pass over the input genotypes that results in a highly memory efficient implementation. The resulting computational efficiency permits RHEmc to jointly fit 300 variance components in less than an hour on a dataset of about 300,000 individuals and 500,000 SNPs, about two orders of magnitude faster than stateoftheart methods. On a dataset of one million individuals and one million SNPs, RHEmc can fit 100 variance components in about 12 h.
To demonstrate its utility, we first show that RHEmc can accurately estimate genomewide and partitioned SNP heritability under realistic genetic architectures (the functional dependence of SNP effect sizes on MAF and LD). We applied RHEmc to 22 traits measured across 291,273 individuals genotyped at 459,792 common SNPs (MAF > 1%) in the UK Biobank to obtain estimates of genomewide SNP heritability. We then used RHEmc to partition heritability for the 22 traits across seven million imputed SNPs (MAF > 0.1%) into 144 bins defined based on MAF and LD. We observe that the perallele squared effect size tends to increase with lower MAF and LD across the traits considered. Finally, we partitioned heritability for SNPs with MAF > 0.1% across 28 functional annotations. We recover previously reported enrichment of heritability in annotations corresponding to conserved regions^{7} and also document enrichment of heritability in FANTOM5 enhancers in eczema, asthma, autoimmune disorders, and thyroid disorders.
Results
Methods overview
RHEmc aims to fit a variance component model that relates phenotypes y measured across N individuals to their genotypes over M SNPs X:
where \({\mathcal{D}}(\boldsymbol{\mu} ,\boldsymbol{\Sigma})\) is an arbitrary distribution with mean μ and covariance Σ. Each of the M SNPs is assigned to one of K nonoverlapping categories so that X_{k} is the N × M_{k} matrix consisting of standardized genotypes of SNPs belonging to category k (note that the expected heritability is constant within categories when we use standardized genotypes). β_{k} denotes the effect sizes of SNPs assigned to category k which are drawn from a zeromean distribution with covariance parameter \(\frac{{\sigma }_{k}^{2}}{M_k}{{\bf{I}}}_{{M}_{k}}\) (the variance component of category k) while \({\sigma }_{e}^{2}\) is the residual variance.
In this model, the genomewide SNP heritability is defined as: \({h}_{{\mathrm{{SNP}}}}^{2}=\frac{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2} \, + \, {\sigma }_{e}^{2}}\) while the SNP heritability of category k is defined as: \({h}_{k}^{2}=\frac{{\sigma }_{k}^{2}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\sigma }_{k}^{2} \, + \, {\sigma }_{e}^{2}}\). By choosing categories to represent genomic annotations of interest, e.g., chromosomes, allele frequencies, or functional annotations, these models can be used to estimate the phenotypic variation that can be attributed to the relevant annotation.
The key inference problem in this model is the estimation of the variance components: \(({\sigma }_{1}^{2},\ldots ,{\sigma }_{K}^{2},{\sigma }_{e}^{2})\). These parameters are typically estimated by maximizing the likelihood or the restricted likelihood. Instead, RHEmc uses a scalable methodofmoments estimator, i.e., finding values of the variance components such that the population moments match the sample moments^{18,19,21,22,23}. RHEmc uses a randomized algorithm that avoids explicitly computing N × N genetic relatedness matrices that are required by methodofmoments estimators. Instead, it operates on a smaller matrix formed by multiplying the input genotype matrix with a small number of random vectors (see “Methods” section). The application of a randomized algorithm for SNP heritability estimation using a single variance component was proposed in our previous work, RHEreg^{20}. RHEmc extends our previous work in several directions. RHEmc can efficiently fit multiple variance components (both nonoverlapping and overlapping) and can also handle continuous annotations. The resulting algorithm has scalable runtime as it only requires operating on the genotype matrix one time. Further, RHEmc uses a streaming implementation that does not require all the genotypes to be stored in memory leading to scalable memory requirements (Supplementary Notes). Finally, RHEmc uses an efficient implementation of a block Jackknife to estimate standard errors with little computational overhead (Supplementary Notes).
Accuracy of genomewide SNP heritability estimates in simulations
We assessed the accuracy of RHEmc in estimating genomewide SNP heritability as previous attempts at estimating SNP heritability have been shown to be sensitive to assumptions about how SNP effect size varies with MAF and LD^{8}. Starting with genotypes of M = 593,300 array SNPs over N = 337,205 unrelated white British individuals in the UK Biobank, we simulated phenotypes according to 64 MAF and LDdependent architectures by varying the SNP heritability, the proportion of variants that have nonzero effects (causal variants or CVs), the distribution of CVs across minor allele frequencies (CVs distributed across all minor allele frequency bins or CVs restricted to either common or lowfrequency bins), and the form of coupling between the SNP effect size and MAF as well as LD. For RHEmc, we partitioned the SNPs into 24 variance components based on six MAF bins as well as four LD bins (see “Methods” section). The key parameter in applying RHEmc is the number of random vectors B which we set to 10. RHEmc estimates were relatively insensitive when we increased the number of random vectors B to 100 (Supplementary Figs. 1 and 2, Supplementary Table 1). Across these 64 architectures, RHEmc is relatively unbiased (a twosided ttest of the hypothesis of no bias is not rejected across any of the architectures at a pvalue < 0.05) with the largest relative bias observed to be 0.5% of the true SNP heritability (Supplementary Fig. 3). We used a block Jackknife (number of blocks = 100) to estimate the standard errors of RHEmc and confirmed that the estimated standard errors are close to the true SE (Supplementary Table 2).
We compared the accuracy of RHEmc to stateoftheart methods for heritability estimation that can be applied to large datasets (across architectures where the true SNP heritability was fixed at 0.5). These methods, LDSC^{24}, SumHer^{25}, SLDSC^{26}, and GRE^{10}, all leverage summary statistics while RHEmc requires individual genotype data. We found that estimates from the summarystatistic methods tend to be sensitive to the underlying genetic architecture: across 16 architecture relative biases range from −31% to 27% for LDSC, −27% to 5% for SLDSC, and −5% to 9% for SumHer (Fig. 1). We also compared to a recently proposed method (GRE^{10}) that only estimates genomewide SNP heritability (without partitioning by MAF/LD) and observed that relative biases ranged from 1% to 1.4% for GRE and from −1.5% to 0.5% for RHEmc. We also considered architectures in which only rare variants are causal and found RHEmc is accurate relative to other methods (Supplementary Fig. 4). These results further emphasize that RHEmc can accurately estimate SNPheritability through fitting multiple variance components.
We compared RHEmc to the stateoftheart REMLbased variance component estimation method, GCTAmc (multicomponent GREML^{8,27,28}) and to exact multicomponent Haseman–Elston Regression (HEmc) as implemented in GCTA^{27}. We ran each of these methods by partitioning SNPs into 24 variance components (6 MAF bins by 4 LD bins, see “Methods” section). To make these experiments computationally feasible, we simulated phenotypes starting from a smaller set of genotypes (M = 593,300 array SNPs and N = 10,000 white British individuals). Across 16 architectures where the true SNP heritability was fixed at 0.25, the relative biases for RHEmc range from −3.2% to 3.6%, and from −3.2% to 5% for GCTAmc (Fig. 2). On average, RHEmc has standard errors that are 1.1 times larger than GCTAmc (which range from 0.97 to 1.24) and 1.08 times larger than HEmc (which range from 1.00 to 1.21).
Accuracy of heritability partitioning in simulations
We also evaluated the accuracy of RHEmc in partitioning SNP heritability in both smallscale (M = 593,300 SNPs, N = 10,000 individuals) (Supplementary Fig. 5) and largescale settings (M = 593,300 SNPs, N = 337,205 individuals) (see Supplementary Fig. 6). For these experiments, we restrict our attention to architectures for which the CVs are chosen to lie within a narrow range of MAF. Since the variance components correspond to bins of MAF and LD, a subset of the variance components would have no causal SNPs and hence have a heritability of zero. We assess the accuracy of estimates of heritability aggregated over these components (termed the noncausal bin) as well as the heritability aggregated over the remaining genetic components (termed the causal bin). For example, variance components that correspond to MAF ∈ [0.01, 0.05] would be included in the causal bin for an architecture that restricts the MAF of CVs to lie in the range [0.01, 0.05]. For the smallscale simulations, we compared RHEmc to GCTAmc. We ran both methods by partitioning the SNPs into 24 variance components based on six MAF bins as well as four LD bins defined by quartiles of the measure of LDAK weight at a SNP (see “Methods” section). Across the genetic architectures tested, estimates of heritability within each of the causal and noncausal bins are highly concordant between RHEmc and GCTAmc (Supplementary Fig. 5, Supplementary Table 3): for the causal bin, the relative bias ranges from −4% to 0.4% for RHEmc and −3.6% to 2% for GCTAmc while, for the noncausal bin, the bias ranges from 0 to 0.7% for RHEmc and 0 to 1.4% for GCTAmc (Supplementary Table 3). For the largescale settings, RHEmc remains accurate: the relative bias ranges from −2.6% to 3.2% (causal bin) and −0.5% to 0.2% (noncausal bin) over the genetic architectures considered (Supplementary Fig. 6, Supplementary Table 4).
Heritability partitioning has been used to estimate heritability attributed to functional genomic annotations^{7}. However, some of these annotations (such as FANTOM5 enhancers) are quite small covering <1% of the genome. We explored the ability of RHEmc to accurately estimate heritability as a function of the size of the annotation. To this end, we performed simulations using N = 291,273 unrelated white British individuals and M = 459,792 common SNPs. We defined eight annotations (four MAF bins and two LD bins) in which we fixed the enrichment of a selected bin and varied the proportion of SNPs in the selected category. RHEmc obtained accurate estimates of enrichment even when the selected bin only contained 0.4% of the genomewide SNPs (comparable to the size of FANTOM5 enhancers). RHEmc estimates are wellcalibrated: when the bin has zero enrichment, RHEmc rejected the null hypothesis of no enrichment in 5% of the simulations, while attaining high power to reject the null hypothesis even when the bin contained <1% of the SNPs (Supplementary Notes).
Computational efficiency
We benchmarked the runtime and memory usage of RHEmc as a function of number of individuals, SNPs and variance components (Fig. 3, Table 1). We ran RHEmc with B = 10 random vectors and 22 variance components where each chromosome forms a distinct component. On a dataset of ≈300,000 individuals and ≈500,000 SNPs, RHEmc can fit 22 variance components in less than an hour and ≈300 variance components (corresponding to bins of size 10 Mb) with little increase in its runtime. On a dataset of one million individuals and one million SNPs, RHEmc can fit 100 variance components in a few hours. Further, due to its use of a streaming implementation that only requires the genotypes to be operated on once, the memory requirement of RHEmc is modest: all experiments required <60 GB. We compared the run time and memory usage of RHEmc with REMLbased methods (GCTA^{27} and BOLTREML^{4}) on the UK Biobank genotypes consisting of around 500,000 SNPs over varying sample sizes and observed that RHEmc achieves several ordersofmagnitude reduction in runtime. Summarystatistic methods such as SLDSC requires precomputed inputs which depend on the runtimes of other softwares making a direct comparison of speed difficult. Thus, we have restricted our comparison to individuallevel methods where the benchmarking can be done in a comparable manner.
Estimating total SNP heritability in the UK Biobank
We applied RHEmc to estimate genomewide SNP heritability for 22 complex traits (6 quantitative and 16 binary traits) measured in the UK Biobank. We analyzed N = 291,273 unrelated white British individuals and M = 459,792 SNPs genotyped on the UK Biobank Axiom array (see “Methods” section). We ran RHEmc with B = 10 and with SNPs divided into eight bins based on two MAF bins (0.01 ≤ MAF < 0.05, MAF ≥ 0.05) and quartiles of the LDscores. We compared the estimates from RHEmc to those from LDSC, SLDSC, SumHer, and GRE. Restricting our analysis to 18 traits for which the point estimate of genomewide SNP heritability from RHEmc is >0.05, the estimates from SLDSC, GRE, SumHer, and LDSC were on average 2.5%, 10%, 25%, and 67% higher than RHEmc (Fig. 4). Relative to the simulation results, the estimates from SLDSC are generally consistent with those from RHEmc. This is likely due to the fact that, in simulations, our application of SLDSC used only MAF bins. On the other hand, in real data, we used SLDSC with the recommended baselineLD annotations (including functional annotations).
We then applied RHEmc to estimate genomewide heritability attributable to imputed variants. The genomewide estimates of SNP heritability from RHEmc on imputed SNPs (MAF > 1%) are concordant with the estimates from array SNPs (2.8% higher on average). We then analyzed M = 7,774,235 imputed genotypes with MAF > 0.1% using 144 bins formed by 4 LD bins and 36 MAF bins (see “Methods” section). Genomewide SNP heritability estimates from RHEmc on imputed SNPs (MAF > 0.1%) are 11.4% higher than RHEmc on imputed SNPs (MAF > 1%) (Fig. 4, Supplementary Fig. 7). Following previous work^{10}, we have removed the MHC region to enable a systematic comparison since the estimation of LD in the MHC region can be challenging; it would be of interest to compare methods when the MHC is included.
Partitioning SNP heritability across allele frequency and LD bins
We used RHEmc to partition SNP heritability of 22 complex traits across MAF and LD bins. We analyzed M = 7,774,235 imputed SNPs with MAF > 0.1%. We used 144 bins formed by 4 LD bins and 36 MAF bins (see “Methods” section). We compute the perallele squared effect size of SNPs in bin k as \(\frac{{h}_{k}^{2}}{2{f}_{k}(1\,\,{f}_{k}){M}_{k}}\), where \({h}_{k}^{2}\) is the heritability estimated in bin k, f_{k} is the mean MAF in bin k, and M_{k} is the number of SNPs in bin k. We observe that allelic effect size increases with lower MAF and LD. For height, in the lowest quartile of LD scores, SNPs with MAF ≈ 0.1% have allelic effect sizes ≈27x ± 8 larger than SNPs with MAF ≈ 50%. Similarly, among SNPs with MAF ≈50%, SNPs in the lowest quartile of LD scores have allelic effect sizes ≈5x ± 1 larger than SNPs in the highest quartile (Fig. 5 for height; other traits in Supplementary Fig. 9). While these trends have been observed in previous studies^{9,29,30}, the ability of RHEmc to jointly fit multiple variance components allows us to estimate effect sizes at SNPs with MAF as low as 0.1%. We caution that negative heritability estimates in bins of lowest MAF and high LD score could arise due to one or more of the following factors: low number of SNPs in this bin (we did not constrain our variance components estimates to be nonnegative), the inadequacy of the assumed heritability model, and errors in the imputed genotypes used for the analysis.
Partitioning heritability by functional annotations
The ability of RHEmc to estimate variance components associated with a large number of overlapping annotations enables us to explore the contribution of a variety of functional genomic annotations to trait heritability using individuallevel data in the UK Biobank. We applied RHEmc to jointly partition heritability of 22 complex traits across 28 functional annotations as defined in ref. ^{7} restricting our analysis to N = 291,273 unrelated white British individuals and M = 5,670,959 imputed SNPs (we restrict to SNPs with MAF > 0.1% which are also present in 1000 Genomes Project). We grouped the traits into five categories (autoimmune, diabetes, respiratory, anthropometric, cardiovascular); for a representative trait from each category, we report enrichment of each of the 28 functional annotations in Fig. 6 (see “Methods” section; for all traits see Supplementary Fig. 8). Our results are largely concordant with previous studies^{7,9}: we observe enrichment of heritability across traits in conserved regions (Zscore > 3 in 15 traits). We also observe enrichment of heritability at FANTOM5 enhancers (labeled Enhancer_Andersson in Fig. 6) in asthma, eczema, autoimmune disorders (broad), hypothyroidism, and thyroid disorders (Zscore > 3) even though these annotations cover only 0.4% of the analyzed SNPs.
Discussion
We have presented RHEmc, an algorithm that can efficiently estimate multiple variance components on largescale genotype data. In light of increasing evidence for SNP effect sizes that vary as a function of covariates, such as MAF and LD and the bias associated with methods that fit only a single variance component^{8}, the ability to define flexible models endowed with multiple variance components is important to obtain unbiased estimates of fundamental quantities such as SNP heritability. We confirm that RHEmc yields accurate genomewide SNP heritability estimates under diverse genetic architectures. In applications to 22 complex traits in the UK Biobank, RHEmc yields heritability estimates on array SNPs that are lower on average relative to SLDSC and SumHer. We have explored the utility of RHEmc in heritability partitioning analyses. These analyses show that perallele squared effect sizes tend to increase with a decrease in MAF and LD consistent with previous studies^{9}. We also partitioned heritability across functional annotations to reveal enrichment of heritability at FANTOM5 enhancers in specific traits such as asthma and eczema.
We discuss several limitations of RHEmc as well as directions for future work. First, the methodofmoments estimator underlying RHEmc tends to yield slightly larger standard errors, on average, relative to REML estimators. The relative performance of the two methods likely depends on a number of aspects of the study design such as sample size, number of SNPs, the LD structure, relatedness patterns, and the underlying genetic architecture. Nevertheless, our method is designed to be applicable to massive datasets for which the heritability estimates are relatively precise. Developing scalable variance components estimators that are as efficient as REMLbased methods is an important direction for future work. Second, this work has primarily explored the partitioning of heritability across discrete annotations. While we have shown how the methodology can be extended to continuousvalued annotations (see “Methods” section and Supplementary Notes), it would be of interest to explore variation in trait heritability as a function of the value of an annotation. On the other hand, the ability of RHEmc to fit many annotations allows the annotation to be divided into a sufficiently large number of bins. Third, we have applied RHEmc to binary traits available in the UK Biobank treating these traits as continuous. Methods that explicitly model binary traits as well as the underlying ascertainment involved in casecontrol studies are likely to lead to more accurate heritability estimates^{23,31}. For example, the PCGC method^{23} is an extension of HE regression and it would be of interest to develop a scalable randomized PCGC estimator. Fourth, RHEmc requires access to individuallevel genotype and phenotype data. Methods that only require summary statistic data (GRE^{10}, LDSC^{24}, and SumHer^{25}) have the advantage of being applicable to datasets where acquiring access to individuallevel data can be challenging^{10}. Finally, our method could potentially lead to improvements in association testing, trait prediction, and understanding of polygenic selection.
Methods
Multicomponent linear mixed model
RHEmc attempts to fit the following variance component model:
Here y is a Nvector of centered phenotypes and each of the M SNPs is assigned to one of K nonoverlapping categories. Each category k contains M_{k} SNPs, k ∈ {1, …, K}, ∑_{k}M_{k} = M. X_{k} is a N × M_{k} matrix, where x_{k,n,m} denotes the standardized genotype for individual n at SNP m in category k. We have ∑_{n}x_{k,n,m} = 0 and \({\sum }_{n}{x}_{k,n,m}^{2}=N\) for m ∈ {1, 2, …, M_{k}}. β_{k} denote the M_{k}vector of SNP effect sizes for the kth category where \({\mathcal{D}}(\boldsymbol{\mu} ,\boldsymbol{\Sigma})\) is an arbitrary distribution with mean \(\boldsymbol{\mu}\) and covariance \(\boldsymbol{\Sigma}\). In the above model, \({\sigma }_{e}^{2}\) is the residual variance, and \({\sigma }_{k}^{2}\) is the variance component of the kth category. The total SNP heritability is defined as
The SNP heritability of category k is defined as
Enrichment in bin k is defined as
Methodofmoments for estimating multiple variance components
To estimate the variance components, RHEmc uses a MethodofMoments (MoM) estimator that searches for parameter values so that the population moments are close to the sample moments^{32}. Since \({\mathbb{E}}[{\bf{y}}]=0\), we derived the MoM estimates by equating the population covariance to the empirical covariance. The population covariance is given by
Here \({{\boldsymbol{K}}}_{k}=\frac{{{\boldsymbol{X}}}_{k}{{\boldsymbol{X}}}_{k}^{{\mathrm{{T}}}}}{{M}_{k}}\) is the genetic relatedness matrix (GRM) computed from all SNPs of kth category. Using yy^{T} as our estimate of the empirical covariance, we need to solve the following leastsquares problem to find the variance components.
The MoM estimator satisfies the following normal equations:
Here \(\tilde{{\sigma }_{g}^{2}}=\left[\begin{array}{l}\tilde{{\sigma }_{1}^{2}}\\ \vdots \\ \tilde{{\sigma }_{K}^{2}}\end{array}\right]\), T is a K × K matrix with entries T_{k,l} = tr(K_{k}K_{l}), k, l ∈ {1, …, K}, b is a Kvector with entries b_{k} = tr(K_{k}) = N (because X_{k}s is standardized), and c is a Kvector with entries c_{k} = y^{T}K_{k}y. Each GRM K_{k} can be computed in time \({\mathcal{O}}({N}^{2}{M}_{k})\) and \({\mathcal{O}}({N}^{2})\) memory. Given K GRMs, the quantities T_{k,l}, c_{k}, k, l ∈ {1, …, K}, can be computed in \({\mathcal{O}}({K}^{2}{N}^{2})\). Given the quantities T_{k,l}, c_{k}, the normal Eq. (7) can be solved in \({\mathcal{O}}({K}^{3})\). Therefore, the total time complexity for estimating the variance components is \({\mathcal{O}}({N}^{2}M+{K}^{2}{N}^{2}+{K}^{3})\).
RHEmc: Randomized estimator of multiple variance components
The key bottleneck in solving the normal Eq. (7) is the computation of T_{k,l}, k, l ∈ {1, …, K} which takes \({\mathcal{O}}({N}^{2}M)\). Instead of computing the exact value of T_{k,l}, we use an unbiased estimator of the trace^{33} based on the following identity: for a given N × N matrix C, z^{T}Cz is an unbiased estimator of tr(C) (E[z^{T}Cz] = tr[C]), where z be a random vector with mean zero and covariance I_{N}. Hence, we can estimate the values T_{k,l}, k, l ∈ {1, …, K} as follows:
Here z_{1}, …, z_{B} are B independent random vectors with zero mean and covariance I_{N}. We draw these random vectors independently from a standard normal distribution. Computing T_{k,l} using the unbiased estimator involves four multiplications of submatrices of the genotype matrix with a vector, repeated B times. Therefore, the total running time for estimating the matrix T is \({\mathcal{O}}(NMB+{K}^{2}NB)\).
Moreover, we can leverage the structure of the genotype matrix which only contains entries in {0, 1, 2}. For a fixed genotype matrix X_{k}, we can improve the per iteration time complexity of matrix–vector multiplication from \({\mathcal{O}}(NM)\) to \({\mathcal{O}}(\frac{NM}{{\mathrm{{max}}}({\mathrm{log}\,}_{3}N,{\mathrm{log}\,}_{3}M)})\) by using the Mailman algorithm^{34}. Solving the normal equations takes \({\mathcal{O}}({K}^{3})\) time so that the overall time complexity of our algorithm is \({\mathcal{O}}(\frac{NMB}{\max ({\mathrm{log}\,}_{3}(N),{\mathrm{log}\,}_{3}(M))}+{K}^{2}(K+NB))\).
RHEmc uses a block Jackknife to estimate standard errors. In Supplementary Notes, we show how the block Jackknife estimates can be computed with little additional computational overhead. Further, we also show how covariates can be efficiently included in the model (Supplementary Notes).
Multicomponent LMM with overlapping annotations
RHEmc can also be applied in the setting where annotations overlap. Following ref. ^{7}, the heritability of SNPs belong to annotation k is defined as
where S_{k} is the set of SNPs in kth annotation and M_{k} = ∣S_{k}∣. Enrichment in bin k is defined as \({e}_{k}=\frac{{h}_{k}^{2}/{h}_{{\mathrm{{SNP}}}}^{2}}{{M}_{k}/M}\).
Multicomponent LMM with continuous annotations
We have described the derivation of RHEmc using binary annotations. Following ref. ^{29}, we can extend RHEmc to support continuousvalue annotations as follows:
This model is similar to the model in Eq. (1) except that here we assume that the variance of effect sizes depend on continuousvalued annotation. Let \({\mathbf{a}}\)_{k} be a M_{k}vector where a_{k,m} is the value of kth annotation at SNP m (the elements of \({\mathbf{a}}_{k}\) must be nonnegative). Let S_{k} be the set of SNPs belong to annotation k. In this model, the SNP heritability of annotation k is defined as:
To estimate the variance components of this new model, we only need to replace X_{k} with \({{\boldsymbol{X}}}_{k}{\mathrm{{diag}}}(\sqrt{{{\bf{a}}}_{k}})\) in Eq. (5) for every annotation k. We assessed the accuracy of RHEmc in estimating variance components with continuous annotation in Supplementary Notes.
Simulations
We performed simulations to compare the performance of RHEmc with several stateoftheart methods for heritability estimation that cover the spectrum of methods that have been proposed.
We considered two simulation settings. In the largescale simulation setting, we simulated phenotypes for the full set of UK Biobank genotypes consisting of M = 593,300 array SNPs and N = 337,205 individuals. We obtained the individuals by keeping unrelated white British individuals which are >3rd degree relatives (defined as pairs of individuals with kinship coefficient <1/2^{(9/2)})^{17}, and removing individuals with putative sex chromosome aneuploidy. The smallscale setting was designed so that we could compare the accuracies of RHEmc to REML methods. In this setting, we simulated phenotypes from a subsampled set of genotypes from the UK Biobank data genotypes used in largescale simulation^{35}. Specifically, we randomly chose a subset of N = 10,000 individuals from the largescale data so that we have M = 593,300 array SNPs and N = 10,000 individuals. We simulated phenotypes from genotypes using the following model which is used in refs. ^{8,10}:
where S is a normalizing constant chosen so that \(\mathop{\sum }\nolimits_{m = 1}^{M}{\sigma }_{m}^{2}={h}^{2}\). Here h^{2} ∈ [0, 1], a ∈ {0, 0.75}, b ∈ {0, 1}. β_{m}, f_{m}, and w_{m} are the effect size, the minor allele frequency, and LDAK score of mth SNP, respectively. Let c_{m} ∈ {0, 1} be an indicator variable for the causal status of SNP m. The LD score of a SNP is defined to be the sum of the squared correlation of the SNP with all other SNPs that lie within a specific distance, and the LDAK score of a SNP is computed based on local levels of LD such that the LDAK score tends to be higher for SNPs in regions of low LD^{36}. The above models relating genotype to phenotype are commonly used in methods for estimating SNP heritability: the GCTA Model (when a = b = 0 in Eq. (12)), which is used by the software GCTA^{27} and LD Score regression (LDSC)^{24}, and the LDAK Model (where a = 0.75, b = 1 in Eq. (12)) used by software LDAK^{36}. Moreover, under each model, we varied the proportion and minor allele frequency (MAF) of CVs. Proportion of CVs were set to be either 100% or 1%, and MAF of CVs drawn uniformly from [0, 0.5] or [0.01, 0.05] or [0.05, 0.5] to consider genetic architectures that are either infinitesimal or sparse, as well genetic architectures that include a mixture of common and rare SNPs as well as ones that consist of only rare or common SNPs. The true heritability were chosen from {0.1, 0.25, 0.5, 0.8}.
We generated 100 sets of simulated phenotypes for each setting of parameters and report accuracies averaged over these 100 sets.
Comparisons
For the largescale simulations, we compared RHEmc to methods that rely on summary statistics for estimating heritability. Among the summary statistic methods, LD score regression (LDSC)^{24} uses the slope from the GWAS χ^{2} statistics regressed on the LD scores to estimate heritability. Stratified LD score regression (SLDSC)^{7} is an extension of LDSC for partitioning heritability from summary statistics. SumHer is the summary statistic analog of LDAK^{25}. We ran SLDSC with 10 binary MAF bin annotations defined such that each bin contains exactly 10% of the typed SNPs; this is intended to mirror the 10 MAF bin annotations in the SLDSC “baselineLD model”^{29} (see Supplementary Table 5). To run SumHer, we used the LDAK software to compute the default “LDAK weights” using insample LD ^{25,36,37}. We then computed “LD tagging” using 1Mb windows centered on each SNP as recommended^{25}. To do a fair comparison we computed LD scores for LDSC, SLDSC, GRE, and SumHer by using insample LD among the M SNPs, and in all simulations we aim to estimate the SNPheritability explained by the same set of M SNP. We described the parameter settings of summary statistic methods in Supplementary Notes.
For the smallscale simulations, we compared RHEmc to GCTAmc and HEmc^{27}. GCTAmc and HEmc are the extensions of GCTA and HE to a multicomponent LMM, respectively, where the variance components are typically defined by binning SNPs according to their MAF as well as local LD^{8}. We ran GCTAmc, HEmc and RHEmc using 24 bins formed by the combination of six bins based on MAF (MAF ≤ 0.01, 0.01 < MAF ≤ 0.02, 0.02 < MAF ≤ 0.03, 0.03 < MAF ≤ 0.4, 0.04 < MAF ≤ 0.05, MAF > 0.05) as well as four bins based on quartiles of the LDAK score of a SNP. We ran both GCTAmc and RHEmc allowing for estimates of a variance component to be negative.
For comparisons of runtime, we compared RHEmc to GCTA^{27} and BOLTREML^{4} which is a computationally efficient approximate method to compute the REML estimator. We ran all methods with 22 components (one for each chromosome). We also ran RHEmc with ≈300 components (corresponding to 10 Mb bins) on the UK Biobank genotype (Supplementary Fig. 10). To create our largest dataset, we replicate individuals from the UK Biobank and a subset of the imputed SNPs to obtain a dataset with one million individuals and SNPs. We use the latest versions of BOLTREML (Version 2.3.2) and GCTA (Version 1.92.1) in our comparison. All comparisons are performed on an Intel(R) Xeon(R) CPU 2.10 GHz server with 128 GB RAM.
Heritability estimates in the UK Biobank
We estimated SNPheritability for 22 complex traits (6 quantitative, 16 binary) in the UK Biobank^{17}. In this study, we restricted our analysis to SNPs that were present in the UK Biobank Axiom array used to genotype the UK Biobank. SNPs with >1% missingness and minor allele frequency <1% were removed. Moreover, SNPs that fail the Hardy–Weinberg test at significance threshold 10^{−7} were removed. We restricted our study to selfreported British white ancestry individuals who are >3rd degree relatives defined as pairs of individuals with kinship coefficient <1/2^{(9/2)}^{17}. Furthermore, we removed individuals who are outliers for genotype heterozygosity and/or missingness. Finally, we obtained a set of N = 291,273 individuals and M = 459,792 SNPs to use in the real data analyses. We included age, sex, and the top 20 genetic principal components (PCs) as covariates in our analysis for all traits. We used PCs precomputed by the UK Biobank from a superset of 488,295 individuals. Additional covariates were used for waisttohip ratio (adjusted for BMI) and diastolic/systolic blood pressure (adjusted for cholesterollowering medication, blood pressure medication, insulin, hormone replacement therapy, and oral contraceptives).
Heritability partitioning
In our initial analysis, we removed SNPs with >1% missingness and minor allele frequency <1%. Moreover, we removed SNPs that fail the Hardy–Weinberg test at significance threshold 10^{−7} as well as SNPs that lie within the MHC region (Chr6: 25–35 Mb) to obtain 4,824,392 SNPs. We restricted our study to selfreported British white ancestry individuals who are >3rd degree relatives defined as pairs of individuals with kinship coefficient <1/2^{(9/2)}^{17}. Furthermore, we removed individuals who are outliers for genotype heterozygosity and/or missingness. Finally, we obtained 291,273 individuals . We partitioned SNPs into eight bins based on two MAF bins (MAF ≤ 0.05, MAF > 0.05) and quartiles of the LDscores. For each bin k, we computed the heritability enrichment as the ratio of the percentage of heritability explained by SNPs in bin k to the the percentage of SNPs in bin k.
We considered an additional analysis in which we included SNPs with MAF > 0.1% resulting in N = 291,273 unrelated white British individuals and M = 7,774,235 imputed SNPs (MAF > 0.1%). We defined 144 bins based on 4 LD bins and 36 MAF bins. The 4 LD bins are defined based on quartile of LDscores, and 36 MAF bins are defined based on 9quantile of the following four intervals: 0.001 ≤ MAF ≤ 0.01, 0.01 < MAF ≤ 0.05, 0.05 ≤ MAF ≤ 0.10, 0.10 < MAF ≤ 0.50.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
Access to the UK Biobank resource is available via application at: http://www.ukbiobank.ac.uk.
Code availability
RHEmc software is opensource software freely available at: https://github.com/sriramlab/RHEmc
References
McCulloch, C. E. & Searle, S. R. Generalized, Linear, and Mixed Models (John Wiley & Sons, 2004).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet.42, 565 (2010).
Yang, J. et al. Genome partitioning of genetic variation for complex traits using common snps. Nat. Genet.43, 519 (2011).
Loh, P.R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variancecomponents analysis. Nat. Genet.47, 1385 (2015).
Lee, S. H. et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common snps. Nat. Genet.44, 247 (2012).
Gusev, A. et al. Partitioning heritability of regulatory and celltypespecific variants across 11 common diseases. Am. J. Hum. Genet.95, 535–552 (2014).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genomewide association summary statistics. Nat. Genet.47, 1228 (2015).
Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet.50, 737 (2018).
Gazal, S. et al. Functional architecture of lowfrequency variants highlights strength of negative selection across coding and noncoding annotations. Nat. Genet50, 1600–1607 (2018).
Hou, K. et al. Accurate estimation of snpheritability from biobankscale data irrespective of genetic architecture. Nat. Genet. https://doi.org/10.1038/s4158801904650. https://www.biorxiv.org/content/early/2019/01/23/526855.full.pdf (2019).
Patterson, H. D. & Thompson, R. Recovery of interblock information when block sizes are unequal. Biometrika58, 545–554 (1971).
Kuk, A. Y. & Cheng, Y. W. The Monte Carlo Newton–Raphson algorithm. J. Stat. Comput. Simul.59, 233–250 (1997).
Liu, J. S. & Wu, Y. N. Parameter expansion for data augmentation. J. Am. Stat. Assoc.94, 1264–1274 (1999).
Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics51, 1440–1450 (1995).
Matilainen, K., Mäntysaari, E. A., Lidauer, M. H., Strandén, I. & Thompson, R. Employing a Monte Carlo algorithm in Newtontype methods for restricted maximum likelihood estimation of genetic parameters. PLoS ONE8, e80821 (2013).
Runcie, D. E. & Crawford, L. Fast and exible linear mixed models for genomewide genetics. PLoS Genet.15, e1007978 (2019).
Bycroft, C. et al. The uk biobank resource with deep phenotyping and genomic data. Nature562, 203–209 (2018).
Haseman, J. & Elston, R. The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet.2, 3–19 (1972).
Zhou, X. A unified framework for variance component estimation with summary statistics in genomewide association studies. Ann. Appl. Stat.11, 2027 (2017).
Wu, Y. & Sankararaman, S. A scalable estimator of snp heritability for biobankscale data. Bioinformatics34, i187–i194 (2018).
Ge, T., Chen, C.Y., Neale, B. M., Sabuncu, M. R. & Smoller, J. W. Phenomewide heritability analysis of the uk biobank. PLoS Genet.13, e1006711 (2017).
Visscher, P. M. et al. Statistical power to detect genetic (co) variance of complex traits using snp data in unrelated samples. PLoS Genet.10, e1004269 (2014).
Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: inferring the contribution of common variants. Proc. Natl Acad. Sci. USA111, E5272–E5281 (2014).
BulikSullivan, B. K. et al. Ld score regression distinguishes confounding from polygenicity in genomewide association studies. Nat. Genet.47, 291 (2015).
Speed, D. & Balding, D. J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat Genet.51, 277–284 (2019).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genomewide association summary statistics. Nat. Genet.47, 1228 (2015).
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. Gcta: a tool for genomewide complex trait analysis. Am. J. Hum. Genet.88, 76–82 (2011).
Yang, J. et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet.47, 1114 (2015).
Gazal, S. et al. Linkage disequilibriumdependent architecture of human complex traits shows action of negative selection. Nat. Genet.49, 1421 (2017).
Wainschtein, P. et al. Recovery of trait heritability from whole genome sequence data. Preprint at 588020 (2019).
Weissbrod, O., Flint, J. & Rosset, S. Estimating snpbased heritability and genetic correlation in casecontrol studies directly and with summary statistics. Am. J. Hum. Genet.103, 89–99 (2018).
Henderson, C. R. Estimation of variance and covariance components. Biometrics9, 226–252 (1953).
Hutchinson, M. A stochastic estimator of the trace of the inuence matrix for Laplacian smoothing splines. Commun. Stat.Simul. Comput.18, 1059–1076 (1989).
Liberty, E. & Zucker, S. W. The mailman algorithm: a note on matrix–vector multiplication. Inf. Process. Lett.109, 179–182 (2009).
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med.12, e1001779 (2015).
Speed, D., Hemani, G., Johnson, M. R. & Balding, D. J. Improved heritability estimation from genomewide SNPs. Am. J. Hum. Genet.91, 1011–1021 (2012).
Speed, D. et al. Reevaluation of SNP heritability in complex human traits. Nat. Genet.49, 986 (2017).
Acknowledgements
This research was conducted using the UK Biobank Resource under applications 33127 and 33297. We thank the participants of UK Biobank for making this work possible. We thank Rob Brown, Steven Gazal, and members of the Sankararaman and Pasaniuc labs for feedback on this manuscript. This work was funded by NIH grants R01HG009120 (B.P. and K.S.B.), R35GM125055 (S.S.), an Alfred P. Sloan Research Fellowship (S.S.), and a NSF grant III1705121 (A.P., Y.W., and S.S.).
Author information
Authors and Affiliations
Contributions
A.P. and S.S. conceived and designed the experiments. A.P. performed the experiment and statistical analyses. Y.W., K.S.B., and K.H. collected and managed the data. Y.W., K.S.B., K.H., and A.Z. assisted with the experiments. B.P. consulted on analysis and interpretation of the data. A.P., K.S.B., B.P., and S.S. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review informationNature Communications thanks Doug Speed, Bjarni Vilhjalmsson, and the other, anonymous reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pazokitoroudi, A., Wu, Y., Burch, K.S. et al. Efficient variance components analysis across millions of genomes. Nat Commun 11, 4020 (2020). https://doi.org/10.1038/s41467020175769
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467020175769
This article is cited by

Probabilistic inference of the genetic architecture underlying functional enrichment of complex traits
Nature Communications (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.