Efficient variance components analysis across millions of genomes

While variance components analysis has emerged as a powerful tool in complex trait genetics, existing methods for fitting variance components do not scale well to large-scale datasets of genetic variation. Here, we present a method for variance components analysis that is accurate and efficient: capable of estimating one hundred variance components on a million individuals genotyped at a million SNPs in a few hours. We illustrate the utility of our method in estimating and partitioning variation in a trait explained by genotyped SNPs (SNP-heritability). Analyzing 22 traits with genotypes from 300,000 individuals across about 8 million common and low frequency SNPs, we observe that per-allele squared effect size increases with decreasing minor allele frequency (MAF) and linkage disequilibrium (LD) consistent with the action of negative selection. Partitioning heritability across 28 functional annotations, we observe enrichment of heritability in FANTOM5 enhancers in asthma, eczema, thyroid and autoimmune disorders.


.1 Computing the Standard Errors of the RHE-mc estimates
We obtain standard errors for RHE-mc using a block jackknife [1]. A jackknife subsample is created by leaving out a subset of observations from a dataset. The jackknife estimate of a parameter can be found by estimating the parameter for each subsample omitting the i-th jackknife block. A naive way to compute jackknife estimate requires computing the estimator of the parameters for every sub-sample. For instance, in our problem, if we define J jackknife blocks, then we need to run RHE-mc for every sub-sample which takes O(J( N M B max(log 3 (N ),log 3 (M )) + K 2 (K + N B))). We propose an efficient way to compute the jackknife estimate in time O( N M B max(log 3 (N ),log 3 (M )) + JK 2 (K + N B)). Let X be a N × M matrix of standardized genotypes where N and M are the number of individuals and SNPs respectively. To generate J jackknife subsamples, we partition X into J non-overlapping blocks X (1) , . . . , X (J) such that X = [X (1) , X (2) , . . . , X (J) ] . Note that for every j, X (j) is a N × M j matrix where M j is the number of SNPs in the j-th block.
We create the j-th jackknife subsample by removing the j-th block X (j) from X. To estimate the variance components of the j-th jackknife subsample, we need to compute the corresponding quantities of the jth subsample in the normal equations (Methods). Let K

Including covariates in RHE-mc
We can extend the LMM to include covariates as follows: y| , β 1 , . . . , β k = W α + k X k β k + Here W is a N × C matrix of covariates while α is a C-vector of fixed effects. It is easy to see that the matrix V = I N − W (W T W ) −1 W T is symmetric and idempotent (V 2 = V ) of rank N − C. Therefore, we consider the eigendecomposition of V = EDE T , where D is a diagonal matrix with N − C ones and C zeros on the diagonal (we can assume that first N − C elements are one). Now let the matrix U N ×(N −C) represent the first N − C columns of E. It is not hard to see that U satisfies U T U = I N −C , U U T = V , U T W = 0. Now we multiplying by U T on both sides of the above equation: The matrix U T is constant and the vector y is random. Therefore, we have The MoM estimator is obtained by solving the following ordinary least squares problem: We need to solve the following normal equations to estimate the variance components.
Commonly, the number of covariates C is small (tens to hundreds) so that including covariates does not significantly affect the computational cost. The cost of computing the elements of the normal equations 8 includes the cost of inverting W T W which is a C × C matrix and multiplying W by a real-valued N -vector which can be done in O(C 3 + N C).

Streaming version
Here we describe the streaming version of RHE-mc algorithm. In Methods section, we showed that our MoM estimator satisfy the following normal equation.
, and c is a K-vector with entries c k = y T K k y. Here we estimate T k,l as follows : Here z 1 , . . . , z B are B independent random vectors with zero mean and covariance I N .
We read genotype matrix X k for every k ∈ {1, . . . , K} block by block. We define J blocks over X k by partitioning the columns of X k to J groups such that X k = [X k (1) . . . X k (J) ].
for every pairs of genotype matrices k and l ∈ {1, .., K} do Solve the normal equation for j th sub-sample (j = 0 corresponds to the original genotype matrix used for computing the point estimates) end Compute the jackknife SE from the point estimates of J sub-samples.
In the above algorithm, the 3-D matrices Z and U need O(JKBN ) memory, the 2-D matrices V and H need O(JN ) memory. So the total space complexity will be O(JKBN ). The total running time of this implementation is O( N M B max(log 3 (N ),log 3 (M )) + JK 2 (K + N B))). For simplicity we assume that the streaming blocks are same as jackknife blocks. However, we can set the size of the streaming blocks to be different from the jackknife blocks to make the algorithm more efficient in terms of memory usage.

Parameter settings for summary statistics methods
For running LDSC we computed the LD score of each SNP within 2-Mb windows centered on the SNP. We ran LD score regression with an unconstrained intercept and with regression weights that account for correlations between association statistics at SNPs in LD and heteroscedasticity [2]. For preventing the LDSC software from dropping high-effect SNPs we used the following flags -not-M-5-50 and -chisq-max 99999. In simulations, we ran S-LDSC with 10 binary MAF bins which are defined such that each bin contains 10% of the typed SNPs; this is done to reflect the 10 MAF bin annotations in the S-LDSC baseline-LD model [3] (see Table 5 for the details of MAF bins). In analyzing the 22 real complex traits, we run S-LDSC with baseline-LD model [3]. To run SumHer, first we computed the default LDAK weights using in-sample LD [4]. After that we computed LD tagging using 1-Mb windows centered on each SNP and setting α = −0.25 as recommended [5]. We used default values for the other parameter settings for running SumHer.
To do a direct comparison among LDSC, S-LDSC, and SumHer, we ran an in-sample LD version of each method meaning that we used same set of SNPs to compute LD scores and LDAK weights, perform the regression, and estimate SNP-heritability.

Continuous annotations
We assessed the accuracy of RHE-mc in estimating variance components with continuous annotation. We simulated a phenotype with true heritability 0.5 from 9K individuals and 15k SNPs under the GCTA model. We ran RHE-mc with single component, no annotations, and standardized genotypes. We next ran RHE-mc with single component, non-standardized genotypes, where we added a continuous annotation defined as 1/var(i) for SNP i where var(i) is the variance of SNP i across individuals. We obtain concordant estimate of genome-wide SNP heritability 0.45 ± 0.03 in the first case and 0.46 ± 0.03 in the second case.

Power as a function of annotation size
To quantify the power of RHE-mc as a function of the size of an annotation, we performed simulations using N = 291, 273 unrelated white British individuals and M = 459, 792 common SNPs. We defined 8 annotations (4 MAF bins and 2 LD bins) in which we fixed the heritability of a selected bin and varied the proportion of SNPs in the selected category. We then plotted probability of rejection; the results are displayed in Supplementary Figure 11 . Furthermore, we simulated phenotypes in which we fixed the enrichment of a selected bins and varied the size of the selected bin, the results are displayed in Supplementary Table 6.

Supplementary Figures
Supplementary Figure 1: Comparison of RHE-mc heritability estimates with B = 10 and B = 100 random vectors on large-scale simulated data (M=590K array SNPs and N=337K individuals): We ran RHE-mc with 24 bins( based on 6 MAF bins and 4 LDAK bins, see Methods). Here x-axis represents the bins (i.j denotes the bin defined based on i-th ldak bin and j-th MAF bin) and y-axis represents the heritability. Boxplot whiskers extend to the minimum and maximum estimates located within 1.5× interquartile range (IQR) from the first and third quartiles, respectively. Each box plot represents estimates from 100 simulations. Diamond points and error bars represent the mean and ±2 SE centered on estimated heritability respectively. Mean and standard errors (SE's) are computed from 100 replicates. Here, we run RHE-mc using 24 bins formed by the combination of 6 bins based on MAF as well as 4 bins based on quartiles of the LDAK score of a SNP (see Methods). We run S-LDSC with 10 MAF bins (see Supplementary Table S5 ). To do a fair comparison, for every method, we computed LD scores and LDAK weights by using in-sample LD, and in all simulations we aim to estimate the SNP-heritability explained by the same set of M SNPs. Boxplot whiskers extend to the minimum and maximum estimates located within 1.5× interquartile range (IQR) from the first and third quartiles, respectively. Each box plot represents estimates from 100 simulations. Diamond points and error bars represent the mean and ±2 SE centered on estimated heritability respectively. Mean and standard errors (SE's) are computed from 100 simulations. . SNPs were partitioned based on 28 functional annotations that were defined in a previous study [6]. We grouped 22 traits in the UK Biobank into five categories (autoimmune, diabetes, respiratory, anthropometric, cardiovascular). Black bars mark ±2 standard errors centered on estimated enrichment. Annotations are ordered by the proportions of SNPs in that annotation (given in parentheses)

Supplementary
Supplementary Figure 9: Per-allele effect size squared of 22 traits as a function of MAF: We applied RHE-mc to N = 291, 273 unrelated white British individuals and M = 7, 774, 235 imputed SNPs. SNPs were partitioned into 144 bins based on LD score (4 bins based on quartiles of the LD score with i denoting the i th quartile) and MAF (36 MAF bins) (see Methods). Per allele heritability for bin k is defined as where h 2 k is the heritability attributed to bin k, M k is the number of SNPs in bin k, and f k is the average MAF in bin k. Points represent estimated per-allele heritability. Bars mark ±2 standard errors centered on estimated per-allele heritability.     . We run RHE-mc with 8 bins defined based on two MAF bins (MAF≤ 0.05, MAF> 0.05) and quartiles of the LD-scores. Furthermore, we run RHE-mc with 22 bins defined based on chromosome number. On average, partitioning based on chromosome numbers leads 21% higher estimates of genome-wide SNP heritability for 22 traits than partitioning based on MAF and LD. For instance, it leads 18% and 13% higher estimates of heritability for height and BMI respectively. We also partitioned SNP based on 10 Mb genomic regions (300 variance components).