Explicating heterogeneity of complex traits has strong potential for improving GWAS efficiency

Common strategy of genome-wide association studies (GWAS) relying on large samples faces difficulties, which raise concerns that GWAS have exhausted their potential, particularly for complex traits. Here, we examine the efficiency of the traditional sample-size-centered strategy in GWAS of these traits, and its potential for improvement. The paper focuses on the results of the four largest GWAS meta-analyses of body mass index (BMI) and lipids. We show that just increasing sample size may not make p-values of genetic effects in large (N > 100,000) samples smaller but can make them larger. The efficiency of these GWAS, defined as ratio of the log-transformed p-value to the sample size, in larger samples was larger than in smaller samples for a small fraction of loci. These results emphasize the important role of heterogeneity in genetic associations with complex traits such as BMI and lipids. They highlight the substantial potential for improving GWAS by explicating this role (affecting 11–79% of loci in the selected GWAS), especially the effects of biodemographic processes, which are heavily underexplored in current GWAS and which are important sources of heterogeneity in the various study populations. Further progress in this direction is crucial for efficient use of genetic discoveries in health care.


Supplementary Notes
Supplementary Note 1 Asymptotic expression for the efficiency for a normally distributed random variable and homogeneity/heterogeneity of targeted effects Consider two estimates of the same regression equation from two separate sample populations, P 1 and P 2 , of size N 1 and N 2 , where N 1 > N 2 ; denote the respective estimates of two regression effects as ̂1 and ̂2 . Consider the null hypothesis: (H0) that 1 = 2 = 0 , where 0 is some prespecified value which, for simplicity, we assume to be 0 = 0. Because ̂1 and ̂2 are sample estimates, we extend the null hypothesis by assuming that they are drawn from closely related normal populations having variance 0 2 ⁄ , = 1,2, where 0 2 is a common parameter termed the "unit variance" and where the division by N i follows from well-known properties of regression estimators, see p.226 in Ref. 37 . The assumption that 0 2 is a common parameter implies that the variances of the predictors in the regression equations are equal in the two samples, which will be true if the samples are drawn from a common parent population using the same sampling methods.
Define the t-transformation as = √ 0 ⁄ , = 1,2 . It follows from the above assumptions that the estimator ̂ is distributed as a standard normal variate with variance 1. The probability that a hypothetical estimate, say ̃, is larger than ̂ is then given by: where () = 1 √2 (̃2 2 ⁄ ) is the standard normal probability density function. Equation (1) is the probability of an observed effect of size ̂ or larger in standardized units in a sample of size N i , assuming that the null hypothesis is true. This is the expression for the quantity typically reported as the p-value for a given effect in GWAS.
For simplicity, we consider only the typical GWAS case where ̂≫ 0. Hence, we focus on the upper tail probability of the standard normal distribution, i.e., the Q-function: which can be approximated as: 38 with pseudo-parameters = 0.0559 and = 0.5030 estimated numerically by minimizing the squared relative differences between the probability density function for the standard normal distribution and the approximating solution for (̂) within the typical for GWAS range 4 ≤̂≤ 20.
The above figure shows that the second approximation to the Q-function is virtually indistinguishable from the original Q-function on a logarithmic scale over the indicated range [4,20].
Hence, taking the natural logarithm of (3), changing sign, replacing the parameters by their estimators, substituting for ̂2 , and dividing by N i , we obtain for population P i : where, except for the scaling parameter ln(10) in the denominator for the change from natural to base 10 logarithms, is the efficiency parameter introduced in the "Efficiency measure" section in the main text and ̂ is its estimator. Thus, the efficiency represents the ratio of the log-transformed probability of the effect b i to the sample of size N i . Equivalently, the efficiency can be interpreted as the log-transformed p-value per unit observation, i.e., per person in this case.
If the null hypothesis (H0) is true, then (̂) follows the uniform distribution over the unit interval [0, 1], independent of sample size N i , and ̂ follows the exponential distribution over [0, ∞) with mean 1⁄ .
Small p-values are generally interpreted as providing evidence against the null hypothesis, leading to consideration of two alternative hypotheses: (H1) 1 = 2 = 12 > 0 ; and (H2) The variance of ̂, under an extension of H1 or H2 for which each ̂ , = 1,2, is assumed to be drawn from a normal distribution having mean and variance 0 2 ⁄ , = 1,2, can be approximated using the delta method (e.g., Volume 1, p.232 in Ref. 39 ).
Expressions (5-7) were derived based on the assumption of independence of variables ̂1 and ̂2 , whereas all samples used in our analyses were not independent. To generalize these expressions to this case, we included correlation term (e.g., Volume 1, p.232 in Ref. 39 ) Because information on covariance between the smaller and larger samples was unavailable from the selected studies, we take into account that the larger sample P 1 is a superposition of two independent subsets P 2 and P 0 , where P 0 = P 1 -P 2 . Then, the effect size for P 1 can be defined as mean of the effect sizes in these two subsets, i.e. ̂1 = [̂2 • 2 + ∆̂• ( 1 − 2 )]/ 1 . Using equation (4) one can derive the following expression for covariance The 95% confidence intervals (CIs) for ρ evaluated using (8,9) are given in the last two columns in Supplementary Tables 1-4. Hypothesis H1 should be rejected if value for ̂ is not within the 95% CIs. We also used bootstrapping to generate 95% CIs for ̂=̂1/̂2 to reject hypothesis H1. For each locus in Supplementary Tables 1-4, we generated a mega sample (N=10 7 ) of two random variables X and Y. Variable X was distributed as a standard normal variate with variance 1 and zero mean. Variable Y was constructed as dependent variable in the regression model ~+ + with intercept , effect size , and residual ~(0, ). To simulate statistics for ̂=̂1/̂2, standard deviation for was defined based on a larger sample P 1 of size N 1 (̂1), =̂1 = √ 1 /Φ −1 ( 1 ), where Φ −1 is the quantile function for normal distribution and 1 denotes probability for sample P 1 . Next we randomly selected sample P 1 from the mega sample. Given that larger and smaller samples in the GWAS selected for our study are not independent, we randomly generated a smaller sample P 2 of size 2 from the larger sample P 1 . For simplicity we fixed = 1 and = 0.1 in the regression model for all simulations; this does not violate generalizability of the method. Next we conducted linear regression analysis to generate p-values, 1 , and 2 for the selected samples P 1 and P 2 . Given 1 , and 2 , we evaluated = 1 / 2 . We conducted 100 and 1000 simulations for each locus to generate c.d.f. for ̂ and to evaluate 95% CIs. We did not conduct more simulations because c.d.f. asymptotically converged. The 95% CIs for ρ evaluated using this method for 1000 simulations confirm our analytical result given in the last two columns in Supplementary Tables 1-4.

Supplementary Tables
Supplementary  Table 3 in Ref. 21 . * The sample size for locus RASA2 was larger in the 2010 Nature Genetics study 20 than in the 2015 Nature study 21 . Accordingly, the results for this locus for rs2035935 were taken from Ref. 20 and those for rs16851483 from Ref. 21 . ∆p = -(log 10 (p 2015 )log 10 (p 2010 )) is the difference in the log-transformed p-values reported in the larger sample 21