Introduction

Large-scale association analyses using whole genome sequence (WGS) data on thousands of participants are now underway, through programs such as the NHLBI’s Trans-Omics for precision Medicine (TOPMed) program and NHGRI’s Genome Sequencing Program. Unlike earlier Genome-Wide Association Studies (GWASs), where data were combined using meta-analyses of summary statistics, in WGS analyses participant-level data from multiple studies are often pooled, and results are obtained from a single analysis. Pooled analysis of WGS is useful due to its computational tractability and its ability to control for genetic relatedness across the pooled datasets. However, it is sensitive to a form of population stratification that is not well known. Population stratification in genetic association analysis1,2 typically refers to situations where the mean phenotype value and allele frequency both differ across population subgroups. Unless appropriately accounted for in the analysis, e.g., by using regression-based adjustments for ancestry such as principal components or genetic relatedness matrices in linear mixed models, or their combination, it can lead to false–positive associations3,4,5. Population stratification more generally can refer to differences in phenotype distribution and allele frequency across population subgroups6, and hence can also manifest as differences in phenotype variances by subgroup, combined with differences in allele frequencies. In practice, this phenomenon is common in pooled analysis of multi-study data, as small differences in allele frequencies are prevalent, and different studies being pooled often have different measurement protocols, environmental exposures and inclusion criteria, all of which can lead to different phenotype variances among studies.

Previous studies considered the effect of combining together groups with different phenotypic variances. Haldar and Ghosh6 studied the effect of population stratification due to mean differences, variance differences, and more generally, phenotypic heterogeneity, across subpopulations, on false positive detections when testing variant associations with a quantitative trait. Conomos et al.7 showed that when testing variant associations in a pooled sample of Hispanics/Latinos from different Hispanic background groups, statistical properties of tests are improved when the model allows for different residual variances in the different background groups. Musharoff et al.8, in a preprint, studied population variance structure using statistical models of both population means and variances, and developed statistical tests for the association of genetic variants with phenotypic variability.

In this manuscript, we develop variant-specific inflation factors λvs, which quantify the degree of inflation/deflation in association testing of a single genetic variant due to population stratification at the variance level. We develop an algorithm to compute approximate variant-specific inflation factors based on allele frequencies and variances in groups pooled together for analysis, demonstrate their usage for assessing model fit, and demonstrate the implications of the population stratification at the variance level in simulations and in analyses of WGS data from TOPMed. To account for population stratification at the variance level we use the computationally efficient and scalable approach proposed in Conomos et al.7 and implemented in GENESIS9, and show in simulations that it indeed addresses the variance stratification problem in scenarios based on Musharoff et al.8.

Results

Simulation studies

Our simulations consisted of 576 simulation settings according to various combinations of parameters. We compared a few ways to estimate variance parameters to be used in computing λvs: empirical variances based on homogeneous and stratified variance models, and model-based variances from the heterogeneous variance model. The estimated λvs were essentially the same regardless of the method. Figure 1 compares the estimated λvs to the observed λgc in each of the simulation settings and in each of the two modeling approaches (homogeneous versus stratified variance). Settings are divided according to patterns determining whether variance stratification will be expected, including same or different MAFs between the two studies, the same or different error variances, and whether the PC affects the genetic variance or not. The top three rows in Fig. 1 demonstrate settings in which both the MAF and the total variances differ between the two combined studies, including settings in which both the error and genetic variance components are the same, but the PC affects the genetic variance, resulting in different total variance between the studies because the mean of the PC differs between them. In these settings the variance stratification is observed when using the homogeneous variance model, in that the observed inflation can be substantially higher or lower than 1, with exact values depending on the specific parameters used in each simulation. Indeed, the computed λvs and the observed λgc are highly correlated. In contrast, the stratified variance model was robust to variance stratification across all settings, with observed inflation around 1 in all simulations. The bottom two rows of Fig. 1 demonstrate settings in which either the MAF or the variances are the same in the two combined studies. In these settings, the expected inflation computed by λvs is always 1 (no inflation). As expected, the observed inflation is the same in the homogeneous and stratified variance models. The spread seen in the values of the observed inflation, with some values higher and some lower than the desired 1, are consistent with that expected based on the number of replication of simulations in each setting (10,000); see Supplementary Information for more details.

Fig. 1: Estimated variant-specific inflation factors versus observed inflation in simulations.
figure 1

The figure compares estimated variant-specific inflation factors λvs estimated in each of many simulation settings, and corresponding observed inflation λgc averaged across 10,000 repetitions of each simulation settings. Observed inflation values are provided based on a homogeneous variance model, in which a single variance parameter is estimated using the aggregated data; and based on a stratified variance model, that fits a different variance parameter to each of the two simulated studies. Each simulation set corresponds to a single point on this figure, and the simulations are grouped (denoted by different colors and symbols) by the characteristics stated in the legend. Within each group of simulation settings, the simulation parameters differ by specific parameter values, including MAFs, variance components, and sample sizes, while still satisfying the broad conditions of the grouped simulation settings. The dashed horizontal lines correspond to the 2.5% and 97% quantiles of the distribution of λgc based on 10,000 variants under the null of no inflation/deflation, obtained from simulations.

Genetic association analysis of BMI and hemoglobin concentration in TOPMed

We demonstrate the variance stratification problem in analyzes of hemoglobin concentrations (HGB, N = 7596; three analysis groups) and body mass index (BMI, N = 9807; eight analysis groups) in the TOPMed freeze 4 dataset. In both analyses we computed approximate variant-specific inflation factors λvs. We investigated the inflation/deflation problems resulting from variance stratification, and verified that the patterns of inflation and deflation in the homogeneous variance analysis agree, across the different variants, with those obtained from the formula and the provided code. Figures 2 and 3 provide quantile-quantile (QQ)-plots for variants from three categories of variants, where theory predicts inflation \(( \lambda_{vs}\ge 1.01)\), deflation \((\lambda_{vs}\le 0.99)\) and variants with λvs “Approx. no inflation” (\(0.99 < {\lambda }_{vs} < 1.01\)), and across all variants, for HGB and BMI analyses respectively. The plots overlay the results from the four analyses methods together. While the homogeneous variance model clearly produces inflated and deflated QQ-plots in line with the theoretical expectation, when looking at all tested variants together, this inflation and deflation (i.e., Type I and Type II errors) mask each other, alarmingly. Despite appearances, these problems do not “cancel out”; one creates more Type I errors, one creates more Type II errors, yet the plot of all results may lead investigators to conclude that the analysis is well-calibrated. In contrast, the stratified residual variance model provides good control of Type I errors, as seen in the QQ-plots, with the exception of the bottom left panel in Fig. 3, which provides QQ-plots for the set of variants that are expected to have deflated test statistics under the pooled variance model when studying BMI: here the stratified residual variance model was also somewhat deflated. Figure 4 provides the genomic control inflation factors λgc computed over each of the variant sets provided in the QQ-plots and for each of the traits. The completely stratified and MetaCor models performed better in terms of overall QQ-plots and computed λgc values in the two analyses, in that λgc values were always closer to 1. MetaCor performed slightly better than the completely stratified model under independence, likely because it accounts for a low degree of relatedness between the strata.

Fig. 2: QQ-plots comparing observed and expected p values (−log10 transformed) from the analysis of hemoglobin concentrations.
figure 2

The analyses used four approaches: “homogeneous variance” model, that assumes that all groups in the analysis have the same variances; “stratified variance” model, that allows for different residual variances across analysis groups; a “completely stratified indep” model in which analysis groups were analyzed separately, allowing for both heterogeneous residual and genetic variances across groups, and then combined together in meta-analysis under independence assumption, and “MetaCor”, a procedure that accounts for relatedness across strata in the meta-analysis. The QQ-plots are provided across sets of variants classified by their inflation/deflation patterns according to the algorithm for variant-specific approximate inflation factors. We categorized variants as “Approx. no inflation” when they had estimated λvs between 0.99 and 1.01, “Deflated” when estimated λvs lower than 0.99, and “Inflated” when they had estimated λvs higher than 1.01.

Fig. 3: QQ-plots comparing observed and expected p values (−log10 transformed) from the analysis of BMI.
figure 3

The analyses used four approaches: “homogeneous variance” model, which assumes that all groups in the analysis have the same variances; “stratified variance” model, which allows for different residual variances across analysis groups; a “completely stratified indep” model in which analysis groups were analyzed separately, allowing for both heterogeneous residual and genetic variances across groups, and then combined together in meta-analysis under independence assumption, and “MetaCor”, a procedure that accounts for relatedness across strata in the meta-analysis. The QQ-plots are provided across sets of variants classified by their inflation/deflation patterns according to the algorithm for variant-specific approximate inflation factors. We categorized variants as “Approx. no inflation” when they had estimated λvs between 0.99 and 1.01, “Deflated” when estimated λvs lower than 0.99, and “Inflated” when they had estimated λvs higher than 1.01.

Fig. 4: Estimated genomic control inflation factors (λgc) across compared analyses.
figure 4

The figure provides estimated λgc from the various analyzes of BMI and hemoglobin concentrations, computed across sets of variants classified by their inflation/deflation patterns according to the algorithm for approximate variant-specific inflation factors (λvs). We categorized variants as “Approx. no inflation” when they had estimated λvs between 0.99 to 1.01, “Deflated” when estimated λvs were lower than 0.99, and “Inflated” when they had estimated λvshigher than 1.01. Genomic control inflation factors λgc were computed as the ratio between the median \({\chi }_{1}^{2}\) test statistic across variants in the set to the theoretical median of the test statistic under the null hypothesis of no association.

Table 1 describes the inflation/deflation patterns of variants according to their MAF. One can see that the inflation/deflation problem is ubiquitous for rare variants, but less so for common variants. In fact, for variants with frequency <0.05, only ~4% of variants have λvs falling in the “approximately no inflation” category. This is because the ratio between allele frequencies has a strong effect on inflation/deflation, and ratios can become quite high when variants are rare. In the Supplementary Information, Figure S2 shows the distribution of inflation, deflation, and “approximately no inflation” categories across variants in the two analyses, and demonstrates how similar the deflation/inflation categories are between them. Most variants stay in the same category between analyses, but some rare variants (in the figure defined as MAF < 0.05) can be inflated in one analysis and deflated in the other. These differences are because λvs coefficients are affected by sample sizes, variances, and allele frequencies, which all differ to some extent between analyses due to different samples and trait characteristics.

Table 1 Variant inflation/deflation characteristics by categories of MAF.

Discussion

A standard tool for analysis of quantitative traits is linear or linear mixed model regression. In its widely-used default version, linear regression is fitted under the assumption that the phenotype’s residual variance is the same for all individuals in the analyzed sample. The extent of the consequences if the variances are not equal sized can be computed exactly given simplifying assumptions. Broadly, using default approaches, if a specific subgroup has a larger phenotypic variance than that of other subgroups in the pooled analysis, the estimated precision of the association signal will understate the contribution from such a subgroup. The result is deflation (loss of power) for variants where allele frequency is greater in this subgroup compared to other subgroups, or inflation (too many false positives) for variants with lower allele frequency in this subgroup compared to others.

While default linear regression methods assume the same variance for all subgroups, which leads to mis-calibrated tests if the assumption does not hold, standard computational tools can be adapted to allow for a stratified variance model, yielding better calibrated tests. Specifically, by fitting different residual variances for each study, or more generally, appropriately defined “analysis group” (e.g., all African Americans of a specific study) the problem can be alleviated. This can be viewed as fitting a different variance component for noise within each study, or as a weighted least squares approach, in which the group-dependent weights are estimated. This approach is implemented in some standard genetic analysis software packages (e.g., GENESIS9). Our mathematical derivation and code can be used to assess the degree of miscalibration of association tests. The code uses an additive model, using a Binomial distribution for allele counts, which is commonly used in GWAS. Inflation/deflation trends should be similar between additive and dominant models, though specific values estimated using each of the two models would not be identical.

In linear regression, the stratified residual variance model allows every analysis group to have its own residual variance parameter. In the mixed model setting, where the variance is decomposed into genetic and residual variances, this model keeps the genetic variance component the same but allows for the residual variance to differ across groups. Analysis groups can be defined as study, race/ethnicity, combinations of these, or any other sample characteristics that affect trait variance and may also correlate with allele frequencies. Our mathematical derivation and code for computing λvs are under simplifying assumptions of no covariate effects and independent observations. Therefore, these make no distinction between genetic and residual variance components. While in the linear regression setting (independent observations) the variance stratification model clearly suffices to account for variance heterogeneity, in the mixed model setting, a residual variance stratification model may not be optimal, because it may not fully account for stratification in the genetic variance, which could be the result of study design. For example, in Fig. 5, the estimated genetic variance component of the Cleveland Family Study is much higher than those of other studies, and of the residual variance component of the same study, perhaps because study participants were selected from families with obstructive sleep apnea, which is highly associated with obesity. Heterogeneity in genetic variance is addressed in the “completely stratified model”, but such a model requires that individuals are independent between different groups (strata). We also used MetaCor, a method that allows for complete stratification of analysis groups, while keeping genetically related individuals across these groups15. MetaCor was shown to have good statistical properties and performed well in the BMI and HGB analysis. However, it is currently computationally costlier than a pooled analysis because individual level data are used both at the individual analysis group computations, and when computing covariances between effect size estimates of all analysis groups. Computational efficiency is critical when testing the large number of variants observed in WGS studies. In addition, the MetaCor approach is not yet extended to tests of sets of rare variants (rather than single rare variants tests studied in the current manuscript). While more difficult to assess, variance stratification likely affects tests of rare variant sets as well, and methods that use a Score test based on a null model that is fit once, such as the stratified variance approach implemented in GENESIS, straightforwardly extend to such settings. As sample sizes of TOPMed grow, pooling together more diverse studies and populations, variance stratification problems may be more severe. Models allowing for pooled analysis with both group-specific residual and genetic variances or robust variance estimates may be needed for better control of Type I errors and increased efficiency. Until such methods are developed, we recommend to first use the stratified variance approach, because it is computationally efficient, it can account for relatedness across the entire sample, and the same null model can be used to test variant sets. As a second step, we recommend computing approximate λvs, and assessing whether observed inflation/deflation remains for test statistics within groups of variants predicted to be inflated/deflated based on λvs values. If inflation/deflation are observed despite residual variance stratification, the analyst would ideally move forward with a meta-analytic approach such as MetaCor (does not discard data but computationally more demanding), or standard meta-analysis after removing individuals to generate genetically independent strata.

Fig. 5: Estimated variance components across compared analyses.
figure 5

The figure provides the estimated variance components corresponding to residual and genetic relatedness in the analyzes of BMI and hemoglobin concentration (HGB). For each analysis group, the estimated variance components were computed based on the analysis of the group alone, and were extracted from the second null model in the fully-adjusted two-stage rank-normalization procedure, to match the procedure used for association analysis.

Methods

The linear model

For a total sample size of n, we assume that the data follow a linear model denoted as

$${y}_{i}={\beta }_{0}+{g}_{i}\beta +{{\epsilon }}_{i},\,1 \le i \le n,$$
(1)

where yi is the trait or phenotype value of person i, gi is their count of coded alleles (i.e., genotype), β0 denotes the mean outcome in those with no copies of the coded allele, β denotes the effect on the mean trait of each additional copy of the coded allele, and the \({{\epsilon }}_{i}\) are residual errors, which we for now assume are independent, as a simplifying assumption.

To provide intuition for the variance stratification problem, we first demonstrate a mathematical derivation in simplified settings. We assume that the genotype effect is null (β = 0), and that the errors follow a normal distribution \({{\epsilon }}_{i}\sim N(0,\,{\sigma }_{i}^{2})\). We further assume that the phenotypes are centered, the genotypes are centered, and follow a dominant mode of inheritance, i.e., we are using \({g}_{i}=({\tilde{g}}_{i}-p)\), where \({\tilde{g}}_{i}\) is the genotype under a dominant mode (having values 1 or 0), p is the frequency of having any copies of the variant allele, and gi is used in analysis.

Implication of variance stratification on the Wald test

The Wald test quantifies the strength of the genetic association by dividing a regression-based estimate of β by its corresponding estimated standard error. The linear regression estimate of the effect (written in the general regression form) is

$${\hat{\beta}}=\left(\mathop{\sum }\limits_{i=1}^{n}{g}_{i}^{2}\right)^{-1}\left(\mathop{\sum }\limits_{i=1}^{n}{g}_{i}{y}_{i}\right).$$
(2)

Denoting the estimated residual variance of individual i by \({\sigma }_{i}^{2}\) (which may differ across individuals), the variance of \(\hat{\beta }\) is

$${\rm{var}}({\hat{\beta }})=\,\left(\mathop{\sum }\limits_{i=1}^{n}{g}_{i}^{2}\right)^{-2}\left(\mathop{\sum }\limits_{i=1}^{n}{g}_{i}^{2}{\sigma }_{i}^{2}\right).$$
(3)

When the variance of the residuals is homogeneous across all individuals, this is

$${\rm{var}}({\hat{\beta}})=\,\left(\mathop{\sum }\limits_{i=1}^{n}{g}_{i}^{2}\right)^{-2}\left(\mathop{\sum }\limits_{i=1}^{n}{g}_{i}^{2}\right){\sigma }^{2}=\,\left(\mathop{\sum }\limits_{i=1}^{n}{g}_{i}^{2}\right)^{-1}{\sigma }^{2},$$
(4)

where σ2 is the common variance parameter. To illustrate how this approach can mislead under variance stratification, we consider the situation where two studies are present, of sizes n1 and n2, respectively, such that \({n}_{1}+{n}_{2}=n.\) Further, each study is internally homogeneous with error variances \({\sigma }_{1}^{2}\) and \({\sigma }_{2}^{2}\), and it is also useful to write p1 and p2 for the frequency of the variant of interest (under dominant mode). Because we assume that the variant was centered in the pooled population, we have that \({\rm{E}}{[\mathop{\sum}\limits_{{i\in {S_{j}}}} g_{i}^{2}]}={n}_{i}E{[{g}_{i}^{2}]}={n}_{i}{[p_{i} (1- p_{i})^{2}+(1- p_{i}) (0-p_{i})^{2}]}\), or \(E{[\mathop{\sum}\limits_{{i\in {{S}_{j}}}} g_{i}^{2}]}={n}_{i}{p}_{i}(1- p_{i}).\) We can re-write Eq. (3) as:

$${\rm{var}}({\hat{\beta}})=\frac{{n_{1} p_{1}(1-p_{1})\sigma_{1}^{2}+ n_{2} p_{2}(1- p_{2})\sigma_{2}^{2}}}{{[n_{1}{p}_{1}(1- p_{1})+ n_{2} p_{2}(1- p_{2})]^{2}}}$$
(5)

We see that the actual variance is a linear combination of the variance parameters \({\sigma }_{1}^{2},\,{\sigma }_{2}^{2}\), and the weight assigned to each depends on the minor allele frequency and sample size in each group. When the minor allele frequencies (MAFs) are equal, p1 = p2, the two forms4 and5 are equal, as there is no association between genotype and outcome, and no confounding occurs. But when \({p}_{1}\;\ne\; {p}_{2}\) then the variance of the estimator upweights the residual variance in the group where the variant is more common, which does not happen under homogeneity. This result straightforwardly generalizes for M studies.

In some studies, researchers use mixed models in GWAS to account for genetic relationships between individuals. Then, it is usually assumed that the variance is decomposed to an error and genetic variances, so that \({\rm{var}}({\epsilon })=\,{\sigma }_{e}^{2}+{\sigma }_{g}^{2}\). When using unrelated individuals and not accounting for genetic relatedness via a genetic relationship matrix, the two variance components are not identifiable and it is clear that accounting for differences in error variances is the same as accounting for differences in total variance (the sum of the two variance components). Musharoff et al.8 introduced a model where the variances depends on individual-specific genetic components. For example, it could depend on a principal component (PC) of the genetic data, with: \({\rm{var}}({\epsilon })={\sigma }_{e}^{2}+\,{\theta }_{g}^{2}{\sigma }_{g}^{2}\) where θg is a PC, with values varying across individuals. We address this setting in simulations.

Computing approximate variant-specific inflation factors

We can use mathematical derivations under homogeneity and heterogeneity, relaxing the restrictive assumptions provided earlier, to compute variant-specific inflation factors. These make use of standard “sandwich” formula for large-sample approximations of the variance of estimators; for a minimally-technical summary see Result 2.1 in Wakefield10, or for more detail Sections 5.2–5.3 of Van der Vaart11. We now allow for additive genetic model, and do not assume that the genotypes are standardized. The variance estimator used by the Wald test is now provided as follows:

$${\rm{var}}({({\hat{\beta }}_{0},{\hat{\beta }})}^{T}\,)=\,\left[\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)^{T}\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)\right]^{-1}\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)^{T}\\ \times cov({\boldsymbol{y}})\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)\left[{\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)}^{T}\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)\right]^{-1}$$

which simplifies if \({\rm{cov}}({y}_{i})=\,{\sigma }^{2}\) for all \(i=1,\,\ldots ,\,n\), to:

$${\rm{var}}({({\hat{\beta }}_{0},\hat{\beta })}^{T}\,)=\,\left[\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)^{T}\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)\right]^{-1}\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)^{T}{\sigma }^{2},$$

but allowing for different variances per study, it becomes:

$${\rm{var}}(({\hat{\beta}}_{0},{\hat{\beta}})^{T})= \, \left[\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)^{T}\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)\right]^{-1}\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)^{T}\\ \, \times\left(\begin{array}{ccc}{\sigma }_{1}^{2} & 0 & 0\\ 0 & \ddots & 0\\ 0 & 0 & {\sigma }_{M}^{2}\end{array}\right)\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)\left[\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)^{T}\left(\begin{array}{cc}1 & {g}_{1}\\ \vdots & \vdots \\ 1 & {g}_{n}\end{array}\right)\right]^{-1}$$

Based on these two expressions, we propose an algorithm to compute an approximate variant-specific inflation factor. For computational purposes, we further simplify these arguments taking advantage of the fact that there are repeated rows (e.g., people who have gi = 1 and are from the same study, having the same residual variance). The algorithm below uses the additional assumption that phenotypic variance within each study does not vary with genotype—which must hold under the strong null hypothesis of no association in any subpopulation. It also uses the simplifying assumption that variants are in Hardy Weinberg Equilibrium (HWE) within each study population; testing HWE is a standard preprocessing step for genotype data.

Algorithm for computing variant-specific inflation factors

Suppose that an analyst wished to estimate a vector of regression parameters \({({\beta }_{g},{\beta }_{1},\ldots ,{\beta }_{M})}^{T}\), where βg is a variant association measure, and \({\beta }_{1},\ldots ,\,{\beta }_{M}\) are intercepts for M analysis groups. Denote the genotype of the ith individual in the m analysis group by gmi. The design matrix for estimating these parameters in linear regression would be of the form

$$\left(\begin{array}{llll}{\bf{g}}_{1} \quad {\bf{1}}_{{n_{1}}} \quad {\bf{0}}_{{n_{1}}} \quad \ldots \quad {\bf{0}}_{{n_{1}}}\\ {\bf{g}}_{2} \quad {\bf{0}}_{{n_{2}}} \quad {\bf{1}}_{{n_{2}}} \quad \ldots \quad {\bf{0}}_{{n_{2}}}\\ \;\; \vdots \qquad \!\vdots \qquad\! \vdots \qquad\!\! \ldots\quad\!\! \vdots \\ {\bf{g}}_{M} \quad {\bf{0}}_{{n_{M}}} \quad {\bf{0}}_{{n_{M}}} \quad \ldots \quad {\bf{1}}_{{n_{M}}}\end{array}\right)$$

where \({{\bf{g}}}_{m}={({g}_{m1},\ldots ,{g}_{m{n}_{m}})}^{T}\), \({1}_{{n}_{m}}\) is a vector of length nm with all entries being equal to 1, and similarly \({0}_{{n}_{m}}\) is a vector of length nm with all entries being equal to 0. Let \(V=var(y)\) be the diagonal matrix with error variances of the outcomes. The estimator of the variances and covariances of the vector of regression parameters is \({\rm{var}}(\hat{{\boldsymbol{\beta }}})={({{\rm{W}}}^{T}{\rm{W}})}^{-1}{{\rm{W}}}^{T}{\rm{VW}}{({{\rm{W}}}^{{\rm{T}}}{\rm{W}})}^{-1}\). From the matrix \({\rm{var}}(\hat{{\boldsymbol{\beta }}})\) we are interested in the leading diagonal value, which is the variance of \(\widehat{{\beta }_{g}}\). Suppose first that one construct the matrix W using the actual data. Then:

$$({{\rm{W}}}^{T}{\rm{W}})=\,\left(\begin{array}{ccc}\mathop{\sum}\limits_{m=1}^{M}{{\bf{g}}}_{m}^{T}{{\bf{g}}}_{m} \quad\;\;\; {{\bf{g}}}_{1}^{T}{\bf{1}}_{{n}_{1}}\quad {\bf{g}}_{2}^{T}{\bf{1}}_{{n_{2}}}\quad \ldots \quad\;\;\;\; {\bf{g}}_{{M^{T}}}{\bf{1}}_{{n_{M}}}\\ \!\!\!\!\!{{\bf{g}}}_{1}^{T}{\bf{1}}_{{n}_{1}}\qquad\quad\;\;\; n_{1}\!\! \qquad\;\;\, 0 \qquad\ldots\;\;\;\qquad 0\\ \;\; \!\!\!\!\!\!\!{{\bf{g}}}_{2}^{T}{\bf{1}}_{{n}_{2}}\qquad\;\;\;\;\;\quad\!\!\!\!\!\! 0 \qquad\;\; n_{2} \qquad\; \ldots\quad\;\;\;\;\;\; 0\\ \; \; \!\vdots \qquad\qquad\;\; \vdots \qquad\quad \vdots \qquad\quad\!\!\! \ddots\;\;\; \qquad 0\\ \!\!\!{{\bf{g}}}_{{M^{T}}}{\bf{1}}_{{n_{M}}}\;\;\;\quad\quad 0 \qquad\quad 0 \qquad\quad 0\qquad\;\;\;\; n_{M}\end{array}\right)$$

Now, instead of using the genotype themselves, we use the large sample limit of the expected genotype under HWE to replace the expression \({\bf{g}}_{m}^{T} {1}_{{n_{m}}}\) by \({n}_{m}\times {(0\times {p}_{m}^{2}+1\times 2{p}_{m}(1-{p}_{m})+2\times (1-{p}_{m})^{2})}\), where \({n}_{m}{p}_{m}^{2}\), \(2{n}_{m} {p}_{m} {(1-p_{m})}\), and \({n}_{m} {(1-p_{m})}^{2}\) are the number of individuals from analysis group m expected to have 0, 1, and 2 effect alleles under HWE. Similarly, we can replace \({\bf{g}}_{m}^{T} {\bf{g}}_{m}\) by its large sample limit under HWE.

Notice that the quantity \(0 \times p_{m}^{2}+1\times 2{p}_{m} {(1-p_{m})} +2\times {(1-p_{m})}^{2}\) is a multiplication of two vectors: \({(0,1,2)}\times {(p_{m}^{2}, 2 p_{m} {(1-p_{m})}, {(1-p_{m}^{T})})}^{T}\). Thus, we now define a matrix X and a matrix p, such that \(({\rm{W}}^{{\rm{T}}}{\rm{W}})={\rm{X}}^{{\rm{T}}}{\rm{PX}}\). In matrix X the left column having values \({(0,1,2,\ldots ,0,1,2)}^{T}\) (\({(0,1,2)}^{T}\) repeating for each study), instead of the actual observed genotypes \({(g_{11},\ldots , g_{{M {n_{M}}}})}^{T}\), other columns represent study-specific intercepts, and the matrix P is a diagonal matrix providing the HWE probabilities, for each study, further scaled by the proportion of individual that each analysis group contributes to the study. We use the matrices X, P, and V = var(y) to similarly replace WTVW by its large sample limit under HWE. Specifically, define:

  • \({\rm{X}}=({\rm{G}}\,{\rm{D}})\) where G is a vector of length 3M of the form \({(0,1,2,\ldots ,0,1,2)}^{T}\), and D is a 3M × M design matrix modeling study-specific intercepts where the i, j element Dij is

    $$\left\{\begin{array}{lll}1 & {\rm{if}} & i=3m,3m-1\,or\,3m-2,\,j=m\\ 0 & {\,\!} & {\rm{otherwise}}.\end{array}\right.$$
  • P, a 3M × 3M diagonal matrix, in which each entry gives the population proportion in each combination of genotype and study, i.e.,:

$${\rm{P}}=diag\left(\frac{{n_{1}}}{n}{p}_{1}^{2},\,\frac{{n_{1}}}{n}2{p}_{1}(1-{p}_{1}),\,\frac{{n_{1}}}{n}{(1-{p}_{1})}^{2},\,\ldots\right.$$
$$\left.\frac{{n_{M}}}{n}{p}_{M}^{2},\,\frac{{n_{M}}}{n}2{p}_{M}(1-{p}_{M}),\,\frac{{n_{M}}}{n}{(1-{p}_{M})}^{2}\right)$$

V, is a 3M × 3M diagonal matrix, in which each entry gives the outcome variance in each combination of genotype and study.

$${\rm{V}}=diag({\sigma }_{1}^{2},{\sigma }_{1}^{2},{\sigma }_{1}^{2},\,\ldots ,\,{\sigma }_{M}^{2},\,{\sigma }_{M}^{2},\,{\sigma }_{M}^{2})$$

Define now \({\rm{B}}={{\rm{X}}}^{T}{\rm{PX}}\) and \({\rm{A}}={{\rm{X}}}^{T}{\rm{PVX}}\), which give the large sample limits of \(({{\rm{W}}}^{T}{\rm{W}})\) and \({{\rm{W}}}^{T}{\rm{VW}}\). Under heterogeneity the variance of the slope estimate \(\widehat{{\beta }_{g}}\,\) is proportional to the leading diagonal entry \({{\rm{B}}}^{-1}{{\rm{AB}}}^{-1}\). Under homogeneity the variance \(\widehat{{\beta }_{g}}\,\) is proportional to the leading entry of \({{\rm{B}}}^{-1}\times {\rm{sum}}(diag({\rm{PV}}))\), with the same constant of proportionality. The ratio of these two leading entries, squared, gives the large-sample value of λgc, the genomic control inflation factor12 that would be obtained by comparing the median Wald test statistic to the median of the \({\chi }_{1}^{2}\) reference distribution, if all variants had the same MAF values across the studies. Because this formula provides different results for each variant, depending on the allele frequencies, we denote the ratio between the estimated values under homogeneous variance and the heterogeneous variance models λvs, for “variant specific”. Note that this function requires estimation of variances (for constructing matrix P, under HWE assumption) and allele frequencies (for constructing matrix V), which are readily obtained.

An R function implementing these matrix calculations is provided, together with a tutorial that includes a coding example. These are also provided on GitHub on https://github.com/tamartsi/Variant_specific_inflation, and the function will be integrated into the GENESIS R package.

Simulation studies

We performed simulations to study the appropriateness of the proposed λvs, in terms of how it approximates the standard genomic control coefficient λgc obtained from a “homogeneous variance” model that estimates a single variance parameter across data from all studies. We also studied whether a “stratified variance” model, allowing for different variance parameters across two studies, improves upon the homogeneous variances model. In this vein, we simulated unassociated genetic variants and outcomes in a range of settings combining two studies. We simulated \({n}_{1},\,{n}_{2}\) individuals in study 1 and study 2, \({n}_{1}+{n}_{2}=n\). Let yi be the outcome value of person \(i,\,i=1,\,\ldots ,\,n\), and \({\theta }_{i}\) a PC value for this person, \({\beta }_{1}=1,\,{\beta }_{2}=2\) be study-specific intercepts for studies 1 and 2, \({\sigma }_{1}^{2},\,{\sigma }_{2}^{2}\)study-specific error variances, \({\sigma }_{g}^{2}\) a common genetic variance parameter, and \({\beta }_{\theta }=1\) models the linear association of the PC with the outcome. The PC was simulated from a normal distribution with variance 1, and mean \({\mu }_{1}=2\) in study 1, and mean μ2 in study 2 computed such that the overall PC mean in the two studies together is equal to zero (i.e., \((\mathop{\sum}\limits_{i=1}^{{n_{2}}}\theta_{i})/n_{2}\,=(\mathop{\sum}\limits_{j=1}^{{n_{1}}}\theta_{j})/{n}_{1}\)). The outcome model specified as:

$${y}_{i}={\alpha }_{1}{1}_{stud{y}_{1}}+{\alpha }_{2}{1}_{stud{y}_{2}}+{\theta }_{i}{\alpha }_{\theta }+{{\epsilon }}_{i},$$
(6)

With

$${{\epsilon }}_{i}\sim N(0,\,{\sigma }_{1}^{2}{1}_{stud{y}_{1}}+{\sigma }_{2}^{2}{1}_{stud{y}_{2}}+{\sigma }_{g}^{2}),$$
(7)

Or

$${{\epsilon }}_{i}\sim N(0,\,{\sigma }_{1}^{2}{1}_{stud{y}_{1}}+{\sigma }_{2}^{2}{1}_{stud{y}_{2}}+{\theta }_{i}^{2}{\sigma }_{g}^{2}).$$
(8)

In7 the PC does not affect the genetic outcome variance, while in8 it does. Some of the parameters were the same in all simulations (as reported above). We varied the following parameters: \({n}_{1},{n}_{2}\in \{1000,\,5000\}\), \({\sigma }_{1}^{2},{\sigma }_{2}^{2},\,{\sigma }_{g}^{2}\in \{1,\,2\},\) and simulated bi-allelic independent genetic variants with MAFs \({p}_{1},{p}_{2}\in \{0.01,\,0.05,\,0.5\}\) in the two studies.

We performed 10,000 simulations for each combination of parameters and, for each such setting, computed λgc as the ratio between median observed and expected value of the \({\chi }_{(1)}^{2}\) test statistic (under the null). We computed λvs in each of the 10,000 simulations based estimated variances and observed allele frequencies in each of the two simulated studies, and averaged these estimates across the simulations. We compared three approaches to estimate variances1: fit a homogeneous variances model, obtain residuals \(\hat{{\epsilon }}.\) For each study, estimate the variance as the average \(1/n\mathop{\sum}\limits_{i=1}^{n}\widehat{\epsilon_{i}^{2}}\) where n is the number of study individuals (empirical variance)2; fit a “stratified variance” linear regression model allowing for different residual variances by study (as implemented in the R/Bioconductor package GENESIS9);3 use the same model with stratified variances, but use the variance estimates obtained by the AI-REML algorithm (model variance). In the Supplementary Information, we provide a distribution of λgc values that would be seen under random variability, using 10,000 independent test statistics (as was used in simulations), and simulating test statistics under the null and computing inflation factors λgc.

Whole genome sequencing in TOPMed

For the present analysis, we used Whole Genome Sequencing (WGS) data from freeze 4 of TOPMed. WGS was performed on DNA samples extracted from blood. Sequencing was performed by the Broad Institute of MIT and Harvard (FHS and Amish) and by the Northwest Genome Center (JHS). PCR-free libraries were constructed using commercially available kits from KAPA Biosystems (Broad) or Illumina TruSeq (NWGC). Libraries were pooled for clustering and sequencing, and later de-multiplexed using barcodes. Cluster amplification and sequencing were performed according to manufacturer’s protocols using the Illumina cBot and HiSeq X sequencer, to a read depth of >30X. Base calling was performed using Illumina’s Real Time Analysis 2 (RTA2) software. Read alignment, variant detection, genotype calling and variant filtering were performed by the TOPMed Informatics Research Center (University of Michigan). Reads were aligned to the 1000 Genomes hs37d5 decoy reference sequence. Variant detection and genotype calling were performed jointly for several TOPMed studies (including the three analyzed here), using the GotCloud pipeline. Mendelian consistency was used to train a variant quality classifier using a Support Vector Machine, used for variant filtering. Additional quality control (pedigree checks, gender checks, and concordance with prior array data), performed by the TOPMed Data Coordinating Center, were used to detect and resolve sample identity issues. Further details (including software versions) are provided online (see: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/document.cgi?study_id=phs000964.v2.p1&phv=251960&phd=6969&pha=&pht=4838&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1).

TOPMed analyses were performed in agreement with study participants’ consent, as verified via an approval process by parent studies PIs in TOPMed and TOPMed publication committee.

Variant-specific inflation and genetic association analysis of BMI and hemoglobin concentration in TOPMed

To demonstrate the variance stratification problem, we used datasets of hemoglobin concentrations (HGB) and body mass index (BMI) in the TOPMed freeze 4. For each of the traits, we computed a Genetic Relationship Matrix (GRM13) on all available variants for the corresponding trait with minor allele frequency at least 0.001, which was used to control for genetic relatedness in mixed models. Because some studies had individuals with different genetic backgrounds (leading to differences in allele frequencies), we defined “analysis groups” to use for assessment of variance stratification. An analysis group was as either all individuals from a single study (e.g., Amish), or further defined by both study and race/ethnic group (e.g., European and African Americans from the Cleveland Family Study were separate analysis groups). Thus, analysis groups capture multiple potential sources of trait variance, including differences in allele frequencies due to genetic ancestry, differences in environment and social/cultural factors, and differences in trait measurement by study. For both BMI and HGB, we performed single-variant association analysis for all variants with minor allele count of at least 20. Detailed breakdown of the studies and populations used in these analyses are provided in the Tables 2 and 3. The analysis strategy for both traits was to use the fully-adjusted two-stage procedure for rank-normalization of residuals, because it was shown to have better statistical properties (type 1 error control and power), especially when testing possibly rare genetic variants14. Thus, we first fit a mixed linear regression model, with fixed effects for sex, age (also age2 for BMI), group defined by study and race/ethnicity, and allowing for genetic relatedness by including a variance component proportional to the GRM. Then we took the residuals generated by this model, rank-normalized them, and then re-fit the same model but with the rank-normalized residuals as the trait. For both traits, we compared four analyses: first, a ‘homogeneous variance’ analysis that estimates a single residual variance parameter across all individuals; second, a “stratified residual variance” model that allows a different residual variance parameter for each analysis group; third, a “completely stratified” approach which fits models and performs tests in each analysis group separately, and then combines the results via inverse-variance fixed-effects meta-analysis; and forth, a “MetaCor” analysis15 that perform stratified analyses followed by fixed-effects meta-analysis while accounting for potential correlations due to genetic relationships between individuals in different analysis groups. The ‘completely stratified’ and the ‘MetaCor’ analyses are slightly more flexible than the stratified variance model because they allow for different genetic variance components across analysis groups, in addition to different residual variance components. For BMI, we removed eight individuals from the “completely stratified” analysis to ensure individuals were unrelated across groups, defined as less than third-degree relatedness. All analyses, other than MetaCor, used the GENESIS R package.

Table 2 Analysis groups/strata participating in the BMI analysis.
Table 3 Analysis groups/strata participating in the analysis of hemoglobin concentration.

Computing variant-specific inflation factors in mixed models with residual rank-normalization

We studied the calibration of the various analyses of HGB and BMI by computing approximate variant-specific inflation factors λvs and, for diagnostics, generated QQ-plots as describe later. Notably, λvs were developed assuming independent data, and applying them in the mixed model settings provides only an approximation, as both the sample size is inaccurate (e.g., two full siblings have similar genetic data, so their effective sample size is <2), and there is more than a single variance parameter, and thus it is not straight forward to decide which variance estimates to use in computing λvs. To see that, consider the mixed- model analysis. We modeled both an error and a genetic variance component, so that for each observation, the model, in matrix form, assumes that:

$${\boldsymbol{y}}={\boldsymbol{X}}\beta +{{\boldsymbol{g}}}_{{\boldsymbol{j}}}{\alpha }_{j}+\epsilon ,\,{\rm{with}}\,cov(\epsilon )={\sigma }_{e}^{2}{\boldsymbol{I}}+{\sigma }_{g}^{2}{\boldsymbol{G}}$$

Where G is the GRM, and \({\sigma }_{e}^{2},\,{\sigma }_{g}^{2}\) are error and genetic variance components, respectively. Thus, the variance depends on \({\sigma }_{e}^{2}\), \({\sigma }_{g}^{2}\), and G

In addition, we applied the fully-adjusted two-stage procedure for rank-normalization of residuals, another procedure unaccounted for by the algorithm. Therefore, different possible models will yield quite different variance estimates to be used in the λvs computations, due to changes to the residual distributions due to rank-normalization. Because we are alerting the readers to the problems arising from assuming that variances are the same across all studies, we used variance computed based on the ‘homogeneous variance’ null model (the same residual and genetic variance components for all analysis groups). We extracted marginal residuals (distinguished from conditional residuals that can also be computed in mixed models) for each group, and computed empirical variance for group j by \(v_{j}=\,1/n_{j}\mathop{\sum}\limits_{{i\in {S_{j}}}} \widehat{\epsilon_{i}^{2}}\). Note that this estimator does not account for relatedness. We used the residuals from the second null model from the two-stage procedure.

Assessing population stratification at the variance level through QQ-plots

Once λvs are computed for each of the variants of interest, we propose to generate QQ-plots across sets of variants to visualize whether population stratification at the variance level is appropriately addressed. A function is available on the GitHub repository to generate QQ-plots stratifying variants to categories: “Inflated”, “Deflated” and “Approx. no inflation”. The categories can be manually defined, so that a variant can be assigned to the “Inflated” category if its λvs is larger than a user-specified value, e.g., 1.01. Similarly, a variant is assigned to the “Deflated” category if its λvs is lower than a user-specified, e.g., 0.99. Variants are assigned to the “Approx. no inflation” category if they are not in the “Inflated” or “Deflated” categories, i.e., their λvs is close to the desired value of 1.

Characterizing variants by inflation patterns

To study how common the variant inflation/deflation problem is, and how it relates to variant frequencies, we computed the proportion of variants in levels of λvs for each allele frequency category: <0.01, 0.01–0.05, 0.05–0.2, and 0.2–0.5. We also studied the similarity of inflation/deflation patterns between BMI and HGB.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.