Abstract
In modern Whole Genome Sequencing (WGS) epidemiological studies, participantlevel data from multiple studies are often pooled and results are obtained from a single analysis. We consider the impact of differential phenotype variances by study, which we term ‘variance stratification’. Unaccounted for, variance stratification can lead to both decreased statistical power, and increased false positives rates, depending on how allele frequencies, sample sizes, and phenotypic variances vary across the studies that are pooled. We develop a procedure to compute variantspecific inflation factors, and show how it can be used for diagnosis of genetic association analyses on pooled individual level data from multiple studies. We describe a WGSappropriate analysis approach, implemented in freelyavailable software, which allows studyspecific variances and thereby improves performance in practice. We illustrate the variance stratification problem, its solutions, and the proposed diagnostic procedure, in simulations and in data from the TransOmics for Precision Medicine Whole Genome Sequencing Program (TOPMed), used in association tests for hemoglobin concentrations and BMI.
Introduction
Largescale association analyses using whole genome sequence (WGS) data on thousands of participants are now underway, through programs such as the NHLBI’s TransOmics for precision Medicine (TOPMed) program and NHGRI’s Genome Sequencing Program. Unlike earlier GenomeWide Association Studies (GWASs), where data were combined using metaanalyses of summary statistics, in WGS analyses participantlevel data from multiple studies are often pooled, and results are obtained from a single analysis. Pooled analysis of WGS is useful due to its computational tractability and its ability to control for genetic relatedness across the pooled datasets. However, it is sensitive to a form of population stratification that is not well known. Population stratification in genetic association analysis^{1,2} typically refers to situations where the mean phenotype value and allele frequency both differ across population subgroups. Unless appropriately accounted for in the analysis, e.g., by using regressionbased adjustments for ancestry such as principal components or genetic relatedness matrices in linear mixed models, or their combination, it can lead to false–positive associations^{3,4,5}. Population stratification more generally can refer to differences in phenotype distribution and allele frequency across population subgroups^{6}, and hence can also manifest as differences in phenotype variances by subgroup, combined with differences in allele frequencies. In practice, this phenomenon is common in pooled analysis of multistudy data, as small differences in allele frequencies are prevalent, and different studies being pooled often have different measurement protocols, environmental exposures and inclusion criteria, all of which can lead to different phenotype variances among studies.
Previous studies considered the effect of combining together groups with different phenotypic variances. Haldar and Ghosh^{6} studied the effect of population stratification due to mean differences, variance differences, and more generally, phenotypic heterogeneity, across subpopulations, on false positive detections when testing variant associations with a quantitative trait. Conomos et al.^{7} showed that when testing variant associations in a pooled sample of Hispanics/Latinos from different Hispanic background groups, statistical properties of tests are improved when the model allows for different residual variances in the different background groups. Musharoff et al.^{8}, in a preprint, studied population variance structure using statistical models of both population means and variances, and developed statistical tests for the association of genetic variants with phenotypic variability.
In this manuscript, we develop variantspecific inflation factors λ_{vs}, which quantify the degree of inflation/deflation in association testing of a single genetic variant due to population stratification at the variance level. We develop an algorithm to compute approximate variantspecific inflation factors based on allele frequencies and variances in groups pooled together for analysis, demonstrate their usage for assessing model fit, and demonstrate the implications of the population stratification at the variance level in simulations and in analyses of WGS data from TOPMed. To account for population stratification at the variance level we use the computationally efficient and scalable approach proposed in Conomos et al.^{7} and implemented in GENESIS^{9}, and show in simulations that it indeed addresses the variance stratification problem in scenarios based on Musharoff et al.^{8}.
Results
Simulation studies
Our simulations consisted of 576 simulation settings according to various combinations of parameters. We compared a few ways to estimate variance parameters to be used in computing λ_{vs}: empirical variances based on homogeneous and stratified variance models, and modelbased variances from the heterogeneous variance model. The estimated λ_{vs} were essentially the same regardless of the method. Figure 1 compares the estimated λ_{vs} to the observed λ_{gc} in each of the simulation settings and in each of the two modeling approaches (homogeneous versus stratified variance). Settings are divided according to patterns determining whether variance stratification will be expected, including same or different MAFs between the two studies, the same or different error variances, and whether the PC affects the genetic variance or not. The top three rows in Fig. 1 demonstrate settings in which both the MAF and the total variances differ between the two combined studies, including settings in which both the error and genetic variance components are the same, but the PC affects the genetic variance, resulting in different total variance between the studies because the mean of the PC differs between them. In these settings the variance stratification is observed when using the homogeneous variance model, in that the observed inflation can be substantially higher or lower than 1, with exact values depending on the specific parameters used in each simulation. Indeed, the computed λ_{vs} and the observed λ_{gc} are highly correlated. In contrast, the stratified variance model was robust to variance stratification across all settings, with observed inflation around 1 in all simulations. The bottom two rows of Fig. 1 demonstrate settings in which either the MAF or the variances are the same in the two combined studies. In these settings, the expected inflation computed by λ_{vs} is always 1 (no inflation). As expected, the observed inflation is the same in the homogeneous and stratified variance models. The spread seen in the values of the observed inflation, with some values higher and some lower than the desired 1, are consistent with that expected based on the number of replication of simulations in each setting (10,000); see Supplementary Information for more details.
Genetic association analysis of BMI and hemoglobin concentration in TOPMed
We demonstrate the variance stratification problem in analyzes of hemoglobin concentrations (HGB, N = 7596; three analysis groups) and body mass index (BMI, N = 9807; eight analysis groups) in the TOPMed freeze 4 dataset. In both analyses we computed approximate variantspecific inflation factors λ_{vs}. We investigated the inflation/deflation problems resulting from variance stratification, and verified that the patterns of inflation and deflation in the homogeneous variance analysis agree, across the different variants, with those obtained from the formula and the provided code. Figures 2 and 3 provide quantilequantile (QQ)plots for variants from three categories of variants, where theory predicts inflation \(( \lambda_{vs}\ge 1.01)\), deflation \((\lambda_{vs}\le 0.99)\) and variants with λ_{vs} “Approx. no inflation” (\(0.99 < {\lambda }_{vs} < 1.01\)), and across all variants, for HGB and BMI analyses respectively. The plots overlay the results from the four analyses methods together. While the homogeneous variance model clearly produces inflated and deflated QQplots in line with the theoretical expectation, when looking at all tested variants together, this inflation and deflation (i.e., Type I and Type II errors) mask each other, alarmingly. Despite appearances, these problems do not “cancel out”; one creates more Type I errors, one creates more Type II errors, yet the plot of all results may lead investigators to conclude that the analysis is wellcalibrated. In contrast, the stratified residual variance model provides good control of Type I errors, as seen in the QQplots, with the exception of the bottom left panel in Fig. 3, which provides QQplots for the set of variants that are expected to have deflated test statistics under the pooled variance model when studying BMI: here the stratified residual variance model was also somewhat deflated. Figure 4 provides the genomic control inflation factors λ_{gc} computed over each of the variant sets provided in the QQplots and for each of the traits. The completely stratified and MetaCor models performed better in terms of overall QQplots and computed λ_{gc} values in the two analyses, in that λ_{gc} values were always closer to 1. MetaCor performed slightly better than the completely stratified model under independence, likely because it accounts for a low degree of relatedness between the strata.
Table 1 describes the inflation/deflation patterns of variants according to their MAF. One can see that the inflation/deflation problem is ubiquitous for rare variants, but less so for common variants. In fact, for variants with frequency <0.05, only ~4% of variants have λ_{vs} falling in the “approximately no inflation” category. This is because the ratio between allele frequencies has a strong effect on inflation/deflation, and ratios can become quite high when variants are rare. In the Supplementary Information, Figure S2 shows the distribution of inflation, deflation, and “approximately no inflation” categories across variants in the two analyses, and demonstrates how similar the deflation/inflation categories are between them. Most variants stay in the same category between analyses, but some rare variants (in the figure defined as MAF < 0.05) can be inflated in one analysis and deflated in the other. These differences are because λ_{vs} coefficients are affected by sample sizes, variances, and allele frequencies, which all differ to some extent between analyses due to different samples and trait characteristics.
Discussion
A standard tool for analysis of quantitative traits is linear or linear mixed model regression. In its widelyused default version, linear regression is fitted under the assumption that the phenotype’s residual variance is the same for all individuals in the analyzed sample. The extent of the consequences if the variances are not equal sized can be computed exactly given simplifying assumptions. Broadly, using default approaches, if a specific subgroup has a larger phenotypic variance than that of other subgroups in the pooled analysis, the estimated precision of the association signal will understate the contribution from such a subgroup. The result is deflation (loss of power) for variants where allele frequency is greater in this subgroup compared to other subgroups, or inflation (too many false positives) for variants with lower allele frequency in this subgroup compared to others.
While default linear regression methods assume the same variance for all subgroups, which leads to miscalibrated tests if the assumption does not hold, standard computational tools can be adapted to allow for a stratified variance model, yielding better calibrated tests. Specifically, by fitting different residual variances for each study, or more generally, appropriately defined “analysis group” (e.g., all African Americans of a specific study) the problem can be alleviated. This can be viewed as fitting a different variance component for noise within each study, or as a weighted least squares approach, in which the groupdependent weights are estimated. This approach is implemented in some standard genetic analysis software packages (e.g., GENESIS^{9}). Our mathematical derivation and code can be used to assess the degree of miscalibration of association tests. The code uses an additive model, using a Binomial distribution for allele counts, which is commonly used in GWAS. Inflation/deflation trends should be similar between additive and dominant models, though specific values estimated using each of the two models would not be identical.
In linear regression, the stratified residual variance model allows every analysis group to have its own residual variance parameter. In the mixed model setting, where the variance is decomposed into genetic and residual variances, this model keeps the genetic variance component the same but allows for the residual variance to differ across groups. Analysis groups can be defined as study, race/ethnicity, combinations of these, or any other sample characteristics that affect trait variance and may also correlate with allele frequencies. Our mathematical derivation and code for computing λ_{vs} are under simplifying assumptions of no covariate effects and independent observations. Therefore, these make no distinction between genetic and residual variance components. While in the linear regression setting (independent observations) the variance stratification model clearly suffices to account for variance heterogeneity, in the mixed model setting, a residual variance stratification model may not be optimal, because it may not fully account for stratification in the genetic variance, which could be the result of study design. For example, in Fig. 5, the estimated genetic variance component of the Cleveland Family Study is much higher than those of other studies, and of the residual variance component of the same study, perhaps because study participants were selected from families with obstructive sleep apnea, which is highly associated with obesity. Heterogeneity in genetic variance is addressed in the “completely stratified model”, but such a model requires that individuals are independent between different groups (strata). We also used MetaCor, a method that allows for complete stratification of analysis groups, while keeping genetically related individuals across these groups^{15}. MetaCor was shown to have good statistical properties and performed well in the BMI and HGB analysis. However, it is currently computationally costlier than a pooled analysis because individual level data are used both at the individual analysis group computations, and when computing covariances between effect size estimates of all analysis groups. Computational efficiency is critical when testing the large number of variants observed in WGS studies. In addition, the MetaCor approach is not yet extended to tests of sets of rare variants (rather than single rare variants tests studied in the current manuscript). While more difficult to assess, variance stratification likely affects tests of rare variant sets as well, and methods that use a Score test based on a null model that is fit once, such as the stratified variance approach implemented in GENESIS, straightforwardly extend to such settings. As sample sizes of TOPMed grow, pooling together more diverse studies and populations, variance stratification problems may be more severe. Models allowing for pooled analysis with both groupspecific residual and genetic variances or robust variance estimates may be needed for better control of Type I errors and increased efficiency. Until such methods are developed, we recommend to first use the stratified variance approach, because it is computationally efficient, it can account for relatedness across the entire sample, and the same null model can be used to test variant sets. As a second step, we recommend computing approximate λ_{vs}, and assessing whether observed inflation/deflation remains for test statistics within groups of variants predicted to be inflated/deflated based on λ_{vs} values. If inflation/deflation are observed despite residual variance stratification, the analyst would ideally move forward with a metaanalytic approach such as MetaCor (does not discard data but computationally more demanding), or standard metaanalysis after removing individuals to generate genetically independent strata.
Methods
The linear model
For a total sample size of n, we assume that the data follow a linear model denoted as
where y_{i} is the trait or phenotype value of person i, g_{i} is their count of coded alleles (i.e., genotype), β_{0} denotes the mean outcome in those with no copies of the coded allele, β denotes the effect on the mean trait of each additional copy of the coded allele, and the \({{\epsilon }}_{i}\) are residual errors, which we for now assume are independent, as a simplifying assumption.
To provide intuition for the variance stratification problem, we first demonstrate a mathematical derivation in simplified settings. We assume that the genotype effect is null (β = 0), and that the errors follow a normal distribution \({{\epsilon }}_{i}\sim N(0,\,{\sigma }_{i}^{2})\). We further assume that the phenotypes are centered, the genotypes are centered, and follow a dominant mode of inheritance, i.e., we are using \({g}_{i}=({\tilde{g}}_{i}p)\), where \({\tilde{g}}_{i}\) is the genotype under a dominant mode (having values 1 or 0), p is the frequency of having any copies of the variant allele, and g_{i} is used in analysis.
Implication of variance stratification on the Wald test
The Wald test quantifies the strength of the genetic association by dividing a regressionbased estimate of β by its corresponding estimated standard error. The linear regression estimate of the effect (written in the general regression form) is
Denoting the estimated residual variance of individual i by \({\sigma }_{i}^{2}\) (which may differ across individuals), the variance of \(\hat{\beta }\) is
When the variance of the residuals is homogeneous across all individuals, this is
where σ^{2} is the common variance parameter. To illustrate how this approach can mislead under variance stratification, we consider the situation where two studies are present, of sizes n_{1} and n_{2}, respectively, such that \({n}_{1}+{n}_{2}=n.\) Further, each study is internally homogeneous with error variances \({\sigma }_{1}^{2}\) and \({\sigma }_{2}^{2}\), and it is also useful to write p_{1} and p_{2} for the frequency of the variant of interest (under dominant mode). Because we assume that the variant was centered in the pooled population, we have that \({\rm{E}}{[\mathop{\sum}\limits_{{i\in {S_{j}}}} g_{i}^{2}]}={n}_{i}E{[{g}_{i}^{2}]}={n}_{i}{[p_{i} (1 p_{i})^{2}+(1 p_{i}) (0p_{i})^{2}]}\), or \(E{[\mathop{\sum}\limits_{{i\in {{S}_{j}}}} g_{i}^{2}]}={n}_{i}{p}_{i}(1 p_{i}).\) We can rewrite Eq. (3) as:
We see that the actual variance is a linear combination of the variance parameters \({\sigma }_{1}^{2},\,{\sigma }_{2}^{2}\), and the weight assigned to each depends on the minor allele frequency and sample size in each group. When the minor allele frequencies (MAFs) are equal, p_{1} = p_{2}, the two forms^{4} and^{5} are equal, as there is no association between genotype and outcome, and no confounding occurs. But when \({p}_{1}\;\ne\; {p}_{2}\) then the variance of the estimator upweights the residual variance in the group where the variant is more common, which does not happen under homogeneity. This result straightforwardly generalizes for M studies.
In some studies, researchers use mixed models in GWAS to account for genetic relationships between individuals. Then, it is usually assumed that the variance is decomposed to an error and genetic variances, so that \({\rm{var}}({\epsilon })=\,{\sigma }_{e}^{2}+{\sigma }_{g}^{2}\). When using unrelated individuals and not accounting for genetic relatedness via a genetic relationship matrix, the two variance components are not identifiable and it is clear that accounting for differences in error variances is the same as accounting for differences in total variance (the sum of the two variance components). Musharoff et al.^{8} introduced a model where the variances depends on individualspecific genetic components. For example, it could depend on a principal component (PC) of the genetic data, with: \({\rm{var}}({\epsilon })={\sigma }_{e}^{2}+\,{\theta }_{g}^{2}{\sigma }_{g}^{2}\) where θ_{g} is a PC, with values varying across individuals. We address this setting in simulations.
Computing approximate variantspecific inflation factors
We can use mathematical derivations under homogeneity and heterogeneity, relaxing the restrictive assumptions provided earlier, to compute variantspecific inflation factors. These make use of standard “sandwich” formula for largesample approximations of the variance of estimators; for a minimallytechnical summary see Result 2.1 in Wakefield^{10}, or for more detail Sections 5.2–5.3 of Van der Vaart^{11}. We now allow for additive genetic model, and do not assume that the genotypes are standardized. The variance estimator used by the Wald test is now provided as follows:
which simplifies if \({\rm{cov}}({y}_{i})=\,{\sigma }^{2}\) for all \(i=1,\,\ldots ,\,n\), to:
but allowing for different variances per study, it becomes:
Based on these two expressions, we propose an algorithm to compute an approximate variantspecific inflation factor. For computational purposes, we further simplify these arguments taking advantage of the fact that there are repeated rows (e.g., people who have g_{i} = 1 and are from the same study, having the same residual variance). The algorithm below uses the additional assumption that phenotypic variance within each study does not vary with genotype—which must hold under the strong null hypothesis of no association in any subpopulation. It also uses the simplifying assumption that variants are in Hardy Weinberg Equilibrium (HWE) within each study population; testing HWE is a standard preprocessing step for genotype data.
Algorithm for computing variantspecific inflation factors
Suppose that an analyst wished to estimate a vector of regression parameters \({({\beta }_{g},{\beta }_{1},\ldots ,{\beta }_{M})}^{T}\), where β_{g} is a variant association measure, and \({\beta }_{1},\ldots ,\,{\beta }_{M}\) are intercepts for M analysis groups. Denote the genotype of the ith individual in the m analysis group by g_{mi}. The design matrix for estimating these parameters in linear regression would be of the form
where \({{\bf{g}}}_{m}={({g}_{m1},\ldots ,{g}_{m{n}_{m}})}^{T}\), \({1}_{{n}_{m}}\) is a vector of length n_{m} with all entries being equal to 1, and similarly \({0}_{{n}_{m}}\) is a vector of length n_{m} with all entries being equal to 0. Let \(V=var(y)\) be the diagonal matrix with error variances of the outcomes. The estimator of the variances and covariances of the vector of regression parameters is \({\rm{var}}(\hat{{\boldsymbol{\beta }}})={({{\rm{W}}}^{T}{\rm{W}})}^{1}{{\rm{W}}}^{T}{\rm{VW}}{({{\rm{W}}}^{{\rm{T}}}{\rm{W}})}^{1}\). From the matrix \({\rm{var}}(\hat{{\boldsymbol{\beta }}})\) we are interested in the leading diagonal value, which is the variance of \(\widehat{{\beta }_{g}}\). Suppose first that one construct the matrix W using the actual data. Then:
Now, instead of using the genotype themselves, we use the large sample limit of the expected genotype under HWE to replace the expression \({\bf{g}}_{m}^{T} {1}_{{n_{m}}}\) by \({n}_{m}\times {(0\times {p}_{m}^{2}+1\times 2{p}_{m}(1{p}_{m})+2\times (1{p}_{m})^{2})}\), where \({n}_{m}{p}_{m}^{2}\), \(2{n}_{m} {p}_{m} {(1p_{m})}\), and \({n}_{m} {(1p_{m})}^{2}\) are the number of individuals from analysis group m expected to have 0, 1, and 2 effect alleles under HWE. Similarly, we can replace \({\bf{g}}_{m}^{T} {\bf{g}}_{m}\) by its large sample limit under HWE.
Notice that the quantity \(0 \times p_{m}^{2}+1\times 2{p}_{m} {(1p_{m})} +2\times {(1p_{m})}^{2}\) is a multiplication of two vectors: \({(0,1,2)}\times {(p_{m}^{2}, 2 p_{m} {(1p_{m})}, {(1p_{m}^{T})})}^{T}\). Thus, we now define a matrix X and a matrix p, such that \(({\rm{W}}^{{\rm{T}}}{\rm{W}})={\rm{X}}^{{\rm{T}}}{\rm{PX}}\). In matrix X the left column having values \({(0,1,2,\ldots ,0,1,2)}^{T}\) (\({(0,1,2)}^{T}\) repeating for each study), instead of the actual observed genotypes \({(g_{11},\ldots , g_{{M {n_{M}}}})}^{T}\), other columns represent studyspecific intercepts, and the matrix P is a diagonal matrix providing the HWE probabilities, for each study, further scaled by the proportion of individual that each analysis group contributes to the study. We use the matrices X, P, and V = var(y) to similarly replace W^{T}VW by its large sample limit under HWE. Specifically, define:

\({\rm{X}}=({\rm{G}}\,{\rm{D}})\) where G is a vector of length 3M of the form \({(0,1,2,\ldots ,0,1,2)}^{T}\), and D is a 3M × M design matrix modeling studyspecific intercepts where the i, j element D_{ij} is
$$\left\{\begin{array}{lll}1 & {\rm{if}} & i=3m,3m1\,or\,3m2,\,j=m\\ 0 & {\,\!} & {\rm{otherwise}}.\end{array}\right.$$ 
P, a 3M × 3M diagonal matrix, in which each entry gives the population proportion in each combination of genotype and study, i.e.,:
V, is a 3M × 3M diagonal matrix, in which each entry gives the outcome variance in each combination of genotype and study.
Define now \({\rm{B}}={{\rm{X}}}^{T}{\rm{PX}}\) and \({\rm{A}}={{\rm{X}}}^{T}{\rm{PVX}}\), which give the large sample limits of \(({{\rm{W}}}^{T}{\rm{W}})\) and \({{\rm{W}}}^{T}{\rm{VW}}\). Under heterogeneity the variance of the slope estimate \(\widehat{{\beta }_{g}}\,\) is proportional to the leading diagonal entry \({{\rm{B}}}^{1}{{\rm{AB}}}^{1}\). Under homogeneity the variance \(\widehat{{\beta }_{g}}\,\) is proportional to the leading entry of \({{\rm{B}}}^{1}\times {\rm{sum}}(diag({\rm{PV}}))\), with the same constant of proportionality. The ratio of these two leading entries, squared, gives the largesample value of λ_{gc}, the genomic control inflation factor^{12} that would be obtained by comparing the median Wald test statistic to the median of the \({\chi }_{1}^{2}\) reference distribution, if all variants had the same MAF values across the studies. Because this formula provides different results for each variant, depending on the allele frequencies, we denote the ratio between the estimated values under homogeneous variance and the heterogeneous variance models λ_{vs}, for “variant specific”. Note that this function requires estimation of variances (for constructing matrix P, under HWE assumption) and allele frequencies (for constructing matrix V), which are readily obtained.
An R function implementing these matrix calculations is provided, together with a tutorial that includes a coding example. These are also provided on GitHub on https://github.com/tamartsi/Variant_specific_inflation, and the function will be integrated into the GENESIS R package.
Simulation studies
We performed simulations to study the appropriateness of the proposed λ_{vs}, in terms of how it approximates the standard genomic control coefficient λ_{gc} obtained from a “homogeneous variance” model that estimates a single variance parameter across data from all studies. We also studied whether a “stratified variance” model, allowing for different variance parameters across two studies, improves upon the homogeneous variances model. In this vein, we simulated unassociated genetic variants and outcomes in a range of settings combining two studies. We simulated \({n}_{1},\,{n}_{2}\) individuals in study 1 and study 2, \({n}_{1}+{n}_{2}=n\). Let y_{i} be the outcome value of person \(i,\,i=1,\,\ldots ,\,n\), and \({\theta }_{i}\) a PC value for this person, \({\beta }_{1}=1,\,{\beta }_{2}=2\) be studyspecific intercepts for studies 1 and 2, \({\sigma }_{1}^{2},\,{\sigma }_{2}^{2}\)studyspecific error variances, \({\sigma }_{g}^{2}\) a common genetic variance parameter, and \({\beta }_{\theta }=1\) models the linear association of the PC with the outcome. The PC was simulated from a normal distribution with variance 1, and mean \({\mu }_{1}=2\) in study 1, and mean μ_{2} in study 2 computed such that the overall PC mean in the two studies together is equal to zero (i.e., \((\mathop{\sum}\limits_{i=1}^{{n_{2}}}\theta_{i})/n_{2}\,=(\mathop{\sum}\limits_{j=1}^{{n_{1}}}\theta_{j})/{n}_{1}\)). The outcome model specified as:
With
Or
In^{7} the PC does not affect the genetic outcome variance, while in^{8} it does. Some of the parameters were the same in all simulations (as reported above). We varied the following parameters: \({n}_{1},{n}_{2}\in \{1000,\,5000\}\), \({\sigma }_{1}^{2},{\sigma }_{2}^{2},\,{\sigma }_{g}^{2}\in \{1,\,2\},\) and simulated biallelic independent genetic variants with MAFs \({p}_{1},{p}_{2}\in \{0.01,\,0.05,\,0.5\}\) in the two studies.
We performed 10,000 simulations for each combination of parameters and, for each such setting, computed λ_{gc} as the ratio between median observed and expected value of the \({\chi }_{(1)}^{2}\) test statistic (under the null). We computed λ_{vs} in each of the 10,000 simulations based estimated variances and observed allele frequencies in each of the two simulated studies, and averaged these estimates across the simulations. We compared three approaches to estimate variances^{1}: fit a homogeneous variances model, obtain residuals \(\hat{{\epsilon }}.\) For each study, estimate the variance as the average \(1/n\mathop{\sum}\limits_{i=1}^{n}\widehat{\epsilon_{i}^{2}}\) where n is the number of study individuals (empirical variance)^{2}; fit a “stratified variance” linear regression model allowing for different residual variances by study (as implemented in the R/Bioconductor package GENESIS^{9});^{3} use the same model with stratified variances, but use the variance estimates obtained by the AIREML algorithm (model variance). In the Supplementary Information, we provide a distribution of λ_{gc} values that would be seen under random variability, using 10,000 independent test statistics (as was used in simulations), and simulating test statistics under the null and computing inflation factors λ_{gc}.
Whole genome sequencing in TOPMed
For the present analysis, we used Whole Genome Sequencing (WGS) data from freeze 4 of TOPMed. WGS was performed on DNA samples extracted from blood. Sequencing was performed by the Broad Institute of MIT and Harvard (FHS and Amish) and by the Northwest Genome Center (JHS). PCRfree libraries were constructed using commercially available kits from KAPA Biosystems (Broad) or Illumina TruSeq (NWGC). Libraries were pooled for clustering and sequencing, and later demultiplexed using barcodes. Cluster amplification and sequencing were performed according to manufacturer’s protocols using the Illumina cBot and HiSeq X sequencer, to a read depth of >30X. Base calling was performed using Illumina’s Real Time Analysis 2 (RTA2) software. Read alignment, variant detection, genotype calling and variant filtering were performed by the TOPMed Informatics Research Center (University of Michigan). Reads were aligned to the 1000 Genomes hs37d5 decoy reference sequence. Variant detection and genotype calling were performed jointly for several TOPMed studies (including the three analyzed here), using the GotCloud pipeline. Mendelian consistency was used to train a variant quality classifier using a Support Vector Machine, used for variant filtering. Additional quality control (pedigree checks, gender checks, and concordance with prior array data), performed by the TOPMed Data Coordinating Center, were used to detect and resolve sample identity issues. Further details (including software versions) are provided online (see: https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/document.cgi?study_id=phs000964.v2.p1&phv=251960&phd=6969&pha=&pht=4838&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1).
TOPMed analyses were performed in agreement with study participants’ consent, as verified via an approval process by parent studies PIs in TOPMed and TOPMed publication committee.
Variantspecific inflation and genetic association analysis of BMI and hemoglobin concentration in TOPMed
To demonstrate the variance stratification problem, we used datasets of hemoglobin concentrations (HGB) and body mass index (BMI) in the TOPMed freeze 4. For each of the traits, we computed a Genetic Relationship Matrix (GRM^{13}) on all available variants for the corresponding trait with minor allele frequency at least 0.001, which was used to control for genetic relatedness in mixed models. Because some studies had individuals with different genetic backgrounds (leading to differences in allele frequencies), we defined “analysis groups” to use for assessment of variance stratification. An analysis group was as either all individuals from a single study (e.g., Amish), or further defined by both study and race/ethnic group (e.g., European and African Americans from the Cleveland Family Study were separate analysis groups). Thus, analysis groups capture multiple potential sources of trait variance, including differences in allele frequencies due to genetic ancestry, differences in environment and social/cultural factors, and differences in trait measurement by study. For both BMI and HGB, we performed singlevariant association analysis for all variants with minor allele count of at least 20. Detailed breakdown of the studies and populations used in these analyses are provided in the Tables 2 and 3. The analysis strategy for both traits was to use the fullyadjusted twostage procedure for ranknormalization of residuals, because it was shown to have better statistical properties (type 1 error control and power), especially when testing possibly rare genetic variants^{14}. Thus, we first fit a mixed linear regression model, with fixed effects for sex, age (also age^{2} for BMI), group defined by study and race/ethnicity, and allowing for genetic relatedness by including a variance component proportional to the GRM. Then we took the residuals generated by this model, ranknormalized them, and then refit the same model but with the ranknormalized residuals as the trait. For both traits, we compared four analyses: first, a ‘homogeneous variance’ analysis that estimates a single residual variance parameter across all individuals; second, a “stratified residual variance” model that allows a different residual variance parameter for each analysis group; third, a “completely stratified” approach which fits models and performs tests in each analysis group separately, and then combines the results via inversevariance fixedeffects metaanalysis; and forth, a “MetaCor” analysis^{15} that perform stratified analyses followed by fixedeffects metaanalysis while accounting for potential correlations due to genetic relationships between individuals in different analysis groups. The ‘completely stratified’ and the ‘MetaCor’ analyses are slightly more flexible than the stratified variance model because they allow for different genetic variance components across analysis groups, in addition to different residual variance components. For BMI, we removed eight individuals from the “completely stratified” analysis to ensure individuals were unrelated across groups, defined as less than thirddegree relatedness. All analyses, other than MetaCor, used the GENESIS R package.
Computing variantspecific inflation factors in mixed models with residual ranknormalization
We studied the calibration of the various analyses of HGB and BMI by computing approximate variantspecific inflation factors λ_{vs} and, for diagnostics, generated QQplots as describe later. Notably, λ_{vs} were developed assuming independent data, and applying them in the mixed model settings provides only an approximation, as both the sample size is inaccurate (e.g., two full siblings have similar genetic data, so their effective sample size is <2), and there is more than a single variance parameter, and thus it is not straight forward to decide which variance estimates to use in computing λ_{vs}. To see that, consider the mixed model analysis. We modeled both an error and a genetic variance component, so that for each observation, the model, in matrix form, assumes that:
Where G is the GRM, and \({\sigma }_{e}^{2},\,{\sigma }_{g}^{2}\) are error and genetic variance components, respectively. Thus, the variance depends on \({\sigma }_{e}^{2}\), \({\sigma }_{g}^{2}\), and G
In addition, we applied the fullyadjusted twostage procedure for ranknormalization of residuals, another procedure unaccounted for by the algorithm. Therefore, different possible models will yield quite different variance estimates to be used in the λ_{vs} computations, due to changes to the residual distributions due to ranknormalization. Because we are alerting the readers to the problems arising from assuming that variances are the same across all studies, we used variance computed based on the ‘homogeneous variance’ null model (the same residual and genetic variance components for all analysis groups). We extracted marginal residuals (distinguished from conditional residuals that can also be computed in mixed models) for each group, and computed empirical variance for group j by \(v_{j}=\,1/n_{j}\mathop{\sum}\limits_{{i\in {S_{j}}}} \widehat{\epsilon_{i}^{2}}\). Note that this estimator does not account for relatedness. We used the residuals from the second null model from the twostage procedure.
Assessing population stratification at the variance level through QQplots
Once λ_{vs} are computed for each of the variants of interest, we propose to generate QQplots across sets of variants to visualize whether population stratification at the variance level is appropriately addressed. A function is available on the GitHub repository to generate QQplots stratifying variants to categories: “Inflated”, “Deflated” and “Approx. no inflation”. The categories can be manually defined, so that a variant can be assigned to the “Inflated” category if its λ_{vs} is larger than a userspecified value, e.g., 1.01. Similarly, a variant is assigned to the “Deflated” category if its λ_{vs} is lower than a userspecified, e.g., 0.99. Variants are assigned to the “Approx. no inflation” category if they are not in the “Inflated” or “Deflated” categories, i.e., their λ_{vs} is close to the desired value of 1.
Characterizing variants by inflation patterns
To study how common the variant inflation/deflation problem is, and how it relates to variant frequencies, we computed the proportion of variants in levels of λ_{vs} for each allele frequency category: <0.01, 0.01–0.05, 0.05–0.2, and 0.2–0.5. We also studied the similarity of inflation/deflation patterns between BMI and HGB.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
TOPMed (https://www.nhlbiwgs.org/) whole genome sequencing data are available, from TOPMed Freeze 5b and Freeze 8, on dbGaP by application to each of the studies used in this manuscript. Phenotypes can also be obtained through application to dbGaP. Study dbGaP accessions are: phs000956 (Amish; ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?id=phs000956); phs000954 (CFS; https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?id=phs000954); phs000951 (COPDGene; https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?id=phs000951); phs000988 (CRA; https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000988.v4.p1); phs000974 (FHS; https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000974.v1.p1); phs000964 (JHS; https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000964.v4.p1). Source data are provided with this paper. Because Figs. 2 and 3 are based on results from testing tens of millions of variants, source data are provided after randomly sampling a smaller subset of data points out of those with p value > 0.01. Source data are provided with this paper.
Code availability
Statistical analyses were performed using the freely available R software, version 4.0.0. Association testing used the GENESIS package version 2.18.0, available on R/Bioconductor, or using the MetaCor R package available on GitHub https://github.com/tamartsi/MetaCor. Code for computing variantspecific inflation factors is available on GitHub https://github.com/tamartsi/Variant_specific_inflation with a tutorial, code, and example simulated data provided also in Supplementary Software 1. The code will also become available as part of GENESIS in a future release. Figures were generated using the ggplot2 R package version 3.3.0.
References
 1.
Hellwege, J. N. et al. Population stratification in genetic association. Stud. Curr. Protoc. Hum. Genet.95, 1.22.1–1.3 (2017).
 2.
Cardon, L. R. & Palmer, L. J. Population stratification and spurious allelic association. Lancet361, 598–604 (2003).
 3.
Price, A. L. et al. Principal components analysis corrects for stratification in genomewide association studies. Nat. Genet.38, 904–909 (2006).
 4.
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. Genomewide complex trait analysis (GCTA): methods, data analyses, and interpretations. Methods Mol. Biol.1019, 215–236 (2013).
 5.
Conomos, M. P., Reiner, A. P., Weir, B. S. & Thornton, T. A. Modelfree estimation of recent genetic relatedness. Am. J. Hum. Genet.98, 127–148 (2016).
 6.
Haldar, T. & Ghosh, S. Effect of population stratification on false positive rates of populationbased association analyses of quantitative traits. Ann. Hum. Genet.76, 237–245 (2012).
 7.
Conomos, M. P. et al. Genetic diversity and association studies in US Hispanic/Latino populations: applications in the Hispanic Community Health Study/study of Latinos. Am. J. Hum. Genet.98, 165–184 (2016).
 8.
Musharoff S., et al. 2018. Existence and implications of population variance structure. bioRxiv:439661
 9.
Gogarten S. M., et al. 2019. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics.
 10.
Wakefield J. 2013. Bayesian and frequentist regression methods. Springer Science & Business Media.
 11.
Van der Vaart A. W. 2000. Asymptotic statistics. Cambridge University Press.
 12.
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics55, 997–1004 (1999).
 13.
Amin, N., Van Duijn, C. M. & Aulchenko, Y. S. A genomic background based method for association analysis in related individuals. PloS One2, e1274 (2007).
 14.
Sofer T., et al. 2019. A fully adjusted twostage procedure for ranknormalization in genetic association studies. Genet. Epidemiol.
 15.
Sofer, T. et al. Metaanalysis of genomewide association studies with correlated individuals: application to the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). Genet Epidemiol.40, 492–501 (2016).
Acknowledgements
T.S. was supported by National Heart, Lung, and Blood Institute (NHLBI; R01HL120393‐03S1, 1R35HL135818, and 1R21HL145425). The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services. Study acknowledgements are provided in Supplementary File 1.
Author information
Affiliations
Consortia
Contributions
T.S. and K.M.R. conceptualized the work and drafted the manuscript. T.S., X.Z., S.M.G., M.C., T.A.T., A.S. and K.M.R. developed, studied, and implemented the genetic analysis algorithm that incorporates different residual variances by group. T.S. performed simulation studies. X.Z., C.A.L., S.M.G. performed quality control on the genetic sequencing data. X.Z., C.A.L., J.A.B., and J.B. harmonized and performed quality control for the phenotypes used in the analysis. B.M.P., C.C.L. and K.M.R. supervised quality control and phenotype harmonization procedures. T.S., X.Z., C.A.L., S.M.G., J.A.B., M.P.C., J.C.B., T.A.T., A.S., J.R.O., E.M.L., Y.G., L.A.C., B.M.P. and K.M.R. interpreted the data, reviewed and approved the final manuscript. The NHLBI TransOmics for Precision Medicine (TOPMed) Consortium authors contributed to the TOPMed data collection, joint processing and quality controls, and establishment of analysis procedures.
Corresponding author
Ethics declarations
Competing interests
Bruce M. Psaty serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. All other authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Jacklyn Hellwege and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sofer, T., Zheng, X., Laurie, C.A. et al. Variantspecific inflation factors for assessing population stratification at the phenotypic variance level. Nat Commun 12, 3506 (2021). https://doi.org/10.1038/s41467021236552
Received:
Accepted:
Published:
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.