Variant-specific inflation factors for assessing population stratification at the phenotypic variance level

In modern Whole Genome Sequencing (WGS) epidemiological studies, participant-level data from multiple studies are often pooled and results are obtained from a single analysis. We consider the impact of differential phenotype variances by study, which we term ‘variance stratification’. Unaccounted for, variance stratification can lead to both decreased statistical power, and increased false positives rates, depending on how allele frequencies, sample sizes, and phenotypic variances vary across the studies that are pooled. We develop a procedure to compute variant-specific inflation factors, and show how it can be used for diagnosis of genetic association analyses on pooled individual level data from multiple studies. We describe a WGS-appropriate analysis approach, implemented in freely-available software, which allows study-specific variances and thereby improves performance in practice. We illustrate the variance stratification problem, its solutions, and the proposed diagnostic procedure, in simulations and in data from the Trans-Omics for Precision Medicine Whole Genome Sequencing Program (TOPMed), used in association tests for hemoglobin concentrations and BMI.

L arge-scale association analyses using whole genome sequence (WGS) data on thousands of participants are now underway, through programs such as the NHLBI's Trans-Omics for precision Medicine (TOPMed) program and NHGRI's Genome Sequencing Program. Unlike earlier Genome-Wide Association Studies (GWASs), where data were combined using meta-analyses of summary statistics, in WGS analyses participant-level data from multiple studies are often pooled, and results are obtained from a single analysis. Pooled analysis of WGS is useful due to its computational tractability and its ability to control for genetic relatedness across the pooled datasets. However, it is sensitive to a form of population stratification that is not well known. Population stratification in genetic association analysis 1,2 typically refers to situations where the mean phenotype value and allele frequency both differ across population subgroups. Unless appropriately accounted for in the analysis, e.g., by using regression-based adjustments for ancestry such as principal components or genetic relatedness matrices in linear mixed models, or their combination, it can lead to false-positive associations [3][4][5] . Population stratification more generally can refer to differences in phenotype distribution and allele frequency across population subgroups 6 , and hence can also manifest as differences in phenotype variances by subgroup, combined with differences in allele frequencies. In practice, this phenomenon is common in pooled analysis of multi-study data, as small differences in allele frequencies are prevalent, and different studies being pooled often have different measurement protocols, environmental exposures and inclusion criteria, all of which can lead to different phenotype variances among studies.
Previous studies considered the effect of combining together groups with different phenotypic variances. Haldar and Ghosh 6 studied the effect of population stratification due to mean differences, variance differences, and more generally, phenotypic heterogeneity, across subpopulations, on false positive detections when testing variant associations with a quantitative trait. Conomos et al. 7 showed that when testing variant associations in a pooled sample of Hispanics/Latinos from different Hispanic background groups, statistical properties of tests are improved when the model allows for different residual variances in the different background groups. Musharoff et al. 8 , in a preprint, studied population variance structure using statistical models of both population means and variances, and developed statistical tests for the association of genetic variants with phenotypic variability.
In this manuscript, we develop variant-specific inflation factors λ vs , which quantify the degree of inflation/deflation in association testing of a single genetic variant due to population stratification at the variance level. We develop an algorithm to compute approximate variant-specific inflation factors based on allele frequencies and variances in groups pooled together for analysis, demonstrate their usage for assessing model fit, and demonstrate the implications of the population stratification at the variance level in simulations and in analyses of WGS data from TOPMed. To account for population stratification at the variance level we use the computationally efficient and scalable approach proposed in Conomos et al. 7 and implemented in GENESIS 9 , and show in simulations that it indeed addresses the variance stratification problem in scenarios based on Musharoff et al. 8 .

Results
Simulation studies. Our simulations consisted of 576 simulation settings according to various combinations of parameters. We compared a few ways to estimate variance parameters to be used in computing λ vs : empirical variances based on homogeneous and stratified variance models, and model-based variances from the heterogeneous variance model. The estimated λ vs were essentially the same regardless of the method. Figure 1 compares the estimated λ vs to the observed λ gc in each of the simulation settings and in each of the two modeling approaches (homogeneous versus stratified variance). Settings are divided according to patterns determining whether variance stratification will be expected, including same or different MAFs between the two studies, the same or different error variances, and whether the PC affects the genetic variance or not. The top three rows in Fig. 1 demonstrate settings in which both the MAF and the total variances differ between the two combined studies, including settings in which both the error and genetic variance components are the same, but the PC affects the genetic variance, resulting in different total variance between the studies because the mean of the PC differs between them. In these settings the variance stratification is observed when using the homogeneous variance model, in that the observed inflation can be substantially higher or lower than 1, with exact values depending on the specific parameters used in each simulation. Indeed, the computed λ vs and the observed λ gc are highly correlated. In contrast, the stratified variance model was robust to variance stratification across all settings, with observed inflation around 1 in all simulations. The bottom two rows of Fig. 1 demonstrate settings in which either the MAF or the variances are the same in the two combined studies. In these settings, the expected inflation computed by λ vs is always 1 (no inflation). As expected, the observed inflation is the same in the homogeneous and stratified variance models. The spread seen in the values of the observed inflation, with some values higher and some lower than the desired 1, are consistent with that expected based on the number of replication of simulations in each setting (10,000); see Supplementary Information for more details.
Genetic association analysis of BMI and hemoglobin concentration in TOPMed. We demonstrate the variance stratification problem in analyzes of hemoglobin concentrations (HGB, N = 7596; three analysis groups) and body mass index (BMI, N = 9807; eight analysis groups) in the TOPMed freeze 4 dataset. In both analyses we computed approximate variant-specific inflation factors λ vs . We investigated the inflation/deflation problems resulting from variance stratification, and verified that the patterns of inflation and deflation in the homogeneous variance analysis agree, across the different variants, with those obtained from the formula and the provided code. Figures 2 and 3 provide quantile-quantile (QQ)-plots for variants from three categories of variants, where theory predicts inflation ðλ vs ≥ 1:01Þ, deflation ðλ vs ≤ 0:99Þ and variants with λ vs "Approx. no inflation" (0:99<λ vs <1:01), and across all variants, for HGB and BMI analyses respectively. The plots overlay the results from the four analyses methods together. While the homogeneous variance model clearly produces inflated and deflated QQ-plots in line with the theoretical expectation, when looking at all tested variants together, this inflation and deflation (i.e., Type I and Type II errors) mask each other, alarmingly. Despite appearances, these problems do not "cancel out"; one creates more Type I errors, one creates more Type II errors, yet the plot of all results may lead investigators to conclude that the analysis is well-calibrated. In contrast, the stratified residual variance model provides good control of Type I errors, as seen in the QQ-plots, with the exception of the bottom left panel in Fig. 3, which provides QQplots for the set of variants that are expected to have deflated test statistics under the pooled variance model when studying BMI: here the stratified residual variance model was also somewhat deflated. Figure 4 provides the genomic control inflation factors λ gc computed over each of the variant sets provided in the QQplots and for each of the traits. The completely stratified and MetaCor models performed better in terms of overall QQ-plots and computed λ gc values in the two analyses, in that λ gc values were always closer to 1. MetaCor performed slightly better than the completely stratified model under independence, likely because it accounts for a low degree of relatedness between the strata.  Observed inflation values are provided based on a homogeneous variance model, in which a single variance parameter is estimated using the aggregated data; and based on a stratified variance model, that fits a different variance parameter to each of the two simulated studies. Each simulation set corresponds to a single point on this figure, and the simulations are grouped (denoted by different colors and symbols) by the characteristics stated in the legend. Within each group of simulation settings, the simulation parameters differ by specific parameter values, including MAFs, variance components, and sample sizes, while still satisfying the broad conditions of the grouped simulation settings. The dashed horizontal lines correspond to the 2.5% and 97% quantiles of the distribution of λ gc based on 10,000 variants under the null of no inflation/deflation, obtained from simulations. problem is ubiquitous for rare variants, but less so for common variants. In fact, for variants with frequency <0.05, only~4% of variants have λ vs falling in the "approximately no inflation" category. This is because the ratio between allele frequencies has a strong effect on inflation/deflation, and ratios can become quite high when variants are rare. In the Supplementary Information, Figure S2 shows the distribution of inflation, deflation, and "approximately no inflation" categories across variants in the two analyses, and demonstrates how similar the deflation/inflation categories are between them. Most variants stay in the same category between analyses, but some rare variants (in the figure defined as MAF < 0.05) can be inflated in one analysis and deflated in the other. These differences are because λ vs coefficients are affected by sample sizes, variances, and allele frequencies, which all differ to some extent between analyses due to different samples and trait characteristics.

Discussion
A standard tool for analysis of quantitative traits is linear or linear mixed model regression. In its widely-used default version, linear regression is fitted under the assumption that the phenotype's residual variance is the same for all individuals in the analyzed sample. The extent of the consequences if the variances are not equal sized can be computed exactly given simplifying assumptions. Broadly, using default approaches, if a specific subgroup has a larger phenotypic variance than that of other subgroups in the pooled analysis, the estimated precision of the association signal will understate the contribution from such a subgroup. The result is deflation (loss of power) for variants where allele frequency is greater in this subgroup compared to other subgroups, or inflation (too many false positives) for variants with lower allele frequency in this subgroup compared to others.
While default linear regression methods assume the same variance for all subgroups, which leads to mis-calibrated tests if the assumption does not hold, standard computational tools can be adapted to allow for a stratified variance model, yielding better calibrated tests. Specifically, by fitting different residual variances for each study, or more generally, appropriately defined "analysis group" (e.g., all African Americans of a specific study) the problem can be alleviated. This can be viewed as fitting a different variance component for noise within each study, or as a weighted The analyses used four approaches: "homogeneous variance" model, that assumes that all groups in the analysis have the same variances; "stratified variance" model, that allows for different residual variances across analysis groups; a "completely stratified indep" model in which analysis groups were analyzed separately, allowing for both heterogeneous residual and genetic variances across groups, and then combined together in meta-analysis under independence assumption, and "MetaCor", a procedure that accounts for relatedness across strata in the meta-analysis. The QQ-plots are provided across sets of variants classified by their inflation/deflation patterns according to the algorithm for variant-specific approximate inflation factors. We categorized variants as "Approx. no inflation" when they had estimated λ vs between 0.99 and 1.01, "Deflated" when estimated λ vs lower than 0.99, and "Inflated" when they had estimated λ vs higher than 1.01.
least squares approach, in which the group-dependent weights are estimated. This approach is implemented in some standard genetic analysis software packages (e.g., GENESIS 9 ). Our mathematical derivation and code can be used to assess the degree of miscalibration of association tests. The code uses an additive model, using a Binomial distribution for allele counts, which is commonly used in GWAS. Inflation/deflation trends should be similar between additive and dominant models, though specific values estimated using each of the two models would not be identical.
In linear regression, the stratified residual variance model allows every analysis group to have its own residual variance parameter. In the mixed model setting, where the variance is decomposed into genetic and residual variances, this model keeps the genetic variance component the same but allows for the residual variance to differ across groups. Analysis groups can be defined as study, race/ethnicity, combinations of these, or any other sample characteristics that affect trait variance and may also correlate with allele frequencies. Our mathematical derivation and code for computing λ vs are under simplifying assumptions of no covariate effects and independent observations. Therefore, these make no distinction between genetic and residual variance components. While in the linear regression setting (independent observations) the variance stratification model clearly suffices to account for variance heterogeneity, in the mixed model setting, a residual variance stratification model may not be optimal, because it may not fully account for stratification in the genetic variance, which could be the result of study design. For example, in Fig. 5, the estimated genetic variance component of the Cleveland Family Study is much higher than those of other studies, and of the residual variance component of the same study, perhaps because study participants were selected from families with obstructive sleep apnea, which is highly associated with obesity. Heterogeneity in genetic variance is addressed in the "completely stratified model", but such a model requires that individuals are independent between different groups (strata). We also used MetaCor, a method that allows for complete stratification of analysis groups, while keeping genetically related individuals across these groups 15 . MetaCor was shown to have good statistical properties and performed well in the BMI and HGB analysis. However, it is currently computationally costlier than a pooled analysis because individual level data are used both at the "homogeneous variance" model, which assumes that all groups in the analysis have the same variances; "stratified variance" model, which allows for different residual variances across analysis groups; a "completely stratified indep" model in which analysis groups were analyzed separately, allowing for both heterogeneous residual and genetic variances across groups, and then combined together in meta-analysis under independence assumption, and "MetaCor", a procedure that accounts for relatedness across strata in the meta-analysis. The QQ-plots are provided across sets of variants classified by their inflation/deflation patterns according to the algorithm for variant-specific approximate inflation factors. We categorized variants as "Approx. no inflation" when they had estimated λ vs between 0.99 and 1.01, "Deflated" when estimated λ vs lower than 0.99, and "Inflated" when they had estimated λ vs higher than 1.01. individual analysis group computations, and when computing covariances between effect size estimates of all analysis groups. Computational efficiency is critical when testing the large number of variants observed in WGS studies. In addition, the MetaCor approach is not yet extended to tests of sets of rare variants (rather than single rare variants tests studied in the current manuscript). While more difficult to assess, variance stratification likely affects tests of rare variant sets as well, and methods that use a Score test based on a null model that is fit once, such as the stratified variance approach implemented in GENESIS, straightforwardly extend to such settings. As sample sizes of TOPMed grow, pooling together more diverse studies and populations, variance stratification problems may be more severe. Models allowing for pooled analysis with both group-specific residual and genetic variances or robust variance estimates may be needed for better control of Type I errors and increased efficiency. Until such methods are developed, we recommend to first use the stratified variance approach, because it is computationally efficient, it can account for relatedness across the entire sample, and the same null model can be used to test variant sets. As a second step, we recommend computing approximate λ vs , and assessing whether observed inflation/deflation remains for test statistics within groups of variants predicted to be inflated/deflated based on λ vs values. If inflation/deflation are observed despite residual variance stratification, the analyst would ideally move forward with a meta-analytic approach such as MetaCor (does not discard data but computationally more demanding), or standard meta-analysis after removing individuals to generate genetically independent strata.

Methods
The linear model. For a total sample size of n, we assume that the data follow a linear model denoted as where y i is the trait or phenotype value of person i, g i is their count of coded alleles (i.e., genotype), β 0 denotes the mean outcome in those with no copies of the coded allele, β denotes the effect on the mean trait of each additional copy of the coded  . We categorized variants as "Approx. no inflation" when they had estimated λ vs between 0.99 to 1.01, "Deflated" when estimated λ vs were lower than 0.99, and "Inflated" when they had estimated λ vs higher than 1.01. Genomic control inflation factors λ gc were computed as the ratio between the median χ 2 1 test statistic across variants in the set to the theoretical median of the test statistic under the null hypothesis of no association. In each of the analyses (BMI and HGB), for each allele frequency category we provide the number of variants in this category, and from these, the proportion of variants with computed λ vs in each of multiple categories of inflation/deflation values.
allele, and the ϵ i are residual errors, which we for now assume are independent, as a simplifying assumption.
To provide intuition for the variance stratification problem, we first demonstrate a mathematical derivation in simplified settings. We assume that the genotype effect is null (β = 0), and that the errors follow a normal distribution ϵ i $ Nð0; σ 2 i Þ. We further assume that the phenotypes are centered, the genotypes are centered, and follow a dominant mode of inheritance, i.e., we are using g i ¼ ðg i À pÞ, whereg i is the genotype under a dominant mode (having values 1 or 0), p is the frequency of having any copies of the variant allele, and g i is used in analysis.
Implication of variance stratification on the Wald test. The Wald test quantifies the strength of the genetic association by dividing a regression-based estimate of β by its corresponding estimated standard error. The linear regression estimate of the effect (written in the general regression form) iŝ Denoting the estimated residual variance of individual i by σ 2 i (which may differ across individuals), the variance ofβ is When the variance of the residuals is homogeneous across all individuals, this is where σ 2 is the common variance parameter. To illustrate how this approach can mislead under variance stratification, we consider the situation where two studies are present, of sizes n 1 and n 2 , respectively, such that n 1 þ n 2 ¼ n: Further, each study is internally homogeneous with error variances σ 2 1 and σ 2 2 , and it is also useful to write p 1 and p 2 for the frequency of the variant of interest (under dominant mode). Because we assume that the variant was centered in the pooled population, we have that E½ ∑ i2S j We can re-write Eq. (3) as: varðβÞ ¼ n 1 p 1 ð1 À p 1 Þσ 2 1 þ n 2 p 2 ð1 À p 2 Þσ 2 2 ½n 1 p 1 ð1 À p 1 Þ þ n 2 p 2 ð1 À p 2 Þ 2 We see that the actual variance is a linear combination of the variance parameters σ 2 1 ; σ 2 2 , and the weight assigned to each depends on the minor allele frequency and sample size in each group. When the minor allele frequencies (MAFs) are equal, p 1 = p 2 , the two forms 4 and 5 are equal, as there is no association between genotype and outcome, and no confounding occurs. But when p 1 ≠ p 2 then the variance of the estimator upweights the residual variance in the group where the variant is more common, which does not happen under homogeneity. This result straightforwardly generalizes for M studies.
In some studies, researchers use mixed models in GWAS to account for genetic relationships between individuals. Then, it is usually assumed that the variance is decomposed to an error and genetic variances, so that varðϵÞ ¼ σ 2 e þ σ 2 g . When using unrelated individuals and not accounting for genetic relatedness via a genetic relationship matrix, the two variance components are not identifiable and it is clear that accounting for differences in error variances is the same as accounting for differences in total variance (the sum of the two variance components). Musharoff et al. 8 introduced a model where the variances depends on individual-specific genetic components. For example, it could depend on a principal component (PC) of the genetic data, with: varðϵÞ ¼ σ 2 e þ θ 2 g σ 2 g where θ g is a PC, with values varying across individuals. We address this setting in simulations.
Computing approximate variant-specific inflation factors. We can use mathematical derivations under homogeneity and heterogeneity, relaxing the restrictive assumptions provided earlier, to compute variant-specific inflation factors. These make use of standard "sandwich" formula for large-sample approximations of the variance of estimators; for a minimally-technical summary see Result 2.1 in Wakefield 10 , or for more detail Sections 5.2-5.3 of Van der Vaart 11 . We now allow for additive genetic model, and do not assume that the genotypes are standardized. The variance estimator used by the Wald test is now provided as follows: varððβ 0 ;βÞ T Þ ¼

HGB BMI
A m is h _ E u r o p e a n _ A m e r ic a n C F S _ A fr ic a n _ A m e r ic a n C F S _ E u r o p e a n _ A m e r ic a n C O P D G e n e _ A fr ic a n _ A m e r ic a n C O P D G e n e _ E u r o p e a n _ A m e r ic a n C R A _ C o s ta _ R ic a n F H S _ E u r o p e a n _ A m e r ic a n J H S _ A fr ic a n _ A m e r ic a n  Based on these two expressions, we propose an algorithm to compute an approximate variant-specific inflation factor. For computational purposes, we further simplify these arguments taking advantage of the fact that there are repeated rows (e.g., people who have g i = 1 and are from the same study, having the same residual variance). The algorithm below uses the additional assumption that phenotypic variance within each study does not vary with genotype-which must hold under the strong null hypothesis of no association in any subpopulation. It also uses the simplifying assumption that variants are in Hardy Weinberg Equilibrium (HWE) within each study population; testing HWE is a standard preprocessing step for genotype data.

Algorithm for computing variant-specific inflation factors
Suppose that an analyst wished to estimate a vector of regression parameters ðβ g ; β 1 ; ; β M Þ T , where β g is a variant association measure, and β 1 ; ; β M are intercepts for M analysis groups. Denote the genotype of the ith individual in the m analysis group by g mi . The design matrix for estimating these parameters in linear regression would be of the form ; g mn m Þ T , 1 n m is a vector of length n m with all entries being equal to 1, and similarly 0 n m is a vector of length n m with all entries being equal to 0. Let V ¼ varðyÞ be the diagonal matrix with error variances of the outcomes. The estimator of the variances and covariances of the vector of regression parameters is varðβÞ ¼ ðW T WÞ À1 W T VWðW T WÞ À1 . From the matrix varðβÞ we are interested in the leading diagonal value, which is the variance of b β g . Suppose first that one construct the matrix W using the actual data. Then: Now, instead of using the genotype themselves, we use the large sample limit of the expected genotype under HWE to replace the expression g T m 1 n m by n m ð0 p 2 m þ 1 2p m ð1 À p m Þ þ 2 ð1 À p m Þ 2 Þ, where n m p 2 m , 2n m p m ð1 À p m Þ, and n m ð1 À p m Þ 2 are the number of individuals from analysis group m expected to have 0, 1, and 2 effect alleles under HWE. Similarly, we can replace g T m g m by its large sample limit under HWE.
Notice that the quantity 0 p 2 m þ 1 2p m ð1 À p m Þ þ 2 ð1 À p m Þ 2 is a multiplication of two vectors: ð0; 1; 2Þ ðp 2 m ; 2p m ð1 À p m Þ; ð1 À p T m ÞÞ T . Thus, we now define a matrix X and a matrix p, such that ðW T WÞ ¼ X T PX. In matrix X the left column having values ð0; 1; 2; ; 0; 1; 2Þ T (ð0; 1; 2Þ T repeating for each study), instead of the actual observed genotypes ðg 11 ; ; g Mn M Þ T , other columns represent study-specific intercepts, and the matrix P is a diagonal matrix providing the HWE probabilities, for each study, further scaled by the proportion of individual that each analysis group contributes to the study. We use the matrices X, P, and V = var(y) to similarly replace W T VW by its large sample limit under HWE. Specifically, define: • X ¼ ðG DÞ where G is a vector of length 3M of the form ð0; 1; 2; ; 0; 1; 2Þ T , and D is a 3M × M design matrix modeling study-specific intercepts where the i, j element D ij is otherwise: • P, a 3M × 3M diagonal matrix, in which each entry gives the population proportion in each combination of genotype and study, i.e.,: , is a 3M × 3M diagonal matrix, in which each entry gives the outcome variance in each combination of genotype and study.
Define now B ¼ X T PX and A ¼ X T PVX, which give the large sample limits of ðW T WÞ and W T VW. Under heterogeneity the variance of the slope estimate b β g is proportional to the leading diagonal entry B À1 AB À1 . Under homogeneity the variance b β g is proportional to the leading entry of B À1 sumðdiagðPVÞÞ, with the same constant of proportionality. The ratio of these two leading entries, squared, gives the large-sample value of λ gc , the genomic control inflation factor 12 that would be obtained by comparing the median Wald test statistic to the median of the χ 2 1 reference distribution, if all variants had the same MAF values across the studies. Because this formula provides different results for each variant, depending on the allele frequencies, we denote the ratio between the estimated values under homogeneous variance and the heterogeneous variance models λ vs , for "variant specific". Note that this function requires estimation of variances (for constructing matrix P, under HWE assumption) and allele frequencies (for constructing matrix V), which are readily obtained.
An R function implementing these matrix calculations is provided, together with a tutorial that includes a coding example. These are also provided on GitHub on https://github.com/ tamartsi/Variant_specific_inflation, and the function will be integrated into the GENESIS R package.
Simulation studies. We performed simulations to study the appropriateness of the proposed λ vs , in terms of how it approximates the standard genomic control coefficient λ gc obtained from a "homogeneous variance" model that estimates a single variance parameter across data from all studies. We also studied whether a "stratified variance" model, allowing for different variance parameters across two studies, improves upon the homogeneous variances model. In this vein, we simulated unassociated genetic variants and outcomes in a range of settings combining two studies. We simulated n 1 ; n 2 individuals in study 1 and study 2, n 1 þ n 2 ¼ n. Let y i be the outcome value of person i; i ¼ 1; ; n, and θ i a PC value for this person, β 1 ¼ 1; β 2 ¼ 2 be study-specific intercepts for studies 1 and 2, σ 2 1 ; σ 2 2 studyspecific error variances, σ 2 g a common genetic variance parameter, and β θ ¼ 1 models the linear association of the PC with the outcome. The PC was simulated from a normal distribution with variance 1, and mean μ 1 ¼ 2 in study 1, and mean μ 2 in study 2 computed such that the overall PC mean in the two studies together is equal to zero (i.e., ð ∑ n 2 i¼1 θ i Þ=n 2 ¼ ð∑ n 1 j¼1 θ j Þ=n 1 ). The outcome model specified as: With Or In 7 the PC does not affect the genetic outcome variance, while in 8 it does. Some of the parameters were the same in all simulations (as reported above). We varied the following parameters: n 1 ; n 2 2 f1000; 5000g, σ 2 1 ; σ 2 2 ; σ 2 g 2 f1; 2g; and simulated bi-allelic independent genetic variants with MAFs p 1 ; p 2 2 f0:01; 0:05; 0:5g in the two studies.
We performed 10,000 simulations for each combination of parameters and, for each such setting, computed λ gc as the ratio between median observed and expected value of the χ 2 ð1Þ test statistic (under the null). We computed λ vs in each of the 10,000 simulations based estimated variances and observed allele frequencies in each of the two simulated studies, and averaged these estimates across the simulations. We compared three approaches to estimate variances 1 : fit a homogeneous variances model, obtain residualsε: For each study, estimate the variance as the average 1=n ∑ n i¼1 b ϵ 2 i where n is the number of study individuals (empirical variance) 2 ; fit a "stratified variance" linear regression model allowing for different residual variances by study (as implemented in the R/Bioconductor package GENESIS 9 ); 3 use the same model with stratified variances, but use the variance estimates obtained by the AI-REML algorithm (model variance). In the Supplementary Information, we provide a distribution of λ gc values that would be seen under random variability, using 10,000 independent test statistics (as was used in simulations), and simulating test statistics under the null and computing inflation factors λ gc .
Whole genome sequencing in TOPMed. For the present analysis, we used Whole Genome Sequencing (WGS) data from freeze 4 of TOPMed. WGS was performed on DNA samples extracted from blood. Sequencing was performed by the Broad Institute of MIT and Harvard (FHS and Amish) and by the Northwest Genome Center (JHS). PCR-free libraries were constructed using commercially available kits from KAPA Biosystems (Broad) or Illumina TruSeq (NWGC). Libraries were pooled for clustering and sequencing, and later de-multiplexed using barcodes. Cluster amplification and sequencing were performed according to manufacturer's protocols using the Illumina cBot and HiSeq X sequencer, to a read depth of >30X. Base calling was performed using Illumina's Real Time Analysis 2 (RTA2) software. Read alignment, variant detection, genotype calling and variant filtering were performed by the TOPMed Informatics Research Center (University of Michigan).
Reads were aligned to the 1000 Genomes hs37d5 decoy reference sequence. Variant detection and genotype calling were performed jointly for several TOPMed studies (including the three analyzed here), using the GotCloud pipeline. Mendelian consistency was used to train a variant quality classifier using a Support Vector Machine, used for variant filtering. Additional quality control (pedigree checks, gender checks, and concordance with prior array data), performed by the TOPMed Data Coordinating Center, were used to detect and resolve sample identity issues. Further details (including software versions) are provided online (see: https://www.ncbi.nlm. nih.gov/projects/gap/cgi-bin/document.cgi?study_id=phs00096 4.v2.p1&phv=251960&phd=6969&pha=&pht=4838&phvf=&phd f=&phaf=&phtf=&dssp=1&consent=&temp=1).
TOPMed analyses were performed in agreement with study participants' consent, as verified via an approval process by parent studies PIs in TOPMed and TOPMed publication committee.
Variant-specific inflation and genetic association analysis of BMI and hemoglobin concentration in TOPMed. To demonstrate the variance stratification problem, we used datasets of hemoglobin concentrations (HGB) and body mass index (BMI) in the TOPMed freeze 4. For each of the traits, we computed a Genetic Relationship Matrix (GRM 13 ) on all available variants for the corresponding trait with minor allele frequency at least 0.001, which was used to control for genetic relatedness in mixed models. Because some studies had individuals with different genetic backgrounds (leading to differences in allele frequencies), we defined "analysis groups" to use for assessment of variance stratification. An analysis group was as either all individuals from a single study (e.g., Amish), or further defined by both study and race/ethnic group (e.g., European and African Americans from the Cleveland Family Study were separate analysis groups). Thus, analysis groups capture multiple potential sources of trait variance, including differences in allele frequencies due to genetic ancestry, differences in environment and social/cultural factors, and differences in trait measurement by study. For both BMI and HGB, we performed single-variant association analysis for all variants with minor allele count of at least 20. Detailed breakdown of the studies and populations used in these analyses are provided in the Tables 2 and 3. The analysis strategy for both traits was to use the fully-adjusted two-stage procedure for ranknormalization of residuals, because it was shown to have better statistical properties (type 1 error control and power), especially when testing possibly rare genetic variants 14 . Thus, we first fit a mixed linear regression model, with fixed effects for sex, age (also age 2 for BMI), group defined by study and race/ethnicity, and allowing for genetic relatedness by including a variance component proportional to the GRM. Then we took the residuals generated by this model, rank-normalized them, and then re-fit the same model but with the rank-normalized residuals as the trait. For both traits, we compared four analyses: first, a 'homogeneous variance' analysis that estimates a single residual variance parameter across all individuals; second, a "stratified residual variance" model that allows a different residual variance parameter for each analysis group; third, a "completely stratified" approach which fits models and performs tests in each analysis group separately, and then combines the results via inverse-variance fixed-effects meta-analysis; and forth, a "MetaCor" analysis 15 that perform stratified analyses followed by fixed-effects meta-analysis while accounting for potential correlations due to genetic relationships between individuals in different analysis groups. The 'completely stratified' and the 'MetaCor' analyses are slightly more flexible than the stratified variance model because they allow for different genetic variance components across analysis groups, in addition to different residual variance components.
For BMI, we removed eight individuals from the "completely stratified" analysis to ensure individuals were unrelated across groups, defined as less than third-degree relatedness. All analyses, other than MetaCor, used the GENESIS R package.
Computing variant-specific inflation factors in mixed models with residual rank-normalization. We studied the calibration of the various analyses of HGB and BMI by computing approximate variant-specific inflation factors λ vs and, for diagnostics, generated QQ-plots as describe later. Notably, λ vs were developed assuming independent data, and applying them in the mixed model settings provides only an approximation, as both the sample size is inaccurate (e.g., two full siblings have similar genetic data, so their effective sample size is <2), and there is more than a single variance parameter, and thus it is not straight forward to decide which variance estimates to use in computing λ vs . To see that, consider the mixed-model analysis. We modeled both an error and a genetic variance component, so that for each observation, the model, in matrix form, assumes that: Where G is the GRM, and σ 2 e ; σ 2 g are error and genetic variance components, respectively. Thus, the variance depends on σ 2 e , σ 2 g , and G In addition, we applied the fully-adjusted two-stage procedure for rank-normalization of residuals, another procedure unaccounted for by the algorithm. Therefore, different possible models will yield quite different variance estimates to be used in the λ vs computations, due to changes to the residual distributions due to rank-normalization. Because we are alerting the readers to the problems arising from assuming that variances are the same across all studies, we used variance computed based on the 'homogeneous variance' null model (the same residual and genetic variance components for all analysis groups). We extracted marginal residuals (distinguished from conditional residuals that can also be computed in mixed models) for each group, and computed empirical variance for group j by i . Note that this estimator does not account for relatedness. We used the residuals from the second null model from the two-stage procedure.
Assessing population stratification at the variance level through QQ-plots. Once λ vs are computed for each of the variants of interest, we propose to generate QQ-plots across sets of variants to visualize whether population stratification at the variance level is appropriately addressed. A function is available on the GitHub repository to generate QQ-plots stratifying variants to categories: "Inflated", "Deflated" and "Approx. no inflation". The categories can be manually defined, so that a variant can be assigned to the "Inflated" category if its λ vs is larger than a user-specified value, e.g., 1.01. Similarly, a variant is assigned to the "Deflated" category if its λ vs is lower than a user-specified, e.g., 0.99. Variants are assigned to the "Approx. no inflation" category if they are not in the "Inflated" or "Deflated" categories, i.e., their λ vs is close to the desired value of 1.
Characterizing variants by inflation patterns. To study how common the variant inflation/deflation problem is, and how it relates to variant frequencies, we computed the proportion of variants in levels of λ vs for each allele frequency category: <0.01, 0.01-0.05, 0.05-0.2, and 0.2-0.5. We also studied the similarity of inflation/deflation patterns between BMI and HGB.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.  For each group, we report parent study, race/ethnic group, number of participants, number and percentage of males, and age and hemoglobin's means and standard deviations. For each group, we report parent study, race/ethnic group, number of participants, number and percentage of males, and age and BMI's means and standard deviations.

Code availability
Statistical analyses were performed using the freely available R software, version 4.0.0. Association testing used the GENESIS package version 2.18.0, available on R/ Bioconductor, or using the MetaCor R package available on GitHub https://github.com/ tamartsi/MetaCor. Code for computing variant-specific inflation factors is available on GitHub https://github.com/tamartsi/Variant_specific_inflation with a tutorial, code, and example simulated data provided also in Supplementary Software 1. The code will also become available as part of GENESIS in a future release. Figures were generated using the ggplot2 R package version 3.3.0.