Accuracy of heritability estimations in presence of hidden population stratification

The heritability of a trait is the proportion of its variance explained by genetic factors; it has historically been estimated using familial data. However, new methods have appeared for estimating heritabilities using genomewide data from unrelated individuals. A drawback of this strategy is that population stratification can bias the estimates. Indeed, an environmental factor associated with the phenotype may differ among population subgroups. This factor being associated both with the phenotype and the genetic variation in the population would be a confounder. A common solution consists in adjusting on the first Principal Components (PCs) of the genomic data. We study this procedure on simulated data and on 6000 individuals from the Three-City Study. We analyse the geographical coordinates of the birth cities, which are not genetically determined, but the heritability of which should be overestimated due to population stratification. We also analyse various anthropometric traits. The procedure fails to correct the bias in geographical coordinates heritability estimates. The heritability estimates of the anthropometric traits are affected by the inclusion of the first PC, but not by the following PCs, contrarily to geographical coordinates. We recommend to be cautious with heritability estimates obtained from a large population.

The heritability of a quantitative phenotype is the proportion of its variance explained by genetic factors. The concept of heritability can be traced back to the pioneering works of Galton in the nineteenth century 1 ; its modern definition is due to Fisher in 2 . Most recent works focus on the narrow-sense heritability which is the proportion of variance explained by additive genetic effects.
Heritability estimates have long been based on family data 3 . Twin studies 4 were very popular, as they provide an easy way to take into account the shared environment in families, which can bias the estimates if unaccounted for in the analyses. However the possibility of a bias due to difference in shared environment in monozygotic and dizygotic twins remained 4,5 .
The research of the genetic polymorphisms causing the variability of the phenotypes with a strong genetic component was carried notably through numerous Genome-Wide Association Studies (GWAS). Hundreds of Single Nucleotide Polymorphisms (SNPs) have been found associated with complex traits 6,7 . However the variance explained by these SNPs is lower than the genetic variance predicted by the family studies; the unexplained variance was called the "missing heritability" [8][9][10][11][12][13] .
The missing heritability problem triggered the development of methods using genome-wide data to obtain heritability estimates from unrelated individuals 13,14 . These methods have now mostly superseded family based methods; they are often advocated and perceived as providing unbiased estimates, in particular because, as the individuals are unrelated, there is no shared environment 15 . It was however demonstrated early that the presence of population stratification inflates heritability estimates 16 . A common practice to correct this bias is to include a few Principal Components (PCs) of the estimated kinship matrix in the model with fixed effects [17][18][19][20][21] .
The common usage is to include 10 or 20 PCs in the model with fixed effects. There are however no theoretical grounds on which this number of PCs is chosen; more PCs could be included without damaging the estimates. In this paper, we explore the variability of genomic heritability estimates with the number of PCs included as fixed effects in the model, and the ability of this method to effectively correct for population stratification.
We use simulated data to assess the properties of the method, which we also apply on the Three-City (3C) data 22 . The 3C study is a longitudinal study including 9294 French individuals aged 65 years or older, recruited between 1999 and 2001 in the cities of Bordeaux, Dijon and Montpellier. The longitude and latitude of the birth cities provide examples of "purely stratified" traits, which are not determined by the genome but should be very correlated to the first PCs 23 . Thus, the "naive" genomic heritability estimates of these traits should be artificially high; including PCs in the model should reduce these estimates. Our aim is to find how many PCs have to be included to get a genomic heritability estimate close to zero.
We also analyse anthropometric phenotypes: height which is the historical example of a highly heritable trait 2,24 , weight, Body Mass Index (BMI), head circumference and waist-to-hip ratio which are also expected to display significant heritability. GWAS studies discovered many variants associated with human height [25][26][27] , and with obesity related traits such as weight, BMI, waist or hip circumference and waist-to-hip ratio [28][29][30][31][32][33] . The heritability of all these traits have been estimated multiple times through twin studies, familial studies, or genomic data; Table 1 summarizes some estimates from the literature. We compute their genomic heritability on the 3C data set, and consider how it evolves when including PCs with fixed effects.

Results
Using linear mixed models, we estimate variance components and heritabilities h 2 of several traits. The decompositions of phenotypic variance are presented in graphics with number of Principal Components included with fixed effects in the model as abscissae and the proportion of explained variance as ordinates. The estimated proportion of variance explained by the fixed effects is displayed in dark gray. The light gray and white colors are respectively the estimated proportions of genetic and residual variances.
For the 3C phenotypes, we also give Tables with numerical value of the parameter estimates, their standard errors, and the likelihood ratio test for H 0 : h 2 = 0. Simulated data. Phenotypes were simulated under the linear mixed model, with a non-zero effect on the first 10 PCs (see details in Methods section). They were analysed with a number p of PCs included in the model with fixed effects varying from 0 to 2000. Figure 1 displays the mean of estimated variance proportions, and their standard deviations, computed on 100 simulation replicates. From 0 to 2000 PCs are included with fixed effects. We note that as soon as 10 PCs or more are included in the model, the true proportions of variance (20% for the PCs, 40% for each of the random terms) are well estimated (with a standard error close to 0.05). In contrast, when the number of included PCs is lower than 10, the proportion of genetic variance is overestimated. When up to 2000 PCs are included, the means of estimates are stable, while the standard deviation increase slowly.
Three-City data. Variance estimations have also been made on several quantitative traits from the 3C study. First of all, we considered as quantitative phenotypes the longitude and the latitude of the birth cities. Geographical coordinates of all birth cities are displayed on the map of France in the first panel of Supplementary  Fig. S1. We also estimated the heritability of several anthropometric phenotypes: height, weight, BMI, waist-to-hip ratio and head circumference.
The mean and the standard deviation of each analyzed variable are given in Table 2, stratified on the sex. The sex has been included as a covariate when estimating the heritability of the anthropometric phenotypes.
Longitude and latitude. The variance decompositions of longitude and latitude are displayed in the two panels of Fig. 2, where we included up to p = 2000 PCs in the model with fixed effects. When no PCs are included in the model, it is estimated that 100% of the variance is genetic both for longitude and latitude. When adding PCs to the model, this estimated proportion decreases slowly for the longitude (first panel of Fig. 2). It is estimated close to 54% for p = 50 PCs, and to 26% for p = 2000. In contrast, the estimated proportion of genetic variance of latitude  Tables 3 and 4 for longitude and latitude heritability respectively. These Tables give, for various values of p, the estimates of σ 2 , τ, and h 2 , together with their standard error, and the likelihood ratio   test for the null hypothesis H 0 : h 2 = 0. We first note that for all values of p up to p = 2000, the Likelihood Ratio Test (LRT) is significant for the heritability of both longitude and latitude (note that the LRT asymptotically follows a χ χ (0) : mixture 34 ; the 5% significance threshold is 2.70). We also note that the standard error of the various estimates increases with p, but within reasonable bounds: for example the estimated heritability of the longitude is = Anthropometric traits. We analysed height, weight, BMI, head circumference and waist-to-hip ratio. For these traits, sex and age were included in the model as covariates. In Fig. 3, the estimated proportions of variance for height, head circumference and waist-to-hip ratio are displayed. Sex and age alone are responsible for 52%, 24% and 41% of the variance of height, head circumference and waist-to-hip ratio respectively. Without population stratification correction (p = 0), the genetic variance is 22%, 13% and 12% of the total variance; the inclusion of a single PC in the model (p = 1) decreases these values to 19%, 8% and 8% respectively. Including up to p = 50 PCs does not modify much these proportions. For p up to 2000, the variance proportion estimates fluctuate around the previously cited values.
The estimated proportions of variance for weight and BMI are displayed in Supplementary Fig. S3: the inclusion of PCs with fixed effect does not impact much the estimated proportions of genetic variance, which are 16% and 19% respectively. Table 5 gives for the five anthropometric phenotypes and various values of p, the estimates of σ 2 , τ, and h 2 , together with their standard error, and the likelihood ratio test for the null hypothesis H 0 :   Without stratification correction, height heritability is estimated to 46% with standard error of 6.1%; as soon as one PC is included in the model, this value drops to 39% (standard error 6.6%). Inclusion of more PCs (up to p = 20) almost does not change the estimated values nor their standard errors. The likelihood ratio tests for h 2 = 0 are significant.
Heritability of weight and BMI seem to be almost unaffected by population stratification correction. In the two cases, heritability is significantly positive. It is estimated close to 22% (standard error 6%) and 19% (standard error 6%).
Head circumference heritability is estimated to 17% (standard error 5.7%) without population stratification correction. With inclusion of p = 1 PC in the model, it is only 10%, which drops to 9% for p = 2 (with a standard error of 6.3% or 6.4%). The likelihood ratio test is no longer significant.
Waist-to-hip ratio heritability is estimated to 20% (standard error 6.1% without population stratification correction. This value drops to 13% (standard error 6.7%) with the inclusion of one PC in the model. The likelihood ratio tests stay significant.

Discussion
Our analyses of simulated data with up to 2000 PCs included as fixed effect show that the heritability estimates precision are not much impacted when an important number of PCs are included in the model with fixed effects. Of course, the size of the sample matters, however as soon as a few thousands individuals are included in the analysis, taking p = 100 or 500 is not harmful for the precision of the estimates. There is no practical reason to limit to small values of p.
We were surprised to obtain a genomic heritability estimate of 100% for latitude and longitude: as they are correlated to the first few PCs, we were expecting a positive heritability estimate, but not that large. We were also expecting the estimated heritability to vanish (or at least to become very small) after inclusion of a few PCs in the model. Instead of that, if the latitude heritability drops to 68% after inclusion of the first two PCs, adding more PCs only results in a slow decrease, and the heritability remains highly significant. The longitude behaves in a worse manner, as there is no initial drop as observed for the longitude. In both cases, the slow decrease of heritability is accompanied by a slow increase of the proportion of total variance due to the PCs included with fixed effects. The Likelihood Ratio Test (LRT) shows that the heritability is significantly positive for all values of p.
Browning et al. 16 obtained similar results with simulated case/control data based on the WTCCC case/control data, with an extremely disequilibrated ascertainment scheme in which 90% of individuals from Scotland and Wales were assigned to be cases, and only 10% of individuals from England, the controls being the remaining individuals. In their response, Goddard et al. 15 pointed out that this was an extreme scenario; it is however realistic to imagine that a quantitative trait is under the influence of an environmental factor which varies with latitude or longitude. The heritability estimate of such a trait would be severely overestimated.
Heritability estimates obtained for anthropometric traits are globally compatible with the results from the literature. It is interesting to note that height and waist-to-hip ratio seem sensitive to population stratification, even if marginally. Head circumference is more affected, as its estimated heritability drops from 17% to 9% with the inclusion of the first two PCs, and the LRT is no longer significant. All these traits are weakly but significantly correlated with the geographical coordinates of the birth cities (cf Supplementary Table S5), but so is BMI which does not display this behaviour. We note that when including more PCs in the analysis of these traits, there is no linear trend comparable to the one observed in the case of the geographical coordinates. This can bring back some confidence in the method, which was dented by the previous analyses.
We performed additional analyses were we included the geographical coordinates as covariates when analysing height, head circumference, and waist-to-hip ratio ( Supplementary Fig. S4 and Supplementary Table S1): the drop in heritability when the first PCs are added is no longer observed, which seems to indicate that the effect of the first PCs was due to a geographical stratification. Heritability of height drops to 37% (instead of 39%), which is a marginal change; as previously, heritability of head circumference is not significantly positive. Heritability of waist-to-hip ratio is estimated close to 8%, and is no longer significantly positive, while it was previously estimated to 13%, significant with p = 2: this falls in line with our previous findings on the fact that the inclusion of the first PCs may not be sufficient to fully correct for population stratification. One should consider seriously to include the geographical coordinates of the place of birth with fixed effect in the model, when they are available.
Epidemiologists use to take into account possible center effects by including indicator variables for the centers. We performed an additional analysis of height, head circumference and waist-to-hip ratio with fixed effects for the centers (Supplementary Fig. S5 and Supplementary Table S2). As observed for the previous analysis incorporating geographical coordinates, the drop in heritability after including the first PCs almost disappears (although not completely). The final value after inclusion of 10 PCs is however somewhat higher than the one obtained with the geographical coordinates. We also performed an analysis on the individuals from the largest center, Dijon, alone ( Supplementary Fig. S6 and Supplementary Table S3). The drop in heritability after including the first PC is then quite noticeable. This discards the hypothesis that the impact of the PCs on the heritability estimates is due  to a center effect. Note however that it is impossible to tell whether the first PCs effect is due to some geographic environment, or to some genes with a north-south or west-east allelic gradient. We made the choice of using LD-pruned data to compute the PCs included for population stratification. We could have used the whole data to this aim; this does not alter significantly our conclusions. The results of such analyses are displayed in the Supplementary Information 2. A procedure has been proposed to detect the presence of cryptic relatedness and population stratification 35 . It consists in computing, in one hand, the 22 heritabilities due to each chromosome, one at time, and on the other hand, the same quantities by analysing all the chromosomes together. In absence of cryptic relatedness and of population stratification, these two estimates should be (roughly) the same. The procedure then regresses the difference between those two estimates on the length of the chromosomes. A significantly positive intercept is interpreted as due to the presence of cryptic relatedness, while a significantly positive slope is interpreted as due to population stratification. We applied this procedure on all real data (Supplementary Table S4). This procedure diagnoses well the presence of a population stratification for longitude, latitude, height, head circumference, and waist-to-hip ratio. However, for anthropometric traits, when one PC is added in the model, the procedure would conclude that the population stratification is correctly taken into account. Also, when analysing the latitude, after the inclusion of 10 PCs, the slope is only borderline significant. This suggests that heritability estimates of a trait depending on the latitude, for example through the sun exposure, could have an important bias which would be difficult to detect by this method.
A recent work argued that considering the genotype matrix as fixed in the linear mixed model, while in reality the genotypes are sampled from the population, is a cause of unreliability of genomic heritability estimates, and that this is aggravated by the presence of population stratification 36 ; the conclusions of this work are debatable 37 . However, and more generally, the assumptions of the linear mixed model are unrealistic 38 . They certainly are not fulfilled when it comes to the geographical coordinates of the birth cities; we do not know to what extent they are more realistic for other traits. This alone should compel us to take any genomic heritability estimate with a grain of salt.

Methods
Estimation of the heritability. We consider the linear mixed model n a vector of phenotypes,  ∈ × X n p the matrix of covariates included with fixed effects,  α ∈ p the fixed effect vector,  ∈ × Z n q the standardized genotype matrix, the error vector. The variance components τ and σ 2 are estimated by Restricted Maximum Likelihood (REML) [39][40][41][42] . We used the R package Gaston 43 and GCTA 44 . Example of commands are given in Supplementary Methods (Paragraph 1).
The heritability is usually estimated by = τ τ σ + h 2 2 -which ignores the variance of the phenotype which is explained by the covariates included in X. This is tantamount to defining the heritability as the proportion of genetic variance in the phenotype variance yet unexplained by known covariates, as in ref. 9. However, it is also possible to split the variance of Y in three parts: the variance explained by the covariates in X, the genetic variance τ, and the remaining variance σ 2 .
Estimating the variance due to the covariates by the empirical variance of the components of β  X would result in upward biased estimates. We show in Supplementary Methods (Paragraph 2) how to estimate it without bias. We will report both h 2 and this decomposition of the variance in three components.
Correcting for population stratification. To correct for population stratification, we include the first p principal components of the kinship matrix estimated only on a submatrix Z 1 of Z, obtained by pruning SNPs from Z, to retain approximately 50000 SNPs in low linkage disequilibrium 45 .
with PC i the i th principal component vector (n × 1) and β i the fixed effect of the i th PC.
We will use values of p varying from 0 to 2000 (for n = 5793).
The Three-City genomic data. The 3C study 22 is an ongoing French population-based longitudinal study which started in 1999. Participants were randomly selected from electoral rolls of the cities of Bordeaux, Dijon and Montpellier. To be eligible, they had to be aged 65 years or older, living in one of the recruitment cities, and not institutionalized. A total of 9294 individuals are included. Participants were genotyped with Illumina Human610-Quad BeadChip in the Centre National de Génotypage as described in ref. 46. The 3C data can be shared for an ancillary study, subject to approval by the 3C-Study Steering Committee (http://www. three-city-study.com/genetic-studies.php). Quality Control was performed using PLINK 47 as described hereafter. Duplicate individuals and individuals with a discordance between genetic and clinical sex were discarded. In the sequel, only autosomal SNPs were considered. Individuals with more than 5% of missing genotypes, or with a proportion of heterozygous genotypes more than 3 standard deviations away from the observed mean were removed. Individuals with non-European ancestry were excluded using Principal Component Analysis on Hapmap individuals. Individuals known to be born outside mainland France were removed. To eliminate cryptic relatedness, we removed an individual from Scientific RepoRts | 6:26471 | DOI: 10.1038/srep26471 all pairs with a genetic relatedness superior to 0.025. Finally, SNPs with a call-rate below 99% or Hardy-Weinberg threshold of 10 −8 were removed.
After Quality Control, there are 5793 individuals (1499, 3676 and 618 from Bordeaux, Dijon and Montpellier respectively) and 509931 autosomal SNPs left.
The Principal Components for population stratification correction are computed on LD-pruned data, where we kept only SNPs with a minor allele frequency higher than 5%, and in mutual LD inferior to 0.1. We also removed SNPs in the long-range LD regions defined in ref. 48. The final LD-pruned dataset includes 49277 SNPs. The distribution of the first two PC depending of recruitment cities are given in Supplementary Fig. S1; the first PC correlates with the collection center. This observation is not surprising, as the geographical coordinates differ from one center to another.
Two types of traits are available; the longitude and the latitude of the birth cities which were retrieved from their zipcode (see first panel of Supplementary Fig. S1) and several anthropometric phenotypes: height, weight, BMI, waist-to-hip ratio and head circumference.
Simulated phenotypes. We simulated phenotypes according to the model (1), using the 3C genomic data: Z is the 5793 × 509931 matrix of standardized genotypes, and the PCs are computed on the LD-pruned data. The p = 10 first principal components are included. The coefficients β 1 , … , β p have been chosen such that each PC explains 2% of the phenotype variance, (thus, the 10 PCs together explain 20% of the variance). The remaining variance is equally distributed between genetic and environmental effects (thus, σ 2 = τ, h 2 = 0.5, and the proportion of total variance is 40% for each).