Common genetic variants and health outcomes appear geographically structured in the UK Biobank sample: Old concerns returning and their implications

The inclusion of genetic data in large studies has enabled the discovery of genetic contributions to complex traits and their application in applied analyses including those using genetic risk scores (GRS) for the prediction of phenotypic variance. If genotypes show structure by location and coincident structure exists for the trait of interest, analyses can be biased. Having illustrated structure in an apparently homogeneous collection, we aimed to a) test for geographical stratification of genotypes in UK Biobank and b) assess whether stratification might induce bias in genetic association analysis. We found that single genetic variants are associated with birth location within UK Biobank and that geographic structure in genetic data could not be accounted for using routine adjustment for study centre and principal components (PCs) derived from genotype data. We found that GRS for complex traits do appear geographically structured and analysis using GRS can yield biased associations. We discuss the likely origins of these observations and potential implications for analysis within large-scale population based genetic studies.


Introductory paragraph
The inclusion of genetic data in large studies has enabled the discovery of genetic contributions to complex traits and their application in applied analyses including those using genetic risk scores (GRS) for the prediction of phenotypic variance. If genotypes show structure by location and coincident structure exists for the trait of interest, analyses can be biased. Having illustrated structure in an apparently homogeneous collection, we aimed to a) test for geographical stratification of genotypes in UK Biobank and b) assess whether stratification might induce bias in genetic association analysis.
We found that single genetic variants are associated with birth location within UK Biobank and that geographic structure in genetic data could not be accounted for using routine adjustment for study centre and principal components (PCs) derived from genotype data.
We found that GRS for complex traits do appear geographically structured and analysis using GRS can yield biased associations. We discuss the likely origins of these observations and potential implications for analysis within large-scale population based genetic studies.

Main
Many recent and ongoing research programmes aim to systematically identify genetic contributions to complex traits and undertake applied epidemiological analyses using genotype data. Irrespective of source, latent structure within a dataset can be very important when performing these analysis, as structural alignment between ancestry and genotypes, health outcomes and geography has potential to induce artefactual relationships 1 . Current methods to account for structure include proxy measurement and adjustment for latent structure within datasets (mainly using PCs or measures of actual geographic location 2-4 ).
Recent developments in resources, applications and understanding warrant a re-exploration of latent structure in datasets. Prior to 2015, very large samples were only achieved by aggregation of smaller studies whose structural properties and geographical footprints were neither detectable within single studies nor coordinated across the collection of studies. Now analysis can be undertaken in very large individual collections with the capacity to capture a single geographical footprint, such as UK Biobank 5 . With increased sample size and statistical power, there is now potential to discover a broader range of genetic effects that might conceivably capture characteristics of the structural properties or geographical footprint of the dataset. This sits in the context of a growing appreciation of fine-scale population structure within the British population 6 .
These changing circumstances are relevant for applied epidemiological analyses which have developed substantially with their exploitation of reliable genetic association results. A good example of this is Mendelian randomization, which aims to escape confounding in observational associations by using genetic variation to proxy risk factors of interest 7 . Recent literature has focused on maximising the use of the current wave of genetic association evidence and accounting for undesirable pleiotropic effects of single variants 8 . This activity, however, has largely assumed that structure is addressed during the discovery of associated genetic variants. Under-appreciated structure in genetic datasets challenges the assumption that genetic instruments are not related to potentially confounding features 9 .
As an exemplar, we examined whether there is previously under-appreciated structure in a well understood, ethnically and geographically homogenous resource. In the Avon Longitudinal Study of Parents and Children (ALSPAC) 10,11 , we studied mothers who were recruited during pregnancy in the Bristol area (South West UK) in the early 1990s. We undertook chromosome painting 12 to describe fine-scale relatedness between each mother and each of the regions of the Peopling of the British Isles (PoBI) project 6 . We summarised each mother's ancestral lineage as a mixture of the PoBI regions, allowing us to estimate the educational attainment that those regions would have, were the ALSPAC mothers' education levels explained by this variation. In doing this a pattern for lower educational attainment in lineages originating from the regions immediately surrounding Bristol ( Figure 1) and higher educational attainment in more geographically distant lineages was observed. Distant lineages are likely only represented in ALSPAC by individuals or families who had migrated, and we anticipate that the educational attainment of people who migrate for economic reasons differs from people who do not. Educational attainment is therefore aligned to subtle genetic differences even in this apparently geographically and ethnically homogenous population and this is coincident with axes of ancestry.
The structure in ALSPAC was detected here using a method which is highly sensitive to ancestry. With greater power, it is entirely possible the same phenomena may become detectable in more routine analytical procedures. We therefore turned to UK Biobank, an exceptional resource containing a catalogue of health, disease and genotype data of almost half a million participants 5,13 Conceptually the UK Biobank is analogous to a super-imposition of multiple ALSPACs, each of which recruited participants living near a study assessment centre. This design gives UK Biobank the capacity to represent a broad spectrum of UK ancestry and structure, but is also sensitive to important sampling phenomena including self- GWAS for birth location identified that single variants are associated with geography within UK Biobank. An unadjusted model produced distorted and inflated plots with evidence for association at variants across the autosome. After adjustment for genotyping array, 40 PCs and a factor variable representing UK Biobank assessment centre single variants remained associated with birth location (figure S1).
Rather than using single genetic variants, empirical epidemiological analyses often use genetic risk scores (GRS) 19,20 . As exemplars, we took genetic variants and weightings associated with educational attainment, height and body mass index (BMI) from published genome-wide meta-analyses 21-23 . Using an approach that is widespread in applied analyses, we derived weighted and unweighted GRS for the three traits based on variants with p<5e-08 and p<1e-05 in the discovery sample. We used general additive models 24 in the 'mgcv' package (version 1.8) 25 within R (version 3.3.1) 26 , to test for non-linear relationships between GRS and geographical terms. All GRS tested were associated with birth location in an unadjusted model and a model that adjusted only for genotyping array.
These associations attenuated but were not extinguished in models incorporating adjustment for 40 PCs and study centre, especially for educational attainment and North location at birth, where statistical adjustment had little impact on the fitted geographical distribution of the GRS (figure 2, table 1).
Having found evidence for association between genotypic variation and geography, we used general additive models to test for non-linear relationships between four exemplar complex traits and geography. Reported household income, measured BMI, reported age at completion of full time education and reported number of siblings showed strong evidence for geographical stratification (p<2e-16 for non-linear relationship between observed traits and axes of birth location).
We noted that structure in genotypes and phenotypes appeared geographically co-incident (example figure S2), which led us to explore the potential role of geography in confounding applied analysis. We tested for linear association between GRS and complex traits and examined whether the inclusion of non-linear terms for birth location as covariates altered the results, again using general additive models. These relationships changed in magnitude with the addition of non-linear terms for birth location (table 2), suggesting a role for residual confounding by geographical location. For example, the relationship between genetically predicted BMI and household income (pounds sterling per year per 1 standard deviation (SD) increase in GRS for BMI) changed from -335 in the unadjusted model to -251 (adjusted for 40 PCs and study location) to -229 (adjusted for 40 PCs study location and non-linear terms for birth location). Birth location captures neither the full extent of variation in fine ancestral structure (which predicts GRS) nor the full extent of geographically structured social and economic differences (which predict income). It is possible that these adjusted estimates therefore contain residual confounding and that the true impact of biases within this sample is larger than these results suggest.
As an alternative way to demonstrate the potential impact of such bias, we analysed simulated geographically-stratified complex traits which preserved coarse geographical variance in observed traits whilst removing direct genotype-phenotype effects. This analysis produced associations between GRS and complex traits even in the absence of direct genetic effects on biology, suggesting GRS predict geographical location within the UK Biobank sample (online methods and table S1).
The presence of structure within the genetic data of UK Biobank has several potential explanations, including a legacy of ancient ancestral groups that are not fully admixed 6,27 , a consequence of non-random mating or polygenic selection 28-30 , a study artefact induced by selection bias 17 or a combination of all these explanations. Regardless of origin, unaddressed structure in this sample is sufficient to mean that predictions based on GRS are capable of inducing associations where there is little or no direct effect. Recent evidence from an investigation in the USA 31 also illustrates associations between GRS and complex traits at the ecological level. Now manifest, this property should be added to the growing list of limitations to naïve use of GRS -including horizontal pleiotropy 7 , high false discovery rate 32 , association with coarse ancestral groups 33 and prediction of inter-generational phenotypes which complicates interpretation 34 .
The ability of very large studies to detect effects indistinguishable from artefactual biases or ancestral differences demands reworked approaches to exploit 35 , or at least account for, structure. Exciting recent developments aim to improve statistical models 36 or leverage information from family-based study designs for unbiased inference 37 . Until such methods have developed further, the truth is that a thorough understanding of the properties of genotypic and phenotypic data and impact of study design will remain critical in allowing reasonable inference.

0
References: o  n  a  n  d  s  p  u  r  i  o  u  s  a  l  l  e  l  i  c  a  s  s  o  c  i  a  t  i  o  n  .  T  h  e  L  a  n  c  e  t  3  6  1  ,  5  9  8  -6  0  4  (  2  0  0 G  e  n  o  m  e  -w  i  d  e  a  s  s  o  c  i  a  t  i  o  n  s  t  u  d  y  i  d  e  n  t  i  f  i  e  s  7  4  l  o  c  i  a  s  s  o  c  i  a  t  e  d  w  i  t  h  e  d  u  c  a  t  i  o  n  a  l  a  t  t  a  i  n  m  e  n  t  .  N  a  t  u  r  e  5  3  3  ,  5  3  9  -+  (  2  0  1 . 2   Tables   T  a  b  l  e  1   -R  e  l  a  t  i  o  n  s  h  i  p  b  e  t  w  e  e  n  G  R  S  a  n  d  b  i  r  t  h  l  o  c  a  t  i  o  n  w  i  t  h  i  n  U  K  B  i  o  b  a  n  k  .   P  v  a  l  u  e  f  o  r  a  s  s  o  c  i  a  t  i  o  n  b  e  t  w  e  e  n  G  R  S  a  n  d  g  e  o  g  r  a  p  h  i  c  a  l  t  e  r