The transferability of lipid loci across African, Asian and European cohorts

Most genome-wide association studies are based on samples of European descent. We assess whether the genetic determinants of blood lipids, a major cardiovascular risk factor, are shared across populations. Genetic correlations for lipids between European-ancestry and Asian cohorts are not significantly different from 1. A genetic risk score based on LDL-cholesterol-associated loci has consistent effects on serum levels in samples from the UK, Uganda and Greece (r = 0.23–0.28, p < 1.9 × 10−14). Overall, there is evidence of reproducibility for ~75% of the major lipid loci from European discovery studies, except triglyceride loci in the Ugandan samples (10% of loci). Individual transferable loci are identified using trans-ethnic colocalization. Ten of fourteen loci not transferable to the Ugandan population have pleiotropic associations with BMI in Europeans; none of the transferable loci do. The non-transferable loci might affect lipids by modifying food intake in environments rich in certain nutrients, which suggests a potential role for gene-environment interactions.


Introduction
coefficients had a range of 0.27-0.28, 0.23-0.28 and 0.20-0.24, respectively. In APCDR-Uganda, the 117 strongest association was observed for LDL (r=0.28, SE=0.01, p=1.9x10 -107 ). The HDL association was 118 attenuated compared to the European samples (r=0.12,SE=0.01, ). The effect of the TG 119 score was markedly weaker (r=0.06, SE=0.01, p=4.5x10 -7 ). We also assessed associations between a 120 given score and levels of each of the other biomarkers (Supplementary Table 2). In line with the trans-121 ethnic genetic correlation results, we observed inverse associations between the HDL score and TG 122 levels and vice versa in all studies, except APCDR-Uganda. 123 124 Differences in LD structure, MAF and sample size make it difficult to assess replication for individual 125 loci. Therefore, we propose a new strategy to assess evidence for shared causal variants between two 126 populations: trans-ethnic colocalization. For this we re-purposed a method that was originally 127 developed for colocalization of GWAS and eQTL results: Joint Likelihood Mapping (JLIM) 16 . In order to 128 assess its performance for GWAS results from samples with different ancestry, we carried out a 129 simulation study. UK Biobank (UKB) was used as a European ancestry reference and compared to CKB 130 and APCDR-Uganda. Phenotypes were simulated. In the simulations of distinct causal variants in the 131 non-European and the reference group, the frequencies of false negatives were close to 0.05 as 132 expected (Table 3), with an almost uniform distribution of p-values (Supplementary Figure 1). The 133 power to detect shared associations was good for both populations: 73.1% for APCDR-Uganda and 134 93.1% for CKB. To investigate whether the lower power for APCDR-Uganda could be due to its smaller 135 sample size, we reran the analyses for CKB using a random subset of samples matching the sample 136 size of APCDR-Uganda. The results were similar to the ones for the full CKB set, suggesting that the 137 power of this trans-ethnic colocalization method decreases somewhat with greater genetic distance 138 between the populations that are compared. 139 140 We applied trans-ethnic colocalization for established lipid loci to each study with UKHLS as the 141 reference. There was evidence for significant (pjlim<0.05) colocalization with at least one of the target 142 studies for about half of the major lipid loci (Supplementary Table 3). For several major TG loci, such 143 as GCKR at 2p23.3 or LPL at 8p21.3, strong evidence of replication in the Asian studies was observed 144 while there was no evidence of association in APCDR-Uganda (Figure 3b,c). 145 We compared major lipid loci showing evidence of replication in APCDR-Uganda with those not 146 displaying any suggestion of replication. The proximal genes of replicating loci were enriched for lipid 147 pathways including lipoprotein metabolism, lipid digestion mobilisation and transport, chylomicron-148 mediated lipid transport and metabolism of lipids and lipoproteins. The proximal genes of the non-149 replicating loci were enriched for several other pathways in addition to lipid metabolism, including 150 SHP2 signalling, ABV3 integrin pathway, cytokine signalling in immune system, cytokine-cytokine 151 receptor interaction and transmembrane transport of small molecules ( Supplementary Figures 2 and  152 3). We also assessed the associations of these loci with BMI in samples with European ancestry using 153 publicly available summary statistics from the GIANT consortium 17 (N 484,680) (Table 4). Ten of the 154 fourteen non-replicating lipid loci had pleiotropic associations with BMI at a Bonferroni-adjusted 155 threshold of p<0.0024. None of the seven replicating lipid loci were associated with BMI. We also 156 assessed four additional loci that were not significant in the trans-ethnic colocalization but displayed Estimates of trans-ethnic genetic correlations between European, Chinese and Japanese samples were 169 close to 1. Associations of polygenic scores for LDL were not attenuated in the Ugandan population 170 compared to the UK samples. All PRS associations in the two Greek isolated populations were also 171 highly consistent with those in the UK samples. 172 Previous studies that compared the direction of effect of established loci or assessed associations of 173 polygenic scores reported differing degrees of consistency [18][19][20][21][22][23][24][25][26][27][28][29] . However, most of them were 174 conducted in American samples with diverse ancestry, had smaller sample sizes and applied a single-175 variant look-up or PRS for a limited number of genetic variants. The high degree of consistency for 176 cholesterol biomarkers we observed also contrasts with previously reported trans-ethnic genetic 177 correlations for other traits, such as major depression, rheumatoid arthritis, or type 2 diabetes, which 178 were substantially different from 1 30,31 . In a recent application using data from individuals with 179 European and Asian ancestry from the UK and USA, the average genetic correlation across multiple 180 traits was 0.55 (SE = 0.14) for GERA and 0.54 (SE=0.18) for UK Biobank 32 . 181 Differences in LD structure, MAF and sample size make it difficult to assess replication of individual 182 loci. We therefore propose a new approach: trans-ethnic colocalization. Simulations showed 183 consistent control of type I error rates as well as good power to detect associations. Colocalization 184 identified shared causal variants even at loci where none of the individual variants were associated at 185 stringent p-value thresholds. However, for many of the major lipid loci, more than one independent 186 association signal was identified in discovery GWAS 15 . When these are located in close proximity to 187 each other, they can interfere with the trans-ethnic colocalization analysis because JLIM assumes a 188 single causal variant ( Figure 3d). Therefore, future work should extend this approach to accommodate 189 loci harbouring multiple causal variants. Using trans-ethnic colocalization, we showed that many 190 established loci for triglycerides did not affect levels of this biomarker in Ugandan samples. This 191 included loci associated at genome-wide significance in all the other studies, such as GCKR at 2p23.3 192 or LPL at 8p21.3. The polygenic score for triglycerides had a weak effect on measured levels in APCDR-9 component in this sample (SNP heritability of 0.25, SE=0.05 8 ) and there were some genome-wide 195 significant associations (Supplementary Figure 6e). It is also unlikely that this can be explained purely 196 by differences in LD and MAF because they would affect the analyses of the other two biomarkers as 197 well. Instead these discrepancies could be caused by gene-environment interactions. Most of the lipid 198 loci that did not replicate in the Ugandans had pleiotropic associations with BMI in European ancestry 199 samples while none of the replicating loci were linked to BMI. It is possible that the non-replicating 200 variants affect the amount of food intake with downstream consequences for lipid levels. This might 201 require an environment offering diets that are rich in certain nutrients. While the replicating genes 202 were almost exclusively linked to pathways of lipid metabolism, the non-replicating genes were 203 involved in a diverse pathways which is in line with hypothesis. An alternative explanation could be 204 that the non-replicating loci are involved in metabolising nutrients given a particular diet that is not 205 common in Uganda with downstream consequences for weight. 206 Overall, this could suggest an important role of environmental factors in modifying which genetic 207 variants affect lipid levels. Studying the causes for discordant loci between groups has promise to 208 further elucidate the biological mechanisms of lipid regulation and other complex traits. Applying 209 genetic risk prediction within clinical settings is receiving increasing attention. Our findings 210 demonstrate that the transferability of genetic associations across different ancestry groups and 211 environmental settings should be assessed comprehensively for medically relevant traits. This is 212 important in order to maximise the potential of precision medicine to yield health benefits that are 213 widely shared within and across populations. Ongoing programs in underrepresented countries 33  Each study underwent standard quality control. Details of the genome-wide association analyses with 230 lipid traits have been previously described for GLGC 14 , BBJ 13 , HELIC 10 , and UKHLS 12 . The association 231 analysis for APCDR-Uganda was carried out within a mixed model framework using GEMMA 38 . Rank-232 based inverse normal transformation was applied to the lipid biomarkers after adjusting for age and 233 gender. In China Kadoorie Biobank, lipid levels were regressed against eight principle components, 234 region, age, age 2 , sex, and -for LDL and TG -fasting time 2 . LDL levels were derived using the Friedewald 235 formula. After rank-based inverse normal transformation, the residuals were used as the outcomes in 236 the genetic association analyses using linear regression. In eMERGE, biomarkers were adjusted for 237 age, gender, kidney disease, statin use, type 2 diabetes status and disorders relating to growth 238 hormones. Associations were carried out within a mixed model framework using   identified all variants in LD (r 2 >0.6) based on the European ancestry 1000 Genomes data. We assessed 255 whether the lead or any of the correlated variants, henceforth called credible set, displayed evidence 256 of association in the target study. We used a p-value threshold of p<10 -3 . If this was not the case, we 257 tested whether there was any other variant with evidence of association within a 50Kb window. While 258 this p-value threshold might not be appropriate to provide conclusive evidence of replication for 259 individual loci, we used this to test evidence of replication across sets of loci. As a benchmark, we 260 computed the minimum p-value in 1000 random windows of 50Kb for each study. Less than 5% of 261 random windows had a minimum p<10 -3 for the non-European ancestry studies. 262 263

Trans-ethnic genetic correlations 264
We used the popcorn software 30 to estimate trans-ethnic genetic correlations between studies while 265 accounting for differences in LD structure. This provides an indication of the correlation of causal-266 variant effect sizes across the genome at SNPs common in both populations. Variant LD scores were 267 estimated for ancestry-matched 1000 Genomes data for each study combination. The estimation of 268 LD scores failed for chromosome 6 for some groups. We therefore left out chromosome 6 from all 269 comparisons. Variants with imputation accuracy r 2 <0.8 or MAF<0.01 were excluded. Popcorn did not converge for any of the studies with less than 20,000 samples. Therefore, results are presented for 271 comparisons between GLGC2013, CKB and BBJ. We estimated effect rather than impact correlations. 272 273

Polygenic scores 274
We created polygenic scores based on the established lipid loci and assessed their associations with 275 lipid levels in UKHLS, the HELIC cohorts, and APCDR-Uganda, as it was not possible to compute trans-276 ethnic genetic correlations for these studies. For HELIC and UKHLS, extreme values ( ± 3 , sex 277 stratified) were filtered. Age, age 2 and sex were adjusted for by regressing them on the biomarker 278 values and using the residuals as outcomes for subsequent analyses. For each biomarker in each 279 sample set, we checked normality and homoscedasticity. HDL and LDL were approximately normally 280 distributed. For TG levels, a Box Cox transformation was used to normalize the data. APCDR-Uganda 281 phenotype data were rank-based inverse normally transformed. 282 To make sure PRS were comparable across studies, we excluded variants that were absent, rare 283 (MAF<0.01) or badly imputed (r 2 <0.8) in any of the studies and variants that had different alleles from 284 those in the GLGC. The variant with larger discovery p-value from each correlated pair of SNPs (r 2 >0.1) 285 was also removed. This left 120, 103 and 101 variants for HDL, LDL and TG, respectively. We created 286 trait-specific weighted PRS. The β-regression coefficients from SNP-trait associations in GLGC2017 15 287 were used as weights. All biomarkers and scores were scaled to mean=0 and standard deviation=1 for 288 each study so that the regression coefficient represent estimates of the correlation between scores 289 and biomarkers. 290 We carried out association analyses between each polygenic score and each biomarkers using a linear 291 mixed model with random polygenic effect implemented in GEMMA 38 in order to account for 292 relatedness and population structure. We used a Bonferroni correction to adjust for multiple testing 293 of three PRS with three different biomarker outcomes (p<0.05/9=0.0056). 294 Differences in allele frequency, LD structure and sample size make it difficult to assess whether a given 318 GWAS hit replicates in samples with different ancestries. Therefore, we applied trans-ethnic 319 colocalization. Colocalization methods test whether the associations in two studies can be explained 320 by the same underlying signal even if the specific causal variant is unknown. The joint likelihood 321 mapping (JLIM) statistic was developed by Chun and colleagues to estimate the posterior probabilities 322 for colocalization between GWAS and eQTL signals and compare them to probabilities of distinct 323 causal variants 16 : 324 325 326 JLIM explicitly accounts for LD structure. Therefore, we assessed whether it is suitable for trans-ethnic 327 colocalization. For samples with summary statistics, LD scores were estimated using ancestry matched 328 samples from the 1000 Genomes Project v3. JLIM assumes only one causal variant within a region in 329 each study. We therefore used a small windows of 50Kb for each known locus to minimise the risk of 330 interference from additional association signals. Distinct causal variants were defined by separation 331 in LD space by r 2 0.8 from each other. We excluded loci within the major histocompatibility region 332 due to its complex LD structure. We used a significance threshold of p<0.05 given the evidence of 333 association of the established lipid loci in Europeans and the overall evidence for shared causal genetic 334 architecture across populations for most lipid traits from our other analyses. We compared each 335 target study to UKHLS because of their high level of homogeneity in terms of ancestry, biomarker 336 quantification and study design. 337

338
Simulation 339 To test the power of trans-ethnic colocalization to detect associations shared between pairs of 340 populations with different ancestry, we ran JLIM on two sets of simulated traits with realistic effect 341 size and environmental noise level. The first set of simulations used the same causal variant in both 342 populations, whereas the second set of simulations discordant causal variants, chosen randomly. We 343 sampled 10,000 randomly chosen biallelic variants with MAF>0.05 and simulated random phenotypes 344 in CKB, APCDR-Uganda und 50,000 individuals with British ancestry from UK Biobank. For each data 345 set relatives were excluded. We also sub-sampled CKB to match the smaller number of individuals in 346 APCDR-Uganda in order to test whether different performance is due to ancestry or sample size. We 347 used a simple linear model to generate the phenotype for each individual i: 348 where y is the phenotype value, is the effect size, x is the number of the alternate alleles carried at 350 the locus and ~(0, 2 ), where 2 is the variance of the environmental noise and ( , ) = 0. 351 We used = 0.25 and 2 = 1. 352

353
Comparison of replicating with non-replicating loci 354 We aimed to assess whether there are systematic differences between loci that are shared between 355 European ancestry samples and APCDR-Uganda and loci that are not. We identified all loci with 356 evidence of replication based on the above definition that also had significant (p<0.05) colocalization. 357 We only kept one variant per region. We contrasted them with loci where none of the evidence 358 suggested replication: p>0.05 for colocalization, no variant with a lipid association at p<10 -3 in the 359 region and the lead variant from the discovery study was not rare in APCDR-Uganda. We identified 360 the nearest protein coding gene for each locus and carried out a pathway analyses for the two sets 361 using FUMA 40 . We also assessed the associations of the lead variants with body mass index (BMI) in 362 European ancestry samples using results from a meta-analysis between the GIANT consortium and UK 363

Data availability 366
The UKHLS EGA accession number is EGAD00010000918. Genotype-phenotype data access for UKHLS 367 is available by application to Metadac (www.metadac.ac.uk Table 1. Percentage of established lipid-associated loci with evidence of replication in each target 525 study. Results shown separately by strength of association (whether p<10 -100 ) in the discovery study 526 (GLGC). Only one SNP was kept for each locus with multiple associated variants in close proximity. 527 Regions were defined as 25Kb either side of the lead variant. No variant in the credible set associated in the target set at p<10 -3 but an uncorrelated variant in 531 the region is associated in target set at p<10 -3 532 c A variant in the credible set is associated in the target set at p<10 -3 533    Recombination rate (cM/Mb) Recombination rate (cM/Mb)  Recombination rate (cM/Mb) Recombination rate (cM/Mb) Recombination rate (cM/Mb) Recombination rate (cM/Mb)