A comprehensive reanalysis of publicly available GWAS datasets reveals an X chromosome rare regulatory variant associated with high risk for type 2 diabetes

The reanalysis of publicly available GWAS data represents a powerful and cost-effective opportunity to gain insights into the genetics and pathophysiology of complex diseases. We demonstrate this by gathering and reanalyzing public type 2 diabetes (T2D) GWAS data for 70,127 subjects, using an innovative imputation and association strategy based on multiple reference panels (1000G and UK10K). This approach led us replicate and fine map 50 known T2D loci, and identify seven novel associated regions: five driven by common variants in or near LYPLAL1, NEUROG3, CAMKK2, ABO and GIP genes; one by a low frequency variant near EHMT2; and one driven by a rare variant in chromosome Xq23, associated with a 2.7-fold increased risk for T2D in males, and located within an active enhancer associated with the expression of Angiotensin II Receptor type 2 gene (AGTR2), a known modulator of insulin sensitivity. We further show that the risk T allele reduces binding of a nuclear protein, resulting in increased enhancer activity in muscle cells. Beyond providing novel insights into the genetics and pathophysiology of T2D, these results also underscore the value of reanalyzing publicly available data using novel analytical approaches.

70KforT2D: Identification of novel T2D loci with publicly available GWAS data 2 1 6 . P r  o  g  r  a  m  s  i  n  M  e  t  a  b  o  l  i  s  m  a  n  d  M  e  d  i  c  a  l  &  P  o  p  u  l  a  t  i  o  n  G  e  n  e  t  i  c  s  ,  B  r  o  a  d  I  n  s  t  i  t  u  t  e  o  f   H  a  r  v  a  r  d  a  n  d  M  I  T  ,  C  a  m  b  r  i  d  g  e  ,  M  a  s  s  a  c  h  u  s  e  t  t  s  ,  U  S  A  .   1  7  .  D  i  a  b  e  t  e  s  U  n  i  t  a  n  d  C  e  n  t  e  r  f  o  r  G  e  n  o  m  i  c  M  e  d  i  c  i  n  e  ,  M  a  s  s  a  c  h  u  s  e  t  t  s  G  e  n  e  r  a  l  H  o  s  p  i  t  a  l  ,   B  o  s  t  o  n  ,  M  a  s  s  a  c  h  u  s  e  t  t  s  ,  U  S  A  .   1  8  .  D  e  p  a  r  t  m  e  n  t  o  f  M  o  l  e  c  u  l  a  r  B  i  o  l  o  g  y  ,  H  a  r  v  a  r  d  M  e  d  i  c  a  l  S  c  h  o  o  l  ,  B  o  s  t  o   During the last decade, hundreds of genome-wide association studies (GWAS) have been performed with the aim of providing a better understanding of the biology of complex diseases, improving their risk prediction, and ultimately discovering novel therapeutic targets 1 . However, the majority of the published GWAS just report the primary findings, which generally explain a small fraction of the estimated heritability. In order to better uncover the missing heritability, most strategies usually involve the generation of new genetic and clinical data. Very rarely, new studies are based on the revision and reanalysis of existing genetic data by applying more powerful analytic techniques and resources generated after the primary GWAS findings are published. These cost-effective reanalysis strategies are now possible, given the existence of (1) different data-sharing initiatives that gather large amounts of primary genotype and sequencing data for multiple human genetic diseases, as well as (2)  and rare variants (MAF<0.01). This increases the power to identify novel associations, as well as to refine known associated loci through fine-mapping. Moreover, the availability of public primary genetic data further allows the homogeneous integration of multiple datasets with different origins providing more accurate meta-analysis results, particularly at the low ranges of allele frequency. In contrast, common meta-analytic strategies usually involve the aggregation of summary statistics generated independently by different centers, each applying their own internal quality-control, data cleaning, genotype imputation, and association testing methodologies. Each of these steps represents a potential source of additional heterogeneity, which may ultimately result in loss of statistical power. Finally, the vast majority of reported GWAS analyses omits the X chromosome, even though it 70KforT2D: Identification of novel T2D loci with publicly available GWAS data 5 represents 5% of the genome and encodes for more than 1,500 genes 3,4 . The reanalysis of publicly available data also enables interrogation of this chromosome.
We hypothesized that a unified reanalysis of multiple publicly available datasets, applying homogeneous standardized QC, genotype imputation and association methods, as well as novel and denser sequence-based reference panels for imputation would provide new insights into the genetics and the pathophysiology of complex diseases. To test this we focused on type 2 diabetes (T2D), one of the most prevalent complex diseases for which many GWAS have been performed during the past decade 5,6,7,8,9,10,11 . These studies have allowed the identification of more than 100 independent loci, most of them driven by common variants, with a few exceptions 11,12,13 . Despite all these efforts, only a small fraction of the genetic heritability can be explained by established T2D loci, and the role of low-frequency variants, although recently proposed to be minor 14 , has still not been fully explored. The availability of large T2D genetic datasets in combination with larger and more comprehensive genetic variation reference panels 15,16,17,18 , gives us the opportunity to impute a significant increased fraction of low-frequency and rare variants, and to study their contribution to the risk of developing this disease. At the same time, this strategy also allows to fine map known associated loci, increasing the chances of finding causal variants and understanding their functional impact. We therefore gathered publicly available T2D GWAS cohorts with European ancestry, comprising a total of 13,857 T2D cases and 62,126 controls, to which we first applied harmonization and quality control protocols across the whole genome (including the X chromosome) and across samples, to later carry out imputation using 1,000 Genomes Project (1000G) 15 and UK10K 16,17,18 reference panels, as well as association testing. By using this strategy, we identified novel associated regions driven by common, low-frequency, and rare variants, fine-mapped and functionally annotated the existing and novel ones, and 70KforT2D: Identification of novel T2D loci with publicly available GWAS data 6 confirmed experimentally a regulatory mechanism disrupted by a novel rare and large effect variant identified at the X-chromosome.

Overall analysis strategy
As shown in Figure 1 This strategy also allowed us to impute of 1,357,753 indels with good quality ( Supplementary   Figure 1 B).
In order to take full advantage of publicly available genetic data, we used three main metaanalytic approaches to adapt to three of the most common strategies for genetic data sharing:   In addition, variant set enrichment analysis of the T2D-associated credible sets across regulatory elements defined in isolated human pancreatic islets showed a significant enrichment for active regulatory enhancers (Supplementary Figure 4), suggesting that causal SNPs within associated region have a regulatory function, as previously reported 23 .

Identification, fine-mapping and functional characterization of novel and previously known loci
The three association strategies allowed us to finally identify 57 genome-wide significant associated loci (p≤5×10 -8 ), of which seven were not previously reported as associated with T2D ( Table 1). The remaining 50 loci were already known, and included, for example, two low-frequency variants recently discovered in Europeans, one located within one of the CCND2 introns (rs76895963) and a missense variant within the PAM 13 gene. Furthermore, as a quality control of these results, we confirmed that the magnitude and direction of the effect of all the associated variants (p≤0.001) were highly consistent with those reported previously (Rho=0.92, p=1×10 -248 , Supplementary Figure 5). In fact, the direction of effect was consistent with all 139 previously reported variants except three that were discovered in east and south Asian populations (Supplementary Table 4).

Identification of new signals driven by common variants
Beyond the detailed characterization of the known T2D associated regions, we also identified novel loci, among which, five were driven by common variants linked to modest effect sizes (1.06 < OR <1.12; Table 1 Within the first novel T2D-associated locus in chromosome 1q41 (LYPLAL1-ZC3H11B,  Table 9).
The fifth novel locus driven by common variants is located within 17q21.32 (rs12453394, OR=1.07 [1.05-1.10], p=3.23×10 -8 ). It involves three missense variants located within the CALCOCO2, SNF8 and GIP genes. GIP encodes for glucose-dependent insulinotropic peptide, a hormonal mediator of enteral regulation of insulin secretion 48 . Variants in the GIP receptor (GIPR) have previously been associated with insulin response to oral glucose challenge and beta-cell function 49 . GIP is thus a plausible candidate effector gene of this locus 50 .

Identification of a new signal driven by a low-frequency variant
We selected all low-frequency (0.01≤MAF<0.05) variants with p≤1×10 -4 in the 70KforT2D meta-analysis that were annotated as altering protein-coding variants according to VEP. This resulted in 15 coding variants that were meta-analyzed with exome array and whole-exome sequencing data from a total of ~97,000 individuals 19,20,21 after excluding the overlapping cohorts between the different datasets. This analysis highlighted a novel genome-wide association driven by a low-frequency missense variant (Ser58Phe) within the EHMT2 gene 70KforT2D: Identification of novel T2D loci with publicly available GWAS data 1 2 at chromosome 6p21.33 (rs115884658, OR=1.21 [1.14-1.29], p=3.00×10 -10 ; Figure 2,

Supplementary Figures 6 and 7). EHMT2 is involved in the mediation of FOXO1
translocation induced by insulin 51 . Since this variant was less than 1 Mb away from HLA-DQA1, which was recently reported to be associated with T2D 52 , we performed a series of reciprocal conditional analyses and excluded the possibility that our analysis was capturing previously reported T2D 52 or T1D 10,53,54 signals (Supplementary has been reported as a direct target of metformin by mediating the anti-proliferative effect of this drug in human glioblastoma 55 . All these findings support CLIC1, as an additional possible effector transcript, likely driven by rs115333512.

Identification of a novel rare variant in the X chromosome associated with 2.7-fold increased risk for T2D
As for many other complex diseases, the majority of published large-scale T2D GWAS studies have omitted the analysis of the X chromosome, with the notable exception of the identification of a T2D associated region near the DUSP9 gene in 2010 56 . To fill this gap, we tested the X chromosome genetic variation for association with T2D. To account for heterogeneity of the effects and for the differences in imputation performance between males and females, the association was stratified by sex and tested separately, as well as together by meta-analyzing the results from males and females. This analysis was able to replicate the DUSP9 locus, not only through the known rs5945326 variant (OR=1.15, p=0.049), but also 70KforT2D: Identification of novel T2D loci with publicly available GWAS data  Figure 8).
Whether this association is specific to men or whether it also affects female carriers remains to be clarified, since the imputation for this variant in females was not accurate enough.
To further validate this association and to discard potential imputation artifacts, we next analyzed two independent cohorts by performing imputation with the UK10K reference panel (SIGMA 57 , INTERACT 58 ), and a third cohort by de-novo genotyping the rs146662075 variant in several Danish sample sets. The initial meta-analysis, including the three replication datasets did not reach genome-wide significance (OR=1.72, p=1.6×10 -4 ) ( Figure   3B), and revealed a strong degree of heterogeneity (heterogeneity p het =0.002), which appeared to be driven by the replication cohorts. Within one of the case-control studies there was a nested cohort study; this study, the Inter99 prospective cohort, consisted of 1,652 non-70KforT2D: Identification of novel T2D loci with publicly available GWAS data 1 4 diabetic male subjects, of whom 158 developed T2D after eleven years of follow-up.
Analysis of incident diabetes in this cohort, confirmed the association with the same allele as previously seen in the case-control studies, with carriers of the rare T allele having increased risk of developing incident diabetes, compared to the C carriers (Cox-proportional Hazards Ratio (HR)=3.17 [1.3-7.7], p=0.011, Figure 3C): in fact, nearly 30% carriers of the T risk allele developed incident T2D during 11 years of follow-up, compared to only ~10% of noncarriers.
In order to gain further support for this association, we thoroughly compared the clinical and demographic characteristics of the discovery and replication datasets in an attempt to understand the strong degree of heterogeneity observed when adding the replication datasets.
We found that some of the replication datasets contained control subjects that were significantly younger than the average age at onset of T2D reported in this study and in  Figure 9). Since in the Inter99 prospective cohort we observed that 30% of subjects developed incident T2D over 11 years of follow-up, we performed an additional analysis using a stricter definition of controls, in order to avoid the presence of pre-diabetes or individuals that may further develop diabetes after reaching the average age at onset. We therefore applied two additional exclusion criteria in order to obtain a stricter set of controls: (i) excluding subjects younger than 55 years old, and (ii) restricting the analysis, when possible, to individuals with measured 2 hours plasma glucose values during oral glucose tolerance test (OGTT) below 7.8 mmol/l, which is employed to diagnose impaired glucose tolerance (pre-diabetes), a strong risk factor for developing T2D 60 . We repeated the metaanalysis using the first filtering criterion for controls for both the discovery and replication 70KforT2D: Identification of novel T2D loci with publicly available GWAS data 1 5 datasets, including only controls older than 55. While the application of the first filter did not yet provide genome-wide significant results (Supplementary Figure 10), when we added the second filter (only possible in the Danish cohort), we obtained replication results consistent with the initial discovery results. Moreover, we also integrated the Cox-proportional hazards results into the meta-analysis by using a method that accounts for overlapping subjects (MAOS) 61 allowing us to integrate the longitudinal (follow-up) and the case-control study.
This final meta-analysis confirmed the association at rs146662075 resulting in genome-wide significance and without significant heterogeneity (OR=2.7 (1.91, 3.81), p=1.7×10 -08 , p het =0.51, Figure 3D). These results therefore support the existence of a genetic association with T2D is driven by a rare variant at X chromosome.
The rs146662075 T risk allele is associated with 5-fold greater enhancer activity and disruption of allele specific nuclear protein binding.
We next explored the possible molecular mechanism behind this association by using  signal of both H3K27ac and AGTR2 RNA-seq (but not SLC6A14) expression in fetal muscle ( Figure 4A).
We next studied whether the region encompassing the rs146662075 variant could act as a transcriptional enhancer and whether its activity was allele-specific. For this, we linked the DNA region, with either the T (risk) or the C (common) allele, to a minimal promoter and performed luciferase assays in a mouse myoblast cell line. The luciferase analysis showed an average 4.4 fold increased activity for the disease-associated T allele, compared to the expression measured with the common C allele, suggesting an activating function of the T allele, or a repressive function of the C allele ( Figure 4B). Consistent with these findings, electrophoretic mobility shift assays using nuclear protein extracts from mouse myoblast cell lines, differentiated myotubes, and human fetal muscle cell line, revealed sequence-specific binding activity of the C allele, but not of the rare T allele ( Figure 4C). Overall, these data indicate that the risk T allele prevents binding by a nuclear protein that is associated with decreased activity of an AGTR2-linked enhancer.

Discussion
To date, a large fraction of GWAS datasets have been made publicly available through repositories such as dbGaP and EGA. However, how these resources are useful to increase the knowledge of the genetic factors contributing to disease susceptibility has not been fully explored. In this study, we demonstrate the utility of these publicly available resources. By harmonizing and reanalyzing existing and publicly available T2D GWAS data, and by performing genotype imputation with two whole-genome sequence-based reference panels, we have been able to explore deeply the genetic architecture of T2D. This strategy allowed us to impute and test for association with T2D more than 15 million of high quality imputed variants, including low-frequency, rare, and small insertions and deletions across chromosomes 1 to 22 and X.
70KforT2D: Identification of novel T2D loci with publicly available GWAS data 1 7 The reanalysis of these data confirmed a large fraction of already known T2D loci, for which we identify and propose novel potential causal variants by fine-mapping and functionally annotating each locus. Our in-depth characterization of each of these loci reached an improved accuracy over prior efforts, providing for the first time a comprehensive coverage of structural variants, which point to previously unobserved candidate causal variants in known and novel loci.
This reanalysis also allowed us to identify seven novel associations driven by common variants in or near LYPLAL1, NEUROG3, CAMKK2, ABO and GIP; a low-frequency variant in EHMT2, and one rare variant in the X chromosome. This rare variant identified in Xq23 chromosome, was located near the AGTR2 gene, and showed a nearly three-fold increased risk for T2D in males, which represents, to our knowledge, the largest effect size identified so far in Europeans, and complements other large effect size variants identified in other specific populations 11,12 .
This study also highlights the importance of a strict classification of both cases and controls in order to identify rare variants associated with disease, as our initial discovery for the Xq23 locus was not replicated when using the standard classification of cases and controls (based on fasting plasma glucose or glycated hemoglobin, HbA1c), but only when reclassifying the control group by restricting it to non-diabetic individuals older than 55 years (average age at onset of T2D), and with confirmed normal glucose tolerance using the stricter OGTT measure. This is in line with previous results obtained for a T2D population specific variant found in Inuit within the TBC1D4 gene, which was only significant when using OGTT as criteria for classifying cases and controls, but not when using HbA1c 11 . The use of this strict criterion for T2D diagnosis was in fact very clinically relevant, as 32% carriers of the TB1CD4 variant would remain undiagnosed T2D or prediabetes patients when OGTT was not assessed 66 . Despite the need for a stricter definition of controls in our replication case-70KforT2D: Identification of novel T2D loci with publicly available GWAS data 1 8 control study, our results were further confirmed by the observation that 30% of the rs146662075 risk allele carriers developed T2D over 11 years of follow-up, compared to 10% of non-carriers, and suggest that early identification of these subjects by genotyping this variant may be useful to tailor pharmacological or lifestyle intervention in order to prevent or delay their onset of T2D.
Using binding and gene-reporter analyses we demonstrate a functional role of this variant and propose a possible mechanism behind the pathophysiology of T2D in T risk allele carriers, where this rare variant could favor a gain of function of AGTR2, which has been previously associated with insulin resistance 62 . AGTR2 appears, therefore, as a potential therapeutic target for this disease, which would be in line with previous studies that showed that the blockade of the renin-angiotensin system in mice 67 and in humans 68 prevents the onset of type 2 diabetes and restores normoglycemia 69,70 .
Overall, beyond our significant contribution to the understanding of the molecular basis of T2D, our study also highlights the potential of the reanalysis of public data, as a complement to large studies using newly-generated data. This study will inform the open debate in favor of pushing data-sharing and democratization initiatives 71,72 , highlighting their importance for the study of the genetics and pathophysiology of complex diseases, which may lead to new preventive and therapeutic applications.

Methods
Online Content Methods, along with any additional Extended Data display items, are available in the online version of the paper; references unique to these sections appear only in the online paper. 70KforT2D: Identification of novel T2D loci with publicly available GWAS data 2  2   3  7  .  L  e  i  ,  X  .  ,  C  a  l  l  a  w  a  y  ,  M  .  ,  Z  h  o  u  ,  H  .  ,  Y  a  n  g  ,  Y  .  &  C  h  e  n  ,  W  .  O  b  e  s  i  t  y  a  s  s  o  c  i  a  t  e  d  L  y  p  l  a  l  1  g  e  n  e   i  s  r  e  g  u  l  a  t  e  d  i  n  d  i  e  t  i  n  d  u  c  e  d  o  b  e  s  i  t  y  b  u  t  n  o  t  r  e  q  u  i  r  e  d  f  o  r  a  d  i  p  o  c  y  t  e   d  i  f  f  e  r  e  n  t  i  a  t  i  o  n  .  M  o  l  C  e  l  l  E  n  d  o  c  r  i  n  o  l   4  1  1   ,  2  0  7  -1  3  (  2  0  1  5  )  .   3  8  .  W  e  s  s  e  l  ,  J  .  e  t  a  l  .  L  o  w  -f  r  e  q  u  e  n  c  y  a  n  d  r  a  r  e  e  x  o  m  e  c  h  i  p  v  a  r  i  a  n  t  s  a  s  s  o  c  i  a  t  e  w  i  t  h   f  a  s  t  i  n  g  g  l  u  c  o  s  e  a  n  d  t  y  p  e  2  d  i  a  b  e  t  e  s  s  u  s  c  e  p  t  i  b  i  l  i  t  y  .  N  a  t  C  o  m  m  u  n   6   ,  5  8  9  7  (  2  0  1   Publicly available GWAS datasets representing a total of 12,931 cases and 57,196 controls (70KforT2D) were first quality controlled, phased and imputed using 1000G and UK10K separately. For those variants that were present in the DIAGRAM trans-ethnic meta-analysis we used the summary statistics to meta-analyze our results with the cohorts that had no overlap with any of the cohorts included in the DIAGRAM trans-ethnic meta-analysis. With this first meta-analysis we discovered four novel loci (within magenta panels). For the rest of the variants we meta-analyzed all the 70KforT2D datasets, which resulted in two novel loci (in blue panels). All the variants that were coding and that showed a p-value ≤ 1x10 -4 were tested for replication by interrogating the summary statistics in the Type 2 Diabetes Knowledge Portal (T2D Portal) (http://www.type2diabetesgenetics.org/). This resulted in a novel low-frequency variant in the EHMT2 gene (highlighted with a green panel).

Figure 2:
Manhattan and Quantile-Quantile plot (QQ-plot) of the discovery and replication genome-wide meta-analysis. The upper corner represents the QQ-plot. Expected -log 10 pvalues under the null hypothesis are represented in the x-axis while observed -log 10 p-values are represented in the y-axis. Observed p-values were obtained according to the suitable replication dataset used (as shown in Figure 1) and were depicted using different colors. HapMap variants were meta-analyzed using the Trans-Ethnic summary statistics from the DIAGRAM study and our meta-analysis based on the Genetic Epidemiology Research on Aging (GERA) cohort and the Northwestern NuGENE Project and that resulted in novel associations depicted in magenta. The rest of non-HapMap variants meta-analyzed using the full 70KforT2D cohort are represented in grey, highlighting in light blue the fraction of novel GWAS-significant variants. Coding low-frequency variants meta-analyzed using the 70KforT2D and the T2D Portal data that resulted in novel GWAS-significant associations are depicted in green. The shaded area of the QQ-plot indicates the 95% confidence interval under the null and a density function of the distribution of the p-values was plotted using a dashed line. The λ is a measure of the genomic inflation and corresponds to the observed median χ 2 test statistic divided by the median expected χ 2 test statistic under the null hypothesis. The Manhattan plot, representing the -log 10 p-values were colored as explained in the QQ-plot. All known GWAS-significant associated variants within known T2D genes are also depicted in red. X-chromosome results for females (F), males (M) and all individuals (A) are also included.

Figure 3:
Discovery and replication of rs14666075 association signal. Forest plot for the rs146662075 variant using (A) the discovery cohorts and when including also (B) the replication datasets. Cohort-specific odds ratios are denoted by boxes proportional to the size of the cohort and 95% CI error bars. The combined OR estimate for all the datasets is represented by a green diamond, where the diamond width corresponds to 95% CI bounds. The p-value for the meta-analysis (Meta P) and for the heterogeneity (Het P) of odds ratio is shown. C) Plot showing the cumulative incidence of type 2 diabetes for an 11 years followup. The red line represents the T carriers and in light blue, C carriers are represented (n=1,652, cases=158.). D) Forest plot after excluding controls younger than 55 years old and OGTT>7.8 mmol/l in both the discovery and replication cohorts when available.