Re-analysis of public genetic data reveals a rare X-chromosomal variant associated with type 2 diabetes

The reanalysis of existing GWAS data represents a powerful and cost-effective opportunity to gain insights into the genetics of complex diseases. By reanalyzing publicly available type 2 diabetes (T2D) genome-wide association studies (GWAS) data for 70,127 subjects, we identify seven novel associated regions, five driven by common variants (LYPLAL1, NEUROG3, CAMKK2, ABO, and GIP genes), one by a low-frequency (EHMT2), and one driven by a rare variant in chromosome Xq23, rs146662075, associated with a twofold increased risk for T2D in males. rs146662075 is located within an active enhancer associated with the expression of Angiotensin II Receptor type 2 gene (AGTR2), a modulator of insulin sensitivity, and exhibits allelic specific activity in muscle cells. Beyond providing insights into the genetics and pathophysiology of T2D, these results also underscore the value of reanalyzing publicly available data using novel genetic resources and analytical approaches.

. a)-p) Pathway analysis from GWAS results using DEPICT. Expanded representation of each of the network clusters that were significantly enriched (FDR<0.05). The correlation between each pathway is represented by the width of the edges. Supplementary Fig. 7. a)-g) Signal plots representing the 99% credible sets of SNPs at the 7 novel loci. In each plot, each point represents a variant within the 99% credible set with the Bayes' factor (y axis, on a log 10 scale) as a function of genomic position (hg19). The lead SNP is represented by the purple symbol. The color-coding scheme indicates the R-squared with the lead SNP, estimated based on 1000G r 2 values from European population. Recombination rates were estimated from Phase II HapMap and gene annotations from the UCSC genome browser. Only the SNPs that fall within the 99% credible set are plotted. Supplementary Fig. 8. Comparison of imputation quality in males and females and UK10K and 1000 G phase1 reference panels. To evaluate genotype imputation quality, we imputed genotypes into the 58C cohort from the WTCCC, which consisted on a dataset of ~3,000 individuals that were genotyped by both Affymetrix v6.0 (Affy) and Illumina 1.2M (IL) platforms. We performed genotype imputation independently using either Affy or IL genotypes as the backbone. The quality of the imputed variants was evaluated using the allelic dosage R 2 coefficient (see Supplementary Methods) between the genotype dosages estimated when imputing using Affy or Illumina as backbone. We show the imputation results for a) males to 1000G and b) UK10K reference panels, and then, for c) females to 1000G and d) UK10K. Genotype imputation is of higher quality in males to UK10K. Supplementary Fig. 9. Discovery and replication of rs14666075 association signal. Forest plots for rs146662075 using data from the discovery and replication datasets when a) not applying any additional filter to the control samples, b) when only excluding controls younger than 55 years old. Cohort-specific odds ratios (95% CIs) are denoted by blue boxes (blue lines). The combined OR estimate for all the datasets is represented by a green diamond, where the diamond width corresponds to 95% CI bounds. The p-value for the metaanalysis (Meta P) and for the heterogeneity (Het P) of odds ratio is shown. Supplementary Fig. 10. Boxplot representing the distribution of ages in cases and controls across cohorts. The red line represents 55 years old, which is the average age at onset of T2D in European populations. Supplementary Fig. 11. Characterizing the transcriptional regulatory activity of the rs146662075 enhancer element. a) UCSC screenshot showing representative ChIP-Seq datasets for transcription binding and chromatin marks associated to active enhancer elements in human islets within the TAD domain in which rs146662075 is located (highlighted in blue). b) Representation of the rs14666075 enhancer activity according to the -log 10 MACS2 q-value from H3K27ac narrow peaks across multiple tissues from the NIH Roadmap Epigenomics Mapping Consortium consolidated epigenomes dataset. c) Gene expression levels for candidate target genes of the rs146662075 enhancer variant in those tissues in which significant enhancer activity was observed in b). d) Association between enhancer activity and gene expression for each of the candidate target genes. For each candidate target gene, a contingency table showing the tissue's counts is represented for each of the 4 scenarios and the estimate odds ratio (OR) and the p-value from the Fisher's Exact Test is also provided.

Details of independent discovery GWAS datasets
We collected all publicly genetic individual-level data for Type  Approximately 30% were explicitly recruited as part of multiplex sibships 2 and ~25% were offspring in parent-offspring 'trios' or 'duos' (that is, families comprising only one parent complemented by additional sibs) 3 . The remainders were recruited as isolated cases but these cases were (compared to population-based cases) of relatively early onset and had a high proportion of T2D parents and/or siblings 4 . Cases were ascertained across the UK but were centralized on the main collection centres (Exeter, London, Newcastle, Norwich, Oxford). The rest of subjects not coded as T2D patients were considered as controls.

DIAGRAM Trans-Ethnic meta-analysis
We used the summary statistics for the trans-ethnic T2D GWAS meta-analysis 7 from the DIAGRAM consortium, which comprises the following ancestry-specific meta-analyses: the Asian, South Asian, European and Hispanic) 8,9 . Additionally, we also used the summary statistics from exome chip analysis of 75,670 individuals from European ancestry. This dataset has integrated the efforts from the DIAGRAM consortium, the GoT2D project and the T2D-GENES project 8 . This data was accessed on June 2016.

InterAct
The InterAct consortium 10  From the SNP and sample QCed data, we extracted the male samples corresponding to 6,763 individuals, which were re-analysed using genotype imputation with the UK10K reference panel. Association with T2D has been evaluated using an additive logistic model with SNPTEST v2.5.2 adjusted by age and body-mass index.

Slim Initiative in Genomic Medicine for the Americas (SIGMA) T2D Genetics Consortium
The Genotyping of study participants using the Illumina OMNI2.5 array have been described previously 11 . These cohorts, after the SNP and sample QC, were imputed using the UK10K reference panel and the association with T2D was tested under an additive logistic model only considering male samples with SNPTEST v2.5.2 adjusted by age and body-mass index.

Danish cohort
The Danish replication data consisted of data from five sample sets: 1) Inter99, a population- were genotyped together with the study samples to estimate mismatch between genotyping and sequencing. All genotypes (5 heterozygous and 5 homozygous for reference allele) were concordant. Furthermore, 1,602 study samples were genotyped in duplicate and no mismatches were observed. Moreover, general call rate was 98%. Genotype distribution was in accordance with Hardy-Weinberg equilibrium.
The Kaplan-Meier method was used to plot cumulative incidence of T2D against time of follow-up in the Inter99 cohorts, which were followed for 11 years on average. Cox proportional hazards regression models were used to address the risk of incident T2D.
Individuals with self-reported diabetes at the baseline examination and individuals present in the Danish National Diabetes Registry before the baseline examination were excluded from the present analyses of incident T2D. The follow-up analysis were restricted to male individuals younger than 45 years old, which will reach 56 years old after 11 years of followup.

Partners Biobank
The Partners HealthCare Biobank 14  1) Individuals determined by the "curated disease" algorithm employed above to have no history of type 2 diabetes with NPV of 99%.
3) Individuals with HbA1c less than 5.7 b) Case selection criteria.
1) Individuals determined by the "curated disease" algorithm employed above to have type 2 diabetes with PPV of 99% 2) Individuals at least age 30 given the higher rate of false positive diagnoses in younger individuals.
Genomic data for 15,061 participants was generated with the Illumina Multi-Ethnic Genotyping Array, which covers more than 1.7 million markers, including content from over 36,000 individuals, and is enriched for exome content with >400,000 markers missense, nonsense, indels, and synonymous variants.

UK Biobank
The UK Biobank is a prospective cohort of ~500,000 individuals aged between 40 to 69 years when recruited in 2006-2010 16  prioritized keeping all the individuals genotyped by the UK BiLEVE array in a single subset and we also respected the different batches defined by UK Biobank. We performed a twostage imputation procedure based on first pre-phasing the genotypes into whole chromosome haplotypes followed by genotype imputation with the UK10K reference panel (http://www.uk10k.org/). Phasing was performed with SHAPEIT2 and the IMPUTE2 software was used for genotype imputation. During the imputation step we excluded indels, variants whose pairs of alleles were either A/T or C/G, variants with MAF < 1% and variants showing deviation of Hardy-Weinberg Equilibrium with p < 1x10 -20 . In addition, from those pairs of relatives reported to be third-degree or higher according to UK Biobank, we excluded from each pair the individual with lowest call rate. We tested the rs146662075 variant for association with SNPTEST_v2.5.1 using the threshold method and including 7 principal components, body mass index (BMI), age at recruitment and batch information as covariates.
To build our case-control analysis we used the following criteria: a) Control selection criteria.
(2) Individuals without family history of diabetes mellitus (father, mother or siblings).
(4) Individuals without reported age at onset of diabetes mellitus.
b) Case selection criteria.
(1) Individuals with a primary or secondary ICD-10 diagnose from hospitalization events included in the E11 (Non-insulin-dependent diabetes mellitus) disease category.
(2) Individuals without any primary or secondary ICD-10 diagnose from hospitalization events included in the following disease categories: E10 (Insulin-dependent diabetes mellitus), E13 (Other specified diabetes mellitus) and E14 (Unspecified diabetes mellitus).
Finally, we used inverse variance fixed effect meta-analysis to obtain the final effect-size, standard error and p-value across the association results from each of the 6 subsets.

Supplementary Note 3
The SIGMA Type 2 Diabetes Genetics Consortium Genetic analyses: