Novel susceptibility loci and genetic regulation mechanisms for type 2 diabetes

We conducted a meta-analysis of genome-wide association studies (GWAS) with ∼16 million genotyped/imputed genetic variants in 62,892 type 2 diabetes (T2D) cases and 596,424 controls of European ancestry. We identified 139 common and 4 rare (minor allele frequency < 0.01) variants associated with T2D, 42 of which (39 common and 3 rare variants) were independent of the known variants. Integration of the gene expression data from blood (n = 14,115 and 2,765) and other T2D-relevant tissues (n = up to 385) with the GWAS results identified 33 putative functional genes for T2D, three of which were targeted by approved drugs. A further integration of DNA methylation (n = 1,980) and epigenomic annotations data highlighted three putative T2D genes (CAMK1D, TP53INP1 and ATP5G1) with plausible regulatory mechanisms whereby a genetic variant exerts an effect on T2D through epigenetic regulation of gene expression. We further found evidence that the T2D-associated loci have been under purifying selection.


Introduction
Type 2 diabetes (T2D) is a common disease with a worldwide prevalence that increased rapidly from 4.7% in 1980 to 8.5% in 2014 1 . It is primarily caused by insulin resistance (failure of the body's normal response to insulin) and/or insufficient insulin production by beta cells 2 . Genetic studies using linkage analysis and candidate gene approaches have led to the discovery of an initial set of T2D-associated loci (e.g., PPARG, KCNJ11 and TCF7L2) [3][4][5] . Over the past decade, genome-wide association studies (GWAS) with increasing sample sizes have identified 144 genetic variants (not completely independent) at 129 loci associated with T2D [6][7][8] .
Despite the large number of variants discovered using GWAS, the associated variants in total explain only a small proportion (~10%) of the heritability of T2D 9,10 . This well-known "missing heritability" problem is likely due to the presence of common variants (minor allele frequencies (MAF) > 0.01) that have small effects that have not yet been detected and/or rare variants that are not well tagged by common SNPs 9 . The contribution of rare variants to genetic variation in the occurrence of common diseases is under debate 11 , and a recent study suggested that the contribution of rare variants to the heritability of T2D is likely to be limited 12 . If most T2Dassociated genetic variants are common in the population, continual discoveries of variants with small effects are expected from large-scale GWAS using the current experimental design.
Furthermore, limited progress has been made in understanding the regulatory mechanisms of the genetic loci identified by GWAS. Thus, the etiology and the genetic basis underlying the development of the disease remain largely unknown. Recent methodological advances have provided us with an opportunity to identify functional genes and their regulatory elements by combining GWAS summary statistics with data from molecular quantitative trait loci studies with large sample size [13][14][15] .
In this study, we performed a meta-analysis of GWAS with the largest sample size for T2D to date (62,892 cases and 596,424 controls), by combining three large GWAS data sets: DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) 7 , Genetic Epidemiology Research on Aging (GERA) 16 and the full cohort release of the UK Biobank (UKB) 17 . We then integrated the GWAS meta-analysis results with gene expression and DNA methylation data to identify genes that might be functionally relevant to T2D and to infer plausible mechanisms whereby genetic variants affect T2D risk through gene regulation by DNA methylation 15 . We further estimated the genetic architecture of T2D using whole-genome estimation approaches.

Meta-analysis identifies 39 previously unknown loci
We meta-analyzed 5,053,015 genotyped or imputed autosomal SNPs (MAF > 0.01) in 62,892 T2D cases and 596,424 controls from the DIAGRAM (12,171 cases vs. 56,862 controls in stage 1 and 22,669 cases vs. 58,119 controls in stage 2), GERA (6,905 cases and 46,983 controls) and UKB (21,147 cases and 434,460 controls) data sets after quality controls (Supplementary Fig. 1 and Methods). Summary statistics in DIAGRAM were imputed to the 1000 Genomes Project 18 (1KGP) phase 1 using a summary-data-based imputation approach, ImpG 19 (Supplementary Note 1), and we used an inverse-variance method 20 to meta-analyze the imputed DIAGRAM data with the summary data from GWAS analyses of GERA (1KGP imputed data) and UKB (Haplotype Reference Consortium 21 or HRC imputed data) (Methods and Fig. 1a). All the individuals except for a Pakistani cohort in DIAGRAM stage 2 (see Methods) were of European ancestry. We demonstrated by linkage disequilibrium (LD) score regression analysis 22,23 that the inflation in test statistics due to population structure was negligible in each data set, and there was no with a model in which the genomic inflation in test statistics is driven by polygenic effects 22,24 .
After clumping the SNPs using LD information from the UKB genotypes (clumping r 2 threshold = 0.01 and window size = 1 Mb), there were 139 near-independent variants at P < 5×10 -8 (Supplementary Table 2). All of the loci previously reported by DIAGRAM were still genomewide significant in our meta-analysis results. The most significant association was at rs7903146 (P = 1.3×10 -347 ) at the known TCF7L2 locus 5,25 . Among the 139 variants, 39 are not in LD with the known variants ( Fig. 1 and Table 1). The result remained unchanged when the GERA cohort was imputed to HRC (Supplementary Fig. 2). We regarded these 39 variants as novel discoveries; more than half of them passed a more stringent significance threshold at P < 1×10 -8 (Table 1), a conservative control of genome-wide false positive rate (GWFPR) suggested by a recent simulation study 26 . The functional relevance of some novel gene loci to the disease is supported by existing biological or molecular evidence related to insulin and glucose (Supplementary Note 3). Forest plots showed that the effect directions of the 39 novel loci were consistent across the three GWAS data sets (Supplementary Fig. 3). Regional association plots show that some loci have complicated LD structures, and it is largely unclear which genes are responsible for the observed SNP-T2D associations (Supplementary Fig. 4). We also performed gene-based analysis by GCTA-fastBAT 27  Of all the 139 T2D-associated loci identified in our meta-analysis, 16 and 25 were significant in insulin secretion and sensitivity GWAS, respectively, from the MAGIC consortium 29,30 (URLs) after correcting for multiple tests (i.e., 0.05 / 139), with only one locus showing significant associations with both insulin secretion and sensitivity. The limited number of overlapping associations observed might be due to the relatively small sample sizes in the insulin studies. We further estimated the genetic correlation (rg) between insulin secretion (or sensitivity) and T2D by the bivariate LD score regression approach 23 using summary-level data. The estimate of rg between T2D and insulin secretion was -0.15 (s.e. = 0.10), and that between T2D and insulin sensitivity was -0.57 (s.e. = 0.10).

Rare variants associated with T2D
Very few rare variants associated with T2D have been identified in previous studies [31][32][33][34][35] . We included 10,849,711 rare variants (0.0001 < MAF < 0.01) in the association analysis in UKB and detected 11 rare variants at P < 5×10 -8 and 4 of them were at P < 5×10 -9 ( Fig. 1b and Supplementary Table 6). We focused only on the 4 signals at P < 5×10 -9 because a recent study suggested that a P-value threshold of 5×10 -9 is required to control a GWFPR at 0.05 in GWAS including both common and rare variants imputed from a fully sequenced reference 26 . Three of the rare variants were located at loci with significant common variant associations. The rs78408340 (OR = 1.33, P = 4.4×10 -14 ) is a missense variant that encodes a p.Ser539Trp alteration in PAM and was reported to be associated with decreased insulin release from pancreatic beta cells 32 . Variant rs146886108 (odds ratio (OR) = 0.72, P = 4.4×10 -9 ) is a novel locus, which showed a protective effect against T2D, is a missense variant that encodes p.Arg187Gln in ANKH 36 . Variant rs117229942 (OR = 0.70, P = 4.0×10 -11 ) is an intron variant in TCF7L2 5 . Variant rs527320094 (OR = 2.74, P = 4.6×10 -9 ), located in LOC105378797, is also novel rare-variant association, with no other significant SNP (either common or rare) within a ±1 Mb window. We did not observe any substantial difference in association signals for these four variants between the results from BOLT-LMM 37 and logistic regression considering the difference in sample size (Supplementary Table 6).

Sex or age heterogeneity analysis
To examine sex or age heterogeneity in the SNP effects, we performed a GWAS analysis within each sex (male or female) and by age (two age categories separated by the median year of birth) in UKB and tested the difference in the estimated SNP effects between the two sex (or age) groups using a heterogeneity test (Supplementary Note 6). There was no evidence for sex heterogeneity ( Supplementary Fig. 6), consistent with the observation that the male-female genetic correlation estimated by bivariate LDSC 23 was not significantly different from 1 (̂7 = 0.94, . . = 0.042, and >?@@ABACDA = 0.17). There was only one genome-wide significant signal (rs72805579 at the TMEM17 locus with Pheterogeneity = 2.1×10 -9 ) with age heterogeneity (Supplementary Fig. 6).
The estimates of SNP effects were of opposite directions in the two age groups, but the effect was not genome-wide significant in either age group (Supplementary Table 7). In addition, the

Gene expression and DNA methylation associated with T2D
Most previous studies have reported the gene in closest physical proximity to the most significant SNP at a GWAS locus. However, gene regulation can be influenced by genetic variants that are physically distal to the genes 38 . To prioritize genes identified through the genome-wide significant loci that are functionally relevant to the disease, we performed an SMR analysis 39 to test for association between the expression level of a gene and T2D using summary data of GWAS from our meta-analysis and expression quantitative trait loci (eQTL) from the eQTLGen (n = 14,115) and CAGE consortia (n = 2,765) 40 Supplementary Fig. 7). Identification of DNAm sites and their target genes relies on consistent association signals across omics levels 15 .
To demonstrate this, we conducted the SMR analysis to test for associations between the 235 T2D-associated DNAm sites and the 33 T2D-associated genes and identified 22 DNAm sites associated with 16 genes in eQTLGen (Supplementary

SMR associations in multiple T2D-relevant tissues
To replicate the SMR associations in a wider range of tissues relevant to T2D, we performed SMR analyses based on cis-eQTL data from four tissues in GTEx (i.e., adipose subcutaneous tissue, adipose visceral omentum, liver and pancreas). We denoted these four tissues as GTEx-AALP. Of the 27 putative T2D genes identified by SMR and HEIDI using the eQTLGen data, 10 had a cis-eQTL at PeQTL < 5×10 -8 in at least one of the four GTEx-AALP tissues (Supplementary Table 13).
Note that the decrease in eQTL detection power is expected given the much smaller sample size of GTEx-AALP (n = 153 to 385) compared to that of eQTLGen (n = 14,115). As a benchmark, 17 of the 27 genes had a cis-eQTL at PeQTL < 5×10 -8 in GTEx blood (n = 369). We first performed the SMR analysis in GTEx-blood and found that 12 of the 17 genes were replicated at PSMR < 2.9×10 -3 (i.e., 0.05 / 17) (Supplementary Table 13). We then conducted the SMR analysis in GTEx-AALP.
The result showed that 8 of the 10 genes showed significant SMR associations at PSMR < 1.3×10 -3 (i.e., 0.05 / (10 × 4)) in at least one of the four GTEx-AALP tissues, a replication rate comparable to that found in GTEx-blood. Among the 8 genes, CWF19L1, for which the cis-eQTL effects are highly consistent across different tissues, was significant in all the data sets ( Supplementary Fig.   8).
The replication analysis described above depends heavily on the sample sizes of eQTL studies. A less sample-size-dependent approach is to quantify how well the effects of the top associated cis-eQTLs for all the 27 putative T2D genes estimated in blood (i.e., the eQTLGen data) correlate with those estimated in the GTEx tissues, accounting for sampling variation in estimated SNP effects 42 .
This approach avoids the need to use a stringent P-value threshold to select cis-eQTLs in the GTEx tissues with small sample sizes. We found that the mean correlation of cis-eQTL effects between eQTLGen blood and GTEx-AALP was 0.47 (s.e. = 0.16), comparable to and not significantly different from the value of 0.64 (s.e. = 0.16) between eQTLGen and GTEx blood. We also found that the estimated SMR effects of 18 genes that passed the SMR test and were not rejected by the HEIDI test in either eQTLGen or GTEx were highly correlated (Pearson's correlation r = 0.80) ( Supplementary Fig. 9). Note that this correlation is not expected to be unity because of differences in the technology used to measure gene expression (Illumina gene expression arrays for eQTLGen vs. RNA-seq for GTEx).
These results support the validity of using eQTL data from blood for the SMR and HEIDI analysis; using this method, we can make use of eQTL data from very large samples to increase the statistical power, consistent with the conclusions of a recent study 42 . In addition, blood is also considered to be a T2D-relevant tissue, and tissue-specific effects that are not detected in blood will affect the power of the SMR and HEIDI analysis rather than generating false positive associations.

Putative regulatory mechanisms for three T2D genes
Here, we use the genes CAMK1D, TP53INP1 and ATP5G1 as examples to hypothesize possible mechanisms of how genetic variants affect T2D risk by controlling DNAm for gene regulation 15 .

Functional gene annotation information was acquired from the Roadmap Epigenomics Mapping
Consortium 43 (REMC).
The significant SMR association of CAMK1D with T2D was identified in both eQTL data sets ( Table   2 and Supplementary Tables 10-11). The top eQTL, rs11257655, located in the intergenic region (active enhancer) between CDC123 and CAMK1D, was also a genome-wide significant SNP in our meta-analysis (P = 2.0×10 -17 ). It was previously shown that rs11257655 is located in the binding motif for FOXA1/FOXA2 and that the T allele of this SNP is a risk allele that increases the expression level of CAMK1D through allelic-specific binding of FOXA1 and FOXA2 44 . Another functional study demonstrated that increasing the expression of FOXA1 and its subsequent binding to enhancers was associated with DNA demethylation 45 Tables 8-9, 11). Moreover, rs11257655 was also the top mQTL (Fig. 2); the T allele of this SNP is associated with decreased methylation at the site cg03575602 in the promoter region of CAMK1D, suggesting that the T allele of rs11257655 up-regulates the transcription of CAMK1D by reducing the methylation level at cg03575602. Leveraging all the information above, we proposed the following model of the genetic mechanism at CAMK1D for T2D risk (Fig. 3). In the presence of the T allele at rs11257655, FOXA1/FOXA2 and other transcription factors bind to the enhancer region and form a protein complex that leads to a decrease in the DNAm level of the promoter region of CAMK1D and recruits the RNA polymerase to the promoter, resulting in an increase in the expression of CAMK1D (Fig. 3). A recent study showed that the T risk allele is correlated with reduced DNAm and increased chromatin accessibility across multiple islet samples 46 and that it is associated with disrupted beta cell function 47 . Our inference highlights the role of promoterenhancer interaction in gene regulation; this interaction was analytically indicated by the integrative analysis using the SMR and HEIDI approaches.
The second example is TP53INP1, the expression level of which is positively associated with T2D as indicated by the SMR analysis ( Table 2). This is supported by previous findings that the protein encoded by TP53INP1 regulates the TCF7L2-p53-p53INP1 pathway in such a way as to induce apoptosis and that the survival of pancreatic beta cells is associated with the level of expression of TP53INP1 48 . TP53INP1 was mapped as the target gene for three DNAm sites (cg13393036, cg09323728 and cg23172400) by SMR (Fig. 4). The third example involves two proximal genes, ATP5G1 and UBE2Z, the expression levels of which were significantly associated with T2D according to the SMR analysis ( Table 2). A methylation probe (cg16584676) located in the promoter region of UBE2Z was associated with the expression levels of both ATP5G1 and UBE2Z (Supplementary Fig. 10a), suggesting that these two genes are co-regulated by a genetic variant through DNAm. The effect of cg16584676 on gene expression was negative (Supplementary Tables 10-11), implying the following plausible mechanism. A genetic variant near ATP5G1 exerts an effect on T2D by increasing the DNAm levels of the promoters for ATP5G1 and UBE2Z; this decreases the binding efficiency of the transcription factors that recruit RNA polymerase, resulting in down-regulation of gene expression and ultimately leading to an increase in T2D risk (Supplementary Fig. 10b). ATP5G1 has been shown to encode a subunit of mitochondrial ATP synthase, and UBE2Z is a ubiquitinconjugating enzyme. Insulin receptors could be degraded by SOCS proteins during ubiquitinproteasomal degradation, and ATP5G1 and UBE2Z are likely to be involved in this pathway 51 . The function of insulin receptors is to regulate glucose homeostasis through the action of insulin and other tyrosine kinases, and dysfunction of these receptors leads to insulin resistance and increases T2D risk. Interestingly, in addition to cg16584676, there were four other DNAm sites in the vicinity that were significantly associated with T2D (passed SMR and not rejected by HEIDI).
These four methylation sites are located in the promoter regions of ATP5G1 (cg11715999), GIP (cg20551517) and SNF8 (cg26022315 and cg07967210). GIP has been reported to be associated with T2D 52 . SNF8 is a component of a complex that regulates ubiquitin-proteasomal degradation.
These observations imply that these four genes (ATP5G1, UBE2Z, GIP and SNF) are probably coexpressed through promoter-promoter interactions.
The three examples above provide hypotheses for how genetic variants may affect T2D risk through regulatory pathways and demonstrate the power of integrative analysis of omics data for this purpose. These examples describe putative candidates that could be prioritized in future functional studies.

Potential drug targets
In the SMR analysis described above, we identified 33 putative T2D genes. We matched these genes in the DrugBank database (URLs) and found that three genes (ARG1, LTA and P2RX4) are the targets of several approved drugs (drugs that have been approved in at least one jurisdiction).
ARG1 (UniProt ID: P05089), whose expression level was negatively associated with T2D risk, is targeted by three approved drugs: ornithine (DrugBank ID: DB00129), urea (DrugBank ID: DB03904) and manganese (DrugBank ID: DB06757), but the pharmacological mechanism of action of these drugs remains unknown. Arginase (ARG1 is an isoform in liver) is a manganesecontaining enzyme that catalyzes the hydrolysis of arginine to ornithine and urea. Arginase in vascular tissue might be a potential therapeutic target for the treatment of vascular dysfunction in diabetes 53 . Metformin, an oral antidiabetic drug that is used in the treatment of diabetes, was reported to increase ARG1 expression in a murine macrophage cell line 54 , consistent with our SMR result that increased expression of ARG1 is associated with decreased T2D risk (Supplementary Table 8). There is also evidence for an interaction between ARG1 and metformin (Comparative Toxicogenomics Database, URLs). The likely mechanism is that metformin activates AMP- antagonist for P2RX4). Eslicarbazepine acetate is an anticonvulsant that inhibits repeated neuronal firing and stabilizes the inactivated state of voltage-gated sodium channels; its pharmacological action makes it useful as an adjunctive therapy for partial-onset seizures 58 .
Antagonists of P2RX4 inhibit high glucose, prevent endothelial cell dysfunction 59 , and are useful in the treatment of diabetic nephropathy 60 .
To explore whether any of these three genes have potential adverse effects, we checked the associations of the lead variants at the three loci with other traits from previous studies, including two insulin-related GWAS (insulin sensitivity 30 and insulin secretion 29 ) and four lipid traits (HDL cholesterol, LDL cholesterol, triglycerides and total cholesterol) 61 (Supplementary Table 14).
We did not observe any significant association with insulin traits after correcting for multiple testing (i.e., 0.05 / (3 × ), where t is the number of traits). However, the risk allele of the lead T2D-associated variant at the LTA locus was associated with increased LDL cholesterol, total cholesterol and triglycerides. The risk allele of the lead T2D-associated variant at the ARG1 locus was associated with decreased HDL cholesterol and total cholesterol.
In addition to the three genes that are currently targeted by approved drugs, we found two additional genes that are targeted by an approved veterinary drug and a nutraceutical drug, respectively (URLs and Supplementary Note 8).

Enrichment of genetic variation in functional regions and tissue/cell types
Recent studies have indicated that different functional regions of the genome contribute disproportionately to total heritability 62 . We applied a stratified LD score regression method 62 to dissect the contributions of the functional elements to the SNP-based heritability (ℎ 0 123 4 ) for T2D.
We further used MAGMA 65 to test the enriched gene sets. In total, 305 gene sets in GO_BP terms and 20 gene sets in KEGG pathways were significantly enriched (Supplementary Table 17). The top pathway enrichment was "glucose homeostasis" (P = 6.0×10 -8 ) in GO_BP, and "maturity onset diabetes of the young" (P = 3.2×10 -7 ) in KEGG. To further investigate the molecular connections of T2D-associated genes, a protein-protein interaction network was analyzed using STRING 66 ( Supplementary Fig. 13). Among the functional enrichment (Supplementary Table 18) in this network, there are four genes (HHEX, HNF1A, HNF1B, and FOXA2) involved in the KEGG pathway of "maturity onset diabetes of the young", and four genes (ADCY5, CAMK2G, KCNJ11, and KCNU1) were enriched in "insulin secretion".

Natural selection of T2D-associated variants
We performed an LD-and MAF-stratified GREML analysis 67 Table 19).
Under an evolutionary neutral model and a constant population size 68 , the explained variance is uniformly distributed as a function of MAF, which means that the variance explained by variants with MAF ≤ 0.1 equals that explained by variants with MAF > 0.4. However, in our results, the MAF bin containing low-MAF and rare variants (MAF < 0.1) showed a larger estimate than any other MAF bin ( Fig. 6a and Supplementary Table 19), consistent with a model of negative (purifying) selection or population expansion 69 . To further distinguish between the two models (negative selection vs. population expansion), we performed an additional analysis using a recently developed method, BayesS 70 , to estimate the relationship between variance in effect size and MAF (Methods). The method also allowed us to estimate ℎ 0 123 4 and polygenicity (π) on each chromosome. The results (Fig. 6b) showed that the ℎ 0 123 4 of each chromosome was highly correlated with its length (r = 0.92), consistent with the results of previous studies for height and schizophrenia 71,72 . The mean estimate of π, i.e., the proportion of SNPs with non-zero effects, was 1.75% across all chromosomes ( Fig. 6c and Supplementary Table 20), suggesting a high degree of polygenicity for T2D. The sum of per-chromosome ℎ 0 123 alleles (Fig. 6d) and inconsistent with a recent study 12 concluding that T2D-associated loci have not been under natural selection. Our conclusion regarding negative selection is also consistent with the observation that the minor alleles of 9 of the 11 rare variants at P < 5E-8 were T2D risk alleles (Supplementary Table 6). The signal of negative selection implies that a large number of rare variants will be discovered in future GWAS in which appropriate genotyping strategies are used.

Polygenic risk score (PRS) analysis
We used DIAGRAM and UKB as the discovery set and GERA as a validation set in the PRS analysis 73 .
To avoid sample overlap between the discovery and validation sets, we re-ran the meta-analysis excluding the GERA cohort and identified 109 near-independent common SNPs at P < 5×10 -8 .
These SNPs were then used to derive prediction equations for individuals in GERA (Methods).
We divided GERA into ten subsets to acquire the sampling variance of the estimated classification in GERA was much lower than that in UKB.

Discussion
In this study, we sought to identify novel genetic loci associated with T2D by a meta-analysis of GWAS with the largest sample size to date and to infer plausible genetic regulation mechanisms at known and novel loci by an integrative analysis of GWAS and omics data. We identified 139 near-independent common variants ( < 5 × 10 WX ) and 4 rare variants ( < 5 × 10 WY ) for T2D in the meta-analysis. Of the 139 common loci, 39 were novel compared with the results of all 49 previous T2D GWAS from the GWAS Catalog (URLs) 76 , including the two recent studies by DIAGRAM 52 and Zhao et al. 77 . By integrating omics data, we have inferred the genetic mechanisms for the three genes CAMK1D, TP53INP1 and ATP5G1; the inferred mechanisms suggest that enhancer-promoter interactions with DNA methylation play an important role in mediating the effects of genetic variants on T2D risk. These findings provide deeper insight into the etiology of T2D and suggest candidate genes for functional studies in the future. Furthermore, our estimation of genetic architecture suggests that T2D is a polygenic trait for which both rare and common variants contribute to the genetic variation and indicates that rarer variants tend to have larger effects on T2D risk (Fig. 7). Assuming that most new mutations are deleterious for fitness, our result is consistent with a model in which mutations that have larger effects on T2D (and thereby on fitness through pleiotropy) are more likely to be maintained at low frequencies in the population by purifying selection.
This study has a number of limitations. First, the SNP-T2D associations identified by the metaanalysis might be biased by misdiagnosis of T1D (type 1 diabetes) and LADA (latent autoimmune diabetes in adults) 78 . Previous studies found that biases in SNP-T2D associations due to misdiagnosis were likely to be very modest 7,52,79 . We showed by two additional analyses based on known T1D loci that most of the novel SNP-T2D associations identified in this study are unlikely to be driven by misdiagnosed T1D cases (Supplementary Note 9 and Supplementary Table   23). Second, some of the T2D-associated SNPs might confer T2D risk through mediators such as obesity or dyslipidemia. To explore this possibility, we performed a summary-data-based conditional analysis of the 139 T2D-associated SNPs conditioning on body mass index (BMI) or dyslipidemia by GCTA-mtCOJO 80 using GWAS data for these two traits from UKB. It appeared that the effect sizes of most T2D-associated SNPs, with the exception of a few outliers (e.g., FTO, MC4R, POCS and TFAP2B), were not affected by BMI or dyslipidemia (Supplementary Fig. 14). These loci were among those showing the strongest associations with BMI 81 , consistent with the finding from a previous T2D study 82 . Third, among the 39 novel loci, there was only one locus (ARG1/MED23, Supplementary Fig. 15) at which the association between gene expression and T2D risk was significant in SMR and not rejected by HEIDI ( Table 2). This is because the power of the SMR test depends primarily on the SNP effect from GWAS 13 , which is small for the novel loci. Finally, we employed the SMR and HEIDI methods to map CpG sites to their target genes and to identify the CpG sites associated with T2D because of pleiotropy. The SMR approach uses genome-wide significant mQTL as an instrumental variable for each CpG site, which requires a large sample size for the mQTL discovery. In this study, we used mQTL data based on Illumina HumanMethylation450 arrays because of the relatively large sample size (n = 1,980).
Unfortunately, we did not have access to mQTL data from whole-genome bisulfite sequencing (WGBS) in a large sample. Nevertheless, it is noteworthy that there are three T2D-associated variants at the CAMK1D/CDC123, ADCY5, and KLHDC5 loci that show hypomethylation and allelic imbalance as identified by Thurner et al. 46 using WGBS data (n = 10), all of which were genomewide significant in our mQTL-based SMR analysis. Despite these limitations, our study highlights the benefits of integrating multiple omics data to identify functional genes and putative regulatory mechanisms driven by local genetic variation. Future applications of integrative omics data analyses are expected to increase our understanding of the biological mechanisms underlying human disease.

Summary statistics of DIAGRAM, GERA, and UKB
The data used in this study were derived from 659,316 individuals of European ancestry and a small cohort from Pakistan and were obtained from three data sets: DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) 7 , Genetic Epidemiology Research on Adult Health and Aging (GERA) 16 The UKB phenotype   was acquired from self-report, ICD10 main diagnoses and ICD10 secondary diagnoses (field IDs: 20002, 41202 and 41204). The GWAS analysis in UKB was conducted in BOLT-LMM 37 with sex and age fitted as covariates. In the BOLT-LMM analysis, we used 711,933 SNPs acquired by LD pruning (r 2 < 0.9) from Hapmap3 SNPs to control for relatedness, population stratification and polygenic effects. We transformed the effect size from BOLT-LMM on the observed 0-1 scale to the odds ratio (OR) using LMOR 88 .

Inverse variance based meta-analysis
Before conducting the meta-analysis, we performed several analyses in which we examined genetic heterogeneity and sample overlap among data sets (Supplementary Note 2). We performed a two-stage meta-analysis. The first stage combined DIAGRAM stage 1 (GWAS chip) data with GERA and UKB. The second stage combined DIAGRAM stage 1 and 2 (GWAS chip and metabolism chip) with GERA and UKB. We extracted the SNPs common to the three data sets (5,526,193 SNPs in stage 1 and 5,053,015 million SNPs in stage 2) and performed the metaanalyses using an inverse-variance based method in METAL 20 . The stage 1 meta-analysis data were only used to estimate the SNP-based heritability, and the stage 2 meta-analysis data were used in the follow-up analyses.

Summary-data-based Mendelian Randomization (SMR) analysis
We performed an SMR and HEIDI analysis 39 to identify genes whose expression levels were associated with a trait due to pleiotropy using summary statistics from GWAS and eQTL/mQTL studies. The HEIDI test 39 uses multiple SNPs in a cis-eQTL region to distinguish pleiotropy from linkage. In the SMR analysis, we used eQTL summary data from the eQTLGen Consortium (n = 14,115 in whole blood), the CAGE (n = 2,765 in peripheral blood) 40 and the GTEx v7 release (n = 385 in adipose subcutaneous tissue, n = 313 in adipose visceral omentum, n = 153 in liver, n = 220 in pancreas and n = 369 from whole blood) 89 . In CAGE and eQTLGen, gene expression levels were measured using Illumina gene expression arrays; in GTEx, gene expression levels were measured by RNA-seq. The SNP genotypes in all three cohorts were imputed to 1KGP. The mQTL summary data were obtained from genetic analyses of DNA methylation measured on Illumina HumanMethylation450 arrays (n = 1,980 in peripheral blood) 41 .

Estimating the genetic architecture for T2D
The MAF-and LD-stratified GREML (GREML-LDMS) is a method for estimating SNP-based heritability that is robust to model misspecification 67,90 . For ease of computation, we limited the analysis to a subset of unrelated UKB individuals (15,767

Polygenic risk score (PRS) analysis in GERA
We used DIAGRAM and UKB as the discovery set and GERA as a validation set for the PRS analysis.
To avoid sample overlap, we re-ran the meta-analysis excluding GERA and clumped significant SNPs from the meta-analysis (excluding GERA) using UKB as the reference for LD estimation (Pvalue threshold = 5×10 -8 , LD r 2 threshold = 0.01 and window size = 1 Mb). After clumping, there were 109 independent SNPs. These SNPs were used to generate PRS for each individual in GERA.
We then calculated the area under the curve 74 (AUC) as a measure of classification accuracy. To quantify the sampling variance in classification accuracy, the GERA data set was divided evenly into ten groups, each with sample size ~6,000 and similar sample prevalence. We also applied the GCTA-SBLUP (Summary-based Best Linear Unbiased Prediction) method 75

Declaration of Interests
We declare that all authors have no competing interests.

Data availability
Summary statistics from the meta-analysis will be available at http://cnsgenomics.com/data.html when the paper has been formally accepted for publication.  Columns are eQTL data set, probe ID, probe chromosome, gene name, probe position, SNP name, SNP position, effect allele, other allele, frequency of the effect allele in the reference sample, GWAS P-value, eQTL P-value, SMR P-value and HEIDI P-value.

Figure 5
Hypothesized mechanism of how TP53INP1 affects T2D risk. When the promoter region is highly methylated, which prevents binding of repressor protein (red rounded rectangle) to the promoter region, RNA polymerase II (green ellipsoid), transcription factor protein (orange ellipsoid) and mediator proteins (gray circles) will form a transcription initiation complex that increases the transcription. However, when the methylation level of the promoter region is low, repressor protein can more efficiently bind to the promoter, blocking the binding of the transcription initiation complex to the promoter, which decreases the transcription of TP53INP1.