Gene-level analysis of rare variants in 379,066 whole exome sequences identifies an association of GIGYF1 loss of function with type 2 diabetes

Sequencing of large cohorts offers an unprecedented opportunity to identify rare genetic variants and to find novel contributors to human disease. We used gene-based collapsing tests to identify genes associated with glucose, HbA1c and type 2 diabetes (T2D) diagnosis in 379,066 exome-sequenced participants in the UK Biobank. We identified associations for variants in GCK, HNF1A and PDX1, which are known to be involved in Mendelian forms of diabetes. Notably, we uncovered novel associations for GIGYF1, a gene not previously implicated by human genetics in diabetes. GIGYF1 predicted loss of function (pLOF) variants associated with increased levels of glucose (0.77 mmol/L increase, p = 4.42 × 10–12) and HbA1c (4.33 mmol/mol, p = 1.28 × 10–14) as well as T2D diagnosis (OR = 4.15, p = 6.14 × 10–11). Multiple rare variants contributed to these associations, including singleton variants. GIGYF1 pLOF also associated with decreased cholesterol levels as well as an increased risk of hypothyroidism. The association of GIGYF1 pLOF with T2D diagnosis replicated in an independent cohort from the Geisinger Health System. In addition, a common variant association for glucose and T2D was identified at the GIGYF1 locus. Our results highlight the role of GIGYF1 in regulating insulin signaling and protecting from diabetes.


Identification of genes with a biological role in diabetes.
Variants in two genes, GCK and GIGYF1, significantly associated with glucose, HbA1c and T2D diagnosis, strongly suggesting a biological role in diabetes; GCK is involved in Mendelian forms of diabetes while GIGYF1 has not previously been implicated by genetics in the disease. Both GCK and GIGYF1 are located on chromosome 7 but are 56 Mb apart, strongly suggesting that these signals are independent; this independence was confirmed by conditional analysis (Supplementary  Table 13). Two additional variant sets, HNF1A pLOF and TNRC6B pLOF, had genome-wide associations with both T2D diagnosis and HbA1c levels while G6PC2 damaging missense variants associated with decreased levels of both glucose and HbA1c but not T2D diagnosis (Table 3).
To see which other significant genes were likely to have a role in diabetes we looked at all variant sets with a significant glucose, HbA1c, or T2D association and examined whether they had associations with additional diabetes traits (p ≤ 0.0016, correcting for 32 sets tested). Damaging missense variants in PDX1 and PFAS, which significantly associated with HbA1c levels in our primary analysis, associated with T2D diagnosis using this threshold (Table 3 and Supplementary Table 14).
Many HbA1c associations appeared to be secondary to effects on red blood cells. 22 out of 31 variant sets associated with HbA1c did not show effects on glucose levels or T2D diagnosis (Supplementary Table 14) and were not implicated in Mendelian forms of diabetes. Out of these 22 variant sets, 12 were in genes implicated in Mendelian disorders affecting red blood cells (for example EPB42 and TFR2; see Supplementary Table 15) www.nature.com/scientificreports/ and an additional five had highly significant associations with red blood cell traits in our data (p ≤ 7.82 × 10 -7 ; Supplementary Table 16). We focused on the variant sets associated with multiple diabetes traits as these are strong candidates for regulating glucose homeostasis and/or causally contributing to T2D risk. The genes fall into three main groups; known MODY (maturity-onset diabetes of the young) genes (GCK, HNF1A and PDX1) 14 , known genes reported in previous exome-wide analyses of glucose levels or T2D (G6PC2 and PAM) 5,15 , and novel genes not previously implicated by genetics in diabetes (GIGYF1, TNRC6B and PFAS).
Because obesity is linked to the development of T2D, we adjusted for body mass index (BMI) in the regression and found that the association of these genes with diabetes-related traits remained significant (Supplementary  Tables 17 and 18).
We used the generalized linear mixed model implemented by SAIGE-Gene which accounts for relatedness and adjusts for unbalanced case-control ratios 16 to verify association of our variant sets of interest with glucose, HbA1c, and T2D diagnosis. SAIGE-Gene was run in the European ancestry population including related individuals (n = 398,574). Using the p-value thresholds previously employed, all associations were statistically significant using this method apart from the associations of TNRC6B pLOF with HbA1c (p = 6.85 × 10 -6 ) and T2D diagnosis (p = 4.77 × 10 -5 ) which were less significant (Supplementary Table 19).
To maximize power to detect associations for rare variants, our original analysis of glucose and HbA1c included individuals with a diabetes diagnosis. Associations for all variant sets of interest were at least nominally significant when such individuals were excluded from the analysis (Supplementary Table 20). For GIGYF1 Table 1. Gene-level associations with glucose and HbA1c levels. pLOF or damaging missense variants (CADD score ≥ 25) were aggregated per gene. The effect is shown in standard deviations (SD) of transformed values as well as in International Federation of Clinical Chemistry (IFCC) units. CI confidence interval.  HNF1A and PDX1, we also tested pathogenic or likely pathogenic variants annotated in ClinVar 17 for association with glucose, HbA1c and T2D-either alone or combined with the variants used in our primary analysis. Pathogenic and likely pathogenic variants in GCK and HNF1A associated with T2D diagnosis and markers of glycemic control and added signal beyond what was seen in the primary analysis (Supplementary Tables 21 and 22).
Validation of GIGYF1 pLOF associations using independent datasets. We sought to use independent measurements of glucose and HbA1c to perform sensitivity analysis and verify the associations of interest seen in our primary analysis which used measurements taken as part of the UKBB assessment. To do this  www.nature.com/scientificreports/ we extracted lab test values for glucose and HbA1c from primary care data, which is available for approximately half of the cohort. In gene-based burden tests all variant sets showed a direction of effect consistent with that seen in the primary analysis and 10 out of 12 of these were significant when correcting for the number of tests performed (p ≤ 0.004). This included the association of GIGYF1 pLOF with glucose (p = 2.10 × 10 -6 , effect = 0.65 SD) and HbA1c (p = 1.19 × 10 -5 , effect = 0.74 SD) levels (Supplementary Figure 2 and Supplementary Table 23). We then assessed whether rare variants in GIGYF1 and the other novel genes associated with T2D replicated in a completely independent exome-sequencing cohort. Gene-based tests in European ancestry individuals from the Geisinger Health System (GHS; 25,846 T2D cases and 63,749 controls) confirmed the association of GIGYF1 pLOF with T2D (p = 0.01, OR = 1.8). We did not replicate the association of TNRC6B pLOF with T2D. We tested an expanded PFAS variant set (pLOF + deleterious missense) and did not detect an association with T2D (Supplementary Table 24). Notably variant set definitions varied somewhat from those used in our primary analysis (see "Methods").

Multiple variants contribute to associations with diabetes diagnosis and biomarkers.
To examine whether specific variants were driving the associations with diabetes traits we conducted "leave-oneout" burden tests. The association of PAM missense variants with T2D diagnosis was driven entirely by a previously reported variant Ser539Trp (rs78408340; p = 0.43 when Ser539Trp is excluded). For all other variant sets, multiple variants contributed to the associations observed (Supplementary Figure 3). Notably, when singleton variants were excluded, half of the associations no longer reached significance including those for GCK pLOF and glucose (p = 0.0015 without singletons versus p = 1.56 × 10 -9 ) and GIGYF1 pLOF and T2D (p = 2.9 × 10 -5 without singletons versus p = 6.14 × 10 -11 ) (Supplementary Table 25), demonstrating the power of including singletons in gene-based tests.
For the variants contributing to our novel discovered associations, GIGYF1 pLOF, TNRC6B pLOF and PFAS damaging missense variants, we examined the quality scores, sequencing depth, transcripts affected and presence of contributing variants in the gnomAD database. We found that for GIGYF1 and PFAS the variants contributing most to the associations had good quality scores and depth and were present in gnomAD. In contrast, TNRC6B is a highly constrained gene and the most common pLOF variant is not present in gnomAD (see "Supplementary Note" and Supplementary Figure 4). This observation along with the fact that the association of TNRC6B pLOF with T2D did not replicate in GHS leads us to view this association with suspicion.

Replication of published gene-level associations with T2D and associations for T2D drug target genes.
The association between predicted damaging variants in PAM and T2D diagnosis was previously reported in an exome-sequencing study performed by Flannick et al. 5 . We examined whether the other two significant genes in the study, SLC30A8 and MC4R, associated with diabetes traits in our analysis. Both pLOF and damaging missense variants in SLC30A8 associated with reduced levels of HbA1c and glucose and suggestively associated with decreased incidence of T2D diagnosis (Supplementary Table 26). Combining SLC30A8 pLOF and missense variants resulted in more significant associations with glucose (p = 2.71 × 10 -6 ), HbA1c (p = 8.64 × 10 -10 ) and T2D diagnosis (p = 0.005) (Supplementary Table 26). There were no MC4R high confidence pLOF variants in our dataset and MC4R predicted damaging missense variants did not associate with diabetesrelated traits in our study (all p > 0. 19). We note that the MC4R Ile269Asn variant driving the association in Flannick and colleagues' analysis is absent from our dataset, consistent with the fact that it is absent from European populations in gnomAD.
We also examined whether we detect associations for the 8 genes encoding T2D drug targets (GLP1R, IGF1R, PPARG, INSR, SLC5A2, DPP4, KCNJ11, ABCC8). Variant sets in three of these genes, DPP4, GLP1R and KCNJ11 significantly associated with either T2D diagnosis or HbA1c levels (p ≤ 0.003 correcting for 15 variant sets tested) and an additional 4 genes had a nominally significant association with T2D and/or HbA1c (Supplementary  Figure 5 and Supplementary Table 27). Table 3. Genes and variant sets associated with multiple diabetes-related traits. Variant sets significant for at least one trait in our primary analysis that are also associated with additional diabetes traits (p ≤ 0.0016, 32 sets tested) are shown. Effect is shown in SD of transformed values or as an odds ratio (OR). www.nature.com/scientificreports/ PheWAS of GIGYF1 pLOF reveals associations with cholesterol levels, hypothyroidism and complications of diabetes. The most significant novel associations were seen for GIGYF1 pLOF which associated with increased glucose and HbA1c levels as well as increased incidence of T2D diagnosis. To give additional insight into the biological roles of GIGYF1 we performed a phenome-wide association study (PheWAS) testing GIGYF1 pLOF for association with 142 quantitative traits and 262 ICD10-coded diagnoses (Fig. 3). GIGYF1 pLOF strongly associated with decreased levels of total cholesterol (p = 2.44 × 10 -12 , effect = − 0.61 SD) which was, in large part, driven by LDL cholesterol (p = 2.40 × 10 -10 , effect = − 0.56 SD) although an effect on HDL cholesterol was also observed (Table 4). To understand the extent to which this is influenced by the use of cholesterol-lowering medication in diabetics, we adjusted for medication use in the regression and also performed a separate analysis excluding those on cholesterol-lowering medication. The association between GIGYF1 pLOF and LDL cholesterol levels was significant in both analyses (Supplementary Table 28). GIGYF1 pLOF also associated with decreased grip strength and decreased peak expiratory flow. Notably, GIGYF1 pLOF also associated with increased levels of the kidney injury biomarker cystatin c (p = 6.65 × 10 -6 , effect = 0.36 SD) and increased diagnosis of urinary system disorders (p = 7.32 × 10 -5 , OR = 2.71) ( Tables 4 and 5).
After diabetes, the most significant disease association of GIGYF1 pLOF was with increased risk of hypothyroidism (p = 1.25 × 10 -9 , OR = 4.53). 21 out of the 131 GIGYF1 pLOF carriers had a diagnosis of unspecified hypothyroidism and seven of these also had a diagnosis of T2D. Given the autoimmune component in hypothyroidism and type 1 diabetes (T1D), we examined the association of GIGYF1 pLOF with T1D diagnoses but did not detect a significant association (p = 0.1). GIGYF1 pLOF also significantly associated with increased risk of syncope and collapse (p = 1.92 × 10 -6 , OR = 3.75) ( Table 5).
Other phenome-wide significant associations with quantitative traits included waist circumference, total protein and mean corpuscular hemoglobin as well increased time to complete a cognitive test (Table 4). To ensure that the association of GIGYF1 pLOF with HbA1c was independent of effects on hemoglobin we adjusted for mean corpuscular hemoglobin level and verified that the association remained highly significant (p = 4.10 × 10 -12 ). GIGYF1 pLOF also associated with increased diagnosis of emphysema and anemia (Table 5) www.nature.com/scientificreports/ Common variants at GIGYF1 associate with glucose, T2D and GIGYF1 expression. We looked for more common variants that could further implicate the GIGYF1 locus in diabetes. We tested array genotyped and imputed variants at the GIGYF1 locus for association with glucose levels in 294,042 unrelated European ancestry individuals with measurements available. We found a cluster of variants in a linkage disequilibrium block covering GIGYF1 and EPO significantly associating with glucose levels (Fig. 4). This signal is represented by rs221783, an intergenic variant whose minor T allele associated with decreased glucose (p = 1.8 × 10 -11 , effect = − 0.03 SD) and HbA1c (p = 3.6 × 10 -7 , effect = − 0.02 SD) levels as well as increased cholesterol (p = 7.0 × 10 -12 , effect = 0.03 SD). This variant also associated with a decreased risk of T2D (p = 0.005, OR = 0.96) and hypothyroidism (p = 6.95 × 10 -7 , OR = 0.92) ( Table 6). rs221783 is the best eQTL (R 2 > 0.8) for GIGYF1 in several tissues including pancreas, adipose and thyroid 18 (Supplementary Table 29). In all tissues, the T allele associating with decreased glucose and decreased T2D risk associated with increased GIGYF1 expression. Conditional analysis showed that the glucose and HbA1c associations of GIGYF1 pLOF and rs221783 are independent of each other (Supplementary Table 30). The association of rs221783 with glucose levels replicated in Biobank Japan (p = 1.7 × 10 -4 , effect = − 0.05 SD for T allele) 19 whilst in FinnGen, rs221783 showed a nominal association with T2D diagnosis (p = 0.02, OR = 0.96 for T allele) (Supplementary Table 31). The association with thyroid disease has been replicated elsewhere 20 .
The independent glucose and T2D associations at the GIGYF1 locus and their replication in other datasets further support the hypothesis that decreasing GIGYF1 levels predisposes to diabetes while increasing GIGYF1 levels may protect from diabetes.

Identification of causal genes at GWAS loci.
Given the fact that the GIGYF1 locus harbors both rare and common variants associated with T2D we examined whether our study points to the causal gene at additional GWAS loci. For 558 variants associated with T2D in a recent study by Vujkovic et al. 9 we tested whether either of the two closest genes associated with T2D or HbA1c levels in our study. Just nine genes close to these  Figure 6 and Supplementary Table 32). Most of these genes are already known to be causal for T2D including GCK, HNF1A, SLC30A8, IRS2 and HNF4A. Given that there is a common variant association with T2D at TNRC6B but conflicting results for TNRC6B pLOF in UKBB and GHS, further study of this locus may be warranted.

Discussion
Our results highlight the power of whole exome sequencing to make novel discoveries relevant to human disease and to detect known associations of Mendelian disease genes. Gene-level aggregation and burden testing of rare pLOF and predicted damaging missense variants identified genes associating with T2D and biomarkers of glycemic control. These included several genes not previously implicated in diabetes, GIGYF1, TNRC6B and PFAS, as well as GCK, HNF1A and PDX1, known MODY genes 14,[21][22][23] . We also identified PAM and G6PC2, genes highlighted by other rare-variant studies of T2D and glucose levels 5,15 . Gene-level tests were needed to detect the majority of these associations owing to the rarity of the variants. For example, out of 363,977 individuals, just 40 carried a pLOF variant in GCK and 131 carried a pLOF variant in GIGYF1. In general, singleton variants contributed a large part of the signal arguing strongly, as others have done 4 , for including such variants in gene-based collapsing tests. One limitation of this study is its focus on individuals of European ancestry. Although we also   www.nature.com/scientificreports/ performed gene-level tests in the other sub-populations of UKBB we were underpowered to detect associations as they represent a small proportion of the cohort. The only significant association in non-European ancestry populations was between MYH6 damaging missense variants and HbA1c levels in the East Asian ancestry population. Given the small number of carriers and known function of MYH6 in cardiac function 24,25 this may be a false positive and replication is required. An important area for further study will be assessing the impact of rare variants on T2D risk in larger multi-ethnic cohorts. Our finding that rare variants in the MODY genes, GCK, HNF1A and PDX1 associate with common T2D are consistent with results reported by Bonneford et al. 26 . Bonneford et al. defined pathogenic variants in MODY genes according to the American Society of Human Genetics criteria 27 . In contrast, we used bioinformatic methods to select variants likely to have functional impact in all genes in the genome. This allowed us to identify associations for genes not previously implicated in diabetes such as GIGYF1. The fact that we detected associations for MODY genes strongly suggests that we are enriching for variants with functional impact. However, for GCK and HNF1A, adding known pathogenic variants did increase the significance of detected associations.
Test statistic inflation can be a challenge when testing rare variants as statistical assumptions break down when the number of carriers expected to have the disease of interest is low 4,28 . To avoid false positives in our analysis of diabetes, we initially examined associations with glucose and HbA1c because quantitative traits are less susceptible to inflation. All the variant sets that associated with T2D also affected HbA1c and/or glucose levels giving us confidence in these associations. A targeted analysis of the genes encoding T2D drug targets revealed HbA1c and/or T2D associations for variants in several of these genes. The lack of association for variants in some of these drug target genes may in part be due to a lack of statistical power for genes with small numbers of rare variant carriers. However, for some of these genes such as SLC5A2 (encoding SGLT2) we do not detect associations with diabetes traits despite having a reasonable number of variant carriers.
We uncovered novel associations with T2D and biomarkers of glycemic control for aggregated variants in GIGYF1, TNRC6B and PFAS and attempted replication of these associations in exome-sequenced individuals from GHS. The association of GIGYF1 pLOF with T2D replicated in GHS but we did not replicate associations for TNRC6B and PFAS. There are differences between these two cohorts; UKBB is a population-based cohort with T2D diagnoses obtained from inpatient records while GHS is a health system-based cohort and includes both inpatient and outpatient diagnoses. There is a larger effect size for GIGYF1 pLOF in UKBB compared to GHS which may be due to these differences in ascertainment. Differences in the definition of the variant sets tested or the frequency of the relevant variants may have contributed to the failure to replicate the TNRC6B and PFAS associations. Alternatively, this may suggest that the TNRC6B and PFAS associations are false positives.
We focused our analysis on understanding the consequences of GIGYF1 pLOF as it strongly associated with glucose, HbA1c and T2D and the T2D association replicated in GHS. GIGYF1 encodes a protein that was initially identified for its binding to the adapter protein GRB10 (GRB10 interacting GYF protein 1) which negatively regulates both the insulin and IGF-1 receptors 29,30 . Transfection of cells with GRB10-binding fragments of GIGYF1 led to greater activation of both the insulin and IGF-1 receptors 30 . This supports a hypothesis whereby GIGYF1 enhances insulin signaling by reducing the negative regulation of the insulin receptor by GRB10. When GIGYF1 is reduced, as is the case in individuals carrying pLOF variants, GRB10 presumably inhibits insulin signaling to a greater degree thereby reducing the action of insulin in its target tissues and leading to increased risk of T2D. However, the exact mechanistic details of these interactions remain to be determined. GRB10 variants have also been reported to associate with T2D and glycemic traits although interpretation of these results is complicated by imprinting 31,32 . GIGYF1 is broadly expressed with high levels observed in endocrine tissues, pancreas and brain 18,33 . GIGYF1 and the related protein GIGYF2 have also been implicated in translational repression 34 and translation-coupled mRNA decay 35 suggesting biological roles beyond regulation of insulin and IGF-1 receptor signaling.
PheWAS of GIGYF1 pLOF revealed a strong association with decreased cholesterol levels reflecting altered energy homeostasis in carriers. An inverse relationship between glucose and cholesterol levels has been observed for variants in other genes 36 . We also observed several associations that could reflect complications of diabetes in GIGYF1 pLOF carriers including increased cystatin c levels and increased diagnosis of urinary disorders, suggesting renal complications, as well as syncope and collapse which may be a side-effect of hyperglycemia and/ or hypoglycemia in diabetics. Other associations may reflect poor health in carriers including decreased grip strength and decreased peak expiratory flow.
GIGYF1 pLOF associated with a 4.5-fold increased risk of hypothyroidism and GIGYF1 is highly expressed in thyroid 18,33 consistent with a biological function in this tissue. IGF-1 and insulin have been implicated in the proliferation of thyroid cells which may, in part, explain the association with thyroid dysfunction [37][38][39] . An alternative possibility is that GIGYF1 contributes to thyroid function by affecting secretion of thyroid stimulating hormone in the anterior pituitary gland. Another explanation is that shared autoimmune mechanisms contribute to thyroid dysfunction and diabetes in pLOF carriers and that some of the carriers diagnosed with T2D have features of latent autoimmune diabetes in adults 40 . Damaging variants in GIGYF1 have recently been implicated in conferring risk for developmental delay and autism spectrum disorders 41 . Consistent with this, we see an association of GIGYF1 pLOF with increased time to complete a cognitive test. It may be that metabolic aberrations in carriers affect cognitive performance, that brain development is altered due to perturbation of IGF-1 signaling, or that other functions of GIGYF1 such as regulation of mRNA expression and decay are responsible for cognitive phenotypes.
In addition to replicating the association of GIGYF1 pLOF with T2D in an independent cohort we also used common genetic variants to further investigate the role of the GIGYF1 locus in diabetes. A non-coding variant at the GIGYF1 locus associated with glucose levels and T2D, and this replicated in independent datasets. This variant associated with increased GIGYF1 expression but a lower risk of T2D. This direction of effect is consistent www.nature.com/scientificreports/ with what we see for the pLOF variants-reduced levels of GIGYF1 increases diabetes risk but increased levels of GIGYF1 are protective. We observed an intersection of rare and common variant associations at GIGYF1 as well as at MODY genes such as GCK, HNF1A and HNF4A. However, in general, our gene-level analysis of rare variants did not identify many additional causal genes at GWAS loci; out of 558 variants associated with T2D 9 just nine had rare variant associations at a nearby gene.
We assessed the impact of pLOF and predicted damaging missense variants on glycemic traits and uncovered a hitherto unappreciated role for GIGYF1 in regulating blood sugar and protecting from T2D. By highlighting the importance of GIGYF1 and GRB adapter proteins in modulating insulin signaling this finding may lead to new therapeutic approaches for the treatment of diabetes. Discoveries such as this are only possible by combining health-related data with the sequencing of rare variants on a biobank scale.

Methods
The UK Biobank resource and data access. The UK Biobank (UKBB) recruited ~ 500,000 participants in England, Wales, and Scotland between 2006 and 2010 42 . Written informed consent was obtained from all participants. Phenotypic data available includes age, sex, biomarker data and self-reported diseases collected at the time of baseline assessment as well as disease diagnoses from inpatient hospital stays, the cancer registry and death records obtained through the NHS. Approximately half of the participants also have diagnoses from primary care available. Array genotypes are available for nearly all participants and exome sequencing data is available for 454,787 participants. The data used in this study were obtained from the UKBB through application 26041.
Population definition and PC calculation for subjects with exome data. Subject quality control was performed by Regeneron Genetics Center (RGC) and removed subjects with evidence of contamination, unresolved duplications, sex discrepancies and discordance between exome sequencing and genotyping data. Genetic relationships between participants were determined by RGC using the PRIMUS program 43 . For the unrelated subset all first-and second-degrees relatives and some third-degree relatives were excluded.
Populations were defined through a combination of self-reported ethnicity and genetic principal components. We selected the unrelated individuals who identify as White (European), Black (African), Asian (South Asian) or Chinese (East Asian) (Field 21000) and ran an initial principal component analysis (PCA) on high quality common variants using eigenstrat 44 . SNPs were filtered for missingness across individuals < 2%, MAF > 1%, regions of known long range LD 45 , and pruned to independent markers with pairwise LD < 0.1. We then projected the principal components (PCs) onto related individuals and removed all individuals ± 3 standard deviations from the mean of PCs 1-6. A final PC estimation was performed in eigenstrat 44 using unrelated subjects. We then projected related individuals onto the PCs.
Exome sequencing and variant calling. DNA was extracted from whole blood and was sequenced by the RGC as described elsewhere 46 . Briefly, the xGen exome capture was used and reads were sequenced using the Illumina NovaSeq 6000 platform. Reads were aligned to the GRCh38 reference genome using BWA-mem 47 . Duplicate reads were identified and excluded using the Picard MarkDuplicates tool (Broad Institute). Variant calling of SNVs and indels was done using the WeCall variant caller (Genomics Plc.) to produce a GVCF for each subject. GVCFs were combined to using the GLnexus joint calling tool 48 . Post-variant calling filtering was applied using the Goldilocks pipeline 46 . Variants were annotated using the Ensembl Variant Effect Predictor v95 49 which includes a LOFTEE plug-in to identify high confidence (HC) pLOF variants 13 . Combined Annotation Dependent Depletion (CADD) scores were generated using the Whole Genome Sequence Annotator (WGSA) AMI version 0.8.

Phenotype definitions.
Blood biochemistry values were obtained for glucose (Field 30740) and HbA1c (Field 30750) from UKBB and inverse rank normalized using the RNOmni R package 50 , resulting in an approximately normal distribution. For sex-stratified analyses, glucose and HbA1c values were inverse rank normalized separately in males and females.
For disease diagnoses, ICD10 codes were obtained from inpatient hospital diagnoses (Field 41270), causes of death (Field 40001 and 40002) and the cancer registry (Field 40,006) from UKBB. Diagnoses also included additional hospital episode statistics (HESIN) and death registry data made available by UKBB in July 2020. T2D was defined as ICD10 E11. For the purposes of excluding diagnosed diabetics from the glucose and HbA1c analysis we defined diabetes as ICD10 codes E10-E14 which includes both T1D and T2D diagnoses.
For phenome-wide analyses, a selection of quantitative traits was obtained from other fields, encompassing anthropometric measurements, blood counts, as well as blood and urine biochemistry. Beyond these measurements, we selected additional quantitative traits found to be heritable (h 2 significance flagged as at least "nominal" with a confidence level flagged as "medium" or "high") by the Neale lab 28 , using PHESANT to transform values to quantitative traits when necessary as they describe. These included the results of cognitive tests. All quantitative traits were inverse rank normalized using the RNOmni R package 50 . For burden testing, we required at least ten carriers to have measurements. We also tested associations with ICD10-coded diagnoses (using three character codes) that had more than 500 cases in the European ancestry participants with exome data and at least one expected case carrier based on variant frequency and disease prevalence.
Burden testing was performed using glm in R, using a gaussian model for quantitative traits and a binomial model for case-control analyses. Genotype was coded according to a dominant model; 0 (no variant) or 1 (any number of variants). We adjusted for age, sex and the first 12 PCs of genetic ancestry in the regression. Additionally, when testing for association with disease diagnoses, we included country of recruitment as a covariate as the time of available follow-up differs between England, Scotland and Wales. Recruitment country was defined using the location of the relevant UKBB recruitment center (Field 54). Associations were later confirmed using just participants recruited in England. Follow-up duration was defined as the time from recruitment to UKBB until censoring, death, or the last date on which inpatient diagnoses were obtained. For case-control analyses we only ran tests where there was at least one expected case carrier based on variant frequency and disease prevalence. For quantitative traits we required at least ten carriers to have measurements.
For glucose and HbA1c, to convert effect sizes from normalized values back to measured units, the estimates from the regression were multiplied by the standard deviation of these traits in the entire cohort.
For sex-stratified analyses, males and females were defined using a combination of recorded sex (Field 31) and genetic sex (Field 22001). Meta-analysis of males and females was performed using the standard error weighted method in METAL and heterogeneity was tested using Cochrane's Q test 51 .
SAIGE-Gene was run using the SAIGE R package (v0.36.5) 52 using settings recommend by the developers and an additive model, related individuals were included.
Array association testing. Genotypes were obtained through array genotyping and imputation as described previously 53 . Population definition and PC estimation for individuals with array data was performed as previously described 54 . We tested all variants with imputation quality score (info) ≥ 0.8 and minor allele frequency (MAF) ≥ 0.1% in a 200 Mb region around GIGYF1 for association with glucose, HbA1c, T2D and hypothyroidism. Association analyses were performed using an additive model in PLINK adjusting for age at recruitment to UKBB, sex and the first 12 PCs of genetic ancestry. We also adjusted for country of recruitment where appropriate. The most significant variant with info > 0.95 was selected as the lead variant at the locus. We replicated the association of rs221783 with glucose using available summary statistics for Biobank Japan for the trait "blood sugar" (http:// jenger. riken. jp/ en/ result) 19 . We replicated the association of this variant with T2D diagnosis using summary statistics from FinnGen release 3 for the phenotype "E4_DM2" (https:// www. finng en. fi/ en/ access_ resul ts). The effect allele in these datasets was the alternate allele "C". For consistency with the UKBB associations we have shown the effect for the "T" allele.
Meta-analysis of the UKBB and replication dataset association results was performed with the METAL software package using the standard error weighted method 51 .
Region plots were created using LocusZoom 55 . LD calculations were performed in the European ancestry population for array variants in a 500 kb sliding window as follows; we extracted genotypes with info > 0.9, rounded them to whole numbers, mean-imputed missing genotypes and used the R "cor" function to compute R which was then squared to get an R 2 value.
Gene expression and eQTL analysis. The expression of GIGYF1 in various tissues was assessed using the GTEx portal (accessed 08/04/2020) 18 and Human Protein Atlas (http:// www. prote inatl as. org) 33 . eQTL data for rs221783 was obtained from GTEx v8. For each tissue of interest, the best eQTL for GIGYF1 was identified (GTEx v8 "eGene"). R 2 for rs221783 and the best GIGYF1 eQTL was calculated as described above.
Replication analysis in GHS. The GHS MyCode Community Health Initiative study is a health systembased cohort and has been described previously 56 . A subset of participants sequenced as part of the GHS-Regeneron Genetics Center DiscovEHR partnership were included in this study. T2D status was defined based on meeting at least one of the following criteria: (1) clinical encounters due to or problem-lists diagnosis code for type 2 diabetes (ICD-10 code E11), or (2) HbA1c greater than 6.5%, or (3) use of diabetic oral hypoglycemic medicine. Controls were participants who did not meet any of the criteria for case definition. Individuals were excluded from the analysis if they had clinical encounters due to or problem-lists diagnosis code for type 1 diabetes (ICD-10 code E10), or if they were treated with insulin but not with oral hypoglycemic medicines.
Exome-sequencing, variant calling, quality control and gene-based tests were performed as previously described 57  www.nature.com/scientificreports/ predicted to be deleterious by 5/5 algorithms (PFAS) with MAF < 1%. The following variants were classified as pLOF variants: frameshift-causing indels, variants affecting splice acceptor and donor sites, variants leading to stop gain, stop loss and start loss. The five missense deleterious algorithms used were SIFT 58 , PolyPhen2 (HDIV), PolyPhen2 (HVAR) 59 , LRT 60 , and MutationTaster 61 . Association testing was performed in the European ancestry population using the Firth logistic regression test implemented in REGENIE 62 .
Identification of potential causal genes at GWAS loci. For 558 variants identified as associating with T2D 9 we mapped the two closest protein coding genes using bedtools. This resulted in 1118 genes for which we had tested 2071 variant sets (pLOF and/or damaging missense) in our primary analysis. Genes with p < 2.41 × 10 -5 (correcting for 2071 variant sets tested) for HbA1c or T2D were considered significant.
Ethics statement. The UK Biobank study was approved by the National Health Service National Research Ethics Service and all participants provided written informed consent to participate in the study. The UK Biobank resource is an approved Research Tissue Bank and is registered with the Human Tissue Authority, which means that researchers who wish to use it do not need to seek separate ethics approval (unless re-contact of participants is required). Information about ethics oversight in the UK Biobank can be found at https:// www. ukbio bank. ac. uk/ ethics/. This research has been conducted using the UK Biobank resource, application 26041. Research in GHS was approved by the GHS IRB, approval number 2006-0258. Written informed consent was obtained from all participants in GHS.

Data availability
For the primary analysis in UKBB, all phenotypic data and array genotypes used in this study are accessible through application to UKBB. Currently, exome sequencing data for ~ 200,000 participants is available 63 ; the remainder of the exome data used is scheduled for release by UKBB in 2021. Full summary statistics for genelevel tests performed in UKBB will be made available upon publication. For the replication analysis in Geisinger Health System, the full summary statistics for all replication analyses performed are supplied in Supplemental  Table 24. Summary statistics for Biobank Japan are available at http:// jenger. riken. jp/ en/ result and summary statistics for FinnGen are available at https:// www. finng en. fi/ en/ access_ resul ts.