Abstract
To explore the complex genetic architecture of common diseases and traits, we conducted comprehensive PheWAS of ten diseases and 34 quantitative traits in the community-based Taiwan Biobank (TWB). We identified 995 significantly associated loci with 135 novel loci specific to Taiwanese population. Further analyses highlighted the genetic pleiotropy of loci related to complex disease and associated quantitative traits. Extensive analysis on glycaemic phenotypes (T2D, fasting glucose and HbA1c) was performed and identified 115 significant loci with four novel genetic variants (HACL1, RAD21, ASH1L and GAK). Transcriptomics data also strengthen the relevancy of the findings to metabolic disorders, thus contributing to better understanding of pathogenesis. In addition, genetic risk scores are constructed and validated for absolute risks prediction of T2D in Taiwanese population. In conclusion, our data-driven approach without a priori hypothesis is useful for novel gene discovery and validation on top of disease risk prediction for unique non-European population.
Similar content being viewed by others
Introduction
Genetic epidemiological methodologies such as genome-wide association studies (GWAS), phenome-wide association study (PheWAS), conditional GWAS, genetic correlation1, and Mendelian randomization (MR)2 have allowed elucidation into the convoluted interplay between genetics and phenotypes of complex diseases in human. Conventionally, genetic epidemiological studies focused on specific phenotypic or disease trait, which, to date have reported more than 370,000 associations over 1800 traits3. However, complex diseases require more in-depth parallel analysis due to the heterogeneity and genetic pleiotropy of complex diseases. The abundance of both genotype and phenotype data from biobanks such as the UK Biobank (UKBB)4 and the Biobank Japan (BBJ)5 has allowed extensive phenome-wide genome-wide association studies of quantitative traits and diseases. However, most of the previous discovery are based on subjects with European ancestry and gradually, the establishment and maturation of biobanks with non-European ancestry have revealed the importance of diversity in genetic epidemiology6.
The population-based biobank of Taiwan (Taiwan Biobank, TWB) was launched in 2012 and encompasses 0.61% of the total population of Taiwan with more than 144,000 participants, which is comparable to the population coverage of BBJ (0.16%) and UKBB (0.74%). The majority of Taiwanese (over 99%) are of Han Chinese ancestry who migrated from mainland China7,8. Analyses based on phenotypes in TWB may reveal significant genetic effects of critical health-related traits in Taiwanese and in extension, other populations with East Asian ancestry.
In this paper, we reported a comprehensive PheWAS of 10 diseases and 34 quantitative traits from TWB with complementary conditional analysis, genetic correlation and MR. As most of the available traits are cardio-metabolically related, we focused our analysis on type 2 diabetes (T2D). T2D has been one of the leading causes of mortality and morbidity in Taiwan with high socioeconomic burden9. The T2D prevalence in Taiwanese population was estimated to be 11.6% in 20169 with the average age of onset around 59.5 years old10. In addition, the mean annual cost for a patient with T2D-related major complications was estimated to be USD $418911. T2D is a heritable trait compounded by various degrees of gene–gene and gene–environment interactions with heritability estimates ranging from 20 to 80%12. The polygenic risk score (PRS), which aggregates the effects of multiple disease-associated genetic variants, has previously been used for T2D risk prediction in various populations. To further improve the applicability of PRS on identifying high-risk individuals for early intervention specifically for Taiwanese, we constructed absolute risk models which incorporated various risk factors for estimation on the probability that an individual free of T2D at a given age will develop T2D in an upcoming time interval.
Our findings of novel candidate genes (HACL1, RAD21, ASH1L, and GAK) related to T2D in Asian population provided insights into the pathophysiology of T2D, and could be potential targets for clinical diagnosis and therapeutic interventions.
Results
Phenome-wide association analysis for 10 binary and 34 quantitative traits in TWB
Supplementary Data 1 summarizes the PheWAS results for ten binary and 34 quantitative traits in TWB. Detailed demographic data are presented in Supplementary Data 2. About 6 million imputed autosomal SNPs passed our QC criteria and were tested for associations with each of the 44 traits. The 44 traits were grouped into nine categories (Supplementary Data 1): anthropometric (n = 5), metabolic (n = 8), cardiovascular (n = 6), hematological (n = 5), kidney-related (n = 6), liver-related (n = 6), stress-related (n = 5), pulmonary-immunological (n = 2), and articular-skeletal (n = 1).
In total, 995 significantly associated loci (P ≤ 5 × 10−8) with more than 100 loci were found to be specific to the TWB population (Supplementary Data 1 and Supplementary Data 3). TWB PheWAS results is summarized in the Fuji plot5 (Fig. 1) with each layer corresponding to a phenotype, and only significant loci were plotted. We identified several pleiotropic regions associated with multiple phenotypic categories (Supplementary Data 4). For instance, chromosome 2 contains a highly pleiotropic region (positions 27,508,345 through 27,527,678), associated with 13 traits in four categories (hematological, kidney-related, liver-related, and metabolic). This genomic region contains a T2D-associated glucokinase regulator (GCKR). Another highly pleiotropic region on chromosome 12 (positions 111,792,215 through 112,136,812) was associated with eight traits in five different categories (anthropometric, kidney-related, liver-related, vascular-metabolic, metabolic, and FVC). This pleiotropic region on chromosome 12 contains two genes (ALDH2, TRAFD1); genetic variants of ALDH2 have been shown to be associated with risk to T2D, micro-vascular and macro-vascular complications13,14. Further details of the PheWAS results are available in Supplementary Data 4.
To control for the confounding effect of LD, conditional association analyses were carried out with adjustment for corresponding lead SNPs. We found further 115 independent significant signals (lead SNP r2 ≤ 0.1; P ≤ 5 × 10−8). Of these signals, 38 were mapped to genes different from their corresponding lead SNPs (Supplementary Data 5). Of the 15 TWB unique signals (Supplementary Data 5), 6 were mapped to genes different from their lead SNP in conditional analyses. This result demonstrated the power of conditional analysis to resolve confounding effects due to LD within an associated region, and to discover putative candidate genes that might be missed by marginal association analyses.
Shared genetic architectures between traits
To elucidate the underlying mechanisms of identified associations, we estimated genetic correlations between each pair of quantitative traits and binary diseases using bivariate LD score regression1 as shown in Fig. 2, Supplementary Fig. 1, Supplementary Data 6, and Supplementary Data 7. Of the 101 genetic associations identified with FDR ≤ 0.05, the strongest signal was found between hypertension and mean arterial pressure (rg = 0.846, FDR = 4.93 × 10−102). For the vascular-metabolic traits, T2D was found to be significantly associated with 27 quantitative traits, such as HbA1c, BMI, WC (Supplementary Data 6 and Supplementary Data 7). As expected, HbA1c has the strongest correlation (rg = 0.756, FDR = 1.06 × 10−22) with T2D. Moreover, we identified several association signals previously reported by BBJ5 as well as other studies. For example, we observed significant correlations between T2D and several kidney-related traits, such as microalbumin15 (rg = 0.661, FDR = 1.90 × 10−3), uric acid16 (rg = 0.260, FDR = 9.61 × 10−5), and BUN17 (rg = 0.171, FDR = 1.85 × 10−2).
Mendelian randomization
MR analyses were carried out for BMI, Waist circumference, Body fat percentage, Waist-hip ratio, Hip circumference, triglyceride, HDL-C, and VLDL-C against glycemia-related traits (T2D, HbA1c, and FG) (Supplementary Data 9–11). We identified 180, 186, and 185 instrumental variables (IVs) at a genome-wide significance level and clumping threshold of r2 = 0.01 associated with T2D, HbA1c, and FG, respectively. Of these 551 IVs, none of them were found to have horizontal pleiotropy effects on glycemia-related traits by MR-Egger. All of the 551 IVs have passed the SNP outlier test (unknown pleiotropic SNPs) through MR-PRESSO.
In the two-sample MR analysis on T2D, the overall causal estimate (IVW odds ratio (OR) estimate) for T2D per unit increase in BMI was 1.2566 (P = 0.0011), for the effect of a 1-unit increase in HDL-C on the risk of T2D was 0.9762 (P = 0.0087), and for the effect of a 1-unit increase in VLDL-C on the risk of T2D was 0.9832 (P = 0.01) (Supplementary Data 9). In the two-sample MR analysis on HbA1c, the overall causal estimate for HbA1c per unit increase in BMI was 1.0317 (P = 0.0071) (Supplementary Data 10). As for the two-sample MR analysis on FG, the overall causal estimate for FG per unit increase in HDL-C was 0.9472 (P = 0.0229) (Supplementary Data 11). By contrast, Waist circumference, Body fat percentage, Waist-hip ratio, Hip circumference, and triglyceride were not found to be significantly associated with T2D, HbA1c, and FG (Supplementary Data 9–11). After the Bonferroni correction (P = 0.0021 (0.05/24)), the causal relationship between BMI and T2D remained from our MR analyses. In our study, lipid profiles such as HDL-C and VLDL-C were found to be superior to anthropometric measurements in predicting the risk of glycemia-related traits.
GWAS of T2D and glycemia-related phenotypes
GWAS of FG (N = 75,627) (Supplementary Fig. 2) identified 29 significantly associated loci (such as CTBP1-DT, STEAP2-AS1, NOM1/MNX1, and GAD2) (Supplementary Data 1), while GWAS of HbA1c (N = 76,171) (Supplementary Fig. 2) found 26 significantly associated loci including HACL1 (Fig. 5), CTBP1-DT, and C5orf67. GWAS of 63,177 non-diabetic controls (94.3%) and 3844 T2D subjects (5.7%) (Supplementary Fig. 2) revealed seven SNPs significantly associated with T2D (Supplementary Data 1, Supplementary Data 3): CDKAL1, MIR129-1, LEP, SLC30A8, MED30, CDKN2B-AS1, DMRTA1, CDC123, CAMK1D, KIF11, HHEX, and KCNQ1. Collectively, we identified 41 genes significantly associated with T2D and glycemia-related phenotypes with some SNP to be shared among the glycemic traits and T2D (for e.g.,: MED30/SLC30A8, CDC123) (Supplementary Data 8 and Fig. 3). Summary of SNPs and mapped gene by FUMA SNP2GENE are shown in Supplementary Data 12–14, Supplementary Table 1, and Supplementary Figs. 3–4. From the SNPs identified, most of them are located in intronic and intergenic region; only a minuscule number of SNPs are located in the exonic region (T2D: 0.3%, HbA1C: 1.4%, Fasting glucose: 0.8%) (Supplementary Table 1 and Supplementary Fig. 3). The mapped genes were not statistical significantly expressed in any specific tissue types (Supplementary Fig. 4). However, we noticed upregulation of differentially expressed genes in glycaemic-related tissue types such as kidney, pancreas, stomach and adipose tissues. A summary of the functional annotation of the mapped genes are available in Supplementary Data 12–14.
Conditional association analyses
Conditional association analysis using the lead SNPs as covariates was performed for all SNPs in each locus. Ten SNPs were significantly associated with FG independent of their lead SNPs. Of these, eight were associated with the same genes as their lead SNPs despite low r2; while the other two were located on different genes (Supplementary Data 5). For instance, rs742763 and its lead SNP rs9380826 were all mapped to GLP1R. In contrast, rs2632372 and its lead SNP rs1402837 were respectively mapped to NOSTRIN and G6PC2. Fourteen SNPs were associated with HbA1c independent of their lead SNPs with eight SNPs mapped to the same genes as their respective lead SNPs (Supplementary Data 5). For example, an independent SNP rs75151020 and its lead SNP rs742761 were both mapped to GLP1R. In contrast, rs1326821916 and its lead SNP rs72501962 were separately mapped to GAK and CTBP1-DT.
For T2D, the associations of two SNPs (rs11994747 and rs115894051) on different loci remained loci-wide significant (P ≤ 1 × 10−5) after adjustment. (Supplementary Data 5). In contrast, rs11994747 was associated with T2D independent of its lead SNP rs35859536. The potential genes for lead SNP rs35859536 are SLC30A8/MED30, whereas rs11994747 is located in the intronic region of RAD21 (Supplementary Data 5). RAD21 has never been reported to be associated with T2D or any glycemia-related phenotypes.
The conditional association analyses revealed another 26 additional independent loci (such as RAD21, CAMKMT/LINC01833, NOSTRIN, ASH1L, GAK, POLD2/MYL7, SND1, STARD13, and LUC7L). NOSTRIN, POLD2/MYL7, CAMKMT/LINC01833, and SND1 were previously reported to associate with glycemia-related phenotypes18,19,20,21. STARD13 and LUC7L were associated with hemoglobin22 and thus not considered to be associated with glycemic traits.
Absolute risk of developing T2D in the Taiwanese population
The absolute risk modeling for T2D was based on variables including PRS quintile, family history of T2D, BMI, and sex. The 10-year absolute T2D risk demonstrated a significant risk separation across different combinations of PRS, BMI and sex in the Taiwanese population aged 30 to 50 with or without the presence of family history of T2D (Fig. 4 and Supplementary Data 15). For instance, a 40-year-old male without T2D family history, with BMI > 28 and qPRS = 4 (the 4th PRS quintile), has an estimated 10-year absolute T2D risk of 22.6% compared to 7.0% for a male of the same age and qPRS but with normal BMI, and to 7.1% for a male of the same age and BMI but lowest qPRS. Through our analysis, BMI was identified to have the highest risk, which can increase the probability by 15.6% compared to normal BMI, when given other risk values were the same. PRS also gave a similar effect with 15.5%. Validation results based on prospective cohort samples showed that the model had good calibration of relative risk for all sub-groups except for the sub-group of males aged older than 45 years as assessed by the chi-square goodness of fit test shown in Supplementary Fig. 5. Absolute risk (AR) had good calibration for the sub-group of male aged younger than 45 while the observed AR in TWB cohort sample were lower than the model projected ones, assessed by Hosmer-Lemeshow goodness of fit test.
Discussion
Here we present a large-scale PheWAS of ten binary and 34 quantitative traits in 77,072 Taiwanese participants from TWB, identifying 995 association signals. This information in combination with the data from other ancestries or geographic regions will provide more insight into the genetic architecture of cardiometabolic traits for people from different parts of the world. Wei et al.23 showed that TWB cohort represented diverse ancestry of the different province of mainland China and 99% of TWB cohort were Han Chinese. Genetic architecture of Han Chinese is relatively similar to other East Asian populations such as JPT and KHV population from 1000 genome. For diabetes, comparisons of the genetic data from different populations may help to decipher the genetic mechanisms underlying the pathophysiology of T2D for individuals of different ancestries.
As demonstrated in the Fuji plot, we identified several highly pleiotropic loci and discovered putative shared genetic effects for several diseases or traits. For example, GCKR shows interesting pleiotropic effects. Previously, it has been shown to be associated with T2D24, gestational diabetes25, triglyceride26, and fatty liver24. GCKR encodes a regulatory protein for glucokinase (GCK), regulating the subcellular localization and allosteric switch of GCK, the rate-limiting enzyme for cellular glucose uptake27,28. These results clearly demonstrate the strength of PheWAS in identifying genes with pleiotropism on T2D and its comorbid traits. The lead SNP rs6547692 for GCKR gene has a similar prevalence across TWB (MAF = 0.485), BBJ (MAF = 0.4387) and UKBB (MAF = 0.44). The regional association plot for rs6547692 (GCKR) is shown in Fig. 5.
Our GWAS analysis revealed novel association of HACL1 with HbA1C (rs1481559294, P = 4.42 × 10−8) in 77,072 individuals. The recent largest trans-ancestral GWAS of glycemic traits showed marginal significance (P < 1 × 10−3) of several single-nucleotide variations within HACL1 (chromosome 3 from genomic position of 15,560,699 to 15,601,569) in sub-population of around 10,000 individuals29. Their meta-analysis results consist of 281,416 individuals with 13% East Asian. From Genome Aggregation Database (gnomAD)30, we noticed that rs1481559294 for HACL1 was a relatively rare variant (African n = 42,024 MAF = 0.00002, East Asian n = 3132 MAF = 0.0006) as compared to the MAF in our population of 0.0013 (n = 77,702). The regional association plot for rs1481559294 (HACL1) is shown in Fig. 5. HACL1 encodes for the enzyme 2-hydroxyl-CoA lyase 1 which is involved in catalyzing the conversion of even-chain fatty acids into odd-chain fatty acids by cleaving C1 in peroxisome fatty acid α-oxidation31. Jenkins et al. showed that Hacl1 knockout mice had lower plasma and liver C17:0 fatty acid, but did not observe significant difference in adipose tissue32. Gene expression of HACL1 has low tissue specificity, however, transcriptomics data showed tissue expression of HACL1 clustered in the intestine and liver, associated with lipid metabolism (Supplementary Fig. 6). Kocarnik et al. previously reported that SNP rs73148185, which was mapped to HACL1, was associated to C-reactive protein level in a multi-ethnic population33. Chronic systemic inflammation signified by elevated C-reactive protein level is a key underlying pathophysiology in patients with T2D34. PheWAS of rs1481559294 showed association with glycemic traits of HbA1C and FG while displaying nominal significant association with known cardiometabolic risk factors such as waist circumference and other anthropometric traits (Supplementary Fig. 7). The tissue expression HACL1, association with known T2D risk factors and relevancy in lipid metabolism could indicate an indirect, yet important genetic variant associated with glycemic trait specifically in the Taiwanese population.
Another intriguing finding from our GWAS was the association of GAD2 (Glutamate decarboxylase 2) (rs61839365, P = 1.37 × 10−11) with fasting glucose. GAD2 (rs2839671, P = 8.95 × 10−9) was recently reported in the same trans-ancestral GWAS of glycaemic traits by Chen et al.29. Despite having a smaller sample size, we managed to identify the association of GAD2 with fasting glucose, highlighting the importance of genotype data from diverse ancestries such as Taiwan Han Chinese. The lead SNP rs61839365 for GAD2 is more prevalent in East Asian with MAF of 0.293 in TWB and 0.416 in BBJ as compared to 0.18 in UKBB. The regional association plot for rs61839365 (GAD2) is shown in Fig. 5. GAD2 is a major autoantigen in autoimmune-associated type 1 diabetes and in a subset of T2D, latent autoimmune diabetes in adults35. Glutamate decarboxylase 2 catalyzes the formation of gamma-aminobutyric acid (GABA). GABA is an inhibitory neurotransmitter that is a critical component of neurophysiologic function. Upon the stimulation of glucose, GABA, co-secreted with insulin, has been shown to inhibit glucagon secretion via the activation of GABAA-receptor chloride channels of α cells36. It has also been documented that beta cells secrete GABA in a pulsatile manner in synchrony with insulin secretion37. The storage and secretion of GABA in beta cells are defective in islets of type 1 and type 2 diabetic patients37. Taken together, it is plausible that GAD2 may modulate the blood glucose level by regulating glucagon secretion. GAD2 is highly expressed in brain tissues (especially the hypothalamus) (Supplementary Fig. 6), strengthening the understanding of key role of the hypothalamic–pituitary–adrenal axis in neuroendocrine dysregulation of T2D38.
The conditional association analyses revealed three novel genetic association with glycaemic traits (RAD21, ASH1L and GAK). The regional association plots (before and after conditional analysis) for rs11994747 (RAD21), rs371382391 (ASHIL) and rs1326821916 (GAK) are shown in Supplementary Fig. 8. All three genes have low tissue specificities, with almost equal gene expression across different tissue types (Supplementary Fig. 9). ASH1L and GAK were previously reported to be associated with obesity traits such as BMI and waist-hip ratio in GWAS of UK Biobank39. Obesity traits were shown to have significant genetic correlation and are causally associated with glycaemic traits in our pairwise genetic correlation analysis and MR. Furthermore, Klarins et al. reported that GAK was associated with lipid traits (HDL and TG) in their genome-wide meta-analysis. Lipid traits were also shown to have significant genetic correlation and causally associated with glycaemic traits in our pairwise genetic correlation analysis and MR40. All these implied that ASH1L and GAK are crucial cardiometabolic genes in T2D. Inhibition of RAD21 has been demonstrated to increase insulin secretion in a MIN6 mouse beta cell line17 and RAD21 was associated with reduced hematopoietic stem cell self-renewal in aging and inflammation41. Rare mutation of RAD21 has been reported in Cornelia de Lange syndrome with premature physiological aging and gastrointestinal tract difficulties42. However, RAD21 has never been reported to be linked to any cardiometabolic traits in any GWAS. Further functional work in both cellular and animal model will be required to confirm the role of RAD21 in T2D and the link between insulin secretion, physiological aging and T2D.
As discussed in ref. 43, it is more ideal to consider the ancestry-trait-specific Bonferroni-corrected significance threshold. In our study, we only consider the Taiwanese population, and the maximum number of tested SNPs is 5,981,581 for all traits. Therefore, the most stringent ancestry-trait-specific Bonferroni-corrected significance threshold would be 8.36 × 10−9. Among the three highlighted genes that were identified based on the traditional threshold 5 × 10−8 only the gene HACL1 (rs1481559294)-HbA1c trait with P value = 4.24 × 10−8 larger than the most stringent Bonferroni-corrected significance threshold in our study. However, as mentioned above that the study of Hacl1 knockout mice supports the potential involvement of HACL1 for fatty acid, we believe that HACL1 is still a promising candidate gene for metabolic traits.
Our absolute risk model for T2D using PRS and risk factors should be applicable for any Asian population. The strengths of absolute risk model for T2D are the combination polygenic risk of SNPs and risk factor measurements for all subjects allowed joint evaluation on the effects of PRS and family history of T2D. Our findings could be useful in global efforts to generate trans-ancestry PRS. The good model calibration of relative risk demonstrated the validity of the model. The main limitation of our PRS model is that self-report disease status often underestimates the true disease prevalence/incidence. As TWB is an on-going project, we expect to have a sufficiently large follow-up dataset on T2D in the near future, which will allow us to validate our prediction model and evaluate its applicability for population-wide screening on T2D to identify high-risk individuals for early intervention.
Methods
Study population
We used individual genotype and phenotype data of subjects recruited from 2012 to 2019 of Taiwan Biobank (TWB) for subsequent data analysis. (https://www.twbiobank.org.tw/) The population in Taiwan consists of mostly East Asian ancestry, specifically Han Chinese, therefore, they are suitable to serve as a representative study sample for Asian population. Detailed information of TWB dataset is available in the previous publication23. This study has been approved by the internal review board of the Academia Sinica (Num: AS-IRB02-109063) and the research ethics committee of Taiwan University Hospital (No. 201507020RINB), and Taiwan Biobank. All participants gave informed consent when joining TWB, which allows for sharing of all anonymized data with authorized researchers. Participants can withdraw consent to sharing their data at any stage of their participation in TWB.
GWAS
We conducted GWAS through logistic regression model (for binary traits) and linear regression model (for continuous traits) under the assumption of additive allelic effects of the SNP dosages via PLINK v2.0. The regression models were adjusted for age, gender and the first ten genetic principal components.
Genotyping and imputation
Detailed genotyping and imputation procedures have been described earlier23. For this study 95,252 subjects were genotyped with either the customized TWB1 array (NTWB1 = 27,737 DNA variants) or TWB2 array (NTWB2 = 68,978) or both (Nboth = 1496) and the last group was also typed by whole genome sequencing (WGS).
Quality control
Binary traits
We first homogenized control individuals by removing comorbid individuals for each trait. Comorbid diseases are defined by a data-driven method using the Partitioning Around Medoids (PAM)44 algorithm in the cluster package of R (version 3.6) and φ-correlation as our distance matrices. Best-fit group numbers were selected by maximizing the silhouette score45. The final sample sizes for each trait are shown in Supplementary Data 1.
Quantitative traits
For each trait, outliers beyond three standard deviations (two-tailed) were excluded. Individuals with missing values in any trait were dropped from the analysis. A total of 34 quantitative traits were used in this study (Supplementary Data 1).
Genotype data
The genotype data from the TWB1 and TWB2 arrays were merged using the GRCh38 assembly and annotation provided by TWB. After filtering out samples with a call rate <0.99 and sex mismatch in either of the TWB1 and TWB2 datasets, 95,215 samples and 95,673 variants remained. For kinship estimation, computation of principal components (PCs), and genomic relation matrix, SNPs were extracted by the following criteria: (1) SNP IDs, chromosomes, physical positions, minor alleles and major alleles had to be all identical in both datasets; (2) call rate >0.95; (3) MAF > 0.01; (4) deviation from Hardy–Weinberg equilibrium (HWE) P > 0.001; and (5) no INDELs. For sample filtering, arrays with generated genotypes for <95% of the loci were excluded. PLINK v1.9 software was used to identify samples with genetic relatedness indicating that they were from the same individual or from first-, second- or third-degree relatives. These determinations were based on evidence for cryptic relatedness from identity-by-descent status (pi-hat cutoff of 0.125). After removing first-, second- and third- degree relatives, 77,072 independent samples and 59,521 SNPs remained.
Imputation data
For our analysis, we merged the imputed TWB1 and TWB2 genotype data and selected SNPs according to the merged imputation data released by TWB. Low-quality variants were filtered out using PLINK if an SNP met any of the following criteria: (1) MAF ≤ 0.001; (2) imputation INFO score ≤ 0.8; (3) call rate ≤ 0.95; and (4) deviation from HWE (P ≤ 10−10). Supplementary Table 1 shows the number of SNPs included in the analyses for each trait.
For building PRS model
Additional typical QC procedure for SNPs to build PRS models were applied. Multiallelic SNPs and SNPs with ambiguous strands were removed from the analysis. SNPs with MAF ≤ 0.01, low imputation quality (info <0.3) or deviation from HWE (P < 10−6) were also excluded.
Statistics and reproducibility
PheWAS
Significant signals for all binary traits were first screened using a genome-wide significance threshold of P ≤ 5.0 × 10−8 with PLINK2 (https://www.cog-genomics.org/plink/2.0). Linear regression models were used to evaluate the association of all SNPs with each of the 34 quantitative traits under the assumption of additive allelic effects by PLINK. Unless described otherwise, both binary and quantitative traits were adjusted for age and sex with the first ten principal components (PCs) estimated by EIGENSOFT (version 6). Since the standard threshold of 5 × 10−8 had been used in many PheWASs such as in homogeneous population20 and also in trans-ancestral analysis29, therefore, we set the standard threshold of 5 × 10−8 for our GWAS.
Conditional association analysis
Conditional analyses were performed for each aforementioned defined locus using the PLINK2 “--condition” flag. Association tests were conducted based on the generalized linear model with adjustment for all covariates listed in the PheWAS section and an additional lead SNP genotype. Linkage disequilibrium (LD) was estimated with the pairwise squared correlation (r2) within each locus with a window size of 4000 SNPs using 1496 TWB WGS data. SNP was considered to be independent of the lead SNP if r2 ≤ 0.1.
Gene mapping and functional annotation
Post-GWAS analysis of gene mapping, functional annotation, and tissue-expression analysis of prioritized genes was conducted using FUMA SNP2GENE and GENE2FUNC functions46. Independent significant SNPS are defined as P < 5 × 10−8 and r2 < 0.6, lead SNPs if pairwise SNPs had r2 < 0.1. The maximum distance between LD blocks to merge into a genomic locus was set as 250 kb. The genetic data of East Asian populations in 1000 G phase 347 were viewed as reference data to conduct LD analyses. Gene expression of different tissues was estimated with gene expression data of 54 tissue types from GTEx v848. Consensus transcript expression levels for HACL1 and GAD2 in 55 human tissues were also generated from Protein Atlas based on transcriptomics data from the two sources HPA and GTEx49.
Phenotype–phenotype genetic correlation
We used bivariate LD score regression1 to calculate genetic correlations between all phenotype pairs. TWB WGS data of 1496 samples were used to compute LD scores with a one cM window size. Summary statistics of the SNPs passing QC criteria described in the above methods section were utilized to perform this calculation. The false discovery rate (FDR) calculated by the Benjamini–Hochberg method50 was used to adjust for multiple testing of 946 combinations of pairs of traits using the Python module statsmodels (www.statsmodels.org). A significant genetic correlation between phenotypes was considered if FDR ≤ 0.05. Data visualization was performed through the ldsc-corrplot-rg script (https://github.com/mkanai/ldsc-corrplot-rg) and R.
Mendelian randomization
Mendelian randomization (MR) methods utilize common genetic variants to estimate the causal relationship of risk factors with disease outcomes2,51. In our MR analysis, we first filtered horizontal pleiotropic instrumental variables (IVs)52 using MR-Egger53 and identified outliers through MR-PRESSO54. Finally, the valid IVs were applied through IVW55 to analyze causal relationships between the following eleven clinical measurements as exposures (i.e., BMI, waist circumference, body fat percentage, waist-hip ratio, hip circumference, triglyceride, HDL-C, VLDL-C), and outcomes (i.e., three glycemia-related traits: T2D, FG, and HbA1c).
TWB samples was randomly split into two sub-groups: 3/4 (G group) for GWAS and the remaining 1/4 (MR group) was tested using the IVW method for MR analysis. For the IV selection, GWAS were carried out in the same manner described above using the G group, and significant SNPs for the eight exposures were selected. The causal associations of the eight exposures with T2D were investigated using the MR group with logistic regression models. The resulting associated SNPs (IVs), their regression coefficients and the effect estimates of the exposures on the outcome were obtained by pooling all MR estimates using the random-effects IVW method. To ensure independence of IVs, strict clumping was performed with a r2 threshold of 0.01 and physical distance threshold of 10 Mb through clump command of the TwoSampleMR package56 in R.
Data preprocessing for absolute risk modeling
To build absolute model for T2D risk estimation based on PRS and risk factors in T2D-free individuals, we divided the process into three stages: construction of PRS, absolute risk modeling, and validation analysis. For the data preprocessing, we first identified individuals who were free of T2D at baseline and with more than one visit records, which was 15,664 individuals, to be utilized subsequently in the validation analysis (TWB-for-val). For the PRS model construction, we randomly selected one third of the remaining TWB samples, excluding the 15,664 individuals for validation analysis (TWB-for-PRS). The remaining 2/3 samples (TWB-for-AR) was used to build the absolute risk model.
Constructing polygenic risk score (PRS) model
A PRS was calculated as a weighted sum of the number of alleles of SNPs. To estimate the weights for PRS models, we used GWAS summary statistics from BBJ: estimates of regression coefficients (\({\hat{\beta }}_{j}\)), their standard errors (\({\hat{\sigma }}_{j}\)), and associated P values (\({p}_{j}\)) for each SNP j. The QC procedure for SNPs is available in methods.
To calculate PRS, two methodologies were utilized. The first method was the standard clumping and thresholding (C + T) method. The hyperparameters for this method were the thresholds for the correlation \({r}^{2}\) and P value p. The parameter spaces were the Cartesian product of \({r}^{2}\) and p, where \({r}^{2}\in\) {0.01,0.1} and \(p\in\) {0.05,0.001,0.005,1e−4,5e−4,1e−5,5e−5,5e−6}. For each pair of (\({r}^{2},p\)), we used PLINK with a window size of 10 Mb to select SNPs. For model selection, we used the TWB-for-PRS sample to choose optimal tuning parameters. The second method was the C + T method with winner’s curse correction proposed by Shi et al.57 and the same strategy to select an optimal PRS model.
Absolute risk modeling and validation analysis for T2D
To build the absolute risk model, we utilized the R package iCARE58 to project the individual risk. We used the best PRS model described above and transformed it into 4 quintile variables to facilitate interpretation, where the first quintile represents the lowest 20% of the PRS sample in our sample population and so forth for the other quintiles. Similarly, the continuous BMI values were categorized into four classes (BMI < 18.5, 18.5 ≤ BMI < 24, 24 ≤ BMI < 28, 28 ≤ BMI). To apply iCARE for absolute risk estimation, relative risk (RR) estimates is required for all variables in the model, including PRS quintiles, categorical BMI, and family history of T2D. We used TWB-for-AR samples to estimate odds ratio (OR) associated with the aforementioned variables by logistic regression. As the prevalence of T2D in the Taiwanese population is about 10%, the OR gives a reasonable approximation to the RR. Absolute risk estimation was also based on the age and sex-specific T2D incidence rate and mortality rate from causes other than T2D in the Taiwanese population as recommended in ref. 9 and data from the Taiwan Ministry of Health and Welfare (https://www.mohw.gov.tw/mp-2.html). Note that the data on the incidence and mortality rates from these references are based on diabetes mellitus cases; however, they should provide a reasonably good approximation to T2D as T2D represents the majority (>95%) of diabetes mellitus cases in our data.
Lastly, we used the R package iCARE to conduct validation analysis on TWB-for-val sample. Model calibration was assessed by comparing the model projected absolute and relative risk estimates to the observed values in the TWB-for-val sample. The Hosmer-Lemeshow and Chi-square tests were used to judge goodness of model fit, respectively. In addition, the model discrimination was assessed by area under the curve (AUC).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
GWAS summary statistics of T2D, HbA1c, and Fasting Glucose have been provided to the NHGRI-EBI GWAS Catalog, and the study accession numbers are GCST90161239, GCST90161237, GCST90161236. Summary statistics were downloaded from the UKB Biobank (UKBB) and the Biobank Japan Project. The Biobank Japan Project: Summary statistics of T2D, HbA1c, and Blood sugar were acquired from the Biobank Japan Project website (http://jenger.riken.jp/en/result). The UK Biobank: Summary statistics of T2D, HbA1c and Fasting Glucose were acquired from Neale’s lab website (GWAS round 2) (http://www.nealelab.is/uk-biobank). Supplementary Data 4 contains source data underlying Fig. 1a. Supplementary Data 15 contains source data underlying Fig. 4. Other data used in this study were obtained from Taiwan Biobank, which is publicly available on request, while we are not authorized to redistribute the data. Analysis results can be shared on request by contacting the corresponding authors for reasonable use.
Code availability
No custom computer code was used in this study. We used publicly available software (URLs listed below) in this research. Genetic association analyses were performed using PLINK2 (https://www.cog-genomics.org/plink/2.0). The Mendelian Randomization analyses were done using the R package MendelianRandomization (https://cran.r-project.org/web/packages/MendelianRandomization/index.html). Polygenic risk scores were calculated using the software plink (https://www.cog-genomics.org/plink/) and R programming (https://www.r-project.org), and absolute risk estimation was conducted by R package iCARE (https://www.bioconductor.org/packages/release/bioc/html/iCARE.html). SNP heritability and genetic correlations were estimated using LD score regression (https://github.com/bulik/ldsc) and LD hub (http://ldsc.broadinstitute.org/). Functional annotations were done using FUMA (https://fuma.ctglab.nl/). LocusZoom (https://github.com/Geeketics/LocusZoms). UpSet plot (https://github.com/hms-dbmi/UpSetR). Liftover (https://genome.sph.umich.edu/wiki/LiftOver).
References
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236 (2015).
Davey Smith, G. & Hemani, G. Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Hum. Mol. Genet. 23, R89–R98 (2014).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Kanai, M. et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat. Genet. 50, 390–400 (2018).
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, https://doi.org/10.1126/science.aay5012 (2020).
Yang, H. C. et al. A comparison of major histocompatibility complex SNPs in Han Chinese residing in Taiwan and Caucasians. J. Biomed. Sci. 13, 489–498 (2006).
Chen, C. H. et al. Population structure of Han Chinese in the modern Taiwanese population based on 10,000 participants in the Taiwan Biobank project. Hum. Mol. Genet. 25, 5321–5331 (2016).
Sheen, Y. J. et al. Trends in prevalence and incidence of diabetes mellitus from 2005 to 2014 in Taiwan. J. Formos. Med. Assoc. 118, S66–S73 (2019).
Hsu, C. C. et al. 2019 Diabetes Atlas: Achievements and challenges in diabetes care in Taiwan. J. Formos. Med. Assoc. 118, S130-S134 (2019).
Cheng, S. W., Wang, C. Y., Chen, J. H. & Ko, Y. Healthcare costs and utilization of diabetes-related complications in Taiwan: a claims database analysis. Medicine 97, e11602 (2018).
Ali, O. Genetics of type 2 diabetes. World J. Diabetes 4, 114–123 (2013).
Li, G. Y. et al. Meta-analysis on the association of ALDH2 polymorphisms and type 2 diabetic mellitus, diabetic retinopathy. Int. J. Environ. Res. Public Health 14, https://doi.org/10.3390/ijerph14020165 (2017).
Morita, K. et al. Association between aldehyde dehydrogenase 2 polymorphisms and the incidence of diabetic retinopathy among Japanese subjects with type 2 diabetes mellitus. Cardiovascular Diabetol. 12, 132 (2013).
Groop, L. et al. Insulin resistance, hypertension and microalbuminuria in patients with type 2 (non-insulin-dependent) diabetes mellitus. Diabetologia 36, 642–647 (1993).
Gao, Z. et al. Renal impairment markers in type 2 diabetes patients with different types of hyperuricemia. J. Diabetes Investig. 10, 118–123 (2019).
Wang, N. et al. Long noncoding RNA Meg3 regulates mafa expression in mouse beta cells by inactivating Rad21, Smc3 or Sin3α. Cell Physiol. Biochem. 45, 2031–2043 (2018).
Vujkovic, M. et al. Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis. Nat. Genet. 52, 680–691 (2020).
Spracklen, C. N. et al. Identification of type 2 diabetes loci in 433,540 East Asian individuals. Nature 582, 240–245 (2020).
Ishigaki, K. et al. Large-scale genome-wide association study in a Japanese population identifies novel susceptibility loci across different diseases. Nat. Genet. 52, 669–679 (2020).
Sinnott-Armstrong, N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021).
Chen, M. H. et al. Trans-ethnic and ancestry-specific blood-cell genetics in 746,667 individuals from 5 global populations. Cell 182, 1198–1213.e1114 (2020).
Wei, C. Y. et al. Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese. npj Genom. Med. 6, 10 (2021).
Dupuis, J. et al. New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat. Genet. 42, 105–116 (2010).
Guo, F. et al. FTO, GCKR, CDKAL1 and CDKN2A/B gene polymorphisms and the risk of gestational diabetes mellitus: a meta-analysis. Arch. Gynecol. Obstet. 298, 705–715 (2018).
Saxena, R. et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316, 1331–1336 (2007).
de la Iglesia, N., Veiga-da-Cunha, M., Van Schaftingen, E., Guinovart, J. J. & Ferrer, J. C. Glucokinase regulatory protein is essential for the proper subcellular localisation of liver glucokinase. FEBS Lett. 456, 332–338 (1999).
Choi, J. M., Seo, M.-H., Kyeong, H.-H., Kim, E. & Kim, H.-S. Molecular basis for the role of glucokinase regulatory protein as the allosteric switch for glucokinase. Proc. Natl Acad. Sci. USA 110, 10171–10176 (2013).
Chen, J. et al. The trans-ancestral genomic architecture of glycemic traits. Nat. Genet. 53, 840–860 (2021).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Kitamura, T., Seki, N. & Kihara, A. Phytosphingosine degradation pathway includes fatty acid α-oxidation reactions in the endoplasmic reticulum. Proc. Natl Acad. Sci. USA 114, E2616–E2623 (2017).
Jenkins, B., de Schryver, E., Van Veldhoven, P. P. & Koulman, A. Peroxisomal 2-hydroxyacyl-CoA lyase is involved in endogenous biosynthesis of heptadecanoic acid. Molecules 22, https://doi.org/10.3390/molecules22101718 (2017).
Kocarnik, J. M. et al. Discovery, fine-mapping, and conditional analyses of genetic variants associated with C-reactive protein in multiethnic populations using the Metabochip in the Population Architecture using Genomics and Epidemiology (PAGE) study. Hum. Mol. Genet. 27, 2940–2953 (2018).
Pradhan, A. D., Manson, J. E., Rifai, N., Buring, J. E. & Ridker, P. M. C-reactive protein, interleukin 6, and risk of developing type 2 diabetes mellitus. J. Am. Med. Assoc. 286, 327–334 (2001).
Li, Q. et al. Relationship between serum GAD-Ab and the genetic polymorphisms of GAD2 and type 2 diabetes mellitus. Genet. Mol. Res. 14, 3002–3009 (2015).
Rorsman, P. et al. Glucose-inhibition of glucagon secretion involves activation of GABAA-receptor chloride channels. Nature 341, 233–236 (1989).
Menegaz, D. et al. Mechanism and effects of pulsatile GABA secretion from cytosolic pools in the human beta cell. Nat. Metab. 1, 1110–1126 (2019).
Rosmond, R. et al. A 5-year follow-up study of disease incidence in men with an abnormal hormone pattern. J. Intern. Med. 254, 386–390 (2003).
Zhu, Z. et al. Shared genetic and experimental links between obesity-related traits and asthma subtypes in UK Biobank. J. Allergy Clin. Immunol. 145, 537–549 (2020).
Klarin, D. et al. Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program. Nat. Genet. 50, 1514–1523 (2018).
Chen, Z. et al. Cohesin-mediated NF-κB signaling limits hematopoietic stem cell self-renewal in aging and inflammation. J. Exp. Med. 216, 152–175 (2019).
Gimigliano, A. et al. Proteomic profile identifies dysregulated pathways in Cornelia de Lange syndrome cells with distinct mutations in SMC1A and SMC3 genes. J. Proteome Res. 11, 6111–6123 (2012).
Smith, S. P. et al. Enrichment analyses identify shared associations for 25 quantitative traits in over 600,000 individuals from seven diverse ancestries. Am. J. Hum. Genet. 109, 871–884 (2022).
Aldenderfer, M. S. & Blashfield, R. K. Cluster Analysis (Sage Publications, 1984).
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Watanabe, K., Taskesen, E., van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Battle, A. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc.: Ser. B (Methodol.) 57, 289–300 (1995).
Jin, H., Lee, S. & Won, S. Causal evaluation of laboratory markers in type 2 diabetes on cancer and vascular diseases using various mendelian randomization tools. Front. Genet. 11, 597420 (2020).
Burgess, S., Foley, C. N., Allara, E., Staley, J. R. & Howson, J. M. M. A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nat. Commun. 11, 376 (2020).
Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44, 512–525 (2015).
Verbanck, M., Chen, C. Y., Neale, B. & Do, R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat. Genet. 50, 693–698 (2018).
Burgess, S., Butterworth, A. & Thompson, S. G. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet. Epidemiol. 37, 658–665 (2013).
Hemani, G. et al. The MR-base platform supports systematic causal inference across the human phenome. eLife 7, https://doi.org/10.7554/eLife.34408 (2018).
Shi, J. et al. Winner’s curse correction and variable thresholding improve performance of polygenic risk modeling based on genome-wide association study summary-level data. PLoS Genet. 12, e1006493 (2016).
Pal Choudhury, P. et al. iCARE: an R package to build, validate and apply absolute risk models. PLoS ONE 15, e0228198 (2020).
Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R. & Pfister, H. UpSet: visualization of Intersecting Sets. IEEE Trans. Vis. Comput. Graph. 20, 1983–1992 (2014).
Acknowledgements
We acknowledge the staff at Taiwan biobank for their hard work in collecting and distributing the data, and give special thanks to Dr. Te-Chang Lee. We also acknowledge Benjamin Neale’s team, UK Biobank, and the Biobank Japan Project for making their analysis results publicly available. This work was supported by the following grants: MOST 109-2314-B-001 -006 -MY2 and MOST 111-2314-B-001-008. We thank the Institute of Biomedical Sciences, Academia Sinica of Taiwan, and the National Science and Technology Council of Taiwan.
Author information
Authors and Affiliations
Contributions
Conceptualization: C.-J.L., T.-H.C., A.-R.H., and C.S.-J.F.; Data curation: C.-J.L., J.-J.S., C.-C.C., S.-J.W., and C.-L.H.; formal analysis: C.-J.L., T.-H.C., A.M.-W.L., J.-J.S., C.-C.C., S.-J.W., C.-L.H.; funding acquisition: C.S.-J.F.; software: C.-J.L.; supervision: A.-R.H., W.-S.Y., and C.S.-J.F.; visualization: C.-J.L., T.-H.C., A.M.-W.L., and J.-J.S.; writing—original draft: C.-J.L., T.-H.C., A.M.-W.L., A.-R.H., W.-S.Y., C.S.-J. F.; writing—review & editing: S.-W.C.; biological interpretation: A.M.-W.L., P.-L.C., and W.-S.Y.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks Chani Hodonsky and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Kaoru Ito and George Inglis. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lee, CJ., Chen, TH., Lim, A.M.W. et al. Phenome-wide analysis of Taiwan Biobank reveals novel glycemia-related loci and genetic risks for diabetes. Commun Biol 5, 1175 (2022). https://doi.org/10.1038/s42003-022-04168-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-022-04168-0
This article is cited by
-
GBA1 as a risk gene for osteoporosis in the specific populations and its role in the development of Gaucher disease
Orphanet Journal of Rare Diseases (2024)
-
SLC10A1 rs2296651 variant (S267F mutation) predicts biochemical traits, hepatitis B virus infection susceptibility and the risk of gallstone disease
Molecular Genetics and Genomics (2024)
-
Causal relevance of different blood pressure traits on risk of cardiovascular diseases: GWAS and Mendelian randomisation in 100,000 Chinese adults
Nature Communications (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.