Abstract
The genetic contribution of protein-coding variants to immune-mediated diseases (IMDs) remains underexplored. Through whole exome sequencing of 40 IMDs in 350,770 UK Biobank participants, we identified 162 unique genes in 35 IMDs, among which 124 were novel genes. Several genes, including FLG which is associated with atopic dermatitis and asthma, showed converging evidence from both rare and common variants. 91 genes exerted significant effects on longitudinal outcomes (interquartile range of Hazard Ratio: 1.12-5.89). Mendelian randomization identified five causal genes, of which four were approved drug targets (CDSN, DDR1, LTA, and IL18BP). Proteomic analysis indicated that mutations associated with specific IMDs might also affect protein expression in other IMDs. For example, DXO (celiac disease-related gene) and PSMB9 (alopecia areata-related gene) could modulate CDSN (autoimmune hypothyroidism-, psoriasis-, asthma-, and Graves’ disease-related gene) expression. Identified genes predominantly impact immune and biochemical processes, and can be clustered into pathways of immune-related, urate metabolism, and antigen processing. Our findings identified protein-coding variants which are the key to IMDs pathogenesis and provided new insights into tailored innovative therapies.
Similar content being viewed by others
Introduction
Characterized by high mortality rates1,2 and rising prevalences3,4, immune-mediated diseases (IMDs) pose a huge challenge to human health, yet a significant proportion lacked effective treatments5,6. Since they exhibit substantial heritability (around 50%)7, unraveling the genetic architecture of IMDs might illuminate potential therapeutic strategies. However, current genome-wide association studies (GWASs) could not fully depict the entire human hereditary landscape8, particularly rare IMDs. This limitation is further accentuated as many GWAS-identified alleles fall outside protein-coding regions, complicating identifications of direct gene-disease associations. Compared to GWAS, whole-exome sequencing (WES) focused on protein-coding regions, potentially unmasking variants directly associated with IMDs9,10.
Identifying causal variants and their underlying biological insights can deepen our understanding of diseases and further tailor therapeutic targets. Several previous exome studies have revealed mutations in thrombotic thrombocytopenic purpura11, psoriasis12, and sarcoidosis (SD)13, yet they were limited by small sample sizes, affecting their statistical robustness. It has left significant lacunae in the comprehension of protein-coding variants, not to mention the broader landscape of IMDs that remains uncharted. Moreover, previous studies have primarily focused on individual variant-disease associations rather than gene-disease relationships14. Despite exonic variants being protein-coding, their mutations often exert minimal functional consequences15,16. The strategy that involves gene-level collapsing variants, especially those predicted loss-of-function (pLOF) or predicted deleterious missense (pmis), to identify promising genetic-disease associations17 has emerged. Third, even large-scale sequencing studies, such as that on Crohn’s disease18 or osteoporosis19, could not fully capture the multifaceted pathophysiology and clinical implications of these variants. The pathophysiological characteristics of these diseases are crucial, as they can guide tailored therapeutic innovative therapies.
The UK Biobank (UKB), rich in multi-omic data, stands as an ideal platform for such endeavors and has been widely used for sequencing studies of human diseases and traits7,8,9,10,11,12. Prior investigations within the UKB, primarily through GWAS, have focused on genetic risk factors20,21, ethnic disparities22, identification of IMD therapeutic targets23,24,25, associated comorbidities26,27, and their unique or overlapping genetic landscape28. However, WES studies of IMDs in the UKB have been sparse. Some have targeted specific regions, such as the HLA region for 11 autoimmune diseases29, or have investigated asthma risk mutations among predetermined variants30. Additionally, many of these studies were constrained by their sample sizes; for instance, one identified the TET2 mutation as a risk factor for gout among only 170,000 participants31.
In this study, by leveraging the large-scale WES data of 350,770 UKB adults, our objectives are to: (1) identify putative variants across 40 IMDs at an exome-wide level; (2) elucidate clinical impacts, and underlying biological pathways of these variants and to identify potential therapeutic targets through time-to-event analysis, Mendelian randomization (MR) analysis, proteomic analysis, and phenome-wide association analysis (PheWAS). A detailed study design is depicted in Fig. 1.
Results
Population characteristics
We analyzed a total of 350,770 European-descent individuals (mean age, 56.9 years; female sex, 162,210 [46.2%]) from the UKB for whom both WES data and phenotype data for IMDs were available (Supplementary Data 1). A total of 20,155,842 distinct autosomal genetic variants were available from the exome sequencing data after quality control (QC), 20,056,064 of which displayed a minor allele frequency (MAF) below 0.1%. We annotated 744,255 pLOF variants and 1,437,627 pmis variants.
Exome-wide rare variant analysis of 30 IMDs identified 92 genes
We performed a gene-based, whole-exome-wide association analysis using the SKAT method for rare mutations (MAF < 1%) across 40 IMDs, adjusting for age, sex, the first 10 genomic principal components (PCs), and a sparse genetic relationship matrix. The Q-Q plots were presented in Supplementary Fig. 1. We incorporated four max-MAF cutoffs (maximum MAF: 1%, 0.1%, 0.001%, and 0.0001%) and two functional annotation sets (pLOF only, and pLOF+pmis) to identify new gene-phenotype associations32. In total, we identified 92 significant genes across 30 IMDs at an exome-wide significance threshold of P < 2.5 × 10−6 (Fig. 2A and Supplementary Data 2), with 83 being unreported. Our research has discerned both previously established genes and a plethora of novel genes. Concordant with a previous research33, we found that FLG (OR: 1.01, P = 1.79 × 10−6) was associated with atopic dermatitis (AD), and we also identified a novel gene-STAT5B with a more pronounced coefficient (1.16, P = 1.18 × 10−6). Aligned with a prior study34, we found that VEGFA was associated with multiple sclerosis (MS, 1.10, P = 6.04 × 10−7), but our study also unraveled two novel genes, LRRC74A (1.08, P = 4.20 × 10−7) and ZNF266 (1.08, P = 1.10 × 0−7). Most gene-disease pairs displayed deleterious associations, except IL33-asthma (0.99, P = 8.09 × 10−16) and IFIH1-psoriasis (0.98, P = 1.02 × 10−6).
We next sought to determine whether the signals from the detected rare variants stood independently from nearby GWAS signals. Initial analysis was carried out to identify leading signals located within ±500 kilobases (kb) from the gene. Following this, we conducted rare variant association assessments by integrating the leading signals as supplementary covariates (“Methods” section). In general, the magnitude of effects and associated P-values demonstrated minimal attenuation after accounting for GWAS signals (Supplementary Data 3). Notably, the correlation of ABCG2 with gout (P = 5.39 × 10−8) and DXO with celiac disease (P = 1.40 × 10−7) displayed enhanced significance.
In order to validate the gene-based associations in UKB, we searched from Kurki et al.’s summary statistics analyzed from FinnGen dataset35 (“Methods” section). Of the 30 disease phenotypes identified in the UKB, 24 were available in FinnGen, which covered 69 of the 92 identified genes. Searches yielded 13 associations (19% replicated) of Bonferroni-corrected significance (P < 1.43 × 10−3; Supplementary Data 2). Notably, FLG for AD (OR: 2.09, P = 6.54 × 10−53), ETV7 for Primary biliary cirrhosis (PBC, 1.98, P = 2.21 × 10−4), and ABCG2 for gout (1.69, P = 2.73 × 10−76) emerged with the most pronounced coefficients, while protective effects were also discerned for IL33-asthma (0.91, P = 4.12 × 10−25) and IFIH1-psoriasis (0.70, P = 1.81 × 10−8). The results of multi-ancestry were presented in Supplementary Data 4, with 10 association pairs in the Asian population (including ABCG2-gout, and DXO-celiac disease), and 5 in the Black population (FLG-AD and IFIH1-psoriasis) being replicated.
To offer a clinical perspective on mutations, we classified variants into pLOF and four pmis categories based on REVEL scores (75–100, 50–75, 25–50, 0–25). Using case-control enrichment as our core metric, we derived ORs using penetrance and prevalence, favoring its straightforward computation and alignment with clinical indicators over burden tests (“Methods” section). We found a significant enrichment of mutations in IMD cases within the pLOF category, exemplified by STAT5B and CD28 (Fig. 2B). Relatively deleterious pmis mutations with a REVEL score of >50 were also notably enriched, such as LRRC74A and AJAP1 (Fig. 2B). Collectively, 80% of the observed case enrichment was attributed to pLOF or more damaging pmis variants (Fig. 2C). The distribution pattern underscores the importance of focusing on pLOF and pmis variants with a high REVEL score (Fig. 2D). Detailed results could be found in Supplementary Data 5.
Exome-wide common variant analysis of 20 IMDs identified 73 genes
Exome-wide common variant (MAF ≥ 1%) association analyses were conducted using PLINK v236. We identified significant hits for 20 of the 40 IMDs for common mutations under the conventional threshold (P < 5 × 10−8). A total of 73 genes (115 gene-disease associations, Fig. 3A), among which 78 gene-disease associations were novel (Supplementary Data 6). Celiac disease was identified to have the most single nucleotide polymorphisms (SNPs, N = 30), followed by psoriasis (N = 16) and asthma (N = 13). The mean values for disease-predisposing and disease-protective alleles were 1.47 and 0.70, respectively. Among the risk genes, LTA notably escalated the risk for celiac disease (OR: 1.66, P = 3.30 × 10−43) and CDSN significantly elevated the risk of psoriasis (1.43, P = 1.20 × 10−145). Among the protective genes, BTN3A2 (0.80, P = 4.60 × 10−8) and DDR1 (0.79 P = 4.37 × 10−9) protected against SD and Graves’ disease, respectively. Different from positional annotation conducted by ANNOVAR, we also combined positional mapping with eQTL and Chromatin interaction mapping by FUMA to find the gene that they regulate. A total of 489 genes were mapped by positional or eQTL or Chromatin interaction mapping and 109 gene-disease associations were consistently mapped (94.8% overlapped with ANNOVAR; Supplementary Data 7 and8).
The FinnGen dataset supported 69 of the 115 associations under the threshold of 1.43 × 10−3 (Supplementary Data 6). For 51 variants that were directly assessed without the need for neighboring substitutes, 42 (82.4%) were corroborated within the FinnGen dataset. For instance, CFB was observed to alleviate the risk associated with celiac disease (ORUKB: 0.46, ORFinGenn: 0.59) while RNF5 heightened the risk for celiac disease (ORUKB: 3.41, ORFinGenn: 3.74). The results from the Asian population replicated 16 associations (including BTNL2-asthma), and the Black population replicated 7 associations (including RNF5-celiac disease; Supplementary Data 9).
We further ascertained the convergence between evidence from common WES variants and GWAS signals (i.e., the consistency in their identification of the same genes). Since GWAS typically implicates variants rather than specific genes, we mapped trait-linked SNPs to their candidate effector genes using refGene (“Methods” section). Of the 115 gene-disease associations that achieved exome-wide significance, 42 have been identified in previous GWAS studies. Before clumping, a total of 74 gene-disease associations also recognized significant GWAS signals. Notably, genes identified by common variants for conditions such as vitiligo, necrotizing vasculopathies (NV), Lichen planus, and Graves’ disease exhibited 100% converge with those identified by GWAS (Fig. 3B and Supplementary Data 10). This suggested that these GWAS signals might be potentially influenced by independent WES variants (clumped by PLINK) in the proximal region.
Notably, many individual genes demonstrated associations with multiple IMDs, as depicted in Fig. 3C. In a demonstrative example, BTNL2 exhibited associations with well-established IMDs phenotypes (asthma and Behcet’s disease) and novel ones (celiac disease and ulcerative colitis [UC]). We further detected unique genetic overlap patterns across diseases. For example, TYK2 exerted protective effects on both autoimmune hypothyroidism (HPT) and psoriasis.
Genetic correlations among IMDs and their pleiotropic implications
Enlightened by the overlap of genes across IMDs revealed by exome-wide association analyses (Fig. 3C), we then systematically assessed the genetic correlations between IMDs. First, we estimated the genetic heritability of the IMDs based on rare variants using burden heritability regression (BHR)37. The contribution of the newly identified variants to heritability was estimated for 30 IMDs, with h2 ranging from 0.2% to 22% (Fig. 4A and Supplementary Data 11). Among the IMDs, in terms of heritability, Behcet’s disease (22.0%) ranked first, followed by allergic purpura (AP, 19.5%) and narcolepsy (18.7%). Ultra-rare pLOF variants accounted for the largest heritability contribution. Then, the genetic correlations across all pairwise combinations of IMDs were estimated37. A total of 395 BHR correlations between the disease pairs were identified (Fig. 4B and Supplementary Data 12), revealing the existence of widespread genetic correlations across IMDs, among which 181 were negative, and 214 were positive. The strongest association pairs were AD-gout (R = 0.82), and psoriasis-asthma (R = −0.92). These findings suggest that IMDs might share the mechanisms for their genetic liability.
Having revealed the shared genetic architecture among the genes and diseases, we wondered the pleiotropic effects of the identified coding variants. To comprehensively assess this, we performed PheWAS across a total of 175 clinical outcomes identified by ICD-10 codes38 in the UKB. For genes identified by rare variants, we identified 78 gene-disease associations, involving 10 disease classes under the FDR-corrected P < 0.05 (Supplementary Data 13 and Supplementary Fig. 2a). For example, TET2 (idiopathic thrombocytopenic purpura [ITP]-related gene) was associated with Hodgkin’s lymphoma, leukemia, pneumonia, and purpura. For common variants, we identified 1820 associations, involving 12 disease classes (Supplementary Data 14 and Supplementary Fig. 2b). For instance, NOTCH4, a highly heterogeneous gene and also the target for immune checkpoint inhibitor therapy39, was revealed to be associated with multiple IMDs (celiac disease, Graves’ disease, HPT, psoriasis and scleroderma).
Protein-coding variants confer substantial risks for IMDs by time-to-event analysis
To further characterize the disease relevance and clinical significance of the identified genes, we sought to quantify longitudinal disease risks for putatively pathogenic variations or annotated genes using Cox-proportional hazard ratio regression (Supplementary Data 15). 48 of 92 significant genes (52.2%) in rare variant analysis were revealed to be significantly related with the risks of corresponding diseases under FDR-corrected P < 0.05 (Supplementary Fig. 3). The median hazard ratio (HR) for risk mutations stood at 5.71 (interquartile range [IQR]: 3.12–10.70), and that for protective mutations was 0.75. 32 out of 73 common variant genes (43.8%) significantly conferred altered risks of diseases, involving 42 gene-disease associations. The corresponding median HR was 1.16 (1.09–1.34) for risk mutations, and 0.96 (0.91–0.97) for protective mutations. As expected, the HRs for disease risk were generally higher in rare variants than in common variants. Of note, genetic evidence was consistent between time-to-event analyses and WES analyses. For instance, survival analysis also revealed that TET2 (ORburden: 1.17, PSKAT = 1.49 × 10−9) increased the susceptibility to ITP (HR: 12.75, P = 6.12 × 10−14).
Protein expression and druggability assessment
The intergroup differences in gene expression between mutation carriers and non-carriers revealed that among 16 proteins available, a list of 12 proteins was found to be significantly altered (Fig. 5A and Supplementary Data 16). For example, significantly increased expression of LTA (P = 2.20 × 10−16) in celiac disease, IL18BP (P = 3.28 × 10−10) in Crohn’s disease, and DDR1 (P = 2.20 × 10−16) in Graves’ disease were observed. We subsequently utilized the MR approach to investigate the causalities, with the IVW method as the main model (Fig. 5B and Supplementary Data 17). Of the 12 altered protein, 5 protein-disease causalities were supported by MR, i.e., LTA-celiac disease (PIVW = 1.29 × 10−3), IL18BP-Crohn’s disease (PIVW = 4.04 × 10−2), DDR1-Graves’ disease (PIVW = 3.37 × 10−3), CDSN-psoriasis (PIVW = 4.02 × 10−7), and BTN3A2-SD (PIVW = 3.93×10−5).
By annotating potential amino-acid alterations, we found that after gene mutations, a significant proportion of amino acids transited to terminators, as depicted in Fig. 5C (four protein entities: CBLB, DHX3, CIITA, and CAPN9) and Supplementary Data 18 (the rest proteins). This suggested that these mutations are likely to alter protein expression. For genes without available protein data, we employed proteomic-wide assessment to investigate whether their mutations altered the expression of other proteins. Operating under a statistical threshold of FDR-corrected P < 0.05, we discerned 80 associations between rare mutations (Supplementary Fig. 4a and Supplementary Data 19) and protein expression, and an extensive 14,592 associations for common mutations (Supplementary Fig. 4b and Supplementary Data 20). Remarkably, genes linked to a specific IMD phenotype could modulate proteins associated with other IMDs. For instance, CHI3L1 was influenced by both PHACTR1 and ZNF311, CDSN by DXO and PSMB9, and LTA by GSTM5, PSMB9, and ENSG00000244255. Among them, PSMB9 had 22 significant associations, predominantly with immune-related proteins, especially IL10 (Supplementary Fig. 5). Within the spectrum of the associations between common mutations and proteins, the most 25 pronounced associations (down-regulating) involved the protein of MICA/MICB, among which its association with C6orf15 was most significant (OR: 0.21, P < 1.00 ×1 0−350). The shared regulatory and interaction network between different proteins (genes) suggests a high genetic overlap among these diseases, which provides significant implications for the development of drug repurposing.
Finally, we explored the druggability of the genes by querying the GeneCards with its associated DrugBank, HMDB, and Tocris databases. We found that 87 of 164 (53.0%) identified genes are druggable, including 4 causal genes (LTA, DDR1, CDSN, and IL18BP) supported by MR (Supplementary Data 21 and 22). They have been widely used to develop immunosuppressant or immunomodulators, including LTA (for Etanercept), DDR1 (for Fostamatinib, Imatinib, Nilotinib, and Pralsetinib), and CDSN (for Carboplatin, and Gemcitabine).
Biological insights into the identified gene-disease associations
To refine our understanding of how genetic variations confer risk for IMDs, we performed a series of bioinformatic analyses. First, we delved into the linkage with a series of biological indicators via PheWAS. The identified genes were significantly related with a range of biochemistry (especially, cholesterol-related indexes), inflammatory (white blood cell and neutrophil), spirometry (FEV1-related indexes), brain MRI (insula and posterior cingulate), and cognition (fluid intelligence and pair matching) traits (Fig. 6A, B and Supplementary Data 23 and 24). For instance, LTA was associated with a list of inflammatory (white blood cell, lymphocyte, neutrophil, and monocyte counts), and lipid-related (total cholesterol, triglycerides, and LDL cholesterol) indicators, highlighting the biological relevance with clinical celiac disease. Next, we adopted protein-protein interaction (PPI) analysis to construct an interaction network containing 164 genes and their disease-associated pathway clusters (Fig. 6C, Supplementary Fig. 6, and Supplementary Data 25). Generally, three cluster enrichment was observed under the FDR P < 0.05, basically covering the major functional pathways involved in the IMDs, supporting the biological validity of the genetic associations. Specifically, cluster 1 mainly involves immune-related pathways, cluster 2 is related to the urate metabolic process, and cluster 3 was of antigen processing and presentation. Besides focusing on the specific biological pathways, the target tissue and cell types should also be noted for the precision treatment of IMDs. We next characterized the specific cell types where the identified genes showed altered expression using single-cell RNA sequencing (scRNA-seq) data, which revealed diverse expression levels across varying cell and tissue types for a multitude of genes (Fig. 6D and Supplementary Fig. 7). For instance, LRP1 (asthma-related gene) is mainly expressed on neutrophils in the blood, but mainly expressed on macrophages and fibroblasts in the bladder. TET2 (ITP-related gene) is widely expressed on various inflammatory cells (neutrophils, monocytes, and NK cells) in the blood, but is mainly expressed on macrophages in the bladder.
Discussion
Here, we conducted a large-scale and comprehensive WES study of protein-coding variants for IMDs from 350,770 UKB individuals. We implicated 162 significant risk genes for 35 IMDs across the protein-coding allelic frequency spectrum, among which 124 were novel. Several genes showed independent convergence of rare and common variants evidence, reinforcing their vital value in the corresponding diseases. Longitudinal time-to-event analysis revealed that most of the observed genes (52.2%) significantly influenced disease onset at the population level. MR analysis further provided the causal evidence of 5 gene-disease associations, among which 4 were approved drug targets. By focusing on coding variants, we found mutations for specific IMDs not only significantly affected the protein expression in these IMDs, but also influenced the protein expression in other IMDs, revealing the shared network of mutual regulatory mechanisms across IMDs. Functional annotation provided biological insights into the gene-phenotype linkages by demonstrating the involvement of immune and metabolic pathways in these IMDs.
Traditional GWAS signals were not driven by the loss-of-function40. However, protein-coding variants have demonstrably greater translational potential, given their ability to interpret the functional impacts41 and more clear effects on disease pathogenesis42. Our WES is the most comprehensive one to date to identify protein-coding genetic mutations across 40 IMDs. First, the congruence between our findings and previous research underscores the rigor and robustness of our study. 38 of the 164 genes through protein-coding variants have been identified in previous studies. For instance, we replicated the finding of IFIH1—the gene playing an important role in type I interferon production and signaling—are associated with psoriasis43. The proteome analysis further revealed the association between IFIH1 and the lower blood level of protein IFNL1. IFNL1 was recently reported to be involved in the expression and production of all IFN-λ44, and might be effective in the pathogenesis of psoriasis by regulating the biological potential of IFN-λ signaling45. SLC22A11 was previously reported to be associated with the risk of gout46. Through collapsing analysis in combination with conditional analysis adjusting the nearby common variants, we replicated this gene for the risk of gout, as well as supported by the FinnGen cohort. Our PheWAS analysis further revealed the association between SLC22A11 and the blood levels of uric acid and Cystatin C. Second, we deciphered 124 novel putative pathogenic genes spanning across 35 IMDs. TET2 was a novel gene for ITP through protein-coding variants, which was solid in longitudinal association analyses. Our proteome analysis revealed the association between TET2 and lower blood levels of heme oxygenase 1 (HMOX1), which is involved in erythroid differentiation. We further found leukemia, purpura, and a series of blood-related parameters to be associated with this gene. All of the above analyses suggest the possibility of TET2, previously been developed for ascorbic acid, might be a novel drug target for ITP.
Among gene-level signals for which an individual variant also achieved significance, we highlighted the examples where both gene-based and single-variant-based genes contributed to disease burden. For instance, FLG was found in both gene-based (OR = 1.01) and single-variant-based (OR = 1.22) analyses for asthma. Time-to-event analysis revealed that FLG was correlated with a higher risk of asthma, with HRs ranging from 1.06 to 1.12. The consistency of exome-wide gene-based and single-variant-based associations, as well as the longitudinal disease risk at the population level, which also translates to an independent exome-sequenced cohort (FinnGen), increases the confidence that our gene findings are indeed biologically relevant. It is also worth mentioning that the majority of rare mutations are disease-specific (91 of 93 gene-disease associations), while many common mutations are shared across IMDs (around 25%). Previous studies have also revealed the overlap of common mutations, for example, inflammatory bowel disease (IBD) shares their genetics not only with dermatological immune-mediated disorders but also with autoimmune endocrine disorders such as Type I Diabetes mellitus (T1D)47. Despite the shared genetic overlap among the IMDs, we further ascertained that many shared genes exhibit opposite effects. For example, mutations in CFB potentially confer protection against celiac disease but increase susceptibility to ankylosing spondylitis (AS) and UC. Previous research also observed such discordances (a shared locus for which the same haplotype increases risk for one disease but is protective for another), with a discordance rate of 14% between IBD, AS, and celiac disease in a sample of 416 instances48. The specificity between the IMDs was also reflected in that many genes appear to be more restricted to particular cellular contexts, which is consistent with one previous report49. For example, there is an over-representation of celiac disease loci expressed selectively in monocytes50; asthma-associated loci are preferentially expressed in CD4+ T cells51. Collectively, these empirical observations underscore the inherent heterogeneity concerning the influence of genetic mutations on IMDs phenotypic expressions. Future research should also focus on these disease-specific factors, as they may yield important clues to the disease mechanisms and may provide avenues for prevention and treatment.
The development of biological therapies targeting specific inflammatory proteins has transformed the clinical management of IMDs52. First, we pinpointed that the expression level of several genes significantly altered between the normal population and patients with the specific IMDs, i.e., decreased LTA protein levels in patients of celiac disease, highlighting the protein-coding functions of these genes and their clinical readabilities. Next, our proteomic analysis showed that LTA could significantly affect the blood levels of two cancer-related proteins-MUC1653 and KLK1154. Previous studies have pointed out the possible associations between celiac disease and neoplasms, especially malignancies55. The above evidence reinforced the involvement of LTA in celiac disease. Furthermore, evidence from our analyses yielded clues to a gene’s biological mechanisms and relevance to diseases. Our PheWAS showed that LTA was associated with a list of inflammatory (white blood cell, lymphocyte, neutrophil, and monocyte counts), and lipid-related (total cholesterol, triglycerides, and LDL cholesterol) indicators, suggesting that the biological relevance of the gene is consistent with the well-established genetic relationships of clinical phenotypes. Monitoring clinical mutation-related biomarkers might help differentiate etiologies and guide individualized treatments9, as well as contribute to genetic therapy. However, more importantly, understanding whether the proteins are drivers of disease is of vital importance for the development of treatments23. To this end, we used MR to evaluate the causal contributions of proteins to different IMDs. MR revealed four proteins including LTA, which is highlighted above, and it has already been approved for developing immunomodulators (Abacavir, Etanercept, and Carbamazepine), which mainly participates in the regulation of cellular apoptosis critical in the development of IMDs56. It is interesting and meaningful to explore whether there are activators targeting LTA that might be developed for the treatment of specific IMDs. CDSN is another causal gene supported by the MR evidence. It is a gene of considerable heterogeneity, located in the major histocompatibility complex (MHC) class I region on chromosome 6, and encoded a protein found in corneodesmosomes57. It is highly polymorphic in the human population, and its variation is associated with T1D, polymyalgia rheumatica, psoriatic/enteropathic arthropathies, and UC in the present study. The proteomic analysis demonstrated that DXO (celiac disease- and T1D-related gene) and PSMB9 (alopecia areata-related gene) significantly altered the blood expression of CDSN, hinting that there might be a network of mutual regulatory mechanisms between these autoimmune-related genes. Moreover, CDSN has been used in the development of immunosuppressants (Carboplatin and Gemcitabine), providing confirmation of the utility of this approach, and also highlighting new potential therapeutic targets.
The major strength of this study lies in not only the systematic investigations of protein-coding variants in a series of 40 IMDs, but also the identification of a list of novel genes and exploration of their biological relevance. By virtue of focusing on coding variants, the observed associations more often provide a direct causal link between variants in a gene and a specific IMD58, having implications for identifying or validating drug targets. However, the results should be viewed in light of several limitations. First, the ethnicity is mainly of European descent, making generalizability to other ethnicities challenging. Though the sub-population analyses in Black and Asian partly supported our findings and further provided more gene labels, other larger ancestral groups are no doubt valuable and can provide further information. Second, the sample sizes of some specific diseases were too small, resulting in insufficient statistical power, which may have resulted in missing genes. As we can see from Supplementary Fig. 1, the actual P-values were lower than the expected P-values for some IMDs. Third, the UK Biobank population may reflect volunteer bias and survivor bias with a sample of healthier individuals than the general UK population59, and therefore may show a lower frequency of putative pathogenic variants and lower penetrance. Fourth, the observed effect sizes for genes identified through collapsing analysis may be lower than those reported in prior research. This discrepancy can be attributed to SAIGE’s methodological rigor in correcting inflated type I error rates in binary traits with imbalanced case-control ratios and rare variants, thereby yielding more precise and biologically credible beta estimates16,17,32,60. Consequently, compared to earlier approaches, SAIGE’s estimated effect sizes are less susceptible to inflation. Last, the FinnGen was not a true replication cohort. We acknowledge that our primary focus was not framed as replication but rather aimed at providing evidence to support our results. Altogether, this study comprehensively examines the contribution of protein-coding variants to the genetic architecture of a series of complex IMDs, demonstrating the potential for gene-based analyses in large sequenced biobank cohorts. Investigations of the functional relevance of protein-coding variants might improve our understanding of the pathobiology of IMDs, lead to improved identification of affected pathways, and eventually can pave the way for the development of personalized therapies.
Methods
Study participants and disease phenotypes
The UKB is a population-based study that enrolled more than 500 thousand participants aged 40–69 years at recruitment across the UK. In-depth phenotypic, health-related, genetic, and proteomic data were collected from a baseline assessment and subsequent follow-up visits. UKB has obtained ethics approval from the Research Ethics Committee (REC; approval number: 06/MRE08/65) and informed consent from all participants. For this study, we included a total of 350,700 participants with available WES and clinical data after quality control (QC) under project application number 19542.
The IMDs phenotypes were ascertained and classified according to the in- and out-patient International Classification of Disease, Ninth Revision (ICD-9) and Tenth Revision (ICD-10) codes61 and Read codes62,63,64,65,66. Data sources included the UKB health outcome records’ first occurrences of health outcomes (Category 1712), hospital admission data (Field ID 41270, 41271), and primary care (Category 3000). The diagnose codes for IMDs were provided in Supplementary Data 26. We defined participants with a specific IMD as cases, otherwise as controls. Case status was determined at the last follow-up.
Whole-exome sequencing and quality control
WES data were processed using the Regeneron Genetics Center (RGC) SPB pipeline14. Briefly, the Genomic DNA was extracted from blood samples and transferred to the RGC, and stored in an automated sample biobank at −80 °C. Sequencing was performed by dual-indexed 75 × 75-bp paired-end reads using the IDT xGen v1 capture kit on the Illumina NovaSeq 6000 platform.
The OQFE WES pVCF files provided by the UKB (https://biobank.ctsu.ox.ac.uk/showcase/label.cgi?id=170) were used, which was aligned to the human reference genome GRCh3867. In addition to the standard QC that was performed centrally, we performed additional genotype-, variant- and sample-based QC procedures to ensure a high-quality dataset for analyses16 (Supplementary Methods). In brief, we first conducted a genotype refinement on the preliminary genotype calls present within the pVCF files utilizing Hail. Multi-allelic sites were segregated to yield distinct bi-allelic representations. Any calls failing to meet the hard filtering criteria were removed. Then we performed variant-based QC by excluding variants that exhibited a call rate of less than 90%, deviated from Hardy–Weinberg equilibrium (P < 1 × 10−15), or were present within regions of low complexity. Finally, on the sample level, we excluded participants who had rescinded their consent, instances of sample duplications, incongruences between genetically inferred sex and self-reported gender, as well as those samples displaying values for Ti/Tv, Het/Hom, SNV/indel, and singleton counts that deviated from mean±8 standard deviations (8sds).
The relatedness between samples was defined using the kinship coefficient score by KING software. The kinship coefficient threshold at 0.0884, which indicated the 2nd relatedness, was used to define an unrelated sample. We restricted our main analysis to the European-descent genetic ethnic group (field ID 22006) and utilized high-quality variants to compute the PCs pertinent to ancestry. Detailed methodologies are elaborated upon in the Supplementary Methods.
Variant annotation
Rare variants (MAF < 1%) were annotated using SnpEff68 v5.1 against Ensembl Build 38. Gene regions were defined using Ensembl Release. Predicted loss-of-function (pLOF) variants were those annotated to cause frameshift insertion/deletion, splice acceptor, splice donor, stop gain, start loss, and stop loss. Predicted deleterious missense (pmis) variants were defined as those predicted consistently to be deleterious by 5 in silico prediction tools including SIFT69, LRT70, PolyPhen2 HDIV, and PolyPhen2 HVAR71, and MutationTaster72. Common variants (MAF ≥ 1%) were annotated with ANNOVAR73 using refGene (https://annovar.openbioinformatics.org/en/latest/user-guide/download/) as a reference panel. We also combined positional mapping with eQTL and Chromatin interaction mapping by Functional Mapping and Annotation of Genetic Associations (FUMA; https://fuma.ctglab.nl/) to find the genes that they regulate. Variants annotated as exons were included in the following analysis.
Rare variant burden analysis
Gene-level collapsing analysis was executed for rare variants. It encompassed 8 variant criteria, which vary in terms of MAF (<1 × 10−5, <1 × 10−4, <0.001, and <0.01), and predicted consequences (pLOF and a composite of pLOF and pmis). To discern genes pertinent to IMDs, a generalized mixed-effects model was deployed. All association analyses incorporated covariates including sex, age, and the first 10 PCs as fixed effects, serving to attenuate confounders and mitigate potential population stratification. The sparse genetic relationship matrix, which was used for the random effect variance ratio estimation in the model, was constructed using high-quality variants with the recommended relative coefficient cutoff of 0.05. For each collapsed association, effect sizes and the associated P-values were ascertained through testing modalities including SKAT74, burden75, and SKAT-O76, implemented in SAIGE-GENE+ v1.1.6.277, with SKAT reported in the main article. The conventional threshold for collapsing analysis (PSKAT < 2.5 × 10−6) was applied to determinant significant genes. Additionally, this analysis was extended to the specific Asian and Black populations (Field ID: 21000).
Case-control enrichment of rare variants across consequence categories
To offer a more direct clinically relevant perspective on mutations, we stratified variants into pLOF and four discrete pmis classifications by utilizing the variant effect predictor software78. Case-control enrichment was central to our approach, which offers a clearer lens into which mutation types are of utmost clinical relevance. Instead of employing burden tests via SAIGE, we derived the odds ratio (ORs) by drawing from both penetrance and prevalence. Compared to the traditional burden tests, this approach ensures ease of computation, thus enhancing its clinical practicality.
The pLOF was delineated as per the prior analysis. The pmis variants were characterized using the Rare Exome Variant Ensemble Learner (REVEL) score79. We established five deleteriousness thresholds by sequentially categorizing variants based on decreasing levels of predicted deleteriousness: pLOF, REVEL ≥ 75, 75 ≥ REVEL > 50, 50 ≥ REVEL > 25, and 25 ≥ REVEL > 0. Based on established methods, we computed the case-control enrichment across consequence categories37. Let NCase be the number of cases and NMcase be the number of minor alleles in cases. The MAF of variants in cases (AFCase) was calculated by:
Let NCtrl be the number of controls and NMctrl be the number of minor alleles in controls. The MAF of variants in the total population (AF) was calculated by:
We calculated the per-sd coefficients using AF, AFCase, and prevalence:
where prevalence equals to Ncase/(NCase + NCtrl).
Variant variances were calculated by:
Finally, by combining (3) and (4), we converted beta per-sd (i.e., sqrt(variance explained)) to per-allele (i.e., in units of phenotypes):
Single association analysis for common variants
For exon SNPs exhibiting an MAF of 1% or higher, a single association examination was conducted among the non-related European-descent cohort utilizing the computational tool PLINK236. Lead SNPs were defined as independent significant SNPs with r2 ≤ 0.1 within a 1 Mb window. Covariates encompassing age, sex, and the ten PCs were adjusted. The significance threshold was established at the conventional threshold for GWAS (P < 5 × 10−8).
External replication in FinnGen
To validate our discerned associations in an independent cohort, we leveraged summary statistics from the FinnGen Consortium online results (version 8)35. The FinnGen study amalgamates genotypic information with national health registry data of Finnish citizens. The summary statistics were publicly online available (see “Data availability” section). Genotype-, sample- and variant-wise QC and filtering procedures can be found in previous studies. For the validation of rare mutations, coefficient values and P-values were initially calculated and then we obtained the results of gene-disease associations by selecting the strongest associations (i.e., lowest P-value) per gene. For common mutations, we obtained the coefficients and P-values for the according variant if available. If the variant was filtered in QC or not imputed, we utilized variants within 50 kb as a substitute. A Bonferroni-corrected threshold of P < 1.43 × 10−3 (0.05/35 IMDs) was considered to be supported. The precise diagnosis code, along with the case and control distribution for each phenotype were delineated in Supplementary Data 27.
Genome-wide association study
Variants that did not deviate from HWE (P > 1 × 10−12), per variant missing rates 1%, and an imputation quality score >0.8 were selected. We carried out linear mixed model association analyses and adjusted for the genotyping array and 10 ancestry PCs to assess the associations between the traits and imputed genotype dosages under an additive genetic model by using BOLT-LMM v2.3. Details were addressed in Supplementary Methods.
Heritability and proportion of variance-explained estimates
Burden heritability regression (BHR)37 was applied to estimate the heritability explained by the rare variants using the R package BHR v0.1.0 (accessible at: https://github.com/ajaynadig/bhr). BHR regresses gene burden statistics on gene burden scores and estimates the burden heritability from the regression slope. We computed per-allele effect sizes from case and control allele counts as the input to BHR, as recommended by the previous publication37. Our investigation focused on pLOF and pmis variants. The univariate burden heritability was separately computed across four distinct MAF intervals, namely: (0, 10−5), [10−5, 10−4), [10−4, 10−3), and [10−3, 0.01) and then further aggregated into total heritage. Bivariate BHR was further performed to calculate the genetic correlations between phenotypes.
Associations with clinical outcomes by Cox analysis
For rare mutations, we defined those carrying pLOF or deleterious pmis mutations in the identified gene as cases, otherwise as controls. For common mutations, the analysis was stratified by the number of identified mutations (0,1,2) that a subject carried. We tested the identified likely-causal variants for associations with corresponding disease outcomes. Time zero corresponded to the birth year of each individual, and follow-up time was subsequently calculated as years of birth to the date of first diagnosis, death, or the final date with accessible information from hospital admission, whichever came first. After FDR correction, P < 0.05 was identified as significant.
Gene expression analysis
T-tests were used to identify the protein expression alterations in subjects carrying mutations and non-carriers. Mutation carriers were defined as carrying the specific identified variants in the common variant spectrum or carrying any of the pLOF or deleterious pmis variants in the rare variant spectrum. A density plot was crafted to illustrate the differential expression. It is imperative to note that the UKB furnished data pertaining to only 1900 proteins. As such, only 16 genes are complemented with corresponding protein expression data. The population was restricted to a European-descent ethnic group, and protein expression levels outside 5sds were excluded.
Mendelian randomization analysis
We delved into an exploration of the potential causal relationships between protein expression alterations and the corresponding IMDs via a two-sample MR approach. This method leverages genetic variants as instrumental variables, establishing a causal linkage between exposures and outcomes. Adjusting for age, sex, and the first 10 PCs of ancestry, we computed the GWAS (Supplementary Methods) through PLINK2. For GWAS of IMDs, we leveraged summary statistics from the FinnGen cohort to avoid sample overlap.
We employed several analytical models: inverse-variance weighted (IVW), weighted median, simple median, and weighted mode—all of which were operationalized via the R package TwoSampleMR (https://mrcieu.github.io/TwoSampleMR/). We selected genetic instruments characterized by P-values less than 1 × 10−5 and subsequently performed Steiger filtering, ensuring the veritable association of the instrumental variables with the exposure. We clumped SNPs by excluding those with r2 > 0.01 and kb<5000 to avoid the potential linkage disequilibrium. Cochran’s Q-statistic was used to evaluate the heterogeneity inherent in the IVW. Radial MR and leave-one-out analysis was leveraged to eliminate SNPs with significant heterogeneity until no heterogeneity was observed by Q-statistic. We further adopted MR-Egger to assess potential pleiotropy, which detected no sign of pleiotropy in our MR. Since no heterogeneity and pleiotropy effect were observed, we prior the results from the IVW method.
Amino-acid signals and Proteomic-wide analysis
We annotated the amino-acid alterations by ANNOVAR, delineating the structural framework and domains based on the SMART database. We subsequently conducted a proteomic-wide analysis to assess the associations between the putative mutations and the proteomic data by employing linear regression. Our analysis data was restricted to the European-descent demographic, omitting observations that deviated beyond 5sds. Covariates, namely age, sex, and the first 10 PCs were incorporated. An FDR-corrected P-value threshold of less than 0.05 was established as the criterion for statistical significance.
Phenome-wide association study
A PheWAS approach was used to determine the clinical measures and comorbidities associated with the identified rare/common genes and loci. We included a total of 153 biochemistry, inflammatory, spirometry, cognition, and MRI (brain and heart) traits (Supplementary Data 28). We also assessed the pleiotropic effects of IMDs-associated genes on a total of 187 clinical outcomes identified by ICD-10 codes38 (Supplementary Data 29). Biological indicators or disease phenotypes were tested for associations with the identified genes or single variants. The regressions (SAIGE for rare mutations and logistic/linear regressions for common mutations) were adjusted for age, sex, and the first 10 PCs. The results were deemed significant at an FDR level of 5%.
Protein-protein interaction analysis
The Search Tool for the Retrieval of Interacting Genes database (STRING) (Version 10.0, http://string-db.org) was used to predict the relationships among the screened genes. Based on experimental data, database entries, and co-expression, PPI node pairs with a score of combination > 0.4 (medium confidence) were considered to be significant. A machine learning method (K-means) was utilized to categorize gene clusters.
Functional enrichment analysis
To unravel the underlying biological pathways within the identified genes, we employed Multi-marker Analysis of GenoMic Annotation (MAGMA)80 within the FUMA platform81 to probe three of the most robust biological annotation and pathway compendiums: Gene Ontology (GO)80, Kyoto Encyclopedia of Genes and Genomes (KEGG)82, and Reactome83. The top ten significantly enriched pathways of the total gene sets and significantly enriched pathways of each cluster from PPI were displayed. FDR-adjusted P < 0.05 was statistically significant.
Single-cell expression
A large single-cell RNA sequencing dataset by Tabula Sapiens Consortium84 was obtained to identify the expression level of the genes in each cell type. The Tabula Sapiens was generated with data from 15 human donors, comprising ~500,000 cells from 24 different organs or tissues84. The clustered and annotated scRNA-seq datasets of blood, bone marrow, bladder, and kidney were obtained, and the R package Seurat was used for analysis and visualization85.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The individual-level data used in this study were accessed from the UKB database under accession code 19542 (https://www.ukbiobank.ac.uk/). The exome sequencing data are available under restricted access for the data restriction policy of the UKB cohort, but access can be obtained by following the application instructions at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. FinnGen summary statistics are publicly accessible (http://r8.FinnGen.fi). The raw whole-exome sequencing data are protected and are not available due to data privacy laws. The processed whole-exome sequencing data are available at https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=170. The gene-level and single-variant association summary statistics of the exome association study have been made accessible through https://doi.org/10.6084/m9.figshare.25420873.
Code availability
Open-source R package SAIGE-GENE+ v1.1.6.2 was used to run gene-based collapsing tests for rare variants and the code was available from the GitHub (https://github.com/saigegit/SAIGE). PLINK v1.9 (https://www.cog-genomics.org/plink/1.9/) and v2.0 (https://www.cog-genomics.org/plink/2.0/) were adopted to perform association tests for common variants, variant quality control and sample quality control. Hail is adopted to perform genotype quality control (https://hail.is/). KING 2.3.1 is adopted to identify duplicated samples (https://www.kingrelatedness.com/). Rare variants were annotated by SnpEff v5.1 (https://pcingola.github.io/SnpEff/se_introduction/), ensembl variant effect predictor v101.0 (https://www.ensembl.org/info/docs/tools/vep/), and common variants were annotated by ANNOVAR. The burden heritability regression and genetic associations were performed using BHR v0.1.0 (https://github.com/ajaynadig/bhr). Protein-protein interactions and pathway enrichment were performed by STRING v12.0 (https://cn.string-db.org/). Protein-protein interactions and pathway enrichment were performed by STRING v12.0 (https://cn.string-db.org/) and function annotation was performed by FUMA. The code of the main analysis and visualization of single-nucleus RNA-seq data was an adaptation of the R package Seurat version 4.0 and is available from https://satijalab.org/seurat/index.html. Mendelian randomization was performed by using the R package TwoSampleMR. The figures are generated using the ‘ggplot2’ and ‘ggbreak’ packages. Codes are available at https://github.com/Sirius-Yang/IMDs_WES86.
References
Collins, P. Y. et al. Grand challenges in global mental health. Nature 475, 27–30 (2011).
Wigerblad, G. & Kaplan, M. J. Neutrophil extracellular traps in systemic autoimmune and autoinflammatory diseases. Nat. Rev. Immunol. 23, 274–288 (2023).
Punga, A. R., Maddison, P., Heckmann, J. M., Guptill, J. T. & Evoli, A. Epidemiology, diagnostics, and biomarkers of autoimmune neuromuscular junction disorders. Lancet Neurol. 21, 176–188 (2022).
DiMeglio, L. A., Evans-Molina, C. & Oram, R. A. Type 1 diabetes. Lancet 391, 2449–2462 (2018).
Seror, R., Nocturne, G. & Mariette, X. Current and future therapies for primary Sjogren syndrome. Nat. Rev. Rheumatol. 17, 475–486 (2021).
Grunewald, J. et al. Sarcoidosis. Nat. Rev. Dis. Prim. 5, 45 (2019).
Polderman, T. J. et al. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nat. Genet. 47, 702–709 (2015).
Sawcer, S., Franklin, R. J. & Ban, M. Multiple sclerosis genetics. Lancet Neurol. 13, 700–709 (2014).
Sun, B. B. et al. Genetic associations of protein-coding variants in human disease. Nature 603, 95–102 (2022).
Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).
Basu, M. K. et al. Exome sequencing identifies abnormalities in glycosylation and ANKRD36C in patients with immune-mediated thrombotic thrombocytopenic purpura. Thromb. Haemost. 121, 506–517 (2021).
Zhou, X. et al. Whole exome sequencing in psoriasis patients contributes to studies of acitretin treatment difference. Int. J. Mol. Sci. https://doi.org/10.3390/ijms18020295 (2017).
Lahtela, E., Kankainen, M., Sinisalo, J., Selroos, O. & Lokki, M. L. Exome Sequencing identifies susceptibility loci for sarcoidosis prognosis. Front. Immunol. 10, 2964 (2019).
Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).
Holstege, H. et al. Exome sequencing identifies rare damaging variants in ATP8B4 and ABCA1 as risk factors for Alzheimer’s disease. Nat. Genet. 54, 1786–1794 (2022).
Jurgens, S. J. et al. Analysis of rare genetic variation underlying cardiometabolic diseases and traits among 200,000 individuals in the UK Biobank. Nat. Genet. 54, 240–250 (2022).
Zhou, W. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).
Sazonovs, A. et al. Large-scale sequencing identifies multiple genes and rare variants associated with Crohn’s disease susceptibility. Nat. Genet. 54, 1275–1283 (2022).
Zhou, S. et al. Converging evidence from exome sequencing and common variants implicates target genes for osteoporosis. Nat. Genet. 55, 1277–1287 (2023).
Chia, R. et al. Identification of genetic risk loci and prioritization of genes and pathways for myasthenia gravis: a genome-wide association study. Proc Natl Acad Sci USA https://doi.org/10.1073/pnas.2108672119 (2022).
Burren, O. S. et al. Genetic feature engineering enables characterisation of shared risk factors in immune-mediated diseases. Genome Med. 12, 106 (2020).
Sharma-Oates, A. et al. Early onset of immune-mediated diseases in minority ethnic groups in the UK. BMC Med. 20, 346 (2022).
Zhao, J. H. et al. Genetics of circulating inflammatory proteins identifies drivers of immune-mediated disease risk and therapeutic targets. Nat. Immunol. 24, 1540–1551 (2023).
Lin, J., Zhou, J. & Xu, Y. Potential drug targets for multiple sclerosis identified through Mendelian randomization analysis. Brain 146, 3364–3372 (2023).
Yuan, S. et al. Mendelian randomization and clinical trial evidence supports TYK2 inhibition as a therapeutic target for autoimmune diseases. EBioMedicine 89, 104488 (2023).
Fromme, M. et al. Comorbidities in lichen planus by phenome-wide association study in two biobank population cohorts. Br. J. Dermatol. 187, 722–729 (2022).
Yuan, S. et al. Phenome-wide Mendelian randomization analysis reveals multiple health comorbidities of coeliac disease. EBioMedicine 101, 105033 (2024).
Shirai, Y. et al. Multi-trait and cross-population genome-wide association studies across autoimmune and allergic diseases identify shared and distinct genetic component. Ann. Rheum. Dis. 81, 1301–1312 (2022).
Butler-Laporte, G. et al. HLA allele-calling using multi-ancestry whole-exome sequencing from the UK Biobank identifies 129 novel associations in 11 autoimmune diseases. Commun. Biol. 6, 1113 (2023).
Wjst, M. Exome variants associated with asthma and allergy. Sci. Rep. 12, 21028 (2022).
Agrawal, M. et al. TET2-mutant clonal hematopoiesis and risk of gout. Blood 140, 1094–1103 (2022).
Zhou, W. et al. SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests. Nat. Genet. 54, 1466–1469 (2022).
van den Oord, R. A. & Sheikh, A. Filaggrin gene defects and risk of developing allergic sensitisation and allergic disorders: systematic review and meta-analysis. BMJ 339, b2433 (2009).
Punyte, V., Vilkeviciute, A., Gedvilaite, G., Kriauciuniene, L. & Liutkeviciene, R. Association of VEGFA, TIMP-3, and IL-6 gene polymorphisms with predisposition to optic neuritis and optic neuritis with multiple sclerosis. Ophthalmic Genet. 42, 35–44 (2021).
Kurki, M. I. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Weiner, D. J. et al. Polygenic architecture of rare coding variation across 394,783 exomes. Nature 614, 492–499 (2023).
Wu, P. et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inf. 7, e14325 (2019).
Long, J. et al. Identification of NOTCH4 mutation as a response biomarker for immune checkpoint inhibitor therapy. BMC Med. 19, 154 (2021).
Bouzid, H. et al. Clonal hematopoiesis is associated with protection from Alzheimer’s disease. Nat. Med. 29, 1662–1670 (2023).
MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).
Johar, A. S. et al. Candidate gene discovery in autoimmunity by using extreme phenotypes, next generation sequencing and whole exome capture. Autoimmun. Rev. 14, 204–209 (2015).
Prinz, J. C. Autoimmune aspects of psoriasis: Heritability and autoantigens. Autoimmun. Rev. 16, 970–979 (2017).
Wack, A., Terczyńska-Dyla, E. & Hartmann, R. Guarding the frontiers: the biology of type III interferons. Nat. Immunol. 16, 802–809 (2015).
Clark, R. A. Human skin in the game. Sci. Transl. Med. 5, 204ps213 (2013).
Vávra, J. et al. Examining the association of rare allelic variants in urate transporters SLC22A11, SLC22A13, and SLC17A1 with hyperuricemia and gout. Dis. Markers 2024, 5930566 (2024).
Lees, C. W., Barrett, J. C., Parkes, M. & Satsangi, J. New IBD genetics: common pathways with other diseases. Gut 60, 1739–1753 (2011).
Parkes, M., Cortes, A., van Heel, D. A. & Brown, M. A. Genetic insights into common pathways and complex relationships among immune-mediated diseases. Nat. Rev. Genet. 14, 661–673 (2013).
Tajuddin, S. M. et al. Large-scale exome-wide association analysis identifies loci for white blood cell traits and pleiotropy with immune-mediated diseases. Am. J. Hum. Genet. 99, 22–39 (2016).
Zanoni, G. et al. In celiac disease, a subset of autoantibodies against transglutaminase binds toll-like receptor 4 and induces activation of monocytes. PLoS Med. 3, e358 (2006).
Jeong, J. & Lee, H. K. The role of CD4(+) T cells and microbiota in the pathogenesis of asthma. Int. J. Mol. Sci. https://doi.org/10.3390/ijms222111822 (2021).
Schett, G., McInnes, I. B. & Neurath, M. F. Reframing immune-mediated inflammatory diseases through signature cytokine hubs. N. Engl. J. Med. 385, 628–639 (2021).
Li, X., Pasche, B., Zhang, W. & Chen, K. Association of MUC16 mutation with tumor mutation load and outcomes in patients with gastric cancer. JAMA Oncol. 4, 1691–1698 (2018).
Mukama, T. et al. Prospective evaluation of 92 serum protein biomarkers for early detection of ovarian cancer. Br. J. Cancer 126, 1301–1309 (2022).
Marafini, I., Monteleone, G. & Stolfi, C. Association between celiac disease and cancer. Int. J. Mol. Sci. https://doi.org/10.3390/ijms21114155 (2020).
Yoon, K. W. et al. Control of signaling-mediated clearance of apoptotic cells by the tumor suppressor p53. Science 349, 1261669 (2015).
Levy-Nissenbaum, E. et al. Hypotrichosis simplex of the scalp is associated with nonsense mutations in CDSN encoding corneodesmosin. Nat. Genet. 34, 151–153 (2003).
Nag, A. et al. Effects of protein-coding variants on blood metabolite measurements and clinical biomarkers in the UK Biobank. Am. J. Hum. Genet. 110, 487–498 (2023).
Schoeler, T. et al. Participation bias in the UK Biobank distorts genetic associations and downstream analyses. Nat. Hum. Behav. 7, 1216–1227 (2023).
Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Yokoyama, J. S. et al. Association between genetic traits for immune-mediated diseases and Alzheimer disease. JAMA Neurol. 73, 691–697 (2016).
de Lusignan, S. et al. Atopic dermatitis and risk of autoimmune conditions: population-based cohort study. J. Allergy Clin. Immunol. 150, 709–713 (2022).
Rafiq, M. et al. Allergic disease, corticosteroid use, and risk of Hodgkin lymphoma: a United Kingdom nationwide case-control study. J. Allergy Clin. Immunol. 145, 868–876 (2020).
Persson, M. S. M. et al. Validation study of bullous pemphigoid and pemphigus vulgaris recording in routinely collected electronic primary healthcare records in England. BMJ Open 10, e035934 (2020).
Cipolletta, E. et al. Association between gout flare and subsequent cardiovascular events among patients with gout. JAMA 328, 440–450 (2022).
Schonmann, Y. et al. Inflammatory skin diseases and the risk of chronic kidney disease: population‐based case–control and cohort analyses. Br. J. Dermatol. 185, 772–780 (2021).
Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92 (2012).
Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. SIFT missense predictions for genomes. Nat. Protoc. 11, 1–9 (2016).
Chun, S. & Fay, J. C. Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).
Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting functional effect of human missense mutations using PolyPhen-2. Curr. Protoc. Hum. Genet. Chapter 7, Unit7.20 (2013).
Schwarz, J. M., Rödelsperger, C., Schuelke, M. & Seelow, D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat. Methods 7, 575–576 (2010).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008).
Lee, S., Wu, M. C. & Lin, X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775 (2012).
Kurki, M. I. et al. FinnGen: unique genetic insights from combining isolated population and national health register data. Preprint at medRxiv https://doi.org/10.1101/2022.03.03.22271360 (2022).
Kim, H. Y., Jeon, W. & Kim, D. An enhanced variant effect predictor based on a deep generative model and the Born-Again Networks. Sci. Rep. 11, 19127 (2021).
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Gene Ontol. Consort. Nat. Genet. 25, 25–29 (2000).
Watanabe, K., Taskesen, E., van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017).
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–d361 (2017).
Croft, D. et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 42, D472–D477 (2014).
Tabula Sapiens, C. et al. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Yang, L., Ou, Y., & Wu, B. Large-scale whole exome sequencing analyses identified protein-coding variants associated with immune-mediated diseases in 350770 adults (Version 1.0) [Data set]. https://doi.org/10.5281/zenodo.11307851 (2024).
Acknowledgements
The authors gratefully thank all the participants and professionals contributing to the UKB. We also want to acknowledge the participants and investigators of the FinnGen study. This study was supported by grants from the Science and Technology Innovation 2030 Major Projects (2022ZD0211600), National Natural Science Foundation of China (82071201, 82071997), Shanghai Municipal Science and Technology Major Project (2018SHZDZX01), Research Start-up Fund of Huashan Hospital (2022QD002), Excellence 2025 Talent Cultivation Program at Fudan University (3030277001), Shanghai Talent Development Funding for The Project (2019074), Shanghai Rising-Star Program (21QA1408700), 111 Project (B18015), and ZHANGJIANG LAB, Tianqiao and Chrissy Chen Institute, the State Key Laboratory of Neurobiology and Frontiers Center for Brain Science of Ministry of Education, and Shanghai Center for Brain Science and Brain-Inspired Technology, Fudan University.
Author information
Authors and Affiliations
Contributions
L.Y., Y.N.O., and B.S.W. organized data and carried out the statistical analysis. Y.N.O. and L.Y. participated in the first draft of the manuscript. L.Y. and W.S.L. designed and drew the figures. Y.T.D., X.Y.H., Y.L.C., J.J.K., C.J.F., and Y.Z. participated in the revision of the manuscript. L.T., Q.D., J.F.F., W.C., and J.T.Y. participated in the study design, reviewing and editing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, L., Ou, YN., Wu, BS. et al. Large-scale whole-exome sequencing analyses identified protein-coding variants associated with immune-mediated diseases in 350,770 adults. Nat Commun 15, 5924 (2024). https://doi.org/10.1038/s41467-024-49782-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-49782-0
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.