Introduction

Clostridioides difficile (C. diff.) infection (CDI), formerly known as Clostridium difficile infection, is the leading infectious cause of nosocomial diarrhea in North America and Europe and is associated with a high global burden of disease1. Once acquired, this reemerging, Gram-positive, spore-forming bacterium secretes a toxin that causes watery diarrhea, sometimes progressing to severe pseudomembranous colitis, toxic megacolon, and sepsis2. In the early 2000s, the emergence of C. diff. strain NAP1/BI/027 led to increased incidence, prevalence, morbidity, and mortality associated with CDI3,4. This epidemic strain produces more toxin, has a higher resistance to common treatments, and causes more recurrent infections than other common C. diff. strains. Despite aggressive antibiotic treatment (e.g. vancomycin, metronidazole, and fidaxomicin) and fecal transplant5,6, outcomes of NAP1/BI/027 CDI include significant morbidity across all age groups, 5% mortality in individuals older than 65 years of age, and an estimated $1.1 billion dollars per year in healthcare costs2.

Asymptomatic colonization with C. diff. is common among patients in healthcare settings, with an estimated prevalence of 3–26% in adults admitted to acute care hospitals and 5–7% in adults at long-term care facilities7. Progression from C. diff. colonization to acute CDI is generally associated with one or more risk factors8, including new exposure to C. diff., older age, hospitalization or nursing home residency, chemotherapy, severe comorbid illness, proton pump inhibitor or immunosuppressant medication use, or prior use of high-risk antibiotics such as fluoroquinolones or cephalosporins9,10,11. Antibiotic use and proton pump inhibitor use are also risk factors for recurrent CDI12. Despite having one or more risk factors, some people colonized with C. diff. either do not develop CDI or successfully clear an initial infection, while other individuals are burdened by severe and/or recurrent CDI. This differential susceptibility may have a genetic component, given that host genetic variation underlies susceptibility for other infections, including enteric infections such as Helicobacter pylori13. Identification of host genetic susceptibility loci could yield methods for prevention and/or treatment of this important pathogen14,15.

Previous studies have identified candidate risk loci for primary and recurrent CDI in small patient populations using a combination of genetic and clinical data. Apewokin et al.16 performed a genome-wide logistic regression analysis of CDI in 646 patients (57 cases; 589 controls) undergoing stem cell transplantation for multiple myeloma, and found several single nucleotide variants (SNVs) in the RLBP1L1, ASPH, and P7B genes that were associated with higher risk of CDI. Shen et al.17 identified two alleles in in the extended major histocompatibility complex (MHC; HLA-DRB1*07:01 and HLA-DQA1*02:01) that were associated with a reduction in CDI recurrence among 704 patients who achieved initial clinical cure with bezlotoxumab treatment in the MODIFY clinical trials. Several studies have also suggested that common SNVs in the promoter region of the interleukin-8 (IL-8) gene may confer increased risk for recurrent CDI by altering neutrophil recruitment during disease pathogenesis18,19. While these results are collectively suggestive of genetic involvement in CDI risk, the aforementioned studies had small sample sizes and did not always control for major risk factors such as previous antibiotic use or corticosteroid use in their association models. Genome-wide association studies (GWAS) that properly control for known risk factors and include a large number of participants are needed to identify risk loci with sufficient power and reliability. One such study identified 16,464 patients (1160 cases; 15,304 controls) from the Geisinger MyCode cohort20 using a C. diff. phenotyping algorithm developed by the Electronic Medical Records and Genomics (eMERGE) Network21. While no variants reached genome-wide significance in the full case–control dataset, one variant (rs114751021) in the small nucleolar RNA SNORD117 gene, located in the MHC region, reached genome-wide significance in a subset of cases and controls with recent exposure to antibiotics (P = 4.50 × 10–8; OR 2.42; 587 cases; 3166 controls). Additional validation studies in other large patient cohorts are needed to evaluate the role of genetic factors in CDI risk.

To identify common genetic variants associated with susceptibility to CDI, we performed joint and ancestry-stratified GWAS and human leukocyte antigen (HLA) fine-mapping using phenotypes extracted from electronic medical records (EMRs) of participants aged two years or older from the eMERGE Network. The eMERGE Network is a National Human Genome Research Institute (NHGRI)-funded consortium of twelve study sites across the United States (U.S.) that supports research for furthering the implementation of genomic medicine22. At the time of this study, the network included a multi-ethnic cohort of roughly 99,000 U.S. participants with linked genetic and EMR data.

Results

Demographics

After all exclusions, there were 1349 cases and 18,512 controls identified via the eMERGE C. diff. phenotyping algorithm (Table 1). Approximately 74% of cases and controls self-identified as White, and 19% self-identified as Black or African American. Although older age is a known risk factor for C. diff. infection11, controls tended to be older than cases (z = 14.37, P = 2.20 × 10–16), which reflected the patient populations of the participating eMERGE study sites. Controls also tended to have higher BMIs than cases (z = 14.58, P = 2.20 × 10–16). Cases had slightly higher exposure to Class 1 (high-risk) antibiotics than controls (28% vs. 21%), yet they had much less exposure to Class 2 (moderate risk) antibiotics than controls (11% vs. 26%). More cases received chemotherapy outside of the exclusionary time period than did controls. It is worth noting that while 14 cases were identified from Cincinnati Children’s Medical Hospital, no controls were identified from this site. These cases were 57% female, with a median age of 4.0 (IQR 3.0–12.5) years and a median BMI of 16.09 (IQR 14.90–17.00). Approximately 93% of these cases were of European ancestry (genetically determined) and tended to be at high risk for C. diff. infection, with 50% having recent exposure to Class 1 or Class 2 antibiotics and 43% having recent exposure to transplant medications.

Table 1 Summary statistics of demographic data and phenotypes for C. diff cases and controls selected using the C. diff phenotyping algorithm.

After finding the intersection of self-reported ancestry and genetically determined ancestry, there were 3700 African participants, 14,620 European participants, and 135 Asian participants. Table 2 summarizes the demographic and phenotype characteristics of the African ancestry cases (n = 192) and controls (n = 3508) and European ancestry cases (n = 988) and controls (n = 13,632), which were used to conduct ancestry-stratified association tests. Cases in the African sample tended to be younger than those in the European sample (median age 50.8 vs. 59.6 years) and had higher rates of diabetes (37% vs. 20%) and HIV (14% vs. 0.8%). There was a higher proportion of female participants among controls in the African sample than in the European sample (66% vs. 52%), and controls in the African sample had higher exposure to high-risk antibiotics (30% vs. 18%) and moderate risk antibiotics (46% vs. 20%) than those in the European sample, as well as higher rates of diabetes (33% vs. 22%). The demographic and risk characteristics of the European sample tended to mirror those of the full study population, but a higher proportion of cases in the European sample identified as not Hispanic or Latino (98% vs. 88%).

Table 2 Summary of demographic data and phenotypes for C. diff cases and controls in the African ancestry (n = 3700) and European ancestry (n = 14,620) samples.

GWAS

Table 3 summarizes the logistic regression association results that reached genome-wide significance in the combined and European ancestry-only samples, with corresponding summary statistics for those findings in the African ancestry-only sample. A strong association in the human leukocyte antigen (HLA) region was found in the European and joint ancestry samples (Fig. 1, Supplementary Fig. S2) but was not found in the African ancestry sample. The lack of association in the African ancestry sample could be due to either insufficient detection power as a result of small sample size or different haplotype or linkage disequilibrium (LD) structures compared to individuals of European ancestry. Manhattan plots and corresponding QQ plots for the European, joint, and African ancestry GWAS analyses are provided (Supplementary Figs. S1S5). The five most significantly associated SNVs driving the association in the European sample (rs68148149, P = 8.06 × 10–14; rs3828840, P = 9.96 × 10–14; rs35882239, P = 8.18 × 10–12; rs71534541, P = 5.12 × 10–11; rs35222480, P = 9.88 × 10–11) mapped to the intergenic region between the HLA-DRB5 and HLA-DRB1 genes in the beta block of the MHC Class II region. Three of the five most significant SNVs (rs3828840, rs35882239, and rs35222480), with minor allele frequencies (MAFs) of 0.17, 0.17, and 0.20, respectively, also mapped to the 3ʹ end of the HLA-DRB6 pseudogene. A review of the NHGRI-European Bioinformatics Institute (NHGRI-EBI) GWAS Catalog23 and dbSNP24 revealed that rs3828840 has been previously associated with multiple sclerosis, an autoimmune inflammatory disease that impacts the central nervous system25.

Table 3 Index SNV results from logistic regression-based genome wide analysis for joint ancestry (n = 19,861), European ancestry (n = 14,620), and African ancestry (n = 3700) samples.
Figure 1
figure 1

Manhattan plot of P-values generated using logistic regression analysis in the European ancestry sample (n = 14,620). An additive model was used to assess the disease susceptibility impact of the minor (coded) allele at each position, while controlling for age, BMI, sex, ancestry, nursing home status, chemotherapy, diabetes, HIV, transplant medications, corticosteroids, and medium or high-risk antibiotic exposure as covariates. Genomic coordinates are displayed along the X-axis, and the negative logarithm of logistic regression P-values are displayed on the Y-axis. Each dot represents a SNV in the regression model, with associated P-values plotted accordingly, while the triangle represents the most significantly associated SNV. The dotted line represents the negative logarithm of the genome-wide significance threshold (P < 5 × 10–8). Colors are used to distinguish between SNVs in adjacent chromosomes.

Given the well-known presence of high LD within the HLA region26, a regional LD plot with reference to the index SNV (rs68148149) was generated using P-values from the European logistic regression analysis and using the 2014 1000 Genomes European superpopulation as a reference group (Fig. 2). This step was taken to assess the possibility that variants other than the index SNV might better explain disease association in terms of functional impact. While the second two most significant SNVs were in high LD with the index SNV (R2 > 0.8), the index SNV had the highest regulatory potential among the most significantly associated SNVs, as annotated by RegulomeDB27. To assess the possibility that the lack of disease association in the African ancestry sample is a result of different regional LD structures, a regional LD plot with reference to the index SNV was generated using the 1000 Genomes African superpopulation as a reference (Supplementary Fig. S6). The second two most significant SNVs in the European-ancestry sample were also in high LD with the index SNV in the African-ancestry superpopulation, but higher LD was observed with more SNVs in the HLA-DRB1/5 intergenic region in the African superpopulation (R2 > 0.4) than in the European superpopulation (R2 > 0.2). On the other hand, lower LD was observed with SNVs in the region spanning HLA-DRB1 and HLA-DQA1 in the African superpopulation (R2 > 0.6) than in the European superpopulation (R2 > 0.8). Differences in regional LD patterns between the European-ancestry and African-ancestry samples could therefore have contributed to the observed differences in gene-disease association patterns, in addition to insufficient detection power.

Figure 2
figure 2

Regional LD plot of SNVs evaluated in the European-ancestry logistic regression analysis, using the European 1000 Genomes superpopulation as a reference group. Genomic coordinates spanning the HLA-DRB region and surrounding genes are shown on the X-axis in both subplots. Negative logarithms of P-values from the European-ancestry logistic regression analysis are shown on the Y-axis in the upper sublot, and annotated gene transcripts are distributed along the Y-axis in the lower subplot. Each dot represents a SNV in the regression model, with associated P-values plotted accordingly. SNVs in highest LD with reference to the index SNV (rs68148149) are colored in red. The LD plot was generated with the LocusZoom68 tool using default parameters and the 1000 Genomes Project 2014 EUR reference panel.

A follow-up GWAS using the index SNV as a covariate revealed several new SNVs associated at genome-wide significance (rs116603449, P = 4.54 × 10–9; rs9270896, P = 6.09 × 10–9; rs9270894, P = 1.12 × 10–8; rs9270895, P = 2.32 × 10–8; rs618095, P = 3.71 × 10–8) (Table 3, Supplementary Figs. S7, S8). While suggestive peaks were observed in chromosomes 14 and 22 using the unadjusted model, the elimination of these peaks in models that included the genome-wide significant index SNVs suggests that they were spuriously associated with the tagged region in chromosome 6. However, no SNVs of interest on chromosomes 14 or 22 were in high LD with any of the index SNVs on chromosome 6, therefore the nature of the association remains unknown.

HLA association analyses

All 14,620 European ancestry participants had high quality imputed HLA genotypes available for association analyses. Table 2 summarizes the number of participants in each ancestry stratified case–control group possessing at least one HLA-DRB3, 4 and/or 5 gene (corresponding to haplotype families HLA-DR52, 53 and 51, respectively)28 (Supplementary Fig. S11). The most significant SNVs from the GWAS reached genome-wide significance among individuals with at least one DRB3, 4 or 5 genes collectively (rs68148149, P = 1.26 × 10–13; rs3828840, P = 1.49 × 10–13; rs35882239, P = 2.37 × 10–11; rs71534541, P = 1.67 × 10–11; rs35222480, P = 3.17 × 10–11), and among individuals with at least one DRB5 gene only, or DR51 haplotype carriers (rs68148149, P = 1.55 × 10–11; rs3828840, P = 1.72 × 10–11; rs35882239, P = 2.62 × 10–10; rs71534541, P = 1.56 × 10–11; rs35222480, P = 4.68 × 10–11) (Table 4, Supplementary Fig. S9). Among DR51 haplotype carriers, the most significantly associated SNVs only reach genome-wide significance among carriers of the DR15 haplotype (rs68148149, P = 2.08 × 10–11; rs3828840, P = 2.27 × 10–11; rs35882239, P = 4.14 × 10–10; rs71534541, P = 1.75 × 10–12; rs35222480, P = 5.81 × 10–12), and more specifically, carriers of the HLA-DRB1*15:01 allele (rs68148149, P = 7.45 × 10–11; rs3828840, P = 8.11 × 10–11; rs35882239, P = 1.42 × 10–9; rs71534541, P = 7.37 × 10–12; rs35222480, P = 1.43 × 10–11). No SNVs reached genome-wide significance among participants with at least one DRB3 or DRB4 gene only, suggesting that the HLA-DR51 haplotype in combination with variants in the HLA-DRB1/5 intergenic region may singularly drive genetic risk for CDI in the European ancestry population. However, examining the risk allele frequencies of the index SNV (rs68148149) in cases and controls across DR51, DR52, and DR53 haplotype-enriched groups showed that the risk allele frequency was higher in European-ancestry cases than controls in all haplotype groups, suggesting that the SNV may indeed drive risk in all HLA-DR haplotype groups but that the low frequency in the DR52 and DR53 haplotype groups limits the power to detect the association in these groups (Supplementary Fig. S12). The same pattern was not observed in African-ancestry cases and controls, indicating that haplotype differences between ancestry groups may indeed play a role in differentially conferring risk.

Table 4 Index SNV results from logistic regression-based analysis of the HLA region in European samples enriched for each HLA-DRB haplotype or haplotype family: DR51, DR52, DR53, DR15, DRB1*15:01, and any of the above.

To assess the possibility that one or more HLA alleles themselves were driving the risk association in the European ancestry sample, rather than the most significantly associated SNVs identified in the GWAS, we performed a separate logistic regression analysis using the HIBAG-imputed HLA genotypes in the European ancestry sample. None of the imputed HLA alleles reached genome-wide significance. Using the classical HLA tags identified by de Bakker et al.29 and the NCI LDMatrix tool30, it was also confirmed that none of the GWAS-identified SNVs were in high LD (R2 > 0.5) with any classical HLA alleles in either the European ancestry or African ancestry 1000 Genomes superpopulations. The index SNV was in moderate LD with the tag SNV for the DRB1*15:01-DRB5*01:01 haplotype in the European ancestry superpopulation (rs3135388; R2 = 0.186) and low LD with the tag SNV in the African ancestry superpopulation (rs443623; R2 = 0.002).

Discussion

Using a robust EMR-based phenotyping algorithm, we identified a large, multi-institutional corpus of patients with a history of at least one episode of CDI and controls without CDI. Our results suggest that genetic variation in the (HLA-)DRB locus of the HLA region may increase risk of infection in European ancestry populations. In this study, European participants who possessed the minor allele among the most significantly associated SNVs had 56% greater odds of having at least one episode of CDI. As the key beta-subunits of MHC Class II surface receptors on antigen presenting cells (APCs), the proteins encoded by DRB genes play a critical role in stimulating the host adaptive immune response against foreign peptides and are therefore excellent candidates for future studies of host immunity to C. diff.31.

The MHC (HLA) Class I and II loci are among the most polymorphic coding regions in the human genome, and DRB genes are particularly variable in copy number and combination. Although there is only one monomorphic DRA gene per (HLA-)DR haplotype, there are five common DR haplotype families composed of different combinations of protein coding DRB genes (DRB1, DRB3, DRB4 and DRB5) and pseudogenes (DRB2, DRB6, DRB8 and DRB9)28. DRB1 is present in all haplotypes, but any given individual may have as few as two protein coding DRB genes (2 copies of DRB1), or as many as four genes (2 copies of DRB1 + 1 or 2 copies of DRB3, 4 or 5) between homologs. The unique combination of DRB genes on each haplotype is remarkably conserved and has been maintained in ancestral DNA since before the divergence of human and gorilla lineages over five million years ago32. Although having a diverse set of MHC II molecules may confer a selective advantage against infection33, each additional DRB gene is nonetheless susceptible to intragenic and/or regulatory mutations in the highly polymorphic HLA region and may paradoxically increase susceptibility to other diseases. In the case of gastrointestinal infections, protective effects of the DRB1*04:05 allele against enteric infection caused by Salmonella typhi or Salmonella paratyphi have been observed in Vietnamese and Nepalese patients34. Conversely, the DRB1 gene has also been implicated in increasing host susceptibility to a number of inflammatory diseases, including Crohn’s disease, type I diabetes mellitus, rheumatoid arthritis, multiple sclerosis (MS), ulcerative colitis and Alzheimer’s disease, primarily in European populations35,36,37,38,39,40.

Haplotype effects appear to play a critical role in conferring risk for CDI. In this study, the risk association only reached genome-wide significance in individuals carrying at least one copy of the DRB1*15:01-DRB5*01:01 haplotype41, and individuals in this group had 200% higher odds of developing CDI on average. These results indicate that the DRB1*15:01-DRB5*01:01 haplotype is involved in conferring CDI risk among individuals with common genetic variants in the tagged DRB1-DRB5 intergenic region (Supplementary Fig. S10). This haplotype is most strongly associated with susceptibility to multiple sclerosis42,43,44,45, but has also been associated with susceptibility to other autoimmune conditions including anti-glomerular basement membrane disease in European ancestry populations46,47, and both systemic lupus erythematosus and adult onset Still’s disease in Japanese populations48.

One possible explanation for increased CDI risk among these individuals is that differential MHC II gene expression impacts the baseline composition of their gut microbiota, thereby influencing colonization resistance to opportunistic enteric pathogens like C. diff. Secretory Immunoglobulin A (IgA) antibodies play an essential role in shaping an individual’s gut microbial community and maintaining a homeostatic balance of microbes within the mucosal immune system49, and the interactions between APCs and CD4+ T-follicular helper (Tfh) cells are key to driving the production of IgA by plasma cells50. Studies in mouse models have previously demonstrated that MHC II polymorphisms directly affect antibody-mediated microbiota composition, and that the unique microbial communities formed under the influence of different MHC genotypes can impact an organism’s susceptibility to opportunistic pathogens like Salmonella enterica typhimurium when treated with antibiotics51,52. Understanding the unique interactions between commensal microbe antigens presented by APCs, the MHC II molecules encoded by the DRB1*15:01-DRB5*01:01 haplotype, and Tfh cells may provide valuable insights into how host genetics impact the composition of gut microbial communities in individuals susceptible to enteric infection, compared with those who are resistant to infection.

Alternatively, increased CDI risk among these individuals may be driven by differential T-cell mediated responses to the TcdA and TcdB toxins produced by C. diff. bacteria. In addition to sculpting the host microbiota, high affinity IgA helps to neutralize bacterial toxins53. Unique interactions between T-cells and C. diff. toxins specifically bound by DRB1*15:01-DRB5*01:01 MHC II molecules may impact the host anti-toxin IgA response differently than other T-cell-MHC II interactions, thus influencing the host’s ability to clear circulating toxins. Recent Phase III, placebo-controlled clinical trials of the monoclonal antibody treatments actoxumab (anti-TcdA) and bezlotoxumab (anti-TcdB) showed that TcdB toxin neutralization alone could decrease CDI recurrence by 38% among patients receiving standard antibiotic therapy for initial or recurrent CDI54. Naturally occurring anti-TcdB antibodies in the placebo group also conferred protection against recurrent CDI, recapitulating the importance of neutralizing TcdB in controlling infection55. However, other studies have failed to replicate these results when comparing healthy controls with CDI patients, suggesting that anti-toxin antibody concentrations may not fully explain susceptibility to initial and/or recurrent infection56.

Although the MHC II region is strongly associated with CDI in this study, the SNVs that confer risk are neither located in coding regions, nor in high LD with SNVs in coding regions, suggesting that the mechanism for altered gene expression may be regulatory. One possible mechanism for altered expression of the DRB1*15:01-DRB5*01:01 haplotype is allele-specific DNA methylation of the DRB1 and/or DRB5 regulatory regions, given that that targeted bisulfite sequencing has previously identified the DRB1-DRB5 intergenic space as a differentially methylated region57. Disruptions to normal DNA methylation patterns, and to resulting gene expression, have been known to modulate susceptibility to a number of human diseases58. For example, in the case of DRB1*15:01-DRB5*01:01-associated multiple sclerosis, DNA hypermethylation in exon 2 of DRB1 confers protection against the major risk allele and is driven by several SNVs in high LD with one another that overlap with CpG sites59. It is possible that disrupted methylation patterns at or near the regulatory regions of DRB1*15:01 and/or DRB5*01:01 also contribute to differential expression of these MHC II proteins, thus impacting the landscape of the host adaptive immune response via microbiome-mediated and/or toxin-mediated mechanisms. Additional gene expression analyses, such as expression quantitative trait loci (eQTL) analysis, could be used to explore whether the top SNVs regulate expression levels of nearby genes.

This study has several important limitations. First, sample size and statistical power were severely limited among non-European ancestry samples, which may have contributed to the lack of significant associations in the African ancestry analyses. It is also possible that within the European sample, the comparatively low frequency of the risk allele in the DR52 and DR53 haplotype groups, compared to DR51, limited the power to detect a true risk association in other DR haplotype groups. Second, replicate studies are needed to confirm the identified association. However, the large, multi-site biobank of linked EMR and genotype data used in this study supports the replicability and reliability of these results, and future association studies would benefit immensely from these types of biobanks. While the gene associations in this study do not align exactly with those identified in the previous C. diff. GWAS conducted by Li et al. using the MyCode cohort, they do support the hypothesis that immune molecules encoded within the MHC region are involved in CDI pathogenesis. Third, C. diff. cases were not stratified by primary and recurrent CDI, and it is possible that the genetic variants driving pathogenesis are different between these two forms of infection. For example, Shen et al. identified alleles in DRB1 and DQA1 that were different from those identified in this study and were protective against CDI recurrence, suggesting that the genetic factors involved in initial vs. recurrent infection could be distinct from one another. Fourth, the length and severity of infection were not considered in the current study, but future analyses would benefit from continuous trait regression analyses to identify genetic variants associated with increased CDI length and/or severity, rather than susceptibility. Additionally, C. diff. cases in this study included individuals with a positive antigen test as their only criterion for infection. The C. diff. antigen test cannot accurately distinguish between toxigenic and non-toxigenic strains and may falsely identify asymptomatic carriers as C. diff. cases. Finally, the specific toxigenic ribotype that each case was exposed to was not included in the analysis, and it is possible that different C. diff. ribotypes are associated with different genetically determined host responses.

Our findings suggest that genetic variation in the MHC II locus of the HLA region drives susceptibility to CDI and highlights the importance of the adaptive immune response in combating opportunistic pathogens. To better understand how host genetics might confer microbiome-mediated risk for opportunistic enteric infections, future studies should explore the mechanisms of interaction between commensal microbe antigens presented by APCs and the MHC II molecules encoded by the DRB1*15:01-DRB5*01:01 haplotype. Interactions between DRB1*15:01-DRB5*01:01 MHC II molecules, C. diff. exotoxins and T-cells may alternately play a critical role in CDI pathogenesis, and additional work is needed to understand whether and how the host IgA response is differentially impacted by the combined effects of haplotype and transcriptional modifications. Finally, future work should address the possibility that allele-specific DNA methylation is a driver of epigenetic transcriptional regulation of the DRB1 and/or DRB5 genes. If this mechanism is experimentally validated, therapeutics that modulate MHC II molecule transcription levels could potentially be developed to decrease the incidence of CDI among individuals who carry the risk genotype.

Methods

Participants

Cases and controls were selected from among the ~ 99,000 participants of the eMERGE Network. Participating sites included the following: 1. The Children’s Hospital of Philadelphia, Philadelphia, PA; 2. Cincinnati Children’s Medical Hospital, Cincinnati, OH; 3. Columbia University, New York, NY; 4. Geisinger, Danville, PA; 5. Mass General Brigham, Boston, MA; 6. Kaiser Permanente Washington (formerly Group Health Cooperative) and University of Washington partnership, Seattle, WA; 7. Marshfield Clinic, Marshfield, WI; 8. Mayo Clinic, Rochester, MN; 9. Meharry Medical College, Nashville, TN; 10. Mount Sinai, New York, NY; 11. Northwestern University, Evanston, IL; and 12. Vanderbilt University, Nashville, TN. Informed consent was obtained from participants by each eMERGE site. The eMERGE study was approved by each participating site’s institutional review board, and all methods were performed in accordance with the relevant guidelines and regulations at each institution.

Case–control selection using Clostridioides difficile phenotyping algorithm

Clostridioides difficile cases and controls were selected using a variety of information contained in the EMR, including International Classification of Disease (ICD) Clinical Modification (CM) codes 9th and 10th editions, lab and medication data, and clinician progress notes. The C. diff. phenotyping algorithm used in this study was designed collaboratively by the University of Washington, Group Health and Vanderbilt as part of the eMERGE Network and was published in the Phenotyping KnowledgeBase (PheKB) in 201260,61. Case/control selection and exclusion criteria are depicted as a flowchart in Fig. 3.

Figure 3
figure 3

eMERGE Clostridiodes difficile phenotyping algorithm flowchart.

For participants aged two years or older, there were four combinations of EMR data considered for case selection. First, individuals with a positive C. diff. antigen or toxin test were selected. Second, those with one or more inpatient or outpatient diagnoses of C. diff. (ICD-9-CM code 008.45; ICD-10-CM code A047), followed by one or more days of medication for treatment (metronidazole, oral vancomycin, fidaxomicin, or linezolid), followed by another inpatient or outpatient C. diff. diagnosis code, were selected. Third, individuals with at least one C. diff. ICD-CM code combined with at least one affirmative mention (unqualified by negation, uncertainty, or historical reference) of C. diff. infection in a clinical progress note as identified through natural language processing (NLP), were selected. The C. diff. mentions used by the NLP algorithm are listed in Supplementary Table S1. Finally, individuals with two or more affirmative mentions of C. diff. infection on separate calendar days in clinical progress notes, identified by NLP, were selected. To exclude severely immune-compromised participants from the test population, participants meeting one of the four above criteria were excluded from being cases if they had a diagnosis of bone marrow cancer in the 2-year period prior to their C. diff. case index date (i.e., the first positive lab test, diagnosis code or progress note mention), or within 7 days following their index date. Participants were also excluded from being cases if they had received chemotherapy in the 180-day period prior to their C. diff. index date, or within 7 days following their index date. Using these criteria, 1598 cases were selected.

Controls were selected from eMERGE participants two years of age or older who had no known test for and no diagnosis codes for C. diff. in their records. Since C. diff. toxin tests have sensitivities ranging from 60 to 70%62, a single test does not rule out disease, and multiple tests could signal a concern that disease exists. Additionally, controls must have had at least one hospital admission with a prior exposure to a high-or moderate-risk antibiotic (Supplementary Table S2) in the 7 to 62-day period before admission. Alternatively, they must have had exposure to a high or moderate-risk antibiotic and had 5 or more years of documented clinical visits following exposure with no mention of C. diff. infection in their progress notes. Participants meeting the control criteria were excluded if they had chemotherapy or bone marrow cancer in the 180-day period prior to the C diff. control index date (i.e., the earliest hospital admission with antibiotic exposure or earliest antibiotic exposure with 5 years of follow-up), or within seven days following the index date. These criteria resulted in the selection of 23,061 eMERGE participants as controls.

We excluded 202 cases and 2723 controls that were missing genotype data. An additional 31 cases and 889 controls were excluded because the genotype imputation quality failed to meet our quality control (QC) threshold (mean R2 > 0.3)63.

Cryptic relatedness was assessed in all participants by calculating the probabilities of sharing alleles identical by descent (IBD), where Z0 is the probability of sharing zero alleles IBD and Z1 is the probability of sharing one allele IBD. Families were constructed when sample pairs had Z0 < 0.83 and Z1 > 0.163. When study participants were found to be in the same family, we prioritized the inclusion of cases. In situations where two or more cases or two or more controls were found to be in the same family, one participant was selected at random, and the others were excluded. For participants selected via the C. diff. phenotyping algorithm, 9 cases and 937 controls were excluded due to cryptic relatedness. Two-sample Z-tests were used to identify significant differences in the sample means of distributions for continuous variables (age and BMI) between cases and controls.

Covariates identified for phenotyping algorithm sample

The following covariates were identified for all cases and controls using structured EMR data: 1. Age at index date (index age); 2. Body mass index (BMI); 3. Sex; 4. Genetically determined ancestry; 5. Nursing home status (y/n); 6. Chemotherapy (y/n); 7. Diabetes mellitus (y/n); 8. Human immunodeficiency virus (HIV) positive status (y/n); 9. Any transplant medications (y/n); 10. Any corticosteroid medications (y/n); and 11. Any medium or high-risk antibiotic exposure (y/n). We used the median BMI record for the age year that matched most closely to the participant’s index age. Nursing home status was determined either by structured data on skilled nursing facility residence, or by mentions of nursing home status in social work and case management notes, as identified by NLP (Supplementary Table S3). We flagged chemotherapy using Current Procedural Terminology (CPT) codes 96400, 96408, 96409, 96411–96425, 96520, and 96530. We flagged participants as having diabetes mellitus if they had at least two of the following three indications: 1. An ICD-CM code from ICD-9-CM 250.* or ICD-10-CM E08-E13.*; 2. Prescriptions for diabetes medications including insulin (Supplementary Table S4); or 3. A hemoglobin A1C (HbA1C) reading > 6.5% or a glucose reading of > 200 mg/dL. Participants were flagged as having HIV infection if they had one instance of ICD-9-CM 042.*, ICD-10-CM B20-B24.* or Z21.*. Patients were flagged as having been exposed to transplant or corticosteroid medications if any medication listed in Supplementary Table S4 was administered outside of the exclusionary time range.

Genotyping and imputation

Genotypes for all participant samples from eMERGE-I, eMERGE-II and eMERGE-III were imputed using the Michigan Imputation Server64. The server uses the Minimac3 algorithm to impute missing genotypes and uses the Haplotype Reference Consortium reference panels65 (HRC1.1) as the reference set. The majority of samples from the 13 eMERGE sites were genotyped on the Human 660 Quad (eMERGE-I). Other genotyping platforms included the CytoSNP-850K BeadChip, the OmniExpress chip, the Affymetrix 6.0 array, and the Illumina MEGA among others. In this analysis, variants with an allelic R2 ≥ 0.3 and minor allele frequency (MAF) ≥ 0.05 were included. Additional QC filters were applied as described in case–control selection.

Genetically determined ancestry

The set of ~ 99,000 unique imputed samples was analyzed by Principal Component Analysis (PCA) using the PLINK 2.0 software66. Variants with ≥ 0.05 MAF, missingness of ≤ 0.1 and LD-pruned R2 threshold of 0.7 were included in the multisample analysis. K-means clustering of Principal Component (PC) 1 and PC2 identified three groups (corresponding to African ancestry, Asian ancestry and European ancestry) was used to find genetically determined ancestry of each sample. Genetically determined and self-identified ancestry were checked for concordance, and samples were ultimately grouped into African ancestry, Asian ancestry, and European ancestry clusters. IBD was calculated for all pairwise sample comparisons using the plink –genome function, and cryptic relatedness between samples was assessed as described in case/control selection.

GWAS

To identify genetic variants associated with CDI, we performed logistic regression-based association analyses for the case/control curated phenotype using PLINK 1.9067. All covariates and genotypes were used in the joint analysis of all participants, whereas the PC1 and PC2 covariates for the African and European ancestry-stratified analyses were derived from ancestry specific PCA analyses. An additive genotypic model of SNV genotypes coded as 0, 1 or 2 copies of the minor allele was used. The regional LD plots of the index SNV were created using the LocusZoom web-based tool68. Following the initial stratified analyses, an additional logistic regression-based association analysis was performed in the European sample using the index SNV as a covariate to determine whether this SNV was truly driving the risk association.

HLA association analyses

Classical HLA alleles were imputed against four ancestry-specific reference panels (African, Asian, European and Hispanic) using the HIBAG software69. HLA-DRB3, 4 and 5 gene dosages were inferred based on the HLA-DRB1 alleles present in each individual, as described in Habets et al.70. Calls were quality-filtered for a HIBAG posterior probability of > 0.5.

To test for haplotype-specific effects of the most significantly associated SNVs, four overlapping participant subgroups were selected from the European ancestry sample based on the presence of at least one of the following: 1. DRB3 gene; 2. DRB4 gene; 3. DRB5 gene; or 4. any of the above genes in each participant. Haplotype subgroups were further divided into DR15 and DR16 haplotype carriers (stemming from the DRB5 gene carriers, or DR51 haplotype family), and DRB1*15:01 carriers (stemming from the DR15 haplotype). Logistic regression-based association analysis was performed separately in each haplotype subgroup, using the same covariates described in “Methods: GWAS” for the European ancestry sample.

To test for HLA alleles driving the association, case–control logistic regression-based association analysis was performed in the European ancestry population sample for 276 classical HLA alleles, using the same covariates described in “GWAS” in “Methods” section for the European ancestry sample. The CEU Chromosome 6 LD dataset from the HapMap 3 project was used to assess LD of the most significantly associated SNVs among classical HLA alleles.