Introduction

Disease syndromes caused by infectious agents have occurred throughout the history of modern humans1. As a result of our continued interactions with pathogens, our genomes have been shaped through processes of co-evolution, with pathogen-imposed selection pressures leading to selection signatures in ancient and modern human genomes2. As one of several illustrative examples, genetic diversity involving human red blood cell structure and function is being impacted by an evolutionary arms race with malaria that is reciprocally seen in the parasite genome3. In the era of modern medicine with antimicrobial drugs and powerful organ support systems, we are curbing the selection pressure exerted through the traditional high mortality associated with many of these infectious agents in resource-rich settings, but the worldwide health-care burden of infectious diseases remains substantial4. This is exemplified by the emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)5,6 and the rapid evolution into a global COVID-19 pandemic requiring coordinated global responses. Additionally, the ongoing and dynamic challenges of managing malaria, tuberculosis (TB) and human immunodeficiency virus (HIV) infection alongside the constant threat of global influenza and other outbreaks such as Ebola virus disease, all particularly afflicting underserved populations, remind us of the importance of advancing infectious disease medicine.

A seminal study of adoptees in the 1980s reported increased risk of death from infectious disease in children whose biological parents succumbed to an infectious disease7, highlighting the significance of human genetic background in susceptibility to infections. As genetic technology has developed, mirroring other disease syndromes, a raft of genetic loci influencing susceptibility to infectious diseases have been discovered through genome-wide association studies (GWAS), including for HIV infection, TB, hepatitis and malaria8 (Table 1). Such studies aim to identify common variants on the basis of a polygenic model of complex multifactorial traits. However, it has become clear that, although human genetics play a role in disease susceptibility, in contrast to other syndromes, GWAS of case–control design may not be the most powerful method to tease out complex host–pathogen relationships. Alongside developments in genotyping technologies enabling GWAS, the in-depth understanding of single-gene disorders facilitated by next-generation sequencing technologies has enabled us to better understand the functional mechanisms of host defence to infection through experiments of nature9 resulting in loss of function (LOF) or, in increasingly recognized instances, gain of function (GOF) of discrete genes (Fig. 1). Here, rare mutations of variable penetrance result in single-gene inborn errors of immunity and lead to life-threatening infectious disease (Table 2). More recently, the development and availability of multi-omics experimental techniques promise to deliver similar mechanistic understanding of the functional importance of genome-level variation at genomic loci implicated in disease pathogenesis.

Table 1 Overview of genome-wide association studies involving major infectious diseases
Fig. 1: Signalling pathways crucial to the immune response and consequences of inborn errors of immunity for infectious disease.
figure 1

Examples of specific proteins are shown (highlighted in colour), which when present as mutants give rise to monogenic inborn errors of immunity, with the main infectious disease phenotypes noted. a | Pattern recognition receptors (PRRs) responsible for detecting pathogen-associated molecular patterns (PAMPs) and damage-associated molecular patterns (DAMPs) with examples of PRR pathways illustrated for retinoic acid-inducible gene I protein (RIG-I)-like receptor (RLR) and Toll-like receptor (TLR). b | Receptor-interacting protein kinase (RIPK) signalling. RIPK1 and RIPK3 regulate inflammation and cell death (necroptosis). c | Interferon pathways. Type I interferons (for example, interferon-α (IFNα) and IFNβ) and type II interferons (for example, IFNγ) regulate the immune response to viral and bacterial infections. d | Antigen presentation pathways. Major histocompatibility (MHC) class I molecules present antigens (derived from intracellular proteins such as viruses and some bacteria) to cytotoxic CD8+ T cells via the endogenous pathway (left). MHC class II molecules present antigens from bacteria, parasites and other extracellular pathogens endocytosed into antigen-presenting cells to CD4+ T cells via the exogenous pathway (right). CARD, caspase activation and recruitment domain; CLIP, class II-associated invariant chain peptide; CMV, cytomegalovirus; CTD, carboxy-terminal domain; ERK, extracellular signal-regulated kinase; GAS, gamma-activated sequence; HSV, herpes simplex virus; IFNAR, interferon-α/β receptor; IKK, IκB kinase; IRAK, interleukin-1 receptor-associated kinase; IRF, interferon response factor; ISGs, interferon-stimulated genes; ISRE, interferon-stimulated response element; JNK, JUN amino-terminal kinase; MAPK, mitogen-activated protein kinase; MAPKKK, mitogen-activated protein kinase kinase kinase; NF-κB, nuclear factor-κB; TAB, transforming growth factor-β-activated kinase 1 (MAP3K7)-binding protein; TAK, transforming growth factor-β-activated kinase; TAP, transporter associated with antigen processing; TCR, T cell receptor; TIRAP, Toll–interleukin-1 receptor domain-containing adaptor protein; TRAFs, tumour necrosis factor receptor-associated factors; TRAM, Toll–interleukin-1 receptor domain-containing adapter-inducing interferon-β (TRIF)-related adaptor molecule; TRIF, Toll–interleukin-1 receptor domain-containing adapter-inducing interferon-β.

Table 2 Inborn errors of immunity including primary immunodeficiency disorders and Mendelian/monogenic infection susceptibilities

This Review aims to summarize the insights gained from the wide range of -omics approaches being used to understand genetic susceptibility to infectious disease. We begin by discussing recent findings and limitations of GWAS applied to infectious diseases, in particular the need for application in diverse global populations. We focus on lessons learnt from different infections and on a region of the genome consistently linked to infectious disease susceptibility: the major histocompatibility (MHC) locus encoding the vital immune human leukocyte antigen (HLA) genes (Box 1). We then proceed to discuss novel approaches, many of which involve large-scale sequencing, and highlight the importance of monogenic architecture leading to infectious disease susceptibility, although only selected examples will be covered as this topic was reviewed comprehensively recently9. This is followed by exploration of the potential of future integrated analyses of human–pathogen variation and application of multi-omics platforms. We postulate how these findings may inform our understanding of the genetic determinants of infectious disease susceptibility, how this knowledge may in turn improve our understanding of individual immune and vaccine responses, and how this might be directed more universally for the benefit of human health. Finally, we discuss strategic approaches to dissecting host genetic factors in COVID-19 as a case study in how such work is being implemented as part of global efforts to address this emerging public health threat.

Common-variant associations: HLA and beyond

For major infectious diseases, including HIV-1 infection, TB, malaria, leprosy and viral hepatitis, genetic evidence has been accrued primarily through GWAS, aiming to identify disease-associated common variants in populations on the basis of a polygenic model of complex multifactorial traits. Recent work involving these traits has highlighted the progress and challenges that continue to be associated with mapping and functionally interrogating genetic associations through GWAS.

Genetic factors are hypothesized to contribute to the observed heterogeneity in response to infectious disease, and twin studies are an important source of information to estimate heritability while recognizing the risks of overestimation10. It is also important to note that while ongoing work investigating rare and structural genomic variants may help further address the ‘missing heritability’ conundrum — the disparity between overall trait heritability and observed effects through GWAS — this has been reconciled mostly by the recognition that genetic predisposition in most common traits is spread and shared across large numbers of variants (in the thousands), each with a typically modest effect11. Understanding of the genetic architecture of disease predisposition continues to develop, with an omnigenic model for complex traits proposing genetic contribution from both core genes with strong genetic evidence and very many networked peripheral genes of very low effect size involving regulatory circuits12,13.

Of particular interest in infectious disease GWAS has been the role of genetic variation at the locus encoding HLA genes in determining the individual response to infection, given the role of encoded MHC molecules in binding and presenting antigens and thus determining which antigens are ‘seen’ by the immune system (Box 1). However, the extreme polymorphism of this region, high gene density, complex linkage disequilibrium and regions of homology make fine mapping genetic associations involving HLA very challenging14.

Across the genome, a substantial majority of genetic variants identified by GWAS are located in non-coding regions of the genome, requiring complementary functional genomic annotations to link such variants to modulation of individual genes or pathways. This is key to understanding the functional basis of such associations, in particular the specific modulated genes or pathways that may represent potential drug targets. Knowledge of network connectivity substantially enhances the ability to then identify nodal genes that may be the best therapeutic targets15. Prior genetic evidence for targets substantially increases the success rate in late-stage clinical trials16. As well as supporting the development of new drugs, it provides important opportunities to support repurposing existing drugs where a genetic association with a target is shared across traits, indicating new indications for such approved drugs17.

Genetics can also enable analytical approaches such as Mendelian randomization that leverage genetic variation as instrumental variables to establish causal relationships between non-genetic (modifiable) risk factors and disease18. A further example of translational relevance is the opportunity for genetic evidence to potentially stratify risk; for example, to identify high-risk individuals for whom early or pre-emptive intervention may be appropriate. Aggregated genetic risk through polygenic risk scores and consideration of large-effect rare variants are gaining traction as useful clinical tools19, with reports, for example, of application to dengue fever20.

HIV-1 infection

The two genomic regions most consistently associated with HIV infection outcomes are the MHC21 and the CCR5 locus (CCR5 is a coreceptor for the virus)22,23,24. Arguably, HIV-1 infection is the disease with the greatest insights into HLA–disease association pathogenesis, including seminal early advances supporting the hypothesis of ‘heterozygous advantage’, with individuals with greater heterozygosity of class I alleles (HLA-A, HLA-B and HLA-C) having delayed progression to AIDS25. A more recent discovery has demonstrated that differential expression of HLA-A contributes to disease progression through HLA-E-mediated effects, implicating the interaction between HLA-E and the NKG2A natural killer cell inhibitory receptor as a therapeutic target26. Both HLA-B and HLA-C have been consistently implicated in HIV control, but strong coinheritance or linkage disequilibrium in the MHC region has made disentangling the contributions of each difficult.

The repeated identification of HLA associations in HIV-1 GWAS has contributed, in part, to a catalogue of work focusing on the killer immunoglobulin-like receptor (KIR) genes that encode a family of highly polymorphic activating and inhibitory receptors expressed mainly by natural killer cells that interact with HLA. There has been a surprising lack of KIR gene associations in GWAS to date, which may be due to a multitude of reasons, including substantial diversification of KIR gene haplotypes, incomplete knowledge of the nature and diversity of such haplotypes across populations, and poor coverage of current genotyping arrays for this region, making it difficult to pinpoint specific genetic variants across populations. Thus, this region could represent an archetypal complex region that we have yet to accurately tag in even the most comprehensive of modern GWAS. Moreover, the effects of KIR gene variants may be detected only in the presence of their specific HLA ligand. As evidence for this effect, KIR3DL1 allotypes expressed highly in combination with specific HLA-B alleles (Bw4) were found to slow disease progression when compared with a low-expression allotype or lack of the Bw4 allele27. By contrast, homozygosity for the CCR5Δ32 mutation has been the most robustly confirmed genetic background for protection against HIV28, and antisense long non-coding RNA CCR5AS, which affects CCR5 mRNA stability and cell surface expression, has recently been shown to influence HIV infection susceptibility29. Furthermore, knowledge of the CCR5 pathway has led to the successful development of a class of antiviral agents targeting the entry of HIV-1 into target host cells.

Tuberculosis

Despite some of the earliest evidence for TB susceptibility being driven by genetic heterogeneity30,31, a consistent genetic signal of association in any one locus has yet to be consistently demonstrated. The largest study to date was undertaken in a combined sequencing and genotyping effort including 8,162 cases of combined presentations of TB in the Icelandic population32. This study identified variants in the MHC class II region as being associated with TB susceptibility, perhaps emphasizing the need to undertake GWAS in highly homogeneous populations when tackling ancient pathogens such as Mycobacterium tuberculosis. By contrast, other GWAS applied to TB of varied organ involvement in diverse populations have demonstrated association of pulmonary TB with variants in ESRRB and TGM6 in a Chinese population33 and ASAP1 in a Russian population34, although the biological basis by which these genes mediate an effect in TB is not always clear. For example, TB mouse models with CRISPR–Cas9 deletion of TGM6 lead to increased cytokine transcripts in the lungs, greater tissue damage and higher bacterial burden, but whether the cytokine levels are directly due to altered TGM6 levels is yet to be established.

The lack of consistent signals in the TB GWAS highlight the enormous challenges in undertaking multinational studies, where the observations may be attributed to differences in populations because of population structure either at the host level or the pathogen level or may be owing to technical differences, such as the genotyping array used. A meta-analysis of these studies that might help explain the observed incompatibilities and may inform future research in the area is eagerly awaited.

Leprosy

By contrast, many genes have been identified through GWAS for susceptibility to leprosy due to the related species Mycobacterium leprae35 that unsurprisingly implicate the immune system, including, most notably, the HLA-DR region36. Whereas a GWAS in an Indian cohort found HLA-DR, HLA-DQ and TLR1 (encoding Toll-like receptor 1) to be associated with susceptibility to leprosy37, a further GWAS of Chinese patients failed to replicate the TLR1 association despite adequate power, implicating population-specific effects38. This is again in line with the hypothesis of the genetic variation responsible for variability in the infectious disease trait (human genetic architecture) being subjected to selection pressure from pathogens, as different populations would have been exposed to various infective agents. The large size of the case–control cohorts has enabled the identification and downstream functional characterization of some of the non-HLA loci; for example, for NOD2, where the single-nucleotide polymorphism (SNP) rs1981760 is associated with differential NOD2 gene expression by being an expression quantitative trait locus (eQTL) in neutrophils39. The SNP acts by affecting binding of the transcription factor STAT3 and influencing neutrophil interferon-β (IFNβ) responses39, an effect that has been also been reported in TB40.

Malaria

This parasitic infection is perhaps the most famous example of host genetics influencing infectious disease susceptibility, with haemoglobinopathies — such as sickle haemoglobin (HbS), thalassaemia and glucose 6-phosphate dehydrogenase (G6PD) deficiency — long being associated with protection against malaria. As one example of an underlying biological mechanism, haemoglobin oxidation products in HbS affect the ability of the Plasmodium falciparum parasite to hijack red blood cell cytoskeletal remodelling41. Balancing selection, illustrated by the heterozygote advantage conferred by HbS, provides an important mechanism for maintaining advantageous genetic diversity in human populations42.

Malaria is an example where modern GWAS have confirmed several loci previously identified largely by candidate gene or protein studies, albeit still not most of them; most notably, despite many early observations of associations with HLA, GWAS projects have not found equivalent evidence. The Malaria Genomic Epidemiology Network (MalariaGEN)43 has successfully identified or confirmed genetic loci associated with severe Plasmodium falciparum malaria across Africa, Asia and Australasia, including HBB (β-globin gene), ABO (gene for glycosyltransferase enzyme that determines ABO blood groups), ATP2B4 (gene for a calcium transporter on the erythrocyte membrane), G6PD (activity levels of the enzyme affect oxidative stress in red blood cells) and CD40LG (a key gene in B cell immune responses). A notable finding was that among both different phenotypes and different populations, the heterogeneity of effect of risk alleles was substantial, possibly explaining the inability to replicate some of the previous genetic findings, together with pathogen diversity. The consortium further identified a locus neighbouring glycophorin genes (encoding erythrocyte cell surface receptors that allow Plasmodium falciparum invasion)44 to be implicated in malaria susceptibility. This finding highlights the possibility that loci conferring resistance to malaria involve polymorphisms with features of ancient balancing selection based on haplotype sharing between humans and chimpanzees45, in keeping with the hypothesis of a protracted evolutionary arms race between host and pathogen leading to the host and pathogen genetic diversity that we see today. The link between glycophorin genes and malaria resistance was further elucidated in a study using next-generation sequencing data and reference haplotypes that identified a complex structural rearrangement involving GYPA and GYPB gene copy number that reduced the risk of severe malaria by 40% and that explained the previous association signal46.

Viral hepatitis

Whereas three-quarters of patients with acute hepatitis C virus (HCV) infections eventually reach chronicity, the remaining quarter clear the virus spontaneously. As a key example of a strong GWAS signal not related to HLA with major translational impact, polymorphisms in the gene encoding IFNλ3 (IFNL3; formerly known as IL28B) have been associated with resolution of HCV infection47 and with differences in response to HCV drug treatment48 (viral clearance following treatment with pegylated interferon and ribavirin) in individuals of both European and African ancestry, enabling development of a precision medicine approach to therapy49 (Fig. 2). HLA class II associations with spontaneous HCV clearance have been reported by GWAS50. In a separate GWAS recently of African, European and Hispanic populations, HLA class II signals were replicated, while loci around the IFNλ genes and GPR158 were also implicated in spontaneous viral clearance51.

Fig. 2: Precision medicine approaches in infectious disease informed by human genetics.
figure 2

Examples of how understanding of human genetic information can be or has begun to be leveraged for improved patient care. Strategies include identifying specific molecular targets based on genetic understanding, stratifying patients to decide on use of certain drugs and using genetic knowledge to predict severe adverse reactions to medications. GWAS, genome-wide association studies; HCV, hepatitis C virus; HIV, human immunodeficiency virus; JAK, Janus kinase.

Novel approaches and technologies

Human population diversity

Lessons learnt in infectious disease susceptibility GWAS to date highlight several principles for increasing the chances of success. A key emerging area is the extent of population-specific associations genome-wide (Table 1). This in turn emphasizes the relative paucity of GWAS undertaken in diverse worldwide populations, many of which have the highest burden of infectious disease. Greater efforts will be needed to focus GWAS and other genetic studies on such populations, especially where ethnic background may be associated with differential risk52. It is thus imperative that genotyping arrays are designed for worldwide populations and that the availability of large imputation reference panels from diverse populations enables effective statistical inference of additional non-genotyped SNPs (and hence substantially increasing the informativeness of GWAS at no additional cost)53,54,55. New technologies such as those based on low-coverage whole-genome sequencing may provide an alternative cost-effective approach for genotyping common and rarer genetic variants56.

As is apparent from the studies reported so far herein, infections can afflict populations limited by geography and local ecology, and it is therefore crucial that genetic associations are tested for in non-European cohorts. However, such analyses come with technical challenges owing to the need to adjust the data for potential confounding by differences in population structure or relatedness that may cause an excess of type I errors. These aspects have been appreciated for some time, and many methods have been developed to facilitate such adjustments. One method is principal component analysis (a mathematical method to summarize the main sources of variance in data) to model ancestry differences and account for differing allele frequencies in different populations57. A further development is in statistical modelling using linear mixed model algorithms that can account for population structure and cryptic relatedness simultaneously58.

The Population Architecture Using Genomics and Epidemiology (PAGE) Consortium has demonstrated both the power and the importance in accounting for multi-ethnic and diverse populations in various non-infectious disease and physical trait settings; a GWAS conducted by the consortium demonstrated that in different populations, the effect size of implicated loci differed, exemplifying the need to account for population structure59. Any initiatives planning to develop personalized medicine in infectious diseases based on GWAS will therefore have to depend on the availability of genetic studies that are commensurately precise in a population context and that include an appreciation of the genomic complexity of regions such as the MHC, KIR and immunoglobulin heavy chain (IGH) regions.

Leveraging self-report

The maturation of methods for undertaking GWAS has been paralleled by improvements in methods for recruiting and testing associations. A key study from the personal genetic provider 23andMe used self-report to assess the influence of host genetic background on susceptibility to a range of common infectious diseases of low mortality (for example, common cold, rubella and cold sores)60. Including more than 200,000 individuals of European descent, this study reported many genetic loci related to the immune system and embryonic development, including genes both within and outside the MHC. Despite the large number of associations identified, the use of surveys requiring retrospective recall to identify cases will require follow-up replication and validation. A similar method was used for whooping cough, where self-report of the characteristic cough in childhood was associated with variation across the MHC61.

Phenome-wide association studies

Another development in recent years has been to establish the phenome-wide association study (PheWAS) approach. This has been facilitated by the use of electronic health records for clinical phenotyping. Instead of taking a single disease phenotype and scanning genome-wide as in GWAS, this approach takes a number of genetic loci (usually based on a priori biological knowledge) and looks for associations with any of a large number of phenotypes. Although revealing for many immune-related diseases, it has had limited use thus far in the field of infectious diseases. A PheWAS on HLA predictably highlighted autoimmune diseases, but found only a relatively small number of associations with infectious diseases62. Given the known key role of HLA genes in infectious diseases, it is likely that this was due to a lack of statistical power or challenges in defining cases from controls using hospital data.

Nevertheless, one example of a PheWAS in infectious disease genetics identified an association between the rs3211783 SNP in the factor X gene, encoding an enzyme of the coagulation cascade, with infection-related phenotypes63. Recent evidence for the role of factor X in antibacterial immunity has emerged64, and more studies will be needed to confirm this role, but the study provides an example of what may come in the field of infectious disease genetics through future PheWAS.

Genome sequencing and large-scale bioresources

Standard GWAS and imputation methods are unable to test the effects of rare, large-effect variants, as these variants are not covered by genotyping arrays and are not common enough in haplotype references. To overcome this challenge, direct whole-genome sequencing (WGS) or whole-exome sequencing (WES) approaches have been used. In one example, WES was first performed to identify rare alleles that are relevant in West Nile neuroinvasive disease, enabling subsequent imputation of the rare alleles into a primary cohort. The findings were then extended to larger cohorts for genome-wide association analysis. Loci revealed as being important included HERC5, an interferon-stimulated gene, and an intergenic region between CD83 and JARID2, which included a site for STAT5A transcription factor binding. With very few other such examples in the literature for any trait, it is becoming increasingly clear that the successful application of these methods will require hundreds of thousands to millions of individuals65.

To this end, several population-level bioresources and personal genetic services such as the UK Biobank, the China Kadoorie Biobank and the 100,000 Genomes Project — which are large repositories of samples and data — are enabling the use of novel approaches and enhancing the quality of data analysis using existing methods. In light of early discoveries afforded by the availability of genetic data from these studies66, it will be exciting to see how they may provide further insights into infectious disease as they mature.

Long-read sequencing technologies and highly polymorphic loci

A GWAS of rheumatic heart disease in Pacific Island populations recently identified a signal in the IGH locus67. This disease, which occurs following Streptococcus pyogenes infection, used to have a more widespread worldwide prevalence and is now rather limited to underserved populations across the Pacific. This intriguing finding in the IGH region has long been suspected for many diseases but has been infrequently observed. Studying the IGH region has been challenging owing to its high level of complexity (including substantial copy number variation68) and the sparse coverage of this locus on genotyping arrays. However, with the advent of new long-read sequencing and other technologies, it is likely that we will better capture the extent of IGH diversity and possible disease associations, similarly to what has been achieved for regions such as the MHC and KIR regions69.

New insights from inborn errors of immunity

Another key route to understanding infectious disease susceptibility is through the study of inborn errors of immunity that disrupt host defence against infection. In contrast to more common variants with small to moderate effect sizes identified by GWAS, rare mutations (of variable penetrance) typically result in a large phenotypic effect with substantial infections for the individual that may involve multiple pathogens. Such analysis has been highly informative in revealing mechanisms of infectious disease pathogenesis (Fig. 1; Table 2). Intersection with GWAS is seen: for example, single-gene defects in complement components result in increased susceptibility to meningococcal disease70,71, while GWAS identified SNPs within complement factor H (CFH) and CFH-related protein 3, consolidating evidence for the role of complement in host defence against meningococcal infection72. Application of WGS and WES has now identified more than 400 genetic causes73 of inborn errors of immunity, many of which cause primary immunodeficiencies (PIDs), including genes of known and novel mediators of innate and adaptive immunity (Table 2) mutated in various manners leading to the full range of mechanisms for Mendelian inheritance patterns. One notable advancement in this area has been the recognition of disease-causing genes that do not necessarily lead to overt immunological abnormalities, with differences in penetrance and specificities in clinical infection phenotype among other features9 (Table 2).

The link between certain immune pathways with specific clinical infection phenotypes has been well documented. For example, the TLR1/2/6–Toll–interleukin-1 receptor domain-containing adaptor protein (TIRAP)–nuclear factor-κB (NF-κB) pathway influences invasive pneumococcal disease (Fig. 1a), whereas the TLR3–interferon pathway affects susceptibility to herpes simplex virus encephalitis8,74,75,76,77,78 (Fig. 1b,c). However, other studies are highlighting non-canonical immune pathways.

As an example of a PID, recent studies include demonstration of a critical role for RIPK1 in susceptibility to infections. RIPK1 is a key regulator of cell death and survival and, in the case of cell death, mediation of either apoptosis or necroptosis. In a study of a small group of patients with recurrent infections, early-onset inflammatory bowel disease and progressive polyarthritis, WES identified homozygous LOF mutations79, while biallelic mutations in RIPK1 were found in a further series of patients with very early onset inflammatory bowel disease — the patients all had recurrent bacterial and viral infections and in some instances sepsis80.

Haploinsufficient mechanisms have also been identified, for example, involving BACH2 (ref.81), which is key in controlling B and T cell maturation and functional specification. In a case of a PID with recurrent upper respiratory tract infections and early-onset colitis, a heterozygous non-synonymous mutation in BACH2 resulted in reduced ability of the protein to dimerize or localize to the nucleus. On the other hand, GOF mutations leading to infectious disease susceptibility highlight opportunities for precision medicine. In patients with heterozygous GOF mutations in STAT1 — who have fungal (persistent oral candidiasis) and viral (CMV) infection and in one case TB (Fig. 1c) — the resultant excessive CD4+ T cell STAT1 phosphorylation after IFNβ stimulation highlighted the opportunity for therapeutic suppression by the Janus kinase (JAK) inhibitor ruxolitinib82. A separate study of patients affected by STAT1 GOF mutations and chronic mucocutaneous candidiasis showed successful treatment with ruxolitinib83 (Fig. 2).

Chronic granulomatous disease arises owing to the inability of phagocytes to mount a reactive oxygen species burst in response to pathogens and is characterized by severe recurrent bacterial and fungal infections84. This PID is often caused by mutations in the genes encoding the components of the NADPH oxidase complex. A homozygous LOF mutation in CYBC1 was identified as a further cause85, revealing a novel function for CYBC1 as a chaperone for one of the components of the NADPH oxidase complex.

Studies of patients with extremes of response to viral infections have provided mechanistic insights. In a case of recurrent viral infections requiring intensive care admission86, WES revealed a homozygous mutation in IFIH1 (encoding a retinoic acid-inducible gene I protein (RIG-I)-like helicase receptor responsible for sensing viral double-stranded RNA in the cytosol), whereas, in cases of life-threatening influenza, WES identified compound heterozygous mutations in IRF7, which reduces both type I and type III interferon production87. Influenza virus pneumonia may also arise owing to rare inborn errors of immunity involving IRF9 and TLR3 deficiencies88,89 whereas homozygous CCR5-null and FUT2-null alleles protect against CCR5-tropic HIV and norovirus, respectively22,23,24,90.

Mendelian susceptibility to mycobacterial disease can arise owing to mutations in genes encoding proteins involved in the production of, or response to, the cytokine IFNγ. Such mutations have important treatment implications; for example, the deficiency of IFNγ signalling can be ameliorated by therapeutic supplementation with IFNγ or haematopoietic stem cell transplantation, dependent on the degree of functional pathway responsiveness91. WES has revealed a novel cause of Mendelian susceptibility to mycobacterial disease in a case of homozygous splice-site mutations in SPPL2A (required to cleave CD74, the invariant chain of HLA class II); the mutation is proposed to disrupt functional polarization of T cells owing to altered priming by dendritic cells after mycobacterial antigen presentation92 (Fig. 1d).

One hurdle in monogenic disorders is incomplete penetrance. Polymorphism of TIRAP is known to influence susceptibility to invasive pneumococcal disease, TB, malaria and other bacterial infections93,94,95,96. In a study of a family with TIRAP deficiency, seven of eight individuals with the genetic defect did not have the severe staphylococcal infection that the index case did97. A possible explanation was the lack of antibodies to lipoteichoic acid found in the index patient compared with other family members that may allow monocytes in the unaffected family members to respond to lipoteichoic acid stimulation of TLR2 despite deficiency in TIRAP.

Integrating host and pathogen genetic variation

Although the impact of host genetic variation on susceptibility to infection is often viewed separately from that of pathogen genetics, it is highly likely that a major limitation in identifying novel genetic signals in GWAS applied to common infections will be attributable to heterogeneity at the level of the infectious agent. Approaches attempting to simultaneously interrogate host and pathogen transcriptomic responses (dual RNA sequencing) have been developed and applied to infectious diseases98,99,100; this joint characterization has also occurred at the genetic level to account for variability in both the host genome and the pathogen genome. In a Dutch cohort, a GWAS of both the host and the pathogen in pneumococcal meningitis allowed delineation of the contributions of host genetic background to the susceptibility and severity of disease versus the effects of pneumococcal genetic variation on invasive potential but not disease severity101. The study revealed an intronic SNP in a host ubiquitin-converting enzyme gene, UBE2U, to be significantly associated, with possible interaction with other host genes, including PGM1 (encoding phosphoglucomutase 1) and ROR1 (encoding a tyrosine-protein kinase transmembrane receptor). Similarly, a joint genetic study of HCV infection with human genome-wide genotyping and genome sequencing of HCV genotype 3 infections led to more biological insights, including understanding of reciprocal effects (for example, of human genetic variation in driving viral genome polymorphism)102. Rather than association tests using disease status, this study examined the associations between the host genetic background and variable sites of the viral proteome. Host HLA alleles were found to be associated with HCV amino acids leaving multiple ‘footprints’ in the HCV genome; and, in addition, a specific interaction between the HCV NS5A protein and human IFNL4 (encoding IFNλ4) genotype was found to affect viral load.

Simultaneous evaluation of host and pathogen genetics was also used in HIV datasets to reveal the influence of HLA genes. By considering how a virus mutates throughout the course of its infection, the relevance of the host genetic background can be considered. A recent study analysed a large number of existing HIV datasets, using host HLA class I genotypes and pathogen reverse transcriptase and protease sequences and modelling the processes of within-host evolution of the virus103. Known allele–epitope combinations (for example, escape at codons 173/195 of reverse transcriptase when paired with the HLA-B*51 allele) were replicated, but many epitopes highlighted to be HLA associated were not previously reported. Deeper analysis of viral sequence subtypes showed that differential selection pressures between the same HLA allele (HLA-B*58) and the different viral subtypes occurred at a reverse transcriptase codon that differed between the two subtypes. HLA-B*58 selection pressure may have consolidated the divergence in viral sequence subtypes.

The selective pressure of other pathogens on human populations has been delineated by genome-wide analyses of natural selection. For example, for cholera, targets of positive selection were identified in a population from the Ganges River Delta, supporting strong selective pressure by the pathogen on innate immune pathways, including NF-κB and inflammasome signalling104.

Studies of dual-pronged interrogation of host and infectious agent genetics in malaria have yet to be conducted at a large scale, but there have been projects to characterize heterogeneity in both Plasmodium falciparum, Plasmodium knowlesi and Plasmodium malariae and the malarial vector Anopheles gambiae at genetic and gene expression levels105,106,107. These studies will be informative for drug resistance and vector control; for example, one study highlighted the need to collect vector genomic data to establish which strategies lead to insecticide resistance107.

Several major challenges remain for studies considering both host and pathogen genetic variation. Most notably, the multiple testing burden means that substantially larger sample sizes will be needed in future studies, although an important analytical strategy is dimensionality reduction, which takes advantage of genome-wide correlation between variants101. Additionally, understanding the impact of complex spatio-temporal and ecological scales requires careful acquisition of comprehensive metadata that can be used with genomic data108.

Genetics of immune function phenotypes

In the evaluation of any disease, accounting for phenotype heterogeneity is of paramount importance. The use of various -omics technologies is a strategy to address this (Fig. 3). For example, transcriptomic analysis in sepsis has been able to stratify patients into different disease endotypes regardless of the source of infection109,110,111,112. Crucially, these endotypes have been shown to have interaction effects with steroid treatment, with evidence that patients of a relatively immunocompetent endotype have increased mortality on being given steroids (Fig. 2). This finding highlights the need for accurate phenotyping in future clinical trials where the effect of treatment may be diluted by heterogeneity in patient cohorts, and such improved phenotyping will also enable more successful application of GWAS.

Fig. 3: Omics and intermediate phenotypes as part of the toolkit for investigating the basis of infectious disease susceptibility.
figure 3

a | Traditional case–control genome-wide association study (GWAS) approaches compare allele frequencies of genetic variants in cases versus controls. b | Mendelian disease mapping with pedigree analysis (including case–parent trio analyses) and use of whole-exome or whole-genome sequencing. c | Multi-omics approaches, which enable intermediate phenotypes to be quantified by various -omics technologies. d | Leveraging genetic information to interrogate or leverage intermediate phenotypes. Differences in intermediate phenotypes such as gene expression can be mapped to genetic variation by quantitative trait locus (QTL) mapping. Mendelian randomization methods can use intermediate phenotypes that are risk factors for disease, with genetic variants that affect the intermediate phenotype allocated randomly to allow confounders to also be randomly distributed.

Thus, aside from clinical disease phenotypes, innovative approaches to examining infectious disease genetics have investigated intermediate phenotypes that may be relevant to the disease (Fig. 3). For example differential blood cell types are under substantial genetic influence113. Furthermore, levels of immunoglobulin are also genetically influenced not only by genetic variants in IGH genes and related immunoglobulin genes and the MHC but also by RUNX3 variants altering isoform shifting and FCGR2B variants impacting IgG binding to its receptor114. These approaches are now also being translated in the context of antigen specificity, including the role of genetic variation in agent-specific antibody responses, with a role for MHC identified in various responses, including anti-Epstein–Barr virus and anti-rubella virus IgG levels115,116. It will be interesting to determine what findings these markers of previous exposure may contribute in the context of acute infection risk in the future.

In a different attempt to examine the role of genetic variants in controlling the human immune system, Roederer et al. performed detailed immunophenotyping in 669 female twins117. The top 151 heritable immune traits from extensive flow cytometry were analysed. For example, FCGR2A variants were associated with multiple immune phenotypes such as expression levels of various T cell markers such as CD27 or CD161, and FCGR2A has been associated with HIV infection progression.

The use of microbial stimuli to produce intermediate phenotypes has also led to interesting insights. One study used 528 lymphoblastoid cell lines and subjected them to challenges with eight different kinds of pathogens, selected carefully for their high health-care burden and the wide range of host responses elicited against them118. Genome-wide association tests revealed 17 significant SNPs; for example, a SNP in ZBTB20 (which functions as a transcriptional repressor) was associated with both increased salmonella-induced pyroptosis and chlamydia replication. A PheWAS approach was further used with clinical phenotypes from the eMERGE dataset119, which highlighted the same SNP, this time relevant in viral hepatitis.

Ultimately, the link from gene to biological function requires characterization of candidate loci for mechanistic understanding of genetic contributions. The vast majority of loci found in GWAS reside in non-coding regions, implicating gene regulation in disease pathogenesis. Much effort in recent years has therefore been focused on understanding the genetic basis of various forms of molecular traits (molecular quantitative trait locus (QTL) mapping) (Fig. 3). Such associations are often highly context specific, dependent, for example, on cell type120 and activation state121,122,123, with effects specific to exposure to particular pathogens124, and dependent on the disease state; for example, in patients with sepsis due to community acquired pneumonia109. This characterization of molecular traits can inform GWAS interpretation; for example, in a GWAS of non-typhoidal Salmonella bacteraemia, a SNP acting as an eQTL for STAT4 was found, with its effect reported to be a reduction in IFNγ production by natural killer cells125. Genetic modulators of gene expression may also operate at the level of alternative splicing126, including after stimulation by influenza A virus127, with evidence that genetic regulation of isoform use was mostly distinct from that of gene expression. Immune cell-specific protein QTLs have also been demonstrated128, including effects on expression levels of HLA-DRB1 in specific cell types that may be relevant in diseases such as sepsis, where reduced HLA-DR expression is known to correlate with poor outcome129. A further study mapped cytokine production to genetic variants (cQTL mapping) in response to microbial stimuli, identifying QTLs involving pathways containing pattern recognition receptors, cytokine and complement inhibitors, and the kallikrein system130.

The maturation of technologies has seen multi-omics approaches applied to increasing numbers of diseases to identify and link genomic alterations to biological mechanisms. The infectious disease field has seen recent examples emerge: the combination of the plasma proteome, metabolome and lipidome and peripheral blood mononuclear cell transcriptome in analysing Ebola virus disease pathogenesis is one such example131. Macrophages and neutrophils were found to be particularly relevant cell types. Although this study itself did not tie the findings to the host genetic background, it is nevertheless an important example of the use of multi-omics functional characterization, and it is likely that we will soon be seeing the linking of multi-omics data back to genetic background.

Genetics of responses to vaccinations

As vast amounts of data continue to accrue on the relationship between the human host, its immune responses and infectious agents, the question as to how these data are best used is timely. Arguably, vaccination has offered one of the most effective public health efforts of modern medicine and is the most cost-effective opportunity to substantially reduce the incidence of many important infections, such as malaria, HIV infection and TB. Thus, given the large amount of evidence of human genetics influencing both non-specific and specific immune responses and susceptibility to disease, it is curious that there is still a relative lack of investigations looking at the effect of host genetics on vaccine response132, especially as the study of responses against conserved epitopes offers a means to control for pathogen diversity. Here, we review the state of the field in terms of the genetics of response to vaccination and its potential utility.

The vaccine targeting hepatitis B virus (HBV) remains the most studied, with many genetic associations identified in the MHC region, and there is controversy over whether it is HLA-DR, HLA-DP or class III MHC (involving the C4A gene) that plays independent or linked roles133,134,135,136. It is clear, however, that while HLA-DRB1*03 and HLA-DRB1*07 are linked to lower antibody responses, the opposite is true of HLA-DRB1*01, HLA-DRB1*13 and HLA-DRB1*15 (refs134,137,138), and the associations between HLA-DP and vaccine response against HBV are robust, as they are for other clinical phenotypes associated with the infection, particularly in Asian populations.

Rubella vaccination has seen similar HLA associations, although evidence is again conflicting. The HLA-DPB1 locus has been associated with rubella vaccination antibody responses139,140, but SNPs in the region have been associated in opposite directions, highlighting the need for further replication and functional studies to identify plausible biological mechanisms. Increased antibody responses to measles vaccination has been associated with the HLA-DQA1*0201 allele141.

Although a twin study has shown that the contribution of HLA genes is relatively small compared with non-HLA genes in antibody-inducing vaccines142, there has been relatively little robust and replicated evidence of what the non-HLA genes may be. A recent two-stage GWAS looking at data from more than 3,600 children revealed both HLA and non-HLA associations on studying a combination of capsular group C meningococcal, Haemophilus influenzae type b and tetanus toxoid vaccines143. This study found SNPs in the locus coding for signal-regulatory proteins SIRPA, SIRPB and SIRPG were associated with greater antibody titres in group C meningococcal vaccines. Notably, the association was present only for serum bactericidal antibodies (functional antibodies as assessed by rabbit/human complement assays, and not with total meningitis C-specific IgG), emphasizing the need for caution about the agent-specific readout used. The study also identified four HLA class II alleles associated with tetanus toxoid vaccine IgG concentrations, with the lead SNP being an eQTL for HLA-DRB1 and HLA-DRB5 (ref.144).

Inborn errors of immunity have been reported to occasionally result in life-threatening disease following administration of live attenuated vaccines, for example, live poliovirus vaccine (vaccine-associated paralytic polio in patients with agammaglobulinaemia145), whereas defects in interferon immunity may result in severe illness following administration of the yellow fever or MMR vaccine (measles strain) (IFNAR1 (ref.146) and IFNAR2 (ref.147), and STAT1 (ref.148) and STAT2 (ref.149) deficiencies, respectively) (Fig. 1c).

Overall, current work highlights that new cohorts and mechanistic studies to further explore the effect of host genetics on vaccine response are needed to address this important area.

Genetic strategies for a new disease: COVID-19

Emerging infections such as SARS-CoV-2 pose a major threat to human health, and the early observation of striking heterogeneity in clinical presentations and outcomes following infection has highlighted the ongoing need to apply genetic and genomic approaches to understand drivers of individual susceptibility, severity and outcomes to this disease. Only a small minority of those infected with SARS-CoV-2 develop severe disease, with recognized risk factors including older age, male sex and co-morbidities, including obesity150. Genetic factors are hypothesized to contribute to the observed heterogeneity in response, and early reports such as from TwinsUK indicate relatively high heritability estimates for self-reported COVID-19 symptoms such as anosmia151. Genetic approaches provide the opportunity to gain novel insights into disease pathogenesis through the identification of specific disease associations involving particular genes or pathways, to define novel drug targets or to develop personalized medicine approaches with early intervention or therapy tailored to the individual. The opportunity to identify repurposing opportunities for approved medications based on shared genetic associations across traits is a particularly important consideration in a pandemic situation such as COVID-19, where, while not based on genetic evidence, the RECOVERY trial has demonstrated the therapeutic utility of existing drugs such as dexamethasone for treatment of severe disease152.

The international response to answering the question of whether human genetics alters susceptibility to or outcome from SARS-CoV-2 infection is an exemplar of modern genomic collaboration. The establishment of large appropriately phenotyped cohorts for GWAS in a pandemic situation has been enabled by prospective collection through existing or hibernating studies, rapid deployment of new studies and leveraging existing population biobank studies with large numbers of already genotyped individuals. Monthly data releases from the UK Biobank of COVID-19 results together with outcome and phenotypic information early in the pandemic highlight the opportunity provided by such population biobanks, while application of meta-analysis methods across cohorts will maximize power for genetic discovery. International collaborative efforts are key to facilitate genetic studies in infectious diseases such as COVID-19. For example, the COVID-19 Host Genetics Initiative153 was rapidly established to support sharing of relevant tools and data with an emphasis on GWAS. The COVID Human Genetic Effort154, by contrast, is focused on identifying monogenic cases with rare, highly penetrant mutations through analysis of young patients (less than 50 years of age) who were previously well and developed life-threatening disease, as well as those naturally resistant despite repeated exposure. Integrative analysis, considering both host and viral genetic variation, will likely be highly informative, although early reports suggest that viral genetic variation did not significantly affect outcomes in COVID-19 (ref.155).

An early exemplar of GWAS during the COVID-19 pandemic is a study by the Severe Covid-19 GWAS Group, who conducted a GWAS of 1,980 patients with severe disease (hospitalized with respiratory failure requiring oxygenation or mechanical ventilation) in Italian and Spanish epicentres of the pandemic versus population controls. The European meta-analysis revealed genome-wide significant associations at 3p21.31 and 9q34.2 (ref.156). The limitations of self-reported data notwithstanding, a large GWAS of more than 1.05 million individuals from the 23andMe research platform replicated the associations at these two loci for disease severity157, while the GenOMICC investigators recruited patients with COVID-19 from intensive care units only and robustly reproduced the 3p21.31 signal in addition to finding associations at other loci, including ones with genes of well-known antiviral function such as 12q24.13 (OAS1, OAS2 and OAS3) and 21q22.1 (IFNAR2)158.

Despite the important achievements, these results exemplify current challenges in GWAS biology. First, the 3p21.31 association, which was also supported by preliminary results from the COVID-19 Host Genetics Initiative, spans multiple genes, including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6 and XCR1, but will require careful functional genomic and other annotations to establish the causal genes underlying the genomic signal, as non-coding variants may modulate gene expression and other regulatory mechanisms at a distance159. Intriguingly, there is evidence that the risk haplotype, spanning ~50 kb at 3p21.31, was inherited from Neanderthals and shows substantial variation in frequency between populations, being carried by ~50% of people in South Asia, while being almost absent in East Asia, suggesting it may have been affected by selection in the past160. Second, the association at 9q34.2 coincides with the ABO blood group locus, and increased risk with blood group A was shown, but the mechanism for this association remains unclear and will require further study especially as this locus is frequently a site of type I error resulting from population stratification. Despite the apparent lack of significant signal of association at the locus encoding HLA genes in these initial GWAS of COVID-19, there is considerable interest in the importance of such variation in the response to infection, with evidence, for example, of differing capacity between HLA alleles in presentation of highly conserved SARS-CoV-2 peptides to immune cells161.

There is also early evidence from the COVID Human Genetic Effort to support the role of rare variants in the risk of severe disease on the basis of a study of 659 patients (0.1–99 years of age) hospitalized with critical disease due to COVID-19 and requiring mechanical ventilation or organ support in intensive care units162. Following genome and exome sequencing, the study authors tested the hypothesis that inborn errors of TLR3- and IRF7-dependent type I interferon immunity contribute to critical disease risk, analysing 13 genomic regions comprising loci previously shown to be mutated in critical influenza pneumonia and connected loci mutated in patients with other viral illnesses. Significant enrichment of rare variants predicted to result in LOF was found at these loci relative to people with asymptomatic or benign infection, with experimental validation including demonstration of inborn errors at eight loci in up to 23 patients (3.5%) of different ages (17–77 years) and population ancestries. The importance of type I interferons in protective immunity against SARS-CoV-2 was underlined by an accompanying article showing evidence for a high level of neutralizing autoantibodies to type I interferons in other patients with critical COVID-19, together highlighting the opportunities for therapeutic intervention based on screening at-risk individuals and developing targeted interventions163. Indeed, in the GenOMICC study described above, beyond uncovering key potential genetic contributions to COVID-19 severity, the group further used Mendelian randomization to demonstrate a link between low IFNAR2 expression and high TYK2 expression and life-threatening disease, the former further evidencing the importance of type I interferons in COVID-19 pathogenesis158. Altogether, the emerging insights from this fast-moving field of research exemplify the complex nature of the genetic architecture of susceptibility to infectious disease that may have relevance not only for the agent inflicting the largest pandemic of our generation but also potentially for a range of other infections afflicting our public health worldwide.

Conclusions and perspectives

A huge wealth of data is becoming available that confirms the influence of our host genetic variation on how we become susceptible to and respond to diseases caused by infectious agents. Although traditional case–control GWAS have yielded many insights, a large amount of information is also coming from other methods, such as intermediate phenotype mapping and multi-omics approaches. Issues of heterogeneity in host and pathogen genetic diversity and difficulty in arriving at consensus for disease case definition both hamper power in GWAS approaches. However, as recent studies emphasize, these problems are not insurmountable with appropriate power and will be aided by greater precision in defining disease phenotypes and application in a range of populations.

The challenge now comes with bringing all of these findings and approaches together to translate them into a clinical benefit. Moving forwards, it will be critical to consider context specificity. The cellular studies and vaccine responses emphasize the importance of considering pathogen-specific immune responses in the context of both time and appropriate sample type. The extent and nature of pathogen diversity is a key context, with large gaps remaining in our understanding. Our ability to leverage genetic information will also improve as we further appreciate the overlaps between monogenic syndromes with a high infection risk and the more common infectious diseases with similar pathogens. All of these opportunities are likely to come as data from cohorts such as the UK Biobank and the China Kadoorie Biobank mature.

The clinical benefit of the large body of genetic work has already been demonstrated through associations translated to clinical implementation, such as abacavir hypersensitivity eliminated by screening for HLA-B*5701 (Fig. 2), and HCV treatment influenced by IL28B. We should be inspired by these lessons learnt as we move forwards into an era where large-scale genomics may help predict our risks of disease or facilitate future vaccine development and deployment through an understanding of genetic diversity at both the individual level and the population level to enable more tailored application in a precision medicine approach that maximizes effectiveness for a given person or population group.