arising from: COVID-19 Host Genetics Initiative. Nature https://doi.org/10.1038/s41586-021-03767-x (2021)

Investigating the role of host genetic factors in COVID-19 severity and susceptibility can inform our understanding of the underlying biological mechanisms that influence adverse outcomes and drug development1,2. Here we present a second updated genome-wide association study (GWAS) on COVID-19 severity and infection susceptibility to SARS-CoV-2 from the COVID-19 Host Genetic Initiative (data release 7). We performed a meta-analysis of up to 219,692 cases and over 3 million controls, identifying 51 distinct genome-wide significant loci—adding 28 loci from the previous data release2. The increased number of candidate genes at the identified loci helped to map three major biological pathways that are involved in susceptibility and severity: viral entry, airway defence in mucus and type I interferon.

We conducted a meta-analysis for 3 phenotypes across 82 studies from 35 countries, including 36 studies of individuals with non-European ancestry (Fig. 1, Supplementary Figs. 1 and 2 and Supplementary Table 1): critical illness (respiratory support or death; 21,194 cases), hospitalization (49,033 cases) and SARS-CoV-2 infection (219,692 cases). Most of the studies were collected before the widespread introduction of COVID-19 vaccination. We found 30, 40 and 21 loci that are associated with critical illness, hospitalization and infection due to SARS-CoV-2, respectively, for a total of 51 distinct genome-wide significant loci across all three phenotypes (P < 5 × 10−8; Fig. 2, Supplementary Fig. 3 and Supplementary Table 2), adding 28 genome-wide significant loci to the 23 previously identified by the COVID-19 Host Genomics Initiative (HGI; data release 6)1,2. We observed a median increase of 2.9-fold in statistical power across lead variants owing to a median increase of 1.6-fold in effective sample sizes from the previous release (Supplementary Table 3). After correcting for the number of phenotypes examined, 46 loci remained significant (P < 1.67 × 10−8). Of the 28 additional loci, 6 loci were originally reported by the GenOMICC study3, which also contributed to the current meta-analysis, and 9 other loci were identified by the new GenOMICC meta-analysis4 during the preparation of this paper. We found nine more loci that reached genome-wide significance, but we excluded them as they were probably false positives, as determined using a leave-most-significant-biobank-out analysis (Supplementary Table 4 and Supplementary Note). Comparing the effect sizes and statistical significance between the previous2 and current analysis indicated that all of the previously identified loci were replicated and showed an increase in statistical significance (Supplementary Fig. 4). Using our previously developed two-class Bayesian model for classifying loci as being more likely involved in infection susceptibility or severity2, we determined that 36 loci are substantially more likely (higher than 99% posterior probability) to affect disease severity (hospitalization) and 9 loci are substantially more likely to influence susceptibility to SARS-CoV-2 infection, while the remaining 6 loci could not be classified (Supplementary Fig. 5, Supplementary Table 5 and Supplementary Note). We observed that the 1q22 locus (lead variant: rs12752585:G>A) showed significant effect-size heterogeneity across ancestries (Phet < 9.80 × 10−4 = 0.05/51), whereas the previously reported heterogenous locus (FOXP4) remained at the same level of significance as before2, despite an increase in sample size (Phet = 2.01 × 10−3; Supplementary Fig. 6 and Supplementary Table 6). We found significant observed-scale single-nucleotide polymorphism heritabilities of all the three phenotypes (1.2–8.2%, P < 0.0001). We also estimated liability-scale heritabilities for a range of population prevalences (Supplementary Fig. 7, Supplementary Table 7 and Supplementary Note).

Fig. 1: Overview of the contributing studies in the HGI data release 7.
figure 1

a, Geographical overview of the studies contributing to the COVID-19 HGI and the composition by major ancestry groups. Populations are defined as Middle Eastern (MID), South Asian (SAS), East Asian (EAS), African (AFR), admixed American (AMR), European (EUR) and other (OTH). b, A principal component analysis (PCA) highlights the population structure and the sample ancestry of the individuals participating in the COVID-19 HGI. Per-cohort PCA results are provided in Supplementary Fig. 2. This figure was reproduced from the original publication by the COVID-19 HGI1 with modifications reflecting the updated analysis from data release 7.

Fig. 2: GWAS results for COVID-19.
figure 2

a, The results of a GWAS analysis of hospitalized individuals with COVID-19 (n = 49,033 cases and n = 3,393,109 controls) (top), and the results for individuals with reported SARS-CoV-2 infection (n = 219,692 cases and n = 3,001,905 controls) (bottom). The loci highlighted in yellow (top) represent regions that are associated with severity of COVID-19. The loci highlighted in green (bottom) are regions associated with susceptibility to SARS-CoV-2 infection. Lead variants for the loci that were identified in this data release are annotated with their respective rsID. The y axis is on the −log10 (P) scale up to 10, after which it switches to a 10 × log10[−log10(P)] scale to aid presentation. b, Results of gene prioritization using different evidence measures of gene annotation. For the genes in a linkage-disequilibrium (LD) region, genes with coding variants and eGenes (fine-mapped cis-expression quantitative trait locus (cis-eQTL) variant with posterior inclusion probability (PIP) >0.1 in GTEx Lung) are annotated as such if they are in linkage disequilibrium with a COVID-19 lead variant (r2 > 0.6). V2G, the highest gene prioritized by OpenTargetGenetics V2G score. The pink circle indicates SARS-CoV-2 infection susceptibility, the green triangle indicates COVID-19 severity and the blue cross indicates unclassified. This figure was reproduced from the original publication by the COVID-19 HGI1 with modifications reflecting the updated analysis from data release 7.

To better understand the biological mechanisms underlying COVID-19 susceptibility and severity, we further characterized candidate causal genes by mapping them onto biological pathways and performing a phenome-wide association analysis (Extended Data Fig. 1, Supplementary Fig. 8 and Supplementary Tables 2, 8 and 9). In total, 15 out of 51 loci could be linked to three major pathways involved in susceptibility and severity defined by expert-driven classification (Supplementary Note): (1) viral entry; (2) entry defence in airway mucus; and (3) type I interferon response. Moreover, the phenome-wide association analysis identified nine loci involved in the upkeep of healthy lung tissue.

First, five loci include candidate causal genes involved in the viral entry pathway (Extended Data Fig. 1a), such as previously reported SLC6A20 (3p21.31), ABO (9q34.2), SFTPD (10q22.3) and ACE2 (Xp22.2), as well as TMPRSS2 (21q22.3), which was also identified in the data release 7. We found that the lead variant rs9305744:G>A, an intronic variant of TMPRSS2, is protective against critical illness (odds ratio (OR) = 0.92, 95% CI = 0.89–0.95, P = 1.4 × 10−8) and is in linkage disequilibrium with the missense variant rs12329760:C>T (p.Val197Met; r2 = 0.68). SARS-CoV-2 uses the serine protease TMPRSS2 for viral spike protein priming, as well as the previously reported ACE2 for host cell entry which functionally interacts with SLC6A20 (refs. 5,6). Notably, the previously reported association between ABO blood groups and susceptibility could be attributed to the interference of anti-A and anti-B antibodies with the spike protein, potentially interfering with viral entry7. Furthermore, the previously reported SFTPD encodes pulmonary surfactant protein D (SP-D), which contributes dually to the lung’s innate immune molecules and viral-entry response in pulmonary epithelia8,9 along with other genes for airway defence.

Second, four loci contain candidate causal genes for entry defence in the airway mucus (Extended Data Fig. 1b), such as previously reported MUC1/THBS3 (1q22) and MUC5B (11p15.5) as well as novel MUC4 (3q29) and MUC16 (19p13.2). We found that the novel lead variants rs2260685:T>C in MUC4 (intronic variant; in linkage disequilibrium (r2 = 0.65) with a missense variant rs2259292:C>T (p.Gly4324Asp)) and rs73005873:G>A in MUC16 (intronic variant) increase the risk of SARS-CoV-2 infection (OR = 1.03 and 1.03, 95% CI = 1.02–1.04 and 1.02–1.04, P = 4.1 × 108 and 6.4 × 10−10, respectively). Moreover, the previously reported locus 1q22 contains an intergenic lead variant rs12752585:G>A that decreases the risk of infection (OR = 0.98, 95% CI = 0.97–0.98, P = 1.5 × 1011) and increases MUC1 expression in the oesophagus mucosa in GTEx v8 (P = 5.2 × 10−9). Notably, the 1q22 locus also contains an independent lead variant, rs35154152:T>C, a missense variant (p.Ser279Gly) of THBS3, that decreases the risk of hospitalization (OR = 0.88, 95% CI = 0.86–0.90, P = 5.6 × 1022) but not infection (P = 5.7 × 104), suggesting potential distinct mechanisms in the locus. Consistent with these association patterns, MUC1, MUC4 and MUC16 are three known major transmembrane mucins of the respiratory tracts that prevent microbial invasion, whereas previously reported MUC5B, together with nearby MUC5AC, are primary structural components of airways mucus that enable mucociliary clearance of pathogens10.

Third, six loci contain candidate causal genes that are linked to the type I interferon pathway (Extended Data Fig. 1c), such as previously reported IFNAR2 (21q22.11), OAS1 (12q24.13) and TYK2 (19p13.2), as well as additionally identified JAK1 (1p31.3), IRF1 (5q31.1) and IFNα-coding genes (9p21.3). Previous studies have reported additional genes in this pathway: TLR7 (refs. 11,12) and DOCK2 (ref. 13). Here we found that the lead variant rs28368148:C>G, a missense variant (p.Trp164Cys) of IFNA10 located within the IFNα gene cluster, increases the risk of critical illness (OR = 1.56, 95% CI = 1.38–1.77, P = 3.7 × 10−12). IFNα is one of the type I interferons that binds specifically to the IFNα receptor consisting of IFNAR1–IFNAR2 chains, in which mutations are also known to increase the risk of hospitalization and critical illness. In the genes that enable signalling downstream of IFNAR, we identified that the lead variant rs11208552:G>T, an intronic variant of JAK1, is protective against critical illness and hospitalization (OR = 0.92 and 0.95, 95% CI = 0.89–0.94 and 0.93–0.96, P = 5.5 × 10−10 and 2.2 × 109, respectively). This variant was previously reported to decrease lymphocyte counts14 (β = −0.016, P = 5.5 × 10−15) and increase the JAK1 expression in the thyroid in GTEx15 (P = 6.1 × 10−23). JAK1 and previously reported TYK2 are Janus kinases (JAKs) that are required for type I interferon-induced JAK–STAT signalling. JAK inhibitors are used to treat patients with severe COVID-19 (ref. 16). Furthermore, downstream of JAK–STAT signalling, we found that the lead variant rs10066378:T>C, located 67 kb upstream of IRF1, increases the risk of critical illness and hospitalization (OR = 1.09 and 1.07, 95% CI = 1.06–1.13 and 1.05–1.09, P = 2.7 × 10−9 and 3.74 × 10−10, respectively).

Furthermore, the phenome-wide association analysis identified nine loci previously associated with lung function and respiratory diseases. These loci contain genes involved in the upkeep of healthy lung tissue such as previously reported FOXP4 (6p21.1), SFTPD (10q22.3), MUC5B (11p15.5) and DPP9 (19p13.3), as well as additionally identified CIB4 (2p23.3), NPNT (4q24), ZKSCAN1 (7q22.1), ATP11A (13q34) and PSMD3 (17q21.1). For example, we found that three lead variants, rs1662979:G>T (intronic variant of CIB4), rs34712979:G>A (splice region variant of NPNT) and rs2897075:C>T (intronic variant of ZKSCAN1), are significantly associated with hospitalization (OR = 1.05, 0.94 and 1.05, 95% CI = 1.03–1.07, 0.92–0.96 and 1.03–1.07, P = 5.6 × 10−9, 3.8 × 10−8 and 8.9 × 10−9, respectively) and lung function (FEV1/FVC)17, similar to the previously reported lead variant rs3934643:G>A (intronic variant of SFTPD). Notably, whereas the alleles associated with increased risk of COVID-19 severity of rs1662979 and rs3934643 decrease lung function (β = −0.013 and −0.025, P = 5.3 × 10−8 and 6.3 × 10−10), those of rs34712979 and rs2897075 increase lung function (β = 0.068 and 0.023, P = 4.2 × 10−134 and 1.6 × 10−20, respectively). Likewise, we found lead variants that were significantly associated with hospitalization and idiopathic pulmonary fibrosis18,19, such as the aforementioned rs2897075 and rs12585036:C>T (intronic variant of ATP11A; OR = 1.10, 95% CI = 1.08–1.12, P = 3.2 × 10−21), in addition to the previously reported rs35705950:G>T (promoter variant of MUC5B). Whereas the COVID-19 severity risk-increasing alleles of rs2897075 and rs12585036 increase the risk of idiopathic pulmonary fibrosis (OR = 1.12 and 1.27, P = 3.0 × 10−14 and 7.0 × 10−9, respectively), those of rs35705950 decreases the risk (OR = 0.50, P = 3.9 × 10−80). These results highlight the complex pleiotropic relationships between COVID-19 severity, lung function and respiratory diseases.

We used genetic correlations and Mendelian randomization analyses to identify potential causal effects of risk factors on COVID-19 phenotypes (Supplementary Fig. 9 and Supplementary Tables 10 and 11). In total, 14 novel genetic correlations and 10 novel robust exposure-COVID-19 trait pairs showed evidence of causal associations (Supplementary Note). In particular, smoking initiation and the number of cigarettes per day were positively correlated with severity and susceptibility phenotypes; Mendelian randomization indicated that smoking was causally associated with increased risk of COVID-19, further highlighting the role of the healthy lung tissue in COVID-19 severity. Moreover, genetically instrumented higher glomerular filtration rate (indicative of better kidney function) was associated with a lower risk of COVID-19 critical illness, whereas genetically predicted chronic kidney disease was associated with an increased risk of COVID-19 critical illness, suggesting that better kidney function would be beneficial for a lower risk of COVID-19 severity.

In summary, we have substantially expanded the current knowledge of host genetics for COVID-19 susceptibility and severity by further doubling the case numbers from the previous data release2 and identifying 28 additional loci. The increased number of loci enables us to map genes to pathways that are involved in viral entry, airway defence and immune system response. Notably, we observed severity loci mapped to type I interferon pathway, while susceptibility loci mapped to viral entry and airway defence pathways, with notable exceptions for severity-classified TMPRSS2 and MUC5B loci. Further investigation of how such susceptibility and severity loci map to different pathways would provide mechanistic insights into the human genetic architecture of COVID-19.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.