The COVID-19 pandemic continues to pose a major public health threat, especially in countries with low vaccination rates. To better understand the biological underpinnings of SARS-CoV-2 infection and COVID-19 severity, we formed the COVID-19 Host Genetics Initiative1. Here we present a genome-wide association study meta-analysis of up to 125,584 cases and over 2.5 million control individuals across 60 studies from 25 countries, adding 11 genome-wide significant loci compared with those previously identified2. Genes at new loci, including SFTPD, MUC5B and ACE2, reveal compelling insights regarding disease susceptibility and severity.

Here we present meta-analyses bringing together 60 studies from 25 countries (Fig. 1 and Supplementary Table 1) for three COVID-19-related phenotypes: (1) individuals critically ill with COVID-19 on the basis of requiring respiratory support in hospital or who died as a consequence of the disease (9,376 cases, of which 3,197 are new in this data release, and 1,776,645 control individuals); (2) individuals with moderate or severe COVID-19 defined as those hospitalized due to symptoms associated with the infection (25,027 cases, 11,386 new and 2,836,272 control individuals); and (3) all cases with reported SARS-CoV-2 infection regardless of symptoms (125,584 cases, 76,022 new and 2,575,347 control individuals). Most studies have reported results before the roll out of the COVID-19 vaccination campaign. An overview of the study design is provided in Supplementary Fig. 1. We found a total of 23 genome-wide significant loci (P < 5 × 10−8) of which 20 loci remain significant after correction for multiple testing (P < 1.67 × 10−8) to account for the number of phenotypes examined (Fig. 2, Supplementary Fig. 2 and Supplementary Table 2). We compared the effects of these loci between the previous2 and current analysis and found that only one locus did not replicate (rs72711165). All of the other loci showed the expected increase in statistical significance (Supplementary Fig. 3).

Fig. 1: Overview of contributing studies in Host Genetics Initiative data freeze 6.
figure 1

a, Geographical overview of the contributing studies to the COVID-19 Host Genetics Initiative and composition by major continental ancestry groups. Ancestry groups are defined as Middle Eastern (MID), south Asian (SAS), east Asian (EAS), African (AFR), admixed American (AMR) and European (EUR). b, Principal components analysis highlighting the population structure and the sample ancestry of the individuals participating in the COVID-19 Host Genetics Initiative. This figure is reproduced from the original publication by the COVID-19 Host Genetics Initiative2 with modifications reflecting the updated analysis from data freeze 6.

Fig. 2: Genome-wide association results for COVID-19.
figure 2

a, The results of the genome-wide association study of hospitalized COVID-19 (n = 25,027 cases and n = 2,836,272 control individuals) (top), and the results of reported SARS-CoV-2 infection (n = 125,584 cases and n = 2,575,347 control individuals) (bottom). Loci highlighted in yellow (top) represent regions associated with the severity of COVID-19 manifestation. Loci highlighted in green (bottom) are regions associated with SARS-CoV-2-reported infection. Lead variants for the loci identified in this data release are annotated with their respective rs ID. Horizontal lines denote genome-wide significant thresholds. b, The results of gene prioritization using different evidence measures of gene annotation. Genes in regions of linkage disequilibrium (LD), genes with coding variants and eGenes (fine-mapped cis-eQTL variant PIP > 0.1 in GTEx Lung) are annotated if in linkage disequilibrium with a COVID-19 lead variant (r2 > 0.6). V2G denotes the highest gene prioritized by OpenTargetGenetics’ V2G score. The asterisk (*) indicates SARS-CoV-2 reported infection and the plus symbol (+) indicates COVID-19 severity. The transparent loci were reported in the previous freeze (data release 5), and loci in bright blue were identified in the current freeze (data release 6). This figure is reproduced from the original publication by the COVID-19 Host Genetics Initiative2 with modifications reflecting the updated analysis from data freeze 6.

Across the genome-wide significant loci, we observed clear patterns of association with the different phenotypes under study. We therefore developed a two-class Bayesian model for classifying loci based on the patterns of association across the two better-powered phenotypes (COVID-19 hospitalization and SARS-CoV-2 reported infection). Intuitively, loci that are associated with susceptibility will also be associated with severity as, to develop COVID-19, SARS-CoV-2 infection needs to first occur. By contrast, those genetic effects that solely modify the course of illness should be associated with severity of illness and not show any association with reported infection except through preferential ascertainment of hospitalized cases in a cohort (Supplementary Methods). We identified 16 loci that are substantially more likely (>99% posterior probability) to affect the risk of COVID-19 hospitalization and 7 loci that clearly influence susceptibility to SARS-CoV-2 infection (Supplementary Table 3 and Supplementary Fig. 4).

We observed that several loci had a significant heterogeneous effect across studies (6 out of 23 loci with a P value for heterogeneity of <2.2 × 10−3; Supplementary Table 2). Owing to an increased diversity in our study population (Supplementary Fig. 5), we were able to examine whether such heterogeneity was due to effect differences across continental ancestry groups. Only one locus (FOXP4) showed a significantly different effect across ancestries (P value heterogeneity of <7 × 10−5; Supplementary Table 4 and Supplementary Fig. 6), although even at this locus all of the ancestry groups showed a positive effect estimate. This confirms that factors related to between-study heterogeneity (such as variable definition of COVID-19 severity owing to different thresholds for testing, hospitalization and patient recruitment) rather than differences across ancestries are a more likely explanation for the observed heterogeneity in the effect sizes across studies.

For the 23 genome-wide significant loci, we examined candidate causal genes and performed a phenome-wide association study to better understand their potential biological mechanisms (Supplementary Tables 2, 5 and 6 and Supplementary Fig. 7). Several of these loci with previous and direct connections to lung disease and SARS-CoV-2 infection mechanisms are highlighted here.

Several loci involved in COVID-19 severity implicate lung surfactant biology. A missense variant rs721917:A>G (p.Met31Thr) in SFTPD (10q22.3) confers risk for hospitalization (odds ratio (OR) = 1.06, 95% confidence interval (CI) = 1.04–1.08, P = 1.7 × 10–8) and has been previously associated with increased risk of chronic obstructive pulmonary disease3 (OR = 1.08, P = 2.0 × 10–8) and decreased lung function4 (FEV1/FVC; β = –0.019; P = 2.0 × 10–15). SFTPD encodes surfactant protein D (SP-D), which participates in innate immune response, protecting the lungs against inhaled microorganisms. The recombinant fragment of SP-D binds to the S1 spike protein of SARS-CoV-2 and potentially inhibits binding to ACE2 receptor and SARS-CoV-2 infection5. Another missense variant rs117169628:G>A (p.Pro256Leu) in SLC22A31 (16q24.3) also confers risk of hospitalization (OR = 1.09, 95% CI = 1.06–1.13, P = 2.6 × 10–8). SLC22A31 belongs to the family of solute carrier proteins that facilitate transport across membranes6 and is co-regulated with other surfactant proteins7.

We found that the variant rs35705950:G>T located in the promoter of MUC5B (11p15.5) is protective against hospitalization (OR = 0.83, 95% CI = 0.86–0.93, P = 6.5 × 10–9). This well-studied promoter variant increases the expression of MUC5B in lung in GTEx (P = 6.7 × 10–16) and is the strongest known variant associated with an increased risk of developing idiopathic pulmonary fibrosis (IPF)8,9, but also improves survival in patients with IPF carrying this mutation10.

Finally, we found that rs190509934:T>C, which is located 69 bp upstream of ACE2 (Xp22.2), is associated with decreased susceptibility risk (OR = 0.69, 95% CI = 0.63–0.75, P = 3.6 × 10–18). ACE2 is the SARS-CoV-2 receptor and functionally interacts with SLC6A19 and SLC6A2011, one of which also showed a significant association with susceptibility (rs73062389:G>A at SLC6A20; OR = 1.18, 95% CI = 1.16–1.20, P = 2.5 × 10–74). Notably, rs190509934 is ten times more common in south Asian populations (minor allele frequency (MAF) = 0.027) than in European populations (MAF = 0.0024), demonstrating the importance of diversity for variant discovery. Recent results have shown that the rs190509934:T>C variant lowers ACE2 expression, which in turn confers protection against SARS-CoV-2 infection12.

We applied Mendelian randomization to infer potential causal relationships between COVID-19-related phenotypes and their genetically correlated traits (Supplementary Methods; Supplementary Tables 79 and Supplementary Fig. 8). A causal association was observed between genetic liability to type 2 diabetes and SARS-CoV-2 reported infection (OR = 1.02, 95% CI = 1.01–1.03, P = 1.6 × 10−3), and COVID-19 hospitalization (OR = 1.06, 95% CI = 1.03–1.1, P = 1.4 × 10−4). Multivariable Mendelian randomization was used to estimate the direct effect of liability to type 2 diabetes on COVID-19-related phenotypes that was not mediated through body mass index. This analysis indicated that the observed causal association of liability to type 2 diabetes on COVID-19 phenotypes is mediated by body mass index (Supplementary Table 10).

We have substantially expanded the genetic analysis of SARS-CoV-2 infection and COVID-19 severity by doubling the case size, identifying 11 loci. We developed an approach to systematically assign the 23 discovered loci to either disease susceptibility (7 loci) or disease severity (16 loci). Although distinguishing between the two phenotypes is challenging because progression to a severe form of the disease requires susceptibility to infection in the first place, it is now evident that the genetic mechanisms involved in these two aspects of the disease can be differentiated. Among the new loci associated with disease susceptibility, ACE2 represents an expected, albeit interesting, finding. MUC5B, SFTPD and SLC22A31 are the three most interesting new loci associated with COVID-19 severity. Their relationship with lung function and lung diseases is consistent with loci previously associated with disease severity. The surfactant proteins secreted by alveolar cells, representing an emerging biological mechanism, maintain healthy lung function and facilitate the clearance of pathogens13. The protective effect of the MUC5B variant is unexpected given the otherwise risk-increasing, concordant effect between IPF and COVID-19 observed for other variants9. Nonetheless, this result aligns with the MUC5B promoter variant association that shows a twofold higher survival rate among patients with IPF10. In mice, Muc5b seems to be essential for effective mucociliary clearance and for controlling infection14, which suggests that therapies to control mucin secretion may be beneficial in patients with COVID-19.

Expanding genomic research to include participants from around the world enabled us to test whether the effect of COVID-19-related genetic variants was markedly different across ancestry groups. We did not detect obvious heterogeneity between ancestry groups, and we attribute the observed heterogeneity in the effect of COVID-19-related genetic variants to the diverse inclusion criteria across studies in terms of COVID-19 severity. However, we also note that ascertainment differences across studies might mask true underlying differences in effect sizes between ancestry groups.

The biological insights gained by this expansion of the COVID-19 Host Genetic Initiative showed that increasing sample size and diversity remain a fruitful activity to better understand the human genetic architecture of COVID-19.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.