Genome-wide analysis provides genetic evidence that ACE2 influences COVID-19 risk and yields risk scores associated with severe disease

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) enters human host cells via angiotensin-converting enzyme 2 (ACE2) and causes coronavirus disease 2019 (COVID-19). Here, through a genome-wide association study, we identify a variant (rs190509934, minor allele frequency 0.2–2%) that downregulates ACE2 expression by 37% (P = 2.7 × 10−8) and reduces the risk of SARS-CoV-2 infection by 40% (odds ratio = 0.60, P = 4.5 × 10−13), providing human genetic evidence that ACE2 expression levels influence COVID-19 risk. We also replicate the associations of six previously reported risk variants, of which four were further associated with worse outcomes in individuals infected with the virus (in/near LZTFL1, MHC, DPP9 and IFNAR2). Lastly, we show that common variants define a risk score that is strongly associated with severe disease among cases and modestly improves the prediction of disease severity relative to demographic and clinical factors alone.


SARS-CoV-2 infection aggregated from 4 studies (Geisinger Health
System (GHS), Penn Medicine BioBank (PMBB), UK Biobank (UKB) and AncestryDNA; Supplementary Table 1) and 5 continental ancestries. Of the cases with COVID-19, 6,911 (13.1%) were hospitalized and 2,184 (4.1%) had severe disease; hospitalized patients were more likely to be older, of non-European ancestry and to have preexisting cardiovascular and lung disease (Supplementary Table 2). Using these data, we defined five case-control comparisons related to the risk of infection and two others related to disease severity among cases with COVID-19 (Table 1 and Supplementary  Table 3). For each comparison, we performed ancestry-specific GWAS in each study using REGENIE (Methods) and then combined the results using a fixed-effects meta-analysis. Genomic inflation factors (λ GC ) for the meta-analyses were <1.05, suggesting no substantial impact of population structure or unmodeled relatedness (Supplementary Table 4). Unless otherwise noted, all association P values reported henceforth are from Firth (disease traits) or linear (quantitative traits) regression tests performed in REGENIE.
Our analysis provides independent support for several risk variants reported in previous GWAS of COVID-19 [9][10][11] (Supplementary  Table 5), including those recently reported by the COVID-19 Host Genetics Initiative (HGI) 12 , to which we contributed an earlier version of these data (Supplementary Table 6). Details for these replicated loci follow below, but first we looked for new genetic associations that might have been missed by the HGI. Across the seven risk and severity phenotypes, considering both common (minor allele frequency (MAF) > 0.5%, up to 13 million) and rare (MAF < 0.5%, up to 76 million) variants, we observed one previously unreported association at a conservative P < 8 × 10 −11 (Bonferroni correction for seven phenotypes × 89 million variants). This association was between a lower risk of SARS-CoV-2 infection (52,630 cases positive for COVID-19 versus 704,016 COVID-19 negative or unknown controls) and rs190509934:C on the X chromosome (MAF = 0.3%, odds ratio (OR) = 0.60, 95% confidence interval (CI) = 0.52-0.69, P = 4.5 × 10 −13 ; Fig. 1). This rare variant is located 60 base pairs (bp) upstream of the ACE2 gene (Fig. 2a), the primary cell entry receptor for SARS-CoV-2 16 .
Given the potential significance of these findings, we studied the association between the ACE2 variant rs190509934 and COVID-19 outcomes in greater detail. We found that the variant was well imputed (imputation info score > 0.5 for all studies) and that there was no evidence for differences in effect size (heterogeneity test P > 0.05) across studies (Fig. 2b) or ancestries (Supplementary Table 7). However, a significantly stronger association with SARS-CoV-2 infection (heterogeneity test P = 0.009) was observed in males (OR = 0.49, P = 7.0 × 10 −11 , explaining 0.085% of the variance in disease liability 17 , h 2 ) when compared to females (OR = 0.72, P = 5 × 10 −4 ; h 2 = 0.017%). There were no associations between rs190509934 and 6 clinical risk factors for COVID-19 after multiple test correction (all with P > 0.05/6 = 0.008; Supplementary Table 8), suggesting these did not likely confound the analysis. We then investigated the association between rs190509934 and severity among cases with COVID-19 and found that carriers of rs190509934:C had a numerically (but not significantly) lower risk of worse disease outcomes when compared to non-carriers (for example, OR = 0.69, P = 0.16 when comparing 6,779 cases hospitalized with COVID-19 versus 44,968 cases not hospitalized with COVID-19; Supplementary Table 9). These results demonstrate that rs190509934 near ACE2 confers protection against SARS-CoV-2 infection and potentially also modulates disease severity among individuals infected with the virus; since the variant is relatively uncommon, a definitive account of its role in disease severity requires assessing larger numbers of severe cases.
We speculated that the protective rare variant near ACE2 (rs190509934:C) might regulate ACE2 expression. This variant was not characterized by the Genotype-Tissue Expression (GTEx) consortium 18 or 51 other gene expression studies we queried (Supplementary Table 10). Thus, to test its association with ACE2 expression, we analyzed RNA sequencing (RNA-seq) data from liver tissue available in a subset of 2,035 individuals from the GHS study, including 8 heterozygous and 1 hemizygous carrier for rs190509934:C. After adjusting for potential confounders (for example, body mass index (BMI), liver disease), we found that rs190509934:C reduced ACE2 expression by 0.87 s.d. units  Fig. 3a). When considering raw, prenormalized ACE2 expression levels, rs190509934:C was associated with a 37% reduction in expression relative to non-carriers (Fig. 3b). There was no association with the expression of 8 other nearby genes (within 500 kilobases (kb), with detectable expression in our dataset) after accounting for multiple testing. These results are consistent with rs190509934:C lowering ACE2 expression, which in turn confers protection from SARS-CoV-2 infection. In addition to its role in viral infections, the normal physiological role of ACE2 involves its hydrolysis and clearance of angiotensin II, a vasoconstrictive peptide that can lead to higher vascular tone or blood pressure 19 . Therefore, we investigated if rs190509934:C was associated with higher systolic blood pressure in the UKB study but found no significant association (Beta = 0.009 s.d. units, P = 0.56; Supplementary Table 11). There was a trend for higher blood pressure among carriers of ultrarare coding variants in ACE2 that are predicted to be full loss of function (Beta = 0.219 s.d. units, P = 0.09; Supplementary Table 11) and which were assayed through exome sequencing 20 . These results need to be confirmed in larger datasets but suggest that ACE2 loss of function may modestly increase blood pressure. This should be considered if ACE2 blockade is to be developed for COVID-19 treatment, although pharmacological inhibition of ACE2 in such a setting would be expected to be short term and elevations in blood pressure could be managed with antihypertensives. Of note, ACE2 expression in the airways was reported to be higher in smokers and patients with chronic obstructive pulmonary disease (COPD) 21 and to increase with age 22 . Collectively, these observations and our genetic findings are consistent with the hypothesis that ACE2 levels play a key role in determining COVID-19 risk.
There was no evidence for heterogeneity in effect sizes across studies (all with P > 0.05; Supplementary Table 12)  (all with P > 0.05; Supplementary Table 13) for any of the six variants. We also explored the possibility that the association between these six variants and COVID-19 could have been confounded by disease status for relevant comorbidities. We found that only two of the six variants were associated with a clinical risk factor: the MHC variant was associated with asthma (P = 6.8 × 10 −9 ) and type 2 diabetes (T2D) (P = 1.5 × 10 −5 ), while the ABO variant was associated with kidney disease (P = 1.4 × 10 −4 ) and T2D (P = 9.7 × 10 −5 ; Supplementary Associations with disease severity among cases with COVID-19. We then investigated which replicated variants were associated with severity among cases with COVID-19. Among the 6 replicated variants (in/near LZTFL1, SLC6A20, MHC, ABO, DPP9 and IFNAR2), 4 were significantly (P < 0.05) associated with worse outcomes among infected individuals (in/near LZTFL1, MHC, DPP9 and IFNAR2), while those in ABO and near SLC6A20 were not associated with COVID-19 severity (Extended Data Fig. 1 and Supplementary Table 15). Collectively, these results highlight four variants associated with both COVID-19 risk and worse disease outcomes, including respiratory failure and death. These variants may be used to identify individuals at risk of severe COVID-19 and guide the search for genes involved in the pathophysiology of COVID-19. Next, we evaluated whether variants identified by the COVID-19 HGI, a large worldwide effort to identify genetic risk factors for COVID-19, could augment this set of four disease severity variants. The latest HGI analyses 12 include data from 49,562 individuals infected with SARS-CoV-2 and use >1.7 million individuals with no record of infection as controls (Supplementary Table 16). To identify additional variants associated with severity, we started with variants associated with the phenotype 'reported infection' (infected versus no record of infection) which, despite the sample overlap between the HGI and our analyses, was statistically independent from severity among infected individuals because infection status (positive cases versus negative or unknown controls) is uncorrelated with hospitalization status once infected (hospitalized versus non-hospitalized cases). We found that two variants were nominally associated with the risk of severe disease among cases (rs11919389 near RPL24, P = 0.029 and rs1886814 near FOXP4, P = 0.018; Supplementary Table 16), suggesting that these loci also modulate disease severity after infection with SARS-CoV-2.

Likely effector genes of variants associated with COVID-19.
Collectively, our association analyses highlighted six common variants identified in previous GWAS or by the HGI-in/near LZTFL1, MHC, DPP9, IFNAR2, RPL24 and FOXP4-that are associated with COVID-19 as well as disease severity among cases. To help identify genes that might underlie the observed associations, we searched for functional protein-coding variants (missense or predicted loss of function) in high LD (r 2 > 0.80) with each variant. We found eight functional variants in five genes (Supplementary Table 17): IFNAR2, a cytokine receptor component in the antiviral type 1 IFN pathway, which is activated by SARS-CoV-2 and is dysregulated in cases with severe COVID-19 14,24 ); CCHCR1, a P-body protein associated with cytoskeletal remodeling and messenger RNA turnover 25,26 ; TCF19, a transcription factor associated with hepatitis B 27 ; and C6orf15 and PSORS1C1, two functionally uncharacterized genes in the MHC. These data indicate that the variants identified may have functional effects on these five genes.
We then asked if any of the 6 sentinel variants colocalized (that is, were in high LD, r 2 > 0.80) with published sentinel expression quantitative trait loci (eQTLs) across 52 studies (considering eQTLs associated with gene expression at a P < 2.5 × 10 −9 in the original studies; Supplementary Table 10), specifically focusing on 114 genes in cis (±500 kb). We found colocalization with sentinel eQTLs for eight genes (Supplementary Table 18): SLC6A20 (eQTLs from lung), a proline transporter that binds the host SARS-CoV-2 receptor, ACE2 28 ; NXPE3 (esophagus), a gene of unknown function; SENP7 (blood), a SUMO-specific protease that promotes IFN signaling and that in mice is essential for innate defense against herpes simplex virus 1 infection 29 ; IFNAR2 and TCF19 (multiple tissues), both discussed above; LST1 (blood), an immunomodulatory protein that inhibits lymphocyte proliferation 30 and is upregulated in response to bacterial ligands 31 ; HLA-C (adipose tissue), a natural killer cell ligand, which is associated with HIV infection 32 and autoimmunity 33 ; and IL10RB (multiple tissues), a pleiotropic cytokine receptor associated with persistent hepatitis B and autoimmunity 34,35 . Collectively, analysis of missense variation and eQTL catalogs suggests 12 potential effector genes in COVID-19 loci (ACE2, C6orf15, CCHCR1, HLA-C, IFNAR2, IL10RB, LST1, NXPE3, PSORS1C1, SENP7, SLC6A20 and TCL19), although functional studies are required to confirm these predictions.
Using GRS to predict severe disease. Next, we proceeded to evaluate if common genetic variants can help identify individuals at high risk of severe COVID-19 once infected with SARS-CoV-2. To this end, we created a weighted GRS for individuals with a record of SARS-CoV-2 infection and then compared the risk of hospitalization (hospitalized versus non-hospitalized cases) and severe disease (severe versus non-hospitalized cases) between those with a high GRS and all other cases, after adjusting for established risk factors. We considered different approaches to select variants for inclusion in the GRS. First, we reasoned that variants most informative for prediction of severe disease were those associated with worse disease outcomes among infected individuals; thus, this was the approach taken for our primary GRS analysis. Of all published genetic risk factors for COVID-19, only one variant was associated with worse outcomes among infected individuals at P < 5 × 10 −8 in our analysis (rs73064425 in LZTFL1) but this likely reflects low power due to the small number of patients with severe illness that were available for analysis. To address this limitation, we also included in the GRS five additional variants (in/near MHC, DPP9, IFNAR2, RPL24 and FOXP4) that (1) had an association with risk of infection at P < 5 × 10 −8 in published GWAS or by the HGI; and (2) were associated with worse disease outcomes among infected individuals in our data (Supplementary Tables 15 and 16), albeit at the suggestive level with current sample sizes. The combination of a genome-wide significant association with risk of infection in previous GWAS and a suggestive association with worse outcomes among infected individuals in the current analysis minimizes the chance that these loci represent false positive associations for disease severity. Of note, we did not include in the GRS five additional variants discovered by the HGI for risk of hospitalization or severe disease (Supplementary Table 16) because the HGI analysis for those two phenotypes was not statistically independent from our analysis of disease outcomes among infected individuals (due to sample overlap). To calculate the GRS, the weights used for each of the six variants corresponded to the effect size (log of the OR) reported in previous GWAS. P values reported in this section were obtained from a logistic regression test (Methods), unless otherwise noted. When considering cases with COVID-19 of European ancestry (n = 44,958), we found that having a high GRS (top 10%) was associated with a 1.38-fold increased risk of hospitalization (95% CI = 1.26-1.53, P = 6 × 10 −11 ; Fig. 4a) and 1.58-fold increased risk of severe disease (95% CI = 1.36-1.82, P = 7 × 10 −10 ; Fig. 4b). In other ancestries, a high GRS also appeared to predict risk of hospitalization-including among individuals of African ancestry (n = 2,598, 1.70-fold risk for high GRS, 95% CI = 1.03-2.81, P = 0.038), Hispanic or Latin American ancestry (n = 3,752, 1.56-fold risk, 95% CI = 1.00-2.43, P = 0.05) and South Asian ancestry (n = 760, 1.42-fold risk, 95% CI = 0.72-2.82, P = 0.32; Supplementary Table 19). A similar pattern was observed in non-European ancestries for risk of severe disease, although sample sizes were considerably smaller (Supplementary Table 20).
We then compared the effect of the GRS between individuals with and without established risk factors for severe COVID-19. In Europeans of both the AncestryDNA and UKB studies, we found that a high GRS (top 10%) was associated with risk of severe disease both among individuals with and without established clinical risk factors for severe COVID-19 (Fig. 5). In the meta-analysis of the two studies, a high GRS was associated with a 1.65-fold (95% CI = 1.39-1.96, P = 1 × 10 −8 ) and 1.75-fold (95% CI = 1.28-2.40, P = 4 × 10 −4 ) higher risk of severe disease, respectively among individuals with (n = 22,045) and without (n = 22,913) established risk factors (Supplementary Table 21), with no evidence for heterogeneity of GRS effect with clinical risk factor status (P = 0.30). Similar results were observed for risk of hospitalization (1.35-fold versus 1.39-fold; Supplementary Table 21 and Extended Data Fig. 2). We also performed this stratified analysis in individuals of Hispanic or Latin American ancestry (but not other ancestries due to small sample size) and found that a high GRS was associated with higher risk of severe disease in individuals with (n = 1,341; OR = 3.35, 95% CI = 1.56-7.21, P = 0.002) but not without (n = 2,411; OR = 0.88, 95% CI = 0.19-4.07, P = 0.873) clinical risk factors (Extended Data Fig. 3).
Next, we performed sensitivity analyses to understand the extent to which the GRS composition affected the association results described above. First, we expanded the GRS to include all 12 variants reported to associate with the risk of COVID-19 in previous GWAS (8 variants) and by the HGI (4 new variants associated with reported infection). We found that associations between the 12-SNP GRS and both risk of hospitalization and severe disease were similar to those obtained with the 6-SNP GRS (Extended Data Fig. 4). For example, using the 12-SNP GRS, we found that cases with COVID-19 in the top 10% of genetic risk had a 1.38-fold (95% CI = 1.26-1.52, P = 4 × 10 −11 ) and 1.64-fold (95% CI = 1.43-1.90, P = 6 × 10 −12 ) higher risk of severe disease, compared to 1.38-fold and 1.58-fold, respectively obtained with the 6-SNP GRS (above). Second, we expanded the GRS to include a larger set of variants associated with risk of infection but this resulted in weaker associations when compared to the 6-SNP GRS (Extended Data Fig. 5).
Overall, these results suggest that a GRS calculated using variants associated with disease risk and severity can potentially be used to identify cases with COVID-19 at high risk of developing poor disease outcomes.
To formally address this possibility, we assessed the value of using the 6-SNP GRS to predict the risk of severe disease in addition to demographic and clinical risk factors. For this analysis, each study was split 50:50 into a training set, which was used to estimate associations between disease severity and demographic, clinical and genetic risk factors, and a validation set, where risk scores were calculated based on the effect estimates from the training set and then used to predict disease severity (Methods). We found that the ability to predict disease severity improved somewhat when the 6-SNP GRS was added to a baseline model that considered only age and sex, with the area under the receiving operator characteristic curve (AUC) improving by 0.7% in the AncestryDNA study and 0.5% in the UKB study (Fig. 6). This magnitude of  improvement in the AUC was comparable to that observed with some clinical risk factors individually, such as cardiovascular disease (CVD) (0.6% and 0.5%, respectively in AncestryDNA and UKB) and respiratory disease (1% and 0.8%, respectively). Similar results were observed when the 6-SNP GRS was added to a model that considered all non-genetic risk factors (Fig. 6), with the AUC for disease severity improving by 0.8% and 0.5%, respectively in the AncestryDNA and UKB studies. Overall, in our analyses, age and sex were the strongest predictors of poor outcomes in individuals with COVID-19 and an elevated GRS enabled a modest improvement in predictions similar to that contributed by individual clinical risk factors.

Discussion
In summary, we performed a GWAS including 756,646 individuals aggregated across 4 cohorts and used both clinical and self-reported phenotypes to define risk and severity groups for COVID-19. Our analysis identified a new association between a rare variant near the ACE2 gene that decreases expression of the SARS-CoV-2 receptor and COVID-19 risk. This finding provides human genetic support  for the hypothesis that ACE2 expression plays a key role in SARS-CoV-2 infection and may constitute an attractive therapeutic target for prevention of COVID-19. We also confirmed six common variant associations with risk of infection and further showed that four of these variants modulate disease severity among cases. Lastly, we demonstrated that a GRS based on common variants validated in this study modestly improves the prediction of poor disease outcomes among individuals with COVID-19.
The following caveats should be considered when interpreting the results from this study. First, our study had greater power to identify associations with disease risk than with severity outcomes, given the relatively small sample size for the latter. Second, there was phenotypic heterogeneity among cases with COVID-19 and controls and associated risk factors across our studies. One likely reason for this is that survey respondents from the AncestryDNA study were enriched for healthier individuals and cases with milder COVID-19 compared to participants of the UKB, GHS and PMBB studies, who were ascertained in clinical settings and so were enriched for hospitalized cases and cases with severe COVID-19. Other sources of heterogeneity may include regional and temporal availability of COVID-19 testing and the inability to control for viral exposure among controls. While our meta-analysis collectively spans a broad phenotypic spectrum, these individual differences may account for variability in results across reported studies. Third, we used expression levels measured in the liver to assess the impact of the ACE2 risk variant on gene expression. The liver is not the most disease-relevant tissue to assess ACE2 expression but we note that cis eQTLs are often shared across tissues 18,36 and so our findings are likely predictive of decreased ACE2 expression in other tissues. Fourth, the association between GRS and risk of severe disease was strongest in European individuals of the AncestryDNA (OR = 1.72, P = 2 × 10 −6 ) and UKB (OR = 1.65, P = 6 × 10 −6 ) studies when compared to the smaller GHS study (OR = 1.03, P = 0.877). The lower effect size in the latter may be due to differences in ascertainment of COVID-19-positive cases, as discussed above, or stochastic, given the smaller sample size. We also noted that the impact of the GRS on risk of hospitalization was attenuated in comparison to severe disease, which may be a reflection of the weighting schema for the variants comprising the score; the four largest GRS weights were derived from an analysis of critically ill individuals 10 .
To date, SARS-CoV-2 has infected >230 million people globally, disproportionately affecting older, male individuals and those of non-European ancestry or with underlying cardiovascular and respiratory comorbidities with severe COVID-19 and death. Host genetic analysis, primarily of hospitalized cases and clinical data, have uncovered over a dozen loci associated with increased odds of severe COVID-19 12 . Our approach of coupling human genetics with both electronic health records (EHRs) and self-reported COVID-19 data has strengthened our knowledge of COVID-19 host genetics and uncovered an additional COVID-19 locus in ACE2. Further analysis, including additional rare variants, may further elucidate the host genetic contribution to COVID-19 and sequelae.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41588-021-01006-7.
PMBB study. Appropriate consent was obtained from each participant regarding the storage of biological specimens, genetic sequencing and genotyping, and access to all available EHR data. This study was approved by the institutional review board of the University of Pennsylvania and complied with the principles set out in the Declaration of Helsinki. Written informed consent was obtained for all study participants.
Participating studies. AncestryDNA COVID-19 research study. AncestryDNA customers over the age of 18, living in the USA, who had consented to the research, were invited to complete a survey assessing COVID-19 outcomes and other demographic information. These included SARS-CoV-2 swab and antibody test results, COVID-19 symptoms and severity, brief medical history, household and occupational exposure to SARS-CoV-2 and blood type. A total of 163,650 AncestryDNA survey respondents were selected for inclusion in this study. Respondents selected for this study included all individuals with a positive COVID-19 test together with age-and sex-matched controls. DNA samples were genotyped on an Illumina array containing 730,000 SNPs. Sample quality control (QC) involved removing individuals with discordant sex (based on reported and genetically determined sex) and those with <98% sample call rate, as described previously 38 Variant QC involved removing array variants with a difference in allele frequency >0.1 between any pair of array versions used, as well as variants with a call rate <98%. Genotype data for variants not included in the array were then inferred using imputation to the Haplotype Reference Consortium (HRC) reference panel. Briefly, samples were imputed to HRC v.1.1, which consists of 27,165 individuals and 36 million variants. The HRC reference panel does not include indels; consequently, indels are not present in the imputed data. We determined best-guess haplotypes with Eagle v.2.4.1 and performed imputation with Minimac4 v.1.0.1. We used 1,117,080 unique variants as input; 8,049,082 imputed variants were retained in the final dataset. Variants with a Minimac4 r 2 < 0.30 were filtered from the analysis.
GHS. The GHS MyCode Community Health Initiative is a health system-based cohort from central and eastern Pennsylvania (USA) with ongoing recruitment since 2006 39 . A subset of 144,182 MyCode participants sequenced as part of the GHS-Regeneron Genetics Center DiscovEHR partnership were included in this study. Information on COVID-19 outcomes was obtained through the GHS COVID-19 registry. Patients were identified as eligible for the registry based on relevant laboratory results and International Classification of Diseases, Tenth Revision (ICD-10) diagnosis codes. Patient charts were then reviewed to confirm the COVID-19 diagnoses. The registry contains data on outcomes, comorbidities, medications, supplemental oxygen use, and intensive care unit admissions. DNA from participants was genotyped on either the Illumina Infinium OmniExpressExome or Global Screening Array (GSA) and imputed to the TOPMed reference panel (stratified by array) using the TOPMed Imputation Server. Before imputation, we retained variants that had a MAF ≥ 0.1%, missingness <1% and Hardy-Weinberg equilibrium test P > 10 −15 . After imputation, data from the Infinium OmniExpressExome and GSA datasets were merged for subsequent association analyses, which included an Infinium OmniExpressExome/GSA batch covariate, in addition to other covariates described below.
PMBB study. The PMBB contains approximately 70,000 study participants, all recruited through the University of Pennsylvania Health System (UPHS). Participants donate blood or tissue and allow access to EHR information 40 . The PMBB participants with COVID-19 infection were identified through the UPHS COVID-19 registry, which consists of quantitative PCR (qPCR) results of all patients tested for SARS-CoV-2 infection within the health system. We then used EHRs to classify patients with COVID-19 into hospitalized and severe (ventilation or death) categories. DNA genotyping was performed with the Illumina GSA and imputation performed using the TOPMed reference panel as described for the GHS study.
UKB study. We studied the host genetics of SARS-CoV-2 infection in participants of the UKB study, which took place between 2006 and 2010 and includes approximately 500,000 adults aged 40-69 at recruitment. In collaboration with the UK health authorities, the UKB has made available regular updates on COVID-19 status for all participants, including results from four main data types: qPCR test for SARS-CoV-2; anonymized EHRs; primary care; and death registry data. We report results based on phenotype data downloaded on the 4 January 2021 and excluded from the analysis 28,547 individuals with a death registry event before 2020. DNA samples were genotyped as described previously 41 using the Applied Biosystems UK BiLEVE Axiom Array (n = 49,950) or the closely related (95% variant overlap) Applied Biosystems UKB Axiom Array (n = 438,427). Genotype data for variants not included in the arrays were inferred using the TOPMed reference panel, as described above.
COVID-19 phenotypes used for the genetic association analyses. We grouped participants from each study into three broad COVID-19 disease categories (Supplementary Table 1): (1) positive, that is, those with a positive qPCR or serology test for SARS-CoV-2 or with a COVID-19-related ICD-10 code (U07), hospitalization or death; (2) negative, that is, those with only negative qPCR or serology test results for SARS-CoV-2 and with no COVID-19-related ICD-10 code (U07), hospitalization or death; and (3) unknown, that is, those with no qPCR or serology test results and no COVID-19-related ICD-10 code (U07), hospitalization or death. We then used these broad COVID-19 disease categories, in addition to hospitalization and disease severity information, to create seven COVID-19-related phenotypes for genetic association analyses, as detailed in Supplementary Table 3. SARS-CoV-2 infection status (positive, negative or unknown) was determined based on a qPCR test for SARS-CoV-2 in the UKB, GHS and PMBB studies and self-reported results for qPCR or serology test for SARS-CoV-2 in the AncestryDNA study.
Hospitalization status (positive, negative or unknown) was determined based on the COVID-19-related ICD-10 codes U071, U072 and U073 in variable 'diag_ icd10' (table 'hesin_diag') in the UKB study, self-reported hospitalization due to COVID-19 in the AncestryDNA study and medical records in the GHS and PMBB studies.
For association analysis in the AncestryDNA study, we excluded from the COVID-19 unknown group individuals who had (1) a first-degree relative who was COVID-19-positive or (2) flu-like symptoms.
Genetic association analyses. Association analyses in each study were performed using the genome-wide Firth logistic regression test implemented in REGENIE V2.0.1 (ref. 37 ). In this implementation, Firth's approach is applied when the P value from the standard logistic regression score test is below 0.05. We included in step 1 of REGENIE (that is, prediction of individual trait values based on the genetic data) directly genotyped variants with an MAF > 1%, <10% missingness, Hardy-Weinberg equilibrium test P > 1 × 10 −15 and LD pruning (1,000 variant windows, 100 variant sliding windows and r 2 < 0.9). The association model used in step 2 of REGENIE included as covariates age, age 2 , sex, age-by-sex and the first 10 ancestry-informative principal components (PCs) derived from the analysis of a stricter set of LD-pruned (50 variant windows, 5 variant sliding windows and r 2 < 0.5) common variants from the array (imputed for the GHS study) data.
Within each study, association analyses were performed separately for five different continental ancestries defined based on the array data: African (AFR), Hispanic or Latin American (HLA; originally referred to as ' AMR' by the 1000 Genomes Project; a subsequent study recommended the use of HLA to refer to this ancestral group 42 ); European (EUR); and South Asian (SAS). We determined continental ancestries by projecting each sample onto reference PCs calculated from the HapMap3 reference panel. Briefly, we merged our samples with HapMap3 samples and kept only SNPs in common between the two datasets. We further excluded SNPs with MAF < 10%, genotype missingness >5% or Hardy-Weinberg equilibrium test P < 10 −5 . We calculated PCs for the HapMap3 samples and projected each of our samples onto those PCs. To assign a continental ancestry group to each non-HapMap3 sample, we trained a kernel density estimator (KDE) using the HapMap3 PCs and used the KDEs to calculate the likelihood of a given sample belonging to each of the five continental ancestry groups. When the likelihood for a given ancestry group was >0.3, the sample was assigned to that ancestry group. When two ancestry groups had a likelihood >0.3, we arbitrarily assigned AFR over EUR, HLA over EUR, HLA over EAS, SAS over EUR and HLA over AFR. Samples were excluded from analysis if no ancestry likelihoods were >0.3 or if more than three ancestry likelihoods were >0.3.
Results were subsequently meta-analyzed across studies and ancestries using an inverse variance-weighted fixed-effects meta-analysis.
Identification of putative targets of GWAS variants based on colocalization with eQTLs. We identified as a likely target of a sentinel GWAS variant any gene for which a sentinel eQTL colocalized (that is, had an LD r 2 > 0.80) with the sentinel GWAS variant. That is, we only considered genes for which there was strong LD between a sentinel GWAS variant and a sentinel eQTL, which reduces the chance of spurious colocalization. Sentinel eQTLs were defined across 174 published datasets (Supplementary Table 10), considering only eQTLs associated with gene expression in cis (±1 Mb) at a conservative P < 2.5 × 10 −9 threshold as described previously 43 . We did not use statistical approaches developed to distinguish colocalization from shared genetic effects because these have very limited resolution at high LD levels (r 2 > 0.80) (ref. 44 ).
Gene expression analysis in participants of the GHS study. For a subset of individuals from the GHS study (n = 2,035, ascertained through the Geisinger Bariatric Surgery Clinic), RNA was extracted from liver biopsies conducted during bariatric surgery to evaluate liver disease. Individuals had class 3 obesity (BMI > 40 kg m −2 ) or class 2 obesity (BMI 35-39 kg m −2 ) with an obesity-related comorbidity (for example, T2D, hypertension, sleep apnea, non-alcoholic fatty liver disease). RNA libraries were prepared using poly(A) extraction and then sequenced with 75-bp paired-end reads with two 10-bp index reads on the Illumina NovaSeq 6000 on S4 flow cells. RNA-seq data were then analyzed using the GTEx v.8 workflow 18 , using STAR v.2.7.3a (ref. 45 ) and rnaSeqQC v.1.2 (Code availability), except that GENCODE v.32 was used in lieu of v.26. Briefly: (1) raw expression counts were normalized with trimmed mean of M values (TMM) as implemented in edgeR v.3.13 (ref. 46 ); (2) a rank-based inverse normal transformation was applied to the normalized expression values; (3) PC analysis was performed on data from 25,078 genes with transcripts per million > 0.1 in >20% samples to identify latent factors accounting for variation in gene expression; (4) gene expression levels were adjusted for the top 100 PCs to improve power to identify cis-regulatory effects. The association between adjusted ACE2 expression and the imputed genotypes of rs190509934 was then tested using linear regression, with the following variables included as covariates: age, age 2 , four ancestry-informative PCs, steatosis status, fibrosis status, diabetes status and BMI at the time of bariatric surgery.

GRS analysis of COVID-19 hospitalization and severity.
First, in each study (AncestryDNA, GHS, UKB and PMBB), we created a GRS for each COVID-19positive individual based on variants that were reported to associate with risk of COVID-19 in previous GWAS and that we (1) independently replicated (except variants identified by the HGI) and (2) found to be associated with COVID-19 severity outcomes. We used as weights the effect (Beta) reported in previous GWAS (Supplementary Table 5). Second, we ranked individuals with COVID-19 based on the GRS and created a new binary GRS predictor by assigning each individual to a high (top 5%) or low (rest of the population) percentile group. Third, for studies with >100 hospitalized cases, we used logistic regression to test the association between the binary GRS predictor and risk of hospitalization (hospitalized cases versus all other cases), including as covariates age, sex, age-by-sex interaction and ten ancestry-informative PCs. In addition to age and sex, we included as additional covariates established clinical risk factors for COVID-19 that are outlined in the Emergency Use Authorisation treatment guidelines for casirivimab and imdevimab: BMI; chronic kidney disease (CKD); diabetes; immunosuppressive disease; COPD or other chronic respiratory disease; CVD; and hypertension. We repeated the association analysis (1) using different percentile cutoffs for the GRS (5, 10, 20, 30 and 40%) and (2) to test the association with disease severity (severe cases versus all other cases). We then stratified COVID-19 cases by clinical risk (high versus lower) and evaluated the association between the top 10% by GRS (that is, high genetic risk) and risk of hospitalization or severe disease. The stratified analyses were performed with logistic regression, with sex and ancestry-informative PCs included as covariates. High clinical risk was defined as any one of the following: (1) age ≥65; (2) BMI ≥ 35; (3) CKD, diabetes or immunosuppressive disease; (4) age ≥55 and presence of COPD/other chronic respiratory disease, CVD or hypertension.
In populations with >100 hospitalized cases, we also evaluated the impact of the GRS relative to other non-genetic risk factors associated with increased risk of hospitalization and severe disease (for example, COPD, diabetes). The datasets were randomly split 50:50 into training and test datasets. In the training dataset, a logistic regression model with age, sex and ancestry covariates was fitted. The coefficients for age and sex from this model were then used to calculate a risk score in the other half of the population, which was fitted in a second model along with ancestry covariates. From this model, the AUC from a receiver operating characteristic curve (and 95% CI) was estimated. The process was repeated iteratively, adding other demographic and clinical risk factors one at a time to the baseline model with age, sex and ancestry covariates. Models were then fitted with just the baseline model plus GRS, all factors except GRS and a final model with all demographic/clinical risk factors plus the GRS.

Statistics and reproducibility.
No statistical method was used to predetermine sample size. Individuals were excluded for the following reasons: if they were not assigned to one of the five continental ancestry groups based on principal component analysis (Methods), had previously passed away before January 2020 (near the beginning of the COVID-19 pandemic), had an unknown COVID-19 status but did have confirmed cases in their household or if the continental ancestry group had fewer than 25 cases and 25 controls (Methods). The experiments were not randomized. The investigators were not blinded to allocation during the experiments and outcome assessment. Unless otherwise noted, the association P values reported in this manuscript are from (1) Firth (disease traits) or linear (quantitative traits) regression tests performed in REGENIE for GWAS and (2) logistic regression, for the GRS analyses.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All genotype-phenotype association results reported in this study are available for browsing using the Regeneron Genetic Center (RGC) COVID-19 Results Browser (https://rgc-covid19.regeneron.com). Data access and use is limited to research purposes in accordance with the terms of use (https://rgc-covid19.regeneron.com/ terms-of-use). Gene expression levels derived from the liver RNA-seq data for ACE2 and the eight nearby genes analyzed in this study, as well as genotypes for the ACE2 variant associated with the risk of SARS-CoV-2 infection (rs190509934), are provided in Supplementary Data 1.

Code availability
REGENIE v.2.0.1 can be accessed at https://github.com/rgcgithub/regenie. The GWAS analyses were performed with REGENIE using automated pipelines. An R script that exemplifies how the genetic risk score analyses were performed is available at https://doi.org/10.5281/zenodo.5700998 and https://doi.org/10.5281/ zenodo.5748168. R can be found at https://www.r-project.org/. rnaSeqQC is available from GitHub (https://github.com/oicr-gsi/rnaSeqQC). approved the final manuscript. All authors: contributed to securing the funding, study design and oversight; reviewed the final version of the manuscript; performed and were responsible for sample genotyping and exome sequencing; conceived and were responsible for laboratory automation, sample tracking and the library information management system; were responsible for the development and validation of the clinical phenotypes used to identify study participants and (when applicable) controls; performed and were responsible for the analysis needed to produce exome and genotype data; provided the computing infrastructure development and operational support; provided variant and gene annotations and their functional interpretation of variants and conceived and were responsible for creating, developing and deploying the analysis platforms and computational methods used to analyze the genomic data; developed the statistical analysis plans; contributed to the quality control of the genotype and phenotype files and the generation of the analysis-ready datasets; developed the statistical genetics pipelines and tools and use thereof in the generation of the association results; contributed to the quality control of the review and the interpretation of the results and generated and formatted the results to create the manuscript figures; contributed to the development of the study design and analysis plans and the quality control of the phenotype definitions; quality-controlled, reviewed and interpreted the association results; developed the in vivo and in vitro experimental biology and interpretation, contributed to the management and coordination of all research activities, planning and execution and managed the review of the project.