Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) enters human host cells via angiotensin-converting enzyme 2 (ACE2) and causes coronavirus disease 2019 (COVID-19). Here, through a genome-wide association study, we identify a variant (rs190509934, minor allele frequency 0.2–2%) that downregulates ACE2 expression by 37% (P = 2.7 × 10−8) and reduces the risk of SARS-CoV-2 infection by 40% (odds ratio = 0.60, P = 4.5 × 10−13), providing human genetic evidence that ACE2 expression levels influence COVID-19 risk. We also replicate the associations of six previously reported risk variants, of which four were further associated with worse outcomes in individuals infected with the virus (in/near LZTFL1, MHC, DPP9 and IFNAR2). Lastly, we show that common variants define a risk score that is strongly associated with severe disease among cases and modestly improves the prediction of disease severity relative to demographic and clinical factors alone.
Coronavirus disease 2019 (COVID-19) is caused by infection with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which enters human host cells via angiotensin-converting enzyme 2 (ACE2)1. COVID-19 ranges from asymptomatic infection to severe disease, including respiratory failure and death2,3,4, and has led to more than 5 million deaths worldwide since December 20195. Reported risk factors for severe COVID-19 include male sex, older age, ethnicity, obesity and cardiovascular and respiratory diseases6,7,8, among others. Host genetic factors have also been shown to modulate the risk of infection and disease severity9,10,11,12. The largest human genetics study performed so far included data from 49,562 individuals infected with SARS-CoV-2 and >1.7 million individuals with no record of infection as controls, and identified 13 independent common risk variants12, many located in or near immune-related genes, such as IFNAR2 and CXCR6. Genetic studies of rare variation assayed through exome or genome sequencing have also suggested a role in COVID-19 for genes in the type 1 interferon (IFN) pathway, including TLR713,14,15. Still, a complete understanding of genetic susceptibility to SARS-CoV-2 infection and progression to severe COVID-19, and the applicability of these findings for risk prediction, are incompletely understood. In this study, we performed a genome-wide association study (GWAS) meta-analysis to identify additional genetic variants associated with COVID-19 since these may help identify new therapies. We also tested the utility of genetic risk scores (GRS) to identify individuals at the highest risk of severe disease, who could be prioritized for vaccination or therapeutic interventions, which globally are in short supply.
GWAS of SARS-CoV-2 infection identifies ACE2 association
We performed GWAS of COVID-19 outcomes across 52,630 individuals with COVID-19 and 704,016 individuals with no record of SARS-CoV-2 infection aggregated from 4 studies (Geisinger Health System (GHS), Penn Medicine BioBank (PMBB), UK Biobank (UKB) and AncestryDNA; Supplementary Table 1) and 5 continental ancestries. Of the cases with COVID-19, 6,911 (13.1%) were hospitalized and 2,184 (4.1%) had severe disease; hospitalized patients were more likely to be older, of non-European ancestry and to have preexisting cardiovascular and lung disease (Supplementary Table 2). Using these data, we defined five case-control comparisons related to the risk of infection and two others related to disease severity among cases with COVID-19 (Table 1 and Supplementary Table 3). For each comparison, we performed ancestry-specific GWAS in each study using REGENIE (Methods) and then combined the results using a fixed-effects meta-analysis. Genomic inflation factors (λGC) for the meta-analyses were <1.05, suggesting no substantial impact of population structure or unmodeled relatedness (Supplementary Table 4). Unless otherwise noted, all association P values reported henceforth are from Firth (disease traits) or linear (quantitative traits) regression tests performed in REGENIE.
Our analysis provides independent support for several risk variants reported in previous GWAS of COVID-199,10,11 (Supplementary Table 5), including those recently reported by the COVID-19 Host Genetics Initiative (HGI)12, to which we contributed an earlier version of these data (Supplementary Table 6). Details for these replicated loci follow below, but first we looked for new genetic associations that might have been missed by the HGI. Across the seven risk and severity phenotypes, considering both common (minor allele frequency (MAF) > 0.5%, up to 13 million) and rare (MAF < 0.5%, up to 76 million) variants, we observed one previously unreported association at a conservative P < 8 × 10−11 (Bonferroni correction for seven phenotypes × 89 million variants). This association was between a lower risk of SARS-CoV-2 infection (52,630 cases positive for COVID-19 versus 704,016 COVID-19 negative or unknown controls) and rs190509934:C on the X chromosome (MAF = 0.3%, odds ratio (OR) = 0.60, 95% confidence interval (CI) = 0.52–0.69, P = 4.5 × 10−13; Fig. 1). This rare variant is located 60 base pairs (bp) upstream of the ACE2 gene (Fig. 2a), the primary cell entry receptor for SARS-CoV-216.
Given the potential significance of these findings, we studied the association between the ACE2 variant rs190509934 and COVID-19 outcomes in greater detail. We found that the variant was well imputed (imputation info score > 0.5 for all studies) and that there was no evidence for differences in effect size (heterogeneity test P > 0.05) across studies (Fig. 2b) or ancestries (Supplementary Table 7). However, a significantly stronger association with SARS-CoV-2 infection (heterogeneity test P = 0.009) was observed in males (OR = 0.49, P = 7.0 × 10−11, explaining 0.085% of the variance in disease liability17, h2) when compared to females (OR = 0.72, P = 5 × 10−4; h2 = 0.017%). There were no associations between rs190509934 and 6 clinical risk factors for COVID-19 after multiple test correction (all with P > 0.05/6 = 0.008; Supplementary Table 8), suggesting these did not likely confound the analysis. We then investigated the association between rs190509934 and severity among cases with COVID-19 and found that carriers of rs190509934:C had a numerically (but not significantly) lower risk of worse disease outcomes when compared to non-carriers (for example, OR = 0.69, P = 0.16 when comparing 6,779 cases hospitalized with COVID-19 versus 44,968 cases not hospitalized with COVID-19; Supplementary Table 9). These results demonstrate that rs190509934 near ACE2 confers protection against SARS-CoV-2 infection and potentially also modulates disease severity among individuals infected with the virus; since the variant is relatively uncommon, a definitive account of its role in disease severity requires assessing larger numbers of severe cases.
We speculated that the protective rare variant near ACE2 (rs190509934:C) might regulate ACE2 expression. This variant was not characterized by the Genotype-Tissue Expression (GTEx) consortium18 or 51 other gene expression studies we queried (Supplementary Table 10). Thus, to test its association with ACE2 expression, we analyzed RNA sequencing (RNA-seq) data from liver tissue available in a subset of 2,035 individuals from the GHS study, including 8 heterozygous and 1 hemizygous carrier for rs190509934:C. After adjusting for potential confounders (for example, body mass index (BMI), liver disease), we found that rs190509934:C reduced ACE2 expression by 0.87 s.d. units (95% CI = −1.18 to −0.57, linear regression test P = 2.7 × 10−8; Fig. 3a). When considering raw, prenormalized ACE2 expression levels, rs190509934:C was associated with a 37% reduction in expression relative to non-carriers (Fig. 3b). There was no association with the expression of 8 other nearby genes (within 500 kilobases (kb), with detectable expression in our dataset) after accounting for multiple testing. These results are consistent with rs190509934:C lowering ACE2 expression, which in turn confers protection from SARS-CoV-2 infection.
In addition to its role in viral infections, the normal physiological role of ACE2 involves its hydrolysis and clearance of angiotensin II, a vasoconstrictive peptide that can lead to higher vascular tone or blood pressure19. Therefore, we investigated if rs190509934:C was associated with higher systolic blood pressure in the UKB study but found no significant association (Beta = 0.009 s.d. units, P = 0.56; Supplementary Table 11). There was a trend for higher blood pressure among carriers of ultrarare coding variants in ACE2 that are predicted to be full loss of function (Beta = 0.219 s.d. units, P = 0.09; Supplementary Table 11) and which were assayed through exome sequencing20. These results need to be confirmed in larger datasets but suggest that ACE2 loss of function may modestly increase blood pressure. This should be considered if ACE2 blockade is to be developed for COVID-19 treatment, although pharmacological inhibition of ACE2 in such a setting would be expected to be short term and elevations in blood pressure could be managed with antihypertensives. Of note, ACE2 expression in the airways was reported to be higher in smokers and patients with chronic obstructive pulmonary disease (COPD)21 and to increase with age22. Collectively, these observations and our genetic findings are consistent with the hypothesis that ACE2 levels play a key role in determining COVID-19 risk.
Replication of previously reported associations
As noted, our GWAS also identified associations at several loci reported in previous GWAS of COVID-19 outcomes. To explore previously reported signals in detail, we first attempted to replicate 8 independent associations (linkage disequilibrium (LD) r2 < 0.05) with disease risk (Supplementary Table 5) reported in 3 recent GWAS9,10,11 that included >1,000 cases (Supplementary Table 6). After accounting for multiple testing, 6 variants had a significant (P < 0.0012) and directionally consistent association in at least 1 of our 5 disease risk analyses (Supplementary Table 12): rs73064425:T in LZTFL1 (published OR = 2.14; strongest in our analysis of cases with severe COVID-19 versus COVID-19-negative or unknown controls; MAF = 7%, OR = 1.58, P = 2 × 10−18); rs2531743:G near SLC6A20 (published OR = 0.92; COVID-19-positive versus COVID-19-negative; MAF = 42%, OR = 0.94, P = 3 × 10−12); rs143334143:A in the major histocompatibility complex (MHC) (published OR = 1.85; COVID-19-positive versus COVID-19-negative; MAF = 7%, OR = 1.06, P = 2 × 10−4); rs879055593:T in ABO (published OR = 1.17; COVID-19-positive versus COVID-19-negative or unknown; MAF = 24%, OR = 1.10, P = 7 × 10−34); rs2109069:A in DPP9 (published OR = 1.36; cases hospitalized with COVID-19 versus COVID-19-negative or unknown; MAF = 31%, OR = 1.10, P = 3 × 10−7); and rs2236757:A in IFNAR2 (published OR = 1.28; cases hospitalized with COVID-19 versus COVID-19-negative or unknown; MAF = 29%, OR = 1.08, P = 7 × 10−5). The variants in LZTFL1 and SLC6A20 are located 63 kb apart at the 3p21.31 locus first reported by Ellinghaus et al.9, which contains a core risk haplotype that includes 13 variants in high LD with each other23. However, in individuals of European ancestry, this haplotype block (indexed by rs35044562) is in high LD with the LZTFL1 variant rs73064425 (r2 = 0.99) but not the SLC6A20 variant rs2531743 (r2 = 0.02), indicating that these two signals—for severe COVID-19 among infected individuals and for risk of SARS-CoV-2 infection compared with individuals who did not test positive for COVID-19, respectively—are likely independent.
There was no evidence for heterogeneity in effect sizes across studies (all with P > 0.05; Supplementary Table 12) or ancestries (all with P > 0.05; Supplementary Table 13) for any of the six variants. We also explored the possibility that the association between these six variants and COVID-19 could have been confounded by disease status for relevant comorbidities. We found that only two of the six variants were associated with a clinical risk factor: the MHC variant was associated with asthma (P = 6.8 × 10−9) and type 2 diabetes (T2D) (P = 1.5 × 10−5), while the ABO variant was associated with kidney disease (P = 1.4 × 10−4) and T2D (P = 9.7 × 10−5; Supplementary Table 8). Importantly, however, for both variants the association with COVID-19 was essentially unchanged after adjusting for the associated clinical risk factors (MHC: OR = 1.09 versus OR = 1.08; ABO: OR = 1.08 versus OR = 1.07; Supplementary Table 14). Therefore, we conclude that the association between the six variants and COVID-19 is unlikely to be explained by these underlying comorbidities.
Associations with disease severity among cases with COVID-19
We then investigated which replicated variants were associated with severity among cases with COVID-19. Among the 6 replicated variants (in/near LZTFL1, SLC6A20, MHC, ABO, DPP9 and IFNAR2), 4 were significantly (P < 0.05) associated with worse outcomes among infected individuals (in/near LZTFL1, MHC, DPP9 and IFNAR2), while those in ABO and near SLC6A20 were not associated with COVID-19 severity (Extended Data Fig. 1 and Supplementary Table 15). Collectively, these results highlight four variants associated with both COVID-19 risk and worse disease outcomes, including respiratory failure and death. These variants may be used to identify individuals at risk of severe COVID-19 and guide the search for genes involved in the pathophysiology of COVID-19.
Next, we evaluated whether variants identified by the COVID-19 HGI, a large worldwide effort to identify genetic risk factors for COVID-19, could augment this set of four disease severity variants. The latest HGI analyses12 include data from 49,562 individuals infected with SARS-CoV-2 and use >1.7 million individuals with no record of infection as controls (Supplementary Table 16). To identify additional variants associated with severity, we started with variants associated with the phenotype ‘reported infection’ (infected versus no record of infection) which, despite the sample overlap between the HGI and our analyses, was statistically independent from severity among infected individuals because infection status (positive cases versus negative or unknown controls) is uncorrelated with hospitalization status once infected (hospitalized versus non-hospitalized cases). We found that two variants were nominally associated with the risk of severe disease among cases (rs11919389 near RPL24, P = 0.029 and rs1886814 near FOXP4, P = 0.018; Supplementary Table 16), suggesting that these loci also modulate disease severity after infection with SARS-CoV-2.
Likely effector genes of variants associated with COVID-19
Collectively, our association analyses highlighted six common variants identified in previous GWAS or by the HGI—in/near LZTFL1, MHC, DPP9, IFNAR2, RPL24 and FOXP4—that are associated with COVID-19 as well as disease severity among cases. To help identify genes that might underlie the observed associations, we searched for functional protein-coding variants (missense or predicted loss of function) in high LD (r2 > 0.80) with each variant. We found eight functional variants in five genes (Supplementary Table 17): IFNAR2, a cytokine receptor component in the antiviral type 1 IFN pathway, which is activated by SARS-CoV-2 and is dysregulated in cases with severe COVID-1914,24); CCHCR1, a P-body protein associated with cytoskeletal remodeling and messenger RNA turnover25,26; TCF19, a transcription factor associated with hepatitis B27; and C6orf15 and PSORS1C1, two functionally uncharacterized genes in the MHC. These data indicate that the variants identified may have functional effects on these five genes.
We then asked if any of the 6 sentinel variants colocalized (that is, were in high LD, r2 > 0.80) with published sentinel expression quantitative trait loci (eQTLs) across 52 studies (considering eQTLs associated with gene expression at a P < 2.5 × 10−9 in the original studies; Supplementary Table 10), specifically focusing on 114 genes in cis (±500 kb). We found colocalization with sentinel eQTLs for eight genes (Supplementary Table 18): SLC6A20 (eQTLs from lung), a proline transporter that binds the host SARS-CoV-2 receptor, ACE228; NXPE3 (esophagus), a gene of unknown function; SENP7 (blood), a SUMO-specific protease that promotes IFN signaling and that in mice is essential for innate defense against herpes simplex virus 1 infection29; IFNAR2 and TCF19 (multiple tissues), both discussed above; LST1 (blood), an immunomodulatory protein that inhibits lymphocyte proliferation30 and is upregulated in response to bacterial ligands31; HLA-C (adipose tissue), a natural killer cell ligand, which is associated with HIV infection32 and autoimmunity33; and IL10RB (multiple tissues), a pleiotropic cytokine receptor associated with persistent hepatitis B and autoimmunity34,35. Collectively, analysis of missense variation and eQTL catalogs suggests 12 potential effector genes in COVID-19 loci (ACE2, C6orf15, CCHCR1, HLA-C, IFNAR2, IL10RB, LST1, NXPE3, PSORS1C1, SENP7, SLC6A20 and TCL19), although functional studies are required to confirm these predictions.
Using GRS to predict severe disease
Next, we proceeded to evaluate if common genetic variants can help identify individuals at high risk of severe COVID-19 once infected with SARS-CoV-2. To this end, we created a weighted GRS for individuals with a record of SARS-CoV-2 infection and then compared the risk of hospitalization (hospitalized versus non-hospitalized cases) and severe disease (severe versus non-hospitalized cases) between those with a high GRS and all other cases, after adjusting for established risk factors. We considered different approaches to select variants for inclusion in the GRS. First, we reasoned that variants most informative for prediction of severe disease were those associated with worse disease outcomes among infected individuals; thus, this was the approach taken for our primary GRS analysis. Of all published genetic risk factors for COVID-19, only one variant was associated with worse outcomes among infected individuals at P < 5 × 10−8 in our analysis (rs73064425 in LZTFL1) but this likely reflects low power due to the small number of patients with severe illness that were available for analysis. To address this limitation, we also included in the GRS five additional variants (in/near MHC, DPP9, IFNAR2, RPL24 and FOXP4) that (1) had an association with risk of infection at P < 5 × 10−8 in published GWAS or by the HGI; and (2) were associated with worse disease outcomes among infected individuals in our data (Supplementary Tables 15 and 16), albeit at the suggestive level with current sample sizes. The combination of a genome-wide significant association with risk of infection in previous GWAS and a suggestive association with worse outcomes among infected individuals in the current analysis minimizes the chance that these loci represent false positive associations for disease severity. Of note, we did not include in the GRS five additional variants discovered by the HGI for risk of hospitalization or severe disease (Supplementary Table 16) because the HGI analysis for those two phenotypes was not statistically independent from our analysis of disease outcomes among infected individuals (due to sample overlap). To calculate the GRS, the weights used for each of the six variants corresponded to the effect size (log of the OR) reported in previous GWAS. P values reported in this section were obtained from a logistic regression test (Methods), unless otherwise noted.
When considering cases with COVID-19 of European ancestry (n = 44,958), we found that having a high GRS (top 10%) was associated with a 1.38-fold increased risk of hospitalization (95% CI = 1.26–1.53, P = 6 × 10−11; Fig. 4a) and 1.58-fold increased risk of severe disease (95% CI = 1.36–1.82, P = 7 × 10−10; Fig. 4b). In other ancestries, a high GRS also appeared to predict risk of hospitalization—including among individuals of African ancestry (n = 2,598, 1.70-fold risk for high GRS, 95% CI = 1.03–2.81, P = 0.038), Hispanic or Latin American ancestry (n = 3,752, 1.56-fold risk, 95% CI = 1.00–2.43, P = 0.05) and South Asian ancestry (n = 760, 1.42-fold risk, 95% CI = 0.72–2.82, P = 0.32; Supplementary Table 19). A similar pattern was observed in non-European ancestries for risk of severe disease, although sample sizes were considerably smaller (Supplementary Table 20).
We then compared the effect of the GRS between individuals with and without established risk factors for severe COVID-19. In Europeans of both the AncestryDNA and UKB studies, we found that a high GRS (top 10%) was associated with risk of severe disease both among individuals with and without established clinical risk factors for severe COVID-19 (Fig. 5). In the meta-analysis of the two studies, a high GRS was associated with a 1.65-fold (95% CI = 1.39–1.96, P = 1 × 10−8) and 1.75-fold (95% CI = 1.28–2.40, P = 4 × 10−4) higher risk of severe disease, respectively among individuals with (n = 22,045) and without (n = 22,913) established risk factors (Supplementary Table 21), with no evidence for heterogeneity of GRS effect with clinical risk factor status (P = 0.30). Similar results were observed for risk of hospitalization (1.35-fold versus 1.39-fold; Supplementary Table 21 and Extended Data Fig. 2). We also performed this stratified analysis in individuals of Hispanic or Latin American ancestry (but not other ancestries due to small sample size) and found that a high GRS was associated with higher risk of severe disease in individuals with (n = 1,341; OR = 3.35, 95% CI = 1.56–7.21, P = 0.002) but not without (n = 2,411; OR = 0.88, 95% CI = 0.19–4.07, P = 0.873) clinical risk factors (Extended Data Fig. 3).
Next, we performed sensitivity analyses to understand the extent to which the GRS composition affected the association results described above. First, we expanded the GRS to include all 12 variants reported to associate with the risk of COVID-19 in previous GWAS (8 variants) and by the HGI (4 new variants associated with reported infection). We found that associations between the 12-SNP GRS and both risk of hospitalization and severe disease were similar to those obtained with the 6-SNP GRS (Extended Data Fig. 4). For example, using the 12-SNP GRS, we found that cases with COVID-19 in the top 10% of genetic risk had a 1.38-fold (95% CI = 1.26–1.52, P = 4 × 10−11) and 1.64-fold (95% CI = 1.43–1.90, P = 6 × 10−12) higher risk of severe disease, compared to 1.38-fold and 1.58-fold, respectively obtained with the 6-SNP GRS (above). Second, we expanded the GRS to include a larger set of variants associated with risk of infection but this resulted in weaker associations when compared to the 6-SNP GRS (Extended Data Fig. 5). Overall, these results suggest that a GRS calculated using variants associated with disease risk and severity can potentially be used to identify cases with COVID-19 at high risk of developing poor disease outcomes.
To formally address this possibility, we assessed the value of using the 6-SNP GRS to predict the risk of severe disease in addition to demographic and clinical risk factors. For this analysis, each study was split 50:50 into a training set, which was used to estimate associations between disease severity and demographic, clinical and genetic risk factors, and a validation set, where risk scores were calculated based on the effect estimates from the training set and then used to predict disease severity (Methods). We found that the ability to predict disease severity improved somewhat when the 6-SNP GRS was added to a baseline model that considered only age and sex, with the area under the receiving operator characteristic curve (AUC) improving by 0.7% in the AncestryDNA study and 0.5% in the UKB study (Fig. 6). This magnitude of improvement in the AUC was comparable to that observed with some clinical risk factors individually, such as cardiovascular disease (CVD) (0.6% and 0.5%, respectively in AncestryDNA and UKB) and respiratory disease (1% and 0.8%, respectively). Similar results were observed when the 6-SNP GRS was added to a model that considered all non-genetic risk factors (Fig. 6), with the AUC for disease severity improving by 0.8% and 0.5%, respectively in the AncestryDNA and UKB studies. Overall, in our analyses, age and sex were the strongest predictors of poor outcomes in individuals with COVID-19 and an elevated GRS enabled a modest improvement in predictions similar to that contributed by individual clinical risk factors.
In summary, we performed a GWAS including 756,646 individuals aggregated across 4 cohorts and used both clinical and self-reported phenotypes to define risk and severity groups for COVID-19. Our analysis identified a new association between a rare variant near the ACE2 gene that decreases expression of the SARS-CoV-2 receptor and COVID-19 risk. This finding provides human genetic support for the hypothesis that ACE2 expression plays a key role in SARS-CoV-2 infection and may constitute an attractive therapeutic target for prevention of COVID-19. We also confirmed six common variant associations with risk of infection and further showed that four of these variants modulate disease severity among cases. Lastly, we demonstrated that a GRS based on common variants validated in this study modestly improves the prediction of poor disease outcomes among individuals with COVID-19.
The following caveats should be considered when interpreting the results from this study. First, our study had greater power to identify associations with disease risk than with severity outcomes, given the relatively small sample size for the latter. Second, there was phenotypic heterogeneity among cases with COVID-19 and controls and associated risk factors across our studies. One likely reason for this is that survey respondents from the AncestryDNA study were enriched for healthier individuals and cases with milder COVID-19 compared to participants of the UKB, GHS and PMBB studies, who were ascertained in clinical settings and so were enriched for hospitalized cases and cases with severe COVID-19. Other sources of heterogeneity may include regional and temporal availability of COVID-19 testing and the inability to control for viral exposure among controls. While our meta-analysis collectively spans a broad phenotypic spectrum, these individual differences may account for variability in results across reported studies. Third, we used expression levels measured in the liver to assess the impact of the ACE2 risk variant on gene expression. The liver is not the most disease-relevant tissue to assess ACE2 expression but we note that cis eQTLs are often shared across tissues18,36 and so our findings are likely predictive of decreased ACE2 expression in other tissues. Fourth, the association between GRS and risk of severe disease was strongest in European individuals of the AncestryDNA (OR = 1.72, P = 2 × 10−6) and UKB (OR = 1.65, P = 6 × 10−6) studies when compared to the smaller GHS study (OR = 1.03, P = 0.877). The lower effect size in the latter may be due to differences in ascertainment of COVID-19-positive cases, as discussed above, or stochastic, given the smaller sample size. We also noted that the impact of the GRS on risk of hospitalization was attenuated in comparison to severe disease, which may be a reflection of the weighting schema for the variants comprising the score; the four largest GRS weights were derived from an analysis of critically ill individuals10.
To date, SARS-CoV-2 has infected >230 million people globally, disproportionately affecting older, male individuals and those of non-European ancestry or with underlying cardiovascular and respiratory comorbidities with severe COVID-19 and death. Host genetic analysis, primarily of hospitalized cases and clinical data, have uncovered over a dozen loci associated with increased odds of severe COVID-1912. Our approach of coupling human genetics with both electronic health records (EHRs) and self-reported COVID-19 data has strengthened our knowledge of COVID-19 host genetics and uncovered an additional COVID-19 locus in ACE2. Further analysis, including additional rare variants, may further elucidate the host genetic contribution to COVID-19 and sequelae.
Ethical approval for the UKB study was previously obtained from the North West Centre for Research Ethics Committee (no. 11/NW/0382). The work described in this study was approved by the UKB under application no. 26041.
Approval for the DiscovEHR analyses was provided by the GHS institutional review board under project no. 2006-0258.
All data for this research project was from individuals who provided prior informed consent to participate in AncestryDNA’s Human Diversity Project, as reviewed and approved by our external institutional review board, Advarra (formerly Quorum). All data were de-identified before use.
Appropriate consent was obtained from each participant regarding the storage of biological specimens, genetic sequencing and genotyping, and access to all available EHR data. This study was approved by the institutional review board of the University of Pennsylvania and complied with the principles set out in the Declaration of Helsinki. Written informed consent was obtained for all study participants.
AncestryDNA COVID-19 research study
AncestryDNA customers over the age of 18, living in the USA, who had consented to the research, were invited to complete a survey assessing COVID-19 outcomes and other demographic information. These included SARS-CoV-2 swab and antibody test results, COVID-19 symptoms and severity, brief medical history, household and occupational exposure to SARS-CoV-2 and blood type. A total of 163,650 AncestryDNA survey respondents were selected for inclusion in this study. Respondents selected for this study included all individuals with a positive COVID-19 test together with age- and sex-matched controls. DNA samples were genotyped on an Illumina array containing 730,000 SNPs. Sample quality control (QC) involved removing individuals with discordant sex (based on reported and genetically determined sex) and those with <98% sample call rate, as described previously38 Variant QC involved removing array variants with a difference in allele frequency >0.1 between any pair of array versions used, as well as variants with a call rate <98%. Genotype data for variants not included in the array were then inferred using imputation to the Haplotype Reference Consortium (HRC) reference panel. Briefly, samples were imputed to HRC v.1.1, which consists of 27,165 individuals and 36 million variants. The HRC reference panel does not include indels; consequently, indels are not present in the imputed data. We determined best-guess haplotypes with Eagle v.2.4.1 and performed imputation with Minimac4 v.1.0.1. We used 1,117,080 unique variants as input; 8,049,082 imputed variants were retained in the final dataset. Variants with a Minimac4 r2 < 0.30 were filtered from the analysis.
The GHS MyCode Community Health Initiative is a health system-based cohort from central and eastern Pennsylvania (USA) with ongoing recruitment since 200639. A subset of 144,182 MyCode participants sequenced as part of the GHS-Regeneron Genetics Center DiscovEHR partnership were included in this study. Information on COVID-19 outcomes was obtained through the GHS COVID-19 registry. Patients were identified as eligible for the registry based on relevant laboratory results and International Classification of Diseases, Tenth Revision (ICD-10) diagnosis codes. Patient charts were then reviewed to confirm the COVID-19 diagnoses. The registry contains data on outcomes, comorbidities, medications, supplemental oxygen use, and intensive care unit admissions. DNA from participants was genotyped on either the Illumina Infinium OmniExpressExome or Global Screening Array (GSA) and imputed to the TOPMed reference panel (stratified by array) using the TOPMed Imputation Server. Before imputation, we retained variants that had a MAF ≥ 0.1%, missingness <1% and Hardy–Weinberg equilibrium test P > 10−15. After imputation, data from the Infinium OmniExpressExome and GSA datasets were merged for subsequent association analyses, which included an Infinium OmniExpressExome/GSA batch covariate, in addition to other covariates described below.
The PMBB contains approximately 70,000 study participants, all recruited through the University of Pennsylvania Health System (UPHS). Participants donate blood or tissue and allow access to EHR information40. The PMBB participants with COVID-19 infection were identified through the UPHS COVID-19 registry, which consists of quantitative PCR (qPCR) results of all patients tested for SARS-CoV-2 infection within the health system. We then used EHRs to classify patients with COVID-19 into hospitalized and severe (ventilation or death) categories. DNA genotyping was performed with the Illumina GSA and imputation performed using the TOPMed reference panel as described for the GHS study.
We studied the host genetics of SARS-CoV-2 infection in participants of the UKB study, which took place between 2006 and 2010 and includes approximately 500,000 adults aged 40–69 at recruitment. In collaboration with the UK health authorities, the UKB has made available regular updates on COVID-19 status for all participants, including results from four main data types: qPCR test for SARS-CoV-2; anonymized EHRs; primary care; and death registry data. We report results based on phenotype data downloaded on the 4 January 2021 and excluded from the analysis 28,547 individuals with a death registry event before 2020. DNA samples were genotyped as described previously41 using the Applied Biosystems UK BiLEVE Axiom Array (n = 49,950) or the closely related (95% variant overlap) Applied Biosystems UKB Axiom Array (n = 438,427). Genotype data for variants not included in the arrays were inferred using the TOPMed reference panel, as described above.
COVID-19 phenotypes used for the genetic association analyses
We grouped participants from each study into three broad COVID-19 disease categories (Supplementary Table 1): (1) positive, that is, those with a positive qPCR or serology test for SARS-CoV-2 or with a COVID-19-related ICD-10 code (U07), hospitalization or death; (2) negative, that is, those with only negative qPCR or serology test results for SARS-CoV-2 and with no COVID-19-related ICD-10 code (U07), hospitalization or death; and (3) unknown, that is, those with no qPCR or serology test results and no COVID-19-related ICD-10 code (U07), hospitalization or death. We then used these broad COVID-19 disease categories, in addition to hospitalization and disease severity information, to create seven COVID-19-related phenotypes for genetic association analyses, as detailed in Supplementary Table 3.
SARS-CoV-2 infection status (positive, negative or unknown) was determined based on a qPCR test for SARS-CoV-2 in the UKB, GHS and PMBB studies and self-reported results for qPCR or serology test for SARS-CoV-2 in the AncestryDNA study.
Hospitalization status (positive, negative or unknown) was determined based on the COVID-19-related ICD-10 codes U071, U072 and U073 in variable ‘diag_icd10’ (table ‘hesin_diag’) in the UKB study, self-reported hospitalization due to COVID-19 in the AncestryDNA study and medical records in the GHS and PMBB studies.
Disease severity status (severe (ventilation or death) or not severe) was determined in the UKB study based on: (1) respiratory support ICD-10 code Z998 in variable ‘diag_icd10’ (table ‘hesin_diag’); (2) the following respiratory support ICD-10 codes in variable ‘oper4’ (table ‘hesin_oper’): E85, E851, E852, E853, E854,E855, E856, E858, E859, E87, E871, E872, E873, E874, E878, E879, E89, X56, X561, X562, X563, X568, X569, X58, X581, X588 and X589; or (3) the COVID-19-related ICD-10 codes U071, U072 and U073 in cause of death (variable ‘cause_icd10’ in table ‘death_cause’). In the AncestryDNA study, disease severity was determined based on self-reported ventilation or need for supplementary oxygen due to COVID-19. In the GHS and PMBB studies, it was determined based on ventilator or high-flow oxygen use.
For association analysis in the AncestryDNA study, we excluded from the COVID-19 unknown group individuals who had (1) a first-degree relative who was COVID-19-positive or (2) flu-like symptoms.
Genetic association analyses
Association analyses in each study were performed using the genome-wide Firth logistic regression test implemented in REGENIE V2.0.1 (ref. 37). In this implementation, Firth’s approach is applied when the P value from the standard logistic regression score test is below 0.05. We included in step 1 of REGENIE (that is, prediction of individual trait values based on the genetic data) directly genotyped variants with an MAF > 1%, <10% missingness, Hardy–Weinberg equilibrium test P > 1 × 10−15 and LD pruning (1,000 variant windows, 100 variant sliding windows and r2 < 0.9). The association model used in step 2 of REGENIE included as covariates age, age2, sex, age-by-sex and the first 10 ancestry-informative principal components (PCs) derived from the analysis of a stricter set of LD-pruned (50 variant windows, 5 variant sliding windows and r2 < 0.5) common variants from the array (imputed for the GHS study) data.
Within each study, association analyses were performed separately for five different continental ancestries defined based on the array data: African (AFR), Hispanic or Latin American (HLA; originally referred to as ‘AMR’ by the 1000 Genomes Project; a subsequent study recommended the use of HLA to refer to this ancestral group42); European (EUR); and South Asian (SAS). We determined continental ancestries by projecting each sample onto reference PCs calculated from the HapMap3 reference panel. Briefly, we merged our samples with HapMap3 samples and kept only SNPs in common between the two datasets. We further excluded SNPs with MAF < 10%, genotype missingness >5% or Hardy–Weinberg equilibrium test P < 10−5. We calculated PCs for the HapMap3 samples and projected each of our samples onto those PCs. To assign a continental ancestry group to each non-HapMap3 sample, we trained a kernel density estimator (KDE) using the HapMap3 PCs and used the KDEs to calculate the likelihood of a given sample belonging to each of the five continental ancestry groups. When the likelihood for a given ancestry group was >0.3, the sample was assigned to that ancestry group. When two ancestry groups had a likelihood >0.3, we arbitrarily assigned AFR over EUR, HLA over EUR, HLA over EAS, SAS over EUR and HLA over AFR. Samples were excluded from analysis if no ancestry likelihoods were >0.3 or if more than three ancestry likelihoods were >0.3.
Results were subsequently meta-analyzed across studies and ancestries using an inverse variance-weighted fixed-effects meta-analysis.
Identification of putative targets of GWAS variants based on colocalization with eQTLs
We identified as a likely target of a sentinel GWAS variant any gene for which a sentinel eQTL colocalized (that is, had an LD r2 > 0.80) with the sentinel GWAS variant. That is, we only considered genes for which there was strong LD between a sentinel GWAS variant and a sentinel eQTL, which reduces the chance of spurious colocalization. Sentinel eQTLs were defined across 174 published datasets (Supplementary Table 10), considering only eQTLs associated with gene expression in cis (±1 Mb) at a conservative P < 2.5 × 10−9 threshold as described previously43. We did not use statistical approaches developed to distinguish colocalization from shared genetic effects because these have very limited resolution at high LD levels (r2 > 0.80) (ref. 44).
Gene expression analysis in participants of the GHS study
For a subset of individuals from the GHS study (n = 2,035, ascertained through the Geisinger Bariatric Surgery Clinic), RNA was extracted from liver biopsies conducted during bariatric surgery to evaluate liver disease. Individuals had class 3 obesity (BMI > 40 kg m−2) or class 2 obesity (BMI 35–39 kg m−2) with an obesity-related comorbidity (for example, T2D, hypertension, sleep apnea, non-alcoholic fatty liver disease). RNA libraries were prepared using poly(A) extraction and then sequenced with 75-bp paired-end reads with two 10-bp index reads on the Illumina NovaSeq 6000 on S4 flow cells. RNA-seq data were then analyzed using the GTEx v.8 workflow18, using STAR v.2.7.3a (ref. 45) and rnaSeqQC v.1.2 (Code availability), except that GENCODE v.32 was used in lieu of v.26. Briefly: (1) raw expression counts were normalized with trimmed mean of M values (TMM) as implemented in edgeR v.3.13 (ref. 46); (2) a rank-based inverse normal transformation was applied to the normalized expression values; (3) PC analysis was performed on data from 25,078 genes with transcripts per million > 0.1 in >20% samples to identify latent factors accounting for variation in gene expression; (4) gene expression levels were adjusted for the top 100 PCs to improve power to identify cis-regulatory effects. The association between adjusted ACE2 expression and the imputed genotypes of rs190509934 was then tested using linear regression, with the following variables included as covariates: age, age2, four ancestry-informative PCs, steatosis status, fibrosis status, diabetes status and BMI at the time of bariatric surgery.
GRS analysis of COVID-19 hospitalization and severity
First, in each study (AncestryDNA, GHS, UKB and PMBB), we created a GRS for each COVID-19-positive individual based on variants that were reported to associate with risk of COVID-19 in previous GWAS and that we (1) independently replicated (except variants identified by the HGI) and (2) found to be associated with COVID-19 severity outcomes. We used as weights the effect (Beta) reported in previous GWAS (Supplementary Table 5). Second, we ranked individuals with COVID-19 based on the GRS and created a new binary GRS predictor by assigning each individual to a high (top 5%) or low (rest of the population) percentile group. Third, for studies with >100 hospitalized cases, we used logistic regression to test the association between the binary GRS predictor and risk of hospitalization (hospitalized cases versus all other cases), including as covariates age, sex, age-by-sex interaction and ten ancestry-informative PCs. In addition to age and sex, we included as additional covariates established clinical risk factors for COVID-19 that are outlined in the Emergency Use Authorisation treatment guidelines for casirivimab and imdevimab: BMI; chronic kidney disease (CKD); diabetes; immunosuppressive disease; COPD or other chronic respiratory disease; CVD; and hypertension. We repeated the association analysis (1) using different percentile cutoffs for the GRS (5, 10, 20, 30 and 40%) and (2) to test the association with disease severity (severe cases versus all other cases). We then stratified COVID-19 cases by clinical risk (high versus lower) and evaluated the association between the top 10% by GRS (that is, high genetic risk) and risk of hospitalization or severe disease. The stratified analyses were performed with logistic regression, with sex and ancestry-informative PCs included as covariates. High clinical risk was defined as any one of the following: (1) age ≥65; (2) BMI ≥ 35; (3) CKD, diabetes or immunosuppressive disease; (4) age ≥55 and presence of COPD/other chronic respiratory disease, CVD or hypertension.
In populations with >100 hospitalized cases, we also evaluated the impact of the GRS relative to other non-genetic risk factors associated with increased risk of hospitalization and severe disease (for example, COPD, diabetes). The datasets were randomly split 50:50 into training and test datasets. In the training dataset, a logistic regression model with age, sex and ancestry covariates was fitted. The coefficients for age and sex from this model were then used to calculate a risk score in the other half of the population, which was fitted in a second model along with ancestry covariates. From this model, the AUC from a receiver operating characteristic curve (and 95% CI) was estimated. The process was repeated iteratively, adding other demographic and clinical risk factors one at a time to the baseline model with age, sex and ancestry covariates. Models were then fitted with just the baseline model plus GRS, all factors except GRS and a final model with all demographic/clinical risk factors plus the GRS.
Statistics and reproducibility
No statistical method was used to predetermine sample size. Individuals were excluded for the following reasons: if they were not assigned to one of the five continental ancestry groups based on principal component analysis (Methods), had previously passed away before January 2020 (near the beginning of the COVID-19 pandemic), had an unknown COVID-19 status but did have confirmed cases in their household or if the continental ancestry group had fewer than 25 cases and 25 controls (Methods). The experiments were not randomized. The investigators were not blinded to allocation during the experiments and outcome assessment. Unless otherwise noted, the association P values reported in this manuscript are from (1) Firth (disease traits) or linear (quantitative traits) regression tests performed in REGENIE for GWAS and (2) logistic regression, for the GRS analyses.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
REGENIE v.2.0.1 can be accessed at https://github.com/rgcgithub/regenie. The GWAS analyses were performed with REGENIE using automated pipelines. An R script that exemplifies how the genetic risk score analyses were performed is available at https://doi.org/10.5281/zenodo.5700998 and https://doi.org/10.5281/zenodo.5748168. R can be found at https://www.r-project.org/. rnaSeqQC is available from GitHub (https://github.com/oicr-gsi/rnaSeqQC).
Yan, R. et al. Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2. Science 367, 1444–1448 (2020).
Bai, Y. et al. Presumed asymptomatic carrier transmission of COVID-19. JAMA 323, 1406–1407 (2020).
Guan, W.-J. et al. Clinical characteristics of coronavirus disease 2019 in China. N. Engl. J. Med. 382, 1708–1720 (2020).
Kimball, A. et al. Asymptomatic and presymptomatic SARS-CoV-2 infections in residents of a long-term care skilled nursing facility—King County, Washington, March 2020. MMWR Morb. Mortal. Wkly Rep. 69, 377–381 (2020).
Zhu, N. et al. A novel coronavirus from patients with pneumonia in China, 2019. N. Engl. J. Med. 382, 727–733 (2020).
Atkins, J. L. et al. Preexisting comorbidities predicting COVID-19 and mortality in the UK Biobank community cohort. J. Gerontol. A 75, 2224–2230 (2020).
Cummings, M. J. et al. Epidemiology, clinical course, and outcomes of critically ill adults with COVID-19 in New York City: a prospective cohort study. Lancet 395, 1763–1770 (2020).
Zhou, F. et al. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study. Lancet 395, 1054–1062 (2020).
Ellinghaus, D. et al. Genomewide association study of severe Covid-19 with respiratory failure. N. Engl. J. Med. 383, 1522–1534 (2020).
Pairo-Castineira, E. et al. Genetic mechanisms of critical illness in COVID-19. Nature 591, 92–98 (2021).
Shelton, J. F. et al. Trans-ancestry analysis reveals genetic and nongenetic associations with COVID-19 susceptibility and severity. Nat. Genet. 53, 801–808 (2021).
Niemi, M. E. K. et al. Mapping the human genetic architecture of COVID-19. Nature 600, 472–477 (2021).
van der Made, C. I. et al. Presence of genetic variants among young men with severe COVID-19. JAMA 324, 663–673 (2020).
Zhang, Q. et al. Inborn errors of type I IFN immunity in patients with life-threatening COVID-19. Science 370, eabd4570 (2020).
Kosmicki, J. A. et al. Pan-ancestry exome-wide association analyses of COVID-19 outcomes in 586,157 individuals. Am. J. Hum. Genet. 108, 1350–1355 (2021).
Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020).
Pawitan, Y., Seng, K. C. & Magnusson, P. K. E. How many genetic variants remain to be discovered? PLoS ONE 4, e7969 (2009).
Aguet, F. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Keidar, S., Kaplan, M. & Gamliel-Lazarovich, A. ACE2 of the heart: from angiotensin I to angiotensin (1–7). Cardiovasc. Res. 73, 463–469 (2007).
Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).
Saheb Sharif-Askari, N. et al. Airways expression of SARS-CoV-2 receptor, ACE2, and TMPRSS2 is lower in children than adults and increases with smoking and COPD. Mol. Ther. Methods Clin. Dev. 18, 1–6 (2020).
Bunyavanich, S., Do, A. & Vicencio, A. Nasal gene expression of angiotensin-converting enzyme 2 in children and adults. JAMA 323, 2427–2429 (2020).
Zeberg, H. & Pääbo, S. The major genetic risk factor for severe COVID-19 is inherited from Neanderthals. Nature 587, 610–612 (2020).
Hadjadj, J. et al. Impaired type I interferon activity and inflammatory responses in severe COVID-19 patients. Science 369, 718–724 (2020).
Ling, Y. H. et al. CCHCR1 interacts with EDC4, suggesting its localization in P-bodies. Exp. Cell Res. 327, 12–23 (2014).
Tervaniemi, M. H. et al. Intracellular signalling pathways and cytoskeletal functions converge on the psoriasis candidate gene CCHCR1 expressed at P-bodies and centrosomes. BMC Genomics 19, 432 (2018).
Kim, Y. J. et al. A genome-wide association study identified new variants associated with the risk of chronic hepatitis B. Hum. Mol. Genet. 22, 4233–4238 (2013).
Vuille-dit-Bille, R. N. et al. Human intestine luminal ACE2 and amino acid transporter expression increased by ACE-inhibitors. Amino Acids 47, 693–705 (2015).
Cui, Y. et al. SENP7 potentiates cGAS activation by relieving SUMO-mediated inhibition of cytosolic DNA sensing. PLoS Pathog. 13, e1006156 (2017).
Rollinger-Holzinger, I. et al. LST1: a gene with extensive alternative splicing and immunomodulatory function. J. Immunol. 164, 3169–3176 (2000).
Mulcahy, H., O’Rourke, K. P., Adams, C., Molloy, M. G. & O’Gara, F. LST1 and NCR3 expression in autoimmune inflammation and in response to IFN-γ, LPS and microbial infection. Immunogenetics 57, 893–903 (2006).
Apps, R. et al. Influence of HLA-C expression level on HIV control. Science 340, 87–91 (2013).
Kulkarni, S. et al. Genetic interplay between HLA-C and MIR148A in HIV control and Crohn disease. Proc. Natl Acad. Sci. USA 110, 20705–20710 (2013).
Begue, B. et al. Defective IL10 signaling defining a subgroup of patients with inflammatory bowel disease. Am. J. Gastroenterol. 106, 1544–1555 (2011).
Frodsham, A. J. et al. Class II cytokine receptor gene cluster is a major locus for hepatitis B persistence. Proc. Natl Acad. Sci. USA 103, 9148–9153 (2006).
Qi, T. et al. Identifying gene targets for brain-related traits using transcriptomic and methylomic data from blood. Nat. Commun. 9, 2282 (2018).
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).
Roberts, G. H. L. et al. AncestryDNA COVID-19 host genetic study identifies three novel loci. Preprint at medRxiv https://doi.org/10.1101/2020.10.06.20205864 (2020).
Dewey, F. E. et al. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 354, aaf6814 (2016).
Park, J. et al. A genome-first approach to aggregating rare genetic variants in LMNA for association with electronic health record phenotypes. Genet. Med. 22, 102–111 (2020).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Morales, J. et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol. 19, 21 (2018).
Ferreira, M. A. et al. Shared genetic origin of asthma, hay fever and eczema elucidates allergic disease biology. Nat. Genet. 49, 1752–1757 (2017).
Chun, S. et al. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 49, 600–605 (2017).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
This research was conducted using the UKB Resource (project no. 26041). The PMBB is funded by a gift from the Smilow family, the National Center for Advancing Translational Sciences of the National Institutes of Health under CTSA award no. UL1TR001878 and the Perelman School of Medicine at the University of Pennsylvania. We thank the participants and investigators of the FinnGen study. We thank the AncestryDNA customers who voluntarily contributed information in the COVID-19 survey.
J.E.H., J.A.K., A.D., D. Sharma, N.B, A.Y., A.M., R.L., E.M., X.B., D. Sun, F.S.P.K., J.D.B., C.O.D., A.J.M., D.A.T., A.H.L., J. Mbatchou, K.W., L.G., S.E.M, H.M.K., L.D., E.S., M.J., S.B., K.S, W.J.S., A.R.S., A.E.L., J. Marchini, J.D.O., L.H., M.N.C., J.G.R., A. Baras, G.R.A. and M.A.R.F. are current and/or former employees and/or stockholders of RGC or Regeneron Pharmaceuticals. G.H.L.R., M.V.C., D.S.P., S.C.K. A. Baltzell, A.R.G., S.R.M., R.P., M.Z., K.A.R., E.L.H. and C.A.B. are current and/or former employees of AncestryDNA and may hold equity in AncestryDNA. The other authors declare no competing interests.
Peer review information
Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Comparison of effect sizes across COVID-19 risk and severity outcomes for six previously reported risk variants that validated in this study.
Six variants were reported to associate with risk of COVID-19 in previous studies and replicated in our analysis. Of these, four variants also associated with disease severity among COVID-19 cases (in/near LZTFL1, CCHCR1, DPP9 and IFNAR2), whereas two variants did not (in ABO and SLC6A20). Sample size for each of the seven phenotypes is shown in Supplementary Table 3. Data are presented as odds ratio + /− 95% confidence interval.
Extended Data Fig. 2 Association between a 6-SNP genetic risk score (GRS) and risk of hospitalization among COVID-19 cases of European ancestries after stratifying by the presence of clinical risk factors.
a, Rate of hospitalization in the AncestryDNA study (n = 25,353 COVID-19 cases, including 1,484 hospitalized). b, Rate of hospitalization in the UK Biobank study (n = 14,320 COVID-19 cases, including 3,878 hospitalized). High genetic risk (red bars): top 10% of the GRS. Low genetic risk (grey bars): bottom 90% of the GRS (that is all other COVID-19 cases). Data are presented as percent of individuals hospitalized + /- standard error (SE).
Extended Data Fig. 3 Association between a 6-SNP genetic risk score (GRS) and risk of hospitalization and severe disease among COVID-19 cases of Hispanic or Latin American ancestries (n = 3,752).
a, Rate of hospitalization. b, Rate of severe disease. High genetic risk (red bars): top 10% of the GRS. Low genetic risk (grey bars): bottom 90% of the GRS (that is all other COVID-19 cases). Data are presented as percent of individuals hospitalized (a) or with severe disease (b) ± standard error (SE).
Extended Data Fig. 4 Association between a 6- and 12-SNP genetic risk score (GRS) and risk of hospitalization and severe disease among COVID-19 cases of European ancestries.
a, Associations with risk of hospitalization (n = 44,958 COVID-19 cases). b, Associations with risk of severe disease (n = 39,673). To evaluate if the association between the GRS and worse disease outcomes was dependent on the list of variants selected for analysis, we compared results between GRS calculated using different sets of variants. We considered a GRS calculated using: the six variants that were reported in previous GWAS of COVID-19 and that we further showed were associated with risk of hospitalization or severe disease among COVID-19 cases (four variants in/near LZTFL1, MHC, DPP9 and IFNAR2, see Extended Data Fig. 1; and two variants discovered by the HGI in/near RPL24 and FOXP4, see Supplementary Table 16). Analyses were performed separately in the UK Biobank, AncestryDNA and GHS studies (risk of hospitalization only) after stratifying COVID-19 cases by the presence of clinical risk factors, considering individuals with lower clinical risk (blue circles), high clinical risk (green triangles) or all individuals (grey squares). Association results were then meta-analyzed across studies. Data are presented as odds ratio + /− 95% confidence interval.
Extended Data Fig. 5 Association between risk of severe disease among COVID-19 cases of European ancestries and genetic risk scores (GRS) determined based on different criteria.
a, Association results in the AncestryDNA study (n = 25,353 COVID-19 cases). b, Association results in the UK Biobank study (n = 14,320 COVID-19 cases). In each study, we compared GRS based on (i) variants that were reported in the literature and validated in this study (Literature.HGI.1var: rs73064425 in LZTFL1; Literature.HGI.5var: variants from our 6-SNP model, with the exception of rs73064425 in LZTFL1; Literature.HGI.6var: all six variants from our 6-SNP model; in green); and variants associated with the risk of infection phenotype reported by the HGI and obtained through (ii) approximate conditional analysis using GCTA-COJO, considering two association P-value thresholds (5 x 10-8 and 5 x 10-7; in orange); (iii) pruning and thresholding (P&T), using different association P-value and LD r2 thresholds (in purple); and (iv) the LDpred approach47, considering different 𝝔 parameters (in teal).
Supplementary Tables 1–21.
Supplementary Data 1
Individual-level data used to test the association between the ACE2 variant rs190509934 and ACE2 gene expression in the GHS cohort, including (1) genotypes for rs190509934 and (2) normalized gene expression levels for ACE2 and eight nearby genes.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Horowitz, J.E., Kosmicki, J.A., Damask, A. et al. Genome-wide analysis provides genetic evidence that ACE2 influences COVID-19 risk and yields risk scores associated with severe disease. Nat Genet 54, 382–392 (2022). https://doi.org/10.1038/s41588-021-01006-7
This article is cited by
Cellular and molecular features of COVID-19 associated ARDS: therapeutic relevance
Journal of Inflammation (2023)
Annotating and prioritizing human non-coding variants with RegulomeDB v.2
Nature Genetics (2023)
Host Genetic Factors, Comorbidities and the Risk of Severe COVID-19
Journal of Epidemiology and Global Health (2023)
Association investigations between ACE1 and ACE2 polymorphisms and severity of COVID-19 disease
Molecular Genetics and Genomics (2023)
CCHCR1-astrin interaction promotes centriole duplication through recruitment of CEP72
BMC Biology (2022)