Whole-genome sequencing reveals host factors underlying critical COVID-19

Critical COVID-19 is caused by immune-mediated inflammatory lung injury. Host genetic variation influences the development of illness requiring critical care1 or hospitalization2–4 after infection with SARS-CoV-2. The GenOMICC (Genetics of Mortality in Critical Care) study enables the comparison of genomes from individuals who are critically ill with those of population controls to find underlying disease mechanisms. Here we use whole-genome sequencing in 7,491 critically ill individuals compared with 48,400 controls to discover and replicate 23 independent variants that significantly predispose to critical COVID-19. We identify 16 new independent associations, including variants within genes that are involved in interferon signalling (IL10RB and PLSCR1), leucocyte differentiation (BCL11A) and blood-type antigen secretor status (FUT2). Using transcriptome-wide association and colocalization to infer the effect of gene expression on disease severity, we find evidence that implicates multiple genes—including reduced expression of a membrane flippase (ATP11A), and increased expression of a mucin (MUC1)—in critical disease. Mendelian randomization provides evidence in support of causal roles for myeloid cell adhesion molecules (SELE, ICAM5 and CD209) and the coagulation factor F8, all of which are potentially druggable targets. Our results are broadly consistent with a multi-component model of COVID-19 pathophysiology, in which at least two distinct mechanisms can predispose to life-threatening disease: failure to control viral replication; or an enhanced tendency towards pulmonary inflammation and intravascular coagulation. We show that comparison between cases of critical illness and population controls is highly efficient for the detection of therapeutically relevant mechanisms of disease.


Article
Using microarray genotyping in 2,244 cases, we previously discovered that critical COVID-19 is associated with genetic variation in the host immune response to viral infection (OAS1, IFNAR2 and TYK2) and the inflammasome regulator DPP9 1 .In collaboration with international groups, we extended these findings to include a variant near TAC4 (rs77534576) 3 .Several variants have been associated with milder phenotypes, including the ABO blood-type locus 2 , a pleiotropic inversion in chr17q21.31 9and associations in five additional loci, including the T lymphocyte-associated transcription factor, FOXP4 3 .An enrichment of rare loss-of-function variants in candidate interferon signalling genes has been reported 4 , but this has yet to be replicated at genome-wide significance thresholds 10,11 .
In partnership with Genomics England, we performed whole-genome sequencing (WGS) to improve the resolution and deepen the fine-mapping of significant signals and thereby provide further biological insight into critical COVID-19.Here we present results from a cohort of 7,491 critically ill patients from 224 intensive care units, compared with 48,400 control individuals, describing the discovery and validation of 23 gene loci for susceptibility to critical COVID-19 (Extended Data Fig. 1).

Genome-wide association study analysis
After quality control procedures, we used a logistic mixed model regression, implemented in SAIGE 12 , to perform association analyses with unrelated individuals (critically ill cases, n = 7,491; controls, n = 48,400 (100,000 Genomes Project (100k) cohort, n = 46,770; mild COVID-19, n = 1,630) (Methods, Supplementary Table 2).A total of 1,339 of these cases were included in the primary analysis for our previous report 1 .Genome-wide association studies (GWASs) were performed separately for genetic ancestry groups (n cases /n controls : European (EUR) 5,989/42,891; South Asian (SAS) 788/3,793; African (AFR) 440/1,350; East Asian (EAS) 274/366), and combined by inverse-variance-weighted fixed effects meta-analysis using METAL (Methods).We established the independence of signals using GCTA-cojo, and we validated this with conditional analysis using individual-level data with SAIGE (Methods, Supplementary Table 6).To reduce the risk of spurious associations arising from genotyping or pipeline errors, we required supporting evidence from variants in linkage disequilibrium (LD) for all genome-wide-significant variants: observed z-scores for each variant were compared with imputed z-scores for the same variant, with discrepant values being excluded (see Methods, Supplementary Fig. 2).Variants and the reference and alternative allele are reported according to GRCh38.The three variants discovered in multi-ancestry meta-analysis but not in the European ancestry GWAS are labelled with ‡, and † indicates genome-wide significant heterogeneity.REF and ALT columns indicate the reference and alternative alleles; an asterisk (*) indicates the risk allele.For each variant, we report the risk allele frequency in Europeans (RAF), the odds ratio and 95% confidence interval (OR and OR CI ), and the association P value.'Consequence' indicates the predicted worst consequence type across GENCODE basic transcripts predicted by VEP (v.104), and 'Gene' indicates the VEP-predicted gene, but not necessarily the causal mediator.For the HLA locus, the gene that was identified by HLA allele analysis is displayed.An asterisk (*) next to the replication P value (P hgib2.23m-HGI B2 and 23andMe; or P reg -Regeneron) indicates that the lead signal (from multi-ancestry meta-analysis) is replicated with a Bonferroni-corrected P < 0.002 (0.05/25) with a concordant direction of effect.The 'Cit.' column lists citation numbers for the first publication of confirmed genome-wide associations with critical illness or (in brackets) any COVID-19 phenotype.
To assess the sensitivity of our results to mismatches of demographic characteristics between cases and controls (Supplementary Figs. 9, 10), we performed an age-, sex-and body mass index (BMI)-matched casecontrol analysis (Supplementary Figs.18-21).As there is a theoretical risk of mismatch between cases and 100,000 Genomes Project participants in risk factors for exposure (for example, shielding behaviour) or susceptibility to critical COVID-19 (for example, immunosuppression), we performed a sensitivity analysis using only the cohort with mild COVID-19 (see above; Supplementary Table 10).In both of these analyses, allele frequencies and directions of effect were concordant for all lead signals.
We inferred credible sets of variants using Bayesian fine-mapping with susieR 13 , by analysing the GWAS summaries of 17 regions of genomic length 3 Mb that were flanking groups of lead signals.We obtained 22 independent credible sets of variants for EUR and an additional 2 from the trans-ancestry meta-analysis with a posterior inclusion probability greater than 0.95 (Extended Data Table 1, Supplementary Information).Fine-mapping of the association signals revealed putative causal variants for both previously reported and novel association signals (see Supplementary Information, Extended Data Table 1).In 12 out of the 24 fine-mapped signals, the credible sets included 5 or fewer variants, and for 8 signals we detected variants with predicted missense or worse consequence across each credible set (Extended Data Table 1).We were able to fine-map multiple independent signals at previously identified loci (Fig. 3, Extended Data Figs.2, 4).For example, the signal in the 3p21.31region 2 , was fine-mapped into two independent associations, with the credible set for the first refined to a single variant in the 5′ untranslated region (UTR) of SLC6A20 (chr3:45796521:G:T, rs2271616, odds ratio (OR): 1.29, 95% confidence interval (CI):1.21-1.37),and the second credible set including multiple variants in downstream and intronic regions of LZTFL1 (Fig. 3).Among the novel signals, at 3q24 and 9p21.3 we detected missense variants that affect PLSCR1 and IFNA10, respectively (chr3:146517122:G:A, rs343320, p.His262Tyr, OR: 1.24, 95% CI: 1.15-1.33,CADD: 22.6; chr9:21206606:C:G, rs28368148, p.Trp164Cys, OR:1.74, 95% CI: 1.45-2.09,CADD: 23.9).Both are predicted to be deleterious by the Combined Annotation Dependent Depletion (CADD) tool 14 .Structural predictions for these variants suggest functional effects (Extended Data Fig. 5).We assessed whether the main signals of this study were underlain by rarer variants with a lower minor allele frequency (MAF) (less than 0.02%) than our GWAS default threshold (less than 0.5%), by including rarer variant summaries when fine-mapping, but no additional variants were added to the main credible sets (Supplementary Table 9).
Consistent with our expectation that genetic susceptibility has a stronger role in younger individuals, age-stratified analysis (individuals of younger than 60 years old versus individuals of 60 years old or above) in the EUR group revealed a signal in the 3p21.31region with a significantly stronger effect in the younger age group (chr3:45801750:G:A, rs13071258, OR: 3.34, 95% CI: 2.98-3.75versus OR: 2.1, 95% CI 1.88-2.34),which is in strong LD (r 2 = 0.947) with the main GWAS signal indexed by rs73064425.Sex-specific analysis did not reveal significant effects (Supplementary Fig. 17).

Replication
For replication, we performed a meta-analysis of summary statistics generously shared by 23andMe and the COVID-19 Host Genetics Initiative (HGI) data freeze 6 (B2).As a previous analysis of GenOMICC 1 contributes a substantial part of the signal at each locus in HGI v. 6, and leave-one-out analyses were not available, we removed the signal

Article
from GenOMICC cases in HGI v.6 using mathematical subtraction to ensure independence (Methods).Using LD clumping to find variants genotyped in both the discovery and replication studies, we required P < 0.002 (0.05/25) and concordant direction of effect (Table 1, Supplementary Table 8) for replication.We interrogated two variants that failed replication in this set in a second GWAS meta-analysis of hospitalized patients with COVID-19 from UK Biobank, AncestryDNA, Penn Medicine Biobank and Geisinger Health Systems, which included a total of 9,937 individuals who were hospitalized with COVID-19 and 1,059,390 control individuals.This led to a further successful replicated finding, in IFNA10 (Table 1).We replicated 23 of the 25 significant associations that were identified in the population-specific and/or multi-ancestry GWASs.One of the non-replicated signals (rs4424872) corresponds to a rare variant that may not be well represented in the replication datasetswhich are dominated by single-nucleotide polymorphism (SNP) genotyping data-but which also had significant heterogeneity among ancestries.The second non-replicated signal is within the human leukocyte antigen (HLA) locus, which has complex LD (see below).

HLA region
The lead variant in the HLA region, rs9271609, lies upstream of the HLA-DQA1 and HLA-DRB1 genes.To investigate the contribution of specific HLA alleles to the observed association in the HLA region, we imputed HLA alleles at a four-digit (two-field) level using HIBAG 15 .The only allele that reached genome-wide significance was HLA-DRB1*04:01 (OR: 0.80, 95% CI: 0.75-0.86,P = 1.6 × 10 −10 in EUR), which has a stronger P value than the lead SNP in the region (OR: 0.88, 95% CI: 0.84-0.92,P = 3.3 × 10 −9 in EUR) and is a better fit to the data (Akaike information criterion (AIC): AIC DRB1*04:01 = 30,241.34;AIC leadSNP = 30,252.93)(Extended Data Fig. 6).HLA-DRB1*04:01 has been previously reported to confer protection against severe disease in a small cohort of European ancestry 16 .

Gene burden testing
To assess the contribution of rare variants to critical illness, we performed gene-based analysis using SKAT-O as implemented in SAIGE-GENE 17 on a subset of 12,982 individuals from our cohort (7,491 individuals with critical COVID-19 and 5,391 control individuals), for which the genome-sequencing data were processed with the same alignment and variant calling pipeline.We tested the burden of rare (MAF < 0.5%) variants considering the predicted variant consequence type (tested variant counts provided in the Supplementary Information).We assessed burden using a strict definition for damaging variants (high-confidence putative loss-of-function (pLoF) variants as identified by LOFTEE 18 ) and a lenient definition (pLoF plus missense variants with CADD ≥ 10) 14 , but found no significant associations at a gene-wide-significance level.Moreover, all individual rare variants included in the tests had P values greater than 10 −5 .
Consistent with other recent work 11 , we did not find any significant gene burden test associations among the 13 genes previously reported from an interferon-pathway-focused study 4 (tests for all genes had P > 0.05; Supplementary Information), and we did not replicate the reported association [19][20][21] in TLR7 (EUR P = 0.30 for pLoF and P = 0.075 for missense variants).

Transcriptome-wide association study analysis
To infer the effect of genetically determined variation in gene expression on disease susceptibility, we performed a transcriptome-wide association study (TWAS) using gene expression data (GTEx v.8; ref. 22 ) for two disease-relevant tissues: lung and whole blood.We found significant associations between critical COVID-19 and predicted expression in lung (14 genes) and blood (6 genes) (Supplementary Fig. 23) and in an all-tissue meta-analysis (GTEx v.8; 51 genes) (Supplementary Fig. 24).Expression signals for 16 genes significantly colocalized with susceptibility (Fig. 2).As the LD structure of the HLA is complex, we only assessed colocalization for the significant association, HLA-DRB1.Although it was not significant in our TWAS analysis, expression quantitative trait loci (eQTLs) in the proximity of the association significantly colocalize with the GWAS signal for both blood and lung (both PP H4 > 0.8; Supplementary Information).
We repeated the TWAS analysis using models of intron excision rate from GTEx v.8 to obtain a splicing TWAS, which revealed significant signals in lung (16 genes) and whole blood (9 genes), and in an all-tissue meta-analysis (33 genes); 11 of these had strongly colocalizing splicing signals (Supplementary Information).

Mendelian randomization
We performed generalized summary-data-based Mendelian randomization (GSMR) 23 in a replicated outcome study design using the protein quantitative trait loci (pQTLs) from the INTERVAL study 24 .GSMR incorporates information from multiple independent SNPs and provides stronger evidence of a causal relationship than single-SNP-based approaches.Of 16 proteome-wide-significant associations in this study, 8 were replicated in an external dataset at a Bonferroni-corrected P value threshold of P < 0.0031 (P < 0.05/16; Extended Data Table 2, Extended Data Fig. 7) .

Discussion
We report 23 replicated genetic associations with critical COVID-19, which were discovered in only 7,491 cases.This demonstrates the efficiency of the design of the GenOMICC study, an open-source 25 international research programme (https://genomicc.org) that focuses on extreme phenotypes: patients with life-threatening infectious disease, sepsis, pancreatitis and other critical illness phenotypes.GenOMICC detects greater heritability and stronger effect sizes than other study designs across all variants (Supplementary Figs.22, 14).In COVID-19, critical illness is not only an extreme susceptibility phenotype, but also a more homogeneous one: we have shown previously that critically ill patients with COVID-19 are more likely to have the primary disease processhypoxaemic respiratory failure 5 -and that patients in this group have a divergent response to immunosuppressive therapy compared to other hospitalized patients 8 .We detect distinct signals at several of the associated loci, in some cases implicating different biological mechanisms.
Five of the variants associated with critical COVID-19 have direct roles in interferon signalling and broadly concordant predicted biological effects.These include a probable destabilizing amino acid substitution in a ligand, IFNA10 (Trp164Cys, Extended Data Fig. 5), and-as we reported previously 1 -reduced expression of a subunit of its receptor IFNAR2 (Fig. 2).IFNAR2 signals through a kinase that is encoded by TYK2 1 .Although the lead variant in TYK2 in WGS is a protein-coding variant with reduced STAT1 phosphorylation activity 26 , it is also associated with significantly increased expression of TYK2 (Fig. 2, Methods).Fine-mapping reveals a significant association with an independent missense variant in IL10RB, a receptor for type III (lambda) interferons (rs8178521; Table 1).Finally, we detected a lead risk variant in phospholipid scramblase 1 (chr3:146517122:G:A, rs343320; PLSCR1) which disrupts a nuclear localization signal that is important for the antiviral effect of interferons 27 (Extended Data Fig. 5).PLSCR1 controls the replication of other RNA viruses, including vesicular stomatitis virus, encephalomyocarditis virus and influenza A virus 27,28 .
Although our genome-wide gene-based association tests did not replicate any findings from a previous pathway-specific study of rare deleterious variants 4 , our results provide robust evidence implicating reduced interferon signalling in susceptibility to critical COVID-19.Notably, systemic administration of interferon in two large clinical trials, albeit late in disease, did not reduce mortality 29,30 .
We found significant associations in genes that are implicated in lymphopoesis and in the differentiation of myeloid cells.BCL11A is essential for B and T lymphopoiesis 31 and promotes the differentiation of plasmacytoid dendritic cells 32 .TAC4, reported previously 3 , encodes a regulator of B cell lymphopoesis 33 and antibody production 34 , and promotes the survival of dendritic cells 35 .Finally, although the strongest

Article
fine-mapping signal at 5q31.1 (chr5:131995059:C:T, rs56162149) is in an intron of ACSL6 with significant effects on expression (Supplementary Information), the credible set includes a missense variant in CSF2 (encoding granulocyte-macrophage colony stimulating factor; GM-CSF) of uncertain significance (chr5:132075767:T:C; Extended Data Table 1).We have previously shown that GM-CSF is strongly up-regulated in critical COVID-19 36 , and it is already under investigation as a target for therapy 37 .Mendelian randomization results are consistent with a direct link between the plasma levels of a closely related cytokine receptor subunit, IL3RA, and critical COVID-19 (Extended Data Table 2).Fine-mapping, colocalization and TWAS analyses provide evidence for increased expression of MUC1 as the mediator of the association with rs41264915 (Supplementary Table 12).This suggests that mucins could have a therapeutically important role in the development of critical illness in COVID-19.
Mendelian randomization provides genetic evidence in support of a causal role for coagulation factors (F8) and platelet activation (PDGFRL) in critical COVID-19 (Extended Data Table 2, Extended Data Fig. 7), consistent with autopsy 6 , proteomic 38 and therapeutic 39 evidence.Perhaps more importantly, we identify specific and closely related intercellular adhesion molecules that have known roles in the recruitment of inflammatory cells to sites of inflammation, including E-selectin (SELE), intercellular adhesion molecule 5 (ICAM5) and DC-SIGN (dendritic-cell-specific ICAM3-grabbing non-integrin; CD209), which may provide additional therapeutic targets.DC-SIGN (CD209) mediates pathogen endocytosis and antigen presentation, and is known to be involved in multiple viral infections, including SARS-CoV and influenza A virus.It has affinity for SARS-CoV-2 40,41 .
Our previous report of an association between the OAS gene cluster and severe disease was robustly replicated in an external cohort 1 , but does not meet genome-wide significance in the present analysis (Supplementary Table 7).This may indicate a change in the observed effect size because any effect that is detected in GWASs is more likely to have been sampled from the larger end of the range of possible effect sizes -the 'winner's curse'.Alternatively, it may indicate either a change in the population of patients (cases or controls) or a change in the pathogen.For example it is possible that-as with the other coronaviruses that are known to infect humans 42 -more recent variants of SARS-CoV-2 have evolved to overcome this host antiviral defence mechanism.

Limitations
In contrast to microarray genotyping, WGS is a rapidly evolving and relatively new technology for GWASs, with relatively few sources of population controls.We selected a control cohort from the 100,000 Genomes Project, which was sequenced and analysed using a different platform and bioinformatics pipeline compared with the case cohort (Extended Data Fig. 1).However, to minimize the risk of false-positive associations due to technical artifacts, extensive quality measures were used (Methods).In brief, we masked low-quality genotypes, filtered for genotype signal using a low threshold for missingness and performed a control-control relative allele frequency filter using a subset of samples processed with both bioinformatics pipelines.Finally, we required all significant associations to be supported by local variants in LD, which may be excessively stringent (Methods).Although this approach may remove some true associations, our priority is to maximize confidence in the reported signals.Of 25 variants that meet this requirement, 23 are externally replicated, and the remaining 2 may be true associations that are yet to be replicated owing to a lack of coverage or power in the replication datasets.
The design of our study incorporates genetic signals for every stage in the disease progression into a single phenotype.This includes establishment of infection, viral replication, inflammatory lung injury and hypoxaemic respiratory failure.Although we can have considerable confidence that the replicated associations with critical COVID-19 we report are robust, we cannot determine at which stage in the disease process, or in which tissue, the relevant biological mechanisms are active.

Conclusions
These genetic associations identify biological mechanisms that may underlie the development of life-threatening COVID-19, several of which may be amenable to therapeutic targeting.Furthermore, we demonstrate the value of WGS for fine-mapping loci in a complex trait.In the context of the ongoing global pandemic, translation to clinical practice is an urgent priority.As with our previous work, biological and molecular studies-and, where appropriate, large-scale randomized trials-will be essential before our findings can be translated into clinical practice.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-022-04576-6.

Ethics
GenOMICC study: GenOMICC was approved by the following research ethics committees: Scotland 'A' Research Ethics Committee (15/SS/0110) and Coventry and Warwickshire Research Ethics Committee (England, Wales and Northern Ireland) (19/WM/0247).Current and previous versions of the study protocol are available at https://genomicc.org/protocol/.100,000 Genomes project: the 100,000 Genomes project was approved by the East of England-Cambridge Central Research Ethics Committee (REF 20/EE/0035).Only individuals from the 100,000 Genomes project for whom WGS data were available and who consented for their data to be used for research purposes were included in the analyses.UK Biobank study: ethical approval for the UK Biobank was previously obtained from the North West Centre for Research Ethics Committee (11/NW/0382).The work described herein was approved by UK Biobank under application number 26041.Geisinger Health Systems (GHS) study: approval for DiscovEHR analyses was provided by the GHS Institutional Review Board under project number 2006-0258.AncestryDNA study: all data for this research project were from individuals who provided prior informed consent to participate in AncestryDNA's Human Diversity Project, as reviewed and approved by our external institutional review board, Advarra (formerly Quorum).All data were de-identified before use.Penn Medicine Biobank study: appropriate consent was obtained from each participant regarding the storage of biological specimens, genetic sequencing and genotyping, and access to all available EHR data.This study was approved by the institutional review board of the University of Pennsylvania and complied with the principles set out in the Declaration of Helsinki.Informed consent was obtained for all study participants.23andMe study: participants in this study were recruited from the customer base of 23andMe, a personal genetics company.All individuals included in the analyses provided informed consent and answered surveys online according to the 23andMe protocol for research in humans, which was reviewed and approved by Ethical and Independent Review Services, a private institutional review board (http:// www.eandireview.com).

Recruitment of cases (patients with COVID-19)
Patients were recruited to the GenOMICC study in 224 UK intensive care units (https://genomicc.org).All individuals had confirmed COVID-19 according to local clinical testing and were deemed, in the view of the treating clinician, to require continuous cardiorespiratory monitoring.In UK practice this kind of monitoring is undertaken in high-dependency or intensive care units.

Recruitment of control individuals
Mild or asymptomatic control individuals.Participants were recruited to the mild COVID-19 cohort on the basis of having experienced mild (non-hospitalized) or asymptomatic COVID-19.Participants volunteered to take part in the study via a microsite and were required to self-report the details of a positive COVID-19 test.Volunteers were prioritized for genome sequencing on the basis of demographic matching with the critical COVID-19 cohort considering self-reported ancestry, sex, age and location within the UK.We refer to this cohort as the COVID-19 mild cohort.
Control individuals from the 100,000 Genomes project.Participants were enrolled in the 100,000 Genomes Project from families with a broad range of rare diseases, cancers and infection by 13 regional NHS Genomic Medicine Centres across England and in Northern Ireland, Scotland and Wales.For this analysis, participants for whom a positive SARS-CoV-2 test had been recorded as of March 2021 were not included owing to uncertainty in the severity of COVID-19 symptoms.
Only participants for whom genome sequencing was performed from blood-derived DNA were included and participants with haematological malignancies were excluded to avoid potential tumour contamination.

DNA extraction
For severe cases of COVID-19 and mild cohort controls, DNA was extracted from whole blood either manually using a Nucleon Kit (Cytiva) and resuspended in 1 ml TE buffer pH 7.5 (10 mM Tris-Cl pH 7.5, 1 mM EDTA pH 8.0), or automated on the Chemagic 360 platform using the Chemagic DNA blood kit (PerkinElmer) and re-suspended in 400 μl elution buffer.The yield of the DNA was measured using Qubit and normalized to 50 ng μl −1 before sequencing.For the 100,000 Genomes Project samples, DNA was extracted from whole blood at designated extraction centres following sample handling guidance provided by Genomics England and NHS England.

WGS
Sequencing libraries were generated using the Illumina TruSeq DNA PCR-Free High Throughput Sample Preparation kit and sequenced with 150-bp paired-end reads in a single lane of an Illumina Hiseq X instrument (for 100,000 Genomes Project samples) or a NovaSeq instrument (for the COVID-19 critical and mild cohorts).
Sequencing data quality control.All genome sequencing data were required to meet minimum quality metrics and quality control measures were applied for all genomes as part of the bioinformatics pipeline.The minimum data requirements for all genomes were: more than 85 × 10 −9 bases with Q ≥ 30 and at least 95% of the autosomal genome covered at 15× or higher calculated from reads with mapping quality greater than 10 after removing duplicate reads and overlapping bases, after adaptor and quality trimming.Assessment of germline cross-sample contamination was performed using VerifyBamID and samples with more than 3% contamination were excluded.Sex checks were performed to confirm that the sex reported for a participant was concordant with the sex inferred from the genomic data.

WGS alignment and variant calling
COVID-19 cohorts.For the critical and mild COVID-19 cohorts, sequencing data alignment and variant calling were performed with Genomics England pipeline 2.0, which uses the DRAGEN software (v.3.2.22).Alignment was performed to genome reference GRCh38 including decoy contigs and alternative haplotypes (ALT contigs), with ALT-aware mapping and variant calling to improve specificity.A subset of the genomes from the cancer program of the 100,000 Genomes Project were reprocessed (alignment and variant calling) using the same pipeline used for the COVID-19 cohorts (DRAGEN v.3.2.22) for equity of alignment and variant calling.

Aggregation
Aggregation was conducted separately for the samples analysed with Genomics England pipeline 2.0 (severe cohort, mild cohort, cancer-realigned 100,000 Genomes Project) and those analysed with the Illumina North Star Version 4 pipeline (100,000 Genomes Project).
For the first three, the WGS data were aggregated from single-sample gVCF files to multi-sample VCF files using GVCFGenotyper (GG) v.3.8.1, which accepts gVCF files generated by the DRAGEN pipeline as input.GG Extended Data Fig. 1 | Analysis workflow for GWAS and AVT analyses of this study.The cohorts displayed in yellow and green in the top box were processed with Genomics England Pipeline 2.0 and Illumina NSV4, respectively (see Methods on WGS Alignment and variant calling for details on differences between pipelines).We used individuals that were processed with either pipeline for the GWAS analyses and individuals processed only with Genomics England Pipeline 2.0 for the AVT analyses.The definition of the cases and controls was the same for GWAS and AVT, cases were the COVID-19 severe individuals for both, and controls included individuals from the 100,000 Genomes Project (100,000 Genomes Project) and also COVID-19 positive individuals that were recruited for this study and experienced only mild symptoms (COVID-mild).Distance (in) between selected atoms (PLSCR1 His262 N 2 and Importin Glu107 carboxyl O) is indicated.A hydrogen bond between PLSCR1 His262 and Importin Glu107 is indicated with a dashed line.The risk variant is predicted to eliminate this bond, disrupting nuclear import, an essential step for effect on antiviral signalling 27 and neutrophil maturation 66 .(c) Because there is very strong sequence conservation between IFNA10 and the gene encoding IFNω, we used existing crystal structure data (Protein Data Bank ID 3SE4 (ref. 67)) for IFNω (cyan) to display a ternary complex with interferon α/β receptor IFNAR1 (blue), IFNAR2 (red).The side chain of Trp164 is shown as spheres and indicated with a black line.(d) The hydrophobic core of IFNω with Trp164 shielded from the solvent in the center.Trp164-surrounding residues of IFNω are numbered and correspond to UniProt entry P05000.Trp164 and surrounding residues are conserved in IFNA10 (UniProt ID P01566) and share the same numbering as in IFNω (P05000).Side chains of four residues are shown as sticks.Carbon and nitrogen atoms coloured in cyan and blue, respectively.The critical COVID-19associated mutation, Trp164Cys, would replace an evolutionarily conserved, bulky side chain in the hydrophobic core of IFNA10 with a smaller one, which may destabilize IFNA10.13.

Fig. 1 |
Fig. 1 | GWAS results for the EUR ancestry group, and multi-ancestry meta-analysis.Manhattan plots are shown on the left and quantile-quantile (QQ) plots of observed versus expected P values on the right, with genomic inflation (λ) displayed for each analysis.Highlighted results in blue in the Manhattan plots indicate variants that are LD-clumped (r 2 = 0.1, P 2 = 0.01, EUR LD) with the lead variants at each locus.Gene name annotation indicates genes that are affected by the predicted worst consequence type of each lead variant (annotation by Variant Effect Predictor (VEP)).For the HLA locus, the gene that was identified by HLA allele analysis is annotated.The GWAS was performed using logistic regression and meta-analysed by the inverse variant method.The red dashed line shows the Bonferroni-corrected P value: P = 2.2 × 10 −8 .

8 Fig. 2 |
Fig. 2 | Gene-level Manhattan plot showing results from the TWAS meta-analysis and highlighting genes that colocalize with GWAS signals or have strong metaTWAS associations.The highlighting colour is different for the lung and blood tissue data that were used for colocalization, and we also distinguish loci that were significant in both.Results are grouped according to two classes for the posterior probability of colocalization (PP H4 ): P > 0.5 and P > 0.8.If a variant is placed in both classes, then the colour that corresponds to the higher probability class is shown.Arrowheads indicate the direction of change in gene expression associated with an increased disease risk.The red dashed line shows the Bonferroni-corrected significance threshold for the metaTWAS analysis at P = 2.3 × 10 −6 .

Fig. 3 |
Fig. 3 | Regional detail showing fine-mapping to identify two adjacent independent signals on chromosome 3. Top two panels, variants in LD with the lead variants shown.The variants that are included in two independent credible sets are displayed with black outline circles.The r 2 values in the key

Extended Data Fig. 7 | 19 .
and HLA associations: conditioned on DRB1*04:01 Extended Data Fig. 6 | Manhattan plot of HLA and GWAS signal across the extended MHC region for the EUR cohort.Grey circles mark the GWAS (small variant) associations and diamonds represent the HLA each allele association, coloured by locus.The lead variant from the GWAS and lead allele from HLA are labelled.The left-panel shows the raw association −log 10 (P values) per variantprior to conditional analysis.The right-panel shows the −log 10 (P values) per variant following conditioning on DRB1*04:01.The dashed red line shows the Bonferroni-corrected genome-wide significance threshold for Europeans.Effect-effect plots for Mendelian randomization analyses to assess causal evidence for circulating proteins in critical COVID-Each plot shows effect size (β) of variants associated with protein concentration (x axis) and critical COVID-19 (y axis).A full list of instruments is found in Supplementary Table