Main

The SARS-CoV-2 pandemic has had profound economic, social and public health impact with over 11 million confirmed cases and over 530,000 deaths globally. Widely reported age-dependent mortality rates associated with infection appear to be driven by viral replication and comorbidities that influence immune-mediated pathology1,2. Understanding how molecular virus–host interactions manifest as risk factors can delineate regulatory programs that mediate immune pathology3, provide valuable clues about disease determinants, facilitate clinical management and help guide choice of therapeutic interventions and setting of appropriate social and public health measures. As part of a separate study, we mapped over 140 cellular proteins that are structurally mimicked by coronaviruses (CoVs) and identified complement and coagulation pathways as targets of this strategy across all CoV strains4. The complement system is a critical defense against pathogens, including viruses5 and when dysregulated (by germline variants or acquired through age-related effects or excessive tissue damage) can contribute to pathologies mediated by inflammation5,6,7. Similarly, inflammation-induced coagulatory programs, which can themselves be regulated by the complement system, are pivotal in controlling pathogenesis associated with infections. So, virally encoded structural mimics of complement and coagulation factors may contribute to CoV-associated immune-mediated pathology and indicate sensitivities in antiviral defenses. For example, dysfunctions associated with complement (for example, early-onset and age-related macular degeneration (AMD)6,7,8,9,10) and/or coagulation (for example, thrombocytopenia, thrombosis and hemorrhage) may impact clinical outcome of SARS-CoV-2 infection. Guided by protein structure–function analysis and observations that CoV infections result in hypercoagulative phenotypes11,12, we set out to explore the role of complement or coagulatory function in SARS-CoV-2 infection and clinical outcome. In a retrospective observational study of 11,116 patients who presented with suspected SARS-CoV-2 infection, we found that history of macular degeneration (a proxy for complement activation disorders) and history of coagulation disorders (thrombocytopenia, thrombosis and hemorrhage) are risk factors for morbidity and mortality in SARS-CoV-2-infected patients, effects that could not be explained by age, sex or history of smoking. Transcriptional profiling of nasopharyngeal swabs (NPs) from 650 control and SARS-CoV-2-infected patients demonstrated that infection results in robust engagement and activation of complement and coagulation pathways. Finally, a candidate-driven genetic association study of severe SARS-CoV-2 disease identified missense, expression quantitative trait loci (eQTL) and splicing quantitative trait loci (sQTL) variants in critical regulators of complement and coagulation. In addition to providing evidence that these cascades modulate SARS-CoV-2 infection outcome, the data point to putative transcriptional and genetic markers of susceptibility. The results highlight the value of a multimodal analytical approach, combining molecular information from virus protein structure–function analysis with clinical informatics, transcriptomics and genomics to reveal determinants and predictors of immunity, susceptibility and clinical outcome associated with infection.

Between 1 February 2020 and 25 April 2020, 11,116 patients presented to New York-Presbyterian/Columbia University Irving Medical Center with suspected SARS-CoV-2 infection, of which 6,398 tested positive (Table 1). Electronic health records (EHRs) were used to define sex, age and smoking history status as well as histories of macular degeneration, coagulatory disorders (thrombocytopenia, thrombosis and hemorrhage), hypertension, type 2 diabetes (T2D), coronary artery disease (CAD) and obesity (see Methods). We identified 88 patients with history of macular degeneration, 4 with complement deficiency disorders and 1,179 with coagulatory disorders (other comorbidity frequencies are shown in Table 1 and covariances between them are shown in Extended Data Fig. 1). We observed a 35% mortality rate among patients that were put on mechanical ventilation and that 31% of deceased patients had been on mechanical respiration.

Table 1 Patient cohort demographic details, broken down by SARS-CoV-2 infection and disease status

We used survival analysis and Cox proportional hazards regression modeling to estimate univariate and age- and sex-corrected risk associated with clinical history of previously reported SARS-CoV-2 risk factors as well as coagulation and complement disorders. We identified significant risk of mechanical respiration and mortality associated with age and sex, as well as history of hypertension, obesity, T2D and CAD (Fig. 1 and Table 1). Notably, we did not find evidence that smoking status (past or present) is a significant risk factor for either mechanical respiration or mortality. However, patients with AMD (a proxy for complement activation disorders) and coagulation disorders (thrombocytopenia, thrombosis and hemorrhage) were at significantly increased risk of adverse clinical outcomes (including mechanical respiration and death) following SARS-CoV-2 infection (Fig. 1 and Table 1). Moreover, as shown in Fig. 1b, patients with AMD succumb to disease more rapidly than others. Critically, the contribution of age and sex was not sufficient to explain the increased risks associated with history of macular degeneration or coagulation disorders (Fig. 1 and Table 1). Conversely, albeit in a small number of individuals, we observed that among patients with complement deficiency disorders, who are normally at increased risk of complications associated with infections, none required mechanical respiration or succumbed to their illness (Table 1 and Fig. 1a,b). Notably, while the correlation between AMD or coagulopathies and established covariates included in this study is low (Extended Data Fig. 1 and Extended Data 1), further study will be necessary to rule out comorbidities that may be associated with AMD and coagulopathies. Together, these data suggest that hyperactive complement and coagulative states predispose individuals to adverse outcomes associated with SARS-CoV-2 infection and that deficiencies in complement components may be protective. However, given the low incidence rate of deficiencies in either complement or coagulation pathways, further analysis with larger clinical cohorts is warranted.

Fig. 1: History of macular degeneration and coagulation disorders are associated with adverse outcomes after confirmed SARS-CoV-2 infection.
figure 1

a, Kaplan–Meier curves (line, proportion of subjects that did not require intubation; error band, 95% CI (Greenwood’s)) for ten binary conditions: age > 65 years, male sex, macular degeneration (macula), complement deficiency disorders (CD), coagulation, hypertension, T2D, obesity, CAD and cough. The survival rates for the patients with the named condition are shown in orange (sample size is given as nE). The shaded region indicates the 95% CI. The blue survival line is for patients without the named condition (sample size is given as nT). Note that none of the four patients with CD required mechanical ventilation. b, Kaplan–Meier curves (line, proportion of subjects that survived; error band, 95% CI (Greenwood’s)) for the same ten conditions as in a. All four patients with CD survived (not statistically significant). c, Intubation rates across the binary conditions. Intubation was highest in patients with a history of macular degeneration (n = 14), followed by T2D (n = 111) and hypertension (n = 210). Error bars indicate binomial 95% CIs. d, Mortality rates across the binary conditions. Patients with a history of macular degeneration saw the highest mortality rates (n = 22), followed by age ≥ 65 (n = 513) and T2D (n = 191). Error bars indicate binomial 95% CI. e, HRs, estimated using a Cox proportional hazards model, for risk of intubation (as a validated proxy for requiring mechanical respiration). Sample sizes are the same as in a, and error bars represent 95% CI derived from the s.e.m. f, Similarly, HRs for mortality, estimated using a Cox proportional hazards model. HRs and statistical significances are shown in Table 1. Sample sizes are the same as in b, and error bars represent 95% CI derived from the s.e.m.

We leveraged whole-genome RNA sequencing (RNA-seq) profiles to identify differentially regulated genes and pathways in 650 NP swabs from control and SARS-CoV-2-infected patients who presented to Weill-Cornell Medical Center. Gene set enrichment analysis (GSEA) of Hallmark gene sets found that SARS-CoV-2 infection (as defined by presence of SARS-CoV-2 RNA and stratified into ‘positive’, ‘low’, ‘medium’ or ‘high’ based on viral load; see Methods) induces genes related to pathways with known immune modulatory functions (Fig. 2a). Moreover, among the most enriched gene sets, SARS-CoV-2 infection induces robust activation of the complement cascade (false discovery rate (FDR) P < 0.001), with increasing enrichment and significance with viral load (FDR P < 0.0001). We extended the analysis to include all complement- and coagulation-associated gene sets in MsigDB and identified ‘KEGG_Complement_and_Coagulation_Cascades’, ‘GO_Coagulation’ and ‘Reactome_initial_triggering_of_complement’ to be enriched in expression profiles of SARS-CoV-2-infected samples (Q < 0.05; representative GSEA profiles are shown in Fig. 2b and a full list of enriched pathways and gene sets can be found at https://covidgenes.weill.cornell.edu/). The pathway-level transcriptional regulation induced by SARS-CoV-2 identified by GSEA is also observed at the individual gene level for upregulated and downregulated transcripts as well as those that are particularly upregulated in the context of high viral load (Fig. 2c–f). Taken together, the data demonstrate that in addition to immune factors such as type I interferons and dysregulation of interleukin (IL)-6-dependent inflammatory responses, which have been linked to poor clinical outcome11, transcriptional control of complement and coagulation cascades is a feature of SARS-CoV-2 infection.

Fig. 2: SARS-CoV-2 infection engages robust transcriptional regulation of complement and coagulation cascades.
figure 2

a, GSEA of Hallmark gene sets was applied to RNA-seq profiles of NP swabs from 650 control and SARS-CoV-2-infected patients stratified by SARS-CoV-2 positive (green) or low (yellow), medium (orange), high (red) viral load (significantly enriched gene sets highlighted in blue). b, Leading edge enrichment plots from GSEA analysis of MsigDB-wide gene sets are shown for Hallmark_Complement and KEGG_Complement_and_Coagulation_Cascade gene sets with SARS-CoV-2 stratification indicated by color. c, Hierarchical clustering of z score normalized mRNA profiles of complement and coagulation components that undergo significant (FDR corrected P < 0.01) transcriptional regulation in response to SARS-CoV-2 infection (cold and hot color scale reflects downregulated or upregulated expression, respectively). df, Violin plots indicating median and quartiles as well as minima and maxima bounds (TPM, transcripts per million, shown on y axis) of highlighted differentially regulated genes are shown for upregulated (d), downregulated (e) or particularly upregulated expression in the context of high viral load (f). Normalized enrichment scores (NES) and FDR-corrected P values are shown. Two-tailed Mann–Whitney U-test P values are reported.

Dysfunction of complement or coagulation cascades can result from acquired dysregulation, genetically encoded variants or both. We conducted a candidate-driven study to evaluate whether genetic variation within a 60-Kb window around 102 genes with known roles in regulating complement or coagulation cascades (2,888 genetic variants fulfill this criteria of the 805,426 profiled in the UK Biobank) is associated with poor SARS-CoV-2 clinical outcome (results that follow are robust to varying window size between 40–80Kb, Extended Data Fig. 2 and Fig. 3a,b; Methods). Single-nucleotide polymorphisms (SNPs) with minor allele frequency (MAF) below 1% were excluded and an empirical permutation analysis to set the study-wide significance alpha (α) thresholds for each analysis described below was used (Fig. 3 and Extended Data Figs. 27; Methods). The initial analysis, applied to the April 2020 UK Biobank data release10,13 (excluding third-degree and above relatedness and without aneuploidy, resulted in 388 positive and 332 positive and hospitalized patients) identified 11 loci representing seven genes with study-wide significance (α = 0.001; Fig. 3c and Extended Data Fig. 2). Among these, and proximal to coagulation factor III (F3), is variant rs72729504, which we find to be associated with increased risk of adverse clinical outcome associated with SARS-CoV-2 infection (odds ratio (OR) 1.93). Mutations in F3 have the strongest association with fibrin fragment D-dimer levels, the most widely used clinical marker of activated blood coagulation14. Importantly, increased D-dimer levels were recently reported to correlate with poor clinical outcome in SARS-CoV-2-infected patients11. Though lacking in power to determine statistical significance, we find that rs12029080, the lead common variant identified to be associated with increased D-dimer levels, is also associated with increased COVID-19 severity (Extended Data Fig. 2). So, while effect sizes are modest and a functional role of rs72729504 remains to be elucidated, our observations suggest that this locus may represent a genetic marker of SARS-CoV-2 disease outcomes.

Fig. 3: Targeted genetic association study identifies SNPs in complement and coagulation pathway components associated with clinical outcome of SARS-CoV-2 infection.
figure 3

a,b, P values from a negative binomial distribution fit to permutation of SNPs sampled (left) and case–control phenotypes (center) generated under the null hypothesis are shown for the April 2020 (a) or May 2020 (b) cohort (α and distance pairs as indicated; for more information see Methods). Also shown are the number of hits that pass the corresponding α study-wide significance threshold by distance (right) for April 2020 (a) or May 2020 (b) cohorts. c, Manhattan plots of 2,888 variants within 60 kb of complement and coagulation pathway genes for analyses using the April 2020 cohort (top) and May 2020 cohort (bottom). Study-wide significance threshold shown as dashed green lines, nominal significance threshold is shown as black dashed line, and SNP colors alternate by chromosome. Significant SNPs are shown as colored markers and annotated with the nearest gene by base-pair distance. SNPs shown in green are study-wide significant in both April 2020 and May 2020. SNPs shown as diamonds are also study-wide significant in haplotype-based analysis (see Methods). Logistic regression was used to assess statistical significance; multiple hypothesis testing was controlled using an empirical permutation analysis to set a study-wide α. eQTLs are further highlighted in d and e. d, eQTL relationship for rs61821114 and CD55 in thyroid19. The T allele of rs61821114 is associated with significantly lower expression of CD55. e, eQTL relationship for rs669 and A2M19. The C allele of rs669 is associated with significantly lower expression of A2M in 17 tissues, including the esophageal mucosa (shown) and lung. Box plot inlays show median, 25th and 75th percentiles in each group. eQTL-associated P values were calculated as described previously and show minima and maxima bounds16.

The analysis also identified that four variants previously reported to be associated with AMD (rs45574833, rs61821114, rs61821041 and rs12064775)15 predispose carriers to hospitalization following SARS-CoV-2 infection (OR, 2.13–2.65; Extended Data Fig. 2). A fifth variant, rs2230199, which maps to complement C3 (a gene shown to be linked to AMD in an independent genome-wide association study (GWAS)) has not been associated with increased AMD risk in the UK population. The three SNPs that map to C3 (rs1047286, rs2230203 and rs2230199) each seem to confer some protection associated with SARS-CoV-2 infection (OR, 0.66–0.68). Two of the identified variants (rs61821114 and rs61821041) map to eQTLs associated with complement decay-accelerating factor (CD55)16. CD55 accelerates the decay of complement proteins, thereby disrupting the cascade and preventing immune-mediated damage5. These eQTLs result in decreased expression of CD55 (Fig. 3d), thereby relieving the restraining function of this protein. In agreement, we observed that these variants are associated with increased risk of adverse clinical outcome following SARS-CoV-2 infection (OR, 2.34–2.4).

Genetic association studies performed on relatively small cohorts can be prone to false positives. While we implemented permutation analyses to empirically determine statistical significance thresholds (see Methods), we also repeated the analysis using updated UK Biobank data released in May 2020 (651 SARS-CoV-2-positive and 500 positive and hospitalized patients). The analysis recapitulated 6 of the 11 findings from April 2020 and identified 16 additional loci with study-wide significance (α = 0.0025, Extended Data Fig. 2 and Fig. 3c). The scan revealed five variants proximal to α-2-macroglobulin (A2M), a regulator of fibrin clot formation and inflammatory cascades17. rs10842898, rs669 and rs4883215 are eQTLs associated with significant downregulation of A2M (and concomitant upregulation of A2M-AS1, the antisense RNA of A2M; data available on https://gtextportal.org) in multiple tissues, including mucosa of the esophagus (P value = 1.9 × 10−15; Fig. 3e). In addition, rs10842898 and rs669 are sQTLs for mannose-6-phosphate receptor (M6PR) a P-type lectin that regulates lysosomal cargo loading and participates in cellular responses to wound healing, cell growth and viral infection18, suggesting that these SNPs may contribute to complex regulation of transcripts with immunological and antiviral roles.

Analysis of 936 SNPs in haplotype blocks (Extended Data Fig. 2; see Methods) resulted in 16 study-wide significant SNPs (α = 0.01, Extended Data Fig. 8) using the April 2020 cohort, of which 8 repeated at study-wide significance (α = 0.0075) using the May 2020 dataset. These include rs45574833, a missense variant in C4BPA, which negatively regulates the classical complement pathway19. We also identified a link between rs731034 (an eQTL in collectin subfamily member 11; COLEC11) and poor clinical outcome in both April 2020 (OR, 1.27) and May 2020 (OR, 1.33) cohorts. COLEC11 binds carbohydrate antigens on microorganisms, including viruses, facilitating their recognition and removal. This eQTL results in significant upregulation of COLEC11 across multiple tissues including lung (P value = 1 × 10−11) and suggests that sugar moieties on viral proteins may serve as antigenic targets of immunological responses to SARS-CoV-2 infection. Though experimental validation and functional interrogation is required to elucidate the precise pathophysiology, taken together, the observations point to genetic variation in complement and coagulation components as contributing factors in SARS-CoV-2-mediated disease.

In summary, we present evidence that complement and coagulatory functions may play a role in SARS-CoV-2 infection outcome. We found that AMD and coagulatory dysfunctions predispose patients to poor clinical outcomes following SARS-CoV-2 infection and, though their low incidence rates invite further investigation, complement deficiencies appear to be protective. Nevertheless, in an orthogonal analysis of transcriptional responses to infection, we demonstrate that SARS-CoV-2 engages robust activation of complement and coagulation cascades. Moreover, our candidate-driven analysis of genetic variation and severe SARS-CoV-2 disease yielded putative loci including missense, eQTL and sQTL variants of critical regulators of the complement and coagulation cascades. While the results appear to be, at least functionally, consistent with a recent GWAS that identified A/B blood types as risk factors for COVID-19 (ref. 20) (A and B blood groups have been associated with increased risk of coagulopathies21,22,23), their interpretation must be performed with caution. However, pathology associated with dysregulation and genetic variation in complement and coagulation pathways is not without precedent and has been associated with dengue virus infection, where it is correlated with disease severity and mirrors that of acute SARS-CoV-2 disease24,25, suggesting that complement and coagulatory disfunctions may represent risk factors for a broad range of pathogens.

Site-specific clinical care decisions, ancestral homogeneity, population stratification and socioeconomic status of affected populations can all impact the findings, particularly when limited by sample size. Similarly, retrospective studies have notable limitations in data completeness, selection biases and methods of data capture. So, claims on causality cannot be made, nor can we definitively rule out other clinical factors as possible drivers of disease. Still, our findings highlight the value of combining molecular information from virus protein structure–function analysis with orthogonal clinical data analysis to reveal determinants and predictors of immunity, susceptibility and clinical outcome associated with infection. Such frameworks can help refine and power large-scale genomics efforts based on informed biological and clinical conjectures. While identification of CoV-encoded structural mimics guided the clinical studies, a molecular link between those observations and the findings presented herein remains to be elucidated. Nevertheless, the observations advance our understanding of how SARS-CoV-2 infection leads to disease and can help explain variability in clinical outcomes. Among the implications, the data warrant heightened public health awareness for the most vulnerable individuals and further investigation into an existing menu of complement and coagulation targeting therapies that were recently shown to be beneficial in a small cohort of patients with SARS-CoV-2 infection26,27.

Methods

Ethics and data governance approval

The study is approved by the Columbia University Irving Medical Center Institutional Review Board (IRB) no. AAAL0601 and the requirement for informed consent was waived. A data request associated with this protocol was submitted to the Tri-Institutional Request Assessment Committee of New York-Presbyterian/Columbia and Cornell and approved. The research on the UK Biobank data was conducted using the UK Biobank Resource under application no. 41039. The transcriptomics analysis samples were collected and processed through the Weill Cornell Medicine IRB protocol 19-11021069.

Retrospective clinical study

Cohort and study description

In this observational cohort study, we used a data warehouse derived from EHRs from 11,116 patients treated at New York-Presbyterian/Columbia University Irving Medical Center for suspected cases of SARS-CoV-2 infection. For these patients we collected contemporary data from their current encounter (the encounter associated with their suspected SARS-CoV-2 infection) as well as historical data, if available, from their previous encounters. Contemporary data (collected between 1 February 2020 and 12 April 2020) included insurance billing information, laboratory measurements, procedures and SARS-CoV-2 diagnostic test results. These data were derived from the data warehouse tables in Epic. Overall, 6,927 patients have historical data (data collected before 24 September 2019) available from an Observational Medical Outcomes Partnership (OMOP) v.5 instance stored using MySQL, which included all of the standard tables for recording condition, procedure, medication and measurement data (among others). Of these, we used insurance billing information from the condition occurrence table and demographics from the person table. See Preparation of data for modeling for further details on data preparation.

We used contemporary data to define inclusion criteria and outcomes (requiring mechanical respiration and mortality) and used historical data to define patient comorbidities. We defined three hypothesized comorbidity covariates: macular degeneration, CD disorders and disorders of coagulation. We used historical data to define comorbidities, age and sex. We did not include race and ethnicity data in the modeling as we have previously found issues with data quality28. The race/ethnicity data that we had is included in the tables for reference. We also modeled other comorbidities previously associated with morbidity and mortality (Zhou et al.11 and others), including history of cardiovascular disease, hypertension, obesity, and diabetes (Table 1 and Extended Data Fig. 1), all derived from the historical data. Coded covariate definitions, as well as lists of which diagnosis codes are most common in each group, are available in the Extended Data and methods. We used established institutional procedures and an institutional clinical data warehouse to extract all data from the EHRs.

Defining patient outcomes

Outcome definitions were defined by data derived from the EHRs between 1 February 2020 and 12 April 2020. Mortality is derived from a death note filed by a resident or primary provider that records the date and time of death. Intubation was used as an intermediary end point and is a proxy for a patient requiring mechanical respiration. We used note types that were developed for patients with SARS-CoV-2 infection to record that this procedure was completed. We validated outcome data derived from notes against the patient’s medical record using manual review.

Preparation of data for modeling

We used MySQL and Python libraries (pymysql, pandas) to extract and prepare data for modeling. The code for data preparation is available in GitHub (https://github.com/tatonetti-lab/complementcovid) as a Jupyter Notebook entitled Data Setup. We began by creating a master list of patients with suspected CoV. These were patients that were either diagnosed with the disease, as indicated by an ICD10 code for SARS-CoV-2 infection, in their billing data or a patient that was tested for the presence of the virus using RT–PCR as indicated by a laboratory order for the test. We found 2,821 using the former method and 11,116 patients using the latter. We then extracted birth dates, death dates (if the patient had died or a null value otherwise) and sex codes (1 for female, 2 for male). Patients who had sex codes for nonbinary genders were excluded from our analysis. We then defined a ‘first diagnosis date’ for each patient as either their first diagnosis date (by billing code) or the first date that they tested positive for SARS-CoV-2, whichever came first. Next, we calculated each patient’s age at the time of the ‘first diagnosis date.’ Each of the outcomes and covariates were extracted from their respective tables as detailed in GitHub. Whenever possible, we used the highest-level ancestor code (from the structured vocabulary in OMOP) that represented the concept we wanted to model. We then used the concept ancestor tables to grab all the descendant codes. Note that diabetic kidney disease was considered for inclusion and so is represented in the data preparation script; however, it was never modeled. Cough is included as a covariate as a reference symptom for comparison. The last step in the preparation process was to compute the censor dates. To do so, we iterated through each patient in our master list and computed their time (in days) to intubation (if they required mechanical respiration) or death (if they died). If not, then the study end date (25 April 2020) was used as the patient’s censored time (in days). Finally, for any patients that were not positive for SARS-CoV-2, their time-to-event values were set to a null indicator to be dropped from the dataset later. Finally, the data were all combined in a pandas (v.1.0.3) dataframe and saved to disk as a pickle file for efficient loading.

Statistical model

Our patient timelines may be censored as our study cohort included patients that were being treated at the time of analysis. We performed survival analysis on the intubation orders and death using a Cox proportional-hazards model and visualized the risk using Kaplan–Meier curves using the lifelines Python package (v.0.24.4). Error estimates on the Kaplan–Meier curves were estimated using Greenwood’s exponential formula29. We fitted both univariate models and model fit on the covariate, age and sex and used log-likelihood to assess significance. We reported Cox proportional hazards coefficients and their 95% confidence intervals (Table 1). We modeled whether or not a patient had macular degeneration, a complement deficiency disorder or a coagulation disorder as binary variables (1 = yes, 0 = no). Code definitions are provided in Extended Data Fig. 1. We also included other significant comorbidities suggested by previous studies, CAD, hypertension, T2D, obesity or smoking status as binary variables (1 = yes, 0 = no), sex as a binary variable (0 = female, 1 = male), age as a quantitative variable, older age over 65 years (note that age over 65 was used only for illustrative purposes and was not used in multivariate modeling; in the multivariate model, age as a quantitative variable was used) and outcome as a binary variable (1 = yes, 0 = no). The outcome of interest was coded as 0 until the day it occurred (the date of the first intubation order following admission or the death date) or the date of analysis, whichever occurred first. Survival curves were generated for the indicated variables by setting all other variables to their respected averages within the training data. Note that we dropped patients who experienced the outcome before their initial diagnosis. This was either due to patients being hospitalized before infection (in the case of intubation) or errors in the coded data. We dropped 121 patients for intubation before infection and 12 patients for death. We also restricted the study to 90 d from the start date. One patient was removed for having an event outside of this range.

Covariate correlations

Using the data prepared as discussed above, we computed pairwise statistical correlations between age and sex as well as history of macular degeneration, CD disorders, coagulation disorders, hypertension, T2D, obesity and CAD. We computed them using data from all suspected patients (tested both positive and negative) as well as only those patients who tested positive. We used Spearman’s rho and the Tanimoto coefficients (1 − Jaccard distance) as our measures of correlation. For the comparison using the Tanimoto coefficient we binarized age as greater than or equal to 65 years.

Statistical software

We used Jupyter Notebooks (jupyter-client v.5.3.4 and jupyter-core v.4.6.1) running Python 3.7 and all fitted models using the Python lifelines package (v.0.24.4).

Transcriptomic analysis of NP swabs

Sample collection and processing

Patient specimens were collected with patients’ consent at New York Presbyterian Hospital and then processed for RT–PCR as described previously30. NP swab specimens were collected using the BD Universal Viral Transport Media system (Becton, Dickinson and Company) from symptomatic patients.

Extraction of viral RNA and RT–PCR detection

Total viral RNA was extracted from deactivated samples using automated nucleic acid extraction on the QIAsymphony and the DSP Virus/Pathogen Mini kit (Qiagen). One-step reverse transcription to complementary DNA and real-time PCR amplification of viral targets, E (envelope) and S (spike) genes and internal control, was performed using the Rotor-Gene Q thermocycler (Qiagen).

Human transcriptome analysis

RNA-seq reads that mapped unambiguously to the human reference genome via Kraken2 were used to detect transcriptional responses to SARS-CoV-2 infection as described previously30. Briefly, reads were trimmed with TrimGalore, aligned with STAR (v.2.6.1d) to the human reference build GRCh38 and the GENCODE v.33 transcriptome reference and gene expression was quantified using featureCounts, stringTie and salmon using the nf-core RNA-seq pipeline. Sample quality control was reported using fastqc, RSeQC, qualimap, dupradar, Preseq and MultiQC. Reads, as reported by featureCounts, were normalized using variance-stabilizing transform (vst) in DESeq2 package in R and DESeq2 was used to call differential expression with either positive cases versus negative, or viral load (high/medium/low/none) as reported by RT–PCR cycle threshold (Ct) values. TPMs were used to rank genes and perform GSEA as described previously31,32.

Reverse transcriptase, quantitative real-time PCR

The presence of SARS-CoV-2 in clinical samples was determined by RT–PCR. Briefly, primers for the E (envelope) gene (which detects all members of the lineage B of β-CoVs) and the S (spike) gene (which specifically detect SARS-CoV-2). Samples were annotated using RT–PCR Ct value for SARS-CoV-2 primers as follows: Ct ≤ 18 were assigned as ‘high viral load’; Ct of 18–24 were assigned as ‘medium viral load’; and Ct of 24–40 were assigned as ‘low viral load’ stratifications; Ct > 40 was classified as negative ().

Genetic analysis of UK Biobank

Data source

UK Biobank subjects that were of White British descent, in the UK Biobank principal component analysis calculations and therefore without third degree and above relatedness and without aneuploidy, were used in this study, totaling 337,147 subjects (181,032 females and 156,115 males)10. Of the nearly 500,000 participants, approximately 50,000 subjects were genotyped on the UK BiLEVE Array by Affymetrix, whereas the rest were genotyped using the Applied Biosystems UK Biobank Axiom Array, with over 800,000 markers using build GRCh37 (hg19). The arrays share 95% marker coverage. We extracted markers with a MAF > 0.005, INFO score > 0.3 and Hardy–Weinberg equilibrium test mid-P value >10-10 using PLINK2 (ref. 33). UK Biobank v.3 imputation combined the Haplotype Research Consortium with the UK10K haplotype resource using the software IMPUTE4 (UK Biobank white paper). Association analyses were performed using a logistic regression model with additive gene dosage and covariates including age at 2018, sex, first ten principal components (provided by the UK Biobank) and the genotyping array that the sample was carried out on. We determined the α threshold for study-wide significance using an empirical permutation analysis (see previously). We performed a study-wide association analysis comparing variants for subjects that were SARS-CoV-2 positive and required hospitalization against the entire population of 337,147 subjects

Targeted gene set definition

The union of coagulation- and complement-related gene sets (with immunoglobulin genes removed) that are part of MsigDB was used to define the set of 102 genes used in this study. For each gene, we used the transcriptional start and stop site from the hg19 build of the human genome to define a catchment window of 80 kbp. From the 805,426 variants profiled in the UK Biobank genotyping data after quality control and quality control filters using PLINK2 (see above), 3,540 variants within the transcribed region of the genes of interest or within 80 kbp flanking the transcribed region, 2,888 are within 60 kbp, 2,292 are within 40 kbp and 936 are located in haplotype blocks with study genes.

Empirical permutation evaluation to set study-wide α thresholds

We used permutation to estimate null distributions of the number of hits expected at nine α thresholds varying from (5 × 10−5 to 0.05) and by varying the distance threshold from 40 kb to 80 kb. As shown previously, 80% of GWAS hits are within 60 kb of the nearest gene34. Further, as shown in Extended Data Fig. 9, we empirically determined that the majority of eQTLs (>70%) are within 60 kb of gene bodies. We performed two sets of permutation analyses: (1) permuted the initial set of genes on which the included variant loci were chosen and (2) permuted the case–control labels. We repeated each 100 times and used the resulting data to fit a negative binomial distribution as our estimate of the null hypothesis. Additionally, we evaluated each of the sampled SNP variant sets from (1) and compared their MAF distribution with the MAF distribution of the complement and coagulation set. We removed any sets that were significantly different (nominal p < 0.05) according to a Mann–Whitney U-test (52 of 100 sets were removed due to this criterion; see Extended Data Fig. 10). We found that the negative binomial fitted the data the best according to a goodness of fit test (Extended Data Figs. 27). We used this distribution to assess statistical significance for each combination of α and distance values. The result is two estimates of the significance for each α, distance (d) pair, P(1)α,d and P(2)α,d, from permutation analyses (1) and (2) above, respectively. For example:

$$\begin{array}{l}{\mathrm{X}}_{\alpha ,{\mathrm{d}}}^{(i)} \sim {\mathrm{NB}}\left( {r,p} \right)\\ {\mathrm{P}}_{\alpha {\mathrm{,d}}}^{(i)} = 1 - {\mathrm{CDF}}_{{\mathrm{NB}}({\mathrm{r}},{\mathrm{p}})}\left( {{\mathrm{k}}_{\alpha ,{\mathrm{d}}}} \right)\end{array}$$

where X(i)α,d is the number of permutation loci with a P value under the threshold, α. The parameters r and p of the negative binomial represent the number of successes/failures and the probability of success, respectively. Both r and p are fitted using nonlinear least squares (the curve_fit function in scipy.optimize) on X(i)α,d, the count data from the permutation analyses for the given α and d. The P value is then calculated using the cumulative distribution function(CDF) of the fitted negative binomial distribution.

For the gene set permutation analysis (indicated by (2) above) we evaluated each of the 100 replicates to confirm that the MAF distribution was statistically indistinguishable from that of the complement and coagulation gene set variants. We did so by performing a Mann–Whitney U-test between the two distributions and excluded any replicates that showed a significant difference (nominal P < 0.05). Overall, 52 replicates were excluded because of this requirement. This MAF distribution analysis is not necessary for the case–control permutation analysis (shown in (2) above) as the loci are the same in each replicate and it is the case–control labels that are permuted.

Finally, to set the study-wide α for each study we chose the greatest threshold value that gave a P value of 0.05 or less for both permutation analysis methods:

$${\mathrm{max}}\,\alpha \,{\mathrm{s}}{\mathrm{.t}}{\mathrm{.}}\,{\mathrm{P}}^{({\mathrm{i}})}_{\alpha ,{\mathrm{d}}} < 0.05\,{\mathrm{and}}\,{\mathrm{P}}^{({\mathrm{ii}})}_{\alpha ,{\mathrm{d}}} < 0.05.$$

Finally, this entire process was repeated for two cohorts of patients, (a) the initial COVID cohort released by the UK Biobank in April 2020 and (b) the updated COVID cohort released in May 2020. The chosen α for April was 0.001 and the chosen α for May was 0.0025. A data file of all of the distribution fit results and their resulting chi-squared goodness-of-fit statistics is available in the Extended Data materials.

We also performed the permutation significance estimation for the haplotype-derived SNP sets, although the distances for all loci chosen using that method are below the minimum in this analysis of 40 kb so those results are constant with regard to distance (Supplementary Fig. 3a,b). The chosen α for the linkage disequilibrium (LD)-derived SNP sets is 0.01 and 0.0075 for April and May, respectively.

Haplotype block-based selection of SNPs

We identified haplotype blocks based on linkage disequilibrium within the UK Biobank data genotype data of the 337,147 subjects using PLINK1.9, where the lower 90% confidence interval is greater than 0.70 and the upper 90% confidence interval is at least 0.98. We identified blocks of interests and subsequently the variants within those blocks, as those that contain any part of the genes of interest as denoted by the transcriptional start and end sites from the hg19 build of the human genome. From the 805,426 variant profiles in the UK Biobank genotype data, we identified 7,281 variants within the genes of interest. After applying additional quality control filters using PLINK2, 936 variants remained for analysis.

Software

We used PLINK v.2.00a2LM 64-bit Intel (26 August 2019) to run the genetic association analysis. We used PLINK v.1.90b6.10 64-bit (17 June 2019) to identify haplotype blocks based on linkage disequilibrium. We used Jupyter Notebooks (jupyter-client v.5.3.4 and jupyter-core v.4.6.1) running Python 3.7, numpy 1.18.1 and scipy 1.4.1 for the permutation analyses.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.