Hospitalization and mortality associated with SARS-CoV-2 viral clades in COVID-19

The COVID-19 epidemic of 2019–20 is due to the novel coronavirus SARS-CoV-2. Following first case description in December, 2019 this virus has infected over 10 million individuals and resulted in at least 500,000 deaths world-wide. The virus is undergoing rapid mutation, with two major clades of sequence variants emerging. This study sought to determine whether SARS-CoV-2 sequence variants are associated with differing outcomes among COVID-19 patients in a single medical system. Whole genome SARS-CoV-2 RNA sequence was obtained from isolates collected from patients registered in the University of Washington Medicine health system between March 1 and April 15, 2020. Demographic and baseline clinical characteristics of patients and their outcome data including their hospitalization and death were collected. Statistical and machine learning models were applied to determine if viral genetic variants were associated with specific outcomes of hospitalization or death. Full length SARS-CoV-2 sequence was obtained 190 subjects with clinical outcome data. 35 (18.4%) were hospitalized and 14 (7.4%) died from complications of infection. A total of 289 single nucleotide variants were identified. Clustering methods demonstrated two major viral clades, which could be readily distinguished by 12 polymorphisms in 5 genes. A trend toward higher rates of hospitalization of patients with Clade 2 infections was observed (p = 0.06, Fisher’s exact). Machine learning models utilizing patient demographics and co-morbidities achieved area-under-the-curve (AUC) values of 0.93 for predicting hospitalization. Addition of viral clade or sequence information did not significantly improve models for outcome prediction. In summary, SARS-CoV-2 shows substantial sequence diversity in a community-based sample. Two dominant clades of virus are in circulation. Among patients sufficiently ill to warrant testing for virus, no significant difference in outcomes of hospitalization or death could be discerned between clades in this sample. Major risk factors for hospitalization and death for either major clade of virus include patient age and comorbid conditions.

www.nature.com/scientificreports/ SARS-CoV-2 is ~ 30 kb in length and contains 16 open reading frames (ORFs) 7,8 . Despite its very recent emergence as a viral pathogen, SARS-CoV-2 has undergone rapid mutation. The Nextstrain sequencing consortium 9 has identified 2785 variants, while the GISAID initiative has collected over 50,000 full length sequences as of 6-28-20 [10][11][12] . Analysis of these sequences reveals 5 (Nextstrain) or 6 (GISAID) sequence clusters, with as many as 18 new mutations in each cluster distinct from the original SARS-CoV-2 virus 6 . Viral genetic variation plays an important role in pathogenicity and virulence in other viruses, such as influenza 13 . In this study, we sought to determine if specific viral sequence variants are associated with better or worse clinical outcomes in COVID-19, by analyzing the clinical course of a cohort of patients within a single medical system for whom both full-length viral sequence data and clinical outcome data were available.

Results
Study population characteristics. Full length SARS-CoV-2 sequence was obtained from 283 patients; clinical history within the UW Medicine system was available from 190 of these. 35 (18.4%) were hospitalized and 14 (7.4%) died from complications of the infection. Clinical characteristics of this cohort are summarized in Tables 1 and 2. When stratified between hospitalized and non-hospitalized patients, advanced age, admission from a skilled nursing facility, and a history of either hypertension (HTN), CHF, CVD, CKD, and a history of DVT or cancer were significantly associated with hospitalization (p values < 0.001-0.02). Other characteristics that were associated with hospitalization included anti-coagulated status, utilization of an ACE inhibitor or ARB, and use of corticosteroid or immunomodulatory therapy (p values < 0.001-0.04). Notably, comorbidities of diabetes and known tobacco history were not significantly associated with hospitalized status (Tables 1, 2). Table 1. Baseline demographics and medical histories of hospitalized and non-hospitalized patients. DVT Deep Venous Thrombosis, COPD chronic obstructive pulmonary disease, MI myocardial infarction, ACEI angiotensin-converting enzyme inhibitors, ARB angiotensin receptor blocker, IMT immunomodulatory therapy. a Cardiovascular disease defined as history of MI, CVD/stroke, CHF, valvular diseases (s/p CABG). b Reference value. All diseases were defined using ICD-9 and ICD-10 codes.  Fig. 1). Most variants were present with frequency less than 5%. 163 of the sequence variants were missense mutations, 84 were synonymous mutations, and the remainder were not in protein coding regions. UPGMA hierarchical clustering produced two clear clades of sequence variants (Fig. 1), determined by 12 single nucleotide polymorphisms (Table 3). Ninety-seven samples corresponded to what we refer to as 'Clade 1′ and 91 corresponded to 'Clade 2′. Two of 190 samples did not fall into either of the two major clades (and are listed as 'Clade 3' in Table 4). When mapped onto GISAID and NextStrain clades, we find in clade 1 that 89 correspond to clades GH/20C, 6 map to G/20A, and 2 map to G/20B. In clade 2 we found that 86 correspond to S/19B, and 5 mapped onto L/19A (Table 4). The 2 of 190 samples that did not fall into either of the major clades corresponded to GH/20C and S/19B.
Mapping the 190 sequence variants onto the 2563 available full-length sequences in NCBI Virus (05/18/2020) showed that the sequence variants found in Seattle in this study represented a substantial fraction of sequence variation seen globally (Fig. 1B). Of note, the two major clades identified in our Seattle-based samples are found in approximately equal proportion among global samples. No variants were observed in the current series that did not map on an existing clade. In Seattle, there was underrepresentation of one clade appearing in the larger dataset, which included sequences collected from 20 countries across 5 continents including the USA, China, Australia, Italy, Pakistan, and Brazil and contained the reference sequence (NC_045512.2). Overall, the sequence variants obtained in this study did not appear unique to Seattle, and represented a substantial proportion of variation noted globally. Table 2. Baseline demographics and medical histories of patients stratified by mortality. a 2 sequences were excluded as they were not characteristic of either clade. b Reference value. c Cardiovascular disease defined as history of MI, CVD/stroke, CHF, valvular diseases (s/p CABG). All diseases were defined using ICD-9 and ICD-10 codes. DVT Deep Venous Thrombosis, COPD chronic obstructive pulmonary disease, MI myocardial infarction, ACEI angiotensin-converting enzyme inhibitors, ARB angiotensin receptor blocker, IMT immunomodulatory therapy. www.nature.com/scientificreports/ To determine if differential selective pressure was driving mutagenesis toward either major clade, the number of missense to silent mutations was compared between clades. Overall, the ratio of missense to sense was 1.851 in Clade 1 and 1.889 in Clade 2. In each case, this ratio varied significantly from expected random mutation which would have predicted a ratio of 3.46 (p < 0.001 for each). Thus, both clades appear to be under selective pressure but neither appears under differential pressure.  www.nature.com/scientificreports/ Correlation of sequence variants with clinical outcomes. Viral clade appeared to correlate with several baseline clinical characteristics as shown in Table 5. When stratified between clade 1 and clade 2 infections, there was a significant difference in patients with a history of CVD, malignancy, steroid/IMT use, and anti-coagulation as well as in patients with a smoking history (p values: 0.005-0.03). Comorbidity with CVD and cancer history was associated with clade 2 infection (OR: 3.1, 3.1, respectively) ( Table 6). History of steroid/ IMT and anticoagulation use was associated with clade 2 infection (OR: 2.7, 5.0 respectively). Notably, patients with clade 2 infections were more likely to have never smoked tobacco (OR: 2.0), while patients with clade 1 infections were more likely to be active smokers (OR: 3.6). In multivariable analyses, a history of malignancy was  . Further analysis of individual viral sequence polymorphism revealed that no single polymorphism was significantly associated with outcomes of hospitalization or death, even without statistical adjustment for multiple comparisons.
Machine learning (ML) was applied to probe for cooperative effects between viral genotype and host risk factors in determining which patients would be hospitalized. Death was too infrequent an outcome to allow machine learning modeling from our dataset. Using a Random Forest paradigm (scikit-learn 14 v0.22.2) we trained models using patient demographics, clinical features, and viral clade either separately or in combination on 160 cases, and tested the model on a hold-out set of 30 cases. For our initial analysis, we utilized the most recent 30 cases (which had 3 hospitalized patients out of 30). As shown in Fig. 3, the model using solely patient demographics achieved an AUROC of 0.66. Addition of clinical information to demographics resulted in a substantially improved AUROC of 0.93. This model correctly predicted hospitalization status in 26 of 30 patients. Addition of clade data to either demographics-only or demographics + clinical models resulted in minimal improvement (AUROC 0.72 and 0.93, respectively). A final model, in which all genetic polymorphism data was added to the Because hospitalization was relatively rare (18% in total data set), machine learning models could succeed by generally predicting non-hospitalization. The best trained model in our initial analysis (demographic + clinical) correctly predicted hospitalization status in 26 of the 30 subjects. The negative predictive value of this model was excellent (0.96), but the pre-test probability of non-hospitalization in this cohort was already 0.9. To test the generalizability of the machine learning approach for a group with higher risk of hospitalization, we re-trained the model on the same data, except holding out the most recently collected 15 hospitalized and 15 non-hospitalized patients (Fig. 4). Again, AUROC of the demographics-only model was relatively high (0.78). Performance was again improved with addition of clinical data (AUROC 0.86), and no further improvement was seen with addition of clade or genetic data to the demographics + clinical dataset. Interestingly, with the 50% hospitalization holdout set, a demographics + clade model did appear to out-perform demographics-only (AUROC 0.86), suggesting that there may be some outcome information associated with clade, with the model correctly predicting two additional hospitalizations and one non-hospitalization. However, addition of clade information to the demographics + clinical did not improve performance further, suggesting minimal interaction between viral clade and comorbid conditions.  www.nature.com/scientificreports/ We further tested the machine learning models for sensitivity by training the models on datasets in which a specific clade had a 'spiked' association with hospitalization, ranging from 100% to random assignment (Fig. 4C). Spiked models achieved AUROC of 1.0 for clade assignments ranging from perfect correlation to ~ 10% noise, but then degraded toward demographic-only level AUC performance with between 25 and 50% noise. This demonstrates that had viral clade been a dominant determinant of outcome the machine learning models would have had sufficient power to detect this effect.

Discussion
The COVID-19 pandemic of 2019-2020 has had a dramatic impact on health world-wide, with 7.8 million cases and over 400,000 attributed deaths world-wide as of June 14, 2020. The first case reported in the United States was in the state of Washington in January, 2020. In the ensuing 6 months, over 20,000 cases have been reported in Washington. The overall hospitalization rate within Washington State as of 7/31/20 is approximately 15%, with an overall mortality rate of 5% (https ://coron aviru s.jhu.edu/map.html), although these numbers are based on Department of Health confirmed cases and likely overestimate true population rates (which would include asymptomatic and minimally symptomatic infected individuals).
Several possibilities exist for the heterogeneity of outcomes associated with SARS-CoV-2 infection. Morbidity may be associated with pre-existing illness or its treatment; individuals already ill from other causes may be more likely to require hospitalization or succumb from infection. Access to quality healthcare may also influence outcomes. Age itself is a predictor of hospitalization and mortality which may significantly influence outcome. Host genetic factors are likely to play a role as well, and recent results suggest two host susceptibility loci that influence outcome in COVID-19 15 .
For some infectious processes, viral strain may play a large role in determination of pathogenicity. The influenza pandemic of 1918-the last widespread highly morbid global pandemic-was caused by a specific strain of influenza virus (H1N1). Many other viruses have significant sequence variation which influences clinical outcomes [16][17][18] . Hepatitis C genomic variants, for example, have been significantly associated with outcomes in hepatic disease 19 . www.nature.com/scientificreports/ The current study is the first to attempt to link SARS-CoV-2 viral sequence variants with COVID-19 outcomes. We found significant variation in viral sequence in our sample, with 289 sequence variants found in the 190 sequenced samples. However, the majority of variants occurred with a frequency of less than 10%. All of the frequently encountered variants have been previously described, and several clades have been identified in the literature [20][21][22][23][24] .
Within our data, two clear clades emerged from UPGMA hierarchical clustering of the sequence variants. The clades are determined by 12 base variants as noted in Table 3. 97 samples corresponded to what we refer to as 'Clade 1′ and 91 corresponded to 'Clade 2′. When mapped onto GISAID and Nextstrain clades, we find in clade 1 that 90 correspond to clades GH/20C (GISAID/NextStrain), while in clade 2, 86 correspond to S/19B. In analyzing the ratio of synonymous to non-synonymous mutations, we find that both major clades appear to be under substantial negative selection, with significantly more synonymous than non-synonymous mutations observed than would be observed by chance. However, relative to each other, neither strain appeared to be under differential selective pressure.
We find that risk factors for hospitalization for patients with COVID-19 include advanced age and presentation from skilled nursing facility. In addition, we found that histories of hypertension, cardiovascular disease, deep venous thrombosis, and chronic renal disease were associated with hospitalization. Even though we found several baseline clinical factors to be significantly associated with clade 2 in univariate analyses and history of malignancy in the multivariate model, rates of hospitalization were not significantly different between patients infected with the two major clades of virus in our study (p = 0.063), nor were mortality rates (p = 0.58). Given the relatively low number of fatalities in our study, we were not powered to detect subtle strain-level differences in mortality outcome.
Machine learning approaches allowed us to model the predictability of hospitalization. Demographics alone was sufficient to allow some prediction of hospitalization with an AUROC for the model for the most recent 30 cases of 0.66. However, addition of clinical data improved the AUROC to 0.93. Addition of clade or individual viral sequence data to the model did not further improve performance suggesting that viral sequence variants do not independently contribute significantly to risk of hospitalization. Sensitivity analysis suggests that had a viral variant had > 50% impact on hospitalization risk this would have been detected by the machine learning algorithm and resulted in higher AUROC.
This study has some unique strengths and weaknesses. Our data were derived from a single health-care system encompassing three hospitals in a major metropolitan area. By using a single medical system, we had access to substantial medical history on these subjects as well as reduced concern regarding the influence of hospital system on outcomes (i.e. we assume that decisions for hospitalization and quality of care of hospitalized patients will be more consistent in patients treated within a single system). Our system served as the primary site for COVID-19 testing particularly in March and April, 2020, which gave us access to a substantial number of patients with linked outcome data. However, our outcomes at present are limited to hospitalization and death. Use of hospitalization as outcome represents a useful dichotomous outcome that is a proxy for disease severity. However, the decision to admit may be influenced by factors other than the patient's immediate status, and may be biased toward admission of patients with significant comorbidities, advanced age, or socio-economic considerations. It is conceivable that viral sequence variants might be associated with differential outcomes looking at more granular and direct disease features such as pulmonary radiologic outcomes or specific complications. The use of machine learning produced predictive models with excellent overall performance, particularly for predicting those patients who would not require hospitalization. Although we took significant steps to limit over-learning by models, including testing on two substantial hold-out sets and performing sensitivity analysis with 'spiked' datasets, it is possible that these machine learning models might not be generalizable to patients in other geographic regions or in other health systems.
Overall, our results demonstrate substantial sequence variation in SARS-CoV-2 within a single metropolitan area, where the observed sequences represent a substantial fraction of sequence clades that have been observed globally. Viral clade showed a trend toward worse outcomes for patients infected with virus from clade 2 but this result was of borderline statistical significance in our cohort of 190 patients, and potentially confounded by imbalanced distribution of comorbidities between patients infected with the two major clades. Patient demographics and clinical history were strongly predictive of hospitalization, and viral clade information did not substantially improve predictions, suggesting that it contributes minimally to determination of outcome. Further analysis on larger datasets will be needed to determine if viral clade has significant influence on patient outcome.

Methods
Subjects, samples, and sequencing. Institutional Review Board approval for this study was obtained from the University of Washington, and all research was conducted in compliance with the Declaration of Helsinki. As a retrospective chart outcome study, the University of Washington IRB exempted this study from informed consent. The subset of nasopharyngeal samples collected at University of Washington Medicine (UW Medicine) clinical sites between March 5 and April 8, 2020 that tested positive for SARS-CoV2 by quantitative PCR with Ct < 32 were subjected to whole viral genome sequencing as described previously 25,26 . In brief extracted RNA from positive specimens was converted to cDNA using random hexamers and sequencing libraries were prepared using Nextera XT or Flex kits (Illumina). Libraries were sequenced on MiSeq, NextSeq or NovaSeq instruments (Illumina) using 1 × 185, 1 × 75, or 1 × 100 runs respectively. Raw reads were processed to generate consensus sequences using a custom bioinformatics pipeline (https ://githu b.com/proyc hou/hCoV1 9) that combines de novo assembly and read mapping. Raw reads and consensus sequences were deposited to NCBI SRA and Genbank respectively under BioProject PRJNA610428. www.nature.com/scientificreports/ Patients with medical records in the UW Medicine health care system had their health records extracted using manual chart review. Data extracted included demographics (age, sex, ethnicity) as well as co-morbid conditions identified on the documented clinical notes. Finally, clinical outcome measures such as hospitalization and mortality were collected as well. Patients without adequate documentation on comorbidities were coded as unknown (n = 4) and not included in calculations for significance.
Demographic and clinical characteristic data were summarized with descriptive statistics including frequencies, percentages and two-sided Pearson chi-square tests with clade type and hospitalization status as outcomes of interest. Fisher's exact test with mid-p correction and Student's t-test were used when appropriate in univariate analysis of demographic and clinical characteristics to determine significance. Clopper-Pearson confidence intervals were considered for binomial confidence intervals. In an exploratory analysis, four models of variable selection for multivariate logistic regression were used to determine significance: 1) stepwise Akaike Information Criteria (AIC) 27 ; 2) random forests with area under the receiver operating characteristic curve (AUC) as the parameter of interest 28 ; 3) all univariate significant variables with a p value < 0.1; and 4) all covariates. LASSO regression 29 using mean squared error and AUC as the lambda tuner was also performed to understand important predictor variables.
Missing data. Missing data was handled using multiple imputation by chained equations (MICE) after a sensitivity analysis revealed that the missingness of the data was not completely random (i.e. not MCAR). Recent literature has concluded that the number of imputations should be similar to the percentage of incomplete cases, which in our data is 3.4% 30,31 . Taken together with the computational expense, a number of 10 imputed datasets was chosen with 20 cycles to reach convergence of the sampling distribution of imputed values 32 . Finally, all analytic variables with continuous, dichotomous, and categorical data were modelled using predictive meanmatching, Bayesian logistic regression, and Bayesian polytomous regression, respectively.
Machine learning. For machine learning, several models were built with datasets consisting of combinations of demographics, clinical, clade, and genetic data. Genetic data was included as a vector of sequence variants for each sample. Each dataset was split into train and test sets with 160 and 30 samples respectively. Model selection was run using a nested cross validation (CV) format, with 5 outer folds and leave-one-out CV run on each fold. Model tuning was accomplished using 10,000 random sets of hyperparameters for each of 4 model architectures (AdaBoost, Extra Trees, Gradient Boosting, Random Forest) from scikit-learn (v0.22.2). Composite precision-recall curves were produced by merging the 5 outer fold predictions, the area under the precisionrecall curves (AUPRC) and area under the receiver operating characteristic (AUROC) were compared, and the parameters that produced the lowest validation bias and highest validation score were chosen. Top 1 performance was similar across the models with the Random Forest slightly outperforming the others. The chosen hyperparameters and model were then used for subsequent testing on the holdout test set.