The COVID-19 outbreak was first identified in Wuhan and appeared to be linked to Huanan Seafood Wholesale Market (HSWM). The causal agent, SARS-CoV-21,2, is closely related to a bat coronavirus (RaTG13)2, although its receptor binding domain is more similar to that of pangolin coronaviruses3. Currently, several questions remain regarding the origin, evolution and host interactions of SARS-CoV-2. First, although HSWM has been widely proposed to be the original outbreak site of SARS-CoV-2, a significant number of the initial cases did not have contact with this market4. This casts doubt on the idea of a singular event of zoonotic spillover to humans in the initial outbreak. Second, additional data are required to discern whether the virulence of SARS-CoV-2 has altered as a result of genomic sequence evolution during the spread of the disease. Third, although SARS-CoV-2 infection can cause life-threatening respiratory disease, most cases manifest only mild pneumonia5. The factors associated with disease outcome have yet to be fully characterized. We have systematically analysed key immunological parameters spanning the course of infection in patients, obtained viral genomes directly from clinical samples, and delineated factors associated with prognosis and epidemiological features.

Overview of enrolment

The basic clinical and epidemiological features of the cohort (326 patients in Shanghai between 20 January and 25 February 2020) are summarized in Extended Data Table 1. Four categories of infected case were defined. Five individuals were asymptomatic; that is, they had no obvious fever, respiratory symptoms or radiological manifestations. Most patients (293) had mild disease with fever and radiological manifestations of pneumonia. Twelve patients who had symptoms of dyspnoea and signs of expanding ground-glass opacity in the lung within 24–48 h of admission were defined as severe cases. The remaining 16 patients deteriorated into acute respiratory distress syndrome and required mechanical ventilation or extracorporeal membrane oxygenation; these patients were categorized as critical (Extended Data Table 1). As of 1 April 2020, 315 (96.63%) of the patients had been discharged, and 6 (1.84%) had died.

Nucleotide variation in viral genomes

Sequencing data from 112 samples (sputum or oropharyngeal swab) passed quality control and were used for nucleotide variation calling (Extended Data Fig. 1). Compared to the first-released genome (Wuhan-Hu-1), we identified 66 synonymous and 103 nonsynonymous variants in 9 protein-coding regions (Extended Data Fig. 2a, b). Substitution rates in most genes (ORF1ab, S, ORF3a, E, M and ORF7a) were similar (around 3.5 × 10−4 per site per year), whereas variation rates in ORF8 (9.51 × 10−4 per site per year) and N (1.05 × 10−3 per site per year) were higher (Extended Data Fig. 2a, b). The recurrence of variations in the viral genome is similar between samples from Shanghai and the GISAID dataset (Extended Data Fig. 2c).

Genomic phylogeny analysis

We next used the viral genomes from 94 patients (which were more than 90% complete) together with 221 sequences of SARS-CoV-2 from the GISAID database for phylogeny analysis. Two major clades were identified (Fig. 1, Extended Data Fig. 3a, b), both of which included cases diagnosed in early December 20191,2. Clade I included several subclades, such as those characterized by ORF3a: p.251G>V (subclade V), or S: p.614D>G (subclade G). Clade II is distinguished from clade I by two linked variations—ORF8: p.84L>S (28144T>C) and ORF1ab: p.2839S (8782C>T) (Fig. 1, Extended Data Fig. 3a). The both major clades and their subclades were found in the Shanghai cohort, suggesting that there were multiple origins of transmission into Shanghai. We did not observe significant expansion of clades or subclades in Shanghai.

Fig. 1: Phylogenetic analysis of the assembled SARS-CoV-2 genomes.
figure 1

We used 94 SARS-CoV-2 genome sequences and 221 published sequences to construct a time-resolved phylogeny tree. Clades I and II are marked and variations that distinguish branches of the tree are indicated. Concentric circles represent sampling dates. Each tip circle represents a single sample; colours indicate case locations (key). Cases with a history of contact with HSWM are highlighted.

Additionally, the viral genomes from six patients with a clear history of contact with HSWM1,2, the suspected initial outbreak site, were all clustered into clade I, whereas those from three patients diagnosed at the same time without a history of contact with HSWM6,7 were clustered into clade II (Fig. 1). We analysed the sequences around nucleotides 8,782 and 28,144 of SARS-CoV-2 in samples from patients with or without a history of contact with HSWM and in the bat coronavirus Bat-SARS-CoV-RaTG13. Virus genomes found in patients without contact with HSWM were identical to Bat-SARS-CoV-RaTG13 at these two sites (Extended Data Fig. 3c).

We compared the clinical manifestations of patients infected with viruses of either clade I or clade II. We found no statistical difference in disease severity (P = 0.88), lymphocyte count (P = 0.79), CD3 T cell count (P = 0.21), C-reactive protein level (P = 0.83) or D-dimer level (P = 0.19), or in the duration of virus shedding after onset (P = 0.79) (Extended Data Table 2). Thus, these two clades of virus exhibited similar pathogenic effects despite their genome sequence variations. Likewise, we found no significant association between disease severity and the 13 most frequent genetic variations (synonymous and non-synonymous) (Extended Data Fig. 4).

Host factors associated with disease severity

A notable feature of our cohort was that some infected individuals (five cases; 1.53%) did not develop obvious symptoms even though substantial virus shedding could be detected. As shown in Extended Data Fig. 5a, an asymptomatic patient showed no obvious lesions in the lungs upon admission or five days later. By contrast, unilateral and bilateral opacity lesions were observed in patients with mild (Extended Data Fig. 5b) or critical COVID-19, and the latter deteriorated quickly over just two days (Extended Data Fig. 5c).

We further analysed the immunological and biochemical parameters of the patients (Extended Data Table 3). A prominent feature of COVID-19 was progressive lymphocytopenia, particularly in patients categorized as severe or critical (initial test result after admission, P = 6 × 10−6). Detailed analysis of lymphocyte subtypes revealed that CD3+ T cells were most significantly affected (P < 10−6), with CD4+ and CD8+ T cells sharing similar trends (CD4+ T cell, P < 10−6; CD8+ T cell, P = 1 × 10−5). Notably, the changes in T lymphocytes were statistically significant not only in critical cases but also in the other three categories (asymptomatic, mild and severe; CD3+ T cells, P = 0.013; CD8+ T cells, P = 0.004). By contrast, for CD19+ B cells, although a significant decline was found in critical cases (P = 1 × 10−5), patients in the other categories showed no obvious changes (P = 0.47). We further examined the longitudinal cell counting data for each group. It was clear that the CD3+ T lymphocytes exhibited a gradual decline (P < 0.05 on dayd 7, 8, 11, 14–18, 22–25, 28 and 29 after onset, Kruskal–Wallis test) as the disease deteriorated (Fig. 2a), a trend that was also seen in CD4+ and CD8+ T cells (Fig. 2b, c). However, it was not found for natural killer (NK) (CD16+CD56+) or B (CD19+) cells (Fig. 2d, e).

Fig. 2: Lymphocyte numbers in patients during hospitalization.
figure 2

ae, Temporal changes in CD3+ (a), CD4+ (b), CD8+ (c), CD16+CD56+ (d) and CD19+ (e) cell counts in each group. Data are shown as median ± 95% confidence interval and the normal range for each cell type is indicated with dashed lines. ac, n = 325; d, e, n = 220.

We next compared the clinical parameters grouped by comorbidity and found a significantly higher risk for disease progression when the disease was complicated by co-existing conditions (P = 0.01) (Extended Data Table 4), although the median age of the comorbidity group was also higher (P = 0.02). Indeed, univariate logistic regression analysis indicated that age (P < 0.0001), lymphocyte counts upon admission (P < 0.0001), comorbidities (P = 0.01) and gender (P = 0.014) (higher risk for male) were the main factors associated with disease severity (Extended Data Table 5). Multivariate analysis showed that age (P = 0.002) and lymphocytopenia (P = 0.002) were two major independent factors, whereas comorbidities did not reach statistical significance.

The levels of eleven cytokines (IFN-α, IFN-γ, IL-1β, IL-2, IL-4, IL-5, IL-6, IL-8, IL-10, IL-12 and IL-17) in serum were measured upon admission and during treatment. Among them, IL-6 (P < 10−6) and IL-8 (P = 1 × 10−5) (Extended Data Table 3) showed the most significant changes. Notably, the levels of these two cytokines were inversely correlated with lymphocyte count (Fig. 3a, b, Extended Data Table 5). Furthermore, we combined the longitudinal cytokine data of each group and plotted their fluctuation patterns against the time point from onset. We aggregated the highest IL-6 data from each patient from day 6 to day 10 after onset and compared patients classed as critical with those classed as non-critical. Patients categorized as critical showed significantly higher levels of IL-6 (P = 0.001, two-sided Mann–Whitney U test) (Fig. 3c). There was a similarly significant difference in IL-8 level when data were aggregated from day 16 to day 20 after onset (P = 0.006) (Fig. 3d). These data suggest that there is a strong link between inflammatory cytokines and the pathogenesis of SARS-CoV-2 infection.

Fig. 3: Correlation between inflammatory cytokines and lymphocyte counts.
figure 3

a, b, Levels of serum IL-6 (a) and IL-8 (b) upon admission plotted against lymphocyte count. Two-sided Spearman’s correlation analysis with no adjustment of multiple comparisons. c, d, Temporal changes in IL-6 (c, n = 230) and IL-8 (d, n = 149) in each group during hospitalization. Data are shown as median ± 95% confidence interval and the normal range for each cytokine is indicated with dashed lines.


Our analysis of some recently treated patients provides further evidence that the viral genome of SARS-CoV-2 is largely stable. Consistent with recently published results8, we found that the observed small sequence variations divided the viral genomes into two major clades. We noted that six sequences recovered from patients with a history of contact with HSWM all fell into clade I, whereas three genomes from patients diagnosed in the same period but without exposure to HSWM were clustered into clade II. Thus, these two major haplotypes are likely to represent two lineages derived from a common ancestor that evolved independently in early December 2019 in Wuhan, only one of which (clade I) was spawned within the HSWM, where a high density of stalls, vendors and customers might have facilitated human-to-human transmission. Consistent with this idea, epidemiological investigations of the earliest cases found in Wuhan before 18 December 2019 identified two patients that were linked to HSWM and five that were not4. Our time-resolved phylogeny analysis suggests that the earliest zoonotic spillover event might have occurred in late November 2019, which is in agreement with a previous analysis8.

Nevertheless, we found no significant differences in clinical features, mutation rate or transmissibility between patients infected with clade I or II virus. Our data are in agreement with a lack of selection against either clade, as suggested9, but is at odds with a previous conclusion, the L/S-type classification of which was based on the same two linked polymorphisms10. The presumed difference in transmissibility might be due to sampling bias, as the early uploaded sequences in the GISAID database were recovered from a limited number of critically ill patients and duplicate assemblies from the same patients were not uncommon1,2,11.

A recent analysis of 1,099 cases of COVID-19 in China found lymphocytopenia to be one of the most common features in laboratory tests5. Here, we have confirmed this observation and further shown that CD3+ T cells were the major cell type that was suppressed in infected patients, whereas CD19+ B cells and CD16+CD56+ NK cells exhibited less suppression. Indeed, lymphopenia and, in particular, reduced CD4+/CD8+ cell counts, are also a major manifestation of SARS-CoV infection12. Furthermore, our longitudinal monitoring of major cytokines indicated that IL-6 and IL-8 were negatively correlated with lymphocyte count and that IL-6 kinetics was highly related to disease severity. At present, the relationships between virological activity, cytokine release and lymphocytopenia remain unclear. We hypothesize that the immunopathological response against SARS-CoV-2, involving a cytokine storm and loss of CD3+ T lymphocytes, could constitute—at least in part—an underlying mechanism for disease progression and fatality. The macrophages in the lung could serve as the first driver of the cytokine storm in the early phase of COVID-19 pneumonia13, and subsequent lymphocyte infiltration mobilized by the cytokines, as observed in infected patients14,15 and Rhesus macaques16, may explain the lymphocytopenia, although probable cytokine-induced T cell depletion cannot be ruled out.

In conclusion, by closely monitoring the molecular and immunological data in 326 patients with COVID-19, we find evidence that adverse outcome is associated with depletion of CD3+ T lymphocytes, which is tightly linked to bursts of cytokines such as IL-6 and IL-8. Targeted sequencing of 94 individuals who were infected during late January to February indicated limited variation in the viral genome, which suggests stable evolution. Two major lineages of the virus derived from one common ancestor may have originated independently from Wuhan in December 2019 and contributed to the current pandemic, although we find no major difference in clinical manifestation or transmissibility between them. Our data provide further evidence for the respective roles played by viral and host factors in disease mechanism and underscore the importance of early intervention in therapy.


Ethics statement

This study was approved by the Shanghai Public Health Clinical Center Ethics Committee (no. YJ-2020-S015-01). Informed consent was obtained from all enrolled patients.


This study involved 326 patients, who had tested positive for SARS-CoV-2 RNA and were admitted to the Shanghai Public Health Clinical Center (the designated hospital receiving all COVID-19 cases in Shanghai) between 20 January and 25 February 2020. In addition to routine clinical tests, measurement of serum cytokines was performed on 228 patients. Their basic demographic, epidemiological and clinical characteristics are shown in Extended Data Table 1. The median age of the patients was 51 years (range 15–88) with a male:female sex ratio of 1.10:1. Among these 326 patients, 125 (38.34%) had at least one comorbidity; the most common were hypertension (76 patients), diabetes (24), coronary heart disease (13), chronic hepatitis B (10), chronic obstructive pulmonary disease (2), chronic renal disease (2) and cancer (3). Disease severity was categorized into four stages—asymptomatic, mild, severe and critical—according to the guidelines on the Diagnosis and Treatment of COVID-19 issued by the National Health Commission, China17. In brief, asymptomatic disease was defined as normal body temperature, lack of respiratory symptoms and no pulmonary radiological manifestation; mild disease as having fever, respiratory symptoms and radiological evidence of pneumonia; severe disease as meeting one of the following manifestations: respiratory rate >30/min, oxygen saturation levels (SpO2) <93%, arterial partial pressure of oxygen (PaO2)/fraction of inspired oxygen (FiO2)(PaO2/FiO2 ratio) ≤ 300 mm Hg or pulmonary imaging with multi-lobular lesions or lesion progression exceeding 50% within 48 h; and critical disease as one of the following: acute respiratory distress syndrome requiring mechanical ventilation, shock, or complications with other organ failure.

Nucleic acid extraction, molecular screening and genome sequencing

Swabs and sputum samples were collected for nucleic acid extraction using an automatic magnetic extraction device and accompanying kit (Shanghai Bio-Germ) and screened using a semiquantitative RT–PCR kit (Shanghai Bio-Germ) with amplification targeting the ORF1a/b and N genes. Deep sequencing was done using the nucleic acid extracted from patients confirmed as having COVID-19 by RT–PCR in Shanghai Public Health Clinical Center. We used a multiplexed amplicon strategy as described18 and the primers were synthesized as described ( The primers were split into 10 subpools each containing 9–10 pairs for specific amplification of 400-bp viral sequence using the remaining cDNA from the diagnostic test. The PCR amplicons were purified using AMPure DNA cleanup steps. The amplicon libraries were generated using a NanoPrep for Illumina kit (IDT) according to the manufacturer’s instructions. In brief, the procedures included end-repair, 3′ end adenylation, adaptor ligation and PCR amplification, followed by assessing DNA library quality. Amplicon sequencing was performed with established Illumina protocols on MiSeq platform (Illumina) according to a 2 × 300-bp protocol in the National Research Center for Translational Medicine (Shanghai).

Viral genomic sequence variation calling

All clean reads were mapped to the SARS-CoV-2 genome (Wuhan-Hu-1, GenBank accession number MN908947) using BWA (version 0.7.17)19. Variations were called with mpileup tools in samtools20. Low-quality variations with depth lower than 10 and Qual score lower than 50 were filtered using bcftools (version 1.9).

Phylogenetic analysis

Sequencing reads were trimmed using Trimmomatic (version 0.39)21 to remove low-quality regions, adaptor sequences and sequencing primers. Clean reads were used to build virus genome assemblies with VirGenA (version 1.4)22. A post-assembly procedure was manually performed to remove low-quality content and potential sequencing artefacts. Ninety-four assemblies with coverage above 90% qualified for phylogeny analysis. MAFFT (version 7.453)23 made the multi-sequence alignment after trimming off Ns on both ends of the genome sequences. The computation and visualization platform used for the phylogeny analysis was Nextstrain (version 1.15.0)24. The module we selected for phylogenetic tree building was IQ-TREE (version 1.6.12)25. Automatic substitution model selection was performed and the TIM+F+I model was selected to build the maximum likelihood phylogeny tree based on Bayesian information criteria (BIC) score. TreeTime (version 0.7.3)26 was used for time-resolved phylogeny analysis. The resulting phylogeny tree was visualized using auspice from the Nextstrain package. All bioinformatics analyses were performed using the ASTRA supercomputing platform (Sugon) with Optane memory technology in the National Research Center for Translational Medicine (Shanghai).

Cytokine quantification and lymphocyte subset counting

A Becton Dickinson (BD) cytometric bead array (human Th1/Th2/Th17 cytokine kit and Human Inflammatory Cytokine Kit) was used quantify serum cytokines (IFNα, IFNγ, IL-1β, IL-2, IL-4, IL-5, IL-6, IL-8, IL-10, IL-12 and IL-17). CD3+ T, CD4+ T, CD8+ T, CD19+ B, and CD16+CD56+ NK cells were stained using BD Multitest 6-colour TBNK reagent in Trucount tubes and analysed using the BD FACSCanto II flow cytometer. The longitudinal plots of cytokines and cell count data were visualized using the geom_smooth tool in the ggplot2 R package.

Statistical analysis

Two sided Mann–Whitney U tests and Kruskal–Wallis tests were used to compare two and more than two groups of variables, respectively. χ2 and Fisher’s exact test were used for analysing contingency tables. Spearman’s rank correlation test was used to evaluate correlations. No statistical methods were used to predetermine sample size. Investigators were not blinded to patient group during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.