Host–pathogen dynamics in longitudinal clinical specimens from patients with COVID-19

Rapid dissemination of SARS-CoV-2 sequencing data to public repositories has enabled widespread study of viral genomes, but studies of longitudinal specimens from infected persons are relatively limited. Analysis of longitudinal specimens enables understanding of how host immune pressures drive viral evolution in vivo. Here we performed sequencing of 49 longitudinal SARS-CoV-2-positive samples from 20 patients in Washington State collected between March and September of 2020. Viral loads declined over time with an average increase in RT-QPCR cycle threshold of 0.87 per day. We found that there was negligible change in SARS-CoV-2 consensus sequences over time, but identified a number of nonsynonymous variants at low frequencies across the genome. We observed enrichment for a relatively small number of these variants, all of which are now seen in consensus genomes across the globe at low prevalence. In one patient, we saw rapid emergence of various low-level deletion variants at the N-terminal domain of the spike glycoprotein, some of which have previously been shown to be associated with reduced neutralization potency from sera. In a subset of samples that were sequenced using metagenomic methods, differential gene expression analysis showed a downregulation of cytoskeletal genes that was consistent with a loss of ciliated epithelium during infection and recovery. We also identified co-occurrence of bacterial species in samples from multiple hospitalized individuals. These results demonstrate that the intrahost genetic composition of SARS-CoV-2 is dynamic during the course of COVID-19, and highlight the need for continued surveillance and deep sequencing of minor variants.

www.nature.com/scientificreports/ co-occurrences, which may be associated with recovery 12,13 . We found negligible change in viral consensus sequences over time, but detectable changes in variant allele frequencies that are only weakly predictive of future consensus changes across the globe. We further observed rapid emergence of deletion variants in the N-terminus domain of the spike glycoprotein in one patient, potentially suggesting within-host SARS-CoV-2 evasion of NTD-directed antibodies. Taken together our results support the limited emergence and fixation of escape variants during a typical infection, and also highlight the need to monitor minor variants due to their potential impact on vaccine and therapeutic efficacy.

Viral load dynamics in longitudinal samples from SARS-CoV-2 infected individuals. Resid-
ual clinical specimens were obtained from the University of Washington (UW) Virology Lab after testing for SARS-CoV-2 14 . By reviewing our laboratory information system, we identified 20 individuals who had two or more positive or inconclusive samples collected between March and September 2020 (Table 1, Supplementary  Table S1). Inconclusive samples had one of two PCR targets detected. This commonly occurs in samples with small amounts of viral DNA, and thus these were considered presumptive positive for SARS-CoV-2, albeit with low viral load. A majority of samples came from inpatients who received care within the UW Medicine system, which includes UW Medical Center, Harborview Medical Center, and Northwest Hospital. Samples came from individuals with a mean age of 70 (range 42-99), many with severe disease given the availability of multiple samples from these patients. Consistent with other reports 15 , we observed that viral load declined over time in most patients with two or more positive or inconclusive samples with an average increase in RT-QPCR cycle threshold (Ct) of 0.87 per day (Fig. 1, Supplementary Fig. S1). A total of 49 samples (47 nasopharyngeal and 2 oropharyngeal swabs) were sequenced with sufficient reads to be included in this study ( Supplementary Fig. S2A). Ct values for these samples ranged between 14.6 and 36.6. The length of time between collection dates for sequenced samples from the same individual ranged from 0 to 22 days. Samples collected on the same date were sequenced for three individuals. We sequenced nasopharyngeal and oropharyngeal samples from P004 collected at the time of autopsy, two samples from P006 collected at the same time during a hospital admission, and samples collected 9 h apart from P012 during an emergency room visit.
Negligible change in consensus sequences over time. We obtained full-length viral genome sequences with less than 2% unknown bases (Ns) for two or more time points in 14 out of 20 individuals, Low frequency variants detected across the genome. We analyzed intrahost viral genetic variation by examining all sites with > 100 × locus depth, masking known problematic sites and filtering for high-confidence variants (see Methods). We examined sites in 47 samples from 20 different patients and found a total of 103 unique non-synonymous variants relative to the Wuhan-Hu-1 (NC_045512.2) reference genome present at frequencies between 5 and 95% ( Fig. 2A). nsp12 had the highest number of variant sites (26, 25.24%), followed by nsp3 (16,15.533%) and the spike glycoprotein (13, 12.62%). Even when adjusted for gene length, mutations were most prevalent in nsp12, at 0.009 variant sites per nucleotide. For samples that were sequenced multiple times (n = 10), variant frequencies were reproducible across replicates, particularly among samples sequenced with the same library preparation method (Fig. 2B). Intra-host mutations present in multiple longitudinal samples of the same patient were mostly fixed and highly clonal, with some low-level variants ( Supplementary Fig. S2B).
Of the seven most commonly observed variants in our dataset (Table 3), the three most frequent define the Washington state outbreak clade 16,17 and the rest of the variants are clade-defining mutations in Nextstrain clades 20A and 20C. Nine out of 20 patients had the spike protein mutation D614G (A23403G), which has been associated with increased transmissibility and higher viral loads 18,19 . While this variant was rare at the beginning of the pandemic, it reached near fixation in the global SARS-CoV-2 population by June 2020 20 . This rapid rise in prevalence is reflected in our data, as this D614G mutation is present in all three patients with samples collected during or after June 2020. In 10 out of the 11 patients with the 614D variant, no alternate alleles were detected at this position. In the second sample from P007, 614G was detected with a variant allele frequency of 6.1%, but the read depth at this locus (82X) was insufficient to reach our QC standards.
Variants exhibiting intra-host evolution are limited in prediction of future global consensus changes and highlight SARS-CoV-2 antibody evasion. We further examined variants that underwent a maximum allele frequency change of ≥ 20% across timepoints within each patient. The derived alleles for all 25 non-synonymous amino acid changes meeting this criteria in our dataset were also observed among consensus genomes deposited in GISAID 21 by April 2021 (range: 1-2171, mean: 467.8, median: 86.5). Only 8 variants exhibited a maximum allele frequency change of ≥ 40%. The derived alleles for these variants were present in very few GISAID consensus sequences (range: 1-1751), which at the time of analysis represented a mere 0.0001-0.17% of all sequences deposited in GISAID (Fig. 3A). Though the number of consensus sequences with these derived alleles was relatively low, these sequences were diverse with respect to collection date and geographic origin (Fig. 3B).
Three variants with a ≥ 20% within-host change in allele frequency localized to the spike glycoprotein (Fig. 4A). None of these had an allele frequency change of ≥ 50%. In patient P016, a S:143-145 6-nucleotide deletion was observed at an allele frequency of 20.7% in patient sample 2, but was not observed in samples collected www.nature.com/scientificreports/ 3 days prior and 3 days later. This patient was immunocompetent and not receiving any COVID-19 treatment. Interestingly, numerous other deletions arose at low frequencies in this patient, with the largest number present at day 0 (Fig. 4B). The most prevalent deletion variant in the day 0 sample (collected 16 days after initial symptom onset) was S:∆141-144 at 1.87% allele frequency. This deletion was the second most common variant in the day 3 sample at 1.2% allele frequency, but was not observed at all in the day 6 sample. Deletions in this region, including S:∆141-144, have previously been observed in chronically infected immunocompromised patients, and some are associated with escape from NTD-specific neutralizing antibodies or polyclonal sera [8][9][10][11]22 . Notably, a S:∆142-144 deletion is a hallmark of the currently circulating Omicron variant.

Longitudinal RNAseq analysis illustrates loss of ciliated epithelium during infection. For sam-
ples that were sequenced metagenomically, we pseudo-aligned reads to the human transcriptome to perform differential expression analysis comparing initial (t = 0) timepoints to later timepoints. Samples with more than 900,000 pseudo-aligned reads (n = 7 initial, 3 later timepoints) were included in the analysis to determine vari- Table 2. Consensus sequence analysis of SARS-CoV-2 in longitudinal specimens. All patients with less than 2% unknown bases (Ns) are included. The last column indicates nucleotide differences compared to the first sample collected for each respective patient. One asterisk (*) indicates symptoms were present at first time point but exact date of symptom onset is unknown. Two asterisks (**) indicate days since first sample.  www.nature.com/scientificreports/ ation in host gene expression over time. We observed a dramatic downregulation of several cytoskeletal genes, particularly dynein heavy chain (DNAH 2, 3,5,6,7,9,10,11,12), as well as WDRs, MAP1A, and others (Fig. 5A, Supplementary Table S3). Gene Ontology analysis (Fig. 5B) confirmed that downregulated genes are involved in biological processes associated with microtubule-based motility. This is consistent with the death of ciliated epithelial cells, which are enriched for transcripts encoding microtubule transport machinery 23 , following SARS-CoV-2 infection. We observed upregulation of some actin cytoskeleton-related transcripts like VAV1, VASP, and RhoF 24 . We also observed downregulation of several interferon-stimulated genes, but these did not reach statistical significance in this small sample set.
Metagenomic analysis shows high levels of clinically relevant bacteria in three samples. We used a previously described metagenomic pipeline (CLOMP 25 ) to perform taxonomic assignment of preprocessed reads. We excluded samples that had fewer than 10,000 reads after trimming and any samples that underwent enrichment via probe-capture or amplicon-sequencing for SARS-CoV-2. A total of 24 samples from 11 patients were included in further analysis ( Supplementary Fig. S3, Supplementary Fig. S4). No viruses aside from SARS-related coronaviruses met the required cutoffs to be classified as co-infections (see "Methods"). Multiple samples had detectable numbers of bacterial reads, in particular Staphylococcus aureus and Moraxella catarrhalis. Samples from two patients (P004, P005) had high detectable levels (> 100 RPM) of both bacteria at one or more time point(s) (Supplementary Fig. S3B). In patient P006, who had paired nasopharyngeal (NP) and oropharyngeal (OP) swabs collected at the same time point, we detected Capnocytophaga gingivalis, Capnocytophaga leadbetteri, and Streptococcus parasanguinis in the OP swab at 21,960, 9475, and 4476 RPM, respectively. All three species of bacteria commonly colonize the oropharynx. In contrast, in the NP swab, the predominant species of bacteria was a common skin colonizer, Cutibacterium acnes, at 824 RPM. In P010, we found a large number of reads corresponding to Corynebacterium spp. at one time point. Upon review of medical records, we found no mention of bacterial co-infections or of positive bacterial cultures in the charts for P004 or P006. P005 had a nares culture that grew methicillin-resistant Staphylococcus aureus three months prior to SARS-CoV-2 infection.

Discussion
In this study, we performed high-throughput sequencing of longitudinal clinical specimens that were positive for SARS-CoV-2 by RT-QPCR. Most samples were sequenced using a metagenomic approach, which enabled us to simultaneously derive information about viral evolution, host transcription, and the presence of other organisms within patient samples. We showed that although the viral consensus sequence remains largely unchanged over the course of infection, there is a relative abundance of genome-wide low-frequency variants. Similar to other studies, we saw a wide range in the number of variants detected across samples 26 and distribution of variants across the genome, though some positions appeared to be more prone to variation 4,26 . Studies of SARS-CoV-2 and other respiratory viruses 2,3,8,27 have demonstrated the transmission of minor variants and the role of these population bottlenecks on viral evolution, underscoring the importance of studying within-host viral variation. All variants demonstrating significant longitudinal evolution in our sample set collected March-September 2020 have been observed in consensus sequences from around the globe 21 , albeit at relatively low prevalence.
In one patient, we observed rapid turnover of multiple deletion variants in the N-terminal domain of the spike glycoprotein, which has previously been seen in persistent infection in immunocompromised individuals and has been associated with viral escape of neutralizing antibodies 8,10,22,28 . Deletions in the NTD are of particular significance due to their presence in currently circulating lineages of concern. Here we show the emergence of a deletion in this genomic region in an immunocompetent background. It is unclear if the absence of this mutation at day 6 is due to successful clearance of the NTD variant or lack of detection of the minor allele associated with lower copy numbers. In addition, while some of these low frequency deletion alleles have been previously shown to arise independently in different patients in response to similar selection pressures 8 , the presence of multiple low frequency deletion alleles within the same patient may be the product of parallel within-host microevolutionary processes suggestive of some selective advantage, particularly with its similarity to the S:∆142-144 deletion in the currently circulating Omicron variant. Notably, we did not find any evolving www.nature.com/scientificreports/ variants selected for in the RBD, the main target for neutralizing activity of human plasma 29,30 . Additionally, we found two highly dynamic mutations within the Ubl2 and PL2Pro domains of the multifunctional nsp3 protein 31 and three mutations within the nsp12 polymerase. While we and others have previously linked a specific nsp12 mutation to remdesivir resistance 32 , each of these mutations has no phenotypic associations to date and awaits further biochemical and virological characterization. Individual host factors, such as the immune response and respiratory tract microbiome, may play an important role in viral persistence. In particular, because SARS-CoV-2 infection is slow to resolve, the adaptive immune response could drive within-host viral evolution as variants that can escape T-cell and antibody responses develop. Although we were underpowered to see specific evidence of an adaptive immune response being mounted against SARS-CoV-2 in the nasopharynx, detailed studies evaluating antibody and T-cell receptor repertoire changes throughout the course of infection could shed light on the role of immune pressure in the development of minor variants. Similarly, the relationship between bacterial colonization of the nasopharynx  . Variants that exhibit intra-host evolution in the spike protein across all patients. (A) All nonsynonymous variants located in the spike protein with a ≥ 20% change in allele frequency among timepoints for any patient. (B) Enumeration of deletions that arose between residues 138-149 of the spike protein in P016 at ≥ 1% relative frequency reveals a rapidly changing complement of low frequency alleles present over a 6-day period. The reference nucleotide sequence (NC_045512) is located at the top of the sequence alignment. Above the reference is the corresponding amino acid sequence with associated residue numbers. Alleles that match the reference are in gray, and deletions are shown in black. To the right of the sequence alignment is a bar graph showing the square root of the relative frequency of each variant, for visualization purposes, labeled with the allele frequency in percentage and read depth in parentheses. www.nature.com/scientificreports/ and the development or suppression of inflammation in response to SARS-CoV-2 infection remains poorly understood.
As viral load decreases during recovery, it becomes more challenging to recover viral genomes. As a result, one of the limitations of our study is the variability in sequencing depth across samples and the difficulty in ensuring similar sequencing depth for samples from different time points. We used an amplicon sequencingbased approach described previously 33 to obtain near full-length genomes from low viral load samples (up to Ct values of 36). We also used multiple library preparations and performed re-sequencing to ensure the accuracy of variant calls.
Taken together, our results suggest that low frequency genomic variants emerge in immunocompetent individuals, but that these variants are unlikely to reach fixation. Given the emergence of rapidly spreading variants of concern over the past several months, the limited intra-host evolution observed in our dataset highlights the critical impact that a select few individual intra-host evolutionary events may have on the course of the global pandemic and the need for continual genomic surveillance.

Methods
Sample collection and clinical testing for SARS-CoV-2. Specimens were obtained as part of clinical testing for SARS-CoV-2 ordered by local healthcare providers or collected at drive-through testing sites. RNA was extracted and the presence of SARS-CoV-2 was detected by RT-QPCR as previously described using either the emergency use-authorized UW CDC-based laboratory-developed test, Hologic Panther Fusion or Roche cobas SARS-CoV-2 tests 34,35 . Supplemental Table 1 contains clinically relevant details of these specimens.
Chart review and ethics approval. We received approval from the University of Washington Institutional Review Board (UW IRB) to use residual clinical specimens for sequencing and to review clinical records of patients who received care within the UW network. We also received a waiver of informed consent from the UW IRB for the use of residual clinical specimens and retrospective chart review to perform this work. Information obtained from medical records included sex, age, comorbidities, medication, hospital or critical care admission, and discharge status. All methods were carried out in accordance with relevant guidelines and regulations. 18 of the 20 total patients had one or more known comorbidities (Supplementary Table 1).
Sequencing and bioinformatic analysis. Sequencing was attempted on all samples with a positive RT-QPCR assay result that had a Ct ≤ 36 using one of three methods available at the time: (1) a shotgun metagenomic approach using Illumina Nextera XT described previously 36 for samples with a Ct less than 24; (2) an oligo-nucleotide probe capture-based approach from IDT (xGen NGS hybridization capture, Integrated DNA Technologies) similar to previous work 37 for samples with Ct between 24 and 28, or (3) using Swift Biosciences' Normalase Amplicon Panel library preparation workflow 38 for samples with Ct between 28 and 36. Libraries were sequenced on Illumina MiSeq, NextSeq, or NovaSeq instruments using 300, 150, 100, or 75 bp reads. Consensus sequences were assembled using TAYLOR 38 , a custom bioinformatics pipeline (https:// github. com/ greni nger-lab/ covid_ swift_ pipel ine) with or without an additional primer clipping step depending on library preparation method.
Consensus sequences from each individual were aligned with the reference sequence NC_045512 using MAFFT v7 39 . Clade assignments were generated using Pangolin (http:// github. com/ cov-linea ges/ pango lin) and Nextstrain 40 in December 2020. Consensus sequences with < 5% Ns across the length of the genome were considered for further analysis.
Variants were also called with TAYLOR from aligned reads. Variants leading to coding changes with a sequencing depth of > 100 and an allele frequency > 0.01 were subjected to further analysis. For variants that had an allele frequency of < 0.4, we included an additional quality-control step to only include mutations that were present across multiple samples. We also excluded mutations in the first 100 and last 50 bases, as well as variants determined to be due to sequencing error. Most samples were re-prepped and sequenced multiple times to ensure accuracy of variant calls. Variants at positions 6700, 11081-83, 19989, and 29056 were observed in a large number of samples but were determined to be the result of homopolymer sequencing error and were excluded.

RNAseq analysis.
For samples that were prepared with no target enrichment step (metagenomic sequencing), reads were adapter and quality trimmed with Trimmomatic v0.39 41 using the call "leading 3 trailing 3 slidingwindow:4:15 minlen 20", then pseudoaligned to the hg38-derived human transcriptome using Kallisto v0.46 42 . Only samples with more than 900,000 reads pseudo-aligned to the human genome were used for analysis. Differential expression analysis using the Wald test was performed using DEseq2 43 and deemed significant at a Benjamini-Hochberg adjusted p value < 0.1. Statistical enrichment of Gene Ontology Biological Processes was performed on all significant genes using the R package clusterProfiler 44 . Raw counts were submitted to the Gene Expression Omnibus, accession number GSE173310.
Metagenomic analysis. Raw FASTQ files were analyzed using CLOMP v0.1.4 (https:// github. com/ FredH utch/ CLOMP) as previously described 25 . Samples with more than 10 million reads were randomly down-sampled to 10 million reads before analysis using the "sample" command in seqtk (https:// github. com/ lh3/ seqtk). The pipeline output was visualized using the Pavian metagenomic explorer 45 , and reads per million (RPM) calculations were done using a custom R script. Results were filtered to highlight RPM counts for a shortlist of clinically relevant taxa (Supplementary Table S4). Samples were determined to be positive if the species level RPM was at least 30 for viruses, and 100 for bacteria.