High-precision and cost-efficient sequencing for real-time COVID-19 surveillance

COVID-19 global cases have climbed to more than 33 million, with over a million total deaths, as of September, 2020. Real-time massive SARS-CoV-2 whole genome sequencing is key to tracking chains of transmission and estimating the origin of disease outbreaks. Yet no methods have simultaneously achieved high precision, simple workflow, and low cost. We developed a high-precision, cost-efficient SARS-CoV-2 whole genome sequencing platform for COVID-19 genomic surveillance, CorvGenSurv (Coronavirus Genomic Surveillance). CorvGenSurv directly amplified viral RNA from COVID-19 patients’ Nasopharyngeal/Oropharyngeal (NP/OP) swab specimens and sequenced the SARS-CoV-2 whole genome in three segments by long-read, high-throughput sequencing. Sequencing of the whole genome in three segments significantly reduced sequencing data waste, thereby preventing dropouts in genome coverage. We validated the precision of our pipeline by both control genomic RNA sequencing and Sanger sequencing. We produced near full-length whole genome sequences from individuals who were COVID-19 test positive during April to June 2020 in Los Angeles County, California, USA. These sequences were highly diverse in the G clade with nine novel amino acid mutations including NSP12-M755I and ORF8-V117F. With its readily adaptable design, CorvGenSurv grants wide access to genomic surveillance, permitting immediate public health response to sudden threats.

Precision and accuracy of CorvGenSurv. We measured the accuracy of CorvGenSurv by sequencing the genomic RNA of the USA-WA1/2020 control strain (GenBank: MN985325.1) provided by BEI Resources (NR-52285). The assembled near full-length SARS-CoV-2 sequence obtained by CorvGenSurv matched 100% to the sequence of this control strain. This accuracy was achieved by our error correction step during consensus sequence building. To assess how many reads are required to produce an accurate consensus sequence, we performed a bootstrap test, randomly sampling a given number of reads and building the consensus sequence of those reads. As shown in Fig. 1b, when three reads were sampled, only 67% [52.4-78.9%] of the 1000 bootstrap runs' resulting consensus sequences were consistent with the reference sequence of USA-WA1/2020. When we generated a consensus sequence from 31 or more reads, the consensus sequence of all 1000 bootstrap runs was identical to the control sequence. This 31-fold sequencing depth ensured the production of a SARS-CoV-2 whole genome sequence that perfectly matched the control sequence.
We also spot-checked the precision of CorvGenSurv by Sanger sequencing, the current gold standard for precision checking 34 . USA/CA-LAC-USC1 in Table 1 was selected, and a near full-length SARS-CoV-2 sequence was obtained by assembling a total of 49 overlapping segments that were sequenced by Sanger sequencing. The resulting Sanger sequence matched 100% to the corresponding USA/CA-LAC-USC1 sequence obtained by CorvGenSurv.
We observed two putative deletions in one sequence (USA/CA-LAC-USC2) that potentially originated from homopolymer sequencing errors. However, these errors were rare as we observed only two deletions out of 754,702 base calls (2.65 × 10 -6 per base). We have thus corrected these putative errors.

COVID-19 specimen profiling. We produced whole genome sequences of 25 Los Angeles County
COVID-19 patients who tested positive at Keck Medicine of USC Clinical Laboratories by accessing their remnant NP/OP swab specimens. This study (HS-20-00326) was approved by the Institutional Review Board of the University of Southern California. Table 1 presented each specimen's collection date from April to June 2020 and COVID-19 test cycle threshold (C t ) values. diagnostic testing were subject to SARS-CoV-2 RNA extraction. Viral RNA was amplified via three overlapping RT-PCRs (~ 10,000 base long each) and pooled SARS-CoV-2 amplicons of indexed COVID-19 specimens were then sequenced by long-read high-throughput single-molecule sequencing. The output fasta file was de-multiplexed and processed to produce the consensus sequence of each segment. Each COVID-19 specimen's three overlapping segments were assembled into a SARS-CoV-2 whole genome sequence. (b) CorvGenSurv's precision was tested by comparing the consensus sequence from a given number of reads with the USA-WA1/2020 control strain (GenBank: MN985325.1). When a consensus sequence was built from three reads, only 67% [52.4-78.9%] of the 1000 bootstrap runs' resulting consensus sequences were consistent with the correct sequence. When the number of the reads was greater or equal to 31, all 1000 bootstrap runs resulted in the correct sequence. www.nature.com/scientificreports/ Figure 2a presented the maximum likelihood tree of these 25 Los Angeles sequences along with 1,215 sequences from the state of California archived in GISAID 20,21 as of July 27th, 2020. The most recent common ancestor sequence of the 25 strains (grey circle in Fig. 2a) had three nucleotide substitutions from the reference sequence, Wuhan-Hu-1, and two amino acid changes, P323L in the RNA-dependent RNA polymerase (RdRP, NSP12) and D614G in the S protein 12,35 . As expected, sequences of specimens collected from April to May (purple numbers in Fig. 2a) were located closer to the root of tree than those of specimens collected in June (blue numbers in Fig. 2a). Ten of the 25 sequences shared additional mutations of R203K and G204R in the N (nucleocapsid) protein (skyblue circle in Fig. 2a) as part of the GR-clade 20,21 . Eight other sequences had mutations of T85I in the NSP2 protein and Q57H in the ORF3a protein (red circle in Fig. 2a) as part of the GH-clade 20,21 . Five other sequences shared the mutation S194L in the N protein (purple circle in Fig. 2a) as part of the G-clade 20,21 . One sequence (USA/CA-LAC-USC15) had a unique mutation lineage amongst the sampled 25 sequences as part of the G-clade 20,21 (Fig. 2a). Another sequence (USA/CA-LAC-USC16) had a unique mutation lineage as part of the GH clade 20,21 . These 25 sequences reflect the high viral diversity observed in California within the G, GR, and GH clades 35,36 . Figure 2b marked the amino acid changes of each of our 25 sequences in reference to the Wuhan-Hu-1 sequence. We reported a total of nine new amino acid mutations that were not observed among 28,176 global SARS-CoV-2 sequences in GISAID 20,21 . These include M755I in RdRP (NSP12) and V117F in the ORF8 protein (Table 1).
We compared the prevalence of each amino acid mutation from Wuhan-Hu-1 that had greater than 2% frequency either globally, in the USA, or in California as of July, 2020 (Fig. 2c). Mass circulation of G clade strains Table 1. Amino acid mutations, clade and lineage of 25 whole genome sequences from COVID-19 remnant NP/OP specimens. Novel amino acid mutations that were not observed among 28,176 global SARS-CoV-2 sequences in GISAID are in bold. C t -1 targeted the ORF1/a-b non-structural region and C t -2 a conserved region in the structural protein envelope E-gene. qRT-PCR was performed using the Roche COBAS system. www.nature.com/scientificreports/ were confirmed by around 80% prevalence of P323L in RdRP and D614G in the S protein globally. In USA and California, the prevalence of T85I in NSP2 and Q57H in ORF3a were observed to be above 40%, as shown in Fig. 2c. The prevalence of P1263L mutation in the S protein was sixfold greater in California than the global prevalence. Additionally, S194L mutation in the N protein was much more prevalent in California, compared to other regions.
CorvGenSurv's supply cost. The development of a cost-effective SARS-CoV-2 genotyping protocol is crucial to expanding COVID-19 surveillance efforts. The per-specimen supply cost of our SARS-CoV-2 whole genome sequencing method was estimated to be $33.8. This includes RNA extraction from NP/OP swab media ($4.58), long-range RT-PCR ($25.8), index PCR ($2.6), and long-read high-throughput sequencing ($0.82). We estimated the per-specimen high-throughput sequencing cost by assessing the maximum number of specimens ( N m ) we can process in a single sequencing run. We observed that 31 or more reads would be required to produce an accurate consensus sequence (Fig. 1b). Therefore, we estimated how many reads are required on average to obtain a minimum of 31 reads per consensus sequence. The expected minimum value of N g Poisson random variables is given by ∞ k=1 [γ (k, )/ Ŵ(k)] N g , where is the mean of the Poisson distribution, γ (k, ) is the lower incomplete gamma function, and Ŵ(k) is the gamma function. Since we need three segments per specimen, N g is given by 3 × N m . We obtained around 1.58 million circular consensus sequence (CCS) reads  Table 1 were denoted by 1 to 25 in this tree. All 25 sequences were classified as G clade with mutations P323L in NSP12 (RdRP) and D614G in S protein (grey circle). Different ancestral sequences were presented by circles in different colors with common mutations of each lineage presented in the box. The unit branch length (one nucleotide base substitution) was denoted as "HD = 1". (b) Each of our 25 sequences' amino acid mutations from Wuhan-Hu-1 (MN908947) were marked using Highlighter (https:// www. hiv. lanl. gov/ conte nt/ seque nce/ HIGHL IGHT/ highl ighter_ top. html). The regions of NSP2, NSP12 (RdRP), S, E, M, and N were presented by colored boxes. (c) The prevalence of each amino acid mutation with greater than 2% frequency either globally, in the USA, or in California. A total of 28,176 global sequences were downloaded from GISAID 20,21 . www.nature.com/scientificreports/ with 99% accuracy for a library size of ~ 10 kb from a single sequencing run. We estimated that 8779 specimens can be processed in a single sequencing run, where 60 reads per consensus sequence can be obtained on average to recover a minimum of 31 reads per consensus sequence. Similarly, if we aim to recover a minimum of 100 reads per consensus sequence as a conservative approach, around 3657 specimens can be processed in a single sequencing run. Assuming a sequencing cost of $3000, this yields a per specimen sequencing cost of $0.82 for large-scale sequencing efforts. This estimated low-sequencing cost suggests a cost advantage over current short-read based SARS-CoV-2 whole genome sequencing methods 8,10-12,26-30 . The required time for CorvGenSurv is around 15 h for sample library preparation and 30 h for sequencing. Our cost-effective workflow may therefore enhance the implementation of real-time massive genomic SARS-CoV-2 surveillance.

SARS-CoV-2 evolution rate.
We estimated the rate of SARS-CoV-2 evolution from the 25 genomes we obtained using a Bayesian phylogenetic reconstruction method 37,38 . The SARS-CoV-2 nucleotide evolution rate was estimated to be 7.51 The rate of SARS-CoV-2 evolution was also estimated by measuring the rate of divergence from the reference Wuhan-Hu-1 sequence. We plotted our sequences' number of nucleotide base differences from the Wuhan-Hu-1 sequence against sample collection time difference in Fig. 3. The nucleotide substitution rate in reference to Wuhan-Hu-1 was measured to be 8.62 × 10 -4 substitutions per site per year (95% confidence interval: 7.96 × 10 -4 to 9.24 × 10 -4 ). This rate was consistent with the above Bayesian phylogenetic estimate.

Influenza vaccination and evolution. It has been reported that several SARS-CoV-2 variants of concern
can potentially escape from vaccine-induced immunity 16,17 . The emergence and circulation of such variants has prompted efforts to continuously monitor SARS-CoV-2 evolution via real-time genomic surveillance 39,40 . To further demonstrate its importance, we analyzed influenza evolution in response to vaccination. Influenza has also been shown to escape from vaccine-induced immunity, as confirmed by serologic testing 41 . The WHO changed the 2019-2020 H1N1 Northern Hemisphere Flu vaccine strain from A/Michigan/45/2015 (grey diamond in Fig. 4a-c) to A/Brisbane/02/2018 (purple diamond in Fig. 4a-c) 42 . Our maximum likelihood tree analysis showed that influenza H1NI Hemagglutinin (HA) sequences evolved away from this new vaccine strain. As shown in Fig. 4a, a total of 255 H1N1 Hemagglutinin (HA) sequences sampled in April 2019 were relatively close to the 2019-2020 H1N1 vaccine strain (purple diamond in Fig. 4a). However, we found that HA sequences collected in January 2020 (1140 sequences total) had evolved away from this vaccine strain. This rapid evolution was visualized in a two-dimensional map obtained by multidimensional scaling of the pairwise HA sequence distance matrix (Fig. 4b,c). As plotted in Fig. 4d, the nucleotide base difference from the vaccine sequence was significantly higher among the January 2020 sequences, compared to the April 2019 sequences (median distance 18 vs. 23, p < 0.001). www.nature.com/scientificreports/ This observed evolution away from the vaccine strain may explain the 2019-2020 seasonal influenza vaccine's low effectiveness against influenza A (H1N1). Influenza vaccine efficacy was estimated to be 37% against H1N1 strains during the 2019-20 flu season 43 . Viral evolution in response to vaccination remains a challenge for vaccine design, and the 2019-20 flu season highlights the need for real-time surveillance of viral sequences. Like influenza, SARS-CoV-2 has the potential to evolve in response to vaccination 44 , and thus SARS-CoV-2 evolution must be continuously monitored in real time to prevent vaccine failure and guide future vaccine strain selection.

Discussion
We developed CorvGenSurv (Coronavirus Genomic Surveillance), a streamlined, high-resolution, and costeffective SARS-CoV-2 whole genome sequencing method in which viral RNA is directly amplified, indexed, and sequenced. CorvGenSurv's precision and accuracy have been validated by Sanger sequencing and control genomic RNA sequencing, respectively. Our targeted amplification of the whole genome in three segments minimized the risk of genome coverage dropouts. Additionally, our streamlined long-read sequencing protocol significantly reduced workflow complexity and thus has a potential cost advantage over currently available shortread sequencing approaches 8,[10][11][12][26][27][28]31 .
We obtained 25 whole genome sequences from NP/OP remnant COVID-19 positive test specimens that were collected from April to June 2020 at Los Angeles County, California, USA. The Los Angeles County pandemic accounted for 33% of all COVID-19 cases in California with around 3500 new cases per day in July 2020 45,46 . Our maximum likelihood tree analysis showed that the 25 strains were highly diverse within the G, GR and GH clades 35 . The G clade and tis lineage have been dominant since late March, 2020 and its D614G mutation in the spike protein has been associated with increased transmissibility 14 . In addition to the G clade's other common mutation, NSP12-P323L, frequent mutations we observed include N-S194L, N-R203K, N-G204R, NSP2-T85I, and ORF3a-Q57H. From this wide spectrum of viral mutations, we estimated SARS-CoV-2 evolutionary rate by a Bayesian phylogenetic reconstruction method 37,38 and divergence measures. These two estimates were 7.51 × 10 -4 and 8.62 × 10 -4 substitutions per site per year respectively, which were consistent with recent reports on the SARS-CoV-2 evolutionary rate [47][48][49] .
We identified new amino acid mutations in the NSP3, NSP4, RdRP (NSP12), ORF8, and N proteins. Mutations in these proteins have the potential to trigger immune escape, alter viral replication capacity, or modulate viral immune suppression. Numerous T cell epitopes have been annotated in these proteins and thus mutations can lead to viral escape from cellular immune responses 50,51 . While mutations in the N, NSP3, NSP4 or NSP12 proteins can alter viral replication capacity 13,52-55 , those in the ORF8 or N proteins may dysregulate host innate www.nature.com/scientificreports/ immune responses via type I interferon signal modulation 54 . It is therefore of great importance to monitor mutations from a diverse array of proteins by whole genome sequencing to pinpoint those that are relevant to viral pathogenicity and immune escape. SARS-CoV-2's S protein is the key target for many of the vaccines currently in development 18 and thus mutations in this region are being closely monitored for their potential to render these vaccines ineffective. The observed influenza HA sequence evolution away from this year's vaccine strain clearly indicates the necessity of screening mutations in viral surface proteins. Genomic surveillance data can also inform SARS-CoV-2 drug resistance. Mutations in RdRP (NSP12) can potentially result in decreased binding affinity with nucleoside analogs such as Remdesivir and Favipiravir 56 . Therefore, genomic surveillance on circulating pandemic strains is crucial for guiding strategies for COVID-19 vaccination and therapeutic intervention and our simple and cost-efficient pipeline has the potential to advance real-time COVID-19 surveillance.
CorvGenSurv can accurately survey new mutations in real-time and thus provide crucial data for disease transmission, new outbreak detection, and vaccine candidate selection. As an alternative to the more common short-read sequencing approaches 8,[10][11][12][26][27][28][29][30] , CorvGenSurv can facilitate the widespread adoption of real-time COVID-19 genomic surveillance and can therefore serve as an important tool for public health decision-making. www.nature.com/scientificreports/