COVseq is a cost-effective workflow for mass-scale SARS-CoV-2 genomic surveillance

While mass-scale vaccination campaigns are ongoing worldwide, genomic surveillance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is critical to monitor the emergence and global spread of viral variants of concern (VOC). Here, we present a streamlined workflow—COVseq—which can be used to generate highly multiplexed sequencing libraries compatible with Illumina platforms from hundreds of SARS-CoV-2 samples in parallel, in a rapid and cost-effective manner. We benchmark COVseq against a standard library preparation method (NEBNext) on 29 SARS-CoV-2 positive samples, reaching 95.4% of concordance between single-nucleotide variants detected by both methods. Application of COVseq to 245 additional SARS-CoV-2 positive samples demonstrates the ability of the method to reliably detect emergent VOC as well as its compatibility with downstream phylogenetic analyses. A cost analysis shows that COVseq could be used to sequence thousands of samples at less than 15 USD per sample, including library preparation and sequencing costs. We conclude that COVseq is a versatile and scalable method that is immediately applicable for SARS-CoV-2 genomic surveillance and easily adaptable to other pathogens such as influenza viruses.

S ince the identification of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) as the causative agent of coronavirus disease 2019 (COVID- 19) 1 , thousands of SARS-CoV-2 genomes have been sequenced worldwide and the sequences have been made publicly available on the global initiative on sharing all influenza data (GISAID) 2 (https://www. gisaid.org/). This has enabled a phylogenetic reconstruction of the viral spread and evolution across different countries and continents at an unprecedented scale 3 , allowing the rapid identification of new viral variants, including the UK 4 , South African 5 , and Brazilian 6 variants of concern (VOC), which have now been reclassified by the World Health Organization (WHO) as Alpha, Beta, and Gamma, respectively. SARS-CoV-2 whole-genome sequencing (WGS) based on next-generation sequencing (NGS) has been used for genomic surveillance to track infections in hospitals and community settings 7,8 , as well as to monitor viral outbreaks in breeding farms, such as mink farms 9 . More recently, with the advent of mass-scale vaccination campaigns worldwide, SARS-CoV-2 genomic surveillance has become vital to rapidly trace the emergence and spread of VOC with potentially reduced susceptibility to the current vaccines 10,11 . In this context, the availability of scalable and cost-effective methods for centralized sequencing of hundreds or potentially thousands of samples per week would be greatly beneficial.
Various NGS-based approaches have been developed to perform SARS-CoV-2 WGS using different sequencing platforms. These include direct RNA sequencing and metagenomics [12][13][14] , amplicon-based methods 13,15,16 , and oligonucleotide capturebased methods 13,[16][17][18] . Recently, numerous commercial kits based on the above methods have become available on the market and are being deployed in SARS-CoV-2 genomic surveillance 19 . However, existing commercial solutions are very costly and/or difficult to scale up largely because, typically, one sequencing library must be prepared for each individual sample and multiple indexed libraries must be carefully quantified and balanced before pooling them together prior to sequencing. This limits the number of samples that can be sequenced on a weekly basis, increasing the risk of missing emerging variants of potential concern. In addition to its applications for WGS, NGS has also been used for mass-scale SARS-CoV-2 testing, such as in the SwabSeq method 20 ; however, the latter does not provide full genome coverage. To counteract these limitations, here we present a versatile, scalable, and cost-effective workflow-COVseq-that can be used to prepare multiplexed WGS libraries from many SARS-CoV-2 samples in parallel. We validate COVseq using RNA extracted from a SARS-CoV-2 viral culture as well as 274 diagnostic samples collected in three different phases of the ongoing pandemic at two hospitals in Italy. We demonstrate that the genome sequences obtained by COVseq are compatible with downstream phylogenomic analyses, including detailed phylogenetic reconstruction of a COVID-19 nosocomial outbreak that occurred in January 2021 at a single hospital in Italy. Lastly, we perform a real-life cost analysis based on our experience with the genomic surveillance program that we recently initiated for the Piemonte Region in North-West Italy, demonstrating that COVseq is a highly costeffective method for SARS-CoV-2 genomic surveillance, including in low-income countries.

Results
COVseq enables near-complete coverage of the SARS-CoV-2 genome. The CUTseq method, which we previously described 21 , enables a cost-effective preparation of highly multiplexed DNA sequencing libraries, by using restriction enzymes to barcode multiple samples before pooling them together into a single library. When the SARS-CoV-2 pandemic started, we sought to adapt CUTseq to sequence many SARS-CoV-2 genomes in parallel at an affordable cost. To this end, we designed a workflow-COVseq-that begins with a multiplexed PCR assay developed by the U.S. Centers for Disease Control and Prevention (CDC) to amplify the whole SARS-CoV-2 genome, followed by CUTseq on the resulting purified amplicons (Fig. 1a, b, Supplementary Data 1, 2, and "Methods"). A step-by-step COVseq protocol is available in Supplementary Methods and at Protocol Exchange 22 . We first determined the breadth of coverage of the SARS-CoV-2 genome that could theoretically be achieved using two frequently cutting restriction enzymes compatible with CUTseq 21 . In silico simultaneous digestion with two 4-base cutters, MseI and NlaIII, predicted that 96.1% of the SARS-CoV-2 genome and 100% of the S region that is targeted by all current vaccines would be covered by 150 nucleotides (nt) single-end (SE) sequencing, while 98.8% of the whole genome and 100% of the S region would be covered by SE300 sequencing (Table 1 and "Methods"). In the latter scenario, only one out of 32 single-nucleotide variants (SNVs) in the UK, South African, and Brazilian VOC falls into a so-called "dark region" and thus would not be covered by COVseq ( Supplementary Fig. 1a). We then tested the COVseq workflow by preparing a single library from the RNA extracted from the supernatant of a SARS-CoV-2 culture, using MseI and NlaIII either alone or in combination to digest the pre-amplified SARS-CoV-2 genome ("Methods"). In line with our theoretical expectations, double digestion and SE150 sequencing resulted iñ 95% of the whole SARS-CoV-2 genome and S region covered at 10 sequencing depth and~99% covered at 1 (Fig. 1c, d and Supplementary Data 3).
Next, we sought to apply COVseq to left-over SARS-CoV-2positive RNA samples extracted from nasopharyngeal swabs taken for routine SARS-CoV-2 testing. We prepared a single COVseq library from 29 SARS-CoV-2-positive samples collected during Phase 1 of the pandemic at a hospital in Turin, Italy (OAS-29 samples in Supplementary Data 4 and "Methods"). To minimize the volume of reagents used and hence the cost per sample, we performed all the barcoding reactions in nanoliter volumes, by leveraging on a contactless nanodispensing device (I-DOT One), which we previously used for high-throughput CUTseq 21 ("Methods"). As a reference, we prepared individual sequencing libraries from each of the 29 samples using a commercial kit (NEBNext), which is compatible with SARS-CoV-2 WGS ( Supplementary Fig. 1b, c, Supplementary Data 3, and "Methods"). The number of reads per sample inversely scaled with the corresponding cycle threshold (Ct) value in both COVseq and NEBNext and was highly correlated between them (Pearson's correlation coefficient, PCC: 0.87) (Fig. 1e, f and Supplementary  Fig. 2a). Furthermore, the breadth of SARS-CoV-2 genome coverage at 10 was highly correlated (PCC: 0.95) between the two methods, independently of the sequencing platform and modality (NextSeq 500 PE150 vs. MiSeq PE300) ( Fig. 1g and Supplementary Fig. 2b). Of note, in samples with high Ct value (> 35), a sizable fraction of the reads was aligned to the human genome, independently of the method used to prepare the libraries (Fig. 1h). To further validate COVseq, we assessed how many SNVs identified by NEBNext in 20 samples with low Ct (≤ 35)-a threshold that allows reliable SNV calling-were also detected by COVseq (Supplementary Data 5 and "Methods"). The number of SNVs per sample was highly correlated (PCC: 0.99) between COVseq and NEBNext, and 205 out of 217 SNVs (95.4%) were detected by both methods (Fig. 1i, j and Supplementary Fig. 2c). The remaining SNVs that were detected by only one method had a substantially lower depth of coverage ( Supplementary Fig. 2d). Importantly, most of the genomic locations of the SNVs in the UK, South African, and Brazilian VOC were covered by COVseq at a depth (≥15 reads) that would be sufficient to identify them, if they were present in these samples (Fig. 1k).
COVseq can reproducibly identify SNVs including those found in emerging VOC. Having demonstrated the feasibility and analytical validity of COVseq, we sought to assess the reproducibility and sensitivity of our method in detecting known VOC. To this end, we prepared three replicate (  COVseq data are compatible with phylogenetic reconstruction of COVID-19 clusters. Next, we explored whether the sequencing data generated by COVseq are compatible with downstream phylogenomic analyses. To this end, we used Nextstrain 23 and a random selection of SARS-CoV-2 genomes from GISAID 2 in order to build phylogenetic trees visualizing the hierarchical distribution of the 179 samples that we sequenced with COVseq ("Methods"). As expected, these samples formed several distinct clusters in different branches of the phylogenetic tree, reflecting their time and location of collection ( Fig. 3a, b). The seven OAS-95 samples, which were assigned to the B.1.1.7 (UK) variant based on COVseq, clustered in the 20I operational taxonomic unit clade together with all the other GISAID samples that were classified as UK variant (Fig. 3a, b). Notably, 87 of the 95 OAS-95 samples, which were collected in Jan 2021 during a nosocomial COVID-19 outbreak at a hospital in Turin, Italy, formed a clearly distinct cluster ( Fig. 3a, b). Closer inspection of this cluster suggested that the outbreak had most likely originated in one ward (internal medicine, case index #1) and then spread to two other wards (orthopedics and cardiology) (Fig. 3c). These results demonstrate that COVseq generates high-quality genome sequences that are compatible with downstream phylogenomic analyses, including phylogenetic reconstructions of COVID-19 clusters.
COVseq is suitable for mass-scale SARS-CoV-2 genomic surveillance. Last, we performed a comparative cost analysis to assess the applicability of COVseq in mass-scale genomic surveillance programs. Considering only library preparation costs and assuming to pool 96 samples into the same COVseq library, we determined that the cost per sample would be~$37, $22, and $20 for 1000, 10,000, and 100,000 samples ( Fig. 4a-    scalable, and highly cost-effective method that is suitable for mass-scale SARS-CoV-2 genomic surveillance.

Discussion
As the COVID-19 pandemic continues to rage worldwide, the use of genomic surveillance to monitor SARS-CoV-2 outbreaks in communities, healthcare settings as well as in farms where thousands of susceptible animals live in close proximity has become increasingly important [7][8][9] . Moreover, with the emergence of new viral variants with potentially higher infectivity and/or pathogenicity, there is a cogent need for streamlined and costeffective approaches that could be deployed for sequencing thousands of viral samples per week. This is particularly relevant in the context of the ongoing mass-scale vaccination campaigns,  55  39  13  53  37  67  41  47  51  42  28  21  35  33  34  27  20  25  29  30  31  22  26  6  8  7  16  14  4  19  59  10  90  80  1  54  68  88  91  57  82  78  75  70  49  44  66  69  73  71  83  52  85  81  77  63  64  79  65  76  86  74  92  58  45  9  17  60  38  61  12  23  15  32  5  87  3  43  95  18  50  11  in order to quickly detect and respond to the possible emergence of variants conferring resistance to the existing vaccines 10,11 . The main bottleneck toward this goal is that existing commercial solutions for preparing sequencing libraries from thousands of SARS-CoV-2 samples are costly and time-consuming, mainly because the reagent volumes used are high (microliter range) and because, typically, a single library must be prepared from each sample and quantified before sequencing. In contrast, the COVseq method that we have described here allows constructing highly multiplexed sequencing libraries starting from small volumes of purified RNA samples, and it only requires a nanodispensing device to drastically reduce reagent volumes and therefore the cost per sample. The I-DOT nanodispensing device that we have used here is a versatile bench-top instrument that requires minimal maintenance and training, while drastically reducing plastic consumption by operating in contactless mode. Our cost analysis and ongoing experience in the Piemonte Region in Italy indicates that COVseq is a highly cost-effective approach that could be readily and widely adopted for genomic surveillance of the ongoing pandemic, even in low-income countries. In this context, COVseq is complementary to other sequencing-based methods for mass-scale SARS-CoV-2 testing, which do not provide whole-genome sequence information, such as SwabSeq 20 . Although COVseq is designed to achieve SARS-CoV-2 WGS, the fact that the primers in the CDC or ARTIC pools can be ordered separately allows, in principle, to sequence only a fraction of the SARS-CoV-2 genome, therefore increasing the number of samples that can be sequenced in parallel for the same total cost. This could be applied, for example, to sequence the S genewhich encodes the Spike protein that is targeted by all existing SARS-CoV-2 vaccines-in a much larger number of samples than would actually be possible by WGS, in order to promptly detect the emergence of new amino acid changing variants potentially impacting viral transmission and/or vaccine efficacy.
Since COVseq relies on restriction enzymes that cut the genome non-randomly, some parts of the SARS-CoV-2 genome might not be covered at a depth sufficiently high to reliably call mutations in these regions. However, our data demonstrate that a combination of two restriction enzymes (MseI and NlaIII) and sufficiently long sequencing reads (PE150 or PE300) results in near-complete (98.8%) SARS-CoV-2 genome coverage and allows detecting the recently emerged VOC.
Although COVseq is tailored for SARS-CoV-2, it can be easily adapted to other RNA viruses, such as influenza viruses, as well as to DNA viruses. Indeed, a quick survey of existing multiplexed PCR assays for viruses other than SARS-CoV-2 shows that CUTseq could be readily implemented to sequence the genome of Influenza A and B viruses as well as Dengue, using the same restriction enzymes as in COVseq (Supplementary Fig. 6 and Supplementary Table 1). Different enzyme combinations could also be tested to achieve optimal genome coverage, depending on the pathogen of interest. In conclusion, we envision that COVseq will play an important role in the genomic surveillance of the ongoing and future pandemics.  Data 4). Since the study was conducted on anonymous left-over samples and no clinical and personal information was collected, no informed consent subscription was required. The study was approved by the Ethical Committee of the CCI (permit no. 57/2021) and by the Swedish Ethical Review Authority (permit no. 2020-06694).
Real-time PCR. The differences below reflect the fact that each institution providing the samples adopted different approved diagnostic kits in different phases of the pandemic.  CCI-95 samples. We performed SARS-CoV-2 real-time PCR on these samples using the TaqPath COVID-19 CE-IVD RT-PCR kit (Thermo Fisher Applied Biosystems, cat.no. A48067) or the SARS-CoV-2 ELITe MGB kit (ELITechGroup, cat. no. RTS170ING) following the manufacturer's instructions.
To generate single-stranded cDNA, we added (in order) 4 μL of SuperScript IV buffer (Thermo Fisher Scientific, cat. no. 18090050), 1 μL of 0.1 M DTT (Thermo Fisher Scientific, cat. no. 18090050), 1 μL of RNase OUT (Thermo Fisher Scientific, cat. no. 10777-019), and 1 μL of SSIV reverse-transcriptase enzyme (Thermo Fisher Scientific, cat. no. 18090050) to the RT mix and incubated it in a thermocycler using the following program: 23°C 10 min, 50°C for 10 min, 85°C for 10 min and hold at 4°C. Afterwards, we added 1 μL of RNAse H (Thermo Fisher Scientific, cat. no. 18021071) to the sample and incubated it for 20 min at 37°C. This step is optional and was not performed when using the ARTICv3 protocol. For multiplexed PCR using the CDC approach, we first mixed equal volumes of the corresponding forward (F) and reverse (R) primers (Integrated DNA Technologies) diluted at 50 μM in Nuclease-Free Water (see Supplementary Data 1 for all primer sequences). We then prepared six primer pools according to the aforementioned CDC protocol, by mixing an equal volume of each F + R primer pair in a pool (see Supplementary Data 1 for the primer pairs contained in each pool). To perform the multiplexed PCR, for each primer pool, we aliquoted 3 μL of each cDNA prepared as described above in six separate PCR tubes prefilled with the following reaction mix: 15 μL of NEBNext Q5 Hot Start HiFi PCR Master Mix (NEB, cat. no. M0543L), 9.2 μL of Nuclease-Free Water (Thermo Fisher Scientific, cat. no. AM9932), 1 μL of 4 SYBR Green (Thermo Fisher Scientific, cat. no. S7563), and 1.8 μL of primer pool at 10 μM. We then performed the PCR reaction using a thermocycler (Biometra GmBH) and the following program: (i) 98°C for 30 s, (ii) 40 cycles of 98°C for 15 s and 65°C for 5 min, (iii) hold at 4°C. We then pooled an equal volume (20 μL) from each of the six amplicon pools into a 1.5 mL tube and purified DNA with a 1.0 vol/vol ratio of Ampure XP (Beckman Coulter, cat. no. A63881) beads and eluted the purified PCR in 80 μL of Nuclease-Free Water. To measure the DNA concentration in the sample, we used the Qubit dsDNA BR kit (Thermo Fisher Scientific, cat. no. Q32850) according to the manufacturer's instructions.
For multiplexed PCR using the ARTIC (v3) approach, we first diluted the IDT ARTIC nCoV-2019 V3 panel pools 100 μM (IDT, cat. no. 10006788) to a final concentration of 10 μM. To perform the multiplexed PCR, for each of the two primer pools, we aliquoted 6 μL of each cDNA prepared as described above into a PCR mix containing the following reagents: 12.5 μL of NEBNext Q5 Hot Start HiFi PCR Master Mix (NEB, cat. no. M0543L), 2.9 μL of Nuclease-Free Water (Thermo Fisher Scientific, cat. no. AM9932), and 3.6 μL of primer pool at 10 μM. We then performed the PCR reaction with the following program: (i) 98°C for 30 s, (ii) 35 cycles of 98°C for 15 s and 63°C for 5 min, (iii) hold at 4°C. We then pooled an equal volume (20 μL) from each of the two amplicon pools and purified the DNA with a 0.8 vol/vol ratio of Ampure XP beads and eluted the purified PCR in 40 μL of Nuclease-Free Water. To measure the DNA concentration in the sample, we used the Qubit dsDNA BR kit (Thermo Fisher Scientific, cat. no. Q32850) according to the manufacturer's instructions.
COVseq. A detailed step-by-step COVseq protocol is available in the Supplementary Information and at Protocol Exchange 22 . Below, we briefly describe two COVseq workflows, depending on the number of samples to be processed.
Workflow I (suitable for <10 samples). To test the feasibility of COVseq, we initially applied the standard CUTseq protocol to few RNA samples prepared as described above and processed in individual 0.5 mL tubes. Briefly, we mixed 7 μL (300 ng) of purified pooled PCR product with 1 μL of NlaIII (NEB, cat. no. R0125L), 1 μL of MseI (NEB, cat.no. R0525L), and 1 μL of 10 CutSmart Buffer (NEB, cat. no. B7204S) and incubated the sample for 3 h at 37°C followed by inactivation for 20 min at 65°C. Afterwards, we added the following reagents to the same sample (without purifying it) to reach a final volume of 30 μL: 1 μL of NlaIII and 1 μL of MseI adapters (both at 0.33 μM and prepared as we previously described 21 ), 1 μL of T4 ligase (Thermo Fisher Scientific, cat. no. EL0011), 3 μL of T4 ligase buffer 10 (Thermo Fisher Scientific, cat. no. EL0011), 2.4 μL of ATP 10 mM (Thermo Fisher Scientific, cat. no. R0441), 0.6 μL BSA 50 mg/ml (Thermo Fisher Scientific, cat. no. AM2616), and 11 μL of Nuclease-Free Water. We incubated the sample for 16 h at 16°C followed by inactivation for 10 min at 65°C on the next day. We purified the sample with a 1.2 vol/vol ratio of Ampure XP beads and eluted the purified DNA in 10 μL of Nuclease-Free Water. We performed in vitro transcription (IVT) with the MEGAscript T7 Transcription kit (Thermo Fisher Scientific, cat. no. AM1334) using 8 μL of purified DNA in a final volume of 20 μL and incubated the reaction for 14 h at 37°C. After IVT, we purified the amplified RNA with a 1.8 vol/vol ratio of RNAClean XP (Beckman Coulter, cat. no. A63987) beads and eluted the purified RNA in 10 μL of Nuclease-Free Water. We then ligated the RA3 adapters by preheating 1 μL of RA3 adapter at 10 μM for 2 min at 70°C, followed by the addition of 7. Workflow II (for >10 samples). To process multiple samples in parallel, we performed all reactions until IVT in 384-well plates, leveraging on the I-DOT One nanodispensing device (Dispendix GmBH), which we previously deployed for high-throughput CUTseq 21 , to reduce the volume of each reagent and therefore the cost per sample. However, since our I-DOT machine could not be placed inside a BSL-2 lab-which is required to safely handle potentially infectious RNA samples-we used it only for the CUTseq step, while the RT and multiplexed PCR steps were done in a BSL-2 lab using standard multichannel pipettes. In principle, however, all the steps could be implemented on I-DOT, provided that the machine can be placed inside a BSL-2 lab. Briefly, after having manually prefilled each well of a 384-well plate with 5 μL of mineral oil (Sigma-Aldrich, cat. no. M5904) to prevent evaporation during the following steps, we dispensed from 10 to 50 nL of purified pooled PCR amplicons, depending on the Ct values of the samples and then brought up each well to 350 nL with Nuclease-Free Water. After dispensing for each step, we briefly vortexed the plate on a thermomixer (Eppendorf) at 1000 rpm for 1 min and then centrifuged the plate at 3220 g for 5 min before each incubation. For digestion, we dispensed 150 nL per well of a digestion mix containing 50 nL of NlaIII, 50 nL of MseI, and 50 nL of 10 CutSmart Buffer and incubated the plates at 37°C for 1 h followed by 65°C for 20 min to inactivate the enzymes. After digestion, we dispensed 150 nL of NlaIII adapters and 150 nL of MseI adapters (each at 33 nM and prepared as previously described 21 ) into each well, followed by 700 nL of a ligation mix containing 200 nL of T4 rapid DNA ligase (Thermo Fisher, cat. no. K1423) or 150 nL of T4 standard DNA ligase (Thermo Fisher Scientific, cat. no. EL0011), 300 nL of 5 T4 ligase buffer (Thermo Fisher Scientific, cat. no. K1423) or 150 nL of 10 T4 ligase buffer (Thermo Fisher Scientific, cat. no. EL0011), 120 nL of ATP 10 mM, 30 nL of BSA 50 mg/mL, and 50 nL of Nuclease-Free Water when using rapid ligase or 250 nL when using standard ligase. We incubated the plates at 22°C for 30 min when using rapid ligase or 1 h for standard ligase, followed by inactivation at 70°C for 5 min, after which we manually dispensed 5 μL of Nuclease-Free Water/33 nM EDTA (for a final concentration of 25 nM) into each well. We then pooled the contents of multiple wells in the same plate manually or by centrifuging the plate upside down at 117 g for 1 min, which forces the contents into a collection plate placed at the bottom, and transferred the solution to a 1.5-mL tube. Last, we purified the pooled samples with a 1.2 vol/vol ratio of Ampure XP beads and eluted each pool in 10 μL of Nuclease-Free Water. We prepared sequencing libraries in the same way as described above for COVseq in single tubes. The number of samples per library depends on the number of differently barcoded adapters available. We have designed 384 different NlaIII and MseI adapters (see Supplementary Data 2), allowing a maximum of 384 samples to be pooled into the same library. However, higher multiplexing could be easily achieved by using a larger number of sequence barcodes when designing COVseq adapters.
Preparation of SARS-CoV-2 sequencing libraries using a standard approach. To validate COVseq, we generated individual libraries from 29 left-over samples (OAS-29 samples in Supplementary Data 4) using the NEBNext Ultra II FS DNA Library Prep Kit (NEB, cat. no. E7805L) following the manufacturer's instructions. Briefly, we used 250 ng of the pooled purified amplicons from each sample as input to prepare a library. First, we enzymatically fragmented the amplicons for 7 min at 37°C followed by incubation for 30 min at 65°C to achieve a target size around 200 bp. After fragmentation, we performed end-repair and adapter ligation in the same tube followed by purification of the fragments using a 0.8 vol/vol ratio of Ampure XP beads, following the manufacturer's instructions. We amplified adapter-ligated DNA fragments by three PCR cycles with barcoded primers (NEB, cat. no. E7500S) following the manufacturer's instructions and purified the PCR product with a 0.9 vol/vol ratio of Ampure XP beads. We assessed the size distribution and concentration of the libraries on a Bioanalyzer 2100 (Agilent Technologies, cat. no. G2943CA) using the High Sensitivity DNA kit (Agilent Technologies, cat. no. 5067-4626).
In silico coverage prediction. We extracted all the cut sites from the SARS-CoV-2 reference genome (NC_045512.2) using a custom Python script. Following this, we predicted the COVseq breadth of coverage by extending known cut site locations in the SARS-CoV-2 genome by the effective read length (theoretical sequence read length minus 20 bp of adapter sequence) on both sides. Sequencing data pre-processing and variant calling. We demultiplexed raw sequence reads to fastq files based on index sequences using the BaseSpace Sequence Hub cloud service of Illumina. We then further demultiplexed individual libraries to fastq files for each sample using a custom Python script. Following this, we processed the samples using a Nextflow 24 (version 20.10.0) based analysis pipeline from nf-core 25 called viralrecon 26 (version 1.1.0). In short, we trimmed the adapters from the fastq reads using fastp 27 (version 0.20.1) and aligned them to the SARS-CoV-2 reference genome (NC_045512.2) using bowtie 2 28 (version 3.5.1). Following this, we sorted and indexed the reads using samtools 29 (version 1.9), we trimmed amplicon primer sequences using ivar 30 (version 1.2.2), called variants, and generated the subsequent consensus sequence also using ivar. To determine the percentage of reads that mapped to different organisms and common contaminants, we used FastQ-screen 31 (version 0.14.1). Briefly, 100,000 reads were sampled from the fastq files and aligned to 14 reference sequences using bowtie 2 28 (version 3.5.1) (see Supplementary Table 2). We performed all subsequent analyses using custom R scripts.
Phylogenetic analyses. We downloaded sequences and sequence metadata from GISAID 2 (https://www.gisaid.org/ 2021-03-31) and added all the COVseq libraries and relevant metadata from the OAS-29, CCI-55, and OAS-95 samples. We then used the ncov tool (https://github.com/nextstrain/ncov) built on nextstrain 23 to generate a temporal and spatial phylogenetic tree. In addition, we randomly sampled 909 samples from around the world available in GISAID. We analyzed and visualized the resulting newick tree in R using ggtree 32 (version 2.2.4).
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The BAM files used to generate all the plots in the main Figures and Supplementary  Figures have been deposited to the European Nucleotide Archive (ENA) and are available at the following link: https://www.ebi.ac.uk/ena/browser/view/PRJEB42601. All reference sequences used in this study are listed in Supplementary Table 2. All the GISAID data used in this study are described in Supplementary Data 7 and are available at https:// www.gisaid.org.