Main

The coronavirus disease 2019 (COVID-19) pandemic has again demonstrated the crucial role of genomic sequencing in combatting infectious disease outbreaks globally. Monitoring the emergence of pathogens and the spread of variants of concern has become commonplace in government, academic and private laboratories1,2. Genomics data provide insights into the diversity, evolution and transmission of a virus, and serve as a critical guide for public health interventions ranging from contact tracing, identifying cases of reinfection or understanding the basis of resistance to clinical interventions3,4,5,6. Laboratories are now performing viral genomic sequencing of SARS-CoV-2 at an unprecedented scale7,8, highlighting the pressing need to ensure the integrity of genomes.

Multiplexed amplicon-based genome sequencing methods have accelerated the massive scale of SARS-CoV-2 genomic surveillance due to their improved sensitivity, cost and speed over other, lower-amplification RNA sequencing approaches, such as unbiased metagenomic sequencing9. However, the risk for contamination during the amplification stage is especially high as the 35+ cycles of virus-specific PCR produces trillions of SARS-CoV-2 amplicons in a single reaction. Other high-risk modes of contamination, including sample swaps, cross-contamination of samples, or aerosolization, can occur throughout the sample processing pipeline. With many laboratories performing viral sequencing by processing multiple large batches in parallel, the potential for contamination increases10. Even small amounts of sample mixing or contamination of amplicons could potentially confound studies where detection is sensitive to tens of molecules10,11. Moreover, as SARS-CoV-2 has relatively low genetic diversity and often spreads in local outbreaks or clusters11,12, genomes are often identical at the consensus level11,15,16,17, a pattern that could also occur with contamination. The risk of contamination and the challenges in detecting it can confound a wide array of genomic analyses, including estimates of the frequencies of variants, lineage dynamics and transmission events. Additionally, methods to address the critical risk of sample processing errors in clinical sequencing could enable its use more widely in clinical decision-making.

While inclusion of internal standards is commonplace in many experimental approaches13,14,15 and some technical assay controls exist for DNA sequencing16,17,18, the use of internal controls is currently rare in amplicon-based genomic surveillance. Here we developed and extensively tested a sample identification method using 96 synthetic DNA spike-ins (SDSIs) for amplicon-based sequencing approaches. Using the widely used open-access ARTIC tiled primer design (https://artic.network/), we implemented these SDSIs for SARS-CoV-2 genomic sequencing in thousands of residual diagnostic (clinical) samples. The resulting user-friendly and highly versatile SDSI + AmpSeq protocol can be easily implemented to improve the quality of genomic data generated for epidemiological and clinical investigations of human pathogens (Fig. 1 and Supplementary Data Table 1).

Fig. 1: SDSI + AmpSeq protocol.
figure 1

Illustrative workflow for 96 samples through the SDSI + AmpSeq amplicon-sequencing pipeline. A unique, synthetic DNA spike-in (SDSI) will be added to each cDNA sample to allow for contamination tracking and accurate sample identification in analysis. Red asterisks indicate additional steps to the standard ARTIC pipeline.

Results

Design and characterization of SDSIs for amplicon sequencing

We designed a simple, flexible system for sample tracking and contamination tracing using a uniquely identifiable core DNA sequence flanked by constant priming regions satisfying several design requirements. This design allows in-sample tracking through the addition of a different SDSI to each sample during sample processing. Following sequencing, the data can be analysed for both the presence of the expected or any other SDSI, illuminating sample misassignment or contamination with high resolution and accuracy (Fig. 1). Our design focused on highly stable DNA oligos that could be added to sample complementary DNA (cDNA) to capture contamination at or after the critical viral amplification step, including contamination generated during amplification and in handling amplified material. In contrast to a short barcode system, SDSIs are compatible with both tagmentation and ligation-based sequencing approaches18. The constant priming regions mean that only a single primer pair needs to be added into the existing multiplexed PCR step to co-amplify all SDSIs with the primary reaction target(s) (Fig. 2a).

Fig. 2: Synthetic DNA oligos spiked into amp-seq reactions designed to flag contamination and sample swaps.
figure 2

a, Schematic of SDSI design. Each oligo contains 140 bp of unique sequence flanked by common primer binding sites. Primers designed to amplify all SDSIs are added to ARTIC primer pools, and a unique SDSI is added to each clinical sample. Identification of multiple SDSIs in the same sample indicates contamination. b, Percent of SDSI reads mapping for each of the 96 SDSIs (horizontal axis) were quantified for each of the 96 SDSIs (vertical axis). Any off-diagonal signal would indicate non-specific identification of SDSIs.

The core portion of each SDSI constitutes 96 distinct DNA sequences from the genomes of diverse uncommon Archaea, precluding false detection and cross-identification (Supplementary Data File 1 and Methods). By using extremophilic Archaea, our designs maximized evolutionary distance from common human pathogens. To avoid false positive results, the core SDSI sequences should be sufficiently distinct from one another, as well as from sequences commonly found in laboratories and clinical samples. A permissive BLASTn search performed against the entire National Center for Biotechnology Information (NCBI) database confirmed that the SDSI core sequences had limited homology outside the domain Archaea, specifically to genera unlikely to be found in laboratories (Extended Data Fig. 1a). While this limited homology outside of the domain Archaea maximized the potential for broad applications, we also confirmed that none of the core sequences shared homology (defined as >90% identity over 50 base pairs (bps)) with Homo sapiens or known viral genomes (Methods). Similarly, we confirmed that all SDSIs were substantially different from each other to prevent misidentification (Extended Data Fig. 1b). Sequencing of the SDSIs confirmed that each of the 96 constructs resulted in a robust and specific signal of mapped reads (Fig. 2b).

We selected a pair of primers and priming regions that are highly specific and show reliable amplification across SDSIs and under standard PCR conditions. Using Primer-BLAST, we predicted that these sequences had limited homology to common organisms and thus were unlikely to amplify off-target sequences. Experimentally we confirmed that the SDSI primers did not produce any non-specific amplification, including in the presence of cDNA from a nasopharyngeal (NP) swab sample (Extended Data Fig. 2a). The primer pair also had a common length (24 bps), guanine-cytosine (GC) content (45.8%) and melting temperature (62 °C and 63 °C, respectively, in our SDSI + AmpSeq protocol), ensuring their compatibility with many multiplexed PCR reactions, including the most widely used SARS-CoV-2 amplicon sequencing strategy (https://artic.network/) (Extended Data Fig. 2b). Each SDSI was identically sized, shared a priming region and had similar GC content (range: 33–65%) to promote similar amplification rates across different SDSIs and to viral amplicons (for example, the SARS-CoV-2 genome is 37 ± 5% GC)19 (Extended Data Fig. 2c). All SDSIs amplified in an ARTIC SARS-CoV-2 PCR reaction with SDSI primers included, in each case yielding a single clean product of the expected size (Extended Data Fig. 2d). Furthermore, we observed that GC content did not significantly bias the number of SDSI reads detected in clinical samples (Extended Data Fig. 2e).

Validation of an SDSI + AmpSeq SARS-CoV-2 sequencing approach

The addition of SDSIs into the ARTIC PCR did not detrimentally alter SARS-CoV-2 cDNA amplification from clinical samples. To prevent SDSIs from overtaking the amplification and sequencing of SARS-CoV-2 amplicons, we titrated the amount of SDSI added to the reaction. We found that the highest concentration tested, 600 copies per μl, resulted in reliable SDSI detection with >96% of reads mapping to SARS-CoV-2 and no apparent alteration in coverage across the genome (Extended Data Fig. 3a,b). We assessed the specificity of the 96 SDSIs in clinical samples to confirm that there was no unpredicted cross-mapping, misidentification or large differences in amplification rate (Fig. 3a). We processed 14 samples spanning a range of viral cycle threshold (CT) values (CT range: 25–33), with both the standard ARTIC and SDSI + AmpSeq methods. For each amplicon, across all samples, there was no significant difference in coverage between the ARTIC and SDSI + AmpSeq conditions (Fig. 3b). Even in samples with low viral loads (CT > 30), we found that there were no significant differences in amplicon coverage (Extended Data Fig. 4a). Additionally, within the 14 samples processed +/− SDSI, we see a 100% genome concordance rate, illustrating that the addition of the SDSIs does not impact the accuracy of assembled genomes.

Fig. 3: SDSI + AmpSeq amplicon coverage and genome concordance.
figure 3

a, Percent of SDSI for SDSI 1–96 in patient samples. b, log of the mean amplicon coverage for the same clinical samples run with and without an SDSI (n = 14). A unique SDSI was used in each sample. The solid blue line represents SDSI + AmpSeq and the solid black line is ARTIC only with no SDSI. Blue and black shading around the solid lines represents the 95% confidence interval. There were no statistical differences (P > 0.05) in the mean amplicon coverage for each amplicon between the groups (two-tailed Mann Whitney t-test and multiple comparison two-stage step-up Benjamini, Krieger and Yekutieli test with FDR set to 5%). c, SNV concordance plot between SDSI + AmpSeq and unbiased consensus sequences. Two discordant SNVs, outlined in a red box, were found. Grey dots represent SNVs found in both the unbiased and SDSI + AmpSeq method, whereas black dots indicate that the SNV was only present in unbiased sequencing.

As extensive PCR can result in the propagation of numerous types of errors, such as DNA polymerase base substitution errors, PCR recombination events and thermocycling induced DNA damage, we further compared SARS-CoV-2 genome concordance between SDSI + AmpSeq and unbiased, metagenomic sequencing on a batch of 89 unique patient samples9,10,20,21. The samples consisted of diverse viral lineages and loads (CT range: 11.9–37.4), with the more sensitive amplicon sequencing method generating more complete genomes at higher CTs (Extended Data Fig. 5a–d). We assessed the coverage uniformity, as increasing uniformity reduces the sequencing depth required to generate reliable genomes22. We found that unbiased sequencing had more uniform coverage up to a CT of 25 (n = 31, Gini coefficient = 0.240 ± 0.046 (unbiased) vs 0.428 ± 0.026 (SDSI + AmpSeq)), while SDSI + AmpSeq generated more uniform coverage for samples above a CT of 25 (n = 39, Gini coefficient = 0.766 ± 0.265 (unbiased) vs 0.554 ± 0.124 (SDSI + AmpSeq)) (Extended Data Fig. 5e). For the 37 samples that assembled a full genome in both methods, only two out of 332 total single nucleotide variants (SNVs) were divergently identified by SDSI + AmpSeq (Fig. 3c). Each SNV was observed in only one sample and both fell within an ARTIC primer region, despite primer trimming during analysis; for one SNV, this suggests that PCR error from the ARTIC protocol contributed to the discrepancy23, whereas manual inspection of the other (C9565T) indicated the presence of intra-host variation in both methods, with a variant allele frequency of 39.4% (SDSI + AmpSeq) and 59.2% (unbiased sequencing). Overall, the discordance SNV calling rate was 0.6%, consistent with observed circulating diversity and sequencing-based errors. Consistent with previous reports from other groups, ARTIC amplicon sequencing maintains a high level of concordance at the consensus genome level10, even with the addition of SDSIs.

We explored other technical modifications to the ARTIC amplicon sequencing protocol to improve genome recovery, limit contamination points and enhance reproducibility of our SDSI approach. Foremost, increasing cDNA length by using more processive reverse transcriptases improved amplicon coverage (Extended Data Fig. 6a,b). Amplification of ARTIC amplicons and SDSIs by Q5 Hot-Start High-Fidelity 2x Master Mix resulted in higher amplification (Extended Data Fig. 6c and Supplementary Data Table 2). We found that increasing (2×) primer concentrations (20.8 nM final concentration) for poorly performing amplicons increased coverage in these amplicons, even enabling whole genome recovery for multiple samples, supporting the observation that primer rebalancing can enable greater coverage24,25 (Extended Data Figs. 6d and 7 and Supplementary Data Table 3). We explored the effects of different numbers of PCR cycles, DNA-hybridization steps, and temperature ramp speeds. Both DNA-hybridization steps and temperature ramping provided no substantial changes in amplification (Extended Data Fig. 6e,f). Additional PCR cycles can increase genome coverage uniformity of low-viral-load samples, although it may lead to increased erroneous SNV calls23 (Extended Data Fig. 6g). Using a standardized cDNA input, we found that the DNA Flex library workflow resulted in increased genome coverage depth across all CT values tested, compared to Nextera XT (Extended Data Fig. 6h). To further mitigate the risk of contamination from highly amplified libraries, we took advantage of the self-normalizing feature of the DNA flex kit and found that limiting the tagmentation beads by scaling down (0.5×) all components of the DNA Flex library construction reagents restricted library over-amplification. Notably, this did not impact final library size distributions or SDSI amplification, while generating final sequencing libraries at half their original concentrations and halving the library construction cost per sample (Methods and Supplementary Data Table 4). We summarized the results of our optimizations (https://benchling.com/s/prt-R95g0tCxKOeCAqn8lAk3) and show that SDSIs can be easily integrated with numerous protocol alterations.

Implementation of SDSIs to clinical samples at scale

The SDSI + AmpSeq method is compatible with a range of viral CTs, SARS-CoV-2 lineages, origin of the patient sample, and laboratory in which the pipeline is implemented, demonstrating that this is a robust and flexible approach that can be readily implemented for surveillance. A half plate of SDSIs were used at two large-scale sequencing facilities, the Broad Institute and Jackson Laboratories (JAX), for SDSI + AmpSeq SARS-CoV-2 surveillance across a total of 6,676 clinical samples and controls (JAX: n = 3,773; Broad: n = 2,903). Individual batches typically consisted of 92 clinical samples with 4 designated water controls. Clinical samples were largely from Maine, Massachusetts, and Rhode Island from December 2020 to July 2021 and covered a wide range of viral CT values (8.4–39.9) and pango lineages (77 total lineages) (Fig. 4a). The SDSI + AmpSeq method worked robustly despite minor implementation differences in protocols between the two laboratories, including different cDNA synthesis enzymes, CT normalization, and library construction approaches (0.5X Illumina DNA Flex vs Illumina COVID-Seq) (Methods).

Fig. 4: SDSI + AmpSeq performs well across thousands of samples.
figure 4

a, Sample diversity from two different institutions representing a range of CTs, viral lineages, and states of sample collection from samples where the data was available. b, Percentage of SDSI reads out of the sum of all SDSI reads that map to the correct spike-in. Left: JAX, n = 3,773 biologically independent samples. Right: Broad, n = 2,903 biologically independent samples. Data are presented as mean ± s.e.m. Individual data points are displayed when n ≤ 10. c, Percentage of SDSI reads over the total of all sequenced reads for all SARS-CoV-2-positive samples. Left: JAX, n = 3,045 biologically independent samples. Right: Broad, n = 2,670 biologically independent samples. Data are presented as mean ± s.e.m. Individual data points are displayed when n ≤ 10. The dashed line represents 10% SDSI reads.

Source data

SDSI + AmpSeq is a tractable and easily-implemented method for genome quality control when applied to high-throughput processing of clinical samples. Across thousands of clinical samples, the SDSIs performed consistently and reliably (Fig. 4b,c). The mean percentage of SDSI reads that mapped to the expected SDSI was above 95% for all SDSIs in both laboratories (Fig. 4b). This demonstrated that across a large set of highly variable clinical samples, there were no systemic issues of misidentification for specific SDSIs. Additionally, across 5,715 samples from both institutions, the percentage of all SDSI reads in SARS-CoV-2-positive samples averaged 3.72% (90% of samples fell between 0.0003–9.9887%) (Fig. 4c). Importantly, this low, but consistent percentage of reads mapping to SDSIs allows for their implementation without necessitating increased sequencing depth. Notably, the SDSIs performed well with and without previous normalization of cDNA based on CT; however, normalizing increased the percentage of SDSI reads (Extended Data Fig. 4, Fig. 4b(left) and Methods). Normalization of viral CT may provide an additional level of quality control that is especially important for labs with limited sequencing capacities.

Confident genome sequencing and analysis with SDSI + AmpSeq

SDSIs enable detection of sample swaps and contamination events that occur in large-scale batch processing, which may otherwise go undetected. To demonstrate that SDSI + AmpSeq can accurately detect contamination, we mixed two SDSIs at various ratios before the ARTIC PCR and found that SDSI ratios were reflected in the sequencing output (Fig. 5a). Across 6,676 SARS-CoV-2 samples processed in large batches, SDSIs allowed for identification of multiple key modes of error (Fig. 5b). As plotted, a plate without contaminating events or sample swaps should display a simple diagonal pattern with 1:1 matching of expected and observed SDSIs. In some cases, off-diagonal events occur in clear patterns, enabling speculation on the nature of the contamination and demonstrating the utility of SDSIs as an internal control and in-sample label. We observed cases where a plate was accidentally inverted when the post-PCR reaction pool 1 was mixed with pool 2 (Fig. 5b). SDSI + AmpSeq detects entire flawed batches that may not have been flagged with standard controls (as in the case with the plate inversion where water controls in plate corners would not have been affected). In another example, SDSIs were detected unexpectedly throughout a batch, indicating that an SDSI (and possibly SARS-CoV-2) contaminated a common reagent.

Fig. 5: SDSI + AmpSeq is used to identify sample swaps and contamination.
figure 5

a, Intentional SDSI contamination experiment (run in duplicate) assessing whether different ratios of contamination between SDSI 87 and SDSI 94 (SDSI 87:SDSI 94) were detectable with the SDSI + AmpSeq method. b, Examples of experimental errors that were caught using the SDSI + AmpSeq method. c, Top: distance matrix showing pairwise differences between the 17 complete genomes assembled from this sample set. Putative cluster samples are in bold. Bottom: spike-in counts for each of the 24 samples and water controls in this sequencing batch.

SDSI + AmpSeq also enables fine-resolution insight into sample processing errors with high specificity. In one example, SDSI counts indicated that columns were unintentionally mixed together (Fig. 5b). Here, in-sample labelling allowed researchers to confidently move forward with analyses on unaffected samples. In other cases, samples were associated with both the expected SDSI and SDSIs from neighbouring samples, indicating potential spillover or pipetting errors. Again, genomes generated from samples with suspicious SDSI profiles can be investigated further, and removed from analyses and/or reprocessed. We recommend manual curation of genomes assembled from any samples with <95% of SDSI reads mapping to the expected SDSI. This level of impurity is probably attributable to sample processing contamination, given minimal baseline crosstalk from sources such as indexing primer or oligo synthesis observed (Methods and Extended Data Fig. 8). Moreover, these patterns of contamination events identified via use of SDSI + AmpSeq illuminated key sources of error in processing pipelines and provided an opportunity to improve processing fidelity in subsequent batches.

To demonstrate the ability of SDSIs to enable confident interpretation of sequencing data in an outbreak investigation, we used SDSI + AmpSeq to investigate a putative SARS-CoV-2 cluster from Massachusetts General Hospital (MGH). The Infection Control Unit suspected nosocomial transmission, yet sample swaps and contamination could easily undermine findings. We sequenced 24 samples with SDSI + AmpSeq (including 14 suspected cluster-related samples, 8 unlinked contextual controls, and 2 water controls). Out of 11 assembled genomes that were suspected to be part of the cluster, 10 were genetically highly similar (0–1 consensus nucleotide difference) (Fig. 5c) and distinct from other samples from Massachusetts around the same time (Extended Data Fig. 9), strongly suggesting that this cluster arose from nosocomial transmission. Analysis of the SDSIs confirmed that genome sequence similarity among cluster-associated samples was not the result of cross-contamination (Fig. 5c). Indeed, 23/24 libraries contained >95% SDSI-mapped reads corresponding to the expected SDSI. One sample that was not part of the cluster (MA_MGH_02845) showed 18% of reads from a second SDSI, which was added to a different sample in the batch (MA_MGH_02839). We resequenced both samples implicated in the contamination event. We confirmed that the two genome sequences for MA_MGH_02845 were 100% concordant, and no genome was assembled for MA_MGH_02839 in either attempt, probably due to its very low viral load (CT = 37). This example illustrates how SDSIs can be used to isolate and validate only those samples implicated in contamination events and altogether increase confidence in cluster investigations.

To further increase our ability to comprehensively identify sources of error in AmpSeq methods for viral genomics, we sought to capture contamination and sample swaps that might occur before cDNA synthesis. We explored the feasibility of modifying the SDSI approach to use synthetic RNA spike-ins (SRSIs) from the same constructs, which could be added to clinical sample RNA. For a subset of SDSIs, we included a T7 promoter site to produce RNAs in vitro. For two clinical samples representing low (20) and mid (26) CTs, we detected reads from the SRSIs added directly to extracted viral RNA as a proof-of-principle (Extended Data Fig. 10). Notably, this approach did not require any additional protocol modifications, and we therefore expect it to be a highly versatile and user-friendly method when deployed at scale for complete end-to-end sample tracking.

Discussion

Amplicon-based sequencing methods enable rapid, full genome recovery for emerging SARS-CoV-2 variant surveillance; however, robust tools are needed to ensure accuracy in genomic data. SDSI + AmpSeq is an invaluable technique for detecting key modes of contamination, addressing this critical gap in standard controls and practices. SDSIs do not compromise genome quality, have been successfully deployed in thousands of clinical samples, and are in use across multiple laboratories with differing protocols. These SDSIs revealed numerous instances of sample swaps and contamination, many of which would go unnoticed with standard batch-level controls. SDSIs further provide critical confidence in the interpretation of clusters of identical genomes, a renewed challenge in the surveillance of more transmissible variants. The common primer design enables the applicability of SDSIs to multiple short amplicon designs and sequencing strategies, adding only minor changes to existing protocols and minimal additional cost.

SDSIs overcome multiple modes of error in the production of amplicon-based genomic sequencing data, and are a critical component of quality control measures. The approach is most effective when adopted fully within a laboratory setting and thus we propose routine use of the SDSI + AmpSeq method to flag laboratory-wide contamination. We provide an extensively tested protocol with ARTIC v3 and Illumina-based tagmentation. It can also be applied to other sequencing pipelines, although this potentially requires further optimization. Our pathogen-exclusion design criteria allow our 96 validated SDSIs to be immediately incorporated into other tiled amplicon panels, such as existing ones for Zika, Ebola and other viruses of epidemic potential26,27.

The SDSI-labelling paradigm is broadly applicable to many amplicon-based needs: it is amenable to various technical enhancements, flexible to remaining error modes and expandable to additional targets. One could apply the same design parameters to expand the set of SDSIs, such as to 384-well formats. To design larger panels of SDSIs, one could use artificial core sequences, rather than excerpting from Archaea. Primer sites could also be easily adapted for integration with new advancements in amplicon sequencing, such as tailed primer approaches or new primer schemes28,29,30,31,32. In its current implementation, our SDSIs detect contamination or workflow errors that occur during and after amplification, but not issues arising at the RNA or cDNA generation stage. While we expect DNA spike-ins would be more robust to variation in samples and protocols and are applicable to both RNA and DNA viruses, further refinement of the RNA spike-in approach could address other modes of contamination, enabling end-to-end sample tracking for RNA viruses at scale. Specifically, SRSIs could improve the tracking of high-viral-load samples that are more likely to have well-to-well contamination than low-viral-load samples. Future work improving quantification and SDSI analysis pipelines may enable them to serve as within-sample controls, since samples or batches with outlier SDSI read counts may reveal experimental error.

The integration of SDSIs can mitigate a critical vulnerability of amplicon-based sequencing while preserving its many advantages, increasing the robustness of its use across laboratory and clinical settings. Adoption of controls across the viral surveillance community would increase accuracy and integrity of genomic data worldwide. Looking forward, SDSIs could serve as a crucial component in improving data integrity in amplicon-based genomic sequencing beyond infectious disease surveillance, such as for food safety, species identification and environmental sampling.

Methods

We have provided our SDSI implementation protocol on Benchling (https://benchling.com/s/prt-R95g0tCxKOeCAqn8lAk3) in addition to the methods detailed below.

Sample collection and study design

Research was conducted at the Broad Institute with an exempt determination from the Broad Office of Research Subjects Protections and with approval from the MIT Institutional Review Board under protocol #1612793224. Samples were obtained from MGH, Massachusetts Department of Public Health, the Rhode Island Department of Public Health and the Broad Institute Clinical Research Sequencing Platform. Samples from MGH fall under Partners Institutional Review Board under protocol #2019P003305. Samples were secondary-use or residual clinical and diagnostic specimens (referred to collectively throughout as clinical samples), obtained by researchers under a waiver of consent. All samples were nasopharyngeal or anterior nares swabs in a stabilizing medium (for example, molecular transport media (MTM), viral transport media (VTM) or universal transport media (UTM)). These unique biological materials are not available to other researchers as they are human patient samples from clinical excess material and are thus of limited volume. Samples sequenced at JAX were approved under protocol 2020-NHSR-019-BH.

SDSI design and in silico validation

We designed synthetic DNA fragments that each contained a 140 bp unique sequence and constant priming regions. Core SDSI sequence homology to sequences from various organisms was predicted by a permissive BLAST search (blastn; 5,000 maximum targets; E = 10; word size = 11; no mask for low complexity). We considered homologies identified with this BLASTn search described above that were additionally >50 bps (>35% query cover) and >90% sequence identity to be above our homology threshold. For all 96 selected SDSIs, there were no such homologies when results were filtered to all Homo sapiens (taxid:9606) or viral (taxid:10239) sequences in the NCBI database. For homologies above the threshold to bacterial or eukaryotic sequences in the NCBI database (excluding Archaea: taxid:2157), we report both the SDSI and the genus it mapped to in each case (Extended Data Fig. 1a). Using the same BLASTn (BLASTN 2.12.0+) parameters, we also mapped SDSIs against a custom database including SDSI core sequences, and found no homologies above our threshold between SDSIs. As there were no homologies below our threshold between SDSIs and human, virus or other SDSI sequences, we noted the maximum alignment scores for any homology below our threshold identified, and reported these scores (Extended Data Fig. 1b).

We confirmed that SDSI primers and amplicons were predicted to amplify specifically and consistently with ARTIC v3 amplicons. We used Primer-BLAST to predict 50–5,000 bp amplicons produced on templates in the entire non-redundant nucleotide database; no amplicons were identified. We calculated the length and GC content of SDSI primers and full SDSI amplicon sequences and ARTIC v3 primers and amplicons using Geneious Prime (2019.2.1) and compared their distributions (Extended Data Fig. 2b,c). ARTIC and SDSI primer melting temperatures were matched and calculated using the New England Biolabs online calculator (version 1.13.0; https://tmcalculator.neb.com).

SDSI experimental validation

We sought to validate in silico predictions for the performance of our SDSI primers and amplicons. We ordered primers (from IDT; oligo sequences in Supplementary Data File 1) and performed quantitative PCR (qPCR) using the Q5 Hot-Start 2x Master Mix, with 500 nM SDSI primers and 0.17X SYBR Gold (ThermoFisher S11494), and without ARTIC primer pools. We performed this assay in triplicate in 10 μl reactions on a QuantStudio 6 with the following cycling conditions: 95 °C for 30 s, followed by 35 cycles of 95 °C for 15 s and 65 °C for 5 min. We tested 4 conditions: (1) 0.5 μl of an SDSI gene block (IDT) (1 pM), (2) 0.5 μl of an SDSI gene block + 0.5 μl of cDNA from an NP swab, (3) 0.5 μl of cDNA from an NP swab, and (4) no template to detect any non-specific amplification of the primers (Extended Data Fig. 2a). We performed PCR on each SDSI oligo, using the standard SDSI + AmpSeq PCR conditions (https://benchling.com/s/prt-R95g0tCxKOeCAqn8lAk3), then ran the PCR products on a 2.2% agarose gel to confirm that these primers amplified the SDSIs and that the product was clean and of the expected size (Extended Data Fig. 2d).

We ordered unique oligos as TruGrade ultramers (IDT), then resuspended and stored them at 10 μM in water (oligo sequences in Supplementary Data File 1). Further characterization for identification of 96 SDSIs was achieved by direct PCR amplification with primers containing the constant SDSI handle and an Illumina P5/P7 adapter, followed by sequencing with a MiSeq Nano 2 ×150 bp kit (Illumina #MS-102-2002). SDSI reads were quantified by mapping each SDSI against other SDSIs with the ‘align_and_count_multiple_report’ workflowimplemented in Terra, as described below, and purity and sequence fidelity of SDSIs were achieved by calculating the percentage of reads mapping to each SDSI out of total SDSI reads (Fig. 2b). Given these same data, we explored our SDSI mapping stringency threshold. We determined whether each SDSI was uniquely identified over a range of SDSI stringency thresholds (0.01%–50% of SDSI reads mapping, with a step size of 0.01%) (Extended Data Fig. 8). We tested 142 total unique SDSIs; all SDSIs amplified successfully with high sequence fidelity and purity (>95% of reads mapped to the expected SDSI in the experiment described above). The final set of 96 SDSIs was chosen after first-pass validation in a combination of clinical sample amplification tests, GC cut-offs and sequence homology cut-offs. SDSIs excluded because of poor amplification or impurity in clinical sample processing were not retested to determine whether error was technical or biological.

Viral CT determination

Viral CT values for all samples sequenced at the Broad Institute were obtained using the Centers for Disease Control and Prevention (CDC) reverse transcription qPCR (RT–qPCR) assay with the N1 probe as previously described21. Viral CTs for samples sequenced at JAX were obtained from various providers and the RT–qPCR assays used were thus variable.

CT normalization

CT normalization was performed by first setting a desired mock viral CT and calculating the difference between this desired mock viral CT and the measured viral CT of a given sample, rounding to the nearest whole number. We next calculated the number of doublings required for the mock viral CT (assuming 100% PCR efficiency), and multiplied this by the volume of cDNA input to be used for the normalization. The final volume of water used to dilute the cDNA was the doubling factor minus the volume of cDNA input. An example calculation is illustrated below:

CT normalization calculation:

N = Difference between actual and mock

X = Volume (μl) of cDNA to use for normalization

DF = Doubling factor is X(2N)

Volume water for dilution (μl) = DF − X

This CT normalization was done for certain method development samples which are described throughout the manuscript as being ‘mock diluted’ or ‘normalized to CT X’. The nosocomial cluster was normalized to CT 27. The majority of batch data generated at the Broad Institute underwent CT normalization to CT 25. Batch data from JAX did not undergo CT normalization.

cDNA generation and ARTIC amplification optimization

Reverse transcriptase

We tested reverse transcriptase enzymes using extracted RNA from four SARS-CoV-2-positive clinical samples (CTs = 13.9, 23.9, 29.6, 33.6) (Extended Data Fig. 6a,b). We added 2 μl purified DNase-treated RNA as input into SuperScript III (Thermo 18080093), SuperScript IV (Thermo 18091050) or SuperScript IV VILO (Thermo 11756500). Superscript IV (SSIV) reactions were incubated at room temperature for 10 min, followed by 50 °C for 60 min and an inactivation step at 80 °C for 10 min. Superscript IV VILO shared the same protocol, but with a temperature of 85 °C for the inactivation step. We input 2.5 μl of cDNA for ARTIC pool 1 PCR under standard conditions for 40 cycles. We then tested the resulting pool 1 using the scaled-down Illumina DNA Flex library construction (as described in Methods below) and sequenced it on Illumina Miseq (V2 reagent kit) with 2 ×150 bp paired-end sequencing.

ARTIC PCR enzyme

We tested PCR enzyme efficiency using extracted RNA from SARS-CoV-2-positive clinical samples, followed by cDNA generation using SuperScript IV, and diluted the resulting cDNA to a mock CT value of 35 for standardization across all PCR enzyme tests. We set up the standard ARTIC PCR pool 1 and pool 2 using an input of 2.5 μl, altering only the PCR enzyme and corresponding buffer. We tested NEB Q5 Hot-Start High-fidelity 2x Master Mix (Q5 2X MM) (NEB M0494L), NEB Q5 Hot-Start High-fidelity 2x Master Mix plus 0.01% SDS, NEB Q5 Ultra II Master Mix (NEB M0544L), KAPA HiFi HotStart (Roche KK2601), and KOD Hot-Start DNA polymerase (Sigma–Aldrich 71842) (Extended Data Fig. 6c). We quantified the resulting ARTIC PCR amplicons using a High-Sensitivity DNA Qubit kit, then input 25 ng from each pool (50 ng total) into scaled-down Illumina DNA Flex library construction. The resulting libraries (except Q5 plus 0.01% SDS, which had no visible product using the Tapestation D1000 High-Sensitivity Kit) were quantified and pooled on Illumina Miseq (V2 reagent kit) with 2 ×150 bp paired-end sequencing.

Rehybridization PCR

We optimized PCR cycling conditions on mock CT 35 cDNA (generated as described above) using standard ARTIC PCR primer conditions. We performed a catch-up/rehybridization PCR under the following conditions: 98 °C for 30 s, 95 °C for 15 s then 65 °C for 5 min (10 cycles); 95 °C for 15 s, 80 °C for 30 s then 65 °C for 5 min (2 cycles); 95 °C for 15 s then 65 °C for 5 min (8 cycles); then 4 °C hold (Extended Data Fig. 6e). We quantified the resulting ARTIC PCR amplicons using a High-Sensitivity DNA Qubit kit, then input 25 ng from each pool (50 ng total) into scaled-down Illumina DNA Flex library construction. We then quantified these libraries and pooled them on Illumina Miseq (V2 reagent kit) with 2 ×150 bp paired-end sequencing.

Cycle test

We further optimized ARTIC PCR by modifying PCR cycle numbers. Extracted RNA from six SARS-CoV-2-positive clinical samples ranging from CT 27–37 were converted to cDNA with Superscript IV and amplified under standard ARTIC PCR reaction components (with Q5 2X MM), modifying the final number of cycles of PCR from 35, 40 and 45 (Extended Data Fig. 6g). We quantified cDNA and used a standard 50 ng of input for scaled-down Illumina DNA Flex library construction, then quantified the resulting libraries and pooled them on Illumina Miseq (V2 reagent kit) with 2 ×150 bp paired-end sequencing.

Ramp test

We used mock CT 35 cDNA to test the effect of decreased ramp speed on genome recovery and coverage. ARTIC PCR conditions for this experiment were 98 °C for 30 s, followed by 40 cycles of 95 °C for 15 s and 65 °C for 5 min, with a cooling and heating ramping speed of 3 °C s−1. We tested a slow ramp PCR protocol with the ramp speed reduced to 1.5 °C s−1 (Extended Data Fig. 6f). Libraries were constructed with Illumina DNA Flex and sequenced on Illumina Miseq (V2 reagent kit) with 2 ×150 bp paired-end sequencing.

Primer concentration optimization

Under standard ARTIC protocol conditions, we ordered lyophilized ARTIC v3 primers from IDT and resuspended them in water at 100 μM each. Pool 1 primers consisted of all odd numbered amplicons, whereas pool 2 primers consisted of all even numbered amplicons. To generate the 100 μM pool 1 primer stock, we combined 5 μl of each 100 μM pool 1 primer, and repeated this protocol for the even numbered primers to give a 100 μM pool 2 primer stock. We selected a total of 20 amplicons as regions of low coverage from previous sequencing data (Supplementary Table 3). Low-coverage amplicons were present in both pools, with 11 coming from pool 1 and 9 coming from pool 2. For the primer 2X pools, we spiked in primers for the corresponding amplicons at 2× the concentration (20.8 nM final) of the other primers in the pool. For these low-coverage primers, we used 10 μl of the 100 μM stock rather than 5 μl. We diluted both the original and 2X primer pools 1:10 in nuclease-free water to generate a 10 μM working stock. We then selected 8 samples with varying CT values to determine whether selectively increasing primer concentrations reduced amplicon dropout (Extended Data Fig. 6d). We used the SDSI + AmpSeq protocol (without the SDSI or SDSI primers) and processed each sample with both the original primer pool, as well as the 2X primer pool, then sequenced these 16 samples on an Illumina Miseq (V2 reagent kit) with 2 ×150 bp paired-end sequencing. Only 6 of the 8 samples generated complete genomes (>98%) in both conditions and were used for further analysis.

CT normalization experiment

The CT normalization experiment was performed by taking four individual clinical samples (CT = 18–25) with four randomly selected SDSIs and either not normalizing the cDNA or normalizing to CT 25, 26 or 27 before the ARTIC PCR (Extended Data Fig. 4b). Samples were processed with the standard SDSI + AmpSeq protocol described below and sequenced on a NextSeq 500 Mid Output Kit v2.5 (300 cycles).

Illumina DNA flex

We performed a head-to-head comparison of standard Illumina Nextera DNA Flex and Nextera XT (Illumina FC-131-1096) library construction kits (Extended Data Fig. 6h). The Nextera XT protocol was performed as previously described21,33. Both library construction methods were compared on post ARTIC v1 PCR amplicons from clinical samples. Briefly, we amplified samples with a range of SARS-CoV-2 viral CT values (CT = 22.9, 26.2, 30.3) with ARTIC v1 primers, producing 400-bp-size fragments. We quantified amplicons from each ARTIC primer pool and pooled in equal molar concentrations. Standard Nextera DNA Flex input was 100 ng (50 ng from each pool) and 1 ng (0.5 ng from each pool) for Nextera XT. We quantified and pooled the resulting libraries before sequencing on an Illumina Miseq (V2 reagent kit) with 2 ×150 bp paired-end sequencing.

We optimized Illumina DNA Flex library construction (Illumina 20018705) with the goal of reducing normalization steps, cost and increasing throughput. We scaled down (0.5X) Illumina DNA Flex throughout the standard Illumina sequencing protocol, also scaling down sample input for a total of 50 ng (25 ng from each primer pool). Due to the CT normalization step, we removed the pre-DNA Flex DNA concentration and pooling step. We used 1–2 μl post ARTIC PCR amplicon as input into the scaled-down DNA Flex library construction, and performed post library construction quantification and pooling with more uniform library size and concentration, further reducing time and cost of pooling libraries for sequencing. This protocol was used for all method development experiments, the cluster investigation, and a portion of the batch data generated from both the Broad Institute and JAX.

SDSI + AmpSeq SDSI titration in ARTIC SARS-CoV-2 sequencing

To determine an optimal concentration for SDSIs in ARTIC SARS-CoV-2 sequencing, we diluted SDSI 49 to 0.6, 6, 60 and 600 copies per μl (1, 0.1, 0.01 and 0.001 fM); 1 μl SDSI 49 was added to 5 μl cDNA, to be split to 2 ×3 μl for each ARTIC pool (Extended Data Fig. 3 and Supplementary Data File 1). SDSI primers were added to each ARTIC pool with a final concentration of 40 nM. The cDNA from one clinical sample (MA_MGH_00195; CT = 16) was mock diluted to CT 20,25,30 and 35 for this experiment using the protocol described within the CT normalization section. On the basis of the results of this experiment, SDSIs were used at 6 × 102 copies per μl (1 fM) for all method development data. Batch processing modifications to this approach from the Broad Institute and JAX are detailed below.

SDSI + AmpSeq protocol

Full protocol details can be found at https://benchling.com/s/prt-R95g0tCxKOeCAqn8lAk3 (Fig. 1). Briefly, cDNA synthesis was performed on 2.5 μl DNAse-treated viral RNA with SSIV following the manufacturer’s protocol, with an extension of the 50 °C incubation from 10 min to 60 min. An additional cDNA normalization step could be performed (see above) or one could move directly into the ARTIC PCR by taking 5 μl cDNA and mixing this with 1 ul of a 1 fM SDSI (equal to 600 copies per μl). After mixing, the mixture was split into 2 ×3 μl aliquots and ARTIC primer pool 1 or pool 2 was added, as well as 1 μM of the spike-in forward and reverse primers (40 nM final concentration in the ARTIC pool). The ARTIC PCR conditions were 98 °C for 30 s, followed by 40 cycles of 95 °C for 15 s and 65 °C for 5 min. Pool 1 and pool 2 PCR reactions were combined and taken through library construction with scaled-down Illumina DNA Flex.

Broad Institute sample processing

The batch data from the Broad Institute was generated using SDSI + AmpSeq with minor modifications (Fig. 4). Briefly, SSIV was used for cDNA synthesis. Q5 2X MM was used for the ARTIC PCR, which was run for 35 cycles. The SDSIs were spiked in at 6 × 103 copies per μl and the SDSI-specific primers were added to each ARTIC pool at a final concentration of 40 nM. Library construction was performed either with the scaled-down Illumina DNA Flex (previously described) or COVID-seq (Illumina 20043675). Samples were sequenced on a NovaSeq 6000 SP Reagent Kit v1 (300 cycles), v1.5 kits (300 cycles) or NextSeq 500 v2 kit (300 cycles).

The correlation between the GC percentage for each SDSI and the percentage SDSI reads over total reads for SDSI 2–48 was performed with the samples sequenced at the Broad Institute (n = 2,903) (Extended Data Fig. 2e). A linear regression was used to evaluate significance (P value = 0.8160).

Jackson Laboratory sample processing

Data generated at Jackson Laboratory used two different protocols publicly available at https://github.com/tewhey-lab/SARS-CoV-2-Consensus (Fig. 4). All samples included 6 × 102 copies per μl of SDSIs and the SDSI-specific primers were added to each ARTIC pool at a final concentration of 4 nM. Samples processed from December 2020 to April 2021 used Lunascript (NEB E3010) for cDNA synthesis and Q5 2X MM for the ARTIC PCR, which was run for 35 cycles. These samples used scaled-down Illumina DNA Flex for library construction. Samples sequenced after April 2021 used the standard COVID-seq protocol. All samples were sequenced on a NextSeq 500 using paired 75 bp reads by the Genome Technology group on Jackson Laboratory’s Bar Harbor campus. The validation of all SDSIs in clinical samples (Fig. 3a) was performed with this protocol and is presented as the percentage of SDSI reads over the total of all reads for each sample. Of note, the SDSIs (used at the lowest recommended concentration of 6 × 102 copies per ul) were reliably detected in the samples sequenced at JAX. This reliable detection, however, is also dependent on the sequencing depth used by the institution.

SDSI impact on genome recovery

For +/− SDSI experiments testing the impact on recovery of viral genomes, 14 clinical samples spanning a range of CTs (17.6–30) were selected (Fig. 3b and Extended Data Fig. 4a). Samples were CT normalized and split after cDNA synthesis into 2 ×5 μl aliquots. Samples below CT 20 were normalized to CT 25 and samples between CT 20–25 were normalized to CT 26. Fourteen randomly selected SDSIs were used, with each sample receiving either an SDSI (600 copies per μl) and the SDSI-specific primers (40 nM final concentration in the ARTIC pool), or just the ARTIC pool 1 and pool 2 mastermix with additional nuclease-free water and no SDSI primers. Samples were processed according to the SDSI + AmpSeq protocol using scaled-down Illumina DNA Flex for library construction, sequenced on a NextSeq 500 Mid Output Kit v2.5 (300 Cycles) and analysed as described below.

Statistical analysis for the +/− SDSI experiment involved analysis of the mean coverage for all 98 amplicons for the full sample set with a two-tailed Mann Whitney t-test and multiple comparison two-stage step-up Benjamini, Krieger and Yekutieli test, with the false discovery rate (FDR) set to 5%. All 98 amplicons were found to be not significantly different (P > 0.05) between the plus and minus SDSI group. Samples were also separated into three CT bins (CT < 27 (n = 4), 27–29 (n = 6), >30 (n = 4)) and the test repeated for each CT bin. This analysis also revealed that there was no significant difference (P > 0.05) in the mean coverage across any amplicon for any CT bin.

Intentional SDSI contamination experiment

The intentional contamination experiment used SDSI 87 and SDSI 94. The SDSIs were mixed at five different proportions (SDSI 87:SDSI 94; 100:0, 75:25, 50:50, 25:75 and 0:100) (Fig. 5a). Each condition was performed in duplicate. All validation experiment samples were processed according to the SDSI + AmpSeq protocol using scaled-down Illumina DNA Flex for library construction. Samples were processed with the standard SDSI + AmpSeq protocol and sequenced on a NextSeq 500 Mid Output Kit v2.5 (300 Cycles).

Production and application of synthetic RNA spike-ins (SRSI)

We ordered SDSI oligos with minor modifications to enable in vitro transcription of RNAs (including a T7 promoter upstream of the SDSI amplicon, as well as 17 bps of constant sequence within the primer region) (Twist Bioscience) (Supplementary Data File 1). For two SDSIs (SDSI 1 and SDSI 4), we in vitro transcribed RNA using a T7 transcription kit (NEB E2050), quantified RNA using an RNA screen tape (Agilent 5067-5579 and 5067-5580), then diluted RNA in water to 10 fM (6,000 copies per μl), 1 fM (600 copies per μl), 100 aM (60 copies per μl) and 10 aM (6 copies per μl). We added 1 μl SRSI at each concentration directly to 5 μl RNA from two patient samples with high and intermediate viral loads, respectively, and prepared sequencing libraries using the SDSI + AmpSeq protocol (without the SDSI addition step at the cDNA stage). For the sample with a high viral load, we performed a dilution at the cDNA stage (diluting 32-fold for a mock Ct of 25 rather than 20). Reads mapping to unique SDSI sequences and SARS-CoV-2 were quantified using the ‘align_and_count_multiple_report’ and ‘assemble_refbased’ wdls, respectively, and % SDSI per combined reads was reported (Extended Data Fig. 10).

Computational analysis workflow

We analysed sequencing data on the Terra platform (app.terra.bio) using viral-ngs 2.1.28 with workflows that are publicly available on the Dockstore Tool Repository Service (dockstore.org/organizations/BroadInstitute/collections/pgs).

Samples were demultiplexed using the ‘demux_plus’ workflow with a spike-in database file for the SDSIs. We performed any separate analyses to quantify read counts, including those for SDSIs, with the ‘align_and_count_multiple_report’ workflow with the relevant database. For most analyses involving direct comparisons between samples, we performed downsampling to the lowest number of reads passing filter with the ‘downsample’ workflow. We performed assembly using the ‘assemble_refbased’ workflow to the following reference fasta: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta. We used iVar version 1.2.1 for primer trimming on all samples, followed by assembly with minimap2 set to a minimum coverage of either 3, 10 or 20, skipping deduplication procedures. The computational pipeline for all samples sequenced at JAX is publicly available at https://github.com/tewhey-lab/SARS-CoV-2-Consensus.

Samples from the batch data were subset in the following way for analysis. All samples with a present SDSI were used for the percentage of SDSI reads out of the sum of all SDSI reads analysis (JAX: n = 3,773; Broad: n = 2,903). Samples with known experimental contamination errors or where the dominant (>50%) SDSI was not the correct SDSI were removed. For the percentage of SDSI reads over the total of all sequenced reads analysis (JAX: n = 3,045; Broad: n = 2,670), non-template controls (waters) and clinical samples with no detectable viral load (CT > 40 or not detected via qPCR as described above) were removed from analysis.

Metagenomic sequencing and comparison

Metagenomic sequencing data and genome assemblies used for the comparison of amplicon-based sequencing were prepared, sequenced, analysed as described previously21, and the data are made publicly available at NCBI’s GenBank and Sequence Read Archive databases under BioProject PRJNA622837. We prepared amplicon sequencing libraries from the sample RNA extract following our SDSI + AmpSeq protocol (Fig. 1). We normalized cDNA samples that had a high viral load (CT < 27) to a CT of 27. To prepare for the ARTIC PCR, we transferred 5 μl of the normalized cDNA to a new plate and added 1 μl of an SDSI (600 copies per μl). After mixing, we transferred 3 μl to a new plate, added ARTIC PCR pool 1 mastermix and pool 2 mastermix to the respective plates, and incubated the mixture on a thermal cycler at 98 °C for 30 s, followed by 40 cycles of 95 °C for 15 s and 65 °C for 5 min. We then combined equal molar amounts of amplified samples for a total of 50 ng and processed the mixture through a 0.5X Illumina Flex library construction pipeline. We sequenced the concordance dataset on a NovaSeq 6000 SP Reagent Kit v1 (300 cycles) and analysed the data as detailed in the methods below. For SNV analysis, the coverage depth over each divergent SNV was greater than 1,000× for both platforms, and both SNV calls persisted at relaxed (n = 3) and conservative (n = 20) minimum coverage thresholds. Primer trimming using iVar version 1.2.1 was manually confirmed.

Suspected nosocomial cluster investigation

We received NP swab samples in UTM and extracted RNA from 200 μl of biosample as previously described8. We prepared amplicon sequencing libraries as described above and analysed them as detailed in the methods below. A pairwise distance was calculated between all partial genomes (>80% complete), excluding gaps, to determine whether samples were likely to be the result of nosocomial transmission (Fig. 5c). We calculated the proportion of reads that mapped to a given SDSI out of all reads that mapped to any SDSI. Data has been made available in both the Short Read Archive and the NCBI GenBank under Bioprojects PRJNA622837 and PRJNA648098. GenBank accessions for SARS-CoV-2 genomes from this set of samples are MW454553–MW454562.

For phylogenetic tree reconstruction, we placed the suspected nosocomial cluster in a broader genomic context by performing a subsampling of the genome sequences available in GISAID34 (Extended Data Fig. 9). We used the ‘sarscov2_nextstrain’ workflow to perform a Massachusetts-weighted subsampling of samples from 1 January 2020 to 1 November 2020. Our subsampled dataset included 3,146 sequences; 1,449 samples from Massachusetts, 1,425 samples from elsewhere in the United States and 283 from other countries. We constructed a maximum likelihood tree using iqtree with a General Time Reversible substitution model, and edited and interpreted the tree in Figtree v1.4.4.

Data presentation

Data analysis and graphing were performed using R Statistical Software (version 1.3.959; R Foundation for Statistical Computing), GraphPad PRISM (version 9.0.2; GraphPad Software, www.graphpad.com) and Python (version 3.7). We created original figures using BioRender (BioRender.com).

Statistics and reproducibility

The aim of our study was to establish the potential utility of our method, and study sample sets were largely dictated by the availability of clinical samples. No statistical method was used to predetermine sample size and experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment. We performed some experiments, such as RT–qPCRs, with technical replicates and Fig. 5a was run in duplicate.

Samples that failed sequencing due to technical reasons or where insufficient genome coverage was obtained for a given analysis (see Methods) were excluded. Data for Fig. 4b and Extended Data Fig. 2e were subsetted to exclude samples where SDSIs showed evidence of substantial contamination (expected SDSI is not the dominant (>50%) SDSI), the addition of no SDSIs, or failed sequencing. This dataset was further subsetted for Fig. 4c to exclude non-template controls and samples with no detectable CT for SARS-CoV-2. Analysis for 46 SDSIs that we did not proceed with further (due to suspected overlap with lab contaminants or contamination during synthesis or preparation) was omitted from this manuscript. Data from one sample that we added SRSI to was omitted from reported data.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.