Real-time SARS-CoV-2 diagnostic and variants tracking over multiple candidates using nanopore DNA sequencing

Since December 2019, a novel coronavirus responsible for a severe acute respiratory syndrome (SARS-CoV-2) is accountable for a major pandemic situation. The emergence of the B.1.1.7 strain, as a highly transmissible variant has accelerated the world-wide interest in tracking SARS-CoV-2 variants’ occurrence. Similarly, other extremely infectious variants, were described and further others are expected to be discovered due to the long period of time on which the pandemic situation is lasting. All described SARS-CoV-2 variants present several mutations within the gene encoding the Spike protein, involved in host receptor recognition and entry into the cell. Hence, instead of sequencing the whole viral genome for variants’ tracking, herein we propose to focus on the SPIKE region to increase the number of candidate samples to screen at once; an essential aspect to accelerate diagnostics, but also variants’ emergence/progression surveillance. This proof of concept study accomplishes both at once, population-scale diagnostics and variants' tracking. This strategy relies on (1) the use of the portable MinION DNA sequencer; (2) a DNA barcoding and a SPIKE gene-centered variant’s tracking, increasing the number of candidates per assay; and (3) a real-time diagnostics and variant’s tracking monitoring thanks to our software RETIVAD. This strategy represents an optimal solution for addressing the current needs on SARS-CoV-2 progression surveillance, notably due to its affordable implementation, allowing its implantation even in remote places over the world.

Since December 2019, the emergence of a novel coronavirus responsible for a severe acute respiratory syndrome (SARS-CoV-2) is accountable for a major pandemic situation, leading to date to nearly 3 million deaths worldwide. Early on 2020 major viral diagnostic efforts were deployed, including the development of news strategies aiming at (1) reducing the diagnostics time, (2) decrease the costs per assay, (3) providing population-scale solutions, but also (4) presenting a high sensitivity and specificity, notably to discriminate asymptomatic cases. While costs and required time per assay were early on addressed by the development of immune-based tests 1 , their sensitivity relative to viral nucleic acid-targeting diagnostics were initially questioned, but finally accepted in several countries notably due to the low throughput of diagnostics based on RT-qPCR assays 2 .
With the aim of taking advantage of the viral nucleic acid targeting diagnostics but at the same time increasing the number of candidates' through-put, strategies based on the use of massive parallel DNA sequencing for population-scale diagnostics were proposed [3][4][5] . These efforts rely on (1) the incorporation of DNA molecular barcodes during sample preparation; (2) pooling barcoded samples for an "in-one-run" diagnostics assay by the use of massive parallel DNA sequencing; (3) stratification of diagnostics' outcome with the help of bioinformatics processing. While such strategies demonstrated to up-scale the diagnostics power beyond several thousand candidates per assay, they require to count with a DNA sequencing platform where large instruments (Illumina Sequencers) are driven by specialized technical engineers. As consequence, this strategy remains applicable to highly developed countries (due to the prohibitive costs of the required DNA sequencers), which in addition requires a dedicated pipeline strategy for transporting collected samples towards sequencing centers.
Beyond the diagnostics requirements, the identification of the highly transmissible SARS-CoV-2 variant B.1.1.7-originated in the south of England-has strongly accelerated the worldwide interest in tracking SARS-CoV-2 variants' occurrence. More recently, two other extremely infectious variants, namely the P. 1  www.nature.com/scientificreports/ Brazilian city of Manaus and the B.1.351 initially detected In South Africa, were described, and unsurprisingly further others are expected to be discovered, notably due to the long period of time on which the pandemic situation is lasting 6 . Herein, we describe a proof of concept study for population-scale diagnostics and variants' tracking using massive parallel DNA sequencing performed with the MinION Oxford Nanopore Technology (ONT). ONT MinION sequencers are well-known for their portability, their reduced cost, and their simplicity of use; thus, representing a suitable strategy for its implementation even in remote places over the world (reviewed in 7 ). Indeed, this technology is widely used for SARS-CoV-2 whole-genome sequencing, notably by the use of the amplicon sequencing protocol developed by the ARTIC Network (https:// artic. netwo rk/ ncov-2019) 8 . Furthermore, a Covid-19 diagnostics strategy combining Loop-mediated isothermal Amplification (LAMP) with Nanopore sequencing has been also described 9 .
In addition to a dedicated molecular biology protocol, we deliver a computational solution (RETIVAD: Real-Time Variants Detector) providing the outcome of the diagnostic in a real-time mode during sequencing; allowing to earn time for diagnostics report, essential in the current context of the pandemic situation. Finally, we have also demonstrated that our strategy is able to perform SARS-CoV-2 variants' detection across multiple candidates, notably by targeting the SPIKE gene instead of covering the whole viral genome (Fig. 1). This targeted variants' tracking strategy allows increasing ~ 10 × the potential sequencing coverage delivered during the assay, thus allowing to redistribute this resource towards the variants' screening of a large number of candidates at once.
In summary, the proposed strategy satisfies the current need for counting with a population diagnostic assay, and discriminates between the various SARS-CoV-2 variants, which are recognized to potentially present variable infection properties as well as clinical outcomes 6,10 .

Results
A combinatorial molecular barcode strategy for SARS-CoV-2 diagnostics over multiple candidates. With the aim of performing SARS-CoV-2 diagnostics over multiple candidates with the MinION sequencer, we combined a regular reverse-transcription assay, with a particular PCR amplification allowing the generation of long DNA concatemers. This strategy allows to target short amplicons, like those used on quantitative real-time PCR strategies, but at the same time take advantage of the long fragments sequencing capabilities of the Oxford Nanopore technology.
Reverse transcription (RT) is performed with a Covid-19 specific primer, presenting in addition, a unique molecular barcode (20nt length) and a common sequence adapter (Gibson sequence; 30nt length) ( Fig. 2A) www.nature.com/scientificreports/ reverse-oriented sequence for the common Gibson sequence adapter (30nt length) is used together with a Gibson primer ( Fig. 2A). This strategy allows to incorporate a second unique molecular barcode, thus providing a combinatorial barcoding strategy per candidate which is extremely useful for increasing the number of candidates to interrogate without having to linearly increase the number of primers to purchase; an aspect which is of major relevance due to their long length (> 70nt). The presence of Gibson sequences on both extremities of the PCR amplified products allows generating concatemers during amplification ( Fig. 2A). While this process takes place in an uncontrolled manner during PCR amplification, low primer concentrations were shown to give rise preferentially to long concatemers (Fig. 2B,C). As consequence, concatenated PCR products can be purified from long primers (> 70nt length) with SPRI-select reagent (Beckman; B23318), without the risk of losing the short amplicons. Furthermore, the use of long concatemers increased the average sequencing coverage of MinION by a factor of ~ 9 (i.e. 1 million concatemers sequenced corresponds to 9 million monomers; Figure S1 and Fig. 3E), such that a large number of candidates can be potentially screened like in previous population-scale diagnostics efforts based on the use of Illumina instruments [3][4][5] .
Overall, the proposed pipeline is composed of a standard molecular biology strategy, requiring less than 8 h of samples preparation prior Nanopore sequencing ( Fig. 2D and "Methods" section).
Real-time SARS-CoV-2 diagnostics over multiple pseudo-candidates presenting variable amounts of viral copies. One of the major bottlenecks for population-scale Covid-19 diagnostics based on massive parallel DNA sequencing is the required time to deliver the outcome of the diagnostic. In addition to the molecular biology workflow required for samples' processing till sequencing library preparation, DNA sequencing, performed on Illumina instruments, requires an incompressible processing time, which delays downstream analyses since they can only be performed at the end of the whole sequencing process. In contrary, Oxford Nanopore sequencers deliver sequenced reads on a rolling basis (4000 sequenced reads per generated fastq file). Hence, we have developed a computational pipeline dedicated to (1) collect available fastq files; (2) To validate the performance of the proposed diagnostics strategy, synthetic SARS-CoV-2 RNA control samples (Twist Biosciences) were used for mimicking variable viral RNA copies per assay. Specifically, subsequent dilutions of the synthetic Wuhan-Hu-1 (GeneBankID: MN908947.3) SARS-CoV-2 RNA control sample has been used for reverse transcription, followed by a quantitative PCR assay targeting a region on the E gene of Considering that Oxford Nanopore instruments provide sequenced data during sequencing, the aforementioned steps are performed on real-time, such that the number of aligned read counts per candidate (defined by the combination of cDNA & PCR incorporated DNA barcodes) are assessed and displayed during the MinION sequencing process. (B) The synthetic SARS-CoV-2 RNA control sample (Twist Bioscience; Wuhan-Hu-1 strain) was subsequently diluted and used for reverse transcription and RT-qPCR assay targeting the E-Sarbeco viral region. (C) Reverse transcription assay scheme for 8 conditions corresponding to different viral RNA dilutions (dilution ID corresponding to that presented in (B)) and targeting the E-Sarbeco region. Each of these reverse transcription assays are identified by the indicated molecular barcodes (BC33-BC40). Note that among them a sample devoid of viral RNA is also included (BC37). All 8 samples are pooled at the end of the assay and concat-PCR amplified with a set of other eight barcoded primers targeting the E-Sarbeco gene (BC1-BC8), providing a combinatorial barcoding strategy corresponding to 64 pseudo-candidates. (D) Real-time display illustrating the number of aligned read counts associated to the E-Sarbeco region per dual barcode combination (BC3 vs all cDNA barcodes) assessed during Nanopore sequencing. Inset: Zoom view highlighting that the first seven hours of DNA sequencing are sufficient to reveal read count differences between samples. (E) Summary of the bioinformatics pipeline assessed at the end of 4 days of DNA sequencing. Concatenation factor is inferred from the ratio between the number of detected monomers and the initial number of sequenced concatemers. Optimal barcodes correspond to the number of sequences presenting an optimal monomer length (> 100nt; < 250nt) and presenting both cDNA & PCR barcode sequences. (F) Boxplots illustrating the total aligned read counts associated to each cDNA barcode (BC33-BC40) and the different PCR barcodes. Each of the boxplots are labelled with their corresponding viral RNA dilution (dil.ID indicated in (B)). (G) Heatmap illustrating the total aligned read counts associated to each cDNA barcode (BC33-BC40) and the different PCR barcodes (BC1-BC8). Note that (F,G) reveals a sensitivity of 100 viral copies, since the next RNA dilution condition appears in some cases at the same level of the negative samples. www.nature.com/scientificreports/ SARS-Cov-2, defined by primer sequences described previously (E-Sarbeco) 11 . This quantitative PCR allowed to detect as few as 10 viral copies of the synthetic SARS-CoV-2 RNA control sample (Fig. 3B). For Nanopore DNA sequencing-based diagnostic, as described on Fig. 2A, we have designed eight barcoded E-Sarbeco reverse primers (targeting the same amplicon region used for the quantitative PCR assay), and other eight barcoded E-Sarbeco forward primers (incorporating the required Gibson sequence), for concat-PCR amplification. This combinatorial strategy allows to mimic the screening of 64 pseudo-candidates (Fig. 3C).
Subsequent dilutions of the synthetic SARS-CoV-2 RNA control sample were used for reverse transcription, performed with defined barcoded E-Sarbeco primers (BC33-BC36; BC38-BC40), in addition to a negative control sample devoid of viral RNA material (BC37) (dilution ID in Fig. 3C). The real-time diagnostic assay revealed significant differences between samples presenting different viral RNA copies based on the number of aligned read counts towards the E-Sarbeco target region. Such differences, observed as early as 7 h after DNA sequencing, revealed that samples issued from high viral RNA titers presented systematically higher amounts of aligned read counts to the E-Sarbeco target region ( Fig. 3D and Figure S2). Furthermore, after 4 days of DNA sequencing (72 h of DNA sequencing, and 1 day more for finalizing base-calling) a total of ~ 4 million reads were sequenced, corresponding to ~ 38 million monomers (9.7 × concatemerization factor) (Fig. 3E). Despite this large gain on monomer sequences, only 14 thousand optimal sequences (i.e. presenting both cDNA and PCR flaking barcodes and within the expected monomer amplicon insert size) were used for the diagnostic assay, all others presenting either too short inserts (< 100nt) or missing one or both of the required barcodes, essential for the identification of the samples. Noteworthy, the analysis of ~ 14 thousand optimal sequences was sufficient for discriminating samples with viral RNA titers ranging within 1 to 100 fold dilutions of the synthetic Covid-19 RNA material (Fig. 3F,G); equivalent to 100 copies detected with the quantitative PCR assay (Fig. 3B). Indeed, the sensitivity of the assay fails to discriminate in some cases between the negative control samples (BC37) and those issued from the least viral RNA titer (1/1000 dilution; BC36) ( Fig. 3F,G).

Real-time SARS-CoV-2 variants' tracking over multiple pseudo-candidates.
Considering the current needs on tracking SARS-CoV-2 variants' emergence and spreading over the population, we have evolved the aforementioned real-time SARS-CoV-2 diagnostic strategy into a variants' tracking system. Because all current described SARS-CoV-2 variants present an important number of mutations within the SPIKE gene, we have focused the real-time tracking system to this region. In fact, by concentrating the sequencing efforts towards the SPIKE gene (~ 3.8 kb) we can afford to combine diagnostic and variants' tracking into a single assay, which per se represents a major progress relative to the current strategies.
In contrast to the aforementioned diagnostics effort targeting a short amplicon (E-Sarbeco), variants' tracking over the whole SPIKE gene uses a random-hexamer sequence for the reverse transcription step, and four primers targeting the SPIKE gene at different regions (~ 1 kb distance among each primer) (Fig. 4A). Like in the case of the diagnostics assay, the random-hexamer primer used for the reverse transcription present a unique molecular barcode as well as the common Gibson sequence. In contrary, the PCR primers (P1-P4) present a unique molecular barcode but they are devoid of the Gibson sequence. In fact, PCR amplification over cDNA issued from random-hexamer priming leads to amplicons of variable length (including fragments > 1 kb; see Fig. 4B & Figure S3), thus the concatemerization step used on the short amplicon strategy does not appear essential. To simplify its use, PCR amplification is performed in presence of a pool of all four primers targeting the SPIKE region (P1-P4) and a Gibson primer sequence (in an equimolar concentration; i.e. 4 × Gibson primer for 1 × for each SPIKE targeting primer).
To better mimic a real situation, in which the diagnostics/variants' tracking assay is performed with samples collected from human candidates (e.g. nasopharyngeal swabs), human RNA-issued from cell lines in culturehas been combined with the synthetic viral RNA samples, such that the random-hexamer primers used for the reverse transcription assay could also provide means to count with a positive control for sample collection. Furthermore, for addressing variants tracking performance, in addition to the Wuhan-Hu-1 SARS-CoV-2 RNA control sample, we have used synthetic viral RNA samples corresponding to the highly transmissible SARS-CoV-2 variant B.1.1.7, first described in the UK in December 2020, as well as the strain HF2393 described as a particular clade circulating in France since January 2020 12 (Fig. 4C).
To perform diagnostics and variants' tracking assay over multiple pseudo-candidates, we have used eight barcoded random-hexamer primers for reverse transcription (BC37-BC44) in presence of different viral RNA titers (consecutive dilutions) and issued from the aforementioned synthetic viral RNA strains (Fig. 4D). To multiply the number of pseudo-candidates, we have amplified the eight barcoded cDNA products with four barcoded primers, either targeting the SPIKE gene (BC1-BC4 corresponding to pools of P1-P4) or targeting the E-Sarbeco region (BC5-BC8).
The cDNA products issued from barcoded-random-hexamer priming were first validated by quantitative PCR amplification, targeting either the human RNAse-P gene or the viral E-Sarbeco region (Fig. 4E). Then, the PCR products generated with barcoded primers targeting SPIKE (BC1-BC4) and those targeting the E-Sarbeco region (BC5-BC8) were pooled together and analyzed by ONT sequencing. The real-time diagnostic assay for the SPIKE gene or the E-Sarbeco region, revealed significant differences between samples presenting different viral RNA copies based on the number of aligned read counts. Specifically, the negative control sample (BC44) presented systematically the least number of aligned counts for the SPIKE and the E-Sarbeco region, and in the opposite, the sample presenting the most concentrated viral RNA titers (BC43) presented the highest number of aligned reads per aforementioned target regions (Fig. 4F,G).
In addition to the diagnostic outcome, RETIVAD performs a variants' detection relative to the SARS-CoV-2 reference genome. In fact, as illustrated on Fig. 4H (Fig. 4D). Similarly, variants detection performed on samples issued from the BC40-barcoded cDNA (BC1-BC4 x BC40) revealed a similar signature, in despite of their lower coverage (Fig. 4I). In fact, samples associated to the BC40 were also generated with the synthetic viral RNA corresponding to the B.1.1.7 strain, but with a tenfold lower titer than in the case of the BC37 (Fig. 4D). Finally, variants' tracking performed on samples issued from the BC38-barcoded cDNA (BC1-BC4 x BC38; French strain HF2393) revealed a single common mutation, D614G, previously described within the French strain HF2393 12 , but also among other European strains, including the B.1.1.7, as observed within the variants detected on samples issued from BC37 and BC40 (Fig. 4H,I).

Discussion
Despite the current vaccination efforts, the pandemic situation anticipates the propagation of the recently described variants, as well as the emergence of new others, notably due to the major discrepancies between countries to access to optimal vaccination programs 6 . As consequence, counting with methodologies for Covid-19 diagnostics but also variants' tracking on a population scale remains essential.
Currently, diagnostics and variants' tracking are performed in two steps, which per se delays the time for obtaining a complete diagnostic. Furthermore, variants' tracking relies on full genome sequencing of SARS-CoV-2, which represent an expensive and labor-intensive effort, incompatible with population-scale assays. Finally, most of the variants' tracking efforts, relying on the use of large and expensive sequencing instruments, requires to centralize samples processing to dedicated institutions, which are rather absent on poor countries.
To address all these pitfalls, we propose herein a methodology able to (1) perform SARS-CoV-2 diagnostic and variants' tracking at the same time; (2) shorten the time for accessing to the outcome of the diagnostic/ variants' detection by the use of a real-time tracking system; (3) take advantage of the compact and affordable Oxford Nanopore MinION DNA sequencing instrument, notably for deploying the aforementioned diagnostics and variants' tracking platform anywhere over the world.
Within this study we have validated the proposed Nanopore DNA sequencing-based diagnostic, by targeting a region on the E gene of SARS-Cov-2 (E-Sarbeco) on a total of 64 pseudo-candidate samples (Fig. 3). Importantly, the few number of optimal sequences required for performing this diagnostic assay suggests that larger candidate screenings (several hundreds) are possible within the scale of the sequencing-depth afforded by the MinION sequencer.
The proposed variants' tracking focus on the analysis of the SPIKE gene instead of the whole SARS-CoV-2 genome. In fact, the SPIKE gene has shown to concentrate the majority of the currently discovered variants, which can be rationalized by the fact that it codes for the protein required for the interaction with the human cells. Noteworthy, most of the current vaccines against the Covid-19 are also based on the SPIKE region. Hence, concentrating variants' tracking towards this gene appears an optimal compromise to increase the number of candidates to screen at once. This philosophy is shared by a recent work describing the HiSpike method, performing variants' detection within the SPIKE gene with the help of the small Illumina MiSeq instrument 13 . Illumina instruments being a short fragments sequencing approach, HiSpike requires 42 primer pairs generating amplicons of 400 nucleotides of length for covering the SPIKE region. In contrary, our proposed strategy, based on the use of long DNA fragments, compatible with the Nanopore DNA sequencing technology, relies on 4 primers targeting the SPIKE region, one random primer for the cDNA step and a common Gibson sequence. In fact, a strategy for SARS-CoV-2 diagnostic and variants' surveillance, dedicated to be deployed anywhere on the world does not only requires to count with low cost instruments, but also involve a reduced number of reagents for allowing its use on a long-standing manner.
During the time of the final revision of this article, new variants derived from the lineage of B.1.1.7 has gained major attention notably due to the fact that they are vastly spreading in the UK, USA, and other locations (B.1.617.1, also known as Kappa; B.1.617.2 also known as "Delta" and B.1.618 also known as "triple mutant, ", as suspected to have evolved from B.1.617) 14 . In all cases, new mutations on the gene SPIKE were observed, further supporting our strategy for targeting this genomic region for variants surveillance.
For all these reasons, we are convinced that our proposed strategy, accompanied by the dedicated computational solution RETIVAD, can represent a game-changer approach for tracing variants' progression in the following months. For SPIKE-centered variants tracing, the reverse-complementary Gibson sequence has been excluded, notably because the PCR amplification performed over random-hexamer generated cDNA templates generate large fragments (> 1 kb). In contrary, the amplicon product issued from E-Sarbeco targeting oligonucleotides gives rise to short amplicons-compatible with quantitative PCR assays, thus requiring concatemerization prior nanopore sequencing.

SARS-CoV
With the aim of covering the SPIKE gene (~ 3.8 kb), the following four primers targeting the SPIKE region with ~ 1 kb interval within them were designed: Spike1_F: AGG GGT ACT GCT GTT ATG

Real-time SARS-CoV-2 diagnostics and variants tracking.
Real-time processing of the fastq files generated by the Mk1C instrument during sequencing is performed with our in-house developed RETIVAD (Real-Time Variants Detector) software. RETIVAD collects each ten minutes the fastq files generated during sequencing from the Mk1C instrument and (1) search for Gibson sequences within the sequenced reads (regex query (https:// pypi. org/ proje ct/ regex) with maximum 4 mismatches accepted by default); (2) split sequenced reads on monomers-delimited by the presence of the identified Gibson sequences; (3) search for barcode sequences introduced during the reverse transcription (cDNA barcodes: BC33-BC48; regex query); (4) search for barcodes introduced during the PCR amplification (PCR barcodes: BC1-BC8; regex query); (5) collect the sequences between the cDNA and PCR-associated barcodes for their alignment towards the SARS-CoV-2 reference genome (Wuhan-Hu-1 genome: NC_045512; BWA aligner following nanopore parameters). Considering that steps (3)(4)(5) are performed within the retrieved monomers in step (2), RETIVAD matches cDNA/PCR barcode combinations to aligned outcomes, thus providing a diagnostic outcome for the associated pseudocandidate. Due to the low number of sequences retrieved at intervals of 10 min, and the relative short size of the SARS-CoV-2 genome (~ 29 kb), RETIVAD is able to provide such diagnostic outcome within such 10 min interval. Hence, at the next round of data collection, RETIVAD reiterates on the aforementioned steps over the newly collected fastq files and appends the outcome to the previous results. The continuous gain on aligned readcounts per region of interest (e.g. E-Sarbeco amplicon sequence, or SPIKE gene; defined as entry parameters) is visualized within scatterplot view (png file) updated during the processing (RETIVAD produces a png file per detected PCR barcode monomers; thus allowing to generate scatterplots displaying as many lines as detected cDNA barcode monomers; Figure S4).
In addition to the diagnostic outcome, RETIVAD has been designed for performing real-time variants detection. For it, the racon package (version 1.4.20) is used for raw de novo DNA assembly, followed by the alignment with Minimap2 (version 2.17). Medaka (version 1.2.5) is used to create a consensus sequence and variant calls from the aligned data (> 20 aligned reads required as default for variants calling). RETIVAD generates BAM files per cDNA/PCR barcode combination harboring base conversion information and their associated coverage (for visualization with Genome browsers like IGV). Furthermore, summary files per cDNA/PCR barcode combination are also generated, providing the genomic position in which the base conversion has been detected, the identity of the converted bases as well as their associated coverage. Like in the case of the diagnostics processing, variants detection is performed in real time, thus providing the possibility to the user to visualize the results and decide whether sequencing requires still to be performed or whether the collected information is sufficient to conclude the diagnostics/variants tracing for the corresponding candidates.
RETIVAD generates a final summary report for the diagnostics outcome when no new fastq files are available after 40 min from the last data collection and processing.

Data availability
All genome sequences included in this study are available under request. RETIVAD is freely available at https:// github. com/ SysFa te/ retiv ad.