Rapid genomic characterization of SARS-CoV-2 viruses from clinical specimens using nanopore sequencing

The novel SARS-CoV-2 outbreak has swiftly spread worldwide. The rapid genome sequencing of SARS-CoV-2 strains has become a helpful tool for better understanding the genomic characteristics and origin of the virus. To obtain virus whole-genome sequences directly from clinical specimens, we performed nanopore sequencing using a modified ARTIC protocol in a portable nanopore sequencer and validated a routine 8-h workflow and a 5-h rapid pipeline. We conducted some optimization to improve the genome sequencing workflow. The sensitivity of the workflow was also tested by serially diluting RNA from clinical samples. The optimized pipeline was finally applied to obtain the whole genomes of 29 clinical specimens collected in Hangzhou from January to March 2020. In the 29 obtained complete genomes of SARS-CoV-2, 33 variations were identified and analyzed. The genomic variations and phylogenetic analysis hinted at multiple sources and different transmission patterns during the COVID-19 epidemic in Hangzhou, China. In conclusion, the genomic characteristics and origin of the virus can be quickly determined by nanopore sequencing following our workflows.


h and 5 h workflows for SARS-CoV-2 nanopore sequencing.
To acquire the whole-genome sequence of SARS-CoV-2 more efficiently, an 8-h workflow was designed on the basis of the sequencing throughput and speed after loading the library into the flow cell, and a 5-h workflow was designed for rapid library building (Fig. 1). These two workflows were both tested on the HZCDC0001 sample, with Ct values of 26.51/27.03 (Orf1ab/N); this sample was obtained from the first case that appeared in Hangzhou, Zhejiang Province. In the 8 h workflow, a nanopore ligation sequencing kit was applied, since this protocol can maximize the sequencing throughput and the length of reads (Fig. 2C). The total bases sequenced and the genome completeness of SARS-CoV-2 increased much faster than in the 5 h workflow. Here, regions with a depth greater than 15 × are recognized as the credible coverage, and genome completeness can reach almost 100% in only 10 min after loading the library into the flow cell. In contrast, the 5 h workflow took more than 1.5 h to approach 100% completeness. The 5 h workflow presented the advantage of a rapid 15-min library preparation time, especially under extreme conditions. This workflow greatly shortened the library preparation time compared to the 2-h ligation protocol (Fig. 1). However, as the rapid nanopore protocol cleaves the DNA to quickly add transposase adapters, the sequencing throughput and speed were poor compared with those of the 8 h workflow ( Fig. 2A-C).
In both workflows, two regions (5231-5644 bp and 22,798-23,214 bp, primer pairs #18 and #76) appeared to be short boards in genome mapping, which indicates a need for further optimization. We performed some optimization to improve the workflow. Since RNA extraction is vital for follow-up sequencing, we compared magnetic bead extraction in an NP968 instrument (Tianlong, China) with column RNA extraction using the RNeasy Mini Kit (QIAGEN, Germany) (Fig. 2D). The latter approach seemed to yield higher-quality RNA, nearly doubling the depth of coverage (Fig. 2F). Moreover, the PCR procedure took more than 40% of the total workflow time, so we tried to decrease the annealing and extension time from 5 to 3 min and 1 min, corresponding to a total time during PCR of approximately 3-1 h. Even when the annealing time was reduced to 1 min, the whole-genome sequence could still be obtained from the products of the 1 h PCR procedure (Fig. 2E,G). Because of the different viral titers of SARS-CoV-2 in the clinical samples, the 3-min annealing time could be considered the equilibrium point.

Sensitivity test and application in clinical samples.
As clinical specimens may exhibit extremely low viral titers, we tested the sensitivity of the routine 8-h workflow by serially diluting RNA from two COVID-19 patients 10 times (starting from RNA samples with a Ct value ~ 22) to determine whether we could amplify the whole genome of SARS-CoV-2 from samples that showed failure in the qRT-PCR test. Approximately 85% of the reads were mapped to the reference genome (GenBank accession number MN908947.3) with an average depth of coverage greater than 250 × across > 97.56% of the SARS-CoV-2 genome for both samples, including the 100,000 ×-diluted sample, which was undetectable by the qRT-PCR test in an ABI 7500 instrument ( Fig. 2H-J). The average depth was not obviously decreased among the diluted samples; however, the genome-wide depth fluctuated very significantly when the dilution rate was above 10,000 ×.
The routine workflow was applied to obtain the genomes from 29 clinical samples with an average depth of 233.75-754.88 × and genome completeness of 98.08-100% (Table 1). The information for the specimens, including the genome-wide depth of coverage determined by nanopore sequencing and Ct values determined by RT-PCR, is listed in Table 1 and Fig. 2K. In particular, the genomes of some specimens with low Ct values, such as HZCDC0090 and HZCDC0091, were also obtained with completeness ranging from 99.15 to 100%, which provided evidence that some problematic clinical samples can be successfully sequenced via nanopore sequencing. In addition, both upper and lower respiratory tract specimens (HZCDC0048, HZCDC0048L, HZCDC0090, HZCDC0090L, HZCDC0091 and HZCDC0091L) from three COVID-19 patients were sequenced, and the genome mapping results showed that the virus genomes from different parts of the respiratory tract were consistent (Table 1).

Genomic variations and phylogenetic analyses of SARS-CoV-2. The length of the reference SARS-
CoV-2 genome (MN908947.3) was 29,903 bp. However, a considerable fraction of the submitted SARS-CoV-2 genomes were incomplete. Therefore, the strategy of building a phylogenetic tree based on SNPs was applied to investigate the traceability of samples of interest. To avoid introducing errors in nanopore sequencing, we first filtered low-quality reads, and only SNPs with high quality (Phred value ≥ 20) and a high site depth of coverage (≥ 50) were considered in the downstream analysis. In addition, we performed Illumina sequencing for all 29 clinical samples, which provided evidence that the SNPs from our standard were 100% consistent with the Illumina data and that SNPs from SARS-CoV-2 genomes can be called on the basis of nanopore sequencing alone. In all 29 obtained complete SARS-CoV-2 genomes, 33 substitutions distributed in five coding sequences (CDSs) and 5′UTRs were identified based on sequence alignment (Fig. 3), including C125T and C241T in the 5′UTR, 10 synonymous variations and 21 missense variations ( Table 2). Among the 21 amino acid variants, two were found in the S gene (E96D and D614G), three were found in the N gene (R203K, G204R and S235F), and the others were detected in ORF1a/b (T265I, T551I, I739V, P765S, A2142S, L3606F, M4590T, A4784V, T4847I, T5020I, V5661A, I6525T and D7006G) and ORF3a (Q57H and G251V), respectively. The D614G amino acid change in the spike protein was confirmed to be associated with greater infectivity 6 , which was caused by an A23403G nucleotide mutation compared with MN908947.3. In our study, it was found that G614 in the spike protein had become the dominant in the imported cases, increasing from 0% (15 genomes in January) to 71.4% (10 of 14 genomes in March), accompanied by C241T in the 5′UTR, the silent mutation C3037T and the missense variation C14408T, which results in a P323L amino acid change in RNA-dependent RNA polymerase (RdRp).
Two family clusters of SARS-CoV-2 infections involving 4 patients were found. One of the index patients (HZCDC0013) was a female who was infected in Wuhan before returning home and was diagnosed with COVID-19 five days later. Her husband (HZCDC0012) was confirmed to have SARS-CoV-2 infection on the same day, with a strain showing with two mutations, G11083T and A21282G, consistent with HZCDC0013. In another human-to-human transmission event, identical substitutions (A2480G, C2558T, G11083T, C14805T and G26144T) were found in the genomes obtained from a mother and a son, hinting at a family-cluster transmission history of SARS-CoV-2 from a U.K. student to his mother.
The phylogenetic analysis was conducted based on a July 4, 2020 download of 196 SARS-CoV-2 reference genomes from different lineages 7 and clades from the GISAID database (GISAID acknowledgments are included in Table S1) randomly selected by the Perl rand() function. According to the observed genetic diversity, four clades, L, V, S and G, were defined, among which the L clade was the most prominent among the genomes collected in January 2020 except four of fifteen genomes belonging to clade V showing an L3603F substitution in ORF1a/b and a G251V substitution in ORF3a (Fig. 3). Clade V strains were also found in three samples collected from imported cases in March 2020. However, most of the genomes from the patients who had a history of going abroad or of contact with imported confirmed cases were included in the G clade, which harbors the D614G substitution. The G clade is currently the most prominent clade and has been additionally subdivided into the G, GR and GH subclades. The phylogenetic relationships revealed that most of our genomes from COVID-19 patients linked to imported cases were scattered among the three subclades, with 8 cases in G, 1 case in GR, and 2 cases in GH (Fig. 3), which hinted at multiple sources of transmission from overseas.

Discussion
The outbreak of COVID-19 caused by SARS-CoV-2 has swiftly spread worldwide. However, the probable origin of SARS-CoV-2 associated with the COVID-19 pandemic is still unclear 8 . Recent reports of COVID-19 cases with no or mild upper respiratory tract symptoms suggest the potential for asymptomatic or oligosymptomatic transmission during the first week of symptoms 9,10 . Hence, there is an urgent need for rapid identification and traceability of pathogens for disease control and prevention. A deep understanding of the novel virus is first obtained through the analysis of the genome sequence. In this study, we demonstrated the utility of nanopore sequencing for SARS-CoV-2 genomes from clinical specimens based on a modified ARTIC protocol. The adopted approach allowed the confirmation of SARS-CoV-2 infections at the genomic level within a few minutes by sequencing and simultaneously mapping the reads to the reference genome and analyzing the output data in real time.
Compared with nasal/oropharyngeal swabs, the virus could be detected more readily in lower respiratory tract specimens from COVID-19 patients 11 . Our data showed that the virus genomes from different parts of the respiratory tract were consistent. However, the difference in viral loads among samples will affect the stability of  To characterize the genomic variations, we found 33 different substitution sites distributed in five coding regions among 29 SARS-CoV-2 genomes, without any recombination events. Genomic evidence supported the linkage of most of the early 15 infections with Wuhan either directly or indirectly, and 10 of the 14 genomes from the imported infections that occurred in March 2020 exhibited the specific D614G variation in the spike protein that belongs to clade G compared with the domestic strains.
The small sample size of 29 clinical samples is a significant limitation of this study. Although the number of genomes was not large, the typical enrolled cases were representative. First, this cohort included various forms of transmission, including local cases, domestic spreading from a Wuhan returnee, transmission from imported cases from multiple countries, familial cluster infections (HZCDC0012 and HZCDC0013, HZCDC6948  and HZCDC7019), and a small-scale outbreak (HZCDC0048, HZCDC0049, HZCDC0090, HZCDC0091 and  HZCDC0119). Second, the mean (± SD) age of the patients was 34.81 ± 12.79 years and was distributed between 12 and 62 years old, including 53.85% males and 46.15% females. Moreover, five different types of samples were used to test the approach of SARS-CoV-2 sequencing based on nanopore technology, including nasal and oropharyngeal swabs, sputum, tracheal aspirates, bronchoalveolar lavage fluid from the respiratory tract and fecal specimens from the digestive tract, showing a wide range of applications in different types of clinical samples.
In summary, we performed SARS-CoV-2 genome sequencing in a portable nanopore sequencer. Combined with the 8-h workflow, the genomic characteristics and the origin of the virus could be quickly determined. The rapid 5-h workflow, with 15-min fast library preparation could be applied for backward tracing of the strains out of lab, bringing genome-level molecular epidemiology analysis to the front lines of the outbreak. Therefore, based on prompt diagnosis and rapid whole-genome analysis, the swift and decisive response to the SARS-CoV-2 outbreak will benefit disease control and prevention efforts. According to the eight-hour routine workflow (Fig. 1), the purified DNA was repaired with NEBNext FFPE Repair Mix (NEB, USA), followed by DNA end preparation using NEBNext End repair/dA-tailing Module (NEB, USA) and the successive attachment of native barcodes and sequencing adapters supplied in the EXP-NBD104/114 kit (Oxford Nanopore Technologies, UK) to the DNA ends. The DNA concentration was determined with a Qubit 3.0 instrument using a dsDNA HS Assay Kit (Thermo Fisher, USA). After priming the flow cell, 60 ng of DNA per sample of the products was pooled in a DNA library with a final volume of 65 μL. Following the ligation sequencing kit (SQK-LSK109, Oxford Nanopore Technologies, UK) protocol, MinION Mk1B was used to perform genome sequencing in an R9.4.1 flow cell for 1 h per sample. For the rapid barcoding workflow, a fragmentation mixture from the SQK-RBK004 kit (Oxford Nanopore Technologies, UK) was used to attach the barcodes to the DNA ends, followed by the attachment of sequencing adapters.
After the application of read quality control procedures, the artic-ncov2019 pipeline (https ://artic .netwo rk/ ncov-2019) was applied to perform sequence mapping, primer trimming, variant calling and consensus assembly building. Variations were called using Medaka 0.11.1 (https ://githu b.com/nanop orete ch/medak a). In the stage of consensus assembly building, sites with a depth lower than 50 × were masked by N bases, and the reference was substituted by homozygous variations with a Phred quality ≥ 20. "Samtools depth" was used to calculate the depth of each site, and "Samtools bedcov" was used to calculate the window depth for scanning in the genome 12 .
Read preprocessing and variant calling for Illumina sequencing. Raw Illumina PE reads were trimmed and subjected to quality control with the software fastap 0.20.0 with default parameters 13 . Bwa 0.7.17-r1188 14 was used to map the clean reads to the SARS-CoV-2 reference genome, and SAM/BAM files were manipulated by using SAMtools 1.9 12 . Variations were detected with the program "mpileup and calling" from bcftools 1.9 15 . Variations were considered positive when they exhibited a Phred quality value ≥ 20 and a depth ≥ 50.
Phylogeny and variant analysis. To remove bias from the gaps in the incomplete genome, sequence alignment according to all SNP sites was chosen to build the phylogenetic tree. First, all SNPs were called with alignment to SARS-CoV-2 reference sequences using the "nucmer" and "dnadiff " programs from MUMmer 3.23 16 , and the effect of the SNPs was estimated using SnpEff 4.3t 17 . Second, all SNP sites were connected to a single sequence for every sample based on the variant calling results from the last step. Then, these sequences were combined to perform phylogenetic analysis, and maximum likelihood phylogenies were estimated by using FastTree 2.1.10 18 with the default parameters. The phylogenetic tree with the variation heatmap matrix was drawn by using phyD3 19 . The group and clade numbers were assigned to achieve consistency with earlier studies.

Data availability
All genome sequences included in this study are available from GISAID (https ://gisai d.org) (the full list of the accession numbers is available in Table 1). Scientific Reports | (2020) 10:17492 | https://doi.org/10.1038/s41598-020-74656-y www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.