First complete genome sequences of Streptococcus pyogenes NCTC 8198T and CCUG 4207T, the type strain of the type species of the genus Streptococcus: 100% match in length and sequence identity between PacBio solo and Illumina plus Oxford Nanopore hybrid assemblies

We present the first complete, closed genome sequences of Streptococcus pyogenes strains NCTC 8198T and CCUG 4207T, the type strain of the type species of the genus Streptococcus and an important human pathogen that causes a wide range of infectious diseases. S. pyogenes NCTC 8198T and CCUG 4207T are derived from deposit of the same strain at two different culture collections. NCTC 8198T was sequenced, using a PacBio platform; the genome sequence was assembled de novo, using HGAP. CCUG 4207T was sequenced and a de novo hybrid assembly was generated, using SPAdes, combining Illumina and Oxford Nanopore sequence reads. Both strategies, yielded closed genome sequences of 1,914,862 bp, identical in length and sequence identity. Combining short-read Illumina and long-read Oxford Nanopore sequence data circumvented the expected error rate of the nanopore sequencing technology, producing a genome sequence indistinguishable to the one determined with PacBio. Sequence analyses revealed five prophage regions, a CRISPR-Cas system, numerous virulence factors and no relevant antibiotic resistance genes. These two complete genome sequences of the type strain of S. pyogenes will effectively serve as valuable taxonomic and genomic references for infectious disease diagnostics, as well as references for future studies and applications within the genus Streptococcus.


Introduction 55
Streptococcus pyogenes, within the β-hemolytic, Lancefield group A Streptococcus (GAS) 56 (Lancefield, 1933), is an important clinically-relevant and strictly-human pathogen causing a 57 wide range of diseases, including local and invasive infections (e.g., throat, skin infections, 58 meningitis), severe toxin-mediated diseases (e.g., necrotizing fasciitis, scarlet fever, 59 streptococcal toxic shock syndrome) and immune-mediated diseases (e.g., rheumatic fever, 4 In 2005, it was estimated that more than 500,000 people were dying every year from severe 62 diseases caused by GAS, as well as an estimated 600 million new cases of pharyngitis and 100 63 million new cases of pyoderma (Carapetis et al., 2005). Thus, S. pyogenes is among the top-10 64 infectious causes of mortality in humans (Barnett et al., 2019). Moreover, S. pyogenes is the 65 type species of the genus Streptococcus, the type genus of the family Streptococcaceae, and as 66 a clinically-relevant bacterium, S. pyogenes has been continuously studied since it was first 67 described (Rosenbach, 1884). 68 69 In recent decades, several next-generation and third-generation (i.e., long-read) sequencing 70 technologies have emerged and are now widely used in many settings (Loman and Pallen, 71 2015). For instance, Illumina has led the field in high-throughput DNA sequencing, by 72 providing highly accurate and relatively inexpensive sequence reads. However, their short 73 lengths (few hundred base-pairs) have restricted efficacy to resolve problematic genomic 74 regions (e.g., repeats, ribosomal operons, long sequence motifs), sometimes yielding 75 fragmented and incomplete assemblies (Goodwin et al., 2016). Meanwhile, PacBio provides 76 long reads (several kilobase-pairs) with high consensus accuracy; generally yielding complete 77 bacterial genome sequences. However, high capital costs of PacBio platforms have constrained 78 accessibility to users, who normally access them via commercial/institutional sequencing 79 services. Additionally, requirements of large quantities of high-quality DNA make PacBio 80 sequencing relatively laborious, time-consuming and impractical for some applications. More 81 recently, Oxford Nanopore Technologies launched the MinION portable sequencer, which 82 provides ultra-long reads of as many as two million base-pairs (Payne et al., 2018), requiring 83 simple, rapid and cost-effective DNA library preparation protocols. Nanopore-sequencing has 84 been demonstrated to resolve very-long repetitive regions that not even PacBio-sequencing 85 could resolve (Schmid et al., 2018). However, inaugural high error rates (>30%; currently ~7%) Gothenburg, Sweden) on Chocolate agar medium (Brain Heart Infusion Agar with 10% heat-112 lysed defibrinated horse-blood, 15% horse-serum and fresh yeast extract, prepared by the 113 Substrate Unit, Department of Clinical Microbiology, Sahlgrenska University Hospital), with 114 5% CO2, at 37°C. Genomic DNA was extracted from fresh pure biomass, using a Wizard® 115 Genomic DNA Purification Kit (Promega, Madison, WI, USA), for Illumina sequencing, and 116 a modified version (Salvà-Serra et al., 2018) of a previously described protocol (Marmur, 117 1961), for Oxford Nanopore sequencing (Figure 1) 135 quality score threshold of Q30. Meanwhile, the FAST5 files containing the raw data generated 161 by the Oxford Nanopore sequencing run were processed with the Oxford Nanopore basecalling 162 pipeline, Albacore, version 2.0.2., and the quality was analyzed with NanoPlot version 1.13.0 163 (De Coster et al., 2018). Only reads with a quality score great than Q7 were used for the 164 assembly (i.e., classified as Pass by Albacore). Afterwards, a hybrid de novo assembly, using 165 both Illumina and Oxford Nanopore reads was performed with SPAdes, version 3.11.0 166 (Bankevich et al., 2012, Antipov et al., 2016. The assembly was performed with the flag --167 careful enabled, to map the Illumina reads back to the assembly with BWA, version 0.7.12-168 r1039 (Li and Durbin, 2009) and to reduce the number of mismatches and short indels. The 169 ends of the assembly were trimmed manually, and the sequence reorganized to start with dnaA, 170 as was done for the PacBio assembly. Assembly statistics were obtained, using QUAST, 171 Once assembled, closed and completed, the genome sequences of S. pyogenes NCTC 8198 T 175 and S. pyogenes CCUG 4207 T were compared ( Figure 1). Firstly, both genome assemblies were 176 aligned, using BLASTN, version 2.2.10 (Altschul et al., 1990). Secondly, all the raw Illumina 177 paired-end reads were mapped against complete and closed genome sequences, using CLC 178 Genomics Workbench, version 10.0 (Qiagen Aarhus A/S, Aarhus, Denmark), and a Basic 179 Variant Detection 2.0 analysis was performed, using the same software, using a minimum 180 frequency of 35% (default). 181

Genome sequence assemblies 261
The assembly of the PacBio reads with HGAP, followed by a polishing step performed with 262 Quiver, yielded a complete and closed sequence. Trimming of the ends (i.e., overlapping 263 redundant sequences) resulted in a final sequence of 1,914,862 bp, representing the genome of 264 S. pyogenes NCTC 8198 T . In parallel, the de novo hybrid assembly of the trimmed Illumina 265 sequence reads plus the basecalled Oxford Nanopore reads also resulted in a complete and 266 closed sequence. The trimming of the ends yielded a final sequence of 1,914,862 bp. Analysis 267 performed with QUAST confirmed that both assemblies did not contain any gaps (i.e., no N's).  The latest annotation (i.e., S. pyogenes CCUG 4207 T annotated with PGAP version 4.7 and 290 available in RefSeq) revealed a total of 2,009 genes, of which 1,920 were CDSs. The annotation 291 detected 89 RNA genes, which included 67 tRNA genes, four non-coding RNA genes and 18 ribosomal genes distributed in six complete ribosomal operons. Additionally, 60 pseudogenes 293 were annotated. A total of 306 genes (14.8%) were annotated as "hypothetical proteins". A 294 genome atlas of this annotation version is depicted in Figure 2. 295

301
Prophages 302 Five putative prophages were detected using the software PHASTER, four marked by the 303 software as 'intact' (completeness score > 90) and one as 'questionable' (completeness score 304 = 70-90), with a GC content ranging from 37.4 to 39.1% (Table 3). The largest prophage region 305 was 56,926 bp and the shortest 41,886 bp. Overall, the five regions add up to 234,671 bp, which 306 represents 12.3% of the genome size ( Figure 2). In total, according to the PGAP annotation, 307 the five prophage regions encompass 321 CDSs, which represents a 16.72% of the total 1,920 308 CDSs annotated by PGAP in the genome sequence. 309 310

CRISPR-Cas systems 314
The analysis of the genome with CRISPRFinder revealed the presence of a CRISPR array 315  The fact that two wholly independent approaches have yielded identical sequences 445 demonstrates the high quality of these genome sequences. On the one hand, the NCTC strain 446 was sequenced, using a PacBio RSII platform, which, as expected, yielded long and relatively 447 inaccurate raw sequence reads (indicated in Table 1, by the low Phred quality scores). 448 However, due to their random distribution, sequencing errors can be corrected during assembly, 449 by high coverage (Koren et al., 2013). On the other hand, the CCUG strain was sequenced with 450 the highly accurate Illumina HiSeq 2500 and the Oxford Nanopore MinION device, which, as 451 expected, yielded long raw sequence reads with low accuracy (indicated in Table 1), 452 Interestingly, Nanopore reads exhibit a higher average Phred quality score than PacBio reads. in the genome sequence of S. pyogenes CCUG 4207 T (= NCTC 8198 T ). Additionally, high 538 similarity was found between the sequences of the spacers of the CRISPR arrays and three of the prophage regions, one of them being 100% identical. However, we detected a frameshift, 540 caused by a single-nucleotide deletion in the cas3 gene, which encodes an 541 endonuclease/helicase that is essential for CRISPR-Cas interference (Brouns et al., 2008). 542 Therefore, the CRISPR-Cas system most likely is non-functional due to this truncation, thus 543 leaving a freeway for infection by bacteriophages, which could explain the presence of five 544 prophage regions inserted within the chromosome. In addition, the short number of spacers 545 also suggests that the CRISPR-Cas system might also not be active in acquisition of new 546 spacers. 547 was isolated from a throat swab from a scarlet fever patient (Griffith, 1926). For many years, 555 scarlet fever has been associated with S. pyogenes strains producing pyrogenic toxins (Dick 556 and Dick, 1924). In accordance to this, several streptococcal pyrogenic exotoxins have been 557 found, one of them encoded in a prophage region. Additionally, several of the other virulence 558 factors have also been found encoded in prophage regions, highlighting the role of prophages 559 in the pathogenicity of S. pyogenes strains. In any case, further studies will be needed to 560 uncover the full pathogenic potential of this strain, with emphasis in revealing the role of the 561 still high number of CDSs annotated as 'hypothetical proteins', which represent a 15% of the 562 2,009 genes annotated. 563 The only antibiotic resistance-gene detected in the genome sequence was a gene encoding an 565 ABC transporter ATP-binding protein, wherein overexpression has been associated with 566 fluoroquinolone resistance. This lack of significant antibiotic resistance genes was expected, 567 as the type strain of S. pyogenes was isolated before the antibiotic era (i.e., before 1928). In 568 addition, the current antibiotic resistance problem is not that great yet among S. pyogenes as it 569 is among other bacterial species and taxa, with penicillin remaining the drug of choice, despite 570 numerous decades of use (Spellerberg and Brandt, 2016). 571 572

573
Here we present the first complete genome sequences of the type strain of S. pyogenes (NCTC 574 8198 T = CCUG 4207 T = S.F. 130 T ), the type species of the genus Streptococcus, the type genus 575 of the family Streptococcaceae and a major human pathogen. These genome sequences 576 represent the reference genomic material to be used in taxonomic studies involving this family 577 and its members. Additionally, we have shown how the combination of high-quality, short, 578 Illumina sequence reads with long Oxford Nanopore sequence reads is able to generate a 579 complete genome sequence, identical to the one obtained with only PacBio sequencing.