Complete genome sequences of Streptococcus pyogenes type strain reveal 100%-match between PacBio-solo and Illumina-Oxford Nanopore hybrid assemblies

We present the first complete, closed genome sequences of Streptococcus pyogenes strains NCTC 8198T and CCUG 4207T, the type strain of the type species of the genus Streptococcus and an important human pathogen that causes a wide range of infectious diseases. S. pyogenes NCTC 8198T and CCUG 4207T are derived from deposit of the same strain at two different culture collections. NCTC 8198T was sequenced, using a PacBio platform; the genome sequence was assembled de novo, using HGAP. CCUG 4207T was sequenced and a de novo hybrid assembly was generated, using SPAdes, combining Illumina and Oxford Nanopore sequence reads. Both strategies yielded closed genome sequences of 1,914,862 bp, identical in length and sequence identity. Combining short-read Illumina and long-read Oxford Nanopore sequence data circumvented the expected error rate of the nanopore sequencing technology, producing a genome sequence indistinguishable to the one determined with PacBio. Sequence analyses revealed five prophage regions, a CRISPR-Cas system, numerous virulence factors and no relevant antibiotic resistance genes. These two complete genome sequences of the type strain of S. pyogenes will effectively serve as valuable taxonomic and genomic references for infectious disease diagnostics, as well as references for future studies and applications within the genus Streptococcus.

PacBio sequencing. Genomic DNA of S. pyogenes NCTC 8198 T was sheared with a 26 G blunt Luer-Lok needle and used to prepare two 10 to 20-kb PacBio SMRT libraries, following the manufacturer's recommendations. The libraries were sequenced using the P6-C4 chemistry on a Single Molecule, Real-Time (SMRT) cell, using a PacBio RSII platform (Pacific Biosciences of California, Inc., Menlo Park, CA, USA) (www.pacb.com), at the Wellcome Trust Sanger Institute (Hinxton, UK).
Illumina sequencing. Genomic DNA of S. pyogenes CCUG 4207 T was used to prepare a standard Illumina library, with an insert size ranging from 130 to 680 bp, following an optimized protocol (GATC Biotech, Konstanz, Germany) and using standard Illumina adapter sequences. The library was sequenced at GATC Biotech (Konstanz, Germany), using an Illumina HiSeq 2500 instrument (Illumina, Inc., San Diego, CA, USA) (www. illum ina.com) to generate paired-end reads of 126 bp.
PacBio de novo assembly. PacBio sequence reads from both SMRT sequencing runs were used in the assembly. Read quality was assessed, using NanoPlot version 1.13.0 15 . Sequence reads were auto-error-corrected and assembled de novo, using the Hierarchical Genome Assembly Process (HGAP) version 3 16 . The assembled sequence was polished with the consensus calling algorithm, Quiver, version 1. The ends of the final assembly were trimmed (i.e., eliminating sequence redundancy) manually, to circularize the genome, and the genome Scientific RepoRtS | (2020) 10:11656 | https://doi.org/10.1038/s41598-020-68249-y www.nature.com/scientificreports/ Genome annotations and characterization. The genome sequence of S. pyogenes NCTC 8198 T was initially annotated with Prokka 23 , and submitted to the European Nucleotide Archive 24 . Afterwards, the genome sequence was re-annotated, with the NCBI Prokaryotic Genome Annotation Pipeline (PGAP), version 4.1 25 , and deposited in the NCBI Reference Sequence (RefSeq) database 26 . The genome sequence of S. pyogenes CCUG 4207 T was submitted to GenBank 27 . Subsequently, the sequence was annotated with PGAP, version 4.7, and deposited in RefSeq. The latest of these annotations (i.e., PGAP version 4.7) was used for further analyses and to construct a genome atlas with the on-line server GView (www.gview .ca) 28 . The on-line tool, PHASTER (www.phast er.ca) 29 , was used to search for prophages inserted in the chromosome, while the tool CRISPRFinder 30 was used to search clustered, regularly interspaced short palindromic repeat (CRISPR) arrays. The consensus sequences of the direct repeats were classified, using CRISPRmap v2.1.3-2014 31,32 , and the crRNA-encoding strand determined, using CRISPRstrand 32 , implemented in CRISPRmap. Additionally, the tool CRISPRone 33 was used to confirm the detected CRISPR arrays and to identify possible CRISPR-associated genes (cas). Spacer sequences of CRISPR arrays were analysed with BLASTN, version 2.2.10 22 , against the complete genome sequence of S. pyogenes NCTC 8198 T (= CCUG 4207 T ). Searches for virulence factors across the genome were done against the protein sequences of the curated core dataset (3,200  The final complete and closed genome sequence of S. pyogenes CCUG 4207 T was analysed, using BLASTN, against the complete genome sequence of S. pyogenes NCTC 8198 T . The analysis yielded a match of 1,914,862 bp with 100% of identity. Afterwards, for further quality control, the entire set of raw Illumina paired-end reads (i.e., 1,292.9 Mb, coverage: 675 X) was mapped against the two complete genome sequences. A variant calling analysis was performed, and no variants were found in any of the cases. Thus, the two independent and parallel strategies of sequencing and assembly resulted in a genome sequence of identical length and identity of 1,914,862 bp and a GC content of 38.5% ( Table 2).
Characterization of the complete genome sequence. Annotation. The latest annotation (i.e., S. pyogenes CCUG 4207 T annotated with PGAP version 4.7 and available in RefSeq) revealed a total of 2,009 genes, of which 1,920 were coding sequences (CDSs). The annotation detected 89 RNA genes, which included 67 tRNA genes, four non-coding RNA genes and 18 ribosomal genes distributed in six complete ribosomal operons. Additionally, 60 pseudogenes were annotated. A total of 306 genes (14.8%) were annotated as "hypothetical proteins". A genome atlas of this annotation version is depicted in Fig. 2.
Prophages. Five putative prophages were detected using the software PHASTER, four marked by the software as 'intact' (completeness score > 90) and one as 'questionable' (completeness score = 70-90), with a GC content ranging from 37.4 to 39.1% (Table 3). The largest prophage region was 56,926 bp and the shortest 41,886 bp. Overall, the five regions add up to 234,671 bp, which represents 12.3% of the genome size (Fig. 2). In total, according to the PGAP annotation, the five prophage regions encompass 321 CDSs, which represents a 16.72% of the total 1,920 CDSs annotated by PGAP in the genome sequence.
CRISPR-Cas systems. The analysis of the genome with CRISPRFinder revealed the presence of a CRISPR array (positions: 1,317,644-1,317,938 bp), composed of five direct repeats of 32 bp and four spacers of sizes ranging from 33 to 35 bp. The consensus sequence of the direct repeats was classified, with CRISPRmap, into the family 5 and structure motif 3 (Fig. 3). The analysis with CRISPRone revealed seven CRISPR-associated (cas) genes, located adjacent to the CRISPR array (cas3, cas5, cas8c, cas7, cas4, cas1 and cas2; locus tags: DB248_RS07080-DB248_RS07050). This architecture corresponds to the Class 1, subtype I-C of the updated evolutionary classification of CRISPR-Cas systems 36 . However, cas3 was frame-shifted due to a single-nucleotide deletion. The frameshift was confirmed by manually inspecting the mapped Illumina reads.
The BLASTN analyses of the spacer sequences against the whole genome sequence revealed that the sequence of the second spacer of the CRISPR array (positions 1,317,807-1,317,839 bp) shows 100% identity against a sequence of the prophage region SF130.1 (positions: 551,171-551,203; identities = 33/33), located in a gene encoding a phage predominant capsid protein (DB248_RS03050). Additionally, the sequence of the third spacer shows high similarity to a sequence of the prophage region SF130.5 (positions: 1,388,239-1,388,206 bp; identities: 32/34), which is part of a gene encoding a hypothetical protein (DB248_RS07380). Finally, the sequence of the fourth spacer presents a high degree of homology to a sequence of the prophage region SF130.2 (positions: 850,316-850,348; identities = 32/33), which is also part of a gene encoding a hypothetical protein (DB248_RS04545).
Additionally, four putative cas genes (cas9, cas1, cas2, csn2; locus tags: DB248_RS04305-DB248_RS04320) were located between positions 812,387 and 818,352, adjacent to a gene encoding a phage integrase of the Table 1. Results of whole-genome sequencing of S. pyogenes NCTC 8198 T and CCUG 4207 T , from the four sequencing runs done with PacBio RSII, Illumina HiSeq 2,500 and Oxford Nanopore MinION platforms. The total number of reads, total yield (Mb), sequencing depth, average read length (bp), read length N50 (bp), longest read (bp) and average Phred quality score are shown for each sequencing run. The SRA accession number of each run is indicated.  Table S1), several of them within the prophage regions. One of the genes was emm, encoding the surface protein M (DB248_RS09295), one of the major virulence factors of S. pyogenes, which provides protection against the immune system and has been used for strain serological typing 37 . The BLATN analysis of the sequence of the emm gene against the emm database of the Streptococcus Laboratory (CDC, USA) confirmed that it is type 1.0. Genes related with the synthesis of the hyaluronic acid capsule were also found (hasA, DB248_RS09950; hasB, DB248_RS09955; hasC, DB248_RS09965). This polysaccharide capsule is a key virulence factor involved in adhesion, tissue invasion 38 as well as in molecular mimicry for immune evasion 39 .
A gene encoding the hyaluronate lyase HylA was detected (DB248_RS04240). HylA has been suggested to facilitate spread of large molecules and to play a nutritional role for S. pyogenes, by disrupting host-tissue as well as its own capsule, allowing growth on hyaluronic acid as carbon source 44 . In addition, four prophageassociated hyaluronidase-encoding genes were found (DB248_RS03100, DB248_RS04540, DB248_RS06545, DB248_RS07275), located in the prophage regions SF130.1, SF130.2, SF130.4 and SF130.5, respectively, which may act as additional spreading factors 45 . These enzymes seem to be useful for phages to penetrate the capsule of hyaluronic acid 46 . Additionally, an ideS/mac gene, encoding an immunoglobulin G-degrading enzyme, was found (DB248_RS03795). This exoenzyme shields the cells from being opsonized by IgG antibodies, by cleaving their heavy chain 47 . A gene encoding a C5a peptidase was also found (scpA, DB248_RS09275). This peptidase degrades the chemotactic complement factor C5a 48 , thus preventing the C5a-based recruitment of neutrophils and other inflammatory cells to the site of infection 49 . Moreover, a gene encoding the streptokinase A (ska,  www.nature.com/scientificreports/ DB248_RS09135) was found, which catalyses the conversion of plasminogen to plasmin, a serine protease that facilitates tissue invasion by degrading proteins of the extracellular matrix 50 . Additionally, several genes encoding putative streptococcal exotoxins were detected: speA (DB248_RS05660), encoding a pyrogenic exotoxin type A and located in the prophage region SF130.3; speB (DB248_RS09375), encoding a pyrogenic exotoxin type B; speG (DB248_RS01135), encoding a pyrogenic exotoxin type G; speJ (DB248_RS01990) encoding a pyrogenic exotoxin type J precursor; and smeZ (DB248_RS09220), encoding the streptococcal mitogenic exotoxin Z. Genes for pili biosynthesis were also detected (cpa, DB248_RS00770; lepA, DB248_RS00775; fctA, DB248_RS00780; srtC1, DB248_RS00785; fctB, DB248_RS00790), which have been shown to play roles in biofilm formation and attachment to pharyngeal cells 51 . Moreover, several genes encoding putative adhesion-related proteins were found (e.g., fibronectin binding proteins), such as fbaA (DB248_RS09270), encoding an F2-like fibronectinbinding protein, or fbp54 (DB248_RS04160). FBP54 has been shown to play a role in the adhesion of GAS to host cells 52,53 . An lbp gene was also detected (DB248_RS09265), codifying the Lbp laminin-binding protein, involved in adhesion to epithelial cells 54 and suggested to play a role in zinc homeostasis 55 . A grab gene, encoding a G-related α 2 -macroglobulin-binding protein (GRAB), has also been detected (DB248_RS06200). GRAB is a surface protein that inhibits unwanted proteolysis through a high affinity for α2-macroglobulin, a proteinase inhibitor of human plasma 56 .  www.nature.com/scientificreports/ Antibiotic resistances. The analysis of the genome sequences with RGI (CARD) revealed one gene related with antibiotic resistance and classified as "strict". The gene encodes a putative ABC transporter ATP-binding protein (locus tag: DB248_01185), located downstream of a gene encoding another ABC transporter ATP-binding protein (DB248_01180). Gene products show 67 and 66% sequence identity to ABC transporters PatB and PatA of S. pneumoniae TIGR4, whose overexpression has been linked to fluoroquinolone resistance 57 .

Discussion
Here we present the first complete genome sequence of S. pyogenes NCTC 8198 T = CCUG 4207 T , the type strain of the type species of the genus Streptococcus, the type genus of the family Streptococcaceae. The sequence has been determined twice, using two fully independent but parallel strategies: PacBio-solo and Illumina plus Oxford Nanopore sequencing. Both strategies have yielded 100% identical complete genome sequences, thus demonstrating that hybrid approaches can completely mitigate the error rate of long read sequences. The type strain of S. pyogenes (NCTC 8198 T = CCUG 4207 T ) was isolated as strain S.F. 130 in Manchester, UK, from a throat swab of a scarlet fever case. The strain was provided by William W. C. Topley (University of Manchester) to Frederick Griffith (Pathological Laboratory of the Ministry of Health), who used it for the preparation of Type 1 agglutination sera, in a study of scarlatinal streptococci 58 . In 1950, the strain was deposited at the NCTC by Robert E. O. Williams (Central Public Health Laboratory, Colindale, London, UK) and, in 1974, the NCTC strain was deposited at the CCUG (Fig. 1). After decades being available to the scientific community, the strain has served as a taxonomic reference point and has been used in numerous studies. Today, the strain is also preserved and publicly available in other culture collections, e.g.  59 . However, despite the clinical relevance of the species and the taxonomic importance of the type strain, this is the first complete genome sequence of the type strain of S. pyogenes that has been determined.
Following sequencing, two different sequence assembly strategies were used. While S. pyogenes NCTC 8198 T was assembled de novo, using only PacBio reads, S. pyogenes CCUG 4207 T was assembled de novo, using both short Illumina and long Oxford Nanopore reads. Surprisingly, both approaches yielded fully identical complete genome sequences of 1,914,862 bp. Recently, numerous studies have reported high quality genome assemblies, obtained by combining high-quality Illumina reads and long Oxford Nanopore reads [60][61][62] . However, to our knowledge, this is the first study reporting two identical complete genome sequences determined with different methodologies.
The fact that two wholly independent approaches have yielded identical sequences demonstrates the high quality of these genome sequences. On the one hand, the NCTC strain was sequenced, using a PacBio RSII platform, which, as expected, yielded long and relatively inaccurate raw sequence reads (indicated in Table 1, by the low Phred quality scores). However, due to their random distribution, sequencing errors can be corrected during assembly, by high coverage 63 . On the other hand, the CCUG strain was sequenced with the highly accurate Illumina HiSeq 2500 and the Oxford Nanopore MinION device, which, as expected, yielded long raw sequence reads with low accuracy (indicated in Table 1), Interestingly, Nanopore reads exhibit a higher average Phred quality score than PacBio reads. However, because of the less random distribution of the errors (e.g., misinterpretation of homopolymers) 64 , the inclusion of the high-quality Illumina reads, during or after the assembly, is crucial to obtain an accurate genome sequence.
These results demonstrate how a hybrid strategy, combining Illumina and Oxford Nanopore sequencing, can provide results as accurate as high coverage PacBio-solo sequencing. This should help to reduce the scepticism generated by the initially high error rate of Oxford Nanopore [10][11][12] . In this study, the identical results obtained by both strategies are an indicator of the high quality and accuracy of this genome sequence, which makes it a definitive genomic reference of the species as well as a good model candidate for being used in future evaluations of bacterial genome assemblers.
Furthermore, it is noteworthy that the hybrid assembly was performed with SPAdes (i.e., a relatively userfriendly, well-established and widely-used de novo genome assembler), as the assembly itself only required a single command line and did not involve complex and tedious methodologies. Nevertheless, further strategies have been developed in recent years to perform de novo hybrid assemblies combining short and long reads (e.g., Unicycler and MaSuRCA), aiming to cover the shortcomings of each technology with the advantages of the other 65,66 . Alternative strategies involve only-Nanopore de novo assemblies 67 , which can be afterwards 'polished' by other pieces of software (e.g., Pilon and/or Racon) in order to improve their accuracy 68,69 . In addition, all these strategies and methodologies can be complementary to each other, as one protocol might be more or less useful under particular conditions, while the other one could be the opposite. For instance, SPAdes relies on the short reads to create a first draft assembly and afterwards perform a scaffolding step, while Canu performs an only-long-read de novo assembly 67 which can be optionally followed by a polishing step with short reads, using software like Pilon 68 .
In any case, despite the lack of differences between the two determined genome sequences, genomic variations could have been expected, as there have been several passages between strain NCTC 8198 T and strain CCUG 4207 T , and cultivation conditions and DNA preparation methods were different between both culture collections. These circumstances evidently increase the probability of having natural genotypic and phenotypic changes [70][71][72] . As a practical example, the ATCC recommends users to do no more than five passages from ATCC Genuine Cultures. For this reason, i.e., to reduce risk of natural alterations, the same starting material and DNA Scientific RepoRtS | (2020) 10:11656 | https://doi.org/10.1038/s41598-020-68249-y www.nature.com/scientificreports/ preparation should had been used. Nonetheless, this genome sequence will be a definitive reference of the type strain of S. pyogenes. The vast amount of genomic data that the current "next-generation" and "third-generation" sequencing platforms generate, can be used for bacterial systematics and taxonomy, which traditionally has been based on observations of phenotypic features, DNA G + C content, DNA-DNA hybridization similarities and sequence determinations and analyses of marker genes, such as 16S rRNA 73 . Recently, numerous studies have shown the effectivity, high resolution and discriminative power of whole-genome sequence-based comparative studies 74,75 . In fact, several methods and tools have been developed for analysing whole-genome sequence similarities, i.e., Average Nucleotide Identity (ANI) 76 , which can be calculated, using JSpecies 77,78 and in silico DNA-DNA hybridization, which can be calculated, using the Genome-to-Genome Distance Calculator (GGDC) 79 . Other interesting tools include the Type Strain Genome Server 80 and TrueBac ID 81 , both high-throughput on-line servers for genome sequence-based taxonomy, dependent upon curated databases of bacterial species type strain genome sequences. Despite the availability of such tools, public databases contain numerous misclassified genome sequences 74,75,82 , most likely due to disregard for taxonomic controls, but also because genome sequences of the type strains of many species have not yet been determined 83 or are not reliable, even because genome sequences may have been erroneously labelled as "type strains" 84 . Global efforts and initiatives are underway to curate public databases 85 , as well as to sequence the genomes of type strains 83,86 . Since January 2018, the International Journal of Systematic and Evolutionary Microbiology (IJSEM), the official publication of the International Committee on Systematics of Prokaryotes (ICSP) and the journal of record for publication of novel microbial taxa, has required authors describing new taxa to provide the genome sequence data; the genome of an organism encodes the basis of its biology and, therefore, is the fundamental basis of information for understanding the organism. Furthermore, genome sequences of the type strains of bacterial and archaeal species are crucial as reference points for identifying and classifying genetic and metagenomic data 87,88 .
Five prophage regions have been detected in the genome sequence of S. pyogenes CCUG 4207 T (= NCTC 8198 T ), encompassing 17% of the CDSs of the genome. These results agree with initial reports of genome sequences of S. pyogenes, which already confirmed a high prevalence of prophages inserted in chromosomes of the species 59,89,90 . Overall, numerous studies have shown the crucial role of bacteriophages in the ecology, pathogenicity and the evolution of S. pyogenes strains. In fact, prophages have been linked with the recent resurgence of M1-GAS-associated invasive diseases 91 .
CRISPR-Cas systems are adaptive immune systems that are widely spread in bacteria 36,92 . These systems have been previously found in several strains of S. pyogenes, and an inverse correlation has been observed between them and the number of prophages inserted in the genome 93 . Contrary to this observation, we found a complete CRISPR-Cas system together with five putative prophages in the genome sequence of S. pyogenes CCUG 4207 T (= NCTC 8198 T ). Additionally, high similarity was found between the sequences of the spacers of the CRISPR arrays and three of the prophage regions, one of them being 100% identical. However, we detected a frameshift, caused by a single-nucleotide deletion in the cas3 gene, which encodes an endonuclease/helicase that is essential for CRISPR-Cas interference 94 . Therefore, the CRISPR-Cas system most likely is non-functional due to this truncation, thus leaving a freeway for infection by bacteriophages, which could explain the presence of five prophage regions inserted within the chromosome. In addition, the short number of spacers also suggests that the CRISPR-Cas system might also not be active in acquisition of new spacers.
As a major human pathogen, S. pyogenes has a great repertoire of virulence factors, some of which are intrinsic and shared among almost all strains, while others might be present only in certain strains 34 or serotypes (e.g., EndoS 2 , exclusive of serotype M49) 95 . In this study, the analysis of the protein sequences of S. pyogenes CCUG 4207 T (= NCTC 8198 T ) against the VFDB has provided insight into the virulence potential of this strain, revealing the presence of numerous prominent virulence factors. In particular, this strain was isolated from a throat swab from a scarlet fever patient 58 . For many years, scarlet fever has been associated with S. pyogenes strains producing pyrogenic toxins 96 . In accordance to this, several streptococcal pyrogenic exotoxins have been found, one of them encoded in a prophage region. Additionally, several of the other virulence factors have also been found encoded in prophage regions, highlighting the role of prophages in the pathogenicity of S. pyogenes strains. In any case, further studies will be needed to uncover the full pathogenic potential of this strain, with emphasis in revealing the role of the still high number of CDSs annotated as 'hypothetical proteins' , which represent 15% of the 2,009 genes annotated.
The only antibiotic resistance-gene detected in the genome sequence was a gene encoding an ABC transporter ATP-binding protein; overexpression has been associated with fluoroquinolone resistance for homologues of this gene in Streptococcus pneumoniae 57 . This lack of significant antibiotic resistance genes was expected, as the type strain of S. pyogenes was isolated before the antibiotic era (i.e., before 1928). In addition, the current antibiotic resistance problem is not that great yet among S. pyogenes as it is among other bacterial species and taxa, with penicillin remaining the drug of choice, despite numerous decades of use 97 .

conclusions
Here we present the first complete genome sequences of the type strain of S. pyogenes (NCTC 8198 T = CCUG 4207 T = S.F. 130 T ), the type species of the genus Streptococcus, the type genus of the family Streptococcaceae and a major human pathogen. These genome sequences represent the reference genomic material to be used in taxonomic studies involving this family and its members. Additionally, we have shown how the combination of high-quality, short, Illumina sequence reads with long Oxford Nanopore sequence reads is able to generate a complete genome sequence, identical to the one obtained with only PacBio sequencing.