Essential Genes in the Core Genome of the Human Pathogen Streptococcus pyogenes

Streptococcus pyogenes (Group A Streptococcus, GAS) remains a major public health burden worldwide, infecting over 750 million people leading to over 500,000 deaths annually. GAS pathogenesis is complex, involving genetically distinct GAS strains and multiple infection sites. To overcome fastidious genetic manipulations and accelerate pathogenesis investigations in GAS, we developed a mariner-based system (Krmit) for en masse monitoring of complex mutant pools by transposon sequencing (Tn-seq). Highly saturated transposant libraries (Krmit insertions in ca. every 25 nucleotides) were generated in two distinct GAS clinical isolates, a serotype M1T1 invasive strain 5448 and a nephritogenic serotype M49 strain NZ131, and analyzed using a Bayesian statistical model to predict GAS essential genes, identifying sets of 227 and 241 of those genes in 5448 and NZ131, respectively. A large proportion of GAS essential genes corresponded to key cellular processes and metabolic pathways, and 177 were found conserved within the GAS core genome established from 20 available GAS genomes. Selected essential genes were validated using conditional-expression mutants. Finally, comparison to previous essentiality analyses in S. sanguinis and S. pneumoniae revealed significant overlaps, providing valuable insights for the development of new antimicrobials to treat infections by GAS and other pathogenic streptococci.

. pKRMIT, a Tn-seq compatible in vivo mariner transposition system for GAS. Figure S2. Additional statistical analyses for pairwise comparisons of the initial libraries produced in GAS 5448. Figure S3. The GAS core genome. Figure S4. Conserved essential genes within the GAS core genome.
Supplemental Tables. Following ten (10) tables are included: Table S1. Complete analysis of Tn-seq read counts and alignments. Table S2. Selection of reads for the Bayesian analysis. Table S3. Bayesian analysis of Tn-seq datasets for GAS M1T1 5448 grown in vitro. Table S4. Integrated gene essentiality determination for GAS M1T1 5448 for all time points. Table S5. Bayesian analysis of Tn-seq datasets for GAS M49 NZ131 grown in vitro. Table S6. Integrated gene essentiality determination for GAS M49 NZ131 for all time points. Table S7. Conserved GAS essential genes identified in both M1T1 5448 and M49 NZ131 compared to the predicted GAS core genome. Table S8. Bacterial strains and plasmids. Table S9. Primers used in this study.
Table S10. Summary of 10 publically available GAS genomes.
(tnseqv0.rmd for no-mismatches, and tnseqv1.rmd for 1 mismatch analyses) and knitr (http://yihui.name/knitr/) was used to generate tnseq.pdf which provides a (longer than anyone will ever read) report of the operations performed.

Library Metrics
Before using the libraries to quantify the essentiality/fitness of coding sequences in the genome, it was necessary to compare and contrast the libraries and quantify their relative coverage, similarity, and saturation with respect to available TA insertion points.
These comparisons in turn require some attention to the normalization strategies employed.
Library Saturation by TA insertion points. The GAS genome is AT rich, providing a relatively large number of possible mariner insertion points: there are 1,838,562 and 1,815,783 TA sites found on the genome of MGAS5005 and NZ131, respectively. Therefore, the bam alignments were read into tnseq_count_hitsv2.pl along with the fasta genome; this script created a simple array of the genome and filled in the number of observed reads, which started (forward strand) or ended (reverse strand) at every TA in the genome. The bowtie parameters used allowed no mismatch, therefore this script excluded any read which did not match exactly the TA insertion point. The resulting text files contain one line for each TA of the genome, include a column for the position of each TA, the number of observed reads on the forward strand, reverse strand, and missed reads. These may be found in data/count_tas/; there is one file for the concatenated 5448 libraries as well as each separate library for all strains, in each case the file ends in 'tas_only.txt'. Some reads were observed (again due to bowtie mismatches) to hit near but not on the TA insertion points. These were also collected in the tnseqcount.out files, but were discarded. These same outputs were used to feed the essentiality package 11 .
These files were read into the R function plot_saturation() and used to visualize the lifespan of each library over time. This was done by taking the log 2 (hits + 1) for each position and plotting them as a set of histograms.
Normalization and visualization strategies. The most expedient method of comparing the libraries was to treat them as if they were components of an RNA sequencing experiment and assuming similar normalization strategies 12 apply. This strategy is very similar to that taken by the essentials software package 13 , but uses voom/limma instead of EdgeR. Therefore the alignments were counted by reference genome coding sequence and quantile normalized without removing the low-count genes.
These normalized counts were adjusted by the size of the library and length of each coding sequence (rpkm implemented in edgeR 14 with assistance from DESeq 15 and log 2 transformed. Conversely, the quantile normalized counts were adjusted by library size (cpm), then adjusted by the ratio of TAs observed in the coding sequence divided by the median number of TAs in all coding sequences (all normalizations were performed by my_norm and divide_seq). The resulting data sets were used to quantify the median coverage of the genome. Pairwise Euclidean distances (Fig. S2B), Spearman correlation coefficients ( Fig. 1G, H, I) and principle component analyses (Fig. S2A) were then used to visualize the similarities/differences between normalized libraries.
Pairwise sample comparisons. In order to further explore the similarities and difference between the libraries, they were plotted against one another as a series of scatter plots (generated by my_linear_scatter) and histograms. These provided a visual metric of the similarities and differences among the libraries and time points between libraries. In the case of the 5448 strain, voom 16 and limma 17 were used to combine the samples, correct for the batch effect among libraries, and calculate the merged changes between time points (performed by simple_comparison), and include significance with respect to the variance in the data.
Essentiality. The essentiality software package 11 provides an opportunity to query statistically significant stretches of TAs which have no observed insertions to further inform its metric of essentiality. The insertion data was therefore converted into its expected format (tnseq_count_hitsv2.pl) and passed to the version 1.21 of the implementation python script. The resulting table provided a count of the number of insertions observed in each ORF, the number of observed TAs, the maximum length of non-observed sequence, the nucleotide span of this region, a call on whether each ORF is essential, and the posterior probability for each call. A few different parameters were tested with each strain and settled on a minimum hit parameter of 2 for both strains.
Circos. R was used to massage various data structures into the format expected by circos 18 .

Supplemental Tables
Following nine (9) tables are presented: Table S1. Complete analysis of Tn-seq read counts and alignments. Table S2. Selection of reads for the Bayesian analysis. Table S3. Bayesian analysis of Tn-seq datasets for GAS M1T1 5448 grown in vitro. Table S4. Integrated gene essentiality determination for GAS M1T1 5448 for all time points. Table S5. Bayesian analysis of Tn-seq datasets for GAS M49 NZ131 grown in vitro. Table S6. Integrated gene essentiality determination for GAS M49 NZ131 for all time points. Table S7. Conserved GAS essential genes identified in both M1T1 5448 and M49 NZ131 compared to the predicted GAS core genome. Table S8. Bacterial strains and plasmids. Table S9. Primers used in this study. Table S10. Summary of the 20 publically available GAS genome sequences. 44 32 (a) Sample name describes the GAS strain (i.e GAS 5448 or NZ131) used to produce the Krmit library. For 5448, 4 independent libraries (Lb1 to Lb4) were analyzed separately. The different libraries produced (T0) were subjected to 3 additional passages in THY (T1 to T3). Master 5448 is the combinaison of the reads from the 4 initial libraries in 5448. (b) Reads aligning to more than one position on the chromosome were randomly placed onto the GAS chromosome. (c) Percentage in parenthesis represents the proportion of the initial reads (sum of strictly aligned and randomly placed) that align to the GAS genome. (d) Percentage in parenthesis coresponds to the proportion of reads among the reads that failed to align to the GAS genome that were aligned to the pKRMIT plasmid sequence. (e) The saturation index depicts the proportion of unique TA sites found on the GAS chromosome hit at least once by Krmit. (f) Calculated distance separating two adjacent TA sites where Krmit inserted. 57,377 44 11,487 9 (a) Recapitulation of the total number of unique transposon insertion sites (TIS) and saturation indexes for the Krmit libraries produced in GAS 5448 and NZ131. (b) The Bayesian analysis for gene essentiality prediction called for eliminating TIS with one Krmit insertion. The remaining reads have at least two Krmit counts. (c) The saturation index depicts the proportion of unique TA sites found on the GAS chromosome hit at least once by Krmit.  (a) Spy numbers from MGAS5005 genome.

All TIS(a) TIS with at least 2 Krmit copies(b)
(b) When available, gene name is provided.
(c) Call integrating data from Bayesian analyses on 3 time points.
(d) Hyperlink to the NCBI Gene website (http://www.ncbi.nlm.nih.gov/gene). (f) Asteriks refer to genes found in the GAS core genome.

Transcriptional Unit (e)
Locus Tag