Comparative genomics of a novel clade shed light on the evolution of the genus Erysipelothrix and characterise an emerging species

Erysipelothrix sp. isolates obtained from a deadly outbreak in farmed turkeys were sequenced and compared to representatives of the genus. Phylogenetic trees—supported by digital DNA:DNA hybridization and Average Nucleotide Identity—revealed a novel monophyletic clade comprising isolates from pigs, turkeys, and fish, including isolates previously described as E. sp. Strain 2. Genes coding for the SpaC protein, typically found in E. sp. Strain 2, were detected in all isolates of the clade. Therefore, we confirm E. sp. Strain 2 represents a unique species, that despite its official name “Erysipelothrix piscisicarius” (meaning a killer of fish), may be isolated from a broad host range. Core genome analysis showed that the pathogenic species of this genus, E. rhusiopathiae and the clade E. sp. Strain 2, are enriched in core functionalities related to nutrient uptake and transport, but not necessarily homologous pathways. For instance, whereas the aerobic DctA transporter may uptake C4-dicarboxylates in both species, the anaerobic DcuC transporter is exclusive of the E. sp. Strain 2. Remarkably, the pan-genome analysis uncovered that genes related to transport and metabolism, recombination and repair, translation and transcription in the fish isolate, within the novel clade, have undergone a genomic reduction through pseudogenization. This reflects distinct selective pressures shaping the genome of species and strains within the genus Erysipelothrix while adapting to their respective niches.


Identification of single-copy core genes potentially involved in horizontal gene transfer:
We identified genes potentially involved in horizontal gene transfer (HGT) events and removed their respective orthologous group (OG) from the single-copy core genome dataset to prevent bias in the phylogenomic analysis. Protein sequences from E. rhusiopathiae Fujisawa (one of the most derived species) and E. larvae (the most ancient species) were used as query sequences to run BLASTP searches (parameters: -max_target_seqs 100 -remote) against the NR database at the NCBI. Results were tabulated and manually inspected. When a query sequence had a better hit (considering both Evalue and identity) to a sequence belonging to other genera rather than Erysipelothrix, this sequence was considered potentially involved in HGT events. All queries (618 sequences) from E. rhusiopathiae retrieved best hits belonging to Erysipelothrix species before retrieving hits against any other genera, showing no putative genes involved in HGT events. A hundred and twelve queries out of 618 from E. larvae had a better hit against other genera rather than Erysipelothrix. After removing the respective 112 OGs from the single-copy core genome dataset, the phylogenomic analysis was constructed based on 506 OGs single-copy core genome dataset.
Identification of the accessory genes shared only by E. sp. EsS2-6-Brazil and E. sp. EsS2-7-Brazil in isolate E. sp. 15TAL0474: A total of 307 sequences belong to the accessory genome of the turkey isolates (i.e., shared between E. sp. EsS2-6-Brazil and EsS2-7-Brazil) but were found missing in the fish isolate genome (E. sp. 15TAL0474). To verify that the genes were not missed due to misannotation in E. sp. 15TAL0474, we performed BLASTP and TBLASTN searches of either E. sp. EsS2-6-Brazil and EsS2-7-Brazil protein sequences (as queries) against the E. sp. 15TAL0474 pseudogenes and genome, respectively. BLAST results were classified according to Blanc et al. (2007) as follow: unannotated coding gene -when a hit found using TBLASTN was >50% of the query coverage; annotated pseudogene -when a hit was found using BLASTP against an annotated pseudogene; unannotated pseudogene -when a hit found using TBLASTN had a E-value better than 0.01 and HSP > 20 amino acids or query coverage >20% ; absent -when no hit was found with previous criteria. No unannotated coding genes were found in E. sp. 15TAL0474. After searches, we found that 197 queries out of 307 OGs were found in 15TAL0474 as annotated pseudogenes, 47/307 OGs were found as unannotated pseudogenes, and 63/307 OGs had no hits in 15TAL0474 and were considered absent (Supplementary Table S9). Therefore, the missed gene set of E. sp. 15TAL0474 is not related to faulty annotation.

General genome characteristics
Sequenced isolates (E. sp. EsS2-6-Brazil and E. sp. EsS2-7-Brazil) had been previously confirmed as E. sp. Strain 2-related isolates (Hoepers et al., 2019) based on a PCR method developed elsewhere (Pal et al., 2010). The WGS generated 3,739,346 total sequenced reads for isolate EsS2-6-Brazil and 2,914,734 for EsS2-7-Brazil, corresponding to approximately 431x and 322x genome coverage, respectively. The draft genome sequences were assembled in 30 contigs each, comprising a total of 1,739,138 bp and 1,714,240 bp for isolate EsS2-6-Brazil and EsS2-7-Brazil, respectively, making them amongst the smallest genome sizes compared to other Erysipelothrix spp. genomes available (Supplementary Table S1).

Estimating the genus core genome
To help us understand the phylogenetic relationship and bacterial evolution of the genus, we used the program FastOrtho to predict orthologous groups (OGs) among Erysipelothrix species. Considering the four species (E. larvae, E. tonsillarum, E. rhusiopathiae and E. sp. Strain 2) of the genus, FastOrtho predicted a total of 2,007 OGs and 1,138 orphans revealing a pan-genome of 3,145 protein families.
The genus core genome consisted of 647 OGs, of which 618 were represented by single-copy genes. A total of 112 OGs were disregarded due to possible horizontal gene transfer (HGT) events in E. larvae (Supplementary Material). The core genome represents 18.5% and 30.9% of the protein coding gene content for E. larvae and E. tonsillarum, respectively, and 34% of the average gene content across E.
rhusiopathiae and E. sp. Strain 2. Among the protein families found in all E. rhusiopathiae and E. sp.

Unique repertoire of Erysipelothrix species reveals exclusive expansions in E. larvae
Unique genes (strain-specific genes) were found in most studied genomes. E. sp. Strain 2 isolates showed 35 (isolate 15TAL0474), 41 (isolate EsS2-6-Brazil) and 29 (isolate EsS2-7-Brazil) exclusive genes. Among E. rhusiopathiae genomes, few genes comprise the exclusive set, indicating that the vast majority of the gene content is shared at some level among distinct clades (Forde et al., 2016) within the species (Supplementary Fig. 2, Supplementary Fig. 6 and Supplementary Methods). The considerable number of shared genes (core and accessory) in addition to the low abundance of unique sets in E. rhusiopathiae ( Supplementary Fig. 6) is likely to be caused by genetic homogeneity among strains rather than a lack of representativeness, once isolates from three distinct clades were considered ( Supplementary Fig. 2).
As expected, E. larvae and E. tonsillarum showed the highest number of unique genes (1,090 and 185 genes, respectively), which is likely a consequence of i) the lack of other available genomes from the same species, which would demonstrate the repertoire of shared genes (probably including many genes of the current unique list (Supplementary Table S8)) within the species and ii) being the two most ancient species (Fig. 2B) and distantly related (Fig. 3AB) to E. rhusiopathiae and E. sp. Strain 2 genomes. Surprisingly, E. larvae showed a considerable fraction (27.3% or 298/1090 genes) of lineagespecific expansions in its exclusive gene set. In constrast, E. tonsillarum showed only few instances (about 2.2% or 4/185 genes) of lineage-specific expansions in its exclusive gene set. Most of the expanded unique families in E. larvae were poorly characterized sequences related to the mobilome, such as chromosomal transposases and insertion sequences. Interestingly, protein family 6-phosphobeta-glucosidase (K01223; EC: 3.2.1.86) represents a lineage-specific expansion of E. larvae. This protein family is involved in starch and sucrose metabolism (Ko00500) responsible for the hydrolysis of cellobiose-6-phosphate (imported as cellobiose from extracellular environment) to D-glucose. E. larvae was isolated from the healthy gut microbiota of a rhinoceros beetle, Trypoxylus dichotomus (Bang et al., 2014), which is fed on decayed wood. Therefore, the expansion of a protein related to cellulose degradation represents an important advantage not only to the bacteria but also to the host insect, since it promotes efficient digestion of the disaccharide.  Table S5). Organism abbreviation is described in Supplementary Table S1.

Supplementary Fig. S6 -A flower plot of the pan-genome of E. rhusiopathiae and E. sp. Strain 2 isolates.
Legend: A flower-plot schematic representation illustrates the number of predicted core (917) and accessory (369 to 780) genes among E. rhusiopathiae and E. sp. Strain 2 isolates. Petals show the number of predicted unique genes for each isolate. E. rhusiopathiae isolates are color-coded according their Clades (Forde et al., 2016): Clade 1 -blue; Clade 2 -green; Clade Intermediate -Orange. E. rhusiopathiae and E. sp. Strain 2 isolates were indicated as described in Supplementary Table S1. Legend: E. sp. Strain 2 isolates were highlighted in red whereas E. rhusiopathiae isolates were highlighted in cyan. The standard threshold value used as species boundaries for 16S rRNA and rpoB sequences is 97% (Stackebrandt & Goebel, 1994) and 97.7% (Adékambi et al., 2003;Adékambi et al., 2006), respectively. Organism abbreviation is described in Supplementary Table S1. In addition, organism abbreviations not shown in Supplementary Table S1 (isolates with no genome available until submission of this study) are: Erysipelothrix inopinata 143-02 (eino-143-02) and Erysipelothrix sp. 715 (espe-715).

Strain 2 genomes.
Legend: Functional category enrichment analysis was calculated using the Fisher's exact test. The pvalue, followed by the odds value (between parentheses), is shown in each cell. A p-value < 0.05 was considered significant. An odds value greater than 1.0 indicates a COG category enriched in the core genome (highlighted in pink), whereas an odds value lower than 1.0 indicates a COG category enriched in the accessory genome (highlighted in green). Legend: COG E -Amino acid transport and metabolism, COG I -Lipid transport and metabolism, and COG P -Inorganic ion transport and metabolism. Rows were colored according to the COG category a gene belongs to: red for COG E, blue for COG I, and yellow for COG P. Column "Core/accessory/unique" shows whether a gene belongs to the core genome (shared among all genomes) or accessory genome (shared between two or more, but not all genomes) or unique (exclusive of one genome) inferred amongst E. rhusiopathiae isolates (core-erhu, accessory-erhu, unique-erhu) and amongst E. sp. Strain 2 isolates (core-strain2, accessory-strain2, unique-strain2). A cell content showing "NA" (Not Assigned) means that no KO number from Kyoto Encyclopedia of Genes and Genomes (KEGG), KO number description (KEGG), Pfam domain, Pfam domain architecture, E-value (Pfam) was found in the analysis. Organism abbreviation is described in Supplementary Table S1.

Supplementary
Supplementary Table S6 -List of core genes identified in E. rhusiopathiae isolates and core genes identified in E. sp. Strain 2 isolates.
Legend: Column "CORE erhu or strain2" shows whether a gene belongs to the core genome of E. rhusiopathiae isolates (core-erhu) or E. sp. Strain 2 isolates (core-strain2). A cell content showing "NA" (Not Assigned) means that no Cluster of Orthologous Groups (COG) or COG category was found in the analysis. Organism abbreviation is described in Supplementary Table S1.