(Meta)genomic insights into the pathogenome of Cellulosimicrobium cellulans

Despite having serious clinical manifestations, Cellulosimicrobium cellulans remain under-reported with only three genome sequences available at the time of writing. Genome sequences of C. cellulans LMG16121, C. cellulans J36 and Cellulosimicrobium sp. strain MM were used to determine distribution of pathogenicity islands (PAIs) across C. cellulans, which revealed 49 potential marker genes with known association to human infections, e.g. Fic and VbhA toxin-antitoxin system. Oligonucleotide composition-based analysis of orthologous proteins (n = 791) across three genomes revealed significant negative correlation (P < 0.05) between frequency of optimal codons (Fopt) and gene G+C content, highlighting the G+C-biased gene conversion (gBGC) effect across Cellulosimicrobium strains. Bayesian molecular-clock analysis performed on three virulent PAI proteins (Fic; D-alanyl-D-alanine-carboxypeptidase; transposase) dated the divergence event at 300 million years ago from the most common recent ancestor. Synteny-based annotation of hypothetical proteins highlighted gene transfers from non-pathogenic bacteria as a key factor in the evolution of PAIs. Additonally, deciphering the metagenomic islands using strain MM’s genome with environmental data from the site of isolation (hot-spring biofilm) revealed (an)aerobic respiration as population segregation factor across the in situ cohorts. Using reference genomes and metagenomic data, our results highlight the emergence and evolution of PAIs in the genus Cellulosimicrobium.


Legends of Supplementary Information
Text S1: The PAI gene content specific to each strain is outlined in this section.

Supplementary Information
Text S1 The PAI gene content specific to each strain:

(a) PAIs of Cellulosimicrobium sp. strain MM
The first PAI (locus; MM_CPAI1) ( Figure 2A, 2B, Table 1) was characterized by the presence of genes for hypothetical proteins, integrase, transposase (TniQ), inositol-3phosphate synthase and multidrug efflux transporter protein. Inositol is required for synthesis of phospholipid in both prokaryotes and eukaryotes 46 especially by pathogens with a desperate need to be able to proliferate inside hosts. Human pathogens are known to acquire inositol either via synthesis or import from host fluids (such as blood). Despite the abundance of inositol in host environments, certain pathogenic bacteria such as Mycobacteria (Actinobacteria) synthesisize inositol via enzymes inositol-3-phosphate synthase 46 .
The second locus (MM_CPAI2, Figure 2A, 2B, Table 1) was characterized by Clp proteolytic subunits consisting of an enzyme system responsible for proteolysis of misfolded/damaged proteins. These proteins are well-established markers of M. tuberculosis pathogenesis 47 . Besides the presence of gene encoding site recombinase such as XerC, genes encoding for anti-toxin VbhA and Fic (filamentation induced by cyclic AMP) proteins were also annotated across the second locus (MM_CPAI2) ( Figure   2B). Fic proteins are effector proteins which work in a complex with VbhT (toxin) and VbhA (anti-toxin) system, utilized by pathogenic bacteria to interfere with cell signaling of the host cell collapsing the actin cytoskeleton, hence cell death 3 . This PAI hence implicitly demonstrated the "selfish operon" theory marked by the presence of juxtaposing Fic and VbhA proteins, speculating that conjugative systems are transferred together on the pathogenicity or genomic islands loci. One of the ORFs was predicted to be TrwC Relaxases which work with the cognate conjugative system Type IV secretion system (T4SS) and are bona fide virulence factors 2 ( Figure 2B). In addition to the above mentioned virulence factors, ORF for bacteriocin-like protein was also annotated which represents a class of proteins possessing bactericidal activity modulating both inter-strain and inter-species inhibition 48 .
Microbial antibiotic resistance is not a pathogenicity factor; however, it helps in enhancing the virulence/fitness across environments with increased chemical stress 49 .
The third PAI locus (MM_CPAI3) consisted of genes for fluoroquinolone resistance, which is already known to be a marker trait of extra-intestinal pathogenic bacteria such as Escherichia coli 5 ( Figure 2B, Table 1). Discounting four hypothetical proteins, gene encoding for sulfatase modifying factor was also found on this locus, which has been reported to be responsible for increased fitness of pathogenic bacteria such as Streptococcus pneumonia using mutational analysis 6 . Interestingly, one of the ORFs was annotated ( Figure 2B, Table 1) as DEAD-box helicase, which has been documented earlier on one of the O islands in formidable pathogen E. coli O157:H7. Using deletion mutant assay, DEAD-box helicase was reported to impede the function of fliC gene responsible for flagellar driven motility 50 .
Strikingly, ORFs on the fourth locus (MM_CPAI4, Figure 2B, Table 1)  Hypothetical proteins (n = 8) were annotated on the second locus (LMG_CPAI2) ( Figure   S3B) along with Type II secretion system, which is involved in enhancing both pathogenicity and environmental fitness. Two ORFs for transposases were also revealed along with putative membrane protein. Another very short PAI (LMG_CPAI3) (5 kbp) was characterized by hypothetical proteins along with transcriptional activators (XRE family) and cell division protein (FtsK) ( Figure S3C).
The mobility of PAIs can be phage-induced using helper proteins such as capsid proteins which help in packaging of the PAIs into transducing materials, which were discovered on LMG_CPAI4 ( Figure S3D) 57 . Other phage-associated proteins identified included genes encoding phage-integrases (XerC and XerD family). Interestingly, genes encoding laminarinase and β-glucosidase-related glycosidases were also found on this locus, which are well known to be responsible for cellulolytic activity in bacteria. This can be directly linked to genus Cellulosimicrobium in which cellulose hydrolysis is one of the most notable characteristics 58 . Occurrence of these genes on flexible genome is of special interest as this implicates at lateral acquisition of cellulolytic trait by this taxon ( Figure   S3D). Two ORFs encoding F420-dependent oxidoreductase were also discovered on this locus which is already established to play a pivotal role in self-defense of pathogenic bacteria (specifically Actinobacteria) against oxidative stress 59 . This locus was also characterized by the presence of highest number of hypothetical proteins (n=13) ( Figure   S3D).
Fifth PAI (LMG_CPAI5) was marked by the presence of genes encoding arsenic resistance operon repressor and ClpX subunit ( Figure S3E). D-alanyl-D-alanine carboxypeptidase was also revealed on this PAI, which is involved in cell wall biogenesis which is indicative of host-cell recognition 60 . Genes involved in repair mechanisms such as ligD were also found, this is reported to enhance the over-all fitness of the bacterial community. This locus also included lsr2, which is a histone-like bridging protein and is a well-known housekeeping gene, hence its occurrence on PAI implies at its relocation

(c) PAIs of Cellulosimicrobium cellulans J36
The first annotated PAI locus (J36_CPAI1) in strain C. cellulans J36 revealed a relatively greater number of integrases and transposases accompanied with hypothetical proteins ( Figure S4A). Additionally, hemin ABC transporter protein was also revealed on this locus, implicating presence of an evolved iron-uptake system, which is required to regulate numerous virulence factors 64 . Copper binding protein along with copper export protein was also revealed again indicating presence of metallo-regulatory systems to enhance bacterial fitness. Anti-anti sigma factor was also annotated on this specific locus, which is involved in regulating virulent genes 65 . Antibiotic resistance genes were also revealed, including bleomycin resistance, doxorubicin resistance, and methyl viologen resistance protein. Phage-shock protein (psp) operon transcriptional activator was also annotated on this locus, which is established to maintain proton motive force (PMF) required for cellular motility by modulating flagellar movement 66 .
The second locus (J36_CPAI2) ( Figure S4B) was characterized by two ORFs encoding membrane proteins accompanied with three ORFs each for integrase and transposase.
Hypothetical proteins were most abundant (n = 15) on this locus. Two ORFs for resolvase belonging to serine recombinase family were also annotated, which indicate at active HGT events via site-specific recombination ( Figure S4B