Genome analysis of Campylobacter concisus strains from patients with inflammatory bowel disease and gastroenteritis provides new insights into pathogenicity

Campylobacter concisus is an oral bacterium that is associated with inflammatory bowel disease. C. concisus has two major genomospecies, which appear to have different enteric pathogenic potential. Currently, no studies have compared the genomes of C. concisus strains from different genomospecies. In this study, a comparative genome analysis of 36 C. concisus strains was conducted including 27 C. concisus strains sequenced in this study and nine publically available C. concisus genomes. The C. concisus core-genome was defined and genomospecies-specific genes were identified. The C. concisus core-genome, housekeeping genes and 23S rRNA gene consistently divided the 36 strains into two genomospecies. Two novel genomic islands, CON_PiiA and CON_PiiB, were identified. CON_PiiA and CON_PiiB islands contained proteins homologous to the type IV secretion system, LepB-like and CagA-like effector proteins. CON_PiiA islands were found in 37.5% of enteric C. concisus strains (3/8) isolated from patients with enteric diseases and none of the oral strains (0/27), which was statistically significant. This study reports the findings of C. concisus genomospecies-specific genes, novel genomic islands that contain type IV secretion system and putative effector proteins, and other new genomic features. These data provide novel insights into understanding of the pathogenicity of this emerging opportunistic pathogen.


Results
The draft genomes of 27 C. concisus strains. The genomes of 27 C. concisus strains were sequenced in this study. These 27 C. concisus strains were previously isolated in our laboratory from patients with CD, UC and healthy controls and they were randomly selected for inclusion into this study. Ten of these strains were analysed in our previous studies of grouping C. concisus strains using housekeeping genes 3,6,10,23 .
The draft genome sizes of these C. concisus strains were 1.80 to 2.21 Mb. The contig numbers ranged from 7 to 76. The fold coverage ranged from 83.98 to 230.58. The summaries of the C. concisus genomes sequenced in this study are in Table 1.  Table 1. Summary of the 27 C. concisus genomes sequenced in this study. Draft genomes were assembled using St. Petersburg genome assembler (SPAdes, Ver. 3.6.1). Letters P and H in strain ID indicate strains isolated from patients with inflammatory bowel disease and healthy controls respectively. O indicates oral strains isolated from saliva samples and B indicates a strain isolated from intestinal biopsies. These strains were isolated from our previous studies 3, 6 . * Indicates strains used in a previous study using housekeeping genes to group C. concisus strains 18 . CD: Crohn's disease. UC: Ulcerative colitis.
Scientific RepoRts | 6:38442 | DOI: 10.1038/srep38442 The core-genome and accessory genes. The C. concisus core-genome was derived from 36 C. concisus strains including the 27 C. concisus strains sequenced in this study and nine C. concisus genomes that are publically available 3,6,10,23,24 . The C. concisus core-genome of the 36 C. concisus strains consisted of 582 genes, which were 28.7% (582/2025) of the total number of genes present in C. concisus strain 13826. The core-genomes of GS1 and GS2 strains had 1,098 and 1,143 genes respectively. The genes in both GS1 and GS2 C. concisus core-genomes were evenly distributed amongst different Clusters of Orthologous Groups ( Supplementary Fig. S1). The accessory genes in the 36 C. concisus strains ranged from 1,163 to 1,521.
Both housekeeping genes and a PCR method targeting the polymorphisms of 23S rRNA gene were previously used to separate C. concisus strains into different groups [15][16][17][18][19][20][21] . In this study, we compared the assignment of C. concisus strains by housekeeping genes and 23S rRNA gene. The sequences of these housekeeping genes or 23S rRNA gene divided the 36 strains into two clusters, consistent with the GS1 and GS2 grouping assigned based on the C. concisus core-genome (Figs 2 and 3).
A previous study examining eight C. concisus strains found that the 16S rRNA gene was able to differentiate C. concisus strains isolated from patients with gastroenteritis and CD 24 . However, in this study, we found that the 16S rRNA gene was unable to differentiate C. concisus genomospecies or their related diseases (Fig. 4).
Genomospecies-specific genes. Using Burrows-Wheeler Aligner, BLASTn and BLASTx, we found that some genes that were present in all GS1 C. concisus strains were absent in all GS2 strains and vice versa, showing that these were genomospecies-specific genes. The flanking regions of GS1-specific genes were found in the genomes of all GS2 strains on unbroken contigs, and vice versa, further confirming that they were truly genomospecies specific. The phylogenetic tree generated based on C. concisus core-genome sequences. The phylogenetic tree was generated based on the C. concisus core-genome generated from 36 Campylobacter concisus strains using Roary 45 . Oral strains from patients with IBD that were sequenced in this study are coloured red. Oral strains from healthy controls that were sequenced in this study are coloured blue. Oral strain ATCC 33237 is coloured purple; this strain was isolated from a patient with gingivitis. Enteric strains are coloured green. The genome of enteric strain P3UCB1, a strain isolated from intestinal biopsies of a patient with UC, was sequenced in this study. The remaining genomes of enteric C. concisus strains are publically available. Enteric strain ATCC 51561 was isolated from faecal samples of a healthy individual. Enteric strains UNSW2, UNSW3 and UNSWCD were isolated from patients with CD 24 . The remaining enteric strains were isolated from patients with gastroenteritis. Bootstrap values of more than 70 are indicated on the internal branches. GS1 and GS2 indicate Genomospecies 1 and 2 respectively. Scientific RepoRts | 6:38442 | DOI: 10.1038/srep38442 Of the nine GS1-specific genes, three genes encode phosphate transport proteins (PstS, PstA and PstC). The remaining GS1-specific genes encode hypothetical proteins, transporter proteins and enzymes ( Table 2). Fourteen GS2-specific genes were found, including genes that encode a protein involved in regulation of osmolarity (aquaporin Z), a protein involved in pH homeostasis and sodium extrusion (Na + /H + antiporter NhaC), twitching motility protein and the others ( Table 2). CRISPR-associated proteins. Twenty-two C. concisus strains, all belonged to GS2, were found to have genes encoding CRISPR-associated proteins. Cas1, Cas2, Cas3 and Cas4a proteins were found in all 22 strains. Cas5h, Csh1 and Csd2/Csh2 proteins were found in most of these 22 strains, Cas6 protein was found in five strains and the remaining seven CRISPR-associated proteins were found in one or two C. concisus strains (Table 3).
Two different genomic islands containing T4SS homologues and putative effector proteins were found in enteric and oral C. concisus strains respectively. P3UCO1 and P3UCB1 strains were isolated from saliva and intestinal biopsies of a patient with UC. These two strains were genetically closely related (Fig. 1). Interestingly we found a region in the genome of the enteric strain P3UCB1 that was absent in the genome of the oral strain P3UCO1 (Fig. 5A). The size of this region is 31,286 bp, beginning with an integrase. This region contained five proteins homologous to T4SS proteins from the tumour inducing (Ti) plasmid in plant pathogen Agrobacterium tumefaciens, which includes VirB4, VirB8, VirB9, VirB10 and VirB11. Their similarities to the A. tumefaciens VirB proteins were 41%, 42%, 29%, 39% and 50% respectively. Furthermore, this region had proteins homologous to the RP4 plasmid conjugative transfer protein TraQ, the plasmid partitioning protein ParA and to various hypothetical proteins. Collectively, these findings showed that this region is a plasmid derived genomic island, which we have named the C. concisus plasmid integrative island A (CON_PiiA) ( Fig. 5A and Table 4). Two additional enteric C. concisus strains, UNSW2 and ATCC 51562 were found to have CON_PiiA based on the annotated proteins. CON_PiiA was identified in 37.5% (3/8) of the enteric C. concisus strains isolated from individuals with enteric disease and interestingly none of the oral strains (0/27), which was statistically different (P = 0.0086). The core-genomes of multiple oral strains collected from some individuals were genetically similar ( Fig. 1), which may lead to biased statistical results. Therefore, we re-analysed the data by The sequences of six housekeeping genes (asd, aspA, atpA, glnA, pgi and tkt) were extracted from the 36 C. concisus strains and were used to generate the phylogenetic tree using neighbour-joining method, which was performed using molecular evolutionary genetic analysis software version 6.06 (MEGA 6.06) with 1,000 bootstrap replications 47 . Oral strains from patients with IBD that were sequenced in this study are coloured red. Oral strains from healthy controls that were sequenced in this study are coloured blue. Oral strain ATCC 33237 is coloured purple; this strain was isolated from a patient with gingivitis. Enteric strains are coloured green. The genome of enteric strain P3UCB1, a strain isolated from intestinal biopsies of a patient with UC, was sequenced in this study. The remaining genomes of enteric C. concisus strains are publically available. Enteric strain ATCC 51561 was isolated from faecal samples of a healthy individual. Enteric strains UNSW2, UNSW3 and UNSWCD were isolated from patients with CD 24 . The remaining enteric strains were isolated from patients with gastroenteritis. Bootstrap values of more than 70 are indicated on the internal branches. Campylobacter jejuni strain NCTC11168 was used as an outgroup (GenBank accession no. NC_002163). GS1 and GS2 indicate Genomospecies 1 and 2 respectively.
considering multiple oral C. concisus strains from a given individual as one strain if these strains were in the same small group in Fig. 1. P24CDO-S3, P24CDO-S2 and P24CDO-S4 were considered as one strain, P2CDO3 and P2CDO-S6 were considered as one strain, P20CDO-S1 and P20CDO-S3 were considered as one strain, H21O-S1 and H21O-S5 were considered as one strain. Therefore, the total number of oral strains used for re-analysis was 22 instead of 27. The presence of CON_PiiA in enteric strains isolated from patients with enteric diseases and oral C. concisus strains was still significantly different 37.5% (3/8) vs (0/22) (P = 0.0138).
We found a second genomic island in oral C. concisus strains. A contig in H17O-S1 strain contained the entire island, which was closely examined. Like P3UCB1 strain, H17O-S1 strain had a region containing genes encoding homologues of VirB4 (44% similarity), VirB8 (45%), VirB9 (40%), VirB10 (49%) and VirB11 (49%). Additionally there were proteins homologous to TraQ and various hypothetical proteins. Furthermore, H17O-S1 strain contained genes encoding homologues of VirB5 (33%), VirB6 (32%) and VirD4 (43%) from the Ti plasmid in A. tumefaciens, which were not seen in CON_PiiA (Table 4). Repetitive sequences (AGTCCTGGTGAACCCACCA), indicative of attachment sites, were found between an integrase and tRNA-Met-CAT at the positions of 675,445-675,463 bp and 714,647-714,667 bp. Except for two proteins, this region had less than 20% amino acid identities to proteins in CON_PiiA. We named this region C. concisus plasmid integrative island B (CON_PiiB), which was 38,653 bp in length (Fig. 5B). The nine VirB proteins and some CON_PiiB proteins were also found in the remaining four oral C. concisus strains from two individuals including three strains from one patient with CD (P21CDO-S1, P21CDO-S2, P21CDO-S4), and one strain from a healthy individual (H14O-S1). However, the contigs in the three strains from the patient with CD were not long enough to reveal the entire sequence of CON_PiiB island. CON_PiiB was found in 18.5% (5/27) oral C. concisus strains and none of the enteric strains (0/9), which was not statistically significant (P > 0.05). The prevalence of CON_PiiB in oral strains isolated from healthy individuals and patients with IBD was 20% (2/10) and 18.8% (3/16) respectively, which was not statistically significant (P > 0.05).
Potential effector proteins within CON_PiiA and CON_PiiB islands were found. A number of proteins in both islands had similarities to Legionella pneumophila virulence effector proteins, most of which, such as LepB and LepA, are involved in intracellular survival of the pathogen [25][26][27][28][29] . One protein had similarities to Helicobacter pylori cytotoxin-associated protein A (CagA), which is a virulence factor associated with more severe disease states in H. pylori infection 30 . The details of the comparison between proteins in CON_PiiA and CON_PiiB islands and effector proteins are shown in Table 4. The phylogenetic tree was generated based on the sequences of the 23S ribosomal RNA genes. The neighbour-joining method was used to generate the phylogenetic tree, which was performed using Molecular Evolutionary Genetic Analysis software version 6.06 (MEGA 6.06) with 1,000 bootstrap replications 47 . Oral strains from patients with IBD that were sequenced in this study are coloured red. Oral strains from healthy controls that were sequenced in this study are coloured blue. Oral strain ATCC 33237 is coloured purple; this strain was isolated from a patient with gingivitis. Enteric strains are coloured green. The genome of enteric strain P3UCB1, a strain isolated from intestinal biopsies of a patient with UC, was sequenced in this study. The remaining genomes of enteric C. concisus strains are publically available. Enteric strain ATCC 51561 was isolated from faecal samples of a healthy individual. Enteric strains UNSW2, UNSW3 and UNSWCD were isolated from patients with CD 24  Previous studies using different molecular methods such as AFLP, analysis of housekeeping genes and PCR of the 23S rRNA gene showed that C. concisus has two genomospecies [15][16][17][18][19][20][21] . There was some evidence that C. concisus strains of these two genomospecies may have different pathogenic potential [15][16][17][18][19][20][21] . For example, strains invasive to intestinal epithelial cells were often found in GS2 10,11 . Despite these findings, there is a lack of understanding regarding these two C. concisus genomospecies at the genome level.
In this study, for the first time we compared the genomes of C. concisus strains from different genomospecies, which revealed new genomic features of this bacterium. We analysed the nine publically available C. concisus genomes, together with the genomes of additional 27 C. concisus strains that we have sequenced. We generated the C. concisus core-genome from these 36 C. concisus strains. The core-genome, the sequences of six housekeeping genes and the 23S rRNA gene consistently assigned these C. concisus strains into two genomospecies (Figs 1-3). The enteric strains did not form distinct groups within both genomospecies, further supporting our previous theory that some oral C. concisus strains may cause enteric disease when colonizing the intestinal tract 3,31,32 . The previous study examining eight C. concisus strains reported that 16S rRNA gene of C. concisus strains was able to differentiate C. concisus strains isolated from patients with CD and gastroenteritis, this was not observed in our study where 36 C. concisus strains were examined (Fig. 4) 24 .
We found nine genes that were specific to GS1 C. concisus strains and fourteen genes that were specific to GS2 C. concisus strains, some of which encode proteins that may contribute to the survival and pathogenicity of C. concisus (Table 2). For example, three of the nine GS1-specific genes encode proteins involved in phosphate transport (PstS, PstA, PstC), suggesting that strains of GS1 and GS2 may differ in their phosphate uptake. Aquaporin Z was found in all GS2 C. concisus strains, but not in any GS1 strains. Aquaporin Z is a protein that moves water across bacterial membranes to maintain intracellular osmotic pressure 33 . The finding that GS2 Figure 4. The phylogenetic tree generated based on the sequences of 16S ribosomal RNA genes for the 36 Campylobacter concisus strains. The phylogenetic tree was generated based on the sequences of the 16S ribosomal RNA genes. The neighbour-joining method was used to generate the phylogenetic tree, which was performed using Molecular Evolutionary Genetic Analysis software version 6.06 (MEGA 6.06) with 1,000 bootstrap replications 47 . Oral strains from patients with IBD that were sequenced in this study are coloured red. Oral strains from healthy controls that were sequenced in this study are coloured blue. Oral strain ATCC 33237 is coloured purple; this strain was isolated from a patient with gingivitis. Enteric strains are coloured green. The genome of enteric strain P3UCB1, a strain isolated from intestinal biopsies of a patient with UC, was sequenced in this study. The remaining genomes of enteric C. concisus strains are publically available. Enteric strain ATCC 51561 was isolated from faecal samples of a healthy individual. Enteric strains UNSW2, UNSW3 and UNSWCD were isolated from patients with CD 24 . The remaining enteric strains were isolated from patients with gastroenteritis. Bootstrap values of more than 70 are indicated on the internal branches. Campylobacter jejuni strain NCTC11168 was used as an outgroup (GenBank accession no. NC_002163).
Scientific RepoRts | 6:38442 | DOI: 10.1038/srep38442 C. concisus strains have aquaporin Z suggests that they may have enhanced abilities in adapting to environments where osmolarity frequently changes.
The type I CRISPR system, which has the Cas3 protein, was found in 78.6% (22/28) of GS2 C. concisus strains (Table 3). However, the number of CRISPR-associated proteins between C. concisus strains varied. Cas6, an endoribonuclease that generates RNAs for defense in the type I CRISPR system, was present in only five C. concisus strains. CRISPR system provides acquired immunity to plasmids and phages 34,35 . The CRISPR proteins found in C. concisus strains do not seem to be related to CON_phi2 prophage that contains the zonula occludens toxin gene 31 . The C. concisus Zot was found to damage intestinal epithelial barrier and affect the function of macrophages and the zot gene was detected in C. concisus strains from both GS1 and GS2 11,23,36 .
Two novel C. concisus genomic islands were identified in this study. CON_PiiA and CON_PiiB islands were found in both GS1 and GS2 C. concisus strains. CON_PiiA was found in 37.5% (3/8) of enteric strains isolated from patients with enteric diseases including two patients with IBD and one patient with gastroenteritis, but not in the 27 oral C. concisus strains, a difference that was statistically significant. CON_PiiA was not found in ATCC 51561, an enteric strain isolated from faecal samples of a healthy individual. CON_PiiB was found in 18.5% (5/27) of oral C. concisus strains and none of the enteric strains, this difference did not reach statistical significance. Collectively, these data suggest that the CON_PiiA island may preferably integrate into enteric C. concisus strains isolated from patients with enteric diseases. However, the numbers of enteric C. concisus strains included in this study were small, larger numbers of enteric C. concisus strains need to be examined to confirm this finding.
Both CON_PiiA and CON_PiiB islands contained T4SS homologous proteins. The T4SS system is used by microorganisms to transport macromolecules such as proteins or DNA across the cell envelope 37 . T4SS may be involved in plasmid conjugation, uptake or release of DNA or transfer effector proteins into host cells 38 . The well-studied H. pylori cag pathogenicity island encodes proteins homologous to VirB2, VirB4, VirB5, VirB7, VirB9, VirB10, VirB11 and VirD4; these proteins deliver effector proteins such as CagA to host cells through the formation of a pilus 39 . Putative effector proteins similar to L. pneumophila and H. pylori virulence effector proteins were found in both CON_PiiA and CON_PiiB islands. The virulence effector proteins in L. pneumophila are mainly involved in bacterial survival within macrophages [25][26][27][28][29] . H. pylori CagA virulence factor is associated with gastric cancer 30 . Given that the two novel C. concisus genomic islands found in this study contained proteins similar to T4SS and their effector proteins found in human pathogens, CON_PiiA and CON_PiiB islands are likely to be involved in C. concisus virulence. However, the putative effector proteins found in CON_PiiA and CON_PiiB islands had similarities to only a fragment of CagA and L. pneumophila effector proteins. Their true virulence requires confirmation by characterization of individual proteins in these islands.
To our knowledge, this is the first study examining the genomes of C. concisus strains of different genomospecies. We sequenced the genomes of 27 C. concisus strains and performed comparative genome analysis of 36   Table 3. CRISPR-associated proteins in Campylobacter concisus strains. All C. concisus strains that have CRISPR-associated proteins belonged to Genomospecies 2. Letters P and H in strain ID indicate oral strains isolated from patients with inflammatory bowel disease and healthy controls respectively. The remaining five strains were enteric strains isolated from patients with Crohn's disease and gastroenteritis. A positive sign (+ ) indicates the presence of a gene.  identified GS1 and GS2 C. concisus specific genes. Furthermore, we identified two novel genomic islands that contained T4SS homologous proteins and putative effector virulence proteins; CON_PiiA appeared to be associated with enteric C. concisus strains isolated from patients with enteric diseases. The new C. concisus genomic features obtained from this study provide novel insights into understanding of the pathogenicity of this emerging opportunistic pathogen.

Methods
C. concisus strains used for genome sequencing. C. concisus strains sequenced in this study were isolated from saliva samples or intestinal biopsies in our previous studies 3, 6,11,22 . The genomes of 27 C. concisus strains were sequenced. C. concisus strains were grown on Horse Blood Agar (HBA) plates as previously described 1  Draft genome assembly and identification of C. concisus pan-and core-genome. In addition to the above 27 C. concisus strains sequenced in this study, nine C. concisus genomes that are available in NCBI database were also included for analysis, of which seven genomes were from a previous study 24 . The accession numbers of these nine C. concisus genomes are ANNF00000000, ANNJ00000000, ANNE00000000, AENQ00000000, ANNG00000000, ANNH00000000, ANNI00000000, CP000792.1, NZ_CP012541.1. The genomes of strains 13826 and ATCC 33237 (accession numbers CP000792.1, NZ_CP012541.1) were fully sequenced and the remaining genomes were draft genomes. Thus, a total of 36 C. concisus strains were analysed in this study including 27 oral strains and nine enteric strains. The raw reads were assembled using St. Petersburg genome assembler to obtain the draft genomes (SPAdes, Ver. 3.6.1) 42 (Table 1). Gene annotation was performed using a combination of Rapid Annotations using Subsystems Technology server (RAST, Ver. 2.0) and Prokka (Ver. 1.11) 43,44 . The pan-and core-genome for the 36 C. concisus strains were defined by the Rapid large-scale prokaryote pan-genome analysis software (Roary, Ver. 3.5.7) 45 .  Table 4. Putative effector proteins and other proteins in CON_PiiA and CON_PiiB genomic islands. AA: amino acid. # The homology of putative effector proteins in CON_PiiA and CON_PiiB islands to known bacterial effector proteins based on BLASTp was expressed as % similarity (the number of amino acids used for comparison) (the start and end position of the known bacterial effector proteins that matched). @ Proteins predicted to contain a signal peptide. * The two proteins in CON_PiiA and CON_PiiB had more than 40% identities and the remaining proteins in these two islands had less than 20% identities.
Scientific RepoRts | 6:38442 | DOI: 10.1038/srep38442 The genome function analysis was performed as described previously 46 . Briefly, the protein sequences were extracted from the annotated genomes and blasted against the NCBI COG database (ver. 2014). Genes with COG assignment were then categorised in a list of functional groups.
Phylogenetic analysis based on the C. concisus core-genome, sequences of housekeeping genes, 23S and 16S rRNA genes. The phylogenetic tree based on the C. concisus core-genome was generated using Roary 45 . The neighbour-joining method was used to generate phylogenetic trees based on housekeeping genes, 23S rRNA genes and 16S rRNA genes of the 36 C. concisus strains examined in this study, which were performed using Molecular Evolutionary Genetic Analysis software version 6.06 (MEGA 6.06) with 1,000 bootstrap replications 47 . The six housekeeping genes were previously shown to be able to define C. concisus genomospecies, including aspartase A (aspA), glutamine synthetase (glnA), transketolase (tkt), aspartate semialdehyde dehydrogenase (asd), ATP synthase F1 alpha subunit (atpA) and glucose-6-isomerase (pgi) 18 . The sequences of housekeeping genes, 23S and 16S rRNA genes from a Campylobacter jejuni strain (GenBank accession no. NC_002163) were used as an outgroup.
Identification of genomospecies-specific genes. The annotated genes of the 36 C. concisus strains representing the two genomospecies were compared using Roary to determine candidate genes that were specific to GS1 or GS2. A GS1-specific gene refers to a gene that is present in all GS1 strains and absent in all GS2 strains analysed in this study. Similarly, a GS2-specific gene refers to a gene that is present in all GS2 strains and absent in all GS1 strains. To confirm the presence and absence of genomospecies-specific genes, the assemblies from each of the genome were searched with BLASTn (BLAST+ , Ver. 2.2.31) and BLASTx (BLAST+ , Ver. 2.2.31) 48 .
To ensure the absence of genomospecies-specific genes were not due to issues with assemblies and sequencing artefacts, raw reads were mapped with Burrows-Wheeler Aligner (BWA, Ver. 0.7.12) 49 . Finally flanking regions of the absent genes were confirmed to be located on the same contig.
Identification of genomic islands and the putative effector proteins. Two C. concisus genomic islands containing T4SS homologous proteins were identified in this study, which were based on the comparison of the flanked genes in C. concisus strains, the presence of integrases and attachment sites, the sizes of the regions, and the presence of plasmid-associated genes. Clustal Omega was used to compare protein sequences between islands 50 . The effector proteins were identified by comparing the proteins in the identified genomic islands with the proteins in the T4SS secretion system effector protein database SecReT4 using WU-BLAST on default settings 51 .

Statistical analysis.
Fisher's exact test (two tailed) was used to compare the prevalence of CON_PiiA and CON_PiiB islands in enteric and oral C. concisus strains. Statistical analysis was performed using GraphPad Prism 6 software (San Diego, CA).