Whole genome sequencing and protein structure analyses of target genes for the detection of Salmonella

Rapid and sensitive detection of Salmonella is a critical step in routine food quality control, outbreak investigation, and food recalls. Although various genes have been the targets in the design of rapid molecular detection methods for Salmonella, there is limited information on the diversity of these target genes at the level of DNA sequence and the encoded protein structures. In this study, we investigated the diversity of ten target genes (invA, fimA, phoP, spvC, and agfA; ttrRSBCA operon including 5 genes) commonly used in the detection and identification of Salmonella. To this end, we performed whole genome sequencing of 143 isolates of Salmonella serotypes (Enteritidis, Typhimurium, and Heidelberg) obtained from poultry (eggs and chicken). Phylogenetic analysis showed that Salmonella ser. Typhimurium was more diverse than either Enteritidis or Heidelberg. Forty-five non-synonymous mutations were identified in the target genes from the 143 isolates, with the two most common mutations as T ↔ C (15 times) and A ↔ G (13 times). The gene spvC was primarily present in Salmonella ser. Enteritidis isolates and absent from Heidelberg isolates, whereas ttrR was more conserved (0 non-synonymous mutations) than ttrS, ttrB, ttrC, and ttrA (7, 2, 2, and 7 non-synonymous mutations, respectively). Notably, we found one non-synonymous mutation (fimA-Mut.6) across all Salmonella ser. Enteritidis and Salmonella ser. Heidelberg, C → T (496 nt postion), resulting in the change at AA 166 position, Glutamine (Q) → Stop condon (TAG), suggesting that the fimA gene has questionable sites as a target for detection. Using Phyre2 and SWISS-MODEL software, we predicted the structures of the proteins encoded by some of the target genes, illustrating the positions of these non-synonymous mutations that mainly located on the α-helix and β-sheet which are key elements for maintaining the conformation of proteins. These results will facilitate the development of sensitive molecular detection methods for Salmonella.


Results
WGS and Phylogenetic analyses of the three Salmonella serotypes. The overall results of WGS, such as the number of assembled bases and N50 contig sizes, are summarized in Supplementary Tables 1-3 and Supplementary Fig. 1. In general, Salmonella genomes averaged at the size of approximately 5 Mb, the number of contigs ranged from 28 to 473, and the average depth of coverage ranged from 25 × to 433x (Supplementary  Tables 1-3). The WGS data of all the isolates studied here can be accessed from the NCBI SRA (https:// www. ncbi. nlm. nih. gov/ sra) with their accession numbers listed in Supplementary Tables 1-3. Next, we constructed phylogenetic trees grouped by serotype to investigate the isolates' genetic diversity. The phylogenetic tree constructed for the 64 Salmonella ser. Enteritidis isolates was separated into two clades (clade A and clade B) with 552-565 SNPs (Fig. 1). Forty-one of the isolates were placed into clade B with 3 subclades, namely, clade B1 (9 isolates), clade B2 (14 isolates), and clade B3 (18 isolates), with 160-219 SNPs. The 40 Salmonella ser. Typhimurium isolates were grouped into two clades (clade A and clade B); each clade had three subclades (clade A1, A2, A3, and clade B1, B2, B3) (Fig. 2). Within the three subclades of clade A, we identified 48-81 SNPs; within the three subclades of clade B, we identified 673-1141 SNPs. All eight Salmonella ser. Typhimurium isolates from egg sources were placed into clade B2. The 39 Salmonella ser. Heidelberg isolates were grouped in three different clades, namely, clade A, clade B, and clade C (subclade A, B1, B2, C1, C2, and C3), with less than 100 SNPs (19-95 bp) among them (Fig. 3). Five egg-sourced isolates formed clade A. Eighteen isolates belonged to clade B, and 16 belonged to clade C. Clade B2 encompassed all nine chicken-sourced and two egg-sourced (CFSAN015479 and CFSAN033547) Salmonella ser. Heidelberg isolates; compared to other clades, this clade also had relatively high pairwise SNP distances from other clades (85-95 bp).

Distribution of selected target genes among three Salmonella serotypes. After sequencing
and assembling the genomes, we performed BLAST analyses to investigate the existence of the selected target genes among Salmonella ser. Enteritidis, Salmonella ser. Typhimurium and Salmonella ser. Heidelberg isolates. The BLAST analyses showed that all 143 isolates contained the genes invA, ttrRSBCA, phoP, fimA, and agfA (data not shown). By contrast, not all the isolates carried the spvC gene. Specifically, this gene was present in 59/64 Salmonella ser. Enteritidis isolates and in 14/40 Salmonella ser. Typhimurium isolates whereas none of the 39 Salmonella ser. Heidelberg isolates carried spvC. We also identified isolates that carried only partial target gene sequences. For example, two Salmonella ser. Typhimurium isolates (CFSAN034209, CFSAN036243) carried a partial sequence of fimA, and six Salmonella ser. Heidelberg isolates carried a partial sequence of ttrA (CFSAN015377, CFSAN015378, CFSAN015380, CFSAN017093, CFSAN017094, CFSAN017095).

Mutations of the target genes.
We identified numerous non-synonymous mutations and synonymous mutations among the selected target genes ( Table 1). The most commonly detected non-synonymous mutations were changes between T and C (15 times), A and G (13 times), and T and G (8 times); the less frequent changes were those between G and C (4 times), A and C (4 times), and A and T (once). The mutation rates for A/T and G/C were 53.33% and 46.67%, respectively.
The fimA gene was found to have 13 non-synonymous mutations although most of them were identified in the Salmonella ser. Enteritidis and Salmonella ser. Heidelberg isolates; only two synonymous mutations separately found in one Salmonella ser. Enteritidis isolate (CFSAN030097, raw egg yolks) and in all Salmonella ser. Heidelberg isolates (data not shown). Notably, all Salmonella ser. Enteritidis and Salmonella ser. Heidelberg isolates carried one common nonsense mutation (fimA-Mut.6), C → T (nt position 496), resulting in a change at AA 166, namely, glutamine (Q) → stop codon (TAG).
Finally, the phoP gene exhibited one non-synonymous mutation across 25 Salmonella ser. Enteritidis isolates and four synonymous mutations among all Salmonella ser. Enteritidis isolates; no phoP mutation was found among the isolates in the other two serotypes. For the agfA and spvC genes, only six synonymous mutations occurred in the Salmonella ser. Heidelberg and Salmonella ser. Typhimurium isolates, and one synonymous mutation of the spvC gene occurred in Salmonella ser. Enteritidis isolates (data not shown). No non-synonymous mutations were observed for the agfA and spvC genes among the isolates studied.
Phylogenetic analyses for each target gene. Phylogenetic trees were then constructed for each of the selected target genes based on their nucleotide sequences (including synonymous/non-synonymous mutations) in the Salmonella isolates studied to investigate their genetic differences. We found disparities among all selected target genes (Figs. 4, 5, 6) across all three serotypes except ttrR gene. Four Salmonella ser. Enteritidis isolates (CFSAN030097, CFSAN025700, CFSAN032958, CFSAN032959) were different from the others based on the phylogenetic tree of the fimA gene, and the Salmonella ser. Typhimurium isolates were separated into three clusters (Fig. 4C). In the phoP-based phylogenetic tree, the isolates of Salmonella ser. Typhimurium and Salmonella ser. Heidelberg were in the same clade, and Salmonella ser. Enteritidis isolates were divided into two different clades (Fig. 5A). Variations were observed for the phylogenetic trees constructed for ttrS (Fig. 6A), ttrB  (Fig. 7A). Although, we tried to predict the protein structure resulting from the fimA gene using Phyre 2 and SWISS-MODEL, both software programs failed to generate an acceptable result. The predicted protein structure for the fimA gene, as generated by SWISS-MODEL, had a GMQE score of 0.99, but the low QMEAN score (-5.44) suggested that it may not be the best representation possible (Fig. 7B). However, the structure generated by Phyre 2 was worse and exhibited only less than 21% confidence (model not shown). Similarly, we did not finalize models for the protein structures of the agfA and ttrR genes, as the confidence and quality scores from both Phyre 2 and SWISS-MODEL were too low to proceed. Additionally, from the predicted structure (Figs. 7 and 8), evidently the two common secondary structures α-helix and β-sheet existed in each gene: invA (8 α-helixes and 5 β-sheets), fimA (8 α-helixes and 2 β-sheets), phoP (8 α-helixes and 4 β-sheets), spvC (6 α-helixes and 3 β-sheets), ttrS (14 α-helixes and 6 β-sheets), ttrB (4 α-helixes and 3 β-sheets), ttrC (12 α-helixes and 0 β-sheets), and ttrA (15 α-helixes and 12 β-sheets).

Discussion
Phylogenetic analyses of three Salmonella serotypes. WGS is the most powerful tool for bacterial genomic variation analyses because it is based on the complete bacterial genome at DNA level. WGS SNP analysis was used to ascertain that a new lineage of Salmonella ser. Enteritidis occurred and spread in Brazil after 1994 by investigating 256 Salmonella ser. Enteritidis isolates obtained over 48 years 23 . In Denmark, reanalysis of isolates in eight previously reported outbreaks by WGS successfully discriminated 372 isolates of Salmonella ser. Typhimurium and its monophasic variants 24 27 . Interestingly, in this study, all eight Salmonella ser. Typhimurium isolates from egg sources were placed into the same subclade (clade B2), whereas five egg sourced Salmonella ser. Heidelberg grouped into one clade (clade A) and nine chicken sourced isolates and two egg sourced isolates placed into the same subclade (clade B2). Unfortunately, there was limited epidemiological information available for these isolates to explore the relationships among these isolates.
DNA sequence variation of target genes among Salmonella isolates and their phylogenetic relationships. The selected target genes/gene clusters invA, ttrRSBCA, phoP, fimA, agfA, and spvC are all virulence factors of Salmonella. The spvC gene is on the Salmonella virulence plasmid, while the others are located on chromosomes 28 . Although the spvABCD gene cluster was reported to be highly conserved in Salmonella ser. Typhimurium, Salmonella ser. Dublin and Salmonella ser. Choleraesuis 29 , our analysis revealed that the spvC gene was primarily found among Salmonella ser. Enteritidis isolates (92.19%), followed by Salmonella ser. Typhimurium isolates (35%). None of our Salmonella ser. Heidelberg isolates carried spvC, which is consistent with previous reports [30][31][32][33] . Study had shown that spvC is essential for the virulence of Salmonella ser. Typhimurium in mice; the potential impact of the presence and absence of the spvC gene on the isolates' ability of causing infection warrants further investigation 55 . The ttrR gene was the most highly conserved compared to the other ttr genes, as no mutations were found in ttrR among all the isolates studied. In addition, only selected alleles which were more frequently used in detection technology design of the selected target genes/gene clusters were investigated in the current project. Analysis of the phylogenetic diversity of the serotypes based on the phylogenetic trees constructed using individual genes, showed that the serotypes differ in the genes ttrS, ttrB, ttrC, and ttrA (Fig. 6). Variations were observed in Salmonella ser. Enteritidis and Salmonella ser. Typhimurium, but the genes in Salmonella ser. Heidelberg were more stable. The ttrR gene was the most highly conserved compared to the other ttr genes, because no mutations were found in ttrR among all the isolates studied. As a gene cluster within the Salmonella pathogenicity  www.nature.com/scientificreports/ island (SPI, part of the flexible gene pool) on the chromosome, the ttrRSBCA locus has been shown to be transferable through horizontal gene transfer (HGT) events such as transfer by phages or conjugative transposons. This might be one of the reasons that many PCR protocols have an extra target (such as invA) in combination with the ttrRSBCA gene to reinforce the specificity of PCR detection 9,34 . In addition, the presence of mutations in these genes could also be an underlying reason.  www.nature.com/scientificreports/ Finally, our study also demonstrated the presence of partial fimA (two isolates) and ttrA (six isolates) gene sequences among some Salmonella isolates and all studied Salmonella isolates contained the genes invA, phoP, and agfA. Phylogenetic trees generated based on both the invA and agfA genes clearly comprised three lineages of three serotypes ( Fig. 4A and B). Interestingly, the phoP-based phylogenetic tree divided the Salmonella ser. Enteritidis isolates into two different clades (Fig. 5A). In this case, along with the analysis of the fimA, spvC, and ttrRSBCA genes (Figs. 4C, 5B, 6), we revealed that the invA, phoP, and agfA genes have higher discriminatory power to diagnose Salmonella at the genus level than other selected genes.

Mutations of target genes and prediction of protein structures.
Among the 143 Salmonella isolates investigated in this study, both synonymous and non-synonymous mutations in the selected target genes were discovered. Among the total 45 non-synonymous mutations observed (Table 1), the top two mutations occurred with an AA change between the T and C alleles (15 times) and the A and G alleles (13 times). The substitution rates of A/T and G/C were approximately equal. Similarly, in a study involving 106 Salmonella ser. Enteritidis isolates, 55 non-synonymous mutations were discovered, and the top two mutations also occurred between the alleles T ↔ C (27 times) and A ↔ G (15 times) 25 . These results are in accordance with the biased gene conversion model, in which AT → GC mutations have a higher probability of being transmitted to the next generation, as an AT/GC heterozygote produces more gametes carrying G or C than those carrying A or T, presumably through the GC-biased repair of A:C and G:T mismatches in heteroduplexed recombination intermediates 35,36 . In addition, polymorphisms in an organism result from mutation, selection, and other processes such as biased gene conversion, which favors the transmission of G/C over A/T alleles 35,36 .
In the successfully predicted protein structures of the target genes ( Figs. 7 and 8), the overwhelming majority of the mutations were located in the main domain of the structure, except for fimA-Mut.1, which occurred close to the C-terminal, and fimA-Mut. 13, which occurred close to the N-terminal. The mutations mainly occurred in the α-helix and β-sheet which are key elements for maintaining the conformation of proteins, therefore, such mutations could interfere with the hydrogen bonding between main-chain amide and carbonyl groups and their www.nature.com/scientificreports/ corresponding representations. It is well known that C-terminal sequence is an important structural and functional site of proteins and peptides whereas N-terminal influences the overall biological function of the protein.
The invA gene encodes an N-terminal integral membrane domain and a C-terminal cytoplasmic domain that is proposed to form part of a docking platform. As invA is essential for Salmonella to gain access to epithelial cells, isolates with non-synonymous invA mutant may have reduced virulence 37,38 . Since invA gene has been well acknowledged as an effective target to detect Salmonella, it seems that the gene mutations happened to invA have less impact on the protein functions. It will take further experiments and data to prove this theory and reveal the mechanisms.
In terms of the number of mutations present in ttrRSBCA, we speculated that the ttrR gene was the most highly conserved, followed by the ttrB and ttrC genes, which have been successfully used in real-time PCR detection of Salmonella in food 10,34 . Differences in the prevalence of mutations in the ttrRSBCA cluster may be related to the distinct functions and loci of these genes. The ttrR and ttrS genes are components of the ttrSR two-component regulatory system for functional tetrathionate reductase expression, while ttrA, ttrB and ttrC are tetrathionate reductase structural genes. The ttrB iron-sulphur clusters probably function in the transfer of electrons from ttrC to ttrA 39 .
Among the 64 Salmonella ser. Enteritidis isolates studied, 25 had one mutation each in the phoP gene (G → A, nt 268). Allard et al. (2013) also observed the same non-synonymous mutation in the phoP gene in Salmonella ser. Enteritidis isolates from egg-associated samples 25 . The phoP protein has a conserved N-terminal domain with an essential aspartate residue and a C-terminal domain that binds DNA. The phoP/Q two-component system, encoded by phoP and phoQ, controls more than 40 genes, such as prgs, pagO, pagC, and pagD, which regulate the host inflammatory response, lipopolysaccharide (LPS) formation, and extracellular protein transport and promote virulence and intracellular survival [40][41][42][43] . This system may also play a specific role in Salmonella ser. Enteritidis pathogenicity in mice 44 . A phoP-based LAMP assay has been developed to effectively detect Salmonella in food samples 13 . The stable peculiarity (one mutation has been found) of phoP gene also demonstrated the potential of phoP gene as target gene for detecting Salmonella. www.nature.com/scientificreports/ It was reported that the fimA gene contains sequences unique to Salmonella strains and is an effective target for detecting Salmonella in feed and food samples 16,45 . However, in this study, we not only found the fimA mutations happened close to N/C-terminal, but all Salmonella ser. Enteritidis and Salmonella ser. Heidelberg isolates carried one common nonsense mutation (fimA-Mut.6, Q → stop codon), which indicated that the fimA gene has questionable sites for being used as a target of method design to detect Salmonella. The fimA gene, encoding a major fimbria unit, was mapped within the fim gene cluster for the chaperone-usher pathway for the assembly and secretion of multi-subunit appendages (type I pili/fimbriae). The type I pili consists of a helical rod-like structure (fimA and papA) and a flexible tip that contains the minor pilus subunits (fimF, fimG and fimH, papE, papF, papG, papK). Solved crystal structures have shown the elongation complex fimD-fimH-fimG-fimF-fimC and the next subunit-chaperone complex of fimA-fimC in the chaperone-usher pathway. The 3D structure of fimA modelled on a fimH-G1 template indicates that the interface between the subunits contains small hydrophobic or polar residues such as alanine (A), serine (S) and threonine (T) [46][47][48] . This may provide an explanation for the revealed non-synonymous mutation at the N-terminus (S → P) of the predicted fimA structure.
No non-synonymous mutations were detected in the agfA and spvC genes. Our attempt to predict protein structure of agfA was unsuccessful. Limited reports on agfA gene call for further research on this gene. The agfBCA operon encodes thin aggregative fimbriae/curli (formerly SEF17), and the thin aggregative fimbriae are primarily comprised of agfA subunits 49,50 . Although thin aggregative fimbriae are produced by most Salmonella and Escherichia coli isolates 51 and a high thin aggregative fimbriae sequence similarity was found between Salmonella ser. Enteritidis SEF17 fimbriae and E. coli curli 49 . Doran et al. (1993) reported that agfA-based nucleotide probes hybridized only to Salmonella DNA 52 . The agfA gene has been successfully used as a target for Salmonella detection 9 . The spv genes, including spvA, spvB, spvC, spvD, and spvR, are often carried on a large Salmonella virulence plasmid. However, in some serotypes, they are integrated into the chromosome 53 . The spvC gene, as a Salmonella effector with phosphothreonine lyase activity towards host mitogen-activated protein kinases, can be secreted in vitro by the SPI-1 and SPI-2 type III secretion systems 54 . This gene is essential for the full virulence of Salmonella ser. Typhimurium in mice 55 . Oligonucleotide insertions in spvC were shown to be nonpolar 56 .
Currently, limited information is known about the complex functions of the target genes frequently used for the detection of the above mentioned Salmonella isolates. Our study provided comparison of 10 target genes from the perspective of DNA sequence and protein structure. More efforts should be made to determine whether these mutations can further affect the protein functions. And further biological, molecular, and functionality research would help fill the knowledge gaps in this area. We found both non-synonymous and synonymous mutation rates vary among the target genes frequently used based on the three serotypes studied. Large scale investigation of www.nature.com/scientificreports/ more serotypes regarding mutation rates are needed to determine which genes are more suitable for use as detection targets. With the use and sharing of this new information about these target genes in the future, the ability to identify and investigate Salmonella infections by comparing gene sequence data will be greatly enhanced.

Materials and methods
Salmonella isolates. We selected 143 Salmonella isolates from chicken/duck eggs (including raw egg whites, raw egg yolks, raw whole eggs, pecked eggs, egg slurry, egg salad, frozen liquid egg, cooked quail eggs, salted duck eggs, duck egg yolks, frozen salted duck yolks) and chicken (chicken, chicken jerky, chicken breast): 64 Salmonella ser. Enteritidis isolates were collected from 1995 to 2016 (Supplementary Table 1