Novel approach toward the understanding of genetic diversity based on the two types of amino acid repeats in Erwinia amylovora

Erwinia amylovora is a notorious plant pathogenic bacterium of global concern that has devastated the apple and pear production industry worldwide. Nevertheless, the approaches available currently to understand the genetic diversity of E. amylovora remain unsatisfactory because of the lack of a trustworthy index and data covering the globally occurring E. amylovora strains; thus, their origin and distribution pattern remains ambiguous. Therefore, there is a growing need for robust approaches for obtaining this information via the comparison of the genomic structure of Amygdaloideae-infecting strains to understand their genetic diversity and distribution. Here, the whole-genome sequences of 245 E. amylovora strains available from the NCBI database were compared to identify intraspecific genes for use as an improved index for the simple classification of E. amylovora strains regarding their distribution. Finally, we discovered two kinds of strain-typing protein-encoding genes, i.e., the SAM-dependent methyltransferase and electron transport complex subunit RsxC. Interestingly, both of these proteins carried an amino acid repeat in these strains: SAM-dependent methyltransferase comprised a single-amino-acid repeat (asparagine), whereas RsxC carried a 40-amino-acid repeat, which was differentially distributed among the strains. These noteworthy findings and approaches may enable the exploration of the genetic diversity of E. amylovora from a global perspective.


Amino acid repeat of the "Class I SAM-dependent methyltransferase" protein
Through a comparative genome analysis of the 245 E. amylovora strains downloaded from the NCBI database (Supplementary data-Genome), we found a distinct difference in the size of the gene encoding the "Class I SAMdependent methyltransferase" (WP_004166224.1 of strain ATCC49946).This gene exhibited sizes ranging from 1326 to 1389 bp (442-463 amino acids) across AI strains.This difference was fully attributable to a hexanucleotide tandem repeat (5′-AAC AAT -3′) that ranged from 3 to 15 repeated units (Fig. 1).This repeat encoded two asparagine residues (NN), giving rise to a single-asparagine repeat (SAR) of 6-30 units in the E. amylovora strains (Table 1).However, this SAM-dependent methyltransferase gene with a SAR was not detected in most of the RI strains.In addition, we designed PCR primers (metd_F/R) to access SAR from E. amylovora and obtained about 405 bp of amplicons from strains 21-18, 21-1, 20-10, and 21-42.After purifying and sequencing the amplicons, we determined their SARs as 18, 20, 22, and 30, respectively.
We grouped the E. amylovora strains according to the SARs number, and then each SAR group's origin and clade type were analyzed (Table 1).The strains belonging to the Widely-Prevalent clade appeared in various numbers of SAR from 6 to 24.E. amylovora strains from various countries, except for some isolates from USA and Canada, belonged to this clade.In Western N.A. clade, SAR 6, 12, and 14 isolated from USA and Canada were included.In Eastern N.A. clade, there were SAR 6, 10, and B-group, SAR 6, 8, and 12 were included.Interestingly, strains of SARs of more than 16 belonged to the Widely-Prevalent clade.
The results of typing E. amylovora for the SAR revealed unique patterns in some strains isolated from Korea (more than 24 SAR) but not enough to provide high resolution for typing when used alone.Nevertheless, SAR has only one repeat unit, indicating a comparatively high diversity among E. amylovora strains.Thus this repeat region should usually use in combination with other tandem repeat regions as VNTR analysis.Unfortunately, it was difficult to determine the relationship between the host, isolated region, and year according to the SAR length.In addition, strains isolated from Rubus spp.did not carry the SAM-dependent methyltransferase gene and SARs, with the exception of the ATCC BAA-2158 strain.This strain, which belongs to the B-group, carried 6 SARs that may be sorted in the AI group, similar to that reported by another study 13,18 .However, it should be noted that only draft genomes were available for the RI strains.
Generally, bacteria undergo extensive genetic variation in response to various environmental conditions, in part resulting in the expansion and contraction of tandem repeats 23,24 .In turn, tandem repeats have been reported to undergo insertion or deletion events through slipped-strand mispairing or via uneven cross-over during DNA replication.Therefore, many of the tandem repeat sequences in bacterial genomes have been identified and used as genotyping tools.In the case of E. amylovora, tandem repeats have been broadly used in VNTR analysis [9][10][11] .
In fact, the tandem repeat detected in the gene encoding SAM-dependent methyltransferase was used in a VNTR analysis in another study 9 .However, the repeat was reported as "TAA CAA " motif from the target region of the 'hypothetical protein (CFBP 1430, Eamy_0389)' .Currently, the gene annotation of Eamy_0389 has been changed to "Class I SAM-dependent methyltransferase", and we revised the repeat motif as "AAC AAT ", causing a SAR.
Tandem repeats consisting of multiples of three nucleotides in the coding region generate single-amino-acid repeats in the translated protein 20,25,26 .The most-frequently occurring single-amino-acid repeats are glutamine (Q), followed by asparagine (N) and serine (S) 24 .Single-amino-acid repeats have previously been shown to alter protein function or virulence potential 20,21,[25][26][27][28] .Such tandem repeats also happened to cause a SAR in the SAM-dependent methyltransferase gene from AI-type E. amylovora strains.However, the functional role of the tandem repeats and the consequences of their variation among strains remain unclear.

Comparison of the dnd and dpt operons from Erwinia amylovora and Escherichia coli
We compared the genes surrounding the gene encoding SAM-dependent methyltransferase of AI and RI strains of E. amylovora with that of Escherichia coli to identify the presence or absence of this gene between the strains (Supplementary data-GI gene components).We detected differences in the gene composition among AI, RI, and B-group strains.In the case of AI strains, a dpt gene cluster was observed with dptFGH located upstream of the SAM-dependent methyltransferase gene, and a dnd gene cluster was detected with dndEDCB situated in the downstream region (Fig. 2A).These dpt and dnd gene clusters were also discovered in the UMEA 3176-1 strain from E. coli (GCA_000460595.1),as a similar gene structure.However, genes encoding a hypothetical protein or ATPase instead of SAM-dependent methyltransferase were discovered in E. coli 22 .Furthermore, the AI and the E. coli strains commonly carried a tRNA and integrase/recombinase gene upstream of dptF, which is known as a mobile gene element 22,29 and was reported as a GI in the E. amylovora CFBP 1430 and ATCC BAA-2158 strains 18 , suggesting that this region was acquired by horizontal gene transfer (HGT).Interestingly, RI strains or some of the AI clades that did not possess the SAM-dependent methyltransferase gene also had both the tRNA and integrase/recombinase genes in this region.However, other genes were present instead of the dpt/SAM-dependent methyltransferase/dnd cluster.Therefore, some AI strains that did not possess SAM-dependent methyltransferase belonged to the B-group, which carried a specific gene composition after the tRNA and integrase/recombinase gene (Fig. 2B).In addition, RI strains were also clustered differentially according to the gene composition downstream of the tRNA and integrase/recombinase gene (Fig. 2C).Accordingly, we suggest grouping the types of gene structures representing AI, AI B-group, and RI strains in the region located downstream of the tRNA-Leu mobile element and recombinase/integrase gene.Unfortunately, the genomes of all strains presented in Fig. 2B and C were draft genomes, which hampered the full confirmation of the gene structure.
In E. coli, the dnd operon has been shown to be a GI, and three conserved genes, i.e., dptF, dptG, and dptH, are found near the dnd operon (Fig. 2D).Furthermore, E. coli strains encoding the dnd operon are frequently  www.nature.com/scientificreports/among the pathogenic E. coli 22 .In E. amylovora, RI strains, which are restricted to Rubus spp.regarding their host range 30 , did not possess dnd/SAM-dependent methyltransferase gene/dpt gene clusters in their genome.These observations led us to hypothesise that one of the key factors for determining the pathogenicity and host tropism of E. amylovora is the presence of the GI-possessing dnd operon.The causal agent of black shoot blight, E. pyrifoliae, which has a host range that is limited to specific cultivars of pears and apples and is less virulent than E. amylovora 31 , is genetically close to E. amylovora, but does not encode this GI.As an extension, studying the host range or pathogenicity of the strains of the B-group, which belongs to the AI strain group, would be valuable for understanding the relationship between the GI and the dnd cluster, pathogenicity, and host selectivity after horizontal acquisition.Since the genes from the EAMY0383-0403 locus of strain CFBP 1430 were determined as a GI 18 , we analyzed the sequence similarity of the gene components in the GI with those of other organisms to explore the origin of GI.As a result, these genes exhibited a high sequence identity with those of Serratia marcescens WVU-005, Klebsiella grimontii NCTC9146, Klebsiella pneumonia RGT40- GX1006, Pectobacterium odoriferum JK2.1, Yersinia intermedia FDDAARGOS_358, Y. pseudotuberculosis FDAA-GOS_580, and Pantoea dispersa Lsch, with a sequence identity of more than 76% and an E-value less than 0.05 (Table 2).Some species were plant pathogens, including D. dadantii (for SAM-dependent methyltransferase) and P. odoriferum (for dptG).However, most of the bacteria were pathogenic to humans and were distributed in soil, water, and the human gastrointestinal tract [32][33][34][35][36] .The taxonomic order of these bacteria was identical, i.e., Enterobacterales.

Forty-amino-acid repeat located within the "electron transport complex subunit RsxC" gene
We found another intraspecific gene, named "electron transport complex subunit RsxC", with a size that varied among the E. amylovora species.The rsxC gene was included in the rsx cluster in the order of rsxABCDGE in E. amylovora, and exhibited a similar gene composition to that of the E. coli rsx cluster 39 .Among the genes included in the rsx cluster, the gene size of rsxC alone was different among the E. amylovora strains.The size of the rsxC gene ranged from 1853 bp (strain EaG5) to 2493 bp (strain HKN06P1), and the main sequence variation among the different E. amylovora strains emerged at the position of 1679 bp toward the 3′ end.The translation of the nucleotide sequence of rsxC and the comparison of its amino acid sequence between the strains revealed that, starting at amino acid position 553, there were tandem repeats of 40-amino-acid units of the sequence "DPRKAAVEAAIARAKAKKAAQAAPAAADKAAPVQQPAAEQ" toward the C-terminus (Fig. 3).The number of amino acid repeats in rsxC varied from 0 to 5 among the E. amylovora strains (Table 1).Moreover, we detected this amino acid repeat pattern in both AI and RI strains.Nevertheless, we could not find every amino acid repeat pattern of RsxC in most of the RI strains, because their genome sequence was not complete.In addition, we designed PCR primers (EarsxC_885F/R) for amplifying and detecting amino acid repeats in E. amylovora.From the strains 21-18, 21-1, 20-10, and 21-42, 885 bp of amplicons were obtained by PCR and sequenced.Finally, three tandem repeats of 40-amino-acid units were found from each of the strains.
We clustered the E. amylovora strains according to the number of amino acid repeats, from rsx-0 to rsx-5, and compared the origin and clade type between the groups.In Widely-Prevalent clade, rsx-0, 1, and 3 which originated from various countries were included.In Western N.A. clade, there were rsx-1, 2, and 4, and in Eastern N.A. clade, rsx-0, 2, and 4 were included.Interestingly, all strains of rsx-3 group belonged to the Widely-Prevalent clade.Unfortunately, the chromosomes of many of the strains that have been deposited in GenBank were in the scaffold or contig form (Supplementary data-genome).From the 16 RI strains deposited in GenBank, we obtained only one rsxC sequence from strain Ea1-95, which belonged to the rsx-2 group.Likewise, from the strains of B-group, only strain CA3R had rsxC sequence which belonged to the rsx-4 group.
The resolution of this typing method was lower than that of SARs in SAM-dependent methyltransferase since SAR clusters vary from 6 to 30 units.This is because tandem repeats in rsxC are composed of 40-amino-acids, and seem to be very conserved and stable.Interestingly, E. amylovora strains isolated from North America were classified into each of the amino acid repeat groups.In contrast, the European strains were in the rsx-1 and rsx-3 groups, whereas the Korean strains were only in the rsx-3 group.The genetic diversity of the American strain was higher than that of the European and Korean strains, being proportional to the time of E. amylovora emergence.It was also difficult to determine the relationship between the host, isolation region, and year according to the number of amino acid repeats in RsxC.
Intraspecific gene, rsxC is also called rnfC in other bacteria, and the complex is well known to be related to electron transport using CO 2 as an electron acceptor in the anaerobic conditions of Acetobacterium woodii 39 .The cause of the rsxC size difference among the strains is not known; however, the differences in the rnfC size among various bacterial species are understood.It has been reported that the RnfC subunit has a FeS center and Flavin-and NADH-binding sites, and that some species have a longer C-terminus 39 .The amino acid repetition causing the size difference in rsxC among E. amylovora strains was discovered in this study.The exact threedimensional protein structure of rsxC in E. amylovora remains unknown.However, repeated units of 40-aminoacid residues may form solenoid or toroid repeats 40 .This sequence repetition trait detected in rsxC can be used as a new marker for VNTR analysis.

Combining and comparing the amino acid tandem repeats with CRISPR spacer patterns
Additionally, we compared amino acid repeat numbers in SAM-dependent methyltransferase and rsxC genes with concatenated CRISPR spacer patterns 6 (Fig. 4).We could not compare all the E. amylovora strains described in this study since a lot of sequences deposited in NCBI appeared as dozens of contigs or scaffolds.However, the clusters made by CRISPR arrays showed regular patterns with amino acid repeat numbers.E. amylovora strains were mainly divided into three groups by CRISPR patterns.The strains of CRISPR group I, which were usually belonged to Widely-Prevalent clade was matched with rsx-1, 3 group and 16 to 26 SAR.Whereas most strains of CRISPR group II were belonged to Western N.A. clade, and they were matched with rsx-4 and SAR 12 or 14 group.The strains belonging to CRISPR group III were from Eastern N.A. clade or B-group, and matched with rsx-4 and SAR 10 or 12 group.Suggesting that the resolution of tandem repeats in rsxC were more similar to the CRISPR patterns, and SARs would improve the resolution of strain typing by combining these patterns.As LCI types in E. amylovora were revealed to describe distribution recently 16 , future studies of combining LCIs with this study would broaden our knowledge about exploring genetic diversity and evolution of E. amylovora.
In conclusion, we identified two intraspecific genes, i.e., the "SAM-dependent methyltransferase" and "rsxC" genes, using a comparative genomic analysis, to explore the genetic diversity of E. amylovora.We found that the differences in the amino acid repeats present in each of these genes detected among the strains caused strainspecific traits and would increase the resolution of epidemiological studies when combined with other typing methods.Furthermore, the SAM-dependent methyltransferase gene, which was flanked by the dnd and dpt clusters, was only detected in AI strains, and may be acquired by HGT.These results may contribute fundamental information for the study of the genetic diversity and host specificity of E. amylovora.

Collection of apple and pear samples
The diseased plant materials collection and use were carried out in accordance with the fire blight surveillance and control guidelines of Rural Development Administration (RDA, Jeonju, South Korea) which is responsible for the management of fire blight diseased orchards.Samples were collected under RDA Phytosanitary Control Officers license (no.1767).The source of plant samples was listed in the supplementary data-Table S1.

Bacterial strains and DNA isolation
E. amylovora strains were isolated from apple or Asian pear trees with fire blight disease in South Korea.The leaves or branches showing symptoms were sterilised using 70% ethanol, and the margins between the necrotic and healthy tissues were cut into 5 × 5 mm pieces, which were then placed into 1.5-ml microtubes containing 500 µl of sterilized distilled water, followed by grinding and maceration for 30 min.Subsequently, 10 µl of the macerated samples were streaked on tryptic soy agar 41 and King's medium B agar 42 , respectively, then incubated

Whole-genome sequencing
Whole-genome sequencing (WGS) of the E. amylovora isolates was performed using the PacBio RSII (Pacific Bioscience, Menlo Park, CA, USA) and HiSeq™ 4000 (Illumina, San Diego, CA, USA) platform combination.Briefly, to construct the library, 8 µg of genomic DNA was sheared to a size of 20-40 kb using a g-TUBE (Covaris, Woburn, MA, USA).Then, using the PacBio DNA template Prep Kit v1.0 (Pacific Bioscience), 10 µL of library was prepared.SMRTbell templates were annealed and sequenced using the DNA/Polymerase Binding Kit P6 and the PacBio DNA Sequencing Kit 4.0 in 8-well SMRT cells, respectively.The subreads were assembled using the Hierarchical Genome Assembly Process v3 protocol and the SMRT Analysis Software v2.3, and the sequences were then corrected and fixed by Quiver v1 and SMRTpipe v2.3.0.139497, respectively.For the HiSeq sequencing, 1 µg of gDNA was randomly fragmented by Covaris, the adapters were ligated at the end of the fragment, and a size of 400-500 bp was selected for PCR amplification.Illumina reads were mapped against the assembled DNA using Pilon v1.21 for sequence compensation.

Comparative genome analysis
We downloaded the genomic FASTA files of the coding DNA sequences (CDSs) of the E. amylovora strains listed in supplementary data (Genome) from the NCBI bacterial genome database (https:// www.ncbi.nlm.nih.gov/ genome/).We checked the taxonomy and Average Nucleotide Identity results of the deposited sequences in the NCBI Genome Assembly to ensure that the expected sequences were obtained.All collected sequences were compared to mine species-specific genes with more than five differences in amino-acid number in a gene.The nucleotide and amino acid sequences of the mined genes were compared among E. amylovora stains using ClustalV of the Lasergene MegAlign software (Version 7.2.1;DNASTAR Inc., Madison, WI, USA).As a result, we discovered amino acid repeats in these genes that varied among the E. amylovora strains.

Primers for analysing amino acid repeats
Two primer sets were designed to directly analyse amino acid tandem repeats from the E. amylovora isolates.From both nucleotide sequences of Class I SAM-dependent methyltransferase and rsxC genes, forward and reverse primers were designed more than 50 bp outside of each target region.Finally, the metd_F (5′-ATT TAT TAC GGC TTT GGT TTCTT-3′) and metd_R (5′-CTT TCG ATC AGT AGT GTT ATTT) primers for detecting SARs in Class I SAM-dependent methyltransferase and EarsxC_885F (5′-GCG GAG TGC GAA ACA TCA -3′) and EarsxC_885R (5′-GCC TGG CGT GCA TCA TCT G-3′) for detecting amino acid repeats in rsxC were constructed and selected by PrimerSelect software (Version 7.2.1;DNASTAR Inc., Madison, WI, USA).We amplified Korean E. amylovora strains 21-18, 21-1, 20-10, and 21-42 listed in Table 1 by metd and EarsxC_885 primers, respectively.The volume of 25 µl reaction mixture was produced by 25 ng gDNA template, 10 mM of each forward and reverse primer, 1 × reaction buffer, 1.25 unit of Taq polymerase (Promega, Madison, WI, USA), and 0.2 mM of dNTPs.The PCR conditions were as follows: pre-denaturation at 95 °C for 5 min, 35 cycles of denaturation at 95 °C for 30 s, annealing at 60 °C (metd) or 69 °C (EarsxC_885) for 30 s, and extension at 72 °C for 40 s, and a final extension at 72 °C for 10 min.The final products were 405 bp (metd) and 885 bp (EarsxC_885) for each primer.The amplicons were purified and sequenced (Bionics™, Daejeon, South Korea) to determine amino acid repeats.

Structural analysis of the Genomic Island
We compared and analysed the CDS regions located near Class I SAM-dependent methyltransferase in E. amylovora strains using BLASTn against standard databases that are publicly available in NCBI genomes (https:// blast.ncbi.nlm.nih.gov/).For the BLAST search, we selected "Nucleotide collection (nr/nt) of Standard databases, " excluding organism "E.amylovora, " "E.pyrifoliae, " and "uncultured/environmental sample sequences, " and program selection optimised for "somewhat similar sequences (blastn)."

Analysis of CRISPR spacer patterns
To compare amino acid repeat numbers with CRISPR spacer patterns, we collected CRISPR 1, 2, and 3 sequences of E. amylovora strains described by McGhee et al. (2012) from NCBI databases.CRISPR sequences were concatenated, aligned, and then clustered by unweighted-pair group method (UPGMA) tree with 1000 bootstrap replications using Mega-X (v 10.0.5).

Figure 1 .
Figure 1.Structure of the hexanucleotide tandem repeats in the gene encoding the 'Class I SAM-dependent methylfransferase' and their corresponding single-asparagine repeats.Erwinia amylovora ATCC49946 (a) and UT5P4 (b) strains.

Figure 2 .
Figure 2. Genetic map of the genomic islands encoding the dnd and dpt clusters among the different Erwinia amylovora strains.Mobile elements (recombinase, integrase, and transposase) are colored in yellow; restriction endonuclease, blue; and DUF domain-containing protein, grey.

Figure 3 .
Figure 3. Structure of the 40-amino-acid repeats in the gene encoding the 'electron transport complex subunit RsxC' in Erwinia amylovora.HKN06P1 (a) and EaG5 (b) strains.

Table 1 .
Amino acid repeats and basic information of the Erwinia amylovora strains.

Table 2 .
Second-order match homology analysis of query genes in the genomic island using the BLASTn module for Erwinia amylovora ATCC49946.