## Introduction

Bacillus cereus group comprises 11 closely related species, including the first described species B. anthracis, along with B. cereus, B. thuringiensis, B. mycoides, B. weihenstephanensis, B. pseudomycoides and the recently identified “B. gaemokensis”, “B. manliponensis”, B. cytotoxicus, B. toyonensis, “B. bingmayongensis” (Three species “B. gaemokensis”, “B. manliponensis” and “B. bingmayongensis” are effectively but not yet validly published and thus are in quotation marks throughout in this study)1. In comparison with their tremendous contributions for production of numerous enzymes2 and metabolites3, removal of various heavy metals4 and persistent organic pollutants5 and growth promotion of animals and plants as probiotics6, much attention has been focused on the epidemiology and pathogenesis of the B. cereus group bacteria7. More specifically, B. anthracis, the etiologic agent of anthrax, can result in high mortality for human and ungulate herbivore and was even used as a biological weapon in a terrorist attack8. B. cereus as an opportunistic pathogen is responsible for food poisoning caused by a range of protein toxins, characterized by either nausea and vomiting or abdominal pain and diarrhea as well as a wide range of serious invasive infections9. Occasionally, the presence of toxin genes is determined using a variety of methods in B. mycoides, B. weihenstephanensis, B. cytotoxicus and other10. B. thuringiensis is an entomopathogenic bacterium largely attributed to insecticidal crystal proteins encoded by the cry genes, which generally locate on plasmids11.

The taxonomy of the B. cereus group strains is the cornerstone for better understanding of phylogenetic relationships and population diversity. In the past several decades, bacterial identification and taxonomic classification of this group have been extensively investigated by using traditional phenotypes (e.g., morphology, physiology, biochemical characteristics, etc.) and genotypes (e.g., ribotyping, random amplified polymorphim DNA, multilocus sequence typing (MLST), etc.)12,13,14,15,16. Given the simplicity, reproducibility and discrimination among these approaches, MLST is an outstanding method for elucidating clonal relationships of bacteria within this group. Meanwhile, with the advent of the genomic era, a large number of publicly available genomic sequences provide a promising avenue for bacterial species delineation using digital DNA-DNA hybridization (dDDH) and average nucleotide identity (ANI). Recently, a number of studies combining with dDDH and ANI have been successfully applied to some genera and species1, 17, 18. However, this approach has not yet been applied to the B. cereus group.

The bacteria of the B. cereus group occupy a wide range of habitats, ranging from soil to water and from plants and animals to food, probably owing to the metabolic diversity endowing them with ecological adaptation of different environments and strong survivability of spores allowing them to better withstand hostile conditions and to better disperse19, 20. Their population structure and phylogenetic diversity in freshwater and terrestrial environments has been extensively investigated in recent years using several MLST schemes14,15,16. However, to the best of our knowledge, they are less well understood in diverse marine environments, particularly in the deep sea. In last decades, a large number of bacteria within this group were recovered from various marine samples after enrichment with heavy metals, crude oil and the polycyclic aromatic hydrocarbons under various environmental conditions like low temperature and high salinity. These samples used were collected globally, mainly seawater of different layers and sediments from both coastal and pelagic areas, from the Pacific Ocean21, 22, the Indian Ocean23, the South China Sea24, etc., awaiting further analyses on their phylogenetic and ecological diversity.

The aim of this study is to investigate the diversity and niche specificity of bacteria affiliated to the B. cereus group in marine environments by means of an MLST scheme, and to further elucidate their taxonomic position by measurements of dDDH and ANI, using 71 isolated strains together with 11 previously published type strains (Table 1).

## Results

### Sequence diversity

The MLST sequences for the P scheme16 were from 348 bp (purH) to 504 bp (gmk) in length, and seven concatenated sequences produced a 2,829-bp fragment (Table 2). Among the 82 strains, the number of alleles for per gene varied from 28 to 34, and the number of polymorphic sites (S) ranged from 82 to 191 (Table 2). The gene ilvD possessed the highest number of alleles (34), and correspondingly the highest number of polymorphic sites (191), whereas gmk and tpi possessed the lowest number of alleles (28) and polymorphic sites (82). No insertions or deletions were observed in any sequences except for type strain B. manliponensis BL4–6T with a six-base insertion in ilvD gene. The nucleotide diversity (π), which is defined as the average number of nucleotide differences per site between two randomly selected sequences, varied from 0.0199 (tpi) to 0.0856 (ilvD). Seven genes exhibited mean G + C content from 38.0 to 44.8 mol%, which were higher than those of the genomic sequences (ca. 35 mol%) of the B. cereus group bacteria from the GenBank database. A considerable variation in the Ka/Ks ratio was observed in Table 2, suggesting that tpi and glpF were under higher selective pressure than gmk, ilvD, pta, purH and pycA. However, all values were far below one, indicating that these genes were most likely selectively neutral and thereby suitable for MLST analysis. Consistently, the Tajima’s D values, which measures deviation from the standard neutral model of evolution, ranged from −1.8967 to −0.7817 (Table 2). The evolution of these genes was likely driven by neutral selection, which is typical for housekeeping genes.

The Index of Association (I A) was calculated to estimate the degree of association and recombination between alleles at different loci based on the allelic profile data25. Values of both the classical and standardized Index of Association (I A = 5.5542 and $${I}_{{\rm{A}}}^{{\rm{S}}}$$ = 0.9256, respectively) were significantly different from zero, and the pairwise variance (VD = 4.3022) was greater than the 95% critical value (L = 0.6901) when all isolates were included in the analyses (P < 0.001), indicating that there was a high level of linkage disequilibrium. Therefore, the result demonstrated that there were limited recombinational events and a clonal structure within the B. cereus group.

### STs, CCs, distribution and population of STs

The seven loci of the 71 marine strains (27 STs) and 11 type strains (11 STs) yielded 38 unique sequence types (STs), of which 23 STs were first discovered and deposited in the MLST database (Supplementary Table S1). The most frequent type, ST 761, was shared by 18 strains, followed by the other two types, ST 32 and ST 117, which were shared by 14 and 11 isolates, respectively. The type ST 756 was found in five strains. The other 34 types were represented by only one strain. Thirty-eight STs contained one clonal complexes (CC) and 36 singletons at the single locus variant level by the Global Optimal eBURST (goeBURST)26. The sole CC was formed by ST 177 and ST 759, which contained 12 strains (Fig. 1). However, the evolutionary relationship among the 36 singletons could not be inferred based on STs, neither between these singletons and the defined CC doublet.

Further, we analyzed the distribution of STs for 71 strains and two type strains from marine environments. In spite of exhibition of STs for nine type strains in Fig. 1, they were not included in the distribution analyses according to their terrestrial origin. As illustrated in Fig. 1a, ST 761 (18 strains) and ST 765 (5 strains) distributed only in sediment samples. In contrast, ST 32 (14 strains) and ST 177 (11 strains) distributed in both sediment and seawater samples. In sediment samples, the occurrence frequency of STs was 50%, significantly higher than that in seawater (23.5%). Similarly, we also analyzed the distribution of STs in different oceans or marine areas (Fig. 1b). ST 32 representing 14 strains was only found in the Pacific Ocean. On the contrary, ST 177 distributed more widely and was found in the Pacific Ocean, the Indian Ocean, and the Arctic Ocean. Occurrence frequency of STs in the South China Sea (each of the 14 strains represents a unique ST) was considerably higher than those of STs in other marine areas; for example, in 45 strains from the Pacific Ocean, 12 STs were retrieved.

All STs for 82 bacteria of the B. cereus group were analyzed using Structure software27. Multiple runs with K values from 2 to 20 showed maximal posterior probability at K = 12, suggesting that the collected the B. cereus group strains were descendants of 12 distinct ancestor populations. The rebuilding of population structure showed all STs were attributed to 20 distinct subpopulations, which corresponded to 20 clusters/species described below. There is little admixture of ancestral sources among these subpopulations, such as clusters 1, 4, 10, 11 and 15 (Fig. 2a); thus the STs within each subpopulation were highly homogenous. On the contrary, an obvious admixture of ancestral sources occurs within the subpopulations, such as cluster 5–8 and 16–20 (Fig. 2a).

### Phylogenetic analysis

Generally, the phylogenetic tree of a single housekeeping gene is consistent with that of MLST, but the latter provided a better resolution and presented a robust topology with high bootstrap values. Therefore, a neighbor-joining (NJ)28 tree based on seven concatenated housekeeping genes was generated using the software MEGA version 5.0529. As shown in Fig. 2b, the MLST tree revealed a high genetic diversity; the 82 strains were divided into 20 clusters, labelled Cluster 01 to Cluster 20. Each cluster represents a unique species based on the following dDDH/ANI analyses. The assignment of 82 strains to each cluster thoroughly described in Table 1.

In more details, Cluster 01 was the largest group, including 19 strains, and had two branches represented by ST 761 and ST 770. Cluster 02 represented B. anthracis, and accommodated type strain ATCC 14578T and two strains from the sediment of the South China Sea. Cluster 03 included three strains with unique STs, corresponding to ST 771, ST 773 and ST 774. The third largest group, Cluster 04, contained 14 strains, all from the seawater of the Pacific Ocean, and displayed the same ST (ST 32).

Cluster 09 attributed to the species B. toyonensis, including type strain BCT-7112T (ST 111) and strain BC41 (ST 762). Cluster 10 included five marine bacteria and type strain B. thuringiensis ATCC 10792T, each owns a unique ST, while these types were closely related to each other. Cluster 11, the second largest group, contained 15 marine strains and type strain B. cereus ATCC 14579T and was represented by 6 STs in total. Interestingly, in this cluster, 11 isolates of ST 177 were distributed widely, with nine strains from the Pacific Ocean and two from the Arctic Ocean. Cluster 12 harbored two previously established species, represented by type strains B. mycoides DSM 2048T and B. weihenstephanensis DSM 11821T. However, our results indicate that they should be conspecific. Cluster 15 was split into three branches corresponding to 3 STs, represented by 6 strains. Other 11 clusters were each represented by a single strain and corresponded to a unique ST. Thus, in addition to 10 previously established species, 10 new taxa were demonstrated in marine environments.

### The dDDH and ANI analyses

For further validation of the phylogeny based on MLST, the dDDH and ANI analyses were carried out in this study. To ensure at least one strain representative for each cluster from Cluster 01 to 20, 21 genomic sequences of our isolates were determined. The 11 of reference bacteria were obtained from the GenBank database. Therefore, a total of 32 bacterial genomes were analyzed (Supplementary Table S2). The dDDH and ANI values between pairwise strains were shown in Supplementary Table S3. Considering 70% DDH values as the gold standard for the species boundary in bacterial taxonomy30, 32 strains were classified into 20 species, perfectly corresponding to the 20 clusters of the MLST analysis. However, the dDDH values between B. cereus bacteria and B. thuringiensis bacteria were slightly but not significantly above 70%, indicating once again that the two species were of the closest relationship. Furthermore, ANI analysis presented the same phylogeny as obtained by dDDH analysis. According to 95–96% ANI criteria for species definition31, 32 strains were also divided into 20 species, corresponding to the same 20 clusters. Therefore, among 20 clusters, 10 belonged to the well-defined species, each of which was represented by a type strain; while the remaining 10 clusters each represented a putative novel species.

The correlation between the dDDH values and MLST similarities, dDDH and ANI values for 32 strains were determined by a nonlinear simulate analysis method, respectively. The dDDH values were highly correlated with the MLST similarities (R2 = 0.9754). Based on the simulative logarithmic equation of y = 91.11*exp(0.001079*x) − 700.6*exp(−0.19*x), 70% DDH was equivalent to a 98.3% MLST similarity (Fig. 3a). The dDDH values were also strongly correlated with the ANI values (R2 = 0.9944). Based on the simulative exponential equation (y = 86.61*exp(0.001515*x) − 119.5*exp(−0.09997*x)), 70% dDDH corresponded to a 96.2% ANI (Fig. 3b). The correlation analyses described above indicated that both 98.3% of MLST similarity and 96.2% of ANI could be used as the species threshold for bacteria of the B. cereus group.

## Discussion

Bacteria of the B. cereus group are the subject of growing interest because of their considerable significance in agriculture and industry, especially in the area of food safety and health of human and animals7. They are ubiquitously found in diverse environments and have been isolated from a wide variety of niches for hundreds of years, including the multiple organs of plants, different kinds of human and animals, water from ponds, rivers and lakes, soils of different texture and regions, various types of food and other locations (to get more detailed information of isolation origins for bacteria within the B. cereus group, please refer to http://mlstoslo.uio.no/). However, their phylogenetic diversity in marine environments has not been determined to date, although a large number of bacteria have been isolated from various marine environments in our laboratory and other research teams. In this report, we applied MLST, dDDH and ANI analyses to explore their genetic diversity and geographic distribution of this group.

In this study, we found that the 71 marine isolates contained 27 STs that consisted of 23 new STs and four previously identified STs. These isolates were not only widely distributed in marine environments, but also extraordinarily rich in diversity and contained numerous novel genotypes. The marine environment is complex and unique compared with terrestrial environments, characterized by high salinity, oligotrophy and weak alkalinity (pH 7.9–8.4)32 in general and other properties in special niches, which certainly contribute to the development of the genetic, physiological and metabolic diversity of marine bacteria.

In addition to the new phylotypes, it is also noteworthy that some marine strains belonged to B. anthracis, B. cereus and B. thuringiensis species. Two strains, designated BC31 and BC36, were affiliated with B. anthracis species based on dDDH and ANI analyses. In consideration of their phylogenetic position and the pXO1-like plasmid of BC36 (unpublished data), two strains differed from traditionally described B. anthracis, and therefore should be allocated to anomalous B. anthracis. In general, B. cereus is a ubiquitous opportunistic pathogen and can result in the food poisoning when ingested by humans. Among the 71 marine strains, 14 belonged to B. cereus species. Furthermore, the enterotoxin genes (hblCDA, nheABC, cytK and entFM)33 were detected in most of the genomic sequences of 32 strains using a local BLASTN method (unpublished data), indicating that they have a relatively wide distribution in bacteria of this group, not just limited to B. cereus. Accordingly, the bacteria of this group from marine environments, including B. cereus, should be given the same attentions as bacteria from terrestrial environments. Four bacteria belonged to B. thuringiensis, but did not contain cry genes that encode insecticidal crystal proteins (Cry). This is in agreement with our previous report that the presence or absence of cry genes cannot be used as a phenotypic characteristic for discriminating B. thuringiensis from other1.

Evidently, the abundance of bacteria in sediment samples is significantly higher than that in seawater samples, and they possess more diverse STs. This is in accordance with many previous studies34,35,36. Compared with the water column, sediments retain more organic materials that are resistant to degradation, are less proteinaceous and other fragile materials37. Sediment thus more closely resembles the soil environment, where Bacillus bacteria thrive. Likewise, occurrence frequencies of novel STs were much higher in the South China Sea than in other areas. We speculated that two reasons may account for this phenomenon. On the one hand, the indigenous bacteria of this group isolated from the South China Sea probably possessed more rapidly diversification rates relative to them in other areas because they enjoyed the large input of nutrients from coastal areas. On the other hand, the bacteria isolated from the South China Sea, might have originated from terrestrial environments by watercourse and/or anthropogenic activities, were forced to modify the phenotypic and genotypic characterizations in order to adapt harsh ocean environments, that is, stress-driven evolution. No matter what kind of evolution was driven, they resulted in a much richer diversity of B. cereus group isolates in the South China Sea compared with other areas. Certainly, we cannot also completely rule out the possibilities of a lower sampling or less effective cultivability in marine environments than other from the South China Sea. The different assumptions would need to be further confirmed by comparative genomics analysis and other in the following studies.

The classification and taxonomic separation of the B. cereus group strains is the cornerstone for a better understanding of their phylogenetic relationships and population structure38. In spite of this, it has long been a controversial topic for bacteriologists in distinguishing these species within this group39.

Traditionally, these organisms of this group have been differentiated on the basis of diverse phenotypic characteristics, in particular pathogenic potential7. For example, B. anthracis is the etiological agent of anthrax and an obligate pathogen threatening to human and animal health8. B. cereus is an opportunistic pathogen and often associated with human food poisoning10. B. thuringiensis has long been regarded as an insect pathogen owing to the formation of intracellular protein crystals during sporulation11. But, given that the phenotypic characteristics depend on environmental conditions, such as media, temperature, light, etc., some strains varied in reproducibility of phenotypes. In such case, the current practices in phenotypic typing for the B. cereus bacteria have been questioned.

Genotypic typing approaches on the basis of 16S rRNA gene and single housekeeping genes can overcome the shortcomings of phenotypic analyses, and therefore have been widely used in many studies40,41,42. But, the previous studies were unable to provide an accurate and consistent classification for the phylogenetically interspersed species within this group probably because of the high conservation, horizontal gene transfer and recombination of rRNA and housekeeping gene. For example, the 16S rRNA gene can identify a bacterium to the B. cereus group, but cannot assign it accurately to a certain species because of its low discrimination1. In our previous study, a housekeeping gene pycA was capable of rapidly distinguishing almost all close relatives of this group1, which has proven useful for the present study (see Supplementary Fig. S1b). However, as shown in Supplementary Fig. S1b, the use of a shortened pycA gene sequence (363 bp) provides inadequate genetic information to describe the accurately phylogenetic relationships of bacteria within this group relative to the MLST analysis1 (see Supplementary Fig. S1a). The taxonomic affiliations of bacteria of this group might be wrongly attributed because of horizontal gene transfer of a single housekeeping gene43. For example, two strains, Bacillus cereus AH187 and Bacillus cereus G9241, belonged to two different species on the basis of 64.2% of dDDH and 95.7% of ANI. However, these two strains affiliated to the same species based on a 98.2% similarity of gyrB gene. In 2015, the taxonomic status of the bacteria within the B. cereus group was clearly established using whole-genome sequences by our group in cooperation with German colleagues1. However, species discrimination based on whole-genome sequencing is rather expensive even today.

In view of the aforementioned situation, MLST was applied to investigate the phylogenetic relationship of marine bacteria of the B. cereus group in this study. These strains can be divided into 20 clusters corresponding to 20 species that were further confirmed using the dDDH and ANI analyses based on genomic sequences. Among 10 new clusters, six overlapped with those found in our previous report: Cluster 01 in this study corresponded to Group BCG 12, Cluster 03 corresponded to BCG 14, Cluster 04 corresponded to BCG 10, Cluster 06 corresponded to BCG 22, Cluster 13 corresponded to BCG 21, and Cluster 14 corresponded to BCG 111. The other four clusters did not correspond with any clusters from our previous study1. Compared with the studies of Priest et al.16, Barker et al.44, Hoffmaster et al.45 and Didelot et al.46, the analyzed isolates were grouped into three clades including eight lineages with the exception of the “Others” lineage. The closely related B. cereus group isolates were roughly clustered with three major clades in the studies of Kim et al.47 and Cardazzo et al.48. Although these findings were nearly the same for the B. cereus group isolates, these clades and lineages were arbitrarily defined, taking into consideration the relationships between isolates in multiple phylogenetic trees. Moreover, these clades and lineages were also not further validated using other approaches. Consequently, we recommend that the taxonomic and phylogenetic analysis of the B. cereus group bacteria should combine a variety of methods in future studies.

The correlation analyses between the dDDH values and MLST similarities, dDDH and ANI values provided two helpful thresholds for the species definition of the B. cereus group bacteria. In general, an ANI value of approximately 95–96% was used as the cut-off for species demarcation31. In our analysis, this threshold value was slightly raised to 96.2% which will provide more accurate taxonomic relationships of the closely related strains within the B. cereus group. Up to the time of writing this manuscript, the MLST database for the B. cereus group strains contained 1,272 STs among 1,547 isolates corresponding to the P scheme16. Based on the 98.3% MLST similarity, the precise and detailed classification of 1,547 isolates should be feasible in a future study.

In summary, our report uncovers for the first time the genetic diversity of the B. cereus group bacteria from diverse marine environments, based on MLST, dDDH and ANI analyses. These isolates are highly diverse with a wide distribution in marine environments and harbor many novel STs, including 20 clusters. Threshold values (98.3% of MLST similarity and 96.2% of ANI) are proposed as the standard for the species definition of this group. The results of this study are beneficial for understanding the adaptation and diversification of the B. cereus group to diverse marine environments.

## Methods

### Bacterial isolates and DNA extraction

Seventy-one strains of the B. cereus group from the Marine Culture Collection of China (MCCC) were included in this study (Table 1). Specifically for isolated habitats, 54 of them were isolated from the sediments and 17 from the seawater. Among these strains, 50 were from the Pacific Ocean, 9 from the South China Sea, 7 from the Gulf of Mexico, 3 from the Indian Ocean and 2 from the Arctic Ocean. Moreover, all type strains excluding type strain B. anthracis ATCC 14578T from multiple Microbiological Culture Collection Center abroad have currently preserved in the MCCC, and have been used as reference strains (Table 1). For the original source of type strains, four were from soil samples, two from the sediments, and five from other, including cattle, air, food and insect. Apart from two type strains from the Yellow sea, nine were from the terrestrial habitats (Table 1). All strains were incubated on modified Luria-Bertani medium with 3% NaCl (w/v) at pH 7.0 and 32 °C. Genomic DNA was extracted using the SBS extraction kit (Shanghai SBS Genetech Co., Ltd., Shanghai, China) in accordance with the manufacturer’s instructions.

### MLST gene amplification, sequencing, and determination

The P scheme of MLST for B. cereus group used internal fragments of the following seven housekeeping genes: glpF (glycerol uptake facilitator protein), gmk (guanylate kinase, putative), ilvD (dihydroxy-acid dehydratase), pta (phosphate acetyltransferase), purH (phosphoribosylaminoimidazolecarboxamide), pycA (pyruvate carboxylase) and tpi (triosephosphate isomerase)16. The seven housekeeping were chosen owing to the following criteria: (1) the common presence in all strain belonged to the core genes; (2) single copy; (3) a relatively high discrimination power relative to rRNA genes and other; (4) no lateral gene transfer and recombination events. PCR amplification of the housekeeping genes was performed using the primer pairs of each gene, different annealing temperatures and extension times (Table S4). The PCR products were purified and subsequently sequenced with the same primers used in the PCR. Each unique gene sequence was regarded as an allele and was assigned an allele number. The set of all allele numbers for a given strain represents an allelic profile or sequence type (ST). Allele numbers of each gene and ST of each strain were assigned using the PubMLST database (http://pubmlst.org/bcereus/). New alleles and new STs were assigned by the MLST website curator.

### Analysis of sequence diversity

Diversity indices of single gene sequence, such as the number of alleles, the number of polymorphic sites, nucleotide diversity per site (π), the mean G + C content, the Ka/Ks ratios (Ka: the number of non-synonymous substitutions per non-synonymous site; Ks: the number of synonymous substitutions per synonymous site) and Tajima’s D, were analyzed with the software DnaSP version 5.1029. Pairwise similarities for the concatenated sequences were analyzed with the software MEGA version 5.0549 using Kimura’s two-parameter model50.

Allelic linkage disequilibrium was assessed with two test options of both Monte Carlo methods and Parametric (100 resamplings) using LIAN version 3.7 (http://guanine.evolbio.mpg.de/cgi-bin/lian/lian.cgi.pl/query)25.

### Population structure

Strain relationships were analyzed using the goeBURST algorithm26, as implemented in the software PHYLOViZ51 to cluster the STs into clonal complexes (CCs) based on the most stringent definition. The isolates that shared identical alleles at six of the seven loci with at least one other member of the group, were assigned to a single CC. The primary founder of a CC, single locus variants, double locus variants, and singletons were defined using the stringent default setting. Isolate-specific metadata, including the isolation sample and source, were then overlaid on top of the minimum spanning tree.

The population structure of bacteria within the B. cereus group was built using the Structure software. The data set was analyzed by the software Structure version 2.3.4 using the admixture model27, following a burn-in period of 100,000 iterations, Markov chain Monte Carlo of 50,000 repetitions with five iterations for each K (K set between 2 and 20), where K is the estimated maximum number of genetically distinct groups. The evaluation of the K probability was conducted by the Δ(K) method52.

### Phylogenetic analyses

The determined single gene and concatenated sequences (in the following order: glpF, gmk, ilvD, pta, purH, pycA and tpi) were aligned using the ClustalW option implemented in the software MEGA version 5.0529. The phylogenetic tree of the concatenated genes was constructed using the neighbor-joining (NJ) algorithm28 by Kimura’s two-parameter model50 with the software MEGA version 5.0529. Bootstrap values for individual nodes were calculated for 1,000 replicates for the evaluation of tree robustness.

### Correlation analyses between dDDH, MLST similarities and ANI values

The genomic sequences of 21 representative strains selected on the basis of the phylogenetic analysis were determined by Tianjin Biochip Corporation (Tianjin, China) using the Illumina/Solexa sequencing technology. Reads were trimmed to remove low quality nucleotides using the software Trimmomatic version 0.32 (http://www.usadellab.org/cms/?page=trimmomatic). The high-quality reads of each strain were assembled using the software SOAPdenovo version 1.05. The sequencing depth for each strain was over 100×. Automatic gene annotation for each genome was carried out by the NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) (http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html), followed by manual editing. The genomic sequences of 11 type strains within the group were obtained from the GenBank database. The dDDH values of 32 bacteria were estimated using the genome-to-genome distance calculator website service (GGDC 2.1) (http://ggdc.dsmz.de/distcalc2.php). ANI values of genomic sequences of them were calculated using the EzGenome web service (http://www.ezbiocloud.net/ezgenome/ani). Correlation analyses between the dDDH values and MLST similarities, dDDH and ANI values were, respectively, simulated using a nonlinear simulation analysis method with the default option of the Curve Fitting Tool implemented in MATLAB 8.1.

### Nucleotide sequence accession numbers

Seven housekeeping genes sequences for 71 strains and draft genomes sequences for 21 strains were submitted to the GenBank database, and all accession numbers were, respectively, shown in Supplementary Tables S2 and S5.