Introduction

Members of the genus Bacillus comprise gram-positive, spore forming, rod-shaped, aerobic bacteria. Three species of the Bacillus (Bacillus thuringiensis, Bacillus anthracis and Bacillus cereus) have a huge impact on human activities. For example, B. anthracis is the cause of the acute and often lethal disease anthrax1, which is therefore of a concern as a possible agent in biological warfare; B. thuringiensis is extensively used in the biological control of insects due to its ability to produce parasporal protein crystals with insecticidal activity2; B. cereus is an opportunistic human pathogen involved in food-poisoning incidents and contaminations in hospitals1. Some strains of B. cereus have been developed as a useful biological control agent in the suppression of fungi and crop disease3.

While the phenotypes of these Bacillus species are different, their intra and inter phylogenetic relationships are not clear. Several approaches have been used to classify B. thuringiensis strains, including rRNA gene sequences2, amplified fragment length polymorphisms (AFLP)2, restriction fragment length polymorphisms (RFLPs) in small subunit (SSU) rRNA sequences4, GryB (gyrase subunit B) and AroE (shikimate-5-dehydrogenase) gene sequences5. The results of these approaches suggest that there is a high level of sequence homology among the strains of B. thuringiensis. Similarly, overall genetic studies have shown that B. thuringiensis and B. cereus are essentially identical6. B. anthracis can only be distinguished from B. thuringiensis and B. cereus through microbiological and biochemical tests7. Since these genetic methods are not able to easily distinguish different members of B. thuringiensis, B. anthracis and B. cereus, it becomes necessary to look for some more easily recognizable markers.

With the advent and development of next generation sequencing technologies, a great deal of sequencing data has been generated in recent years. The rapid accumulation of whole genome data of Bacillus species in Genbank makes it possible for comparisons of genomic differences over the entire genome that can’t be identified in analyses of specific single gene sequences. However, the size of the whole genome data poses great challenges on alignment-based algorithms, which are effective in dealing with closely related sequences but are unable to evaluate the recombination, shuffling and rearrangement events of the whole genomes8. Thus, alignment-free sequence analysis approaches, such as FFP (Feature Frequency Profile), provide attractive alternatives over alignment-based approaches.

FFP is a new method used to study the whole genome phylogeny based on k –mers9,10,11. In this method, the number of features of a particular length l that occur in a particular genome is counted and assembled into a FFP vector. FFPs from different species are then compared using the Jensen–Shannon (JS) Divergence12. A neighbor-joining phylogenetic tree can thus be constructed based on the resulting distance matrix. Compared to the traditional multiple sequences alignment (MSA) based method, the alignment free FFP method can compare both genic and non-genic regions of the whole genome at higher speed; it can incorporate a wide variety of genomic features into each comparison including intron deletions, exon sequence indels, transposable element insertions, base transversions in coding sequences and some rare genomic changes such as short interspersed element/long interspersed element (SINE/LINE) insertions13. Benefitting from these advantages, this method has been successfully applied to resolving relationships among Escherichia coli and Shigella strains10, prokaryotes9 and mammals13.

In this study, we reconstructed the whole-genome phylogeny of Bacillus (with an emphasis on B. thuringiensis, B. anthracis and B. cereus) using the FFP approach, with an aim to better understand the phylogenetic relationships that exist among them. To validate the usefulness of FFP method, we also processed the data with kSNP v2 (alignment-free sequence analysis method) and Harvest Suite (core genome sequence alignment method). For comparison purpose, we constructed phylogenetic trees inferred from three single genes: 16s rRNA genes, GyrB and AroE, whose DNA sequences were extracted from the corresponding genomes.

Results

The phylogenetic results based on the whole genome data

The phylogenetic tree inferred from the whole genome data of 51 taxa (Table 1) (including 23 B. thuringiensis strains, nine B. anthracis strains, 11 B. cereus strains, three B. subtilis strains, one B. licheniformis strain, one B. weihenstephanensis strain, one B. clausii strain, one B. halodurans strain and one E. coli strain) is presented in Fig. 1. A cluster (I) containing all the B. thuringiensis, B. anthracis and B. cereus strains apart from other Bacillus members under study can be recognized (with an exception of B. weihenstephanensis) . This cluster could be further sub-divided into at least five sub-clusters (I-a to I-e, Fig. 1). The sub-cluster I–b contains all nine B. anthracis strains (B. anthracis str. A0248, B. anthracis str. A16, B. anthracis str. A16R, B. anthracis str. Ames, B. anthracis str. ‘Ames Ancestor’, B. anthracis str. CDC 684, B. anthracis str. H9401, B. anthracis str. Sterne, B. anthracis str. SVA11), whereas the sub-cluster of I-a and I-d contain exclusively B. thuringiensis strains (B. thuringiensis BMB171, B. thuringiensis Bt407, B. thuringiensis DAR 81934, B. thuringiensis HD-771, B. thuringiensis IBL 200, B. thuringiensis IBL 4222, B. thuringiensis serovar andalousiensis BGSC 4AW1, B. thuringiensis serovar berliner ATCC 10792, B. thuringiensis serovar chinensis CT-43, B. thuringiensis serovar huazhongensis BGSC 4BD1, B. thuringiensis serovar kurstaki str. HD73, B. thuringiensis serovar kurstaki str. T03a001, B. thuringiensis serovar kurstaki str. YBT-1520, B. thuringiensis serovar monterrey BGSC 4AJ1, B. thuringiensis serovar pakistani str. T13001, B. thuringiensis serovar pondicheriensis BGSC 4BA1, B. thuringiensis serovar pulsiensis BGSC 4CC1, B. thuringiensis serovar sotto str. T04001, B. thuringiensis serovar thuringiensis str. IS5056, B. thuringiensis YBT-1518). Three B. cereus strains (B. cereus ATCC 10987, B. cereus Q1 and B. cereus AH187) form a monophyletic clade in sub-cluster I-c, whilst four other B. cereus strains (B. cereus AH1271, B. cereus AH1273, B. cereus AH603, B. cereus AH621) are closely grouped with Bacillus weihenstephanensis KBAB4 in sub-cluster I-e.

Table 1 Source of sequence data.
Figure 1
figure 1

Phylogenetic tree of 50 Bacillus strains.

The tree was constructed using the NJ algorithm based on the FFP features of the Whole Genome Data. Escherichia coli Bl21 (DE3) (AM946981.2) was used as an outgroup in the analysis. The bootstrap confidence values were generated using 1,000 permutations. Different symbols were allocated to represent different species: Blue triangle for Bacillus thuringiensis; Pink diamond for Bacillus cereus; Red circle for Bacillus anthracis; Green Square for Bacillus subtilis.

In the topology of this NJ tree, three B. subtilis strains (B. subtilis BSn5, B. subtilis subsp. spizizenii str. W23, B. subtilis subsp. subtilis str. 168) form a monophyletic clade, which is further grouped with B. licheniformis DSM 13 = ATCC 14580. These Bacillus strains together with the remaining Bacillus members (B. clausii KSM-K16 DNA, B. halodurans C-125 DNA) are placed near the outgroup E. coli (Escherichia coli BL21(DE3).

Validation of FFP results

The NJ tree inferred from the kSNP analyses of the whole genome is presented in Fig. 2. The monophyly of B. anthracis was confirmed with high bootstrap support (100). A monophyletic clade containing 16 B. thuringiensis isolates was recognized (clade Bacillus thuringiensis). All the B. anthracis, B. cereus, B. thuringiensis and B. weihenstephanensis form a monophyletic clade (Bacillus cereus sensu lato), which is separated from the remaining Bacillus species examined in this study. Outside this major clade, the monophyly of B. subtilis was confirmed (100 bootstrap support).

Figure 2
figure 2

Phylogenetic tree of 50 Bacillus strains.

The tree was constructed using the NJ algorithm based on all SNP matrix inferred from the kSNP V2 analysis. Escherichia coli Bl21 (DE3) (AM946981.2) was used as an outgroup in the analysis. The bootstrap confidence values were generated using 1,000 permutations. Different symbols were allocated to represent different species: Blue triangle for Bacillus thuringiensis; Pink diamond for Bacillus cereus; Red circle for Bacillus anthracis; Green Square for Bacillus subtilis.

The core SNP matrix resulted from the kSNP analysis provided a direct visualization of the relationships among all the Bacillus species studied (Fig. 3). There was no variation between the core SNPs of B. anthracis and B. thuringiensis, whist only single variation was found for two B. cereus strains (Bacillus cereus AH603 and Bacillus cereus AH621) and B. weihenstephanensis. The variation of core SNP increased to two among the B. subtilis species and the B. licheniformis DSM 13 = ATCC 1458034. The sharp increase of core SNP variations in B. halodurans C-12533 and B. clausii KSM-K16 (4 and 5 respectively) revealed their distant relationships to the remaining Bacillus species.

Figure 3
figure 3

Core SNP matrix inferred from kSNP v2 (BA: Bacillus anthracis, BC: Bacillus cereus, BCL: Bacillus clausii, BH: Bacillus halodurans, BL: Bacillus licheniformis, BS: Bacillus subtilis, BT: Bacillus thuringiensis, BW: Bacillus weihenstephanensis).

Our effort in using Harvest suite to analyse all the species examined in FFP was not successful. The shared core genome among all the studied taxa was too low (less than 1%) to let the Parsnp program to work. This is because Parsnp is designed for intraspecific alignments and requires >=97% average nucleotide identity among input genomes. The Parsnp started to work when Bacillus species other than the member of Bacillus cereus sensu lato were excluded from the analysis. The final alignment and the resulting NJ tree are presented in Figs 4, 5, 6. The NJ tree distinguished two highly supported clades (100 in bootstrap value): one including all the B. anthracis strains and the other including sixteen B. thuringiensis strains. The whole topology of this NJ tree is highly similar to that of the NJ trees inferred from FFP and kSNP analyses (Figs 1 and 2).

Figure 4
figure 4

NJ tree of 44 Bacillus strains.

The tree was constructed using Parsnp and annotated in MEGA 6.0. The NJ algorithm was based on the Core Genome SNP data of the 44 Bacillus strains. Bacillus weihenstephanensis KBAB4 was used as an outgroup in the analysis. The bootstrap confidence values were generated using 1,000 permutations. Different symbols were allocated to represent different species: Blue triangle for Bacillus thuringiensis; Pink diamond for Bacillus cereus; Red circle for Bacillus anthracis. The Genbank file of Bacillus cereus ATCC 14579 (represented as BC14579) was used as reference in the Parsnp analysis.

Figure 5
figure 5

Gingr visualization of 44 Bacillus genomes aligned with Parsnp.

The leaves of the reconstructed phylogenetic tree (left) are paired with their corresponding rows in the multi-alignment.

Figure 6
figure 6

Gingr visualization of 44 Bacillus genomes aligned with Parsnp.

The visual layout is the same as Fig. 5, but unlike Fig. 5, the genome alignment was zoomed to reveal the phylogenetic signature of several clades, in this case within the fully-aligned BC3801 (dipicolinate synthase subunit A).

The Gingr visualization of NJ tree and the genome alignment (core genome based) displayed multiple conserved regions (represented by the SNP heatmap) throughout the entire genome across 44 members of Bacillus cereus sensu lato (Figs 5 and 6). These conserved regions are scattered throughout the genome but showed more density in four regions (500–1500 bp; 11000–15000 bp, 36000–46000 bp and 52000–53000 bp). When being zoomed, the Gingr visualization turned the SNP heatmap into vertical lines, which revealed the phylogenetic signature of several clades [in this case within the fully—aligned dpaA gene (BC3801)] (Fig. 6).

The phylogenetic results based on the single gene data

Three NJ trees inferred from the data of three single genes (16s rRNA gene, GryB and AroE), are shown in Figs 7, 8, 9 respectively. These trees did not support the monophyletic status of B. anthracis. The clades that contain B. anthracis strains also contain other Bacillus species (e.g. B. cereus AH820 in Clade II of Fig. 8 and B. thuringiensis serovar monterrey BGSC 4AJ1 in Clade D of Fig. 9). Among the total 23 B. thuringiensis strains studied, some strains form monophyletic sub - clades in trees inferred from GyrB (Clade V, Fig. 8) and AroE (Clade A and C, Fig. 9), but the monophyletic status of the whole B. thuringiensis strains cannot be confirmed by these analyses. Similarly, B. cereus proved to be a paraphyletic group by all NJ trees inferred from data of three single genes. The data for GyrB and AroE suggested that B. subtilis might be a monophyletic group (Clade IV in Fig. 8 and Clade B in Fig. 9) and this group has close relationship with B. licheniformis DSM 13 ATCC 14580, which is supported by high bootstrap value (97 in Fig. 8 and 99 in Fig. 9). With respect to the phylogenetic placement of B. subtilis and B. licheniformis, the 16S rRNA gene shows very low support in comparison to the other two protein coding genes (Fig. 7).

Figure 7
figure 7

Neighbor-joining tree constructed based on the sequences of the 16S rRNA gene from 50 Bacillus strains.

Escherichia coli Bl21 (DE3) (AM946981.2) was used as an outgroup in the analysis. The bootstrap confidence values were generated using 1,000 permutations. Different symbols were allocated to represent different species: Blue triangle for Bacillus thuringiensis; Pink diamond for Bacillus cereus; Red circle for Bacillus anthracis; Green Square for Bacillus subtilis.

Figure 8
figure 8

Neighbor-joining tree constructed based on the sequences of the GyrB from 50 Bacillus strains.

Escherichia coli Bl21 (DE3) (AM946981.2) was used as an outgroup in the analysis. The bootstrap confidence values were generated using 1,000 permutations. Different symbols were allocated to represent different species: Blue triangle for Bacillus thuringiensis; Pink diamond for Bacillus cereus; Red circle for Bacillus anthracis; Green Square for Bacillus subtilis.

Figure 9
figure 9

Neighbor-joining tree constructed based on the sequences of the AroE from 50 Bacillus strains.

Escherichia coli Bl21 (DE3) (AM946981.2) was used as an outgroup in the analyse. The bootstrap confidence values were generated using 1,000 permutations. Different symbols were allocated to represent different species: Blue triangle for Bacillus thuringiensis; Pink diamond for Bacillus cereus; Red circle for Bacillus anthracis; Green Square for Bacillus subtilis.

Discussion

Our phylogenetic analysis based on the FFP features of the whole genome and associated plasmids resulted in a major cluster containing all strains of B. thuringiensis, B. anthracis and B. cereus separated from other recognised Bacillus members. When strains of same species were grouped together and subject to pairwise distance analysis, the groups of B. thuringiensis, B. anthracis and B. cereus formed a monophyletic clade in the NJ tree (Fig. 10). These results clearly suggest the close relationship among B. thuringiensis, B. anthracis and B. cereus species and are in agreement with earlier results from DNA-DNA hybridization analysis and Multi Locus Enzyme Electrophoresis (MEE), which showed high identity among B. anthracis, B. cereus and B. thuringiensis strains14. These three species have been grouped under the name of Bacillus cereus sensu lato15 despite their obvious difference in phenotype and pathological effects, which are resulted from the genetic difference in plasmid rather than in chromosome1. The results of present study appear to support the classification of Bacillus cereus sensu lato when using genomic sequences only (data not shown). Greater resolution of recognised species was achieved when plasmid sequences were added to the analysis.

Figure 10
figure 10

Pairwise Distance of the Bacillus numbers.

The scale bar represents a 1% sequence difference (BA:B. anthracis; BT:B. thuringiensis; BC:B. cereus; BW:B. weihenstephanensis; BS:B. subtilis and B. clausii, B. halodurans, B. licheniformis) are placed near the outgoup E. coli (Escherichia coli BL21(DE3).

In the present study, B. weihenstephanensis strain KBAB4 was found to be very closely grouped with the major cluster I-d consisting of all B. thuringiensis isolates and proximal to cluster I-e (B. cereus) and cluster I-c (a cluster containing both B. thuringiensis and B. cereus strains) (Fig. 1). B. weihenstephanensis is a member of the Bacillus cereus sensu lato and has high similarities with B. thuringiensis and B. cereus in terms of its ecological features such as producing cereulide as B. cereus and being psychrotolerance as some B. thuringiensis isolates16,17. Soufiane and Cote (2009)5 revealed the close relationship between B. weihenstephanensis and some B. thuringiensis serovars based on the 16S rRNA, GyrB and AroE gene sequences. Our results support their research and provide further evidence for the classification of Bacillus cereus sensu lato.

The NJ tree inferred from the whole genome sequences of these bacteria species not only revealed the close relationship among B. thuringiensis, B. anthracis and B. cereus, but also confirmed the monophyly of B. anthracis (I-b, Fig. 1). Previous studies using other techniques have all stated that B. anthracis is the most monomorphic species among B. thuringiensis, B. anthracis and B. cereus18,19,20. Our results confirmed the genetic homogeneity of B. anthracis but failed to elicidate the evolutionary relationships between B. anthracis and the remaining two species. B. anthracis has been regarded to be evolved from a B. cereus ancestor through acquisition of key plasmid-encoded toxin, capsule and regulatory loci21. Such a relationship did not appear in our phylogenetic analyses based on FFP analysis of whole genome data (Fig. 1). The B. anthracis clade is proximal to a number of isolates of B. cereus and B. thuringiensis which have been associated with disease or toxicity in humans.

The findings of FFP analyses were fully supported by SNP phylogenies construed by kSNP (alignment – free sequence analysis method) and Parsnp (core genome alignment method). By comparing the NJ trees inferred from FFP analysis (Fig. 1) and kSNP analysis (Fig. 2), we found a high level of similarity between two phylogenies. The clades of I - b and I - d clades in FFP tree are consistent with the Bacillus anthracis and Bacillus thuringiensis clades in kSNP tree, whilst the cluster I in FFP tree is corresponding to the clade of Bacillus cereus sensu lato in kSNP tree. While the core genome SNP tree constructed by Parsnp failed to cover all the species studied, the exclusion of other Bacillus species from the major cluster was actually a support for the monophyly of Bacillus cereus sensu lato. This is because Parsnp is limited in intraspecific alignment and can only tolerate genomes with high similarity ( >=97%). Genomes from other species will be automatically excluded from the full alignment22.

Within the core genome SNP tree constructed by Parsnp and visualized by Gingr, the monophyly of B. anthracis and a subclade covering 16 B. thuringiensis strains were confirmed, which is consistent with the results of FFP analysis. The tree also revealed the paraphyly of B. cereus and B. thuringiensis, which is similar to the findings of FFP and kSNP analyses. By zooming the alignment files from genome level to nucleotide level via the fisheye zoom feature of Gingr22, we noticed the SNP variations across different strains of B. cereus and B. thuringiensis that affects the topology of the trees. For B. cereus, the most variable region falls on an area between the gene of Cytochrome d ubiquinol oxidase subunit II and the gene of Alanine racemase (around 121456 bp). There are more SNP sites at this region among four B. cereus strains (B. cereus AH603, B. cereus AH621, B. cereus AH1272 and B. cereus AH1273), which contributed the distant placement of these four strains from the remaining B. cereus strains in the NJ tree. Similarly, we found a region (around 1006988 bp) with high SNP density in B. thuringiensis (starting from the gene of Lysr-type transcriptional regulator and ending at the gene of Thiamine/molybdopterin biosynthesis protein). The distantly placed B. thuringiensis strains (such as B. thuringiensis serovar finitimus YBT-020 and B. thuringiensis str. Al Hakam) generally have more SNP sites in this region than that of the remaining B. thuringiensis strains. It is not clear why some B. cereus and B. thuringiensis strains have more SNP variations at these particular genome regions than that of other strains and what are the impacts of these SNP variations on the phenotype, function and pathogenicity of these Bacillus strains. More research is thus required to answer these questions.

In contrast to the FFP analysis based on the whole genome data, our phylogenetic analysis based on single gene data (16S rRNA gene, GyrB and AroE) were unable to clearly distinguish between Bacillus species examined. The 16S rRNA gene sequence analysis clustered all the sequences together and provided poor resolution for the relationships between each strain. Similarly, our analysis based on two protein coding genes, GyrB and AroE, were unable to separate B. thuringiensis, B. anthracis and B. cereus from other Bacillus members while they provided support for the monophyletic position of B. subtilis. A further analysis using the concatenated sequence of these genes failed to provide any better analysis (data not shown).

These results suggest that FFP analysis of the combined genomic and plasmid sequence data allows for comparisons of genomic differences that can’t be identified in analyses of specific single gene sequences and provides greater resolution of species belonging to B. cereus senso lato than other techniques such as MLST, AFLP or single gene sequence analysis. Furthermore, the availability and reduced cost of whole genome sequencing can be used without extensive gene annotation to provide robust phylogenetic analysis of new isolates as they become available.

Materials and Methods

Source of sequence data

The genome sequence of Bacillus thuringiensis strain DAR 81934 was from our previous research23. Genome sequences of other 49 Bacillus and one E.coli [Escherichia coli BL21(DE3)] (used as outgroup) were retrieved from GenBank (Table 1). The retrieved genome sequences cover both the main chromosome sequences and the plasmid sequences (if any) of each species. The nucleotide sequences of three single nuclear genes for these taxa: 16s rRNA, GyrB and AroE, were extracted from the corresponding whole genome sequences.

Phylogeny analysis via FFP

The whole genome sequences of the 51 taxa were converted to multiFasta format before being uploaded to FFP –3.1910, where the different forms of genome partitions were compared between species and NJ trees were constructed based on the Jensen–Shannon divergences matrix from each type of genome partition. By following the recommendations of the program, we used the tools of ffpvocab and ffpre to find the right range of lengths to use (l = 20 was finally chosen in the analysis). We also conducted bootstrapping (1000) to assess the FFP phylogenetic analysis. The outcome of the bootstrap analysis was imported into Phylip 3.224 to create a consensus distance matrix, which was further processed in MEGA 6.025 to display the final NJ tree.

Validation of FFP results

We applied two programs to validate the outcomes of FFP analysis. The first program is kSNP v2.1.226, an alignment-free sequence analysis method with the capacity to build whole genome phylogeny on single nucleotide polymorphisms (SNPs) in whole genome data. We examined the same datasets of FFP by running kchooser to find the optimum k –mer size (19) prior to the kSNP analysis and including the flag of “–j ” in the command line to estimate Neighbor Joining trees based on all SNPs and core SNPs. The resulting all-SNPs-matrix was imported into MEGA 6.025 for NJ tree construction. The core-SNP-smatrix was applied to demonstrate the core SNP differneces across all examined genomes.

The Harvest Suite22 (inclusing Parsnp and Gingr) was also applied to validate the FFP outcome. We aligned genomes studied in FFP and built NJ phylogentic trees through Parsnp, a fast core-genome multi-aligner and vusualized the alignment and trees with Gingr, a dynamic visual platform. The default parameters recommanded by the program were followed during the whole analysis.

Single gene based phylogeny

The retrieved single gene sequences of 16s rRNA, GyrB and AroE were imported into MEGA 6.0 for sequence alignment (Clustal W27) and phylogenetic analysis (Neighbor-Joining28). The Kimura 2-parameter model was selected by executing the function of “Find Best DNA/Protein Models” prior to the Phylogenetic analyses. Statistical confidence on the inferred tree topology was assessed with 1,000 bootstrap replications.

Additional Information

How to cite this article: Wang, A. and Ash, G. J. Whole Genome Phylogeny of Bacillus by Feature Frequency Profiles (FFP). Sci. Rep. 5, 13644; doi: 10.1038/srep13644 (2015).