Whole Genome Phylogeny of Bacillus by Feature Frequency Profiles (FFP)

Fifty complete Bacillus genome sequences and associated plasmids were compared using the “feature frequency profile” (FFP) method. The resulting whole-genome phylogeny supports the placement of three Bacillus species (B. thuringiensis, B. anthracis and B. cereus) as a single clade. The monophyletic status of B. anthracis was strongly supported by the analysis. FFP proved to be more effective in inferring the phylogeny of Bacillus than methods based on single gene sequences [16s rRNA gene, GryB (gyrase subunit B) and AroE (shikimate-5-dehydrogenase)] analyses. The findings of FFP analysis were verified using kSNP v2 (alignment-free sequence analysis method) and Harvest suite (core genome sequence alignment method).

FFP is a new method used to study the whole genome phylogeny based on k -mers [9][10][11] . In this method, the number of features of a particular length l that occur in a particular genome is counted and assembled into a FFP vector. FFPs from different species are then compared using the Jensen-Shannon (JS) Divergence 12 . A neighbor-joining phylogenetic tree can thus be constructed based on the resulting distance matrix. Compared to the traditional multiple sequences alignment (MSA) based method, the alignment free FFP method can compare both genic and non-genic regions of the whole genome at higher speed; it can incorporate a wide variety of genomic features into each comparison including intron deletions, exon sequence indels, transposable element insertions, base transversions in coding sequences, and some rare genomic changes such as short interspersed element/long interspersed element (SINE/LINE) insertions 13 . Benefitting from these advantages, this method has been successfully applied to resolving relationships among Escherichia coli and Shigella strains 10 , prokaryotes 9 and mammals 13 .
In this study, we reconstructed the whole-genome phylogeny of Bacillus (with an emphasis on B. thuringiensis, B. anthracis and B. cereus) using the FFP approach, with an aim to better understand the phylogenetic relationships that exist among them. To validate the usefulness of FFP method, we also processed the data with kSNP v2 (alignment-free sequence analysis method) and Harvest Suite (core genome sequence alignment method). For comparison purpose, we constructed phylogenetic trees inferred from three single genes: 16s rRNA genes, GyrB and AroE, whose DNA sequences were extracted from the corresponding genomes.

Results
The phylogenetic results based on the whole genome data. The phylogenetic tree inferred from the whole genome data of 51 taxa ( Validation of FFP results. The NJ tree inferred from the kSNP analyses of the whole genome is presented in Fig. 2. The monophyly of B. anthracis was confirmed with high bootstrap support (100). A monophyletic clade containing 16 B. thuringiensis isolates was recognized (clade Bacillus thuringiensis). All the B. anthracis, B. cereus, B. thuringiensis and B. weihenstephanensis form a monophyletic clade (Bacillus cereus sensu lato), which is separated from the remaining Bacillus species examined in this study. Outside this major clade, the monophyly of B. subtilis was confirmed (100 bootstrap support).
The core SNP matrix resulted from the kSNP analysis provided a direct visualization of the relationships among all the Bacillus species studied (Fig. 3). There was no variation between the core SNPs of B. anthracis and B. thuringiensis, whist only single variation was found for two B. cereus strains (Bacillus cereus AH603 and Bacillus cereus AH621) and B. weihenstephanensis. The variation of core SNP increased to two among the B. subtilis species and the B. licheniformis DSM 13 = ATCC 1458034. The sharp increase of core SNP variations in B. halodurans C-12533 and B. clausii KSM-K16 (4 and 5 respectively) revealed their distant relationships to the remaining Bacillus species.
Our effort in using Harvest suite to analyse all the species examined in FFP was not successful. The shared core genome among all the studied taxa was too low (less than 1%) to let the Parsnp program to work. This is because Parsnp is designed for intraspecific alignments and requires >=97% average nucleotide identity among input genomes. The Parsnp started to work when Bacillus species other than the member of Bacillus cereus sensu lato were excluded from the analysis. The final alignment and the resulting NJ tree are presented in Figs 4-6. The NJ tree distinguished two highly supported clades (100 in  (Figs 1 and 2). The Gingr visualization of NJ tree and the genome alignment (core genome based) displayed multiple conserved regions (represented by the SNP heatmap) throughout the entire genome across 44 members of Bacillus cereus sensu lato (Figs 5 and 6). These conserved regions are scattered throughout the genome but showed more density in four regions (500-1500 bp; 11000-15000 bp, 36000-46000 bp and 52000-53000 bp). When being zoomed, the Gingr visualization turned the SNP heatmap into vertical lines, which revealed the phylogenetic signature of several clades [in this case within the fully-aligned dpaA gene (BC3801)] (Fig. 6).
The phylogenetic results based on the single gene data. Three NJ trees inferred from the data of three single genes (16s rRNA gene, GryB and AroE), are shown in Figs 7-9 respectively. These trees did not support the monophyletic status of B. anthracis. The clades that contain B. anthracis strains also contain other Bacillus species (e.g. B. cereus AH820 in Clade II of Fig. 8, and B. thuringiensis serovar monterrey BGSC 4AJ1 in Clade D of Fig. 9). Among the total 23 B. thuringiensis strains studied, some strains form monophyletic sub -clades in trees inferred from GyrB (Clade V, Fig. 8) and AroE (Clade A and C, Fig. 9), but the monophyletic status of the whole B. thuringiensis strains cannot be confirmed by these analyses. Similarly, B. cereus proved to be a paraphyletic group by all NJ trees inferred from data of  Fig. 8 and Clade B in Fig. 9), and this group has close relationship with B. licheniformis DSM 13 ATCC 14580, which is supported by high bootstrap value (97 in Fig. 8 and 99 in Fig. 9). With respect to the phylogenetic placement of B. subtilis and B. licheniformis, the 16S rRNA gene shows very low support in comparison to the other two protein coding genes (Fig. 7).  15 despite their obvious difference in phenotype and pathological effects, which are resulted from the genetic difference in plasmid rather than in chromosome 1 . The results of present study appear to support the classification of Bacillus cereus sensu lato when using genomic sequences only (data not shown). Greater resolution of recognised species was achieved when plasmid sequences were added to the analysis. In the present study, B. weihenstephanensis strain KBAB4 was found to be very closely grouped with the major cluster I-d consisting of all B. thuringiensis isolates and proximal to cluster I-e (B. cereus) and cluster I-c (a cluster containing both B. thuringiensis and B. cereus strains) (Fig. 1). B. weihenstephanensis is a member of the Bacillus cereus sensu lato, and has high similarities with B. thuringiensis and B. cereus in terms of its ecological features such as producing cereulide as B. cereus and being psychrotolerance as some B. thuringiensis isolates 16,17 . Soufiane and Cote (2009) 5   The NJ tree inferred from the whole genome sequences of these bacteria species not only revealed the close relationship among B. thuringiensis, B. anthracis and B. cereus, but also confirmed the monophyly of B. anthracis (I-b, Fig. 1). Previous studies using other techniques have all stated that B. anthracis is the most monomorphic species among B. thuringiensis, B. anthracis and B. cereus [18][19][20] . Our results confirmed the genetic homogeneity of B. anthracis but failed to elicidate the evolutionary relationships between B. anthracis and the remaining two species. B. anthracis has been regarded to be evolved from a B. cereus ancestor through acquisition of key plasmid-encoded toxin, capsule and regulatory loci 21 . Such a relationship did not appear in our phylogenetic analyses based on FFP analysis of whole genome data (Fig. 1). The B. anthracis clade is proximal to a number of isolates of B. cereus and B. thuringiensis which have been associated with disease or toxicity in humans.
The findings of FFP analyses were fully supported by SNP phylogenies construed by kSNP (alignment -free sequence analysis method) and Parsnp (core genome alignment method). By comparing the NJ trees inferred from FFP analysis (Fig. 1) and kSNP analysis (Fig. 2), we found a high level of similarity between two phylogenies. The clades of I -b and I -d clades in FFP tree are consistent with the Bacillus anthracis and Bacillus thuringiensis clades in kSNP tree, whilst the cluster I in FFP tree is corresponding to the clade of Bacillus cereus sensu lato in kSNP tree. While the core genome SNP tree constructed by Parsnp failed to cover all the species studied, the exclusion of other Bacillus species from the major cluster was actually a support for the monophyly of Bacillus cereus sensu lato. This is because Parsnp is limited in intraspecific alignment and can only tolerate genomes with high similarity ( >=97%). Genomes from other species will be automatically excluded from the full alignment 22 . Within the core genome SNP tree constructed by Parsnp and visualized by Gingr, the monophyly of B. anthracis and a subclade covering 16 B. thuringiensis strains were confirmed, which is consistent with the results of FFP analysis. The tree also revealed the paraphyly of B. cereus and B. thuringiensis, which is similar to the findings of FFP and kSNP analyses. By zooming the alignment files from genome level to nucleotide level via the fisheye zoom feature of Gingr 22 , we noticed the SNP variations across different strains of B. cereus and B. thuringiensis that affects the topology of the trees. For B. cereus, the most variable region falls on an area between the gene of Cytochrome d ubiquinol oxidase subunit II and the gene of Alanine racemase (around 121456 bp). There are more SNP sites at this region among four B. cereus strains (B. cereus AH603, B. cereus AH621, B. cereus AH1272 and B. cereus AH1273), which contributed the distant placement of these four strains from the remaining B. cereus strains in the NJ tree. Similarly, we found a region (around 1006988 bp) with high SNP density in B. thuringiensis (starting from the gene of Lysr-type transcriptional regulator and ending at the gene of Thiamine/molybdopterin biosynthesis protein). The distantly placed B. thuringiensis strains (such as B. thuringiensis serovar finitimus YBT-020 and B. thuringiensis str. Al Hakam) generally have more SNP sites in this region than that of the remaining B. thuringiensis strains. It is not clear why some B. cereus and B. thuringiensis strains have more SNP variations at these particular genome regions than that of other strains, and what are the impacts of these SNP variations on the phenotype, function and pathogenicity of these Bacillus strains. More research is thus required to answer these questions.
In contrast to the FFP analysis based on the whole genome data, our phylogenetic analysis based on single gene data (16S rRNA gene, GyrB and AroE) were unable to clearly distinguish between Bacillus species examined. The 16S rRNA gene sequence analysis clustered all the sequences together and provided poor resolution for the relationships between each strain. Similarly, our analysis based on two protein coding genes, GyrB and AroE, were unable to separate B. thuringiensis, B. anthracis and B. cereus  from other Bacillus members while they provided support for the monophyletic position of B. subtilis. A further analysis using the concatenated sequence of these genes failed to provide any better analysis (data not shown).
These results suggest that FFP analysis of the combined genomic and plasmid sequence data allows for comparisons of genomic differences that can't be identified in analyses of specific single gene sequences and provides greater resolution of species belonging to B. cereus senso lato than other techniques such as MLST, AFLP or single gene sequence analysis. Furthermore, the availability and reduced cost of whole genome sequencing can be used without extensive gene annotation to provide robust phylogenetic analysis of new isolates as they become available.

Materials and Methods
Source of sequence data. The   cover both the main chromosome sequences and the plasmid sequences (if any) of each species. The nucleotide sequences of three single nuclear genes for these taxa: 16s rRNA, GyrB and AroE, were extracted from the corresponding whole genome sequences.
Phylogeny analysis via FFP. The whole genome sequences of the 51 taxa were converted to multi-Fasta format before being uploaded to FFP -3.19 10 , where the different forms of genome partitions were compared between species, and NJ trees were constructed based on the Jensen-Shannon divergences matrix from each type of genome partition. By following the recommendations of the program, we used the tools of ffpvocab and ffpre to find the right range of lengths to use (l = 20 was finally chosen in the analysis). We also conducted bootstrapping (1000) to assess the FFP phylogenetic analysis. The outcome of the bootstrap analysis was imported into Phylip 3.2 24 to create a consensus distance matrix, which was further processed in MEGA 6.0 25 to display the final NJ tree.
Validation of FFP results. We applied two programs to validate the outcomes of FFP analysis. The first program is kSNP v2.1.2 26 , an alignment-free sequence analysis method with the capacity to build whole genome phylogeny on single nucleotide polymorphisms (SNPs) in whole genome data. We examined the same datasets of FFP by running kchooser to find the optimum k -mer size (19) prior to the kSNP analysis, and including the flag of "-j " in the command line to estimate Neighbor Joining trees based on all SNPs and core SNPs. The resulting all-SNPs-matrix was imported into MEGA 6.0 25 for NJ tree construction. The core-SNP-smatrix was applied to demonstrate the core SNP differneces across all examined genomes.
The Harvest Suite 22 (inclusing Parsnp and Gingr) was also applied to validate the FFP outcome. We aligned genomes studied in FFP and built NJ phylogentic trees through Parsnp, a fast core-genome multi-aligner, and vusualized the alignment and trees with Gingr, a dynamic visual platform. The default parameters recommanded by the program were followed during the whole analysis.