Introduction

Phylogenetic tree analysis and comparative studies of taxa are essential parts of modern molecular biology. Phylogenetic reconstruction and comparative sequence analysis traditionally depend on multiple or pairwise sequence alignments. However, various limitations are encountered when analyzing large datasets using an alignment based approach. Whole genome alignment of higher eukaryotes can exceed computational resources. Moreover, factors such as the combinatorics of genomic rearrangements and duplications make the alignment of entire genomes impossible. Therefore, the alignable homologous segments of the genomes under study have to be identified in the initial steps. Recently, large amounts of sequence data produced by next-generation sequencing techniques have become available in private and public databases, which has created new challenges due to the limitations associated with alignment based approaches. This plethora of sequence information increases the computation and time requirements for genome comparisons in computational biology. Therefore, there is a high need for faster sequence analysis algorithms. For this, various methods have been proposed to overcome the limitations of alignment based approach1,2,3, and is termed as alignment-free methods. The alignment-free methods are not only used in phylogenetic studies4,5, but also for metagenomics6,7,8,9,10,11, analysis of regulatory elements12,13,14, protein classification15,16, sequence assembly17, isoform quantification from transcriptome data18, and to identify biomarkers in diagnostic tests19. The alignment-free methods fall into two broad categories: methods based on k-mer or word frequency, and methods based on match length20. Methods based on k-mer or word frequency are quite popular and studied extensively. The k-mer based methods were developed to compare DNA sequences, in which it counts the frequencies of substrings with k letters occurring in respective sequences21. In recent past, a lot of k-mer based methods have been proposed and implemented in sequence analysis and phylogeny, such as, feature frequency profile (FFP)22, return time distribution (RTD)23, frequency chaos game representation (FCGR)24, an improved complete composition vector method (ICCV)25, composition vector (CV)26 and complete composition vector (CCV)27. For sequence comparison, ICCV method is more efficient and robust compared to CV and CCV methods. The other category of the alignment-free method is based on match lengths, where it employs the similarity of substrings between two sequences28,29,30,31. Examples of match length methods are, k-mismatch average common substring32, average common substring28, Kr – method28, etc. These methods are commonly used for string processing in computer science. In this study, we propose to use fuzzy integral33 to analyze DNA sequences based on a Markov chain34, which can be categorised as k-mer or word frequency method. The fuzzy integral similarity method35,36 assigns a similarity score between two DNA sequences based on the estimated parameters of a Markov chain. A DNA sequence consists of four characters (A, T, G and C). By taking the state space as S = {A, T, G, C}, we used the k-th step transition probability matrix, a fuzzy measure37 and fuzzy integral to describe the DNA sequences. We used the fuzzy integral similarity to obtain a distance matrix, which was used in the neighbor program in the PHYLIP package38 to construct a phylogenetic tree. The similar fuzzy integral similarity approach was taken by36. However in36, the method of feature vector extraction from the DNA sequences is different from our method. In both our method and36, the extracted features are used as an input for the fuzzy integral similarity analysis. The proposed method is tested on 18S rDNA sequences from 11 Arbuscular mycorrhizal fungi isolates and 16S rDNA sequences from 40 bacterial isolates, and also tested on the following benchmark datasets, 41 mammalian mitochondrial genomes, 59 ebolavirus complete genomes, 30 coronavirus whole genomes, 30 bacterial whole genomes, 48 Hepatitis E virus (HEV) whole genomes, 24 Eutherian mammals sequences, 58 genome datasets from different species and 29 Escherichia/Shigella complete genomes. The method was also tested on large mammalian dataset. In addition, we used receiver operating characteristic (ROC)39,40,41 curve for measuring the performance of our method to compare the other alignment-free methods from Alfree repository2. The consistency can also be seen from the statistical analysis such as AUC (area under the ROC) values, calculated from ROC curves provided in Supplementary Material.

Materials and Methods

Construction of a Markov chain for DNA sequence

Let P = [pij] denote the transition probability matrix of a discrete-time Markov chain34. Each state transition probability pij is defined as follows:

$${p}_{ij}=p({X}_{n+1}={s}_{j}|{X}_{n}={s}_{i}),\,1\le i,j\le S,$$
(1)

where Xn indicates the actual state at time n(n = 1, 2, 3 …) and si is the ith state of S distinct states. In the context of a DNA sequence, the number of states is S = 4 which corresponds to the four nucleotide symbol set \({\bf{S}}=\{A={s}_{1},T={s}_{2},G={s}_{3},C={s}_{4}\mathrm{\}}.\) The state transition probabilities are subject to

$${p}_{ij}\ge 0\,\forall i,j\,\,{\rm{and}}\,\,\sum _{j=1}^{S}\,{p}_{ij}=1\,\forall i.$$

Since the transition probabilities are unknown initially, they must be estimated based on the observed sequence. Here, we estimate the parameters of the Markov chain by taking the frequencies of occurrence of all possible nucleotide pairs for each sequence42. If the total number of each adjacent nucleotide pair (si, sj) in the sequence is denoted by \({N}_{{s}_{i}{s}_{j}}\), then the 1st-step transition probability from state si to state sj is estimated as

$${p}_{ij}=\frac{{N}_{{s}_{i}{s}_{j}}}{{N}_{{s}_{i}A}+{N}_{{s}_{i}T}+{N}_{{s}_{i}G}+{N}_{{s}_{i}C}},$$
(2)

where \({N}_{{s}_{i}{s}_{j}}\) represents the total number of each adjacent pair starting from nucleotide si and ending with nucleotide sj.

Presented above is the 1-step Markov chain. The k-step Markov chain can be calculated through the 1-step Markov chain, which is known as the Chapman-Kolmogorov process. Let \({P}^{k}=[{p}_{ij}^{k}]\) denote the transition probability matrix of a discrete-time Markov chain in state j after k steps from state i. Each state transition probability \({p}_{ij}^{k}\) is defined as follows:

$${p}_{ij}^{k}={p}^{k}({X}_{n+k}={s}_{j}|{X}_{n}={s}_{i}),\,1\le i,j\le S,$$
(3)

The state transition probabilities are subject to

$${p}_{ij}^{k}\ge 0\,\forall i,j\,\,{\rm{and}}\,\,\sum _{j=1}^{S}\,{p}_{ij}^{k}=1\,\forall i.$$

For any three events, A, B and C, the following identity is known: \(p[A\cap B|C]=p[A|B\cap C]p[B|C].\) By interpreting A as Xn+k = sj, B as Xn+t = sr and C as Xn = si, we have

$$\begin{array}{rcl}{p}_{ij}^{k} & = & p[{X}_{n+k}={s}_{j}|{X}_{n}={s}_{i}]\\ & = & \sum _{{s}_{r}\in {\bf{S}}}\,p[{X}_{n+k}={s}_{j},{X}_{n+t}={s}_{r}|{X}_{n}={s}_{i}]\\ & = & \sum _{{s}_{r}\in {\bf{S}}}\,p[{X}_{n+k}={s}_{j}|{X}_{n+t}={s}_{r},{X}_{n}={s}_{i}]\times p[{X}_{n+t}={s}_{r}|{X}_{n}={s}_{i}]\\ & = & \sum _{{s}_{r}\in {\bf{S}}}\,p[{X}_{n+k}={s}_{j}|{X}_{n+t}={s}_{r}]\times p[{X}_{n+t}={s}_{r}|{X}_{n}={s}_{i}]\\ & = & \sum _{{s}_{r}\in {\bf{S}}}\,{p}_{rj}^{k-t}{p}_{ir}^{t},\end{array}$$
(4)

which is known as the Chapman-Kolmogorov equation.

Hence, the matrix with the elements \({p}_{ij}^{k}\) is \([{p}_{ij}^{k}]={P}^{k}\).

The selection of step k plays an important role in capturing rich evolutionary information from the DNA sequence. In the context of a DNA sequence, the kth-step transition probability can be written as:

$${P}^{k}=[\begin{array}{cccc}{p}_{11}^{k} & {p}_{12}^{k} & {p}_{13}^{k} & {p}_{14}^{k}\\ {p}_{21}^{k} & {p}_{22}^{k} & {p}_{23}^{k} & {p}_{24}^{k}\\ {p}_{31}^{k} & {p}_{32}^{k} & {p}_{33}^{k} & {p}_{34}^{k}\\ {p}_{41}^{k} & {p}_{42}^{k} & {p}_{43}^{k} & {p}_{44}^{k}\end{array}]=[\begin{array}{cccc}{p}_{AA}^{k} & {p}_{AT}^{k} & {p}_{AG}^{k} & {p}_{AC}^{k}\\ {p}_{TA}^{k} & {p}_{TT}^{k} & {p}_{TG}^{k} & {p}_{TC}^{k}\\ {p}_{GA}^{k} & {p}_{GT}^{k} & {p}_{GG}^{k} & {p}_{GC}^{k}\\ {p}_{CA}^{k} & {p}_{CT}^{k} & {p}_{CG}^{k} & {p}_{CC}^{k}\end{array}]$$
(5)

Which is subject to \({p}_{ij}^{k}\ge 0\,\forall i,\,j\in \{1,2,3,4\}\) and \({\sum }_{j=1}^{4}\,{p}_{ij}^{k}=1\,\forall i.\) The \({p}_{ij}^{k}\) can be calculated using the above Eqs (2 and 4).

Fuzzy measure and fuzzy integral for the k th-step nucleotide sequence

Fuzzy set theory43 is particularly suitable for modelling imprecise data, whereas fuzzy integral is highly appropriate for representing the interaction among different information sources. The concept of fuzzy integral with respect to a fuzzy measure has been proposed by Sugeno in 197444. In this section, we propose the use of the fuzzy integral incorporating with the transition probability matrix, where the elements of transition probability matrix are taken as fuzzy membership degree.

Let \(F=\{{({s}_{i}{s}_{j})}^{k}={y}_{ij}|i,j\in \{1,2,3,4\}\}\) be the finite set of kth-step nucleotides starting from nucleotide si and ending with nucleotide sj estimated from the observed sequence.

Let X, YF and R(F) be the power set of F. A fuzzy measure μ is a real valued function:

μ: R(F) → [0, 1] satisfies the given condition,

  1. (a)

    μ(ϕ) = 0 and μ(F) = 1.

  2. (b)

    μ(X) ≤ μ(Y) if XY.

For a fuzzy measure μ let \(\mu ({y}_{ij})={\mu }^{ij}\,\forall {y}_{ij}\in F.\) The mapping \({y}_{ij}\to {\mu }^{ij}\) is termed a fuzzy density function. The fuzzy density function can be interpreted as the importance of element yij in determining the set F. By definition of the fuzzy measure μ, the measure of the union of two disjointed subsets cannot be directly computed from their disjointed component measures. In other words, the fuzzy measure value of a given subset is not simply the summation of the measures of its elements. Therefore, to define a fuzzy measure, we need to know the fuzzy densities of each element of the measured set and the measure of each combination. This measure can be provided by an expert or extracted from the problem definition. However, when dealing with a set of numerous elements, this job may become noisy, tedious or even unfeasible. A possible solution to this problem is to use a λ – fuzzy measure. A λ – fuzzy measure45 fulfills the criteria of a fuzzy measure, and has an additional property: for all \(X\cap Y=\varphi ,\,X,Y\subseteq \{{y}_{i1},{y}_{i2},{y}_{i3},{y}_{i4}\}\) for fixed \(i\in \{1,2,3,4\}\) and

$$\mu (X\cup Y)=\mu (X)+\mu (Y)+{\lambda }_{i}\mu (X)\mu (Y),\,{\rm{for}}\,{\rm{each}}\,{\lambda }_{i} > -\,1.$$
(6)

Furthermore, λi can be calculated by solving:

$${\lambda }_{i}+1=\prod _{j=1}^{4}\,\mathrm{(1}+{\lambda }_{i}{\mu }^{ij})\,{\rm{for}}\,{\rm{fixed}}\,\,i.$$
(7)

For solving Eqs (6 and 7), we only need to assemble information regarding the individual fuzzy densities of the elements \({\mu }^{ij}\,(i,j=1,2,3,4).\)

Let \(F=\{{({s}_{i}{s}_{j})}^{k}={y}_{ij}|i,\,j\in \{1,2,3,4\}\}\) be a finite set of information sources. Let h: F → [0, 1] represent a function that maps each element of F to its observed evidence. Suppose that \(h({y}_{i1})\ge h({y}_{i2})\ge h({y}_{i3})\ge h({y}_{i4})\) for each fixed \(i\in \{1,2,3,4\}\) If the decreasing order criterion is not fulfilled, then F should be reordered so that the decreasing order relationship holds, and further investigation will be based on the modified relationship. Let μ: R(F) → [0, 1] be a fuzzy measure. Then, the fuzzy integral of h with respect to the fuzzy measure μ is

$$I=max[ma{x}_{i=1}^{4}[mi{n}_{j=1}^{4}[h({y}_{ij}),\mu ({A}_{ij})]]],$$
(8)
$${\rm{where}}\,{A}_{ij}=\{{y}_{i1},{y}_{i2},\ldots ,{y}_{ij}\}\,{\rm{for}}\,{\rm{each}}\,{\rm{fixed}}\,i.$$
(9)

The fuzzy integral considers the significance provided by every element of a given set, and the importance of each subset of elements (i.e., the fuzzy measure) plays an important role in its decision-making process. The combination of the extracted information and the importance of the provided source convert the fuzzy integral to an appropriate form for information fusion. This theory has the potential to address uncertainties associated with issues related to data extraction and their processing procedures. Therefore, the theory has been widely implemented in feature extraction and classification45,46.

Fuzzy integral similarity and distance matrix for sequence comparison

The fuzzy integral similarity is based on the distance of the kth-step nucleotide pair frequency with respect to the conservation level of the position between two sequences. In our case, the kth-step nucleotide pair frequency at all sixteen positions in the transition probability matrix is taken as the fuzzy membership degree.

Let \({P}_{1}^{k}\) and \({P}_{2}^{k}\) be two kth-step transition probability matrices. The fuzzy integral function find the similarity level of the nucleotide pairs between kth-step transition probability matrices. We constructed a fuzzy integral function h, which is given as:

$${h}^{{y}_{ij}}=1-|{({P}_{1}^{k})}^{{y}_{ij}}-{({P}_{2}^{k})}^{{y}_{ij}}|,$$
(10)

where yij {(AA)k, (AT)k, (AG)k, (AC)k, (TA)k, (TT)k, (TG)k, (TC)k, (GA)k, (GT)k, (GG)k, (GC)k, (CA)k, (CT)k, (CG)k, (CC)k}.

Additionally, the fuzzy measure function find the maximum level of conservation of the nucleotide pairs between kth-step transition probability matrices \({P}_{1}^{k}\) and \({P}_{2}^{k}\), which favours the importance of better conserved positions.

Taking advantage of the properties explained above, we can construct a λ – fuzzy measure μ using the fuzzy density of each element μij.

In this case,

$${\mu }^{ij}={\mu }^{{y}_{ij}}=max\{{({P}_{1}^{k})}^{{y}_{ij}},{({P}_{2}^{k})}^{{y}_{ij}}\},$$
(11)

where yij {(AA)k, (AT)k, (AG)k, (AC)k, (TA)k, (TT)k, (TG)k, (TC)k, (GA)k, (GT)k, (GG)k, (GC)k, (CA)k, (CT)k, (CG)k, (CC)k}. At this stage, we should apply Eq. (7) to find λ and apply the value of λ in Eq. (6) to finally obtain the fuzzy measure μ. The result generated by Eq. (6) satisfies the given criteria (a) and (b) of the fuzzy measure. After generating h and μ, we obtained the fuzzy integral similarity by applying Eq. (8). In fuzzy integral similarity, greater importance is given to the higher degree of membership which is calculated via the fuzzy integral with respect to the fuzzy measure. It is based on fuzzy technology and is intended to deal with the intrinsic uncertainty involved in sequence comparison tasks. Fuzzy integral similarity does not require any additional parameters, which makes it fully automated and robust.

The fuzzy integral similarity measure provides the similarity score between the two kth-step transition probability matrices. Next, we will define a distance measure between two kth-step transition probability matrices \({P}_{1}^{k}\) and \({P}_{2}^{k},\) which is given as follows:

$$D({P}_{1}^{k},{P}_{1}^{k})=1-I({P}_{1}^{k},{P}_{1}^{k}).$$
(12)

Similarly using Eq. (12), we can calculate the distance measure for all pairwise combinations taken from an n number of DNA sequences. Finally, a symmetric distance matrix is generated. This matrix is used as an input for the neighbor program in the PHYLIP package38 for phylogenetic tree construction.

Algorithm

This section describes the algorithmic aspect of the proposed method. The entire algorithm contains three stages.

Stage 1: Calculation of the transition probability matrix through Markov chain:

Algorithm 1
figure a

kth-step transition probability matrix.

Stage 2: Calculation of fuzzy integral similarity between two \({k}^{th}-\) step transition probability matrices:

Algorithm 2
figure b

FISim \(({P}_{1}^{k},{P}_{2}^{k})\).

Stage 3: We integrate stage(1) and stage(2) for phylogenetic construction:

Algorithm 3
figure c

Construction of distance matrix.

Time complexity of the proposed algorithm

To determine the time complexity of a given algorithm, we assume that all operations took the same unit of time. The whole computational process consists of three stages. In the first stage, we calculate the transition probability matrix from the raw DNA sequences. The time complexity of stage (1) is O(m3nk + nl), where l is the average length of the DNA sequences, n is the total number of DNA sequences, m is the number of bases and k is the kth-step transition probability matrix. In the second stage, we calculate the fuzzy integral similarity between the two m × m transition probability matrices generated in stage (1). The time complexity of stage (2) is O(m2m). In the third stage, we integrate stage (1) and stage (2) to generate a distance matrix. Let k = h be an optimal step that satisfies condition (8) in algorithm (3). Therefore, the total time complexity to generate the final distance matrix at the hth optimal step is:

\(\begin{array}{c}=\,{h}^{th}-step\,{\rm{time}}\,{\rm{complexity}}\,{\rm{of}}\,{\rm{stage}}\,1+h((n(n-1))\mathrm{/2})\ast {\rm{time}}\,{\rm{complexity}}\,{\rm{of}}\,{\rm{stage}}\,2\\ =\,O({m}^{3}nh+nl)+h((n(n-\mathrm{1))/2)}\ast O(m{2}^{m})\\ =\,O({m}^{3}nh+nl)+O(h{n}^{2}m{2}^{m})\\ =\,O({m}^{3}nh+nl+h{n}^{2}m{2}^{m}).\end{array}\)

Since we are calculating the computational complexity for DNA sequences, the number of bases (A, T, G and C) is m = 4. Hence, the time complexity of our proposed algorithm are O(nh + nl + hn2).

Results

To check the performance of the proposed method, it was tested on different datasets. Some datasets are small sized and others are medium sized. The length of sequences ranges from seven thousands to several millions base pairs. In order to compare and analyze various genomic data, we generated a distance matrix using Eq. (12) for each distinct step k using the method described above. We increased step k until we obtained the same distance matrix for two consecutive distinct k(suppose k = h and h + 1, where h is a fixed integer), (i.e., the root mean square distance47 between two distance matrices generated by step h and h + 1 should be zero). Therefore, we considered k = h an optimal step and generate the phylogenetic tree at step k = h using the PHYLIP package. Here, we use the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) approach in the PHYLIP package38 to generate the phylogenetic tree. To test the effectiveness of our proposed approach, we selected ten sets of test data: (i) 18S rDNA sequences from 11 Arbuscular mycorrhizal fungi isolates, (ii) 16S rDNA sequences from 40 bacterial isolates, (iii) 41 mammalian mitochondrial genomes, (iv) 59 ebolavirus complete genomes, (v) 30 coronavirus whole genomes, (vi) 30 bacterial whole genomes, (vii) 48 Hepatitis E virus (HEV) whole genomes, (viii) 24 Eutherian mammals sequences, (ix) 58 genome datasets from different species and (x) 29 Escherichia/Shigella complete genomes. We compared our tree with the tree generated by the previously published method using same datasets.

Phylogenetic tree analysis using 18S rDNA sequences from 11 arbuscular mycorrhizal fungi (AMF) isolates

Arbuscular mycorrhizal fungi (which is also called an AM fungi (AMF) or endomycorrhiza) is a type of mycorrhiza in which the fungus infects vascular plants by penetrating the cortical cells of the root. Arbuscular mycorrhizas are characterized by the formation of unique structures (arbuscules) and vesicles, these fungi belong to phylum glomeromycota. Arbuscular mycorrhizas fungi help plants to capture nutrients, such as phosphorus, sulfur, nitrogen and micronutrients, from soil. The development of arbuscular mycorrhizal symbiosis is believed to have played a crucial role in the initial colonization of plants on land and in the evolution of vascular plants48. We built a phylogenetic tree (Fig. 1) using the optimal step k = 8 of 11 AMF sequences listed in Table S1. To compare our method with an alignment-based method, we also constructed the phylogenetic tree (Fig. S1) by ClustalW method using MEGA package49. We characterized 11 AMF sequences based on their families and genera. All rhizophagus genera belonging to family glomeraceae were clustered together in cluster (a), except one genus of rhizophagus (i.e., the “15 Rhi in” sequence belongs to cluster (d)). All glomus genera belonging to family glomeraceae were clustered in cluster (b). All gigaspora genera belonging to family gigasporaceae were clustered in cluster (c). While comparing the tree prepared by our method (Fig. 1) with the tree prepared by ClustalW method (Fig. S1) using the UPGMA approach, we found that, glomus genera were clustered together in Fig. 1 which was lacking in Fig. S1. An obvious flaw in both the phylogenetic trees (Figs 1 and S1) is, none of them clustered rhizophagus genera in the single clade.

Figure 1
figure 1

The phylogenetic tree of the 11 AMF sequences constructed using our method.

Phylogenetic tree analysis using 16S rDNA sequences from 40 bacterial isolates

Endophytic bacteria are an essential part of plant systems and play significant roles in plant growth and development50. The 40 bacterial sequences were obtained from pure cultures of endophytic bacteria isolated from the surface sterilized mature endosperms of six rice varieties. The rice seeds were collected from two different locations in north-east India: North-Lakhimpur, Assam, Aizawl and Mizoram. Genomic DNA was extracted from the pure cultures, and the full length 16S rDNA sequences were amplified using the primer pair 27f (5′-AGAGTTTGATYMTGGCTCAG) and 1492r (5′-TACCTTGTTAYGACTT). The amplicons were sequenced in an Applied Biosystems sequencer using the BigDye terminator method. To minimize sequencing error, we also used two internal primers 533f/805r along with 27 f/1492r. We assembled the contigs based on their phred scores (15) using the Codon Code aligner v7.0.1, BioEdit and SeqTrace v0.9 software51. The contigs were checked for the presence of any chimeras in mothur v1.35.1 and were aligned for identification against the NCBI reference rRNA database using the blastn algorithm. Information, including the accession numbers, phyla, classes, orders and families for the 40 bacterial isolates are listed in Table S2. With there 40 bacterial isolates, we generated a phylogenetic tree (Fig. 2) with our approach using the optimal step k = 6. The tree (Fig. 2) obtained by our method was compared with the tree (Fig. S2) obtained by ClustalW method using MEGA package49. Our algorithm separated the 40 bacterial sequences into two major clades: clade A (purple) and clade B (green) (Fig. 2). Clade A branched into two clades: clade A1 (bold black) and clade A2 (purple). Clade A1 contained only Staphylococcus warneri, which separated out as an outgroup from the sequences in clade A2. Clade A2 consisted of 25 sequences, of which one sequence represented phylum firmicutes, three sequences represented actinobacteria, and the remaining 21 sequences belonged to phylum proteobacteria. Our method successfully grouped sequences of genus pantoea together, but in one instance it placed pantoea and xanthomonas as sister groups. In the same clade, our method placed a third sequence belonging to xanthomonas as an outlier. Additionally, in clade A2, brevibacillus and pantoea were clustered as a sister group, which belong to phylum firmicutes and proteobacteria respectively and curtobacterium was grouped with luteibacter. None of the actinobacterial sequences were grouped together in this clade.

Figure 2
figure 2

The phylogenetic tree constructed by our method using 16S rDNA sequences from 40 bacterial isolates.

The second major cluster or clade B(green) consisted of 14 sequences. Bacillus marisflavi was an outlier from the remaining 13 sequences. In this cluster, Curtobacterium plantarum and Pantoea agglomerans were placed together as sister groups, which might indicate sequence similarity between the two species. Three actinobacterial species (two sequences of C. plantarum and one sequence of Microbacterium proteolyticum) were placed together in one clade. However, C. plantarum and M. proteolyticum were placed as sister groups, and the other C. plantarum sequence was positioned as an outlier. When we compared our method (Fig. 2) with ClustalW method (Fig. S2), we found that both methods clearly separated the 40 bacterial sequences into two major clades. Each clade contains the same type of bacterial sequences, but the order was interchanged. In clade A, our method failed to cluster xanthomonas together, which was grouped together in the result obtained by ClustalW.

Phylogenetic tree analysis of 41 mammalian mitochondrial genomes

The proposed algorithm was tested on the benchmark mammalian dataset containing 41 complete mitochondrial genomes(mtDNA) with nearly 16500 nucleotides (Table S3). The tree generated by our approach (Fig. 3) using the optimal step k = 6, the 41 species were correctly divided into eight groups: Primates (red), Cetacea (green), Artiodactyla (pink), Perissodactyla (light green), Rodentia (black), Lagomorpha (dark red), Carnivore (blue), and Erinaceomorpha (grey). The cat species in our approach was clustered with the Artiodactyla group. We compared the phylogenetic tree (Fig. 3) generated by our approach with the phylogenetic tree (Figs S3 and S4) collected from previous work52. The Fig. S3 is generated by multiple encoding vector method52 and Fig. S4 is generated by FFP method22 using substrings length seven. In Fig. 3, the 10 primates (red) formed a cluster, also Vervet monkey and Macaca Thibetana of family cercopithecidae were clustered in a single clade as sister group which was not observed in Fig. S3. Moreover, species belong to Artiodactyla were grouped into a separate clade, which was lacking in Fig. S3. We have also compared our result with the phylogenetic tree (Fig. S4) generated by FFP method. As showed in Fig. S4, the eight groups were not classified well. The four species of Perissodactyla were distributed into two clades. Indus RiverDolphin from Cetacea was separated from other species of Cetacea. The Primates, Artiodactyla and Carnivore clades were all divided into more than one group. The phylogenetic tree (Fig. 3) generated by our approach shows a better clustering as compared to Figs S3 and S4.

Figure 3
figure 3

The phylogenetic tree of 41 mammalian mitochondrial genomes constructed using our method.

Phylogenetic tree analysis of 59 ebolavirus complete genomes

The benchmark dataset used in this study was 59 complete genomes of ebolavirus with nearly 18900 nucleotides (Table S4). The Ebolavirus genus includes five species: Bundibugyo virus (BDBV), Reston virus (RESTV), Ebola virus (formerly Zaire ebolavirus, EBOV), Sudan virus (SUDV), and Tai Forest virus (TAFV)53. Ebola viruses are single-strand negative sense RNA viruses. Each ebolavirus genome encodes seven proteins in which glycoprotein is the only viral protein on the surface of ebolavirus. The first case of human, infected by EBOV, was reported in 1976 in Zaire (currently the Democratic Republic of the Congo (DRC))54. We applied our proposed method to generate the phylogenetic tree (Fig. 4) using the optimal step k = 6 of 59 viruses in Ebolavirus genus. As shown in Fig. 4, the five species were correctly separated. The EBOV strains from the recognized pandemics build a lineage independent of the other four species in genus Ebolavirus. The EBOV strains in Zaire (DRC) pandemic in 1976–1977 were clustered together as a clade. The EBOV strains in DRC pandemic in 2007 were clustered together with the exception EBOV_2007_KC242788, which was clustered with Zaire (DRC) in 1976–1977. The EBOV strains in Guinea epidemic in 2014 were clustered together as a clade. The three EBOV strains from the 1995 outbreak in Zaire (DRC) formed a clade. SUDV and RESTV formed separate clades. BDBV and TAFV viruses were positioned together. Our result was in consensus with the result generated using multiple encoding vector method52 (Fig. S5) and FFP method22 (Fig. S6).

Figure 4
figure 4

The phylogenetic tree of 59 ebolavirus complete genomes constructed using our method.

Phylogenetic tree analysis of 30 coronavirus whole genomes

The other benchmark dataset used for validation of the method was the 30 complete coronavirus genomes with nearly 25,000 to 32,000 nucleotides (Table S5). Coronaviruses55 are enveloped, single-stranded, positive-sense RNA viruses within the family Coronaviridae56. The coronaviruses are pleomorphic RNA viruses that are widespread among avians, bats, humans and other mammals. They are known to cause mild to severe respiratory diseases, gastroenterological, neurological and systemic conditions. This group of virus can easily cross species-barrier and infect new species57. As a result of pandemics from coronaviruses especially the SARS, the classification and evolutionary relationships among these viruses have been extensively investigated. We employed our method to analyse the 30 coronavirus whole genome sequences along with 4 non-coronaviruses as outgroups. The 30 coronavirus were classified into five groups according to their host type. As shown in Fig. 5 generated by our approach using the optimal step k = 7, we can observe that the 30 coronavirus along with 4 non-coronaviruses were correctly grouped according to their host type except group 1 (Table S5). We compared Fig. 5 generated by our approach with Figs S7, S8, S9 and S10 collected from previous work52,58. The limitation observed in our result (Fig. 5) is that, our method was unable to cluster group 1 as compared to Figs S7 and S8. While in Fig. S9 generated by k–mer59 method and Fig. S10 generated by FFP method using substrings length six, the four non-coronaviruses were not clustered together. Therefore, for this dataset, tree generated by our approach has advantage over k–mer and FFP methods using substrings length six.

Figure 5
figure 5

The phylogenetic tree of 30 coronavirus whole genomes constructed using our method.

Phylogenetic tree analysis of 30 bacterial whole genomes

Another benchmark dataset used in this study was 30 complete bacterial genomes with more than 1 million nucleotides (Table S6). Methods based on multiple sequence alignment program cannot handle such large dataset. As shown in Fig. 6, generated by our approach using the optimal step k = 7, the 30 bacterial genomes were clustered into nine groups based on taxnomic family: Burkholderiaceae, Rhodobacteriaceae, Enterobacteriaceae, Borreliaceae, Bacilleceae, Clostridiaceae, Desulfovibrionaceae, Yersiniaceae, and Staphylococcaceae. Our result (Fig. 6) has similarity with the result (Fig. S11) generated by fourier power spectrum method at the taxnomic family level collected from previous work58. However, our phylogenetic tree (Fig. 6) has advantages at the phylum level which was lacking in Fig. S11. As shown in Fig. 6, the genomes were successfully clustered into three phylum, Firmicutes, Proteobacteria, and Spirochaetes as a separate clade, which was not observed in Fig. S11.

Figure 6
figure 6

The phylogenetic tree of 30 bacterial whole genomes constructed using our method.

Phylogenetic tree analysis of 48 Hepatitis E virus (HEV) whole genomes

The other benchmark was 48 complete genomes of hepatitis E virus (HEV). This virus is characterized as non-enveloped, single-stranded RNA virus with nearly 7200 nucleotides (Table S7). The acute condition of the disease is caused by the hepatitis E virus. The difference between other known hepatitis viruses (A, B, C, D) and hepatitis E virus is that, the hepatitis E virus is the only animal-host disease hepatitis60. To understand the relationship between HEV sequences, we have applied our proposed method to generate the phylogenetic tree using the optimal step k = 6 of the 48 HEV whole genome sequences. As shown in Fig. 7 generated by our approach, the HEV genomes were divided into separate clades based on four genotypic61 category (I(red), II(grey), III(blue) and IV(green)) except few sequences. Phylogenetic tree (Fig. 7) generated by our approach shows better clade distribution based on the genotypic division as compared with Figs S12 and S13 collected from the previous work62.

Figure 7
figure 7

The phylogenetic tree of 48 Hepatitis E virus (HEV) whole genomes constructed using our method.

Phylogenetic tree analysis of 24 Eutherian mammal sequences

We selected transferrin (red) and lactoferrin (green) sequences from 24 vertebrates as a benchmark dataset63 (Table S8). Vertebrate transferrins and lactoferrins are iron-binding proteins found in blood serum, milk, egg whites, tears, and interstitial spaces. They can be involved in iron storage and resistance to bacterial disease. We have applied our proposed method to generate the phylogenetic tree (Fig. 8) using the optimal step k = 8 of the 24 Eutherian mammal sequences. As shown in Fig. 8, we can observe that all transferrin sequences (red) were clustered into two distinct clades, except rabbit transferrin sequence was grouped with lactoferrin class. Similarly, all lactoferrin sequences (green) were clustered together, except mouse lactoferrin sequence was grouped with transferrin class. Phylogenetic tree (Fig. 8) generated by our approach showed better clade distribution based on transferrin and lactoferrin categories compared with previous work62 which is shown in Figs S14 and S15.

Figure 8
figure 8

The phylogenetic tree of 24 Eutherian mammals sequences constructed using our method.

Phylogenetic tree analysis using 58 genome datasets from different species

To verify the clustering efficiency of our method on extremely divergent sequences from different organisms, we randomly collected genomes of varying length from different datasets from Tables S1, S2, S3, S4, S5, S7 and S8. The genomes included in this dataset were, four arbuscular mycorrhizal fungi, six bacterial isolates, nine primates mammalian mitochondrial genomes, ten ebolavirus (five reston virus (RESTV), five bundibugyo virus (BDBV)) complete genomes, ten SARS coronavirus, eleven hepatitis E virus and eight eutherian mammals. We applied our proposed method to generate the phylogenetic tree (Fig. 9) using the optimal step k = 8. As shown in Fig. 9, we observed that all the different species genomes were clustered separately. This result (Fig. 9) showed the efficiency of our method in clustering genomes irrespective of their size and divergence. Our result (Fig. 9) was in consensus with the result generated using ClustalW method (Fig. S16). The time taken by our method to generate the transition probability matrix was less than 1 second, while Clustalw has taken 3 hours and 12 minutes.

Figure 9
figure 9

The phylogenetic tree of 58 genome datasets from different species constructed using our method.

Phylogenetic tree analysis using 29 Escherichia/Shigella complete genomes

The other benchmark dataset used in this study was 29 complete genomes from the genera Escherichia/Shigella with more than 1 million nucleotides (Table S9). We applied our proposed method to generate the phylogenetic tree (Fig. 10) using the optimal step k = 7. As shown in Fig. 10, we observed that the genomes were clustered into distinct clades, Escherichia(green) and Shigella (red). We took the benchmark tree64 as a reference which is based on concatenated alignments of the 2034 core genes and used the maximum likelihood method to infer the phylogenetic relationships. We calculated Robinson-Foulds distance (RF-distance)65 of the tree produced by our method against the benchmark tree64. The RF-distance is often used to compare two trees of closely related species. Since, the species in this dataset (29 complete genomes from the genera Escherichia/Shigella) are closely related organisms, therefore, we employed the RF-distance, which evaluates the topological congruence between an inferred tree and a benchmark tree. We also collected the generated RF-distances from the previous study66. RF = 0 indicates that, the test-tree topology is completely similar to that of the benchmark tree, while similarity level decreases as the RF increases. As shown in Fig. S17, RF-distance generated by our approach to the reference tree was higher than RF-distance generated by rest of the methods to the reference tree. This result demonstrates that our proposed method has a limitation in clustering of the closely related organism.

Figure 10
figure 10

The phylogenetic tree of 29 Escherichia/Shigella whole genomes constructed using our method.

Conclusion

This study focused on fuzzy integral similarity technique based on Markov chain and applied this algorithm to phylogenetic tree analysis. Sequence comparison is one of the most useful and widely practiced methods in bioinformatics and computational biology. Alignment based methods perform well if the genetic sequences are homologous. High mutation rates and genetic recombination brings in a limitation of the alignment based method. Also at the genomic scale, alignment based methods become impractical due to their computational complexity. Alignment-free methods are of great value, because they reduce the technical constraints of alignments. We constructed a transition probability matrix using a Markov chain for each DNA sequences without performing prior alignment at the genomic scale. The fuzzy integral similarity technique is a method that can calculate similarity score between two transition probability matrices of DNA sequences. The main advantage of our approach is that, it does not require any additional parameters, which makes it fully automated and robust. We implemented and tested our method on suitable datasets.

All programs are implemented on a linux server with 384 GB RAM with 24 dual core processor. Our proposed approach is fast in computational speed (Table 1) compare to alignment-based method, ClustalW and also faster as compared to various alignment-free methods, which were discussed above. For the large datasets such as, 30 bacteria and 29 Escherichia/Shigella, which ClustalW can not handle, while our alignment-free method take only 3 seconds to produce transition probability matrices for both the datasets.

Table 1 Running time comparison.

In this study, we plotted ROC curve39,40,41 (Fig. 11) and calculate area under the ROC curve (AUC) using distance matrices generated by our method and other alignment-free methods from Alfree repository2. The detail discussion of the ROC results (Fig. 11) and AUC analysis for all benchmark datasets are given in ROC_Supplementary Material. It may be observed that, while we have similar AUC values as the other methods, the phylogenetic tree generated by our method outperforms the other existing methods. The result shows clear accuracy in terms of AUC of our method as the other methods and superiority in terms of phylogenetic clustering. Moreover, the superiority of our method can be observed from the execution time in Figs 30 and 48 (ROC_Supplementary Material) for the large sequence length data.

Figure 11
figure 11

Receiver operating characteristic curve (ROC) on the given datasets.

Our proposed method is faster and has the potential to build phylogenetic tree for large sized genomes, such as, mammalian genome. Mammalian genomes are divided into several chromosomes. In this study, we selected chromosome X to do the phylogenetic analysis, details are given in Table S10. Our dataset included the species: human (Homo sapiens), monkey (Macaca mulatta), chimpanzee (Pan troglodytes), gorilla (Gorilla gorilla), horse (Equus caballus), mouse (Mus musculus), dog (Canis familiaris), opossum (Monodelphis domesticus), and platypus (Ornithorhynchus anatinus). The length of the chromosomes X in these organisms ranges approximately from 6 to 147 Mb. Applying our method, we generated the phylogenetic tree (Fig. S18) of nine mammals using the optimal step k = 8. Our method took only 18 seconds for generating transition probability matrix. Phylogenetic tree (Fig. S19) generated by multiple encoding vector method52, mouse clustered with primates, while dog and horse came together in a clade. Figure S18 generated by our method formed three major clusters. Branch point in the first clade shows a divergence event of horse and mouse from the primates. A comparative radiation hybrid map of chromosome X of human, horse and mouse reveals many conserved syntenies between the three species67,68. This similarity may have placed horse and mouse in a sister group. Dog and oppossum formed a distinct clade, while platypus separated as an outgroup.

Based on the results generated by our developed method, we found that our method performed well on divergent sequences, rather than closely related sequences. Therefore, this approach would be beneficial for the users to generate hypothesis that can be investigated in further detail with subsequent analysis. Before continuing research for further development of our method, we must keep in mind that, this approach is a probabilistic measure in nature, and can be modified by incorporating more information, such as, nucleotides substitution, insertion, deletion, genetic recombination, Physicochemical Properties of nucleotides etc., in sequences. Overall, our goal in this study was to introduce a new methodology and a new tool to the comparative genomics research community. This proposed work can be used to guide the development of more powerful measures for sequence comparison.