Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter

We describe a novel algorithm for information recovery from DNA sequences by using a digital filter. This work proposes a three-part algorithm to decide the k-mer or q-gram word density. Employing a finite impulse response digital filter, one can calculate the sequence's k-mer or q-gram word density. Further principal component analysis is used on word density distribution to analyze the dissimilarity between sequences. A dissimilarity matrix is thus formed and shows the appearance of cluster formation. This cluster formation is constructed based on the alignment-free sequence method. Furthermore, the clusters are used to build phylogenetic relations. The cluster algorithm is in good agreement with alignment-based algorithms. The present algorithm is simple and requires less time for computation than other currently available algorithms. We tested the algorithm using beta hemoglobin coding sequences (HBB) of 10 different species and 18 primate mitochondria genome (mtDNA) sequences.

The alignment-free approach is independent in respect to the position of the strings. It is advantageous to compare sequences avoiding the biasness which arises from the order of strings within the sequences 7 . Two alignment-free methods, (a) graphical representation and (b) q-mer/word frequency estimation, are very popular for sequence similarity analysis. The visual inspection of data can be done through graphical representation. It facilitated to comparison of DNA sequences as well similarity analysis [8][9][10] . Different graphical representation has been reported to determine multi-dimensional mode for DNA characterization [11][12][13][14][15][16][17][18] which are mostly two-and three-dimensional in nature. Li et al., reported 2D graphical representation of DNA sequence to constructed a coronavirus phylogeny 19 . Similarly, two-dimensional representation was described in dual nucleotides sequencing by Liu et al. 20 . Randic and Bai used triplet codons to obtained 2D graphical plot of protein sequences 21,22 . On the other hand three-dimensional graphical representation was used for different sequences. In this way RNA secondary structure was described using 3D graphical representation 23 . Cao et al. used dual nucleotide coding in three-dimensional representation 24 . Tri-nucleotide coding was also used in 3D graphical representations 25 . 4D/5D graphical representation was also reported by some workers to determining similarity index between two sequences [26][27][28] .
On the other hand, q-mer/word frequency estimation is based on the fact that the more alike two sequences are, higher the number of features that are common to two sequences. Blaisdell determined sequence comparisons based on frequency statistics It is very useful for the comparision of long sequences based on alignment-free statistic D2 29 . D 2 analysis is based on finding correlation between the occurrences of all q-mers which appear in two sequences. But, D 2 calculation is sensitive to noise arising from the sequence randomness. It results in low statistical power to determine relationship in the comparison of two sequences 30 . Sometimes word frequency measurement has low sensitivity when determining the statistical significance of a word's frequency in the sequence [31][32][33] .
To reduce the computational complexity and find a more realistic method for the recovery of DNA sequence information, in this paper, we propose a digital signal processing (DSP) technique [34][35][36] . We impose a DSP-based finite impulse response (FIR) filter 37 on a DNA sequence to calculate the k-mer or q-gram word density via sequence-free analysis. K-mer is highly important to understand the landscape of NGS read dataset 38 . The DSP technique is advantageous as it deals with the general structures of coding, and it is quite different from organism dependence. The four letter alphabet A, C, G, T are the basis to construct the genetic word of different lengths, viz., {AA, …, TT}, {AAA, …, TTT}. The first set of word is of length 2 and the latter set is of length 3. These are the examples of two letters and three letters word respectively. Generally, a q-nucleotide word is termed as k-mer or q-gram. As the set of DNA alphabet contains four letters, the quantity of all possible q-grams is 4 q . We use a dynamic window of width W to slide over the sequence to count the frequency (or occurrence) of each k-mer or q-gram. The window slides over one by one site. While existing on W number sites, digital filter technique is used to calculate the density of the word through the sequence. The distribution of word density along the sequence is taken as input for evaluation Principal Component Analysis (PCA), which helps finding the dissimilarity between the sequences. PCA projects the multi-dimensional data sets into low dimension without damaging the original data matrix's reliability 39,40 . To examine the validity of the method, we have taken two different groups, beta hemoglobin genes (HBB) and mitochondria genomes (mtDNA).

Materials and methods
Nucleotide Density. We have used W-size window which slides over the sequence in a base-by-base means between position i = 1 to n along the DNA strand. Window performs the counting of genetic word to find the word density in the sequence. The method relies upon the observations through sliding 'counter' of size W over the DNA sequence. A particular number of q-grams, herein called bins, are taken into consideration for the formation of the counter. The following definitions are helpful to calculate the word density distribution in a sequence. Definition 1. q-gram of Sequence Consider a counter of length q moves along a sequence segment 'seq' and it will count the signature of a q-gram. Thus, it will count a total number of seq − q − 1 q-grams over the sequence 'seq' . The assumption of homologous region contiguity is not required in alignment-free method Since there are four basis of genetic alphabet, one can construct a total number of 4 q possible q-grams or bins. The bins are arranged in lexicographical order. If i th bin is denoted by b i in this order, the set of all possible bins are denoted as Definition 2. Bin signature Bin signature represents the presence or absence of a bin b j b j ∈ B q ; j = 1, 2, ..., 4 q at a position α in the sequence. It can be expressed as S j , which is a mapping of bin b j with is signature at a position α in the sequence. For a sequence segment 'seq' , there are seq − b j − 1 number of bits in S j .   In the present work, we have considered uniformly distributed window of unit value for nucleotide density calculation. As elucidated, the output in the form of convolution summation denotes the nucleotide density distribution. The algorithms for bin structure, bin signature, and filter process are discussed as presented in different table (table S1, table S2, and table S3, correspondingly).

Sequence analysis.
Based on the density distribution of the DNA sequences, one can find the similarity/ dissimilarity measure between two density distributions, d i and d j such that d k = (y k1 , y k2 , …, y kn ). A data matrix D is constructed by including all density distributions [d 1 , d 2 , d 3 , ...d m ] ′ , where m is the total number of sequences. Thus, D becomes a m-by-n matrix. We wanted to locate the coordinates of the species in 2D and 3D representation. But the m-by-n D matrix cannot visualize the coordinates due to the higher dimensionality of the data set. In that case, reducing dimensionality to the utmost 3-dimension help visualization of the coordinates. In this scenario, we have used Principal Component Analysis (PCA) to reduce the multidimensional data sets to lesser dimensions without losing the consistency of the original data matrix. PCA helps estimating the scores between the density distributions. All the sequences under consideration are taken as a default query sequence in NCBI BLAST alignment as sense strands. In such a case, the nucleotide density distribution study is sufficient without emphasis on whether the strand is sense (positive) or antisense (negative).
The scores in the first three principal components are used to determine the dissimilarity between two sequences. Hence a score matrix, S, of m-by-3 order is constructed. Ordered pair of rows of the score matrix is taken for computation of the Euclidean distance between the pairs of sequences. Rows of S corresponding to sequences are the observations. On the other hand, the columns corresponding to the position index in the sequence are the variables. Since there are m number of observations, one can construct m(m-1)/2 number of Euclidean distances, corresponding to pairs of observations in S. The Euclidean distances are organized in the order (2, 1), (3,1), …, (m, 1), (3,2), …, (m, 2), …, (m, m-1) and they are arranged in a row matrix of 1-by-m(m-1)/2 order, which is further used for building dissimilarity matrix in clustering or multidimensional scaling. In our case, the low dimensional data in the first three components are taken for computation of the Euclidean distance between the pairs of sequences. Hence PCoA, K-medoid, or MDS are not required for further analysis 61 . A phylogenetic tree is constructed by employing Unweighted Pair Group Method with Arithmetic mean (UPGMA) on the PC scores as included in S matrix. The entire process is displayed as a flowchart in Fig. 1.

Results and discussions
Sequences. We have tested the algorithm with two sets of sequences-beta hemoglobin coding sequences (HBB) of 10 different species and 18 primate mitochondria genome (mtDNA) sequences. The detail of the sequences is given in Table S4 and S4 respectively. The sequences are obtained from NCBI genetic sequence database, which is publicly available, and it does not need any ethical approval. We have taken the Nucleotide database from NCBI. The nucleotide sequence is used to determine the nucleotide density distribution over the sequence. We considered sequences of beta hemoglobin coding genes (HBB) of 444 base-pair length from different organisms, such as primates, ungulates, rodents, and birds, to examine our algorithm. Similarly, the second set of sequence data of 337 base-pair length is taken from some primates' HVR-2 mitochondria genome. The sequences are of equal length. The study limits the comparison of sequences with different lengths. It can be overcome by normalizing sequence distribution. Moreover, in our case, a multiple sequence alignment algorithm is used to identify the aligned regions between the sequences.
FIR filtering. Employing FIR filter, the spatial distribution of nucleotide density is generated as d k = (y k1 , y k2 , …, y kn ), where k = 1 to 10 and n = 444. We calculated the density distribution for one-, two-, three-, … gram nucleotides for different organisms. Figure S1 displays the spatial variation of the nucleotide density along the HBB sequence for 10 organisms. The G nucleotide distribution demonstrates a rich density in base position between 70 and 100 for all organisms except Gallus gallus. The same distribution for the T nucleotide demonstrates a rich www.nature.com/scientificreports/ density of approximately 150 bp for most of the organisms except Gallus gallus and Sus scrofa. Mus musculus and Rattus norvegicus are enriched with A nucleotides in the region between 200 and 240 bp, but Gallus gallus shows enrichment of the A nucleotide in the region between 240 and 280 bp. Gallus gallus also shows enrichment of the C nucleotide in the region between 145 and 200 bp. This type of variation in nucleotide density profiles indicates the occurrence of evolution among organisms. In the subsequent discussion, we will elaborate on the variations through the formation of phylogenetic relations.
Di-nucleotide density distribution. We calculated 16 two-gram nucleotide densities along HBB sequences for 10 organisms. Figure S2 presents the AA and AC density distributions along the sequence for all organisms. Tri-nucleotide density distribution. We determined 64 three-gram nucleotide density profiles. Figure S3 shows two three-gram nucleotide distributions along the HBB sequences for 10 organisms.

Principal component analyses (PCA). Two component analysis. Principal Component Analyses (PCA)
is a vital technique frequently used in multivariate analysis to compare patterns by extracting information from a higher dimension to a lower dimension [41][42][43] . It is used in geometric morphometric shape analysis and consequently to determine trait-based phylogenic trees. PCA on phenotypic variance are found in the literature [44][45][46] .
In our case, genome-wide patterns are taken as input to PCA which has been employed on the nucleotide density distribution along the sequences. PCA converts the information in multidimensional data sets into principal components (PC), which reduces the multidimensional data sets to a lesser dimension without losing the consistency of the original data matrix. Instead of considering all PCs, very few of the principal components are used to capture most of the original dataset variation. The data can then be represented in a 2D or 3D scatter plot, taking two or three principal component axes, respectively. This PCA plot helps visualization of groups of observations in the original dataset. First Principal Component (PC1) includes the most variation, PC2 represents the second most variation, and so on. In this way, the first two or three PCs are sufficient to capture most of the variation. We have considered a data set of m-by-n matrix for PCA, where m is the number of sequences and n is the length of each sequence. Each of m rows is projected on n number principal component basis. A scatter plot of the first two PC values dealing with HBB nucleotides of 10 organisms is displayed in Fig. 2 (4) Gallus gallus. After noticing this disagreement, we considered all four nucleotide density distributions simultaneously, in the same data matrix, and then performed three components PCA on the entire data matrix. We have included the script of the MAT-LAB code in the supplementary section (Script S1).
Three component analysis. As mentioned earlier, a data set of the m-by-4n matrix is considered for 3D principal component analysis (3D-PCA). One can notice that instead of n columns, the data set contains 4n columns. It is due to the inclusion of four nucleotide density distribution in a single row of each sequence. Out of 4n columns, we have considered three columns, as three PCs contain most of the variation. Other columns (4n-3) are discarded as their contribution is less in the information acquirement. Finally, a data set of m-by-3 matrix is taken for 3D principal component plot as shown in Fig. 3. The PCA with simultaneous nucleotide density distributions is in good agreement with the phylogeny reported elsewhere 47 . Figure 3 clearly  Dissimilarity matrix calculation. The dissimilarity value (DV) between any two organisms was estimated based on the first 3 principle components (> 90% of the total variability) by computing the Euclidean distance between organisms. The obtained DV is used to find the dissimilarity between two sequences, as presented in table S5. Additionally, the obtained dissimilarity matrix is shown in Fig. 4    Phylogenetic tree. Taking HBB sequences as a basis, we get the phylogenetic tree of the organisms with the Unweighted Pair Group Method's help with the Arithmetic mean (UPGMA) method as applied on the dissimilarity matrix. The phylogenic tree, as found from our algorithm, is displayed in panel A of Fig. 5. The same phylogenetic relationship is obtained by applying the CLUSTALW alignment-based method. It is indicated in panel B of Fig. 5. The two phylogenic tree as constructed based on two methods are in good agreement. Both the figures show the same nature of cluster formation. Cluster 1 includes two Ungulates accessions (Equus caballus and Sus scrofa). Two members of cluster 1 are at a genetic distance (GD) of 0.307 from each other, which is relatively large in comparison to members in other clusters. This large GD between Equus caballus and Sus scrofa is shown to be relevant to the morphological criteria of different sides, namely, headgear, dentition, and foot structure 48,49 . They are mainly segregated on the basis of their foot morphology-whether they are even-toed or odd-toed hoofed mammals. Equus caballus is an odd-toed ungulate, and Sus scrofa is an even-toed ungulate. Cluster 2 contains five primates, namely, Homo sapiens, Callithrix jacchus, Macaca mulatta, Pan troglodytes and Papio anubis. They are close to one another in the tree, which means that primates maintain less GD in comparison to the other clusters. For example, the Homo sapiens and Pan troglodytes HBB sequences are almost identical, with a likeness of approximately 98.8% of their genome 50 . Phylogenic studies have established the relationships, presenting the common chimpanzee (Pan troglodytes) and bonobo (Papio anubis) as our neighboring evolutionary relatives 51 . However, some major interspecies evolutionary changes are not reflected in the short GD relationship 52 . It is found that Gallus gallus is genetically far away from the other organisms, with a GD as high as 0.7, which is in support of the fact that Gallus gallus is only non-mammalian species among the 10 species in the present study.
Validation of the algorithm with another set of data. To further demonstrate the applicability of our method, we give emphasis to the origin of the human species. After the discovery of fossilized Neanderthal skeletons in Europe, many questions arose regarding the origin of human beings, from which the subject of our relation to Neanderthal is foremost. We apply the present sequencing method to study the mitochondrial genomes of primate origins. We choose mtDNA because of its rapid mutation rate 53 and because it is worthwhile to elucidate the evolutionary relationships among species based on mtDNA sequence analysis. We consider the displacement loop (D-loop), which is a non-coding section of mtDNA. It comprises two regions, viz., hyper variable region 1 (HVR-1) and hyper variable region 2 (HVR-2), which represent variations between species. We have taken sequences from HVR-2. The associated mitochondrion genome sequences are downloaded from the GenBank database (table S6). We examined the HVR-2 in the D-loop region of 18 species of primates, including four humans (Berber, Chinese, Georgean, and Yoruba), Neanderthal, three common chimpanzees, two pygmy chimpanzees, three gorillas, two Sumatran orangutans, two Bornean orangutans, and a gibbon, in an attempt to estimate the divergence of mtDNAs and to determine the relationship between them. We calculated the four nucleotide density distribution matrix followed by performing principal component analysis. The first three principal components for 18 primates are shown in Fig. 6. We observed that seven clusters were formed. Cluster 1 comprises five Homo sapiens (Berber, Chinese, Georgean, Yoruba, and Neanderthal). Three chimpanzees (Pan_troglodytes_1, Pan_troglodytes_2, Pan_troglodytes_3) are included in cluster 2. Two pygmy chimpanzees (Pan_paniscus_1, Pan_ paniscus_2) are grouped into cluster 3. Cluster 4 is built with three gorillas (Gorilla_gorilla_1, Gorilla_gorilla_2, Gorilla_gorilla_gorilla). Sumatran orangutans (Pongo_pygmaeus_abelii_1, Pongo_pygmaeus_abelii_1) appear in  We have constructed a dissimilarity matrix of these sequences from the first three principal components, and the phylogenetic tree among the 18 organisms is obtained by using UPGMA algorithm. The phylogenic tree, as determined from the present algorithm, is shown in panel A of Fig. 7. It is also constructed by CLUSTALW alignment-based method. The latter tree is displayed in the panel B of Fig. 7. The two phylogenic trees as built by two methods show a good agreement. Both trees show the same nature of cluster formation. Our present phylogenetic analysis maintains consistency with the analysis reported by Cristianini and Hahn 54 . The organisms Berber, Chinese, Georgean, and Yoruba are very close in cluster 1. Neanderthals is shown to be connected to a common ancestor of Neanderthals and modern humans. However, there is controversy regarding the relationship between Neanderthals and modern humans 55 . Krings et al. described Neanderthal mtDNA to be far outside the  www.nature.com/scientificreports/ phylogenetic tree connecting the mtDNAs of contemporary humans 56 . On the other hand, Wolpoff reported a close relation between Neanderthal and Europeans 57 . However, in our case, the mtDNA HVR-2 phylogenetic tree shows that Neanderthals are more closely related to modern humans than to any of the other extant Great Apes (chimpanzees, gorillas, orangutans) or Lesser Apes (gibbons). The GD between Neanderthals and the human line is 0.096687, whereas the GD between Neanderthals and chimpanzees is 0.29986. Gorillas and orangutans have GDs of 0.73339 and 0.91766, respectively, from Neanderthals. Gibbons are positioned at a GD of 0.77814 from Neanderthals. The common chimpanzee and pygmy chimpanzee have a GD of 0.2506 between them. The Sumatran orangutan has a GD of 0.36534 from Bornean orangutans. Lesser Apes are distinct from Great Apes; for instance, the gibbon is far from chimpanzees (GD = 0.77814). The present clustering algorithm has several advantages over alignment-based methods. It does not require any pre-editing of the sequences. Alignment-based similarity findings may result in some type of incorrect relationship in the case of divergent sequences 58 . Comparisons of the existing alignment-based algorithms are provided in table S7. However Jemes et al. has developed a clustering for DNA sequences using clustering algorithm 59 . The present algorithm also helpful for genetic grouping of different species subtype using DNA sequence 60 .

Conclusion
We propose a novel algorithm for the determination of the occurrence of k-mer or q-gram words in a DNA sequence for information recovery in sequence dynamics. The present algorithm counts the prevalence of each k-mer or q-gram occurrence by using dynamic window which slides over the sequence. Using digital signal processing (DSP) technique and based on the operation of a FIR type filter, we have calculated the density distribution which is further used for PCA. The present method is of sequence alignment-free and does not require graphical representation. The cluster algorithm based on the FIR operation validates its applicability. Application of the FIR algorithm demonstrates its applicability to two data sets with evolutionary importance, beta hemoglobin coding sequences (HBB) and HVR-2 mtDNA sequences. Our algorithm can be useful for various evolutionary analyses and will be very helpful for future biological communities that are working in the area of molecular phylogenetics and also helpful for genetic grouping of different subtype of a species using DNA sequence.