Introduction

One of the most important disciplines in bioinformatics is protein classification, which is used to discover phylogenetic and evolutionary relationships amongst species1,2. Furthermore, accurate classification of a protein sequence among large protein sequence database is critical while developing pharmaceutical products3, such as vaccines, which is becoming increasingly important, particularly in the wake of the coronavirus epidemic. Without current redundancy, the protein sequence database contains over 190 million inputs, and the number of unique functional domains is much larger4. Any attempt to annotate protein function has many issues due to the huge amount of data connected with these proteins. In this manner, protein classification into sequence and structural classes has long been used as a means of simplifying the challenge. As a result, a variety of approaches are used, as explored below.

The process of obtaining a large number of protein sequences has been made easier for researchers because of advances in sequencing technologies5,6,7. As a result, comparing and phylogenetic analysis of these biological sequences becomes a new issue, posing challenges in a variety of areas, including processing time and resource management, due to the large number of data to be considered5. Various methods have been proposed up to this point, which can be divided into two categories: alignment-based methods and alignment-free methods8. For a clear description, Table 1 provides a list of online tools developed for some of the alignment-based and alignment-free methods, as well as the summary of their corresponding comparison algorithm. Alignment-based methods relying on multiple sequence alignment, which commonly uses some sort of evolutionary tools including sequence similarity search tools (e.g., BLAST9, FASTA10), multiple sequence aligners (e.g., ClustalW11, MUSCLE12, MAFFT13), sequences’ profile search programs (e.g., PSI-BLAST14, HMMER/Pfam15), and whole-genome aligners (e.g., progressive Mauve, BLASTZ16, TBA17). Although these methods achieve up-to-scratch results in evolutionary relationships discovery, they are generally time and resource consuming and rely on multiple assumptions about the evolution of the sequences to be compared (i.e. various parameters should be set, such as substitution matrices, gap penalties, and threshold values for statistical parameters which are somewhat arbitrary)2. As a consequence, several alignment-free methods have been presented to prevail over these drawbacks. Alignment-free approaches include any method for evaluating sequence similarity/dissimilarity that does not apply or produce sequence alignment at any step of the algorithm; instead, they use feature extraction to extract the required information from the query sequences2,5. As mentioned before, the alignment-based methods provide high accuracy at the cost of being time-consuming and expensive in memory usage. In contrast, alignment-free methods are fast in computational speed and have been introduced to overcome the complication of sequence alignment2,5. However, they confront the obstacle of accurate comparison and classification. In other words, one of the outstanding concerns in this field is to develop an accurate alignment-free approach that can be utilized in practice, and hence, studies in the field of developing alignment-free methods are mostly focused on this.

Table 1 A list of online tools developed for some of the alignment-based and alignment-free methods, as well as the summary of their comparison algorithm.

Feature extraction from a protein sequence is the challenging part of protein classification studies in any of the approaches discussed above; as a result, various alignment-free methods have been developed in this area. These techniques can extract features in one of two ways25: (1) the protein's amino acid composition, which includes the frequencies of the 20 distinct amino acids within the sequence as well as their physicochemical qualities, or (2) the order and positional information of amino acids within the sequence. So, the physicochemical properties of amino acids can be used to derive features from a protein sequence. As is obvious, a protein sequence is made up of amino acids, each of which has its own set of physicochemical properties that influence protein structures, functions, folding, protein–protein interactions, and evolutionary patterns26,27. As a result, amino acid physicochemical properties play an essential role in protein sequence similarity analysis, protein subcellular localization prediction, and protein structural class prediction. Another aspect to consider is that relying solely on the amino acids’ physical qualities results in the loss of various information, such as the number of amino acids, their location in a string, and so forth25. This positional information, on the other hand, can significantly affect the accuracy of similarity analysis between two sequences. As a result, by relying simply on either physicochemical characteristic or positional information, some data is lost, and the information embedded within a protein sequence is not fully utilized.

In this regard, many studies have been conducted in this area, each of which has used one of these two methodologies or a combination of them. One of these state-of-the-art methods, DCGR27, for example, is built upon the chaos game model based on the physicochemical features of amino acids. The Energy matrix approach2 is another study that is based on physicochemical properties and the position-feature Energy Matrix. Apart from these two studies, methods8,25,28, and many others have utilized the physicochemical features in various ways in their methods. The main drawback of these studies is that they just use the value of physicochemical properties of amino acids in an indirect manner, which results in some information loss. Moreover, there are other concerns, such as the high complexity and increasing volume of calculations in these methods, such as DCGR.

On the other hand, other studies, such as natural vector based method29 that is based on the k-mer natural vector, fuzzy integral (FI) based method5 that is based on the fuzzy integral and Markov chain, and7 method, and many more, do not consider physicochemical properties of the amino acids. Even though these properties provide essential factors for predicting the function and structure of protein sequences.

As a consequence, based on the previous studies, it appears that combining physicochemical properties of amino acids with other properties of protein sequences can assist to enrich the features derived from the sequences2,27. After retrieving the attributes of the protein sequences, numerous methods have been proposed to leverage these features to complete the target tasks and perform the desired calculations. These approaches either require large memories or are not optimal in terms of algorithm execution time, although they achieve acceptable accuracy among various alignment-free methods2,5,27. To build an efficient algorithm for protein classification, several tradeoffs between memory requirement, time consumption, and high accuracy, must be evaluated, which necessitates effective feature extraction.

Given aforementioned challenges of the available alignment-free methods, including clustering accuracy, accuracy of the resultant similarity/dissimilarity scores in comparison to the alignment-based methods, computation speed, considering each protein sequence as a single unit to be processed, and thus, ignoring individual indels' effect, and finally, not involving the physicochemical properties of the amino acids, we propose a simple but efficient vector-based method named PCV (PhysicoChemical properties Vector) to numerically characterize a protein sequence, utilizing the value of the amino acids' physicochemical properties, as well as the positional information of the letters. In more details, we utilize all physicochemical properties of the amino acids, we split each protein sequence into fixed-size blocks, and encode each block of protein applying the proposed encoding method. As a result, PCV takes into account both the influential amino acids' physicochemical properties and the local sequence comparisons, none of which have been considered by most of the existing alignment-free methods. In this work, we also attempt to incorporate the key advantages of the alignment-free methods, including increased processing speed and reduced resource utilization, in comparison to the alternative alignment-based and alignment-free methods.

The steps of our approach are as follows:

  1. 1.

    Extract physicochemical properties from the AAindex.

  2. 2.

    Cluster properties into 110 items.

  3. 3.

    Split a sequence into fixed-length blocks.

  4. 4.

    Calculate statistical or positional characteristics and produce vectors based on the physicochemical properties.

  5. 5.

    Calculate the distance metric between different spices vectors to perform the evolutionary analysis.

Going through above steps, PCV offers several key advantages: (1) simple feature extraction by using the value of the amino acids’ physicochemical properties, and (2) preserving more sequence information by using the amino acids’ physicochemical properties and moment values as a combination of protein composition and positional information, (3) incorporating locational information of amino acids to fully utilize all information embedded within a protein sequence, (4) providing computational parallelism as the result of sequence partitioning into fixed-length blocks, which facilitates parallel operation on various blocks through iterative steps, (5) capable of handling numerous mutations in compared sequence, (6) reduced runtime, compared to alternative methods.

As a comparative study, we evaluate PCV on a variety of datasets with diverse sequence lengths and numbers of sequences. Also several comparison metrics (e.g. Correlation Coefficient and Robinson Foulds distance) are applied to compare our results with those of alternative methods, such as fuzzy integral based method5 and ClustalW as the reference method11. The results confirm that PCV offers higher accuracy, provides correct evolutionary relationships of different kinds of species, and offers high speed comparison. The rest of the paper is organized as follows. Details of the proposed method, i.e. PCV, are introduced in “Method and materials” section. The experimental setup, simulation results, and comparative studies are explored in “Results” section. Finally, the paper is concluded in “Discussion” and “Conclusions” sections.

Method and materials

As the main target of this paper, we attempt to develop an accurate alignment-free approach to boost the speed of protein comparison, while preserving resource efficiency. In this case, we utilize two different types of data: (1) physical and chemical characteristics of amino acids, and (2) statistical information of the amino acids, such as their spatial frequency within the sequences. For this purpose, as shown in Fig. 1, we present four units to utilize and process this information in our proposed method, called as PCV:

  1. 1.

    Clustering unit categorizes physicochemical features of the amino acids. This unit categorizes 566 amino acid features into 110 classes. As a result, although all features are utilized, the amount of data fed to the comparison algorithm is reduced.

  2. 2.

    Splitting unit splits protein sequences into fixed-size blocks: Splitting sequences preserves local information, while enables indel modeling.

  3. 3.

    Calculation unit generates statistical information vector for each block: In this unit, statistical information of the amino acids and their order are calculated based on the corresponding physicochemical features.

  4. 4.

    Comparison unit computes dissimilarity/similarity metric. This unit computes Euclidean distance of physicochemical and statistical information vectors for each pair of blocks of two sequences, and so, utilizes the extracted distance to compute the dissimilarity metric between input sequences.

Figure 1
figure 1

The overall schematic of PVC algorithm with its four units: (a) clustering unit, (b) splitting unit, (c) calculation unit, and (d) comparison unit.

The process of PCV algorithm is such that first of all, its Splitting unit elongates the input sequences to a specific length and splits them into pieces of fixed length. Afterwards, the Calculation unit encodes these components using the physicochemical properties clustered by the Clustering unit. Finally, the PCV algorithm calculates a score as the degree of similarity / dissimilarity between each two sequences. To clarify these functionalities, Fig. 1 shows the relationship between various units of the PCV algorithm. In the following, we provide a detailed description of each unit of the PCV algorithm.

PCV’s units

Clustering unit

The physicochemical properties of amino acids demonstrate the characteristics of biochemical reactions and have been widely used in bioinformatics research. AAindex is a database of numerical indices representing diverse physicochemical and biochemical properties of amino acids26. This database consists of three sections: (1) AAindex1 for the amino acid indices, as we require in this paper, (2) AAindex2 for the amino acid substitution matrices, and (3) AAindex3 for the amino acid contact potentials.

AAindex1 includes 566 properties for amino acids26. Although it can be used in various applications, such as sequence comparison involving high number of sequences, this massive volume of data arises resource consumption and computational complexity concerns. In this manner, we propose grouping them together to overcome these challenges. For this purpose, we first calculate the correlation between each pair of 566 properties to obtain a 566 × 566 correlation matrix. Then, we use this matrix to calculate a pairwise Euclidean distance in a 566-dimensional space to achieve a 1-d array. This array is then used to feed the “linkage method” (this is also known as the Farthest Point Algorithm or Voor Hees Algorithm)30,31, as is discussed in more details in the supplementary materials section "AAindex1 clustering". Finally, we form flat clusters from the hierarchical clustering defined by the given linkage matrix, provided that each flat cluster has a cophenetic distance not greater than 0.2 × max_distance (maximum Euclidean distance from the 1-d array). The details of clustering these 566 properties into 110 groups can be seen in Table S1 of the supplementary materials.

Once the physicochemical properties are clustered, to use each class, a representative value of all the attributes in that class should be determined. In this paper, we employ the average values of all attributes of each class for each amino acid, resulting in a 110 × 20 matrix, which replaces the 566 × 20 matrix used by AAindex1 to feed the PCV. In addition, these values are normalized by Studentized residual to unify the impact of the physicochemical properties. This clustered 110 × 20 matrix is called “level-1 categorization”, which is still a significant amount of data to be used through the sequence comparison process. To resolve the issue, we also partition the numerical range of each of these 110 classes into four equal ranges, and allocate each amino acid to one of the four ranges at which its corresponding numerical value is located. This categorization scheme, called as quadruple classification, is utilized in other parts of the PCV as well. It is worth noting that the number of amino acids in each of the four categories might be different. Figure 2 depicts an example of this categorization scheme for one sample class of the 110 classes.

Figure 2
figure 2

An example of categorizing a sample class into four groups of amino acids.

Splitting unit

Splitting unit partitions protein sequences into blocks of constant length. This partitioning preserves the locality information of protein sequences through the comparison process, which is also empowered by the physicochemical characteristics of amino acids. However, due to the different lengths of input sequences, first of all, lengths of all sequences are made equal to the length of the longest sequence by appending a meaningless substring to their tails.

The occurrence of indels within the sequences and their diagnosis raises some challenges since sequence partitioning into fixed length blocks might prevent their appearances. To resolve this issue, as shown in Fig. 3, we adopt a sliding window (with fixed size S) with maximum shift of d characters to the right and the left of a fixed point, rather than selecting each block from a fixed location. In other words, each initial block is left and right shifted by equal or less than d characters, and so, results in a total of 2d + 1 states, all of them are fed to the PCV, whose best result, in terms of the distance between the blocks, is reported. It should be noted that block size and d value should be pre-determined, as presented in section "Block size and shift analysis" of the supplemental materials. Analyzing various datasets, we chose block size of 50 with maximum shift value d equal to 5, as the suitable values of parameters.

Figure 3
figure 3

An example of sequence splitting into fixed size (S) blocks by various right and left shifts (d as maximum shift); in this example S = 5 and d = 2.

Calculation unit

Once the sequences have been split into blocks, they must be encoded and compared with each other. For this purpose, the Calculation unit performs the encoding and the comparison unit compares the sequences. Each block is encoded as a vector, consisting of two parts: (1) physicochemical vector, representing values of the block's physicochemical attributes collected and preprocessed from AAindex1, and (2) statistical vector, representing statistical information of the block's amino acid groups. In the following, we describe these two parts of the vector in more details.

Physicochemical vector

As described in the clustering unit, the 566 properties provided for each amino acid in AAindex1 are clustered into 110 classes. As a result, for each amino acid in the split block, the calculating unit allocates a vector of length 110 to represent its physicochemical properties. In this manner, for a block of length |S|, a matrix of size 110 ×|S| is generated. As the next step for each block, the values of each physicochemical property for various amino acids are added up. It should be noted that this process is performed for all possible shifts of each block. Figure 4 depicts the aforementioned step-by-step procedure.

Figure 4
figure 4

Producing physicochemical vector for each block.

Statistical vector

As described in the introduction, in order to extract the most information from a protein strand, the statistical and spatial information of amino acids must be analyzed, alongside their physicochemical properties. Specifically, quadruple classification as defined in the clustering unit is employed to produce statistical vectors for each block. For this purpose, 2nd moment of position of each group of quadruple classification is used for each block as statistical vector which is derived from the other two concepts as follow: (1) the number of repetitions of each quadruple classification group for each property in the block, and (2) the mean position of each group for each property in the block. This statistical information can be calculated by Eqs. (1) to (3), respectively.

$$N_{S} = \mathop \sum \limits_{i = 1}^{\left| S \right|} f_{N} \left( {S_{i} } \right)$$
(1)
$$\mu_{S} = \mathop \sum \limits_{i = 1}^{\left| S \right|} i \cdot \frac{{f_{N} \left( {S_{i} } \right)}}{{N_{s} }}$$
(2)
$$D_{2}^{S} = \mathop \sum \limits_{i = 1}^{\left| S \right|} \frac{{\left( {i - \mu_{S} } \right)^{2} f_{N} \left( {S_{i} } \right)}}{{N_{S} N}}$$
(3)

where S represents a block of protein sequence, Ns is the number of repetitions of each quadruple classification group for each property in block S, fN is a binary function calculated for each group of quadruple classification of each of 110 property classes and assumes that all members (amino acids) of each group are similar. Therefore, for each group of quadruple classification, fN equals one if its input character (each amino acid of block S) belongs to the intended group, and its zero otherwise, μS is the mean position of each group of quadruple classification for each property in block S, \(D_{2}^{S}\) is the 2nd moment of position of each group of quadruple classification for each property in block S. Finally, the calculated statistical vector is normalized using the Studentized residual method. Therefore, the statistical vector is a 4 × 110 vector, in which each of the four rows represents the second moment of the corresponding group of quadruple classification and 110 columns represent 110 property classes of amino acids. For more clarity, Fig. 5 depicts the generation process of a statistical vector for a sample block in a more practical manner.

Figure 5
figure 5

Producing statistical vector for each block.

Comparison unit

The vectors produced by the computation unit for each block, as well as their corresponding shifts, are compared in this unit. To accomplish so, it compares ith blocks of two sequences using the Euclidean distance, as addressed in Eq. (4).

$$\begin{aligned} ED_{block\,i, shift\,j\,and\,k} & = \sqrt {\mathop \sum \limits_{m = 1}^{\left| S \right|} \left( {v_{2,i,j,m} - v_{1,i,k,m} } \right)^{2} } , - d \le j, k \le d \\ V_{sequence\,1,block\,i,shift\,j} & = \left( {v_{1,i,j,1} , v_{1,i,j,2} ,v_{1,i,j,3} , \ldots , v_{1,i,j,550} } \right), \\ V_{sequence\,2,block\,i,shift\,k} & = \left( {v_{2,i,k,1} , v_{2,i,k,2} ,v_{2,i,k,3} , \ldots , v_{1,i,k,550} } \right) \\ \end{aligned}$$
(4)

where V refers to the encoded vectors of the split blocks, as generated by appending physicochemical and statistical vectors, and d denotes the maximum shift value for each block. Shifts are accomplished as shown in Fig. 6, while all resultant shifts of the two blocks are compared with each other. The minimal distance between two blocks is presented by the minimum distance between any pair of their shifted variants. This value is utilized for computing dissimilarity score of PCV. Finally, similar to Eq. (5), the dissimilarity score between the two sequences is calculated by sum of the Euclidean distance values picked from all pairs of their shifted variants.

$$D_{Seq1,Seq2} = \mathop \sum \limits_{i = 1}^{{\frac{{\left| {Seq} \right|}}{\left| S \right|}}} {\text{min}}\left( {ED_{block\,i,\,shift\,j\,and\,k} } \right), - d \le j, k \le d$$
(5)
Figure 6
figure 6

The method of comparing blocks in the PCV algorithm's comparison unit.

Datasets

The proposed algorithm is tested on 12 different datasets, available at The National Center for Biotechnology Information (NCBI), F10 and G11 protein datasets, and other public databases2,5,7,29. Our datasets include 9 ND5, 8 ND6, 24 TFs, Coronavirus in two versions, i.e. a) 24 sequences and b) 50 sequences, Betaglobin in three versions, i.e. a) 9 sequences, b) 50 sequences, and c) 88 sequences, 27 AFPs, 114 HRV,1163 influenza, and 20 xylanases protein sequences. The detailed information of each dataset is provided in Table 2. According to this table, these datasets are very diverse in terms of number of sequences and sequence lengths, ranging from about 150 to about 2200 in length, and from 8 to more than 1150 sequences. Finally, it should be noted that these protein sequences are used as Fasta files. Access information of this data can be found in the "Data" section of the supplementary materials.

Table 2 Benchmark datasets details.

Results

As mentioned in the data section, we utilized 12 datasets (as listed in Table 2) with substantial variation, in terms of the number of sequences and sequence lengths used in earlier studies, to evaluate PCV in terms of correctness and speed. PCV's applicability and accuracy in comparing sequences can be assessed in a variety of ways, including qualitative phylogenetic tree analysis, RF (Robinson–Foulds) distance measurement, CC (Correlation Coefficient) similarity measurement, ROC (Receiver Operating Characteristic) diagram, and AUC (Area under the ROC Curve), each of which analyzes various aspects of the algorithm, as follows2,5. The qualitative analysis of trees can assess the clustering ability and topology of the resultant trees, demonstrating the method's capacity to appropriately categorize the input data. Analyzing the RF distance and CC similarity metric for the two phylogenetic trees or the distance matrices determines their topological dissimilarity/similarity, considering the corresponding similarity ratios, by investigating the linear relationship between the corresponding branch lengths of the trees (or the distance matrices) produced by the two methods. Finally, the ROC diagram and AUC are primarily used to qualify data classification, and do not assess an exact numerical relationship between similarity scores supplied by the compared methods.

Since each of the aforementioned comparative approaches evaluates PCV from a different perspective, we have reported them all to comprehensively analyze the performance of PCV, against the counterpart alignment-free methods. It is worth noting that we compare PCV with two alignment-free methods, the fuzzy integral-based method and the Euclidean distance method. We chose fuzzy integral-based method, since it is been compared to, and shown to be better than, five other alignment-free method, as reported in5. This comparison is achieved in terms of RF distance and CC values. In this manner, indirect comparison of PCV with five other alignment-free approaches, FFP, RTD, CV, NCD, and BBC, can be achieved. For a more comprehensive comparison, ROC values of these methods are depicted in the ROC diagrams, as shown in Fig. 19. Moreover, accuracy of the PCV method is also compared against that of the Euclidean distance approach, which is a word-based method, since it outperforms the reference methods, such as Smith–Waterman, especially in the case of homologous sequences. Furthermore, as a key advantage of providing an alignment-free approach, we can enhance the speed of sequence comparison, especially for large datasets. In this manner, “Runtime analysis” section compares PCV's processing time to that of alternative methods for various datasets.

It should be emphasized that in this study, ClustalW method is chosen as the reference method, with which various alignment-free method have been compared so far. Hence, the outperformance of the PCV approach, over the alternative alignment-free methods, is investigated by comparing its performance metrics against those of the alternative methods, considering ClustalW as the reference method. For a comprehensive study, we also take advantage of two state-of-the-arts alignment-based methods, Clustal Omega and Muscle, as the reference methods in this study. It should be noted that although we report accuracy of the PCV method alongside that of ClustalW and two other references, we do not focus on the accuracy improvement of PCV method against the alignment-based approaches. Rather, the main goal is to provide almost the same accuracy of the alignment-based method by an alignment-free one. Finally, it is worth noting that although ClustalW is not the most accurate alignment-based method proposed so far, it is a popular one and has been used as the reference method for evaluating other alignment-free methods2,5. Therefore, for a fair comparison of PCV method with the alternative alignment-free methods, we have also chosen ClustalW as the reference method. However, we incorporated two other reference methods, Clustal Omega and Muscle, to investigate the efficiency of PCV in comparison to the state-of-the-art methods.

Phylogenetic trees analysis

Studying phylogenetic trees specifies the capability of PCV to properly cluster different samples from various categories. Specifically, this approach, as a qualitative evaluation approach, only studies proper placement of the samples within the subtrees. It should be noted that since some methods produce their phylogenetic trees by UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and some others by NJ (Neighbor-Joining), we have used both methods to produce the trees. All trees produced by these methods are accessible in the supplemental materials, however to avoid overlength paper, depending on the method being compared with PCV, either UPGMA or NJ is reported in the following section. Finally, it should be mentioned that we use MEGAX software version 10.1.7 to produce the trees.

9 ND5 protein sequences

MT-ND5 is one of the seven mitochondrial genes encoding subunits of the enzyme NADH dehydrogenase (ubiquinone). This protein is a section of a large enzyme complex, known as complex I, which is active in mitochondria. This enzyme is the largest of the respiratory complexes, and is responsible for the first step of the electron transport process, i.e. the transfer of electrons from a molecule called NADH to another molecule called ubiquinone5,27,28. This dataset includes 9 protein sequences of ND5, which approximately has 600 amino acids. The phylogenetic tree (produced by UPGMA method) constructed by our method, PCV, is shown in Fig. 7. These 9 sequences are divided into four categories based on their taxonomic families, (1) including human, pigmy chimpanzee, common chimpanzee, and gorilla, (2) including fin whale and blue whale, (3) including mouse and rat, and (4) including opossum. An opossum is farthest away from other species, hence it is clustered separately. Unlike the phylogenetic tree produced by fuzzy integral algorithm5 for this dataset, which is shown in Figure S2, pigmy chimpanzee and common chimpanzee are correctly put close to each other by our method. To elucidate the effectiveness of our approach, we compared the phylogenetic tree generated by our method with the corresponding tree generated by ClustalW11 (as shown in Figure S4). This comparison reveals that our method leads to a completely compatible result, although it is an alignment-free method.

Figure 7
figure 7

The phylogenetic tree of 9 sequences of NADH Dehydrogenase 5 protein constructed by our method (UPGMA).

8 ND6 protein sequences

MT-ND6, the only protein-coding gene located on the L-strand of the human mitogenome, is one of seven mitochondrial genes encoding subunits of the enzyme NADH dehydrogenase (ubiquinone)5. This dataset includes 8 protein sequences of ND6, which approximately has 175 amino acids. The phylogenetic tree constructed by our method is shown in Fig. 8. As shown in this figure, these 8 sequences are divided into four categories based on their taxonomic family, (1) including human, chimpanzee, and gorilla, (2) including gray seal and harbor, (3) including mouse and rat, and (4) including wallaroo. As shown in the phylogenetic tree (produced by UPGMA method) generated by PCV, the protein sequences belonging to the four categories are correctly separated. Indeed, the similarity between wallaroo and (human, chimpanzee, gorilla) and (gray seal, harbor seal) is more than the similarity between (mouse, rat). Therefore, we can conclude that our method successfully clustered wallaroo in same clade with (human, chimpanzee, and gorilla) and (gray seal, harbor seal). The latter observation illustrates the outperformance of PCV over fuzzy integral algorithm5, which put wallaroo in the same clade with (mouse, rat), as shown in Figure S6. Finally, it is worth noting that, as shown in Figure S8, the alignment-based approach, ClustalW11, generates a phylogenetic tree with the same topology as that of PCV. Hence, we can conclude that PCV can be as accurate as the alignment-based methods.

Figure 8
figure 8

The phylogenetic tree of 8 sequences of NADH Dehydrogenase 6 protein constructed by our method (UPGMA).

24 TF protein sequences

Protein sequences of transferrins (TFs) from vertebrates, with approximately 700 amino acids, are glycoproteins found in vertebrates that bind to iron (Fe), and consequently, they mediate the transport of iron (Fe) through blood plasma. The liver is the main site of transferrin synthesis, but other tissues and organs, including the brain, also produce transferrin. Transferrin is also associated with the innate immune system. It is found in the mucosa and binds iron, thus creates an environment with low level of free iron that impedes bacterial survival in a process called iron withholding5,8. For PCV, both UPGMA and Neighbor-Joining tree construction methods show the precise categorization. Figure 9 shows the PCV-UPGMA tree for the TF dataset. The 24 sequences are split into four clades, as indicated in this diagram, mammalian TF (light green clade), mammalian LF (red clade), actinopterygii (green clade), and amphibians (bluegreen clade). Whereas, the tree produced by fuzzy integral-based method5 (Figure S10) clusters the Japanese flounder transferrin sequence (which belongs to the actinopterygii class) with the Frog transferrin sequence (which belonge to amphibians class). Even the tree produced by ClastalW method (Figure S12) does not correctly classify Possum transferrin sequence as a mammalian TF. It should be noted that the NJ tree of the TF dataset produced by PCV also follows the similar trend to the produced UPGMA tree.

Figure 9
figure 9

The phylogenetic tree of 24 sequences of transferrins protein constructed by our method (UPGMA).

24 Coronavirus protein sequences

Coronaviruses are large, enclosed, positive-stranded RNA viruses that belong to the Coronaviridae family. Coronaviruses cause respiratory and gastrointestinal disorders in humans and other animals. The spike protein, which is found in all coronaviruses, is essential for viral attachment and entrance into the host cell, while its sequence and structure differ depending on the host5,7,32. Coronaviruses can be classified into four types based on their spike proteins and host types. This dataset contains 24 coronavirus protein sequences divided into four classes5: classes I (mammalian coronaviruses), II (mammalian coronaviruses), III (avian coronaviruses), and IV (SARS-CoV samples). As shown in Fig. 10, the phylogenetic tree produced by PCV almost correctly categorizes samples of this dataset into the four indicated classes. All samples are accurately classified by PCV, except two samples from class I (i.e. AAK38656 and NP598310), which are added to the top level of class II. However, it should be noted that since they are close to other members of class II and also are separated from class I, it might not be considered as a clustering error. For a comprehensive study, this dataset is also put to the ClustalW11 method for comparison. Similar to our method, as shown in Figure S15, ClustalW properly classifies the samples too. It should be mentioned that all phylogenetic trees for this dataset are generated by UPGMA method. This dataset is also used to evaluate other comparison methods, such as the intensity-based method7. While this method has not created a phylogenetic tree for this dataset, it uses UPGMA for phylogenetics tree construction for two other datasets (i.e. ND 5 and betaglobin). The intensity-based technique merely presented the average and standard deviation of the produced intensity vectors (Table S11), claiming that the values for distinct groups of coronaviruses are different. However, the boundaries of these values are unclear for each class of this dataset, and hence, a threshold cannot be chosen for classifying this dataset using the intensity-based technique.

Figure 10
figure 10

24 Coronavirus protein sequences constructed by our method (UPGMA).

50 Coronavirus spike protein sequences

This coronavirus data collection, which includes 50 samples of length 1500 amino acids, is also divided into four groups: Mammalian coronaviruses are classified into different types I and II, avian coronaviruses are found in group III, and SARS-CoVs are found in group IV5,7,32. As shown in Fig. 11, the phylogenetic tree produced by PCV properly distinguishes the strings belonging to each group. Furthermore, all strings that are closely connected are classified in the same category within a group. For example, (TGEV, TGEVG) and (PEDVC, PEDV) from class I are clustered as separate clades, or as another example, while the SARS coronavirus group is categorized independently from other coronavirus groups, its subgroups are also categorized as two sub-trees, i.e. group IVa from the 03–04 interspecies epidemic and the one containing all other human-related SARS-CoVs branches29. Furthermore, considering coronavirus protein sequences, the classifications achieved by PCV are consistent with those resulted by the ClustalW11 method (as shown in Figure S19) and the fuzzy integral based method5 (as shown in Figure S17). It is worth noting that these phylogenetic trees are created using the UPGMA approach.

Figure 11
figure 11

50 Coronavirus protein sequences constructed by our method (UPGMA).

27 Antifreeze protein sequences

Antifreeze proteins (AFPs) are a group of proteins that bind to macromolecular ice and prevent it from accumulating. Spruce budworm (Choristoneura fumiferana, CF), yellow mealworm (Tenebrio molitor, TM), Hypogastrura harveyi (HH), Dorcus curvidens binodulosus (DCB), Microdera dzhungarica punctipennis (MDP), and Dendroides canadensis (DC) are the six species that makeup AFPs2,27. The phylogenetic tree11 presented in2 for ClustalW differs from the corresponding trees we generated using NJ and UPGMA methods. Specifically, as shown in Fig. 12, by comparing PCV’s generated trees with that of ClustalW (Figure S22), reported in2, we can observe that only two elements in the "TM" clade of the NJ's PCV phylogenetic tree are classified incorrectly, so our generated tree outperforms the tree presented in2. On the other hand, the phylogenetic trees generated for the ClustalW (Figure S23 and Figure S24) provide better results, with some mergers within the three categories "DCB," "MDP," and "TM" visible. Therefore, in comparison with reported results in2, PCV achieves more accurate phylogenetic trees. According to the description presented in2 for phylogenetic trees analysis (Figure S21), the Energy matrix approach2 accurately performs the classification task, while it has not reported any evaluation metric, such as RF distance or AUC, to confirm the accuracy. Finally, it is should be noted that in this manuscript, the aforementioned phylogenetic trees are created using the NJ approach.

Figure 12
figure 12

27 Antifreeze proteins (AFPs) sequences constructed by our method (NJ).

9 Betaglobin protein sequences

The most prevalent haemoglobin in adult humans is beta-globin, which is frequently used to investigate species’ connections. This dataset is the first version of betaglobin, and includes 9 protein sequences with a maximum length of 147 amino acids from four categories: (1) Hominidae, including Chimpanzee, Gorilla, Human, (2) Rodentia, including Rat, mouse, (3) Didelphidae, including Oppossum, and (4) Anatidae, including Duck, Gutta, Gallus7,29. The phylogenetic tree constructed by PCV, based on UPGMA method, is shown in Fig. 13. According to this figure, our method clustered the sequences as accurate as ClustalW11 (as shown in Figure S28). For example, the similarity between Gorilla and Chimpanzee is more than that of humans and Chimpanzee, and our method successfully clustered humans after gorilla and chimpanzee. Furthermore, as compared to the alignment-free approaches, such as the Intensity method7 (whose generated tree is shown in Figure S26), PCV tends to combine comparable groups together. For example, PCV appropriately groups Rodentia and Hominidae, while the Intensity technique fails to do so, or PCV correctly groups Oppossum with Rodentia and Hominidae, whereas the Intensity method incorrectly groups it with Anatidae.

Figure 13
figure 13

The Phylogenetic tree of 9 betaglobin proteins constructed by our method (UPGMA).

50 Betaglobin protein sequences

According to5, these 50 beta-globin protein sequences, from various species taken from GenBank, can be divided into four categories: mammals, birds, reptiles, and aquatic animals. However, the categorization does not end here; there are more details, while each of these classes encompasses multiple subclasses. Specifically, Primates, Proboscidea, Ungulate, Carnivora, Rodentia, Chiroptera, and Cetacea are all mammals; birds or Aves are both categorized as birds; and, Aquatics consist of Actinopterygii and Chondrichthyes2,5. As shown in Fig. 14, PCV precisely categorizes these sequences into the same four main categories as previously described. Moreover, PCV has good performance in classifying subclasses while other free-alignments methods, such as fuzzy integral based method5 (as shown in Figure S30), lead to some mistaken categorization within classes or subclasses, such as “Antinopterygii” and “Ungulate”. Consequently, PCV outperforms other proposed alignment-free approaches, such as2,20,33. Furthermore, PCV's phylogenetic tree is closely related to the ClustalW’s result11 (as shown in Figure S32), as an alignment-based approach, and even it outperforms the ClustalW in clustering “Ungulate” class as a separate subtree. It should be noted that for this dataset, phylogenetic trees are generated using the UPGMA approach.

Figure 14
figure 14

The phylogenetic tree of 50 betaglobin proteins constructed by our method (UPGMA).

88 Betaglobin protein sequences

This dataset is related to beta-globin protein data, but it includes 88 samples from 20 distinct categories and a wider range of species, including Carnivora, Primates, Sirenia, Insectivora, Perissodactyla, Hyracoidea, Proboscidea, Rodentia, Diprotodontia, Testudines, Columbiformes, Passeriformes, Galliformes, Anseriformes, Crocodylia, Anura, Perciformes, Gadiformes, Cypriniformes, and Salmoniformes29. There is a relationship between some of these 20 categories at the high levels of evolution, just as there was with the previous dataset (i.e. 50 Betaglobin protein sequences). As a result, a classifier strategy with a strong performance is expected to bring these categories closer together. This database is created using the natural vector approach, and the findings are presented in the form of a phylogenetic tree created by NJ. So, for a fair comparison, outputs of PCV for this dataset are used to produce a phylogenetic tree based on the NJ approach. PCV clusters all 20 categories appropriately, as well as groups related categories closer together at the order level, as shown in Fig. 15. Furthermore, PCV’s outcome is very similar to ClustalW11 (as shown in Figure S35), as an alignment approach. Moreover, in comparison with ClustalW, it provides better clustering and does not result in any mistake in grouping classes Anseriformes and Rodentia. Finally, the phylogenetic tree produced by the natural vector29 (as shown in Figure S34), as a non-alignment method, is as accurate as that of PCV.

Figure 15
figure 15

The phylogenetic tree of 88 betaglobin proteins constructed by our method (NJ).

20 Xylanase protein sequences

The 20 xylanases protein sequences, which had roughly 500 amino acids, constitute another benchmark dataset utilized to validate the approach. Sequences of this dataset belong to the two classes F10 and G115. As shown in Fig. 16, PCV correctly distinguishes between these two types of samples. It should be noted that although the fuzzy integral based method5 (Figure S38), as an alignment-free method, also makes such a distinction between the classes, PCV’s phylogenetic tree appears to be more similar to that of the ClustalW11 (Figure S39), as the reference approach. Discussed later, quantitative comparisons, such as the RF distance and the CC measurement, support this claim as well. Finally, it should be noted for Xylanase dataset phylogenetic trees are generated using the NJ approach.

Figure 16
figure 16

The phylogenetic tree of 20 Xylanase proteins constructed by our method (NJ).

113 Human rhinoviruses (HRV) and 3 HEV-C protein sequences

Human rhinovirus (HRV) is one of the most prevalent causes of respiratory infections and is often associated with the common cold. The HRV dataset29 contains 113 HRV protein sequences from the RV-A, RV-B, and RV-C Enterovirus genera in the Picornaviridae family, as well as three HEV samples as an outgroup. As a result, phylogenetic trees must categorize this dataset into four clusters. According to Fig. 17, PCV accurately categorizes three HRV species and HEV-C. It resembles the ClustalW’s tree11 (as shown in Figure S43), as well as the phylogenetic tree produced by other methods, such as the Natural vector method29 (as shown in Figure S42). The NJ technique has been used to create all phylogenetic trees for this dataset.

Figure 17
figure 17

The phylogenetic tree of 113 Human rhinoviruses (HRV) and 3 HEV-C proteins constructed by our method (NJ).

1163 Influenza A viruses protein sequences

Influenza A viruses cause influenza in birds and certain mammals, which can result in a range of serious human disorders. Antigenic variation of two surface glycoproteins, i.e. hemagglutinin (HA) and neuraminidase (NA), varies substantially across influenza viruses29. As a result, subtypes of influenza A viruses are identified by two numbers H and N, where H represents the hemagglutinin type, which currently has 18 variants, and N represents the type of neuraminidase, for which currently there are 11 variants. Thus, diversity of this dataset emphasizes its key role for evaluating classification methods, including PCV. For this purpose, we used 1163 influenza A viruses NA protein sequences which are divided into 13 subtypes: H5N6, H5N1, H7N9, H1N1, H6N2, H3N8, H3N2, H4N6, H5N5, H10N3, and H7N3. As shown in Fig. 18, PCV precisely categorizes 13 varieties of influenza A viruses, while the sequences with the same N number are grouped together. The resultant phylogenetic tree demonstrates that our proposed method is capable of accurate categorizing even for enormous datasets. It should be noted that PCV’s categorization is in line with ClustalW’s findings11 (as shown in Figure S47) and natural vector method29 (as shown in Figure S46). Furthermore, our phylogenetic tree appears to be a bit better than theirs, with more sequences clustered inside each clade and fewer separated samples, compared to phylogenetic trees produced by ClustalW and natural vector method. Finally, as shown in Fig. 18, PCV brings related clades together. The NJ technique has been used to create all phylogenetic trees for this dataset.

Figure 18
figure 18

The phylogenetic tree of 1163 Influenza A viruses proteins constructed by our method (NJ).

Robinson Foulds (RF) distance results

The Robinson–Foulds (RF) distance is a commonly used comparison metric for analyzing the linear relation between two phylogenetic trees. The RF distance of zero indicates that the trees are identical; as the distance increases, the trees become less similar5. As presented in this section, we built phylogenetic trees using both the UPGMA and NJ methods, and for each method, we reported RF distance between the phylogenetic trees produced by our method (i.e. PCV) and three alignment-based method, the ClustalW11, Clustal Omega18, and Muscle12, as the reference method. Furthermore, while some other studies, such as the fuzzy integral-based technique5, provide RF distances for their proposed methods, we have also included those information in Table 3 as well, so we can compare our proposed method to some other alignment-free methods. According to this table, for all provided datasets, PCV outperforms fuzzy integral-based approach and also outperforms the squared Euclidean distance approach except for one of them (Xylanase) with small differentiate, and its resultant trees are more compatible with those of ClustalW approach. More detailed discussions of this table are provided in the "Discussion" section.

Table 3 Comparison of PCV (both UPGMA and NJ trees), ED and Fuzzy integral based method based on Robinson Foulds (RF) distance for 12 benchmark datasets.

Correlation Coefficient (CC) results

Another well-known metric for measurement of linear relationships between two vectors is the Correlation Coefficient (CC)5,28. The value of this metric is in the range of [− 1,1], where the greater the linear relationship between two vectors, the closer CC value is to 1 or − 1, and the weaker the relationship, the closer it is to zero. The direct and indirect connection of the two vectors, respectively, is presented by positive and negative values. In this manner, for each of the 12 datasets, we utilized CC to examine the linear relationship between the distance matrix of PCV and three alignment-based methods ClustalW11, Clustal Omega18, and Muscle12; Table 4 shows these results. According to this table, the average CC metric for all datasets is above 94%, indicating that our suggested approach has been able to largely conform to alignment-based methods and outperforms the fuzzy integral-based method5 for the given datasets and also outperforms ED for most of datasets. More detailed discussions of this table are provided in the "Discussion" section.

Table 4 Comparison of PCV, Fuzzy integral based method, and ED method based on Correlation Coefficient (CC) distance with ClustalW, Clustal Omega, and Muscle for 12 benchmark datasets.

Receiver Operating Characteristic (ROC) and AUC results

The ROC (Receiver Operating Characteristic) curve is a graphical representation of the diagnostic capabilities of a binary classifier system when its discrimination threshold varies. It can, however, be used for a multi-class classifier as well. The ROC curve is produced by the true positive rate (TPR), also known as sensitivity, recall, or probability of detection, against the false positive rate (FPR), also known as the probability of false alarm and computed as (1—specificity). In addition, the AUC (Area under the ROC Curve) measure is one of the most prevalent interpretations based on the ROC curve. AUC is in the range of [0, 1], while the closer value to 1 indicates the classification approach with the greater accuracy. Generally, an approach with an AUC above 0.9 is considered as a high accurate classifier, while AUC between 0.7 and 0.9, and between 0.5 and 0.7 represent classifiers with average and low accuracy, respectively5,35. More information about this curve can be found in the "ROC" section of the supplementary materials.

To evaluate the proposed algorithm, for all 12 datasets, we generated ROC curves and computed their AUC values for PCV, ED34, and three alignment-based methods ClustalW11, Clustal Omega18 and Muscle12. Moreover, all ROC curves and AUC values reported for fuzzy integral based method5 are also considered in Table 5. It is worth noting that the ROC curves of five other methods, including FFP, RTD, CV, NCD, and BBC, are also provided in5. Based on all information reported in Fig. 19 and Table 5, considering AUC as an interpretable metric, PCV is one of the most accurate clustering algorithms, except for two datasets AFP and Xylanase, even more accurate in some datasets like Coronavirus (24) than alignment-based methods, Muscle and Clustal Omega. Moreover, PCV has an AUC value of more than 0.9 representing its clustering capability. AUC value of the fuzzy integral based approach5, which in most datasets is greater than the AUC values of ClustalW and PCV methods, should be discussed. As a key point to be noticed in this section is that the classification matrix used to calculate AUC has a high impact on its value, while these matrices are identical for both ClustalW and PCV methods. As a result, their AUC values confirm the trend of other metrics, such as the CC between PCV and ClustalW, as expected. However, the classification matrix of fuzzy integral based approach is ambiguous, and it most likely differs from the matrix we utilize, resulting in a higher AUC score, compared to the ClustalW method, while other metrics indicate that it is inferior to ClustalW.

Table 5 Comparison of PCV, ED, Fuzzy integral based method and three alignment-based methods ClustalW, Clustal Omega, and Muscle based on AUC values for 12 benchmark datasets.
Figure 19
figure 19

ROC of 12 benchmark datasets.

To resolve the aforementioned issue, considering the distance matrix provided by the fuzzy integral based method at5, we constructed ROC curves for this method using our classification matrix. In this manner, we achieved two ROC curves for the fuzzy integral approach (the one published at5 and the other one we generated from the provided matrix). As reported in Table 5, AUC values for these two curves are computed. It should be noted that the classification matrix reported at5 is still used for the other five alignment-free methods. Given the ROC curves and the repetition of the ROCs production for the fuzzy integral based method and their AUC values, we can conclude that our classification matrices are more rigid and reliable, and hence, it is possible that the computed AUC values based on our classification matrix for five alternative alignment-free methods be lower than the corresponding reported values at5. Finally, as a key advantage of our proposed method, it should be noted that PCV provides better classification results for large datasets, such as influenza A and betaglobin (88), and HRV, compared to the alternative alignment-free methods as well as alignment-based methods, ClustalW, Clustal Omega, and Muscle. As analysis of quantitative metrics confirm this superiority, the latter achievement is also obvious from qualitative study of the corresponding phylogenetic trees. More detailed discussions of this table are provided in the "Discussion" section.

Runtime analysis

Considering 12 aforementioned datasets with different number of sequences and varying sequence lengths, in this section, we evaluate the execution time of PCV. For a comparative study, the implementations of ClustalW11, as an alignment-based algorithm, and the fuzzy integral based approach5, as a free-alignment method, have been presented as well. The corresponding execution times are listed in Table 6. It should be noted that due to the lack of access to FI tool, we only report the execution times reported in5, and so, the execution times for the datasets not examined by FI tool are not applied. It should be mentioned that, in addition to the method's execution time, the system’s specifications executing the algorithm should be presented as well. Saw et al. 5 uses a Linux server with 24 dual-core processors, 384 GB RAM, and two threads to run its fuzzy integral based technique. While execution times of PCV and ClustalW are measured on a Windows system with a 4-core CPU running at 3.6 GHz and 12 GB of RAM. Considering above specifications, assuming execution of fuzzy integral based technique on a similar system implementing PCV and ClustalW methods, its runtime would be several times higher. Moreover, it should be noted that in order to report the ClustalW's execution time, we ran it 20 times on the system and averaged their results. According to Table 6, for large datasets, either in terms of sequence length or number of sequences, the superiority of PCV’s speed over alignment-based and alignment-free methods is obvious, even for a non-optimized implementation in MATLAB. Figure 20 does a better job of addressing this point. The execution time of the PCV method, as well as those of the Muscle and ClustalW methods36, is depicted in this diagram for a various numbers of input sequences. As depicted in this figure, the priority of PCV in terms of execution time for a large number of sequences is crystal clear. For example, PCV can compare around 5000 input sequences in about 28 h, while the other two approaches take more than 84 h to accomplish this comparison. It should be noted that the sequences in this database are around 330 characters long, and execution times are measured on a system with a 4-core CPU running at 2.4 GHz and 12 GB of RAM. Finally, in section "time estimates" of the supplementary materials, we provided a runtime estimation formula of PCV, based on the implementing system’s specification and maximum sequence length.

Table 6 Running time of PCV, ClustalW, and fuzzy integral based method.
Figure 20
figure 20

Execution time comparison of three sequence comparison methods, PCV, ClustalW2, and Muscle, for various numbers of input sequences. The sequences in this database are around 330 characters long, and execution times are measured on a system with a 4-core CPU running at 2.4 GHz and 12 GB of RAM.

Discussion

As discussed in this paper, it is critical to provide an accurate and high-speed tool, as a means of protein sequence comparison, which provides sequence categorization, particularly for large datasets with a significant number of sequences. Proteins are sequences whose folded structures have a direct effect on their functions, and their structures are influenced by physicochemical qualities on the one hand26,27. As a result, effective physicochemical qualities combined with sequence information can be utilized to increase the accuracy of comparison tools and form the basis for their application in other tools, such as predicting the second and third structures and the protein function. PCV is created and developed with these scenarios in mind.

As previously stated, evaluations and comparisons with current acceptable tools are required to assure PCV’s performance, and these are carried out at several levels. We employed numerous comparison methods, since each method offers a distinct comparison approach. In comparison to the fuzzy integral based technique and ED34, the phylogenetic trees of PCV offer substantially much similarity with those of ClustalW11, Clustal Omega18, and Muscle12 based on the RF distance. For this comparison, the topology and branch lengths of the trees are addressed for similarity measurement of phylogenetic trees, assuming that the reference tree is completely correct. On the other hand, since alignment-based metrics are still being studied, it is not definitive to choose an alignment-based method as the reference one. As a result, for some datasets, such as Influenza A and HRV, due to the large number of samples and the less likely exact similarity of the phylogenetic tree of PCV to the reference one, PCV results in large values of RF distance for some datasets. Since no other approach has supplied RF distance for these challenging datasets, PCV evaluation for these datasets is impossible.

We performed a comparative study of PCV and three alignment-based methods, ClustalW, Clustal Omega, and Muscle using the CC similarity metric. For this purpose, we eased the similarity conditions of phylogenetic trees by examining the linear relationship between our method's distance matrix and that of the reference method. In comparison to the fuzzy integral-based method5, PCV is improved greatly in terms of this parameter, with an average CC value of 95%. Another measure for evaluating classification capability of a method is the ROC curve. Regarding this measurement, PCV showed to be an accurate classification approach for most datasets. In addition to all these achievements in accuracy comparison, our method is also able to determine the local similarity for each pair of blocks, as well as the overall similarity of the sequences indicated by the final dissimilarity score.

It should be noted that most studies addressing the protein sequence alignment-free comparison2,5, with no explicit statement, involve customized datasets containing a number of sequences from specific categories, often at the taxonomy levels of family, genus, or species. In other words, to prepare a customized dataset, an extensive search is performed within a large database to select the proper sequences and create the required datasets. It should be mentioned that for the alignment-free methods, the initial assumption is that the input sequences do not include large structural variations; in other words, they are homologous, so that a one-way comparison can be performed from the beginning to the end of the sequences. Moreover, the assumption can be equated with the proposition that along with the alignment-free methods, a simple sequence search method is required to detect the homologous data. Therefore, we can conclude that generally, the alignment-free methods can process the datasets of homologous sequences, unless joined with some preprocessing tools. To validate the aforementioned statement, for each alignment-free comparison tool, we determined its best response assuming the least identity of the input sequences and the minimum degree of sequence alignment. In this manner, we obtained Table 7 by analyzing the identity matrix of datasets used by each comparison tool. As reported in this table, each dataset is evaluated using three reference methods (i.e. ClustalW, Clustal Omega, and Muscle), and the minimum, maximum, mean, and standard deviation values of each dataset are reported. It should be noted that due to the large number of sequences in the Influenza A dataset, the aforementioned values for this dataset could only be calculated by the Clustal Omega method. Finally, based on the values reported in Table 5, which provides the AUC value of the PCV method for each dataset, we can conclude that the PCV method leads to the accurate classification results for all these datasets, and therefore, it is applicable for the wide range of minimum identity values, reported in Table 7 (i.e. greater than 5% based on Muscle as the reference method). However, choosing the 90% threshold value of AUC, according to the Table S9, we can conclude that the PCV method provides the best sequence comparison output for mean identity values greater than 60% (based on Muscle as the reference method) with the standard deviation of less than 23%.

Table 7 Degree of sequence alignment of each dataset, as calculated by the means of sequence identity metric.

In addition to the issue of accuracy, high speed processing is one of our main goals in designing a suitable method for comparing protein sequences. Examining the speed of our method compared to those of ClustalW, ClustalW2, Muscle and fuzzy integral based method5, we can conclude that in the case of large datasets, either in terms of sequence lengths or number of sequences, we can offer a higher processing speed.

Finally it should be mentioned that the alignment-based methods, such as ClustalW, are known as the accurate comparison tools, and have been used for many years. Therefore, in this study, ClustalW was chosen as a reference method to assess the accuracy of the PCV and the alternative alignment-free methods. ClustalW, as mentioned in "Runtime analysis" section, performs calculations for small and medium datasets in an acceptable period of time, and produces auxiliary outputs, like alignment output, in addition to the comparison score. However, it should be noted that in many applications, such as clustering input sequences or searching them within the databases, these auxiliary outputs are not required, and only the comparison score is taken into account. Moreover, in cases when the datasets are particularly large, execution of the alignment-based approaches are not feasible or reasonable for everyone, as seen in Fig. 20, and this is where speeding up the comparison task with an alignment-free method, like PCV, becomes very significant. Furthermore, PCV can outperform the ClustalW by offering hierarchical methods that require separation of distinct kinds of input sequences in the early phases.

Conclusions

Due to the growing need for development of sequence comparison tools, especially protein sequences, in this work, we presented an alignment-free method that uses sequence information and physicochemical properties of amino acids. This method estimates the similarity of the protein sequence in whole by determining local similarity of fixed length blocks. As a result, although it is an alignment-free method, it can resemble the behavior of alignment-based methods for protein comparison. We compared this method, known as PCV, to some well-known alignment-based and alignment-free methods in a variety of ways. Specifically, we evaluated the PCV approach for 12 benchmark datasets considering various conditions, which is a superior of datasets compared to alternative studies. It should be mentioned that like other alignment-free methods, these datasets include classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. Assuming ClustalW11 as the reference method, in addition to improving the comparison speed compared to the other methods, Correlation Coefficient (CC) metric, RF distance, ROC curve, and the corresponding AUC metric indicate greater improvement for PCV method than the alternative alignment-free methods. Specifically, we reported an average CC of 94% between PCV and ClustalW methods, as well as more accurate classifications at the different levels of evolution. In this way, we can conclude that PCV is accurate and fast, while providing local similarity information which is not considered by other alignment-free methods. As the future works, PCV can be developed to display pseudo-dot plots and reduce the required amount of memory. Furthermore, due to its structure, employing repetitive operational units, PCV method can be implemented on widely available hardware platforms, such as FPGA, which can assist speeding up this approach, compared to its current CPU-based version. In addition, we would like to make PCV method available as a public online tool.