Introduction

Although basic biological principles, such as the fidelity of replicases and the compatibility of codon degeneracy, attempt to maintain the stability of genes and proteins, random point mutations inevitably provide innumerable novel opportunities for genetic selection and evolution. The accumulation of non-synonymous mutations promotes the formation of homologies, which enable structural diversity while maintaining similar function. Thus, the homogeneity and heterogeneity of proteins hold great interest for the study of structure-based function and molecular evolution.

To this end, many bioinformatics tools have been created. Multiple sequence alignment (MSA) is widely used to identify sequence similarity1,2 and extended methods such as motif search tools and phylogeny trees have been developed to explore functional regions and evolutionary relationships. Although conserved regions among orthologues and paralogues are considered core structures required for basic biological function3, co-variant amino acids have attracted attention recently for their roles in differentiation and phenotype, which reflect a subtle correlation between the polymorphism and stability of proteins4,5. Several algorithmic approaches have been used to explore the coevolution of protein residues, such as explicit likelihood6, maximum likelihood7, joint probability estimation8 and correlation coefficient tests9. However, many of these methods are unable to separate phylogenetic linkages from covariant sites in a group of mixed sequences9,10 that might feature multiple or dynamic evolutionary issues. Furthermore, several of these methods do not effectively remove noise derived from unrelated sequences and ignore Indel fragments in the alignment. Recent effective methods such as PSIcov and DCA11,12 detect only covariant pairs and not multiple covariance. In addition, due to inadequate consideration of the relationship between a protein and its ligand, previous estimations of covariance are not reliable depending on adjacent amino acids in a 3D structure. Thus, there is no good standard for testing covariance and most models neglect computing efficiency.

In this work, we attempt to establish a novel approach to rapidly and effectively detect multiple covariance in a dataset of sequences to interpret the profile of biological divergence and its significance.

Methods

Data initialization

Amino acid (or nucleotide) sequences aligned by MSA were used as input data. Initially, this model discarded illegal sequences and unrelated sites according to the following criteria: a) more than one third of residues (or bases) in a sequence matched the consensus sequence; and b) at least two types of residues appeared at one site in at least 5 similar samples.

Independent pair model

We designed a new model to detect independent covariant pairs based on direct correlation. First, one residue at one site was defined by a site-residue element. For instance, the x(m) element represented a residue m at site x and the y(n) element represented a residue n at site y. For any two sites, the purity of the coupled elements x(m)y(n) was calculated as a correlation probability, the purity value was denoted by P and the association of every pair was defined as:

The length of the aligned sequence is denoted by L and the number of elements (or pairs) is denoted by N.

As a correlation function, a rational threshold of Px(m)y(n) should be considered first. According to the formula, if there are no coupled elements, Px(m)y(n) is zero and if x(m) is completely coupled with y(n), Px(m)y(n) is one. When all of x(m) are coupled with half of y(n), Px(m)y(n) is 2/3, a correlation can be observed with Px(m)y(n)greater than 2/3. Thus, the rational threshold of purity ranges from 2/3 to 1 (2/3 < P≤ 1), with a larger value representing a stronger correlation. As with real biological events, covariance does not occur at one time point and the purity of covariance is not always 100%; thus the default threshold value was set to a appropriate correlation value of 0.7 (Figure S1). Moreover, because covariance results in a stable transformation of amino acids, at least two types of elements are needed to confirm covariance between the two sites.

Correlated tandem model

To detect multiple covariance information, we joined the independent pairs with each other in a potential tandem pattern if they exhibited a common site-residue element. First, the association degree of each element was evaluated to confirm a reliable pattern. Figure 1 demonstrates the processing procedure.

Figure 1
figure 1

Schematic of the correlated tandem model.

Based on the correlation of the inter-restriction among the elements, the associated independent pairs were harvested into a group and transformed into a matrix. The average purity of each column or row reflects the degree of the general association of the element compared to others. Here, the number of columns or rows in the current matrix is denoted by n, the purity value of every pair is denoted by P, the degree of association of each element is individually calculated and the smallest association is obtained by the following formula:

If dm is lower than a threshold, the corresponding element is removed from the current matrix. Then, the model re-calculates the matrix until the degree value of every element is higher than the threshold.

For example, if the threshold of the degree of association is set to a intermediate correlation degree of 0.7, the ‘2 W’ element with a dm (the smallest association) less than 0.7 is removed firstly during the first round of calculation, then the degrees of every elements sites are more than 0.7 during the next round. So we successfully harvest a covariant pattern of ‘ATV’ at 1, 6 and 7, which is represented in a (1-A 6-T 7-V) format. This model then detected other patterns in subsequent groups in an identical manner. Finally, the confirmed patterns are reported after removing the subsets of all patterns.

Sequence clustering

Based on the covariant patterns, we can easily cluster the sequences. Because every pattern represents a strong characteristic of one population, a loose fitness of the pattern is sufficient to classify a sequence into the relevant population. Using the distribution test of pattern matching proportion, a sequence was classified if 45% or more of the pattern sites matched (Figure S2). This process can be used for genotyping and evolutionary analysis, among other applications.

Algorithm implementation

These models were integrated into a new algorithm named ‘Fastcov’ implemented in the golang ( https://golang.org) programming language. The executable binary files for most popular operating systems are freely available at http://yanlilab.github.io/fastcov.

Testing and verification

Well-established data, such as reverse transcriptase (RT) data for hepatitis B virus (HBV), glyceraldehyde 3-phosphate dehydrogenase (GAPDH), the 90 Kd heat shock protein (Hsp90) and toll-like receptor 4 (TLR4), were adopted as samples for model testing and performance verification. All sequences were downloaded from public biological databases. The MSA data were prepared using the ClustalW tool and a phylogenetic tree was drawn using the maximum likelihood method with default parameters in Mega 7.0 software13. The results obtained using our new algorithm were evaluated based on the structure of the phylogenetic tree and the genotyping performance, based on the background information provided with the samples. Several evaluation indexes, Sensitivity (Sen), Specificity (Spe) and Accuracy (Acc), were used to evaluate the performance14.

Moreover, ten-fold cross validation was used to estimate the model’s ability to fit out-of-sample data. During the procedure, the dataset of each genotype was randomly divided into 10 subsets marked with No. 1 ~ 10. The subsets with same number from all genotypes were put together as a test data and the remaining sets composed a corresponding training set. After the tests were performed on the 10 groups of training-testing datasets, the average indexes across all the trials were computed. Furthermore, the dataset was randomly sampled in a proportion of 50% or 10% for different scale test.

Results

Genotyping performance based on covariant patterns

Eight genotypes (A–H) of Hepatitis B virus (HBV) have been identified worldwide and exhibit distinct geographical distributions. The gold standard for classifying any virus depends on the sequence divergence of the entire genome. HBV has been classified into genotypes A-D based on the intergroup divergence of the entire genome, which exceeds 8%15,16.

Here, we assessed the performance of Fastcov by focusing on one HBV protein. For reliable detection, non-redundant data were obtained from a professional HBV database17. In total, 3424 amino acid sequences of the HBV polymerase RT region from genotypes A-D were prepared. To avoid bias due to inadequate sampling, genotypes E-H were not considered in this study. Using our Fastcov algorithm with a purity threshold of 0.7, five groups of tandem sites with 21 patterns were obtained (Table 1). Among them, 4 covariant patterns exhibited typical genotype characteristics by which the relevant sequences could be clustered into their own genotype group (Table 1, bold font). The number of sequences labelled by genotyping was counted in every cluster to evaluate performance. The accuracies reached 98% and above. In Table 1, the evaluation indexes indicate a satisfactory clustering effect for every genotype, indicating that the covariant patterns for the RT protein were consistent with genotyping. Our data also suggest that divergence in the genome contributed to consistent changes and adaptive evolution at the protein level. This result further supports that structural and functional requirements determine selection, as previously reported18. The other two patterns observed at the tandem sites 124-145-151-322 separated genotype AB from genotype CD (Table 1, bold font), but no patterns separating genotypes AC and BD were observed.

Table 1 Clustering performance by covariant patterns.

In addition, to confirm the reliability of Fastcov, we performed 10-fold cross-validation on three scales of HBV sequences. According to the best patterns observed for types A, B, C and D, the multi-class classification results were represented by the average value of the four genotypes14. These results demonstrated that the performance of the Fastcov algorithm was highly reliable (Table 2). We therefore suggest that the Fastcov algorithm not only possesses outstanding power for covariant analysis but is also reliable for evolutionary classification based on covariant patterns.

Table 2 Ten-fold cross-validation of performance for datasets of different scales

Comparison of the performance of different algorithms

To evaluate the performance of Fastcov, we performed a comparative analysis with the most popular relevant methods: PSIcov and CAPS11,19. The analysis was performed in a Windows 7 64-bit operating system with an Intel i5 4570 3.2 GHz 4-core CPU and 8 G memory using the above dataset of 3424 amino acid sequences with a length of 344 residues. The key indexes, including computational efficiency and their functions were evaluated. The test results, shown in Table 3, indicate that Fastcov is very fast, occupies less memory and exhibits more powerful functions than the other methods.

Table 3 Performance of different algorithms

As shown in Table 3, because the latter two tools were unable to perform multiple covariance detection or sequence classification, they were incompatible with all but the co-pair function of Fastcov. We therefore examined the intersection of the co-pairs produced by Fastcov and PSIcov. The Venn diagram indicated that Fastcov and PSIcov obtained similar results at the level of co-pair detection (Fig. 2A). If the threshold decreased to 0.5 below the bottom line of the rational range and those sites where one pair with P ≥ 0.8 remained, the intersection between the Fastcov and PSIcov datasets increased (Fig. 2B), indicating that the models were similar. According to the suggested parameters, PSIcov listed too many candidate pairs following manual reselection11.

Figure 2
figure 2

Venn diagrams of co-pairs produced by Fastcov and PSIcov.

(A) At a threshold of 0.7 in Fastcov. (B) At a threshold of 0.5, with purity of one pair more than 0.8 in Fastcov. (C) Union of A and B in Fastcov.

It should be noticed that the information of pairs provided by PSIcov is incomplete, which only contains sites but no residues, so that people don’t know what covariance occurred in the sites. By contrast, the Fastcov presents complete information in form of the site-residue elements. Because a covariance event between the two sites would lead to at least one transformation of a pair of amino acids into another pair, that means there should exist at least two types of amino acids pairs, so Fastcov confirms a covariance when at least two types of elements reached the threshold simultaneously. If only one, transformation could not take place, then these pairs were considered noise to discard by Fastcov. Unfortunately, they were retained by PSIcov, exactly locating in the subsection of PSIcov in Fig. 2.

On the other hand, after checking the Fastcov subsection, we found that the covariant pairs with lower frequency were ignored by PSIcov. It suggested that FASTcov is better than PSIcov to deal with low frequency data.

Moreover, despite a result similar to that of CAPS at the co-variance level, the co-evolutionary amino acid group reported by CAPS could not distinguish the genotypes. This result indicates that CAPS was unable to provide a precise evolutionary classification.

Thus, only Fastcov retained the sequence and amino acid information in the result, which facilitates the interpretation of the data by biologists. Unfortunately, some algorithms, such as DCA, did not provide software for comparison12.

Furthermore, the versions of Fastcov for popular operating systems, including Windows, Linux, Mac OS X and BSD, are adequate. Fastcov is easy to use and install; it must simply be downloaded and run in a command line terminal. With the exception of the optional parameters, only one input file containing the MSA sequences in FASTA format is needed to run the analysis.

Covariant features of GAPDHs corresponding to the phylogenetic tree

GAPDH is a classic glycolytic enzyme that plays fundamental roles in energy metabolism and non-metabolic processes, including transcription activation and the initiation of apoptosis. GAPDH is related to a variety of pathologies, including diabetes and age-related neurodegenerative disorders20,21,22. GAPDH is also a well-known house-keeping gene that is widely used as an internal control in studies involving protein quantification due to its highly conserved structure and stable expression in tissues23.

Here, we characterized the diversity among GAPDHs in different animal families. In total, non-redundant amino acid sequences from 112 species were retrieved from the Refseq database. Based on these sequences, Fastcov successfully detected family-characteristic covariant patterns that corresponded perfectly to relevant branches in the phylogenetic tree (Fig. 3), thus providing new insights for the elucidation of gene evolution and structural diversity.

Figure 3
figure 3

Covariant patterns corresponding to different families in the phylogenetic tree.

Note: The sites are denoted by the order of the aligned data.

Covariant features of HSPs corresponding to the phylogenetic tree

Heat-shock proteins (HSPs) are stress proteins that are highly conserved and present in virtually all living organisms. HSP90 proteins are ancient and are conserved from bacteria to mammals24,25,26. To observe covariant patterns during evolutionary processes, 31 amino acid sequences from bacteria, fungi, plants and animals were prepared for examination. Our data revealed that HSP90 was relatively conserved, with many Indels in sequences from different organisms. Fastcov was therefore run under a purity threshold of 0.95 and sequences with genotype characteristics are shown in Fig. 4. We detected more frequent and longer patterns in bacterial groups, which displayed covariances occurring with a higher frequency. By contrast, in eukaryotes, covariances always occurred in relatively conserved tandem sites. For example, the pattern (254-Q 729-S 760-A 763-H) distinguished animal sequences from plants and fungi and the pattern (105-S 323-E 433-D) differentiated plants from animals and fungi. Fastcov indicated that the multiple-site combined variance was a common mode utilized during evolution and thus provides a new method by which to assess co-evolutionary modes.

Figure 4
figure 4

Covariant patterns corresponding to different families in the phylogenetic tree.

Note: The sites are denoted by the order of the aligned data.

Divergence analysis of TLR4

Toll-like receptors (TLRs) are highly conserved from Drosophila to humans and share structural and functional similarities that play fundamental roles in pathogen recognition and the activation of innate immunity27. Human and mouse TLR4s both recognize LPS but occasionally exhibit differential activation28. Selective recognition of temperature-dependent shifts in LPS acylation by human and mouse TLR4 was recently reported29. To clarify the molecular mechanism underlying this phenomenon, we performed a divergence analysis of TLR4 using the Fastcov approach.

As reported by the RCSB Protein Data Bank (PDB), the three-dimensional (3D) structures of human and mouse TLR4 include a similar C-shaped circle. Interestingly, among the 570 amino acids, nearly two thirds were conserved (red in Fig. 5); Fastcov exposed a covariant pattern of 180 sites, of which more than 90% (yellow) were located in the loops at the outside of the circle, with few in the beta sheets at the inner region. This result suggests that the divergence is mainly associated with the stability of the 3D structure, providing a clue to understand the differential activation of TLR4 from different species at the molecular level28,29.

Figure 5
figure 5

Distribution of covariant sites (yellow) between human and mouse TLR4 (A) left view; (B) right view.

Discussion

Covariant mutations are crucial for maintaining the structural characteristics of a protein and, consequently, conformational and functional stability30. Accumulating data from numerous populations has provided expanded insights on the significance of multi-site combined mutations.

Point mutations are typically stochastic and as mutations accumulate, a covariant pattern appears in the population. With multiple coincidences, biological trends become visible. During the process, mutations only provide diversity, but selection shapes the pattern. In this study, we developed a novel algorithm to identify correlated changes using an independent pair and correlated tandem model. This algorithm is based on the idea of correlated inter-restriction, which is suitable for the analysis of non-synchronous co-variances. This algorithm enables the reliable detection of multiple-residue covariant patterns and sequence clustering, which provides a new method for studying the divergence of biological sequences.

Substitutions or conservations at two independent sites are not considered directly comparable due to differences in their amino acid compositions9. To resolve this issue, Fastcov utilizes a new independent pair model that considers site-residue elements. Fastcov provides faster processing as well as more powerful functions than other complex methods. The lightweight Fastcov algorithm significantly improves processing efficiency and thus could easily be applied to much larger datasets. Furthermore, the Fastcov algorithm can detect multiple covariances and perform sequence classification, in contrast to existing algorithms.

Sample size is a common factor influencing signal and statistics. Fastcov is adequate for datasets with samples of different sizes, as shown in Section 1 of the results. Furthermore, Fastcov opens the parameters to accommodate observation zooms, higher threshold presents patterns with lower heterogeneity and lower one would provide more patterns including those with higher heterogeneity. In particular, Fastcov adopts several strategies to eliminate noise, including inter-restriction model design, limitation to the smallest sample size and fast processing of large datasets. For any method of covariance analysis, well-aligned data are necessary for a precise result.

CAPS is designed to remove phylogenetic co-evolution, which effectively distinguishes background correlation from true correlation9, in contrast to many other algorithms31,32. For various reasons, Fastcov focuses on detecting co-variance by site-residue elements. However, Fastcov is able to obtain more detailed covariant information by relying on reference sequences. In contrast to interaction methods33, the Fastcov approach is not typically used to identify protein-protein interaction interfaces and is unable to resolve the docking problem. However, the information reported by Fastcov might be useful for extended studies when accompanied by relevant tools.

Fastcov is most useful for studying the point and compensatory mutations caused by natural selection, drug induction and environmental or host immunological pressure and is less useful for analysing the fragments of genes with frequent recombination, such as hemagglutinin and neuraminidase genes of the influenza A virus.

Finally, the samples used in this study were chosen primarily to assess the performance and applications of this model. Additional biological information could be explored by Fastcov by employing a comparative approach under various conditions according to the needs of the researcher.

In conclusion, Fastcov provides a fast and reliable approach for multiple covariance detection and sequence clustering and should be widely used in studies involving sequence diversity analysis, evolutionary classification, genotyping and protein structure formation. Furthermore, the Fastcov algorithm could be easily implanted into integrated bioinformatics tools as an extended module.

Additional Information

How to cite this article: Shen, W. and Li, Y. A novel algorithm for detecting multiple covariance and clustering of biological sequences. Sci. Rep. 6, 30425; doi: 10.1038/srep30425 (2016).