Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral “tree of life”. However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. The resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.


Results
Dataset and information content evaluation. The non-redundant dataset includes 3,905 complete genomes from the RefSeq viruses as summarized in Supplementary Table S1. The smallest genome is the Anguilla anguilla circovirus (NC_023421), with a length of 1,378 nt and the largest genome is Pandoravirus salinus (NC_022098), which is 2,473,870 nt long. The distribution of genome sizes is depicted in the density plot in Fig. 1. The long tail on the right shows some large genome sizes as outliers, such as Pandoraviruses, Megaviruses, Mimiviruses, and other giant viruses. It is worth mentioning that, after determining the CRE values as shown in Fig. 2, we noted that the recommended range for k-mer length varies greatly depending on genome size, and therefore divided the dataset into four arbitrary subgroups (Q1-Q4) using the 25%, 50%, and 75% quartiles cutoff of 6,407; 12,141; and 45,242 bp; respectively. Assessment of the optimal feature length (k). Because the criteria used by Wu et al. 32 were not directly applicable to our large-scale virus dataset, due to the dependence of CRE on genome size, we determined the Scientific RepoRts | 7:40712 | DOI: 10.1038/srep40712 optimal feature length based on three criteria: (1) from an individual genome perspective, using CRE to find the minimum feature length: where the genome curves reach zero CRE or fall to < 10% of their CRE maximum values; this CRE value was the original criterion for optimal feature lengths in previously published papers 31,32,39 ; (2) from a pairwise-comparison perspective, the average number of common features (ACF) among genomes was applied to determine the maximum feature length: the length prior to ACF dropping to a lower value; this ACF criterion is defined as the average number of common features when comparing pairwise to each of the other genomes at a specific feature length; (3) from an "all-genomes comparison" perspective, we measured commonness of k-mers among all genomes in our dataset in terms of the diversity index to narrow down the range of optimal feature length. The Shannon diversity index was used to quantify the diversity of commonness of k-mers using the fraction of k-mers shared by genomes. The preferred length is the one with a higher Shannon diversity index value (which represents more diversity in the commonness of k-mers) in the range suggested from criteria 1 and 2. When the three steps suggest multiple optimal lengths in the range, then the tree stability (based on Robinson-Foulds distance) is also considered supporting information (see Materials and Methods for more details).
Cumulative relative entropy. For each individual genome, CRE values were calculated by increasing the k-mer length from 5 to 15. We plotted CRE values for the 3,905 referenced viral genomes (Fig. 2), which are colored by genome size and ordered from smallest to largest genome. CRE curves did not simultaneously drop to < 10% of the maximum CRE value for all genomes, which is the selection criterion that Wu et al. 32 recommended. When curves for the smaller genomes achieve that goal, some curves for larger genomes are still at a plateau. At k = 9, the curves of small genomes start to fall < 10% of the maximum CRE, and ~50% of all CREs drop to < 20% of their maxima. At larger values of k (k = 10, 11, and 12), more genome CREs satisfy the < 10% of the maximum criterion. When k = 13, the CRE values of most genomes fall to < 10% of the maximum CREs. However, k = 13 cannot simply be chosen as the optimal feature length, because it might be too large (no information left) for small genomes. By quartile, the optimal k-mer lengths for subgroups Q1, Q2, Q3, and Q4 are determined to be 9 to 11, 10 to 12, 11 to 13, and 12 to 15, respectively. Therefore, we initially determined the optimal range of k-mer lengths for the entire set of 3,905 genomes to be 9 to 13. This range will be refined in the following steps.
Average number of common features. It was previously found that computed RSD values did not work as expected (that is, they did not converge to zero after reaching the optimal feature length). Because of this, we  The curves start to fall to < 10% of the maximum at k = 9 and most genomes satisify the criteria at k = 13. Subgroups Q1, Q2, Q3, and Q4 are green, yellow, orange, and red, respectively. did not use the comparison with random feature space; instead, we only used the denominator of the RSD to explore the average number of common features between pairwise genomes, which we call the ACF. For each genome, the ACF is defined as the average number of common features from a pairwise comparison of all the other genomes at a specific feature length (see Materials and Methods). As FFP is a pairwise-comparison method, the ACF is not expected to be very low among samples at the specific feature length. Otherwise, the obtained information will tend to be randomized, which means that it could produce a random phylogeny. On the other hand, very high ACF will lead to obtain a poor discrimination phylogeny.
First, in order to reveal the shared degree of features at different length, we calculated the ACF among 3,905 RefSeq viral genomes by comparing each genome with the other 3,904 at different feature lengths, as plotted in Fig. 3. The ACF plot demonstrated that few features are shared when the feature length is larger than 11 (k > 11). As a result, the maximal feature length for 3,905 genomes should be 11 nucleotides. Therefore, the optimal range for k is reduced to 9-11. These curves were also coloured by different genome sizes, as in subgroups Q1-Q4. As seen in the Fig. 3, the ACFs are increased with an increased genome size. As we estimated, when k = 13, many of the features of small genomes (Q1 subgroup) are shared, which implies that we cannot consider only the CRE criterion to choose the optimal k.
Finally, we also calculated ACF values for the subgroups (Fig. 4), by comparing each genome with the other 995 or 996 in the same quartile. The maximal optimal feature lengths for Q1, Q2, Q3, and Q4 were found to be 10, 11, 12, and 13, respectively. As a result, the optimal feature ranges were reduced to 9-10, 10-11, 11-12, and 12-13.
All observed feature occurrences in genomes. The unions of all observed features at different lengths have been calculated and compared with theoretical occurrences, as shown in Table 1. Noticeably, the number of observed non-redundant features increases exponentially as powers of alphabetical size (4 for nucleotide sequences); i.e., when k < 13, the total redundant feature number (165,838,971) largely covers the expected feature space. However, when k > 13, the number of observed non-redundant features grows more slowly. In the subgroups, all of the feature numbers show similar trends.
The optimal k-mer length necessary to construct a good dendrogram is that length that provides the best balance of shared and unique features among the genome dataset. To illustrate the relationship between "all features" and "all genomes", the distribution of feature occurrences in genomes was calculated and plotted. As shown in Fig. 5, when the feature length is small (k = 5, 6), most features can be found in most genomes; however, when the feature length is large (k = 14, 15), most features (> 50% or 80%) are unique (occurrence = 1). In either scenario, the FFP method cannot work efficiently. After all, the feature occurrences should be diverse to balance the similarity and dissimilarity when comparing all genomes. For this purpose, the Shannon diversity index was applied and plotted with different feature lengths (Fig. 6). From the curve, the diversity of feature occurrence peaks at k = 7, and then drops steadily. In this regard, k = 9 is more appropriate than 10 or 11 within our previous optimal feature range.
We repeated the same process for each of the four subgroups and obtained Supplementary Figures S1-S4 for distributions and Fig. 7 for the Shannon diversity index. Considering all criteria, the optimal feature lengths for Q1, Q2, Q3, and Q4 were determined to bed 9, 10, 11 and 12, respectively.
What is the optimal feature length? The results for the application of all criteria to the selection of optimal k-mer length are summarized in Table 2. For the dendrogram of the 3,905 viral genomes, either 9 or 11 can be chosen as the optimal feature length. k = 10 has a lower ACF and Shannon diversity index, indicating a non-linear relationship in the dataset. When k = 9, CRE values have not yet dropped to < 10% of their maximum, and the other two criteria perform well. When k = 11, most of the CRE values have droped to < 10% of their maximum, although the ACF is not good for small viral genomes. In this case, it is hard to choose between 9 and 11, because neither can perfectly satisfy our three criteria. Therefore, it makes sense to check the tree stability and use it as supporting information for this study. To evaluate the tree stability, we calculated Robinson-Foulds distances between k (5, 6, 7… ) and k + 1 at different feature lengths. When the Robinson-Foulds distances drop to a low value, it means that the tree stability starts at this k point and the tree topology does not change much as feature lengths increase. As shown in Fig. 8, the trees start to converge at k = 9; therefore, we will choose k = 9 as the optimal feature length for this dendrogram. Furthermore, because we want to obtain a global view of the relationship among RefSeq viral genomes, the "pairwise comparison perspective" and "all genome comparison perspective" are considered more important in this research than the exact estimation of individual genomes, especially when all sequences are RefSeq whole genomes (not as similar and sensitive). For dendrograms of the four subgroups Q1, Q2, Q3, and Q4, the optimal feature lengths have been identified as k = 9, 10, 11, and 12, respectively.

Phylogenomic analysis of 3,905 viral RefSeq genomes.
Based on the three-step assessment, the dendrogram of all 3,905 RefSeq viruses (k = 9) is shown in Fig. 9. This dendrogram was built by the neighbour-joining method using all FFP values as pairwise distances. As a whole, the taxonomic groupings of the 3,905 viral whole genomes generally agree well with the reference taxonomy. The dendrogram is colour-coded by the Baltimore classification, viral orders, kingdoms of hosts, and by genome sizes. From this dendrogram, a global view of all relationships among the 3,905 viral RefSeq genomes is demonstrated. We used this dendrogram as a preliminary step to show the global view of clustering for the hundreds of whole genomes of Ebola viruses sequenced in the 2015 West Africa Outbreak compared with the diverse set of viral taxa. We then used rigorous analysis based on traditional methods to analyse the genomic variation among Ebola viruses 40 . As shown in Fig. 9, all branches of the dendrogram are coloured by the Baltimore classification, including dsDNA viruses, dsRNA viruses, retrotranscribing (RT) viruses, ssDNA viruses, ssRNA positive-strand viruses, and ssRNA negative-strand viruses. In the dendrogram, dsDNA viruses, the largest taxon, are classified into five major groups, which include one relatively large group, one medium-size group, and three small groups. The second major group, ssRNA(+ ) viruses, forms multiple small clades and interlaces among other groups. ssDNA viruses also form five groups, which include one large group and four small groups. ssRNA(− ) viruses and RT viruses organize two relatively independent clades.
The innermost circle of the dendrogram is coloured by reference taxonomy at different Orders, including Caudovirales, Herpesvirales, Ligamenvirales, Mononegavirales, Nidovirales, Picornavirales, Tymovirales, and unclassified families. Supplementary Table S1 shows that ~60% of viruses in our dataset are Caudovirales (excluding the 2,171 viruses whose reference orders are unclassified or unassigned). In general, the Caudovirales viruses group well (ignoring the unclassified orders), with a few member discrepancies. It is interesting to note, however, that Herpesvirales viruses form a small clade to split the largest clade of Caudovirales. Other Herpesvirales viruses also group inside the Caudovirales clades as discrepancies. The Ligamenvirales, Mononegavirales, Nidovirales, Picornavirales, and Tymovirales orders separate from each other to form small sporadic groups.
The second circle shows the hosts' kingdoms, including archaea, bacteria, fungi, animal, plants, protists, and the environment. As shown, the host kingdom with the most dsDNA viruses is bacteria. The plant viruses are mainly ssDNA viruses and ssRNA(+ ) viruses. The animal viruses distribute around the whole dendrogram, and respond to various sequence structures and reference orders, which suggests their possible origin from transmission. Different levels of genome sizes color the outside circle. The overall trend is that genomes with similar sizes are easier to group together, although the colours mix as locality changes.
From Fig. 9, we observed that there is a correlation between the length of the genome and dendrogram grouping as shown in the outer circle. Therefore, the dendrogram of subgroup based on the optimal k-mer as reported in Table 2, will give a better taxonomic resolution.
Statistical analysis for grouping uncertainty. The RefSeq dataset of 3,905 genomes contains 97 known families (by the ICTV annotation), and 59 genomes do not have information about their families (missing or "unassigned" in GenBank). The 10 largest families (see Material and Methods) were evaluated for grouping uncertainty (Huang et al. 34 ). Considering the dendrogram derived from the optimal k = 9, the descriptive statistics of within-group and between-group distances of different viral families were calculated by the Kruskal-Wallis one-way ANOVA and the Wilcoxon rank sum test. For the Kruskal-Wallis one-way ANOVA, the null hypothesis, which is that the within-group and between-group distances of the largest 10 families have equal means, is rejected (p < 2.2 × 10 −16 ). The pairwise Wilcoxon rank sum test shows that within-group distances are smaller than the between-group distance for each viral family (p < 2.2 × 10 −16 ). These statistical results strongly indicate    Step 1: CRE 9, 10, 11, 12, 13 9, 10, 11 10, 11, 12 11, 12, 13 12, 13, 14 Step 2: ACF 9, 10, 11 9, 10 10, 11 11, 12 12, 13 Step 3: feature Occurrence in genomes 9 or 11* 9 10 11 12 Optimal feature length 9 or 11* 9 10 11 12 Table 2. Summary for optimal feature length. *k = 9 performs best in Step 3 and k = 11 performs best in Step 1.  Figure S8). The Caudovirales viruses, most of which are bacterial viruses, form three large clades. Among these are animal viruses with a few protist viruses whose orders are Herpesvirales or unknown.

Discussion
Identifying optimal feature length in an alignment-free phylogenomic method is an important but challenging process, especially when we construct phylogenomic trees for large-scale datasets of divergent genomes of varied size. In this study, we have developed a comprehensive strategy to find the optimal length of k-mers in alignment-free phylogenomic analysis, and we built phylogenomic dendrograms for all complete viral genomes in the NCBI RefSeq as of October 2014 41 .
With the development of sequencing technologies, whole-genome information presents new possibilities for microbial classification 42 . Compared to traditional gene trees, whole-genome phylogenies use completed genomic information and solve the incongruence generated by gene trees from various studies. The alignment-free method with k-mers is useful for comparing genomes with low homology and has been applied to various microbial studies. However, it is still not clear how to find the optimal feature length of k-mers in alignment-free phylogenomic analysis, especially for large-scale comparison of viral genomes. CRE and RSD values have been used as criteria in previous studies 29,31,32,39 , however, these studies used, at most, hundreds of genomes and their lengths did not change greatly. However, thousands of viral genomes in the NCBI RefSeq showed a great difference in size, which ranged from the smallest (Anguilla anguilla circovirus) at 1,378 nucleotides to the largest (Pandoravirus salinus) at 2,473,870 nucleotides. As a result, their CRE curves cannot simultaneously drop to < 10% of maximum as required in previous studies. Furthermore, CRE reflects the ability to identify individual whole genomes at various lengths of k. More details should be taken into consideration when dealing with such highly-diverse data, such as pairwise-comparison information and shared k-mers among all genomes. Hence, we divided our dataset into four subgroups by 25%, 50%, and 75% quantiles cut-off of genomic size.
In this study, we designed a comprehensive strategy to find the optimal length of k-mers for alignment-free FFP phylogenomic analysis. This comprehensive strategy combines three steps: (1) an individual genome perspective: CRE to find the minimum feature length; (2) pairwise-comparison perspective, where the ACF among genomes is applied to determine the maximum feature length; (3) an all-genome comparison perspective, where  9). The braches are colored by Baltimore classifications. The circles, from inside to outside, are colored by different orders, hosts, and genome sizes as follows: The ranches by Baltimore classification: dsDNA (no RNA stage), red; dsRNA, green; RT viruses, pink; ssDNA, blue; ssRNA negative-strand, bright blue; ssRNA positive-strand, yellow. The First circle, inside to outside, by order: Caudovirales, red; Herpesvirales, green; Ligamenvirales, blue; Mononegavirales, orange; Nidovirales, cyan: Picornavirales, pink; Tymovirales, dark green; unclassified, silver. The second circle, inside to outside, by host: protist, orange; archaea, red; bacteria, dark green; fungi, blue; animal, cyan; animal and plants, pale violet red; plant, pink; environment or NA, silver.The third circle, inside to outside, by genome size: Q1, green; Q2, yellow; Q3, orange; Q4, red.
the Shannon diversity index of all observed feature occurrences in genomes was used to find the optimal feature length between the minimum and the maximum. If the results are not unique, the tree stability information (obtained from the Robinson-Foulds distance) can be used to determine the optimal length of k. Based on these criteria, the optimal feature lengths for each subgroup have been identified and are shown in Table 2. To trace the global relationship of all 3,905 viral whole genomes, we chose the smallest k (k = 9) among the optimal feature lengths for subgroups as an acceptable feature length and constructed a dendrogram of all viral whole genomes.
In conclusion, our three-step comprehensive strategy was successfully applied to identify the optimal feature length of k in an alignment-free phylogenomic analysis for thousands of whole genomes with highly-diverse sizes. Moreover, our dendrogram with the optimal feature length derived from all complete viral genomes provides a global view of classification in good agreement with the current viral taxonomy reported by the ICTV and the Baltimore classification. Moreover, this overall dendrogram can also be used as a preliminary step to show the global view of clustering of the diverse viral taxa and further analyze the genomic variation by traditional methods for specific viruses as example on the study of Ebola viruses that were responsible for the recent outbreak in 2015 in West Africa 40 .

Materials and Methods
Dataset. Viral genomes (5,326) were downloaded from the RefSeq database 41 (http://www.ncbi.nlm.nih.gov/ refseq/) in October of 2014. After merging all multiple-segmented genomes from the same virus, 4,300 genomes were obtained. Viroid and satellite data were excluded from the dataset, and then 3,905 genomes remained for this research. All genome data were converted to k-mer feature counts with Jellyfish software 43 . The dataset was also divided into four subsets by 25%, 50%, and 75% quantile cut-offs of genomic size in order to explore how optimal feature lengths vary by genomic size. FFP and phylogenomic trees. All phylogenomic trees were calculated based on FFP-based distance matrices 29 . All criteria, which are related to optimal feature lengths, were computed in parallel with Python 2.7. Phylogenomic trees were calculated from distance matrices based on the neighbor-joining method with the R package phytools 44 . All dendrograms were plotted by the ITOL online tool (http://itol.embl.de/itol.cgi), and the other figures were generated by R software. Fig. 10, the optimal feature lengths were determined by three criteria: (1) from the individual genome perspective using CRE; (2) from a pairwise-comparison perspective: ACF among genomes; and (3) from an all-genome-comparison perspective: all observed feature occurrences in genomes. If multiple values of feature lengths were determined after this process, tree stability will be used to find the optimal length.

Optimal feature lengths. As shown in
Cumulative relative entropy. A general description of CRE was provided previously 39 , and the optimal feature length, k, was considered the point where the genome curves start having zero CRE or begin falling to < 10% of their CRE maximum values. The CRE has been calculated as follows 32 : RE Where l is the feature length, f i is the observed feature frequency, and f i is the expected frequency formulated from the Markov chain of order 2 as in the previous study 45 . Because the relative entropy (Kullback-Leibler divergence) 46 is always a non-negative value, the function of CRE is monotonically decreasing.
In previous studies 29,32 , RSD has also been used to determine the optimal feature length. However, RSD cannot be applied for this research because our 3,905 genomes provide a huge feature space, and the overlap in feature space between the viral genomes and random sequence does not reduce. As a result, not all RSD values decrease to zero as expected. From another aspect, the random sequences are only generated once, without any iteration, and the iteration can be time-consuming. Therefore, RSD was discarded for use e in this research. However, from this information, we developed the ACF to check the overlap in feature space among genomes.
Average number of common features. For pairwise genomes, the similarity in FFP methods is actually held by the common features between them. When k is small, most features in one viral genome can be shared by another. However, the all possible features number is small (4 K ), therefore the average number should be low. On the other hand, when k is very large (because the features are long), only a few features can be shared between pairwise genomes. In this case, FFP may not provide enough signals for phylogeny and may show a random phylogeny. Therefore, the optimal k should be chosen before the ACF drops to low values. The ACF can be defined as follows: where c(s i , s j , l) is the number of common features of length l between sequences s i and s j , and N is the genomic number of genomes in the dataset. We used 10% of the maximum ACF for the considered population as the suggested cutoff, similar to the suggestion for RSD 29,32 .
All observed feature occurrences in genomes. From the perspective of all genomes, to balance the similarity and dissimilarity, neither of these situations are acceptable for FFP: (1) most features can be found in most genomes (when the feature length is too small); and (2) most features are unique (when the feature length is too large). For this purpose, the union of all observed features at different k were calculated in our dataset, as well as their occurrence in genomes. Theoretically, the maximum number of all possible features is 4 k . However, the biological sequences are not a random combinations of alphabets (nucleotide bases). As a result, the percentage of observed features relative to the maximum possible number of features decreases as feature length increases in our 3,905 genomes. To balance the level of similarity and dissimilarity, the occurrence for all observed features can be measured by the Shannon diversity index 47 : where p i is the probability that features can be found in i genomes and N is the total number of genomes in the dataset. For the specific length k, the number of observed k-mers is O k (O k ≤ 4 k ). C i k-mers, can be found in i genomes (1 ≤ i ≤ N). The p i can be calculated as p i = C i /O k . For example, to calculate the Shannon diversity index of the k = 9 dendrogram, the O k = 262,144. We assume there are C 5 is the number of k-mers that can be found in five genomes, which means that C 5 many k-mers exist in five of the 3,905 genomes. Here p 5 = C 5 /262,144 (i = 5). The Shannon diversity index can be calculated by adding values from p 1 to p 3905 .
Tree stability. Although the three-step process is applied to check the optimal feature length, it is still possible that inconsistent results can be obtained from the three criteria. To strengthen the feasibility of our method, we used tree stability as an additional factor to determine the optimal feature length. Tree stability is estimated by calculating the topology difference between trees at feature length k (k = 5, 6, 7, … … ) and k + 1 using Robinson-Foulds distance 48 , which is a metric to compare differences between two phylogenies. Therefore, when the Robinson-Foulds distances between trees at feature length k and k + 1 decrease to a low value, it means that the tree stability starts at this k point and the tree topology does not change much as k increases. In our case, trees start to converge at k = 9; therefore, k = 9 was chosen as the optimal feature length for the global dendrogram.
Evaluation of grouping uncertainty. The dendrogram (k = 9) was evaluated for grouping uncertainty by viral family annotation, based on ICTV classification, using the statistical methods described by Huang 34 . Kruskal-Wallis one-way ANOVA was used to evaluate the difference of the distance mean between within-groups and between-groups. The Wilcoxon rank sum test was used to evaluate the difference of distance means between within-group and between-group for each group. The top 10 highest members of viral families, which are 1 Siphoviridae (657 viruses), Geminiviridae (364 viruses), Myoviridae (307 viruses), Podoviridae (218 viruses),