Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains

The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com.

reads in human small intestine microbiota 9 , and up-to 50% reads that cannot be assigned to reference databases in oceans with large phytoplankton 10 . Therefore, alignment-based methods are not applicable for microbial communities with a large amount of dark matters. (2) Current tools analyzing the microbial communities were mostly designed for metagenomics based on mark genes, such as 16S rRNA. However, for the metatranscriptomic dataset, ribosomal RNA (rRNA) transcripts are often required to be depleted in order to maximize mRNA recovery 8,9 . Therefore, the metagenomic tools based on 16S rRNA marker genes are not suitable to analyze metatranscriptomic data. Among the limited metatranscriptomic analytic tools, some were designed for Illumina paired-end 10 or single/paired-end data 11 , or only used to evaluate the gene expression level 12 . A previous study 11 compared four taxonomical classification tools based on a common metatranscriptomic data and obvious differences among the taxonomical analytic results were observed, which was the second figure in original paper 11 . (3) Sequence assembly is time-consuming and challenging especially for metagenome/metatranscriptome when organisms share a high volume of homologous sequences. Different assembled contigs were obtained for the same reads when using different assembly tools. Therefore, alignment-free methods provide a promising alternative for microbial community comparison, eliminating the requirements of reference sequences and assembly.
One type of alignment-free methods is based on the frequencies of k-tuples (k-words, k-mers or k-grams) 13 . A k-tuple is a contiguous sequence of length k. Previous studies indicate that relative k-tuple frequencies are similar across different regions of the same genome, but differ between genomes 14 . One of the earliest similarity measures between two sequences is D 2 which measures the total number of matched k-tuples between two long sequences 13 . However, theoretical studies have shown that the distribution of D 2 is dominated by the variance in the number of occurrences of k-tuples along individual sequences and less by the relationship between sequences 15 . Consequently, other similarity measures have been developed with different normalization, centralization and background models in an attempt to modify D 2 , including D Z 2 16 , D S 2 15 , ⁎ D 2 17 , S 2 18,19 and CVTree measures. Subsequently, normalized dissimilarity measures 20 based on D 2 , D S 2 and ⁎ D 2 , including d 2 , d S 2 and ⁎ d 2 1,21 with range between 0 and 1, were developed for high-throughput sequencing data. Indeed, previous studies 22 showed that k-tuple-based dissimilarity measures are effective in revealing group relationships and gradient relationships among metagenomic and metatranscriptomic samples and that d S 2 and ⁎ d 2 achieved the best performances in most comparisons of microbial communities.
However, the utility of d S 2 and ⁎ d 2 depends on a proper probability model for background genomes. To address this gap, Fixed Order Markov Chains (FOMC) were used to model the background genome sequences, as reported in previous studies 22,23 . There are several limitations during the applications of FOMC: (1) The order of Markov Chain (MC) needs to be set manually. However, for most microbial communities, there is no prior knowledge available for setting the MC order. (2) Furthermore, it is hard to model probabilities of different tuples using a single fixed order MC, and FOMC is not structurally rich. There are n r × (n − 1) independent parameters for an r-th order MC, where n is the number of states, that is, n = 4 for DNA or RNA sequences. When the order r equals 2 or 3, the number of parameters for the model is 48 or 192, respectively. There are no FOMCs with number of parameters between 48 and 192. (3) Thus, the number of parameters grows exponentially with the increase of order r. When sequencing depth is relatively low, the parameters, with their number growing exponentially with the increase of MC order in FOMC models, cannot be accurately estimated.
With this in mind, we introduced Variable Length Markov Chains 24 (VLMC) as an alternative for FOMC to model the background genomes of microbial community in this study. VLMC adaptively determines the order of MC based on the sequence data, thus eliminating manual selection. Additionally, the number of variables in VLMC is flexible. VLMC was originally designed for modeling one long sequence and was represented as a context tree structure 24,25 . For high-throughput sequencing of short reads, the likelihood of underlying, or unobserved sequences cannot be calculated. As a result, the rules for pruning the tree are not clearly defined. Therefore, we first developed strategies to determine the parameters for building a context tree and then extended VLMC for high-throughput sequencing of short reads. Thus, the complete context tree is constructed from these short reads, which typically overfits the data. The number of independent parameters is Num(nodes) × 3, where the Num(nodes) is the total number of nodes in the context tree except the root-node. The tree is then pruned according to a local decision rule. Using VLMC for background modeling, d S 2 and ⁎ d 2 measures were then applied to compare transcriptomic or metatranscriptomic datasets. From the obtained dissimilarities among samples, the clustering trees were evaluated based on the triples distance 26 between the reference and resulting trees. Experimental results show that VLMC models the position dependency in the nucleotide sequences better than FOMC, and since it is free from order selection required by FOMC, VLMC is easier to apply. Our studies also show that VLMC probability models combined with d S 2 and ⁎ d 2 measures exhibit superior performance in clustering metatranscriptomic samples when compared to previous approaches.

Results
Design of experiments. In order to explore the performance of d S 2 and ⁎ d 2 with VLMC, we designed experiments with one simulated dataset and four real datasets. The simulated metatranscriptomic dataset is composed of 90 samples belonging to 3 different groups with 5,000 genes from 5 microbes. Real dataset 1 consists of 18 and 22 RNA-Seq datasets from marine microbial eukaryotes. For 18 RNA-Seq datasets, the molecular phylogeny 27 was reconstructed based on the 18S rRNA genes with maximum likelihood (ML) method. The ML phylogenetic tree was then used to evaluate the ability of VLMC as background model, combined with d S 2 and ⁎ d 2 measures to compare their relationships based on the high-throughput sequencing data of individual species. For 22 RNA-Seq datasets, phylogenetic tree was built with Bayesian inference using MrBayes 28 program. Dataset 2 contains 88 metatranscriptomic samples collected from the Global Ocean Sampling Expedition (GOSE), and they were used to study the effect of VLMC-based measures in identifying group relationships. Dataset  consists of 14 metatranscriptomic samples from depths of 0.03 m and 0.08 m within a typical iron-rich microbial mat. Datasets 3 and 4 were used to study the performance of VLMC-based measures in revealing environmental gradient relationships. The triples distances were applied to evaluate the consistency between the reference and clustering trees from alignment-free measures.
There is no rigorous criterion to decide the optimal length k for k-tuples. However, according to our previous experiments, generally the optimal k is 6-9. For comparison, d S 2 and ⁎ d 2 with 0-4th order FOMC, three L p -norm measures and d 2 were also applied. Experiment 1: Detecting group relationships among simulated metatranscriptomic datasets. Using a similar simulation strategy as developed in Martinez et al. 11 , we simulated three groups of synthetic mock communities with different expression levels using Polyester 12 , an RNA-Seq simulation tool. Five most abundant microbial genomes in human gut were selected based on Qin et al. 29 : Bacteroides vulgatus ATCC 8482, Ruminococcus torques L2−14, Faecalibacterium prausnitzii SL3/3, Bacteroides thetaiotaomicron VPI-5482 and Parabacteroides distasonis ATCC 8503. For each bacterium, a subsample of 1000 genes was randomly selected without replacement. Based on the mock community consisting of 5,000 genes from the five bacteria, we set three group centers with different gene expression levels as follows: (1) Among the 5000 genes, 20% showed 4-fold overexpression, 20% showed 4-fold under-expression, and 60% were normally-expressed. The simulation tool Polyester 12 uses a fold change vector to specify the different expression levels among transcripts. Polyester generates the baseline read numbers from a negative binomial distribution with a preset mean value (default mean = 300), and then multiply the baseline numbers by the fold changes to simulate the transcripts with different expression levels. As shown in equation (1), A is the basic fold change vector, and 20% of the elements equal 1 4 , 20% equal 4, and the others equal 1. We then generated 90 samples belonging to 3 groups each containing 30 samples using the simulation strategy as in Jiang et al. 22 , shown in steps (2) and (3). (2) The three group centers A 1 A 2 and A 3 were generated as equation (2). Norm(μ,σ 2 ) indicates the normal distribution with mean μ and variance σ 2 . Based on the generated 90 expression level vectors, 90 metatranscriptomic sequencing data were simulated and the read length was 76 bp.
The best hierarchical clustering trees with VLMC and FOMC are shown in Fig. 1, and the corresponding triples distances are shown in Table 1. Clear groups of three simulated datasets among samples can be observed for both VLMC and FOMC. The best clustering trees with the smallest triples distance for VLMC and FOMC are both obtained in k = 9 and using d S 2 dissimilarity measure. From the clustering tree in Fig. 1, it is clear that the tree built based on VLMC is more similar to the true tree than that build based on FOMC. Quantitatively, the smallest triples distance for VLMC and FOMC are 42,973 and 43,043, respectively, where VLMC outperforms FOMC with less misclassification.  Table S1 in Supplementary Section 1.1. The reference tree of the eukaryotes was extracted from the molecular phylogenetic tree built from a previous study 27 that reconstructed the tree by maximum likelihood (ML) based on the 18S rRNA gene from a genome sequence or RNA-seq-based transcriptome assembly, shown in Supplementary Figure 1 of their paper 27 . Figure 2a shows the resulting ML of the 18 eukaryotes and it is used as a reference tree in our study. The bootstrap supports for the nodes in the phylogenetic reference tree are higher than 65%. The bootstrap support values of the nodes were calculated based on 1,000 replicates of the data with the same substitution model. The Bayesian posterior probabilities of the nodes in the tree were higher than 90%. The Bayesian analyses were performed with two independent runs with 1,000,000 generations per run. After a burn-in of 350,000 trees (that were discarded) per run, the remaining trees were used to reconstruct a consensus tree and to obtain posterior probabilities for node supports 27 . Table 2 shows the triples distance between the reference and clustering trees using various dissimilarity measures and tuple length. The best clustering result with the smallest triples distance of 177 is obtained by VLMC using the dissimilarity measure d S 2 and tuple length k = 6, as shown in Fig. 2b. The topological structure is similar to that of the reference phylogenetic tree which basically includes three groups. The smallest triples distance for Scientific RepoRts | 6:37243 | DOI: 10.1038/srep37243 FOMC is 318, which was achieved by using ⁎ d 2 with 0-order MC and k = 2, as shown in Fig. 2c. Its overall topological structure of the clustering results is different with the phylogenetic tree in Fig. 2a. The clustering result based on VLMC is obviously better than the result based on the FOMC model.
We also analyzed RNA-Seq data from another set consisting of 22 Marine Microbial Eukaryotes from the Phylum Bacillariophyta, Chlorophyta, and Cryptophyta. The phylogenetic tree was built using MrBayes 28 based on multiple alignments of 18S rRNA sequences using the default settings, and it was used as a reference tree for evaluations. The score for each branch is the Bayesian posterior probability of each partition or clade in the tree. It is the fraction of times that the partition or clade appears in the set of sampled posterior trees. The total number of samples generated from the posterior probability distribution is 1,000,000, and the beginning 25% of the samples were treated as burn-in and were discarded. The three groups Ch, Cr and Ba were clearly clustered to different groups with 100% posterior probabilities. Two internal branches in group Ba have Bayesian posterior probabilities less than 100%. The corresponding results for clustering the 22 species based on transcriptome data using FOMC and VLMC were shown in Supplementary Section 1.2. The experiment also shows the superior performance of VLMC over FOMC.   Fig. 3c. In VLMC, clear groups of different locations among samples can be seen. Except for the two samples from "SWGE", all other samples are consistently grouped with the marine locations. The communities with proximate latitudes, including Eastern Equa, Atlan_Amazon and SWGE, are clustered first, which is consistent with our understanding that these communities should have greater similarity of gene expression profiles. For FOMC, samples from SWGE and the Amazon River are both scattered into several parts of the clustering. VLMC-based measures reveal location relationships of these 88 global ocean metatranscriptomic samples.     Table S4 in Supplementary Section 2). Table 3 shows the triples distance between the reference tree and the derived clustering trees using different dissimilarity measures and background sequence models. Using the VLMC background sequence model, both d S 2 and ⁎ d 2 can recover the reference tree. The best results from both VLMC and FOMC background sequence models show clear separations between metagenomic and metatranscriptomic groups, as shown in Fig. 4b Table S4 in section 2) samples were collected from different locations with two research cruises in the Equatorial North Atlantic Ocean and South Pacific Subtropical gyre. mainly bacteria and archaea. Previous studies 32 found clear phylogenetic stratification between the surface and the deeper regions of the microbial mat where iron-oxidizing bacteria dominated the community in the upper layers, and methanothrophs contributed to the majority of sequences in the deeper layers. Therefore, in this experiment, we used our methods to study 14 metatranscriptomic samples 32 to evaluate gradient relationships at different depths of the microbial mat. As shown in Figure S2 in Supplementary Section 3, the sampling site is a slow-flowing stream where two collection sites (S1, S2) are placed at 1 cm in depth (surface water), and three collection sites (D1, D2.D3) are placed in deeper regions of 7-9 cm. Three samples were collected at every collection site, except D3, where only two samples were harvested. The descriptions and dataset IDs of these samples can be found in Table S5 in Supplementary Section 3. Figure 5a shows the reference tree of the 14 microbial mat samples. Samples from S1 and S2 are marked in red, and samples from D1, D2 and D3 are marked in black. Samples at three different locations were respectively represented as squares, triangles and circles. Table 4 shows the triples distance between the reference and the clustering trees. The best clustering tree was achieved by VLMC with d S 2 when k = 8 as shown in Fig. 5b, and the smallest triples distance is 76. Samples from surface water (S1, S2) and from deeper regions (D1, D2 and D3) are clearly separated. In contrast, the best result   based on the FOMC background sequence model showed that surface samples were merged with deeper samples successively, as shown in Fig. 5c. The two-dimension Principal Component Analysis (PCA) plots based on the optimal results from FOMC and VLMC when k = 8 are shown in Fig. 6a,b, respectively. The PCA plot based on VLMC reflects the gradient information for collection depths and sites as the first and second principal component. We also plotted the PCA figures for k = 7 and 9, shown in Figure S3 in Supplementary Section 4. Although k = 7 and 9 are not the optimal value for VLMC, they still can separate the different depths and collecting sites. In comparison, the PCA ordinates based on FOMC did not show clear separations, and some points are shown as outliers.

Discussion and Conclusions
In this study, we developed theoretical and computational approaches to model background sequences using VLMCs based on short reads from high throughput sequencing. We compared the performances of VLMC and FOMC with ⁎ d 2 and d S 2 , as well as d 2 and three L p -norm measures, to model the background sequence with one simulated dataset, three real metatranscriptomic datasets, and one real RNA-seq dataset. VLMC outperformed FOMC in all experiments; and d S 2 together with VLMC, as background sequence model, outperformed FOMC in all experiments. Experiments show that VLMC builds the model with adaptive and variable MC according to the   metatranscriptomic data, exempting from manual selection of a fix MC order. Compared with FOMC, VLMC are more structural rich and easy to use. Based on the experimental results, we show that ⁎ d 2 and d S 2 dissimilarity measures combined with VLMC background model can identify the underlying relationships among samples from different microbial communities. They can also reveal the gradient relationship among the samples. Therefore, such dissimilarity measures should be adopted in comparative transcriptomic and metatranscriptomic studies.
In this study, we only applied VLMC to RNA-Seq or metatranscriptomic datasets. We also attempted to apply VLMC to metagenomic datasets, but here, VLMC does not achieve obvious improvements compared with the results of FOMC. For instance, we applied VLMC to analyze a real mammalian gut metagenomic dataset 33 . It includes 21 samples from mammalian species of herbivores and 7 samples from species of carnivores. As shown in Figure S4 in Supplementary section 5, results indicate that VLMC is less effective than FOMC in distinguishing between the two mammalian sample types. This could be attributed to the inclusion of both expressed and non-expressed regions in the whole genomes, making them heterogeneous. One model cannot fit the data well resulting in a simple independent identically distributed yielding the most meaningful results in most cases. Since the transcriptome only includes expressed regions, they will most likely be homogeneous, and a Markov model may fit better. Thus, while VLMC can improve performance for metatranscriptomic datasets, it does not show obvious improved performance for metagenomic datasets.
Alignment-free method avoids the complications of alignment-based approach, and is able to process the microbial community with a large amount of dark matters. However, it does not provide detail insights of microbial communities and further biological interpretation. To answer such questions, alignment-based methods are still needed.

Methods
Processing flow chart. The processing procedure consists of three main steps: (1) calculating k-tuple frequency; (2) calculating the probability of each tuple based on VLMC and applying various dissimilarity measures to k-tuple frequencies; and (3) evaluating different dissimilarity measures and models for background sequences. We used UPGMA 34 for hierarchical clustering based on dissimilarity matrix and applied the triples distance 26 to evaluate consistency between the reference tree and the clustering tree. We extended the VLMC algorithm to make it suitable for high-throughput sequencing data and then applied VLMC to model the underlying background genomes in d S 2 and ⁎ d 2 dissimilarity measures. Figure 7 shows the flow chart, and the details of these steps are given below.
Calculating k-tuple frequency. Alignment-free methods use k-tuple frequencies as sequence signatures to represent each metatranscriptomic datum. In our study, k-tuple frequencies from k = 1 to a maximum k value are calculated with our developed pipeline, taking complementary strands into consideration. The maximum k value Dissimilarity measures based on k-tuple frequency. The dissimilarity between two samples is calculated based on the frequency vectors using various measures, including measures with background model normalization such as d S 2 and ⁎ d 2 with VLMC/FOMC background sequence models, and measures without background model normalization such as d 2 , Ma, Ch and Eu in our study. The calculation of d S 2 and ⁎ d 2 is described briefly as follows 21 : ,4 k represent the k-tuple frequency vectors of sequencing data X and Y, be the sum of the counts of all k-tuples. The d S 2 and ⁎ d 2 dissimilarity measures are defined in equations (1) and (2) , . The ranges of d S 2 and ⁎ d 2 are between 0 and 1.
where P X,i and P Y,i are the probability of the i th k-tuple based on X and Y, respectively. The probabilities are calculated based on a specific probabilistic model. For example, consider a 5-tuple "GCTAC". Then P(GCTAC) can be calculated as: We used UPGMA for hierarchical clustering based on dissimilarity matrix and applied the triples distance to evaluate the consistency between the reference and the clustering trees.
Scientific RepoRts | 6:37243 | DOI: 10.1038/srep37243 ) P( ) P( ) P( ) P( ) P( ) (6) In previous studies 1 , FOMC was used to compute transition probability with fixed order r. For example, when r = 2, ) P( ) P( ) P( ) P( ) P( ) (7) In application, the order of MC needs to be set manually. But for most microbial communities, there is no prior knowledge available for MC order. Furthermore, it is hard to model probabilities of different tuples using a single fixed order MC. Variable Length Markov Chains 24 (VLMC) model the background genomes selecting the MC order adaptively in a data-driven way. For example, the probability (3) might be represented as formula (8) after determining the order in VLMC: Thus VLMC is more structurally rich and the number of variables is flexible. VLMC was originally designed to model long sequences 24,35 and was represented as a context tree structure 25 . In our study, VLMC was extended to model the background genomes based on short reads from high throughput sequencing.
VLMC for modeling background genomes with high-throughput sequencing data. The VLMC for high-throughput sequencing data is implemented with the following three steps: (1) A full prefix tree is built based on 1, 2, …, 10-tuple frequency vectors, but the tree usually overfits the data. (2) The tree is subsequently pruned to remove redundant branches based on Kullback-Leibler divergence 36 , and the pruned tree is also called a context tree 25 . (3) Transition probabilities are calculated with respect to the MC orders from the context tree, and the probabilities of k-tuples are then computed accordingly. A specific example is given in Fig. 8. The three steps were inspired by the original VLMC method on a single genomic sequence proposed by Bühlmann P. and Wyner, A. J. in 1999 24 .
Step 1: Generating a prefix tree τ max based on tuple frequency. We first generate a tree τ max to store tuples in the frequency vector. The tree τ max is actually a prefix tree growing downwards, where each node in the tree represents a tuple. The l th level nodes represent tuples of length l. In our study, the maximum depth of the tree τ max is up to 10. The following logic determines the relationships connecting nodes. If a node represents the l-tuple ω ∈ {A,C,G,T} l , l = 1, 2, …, 9, then its offspring represents the (l + 1)-tuple word μω (μ is a character in front of ω, μ ∈ {A,C,G,T}. For a node representing ω, the transition probability is calculated as P X (X|ω) = C X,ωX /C X,ω and saved at the node. In practice, the construction of τ max based on 1, 2, … , k-tuple frequency vectors is fast.
In Fig. 8A, τ max is generated based on frequency vector C gg . Node N 2 (C) represents tuple C, and its offspring N 21 (GC) represents tuple GC. Additionally, each node is associated with the transition probability from corresponding tuple to X (X∈ {A,C,G,T}). Node N 2 (C) is associated with P(X|C), and node N 21 (GC) is associated with P(X|GC).
Step 2: Pruning the tree τ max . The next step involves pruning the tree τ max to remove redundant branches.
If the probability P(X|μω) for a terminal node μω is the same as its parent node's transition probability P(X|ω), meaning that the transition probability of μω can be replaced by that of ω, then the terminal node μω can be pruned from the branch. In our study, Kullback-Leibler divergence is a measure of the distance between two probability distributions P(X|μω) and P(X|ω). Accordingly, Kullback-Leibler divergence 36 is applied to compare P(X|μω) and P(X|ω), which is denoted as D KL (P(X|μω)||P(X|ω)). A value of D KL (P(X|μω)||P(X|ω)) less than a threshold value K indicates that no information is lost when P(X|ω) is used to approximate P(X|μω), thereby allowing μω to be pruned. D KL (P(X|μω)||P(X|ω)) is given by formula (9), and N(*) is the frequency.
Taking Fig. 8B as an example, the Kullback-Leibler divergence between N 21 (GC) and N 2 (C) is calculated to determine whether nodeN 21 (GC) should be pruned: KL X A C G T { , , , } Suppose that threshold K is set to 5, then node N 21 (GC) should be pruned if

KL
The pruning is implemented for each terminal node until no branches can be pruned. K is the threshold that determines the degree of pruning. A larger K means greater conditional latitude in branch pruning, in turn producing a smaller tree. Similar to the study of Mächler and Bühlmann 35 , the determination of K is implemented through the optimization of Akaike Information Criterion (AIC) 37 designed for high-throughput sequencing data. AIC measures the relative quality of statistical models for a given set of data. AIC is originally defined as = − n L AIC 2 2 ln( ) ( 13) where L is the maximum value of the likelihood function for a statistical model, indicating the goodness of fit of the model, and n is the number of parameters in the model, indicating the complexity of the model. Here we develop the AIC calculation algorithm for high-throughput sequencing data. Given high-throughput sequencing data with M reads of length β, M j j j 1 2 1 where S j is the j th read, S ji is the i th nucleotide of j th read, and S ji ∈ {A,C,G,T}. Then, AIC with pruning threshold K is defined as: where card (τ ĉK ) denotes the number of nodes in the context tree τ ĉK , and L ln ( ) K R is the log-pseudo-likelihood under a fitted VLMC model with threshold K. The superscript R denotes the short read data. The log-pseudo-likelihood of the sequencing data is where P(S j(i+ 1) |S j1 … S ji ) is the estimated transition probability from the high-throughput sequencing data. The optimal K is determined by minimizing the formula AIC R (K). The two steps of tree building and pruning for high throughput sequencing data is extended from the original algorithm for long sequences from Bühlmann et al. 24 . The pruning step starts from the terminal nodes and the procedure is repeated until no more pruning is possible. The algorithm is greedy, so it is possible that the final pruned context tree is not the global optimal one. The R-package for long sequences 35 developed in 2012 follows the same greedy algorithm.
Step 3: Calculating probabilities of tuples based on the context tree. The corresponding probabilities of tuples are calculated based on the context tree. The number of independent parameters is Num(nodes) × 3, where the Num(nodes) is the total number of nodes in the context tree except the root-node. Taking the context tree in Fig. 8C as an example, Node N 21 (GC) was pruned away; therefore, in tuple GCX, G has no effect on the transition probability from GC to state X. Thus, P(X|GC) can be replaced by P(X|C) and stored in node N 2 (C) of the context tree in Fig. 8C.
The tuples in the node of context tree can be of variable length, allowing the VLMC model to estimate the transition probability. The corresponding probabilities of tuples used in d S 2 and ⁎ d 2 are then computed based on the transition probabilities. For example in Fig. 8C, the probability of 5-tuple word "GCTAC", In the real data from marine metatranscriptome, there are ~10 3 nodes for the pruned context tree with 8 levels and ~10 2 nodes for the tree with 7 levels, which means that the number of parameters reduced from 4 8 × 3∼ 2 × 10 5 to ∼ 10 3-4 for r = 8; and from 4 7 × 3∼ 5 × 10 4 to ∼ 10 2-3 for r = 7, at least 10-fold decrease in the number of parameters.
Using a heuristic approach to search for optimal K. The value of K is determined by minimizing AIC R (K). However, no simple analytical formula exists between K and AIC R (K), making it a challenge to find the optimal K for all sequencing data. To solve this problem, we developed the following heuristic approach to determine the value of K. In our study, one branch is pruned when its Kullback-Leibler divergence is less than the threshold K. Therefore, K is meaningful only when it is within the value range of Kullback-Leibler divergence. In Experiment on 22 Marine Microbial Eukaryotes, the probability density distribution of the Kullback-Leibler divergence is shown as Fig. 9. The values of Kullback-Leibler divergence in most tuples are between 100 and 500. Optimal results are generally obtained with K setting around the peak points and the right two inflexions (point A, B and C in Fig. 9). Hence, we only implement local search around these three points for the optimal K that minimizes loss functionAIC R (K). Sample clustering with UPGMA 34 (Unweighted Pair Group Method with Arithmetic Mean) is a hierarchical clustering method initially designed for classification problems. UPGMA is now widely used for hierarchical clustering in bioinformatics based on dissimilarity matrices. The nearest two clusters are combined into a higher-level cluster. The distance between two clusters A and B is defined as the average of all distances between pairs of samples x in A and y in B. The calculation is presented in equation (14), where d (x,y) refers to the dissimilarity between sample x and sample y. This is repeated for each step. UPGMA is implemented with the function 'upgma' from the 'phangorn' toolbox of R.
x Ay B The selection of proper evaluation metrics: Based on the dissimilarity matrix from different background models, the hierarchical clustering trees are produced. The consistency between the reference and the clustering trees offers the metrics to evaluate the performance of the various background models. There are several metrics to measure the difference of topological structures between two trees. Parsimony score 38,39 is the most common one to compare the topological structures of two trees. The parsimony score for a tree is the sum of the smallest number of substitutions needed comparing with the reference tree, which was implemented with toolbox Mothur in our study. When one tree is binary and one tree is not binary, the parsimony score is not suitable for comparison of the trees. Symmetric difference 40 was originally defined to compare two node sets. It has been used as a criterion to evaluate the consistency between two trees 38 i.e., the set of nodes present in one tree, but not in the other tree, where | * | is the number of elements, and A and B are the complements of set A and set B, respectively. Compared with the parsimony score, symmetric difference does not use branch length information, only tree topologies. Moreover, symmetric difference has taken the order of hierarchical clustering into consideration, making the comparison more sensitive. Symmetric difference is calculated with Treedist from Phylip. The triples distance 26 , another tree comparison metric to measure the distance between binary 26 or non-binary trees 41 , is also used. In our study, some reference trees are rooted non-binary trees. The measures are based on the topologies of the input trees induce on triplets; that is, on three-element subsets of the set of species. Triplet based distances provide a robust and fine-grained measure of the similarities between trees 41 , which was developed as toolbox TreeCmp 42 . The above three metrics have different characteristics and application scopes of their own. In Supplementary Section 7, we constructed example trees and measure their distances with the three metrics. Table S6 shows the three metrics for experiment 5, and the three metrics reflect general consistent tendency of tree distance. These two experiments show that the triples distance is most suitable and has high accuracy to evaluate the consistence of topologies of two trees. Principal component analysis (PCA) 43 is an important tool to analyze a multivariate data table in which observations are described by several inter-correlated quantitative dependent variables. Its goal is to extract the important information from the table and to represent the information as a set of new orthogonal variables called principal components. In R 'ape' toolbox, the functions princomp and prcomp can be used for principal component analysis.