A Bayesian approach for accurate de novo transcriptome assembly

De novo transcriptome assembly from billions of RNA-seq reads is very challenging due to alternative splicing and various levels of expression, which often leads to incorrect, mis-assembled transcripts. BayesDenovo addresses this problem by using both a read-guided strategy to accurately reconstruct splicing graphs from the RNA-seq data and a Bayesian strategy to estimate, from these graphs, the probability of transcript expression without penalizing poorly expressed transcripts. Simulation and cell line benchmark studies demonstrate that BayesDenovo is very effective in reducing false positives and achieves much higher accuracy than other assemblers, especially for alternatively spliced genes and for highly or poorly expressed transcripts. Moreover, BayesDenovo is more robust on multiple replicates by assembling a larger portion of common transcripts. When applied to breast cancer data, BayesDenovo identifies phenotype-specific transcripts associated with breast cancer recurrence.

. Example of contig concatenation by paired-end reads support For alternative extensions occurred during the contig extension, we will remove erroneous extensions by a read guided strategy. Due to the complexity of the k-mers, some alternative extensions may be only supported by part of reads or mixed sequences from multiple reads.
Therefore, we only keep the extensions that are supported by full-length read, which will effectively filter out false positive paths in the graph. Figure S2 shows an example of how the read guided strategy works for removing erroneous branches. This example shows a 5-mer de Brujin graph where the red branch is constructed by the last three reads that happen to have the 5-mers to extend. In our graph construction step, the red branch will be removed since it is not supported by a full length read. Figure S2. Example of read-guide branch correction strategy S1.2 Bayesian framework to identify transcript structure from splicing graph In BayesDenovo, we employ a Bayesian framework published in [3] to reliably identify transcript structures from splicing graph. The framework utilizes a Bayesian model to simulate the generation of sequencing reads. First, a candidate set of isoforms will be enumerated from the TTGCC CAA GCA GCA Read1: GCATTGCCGCA Read2: CCCCAATGCCC Read3: TTTATTGCCCA Read4: GCGATCCCCAA splicing graph. The existence of each isoform will be modeled as a binary variable z. If one isoform is actually expressed in the data, the associated variable z for this isoform will be 1. Otherwise, z will be 0 representing the isoform is enumerated by mistake. Due to the complexity of splicing graph, the size of candidate isoform set will be huge. Therefore, we can expect the vector z of all candidate isoforms will be sparse. The sparsity will be modeled by a Bernoulli distribution with a small probability of success.
where ≪ 1 introduces the sparsity in variable z. Suppose the total expression of isoforms equals to 1, then we can model the relative expression of isoform by Dirichlet distribution given the existence status of isoforms: , where x is the relatively expression for all isoforms, is the total number of expressed isoforms and is a hyper parameter. For a given transcript, the probability of a paired reads that can be generated from this transcript can be modeled by a generative model with three parts. Suppose we would like to generate a fragment or paired-end reads = ( 1 , 2 ) from transcript t. We need to first determine the three prime position s of the fragment, which can be sampled from The full joint model of the Bayesian framework will be estimated by a Gibbs sampling approach.
The two key parameters z and x will be iteratively sampled. The confidence of existence for an isoform will be estimated as the frequency of equals to 1. The isoforms with confidence of existence larger than 0.5 will be classified as existed isoforms in the final identified isoform sets by default.

S1.3 Assemblers evaluation by rnaQUAST
Beside the metrics used in Figure 1 in the main text, we further comprehensively evaluate the assemblers using another evaluation package called rnaQUAST [4] on another simulation data generated by the Flux simulator. Additionally, we add rnaSPAdes [5] and Trans-ABySS [6] in our comparison study. The results are shown in Table S1.
Supplemental Table S1. Performance evaluated by rnaQUAST Based on average alignment length, we can see that our BayesDenovo method assembles longer transcripts than existing methods. Similar to the results shown in Figure 2 in the main text, Bayesdenovo detects full length transcripts (95%-assembled isoforms) more accurately. It has assembled 5722 full length transcripts isoforms from only 15889 total predictions.

S1.4 Computational time evaluation
The computation time of single k-mer assemblers is evaluated on a high-performance cluster. For assemblers with multiple threads support, we use 10 threads for this evaluation. Figure S3 shows the computational time on a simulation dataset with 80 million reads. It can be seen that BayesDenovo is comparable to other assemblers even using a sampling based approach.