A multi-sample approach increases the accuracy of transcript assembly

Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples.

All of the above methods generate a set of partial transcripts (transfrags) for a given RNA--seq samples. Since most experiments involve multiple RNA--seq samples, transfrags from all samples are further 'merged' into more complete exon--intron structures to determine a consensus set of transcripts, or meta--annotations. Existing meta--assemblers include Cuffmerge, included with the Cufflinks package, StringTie--merge from the StringTie package, and more recently TACO 15 . TACO builds a 'path' graph from the input transcripts, where a 'path' is a sequence of consecutive splice junctions (partial transcript) represented as a vertex and two vertices are connected if they have compatible junction patterns, and iteratively selects the most abundant paths (isoforms). Despite the importance of meta--annotations for subsequent quantification and differential gene expression analyses, however, there has been relatively little effort in designing mathematically rigorous meta--assemblers. Lastly, only a small number of studies have focused on simultaneous multi--sample transcript assembly. These include CLIIQ 16 , an early protoype algorithm that uses an integer linear programing (ILP) approach with variables the full set of isoforms; MiTie 17 , which builds a splicing graph representing the gene and maximizes a likelihood function using mixed integer programming with a regularization penalty; and ISP 18 , which solves an LP or ILP problem iteratively on a weighted connectivity graph derived from the input samples. While marking significant conceptual advances, they scale poorly (MiTie) or otherwise have limited performance in detecting splicing variation (ISP), as demonstrated in 19 . Supplementary Figure 1. Overview of the PsiCLASS algorithm.

Supplementary Figures
Step 1. Build sample--level subexon graphs from aligned reads and splice reads. PsiCLASS builds a subexon graph for each sample by clustering overlapping read alignments into regions, dividing regions into subexons at splice junctions (inferred from spliced reads), and connecting with edges subexons that are adjacent within the same region or connected by an intron.
Step 2. Build and refine a global subexon graph, by merging sample--level subexon graphs and employing intron and subexon filters that evaluate information simultaneously across all samples.
Step 3. Enumerate or select a set of candidate transcripts using dynamic programming across all samples.
Step 4. Select a subset of transcripts in each sample, using a greedy strategy that iteratively select an optimal transcript (with global subexon graph--based dynamic programming).
Step 5. Select a unified set of meta--annotations from among the sample--level transcripts, with voting. Supplementary Figure 2. Performance evaluation of PsiCLASS and existing reference methods at the level of meta--annotations on simulated data. Methods tested include combinations of three single--sample assemblers (Cufflinks, StringTie and Scallop) and two meta--assemblers (TACO and StringTie(ST)--merge), and two multi--sample integrated methods (ISP and FlipFlop), where TACO and ST--merge were used to aggregate the outputs from individual samples into a unified set of meta--annotations. Below, the shape of the point represents the single--sample assembly tool used, and the color represents the aggregation method. For PsiCLASS, the red curve shows the variation in performance as the weighted voting cutoff varies among 0, 1, 2, 4, 8, 16 (right to left). PsiCLASS produces the highest precision and its sensitivity is comparable with the best of the other methods. Comparison of transcript assembly methods at the sample--level and for different alignment tools. Each point represents the performance of the stated method on one of the 25 simulated samples. The shape of the mark represents the transcript assembly method (StringTie, Scallop and PsiCLASS), and the color indicates the RNA--seq alignment tool (Hisat2 and STAR). All methods perform similarly with the two alignment methods, with Hisat2 leading to a slight increase in performance. When assembly methods are compared, PsiCLASS (with Hisat2) using a global subexon graph leads to improved accuracy at sample level, with the highest per sample average recall, 28% higher than StringTie and 16% higher than Scallop, and precision comparable to StringTie and Scallop. Performance of methods on simulated data based on gene splicing complexity. Genes were divided into low (1 transcript/gene; 680 genes, 680 transcripts), medium (2 transcripts/gene; 166 genes, 332 transcripts) and high (3 or more transcripts/gene; 106 genes, 431 transcripts) complexity and methods were evaluated for each group. PsiCLASS is the best performer on the high and medium complexity genes (two leftmost groups), whereas StringTie has the best overall performance on the low complexity group, followed closely by PsiCLASS. Source data are provided as a Source Data file. Three different methods were tested for their ability to generate meta--annotations from single--, two--and multi--condition experiments, using GTEx RNA--seq data from six tissues (cortex, frontal cortex, cerebellum, heart, liver and lung): i) PsiCLASS with voting on all samples in a single and/or multi--tissue collection; ii) PsiCLASS with TACO on all samples in a single and/or multi--tissue collection; and iii) PsiCLASS on all samples of a given tissue, followed by meta-assembly between tissue collections with TACO. Source data are provided as a Source Data file.   Supplementary Figure 8. Effect of feature selection --performance at intron and internal exon level for the three single--sample assemblers with the voting aggregation method. PsiCLASS captures 10--40% more introns, and 10--50% more internal exons. Source data are provided as a Source Data file.