Complementarity of assembly-first and mapping-first approaches for alternative splicing annotation and differential analysis from RNAseq data

Genome-wide analyses estimate that more than 90% of multi exonic human genes produce at least two transcripts through alternative splicing (AS). Various bioinformatics methods are available to analyze AS from RNAseq data. Most methods start by mapping the reads to an annotated reference genome, but some start by a de novo assembly of the reads. In this paper, we present a systematic comparison of a mapping-first approach (FaRLine) and an assembly-first approach (KisSplice). We applied these methods to two independent RNAseq datasets and found that the predictions of the two pipelines overlapped (70% of exon skipping events were common), but with noticeable differences. The assembly-first approach allowed to find more novel variants, including novel unannotated exons and splice sites. It also predicted AS in recently duplicated genes. The mapping-first approach allowed to find more lowly expressed splicing variants, and splice variants overlapping repeats. This work demonstrates that annotating AS with a single approach leads to missing out a large number of candidates, many of which are differentially regulated across conditions and can be validated experimentally. We therefore advocate for the combined use of both mapping-first and assembly-first approaches for the annotation and differential analysis of AS from RNAseq datasets.

repeats may be trapped in these subgraphs. We showed in 4 that an effective strategy to deal with this issue is to enumerate only bubbles which have at most b branches. In practice, we set b to 5. Increasing b will increase the running time, but allow to find more repeat-associated alternative splicing events. Bubbles which do not correspond to true AS events can be filtered out at the mapping step. MISO 5 was run in "exon-centric" mode with default parameter. We first generated from the EnsEMBL r75 gff file the alternative event annotation file requested by MISO using rnaseqlib. The mapping step was done exactly the same as for FARLINE with Tophat-2.0.11 6 , except that the replicates of each condition were merge together because MISO does not accept biological replicates. We then run all MISO scripts with default parameters. Finally, we filtered the differentially changing events with the filter_events script using the following parameters :

Cufflinks
Cufflinks 6 was run on the same alignment files used in FARLINE using annotation as a guide with the following parameters : -g <EnsEMBL r75 gff file> -b <hg19 genome> -u -p 16.
When an annotation is given as a guide to Cufflinks, some faux-reads are introduced to support all transcripts present in the annotation. Because it can annotate transcripts even if there are not expressed in the samples, for the rest of the analysis, we decide to consider only the reconstructed transcripts supported by real reads.
Then, the AS events were retrieved from the reconstructed transcripts using the FARLINE annotation script.

Trinity
Trinity 7 was run with the following parameters : --max_memory 110G --CPU 16 --min_kmer_cov 2 --seqType fq --SS_lib_type RF. In order to retrieve the bubbles from Trinity's output file, we parsed the transcripts' headers by firstly partitioning the reconstructed transcripts into disjoint sets, where each set is a predicted gene. Then, for each such set, the bubbles were found by processing the nodes' identifiers used to build each isoform.   Figure S2. A) Main categories identified explaining why some exons are detected by only one method. Numbers for MCF-7 dataset. B) The exon in intron 3 of the DIP2A gene is an example of an exon not annotated in EnsEMBL r75. This event was identified by KISSPLICE but not by FARLINE. C) PDPK1 and PDPK2 are 2 paralog genes. KISSPLICE detected 2 isoforms that could be produced by these 2 genes. FARLINE did not detect any event in either of these genes. The exon skipped is exon 15 in PDPK1 (corresponding to exon 7 in PDPK2). C) Exon 7 of the SUGT1 gene is an example of exon skipping overlapping an Alu element identified only by FARLINE. The events in panel A to C were validated by RT-PCR. E) The KPNA1 gene contains a complex event with more than 5 branches inside the bubble. This event was detected by FARLINEbut not by . The total number of annotated splicing events predicted by at least one method, with the minor isoform being supported by at least 5 reads is 10546. The largest overlaps are 2415 (all methods), 1647 (all methods but Trinity), 874 (KisSplice-Trinity), 662 (FarLine-MISO-Cufflinks). As expected, Trinity is the least sensitive method. We also observe that the three mapping-first approaches (FarLine, MISO and Cufflinks) have a very large number of common candidates, 662 of which are not found by the two assembly-first approaches (KisSplice and Trinity). Conversely, the two assembly-first approaches have a very large number of common candidates, 874 of which are not found by the three mapping-first approaches. Similar numbers are found for the MCF-7 dataset. These results support the main conclusion of this paper.    Figure S9. RT-PCR validations of events found by both approaches in the MCF-7 dataset. 41 out of the 48 events were validated (both the inclusion and the exclusion variant were amplified by RT-PCR). In some cases, there were additional PCR products (marked as '*') suggesting the existence of additional variants.

13/20
Relative frequency of minor isoform   Figure S12. Dealing with repeats in KISSPLICEand KISSPLICE2REFGENOME. A) Example of a bubble containing an Alu. Repeated events such as Alu are expected to be present in several copies in the reads. Thus, when the graph is constructed, edges link different copies of Alu. Because a bubble with more than 5 edges within one of its paths is not enumerated by KISSPLICE, this case is not annotated by the assembly-first approach. B) In KISSPLICE2REFGENOME, if the two variants (i.e. paths) both map on different copies (exact repeat), we classify it as a recent paralog. On the contrary if each variant maps on a different locus, we consider the event as coming from an inexact repeat. This category represents mostly paralogs that have diverged.    Figure S15. FasterDB exons are defined as the projection of the longer or most frequent exon in the transcripts (except for alternative first or last exons). The whole analysis done with FARLINE is based on these exons.

17/20
p-value! with exonic reads! p-value! without exonic reads! Figure S16. Correlation of the p-values when exonic reads were taken or not into account in the quantification. Red dots and blue dots correspond to ASE predicted to be regulated (p-value<0.05) when using junction reads and when using junction and exonic reads respectively.  Figure S17. Examples of AS events predicted as differentially spliced between the two conditions in the MCF-7 dataset using junction and exonic reads, but not using only junction reads. A) Exon 6 of STARD4 is detected as an alternatively skipped exon, but it also overlaps with an alternative last exon. B-C) Exon in intron 3 of PANK2 gene and exon in intron 18 of PRRC2B gene are new exons found by KISSPLICE. These exons are located in poorly spliced introns. Individual PSI computation: Ψ cond1_rep1 = 10 / 10+5 = 66% Ψ cond1_rep2 = 20 / 20+7 = 74% Ψ cond2_rep1 = 0 / (0+50) = 0% Ψ cond2_rep2 = 2 / (2+45) = 4%

18/20
Mean PSI per condition: Ψ cond1 = 60%, Ψ cond2 = 2% Final result: P-value = 10 -3 ∆Ψ = Ψ cond1 -Ψ cond2 = 58% kissDE Figure S18. Input and output of the differential analysis. Counts for each replicate of each condition were computed by FARLINE or KISSPLICE. These counts together with the experimental plan are the input of KISSDE. In this example, we show counts for one single event, in practice KISSDE tests all events discovered by one method to spot the differential splicing events. Provided that at least two replicates are available per condition, KISSDE computes p-values and DeltaPSI (∆Ψ) per event, and results are ranked using these two metrics.