Bayesian nonparametric discovery of isoforms and individual specific quantification

Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop biisq, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. biisq does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. biisq shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios.


Introduction
Alternative splicing is the process by which a single protein-coding gene produces distinct mRNA transcripts, which vary in usage of component exons 1 .Isoforms can differ by alternative transcription initiation sites, alternative usage of splice sites (either 5 donor or 3 acceptor sites), alternative polyadenylation sites, or variable inclusion of entire exons or introns (Figure 1).Altogether, alternative splicing enables the large diversity of mRNA expression levels or proteome composition observed in eukaryotic cells, which is particularly important for regulating the context-specific needs of the cell 2 .
It is estimated that 95% of human protein-coding genes produce alternatively spliced mRNAs 1 .
These splicing decisions are important drivers of many biological processes, with considerable variation in splicing patterns across human tissues 3 .For example, mutations in splicing regulatory elements leading to disease pathogenesis and progression 1,[4][5][6][7][8] and mutations in protein domains of specific splicing factors occur at a high rate in tumor cells, resulting in increased cellular proliferation 9 .Furthermore, proteins resulting from splicing variants often have distinct molecular functions-for instance, the two variants of survivin have opposite functions: one with pro-apoptotic and the other with anti-apoptotic properties 10 .
Though there is increasing evidence of the biological importance of splicing processes, the precise role of alternative isoforms in regulating complex phenotypes is still largely uncharacterized.This gap in understanding is due, in part, to the difficulty of identifying and quantifying isoforms with high accuracy from RNA-seq data 11 .Transcript reconstruction is essential to elucidate the role of gene expression in biological processes, because gene level quantification is convoluted by the multiple transcribed isoforms for each gene.The difficulties in isoform quantification stem from the tissue-and sample-specific composition and expression patterns of isoforms, the lack of a complete reference for isoform composition, and low abundance levels of many isoforms 2 .Further, RNA-seq reads that overlap informative splice junctions are rare, often noisy 12 , and difficult to map to a reference genome 13 .Improvements in reconstructing and quantifying tissue-and sample-specific isoforms would enable substantial improvements in understanding the role of alternative splicing in complex disease.While many tools exist for isoform reconstruction using RNA-seq data, these methods have a number of drawbacks.First, many quantification methods assume that a high-resolution isoform sequence reference is available for each gene in the genome [14][15][16] , which in practice is often not available or incomplete for non-model organisms and rare tissue or disease samples 11 .Second, most methods consider a single sample in isolation, which fails to exploit the sharing of isoforms across samples to gain power for identification of rare or low abundance isoforms [17][18][19] .Third, many methods make technology dependent assumptions by controlling for specific biases (e.g., non-uniform sampling of reads 20 ) that do not generalize to mixtures of existing technologies or new technologies with different biases.
Our method, Bayesian isoform discovery and individual specific quantification (BIISQ), addresses these limitations.First, BIISQ uses annotations of transcribed regions as prior information 21,22 , but the number and composition of isoforms across samples are estimated directly from the data, and the number of isoforms may grow with additional observations.Second, BIISQ explicitly captures isoforms shared across samples using a Bayesian hierarchical admixture model, which models multiple samples jointly and borrows statistical strength across samples to identify shared isoforms that may be in low abundance across samples.Third, we assume that each nucleotide base in an isoform has an independent frequency in the mapped reads allowing BIISQ to account for read mapping biases in RNA-seq data 23 .
We develop a computationally tractable stochastic variational inference (SVI) algorithm to fit this model to RNA-seq data to estimate the structure of isoforms, probabilistically assign reads to isoforms, and compute sample-specific and global isoform proportions 24 .We compare and validate BIISQ results on simulated data from the Benchmarker for Evaluating the Effectiveness of RNA-Seq (BEERS) software 16 and from Pacific Biosciences (PacBio) Iso-seq data, which produces approximately 1 − 10 Kb sequence reads potentially capturing full-length isoforms 25 .
Finally, we apply BIISQ to a large RNA-seq data set from lymphoblastoid cell lines (LCLs) 26 to identify the catalog of isoforms across samples.We use our catalog of sample-specific inferred isoforms and genotype data to identify genetic variants associated with isoform ratios and assess the functional significance of alternatively spliced genes and associated splicing variants.

Results
The goal of isoform reconstruction and quantification is to robustly estimate both absolute and relative mRNA isoform expression levels, for both lowly and highly expressed transcripts, for each sample in a large RNA-seq data set from multiple sequenced samples.Our method, BIISQ, approaches this problem by postulating a model of isoform composition and relative isoform expression shared across samples.Specifically, BIISQ implements a Bayesian non-parametric hierarchical model of RNA-seq reads and isoforms inspired by the hierarchical Dirichlet process 27 , and we use stochastic variational inference (SVI) methods for computationally tractable and robust posterior inference 24 .
The BIISQ model probabilistically maps each RNA-seq read into a distribution over isoform exon compositions.Each sample is associated with its own distribution over isoforms, drawn from a global distribution over an arbitrarily large catalog of isoforms.The exon composition of each isoform is modeled with a structured prior over exon usage that constrains the space of possible isoforms to those with support in the observed RNA-seq reads.The model allows for extending the global isoform catalog by constructing novel isoforms given observed RNA-seq reads that could not have been generated from the current isoform catalog.Importantly, in our model, exon usage, RNA-seq read assignments, and the sample-specific and global isoform proportions are interpretable model parameters that translate directly to isoform composition and population-and sample-level isoform quantification (for details, see Methods).
Variational methods enable computationally tractable posterior inference in Bayesian nonparametric models such as BIISQ when applied to large genomic data 28,29 .In brief, the posterior distribution of the BIISQ model is intractable to compute directly; instead, we hypothesize a set of tractable variational distributions over the latent variables.Then, we iteratively compute the values of the variational distribution parameters that minimize the distance between the variational and true posterior distribution with respect to the Kullback-Leibler divergence 30,31 .BIISQ implements stochastic variational inference (SVI), an extension of variational inference that uses random subsets of the samples to compute each approximate update of the variational parameters 24 .SVI was previously applied in eXpress to efficiently assign ambiguously mapped sequence reads for transcript abundance estimation 32 .
Related isoform quantification methods.Methods for jointly inferring and quantifying alternatively spliced transcripts can be broadly partitioned into categories based on the required level of reference annotation 33 .Transcriptome annotation dependent methods require complete annotation of the genome and transcriptome, including a description of isoform transcripts sequences and splice junctions [14][15][16] .In contrast, annotation-free methods require neither transcriptome nor genome annotations 17,34,35 .A third category of methods requires annotations of transcribed regions but is agnostic to isoform and splicing annotations [17][18][19] ; our method BIISQ is in this category.Methods may exhibit characteristics shared across multiple of the three previously mentioned categories.For example, Cufflinks has evaluation modes that can be annotation-free or guided by reference annotations 17 .
We compared results from BIISQ with three representative isoform reconstruction and quantification methods: Cufflinks 17 , CEM 19 , and SLIDE 18 .These methods were selected based on the following criteria: i) the ability to use annotations of transcribed genomic regions for isoform discovery and quantification, but no requirement for isoform transcripts or splice junction annotations; ii) coverage of combinatorial and statistical approaches; iii) support for both singleend and paired-end reads; and iv) high-quality performance in a recent benchmark study of isoform detection and quantification 36 .
Cufflinks uses a parsimonious approach to isoform discovery in order to find the minimal number of transcripts to explain the aligned reads 17 .After filtering erroneous spliced read alignments, aligned reads are to assigned vertices in an overlap graph, whose edges represent isoform compatibility between aligned reads.Transcript assembly then reduces to finding a minimum set of paths through the overlap graph such that each aligned read is part of a path.Transcript quantification uses a generative model for RNA-seq reads to compute a maximum a posteriori estimate of the isoform quantifications, extending an earlier unpaired model 37 ; confidence intervals are estimated using importance sampling.
CEM, an extension to the method IsoLasso 38 , constructs a connectivity graph to generate a set of candidate isoforms 19 .CEM and IsoLasso model the coverage of aligned reads at each location as a Poisson distribution and uses lasso regression to produce a set of inferred isoforms and abundance levels.The principle difference between CEM and IsoLasso is the optimization procedure: CEM uses expectation maximization (EM) while IsoLasso solves a quadratic program.
In our comparison, we preferred CEM because of the superior performance demonstrated on benchmark data 39 .
The sparse linear modeling for isoform discovery and abundance estimation (SLIDE) method implements a statistical approach based on the start and end positions of aligned reads 18 .SLIDE computes the number of aligned read start and end positions that group into transcribed regions of exons and organizes them into bins.Isoform proportions are quantified using a linear model of the observed bin proportions; a modified lasso penalty limits the number and composition of isoforms.In our methods comparison, we ran SLIDE using two distinct settings of the regularization parameter, λ = 0.01 and λ = 0.2, which encourages more and fewer discovered isoforms, respectively.We refer to results from SLIDE with these two parameter settings as SLIDE more and SLIDE fewer below.
Evaluation criteria.We evaluated precision and recall for each method in terms of exact and partial matches to simulated RNA-seq data 40 .Precision and recall were calculated based on exact full length isoform matches between simulated and estimated isoforms (Equation 4, Methods).
Partial precision and recall were calculated by defining imperfect matchings between each estimated transcript and the true transcripts (see Methods and Supplementary Fig. 1).We controlled for issues regarding exon identification by counting an exon as successfully inferred if any subsequence of the inferred isoform overlapped an exon in the gene annotation.Thus, reconstructing any subsequence of an exon was equivalent to reconstructing the whole exon correctly.
RNA-sequencing simulation: BEERS.The first evaluated our model on simulated data generated using the benchmarker for evaluating the effectiveness of RNA-Seq software (BEERS) 16 .
After removing genes with less than three exons, we divided RefSeq genes into three equally sized groups according to exon counts producing gene sets with 3 − 6 exons, 7 − 12 exons, and 13 − 312 exons.We then simulated 10,000 single-end reads from 100 samples and 35 genes randomly selected from each group (105 simulated genes in total) using simulator parameters to vary read length and the number of distinct isoforms.
To test the accuracy of each method, we applied the four isoform quantification methods to these simulated data and computed the precision and recall of the isoform discovery results-both perfect and partial matches (Figure 2).For perfect matchings, BIISQ displayed significantly higher precision and recall across the 105 genes (t-test, p ≤ 2.2 × 10 −16 ).For partial matchings, BIISQ showed significantly higher recall but lower precision than Cufflinks (t-test, p ≤ 2.2 × 10 −16 ).The overall trend remains when considering the results factored by the number of exons in the gene or the number of alternatively spliced transcripts.However, there are opportunities to improve BIISQ performance in precision and recall for highly spliced genes (Supplementary Fig. 2) and genes with a small number of exons (Supplementary Fig. 3).These results suggest that BIISQ makes marginal sacrifices in false discovery rate (FDR) to identify a higher proportion of true isoforms relative to Cufflinks.Next, to investigate the source of BIISQ's increased precision and recall, we evaluated the number of perfectly inferred isoform transcripts for the most lowly expressed transcript in each gene.Across all samples in the simulated data, BIISQ, CEM, Cufflinks, SLIDE more, and SLIDE fewer correctly inferred 3,011, 1,205, 996, 1,167, and 1,067, respectively, out of a total of 10,500 transcripts (i.e., 105 genes and 100 samples per gene) highlighting that much of the recall gains of BIISQ originated from isoform transcripts with low expression levels.
We assessed the quantification accuracy of each method by computing the correlation between true and inferred normalized read counts (reads per kilobase of exon model per million mapped reads, or RPKM).BIISQ inferred positive expression for 13,607 transcripts compared to 8,512, 8,428, 7,516, and 7,062 for SLIDE more, CEM, Cufflinks, and SLIDE fewer, respectively.We found a wide range of expression level estimates from the four methods (Figure 3A), which is typical of isoform quantification in human samples 40 .Overall, BIISQ showed the highest correlation of results across the BEERS data, followed by CEM and Cufflinks (BIISQ r = 0.539, CEM r = 0.52, Cufflinks r = 0.514).
We next investigated the characteristics of inferred transcripts with positive predicted expression by each method.All methods inferred a similar number of transcripts when the number of exons was small (at least three exons and at most six).However, BIISQ quantified over 2,300 additional transcripts compared to Cufflinks, CEM, and SLIDE for medium length transcripts (total number of exons ∈ [7,12]) and at least 1,948 additional transcripts for larger transcripts (total number of exons ∈ [13,312]; Supplementary Fig. 4).We also evaluated the number of transcripts quantified in terms of coverage, or the number of bases sampled from the transcript with simulated reads normalized by the transcript length.Consistent with our results on low frequency transcripts, BIISQ correctly infers almost 500% more transcripts across samples for isoforms with coverage < 5 within each sample (Supplementary Fig. 5).As coverage increases, the difference in the number of transcripts correctly inferred between BIISQ and competing methods diminishes, with SLIDE more inferring the largest number of transcripts across all samples for transcripts at coverage ≥ 400 within each sample.These results highlight the benefits of BIISQ's model based approach to isoform sharing across samples.Short read RNA-seq simulations from long-read RNA-seq data.The BEERS simulated RNAseq data models technology-specific biases of short read RNA-seq data but does not capture the exon composition of true transcript isoforms.The Pacific Biosciences Iso-Seq protocol is a single molecule transcriptome sequencing technology that offers read lengths of up to 10 Kb 25 .Iso-Seq reads may span entire RNA transcripts, making the characterization of isoform composition straightforward relative to inference from short-read data, where each read may map to one of many possible isoforms.Thus, long read sequencing provides a medium for experimentally driven evaluation of isoform reconstruction; however, the cost and platform specific error rates make this technology infeasible to replace short read RNA-seq in the near future, necessitating the development of methods such as BIISQ for short read RNA-seq data.Further, while reconstruction is aided by long reads, quantifying isoforms is challenging due to the high costs at the relatively low throughput realized by Iso-Seq, making precision and recall the principle metrics for evaluation of Iso-Seq data 41 .
We simulated short-read RNA-seq data from full length Iso-Seq reads, which allows us to precisely capture true isoform composition and proportions in simulated data.To do this, we constructed a reference set of genes from the Iso-Seq human transcriptome reference samples of heart and brain tissue.After mapping genes and transcripts across tissues, we identified seven genes with two or more isoforms in the heart and brain tissues (see Supplementary Methods).
Iso-Seq reads have a different error profile than other short-read technologies, so we evaluated both GMAP and STARlong's Iso-Seq read alignments to the human genome version hg19 42 (Supplementary Figs. 6, 7).To sample paired-end short reads from the Iso-Seq reads, the simulator copied the RNA sequence from the Iso-Seq read and the alignment from GMAP or STARlong.For seven transcripts (Supplementary Table 1), we simulated reads with lengths 50 bp, 100 bp, and 200 bp for 50 replicates of brain and heart tissue Iso-Seq samples.We define transcript span relative to the transcript sequence coverage of simulated reads; for example, a span of 0.5 indicates that short reads were simulated until the simulated reads mapped to half the length of the Iso-Seq reads for a specific transcript.
We first evaluated the accuracy of each method in terms of perfect and partial precision and recall.BIISQ achieved the highest precision and recall from both exact and partial matching thresholds across the seven Iso-Seq transcripts (Figure 4A).This strong performance improvement remains when partitioning the RNA-seq data by read length and span (Figure 4B, C).Importantly, the performance of BIISQ, Cufflinks, and SLIDE fewer does not deteriorate substantially in either precision or recall for paired-end short reads relative to the deterioration in performance from CEM and SLIDE more (Figures 2 and 4A).
We then compared isoform quantifications across the four methods in the paired-end short read data (Figure 3B).BIISQ computed positive quantifications for 5,595 transcripts while Cufflinks, SLIDE fewer, SLIDE more, and CEM inferred 3,248, 2,170, 1,296, and 1,025 transcripts respectively.Rankings of quantification results on paired-end data largely mirrored the BEERS simulations and, in total, BIISQ and Cufflinks showed the highest average correlation across the BEERS and Iso-Seq data (r = 0.645, r = 0.639).BIISQ and Cufflinks also showed the highest agreement between any pair of distinct methods (r = 0.567, Supplementary Fig. 8).We also found that BIISQ inferred more transcripts across all exon compositions (Supplementary Fig. 9) and all spans (Supplementary Fig. 10) in the Iso-Seq data.
In theory, paired-end reads are more informative for isoform reconstruction and quantification than single-end reads largely because reads sampled from the same fragment are transcript specific and may span additional non-contiguous junctions.However, isoform transcripts often share many splice junctions, making them difficult to deconvolve in short-read paired end data, whereas dissimilar transcripts are easier to differentiate using unique splice junctions.The Iso-Seq simulated data included transcripts with a lower average normalized distance between isoforms (0.471) compared to the BEERS simulated data (0.584), making these gains in short-read pairedend data difficult to achieve.This transcript similarity and complications in modeling the insert length of paired-end data may have contributed to the dramatic decrease in CEM's accuracy on the Iso-Seq simulated data (Figure 4).
We investigated the run time of each method as functions of the number of exons, gene length, read length, and span; BIISQ run times are averaged across 20 runs and do not include the one-time preprocessing for converting aligned reads to read-terms (Supplementary Figs.11-14).
CEM was the most efficient method tested, followed closely by Cufflinks, while BIISQ and SLIDE had the longest run times.However, isoform reconstruction can be parallelized at the level of reference transcript, so difficulties associated with running multiple iterations of BIISQ on large data may be reduced by having many compute nodes to process distinct genes in parallel.Gene and read lengths had marginal effects on run time (Supplementary Figs.GEUVADIS RNA-seq data for 462 samples.Next, we applied BIISQ to RNA-seq data for 462 lymphoblastoid cell lines (LCLs) from the GEUVADIS RNA sequencing project for 1000 Genomes samples 26 .We built a model of transcription for each gene from the human genome annotations in GENCODE release 19 and mapped RNA-seq reads with STAR 2-pass to the human genome version hg19 (Methods).Applying BIISQ to these data, we discovered 31,712 novel and 14,044 known transcript isoforms with respect to the GENCODE database v19 transcript isoform annotations, considering only perfect matches to isoform exon composition.When using a matching threshold of 0.2, we discovered 24,871 novel and 20,885 known isoforms.The distribution of the number of isoforms per gene is peaked for genes with no evidence of alternative splicing (one transcript) and heavily spliced genes (≥ 7 transcripts) although we note that this distribution could be confounded by erroneous splice junctions and fragmented transcripts in the BIISQ output (Online Methods and Supplementary Fig. 15) 12 .
To investigate population and sex specific splicing patterns, we first analyzed transcript ratio quantification patterns across all genes in the GEUVADIS data.We considered global signatures of differential transcript ratio usage and did not find a significant difference in the average isoform transcript counts across sex (χ 2 test, p ≤ 0.99) or population (χ 2 test, p ≤ 1) when counts were aggregated across protein coding genes (Supplementary Fig. 15).We then computed population and sex specific transcript ratio distributions for each protein coding gene independently using like-lihood ratio (LR) tests (see Online Methods).We found 924 and 148 genes that show populationspecific and sex-specific transcript ratio distributions, respectively (χ 2 test, Bonferroni-corrected p ≤ 0.05; Supplementary Table 14).Gene PTPRN2 showed the most significant differential effects of population on isoform ratios (LR test, Bonferonni-corrected p ≤ 2.2 × 10 −16 ; Figure 5A, top).

Gene
LGALS9B showed the most significant differential effects of sex on isoform ratios (LR test, Bonferonni-corrected p ≤ 2.2 × 10 −16 ; Figure 5A, bottom, B).Most samples express at most two of the five isoform transcripts of LGALS9B, but females show more highly variable isoform expression levels than males, in particular for isoform 5 (Figure 5B).
Next, due to scarce prior information on population or sex specific transcript ratios, we validated these results by testing for over-representation of population (CEU, TSI, FIN, GBR, YRI), European (EUR) vs. African (AFR) (termed super population), and sex specific variants in the exonic and intronic regions of the 924 and 148 genes.We compared this set to variants in background genes defined as non-overlapping gene regions in human genome version hg19.To control for the correlation structure of variation throughout the genome, we preformed linkage disequilibrium (LD) pruning by removing variants in high r 2 with neighboring variation (see Methods).
We varied the minor allele frequency (MAF), population-, and sex-specific thresholds to test the hypergeometric test's sensitivity to the threshold.We found a significant over-representation of population-specific variants at MAF ≥ 0.15 and a population threshold requiring ≥ 85% of alleles in a sample to exist in a specific population (hypergeometric test, Bonferroni-corrected p ≤ 1.8 × 10 −5 ; Supplementary Table 2).We computed over-abundance tests for European or African specific variants (Supplementary Table 3) and find that they require a much lower threshold than the population tests (MAF=0.05,super population threshold=0.75,hypergeometric test, Bonferroni-corrected p ≤ 1.42 × 10 −5 ), which is expected given that most of the GEUVADIS sample is concentrated in the European population.We found similar significance for sex specific variants (Supplementary Table   overlapping transcripts to compare BIISQ cis-trQTLs and the GEUVADIS study cis-trQTLs, but, comparing the most significant cis-trQTLs for each gene between the BIISQ and GEUVADIS study (EUR population) results did not show significant correlation (Supplementary Fig. 19; t-test, p ρ = 0.49 and p rs = 0.12).Interestingly, BIISQ cis-eQTLs were more highly correlated with BIISQ cis-trQTLs (Supplementary Fig. 20; ρ = 0.74 and r s = 0.51) than GEUVADIS study cis-trQTLs and cis-eQTLs (EUR: Supplementary Fig. 21; ρ = 0.3 and r s = 0.23 and YRI: Supplementary Fig. 22; ρ = 0.18 and r s = 0.03).
Next, we quantified potentially novel biological insight uniquely enabled by BIISQ.We computed genes with cis-trQTLs exclusively inferred by BIISQ compared to previous work 26 .
The most significant cis-trQTL (rs7042091) was identified for gene LCN8 by BIISQ (FDR ≤ 0.05; Figure 5E) both of which were not found in the GEUVADIS study (FDR < 0.05) 26 .However, we found evidence for a cis-eQTL at this SNP-gene pair in the GTEx study across a number of tissues (most significant p-value in whole blood p ≤ 2.2 × 10 −16 ) 3 .Further, mutations in the SNCA gene, typically expressed in neurons but also in LCLs, have been associated with Parkinson's disease and was among the 721 uniquely inferred genes with a significant cis-trQTL 46 .It has been demonstrated that alternatively spliced transcripts can cause protein misfolding 47 and the misfolding of SNCA's protein product has been suggested as a therapeutic target for Parkinson's disease treatment 48 .The synthesis of these results suggest the alternatively spliced transcripts of SNCA might be interesting targets for future research and demonstrate the unique utility of BIISQ.
Finally, we quantified enrichment of eGenes in the Database for Annotation, Visualization and Integrated Discovery (DAVID) 53 across nine annotation lists including the KEGG, SwissProt, UniProt, and InterPro databases (Supplementary Table 7).We compiled sets of high confidence isoforms for each gene by filtering out transcript isoforms not present in ≥ 10% of the samples.
We computed functional enrichment for the remaining eGenes, and genes with > 1, > 4 and > 6 transcript isoforms using all annotated human genes as the background set.We found cis-trQTL gene target enrichment in a single KEGG pathway, olfactory transduction (BH adjusted p ≤ 2.5 × 10 −3 ) (Supplementary Table 8).This pathway shows substantial transcript diversity: more than two thirds of olfactory receptors have been estimated to be alternatively spliced 54 .The most significant enrichment for the SwissProt and UniProt seq-feature annotations across the > 1, > 4, and > 6 gene sets were alternative splicing and splice variant respectively (BH adjusted p ≤ 2.2 × 10 −16 ) (Supplementary Tables 9-11).We found the most significant enrichment from InterPro to be protein kinases (BH adjusted p ≤ 2.2 × 10 −16 ) which have been shown to exhibit high proteomic and functional diversity as the result of alternative splicing 55 .These database enrichment results support BIISQ's ability to reconstruct biologically relevant alternatively spliced gene sets.

Discussion
We presented a statistical model, BIISQ, for quantifying RNA isoforms, which shares strength across samples to estimate isoforms (especially those at low abundance) without reference isoform compositions.We described a stochastic variational inference method for fitting BIISQ to data that allows our approach to scale to genome-wide study data; further, BIISQ showed increased efficiency as the coverage or size of the gene increased in paired-end Iso-Seq data.We demonstrated that our method improves substantially over three state-of-the-art methods in both precision and recall of isoforms on two different types of simulated data with significant improvement for low abundance transcripts.We applied BIISQ to the GEUVADIS RNA-seq data and identified known and novel isoforms that we extensively validated, in part, by identifying cis-trQTLs that cluster near known splice junctions and are significantly enriched in cis regulatory elements associated with chromatin accessibility, histone modifications, and alternatively excised intron clusters.
BIISQ has several advantages over existing representations of RNA isoforms: 1) samplespecific isoforms are drawn from a collection of global isoforms, which leads to higher power to discover low frequency isoforms by sharing strength across samples; 2) a Bayesian hierarchical approach enables the principled incorporation of high-quality prior information such as observed variation in the exon composition of isoforms; and 3) a nonparametric approach allows us to flexibly combine computationally tractable posterior inference with model selection (here, the number of isoforms).BIISQ is guaranteed to converge to a local maximum, but the results on BEERS and Iso-Seq data demonstrate that the quality of isoform reconstruction is improved from taking the best (in the maximum a posteriori) solution from multiple random restarts.
Our results on GEUVADIS data show that BIISQ captures biologically interesting trends.This is likely due to the fact that the potentially incomplete transcripts BIISQ computes are still useful; in fact, many methods quantify alternatively spliced exons 14,56 or excised introns 8 that correspond to incomplete transcripts.This suggests that both the full-length and partial transcripts merge-propose-reduce step in SVI and executed this step every 30 iterations 57,58 .For every pair of isoforms, the merge step calculates the likelihood of the data before and after merging the pair of isoforms; if the likelihood is greater after the merge for each sample, the merge is accepted.BIISQ proposes new isoforms by computing the union of exons in randomly sampled, poorly mapped read-terms, where the likelihood of that read mapping to existing isoforms is < 0.5.If at least one novel isoform is proposed, BIISQ reinitializes all variational parameters as defined by Supp.
Algorithm 1. Finally, the reduce step removes an isoform k from the local and global distributions if, for all samples, there are no reads that map to isoform k with a probability > 0.01.
BEERS simulated data runs.Single-end simulated RNA-seq reads were generated by the benchmarker for evaluating the effectiveness of RNA-Seq software (BEERS) 16 .We divided RefSeq gene models into three equally sized groups according to exon count, producing groups of genes with 3-6, 7-12, and 13-312 exons.For 35 randomly selected genes in each group, we sampled 10,000 error-free reads for seven parameter configurations: a fixed number of three novel transcripts with varying read lengths in {50,200,400} and a fixed read length of 100 with a varying number of novel transcripts in {2,3,4,6} (see Supplementary Methods).For each gene model, we sample reads for 100 samples drawn from the aligned RNA-seq reads in the BEERS simulated read pool.
The start position of a read was sampled from a gamma distribution with a parameter that decreases linearly with the position to simulate a 5' bias.The exonic composition of a novel splice form was generated by BEERS, and isoform proportions for each sample were sampled from a Dirichlet distribution with concentration parameter 1.
Short read simulations from PacBio Iso-Seq long read data.We downloaded the full length non-chimeric human transcriptome liver, heart, and brain data from the Iso-Seq protocol, which included unaligned sequence reads and General Feature Format (GFF) reference files for each tissue 25 .The gene identifiers provided in the reference files were created independently for each tissue, so we constructed a reference set of genes and their transcripts across tissues as follows.For each gene, we created a standard set of exons by parsing its transcripts and collapsing overlapping exons in the GFF files.We then mapped genes across the three tissues based on a base-pair overlap of 95% and discarded non-unique mappings.For each gene, we then mapped transcripts across tissues based on a 95% overlap (see Supplementary Methods).This process was conservative by design, leading to a confident baseline of cross-tissue isoforms.We found seven genes having at least two transcripts isoforms shared across at least two tissues (see Supplementary Table 1): BLOC1S6, ZFAND6, CYTH1, APP, C1orf43, SPARCL1, and RNF14.None of these genes were found to be expressed in liver and thus the liver data was discarded.
We mapped the Iso-Seq reads to the gene sequences of the identified transcripts from human genome version hg19 (Supplementary Table 1) using the GMAP and STARlong algorithms 42 .We built an Iso-Seq short-read simulator (ISSRS) that simulates short-reads from longer Iso-Seq reads.
The inputs to ISSRS are sequencing parameters, a gene reference file with exon boundaries, and an aligned sequence read file.The outputs of ISSRS are aligned sequences in SAM format that contain shorter sequence reads but retain the read mapping biases present in the Iso-Seq data by copying the sequence position from the Iso-Seq reads.In brief, the simulator works as follows: (1) compute Iso-Seq reads that map with high confidence to a known transcript; (2) for each read, determine the amount of sampling based on input coverage; (3) sample reads by attempting to add insert sizes distributed normally with mean 10 bp and 40 bp standard deviation; (4) output sampled reads while preserving sequence and quality scores from the aligned Iso-Seq transcripts in SAM and BIISQ format (see Supplementary Methods).For step (1), STARlong mappings yielded fewer false positives than GMAP, but GMAP produced many more usable alignments (Supplementary Figs. 6 and 7).For the seven transcripts identified across tissues (Supplementary Table 1), we simulated reads with lengths 50 bp, 100 bp, and 200 bp and approximate coverage values of the input Iso-Seq transcripts of 0.25, 0.5, and 1 for 50 samples from brain and heart tissues.
GEUVADIS RNA-seq data preparation.RNA-seq reads from EBV-transformed LCLs were downloaded from the Genetic European Variation in Health and Disease (GEUVADIS) project 26 .
BIISQ requires read-terms -mapped RNA-seq read start positions, end positions, and exons covered tuples -and a model of transcription for each gene indicating contiguous transcribed subse-splice junctions, transcripts with ratios of 0 or 1 in all samples, as well as, transcripts expressed in less than 10% of the samples were discarded.
GEUVADIS functional assessment.The Database for Annotation, Visualization and Integrated Discovery (DAVID v6.8, May 2016) analysis included functional enrichment for nine databases (Supplementary Table 7) and used the default whole genome set of genes 53 .To reduce the affects of linkage disequilibrium (LD) on the variant set enrichment analysis, we computed LD blocks for YRI, CEU, FIN, GBR, and TSI populations with a minimum MAF of 0.001, r 2 of 0.8, and genotyping rate of 0.8 using the rAggr interface to Haploview on the 1000 Genomes Project phase 3 data 62 .Variant set enrichment analysis was run on the LD blocks for ENCODE Encyclopedia 3 annotations: DNaseI hypersensitive sites, H3K4me3, H3K27ac, annotations generated from a chromatin state segmentation computational tool sourcing from the Broad Histone UCSC track for nine factors and nine cell types, DNaseI/FAIRE/ChIP synthesis annotations from ENCODE and OpenChrom 52 , and LeafCutter clusters (Supplementary Table 6).Sample Gm12878HMM was removed from enrichment analysis due to a non-normal null distribution (KS test, p ≤ 0.01), which is required by VSE (Supplementary Fig. 23).We included annotations from LCLs as well as several other cell types as controls: glioblastoma, cervical carcinoma, hepatocellular carcinoma, trophoblast, and embryonic stem cells.
Population and sex specific splicing.Population and sex-specific transcript ratios were evaluated based on a likelihood ratio test.The alternative hypothesis modeled sample transcript ratios as draws from population and sex-specific Dirichlet distributions while the null hypothesis assumed a global Dirichlet distribution.We computed the maximum likelihood estimates for parameters of the global, population-specific, and sex-specific Dirichlet distributions.Genes were selected for the isoform proportion plots (Figure 5A) based on the likelihood ratio test Software.The source code and software implementing the BIISQ model and inference methods can be downloaded from: https://github.com/bee-hive/BIISQ.

Figure 1 :
Figure 1: A single gene may be transcribed into several distinct mRNA variants called isoforms through alternative splicing mechanisms.This figure shows six common types of splicing events (top to bottom): simple transcript; alternative transcription start site; alternative 5' splice site; alternative 3' splice site; skipped exon; and alternative polyadenylation.

Figure 2 :
Figure 2: Isoform discovery precision and recall for simulated data.Precision (red) and recall (blue) of the results from BIISQ, CEM, Cufflinks (CUFF), and SLIDE (SLIDE more and SLIDE fewer) applied to the BEERS simulated single-end RNA-seq data.The thick center bars denote the mean precision or recall and the fill denotes twice the standard error.Transparent fill denotes partial precision and recall with a matching threshold of 0.2.Across all methods, the best (partial) precision and recall values are annotated above their respective data points.
11 and 12).However, CEM and Cufflinks showed increased run time as a function of the number of exons (p-value for linear regression of run time versus the number of exons for CEM and Cufflinks: p ≤ 8.79 × 10 −4 and p ≤ 1.64 × 10 −3 , respectively) and the span (linear regression of run time versus span p ≤ 7.44 × 10 −5 and p ≤ 8.85 × 10 −5).SLIDE all exhibited increased run time as a function of exons (p ≤ 6.16×10 −6 ) but no significant increase for span (p ≤ 0.32).In contrast, the run time of BIISQ shows no dependence on the number of exons (p ≤ 0.61) or span (p ≤ 0.29) (Supplementary Figs.

13 and 14 )
; this efficiency is likely due to modeling reads as read-terms and the aggregation of similar read-terms in the initial processing of the aligned sequence read input.

Figure 4 :
Figure 4: Comparison of methods on Iso-seq simulations.Precision (red) and recall (blue) of the results from BIISQ, CEM, Cufflinks (CUFF), SLIDE more, and SLIDE fewer applied to (A) the short-read data simulated from Iso-Seq reads; (B) simulated data split by read length; and (C) simulated data split by span.Transparent fill denotes partial precision and recall with a matching threshold of 0.2 The thick center bars denote the mean precision or recall, and the fill denotes twice the standard error.The best (partial) precision and recall values are annotated above their respective points.

Figure 5 :
Figure 5: Results for GEUVADIS data.(A) The isoform quantification distribution where color denotes a unique isoform and each vertical bar is a single sample for genes PTPRN2 (top) and LGALS9B (bottom).(B) Simplex plots for gene LGALS9B factored by sex.Each point (red for female, black for male) represents a sample's isoform composition distribution for the two isoforms denoted on the bottom axis and the remaining isoforms at the top intersection point.(C) Matrix-eQTL p-value distribution for (left) cis-trQTLs and (right) cis-eQTLs.(D) The density of cis-trQTLs, LD pruned cis-trQTLs, and cis-eQTLs distances to the nearest canonical splice junctions in GENCODE.(E) LCN8 contained The most significant cis-trQTL (p ≤ 2.2 × 10 −16 ).Linear and logistic regressions are shown in black and red.(F) Variant set enrichment scores (red points denoting significance).The x-axis includes cis-regulatory annotations (histone modifications and chromatin accessibility).

2log L(∼ χ 2 where 1 : 8 )
θCEU |x CEU tr ),L( θFIN |x F IN tr ),L( θGBR |x GBR tr ),L( θTSI |x T SI tr ),L( θY RI |x Y RI tr ) L( θALL |x ALL tr ) θa are the maximum likelihood estimates for the parameters of the Dirichlet distribution for population a, x tr are the sample transcript ratios and populations are denoted as superscripts.Sex For inexact matches, partial precision and recall were calculated by defining a matching M , or a set of pairs of inferred-true isoforms, that is of maximum cardinality and minimum weight (i.e., distance between isoform composition of a pair) between each computed transcript and the true transcripts as follows.Let K C and K T be the set of estimated and true isoforms, respectively, which are Boolean vectors of length E exons {1,2,...,E}.A 1 at position ι signifies that exon ι is contained within that isoform, and k[ι] indexes the position of the Boolean vector k.We define the distance between an inferred and true isoform d k,l for all k ∈ K C and l ∈ K T to be the Hamming distance.The Hamming distance counts the number of mismatched exons between the estimated and true isoforms.The maximum cardinality minimum weight M is then the solution to the optimization problem min M = k∈1:K C ∈1:K T x k d k K C , ∀ ∈ 1 : K T (If the total number of isoforms is I, finding a maximum cardinality minimum weight matching can be solved in O(I 3 ) time 65 .If d k is the distance between inferred isoform k ∈ K C and true isoform ∈ K T for matching M , then d k, = 0 (d k, > 0) implies k is a true (false) positive; if d k, ≤ p|E| (d k, > p|E|) then k is a p-partial true (false) positive (p-TP and p-FP).Any true isoform not matched by a p-partial true positive is a p-partial false negative (p-FN).Using these definitions of p-TP, p-FP, and p-FN, we can compute p-precision and p-recall as in Equation 4.