Abstract
A major limitation of RNA sequencing (RNA-seq) analysis of alternative splicing is its reliance on high sequencing coverage. We report DARTS (https://github.com/Xinglab/DARTS), a computational framework that integrates deep-learning-based predictions with empirical RNA-seq evidence to infer differential alternative splicing between biological samples. DARTS leverages public RNA-seq big data to provide a knowledge base of splicing regulation via deep learning, thereby helping researchers better characterize alternative splicing using RNA-seq datasets even with modest coverage.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
The problem of selection bias in studies of pre-mRNA splicing
Nature Communications Open Access 08 April 2023
-
A review of deep learning applications in human genomics using next-generation sequencing data
Human Genomics Open Access 25 July 2022
-
Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome
Genome Biology Open Access 10 June 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout



Code availability
The DARTS program, trained model parameters, and predictive features are provided at GitHub (https://github.com/Xinglab/DARTS).
Data availability
The RNA-seq data that support the findings of the deep learning models are available from the ENCODE project (https://www.encodeproject.org/) and the Roadmap Epigenomics project (http://www.roadmapepigenomics.org/). The H358 time-course RNA-seq data were downloaded from GEO accession GSE75492. The PC3E-GS689 RNA-seq data and RASL-seq data can be accessed from GEO under accession GSE112037.
References
Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Nat. Methods 7, 1009–1015 (2010).
Shen, S. et al. Proc. Natl Acad. Sci. USA 111, E5593–E5601 (2014).
ENCODE Project Consortium. Nature 489, 57–74 (2012).
Kundaje, A. et al. Nature 518, 317–330 (2015).
Cieślik, M. & Chinnaiyan, A. M. Nat. Rev. Genet. 19, 93–109 (2018).
Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. Am. J. Hum. Genet. 102, 11–26 (2018).
Xiong, H. Y. et al. Science 347, 1254806 (2015).
Barash, Y. et al. Nature 465, 53–59 (2010).
Leung, M. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Bioinformatics 30, i121–i129 (2014).
Huang, Y. & Sanguinetti, G. Genome. Biol. 18, 123 (2017).
Gerstberger, S., Hafner, M. & Tuschl, T. Nat. Rev. Genet. 15, 829–845 (2014).
Van Nostrand, E. L. et al. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/179648v1?versioned=true (2017).
Warzecha, C. C. et al. EMBO J. 29, 3286–3300 (2010).
Yang, Y. et al. Mol. Cell. Biol. 36, 1704–1719 (2016).
Dittmar, K. A. et al. Mol. Cell. Biol. 32, 1468–1482 (2012).
Lu, Z. X. et al. Mol. Cancer Res. 13, 305–318 (2015).
Li, H., Qiu, J. & Fu, X.-D. Curr. Protoc. Mol. Biol. 98, 13.1–4.13.9 (2012).
Trincado, J. L. et al. Genome. Biol. 19, 40 (2018).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Ioffe, S. & Szegedy, C. In Proc. 32nd International Conference on Machine Learning (eds Bach, F. & Blei, D.) 448–456 (PMLR/Microtome Publishing, Brookline, MA, USA, 2015).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Nat. Biotechnol. 34, 525–527 (2016).
Harrow, J. et al. Genome. Biol. 7, 1–S4.9 (2006).
Kent, W. J. Genome Res. 12, 656–664 (2002).
Ying, Y. et al. Cell 170, 312–323 (2017).
Acknowledgements
We thank X.-D. Fu (UCSD) for the RASL oligos and advice on RASL-seq. This study is supported by National Institutes of Health grants (R01GM088342, R01GM117624, U01HG007912, and U01CA233074 to Y.X.). Z.Z. is partially supported by a UCLA Dissertation Year Fellowship.
Author information
Authors and Affiliations
Contributions
Z.Z. and Y.X. conceived the study; Z.Z., Y.W., and Y.X. designed the research; Z.Z., Z.P., Y.Y., S.A., and J.P. performed the research; Z.X., R.P.C., and D.L.B. contributed analytic tools; Z.Z. and Y.X. analyzed data; and Z.Z. and Y.X. wrote the paper with input from all other authors.
Corresponding author
Ethics declarations
Competing interests
Y.X. and D.L.B. are scientific cofounders of Panorama Medicine. Z.Z. and Y.X. are in the process of filing a patent application for DARTS.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Figure 1 Schematic overview of the DARTS DNN model.
The DARTS DNN model consists of four hidden layers and 7,923,402 parameters. Batch normalization and drop-out of hidden variables are implemented during training to mitigate overfitting.
Supplementary Figure 2 Performance comparison of DARTS BHT(flat), MISO, and MATS using simulated RNA-seq data generated by Flux simulator.
We derived the transcriptome profiles from a real RNA-seq dataset with widespread splicing changes (E-MTAB-1147; knockdown of splicing factor HNRNPC in the HeLa cell line) and plugged them into Flux simulator as ground truth to simulate RNA-seq reads. a,b, Then we computed (a) AUROC and (b) AUPR for each statistical method by labeling the exon-skipping events with ground-truth \(| {\rm{\Delta }}\psi | > 0.05\) as positive and \(| {\rm{\Delta }}\psi | \le 0.05\) as negative (details are provided in the Supplementary Notes). DARTS BHT(flat) performed favorably compared with MISO and MATS.
Supplementary Figure 3 Performance comparison of DARTS BHT(flat) with replicates versus DARTS BHT(flat) on pooled data and rMATS with replicates on n = 3,000 independent simulated events.
We fixed the total RNA-seq read counts (coverage per replicate × number of replicates) while varying the number of replicates (K), within-group variance (sigma), and whether there is one outlier sample. The replicate DARTS model (rDARTS) outperforms DARTS on pooled data when there are outlier samples (b,e) or when the within-group variance is large (c,f).
Supplementary Figure 4 The performance of the DARTS DNN during cross-validation and testing as training progressed.
The maximum AUROC was 0.97 during cross-validation and 0.86 during testing.
Supplementary Figure 5 Relationship of DARTS posterior, prior, and the amount of observed RNA-seq read counts.
For a fixed absolute PSI difference between the two conditions (i.e. the effect size δ), we calculated the posterior probability \(P(| \delta | > 0.05| I,S)\) from simulated data by varying the prior probability and the amount of read counts. The prior’s effect on the DARTS posterior diminished when the number of observed read counts was large (>100) and/or with large effect sizes (δ = 0.3). For events with moderate or low numbers of read counts, a strong informative prior improves the inference accuracy.
Supplementary Figure 6 Application of the DARTS DNN to different classes of alternative splicing patterns.
a,c,e, The performance of the DARTS DNN during cross-validation and testing as training progressed for alternative 5′ splice sites (A5SS), alternative 3′ splice sites (A3SS), and retained introns (RI) as measured by AUROC. b,d,f, Comparison of the DARTS DNN with baseline methods using independent leave-out datasets. The DARTS DNN outperforms baseline methods trained on individual leave-out datasets by a large margin. Note that in these analyses the DARTS DNN is trained using combined ENCODE and Roadmap RNA-seq datasets, with certain pairwise comparisons held out for benchmarking as independent leave-out datasets.
Supplementary Figure 7 An example of the DARTS DNN prediction for the PLEKHA1 gene in the H358 EMT time-course RNA-seq data.
The genome browser view represents aggregated RNA-seq signals from three biological replicates. The DARTS DNN score for this exon is 0.94 in day 5 versus day 0, increasing the posterior probability of differential splicing to 0.73 over 0.42 when using RNA-seq data alone. The differential splicing pattern of this exon was apparent throughout the time course and was previously validated by RT-PCR.
Supplementary Figure 8 Meta-exon motif analysis of the ESRP motif.
a,b, For the comparison of day 6 versus day 0 on the H358 time-course RNA-seq data, we calculated ESRP motif scores for (a) all DARTS BHT(flat) significant events and (b) DARTS BHT(flat) significant events that become insignificant in DARTS BHT(info). The latter set of events does not have enrichment of the ESRP motif.
i
Supplementary Figure 9 Characteristics of the DARTS DNN predicted events.
a,b, The cumulative density function of (a) gene expression levels (TPM values) and (b) RNA-seq read coverage for DARTS-DNN-predicted differential events and RNA-seq differential events. The DARTS-DNN-predicted differential events are from genes with significantly lower expression levels and have significantly lower RNA-seq read coverage compared with that of RNA-seq differential events (two-sided Wilcoxon test). c, DARTS BHT(info) outperforms baseline methods that use RNA-seq data alone to call differential splicing (DARTS BHT(flat), rMATS, and SUPPA2), as benchmarked using ground truth defined by RASL-seq. d, DARTS BHT(info) outperforms baseline methods at different FPR thresholds for DARTS-DNN-predicted differential events (n represents the number of alternative splicing events), with the maximum gain observed for the most confidently predicted events with FPR = 1%.
Supplementary Figure 10 Ranking by DARTS BHT on simulated data when using different t1 and t2 values.
The results of DARTS BHT are robust to different choices of parameters, especially for the inference of differential alternative splicing events (upper right corner in each panel).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–10, Supplementary Notes
Supplementary Table 1
A complete list of cis RNA sequence features and trans RNA-binding protein features used in the DARTS DNN model. The cis RNA sequence features for each type of alternative splicing pattern (skipped exons, alternative 5′ splice sites, alternative 3′ splice sites, retained introns) are shown in separate tabs.
Supplementary Table 2
Statistics of the significant differential and unchanged alternative splicing events called by DARTS BHT(flat) in ENCODE RBP knockdown RNA-seq experiments used to train the DARTS DNN model.
Supplementary Table 3
Performance of DARTS DNN, DNN, logistic regression, and random forest with ENCODE leave-out datasets.
Supplementary Table 4
Performance of DARTS BHT(info) and DARTS BHT(flat) in cell-type-specific splicing analysis using individual replicates of ENCODE HepG2 versus K562 RNA-seq data.
Supplementary Table 5
Statistics of the significant differential and unchanged alternative splicing events called by DARTS BHT(flat) in pairwise comparisons of Roadmap RNA-seq data used for training the DARTS DNN model.
Supplementary Table 6
A complete list of DARTS-DNN-predicted differential (n = 52) and unchanged (n = 29) alternative splicing events in the comparison of GS689 versus PC3E. Prior probability predicted by the DARTS DNN, posterior probability of DARTS BHT(flat) and DARTS BHT(info), and P values and FDR from rMATS are reported.
Rights and permissions
About this article
Cite this article
Zhang, Z., Pan, Z., Ying, Y. et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat Methods 16, 307–310 (2019). https://doi.org/10.1038/s41592-019-0351-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-019-0351-9
This article is cited by
-
The problem of selection bias in studies of pre-mRNA splicing
Nature Communications (2023)
-
A scoping review on deep learning for next-generation RNA-Seq. data analysis
Functional & Integrative Genomics (2023)
-
Using machine learning to detect the differential usage of novel gene isoforms
BMC Bioinformatics (2022)
-
Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome
Genome Biology (2022)
-
A review of deep learning applications in human genomics using next-generation sequencing data
Human Genomics (2022)