Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Deep-learning augmented RNA-seq analysis of transcript splicing

Abstract

A major limitation of RNA sequencing (RNA-seq) analysis of alternative splicing is its reliance on high sequencing coverage. We report DARTS (https://github.com/Xinglab/DARTS), a computational framework that integrates deep-learning-based predictions with empirical RNA-seq evidence to infer differential alternative splicing between biological samples. DARTS leverages public RNA-seq big data to provide a knowledge base of splicing regulation via deep learning, thereby helping researchers better characterize alternative splicing using RNA-seq datasets even with modest coverage.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The DARTS computational framework.
Fig. 2: Performance evaluation of the DARTS BHT framework, and the influence of training datasets on the performance of the DARTS DNN.
Fig. 3: DARTS analysis of alternative splicing during the EMT.

Similar content being viewed by others

Code availability

The DARTS program, trained model parameters, and predictive features are provided at GitHub (https://github.com/Xinglab/DARTS).

Data availability

The RNA-seq data that support the findings of the deep learning models are available from the ENCODE project (https://www.encodeproject.org/) and the Roadmap Epigenomics project (http://www.roadmapepigenomics.org/). The H358 time-course RNA-seq data were downloaded from GEO accession GSE75492. The PC3E-GS689 RNA-seq data and RASL-seq data can be accessed from GEO under accession GSE112037.

References

  1. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Nat. Methods 7, 1009–1015 (2010).

    Article  CAS  Google Scholar 

  2. Shen, S. et al. Proc. Natl Acad. Sci. USA 111, E5593–E5601 (2014).

    Article  CAS  Google Scholar 

  3. ENCODE Project Consortium. Nature 489, 57–74 (2012).

  4. Kundaje, A. et al. Nature 518, 317–330 (2015).

    Article  CAS  Google Scholar 

  5. Cieślik, M. & Chinnaiyan, A. M. Nat. Rev. Genet. 19, 93–109 (2018).

    Article  Google Scholar 

  6. Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. Am. J. Hum. Genet. 102, 11–26 (2018).

    Article  CAS  Google Scholar 

  7. Xiong, H. Y. et al. Science 347, 1254806 (2015).

    Article  Google Scholar 

  8. Barash, Y. et al. Nature 465, 53–59 (2010).

    Article  CAS  Google Scholar 

  9. Leung, M. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Bioinformatics 30, i121–i129 (2014).

    Article  CAS  Google Scholar 

  10. Huang, Y. & Sanguinetti, G. Genome. Biol. 18, 123 (2017).

    Article  Google Scholar 

  11. Gerstberger, S., Hafner, M. & Tuschl, T. Nat. Rev. Genet. 15, 829–845 (2014).

    Article  CAS  Google Scholar 

  12. Van Nostrand, E. L. et al. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/179648v1?versioned=true (2017).

  13. Warzecha, C. C. et al. EMBO J. 29, 3286–3300 (2010).

    Article  CAS  Google Scholar 

  14. Yang, Y. et al. Mol. Cell. Biol. 36, 1704–1719 (2016).

    Article  CAS  Google Scholar 

  15. Dittmar, K. A. et al. Mol. Cell. Biol. 32, 1468–1482 (2012).

    Article  CAS  Google Scholar 

  16. Lu, Z. X. et al. Mol. Cancer Res. 13, 305–318 (2015).

    Article  CAS  Google Scholar 

  17. Li, H., Qiu, J. & Fu, X.-D. Curr. Protoc. Mol. Biol. 98, 13.1–4.13.9 (2012).

    Article  Google Scholar 

  18. Trincado, J. L. et al. Genome. Biol. 19, 40 (2018).

    Article  Google Scholar 

  19. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    Google Scholar 

  20. Ioffe, S. & Szegedy, C. In Proc. 32nd International Conference on Machine Learning (eds Bach, F. & Blei, D.) 448–456 (PMLR/Microtome Publishing, Brookline, MA, USA, 2015).

  21. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Nat. Biotechnol. 34, 525–527 (2016).

    Article  CAS  Google Scholar 

  22. Harrow, J. et al. Genome. Biol. 7, 1–S4.9 (2006).

    Article  Google Scholar 

  23. Kent, W. J. Genome Res. 12, 656–664 (2002).

    Article  CAS  Google Scholar 

  24. Ying, Y. et al. Cell 170, 312–323 (2017).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank X.-D. Fu (UCSD) for the RASL oligos and advice on RASL-seq. This study is supported by National Institutes of Health grants (R01GM088342, R01GM117624, U01HG007912, and U01CA233074 to Y.X.). Z.Z. is partially supported by a UCLA Dissertation Year Fellowship.

Author information

Authors and Affiliations

Authors

Contributions

Z.Z. and Y.X. conceived the study; Z.Z., Y.W., and Y.X. designed the research; Z.Z., Z.P., Y.Y., S.A., and J.P. performed the research; Z.X., R.P.C., and D.L.B. contributed analytic tools; Z.Z. and Y.X. analyzed data; and Z.Z. and Y.X. wrote the paper with input from all other authors.

Corresponding author

Correspondence to Yi Xing.

Ethics declarations

Competing interests

Y.X. and D.L.B. are scientific cofounders of Panorama Medicine. Z.Z. and Y.X. are in the process of filing a patent application for DARTS.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Schematic overview of the DARTS DNN model.

The DARTS DNN model consists of four hidden layers and 7,923,402 parameters. Batch normalization and drop-out of hidden variables are implemented during training to mitigate overfitting.

Supplementary Figure 2 Performance comparison of DARTS BHT(flat), MISO, and MATS using simulated RNA-seq data generated by Flux simulator.

We derived the transcriptome profiles from a real RNA-seq dataset with widespread splicing changes (E-MTAB-1147; knockdown of splicing factor HNRNPC in the HeLa cell line) and plugged them into Flux simulator as ground truth to simulate RNA-seq reads. a,b, Then we computed (a) AUROC and (b) AUPR for each statistical method by labeling the exon-skipping events with ground-truth \(| {\rm{\Delta }}\psi | > 0.05\) as positive and \(| {\rm{\Delta }}\psi | \le 0.05\) as negative (details are provided in the Supplementary Notes). DARTS BHT(flat) performed favorably compared with MISO and MATS.

Supplementary Figure 3 Performance comparison of DARTS BHT(flat) with replicates versus DARTS BHT(flat) on pooled data and rMATS with replicates on n = 3,000 independent simulated events.

We fixed the total RNA-seq read counts (coverage per replicate × number of replicates) while varying the number of replicates (K), within-group variance (sigma), and whether there is one outlier sample. The replicate DARTS model (rDARTS) outperforms DARTS on pooled data when there are outlier samples (b,e) or when the within-group variance is large (c,f).

Supplementary Figure 4 The performance of the DARTS DNN during cross-validation and testing as training progressed.

The maximum AUROC was 0.97 during cross-validation and 0.86 during testing.

Supplementary Figure 5 Relationship of DARTS posterior, prior, and the amount of observed RNA-seq read counts.

For a fixed absolute PSI difference between the two conditions (i.e. the effect size δ), we calculated the posterior probability \(P(| \delta | > 0.05| I,S)\) from simulated data by varying the prior probability and the amount of read counts. The prior’s effect on the DARTS posterior diminished when the number of observed read counts was large (>100) and/or with large effect sizes (δ = 0.3). For events with moderate or low numbers of read counts, a strong informative prior improves the inference accuracy.

Supplementary Figure 6 Application of the DARTS DNN to different classes of alternative splicing patterns.

a,c,e, The performance of the DARTS DNN during cross-validation and testing as training progressed for alternative 5′ splice sites (A5SS), alternative 3′ splice sites (A3SS), and retained introns (RI) as measured by AUROC. b,d,f, Comparison of the DARTS DNN with baseline methods using independent leave-out datasets. The DARTS DNN outperforms baseline methods trained on individual leave-out datasets by a large margin. Note that in these analyses the DARTS DNN is trained using combined ENCODE and Roadmap RNA-seq datasets, with certain pairwise comparisons held out for benchmarking as independent leave-out datasets.

Supplementary Figure 7 An example of the DARTS DNN prediction for the PLEKHA1 gene in the H358 EMT time-course RNA-seq data.

The genome browser view represents aggregated RNA-seq signals from three biological replicates. The DARTS DNN score for this exon is 0.94 in day 5 versus day 0, increasing the posterior probability of differential splicing to 0.73 over 0.42 when using RNA-seq data alone. The differential splicing pattern of this exon was apparent throughout the time course and was previously validated by RT-PCR.

Supplementary Figure 8 Meta-exon motif analysis of the ESRP motif.

a,b, For the comparison of day 6 versus day 0 on the H358 time-course RNA-seq data, we calculated ESRP motif scores for (a) all DARTS BHT(flat) significant events and (b) DARTS BHT(flat) significant events that become insignificant in DARTS BHT(info). The latter set of events does not have enrichment of the ESRP motif.

i

Supplementary Figure 9 Characteristics of the DARTS DNN predicted events.

a,b, The cumulative density function of (a) gene expression levels (TPM values) and (b) RNA-seq read coverage for DARTS-DNN-predicted differential events and RNA-seq differential events. The DARTS-DNN-predicted differential events are from genes with significantly lower expression levels and have significantly lower RNA-seq read coverage compared with that of RNA-seq differential events (two-sided Wilcoxon test). c, DARTS BHT(info) outperforms baseline methods that use RNA-seq data alone to call differential splicing (DARTS BHT(flat), rMATS, and SUPPA2), as benchmarked using ground truth defined by RASL-seq. d, DARTS BHT(info) outperforms baseline methods at different FPR thresholds for DARTS-DNN-predicted differential events (n represents the number of alternative splicing events), with the maximum gain observed for the most confidently predicted events with FPR = 1%.

Supplementary Figure 10 Ranking by DARTS BHT on simulated data when using different t1 and t2 values.

The results of DARTS BHT are robust to different choices of parameters, especially for the inference of differential alternative splicing events (upper right corner in each panel).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10, Supplementary Notes

Reporting Summary

Supplementary Table 1

A complete list of cis RNA sequence features and trans RNA-binding protein features used in the DARTS DNN model. The cis RNA sequence features for each type of alternative splicing pattern (skipped exons, alternative 5′ splice sites, alternative 3′ splice sites, retained introns) are shown in separate tabs.

Supplementary Table 2

Statistics of the significant differential and unchanged alternative splicing events called by DARTS BHT(flat) in ENCODE RBP knockdown RNA-seq experiments used to train the DARTS DNN model.

Supplementary Table 3

Performance of DARTS DNN, DNN, logistic regression, and random forest with ENCODE leave-out datasets.

Supplementary Table 4

Performance of DARTS BHT(info) and DARTS BHT(flat) in cell-type-specific splicing analysis using individual replicates of ENCODE HepG2 versus K562 RNA-seq data.

Supplementary Table 5

Statistics of the significant differential and unchanged alternative splicing events called by DARTS BHT(flat) in pairwise comparisons of Roadmap RNA-seq data used for training the DARTS DNN model.

Supplementary Table 6

A complete list of DARTS-DNN-predicted differential (n = 52) and unchanged (n = 29) alternative splicing events in the comparison of GS689 versus PC3E. Prior probability predicted by the DARTS DNN, posterior probability of DARTS BHT(flat) and DARTS BHT(info), and P values and FDR from rMATS are reported.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Pan, Z., Ying, Y. et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat Methods 16, 307–310 (2019). https://doi.org/10.1038/s41592-019-0351-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-019-0351-9

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics