Technical Report

Annotation-free quantification of RNA splicing using LeafCutter

  • Nature Geneticsvolume 50pages151158 (2018)
  • doi:10.1038/s41588-017-0004-9
  • Download Citation
Received:
Accepted:
Published:

Abstract

The excision of introns from pre-mRNA is an essential step in mRNA processing. We developed LeafCutter to study sample and population variation in intron splicing. LeafCutter identifies variable splicing events from short-read RNA-seq data and finds events of high complexity. Our approach obviates the need for transcript annotations and circumvents the challenges in estimating relative isoform or exon usage in complex splicing events. LeafCutter can be used both to detect differential splicing between sample groups and to map splicing quantitative trait loci (sQTLs). Compared with contemporary methods, our approach identified 1.4–2.1 times more sQTLs, many of which helped us ascribe molecular effects to disease-associated variants. Transcriptome-wide associations between LeafCutter intron quantifications and 40 complex traits increased the number of associated disease genes at a 5% false discovery rate by an average of 2.1-fold compared with that detected through the use of gene expression levels alone. LeafCutter is fast, scalable, easy to use, and available online.

  • Subscribe to Nature Genetics for full access:

    $59

    Subscribe

Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Han, H. et al. MBNL proteins repress ES-cell-specific alternative splicing and reprogramming. Nature 498, 241–245 (2013).

  2. 2.

    Calarco, J. A. et al. Regulation of vertebrate nervous system alternative splicing and development by an SR-related protein. Cell 138, 898–910 (2009).

  3. 3.

    Brett, D., Pospisil, H., Valcárcel, J., Reich, J. & Bork, P. Alternative splicing and genome complexity. Nat. Genet. 30, 29–30 (2002).

  4. 4.

    Pai, A. A. et al. Widespread shortening of 3′ untranslated regions and increased exon inclusion are evolutionarily conserved features of innate immune responses to infection. PLoS Genet. 12, e1006338 (2016).

  5. 5.

    Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013).

  6. 6.

    Leng, N. et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 1035–1043 (2013).

  7. 7.

    Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014).

  8. 8.

    Bray, N., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal RNA-Seq quantification. Preprint available at https://arxiv.org/abs/1505.02710 (2015).

  9. 9.

    Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010).

  10. 10.

    Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017 (2012).

  11. 11.

    Lacroix, V., Sammeth, M., Guigo, R. & Bergeron, A. Exact transcriptome reconstruction from short sequence reads. In Algorithms in Bioinformatics (eds. Crandall, K.A. & Lagergren, J.) 50–63 (Springer, Berlin, Heidelberg, 2008).

  12. 12.

    Vaquero-Garcia, J. et al. A new view of transcriptome complexity and regulation through the lens of local splicing variations. eLife 5, e11752 (2016).

  13. 13.

    Stein, S., Lu, Z. X., Bahrami-Samani, E., Park, J. W. & Xing, Y. Discover hidden splicing variations by mapping personal transcriptomes to personal genomes. Nucleic Acids Res. 43, 10612–10622 (2015).

  14. 14.

    Zhao, K., Lu, Z. X., Park, J. W., Zhou, Q. & Xing, Y. GLiMMPS: robust statistical model for regulatory variation of alternative splicing using RNA-seq data. Genome Biol. 14, R74 (2013).

  15. 15.

    Monlong, J., Calvo, M., Ferreira, P. G. & Guigó, R. Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat. Commun. 5, 4698 (2014).

  16. 16.

    Ongen, H. & Dermitzakis, E. T. Alternative splicing QTLs in European and African populations. Am. J. Hum. Genet. 97, 567–575 (2015).

  17. 17.

    Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600–604 (2016).

  18. 18.

    Tilgner, H. et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res. 22, 1616–1625 (2012).

  19. 19.

    Wu, J., Anczuków, O., Krainer, A. R., Zhang, M. Q. & Zhang, C. OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. 41, 5149–5163 (2013).

  20. 20.

    GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

  21. 21.

    Soumillon, M. et al. Cellular source and mechanisms of high transcriptome complexity in the mammalian testis. Cell Rep. 3, 2179–2190 (2013).

  22. 22.

    Kaessmann, H. Origins, evolution, and phenotypic impact of new genes. Genome Res. 20, 1313–1326 (2010).

  23. 23.

    Nellore, A. et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 17, 266 (2016).

  24. 24.

    Shen, S. et al. rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl. Acad. Sci. USA 111, E5593–E5601 (2014).

  25. 25.

    Merkin, J., Russell, C., Chen, P. & Burge, C. B. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science 338, 1593–1599 (2012).

  26. 26.

    Barbosa-Morais, N. L. et al. The evolutionary landscape of alternative splicing in vertebrate species. Science 338, 1587–1593 (2012).

  27. 27.

    Reyes, A. et al. Drift and conservation of differential exon usage across tissues in primate species. Proc. Natl. Acad. Sci. USA 110, 15377–15382 (2013).

  28. 28.

    Ongen, H., Buil, A., Brown, A. A., Dermitzakis, E. T. & Delaneau, O. Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics 32, 1479–1485 (2016).

  29. 29.

    Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).

  30. 30.

    Hsiao, Y. H. et al. Alternative splicing modulated by genetic variants demonstrates accelerated evolution regulated by highly conserved proteins. Genome Res. 26, 440–450 (2016).

  31. 31.

    Barbeira, A.N. et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Preprint available at https://www.biorxiv.org/content/early/2017/10/03/045260 (2017).

  32. 32.

    Orozco, G. et al. Association of CD40 with rheumatoid arthritis confirmed in a large UK case-control study. Ann. Rheum. Dis. 69, 813–816 (2010).

  33. 33.

    van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).

  34. 34.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  35. 35.

    Wheeler, H. E. et al. Survey of the heritability and sparse architecture of gene expression traits across human tissues. PLoS Genet 12, e1006423 (2016).

  36. 36.

    Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).

  37. 37.

    Ellis, S.E., Collado Torres, L. & Leek, J. Improving the value of public RNA-seq expression data by phenotype prediction. Preprint available at http://www.biorxiv.org/content/early/2017/06/03/145656.full.pdf (2017).

Download references

Acknowledgements

We thank X. Lan and other members of the Pritchard lab for helpful discussions and comments. This work was supported by a CEHG fellowship (Y.I.L.), the Howard Hughes Medical Institute (J.K.P.), and the US National Institutes of Health (NIH grants HG007036, HG008140, and HG009431 to J.K.P., and MH107666 to H.K.I.).

Author information

Author notes

    • Yang I. Li

    Present address: Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA

  1. Yang I. Li and David A. Knowles contributed equally to this work.

Affiliations

  1. Department of Genetics, Stanford University, Stanford, CA, USA

    • Yang I. Li
    • , David A. Knowles
    •  & Jonathan K. Pritchard
  2. Department of Computer Science, Stanford University, Stanford, CA, USA

    • David A. Knowles
  3. Department of Radiology, Stanford University, Stanford, CA, USA

    • David A. Knowles
  4. UCL Genetics Institute, Gower Street, London, UK

    • Jack Humphrey
  5. Department of Neurodegenerative Disease, UCL Institute of Neurology, London, UK

    • Jack Humphrey
  6. Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA

    • Alvaro N. Barbeira
    • , Scott P. Dickinson
    •  & Hae Kyung Im
  7. Department of Biology, Stanford University, Stanford, CA, USA

    • Jonathan K. Pritchard
  8. Howard Hughes Medical Institute, Stanford University, Stanford, CA, USA

    • Jonathan K. Pritchard

Authors

  1. Search for Yang I. Li in:

  2. Search for David A. Knowles in:

  3. Search for Jack Humphrey in:

  4. Search for Alvaro N. Barbeira in:

  5. Search for Scott P. Dickinson in:

  6. Search for Hae Kyung Im in:

  7. Search for Jonathan K. Pritchard in:

Contributions

Y.I.L., D.A.K., and J.K.P. conceived of the project. Y.I.L. and D.A.K. performed the analyses and implemented the software. D.A.K. developed and performed the statistical tests and modeling. J.H. implemented the visualization application. A.N.B., S.P.D., and H.K.I. performed the S-PrediXcan analyses. Y.I.L. and J.K.P. wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Yang I. Li or David A. Knowles or Jonathan K. Pritchard.

Integrated Supplementary Information

  1. Supplementary Figure 1

    Several types of common alternatively splicing events are captured by the alternative excision of introns.

  2. Supplementary Figure 2

    Bar plots showing the number of alternatively used junctions annotated from our GTEx analyses that were found in Intropolis6. phenopredict8 was used to predict the tissue type corresponding to the SRA samples analyzed in Intropolis. For each set of junctions, the proportion of junctions that were found (at least 1 read) in any SRA sample (Any), or found in samples which were predicted to be from testis (Testis) are highlighted. The predicted tissues with the highest number of supported junctions are colored in purple. Eighty-six percent of all novel alternatively used testis junctions from our LeafCutter analysis could be found in testis samples within SRA (not including GTEx).

  3. Supplementary Figure 3

    Junctions in GTEx tissues. (a) Distribution of the number of different GTEx tissues in which junctions predicted to be absent, or present in three commonly-used annotation databases, could be detected. (b) Relative junction usage in multiple GTEx organs of annotated and unannotated junctions identiffed in four GTEx organs. (c) Distribution of LeafCutter clusters from GTEx samples in terms of their splicing types. Clusters with only annotated junctions and clusters with unannotated junctions were further separated.

  4. Supplementary Figure 4

    PhastCons score distribution of splice site of novel introns. While 60% of annotated splice sites have local phastCons score >0.6, only 15-25% of unannotated splice sites do. Thus 80% of novel splice sites may represent noisy intron excision events.

  5. Supplementary Figure 5

    Comparison between beta-binomial and Dirichlet-multinomial models for differential splicing analyses, performed on 10 male brain vs. heart samples from GTEx. Two approaches for combining per-intron p-values into cluster level introns are compared: Bonferroni correction and Fisher's combined test. Bonferroni is very conservative, as expected. Fisher's combined test has considerably lower power than the multinomial approaches. However, only v2 of the Dirichlet-multinomial (which uses a per intron concentration/overdispersion parameter) is well calibrated under permutations.

  6. Supplementary Figure 6

    Memory usage (RAM) of four differential splicing methods applied to comparisons between 3, 5, 10, and 15 YRI versus CEU LCLs RNA-seq samples. We omitted the 15v15 MAJIQ run due to its expensive resource usage (both in terms of time and RAM). Right panel shows usage in log scale.

  7. Supplementary Figure 7

    Cumulative distributions of differential splicing test P values (1-posterior for MAJIQ) for the comparison of all YRI versus CEU LCLs (red). The distribution of test P values for the permuted comparisons are also shown (black). *Cuffinks2 reports 19 signiffcantly differentially spliced genes in the 3 vs. 3 comparison, but none in the other comparisons.

  8. Supplementary Figure 8

    Receiver operating characteristic (ROC) curves of LeafCutter, Cuffinks2, rMATS and MAJIQ for evaluation of differential splicing of genes with transcripts simulated to have varying levels of differential expression. Top panel shows ROC curves when excluding genes that were not tested by each respective methods. The bottom plot includes genes that were not tested in the calculation of true positive rate.

  9. Supplementary Figure 9

    LeafCutter is effective even with as few as 8 samples. Here we performed differential splicing analysis of 4 male brain vs 4 male muscle samples, and compared to results using 220 samples. a) p-values under permutations are well-calibrated. b-c) p-values and effect sizes are highly correlated between the two sample size datasets. d) Signiffcant disparity in effect sizes between the two sample sizes is primarily driven by an intron being unique to a tissue when N = 8.

  10. Supplementary Figure 10

    Hierarchical clustering on all 1,258 introns that had no missing values in any of the samples.

  11. Supplementary Figure 11

    We restricted to introns that were found to be differentially excised between human tissues (P value < 10−10 and effect size > 1:0).

  12. Supplementary Figure 12

    Sharing of sQTL discoveries between Cuffinks2, Altrans, and LeafCutter estimated using Storey's π 0 method.

  13. Supplementary Figure 13

    Meta-cluster representation of position of all 4,543 sQTLs identiffed at 1% FDR.

  14. Supplementary Figure 14

    Functional enrichment of 4,543 sQTLs identified at 1% FDR from CEU GEUVADIS data. Bars represent the 95% confidence interval from 500 bootstraps.

  15. Supplementary Figure 15

    Example of a shared sQTL.

  16. Supplementary Figure 16

    Example of a tissue-specific sQTL.

Supplementary information

  1. Supplementary Figures and Supplementary Note

    Supplementary Figures 1–16 and Supplementary Note 1

  2. Life Sciences Reporting Summary

  3. Supplementary Dataset 1

    List of genes associated through RNA expression or splicing with S-PrediXcan