Article | Published:

Terminal exon characterization with TECtool reveals an abundance of cell-specific isoforms

Nature Methodsvolume 15pages832836 (2018) | Download Citation

Abstract

Sequencing of RNA 3′ ends has uncovered numerous sites that do not correspond to the termination sites of known transcripts. Through their 3′ untranslated regions, protein-coding RNAs interact with RNA-binding proteins and microRNAs, which regulate many properties, including RNA stability and subcellular localization. We developed the terminal exon characterization (TEC) tool (http://tectool.unibas.ch), which can be used with RNA-sequencing data from any species for which a genome annotation that includes sites of RNA cleavage and polyadenylation is available. We discovered hundreds of previously unknown isoforms and cell-type-specific terminal exons in human cells. Ribosome profiling data revealed that many of these isoforms were translated. By applying TECtool to single-cell sequencing data, we found that the newly identified isoforms were expressed in subpopulations of cells. Thus, TECtool enables the identification of previously unknown isoforms in well-studied cell systems and in rare cell types.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

Accession numbers of all data analyzed in this study are listed in this article and/or associated supplementary information files. TECtool is available at http://tectool.unibas.ch. Source data for Figs. 1, 3, and 4 are available online.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Kishore, S., Luber, S. & Zavolan, M. Deciphering the role of RNA-binding proteins in the post-transcriptional control of gene expression. Brief. Funct. Genomics 9, 391–404 (2010).

  2. 2.

    Hausser, J. & Zavolan, M. Identification and consequences of miRNA-target interactions—beyond repression of gene expression. Nat. Rev. Genet. 15, 599–612 (2014).

  3. 3.

    Sandberg, R., Neilson, J. R., Sarma, A., Sharp, P. A. & Burge, C. B. Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320, 1643–1647 (2008).

  4. 4.

    Lackford, B. et al. Fip1 regulates mRNA alternative polyadenylation to promote stem cell self-renewal. EMBO J. 33, 878–889 (2014).

  5. 5.

    Gruber, A. J. et al. Discovery of physiological and cancer-related regulators of 3′ UTR processing with KAPAC. Genome Biol. 19, 44 (2018).

  6. 6.

    Mayr, C. & Bartel, D. P. Widespread shortening of 3′ UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673–684 (2009).

  7. 7.

    Spies, N., Burge, C. B. & Bartel, D. P. 3′ UTR-isoform choice has limited influence on the stability and translational efficiency of most mRNAs in mouse fibroblasts. Genome Res. 23, 2078–2090 (2013).

  8. 8.

    Gruber, A. R. et al. Global 3′ UTR shortening has a limited effect on protein abundance in proliferating T cells. Nat. Commun. 5, 5465 (2014).

  9. 9.

    Gruber, A. J. et al. A comprehensive analysis of 3′ end sequencing datasets reveals novel polyadenylation signals and the repressive role of heterogeneous ribonucleoprotein C on cleavage and polyadenylation. Genome Res. 26, 1145–1159 (2016).

  10. 10.

    Plass, M., Rasmussen, S. H. & Krogh, A. Highly accessible AU-rich regions in 3′ untranslated regions are hotspots for binding of regulatory factors. PLoS. Comput. Biol. 13, e1005460 (2017).

  11. 11.

    Martin, G., Gruber, A. R., Keller, W. & Zavolan, M. Genome-wide analysis of pre-mRNA 3′ end processing reveals a decisive role of human cleavage factor I in the regulation of 3′ UTR length. Cell Rep. 1, 753–763 (2012).

  12. 12.

    Lee, J. Y., Yeh, I., Park, J. Y. & Tian, B. PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes. Nucleic Acids Res. 35, D165–D168 (2007).

  13. 13.

    Derti, A. et al. A quantitative atlas of polyadenylation in five mammals. Genome Res. 22, 1173–1183 (2012).

  14. 14.

    Lin, Y. et al. An in-depth map of polyadenylation sites in cancer. Nucleic Acids Res. 40, 8460–8471 (2012).

  15. 15.

    Tian, B., Hu, J., Zhang, H. & Lutz, C. S. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 33, 201–212 (2005).

  16. 16.

    Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

  17. 17.

    Kersey, P. J. et al. Ensembl Genomes 2016: more genomes, more complexity. Nucleic Acids Res. 44, D574–D580 (2016).

  18. 18.

    Liu, N. et al. N 6-methyladenosine-dependent RNA structural switches regulate RNA-protein interactions. Nature 518, 560–564 (2015).

  19. 19.

    Calviello, L. et al. Detecting actively translated open reading frames in ribosome profiling data. Nat. Methods 13, 165–170 (2016).

  20. 20.

    Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).

  21. 21.

    Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

  22. 22.

    Hayer, K. E., Pizarro, A., Lahens, N. F., Hogenesch, J. B. & Grant, G. R. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics 31, 3938–3945 (2015).

  23. 23.

    Lagarde, J. et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat. Genet. 49, 1731–1740 (2017).

  24. 24.

    Uhlén, M. et al. Tissue-based map of the human proteome. Science 347, 1260419 (2015).

  25. 25.

    Long, S. A. et al. Partial exhaustion of CD8 T cells and clinical response to teplizumab in new-onset type 1 diabetes. Sci. Immunol. 1, eaai7793 (2016).

  26. 26.

    Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

  27. 27.

    Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  28. 28.

    Aken, B. L. et al. The Ensembl gene annotation system. Database (Oxford) 2016, baw093 (2016).

  29. 29.

    Lahens, N. F. et al. IVT-seq reveals extreme bias in RNA sequencing. Genome Biol. 15, R86 (2014).

  30. 30.

    Gallego Romero, I., Pai, A. A., Tung, J. & Gilad, Y. RNA-seq: impact of RNA degradation on transcript quantification. BMC Biol. 12, 42 (2014).

  31. 31.

    Lianoglou, S., Garg, V., Yang, J. L., Leslie, C. S. & Mayr, C. Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes Dev. 27, 2380–2396 (2013).

  32. 32.

    Katz, Y. et al. Quantitative visualization of alternative exon expression from RNA-seq data. Bioinformatics 31, 2400–2402 (2015).

  33. 33.

    Hinrichs, A. S. et al. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res. 34, D590–D598 (2006).

  34. 34.

    You, L. et al. APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals. Nucleic Acids Res. 43, D59–D67 (2015).

  35. 35.

    Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).

  36. 36.

    Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

  37. 37.

    Dale, R. K., Pedersen, B. S. & Quinlan, A. R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27, 3423–3424 (2011).

  38. 38.

    van der Walt, S., Colbert, S. C. & Varoquaux, G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011).

  39. 39.

    Jones, E., Oliphant, T. & Peterson, P. SciPy: Open Source Scientific Tools for Python. http://www.scipy.org (2001).

  40. 40.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

  41. 41.

    McKinney, W. pandas: a foundational Python library for data analysis and statistics. Presented at PyHPC 2011: Python for High Performance and Scientific Computing, 18 November 2011, Seattle, WA, USA.

  42. 42.

    Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics https://doi.org/10.1093/bioinformatics/bty350 (2018).

  43. 43.

    Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

  44. 44.

    Hahne, F. & Ivanek, R. Visualizing genomic data using Gviz and Bioconductor. Methods Mol. Biol. 1418, 335–351 (2016).

  45. 45.

    Lawrence, M., Gentleman, R. & Carey, V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25, 1841–1842 (2009).

  46. 46.

    Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS. Comput. Biol. 9, e1003118 (2013).

Download references

Acknowledgements

We thank M. Jacquot and the members of sciCore for help with the infrastructure that we used to process the data. This work was supported by the Swiss National Science Foundation (grants 31003A_170216 (M.Z.) and 51NF40_141735 (National Center for Competence in Research ‘RNA & Disease’; to the NCCR consortium)) and the Marie Curie Initial Training Network (project #607720, RNATRAIN; M.Z.).

Author information

Author notes

  1. These authors contributed equally: Andreas J. Gruber, Foivos Gypas.

Affiliations

  1. Oxford Big Data Institute, Nuffield Department of Medicine, University of Oxford, Oxford, UK

    • Andreas J. Gruber
  2. Computational and Systems Biology, Biozentrum, University of Basel, Basel, Switzerland

    • Foivos Gypas
    • , Ralf Schmidt
    •  & Mihaela Zavolan
  3. Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France

    • Andrea Riba

Authors

  1. Search for Andreas J. Gruber in:

  2. Search for Foivos Gypas in:

  3. Search for Andrea Riba in:

  4. Search for Ralf Schmidt in:

  5. Search for Mihaela Zavolan in:

Contributions

A.J.G. conceived the study. A.J.G., F.G., and M.Z. designed the study. A.J.G., F.G., and A.R. developed the method. A.J.G., F.G., A.R., and R.S. analyzed the data with input from M.Z. All of the authors wrote and approved the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Andreas J. Gruber or Mihaela Zavolan.

Integrated supplementary information

  1. Supplementary Figure 1 Intronic poly(A) sites are processed in a tissue-specific manner.

    (a) Percentage of ‘intronic’ PAS in individual samples in the data sets obtained with the SAPAS protocol34. Bottom panel: corresponding sequencing depths. (b) Position-dependent frequency of the canonical poly(A) signal (‘AAUAAA’) upstream of the ‘intronic’ poly(A) sites (orange) and of poly(A) sites corresponding to annotated terminal exons (blue) from the study introduced in (A). The usual position of the poly(A) signal at 21 nt upstream of the cleavage site is indicated by the dashed, vertical line. (c) Distribution of the number of distinct samples (from panel B) in which individual PAS were observed, for different types of PAS; ‘introns’—PAS from genomic regions currently annotated as intronic; ‘terminal exon (ds stop)’—PAS from annotated terminal exons that are located upstream of an annotated stop codon in the corresponding gene; ‘terminal exon’—PAS from terminal exons with no stop codon annotated downstream. Black boxes indicate the interquartile range (IQR) with the orange line corresponding to the median, whiskers corresponding to 1.5 times the IQR from the hinge, and densities extending to the most extreme values.

  2. Supplementary Figure 2 Sashimi plots of gene structures inferred from the RNA-seq data from different tissues.

    (a) The Coiled-coil Domain Containing 173 (CCDC173) gene locus with the annotated ENSEMBL transcript (orange), PAS from the PolyAsite atlas (red lines, http://www.polyasite.unibas.ch), and densities of mRNA reads (gray) from fallopian tube and testis samples. Gray arcs indicate spliced reads with their corresponding numbers. The novel terminal exon (red dotted box) is expressed in the fallopian tube, but not in testis, indicating a sex-dependent isoform switch. Note: Same as Fig. 2a in the manuscript, but showing all spliced reads. (b) Same representation for part of the Kinesin Family Member 1B (KIF1B) gene locus. The novel terminal exon (red dotted box) is mainly expressed in bone marrow. (c) Similar for the locus of lincRNA LINC01744.

  3. Supplementary Figure 3 TECtool analysis for bulk single-end or paired-end RNA-seq reads.

    (a) Graphic representation of the analysis flow for two replicates. (b-h) Features calculated by TECtool for annotated and putative terminal exons. The region from which these statistics are calculated is referred to as ‘object’. (b) Splicing-in-boundary: Reads that splice from an upstream region to the 5' boundary of the object. (c) Splicing-in-all: Reads that splice from an upstream region anywhere within the object. (d) Splicing-out-boundary: Reads that splice from the 3' boundary of the object to a downstream region. (e) Splicing-out-all: Reads that splice out from anywhere within the object to a downstream region. (f) Crossing-in-boundary: Unspliced reads that overlap the 5' boundary of the object. (g) Crossing-out-boundary: Unspliced reads that overlap the 3' boundary of the object. (h) Unspliced-within-boundaries: Unspliced reads that are contained in the object.

  4. Supplementary Figure 4 Overview of the region classification model in TECtool.

    (a) Flow chart of the machine learning algorithm in TECtool. (b) Analysis of TECtool running time. A data set of approximately 123 million reads was subsampled in increments of 10% (starting from approximately 12 million reads) and analyzed running TECtool on a single CPU. The analysis was repeated 10 times for each data set size. Shown are the mean and standard deviation over the 10 runs for each data set size.

  5. Supplementary Figure 5 Features used in the model.

    Schematic representation of the features that are used to construct the model and then classify regions into terminal exons, internal exons or background regions.

  6. Supplementary Figure 6 TECtool analysis flow for single-cell data.

    Example of TECtool analysis of two individual cells.

  7. Supplementary Figure 7 Evaluation of TECtool’s performance.

    (a) Scatter plot of estimated expression levels of already annotated transcripts (ENSEMBL v87, transcript support level 1 (TSL1), blue, 41’676 transcripts) and of transcripts ending at TECtool-identified terminal exons (red, 893 transcripts), in biological replicates of RNA-seq from HEK 293 cells (rP indicate the corresponding Pearson correlations). (b) Translational efficiencies computed for annotated terminal exons, novel terminal exons and intronic regions (two-tailed t-test p-values for pairwise comparisons of regions based on TSL1, novel versus intron replicate 1 (rep1): 5.6e-85; replicate 2 (rep2): 2.3e-84, and annotated versus novel rep1: 1.7e-20; rep2: 2.2e-20). Boxes indicate the interquartile range (IQR) with the line corresponding to the median, whiskers correspond to the most extreme value that is within 1.5 times the IQR from the hinge and outliers beyond this range are shown as individual points. (c) Normalized position-dependent frequencies of ribosome footprints around STOP codons of annotated (upper panel TSL1, lower panel TSL1-5) or novel transcripts. (d) Smoothened (± 5 nt) frequency profiles of the canonical poly(A) signal (AAUAAA) around 3' ends of transcripts predicted as novel relative to TSL1-5 by TECtool, StringTie and Cufflinks, respectively. (e) Venn diagrams showing the number of unique terminal exons defined only by their 5' ends, that were predicted by Cufflinks, StringTie and TECtool from the two replicate HEK 293 RNA-seq data sets using TSL1-5 annotation. The three-circle Venn diagram shows the relationship between 5' end-defined terminal exons that were predicted in both replicates by the above mentioned tools. (f) Venn diagrams reflecting the reproducibility of terminal exon prediction by TECtool, StringTie and Cufflinks (using again ENSEMBL v87 TSL1-5 annotation). Two independent biological replicates were analyzed with the mentioned tools to identify novel terminal exons. The overlap was then determined when exons were uniquely defined based on both their 5' and 3' genome coordinate, or based only on the 5' end, or only on the 3' end.

  8. Supplementary Figure 8 Distribution of expression levels inferred by Salmon43 from short-read sequencing data.

    RNA-seq was carried out from (a) brain, (b) heart, (c) liver and (d) testis samples, and distributions are shown for various transcript sets: all annotated transcripts (green), transcripts sequenced on the PacBio platform from the corresponding samples (red), and transcripts with novel terminal exons predicted by TECtool (blue). The number of transcripts is indicated in parentheses.

  9. Supplementary Figure 9 Summary of RNA-seq samples from the protein atlas dataset.

    (a) Scatter plot of the number of mapped reads (mate 1) and number of previously unknown terminal exons identified by TECtool. Pearson’s r = 0.56, p-value = 3.93e-17. Spearman’s r = 0.73, p-value = 6.44e-34. (b) Barplots of the number of novel exons identified from individual tissues (black: error bars indicating standard deviation computed based on replicate samples) (c) and corresponding library sizes (black: error bars indicating standard deviation computed based on replicate samples). Note for (a-c): Number of samples used for each tissue indicated in parenthesis: salivary gland (6), lung (8), liver (5), heart (9), lymph node (13), prostate (7), adipose tissue (6), bone marrow (8), bladder (4), adrenal gland (6), placenta (7), thyroid (9), spleen (5), skin (6), small intestine (8), appendix (6), gallbladder (6), colon (8), testis (8), esophagus (6), endometrium (6), stomach (4), ovary (5), kidney (4), pancreas (4), cerebral cortex (3), duodenum (4), fallopian tube (6), smooth muscle (3), rectum (4), skeletal muscle (6), tonsil (3).

  10. Supplementary Figure 10 Expression of TECtool-identified transcripts across 32 human tissues.

    (a) Distribution of the mean gene expression contribution of the, on average, most highly expressed annotated (blue) or novel (red) isoform to the total expression of the corresponding gene in the indicated tissue. Boxes indicate the interquartile range (IQR) with the line corresponding to the median, whiskers correspond to the most extreme value that is within 1.5 times the IQR from the hinge and outliers beyond this range are shown as individual points. Testis annotated (n=22383), testis novel (n=251), bone marrow annotated (n=12834), bone marrow novel (n=185), fallopian tube annotated (n=16964), fallopian tube novel (n=66), rectum annotated (n=16484), rectum novel (n=59), colon annotated (n=15534), colon novel (n=37), endometrium annotated (n=16503), endometrium novel (n=35), adipose tissue annotated (n=15012), adipose tissue novel (n=32), smooth muscle annotated (n=15547), smooth muscle novel (n=29), tonsil annotated (n=14913), tonsil novel (n=23), lung annotated (n=15829), lung novel (n=18), heart annotated (n=13936), heart novel (n=21), ovary annotated (n=14950), ovary novel (n=20), cerebral cortex annotated (n=17421), cerebral cortex novel (n=24), spleen annotated (n=16402), spleen novel (n=22), lymph node annotated (n=15080), lymph node novel (n=17), kidney annotated (n=16062), kidney novel (n=21), placenta annotated (n=15290), placenta novel (n=16), small intestine annotated (n=15773), small intestine novel (n=20), liver annotated (n=12365), liver novel (n=15), prostate annotated (n=16787), prostate novel (n=8), thyroid annotated (n=16144), thyroid novel (n=14), esophagus annotated (n=14860), esophagus novel (n=11), skeletal muscle annotated (n=11547), skeletal muscle novel (n=13), salivary gland annotated (n=11441), salivary gland novel (n=4), gallbladder annotated (n=16919), gallbladder novel (n=7), skin annotated (n=16977), skin novel (n=7), stomach annotated (n=15357), stomach novel (n=7), adrenal gland annotated (n=15836), adrenal gland novel (n=6), duodenum annotated (n=15768), duodenum novel (n=6), pancreas annotated (n=8364), pancreas novel (n=4), appendix annotated (n=16143), appendix novel (n=3), bladder annotated (n=15639), bladder novel (n=2). (b) Number of genes for which a novel transcript is, on average, the dominant expressed isoform. For (a) and (b) only novel transcripts having a median expression >1 TPM within a specific tissue were considered.

  11. Supplementary Figure 11 TECtool identifies previously unknown isoforms that are expressed in subsets of single cells.

    (a) Fractions of cells expressing annotated (blue density) or novel (red dots) transcripts, as a function of the average expression of these transcripts across all cells. (b) Histograms of the number of transcripts which contribute a specific fraction of expression of their corresponding gene. Novel transcripts are shown in red, annotated transcripts in blue. We subsampled 20 times the set of annotated transcripts with a mean expression across cells similar to that of novel transcripts (subset size equal to the size of the novel transcripts set), and computed means and standard deviations over the 20 resamplings. Only reads that spliced into terminal exons were used to estimate the expression of transcripts containing the respective terminal exons. Furthermore, we only considered cases where there were at least two distinct reads that could be counted towards the expression of a given gene. (c) Sashimi plot of the locus of the O-glucosyltransferase 1 (POGLUT1) gene with the annotated ENSEMBL transcripts (blue), the novel transcripts predicted by TECtool (red), and RNA-seq read densities (gray) within two different cells. Gray arcs indicate spliced reads with their corresponding numbers. The first track indicates that a novel transcript is expressed in cell X, whereas the second track indicates that another cell, Y, expresses the annotated transcript. (d) Similar to (c) but for the Pre-mRNA Processing Factor (BCAS2) gene locus.

  12. Supplementary Figure 12 TECtool analysis of an RNA-seq dataset obtained from mouse CD4+ T cells.

    (a) Overlap of previously unknown terminal exons sets identified by TECtool from 3 replicate samples for each of the following CD4+ populations: untreated, and the at different time points following activation: 0.5, 1, 2, 4, 6, 12, 24, 48 and 72 hours. From replicate 2 of CD4+ cells 48 hours following activation, no novel terminal exons were identified. (b) Sashimi plots of gene structures inferred from the mouse CD4+ cell sequencing data. Part of the Intersectin 2 (Itsn2) gene locus with the annotated ENSEMBL transcript (orange), PASs from the PolyAsite atlas (http://www.polyasite.unibas.ch, red tick marks on the track under the gene structure), and densities of RNA sequencing reads (gray) from 3 replicates of CD4+ T cells, 72 hours after activation. Gray arcs indicate spliced reads with their corresponding numbers. The red dotted box shows the novel terminal exon. (c) Same representation as in (b) but for the Cytidine monophospho-N-acetylneuraminic acid hydroxylase (Cmah) gene locus.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–12, Supplementary Note and Supplementary Table 1

  2. Reporting Summary

  3. Supplementary Dataset 1

    Sample IDs from datasets that were used in this study

  4. Source Data Figure 1

  5. Source Data Figure 3

  6. Source Data Figure 4

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0114-z