Transcriptome variation in human tissues revealed by long-read sequencing

Glinos, Dafni A.; Garborcauskas, Garrett; Hoffman, Paul; Ehsan, Nava; Jiang, Lihua; Gokden, Alper; Dai, Xiaoguang; Aguet, François; Brown, Kathleen L.; Garimella, Kiran; Bowers, Tera; Costello, Maura; Ardlie, Kristin; Jian, Ruiqi; Tucker, Nathan R.; Ellinor, Patrick T.; Harrington, Eoghan D.; Tang, Hua; Snyder, Michael; Juul, Sissel; Mohammadi, Pejman; MacArthur, Daniel G.; Lappalainen, Tuuli; Cummings, Beryl B.

doi:10.1038/s41586-022-05035-y

Article
Published: 03 August 2022

Transcriptome variation in human tissues revealed by long-read sequencing

Nature volume 608, pages 353–359 (2022)Cite this article

35k Accesses
65 Citations
277 Altmetric
Metrics details

Subjects

Abstract

Regulation of transcript structure generates transcript diversity and plays an important role in human disease^{1,2,3,4,5,6,7}. The advent of long-read sequencing technologies offers the opportunity to study the role of genetic variation in transcript structure^{8,9,10,11,12,13,14,15,16}. In this Article, we present a large human long-read RNA-seq dataset using the Oxford Nanopore Technologies platform from 88 samples from Genotype-Tissue Expression (GTEx) tissues and cell lines, complementing the GTEx resource. We identified just over 70,000 novel transcripts for annotated genes, and validated the protein expression of 10% of novel transcripts. We developed a new computational package, LORALS, to analyse the genetic effects of rare and common variants on the transcriptome by allele-specific analysis of long reads. We characterized allele-specific expression and transcript structure events, providing new insights into the specific transcript alterations caused by common and rare genetic variants and highlighting the resolution gained from long-read data. We were able to perturb the transcript structure upon knockdown of PTBP1, an RNA binding protein that mediates splicing, thereby finding genetic regulatory effects that are modified by the cellular environment. Finally, we used this dataset to enhance variant interpretation and study rare variants leading to aberrant splicing patterns.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview and quality control of the dataset.**

**Fig. 2: Discovery of novel transcripts and comparison between tissues.**

**Fig. 3: Allelic analysis of long-read data.**

**Fig. 4: Variant interpretation through novel transcripts and ASTS analysis.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Article Open access 09 April 2024

Genome-wide association studies

Article 26 August 2021

Data availability

Raw long-read data generated as part of this manuscript are available in the GTEx v.9 release under dbGAP accession number phs000424.v9 and on AnVIL at https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_GTEx_V9_hg38. The GTF file of flair transcripts, along with the transcript-level overall and allelic expression quantifications from GENCODE v.26 and flair transcripts, are available on the GTEx portal (https://gtexportal.org/home/datasets). The GTEx WGS, Illumina short-read, the allelic analysis, eQTLs and sQTLs and enloc colocalization files are all part of the GTEx v.8 release phs000424.v8. In addition, we used the transcript and gene counts available from https://gtexportal.org/home/datasets. The GRCh38 human genome reference and GENCODE v.26 processed for GTEx were used in this analysis (https://console.cloud.google.com/storage/browser/gtex-resources). The CHESS and Workman transcript datasets were downloaded from GitHub (https://github.com/chess-genome/chess and https://github.com/nanopore-wgs-consortium/NA12878). ENCODE eCLIP data was downloaded from https://www.encodeproject.org/.

Code availability

All original code used in the manuscript is released as part of a software package, https://github.com/LappalainenLab/lorals. General scripts are available at https://github.com/LappalainenLab/lorals_paper_code (https://doi.org/10.5281/zenodo.6529254).

References

Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. The expanding landscape of alternative splicing variation in human populations. Am. J. Hum. Genet. 102, 11–26 (2018).
Article CAS PubMed PubMed Central Google Scholar
Nicolae, D. L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010).
Article PubMed PubMed Central CAS Google Scholar
Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600–604 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Article CAS Google Scholar
Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).
Article PubMed PubMed Central Google Scholar
Kremer, L. S. et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat. Commun. 8, 15824 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Gonorazky, H. D. et al. Expanding the boundaries of RNA sequencing as a diagnostic tool for rare Mendelian disease. Am. J. Hum. Genet. 104, 466–483 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).
Article CAS PubMed Google Scholar
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
Article PubMed PubMed Central Google Scholar
Oikonomopoulos, S., Wang, Y. C., Djambazian, H., Badescu, D. & Ragoussis, J. Benchmarking of the Oxford Nanopore MinION sequencing for quantitative and qualitative assessment of cDNA populations. Sci Rep. 6, 31602 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res. 6, 100 (2017).
Article PubMed PubMed Central Google Scholar
Anvar, S. Y. et al. Full-length mRNA sequencing uncovers a widespread coupling between transcription initiation and mRNA processing. Genome Biol. 19, 46 (2018).
Article PubMed PubMed Central CAS Google Scholar
Tardaguila, M. et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 396–411 (2018).
Article CAS PubMed Central Google Scholar
Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tilgner, H., Grubert, F., Sharon, D. & Snyder, M. P. Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc. Natl. Acad. Sci. USA 111, 9869–9874 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Tilgner, H. et al. Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events. Nat. Biotechnol. 33, 736–742 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 24, 14–24 (2014).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018).
Article CAS PubMed Google Scholar
Rivas, M. A. et al. Human genomics. Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 348, 666–669 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Smith, D. et al. A rare IL33 loss-of-function mutation reduces blood eosinophil counts and protects from asthma. PLoS Genet. 13, e1006659 (2017).
Article PubMed PubMed Central CAS Google Scholar
Mohammadi, P. et al. Genetic regulatory variation in populations informs transcriptome analysis in rare disease. Science 366, 351–356 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011).
Article CAS Google Scholar
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Erratum: near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 888 (2016).
Article CAS PubMed Google Scholar
Teng, M. et al. A benchmark for RNA-seq quantification pipelines. Genome Biol. 17, 74 (2016).
Article PubMed PubMed Central CAS Google Scholar
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Article CAS PubMed PubMed Central Google Scholar
Pai, A. A. et al.Widespread shortening of 3’untranslated regions and increased exon inclusion are evolutionarily conserved features of innate immune responses to infection PLoS Genet. 12, e1006338 (2016).
Article PubMed PubMed Central CAS Google Scholar
Alasoo, K. et al. Genetic effects on promoter usage are highly context-specific and contribute to complex traits. eLife 8, e41673 (2019).
Article PubMed PubMed Central Google Scholar
Mittleman, B. E. et al. Alternative polyadenylation mediates genetic regulation of gene expression. eLife 9, e57492 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Jiang, L. et al. A quantitative proteome map of the human body. Cell 183, 269–283.e19 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yeo, G., Holste, D., Kreiman, G. & Burge, C. B. Variation in alternative splicing across human tissues. Genome Biol. 5, R74 (2004).
Article PubMed PubMed Central Google Scholar
Reyes, A. & Huber, W. Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues. Nucleic Acids Res. 46, 582–592 (2018).
Article CAS PubMed Google Scholar
Castel, S. E. et al. A vast resource of allelic expression data spanning human tissues. Genome Biol. 21, 234 (2020).
Article PubMed PubMed Central Google Scholar
Van Nostrand, E. L. et al. A large-scale binding and functional map of human RNA-binding proteins. Nature 583, 711–719 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).
Article PubMed PubMed Central CAS Google Scholar
Ferraro, N. M. et al. Transcriptomic signatures across human tissues identify functional rare genetic variation. Science 369, eaaz5900 (2020).
Article PubMed PubMed Central Google Scholar
Yang, X. et al. Widespread expansion of protein interaction capabilities by alternative splicing. Cell 164, 805–817 (2016).
Article CAS PubMed PubMed Central Google Scholar
Castel, S. E., Levy-Moonshine, A., Mohammadi, P., Banks, E. & Lappalainen, T. Tools and best practices for data processing in allelic expression analysis. Genome Biol. 16, 195 (2015).
Article PubMed PubMed Central CAS Google Scholar
Sibley, C. R. et al. Recursive splicing in long vertebrate genes. Nature 521, 371–375 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Scotti, M. M. & Swanson, M. S. RNA mis-splicing in disease. Nat. Rev. Genet. 17, 19–32 (2016).
Article CAS PubMed Google Scholar
Gandal, M. J. et al. Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder. Science 362, eaat8127 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Article PubMed Central CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M. & Van Broeckhoven, C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34, 2666–2669 (2018).
Article PubMed PubMed Central CAS Google Scholar
Alasoo, K. Wiggleplotr: Make read coverage plots from bigwig files. Bioconductor https://doi.org/10.18129/B9.bioc.wiggleplotr (2017).
Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018).
Article CAS PubMed PubMed Central Google Scholar
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Research 9, 304 (2020).
Article Google Scholar
Trincado, J. L. et al. SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Genome Biol. 19, 40 (2018).
Article PubMed PubMed Central CAS Google Scholar
Keller, A., Nesvizhskii, A. I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).
Article CAS PubMed Google Scholar
Deutsch, E. W. et al. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics Clin. Appl. 9, 745–754 (2015).
Article CAS PubMed PubMed Central Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
PubMed PubMed Central Google Scholar
Nowicka, M. & Robinson, M. D. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Research 5, 1356 (2016).
Article PubMed PubMed Central Google Scholar
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
Article CAS PubMed PubMed Central Google Scholar
Mohammadi, P., Castel, S. E., Brown, A. A. & Lappalainen, T. Quantifying the regulatory effect size of cis-acting genetic variation using allelic fold change. Genome Res. 27, 1872–1884 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cohen, J. Statistical Power Analysis for the Behavioral Sciences. (Academic Press, 2013).
Van Nostrand, E. L. et al. A large-scale binding and functional map of human RNA-binding proteins. Nature 583, 711–719 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
CAS PubMed PubMed Central Google Scholar
Gremme, G., Steinbiss, S. & Kurtz, S. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 10, 645–656 (2013).
Article PubMed Google Scholar
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank M. Micorescu and K. Potamousis from the Oxford Nanopore Technologies commercial team for their help in generating the data. D.A.G. was funded by NIH grant nos. R01GM124486 and U24DK112331. T.L. was funded by NIH grant nos. R01GM124486, R01GM122924, R01AG057422 and UM1HG008901. P.H. was funded by NIH grant no. R01GM124486. A.G. was funded by Roy and Diana Vagelos Pilot Grant. Funding for long-read sequencing of GTEx samples at the Broad was provided by a Broad Ignite grant. N.E. and P.M. were supported by NIGMS award no. R01GM140287. N.R.T. was funded by NIH grant no. K01-HL140187.

Author information

These authors contributed equally: Dafni A. Glinos, Garrett Garborcauskas
These authors jointly supervised this work: Tuuli Lappalainen, Beryl B. Cummings

Authors and Affiliations

New York Genome Center, New York, NY, USA
Dafni A. Glinos, Paul Hoffman, Alper Gokden, Kathleen L. Brown & Tuuli Lappalainen
Department of Systems Biology, Columbia University, New York, NY, USA
Dafni A. Glinos & Tuuli Lappalainen
Medical and Population Genetics Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
Garrett Garborcauskas, Daniel G. MacArthur & Beryl B. Cummings
Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
Nava Ehsan & Pejman Mohammadi
Department of Genetics, Stanford University, Stanford, CA, USA
Lihua Jiang, Ruiqi Jian, Hua Tang & Michael Snyder
Oxford Nanopore Technology, New York, NY, USA
Xiaoguang Dai, Eoghan D. Harrington & Sissel Juul
Broad Institute of MIT and Harvard, Cambridge, MA, USA
François Aguet, Kiran Garimella, Tera Bowers, Maura Costello & Kristin Ardlie
Department of Biomedical Informatics, Columbia University, New York, NY, USA
Kathleen L. Brown
Masonic Medical Research Institute, Utica, NY, USA
Nathan R. Tucker
Cardiovascular Disease Initiative, The Broad Institute of Harvard and MIT, Cambridge, MA, USA
Nathan R. Tucker & Patrick T. Ellinor
Scripps Research Translational Institute, La Jolla, CA, USA
Pejman Mohammadi
Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, New South Wales, Australia
Daniel G. MacArthur
Centre for Population Genomics, Murdoch Children’s Research Institute, Melbourne, Victoria, Australia
Daniel G. MacArthur
Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden
Tuuli Lappalainen

Authors

Dafni A. Glinos
View author publications
You can also search for this author in PubMed Google Scholar
Garrett Garborcauskas
View author publications
You can also search for this author in PubMed Google Scholar
Paul Hoffman
View author publications
You can also search for this author in PubMed Google Scholar
Nava Ehsan
View author publications
You can also search for this author in PubMed Google Scholar
Lihua Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Alper Gokden
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoguang Dai
View author publications
You can also search for this author in PubMed Google Scholar
François Aguet
View author publications
You can also search for this author in PubMed Google Scholar
Kathleen L. Brown
View author publications
You can also search for this author in PubMed Google Scholar
Kiran Garimella
View author publications
You can also search for this author in PubMed Google Scholar
Tera Bowers
View author publications
You can also search for this author in PubMed Google Scholar
Maura Costello
View author publications
You can also search for this author in PubMed Google Scholar
Kristin Ardlie
View author publications
You can also search for this author in PubMed Google Scholar
Ruiqi Jian
View author publications
You can also search for this author in PubMed Google Scholar
Nathan R. Tucker
View author publications
You can also search for this author in PubMed Google Scholar
Patrick T. Ellinor
View author publications
You can also search for this author in PubMed Google Scholar
Eoghan D. Harrington
View author publications
You can also search for this author in PubMed Google Scholar
Hua Tang
View author publications
You can also search for this author in PubMed Google Scholar
Michael Snyder
View author publications
You can also search for this author in PubMed Google Scholar
Sissel Juul
View author publications
You can also search for this author in PubMed Google Scholar
Pejman Mohammadi
View author publications
You can also search for this author in PubMed Google Scholar
Daniel G. MacArthur
View author publications
You can also search for this author in PubMed Google Scholar
Tuuli Lappalainen
View author publications
You can also search for this author in PubMed Google Scholar
Beryl B. Cummings
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.A.G., T.L. and B.C. conceived and designed the project. D.A.G. performed most of the data analysis. G.G., A.G. and X.D. carried out the library preparation and sequencing. P.H. packaged the code. L.J., R.J., H.T. and M.S. provided and analysed the data for the proteomic validation. P.H. and K.B. assisted in allelic expression analysis. K.G. carried out the base-calling. N.E. and P.M. performed power analyses and advised on analysis methods. A.G. carried out the PTBP1 knockdown. N.R.T. and P.E. provided the CVD samples. T.B. and M.C. aided in the data generation. F.A., K.A., E.D.H., S.J., D.G.M., B.C. and T.L. provided feedback on the study design and data analysis. N.E., F.A., N.R.T., E.D.H., S.J., P.M. and D.G.M. provided feedback on the manuscript. E.D.H., S.J., D.G.M. and T.L. supervised the work. D.A.G., T.L. and B.C. wrote the manuscript with contributions from other authors. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Dafni A. Glinos, Garrett Garborcauskas, Tuuli Lappalainen or Beryl B. Cummings.

Ethics declarations

Competing interests

D.A.G. is currently a fellow at Vertex Pharmaceuticals. X.D., E.D.H. and S.J. are employees of Oxford Nanopore Technologies and are shareholders and/or share option holders. F.A. has been an employee of Illumina, Inc., since 8 November 2021. P.T.E. has received sponsored research support from Bayer AG and IBM Health, and he has served on advisory boards or consulted for Bayer AG, MyoKardia and Novartis; none of these activities are related to the work presented here. D.G.M. is a founder with equity in Goldfinch Bio, a paid advisor to GSK, Insitro, Third Rock Ventures and Foresite Labs, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer and Sanofi-Genzyme; none of these activities are related to the work presented here. B.C. is currently employed at Third Rock Ventures. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Quality control of the dataset.

A) Number and B) median length of raw and aligned reads per sample. The diagonal lines correspond to intercept = 0. With the dashed black circle, we highlight the two samples sequenced using the direct-cDNA technology. C) Read number and read length in two fibroblast cell line samples (one of which was sequenced in replicate) that were sequenced using both the direct-cDNA and the PCR-cDNA protocol for 48 h. P values were calculated using a two-sided t-test. Error bars: standard deviation from the mean. D) Hierarchical clustering using Euclidean distance for replicate samples aligned to GENCODE for transcripts with expression above 3 TPM in at least 5 samples. E) Principal component analysis using all 88 samples aligned to GENCODE (v26) for transcripts with expression above 3 TPM in at least 5 samples.

Extended Data Fig. 2 Comparison between ONT and Illumina gene expression.

A) Correlation between the transcriptome of each sample quantified by ONT and by Illumina sequencing technologies. B) Normalized gene and transcript expression for high residual (|residual| > 1) genes and transcripts retrieved from the Spearman correlation analysis. C) Characteristics of genes and D) transcripts with high or low residuals with respect to gene/transcript length, number of transcripts per gene and number of exons per transcript.

Extended Data Fig. 3 Three prime bias analysis using mitochondrial reads.

A) Observed versus expected read length for one sample sequenced using both direct (Spearman R² = 0.3) and PCR cDNA (Spearman R² = 0.26) protocols. The discrete clusters below the diagonal represent incorrect assignments to GENCODE isoforms (potential novel transcripts), and the diffuse shading represents fragmented RNA. Relationship between the expected transcript read length and the fraction of observed nanopore poly(A) RNA reads over the expected full length by sample for B) all samples and C) only fibroblasts. Labels are for mitochondrial genes without the MT prefix. The median was calculated per sample and error bars represent standard deviation. D) Median fraction of full-length per method by which RNA was isolated. E) Comparison of alternative transcription structure events found in highly expressed transcripts in the top and bottom 10% of samples ranked by 3′ bias. We observed no difference between the two deciles when using a two-sided proportionality test.

Extended Data Fig. 4 FLAIR transcript characterisation.

A) FLAIR transcripts comparison to GENCODE with respect to different genomic levels. The difference between intron chain and transcript is that the former only looks at matching the intron boundaries, therefore allowing variation in the UTR regions. B) Transcript length and C) number of exons per transcript classified by comparison to GENCODE. D) Number of overlapping transcripts between the ones identified in this paper and the ones released by a) CHESS and b) Workman.

Extended Data Fig. 5 Protein validation analysis of transcripts from matched tissues.

A) Percentage of validated transcripts at the protein level using mass spectrometry for different TPM thresholds. Each point represents the mean across n = 7 assayed tissues and error bars represent the standard deviation. B) Mean expression per tissue with over one sample (lung, liver, heart, muscle and brain) of annotated and novel transcripts stratified by their validation status. The vertical line denotes the 5 TPM threshold used. C) Percentage of validated transcripts at the protein level using mass spectrometry per primary tissue. D) Proportion of the AltTS events validated per tissue. E) MLF1 is an example of a gene with multiple highly-expressed transcripts across both muscle and heart tissues with a different transcript validated in each tissue. A3: alternative 3′ splice site; A5: alternative 5′ splice site; AF: alternative first exon; AL: alternative last exon; A3UTR: alternative 3′ end; A5UTR: alternative 5′ end; MX: mutually exclusive exons; RI: retained intron; SE: skipped exon.

Extended Data Fig. 6 Transcript expression overview of novel and annotated transcripts.

A) Hierarchical clustering using euclidean distance and B) principal component analysis for selected samples aligned to GENCODE for transcripts with expression above five TPM in at least three samples separated based on whether they are novel or not. C) Proportion of transcripts expressed at different TPM thresholds and classified based on how many tissues express the transcript in at least two samples. The total number of transcripts per group and threshold is included in the legend.

Extended Data Fig. 7 Differential transcript expression and usage between tissues.

Heatmap of number of differentially A) upregulated transcripts and B) used transcripts (FDR ≤ 0.05) in pairwise comparison of tissues with at least five samples. In differential expression analysis we identify up- or down-regulated transcripts per pairwise comparison (asymmetrical heatmap) while in the differential transcript usage analysis there is no direction of effect (symmetrical heatmap). C) Number of differentially expressed or used transcripts that were specific to that tissue. D) Median gene expression across all transcripts that are specific to a tissue.

Extended Data Fig. 8 LORALS pipeline development and aligning statistics.

A) Pipeline for allele-specific analysis. Raw long-reads are first aligned to the genome using minimap2. This alignment is used to correct the phase of some of the heterozygous variants on the whole genome sequencing vcf. This new file is then used to generate personalized genome reference files against which the raw reads are again aligned using minimap2. The raw reads are also aligned to the transcriptome using minimap2. The VCF file along with the genome aligned reads and the transcriptome aligned reads are then fed into LORALS for allelic analysis. B) Percentage of switched haplotypes per donor informed by the long-read data. For this all samples from the same donor were merged to harmonize the files. C) Percentage of haplotype specific reads calculated as reads having a higher mapping score when using a personalized genome reference. D) Delta calculated as the difference in the start position of the aligned read between the genome aligned and the personalized genome aligned reads. Not shown are the reads that had Delta = 0. E) Reference ratio for the samples present in this study sequenced using Illumina technology and ONT technology aligned with two different approaches.

Extended Data Fig. 9 LORALS pipeline allele specific analysis filter setting.

A) Reference ratio and B) normalized reads counts across different Illumina and ONT flags for both of these sequencing technologies. Mapping bias: mapping bias in simulations; Low mappability: low-mappability regions (75-mer mappability < 1 based on 75mer alignments with up to two mismatches based on the pipeline for ENCODE tracks and available on the GTEx portal); Genotype warning: no more reads supporting two alleles than would be expected from sequencing noise alone, indicating potential genotyping errors (FDR < 1%); Blacklist: ENCODE blacklist. Multi-mapping: regions with multi-mapping reads constructed using the alignability track from UCSC using a threshold of 0.1 (so that a 100kmer aligning to that site aligns to at least 5 other locations in the genome with up to 2 mismatches); Other allele warning: regions where the proportion of reference or alternative allele containing reads is lower than 0.8; High indel warning: sites where the proportion of non-indel containing is lower than 0.8. C) Reference ratios and normalised reads counts of all kept sites across Illumina and ONT sequencing technologies. D) Distribution of the high indel warning ratios and the other allele ratios across all samples. E) Proportion of genes with at least 20 overlapping reads flagged per filter. The proportion was calculated across all genes for each sample (n = 59). The center corresponds to the median, the lower and upper hinges correspond to the 25th and 75th percentiles and the whiskers extend from the hinge to the smallest/largest value no further than 1.5 * inter-quartile range from the hinge.

Extended Data Fig. 10 Allele specific analysis on all GTEx samples.

A) Estimated power for different number of transcripts (2, 3 or 5) with respect to the coverage (x-axis) for effect sizes 0.1 (low), 0.3 (medium) and 0.5 (high), derived from simulated count data. B) Number of genes tested for allele-specific expression (ASE) and allele-specific transcript structure (ASTS) and number of significant genes. The diagonal indicates the median percentage of significant genes (9% and 9%, respectively). C) Number of transcripts per gene tested for ASTS. D) Number of genes with allelic data across donors per tissue for ASE and ASTS. E) Total number of genes calculated per sample (n = 59) at different levels of power. Outliers are hidden for ease of viewing. The center corresponds to the median, the lower and upper hinges correspond to the 25th and 75th percentiles and the whiskers extend from the hinge to the smallest/largest value no further than 1.5 * inter-quartile range from the hinge.

Extended Data Fig. 11 Comparison of allele specific expression between ONT and Illumina.

A) Proportion of significant ASE genes discovered using Illumina or ONT and replicated in the other method. π1 calculations are carried out up to P value = 0.5. B) Log allelic fold-change of Illumina and ONT of the shared ASE genes. C) Number of ASE events with opposite directions between ONT and Illumina per sample. Highlighted are the five fibroblast cell lines that were further cultured prior to sequencing, where 12/57 events were observed. RNA-seq read pile-ups for D) CCDC69 and E) ACSSL3 which have ASE in the opposite direction between the two methods. In red is shown the variant used to parse the reads between the two haplotypes. CCDC69 differences can be attributed to a depression in the Illumina read pile-ups while ACSSL3 can be attributed to the variant being in the 3′UTR, which is not well captured by Illumina reads. F) Venn diagram of the significant ASE genes discovered by Illumina and ONT. The LOG2 of total counts for each method is shown for each group of the Venn diagram.

Extended Data Fig. 12 Alternative transcript structure event annotation of allele specific events.

A) Percentage of genes displaying a single alternative transcript structure event. P values were calculated using a two-sided binomial test. A3: alternative 3′ splice site; A5: alternative 5′ splice site; AF: alternative first exon; AL: alternative last exon; A3UTR: alternative 3′ end; A5UTR: alternative 5′ end; MX: mutually exclusive exons; RI: retained intron; SE: skipped exon. B) Average relative location of the heterozygous variant used for ASTS event, by grouping all the transcripts of an ASTS event together. C) Read pile-ups per transcript for the two donors displaying ASTS in DUSP13 gene. In the lower panel the transcript structure is shown, without details of the coding sequence. D) Transcript percentage for four of the five DUSP13 annotated transcripts with high read coverage in the GTEx v8.

Extended Data Fig. 13 Differential expression analysis between PTBP1-KD and control samples.

A) Volcano plot from differential gene expression between control and samples with PTBP1 knockdown using ONT data. P values were calculated using the Wald test in DESeq2. B) Gene expression profile in PTBP1 and PTBP2 genes (PTBP2 under normal circumstances has its expression restricted to the brain). C) Proportion of different alternative transcript structure events in transcripts upregulated in the control or the PTBP1 knockdown samples. P values were calculated using a two-sided proportionality test.

Extended Data Fig. 14 Allele specific analysis of PTBP1-KD and control samples.

A) Correlation between the control and the PTBP1 knockdown samples in the reference ratio of gene expression and transcript structure. B) Changes in ASTS by PTPB1 knockdown, as assayed by gridION, with the heatmap showing the co-occurrence of alternative transcript structure events that are observed at least once per each event (or a single time for the diagonal) in a given gene. Color corresponds to the log₂ ratio of the number of events found in the control over PTBP1 knockdown (KD) samples. C) Number of significant ASE and ASTS genes found by gridION categorized based on their status in the PromethION data from the same samples. D) Proportion of genes displaying allele-specific patterns specifically in either control or PTBP1 knockdown samples. E) SLC1A5 gene transcript read pile-ups which display significant ASTS only in the control sample only. The arrow indicates the location of the PTBP1 eCLIP site which contains a heterozygous variant in that donor.

Extended Data Fig. 15 Variant interpretation through novel transcripts and allele-specific transcript structure analysis.

A) Number of variants per variant effect predictor (VEP) category using GENCODE v26 protein-coding genes with or without novel FLAIR transcripts. B) CADD score distribution of variants that were reassigned to a more severe consequence when the GENCODE gene annotations were complemented with the novel FLAIR transcripts, compared to variants that retained their annotation (down sampled to a similar size). P values from two-sided t-test. The center corresponds to the median, the lower and upper hinges correspond to the 25th and 75th percentiles and the whiskers extend from the hinge to the smallest/largest value no further than 1.5 * inter-quartile range from the hinge. C) Percentage of variants per clinical significance category that get reassigned when supplementing the gene annotation with the novel transcripts. The numbers above the bars correspond to the number of re-assigned variants. D) Number of rare variants per ASTS gene (10kb window around gene). E) Proportion of rare heterozygous variants per annotation in significant ASTS events. As a background all ASTS events were used, and P values were calculated using a two-sided binomial testing. F) Enrichment of the significant ASTS genes within splicing outliers. As a background all ASTS genes were used and P values were calculated using a two-sided binomial test. G) NDUFS4 as an example of a gene with a rare heterozygous variant in a sample that is a GTEx splicing outlier and has significant ASTS, with read pileups and grey arrows indicating the rare variants. Log normalised transcript counts per allele are plotted per transcript, with the REF:ALT ratios marked for each.

Supplementary information

Supplementary Information

Supplementary Methods, legends for Supplementary Tables 1–12 and Supplementary Fig. 1.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–12; see main Supplementary file for legends.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Glinos, D.A., Garborcauskas, G., Hoffman, P. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022). https://doi.org/10.1038/s41586-022-05035-y

Download citation

Received: 25 February 2021
Accepted: 28 June 2022
Published: 03 August 2022
Issue Date: 11 August 2022
DOI: https://doi.org/10.1038/s41586-022-05035-y

This article is cited by

Alternative splicing and environmental adaptation in wild house mice
- David N. Manahan
- Michael W. Nachman
Heredity (2024)
High-throughput RNA isoform sequencing using programmed cDNA concatenation
- Aziz M. Al’Khafaji
- Jonathan T. Smith
- Nir Hacohen
Nature Biotechnology (2024)
Splicing complexity as a pivotal feature of alternative exons in mammalian species
- Feiyang Zhao
- Yubin Yan
- Ruolin Yang
BMC Genomics (2023)
The role of alternative pre-mRNA splicing in cancer progression
- Sunkyung Choi
- Namjoon Cho
- Kee K. Kim
Cancer Cell International (2023)
SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty
- Euphy Y. Wu
- Noor P. Singh
- Michael I. Love
Genome Biology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.