Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Transcriptome variation in human tissues revealed by long-read sequencing

Abstract

Regulation of transcript structure generates transcript diversity and plays an important role in human disease1,2,3,4,5,6,7. The advent of long-read sequencing technologies offers the opportunity to study the role of genetic variation in transcript structure8,9,10,11,12,13,14,15,16. In this Article, we present a large human long-read RNA-seq dataset using the Oxford Nanopore Technologies platform from 88 samples from Genotype-Tissue Expression (GTEx) tissues and cell lines, complementing the GTEx resource. We identified just over 70,000 novel transcripts for annotated genes, and validated the protein expression of 10% of novel transcripts. We developed a new computational package, LORALS, to analyse the genetic effects of rare and common variants on the transcriptome by allele-specific analysis of long reads. We characterized allele-specific expression and transcript structure events, providing new insights into the specific transcript alterations caused by common and rare genetic variants and highlighting the resolution gained from long-read data. We were able to perturb the transcript structure upon knockdown of PTBP1, an RNA binding protein that mediates splicing, thereby finding genetic regulatory effects that are modified by the cellular environment. Finally, we used this dataset to enhance variant interpretation and study rare variants leading to aberrant splicing patterns.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview and quality control of the dataset.
Fig. 2: Discovery of novel transcripts and comparison between tissues.
Fig. 3: Allelic analysis of long-read data.
Fig. 4: Variant interpretation through novel transcripts and ASTS analysis.

Similar content being viewed by others

Data availability

Raw long-read data generated as part of this manuscript are available in the GTEx v.9 release under dbGAP accession number phs000424.v9 and on AnVIL at https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_GTEx_V9_hg38. The GTF file of flair transcripts, along with the transcript-level overall and allelic expression quantifications from GENCODE v.26 and flair transcripts, are available on the GTEx portal (https://gtexportal.org/home/datasets). The GTEx WGS, Illumina short-read, the allelic analysis, eQTLs and sQTLs and enloc colocalization files are all part of the GTEx v.8 release phs000424.v8. In addition, we used the transcript and gene counts available from https://gtexportal.org/home/datasets. The GRCh38 human genome reference and GENCODE v.26 processed for GTEx were used in this analysis (https://console.cloud.google.com/storage/browser/gtex-resources). The CHESS and Workman transcript datasets were downloaded from GitHub (https://github.com/chess-genome/chess and https://github.com/nanopore-wgs-consortium/NA12878). ENCODE eCLIP data was downloaded from https://www.encodeproject.org/.

Code availability

All original code used in the manuscript is released as part of a software package, https://github.com/LappalainenLab/lorals. General scripts are available at https://github.com/LappalainenLab/lorals_paper_code (https://doi.org/10.5281/zenodo.6529254).

References

  1. Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. The expanding landscape of alternative splicing variation in human populations. Am. J. Hum. Genet. 102, 11–26 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Nicolae, D. L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  3. Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600–604 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  4. GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).

    Article  CAS  Google Scholar 

  5. Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Kremer, L. S. et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat. Commun. 8, 15824 (2017).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  7. Gonorazky, H. D. et al. Expanding the boundaries of RNA sequencing as a diagnostic tool for rare Mendelian disease. Am. J. Hum. Genet. 104, 466–483 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).

    Article  CAS  PubMed  Google Scholar 

  9. Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Oikonomopoulos, S., Wang, Y. C., Djambazian, H., Badescu, D. & Ragoussis, J. Benchmarking of the Oxford Nanopore MinION sequencing for quantitative and qualitative assessment of cDNA populations. Sci Rep. 6, 31602 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  11. Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res. 6, 100 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Anvar, S. Y. et al. Full-length mRNA sequencing uncovers a widespread coupling between transcription initiation and mRNA processing. Genome Biol. 19, 46 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. Tardaguila, M. et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 396–411 (2018).

    Article  CAS  PubMed Central  Google Scholar 

  14. Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Tilgner, H., Grubert, F., Sharon, D. & Snyder, M. P. Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc. Natl. Acad. Sci. USA 111, 9869–9874 (2014).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  16. Tilgner, H. et al. Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events. Nat. Biotechnol. 33, 736–742 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506 (2013).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  18. Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 24, 14–24 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018).

    Article  CAS  PubMed  Google Scholar 

  20. Rivas, M. A. et al. Human genomics. Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 348, 666–669 (2015).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  21. Smith, D. et al. A rare IL33 loss-of-function mutation reduces blood eosinophil counts and protects from asthma. PLoS Genet. 13, e1006659 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. Mohammadi, P. et al. Genetic regulatory variation in populations informs transcriptome analysis in rare disease. Science 366, 351–356 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  23. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011).

    Article  CAS  Google Scholar 

  25. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Erratum: near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 888 (2016).

    Article  CAS  PubMed  Google Scholar 

  26. Teng, M. et al. A benchmark for RNA-seq quantification pipelines. Genome Biol. 17, 74 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  27. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Pai, A. A. et al.Widespread shortening of 3’untranslated regions and increased exon inclusion are evolutionarily conserved features of innate immune responses to infection PLoS Genet. 12, e1006338 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  29. Alasoo, K. et al. Genetic effects on promoter usage are highly context-specific and contribute to complex traits. eLife 8, e41673 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Mittleman, B. E. et al. Alternative polyadenylation mediates genetic regulation of gene expression. eLife 9, e57492 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  32. Jiang, L. et al. A quantitative proteome map of the human body. Cell 183, 269–283.e19 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Yeo, G., Holste, D., Kreiman, G. & Burge, C. B. Variation in alternative splicing across human tissues. Genome Biol. 5, R74 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Reyes, A. & Huber, W. Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues. Nucleic Acids Res. 46, 582–592 (2018).

    Article  CAS  PubMed  Google Scholar 

  35. Castel, S. E. et al. A vast resource of allelic expression data spanning human tissues. Genome Biol. 21, 234 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Van Nostrand, E. L. et al. A large-scale binding and functional map of human RNA-binding proteins. Nature 583, 711–719 (2020).

    Article  ADS  PubMed  PubMed Central  CAS  Google Scholar 

  37. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  38. Ferraro, N. M. et al. Transcriptomic signatures across human tissues identify functional rare genetic variation. Science 369, eaaz5900 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Yang, X. et al. Widespread expansion of protein interaction capabilities by alternative splicing. Cell 164, 805–817 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Castel, S. E., Levy-Moonshine, A., Mohammadi, P., Banks, E. & Lappalainen, T. Tools and best practices for data processing in allelic expression analysis. Genome Biol. 16, 195 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Sibley, C. R. et al. Recursive splicing in long vertebrate genes. Nature 521, 371–375 (2015).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  42. Scotti, M. M. & Swanson, M. S. RNA mis-splicing in disease. Nat. Rev. Genet. 17, 19–32 (2016).

    Article  CAS  PubMed  Google Scholar 

  43. Gandal, M. J. et al. Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder. Science 362, eaat8127 (2018).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  44. GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

    Article  PubMed Central  CAS  Google Scholar 

  45. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M. & Van Broeckhoven, C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34, 2666–2669 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  47. Alasoo, K. Wiggleplotr: Make read coverage plots from bigwig files. Bioconductor https://doi.org/10.18129/B9.bioc.wiggleplotr (2017).

  48. Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Research 9, 304 (2020).

    Article  Google Scholar 

  50. Trincado, J. L. et al. SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Genome Biol. 19, 40 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  51. Keller, A., Nesvizhskii, A. I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).

    Article  CAS  PubMed  Google Scholar 

  52. Deutsch, E. W. et al. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics Clin. Appl. 9, 745–754 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    PubMed  PubMed Central  Google Scholar 

  54. Nowicka, M. & Robinson, M. D. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Research 5, 1356 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  55. Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Mohammadi, P., Castel, S. E., Brown, A. A. & Lappalainen, T. Quantifying the regulatory effect size of cis-acting genetic variation using allelic fold change. Genome Res. 27, 1872–1884 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Cohen, J. Statistical Power Analysis for the Behavioral Sciences. (Academic Press, 2013).

  58. Van Nostrand, E. L. et al. A large-scale binding and functional map of human RNA-binding proteins. Nature 583, 711–719 (2020).

    Article  ADS  PubMed  PubMed Central  CAS  Google Scholar 

  59. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. Gremme, G., Steinbiss, S. & Kurtz, S. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 10, 645–656 (2013).

    Article  PubMed  Google Scholar 

  61. Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank M. Micorescu and K. Potamousis from the Oxford Nanopore Technologies commercial team for their help in generating the data. D.A.G. was funded by NIH grant nos. R01GM124486 and U24DK112331. T.L. was funded by NIH grant nos. R01GM124486, R01GM122924, R01AG057422 and UM1HG008901. P.H. was funded by NIH grant no. R01GM124486. A.G. was funded by Roy and Diana Vagelos Pilot Grant. Funding for long-read sequencing of GTEx samples at the Broad was provided by a Broad Ignite grant. N.E. and P.M. were supported by NIGMS award no. R01GM140287. N.R.T. was funded by NIH grant no. K01-HL140187.

Author information

Authors and Affiliations

Authors

Contributions

D.A.G., T.L. and B.C. conceived and designed the project. D.A.G. performed most of the data analysis. G.G., A.G. and X.D. carried out the library preparation and sequencing. P.H. packaged the code. L.J., R.J., H.T. and M.S. provided and analysed the data for the proteomic validation. P.H. and K.B. assisted in allelic expression analysis. K.G. carried out the base-calling. N.E. and P.M. performed power analyses and advised on analysis methods. A.G. carried out the PTBP1 knockdown. N.R.T. and P.E. provided the CVD samples. T.B. and M.C. aided in the data generation. F.A., K.A., E.D.H., S.J., D.G.M., B.C. and T.L. provided feedback on the study design and data analysis. N.E., F.A., N.R.T., E.D.H., S.J., P.M. and D.G.M. provided feedback on the manuscript. E.D.H., S.J., D.G.M. and T.L. supervised the work. D.A.G., T.L. and B.C. wrote the manuscript with contributions from other authors. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Dafni A. Glinos, Garrett Garborcauskas, Tuuli Lappalainen or Beryl B. Cummings.

Ethics declarations

Competing interests

D.A.G. is currently a fellow at Vertex Pharmaceuticals. X.D., E.D.H. and S.J. are employees of Oxford Nanopore Technologies and are shareholders and/or share option holders. F.A. has been an employee of Illumina, Inc., since 8 November 2021. P.T.E. has received sponsored research support from Bayer AG and IBM Health, and he has served on advisory boards or consulted for Bayer AG, MyoKardia and Novartis; none of these activities are related to the work presented here. D.G.M. is a founder with equity in Goldfinch Bio, a paid advisor to GSK, Insitro, Third Rock Ventures and Foresite Labs, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer and Sanofi-Genzyme; none of these activities are related to the work presented here. B.C. is currently employed at Third Rock Ventures. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Quality control of the dataset.

A) Number and B) median length of raw and aligned reads per sample. The diagonal lines correspond to intercept = 0. With the dashed black circle, we highlight the two samples sequenced using the direct-cDNA technology. C) Read number and read length in two fibroblast cell line samples (one of which was sequenced in replicate) that were sequenced using both the direct-cDNA and the PCR-cDNA protocol for 48 h. P values were calculated using a two-sided t-test. Error bars: standard deviation from the mean. D) Hierarchical clustering using Euclidean distance for replicate samples aligned to GENCODE for transcripts with expression above 3 TPM in at least 5 samples. E) Principal component analysis using all 88 samples aligned to GENCODE (v26) for transcripts with expression above 3 TPM in at least 5 samples.

Extended Data Fig. 2 Comparison between ONT and Illumina gene expression.

A) Correlation between the transcriptome of each sample quantified by ONT and by Illumina sequencing technologies. B) Normalized gene and transcript expression for high residual (|residual| > 1) genes and transcripts retrieved from the Spearman correlation analysis. C) Characteristics of genes and D) transcripts with high or low residuals with respect to gene/transcript length, number of transcripts per gene and number of exons per transcript.

Extended Data Fig. 3 Three prime bias analysis using mitochondrial reads.

A) Observed versus expected read length for one sample sequenced using both direct (Spearman R2 = 0.3) and PCR cDNA (Spearman R2 = 0.26) protocols. The discrete clusters below the diagonal represent incorrect assignments to GENCODE isoforms (potential novel transcripts), and the diffuse shading represents fragmented RNA. Relationship between the expected transcript read length and the fraction of observed nanopore poly(A) RNA reads over the expected full length by sample for B) all samples and C) only fibroblasts. Labels are for mitochondrial genes without the MT prefix. The median was calculated per sample and error bars represent standard deviation. D) Median fraction of full-length per method by which RNA was isolated. E) Comparison of alternative transcription structure events found in highly expressed transcripts in the top and bottom 10% of samples ranked by 3′ bias. We observed no difference between the two deciles when using a two-sided proportionality test.

Extended Data Fig. 4 FLAIR transcript characterisation.

A) FLAIR transcripts comparison to GENCODE with respect to different genomic levels. The difference between intron chain and transcript is that the former only looks at matching the intron boundaries, therefore allowing variation in the UTR regions. B) Transcript length and C) number of exons per transcript classified by comparison to GENCODE. D) Number of overlapping transcripts between the ones identified in this paper and the ones released by a) CHESS and b) Workman.

Extended Data Fig. 5 Protein validation analysis of transcripts from matched tissues.

A) Percentage of validated transcripts at the protein level using mass spectrometry for different TPM thresholds. Each point represents the mean across n = 7 assayed tissues and error bars represent the standard deviation. B) Mean expression per tissue with over one sample (lung, liver, heart, muscle and brain) of annotated and novel transcripts stratified by their validation status. The vertical line denotes the 5 TPM threshold used. C) Percentage of validated transcripts at the protein level using mass spectrometry per primary tissue. D) Proportion of the AltTS events validated per tissue. E) MLF1 is an example of a gene with multiple highly-expressed transcripts across both muscle and heart tissues with a different transcript validated in each tissue. A3: alternative 3′ splice site; A5: alternative 5′ splice site; AF: alternative first exon; AL: alternative last exon; A3UTR: alternative 3′ end; A5UTR: alternative 5′ end; MX: mutually exclusive exons; RI: retained intron; SE: skipped exon.

Extended Data Fig. 6 Transcript expression overview of novel and annotated transcripts.

A) Hierarchical clustering using euclidean distance and B) principal component analysis for selected samples aligned to GENCODE for transcripts with expression above five TPM in at least three samples separated based on whether they are novel or not. C) Proportion of transcripts expressed at different TPM thresholds and classified based on how many tissues express the transcript in at least two samples. The total number of transcripts per group and threshold is included in the legend.

Extended Data Fig. 7 Differential transcript expression and usage between tissues.

Heatmap of number of differentially A) upregulated transcripts and B) used transcripts (FDR ≤ 0.05) in pairwise comparison of tissues with at least five samples. In differential expression analysis we identify up- or down-regulated transcripts per pairwise comparison (asymmetrical heatmap) while in the differential transcript usage analysis there is no direction of effect (symmetrical heatmap). C) Number of differentially expressed or used transcripts that were specific to that tissue. D) Median gene expression across all transcripts that are specific to a tissue.

Extended Data Fig. 8 LORALS pipeline development and aligning statistics.

A) Pipeline for allele-specific analysis. Raw long-reads are first aligned to the genome using minimap2. This alignment is used to correct the phase of some of the heterozygous variants on the whole genome sequencing vcf. This new file is then used to generate personalized genome reference files against which the raw reads are again aligned using minimap2. The raw reads are also aligned to the transcriptome using minimap2. The VCF file along with the genome aligned reads and the transcriptome aligned reads are then fed into LORALS for allelic analysis. B) Percentage of switched haplotypes per donor informed by the long-read data. For this all samples from the same donor were merged to harmonize the files. C) Percentage of haplotype specific reads calculated as reads having a higher mapping score when using a personalized genome reference. D) Delta calculated as the difference in the start position of the aligned read between the genome aligned and the personalized genome aligned reads. Not shown are the reads that had Delta = 0. E) Reference ratio for the samples present in this study sequenced using Illumina technology and ONT technology aligned with two different approaches.

Extended Data Fig. 9 LORALS pipeline allele specific analysis filter setting.

A) Reference ratio and B) normalized reads counts across different Illumina and ONT flags for both of these sequencing technologies. Mapping bias: mapping bias in simulations; Low mappability: low-mappability regions (75-mer mappability < 1 based on 75mer alignments with up to two mismatches based on the pipeline for ENCODE tracks and available on the GTEx portal); Genotype warning: no more reads supporting two alleles than would be expected from sequencing noise alone, indicating potential genotyping errors (FDR < 1%); Blacklist: ENCODE blacklist. Multi-mapping: regions with multi-mapping reads constructed using the alignability track from UCSC using a threshold of 0.1 (so that a 100kmer aligning to that site aligns to at least 5 other locations in the genome with up to 2 mismatches); Other allele warning: regions where the proportion of reference or alternative allele containing reads is lower than 0.8; High indel warning: sites where the proportion of non-indel containing is lower than 0.8. C) Reference ratios and normalised reads counts of all kept sites across Illumina and ONT sequencing technologies. D) Distribution of the high indel warning ratios and the other allele ratios across all samples. E) Proportion of genes with at least 20 overlapping reads flagged per filter. The proportion was calculated across all genes for each sample (n = 59). The center corresponds to the median, the lower and upper hinges correspond to the 25th and 75th percentiles and the whiskers extend from the hinge to the smallest/largest value no further than 1.5 * inter-quartile range from the hinge.

Extended Data Fig. 10 Allele specific analysis on all GTEx samples.

A) Estimated power for different number of transcripts (2, 3 or 5) with respect to the coverage (x-axis) for effect sizes 0.1 (low), 0.3 (medium) and 0.5 (high), derived from simulated count data. B) Number of genes tested for allele-specific expression (ASE) and allele-specific transcript structure (ASTS) and number of significant genes. The diagonal indicates the median percentage of significant genes (9% and 9%, respectively). C) Number of transcripts per gene tested for ASTS. D) Number of genes with allelic data across donors per tissue for ASE and ASTS. E) Total number of genes calculated per sample (n = 59) at different levels of power. Outliers are hidden for ease of viewing. The center corresponds to the median, the lower and upper hinges correspond to the 25th and 75th percentiles and the whiskers extend from the hinge to the smallest/largest value no further than 1.5 * inter-quartile range from the hinge.

Extended Data Fig. 11 Comparison of allele specific expression between ONT and Illumina.

A) Proportion of significant ASE genes discovered using Illumina or ONT and replicated in the other method. π1 calculations are carried out up to P value = 0.5. B) Log allelic fold-change of Illumina and ONT of the shared ASE genes. C) Number of ASE events with opposite directions between ONT and Illumina per sample. Highlighted are the five fibroblast cell lines that were further cultured prior to sequencing, where 12/57 events were observed. RNA-seq read pile-ups for D) CCDC69 and E) ACSSL3 which have ASE in the opposite direction between the two methods. In red is shown the variant used to parse the reads between the two haplotypes. CCDC69 differences can be attributed to a depression in the Illumina read pile-ups while ACSSL3 can be attributed to the variant being in the 3′UTR, which is not well captured by Illumina reads. F) Venn diagram of the significant ASE genes discovered by Illumina and ONT. The LOG2 of total counts for each method is shown for each group of the Venn diagram.

Extended Data Fig. 12 Alternative transcript structure event annotation of allele specific events.

A) Percentage of genes displaying a single alternative transcript structure event. P values were calculated using a two-sided binomial test. A3: alternative 3′ splice site; A5: alternative 5′ splice site; AF: alternative first exon; AL: alternative last exon; A3UTR: alternative 3′ end; A5UTR: alternative 5′ end; MX: mutually exclusive exons; RI: retained intron; SE: skipped exon. B) Average relative location of the heterozygous variant used for ASTS event, by grouping all the transcripts of an ASTS event together. C) Read pile-ups per transcript for the two donors displaying ASTS in DUSP13 gene. In the lower panel the transcript structure is shown, without details of the coding sequence. D) Transcript percentage for four of the five DUSP13 annotated transcripts with high read coverage in the GTEx v8.

Extended Data Fig. 13 Differential expression analysis between PTBP1-KD and control samples.

A) Volcano plot from differential gene expression between control and samples with PTBP1 knockdown using ONT data. values were calculated using the Wald test in DESeq2. B) Gene expression profile in PTBP1 and PTBP2 genes (PTBP2 under normal circumstances has its expression restricted to the brain). C) Proportion of different alternative transcript structure events in transcripts upregulated in the control or the PTBP1 knockdown samples. values were calculated using a two-sided proportionality test.

Extended Data Fig. 14 Allele specific analysis of PTBP1-KD and control samples.

A) Correlation between the control and the PTBP1 knockdown samples in the reference ratio of gene expression and transcript structure. B) Changes in ASTS by PTPB1 knockdown, as assayed by gridION, with the heatmap showing the co-occurrence of alternative transcript structure events that are observed at least once per each event (or a single time for the diagonal) in a given gene. Color corresponds to the log2 ratio of the number of events found in the control over PTBP1 knockdown (KD) samples. C) Number of significant ASE and ASTS genes found by gridION categorized based on their status in the PromethION data from the same samples. D) Proportion of genes displaying allele-specific patterns specifically in either control or PTBP1 knockdown samples. E) SLC1A5 gene transcript read pile-ups which display significant ASTS only in the control sample only. The arrow indicates the location of the PTBP1 eCLIP site which contains a heterozygous variant in that donor.

Extended Data Fig. 15 Variant interpretation through novel transcripts and allele-specific transcript structure analysis.

A) Number of variants per variant effect predictor (VEP) category using GENCODE v26 protein-coding genes with or without novel FLAIR transcripts. B) CADD score distribution of variants that were reassigned to a more severe consequence when the GENCODE gene annotations were complemented with the novel FLAIR transcripts, compared to variants that retained their annotation (down sampled to a similar size). P values from two-sided t-test. The center corresponds to the median, the lower and upper hinges correspond to the 25th and 75th percentiles and the whiskers extend from the hinge to the smallest/largest value no further than 1.5 * inter-quartile range from the hinge. C) Percentage of variants per clinical significance category that get reassigned when supplementing the gene annotation with the novel transcripts. The numbers above the bars correspond to the number of re-assigned variants. D) Number of rare variants per ASTS gene (10kb window around gene). E) Proportion of rare heterozygous variants per annotation in significant ASTS events. As a background all ASTS events were used, and P values were calculated using a two-sided binomial testing. F) Enrichment of the significant ASTS genes within splicing outliers. As a background all ASTS genes were used and P values were calculated using a two-sided binomial test. G) NDUFS4 as an example of a gene with a rare heterozygous variant in a sample that is a GTEx splicing outlier and has significant ASTS, with read pileups and grey arrows indicating the rare variants. Log normalised transcript counts per allele are plotted per transcript, with the REF:ALT ratios marked for each.

Supplementary information

Supplementary Information

Supplementary Methods, legends for Supplementary Tables 1–12 and Supplementary Fig. 1.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–12; see main Supplementary file for legends.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Glinos, D.A., Garborcauskas, G., Hoffman, P. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022). https://doi.org/10.1038/s41586-022-05035-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-022-05035-y

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics