Abstract
Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All data used in this study are available at https://github.com/jonassibbesen/vgrna-project-paper. Data that are available from public repositories are provided as web links only. Accession numbers are included when relevant, and accession numbers for sequencing data are also listed in Supplementary Table 4. The repository also includes all spliced pangenome graphs and pantranscriptome haplotype-specific transcript sets, which may be freely used in other projects. Mapping benchmark tables and haplotype-specific expression estimates are archived in Zenodo (https://doi.org/10.5281/zenodo.7234454).
Code availability
The source code for VG and RPVG is publicly available at https://github.com/vgteam/vg and https://github.com/jonassibbesen/rpvg, respectively. Both tools are licensed under the MIT License. A full list of the versions of all computational tools used is available in Supplementary Table 6. All bash scripts with exact command-lines used to generate the results are available at https://github.com/jonassibbesen/vgrna-project-paper. This repository also includes the custom C++, Python, and R scripts used for analysis and plotting, together with references to Docker containers and log files from the analyses.
References
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinform. 12, 1–16 (2011).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).
Eizenga, J. M. et al. Pangenome graphs. Annu. Rev. Genomics Hum. Gen. 21, 139–162 (2020).
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Rakocevic, G. et al. Fast and accurate genomic analyses using genome graphs. Nat. Genetics 51, 354–362 (2019).
Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 1–17 (2020).
Sibbesen, J. A., Maretty, L. & Krogh, A. Accurate genotyping across variant classes and lengths using variant graphs. Nat. Genet. 50, 1054–1059 (2018).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).
Rautiainen, M. et al. AERON: Transcript quantification and gene-fusion detection using long reads. Preprint at bioRxiv https://doi.org/10.1101/2020.01.27.921338 (2020).
Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 1–28 (2020).
Denti, L. et al. ASGAL: aligning RNA-seq data to a splicing graph to detect novel alternative splicing events. BMC Bioinform. 19, 1–21 (2018).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Zink, F. et al. Insights into imprinting from parent-of-origin phased methylomes and transcriptomes. Nat. Genet. 50, 1542–1552 (2018).
Castek, S. E., Levy-Moonshine, A., Mohammadi, P., Banks, E. & Lappalainen, T. Tools and best practices for data processing in allelic expression analysis. Genome Biol. 16, 195 (2015).
Van De Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).
Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Sys. Biol. 7, 522 (2011).
Raghupathy, N. et al. Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics 34, 2177–2184 (2018).
Lee, W., Plant, K., Humburg, P. & Knight, J. C. AltHapAlignR: improved accuracy of RNA-seq analyses through the use of alternative haplotypes. Bioinformatics 34, 2401–2408 (2018).
Aguiar, V. R. C., César, J., Delaneau, O., Dermitzakis, E. T. & Meyer, D. Expression estimation and eQTL mapping for HLA genes with a personalized pipeline. PLoS Genet. 15, e1008091 (2019).
Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020).
Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Preprint at bioRxiv https://doi.org/10.1101/672931 (2020).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Consortium, G. P. et al. A global reference for human genetic variation. Nature 526, 68 (2015).
Consortium, T. E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Davis, C. A. et al. The encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2017).
Berger, K., Somineni, H., Prince, J., Kugathasan, S. & Gibson, G. Altered splicing associated with the pathology of inflammatory bowel disease. Hum. Genomics 15, 1–10 (2021).
Micheletti, S. J. et al. Genetic consequences of the transatlantic slave trade in the Americas. Am. J. Hum. Genet. 107, 265–277 (2020).
Robinson, J. et al. IPD-IMGT/HLA database. Nucleic Acids Res. 48, D948–D955 (2020).
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Gourraud, P.-A. et al. HLA diversity in the 1000 Genomes dataset. PloS ONE 9, e97282 (2014).
Abi-Rached, L. et al. Immune diversity sheds light on missing variation in worldwide genetic diversity panels. PloS ONE 13, e0206512 (2018).
Orenbuch, R. et al. arcasHLA: high-resolution HLA typing from RNAseq. Bioinformatics 36, 33–40 (2019).
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).
Baran, Y. et al. The landscape of genomic imprinting across diverse adult human tissues. Genome Res. 25, 927–936 (2015).
Jadhav, B. et al. RNA-seq in 296 phased trios provides a high-resolution map of genomic imprinting. BMC Biol. 17, 1–20 (2019).
Nakabayashi, K. et al. Methylation screening of reciprocal genome-wide UPDs identifies novel human-specific imprinted genes. Hum. Mol. Genet. 20, 3188–3197 (2011).
Liu, Y. et al. Pan-genome of wild and cultivated soybeans. Cell 182, 162–176 (2020).
Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).
Liao, W.-W. et al. A draft human pangenome reference. Preprint at bioRxiv https://doi.org/10.1101/2022.07.09.499321 (2022).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).
Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 1–19 (2020).
Hickey, G. et al. Pangenome graph construction from genome alignment with Minigraph-Cactus. Preprint at bioRxiv https://doi.org/10.1101/2022.10.06.511217 (2022).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 1–21 (2014).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
Eizenga, J. M. et al. Efficient dynamic variation graphs. Bioinformatics 36, 5139–5144 (2020).
Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1–54 (2020).
Sirén, J. Indexing variation graphs. In 2017 Proc. 19th Workshop on Algorithm Engineering and Experiments (ALENEX) 13–27 (SIAM, 2017).
Chang, X., Eizenga, J., Novak, A. M., Sirén, J. & Paten, B. Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, 146–153 (2020).
Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018).
Eades, P., Lin, X. & Smyth, W. F. A fast and effective heuristic for the feedback arc set problem. Inf. Process. Lett. 47, 319–323 (1993).
Lee, C., Grasso, C. & Sharlow, M. F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).
Burset, M., Seledtsov, I. A. & Solovyev, V. V. Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 28, 4364–4375 (2017).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Wala, J. & Beroukhim, R. SeqLib: a C++ API for rapid BAM manipulation, sequence alignment and sequence assembly. Bioinformatics 33, 751–753 (2016).
Karlin, S. & Altschul, S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA 87, 2264–2268 (1990).
Flecher, C., Allard, D. & Naveau, P. Truncated skew-normal distributions: moments, estimation by weighted moments and application to climatic data. Metron 68, 331–345 (2010).
Albers, C. A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).
Cock, P. J. A. et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Acknowledgements
Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award numbers U01HG010961, R01HG010485, U41HG010972, U24HG011853, and OT2OD026682 to B.P. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The work of J.A.S. was supported by the Carlsberg Foundation. We thank the ENCODE Consortium, the Thomas Gingeras Laboratory (Cold Spring Harbor Laboratory), the Ali Mortazavi Laboratory (University of California Irvine) and the Joe Ecker Laboratory (Salk Institute for Biological Studies) for generating and sharing the ENCODE data used in this study. We would also like to thank M. Dennis (University of California Davis) for generating and providing access to the CHM13 RNA-seq data on behalf of the T2T consortium. Finally, we would like to thank J. Monlong and G. Hickey for feedback on the manuscript, and everybody else in the VG Team.
Author information
Authors and Affiliations
Contributions
J.A.S. and J.M.E. developed software, designed and carried out experiments, analyzed data, and wrote the paper. A.M.N., J.S., X.C., and E.G. contributed to developing the software. B.P. contributed to project conceptualization, supervised the research, and edited the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Michael Love, Harold Pimentel, and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Diagram of a multipath alignment.
A diagrammatic comparison between the multipath alignment output of VG MPMAP and the single-path alignment output of other graph aligners (such as VG MAP). a A read and b a sequence graph, which have been colored to indicate which parts of the read could plausibly align to which parts of the graph. c A single-path alignment. The read sequence is aligned to one path from the graph. d A multipath alignment. The alignment can split and rejoin to express the alignment uncertainty to different paths in the graph.
Extended Data Fig. 2 Mapping benchmark for primary alignments using RNA-seq data from NA12878.
Mapping error and recall for VG MPMAP and three other methods using simulated Illumina data. Colored numbers indicate different mapping quality thresholds. Reads are considered correctly mapped if their primary alignments cover 90% of the true reference sequence alignment.
Extended Data Fig. 3 Mapping benchmark stratified by edit distance using RNA-seq data from NA12878.
Mapping recall (a) and error (b) for VG MPMAP and three other methods using simulated Illumina data as a function of edit distance. Unique alignments are primary alignments with a mapping quality of at least 30. Reads are considered correctly mapped if their alignments cover 90% of the true reference sequence alignment.
Extended Data Fig. 4 Mapping benchmark stratified by non-reference variants using RNA-seq data from NA12878.
Mapping error and recall for VG MPMAP and three other methods using simulated Illumina data. Colored numbers indicate different mapping quality thresholds. Reads are considered correctly mapped if one of their multi-alignments covers 90% of the true reference sequence alignment. Reads are stratified into those that a contain no variants, b contain no insertions or deletions (indels) and one single nucleotide variant (SNV), c contain no indels and two SNVs, d contain no indels and three SNVs, e contain no indels and more than three SNVs, and f contain any indels.
Extended Data Fig. 5 Allelic bias benchmark using RNA-seq data from NA12878.
Allelic mapping bias for VG MPMAP and four other methods using simulated Illumina RNA-seq reads, which were simulated without allelic bias. STAR was used as the aligner for the WASP pipeline. The WASP (STAR) pipeline were provided the 1000GP NA12878 haplotypes as input. The number of variant sites with coverage at least 20 is plotted against the observed rate of false positive hypothesis tests of allelic skew (two-sided binomial test, α = 0.01). Coverage was calculated from primary alignments with a mapping quality value of at least 30. The bottom row shows a zoomed view without WASP (STAR).
Extended Data Fig. 6 Haplotype-specific transcript uniqueness in a 1000 Genomes Project.
The fraction of HSTs that are unique to each of the 2504 samples in the 1000 Genomes Project (1000GP) when compared to different subsets of samples in the 1000GP. Left box plots show the fraction unique when comparing to all other samples, middle box plots show the fraction unique when comparing to all other samples excluding the samples’ population, and right box plots show the fraction unique when comparing to all other samples excluding the samples’ super population. AFR: African (n = 661), AMR: Admixed American (n = 347), EAS: East Asian (n = 504), EUR: European (n = 503), SAS: South Asian (n = 489). The horizontal line in the boxes corresponds to the median, and the box bounds (inter-quartile range) to the 25th and 75th percentile. The whiskers extend to the minimum and maximum value, but no further than 1.5 times the inter-quartile range from the box bounds. Values outside the whiskers are displayed as points.
Extended Data Fig. 7 Allele-specific expression benchmark using RNA-seq data from NA12878.
Allele-specific expression (ASE) results comparing the MPMAP-RPVG pipeline against WASP (with STAR as the aligner) using simulated data. Shows true positive rate and false positive rate of ASE significance for different thresholds of variant read count in the simulated data. Variants were defined as showing significant ASE using a two-sided binomial test of the allele-specific read counts with p-values adjusted using the Benjamini-Hochberg procedure and a False Discovery Rate (FDR) α = 0.1. All heterozygotic NA12878 variants from the 1000 Genomes Project (1000GP) with at least one read in the simulated data were used for the benchmark. For the MPMAP-RPVG pipeline, we used the personal transcriptome generated from the 1000GP NA12878 haplotypes (Supplementary Table 3). WASP was provided the 1000GP NA12878 haplotypes as input. Note, we only used WASP for bias correction and allele-specific read counting, and not its downstream inference method.
Extended Data Fig. 8 Proportion of marginal expression attributed to ≤2 HSTs of a transcript.
For an African American individual (left) and a European American individual (right), the proportion of transcripts for which the marginal expression has at least X proportion assigned to ≤2 HSTs is shown for various values of X. Colors correspond to different thresholds on the proportion of marginal expression. A pantranscriptome generated from all 1000 Genomes Project haplotypes were used for the evaluation (“Whole” in Supplementary Table 3). Transcripts with fewer than 1 inferred read are omitted.
Extended Data Fig. 9 Multipath alignment benchmark using RNA-seq data from NA12878.
Haplotype-specific transcript (HST) quantification results comparing RPVG with single-path and multipath alignments from VG MPMAP and VG MAP as input using simulated and real Illumina data. For details on the pantranscriptomes used see Supplementary Table 3. The VG MPMAP single-path alignments were created by finding the best scoring path in each multipath alignment. a Recall and precision of whether a transcript is correctly assigned nonzero expression for different expression value thresholds (colored numbers for “Whole (excl. CEU)” pantranscriptome) using simulated data. Expression is measured in transcripts per million (TPM). b Mean absolute relative expression difference (MARD) between simulated and estimated expression (in TPM) for different pantranscriptomes using simulated data. MARD was calculated using either all HSTs in the pantranscriptome (solid bars) or using only the NA12878 HSTs (shaded bars). c Number of expressed transcripts from NA12878 haplotypes shown against the number from non-NA12878 haplotypes for different expression value thresholds (colored numbers) using real data. d Fraction of transcript expression (in TPM) assigned to NA12878 haplotypes for different pantranscriptomes using simulated (left) and real (right) data.
Extended Data Fig. 10 Examples of allele expression concordance across tissues.
A set of examples showing allele concordance across tissues using two different variant expression thresholds. Only three tissues are used in the example for simplicity. Blue and orange bars correspond to reference and alternative allele expression, respectively. Variant expression is calculated as the sum of the two alleles. An allele is defined as concordant if it is either consistently expressed or consistently not expressed across all tissues for which the corresponding variant is expressed. Using this definition all alternative alleles except for the allele in variant 2 are defined as concordant when the minimum variant expression threshold is set to 0. If the variant expression threshold is increased to 3, the alternative allele in variant 2 becomes concordant since tissue 2 will be filtered for this variant. Moreover, variant 4 will be excluded due to tissue 3 being filtered since at least two expressed tissues are needed to compute concordance.
Supplementary information
Supplementary Information
Supplementary Figs 1–18, Supplementary Tables 1–6, Supplementary Note, Supplementary Algorithms 1–8.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sibbesen, J.A., Eizenga, J.M., Novak, A.M. et al. Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nat Methods 20, 239–247 (2023). https://doi.org/10.1038/s41592-022-01731-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-022-01731-9
This article is cited by
-
Introgressions lead to reference bias in wheat RNA-seq analysis
BMC Biology (2024)
-
A survey of mapping algorithms in the long-reads era
Genome Biology (2023)
-
Graph construction method impacts variation representation and analyses in a bovine super-pangenome
Genome Biology (2023)
-
A draft human pangenome reference
Nature (2023)