Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Haplotype-aware pantranscriptome analyses using spliced pangenome graphs


Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.

This is a preview of subscription content, access via your institution

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Diagram of haplotype-aware transcriptome analysis pipeline.
Fig. 2: Mapping benchmark using RNA-seq data from NA12878.
Fig. 3: HST quantification benchmark using RNA-seq data from NA12878.
Fig. 4: HLA typing and allele concordance evaluation using RNA-seq data from trios and different tissues.
Fig. 5: Exploratory demonstration of analyzing genomic imprinting using data from NA12878 lymphoblastoid cell line.

Data availability

All data used in this study are available at Data that are available from public repositories are provided as web links only. Accession numbers are included when relevant, and accession numbers for sequencing data are also listed in Supplementary Table 4. The repository also includes all spliced pangenome graphs and pantranscriptome haplotype-specific transcript sets, which may be freely used in other projects. Mapping benchmark tables and haplotype-specific expression estimates are archived in Zenodo (

Code availability

The source code for VG and RPVG is publicly available at and, respectively. Both tools are licensed under the MIT License. A full list of the versions of all computational tools used is available in Supplementary Table 6. All bash scripts with exact command-lines used to generate the results are available at This repository also includes the custom C++, Python, and R scripts used for analysis and plotting, together with references to Docker containers and log files from the analyses.


  1. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinform. 12, 1–16 (2011).

    Article  Google Scholar 

  2. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  Google Scholar 

  3. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

    Article  CAS  Google Scholar 

  4. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

    Article  CAS  Google Scholar 

  5. Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).

    Article  CAS  Google Scholar 

  6. Eizenga, J. M. et al. Pangenome graphs. Annu. Rev. Genomics Hum. Gen. 21, 139–162 (2020).

    Article  CAS  Google Scholar 

  7. Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).

    Article  CAS  Google Scholar 

  8. Rakocevic, G. et al. Fast and accurate genomic analyses using genome graphs. Nat. Genetics 51, 354–362 (2019).

    Article  CAS  Google Scholar 

  9. Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 1–17 (2020).

    Article  Google Scholar 

  10. Sibbesen, J. A., Maretty, L. & Krogh, A. Accurate genotyping across variant classes and lengths using variant graphs. Nat. Genet. 50, 1054–1059 (2018).

    Article  CAS  Google Scholar 

  11. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).

    Article  CAS  Google Scholar 

  12. Rautiainen, M. et al. AERON: Transcript quantification and gene-fusion detection using long reads. Preprint at bioRxiv (2020).

  13. Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 1–28 (2020).

    Article  Google Scholar 

  14. Denti, L. et al. ASGAL: aligning RNA-seq data to a splicing graph to detect novel alternative splicing events. BMC Bioinform. 19, 1–21 (2018).

    Article  Google Scholar 

  15. Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).

    Article  CAS  Google Scholar 

  16. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

    Article  CAS  Google Scholar 

  17. Zink, F. et al. Insights into imprinting from parent-of-origin phased methylomes and transcriptomes. Nat. Genet. 50, 1542–1552 (2018).

    Article  CAS  Google Scholar 

  18. Castek, S. E., Levy-Moonshine, A., Mohammadi, P., Banks, E. & Lappalainen, T. Tools and best practices for data processing in allelic expression analysis. Genome Biol. 16, 195 (2015).

    Article  Google Scholar 

  19. Van De Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).

    Article  Google Scholar 

  20. Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Sys. Biol. 7, 522 (2011).

    Article  Google Scholar 

  21. Raghupathy, N. et al. Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics 34, 2177–2184 (2018).

    Article  CAS  Google Scholar 

  22. Lee, W., Plant, K., Humburg, P. & Knight, J. C. AltHapAlignR: improved accuracy of RNA-seq analyses through the use of alternative haplotypes. Bioinformatics 34, 2401–2408 (2018).

    Article  CAS  Google Scholar 

  23. Aguiar, V. R. C., César, J., Delaneau, O., Dermitzakis, E. T. & Meyer, D. Expression estimation and eQTL mapping for HLA genes with a personalized pipeline. PLoS Genet. 15, e1008091 (2019).

    Article  CAS  Google Scholar 

  24. Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020).

    Google Scholar 

  25. Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Preprint at bioRxiv (2020).

  26. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    Article  CAS  Google Scholar 

  27. Consortium, G. P. et al. A global reference for human genetic variation. Nature 526, 68 (2015).

    Article  Google Scholar 

  28. Consortium, T. E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  29. Davis, C. A. et al. The encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2017).

    Article  Google Scholar 

  30. Berger, K., Somineni, H., Prince, J., Kugathasan, S. & Gibson, G. Altered splicing associated with the pathology of inflammatory bowel disease. Hum. Genomics 15, 1–10 (2021).

    Article  Google Scholar 

  31. Micheletti, S. J. et al. Genetic consequences of the transatlantic slave trade in the Americas. Am. J. Hum. Genet. 107, 265–277 (2020).

    Article  CAS  Google Scholar 

  32. Robinson, J. et al. IPD-IMGT/HLA database. Nucleic Acids Res. 48, D948–D955 (2020).

    CAS  Google Scholar 

  33. Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).

    Article  Google Scholar 

  34. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).

    Article  CAS  Google Scholar 

  35. Gourraud, P.-A. et al. HLA diversity in the 1000 Genomes dataset. PloS ONE 9, e97282 (2014).

    Article  Google Scholar 

  36. Abi-Rached, L. et al. Immune diversity sheds light on missing variation in worldwide genetic diversity panels. PloS ONE 13, e0206512 (2018).

    Article  Google Scholar 

  37. Orenbuch, R. et al. arcasHLA: high-resolution HLA typing from RNAseq. Bioinformatics 36, 33–40 (2019).

    Article  Google Scholar 

  38. McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).

    Article  Google Scholar 

  39. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  Google Scholar 

  40. Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).

    Article  CAS  Google Scholar 

  41. Baran, Y. et al. The landscape of genomic imprinting across diverse adult human tissues. Genome Res. 25, 927–936 (2015).

    Article  CAS  Google Scholar 

  42. Jadhav, B. et al. RNA-seq in 296 phased trios provides a high-resolution map of genomic imprinting. BMC Biol. 17, 1–20 (2019).

    Article  CAS  Google Scholar 

  43. Nakabayashi, K. et al. Methylation screening of reciprocal genome-wide UPDs identifies novel human-specific imprinted genes. Hum. Mol. Genet. 20, 3188–3197 (2011).

    Article  CAS  Google Scholar 

  44. Liu, Y. et al. Pan-genome of wild and cultivated soybeans. Cell 182, 162–176 (2020).

    Article  CAS  Google Scholar 

  45. Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).

    Article  CAS  Google Scholar 

  46. Liao, W.-W. et al. A draft human pangenome reference. Preprint at bioRxiv (2022).

  47. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    Article  CAS  Google Scholar 

  48. Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).

    Article  Google Scholar 

  49. Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).

    Article  CAS  Google Scholar 

  50. Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 1–19 (2020).

    Article  Google Scholar 

  51. Hickey, G. et al. Pangenome graph construction from genome alignment with Minigraph-Cactus. Preprint at bioRxiv (2022).

  52. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 1–21 (2014).

    Article  Google Scholar 

  53. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).

    Article  Google Scholar 

  54. Eizenga, J. M. et al. Efficient dynamic variation graphs. Bioinformatics 36, 5139–5144 (2020).

    Article  CAS  Google Scholar 

  55. Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1–54 (2020).

    Article  Google Scholar 

  56. Sirén, J. Indexing variation graphs. In 2017 Proc. 19th Workshop on Algorithm Engineering and Experiments (ALENEX) 13–27 (SIAM, 2017).

  57. Chang, X., Eizenga, J., Novak, A. M., Sirén, J. & Paten, B. Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, 146–153 (2020).

    Article  Google Scholar 

  58. Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018).

    Article  CAS  Google Scholar 

  59. Eades, P., Lin, X. & Smyth, W. F. A fast and effective heuristic for the feedback arc set problem. Inf. Process. Lett. 47, 319–323 (1993).

    Article  Google Scholar 

  60. Lee, C., Grasso, C. & Sharlow, M. F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).

    Article  CAS  Google Scholar 

  61. Burset, M., Seledtsov, I. A. & Solovyev, V. V. Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 28, 4364–4375 (2017).

    Article  Google Scholar 

  62. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  63. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  Google Scholar 

  64. Wala, J. & Beroukhim, R. SeqLib: a C++ API for rapid BAM manipulation, sequence alignment and sequence assembly. Bioinformatics 33, 751–753 (2016).

    Google Scholar 

  65. Karlin, S. & Altschul, S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA 87, 2264–2268 (1990).

    Article  CAS  Google Scholar 

  66. Flecher, C., Allard, D. & Naveau, P. Truncated skew-normal distributions: moments, estimation by weighted moments and application to climatic data. Metron 68, 331–345 (2010).

    Article  Google Scholar 

  67. Albers, C. A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).

    Article  CAS  Google Scholar 

  68. Cock, P. J. A. et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    Article  CAS  Google Scholar 

Download references


Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award numbers U01HG010961, R01HG010485, U41HG010972, U24HG011853, and OT2OD026682 to B.P. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The work of J.A.S. was supported by the Carlsberg Foundation. We thank the ENCODE Consortium, the Thomas Gingeras Laboratory (Cold Spring Harbor Laboratory), the Ali Mortazavi Laboratory (University of California Irvine) and the Joe Ecker Laboratory (Salk Institute for Biological Studies) for generating and sharing the ENCODE data used in this study. We would also like to thank M. Dennis (University of California Davis) for generating and providing access to the CHM13 RNA-seq data on behalf of the T2T consortium. Finally, we would like to thank J. Monlong and G. Hickey for feedback on the manuscript, and everybody else in the VG Team.

Author information

Authors and Affiliations



J.A.S. and J.M.E. developed software, designed and carried out experiments, analyzed data, and wrote the paper. A.M.N., J.S., X.C., and E.G. contributed to developing the software. B.P. contributed to project conceptualization, supervised the research, and edited the paper.

Corresponding author

Correspondence to Benedict Paten.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Michael Love, Harold Pimentel, and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Diagram of a multipath alignment.

A diagrammatic comparison between the multipath alignment output of VG MPMAP and the single-path alignment output of other graph aligners (such as VG MAP). a A read and b a sequence graph, which have been colored to indicate which parts of the read could plausibly align to which parts of the graph. c A single-path alignment. The read sequence is aligned to one path from the graph. d A multipath alignment. The alignment can split and rejoin to express the alignment uncertainty to different paths in the graph.

Extended Data Fig. 2 Mapping benchmark for primary alignments using RNA-seq data from NA12878.

Mapping error and recall for VG MPMAP and three other methods using simulated Illumina data. Colored numbers indicate different mapping quality thresholds. Reads are considered correctly mapped if their primary alignments cover 90% of the true reference sequence alignment.

Extended Data Fig. 3 Mapping benchmark stratified by edit distance using RNA-seq data from NA12878.

Mapping recall (a) and error (b) for VG MPMAP and three other methods using simulated Illumina data as a function of edit distance. Unique alignments are primary alignments with a mapping quality of at least 30. Reads are considered correctly mapped if their alignments cover 90% of the true reference sequence alignment.

Extended Data Fig. 4 Mapping benchmark stratified by non-reference variants using RNA-seq data from NA12878.

Mapping error and recall for VG MPMAP and three other methods using simulated Illumina data. Colored numbers indicate different mapping quality thresholds. Reads are considered correctly mapped if one of their multi-alignments covers 90% of the true reference sequence alignment. Reads are stratified into those that a contain no variants, b contain no insertions or deletions (indels) and one single nucleotide variant (SNV), c contain no indels and two SNVs, d contain no indels and three SNVs, e contain no indels and more than three SNVs, and f contain any indels.

Extended Data Fig. 5 Allelic bias benchmark using RNA-seq data from NA12878.

Allelic mapping bias for VG MPMAP and four other methods using simulated Illumina RNA-seq reads, which were simulated without allelic bias. STAR was used as the aligner for the WASP pipeline. The WASP (STAR) pipeline were provided the 1000GP NA12878 haplotypes as input. The number of variant sites with coverage at least 20 is plotted against the observed rate of false positive hypothesis tests of allelic skew (two-sided binomial test, α = 0.01). Coverage was calculated from primary alignments with a mapping quality value of at least 30. The bottom row shows a zoomed view without WASP (STAR).

Extended Data Fig. 6 Haplotype-specific transcript uniqueness in a 1000 Genomes Project.

The fraction of HSTs that are unique to each of the 2504 samples in the 1000 Genomes Project (1000GP) when compared to different subsets of samples in the 1000GP. Left box plots show the fraction unique when comparing to all other samples, middle box plots show the fraction unique when comparing to all other samples excluding the samples’ population, and right box plots show the fraction unique when comparing to all other samples excluding the samples’ super population. AFR: African (n = 661), AMR: Admixed American (n = 347), EAS: East Asian (n = 504), EUR: European (n = 503), SAS: South Asian (n = 489). The horizontal line in the boxes corresponds to the median, and the box bounds (inter-quartile range) to the 25th and 75th percentile. The whiskers extend to the minimum and maximum value, but no further than 1.5 times the inter-quartile range from the box bounds. Values outside the whiskers are displayed as points.

Extended Data Fig. 7 Allele-specific expression benchmark using RNA-seq data from NA12878.

Allele-specific expression (ASE) results comparing the MPMAP-RPVG pipeline against WASP (with STAR as the aligner) using simulated data. Shows true positive rate and false positive rate of ASE significance for different thresholds of variant read count in the simulated data. Variants were defined as showing significant ASE using a two-sided binomial test of the allele-specific read counts with p-values adjusted using the Benjamini-Hochberg procedure and a False Discovery Rate (FDR) α = 0.1. All heterozygotic NA12878 variants from the 1000 Genomes Project (1000GP) with at least one read in the simulated data were used for the benchmark. For the MPMAP-RPVG pipeline, we used the personal transcriptome generated from the 1000GP NA12878 haplotypes (Supplementary Table 3). WASP was provided the 1000GP NA12878 haplotypes as input. Note, we only used WASP for bias correction and allele-specific read counting, and not its downstream inference method.

Extended Data Fig. 8 Proportion of marginal expression attributed to ≤2 HSTs of a transcript.

For an African American individual (left) and a European American individual (right), the proportion of transcripts for which the marginal expression has at least X proportion assigned to ≤2 HSTs is shown for various values of X. Colors correspond to different thresholds on the proportion of marginal expression. A pantranscriptome generated from all 1000 Genomes Project haplotypes were used for the evaluation (“Whole” in Supplementary Table 3). Transcripts with fewer than 1 inferred read are omitted.

Extended Data Fig. 9 Multipath alignment benchmark using RNA-seq data from NA12878.

Haplotype-specific transcript (HST) quantification results comparing RPVG with single-path and multipath alignments from VG MPMAP and VG MAP as input using simulated and real Illumina data. For details on the pantranscriptomes used see Supplementary Table 3. The VG MPMAP single-path alignments were created by finding the best scoring path in each multipath alignment. a Recall and precision of whether a transcript is correctly assigned nonzero expression for different expression value thresholds (colored numbers for “Whole (excl. CEU)” pantranscriptome) using simulated data. Expression is measured in transcripts per million (TPM). b Mean absolute relative expression difference (MARD) between simulated and estimated expression (in TPM) for different pantranscriptomes using simulated data. MARD was calculated using either all HSTs in the pantranscriptome (solid bars) or using only the NA12878 HSTs (shaded bars). c Number of expressed transcripts from NA12878 haplotypes shown against the number from non-NA12878 haplotypes for different expression value thresholds (colored numbers) using real data. d Fraction of transcript expression (in TPM) assigned to NA12878 haplotypes for different pantranscriptomes using simulated (left) and real (right) data.

Extended Data Fig. 10 Examples of allele expression concordance across tissues.

A set of examples showing allele concordance across tissues using two different variant expression thresholds. Only three tissues are used in the example for simplicity. Blue and orange bars correspond to reference and alternative allele expression, respectively. Variant expression is calculated as the sum of the two alleles. An allele is defined as concordant if it is either consistently expressed or consistently not expressed across all tissues for which the corresponding variant is expressed. Using this definition all alternative alleles except for the allele in variant 2 are defined as concordant when the minimum variant expression threshold is set to 0. If the variant expression threshold is increased to 3, the alternative allele in variant 2 becomes concordant since tissue 2 will be filtered for this variant. Moreover, variant 4 will be excluded due to tissue 3 being filtered since at least two expressed tissues are needed to compute concordance.

Supplementary information

Supplementary Information

Supplementary Figs 1–18, Supplementary Tables 1–6, Supplementary Note, Supplementary Algorithms 1–8.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sibbesen, J.A., Eizenga, J.M., Novak, A.M. et al. Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nat Methods (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing