Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references

Abstract

Single-cell RNA-sequencing (scRNA-seq) is an indispensable tool for characterizing cellular diversity and generating hypotheses throughout biology. Droplet-based scRNA-seq datasets often lack expression data for genes that can be detected with other methods. Here we show that the observed sensitivity deficits stem from three sources: (1) poor annotation of 3′ gene ends; (2) issues with intronic read incorporation; and (3) gene overlap-derived read loss. We show that missing gene expression data can be recovered by optimizing the reference transcriptome for scRNA-seq through recovering false intergenic reads, implementing a hybrid pre-mRNA mapping strategy and resolving gene overlaps. We demonstrate, with a diverse collection of mouse and human tissue data, that reference optimization can substantially improve cellular profiling resolution and reveal missing cell types and marker genes. Our findings argue that transcriptomic references need to be optimized for scRNA-seq analysis and warrant a reanalysis of previously published datasets and cell atlases.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Missing genes and sequencing read registration in single-cell RNA-seq experiments.
Fig. 2: Characterization of intergenic read mapping proximal to 3′ end of genes.
Fig. 3: Same-strand gene overlaps and resulting compromised scRNA-seq gene detection in the mouse and human genomes.
Fig. 4: Strategy for compiling an optimized transcriptomic reference.
Fig. 5: Evaluation of optimized mouse and human reference transcriptomes.

Similar content being viewed by others

Data availability

Raw and fully processed scRNA-seq data generated for this project (mouse MnPO and human PBMC) are available at the NCBI Gene Expression Omnibus (GEO, GSE198528). Additionally, previously published mouse and human datasets were analyzed including mouse 10x Genomics scRNA-seq datasets generated by the Tabula Muris consortium (bone marrow, SRR6835854; kidney, SRR6835849; lung, SRR6835860; tongue, SRR6835844), which can be accessed from the GEO repository GSE132042. Human brain scRNA-seq data generated from the prefrontal cortex (CS22_PFC) were acquired from the NEMO archive at https://assets.nemoarchive.org/dat-0rsydy7, which requires a custom data use agreement. Finally, human 10x Genomics scRNA-seq data (liver, TSP14_Liver_NA; lung, TSP14_Lung_Proximal; tongue, TSP14_Tongue_Anterior) generated by the Tabula Sapiens consortium can be accessed through the Tabula Sapiens AWS storage web service accessible from https://tabula-sapiens-portal.ds.czbiohub.org/ and requires a custom data use agreement. Baseline single-cell transcriptomic references for human (GRCh38) and mouse (mm10) datasets were downloaded from 10X Genomics (latest available version 2020-A): https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest?. Latest optimized versions of the mouse and human reference transcriptomes and respective genome annotations are available for download at www.thepoollab.org/resources.

Code availability

Custom scripts for analyzing data and generating figures are available at https://github.com/PoolLab/Generecovery. ReferenceEnhancer R package for optimizing genome annotations for scRNA-seq analysis is available at https://github.com/PoolLab/ReferenceEnhancer.

References

  1. Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of single-cell RNA-seq in the past decade. Nat. Protoc. 13, 599–604 (2018).

    Article  CAS  PubMed  Google Scholar 

  4. Zeisel, A. et al. Molecular architecture of the mouse nervous system. Cell 174, 999–1014.e22 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Saunders, A. et al. Molecular diversity and specializations among the cells of the adult mouse brain. Cell 174, 1015–1030.e16 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Prescott, S. L., Umans, B. D., Williams, E. K., Brust, R. D. & Liberles, S. D. An airway protection program revealed by sweeping genetic control of vagal afferents. Cell 181, 574–589.e14 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Asp, M. et al. A spatiotemporal organ-wide gene expression and cell atlas of the developing human heart. Cell 179, 1647–1660.e19 (2019).

    Article  CAS  PubMed  Google Scholar 

  8. Ji, A. L. et al. Multimodal analysis of composition and spatial architecture in human squamous cell carcinoma. Cell 182, 497–514 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Pool, A.-H. et al. The cellular basis of distinct thirst modalities. Nature 588, 112–117 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Kim, D. W. et al. Multimodal analysis of cell types in a hypothalamic node controlling social behavior. Cell 179, 713–728.e17 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Wang, X., He, Y., Zhang, Q., Ren, X. & Zhang, Z. Direct comparative analyses of 10X Genomics chromium and smart-seq2. Genom. Proteom. Bioinform. 19, 253–266 (2021).

    Article  CAS  Google Scholar 

  12. Denisenko, E. et al. Systematic assessment of tissue dissociation and storage biases in single-cell and single-nucleus RNA-seq workflows. Genome Biol. 21, 130 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  PubMed  Google Scholar 

  14. Kaminow, B., Yunusov, D. & Dobin, A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Prerpint at bioRxiv https://doi.org/10.1101/2021.05.05.442755 (2021).

  15. La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Bergen, V., Lange, M., Peidli, S., Wolf, F. A. & Theis, F. J. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 38, 1408–1414 (2020).

    Article  CAS  PubMed  Google Scholar 

  17. Nemzer, S. et al. Gene model correction for PVRIG in single cell and bulk sequencing data enables accurate detection and study of its functional relevance. Preprint at bioRxiv https://doi.org/10.1101/2022.11.02.514879 (2022).

  18. CR, S., WH, L. & L, Z. Overlapping genes in the human and mouse genomes. BMC Genomics 9, 169 (2008).

    Article  Google Scholar 

  19. McKinley, M. J. et al. The median preoptic nucleus: front and centre for the regulation of body fluid, sodium, temperature, sleep and cardiovascular homeostasis. Acta Physiol. 214, 8–32 (2015).

    Article  CAS  Google Scholar 

  20. Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2006).

    Article  PubMed  Google Scholar 

  21. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    Article  CAS  PubMed  Google Scholar 

  22. Sakharkar, M. K., Chow, V. T. K. & Kangueane, P. Distributions of exons and introns in the human genome. In Silico Biol. 4, 387–393 (2004).

    CAS  PubMed  Google Scholar 

  23. Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

    Article  PubMed Central  Google Scholar 

  24. Jones, R. C. et al. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 376, 6594 (2022).

    Google Scholar 

  25. Eze, U. C., Bhaduri, A., Haeussler, M., Nowakowski, T. J. & Kriegstein, A. R. Single-cell atlas of early human brain development highlights heterogeneity of human neuroepithelial cells and early radial glia. Nat. Neurosci. 24, 584–594 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Abbott, S. B. G., Machado, N. L. S., Geerling, J. C. & Saper, C. B. Reciprocal control of drinking behavior by median preoptic neurons in mice. J. Neurosci. 36, 8228–8237 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Augustine, V. et al. Hierarchical neural architecture underlying thirst regulation. Nature 555, 204–209 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Zimmerman, C. A. et al. A gut-to-brain signal of fluid osmolarity controls thirst satiation. Nature 568, 98–102 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Leib, D. E. et al. The forebrain thirst circuit drives drinking through negative reinforcement. Neuron 96, 1272–1281.e4 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Tan, C. L. et al. Warm-sensitive neurons that control body temperature. Cell 167, 47–59.e15 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Machado, N. L. S., Bandaru, S. S., Abbott, S. B. G. & Saper, C. B. EP3R-expressing glutamatergic preoptic neurons mediate inflammatory fever. J. Neurosci. 40, 2573–2588 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Song, K. et al. The TRPM2 channel is a hypothalamic heat sensor that limits fever and can drive hypothermia. Science 353, 1393–1398 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Piñol, R. A. et al. Preoptic BRS3 neurons increase body temperature and heart rate via multiple pathways. Cell Metab. 33, 1389–1403.e6 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Szabo, P. A. et al. Single-cell transcriptomics of human T cells reveals tissue and activation signatures in health and disease. Nat. Commun. 10, 4706 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Monaco, G. et al. RNA-seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 26, 1627–1640.e7 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Hou, W., Ji, Z., Ji, H. & Hicks, S. C. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 21, 218 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Gaublomme, J. T. et al. Nuclei multiplexing with barcoded antibodies for single-nucleus genomics. Nat. Commun. 10, 2907 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Parekh, S., Ziegenhain, C., Vieth, B., Enard, W. & Hellmann, I. zUMIs: a fast and flexible pipeline to process RNA sequencing data with UMIs. Gigascience 7, giy059 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  39. He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods 19, 316–322 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

    Article  CAS  PubMed  Google Scholar 

  41. Du, Y., Huang, Q., Arisdakessian, C. & Garmire, L. X. Evaluation of STAR and Kallisto on single cell RNA-seq data alignment. G3 10, 1775–1783 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. You, Y. et al. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol. 22, 339 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Sanna, C. R., Li, W. H. & Zhang, L. Overlapping genes in the human and mouse genomes. BMC Genomics 9, 169 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Aken, B. L. et al. The Ensembl gene annotation system. Database J. Biol. Databases Curation 2016, baw093 (2016).

    Google Scholar 

  45. Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–65 (2007).

    Article  CAS  PubMed  Google Scholar 

  46. Booeshaghi, A. S. et al. Isoform cell-type specificity in the mouse primary motor cortex. Nature 598, 195–199 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Di Giammartino, D. C., Nishida, K. & Manley, J. L. Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853–866 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Chen, S. et al. Dissecting heterogeneous cell populations across drug and disease conditions with PopAlign. Proc. Natl Acad. Sci. USA 117, 28784–28794 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Uhlén, M. et al. Tissue-based map of the human proteome. Science 347, 1260419 (2015).

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

We thank L. S. Pachter and members of the M.T. lab for helpful discussion and comments. We thank the Single-Cell Profiling Center (SPEC) in the Beckman Institute at Caltech for technical assistance with scRNA-seq. A.H.P. is supported by Eugene McDermott Scholar funds and by Startup funds from Peter O’Donnell Jr. Brain Institute at UT Southwestern. Y.O. is supported by Startup funds from the President and Provost of the California Institute of Technology and the Biology and Biological Engineering Division of California Institute of Technology, Searle Scholars Program, the Mallinckrodt Foundation, the McKnight Foundation, the Klingenstein–Simons Foundation, the New York Stem Cell Foundation and the NIH (grant nos. R56MH113030 and R01NS109997).

Author information

Authors and Affiliations

Authors

Contributions

A.-H.P. conceived and designed the project. A.-H.P. and H.P. devised and performed data analysis. A.-H.P. and S.C. generated the MnPO scRNA-seq dataset. S.C. and M.T. generated the human PBMC scRNA-seq dataset. S.C., M.T. and Y.O. provided conceptual advice on data analysis. All authors contributed to the manuscript as drafted by A.-H.P. and H.P. A.-H.P. and Y.O. supervised the overall project.

Corresponding authors

Correspondence to Allan-Hermann Pool or Yuki Oka.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Rob Patro and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Genes with shared terminal exon sequences are obscured from scRNA-seq analysis.

a. The terminal exons of human genes TIFAB and DCANP1 overlap resulting in all sequencing reads mapping to the overlapping area being discarded due to ‘multigene mapping’ classification. Thereby, 3′ scRNA-seq is mostly blind to the expression of these genes. b. Expression information for TIFAB and DCANP1 genes can be recovered by removing one of the genes and renaming the remaining gene’s transcript recovering discarded expression data.

Extended Data Fig. 2 Different strategies for incorporating intronic reads into scRNA-seq analysis vary by detection of several hundred genes and up to 7.1% of sequencing reads.

a. Comparison of detected genes by distinct methods for incorporating intronic sequencing reads into scRNA-seq analysis with mouse MnPO brain nucleus dataset. ‘Cell Ranger pre-mRNA reference’ data was generated with Cell Ranger 6.1.2 software in exonic mapping mode using a genome annotation where all transcripts were defined as exons thus leading to incorporation of previously intronically mapping reads. ‘STARsolo GeneFull mode’ data was generated by the STARsolo software by specifying the ‘GeneFull’ attribute integrating intronically mapped reads into the assembled gene-cell matrix. ‘Cell Ranger include-introns mode’ was generated with the Cell Ranger software with ‘–include-introns’ parameter specified leading to integration of intronic reads to assembly of the gene-cell matrix. ‘Hybrid pre-mRNA reference’ was a combination of ‘Regular pre-mRNA reference’ and ‘Cell Ranger include-introns’ strategy where all genes received additional full transcript length exons and where data were mapped in Cell Ranger–include-introns mode. Black – genes detected by all four methods for incorporating intronic reads; red – genes that are either unique or shared by up to two other methods for detecting intronic reads. b. Matrix outlining the number of unique genes detected by specific intronic read capturing strategies as compared to other methods. c. Comparison of detected sequencing reads incorporated into the gene-cell matrix by distinct methods registering intronic sequencing reads into scRNA-seq analysis. Black - reads detected by all three methods; red - reads that are either uniquely detected by a given method or shared by up to two other strategies. d. Histogram of enriched genes in the MnPO dataset detected by ‘Hybrid pre-mRNA reference’ in comparison to ‘Regular pre-mRNA reference’. Genes include uniquely detected genes as well as genes with 50% or more transcripts than in the contrasted reference mapped dataset. e. Histogram of enriched genes in the MnPO dataset detected by ‘Hybrid pre-mRNA reference’ in comparison to ‘Cell Ranger include-introns mode’. Genes include uniquely detected genes as well as genes with 50% or more transcripts than in the contrasted reference mapped dataset.

Extended Data Fig. 3 Evaluation of optimized reference derived improvements in scRNA-seq data capture.

a. The use of optimized references dramatically changes the composition of variable genes in single-cell RNA-seq analysis. Top 850 mouse MnPO neuron or top 450 human T-lymphocyte most variable genes as identified by Seurat::FindVariableFeatures ‘vst’ selection method performed on either exonic reference mapped data or optimized reference. New genes in human MnPO and mouse PBMC data stemming from improved read detection with optimized references is shown in light green. Inset shows the % of new genes stemming from 3′ extended genes (dark green), resolved overlapping genes (white) and hybrid-premRNA captured genes (red, genes that display 50% or more reads detected via hybrid-mRNA as opposed to regular Cell Ranger include-introns strategy). b. Increased gene detection by Hybrid pre-mRNA capture strategy (transcript length exons added to regular genome annotation and data mapped in Cell Ranger include-introns mode) as compared to regular exonic and exon+intron mapped data (Cell Ranger include-introns mode). Evaluated mouse scRNA-seq data includes Mouse brain (MnPO) tissue and bonemarrow, kidney, lung and tongue tissue from Tabula Muris consortium. Evaluated human data includes scRNA-seq data from PBMCs, brain tissue (prefrontal cortex) as well as liver, lung and tongue tissue generated by the Tabula Sapiens consortium. c. Improved sequencing read capture by Hybrid pre-mRNA reference as compared to regular exonic or exon+intron read mapping reference strategy. ScRNA-seq datasets same as in panel b. d. Increased gene detection by 3′ gene extension fixed reference mapped data as compared to regular exonic reference mapped data. ScRNA-seq datasets same as in panel b. e. Increased sequencing read registration with gene overlap resolved reference mapped with Cell Ranger include-introns mode as compared to regular Cell Ranger include-introns mapped data. ScRNA-seq datasets same as in panel b. f. Histogram of enriched genes in the mouse MnPO dataset detected by ‘Hybrid pre-mRNA reference’ in comparison to ‘Cell Ranger include-introns’ based exon+intron reference mapped data. Genes include uniquely detected genes as well as genes with 50% or more transcripts than in the contrasted reference mapped dataset. g. Histogram of enriched genes in the human PBMC dataset detected by ‘Hybrid pre-mRNA reference’ in comparison to ‘Cell Ranger include-introns’ based exon+intron reference mapped data. Genes include uniquely detected genes as well as genes with 50% or more transcripts than in the contrasted reference mapped dataset.

Extended Data Fig. 4 Elucidation of neuron types, cell-type-specific markers and physiological functions of cells in the mouse Median Preoptic Nucleus (MnPO) with regular exonic and optimized transcriptomic reference based scRNA-seq analyses.

a. Violin plot of the expression of previously implicated genetic markers labeling thirst, warmth, cold or licking activated neurons in the MnPO as analyzed by a regular exonic reference based analysis pipeline. Red arrows indicate cellular marker genes in the exonic transcriptomic reference that get obscured by the various read registration issues addressed by reference optimization and for which read mapping data is plotted below. Expression is shown on a log-normalized scale with maximum counts per million (max CPM). b. Violin plot of the same marker expression in the MnPO as analyzed with the optimized transcriptomic reference pipeline. Previously missing marker gene expression data is highlighted with red arrows. c. Mapping of sequencing reads to the Kisspeptin 1 (Kiss1) locus with the majority being discarded from downstream analysis due to gene overlaps between Kiss1 and Gm28040. d. Mapping of sequencing reads to the Ptger3 locus with the majority being discarded from downstream analysis with an exonic reference due to their intronic mapping. e. Mapping of sequencing reads to the Gdf11 locus with large fraction being discarded from downstream analysis due to mapping to the intergenic region. f. Nomenclature of neuron types in the MnPO as identified by scRNA-seq with the optimized transcriptomic reference (data same as in Fig. 5b). g. Suggested function to cell-type mapping in the MnPO neurons based on overlap of previously identified genetic markers in the new neuronal nomenclature.

Supplementary information

Reporting Summary

Peer Review File

Supplementary Table 1

Summary of mouse genome annotation updates.

Supplementary Table 2

Summary of human genome annotation updates.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pool, AH., Poldsam, H., Chen, S. et al. Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references. Nat Methods 20, 1506–1515 (2023). https://doi.org/10.1038/s41592-023-02003-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-02003-w

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing