Abstract
Single-cell RNA-sequencing (scRNA-seq) is an indispensable tool for characterizing cellular diversity and generating hypotheses throughout biology. Droplet-based scRNA-seq datasets often lack expression data for genes that can be detected with other methods. Here we show that the observed sensitivity deficits stem from three sources: (1) poor annotation of 3′ gene ends; (2) issues with intronic read incorporation; and (3) gene overlap-derived read loss. We show that missing gene expression data can be recovered by optimizing the reference transcriptome for scRNA-seq through recovering false intergenic reads, implementing a hybrid pre-mRNA mapping strategy and resolving gene overlaps. We demonstrate, with a diverse collection of mouse and human tissue data, that reference optimization can substantially improve cellular profiling resolution and reveal missing cell types and marker genes. Our findings argue that transcriptomic references need to be optimized for scRNA-seq analysis and warrant a reanalysis of previously published datasets and cell atlases.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Raw and fully processed scRNA-seq data generated for this project (mouse MnPO and human PBMC) are available at the NCBI Gene Expression Omnibus (GEO, GSE198528). Additionally, previously published mouse and human datasets were analyzed including mouse 10x Genomics scRNA-seq datasets generated by the Tabula Muris consortium (bone marrow, SRR6835854; kidney, SRR6835849; lung, SRR6835860; tongue, SRR6835844), which can be accessed from the GEO repository GSE132042. Human brain scRNA-seq data generated from the prefrontal cortex (CS22_PFC) were acquired from the NEMO archive at https://assets.nemoarchive.org/dat-0rsydy7, which requires a custom data use agreement. Finally, human 10x Genomics scRNA-seq data (liver, TSP14_Liver_NA; lung, TSP14_Lung_Proximal; tongue, TSP14_Tongue_Anterior) generated by the Tabula Sapiens consortium can be accessed through the Tabula Sapiens AWS storage web service accessible from https://tabula-sapiens-portal.ds.czbiohub.org/ and requires a custom data use agreement. Baseline single-cell transcriptomic references for human (GRCh38) and mouse (mm10) datasets were downloaded from 10X Genomics (latest available version 2020-A): https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest?. Latest optimized versions of the mouse and human reference transcriptomes and respective genome annotations are available for download at www.thepoollab.org/resources.
Code availability
Custom scripts for analyzing data and generating figures are available at https://github.com/PoolLab/Generecovery. ReferenceEnhancer R package for optimizing genome annotations for scRNA-seq analysis is available at https://github.com/PoolLab/ReferenceEnhancer.
References
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of single-cell RNA-seq in the past decade. Nat. Protoc. 13, 599–604 (2018).
Zeisel, A. et al. Molecular architecture of the mouse nervous system. Cell 174, 999–1014.e22 (2018).
Saunders, A. et al. Molecular diversity and specializations among the cells of the adult mouse brain. Cell 174, 1015–1030.e16 (2018).
Prescott, S. L., Umans, B. D., Williams, E. K., Brust, R. D. & Liberles, S. D. An airway protection program revealed by sweeping genetic control of vagal afferents. Cell 181, 574–589.e14 (2020).
Asp, M. et al. A spatiotemporal organ-wide gene expression and cell atlas of the developing human heart. Cell 179, 1647–1660.e19 (2019).
Ji, A. L. et al. Multimodal analysis of composition and spatial architecture in human squamous cell carcinoma. Cell 182, 497–514 (2020).
Pool, A.-H. et al. The cellular basis of distinct thirst modalities. Nature 588, 112–117 (2020).
Kim, D. W. et al. Multimodal analysis of cell types in a hypothalamic node controlling social behavior. Cell 179, 713–728.e17 (2019).
Wang, X., He, Y., Zhang, Q., Ren, X. & Zhang, Z. Direct comparative analyses of 10X Genomics chromium and smart-seq2. Genom. Proteom. Bioinform. 19, 253–266 (2021).
Denisenko, E. et al. Systematic assessment of tissue dissociation and storage biases in single-cell and single-nucleus RNA-seq workflows. Genome Biol. 21, 130 (2020).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Kaminow, B., Yunusov, D. & Dobin, A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Prerpint at bioRxiv https://doi.org/10.1101/2021.05.05.442755 (2021).
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Bergen, V., Lange, M., Peidli, S., Wolf, F. A. & Theis, F. J. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 38, 1408–1414 (2020).
Nemzer, S. et al. Gene model correction for PVRIG in single cell and bulk sequencing data enables accurate detection and study of its functional relevance. Preprint at bioRxiv https://doi.org/10.1101/2022.11.02.514879 (2022).
CR, S., WH, L. & L, Z. Overlapping genes in the human and mouse genomes. BMC Genomics 9, 169 (2008).
McKinley, M. J. et al. The median preoptic nucleus: front and centre for the regulation of body fluid, sodium, temperature, sleep and cardiovascular homeostasis. Acta Physiol. 214, 8–32 (2015).
Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2006).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Sakharkar, M. K., Chow, V. T. K. & Kangueane, P. Distributions of exons and introns in the human genome. In Silico Biol. 4, 387–393 (2004).
Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Jones, R. C. et al. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 376, 6594 (2022).
Eze, U. C., Bhaduri, A., Haeussler, M., Nowakowski, T. J. & Kriegstein, A. R. Single-cell atlas of early human brain development highlights heterogeneity of human neuroepithelial cells and early radial glia. Nat. Neurosci. 24, 584–594 (2021).
Abbott, S. B. G., Machado, N. L. S., Geerling, J. C. & Saper, C. B. Reciprocal control of drinking behavior by median preoptic neurons in mice. J. Neurosci. 36, 8228–8237 (2016).
Augustine, V. et al. Hierarchical neural architecture underlying thirst regulation. Nature 555, 204–209 (2018).
Zimmerman, C. A. et al. A gut-to-brain signal of fluid osmolarity controls thirst satiation. Nature 568, 98–102 (2019).
Leib, D. E. et al. The forebrain thirst circuit drives drinking through negative reinforcement. Neuron 96, 1272–1281.e4 (2017).
Tan, C. L. et al. Warm-sensitive neurons that control body temperature. Cell 167, 47–59.e15 (2016).
Machado, N. L. S., Bandaru, S. S., Abbott, S. B. G. & Saper, C. B. EP3R-expressing glutamatergic preoptic neurons mediate inflammatory fever. J. Neurosci. 40, 2573–2588 (2020).
Song, K. et al. The TRPM2 channel is a hypothalamic heat sensor that limits fever and can drive hypothermia. Science 353, 1393–1398 (2016).
Piñol, R. A. et al. Preoptic BRS3 neurons increase body temperature and heart rate via multiple pathways. Cell Metab. 33, 1389–1403.e6 (2021).
Szabo, P. A. et al. Single-cell transcriptomics of human T cells reveals tissue and activation signatures in health and disease. Nat. Commun. 10, 4706 (2019).
Monaco, G. et al. RNA-seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 26, 1627–1640.e7 (2019).
Hou, W., Ji, Z., Ji, H. & Hicks, S. C. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 21, 218 (2020).
Gaublomme, J. T. et al. Nuclei multiplexing with barcoded antibodies for single-nucleus genomics. Nat. Commun. 10, 2907 (2019).
Parekh, S., Ziegenhain, C., Vieth, B., Enard, W. & Hellmann, I. zUMIs: a fast and flexible pipeline to process RNA sequencing data with UMIs. Gigascience 7, giy059 (2018).
He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods 19, 316–322 (2022).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Du, Y., Huang, Q., Arisdakessian, C. & Garmire, L. X. Evaluation of STAR and Kallisto on single cell RNA-seq data alignment. G3 10, 1775–1783 (2020).
You, Y. et al. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol. 22, 339 (2021).
Sanna, C. R., Li, W. H. & Zhang, L. Overlapping genes in the human and mouse genomes. BMC Genomics 9, 169 (2008).
Aken, B. L. et al. The Ensembl gene annotation system. Database J. Biol. Databases Curation 2016, baw093 (2016).
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–65 (2007).
Booeshaghi, A. S. et al. Isoform cell-type specificity in the mouse primary motor cortex. Nature 598, 195–199 (2021).
Di Giammartino, D. C., Nishida, K. & Manley, J. L. Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853–866 (2011).
Chen, S. et al. Dissecting heterogeneous cell populations across drug and disease conditions with PopAlign. Proc. Natl Acad. Sci. USA 117, 28784–28794 (2020).
Uhlén, M. et al. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Acknowledgements
We thank L. S. Pachter and members of the M.T. lab for helpful discussion and comments. We thank the Single-Cell Profiling Center (SPEC) in the Beckman Institute at Caltech for technical assistance with scRNA-seq. A.H.P. is supported by Eugene McDermott Scholar funds and by Startup funds from Peter O’Donnell Jr. Brain Institute at UT Southwestern. Y.O. is supported by Startup funds from the President and Provost of the California Institute of Technology and the Biology and Biological Engineering Division of California Institute of Technology, Searle Scholars Program, the Mallinckrodt Foundation, the McKnight Foundation, the Klingenstein–Simons Foundation, the New York Stem Cell Foundation and the NIH (grant nos. R56MH113030 and R01NS109997).
Author information
Authors and Affiliations
Contributions
A.-H.P. conceived and designed the project. A.-H.P. and H.P. devised and performed data analysis. A.-H.P. and S.C. generated the MnPO scRNA-seq dataset. S.C. and M.T. generated the human PBMC scRNA-seq dataset. S.C., M.T. and Y.O. provided conceptual advice on data analysis. All authors contributed to the manuscript as drafted by A.-H.P. and H.P. A.-H.P. and Y.O. supervised the overall project.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Rob Patro and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Genes with shared terminal exon sequences are obscured from scRNA-seq analysis.
a. The terminal exons of human genes TIFAB and DCANP1 overlap resulting in all sequencing reads mapping to the overlapping area being discarded due to ‘multigene mapping’ classification. Thereby, 3′ scRNA-seq is mostly blind to the expression of these genes. b. Expression information for TIFAB and DCANP1 genes can be recovered by removing one of the genes and renaming the remaining gene’s transcript recovering discarded expression data.
Extended Data Fig. 2 Different strategies for incorporating intronic reads into scRNA-seq analysis vary by detection of several hundred genes and up to 7.1% of sequencing reads.
a. Comparison of detected genes by distinct methods for incorporating intronic sequencing reads into scRNA-seq analysis with mouse MnPO brain nucleus dataset. ‘Cell Ranger pre-mRNA reference’ data was generated with Cell Ranger 6.1.2 software in exonic mapping mode using a genome annotation where all transcripts were defined as exons thus leading to incorporation of previously intronically mapping reads. ‘STARsolo GeneFull mode’ data was generated by the STARsolo software by specifying the ‘GeneFull’ attribute integrating intronically mapped reads into the assembled gene-cell matrix. ‘Cell Ranger include-introns mode’ was generated with the Cell Ranger software with ‘–include-introns’ parameter specified leading to integration of intronic reads to assembly of the gene-cell matrix. ‘Hybrid pre-mRNA reference’ was a combination of ‘Regular pre-mRNA reference’ and ‘Cell Ranger include-introns’ strategy where all genes received additional full transcript length exons and where data were mapped in Cell Ranger–include-introns mode. Black – genes detected by all four methods for incorporating intronic reads; red – genes that are either unique or shared by up to two other methods for detecting intronic reads. b. Matrix outlining the number of unique genes detected by specific intronic read capturing strategies as compared to other methods. c. Comparison of detected sequencing reads incorporated into the gene-cell matrix by distinct methods registering intronic sequencing reads into scRNA-seq analysis. Black - reads detected by all three methods; red - reads that are either uniquely detected by a given method or shared by up to two other strategies. d. Histogram of enriched genes in the MnPO dataset detected by ‘Hybrid pre-mRNA reference’ in comparison to ‘Regular pre-mRNA reference’. Genes include uniquely detected genes as well as genes with 50% or more transcripts than in the contrasted reference mapped dataset. e. Histogram of enriched genes in the MnPO dataset detected by ‘Hybrid pre-mRNA reference’ in comparison to ‘Cell Ranger include-introns mode’. Genes include uniquely detected genes as well as genes with 50% or more transcripts than in the contrasted reference mapped dataset.
Extended Data Fig. 3 Evaluation of optimized reference derived improvements in scRNA-seq data capture.
a. The use of optimized references dramatically changes the composition of variable genes in single-cell RNA-seq analysis. Top 850 mouse MnPO neuron or top 450 human T-lymphocyte most variable genes as identified by Seurat::FindVariableFeatures ‘vst’ selection method performed on either exonic reference mapped data or optimized reference. New genes in human MnPO and mouse PBMC data stemming from improved read detection with optimized references is shown in light green. Inset shows the % of new genes stemming from 3′ extended genes (dark green), resolved overlapping genes (white) and hybrid-premRNA captured genes (red, genes that display 50% or more reads detected via hybrid-mRNA as opposed to regular Cell Ranger include-introns strategy). b. Increased gene detection by Hybrid pre-mRNA capture strategy (transcript length exons added to regular genome annotation and data mapped in Cell Ranger include-introns mode) as compared to regular exonic and exon+intron mapped data (Cell Ranger include-introns mode). Evaluated mouse scRNA-seq data includes Mouse brain (MnPO) tissue and bonemarrow, kidney, lung and tongue tissue from Tabula Muris consortium. Evaluated human data includes scRNA-seq data from PBMCs, brain tissue (prefrontal cortex) as well as liver, lung and tongue tissue generated by the Tabula Sapiens consortium. c. Improved sequencing read capture by Hybrid pre-mRNA reference as compared to regular exonic or exon+intron read mapping reference strategy. ScRNA-seq datasets same as in panel b. d. Increased gene detection by 3′ gene extension fixed reference mapped data as compared to regular exonic reference mapped data. ScRNA-seq datasets same as in panel b. e. Increased sequencing read registration with gene overlap resolved reference mapped with Cell Ranger include-introns mode as compared to regular Cell Ranger include-introns mapped data. ScRNA-seq datasets same as in panel b. f. Histogram of enriched genes in the mouse MnPO dataset detected by ‘Hybrid pre-mRNA reference’ in comparison to ‘Cell Ranger include-introns’ based exon+intron reference mapped data. Genes include uniquely detected genes as well as genes with 50% or more transcripts than in the contrasted reference mapped dataset. g. Histogram of enriched genes in the human PBMC dataset detected by ‘Hybrid pre-mRNA reference’ in comparison to ‘Cell Ranger include-introns’ based exon+intron reference mapped data. Genes include uniquely detected genes as well as genes with 50% or more transcripts than in the contrasted reference mapped dataset.
Extended Data Fig. 4 Elucidation of neuron types, cell-type-specific markers and physiological functions of cells in the mouse Median Preoptic Nucleus (MnPO) with regular exonic and optimized transcriptomic reference based scRNA-seq analyses.
a. Violin plot of the expression of previously implicated genetic markers labeling thirst, warmth, cold or licking activated neurons in the MnPO as analyzed by a regular exonic reference based analysis pipeline. Red arrows indicate cellular marker genes in the exonic transcriptomic reference that get obscured by the various read registration issues addressed by reference optimization and for which read mapping data is plotted below. Expression is shown on a log-normalized scale with maximum counts per million (max CPM). b. Violin plot of the same marker expression in the MnPO as analyzed with the optimized transcriptomic reference pipeline. Previously missing marker gene expression data is highlighted with red arrows. c. Mapping of sequencing reads to the Kisspeptin 1 (Kiss1) locus with the majority being discarded from downstream analysis due to gene overlaps between Kiss1 and Gm28040. d. Mapping of sequencing reads to the Ptger3 locus with the majority being discarded from downstream analysis with an exonic reference due to their intronic mapping. e. Mapping of sequencing reads to the Gdf11 locus with large fraction being discarded from downstream analysis due to mapping to the intergenic region. f. Nomenclature of neuron types in the MnPO as identified by scRNA-seq with the optimized transcriptomic reference (data same as in Fig. 5b). g. Suggested function to cell-type mapping in the MnPO neurons based on overlap of previously identified genetic markers in the new neuronal nomenclature.
Supplementary information
Supplementary Table 1
Summary of mouse genome annotation updates.
Supplementary Table 2
Summary of human genome annotation updates.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pool, AH., Poldsam, H., Chen, S. et al. Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references. Nat Methods 20, 1506–1515 (2023). https://doi.org/10.1038/s41592-023-02003-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-02003-w
This article is cited by
-
Transient expression of the neuropeptide galanin modulates peripheral‑to‑central connectivity in the somatosensory thalamus during whisker development in mice
Nature Communications (2024)
-
Protracted neuronal recruitment in the temporal lobes of young children
Nature (2024)
-
The molecular landscape of neurological disorders: insights from single-cell RNA sequencing in neurology and neurosurgery
European Journal of Medical Research (2023)
-
A sex-specific thermogenic neurocircuit induced by predator smell recruiting cholecystokinin neurons in the dorsomedial hypothalamus
Nature Communications (2023)