Abstract
Over the last decade, biology has begun utilizing ‘big data’ approaches, resulting in large, comprehensive atlases in modalities ranging from transcriptomics to neural connectomics. However, these approaches must be complemented and integrated with ‘small data’ approaches to efficiently utilize data from individual labs. Integration of smaller datasets with major reference atlases is critical to provide context to individual experiments, and approaches toward integration of large and small data have been a major focus in many fields in recent years. Here we discuss progress in integration of small data with consortium-sized atlases across multiple modalities, and its potential applications. We then examine promising future directions for utilizing the power of small data to maximize the information garnered from small-scale experiments. We envision that, in the near future, international consortia comprising many laboratories will work together to collaboratively build reference atlases and foundation models using small data methods.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Ngai, J. BRAIN 2.0: transforming neuroscience. Cell 185, 4–8 (2022).
BRAIN Initiative Cell Census Network. A multimodal cell census and atlas of the mammalian primary motor cortex. Nature 598, 86–102 (2021).
Regev, A. et al. The human cell atlas. Elife 6, e27041 (2017). Perhaps the largest single-cell atlas in the world.
Landhuis, E. Neuroscience: big brain, big data. Nature 541, 559–561 (2017).
Marx, V. The big challenges of big data. Nature 498, 255–260 (2013).
Todman, L. C., Bush, A. & Hood, A. S. ‘Small data’ for big insights in ecology. Trends Ecol. Evol. 38, 615–622 (2023).
Ferguson, A. R. et al. Big data from small data: data-sharing in the ‘long tail’ of neuroscience. Nat. Neurosci. 17, 1442–1447 (2014).
Hekler, E. B. et al. Why we need a small data paradigm. BMC Med. 17, 133 (2019).
Cai, C. et al. Transfer learning for drug discovery. J. Med. Chem. 63, 8683–8694 (2020).
Qi, G. -J. & Luo, J. Small data challenges in big data era: a survey of recent progress on unsupervised and semi-supervised methods. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2168–2187 (2020).
Yang, L., Hanneke, S. & Carbonell, J. A theory of transfer learning with applications to active learning. Mach. Learn. 90, 161–189 (2013).
Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 9 (2016).
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).
Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40, 163–166 (2022).
Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database https://doi.org/10.1093/database/baaa073 (2020).
Yao, Z. et al. A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain. Nature 624, 317–332 (2023). An incredible resource for analysis of transcriptomic diversity in the brain.
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022). An important resource for updateable atlas creation.
Lopez, R. et al. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Zhang, Z. et al. scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data. Nat. Commun. 15, 912 (2024).
Zhou, Y. et al. Accurate integration of multiple heterogeneous single-cell RNA-seq data sets by learning contrastive biological variation. Genome Res. 33, 750–762 (2023).
Franzén, O., Gan, L. M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database https://doi.org/10.1093/database/baz046 (2019).
Papatheodorou, I. et al. Expression Atlas update: from tissues to single cells. Nucleic Acids Res. 48, D77–D83 (2019).
Tarhan, L. et al. Single Cell Portal: an interactive home for single-cell genomics data. Preprint at bioRxiv https://doi.org/10.1101/2023.07.13.548886 (2023).
CZI Single-Cell Biology Program et al. CZ CELL×GENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Preprint at bioRxiv https://doi.org/10.1101/2023.10.30.563174 (2023).
Camps, J. et al. Meta-analysis of human cancer single-cell RNA-seq datasets using the IMMUcan database. Cancer Res. 83, 363–373 (2023).
Li, X. -W. et al. SCAD-Brain: a public database of single cell RNA-seq data in human and mouse brains with Alzheimer’s disease. Front. Aging Neurosci. 15, 1157792 (2023).
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods https://doi.org/10.1038/s41592-024-02201-0 (2024).
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Roohani, Y., Huang, K. & Leskovec, J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat. Biotechnol. 42, 927–935 (2024).
Booeshaghi, A. S. & Pachter, L. Normalization of single-cell RNA-seq counts by log(x + 1)† or log(1 + x)†. Bioinformatics 37, 2223–2224 (2021).
Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
Osorio, D. & Cai, J. J. Systematic determination of the mitochondrial proportion in human and mice tissues for single-cell RNA-sequencing data quality control. Bioinformatics 37, 963–967 (2021).
Xi, N. M. & Li, J. J. Benchmarking computational doublet-detection methods for single-cell RNA sequencing data. Cell Syst. 12, 176–194 (2021).
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Zhang, F., Wu, Y. & Tian, W. A novel approach to remove the batch effect of single-cell data. Cell Discov. 5, 46 (2019).
Chacon, S. & Straub B. Pro Git. (Apress, 2014).
Raschka, S. Model evaluation, model selection, and algorithm selection in machine learning. Preprint at https://arxiv.org/abs/1811.12808 (2018).
Verbraeken, J. et al. A survey on distributed machine learning. ACM Comput. Surv. 53, 1–33 (2020).
Akbarian, S. et al. The PsychENCODE project. Nat. Neurosci. 18, 1707–1712 (2015).
Stuart, T. et al. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
Zhang, K. et al. A fast, scalable and versatile tool for analysis of single-cell omics data. Nat. Methods 21, 217–227 (2024).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Peidli, S. et al. scPerturb: harmonized single-cell perturbation data. Nat. Methods 21, 531–540 (2024).
Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high‐throughput screens. Mol. Syst. Biol. 19, e11517 (2023).
Bocci, F., Zhou, P. & Nie, Q. spliceJAC: transition genes and state‐specific gene regulation from single‐cell transcriptome data. Mol. Syst. Biol. 18, e11176 (2022).
Wang, J., Chen, Y. & Zou, Q. Inferring gene regulatory network from single-cell transcriptomes with graph autoencoder model. PLoS Genet. 19, e1010942 (2023).
Badia-i-Mompel, P. et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat. Rev. Genet. 24, 739–754 (2023).
Duren, Z. et al. Sc-compReg enables the comparison of gene regulatory networks between conditions using single-cell data. Nat. Commun. 12, 4763 (2021).
Kim, Y. et al. DiffGRN: differential gene regulatory network analysis. Int. J. Data Min. Bioinform. 20, 362–379 (2018).
Götz, J., Bodea, L. -G. & Goedert, M. Rodent models for Alzheimer disease. Nat. Rev. Neurosci. 19, 583–598 (2018).
Moulin, T. C. et al. Rodent and fly models in behavioral neuroscience: an evaluation of methodological advances, comparative research, and future perspectives. Neurosci. Biobehav. Rev. 120, 1–12 (2021).
Zhang, M. et al. Molecularly defined and spatially resolved cell atlas of the whole mouse brain. Nature 624, 343–354 (2023).
Zu, S. et al. Single-cell analysis of chromatin accessibility in the adult mouse brain. Nature 624, 378–389 (2023).
Hall, A. M. & Roberson, E. D. Mouse models of Alzheimer’s disease. Brain Res. Bull. 88, 3–12 (2012).
McKean, N. E., Handley, R. R. & Snell, R. G. A review of the current mammalian models of Alzheimer’s disease and challenges that need to be overcome. Int. J. Mol. Sci. 22, 13168 (2021).
Li, Q. S. & De Muynck, L. Differentially expressed genes in Alzheimer’s disease highlighting the roles of microglia genes including OLR1 and astrocyte gene CDK2AP1. Brain Behav. Immun. Health 13, 100227 (2021).
Bakken, T. E. et al. Comparative cellular analysis of motor cortex in human, marmoset and mouse. Nature 598, 111–119 (2021).
Marshall, L. J. et al. Poor translatability of biomedical research using animals—a narrative review. Altern. Lab. Anim. 51, 102–135 (2023).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Kelsey, G., Stegle, O. & Reik, W. Single-cell epigenomics: recording the past and predicting the future. Science 358, 69–75 (2017).
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
Acknowledgements
This work was supported by National Institutes of Health (NIH) grants UM1MH130994, U01AG076791, U01DA052769, R01AG067153, R01AG082127 and RF1AG065675 to X.X. and the Knights Templar Eye Foundation grant KTEF-5646361 to S.F.G. F.J.T. acknowledges support from the German Federal Ministry of Education and Research (BMBF; 031L0210A) and from the Helmholtz Association’s Initiative and Networking Fund through Helmholtz AI (ZT-I-PF-5-01). Q.N. acknowledges support from National Science Foundation grants DMS1763272, MCB202842 and CBET2134916, and NIH grants R01AR079150, R01ED030565 and U01AR073159. K.G.J. acknowledges support from NIH grant T32 DC010775-14.
Author information
Authors and Affiliations
Contributions
K.G.J. and S.F.G. wrote the paper and created the figures. Q.N. and F.J.T. and oversaw the writing. X.X. oversaw and supported the work.
Corresponding authors
Ethics declarations
Competing interests
F.J.T. consults for Immunai, Singularity Bio B.V., CytoReason and Omniscope, and has ownership interest in Dermagnostix GmbH and Cellarity.
Peer review
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Nina Vogt, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Johnston, K.G., Grieco, S.F., Nie, Q. et al. Small data methods in omics: the power of one. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02390-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41592-024-02390-8