Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Small data methods in omics: the power of one

Abstract

Over the last decade, biology has begun utilizing ‘big data’ approaches, resulting in large, comprehensive atlases in modalities ranging from transcriptomics to neural connectomics. However, these approaches must be complemented and integrated with ‘small data’ approaches to efficiently utilize data from individual labs. Integration of smaller datasets with major reference atlases is critical to provide context to individual experiments, and approaches toward integration of large and small data have been a major focus in many fields in recent years. Here we discuss progress in integration of small data with consortium-sized atlases across multiple modalities, and its potential applications. We then examine promising future directions for utilizing the power of small data to maximize the information garnered from small-scale experiments. We envision that, in the near future, international consortia comprising many laboratories will work together to collaboratively build reference atlases and foundation models using small data methods.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Constructing an updateable integrated cell atlas.
Fig. 2: Applications of single-cell integrative foundational models.

Similar content being viewed by others

References

  1. Ngai, J. BRAIN 2.0: transforming neuroscience. Cell 185, 4–8 (2022).

    Article  CAS  PubMed  Google Scholar 

  2. BRAIN Initiative Cell Census Network. A multimodal cell census and atlas of the mammalian primary motor cortex. Nature 598, 86–102 (2021).

    Article  Google Scholar 

  3. Regev, A. et al. The human cell atlas. Elife 6, e27041 (2017). Perhaps the largest single-cell atlas in the world.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Landhuis, E. Neuroscience: big brain, big data. Nature 541, 559–561 (2017).

    Article  CAS  PubMed  Google Scholar 

  5. Marx, V. The big challenges of big data. Nature 498, 255–260 (2013).

    Article  CAS  PubMed  Google Scholar 

  6. Todman, L. C., Bush, A. & Hood, A. S. ‘Small data’ for big insights in ecology. Trends Ecol. Evol. 38, 615–622 (2023).

    Article  PubMed  Google Scholar 

  7. Ferguson, A. R. et al. Big data from small data: data-sharing in the ‘long tail’ of neuroscience. Nat. Neurosci. 17, 1442–1447 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Hekler, E. B. et al. Why we need a small data paradigm. BMC Med. 17, 133 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Cai, C. et al. Transfer learning for drug discovery. J. Med. Chem. 63, 8683–8694 (2020).

    Article  CAS  PubMed  Google Scholar 

  10. Qi, G. -J. & Luo, J. Small data challenges in big data era: a survey of recent progress on unsupervised and semi-supervised methods. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2168–2187 (2020).

    Article  Google Scholar 

  11. Yang, L., Hanneke, S. & Carbonell, J. A theory of transfer learning with applications to active learning. Mach. Learn. 90, 161–189 (2013).

    Article  Google Scholar 

  12. Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 9 (2016).

    Article  Google Scholar 

  13. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40, 163–166 (2022).

    Article  CAS  PubMed  Google Scholar 

  15. Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database https://doi.org/10.1093/database/baaa073 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Yao, Z. et al. A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain. Nature 624, 317–332 (2023). An incredible resource for analysis of transcriptomic diversity in the brain.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).

    Article  CAS  PubMed  Google Scholar 

  19. Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).

    Google Scholar 

  20. Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022). An important resource for updateable atlas creation.

    Article  CAS  PubMed  Google Scholar 

  21. Lopez, R. et al. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Zhang, Z. et al. scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data. Nat. Commun. 15, 912 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Zhou, Y. et al. Accurate integration of multiple heterogeneous single-cell RNA-seq data sets by learning contrastive biological variation. Genome Res. 33, 750–762 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Franzén, O., Gan, L. M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database https://doi.org/10.1093/database/baz046 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Papatheodorou, I. et al. Expression Atlas update: from tissues to single cells. Nucleic Acids Res. 48, D77–D83 (2019).

    PubMed Central  Google Scholar 

  26. Tarhan, L. et al. Single Cell Portal: an interactive home for single-cell genomics data. Preprint at bioRxiv https://doi.org/10.1101/2023.07.13.548886 (2023).

  27. CZI Single-Cell Biology Program et al. CZ CELL×GENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Preprint at bioRxiv https://doi.org/10.1101/2023.10.30.563174 (2023).

  28. Camps, J. et al. Meta-analysis of human cancer single-cell RNA-seq datasets using the IMMUcan database. Cancer Res. 83, 363–373 (2023).

    Article  CAS  PubMed  Google Scholar 

  29. Li, X. -W. et al. SCAD-Brain: a public database of single cell RNA-seq data in human and mouse brains with Alzheimer’s disease. Front. Aging Neurosci. 15, 1157792 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods https://doi.org/10.1038/s41592-024-02201-0 (2024).

    Article  PubMed  Google Scholar 

  31. Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).

    Article  Google Scholar 

  32. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Roohani, Y., Huang, K. & Leskovec, J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat. Biotechnol. 42, 927–935 (2024).

    Article  CAS  PubMed  Google Scholar 

  34. Booeshaghi, A. S. & Pachter, L. Normalization of single-cell RNA-seq counts by log(x + 1) or log(1 + x). Bioinformatics 37, 2223–2224 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

    Article  PubMed  Google Scholar 

  37. Osorio, D. & Cai, J. J. Systematic determination of the mitochondrial proportion in human and mice tissues for single-cell RNA-sequencing data quality control. Bioinformatics 37, 963–967 (2021).

    Article  CAS  PubMed  Google Scholar 

  38. Xi, N. M. & Li, J. J. Benchmarking computational doublet-detection methods for single-cell RNA sequencing data. Cell Syst. 12, 176–194 (2021).

    Article  CAS  PubMed  Google Scholar 

  39. Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Zhang, F., Wu, Y. & Tian, W. A novel approach to remove the batch effect of single-cell data. Cell Discov. 5, 46 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Chacon, S. & Straub B. Pro Git. (Apress, 2014).

  43. Raschka, S. Model evaluation, model selection, and algorithm selection in machine learning. Preprint at https://arxiv.org/abs/1811.12808 (2018).

  44. Verbraeken, J. et al. A survey on distributed machine learning. ACM Comput. Surv. 53, 1–33 (2020).

    Article  Google Scholar 

  45. Akbarian, S. et al. The PsychENCODE project. Nat. Neurosci. 18, 1707–1712 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Stuart, T. et al. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Zhang, K. et al. A fast, scalable and versatile tool for analysis of single-cell omics data. Nat. Methods 21, 217–227 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Peidli, S. et al. scPerturb: harmonized single-cell perturbation data. Nat. Methods 21, 531–540 (2024).

    Article  CAS  PubMed  Google Scholar 

  50. Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high‐throughput screens. Mol. Syst. Biol. 19, e11517 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Bocci, F., Zhou, P. & Nie, Q. spliceJAC: transition genes and state‐specific gene regulation from single‐cell transcriptome data. Mol. Syst. Biol. 18, e11176 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Wang, J., Chen, Y. & Zou, Q. Inferring gene regulatory network from single-cell transcriptomes with graph autoencoder model. PLoS Genet. 19, e1010942 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Badia-i-Mompel, P. et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat. Rev. Genet. 24, 739–754 (2023).

    Article  CAS  PubMed  Google Scholar 

  54. Duren, Z. et al. Sc-compReg enables the comparison of gene regulatory networks between conditions using single-cell data. Nat. Commun. 12, 4763 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Kim, Y. et al. DiffGRN: differential gene regulatory network analysis. Int. J. Data Min. Bioinform. 20, 362–379 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Götz, J., Bodea, L. -G. & Goedert, M. Rodent models for Alzheimer disease. Nat. Rev. Neurosci. 19, 583–598 (2018).

    Article  PubMed  Google Scholar 

  57. Moulin, T. C. et al. Rodent and fly models in behavioral neuroscience: an evaluation of methodological advances, comparative research, and future perspectives. Neurosci. Biobehav. Rev. 120, 1–12 (2021).

    Article  PubMed  Google Scholar 

  58. Zhang, M. et al. Molecularly defined and spatially resolved cell atlas of the whole mouse brain. Nature 624, 343–354 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Zu, S. et al. Single-cell analysis of chromatin accessibility in the adult mouse brain. Nature 624, 378–389 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Hall, A. M. & Roberson, E. D. Mouse models of Alzheimer’s disease. Brain Res. Bull. 88, 3–12 (2012).

    Article  CAS  PubMed  Google Scholar 

  61. McKean, N. E., Handley, R. R. & Snell, R. G. A review of the current mammalian models of Alzheimer’s disease and challenges that need to be overcome. Int. J. Mol. Sci. 22, 13168 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Li, Q. S. & De Muynck, L. Differentially expressed genes in Alzheimer’s disease highlighting the roles of microglia genes including OLR1 and astrocyte gene CDK2AP1. Brain Behav. Immun. Health 13, 100227 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Bakken, T. E. et al. Comparative cellular analysis of motor cortex in human, marmoset and mouse. Nature 598, 111–119 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Marshall, L. J. et al. Poor translatability of biomedical research using animals—a narrative review. Altern. Lab. Anim. 51, 102–135 (2023).

    Article  PubMed  Google Scholar 

  65. Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Kelsey, G., Stegle, O. & Reik, W. Single-cell epigenomics: recording the past and predicting the future. Science 358, 69–75 (2017).

    Article  CAS  PubMed  Google Scholar 

  67. Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by National Institutes of Health (NIH) grants UM1MH130994, U01AG076791, U01DA052769, R01AG067153, R01AG082127 and RF1AG065675 to X.X. and the Knights Templar Eye Foundation grant KTEF-5646361 to S.F.G. F.J.T. acknowledges support from the German Federal Ministry of Education and Research (BMBF; 031L0210A) and from the Helmholtz Association’s Initiative and Networking Fund through Helmholtz AI (ZT-I-PF-5-01). Q.N. acknowledges support from National Science Foundation grants DMS1763272, MCB202842 and CBET2134916, and NIH grants R01AR079150, R01ED030565 and U01AR073159. K.G.J. acknowledges support from NIH grant T32 DC010775-14.

Author information

Authors and Affiliations

Authors

Contributions

K.G.J. and S.F.G. wrote the paper and created the figures. Q.N. and F.J.T. and oversaw the writing. X.X. oversaw and supported the work.

Corresponding authors

Correspondence to Qing Nie, Fabian J. Theis or Xiangmin Xu.

Ethics declarations

Competing interests

F.J.T. consults for Immunai, Singularity Bio B.V., CytoReason and Omniscope, and has ownership interest in Dermagnostix GmbH and Cellarity.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Nina Vogt, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Johnston, K.G., Grieco, S.F., Nie, Q. et al. Small data methods in omics: the power of one. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02390-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41592-024-02390-8

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing