Abstract
The development of single-cell multi-omics technology has greatly enhanced our understanding of biology, and in parallel, numerous algorithms have been proposed to predict the protein abundance and/or chromatin accessibility of cells from single-cell transcriptomic information and to integrate various types of single-cell multi-omics data. However, few studies have systematically compared and evaluated the performance of these algorithms. Here, we present a benchmark study of 14 protein abundance/chromatin accessibility prediction algorithms and 18 single-cell multi-omics integration algorithms using 47 single-cell multi-omics datasets. Our benchmark study showed overall totalVI and scArches outperformed the other algorithms for predicting protein abundance, and LS_Lab was the top-performing algorithm for the prediction of chromatin accessibility in most cases. Seurat, MOJITOO and scAI emerge as leading algorithms for vertical integration, whereas totalVI and UINMF excel beyond their counterparts in both horizontal and mosaic integration scenarios. Additionally, we provide a pipeline to assist researchers in selecting the optimal multi-omics prediction and integration algorithm.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
A summary of the multi-omics datasets used in the benchmark study, including the sequencing technologies and the websites where the raw data are available as follows: dataset 1 (human BMMCs): CITE-seq, GSE128639 (ref. 5); dataset 2 (human BMMCs): CITE-seq, GSE194122 (ref. 79); dataset 3 (human brain immune cells): CITE-seq, GSE201048 (ref. 80); dataset 4 (human CBMCs): CITE-seq, GSE100866 (ref. 1); dataset 5 (human glioblastomas): CITE-seq, GSM4972212 (ref. 81); dataset 6 (mouse glioblastomas): CITE-seq, GSE163120 (ref. 81); dataset 7 (mouse HSPCs): CITE-seq, GSE175702 (ref. 82); dataset 8 (human MALT tumor): CITE-seq, https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3; dataset 9–10 (mouse murine splenic myeloid cells): CITE-seq, GSE149544 (ref. 83); dataset 11 (mouse naive brains): CITE-seq, GSE148127 (ref. 84); dataset 12–13 (human PBMCs): CITE-seq, GSE164378 (ref. 5); dataset 14–15 (human PBMCs): CITE-seq, https://zenodo.org/record/6348128#.Y5f40LJBzDU (ref. 30); dataset 21–22 (mouse spleen and lymph nodes): CITE-seq, GSE150599 (ref. 6); dataset 23–24 (human PBMCs): REAP-seq, GSE100501 (ref. 2); dataset 25–26 and dataset 40–41 (human PBMCs): DOGMA-seq, GSE156478 (ref. 18); datasets 27 and 42 (human PBMCs): TEA-seq, GSE158013 (ref. 71); dataset 28 (human PBMCs): inCITE-seq, GSE163480 (ref. 85); dataset 29 (skin of mouse): SHARE-seq, GSE140203 (ref. 3); dataset 30 (adult brain of mouse): SHARE-seq, GSE140203 (ref. 3); dataset 31 (adult brain of mouse): SNARE-seq, GSE126074 (ref. 4); dataset 32 (adult brain of mouse): ISSAAC-seq, https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11264/ (ref. 12); dataset 33 (adult brain of mouse): 10x Multiome, https://www.10xgenomics.com/resources/datasets/frozen-human-healthy-brain-tissue-3-k-1-standard-2-0-0/; dataset 34 (10,000 PBMCs with granulocytes removed): 10x Multiome, https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0/; dataset 35 (3,000 PBMCs with granulocytes removed): 10x Multiome, https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0/; dataset 36 (10,000 PBMCs): 10x Multiome, https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-10-k-1-standard-2-0-0/; dataset 37 (3,000 PBMCs): 10x Multiome, https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-3-k-1-standard-2-0-0/; dataset 38 (mouse retina): 10x Multiome, GSE201402 (ref. 86); dataset 39 (human BMMCs): 10x Multiome, GSE194122 (ref. 79); dataset 43 (mouse spleen): scRNA-seq, GSE132901 (ref. 87); dataset 44 (mouse retain): scRNA-seq, GSE181251 (ref. 88); dataset 45 (mouse adult brain): scRNA-seq, GSE246147 (ref. 89); dataset 46 (mouse HSPCs): scRNA-seq, GSE175702 (ref. 82); dataset 47 (mouse retain): scATAC-seq, GSE181251 (ref. 88). Source data are provided with this paper.
Code availability
We have uploaded the codes and scripts used for the benchmark study and figure plotting to a GitHub website, which can be accessed at https://github.com/QuKunLab/MultiomeBenchmarking/. Code is also available in the Zenodo repository via https://doi.org/10.5281/zenodo.10540843 (ref. 90).
References
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116.e20 (2020).
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021).
Zhang, L., Zhang, J. & Nie, Q. DIRECT-NET: an efficient method to discover cis-regulatory elements and construct regulatory networks from single-cell multiomics data. Sci. Adv. 8, eabl7393 (2022).
Kartha, V. K. et al. Functional inference of gene regulation using single-cell multi-omics. Cell Genom. 2, 100166 (2022).
Li, C., Virgilio, M. C., Collins, K. L. & Welch, J. D. Multi-omic single-cell velocity models epigenome–transcriptome interactions and improves cell fate prediction. Nat. Biotechnol. 41, 387–398 (2023).
Gorin, G., Svensson, V. & Pachter, L. Protein velocity and acceleration from single-cell multiomics experiments. Genome Biol. 21, 39 (2020).
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Xu, W. et al. ISSAAC-seq enables sensitive and flexible multimodal profiling of chromatin accessibility and gene expression in single cells. Nat. Methods 19, 1243–1249 (2022).
Zhou, Z., Ye, C., Wang, J. & Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. Nat. Commun. 11, 651 (2020).
Bennett, H. M., Stephenson, W., Rose, C. M. & Darmanis, S. Single-cell proteomics enabled by next-generation sequencing or mass spectrometry. Nat. Methods 20, 363–374 (2023).
Gatto, L. et al. Initial recommendations for performing, benchmarking and reporting single-cell proteomics experiments. Nat. Methods 20, 375–386 (2023).
Lance, C. et al. Multimodal single cell data integration challenge: results and lessons learned. In Proc. NeurIPS 2021 Competitions and Demonstrations Track (eds. Kiela, D. et al.) 162–176 (PMLR, 2022).
Bartosovic, M., Kabbe, M. & Castelo-Branco, G. Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues. Nat. Biotechnol. 39, 825–835 (2021).
Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–887.e17 (2019).
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Ashuach, T. et al. MultiVI: deep generative model for the integration of multimodal data. Nat. Methods 20, 1222–1231 (2023).
Lakkis, J. et al. A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation. Nat. Mach. Intell. 4, 940–952 (2022).
Wu, K. E., Yost, K. E., Chang, H. Y. & Zou, J. BABEL enables cross-modality translation between multiomic profiles at single-cell resolution. Proc. Natl Acad. Sci. USA 118, e2023070118 (2021).
Du, J.-H., Cai, Z. & Roeder, K. Robust probabilistic modeling for single-cell multimodal mosaic integration and imputation via scVAEIT. Proc. Natl Acad. Sci. USA 119, e2214414119 (2022).
Lan, M., Zhang, S. & Gao, L. Efficient generation of paired single-cell multiomics profiles by deep learning. Adv. Sci 10, 2301169 (2023).
Wen, H. et al. Proc. 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Association for Computing Machinery, 2022).
Yang, K. D. et al. Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nat. Commun. 12, 31 (2021).
Baysoy, A., Bai, Z., Satija, R. & Fan, R. The technological landscape and applications of single-cell multi-omics. Nat. Rev. Mol. Cell Biol. 24, 695–713 (2023).
Cheng, M., Li, Z. & Costa, I. G. MOJITOO: a fast and universal method for integration of multimodal single-cell data. Bioinformatics 38, i282–i289 (2022).
Lotfollahi, M., Litinetskaya, A. & Theis, F. J. Multigrate: single-cell multi-omic data integration. Preprint at bioRxiv https://doi.org/10.1101/2022.03.16.484643 (2022).
Wang, R. H., Wang, J. & Li, S. C. Probabilistic tensor decomposition extracts better latent embeddings from single-cell multiomic data. Nucleic Acids Res. 51, e81 (2023).
Kim, H. J., Lin, Y., Geddes, T. A., Yang, J. Y. H. & Yang, P. CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics 36, 4137–4143 (2020).
Ma, A. et al. Single-cell biological network inference using a heterogeneous graph transformer. Nat. Commun. 14, 964 (2023).
Jin, S., Zhang, L. & Nie, Q. scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles. Genome Biol. 21, 25 (2020).
Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21, 111 (2020).
Li, G. et al. A deep generative model for multi-view profiling of single-cell RNA-seq and ATAC-seq data. Genome Biol. 23, 20 (2022).
Lynch, A. W. et al. MIRA: joint regulatory modeling of multimodal expression and chromatin accessibility in single cells. Nat. Methods 19, 1097–1108 (2022).
Singh, R., Hie, B. L., Narayan, A. & Berger, B. Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities. Genome Biol. 22, 131 (2021).
Kriebel, A. R. & Welch, J. D. UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization. Nat. Commun. 13, 780 (2022).
Zhang, Z. et al. scMoMaT jointly performs single cell mosaic integration and multi-modal bio-marker detection. Nat. Commun. 14, 384 (2023).
Ghazanfar, S., Guibentif, C. & Marioni, J. C. Stabilized mosaic single-cell data integration using unshared features. Nat. Biotechnol. 42, 284–292 (2024).
De Biasi, S. et al. Circulating mucosal-associated invariant T cells identify patients responding to anti-PD-1 therapy. Nat. Commun. 12, 1669 (2021).
Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 24, 550–572 (2023).
Miao, Z., Humphreys, B. D., McMahon, A. P. & Kim, J. Multi-omics integration in the age of million single-cell data. Nat. Rev. Nephrol. 17, 710–724 (2021).
Argelaguet, R., Cuomo, A. S. E., Stegle, O. & Marioni, J. C. Computational principles and challenges in single-cell data integration. Nat. Biotechnol. 39, 1202–1215 (2021).
Wang, J. et al. Data denoising with transfer learning in single-cell transcriptomics. Nat. Methods 16, 875–878 (2019).
Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).
Hu, Y. et al. WEDGE: imputation of gene expression values from single-cell RNA-seq datasets using biased matrix decomposition. Brief. Bioinform. 22, bbab085 (2021).
Truong, K.-L. et al. Killer-like receptors and GPR56 progressive expression defines cytokine production of human CD4+ memory T cells. Nat. Commun. 10, 2263 (2019).
Fergusson, J. R. et al. CD161intCD8+ T cells: a novel population of highly functional, memory CD8+ T cells enriched within the gut. Mucosal Immunol. 9, 401–413 (2016).
Kung, P. C., Goldstein, G., Reinherz, E. L. & Schlossman, S. F. Monoclonal antibodies defining distinctive human T cell surface antigens. Science 206, 347–349 (1979).
Liang, Y. & Tedder, T. F. Identification of a CD20-, FcϵRIβ-, and HTm4-Related gene family: sixteen new MS4A family members expressed in human and mouse. Genomics 72, 119–127 (2001).
Ziegler-Heitbrock, H. W. L. & Ulevitch, R. J. CD14: cell surface receptor and differentiation marker. Immunol. Today 14, 121–125 (1993).
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012).
Gertz, J. et al. Distinct properties of cell-type-specific and shared transcription factor binding sites. Mol. Cell 52, 25–36 (2013).
Kang, R. et al. EnhancerDB: a resource of transcriptional regulation in the context of enhancers. Database 2019, bay141 (2019).
Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28, 2309–2320 (2022).
Lewis, S. M. et al. Spatial omics and multiplexed imaging to explore cancer biology. Nat. Methods 18, 997–1012 (2021).
Li, B. et al. Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nat. Methods 19, 662–670 (2022).
Linderman, G. C. et al. Zero-preserving imputation of single-cell RNA-seq data. Nat. Commun. 13, 192 (2022).
Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088–1096 (2022).
Chen, A. et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell 185, 1777–1792.e21 (2022).
Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).
Su, G. et al. Spatial multi-omics sequencing for fixed tissue via DBiT-seq. STAR Protoc. 2, 100532 (2021).
Liu, Y. et al. High-plex protein and whole transcriptome co-mapping at cellular resolution with spatial CITE-seq. Nat. Biotechnol. 41, 1405–1409 (2023).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21, 1470–1480 (2024).
Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods 21, 1481–1491 (2024).
Swanson, E. et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. eLife 10, e63632 (2021).
Hand, D. J. & Till, R. J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 171–186 (2001).
Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Strehl, A. & Ghosh, J. Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002).
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Luecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks (eds. Vanschoren, J. & Yeung, S.) 13 (NeurIPS, 2021).
Kumar, P. et al. Single-cell transcriptomics and surface epitope detection in human brain epileptic lesions identifies pro-inflammatory signaling. Nat. Neurosci. 25, 956–966 (2022).
Pombo Antunes, A. R. et al. Single-cell profiling of myeloid cells in glioblastoma across species and disease stage reveals macrophage competition and specialization. Nat. Neurosci. 24, 595–610 (2021).
Konturek-Ciesla, A. et al. Temporal multimodal single-cell profiling of native hematopoiesis illuminates altered differentiation trajectories with age. Cell Rep. 42, 112304 (2023).
Lukowski, S. W. et al. Absence of Batf3 reveals a new dimension of cell state heterogeneity within conventional dendritic cells. iScience 24, 102402 (2021).
Golomb, S. M. et al. Multi-modal single-cell analysis reveals brain immune landscape plasticity during aging and gut microbiota dysbiosis. Cell Rep. 33, 108438 (2020).
Chung, H. et al. Joint single-cell measurements of nuclear proteins and RNA in vivo. Nat. Methods 18, 1204–1212 (2021).
Dou, J. et al. Bi-order multimodal integration of single-cell data. Genome Biol. 23, 112 (2022).
Kimmel, J. C. et al. Murine single-cell RNA-seq reveals cell-identity-and tissue-specific trajectories of aging. Genome Res. 29, 2088–2103 (2019).
Lyu, P. et al. Gene regulatory networks controlling temporal patterning, neurogenesis, and cell-fate specification in mammalian retina. Cell Rep. 37, 109994 (2021).
Sun, W. et al. Spatial transcriptomics reveal neuron–astrocyte synergy in long-term memory. Nature 627, 374–381 (2024).
Hu, Y. et al. Benchmarking algorithms for single-cell multi-omics prediction and integration. Zenodo https://doi.org/10.5281/zenodo.10540843 (2024).
Acknowledgements
This work was supported by the National Natural Science Foundation of China grants (T2125012 to K.Q.), the National Key R&D Program of China (2020YFA0112200 and 2022YFA1303200 to K.Q.), the National Natural Science Foundation of China grants (32170668 to B.L.; 12371383 and 61972368 to F.C.), CAS Project for Young Scientists in Basic Research YSBR-005 (to K.Q.), Anhui Province Science and Technology Key Program (202003a07020021 to K.Q.), the Fundamental Research Funds for the Central Universities (YD2070002019, WK9110000141 and WK2070000158 to K.Q.; WK0010000085 to Y.H.), Anhui Provincial Natural Science Foundation (2308085QA07 to Y.H.) and China Postdoctoral Science Foundation (2023M733383 to Y.H.). We thank the USTC supercomputing center and the School of Life Science Bioinformatics Center for providing computing resources for this project.
Author information
Authors and Affiliations
Contributions
K.Q., B.L. and F.C. conceived the project. Y.H., S.W. and Y. Luo designed the framework and performed data analysis with help from T.W., S.J., Y.Z., N.L. and Z.Y. Y. Li, W.D. and C.J. contributed in the revision. B.L., Y.H. and K.Q. wrote the manuscript with input from all authors. K.Q. supervised the entire project. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Jinmiao Chen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editors: Hui Hua and Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Performance of eleven algorithms in predicting RC protein abundance from RNA expression.
a, b, Average PCC (a) and CMD (b) values between the reference and predicted RC protein expression for the intra-dataset scenario, that is, the training and test sets are from the same datasets. The X and Y axes are the cell‒cell and protein‒protein PCC/CMD, respectively, and the dashed lines are the medians of all algorithms’ results. Error bar: standard deviation of 23 datasets. Data are presented as mean values +/- 0.5xSD. c, d, Same as (a) and (b), but the results were predicted for the inter-dataset scenario, that is, the training and test sets are from different datasets. Error bar: standard deviation of 10 datasets. e, Average RMSE values between the reference data and the predicted results for the intra-dataset scenario (X axes) and inter-dataset scenario (Y axes). Error bars: standard deviation of 23 datasets (X axes) or 10 datasets (Y axes). Data are presented as mean values +/− 0.5xSD. f, g, Rank index (RI) values of eleven algorithms in the intra-dataset (f) and inter-dataset (g) scenarios. h, The overall performance of eleven algorithms in both intra-dataset and inter-dataset scenarios. Source data for this figure are provided.
Extended Data Fig. 2 Performance of eleven algorithms in predicting RU protein abundance from RNA expression.
a, b, Average PCC (a) and CMD (b) values between the reference and predicted RU protein abundance for the intra-dataset scenario, that is, the training and test sets are from the same datasets. The X and Y axes are the cell‒cell and protein‒protein PCC/CMD, respectively, and the dashed lines are the medians of all algorithms’ results. Error bar: standard deviation of 23 datasets. Data are presented as mean values +/− 0.5xSD. c, d, Same as (a) and (b), but the results were predicted for the inter-dataset scenario, that is, the training and test sets are from different datasets. Error bar: standard deviation of 10 datasets. e, Average RMSE values between the reference data and the predicted results for the intra-dataset scenario (X axes) and inter-dataset scenario (Y axes). Error bars: standard deviation of 23 datasets (X axes) or 10 datasets (Y axes). Data are presented as mean values +/− 0.5xSD. f, g, Rank index (RI) values of seven algorithms in the intra-dataset (f) and inter-dataset (g) scenarios. h, The overall performance of seven algorithms in both intra-dataset and inter-dataset scenarios. Source data for this figure are provided.
Extended Data Fig. 3 Performance of nine chromatin accessibility prediction algorithms when converting peaks to DORCs.
a, b, Average PCC (b) and CMD (c) values between the reference data and the predicted results for the intra-dataset scenario, that is, the training and test sets are from the same datasets. The X and Y axes are the cell‒cell and DORC-DORC PCC/CMD axes, respectively, and the dashed lines are the medians of all algorithms’ results. Error bar: standard deviation of 11 datasets. Data are presented as mean values +/− 0.5xSD. c, Average RMSE values between the reference data and the predicted results for the intra-dataset scenario (X axes) and inter-dataset scenario (Y axes). Error bar: standard deviation of 11 datasets (X axes) or 8 datasets (Y axes). Data are presented as mean values +/− 0.5xSD. d, e, Same as (a) and (b), but the results were predicted for the inter-dataset scenario, that is, the training and test sets are from different datasets. Error bar: standard deviation of 8 datasets. f, g, Rank index (RI) values of nine algorithms in the intra-dataset (e) and inter-dataset (f) scenarios. h, The overall performance of nine algorithms in both intra-dataset and inter-dataset scenarios. Source data for this figure are provided.
Extended Data Fig. 4 Performance of nine chromatin accessibility prediction algorithms when using smoothed ATAC-seq matrix.
a, b, Average PCC (b) and CMD (c) values between the KNN-smoothing reference data and the predicted results for the intra-dataset scenario, that is, the training and test sets are from the same datasets. The X and Y axes are the cell‒cell and peak-peak PCC/CMD, respectively, and the dashed lines are the medians of all the algorithm results. Error bar: standard deviation of 11 datasets. Data are presented as mean values +/− 0.5xSD. c, Average RMSE values between the KNN-smoothing reference data and the predicted results for the intra-dataset scenario (X axes) and inter-dataset scenario (Y axes). Error bar: standard deviation of 11 datasets (X axes) or 8 datasets (Y axes). Data are presented as mean values +/− 0.5xSD. d, e, Same as (a) and (b), but the results were predicted for the inter-dataset scenario, that is, the training and test sets are from different datasets. Error bar: standard deviation of 8 datasets. f, g, Rank index (RI) values of nine algorithms in the intra-dataset (f) and inter-dataset (g) scenarios. h, The overall performance of nine algorithms in both intra-dataset and inter-dataset scenarios. Source data for this figure are provided.
Extended Data Fig. 5 Computational resources consumed by the fourteen multi-omics prediction algorithms.
a, b, The computational time and memory cost of eleven algorithms for predicting protein abundance in datasets with different numbers of cells. Guanlab-dengkw and scArches reported memory errors and stopped when processing the dataset with 500k cells. Error bar: standard deviation of 5 down-samplings and 2 tests. Data are presented as mean values +/− 0.5xSD. c, d, The computer time and memory cost of nine algorithms for predicting chromatin accessibility in datasets with different numbers of cells. Error bar: standard deviation of 5 down-samplings and 2 tests. Data are presented as mean values +/− 0.5xSD. Source data for this figure are provided.
Extended Data Fig. 6 Computational resources consumed by eighteen single-cell multi-omics integration algorithms.
a, Computer time and memory used by nine vertical integration algorithms when integrating RNA expression and protein abundance for datasets with different numbers of cells. CiteFuse reported memory errors and stopped when processing datasets with over 20k cells. Error bar: standard deviation of 5 down-samplings and 2 tests. Data are presented as mean values +/− 0.5xSD. b, Same as (a), but the results were generated by twelve vertical integration algorithms when integrating RNA expression and chromatin accessibility. scAI reported memory errors and stopped when processing datasets with over 20k cells. c, Computer time and memory cost of five horizontal integration algorithms when integrating single-cell RNA+Protein data for datasets with different numbers of cells. Error bar: standard deviation of 5 down-samplings and 2 tests. Data are presented as mean values +/− 0.5xSD. d, Same as (c), but the results were generated by seven horizontal integration algorithms when integrating single-cell RNA + ATAC data. e, Computer time and memory cost of seven mosaic integration algorithms when integrating scRNA-seq and single-cell RNA+Protein data for datasets with different numbers of cells. Error bar: standard deviation of 5 down-samplings and 2 tests. Data are presented as mean values +/− 0.5xSD. f-h, Same as (e), but the results were generated by mosaic integration algorithms when integrating scRNA-seq data and single-cell RNA + ATAC data (b), integrating scATAC-seq data and single-cell RNA + ATAC data (c), and integrating single-cell RNA+Protein data and single-cell RNA + ATAC data (d). Source data for this figure are provided.
Extended Data Fig. 7 Summary of the performance of the fourteen multi-omics prediction algorithms.
The figure shows: (i) the properties of these algorithms, including the programming languages, methodologies, and GPU acceleration requirements. (ii) the overall performance of these algorithms, evaluated by six metrics in both the inter- and intra-scenarios. A lighter color (and/or a larger dot) indicates better performance for a given metrics. (iii) the computer time and memory consumed by these algorithms for different sizes of datasets; ‘NA’ indicates a memory error or invalid result. Source data for this figure are provided.
Extended Data Fig. 8 Summary of the performance of the fifteen vertical integration algorithms.
The figure shows: (i) the properties of these algorithms, including the programming languages, methodologies, and GPU acceleration requirements; (ii) the overall performance of these algorithms, evaluated by four metrics. (iii) the computer time and memory consumed by these algorithms for different sizes of datasets; ‘NA’ indicates a memory error or invalid result. Source data for this figure are provided.
Extended Data Fig. 9 Summary of the performance of nine horizontal integration algorithms.
The figure shows: (i) the properties of these algorithms, including the programming languages, methodologies, and GPU acceleration requirements; (ii) the overall performance of these algorithms, evaluated by ten metrics in both the inter- and intra-scenarios. (iii) the computer time and memory consumed by these algorithms for different sizes of datasets; ‘NA’ indicates a memory error or invalid result. Source data for this figure are provided.
Extended Data Fig. 10 Summary of the performance of eight mosaic integration algorithms.
The figure shows: (i) the properties of these algorithms, including the programming languages, methodologies, and GPU acceleration requirements; (ii) the overall performance of these algorithms, evaluated by ten metrics in both the inter- and intra-scenarios. (iii) the computer time and memory consumed by these algorithms for different sizes of datasets; ‘NA’ indicates a memory error or invalid result. Source data for this figure are provided.
Supplementary information
Supplementary Information
Supplementary Figs. 1–59
Supplementary Tables 1–9
Supplementary Table 1: Multi-omics prediction algorithm properties. Supplementary Table 2: Detailed information of 47 multi-omics datasets. Supplementary Table 3: Quality-control parameters for 39 multi-omics datasets used for prediction algorithms. Supplementary Table 4: Vertical integration algorithm properties. Supplementary Table 5: Horizontal integration algorithm properties. Supplementary Table 6: Mosaic integration algorithm properties. Supplementary Table 7: Detailed information of 24 single-cell multi-omics datasets for vertical integration. Supplementary Table 8: Detailed information of 19 single-cell multi-omics data groups used for benchmarking horizontal integration algorithms. Supplementary Table 9: Detailed information of 55 paired datasets used for benchmarking mosaic integration algorithms.
Source data
Source Data Fig. 1
Statistical source data.
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Source Data Fig. 6
Statistical source data.
Source Data Extended Data Fig. 1
Statistical source data.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 4
Statistical source data.
Source Data Extended Data Fig. 5
Statistical source data.
Source Data Extended Data Fig. 6
Statistical source data.
Source Data Extended Data Fig. 7
Statistical source data.
Source Data Extended Data Fig. 8
Statistical source data.
Source Data Extended Data Fig. 9
Statistical source data.
Source Data Extended Data Fig. 10
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, Y., Wan, S., Luo, Y. et al. Benchmarking algorithms for single-cell multi-omics prediction and integration. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02429-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41592-024-02429-w