Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq


Massively parallel single-cell and single-nucleus RNA sequencing has opened the way to systematic tissue atlases in health and disease, but as the scale of data generation is growing, so is the need for computational pipelines for scaled analysis. Here we developed Cumulus—a cloud-based framework for analyzing large-scale single-cell and single-nucleus RNA sequencing datasets. Cumulus combines the power of cloud computing with improvements in algorithm and implementation to achieve high scalability, low cost, user-friendliness and integrated support for a comprehensive set of features. We benchmark Cumulus on the Human Cell Atlas Census of Immune Cells dataset of bone marrow cells and show that it substantially improves efficiency over conventional frameworks, while maintaining or improving the quality of results, enabling large-scale studies.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Cumulus: a scalable, feature-rich, accessible cloud-based framework for sc/snRNA-seq analysis.
Fig. 2: Algorithmic and implementation improvements underlying Pegasus’ high scalability.

Data availability

The bone marrow dataset is available at The 1.3 million mouse brain dataset is available at The 5K PBMC dataset is available at The BGISEQ SMART-Seq2 data are available at In particular, data from accessions SRX3654625, SRX3654622, SRX3654623, SRX3654630, SRX3654942, SRX3654606, SRX3654816 and SRX3654921 were used.

Code availability

Cumulus code consists of four components: the Pegasus and scPlot python packages; the Cumulus WDL workflows and Docker files; the Cumulus docker images; and the Cirrocumulus app. Pegasus source code is available at Pegasus documentation is available at scPlot source code is available at We wrote all the workflows using the Workflow Description Language (WDL; and encapsulated all software packages into Docker images using Docker files. Cumulus WDL and Docker files are available at The source code used to generate feature-count matrices, generate_count_matrix_ADTs, is available at Cumulus Docker images are available at For Terra users, we additionally deposited Cumulus workflows in the Broad Methods Repository ( and provide a step-by-step manual at Cirrocumulus source code is available at Cirrocumulus documentation is available at Pegasus, scPlot, Cumulus WDL files, Docker files and Cirrocumulus are licensed under a BSD three-clause license. In addition, we documented licenses for Cumulus dependencies in Supplementary Data 5. Due to third-party licensing requirements, we can only provide CellRanger dockers without bcl2fastq2 and users can build their private bcl2fastq2-containing Dockers by following the instructions listed in the Cumulus documentation.


  1. 1.

    Regev, A. et al. The Human Cell Atlas White Paper. Preprint at (2018).

  2. 2.

    Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).

    CAS  PubMed  Google Scholar 

  4. 4.

    Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Yang, A., Troup, M., Lin, P. & Ho, J. W. K. Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics 33, 767–769 (2017).

    CAS  PubMed  Google Scholar 

  6. 6.

    Kowalczyk, M. S. et al. Census of Immune Cells (Human Cell Atlas). (2018).

  7. 7.

    Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

    CAS  PubMed  Google Scholar 

  9. 9.

    Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Gaublomme, J. T. et al. Nuclei multiplexing with barcoded antibodies for single-nucleus genomics. Nat. Commun. 10, 2907 (2019).

    PubMed  PubMed Central  Google Scholar 

  12. 12.

    Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods 14, 297–301 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).

    Google Scholar 

  15. 15.

    Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).

    CAS  PubMed  Google Scholar 

  16. 16.

    Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 10, P10008 (2008).

    Google Scholar 

  17. 17.

    Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  19. 19.

    Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at (2018).

  21. 21.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).

    CAS  Google Scholar 

  22. 22.

    Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS ONE 9, e98679 (2014).

    PubMed  PubMed Central  Google Scholar 

  23. 23.

    Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943.e22 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Tabaka, M., Gould, J. & Regev, A. scSVA: an interactive tool for big data visualization and exploration in single-cell omics. Preprint at bioRxiv (2019).

  25. 25.

    Ke, G. et al. LightGBM: a highly efficient gradient boosting decision tree. Adv. Neural Inform. Process. Syst. 30, 3146–3154 (2017).

    Google Scholar 

  26. 26.

    Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    PubMed  PubMed Central  Google Scholar 

  28. 28.

    Bhattacharya, S. et al. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research. Sci. Data 5, 180015 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Li, C. & Wong, W. H. DNA-Chip analyzer (dChip). in The Analysis of Gene Expression Data 120–141 (Springer, 2003).

  30. 30.

    Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    PubMed  PubMed Central  Google Scholar 

  31. 31.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Polański, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).

    PubMed  Google Scholar 

  33. 33.

    Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    PubMed  Google Scholar 

  35. 35.

    Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2020).

    PubMed  Google Scholar 

  36. 36.

    Aumüller, M., Bernhardsson, E. & Faithfull, A. ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. in Similarity Search and Applications 34–49 (Springer, 2017).

  37. 37.

    Alavi, A., Ruffalo, M., Parvangada, A., Huang, Z. & Bar-Joseph, Z. A web server for comparative analysis of single-cell RNA-seq data. Nat. Commun. 9, 4768 (2018).

    PubMed  PubMed Central  Google Scholar 

  38. 38.

    Van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729.e27 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Anand, K., Bianconi, G. & Severini, S. Shannon and von Neumann entropy of random networks with heterogeneous expected degree. Phys. Rev. E 83, 036109 (2011).

    Google Scholar 

  41. 41.

    Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007).

    Google Scholar 

  42. 42.

    Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191.e4 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).

    PubMed  PubMed Central  Google Scholar 

  44. 44.

    Melsted, P. et al. Modular and efficient pre-processing of single-cell RNA-seq. Preprint at bioRxiv (2019).

  45. 45.

    Slyper, M. et al. A single-cell and single-nucleus RNA-Seq toolbox for fresh and frozen human tumors. Nat. Med. 26, 792–802 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Guo, R., Zhao, Y., Zou, Q., Fang, X. & Peng, S. Bioinformatics applications on Apache Spark. Gigascience 7, giy098 (2018).

    PubMed Central  Google Scholar 

  47. 47.

    Petukhov, V. et al. dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments. Genome Biol. 19, 78 (2018).

    PubMed  PubMed Central  Google Scholar 

  48. 48.

    Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011).

    CAS  Google Scholar 

  49. 49.

    Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).

    CAS  PubMed  Google Scholar 

  50. 50.

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Dixit, A. Correcting chimeric crosstalk in single cell RNA-seq experiments. Preprint at bioRxiv (2016).

  53. 53.

    Cleveland, W. S., Grosse, E. & Shyu, W. M. in Statistical Models in S Ch. 8 (1992).

  54. 54.

    Halko, N., Martinsson, P. G. & Tropp, J. A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 217–288 (2011).

    Google Scholar 

  55. 55.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  56. 56.

    Calvetti, D., Reichel, L. & Sorensen, D. C. An implicitly restarted Lanczos method for large symmetric eigenvalue problems. Electron. Trans. Numer. Anal. 2, 1–21 (1994).

    Google Scholar 

  57. 57.

    Reichardt, J. & Bornholdt, S. Statistical mechanics of community detection. Phys. Rev. E 74, 016110 (2006).

    Google Scholar 

  58. 58.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Met. 57, 289–300 (1995).

    Google Scholar 

  59. 59.

    Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? in Proceedings of the 26th Annual International Conference on Machine Learning 1073–1080 (Association for Computing Machinery, 2009).

  60. 60.

    Natarajan, K. N. et al. Comparative analysis of sequencing technologies for single-cell transcriptomics. Genome Biol. 20, 70 (2019).

    PubMed  PubMed Central  Google Scholar 

Download references


We thank J. Rood for help with manuscript editing, L. Gaffney for help with figure preparation, E. Banks and A. Philippakis for advice on creating Cumulus’ featured Terra workspace, C. O’Day and E. Law for advice on licensing Cumulus as an open-source software, C. O’Day additionally for help on summarizing the license terms of third-party packages on which Cumulus depends (Supplementary Data 5), D. Dionne, J. Waldman, J. Lee and K. Shekhar for contributions in generating the Census of Immune Cells dataset and sharing it openly pre-publication, M. Maarouf and D. Erdogan for transferring the Pegasus namespace on Read the Docs to us, and J. Gatter for providing the Kallisto-BUStools docker and WDLs. This work was supported by the Klarman Cell Observatory, the Manton Foundation, HHMI and the Ludwig Center at MIT (to A.R.), as well as the Human Tumor Atlas Pilot Project scientific team at Leidos Biomedical Research, Frederick National Laboratory for Cancer Research and NCI.

Author information




B.L. and A.R. conceived of the study, designed the experiments and devised the analyses. B.L. developed the computational methods. B.L., J.G., Y.Y. and S.S. implemented the code. B.L., J.G., Y.Y., S.S., M.T., O.A. and Y.R. conducted the computational experiments. M.S., M.S.K. and A.-C.V. helped to interpret the results from the Census of Immune Cells data. T.T. helped with Terra cloud-related development. N.H., O.R.-R. and A.R. supervised the work. B.L., J.G., Y.Y. and A.R. wrote the paper with input from all of the authors.

Corresponding authors

Correspondence to Bo Li or Orit Rozenblatt-Rosen or Aviv Regev.

Ethics declarations

Competing interests

A.R. is a founder of and equity holder in Celsius Therapeutics, a Scientific Advisory Board member of Thermo Fisher Scientific, Neogene Therapeutics, Syros Pharamceuticals and Asimov, and an equity holder in Immunitas. N.H. is a founder and Scientific Advisory Board member of Neon Therapeutics. A.R. is a co-inventor on patent applications filed by the Broad Institute to inventions relating to single-cell genomics.

Additional information

Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The new HVG selection procedure provides excellent quality vs. the standard procedure.

a, New HVG selection procedure (n = 16,613 robust genes). Variance (y axis) vs. mean (x axis) of log expression. Red: fitted LOESS curve. HVGs (blue) are defined as the genes above the LOESS curve. b, Curated immune genes captured by each procedure. The number of ImmPort curated immune genes selected by a standard HVG procedure (red) and Cumulus (blue). c, Analysis with HVGs by new approach highlighted an additional cell type (n = 274,182 bone marrow cells). FIt-SNE plots of cells from the bone marrow dataset generated by Cumulus with HVG genes selected by the standard (left) and new (right) procedure and colored by cell subset annotations. Bottom: Adjusted Mutual Information (AMI) score shows overall high concordance. Only the plot from the new procedure identifies megakaryocytes. HSCs: hematopoietic stem cells; MSCs: mesenchymal stem cells; cDCs: conventional dendritic cells; pDCs: plasmacytoid dendritic cells; NK cells: natural killer cells.

Extended Data Fig. 2 Benchmarking of batch correction methods on 34,654 bone marrow cells.

a, Execution time of each method. b–g, UMAP visualizations of the bone marrow cells (n = 34,654) colored by either cell type annotation (left) or donor identity (right) without batch correction (b, baseline), with L/S adjustment (c, Pegasus), ComBat (d), MNN (e), BBKNN (f), and Seurat v3 (g).

Extended Data Fig. 3 Benchmarking of approximate nearest neighbor finding methods on the bone marrow dataset (n = 274,182 cells).

Accuracy (a, y axis, % recall, Methods) and speed (b, y axis, minutes) of each of three methods. Boxplot (a): Line: median; box boundaries: lower and upper quartiles; whiskers: 1.5 interquartile range (IQR) below and above the low and high quartile, respectively.

Extended Data Fig. 4 Adjusting diffusion pseudotime map parameters for visualization of pseudotemporal trajectories.

a, Using a large number of diffusion pseudotime components yields a developmental trajectory that enhances separation of trajectories of different cell populations (n = 274,182 bone marrow cells). FLEs of single cell (colored by cell type annotation) generated from diffusion pseudotime maps (t = ∞) with 15 (left), 50 (middle) or 100 (right) components. CD8+ and CD4+ naïve T cells are fused together in the left FLE (circled in red). Erythrocytes and Pro-B cells are overlapped in the middle FLE (circled in Red). b, Choosing the timescale t for a diffusion pseudotime map. Von Neumann entropy (y axis) for diffusion maps with 100 components calculated from the bone marrow data at different timescales (x axis). Black point: knee point.

Extended Data Fig. 5 Spectral community detection algorithms combine the strengths of spectral clustering and community detection algorithms.

FIt-SNE of bone marrow single cells (dots, n = 274,182) colored by cluster assignment from (a) Spectral (left) vs. Louvain (right) clustering; (b) Louvain (left) vs. Spectral Louvain (right) clustering; or (c) Leiden (left) vs. spectral Leiden (right) clustering. Top: Execution time; bottom: Adjusted Mutual Information (AMI). Post hoc annotation labels are listed.

Extended Data Fig. 6 Deep-learning based visualization speeds up t-SNE and FLE visualizations while maintaining comparable quality.

Visualization of cell profiles (dots) from the full bone marrow data set (n = 274,182) colored by the same Louvain cluster membership (color; legend shows post hoc annotations) and laid out by (a) t-SNE (left), Net-tSNE (middle), or FIt-SNE (right); or (b) by FLE (left) or Net-FLE (right). Top: Execution time and kSIM acceptance rate.

Extended Data Fig. 7 Benchmark the count step with respect to number of channels for the bone marrow dataset.

Plot of maximum runtime in hours (left) and amortized total costs (right) in US dollars against the number of 10x channels.

Supplementary information

Supplementary Information

Supplementary Tables 1–8 and Supplementary Notes 1 and 2.

Reporting Summary

Supplementary Data 1

ImmPort-curated immune genes selected by the two HVG selection procedures. Tab 1 (GOappend1_Immport): a copy of the original ImmPort spreadsheet from; Tab 2 (Duplicates_merged): removed any duplicated genes; Tab 3 (Common_markers): ImmPort genes that were selected by both procedures; Tab 4 (Standard_procedure_specific) and Tab 5 (New_procedure_specific): ImmPort genes that are only selected by the standard procedure and the new procedure, respectively.

Supplementary Data 2

Execution time for benchmarking Pegasus algorithms on subsampled and full bone marrow datasets. The top table records execution time using eight threads (laptop setting). The bottom table records execution time using 28 threads (cloud setting).

Supplementary Data 3

Count job execution time and amortized cost of 63 bone marrow channels. The execution times are read from the Terra platform’s timing diagrams. The amortized cost is calculated as follows: suppose the total cost is C, the total execution time is T and one channel’s execution time is t; its amortized cost is: \(\frac{t}{T}C\).

Supplementary Data 4

Raw count matrices generated on the cloud and local server. This is a gzipped tarball, containing two subfolders: 5k_pbmc and bgiseq_smartseq2. 5k_pbmc stores raw count matrices generated from the 5K PBMC dataset and contains two subfolders: cloud_output and local_output. Each subfolder contains the same raw count matrix in matrix market format. bgiseq_smartseq2 stores two raw count matrices generated from eight SMART-Seq2 cells sequenced by BGISEQ-500 and contains one file and two folders. The file BGISEQ_list.txt contains detailed information about the eight selected cells. The two folders cloud_output and local_output each contain two count matrices in DGE format (one for paired-end reads (PE) and one for single-end reads (SE)).

Supplementary Data 5

Cumulus dependencies and their licenses.

Supplementary Video 1

Demonstration on how to submit Cumulus jobs on the Terra platform.

Supplementary Video 2

Demonstration of Cirrocumulus on the bone marrow dataset.

Supplementary Video 3

Demonstration on interactive data analysis using Pegasus and Terra Jupyter notebooks.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, B., Gould, J., Yang, Y. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat Methods 17, 793–798 (2020).

Download citation