Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq

Li, Bo; Gould, Joshua; Yang, Yiming; Sarkizova, Siranush; Tabaka, Marcin; Ashenberg, Orr; Rosen, Yanay; Slyper, Michal; Kowalczyk, Monika S.; Villani, Alexandra-Chloé; Tickle, Timothy; Hacohen, Nir; Rozenblatt-Rosen, Orit; Regev, Aviv

doi:10.1038/s41592-020-0905-x

Article
Published: 27 July 2020

Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq

Nature Methods volume 17, pages 793–798 (2020)Cite this article

12k Accesses
71 Citations
78 Altmetric
Metrics details

Subjects

Abstract

Massively parallel single-cell and single-nucleus RNA sequencing has opened the way to systematic tissue atlases in health and disease, but as the scale of data generation is growing, so is the need for computational pipelines for scaled analysis. Here we developed Cumulus—a cloud-based framework for analyzing large-scale single-cell and single-nucleus RNA sequencing datasets. Cumulus combines the power of cloud computing with improvements in algorithm and implementation to achieve high scalability, low cost, user-friendliness and integrated support for a comprehensive set of features. We benchmark Cumulus on the Human Cell Atlas Census of Immune Cells dataset of bone marrow cells and show that it substantially improves efficiency over conventional frameworks, while maintaining or improving the quality of results, enabling large-scale studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Cumulus: a scalable, feature-rich, accessible cloud-based framework for sc/snRNA-seq analysis.**

**Fig. 2: Algorithmic and implementation improvements underlying Pegasus’ high scalability.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

Austin D. Reed, Sara Pensa, … Walid T. Khaled

Data availability

The bone marrow dataset is available at https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79. The 1.3 million mouse brain dataset is available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons. The 5K PBMC dataset is available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3. The BGISEQ SMART-Seq2 data are available at https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=430491. In particular, data from accessions SRX3654625, SRX3654622, SRX3654623, SRX3654630, SRX3654942, SRX3654606, SRX3654816 and SRX3654921 were used.

Code availability

Cumulus code consists of four components: the Pegasus and scPlot python packages; the Cumulus WDL workflows and Docker files; the Cumulus docker images; and the Cirrocumulus app. Pegasus source code is available at https://github.com/klarman-cell-observatory/pegasus. Pegasus documentation is available at https://pegasus.readthedocs.io. scPlot source code is available at https://github.com/klarman-cell-observatory/scPlot. We wrote all the workflows using the Workflow Description Language (WDL; https://github.com/openwdl/wdl) and encapsulated all software packages into Docker images using Docker files. Cumulus WDL and Docker files are available at https://github.com/klarman-cell-observatory/cumulus. The source code used to generate feature-count matrices, generate_count_matrix_ADTs, is available at https://github.com/klarman-cell-observatory/cumulus_feature_barcoding. Cumulus Docker images are available at https://hub.docker.com/u/cumulusprod. For Terra users, we additionally deposited Cumulus workflows in the Broad Methods Repository (https://portal.firecloud.org/?return=terra#methods) and provide a step-by-step manual at https://cumulus.readthedocs.io. Cirrocumulus source code is available at https://github.com/klarman-cell-observatory/cirrocumulus. Cirrocumulus documentation is available at https://cirrocumulus.readthedocs.io. Pegasus, scPlot, Cumulus WDL files, Docker files and Cirrocumulus are licensed under a BSD three-clause license. In addition, we documented licenses for Cumulus dependencies in Supplementary Data 5. Due to third-party licensing requirements, we can only provide CellRanger dockers without bcl2fastq2 and users can build their private bcl2fastq2-containing Dockers by following the instructions listed in the Cumulus documentation.

References

Regev, A. et al. The Human Cell Atlas White Paper. Preprint at https://arxiv.org/abs/1810.05192 (2018).
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
CAS PubMed PubMed Central Google Scholar
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
CAS PubMed PubMed Central Google Scholar
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
CAS PubMed PubMed Central Google Scholar
Yang, A., Troup, M., Lin, P. & Ho, J. W. K. Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics 33, 767–769 (2017).
CAS PubMed Google Scholar
Kowalczyk, M. S. et al. Census of Immune Cells (Human Cell Atlas). https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79 (2018).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
CAS PubMed PubMed Central Google Scholar
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
CAS PubMed Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
CAS PubMed PubMed Central Google Scholar
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
CAS PubMed PubMed Central Google Scholar
Gaublomme, J. T. et al. Nuclei multiplexing with barcoded antibodies for single-nucleus genomics. Nat. Commun. 10, 2907 (2019).
PubMed PubMed Central Google Scholar
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016).
CAS PubMed PubMed Central Google Scholar
Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods 14, 297–301 (2017).
CAS PubMed PubMed Central Google Scholar
Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).
Google Scholar
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
CAS PubMed Google Scholar
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 10, P10008 (2008).
Google Scholar
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
CAS PubMed PubMed Central Google Scholar
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).
CAS PubMed PubMed Central Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
CAS Google Scholar
Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS ONE 9, e98679 (2014).
PubMed PubMed Central Google Scholar
Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943.e22 (2019).
CAS PubMed PubMed Central Google Scholar
Tabaka, M., Gould, J. & Regev, A. scSVA: an interactive tool for big data visualization and exploration in single-cell omics. Preprint at bioRxiv https://doi.org/10.1101/512582 (2019).
Ke, G. et al. LightGBM: a highly efficient gradient boosting decision tree. Adv. Neural Inform. Process. Syst. 30, 3146–3154 (2017).
Google Scholar
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
CAS PubMed PubMed Central Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
PubMed PubMed Central Google Scholar
Bhattacharya, S. et al. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research. Sci. Data 5, 180015 (2018).
CAS PubMed PubMed Central Google Scholar
Li, C. & Wong, W. H. DNA-Chip analyzer (dChip). in The Analysis of Gene Expression Data 120–141 (Springer, 2003).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
PubMed Google Scholar
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
CAS PubMed PubMed Central Google Scholar
Polański, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).
PubMed Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
CAS PubMed PubMed Central Google Scholar
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
PubMed Google Scholar
Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2020).
PubMed Google Scholar
Aumüller, M., Bernhardsson, E. & Faithfull, A. ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. in Similarity Search and Applications 34–49 (Springer, 2017).
Alavi, A., Ruffalo, M., Parvangada, A., Huang, Z. & Bar-Joseph, Z. A web server for comparative analysis of single-cell RNA-seq data. Nat. Commun. 9, 4768 (2018).
PubMed PubMed Central Google Scholar
Van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729.e27 (2018).
CAS PubMed PubMed Central Google Scholar
Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).
CAS PubMed PubMed Central Google Scholar
Anand, K., Bianconi, G. & Severini, S. Shannon and von Neumann entropy of random networks with heterogeneous expected degree. Phys. Rev. E 83, 036109 (2011).
Google Scholar
Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007).
Google Scholar
Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191.e4 (2018).
CAS PubMed PubMed Central Google Scholar
Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).
PubMed PubMed Central Google Scholar
Melsted, P. et al. Modular and efficient pre-processing of single-cell RNA-seq. Preprint at bioRxiv https://doi.org/10.1101/673285 (2019).
Slyper, M. et al. A single-cell and single-nucleus RNA-Seq toolbox for fresh and frozen human tumors. Nat. Med. 26, 792–802 (2020).
CAS PubMed PubMed Central Google Scholar
Guo, R., Zhao, Y., Zou, Q., Fang, X. & Peng, S. Bioinformatics applications on Apache Spark. Gigascience 7, giy098 (2018).
PubMed Central Google Scholar
Petukhov, V. et al. dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments. Genome Biol. 19, 78 (2018).
PubMed PubMed Central Google Scholar
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011).
CAS Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
CAS PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
CAS PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
CAS PubMed Google Scholar
Dixit, A. Correcting chimeric crosstalk in single cell RNA-seq experiments. Preprint at bioRxiv https://doi.org/10.1101/093237 (2016).
Cleveland, W. S., Grosse, E. & Shyu, W. M. in Statistical Models in S Ch. 8 (1992).
Halko, N., Martinsson, P. G. & Tropp, J. A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 217–288 (2011).
Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Calvetti, D., Reichel, L. & Sorensen, D. C. An implicitly restarted Lanczos method for large symmetric eigenvalue problems. Electron. Trans. Numer. Anal. 2, 1–21 (1994).
Google Scholar
Reichardt, J. & Bornholdt, S. Statistical mechanics of community detection. Phys. Rev. E 74, 016110 (2006).
Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Met. 57, 289–300 (1995).
Google Scholar
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? in Proceedings of the 26th Annual International Conference on Machine Learning 1073–1080 (Association for Computing Machinery, 2009).
Natarajan, K. N. et al. Comparative analysis of sequencing technologies for single-cell transcriptomics. Genome Biol. 20, 70 (2019).
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank J. Rood for help with manuscript editing, L. Gaffney for help with figure preparation, E. Banks and A. Philippakis for advice on creating Cumulus’ featured Terra workspace, C. O’Day and E. Law for advice on licensing Cumulus as an open-source software, C. O’Day additionally for help on summarizing the license terms of third-party packages on which Cumulus depends (Supplementary Data 5), D. Dionne, J. Waldman, J. Lee and K. Shekhar for contributions in generating the Census of Immune Cells dataset and sharing it openly pre-publication, M. Maarouf and D. Erdogan for transferring the Pegasus namespace on Read the Docs to us, and J. Gatter for providing the Kallisto-BUStools docker and WDLs. This work was supported by the Klarman Cell Observatory, the Manton Foundation, HHMI and the Ludwig Center at MIT (to A.R.), as well as the Human Tumor Atlas Pilot Project scientific team at Leidos Biomedical Research, Frederick National Laboratory for Cancer Research and NCI.

Author information

Authors and Affiliations

Klarman Cell Observatory, Broad Institute of Harvard and MIT, Cambridge, MA, USA
Bo Li, Joshua Gould, Yiming Yang, Marcin Tabaka, Orr Ashenberg, Yanay Rosen, Michal Slyper, Monika S. Kowalczyk, Timothy Tickle, Orit Rozenblatt-Rosen & Aviv Regev
Division of Rheumatology, Allergy, and Immunology, Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Boston, MA, USA
Bo Li, Yiming Yang & Alexandra-Chloé Villani
Department of Medicine, Harvard Medical School, Boston, MA, USA
Bo Li, Alexandra-Chloé Villani & Nir Hacohen
Broad Institute of Harvard and MIT, Cambridge, MA, USA
Siranush Sarkizova, Alexandra-Chloé Villani & Nir Hacohen
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Siranush Sarkizova
Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
Alexandra-Chloé Villani & Nir Hacohen
Howard Hughes Medical Institute, Massachusetts Institute of Technology, Cambridge, MA, USA
Aviv Regev
Koch Institute of Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
Aviv Regev
Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
Aviv Regev

Authors

Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Gould
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Yang
View author publications
You can also search for this author in PubMed Google Scholar
Siranush Sarkizova
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Tabaka
View author publications
You can also search for this author in PubMed Google Scholar
Orr Ashenberg
View author publications
You can also search for this author in PubMed Google Scholar
Yanay Rosen
View author publications
You can also search for this author in PubMed Google Scholar
Michal Slyper
View author publications
You can also search for this author in PubMed Google Scholar
Monika S. Kowalczyk
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra-Chloé Villani
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Tickle
View author publications
You can also search for this author in PubMed Google Scholar
Nir Hacohen
View author publications
You can also search for this author in PubMed Google Scholar
Orit Rozenblatt-Rosen
View author publications
You can also search for this author in PubMed Google Scholar
Aviv Regev
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.L. and A.R. conceived of the study, designed the experiments and devised the analyses. B.L. developed the computational methods. B.L., J.G., Y.Y. and S.S. implemented the code. B.L., J.G., Y.Y., S.S., M.T., O.A. and Y.R. conducted the computational experiments. M.S., M.S.K. and A.-C.V. helped to interpret the results from the Census of Immune Cells data. T.T. helped with Terra cloud-related development. N.H., O.R.-R. and A.R. supervised the work. B.L., J.G., Y.Y. and A.R. wrote the paper with input from all of the authors.

Corresponding authors

Correspondence to Bo Li, Orit Rozenblatt-Rosen or Aviv Regev.

Ethics declarations

Competing interests

A.R. is a founder of and equity holder in Celsius Therapeutics, a Scientific Advisory Board member of Thermo Fisher Scientific, Neogene Therapeutics, Syros Pharamceuticals and Asimov, and an equity holder in Immunitas. N.H. is a founder and Scientific Advisory Board member of Neon Therapeutics. A.R. is a co-inventor on patent applications filed by the Broad Institute to inventions relating to single-cell genomics.

Additional information

Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The new HVG selection procedure provides excellent quality vs. the standard procedure.

a, New HVG selection procedure (n = 16,613 robust genes). Variance (y axis) vs. mean (x axis) of log expression. Red: fitted LOESS curve. HVGs (blue) are defined as the genes above the LOESS curve. b, Curated immune genes captured by each procedure. The number of ImmPort curated immune genes selected by a standard HVG procedure (red) and Cumulus (blue). c, Analysis with HVGs by new approach highlighted an additional cell type (n = 274,182 bone marrow cells). FIt-SNE plots of cells from the bone marrow dataset generated by Cumulus with HVG genes selected by the standard (left) and new (right) procedure and colored by cell subset annotations. Bottom: Adjusted Mutual Information (AMI) score shows overall high concordance. Only the plot from the new procedure identifies megakaryocytes. HSCs: hematopoietic stem cells; MSCs: mesenchymal stem cells; cDCs: conventional dendritic cells; pDCs: plasmacytoid dendritic cells; NK cells: natural killer cells.

Extended Data Fig. 2 Benchmarking of batch correction methods on 34,654 bone marrow cells.

a, Execution time of each method. b–g, UMAP visualizations of the bone marrow cells (n = 34,654) colored by either cell type annotation (left) or donor identity (right) without batch correction (b, baseline), with L/S adjustment (c, Pegasus), ComBat (d), MNN (e), BBKNN (f), and Seurat v3 (g).

Extended Data Fig. 3 Benchmarking of approximate nearest neighbor finding methods on the bone marrow dataset (n = 274,182 cells).

Accuracy (a, y axis, % recall, Methods) and speed (b, y axis, minutes) of each of three methods. Boxplot (a): Line: median; box boundaries: lower and upper quartiles; whiskers: 1.5 interquartile range (IQR) below and above the low and high quartile, respectively.

Extended Data Fig. 4 Adjusting diffusion pseudotime map parameters for visualization of pseudotemporal trajectories.

a, Using a large number of diffusion pseudotime components yields a developmental trajectory that enhances separation of trajectories of different cell populations (n = 274,182 bone marrow cells). FLEs of single cell (colored by cell type annotation) generated from diffusion pseudotime maps (t = ∞) with 15 (left), 50 (middle) or 100 (right) components. CD8⁺ and CD4⁺ naïve T cells are fused together in the left FLE (circled in red). Erythrocytes and Pro-B cells are overlapped in the middle FLE (circled in Red). b, Choosing the timescale t for a diffusion pseudotime map. Von Neumann entropy (y axis) for diffusion maps with 100 components calculated from the bone marrow data at different timescales (x axis). Black point: knee point.

Extended Data Fig. 5 Spectral community detection algorithms combine the strengths of spectral clustering and community detection algorithms.

FIt-SNE of bone marrow single cells (dots, n = 274,182) colored by cluster assignment from (a) Spectral (left) vs. Louvain (right) clustering; (b) Louvain (left) vs. Spectral Louvain (right) clustering; or (c) Leiden (left) vs. spectral Leiden (right) clustering. Top: Execution time; bottom: Adjusted Mutual Information (AMI). Post hoc annotation labels are listed.

Extended Data Fig. 6 Deep-learning based visualization speeds up t-SNE and FLE visualizations while maintaining comparable quality.

Visualization of cell profiles (dots) from the full bone marrow data set (n = 274,182) colored by the same Louvain cluster membership (color; legend shows post hoc annotations) and laid out by (a) t-SNE (left), Net-tSNE (middle), or FIt-SNE (right); or (b) by FLE (left) or Net-FLE (right). Top: Execution time and kSIM acceptance rate.

Extended Data Fig. 7 Benchmark the count step with respect to number of channels for the bone marrow dataset.

Plot of maximum runtime in hours (left) and amortized total costs (right) in US dollars against the number of 10x channels.

Supplementary information

Supplementary Information

Supplementary Tables 1–8 and Supplementary Notes 1 and 2.

Reporting Summary

Supplementary Data 1

ImmPort-curated immune genes selected by the two HVG selection procedures. Tab 1 (GOappend1_Immport): a copy of the original ImmPort spreadsheet from https://www.immport.org/shared/geneData/GOappend1.xls; Tab 2 (Duplicates_merged): removed any duplicated genes; Tab 3 (Common_markers): ImmPort genes that were selected by both procedures; Tab 4 (Standard_procedure_specific) and Tab 5 (New_procedure_specific): ImmPort genes that are only selected by the standard procedure and the new procedure, respectively.

Supplementary Data 2

Execution time for benchmarking Pegasus algorithms on subsampled and full bone marrow datasets. The top table records execution time using eight threads (laptop setting). The bottom table records execution time using 28 threads (cloud setting).

Supplementary Data 3

Count job execution time and amortized cost of 63 bone marrow channels. The execution times are read from the Terra platform’s timing diagrams. The amortized cost is calculated as follows: suppose the total cost is C, the total execution time is T and one channel’s execution time is t; its amortized cost is: \(\frac{t}{T}C\).

Supplementary Data 4

Raw count matrices generated on the cloud and local server. This is a gzipped tarball, containing two subfolders: 5k_pbmc and bgiseq_smartseq2. 5k_pbmc stores raw count matrices generated from the 5K PBMC dataset and contains two subfolders: cloud_output and local_output. Each subfolder contains the same raw count matrix in matrix market format. bgiseq_smartseq2 stores two raw count matrices generated from eight SMART-Seq2 cells sequenced by BGISEQ-500 and contains one file and two folders. The file BGISEQ_list.txt contains detailed information about the eight selected cells. The two folders cloud_output and local_output each contain two count matrices in DGE format (one for paired-end reads (PE) and one for single-end reads (SE)).

Supplementary Data 5

Cumulus dependencies and their licenses.

Supplementary Video 1

Demonstration on how to submit Cumulus jobs on the Terra platform.

Supplementary Video 2

Demonstration of Cirrocumulus on the bone marrow dataset.

Supplementary Video 3

Demonstration on interactive data analysis using Pegasus and Terra Jupyter notebooks.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, B., Gould, J., Yang, Y. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat Methods 17, 793–798 (2020). https://doi.org/10.1038/s41592-020-0905-x

Download citation

Received: 29 October 2019
Accepted: 18 June 2020
Published: 27 July 2020
Issue Date: August 2020
DOI: https://doi.org/10.1038/s41592-020-0905-x

This article is cited by

Applications of single-cell RNA sequencing in drug discovery and development
- Bram Van de Sande
- Joon Sang Lee
- Edgardo Ferran
Nature Reviews Drug Discovery (2023)
Single-cell biological network inference using a heterogeneous graph transformer
- Anjun Ma
- Xiaoying Wang
- Qin Ma
Nature Communications (2023)
A genetic disorder reveals a hematopoietic stem cell regulatory network co-opted in leukemia
- Richard A. Voit
- Liming Tao
- Vijay G. Sankaran
Nature Immunology (2023)
Transfer learning enables predictions in network biology
- Christina V. Theodoris
- Ling Xiao
- Patrick T. Ellinor
Nature (2023)
Targeting PGLYRP1 promotes antitumor immunity while inhibiting autoimmune neuroinflammation
- Alexandra Schnell
- Linglin Huang
- Vijay K. Kuchroo
Nature Immunology (2023)