Abstract
Massively parallel single-cell and single-nucleus RNA sequencing has opened the way to systematic tissue atlases in health and disease, but as the scale of data generation is growing, so is the need for computational pipelines for scaled analysis. Here we developed Cumulus—a cloud-based framework for analyzing large-scale single-cell and single-nucleus RNA sequencing datasets. Cumulus combines the power of cloud computing with improvements in algorithm and implementation to achieve high scalability, low cost, user-friendliness and integrated support for a comprehensive set of features. We benchmark Cumulus on the Human Cell Atlas Census of Immune Cells dataset of bone marrow cells and show that it substantially improves efficiency over conventional frameworks, while maintaining or improving the quality of results, enabling large-scale studies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The bone marrow dataset is available at https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79. The 1.3 million mouse brain dataset is available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons. The 5K PBMC dataset is available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3. The BGISEQ SMART-Seq2 data are available at https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=430491. In particular, data from accessions SRX3654625, SRX3654622, SRX3654623, SRX3654630, SRX3654942, SRX3654606, SRX3654816 and SRX3654921 were used.
Code availability
Cumulus code consists of four components: the Pegasus and scPlot python packages; the Cumulus WDL workflows and Docker files; the Cumulus docker images; and the Cirrocumulus app. Pegasus source code is available at https://github.com/klarman-cell-observatory/pegasus. Pegasus documentation is available at https://pegasus.readthedocs.io. scPlot source code is available at https://github.com/klarman-cell-observatory/scPlot. We wrote all the workflows using the Workflow Description Language (WDL; https://github.com/openwdl/wdl) and encapsulated all software packages into Docker images using Docker files. Cumulus WDL and Docker files are available at https://github.com/klarman-cell-observatory/cumulus. The source code used to generate feature-count matrices, generate_count_matrix_ADTs, is available at https://github.com/klarman-cell-observatory/cumulus_feature_barcoding. Cumulus Docker images are available at https://hub.docker.com/u/cumulusprod. For Terra users, we additionally deposited Cumulus workflows in the Broad Methods Repository (https://portal.firecloud.org/?return=terra#methods) and provide a step-by-step manual at https://cumulus.readthedocs.io. Cirrocumulus source code is available at https://github.com/klarman-cell-observatory/cirrocumulus. Cirrocumulus documentation is available at https://cirrocumulus.readthedocs.io. Pegasus, scPlot, Cumulus WDL files, Docker files and Cirrocumulus are licensed under a BSD three-clause license. In addition, we documented licenses for Cumulus dependencies in Supplementary Data 5. Due to third-party licensing requirements, we can only provide CellRanger dockers without bcl2fastq2 and users can build their private bcl2fastq2-containing Dockers by following the instructions listed in the Cumulus documentation.
References
Regev, A. et al. The Human Cell Atlas White Paper. Preprint at https://arxiv.org/abs/1810.05192 (2018).
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Yang, A., Troup, M., Lin, P. & Ho, J. W. K. Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics 33, 767–769 (2017).
Kowalczyk, M. S. et al. Census of Immune Cells (Human Cell Atlas). https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79 (2018).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
Gaublomme, J. T. et al. Nuclei multiplexing with barcoded antibodies for single-nucleus genomics. Nat. Commun. 10, 2907 (2019).
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016).
Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods 14, 297–301 (2017).
Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 10, P10008 (2008).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS ONE 9, e98679 (2014).
Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943.e22 (2019).
Tabaka, M., Gould, J. & Regev, A. scSVA: an interactive tool for big data visualization and exploration in single-cell omics. Preprint at bioRxiv https://doi.org/10.1101/512582 (2019).
Ke, G. et al. LightGBM: a highly efficient gradient boosting decision tree. Adv. Neural Inform. Process. Syst. 30, 3146–3154 (2017).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Bhattacharya, S. et al. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research. Sci. Data 5, 180015 (2018).
Li, C. & Wong, W. H. DNA-Chip analyzer (dChip). in The Analysis of Gene Expression Data 120–141 (Springer, 2003).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Polański, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2020).
Aumüller, M., Bernhardsson, E. & Faithfull, A. ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. in Similarity Search and Applications 34–49 (Springer, 2017).
Alavi, A., Ruffalo, M., Parvangada, A., Huang, Z. & Bar-Joseph, Z. A web server for comparative analysis of single-cell RNA-seq data. Nat. Commun. 9, 4768 (2018).
Van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729.e27 (2018).
Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).
Anand, K., Bianconi, G. & Severini, S. Shannon and von Neumann entropy of random networks with heterogeneous expected degree. Phys. Rev. E 83, 036109 (2011).
Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007).
Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191.e4 (2018).
Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).
Melsted, P. et al. Modular and efficient pre-processing of single-cell RNA-seq. Preprint at bioRxiv https://doi.org/10.1101/673285 (2019).
Slyper, M. et al. A single-cell and single-nucleus RNA-Seq toolbox for fresh and frozen human tumors. Nat. Med. 26, 792–802 (2020).
Guo, R., Zhao, Y., Zou, Q., Fang, X. & Peng, S. Bioinformatics applications on Apache Spark. Gigascience 7, giy098 (2018).
Petukhov, V. et al. dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments. Genome Biol. 19, 78 (2018).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Dixit, A. Correcting chimeric crosstalk in single cell RNA-seq experiments. Preprint at bioRxiv https://doi.org/10.1101/093237 (2016).
Cleveland, W. S., Grosse, E. & Shyu, W. M. in Statistical Models in S Ch. 8 (1992).
Halko, N., Martinsson, P. G. & Tropp, J. A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 217–288 (2011).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Calvetti, D., Reichel, L. & Sorensen, D. C. An implicitly restarted Lanczos method for large symmetric eigenvalue problems. Electron. Trans. Numer. Anal. 2, 1–21 (1994).
Reichardt, J. & Bornholdt, S. Statistical mechanics of community detection. Phys. Rev. E 74, 016110 (2006).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Met. 57, 289–300 (1995).
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? in Proceedings of the 26th Annual International Conference on Machine Learning 1073–1080 (Association for Computing Machinery, 2009).
Natarajan, K. N. et al. Comparative analysis of sequencing technologies for single-cell transcriptomics. Genome Biol. 20, 70 (2019).
Acknowledgements
We thank J. Rood for help with manuscript editing, L. Gaffney for help with figure preparation, E. Banks and A. Philippakis for advice on creating Cumulus’ featured Terra workspace, C. O’Day and E. Law for advice on licensing Cumulus as an open-source software, C. O’Day additionally for help on summarizing the license terms of third-party packages on which Cumulus depends (Supplementary Data 5), D. Dionne, J. Waldman, J. Lee and K. Shekhar for contributions in generating the Census of Immune Cells dataset and sharing it openly pre-publication, M. Maarouf and D. Erdogan for transferring the Pegasus namespace on Read the Docs to us, and J. Gatter for providing the Kallisto-BUStools docker and WDLs. This work was supported by the Klarman Cell Observatory, the Manton Foundation, HHMI and the Ludwig Center at MIT (to A.R.), as well as the Human Tumor Atlas Pilot Project scientific team at Leidos Biomedical Research, Frederick National Laboratory for Cancer Research and NCI.
Author information
Authors and Affiliations
Contributions
B.L. and A.R. conceived of the study, designed the experiments and devised the analyses. B.L. developed the computational methods. B.L., J.G., Y.Y. and S.S. implemented the code. B.L., J.G., Y.Y., S.S., M.T., O.A. and Y.R. conducted the computational experiments. M.S., M.S.K. and A.-C.V. helped to interpret the results from the Census of Immune Cells data. T.T. helped with Terra cloud-related development. N.H., O.R.-R. and A.R. supervised the work. B.L., J.G., Y.Y. and A.R. wrote the paper with input from all of the authors.
Corresponding authors
Ethics declarations
Competing interests
A.R. is a founder of and equity holder in Celsius Therapeutics, a Scientific Advisory Board member of Thermo Fisher Scientific, Neogene Therapeutics, Syros Pharamceuticals and Asimov, and an equity holder in Immunitas. N.H. is a founder and Scientific Advisory Board member of Neon Therapeutics. A.R. is a co-inventor on patent applications filed by the Broad Institute to inventions relating to single-cell genomics.
Additional information
Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 The new HVG selection procedure provides excellent quality vs. the standard procedure.
a, New HVG selection procedure (n = 16,613 robust genes). Variance (y axis) vs. mean (x axis) of log expression. Red: fitted LOESS curve. HVGs (blue) are defined as the genes above the LOESS curve. b, Curated immune genes captured by each procedure. The number of ImmPort curated immune genes selected by a standard HVG procedure (red) and Cumulus (blue). c, Analysis with HVGs by new approach highlighted an additional cell type (n = 274,182 bone marrow cells). FIt-SNE plots of cells from the bone marrow dataset generated by Cumulus with HVG genes selected by the standard (left) and new (right) procedure and colored by cell subset annotations. Bottom: Adjusted Mutual Information (AMI) score shows overall high concordance. Only the plot from the new procedure identifies megakaryocytes. HSCs: hematopoietic stem cells; MSCs: mesenchymal stem cells; cDCs: conventional dendritic cells; pDCs: plasmacytoid dendritic cells; NK cells: natural killer cells.
Extended Data Fig. 2 Benchmarking of batch correction methods on 34,654 bone marrow cells.
a, Execution time of each method. b–g, UMAP visualizations of the bone marrow cells (n = 34,654) colored by either cell type annotation (left) or donor identity (right) without batch correction (b, baseline), with L/S adjustment (c, Pegasus), ComBat (d), MNN (e), BBKNN (f), and Seurat v3 (g).
Extended Data Fig. 3 Benchmarking of approximate nearest neighbor finding methods on the bone marrow dataset (n = 274,182 cells).
Accuracy (a, y axis, % recall, Methods) and speed (b, y axis, minutes) of each of three methods. Boxplot (a): Line: median; box boundaries: lower and upper quartiles; whiskers: 1.5 interquartile range (IQR) below and above the low and high quartile, respectively.
Extended Data Fig. 4 Adjusting diffusion pseudotime map parameters for visualization of pseudotemporal trajectories.
a, Using a large number of diffusion pseudotime components yields a developmental trajectory that enhances separation of trajectories of different cell populations (n = 274,182 bone marrow cells). FLEs of single cell (colored by cell type annotation) generated from diffusion pseudotime maps (t = ∞) with 15 (left), 50 (middle) or 100 (right) components. CD8+ and CD4+ naïve T cells are fused together in the left FLE (circled in red). Erythrocytes and Pro-B cells are overlapped in the middle FLE (circled in Red). b, Choosing the timescale t for a diffusion pseudotime map. Von Neumann entropy (y axis) for diffusion maps with 100 components calculated from the bone marrow data at different timescales (x axis). Black point: knee point.
Extended Data Fig. 5 Spectral community detection algorithms combine the strengths of spectral clustering and community detection algorithms.
FIt-SNE of bone marrow single cells (dots, n = 274,182) colored by cluster assignment from (a) Spectral (left) vs. Louvain (right) clustering; (b) Louvain (left) vs. Spectral Louvain (right) clustering; or (c) Leiden (left) vs. spectral Leiden (right) clustering. Top: Execution time; bottom: Adjusted Mutual Information (AMI). Post hoc annotation labels are listed.
Extended Data Fig. 6 Deep-learning based visualization speeds up t-SNE and FLE visualizations while maintaining comparable quality.
Visualization of cell profiles (dots) from the full bone marrow data set (n = 274,182) colored by the same Louvain cluster membership (color; legend shows post hoc annotations) and laid out by (a) t-SNE (left), Net-tSNE (middle), or FIt-SNE (right); or (b) by FLE (left) or Net-FLE (right). Top: Execution time and kSIM acceptance rate.
Extended Data Fig. 7 Benchmark the count step with respect to number of channels for the bone marrow dataset.
Plot of maximum runtime in hours (left) and amortized total costs (right) in US dollars against the number of 10x channels.
Supplementary information
Supplementary Information
Supplementary Tables 1–8 and Supplementary Notes 1 and 2.
Supplementary Data 1
ImmPort-curated immune genes selected by the two HVG selection procedures. Tab 1 (GOappend1_Immport): a copy of the original ImmPort spreadsheet from https://www.immport.org/shared/geneData/GOappend1.xls; Tab 2 (Duplicates_merged): removed any duplicated genes; Tab 3 (Common_markers): ImmPort genes that were selected by both procedures; Tab 4 (Standard_procedure_specific) and Tab 5 (New_procedure_specific): ImmPort genes that are only selected by the standard procedure and the new procedure, respectively.
Supplementary Data 2
Execution time for benchmarking Pegasus algorithms on subsampled and full bone marrow datasets. The top table records execution time using eight threads (laptop setting). The bottom table records execution time using 28 threads (cloud setting).
Supplementary Data 3
Count job execution time and amortized cost of 63 bone marrow channels. The execution times are read from the Terra platform’s timing diagrams. The amortized cost is calculated as follows: suppose the total cost is C, the total execution time is T and one channel’s execution time is t; its amortized cost is: \(\frac{t}{T}C\).
Supplementary Data 4
Raw count matrices generated on the cloud and local server. This is a gzipped tarball, containing two subfolders: 5k_pbmc and bgiseq_smartseq2. 5k_pbmc stores raw count matrices generated from the 5K PBMC dataset and contains two subfolders: cloud_output and local_output. Each subfolder contains the same raw count matrix in matrix market format. bgiseq_smartseq2 stores two raw count matrices generated from eight SMART-Seq2 cells sequenced by BGISEQ-500 and contains one file and two folders. The file BGISEQ_list.txt contains detailed information about the eight selected cells. The two folders cloud_output and local_output each contain two count matrices in DGE format (one for paired-end reads (PE) and one for single-end reads (SE)).
Supplementary Data 5
Cumulus dependencies and their licenses.
Supplementary Video 1
Demonstration on how to submit Cumulus jobs on the Terra platform.
Supplementary Video 2
Demonstration of Cirrocumulus on the bone marrow dataset.
Supplementary Video 3
Demonstration on interactive data analysis using Pegasus and Terra Jupyter notebooks.
Rights and permissions
About this article
Cite this article
Li, B., Gould, J., Yang, Y. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat Methods 17, 793–798 (2020). https://doi.org/10.1038/s41592-020-0905-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-020-0905-x
This article is cited by
-
Applications of single-cell RNA sequencing in drug discovery and development
Nature Reviews Drug Discovery (2023)
-
Single-cell biological network inference using a heterogeneous graph transformer
Nature Communications (2023)
-
A genetic disorder reveals a hematopoietic stem cell regulatory network co-opted in leukemia
Nature Immunology (2023)
-
Transfer learning enables predictions in network biology
Nature (2023)
-
Targeting PGLYRP1 promotes antitumor immunity while inhibiting autoimmune neuroinflammation
Nature Immunology (2023)