Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Integration of millions of transcriptomes using batch-aware triplet neural networks

Abstract

Efficient integration of heterogeneous and increasingly large single-cell RNA sequencing data poses a major challenge for analysis and, in particular, comprehensive atlasing efforts. Here we developed a novel deep learning algorithm called INSCT (Insight) to overcome batch effects using batch-aware triplet neural networks. We use simulated and real data to demonstrate that INSCT generates an embedding space that accurately integrates cells across experiments, platforms and species. Our benchmark comparisons with current state-of-the-art single-cell RNA sequencing integration methods revealed that INSCT outperforms competing methods in scalability while achieving comparable accuracies. Moreover, using INSCT in semisupervised mode enables users to classify unlabelled cells by projecting them into a reference collection of annotated cells. To demonstrate scalability, we applied INSCT to integrate more than 2.6 million transcriptomes from four independent studies of mouse brains in less than 1.5 h using less than 25 GB of memory. This feature empowers researchers to perform atlasing-scale data integration in a typical desktop computer environment. INSCT is freely available at https://github.com/lkmklsmn/insct.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of INSCT.
Fig. 2: INSCT robustly overcomes batch effects in simulated scRNA-seq data.
Fig. 3: INSCT accurately integrates cell atlases across different platforms.
Fig. 4: INSCT enables cross-species integration of cell atlases.
Fig. 5: Semisupervised INSCT robustly classifies cell types.
Fig. 6: INSCT integrates multimillion brain cells.

Similar content being viewed by others

Data availability

The simulation data are available from https://drive.google.com/file/d/1G5rBw-iOIcVMk-3R51DJK0ngttf1iobo. Pancreas data collection was downloaded from the Supplementary Information from work by Luecken and colleagues28 (https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-_integration_task_datasets_Immune_and_pancreas_/12420968). The macaque retina dataset38 was downloaded from the Gene Expression Omnibus (accession: GSE118480). Cell annotations were obtained from the Broad Institute’s Single Cell Portal at https://portals.broadinstitute.org/single_cell. The regular Tabula Muris data30 were downloaded from https://tabula-muris.ds.czbiohub.org/. The Tabula Muris senis data26 were downloaded from https://figshare.com/projects/Tabula_Muris_Senis/64982. Mouse Cell Atlas31 data were downloaded from https://figshare.com/articles/MCA_DGE_Data/5435866. The Human Cell Landscape32 data were downloaded from https://figshare.com/articles/HCL_DGE_Data/7235471. Mouse brain single-cell/nuclei RNA sequencing data were downloaded from four different resources. The 10× Genomics dataset was downloaded from https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons. The DropViz dataset was downloaded from http://dropviz.org. The MouseBrain.org data were downloaded from http://mousebrain.org/downloads.html. The SPLIT-seq data39 were downloaded from the Gene Expression Omnibus (accession: GSE110823).

Code availability

INSCT as well as all analyses and results presented in the manuscript are available via Github: https://github.com/lkmklsmn/insct40. INSCT is licensed under the MIT license.

References

  1. Bianconi, E. et al. An estimation of the number of cells in the human body. Ann. Hum. Biol. 40, 463–471 (2013).

    Article  Google Scholar 

  2. Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

  3. Bacher, R. & Kendziorski, C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol 17, 63 (2016).

    Article  Google Scholar 

  4. Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).

    Article  Google Scholar 

  5. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    Article  Google Scholar 

  6. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).

    Article  Google Scholar 

  7. Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).

    Article  Google Scholar 

  8. Polański, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).

    Google Scholar 

  9. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  Google Scholar 

  10. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    Article  Google Scholar 

  11. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  Google Scholar 

  12. Amodio, M. et al. Exploring single-cell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).

    Article  Google Scholar 

  13. Heaton, J. Ian Goodfellow, Yoshua Bengio, and Aaron Courville: deep learning. Genet. Prog. Evolvable Mach. 19, 305–307 (2018).

    Article  Google Scholar 

  14. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).

    Article  Google Scholar 

  15. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Article  Google Scholar 

  16. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  17. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).

    Article  Google Scholar 

  18. Simon, L. M., Yan, F. & Zhao, Z. DrivAER: identification of driving transcriptional programs in single-cell RNA sequencing data. Gigascience 9, giaa122 (2020).

  19. Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen Predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).

    Article  Google Scholar 

  20. Schroff, F., Kalenichenko, D. & Philbin, J. FaceNet: a unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2015); https://doi.org/10.1109/cvpr.2015.7298682

  21. Wang, J. et al. Learning fine-grained image similarity with deep ranking. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2014); https://doi.org/10.1109/cvpr.2014.180

  22. Szubert, B., Cole, J. E., Monaco, C. & Drozdov, I. Structure-preserving visualisation of high dimensional single-cell datasets. Sci. Rep. 9, 8914 (2019).

    Article  Google Scholar 

  23. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018).

    Article  Google Scholar 

  24. Hoffer, E. & Ailon, N. in Similarity-Based Pattern Recognition 84–92 (Springer, 2015); https://doi.org/10.1007/978-3-319-24261-3_7

  25. Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    Article  Google Scholar 

  26. Tabula Muris Consortium. A single-cell transcriptomic atlas characterizes ageing tissues in the mouse. Nature 583, 590–595 (2020).

    Article  Google Scholar 

  27. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020).

    Article  Google Scholar 

  28. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Preprint at https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2 (2020).

  29. Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    Article  Google Scholar 

  30. Tabula Muris Consortium et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

    Article  Google Scholar 

  31. Han, X. et al. Mapping the mouse cell atlas by microwell-seq. Cell 173, 1091–1107.e17 (2018).

    Google Scholar 

  32. Han, X. et al. Construction of a human cell landscape at single-cell level. Nature 581, 303–309 (2020).

    Article  Google Scholar 

  33. Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2020).

  34. Obernier, K. & Alvarez-Buylla, A. Neural stem cells: origin, heterogeneity and regulation in the adult mammalian brain. Development 146, dev156059 (2019).

  35. Böhm, J. N., Berens, P. & Kobak, D. A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum. Preprint at https://arxiv.org/abs/2007.08902 (2020).

  36. Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).

    Article  Google Scholar 

  37. Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).

    Article  Google Scholar 

  38. Peng, Y.-R. et al. Molecular classification and comparative taxonomics of foveal and peripheral cells in primate retina. Cell 176, 1222–1237.e22 (2019).

    Article  Google Scholar 

  39. Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).

    Article  Google Scholar 

  40. L. M. Simon & Y. Wang. github.com/lkmklsmn/insct (Zenodo, 2021); https://doi.org/10.5281/ZENODO.4728031

Download references

Acknowledgements

This work was supported by the Cancer Prevention and Research Institute of Texas (CPRIT core grant no. RP180734). Z.Z. was also partially supported by the National Institutes of Health grant (grant no. R01LM012806 and R01DE030122). The funders had no role in the study design, data collection and analysis, the decision to publish, nor preparation of the manuscript. We would like to thank the members of the Bioinformatics and Systems Medicine Laboratory at the University of Texas Health Science Center at Houston as well as D. Henke for stimulating discussion.

Author information

Authors and Affiliations

Authors

Contributions

L.M.S. conceptualized the algorithm. L.M.S., Y.-Y.W. and Z.Z. designed the project. L.M.S. and Y.-Y.W. developed the algorithm and analysed the data. Z.Z. supervised the project. L.M.S., Y.-Y.W., and Z.Z. wrote the manuscript. All authors read and approved the final manuscript. Correspondence and requests for materials should be addressed to L.M.S.

Corresponding authors

Correspondence to Lukas M. Simon or Zhongming Zhao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Smita Krishnaswamy and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Schematic illustrates the concept of batch-aware triplet sampling.

Schematic illustrates the concept of batch-aware triplet sampling. Using MNNs triplets are chosen in such a manner that Anchor and Positive come from different batches. Our github repository contains an animation of the batch-aware triplet concept (https://github.com/lkmklsmn/insct).

Extended Data Fig. 2 INSCT robustly integrates simulated scRNA-seq data with many batches and many cell types.

INSCT robustly integrates simulated scRNA-seq data with many batches and many cell types. a, Left, ten cell groups from ten batches were simulated for scenario 1. UMAP (top) and IVIS (middle) embeddings cannot overcome the batch effect. INSCT (bottom) correctly groups cells of the same type across batches. Right, cells were removed such that four cell types were exclusive to a given batch (highlighted with ellipses) for scenario 2. UMAP (top) and IVIS (middle) embeddings cannot overcome the batch effect. INSCT (bottom) correctly groups cells of the same type across batches. b, INSCT embeddings show impact of range of k-to-m-ratio parameter values from left to right on the complex simulation scenario 2. Cells are colored by batch (top) and cell group (bottom). With low k-to-m-ratio parameter values the model cannot learn meaningful representations of cell groups exclusive to individual batches. High k-to-m-ratio parameter values cannot merge cell groups exclusive to individual batches into coherent clusters. Intermediate k-to-m-ratio parameter values learn meaningful representations for cells exclusive to individual batches.

Extended Data Fig. 3 INSCT accurately integrates cells in Tabula Muris data collection.

INSCT accurately integrates cells in Tabula Muris data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy based on KNN classifier. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP, respectively. Cells are colored by batch (left) and cell ontology class (right).

Extended Data Fig. 4 INSCT accurately integrates cells in the macaque retina data collection.

INSCT accurately integrates cells in the macaque retina data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the various integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP respectively. Cells are colored by batch (left) and cell type (right).

Extended Data Fig. 5 INSCT accurately integrates cells in the human pancreas data collection.

INSCT accurately integrates cells in the human pancreas data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the various integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP respectively. Cells are colored by batch (left) and cell type (right).

Extended Data Fig. 6 INSCT is robust to mislabeling. INSCT is robust to mislabeling.

a, UMAP depicts cells from simulation scenario 1. Group of red cells from batch 2 is mislabeled as green cells. For visualization purposes only, the mislabeled cells are colored purple. b, As the label ratio parameter decreases from left to right the model training contains more triplets sampled from MNNs and KNNs compared to cell types, thereby going from completely supervised (label ratio 1) to completely unsupervised (label ratio 0). At high label ratio the mislabeled cells (purple) map onto the incorrect cluster (green). As the label ratio parameter decreases the mislabelled cells (purple) form a separate cluster. At label ratio 0, which corresponds to completely unsupervised training the mislabeled cells map to the correct cluster (red).

Extended Data Fig. 7 INSCT efficiently integrates millions of cells.

INSCT efficiently integrates millions of cells. a, The graph shows the number of cells and their peak memory usage in gigabytes on the X and Y axes, respectively. b, The graph shows the number of cells and their runtime in minutes on the X and Y axes, respectively. Colors represent the different integration methods.

Extended Data Fig. 8 Down-sampling decreases running time while achieving comparable performance.

Down-sampling decreases running time while achieving comparable performance. a, Integration accuracy as measured via KNN classifier remains robust across a number of down-sampling sets while running time decreases b. The integrated embedding of the training data (c) is nearly identical to the integrated embedding after projecting all cells (d). For c and d, rows represent different down-sampling levels and columns show the integrated embedding colored by batch (left) and major common cell types (right).

Extended Data Fig. 9 The margin parameter m impacts the attraction-repulsion theory.

The margin parameter m impacts the attraction-repulsion theory. The simulated data was subjected to INSCT integration varying the margin parameter m from high to low (left to right). Top and bottom rows depict the integrated embedding colored by batch and cell type, respectively.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Simon, L.M., Wang, YY. & Zhao, Z. Integration of millions of transcriptomes using batch-aware triplet neural networks. Nat Mach Intell 3, 705–715 (2021). https://doi.org/10.1038/s42256-021-00361-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00361-8

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics