Abstract
Efficient integration of heterogeneous and increasingly large single-cell RNA sequencing data poses a major challenge for analysis and, in particular, comprehensive atlasing efforts. Here we developed a novel deep learning algorithm called INSCT (Insight) to overcome batch effects using batch-aware triplet neural networks. We use simulated and real data to demonstrate that INSCT generates an embedding space that accurately integrates cells across experiments, platforms and species. Our benchmark comparisons with current state-of-the-art single-cell RNA sequencing integration methods revealed that INSCT outperforms competing methods in scalability while achieving comparable accuracies. Moreover, using INSCT in semisupervised mode enables users to classify unlabelled cells by projecting them into a reference collection of annotated cells. To demonstrate scalability, we applied INSCT to integrate more than 2.6 million transcriptomes from four independent studies of mouse brains in less than 1.5 h using less than 25 GB of memory. This feature empowers researchers to perform atlasing-scale data integration in a typical desktop computer environment. INSCT is freely available at https://github.com/lkmklsmn/insct.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The simulation data are available from https://drive.google.com/file/d/1G5rBw-iOIcVMk-3R51DJK0ngttf1iobo. Pancreas data collection was downloaded from the Supplementary Information from work by Luecken and colleagues28 (https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-_integration_task_datasets_Immune_and_pancreas_/12420968). The macaque retina dataset38 was downloaded from the Gene Expression Omnibus (accession: GSE118480). Cell annotations were obtained from the Broad Institute’s Single Cell Portal at https://portals.broadinstitute.org/single_cell. The regular Tabula Muris data30 were downloaded from https://tabula-muris.ds.czbiohub.org/. The Tabula Muris senis data26 were downloaded from https://figshare.com/projects/Tabula_Muris_Senis/64982. Mouse Cell Atlas31 data were downloaded from https://figshare.com/articles/MCA_DGE_Data/5435866. The Human Cell Landscape32 data were downloaded from https://figshare.com/articles/HCL_DGE_Data/7235471. Mouse brain single-cell/nuclei RNA sequencing data were downloaded from four different resources. The 10× Genomics dataset was downloaded from https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons. The DropViz dataset was downloaded from http://dropviz.org. The MouseBrain.org data were downloaded from http://mousebrain.org/downloads.html. The SPLIT-seq data39 were downloaded from the Gene Expression Omnibus (accession: GSE110823).
Code availability
INSCT as well as all analyses and results presented in the manuscript are available via Github: https://github.com/lkmklsmn/insct40. INSCT is licensed under the MIT license.
References
Bianconi, E. et al. An estimation of the number of cells in the human body. Ann. Hum. Biol. 40, 463–471 (2013).
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Bacher, R. & Kendziorski, C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol 17, 63 (2016).
Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Polański, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Amodio, M. et al. Exploring single-cell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).
Heaton, J. Ian Goodfellow, Yoshua Bengio, and Aaron Courville: deep learning. Genet. Prog. Evolvable Mach. 19, 305–307 (2018).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Simon, L. M., Yan, F. & Zhao, Z. DrivAER: identification of driving transcriptional programs in single-cell RNA sequencing data. Gigascience 9, giaa122 (2020).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen Predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Schroff, F., Kalenichenko, D. & Philbin, J. FaceNet: a unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2015); https://doi.org/10.1109/cvpr.2015.7298682
Wang, J. et al. Learning fine-grained image similarity with deep ranking. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2014); https://doi.org/10.1109/cvpr.2014.180
Szubert, B., Cole, J. E., Monaco, C. & Drozdov, I. Structure-preserving visualisation of high dimensional single-cell datasets. Sci. Rep. 9, 8914 (2019).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018).
Hoffer, E. & Ailon, N. in Similarity-Based Pattern Recognition 84–92 (Springer, 2015); https://doi.org/10.1007/978-3-319-24261-3_7
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Tabula Muris Consortium. A single-cell transcriptomic atlas characterizes ageing tissues in the mouse. Nature 583, 590–595 (2020).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Preprint at https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2 (2020).
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Tabula Muris Consortium et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Han, X. et al. Mapping the mouse cell atlas by microwell-seq. Cell 173, 1091–1107.e17 (2018).
Han, X. et al. Construction of a human cell landscape at single-cell level. Nature 581, 303–309 (2020).
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2020).
Obernier, K. & Alvarez-Buylla, A. Neural stem cells: origin, heterogeneity and regulation in the adult mammalian brain. Development 146, dev156059 (2019).
Böhm, J. N., Berens, P. & Kobak, D. A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum. Preprint at https://arxiv.org/abs/2007.08902 (2020).
Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Peng, Y.-R. et al. Molecular classification and comparative taxonomics of foveal and peripheral cells in primate retina. Cell 176, 1222–1237.e22 (2019).
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
L. M. Simon & Y. Wang. github.com/lkmklsmn/insct (Zenodo, 2021); https://doi.org/10.5281/ZENODO.4728031
Acknowledgements
This work was supported by the Cancer Prevention and Research Institute of Texas (CPRIT core grant no. RP180734). Z.Z. was also partially supported by the National Institutes of Health grant (grant no. R01LM012806 and R01DE030122). The funders had no role in the study design, data collection and analysis, the decision to publish, nor preparation of the manuscript. We would like to thank the members of the Bioinformatics and Systems Medicine Laboratory at the University of Texas Health Science Center at Houston as well as D. Henke for stimulating discussion.
Author information
Authors and Affiliations
Contributions
L.M.S. conceptualized the algorithm. L.M.S., Y.-Y.W. and Z.Z. designed the project. L.M.S. and Y.-Y.W. developed the algorithm and analysed the data. Z.Z. supervised the project. L.M.S., Y.-Y.W., and Z.Z. wrote the manuscript. All authors read and approved the final manuscript. Correspondence and requests for materials should be addressed to L.M.S.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks Smita Krishnaswamy and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Schematic illustrates the concept of batch-aware triplet sampling.
Schematic illustrates the concept of batch-aware triplet sampling. Using MNNs triplets are chosen in such a manner that Anchor and Positive come from different batches. Our github repository contains an animation of the batch-aware triplet concept (https://github.com/lkmklsmn/insct).
Extended Data Fig. 2 INSCT robustly integrates simulated scRNA-seq data with many batches and many cell types.
INSCT robustly integrates simulated scRNA-seq data with many batches and many cell types. a, Left, ten cell groups from ten batches were simulated for scenario 1. UMAP (top) and IVIS (middle) embeddings cannot overcome the batch effect. INSCT (bottom) correctly groups cells of the same type across batches. Right, cells were removed such that four cell types were exclusive to a given batch (highlighted with ellipses) for scenario 2. UMAP (top) and IVIS (middle) embeddings cannot overcome the batch effect. INSCT (bottom) correctly groups cells of the same type across batches. b, INSCT embeddings show impact of range of k-to-m-ratio parameter values from left to right on the complex simulation scenario 2. Cells are colored by batch (top) and cell group (bottom). With low k-to-m-ratio parameter values the model cannot learn meaningful representations of cell groups exclusive to individual batches. High k-to-m-ratio parameter values cannot merge cell groups exclusive to individual batches into coherent clusters. Intermediate k-to-m-ratio parameter values learn meaningful representations for cells exclusive to individual batches.
Extended Data Fig. 3 INSCT accurately integrates cells in Tabula Muris data collection.
INSCT accurately integrates cells in Tabula Muris data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy based on KNN classifier. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP, respectively. Cells are colored by batch (left) and cell ontology class (right).
Extended Data Fig. 4 INSCT accurately integrates cells in the macaque retina data collection.
INSCT accurately integrates cells in the macaque retina data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the various integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP respectively. Cells are colored by batch (left) and cell type (right).
Extended Data Fig. 5 INSCT accurately integrates cells in the human pancreas data collection.
INSCT accurately integrates cells in the human pancreas data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the various integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP respectively. Cells are colored by batch (left) and cell type (right).
Extended Data Fig. 6 INSCT is robust to mislabeling. INSCT is robust to mislabeling.
a, UMAP depicts cells from simulation scenario 1. Group of red cells from batch 2 is mislabeled as green cells. For visualization purposes only, the mislabeled cells are colored purple. b, As the label ratio parameter decreases from left to right the model training contains more triplets sampled from MNNs and KNNs compared to cell types, thereby going from completely supervised (label ratio 1) to completely unsupervised (label ratio 0). At high label ratio the mislabeled cells (purple) map onto the incorrect cluster (green). As the label ratio parameter decreases the mislabelled cells (purple) form a separate cluster. At label ratio 0, which corresponds to completely unsupervised training the mislabeled cells map to the correct cluster (red).
Extended Data Fig. 7 INSCT efficiently integrates millions of cells.
INSCT efficiently integrates millions of cells. a, The graph shows the number of cells and their peak memory usage in gigabytes on the X and Y axes, respectively. b, The graph shows the number of cells and their runtime in minutes on the X and Y axes, respectively. Colors represent the different integration methods.
Extended Data Fig. 8 Down-sampling decreases running time while achieving comparable performance.
Down-sampling decreases running time while achieving comparable performance. a, Integration accuracy as measured via KNN classifier remains robust across a number of down-sampling sets while running time decreases b. The integrated embedding of the training data (c) is nearly identical to the integrated embedding after projecting all cells (d). For c and d, rows represent different down-sampling levels and columns show the integrated embedding colored by batch (left) and major common cell types (right).
Extended Data Fig. 9 The margin parameter m impacts the attraction-repulsion theory.
The margin parameter m impacts the attraction-repulsion theory. The simulated data was subjected to INSCT integration varying the margin parameter m from high to low (left to right). Top and bottom rows depict the integrated embedding colored by batch and cell type, respectively.
Supplementary information
Rights and permissions
About this article
Cite this article
Simon, L.M., Wang, YY. & Zhao, Z. Integration of millions of transcriptomes using batch-aware triplet neural networks. Nat Mach Intell 3, 705–715 (2021). https://doi.org/10.1038/s42256-021-00361-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-021-00361-8
This article is cited by
-
BERNN: Enhancing classification of Liquid Chromatography Mass Spectrometry data with batch effect removal neural networks
Nature Communications (2024)
-
Single-cell omics: experimental workflow, data analyses and applications
Science China Life Sciences (2024)
-
Batch alignment of single-cell transcriptomics data using deep metric learning
Nature Communications (2023)
-
Integrating spatial transcriptomics data across different conditions, technologies and developmental stages
Nature Computational Science (2023)
-
scDREAMER for atlas-level integration of single-cell datasets using deep generative model paired with adversarial classifier
Nature Communications (2023)