Efficient integration of heterogeneous and increasingly large single-cell RNA sequencing data poses a major challenge for analysis and, in particular, comprehensive atlasing efforts. Here we developed a novel deep learning algorithm called INSCT (Insight) to overcome batch effects using batch-aware triplet neural networks. We use simulated and real data to demonstrate that INSCT generates an embedding space that accurately integrates cells across experiments, platforms and species. Our benchmark comparisons with current state-of-the-art single-cell RNA sequencing integration methods revealed that INSCT outperforms competing methods in scalability while achieving comparable accuracies. Moreover, using INSCT in semisupervised mode enables users to classify unlabelled cells by projecting them into a reference collection of annotated cells. To demonstrate scalability, we applied INSCT to integrate more than 2.6 million transcriptomes from four independent studies of mouse brains in less than 1.5 h using less than 25 GB of memory. This feature empowers researchers to perform atlasing-scale data integration in a typical desktop computer environment. INSCT is freely available at https://github.com/lkmklsmn/insct.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The simulation data are available from https://drive.google.com/file/d/1G5rBw-iOIcVMk-3R51DJK0ngttf1iobo. Pancreas data collection was downloaded from the Supplementary Information from work by Luecken and colleagues28 (https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-_integration_task_datasets_Immune_and_pancreas_/12420968). The macaque retina dataset38 was downloaded from the Gene Expression Omnibus (accession: GSE118480). Cell annotations were obtained from the Broad Institute’s Single Cell Portal at https://portals.broadinstitute.org/single_cell. The regular Tabula Muris data30 were downloaded from https://tabula-muris.ds.czbiohub.org/. The Tabula Muris senis data26 were downloaded from https://figshare.com/projects/Tabula_Muris_Senis/64982. Mouse Cell Atlas31 data were downloaded from https://figshare.com/articles/MCA_DGE_Data/5435866. The Human Cell Landscape32 data were downloaded from https://figshare.com/articles/HCL_DGE_Data/7235471. Mouse brain single-cell/nuclei RNA sequencing data were downloaded from four different resources. The 10× Genomics dataset was downloaded from https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons. The DropViz dataset was downloaded from http://dropviz.org. The MouseBrain.org data were downloaded from http://mousebrain.org/downloads.html. The SPLIT-seq data39 were downloaded from the Gene Expression Omnibus (accession: GSE110823).
Bianconi, E. et al. An estimation of the number of cells in the human body. Ann. Hum. Biol. 40, 463–471 (2013).
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Bacher, R. & Kendziorski, C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol 17, 63 (2016).
Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Polański, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Amodio, M. et al. Exploring single-cell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).
Heaton, J. Ian Goodfellow, Yoshua Bengio, and Aaron Courville: deep learning. Genet. Prog. Evolvable Mach. 19, 305–307 (2018).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Simon, L. M., Yan, F. & Zhao, Z. DrivAER: identification of driving transcriptional programs in single-cell RNA sequencing data. Gigascience 9, giaa122 (2020).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen Predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Schroff, F., Kalenichenko, D. & Philbin, J. FaceNet: a unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2015); https://doi.org/10.1109/cvpr.2015.7298682
Wang, J. et al. Learning fine-grained image similarity with deep ranking. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2014); https://doi.org/10.1109/cvpr.2014.180
Szubert, B., Cole, J. E., Monaco, C. & Drozdov, I. Structure-preserving visualisation of high dimensional single-cell datasets. Sci. Rep. 9, 8914 (2019).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018).
Hoffer, E. & Ailon, N. in Similarity-Based Pattern Recognition 84–92 (Springer, 2015); https://doi.org/10.1007/978-3-319-24261-3_7
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Tabula Muris Consortium. A single-cell transcriptomic atlas characterizes ageing tissues in the mouse. Nature 583, 590–595 (2020).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Preprint at https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2 (2020).
Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Tabula Muris Consortium et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Han, X. et al. Mapping the mouse cell atlas by microwell-seq. Cell 173, 1091–1107.e17 (2018).
Han, X. et al. Construction of a human cell landscape at single-cell level. Nature 581, 303–309 (2020).
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2020).
Obernier, K. & Alvarez-Buylla, A. Neural stem cells: origin, heterogeneity and regulation in the adult mammalian brain. Development 146, dev156059 (2019).
Böhm, J. N., Berens, P. & Kobak, D. A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum. Preprint at https://arxiv.org/abs/2007.08902 (2020).
Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Peng, Y.-R. et al. Molecular classification and comparative taxonomics of foveal and peripheral cells in primate retina. Cell 176, 1222–1237.e22 (2019).
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
L. M. Simon & Y. Wang. github.com/lkmklsmn/insct (Zenodo, 2021); https://doi.org/10.5281/ZENODO.4728031
This work was supported by the Cancer Prevention and Research Institute of Texas (CPRIT core grant no. RP180734). Z.Z. was also partially supported by the National Institutes of Health grant (grant no. R01LM012806 and R01DE030122). The funders had no role in the study design, data collection and analysis, the decision to publish, nor preparation of the manuscript. We would like to thank the members of the Bioinformatics and Systems Medicine Laboratory at the University of Texas Health Science Center at Houston as well as D. Henke for stimulating discussion.
The authors declare no competing interests.
Peer review information Nature Machine Intelligence thanks Smita Krishnaswamy and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Schematic illustrates the concept of batch-aware triplet sampling. Using MNNs triplets are chosen in such a manner that Anchor and Positive come from different batches. Our github repository contains an animation of the batch-aware triplet concept (https://github.com/lkmklsmn/insct).
Extended Data Fig. 2 INSCT robustly integrates simulated scRNA-seq data with many batches and many cell types.
INSCT robustly integrates simulated scRNA-seq data with many batches and many cell types. a, Left, ten cell groups from ten batches were simulated for scenario 1. UMAP (top) and IVIS (middle) embeddings cannot overcome the batch effect. INSCT (bottom) correctly groups cells of the same type across batches. Right, cells were removed such that four cell types were exclusive to a given batch (highlighted with ellipses) for scenario 2. UMAP (top) and IVIS (middle) embeddings cannot overcome the batch effect. INSCT (bottom) correctly groups cells of the same type across batches. b, INSCT embeddings show impact of range of k-to-m-ratio parameter values from left to right on the complex simulation scenario 2. Cells are colored by batch (top) and cell group (bottom). With low k-to-m-ratio parameter values the model cannot learn meaningful representations of cell groups exclusive to individual batches. High k-to-m-ratio parameter values cannot merge cell groups exclusive to individual batches into coherent clusters. Intermediate k-to-m-ratio parameter values learn meaningful representations for cells exclusive to individual batches.
INSCT accurately integrates cells in Tabula Muris data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy based on KNN classifier. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP, respectively. Cells are colored by batch (left) and cell ontology class (right).
INSCT accurately integrates cells in the macaque retina data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the various integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP respectively. Cells are colored by batch (left) and cell type (right).
INSCT accurately integrates cells in the human pancreas data collection. a, Barplots depict running time for each method. b, Boxplots depict mean integration accuracy. Error bars represent the standard error derived from testing the classifier on each batch. c, Scatter plot represents LISI metrics colored by method. d, Boxplot depicts kBET scores across the various integration methods. e-l, Integrated embedding plots for UMAP, BBKNN, Harmony, Scanorama, scVI, SAUCIE, INSCT and INSCT-UMAP respectively. Cells are colored by batch (left) and cell type (right).
a, UMAP depicts cells from simulation scenario 1. Group of red cells from batch 2 is mislabeled as green cells. For visualization purposes only, the mislabeled cells are colored purple. b, As the label ratio parameter decreases from left to right the model training contains more triplets sampled from MNNs and KNNs compared to cell types, thereby going from completely supervised (label ratio 1) to completely unsupervised (label ratio 0). At high label ratio the mislabeled cells (purple) map onto the incorrect cluster (green). As the label ratio parameter decreases the mislabelled cells (purple) form a separate cluster. At label ratio 0, which corresponds to completely unsupervised training the mislabeled cells map to the correct cluster (red).
INSCT efficiently integrates millions of cells. a, The graph shows the number of cells and their peak memory usage in gigabytes on the X and Y axes, respectively. b, The graph shows the number of cells and their runtime in minutes on the X and Y axes, respectively. Colors represent the different integration methods.
Down-sampling decreases running time while achieving comparable performance. a, Integration accuracy as measured via KNN classifier remains robust across a number of down-sampling sets while running time decreases b. The integrated embedding of the training data (c) is nearly identical to the integrated embedding after projecting all cells (d). For c and d, rows represent different down-sampling levels and columns show the integrated embedding colored by batch (left) and major common cell types (right).
The margin parameter m impacts the attraction-repulsion theory. The simulated data was subjected to INSCT integration varying the margin parameter m from high to low (left to right). Top and bottom rows depict the integrated embedding colored by batch and cell type, respectively.
About this article
Cite this article
Simon, L.M., Wang, YY. & Zhao, Z. Integration of millions of transcriptomes using batch-aware triplet neural networks. Nat Mach Intell 3, 705–715 (2021). https://doi.org/10.1038/s42256-021-00361-8