Methods to deconvolve single-cell RNA-sequencing (scRNA-seq) data are necessary for samples containing a mixture of genotypes, whether they are natural or experimentally combined. Multiplexing across donors is a popular experimental design that can avoid batch effects, reduce costs and improve doublet detection. By using variants detected in scRNA-seq reads, it is possible to assign cells to their donor of origin and identify cross-genotype doublets that may have highly similar transcriptional profiles, precluding detection by transcriptional profile. More subtle cross-genotype variant contamination can be used to estimate the amount of ambient RNA. Ambient RNA is caused by cell lysis before droplet partitioning and is an important confounder of scRNA-seq analysis. Here we develop souporcell, a method to cluster cells using the genetic variants detected within the scRNA-seq reads. We show that it achieves high accuracy on genotype clustering, doublet detection and ambient RNA estimation, as demonstrated across a range of challenging scenarios.
Subscribe to Journal
Get full journal access for 1 year
only $20.17 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
HipSci cell line data are available at the European Nucleotide Archive (ENA) with accession numbers ERS2630499–ERS2630501 for the three replicates of the experimental mixture and ERS2630502–ERS2630507 for the individual cell lines of euts, nufh, babz, oaqd and ieki, respectively. These data are shown in Fig. 2 and Supplementary Fig. 1. Maternal/fetal data are available at E-MTAB-6701 with sample numbers FCA7474063–FCA7474065. These data are shown in Fig. 3 and Supplementary Fig. 2. The Plasmodium data are available on ENA with accession numbers ERS4280420, ERS4280419 and ERS4280421 for Plasmodium samples 1–3, respectively. These data are shown in Fig. 3 and Supplementary Fig. 3.
Souporcell is freely available under an MIT open-source license at https://github.com/wheaton5/souporcell.
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).
Tung, P.-Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
McGinnis, C. S. et al. MULTI-seq: universal sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626 (2019).
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Huang, Y., McCarthy, D. J. & Stegle, O. Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference. Genome Biol. 20, 273 (2019).
Xu, J. et al. Genotype-free demultiplexing of pooled single-cell RNA-seq. Genome Biol. 20, 290 (2019).
Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. Preprint at bioRxiv https://doi.org/10.1101/303727(2018).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Petti, A. A. et al. A general approach for detecting expressed mutations in AML cells using single cell RNA-sequencing. Nat. Commun. 10, 3660 (2019).
Ueda, N. & Nakano, R. Deterministic annealing variant of the EM algorithm. In Advances in Neural Information Processing Systems 7 (eds Tesauro, G. et al.) 545–552 (MIT Press, 1995).
Carpenter, B. et al. Stan: a probabilistic programming language. J. Stat. Softw. 76 (2017).
Streeter, I. et al. The Human-Induced Pluripotent Stem Cell Initiative—data resources for cellular genetics. Nucleic Acids Res. 45, D691–D697 (2017).
Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370–375 (2017).
Moffett, A. & Colucci, F. Co-evolution of NK receptors and HLA ligands in humans is driven by reproduction. Immunol. Rev. 267, 283–297 (2015).
Vento-Tormo, R. et al. Single-cell reconstruction of the early maternal–fetal interface in humans. Nature 563, 347–353 (2018).
Howick, V. M. et al. The Malaria Cell Atlas: single parasite transcriptomes across the complete Plasmodium life cycle. Science 365, eaaw2619 (2019).
Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).
Sahraeian, S. M. E. et al. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat. Commun. 8, 59 (2017).
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
We acknowledge the Wellcome Sanger Institute’s DNA Pipelines for construction of the 10x sequencing libraries. We thank A. Muhwezi and A. Russell for assistance with parasite culture and 10x single-cell 3′ RNA-seq, respectively. In addition, we would like to thank M. Young for useful conversations about ambient RNA, M. Efremova for providing information about the maternal/fetal data and K. Gray for assistance in interpreting the previously unannotated cluster. The Wellcome Sanger Institute is funded by the Wellcome Trust (grant no. 206194/Z/17/Z), which supports M.K.N.L. and M.H. This work was supported by an MRC Career Development Award (G1100339) to M.K.N.L. R.D. was suppported by Wellcome Trust grant WT207492. We would like to acknowledge the Wellcome Sanger Institute as the source of the human iPSC lines that were generated under the Human iPSC Initiative funded by a grant from the Wellcome Trust and Medical Research Council and supported by the Wellcome Trust (WT098503) and the NIHR/Wellcome Trust Clinical Research Facility. We acknowledge Life Science Technologies Corporation as the provider of Cytotune (http://HipSci.org). The Cardiovascular Epidemiology Unit is supported by core funding from the UK Medical Research Council (MR/L003120/1), the British Heart Foundation (RG/13/13/30194; RG/18/13/33946) and the National Institute for Health Research (Cambridge Biomedical Research Centre at the Cambridge University Hospital’s NHS Foundation Trust). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
H.H. was previously an employee of 10x Genomics and holds shares in that company.
Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
a, Distribution of the number of cells expressing a variant as well as b, the distribution of the number of alleles observed per cell that were used in souporcell clustering for HipSci mixture replicate 1 (replicates 2 and 3 are very similar, so not shown). c, Expression PCA of HipSci mixture replicate 2 (4832 cells) colored by genotype clusters from souporcell. d, and e, PCAs of the normalized cell-by-cluster loss matrix of HipSci mixture replicate 2 also colored by genotype cluster. f, Expression PCA of HipSci mixture replicate 3 (5144 cells) colored by genotype clusters. g and h, PCAs of normalized cell-by-cluster loss matrix of HipSci mixture replicate 3 colored by genotype cluster. i, Assessing genotype calling across souporcell, vireo, and scSplit. We plot true positive versus false positive genotype calls while sweeping the threshold on genotype likelihood. These are compared to a truth set obtained from variant calls on the WGS data j, Each method’s genotype calls versus the true genotype of each tool for a synthetic mixture of five HipSci lines with 6% doublets and 10% ambient RNA with a 0.95 probability threshold for each tool. The facets are the genotype calls made by each tool and the x-axis shows the correct assignments according to the WGS data. We observe that a major error mode for both vireo and scSplit compared to souporcell is that homozygous reference variants are mis-called as heterozygous because ambient RNA is not accounted for in these methods.
a, Expression t-SNE of a decidua1 sample (FCA747063, 2119 cells) colored by genotype clusters for each tool. Souporcell and demuxlet are highly concordant (ARI = 0.93). Vireo misidentifies a significant number of maternal cells as fetal cells. Excluding doublets and unassigned cells, vireo has an ARI of 0.3 versus demuxlet. scSplit has many errors resulting in an ARI versus demuxlet of 0. b, Expression t-SNE of placenta2 sample (3968 cells) colored by genotype clusters for each tool. Souporcell is again highly concordant with demuxlet (ARI = 0.96). Vireo has significant problems producing an ARI vs demuxlet of 0.18, even when excluding doublets and unassigned cells called by either tool. Like the other maternal/fetal samples, scSplit struggles and has and ARI versus demuxlet of 0.
a, Distribution of number of variants observed per cell used for clustering (with at least 4 cells required to support each allele) and the total number of variants used for clustering on the Plasmodium1 sample. b, Distribution of counts of the number of cells expressing each allele used for clustering as well as the total number of cells in the Plasmodium1 sample. c, Elbow plots for each Plasmodium data set show relatively strong support for the correct number of clusters (6) for Plasmodium1, but less clear results for Plasmodium2, which suffered from higher amounts of ambient RNA, and for Plasmodium3, which due to more cell numbers biased towards three genotypes rather than a relatively even mixture. For this reason, we analyze Plasmodium3 with k=3. d, Expression PCA of the Plasmodium2 sample (1893 cells) colored by genotype clusters as called by souporcell. e, Confusion matrix heatmap of the demuxlet best single strain (Y axis) versus souporcell, vireo, and scSplit. For souporcell we see one cluster per strain as expected. Both vireo and scSplit have the majority strain, 3D7, split across two clusters and two other strains combined into a single cluster. f, Expression PCA of the Plasmodium3 sample (2293 cells) colored by genotype clusters as called by souporcell. g, Confusion matrix heatmap of the demuxlet best single strain (Y axis) versus souporcell, vireo, and scSplit genotype clusters with k = 3. Souporcell clusters out the 3D7 and 7G8 strains correctly and puts all other cells into the final cluster while both vireo and scSplit put 3D7 into two clusters and all other cells into the remaining cluster.
a, souporcell cluster assignments of singletons for combined dataset showing that Sample A and Sample B are non-overlapping and Sample C contains all 8 samples. b, shows the first cluster of the doublet assignment for doublets showing largely non-overlapping assignments between Samples A and B.
a, Umap of the normalized log likelihood cluster matrix for the singletons of a mixture of the 5 HipSci samples and the 16 PBMC samples from the Human Cell Atlas project. The main error is the assignment of 129 CB8 cells to the CB3 dominant cluster indicated by the arrow. We show later that this is likely due to contamination.
a, Elbow plot of CB8+CB3 synthetic mixture with 3% doublets shows a clear preference for three clusters rather than the expected two. b, Shows the PCA of the normalized cell by cluster log likelihood matrix (n=2716 cells) showing three distinct genotypes.
a, The synthetic mixture of 5 HipSci cell lines with 6% doublets and 5% ambient RNA with UMIs downsampled shows predominantly good clustering, but performance drops below 800 UMIs/cell. b, The clustering is consistently good with downsampled cells down to an average cell per cluster of 40. The cluster with the fewest cells in the 40 average cells per cluster had 20 cells.
About this article
Cite this article
Heaton, H., Talman, A.M., Knights, A. et al. Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes. Nat Methods (2020). https://doi.org/10.1038/s41592-020-0820-1