Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes


Methods to deconvolve single-cell RNA-sequencing (scRNA-seq) data are necessary for samples containing a mixture of genotypes, whether they are natural or experimentally combined. Multiplexing across donors is a popular experimental design that can avoid batch effects, reduce costs and improve doublet detection. By using variants detected in scRNA-seq reads, it is possible to assign cells to their donor of origin and identify cross-genotype doublets that may have highly similar transcriptional profiles, precluding detection by transcriptional profile. More subtle cross-genotype variant contamination can be used to estimate the amount of ambient RNA. Ambient RNA is caused by cell lysis before droplet partitioning and is an important confounder of scRNA-seq analysis. Here we develop souporcell, a method to cluster cells using the genetic variants detected within the scRNA-seq reads. We show that it achieves high accuracy on genotype clustering, doublet detection and ambient RNA estimation, as demonstrated across a range of challenging scenarios.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Souporcell overview.
Fig. 2: Evaluation of clustering accuracy.
Fig. 3: Application to challenging datasets.

Data availability

HipSci cell line data are available at the European Nucleotide Archive (ENA) with accession numbers ERS2630499ERS2630501 for the three replicates of the experimental mixture and ERS2630502ERS2630507 for the individual cell lines of euts, nufh, babz, oaqd and ieki, respectively. These data are shown in Fig. 2 and Supplementary Fig. 1. Maternal/fetal data are available at E-MTAB-6701 with sample numbers FCA7474063–FCA7474065. These data are shown in Fig. 3 and Supplementary Fig. 2. The Plasmodium data are available on ENA with accession numbers ERS4280420, ERS4280419 and ERS4280421 for Plasmodium samples 1–3, respectively. These data are shown in Fig. 3 and Supplementary Fig. 3.

Code availability

Souporcell is freely available under an MIT open-source license at


  1. 1.

    Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    CAS  Article  Google Scholar 

  2. 2.

    Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).

    CAS  Article  Google Scholar 

  3. 3.

    Tung, P.-Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).

    CAS  Article  Google Scholar 

  4. 4.

    Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).

    CAS  Article  Google Scholar 

  5. 5.

    McGinnis, C. S. et al. MULTI-seq: universal sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626 (2019).

    CAS  Article  Google Scholar 

  6. 6.

    Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    CAS  Article  Google Scholar 

  7. 7.

    Huang, Y., McCarthy, D. J. & Stegle, O. Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference. Genome Biol. 20, 273 (2019).

    Article  Google Scholar 

  8. 8.

    Xu, J. et al. Genotype-free demultiplexing of pooled single-cell RNA-seq. Genome Biol. 20, 290 (2019).

    CAS  Article  Google Scholar 

  9. 9.

    Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. Preprint at bioRxiv

  10. 10.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  Article  Google Scholar 

  11. 11.

    Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS  Article  Google Scholar 

  12. 12.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  Article  Google Scholar 

  13. 13.

    Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at (2012).

  14. 14.

    Petti, A. A. et al. A general approach for detecting expressed mutations in AML cells using single cell RNA-sequencing. Nat. Commun. 10, 3660 (2019).

    Article  Google Scholar 

  15. 15.

    Ueda, N. & Nakano, R. Deterministic annealing variant of the EM algorithm. In Advances in Neural Information Processing Systems 7 (eds Tesauro, G. et al.) 545–552 (MIT Press, 1995).

  16. 16.

    Carpenter, B. et al. Stan: a probabilistic programming language. J. Stat. Softw. 76 (2017).

  17. 17.

    Streeter, I. et al. The Human-Induced Pluripotent Stem Cell Initiative—data resources for cellular genetics. Nucleic Acids Res. 45, D691–D697 (2017).

    CAS  Article  Google Scholar 

  18. 18.

    Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370–375 (2017).

    CAS  Article  Google Scholar 

  19. 19.

    Moffett, A. & Colucci, F. Co-evolution of NK receptors and HLA ligands in humans is driven by reproduction. Immunol. Rev. 267, 283–297 (2015).

    CAS  Article  Google Scholar 

  20. 20.

    Vento-Tormo, R. et al. Single-cell reconstruction of the early maternal–fetal interface in humans. Nature 563, 347–353 (2018).

    CAS  Article  Google Scholar 

  21. 21.

    Howick, V. M. et al. The Malaria Cell Atlas: single parasite transcriptomes across the complete Plasmodium life cycle. Science 365, eaaw2619 (2019).

  22. 22.

    Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).

    Article  Google Scholar 

  23. 23.

    Sahraeian, S. M. E. et al. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat. Commun. 8, 59 (2017).

    Article  Google Scholar 

  24. 24.

    1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

Download references


We acknowledge the Wellcome Sanger Institute’s DNA Pipelines for construction of the 10x sequencing libraries. We thank A. Muhwezi and A. Russell for assistance with parasite culture and 10x single-cell 3′ RNA-seq, respectively. In addition, we would like to thank M. Young for useful conversations about ambient RNA, M. Efremova for providing information about the maternal/fetal data and K. Gray for assistance in interpreting the previously unannotated cluster. The Wellcome Sanger Institute is funded by the Wellcome Trust (grant no. 206194/Z/17/Z), which supports M.K.N.L. and M.H. This work was supported by an MRC Career Development Award (G1100339) to M.K.N.L. R.D. was suppported by Wellcome Trust grant WT207492. We would like to acknowledge the Wellcome Sanger Institute as the source of the human iPSC lines that were generated under the Human iPSC Initiative funded by a grant from the Wellcome Trust and Medical Research Council and supported by the Wellcome Trust (WT098503) and the NIHR/Wellcome Trust Clinical Research Facility. We acknowledge Life Science Technologies Corporation as the provider of Cytotune ( The Cardiovascular Epidemiology Unit is supported by core funding from the UK Medical Research Council (MR/L003120/1), the British Heart Foundation (RG/13/13/30194; RG/18/13/33946) and the National Institute for Health Research (Cambridge Biomedical Research Centre at the Cambridge University Hospital’s NHS Foundation Trust). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

Author information




M.K.N.L. and M.H. conceived the project. H.H. developed the methods and software, ran the tests and simulations, and created the figures. M.K.N.L., M.H. and H.H. wrote the manuscript with methods contributions from A.M.T., A.K. and M.I. A.T. conducted the Plasmodium wet-lab experiments. A.K. and M.I. conducted the HipSci cell line experiments. D.J.G. provided the HipSci cell lines and sequencing. R.D. provided feedback and guidance throughout the project.

Corresponding authors

Correspondence to Haynes Heaton or Martin Hemberg or Mara K. N. Lawniczak.

Ethics declarations

Competing interests

H.H. was previously an employee of 10x Genomics and holds shares in that company.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 HipSci data sparsity, replicates, and genotyping.

a, Distribution of the number of cells expressing a variant as well as b, the distribution of the number of alleles observed per cell that were used in souporcell clustering for HipSci mixture replicate 1 (replicates 2 and 3 are very similar, so not shown). c, Expression PCA of HipSci mixture replicate 2 (4832 cells) colored by genotype clusters from souporcell. d, and e, PCAs of the normalized cell-by-cluster loss matrix of HipSci mixture replicate 2 also colored by genotype cluster. f, Expression PCA of HipSci mixture replicate 3 (5144 cells) colored by genotype clusters. g and h, PCAs of normalized cell-by-cluster loss matrix of HipSci mixture replicate 3 colored by genotype cluster. i, Assessing genotype calling across souporcell, vireo, and scSplit. We plot true positive versus false positive genotype calls while sweeping the threshold on genotype likelihood. These are compared to a truth set obtained from variant calls on the WGS data j, Each method’s genotype calls versus the true genotype of each tool for a synthetic mixture of five HipSci lines with 6% doublets and 10% ambient RNA with a 0.95 probability threshold for each tool. The facets are the genotype calls made by each tool and the x-axis shows the correct assignments according to the WGS data. We observe that a major error mode for both vireo and scSplit compared to souporcell is that homozygous reference variants are mis-called as heterozygous because ambient RNA is not accounted for in these methods.

Supplementary Figure 2 Maternal/Fetal decidua1 and placenta2.

a, Expression t-SNE of a decidua1 sample (FCA747063, 2119 cells) colored by genotype clusters for each tool. Souporcell and demuxlet are highly concordant (ARI = 0.93). Vireo misidentifies a significant number of maternal cells as fetal cells. Excluding doublets and unassigned cells, vireo has an ARI of 0.3 versus demuxlet. scSplit has many errors resulting in an ARI versus demuxlet of 0. b, Expression t-SNE of placenta2 sample (3968 cells) colored by genotype clusters for each tool. Souporcell is again highly concordant with demuxlet (ARI = 0.96). Vireo has significant problems producing an ARI vs demuxlet of 0.18, even when excluding doublets and unassigned cells called by either tool. Like the other maternal/fetal samples, scSplit struggles and has and ARI versus demuxlet of 0.

Supplementary Figure 3 Plasmodium clustering.

a, Distribution of number of variants observed per cell used for clustering (with at least 4 cells required to support each allele) and the total number of variants used for clustering on the Plasmodium1 sample. b, Distribution of counts of the number of cells expressing each allele used for clustering as well as the total number of cells in the Plasmodium1 sample. c, Elbow plots for each Plasmodium data set show relatively strong support for the correct number of clusters (6) for Plasmodium1, but less clear results for Plasmodium2, which suffered from higher amounts of ambient RNA, and for Plasmodium3, which due to more cell numbers biased towards three genotypes rather than a relatively even mixture. For this reason, we analyze Plasmodium3 with k=3. d, Expression PCA of the Plasmodium2 sample (1893 cells) colored by genotype clusters as called by souporcell. e, Confusion matrix heatmap of the demuxlet best single strain (Y axis) versus souporcell, vireo, and scSplit. For souporcell we see one cluster per strain as expected. Both vireo and scSplit have the majority strain, 3D7, split across two clusters and two other strains combined into a single cluster. f, Expression PCA of the Plasmodium3 sample (2293 cells) colored by genotype clusters as called by souporcell. g, Confusion matrix heatmap of the demuxlet best single strain (Y axis) versus souporcell, vireo, and scSplit genotype clusters with k = 3. Souporcell clusters out the 3D7 and 7G8 strains correctly and puts all other cells into the final cluster while both vireo and scSplit put 3D7 into two clusters and all other cells into the remaining cluster.

Supplementary Figure 4 Demuxlet data.

a, souporcell cluster assignments of singletons for combined dataset showing that Sample A and Sample B are non-overlapping and Sample C contains all 8 samples. b, shows the first cluster of the doublet assignment for doublets showing largely non-overlapping assignments between Samples A and B.

Supplementary Figure 5 21 donor synthetic mixture.

a, Umap of the normalized log likelihood cluster matrix for the singletons of a mixture of the 5 HipSci samples and the 16 PBMC samples from the Human Cell Atlas project. The main error is the assignment of 129 CB8 cells to the CB3 dominant cluster indicated by the arrow. We show later that this is likely due to contamination.

Supplementary Figure 6 Contamination of CB8 samples.

a, Elbow plot of CB8+CB3 synthetic mixture with 3% doublets shows a clear preference for three clusters rather than the expected two. b, Shows the PCA of the normalized cell by cluster log likelihood matrix (n=2716 cells) showing three distinct genotypes.

Supplementary Figure 7 UMI and Cell downsampling.

a, The synthetic mixture of 5 HipSci cell lines with 6% doublets and 5% ambient RNA with UMIs downsampled shows predominantly good clustering, but performance drops below 800 UMIs/cell. b, The clustering is consistently good with downsampled cells down to an average cell per cluster of 40. The cluster with the fewest cells in the 40 average cells per cluster had 20 cells.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Supplementary Note and Supplementary Tables 1–4

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Heaton, H., Talman, A.M., Knights, A. et al. Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes. Nat Methods 17, 615–620 (2020).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing