Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes

Heaton, Haynes; Talman, Arthur M.; Knights, Andrew; Imaz, Maria; Gaffney, Daniel J.; Durbin, Richard; Hemberg, Martin; Lawniczak, Mara K. N.

doi:10.1038/s41592-020-0820-1

Article
Published: 04 May 2020

Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes

Nature Methods volume 17, pages 615–620 (2020)Cite this article

25k Accesses
142 Citations
114 Altmetric
Metrics details

Subjects

Abstract

Methods to deconvolve single-cell RNA-sequencing (scRNA-seq) data are necessary for samples containing a mixture of genotypes, whether they are natural or experimentally combined. Multiplexing across donors is a popular experimental design that can avoid batch effects, reduce costs and improve doublet detection. By using variants detected in scRNA-seq reads, it is possible to assign cells to their donor of origin and identify cross-genotype doublets that may have highly similar transcriptional profiles, precluding detection by transcriptional profile. More subtle cross-genotype variant contamination can be used to estimate the amount of ambient RNA. Ambient RNA is caused by cell lysis before droplet partitioning and is an important confounder of scRNA-seq analysis. Here we develop souporcell, a method to cluster cells using the genetic variants detected within the scRNA-seq reads. We show that it achieves high accuracy on genotype clustering, doublet detection and ambient RNA estimation, as demonstrated across a range of challenging scenarios.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Evaluation of clustering accuracy.**

**Fig. 3: Application to challenging datasets.**

Systematic benchmarking of single-cell ATAC-sequencing protocols

Article Open access 03 August 2023

Scalable single-cell RNA sequencing from full transcripts with Smart-seq3xpress

Article Open access 30 May 2022

Single-nucleotide variant calling in single-cell sequencing data with Monopogen

Article Open access 17 August 2023

Data availability

HipSci cell line data are available at the European Nucleotide Archive (ENA) with accession numbers ERS2630499–ERS2630501 for the three replicates of the experimental mixture and ERS2630502–ERS2630507 for the individual cell lines of euts, nufh, babz, oaqd and ieki, respectively. These data are shown in Fig. 2 and Supplementary Fig. 1. Maternal/fetal data are available at E-MTAB-6701 with sample numbers FCA7474063–FCA7474065. These data are shown in Fig. 3 and Supplementary Fig. 2. The Plasmodium data are available on ENA with accession numbers ERS4280420, ERS4280419 and ERS4280421 for Plasmodium samples 1–3, respectively. These data are shown in Fig. 3 and Supplementary Fig. 3.

Code availability

Souporcell is freely available under an MIT open-source license at https://github.com/wheaton5/souporcell.

References

Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Article CAS Google Scholar
Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).
Article CAS Google Scholar
Tung, P.-Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).
Article CAS Google Scholar
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
Article CAS Google Scholar
McGinnis, C. S. et al. MULTI-seq: universal sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626 (2019).
Article CAS Google Scholar
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Article CAS Google Scholar
Huang, Y., McCarthy, D. J. & Stegle, O. Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference. Genome Biol. 20, 273 (2019).
Article Google Scholar
Xu, J. et al. Genotype-free demultiplexing of pooled single-cell RNA-seq. Genome Biol. 20, 290 (2019).
Article CAS Google Scholar
Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. Preprint at bioRxiv https://doi.org/10.1101/303727(2018).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS Google Scholar
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Article CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Petti, A. A. et al. A general approach for detecting expressed mutations in AML cells using single cell RNA-sequencing. Nat. Commun. 10, 3660 (2019).
Article Google Scholar
Ueda, N. & Nakano, R. Deterministic annealing variant of the EM algorithm. In Advances in Neural Information Processing Systems 7 (eds Tesauro, G. et al.) 545–552 (MIT Press, 1995).
Carpenter, B. et al. Stan: a probabilistic programming language. J. Stat. Softw. 76 (2017).
Streeter, I. et al. The Human-Induced Pluripotent Stem Cell Initiative—data resources for cellular genetics. Nucleic Acids Res. 45, D691–D697 (2017).
Article CAS Google Scholar
Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370–375 (2017).
Article CAS Google Scholar
Moffett, A. & Colucci, F. Co-evolution of NK receptors and HLA ligands in humans is driven by reproduction. Immunol. Rev. 267, 283–297 (2015).
Article CAS Google Scholar
Vento-Tormo, R. et al. Single-cell reconstruction of the early maternal–fetal interface in humans. Nature 563, 347–353 (2018).
Article CAS Google Scholar
Howick, V. M. et al. The Malaria Cell Atlas: single parasite transcriptomes across the complete Plasmodium life cycle. Science 365, eaaw2619 (2019).
Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).
Article Google Scholar
Sahraeian, S. M. E. et al. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat. Commun. 8, 59 (2017).
Article Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

Download references

Acknowledgements

We acknowledge the Wellcome Sanger Institute’s DNA Pipelines for construction of the 10x sequencing libraries. We thank A. Muhwezi and A. Russell for assistance with parasite culture and 10x single-cell 3′ RNA-seq, respectively. In addition, we would like to thank M. Young for useful conversations about ambient RNA, M. Efremova for providing information about the maternal/fetal data and K. Gray for assistance in interpreting the previously unannotated cluster. The Wellcome Sanger Institute is funded by the Wellcome Trust (grant no. 206194/Z/17/Z), which supports M.K.N.L. and M.H. This work was supported by an MRC Career Development Award (G1100339) to M.K.N.L. R.D. was suppported by Wellcome Trust grant WT207492. We would like to acknowledge the Wellcome Sanger Institute as the source of the human iPSC lines that were generated under the Human iPSC Initiative funded by a grant from the Wellcome Trust and Medical Research Council and supported by the Wellcome Trust (WT098503) and the NIHR/Wellcome Trust Clinical Research Facility. We acknowledge Life Science Technologies Corporation as the provider of Cytotune (http://HipSci.org). The Cardiovascular Epidemiology Unit is supported by core funding from the UK Medical Research Council (MR/L003120/1), the British Heart Foundation (RG/13/13/30194; RG/18/13/33946) and the National Institute for Health Research (Cambridge Biomedical Research Centre at the Cambridge University Hospital’s NHS Foundation Trust). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

Author information

Authors and Affiliations

Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
Haynes Heaton, Andrew Knights, Maria Imaz, Daniel J. Gaffney, Martin Hemberg & Mara K. N. Lawniczak
MIVEGEC, IRD, CNRS, University of Montpellier, Montpellier, France
Arthur M. Talman
BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Strangeways Research Laboratory, Cambridge, UK
Maria Imaz
Department of Genetics, University of Cambridge, Cambridge, UK
Richard Durbin

Authors

Haynes Heaton
View author publications
You can also search for this author in PubMed Google Scholar
Arthur M. Talman
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Knights
View author publications
You can also search for this author in PubMed Google Scholar
Maria Imaz
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Gaffney
View author publications
You can also search for this author in PubMed Google Scholar
Richard Durbin
View author publications
You can also search for this author in PubMed Google Scholar
Martin Hemberg
View author publications
You can also search for this author in PubMed Google Scholar
Mara K. N. Lawniczak
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.K.N.L. and M.H. conceived the project. H.H. developed the methods and software, ran the tests and simulations, and created the figures. M.K.N.L., M.H. and H.H. wrote the manuscript with methods contributions from A.M.T., A.K. and M.I. A.T. conducted the Plasmodium wet-lab experiments. A.K. and M.I. conducted the HipSci cell line experiments. D.J.G. provided the HipSci cell lines and sequencing. R.D. provided feedback and guidance throughout the project.

Corresponding authors

Correspondence to Haynes Heaton, Martin Hemberg or Mara K. N. Lawniczak.

Ethics declarations

Competing interests

H.H. was previously an employee of 10x Genomics and holds shares in that company.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 HipSci data sparsity, replicates, and genotyping.

a, Distribution of the number of cells expressing a variant as well as b, the distribution of the number of alleles observed per cell that were used in souporcell clustering for HipSci mixture replicate 1 (replicates 2 and 3 are very similar, so not shown). c, Expression PCA of HipSci mixture replicate 2 (4832 cells) colored by genotype clusters from souporcell. d, and e, PCAs of the normalized cell-by-cluster loss matrix of HipSci mixture replicate 2 also colored by genotype cluster. f, Expression PCA of HipSci mixture replicate 3 (5144 cells) colored by genotype clusters. g and h, PCAs of normalized cell-by-cluster loss matrix of HipSci mixture replicate 3 colored by genotype cluster. i, Assessing genotype calling across souporcell, vireo, and scSplit. We plot true positive versus false positive genotype calls while sweeping the threshold on genotype likelihood. These are compared to a truth set obtained from variant calls on the WGS data j, Each method’s genotype calls versus the true genotype of each tool for a synthetic mixture of five HipSci lines with 6% doublets and 10% ambient RNA with a 0.95 probability threshold for each tool. The facets are the genotype calls made by each tool and the x-axis shows the correct assignments according to the WGS data. We observe that a major error mode for both vireo and scSplit compared to souporcell is that homozygous reference variants are mis-called as heterozygous because ambient RNA is not accounted for in these methods.

Supplementary Figure 2 Maternal/Fetal decidua1 and placenta2.

a, Expression t-SNE of a decidua1 sample (FCA747063, 2119 cells) colored by genotype clusters for each tool. Souporcell and demuxlet are highly concordant (ARI = 0.93). Vireo misidentifies a significant number of maternal cells as fetal cells. Excluding doublets and unassigned cells, vireo has an ARI of 0.3 versus demuxlet. scSplit has many errors resulting in an ARI versus demuxlet of 0. b, Expression t-SNE of placenta2 sample (3968 cells) colored by genotype clusters for each tool. Souporcell is again highly concordant with demuxlet (ARI = 0.96). Vireo has significant problems producing an ARI vs demuxlet of 0.18, even when excluding doublets and unassigned cells called by either tool. Like the other maternal/fetal samples, scSplit struggles and has and ARI versus demuxlet of 0.

Supplementary Figure 3 Plasmodium clustering.

a, Distribution of number of variants observed per cell used for clustering (with at least 4 cells required to support each allele) and the total number of variants used for clustering on the Plasmodium1 sample. b, Distribution of counts of the number of cells expressing each allele used for clustering as well as the total number of cells in the Plasmodium1 sample. c, Elbow plots for each Plasmodium data set show relatively strong support for the correct number of clusters (6) for Plasmodium1, but less clear results for Plasmodium2, which suffered from higher amounts of ambient RNA, and for Plasmodium3, which due to more cell numbers biased towards three genotypes rather than a relatively even mixture. For this reason, we analyze Plasmodium3 with k=3. d, Expression PCA of the Plasmodium2 sample (1893 cells) colored by genotype clusters as called by souporcell. e, Confusion matrix heatmap of the demuxlet best single strain (Y axis) versus souporcell, vireo, and scSplit. For souporcell we see one cluster per strain as expected. Both vireo and scSplit have the majority strain, 3D7, split across two clusters and two other strains combined into a single cluster. f, Expression PCA of the Plasmodium3 sample (2293 cells) colored by genotype clusters as called by souporcell. g, Confusion matrix heatmap of the demuxlet best single strain (Y axis) versus souporcell, vireo, and scSplit genotype clusters with k = 3. Souporcell clusters out the 3D7 and 7G8 strains correctly and puts all other cells into the final cluster while both vireo and scSplit put 3D7 into two clusters and all other cells into the remaining cluster.

Supplementary Figure 4 Demuxlet data.

a, souporcell cluster assignments of singletons for combined dataset showing that Sample A and Sample B are non-overlapping and Sample C contains all 8 samples. b, shows the first cluster of the doublet assignment for doublets showing largely non-overlapping assignments between Samples A and B.

Supplementary Figure 5 21 donor synthetic mixture.

a, Umap of the normalized log likelihood cluster matrix for the singletons of a mixture of the 5 HipSci samples and the 16 PBMC samples from the Human Cell Atlas project. The main error is the assignment of 129 CB8 cells to the CB3 dominant cluster indicated by the arrow. We show later that this is likely due to contamination.

Supplementary Figure 6 Contamination of CB8 samples.

a, Elbow plot of CB8+CB3 synthetic mixture with 3% doublets shows a clear preference for three clusters rather than the expected two. b, Shows the PCA of the normalized cell by cluster log likelihood matrix (n=2716 cells) showing three distinct genotypes.

Supplementary Figure 7 UMI and Cell downsampling.

a, The synthetic mixture of 5 HipSci cell lines with 6% doublets and 5% ambient RNA with UMIs downsampled shows predominantly good clustering, but performance drops below 800 UMIs/cell. b, The clustering is consistently good with downsampled cells down to an average cell per cluster of 40. The cluster with the fewest cells in the 40 average cells per cluster had 20 cells.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Supplementary Note and Supplementary Tables 1–4

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Heaton, H., Talman, A.M., Knights, A. et al. Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes. Nat Methods 17, 615–620 (2020). https://doi.org/10.1038/s41592-020-0820-1

Download citation

Received: 28 June 2019
Accepted: 24 March 2020
Published: 04 May 2020
Issue Date: June 2020
DOI: https://doi.org/10.1038/s41592-020-0820-1

This article is cited by

Demuxafy: improvement in droplet assignment by integrating multiple single-cell demultiplexing and doublet detection methods
- Drew Neavin
- Anne Senabouth
- Joseph E. Powell
Genome Biology (2024)
deMULTIplex2: robust sample demultiplexing for scRNA-seq
- Qin Zhu
- Daniel N. Conrad
- Zev J. Gartner
Genome Biology (2024)
scifi-ATAC-seq: massive-scale single-cell chromatin accessibility sequencing using combinatorial fluidic indexing
- Xuan Zhang
- Alexandre P. Marand
- Robert J. Schmitz
Genome Biology (2024)
hadge: a comprehensive pipeline for donor deconvolution in single-cell studies
- Fabiola Curion
- Xichen Wu
- Fabian J. Theis
Genome Biology (2024)
Chronic hypoxia remodels the tumor microenvironment to support glioma stem cell growth
- J. G. Nicholson
- S. Cirigliano
- H. A. Fine
Acta Neuropathologica Communications (2024)