Characterization of phenotypic diversity is a key challenge in the emerging field of single-cell RNA-sequencing (scRNA-seq). In scRNA-seq data, patterns of gene expression (GE) are conventionally used as features to explore the heterogeneity among single cells1,2,3. However, GE features are subject to a significant amount of noises4. For example, GE might be affected by batch effect, where results obtained from two different runs of experiments may present substantial variations5, even when the input materials are identical. Additionally, the expression of particular genes varies with cell cycle6, increasing the heterogeneity observed in single cells7. To cope with these sources of variations, normalization of GE is usually a mandatory step before downstream functional analysis7. Even with these procedures, other sources of biases still exist, e.g., dependent on read depth, cell capture efficiency and experimental protocols etc.

Single-nucleotide variations (SNVs) are genetic alterations of one single base occurring in specific cells as compared to the population background. SNVs may manifest their effects on gene expression by cis and/or trans effect8,9.The disruption of the genetic stability, e.g. increasing number of new SNVs, is known to be linked with cancer evolution10,11. A cell may become the precursor of a subpopulation (clone) upon gaining a set of SNVs. Considerable heterogeneity exists not only between tumors but also within the same tumor12,13. Therefore, investigating the patterns of SNVs provides means to understand tumor heterogeneity.

In single cells, SNVs are conventionally obtained from single-cell exome-sequencing and whole-genome sequencing approaches14. The resulting SNVs can then be used to infer cancer cell subpopulations15,16. In this study, we propose to obtain useful SNV-based genetic information from scRNA-seq data, in addition to the GE information. Rather than being considered the by-products of scRNA-seq, the SNVs not only have the potential to improve the accuracy of identifying subpopulations compared to GE, but also offer unique opportunities to study the genetic events (genotype) associated with gene expression (phenotype)17,18. Moreover, when the coupled DNA- and RNA-based single-cell sequencing techniques become mature, the computational methodology proposed in this report can be adopted as well19.

Here we first built a computational pipeline to identify SNVs from scRNA-seq raw reads directly. We then constructed a linear modeling framework to obtain filtered, effective, and expressed SNVs (eeSNVs) associated with gene expression profiles. In all the datasets tested, these eeSNVs show better accuracies at retrieving cell subpopulation identities, compared to those from gene expression (GE). Moreover, when combined with cell entities into bipartite graphs, they demonstrate improved visual representation of the cell subpopulations. We ranked eeSNVs and genes according to their overall significance in the linear models and discovered that several top-ranked genes (e.g., HLA genes) appear commonly in all cancer scRNA-seq data. In summary, we emphasize that extracting SNV from scRNA-seq analysis can successfully identify subpopulation complexity and highlight genotype–phenotype relationships.


SNV calling from scRNA-seq data

We implemented a pipeline to identify SNVs directly from FASTQ files of scRNA-seq data, following the SNV guideline of GATK (Supplementary Figure 1). We applied this pipeline to five scRNA-seq cancer datasets (Kim20, Ting21, Miyamoto22, Patel23, and Chung24 see Methods), and tested the efficiency of SNV features on retrieving single cell groups of interest. These datasets vary in tissue types, origins (Mouse or Human), read lengths and map-ability (Table 1). They all have pre-defined cell types (subclasses), providing useful references for assessing the performance of a variety of clustering methods used in this study.

Table 1 Summary of scRNA-seq datasets used in this study

We evaluated the GATK SNV calling pipeline using several approaches. First, we estimated the true positive rates of the SNV calling pipeline at different depths of scRNA-seq reads. For this we performed a simulation experiment by artificially introducing 50,000 random SNVs in the exonic regions of hg19 and measured recovery of these SNVs using our pipeline on Kim dataset. The true positive rates monotonically increase with read depth. For as few as 4 read-depth, the pipeline achieves on average over 50% true positive rate and increases to 68% true positive rate when read depth is more than 6 (Fig. 1a). This accuracy is in line with what was reported from bulk cell RNA-seq25. The false positive rate is consistently <0.1, and the median reaches below 0.05 when the read depth is >6 (Fig. 1b). We compared the SNV call results from GATK to those from another SNV caller FreeBayes26 and obtained similar results (Supplementary Figure 2A, B). Additionally, we conducted simulation experiment on a new non-cancer 10X Genomic dataset, and obtained comparable true positive rates (Supplementary Figure 2C, D). Moreover GATK shows better performance than FreeBayes in the 10X dataset. We thus opted to use GATK to call SNVs for the remainder of the report, given its popularity and performance.

Fig. 1
figure 1

The performance measurements of GATK SNV calling and SSrGE pipelines. a, b Performance measurement of GATK SNV calling pipeline. Box plots of true positive rate (a) and false positive rate (b) with respect to the read depth at the called SNV position. The rates are calculated from GATK SNV calling pipeline, using hg19 reference genome to align modified scRNA-seq reads from a subset of 20 cells from the Kim dataset, which were introduced 50,000 random artificial mutations the exonic region of the reads. Error bars represent standard deviation. c, d Comparisons of importance the different types of features in SSrGE models, with respect to the ranking, in Miyamoto dataset (c) and Kim dataset (d). The scores of the SNVs and CNVs correspond to the sum of the coefficients inferred by the SSrGE models. The gene score is the sum of the SNVs scores for a particular gene. Blue: CNV feature; Red: eeSNV feature; Green: gene feature

Using SSrGE to detect eeSNVs in scRNA-seq data

To link the relationship between SNV and GE, we developed a method called Sparse SNV inference to reflect Gene Expression (SSrGE), as detailed in Methods. In addition to SNV, we also optionally considered the effect of CNVs on gene expression, since copy number variation (CNV) may contribute to gene expression variation as well. Similar to gene-based association method PrediXscan17, SSrGE uses SNVs and additionally optionally CNVs as predictors to fit a linear model for gene expression, under LASSO regularization and feature selection27. We choose LASSO rather than elasticNet for penalization, so that the list of resulting eeSNVs is short (Supplementary Figure 3). These eeSNVs serve as refined descriptive features for subsequent subpopulation identification. To directly pinpoint the contributions of SNVs relevant to protein-coding genes, we used the SNVs residing between transcription starting and ending sites of genes as the inputs. We further assessed the relative contributions of eeSNVs and CNVs to gene expression and found that the coefficients of the CNVs are significantly lower than those of eeSNVs(Fig. 1c, d). The ranks of the top genes with and without CNVs in the SSrGE models are not statistically different overall, as the Kendall-Tau correlation scores28 are close to 1 with p-values = 0 (Kendall-Tau test).

Additionally, SNV genotypes and allelic specific level gene expression may also complicate the relationships between eeSNVs and gene expression. Therefore, we further calibrated SSrGE model by considering SNV genotype and allelic specific gene expression. We used QUASAR29 to estimate the SNV genotypes (Supplementary Table 1), and the allelic specific gene expression using the SNV genotype. We rebuilt individual SSrGE models using only the SNVs from a particular genotype and allelic-specific gene expression, and then merged the eeSNV weights from related SSrGE models together to obtain a final ranking of eeSNVs. The new rankings are not statistically different compared to the previous approach (Supplementary Table 1). The Kendall-Tau scores, which evaluate the similarities between the re-calibrated model and the original model have p-values = 0 (Kendall-Tau test) in all data sets.

Lastly, to quantitatively evaluate if the eeSNVs obtained from SSrGE are truly significant, we designed a simulation pipeline (Methods). The pipeline creates random binary matrices of SNVs for n simulated cells, which are connected to the matrices of gene expression. The SNVs present in the simulated cell have probabilities to modify gene expression of the genes positively or negatively. We used various levels of noise to perturb the GE and the SNV matrices. We compared the ranks of top genes identified by SSrGE to the expected impact of each gene provided by the simulation. The inferred top ranked genes using SSrGE have monotonic and positive correlations with those set by the simulation (Supplementary Figure 4A). These correlations are all significant (p-value « 0.05, Kendall-Tau test), independently of the alpha and the level of noise used, confirming the value of SSrGE model. Moreover, to simulate the patterns of dropout in the data, we also introduced two other parameters, one for random dropout or biased dropout toward both lowly expressed genes, and other for dropout rate relative to cell, gene or reads (Methods). We observed that SSrGE performs well on all dropout models (Supplementary Figure 4B). Thus, SSrGE method is validated to generate eeSNVs that are truly important.

eeSNVs are better than gene counts at finding subpopulations

We measured the performance of SNVs and gene expression (GE) to identify subpopulations on the five datasets, using five clustering approaches (Fig. 2). These clustering approaches include two dimension reduction methods, namely principal component analysis (PCA)30 and factor analysis (FA)31, followed by either K-means or the hierarchical agglomerative method (agglo) with WARD linkage32. We also used a recent algorithm SIMLR designed explicitly for scRNA-seq data clustering and visualization33. To evaluate the accuracy of obtained subpopulations in each dataset, we used the metric of adjusted mutual information (AMI) over 30 bootstrap runs, from the optimal a parameters (Supplementary Data 1). These optimal parameters were estimated by testing different a values for each dataset and each clustering approach (Supplementary Figure 5). As shown in Fig. 2, eeSNVs are better features to retrieve cancer cell subpopulations compared to GE, independent of the clustering methods used. Among the clustering algorithms, SIMLR tends to be a better choice using eeSNV features.

Fig. 2
figure 2

Comparison of clustering accuracy using eeSNV and gene expression (GE) features. ae Bar plots comparing the clustering performance using eeSNV vs. gene expression (GE) as features, over five different clustering strategies and five datasets, each with its own pre-defined classes as the truth measure: a Kim dataset, b Ting dataset, c Chung dataset, d Miyamoto dataset, and e Patel dataset. Y-axis is the adjusted mutual information (AMI) obtained across 30 bootstrap runs (mean ± s.d.). Error bars represent standard deviation. *P < 0.05, **P < 0.01, and ***P < 0.001 (paired t-test). f Heatmap of the rankings among different methods and datasets as shown in ae

Additionally, we computed the Adjusted Rand index (ARI)34 and V-measure35, two other metrics for modularity measurements (Methods) and obtained similar trends (Supplementary Figure 6). Similar to AMI, ARI is a normalized metric against random chance, and evaluates the number of correct pairs obtained. On the other hand, V-measure combines the homogeneity score, which measures the homogeneity of reference classes in the obtained clusters, and the completeness score, which measures the homogeneity of obtained clusters within the reference classes. Due to the high number of small homogenous clusters obtained for the Miyamoto dataset, we observed higher V-measure scores, compared to AMI and ARI results (Supplementary Figure 6).

Visualization of subpopulations with bipartite graphs

Bipartite graphs are an efficient way to describe the binary relations between two different classes of objects. We next represented the presence of the eeSNVs into single cell genomes with bipartite graphs using ForceAtlas2 algorithm36. We drew an edge (link) between a cell node and a given eeSNV node whenever an eeSNV is detected. The results show that bipartite graph is a robust and more discriminative alternative (Fig. 3), comparing to PCA plots (using GE and eeSNVs) as well as SIMLR (using GE). For Kim dataset, bipartite graph separates the three classes perfectly. However, gene-based visualization approaches using either PCA or SIMLR have misclassified data points. For Ting data, the eeSNV-cell bipartite graph gives a clear visualization of all six different subgroups of single cells. Other three approaches have more exaggerated separations among the same mouse circulating tumor cells (CTC) subgroup MP (orange color), but mix some other subpopulations (e.g., GM, MP, and TuGMP groups). Miyamoto dataset is the most difficult one to visualize among the four datasets, due to its high number (24) of reference classes and heterogeneity among CTCs. Bipartite graphs are not only able to condense the whole populations, but also separate subpopulations (e.g., the orange colored PC subpopulation) much better than the other three methods.

Fig. 3
figure 3

Comparison of clustering visualization using eeSNV and gene expression (GE) features. a Bipartite graphs using eeSNVs and cells as two groups of nodes. An edge between a cell and an eeSNV represents the presence of the eeSNV within that cell. b Principle component analysis (PCA) results using GE as features of the cells. c PCA results using eeSNVs as features of the cells. d SIMILR results using GE as the input

Characteristics of eeSNVs

In SSrGE, regularization parameter a is the only tuning variable, controlling the sparsity of the linear models and influences the number of eeSNVs. We next explored the relationship between eeSNVs and a (Fig. 4). For every dataset, increasing the value of a decreases the number of selected eeSNVs overall (Fig. 4a), as well as the average number of eeSNVs associated with every expressed gene (Fig. 4b). The optimal a depends on the clustering algorithm and the dataset used (Supplementary Data 1 and Supplementary Figure 5). Increasing the value of a expands the proportion of eeSNVs that have annotations in human dbSNP138 database, indicating a higher true positive rate of SNVs compared to that prior to SSrGE filtering (Fig. 4c). Additionally, increasing a increases the average number of cells sharing the same eeSNVs (Fig. 4d), corresponding to the decreasing number of eeSNVs (Fig. 4b). Note the slight drop in the average number of cells sharing the same eeSNVs in Kim data when a > 0.6, this is due to overpenalization (e.g., a = 0.8 yields only 34 eeSNVs).

Fig. 4
figure 4

Characteristics of the eeSNVs. X-axis: the regularization parameter a values used by LASSO penalization in the SSrGE models. And the Y-axes are: a Log10 transformation of the number of eeSNVs. b The average number of eeSNVs per gene. c The proportion of SNVs with dbSNP138 annotations (human datasets). d The average number of cells sharing eeSNVs

Cancer relevance of eeSNVs

Following the simulation results, we ranked the different eeSNVs and the genes for the five datasets, from SSrGE models (Supplementary Data 2). We found that eeSNVs from multiple genes in human leukocyte antigen (HLA) complex, such as HLA-A, HLA-B, HLA-C, and HLA-DRA, are top ranked in all four human datasets (Table 2 and Supplementary Data 2). HLA is a family encoding the major histocompatibility complex (MHC) proteins in human. Beta-2-microglobulin (B2M), on the other hand, is ranked 7th and 45th in Ting and Patel datasets, respectively (Table 2). Unlike HLA that is present in human only, B2M encodes a serum protein involved in the histocompatibility complex MHC that is also present in mice. Other previously identified tumor driver genes are also ranked top by SSrGE, demonstrating the impact of mutations on cis-gene expression (Table 2 and Supplementary Data 2). Notably, KRAS, previously linked to tumor heterogeneity (Kim et al.37), is ranked 13th among all eeSNV containing genes (Supplementary Data 2). AR and KLK3, two genes reported to show genomic heterogeneity in tumor development in the original study22, are ranked 6th and 19th, respectively. EGFR, the therapeutic target in Patel study with an important oncogenic variant EGFRvIII (Patel et al.23), is ranked 88th out of 4225 genes. Therefore, genes top-ranked by their eeSNVs are empirically validated.

Table 2 A list of interested genes highly ranked

Next we conducted more systematic investigation to identify KEGG pathways enriched in each dataset, using these genes as the input for DAVID annotation tool38 (Fig. 5a). The pathway-gene bipartite graph illustrates the relationships between these genes and enriched pathways (Fig. 5b). As expected, Antigen processing and presentation pathway stands out as the most enriched pathway, with the sum −log10 (p-value) of 15.80 (Fig. 5b). Phagosome is the second most enriched pathway in all four data sets, largely due to its members in HLA families (Fig. 5b). Additionally, pathways related to cell junctions and adhesion (focal adhesion and cell adhesion molecules CAMs), protein processing (protein processing in endoplasmic reticulum and proteasome), and PI3K-AKT signaling pathway are also highly enriched with eeSNVs (Fig. 5a).

Fig. 5
figure 5

Gene and KEGG pathways enriched with eeSNVs in the five scRNA-seq datasets. a Bipartite graph using significant KEGG pathways and genes enriched with eeSNVs as nodes. An edge exists between a significant pathway and a gene if this gene is part of the pathway. Genes of each data set is represented with a distinct color. The size of the nodes reflects the gene and the pathway scores. The gene scores are computed by SSrGE and the pathway scores are the sum of the gene scores linked for each pathway. b KEGG pathways enriched within the top 100 genes based on eeSNV contributions in the five datasets. Pathways are sorted by the sum of the −log10 (p-value) of each dataset, in the descending order

Heterogeneity markers using eeSNVs

We display the potential of eeSNV as heterogeneity markers via pseudo-time reconstruction and heatmap, using Kim dataset (Fig. 6a, b). We built a Minimum Spanning Tree, similarly to the Monocle algorithm39, to reconstruct the pseudo-time ordering of the single cells. The graphs beautifully capture the continuity among cells, from the primary to metastasized tumors (Fig. 6a). Moreover, it highlights ramifications inside each of the subgroups, demonstrating the intra-group heterogeneity. On the contrary, pseudo-time reconstruction using GE showed much less complexity and more singularity (Supplementary Figure 7). As a confirmation, hierarchical clustering of eeSNV heatmp also shows almost perfect separation of the three subgroups (Fig. 6b). Next, we used our method to identify eeSNVs specific to each single-cell subgroup and ranked the genes according to these eeSNVs. We compared the characteristics of the metastasis cells to primary tumor cells. Two top-ranked genes identified by the method, CD44 (1st) and LPP (2nd), are known to promote cancer cell dissemination and metastasis growth after genomic alteration40,41,42,43 (Supplementary Data 2). Other top-ranked genes related to metastasis are also identified, including LAMPC2 (7th), HSP90B1 (14th), MET (44th), and FN1 (52th). As expected, Pathways in Cancer are the top-ranked pathway enriched with mutations (Fig. 6b). Additionally, Focal Adhesion, and Endocytosis pathways are among the other significantly mutated pathways, providing new insights on the mechanistic difference between primary and metastasized RCC tumors (Fig. 6c).

Fig. 6
figure 6

Heterogeneity revealed by Kim dataset. a Pseudo-time ordering reconstruction of the different subgroups: single cells from PDX primary tumor (green), patient metastasis (blue) and PDX metastasis (red). The eeSNVs are obtained with a = 0.6. The tree is inferred using the MST algorithm on Pearson’s correlation-based distance matrix. b Heatmap of the cells (row) and eeSNVs (column). c Bipartite graph using KEGG pathways (orange color) and genes enriched with significant eeSNVs (green color) as two set of nodes. The significant eeSNVs are inferred from the metastasized cells, compared to the primary tumor cells. The size of the nodes reflects the gene scores (given by SSrGE) and the pathway scores (sum of the gene scores). Lighter green indicates genes with a lower rank

Another application is to explore the potential of eeSNVs to separate different cell types within the same individual. Toward this, we extended the same analysis on the two patients BC03 and BC07 from the Chung dataset, who have primary and metastasized tumor cells as well as infiltrating immune cells. Again, bipartite graphs and minimum spanning trees-based visualization illustrate clear separations of tumor (primary and metastasized) cells from immune cells (Supplementary Figure 8). Furthermore, the top-ranked genes relative to the metastasis subgroups (BC03M and BC07M) present some similarities with those found in Kim dataset (Supplementary Data 3). Strikingly, CD44 is also top-ranked (23rd) among the significant genes of BC07M. Similarly, HSP90B1 is top-ranked as the 63rd and 51st most important genes, in BC03M and BC07M, respectively.

Integrating DNA- and RNA-seq data in the same single cells

Coupled DNA-seq and RNA-seq measurements from the same single cell are the new horizon of single-cell genomics. To demonstrate the potential of SSrGE in integrating DNA and RNA data, we downloaded a public single cells data, which have DNA methylation and RNA-seq records from the same hepatocellular carcinoma (HCC) single cells (Hou dataset)44. We then inferred SNVs from the aligned reduced representation bisulfite sequencing (RRBS) reads (see Methods), and used them to predict the scRNA-seq data from the same samples. Given the fact that SNVs are heterozygous among tumor and normal cells, and that a small fraction of genes harboring eeSNVs are subject to CNV, we included both the percentages of SNVs as well as CNVs as additional predictive variables in the SSrGE model besides SNV features. Interestingly, the identified eeSNVs can clearly separate normal hepatocellular cells from cancer cells and highlight the two cancer subtypes identified in the original study (Fig. 7). Pseudo-time ordering shows an early divergence between the two previously assumed subtypes (Fig. 7b). This observation is confirmed by hierarchical clustering of eeSNV based heatmap (Fig. 7c). A simplified version of SSrGE model, where only SNV features were considered as predictors for gene expression, shared 92% eeSNVs as those in Fig. 7a, and achieved almost identical separations between normal hepatocellular cells and cancer cells. This confirms the earlier observation that eeSNVs are much more important predictive features, compared to CNVs (Fig. 1c, d).

Fig. 7
figure 7

Heterogeneity revealed by eeSNVs from multi-omics single cell HCC (Hou) dataset. Normal cells are colored green and HCC tumor cells are colored bright (subpopulation I) or dark (subpopulation II) red. a Bipartite-graph representation using the single cells and eeSNVs from RRBS reads as two sets of nodes. b Pseudo-time ordering reconstruction of the HCC cells, using eeSNVs in from RRBS. c Heatmap of the cells (row) and eeSNVs (column)

We postulated that a considerable part of bisulfite reads was aligned with methylation islands associated with gene promoter regions. Thus, we annotated eeSNVs within 1500 bp upstream of the transcription starting codon, and obtained genes with these eeSNVs, which are significantly prevalent in certain groups. When comparing HCC vs. normal control cells, two genes PRMT2, SULF2 show statistically significant mutations in HCC cells (p-values < 0.05, Fisher’s exact test). Downregulation of PRMT2 was previously associated with breast cancer45, SULF2 was known to be upregulated in HCC and promotes HCC growth46.


Using GE to accurately analyze scRNA-seq data has many challenges, including technological biases such as the choice of the sequencing platforms, the experimental protocols and conditions. These biases may lead to various confounding factors in interpreting GE data5. SNVs, on the other hand, are less prone to these issues given their binary nature. In this report, we demonstrate that eeSNVs extracted from scRNA-seq data are ideal features to characterize cell subpopulations. Moreover, they provide a means to examine the relationship between eeSNVs and gene expression in the same scRNA-seq sample.

The process of selecting eeSNVs linked to GE allows us to identify representative genotype markers for cell subpopulations. We speculate the following reasons attributed to the better accuracies of eeSNVs compared to GE. First, eeSNVs are binary features rather than continuous features like GE. Thus, eeSNVs are more robust at separating subpopulations. We have noticed that SNVs are less affected by batch effects (Supplementary Figure 9). Secondly, LASSO penalization works as a feature selection method and minimizes the spurious SNVs (false positive) from the filtered set of eeSNVs. Thirdly, since eeSNVs are obtained from the same samples as scRNA-seq data, they are more likely to have biological impacts, and this is supported by the observation that they have high prevalence of dbSNP annotations.

A small number of eeSNVs can be used to discriminate distinct single-cell subpopulations, as compared to thousands of genes that are normally used for scRNA-seq analyses. Taking advantage of the eeSNV-GE relationship, a very small number of top eeSNVs still can clearly separate cell subpopulations of the different datasets (e.g., 8 eeSNV features have decent separations for Kim dataset). Moreover, our SSrGE package can be easily parallelized and process each gene independently. It has the potential to scale up to very large datasets, well poised for the new wave of scRNA-seq technologies that can generate thousands of cells at one time47. One can also easily rank the eeSNVs and the genes harboring them, for the purpose of identifying robust eeSNVs as genetic markers for a variety of cancers.

The eeSNV based approach to identify subpopulations can be generalized to other non-cancer tissues and scRNA-seq data generated from various platforms. Additionally, we analyzed nine datasets, including two non-cancer 10X Genomics datasets with a large number of cells, four non-cancer datasets (GSE79457, GSE81903, GSE71485, and GSE80232), and three cancer single-cell datasets (GSE69405, GSE81383, and GSE65253). Bipartite graphs with eeSNVs and cells were able to separate the single-cell subpopulations based on their genotype (Supplementary Figure 10A–I).

SSrGE uses an accumulative ranking approach to select eeSNVs linked to the expression of a particular gene. Mainly, HLA class I genes (HLA-A, HLA-B, and HLA-C) are top-ranked for the three human datasets, and they contribute to antigen processing and presentation pathway, the most enriched pathways of the four datasets. HLA has amongst the highest polymorphic genes of the human genome48, and the somatic mutations of genes in this family occurred in the development and progression of various cancers49,50. eeSNVs of HLA genes could be used as fingerprints to identify the cellular state of the cancer cells, and lead to better separation between the primary (pRCC, green) and metastatic cells (mRCC, red and blue) compared to GE of HLA genes (Supplementary Figure 11). B2M, another gene with top-scored eeSNVs in Ting and Patel datasets, is also known to be a mutational hotspot51. It is immediately linked to immune response as tumor cell proliferation49,51. Many other top-ranked genes, such as SPARC, were reported to be driver genes in the original studies of the different dataset. Thus, it is reasonable to speculate that SSrGE is capable of identifying some driver genes. Another possibility is that some of the eeSNVs reflect aberrant splicing of genes such as the HLA family, which are regularly found in deregulated cancer cells52. Nevertheless, SSrGE may miss some driver mutations due to the incomplete DNA coverage of scRNA-seq reads. Also, its primary goal is to identify a minimal set of eeSNV features by LASSO penalization but in case of correlated features, LASSO may select one of those highly correlated SNV features that correspond to GE.

We have showed with strong evidence that eeSNVs unveil inter- and intra-tumor cells heterogeneities better than gene expression count data obtained from the same RNA-seq reads. Reconstructing the pseudo-time ordering of cancer cells from the same tumor (Kim dataset) displays branching even inside primary tumor and metastasis subgroups, which gene expression data are unable to do. We identified genes enriched with SNVs specific to the metastasis, which were not reported in the original HCC single cell study44. Most interestingly, we showed that eeSNVs can also be retrieved from RRBS reads in a multi-omics single-cell HCC dataset, a twist from their original purpose of single-cell DNA methylation. Again, genes ranked by eeSNVs from RRBS reads only differentiate normal from cancer cells but also the different cancer subtypes. We identified several genes that are significant in either HCC or HCC subgroup, whose promoters are highly impacted by eeSNVs. Thus, we have demonstrated that our method is on the fore-front to analyze data generated by new single-cell technologies extracting multi-omics from the same cells44,53.

Bipartite graphs are a natural way to visualize eeSNV-cell relationships. We have used force-directed graph drawing algorithms involve spring-like attractive forces and electrical repulsions between nodes that are connected by edges. This approach has the advantage to reveal outlier single cells, with a small set of eeSNVs, compared to those distance-based approaches. Moreover, the bipartite representation also reveals directly the relationship between single cells and the eeSNV features. Contrary to dimension reduction approaches such as PCA that requires linear transformation of features into principle components, bipartite graphs preserve all the binary information between cell and eeSNV. Graph analysis software such as Gephi36 or Cytoscape54 can be utilized to explore the bipartite relationships in an interactive manner.

In this manuscript, we demonstrated the efficiency of using eeSNVs for cell subpopulation identification over multiple datasets. eeSNVs are excellent genetic markers for intra-tumor heterogeneity and may serve as genetic candidates of new treatment options. We also have developed SSrGE, a linear model framework that correlates genotype (eeSNV) and phenotype (GE) information in scRNA-seq data. Moreover, we have showed the capacity of SSrGE in analyzing multi-omics data from the same single cells, obtained from the most cutting-edge genomics techniques55,37. Our method has the great promise as part of routine analyses in scRNA-seq pipelines56, as well as multi-omics single-cell integration projects.


scRNA-seq datasets

All six datasets were downloaded from the NCBI Gene Expression Omnibus (GEO) portal57.

Kim dataset (accession GSE73121) contains three cell populations from matched primary and metastasis tumor from the same patient20. Patient Derivated Xenographs (PDX) were constructed using cells from the primary Clear Cell Renal Cell Carcinoma (PDX-pRCC) tumor and from the lung metastasic tumor (PDX-mRCC). Also, metastatic cells from the patient (Pt-mRCC) were sequenced.

Patel dataset (accession GSE57872) contains five glioblastoma cell populations isolated from five individual tumors from different patients (MGH26, MGH28, MGH29 MGH30, and MGH31) and two gliomasphere cell lines, CSC6 and CSC8, used as control23.

Miyamoto dataset (accession GSE67980) contains 122 CTCs from Prostate cancer from 18 patients, 30 single cells derived from 4 different cancer cell lines: VCaP, LNCaP, PC3, and DU145, and 5 leukocyte cells from a healthy patient (HD1)22. A total of 23 classes (18 CTC classes + 4 cancer cell lines + 1 healthy leukocyte cell lines) were obtained.

Ting dataset (subset of accession GSE51372) contains 75 CTCs from Pancreatic cancer from 5 different KPC mice (MP2, MP3, MP4, MP6, and MP7), 18 CTCs from two GFP-lineage traced mice (GMP1 and GMP2), 20 single cells from one GFP-lineage traced mouse (TuGMP3), 12 single cells from a mouse embryonic fibroblast cell line (MEF), 12 single cells from mouse white blood (WBC) and 16 single cells from the nb508 mouse pancreatic cell line (nb508)21. KPC mice have uniform genetic cancer drivers (Tp53, Kras). Due to their shared genotype, we merged all the KPC CTCs into one single reference class. CTCs from GMP1 did not pass the QC test and were dismissed. CTCs from GMP2 mice were labeled as GMP. Finally, 6 reference classes were used: MP, nb508, GMP, TuGMP, MEF, and WBC.

Hou dataset (accession GSE65364) contains 25 hepatocellular carcinoma single cells (Ca) extracted from the same patient and 6 normal liver cells (HepG2) obtained from the adjacent normal tissue of another HCC patient44. The 32 cells were sequenced using scTrio-seq in order to obtain reads from both RNA-seq and reduced representation bisulfate sequencing (RRBS). The authors highlighted that one of the Ca cells (Ca_26) was likely to be a normal cell, based on CNV measurements, and thus we discarded this cell. We used the RRBS reads to infer the SNVs. We use gene expression data provided by the authors to construct a GE matrix. For controls, we used the bulk genome of all the RNA-seq and RRBS reads of the HepG2 group.

Chung dataset (accession GSE75688) contains 549 single cells from primary breast cancer and lymph nodes metastases, extracted from 11 patients (BC01-11) of distinct molecular subtypes. BC01-02 are estrogen receptor positive (ER+); BC04-06 are human epidermal growth factor receptor 2 positive (HER+); BC03 is double positive (ER+ and HER+); BC07-11 are triple negative breast cancer (TNBC)24. Only BC03 and BC07 presented cells extracted from lymph nodes metastases (BC03M and BC07M). Additionally, the dataset contains a large part of infiltrating tumor cells. Following the original analytical procedure of the original study24, we performed an unsupervised clustering analysis to separate the cancer from the immune cells. We first reduced the dimension with a PCA analysis and then used a Gaussian mixture model to infer the clusters. We obtained a total of 372 cancer cells and 177 immune cells.

SNV detection using scRNA-seq data

The SNV detection pipeline using scRNA-seq data follows the guidelines of GATK ( It includes four steps: alignment of spliced transcripts to the reference genome (hg19 or mm10), BAM file preprocessing, read realignment and recalibration, and variant calling and filtering (Supplementary Figure 1)58.

Specifically, FASTQ files were first aligned using STAR aligner59, using mm10 and hg19 as reference genomes for mouse and human datasets, respectively. The BAM file quality check was done by FastQC60, and samples with lower than 50% of unique sequences were removed (default of FastQC). Also, samples with more than 20% of the duplicated reads were removed by STAR. Finally, samples with insufficient reads were also removed, if their reads were below the mean minus two times the standard deviation of the entire single-cell population. The summary of samples and reads filtered by these steps is listed in Table 1. Raw gene counts Xj were estimated using featureCounts61, and normalized using the logarithmic transformation:

$$f(X_j) = {\mathrm{log}}_2\left( {1 + X_j.\frac{{10^9}}{{(G_j.R)}}} \right)$$

where Xjis the raw expression of gene j, R is the total number of reads and Gj is the length of the gene j. BAM files were pre-processed and reordered using Picard Tools (, before subject to realignment and recalibration using GATK tools62. SNVs are then calculated and filtered using the GATK tools HaplotypeCaller and VariantFiltration using default parameters.

Additionally, we used Freebayes26 with default parameters to infer the SNVs, as alternative to the HaplotypeCaller and VariantFiltration softwares. The SNV calling results between the two callers are very similar (Supplementary Figure 2).

SNV detection using RRBS data

We first aligned the RRBS reads on the hg19 reference genome using the Bismark software63. We then processed the bam files using all the preprocessing steps as described in the SNV detection using scRNA-seq data section (i.e., Picard Preprocessing, Order reads, Split reads and Realignments), except the base recalibration step. Finally, we called the SNVs using the BS-SNPer software (default setting)64. The details are the following.

–minhetfreq 0.1 # Threshold of frequency for calling heterozygous SNP

–minhomfreq 0.85 # Threshold of frequency for calling homozygous SNP

–minquali 15 # Threshold of base quality

–mincover 10 # Threshold of minimum depth of covered reads

–maxcover 1000 # Threshold of maximum depth of covered reads

–minread2 2 # Minimum mutation reads number

–errorate 0.02 # Minimum mutation rate

–mapvalue 20 # Minimum read mapping value

10X genomic data preprocessing

We downloaded the fastq files of two data sets: bone marrow mononuclear cells (BMMC) of 2 individuals and peripheral blood mononuclear cells (PBMC), from the 10X Genomic website ( We de-multiplexed the reads using the BC tags extracted with the pysam library ( For bipartite graph display, we used a subset of 2000 BMMC cells and 3300 PBMC cells with the highest number of reads in each dataset, respectively.

SNV simulation

We created a modified version of the hg19 reference genome by introducing 50,000 random mutations in the exonic region of the genes. To introduce a new mutation, we weighted each exon depending on its base length and selected one randomly and proportional to its weight. We realigned the scRNA-seq reads from a subset of 20 cells from the Kim dataset with the introduced new mutations, using the same SNV detection pipelines described above. We used BedTools65 to compute the read depth of each mutation.

SNV annotation

To annotate human SNV datasets, dbSNP138 from the NCBI Single Nucleotide Polymorphism database66 and reference INDELs from 1000 genomes (1000_phase1 as Mills_and_1000G_gold_standard)67 were used. To annotate the mouse SNV dataset, dbSNPv137 for SNPs and INDELs were downloaded from the Mouse Genomes Project of the Sanger Institute, using the following link: The mouse SNP databases were sorted using SortVcf command of Picard Tools in order to be properly used by Picard Tools and GATK.

SSrGE package to correlate eeSNVs to gene expression

For each dataset, we denote MSNV and MGE as the SNV and gene expression matrices, respectively. MSNV is binary (MSNVc,s{0,1}) indicating the presence/absence of SNV s in cell c. MGEc,s is the log transformed gene expression value of the gene g in cell c. Copy number variation (CNV) MCNV, can be added as an additional optional predictor in SSrGE. We computed CNV for each gene g in each cell c using the online platform Ginkgo68. For the Hou dataset, we inferred CNV using the same approach described by the authors44. We removed any SNV present in less than k = 3 cells, as it offers the best tradeoff between accuracy and SSrGE running time. We also filtered SNVs associated with genes having normalized expression value below 2.0. We also discarded genes with normalized expressions below 2.0 and expressed in less than 10 cells from SSrGE analysis. For each gene g, we applied a sparse linear regression using LASSO to identify Wg, the linear coefficient associated to SNV, as well as Wcnv, the coefficient associated to the CNV of g (if CNV was considered). The objective function for SSrGE to minimize is:

$$\begin{array}{*{20}{c}}{\rm{min}}_{{W_{g + {\mathrm{CNV}}}}} \left({\frac{1}{{2n}}\left\Vert {M_{{\mathrm{SNV}} + {\mathrm{CNV}}g}.W_{g + {\mathrm{CNV}}g}^T - M_{{\mathrm{GE}} \ast ,g}} \right\Vert_2^2 + \alpha .\left\Vert {W_{g + {\mathrm{CNV}}g}} \right\Vert_1} \right)\end{array}$$

where α is the regularization parameter, MSNV+CNVg represents the MSNV matrix with an additional column that corresponds to the CNV of the gene g for each cell. In this configuration, Wg+CNV has one more column too, corresponding to the weight associated with the CNV. An SNV was considered as eeSNV when Wg(s)≠0.

When CNV is not considered in SSrGE, the objective function is simplified as:

$$\begin{array}{*{20}{c}}{\rm{min}}_{{W_{g}}}\left( {\frac{1}{{2n}}\left\Vert {M_{\mathrm{SNV}}.W_g^T - M_{\mathrm{GE} \ast ,g}} \right\Vert_2^2 + \alpha .\left\Vert {W_g} \right\Vert_1} \right)\end{array}$$

Inference of SNV genotype and allele-specific SSrGE calibration

We used the package QuASAR ( to identify the genotype of each SNV29. For each dataset, we collected the SNVs and constructed an n × 3 matrix using the number of reads mapping to the reference allele, the alternate allele, or neither of the alleles. We then fit this matrix with QuASAR to estimate the true genotype of each SNV. We then estimated the allelic gene expression for each cell by multiplying the normalized gene counts with the fraction of the SNV of a particular genotype. To calibrate SSrGE model with allele expression, we first fit an SSrGE model for each genotype using the allele-specific SNVs and gene expression as inputs. We then merged the eeSNVs and weights inferred for each model into a final model.

Ranking of eeSNVs and genes

SSrGE generates coefficients of eeSNVs for each gene, as a metric for their contributions to the gene expression. The score of an eeSNV is given by the sum of its weights over all genes:

$${\mathrm {score}}_{\mathrm {eeSNV}} = \mathop {\sum }\limits_ g \left| {W_g^T\left( s \right)} \right|$$

Each gene also receives a score according to its associated eeSNVs:

$${\mathrm {score}}_{{\mathrm {gene}}_ {g}} = \mathop {\sum }\limits_{{\mathrm {eeSNV}} \in {\mathrm {gene}}_{g}} {\mathrm {score}}_{{\mathrm {eeSNV}}}$$

In practice, we first obtained eeSNVs using a minimum filtering of α = 0.1, before using these two scores above to rank eeSNVs and the genes.

Ranking of eeSNVs and genes for a subpopulation

For a given single-cell subpopulation p, an eeSNV is defined as specific to the subpopulation p when it has a significantly higher frequency in p than in any other subpopulation. For each eeSNV we took only the subset of cells expressing the gene g associated with the eeSNV. We then computed the Fisher’s exact test to compare the presence of the eeSNV between single-cells inside and outside p. We considered an eeSNV significant for p-value < 0.05. p′ designates the subset of cells from p expressing g. The score of an eeSNV for p is given by:

$${\mathrm {score}}_{{\mathrm {eeSNV}}}^p = \frac{{\left| {\left\{ {{\mathrm {cell}}\,{\mathrm {with}}\,{\mathrm {eeSNV}}|{\mathrm {cell}} \in p\prime } \right\}} \right|}}{{\left| p\prime \right|}}{\mathrm {score}}_{{\mathrm {eeSNV}}}$$

The score of a given gene g for p is thus given by:

$${\mathrm {score}}_{{\mathrm {gene}}_{g}}^p = \mathop {\sum }\limits_{{\mathrm {eeSNV}}\, \in \,{\mathrm {gene}}_{g}} {\mathrm {score}}_{{\mathrm {eeSNV}}}^p$$

To rank eeSNVs from the promoter regions of the RRBS reads in Hou dataset, we applied a similar methodology: we annotated the eeSNVs within 1500 bp upstream of genes’ starting codon regions.

Perturbation-model based simulation to quantitatively assess SSrGE

We first simulated an interaction table which gives an interaction score (between −1 and 1) for each gene-SNV pair, denoted as interaction g v for gene g and SNV v. This interaction score follows a mixed normal distribution, with two normal distribution components. These two distributions are named the inhibiting component (centered at −0.5), and the enhancing component (centered at 0.5). The type of interaction determines the weighting factor. We define a cis-interaction if the SNV is located within the gene and a trans-interaction otherwise. A cis-interaction has equal weights on the inhibiting and enhancing components. A trans-interaction has significantly larger weight on the inhibiting component, as previous studies found cis interaction is more likely to be inhibitive.

We then simulated the unperturbed expression matrix using Splatter69, with parameters estimated from the Kim dataset. We also simulated the SNV matrix using random shuffle of the SNV matrix extracted from the same dataset. We then applied the interactions to the unperturbed expression. For a gene g in a sample s, the expression level GEgs was perturbed to GE’gs using the following formula:

$${\mathrm{GE}}_{{gs}}^\prime = {\mathrm{GE}}_{{gs}}{\mathrm{exp}}^{\left( {\mathop {\sum }\limits_v {\mathrm{snv}}_v.{\mathrm{interaction}}_{{gv}}} \right)}$$

This created the perturbed expression matrix. Then, we added drop-outs to the perturbed expression matrix using different strategies, in order to create the final observed expression matrix. To mimic dropouts, we introduced two variables: one for dropout bias (either bias toward lower expressed genes or randomly), and the other for dropout rate dependency (cell-based, gene-based, or read-based). We did not consider dropout bias towards highly expressed genes, as it contradicts observations documented earlier70. Random dropout (no bias) removes the dependency on cell, gene, or reads. We also added random noises to the SNV matrix (following a Bernoulli distribution) to generate the final observed SNV matrix.

Pseudo-time ordering reconstruction

To estimate the trajectory of cell evolvement, we adopted the following procedure, motivated by the method described earlier39. We first constructed the following distance matrix to reflect the Pearson’s correlation between each pair of cells: Di,j = 1−Correlation(Samplei,Samplej)

Subpopulation clustering algorithms

We combined two-dimension reduction algorithms: PCA30 and FA31 with two popular clustering approaches: the K-means algorithm71 and agglomerative hierarchical clustering (agglo) with WARD linkage32. We also used SIMLR, a recent algorithm specifically tailored to cluster and visualize scRNA-seq data, which learns the similarity matrix from subpopulations33. Similar to the original SIMLR study, we used the embedding of the cells produced by the algorithm to apply K-means algorithm.

PCA and FA were performed using their corresponding implementation in Scikit-Learn (sklearn)72. For PCA, FA, and SIMLR, we used various input dimensions D [2, 3, 5, 10, 15, 20, 25, 30] to project the data. To cluster the data with K-means or the hierarchical agglomerative procedure, we used a different cluster numbers N (2 to 80) to obtain the best clustering results from each dataset. We computed accuracy metrics for each (D, N) pair and chose the combination that gives the overall best score. Between the two clustering methods, K-means was the implementation of sklearn package with the default parameter, and hierarchical clustering was done by the AgglomerativeClustering implementation of sklearn, using WARD linkage.

Validation metrics

To assess the accuracy of the obtained clusters, we used three metrics: Adjusted Mutual Information (AMI), Adjusted Rand Index (ARI) and V-measure34,35. These metrics compare the obtained clusters C to some reference classes K and generate scores between 0 and 1 for AMI and V-measure, and between −1 and 1 for ARI. A score of 1 means perfect match between the obtained clusters and the reference classes. For ARI, a score below 0 indicates a random clustering.

AMI normalizes Mutual Information (MI) against chances34. The Mutual Information between two sets of classes C and K is equal to: \({\mathrm{MI}}({{C}},{{K}}) = \mathop {\sum }_{i = 1}^{|{{C}}|} \mathop {\sum }_{{{j}} = 1}^{|{{K}}|} {P}\left( {{{i}},{{j}}} \right){\mathrm{log}}\left( {\frac{{{P}\left( {{{i}},{{j}}} \right)}}{{{P}\left( {{i}} \right){P}\prime \left( {{j}} \right)}}} \right)\), where P(i) is the probability that an object from C belongs to the class i, P′(j) is the probability that an object from K belongs to class j, and P(i, j) is the probability that an object are in both class i and j. AMI is equal to: \({\mathrm{AMI}}(C,K) = \frac{{{\mathrm{MI}}(C,K) - {E}({\mathrm{MI}}(C,K))}}{{\max \left( {\{ H\left( C \right),H(K)\} } \right) - {E}({\mathrm{MI}}(C,K))}}\), where H(C) and H(K) designates the entropy of C and K.

Similar to AMI, ARI normalizes RI against random chances: \({\mathrm{ARI}} = \frac{{{\mathrm{RI}} - {E}({\mathrm{RI}})}}{{\max \left( {{\mathrm{RI}}} \right) - {E}({\mathrm{RI}})}}\)34. Rand Index (RI) was computed by: \({\mathrm{RI}} = \frac{{{a} + {b}}}{{{C}_2^{{n}_{{\mathrm{sample}}}}}}\), where a is the number of con-concordant sample pairs in obtained clusters C and reference classes K, whereas b is the number of dis-concordant samples.

V-measure, similar to F-measure, calculates the harmonic mean between homogeneity and completeness. Homogeneity is defined as \(1 - \frac{{H(C|K)}}{{H(C)}}\), where H(C|K) is the conditional entropy of C given K. Completeness is the symmetrical of homogeneity: \(1 - \frac{{H(K|C)}}{{H(K)}}\).

Graph visualization

The different datasets were transformed into GraphML files with Python scripts using iGraph library73. Graphs were visualized using GePhi software36 and spatialized using ForceAtlas274, a specific graph layout implemented into the GePhi software.

Pathway enrichment analysis

We used the KEGG pathway database to identify pathways related to specific genes75. We first selected the top 100 genes for each dataset, according to the ranks given by SSrGE. We then selected genes scored with significant eeSNVs for the metastasis cells from Kim dataset. We used DAVID 6.8 functional annotation tool to identify significant pathways amongst these genes38. We used the default significance value (adjusted p-value threshold of 0.10). Significant pathways are represented as a bipartite graph using Gephi: Nodes are either genes or pathway and the size of each nodes represent the score of the genes or, in the case of pathways, the sum of the scores of the genes linked to the pathways. We used the same methodology to infer significant pathways of cancer cells, compared to normal cells, from Hou dataset. However, we used all the genes ranked rather than only the significant genes, since only few genes are found to be significant for cancer cells.

Code availability

The SNV calling pipeline and SSrGE are available through the following GitHub project:

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.