Abstract
Revealing the heterogeneity among tissues is the greatest advantage of single-cell-sequencing. Marker genes not only act as the key to correctly identify cell types, but also the bio-markers for cell-status under certain experimental imputations. Current analysis methods such as Seurat and Monocle employ algorithms which compares one cluster to all the rest and select markers according to statistical tests. This pattern brings redundant calculations and thus, results in low calculation efficiency, specificity and accuracy. To address these issues, we introduce starTracer, a novel algorithm designed to enhance the efficiency, specificity and accuracy of marker gene identification in single-cell RNA-seq data analysis. starTracer operates as an independent pipeline, which exhibits great flexibility by accepting multiple input file types. The primary output is a marker matrix, where genes are sorted by the potential to function as markers, with those exhibiting the greatest potential positioned at the top. The speed improvement ranges by 2β~β3 orders of magnitude compared to Seurat, as observed across three independent datasets with lower false positive rate as observed in a simulated testing dataset with ground-truth. Itβs worth noting that starTracer exhibits increasing speed improvement with larger data volumes. It also excels in identifying markers in smaller clusters. These advantages solidify starTracer as an important tool for single-cell RNA-seq data, merging robust accuracy with exceptional speed.
Similar content being viewed by others
Introduction
Single-cell/nucleus RNA sequencing (sc/snRNA-seq) reveals the heterogeneity in cell types, with different clusters identifiable through various methods. Annotation of clusters requires marker genes which serve as identifiers that exhibit high-specific expression patterns to differentiate clusters1,2. This facilitates the integration of single-cell sequencing data with subsequent experiments, such as fluorescence-activated cell sorting (FACS).
As sc/snRNA-seq technology is applied in an increasing number of studies, reducing computation energy and time is a sustainability issue. Meanwhile, the requirements for computational resources and time costs increase in tandem with the growing number of cells generated in single-cell sequencing studies3. Consequently, data analysis becomes increasingly inefficient, posing challenges in effectively managing the flexibility, scalability, and efficiency of single-cell sequencing workflows.
Seurat is widely acknowledged as one of the most robust tools for single-cell data analysis4. Marker gene identification in Seurat predominantly relies on the βFindAllMarkersβ function. The underlying algorithm for marker gene identification, including other toolkits such as Scanpy5 and Monocle6 is based on the principle that a marker gene for a specific cluster exhibits significant up-regulation compared to the remaining clusters. However, this algorithm causes a βdilutionβ issue which happens when a high expression cluster is pooled with the lower expressions in the majority of clusters, thus decreasing the accuracy given the fact that an ideal marker gene should be unique to a single cluster. Furthermore, calculating genes for each cluster can result in redundant computations and producing superfluous information and consuming significant computational resources, which creates a bottleneck in efficiency, particularly when dealing with experiments involving larger population and complex annotations, thus warrants further improvement. In addition, identifying marker genes for clusters with low cell counts can also be challenging.
Indeed, taking into account the mean expression values and standard deviations could be regarded as a solution. Performing significance tests over multiple groups or variance analysis with post-hoc may also offer additional information to avoid the βdilutionβ issue, however, these methods all require additional computational resources.
To address this issue, marker gene searching strategies should refrain from aggregating the remaining clusters as one entity as well as considering the expression values among each cluster instead of performing significance tests to maintain high efficiency and accuracy.
We developed the R package starTracer to accurately find marker genes with high specificity and high efficiency. This package is specifically designed to integrate seamlessly with the widely used single-cell analysis tool Seurat and provide valuable insights into marker genes in single-cell data.
starTracer provides multiple and flexible parameters for users: Users can either directly input a sparse single-cell expression matrix, a Seurat object, or an average expression matrix of each cell type using the βsearchMarkerβ function; In Seurat, a gene may be considered a marker in multiple clusters. However, starTracer offers researchers a marker gene list where each gene is exclusively associated with one cluster, based on the number of marker genes specified for each cluster by the researcher. starTracer also offers an option to select marker genes specifically among highly variable genes, thus reducing interference from genes with low variations. Besides, a parameter could be set as a threshold to limit the lowest expression level of the marker genes to identify marker genes at various expression and specificity levels.
Additionally, for users who already have the output marker gene matrix from Seurat, we offer another module βfilterMarkerβ that allows for re-arranging and re-assigning marker genes based on their specificity level to optimize the results obtained from Seurat.
Overall, starTracer is an open-source R software package that is readily available for users to efficiently identify potential marker genes in single-cell sequencing data.
Results
Accurate identification of marker genes by starTracer from human prefrontal cortex, heart, and mouse kidney data
We assess the performance of starTracer using three sets of publicly available datasets, including adult samples from the human prefrontal cortex (24,564 cells)7, human left ventricle (592,689 cells)8, and control samples from mouse kidneys (16,119 cells)9 (Fig.Β 1AβC). In the default settings of the starTracerβs βsearchMarkerβ module, which includes only highly variable genes, starTracer accurately identified a number of typical marker genes documented in published research articles or databases. For instance, in the human PFC data, starTracer identified ETNPPL and SLC14A1 as prominent marker genes for astrocyte cells, CUX210 for L2β3 CUX2 neurons, and CNDP1 for oligodendrocyte cells, all ranking at the top of its list (Fig.Β 1D) and consistent with previous studies11,12,13. In the human heart data, we identified NRXN1, CD163, and MYH11 as marker genes for cardiac neurons14, macrophages15, and vascular-associated smooth muscle cells16, respectively (Fig.Β 1E). In the mouse kidney data, Slc12a3 and Tie1 were identified as marker genes for distal convoluted tubule cells17 and glomerulus epithelial cells18, respectively (Fig.Β 1F). When the option was adjusted to include all detected expression genes in the analysis, starTracer successfully identified canonical markers for nearly all inhibitory neuron subtypes (Supplementary Fig.Β 3EβG). Importantly, starTracer not only accurately identifies marker genes but also effectively minimizes noise in the marker gene dataset (Supplementary Fig.Β 3F, G). All the marker genes found by βsearchMarkerβ that consists of previous studies are recorded (Supplementary TablesΒ 1β3). These findings collectively highlight starTracerβs accuracy in identifying marker genes.
To provide further insight into the functional significance of markers identified by starTracer and Seurat, Gene Ontology (GO) function annotations were conducted using the top 50 markers (Supplementary Figs.Β 4β6). Remarkably, upon comparison with the bubble plot depicting marker genes, it was observed that clusters sharing markers identified by Seurat also exhibited shared functional annotation terms. At the same time, annotation terms are more accurately annotated to the corresponding cluster. We noticed that in the human heart data, the term βactin bindingβ was only uniquely annotated to cardiac muscle cells (Supplementary Fig.Β 4) when annotating with the markers found by starTracer (left panel), which was absent in vascular-associated smooth muscle cells or pericyte cells comparing with Seurat (right panel). This discrepancy may stem from starTracerβs ability to identify more precise genes. Thus, by excluding shared marker genes, starTracer demonstrated its capacity to discern more accurate annotation terms for clusters.
Enhanced specificity of marker gene identification with starTracer
To assess the specificity of identified marker genes, we re-analyzed the three datasets. βsearchMarkerβ module from starTracer was set to detected five marker genes for each cluster. We gauged specificity using the metric \({T}_{i}\), and employed fold-\({T}_{i}\) to determine specificity differences (refer to Methods for details). \({T}_{i}\) is defined as the proportion of cells that originate from the target cluster with the expression of gene i (refer to the βMethodsβ section for details, Supplementary Fig.Β 1G). Higher specificity levels could be observed for all cell types in the human PFC data, especially for interneurons (Vas, VIP, PV_SCUBE3, SST, LAMP5_NOS1, and PV) (Fig.Β 1G). In the other two datasets, we also noticed an increased specificity level for all clusters (Supplementary Fig.Β 3A, B). When comparing the results from starTracerβs βsearchMarkerβ module to Seuratβs βFindAllMarkersβ, we observed significant reductions in background noise across all datasets (Fig.Β 1DβF and Supplementary Fig.Β 3CβE).
starTracer significantly increased calculation speed by 2β3 magnitude compares to Seuratβs βFindAllMarkersβ
To assess the computational cost of starTracer, we conducted a runtime comparison with Seuratβs βFindAllMarkersβ. In the human PFC sample, while βFindAllMarkersβ required an average of 562.86βs, starTracer completed its task in 3.03βs when importing HVGs, and in 3.33βs when importing all detected genesβmarking 185.76 and 169.03-fold time-saving (Fig.Β 1H). Similarly, in the human heart sample, βFindAllMarkersβ averaged 381.28βmin compared to starTracerβs 0.65 and 0.66βminβa striking 586.59 and 577.70-fold improvement (Fig.Β 1I). In the context of the mouse kidney sample, starTracer finished in an average of 1.19 and 2.52βs for HVGs and all detected genes, whereas βFindAllMarkersβ took 45.34βs, with a 38.10 and 17.92-fold improvement (Fig.Β 1J). Collectively, these results suggest that less computation time is required by starTracer to achieve marker gene detection.
To further assess the influence of sample size on the computational time of both starTracer and Seurat, we created a series of subsets of the left ventricle scRNA-seq dataset. These subsets ranged in size from 10% to 100%, increasing in 10% intervals. Cells for each subset were selected using a non-repeating, uniformly distributed strategy to retain the original cell composition. The runtime measurement was repeated 3 times to ensure robustness. Notably, as the sample size increased, the difference in runtime between starTracer and Seurat became more pronounced, with starTracer consistently outperforming Seurat (Fig.Β 1K). This trend underscores starTracerβs distinct advantage when dealing with larger cell datasets.
Seurat itself offers parameters such as min.pct and fold-change to expedite calculations, which can enhance the identification of precise marker genes. To demonstrate Seuratβs calculation speed across various parameters, we initially utilized starTracer to identify 50 markers with sufficient specificity and accuracy. The mean fold-change of these markers found by starTracer was determined and used as the input value for Seuratβs βFindAllMarkersβ function. Subsequently, min.pct was systematically varied from 10% to 90%. Our analysis revealed that when specifying fold-change equal to that determined by starTracer, the calculation speed of Seurat indeed increased. Moreover, as min.pct increased, the calculation speed further improved. However, even at a min.pct of 90%, the calculation speed remains slower than that of starTracer (Fig.Β 1LβN). It is worth noting that the fold-change utilized by Seurat in this experiment is a post-experiment value, implying that researchers may not know the suitable threshold beforehand during real-time data analysis until the calculation is completed.
We expanded our investigation to include multiple methods for calculating marker genes, such as Wilcoxon, bimod, ROC, t-test, negbinom, Poisson, LR, MAST, DESeq2, and the R package Monocle 3, in order to assess calculation speed. The fold-change was standardized to match the mean value of starTracerβs results. Notably, negbinom exhibited the longest computation time, while the default method, Wilcox, ranked fifth among all methods from short to long (Supplementary Fig.Β 6A). In the case of Monocle 3, the number of cores utilized for calculation significantly influenced computation speed. Surprisingly, we observed that as more cores were utilized, the calculation time slightly increased (Supplementary Fig.Β 6B). This phenomenon may be attributed to the overhead of parallel computation. The marker genes identified by Monocle 3 are presented in Supplementary Fig.Β 6C. Overall, starTracer outperformed all included methods in terms of calculation speed in identifying marker genes.
Presto, a C++-based algorithm, effectively accelerates the calculation speed of Wilcoxon and auROC analyses during Seuratβs βFindAllMarkersβ while preserving the same results. This enhancement significantly boosts the computational efficiency of Seurat. Our evaluation encompassed the same human prefrontal cortex (PFC) data, human heart data, and mouse kidney data. In the case of human PFC data, where highly variable genes (HVGs) are included in starTracer, the calculation speed still exhibited a significant difference between starTracer and Presto (Supplementary Fig.Β 6D), starTracer outperformed Presto by a 2.49-fold reduction. For human heart data, starTracer also outperformed Presto by a 3.91-fold reduction in calculation time (Supplementary Fig.Β 7E). Similarly, for mouse kidney data, starTracer surpassed Prestoβs performance by a 1.57-fold reduction in calculation time (Supplementary Fig.Β 7F). In summary, while Presto significantly enhances calculation speed by leveraging C++, starTracer demonstrates better performance on large-scale datasets.
Evaluating false positive rate from simulated single-cell-sequencing data with ground-truth
Splatter is a wildly used tool to generate a simple, reproducible, and well-documented simulation of scRNA-seq data for systematically evaluating the performance of different tools19. To comprehensively evaluate the false positive rate of starTracer and Seurat, we generated single-cell sequencing data compromising 1000 simulated cells and 5 clusters using splatter (Fig.Β 2A). 50 marker genes are defined as ground truth (refer to the βMethodsβ section for details) according to a selection strategy20. The top 50 marker genes are found by starTracer and Seurat according to the same criteria (Supplementary Fig.Β 8). A gene will be defined as a true positive marker if it appears in both the selected and ground-truth list, while a false positive marker is defined as a gene that is only selected by starTracer/Seurat but not included in the 50 ground-truth marker genes. FPR is defined as the fraction of the number of false positive markers and the number of all positive markers (refer to the βMethodsβ section for details). We calculated the FPR from 5 clusters (Fig.Β 2B). The result shows that in all clusters, the FPR found by starTracer is consistently lower than Seurat. starTracer and Seurat, respectively, give a marker gene list with ranks.
To illustrate the true positive rate across varying numbers of top N markers, we intersected the top N marker genes from these two tools with the ground-truth marker gene sets, ranging from the top 1 to the top 50 markers (Fig.Β 2C). It was observed that as N increased, starTracer consistently identified more genes that intersected with the ground-truth marker genes across all five clusters.
Finding marker genes from different annotation levels using starTracer
Researchers tend to annotate cells at various cluster levels3,21 to illustrate differences in cell components and gene expression changes across these levels. To investigate the performance of starTracer in different cluster levels, we reassessed the human PFC data7 using multiple annotation levels fetched from the Single Cell Portal. These levels include (1) Bio_clust: This includes principal neurons (PN), interneurons (IN), astrocytes (Astro), microglia (Micro), oligodendroglia (Oligo), and oligodendroglia progenitor cells (OPC) (Supplementary Fig.Β 9A). (2) Major_clust: Within this, PN is further divided into L2/3 ET-1, etc., while IN is divided into SST, VIP, PVALB, SNCG, among others1 (Fig.Β 1A). (3) Sub_clust: Here, each of the aforementioned clusters can be further delineated into more detailed clusters, totaling 66 subgroups (Fig.Β 3A). In our examination of marker genes at each level based on this classification scheme, we noticed that marker genes identified by starTracer consistently exhibited elevated specificity for all clusters (Supplementary Fig.Β 9B).
For the Bio_clust level, starTracer successfully identified GLI3 as an Astrocytes marker22 and CNDP1 and SLC5A11 as an Oligodendroglia marker13,23 (Supplementary Fig.Β 9C, D). GAD2 is widely recognized as a canonical marker for inhibitory neurons in single-cell analysis3,21,24, COL9A1 and FERMT1 are renowned as OPC marker genes25,26, CBLN2 is known to be expressed in principal neurons27, and FGD2 is highly expressed in microglia28. Both Seurat and starTracer are able to identify marker genes for each cluster, but starTracer outperforms Seurat by identifying marker genes with high specificity (Supplementary Fig.Β 9C, D).
At the Major_clust level, consistent with marker genes from Bio_clust level, we found SLC5A11 and CNDP1 as Oligodendroglia markers, COL9A1 and FERMT1as the OPC markers, FGD2 as a microglia marker (Fig.Β 1D). The rest of the markers are shown in the previous results.
For Sub_clust, although having 66 makes the difference between clusters subtle (Fig.Β 3A), starTracer provided a noise-reduced marker gene set comparing with FindAllMarkers (Fig.Β 3B, C). Meanwhile, all clusters exhibited a higher specificity level as well (Fig.Β 3D).
Overall, starTracer demonstrates a robust capability to identify marker genes across different levels of annotations.
Marker gene identification from rare cell types using starTracer
To showcase starTracerβs capability to identify marker genes even within rare cell types, we employed human prefrontal cortex (PFC) data, focusing particularly on inhibitory neurons, where starTracer exhibits notable enhancement. We generated nine sub-samples from the original dataset, comprising inhibitory neurons ranging from 100% to 10%, while maintaining the original cell counts for all the other cell types (Fig.Β 4A). Despite the reduction in sample size, starTracer successfully identifies a substantial portion of the original marker genes, with approximately half retained even at 10% sample size (Fig.Β 4B). To delve deeper into its ability to preserve top markers, we examined the intersection of markers with those identified using only the top five markers (Fig.Β 4B). Remarkably, for the five identified βrare cell typesβ, starTracer effectively retains specific markers: SST remains the top marker gene for SST neurons, PVALB and NPNT are identified as the top two marker genes for PV_SCUBE3 neurons, VIP emerges as the top marker for VIP neurons (ranked first at 100% sample and second at 10% sample), and NDNF persists as the top marker gene for ID2 neurons until a 20% sample size. Notably, the top marker genes identified by starTracer also exhibit high specificity even at a 10% sample size (Fig.Β 4C).
Parameter influences: Startracer provides insights into the negative correlation between the expression level and the specificity level of marker genes
starTracer gives users the flexibility to adjust \({S}_{2}\), representing the lower limit of gene expression level. To investigate the impact of varying expression levels on marker gene identification, we generated a series of marker genes based on different expression values. The βsearchMarkerβ module in starTracer enables users to select an \({S}_{2}\) range from 0 to 1, which filters out the lowest \({S}_{2}\) percentile of genes for each cluster (refer to the βMethodsβ section for details). Here, by analyzing the subset of principal neurons from the adult human prefrontal cortex (Fig.Β 5A), we demonstrated a solution for user to find optimal marker genes based on their specified minimum expression levels.
We varied \({S}_{2}\) from 0 to 0.8 in 0.2 increments, representing a filtering out from 0% to 80% of the least expressed genes after the gene allocation to each cluster. Using βsearchMarkerβ, we determined marker genes for each \({S}_{2}\) value. We investigated \({E}_{1}^{{ij}}\) and \({T}_{i}\) with different \({S}_{2}\) values. \({T}_{i}\) of the top 5 marker genes decreases as \({S}_{2}\) increases. Meanwhile, and as expected, the gene expression level showed a positive correlation with \({S}_{2}\) (Fig.Β 5B). These findings highlight an inverse relationship between the specificity and marker gene expression level: as specificity demands rise, the expression level drops and vice versa. This underscores the versatility of starTracer, allowing users to customize marker gene selection based on their specificity or expression level criteria.
Bubble plots showcased marker genes under \({S}_{2}\) values of 0, 0.4, and 0.8. A discernible increase in background noise was evident as \({S}_{2}\) values climbed, pointing to a decrease in the specificity level (Fig.Β 5CβE).
βFilterMarkerβ: Identify marker genes with higher specificity based on the results of Seurat operation
To test the impact of βfilterMarkerβ after its application, we revisited data from the human prefrontal cortex, human left ventricle, and mouse kidney. We could notice a marked increase in \({T}_{i}\) after the application of βfilterMarkerβ (Fig.Β 6AβC) and a reduced background noise (Fig.Β 6D, E). Notably, as observed in the βsearchMarkerβ analysis, interneurons displayed significantly higher specificity levels. βfilterMarkerβ retains all genes from βFindAllMarkersβ. However, it identifies the most appropriate marker genes and allocates them to their corresponding cluster (refer to the βMethodsβ for details). This fine-tuning of Seuratβs output matrix aids in pinpointing marker genes with enhanced specificity levels.
Discussion
Accurate identification of marker genes is a critical outcome of single-cell analysis, underpinning all downstream functional analyses. In this study, we introduce starTracer, an R package meticulously designed to swiftly and precisely identify high-specificity marker genes, offering researchers a customizable approach. starTracer comprises two core modules: βsearchMarkerβ, an independent pipeline, and βfilterMarkerβ, seamlessly integrated with output results from tools like Seurat. To optimize efficiency, βsearchMarkerβ avoids redundant calculations by pre-allocating genes to target clusters. We devised a novel metric, MI, within the βsearchMarkerβ module to efficiently gauge a geneβs potential to serve as a marker. To overcome the βdilution issueβ, we utilized the threshold \({S}_{2}\) to divide each gene into two sets and performed the calculations separately. \({T}_{i}\) quantifies the proportion of cells originating from the target cluster in an objective manner, could be used as a metric evaluating the performance of marker genes and thus function as a complementary module from starTracer to refine marker genes found by Seurat. Overall starTracer could rapidly identify marker genes with high specificity, even in samples with a large number of cells or clusters with a small number of cells.
starTracer relies on pre-clustered data. We have also demonstrated a negative correlation between specificity levels and expression levels. Thus, as an option, \({S}_{2}\) allows users to identify ideal marker genes based on expression levels and create a clustering tree according to marker gene expression, which has the potential to refine cell-type annotations based on marker gene expression patterns, given a specific \({S}_{2}\) value. By default, βsearchMarkerβ considers highly variable genes as candidates for marker gene identification, optimizing computational efficiency. However, users can freely input all detected genes for marker identification, and we recommend using highly variable genes for the initial run and exploring all genes if the results are unsatisfactory. Additionally, starTracer provides built-in functions for visualizing the expressions of identified marker genes.
Currently, the bottleneck in the βsearchMarkerβ process still lies in the handling of sparse matrices for single-cell data processingβwe continue to rely on initial and stable computational methods for processing sparse matrices. In future versions, if better algorithms specifically designed for sparse matrices emerge or if other researchers in the community contribute to βsearchMarker,β the computational speed of βsearchMarkerβ will be further enhanced. On the other hand, starTracer remains based on the interpretive computing language R basing on C++, which limits its computational speed improvement. For example, Presto achieves computation speeds tens of times faster by utilizing a more basic but more complex computing language (C++)29. However, in this study, we did not use C++ for package development. There are various solutions available, such as using Rcpp to accelerate computations for R packages30, which could be adopted for further optimization research in starTracer.
We applied starTracer to multiple tissues and organs from both humans and mice, and it consistently demonstrated excellent performance. Particularly noteworthy was its enhanced accuracy when applied to sequencing data from the human prefrontal cortex compared to traditional approaches. This phenomenon may stem from the relatively minor differences between cortical neurons compared to the various cell types in the heart and kidney, especially within inhibitory neurons31, highlighting starTracerβs potential for application in multi-organ single-cell omics studies. Additionally, there has been increasing emphasis on cell-centric research in recent years32,33,34, advocating for the integration and analysis of cells from different single-cell datasets rather than segregating them into separate datasets. There have been tools such as UCell35 and Sargent34 designed for the cell-centric annotation techniques and starTracer has the potential to provide more information for these methodologies. Furthermore, with the continuous maturation of single-cell sequencing technologies, studies involving millions of cells are becoming increasingly prevalent. In this context, starTracer demonstrates promising prospects for applications in large-scale sequencing processes involving a vast number of cells.
starTracer seamlessly accepts input in various formats, including Seurat objects, sparse expression matrices with annotation tables, or average expression matrices with features as rows and cells as columns. This versatility extends its potential applications to other high-resolution omic data, such as spatial transcriptomic data24,36,37, single-cell ATAC-seq data, and morphomic data from morphOMICs, due to their shared data structure. Moreover, starTracer can serve as a valuable tool for identifying the most up-regulated genes in bulk RNA-seq experiments with multiple treatments.
In conclusion, starTracer emerges as a powerful and flexible tool, significantly enhancing the efficiency and accuracy of marker gene identification in single-cell analysis. Its potential applications in diverse omic datasets hold great promise for advancing our understanding of cellular heterogeneity and gene regulation in various biological contexts.
Methods
Rationale of starTracer
Single-cell sequencing data includes an expression matrix and an annotation matrix (Fig.Β 7A). The previous strategy for identifying marker genes within a cluster is largely dependent on comparing the expression levels of a gene in a particular cluster with the corresponding expression levels in the remaining clusters4,5. While generally effective, this approach may not always yield markers with superior specificity. Such a circumstance may arise when facing the βdilutionβ issue: a gene is relatively highly expressed not only in the target cluster but also in one or more additional clusters. Moreover, since this strategy entails a comparison of one cluster versus the remaining clusters for each cluster in question, there is a considerable degree of redundant computation. This redundancy can render the process of identifying marker genes markedly time-consuming.
starTracer, with its βsearchMarkerβ module, operating as an independent pipeline, receiving a gene expression matrix as input (Fig.Β 7B). Within starTracer, we employ max-normalization, a commonly used algorithm in machine learning, to extract features from genes while preserving the geneβs expression patterns across clusters. Utilizing a threshold value, starTracer binarily classifies clusters as high and low expression groups for each gene. Subsequently, starTracer calculates the geneβs potential to act as a positive marker in the high-expression group and as a negative marker in the low-expression group. Finally, the geneβs overall potential to serve as a marker gene is determined based on both its positive and negative marker capabilities. Notably, as no comparisons are made during this process, no statistical tests are performed, resulting in heightened efficiency. The output of starTracer includes a list of genes sorted by their potential to function as cluster-specific markers, with genes exhibiting the highest potential positioned at the top of the matrix for each respective cluster.
The package starTracer includes an additional functional module βfilterMarkerβ, which is designed as a complementary pipeline to the Seuratβs βFindAllMarkersβ function, to automatically remove the redundant results in the output matrix and re-order the matrix according to \({T}_{i}\), representing the proportion of cells expressing gene i that are marked by gene i (Fig.Β 7C). For notations and meanings, please refer to TableΒ 1.
Generating a maximum-scaled average expression matrix
The βsearchMarkerβ module demonstrates great flexibility by accepting a variety of input file types. One essential input file is the expression matrix of the single-cell sequencing data. This can be acquired through three different ways, showcasing the moduleβs versatility: (i) importing a sparse expression matrix along with an annotation matrix into R; (ii) utilizing the βAverageExpressionβ function from Seurat for users who already have a Seurat object; and (iii) importing an average expression matrix along with an annotation matrix into R. Anyway, an average expression matrix will be generated (Supplementary Fig.Β 7 Step 1). This adaptability ensures seamless integration with various data processing workflows.
After generating the average expression matrix, the mean expression value of gene i from the average expression matrix will be calculated, which is denoted as \({E}_{1}^{i}\) (Supplementary Fig.Β 1 Step 2):
\({E}_{1}^{i}\) represents the mean expression of gene i in K clusters. \({x}_{{ij}}\) represents the average expression value of gene i in cluster j.
In order to accentuate the differences in gene expression across samples, whilst preserving the inherent expression characteristics of each gene, such as the maximum and minimum expression values and zero values, we compute the maximum-scaled average expression matrix38 (hereafter referred to as the max-normalized matrix) for each gene (Supplementary Fig.Β 1 Step 3):
\(\widetilde{{x}_{{ij}}}=\frac{{x}_{{ij}}}{\max \left(\left\{{x}_{i1},{x}_{i2}\ldots ,{x}_{{iK}}\right\}\right)}.\)
\(K\) represents the total number of clusters.
\(\widetilde{{x}_{{ij}}}\) represents the max-normalized average expression value of gene i in cluster j.
By normalizing the average expression matrix, the highest expression value for each gene is scaled to 1:
This facilitates the identification of the cluster with the maximum expression for each gene. Furthermore, this normalization method allows for the comparison of normalized results across genes within each cluster.
Assigning genes to clusters and generating sub-matrixes
For the assignment of genes to their potential target clusters, we employ a selection strategy where a gene is assigned to the cluster only if it displays the highest average expression value in this particular cluster. This assumption is based on the premise that a gene may only have the potential to be a marker in a cluster where\(\,\widetilde{{x}_{{ij}}}=1\). From the aspect of each cluster, the genes with \(\widetilde{{x}_{{ij}}}=1\) will be retained and be regarded as the potential markers (Supplementary Fig.Β 1 Step 3). Thus, we define an index set \({M}_{j}\subset \{{{\mathrm{1,2}}},\ldots ,{L}\}\) for cluster j:
where Mj is the set for cluster j, which indicates the rows from the original matrix that equals to 1, representing the maximum expressing values in cluster j.
Let the original max-normalized matrix as V and
we further divide V to sub-matrixes \({V}_{j}\) for each cluster j according to Mj:
The rows of the sub-matrix \({V}_{j}\) is from the set Mj, while the columns are composed of the clusters from 1 to K. By doing so, the algorithm could successfully assign genes to clusters and avoid redundant calculations. We assume that there are \({l}_{j}\) genes in sub-matrix \({V}_{j}\), where \({l}_{j}=|{M}_{j}|\).
Excluding low-expression genes
In the subset-matrix \({V}_{i}\), genes will be re-arranged in descending order according to \({E}_{1}^{i}\) (Supplementary Fig.Β 1 Step 5). In the majority of scenarios, genes with rather low expression levels may not considered optimal candidates for marker genes. To streamline computational efficiency by excluding these low-expression genes, we offer an optional threshold statistic, denoted as \({S}_{2}\), which ranges from 0 to 1. The statistic can be employed to filter out the lowest \({S}_{2}\) percentile of genes for each cluster after re-arranging according to their \({E}_{1}^{i}\) (e.g. Setting \({S}_{2}\) to 0.1 will filter out the genes with the lowest 10% of \({E}_{1}^{i}\)) (Supplementary Fig.Β 1 Step 6). After filtering, \({{S}_{2}} * {l}_{j}\) genes will be filtered out while \(({1-S}_{2})\,* \,{l}_{j}\) genes will be retained for further analysis.
Binary classification of clusters and the calculation of \({{{{\boldsymbol{E}}}}}_{{{{\boldsymbol{2}}}}}^{{{{\boldsymbol{i}}}}}\) and \({{{{\boldsymbol{E}}}}}_{{{{\boldsymbol{3}}}}}^{{{{\boldsymbol{i}}}}}\) to avoid dilution issue
For each gene, researchers can set a parameter \({S}_{1}\), which defaults to 0.5 (Supplementary Fig.Β 2A) to segregate clusters into high and low expression groups.
For gene i, it is presumed to serve as a positive marker in cluster j where \({\widetilde{x}}_{{ij}}\) exceeds \({S}_{1}\)(high expression group), and a negative marker where \(\widetilde{{x}_{{ij}}}\) is no bigger than \({S}_{1}\)(low expression group).
Thus, for gene i, there would be two vectors, one is the vector including clusters that gene i has the potential to be a positive marker (Supplementary Fig.Β 1 Step 7):
and the one including the clusters that gene i has the potential to be a negative marker:
Thus, we can have two vectors for each row of Vj:
And
We then, respectively, define:
and
Here, \({E}_{2}^{i}\) represents the mean of \(\widetilde{{x}_{{ij}}}\) values for these clusters from \({p}_{i}\), and \({E}_{3}^{i}\) represents the mean of \(\widetilde{{x}_{{ij}}}\) values from \({q}_{i}\).
In a scenario where βnβ clusters have max-normalized values greater than \({S}_{1}\). According to previous conclusions, it is certain that \(\max \{\widetilde{{x}_{i1}},\widetilde{{x}_{i2}},\ldots ,\widetilde{{x}_{{iK}}}\}=1\). Indeed, it is observable that \({E}_{2}^{i}\) and \({E}_{3}^{i}\)βwhich represent the average expression levels of a gene in clusters with values exceeding \({S}_{1}\) and not exceeding \({S}_{1}\), respectivelyβare each negatively correlated with \({\rho }_{i}\) and \({\eta }_{i}\), respectively (Supplementary Fig.Β 2B, C):
where \({E}_{2}^{i}\) and \({E}_{3}^{i}\) for each gene will be compiled into a matrix for further analyses.
Evaluating genesβ potential as positive and negative markers by \({{{\boldsymbol{\rho }}}}\) and \({{{\boldsymbol{\eta }}}}\)
While marker genes are typically identified by their up-regulation in specific clusters, it is also crucial for an ideal marker gene to exhibit low expression in the remaining clusters at the same time, thereby ensuring its specificity (Supplementary Fig.Β 2A). We employ
and
to quantify the potential of a gene serving as a positive or negative marker.
We further define that an ideal marker gene should simultaneously exhibit high values for both \(\rho\) and \(\eta\). Then we should notice that as \({E}_{2}\) increases, the potential of a gene to be a marker gene will decrease (Supplementary Fig.Β 2B), while as \({E}_{3}\) increases, the potential of a gene to be a negative marker gene will also decrease (Supplementary Fig.Β 2C). So, for each gene i
Measuring geneβs capability to be positive and negative markers with molecular index
To quantify the potential of a gene to serve as a marker, we devised a novel metric termed the molecular index (MI). MI is defined as the subtraction of the positive molecular index (PMI) and the negative molecular index (NMI), which takes \({E}_{2}^{i}\) and \({E}_{3}^{i}\) into account, thereby offering a comprehensive measure of a geneβs marker potential (Supplementary Fig.Β 2D). The equations are as follows:
In these equations, 1β\({E}_{2}^{i}\) represents the range of the subtraction of actual \({E}_{2}^{i}\) and the maximum \({E}_{2}^{i}\), which equals to 1. Note that the range of this variable is influenced by \(1-{S}_{1}\):
In contrast, \({S}_{1}\) represents the range of \({E}_{3}^{i}\) for samples with values less than \({S}_{1}\), which varies from 0 to \({S}_{1}\). The range of \({E}_{3}^{i}\) is also influenced by \({S}_{1}\):
To mitigate the potential impact of the range of \({1-E}_{2}^{i}\) and \({E}_{3}^{i}\) on the magnitude of these values when considering a given n, we normalize \(({1-E}_{2}^{i})\) and \({E}_{3}^{i}\) by dividing them by (1β\({S}_{1}\)) and \({S}_{1}\), respectively. It is worth noting that PMI and NMI are positively and negatively correlated with \({\rho }_{i}\) and \({\eta }_{i}\), respectively (Supplementary Fig.Β 2E, F). MI is then defined as the difference between PMI and NMI. As such, we introduce PMI, NMI, and MI as our metrics.
Assessing marker gene potential with \({{{\bf{M}}}}{{{{\bf{I}}}}}_{{{{\boldsymbol{i}}}}}\)
Considering that MI is defined as the difference between PMIi and NMIi, it will positively correlate with \(\rho\) due to the following relationships:
\(\frac{\partial {{{\rm {MI}}}}_{i}}{\partial {E}_{2}^{i}}=\,-\frac{1}{1-{S}_{1}} < \, 0\) and \(\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} < \, 0\), therefore,
This inequality shows that \({\rm {M{I}}}_{i}\) positively correlates with \({\rho }_{i}\), reflecting the potential of a gene to be a positive marker gene.
Similarly, \({\eta }_{i}\) positively correlates with NMIi as indicated by the following inequality, where \(\frac{\partial {\rm {M{I}}}_{i}}{\partial {E}_{3}^{i}}=\,-\frac{1}{{S}_{1}} < \, 0\) and \(\frac{\partial {E}_{3}^{i}}{\partial {\rho }_{i}} < \, 0\), therefore:
This demonstrates that \({\rm {M{I}}}_{i}\) positively correlates with \({\eta }_{i}\), indicating the potential of a gene to be a negative marker gene. Thus, the potential of a gene to be a marker for a cluster can be evaluated using \({\rm {M{I}}}_{i}\).
Reordering genes in each cluster by n and \({\rm {M{I}}}_{i}\) to find optimal marker genes
Following the calculation of the aforementioned statistics for gene i, including \({E}_{1}^{i}\), \({E}_{2}^{i}\), \({E}_{3}^{i}\), \({{{\rm {PMI}}}}_{i}\), \({{{\rm {NMI}}}}_{i}\) and \({{{\rm {MI}}}}_{i}\), it is worth noting that these calculations have been performed in the context of a given n. Genes with lower n values should have higher potential to serve as marker genes as there are fewer clusters passing the threshold \({S}_{1}\). Therefore, we rearrange the matrix based on the following principle:
The resulting matrix will be saved as an output of the βsearchMarkerβ module.
Measuring specificity with T i
To measure the specificity level of potential marker genes, we introduce a statistic denoted as \({T}_{i}\) (Supplementary Fig.Β 2G)
where
\({G}_{i}\) represents the set of the cells expressing gene i in-silico.
\({G}_{{{\rm {clust}}}}\) represents the set of cells where gene i serves as the in-silico marker gene.
In the context of using marker gene i to label cells from a sample in-vivo/vitro, \({T}_{i}\) could be utilized to assess the extent to which cells genuinely belong to the cluster defined by the marker gene i, and can, therefore, be utilized to assess the specificity level of marker gene i.
Rationale for selecting negative markers
To find negative markers, the overall process is similar to selecting positive markers except for the following adjustments: (1) we convert the original max-normalized matrix to a new matrix where the expression value of gene i in cluster j is replaced by \(1-\widetilde{{x}_{{ij}}}\). In the following calculation, all the parameters will be calculated based on the new matrix. (2) An additional threshold \({S}_{3}\) is addressed during step 6, where the top \({S}_{3}\) (percentage, \({S}_{3}\in \left[{{\mathrm{0,1}}}\right)\)) genes will be filtered out. (3) In step 8, a new PMIi based on the converted max-normalized matrix will be used to arrange the negative markers:
Rational of βfilterMarkerβ
βFilterMarkerβ is a useful tool for users who already have results from the Seurat βFindAllMarkersβ function and want to refine their results. It provides an algorithm to re-sort the matrix and offer a more accurate list of marker genes for each cluster. Working in conjunction with Seurat, βFilterMarkerβ re-arranges the output matrix from Seurat to complement the βFindAllMarkersβ function (Fig.Β 7C).
Users can leverage this function by using βfilterMarkerβ to sort the matrix from Seurat. The function takes the output matrix from βFindAllMarkersβ as input and calculates \({T}_{i}\) based on the number of cells in each cluster and the values of βpct.1β and βpct.2β provided by the Seurat matrix. Then, genes are assigned to each cluster according to the cluster with the highest expression level, as measured by the average fold-change in log2 scale (βavg_log2FCβ). The re-arrangement follows the principle of a descending \({T}_{i}\) for each cluster.
Processing benchmark and testing data
Processing single-cell/nuclear sequencing data
Sequencing data and metadata for the human heart8 and mouse kidney9 were downloaded from the Single Cell Portal (SCP1303, SCP1245). Prefrontal cortex (PFC) data was obtained from the GEO database39, with the accession ID GSE1684087. R (v4.1.3) is used for the rest of the analysis. We created objects with Seurat (v4.3.0). Single-cell experiment data was normalized and scaled. We identified 3000 highly variable genes (HVG) and performed principal component analysis (PCA). The accumulated standard deviation of each principal component was calculated, and the principal component with an accumulated standard deviation >90% and a standard deviation <5% was recorded as n1. The subtraction of the standard deviation between each neighboring principal component was calculated, and the principal component with a subtraction of the standard deviation <0.1% was recorded as n2. The dimensions from the 1st to 1 plus the minimum of n1 and n2 were used for further analyses40. We performed uniform manifold approximation and projection (UMAP) with uwot umap for visualization.
Benchmarking and parameter evaluation
We evaluated the running time and specificity level using three independent datasets and a series of 10 samples with a linear increase in cell population from around 50,000 to around 500,000. Tests were performed at different annotation levels. and the influence of \({S}_{2}\) on expression and specificity was evaluated. The specificity level improvements achieved by βfilterMarkerβ were also evaluated. For manually generated rare cell types, we utilized the data of human PFC data and randomly generated 10 subsets of the inhibitory neurons.
Simulating single-cell sequencing data and false discovery rate
Splatter (version 3.18) package was used to simulate and generate a ground-truth single-cell RNA-Seq dataset. The simulation parameters were set as follows: de.faLocβ=β3; de.facScaleβ=β0.2; group.probβ=βc(0.2, 0.2, 0.2, 0.2, 0.2). To identify ground-truth topN marker genes, we employed a method outlined in a previously published benchmarking study to systematically select ground-truth marker genes20. Let i, ranging from 1 to L, denote the gene index, and k, ranging from 1 to K, denote the group index. The variable Ξ²ij represents the differential expression (DE) indicator for gene i in group j within the splat model. Here, Ξ²ij quantifies the differential expression level of gene i in group j compared to a baseline expression level. If Ξ²ij equals 1, it suggests that gene i exhibits baseline expression in group j. To determine if a gene is a significant marker for a specific cluster, we use the score mi:
For a given cluster indexed as i, this score helps evaluate the marker potential of each gene.
Steps for selecting simulated ground truth marker genes for a specific cluster from the simulated data consist of:
1. Selecting genes that have simulated mean expression β₯0.1;
2. Calculate the marker gene score mi for each gene;
3. Rank genes by mi;
4. Select the top n genes as marker genes;
A gene was classified as a true positive if it appeared in both the selected and ground-truth lists, a false positive if it was only in the selected list, and a false negative if it was only in the ground-truth list. The false positive rate (FPR) was calculated as follows: FPRβ=βfalse positives/(true positivesβ+βfalse positives).
Data included for testing and benchmarking
We utilized a series of single-cell/nucleus RNA-sequencing datasets to assess the performance of starTracer. We focused on the ability to efficiently identify high-specificity markers compared to the βFindAllMarkersβ from Seurat. We included three sets of single-cell/nuclear RNA-sequencing data from different species and organs from Gene Expression Ominibus39 and Single Cell Portal (https://singlecell.broadinstitute.org/single_cell). The selection of data sets ensured the inclusion of diverse species, organs, varying sample sizes, and the utilization of both scRNA-seq and snRNA-seq techniques. We included objects created with Seurat4, annotations are provided by the authors. We conducted the basic analyses, including finding highly variable genes, normalizing the data, scaling the data, and running PCA and UMAP with each of the Seurat Object (refer to the βMethodsβ section for details).
Statistics and reproducibility
Statistical analysis without indications was analyzed by t-test. p valuesβ<β0.05 were regarded as statistically significant. (Data graphics and statistical analysis were performed using R.) No representative results have been selected from the repeated experiments. Computational repeated experiments have been conducted in the same environment and hard wares.
Code availability
All code for benchmarking and testing can be found in the starTracer vignette at https://WHU-Neuroepigenetics-Lab.github.io/starTracer-vignette. The source code of the R package starTracer is available to be downloaded from our GitHub repository41 (https://github.com/JerryZhang-1222/starTracer). Any updates of our package will also be documented on our GitHub page.
References
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573β3587.e29 (2021).
Hu, C. et al. CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res. 51, D870βD876 (2023).
Yao, Z. et al. A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation. Cell 184, 3222β3241.e26 (2021).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888β1902.e21 (2019).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381β386 (2014).
Herring, C. A. et al. Human prefrontal cortex gene regulatory dynamics from gestation to adulthood at single-cell resolution. Cell 185, 4428β4447.e28 (2022).
Chaffin, M. et al. Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy. Nature 608, 174β180 (2022).
Sidhom, E.-H. et al. Targeting a Braf/Mapk pathway rescues podocyte lipid peroxidation in CoQ-deficiency kidney disease. J. Clin. Investig. 131, e141380 (2021).
Cubelos, B. et al. Cux1 and Cux2 regulate dendritic branching, spine morphology, and synapses of the upper layer neurons of the cortex. Neuron 66, 523β535 (2010).
Yu, L. et al. Physiological functions of urea transporter B. PflΓΌg. Arch. - Eur. J. Physiol. 471, 1359β1368 (2019).
White, C. J., Ellis, J. M. & Wolfgang, M. J. The role of ethanolamine phosphate phospholyase in regulation of astrocyte lipid homeostasis. J. Biol. Chem. 297, 100830 (2021).
Caruso, G., Caraci, F. & Jolivet, R. B. Pivotal role of carnosine in the modulation of brain cells activity: multimodal mechanism of action and therapeutic potential in neurodegenerative disorders. Prog. Neurobiol. 175, 35β53 (2019).
Zhang, Y., Tian, C., Liu, X. & Zhang, H. Identification of genetic biomarkers for diagnosis of myocardial infarction compared with angina patients. Cardiovasc. Ther. 2020, 1β12 (2020).
Hu, J. M. et al. CD163 as a marker of M2 macrophage, contribute to predict aggressiveness and prognosis of Kazakh esophageal squamous cell carcinoma. Oncotarget 8, 21526β21538 (2017).
Dobnikar, L. et al. Disease-relevant transcriptional signatures identified in individual smooth muscle cells from healthy mouse vessels. Nat. Commun. 9, 4567 (2018).
Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 360, 758β763 (2018).
Chung, J.-J. et al. Single-cell transcriptome profiling of the kidney glomerulus identifies key cell types and reactions to injury. J. Am. Soc. Nephrol. 31, 2341β2354 (2020).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Pullin, J. M. & McCarthy, D. J. A comparison of marker gene selection methods for single-cell RNA sequencing data. Genome Biol. 25, 56 (2024).
Hu, P. et al. Dissecting cell-type composition and activity-dependent transcriptional state in mammalian brains by massively parallel single-nucleus RNA-Seq. Mol. Cell 68, 1006β1015.e7 (2017).
Cahoy, J. D. et al. A transcriptome database for astrocytes, neurons, and oligodendrocytes: a new resource for understanding brain development and function. J. Neurosci. 28, 264β278 (2008).
Petrova, R., Garcia, A. D. R. & Joyner, A. L. Titration of GLI3 repressor activity by sonic hedgehog signaling is critical for maintaining multiple adult neural stem cell and astrocyte functions. J. Neurosci. 33, 17490β17505 (2013).
Wang, X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361, eaat5691 (2018).
Wang, Y. et al. Proteogenomics of diffuse gliomas reveal molecular subtypes associated with specific therapeutic targets and immune-evasion mechanisms. Nat. Commun. 14, 505 (2023).
Perlman, K. et al. Developmental trajectory of oligodendrocyte progenitor cells in the human brain revealed by single cell RNA sequencing. Glia 68, 1291β1303 (2020).
Seigneur, E. & SΓΌdhof, T. C. Cerebellins are differentially expressed in selective subsets of neurons throughout the brain: SEIGNEUR and SΓDHOF. J. Comp. Neurol. 525, 3286β3311 (2017).
Grubman, A. et al. Transcriptional signature in microglia associated with AΞ² plaque phagocytosis. Nat. Commun. 12, 3015 (2021).
Korsunsky, I., Nathan, A., Millard, N. & Raychaudhuri, S. Presto scales Wilcoxon and auROC analyses to millions of observations. Preprint at https://doi.org/10.1101/653253 (2019).
Eddelbuettel, D. Seamless R and C++ Integration with Rcpp (Springer, New York, NY, 2013).
Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72β78 (2018).
Abugessaisa, I. et al. SCPortalen: human and mouse single-cell centric database. Nucleic Acids Res. 46, D781βD787 (2018).
Chen, S. et al. hECA: the cell-centric assembly of a cell atlas. iScience 25, 104318 (2022).
Nouri, N., Gaglia, G., Kurlovs, A. H., De Rinaldis, E. & Savova, V. A marker gene-based method for identifying the cell-type of origin from single-cell RNA sequencing data. MethodsX 10, 102196 (2023).
Andreatta, M., Carmona, S. J. & UCell Robust and scalable single-cell gene signature scoring. Comput. Struct. Biotechnol. J. 19, 3796β3798 (2021).
Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).
Zeng, H. et al. Spatially resolved single-cell translatomics at molecular resolution. Science 380, eadd3067 (2023).
Vafaei, N., Ribeiro, R. A. & Camarinha-Matos, L. M. Normalization techniques for multi-criteria decision making: analytical hierarchy process case study. In Technological Innovation for Cyber-Physical Systems Vol. 470 (eds. Camarinha-Matos, L. M., FalcΓ£o, A. J., Vafaei, N. & Najdi, S.) 261β269 (Springer International Publishing, Cham, 2016).
Barrett, T. et al. NCBI GEO: archive for functional genomics data setsβupdate. Nucleic Acids Res. 41, D991βD995 (2012).
Piper, M., Mistry, M., Liu, J., Gammerdinger, W. & Khetani, R. hbctraining/scRNA-seq_online: scRNA-seq Lessons from HCBC (first release). https://doi.org/10.5281/ZENODO.5826256 (2022).
Zhang, F. JerryZhang-1222/starTracer: v1.0.1. Zenodo https://doi.org/10.5281/ZENODO.13364966 (2024).
Acknowledgements
The authors gratefully acknowledge grant support from the NSFC 82171517 (to Wei Wei), NSFC 82001421 (to Xiang Li), and NSFC 82271556 (to Xiang Li). Jincao Chen and Xiang Li are supported by the medical Sci-Tech innovation platform of Zhongnan Hospital, Wuhan University. Xiang Li is supported by the Climbing Project for Medical Talent of Zhongnan Hospital, Wuhan University. Wei Wei is supported by the Translational Medicine and Interdisciplinary Research Joint Fund of Zhongnan Hospital of Wuhan University. Feiyang Zhang is supported by joint-Ph.D. scholarship from the China Scholarship Council (CSC) for a medical research fellowship at Kyoto University. This project is supported by the RIKEN research award to Feiyang Zhang, Shengqun Hou, and Dan Ohtan Wang. The authors would also like to thank Dr. Jianjian Zhang for the helpful editing of the manuscript, and Dr. Bo Wang for comments and discussion.
Author information
Authors and Affiliations
Contributions
F.Z., Z.L., and Q.Z. formulated the algorithm. F.Z., R.C., W.M., K.H., Y.P., Y.L., and S.H. conceived the experiments. F.Z., X.L., W.W., J.C., and D.O.W. wrote and reviewed the manuscript. Y.P. and Y.L. tested the software.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks Yue Cao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Aylin Bircan. A peer review file is available.
Additional information
Publisherβs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the articleβs Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleβs Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, F., Huang, K., Chen, R. et al. starTracer is an accelerated approach for precise marker gene identification in single-cell RNA-Seq analysis. Commun Biol 7, 1128 (2024). https://doi.org/10.1038/s42003-024-06790-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-024-06790-6
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.