Introduction

Single-cell/nucleus RNA sequencing (sc/snRNA-seq) reveals the heterogeneity in cell types, with different clusters identifiable through various methods. Annotation of clusters requires marker genes which serve as identifiers that exhibit high-specific expression patterns to differentiate clusters1,2. This facilitates the integration of single-cell sequencing data with subsequent experiments, such as fluorescence-activated cell sorting (FACS).

As sc/snRNA-seq technology is applied in an increasing number of studies, reducing computation energy and time is a sustainability issue. Meanwhile, the requirements for computational resources and time costs increase in tandem with the growing number of cells generated in single-cell sequencing studies3. Consequently, data analysis becomes increasingly inefficient, posing challenges in effectively managing the flexibility, scalability, and efficiency of single-cell sequencing workflows.

Seurat is widely acknowledged as one of the most robust tools for single-cell data analysis4. Marker gene identification in Seurat predominantly relies on the β€œFindAllMarkers” function. The underlying algorithm for marker gene identification, including other toolkits such as Scanpy5 and Monocle6 is based on the principle that a marker gene for a specific cluster exhibits significant up-regulation compared to the remaining clusters. However, this algorithm causes a β€œdilution” issue which happens when a high expression cluster is pooled with the lower expressions in the majority of clusters, thus decreasing the accuracy given the fact that an ideal marker gene should be unique to a single cluster. Furthermore, calculating genes for each cluster can result in redundant computations and producing superfluous information and consuming significant computational resources, which creates a bottleneck in efficiency, particularly when dealing with experiments involving larger population and complex annotations, thus warrants further improvement. In addition, identifying marker genes for clusters with low cell counts can also be challenging.

Indeed, taking into account the mean expression values and standard deviations could be regarded as a solution. Performing significance tests over multiple groups or variance analysis with post-hoc may also offer additional information to avoid the β€œdilution” issue, however, these methods all require additional computational resources.

To address this issue, marker gene searching strategies should refrain from aggregating the remaining clusters as one entity as well as considering the expression values among each cluster instead of performing significance tests to maintain high efficiency and accuracy.

We developed the R package starTracer to accurately find marker genes with high specificity and high efficiency. This package is specifically designed to integrate seamlessly with the widely used single-cell analysis tool Seurat and provide valuable insights into marker genes in single-cell data.

starTracer provides multiple and flexible parameters for users: Users can either directly input a sparse single-cell expression matrix, a Seurat object, or an average expression matrix of each cell type using the β€œsearchMarker” function; In Seurat, a gene may be considered a marker in multiple clusters. However, starTracer offers researchers a marker gene list where each gene is exclusively associated with one cluster, based on the number of marker genes specified for each cluster by the researcher. starTracer also offers an option to select marker genes specifically among highly variable genes, thus reducing interference from genes with low variations. Besides, a parameter could be set as a threshold to limit the lowest expression level of the marker genes to identify marker genes at various expression and specificity levels.

Additionally, for users who already have the output marker gene matrix from Seurat, we offer another module β€œfilterMarker” that allows for re-arranging and re-assigning marker genes based on their specificity level to optimize the results obtained from Seurat.

Overall, starTracer is an open-source R software package that is readily available for users to efficiently identify potential marker genes in single-cell sequencing data.

Results

Accurate identification of marker genes by starTracer from human prefrontal cortex, heart, and mouse kidney data

We assess the performance of starTracer using three sets of publicly available datasets, including adult samples from the human prefrontal cortex (24,564 cells)7, human left ventricle (592,689 cells)8, and control samples from mouse kidneys (16,119 cells)9 (Fig.Β 1A–C). In the default settings of the starTracer’s β€œsearchMarker” module, which includes only highly variable genes, starTracer accurately identified a number of typical marker genes documented in published research articles or databases. For instance, in the human PFC data, starTracer identified ETNPPL and SLC14A1 as prominent marker genes for astrocyte cells, CUX210 for L2–3 CUX2 neurons, and CNDP1 for oligodendrocyte cells, all ranking at the top of its list (Fig.Β 1D) and consistent with previous studies11,12,13. In the human heart data, we identified NRXN1, CD163, and MYH11 as marker genes for cardiac neurons14, macrophages15, and vascular-associated smooth muscle cells16, respectively (Fig.Β 1E). In the mouse kidney data, Slc12a3 and Tie1 were identified as marker genes for distal convoluted tubule cells17 and glomerulus epithelial cells18, respectively (Fig.Β 1F). When the option was adjusted to include all detected expression genes in the analysis, starTracer successfully identified canonical markers for nearly all inhibitory neuron subtypes (Supplementary Fig.Β 3E–G). Importantly, starTracer not only accurately identifies marker genes but also effectively minimizes noise in the marker gene dataset (Supplementary Fig.Β 3F, G). All the marker genes found by β€œsearchMarker” that consists of previous studies are recorded (Supplementary TablesΒ 1–3). These findings collectively highlight starTracer’s accuracy in identifying marker genes.

Fig. 1: Comparing starTracer’s calculation speed with Seurat and the performances on the large-scale dataset.
figure 1

A The UMAP plot of the single-cell sequencing data from the human prefrontal cortex, including 16 different clusters and 24,564 cells. B The UMAP plot of the single nuclear sequencing data from the human heart left ventricle, including 13 different clusters and 592,689 cells. C The UMAP plot of the single cell sequencing data from mouse kidney, including 13 different clusters and 16,119 cells. D–F Bubble plot of the marker genes identified by starTracer for human prefrontal cortex, human left heart ventricle, and mouse kidney sample. The size of the dot represents the ratio of the cells detected with the expression of the gene. The color of the dot represents the expression level. G Relative specificity level of each cluster from the human prefrontal cortex, which is measured by fold-\({T}_{i}\). The top 5 marker genes are included in the test. For each cluster, fold-\({T}_{i}\) is measured by the quotient of the \({T}_{i}\) of the genes derived by β€œstarTracer” and the \({{{\rm{mea}}}}{{{\rm{n}}}}\left({T}_{i}\right)\) of the genes derived by β€œFindAllMarkers”. The fold-\({T}_{i}\) of genes derived by β€œFindAllMarkers” gives to an average value of 1 marked by the dash line. H–J Calculation time of β€œFindAllMarkers” and starTracer on the human prefrontal cortex, human left heart ventricle, and mouse kidney sample. For each tool, the test is performed 10 times. t-test is utilized to perform significant tests. K The calculation time of starTracer and β€œFindAllMarkers” on the 10 samples with cells from 10% to 100% of 592,689 in 10% increments. Time elapsed of β€œFindAllMarkers” and starTracer are measured by minutes and seconds, respectively. L–N The calculation time of the β€œFindAllMarkers” function from Seurat under different values of min.pct, which limits the minimum expression percent of the cells expressing the marker gene. At each min.pct level, the calculation time is repeated for 3 times. Red line indicates the calculation time of starTracer’s β€œsearchMarker” on the same data. L. Using the data of the human prefrontal cortex. M. Using the data of the human heart. N. Using the data of mouse kidney.

To provide further insight into the functional significance of markers identified by starTracer and Seurat, Gene Ontology (GO) function annotations were conducted using the top 50 markers (Supplementary Figs.Β 4–6). Remarkably, upon comparison with the bubble plot depicting marker genes, it was observed that clusters sharing markers identified by Seurat also exhibited shared functional annotation terms. At the same time, annotation terms are more accurately annotated to the corresponding cluster. We noticed that in the human heart data, the term β€œactin binding” was only uniquely annotated to cardiac muscle cells (Supplementary Fig.Β 4) when annotating with the markers found by starTracer (left panel), which was absent in vascular-associated smooth muscle cells or pericyte cells comparing with Seurat (right panel). This discrepancy may stem from starTracer’s ability to identify more precise genes. Thus, by excluding shared marker genes, starTracer demonstrated its capacity to discern more accurate annotation terms for clusters.

Enhanced specificity of marker gene identification with starTracer

To assess the specificity of identified marker genes, we re-analyzed the three datasets. β€œsearchMarker” module from starTracer was set to detected five marker genes for each cluster. We gauged specificity using the metric \({T}_{i}\), and employed fold-\({T}_{i}\) to determine specificity differences (refer to Methods for details). \({T}_{i}\) is defined as the proportion of cells that originate from the target cluster with the expression of gene i (refer to the β€œMethods” section for details, Supplementary Fig.Β 1G). Higher specificity levels could be observed for all cell types in the human PFC data, especially for interneurons (Vas, VIP, PV_SCUBE3, SST, LAMP5_NOS1, and PV) (Fig.Β 1G). In the other two datasets, we also noticed an increased specificity level for all clusters (Supplementary Fig.Β 3A, B). When comparing the results from starTracer’s β€œsearchMarker” module to Seurat’s β€œFindAllMarkers”, we observed significant reductions in background noise across all datasets (Fig.Β 1D–F and Supplementary Fig.Β 3C–E).

starTracer significantly increased calculation speed by 2–3 magnitude compares to Seurat’s β€œFindAllMarkers”

To assess the computational cost of starTracer, we conducted a runtime comparison with Seurat’s β€œFindAllMarkers”. In the human PFC sample, while β€œFindAllMarkers” required an average of 562.86 s, starTracer completed its task in 3.03 s when importing HVGs, and in 3.33 s when importing all detected genesβ€”marking 185.76 and 169.03-fold time-saving (Fig.Β 1H). Similarly, in the human heart sample, β€œFindAllMarkers” averaged 381.28 min compared to starTracer’s 0.65 and 0.66 minβ€”a striking 586.59 and 577.70-fold improvement (Fig.Β 1I). In the context of the mouse kidney sample, starTracer finished in an average of 1.19 and 2.52 s for HVGs and all detected genes, whereas β€œFindAllMarkers” took 45.34 s, with a 38.10 and 17.92-fold improvement (Fig.Β 1J). Collectively, these results suggest that less computation time is required by starTracer to achieve marker gene detection.

To further assess the influence of sample size on the computational time of both starTracer and Seurat, we created a series of subsets of the left ventricle scRNA-seq dataset. These subsets ranged in size from 10% to 100%, increasing in 10% intervals. Cells for each subset were selected using a non-repeating, uniformly distributed strategy to retain the original cell composition. The runtime measurement was repeated 3 times to ensure robustness. Notably, as the sample size increased, the difference in runtime between starTracer and Seurat became more pronounced, with starTracer consistently outperforming Seurat (Fig.Β 1K). This trend underscores starTracer’s distinct advantage when dealing with larger cell datasets.

Seurat itself offers parameters such as min.pct and fold-change to expedite calculations, which can enhance the identification of precise marker genes. To demonstrate Seurat’s calculation speed across various parameters, we initially utilized starTracer to identify 50 markers with sufficient specificity and accuracy. The mean fold-change of these markers found by starTracer was determined and used as the input value for Seurat’s β€œFindAllMarkers” function. Subsequently, min.pct was systematically varied from 10% to 90%. Our analysis revealed that when specifying fold-change equal to that determined by starTracer, the calculation speed of Seurat indeed increased. Moreover, as min.pct increased, the calculation speed further improved. However, even at a min.pct of 90%, the calculation speed remains slower than that of starTracer (Fig.Β 1L–N). It is worth noting that the fold-change utilized by Seurat in this experiment is a post-experiment value, implying that researchers may not know the suitable threshold beforehand during real-time data analysis until the calculation is completed.

We expanded our investigation to include multiple methods for calculating marker genes, such as Wilcoxon, bimod, ROC, t-test, negbinom, Poisson, LR, MAST, DESeq2, and the R package Monocle 3, in order to assess calculation speed. The fold-change was standardized to match the mean value of starTracer’s results. Notably, negbinom exhibited the longest computation time, while the default method, Wilcox, ranked fifth among all methods from short to long (Supplementary Fig.Β 6A). In the case of Monocle 3, the number of cores utilized for calculation significantly influenced computation speed. Surprisingly, we observed that as more cores were utilized, the calculation time slightly increased (Supplementary Fig.Β 6B). This phenomenon may be attributed to the overhead of parallel computation. The marker genes identified by Monocle 3 are presented in Supplementary Fig.Β 6C. Overall, starTracer outperformed all included methods in terms of calculation speed in identifying marker genes.

Presto, a C++-based algorithm, effectively accelerates the calculation speed of Wilcoxon and auROC analyses during Seurat’s β€œFindAllMarkers” while preserving the same results. This enhancement significantly boosts the computational efficiency of Seurat. Our evaluation encompassed the same human prefrontal cortex (PFC) data, human heart data, and mouse kidney data. In the case of human PFC data, where highly variable genes (HVGs) are included in starTracer, the calculation speed still exhibited a significant difference between starTracer and Presto (Supplementary Fig.Β 6D), starTracer outperformed Presto by a 2.49-fold reduction. For human heart data, starTracer also outperformed Presto by a 3.91-fold reduction in calculation time (Supplementary Fig.Β 7E). Similarly, for mouse kidney data, starTracer surpassed Presto’s performance by a 1.57-fold reduction in calculation time (Supplementary Fig.Β 7F). In summary, while Presto significantly enhances calculation speed by leveraging C++, starTracer demonstrates better performance on large-scale datasets.

Evaluating false positive rate from simulated single-cell-sequencing data with ground-truth

Splatter is a wildly used tool to generate a simple, reproducible, and well-documented simulation of scRNA-seq data for systematically evaluating the performance of different tools19. To comprehensively evaluate the false positive rate of starTracer and Seurat, we generated single-cell sequencing data compromising 1000 simulated cells and 5 clusters using splatter (Fig.Β 2A). 50 marker genes are defined as ground truth (refer to the β€œMethods” section for details) according to a selection strategy20. The top 50 marker genes are found by starTracer and Seurat according to the same criteria (Supplementary Fig.Β 8). A gene will be defined as a true positive marker if it appears in both the selected and ground-truth list, while a false positive marker is defined as a gene that is only selected by starTracer/Seurat but not included in the 50 ground-truth marker genes. FPR is defined as the fraction of the number of false positive markers and the number of all positive markers (refer to the β€œMethods” section for details). We calculated the FPR from 5 clusters (Fig.Β 2B). The result shows that in all clusters, the FPR found by starTracer is consistently lower than Seurat. starTracer and Seurat, respectively, give a marker gene list with ranks.

Fig. 2: Comparing the false positive rate (FPR) of starTracer and Seurat on simulated data.
figure 2

A UMAP plot of the simulated data, including five clusters. B False positive rate of starTracer and Seurat in 5 clusters. Blue line indicates the FPR of starTracer’s module β€œsearchMarker”. Yellow line indicates the FPR of Seurat’s β€œFindAllMarkers”. C The intersection of the top-N markers found by starTracer and Seurat with the ground-truth markers, respectively (blue for starTracer overlap with ground-truth, yellow for Seurat overlap with ground truth) and the intersection of the top-N markers found by starTracer and Seurat (gray). Red dash line (y = x) indicates the ideal situation where the top-N markers completely overlap with the ground-truth markers.

To illustrate the true positive rate across varying numbers of top N markers, we intersected the top N marker genes from these two tools with the ground-truth marker gene sets, ranging from the top 1 to the top 50 markers (Fig.Β 2C). It was observed that as N increased, starTracer consistently identified more genes that intersected with the ground-truth marker genes across all five clusters.

Finding marker genes from different annotation levels using starTracer

Researchers tend to annotate cells at various cluster levels3,21 to illustrate differences in cell components and gene expression changes across these levels. To investigate the performance of starTracer in different cluster levels, we reassessed the human PFC data7 using multiple annotation levels fetched from the Single Cell Portal. These levels include (1) Bio_clust: This includes principal neurons (PN), interneurons (IN), astrocytes (Astro), microglia (Micro), oligodendroglia (Oligo), and oligodendroglia progenitor cells (OPC) (Supplementary Fig.Β 9A). (2) Major_clust: Within this, PN is further divided into L2/3 ET-1, etc., while IN is divided into SST, VIP, PVALB, SNCG, among others1 (Fig.Β 1A). (3) Sub_clust: Here, each of the aforementioned clusters can be further delineated into more detailed clusters, totaling 66 subgroups (Fig.Β 3A). In our examination of marker genes at each level based on this classification scheme, we noticed that marker genes identified by starTracer consistently exhibited elevated specificity for all clusters (Supplementary Fig.Β 9B).

Fig. 3: Testing starTracer’s performance on the dataset with complex annotations.
figure 3

A The UMAP plot of the human prefrontal cortex data. Annotation β€œSub_clust” if provided by the authors. Cells are annotated to 66 clusters. B Bubble plot of the marker genes found by starTracer using the annotation of β€œBio_clust”. The size of the dot represents the proportion of the cells with the expression of each gene. The color means the expression level of each gene in each cell cluster. C Bubble plot of the marker genes found by starTracer using the annotation of β€œSub_clust”. D Relative specificity level measure by fold-\({T}_{i}\) in each of the cluster comparing the specificity level between the marker genes found by β€œFindAllMarkers” and starTracer. Top 5 marker genes are included in the test. For each cluster, fold-\({T}_{i}\) is measured by the quotient of the \({T}_{i}\) derived by β€œstarTracer” and the \({\mbox{mean}}\left({T}_{i}\right)\) calculated by β€œFindAllMarkers”. The fold-\({T}_{i}\) of genes derived by β€œFindAllMarkers” gives to an average value of 1 marked by the dash line.

For the Bio_clust level, starTracer successfully identified GLI3 as an Astrocytes marker22 and CNDP1 and SLC5A11 as an Oligodendroglia marker13,23 (Supplementary Fig.Β 9C, D). GAD2 is widely recognized as a canonical marker for inhibitory neurons in single-cell analysis3,21,24, COL9A1 and FERMT1 are renowned as OPC marker genes25,26, CBLN2 is known to be expressed in principal neurons27, and FGD2 is highly expressed in microglia28. Both Seurat and starTracer are able to identify marker genes for each cluster, but starTracer outperforms Seurat by identifying marker genes with high specificity (Supplementary Fig.Β 9C, D).

At the Major_clust level, consistent with marker genes from Bio_clust level, we found SLC5A11 and CNDP1 as Oligodendroglia markers, COL9A1 and FERMT1as the OPC markers, FGD2 as a microglia marker (Fig.Β 1D). The rest of the markers are shown in the previous results.

For Sub_clust, although having 66 makes the difference between clusters subtle (Fig.Β 3A), starTracer provided a noise-reduced marker gene set comparing with FindAllMarkers (Fig.Β 3B, C). Meanwhile, all clusters exhibited a higher specificity level as well (Fig.Β 3D).

Overall, starTracer demonstrates a robust capability to identify marker genes across different levels of annotations.

Marker gene identification from rare cell types using starTracer

To showcase starTracer’s capability to identify marker genes even within rare cell types, we employed human prefrontal cortex (PFC) data, focusing particularly on inhibitory neurons, where starTracer exhibits notable enhancement. We generated nine sub-samples from the original dataset, comprising inhibitory neurons ranging from 100% to 10%, while maintaining the original cell counts for all the other cell types (Fig.Β 4A). Despite the reduction in sample size, starTracer successfully identifies a substantial portion of the original marker genes, with approximately half retained even at 10% sample size (Fig.Β 4B). To delve deeper into its ability to preserve top markers, we examined the intersection of markers with those identified using only the top five markers (Fig.Β 4B). Remarkably, for the five identified β€œrare cell types”, starTracer effectively retains specific markers: SST remains the top marker gene for SST neurons, PVALB and NPNT are identified as the top two marker genes for PV_SCUBE3 neurons, VIP emerges as the top marker for VIP neurons (ranked first at 100% sample and second at 10% sample), and NDNF persists as the top marker gene for ID2 neurons until a 20% sample size. Notably, the top marker genes identified by starTracer also exhibit high specificity even at a 10% sample size (Fig.Β 4C).

Fig. 4: Finding marker genes from rare cell types using starTracer.
figure 4

A UMAP of 100% sample and 10% sample of the target five types of inhibitory neurons. B The intersection of the original markers (markers found at 100% size) and the markers found at different sample sizes from 100% to 10%. The upper panel shows the intersection of the top 50 markers found by starTracer at each sample size. The lower panel shows the intersection of the top 5 markers found by starTracer. C Bubble plot of the top 50 markers found by starTracer at the level of 10% size. Bubble size indicates the percent of cells expressing the gene. Color indicates the average expression value.

Parameter influences: Startracer provides insights into the negative correlation between the expression level and the specificity level of marker genes

starTracer gives users the flexibility to adjust \({S}_{2}\), representing the lower limit of gene expression level. To investigate the impact of varying expression levels on marker gene identification, we generated a series of marker genes based on different expression values. The β€œsearchMarker” module in starTracer enables users to select an \({S}_{2}\) range from 0 to 1, which filters out the lowest \({S}_{2}\) percentile of genes for each cluster (refer to the β€œMethods” section for details). Here, by analyzing the subset of principal neurons from the adult human prefrontal cortex (Fig.Β 5A), we demonstrated a solution for user to find optimal marker genes based on their specified minimum expression levels.

Fig. 5: The ability of starTracer to find marker genes at different expression levels.
figure 5

A UMAP plot of the principal neurons in the human prefrontal cortex data, which includes 15 clusters. B The relationship between specificity level and the expression level of the top 5 marker genes. The expression level measured by \({E}_{1}\) is shown in a box plot. The specificity level of the marker genes is shown in the line chart. C–E The bubble plot of the marker genes found by starTracer using the annotation of β€œSub_clust” under the \({E}_{2}\) value set as 0.0, 0.4, and 0.8. The size of the dot represents the proportion of the cells with the expression of each gene. The color means the expression level of each gene in each cell cluster.

We varied \({S}_{2}\) from 0 to 0.8 in 0.2 increments, representing a filtering out from 0% to 80% of the least expressed genes after the gene allocation to each cluster. Using β€œsearchMarker”, we determined marker genes for each \({S}_{2}\) value. We investigated \({E}_{1}^{{ij}}\) and \({T}_{i}\) with different \({S}_{2}\) values. \({T}_{i}\) of the top 5 marker genes decreases as \({S}_{2}\) increases. Meanwhile, and as expected, the gene expression level showed a positive correlation with \({S}_{2}\) (Fig.Β 5B). These findings highlight an inverse relationship between the specificity and marker gene expression level: as specificity demands rise, the expression level drops and vice versa. This underscores the versatility of starTracer, allowing users to customize marker gene selection based on their specificity or expression level criteria.

Bubble plots showcased marker genes under \({S}_{2}\) values of 0, 0.4, and 0.8. A discernible increase in background noise was evident as \({S}_{2}\) values climbed, pointing to a decrease in the specificity level (Fig.Β 5C–E).

β€œFilterMarker”: Identify marker genes with higher specificity based on the results of Seurat operation

To test the impact of β€œfilterMarker” after its application, we revisited data from the human prefrontal cortex, human left ventricle, and mouse kidney. We could notice a marked increase in \({T}_{i}\) after the application of β€œfilterMarker” (Fig.Β 6A–C) and a reduced background noise (Fig.Β 6D, E). Notably, as observed in the β€œsearchMarker” analysis, interneurons displayed significantly higher specificity levels. β€œfilterMarker” retains all genes from β€œFindAllMarkers”. However, it identifies the most appropriate marker genes and allocates them to their corresponding cluster (refer to the β€œMethods” for details). This fine-tuning of Seurat’s output matrix aids in pinpointing marker genes with enhanced specificity levels.

Fig. 6: β€œfilterMarker” module’s capacity to refine Seurat’s output marker genes.
figure 6

A–C Relative specificity level of each cluster from the human prefrontal cortex, human left heart ventricle and mouse kidney sample, which is measured by fold-\({T}_{i}\). The top 5 marker genes are included in the test. For each cluster, fold-\({T}_{i}\) is measured by the quotient of the \({T}_{i}\) of the genes derived by β€œstarTracer” and the \({{{\rm{mean}}}}\) \(\left({T}_{i}\right)\) of the genes derived by β€œFindAllMarkers”. The fold-\({T}_{i}\) of genes derived by β€œFindAllMarkers” gives to an average value of 1 marked by the dash line. D Bubble plot of the marker genes identified by β€œFindAllMarkers”. Genes are arranged by the value of avg_log2FC. Dash line indicates the genes identified as marker genes in the inhibitory neurons, which did not show a high specificity level. E Bubble plot of the marker genes identified by β€œfilterMarker”. Genes are arranged by the value of MI. Dash line indicates the genes identified as marker genes in the inhibitory neurons, which have a high increase in specificity.

Discussion

Accurate identification of marker genes is a critical outcome of single-cell analysis, underpinning all downstream functional analyses. In this study, we introduce starTracer, an R package meticulously designed to swiftly and precisely identify high-specificity marker genes, offering researchers a customizable approach. starTracer comprises two core modules: β€œsearchMarker”, an independent pipeline, and β€œfilterMarker”, seamlessly integrated with output results from tools like Seurat. To optimize efficiency, β€œsearchMarker” avoids redundant calculations by pre-allocating genes to target clusters. We devised a novel metric, MI, within the β€œsearchMarker” module to efficiently gauge a gene’s potential to serve as a marker. To overcome the β€œdilution issue”, we utilized the threshold \({S}_{2}\) to divide each gene into two sets and performed the calculations separately. \({T}_{i}\) quantifies the proportion of cells originating from the target cluster in an objective manner, could be used as a metric evaluating the performance of marker genes and thus function as a complementary module from starTracer to refine marker genes found by Seurat. Overall starTracer could rapidly identify marker genes with high specificity, even in samples with a large number of cells or clusters with a small number of cells.

starTracer relies on pre-clustered data. We have also demonstrated a negative correlation between specificity levels and expression levels. Thus, as an option, \({S}_{2}\) allows users to identify ideal marker genes based on expression levels and create a clustering tree according to marker gene expression, which has the potential to refine cell-type annotations based on marker gene expression patterns, given a specific \({S}_{2}\) value. By default, β€œsearchMarker” considers highly variable genes as candidates for marker gene identification, optimizing computational efficiency. However, users can freely input all detected genes for marker identification, and we recommend using highly variable genes for the initial run and exploring all genes if the results are unsatisfactory. Additionally, starTracer provides built-in functions for visualizing the expressions of identified marker genes.

Currently, the bottleneck in the β€œsearchMarker” process still lies in the handling of sparse matrices for single-cell data processingβ€”we continue to rely on initial and stable computational methods for processing sparse matrices. In future versions, if better algorithms specifically designed for sparse matrices emerge or if other researchers in the community contribute to β€œsearchMarker,” the computational speed of β€œsearchMarker” will be further enhanced. On the other hand, starTracer remains based on the interpretive computing language R basing on C++, which limits its computational speed improvement. For example, Presto achieves computation speeds tens of times faster by utilizing a more basic but more complex computing language (C++)29. However, in this study, we did not use C++ for package development. There are various solutions available, such as using Rcpp to accelerate computations for R packages30, which could be adopted for further optimization research in starTracer.

We applied starTracer to multiple tissues and organs from both humans and mice, and it consistently demonstrated excellent performance. Particularly noteworthy was its enhanced accuracy when applied to sequencing data from the human prefrontal cortex compared to traditional approaches. This phenomenon may stem from the relatively minor differences between cortical neurons compared to the various cell types in the heart and kidney, especially within inhibitory neurons31, highlighting starTracer’s potential for application in multi-organ single-cell omics studies. Additionally, there has been increasing emphasis on cell-centric research in recent years32,33,34, advocating for the integration and analysis of cells from different single-cell datasets rather than segregating them into separate datasets. There have been tools such as UCell35 and Sargent34 designed for the cell-centric annotation techniques and starTracer has the potential to provide more information for these methodologies. Furthermore, with the continuous maturation of single-cell sequencing technologies, studies involving millions of cells are becoming increasingly prevalent. In this context, starTracer demonstrates promising prospects for applications in large-scale sequencing processes involving a vast number of cells.

starTracer seamlessly accepts input in various formats, including Seurat objects, sparse expression matrices with annotation tables, or average expression matrices with features as rows and cells as columns. This versatility extends its potential applications to other high-resolution omic data, such as spatial transcriptomic data24,36,37, single-cell ATAC-seq data, and morphomic data from morphOMICs, due to their shared data structure. Moreover, starTracer can serve as a valuable tool for identifying the most up-regulated genes in bulk RNA-seq experiments with multiple treatments.

In conclusion, starTracer emerges as a powerful and flexible tool, significantly enhancing the efficiency and accuracy of marker gene identification in single-cell analysis. Its potential applications in diverse omic datasets hold great promise for advancing our understanding of cellular heterogeneity and gene regulation in various biological contexts.

Methods

Rationale of starTracer

Single-cell sequencing data includes an expression matrix and an annotation matrix (Fig.Β 7A). The previous strategy for identifying marker genes within a cluster is largely dependent on comparing the expression levels of a gene in a particular cluster with the corresponding expression levels in the remaining clusters4,5. While generally effective, this approach may not always yield markers with superior specificity. Such a circumstance may arise when facing the β€œdilution” issue: a gene is relatively highly expressed not only in the target cluster but also in one or more additional clusters. Moreover, since this strategy entails a comparison of one cluster versus the remaining clusters for each cluster in question, there is a considerable degree of redundant computation. This redundancy can render the process of identifying marker genes markedly time-consuming.

Fig. 7: Schematic of starTracer.
figure 7

A The structure of the expression matrix and annotation matrix of single-cell sequencing. B starTracer provides 2 options: de-novo β€œsearchMarker” and in-conjunction β€œfilterMarker”. β€œsearchMarker” requires a cell annotation matrix and an expression matrix/averaged expression matrix from a single cell experiment or a Seurat object. β€œsearchMarker” performs max-normalize the average expression matrix, calculates the molecular index for each gene passing a threshold set by the user and outputs a matrix with marker genes. C β€œfilterMarker” takes an output matrix form β€œFindAllMarkers” function, assigns genes into clusters and re-arranges them by measuring the \({T}_{i}\) for each gene. Time elapsed of β€œsearchMarker” is much shorter than that of β€œFindAllMarkers” and β€œfilterMarker”.

starTracer, with its β€œsearchMarker” module, operating as an independent pipeline, receiving a gene expression matrix as input (Fig.Β 7B). Within starTracer, we employ max-normalization, a commonly used algorithm in machine learning, to extract features from genes while preserving the gene’s expression patterns across clusters. Utilizing a threshold value, starTracer binarily classifies clusters as high and low expression groups for each gene. Subsequently, starTracer calculates the gene’s potential to act as a positive marker in the high-expression group and as a negative marker in the low-expression group. Finally, the gene’s overall potential to serve as a marker gene is determined based on both its positive and negative marker capabilities. Notably, as no comparisons are made during this process, no statistical tests are performed, resulting in heightened efficiency. The output of starTracer includes a list of genes sorted by their potential to function as cluster-specific markers, with genes exhibiting the highest potential positioned at the top of the matrix for each respective cluster.

The package starTracer includes an additional functional module β€œfilterMarker”, which is designed as a complementary pipeline to the Seurat’s β€œFindAllMarkers” function, to automatically remove the redundant results in the output matrix and re-order the matrix according to \({T}_{i}\), representing the proportion of cells expressing gene i that are marked by gene i (Fig.Β 7C). For notations and meanings, please refer to TableΒ 1.

Table 1 Notations and Meanings

Generating a maximum-scaled average expression matrix

The β€œsearchMarker” module demonstrates great flexibility by accepting a variety of input file types. One essential input file is the expression matrix of the single-cell sequencing data. This can be acquired through three different ways, showcasing the module’s versatility: (i) importing a sparse expression matrix along with an annotation matrix into R; (ii) utilizing the β€œAverageExpression” function from Seurat for users who already have a Seurat object; and (iii) importing an average expression matrix along with an annotation matrix into R. Anyway, an average expression matrix will be generated (Supplementary Fig.Β 7 Step 1). This adaptability ensures seamless integration with various data processing workflows.

After generating the average expression matrix, the mean expression value of gene i from the average expression matrix will be calculated, which is denoted as \({E}_{1}^{i}\) (Supplementary Fig.Β 1 Step 2):

$${E}_{1}^{i}=\frac{{\sum }_{j=1}^{K}{x}_{{ij}}}{K}.$$

\({E}_{1}^{i}\) represents the mean expression of gene i in K clusters. \({x}_{{ij}}\) represents the average expression value of gene i in cluster j.

In order to accentuate the differences in gene expression across samples, whilst preserving the inherent expression characteristics of each gene, such as the maximum and minimum expression values and zero values, we compute the maximum-scaled average expression matrix38 (hereafter referred to as the max-normalized matrix) for each gene (Supplementary Fig.Β 1 Step 3):

\(\widetilde{{x}_{{ij}}}=\frac{{x}_{{ij}}}{\max \left(\left\{{x}_{i1},{x}_{i2}\ldots ,{x}_{{iK}}\right\}\right)}.\)

\(K\) represents the total number of clusters.

\(\widetilde{{x}_{{ij}}}\) represents the max-normalized average expression value of gene i in cluster j.

By normalizing the average expression matrix, the highest expression value for each gene is scaled to 1:

$$\max \{\widetilde{{x}_{i1}},\widetilde{{x}_{i2}},\ldots ,\widetilde{{x}_{{iK}}}\}=1.$$

This facilitates the identification of the cluster with the maximum expression for each gene. Furthermore, this normalization method allows for the comparison of normalized results across genes within each cluster.

Assigning genes to clusters and generating sub-matrixes

For the assignment of genes to their potential target clusters, we employ a selection strategy where a gene is assigned to the cluster only if it displays the highest average expression value in this particular cluster. This assumption is based on the premise that a gene may only have the potential to be a marker in a cluster where\(\,\widetilde{{x}_{{ij}}}=1\). From the aspect of each cluster, the genes with \(\widetilde{{x}_{{ij}}}=1\) will be retained and be regarded as the potential markers (Supplementary Fig.Β 1 Step 3). Thus, we define an index set \({M}_{j}\subset \{{{\mathrm{1,2}}},\ldots ,{L}\}\) for cluster j:

$${M}_{j}=\{{i}:\widetilde{{x}_{{ij}}}=1,1\le {i}\le L\},$$

where Mj is the set for cluster j, which indicates the rows from the original matrix that equals to 1, representing the maximum expressing values in cluster j.

Let the original max-normalized matrix as V and

$$V\subset {R}^{L\times K},$$

we further divide V to sub-matrixes \({V}_{j}\) for each cluster j according to Mj:

$${V}_{j}={\left({v}_{m,n}\right)}_{m\in {M}_{j},n\in \{1,2,\ldots ,K\}}.$$

The rows of the sub-matrix \({V}_{j}\) is from the set Mj, while the columns are composed of the clusters from 1 to K. By doing so, the algorithm could successfully assign genes to clusters and avoid redundant calculations. We assume that there are \({l}_{j}\) genes in sub-matrix \({V}_{j}\), where \({l}_{j}=|{M}_{j}|\).

Excluding low-expression genes

In the subset-matrix \({V}_{i}\), genes will be re-arranged in descending order according to \({E}_{1}^{i}\) (Supplementary Fig.Β 1 Step 5). In the majority of scenarios, genes with rather low expression levels may not considered optimal candidates for marker genes. To streamline computational efficiency by excluding these low-expression genes, we offer an optional threshold statistic, denoted as \({S}_{2}\), which ranges from 0 to 1. The statistic can be employed to filter out the lowest \({S}_{2}\) percentile of genes for each cluster after re-arranging according to their \({E}_{1}^{i}\) (e.g. Setting \({S}_{2}\) to 0.1 will filter out the genes with the lowest 10% of \({E}_{1}^{i}\)) (Supplementary Fig.Β 1 Step 6). After filtering, \({{S}_{2}} * {l}_{j}\) genes will be filtered out while \(({1-S}_{2})\,* \,{l}_{j}\) genes will be retained for further analysis.

Binary classification of clusters and the calculation of \({{{{\boldsymbol{E}}}}}_{{{{\boldsymbol{2}}}}}^{{{{\boldsymbol{i}}}}}\) and \({{{{\boldsymbol{E}}}}}_{{{{\boldsymbol{3}}}}}^{{{{\boldsymbol{i}}}}}\) to avoid dilution issue

For each gene, researchers can set a parameter \({S}_{1}\), which defaults to 0.5 (Supplementary Fig.Β 2A) to segregate clusters into high and low expression groups.

For gene i, it is presumed to serve as a positive marker in cluster j where \({\widetilde{x}}_{{ij}}\) exceeds \({S}_{1}\)(high expression group), and a negative marker where \(\widetilde{{x}_{{ij}}}\) is no bigger than \({S}_{1}\)(low expression group).

Thus, for gene i, there would be two vectors, one is the vector including clusters that gene i has the potential to be a positive marker (Supplementary Fig.Β 1 Step 7):

$${q}_{i}=\{j:\widetilde{{x}_{{ij}}}\le {S}_{1}\},$$

and the one including the clusters that gene i has the potential to be a negative marker:

$${p}_{i}=\{j:\widetilde{{x}_{{ij}}} \, > \, {S}_{1}\},$$

Thus, we can have two vectors for each row of Vj:

$${u}_{i,j}=\left\{{v}_{m,n}{{{|}}}m={i},{n}\in {{p}}_{{i}}\right\}$$

And

$${w}_{j,i}=\left({v}_{m,n}{{{|}}}m={i},n\in {q}_{i}\right).$$

We then, respectively, define:

$${E}_{2}^{i}=\frac{{\sum }_{j=1}^{n}\widetilde{{x}_{{ij}}}}{\left|{p}_{i}\right|},$$

and

$${E}_{2}^{i}=\frac{{\sum }_{j=1}^{\left|{p}_{i}\right|}\widetilde{{x}_{{ij}}}}{\left|{p}_{i}\right|}.$$

Here, \({E}_{2}^{i}\) represents the mean of \(\widetilde{{x}_{{ij}}}\) values for these clusters from \({p}_{i}\), and \({E}_{3}^{i}\) represents the mean of \(\widetilde{{x}_{{ij}}}\) values from \({q}_{i}\).

In a scenario where β€˜n’ clusters have max-normalized values greater than \({S}_{1}\). According to previous conclusions, it is certain that \(\max \{\widetilde{{x}_{i1}},\widetilde{{x}_{i2}},\ldots ,\widetilde{{x}_{{iK}}}\}=1\). Indeed, it is observable that \({E}_{2}^{i}\) and \({E}_{3}^{i}\)β€”which represent the average expression levels of a gene in clusters with values exceeding \({S}_{1}\) and not exceeding \({S}_{1}\), respectivelyβ€”are each negatively correlated with \({\rho }_{i}\) and \({\eta }_{i}\), respectively (Supplementary Fig.Β 2B, C):

$$\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} \, > \, 0,$$
$$\frac{\partial {E}_{3}^{i}}{\partial {\eta }_{i}} \, < \, 0,$$

where \({E}_{2}^{i}\) and \({E}_{3}^{i}\) for each gene will be compiled into a matrix for further analyses.

Evaluating genes’ potential as positive and negative markers by \({{{\boldsymbol{\rho }}}}\) and \({{{\boldsymbol{\eta }}}}\)

While marker genes are typically identified by their up-regulation in specific clusters, it is also crucial for an ideal marker gene to exhibit low expression in the remaining clusters at the same time, thereby ensuring its specificity (Supplementary Fig.Β 2A). We employ

$$\rho \in R,0\le \rho \le 1,$$

and

$$\eta \in R,0\le \eta \le 1$$

to quantify the potential of a gene serving as a positive or negative marker.

We further define that an ideal marker gene should simultaneously exhibit high values for both \(\rho\) and \(\eta\). Then we should notice that as \({E}_{2}\) increases, the potential of a gene to be a marker gene will decrease (Supplementary Fig.Β 2B), while as \({E}_{3}\) increases, the potential of a gene to be a negative marker gene will also decrease (Supplementary Fig.Β 2C). So, for each gene i

$$\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} \, < \, 0,$$
$$\frac{\partial {E}_{3}^{i}}{\partial {\rho }_{i}} \, < \, 0.$$

Measuring gene’s capability to be positive and negative markers with molecular index

To quantify the potential of a gene to serve as a marker, we devised a novel metric termed the molecular index (MI). MI is defined as the subtraction of the positive molecular index (PMI) and the negative molecular index (NMI), which takes \({E}_{2}^{i}\) and \({E}_{3}^{i}\) into account, thereby offering a comprehensive measure of a gene’s marker potential (Supplementary Fig.Β 2D). The equations are as follows:

$${{\rm {PM}{I}}}_{i}=\frac{1-{E}_{2}^{i}}{1-{S}_{1}},$$
$${{\rm {NM}{I}}}_{i}=\frac{{E}_{3}^{i}}{{S}_{1}},$$
$${\rm {M{I}}}_{i}={{\rm {PM}{I}}}_{i}-{{\rm {NM}{I}}}_{i}.$$

In these equations, 1βˆ’\({E}_{2}^{i}\) represents the range of the subtraction of actual \({E}_{2}^{i}\) and the maximum \({E}_{2}^{i}\), which equals to 1. Note that the range of this variable is influenced by \(1-{S}_{1}\):

$${1-E}_{2}^{i}\in \left[0,\frac{\left(1-{S}_{1}\right)\left({n}_{i}-1\right)}{{{{|}}}{p}_{i}{{{|}}}}\right).$$

In contrast, \({S}_{1}\) represents the range of \({E}_{3}^{i}\) for samples with values less than \({S}_{1}\), which varies from 0 to \({S}_{1}\). The range of \({E}_{3}^{i}\) is also influenced by \({S}_{1}\):

$${E}_{3}^{i}\in \left[0,{S}_{1}\right].$$

To mitigate the potential impact of the range of \({1-E}_{2}^{i}\) and \({E}_{3}^{i}\) on the magnitude of these values when considering a given n, we normalize \(({1-E}_{2}^{i})\) and \({E}_{3}^{i}\) by dividing them by (1βˆ’\({S}_{1}\)) and \({S}_{1}\), respectively. It is worth noting that PMI and NMI are positively and negatively correlated with \({\rho }_{i}\) and \({\eta }_{i}\), respectively (Supplementary Fig.Β 2E, F). MI is then defined as the difference between PMI and NMI. As such, we introduce PMI, NMI, and MI as our metrics.

Assessing marker gene potential with \({{{\bf{M}}}}{{{{\bf{I}}}}}_{{{{\boldsymbol{i}}}}}\)

Considering that MI is defined as the difference between PMIi and NMIi, it will positively correlate with \(\rho\) due to the following relationships:

\(\frac{\partial {{{\rm {MI}}}}_{i}}{\partial {E}_{2}^{i}}=\,-\frac{1}{1-{S}_{1}} < \, 0\) and \(\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} < \, 0\), therefore,

$$\frac{\partial {\rm {M{I}}}_{i}}{\partial {\rho }_{i}}=\frac{\partial {\rm {M{I}}}_{i}}{\partial {E}_{2}^{i}}\,\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} \, > \, 0.$$

This inequality shows that \({\rm {M{I}}}_{i}\) positively correlates with \({\rho }_{i}\), reflecting the potential of a gene to be a positive marker gene.

Similarly, \({\eta }_{i}\) positively correlates with NMIi as indicated by the following inequality, where \(\frac{\partial {\rm {M{I}}}_{i}}{\partial {E}_{3}^{i}}=\,-\frac{1}{{S}_{1}} < \, 0\) and \(\frac{\partial {E}_{3}^{i}}{\partial {\rho }_{i}} < \, 0\), therefore:

$$\frac{\partial {\rm {M{I}}}_{i}}{\partial {\eta }_{i}}=\frac{\partial {\rm {M{I}}}_{i}}{\partial {E}_{3}^{i}}\cdot \frac{\partial {E}_{3}^{i}}{\partial {\eta }_{i}} \, > \, 0.$$

This demonstrates that \({\rm {M{I}}}_{i}\) positively correlates with \({\eta }_{i}\), indicating the potential of a gene to be a negative marker gene. Thus, the potential of a gene to be a marker for a cluster can be evaluated using \({\rm {M{I}}}_{i}\).

Reordering genes in each cluster by n and \({\rm {M{I}}}_{i}\) to find optimal marker genes

Following the calculation of the aforementioned statistics for gene i, including \({E}_{1}^{i}\), \({E}_{2}^{i}\), \({E}_{3}^{i}\), \({{{\rm {PMI}}}}_{i}\), \({{{\rm {NMI}}}}_{i}\) and \({{{\rm {MI}}}}_{i}\), it is worth noting that these calculations have been performed in the context of a given n. Genes with lower n values should have higher potential to serve as marker genes as there are fewer clusters passing the threshold \({S}_{1}\). Therefore, we rearrange the matrix based on the following principle:

$$\left\{\begin{array}{c}{{\mbox{ascending}}}(|{p}_{i}|)\\ {{\mbox{decending}}}({{{\rm{M{I}}}}}_{i})\end{array}\right..$$

The resulting matrix will be saved as an output of the β€œsearchMarker” module.

Measuring specificity with T i

To measure the specificity level of potential marker genes, we introduce a statistic denoted as \({T}_{i}\) (Supplementary Fig.Β 2G)

$${T}_{i}=\frac{\left|{{G}}_{i}\cap {G}_{{\mbox{clust}}}\right|}{\left|{G}\right|}.$$

where

\({G}_{i}\) represents the set of the cells expressing gene i in-silico.

\({G}_{{{\rm {clust}}}}\) represents the set of cells where gene i serves as the in-silico marker gene.

In the context of using marker gene i to label cells from a sample in-vivo/vitro, \({T}_{i}\) could be utilized to assess the extent to which cells genuinely belong to the cluster defined by the marker gene i, and can, therefore, be utilized to assess the specificity level of marker gene i.

Rationale for selecting negative markers

To find negative markers, the overall process is similar to selecting positive markers except for the following adjustments: (1) we convert the original max-normalized matrix to a new matrix where the expression value of gene i in cluster j is replaced by \(1-\widetilde{{x}_{{ij}}}\). In the following calculation, all the parameters will be calculated based on the new matrix. (2) An additional threshold \({S}_{3}\) is addressed during step 6, where the top \({S}_{3}\) (percentage, \({S}_{3}\in \left[{{\mathrm{0,1}}}\right)\)) genes will be filtered out. (3) In step 8, a new PMIi based on the converted max-normalized matrix will be used to arrange the negative markers:

$$\left\{\begin{array}{c}{{\mbox{ascending}}}(|{p}_{i}|)\\ \\ {{\mbox{decending}}}({\rm {PM}{I}}_{i})\end{array}\right..$$

Rational of β€œfilterMarker”

β€œFilterMarker” is a useful tool for users who already have results from the Seurat β€œFindAllMarkers” function and want to refine their results. It provides an algorithm to re-sort the matrix and offer a more accurate list of marker genes for each cluster. Working in conjunction with Seurat, β€œFilterMarker” re-arranges the output matrix from Seurat to complement the β€œFindAllMarkers” function (Fig.Β 7C).

Users can leverage this function by using β€œfilterMarker” to sort the matrix from Seurat. The function takes the output matrix from β€œFindAllMarkers” as input and calculates \({T}_{i}\) based on the number of cells in each cluster and the values of β€œpct.1” and β€œpct.2” provided by the Seurat matrix. Then, genes are assigned to each cluster according to the cluster with the highest expression level, as measured by the average fold-change in log2 scale (β€œavg_log2FC”). The re-arrangement follows the principle of a descending \({T}_{i}\) for each cluster.

Processing benchmark and testing data

Processing single-cell/nuclear sequencing data

Sequencing data and metadata for the human heart8 and mouse kidney9 were downloaded from the Single Cell Portal (SCP1303, SCP1245). Prefrontal cortex (PFC) data was obtained from the GEO database39, with the accession ID GSE1684087. R (v4.1.3) is used for the rest of the analysis. We created objects with Seurat (v4.3.0). Single-cell experiment data was normalized and scaled. We identified 3000 highly variable genes (HVG) and performed principal component analysis (PCA). The accumulated standard deviation of each principal component was calculated, and the principal component with an accumulated standard deviation >90% and a standard deviation <5% was recorded as n1. The subtraction of the standard deviation between each neighboring principal component was calculated, and the principal component with a subtraction of the standard deviation <0.1% was recorded as n2. The dimensions from the 1st to 1 plus the minimum of n1 and n2 were used for further analyses40. We performed uniform manifold approximation and projection (UMAP) with uwot umap for visualization.

Benchmarking and parameter evaluation

We evaluated the running time and specificity level using three independent datasets and a series of 10 samples with a linear increase in cell population from around 50,000 to around 500,000. Tests were performed at different annotation levels. and the influence of \({S}_{2}\) on expression and specificity was evaluated. The specificity level improvements achieved by β€œfilterMarker” were also evaluated. For manually generated rare cell types, we utilized the data of human PFC data and randomly generated 10 subsets of the inhibitory neurons.

Simulating single-cell sequencing data and false discovery rate

Splatter (version 3.18) package was used to simulate and generate a ground-truth single-cell RNA-Seq dataset. The simulation parameters were set as follows: de.faLoc = 3; de.facScale = 0.2; group.prob = c(0.2, 0.2, 0.2, 0.2, 0.2). To identify ground-truth topN marker genes, we employed a method outlined in a previously published benchmarking study to systematically select ground-truth marker genes20. Let i, ranging from 1 to L, denote the gene index, and k, ranging from 1 to K, denote the group index. The variable Ξ²ij represents the differential expression (DE) indicator for gene i in group j within the splat model. Here, Ξ²ij quantifies the differential expression level of gene i in group j compared to a baseline expression level. If Ξ²ij equals 1, it suggests that gene i exhibits baseline expression in group j. To determine if a gene is a significant marker for a specific cluster, we use the score mi:

$${m}_{i}=\frac{1}{K-1} \mathop {\sum }\limits_{r=1,r\ne i}\log \left(\frac{{{{{\rm{\beta }}}}}_{ij}}{{{{{\rm{\beta }}}}}_{ir}}\right).$$

For a given cluster indexed as i, this score helps evaluate the marker potential of each gene.

Steps for selecting simulated ground truth marker genes for a specific cluster from the simulated data consist of:

1. Selecting genes that have simulated mean expression β‰₯0.1;

2. Calculate the marker gene score mi for each gene;

3. Rank genes by mi;

4. Select the top n genes as marker genes;

A gene was classified as a true positive if it appeared in both the selected and ground-truth lists, a false positive if it was only in the selected list, and a false negative if it was only in the ground-truth list. The false positive rate (FPR) was calculated as follows: FPR = false positives/(true positives + false positives).

Data included for testing and benchmarking

We utilized a series of single-cell/nucleus RNA-sequencing datasets to assess the performance of starTracer. We focused on the ability to efficiently identify high-specificity markers compared to the β€œFindAllMarkers” from Seurat. We included three sets of single-cell/nuclear RNA-sequencing data from different species and organs from Gene Expression Ominibus39 and Single Cell Portal (https://singlecell.broadinstitute.org/single_cell). The selection of data sets ensured the inclusion of diverse species, organs, varying sample sizes, and the utilization of both scRNA-seq and snRNA-seq techniques. We included objects created with Seurat4, annotations are provided by the authors. We conducted the basic analyses, including finding highly variable genes, normalizing the data, scaling the data, and running PCA and UMAP with each of the Seurat Object (refer to the β€œMethods” section for details).

Statistics and reproducibility

Statistical analysis without indications was analyzed by t-test. p values < 0.05 were regarded as statistically significant. (Data graphics and statistical analysis were performed using R.) No representative results have been selected from the repeated experiments. Computational repeated experiments have been conducted in the same environment and hard wares.