Abstract
Singlecell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multiomic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNAseq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using labelaware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.
Introduction
Singlecell RNAseq (scRNAseq) has generated a wealth of data allowing researchers to measure and quantify RNA levels in single cells at unprecedented scales^{1,2}. These studies yield valuable insights regarding intrinsic properties of single cells and specific cell types, which is critical to understanding cell development and disease^{3}. When coupled with other data modalities, such as measurements of cell surface protein levels^{4} or spatial transcriptomics^{5}, a more precise and complex partition of the cell type landscape emerges.
Three singlecell methodologies motivate our work. First, scRNAseq performs short read RNA sequencing on disassociated cells in a sample; a key goal of scRNAseq analyses is to label each of the cells in the sample with a precise cell type by considering the genes that are expressed in the cell. Second, singlemolecule fluorescence in situ hybridization (smFISH) approaches assay transcriptional data spatially^{6}. These techniques rely on fluorescent probes that bind near genes of interest to be quantified called markers. When the marker genes bound by probes are expressed above a threshold, fluorescence can be detected using microscopy at the location of expression. Third, sorting techniques such as fluorescenceactivated cell sorting rely on a small number of cell surface marker probes that can accurately distinguish cell types according to differences in expression of a small set of cell surface markers^{7}. When a cell passes through a flow cytometer to be sorted, the cell is labeled rapidly based on these markers and sorted appropriately.
If we are to use these methods to visualize, characterize, and distinguish cell labels among collections of heterogeneous cells, a key challenge from these emerging technologies is to label each assayed cell with a precise cell type or cell state label. To do this, gene panels must be designed that allow us to distinguish between cell labels both efficiently (with as few markers as possible) and at high precision (discriminating similar cells with different labels). The number of markers is experimentally constrained by the product of the number of fluorescence channels and the number of hybridization cycles^{5}. In particular, stateoftheart smFISH methods use on the order of 40 markers. The task of optimally choosing markers among all genes that most reliably and precisely distinguish cell labels given a hierarchical partitioning of cell type labels is a combinatorially difficult problem.
Existing supervised approaches to marker selection are scarce and only allow for the identification of markers that distinguish each singlecell labels from all of the other cell labels in a sample (onevsall)^{8,9,10}. These methods identify markers that are differentially expressed across two groups by comparing withingroup expression with acrossgroup expression. One such method, COMET^{10}, exhaustively select sets of k markers that it then ranks in a onevsall task. This has complexity G^{k}, where G is the number of genes, and it quickly becomes infeasible as k reaches even 4. In fact, COMET proposes at most four gene panels, thus lacking both scalability and generalizability. Furthermore, these approaches ignore both hierarchical relationships among cell labels and correlations in expression patterns among genes. The simplistic onevsall representation of cell labels to select markers prevents a solution to the problem when the number of cell labels is larger than the number of markers that can be used in an experiment^{11}.
In contrast, label hierarchyaware approaches have the ability to select markers that partition the labels at layers that are not exclusively at the leaves of the hierarchy, allowing genes that are robustly differentially expressed across a subset of cell labels to be selected as markers. For a bifurcating tree representation of the cell type hierarchy, the number of markers required is k1, where k is the number of cell labels (i.e., leaves in the hierarchical tree). Yet, these markers will be redundant if the latent dimension of the space spanned by the gene expression profiles of each cell type is lower than the number of cell labels we aim to distinguish. Assuming this is true, then a number of markers smaller than k is needed to maintain the hierarchy.
To fill this methodological gap, we developed scGeneFit, a rigorous and efficient approach for marker selection in the context of scRNAseq data with a given hierarchical partition of labels. Our method draws from ideas in compressive classification^{11} and largest margin nearest neighbor algorithms^{12,13}. scGeneFit shows good performance in both simulated data and in scRNAseq data. Where traditional approaches test the discriminatory value of each gene separately, our approach jointly recovers the optimal set of genes of a given size that allows robust partitioning of the given labels. Our framework generalizes to settings where the input label partition is captured in a hierarchical, or treelike, structure.
Results
Briefly, scGeneFit works as follows. Given samples (cells) in a highdimensional feature space (genes), and corresponding categorical sample labels (e.g., cell type and cell state), labelaware compression methods^{13} find a projection to a lowdimensional subspace, or the space of selected markers, where the dimension is specified. Samples with the same labels are closer in the lowdimensional space than samples with different labels, when projected into the lowdimensional space. To ensure that the lowdimensional results enable marker selection, scGeneFit additionally constrains the projection so that each of the subspace dimensions must be aligned with a coordinate axis in the original space. With this constraint, each dimension in the lowdimensional space—capturing a single marker—corresponds to a single gene, and not a weighted linear combination of many genes. Fortunately, this constraint eases the computational challenge of the original projection problem, which is intractable in its most general form. This optimization becomes a linear program.
For input to our method, we use postquality control scRNAseq data with unique molecular identifier counts, a target marker set size, and a hierarchical taxonomy of cell labels. When cell labels do not exist, labels may be inferred using a clustering algorithm, or via another data modality^{4}. Similarly, the label hierarchy for input to our method can be expertprovided^{14}, or inferred via a hierarchical clustering algorithm. We assessed scGeneFit’s performance in both simulated scenarios and existing scRNAseq data.
Spectral toy model
We first investigated the behavior of scGeneFit in the context of simulated data that is often used to evaluate methods for spectral clustering. Samples of dimension d + 2 were generated such that d features are each drawn from a Gaussian distribution with mean zero and variance σ = 1, and two features, encoding the desired clusters, represent the twodimensional coordinates on one of two concentric circles with different radii; the cluster label represents whether the sample is on the inner or the outer circle (Fig. 1A; see “Methods”). We found that scGeneFit selected as markers the two features representing the circle coordinates whenever the size of the target marker set was ≥2, enabling the recovery of the sample labels. In contrast, methods that query each of the features independently are not able to identify the twodimensional coordinates.
Heteroskedastic toy model
To illustrate that scGeneFit can correctly identify multiple planted labels within the data, we simulated data corresponding to fantasy gene expression profiles from n = 1000 cells and d = 10,000 genes (Fig. 1B). We considered three different labelings of these n cells and assigned 250 genes as markers of each of three partitions and another 9250 genes as nonmarkers. Within each of the three labelings, the 250 markers for each label were drawn from a multivariate Gaussian with a 250 dimensional mean (a random sample with mean 0 and variance 1), and a dense 250 × 250 covariance matrix. The nonmarker genes were drawn from univariate Gaussian distributions with mean 0 and variance 1. Since the class labels depend on the covariance structure, no individual group of genes considering either mean or variance is predictive of class label. We set the marker set size to 10 and include as input one of two labelings. scGeneFit correctly recovers markers of the appropriate labeling. While there is no unique correct marker set, the probability of randomly selecting a correct set of k markers is (250/10,000)^{k}.
Synthetic multimodal gene expression
We next considered a synthetic model based on the intuition that cell labels may be defined by nonadditive gene interactions. We illustrate this through simulations in which gene expression is multimodal, and synthetic cell labels are defined in combinatorial ways, such as an exclusiveor (xor) relationship among a pair of genes (Fig. 1Ci–v, Table 1, and Supplementary Figs. 2–6).
For visualization purposes, we first simulated a small dataset of synthetic gene expression profiles with three distinct functional gene groups determining ten cell types. Data were simulated from multivariate normal mixtures with means specified in Table 1 as follows. A cell label is defined by a tuple of modes (A_{i}, B_{j}, C_{k}), where i = 1, 2, 3, 4, j = 1, 2, and k = 1, 2, 3, 4. The gene expression levels of each of the five markers were drawn from a Gaussian distribution, where the mean is determined by the type according to Table 1 (Fig. 1C).
We compared the performance of scGeneFit and onevsall in terms of true positive and false positive rates, using receiveroperating characteristic curves (Fig. 1C and Supplementary Figs. 3 and 4). While scGeneFit optimizes for joint cell label separation, it is able to outperform onevsall for most synthetic cell labels in onevsall tasks (which, by design, should be favorable to the onevsall marker selection methods). The performance of scGeneFit scales favorably with the number of markers used (Supplementary Figs. 3–6). This improvement is maintained when considering a larger simulation with 15 functional groups and 40 classes across 1000 cells and 10,000 genes (Supplementary Information, Supplementary Figs. 5 and 6). In this case, data were simulated from both a multivariate normal model (Supplementary Fig. 5) and a generalized multivariate normal model with a Poisson link (Supplementary Fig. 6). In both cases, scGeneFit substantially improved classification accuracy as the number of markers increased. This means that not only does scGeneFit achieve favorable accuracy with fewer markers when theoretically possible, but scGeneFit also provides a principled approach for the selection of informative markers, when the number of markers allowed experimentally exceeds the number of classes we wish to distinguish.
Markers for flat clustering
We studied the performance of scGeneFit in the context of two scRNAseq studies. To do this, we applied scGeneFit to a cord blood mononuclear cell (CBMC) study containing isolated cells from cord blood^{4}. In total, 8584 singlecell expression profiles of CBMCs from the 500 most variable genes from both human and mouse cell lines were considered. For the cell type labels, we used the reported transcriptomebased clustering that partitions CBMC types into B cells, T cells, natural killer cells, monocytes (CD14+ and CD16+), dendritic cells (DC, pDC), erythrocytes, and erythroblasts. We found that the discovered marker sets were able to recover nearly identical cell type partitions in the CBMCs. We explored the space of marker set size close to the number of cell type labels, and found that this number affects the ability to discriminate among cell labels (Fig. 2A, B). We quantified the discriminatory power of the identified marker set by performing knearest neighbor clustering on the projections of cells in lowdimensional marker space (see “Methods”). We found that the distinctions between cell type labels are largely preserved in the reduced dimensional space (Fig. 2B). scGeneFit achieved slightly better performance with fewer markers than the curated marker set based on differential association (8.05% scGeneFit label classification error vs 9.90% curated marker set error in heldout data; see “Methods”), and substantially outperformed a random marker set (30% error in heldout data).
Markers preserving hierarchical clustering
scGeneFit has similarly good performance in the context of scRNAseq data characterizing mouse cortical cell type diversity^{14}. Among the 48 cell types, cells in the mouse somatosensory cortex are organized into a hierarchy governed by five main neuronal and neuroglial types: pyramidal neurons, interneurons, astrocytes, oligodendrocytes, and microglia. We applied the hierarchical version of scGeneFit to identify 30 marker genes that allow recovery of the hierarchical label structure (Fig. 2C, D). Our marker selection technique enabled accurate recovery of the hierarchy of cell labels. These results coincide with those obtained by considering >50% more markers in the onevsall setup (Supplementary Information, Supplementary Fig. 8). At the first level of the hierarchy, scGeneFit performed on par with the onevsall method, slightly improving on identification accuracy of the well understood astrocyte population (f1score of 0.81 vs 0.74 in onevsall, Supplementary Information Table 4).
We compared the results obtained by scGeneFit at the second layer of the hierarchy using 30 or 40 markers with the onevsall method, using 48 markers (Supplementary Information). Our method with 30 and 40 markers performed on par with the onevsall method using 48 markers (classification error at the first layer of the hierarchy on heldout cells: 8.75% (30), 6.44% (40), 9.74% (onevsall); see Supplementary Table 2). In particular, scGeneFit performed well in the astrocyte, pyramidal neuron, and oligodendrocyte subpopulations (Supplementary Table 4).
Discussion
In summary, we develop a method, scGeneFit, that identifies markers to distinguish labeled cells given a structured partition (flat partition or hierarchy) of cell labels. We show that scGeneFit is able to accurately recover a set of markers of a prespecified size and allows robust labeling of cell types in scRNAseq data. scGeneFit is able to handle more complex relationships among class labels than a flat partition, making it the first approach to exploit hierarchical label structure to robustly solve the marker selection problem in an efficient and principled fashion. A key underlying assumption common to both scGeneFit and existing onevsall methods is that cell type classes are linearly separable, which is a reasonable assumption in high dimensions, but may not be true in lower dimensions. In future work, we plan to depart from this assumption by considering a more general class of nonlinear dimension reduction methods. Furthermore, we envision relaxing the categorical labeling to a manifold constraint that will allow selection of markers to place unlabeled cells at, for example, time points along cell trajectories or locations in spatial assays.
Methods
scGeneFit
In the marker selection problem, \({x}_{i}\in {{\Bbb{R}}}^{d}\) is gene expression measurements of the ith cell for d different genes. We assume the subset of the cells used for training include labels.
Setup. We model the marker selection problem as a labelaware dimension reduction method inspired by compressive classification and largest margin nearest neighbor algorithms^{13}. One such method, SqueezeFit, aims to find a projection to the lowestdimensional subspace for which samples with different labels remain farther apart than samples with the same label. Consider a dataset \({\mathcal{D}}={\{({x}_{i},{y}_{i})\}}_{i\in {\mathcal{I}}}\) in \({{\Bbb{R}}}^{d}\times [k]\), here x_{i} is a sample and y_{i} is its corresponding label. We denote \({\mathcal{Z}}({\mathcal{D}}):= \{{x}_{i}{x}_{j}:i,j\in {\mathcal{I}},{y}_{i}{\,}\ne{\,} {y}_{j}\}\) as the vector difference between samples with different labels.
The following optimization problem corresponds to finding the orthogonal projection to the lowestdimensional space that maintains a prescribed separation Δ > 0 between samples with different labels:
Here, Π is the lowdimensional projection, and Δ > 0 is the desired minimum distance between projected samples Πx_{i} and Πx_{j} with different labels. This parameter reflects a fundamental tension in compressive classification: Δ should be large so as to enable sufficient separation of samples with different labels in the lowdimensional space, and simultaneously the projected space rank Π should be of low dimension so that this projection effectively reduces the dimension of the sample. To address the intractability of the optimization in Eq. (1), a convex relaxation technique is used^{13}:
The relaxation extends the feasible set from the set of orthogonal projections, where optimization is intractable—matrices Π that satisfy the constraints in Eq. (1)—to the set of positive semidefinite matrices—matrices M that satisfy the constraints in Eq. (2)—where one can use standard optimization toolboxes to find the global optimum in polynomial time^{15}. The trace norm of M corresponds to the ℓ_{1}norm of the vector of eigenvalues of M. Therefore, minimizing the trace norm \(\mathrm{tr}\,M\) encourages M to be low rank^{16}.
Marker selection
scGeneFit finds a prescribed number of gene markers, so that when the samples are projected onto those marker dimensions they exhibit the same separation of cells with different labels as in the original gene space. The objective of selecting a handful of marker genes in mathematical terms translates to finding a projection onto a subset of the coordinates; specifically, M is a diagonal matrix with entries α_{1}, …, α_{d}. This constraint simplifies the optimization to a linear program:
The objective’s same ℓ_{1} trace norm promotes sparsity in the matrix M (ref. ^{16}). As a result, numerical experiments show that the solution of Eq. (3) is in fact sparse, and the dimension of the projection—the number of selected markers—is smaller than the dimension of the original space.
In order for this method to be useful in practice, we modify the optimization formulation Eq. (3) to allow for outliers, and we specify the dimension of the projected space (i.e., the number of markers) s, leading to the scGeneFit optimization problem:
Here β is a slack vector that quantifies how much the margin between sets with different labels is violated for each constraint^{17}. β is indexed by the elements \(z\in {\mathcal{Z}}({\mathcal{D}})\) and its dimension equals that of the constraint set \({\mathcal{Z}}({\mathcal{D}})\).
Incorporating label hierarchies
Consider a hierarchical partition of the samples denoted by T_{σ}, where σ is an ordered set of indices. Say \({T}_{\sigma ^{\prime} }\subset {T}_{\sigma }\), if σ is a prefix of \(\sigma ^{\prime}\) (for instance T_{ijk} ⊂ T_{ij} ⊂ T_{i}, corresponding to a threelevel hierarchy; see Fig. 3 for a concrete example).
When provided with the structured relationship of the labels, scGeneFit solves the optimization problem (Eq. (4)), replacing the set of constraints \({\mathcal{Z}}({\mathcal{D}})\) to \({{\mathcal{Z}}}_{T}({\mathcal{D}})\) to reflect the hierarchical information. In detail,
Alternative optimization constraints
The optimization problem described above for scGeneFit is effective when the label structure is a flat (star shaped) hierarchy; however, when the label structure has additional layers, we would like to add an additional constraint to encourage labels that are closer in hierarchical space to also be closer in the projected (marker) space. In particular, let \({{\mathcal{I}}}_{t}\) be the set of indices i ∈ {1, …, n} of cell profiles with label t, and let n_{t} be the number of cells in that set. The projected center of the profiles labeled t is Πc_{t}, with \({c}_{t}=\frac{1}{{n}_{t}}{\sum }_{i\in {{\mathcal{I}}}_{t}}{x}_{i}\). The desired constraints can thus be formally encoded as \(\parallel {{\Pi }}{x}_{i}{{\Pi }}{c}_{t^{\prime} }{\parallel }^{2}\parallel {{\Pi }}{x}_{i}{{\Pi }}{c}_{t}{\parallel }^{2}\ge {{{\Delta }}}^{2}\), if y_{i} = t and \(t{\,}\ne{\,} t^{\prime}\).
The hierarchical scGeneFit objective encodes the intuition that the distance between labeled cells should reflect the label distance in the given label hierarchy (Supplementary Information). This is given by the linear program
where, as before, β is a slack vector.
Highdimensional optimization
The optimization procedure allows the analysis of thousands of genes at a time as follows. As before, let c_{t} be the empirical gene expression mean of a class t and consider constraints of the form constraints of the form ∥c_{t} − c_{s}∥ > Δ. To insure that the number of constraints remains within the order of the number of classes, we only consider constraints over the cells with gene expression profiles closest to the cell type centroids in the lowerdimensional space. Our current implementation finds 50 markers in a simulated dataset with 10,000 cells (40 synthetic cell labels), with 10,000 genes in ~15 min (using a standard MacBook pro laptop)
Hyperparameter setting
In scGeneFit, the two main hyperparameters are: s (the target number of markers) and Δ (the target separation of samples with different labels or centers of different classes). In our code, we implement a dual annealing method that optimize for the value of Δ for a given training set, test set, and classifier. The other hyperparameters scGeneFit uses are set to make the problem smaller in case the computational power doesn’t allow the user to run the optimization in the entire dataset (like capping the number of constraints to be used or sampling the dataset to generate fewer constraints). Such hyperparameters are fully described in the Supplementary Material.
Optimization of the linear program and scalability
The optimization problem (Eq. (4)) is a linear program that we solve with scipy linear programming solver (scipy.opitimize.linprog). The computational bottleneck of the linear program is the number of constraints in \({\mathcal{Z}}({\mathcal{D}})\), which a priori scales quadratically with the number of cells. In order to resolve this issue and make the optimization more efficient, we use several strategies. The simplest one is to select the most relevant constraints in Eq. (4) by considering, for each sample, the Knearest neighbors from each of the other classes. Another strategy we use is to randomly select a subsample, run scGeneFit on the subsample, and project the heldout samples using the markers chosen on the subsample.
The most efficient strategy we use is to set constraints based on the empirical centroids of clusters, as discussed in the highdimensional example above. However, such a strategy has the underlying assumption that classes are linearly separable.
A detailed comparison among all variants of our method is documented in our software release^{18}.
Runtime
The computational complexity of linear programming is an open problem in optimization, but it is known to be asymptotically upper bounded by O(N^{2.5}), where N is the size of the problem (number of variables plus number of constraints)^{19}. For the particular experiments we perform, we solve scGeneFit with 4000 variables and 6000 constraints in <40 s, on Matlab 2018a running on an Intel Xeon CPU 1.90 GHz using <4 Gb of memory.
Dataset description and preprocessing
Zeisel
Cells in the mouse somatosensory cortex (S1) and hippocampal CA1 region were classified based on 3005 singlecell transcriptomes via scRNAseq. The nine major molecularly distinct classes of cells (layer 1) were obtained through a divisive biclustering method, and corresponding subclasses of cells (layer 2) were obtained through repeating the biclustering method within each major class^{14}.
CBMC
The CBMCs were produced with CITEseq^{4}. Singlecell RNA data processing and filtering were performed as specified in ref. ^{4}. In particular, the data are sparse and normalized by \({\mathrm{log}\,}_{2}(1+X)\).
Evaluation metrics
In order to evaluate the performance of scGeneFit, we first split the data in training (70%) and test (30%). We train a Knearest neighbor classifier on the training data (for K = 3, 5, 15) after projection to the corresponding markers (computed on the entire dataset). We evaluate the classifier on the test data and report the misclassification error with respect to the known classes (Supplementary Table S1). We also evaluate the performance of kmeans clustering, using kmeans++, reporting the smallest misclassification error among ten random initializations. For the hierarchical dataset in Zeisel, we evaluate the performance at the second level. Finally, we provide precision, recall, and f1metrics for the classification tasks of both synthetic and real datasets (Supplementary Information).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The processed data we used in the experiments is available in the scGeneFitpython Github repository^{18}, as well as the scGeneFit package distributed by the Python Package Index.
Code availability
We produced a python package that implements the marker selection algorithm described in this paper^{18}. One can simply install the package pip install scGeneFit. The code to conduct the simulations and reproduce the analyses is available at https://github.com/solevillar/scGeneFitpython.
References
Macosko, E. Z. et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Zhu, L., Lei, J., Devlin, B. & Roeder, K. et al. A unified statistical framework for single cell and bulk RNA sequencing data. Ann. Appl. Stat. 12, 609–632 (2018).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865 (2017).
Codeluppi, S. et al. Spatial organization of the somatosensory cortex revealed by cyclic smFISH. Nat. Methods 15, 932–935 (2018).
Raj, A., Van Den Bogaard, P., Rifkin, S. A., Van Oudenaarden, A. & Tyagi, S. Imaging individual mRNA molecules using multiple singly labeled probes. Nat. Methods 5, 877 (2008).
Veluchamy, J. P. et al. Standardized and flexible eight colour flow cytometry panels harmonized between different laboratories to study human NK cell phenotype and function. Sci. Rep. 7, 43873 (2017).
Finak, G. et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol. 16, 278 (2015).
McDavid, A. et al. Data exploration, quality control and testing in singlecell qPCRbased gene expression experiments. Bioinformatics 29, 461–467 (2012).
Delaney, C. et al. Combinatorial prediction of marker panels from singlecell transcriptomic data. Mol. Syst. Biol. 15, e9005 (2019).
Reboredo, H., Renna, F., Calderbank, R. & Rodrigues, M. R. Bounds on the number of measurements for reliable compressive classification. IEEE Trans. Signal Process. 64, 5778–5793 (2016).
Weinberger, K. Q. & Saul, L. K. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10, 207–244 (2009).
McWhirter, C., Mixon, D. G. & Villar, S. Squeezefit: Labelaware dimensionality reduction by semidefinite programming. IEEE Trans. Inf. Theory 66, 3878–3892 (2019).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science 347, 1138–1142 (2015).
Grant, M. & Boyd, S. CVX: Matlab software for disciplined convex programming, version 2.1. https://cvxr.com/cvx (2014).
Srebro, N. & Shraibman, A. In International Conference on Computational Learning Theory, 545–560 (Springer, 2005).
Cortes, C. & Vapnik, V. Supportvector networks. Mach. Learn. 20, 273–297 (1995).
Villar, S. scGeneFit Python GitHub repository. https://github.com/solevillar/scGeneFitpython (2020).
Cohen, M. B., Lee, Y. T. & Song, Z. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st annual ACM SIGACT symposium on theory of computing, 938–942 (2019).
Acknowledgements
This work was performed while S.V. was with New York University, and B.D. was with Princeton University. S.V. was partially supported by NSF DMS 2044349, EOARD FA95501817007, and the NSFSimons Research Collaboration on the Mathematical and Scientific Foundations of Deep Learning (MoDL; NSF DMS 2031985). D.G.M. was partially supported by AFOSR FA95501810107 and NSF DMS 1829955. B.E.E. was supported by a grant from the Helmsley Trust, a grant from the NIH Human Tumor Atlas Research Program, NIH NHLBI R01 HL133218, and NSF CAREER AWD1005627.
Author information
Authors and Affiliations
Contributions
B.D. and S.V. conceived the idea, designed and implemented the method, and ran the experiments. S.V., B.D., and B.E.E. analyzed the results and drafted the manuscript. D.M. provided guidance regarding the optimization framework for dimensionality reduction. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
B.E.E. is on the Scientific Advisory Board of Freenome, Celsius Therapeutics, and Crayon Bio. B.E.E. is a consultant for Freenome and was employed by Genomics plc during 20192020. Otherwise, the authors declare that they have no additional competing interests.
Additional information
Peer review information Nature Communications thanks Tallulah Andrews and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dumitrascu, B., Villar, S., Mixon, D.G. et al. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat Commun 12, 1186 (2021). https://doi.org/10.1038/s41467021214534
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021214534
This article is cited by

Predictive and robust gene selection for spatial transcriptomics
Nature Communications (2023)

SMaSH: a scalable, general marker gene identification framework for singlecell RNAsequencing
BMC Bioinformatics (2022)

Temporal modelling using singlecell transcriptomics
Nature Reviews Genetics (2022)

geneBasis: an iterative approach for unsupervised selection of targeted gene panels from scRNAseq
Genome Biology (2021)

Feature selection revisited in the singlecell era
Genome Biology (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.