Optimal marker gene selection for cell type discrimination in single cell analyses

Single-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.

• method: 'centers', 'pairwise', or 'pairwise centers' -'centers' considers constraints that require that two consecutive classes have their empirical centers separated after projection to the selected markers. According to our numerical experiments this is the least general but most efficient and stable set of constraints (the underlying assumption is that the classes are linearly separable, which seems to be true in high dimensions).
-'pairwise' considers constraints that require that points from different classes are separated by a minimal distance after projection to the selected markers. Since all pairwise constraints would typically make the problem computationally too expensive, the constraints are sampled (sampling rate) and capped (n neighbors, max constraints).
-'pairwise centers' after projection to the selected markers every point is required to lie closest to its empirical center than every other class center (sampling and capping also apply here).
• epsilon: constraints will be of the form expr > ∆, where ∆ is chosen to be epsilon times the norm of the smallest constraint (default 1). This is the most important parameter in this problem, it determines the scale of the constraints, the rest the rest of the parameters only determine the size of the LP to adapt to limited computational resources. We include a function that finds the optimal value of epsilon given a classifier and a training/test set. We provide an example of the optimization in https://github.com/solevillar/scGeneFit-python/ blob/master/examples/scGeneFit_functional_groups.ipynb • sampling rate: (if method=='pairwise' or 'pairwise centers') selects constraints from a random sample of proportion sampling rate (default 1) • n neighbors: (if method=='pairwise') chooses the constraints from n neighbors nearest neighbors (default 3) • max constraints: maximum number of constraints to consider (default 1000) • redundancy: (if method=='centers') in this case not all pairwise constraints are considered but just between centers of consecutive labels plus a random fraction of constraints given by redundancy. If redundancy==1 all constraints between pairs of centers are considered • verbose: whether it prints information like size of the LP or elapsed time (default True) We implemented a search in the parameter space using dual annealing to find the a good set of parameters for scGeneFit.

Supplementary Note 2. Numerical performance of scGeneFit
In this section we consider an expanded set of simulations in which we compare and contrast scGeneFit and one-vs-all marker selection procedures.
One dataset, multiple possible clusterings: heteroskedastic normal case We consider the case where the same dataset might allow for multiple clustering, each with it respective sets of markers. To this end consider a small synthetic example with n = 1000 cells and d = 200 genes. We considered three different labelings of these n cells and assigned 50 genes as markers of each partition and another 50 genes as non-markers. Within each of the three labelings, the 50 markers for each label are drawn from a multivariate Gaussian with a 50 dimensional mean and a dense 50 × 50 covariance matrix. The 50 genes that are not markers are drawn from a univariate Gaussian distribution with mean 0 and variance 1. We set the marker set size to 20 and include as input one of the three labelings. We find that scGeneFit recovers markers of the appropriate labeling. Notably, since the third labeling is a finer partition of the first labeling, our method selects markers for both the first and third labelings when given the first labeling, but not vice versa. Discriminative markers were correctly recovered by scGeneFit for simulated samples drawn from mixtures of Gaussians corresponding to three label sets with two ( Supplementary Figure 1  labels, respectively. The probability of observing k markers for a given clustering is bounded above by (50/1000) k =, k itself has too be large enough to allow for covariance estimation, and it is thus expensive.

Synthetic functional group model
We consider a synthetic model based on the intuition that genes interact with one another in a complex manner (e.g., their products belong to related pathways) giving raise to biological modules or functional groups which are utilized differently across cell types. Here, we consider functional groups to be determined by a (possibly overlapping) set of genes. Every functional group has a multi-modal structure: groups of genes can have different levels of activity (subgroups). Cell types are then determined by a unique combination of the different instances of the functional groups (see Supplementary Table 1 and Supplementary Figures 2, 3 for a specific example). To allow for clear visualization and metric evaluation, we consider a small toy example with 9 cell types and only 100 cells. The small toy example makes it easier to observe the effect the marker selection at the individual cluster level. In particular, we compare and contrast the performance of the two approaches in terms of true positive and false and four (C) labels, respectively. Each row is a single sample and each column is a single feature, and the shade of blue corresponds to the label assignment, with samples with the same label sharing a color. The yellow lines correspond to the markers selected by scGeneFit positive rates, as illustrated by Receiver-operating characteristic curves. This shows that while scGen-eFit optimizes for joint separation, it is able to outperform one-vs-all for most synthetic cell types in one-vs-all tasks (which, by design, should be favorable to the one-vs-all marker selection procedures).

Single cell count data
In Supplementary Figure 4, as suggested by the reviewers, we consider a count based model. The data is generated using a Poisson link function log(Poisson(exp(X ij ))) + 1, element wise, where X ij is an element of the matrix X, generated from the mixture of multivariate normal distributions defined by the functional group example ( Supplementary Figures 2 and 3). This follows statistical models with a rich history in describing count data [1,2]. In Supplementary Figure 5, a similar model is explored in a real dataset scale setting.

Large scale computation
In Supplementary Figure 5 we consider a larger scale version of the example we described above. In this larger scale experiment, we generate 40 classes from 15 different functional groups. Each gene is randomly assigned to a gene expression subgroup. While in this case there are multiple marker sets that are suitable for discriminating among a given clustering, the probability of detecting a good marker set is negligible (O(1/(40) 15 )).
In this specific example we have 1000 cells and 8,798 gene ScGeneFit finds the number of markers of this example in 10-12 minutes in a standard 2018 MacBook Pro (or a Google Colab on a standard free account). In order to make our algorithm work in this larger scale setting we modify our method as explained in the next Section as well as in the methods Section of our manuscript.
In order to deal with thousands of genes we consider a different way to choose the set of constraints. Basically in this version we consider one variable per gene and constraints of the form c t − c s > ∆ where c t it the empirical center of the t-th class. We aim to have a number of constraints of the order of the number of classes, therefore we don't consider the constraints over all pairs of centers but O(1) In this example we have 32 = 4 * 4 * 2 possible cell types. For instance, if we consider the cell types T 1 = (A 1 , B 1 , C 4 ) and T 2 = (A 2 , B 2 , C 3 ) it would suffice to use G 3 as a marker to distinguish both cell types. However, if we consider all possible 32 cell types one needs all 5 genes to distinguish them. We use this functional groups example in Supplementary Figures 2 and 3 constraints per center. This implementation is quite efficient but it relies on the underlying assumption that the classes are linearly separable, which is a reasonable assumption in high dimensions but may not be in lower dimensions. All these different implementations are documented in our released code. In Supplementary Figure 5 we show the performance of scGeneFit in a larger scale problem and its comparison with one-vs-all.

CBMC and Zeisel datasets
In this Section we report the performance of the scGeneFit 'pairwise' and 'centers' methods on CBMC and Zeisel datasets in comparison with one-vs-all (Supplementary Tables 2 and 3, and Supplementary Figure 7). More results are presented in the main manuscript as well as the Github repository https: //github.com/solevillar/scGeneFit-python/.
In Supplementary Table 2 K-nearest neighbor based metrics were used to evaluate the performance of the flat clustering. Due to the small number of points per subcluster this method is not suitable for evaluating the lower level of the hierarchy.

Precision Recall Values For Hierarchical Clustering
scGeneFit provides a method for marker selection according to a hierarchical clustering structure, described in the Methods section of the main manuscript. In Supplementary Figure 8 and Supplementary  Table 4 we report its performance on Zeisel using hierarchical labels.  Table  1. In this model, a cell type is defined by its group (A i , B j , C k ) where i = 1, 2, 3, 4, j = 1, 2 and k = 1, 2, 3, 4. The gene expression G s for s = 1, 2, 3, 4, 5 is drawn from a Gaussian distribution where the mean is determined by the type according to Table 1. (Left) we draw 10 cells from each of the 32 possible cell types. A random copy of each of the genes is drawn 100 consecutive times. In this example 5 markers are necessary and sufficient to distinguish all 32 classes. Note that even though all genes could technically be markers, the probability of selecting one from each class by selecting 5 genes at random is negligible. (Right) in this example we consider 9 cell types. The first 3 genes are repeated 10 times whereas the last two genes are repeated 100 times. The results of one-vs-all and scGeneFit for this synthetic data is shown in Supplementary Figure 3.

Detected Markers
scGeneFit identifies both cell-type specific markers (e.g., Pde1a, Cnr1, Enpp2, Mef2c), and markers that show broad patterns of behavior (e.g. the low expression of Tmsb4x is particular to astrocytes). Further interpreting the grammar governing this behavior is subject of future work, and is outside the scope of the current paper.
[2] E. Pierson and C. Yau. Zifa: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome biology, 16(1):1-10, 2015.  Figure 2 followed by tSNE plot of the data restricted to the markers selected by the one-vs-all procedure. We observe that some classes are well distinguished (for instance 3,4 and 5) but other classes are not for instance (0,1,2). This is made quantitative in the second row plot. (Top right): a tSNE plot of the original data followed by tSNE plot of the data restricted to 9 markers selected by scGeneFit procedure. (Second row): scGeneFit performance improves significantly as the number of markers is allowed to increase: we plot the classification accuracy of a K-nearest neighbors classifiers as a function of the number of markers (Third row): Receiver operating characteristic curves for (left) one vs all, and (right) scGeneFit .  Figure 2 followed by tSNE plot of the data restricted to the markers selected by the one-vs-all procedure. We observe that some classes are well distinguished (for instance 3,4 and 5) but other classes are not for instance (0,1,2). This is made quantitative in the second row plot. (Top right): a tSNE plot of the original data followed by tSNE plot of the data restricted to 9 markers selected by scGeneFit procedure. (Second row): scGeneFit performance improves significantly as the number of markers is allowed to increase: we plot the classification accuracy of a K-nearest neighbors classifiers as a function of the number of markers (Third row) Receiver operating characteristic curves for (left) one vs all, and (right) scGeneFit . We generated 40 classes from 15 different functional groups. We set each gene type to be repeated a random number of times between 10 and 1000. In this specific random example we have 1000 cells and 10,000 genes. Each run of the scGeneFit takes around 10 minutes in a Google Colab (standard account) (Top) Representation of the cells and genes as a matrix. (Middle) tSNE plots of (from left to right) original data, data restricted to the 40 markers selected by one vs all, data restricted to the 40 markers selected by scGeneFit, data restricted to 45 markers selected by scGeneFit. (Bottom) We report the classification accuracy of scGeneFit and one vs all. We generated 40 classes from 15 different functional groups. We set each gene type to be repeated a random number of times between 10 and 1000. In this specific random example we have 1000 cells and 10,000 genes. Each run of the scGeneFit takes around 10 minutes in a Google Colab (standard account) (Top) tSNE plots of (from left to right) original data, data restricted to the 40 markers selected by one vs all, data restricted to the 40 markers selected by scGeneFit, data restricted to 50 markers selected by scGeneFit. (Bottom) We report the classification accuracy of scGeneFit and one vs all.  ) from K-nearest neighbor and k-means after compressing with random markers, markers chosen as most significantly associated with a cluster and scGeneFit with 'pairwise' method (a lower score is a better score). (The Id column takes identity to be the "compression" operator.) For the first rows we split the data in 70% training (6032 cells for CBMC, 2104 for Zeisel) and 30% test (2585 cells for CBMC, 901 for Zeisel). We train a K-nearest neighbor classifier on the compressed training set, and report its misclassification error in the compressed test set. The last row considers the smallest misclassification error of k-means clustering among 10 runs of k-means with different random seeds. Due to the high variance in the number of points per cluster, k-means is not a suitable clustering method for this CBMC.   We find the markers using scGeneFit and one-vs-all on a training set of 75% of the dataset (selected at random), and we evaluate its performance on the remaining 25%. The metrics (precision, recall, f1 score) are computed at the first level of the hierarchy. scGeneFit improves on most categories. The best performers are shown bolded. It is worth mentioning that the cluster labels of the Zeisel dataset were obtained with a one-vs-all analysis in mind.