Addressing the looming identity crisis in single cell RNA-seq

Single cell RNA-sequencing technology (scRNA-seq) provides a new avenue to discover and characterize cell types, but the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine the replicability of these studies. Meta-analysis of rapidly accumulating data is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that allows researchers to quantify the degree to which cell types replicate across datasets, and to rapidly identify clusters with high similarity for further testing. We first measure the replicability of neuronal identity by comparing more than 13 thousand individual scRNA-seq transcriptomes, then assess cross-dataset evidence for novel pyramidal neuron and cortical interneuron subtypes identified by scRNA-seq. We find that 24/45 cortical interneuron subtypes and 10/48 pyramidal neuron subtypes have evidence of replication in at least one other study. Identifying these putative replicates allows us to re-analyze the data for differential expression and provide lists of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types and subtypes with high accuracy, indicating many of the transcriptional changes characterizing cell identity are pervasive and easily detected.


Introduction 23
Single cell RNA-sequencing (scRNA-seq) has emerged as an important new technology 24 enabling the dissection of heterogeneous biological systems into ever more refined cellular 25 components. One popular application of the technology has been to try to define novel cell 26 subtypes within a given tissue or within an already refined cell class, as in the lung (Treutlein et  In order to answer this, we turned to the issue of cell diversity in the brain, a prime target of 40 scRNA-seq as neuron diversity is critical for construction of the intricate, exquisite circuits 41 underlying brain function. The heterogeneity of brain tissue makes it particularly important that 42 results be assessed for replicability, while its popularity as a target of study makes this goal 43 particularly feasible. Because a primary aim of neuroscience has been to derive a taxonomy of 44 cell types (Ascoli et al., 2008), already more than twenty single cell RNA-seq experiments have 45 been performed using mouse nervous tissue (Poulin et al., 2016). Remarkable strides have 46 been made to address fundamental questions about the diversity of cells in the nervous system, 47 including efforts to describe the cellular composition of the cortex and hippocampus (Tasic et 48 al., 2016;Zeisel et al., 2015), to exhaustively discover the subtypes of bipolar neurons in the 49 retina (Shekhar et al., 2016), and to characterize similarities between human and mouse 50 midbrain development (La Manno et al., 2016). In spite of this wealth of data, there have been 51 few attempts to compare, validate and substantiate cell type transcriptional profiles across 52 scRNA-seq datasets, and no systematic or formal method has been developed for 53 accomplishing this task. 54 To address this gap in the field, we propose a simple, supervised framework, MetaNeighbor 55 (meta-analysis via neighbor voting), to assess how well cell type-specific transcriptional profiles 56 replicate across datasets. Our basic rationale is that if a cell type has a biological identity rooted 57 in the transcriptome then knowing its expression features in one dataset will allow us to find 58 cells of the same type in another dataset. We make use of the cell type labels supplied by data 59 providers, and assess the correspondence of cell types across datasets by taking the following 60 approach (see schematic, Figure  2) Next, we do cross-dataset validation: we hide all cell type labels ('identity') for one 66 dataset at a time. This dataset will be used as our test set. Cells from all other datasets 67 remain labeled, and are used as the training set. 68 3) Finally, we predict the cell type labels of the test set: we use a neighbor voting algorithm 69 to predict the identity of the held-out cells based on their similarity to the training data. 70 Conceptually, this resembles approaches for the validation of sample clustering (Dudoit et al.,71 2002; Kapp and Tibshirani, 2007) but it has been adapted to operate from within a supervised 72 learning framework. This permits both systematic scoring and carefully defined control 73 experiments to investigate the data features that drive high performance. Our implementation is 74 extremely fast and robust to technical differences between experiments; because prediction is 75 performed only within an individual dataset at a time, we are able to keep many aspects of 76 technical variation constant. This essentially controls for any dataset specific effects that would 77 otherwise swamp the subtler cell identity signal. The method provides a score that indicates the 78 degree to which a cell type replicates for each gene set that is tested. This means that 79 MetaNeighbor doubles as a low-tech 'feature selection tool' that we can use to identify the 80 transcriptional features that are most discriminative between cell types. By comparing the 81 scores returned from using Gene Ontology (GO) functions ("functional gene sets") or sets of 82 randomly chosen genes ("random gene sets"), we can determine whether co-expression of 83 specific gene sets is characteristic of particular cell types, and thus important for cell function or 84

identity. 85
We evaluate cell identity by taking sequential steps according to the basic taxonomy of brain 86 cells: first classifying neurons vs. non-neuronal cells across eight single cell RNA-seq studies, 87 then classifying cortical inhibitory neurons vs. excitatory neurons, and for our final step, we align 88 interneuron and pyramidal cell subtypes across three studies. Critically, we discover that that 89 almost any sufficiently large and highly variable set of genes can be used to distinguish between 90 cell types, suggesting that cell identity is widely represented within the transcriptome. 91 Furthermore, we find that cross-dataset analysis of pyramidal neurons results in broad definition 92 of cortical vs. hippocampal types, and find evidence for the replication of five layer-restricted 93 subtypes. In contrast, we find that cortical interneuron subtypes show clear lineage-specific 94 structure, and we readily identify 11 subtypes that replicate across datasets, including 95 Chandelier cells and five novel subtypes defined by transcriptional clustering in previous work. 96 Meta-analysis of differential expression across these highly replicable cortical interneuron 97 subtypes revealed evidence for canonical marker genes such as parvalbumin and somatostatin, 98 as well as new candidates which may be used for improved molecular genetic targeting, and to 99 understand the diverse phenotypes and functions of these cells. 100

Assessing neuronal identity with MetaNeighbor 101
We aimed to measure the replicability of cell identity across tasks of varying specificity. 102 Broadly, these are divided into tasks where we are recapitulating known cell identities, and ones 103 where are measuring the replicability of novel cell identities discovered in recent research. The 104 former class of task is the focus of this subsection: first, by assessing how well we could 105 distinguish neurons from non-neuronal cells ("task one"), and next assessing the discriminability 106 of excitatory and inhibitory neurons ("task two"). As detailed in the methods, MetaNeighbor 107 outputs a performance score for each gene set and task. This score is the mean area under the 108 receiver operator characteristic curve (AUROC) across all folds of cross-dataset validation, and 109 it can be interpreted as the probability that we will rank a positive higher than a negative (e.g. 110 neuron vs. non-neuronal cell, when using neurons as the positive label set) based on the 111 expression of a set of genes. This varies between 0 and 1, with 1 being perfect classification, 112 0.5 meaning that we have performed as well as if we had randomly guessed the cell's identity, 113 and 0.9 or above being extremely high. Comparison of scores across gene sets allows us to 114 discover their relative importance for defining cell identity. 115 As described above, in task one we assessed how well we could identify neurons and non-116 neuronal cells across eight datasets with a total of 13928 cells (Table S1). Although this was 117 designed to be fairly simple, we were surprised to find that AUROC scores were significantly 118 higher than chance for all gene sets tested, including all randomly chosen sets (AUROC all 119 sets =0.78 ± 0.1, Figure 2A). Reassuringly, a bootstrapped sampling of the datasets showed a 120 trend toward increased performance with the inclusion of additional training data, indicating that 121 we are recognizing an aggregate signal across datasets ( Figure S1). However, the significant 122 improvement of random sets over the null means that prior knowledge about gene function is 123 not required to differentiate between these cell classes. Randomly chosen sets of genes have 124 decidedly non-random expression patterns that enable discrimination between cell types. 125 Task two aimed to assess how well we could discriminate between cortical excitatory and 126 previous results, we saw that AUROC scores were significantly higher than chance 129 (AUROC=0.69 ± 0.1, Figure 2B), suggesting that transcriptional differences are likely to be 130 encoded in a large number of genes. 131 Consistent with the view that a large fraction of transcripts are useful for determining cell 132 identity, we found a positive dependency of AUROC scores on gene set size, regardless of 133 whether genes within the sets were randomly selected or shared some biological function 134 ( Figure 2B). This was further supported by a comparison of scores for task one using 100 sets 135 of either 100 or 800 randomly chosen genes. AUROC score distributions and means were 136 significantly different, with sets of 100 genes having lower scores but higher variability in 137 performance, whereas sets of 800 genes were more restricted in variance and gave higher 138 performance on average ( Figure 2C, AUROC 100 =0.80 ± 0.05, AUROC 800 =0.90 ± 0.03, p<2.2E-139 16, Wilcoxon rank sum test). The variability in performance observed while keeping set size 140 constant suggests that even in random sets, there are transcriptional features that contribute to 141 cell identity. We delved into this further by comparing AUROC scores across gene sets chosen 142 based on their mean expression as we have previously shown that this is a critical factor to 143 control for in evaluating single cell gene co-expression (Crow et al., 2016). We performed task 144 one again using expression-level based gene sets and found a strong positive relationship 145 between expression level and our ability to classify cells ( Figure 2D, r s =0.9). 146 These results provide evidence that MetaNeighbor can readily identify cells of the same type 147 across datasets, without relying on specific knowledge of marker genes. In these two examples, 148 all cells could be classified as one of two types, making this a binary classification task. We find 149 that a gene set's size and mean expression level are the key features that allow for cell type 150 discrimination in this setting. in one and 23 in the other), and the authors of the later paper compared their outcomes by 162 looking at the expression of a handful of marker genes, which yielded mixed results: a small 163 number of cell types seemed to have a direct match but for others the results were more 164 conflicting, with multiple types matching to one another, and others having no match at all. Here 165 we aimed to more quantitatively assess the similarity of their results, and compare them with our 166 own data which derives from phenotypically characterized sub-populations; i.e., not from 167 unsupervised expression clustering (see Table S2 for sample information). 168 MetaNeighbor relies on coordinated variation in expression level to detect cell identity, which 169 means that genes with high variability are particularly useful. Our preceding binary 170 classifications showed that genes with high mean expression were more likely to have variation 171 that allowed MetaNeighbor to learn cell identities. In the following analyses, we are examining 172 both rare and common cell types across datasets. In this case, the mean expression level of 173 marker genes should be a proxy for cell incidence: we can expect that the marker expression for 174 a more abundant type would have a higher mean expression. Since variance scales with 175 expression, the most highly variable genes in the dataset would likely only be discriminative for 176 the abundant type. Because we would like to be able to identify both abundant and rare cell 177 types, we select the genes with the highest variance at each mean expression level. 178 We identified 638 genes with high variability given their expression levels (detailed in Methods) 179 and these were used as a 'high variability gene set' to measure AUROC scores between each 180 pair of cells across datasets. When AUROCs were measured using all genes, we saw that 181 clustering was subject to strong lab-specific effects ( Figure S2). In contrast, the use of variable 182 genes reproduced the known subtype structure, with major branches for the three main 183 subtypes, Pv, Sst and Htr3a. 184 To examine how the previously identified interneuron subtypes are represented across the three 185 studies, we tested the similarity of each pair of subtypes both within and across datasets using 186 the high variability gene set. For each genetically-targeted interneuron type profiled by Paul et 187 al., we found at least one corresponding subtype from the other two studies, which were defined 188 by having a mean AUROC score across training/testing folds >0.95 ( Figure 3). This includes 189 Chandelier cells, a subtype that could not be definitively identified by either Tasic or Zeisel. 190 Using our reciprocal testing and training protocol we find that the Tasic_Pvalb Cpne5 subtype 191 are likely to be Chandelier cells (AUROC=0.99). In addition, expanding our criteria to include all 192 reciprocal best matches in addition to those with ID scores >0.95, we found correspondence 193 among five subtypes that were assessed only in the Tasic and Zeisel data,  (Table S3), without requiring manual 204 gene curation. Because we quantify the similarity among types we can prioritize matches, and 205 use these as input to MetaNeighbor for further evaluation. 206 In the above, we identified overlaps using a single gene set. To assess cell identification more 207 broadly, we ran MetaNeighbor with these new across-dataset subtype labels, measuring 208 predictive validity across all gene sets in GO ( Figure 3A, far right). The distribution of AUROC 209 scores varied across subtypes but we found that the score from the high variability gene set was 210 representative of overall trends, with high performing groups showing higher mean AUROC 211 scores over many gene sets. As detailed in the previous section, we note that AUROC scores 212 are sensitive both to the number of training samples (n) and to underlying data features (e.g., 213 transcriptome complexity), which complicates direct comparison of ID score distributions. Both 214 the high mean AUROCs across all putative replicate subtypes (>0.6), and the similarity of 215 maximum performance suggest that distinctive gene co-expression can be observed in each 216 subtype (max AUROC=0.92 ± 0.04). As with previous tasks, we found little difference in average 217 AUROCs using functional gene sets compared to random sets (mean AUROC Random =0.67 ± 218 0.06, mean AUROC GO =0.68 ± 0.1). 219 These results indicate that highly variable gene sets can be used alongside pairwise testing and 220 training as a heuristic to identify replicable subtypes. 221

Investigating pyramidal neuron subtypes using MetaNeighbor 222
The heterogeneity of pyramidal neurons is undisputed, but the organizing principles are still 223 debated, with some suggesting that identity is discrete and modular (Habib et al., 2016;Zeisel 224 et al., 2015) and others purporting that identities are more likely to be described by expression 225 gradients or spectra (Cembrowski et al., 2016). With MetaNeighbor we are able to quantitatively 226 assess the degree to which pyramidal subtypes defined by scRNA-seq replicate across diverse 227 datasets. If cell types are discrete and modular, we would expect to see sharp differences, with 228 some types showing very strong similarity to one another, and strong dissimilarities to other 229

types. 230
To compare pyramidal neuron scRNA-seq datasets we permuted through all combinations of 231 subtypes as testing and training data based on a set of 743 genes with high variability given 232 their expression level (subtypes listed in Table S2). This was the same procedure that was used 233 for cortical interneurons and while there were similar numbers of subtypes in total, a smaller 234 fraction corresponded across datasets (10/48, ~21%) yielding five putative subtypes (Figure  235 3B). The AUROC score heatmap was generally less modular than the heatmap of interneuron 236 scores. The most prominent feature was that types from the hippocampus and cortex tended to 237 cluster separately from one another. Within each region-specific cluster some layer-or area-238 specific clustering was observed but it was not completely consistent. Particular discrepancy 239 was observed between the cortical layer 5 subtypes which showed more similar AUROC score 240 profiles to the hippocampal subtypes than to other deep layer types (Tasic L5b_Cdh13,241 L5_Chrna6, L5b_Tph). Note that these were also the same subtypes that Tasic et al. found no 242 match for in their marker gene analysis. We suggest that the inclusion of additional datasets 243 may help to resolve this inconsistency. 244 We assessed the five putative subtypes using MetaNeighbor. All subtypes were significantly 245 discernable compared to the null ( Figure 3B) and as with the interneuron subtypes, AUROC 246 scores from the high variability gene set were well correlated with mean performance across all 247 of GO (3888 gene sets). In line with previous tasks, we found that functional gene sets 248 performed equally to random gene sets (mean AUROC Random =0.71 ± 0.08, AUROC GO =0.70 ± 249 0.09). 250

Comparing gene set performance across tasks 251
Finally, we compared gene set results from the 11 replicate interneuron subtypes and the 5 252 pyramidal neuron subtypes. In agreement with our previous results, we found that the top 253 groups were all related to neuronal function, which is unsurprising given the large size of these 254 gene sets and their likelihood of expression and variation in these cells ( Figure 3C). AUROCs 255 were highly correlated across tasks (r~0.76), with slightly higher performance for identifying 256 interneuron types compared to pyramidal types ( Figure 3D). The linearity of the trend across all 257 scores suggests that fundamental data features, like mean expression level and set size, 258 underlie the differential discriminative value of gene sets. The high performance across many 259 sets (mean AUROC ~0.7) also supports the notion that cell identity is encoded promiscuously 260 across the transcriptome, and is not restricted to a small set of functionally important genes. 261

Identifying subtype specific genes 262
ScRNA-seq experiments often seek to define marker genes for novel subtypes. Though ideally 263 marker genes are perfectly discriminative with respect to all cells, in practice marker genes are 264 often contextual and defined relative to a particular out-group. Here we aimed to identify 265 possible marker genes that would allow discrimination among interneuron subtypes or 266 pyramidal neuron subtypes. For each of our identified replicate subtypes we generated a ranked 267 list of possible marker genes by performing one-tailed, non-parametric differential expression 268 analysis within each study for all subtypes (e.g., Int1 vs. all other interneurons in the Zeisel 269 study, Int2 vs. all interneurons, etc.) and combining p-values for replicated types using Fisher's 270 method (Table S4). Figure 4A shows the FDR adjusted p-values for the top candidates based 271 on fold change for the ten replicated interneuron subtypes with overlapping differential 272 expression patterns. Figure 4B shows the same for the two pyramidal neuron subtypes with 273 overlapping differential expression patterns. The majority of these genes have previously been 274 characterized as having some degree of subtype-or layer-specific expression, for example we 275 readily identify genes that were used for the Cre-driver lines in the Tasic and Paul studies (Sst, 276 Pvalb, Vip, Cck, Htr3a, Ctgf). Even though we filtered for genes with high fold changes, we see 277 that many genes are differentially expressed in more than one subtype. Notably, considerable We also identify some novel candidates, including Ptn, or pleiotrophin, which is significantly 283 more expressed in the three Nos1-expressing subtypes than in the others ( Figure 4B). It is thus 284 expected to be discriminative of Nos1-positive neurons compared to other interneuron types. 285 We validated Ptn expression with in situ hybridization and we show clear expression in neurons 286 that are positive for both Sst and Nos1 ( Figure 4C). Ptn is a growth factor, and we suggest that 287 its expression may be required for maintaining the long-range axonal connections that 288 characterize these cells. These cells are well described by current markers, however this 289 approach is likely to be of particular value for novel subtypes that lack markers, allowing 290 researchers to prioritize genes for follow-up by assessing robustness across multiple data 291

sources. 292
Discussion 293 Single-cell transcriptomics promises to have a revolutionary impact by enabling comprehensive 294 sampling of cellular heterogeneity; nowhere is this variability more profound than within the 295 brain, making it a particular focus of both single-cell transcriptomics and our own analysis into 296 its replicability. The substantial history of transcriptomic analysis and meta-analysis gives us 297 guidance about bottlenecks that will be critical to consider in order to characterize cellular 298 heterogeneity. The most prominent of these is laboratory-specific bias, likely deriving from the 299 adherence to a strict set of internal standards, which may filter for some classes of biological 300 signal (e.g., poly-A selection) or induce purely technical grouping (e.g., by sequencing depth). 301

Because of this, it is imperative to be able to align data across studies and determine what is 302
replicable. In this work, we have provided a formal means of determining replicable cell identity 303 by treating it as a quantitative prediction task. The essential premise of our method is that if a 304 cell type has a distinct transcriptional profile within a dataset, then an algorithm trained from that 305 data set will correctly identify the same type within an independent data set. 306 The currently available data allowed us to draw a number of conclusions. We validated the 307 discrete identity of eleven interneuron subtypes, and described replicate transcriptional profiles 308 to prioritize possible marker genes, including Ptn, a growth factor that is preferentially expressed 309 in Sst Chodl cells. We performed a similar assessment for pyramidal neurons but found less 310 correspondence among datasets, suggesting that additional data will be required to determine 311 whether there is evidence for discrete pyramidal neuron types. One major surprise of our 312 analysis is the degree of replicability in the current data. Our AUROC scores are exceptionally 313 high, particularly when considered in the context of the well-described technical confounds of 314 single-cell data. We suspect this reflects the fundamental nature of the biological problem we 315 are facing: discrete cell types can be identified by their transcriptional profiles, and the biological 316 clarity of the problem overcomes technical variation. 317 This is further suggested by our result that cell identity has promiscuous effects within 318 transcriptional data. While in-depth investigation of the most salient gene functions is required to 319 characterize cell types, to simply identify cell types is relatively straightforward. This is 320 necessarily a major factor in the apparent successes of unsupervised methods in determining 321 novel cell types and suggests that cell type identity is clearly defined by transcriptional profiles, sets show more correlated expression within than across types, and variation across types is 332 likely to be accounted for by simple important factors, like cell size. This is not to say that more 333 detailed characterization of cell types is not necessary: understanding the differences between 334 cells and how they work will require focused investigation into the precise molecular players that 335 are differentially utilized. However, we hope that this helps to demonstrate that the variations on 336 dimension reduction and clustering methods in single cell RNA-seq are 'working', inevitably by 337 taking advantage of this very clear signal. 338 In this work we opted to use the subtype or cluster labels provided by the original authors, in 339 essence to characterize both the underlying data as well as current analytic practices. However, 340 this has limitations where studies cluster to different levels of specificity. For example, the Tasic 341 paper defines multiple Parvalbumin subtypes but the Zeisel and Paul work do not. Our method 342 makes it extremely easy to identify highly overlapping types at the levels defined by each 343 author, facilitating downstream work to validate the sub-clusters through meta-analysis and at 344 the bench. Given the known noisiness of single-cell expression and the complex and 345 idiosyncratic character of approaches taken to assessing it, the degree of replicability that we 346 see is much higher than could have been expected were there not simple explanations for the 347 derived clusters from individual laboratories. Our work shows that with additional data, 348 comprehensive evaluation and replication is likely to be quantitatively straightforward, making it 349 possible to have high confidence in derived cell sub-types quite rapidly. As this additional data is 350 generated, our approach can provide consistent updates of the field-wide consensus. 351 The simplicity of our method makes it unlikely to be biased toward the exact cell identity tasks 352 assessed here. For example, because of the method's reliance on relative ranks, it is almost 353 entirely immune to normalization as a potential confound. On the one hand, this limits our 354 sensitivity to detect real signals of some type, but this cost is more than offset by the robustness 355 of signals identified. Its simplicity also means that it is scalable, and readily admits to the 356 incorporation of data from individual labs in their ongoing work. Ultimately we hope that by 357 defining what is replicable clearly, MetaNeighbor will allow future studies involving cell-cell 358 comparisons to build on a strong foundation toward a comprehensive delineation of cell types. 359

Animals, manual cell sorting and scRNA-seq 361
Mice were bred and cared for in accordance with animal husbandry protocols at Cold Spring 362 Harbor Laboratory, with access to food and water ad libitum and a 12 hour light-dark cycle. preparation kit (7-11 cycles of PCR). Libraries were size-selected with SPRISelect magnetic 379 beads (Agencourt) and sequenced with paired-end 101bp reads using an Illumina HiSeq. 380 PolyA-primed reads were mapped to the mouse reference genome (mm9) with Bowtie (v 381 0.12.7), while paired sequences were used for varietal tag counting. A custom python script was 382 used map and tally sequences with unique tags for each mRNA in each cell (Crow et al., 2016). 383 All data is available to download from GEO (accession GSE92522). 384

Public expression data 385
Data analysis was performed in R using custom scripts (github.com/maggiecrow/MetaNeighbor, 386 2016). Processed expression data tables were downloaded from GEO directly, then subset to 387 genes appearing on both Affymetrix GeneChip Mouse Gene 2.0 ST array (902119) and the 388 UCSC known gene list to generate a merged matrix containing all samples from each 389 experiment. The mean value was taken for all genes with more than one expression value 390 assigned. Where no gene name match could be found, a value of 0 was input. We considered 391 only samples that were explicitly labeled as single cells, and removed cells that expressed fewer 392 than 1000 genes with expression >0. Cell type labels were manually curated using sample 393 labels and metadata from GEO (see Tables S1 and S2). Merged data and metadata are linked 394 through our Github page. 395

Gene sets 396
Gene annotations were obtained from the GO Consortium 'goslim_generic' (August 2015). 397 These were filtered for terms appearing in the GO Consortium mouse annotations 398 'gene_association.mgi.gz' (December 2014) and for gene sets with between 20-1000 genes, 399 leaving 106 GO groups with 9221 associated genes. Random gene sets were generated by 400 randomly choosing genes with the same set size distribution as GO slim. Sets of high variance 401 genes were generated by binning data from each dataset into deciles based on expression 402 level, then making lists of the top 25% of the most variable genes for each decile, excluding the 403 most highly expressed bin. The high variance set was then defined as the intersect of the high 404 variance gene lists across the relevant datasets. 405

MetaNeighbor 406
All scripts, sample data and detailed directions to run MetaNeighbor in R can be found on our 407 Github page (github.com/maggiecrow/MetaNeighbor, 2016). 408 The input to MetaNeighbor is a set of genes, a data matrix and two sets of labels: one set for 409 labeling each experiment, and one set for labeling the cell types of interest. For each gene set, 410 the method generates a cell-cell similarity network by measuring the Spearman correlation 411 between all cells across the genes within the set, then ranking and standardizing the network so 412 that all values lie between 0 and 1. The use of rank correlations means that the method is 413 robust to any rank-preserving normalization (i.e., log2, TPM, RPKM). Ranking and standardizing 414 the networks ensures that distributions remain uniform across gene sets, and diminishes the 415 role outlier similarities can play since values are constrained. 416 The node degree of each cell is defined as the sum of the weights of all edges connected to it 417 (i.e., the sum of the standardized correlation coefficients between each cell and all others), and 418 this is used as the null predictor in the neighbor voting algorithm to standardize for a cell's 'hub-419 ness': cells that are generically linked to many cells are preferentially down-weighted, whereas 420 those with fewer connections are less penalized. For each cell type assessment, the neighbor 421 voting predictor produces a weighted matrix of predicted labels by performing matrix 422 multiplication between the network and the binary vector (0,1) indicating cell type membership, 423 then dividing each element by the null predictor (i.e., node degree). In other words, each cell is 424 given a score equal to the fraction of its neighbors, including itself, which are part of a given cell To test the dependency of results on the amount of training and testing data we repeated the 441 neuron vs. non-neuronal cell discrimination task after randomly selecting between two and 442 seven datasets ten times each. This was done for 21 representative gene sets. Means for each 443 gene set and each number of included datasets were plotted. 444

Identifying putative replicates 445
In cases where cell identity was undefined across datasets (i.e., cortical interneuron and 446 pyramidal subtypes) we treated each subtype label as a positive for each other subtype, and 447 assessed similarity over the high variance gene set described above. For example, Int1 from the 448 Zeisel dataset was used as the positive (training) set, and all other subtypes were considered 449 the test set in turn. Mean AUROCs from both testing and training folds are plotted in the 450 heatmap in Figure 3. A stringent cut-off of mean AUROC >0.95 and/or mutual best matches 451 across datasets identified putative replicated types for further assessment with our supervised 452 framework (detailed above). While lowering this threshold could increase the number of 453 subtypes with some match, we found that reciprocal top hits alone provided an upper bound on 454 the number of replicated types (i.e., lowering the thresholds did not allow for a higher number of 455 subtypes). New cell type labels encompassing these replicate types (e.g. a combined Sst-Chodl 456 label containing Int1 (Zeisel), Sst Chodl (Tasic) and Sst Nos1 (Paul)) were generated for 457 MetaNeighbor across random and GO sets, and for meta-analysis of differential expression. 458 While only reciprocal top-hits across laboratories were used to define novel cell-types, 459 conventional cross-validation within laboratories was performed to fill in AUROC scores across 460 labels contained within each lab. 461

Differential expression 462
For each cell type within a dataset (defined by the authors' original labeling), differential gene 463 expression was calculated using a one-sided Wilcoxon rank-sum test, comparing gene 464 expression within a given cell type to all other cells within the dataset (e.g., Zeisel_Int1 vs all 465 other Zeisel interneurons). Meta-analytic p-values were calculated for each putative replicated 466 type using Fisher's method (Fisher, 1925) then a multiple hypothesis test correction was 467 performed with the Benjamini-Hochberg method (Benjamini and Hochberg, 1995). Top 468 differentially expressed genes were those with an adjusted meta-analytic p-value <0.001 and 469 with log2 fold change >2 in each dataset. All differential expression data for putative replicated 470 subtypes can be found in Table S4.  A and B, is plotted. There is a weak correlation between these cells. On the bottom left of the panel we see the correlation between cells A and C, which are strongly correlated. By taking the correlations between all pairs of cells we can build a cell network (right), where every node is a cell and the edges represent how similar each cell is to each other cell. C -The cell network that was generated in B can be extended to include data from multiple experiments (multiple datasets). The generation of this multi-dataset network is the first step of MetaNeighbor. D -The cross-validation and scoring scheme of MetaNeighbor is demonstrated in this panel. To assess cell type identity across experiments we use neighbor voting in cross-validation, systematically hiding the labels from one dataset at a time. Cells within the hidden dataset are predicted as similar to the cell types from other datasets, using a neighbor voting formalism. Whether these scores prioritize cells as the correct type within the dataset determines the performance, expressed as the AUROC. In other words, comparative assessment of cells occurs only within a dataset, but this is based only on training information from outside that dataset. This is then repeated for all gene sets of interest.

Figure 2 -Cell type identity is widely represented in the transcriptome
A & B -Distribution of AUROC scores from MetaNeighbor for discriminating neurons from non-neuronal cells ("task one", A) and for distinguishing excitatory vs. inhibitory neurons ("task two", B). GO scores are in black and random gene set scores are plotted in gray. Dashed grey lines indicate the null expectation for correctly guessing cell identity (AUROC=0.5). For both tasks, almost any gene set can be used to improve performance above the null, suggesting widespread encoding of cell identity across the transcriptome. C -Task one AUROC scores for each gene set are plotted with respect to the number of genes. A strong, positive relationship is observed between gene set size and AUROC score, regardless of whether genes were chosen randomly or based on shared functions. D -Distribution of AUROC scores for task one using 100 sets of 100 randomly chosen genes, or 800 randomly chosen genes. The mean AUROC score is significantly improved with the use of larger gene sets (mean 100 = 0.80 +/-0.05, mean 800 = 0.90 +/-0.03). E -Relationship between AUROC score and expression level. Task one was re-run using sets of genes chosen based on mean expression. A strong positive relationship was observed between expression level and performance (r s ~0.9).

Figure 3 -Cross-dataset analysis of interneuron and pyramidal neuron diversity
A -(Left) Heatmap of AUROC scores between interneuron subtypes based on the highly variable gene kernel. Dendrograms were generated by hierarchical clustering of Euclidean distances using average linkage. Row colors indicate data origin and column colors show marker expression. Clustering of AUROC score profiles recapitulates known cell type structure, with major branches representing the Pv, Sst and Htr3a lineages. (Middle) Table of reciprocal best matches and subtype pairs with scores >0.95. (Right) Boxplots of GO performance (3888 sets) for each replicated subtype, ordered by their AUROC score from the highly variable gene set. Subtypes are labeled with the names from Tasic et al. A positive relationship is observed between AUROC scores from the highly variable set and the average AUROC score for each subtype. Mean AUROCs are all greater than chance (0.5) suggesting robust cross-dataset replication across gene sets. B -(Left) Heatmap of AUROC scores between pyramidal subtypes based on the highly variable gene kernel, clustered as in A. Row colors indicate datasets and column colors show brain region, cortical layer or hippocampal area.
Clustering of AUROC score profiles shows a separation of cortical and hippocampal subtypes. (Middle) Table of reciprocal best matches. (Right) Boxplots of GO performance (3888 sets) for each replicated subtype, ordered by their AUROC score from the highly variable gene set. Subtypes are labeled by layer. A positive relationship is observed between ID scores from the highly variable set and the average AUROC for each subtype. C -The table shows the top GO terms that allow for cross-dataset subtype discrimination, listed by their mean AUROC across tasks. For both tasks, high scores are obtained for terms related to neuronal function. D -AUROC scores for each GO function are plotted, with pyramidal scores on the y-axis and interneuron scores on the x-axis. AUROCs are highly correlated across tasks (r s~0 .76), suggesting limited functional specificity. Subtypes are labeled by layer. B -Standardized Ptn expression is plotted across the three experiments, where each box represents an interneuron subtype. High, but variable expression is observed across the three Sst Chodl types. C -Fluorescent double in-situ of Ai14/tdTomato driven by Sst-Flp and Nos-Cre expression (green) and Ptn (red). Dotted box indicates the area shown in higher magnification on the right, arrowheads point to cells that express both transcripts.