northstar: leveraging cell atlases to identify healthy and neoplastic cells in transcriptomes from human tumors

Cell atlases are revolutionizing our understanding of tissue and disease heterogeneity, yet most single-cell transcriptomic analyses on tumors are not leveraging atlases effectively. We developed northstar, a computational approach to classify cells in tumor datasets guided by but not restricted by previously annotated cell atlases. To benchmark northstar, we transferred annotations from a human brain atlas to a published dataset on glioblastoma and could recapitulate the tumor composition accurately and within seconds. We then collected 1,622 cells from 11 pancreatic tumors and could robustly identify healthy pancreatic and immune cells and neoplastic cell states. Three cell populations were shared across patients while five were private to a single sample. northstar’s cell type classification offered rapid insight into the origins of neuroendocrine and exocrine tumors and fibromatosis. northstar is a useful tool to classify single-cell transcriptomes into known and novel cell types in the age of cell atlases


Introduction
The widespread adoption of single-cell transcriptomics has led to a growing body of cell atlases, large and carefully annotated datasets describing tissue heterogeneity with unprecedented resolution [1] . Single-cell approaches are also transforming our understanding of disease and physiological perturbations [2] . As atlas data across species [3] , tissues [4][5][6][7][8] , development [9] , and aging [5,10] are being amassed, there is growing demand for computational tools that leverage cell atlases to guide the analysis of new single-cell datasets. In particular, there is an unmet need for annotation tools that classify new cells based on cell type annotations found in a cell atlas.
The most direct route to learn from an atlas is to combine atlas and new dataset into a single gene expression counts table and apply an unsupervised clustering algorithm such as Leiden [11] . This method is resource intensive and often leads to ambiguous classification, because atlas cell types can split into subclusters or merge into superclusters. Batch correction techniques can also be adapted for this task but are equally greedy for computational resources [12,13] . Supervised learning approaches -training a classifier on an atlas and using it on the new dataset -are too restrictive because cell states missing from the atlas (e.g. cancer cells) would nonetheless be forced into an atlas population leading to data misinterpretation.
Here we present northstar, an algorithm and software package that classifies single-cell transcriptomes guided by one or multiple atlases but is also able to discover new cell types or cell states. northstar is computationally efficient and its output is simple to interpret. We benchmarked northstar on a published dataset on glioblastoma [14] and then applied it to newly collected 1,622 cells from 11 pancreatic cancer patients. northstar rapidly classified both datasets in terms of both healthy cell types and neoplastic states and ultimately led to useful biological insight into the composition and origin of the tumors. Results northstar identifies cell types guided by a cell atlas northstar is a computational approach to identify cell types in a new single cell dataset leveraging one or multiple cell atlases. The unique feature of northstar is that every new cell can be either assigned to either a known atlas cell type or a novel cluster. An implementation in C++/Python is available at https://github.com/northstaratlas/northstar and preprocessed atlases for immediate use are available at https://northstaratlas.github.io/atlas_landmarks . We are accepting contributions to expand the list of ready-to-use atlases.
As input, the gene expression table for the new dataset and the gene expression table and cell type annotations for the cell atlas are specified by the user. northstar can use either an average expression vector for each cell type or a small subsample of each cell type. Atlas averages and subsamples can be used by northstar as reference landmarks to annotate the new cells. In this sense, northstar serves the same purpose in single-cell datasets as the North Star always had for maritime navigation: providing fixed points that guide the exploration of new landscapes. To simplify adoption, we provide precomputed landmarks (averages and subsamples) of several atlases (see above link). If a precomputed atlas is chosen, the user only needs to specify its name: counts and annotations are downloaded automatically.
The algorithm consists of the following steps. First, atlas landmarks (averages or subsamples) are merged with the new single-cell dataset into a single data table (Fig. 1A). Then, informative genes are selected: upregulated markers of each atlas cell type are included as well as genes showing a high variation within the new dataset. A similarity graph of the merged dataset is constructed, in which each edge connects either two cells with similar expression from the new dataset or a new cell with an atlas cell type (Fig. 1B). Finally, nodes in the graph are clustered into communities using a variant of the Leiden algorithm that prevents the atlas nodes from merging or splitting [11] . The output of northstar is an assignment of each cell to either an atlas cell type or, if a group of cells show a distinctive gene expression profile, to a novel cluster (Fig.  1C).
northstar is designed to be easy to use (Fig. 1D). To examine northstar's scalability to large atlases, we downloaded the Tabula Muris plate data [3] , subsampled it to different cell numbers, and counted the number of cell types with at least 20 cells. As more cells were sampled new cell types were discovered, however with diminishing returns. At full sampling (50,000 cells), we estimated at most 20 new cell types per tenfold increase in cell numbers (Fig. 1E). Because of this sublinear behaviour, northstar scales to atlases of arbitrary size, unlike a naive approach that combines all atlas cells with the new dataset (Fig. 1F). Although subsampling each cell type (e.g. 20 cells) requires more storage memory than a single average, their scaling behaviour is exactly the same (i.e. logarithmic or better). In practical tests, we found that subsampling helps with very heterogeneous cell types, however it is more reliant on a high-quality atlas annotation.
Benchmark against published datasets on healthy brain and glioblastoma To validate northstar's performance, we analyzed a glioblastoma (GBM) dataset [14] on the basis of a previously annotated cell atlas of the human brain by the same authors [4] . The annotations of the GBM dataset according to the original authors define seven healthy cell types: neurons, astrocytes, oligodendrocytes, oligodendrocyte progenitor cells (OPCs), endothelial cells, microglia, and other immune cells. In addition to these seven, an additional cluster of neoplastic cells is described. Of these cell types, the first five are also present in the brain atlas, while some fetal cells were excluded from the atlas because the GBM patients were adults. A projection of the GBM data via t-Distributed Stochastic Neighbor Embedding (t-SNE) is shown in Fig. 2A with the original annotations. The relative abundances of the various cell types in the atlas and GBM data is shown in Fig. 2B.
We calculated cell type gene expression averages of the brain atlas, deleted the annotations from the GBM data, and fed labeled atlas averages and unlabeled GBM data into northstar. We obtained a classification of the new cells and observed that our method recapitulated previously reported cell types and also created new classes for myeloid and neoplastic cells as these cell types were absent from the atlas (Fig. 2C, Supplementary Fig. 1).
A detailed analysis of the original annotations versus new annotations generated by our algorithm highlights the strength of the method (Fig. 2D). Almost all cells of known types were correctly assigned to their respective cluster. The main misclassification was a population annotated as neoplastic in the original paper but classified as oligodendrocyte precursor (OPC) by northstar. However, those cells formed a relatively distinct cloud in both Fig. 2C and, even more clearly, in the original t-SNE plot ( Supplementary Fig. 1). Moreover, northstar classified a diverse population of immune cells is either classified as microglia, which are resident immune cells, or into new cluster 7. Neoplastic cells were correctly assigned to new clusters 8 and 9. To identify why both immune and neoplastic clusters were split by northstar we examined their connectedness in the similarity graph. We found that their subcomponents were only weakly connected ( Supplementary Fig. 2); the Leiden algorithm -which wasn't available to Darmanis et al. in 2017 -split them to increase internal connectedness [11] . Similar classifications were obtained using northstar with a subsample of the atlas, confirming that the classification is robust ( Supplementary Fig. 3). Differential expression analysis indicates that the immune clusters express CD207, CD300E and other classical immune surface markers while the neoplastic cell clusters expressed KLK6, MIR1322, and other RNAs previously implicated in malignancies [15,16] .

Classification of healthy and neoplastic cells in pancreatic cancer
To test the ability of northstar to provide biological insight into new datasets, we collected 1,622 single cells from 11 pancreatic tumors (Supplementary Table 1) and aimed to determine their composition in terms of known cell types and, potentially, novel clusters. Briefly, tumors were surgically resected at Stanford hospital from patients with diagnosed pancreatic cancer and dissociated into single-cell suspensions (see Methods ). Single cells were isolated by fluorescence-activated cell sorting (FACS) into 96-or 384-well microtiter plates and processed as described previously [5,17] . northstar was then used to identify the cells in these samples based on the pancreas atlas by Baron et al. (2016) [18] and the immune atlas by Zanini et al. (2018) [19] . We found that the cells from the cancer patients were classified into both known and novel cell types (Fig. 3A). Among the known cell types were mesenchymal, exocrine and endocrine cell types, and various immune cells. Although all tumors were resected in a similar procedure, the fraction of cells belonging to each known cell type varied considerably among patients ( Fig. 3B and Supplementary Table 2). This reflects the challenges of isolating tumor tissue during surgery without capturing surrounding tissue, and making single-cell suspensions from pancreatic exocrine tissue while the tissue is self-digesting.
Samples TuPa2, 4, and 5 were composed mostly of acinar cells -or cells that closely resemble that cell type. In fact, it is clear from the embedding (Fig 3A) that many of these cells are close to both acinar and ductal types. This was expected for patients TuPa2 and 4 because the tumor was diagnosed as Pancreatic Ductal Adenocarcinoma (PDAC) but not TuPa5, which is annotated as a neuroendocrine tumor. TuPa1 was clinically described as fibromatosis and we found it is essentially composed of activated stellate cells. This is consistent with recent literature [20][21][22] . Sample TuPa27 was classified as composed mainly of monocytes, which was surprising considering that it was clinically described as a neuroendocrine tumor. Because surgery took longer than usual for this sample, we speculate that the tumor cells might have degraded leaving only the more robust immune cells. All other samples classified in one or more novel clusters. TuPa6, 3, and 31 shared cluster 21 though those patients were diagnosed with three different conditions: ampullary adenocarcinoma, mucinous cystic neoplasm, and neuroendocrine tumor, respectively. The embedding confirms northstar's prediction and places these cells somewhere between alpha and gamma/PP endocrine types, indicating an endocrine origin. A minority of cells in samples TuPa3 and 31 belonged to another new cluster, 23, which is located nearby on the embedding. The tumor from TuPa28 belonged to its own private cluster 20. Its location is indicative of an endocrine lineage, in agreement with the patient's diagnosis of neuroendocrine tumor. The donor of TuPa29 was the only one diagnosed with invasive adenosquamous carcinoma and was found to contain a majority of cells in a new, almost private cluster 22 located in the vicinity of acinar cells. A minority of cells from this sample were assigned to quiescent stellate and delta cells. Finally, sample TuPa23 was composed entirely of quasi-private cluster 19, which is surrounded in the embedding by resident immune cells.
To better understand the nature of the novel clusters 19 to 23, we computed differentially expressed genes (DEGs) between each novel cluster and all other cells by Kolmogorov-Smirnov statistics on the expression. In short, we searched for genes that are expressed by a high fraction of the cells within the focal cluster and by few cells outside of it ( Fig. 3C, top heatmap and Supplementary Table 3). The expression of PPY by clusters 21 and 23 is suggestive that these might be neoplastic cells derived from gamma/PP precursors. Clusters 21, 23, and 20 all express TTR and SCG5 but the latter cluster is missing PPY; this favors a distinct endocrine origin for the latter cluster. Cluster 22 expresses KRT19, indicating an epithelial origin which is consistent with their proximity on the t-SNE with acinar and ductal cells. Finally, cluster 19 expresses CD74 which is part of the MHC class II machinery found in immune, antigen presenting cells. To further validate these results, we also looked at the expression of known markers for endocrine and exocrine cancers (bottom heatmap) and found that clusters 20, 21 and 23 show an expression consistent with endocrine cancer cells, while cluster 22 is consistent with exocrine cancer. Cluster 19 expresses PTPRC and IGKC, indicating it is related to B cells. Its location on the embedding supports an immune cell type, although the atlas B cells and plasmablasts are not located in proximity. This discrepancy might be due by biological differences between tumor infiltrating B cells and peripheral blood lymphocytes from healthy subjects.

Discussion
Annotating a new single-cell transcriptomic dataset traditionally involves clustering without an atlas, however this process is laborious and can be inaccurate. Geometric subsampling of the atlas [23] , followed by merging with the new data and unsupersived clustering is a viable route, however known cell types can split into subclusters or merge into superclusters, leading to difficulties in interpretation. In our experience such cases happen often because clustering can be performed at different resolutions leading to equally valid classifications (e.g. all immune cells, lymphocytes, T cells). northstar improves over these approaches by combining cell-type aware subsampling of the atlas with fixed clustering resolution and a stable clustering algorithm that guarantees internal connectedness [11] . Cell types can neither split nor merge simply because they are determined by the atlas.
northstar is very efficient because it approximates a cell atlas by compressed representations, i.e. averages or small subsamples. One can easily use an atlas with millions of cells on a laptop with 16 GB of RAM as long as the number of cell types remains within a few thousands. Current atlases only have tens (see Figs. 2 and 3) or hundreds of cell types and that figure is unlikely to be surpassed by orders of magnitude. Classifying thousands of glioblastoma and pancreatic tumor cells only took seconds and is actually faster than computing their t-SNEs.
northstar clearly separates the concerns of building the similarity graph (see Fig. 1B) and classifying cell types based on an existing graph (Fig. 1C). Although a simple and effective algorithm for graph building is implemented, we intentionally allow the use of custom similarity graphs to cope with batch effects. A plethora of batch correction methods has been proposed and northstar is designed to be compatible with most of them [12,13,24,25] .
Unfortunately, many cell atlases and single cell datasets are poorly disseminated. Data access is idiosyncratic to each dataset and often requires manual steps (e.g. writing emails to the authors). To change this trend and help disseminate northstar we provide a website with averages and subsamples for several atlases that can be accessed programmatically: https://northstaratlas.github.io/atlas_landmarks . This makes it easy to combine atlases and cherry pick different cell types from each to maximize the leverage provided by the annotations.
The analyses of brain and pancreatic tumors presented here highlights the utility of northstar to quickly characterize the cell type composition of tumors. Simple differential expression can be applied immediately afterwards (Fig. 3C) to identify the nature of the new clusters and to shed light on their biological origins. Sampling human tumors is challenging due to cell death preferential to certain cell types (e.g. neurons, pancreatic exocrine cells), hence northstar is an ideal tool to verify whether the cell types of interest are captured effectively. Moreover, the joint analysis of multiple patient samples is informative about how stereotypic neoplastic cell state progression is across individuals. In the 11 pancreatic tumors analyzed, we observed both shared and private clusters and also found corroborating evidence linking fibromatosis to activated stellate cells [20][21][22] . Cell atlases from large numbers of cancers are being collected in addition to healthy tissues [26] . This will further boost the utility of northstar for rapidly classifying tumors into known or novel neoplastic cell states.
Cell atlases provide an invaluable resource to study heterogeneous disease and in particular cancer. northstar's unique ability to identify healthy and neoplastic cells is an important step towards personalized diagnosis and characterization of disease states at the single cell level.

Methods
Brain atlas, glioblastoma dataset, and pancreas atlas Gene expression count tables and cell type annotations were downloaded from NCBI's Gene Expression Omnibus website (brain atlas: GSE67835 , glioblastoma: GSE84465 , pancreas atlas: GSE81547 ). To combine the brain atlas and glioblastoma dataset, only genes that were present in both datasets were retained. Gene expression counts were normalized by 1 million total counts per cell. For the brain and pancreas atlases, arithmetic averages of the normalized counts were computed for each cell type and stored. Fetal cell types were excluded from the brain atlas since the glioblastoma dataset refers to adult patients, while ambiguous cell types (e.g. "unknown", "hybrid") were excluded from both atlases.

Pancreatic cancer dataset
A novel single-cell transcriptomic dataset from human individuals with pancreatic cancer was collected. 11 individuals were sampled with a total of 1,622 cells. A table of individual metadata  is available as Supplementary Table 1.
Pancreatic tumor tissue was obtained at the Stanford Hospital from individuals undergoing surgery for pancreatic cancer between September 2015 and June 2018. Single-cell suspension was then achieved dissociating the samples for 1-2 hours with Collagenase/Hyaluronidase (Stemcell Technologies, 7912) in DMEM 1% FBS, followed by 2 minutes digestion with Trypsin-EDTA (0.25%, Life Tech, except for TuPa#1, #3 and #4). ACK and DNAse treatments were performed as needed.
Single cell suspensions were isolated by fluorescence-activated cell sorting (FACS) on FACS Aria II (BD Biosciences for TuPa#1-6) into a single well of 96-well plate and on Sony SH800Z (for TuPa 23,27,28,29,31) into single wells of 384-well Biorad HardShell plates. Antibodies used for sorting pancreatic cells were: EpCAM fluorescein isothiocyanate (FITC), CD49f phycoerythrin (PE), CD24 PE-Cy7, CD44 Allophycocyanin (APC), hCD45/GPA Pacific-blue (BioLegend). Cells were gated on the basis of forward-and side-scatter profiles, and live/dead discrimination was obtained with Sytox Blue (ThermoFisher #oS34857) or DAPI (4′,6-diamidino-2-phenylindole). The plates were pre-filled with 500nl of lysis buffer containing poly-T capture oligos, spike-in External RNA Controls Consortium (ERCC) control RNAs, and other molecules as described elsewhere [3] . cDNA synthesis and amplification was performed as described in the same reference. Libraries using in-house Tn5 were done as described and sequenced on Illumina NovaSeq 6000 on S2 or S4 flow cells and 100 base paired-end kits at an average depth of 1 million reads per cell. To avoid index hopping, dual unique barcodes with a reciprocal Hamming distance > 2 were used.
The sequencing read pairs were mapped to the human genome using STAR RNA aligner [28] and sorted by name. Genes were counted using HTSeq [29] . One of us (FZ) is the maintainer of HTSeq. A new package for grouping mapping and counting was developed by one of us (FZ) and is available on GitHub: https://github.com/iosonofabio/bag_of_stars . Cells with less than 100,000 reads were discarded.

Data processing
For both datasets, count tables were further analyzed in Python 3.7 using numpy [30] , pandas [31] , and singlet. The latter package was developed by one of us (FZ) and is available on GitHub: https://github.com/iosonofabio/singlet and on PyPI: https://pypi.org/project/singlet/ . A detailed description of the northstar algorithm is available in Supplementary Text 1 .
Code and data availability northstar is available on GitHub at https://github.com/northstaratlas/northstar . The constrained clustering step was developed with the assistance of Vincent Traag and requires the development branch of leidenalg on GitHub at https://github.com/vtraag/leidenalg . Cell type averages for a number of cell atlases are available at https://northstaratlas.github.io/atlas_landmarks/ . All scripts used to generate the figures and tables for this manuscript are available on GitHub at https://github.com/northstaratlas/northstar_analysis/ . The code is written in Python 3, C, and C++ and is tested via continuous integration on Linux and OSX.        Fig. 3A, squares indicate atlas cell types, circles new cells from pancreatic tumors.

Supplementary Text 1: Detailed description of the northstar algorithm
Merge cell atlas and new single cell dataset northstar starts with a gene expression matrix M with L rows (genes) and N columns (cells). The first N a columns represent the cell atlas. Since the atlas is already annotated, the gene expression of each cell can be approximated by the average of its cell type. Therefore, each of these columns contains the mean gene expression within a cell type from the atlas. In addition to these averages, the number of atlas cells within each cell type is taken as input as a vector, which we call S. The last N n = N -N a columns of M represent single cells from the new dataset ("new cells"), which need to be annotated. For these columns, no equivalent of S is needed since each column describes only one cell. The input data for northstar is also depicted in Fig.  1A. The steps below are illustrated in Fig. 1B.

Feature selection
The first step of northstar is to select features (genes) to calculate the neighborhood graph. Briefly, feature selection is necessary because noisy points in a high-dimensional space are, loosely speaking, almost equidistant from one another ("curse of dimensionality"): the selection of 300-1000 most relevant genes drastically reduces noise while retaining enough information to calculate neighborhoods. It is empirically observed that excluding key features from the selection might cause overclustering, while erring on the side of too many features has a less dramatic impact on the results.
A common and simple way to select features is to take the most overdispersed genes, i.e. genes with a large variance-to-mean ratio across all cells. northstar by default takes this approach for the sake of simplicity, however it includes two modifications. First, it only computes overdispersion within the new dataset, since the atlas might be much larger and might otherwise dominate the feature selection, obfuscating de novo cell type discovery. Second, it explicitly includes the most discriminating genes for each cell type in the atlas, to ensure that new cells can be positioned correctly within the atlas if they do belong to any of the known cell types. To maximize customizability, the user can select features before northstar if she prefers to use different criteria.

Weighted PCA
Principal component analysis (PCA) is a commonly used technique across scientific fields and is used in single cell transcriptomics to further mitigate the effect of noise in high-dimensional spaces beyond feature selection. northstar performs PCA on the feature-selected data using weights to balance the influence of the atlas and the new dataset on the annotation result.
Note that most descriptions of PCA consider the rows as observations and columns as variables, whereas we consider a gene expression matrix M that has the cells as columns and the genes as rows, to be more consistent with the single cell transcriptomics literature (see above). In any case, the same operations can be trivially applied to the transposed matrix and lead to the same result.
First, the weights are normalized into fraction of the total weight via division by the total number of cells in both atlas and the new dataset. The resulting vector of weights is then written in a diagonal matrix W of size N x N. We call this matrix the normalized weight matrix.
Then, M is linearly shifted to have zero weighted mean across all cells. It is then normalized by dividing by the square root of the weighted variance across all cells.
The weighted covariance matrix of M is calculated as Cov w M = M W M T . This matrix has dimensions L x L and is real and symmetric, so its canonical linear map has L real eigenvalues and L real eigenvectors. Let U be the matrix with the eigenvectors as columns.
The first P eigenvalues and eigenvectors are computed by standard linear algebraic techniques. Although all L components can be calculated in principle, it is in practice sufficient for the sake of subsequent steps to trim this operation to P ~ 20 because the spectrum of eigenvalues decays sharply after the first few elements.
The P eigenvectors are the top gene loadings of the weighted PCA. To find the cell loadings or principal components (PCs), we exploit the well known connection between PCA and singular value decomposition (SVD): the eigenvectors of Cov w M are also right singular vectors of the data matrix M, while the PCs are left singular vectors. The matrix V with the PCs as columns can therefore be computed by matrix multiplication: where is the smi-diagonal matrix with the singular values (i.e. the square roots of the Σ eigenvalues of the covariance matrix) as the first diagonal entries and zero elsewhere and Σ −1 indicates the reciprocal of all nonzero singular values and zeros elsewhere. In theory, it is possible that this inversion cannot be computed if the covariance matrix is degenerate (i.e. it has rank < L). However, such pathological cases are unlikely to appear in real datasets and are not considered further here.
Note that the linearity of PCA explains why it is possible to calculate an approximate result without having to load all atlas data: since the atlas cells are approximated with their cell type average, having a cell type with a weight of 2 is exactly the same as duplicating that column into the data matrix and having both copies have a weight of 1.

Construction of the k nearest neighbors graph
To construct the k nearest neighbors (knn) graph, a distance matrix from the new cells and all cells (including atlas and new ones) is first computed. The software package lets the user choose the distance metric and defaults onto Pearson correlation, a commonly used metric in single cell transcriptomics. The distance matrix has dimensions N n x N.
Then, for each new cell (row in the distance matrix) the k elements at closest distance excluding self are identified. Notice that there is no need to sort the distance vector fully for this. The algorithm then scan these candidate neighbors in order of increasing distance. If a candidate belongs to the new dataset, it is added to the neighbors list of this cell. If a candidate belongs to the atlas, however, it is a representative of a larger cell type which is bona fide tightly clustered around it. Hence, not only one neighbor is added but rather a number equal to the size of the cell type in the atlas. The total number of neighbors is then trimmed to k.
In addition to the neighbors from the new data into the atlas, adding a few edges that are computed outwards from the atlas helps to reduce batch effects (simlarly to mutual nearest neighbors schemes). We usually compute 5 neighbors in the new dataset for each atlas average and add this small amount of edges to the similarity graph.
It is possible that a cell (or atlas average) has no close neighbors or just fewer than k. To consider this situation, a maximal distance threshold is used to compute the neighbor candidates. The default for correlation distance is 0.8: cells with a Pearson r < 0.2 with the current cell of interest are never treated as neighbors.
Once the nearest neighbors for every new cells are found, an undirected graph is constructed by symmetrization of all edges. Edges between new cells and atlas cell types can be present multiple times in the neighbors lists and are weighted accordingly. This increased edge weight ensures that if a cell is really close to a known cell type it will rapidly be absorbed into that cluster during the Leiden algorithm below. No direct edges are set between atlas cell types because they are annotated already and are not allowed to change cluster membership during the modified Leiden algorithm.

Leiden clustering with fixed nodes
The joint dataset is now represented as a neighborhood graph and can be clustered using standard graph-based methods. However, the nodes belonging to atlas cell types are already annotated and should not be allowed to change membership lest a full reannotation of the atlas is required. northstar solves this issue by modifying the Leiden algorithm for community detection in large graphs to allow for a number of nodes to be "fixed" into their initial membership. The Leiden algorithm itself has been proven effective in unsupervised clustering of single cell transcriptomic data, scales well with graphs with millions of nodes, and provides mathematical connectivity guarantees that lend trust to the resulting annotations.
Briefly, Leiden performs two kinds of operations on nodes, namely "move" and "merge". Moreover, it recursively collapses the initial graph into simplified aggregated graphs in which each cluster becomes a single node: the "move" and "merge" steps are then repeated in the aggregated graph. northstar changes both the "move/merge" steps and the aggregation step.
First, whenever a queue of nodes is constructed for consideration in terms of a move/merge, nodes that are marked as fixed are just never considered. This prevents them from switching membership to another community. Second, whenever aggregated graphs are constructed, fixed nodes are collapsed only if they belong to the same community from the beginning; otherwise they are never collapsed.
The output of northstar is a list of cluster memberships for each new cell. Numbers 0 to (N a -1) represent extant cell types in the atlas (corresponding to the first N a columns of M) and higher numbers indicate new inferred cell types in the new dataset.