Cell BLAST: Searching large-scale scRNA-seq databases via unbiased cell embedding

An effective and efficient cell-querying method is critical for integrating existing scRNA-seq data and annotating new data. Herein, we present Cell BLAST, an accurate and robust cell-querying method. Powered by a well-curated reference database and a user-friendly Web server, Cell BLAST (http://cblast.gao-lab.org) provides a one-stop solution for real-world scRNA-seq cell querying and annotation.

(batch effect 5, 6 ). Meanwhile, multiple data harmonization methods have been proposed to 23 remove such confounding factors during alignment, for example, via warping canonical 24 correlation vectors 7 or matching mutual nearest neighbors across batches 6 . While these 25 methods can be applied to align multiple reference datasets, computation-intensive 26 realignment is required to map query cells to the (pre-)aligned reference data space. 27 28 To address these challenges, we introduce a new customized deep generative model together 29 with a novel cell-to-cell similarity metric specifically designed for cell querying (Fig. 1a,  30 Method). Differing from canonical variational autoencoder (VAE) models 8-11 , adversarial 31 batch alignment is applied to correct batch effect during low-dimensional embedding of 32 reference datasets. Such a design also enables a special "online tuning" mode that can handle 33 batch effect between query and reference data when necessary. Moreover, by exploiting the 34 model's universal approximator posterior to model uncertainty in latent space, we implement 35 a distribution-based metric to measure cell-to-cell similarity. Finally, we also provide a well-36 curated multispecies single-cell transcriptomics database (ACA) and an easy-to-use Web 37 interface for convenient exploratory analysis. 38 39 To assess our model's capability of capturing biological similarity in the low-dimensional 40 latent space, we first benchmarked against several popular dimension reduction tools 8, 12, 13 41 using real-world data (Supplementary Table 1) and found that our model is overall among 42 the best performing methods (Supplementary Fig. 1-2). We further compared batch effect 43 correction performance using combinations of multiple datasets with overlapping cell types 44 profiled (Supplementary Table 1). Our model achieves significantly better dataset mixing 45 ( Fig. 1b) while maintaining comparable cell type resolution (Fig. 1c). Latent space visualization also demonstrates that our model can effectively remove batch effect for multiple datasets with a considerable difference in cell type distribution (Supplementary 48 Fig. 3). Notably, we found that the correction of inter-dataset batch effect does not 49 automatically generalize to that within each dataset, which is most evident in the pancreatic 50 datasets (Supplementary Fig. 3c-d, Supplementary Fig. 4a-c). For such complex scenarios, 51 our model is effective in removing multiple levels of batch effect simultaneously 52 ( Supplementary Fig. 4d-h). 53 54 While the unbiased latent space embedding derived by the nonlinear deep neural network 55 effectively removes confounding factors, the network's random components and nonconvex 56 optimization procedure also lead to serious challenges, especially false-positive hits when 57 cells outside reference types are provided as query. Thus, we propose a novel posterior 58 distribution-based cell-to-cell similarity metric in the latent space, which we term 59 "normalized projection distance" (NPD). Distance metric ROC analysis (Method) shows that 60 our posterior NPD metric is more accurate and robust than Euclidean distance which is 61 commonly used in other neural network-based embedding tools (Fig. 1d, Supplementary  62   Fig. 4k). Additionally, we exploit the stability of query-hit distance across multiple models to 63 improve specificity (Method, Supplementary Fig. 4l). An empirical p-value is computed for 64 each query hit as a measure of "confidence" by comparing the posterior distance to the 65 empirical NULL distribution obtained from randomly selected pairs of cells in the queried 66 data. 67

68
The high specificity of Cell BLAST is especially important for discovering novel cell types. 69 Two recent studies ("Montoro" 14 and "Plasschaert" 15 ) independently reported a rare tracheal 70 cell type named pulmonary ionocyte. We artificially removed ionocytes from the "Montoro" 71 dataset and used it as a reference to annotate query cells from the "Plasschaert" dataset. In 72 addition to accurately annotating 95.9% of query cells, Cell BLAST correctly rejected 12 of 73 19 "Plasschaert" ionocytes ( Fig. 1e). Moreover, it highlights the existence of a putative novel 74 cell type as a well-defined cluster with large p-values among all 156 rejected cells ( Fig. 1f-g). 75 Further examination shows that this cluster actually corresponds to ionocytes 76 (Supplementary Fig. 6a; also see Supplementary Fig. 5 for more detailed analysis on the 77 remaining 7 ionocytes). By contrast, scmap-cell 2 only rejected 7 "Plasschaert" ionocytes 78 despite the higher overall rejection number of 401 (i.e., more false negatives;

81
We further systematically compared the performance of query-based cell typing with scmap-82 cell 2 and CellFishing.jl 4 (Method) using four groups of datasets, each including both positive 83 control and negative control queries (first 4 groups in Supplementary Table 2). Of interest, 84 while Cell BLAST shows superior performance than scmap-cell and CellFishing.jl under the 85 default setting (Supplementary Fig. 7a-c, 8-10), detailed ROC analysis reveals that the 86 performance of scmap-cell could be further improved to a level comparable to Cell BLAST 87 by employing higher thresholds, while ROC and optimal thresholds of CellFishing.jl show 88 large variation across different datasets (Supplementary Fig. 7d). Cell BLAST presents the 89 most robust performance with a default threshold (p-value < 0.05) across different datasets, 90 which will significantly benefit real-world application. Additionally, we assessed their 91 scalability using reference data varying from 1,000 to 1,000,000 cells. Both Cell BLAST and 92 CellFishing.jl scale well with increasing reference size, while scmap-cell's querying time 93 rises dramatically for larger reference datasets with more than 10,000 cells (Supplementary 94  Fig. 11a-f). By contrast, scmap-cell incorrectly assigned most cells to 116 monocyte and granulocyte lineages (Supplementary Fig. 11g). As another example, we 117 applied "online tuning" to Tabula Muris 18 spleen data, which exhibit significant batch effect 118 between 10x-and Smart-seq2-processed cells. The ROC of Cell BLAST improved 119 significantly after "online tuning", achieving high specificity, sensitivity and Cohen's κ (a 120 measure of prediction accuracy corrected for chance, see Methods for more details) 2 at the 121 default cutoff (Supplementary Fig. 11h (Fig. 2e, Supplementary Fig. 12a-b, Supplementary Table 3). To ensure a unified 128 and high-resolution cell type description, all records in ACA are collected and annotated 129 using a standard procedure (Method), with 98.9% of datasets manually curated with Cell 130 Ontology, a structured controlled vocabulary for cell types. We trained our model on all ACA 131 datasets. Notably, we found that the model works well in most cases with minimal 132 hyperparameter tuning (latent space visualizations, self-projection coverage and accuracy 133 available on our website, Supplementary Fig. 12e). 134 135 A user-friendly Web server is publicly accessible at http://cblast.gao-lab.org, with all curated 136 datasets and pretrained models available. Based on the wealth of resources, our website 137 provides "off-the-shelf" querying service. Users can obtain querying hits and visualize cell 138 type predictions with minimal effort (Supplementary Fig. 12c-d). For advanced users, a 139 well-documented Python package implementing the Cell BLAST toolkit is also available, 140 which enables model training on custom references and diverse downstream analyses. 141 142 By explicitly modeling multilevel batch effect as well as uncertainty in cell-to-cell similarity 143 estimation, Cell BLAST is an accurate and robust querying algorithm for heterogeneous 144 single-cell transcriptome datasets. In combination with a comprehensive, well-annotated 145 database and an easy-to-use Web interface, Cell BLAST provides a one-stop solution for 146 both bench biologists and bioinformaticians.
The solution to the above maximization is given by: 255 Substituting D ‡ * ( ) back into (10), we obtain: 256 Note that in practice, model training balances between ℒ ‡ˆ‰Š and pure batch alignment. 260 Aligning cells of the same type induces a minimal cost in ℒ ‡ˆ‰Š , while improperly aligning 261 cells of different types could cause ℒ ‡ˆ‰Š to rise dramatically. During training, the gradient 262 from both batch discriminators and decoder provide fine-grain guidance to align different 263 batches, leading to better results than "hand-crafted" alignment strategies like CCA 7 and 264 MNN 6 . Empirically, given proper values for ‡ , the adversarial approach correctly handles 265 difference in cell type distribution among batches. If multiple levels of batch effect exist, e.g., 266 within-dataset and cross-dataset, we use an independent batch discriminator for each 267 component, providing extra flexibility. 268

Data preprocessing for benchmarks 269
Most informative genes were selected using the Seurat 7 function "FindVariableGenes". We 270 set the argument "binning.method" to "equal_frequency" and left other arguments as default. 271 If within-dataset batch effect exists, genes are selected independently for each batch and then 272 pooled together. By default, a gene is retained if it is selected in at least 50% of batches. 273 Downstream benchmarks were all performed using this gene set, except for scmap and 274 CellFishing.jl, which provide their own gene selection method. GNU parallel 28 was used to 275 parallelize and manage jobs throughout the benchmarking and data processing pipeline.

Benchmarking dimension reduction 277
PCA was performed using the R package irlba 29 (v2.3.2). ZIFA 12 was downloaded from its 278 Github repository, and hard coded random seeds were removed to reveal actual stability. for scVI and our model, since neural network-based models are typically considered less 292 stable. Run time was limited to 2 hours, after which the jobs were terminated. 293 Cell type nearest neighbor mean average precision (MAP) was computed with K nearest 294 neighbors of each cell based on low-dimensional space Euclidean distance. If we denote the 295 cell type of a cell as , and the cell types of its ordered nearest neighbors as S , oe , … ž . The 296 average precision (AP) for that cell is defined as: 297 Mean average precision is then given by: 298 Note that when = 1, MAP reduces to the nearest neighbor accuracy. We set to 1% of the 299 total cell number throughout all benchmarks. 300

Benchmarking batch effect correction 301
We merged multiple datasets according to shared gene names. If datasets to be merged are 302 from different species, Ensembl ortholog 31 information was used to map genes to ortholog groups before merging. To obtain informative genes in merged datasets, we take the union of 304 informative genes from each dataset, and then intersect the union with the intersection of 305 detected genes from each dataset. 306 CCA 7 and MNN 6 alignments were performed using the R packages Seurat 7 (v2.3.3) and 307 scran 32 (v1.6.9), respectively. Hard-coded random seeds in Seurat were removed to reveal 308 actual stability. The modified version of Seurat is available upon request. For comparability, 309 we evaluated cell type resolution and batch mixing in a 10-dimensional latent space. For 310 MNN alignment, we set the argument "cos.norm.out" to false and left other arguments as 311 default. PCA was applied to reduce the dimensionality to 10 after obtaining the MNN-312 corrected expression matrix. For CCA alignment, we used the first 10 canonical correlation 313 vectors. Run time was limited to 2 hours, after which the jobs were terminated. Seurat 314 alignment score was computed exactly as described in the CCA alignment paper 7 . For our 315 own model, we consistently used ‡ = 0.01, and all other hyperparameters remain the same 316 as in dimension reduction benchmarks. 4 random seeds were used for PCA, CCA and MNN, 317 while 16 random seeds were used for scVI and our model, since neural network-based 318 models are typically considered less stable. We term this distance metric normalized projection distance (NPD). By default, 50 samples 330 from the posterior are used to compute NPD, which produces sufficiently accurate results 331 (Supplementary Fig. 4i-j). The definition of posterior NPD does not imply an efficient 332 nearest neighbor searching algorithm. To increase speed, we first use Euclidean distance-333 based nearest neighbor searching, which is highly efficient in the low-dimensional latent 334 space, and then compute posterior distances only for these nearest neighbors. The empirical 335 distribution of posterior NPD for a dataset is obtained by computing posterior NPD on 336 randomly selected pairs of cells in the reference dataset. Empirical p-values of query hits are 337 computed by comparing the posterior NPD of a query hit to this empirical distribution. 338 We note that even with the querying strategy described above, querying with single models 339 still occasionally leads to many false-positive hits when cell types on which the model has 340 not been trained are provided as query. This is because embeddings of such untrained cell 341 types are mostly random, and they could localize close to reference cells by chance. We 342 reason that embedding randomness of untrained cell types could be utilized to identify and 343 correctly reject them. Practically, we train multiple models with different starting points (as 344 determined by random seeds) and compute query hit significance for each model. A query hit 345 is considered significant only if it is consistently significant across multiple models. To 346 acquire predictions based on significant hits, we use majority voting for discrete variables, 347 e.g., cell type, or averaging for continuous variables, e.g., cell fate distribution. 348

Distance metric ROC analysis 349
Our model and scVI 8 were fitted on reference datasets and applied to positive and negative 350 control query datasets in the pancreas group of Supplementary Table 2. We then randomly 351 selected 10,000 query-reference cell pairs. A query-reference pair is defined as "positive" if 352 the query cell and reference cell are of the same cell type, and "negative" otherwise. Each 353 benchmarked similarity metric was then computed on all sampled query-reference pairs and 354 used as predictors for "positive"/"negative" pairs. AUROC values were computed for each 355 benchmarked similarity metric. In addition to the Euclidean distance, we also computed 356 posterior distribution distances for scVI (Supplementary Fig. 4k). NPD was computed as 357 described in (17), based on samples from the posterior Gaussian. JSD was computed in the 358 original latent space without projection.

Benchmarking query-based cell typing 360
Cell ontology annotations in ACA were used as ground truth. Cells without cell ontology 361 annotations were excluded in the analysis. For each querying method, cell type predictions 362 for query cells were obtained based on query hits with a minimal similarity cutoff, i.e., query 363 cells with no significant hits are rejected, while cells not rejected are further assigned cell 364 type predictions. Sensitivity, specificity and Cohen's κ are computed as follows: 365 Predictions are considered correct if they exactly match the ground truth, i.e., no flexibility 366 based on cell type similarity. This prevents unnecessary bias introduced in the selection of 367 cell type similarity measure. Cells were inversely weighed by the size of the corresponding 368 dataset when computing average sensitivity, specificity and Cohen's κ. AUROC was 369 computed using linear interpolation. For scmap 2 , we varied the minimal cosine similarity 370 requirement for nearest neighbors. For Cell BLAST, we varied the maximal p-value cutoff 371 used in filtering hits. For CellFishing.jl 4 , the original implementation does not include a 372 dedicated cell type prediction function, so we used the same strategy as that for our own 373 method (majority voting after distance filtering) to acquire final predictions, in which we 374 varied the Hamming distance cutoff used in distance filtering. Finally, 4 random seeds were 375 tested for each cutoff and each method to reflect stability. Several other cell querying tools 376 (CellAtlasSearch 3 , scQuery 33 , scMCA 34 ) were not included in our benchmark because they 377 do not support custom reference datasets. 378

Benchmarking querying speed 379
To evaluate the scalability of querying methods, we constructed reference datasets of varying 380 sizes by subsampling from the 1M mouse brain dataset 35 . For query data, the "Marques" 381 dataset 36 was used. Benchmarking was performed on a workstation with 40 CPU cores, 382 100GB RAM and GeForce GTX 1080Ti GPU. For all methods, only the querying time was 383 recorded, not including the time consumed to build reference indices.

Application to trachea datasets 385
We first removed cells labeled as "ionocytes" in the "Montoro_10x" 14 dataset and used 386 "FindVariableGenes" from Seurat to select informative genes in the remaining cells. Four 387 models with different starting points were trained on the tampered "Montoro_10x" dataset. 388 We used a cutoff of empirical p-value > 0.1 to reject query cells from the "Plasschaert" 15 389 dataset as potential novel cell types. We clustered rejected cells using spectral clustering 390   cutoff, so we chose a Hamming distance of 110, which is the closest to balancing sensitivity and specificity, but 508 it is still far from being stable across different datasets. (e) Querying speed on reference datasets of different 509 sizes subsampled from the 1M mouse brain dataset 35 . Error bars indicate mean ± s.d.