Precise identification of cell states altered in disease using healthy single-cell references

Joint analysis of single-cell genomics data from diseased tissues and a healthy reference can reveal altered cell states. We investigate whether integrated collections of data from healthy individuals (cell atlases) are suitable references for disease-state identification and whether matched control samples are needed to minimize false discoveries. We demonstrate that using a reference atlas for latent space learning followed by differential analysis against matched controls leads to improved identification of disease-associated cells, especially with multiple perturbed cell types. Additionally, when an atlas is available, reducing control sample numbers does not increase false discovery rates. Jointly analyzing data from a COVID-19 cohort and a blood cell atlas, we improve detection of infection-related cell states linked to distinct clinical severities. Similarly, we studied disease states in pulmonary fibrosis using a healthy lung atlas, characterizing two distinct aberrant basal states. Our analysis provides guidelines for designing disease cohort studies and optimizing cell atlas use.


Supplementary figure 3: Cell and sample representation in cell neighbourhoods constructed with different reference designs on IPF case-control dataset. Integration with CR design leads to less mixing between samples in the cell neighbourhoods (A) Histogram of number of cells per neighbourhood with query-mapping CR design (scANVI scArches), with joint embedding CR design (scANVI) and query mapping ACR design. (B) Histogram of number of samples per neighbourhood with query-mapping CR design (scANVI scArches), with joint embedding CR design (scANVI) and query mapping ACR design. (C) Two-dimensional histograms showing the number of cells in neighbourhood against the number of samples in neighbourhood for each reference design.
Supplementary figure 4: Differential expression analysis to identify markers for aberrant basal-like cells detected with ACR design.Heatmaps of mean expression of marker genes for aberrant basal-like phenotypes, identified by differential expression analysis between basaloid cells and KRT17hi aberrant basal cells (edgeR quasi-likelihood differential expression test [1]).DE genes were considered markers only if also significantly DE between the aberrant phenotype and normal basal cells (at 1% Benjamini-Hochberg FDR).Here we show marker genes that were expressed in at least 20% of cells of the aberrant phenotype, where log-Fold Change for DE was > 2 and with mean expression (log-normalized counts) > 0.2 (see suppl.table 4 for a full list of DE genes).Genes previously identified as markers of basaloid or KRT17hi aberrant basal cells are marked with a blue dot (the left facet shows expression of genes that were previously reported as markers, but that were not identified as DE between basaloid and aberrant KRT17hi basal in our analysis).Genes are ordered by absolute logFC, as indicated by points on the left.Genes significantly associated with GWAS variants for lung function (trait ID: EFO_0003892, using OpenTargets Genetics Locus2Gene association) are marked by a blue bar.Genes marked by a red bar on the right are validated target genes of drugs in trials for lung disease (trait ID: EFO_0003818, using OpenTargets database.

Supplementary Notes
2.1 Alternatives to differential abundance analysis for OOR state detection.
In this study the workflow to detect disease-associated states relies on differential abundance analysis at the subpopulation level, using methods such as Milo [2], MELD [3], Co-varying neighbourhood analysis [4].We recognize that scRNA-seq studies focused on disease-state identification, especially those using atlas datasets for this task [5][6][7], often omit differential abundance testing and define out-of-reference (OOR) cells using different criteria.In this note we test three alternative criteria on recovery of OOR states in our simulation framework: (a) Distance/segregation of OOR states in latent space, (b) query-mapping quality control metrics, and (c) differential expression analysis after label transfer.We discuss the limitations of these alternative approaches, and how they can also benefit from the use of control datasets.

Distance/segregation of OOR cells in latent space
Phenotypically distinct out-of-reference cells are generally expected to segregate from reference cells in the latent space, so that they could be distinguished by unsupervised clustering or other distance-based metric.Exploratory data analysis suggested that this is not always the case (see for example figure 2B: with AR design the OOR cells do not form a separate cluster).To quantitatively test this hypothesis, we use the simulations with a single OOR state (figure 3A).For the latent space of each reference design(see Methods: Simulations Experiments -Latent space embedding), we train a KNN classifier to distinguish the OOR state in the disease dataset, using euclidean distance in the latent space (from scVI or scArches mapping) (k = 50).For each simulation with different OOR state, we split the disease dataset into a training (66% of cells) and test set (33% of cells) and measure precision and recall or OOR state classification on the test set.Of note, this experiment is a proof-of-concept to test whether OOR cells are indeed distant from in-reference cells in the latent space, but it does not represent a strategy to detect OOR cells in a discovery scenario, since it assumes prior labelling of true OOR cells.
In these experiments, the OOR cell state is not consistently distinguished by distance to reference cells in the latent space (Suppl.note figure 1).In general, the CR design with joint embedding performed best with this metric, but precision and recall are mostly dependent on the OOR cell population.

Mapping quality control metrics
In some cases different cell-level metrics for query-to-reference mapping quality are used to flag OOR cells, such as reconstruction error of the latent embedding model or uncertainty over a cell type classifier trained on the atlas dataset [5][6][7].The assumption is that, if the disease-specific cell state was not seen during training of the latent embedding model, mapping quality metrics should be low.
We compare quality control (QC) metrics for scArches on the mapping of disease dataset onto the atlas dataset (mapping QC score), quantifying whether OOR cells and cell states are detectable by low QC score (i.e. higher uncertainty during query-to-reference mapping).For the label transfer uncertainty, we use the method presented by [6].Briefly, we train a weighted KNN classifier for cell type labels on the latent embedding of the scVI reference dataset, setting the number of neighbors to k=100.For each cell in the query dataset, we extract its k nearest neighbors in the reference and the corresponding Euclidean distances, adjusted by a Gaussian kernel.We compute the probability of assigning each label (|)  to the query cell by normalising across all adjusted distances.The label uncertainty  corresponds to .For both label uncertainty and reconstruction error, higher values correspond to lower mapping quality.To compare the mapping QC metrics to the differential abundance logFC on neighbourhoods defined with the ACR design, we calculate the average mapping QC score over all disease cells in the same neighbourhood.
In our hands, mapping quality metrics failed to distinguish ground-truth OOR cells (Suppl.note figure 2A) or neighbourhoods with high fraction of OOR cells (Suppl.note figure 2B).
We tested whether, by using an ACR design, we could increase the sensitivity by prioritising neighbourhoods where the mapping quality was worse for the disease dataset cells compared to the control dataset cells, taking the difference between average QC scores over disease cells and over control cells.This slightly improved OOR state detection performance using mapping QC metrics (Suppl.note figure 2C).However, overall we found that using mapping quality metrics to identify OOR populations performed significantly worse than using the logFC estimated from differential analysis.Moreover, we compared these metrics also on real disease data, testing the ability to detect SPP1hi macrophages in the IPF dataset (Extended Data figure 7D-E).Here we confirmed that differential abundance analysis performed significantly better than the label transfer uncertainty score used in the HLCA study for detecting profibrotic macrophages [7].

OOR detection with label transfer and differential expression analysis
Finally, we reasoned that an atlas datasets might be employed for automatic label annotation, before identification of disease-specific cell populations through differential expression (DE) analysis between cells from disease and reference samples of each cell type.
To compare reference design with such a workflow, we tested whether OOR cell states in simulations could be distinguished by identifying over-expressed marker genes when running differential expression analysis on cell type clusters.For each simulation and reference design, we perform latent space embedding as described in the Methods and we train a KNN classifier to learn cell type annotation labels from the healthy reference latent embedding (k=50).We then predict annotations in the pseudo-disease dataset.We test for differential expression between donors in the healthy reference and pseudo-disease dataset for each cell type annotation, pseudo-bulking expression profiles by donor, with the edgeR quasi-likelihood test [1] using the implementation in the R package glmGamPoi [8], using a 10% FDR.For each cell type DE test, we remove genes expressed in less than 10 donors and with total counts below 20 and subsequently select the 10000 most highly variable genes, using the modelGeneVar function from the scran package [9].We define a cell type cluster as disease-affected (i.e.misannotated cell population) if at least 5% of cells in the OOR state were predicted to belong to the cluster (suppl.note figure 3A).All other clusters are considered unaffected.For sensitivity analysis, we assume that no genes are differentially expressed in the unaffected clusters, while we test for the ability to detect differentially expressed marker genes of the OOR state in the affected cell state (suppl.note figure 3B).We define marker genes for each cell type using the top 50 predictive genes for each cell type based on a celltypist logistic regression model [10] trained on human immune cells.
Without a control dataset (AR design) we had a high rate of false positive genes in the unaffected clusters, where no DE genes are expected by design (suppl.note figure 3B).In contrast, we detected almost no false positives with designs where pseudo-disease dataset and control are contrasted in DE analysis, while still detecting a comparable rate of true positives (marker genes of the OOR state) (suppl.note figure 3B).We verified the rate of false positives is not attributable to failed label transfer (suppl.note figure 3C).

Robustness of OOR detection with ACR design
This note presents additional analyses on robustness of OOR state detection with ACR design to variation in the atlas dataset (see Results -Robustness of OOR detection with ACR design).

Leave-one-out analysis
We assessed whether one particular study amongst the 12 PBMC used to assemble the atlas dataset in simulations was driving the OOR detection performance with ACR design (see suppl.table 1 for the list of studies).To explore this, we performed repeated simulations with a leave-one-out approach.In each iteration, we excluded one of the 12 datasets from the analysis and then repeated the OOR state detection using the ACR design.For simulations with more than 400 OOR cells, the AUPRC of OOR state detection was stable when removing studies from the atlas dataset.The variability in performance increased where the OOR state population is smaller (< 400 cells) (suppl.note figure 4A).We did not observe systematically higher AUPRC with the largest atlas dataset (with all studies), and performance was not systematically lower when excluding a specific dataset (suppl.note figure 4B): in some cases we obtained the best performance when excluding data from the largest study (Powell dataset, comprising 1000 donors [11]).The variance in performance with smaller OOR cell states likely reflects limitations in sensitivity of the current workflow for OOR state detection (Extended Data figure 4).

OOR state detection with cross-tissue atlas
In addition to varying the size and complexity of the atlas dataset, we also considered a scenario where the atlas does not perfectly match the cell types in the case-control dataset.Since large atlases are not yet available for all organs, we reasoned that in the absence of a tissue-specific atlas dataset case-control datasets could be mapped to large datasets collecting cell types from multiple tissues [10,[12][13][14].We applied our pseudo-disease simulation framework on a dataset of healthy lung cells from 30 donors, where we remove both immune and non-immune OOR states from the control (suppl.note figure 5A).Here we compared OOR detection performance using a tissue-specific lung cell atlas [7] or a cross-tissue atlas dataset [14], which does not include lung cells (suppl.note figure 5B).
Briefly, gene expression count matrices for human lung scRNA-seq data from Adams et al. 2020 were downloaded from GEO (GSE136831).As cell type annotations, we used uniform labels generated from integration of this dataset with the Human Lung Cell Atlas by Sikkema et al. [7], downloaded from Zenodo (https://zenodo.org/record/6337966).We subset the dataset to the control patients and to samples for which at least 1000 cells were captured, retaining data from 30 samples (libraries) from 30 donors.We further subsampled the dataset by randomly selecting 1000 cells per sample, to reduce the computational burden of this analysis.To simulate control and disease datasets for benchmarking, we split donors at random with equal probabilities into a disease subset (15 donors) and a control subset (15 donors).We checked that the split was balanced in overall cell state representation between the two sets with principal component analysis on cell type proportions for each sample.To simulate the presence of an out-of-reference cell state, we select one cell type label and remove all cells with that label from the control dataset (suppl.note figure 5A).We repeated this simulation with 6 annotated cell types in the Adams dataset, including 3 immune cell states (natural killer cells, dendritic cells of type 2, alveolar macrophages) and 3 non-immune/stromal cell states (fibroblasts, lymphatic endothelial cells and multiciliated epithelial cells).As a tissue-specific atlas, we used the core Human Lung Cell Atlas (HLCA) [7], collecting scRNA-seq data from 107 donors from 14 studies .We downloaded the scANVI model [15] trained on the core HLCA from Hugging Face 1 using the API available via scvi-tools v0.20.0.As a cross-tissue atlas, we used the Tabula Sapiens dataset [14], comprising scRNA-seq data from 24 tissues from 14 individuals.We downloaded the full dataset from figshare2 , excluding smart-seq2 libraries and data from lung samples.We trained an scVI model on this dataset as described above for the PBMC simulation, using donor ID as the batch covariate and training on 2000 highly variable genes to match the same number of genes used to train the HLCA model.For each simulated control/disease dataset assignment with different OOR population, for the ACR designs we performed latent embedding of the query dataset using scArches on the atlas models as described above.
For the CR design with joint embedding, we trained an scVI model on the concatenated control and disease datasets as described for the PBMC dataset, using donor ID as the batch covariate and training on 2000 highly variable genes.We performed differential abundance analysis between the control and disease dataset with Milo, as described above.
We found OOR state detection performance with the two types of atlases was qualitatively comparable across simulations, with a slight decrease in sensitivity with the cross-tissue atlas in 2 of the 6 OOR cell states tested (suppl.note figure 5C-D).Notably, we observed the largest difference in sensitivity between the ACR design with a cross-tissue atlas and alternative designs in detection of out-of-reference multiciliated cells, which are a specialized class of epithelial cells found in the airways.Based on the maximum fraction of OOR cells in neighbourhoods for this simulation, multiciliated cells occupied neighbourhoods with other epithelial cell types with this design.These results suggest that using a cross-tissue atlas for contextualization of a case-control dataset might be a practical alternative where the availability of tissue-specific data might be scarce, although the power to detect changes in highly tissue-specialized subpopulations might decrease.

Matched controls allow the most sensitive out-of-reference detection
While our analyses suggest that mapping matched pseudo-disease and control datasets to an atlas dataset enhances the sensitivity to detect OOR states, in practice there are cases where collecting matched control samples is especially challenging (e.g.brain tumour studies, or studies of osteosarcoma).Therefore, we tested whether selecting a subset of samples from the atlas to use as control dataset in ACR design performed as well as using matched control samples.Specifically, we run the disease-state detection workflow with ACR design defining as control dataset either (a) donors from the same study (matched control), (b) a random subsample of donors from the atlas dataset (random control) or (c) a subset of donors selected by similarity in cell abundances to the disease dataset donors (close control).In particular, we applied the following unsupervised selection strategy, based on similarity between cell state proportions (illustrated in suppl.note figure 6A): we take latent dimensions learnt by training scVI on the atlas dataset and mapping the disease dataset with scArches (i.e. as described in Methods for AR design).In this latent space, we construct Milo neighbourhoods and the matrix of cell counts of dimension samples x neighbourhoods, as described in the Methods (section "Differential abundance analysis with Milo").We log-normalise cell counts per sample (running scanpy.pp.normalize_total with parameters target_sum=10000, and scanpy.pp.log1p) and perform principal component analysis on the matrix of log-normalised cell counts.This generates a d-dimensional space (d=10) on which we compute Euclidean distances between disease and atlas samples.In this space, for each disease sample we take the closest atlas sample and we consider this set to be the "close controls".If one atlas sample is the nearest neighbor to multiple disease dataset samples, we look at the second nearest neighbors for these samples and take those with the smallest Euclidean distance.
In all cases, the control donors are excluded from the atlas dataset used for scVI training.We found that using the matched control always outperformed subsampling from the atlas, and using the close control only marginally reduced the FDR compared to random subsampling (suppl.note figure 6B-C).Notably, using any type of control always outperformed differential analysis on the full atlas dataset.
These results highlight that comparison against a set of matched controls is always advantageous, likely minimizing false positives driven by hidden confounding factors.To an extent, accounting for confounding covariates in differential analysis based on available metadata might reduce the false discovery rate when no matched controls are available.However, it is to be expected that comparing uniformly processed healthy and disease samples from a similar population will minimise variability associated with known and hidden confounders.Selecting samples from published data based on phenotypic similarity led to a small improvement in performance compared to selection of samples at random.While here we applied a naive approach to define sample similarity, novel methods to model sample-level heterogeneity in scRNA-seq data are being explored [16][17][18].These could improve the matching of disease samples to optimal controls, and provide new insights into which technical and demographic variables are likely to affect disease-to-healthy comparisons.

Impact of latent embedding strategy on OOR state detection
Throughout this study, we use a transfer learning (TL) method for joint dimensionality reduction with an atlas dataset [6], as an efficient alternative to joint embedding on concatenated datasets.We expect our reference design analysis results to be robust to the use of other TL-based methods for query-to-reference mapping [5,19].In the COVID-19 analysis we observed minimal changes to the embedding with ACR design when using de novo integration instead of TL (figure 5D).Conversely, for the reference design without an atlas (CR design) we observed differences for the TL and joint embedding, which was not surprising given the limitations of TL approaches when comparable number of cells are in the reference dataset (control) and in the query dataset (disease) [6].We found that the CR design with joint embedding significantly increases the performance for disease-state detection compared to CR design with TL (figure 3B, figure 5D), although this approach appears to be highly sensitive to feature selection (figure 3C).Of note, in this study we used a standard approach to select features for integration, choosing the default method implemented in the most popular single-cell analysis pipelines [20,21].Many studies use more complex feature selection strategies that might be batch-aware or tissue-specific (e.g.excluding known "soup" genes).Given the wealth of options, we believe robustness of workflows to feature selection choices is a key feature to consider.To date, integration methods and parameters, including feature selection, have been primarily benchmarked on the tasks of batch correction, automatic cell type annotation and fast analysis of new data generated from the same biological condition [22][23][24].While the systematic comparison of different integration strategies on disease datasets is beyond the scope of this study, the benchmarking set-up and python package presented in this study (https://github.com/MarioniLab/oor_benchmark)could serve as basis for future comparisons focused on alternative integration methods.
For the reconstruction error metric, we take samples from the posterior of the  = 50 scArches model to generate profiles of predicted gene expression counts for the highly  ^, variable genes used in training.We define the reconstructed gene expression profile for  ^, + 1)Given the true log-normalised gene expression profile for each cell , we define the   reconstruction error as the cosine distance between and .