Abstract
Joint analysis of single-cell genomics data from diseased tissues and a healthy reference can reveal altered cell states. We investigate whether integrated collections of data from healthy individuals (cell atlases) are suitable references for disease-state identification and whether matched control samples are needed to minimize false discoveries. We demonstrate that using a reference atlas for latent space learning followed by differential analysis against matched controls leads to improved identification of disease-associated cells, especially with multiple perturbed cell types. Additionally, when an atlas is available, reducing control sample numbers does not increase false discovery rates. Jointly analyzing data from a COVID-19 cohort and a blood cell atlas, we improve detection of infection-related cell states linked to distinct clinical severities. Similarly, we studied disease states in pulmonary fibrosis using a healthy lung atlas, characterizing two distinct aberrant basal states. Our analysis provides guidelines for designing disease cohort studies and optimizing cell atlas use.
Similar content being viewed by others
Main
Precise identification of cell phenotypes altered in disease with single-cell genomics can yield insights into pathogenesis, biomarkers and potential drug targets1,2,3,4,5,6,7,8.
The standard approach to identify altered cell states involves joint analysis of single-cell RNA sequencing (scRNA-seq) data from diseased tissues and a healthy reference. This typically includes integrating cellular profiles from different conditions into a common phenotypic latent space to match common cell types and minimize technical differences9,10. Subsequently, differential analysis is performed on matched cell states between healthy and diseased cells to identify differences in gene expression patterns or cellular composition11,12,13,14,15. Regardless of the methods used for these steps, the selection of the healthy reference dataset is crucial.
Large-scale profiling of healthy samples by the Human Cell Atlas community has yielded large, harmonized collections of data from multiple organs, or atlas datasets (http://data.humancellatlas.org/). In tissues like lung and blood, millions of cells have been profiled from hundreds to thousands of individuals. Computational analyses allow for meaningful integration of these datasets, providing a comprehensive view of cell phenotypes in a tissue, while minimizing technical variation. Nevertheless, the characteristics of the samples included in an atlas might differ greatly from those of a disease cohort (Fig. 1a). This could introduce false discoveries if confounding factors are unknown or not appropriately handled in statistical testing. Despite this, several studies use atlas datasets as references for discovering disease states1,16,17,18,19, especially for tissues where obtaining matched healthy controls is challenging, such as the brain20,21.
In contrast, several studies collect matched control samples from healthy tissue alongside the disease samples5,22,23,24,25, with similar demographic and experimental protocol characteristics. This minimizes the risk of false positives driven by confounders. However, collection of a large number of healthy control samples is not always practical or possible. Moreover, using a relatively small number of samples for the integration step increases the risk of missing rare cell states and overinterpreting sample-specific noise. Understanding how features of the reference dataset impact identification of disease-associated cell states will guide effective data reuse, design of disease studies and future cell atlasing efforts.
In this study, we compare the use of atlas and control datasets as references for the identification of disease-associated cell states. We demonstrate the benefits of using an atlas dataset as reference for latent embedding and of a control dataset as reference for differential analysis, with important implications for both experimental design and use of single-cell disease cohorts.
Results
Reference design for disease-associated state identification
To optimize the selection of a reference dataset for the identification of disease-associated cell states, we considered the following workflow (Fig. 1b). First, a dimensionality reduction model is trained on the healthy dataset (the embedding reference dataset) to learn a latent space representative of cellular phenotypes while minimizing batch effects. Next, this model is used for transfer learning to map the query dataset, which includes the disease samples, to the same latent space9,10. Finally, differential analysis is performed to compare cells between disease and healthy samples (differential analysis reference) to identify disease-associated states. We defined a healthy reference dataset as a control if it matched the disease dataset in terms of cohort characteristics and experimental protocols. We defined an atlas reference (AR) dataset as one that aggregated data from hundreds to thousands of individuals from multiple cohorts, collected with several experimental protocols. With this workflow, we outlined three alternatives for selecting a reference dataset (reference design) (Fig. 1c): (1) the AR design; (2) the control reference (CR) design, where either type of healthy dataset is used as the embedding reference and as the differential analysis reference; and (3) an atlas to control reference (ACR) design, where an atlas and a control dataset are used in different steps of the workflow. In this analytical design, the atlas dataset serves as the embedding reference, while the disease and control datasets are mapped to the same latent space; finally, differential analysis is performed contrasting the disease dataset to the control dataset only. For the CR design, we compared a workflow for latent embedding where the control dataset was used as reference for query mapping, and another where the latent embedding model was trained on the concatenated control and disease datasets (Supplementary Note 2.4).
In the following sections, we quantify the ability of these three designs to identify disease-specific cell states in simulations and real data.
Detection of out-of-reference cell states in simulations
To test a scenario with ground truth, we simulated the attributes of atlas, control and disease datasets by splitting scRNA-seq data from 13 studies that profiled healthy peripheral blood mononuclear cells (PBMCs) from 1,248 donors (Supplementary Table 1 and Methods). We selected one study and randomly split the donors to simulate a pseudo-disease and a control dataset (Fig. 2a). This ensured that cohort demographics and experimental protocols were matched, preserving donor and library effects present in real data. The remaining cells (1,219 donors) form the atlas dataset. To simulate a cell population specific to the pseudo-disease dataset, hereafter an out-of-reference (OOR) state, we selected one or more annotated cell types and removed cells with those labels from the control and atlas datasets.
To identify the OOR state, we first learned a latent space embedding on the chosen reference (atlas or control) using single-cell variational inference (scVI)26 (Fig. 2b, left). Then, we used transfer learning with scArches9 to map the query dataset(s) to the trained scVI model. For the CR design with joint embedding (CR scVI), we trained the scVI model on the concatenated pseudo-disease and control datasets (Fig. 2b, center). In the ACR design, the atlas dataset was used to train the latent embedding model; however, after mapping with scArches, only the disease and control datasets are considered. Finally, we used neighborhood-level differential abundance (DA) testing with Milo11 to identify cell states enriched in the disease dataset (Fig. 2b, right).
We first considered a scenario where a single-cell-type cluster is selected as the OOR state and removed from the healthy references (Fig. 3a). Across simulations with different OOR states, we observed that using the combination of the atlas and control datasets (ACR design) led to sensitive detection of neighborhoods with a high fraction of OOR cells (Figs. 2c and 3b, and Extended Data Fig. 1). Conversely, the AR design led to an inflated number of false positives, where significant enrichment was also detected when the fraction of unseen cells was low or 0. Using only the control dataset, latent embedding with query mapping led to more balanced log fold changes, but still a higher false discovery rate (FDR) than the ACR design, while performance with a joint embedding was comparable to the ACR design. Notably, we found minimal difference in the quality of integration with different designs (Extended Data Fig. 2). The difference between reference design results was also consistent when applying alternative methods for DA analysis13,15 (Extended Data Fig. 3). Finally, as expected, the power to detect OOR cell states depended, for all methods, on the number of cells present, with a minimum of 250 cells per cell type needed to identify the OOR population (Extended Data Fig. 4).
We hypothesized that the good performance in OOR state detection with the CR scVI design could be explained by feature selection. Latent embedding models are trained on the top 5,000 highly variable genes (HVGs) in the input dataset (Methods). When training on concatenated disease and control datasets, marker genes for the OOR population are more likely to be among the HVGs. We compared the performance of different reference designs trained using HVGs from the atlas dataset, from the control dataset or from the concatenated control and disease dataset. For all designs, the area under the precision-recall curve (AUPRC) for OOR state detection was highest when using HVGs selected on the same data used to train the model. However, only the CR design with joint embedding showed a substantial decrease in performance when selecting HVGs without using the disease dataset (Fig. 3c). On average, 81% of the HVGs selected from the control and disease data were shared with the set selected from control only and 68% were shared with the set selected from the atlas only. These results indicate that the performance of joint embedding with CR design is sensitive to the feature selection strategy used to train the latent embedding model.
We reasoned that this might impact performance when multiple transcriptionally distinct OOR states are present in the disease population. To test this, we conducted simulations where we removed a fixed cell population (corresponding to classical monocytes) from the reference datasets and then defined a second variable OOR cell state (shifted OOR state) by splitting a cell type population into two distinct groups (Methods and Fig. 3d). The ACR design performed best in OOR state identification (Fig. 3e). In particular, in all simulations where the CR scVI design outperformed the ACR design when the OOR state was removed, the ACR design could distinguish better OOR states in the mixed case, even when considering only recovery of the shifted OOR state (Fig. 3f). In one case (simulation with CD4+ central memory T (TCM) cells as shifted OOR state), we observed a significant drop in performance with the ACR design if the OOR state was shifted instead of removed.
In summary, differential analysis using control datasets drastically reduced the rate of false discoveries in the detection of disease-associated cell states. Of note, studies using atlas datasets to identify disease-associated states9,10,16 might use criteria different from DA to detect OOR cells, such as distance in the latent space, label transfer uncertainty or differential expression analysis. We compared these alternatives to our workflow in Supplementary Note 2.1.
Robustness of OOR detection with the ACR design
We next assessed the robustness of different reference designs to heterogeneity in the control and atlas datasets. We first tested whether using the atlas reduces the number of control donors needed to detect disease-specific states by simulating control datasets of increasing size (Methods). While sensitivity declined for all designs when using a very small control cohort, the ACR design maintained the highest performance in OOR state detection compared to the CR design, regardless of the latent embedding strategy (t-test P < 0.01 for AUPRC distributions across control cohort sizes for both the CR scVI and CR scArches designs) (Fig. 4a and Supplementary Fig. 1). The difference in performance was especially marked when simulating a smaller disease cohort (Supplementary Fig. 1). These results suggest that using the ACR design can minimize the number of control samples required. In Supplementary Note 2.3, we tested options for cases where collecting matched control samples is not feasible.
We also tested how OOR state detection was affected by variation in the atlas dataset. We first confirmed robustness to removal of any given study from the atlas dataset (Supplementary Note 2.2.1). Then, we measured performance with the AR and ACR designs when including an increasing number of PBMC studies in the atlas dataset (Methods). While the results were always significantly affected when using just one or two studies as the atlas dataset, sensitivity with the ACR design was stable when the atlas included at least 10,000 cells (Fig. 4b and Supplementary Fig. 2). Without controls, we observed a stronger dependency of performance with atlas size (Pearson correlation of AUPRC and size: R2 = 0.69, P = 7.2 × 10−7 for the AR design; R2 = 0.4, P = 0.0017 for the ACR design). Notably, the false positive rate (FPR) increased with smaller atlas datasets with an AR design (Supplementary Fig. 2). We compared the use of a cross-tissue or tissue-specific atlas for the ACR design (Supplementary Note 2.2.2), as a practical alternative where the availability of tissue-specific data might be scarce.
In summary, combining the use of an atlas and control dataset led to robust detection of putative disease states, even with a varying quality of the control or atlas dataset.
Detection of interferon-stimulated states in patients with coronavirus disease 2019
We next assessed the benefits of using a healthy atlas to identify altered states in a real patient cohort. We used a published scRNA-seq dataset of PBMCs from 90 patients with varying severities of coronavirus disease 2019 (COVID-19) and 23 healthy volunteers24. As an atlas dataset, we used harmonized scRNA-seq profiles from 12 studies involving 1,219 healthy individuals (Fig. 5a). We compared the use of the healthy PBMC atlas for latent embedding (ACR design) against using only the COVID-19 and control datasets with joint embedding (CR design). To quantify the ability of different designs to identify disease-associated states, we tested whether cells expressing genes involved in interferon (IFN) signaling, a key antiviral response pathway and a recognized hallmark of COVID-19, could be detected among the COVID-19-enriched neighborhoods (Fig. 5b and Methods).
The ACR design showed a stronger correlation between DA log fold change and the mean IFN signature (ACR Pearson R = 0.63, CR Pearson R = 0.52, Fisher’s z-transformation P < 2.2 × 10−16), indicating better prioritization of IFNhi cell states (Fig. 5c), regardless of the latent embedding strategy used (Fig. 5d). Stratifying according to cell type, the correlation was especially strong in myeloid cells, where the strongest IFN stimulation was observed (Extended Data Fig. 5a). Among the IFNlo states prioritized with the ACR design, we found primarily plasmablasts and plasma cells (Extended Data Fig. 5b), followed by platelets, all expected to expand in COVID-19 (refs. 27,28). For lymphocytes, where the average expression of IFN genes was lower than in myeloid cells, the ACR design outperformed the CR design in prioritizing the top 10% IFNhi neighborhoods in natural killer (NK) and CD8+ T cells, while neither design distinguished IFNhi CD4+ T cells or B cells (Extended Data Fig. 5c). The CR design prioritized IFNlo naive B cells over other IFNhi subsets, such as CD16hi and proliferating NK cells (Extended Data Fig. 5b–d), contradicting the widely reported lymphopenia in patients with COVID-19 (ref. 29).
Through iterative dataset subsetting, subclustering and differential analysis, several COVID-19 scRNA-seq studies distinguished IFN-stimulated COVID-19-associated subclusters and normal IFNlo subtypes across immune cell types22,30. Yet, IFN activation is not global, and transitional or alternative pathological phenotypes might be present in COVID-19 PBMCs. In our neighborhood-level analysis with the ACR design, we observed neighborhoods with a relatively low IFN signature that were significantly associated with the disease, notably among classical (CD14+) monocytes (Fig. 5e). We categorized CD14+ monocytes into three phenotypes: normal classical monocytes; COVID-associated IFNlo monocytes; and COVID-associated IFNhi monocytes (Fig. 5f). The proportion of CD14+ monocyte phenotypes changed significantly with different disease severity: the IFNhi state was most prominent in mild and asymptomatic cases compared to healthy cases (Wilcoxon test P = 1.19 × 10−7), while the IFNlo state was predominant in patients with moderate-to-critical disease (Fig. 5g). This supports the notion that IFN stimulation acts as a protective pathway in the acute phase of infection31. Conversely, when using the CR design to define IFNhi and IFNlo states after differential analysis, we found a high fraction of IFNlo COVID-enriched monocytes in healthy and asymptomatic individuals, indicating that this design failed to distinguish IFNlo normal monocytes from the IFNlo phenotype in severe COVID-19 (Extended Data Fig. 6a–c). Additionally, the fraction of IFNhi cells in mild and moderate cases was not significantly higher than in severe cases (Wilcoxon test P = 0.325743). Differential expression analysis between IFNhi and IFNlo COVID-associated monocytes showed that IFNhi monocytes showed higher expression of HLA genes, leukocyte-recruiting chemokines (CCL8, CXCL10, CXCL11) and markers of activation (FCGR3A) (Extended Data Fig. 6d,e and Supplementary Table 2). Conversely, the IFNlo monocytes enriched in severe disease overexpressed S100A genes, previously identified as key markers of COVID-19 severity30,32. This HLA-DRlo S100Ahi phenotype corresponds to a subset of dysfunctional monocytes associated with severe COVID-19, previously described in an independent cohort through direct comparison of mild and severe cases23 (Extended Data Fig. 6f). These markers were not recovered when comparing IFNlo and IFNhi COVID-19 monocytes defined by the CR design (Extended Data Fig. 6e and Supplementary Table 3).
Detection of aberrant cell states in pulmonary fibrosis
To assess the benefit of using atlas and control datasets in other biological contexts, we analyzed a published scRNA-seq dataset of lung parenchyma samples from 32 patients with idiopathic pulmonary fibrosis (IPF), a progressive lung disease with limited treatment options, which is characterized by extracellular matrix (ECM) deposition, inflammation and scarring33,34. This study included data from 28 control donors and 18 patients with chronic obstructive pulmonary disease (COPD)2. As an atlas dataset, we used the core Human Lung Cell Atlas (HLCA) dataset16 (Fig. 6a).
Our first aim was to recover the emergence of IPF-specific alveolar macrophages overexpressing SPP1 and other ECM-remodeling genes contributing to lung fibrosis35. Comparing different designs, the ACR design outperformed the AR and CR designs in detecting macrophages with the strongest profibrotic signature (Fig. 6b and Extended Data Fig. 7a). Interestingly, the CR design incorrectly prioritized neighborhoods with significantly fewer samples compared to true positives (Extended Data Fig. 7b,c), suggesting that the difference in ACR and CR design performance is due to residual batch effects in the latent space (Supplementary Fig. 3).
We next focused on stromal and epithelial cells. We considered cell types with high expression of biomarker genes from diagnostic models built on IPF lung explant RNA-seq36 (Extended Data Fig. 8a,b and Methods). The ACR design consistently led to the most precise distinction of cell states expressing the diagnostic signature (Fig. 6c and Extended Data Fig. 8c). Differential analysis using control samples led to the precise identification of rare aberrant cell states emerging in IPF, such as the KRT5–KRT17+ basaloid cells2,37 thought to originate from the alveolar epithelium in response to fibrosis38,39 (Extended Data Fig. 8c,d). Furthermore, the difference in performance between reference designs was especially notable for basal cells (Fig. 6c and Extended Data Fig. 8c). These were on average significantly enriched in the IPF samples, in agreement with previous reports2,37. However, by using the ACR design, we distinguished the neighborhoods of normal basal cells (with a mix of cells from patients with IPF and controls) and IPF-enriched neighborhoods with high biomarker expression (Extended Data Fig. 8c). We found that basal cells in the ACR design IPF-enriched neighborhoods overexpressed marker genes for KRT5+KRT17hi aberrant basal cells identified in bronchial brushings of patients with IPF40 (Fig. 6d). Marker gene expression was especially high in the neighborhood showing the strongest enrichment in IPF cells. DA analysis with the CR or AR design did not distinguish this aberrant phenotype (Fig. 6d).
While the study describing KRT5+KRT17hi basal cells highlighted their transcriptional similarity to basaloid cells40, we identified both aberrant phenotypes as distinct states (Fig. 6e and Extended Data Fig. 8d). Therefore, we further characterized their specific markers and functional differences. Specifically, we identified genes differentially expressed between aberrant basal-like states and overexpressed compared to normal basal cells (Methods). We identified 981 significantly differentially expressed genes (DEGs) (FDR = 5%) (Fig. 6f and Supplementary Table 4), including six previously described markers for KRT17hi aberrant basal cells and 35 previously described markers for basaloid cells. Several other previously described markers were only overexpressed compared to normal basal cells (Supplementary Fig. 4). KRT17hi basal cells overexpressed genes associated with Myc signaling, in agreement with Jaeger et al.40, and genes involved in keratinization, including keratins and desmoplakin genes (Extended Data Fig. 9a). Similar processes have been identified in lung carcinoma41 and in the lung epithelium of smokers42, indicating that this might be a widespread response to epithelial injury. Basaloid-specific markers showed significant enrichment in the genes involved in ECM organization and epithelial–mesenchymal transition (EMT), including collagens and metalloproteases, as well as morphogenesis factors, including SOX11, SOX4 and TGF-beta signaling genes (Extended Data Fig. 9b). These markers also include genes linked to genomic variants associated with lung function, including the EMT-inducer IL32 (ref. 43), neurotrimin (NTM), GPC5 and DCBLD2 (refs. 44,45,46). Some of the newly identified markers encode targets of drugs approved or in trial for other lung pathologies. For example, CSF2, strongly overexpressed in basaloid cells, has been implicated in the pathogenesis for asthma and COPD, and is being investigated in phase 3 trials for pneumonia treatment (ClinicalTrials.gov registration: NCT04351152)47; the CCL2-inhibitor carlumab has completed a phase 2 trial for pulmonary fibrosis (ClinicalTrials.gov registration: NCT00786201); while U.S. Food and Drug Administration-approved drugs inhibiting ROS1 are used for non-small cell lung carcinoma48.
Discussion
In this study, we assessed how the choice of reference dataset affects the identification of altered cell states from scRNA-seq data of diseased tissues. Using simulations and real-life applications, we showed that atlas datasets are not a substitute for control samples, but that they enhance disease-state discovery in complex scenarios. Contrasting cell profiles from disease samples against a restricted set of control samples is necessary to minimize false positives in disease-state identification. However, when an atlas dataset is available, it is possible to reduce the number of control samples without introducing false discoveries and with minimal impact on sensitivity (Fig. 4a).
Multiple factors could explain the improved performance of ACR compared to CR design in complex scenarios. First, feature selection in joint embedding with a CR design is less likely to include disease-relevant genes necessary to distinguish rare populations (Fig. 3c). Additionally, residual batch effects in the latent space can lead to false positives (Extended Data Fig. 7b,c). Interestingly, while a comprehensive representation of cell states in atlas datasets might have a role, our leave-one-out analysis indicates that the size and composition of the atlas dataset do not significantly impact disease-state detection performance (Supplementary Note 2.2.1). Moreover, as in the comparison between tissue-specific or cross-tissue atlas datasets (Supplementary Note 2.2.2), sensitive detection of disease-specific states is possible when the cell type composition of atlas and case-control datasets differ substantially.
Despite its advantages, researchers may face challenges when applying an ACR design. First, data integration and harmonization efforts are ongoing, and integrated datasets are frequently updated with more individuals, even for well-sampled tissues, such as blood, lung16, heart49,50,51 or gastrointestinal tract52,53. Reassuringly, we showed that the ACR design is robust to the set of harmonized datasets (Supplementary Note 2.2.1) and maintains high sensitivity with smaller atlas datasets (Fig. 4b and Supplementary Fig. 2), making disease analysis more robust to atlas updates. Second, downloading and processing atlas data can be computationally expensive. By benchmarking disease-state detection using latent embedding with transfer learning9, we advocate for atlas builders to share trained models for embedding along with datasets (for example, refs. 54,55,56). Lastly, when the use of an atlas is not feasible, we found that in several benchmarking scenarios, a CR design with joint embedding provided satisfactory performance, serving as an alternative design in this scenario. In this case, we recommend validating predicted disease-associated states by checking for residual batch effects between samples (Extended Data Fig. 7b,c), and evaluating the robustness of results to factors such as the inclusion or exclusion of specific control samples (Fig. 4a) or feature selection (Fig. 3c).
Our disease cohort analyses revealed that an ACR design enables more sensitive identification of transitional and heterogeneous pathological cell states. In the COVID-19 dataset11, we captured IFNhi states across immune cell types, and fine subsets of dysfunctional CD14+ monocytes associated with disease severity (Fig. 5e–g)23. Analyzing lung data from patients with IPF using an ACR design, we distinguished and characterized rare basal-like aberrant cell states (Fig. 6d–f). Previous studies linked IPF severity with basal marker gene expression2,57,58,59 and basal cell accumulation in distal airways60. Our analysis adds insights on basal-like cellular phenotypes in IPF. First, while KRT17hi aberrant basal cells were first described in bronchial epithelium40, we found them in lung parenchyma, supporting their role in bronchiolization61. Second, we showed that only a subset of basal cells in the IPF samples were KRT17hi, suggesting that normal basal cells might undergo reprogramming in the parenchyma. Third, we established that KRT17hi aberrant basal cells are distinct from the recently described IPF-associated KRT5–KRT17+ basaloid cells2,37,62,63, highlighting their distinguishing features and marker genes.
In conclusion, we demonstrated that the combined use of a cell atlas and matched controls as references enables the most precise identification of affected cell states in disease scRNA-seq datasets. We envision that our analysis will instruct the design of new cohort studies, guide efficient data reuse and provide operating principles for analysis of disease datasets and construction of cell atlases.
Methods
Ethics statement
This study relies on the analysis of previously published data, which were collected with written informed consent obtained from all participants and comply with the ethical guidelines for human samples.
PBMC data preprocessing
We collected raw gene expression counts and cell type annotations from healthy PBMC 10X Genomics scRNA-seq data from 13 studies5,18,22,23,24,30,54,64,65,66,67,68,69, available via the CELLxGENE portal (https://cellxgene.cziscience.com/collections) (Supplementary Table 1). During harmonization, we sampled 500 cells for each sample to reduce the computational burden of this analysis, while maintaining sample-level diversity; we excluded samples for which fewer than 500 cells were detected, retaining in total 1,268 samples from 1,248 individuals. We subsequently filtered cells where at least 1,000 mRNA molecules were detected and genes that were expressed in at least one cell. This resulted in a dataset of 599,379 high-quality cells.
To generate a unified cell type annotation, we integrated all normal cells from different studies in a common latent space using the scVI model, as implemented in the Python package scvi-tools26,70. Briefly, we selected the 5,000 most HVGs based on the dispersion of log-normalized counts, as implemented in SCANPY71. We trained the scVI model on raw counts, subsetting to HVGs, considering the library ID as batch (model parameters: n_latent = 30, gene_likelihood = ‘nb’, use_layer_norm = ‘both’, use_batch_norm = ‘none’, encode_covariates = True, dropout_rate = 0.2, n_layers = 2; training parameters: early_stopping = True,train_size = 0.9, early_stopping_patience = 45, max_epochs = 200, batch_size = 1,024,limit_train_batches = 20). We constructed a k-nearest neighbor graph based on similarity in the scVI latent dimensions, using k = 50. Cells were clustered using the Leiden algorithm with resolution = 1.5. Subsequently, clusters were annotated by majority voting using the harmonized cell type labels available via CELLxGENE. During this process, one cluster of cells was excluded as potentially containing doublets. After this final filtering, the dataset included 597,321 cells annotated into 16 cell types.
Simulation experiments
In this section we describe the simulation strategy (Fig. 2a) and workflow to identify OOR cells (Fig. 2b). We designed evaluation experiments and chose methods for the integration and differential analysis with the specific use-case of disease datasets in mind. We believe our results will extrapolate to other types of case-control studies, as long as the main assumptions apply, that is, (1) that all the cell states observed in the control dataset are also found in the atlas dataset and (2) that only a fraction of cell types are altered in the disease datasets. Note that throughout this study the term ‘cell state’ defines a group of cells that are more transcriptionally similar to each other than to other cells in the same tissue.
Data splitting into atlas, control and pseudo-disease
To simulate the attributes of the disease, atlas and control datasets, we selected donors from one study (query study, 29 healthy donors, Stephenson et al.24) and we split these at random with equal probabilities into a disease subset (16 donors) and a control subset (13 donors). The data from the remaining 12 studies comprises the atlas dataset (1,219 donors). To simulate the presence of an OOR cell state, we selected one cell type label and removed all cells with that label from the control and atlas dataset. We repeated this simulation with 15 annotated cell types in the PBMC dataset. Neutrophils were excluded because they were underrepresented in the Stephenson et al.24 study. For seven cell types where the number of cells in the OOR cell state was fewer than 250 cells, we found that our workflow was unable to detect OOR states across designs (Extended Data Fig. 4); therefore, most downstream analysis was restricted to simulations where at least 250 OOR cells were simulated.
To simulate a scenario with multiple cell states altered in disease with different effect sizes (Fig. 3d–f), we selected a fixed cell type label to be removed from the atlas and control as described above (classical monocytes). We then selected a variable cell type label (shifted OOR cell state) that we split between an OOR and an in-reference group with the following procedure: we selected the cells of the shifted OOR cell state in the disease and control datasets; we log-normalized their gene expression profiles and ran a PCA to split the cells into OOR and in-reference groups based on their weights on the first principal component. We then used a k-nearest neighbor classifier (using the implementation in scikit-learn, with k = 10) to assign atlas cells to one of the two groups. We used this procedure instead of running the PCA on atlas, control and disease cells to avoid having a first principal component that captures only batch effects between the query and atlas datasets.
Latent space embedding
For each simulated atlas, control and disease dataset assignment, we embedded the reference and query datasets into a common latent space using transfer learning with scArches9 on scVI models9,26, using the implementation in the Python package scvi-tools v.0.17.4 (ref. 70). Briefly, we selected the 5,000 most HVGs in the reference dataset based on the dispersion of log-normalized counts, as implemented in SCANPY. We trained the scVI model on the raw counts of the reference dataset, subsetting to HVGs, considering the sample ID as batch and specifying the recommended parameters to enable scArches mapping (use_layer_norm = ‘both’, use_batch_norm = ‘none’, encode_covariates = True, dropout_rate = 0.2, n_layers = 2). Models were trained for 400 epochs or until convergence. For the CR design with joint embedding (CR scVI), the scVI model was trained on the concatenated disease and control datasets. Next, we performed transfer learning on the query dataset(s) from the model trained on the reference, running the model for 200 epochs and setting the weight_decay parameter to 0. The reference (for scVI training) and query (for scArches mapping) datasets for latent space embedding were defined as follows for the three reference designs: AR design: the atlas dataset was used as the reference dataset, the disease dataset was used as the query dataset; control reference with query mapping (CR design, scArches): the control dataset was used as the reference dataset, the disease dataset was used as the query dataset; control reference with joint embedding (CR design, scVI): the control and disease datasets were used as the reference dataset, no query mapping was performed; ACR design: the atlas dataset was used as the reference dataset, the disease and control datasets were used as the query dataset.
DA analysis
To find cell states enriched in the disease dataset, we used the Milo framework for DA on cell neighborhoods11 using the implementation in the package milopy v.0.1.0 (https://github.com/emdann/milopy). Briefly, we computed the k-nearest neighbor graph of cells in the reference and disease datasets based on latent embedding. The reference datasets for differential analysis were defined as follows for the three reference designs: (1) AR design: atlas dataset; (2) CR design: control dataset; (3) ACR design: control dataset.
Of note, for the ACR design, the atlas dataset was not considered when constructing the k-nearest neighbor graph. This reduces the computational burden of handling a dataset of hundreds of thousands of cells. We set the value of k to be equal to the total number of samples times five, up to a maximum of k = 200 (this upper limit was set for memory efficiency reasons), as suggested by Dann et al.11. We assigned cells to neighborhoods (milopy.core.make_nhoods, parameters: prop = 0.1) and counted the number of cells belonging to each sample in each neighborhood (milopy.core.count_cells). We assigned to each neighborhood a cell type label based on majority voting of the cells belonging to that neighborhood. To test for enrichment of cells from the disease dataset, we modeled the cell count in neighborhoods as a negative binomial generalized linear model, using a log-linear model to model the effects of disease status on cell counts (log fold change). Although the split between control and disease samples was balanced in terms of the available metadata, in the query study there was a known batch effect between the three sites from which samples were collected24. Therefore, we included site identity as a confounding covariate in the DA model when using the ACR and CR designs, although we found that the results presented in this report were robust even without modeling this confounder. We controlled for multiple testing using the weighted Benjamini–Hochberg correction as described in Dann et al.11 (spatial FDR correction). Unless otherwise specified, neighborhoods were considered enriched in disease cells if the spatial FDR < 0.1 and log fold change > 0.
For the comparison across DA methods (Extended Data Fig. 3), we constructed the k-nearest neighbor graph using the same parameters as described above for the Milo analysis. We used the MELD13 implementation available via PypI (v.1.0.0) and tested for significant differences in density between pseudo-disease and control samples as described by Petukhov et al.72. Specifically, we computed sample-specific densities over the k-nearest neighbor graph (running meld.MELD().fit_transform()) and tested for significant differences in sample densities between conditions using a Wilcoxon rank-sum test, as implemented in SciPy73. While in the original MELD analysis the authors took the normalized mean density across samples of the same condition as a metric for the effect size of DA, we opted to use the Wilcoxon rank-sum test after observing significant variance in sample densities across donors of the same condition. We ran covarying neighborhood analysis (CNA)15 using the implementation available via PypI (v.0.1.4). We used the CNA correlation as a metric for the effect size of DA (running cna.tl.association, with ks = [20]).
We tested additional alternatives to DA to identify OOR cell states, as shown in Supplementary Note 2.1.
Sensitivity analysis
For each simulation (that is, with different OOR cell state and reference design), we defined a neighborhood as an OOR state (true positive) if the percentage of OOR cells in the neighborhood was more than 20% of the maximum percentage observed in that simulation. This threshold selection aimed to quantify the ability to detect the neighborhoods where the largest number of OOR cells was found, even when the atlas dataset was included in the k-nearest neighbor graph (AR design); most cells in the neighborhoods always belong to the atlas dataset. The selected thresholds for each experiment are shown in Extended Data Fig. 1. We calculated TPRs, FPRs and FDRs considering neighborhoods where the spatial FDR < 0.1 and log fold change > 0 as predicted positives.
With precision-recall curve analysis, we quantified the ability to detect true positive OOR states with different thresholds of log fold change, without considering the significance estimated with spatial FDR, using the implementation in scikit-learn74. As a measure of uncertainty around the estimated AUPRC, we performed bootstrap resampling on the neighborhood log fold change values, maintaining the original ratio of positive and negative points, and computed the 95% CI on the distribution of AUPRC values for 1,000 resamplings.
Control and atlas size analysis
For the analysis with varying number of control donors (Fig. 4a and Supplementary Fig. 1), we selected the simulations with the five OOR cell populations with the highest average TPR with CR and ACR designs in the previous analysis (Fig. 3b). For each simulation, we selected the five, seven or nine donors from the disease dataset who had the highest fraction of cells in the OOR cell population. Subsequently, we selected a random subset of n donors (with 3 < n < 12) from the control dataset and performed disease-state identification with the CR or ACR design, as described above. For each disease dataset size and \(n\) we repeated the simulation with five different initializations of the control donor selection.
To assess whether a shallow atlas dataset would introduce false discoveries in disease-state identification (Supplementary Fig. 2), we used all 29 donors from the query dataset in the disease and control datasets, and subsampled the atlas dataset removing data from one to 11 studies (ordering studies according to the total number of cells), and performed disease-state identification with the AR and ACR designs.
More cases of robustness to perturbation of the atlas and control datasets of the reference designs are described in Supplementary Notes 2.1 and 2.2.
Design comparison on the COVID dataset
Data preprocessing and model training
We downloaded data for COVID-19 and healthy PBMCs from Stephenson et al.24, via the CELLxGENE portal (collection ID: ddfad306-714d-4cc0-9985-d9072820c530). We sampled 500 cells for each sample to reduce the computational burden of this analysis, while maintaining sample-level diversity, and we excluded samples for which fewer than 500 cells were detected. We excluded cells where fewer than 1,000 mRNA molecules were detected and we excluded data from three samples that were profiled with the Smart-seq2 protocol. As cell type annotation, we used the high-level annotation from the original authors.
As the atlas dataset, we used the healthy PBMC data described above, excluding the healthy PBMC profiles from Stephenson et al.24. Reference model training and query mapping was performed as described above. After query mapping, control and COVID-19 cells were embedded in a k-nearest neighbor graph (k = 100), which was used to build neighborhoods and perform DA with Milo as described above. For the comparison of de novo integration and query mapping (Fig. 5d), scVI training was performed on the concatenated atlas, control and COVID-19 datasets (ACR design) or control and COVID-19 datasets (CR design), as described above. Also in this case, the atlas dataset was used for scVI model training, but only model weights were used for mapping with scArches; all downstream analysis was performed solely on the COVID-19 and control datasets.
IFN signature calculation
To define IFN-stimulated cells, we aggregated the expression of a set of IFN-associated genes defined by Yoshida et al.22. (BST2, CMPK2, EIF2AK2, EPSTI1, HERC5, IFI35, IFI44L, IFI6, IFIT3, ISG15, LY6E, MX1, MX2, OAS1, OAS2, PARP9, PLSCR1, SAMD9, SAMD9L, SP110, STAT1, TRIM22, UBE2L6, XAF1 and IRF7), using the SCANPY function scanpy.tl.score_genes() to quantify signature expression for each cell. The signature was calculated as the average scaled expression of the IFN-associated genes, which was subtracted from the average expression of a reference set of genes sampled for each binned expression value75. A threshold of IFN signature greater than 0.05 was used for the precision-recall analysis.
CD14+ monocyte disease-state analysis
For the analysis of the COVID-19-associated monocyte subsets, we focused on the neighborhoods annotated as CD14+ monocytes based on majority voting, as described above. We split CD14+ monocyte neighborhoods into IFNhi COVID-19 neighborhoods (spatial FDR < 0.1, log fold change > 0 and IFN signature > 0.2), IFNlo COVID-19 neighborhoods (spatial FDR < 0.1, log fold change > 0 and IFN signature < 0.2) and healthy neighborhoods (the remaining neighborhoods). To assign cells to one of these three phenotypes, we computed, for each cell, the number of neighborhoods of each phenotype to which that cell belonged (as Milo neighborhoods can be partially overlapping) and we labeled cells based on the most representative phenotype (if the cell was found in at least three neighborhoods of that phenotype; otherwise the cell was annotated as mixed CD14+ monocyte phenotype).
For differential expression analysis, we aggregated gene expression profiles by summing counts according to sample and CD14+ monocyte phenotype and performed differential expression testing with the edgeR quasi-likelihood test76 using the implementation in the R package glmGamPoi76 and 1% FDR (Supplementary Tables 2 and 3).
Design comparison on the IPF dataset
Data preprocessing and model training
Gene expression count matrixes for human lung IPF, control and COPD scRNA-seq data from Adams et al.2 were downloaded from the Gene Expression Omnibus (accession no. GSE136831). As cell type annotations, we used uniform labels generated from the integration of this dataset with the HLCA by Sikkema et al.16, downloaded from Zenodo (https://zenodo.org/record/6337966). For latent embedding with the AR and ACR designs, we used the embeddings from scArches mapping on the core HLCA model provided by Sikkema et al. via Zenodo. For latent embedding with the CR design, we trained a scANVI model77 on the concatenated control and disease replicating the parameters used to train the HLCA model (according to the notebooks in https://github.com/LungCellAtlas/HLCA_reproducibility), using dataset ID as the batch covariate and training on the same set of 2,000 HVGs used for HLCA training. We opted to keep the HLCA HVG set for the CR design instead of recomputing HVGs because it was selected using a custom batch-aware strategy and compared (in the original study) to alternative selections with a benchmarking pipeline16. Therefore, we reasoned that recomputing HVGs on the CR design would not represent a fair comparison. DA with Milo was performed as described above (changing only milopy.core.make_nhoods, parameters: prop = 0.01), comparing the abundance of cells from IPF samples to the abundance of cells from the control samples. Neighborhood-level annotations were performed using majority voting as described previously.
SPP1 hi macrophage analysis
To define SPP1hi profibrotic macrophages, we aggregated the expression of a set of marker genes defined by Adams et al.2 (SPP1, LIPA, LPL, FDX1, SPARC, MATK, GPC4, PALLD, MMP7, MMP9, CHIT1, CSTK, CHI3L1, CSF1, FCMR, TIMP3, COL22A1, SIGLEC15, CCL2), using the SCANPY function scanpy.tl.score_genes() to quantify the signature expression of each cell. A threshold of signature greater than 0.32 was used for the precision-recall analysis (corresponding to the 90% quantile of the signature expression in all cells). For comparison to the label transfer uncertainty metrics, we used the values for uncertainty provided by Sikkema et al.
IPF signature analysis
To define profibrotic signatures in stromal cells, we used a gene expression signature developed on bulk RNA-seq data to diagnose IPF from lung explants36. We downloaded DEGs from the original paper, selected upregulated genes and normalized the differential expression test effect sizes to weights \(\in [\mathrm{0,1}]\) with L2 normalization (Extended Data Fig. 8a). We then used a modified version of the SCANPY function scanpy.tl.score_genes() (using weighted means based on gene weights) to quantify the diagnostic signature expression for each cell. We then selected relevant cell types where the difference in mean signature expression between cells from IPF samples and cells from COPD samples was the highest, to control for the effect of end-stage lung disease (Extended Data Fig. 8b). For the precision-recall analysis, we computed the mean profibrotic signature expression across IPF cells in the neighborhoods and used the top 50% quantile for each cell type group (alveolar type (AT), fibroblasts, club cells, basal cells) as the threshold for calling true positives.
Analysis of aberrant basal-like cells
We annotated the neighborhoods of basaloid cells and KRT17hi aberrant basal cells based on profibrotic signature expression and expression of marker genes reported by refs. 2,37,38,40 (Extended Data Fig. 8a,c,d). We defined normal basal cells as cells annotated as basal and not belonging to the basaloid neighborhood or the KRT17hi basal neighborhood. In total we annotated 1,562 normal basal cells, 377 basaloid cells and 350 KRT17hi aberrant basal cells, distributed across individuals (Fig. 6e). For differential expression analysis, we aggregated gene expression profiles by summing counts according to sample and basal-like phenotype, and performed differential expression testing with the edgeR quasi-likelihood test (Robinson and Oshlack78) using the implementation in the R package glmGamPoi (Ahlmann-Eltze and Huber76), using 1% FDR (Supplementary Table 4). We compared KRT17hi aberrant basal cells against basaloid cells, and each aberrant state against normal basal cells. Differential expression analysis was run on the top 7,500 most HVGs for each comparison, using the modelGeneVar function from the scran package79. We considered genes to be aberrant state markers (shown in Fig. 6f and Supplementary Fig. 4) only if significant in the comparison between aberrant states and significantly overexpressed against the normal state (reported in Supplementary Table 4). We performed gene set enrichment analysis using the enrichr method80 with implementation carried out using the Python package GSEApy81. To annotate genes targeted by drugs in trials or approved for lung disease, we downloaded the targets of drugs approved or being trialed for lung disease (trait ID: EFO_0003818) in the Open Targets platform82. To annotate genes associated with GWAS variants for lung function (forced expiratory volume, trait ID EFO_0004314), we downloaded a list of significant GWAS loci and predicted causal genes based on the locus2gene model available via the Open Targets Genetics database83. The full tables for drug targets, the lung function GWAS studies used for the genetic evidence analysis and GWAS-associated genes are shared as metadata in our reproducibility repository (https://github.com/MarioniLab/oor_design_reproducibility).
Statistics and reproducibility
No statistical method was used to predetermine sample size. No data were excluded from the analyses, unless otherwise stated in the relevant section of the Methods where the rationale for exclusion is described. Statistical tests were chosen to model the underlying data distributions (negative binomial likelihood generalized linear models for cell counts11 and mRNA counts78, Wilcoxon signed-rank tests for nonparametric comparisons between metrics). The experiments were not randomized. The investigators were not blinded to allocation during the experiments and outcome assessments. All code to replicate the analysis is available as part of code availability.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All the data used for analysis are publicly available. The blood datasets used for the simulation studies and COVID-19 analysis were downloaded from the CELLxGENE database (Supplementary Table 1). Lung data from the IPF cohort are available via the Gene Expression Omnibus (accession no. GSE136831). The core Human Lung Cell Atlas gene expression data were downloaded from CELLxGENE database (ID 6f6d381a-7701-4781-935c-db10d30de293). Unified cell type annotations for healthy and IPF data were downloaded from Zenodo (https://zenodo.org/record/6337966). The Tabula Sapiens data used in Supplementary Note 2.2.2 were downloaded from figshare (https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219). All processed data objects in AnnData format84 and trained scVI models are available via figshare (https://doi.org/10.6084/m9.figshare.21456645).
Code availability
The functions for benchmarking out-of-reference state detection, including the code for preprocessing, data splitting, latent embedding, differential analysis and evaluation metrics, have been made available as a Python package at https://github.com/MarioniLab/oor_benchmark (deposited at Zenodo85). Notebooks and scripts to reproduce all analyses presented in the manuscript are available at https://github.com/MarioniLab/oor_design_reproducibility (deposited at Zenodo86).
References
Reichart, D. et al. Pathogenic variants damage cell composition and single cell transcription in cardiomyopathies. Science 377, eabo1984 (2022).
Adams, T. S. et al. Single-cell RNA-seq reveals ectopic and aberrant lung-resident cell populations in idiopathic pulmonary fibrosis. Sci. Adv. 6, eaba1983 (2020).
Reyfman, P. A. et al. Single-cell transcriptomic analysis of human lung provides insights into the pathobiology of pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 199, 1517–1536 (2019).
Elmentaite, R. et al. Single-cell sequencing of developing human gut reveals transcriptional links to childhood Crohn’s disease. Dev. Cell 55, 771–783 (2020).
Perez, R. K. et al. Single-cell RNA-seq reveals cell type-specific molecular and genetic associations to lupus. Science 376, eabf1970 (2022).
Velmeshev, D. et al. Single-cell genomics identifies cell type-specific molecular changes in autism. Science 364, 685–689 (2019).
Eisenstein, M. Machine learning powers biobank-driven drug discovery. Nat. Biotechnol. 40, 1303–1305 (2022).
Lindeboom, R. G. H., Regev, A. & Teichmann, S. A. Towards a Human Cell Atlas: taking notes from the past. Trends Genet. 37, 625–630 (2021).
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol. 40, 245–253 (2022).
Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nat. Biotechnol. 39, 30–34 (2021).
Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39, 619–629 (2021).
Zhao, J. et al. Detection of differentially abundant cell subpopulations in scRNA-seq data. Proc. Natl Acad. Sci. USA 118, e2100293118 (2021).
Reshef, Y. A. et al. Co-varying neighborhood analysis identifies cell populations associated with phenotypes of interest from single-cell transcriptomics. Nat. Biotechnol. 40, 355–363 (2022).
Sikkema, L. et al. An integrated cell atlas of the lung in health and disease. Nat. Med. 29, 1563–1577 (2023).
Jardine, L. et al. Blood and immune development in human fetal bone marrow and Down syndrome. Nature 598, 327–331 (2021).
Szabo, P. A. et al. Longitudinal profiling of respiratory and systemic immune responses reveals myeloid cell-driven lung inflammation in severe COVID-19. Immunity 54, 797–814 (2021).
Guo, C. et al. Single-cell analysis of two severe COVID-19 patients reveals a monocyte-associated and tocilizumab-responding cytokine storm. Nat. Commun. 11, 3924 (2020).
Leng, K. et al. Molecular characterization of selectively vulnerable neurons in Alzheimer’s disease. Nat. Neurosci. 24, 276–287 (2021).
Olah, M. et al. Single cell RNA sequencing of human microglia uncovers a subset associated with Alzheimer’s disease. Nat. Commun. 11, 6129 (2020).
Yoshida, M. et al. Local and systemic responses to SARS-CoV-2 infection in children and adults. Nature 602, 321–327 (2022).
Schulte-Schrepping, J. et al. Severe COVID-19 is marked by a dysregulated myeloid cell compartment. Cell 182, 1419–1440 (2020).
Stephenson, E. et al. Single-cell multi-omics analysis of the immune response in COVID-19. Nat. Med. 27, 904–916 (2021).
Nehar-Belaid, D. et al. Mapping systemic lupus erythematosus heterogeneity at the single-cell level. Nat. Immunol. 21, 1094–1106 (2020).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Sette, A. & Crotty, S. Adaptive immunity to SARS-CoV-2 and COVID-19. Cell 184, 861–880 (2021).
Barrett, T. J. et al. Platelets contribute to disease severity in COVID‐19. J. Thromb. Haemost. 19, 3139–3153 (2021).
Chen, Z. & John Wherry, E. T cell responses in patients with COVID-19. Nat. Rev. Immunol. 20, 529–536 (2020).
Ren, X. et al. COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184, 1895–1913.e19 (2021).
Hadjadj, J. et al. Impaired type I interferon activity and inflammatory responses in severe COVID-19 patients. Science 369, 718–724 (2020).
Singh, P. & Ali, S. A. Multifunctional role of S100 protein family in the immune system: an update. Cells 11, 2274 (2022).
Rangarajan, S., Locy, M. L., Luckhardt, T. R. & Thannickal, V. J. Targeted therapy for idiopathic pulmonary fibrosis: where to now? Drugs 76, 291–300 (2016).
Somogyi, V. et al. The therapy of idiopathic pulmonary fibrosis: what is next? Eur. Respir. Rev. 28, 190021 (2019).
Morse, C. et al. Proliferating SPP1/MERTK-expressing macrophages in idiopathic pulmonary fibrosis. Eur. Respir. J. 54, 1802441 (2019).
Meltzer, E. B. et al. Bayesian probit regression model for the diagnosis of pulmonary fibrosis: proof-of-principle. BMC Med. Genomics 4, 70 (2011).
Habermann, A. C. et al. Single-cell RNA sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis. Sci. Adv. 6, eaba1972 (2020).
Lang, N. J. et al. Ex vivo tissue perturbations coupled to single cell RNA-seq reveal multi-lineage cell circuit dynamics in human lung fibrogenesis. Preprint at bioRxiv https://doi.org/10.1101/2023.01.16.524219 (2023).
Strunz, M. et al. Alveolar regeneration through a Krt8+ transitional stem cell state that persists in human lung fibrosis. Nat. Commun. 11, 3559 (2020).
Jaeger, B. et al. Airway basal cells show a dedifferentiated KRT17high phenotype and promote fibrosis in idiopathic pulmonary fibrosis. Nat. Commun. 13, 5637 (2022).
Park, H. J. et al. Keratinization of lung squamous cell carcinoma is associated with poor clinical outcome. Tuberc. Respir. Dis. 80, 179–186 (2017).
Amatngalim, G. D. et al. Aberrant epithelial differentiation by cigarette smoke dysregulates respiratory host defence. Eur. Respir. J. 51, 1701009 (2018).
Gong, L. et al. IL-32 induces epithelial-mesenchymal transition by triggering endoplasmic reticulum stress in A549 cells. BMC Pulm. Med. 20, 278 (2020).
Barton, A. R., Sherman, M. A., Mukamel, R. E. & Loh, P.-R. Whole-exome imputation within UK Biobank powers rare coding variant association and fine-mapping analyses. Nat. Genet. 53, 1260–1269 (2021).
Kichaev, G. et al. Leveraging polygenic functional enrichment to improve GWAS power. Am. J. Hum. Genet. 104, 65–75 (2019).
Shrine, N. et al. Author correction: new genetic signals for lung function highlight pathways and chronic obstructive pulmonary disease associations across multiple ancestries. Nat. Genet. 51, 1067 (2019).
Temesgen, Z. et al. C reactive protein, a biomarker for early COVID-19 treatment, improves efficacy: results from the phase 3 ‘live-air’ trial. Thorax 78, 606–616 (2023).
Sehgal, K. et al. Cases of ROS1-rearranged lung cancer: when to use crizotinib, entrectinib, lorlatinib, and beyond? Precis. Cancer. Med. 3, 17 (2020).
Litviňuková, M. et al. Cells of the adult human heart. Nature 588, 466–472 (2020).
Koenig, A. L. et al. Single-cell transcriptomics reveals cell-type-specific diversification in human heart failure. Nat. Cardiovasc. Res. 1, 263–280 (2022).
Hocker, J. D. et al. Cardiac cell type-specific gene regulatory programs and disease risk association. Sci. Adv. 7, eabf1444 (2021).
Elmentaite, R. et al. Cells of the human intestinal tract mapped across space and time. Nature 597, 250–255 (2021).
Kong, L. et al. The landscape of immune dysregulation in Crohn’s disease revealed through single-cell transcriptomic profiling in the ileum and colon. Immunity 56, 444–458 (2023).
Jones, R. C. et al. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Domínguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022).
Suo, C. et al. Mapping the developing human immune system across organs. Science 376, eabo0510 (2022).
Prasse, A. et al. BAL cell gene expression is indicative of outcome and airway basal cell involvement in idiopathic pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 199, 622–630 (2019).
Xu, Y. et al. Single-cell RNA sequencing identifies diverse roles of epithelial cells in idiopathic pulmonary fibrosis. JCI Insight 1, e90558 (2016).
Jonsdottir, H. R. et al. Basal cells of the human airways acquire mesenchymal traits in idiopathic pulmonary fibrosis and in culture. Lab. Invest. 95, 1418–1428 (2015).
Smirnova, N. F. et al. Detection and quantification of epithelial progenitor cell populations in human healthy and IPF lungs. Respir. Res. 17, 83 (2016).
Heinzelmann, K. et al. Single-cell RNA sequencing identifies G-protein coupled receptor 87 as a basal cell marker expressed in distal honeycomb cysts in idiopathic pulmonary fibrosis. Eur. Respir. J. 59, 2102373 (2022).
Chakraborty, A., Mastalerz, M., Ansari, M., Schiller, H. B. & Staab-Weijnitz, C. A. Emerging roles of airway epithelial cells in idiopathic pulmonary fibrosis. Cells 11, 1050 (2022).
Valenzi, E. et al. Disparate interferon signaling and shared aberrant basaloid cells in single-cell profiling of idiopathic pulmonary fibrosis and systemic sclerosis-associated interstitial lung disease. Front. Immunol. 12, 595811 (2021).
Yazar, S. et al. Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease. Science 376, eabf3041 (2022).
Liu, C. et al. Time-resolved systems immunology reveals a late juncture linked to fatal COVID-19. Cell 184, 1836–1857.e22 (2021).
Ahern, D. J. et al. A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell 185, 916–938 (2022).
Arunachalam, P. S. et al. Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans. Science 369, 1210–1220 (2020).
Lee, J. S. et al. Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19. Sci. Immunol. 5, eabd1554 (2020).
Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).
Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40, 163–166 (2022).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Petukhov, V. et al. Case-control analysis of single-cell RNA-seq studies. Preprint at bioRxiv https://doi.org/10.1101/2022.03.15.484475 (2022).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).
Ahlmann-Eltze, C. & Huber, W. glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data. Bioinformatics 36, 5701–5702 (2021).
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).
Xie, Z. et al. Gene set knowledge discovery with enrichr. Curr. Protoc. 1, e90 (2021).
Fang, Z., Liu, X. & Peltz, G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 39, btac757 (2023).
Ochoa, D. et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res. 51, D1353–D1359 (2023).
Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet. 53, 1527–1533 (2021).
Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Wolf, F. A. anndata: annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).
Dann, E. MarioniLab/oor_benchmark: v0.1.0. Zenodo https://doi.org/10.5281/zenodo.8307751 (2023).
Dann, E. MarioniLab/oor_design_reproducibility: v0.1.0. Zenodo https://doi.org/10.5281/zenodo.8307757 (2023).
Acknowledgements
We thank M. Morgan and R. Lindeboom for the critical reading of the manuscript, and R. Elmentaite, A. Missarova and all members of the Marioni and Teichmann laboratories for valuable discussions and feedback on this project. The PBMC studies included in this work were selected using the materials from the Chan–Zuckerberg Initiative workshop on ‘Assembling Tissue References’, which were kindly shared by L. Dratva. J.C.M. acknowledges core funding from Cancer Research UK (C9545/A29580) and the European Molecular Biology Laboratory. E.D., A.-M.C., A.J.O., K.B.M and S.A.T. acknowledge Wellcome Sanger core funding (WT206194).
Funding
Open access funding provided by European Molecular Biology Laboratory (EMBL).
Author information
Authors and Affiliations
Contributions
E.D. and J.C.M. conceptualized the study. E.D. wrote the benchmarking package and performed the analysis. A.-M.C., A.J.O. and K.B.M. provided datasets, references and intellectual input for the IPF analysis. All authors interpreted the results, and wrote and approved the manuscript. S.A.T. and J.C.M. oversaw the project.
Corresponding authors
Ethics declarations
Competing interests
In the past 3 years, S.A.T. has consulted for Sanofi and has sat on the scientific advisory boards of QIAGEN, Foresite Labs and GlaxoSmithKline. She is a cofounder and equity holder of Transition Bio. From 1 September 2022, J.C.M. has been an employee of Genentech. The other authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Mathew Chamberlain and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Out-of-reference recovery across simulations.
Scatterplot of differential abundance log-Fold Change (DA logFC) against fraction of out-of-reference (OOR) cells for each neighbourhood, in simulations with different removed OOR cell populations (indicated in y-axis). Colored points indicate neighbourhoods where the enrichment was significant (10% SpatialFDR, logFC > 0). The dotted red line indicates the threshold used to define true positives for precision-recall analysis (20% of the higher fraction in the simulation).
Extended Data Fig. 2 Batch correction and biological conservation with latent dimensions learnt with different reference design.
Quantification of overlap between cell type labels (as a measure of biological conservation, left) and sample IDs (as a measure of batch effect, right) and clusters of disease cells on latent dimensions after scArches mapping with different designs (color). The overlap between clusters and covariates is measured by the Normalised Mutual Information (NMI), using the implementation in scikit-learn v1.1.2 (ref. 74). Each box plot shows the median and interquartile range for simulations with different OOR cell populations (n = 15 simulations). NMI values for leiden clustering with increasing resolution (x-axis) are shown. In boxplots the center line denotes the median; box limits, first and third quartiles; whiskers, 1.5X interquartile range.
Extended Data Fig. 3 Reference design comparison with alternative differential analysis methods for OOR detection.
Boxplots of false discovery rate (FDR), false positive rate (FPR), true positive rate (TPR) and Area Under the Precision-Recall Curve (AUPRC) for detection of OOR cell states with different reference designs (boxplot colour) using 3 different methods for differential cell abundance analysis: co-varying neighbourhood analysis (CNA), MELD and Milo. Points represent simulations with different OOR populations (n = 8, selecting OOR states with at least 250 cells). Tests on the same simulated data are connected. In boxplots the center line denotes the median; box limits, first and third quartiles; whiskers, 1.5X interquartile range.
Extended Data Fig. 4 Statistical power is dependent on the size of the OOR cell state across reference designs.
Scatterplot of number of cells in the simulated OOR state (x-axis) against the true positive rate (TPR, y-axis) of OOR state detection with alternative reference designs.
Extended Data Fig. 5 Reference design comparison on COVID-19 cohort.
(a) Scatterplot of neighbourhood differential abundance log-Fold Change (DA logFC) against the mean expression of IFN signature with ACR design (left) and CR design with joint embedding (right), stratified by cell type annotation. Colored points indicate neighbourhoods where the enrichment was significant (10% SpatialFDR and logFC > 0). The dotted line denotes the threshold for high-IFN used for precision-recall analysis. (b) Beeswarm plot of DA logFC annotating neighbourhoods by fine annotation by Stephenson et al. Neighbourhoods where the differential abundance was significant (10% SpatialFDR) are colored. Annotations are ordered by the value of the maximum logFC for the annotation, to visualize which cell types are prioritized for each design. (c) (left) As in (A) but close-up on lymphoid cell types. The red dotted line denotes the 90% quantile of mean IFN signature, used to identify the top 10% IFN-high states for each lymphoid cell type for precision-recall analysis. (right) Area under the precision-recall curve for identification of top 10% IFN-high neighbourhoods in lymphoid cell types. The dotted line denotes the baseline value for the AUPRC, indicating the case of a random classifier. Error bars denote the 95% confidence interval of AUPRC calculated from bootstrapping with 1000 resamplings. The height of the bar denotes the AUPRC computed on the real data. (d) Volcano plot for differential abundance analysis on neighbourhoods of NK cell neighbourhoods (CD16hi NK cells and proliferating NK cells) and naive B cell neighbourhoods. The dotted line denotes the significance threshold of 10% SpatialFDR.
Extended Data Fig. 6 Heterogeneity in COVID-19 associated CD14+ monocyte states.
(a) Scatterplot of neighbourhood differential abundance log-Fold Change (DA logFC) against the mean expression of IFN signature with CR design for neighbourhoods of CD14+ monocyte cells (as in Fig. 5e). (b) Distribution of IFN signature score for cells belonging to neighbourhoods in CR design assigned to 3 alternative CD14+ phenotypes. (c) COVID-19 enriched CD14+ phenotypes (from CR design) across patients with varying disease severity (Healthy: n = 23 patients, Asymptomatic: n = 9 patients, Mild: n = 23 patients, Moderate: n = 30 patients, Critical: n = 15 patients, Severe: n = 13 patients): each point represents a donor, the y-axis shows the fraction of all CD14+ monocytes in that donor showing IFN-high COVID-19 enriched phenotype (orange), and IFN-low COVID-19 enriched phenotype (yellow). The remaining fraction are monocytes with healthy phenotype (not shown). In boxplots the center line denotes the median; box limits, first and third quartiles; whiskers, 1.5X interquartile range. (d) Volcano plot of differential expression analysis results from comparison between IFN-high and IFN-low COVID-19 specific CD14+ phenotypes identified with ACR design. For each tested gene, the x-axis shows the logFC of the edgeR quasi-likelihood differential expression test64 and the y-axis shows the Benjamini-Hochberg adjusted p-value. Genes with significant DE at FDR < 1% are colored in red. A subset of significant genes with absolute logFC > 0.75 are labelled. (e) Dotplot of mean expression of IFN signature genes, HLA-DR genes and S100 genes for different CD14+ monocyte states identified with ACR design. Dot size is proportional to the fraction of cells expressing the gene in a group. (f) Predicted CD14+ monocyte phenotype for monocytes of COVID-19 patients from the Schulte-Schrepping23 dataset. A logistic regression model was trained on the monocytes from the Stephenson dataset24, and predicted phenotypes for all CD14+ monocytes in the Schulte-Schrepping23 dataset. The barplot shows the proportion of cells with a predicted phenotype for HLA-DRlo S100hi monocytes and for all other monocytes. (g) Volcano plot of differential expression analysis results from comparison between IFN-high and IFN-low COVID-19 specific CD14+ phenotypes identified with CR design (as in (D)).
Extended Data Fig. 7 Detection of profibrotic (SPP1hi) macrophages with alternative reference designs.
(a) Scatterplots of differential abundance log-Fold Change (DA logFC) against the mean expression of profibrotic macrophage signature in macrophage cell neighbourhoods with ACR design (left), AR design (middle) and CR design (right). Coloured points indicate neighbourhoods where the enrichment was significant (1% SpatialFDR and logFC > 0). Pearson’s correlation coefficients and p-values for significance of the correlation are reported (two-sided test). (b) Analysis of top 10% macrophage neighbourhoods prioritized by DA logFC using ACR and CR designs. When examining prioritized neighbourhoods with low expression of profibrotic signature (top 10% false positives), we found that with the CR design these neighbourhoods include cells from significantly less samples compared to the true positives. On the left, we mark neighbourhoods that are considered top 10% (colored), separating out False Positive (FP) neighbourhoods, where the mean profibrotic macrophage signature was below the threshold of the 90% quantile used for precision-recall analysis. The boxplots on the right show the number of samples represented in each top 10% neighbourhood (ACR other: n = 10 neighbourhoods; ACR FP: n = 65 neighbourhoods; CR other: n = 66 neighbourhoods; CR FP: n = 18 neighbourhoods). In boxplots the center line denotes the median; box limits, first and third quartiles; whiskers, 1.5X interquartile range. (c) Barplots of fraction of cells from each donor in top 10% false positive neighbourhoods with ACR (left) and CR design (right). (d) Detection of profibrotic macrophages with label transfer uncertainty score from Sikkema et al. 2022. Violin plots show the distribution of label uncertainty on cells (left), mean label uncertainty on neighbourhoods (centre) and DA logFC with ACR design for profibrotic macrophages (profibrotic macrophage signature > 90% quantile, in pink) and other macrophages (in grey). The dotted lines denote the median value and inter-quartile range. (e) Precision-recall curve for detection of profibrotic macrophages with metrics shown in D. The dotted lines denote the baseline value for the AUPRC, indicating the case of a random classifier.
Extended Data Fig. 8 Detection of IPF diagnostic gene signature in stromal and epithelial lung cells.
(a) Scatterplot of weights assigned to genes used for IPF signature calculation (from ref. 36). (b) Boxplots of IPF diagnostic signature values for cells of different cell type groups. Cells are grouped by disease status (Control: n = 28 patients; IPF: n = 32 patients). The number of cells for each cell type group and disease group is shown on the right. Cell type groups are ordered by the difference in mean signature between cells from IPF patients and COPD patients (COPD: chronic obstructive pulmonary disease), with cell type groups where the IPF diagnostic signature was highest in IPF patients shown on top. EC: endothelial cells; Club: club cells; SMG: submucosal gland cells. In boxplots the center line denotes the median; box limits, first and third quartiles; whiskers, 1.5X interquartile range. (c) Scatterplots of differential abundance log-Fold Change (DA logFC) against the mean expression of IPF diagnostic signature in cell neighbourhoods of affected cell type groups (AT: alveolar cells, basal cells, club cells, fibroblasts) with ACR design (left), AR design (middle) and CR design (right). Coloured points indicate neighbourhoods where the enrichment was significant (1% SpatialFDR and logFC > 0). Pearson’s correlation coefficients and p-values for significance of the correlation are reported (two-sided test). Neighbourhoods corresponding to aberrant basal-like phenotypes examined in downstream analysis are highlighted. (d) Dotplot of expression of marker genes for different aberrant basal-like cell states (KRT17hi aberrant basal markers from Jaeger et al.40, basaloid markers from Adams et al.2). Dot size is proportional to the fraction of cells expressing the gene in a group.
Extended Data Fig. 9 Differential expression analysis to identify markers for aberrant basal-like cells detected with ACR design.
(a) Gene set enrichment analysis (Enrichr80 hypergeometric test) results for markers of KRT17hi aberrant basal cells: adjusted p-value (BH correction for multiple testing, transformed to - log10(p-val)) for significant gene sets (10% FDR threshold, marked by dotted line) from GO biological process terms and MSigDB Hallmark pathway terms. Example marker genes associated with each term are shown. (b) Gene set enrichment analysis (Enrichr80 hypergeometric test) results for markers of basaloid cells: adjusted p-value (BH correction for multiple testing, transformed to - log10(p-val)) for significant gene sets (5% FDR threshold, marked by dotted line) from GO biological process terms and MSigDB Hallmark pathway terms. Example marker genes associated with each term are shown.
Supplementary information
Supplementary Information
Supplementary Figs. 1–4, Notes, Note Figs. 1–6 and Note references.
Supplementary Table 1
References of studies included in the healthy PBMC dataset used for the simulations.
Supplementary Table 2
Differential expression analysis results for comparison of CD14+ monocyte COVID-19 phenotypes with ACR design. Differential expression analysis was performed using the edgeR quasi-likelihood test64 using the implementation in the R package glmGamPoi65, using 1% Benjamini–Hochberg FDR as the threshold for significance.
Supplementary Table 3
Differential expression analysis results for comparison of CD14+ monocyte COVID-19 phenotypes with CR design. Differential expression analysis was performed using the edgeR quasi-likelihood test64 using the implementation in the R package glmGamPoi65, using 1% Benjamini–Hochberg FDR as the threshold for significance.
Supplementary Table 4
Differential expression analysis results for comparison of Basaloid and KRT17hi aberrant basal cells in the IPF dataset. Differential expression analysis was performed using the edgeR quasi-likelihood test65 using the implementation in the R package glmGamPoi65, using 1% Benjamini–Hochberg FDR as the threshold for significance.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dann, E., Cujba, AM., Oliver, A.J. et al. Precise identification of cell states altered in disease using healthy single-cell references. Nat Genet 55, 1998–2008 (2023). https://doi.org/10.1038/s41588-023-01523-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01523-7