Introduction

The central nervous system (CNS) comprises a diverse range of cell types, each characterized by distinct functions and gene expression profiles. CNS disorders, including ischemic, hemorrhagic, neurodegenerative, inflammatory, and developmental conditions, exhibit complex physiological or pathological alterations in CNS cells, resulting in abnormal functions and gene expression profiles1,2,3,4. Since CNS cells can undergo heterogeneous transformations and functional changes in response to various physiological and pathological conditions, CNS is one of the most complex systems, including a multitude of cell types and subtypes with distinct and unique morphology and gene expression profiles5,6.

Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for investigating gene expression within specific cell types or subpopulations associated with physiological or pathological conditions7,8,9,10. The existing scRNA-seq methods can be divided into several categories according to their paradigms: clustering-based methods (k-means, DBSCAN11, Seurat12,13), dimensionality reduction-based methods (PCA, t-SNE, UMAP), deep learning-based methods (scVI14, MARS15, scGNN16). Deep learning-based methods provide a promising solution with a common strategy involving neural network training for representation learning, demonstrating remarkable performance in various tasks. The advancements in scRNA-seq are also empowering neuroscientists to gain insights into the cellular and molecular mechanisms underlying CNS function and disease. For instance, recent CNS scRNA-seq studies have achieved breakthroughs in the identification of phenotypic transformation of astrocytes into non-neuroprotective states in Alzheimer’s disease (AD)17,18, the active involvement of reactive oligodendrocytes in multiple sclerosis (MS)19, and the construction of a human brain atlas20,21.

Integrating scRNA-seq data can provide comprehensive analysis of cell types, states, and trajectories that may not be discernible when analyzing individual datasets separately, ultimately enhancing our understanding of cellular heterogeneity and complex biological processes. One of the major issues in integrating a high-quality reference map from scRNA-seq datasets has been the reduction of technical bias and batch effects. Recent developments in computational methods have addressed these issues and allowed the generation of large-scale reference atlases22. Nonetheless, regarding the complexity and heterogeneity of CNS cells in different physiological and pathological conditions, integrating large-scale CNS scRNA-seq data still encounters challenges15,23,24. Furthermore, in cases where gene expression profiles contain numerous missing values or noise, known as ‘dropout’, which can mislead downstream analyses such as low-dimensional representation and cell sub-population identification25,26,27. Thus, it is crucial to capture the patterns in gene expression profiles and reveal the heterogeneous relationships of CNS cell types/subtypes for integrating large-scale CNS scRNA-seq data28,29.

Contrastive learning has recently gained attention for its successful application in self-supervised representation learning in computer vision, as exemplified by SimCLR30 and MoCo31. The fundamental concept behind contrastive learning is to map similar samples to nearby representation space while mapping dissimilar samples to distant representation space. Contrastive learning in scRNA-seq analysis aims to learn representations of single-cell data by comparing and contrasting different cells or cell subsets. It can capture meaningful patterns and variations that may indicate distinct cell types, states, or biological processes. Two contrastive learning-based scRNA-seq analysis methods, Concerto4 and CLEAR32 have been proposed. They have exhibited remarkable success in various scRNA-seq analyses4,32,33. Concerto is a self-supervised distillation framework configured as an asymmetric teacher–student architecture for model multimodal single-cell atlases. CLEAR overcomes the heterogeneity of the scRNA-seq data through a specifically designed representation learning task, allowing it to effectively handle batch effects and dropout events simultaneously. Compared to other scRNA-seq analysis methods, contrastive learning-based methods offer superior generalization capabilities for handling unseen samples more accurately. Thus, contrastive learning shows great promise to overcome the challenges in the integration of large-scale CNS scRNA-seq data. However, several novel contrastive learning frameworks have been proposed, where MoCoV3 achieved state-of-the-art performance on various tasks34. MoCoV3 is grounded in its unique momentum encoder and dynamic queue design, enabling effective handling of large sample volumes while maintaining an extensive set of effective negative samples35. Thus, MoCoV3 has strong potential to capture heterogeneity in gene expression profiles and reveal the relationships in large-scale CNS scRNA-seq data, effectively integrating large-scale CNS scRNA-seq data.

In this work, we proposed scCM, a self-supervised contrastive learning model for integrating large-scale CNS scRNA-seq data. scCM leverages the latest contrastive learning framework to bring functionally related CNS cells close together while simultaneously pushing apart dissimilar CNS cells by comparing the variations of gene expression, effectively revealing the heterogeneous relationships within the CNS cell types/subtypes. To evaluate the performance of scCM, we conducted experiments on 20 CNS datasets worldwide, encompassing data from 4 different species and covering 10 subcategories of CNS diseases. Results show that scCM outperforms the state-of-the-art methods in various CNS scRNA-seq analyses. Taking advantage of scCM’s effective representation and robust spatial relationship of CNS cells, we integrated the collected human CNS datasets into a large-scale reference for cell annotation in neurodegenerative diseases (ND). Results demonstrate that scCM provides an accurate annotation, along with rich spatial information of cell state. In summary, scCM offers improved performance and can be a valuable tool for researchers studying the complexity and heterogeneity of CNS cells, addressing the significant challenge of integrating large-scale CNS scRNA-seq data to enhance our understanding of CNS function and disease mechanisms.

Results

Overview of scCM architecture

scCM is constructed based on a momentum contrastive learning framework (MoCo v335) with the aim of learning informative representations of CNS scRNA-seq data (Fig. 1a). It comprises three modules: an encoder, a momentum encoder, and a predictor head, all constructed using fully connected neural networks. The encoder and momentum encoder receive a pair of gene expression vectors as input. The gene expression vector fed into the encoder is transformed into an embedding and then projected as a vector q (representing the query) by the predictor head. Simultaneously, the gene expression vector fed into the momentum encoder is transformed into a vector k (representing the key). If the two inputs are the same, they are labeled as a positive pair (q, k+); otherwise, they are labeled as a negative pair (q, k-). The InfoNCE loss function is utilized to minimize the distance between positive pairs and maximize the distance between negative pairs. During training, the encoder and predictor head are updated based on the back propagated gradients by the InfoNCE loss function. The momentum encoder is updated using the momentum strategy (also known as the exponential moving average) based on the learned weights of the encoder. scCM promotes spatial proximity among similar cells, while gradually distinguishing dissimilar cells. Specifically, the weights are updated as a combination of the current gradient and the previous weights, moderated by a momentum coefficient. As a result, cells of the same type cluster together as closely as possible, while cells of different types separate as far as possible. After being trained, the embedding vectors produced by the encoder can be regarded as representations of CNS cells that can be utilized for downstream tasks.

Fig. 1: Illustration of scCM architecture and CNS datasets.
figure 1

a scCM is constructed using a momentum contrastive learning framework with symmetric encoders. Its goal is to minimize the embedding distance between similar CNS cells/clusters and maximize the embedding distance between dissimilar CNS cells/clusters. The embeddings learned by Encoder can be informative representations for various downstream tasks, such as clustering, batch effect correction, and cell annotation; b Geographic distribution of the collected CNS datasets; c Species and diseases included in CNS datasets; d Data distribution of each category of CNS data.

CNS datasets

We gathered CNS scRNA-seq datasets related to neurological diseases from the GEO database over the past five years. A total of 18 datasets with cell-type labels were selected from CNS studies worldwide2,36,37,38,39,40,41,42,43,44,45,46,47,48,49 (Fig. 1b). Additionally, two unlabeled datasets were obtained to assess the annotation capabilities of scCM, where one is the Alzheimer’s disease (AD) dataset50 from the GEO database, and the other one is the cerebral tumor dataset obtained from our institution. All datasets comprise 924,425 cells from 4 continents (America, Asia, Europe, and Oceania), encompassing 4 species (human, primate, rodent, and fish) (Fig. 1c) and covering ten subcategories of CNS diseases (AD, Alzheimer’s disease; BM, brain metastase; GBM, glioblastoma; HD, Huntington’s disease; MB, medulloblastoma; MCD, malformations of cortical development; MS, multiple sclerosis; NHD, Nasu-Hakola disease; TBI, traumatic brain injury) (Fig. 1d). Based on the number of cells, the datasets are categorized into two small-scale datasets (<10,000 cells), 7 medium-scale datasets (10,000-50,000 cells), and nine large-scale datasets (>50,000 cells). Two of the large-scale datasets have cell counts exceeding 100,000. Detailed characteristics of all datasets are provided in Supplementary Table 1.

scCM efficiently improves the performance of CNS scRNA-seq analysis

To demonstrate the effectiveness of scCM in CNS scRNA-seq analysis, we evaluated it on 4 CNS datasets (Anderson, Fournier, Ryan, Zhou), which contain rich information about cell annotations, groups, and batches. We compared scCM with several popular methods, including Seurat12,13, Harmony, LIGER, scVI14, MARS15, CLEAR32 and Concerto4. We also compared with scGNN, DESC, SCLSC, and SMILE (Supplementary Methods). The performance was evaluated using 7 metrics (Supplementary Methods): Accuracy (Acc), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), k-nearest neighbor Batch Effect Test (kBET), Silhouette Score (SS), Homogeneity Score (HS), and V-Measure Score (VMS).

scCM achieves the best performance across all 4 datasets in terms of Acc, ARI and VMS, while MARS achieves the best performance in terms of batch effects (kBET) (Fig. 2a and Supplementary Table 2). We visualized the distribution of clustering, batch-effect correction, and groups specifically on the Fournier’s dataset (Fig. 2b and Supplementary Fig. 1). These results demonstrate that scCM achieves the highest similarity and consistency between clustering results and true categories. The MARS model, with its strong learning abilities, employs a nonlinear embedding function to cluster similar cell types, characterized by strong batch effect elimination. However, this approach may blur boundaries between closely related cells or subtypes, as seen with the tight grouping of DC1, DC2, and pDC, and the mixing of Neuroblast and Neuron (Fig. 2c). These findings suggest that MARS appears to over-correct batch effects indicating a challenge in differentiating similar cell subgroups. Similar conclusions can be drawn from the other three datasets (Supplementary Fig. 2). The clustering efficiency of scCM is comparable to that of Concerto, both of which are based on contrastive learning frameworks. However, scCM exhibits stronger batch effect correction capabilities. Unlike CLEAR and concerto, scCM does not utilize a data augmentation approach.

Fig. 2: Performance comparison and visualization between scCM and 7 popular methods.
figure 2

a Performance comparison of different methods on 4 CNS datasets. Based on the weighted average of all performance metrics, the scCM and MARS models demonstrate superior performance. b The UMAP visualization of compared methods on the Fournier dataset. c The UMAP visualization of MARS, Concerto, and scCM for subclusters of the Fournier dataset. d The performance of comparison methods under different dropout rates.

Furthermore, we also evaluated the impact of the high dropout rate and the efficiency of large-scale CNS datasets. A high dropout rate makes CNS scRNA-seq analysis more challenging as it results in the loss of gene information. It is evident that scCM achieves the best performance under various dropout rates in terms of ARI (Fig. 2d), which indicates that scCM has strong robustness. However, interestingly, both MARS and scCM exhibit better batch correction in kBET, as the dropout rate increases. This result suggests batch correction can be improved by appropriately reducing the number of highly variable genes (Supplementary Fig. 3). To evaluate the efficiency of large-scale datasets, we analyzed the runtime of various deep learning-based methods (Table 1). To assess the running time across millions of datasets, we integrated all collected datasets (n = 924,425) for testing. The findings revealed that scCM can handle clustering tasks in million-scale datasets, with running time of 1122.4 ± 72.7 sec, demonstrating scCM’s capability in effectively handling clustering tasks for larger-scale datasets.

Table 1 The running time of various comparison models in the 4 datasets (sec.)

scCM provides promising clustering analysis across large-scale CNS datasets

We further investigated the generalization of scCM in CNS scRNA-seq analysis, considering various critical factors such as sample size, number of clusters, highly variable genes, species, and diseases. We first assessed the impact of sample size and cluster numbers across a wide range of dataset scales (ranging from 789 to 105,332 cells) and cluster numbers (ranging from 4 to 21) (Fig. 3a). Leiden clustering51 on scCM embeddings achieves outstanding clustering performance with Acc and ARI greater than 0.9, except on the Siddique dataset (ARI of 0.84). We then investigated the impact of the number of highly variable genes on clustering analysis in different CNS datasets (Fig. 3b and Supplementary Tables 3 and 4). Although a limited number of highly variable genes pose significant challenges in CNS cell representation learning, scCM exhibits commendable ability in clearly distinguishing clusters, even with extremely small highly variable genes (Supplementary Figs. 4 and 5). Analysis of variance results indicates that scCM’s performance is more stable across various datasets when using the top 3000 HVGs. Therefore, to facilitate training and comparison across multiple datasets, we have set the top 3000 HVGs as the default parameter for our model. Furthermore, we also examined the robustness of scCM across continents, species, and diseases (Fig. 3c). It is obvious that scCM holds high performance in terms of Acc and ARI on CNS datasets from different continents, species, and diseases. These results highlight the strong generalization and robustness of scCM, which effectively captures the variations across diverse CNS cells, demonstrating its suitability for the species and CNS diseases examined in this study. In non-neural tissue datasets52, the scCM also demonstrated favorable clustering outcomes.

Fig. 3: scCM presents promising clustering analysis for large-scale CNS datasets.
figure 3

a Performance of scCM on 18 labeled CNS datasets. b The effect of the number of highly variable genes on the clustering performance of scCM in terms of ARI (The dots represent different datasets). c Clustering performance comparison on different continents, species, and CNS diseases (The value of n in parentheses represents the number of datasets included). d scCM effectively groups similar brain metastatic subtypes. e The relationship between cell clusters in tumors and oligodendrocytes. f scCM is harnessed to annotate unknown cells in Fournier data.

Moreover, we conducted the promising clustering analysis to capture reliable cluster relationships, where clusters with similar functions are projected into close space. In the analysis of the Gonzalez dataset with five brain metastatic types (Fig. 3d), the same cancer subtypes are adjacent. Specifically, the three breast subtypes are located at the top, while the two ovarian subtypes and three melanoma subtypes are on the left and bottom, respectively. However, for the lung metastases, Lung1 is clearly separated from the other two non-small cell lung cancer clusters (Lung2 and Lung3) because Lung1 is a small cell carcinoma cluster. Furthermore, in Datta’s dataset, IDH-mutant oligodendrocytes are positioned closely to glioblastoma by scCM (Fig. 3e). Therefore, scCM can effectively represent functional similarity in the visualization results.

To leverage the advantage of scCM to capture cluster relationships, we employed it to annotate the unknown cells in Fournier data. When visualizing cell clusters in Fournier data (Fig. 3f), we observed that neuroimmune-related cells tend to congregate in the upper-right region, while cells associated with the substance circulation pathway composition are predominantly categorized in the lower-right region. The unknown cluster and neuron cells are located in the lower-left region. Thus, according to the spatial distribution, we can deduce that the unknown cluster is closely situated to astrocytes, neurons, and neuroblasts. Therefore, we hypothesized that the unknown cluster is likely a kind of glia with a repairing function similar to astrocytes. To verify this assumption, we conducted GO enrichment analysis and discovered that the gene expressions in the unknown cluster are related to neural injury repair, myelination, and neurotransmitter function. By comparing marker comparisons (Fig. 3f), we successfully identified the unknown cluster as olfactory ensheathing cells, a specialized glial cell type that promotes axon growth, myelination, and neural nourishment53,54.

CNS cells integration for unveiling the relationship between cell types/subtypes and neurodegenerative diseases

Here, we employed scCM to integrate data from 4 neurodegenerative diseases (AD, HD, MS, and NHD) in 5 controlled human trials to comprehensively analyze cell types and states, subsequently revealing relationships between CNS cell types/subtypes and neurodegenerative diseases. To ensure data consistency, we first normalized and cleaned all datasets by removing cells with inaccurate annotation information and those with ambiguous or unspecified types. Additionally, cell types less than 2000 cells were also excluded. After the preprocessing, we have 280,000 cells, including oligodendrocytes (114,447 cells), neurons (113,086 cells), astrocytes (31,016 cells), oligodendrocyte precursor cells (11,784 cells), microglia (9747 cells), and endothelial cells (2688 cells). Finally, all cells were integrated by scCM, which corrected batch effects and divided them into clusters. The visualizations in Fig. 4a and Supplementary Fig. 6 demonstrate that scCM effectively removes batch effects and accurately aligns cells across datasets, despite differences in experimental protocols, sequencing technologies, and gene expression measurements.

Fig. 4: Visualization of integrated 4 neurodegenerative disease datasets.
figure 4

a UMAP visualization of the combined datasets before and after batch correction with scCM. b Astrocytes are divided into five subtypes with the distinct percentage in 4 neurodegenerative diseases. c Bar plot showing the relative percentage of different astrocyte subtypes comparing the control group and the 4 disease subgroups. d Heatmaps of gene expression in neurodegenerative disease groups and astrocyte subpopulations.

Because neurodegenerative diseases are typically characterized by progressive heterogeneity of glial cells, we focused on a well-known type of glial cells, astrocytes, which have been implicated in various neurodegenerative diseases55,56. After eliminating non-astrocytic cells from the reference, astrocytes were divided into five subtypes with distinct percentages in 4 neurodegenerative diseases (Fig. 4b and Supplementary Fig. 7). Specifically, Astro 0 subtype is the principal component of HD, rarely found in AD and NHD (Fig. 4c). Moreover, based on the comparison of high variable genes (Fig. 4d), Astro 0 exhibits high expression of IGF2R, a gene known to play a crucial role in cognitive processes such as learning and memory. Therefore, Astro 0 can be used to annotate the HD samples. Besides, both Astro 1 and 4 subtypes are found in all 4 neurodegenerative diseases, indicating that they play critical roles in nervous system development. However, Astro 1 is the principal component, comprising more than 80% in AD and NHD, demonstrating that the two diseases share gene expression profiles during CNS cell functional changes. This is consistent with the findings that AD and NHD have a close relationship40. Nonetheless, Astro 4 takes 60% of components in MS, but the remaining 40% consists of other three astrocyte subtypes. Thus, the cell heterogeneity in MS is higher than in other neurodegenerative diseases, resulting in more complexity of gene expression profiles in MS. Notably, Astro 2 and 3 subtypes are exclusively found in MS samples. According to the high gene expression analysis, Astro 2 is identified as a type of potential inflammatory astrocyte expressing up-regulated immune-related and apoptosis-related genes (HSPs, NAMPT, and TPST1). The extensive neuronal loss in neurodegenerative diseases is attributed to apoptosis, and emerging evidence suggests that dysregulated apoptosis may be involved in the pathogenesis of MS57,58. Meanwhile, Astro 3 is enriched with genes related to neuron development, especially cilium function, where cilia are tiny microtubule-based signaling devices that regulate diverse physiological functions. Therefore, the development of MS is potentially associated with the heterogeneous transformation of astrocyte subtypes, with the emergence of inflammatory astrocytes emerging as a cause for concern. Furthermore, we also observed heterogeneity among oligodendrocytes and microglia (Supplementary Fig. 8).

By using scCM, the integration of large-scale CNS datasets can facilitate comparative analysis across different CNS diseases, enabling the exploration of cellular heterogeneity under various physiological and pathological conditions, and contributing to a comprehensive exploration of CNS disease causality.

CNS cell annotation by a metadata reference

According to the above results, scCM demonstrates outstanding robustness in CNS cell clustering by bringing similar CNS cells closer and separating dissimilar ones. Therefore, we believe that scCM has the potential to be regarded as an annotation method to identify unknown cells. The unlabeled CNS cells are initially mapped into the spatial distribution of a metadata reference, they are annotated based on the nearest cell clusters in metadata.

Initially, we harnessed scCM to annotate Soreq’s AD data based on the integrated dataset above as a metadata reference. All cells in the Soreq data were manually identified according to specific marker genes (Fig. 5a), and all cell types are present in the metadata reference. We compared the annotation performance of scCM with MARS. It is obvious that scCM successfully aligns cell types in Soreq with the metadata reference, including the small cluster of endothelial cells (n = 381) (Fig. 5b). However, MARS struggles to effectively align Soreq’s data with metadata reference, making it challenging to classify and annotate subpopulations within the clusters. In particular, MARS erroneously assigns oligodendrocytes as annotations for endothelial cells, disregarding the distinct functional characteristics of these two cell clusters. These results indicate scCM provides a more similar spatial distribution between metadata reference and Soreq’s data. Furthermore, the annotation results in two confusion matrices demonstrate that scCM exhibits greater sensitivity in annotating small cell clusters.

Fig. 5: Annotation of unknown cells.
figure 5

a Manual annotation of the Soreq’s datasets diagnosed with Alzheimer’s disease (AD). b The annotation by scCM and MARS, and the confusion matrices of annotation measured in terms of accuracy. c Pituitary tumor cells mapped on the reference and divided into 4 clusters by scCM. d The visualization of clusters showing a relationship between reference and 4 clusters of pituitary tumor cells by scCM and MARS.

To further assess the annotation performance of scCM in recognizing unobserved cell types, we then used the above integrated neurodegenerative disease dataset as a metadata reference to annotate the pituitary tumor data. The pituitary tumor samples are divided into 4 clusters in the metadata reference (Fig. 5c), and each cluster has distinct gene expression patterns (Supplementary Fig. 9). Specifically, cluster 4 seems to be associated with endothelial cells in the metadata reference, suggesting that cluster 4 is annotated as endothelial cells. This annotation is further corroborated by the marker gene expression of CLDN5 and ITM2A. The other three cell clusters are distinct from the reference, since their cell types are not observed in the metadata reference. Cluster 1 is distinct from the reference and annotated as corticotrophs based on the marker genes TBX19 and NEUROD1. Cluster 2 and 3 are close to microglia, which are intracranial immune cells responsible for regulating the maintenance of neuronal networks. Then, based on marker genes, cluster 2 and 3 are identified as macrophages and T cells, respectively. This result demonstrates that scCM is not only robust in rejecting annotation for unobserved cell types, but also offering relative spatial information of unobserved cell types in the reference space. In contrast, although MARS also has the potential to detect novel clusters, it lacks a reliable spatial distribution where similar CNS cells are in close proximity, which leads to incorrect spatial information for cell annotation (Fig. 5d). These results demonstrate that scCM efficiently annotates CNS cell types.

Discussion

We developed scCM, a self-supervised contrastive learning method utilizing the MoCoV3 framework, which offers improved performance and effectively integrates large-scale CNS scRNA-seq data, making it a valuable tool for researchers studying the complexity and heterogeneity of CNS cells. The momentum encoder and dynamic queue of MoCoV3 efficiently handle extensive sample volumes, maintaining a broad effective negative sample set, crucial for detecting heterogeneity in single-cell data. In contrast, Concerto’s SimCLR framework, while proficient at data representation learning, faces a limitation due to its inability to support very large batch sizes. Unlike the CLEAR model, which utilizes the MoCoV1 framework with a Memory Queue mechanism, our implementation of the MoCoV3 framework dispenses with this mechanism, allowing for large batch sizes during sample observation. It also incorporates elements from the BYOL framework, such as the projection head and prediction head, which significantly contribute to performance enhancement. The SCLSC that employs a supervised contrastive learning model framework is highly label-dependent, linking the model’s performance to label quality. The advantages of MoCoV3 make it more effective in processing large-scale CNS single-cell data.

scCM can capture the heterogeneity in CNS cells, enabling it to integrate large-scale CNS scRNA-seq data to unveil inter-cell connections based on spatial relationships. Experiments on 20 CNS datasets demonstrate that scCM maintains outstanding performance and generalization for clustering, handling dropout events, and correcting batch effects across datasets with varying numbers of genes, cells, clusters, species, and CNS diseases. Moreover, scCM can offers the spatial interpretability of CNS cell clustering, which can help infer cell developments or function relationships. Taking these advantages of scCM, we have successfully annotated the observed and unobserved CNS cell subtypes based on an integrated large-scale neurodegenerative data reference. We anticipate that our approach will have broader applications in the CNS field.

There are a few directions to be further investigated: (1) The evaluation of scCM was only conducted on CNS scRNA-seq data. Additional research is necessary to assess the applicability of scCM to other tissues. (2) A larger and more comprehensive reference can provide more reliable annotations for unknown cells. In future work, we plan to collect broader CNS datasets to increase the number of cells, especially focusing on the rare cell subtypes; (3) The limited diversity of cell types and confounding factors may reduce the accuracy of annotation. To solve this problem, we plan to create an expanded and higher-quality reference, specifically tailored to different neurologic disease contexts; (4) Additionally, analyzing cell subtypes requires individual analysis of each cell cluster, as it cannot be achieved in a single step.

Methods

Dataset collection

We systematically collected the published articles within the past five years related to CNS scRNA-seq in the GEO dataset. The literature search strategy is based on the following keywords: (central nervous system OR brain OR cerebral OR neurodegenerative disease OR Alzheimer’s disease OR Huntington’s disease OR multiple sclerosis OR Parkinson’s disease OR dementia OR cerebral tumor OR brain metastases OR glioblastoma OR medulloblastoma OR pituitary adenoma) OR traumatic brain injury and (single-cell RNA sequencing). After preprocessing, 18 CNS datasets with labeled cell annotations were obtained from the GEO database. Additionally, 2 unlabeled datasets were used to evaluate scCM performance of cell annotation. One AD dataset was sourced from the GEO database, and another human brain tumor dataset was derived from neural neoplasm specimens, approved by the Ethics Review Committee of Chinese Academy of Medical Sciences and Peking Union Medical College.

Data pre-preprocessing

Data processing includes filtering, preprocessing, normalization, and HVG selection. The preprocessing was conducted by SCANPY functions, as follows:

(a) To eliminate potential empty droplets, low-quality cells, and potential multiplets, cells with fewer than 300 or more than 6000 transcripts were excluded from the analysis. Furthermore, cells identified as being of poor quality, where mitochondrial gene expression accounted for less than 15% of the total counts, were excluded. (b) Gene expression measurements in each cell were first normalized to scale the total transcript counts to 10,000 per cell, a crucial step to address variations in sequencing depths across cells. After normalization, the data were transformed using the natural logarithm (“LogNormalize” method). (c) We use the top 3000 HVGs as the default training parameter. On one hand, using a fixed number of HVGs allows for consistent comparisons across different datasets59. On the other hand, after testing various numbers of HVGs, we found that the model yields more stable training results when the number of HVGs is set at 3000.

Dropout event

We simulated dropout events by randomly replacing gene expression values with zeros. This replacement occurs after selecting highly variable genes to avoid redundantly processing genes with zero expression values. This strategy ensures that dropout events mainly affect the expressed genes identified in the cells. The dropout percentage ranges from 0.05 to 0.50.

The scCM framework

scCM is based on a contrastive learning framework, which is trained to differentiate similar and dissimilar pairs of cells, aiming to group cells with similar gene expression patterns while distinguishing cells with larger expression differences. The scCM workflow is outlined as follows:

Encoder architecture

Both the query encoder \({f}_{q}(x)\) and the key encoder \({f}_{k}(x)\) are a Siamese network, which is implemented with two-layer fully connected networks. The number of input nodes matches the number of genes in the gene expression profile. The first layer has 1024 neurons with LayerNorm normalization and ReLU activation function. To mitigate overfitting, a Dropout module is added in subsequent layers with a dropout rate of 0.5. Finally, the output layer consists of 128-dimensional neurons. The update process for the query encoder (\({f}_{q}\)) follows the normal backpropagation. The key encoder (\({f}_{k}\)) was updated with momentum update35.

$${f}_{k}=m{f}_{k}+\left(1-m\right){f}_{q}.$$
(1)

m [0, 1) is the momentum coefficient, regulating the update speed of the encoder network. A higher value of m results in slower updates of the key encoder, whereas a lower value of m drives the key encoder to resemble the query encoder more closely.

Predictor head

A predictor head is added downstream of the query encoder, consisting of two fully connected layers. The first layer has a hidden layer of 128-dimensional with a ReLU activation function. The output layer is without the ReLU activation function. The additional predictor network can effectively prevent collapse60.

Loss function

The InfoNCE loss function is utilized to minimize the distance between positive pairs and maximize the distance between negative pairs, where the inputs of query encoder and key encoder are labeled as a positive pair (q, k+) when they are the same cell, otherwise the inputs are labeled as a negative pair (q, k−). InfoNCE loss function is:

$${L}_{q}=-\log \frac{\exp \left(q{{{\rm{\cdot }}}}\frac{{k}^{+}}{\tau }\right)}{\exp \left(q{{{\rm{\cdot }}}}\frac{{k}^{+}}{\tau }\right)+{\sum}_{k-}\exp \left(q{{{\rm{\cdot }}}}{k}^{-}/\tau \right)}.$$
(2)

where the parameter temperature (\(\tau\)) shapes the distribution61. Increasing τ creates a smoother distribution. Setting \(\tau\) too small causes the model to concentrate solely on hard negative samples, leading to convergence issues and limited feature generalization.

The parameters of the encoder are optimized using the Adam optimizer. The temperature parameter (\(\tau\)) is set to 0.8. A momentum coefficient (m) of 0.99 is employed for slower update of the key encoder’s weights35. For scCM, the default learning rate is 1 × 10−6, and the batch size is 512. We standardized the batch size to 512 to facilitate performance comparison across different datasets. Our training process involves a warm-up period of 5 epochs to accelerate convergence, followed by a total of 10 epochs.

Clustering and UMAP visualization

We first performed k-means clustering across multiple models for uniform comparison using the Python function. In determining the optimal number of clusters (k), we employed the Elbow method, which involves plotting the explained variation as a function of the number of clusters and selecting the elbow of the curve as the number of clusters to use. We then calculated the ARI values for several values around this optimal k, and selected the k value corresponding to the highest ARI value, followed by calculating other performance metrics62. The optimal number of clusters (k) can be found in Supplementary Table 2. The Leiden algorithm was employed to assist the scCM model in clustering analysis across multiple datasets (Supplementary Methods)4. Cell embeddings are visualized by UMAP using SCANPY (n_neighbours is set to 15).

CNS cell integration and annotation

Integration

To integrate two single-cell RNA-seq datasets using Scanpy, first ensure both datasets are in the same format (e.g., AnnData). Then, combine the datasets into a single AnnData object. Preprocess the data by filtering low-quality cells and genes, normalizing, log-transforming the expression values, and selecting 3000 HVGs. Then, The preprocessed dataset was imported into scCM and initiated training using the default parameter settings.

Annotation

After the training phase, we incorporated the cell annotation information from the metadata set into the trained embeddings. For cells not present in the metadata set, we filled in missing annotation information with ‘unknown’. This allowed the visualization of the embeddings to reflect different cell clusters. Based on the mapping relationships between cell clusters in the new dataset and the metadata set, we then annotated the different cell clusters using the Leiden algorithm.

Evaluation metrics

The analysis utilized several metrics to assess the quality and effectiveness of the data clustering and batch effect adjustments. These metrics include:

Acc

Measure the proportion of correct predictions against the total number of cases examined, ranging from 0 to 1.

ARI

Evaluate the similarity between two clusters, adjusting for chance and providing a measure that is normalized to show random clustering as zero, ranging from −1 to 1.

NMI

Assess the mutual dependence between the clustering results, normalized to account for the size of the clusters, facilitating comparison across different sets, ranging from 0 to 1.

kBET

Detect batch effects within the data by analyzing the local neighborhood of each point, thereby ensuring that similar samples are not batch-specific, ranging from 0 to 1.

SS

Measure how similar an object is to its own cluster compared to other clusters. This score helps to determine the appropriateness of the clustering by calculating the difference in distances between clusters, ranging from -1 to 1.

HS

Evaluate clusters that contain only members of a single class, reflecting the consistency of labeling within clusters, ranging from 0 to 1.

VMS

Combine both homogeneity and completeness of the clustering results into a single score to evaluate the success of the clustering against a ground truth classification, ranging from 0 to 1.

For detailed information on the specific methodologies and results of these metrics, please refer to the Supplementary Methods. To better compare the performance of different models, we normalized each metric and then calculated a composite indices (CI), \({x^{\prime} }_{i}\) is the normalized indicator value63.

$${CI}=\mathop{\sum }_{i=1}^{N}{x{\prime} }_{i}$$
(3)

This approach allowed to generate a rank for each model that encompasses all evaluated metrics.

Running time

In this study, we focused solely on the statistical analysis of the model’s training time, including data loading time and training time. To evaluate the training time on large-scale datasets, we integrated the 18 datasets collected in this study into a single dataset, comprising approximately one million cells. Instead of specialized data integration methods, we standardized the data formats across all datasets and then performed a vertical merge to combine them. The data preprocessing was consistent with default parameters.

Statistics and reproducibility

The scCM model was built using the PyTorch (version 2.2.1) framework in Python (version 3.9). Single-cell data preprocessing, analysis, and visualization were performed using Scanpy (version 1.9.8). Model testing included running time and performance metrics, with clustering performance metrics calculated using NumPy (version 1.26.4) and analyzed using means and standard deviations. Performance comparison charts were created using Python and GraphPad Prism (version 8).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.