Abstract
The central nervous system (CNS) comprises a diverse range of brain cell types with distinct functions and gene expression profiles. Although single-cell RNA sequencing (scRNA-seq) provides new insights into the brain cell atlases, integrating large-scale CNS scRNA-seq data still encounters challenges due to the complexity and heterogeneity among CNS cell types/subtypes. In this study, we introduce a self-supervised contrastive learning method, called scCM, for integrating large-scale CNS scRNA-seq data. scCM brings functionally related cells close together while simultaneously pushing apart dissimilar cells by comparing the variations of gene expression, effectively revealing the heterogeneous relationships within the CNS cell types/subtypes. The effectiveness of scCM is evaluated on 20 CNS datasets covering 4 species and 10 CNS diseases. Leveraging these strengths, we successfully integrate the collected human CNS datasets into a large-scale reference to annotate cell types and subtypes in neural tissues. Results demonstrate that scCM provides an accurate annotation, along with rich spatial information of cell state. In summary, scCM is a robust and promising method for integrating large-scale CNS scRNA-seq data, enabling researchers to gain insights into the cellular and molecular mechanisms underlying CNS functions and diseases.
Similar content being viewed by others
Introduction
The central nervous system (CNS) comprises a diverse range of cell types, each characterized by distinct functions and gene expression profiles. CNS disorders, including ischemic, hemorrhagic, neurodegenerative, inflammatory, and developmental conditions, exhibit complex physiological or pathological alterations in CNS cells, resulting in abnormal functions and gene expression profiles1,2,3,4. Since CNS cells can undergo heterogeneous transformations and functional changes in response to various physiological and pathological conditions, CNS is one of the most complex systems, including a multitude of cell types and subtypes with distinct and unique morphology and gene expression profiles5,6.
Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for investigating gene expression within specific cell types or subpopulations associated with physiological or pathological conditions7,8,9,10. The existing scRNA-seq methods can be divided into several categories according to their paradigms: clustering-based methods (k-means, DBSCAN11, Seurat12,13), dimensionality reduction-based methods (PCA, t-SNE, UMAP), deep learning-based methods (scVI14, MARS15, scGNN16). Deep learning-based methods provide a promising solution with a common strategy involving neural network training for representation learning, demonstrating remarkable performance in various tasks. The advancements in scRNA-seq are also empowering neuroscientists to gain insights into the cellular and molecular mechanisms underlying CNS function and disease. For instance, recent CNS scRNA-seq studies have achieved breakthroughs in the identification of phenotypic transformation of astrocytes into non-neuroprotective states in Alzheimer’s disease (AD)17,18, the active involvement of reactive oligodendrocytes in multiple sclerosis (MS)19, and the construction of a human brain atlas20,21.
Integrating scRNA-seq data can provide comprehensive analysis of cell types, states, and trajectories that may not be discernible when analyzing individual datasets separately, ultimately enhancing our understanding of cellular heterogeneity and complex biological processes. One of the major issues in integrating a high-quality reference map from scRNA-seq datasets has been the reduction of technical bias and batch effects. Recent developments in computational methods have addressed these issues and allowed the generation of large-scale reference atlases22. Nonetheless, regarding the complexity and heterogeneity of CNS cells in different physiological and pathological conditions, integrating large-scale CNS scRNA-seq data still encounters challenges15,23,24. Furthermore, in cases where gene expression profiles contain numerous missing values or noise, known as ‘dropout’, which can mislead downstream analyses such as low-dimensional representation and cell sub-population identification25,26,27. Thus, it is crucial to capture the patterns in gene expression profiles and reveal the heterogeneous relationships of CNS cell types/subtypes for integrating large-scale CNS scRNA-seq data28,29.
Contrastive learning has recently gained attention for its successful application in self-supervised representation learning in computer vision, as exemplified by SimCLR30 and MoCo31. The fundamental concept behind contrastive learning is to map similar samples to nearby representation space while mapping dissimilar samples to distant representation space. Contrastive learning in scRNA-seq analysis aims to learn representations of single-cell data by comparing and contrasting different cells or cell subsets. It can capture meaningful patterns and variations that may indicate distinct cell types, states, or biological processes. Two contrastive learning-based scRNA-seq analysis methods, Concerto4 and CLEAR32 have been proposed. They have exhibited remarkable success in various scRNA-seq analyses4,32,33. Concerto is a self-supervised distillation framework configured as an asymmetric teacher–student architecture for model multimodal single-cell atlases. CLEAR overcomes the heterogeneity of the scRNA-seq data through a specifically designed representation learning task, allowing it to effectively handle batch effects and dropout events simultaneously. Compared to other scRNA-seq analysis methods, contrastive learning-based methods offer superior generalization capabilities for handling unseen samples more accurately. Thus, contrastive learning shows great promise to overcome the challenges in the integration of large-scale CNS scRNA-seq data. However, several novel contrastive learning frameworks have been proposed, where MoCoV3 achieved state-of-the-art performance on various tasks34. MoCoV3 is grounded in its unique momentum encoder and dynamic queue design, enabling effective handling of large sample volumes while maintaining an extensive set of effective negative samples35. Thus, MoCoV3 has strong potential to capture heterogeneity in gene expression profiles and reveal the relationships in large-scale CNS scRNA-seq data, effectively integrating large-scale CNS scRNA-seq data.
In this work, we proposed scCM, a self-supervised contrastive learning model for integrating large-scale CNS scRNA-seq data. scCM leverages the latest contrastive learning framework to bring functionally related CNS cells close together while simultaneously pushing apart dissimilar CNS cells by comparing the variations of gene expression, effectively revealing the heterogeneous relationships within the CNS cell types/subtypes. To evaluate the performance of scCM, we conducted experiments on 20 CNS datasets worldwide, encompassing data from 4 different species and covering 10 subcategories of CNS diseases. Results show that scCM outperforms the state-of-the-art methods in various CNS scRNA-seq analyses. Taking advantage of scCM’s effective representation and robust spatial relationship of CNS cells, we integrated the collected human CNS datasets into a large-scale reference for cell annotation in neurodegenerative diseases (ND). Results demonstrate that scCM provides an accurate annotation, along with rich spatial information of cell state. In summary, scCM offers improved performance and can be a valuable tool for researchers studying the complexity and heterogeneity of CNS cells, addressing the significant challenge of integrating large-scale CNS scRNA-seq data to enhance our understanding of CNS function and disease mechanisms.
Results
Overview of scCM architecture
scCM is constructed based on a momentum contrastive learning framework (MoCo v335) with the aim of learning informative representations of CNS scRNA-seq data (Fig. 1a). It comprises three modules: an encoder, a momentum encoder, and a predictor head, all constructed using fully connected neural networks. The encoder and momentum encoder receive a pair of gene expression vectors as input. The gene expression vector fed into the encoder is transformed into an embedding and then projected as a vector q (representing the query) by the predictor head. Simultaneously, the gene expression vector fed into the momentum encoder is transformed into a vector k (representing the key). If the two inputs are the same, they are labeled as a positive pair (q, k+); otherwise, they are labeled as a negative pair (q, k-). The InfoNCE loss function is utilized to minimize the distance between positive pairs and maximize the distance between negative pairs. During training, the encoder and predictor head are updated based on the back propagated gradients by the InfoNCE loss function. The momentum encoder is updated using the momentum strategy (also known as the exponential moving average) based on the learned weights of the encoder. scCM promotes spatial proximity among similar cells, while gradually distinguishing dissimilar cells. Specifically, the weights are updated as a combination of the current gradient and the previous weights, moderated by a momentum coefficient. As a result, cells of the same type cluster together as closely as possible, while cells of different types separate as far as possible. After being trained, the embedding vectors produced by the encoder can be regarded as representations of CNS cells that can be utilized for downstream tasks.
CNS datasets
We gathered CNS scRNA-seq datasets related to neurological diseases from the GEO database over the past five years. A total of 18 datasets with cell-type labels were selected from CNS studies worldwide2,36,37,38,39,40,41,42,43,44,45,46,47,48,49 (Fig. 1b). Additionally, two unlabeled datasets were obtained to assess the annotation capabilities of scCM, where one is the Alzheimer’s disease (AD) dataset50 from the GEO database, and the other one is the cerebral tumor dataset obtained from our institution. All datasets comprise 924,425 cells from 4 continents (America, Asia, Europe, and Oceania), encompassing 4 species (human, primate, rodent, and fish) (Fig. 1c) and covering ten subcategories of CNS diseases (AD, Alzheimer’s disease; BM, brain metastase; GBM, glioblastoma; HD, Huntington’s disease; MB, medulloblastoma; MCD, malformations of cortical development; MS, multiple sclerosis; NHD, Nasu-Hakola disease; TBI, traumatic brain injury) (Fig. 1d). Based on the number of cells, the datasets are categorized into two small-scale datasets (<10,000 cells), 7 medium-scale datasets (10,000-50,000 cells), and nine large-scale datasets (>50,000 cells). Two of the large-scale datasets have cell counts exceeding 100,000. Detailed characteristics of all datasets are provided in Supplementary Table 1.
scCM efficiently improves the performance of CNS scRNA-seq analysis
To demonstrate the effectiveness of scCM in CNS scRNA-seq analysis, we evaluated it on 4 CNS datasets (Anderson, Fournier, Ryan, Zhou), which contain rich information about cell annotations, groups, and batches. We compared scCM with several popular methods, including Seurat12,13, Harmony, LIGER, scVI14, MARS15, CLEAR32 and Concerto4. We also compared with scGNN, DESC, SCLSC, and SMILE (Supplementary Methods). The performance was evaluated using 7 metrics (Supplementary Methods): Accuracy (Acc), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), k-nearest neighbor Batch Effect Test (kBET), Silhouette Score (SS), Homogeneity Score (HS), and V-Measure Score (VMS).
scCM achieves the best performance across all 4 datasets in terms of Acc, ARI and VMS, while MARS achieves the best performance in terms of batch effects (kBET) (Fig. 2a and Supplementary Table 2). We visualized the distribution of clustering, batch-effect correction, and groups specifically on the Fournier’s dataset (Fig. 2b and Supplementary Fig. 1). These results demonstrate that scCM achieves the highest similarity and consistency between clustering results and true categories. The MARS model, with its strong learning abilities, employs a nonlinear embedding function to cluster similar cell types, characterized by strong batch effect elimination. However, this approach may blur boundaries between closely related cells or subtypes, as seen with the tight grouping of DC1, DC2, and pDC, and the mixing of Neuroblast and Neuron (Fig. 2c). These findings suggest that MARS appears to over-correct batch effects indicating a challenge in differentiating similar cell subgroups. Similar conclusions can be drawn from the other three datasets (Supplementary Fig. 2). The clustering efficiency of scCM is comparable to that of Concerto, both of which are based on contrastive learning frameworks. However, scCM exhibits stronger batch effect correction capabilities. Unlike CLEAR and concerto, scCM does not utilize a data augmentation approach.
Furthermore, we also evaluated the impact of the high dropout rate and the efficiency of large-scale CNS datasets. A high dropout rate makes CNS scRNA-seq analysis more challenging as it results in the loss of gene information. It is evident that scCM achieves the best performance under various dropout rates in terms of ARI (Fig. 2d), which indicates that scCM has strong robustness. However, interestingly, both MARS and scCM exhibit better batch correction in kBET, as the dropout rate increases. This result suggests batch correction can be improved by appropriately reducing the number of highly variable genes (Supplementary Fig. 3). To evaluate the efficiency of large-scale datasets, we analyzed the runtime of various deep learning-based methods (Table 1). To assess the running time across millions of datasets, we integrated all collected datasets (n = 924,425) for testing. The findings revealed that scCM can handle clustering tasks in million-scale datasets, with running time of 1122.4 ± 72.7 sec, demonstrating scCM’s capability in effectively handling clustering tasks for larger-scale datasets.
scCM provides promising clustering analysis across large-scale CNS datasets
We further investigated the generalization of scCM in CNS scRNA-seq analysis, considering various critical factors such as sample size, number of clusters, highly variable genes, species, and diseases. We first assessed the impact of sample size and cluster numbers across a wide range of dataset scales (ranging from 789 to 105,332 cells) and cluster numbers (ranging from 4 to 21) (Fig. 3a). Leiden clustering51 on scCM embeddings achieves outstanding clustering performance with Acc and ARI greater than 0.9, except on the Siddique dataset (ARI of 0.84). We then investigated the impact of the number of highly variable genes on clustering analysis in different CNS datasets (Fig. 3b and Supplementary Tables 3 and 4). Although a limited number of highly variable genes pose significant challenges in CNS cell representation learning, scCM exhibits commendable ability in clearly distinguishing clusters, even with extremely small highly variable genes (Supplementary Figs. 4 and 5). Analysis of variance results indicates that scCM’s performance is more stable across various datasets when using the top 3000 HVGs. Therefore, to facilitate training and comparison across multiple datasets, we have set the top 3000 HVGs as the default parameter for our model. Furthermore, we also examined the robustness of scCM across continents, species, and diseases (Fig. 3c). It is obvious that scCM holds high performance in terms of Acc and ARI on CNS datasets from different continents, species, and diseases. These results highlight the strong generalization and robustness of scCM, which effectively captures the variations across diverse CNS cells, demonstrating its suitability for the species and CNS diseases examined in this study. In non-neural tissue datasets52, the scCM also demonstrated favorable clustering outcomes.
Moreover, we conducted the promising clustering analysis to capture reliable cluster relationships, where clusters with similar functions are projected into close space. In the analysis of the Gonzalez dataset with five brain metastatic types (Fig. 3d), the same cancer subtypes are adjacent. Specifically, the three breast subtypes are located at the top, while the two ovarian subtypes and three melanoma subtypes are on the left and bottom, respectively. However, for the lung metastases, Lung1 is clearly separated from the other two non-small cell lung cancer clusters (Lung2 and Lung3) because Lung1 is a small cell carcinoma cluster. Furthermore, in Datta’s dataset, IDH-mutant oligodendrocytes are positioned closely to glioblastoma by scCM (Fig. 3e). Therefore, scCM can effectively represent functional similarity in the visualization results.
To leverage the advantage of scCM to capture cluster relationships, we employed it to annotate the unknown cells in Fournier data. When visualizing cell clusters in Fournier data (Fig. 3f), we observed that neuroimmune-related cells tend to congregate in the upper-right region, while cells associated with the substance circulation pathway composition are predominantly categorized in the lower-right region. The unknown cluster and neuron cells are located in the lower-left region. Thus, according to the spatial distribution, we can deduce that the unknown cluster is closely situated to astrocytes, neurons, and neuroblasts. Therefore, we hypothesized that the unknown cluster is likely a kind of glia with a repairing function similar to astrocytes. To verify this assumption, we conducted GO enrichment analysis and discovered that the gene expressions in the unknown cluster are related to neural injury repair, myelination, and neurotransmitter function. By comparing marker comparisons (Fig. 3f), we successfully identified the unknown cluster as olfactory ensheathing cells, a specialized glial cell type that promotes axon growth, myelination, and neural nourishment53,54.
CNS cells integration for unveiling the relationship between cell types/subtypes and neurodegenerative diseases
Here, we employed scCM to integrate data from 4 neurodegenerative diseases (AD, HD, MS, and NHD) in 5 controlled human trials to comprehensively analyze cell types and states, subsequently revealing relationships between CNS cell types/subtypes and neurodegenerative diseases. To ensure data consistency, we first normalized and cleaned all datasets by removing cells with inaccurate annotation information and those with ambiguous or unspecified types. Additionally, cell types less than 2000 cells were also excluded. After the preprocessing, we have 280,000 cells, including oligodendrocytes (114,447 cells), neurons (113,086 cells), astrocytes (31,016 cells), oligodendrocyte precursor cells (11,784 cells), microglia (9747 cells), and endothelial cells (2688 cells). Finally, all cells were integrated by scCM, which corrected batch effects and divided them into clusters. The visualizations in Fig. 4a and Supplementary Fig. 6 demonstrate that scCM effectively removes batch effects and accurately aligns cells across datasets, despite differences in experimental protocols, sequencing technologies, and gene expression measurements.
Because neurodegenerative diseases are typically characterized by progressive heterogeneity of glial cells, we focused on a well-known type of glial cells, astrocytes, which have been implicated in various neurodegenerative diseases55,56. After eliminating non-astrocytic cells from the reference, astrocytes were divided into five subtypes with distinct percentages in 4 neurodegenerative diseases (Fig. 4b and Supplementary Fig. 7). Specifically, Astro 0 subtype is the principal component of HD, rarely found in AD and NHD (Fig. 4c). Moreover, based on the comparison of high variable genes (Fig. 4d), Astro 0 exhibits high expression of IGF2R, a gene known to play a crucial role in cognitive processes such as learning and memory. Therefore, Astro 0 can be used to annotate the HD samples. Besides, both Astro 1 and 4 subtypes are found in all 4 neurodegenerative diseases, indicating that they play critical roles in nervous system development. However, Astro 1 is the principal component, comprising more than 80% in AD and NHD, demonstrating that the two diseases share gene expression profiles during CNS cell functional changes. This is consistent with the findings that AD and NHD have a close relationship40. Nonetheless, Astro 4 takes 60% of components in MS, but the remaining 40% consists of other three astrocyte subtypes. Thus, the cell heterogeneity in MS is higher than in other neurodegenerative diseases, resulting in more complexity of gene expression profiles in MS. Notably, Astro 2 and 3 subtypes are exclusively found in MS samples. According to the high gene expression analysis, Astro 2 is identified as a type of potential inflammatory astrocyte expressing up-regulated immune-related and apoptosis-related genes (HSPs, NAMPT, and TPST1). The extensive neuronal loss in neurodegenerative diseases is attributed to apoptosis, and emerging evidence suggests that dysregulated apoptosis may be involved in the pathogenesis of MS57,58. Meanwhile, Astro 3 is enriched with genes related to neuron development, especially cilium function, where cilia are tiny microtubule-based signaling devices that regulate diverse physiological functions. Therefore, the development of MS is potentially associated with the heterogeneous transformation of astrocyte subtypes, with the emergence of inflammatory astrocytes emerging as a cause for concern. Furthermore, we also observed heterogeneity among oligodendrocytes and microglia (Supplementary Fig. 8).
By using scCM, the integration of large-scale CNS datasets can facilitate comparative analysis across different CNS diseases, enabling the exploration of cellular heterogeneity under various physiological and pathological conditions, and contributing to a comprehensive exploration of CNS disease causality.
CNS cell annotation by a metadata reference
According to the above results, scCM demonstrates outstanding robustness in CNS cell clustering by bringing similar CNS cells closer and separating dissimilar ones. Therefore, we believe that scCM has the potential to be regarded as an annotation method to identify unknown cells. The unlabeled CNS cells are initially mapped into the spatial distribution of a metadata reference, they are annotated based on the nearest cell clusters in metadata.
Initially, we harnessed scCM to annotate Soreq’s AD data based on the integrated dataset above as a metadata reference. All cells in the Soreq data were manually identified according to specific marker genes (Fig. 5a), and all cell types are present in the metadata reference. We compared the annotation performance of scCM with MARS. It is obvious that scCM successfully aligns cell types in Soreq with the metadata reference, including the small cluster of endothelial cells (n = 381) (Fig. 5b). However, MARS struggles to effectively align Soreq’s data with metadata reference, making it challenging to classify and annotate subpopulations within the clusters. In particular, MARS erroneously assigns oligodendrocytes as annotations for endothelial cells, disregarding the distinct functional characteristics of these two cell clusters. These results indicate scCM provides a more similar spatial distribution between metadata reference and Soreq’s data. Furthermore, the annotation results in two confusion matrices demonstrate that scCM exhibits greater sensitivity in annotating small cell clusters.
To further assess the annotation performance of scCM in recognizing unobserved cell types, we then used the above integrated neurodegenerative disease dataset as a metadata reference to annotate the pituitary tumor data. The pituitary tumor samples are divided into 4 clusters in the metadata reference (Fig. 5c), and each cluster has distinct gene expression patterns (Supplementary Fig. 9). Specifically, cluster 4 seems to be associated with endothelial cells in the metadata reference, suggesting that cluster 4 is annotated as endothelial cells. This annotation is further corroborated by the marker gene expression of CLDN5 and ITM2A. The other three cell clusters are distinct from the reference, since their cell types are not observed in the metadata reference. Cluster 1 is distinct from the reference and annotated as corticotrophs based on the marker genes TBX19 and NEUROD1. Cluster 2 and 3 are close to microglia, which are intracranial immune cells responsible for regulating the maintenance of neuronal networks. Then, based on marker genes, cluster 2 and 3 are identified as macrophages and T cells, respectively. This result demonstrates that scCM is not only robust in rejecting annotation for unobserved cell types, but also offering relative spatial information of unobserved cell types in the reference space. In contrast, although MARS also has the potential to detect novel clusters, it lacks a reliable spatial distribution where similar CNS cells are in close proximity, which leads to incorrect spatial information for cell annotation (Fig. 5d). These results demonstrate that scCM efficiently annotates CNS cell types.
Discussion
We developed scCM, a self-supervised contrastive learning method utilizing the MoCoV3 framework, which offers improved performance and effectively integrates large-scale CNS scRNA-seq data, making it a valuable tool for researchers studying the complexity and heterogeneity of CNS cells. The momentum encoder and dynamic queue of MoCoV3 efficiently handle extensive sample volumes, maintaining a broad effective negative sample set, crucial for detecting heterogeneity in single-cell data. In contrast, Concerto’s SimCLR framework, while proficient at data representation learning, faces a limitation due to its inability to support very large batch sizes. Unlike the CLEAR model, which utilizes the MoCoV1 framework with a Memory Queue mechanism, our implementation of the MoCoV3 framework dispenses with this mechanism, allowing for large batch sizes during sample observation. It also incorporates elements from the BYOL framework, such as the projection head and prediction head, which significantly contribute to performance enhancement. The SCLSC that employs a supervised contrastive learning model framework is highly label-dependent, linking the model’s performance to label quality. The advantages of MoCoV3 make it more effective in processing large-scale CNS single-cell data.
scCM can capture the heterogeneity in CNS cells, enabling it to integrate large-scale CNS scRNA-seq data to unveil inter-cell connections based on spatial relationships. Experiments on 20 CNS datasets demonstrate that scCM maintains outstanding performance and generalization for clustering, handling dropout events, and correcting batch effects across datasets with varying numbers of genes, cells, clusters, species, and CNS diseases. Moreover, scCM can offers the spatial interpretability of CNS cell clustering, which can help infer cell developments or function relationships. Taking these advantages of scCM, we have successfully annotated the observed and unobserved CNS cell subtypes based on an integrated large-scale neurodegenerative data reference. We anticipate that our approach will have broader applications in the CNS field.
There are a few directions to be further investigated: (1) The evaluation of scCM was only conducted on CNS scRNA-seq data. Additional research is necessary to assess the applicability of scCM to other tissues. (2) A larger and more comprehensive reference can provide more reliable annotations for unknown cells. In future work, we plan to collect broader CNS datasets to increase the number of cells, especially focusing on the rare cell subtypes; (3) The limited diversity of cell types and confounding factors may reduce the accuracy of annotation. To solve this problem, we plan to create an expanded and higher-quality reference, specifically tailored to different neurologic disease contexts; (4) Additionally, analyzing cell subtypes requires individual analysis of each cell cluster, as it cannot be achieved in a single step.
Methods
Dataset collection
We systematically collected the published articles within the past five years related to CNS scRNA-seq in the GEO dataset. The literature search strategy is based on the following keywords: (central nervous system OR brain OR cerebral OR neurodegenerative disease OR Alzheimer’s disease OR Huntington’s disease OR multiple sclerosis OR Parkinson’s disease OR dementia OR cerebral tumor OR brain metastases OR glioblastoma OR medulloblastoma OR pituitary adenoma) OR traumatic brain injury and (single-cell RNA sequencing). After preprocessing, 18 CNS datasets with labeled cell annotations were obtained from the GEO database. Additionally, 2 unlabeled datasets were used to evaluate scCM performance of cell annotation. One AD dataset was sourced from the GEO database, and another human brain tumor dataset was derived from neural neoplasm specimens, approved by the Ethics Review Committee of Chinese Academy of Medical Sciences and Peking Union Medical College.
Data pre-preprocessing
Data processing includes filtering, preprocessing, normalization, and HVG selection. The preprocessing was conducted by SCANPY functions, as follows:
(a) To eliminate potential empty droplets, low-quality cells, and potential multiplets, cells with fewer than 300 or more than 6000 transcripts were excluded from the analysis. Furthermore, cells identified as being of poor quality, where mitochondrial gene expression accounted for less than 15% of the total counts, were excluded. (b) Gene expression measurements in each cell were first normalized to scale the total transcript counts to 10,000 per cell, a crucial step to address variations in sequencing depths across cells. After normalization, the data were transformed using the natural logarithm (“LogNormalize” method). (c) We use the top 3000 HVGs as the default training parameter. On one hand, using a fixed number of HVGs allows for consistent comparisons across different datasets59. On the other hand, after testing various numbers of HVGs, we found that the model yields more stable training results when the number of HVGs is set at 3000.
Dropout event
We simulated dropout events by randomly replacing gene expression values with zeros. This replacement occurs after selecting highly variable genes to avoid redundantly processing genes with zero expression values. This strategy ensures that dropout events mainly affect the expressed genes identified in the cells. The dropout percentage ranges from 0.05 to 0.50.
The scCM framework
scCM is based on a contrastive learning framework, which is trained to differentiate similar and dissimilar pairs of cells, aiming to group cells with similar gene expression patterns while distinguishing cells with larger expression differences. The scCM workflow is outlined as follows:
Encoder architecture
Both the query encoder \({f}_{q}(x)\) and the key encoder \({f}_{k}(x)\) are a Siamese network, which is implemented with two-layer fully connected networks. The number of input nodes matches the number of genes in the gene expression profile. The first layer has 1024 neurons with LayerNorm normalization and ReLU activation function. To mitigate overfitting, a Dropout module is added in subsequent layers with a dropout rate of 0.5. Finally, the output layer consists of 128-dimensional neurons. The update process for the query encoder (\({f}_{q}\)) follows the normal backpropagation. The key encoder (\({f}_{k}\)) was updated with momentum update35.
m ∈ [0, 1) is the momentum coefficient, regulating the update speed of the encoder network. A higher value of m results in slower updates of the key encoder, whereas a lower value of m drives the key encoder to resemble the query encoder more closely.
Predictor head
A predictor head is added downstream of the query encoder, consisting of two fully connected layers. The first layer has a hidden layer of 128-dimensional with a ReLU activation function. The output layer is without the ReLU activation function. The additional predictor network can effectively prevent collapse60.
Loss function
The InfoNCE loss function is utilized to minimize the distance between positive pairs and maximize the distance between negative pairs, where the inputs of query encoder and key encoder are labeled as a positive pair (q, k+) when they are the same cell, otherwise the inputs are labeled as a negative pair (q, k−). InfoNCE loss function is:
where the parameter temperature (\(\tau\)) shapes the distribution61. Increasing τ creates a smoother distribution. Setting \(\tau\) too small causes the model to concentrate solely on hard negative samples, leading to convergence issues and limited feature generalization.
The parameters of the encoder are optimized using the Adam optimizer. The temperature parameter (\(\tau\)) is set to 0.8. A momentum coefficient (m) of 0.99 is employed for slower update of the key encoder’s weights35. For scCM, the default learning rate is 1 × 10−6, and the batch size is 512. We standardized the batch size to 512 to facilitate performance comparison across different datasets. Our training process involves a warm-up period of 5 epochs to accelerate convergence, followed by a total of 10 epochs.
Clustering and UMAP visualization
We first performed k-means clustering across multiple models for uniform comparison using the Python function. In determining the optimal number of clusters (k), we employed the Elbow method, which involves plotting the explained variation as a function of the number of clusters and selecting the elbow of the curve as the number of clusters to use. We then calculated the ARI values for several values around this optimal k, and selected the k value corresponding to the highest ARI value, followed by calculating other performance metrics62. The optimal number of clusters (k) can be found in Supplementary Table 2. The Leiden algorithm was employed to assist the scCM model in clustering analysis across multiple datasets (Supplementary Methods)4. Cell embeddings are visualized by UMAP using SCANPY (n_neighbours is set to 15).
CNS cell integration and annotation
Integration
To integrate two single-cell RNA-seq datasets using Scanpy, first ensure both datasets are in the same format (e.g., AnnData). Then, combine the datasets into a single AnnData object. Preprocess the data by filtering low-quality cells and genes, normalizing, log-transforming the expression values, and selecting 3000 HVGs. Then, The preprocessed dataset was imported into scCM and initiated training using the default parameter settings.
Annotation
After the training phase, we incorporated the cell annotation information from the metadata set into the trained embeddings. For cells not present in the metadata set, we filled in missing annotation information with ‘unknown’. This allowed the visualization of the embeddings to reflect different cell clusters. Based on the mapping relationships between cell clusters in the new dataset and the metadata set, we then annotated the different cell clusters using the Leiden algorithm.
Evaluation metrics
The analysis utilized several metrics to assess the quality and effectiveness of the data clustering and batch effect adjustments. These metrics include:
Acc
Measure the proportion of correct predictions against the total number of cases examined, ranging from 0 to 1.
ARI
Evaluate the similarity between two clusters, adjusting for chance and providing a measure that is normalized to show random clustering as zero, ranging from −1 to 1.
NMI
Assess the mutual dependence between the clustering results, normalized to account for the size of the clusters, facilitating comparison across different sets, ranging from 0 to 1.
kBET
Detect batch effects within the data by analyzing the local neighborhood of each point, thereby ensuring that similar samples are not batch-specific, ranging from 0 to 1.
SS
Measure how similar an object is to its own cluster compared to other clusters. This score helps to determine the appropriateness of the clustering by calculating the difference in distances between clusters, ranging from -1 to 1.
HS
Evaluate clusters that contain only members of a single class, reflecting the consistency of labeling within clusters, ranging from 0 to 1.
VMS
Combine both homogeneity and completeness of the clustering results into a single score to evaluate the success of the clustering against a ground truth classification, ranging from 0 to 1.
For detailed information on the specific methodologies and results of these metrics, please refer to the Supplementary Methods. To better compare the performance of different models, we normalized each metric and then calculated a composite indices (CI), \({x^{\prime} }_{i}\) is the normalized indicator value63.
This approach allowed to generate a rank for each model that encompasses all evaluated metrics.
Running time
In this study, we focused solely on the statistical analysis of the model’s training time, including data loading time and training time. To evaluate the training time on large-scale datasets, we integrated the 18 datasets collected in this study into a single dataset, comprising approximately one million cells. Instead of specialized data integration methods, we standardized the data formats across all datasets and then performed a vertical merge to combine them. The data preprocessing was consistent with default parameters.
Statistics and reproducibility
The scCM model was built using the PyTorch (version 2.2.1) framework in Python (version 3.9). Single-cell data preprocessing, analysis, and visualization were performed using Scanpy (version 1.9.8). Model testing included running time and performance metrics, with clustering performance metrics calculated using NumPy (version 1.26.4) and analyzed using means and standard deviations. Performance comparison charts were created using Python and GraphPad Prism (version 8).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The scRNA-seq data that support the findings of this study are available in GEO. Their availability of 19 datasets, alongside downloadable links, are described in Supplementary Table 1.All other data or additional information are available from the corresponding author upon reasonable request.
Code availability
Our tool is open source, written in Python using PyTorch framework. The source code with reproducibility demo is available on GitHub at https://github.com/farry92/ScCM.git or Zenodo at https://doi.org/10.5281/zenodo.1311994164. Detailed guidance on using the code is available upon request from Y.F.
References
Wang, Z. et al. Single-cell transcriptomic analyses provide insights into the cellular origins and drivers of brain metastasis from lung adenocarcinoma. Neuro Oncol. 25, 1262–1274 (2023).
Grubman, A. et al. A single-cell atlas of entorhinal cortex from individuals with Alzheimer’s disease reveals cell-type-specific gene expression regulation. Nat. Neurosci. 22, 2087–2097 (2019).
Deleersnijder, D. et al. Current methodological challenges of single-cell and single-nucleus RNA-sequencing in glomerular diseases. J. Am. Soc. Nephrol. 32, 1838–1852 (2021).
Yang, M. et al. Contrastive learning enables rapid mapping to multimodal single-cell atlas of multimillion scale. Nat. Mach. Intell. 4, 696–709 (2022).
Lee, H. G., Lee, J. H., Flausino, L. E. & Quintana, F. J. Neuroinflammation: An astrocyte perspective. Sci. Transl. Med. 15, eadi7828 (2023).
Amor, S. et al. White matter microglia heterogeneity in the CNS. Acta Neuropathol. 143, 125–141 (2022).
Poulin, J. F., Gaertner, Z., Moreno-Ramos, O. A. & Awatramani, R. Classification of midbrain dopamine neurons using single-cell gene expression profiling approaches. Trends Neurosci. 43, 155–169 (2020).
Gimple, R. C., Yang, K., Halbert, M. E., Agnihotri, S. & Rich, J. N. Brain cancer stem cells: resilience through adaptive plasticity and hierarchical heterogeneity. Nat. Rev. Cancer 22, 497–514 (2022).
Wheeler, M. A. et al. MAFG-driven astrocytes promote CNS inflammation. Nature 578, 593–599 (2020).
Sanmarco, L. M. et al. Gut-licensed IFNγ(+) NK cells drive LAMP1(+)TRAIL(+) anti-inflammatory astrocytes. Nature 590, 473–479 (2021).
Koch, F. C., Sutton, G. J., Voineagu, I. & Vafaee, F. Supervised application of internal validation measures to benchmark dimensionality reduction methods in scRNA-seq data. Brief. Bioinform. 22, bbab304 (2021).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Li, X. et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat. Commun. 11, 2338 (2020).
Brbić, M. et al. MARS: discovering novel cell types across heterogeneous single-cell experiments. Nat. Methods 17, 1200–1206 (2020).
Wang, J. et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun. 12, 1882 (2021).
Mathys, H. et al. Single-cell transcriptomic analysis of Alzheimer’s disease. Nature 570, 332–337 (2019).
Lau, S. F., Cao, H., Fu, A. K. Y. & Ip, N. Y. Single-nucleus transcriptome analysis reveals dysregulation of angiogenic endothelial cells and neuroprotective glia in Alzheimer’s disease. Proc. Natl Acad. Sci. USA 117, 25800–25809 (2020).
Zheng, J. et al. Single-cell RNA-seq analysis reveals compartment-specific heterogeneity and plasticity of microglia. iScience 24, 102186 (2021).
Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017).
Nowogrodzki, A. How to build a human cell atlas. Nature 547, 24–26 (2017).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Spurgat, M. S. & Tang, S. J. Single-cell RNA-sequencing: astrocyte and microglial heterogeneity in health and disease. Cells 11, 2021 (2022).
Jäkel, S. et al. Altered human oligodendrocyte heterogeneity in multiple sclerosis. Nature 566, 543–547 (2019).
Qiu, P. Embracing the dropouts in single-cell RNA-seq analysis. Nat. Commun. 11, 1169 (2020).
Xu, J. et al. Evaluating the performance of dropout imputation and clustering methods for single-cell RNA sequencing data. Comput Biol. Med 146, 105697 (2022).
Zhang, L. & Zhang, S. Imputing single-cell RNA-seq data by considering cell heterogeneity and prior expression of dropouts. J. Mol. Cell Biol. 13, 29–40 (2021).
Masuda, T., Sankowski, R., Staszewski, O. & Prinz, M. Microglia heterogeneity in the single-cell Era. Cell Rep. 30, 1271–1281 (2020).
Berg, D. et al. Prodromal Parkinson disease subtypes - key to understanding heterogeneity. Nat. Rev. Neurol. 17, 349–361 (2021).
Chen T., Kornblith, S., Norouzi, M. & Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning; Virtual, 1597–1607 (2020).
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (ed^(eds). Seattle, WA, USA: IEEE (2020)
Han, W. et al. Self-supervised contrastive learning for integrative single cell RNA-seq data analysis. Brief. Bioinform. 23, bbac377 (2022).
Ciortan, M. & Defrance, M. Contrastive self-supervised clustering of scRNA-seq data. BMC Bioinforma. 22, 280 (2021).
Bai, G. et al. Robust and rotation-equivariant contrastive learning. IEEE Trans Neural Netw Learn Syst 1-14 (2023).
Chen, X., Xie, S. & He, K. An Empirical study of training self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9620–9629 (2021).
Zhang, C., Chen, R. & Zhang, Y. Accurate inference of genome-wide spatial expression with iSpatial. Sci. Adv. 8, eabq0990 (2022).
Farhy-Tselnicker, I. et al. Activity-dependent modulation of synapse-regulating genes in astrocytes. Elife 10, e70514 (2021).
Boghdadi, A. G. et al. NogoA-expressing astrocytes limit peripheral macrophage infiltration after ischemic brain injury in primates. Nat. Commun. 12, 6906 (2021).
Arneson, D. et al. Systems spatiotemporal dynamics of traumatic brain injury at single-cell resolution reveals humanin as a therapeutic target. Cell Mol. Life Sci. 79, 480 (2022).
Zhou, Y. et al. Human early-onset dementia caused by DAP12 deficiency reveals a unique signature of dysregulated microglia. Nat. Immunol. 24, 545–557 (2023).
Absinta, M. et al. A lymphocyte-microglia-astrocyte axis in chronic active multiple sclerosis. Nature 597, 709–714 (2021).
Lim, R. G. et al. Huntington disease oligodendrocyte maturation deficits revealed by single-nucleus RNAseq are rescued by thiamine-biotin supplementation. Nat. Commun. 13, 7791 (2022).
Fournier A. P., et al. Single-cell transcriptomics identifies brain endothelium inflammatory networks in experimental autoimmune Encephalomyelitis. Neurol. Neuroimmunol. Neuroinflamm 10, (2023).
Anderson, A. G. et al. Single nucleus multiomics identifies ZEB1 and MAFB as candidate regulators of Alzheimer’s disease-specific cis-regulatory elements. Cell Genom. 3, 100263 (2023).
Lee, S. H. et al. TREM2-independent oligodendrocyte, astrocyte, and T cell responses to tau and amyloid pathology in mouse models of Alzheimer disease. Cell Rep. 37, 110158 (2021).
Riemondy, K. A. et al. Neoplastic and immune single-cell transcriptomics define subgroup-specific intra-tumoral heterogeneity of childhood medulloblastoma. Neuro Oncol. 24, 273–286 (2022).
LeBlanc, V. G. et al. Single-cell landscapes of primary glioblastomas and matched explants and cell lines show variable retention of inter- and intratumor heterogeneity. Cancer Cell 40, 379–392.e379 (2022).
Siddique, K., Ager-Wick, E., Fontaine, R., Weltzien, F. A. & Henkel, C. V. Characterization of hormone-producing cell types in the teleost pituitary gland using single-cell RNA-seq. Sci. Data 8, 279 (2021).
Gonzalez, H. et al. Cellular architecture of human brain metastases. Cell 185, 729–745.e720 (2022).
Soreq, L., Bird, H., Mohamed, W. & Hardy, J. Single-cell RNA sequencing analysis of human Alzheimer’s disease brain samples reveals neuronal and glial specific cells differential expression. PLoS One 18, e0277630 (2023).
Gadelha, M. & Wildemberg, L. E. Alternative approach to BIPSS in the differential diagnosis of ACTH-dependent Cushing’s syndrome. J. Clin. Endocrinol. Metab 109, e1460-e1461 (2023).
Jones, R. C. et al. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Su, Z. & He, C. Olfactory ensheathing cells: biology in neural development and regeneration. Prog. Neurobiol. 92, 517–532 (2010).
Ursavas, S., Darici, H. & Karaoz, E. Olfactory ensheathing cells: Unique glial cells promising for treatments of spinal cord injury. J. Neurosci. Res. 99, 1579–1597 (2021).
Carter, S. F. et al. Astrocyte Biomarkers in Alzheimer’s Disease. Trends Mol. Med. 25, 77–95 (2019).
Preman, P., Alfonso-Triguero, M., Alberdi, E., Verkhratsky, A. & Arranz, A. M. Astrocytes in Alzheimer’s disease: pathological significance and molecular pathways. Cells 10, 540 (2021).
Sharma, V. K., Singh, T. G., Singh, S., Garg, N. & Dhiman, S. Apoptotic Pathways and Alzheimer’s disease: probing therapeutic potential. Neurochem Res. 46, 3103–3122 (2021).
Mohamed, D. A. W., Selim, H. M., Elmazny, A., Genena, A. & Nabil, M. M. Apoptotic protease activating factor-1 gene and MicroRNA-484: A possible interplay in relapsing remitting multiple sclerosis. Mult. Scler. Relat. Disord. 58, 103502 (2022).
Li, W. et al. scMHNN: a novel hypergraph neural network for integrative analysis of single-cell epigenomic, transcriptomic and proteomic data. Brief. Bioinform 24, bbad391 (2023).
Jean-Bastien, G. et al. Bootstrap your own latent: A new approach to self-supervised Learning. arXiv:200607733, (2020).
Chen, T., Kornblith S., Swersky K., Norouzi M. & Hinton, G. Big Self-Supervised Models are Strong Semi-Supervised Learners. Proceedings of the 34th International Conference on Neural Information Processing Systems. 22243–22255 (2020).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Mazziotta, M. & Pareto, A. Methods for constructing composite indices: One for all or all for one? Riv. Ital. di Economia, Demogr. e Stat. 67, 67–80 (2013).
Fang, Y. scCM v1.0.0 (2024). https://doi.org/10.5281/zenodo.13119941.
Acknowledgements
This work was supported by the National Key R&D Program of China (2021YFE0114300), National Natural Science Foundation of China (82171475, 82103302, 62102118), Shenzhen Science and Technology Program (JCYJ20230807094318038), the Shenzhen Colleges and Universities Stable Support Program (GXWD20220811170504001), CAMS Innovation Fund for Medical Sciences (2023-I2M-C&T-B-008), National High Level Hospital Clinical Research Funding (2022-PUMCH-C-012) and Key-Area Research and Development Program of Guangdong Province (2021B0101420005).
Author information
Authors and Affiliations
Contributions
M.F., B.H., and R.W. designed and supervised the project. Y.F., J.C., and H.W. contributed equally to this work by conducting research, performing analyses, and contributing to the experiments. Y.F. wrote and revised the manuscript. M.Q. collected the clinical samples and associated information. S.W., Q.C., and Q.S. contributed to the manuscript. L.X. drew the diagrams. The manuscript was reviewed by all authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics approval
The study was approved by Chinese Academy of Medical Sciences and Peking Union Medical College (S-K477).
Peer review
Peer review information
Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: George Inglis and Aylin Bircan. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Fang, Y., Chen, J., Wang, H. et al. Integrating large-scale single-cell RNA sequencing in central nervous system disease using self-supervised contrastive learning. Commun Biol 7, 1107 (2024). https://doi.org/10.1038/s42003-024-06813-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-024-06813-2
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.