scBridge embraces cell heterogeneity in single-cell RNA-seq and ATAC-seq data integration

Single-cell multi-omics data integration aims to reduce the omics difference while keeping the cell type difference. However, it is daunting to model and distinguish the two differences due to cell heterogeneity. Namely, even cells of the same omics and type would have various features, making the two differences less significant. In this work, we reveal that instead of being an interference, cell heterogeneity could be exploited to improve data integration. Specifically, we observe that the omics difference varies in cells, and cells with smaller omics differences are easier to be integrated. Hence, unlike most existing works that homogeneously treat and integrate all cells, we propose a multi-omics data integration method (dubbed scBridge) that integrates cells in a heterogeneous manner. In brief, scBridge iterates between i) identifying reliable scATAC-seq cells that have smaller omics differences, and ii) integrating reliable scATAC-seq cells with scRNA-seq data to narrow the omics gap, thus benefiting the integration for the rest cells. Extensive experiments on seven multi-omics datasets demonstrate the superiority of scBridge compared with six representative baselines.

Single-cell multi-omics data integration aims to reduce the omics difference while keeping the cell type difference.However, it is daunting to model and distinguish the two differences due to cell heterogeneity.Namely, even cells of the same omics and type would have various features, making the two differences less significant.In this work, we reveal that instead of being an interference, cell heterogeneity could be exploited to improve data integration.Specifically, we observe that the omics difference varies in cells, and cells with smaller omics differences are easier to be integrated.Hence, unlike most existing works that homogeneously treat and integrate all cells, we propose a multi-omics data integration method (dubbed scBridge) that integrates cells in a heterogeneous manner.In brief, scBridge iterates between i) identifying reliable scATAC-seq cells that have smaller omics differences, and ii) integrating reliable scATAC-seq cells with scRNA-seq data to narrow the omics gap, thus benefiting the integration for the rest cells.Extensive experiments on seven multi-omics datasets demonstrate the superiority of scBridge compared with six representative baselines.
Single-cell RNA sequencing (scRNA-seq) 1 has been widely used and made great progress in the fields of biology and medicine.Recently, the advances in single-cell technologies have enabled profiling single cells from different layers, such as chromatin accessibility (scATACseq) 2,3 , spatial transcriptome (Stereo-seq) 4 , and proteome (ScoPE-MS) 5 .Integrating diverse omics data provides a chance to reconstruct a comprehensive molecular regulation network, and promote the development of precision medicine.In particular, scATAC-seq studies the physical structure of the genome by identifying open chromatin regions, while the dynamic remodeling of chromatin structure is one of the main mechanisms that affect transcription.Thus, the integration of scATAC-seq and scRNA-seq allows not only observing the differences at the transcriptional level but also understanding the reasons behind the differences from an epigenetic perspective 6 .
Several methods have been proposed to integrate transcriptomic data [7][8][9][10][11][12][13][14] , which could also be used for multi-omics data integration.However, suboptimal results would be achieved by directly applying those transcriptomic-oriented methods to integrate scRNA-seq and scATAC-seq, because the data distribution and sparsity level are vastly different across omics 15 .To address this issue, some efforts have been devoted to multi-omics data integration.Specifically, with the crossomics pairing information, scAI 16 and MOFA+ 17 could perform joint integration and clustering on multi-omics data.However, it is daunting to obtain such pairing information due to the prohibitive cost of multiomics sequencing techniques 18,19 .As a remedy, the focus of the community has shifted to the scenario wherein the scRNA-seq and scATACseq data are sequenced independently, i.e., unpaired.To integrate unpaired multi-omics data, the most common paradigm is first independently conducting feature extraction in each omics, and then reducing the omics difference in features through manifold alignment 20,21 , mutual nearest neighbor (MNN) correction 22 , graph linking 23,24 , and adversarial training 25 .A more straightforward solution is explicitly modeling the omics difference as a factor in matrix factorization 26,27 .Recently, considering the abundant annotated scRNA-seq data, scJoint 15 proposes to integrate multi-omics data under the semi-supervised learning paradigm.
Although various methods have been developed and achieved remarkable progress, most of them overlook the role of cell heterogeneity, not to mention exploiting it in the integration.To be specific, the objective of multi-omics integration is to reduce the omics difference while keeping the cell-type difference.However, due to the cell heterogeneity, even the cells of the same omics and type would have non-negligible variances which would make omics and cell-type differences less significant.As a result, it is daunting to model and distinguish the two differences, leading to suboptimal integration results.On the one hand, when the cell-type difference is falsely treated as the omics difference and accordingly eliminated, the cells of different types would be integrated, leading to the over-integration problem.On the other hand, when the omics difference is falsely treated as the cell-type difference and insufficiently reduced, the cells of different omics would not be well mixed, leading to the under-integration problem.
Here, we reveal that instead of being an interference, the cell heterogeneity could be exploited to facilitate data integration based on the following observation.Specifically, the chromatin accessibility of scATAC-seq cells exhibits variable correlations with gene expression of scRNA-seq 28,29 .scATAC-seq cells with higher positive correlation exhibit smaller omics differences, which are easier to integrate and could bridge the modality gap between the two omics.According to the observation, we designed scBridge, a heterogeneous transfer learning method for multi-omics data integration.Briefly, scBridge first warms up a deep neural classifier with the annotated scRNA-seq data, and then identifies the scATAC-seq cells with smaller omics differences through reliability modeling.After that, the reliable scATAC-seq cells are integrated with scRNA-seq cells through cross-omics prototype alignment.Lastly, scBridge selects and merges the most reliable scATAC-seq cells into the annotated scRNA-seq data to narrow the omics gap.By repeating the above processes, the omics difference would be gradually reduced, and more cells would be integrated, leading to the final integration result.
We evaluate the data integration performance of scBridge on seven multi-omics datasets in terms of joint embedding quality and label transfer accuracy.Extensive experimental results illustrate the superiority of scBridge in data integration compared with scJoint 15 (semi-supervised), Seuart 22 (MNN-based), Portal 25 (adversarial), Harmony 14 (transcriptomic-oriented), GLUE 24 and Conos 23 (graphbased).Furthermore, the empirical evaluations show that scBridge is robust against the number and quality of scRNA-seq annotations, the inconsistency between scRNA-seq and scATAC-seq cell types, and technical noises in sequencing data.

Results
The scBridge algorithm scBridge is a semi-supervised method that integrates the annotated scRNA-seq data and the unlabeled scATAC-seq data in a heterogeneous transfer learning manner.As illustrated in Fig. 1, scBridge passes the data into a deep neural encoder and a classifier to achieve data integration and label transfer with the help of a reliability modeling module (Overview a).To be specific, scBridge first warms up the networks using the annotated data (Step b).After that, the networks are transferred to scATAC-seq data.However, such a vanilla transfer paradigm would misclassify scATAC-seq cells due to the modality gap between RNA and ATAC omics.Hence, to integrate the cells of different types correctly, heterogeneous transfer learning is proposed by utilizing cell heterogeneity.Specifically, a portion of scATAC-seq cells exhibits smaller omics differences with scRNA-seq cells as their chromatin accessibility has higher positive correlations with gene expression.Consequently, the classification results on those scATACseq cells are more reliable.To estimate the reliability of each scATACseq cell, scBridge models the discriminability and confidence of scATAC-seq cells with a Gaussian Mixture (Step c).To be specific, scBridge computes the discriminability of each scATAC-seq cell based on its distance to the RNA prototypes (computed by averaging scRNAseq cells of different types), as well as the confidence based on its classification loss value.Cells with higher discriminability and confidence are considered more reliable.With the estimated cell reliability, scBridge computes the ATAC prototypes as the weighted average of scATAC-seq cells with the same predicted cell type and aligns them with the RNA prototypes to achieve integration (Step d).Lastly, scBridge selects the most reliable scATAC-seq cells and merges them into the annotated data, with labels given by the current classification results (Step e).The selected scATAC-seq cells could act as a bridge to reduce the modality gap between RNA and ATAC omics.By repeating steps b to e, scBridge takes a from-easy-to-hard learning fashion to further identify and integrate the remaining scATAC-seq cells that have more distinct features with scRNA-seq cells.

scBridge achieves promising integration results on the golden benchmarks
To evaluate the integration performance of scBridge, we first applied it to three golden benchmarks including the SNARE-seq dataset of mouse brain cortex 18 , the SHARE-seq dataset of human bone marrow 30 , and the 10x Multiome dataset of mouse kidney 31 .As these three sequencing techniques could link the cell's transcriptome with its accessible chromatin, the pairing information provides a golden criterion to validate the integration performance.Notably, the pairing information was not used during integration, but only for validation.Moreover, these three datasets cover three different tissues and two species, which also evaluates the generalization ability of the methods.
To intuitively show how scBridge iteratively integrates scRNA-seq and scATAC-seq data through heterogeneous transfer learning, we visualized the integration process on the SNARE-seq dataset in Fig. 2c.To be specific, the right figure shows the Pearson correlation score (computed on all genes) between scRNA-seq cells and the selected reliable scATAC-seq cells, where larger scores denote smaller omics differences between scRNA-seq and scATAC-seq cells.As shown, scBridge first integrates scATAC-seq cells that are most similar to scRNA-seq cells and gradually integrates more distinct ones in the subsequent iterations (t-test p-value < 1e − 3 in the first five iterations, with the Pearson correlation score decreasing significantly).Such a trend also holds in different types of cells as illustrated in Supplementary Fig. 1a.Here, we took Ex-L2/3-Rasgrf2 cells of scATAC-seq as an example to demonstrate various cell correlation levels across omics.Supplementary Fig. 1b shows a decrease in the Rasgrf2 gene activity as the model iterates.Meanwhile, Supplementary Fig. 1c demonstrates that the discrepancy between Rasgrf2 gene activity and gene expression increases as integration proceeds, consistent with the results in Supplementary Fig. 1a.The left figure in Fig. 2c demonstrates the reliable cell selection and overall label transfer accuracy across iterations.In brief, after the first iteration, scBridge achieves 60.11% label transfer accuracy for all scATAC-seq cells.Based on the Gaussian Mixture Model, 1898 scATAC-seq cells are selected as the annotated data with an accuracy of 90.89%.By using those reliable scATAC-seq cells to bridge RNA and ATAC omics, scBridge achieves better integration results (63.62% label transfer accuracy) in the second iteration.As the training proceeds, more scATAC-seq cells are selected as reliable by scBridge, and the label transfer accuracy steadily grows to 71.95%.Figure 2a and Supplementary Fig. 2a illustrate the final data integration results achieved by scBridge and six baseline methods.As shown, though all seven methods successfully mix scRNA-seq and scATAC-seq cells, scBridge and scJoint achieve more discriminative cell clusters compared with other baselines.In some clusters, however, scJoint falsely integrates cells with different types, leading to inferior label transfer performance.To further validate the superiority of scBridge, Fig. 2d and Supplementary Fig. 2b visualize the confusion matrix of the transferred labels.The results show that scBridge discriminates the cells of different types more accurately compared with all baselines.For example, scJoint fails to separate Claustrum, Mic, and OPC cells, whereas scBridge achieves almost perfect label transfer on them.By using the silhouette score and label transfer accuracy to quantitatively evaluate the integration results, Fig. 2b shows that scBridge achieves the highest harmonized silhouette score, indicating its superiority in the removal of omics difference and the preservation of cell-type difference.We also noticed that scBridge achieves a more precise integration for the rare cell types, i.e., a significant improvement on the weighted F1-score in label transfer (42.26% by scBridge compared with 22.12% by scJoint).
We further visualized the joint embeddings obtained by scBridge and scJoint on the SHARE-seq and 10x Multiome datasets in Fig. 2e-f.On the SHARE-seq dataset, scBridge achieves better cell grouping than scJoint, especially for the rare types like Baso.On the 10x Multiome dataset, scBridge successfully mixes scRNA-seq and scATAC-seq cells, while scJoint fails to eliminate the gap between the two modalities.The UMAP visualizations, label transfer matrix, and quantitative metrics of scBridge and all other baselines are in Supplementary Figs.3-4 demonstrate the superior performance of scBridge in data integration and label transfer.
Finally, as the heterogeneous transfer learning paradigm of scBridge requires the annotated scRNA-seq data, a natural question is how many annotated scRNA-seq cells are needed for accurate integration.To answer this question, we evaluated the robustness of scBridge against the number of annotations on the three golden benchmarks, compared with four baselines that support label transfer.Specifically, we carried out experiments by using 100%, 75%, 50%, and 25% of scRNA-seq data.Figure 2g shows that scBridge achieves the best label transfer accuracy and F1-score under all downsample rates on three benchmarks.Notably, on the 10x Multiome dataset, scBridge remains a high average F1-score of 77.08% with only 25% annotated scRNA-seq cells compared with 77.35% on full data.In contrast, scJoint encounters a significant performance drop in average F1-score, i.e., from 73.22% on full data to 59.36% on 25% downsampled data (t-test p-value = 1.94e − 5, degrees of freedom = 8, 95% confidence interval = [0.103,0.174]).In addition, scBridge with only 50% scRNAseq annotations outperforms all baselines with full data on the SNAREseq dataset.Such a data-efficient property of scBridge could be attributed to its heterogeneous transfer learning paradigm.Namely, as long as the annotated scRNA-seq data is enough for identifying a portion of reliable scATAC-seq data, scBridge could progressively integrate the rest cells.

scBridge scales to atlas data
With the development of sequencing techniques, the number of cells profiled with various protocols grows continually, arousing the  The input of scBridge is composed of the annotated scRNA-seq data and unlabeled scATAC-seq data.scBridge passes them into a deep neural encoder and a classifier to achieve data integration and label transfer iteratively.The main steps of scBridge are elaborated as follows.b scBridge warms up the deep neural encoder and classifier with the annotated scRNA-seq data and accordingly obtains the initial embedding and cell-type prediction for scRNA-seq and scATAC-seq cells.c scBridge models the reliability of each scATAC-seq cell based on its embedding discriminability and classification loss.d Based on the estimated reliability, scBridge computes the ATAC prototypes as the weighted average of scATAC-seq cells and aligns them with the RNA prototypes for integration.e At the end of each iteration, scBridge selects the most reliable scATACseq cells and merges them into the annotated data, with labels given by the current predictions.scBridge repeats steps b to e until convergence so that more and more ATAC cells would be integrated, leading to the final data integration and label transfer results.
demand for efficiently handling large-scale data.To access how scBridge scales to large data, we evaluated it on the mouse atlas dataset.Specifically, we used the cells sequenced with FACS and droplet protocols provided by Tabula-Muris 32 as scRNA-seq data, and the cells sequenced by Cusanovich et al. 33 as scATAC-seq data.After data preprocessing, 102,103 cells from 18 common types are selected for evaluation.
To investigate the computation efficiency of scBridge, we applied it to five subsets of mouse atlas with 5,000-80,000 cells.Figure 3c shows the (logged) running time and memory consumption of all tested methods with respect to different cell numbers.As shown, scBridge takes linearly increasing running time (the third-best) and constant memory consumption (the second-best), which is favorable in scaling to large data.
scJoint: (Accuracy: 66.00%, F1 Score: 22.12%) Seurat: (Accuracy: 56.24%, F1 Score: 20.70%)Despite the scalability and efficiency of scBridge, we also evaluated its effectiveness on the full mouse atlas data.As visualized in Fig. 3a and Supplementary Fig. 5a, scBridge achieves better grouping of cells by types and mixing of cells by omics.The superiority of scBridge is also verified according to the cell type and omics silhouette score in Fig. 3b.To evaluate the label transfer performance, we illustrated the label transfer matrix in Fig. 3e and Supplementary Fig. 5b.As shown, scBridge (89.66%F1-score) achieves a clearer diagonal label transfer matrix compared with scJoint (64.60%F1-score) and Seurat (44.64%F1-score), which indicates a more precise cell integration.Additionally, we observed that Hematopoietic stem cells (HSC) were annotated as HSC, Macrophage, and Monocyte by scBridge, but only Macrophage by scJoint.To explore such a difference, we computed the activity score of the marker genes Cd34, Cd68, Tgfbi for the three types in Fig. 3d.The results verify that scBridge makes more reasonable and accurate cell-type predictions than scJoint.Notably, despite the immense differences between FACS and droplet data (the FACS method captures fewer cells but detects more molecules per cell than the microfluidic-droplet method), scBridge does not require any preintegration of these two batches of data, which demonstrates its capacity to handle data with batch effects.Note that not all results agree with those reported in the scJoint paper 15 due to the slight differences in data preprocessing.

scBridge handles data with inconsistent cell type across omics
In the above experiments, the cell type is consistent across scRNA-seq and scATAC-seq data.In practice, however, such consistency does not always hold.Hence, it is highly expected to explore how data integration methods behave when cell types unmatch across omics.For this purpose, we evaluated scBridge on the human myocardial infarction data 34 , which consists of 67,360 scRNA-seq cells from 11 types, and 46,086 scATAC-seq cells from 8 types after preprocessing.More specifically, Mast, Adipocyte, and Cycling cells are only observed in scRNA-seq data.Experimental results in Supplementary Fig. 7 show that scBridge not only learns the joint embedding with better cell-type grouping and omics mixing but also achieves the best performance in all five data integration and label transfer metrics.
Furthermore, we conducted a more challenging evaluation by manually removing the Myeloid cells from scRNA-seq data.In other words, there are only 7 cell types in common for scRNA-seq and scATAC-seq data, and both of them have unique cell types.The UMAP visualizations in Fig. 4a and Supplementary Fig. 6a illustrate that scJoint fails to integrate the cells from different omics, and other methods achieve less distinct partition of cells with different types compared with scBridge.According to the label transfer matrix in Fig. 4d, scBridge, and Seurat transfer fewer scATAC-seq cells of common types to the three unique types in scRNA-seq data than GLUE and Conos, and scBridge achieves more precise label transfer results among the seven common types.Next, we focused on the integration results for Myeloid cells in scATAC-seq data, which is novel with respect to the annotations in scRNA-seq data.Equipped with the structure preservation loss, scBridge assigns a relatively low confidence score for scATAC-seq Myeloid cells as shown in Fig. 4b.To identify cells of novel types, instead of manually setting a confidence threshold, we proposed a data-driven strategy by fitting the confidence score of all cells with a two-component GMM.As shown in Fig. 4b, the confidence threshold is estimated by the intersection of two probability density functions (PDF).In other words, cells belonging to the less confident GMM component are considered novel.According to the novel type identification performance shown in Fig. 4c and Supplementary Fig. 6d, scBridge gives a more distinct pattern between cells of common and novel types, leading to the highest F1-score for novel type discovery.The superiority of scBridge is also verified by the label transfer matrix in Fig. 4d, namely, it assigns fewer cells of common types as novel.
scBridge is robust to the dropout technical noise in sequencing data In single-cell sequencing studies, it is inevitable to introduce technical noises in sequencing data due to biological and technical limitations such as amplication bias, low starting mRNA amount, and sequencing depth.For example, a typical technical noise is the dropout event, where some entries in the gene expression or activity matrices are false-zeros 35 .Accordingly, the data would be contaminated with considerable non-biological variances, hurting the data integration performance.
To investigate the robustness of scBridge against the dropout technical noise, we applied it to the human hematopoiesis data which contains 34,609 scRNA-seq and 33,819 scATAC-seq cells from 23 common types.To simulate the dropout events, we downsampled the scRNA-seq count matrix, scATAC-seq activity matrix, and scATAC-seq peak-by-cell matrix by 25%, 50%, and 75%, respectively, with the downsampleMatrix function provided in the scuttle R package 36 .As shown in Fig. 5c, scBridge achieves superior robustness towards the scRNA-seq data quality.Namely, its integration and label transfer performances are almost impervious under up to 75% dropout rate on scRNA-seq data.By comparison, though GLUE achieves higher label transfer accuracy than scBridge on the original data, its performance becomes worse and unstable on data contaminated with dropout noises.Similarly, scJoint achieves a comparable silhouette score with scBridge, but encounters prominent performance reduction as the dropout rate increases.Figure 5a, b, and Supplementary Fig. 8 demonstrate the superiority of scBridge over six baselines in data integration and label transfer.Likewise, scBridge also achieves better performance on the corrupted scATAC-seq data, especially under high dropout rates as shown in Fig. 5c.Such robustness of scBridge could be attributed to its iterative and heterogeneous integration paradigm.Namely, even if the sequencing data is of low capture rate, scBridge could still identify a portion of reliable scATAC-seq data, which further helps the model to integrate the rest cells.Note that some results do not exactly match those reported in the scJoint paper 15 due to the differences in data preprocessing and the added dropout corruption.
scBridge is robust to noisy labels in scRNA-seq data annotation Cell-type annotation is challenged by incomplete messenger RNA detection, a lack of curated marker gene lists, improper handling of batch effects, and difficulties in leveraging the latent gene-gene interaction information 37 .It is inevitable to introduce some noisy labels during the manual or automatic annotation.As scBridge requires annotated scRNA-seq data, it is highly expected that scBridge is robust against noisy labels.
We conducted evaluations on the multi-modal PBMC data 38 consisting of 4644 and 4157 cells of seven types from scRNA-seq and scATAC-seq data, respectively.As shown in Fig. 6b, c, scBridge, scJoint, Seurat, and GLUE achieve promising results when the annotations in scRNA-seq data are accurate.Specifically, scBridge achieves the best label transfer accuracy and F1-score, outperforming the second-best method scJoint by about 5% on average with t-test p-value ≤ 1e-3 in the two metrics.To simulate label corruptions, we randomly shuffled 5%, 10%, and 20% percent of annotations in the scRNA-seq data.According to the results in Fig. 6c, scJoint achieves inferior performance than Seurat and GLUE even under a small label corruption rate of 5%.In contrast, scBridge outperforms them under all corruption rates (t-test p-value = 2e-6, 8e-4, and 0.1 under rates 5%, 10%, and 20% respectively in F1-score compared with Seurat), demonstrating its ability to handle data with label corruptions.
Finally, we validated scBridge's cell-type predictions of Naive CD4+ T cells using Protein CD45RA and gene CCR7, as well as Effector CD4+ T cells using Protein CD45RO and gene PLEKHG3 in Fig. 6d.Furthermore, we reconstructed the cell differentiation trajectory using monocle 39 in Supplementary Fig. 9c.As shown, the RNA Pseudotime shows a trend from Naive CD4+ T cells to Effector CD4+ T cells, reflecting the CD4+ T cell differentiation.The UMAP plots in Fig. 6d show that scBridge reflects the continuous trajectory better than scJoint.

Discussion
By utilizing cell heterogeneity, scBridge achieves accurate scRNA-seq and scATAC-seq data integration, as well as label transfer with heterogeneous transfer learning.To summarize, scBridge accepts annotated scRNA-seq data and unlabeled scATAC-seq data to perform integration in an iterative manner.In each iteration, scBridge models the reliability of heterogeneous scATAC-seq cells and conducts crossomics prototype alignment based on the estimated cell reliability.After that, scBridge selects the most reliable scATAC-seq cells as the annotated data and repeats the entire process.As the training proceeds, the modality gap between RNA and ATAC omics is gradually reduced, which enables scBridge to identify and integrate more scATAC-seq cells, leading to the final integration result.On seven multi-omics data integration benchmarks, scBridge outperforms six representative baselines in both joint embedding quality and label transfer accuracy.In addition to its superior performance, scBridge also shows strong robustness against (i) the number of annotated scRNA-seq cells, (ii) the inconsistency between scRNA-seq and scATAC-seq cell types, (iii) the dropout technical noise in sequencing data, and (iv) the quality of scRNA-seq annotations.
Though scBridge is a deep learning-based method, we simplified its structure and hyper-parameters to avoid laborious parameter tuning on different datasets.We fixed the same set of hyper-parameters on all seven datasets and scBridge achieves the best performance without any parameter tuning.In other words, users only need to decide whether to turn on the structure preservation loss to enable novel type discovery or strictly integrate all scATAC-seq cells with annotated scRNA-seq data, based on practical needs.Moreover, as scBridge only requires mini-batch optimization, it naturally scales to large-scale data, with a linear time and constant memory consumption with respect to the cell number.
In this paper, we have focused on integrating scRNA-seq and scATAC-seq data.But theoretically, scBridge could extend to other modalities as long as the input data matrix is aligned in columns (i.e., genes, proteins, etc.).We found that when directly applied to the protein data from the human peripheral blood mononuclear dataset, scBridge still achieves better integration and label transfer performance compared with scJoint (Supplementary Note 1).Notably, scBridge does not strictly require all scRNA-seq cells to be annotated and scATAC-seq cells to be unlabeled.Cell annotations from any omics could be easily incorporated into the  heterogeneous integration framework of scBridge, which is flexible in practice.
In conclusion, scBridge is a multi-omics data integration method based on a novel paradigm, i.e., heterogeneous transfer learning.Considering the large amounts of well-annotated scRNA-seq data and a wide range of scRNA-seq annotation tools 40 , scBridge has a promising application prospect.With its superior performance compared with existing baselines, robustness against different occasions, and scalability to large datasets, scBridge would be a reliable tool in single-cell multi-omics analysis.

Methods scBridge
For the given scRNA-seq data X s 2 R n s × m and scATAC-seq data X t 2 R n t × m , scBridge employs a shared deep encoder network f: X → E to learn cell embeddings followed by a shared classification head g: E → Y to classify both X s and X t , where n s and n t denote the number of scRNA-seq and scATAC-seq cells, m is the number of their common genes, and X s 2 R n s × m is with the annotation Y s 2 R n s .scBridge integrates multi-omics data in an iterative manner.First, scBridge warms up f( ⋅ ), g( ⋅ ) using annotated scRNA-seq data.The performance of f( ⋅ ), g( ⋅ ) is limited when directly transferred to scATAC-seq data due to the omics difference.Thanks to cell heterogeneity, a portion of scATAC-seq cells exhibit smaller omics differences with scRNA-seq cells.Consequently, the model could learn more discriminative embeddings and make more accurate cell-type predictions for those scATAC-seq cells.To identify those reliable scATAC-seq cells, scBridge models the reliability of scATAC-seq cells by fitting the embedding discriminability and classification confidence with the Gaussian Mixture.Based on the estimated cell reliability, scBridge computes the prototypes in ATAC omics and aligns them with the RNA prototypes for integration.Lastly, scBridge selects the most reliable scATAC-seq cells as the annotated data and repeats the entire training process.During the iterations, the modality gap between RNA and ATAC omics is gradually narrowed, enabling scBridge to identify and integrate more cells, and precisely predict their cell type Y t 2 R n t .The training procedure of scBridge is elaborated below.
Warm-up with annotated scRNA-seq data.To endow the deep neural networks f( ⋅ ), g( ⋅ ) with the capacity of feature extraction and cell classification, we first use the annotated scRNA-seq data {X s , Y s } to warm-up f( ⋅ ) and g( ⋅ ) with the following weighted cross-entropy loss: where N and K denote the mini-batch size and cell-type number, x s i and y s i denote the gene expression and annotation of the i-th cell, p s i refers to the predicted soft label of sample i, and jY s k j indicates the number of cells belonging to the k-th class.Here the weight w s i is adopted to alleviate the influence of the highly imbalanced distribution of cells across different types.
Reliability modeling for scATAC-seq data with the Gaussian Mixture.After the warm-up, the model is able to correctly predict cell types for scRNA-seq data.However, the classification performance on scATAC-seq data is undesirable due to the modality gap between RNA and ATAC omics.Nevertheless, thanks to cell heterogeneity, we observe that a portion of scATAC-seq cells exhibit smaller omics differences with scRNA-seq cells when their chromatin accessibility has higher positive correlations with gene expression, and those cells are easier to integrate.Specifically, the deep embedding network f( ⋅ ) tends to extract more discriminative features, and the classifier g( ⋅ ) tends to make more confident predictions for those cells.Motivated by such an observation, to identify those reliable cells that have smaller omics differences, we model the reliability of scATAC-seq cells with Gaussian Mixture based on their discriminability and confidence.Specifically, the discriminability d t i of scATAC-seq data embedding E t = f(X t ) is calculated by the distance to class centers of scRNA-seq data embedding: where cosðÁ,ÁÞ denotes the cosine similarity, c s k is the center of scRNAseq cells from the k-th class X s k in the embedding space, and e t i refers to the embedding of the i-th scATAC-seq cell.The confidence of each prediction is evaluated through the following cross-entropy loss: where ŷt i denotes the current prediction for the i-th scATAC-seq cell.Note that l t i would be smaller when the prediction is more confident (i.e., closer to one-hot).Given the discriminability d t = fd , respectively: .Based on our observation, the cells belonging to components c a 2 and c b 1 are more likely to be correctly classified.Accordingly, we design the following cross-omics prototype alignment strategy to integrate scRNA-seq and scATAC-seq data.
Cross-omics prototype alignment with the estimated cell reliability.To align cells of the same type across different omics, we propose minimizing the pair-wise distance between cross-omics prototypes.The prototype in scRNA-seq data corresponds to the class center as defined in Eq. ( 2), while the prototype in scATAC-seq data is defined as the weighted mean as follows: where pðc a 2 jd t i Þ 2 ½0,1 corresponds to the probability of cell i belonging to the second component c a 2 in GMM g a , and pðc b 1 jl t i Þ 2 ½0,1 the probability belonging to the first component c b 1 in GMM g b .Such a GMMbased sample weighting approach is designed to alleviate the influence of false classified cells and prevent the over-integration problem, because the false classified cells are likely to be assigned with relatively low probabilities pðc a 2 jd t i Þ and pðc b 1 jl t i Þ.Our integration objective is to minimize the cosine distance between prototypes c s k and c t k .However, it is daunting to compute global prototypes in the mini-batch optimization paradigm.Hence, as a remedy, we initialize the prototypes with global information and updated them with the exponential moving average via where c Á k is globally initialized prototype, c Á k refers to the prototype in each mini-batch, and η is the momentum parameter.The data integration is achieved by minimizing the pair-wise distance between cross-omics prototypes: Iterative integration with heterogeneous transfer learning.Due to the cell heterogeneity and large modality gap between RNA and ATAC omics, it is hard to accurately integrate all cells at a time.As a solution, we take an iterative integration paradigm.To be specific, we select the most reliable scATAC-seq cells X t as annotated data at the end of each iteration if their GMM probabilities pðc a 2 jd t i Þ and pðc b 1 jl t i Þ are larger than the threshold α.After that, we treat the annotated scRNA-seq data and the selected scATAC-seq data as annotated.Together with the remaining unlabeled scATAC-seq data, we retrain the network via Such a heterogeneous transfer learning paradigm could gradually eliminate the omics difference, allowing the model to accurately integrate and classify more cells.The above process is repeated until no more reliable scATAC-seq data can be selected.
To sum up, scBridge first warms up the deep neural networks f( ⋅ ) and g( ⋅ ) with the annotated scRNA-seq data by Eq. (1).After that, scBridge models the cell reliability with Eq. ( 4), and performs crossomics prototype alignment with Eq. ( 7), namely, Finally, scBridge selects the most reliable scATAC-seq cells as the annotated data and repeats the training process until convergence.
from https://github.com/GreenleafLab/MPAL-Single-Cell-2019.We excluded cells annotated as "Unknown", resulting in 34,609 cells for scRNA-seq data and 33,819 cells for scATACseq data for the analysis.The number of common genes between two omics is 15,715.• Human peripheral blood mononuclear cells (PBMC).The original CITE-seq data, ASAP-seq data, and fragments file are provided in GSE156478 38 .For convenience, we downloaded the preprocessed data provided in https://github.com/SydneyBioX/scJoint/blob/main/data.zip, which contains 4,644 CITE-seq and 4,506 ASAP-seq cells of 7 common types.The peak-by-cell matrix was calculated using the R package ArchR 50 , v1.0.2, with default parameters.Though both CITE-seq and ASAP-seq could profile chromatin and protein levels simultaneously, only gene expression and activity matrices are used to focus on the scRNA-seq and scATAC-seq data integration.Additionally, we show that scBridge could also integrate the protein data in Supplementary Note 1.
For scJoint 15 , we adopted the official code released in https:// github.com/SydneyBioX/scJoint,with hyper-parameters provided in the code example or recommended in the paper.Since scJoint first binarizes the gene expression and activity matrices, no data preprocessing is needed.The algorithm directly outputs the embedding of both scRNA-seq and scATAC-seq data (_embeddings.txt),the transferred label for scATAC-seq data (_knn_predictions.txt),as well as the confidence score (_knn_probs.txt).
For Seurat, we used Seurat R package 22 , The raw count matrix of scRNA-seq and the unnormalized gene activity matrix of scATAC-seq were transformed into Seurat objects using the Cre-ateSeuratObject function.With the NormalizeData, FindVaria-bleFeatures, ScaleData, and RunPCA functions, the Seurat objects were further processed, where the PCA dimensions were set to 1:30 for scRNA-seq data and 2: 30 for scATAC-seq data, respectively.After that, the FindTransferAnchors function was adopted to identify the anchors between scRNA-seq and scATAC-seq datasets.Then, the TransferData function was employed to impute and integrate the scATAC-seq data, and transfer the annotations from the scRNA-seq dataset into the scATAC-seq cells.The TransferData function also outputs the confidence score of each prediction.Finally, PCA was performed on the combined matrix of scRNA-seq data and imputed scATAC-seq data to obtain features.If not mentioned, all parameters are set as default.
For Portal, we used the portal-sc Python package 25 , v1.0.2.Following its default pipeline, we used the model.preprocessfunction to preprocess the gene expression matrix and gene activity matrix.After that, the model.trainfunction was used to integrate data.We set trai-ning_steps = 1000 for datasets with sample size <20,000 and trai-ning_steps = 2000 otherwise, as suggested in the package.Finally, the integration results were obtained from model.latent after running the model.evalfunction.As Portal itself does not support label transfer, we adopted the KNeighborsClassifier function with k = 10 provided in the scikit-learn Python package 42 , v1.1.3,to transfer cell annotations from scRNA-seq to scATAC-seq data.
For Harmony, we adopted the same data preprocessing pipeline as scBridge, followed by a PCA dimensional reduction with 50 components, and then used the official harmonypy Python package 14 (https://github.com/slowkow/harmonypy),v0.0.5, with the recommended parameters for data integration.As Harmony itself does not support label transfer, we adopted the KNeighborsClassifier function with k = 10 provided in the scikit-learn Python package 42 , v1.1.3,to transfer cell annotations from scRNA-seq to scATAC-seq data.
For GLUE, we adopted the official code released in https:// github.com/gao-lab/GLUE,v0.3.2.Following its default pipeline, we first preprocessed the scRNA-seq expression and scATAC-seq peakby-cell matrices and then constructed the prior regulatory graph.After that, we trained a glue model to integrate data with the scglue.models.fit_SCGLUEfunction.Then, we applied the model for cell and feature embedding using the encode_data function.Finally, cell annotations with confidence scores were transferred from scRNA-seq to scATAC-seq data with the scglue.data.transfer_labelsfunction.
For Conos, we used conos R package 23 , v1.4.6.The raw count matrix of scRNA-seq and the unnormalized gene activity matrix of scATAC-seq were first preprocessed by the basicP2proc function provided in pagoda2 R package 51 , v1.0.10.Then, the joint graph was built by the bulidGraph function with parameters k = 15, k.self = 5, k. self.weigh = 0.01, ncomps = 30, n.odgenes = 5e3, and space = 0 PCA 0 .The joint embedding was generated by the embedGraph function.Finally, labels were propagated from the scRNA-seq to the scATAC-seq based on the joint graph using the propagateLabels function, which also outputs the confidence score of each prediction.All parameters are set as default if not mentioned.
Evaluation metrics.We adopted the cell classification accuracy (ACC) and weighted F1-score (F1) to measure the performance of label transfer.Specifically, let ŷi and y i be the predicted type and the groundtruth annotation for cell i, ACC measures the percentage of cells being correctly classified, i.e., ACC = Considering the severe data imbalance among different cell types, ACC would be dominated by large cell types, leading to partial evaluation for those small classes.Hence, we chose the weighted average F1-score as another measurement for comprehensive evaluations, where the weight is inversely proportional to the number of cells for each cell type.Mathematically, where P k , R k denote the Precision and Recall for the k-th cell type, and jY y i j refers to the number of cells belonging to type k.Silhouette coefficients were adopted to evaluate the joint embedding after data integration.We computed silhouette coefficients given cell types and omics respectively, denoted by Sil Type and Sil Omic .Since one expects the cross-omics cells of the same type to be mixed, higher Sil Type and lower Sil Omic indicate better results.To provide an overall measure, we harmonized the two silhouette coefficients via where a higher Sil F1 score indicates better integration results.
To evaluate the performance of novel type prediction performance, we treated it as a novel/seen binary classification task and evaluated the performance via where TP, FP, FN stands for true positive, false positive, and false negative, respectively.A higher Novel F1 score corresponds to a more accurate novel-type prediction.
Trajectory analysis.The monocle R package 39 , v2.22.0, was used to infer the pseudotime in Naive CD4+ T cells and Effector CD4+ T cells from the PBMC dataset, with its default parameters.The inferred pseudotime was then projected to the UMAP plots obtained by scBridge and scJoint.
Visualization.We used features extracted by the deep embedding network f( ⋅ ) as cell representations and adopted the harmony and umap functions provided in the Scanpy Python package 52 , v1.9.1, with the default parameters to reduce the dimension to two and visualize cells.Dot plots and feature plots were performed using the ggplot2 R package 53

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
deep neural networks using annotated data Compute and align cross-omics prototypes based on reliability Select most reliable ATAC cells and merge them into the annotated the reliability for each ATAC cell Iterate until convergence

Fig. 1 |
Fig. 1 | Overview of scBridge.a The input of scBridge is composed of the annotated scRNA-seq data and unlabeled scATAC-seq data.scBridge passes them into a deep neural encoder and a classifier to achieve data integration and label transfer iteratively.The main steps of scBridge are elaborated as follows.b scBridge warms up the deep neural encoder and classifier with the annotated scRNA-seq data and accordingly obtains the initial embedding and cell-type prediction for scRNA-seq and scATAC-seq cells.c scBridge models the reliability of each scATAC-seq cell based on its embedding discriminability and classification loss.d Based on the Accuracy: 71.95%, F1 Score: 42.26%) Prediction Prediction

Fig. 2 |Fig. 3 |
Fig. 2 | Integration results on three golden benchmarks.a UMAP plot of the joint embeddings obtained by the seven methods on the SNARE-seq dataset, where cells are colored by types.b Quantitative evaluation on the SNARE-seq dataset in terms of the joint embedding quality and label transfer accuracy.Source data are provided as a Source Data file.c (Left) The number of reliable scATAC-seq cells selected by scBridge with the corresponding accuracy, and the overall label transfer accuracy across the training process on the SNARE-seq dataset.(Right) The Pearson correlation score between scRNA-seq and the selected scATAC-seq cells in different iterations.Source data are provided as a Source Data file.d The label transfer matrix of the agreement between the predicted cell type and the ground-truth annotation.A clearer diagonal structure denotes better label transfer performance.e UMAP embeddings of scBridge and scJoint on the SHARE-seq dataset.f UMAP embeddings of scBridge and scJoint on the 10x Multiome dataset.The first and second rows show cells colored by types (DCDP: Distal collecting duct principal cell, DCT: Distal convoluted tubule cell, DTC: Distal tubule cell, PTC: Proximal tubule cell, SMC: Smooth muscle cell, GP: Glomerular podocyte, PTBB: Proximal tubule brush border cell) and omics, respectively.g The label transfer accuracy and F1-score of the tested methods on three benchmarks, where 100%, 75%, 50%, and 25% annotated scRNA-seq data are used.Five random experiments are conducted with different downsample rates.Each boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend to 1.5 times the interquartile range.Source data are provided as a Source Data file.

Fig. 4 |
Fig. 4 | Integration results on the human myocardial infarction data, where both RNA and ATAC omics have their unique cell types (adipocyte, cycling cells, and mast cells only exist in scRNA-seq data, and Myeloid cells only exist in scATAC-seq data).a UMAP visualization of the joint embedding obtained by scBridge, scJoint, Seurat, GLUE, and Conos.The first and second rows show cells colored by types and omics, respectively.The novel Myeloid cells are gray-colored and red-circled.b (Left) scBridge's UMAP embedding of scATAC-seq cells, colored by the confidence score.(Right) scBridge's novel type threshold was estimated by applying a two-component GMM on the confidence score.c The confidence score predicted by scBridge and Seurat on different types of 46,086 scATAC-seq cells.The red dashed line corresponds to the novel type confidence threshold estimated by GMM.Each miniature boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend to 1.5 times the interquartile range.d The label transfer results of scBridge, Seurat, GLUE, and Conos.Cells are considered novel if their confidence scores are below the threshold estimated by GMM.

Fig. 5 |
Fig. 5 | Integration results on human hematopoiesis data.a UMAP visualization of the joint embeddings learned by scBridge, scJoint, Seurat, and Portal under 75% dropout on scRNA-seq data, where cells are colored by types.b The agreement between labels transferred by scBridge, Seurat, and the manual annotations under 75% dropout on scRNA-seq data.A clearer diagonal structure indicates better agreement.c The F1 harmonized silhouette score and the weighted F1 label transfer accuracy of scBridge and six baselines with different dropout corruption rates on scRNA-seq and scATAC-seq data.Five random experiments are conducted under each dropout rate.Each boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend to 1.5 times the interquartile range.Source data are provided as a Source Data file.

Fig. 6 |
Fig. 6 | Integration results on human PBMC data.a UMAP visualization of the joint embeddings obtained by scBridge, scJoint, Seurat, GLUE, and Conos.The first and second rows show cells colored by types and omics, respectively.b The label transferred by scBridge, scJoint, and Seurat.A clearer diagonal structure indicates better agreement between the transferred labels and manual annotations.c The label transfer accuracy and F1-score of the tested methods with 0%, 5%, 10%, and 20% corruption rates on the scRNA-seq cell annotations.Five random experiments are conducted with different corruption rates.Each boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend to 1.5 times the interquartile range.Source data are provided as a Source Data file.d Protein CD45RA, CD45RO and gene CCR7, PLEKHG3 projected on the scBridge UMAP plot of Effector CD4+ T and Naive CD4+ T cells.e Pseuodotime projected on the scBridge and scJoint UMAP plots of Effector CD4+ T and Naive CD4+ T cells.The arrows indicate continuously changed cells from Naive CD4+ T cells to Effector CD4+ T cells.

where γ c Á 1 ,γ c Á 2 denote
the mixture coefficient for components c Á 1 ,c Á 2 , and ϕ Ájc Á 1 À Á ,ϕ Ájc Á 2 À Á refer to the probability density.Without loss of generality, we assume that the mean values of two components satisfy μ c Á 1 <μ c Á 2 , v3.3.2.Other plots were based on the seaborn Python package 54 , v0.11.2.Statistical and reproducibility.Statistical analyses were performed by the SciPy Python package 55 , v1.11.1.p-value was determined by a twosided t-test, and p-value < 0.05 is considered statistically significant.All experiments are conducted under five randomizations with different model initializations.No data were excluded and no statistical methods were used to predetermine sample size.Investigators were not blinded to allocation during library preparation, experiments, or analysis.