Abstract
Mapping singlecell sequencing profiles to comprehensive reference datasets provides a powerful alternative to unsupervised analysis. However, most reference datasets are constructed from singlecell RNAsequencing data and cannot be used to annotate datasets that do not measure gene expression. Here we introduce ‘bridge integration’, a method to integrate singlecell datasets across modalities using a multiomic dataset as a molecular bridge. Each cell in the multiomic dataset constitutes an element in a ‘dictionary’, which is used to reconstruct unimodal datasets and transform them into a shared space. Our procedure accurately integrates transcriptomic data with independent singlecell measurements of chromatin accessibility, histone modifications, DNA methylation and protein levels. Moreover, we demonstrate how dictionary learning can be combined with sketching techniques to improve computational scalability and harmonize 8.6 million human immune cell profiles from sequencing and mass cytometry experiments. Our approach, implemented in version 5 of our Seurat toolkit (http://www.satijalab.org/seurat), broadens the utility of singlecell reference datasets and facilitates comparisons across diverse molecular modalities.
Main
In the same way that readmapping tools have transformed genome sequence analysis^{1,2,3}, the ability to map new datasets to established references represents an exciting opportunity for the field of singlecell genomics. As an alternative to fully unsupervised clustering, supervised mapping approaches leverage large and wellcurated references to interpret and annotate query profiles. This strategy is enabled by the curation and public release of reference datasets as well as the development of new computational tools, including statistical learning^{4,5,6,7} and deep learningbased approaches^{8,9}, that have been successfully applied toward this goal.
A current limitation of existing approaches is the primary focus on singlecell RNAsequencing (scRNAseq) data. Singlecell transcriptomics is well suited for the assembly and annotation of reference datasets, particularly as differentially expressed (DE) gene markers can typically be interpreted to help annotate cell clusters. This has led to the development of highquality, carefully curated and expertly annotated references, particularly from consortia including the Human Cell Atlas^{10}, the Human Biomolecular Atlas Project (HuBMAP^{11}) and the Chan Zuckerberg Biohub^{12}. Mapping to these references facilitates data harmonization, standardization of cell ontologies and naming schemes and comparison of scRNAseq datasets across experimental conditions and disease states.
A crucial challenge is to extend reference mapping to additional molecular modalities, including singlecell measurements of chromatin accessibility (for example, singlecell assay for transposaseaccessible chromatin with sequencing (scATACseq^{13,14})), DNA methylation (singlecell bisulfite sequencing^{15}), histone modifications (singlecell cleavage under targets and tagmentation (scCUT&Tag^{16,17})) and protein levels (cytometry by time of flight (CyTOF^{18})), each of which measures a different set of features than scRNAseq. The lack of transcriptomewide measurements creates challenges for unsupervised annotation. Ideally, datasets from different modalities could be mapped onto scRNAseq references, ensuring that established cell labels and ontologies would be preserved. We and others have proposed methods to map datasets across modalities^{19,20,21}, for example, taking the gene body sum of ATACseq signal (or the inverse of the DNA methylation levels) as a proxy for transcriptional output. These make strict biological assumptions (for example, that accessible chromatin is associated with active transcription) that may not always hold true, particularly when analyzing cellular transitions or developmental trajectories^{22}.
Here, we introduce ‘bridge integration’, which integrates singlecell datasets measuring different modalities by leveraging a separate dataset where both modalities are simultaneously measured as a molecular ‘bridge’. The multiomic bridge dataset, which can be generated by a diverse set of technologies^{23,24,25,26,27,28,29,30,31,32}, helps to translate information between disparate measurements, resulting in robust integration without requiring any limiting biological assumptions. We illustrate the broad applicability of our approach, demonstrating its performance across five different molecular modalities (Fig. 1a). Moreover, we introduce ‘atomic sketch integration’, which combines dictionary learning and dataset sketching to improve the computational efficiency of largescale singlecell analysis and enables rapid integration of dozens of datasets spanning millions of cells.
Results
Using multiomic dictionaries for bridge integration
We aimed to develop a flexible and robust integration strategy to integrate data from singlecell sequencing experiments where different modalities are measured (‘singlemodality datasets’). The fundamental challenge is that different singlemodality datasets measure different sets of features. We reasoned that an approach would be to leverage a multiomic dataset as a bridge that can help to translate between disparate modalities. To perform this translation, we were inspired by the field of dictionary learning, a form of representation learning that is commonly used in image analysis and genomics^{33,34,35,36,37}. The goal of dictionary learning is to represent input data in terms of individual elements that are called atoms and together comprise a dictionary. Reconstructing input data as a weighted linear combination of these atoms is an effective tool for denoising and represents a transformation of the input data into a dictionarydefined space.
We find that dictionary learning enables crossmodality bridge integration at singlecell resolution. Our key insight is to treat a multiomic dataset as a dictionary, with each individual cell’s multiomic profile representing an atom. We learn a ‘dictionary representation’ of each unimodal dataset based on these atoms. For clarity, we emphasize that in contrast to the original applications of dictionary learning where the atoms represent a set of features^{33,37}, we use individual instances (cells) as dictionary elements. This transformation takes datasets in which completely different sets of features were measured and represents them each in a space where the defining features represent the same set of atoms (Fig. 1b). Once different modalities can be represented using the same set of features, they can be readily aligned in a final step.
Our bridge integration is illustrated in Fig. 1b and is described fully in the Supplementary Methods, and we note a few key points below. First, our procedure makes no assumptions about the relationships between modalities, as these are learned automatically from the multiomic dataset. Second, the key advance we present here is a transformation to project datasets profiling different modalities to be represented by a shared set of atoms. Once transformed, the final alignment step is compatible with a wide diversity of singlecell integration techniques, including Harmony^{38}, mnnCorrect^{39}, Seurat^{19}, Scanorama^{40} or scVI^{41}. In this manuscript, we perform this step with an implementation of the mnnCorrect algorithm^{39}.
Third, we found that when working with sizable bridge datasets, the large number of atoms (single cells in the bridge dataset) created a substantial computational burden. Motivated by a similar problem addressed by Laplacian Eigenmaps^{42}, we compute an eigen decomposition of the graph Laplacian for the multiomic dataset to reduce the dimensionality from the number of atoms to the number of selected eigenvectors (Supplementary Methods). We then use these eigenvectors to transform the learned dictionary representations into the same lowerdimensional space, substantially increasing the efficiency of our bridge integration procedure.
Mapping scATACseq data onto scRNAseq references
We first demonstrate our bridge integration strategy by performing crossmodality mapping on scATACseq and scRNAseq samples of human bone marrow mononuclear cells (BMMCs). These samples consist of cells representing the full spectrum of hematopoietic differentiation, including hematopoietic stem cells (HSCs), multipotent and oligopotent progenitors and fully differentiated cells. As part of HuBMAP, we have leveraged public datasets to construct a comprehensive scRNAseq reference (‘Azimuth reference’; 297,627 cells) of human BMMCs, carefully annotating 10 progenitor and 25 differentiated cell states (Fig. 2a). We aimed to map scATACseq ‘query’ datasets of human BMMCs^{43} (16,266 whole bone marrow profiles and 9,893 CD34^{+}enriched profiles) to this reference (Fig. 2b). We used a 10x multiome dataset^{44} (32,368 cells paired singlenucleus RNAseq + scATACseq) that was publicly released as part of NeurIPS 2021 as a bridge.
Our bridge procedure successfully mapped the scATACseq dataset on our Azimuth reference, enabling joint visualization of scATACseq and scRNAseq data (Fig. 2c) and automated annotation of scATACseq profiles with accompanying prediction scores. Reference mapping also aligned shared cell populations across multiple samples, mitigating samplespecific batch effects. Query samples representing CD34^{+} BMMC fractions mapped exclusively to the HSC and progenitor components in the reference dataset, demonstrating that bridge integration can robustly handle cases where the query dataset represents a subset of the reference, while whole fractions mapped to all 35 cell states (Supplementary Fig. 1a).
Our referencederived annotations were concordant with the annotations accompanying the query dataset produced by the original authors (Supplementary Fig. 1b), but we found that bridge integration annotated additional rare and highresolution subpopulations. For example, our annotations separated monocytes into CD14^{+} and CD16^{+} fractions, natural killer cells into CD56^{bright} and CD56^{dim} subgroups and cytotoxic T cells into CD8^{+} and mucosalassociated invariant T (MAIT) cell subpopulations. While these subdivisions were not identified in the unsupervised scATACseq analysis, we confirmed these predictions by observing differential accessibility at canonical loci after grouping by referencederived annotations (Fig. 2d,e and Supplementary Fig. 1c). We validated these chromatin patterns using independent multiome datasets, where cell identity was assigned based on concurrent RNA measurements (Supplementary Fig. 1d,e). Similarly, bridge integration identified extremely rare groups of innate lymphoid cells (ILCs; 0.15%) and recently discovered AXL^{+}SIGLEC6^{+} dendritic cells (ASDCs^{45,46}; 0.10%; Fig. 2f and Supplementary Fig. 1c). To our knowledge, these cell populations have not been previously identified in scATACseq data. Again, we found that differentially accessible sites, such as an ASDCspecific peak in the SIGLEC6 gene (Fig. 2f), fully supported the accuracy of our mapping procedure.
By projecting datasets from multiple modalities into a common space, our referencemapping procedure not only enables the transfer of discrete annotations but also allows us to explore how variation in one corresponds to variation in another. For example, after integration, we applied diffusion maps to the harmonized measurements to construct a joint differentiation trajectory spanning multiple progenitor states during myeloid differentiation (Fig. 2g). Because this trajectory represents both reference and query cells, we can explore how pseudotemporal variation in chromatin accessibility correlates with gene expression, even though the two modalities were measured in separate experiments.
Consistent with previous findings, we identified cases where gene expression changes ‘lagged’ behind variation in chromatin accessibility. For example, while myeloperoxidase (encoded by MPO) is expressed in granulocyte–macrophage progenitors (GMPs) and is associated with myeloid fate commitment^{47,48}, the regulatory region immediately upstream acquired accessibility in lymphoidprimed multipotent progenitors (LMPPs; Fig. 2h–j). We used a crosscorrelationbased metric to systematically identify 236 ‘lagging’ loci (Supplementary Methods) across this trajectory. KEGG pathway enrichment analysis revealed a strong enrichment for genes involved in the cell cycle and DNA replication (Fig. 2k). These loci were characterized by accessible chromatin at the earliest stages of differentiation (HSCs), but there is a delay before the associated genes become transcriptionally active (Fig. 2l). The accessible state of these loci in the earliest progenitors may represent a form of priming to enable rapid cell cycle entry once the decision to differentiate has been made and may represent the type of discovery that can be enabled through integrative analysis across modalities.
Robustness and benchmarking analysis
As our strategy relies on the ability for the dictionary to represent and reconstruct individual datasets, we explored how the size and composition of the multiomic dataset affected the accuracy of integration. We sequentially downsampled the multiomic dataset, repeated bridge integration and compared the results to our original findings. Downsampling the bridge generally returned results that were concordant with the full analysis but, as expected, could affect annotation concordance for rare cell types, which are most sensitive to downsampling (Fig. 3a). We found that if a bridge dataset contained at least 50 cells (‘atoms’) representing a given cell type, this was sufficient for robust integration. We note that this threshold is not a strict requirement; we found that integration can be successful for rare cell types, such as ASDCs, even when fewer than ten cells are present in the bridge, but we also observed failure modes in this regimen. We note that generating bridge datasets consisting of more than 50 cells per subpopulation is quite feasible for many multiomic technologies and that our findings represent guidelines to assist in experimental design when performing multiomic experiments. Notably, we found that substantially altering the relative composition of cell types in the bridge dataset (while maintaining the minimum threshold) did not negatively affect performance, demonstrating that bridge integration can be successful even in cases where there are substantial compositional differences in the sample used to generate the multiomic bridge (Supplementary Fig. 2a,b).
We next compared the performance of bridge integration against two recently proposed methods for integrated analysis of multimodal and singlemodality datasets. Both multiVI^{49} and Cobolt^{50} use variational autoencoders for integration, and while they do not explicitly treat multiomic datasets as a bridge, they aim to integrate datasets across technologies and modalities into a shared space. When applied to the previously described datasets, both methods were broadly successful in integrating scRNAseq and scATACseq data but did not identify matches at the same level of resolution (for example, neither method successfully matched ASDCs in scATACseq data to the ASDCs in the Azimuth reference; Fig. 3b and Supplementary Fig. 2d–f). We also found that the latent space and neighbor relationships learned by bridge integration were most consistent with the labels originally assigned in the ATACseq analysis (Supplementary Fig. 2c). When comparing computational efficiency, bridge integration (0.8 h, not including 1.2 h of preprocessing time) and Cobolt (3.3 h) were the most efficient methods, while multiVI required more computational resources (15.7 h).
We next performed quantitative benchmarking of multiomic integration methods (bridge integration, Cobolt and multiVI) and also evaluated ‘bridgefree’ methods (Canonical Correlationbased Integration and LIGER), which perform integration on the basis of gene activity scores (Supplementary Methods). We found that our bridge integration most consistently and effectively matched cells in the same biological state across modalities (Fig. 3c and Supplementary Fig. 3a). Consistent with our previous results, we found that the strongest improvements were observed when mapping rare cell types, including plasma cells and DCs (Supplementary Fig. 3b). As our procedure is compatible with multiple integration techniques, we compared the performance of bridge integration when using either mnnCorrect^{39} or Seurat v3 (ref. ^{19}) for the final alignment step and observed very similar results (Supplementary Fig. 3a,b). We also computed additional metrics based on the cluster labels originally assigned based on the scRNAseq measurements^{44} (Supplementary Table 1). In all cases, we consistently found that bridge integration exhibited superior performance.
As a second quantitative benchmark with ground truth data, we pursued a similar strategy using a recently published PairedTag dataset^{26}, where individual histone modification binding profiles via scCUT&Tag were simultaneously measured with RNA transcriptomes. We performed crossmodality integration between scRNAseq and scCUT&Tag for active histone marks (H3K27ac), repressive histone marks (H3K27me3) and enhancer histone marks (H3K4me1). In each case, bridge integration successfully integrated cells across modalities and returned the highest Jaccard similarity and classification metrics between matched scRNAseq and scCUT&Tag profiles (Fig. 3c, Supplementary Fig. 3d,e and Supplementary Table 1).
To further demonstrate the flexibility of our approach, we used bridge integration to map and annotate an snmCseq dataset, which measures DNA methylation profiles in single cells from the human cortex^{51}. As a reference, we used a dataset from the Allen Brain Atlas, which defines an expertly curated and multilevel cell ontology^{52} in the human cortex. Using an snmC2Tseq dataset, which simultaneously measures methylation and gene expression as a bridge^{28}, we were able to annotate the snmCseq profiles with high confidence (Supplementary Fig. 3f). Even when our referencederived annotations did not augment the resolution to unsupervised clustering of snmCseq data, they did add substantial interpretability (Fig. 3d–f). For example, unsupervised clustering identified multiple populations of layer 6 (L6) neurons (labeled as L61, L62 and L63), but RNAassisted annotation clearly labeled these clusters as either ‘near projecting’ or deep neocortical laminar 6b excitatory neurons (Fig. 3f).
Last, we aimed to characterize the performance of our method specifically in cases where the bridge dataset was missing specific cell populations or exhibited low data quality. Using the BMMC multiome benchmark dataset, we removed all plasmacytoid DCs (pDCs) from the multiomic dataset and repeated bridge integration. We found that this modification did not alter the annotations or confidence scores of nonpDCs in the query but that pDC query cells did exhibit a drop in annotation performance (94.4% annotated as pDCs using the full bridge and 83.5% annotated as pDCs using the depleted bridge dataset). However, we found that these query cells also exhibited a specific and sharp drop in prediction confidence (average prediction scores of 0.907 using the full bridge and 0.514 using the depleted bridge), demonstrating that our procedure correctly reduced the confidence of prediction when the underlying assumptions were not met. We repeated this analysis after separately depleting three additional cell populations (B cells, CD8^{+} T cells and CD14^{+} monocytes) and observed similar results (Supplementary Fig. 4a). Moreover, we found that substantially reducing bridge data quality by discarding unique molecular identifiers (UMIs; 86% downsampling to 750 RNA UMIs per cell or 70% downsampling to 2,500 ATAC fragments per cell) did not adversely affect integration, although we did observe performance reductions after further downsampling (Supplementary Fig. 4b,c).
Taken together, these results demonstrate the accuracy, robustness and flexibility of our bridge integration procedure. We demonstrate applications on multiple modalities and data types as well as bestinclass performance via quantitative and ground truth benchmark comparisons.
Using dictionary learning for massively scalable integration
The recent increase in publicly available singlecell datasets poses a challenge for integrative analysis. For example, multiple tissues have now been profiled across dozens of studies, representing hundreds of individuals and millions of cells. We refer to the challenge of harmonizing a broad swath (or the entirety) of publicly available singlecell datasets from a single organ as ‘communitywide’ integration. While a rich diversity of analytical methods can harmonize datasets of hundreds of thousands of cells, performing unsupervised ‘communitywide’ integration remains challenging, even when analyzing a single modality.
We were inspired by previous work on ‘geometric sketching’, which first selects a representative subset of cells (a ‘sketch’) across all datasets, integrates them and then propagates the integrated result back to the full dataset^{53}. This pioneering approach substantially improves the scalability of integration, as the heaviest computational steps are focused on subsets of the data. However, this approach is dependent on the results of principalcomponent analysis (PCA) that must first be performed on the full dataset. As datasets continue to grow in scale, performing dimensional reduction can become a limiting step. We aimed to devise a strategy that could integrate large compendiums of datasets, without ever needing to simultaneously analyze or perform intensive computation on the full set of cells.
We reasoned that dictionary learning could also enable efficient and largescale integrative analysis. We first selected a representative sketch of cells (that is, 5,000 cells) from each dataset and treated these cells as atoms in a dictionary (Fig. 4a and Supplementary Methods). We next learned a dictionary representation, a weighted linear combination of atoms that can reconstruct the full dataset. These steps can occur for each dataset independently, allowing for efficient parallel processing. We then performed integration on the atoms from each dataset. This is the only step that simultaneously analyzes cells from multiple datasets, but because only the atoms are considered, this does not impose scalability challenges. Finally, we applied our previously learned dictionary representations to the harmonized atoms from each dataset individually and reconstructed harmonized profiles for the full dataset. We refer to this procedure as ‘atomic sketch integration’. We highlight that for this application, the ‘atoms’ used to reconstruct a dataset represent a subset of cells from the dataset itself. By contrast, in bridge integration, the atoms refer to cells from a different (multiomic) dataset.
The success of atomic sketch integration rests on identifying a representative subset of cells for each dataset. Sketching techniques for singlecell analyses aim to find subsamples that preserve the overall geometry of these datasets^{53,54,55}. These methods do not require a preclustering of the data but aim to ensure that the sketched dataset represents both rare and abundant cell states even after downsampling. Here, we perform sketching using a leverage score samplingbased strategy that has been proposed for largescale information retrieval problems^{56} and can be rapidly and efficiently computed on sparse datasets. Leverage scorebased sampling does not require performing PCA but maintains the ability to efficiently identify cells from rare subpopulations compared to geometric sketching techniques^{53} (Supplementary Fig. 5a,b). We emphasize that atomic sketch integration represents a general strategy for improving scalability that can be broadly coupled with existing methods. For example, a wide variety of integration techniques, including Harmony^{38}, Scanorama^{40}, mnnCorrect^{39}, scVI^{41} and Seurat^{19}, can be used to integrate the atom elements in each dictionary, with our procedure then enabling these results to be extended to full datasets.
Communityscale integration for human lung scRNAseq
To demonstrate the potential of atomic sketch integration to perform ‘communitywide’ analysis, we first considered scRNAseq datasets of the human lung. During the coronavirus disease 2019 (COVID19) pandemic, there has been widespread scRNAseq data collection from respiratory tissues, particularly by the Human Cell Atlas Lung Biological Network^{57}. Leveraging a recently published ‘database’ of scRNAseq studies^{58} and a collection of openly released lung and upper airway datasets from the Human Cell Atlas (https://www.covid19cellatlas.org/index.healthy.html), we assembled a group of 19 datasets spanning a total of 1,525,710 individual cells. We created an atomic dictionary consisting of 5,000 cells from each dataset (95,000 total atoms), integrated these cells and reconstructed the full datasets. Our atomic sketch integration procedure performed all these steps (including preprocessing) in 55 min using a single computational core. We found that the integrated latent space preserved the neighbor relationships between cell types independently assigned in each dataset but also mixed cells across datasets (Supplementary Fig. 5c–e).
Our results exhibit the advantages of communityscale integration compared to individual analysis. First, by matching biological states across datasets and technologies, the integrated reference can help to standardize cell ontologies and naming schemes (Fig. 4b,c). When observing previously assigned annotations derived from each study, we found that matched cell populations were often assigned slightly different names (Supplementary Fig. 5f). We also identified cases where integrated annotations exhibited increased resolution compared to the original labels and verified that our higherresolution annotations were supported by the expression patterns of reproducible gene expression markers (Supplementary Fig. 5g).
As a second benefit, we found that communityscale integration enabled consistent identification of ultrarare populations and, in particular, a population of Foxi1expressing ‘pulmonary ionocytes’ that were recently discovered in both human and mouse lungs^{59} (Fig. 4d). While these cells were only independently annotated in 6 of 19 studies, our integrated analysis discovered at least one pulmonary ionocyte in 17 of 19 studies. The identified ionocytes were extremely rare (0.047%) but exhibited clear expression of canonical markers (Fig. 4c), highlighting the potential value for pooling multiple datasets to characterize these cells. We note that selection of dictionary atoms by sketching or leverage score sampling is essential for optimal performance (Supplementary Fig. 5h,i); repeating the analysis using a set of atoms determined by random downsampling successfully integrated abundant cell types but failed to integrate ionocytes, as they were not sufficiently represented in the dictionary.
Finally, we found that communityscale integration can substantially improve the identification of DE celltype markers. The use of 19 study replicates specifically enables us to identify genes that show consistent patterns across laboratories and technologies, representing robust and reproducible markers. We grouped cells by both sample replicate and celltype identity and performed differential expression on the resulting pseudobulk profiles (Fig. 4e and Supplementary Fig. 6). For example, we identified 116 positive markers for pulmonary ionocytes, representing one of the deepest transcriptional characterizations of this cell type. These markers included canonical markers, such as the transcription factor FOXI1, but also revealed clear ontology enrichments for ATPases (for example, ATP6V1G3 and ATP6V0A4) and chloride channels (for example, CLCNKA, CLCNKB and CFTR), supporting the role of these cells in regulating chemical concentrations in the lung (Fig. 4f). One advantage of working with pseudobulk values is increased quantification accuracy for genes expressed at low levels. Indeed, we repeatedly found that the top DE markers found using this strategy tended to capture more genes at a lower range of average expression values (Fig. 4g).
Communityscale integration of scRNAseq and CyTOF
As a final demonstration, we considered a similar problem of communitywide integration for circulating human peripheral blood cells, which is one of the most widely profiled systems with diverse singlecell technologies. Exploring publicly available studies of either COVID19 samples or healthy controls, we accumulated a collection of 14 studies with scRNAseq measurements, representing a total of 3.46 million cells from 639 individuals. Data from 11 of the studies were obtained from a recently published collection of standardized singlecell sequencing datasets^{60}. We performed unsupervised atomic sketch integration, yielding a harmonized collection in which we annotated 30 cell states (Fig. 5a). We identified specific populations of activated granulocytes and B cells that were specific to COVID19 samples (Supplementary Fig. 7a). Consistent with previous reports, monocytes in COVID19 samples sharply upregulated the expression of interferon response genes^{61,62} but were correctly harmonized with healthy monocytes (Fig. 5b and Supplementary Fig. 7b). By matching shared cell types across disease states (while still allowing for the possibility of diseasespecific subpopulations), this collection represents a valuable resource for identifying celltypespecific transcriptional changes that reproduce across multiple studies. We characterized celltypespecific responses for eight additional cell types, each of which exhibited a conserved interferondriven response alongside the activation of celltypespecific response genes (Supplementary Fig. 8).
While singlecell sequencing technologies are capable of measuring RNA transcripts and surface proteins in thousands of single cells, cytometrybased techniques can measure both extracellular and intracellular proteins in millions of cells. As our bridge integration procedure should enable the mapping of CyTOF profiles onto scRNAseq datasets, we obtained a collection of CyTOF datasets spanning 119 individuals and a total of 5,170,249 cells^{63}. We used our previously collected CITEseq dataset of 161,764 peripheral blood mononuclear cells (PBMCs) from healthy donors as a multiomic bridge^{4}. The CyTOF and CITEseq dataset both shared 30 cell surface protein features, while the CyTOF dataset also measured 17 unique proteins, which included intracellular targets that cannot be measured via CITEseq.
Bridge integration annotated each CyTOF dataset with cluster labels derived from our scRNAseq collection of 3.46 million cells and allowed us to infer intracellular protein levels for each of these clusters (Fig. 5c). Predicted regulatory CD4^{+} T cells expressed high levels of the transcription factor FOXP3 (ref. ^{64}), and effector T cells exhibited enriched KLRG1 levels^{65} (Fig. 5d). We also found that among cytotoxic lymphocyte populations, MAIT cells were uniquely depleted for expression of the cytotoxic protease granzyme B, consistent with previous reports^{66}. Each of these patterns supports the accuracy of our crossmodality mapping. Finally, we successfully annotated a rare population of ILCs (0.024%), which were not independently identified in the CyTOF dataset but correctly exhibited a CD25^{+}CD127^{+}CD161^{+}CD56^{−} immunophenotype^{4,67} (Fig. 5d,e). Taken together, we conclude that dictionary learning enhances the scalability of integration and the ability to integrate and compare diverse molecular modalities.
Discussion
To map datasets measuring a diverse set of modalities to scRNAseq reference datasets, we developed bridge integration, an approach for crossmodality alignment that leverages a multiomic dataset as a bridge. We characterize specific requirements for the bridge dataset and demonstrate the broad applicability of our method to a wide variety of technologies and modalities. Finally, we demonstrate how to use atomic sketch integration to extend the scalability of our approach to harmonize dozens of datasets spanning millions of cells.
We anticipate that our methods will be valuable to individual labs but also larger consortia that have already invested in constructing and annotating comprehensive scRNAseq references. For example, the Human Cell Atlas, Human Biomolecular Atlas Project, Tabula Sapiens^{68} and Human Cell Landscape^{69} have all released scRNAseq references spanning hundreds of thousands of cells for multiple human tissues. Similar efforts are present in model organisms as well, including the Fly Cell Atlas^{70} and Plant Cell Atlas projects^{71}. In each case, these efforts involve careful, collaborative and expertdriven cell annotation alongside the curation of reference cell ontologies. While repeating this manual effort for each modality is not feasible, bridge integration enables the mapping of new modalities without having to modify the reference. As additional multiomic datasets become available, we expect that tools such as Azimuth will also begin to map additional modalities.
We note that bridge integration is particularly well suited for experimental designs where multiomic technologies can be applied to a subset of, rather than all, experimental samples due to its increased cost, lower throughput and reduced data quality. In particular, combinatorial indexing approaches can be readily applied to profile a single modality in hundreds of thousands of cells^{72,73} but not for multiomic technologies. We propose that the collection of large singlemodality datasets, harmonized via a smaller but representative multiomic bridge, may represent an efficient and robust strategy to explore crossmodality relationships across millions of cells.
We note that future extensions of our work can further broaden the applicability of bridge integration or demonstrate its potential in new contexts. For example, performing bridge integration on spatially resolved unimodal datasets (for example, CODEX^{74}) could help to better characterize the spatial localization of scRNAseqdefined cell types in large tissue sections. New multiomic technologies that couple highresolution mass spectrometry imaging to singlecell or spatial transcriptomics could serve as a bridge to harmonize lipidomic and metabolic profiles^{75,76} with sequencingbased references. In addition, future computational improvements will further lower the requirements of the bridge dataset, enabling robust integration with an even smaller number of multiomic cells.
We emphasize the ability for bridge and atomic sketch integration to identify and characterize rare cell populations, including ASDCs and pulmonary ionocytes. Singlecell transcriptome profiling played an essential role in the initial discovery of these cell types, but a deeper understanding of their biological role and function will benefit from multimodal characterization. The goal of moving beyond an initial taxonomic classification of cell types toward a complete multimodal reference will not be accomplished with a single experiment or technology. We envision that computational tools for crossmodality integration will have key contributions to the construction of this map.
Methods
Bridge integration procedure
Our bridge integration procedure is designed to perform integration of singlecell datasets profiling different modalities by leveraging a separate multiomic dataset as a molecular bridge. The individual multiomic profiles each represent individual atoms, which together comprise a multiomic dictionary (that is, each cell in the bridge dataset represents an atom, and the entire bridge dataset represents a dictionary). This dictionary is used to transform both unimodal datasets into a shared space defined by the same set of features, facilitating crossmodality integration. Our approach consists of the following four broad steps described in detail below: (1) withinmodality harmonization of unimodal and bridge datasets, (2) construction of a dictionary representation for each unimodal dataset, (3) dimensional reduction via Laplacian eigen decomposition and (4) alignment of dictionary representations across datasets. We illustrate each step of the method in Fig. 1b using the same mathematical notations that we introduce below.
All methods are implemented in our opensource R package Seurat (www.satijalab.org/seurat and www.github.com/satijalab/seurat).
Withinmodality harmonization of unimodal and bridge datasets
The first step in our procedure is to harmonize the unimodal and bridge datasets based on shared modalities. For example, when performing bridge integration to map an scATACseq dataset onto an scRNAseq reference (via a 10x multiome bridge), we first harmonize the gene expression measurements from the scRNAseq and multiome experiments and the chromatin accessibility measurements from the scATACseq and multiome experiments. Specifically, we define the following:
\(X \in {\Bbb R}^{n_{{\mathrm{scRNAseq}}} \times d_{{\mathrm{genes}}}}\) is the scRNAseq expression counts matrix,
\(Y \in {\Bbb R}^{n_{{\mathrm{scATAC  seq}}} \times d_{{\mathrm{peaks}}}}\) is the scATACseq accessibility counts matrix,
\(M = [M_XM_Y]\) is the multiomic expression + accessibility counts matrix, where
\(M_X \in {\Bbb R}^{n_{{\mathrm{multiomic}}} \times d_{{\mathrm{genes}}}}\) is the scRNAseq subset of the multiomic matrix and
\(M_Y \in {\Bbb R}^{n_{{\mathrm{multiomic}}} \times d_{{\mathrm{peaks}}}}\) is the scATACseq subset of the multiomic matrix.
Our goal is to harmonize X and M_{X} and Y and M_{Y}. This can be performed with a wide variety of existing tools for the harmonization of singlecell datasets. For example, Seurat, Harmony, LIGER, scVI, Scanorama, fastMNN, scVI and scArches all learn a shared lowdimensional space that jointly represents the datasets and aligns cells in a matched biological state. Our goal is therefore to learn
\(X^ \ast \in {\Bbb R}^{n_{{\mathrm{scRNA  seq}}} \times d_{{\mathrm{RNA}}}}\), harmonized space for scRNAseq data,
\(Y^ \ast \in {\Bbb R}^{n_{{\mathrm{scATAC  seq}}} \times d_{{\mathrm{ATAC}}}}\), harmonized space for scATACseq data, and
where
\(M_X^ \ast \in {\Bbb R}^{n_{{\mathrm{multiomic}}} \times d_{{\mathrm{RNA}}}}\) is the harmonized space for the scRNAseq subset of the multiomic dataset and
\(M_Y^ \ast \in {\Bbb R}^{n_{{\mathrm{multiomic}}} \times d_{{\mathrm{ATAC}}}}\) is the harmonized space for the scATACseq subset of the multiomic dataset.
In this work, we treat the scRNAseq dataset X as a reference and map the multiomic gene expression profiles (M_{X}) onto this reference using the FindTransferAnchors and MapQuery functions in Seurat to obtain X^{*} and \(M_X^ \ast\). An example workflow is provided at https://satijalab.org/seurat/articles/integration_mapping.html (‘Mapping and Annotating Query Datasets’).
The same functionality has been implemented in the Signac package for the mapping and harmonization of scATACseq datasets (https://satijalab.org/signac/articles/integrate_atac.html). However, we emphasize that our approach is compatible with a wide variety of preexisting approaches for withinmodality harmonization, including all the methods listed above.
We also note that when finding anchors between the bridge and query datasets, we can leverage the multimodal nature of the bridge dataset to perform ‘supervised’ dimensional reduction, which uses both modalities when calculating a lowdimensional representation during harmonization. For example, we have previously described the use of ‘supervised PCA’ to learn optimized transformations from CITEseq data^{4,77}. When working with bridge datasets that measure ATACseq or CUT&Tag chromatin features (for example, PairedTag and 10x multiome), we use an analogous procedure for supervising the latent semantic indexing reduction.
Construction of a dictionary representation for each unimodal dataset
The goal of dictionary learning is to reconstruct individual data points as a weighted linear combination of atoms in a dictionary. We treat M^{*} as a dictionary, with each row of this matrix representing an atom. We aim to learn reconstructions of X^{*} and Y^{*} based on the atoms of M^{*} while minimizing the error between the original and reconstructed values. Specifically, we aim to identify the matrices D_{X} and D_{Y}, where
\(D_X \in {\Bbb R}^{n_{{\mathrm{scRNA  seq}}} \times n_{{\mathrm{multiomic}}}}\) is the dictionary representation of the scRNAseq dataset, and
\(D_Y \in {\Bbb R}^{n_{{\mathrm{scATACseq}}} \times n_{{\mathrm{multiomic}}}}\) is the dictionary representation of the scATACseq dataset, such that
and
As described in refs. ^{56,78}, this optimization problem is analogous to matrix regression and has a closedform solution for calculating D_{X} and D_{Y},
where † represents the pseudoinverse of the matrix.
We note that D_{X} and D_{Y} represent transformations of the original scRNAseq and scATACseq datasets. While the two experiments originally measured different sets of features, after the transformation, they now are represented by the same set of features, namely, the atoms of the multiomic experiment.
Dimensional reduction via Laplacian eigen decomposition
After the datasets have been transformed in the previous step, it is possible to integrate them directly. The dimensionality of the datasets is based on the number of cells in the multiomic dataset. Unlike the original measurements, the dictionary representations are not sparse. As multiomic datasets often consist of thousands of cells, working with highdimensional and nonsparse dictionary representations is computationally inefficient. We therefore aimed to reduce the dimensionality of the dictionary representation. Motivated by a similar problem addressed by Laplacian eigenmaps^{42}, a nonlinear dimensionality reduction technique, we perform dimensionality reduction by computing an eigen decomposition of the graph Laplacian matrix. Unlike a PCA, which aims to identify lowdimensional representations that preserve data variance, Laplacian eigenmaps represent a lowdimensional reduction that optimally preserves the graphdefined local neighbor relationships^{42}.
We first compute a graph representation of the multiomic dataset M^{*}. We use a ‘shared nearest neighbor’ graph representation, as proposed by Levine et al.^{79} for clustering singlecell datasets. We note that the matrix representation of this graph is symmetric, which is a requirement for downstream eigen decomposition. Our approach is compatible with any userdefined distance metric when constructing this graph, although we recommend using either the Euclidean distance based on harmonized gene expression measurements (that is, \(M_X^ \ast\)) or, alternately, a weighted combination of modalities using the ‘weighted nearest neighbor’ distance metric that we have previously introduced^{4}. We define
\(G \in {\Bbb R}^{n_{{\mathrm{multiomic}}} \times n_{{\mathrm{multiomic}}}}\) as the symmetric graph representation of the multiomic dataset and
\(L = I  D^{  \frac{1}{2}}GD^{  \frac{1}{2}}\) as the graph Laplacian matrix.
We next perform an eigen decomposition of the graph Laplacian matrix:
Here, U_{L} is the leftmost n_{Laplacian} eigenvectors of U, where n specifies the reduced dimensionality of the dataset. We select n_{Laplacian} = 50 for all examples in this work.
We now multiply the learned dictionary representations for the scRNAseq and scATACseq datasets by this truncated set of eigenvectors. Doing so transforms these representations into the same lowerdimensional space (n_{Laplacian}). We define
\(L_X \in {\Bbb R}^{n_{{\mathrm{scRNA  seq}}} \times n_{{\mathrm{Laplacian}}}}\) as the reduced dictionary representation for the scRNAseq data,
\(L_Y \in {\Bbb R}^{n_{{\mathrm{scATAC  seq}}} \times n_{{\mathrm{Laplacian}}}}\) as the reduced dictionary representation for the scATACseq data and
\(L_M \in {\Bbb R}^{n_{{\mathrm{multiomic}}} \times n_{{\mathrm{Laplacian}}}}\) as the reduced dictionary representation for the multiomic datasetand calculate the following matrices:
and
Alignment of dictionary representations across datasets
Both the scRNAseq and scATACseq datasets have now been transformed into a lowdimensional space defined by the same set of features. They can now be directly harmonized using existing methods. As in step 1, multiple published methods can accomplish this goal. In this work, we use our internal implementation of the mnnCorrect integration technique to perform this harmonization^{39}. We choose mnnCorrect, as we find that after performing the steps described above, any remaining samplespecific differences are minor and are typically far less than the differences we observe when aligning scRNAseq datasets across different technologies. To demonstrate the compatibility of our approach with alternative methods, we also repeat our quantitative benchmarking experiments using our previously developed integration workflow in Seurat v3 (ref. ^{19}) and observe very similar results (Supplementary Fig. 3).
Specifically, the final output of our procedure represents
\(L_X^ \ast \in {\Bbb R}^{n_{{\mathrm{scRNA  seq}}} \times n_{{\mathrm{Laplacian}}}}\) as the harmonized reduced dictionary representation for the scRNAseq data,
\(L_Y^ \ast \in {\Bbb R}^{n_{{\mathrm{scATAC  seq}}} \times n_{{\mathrm{Laplacian}}}}\) as the harmonized reduced dictionary representation for the scATACseq data and
\(L_M^ \ast \in {\Bbb R}^{n_{{\mathrm{multiomic}}} \times n_{{\mathrm{Laplacian}}}}\) as the harmonized reduced dictionary representation for the multiomic dataset.
These representations can be used as input for common downstream analytical tasks, including tdistributed stochastic neighbor embedding (tSNE) or UMAP visualization, graphbased clustering and the identification of developmental trajectories.
Atomic sketch integration
Our approach consists of four steps: (1) for each dataset, sample a representative subset of cells (atoms) that span both rare and abundant populations; (2) for each dataset, learn a dictionary representation to reconstruct each cell based on the atoms; (3) integrate the atoms from each dataset and (4) for each dataset, reconstruct each cell from the integrated atoms. Each step is described in detail below. We note that steps 1, 2 and 4 are performed on each dataset individually, and step 3 only requires performing joint computation on the downsampled set of atoms. Therefore, our procedure never requires loading or processing the entirety of the datasets at one time. Our approaches should therefore successfully extend to and beyond the analysis of 100,000,000 cells, which is now an achievable scale for combinatorial barcoding technologies.
All methods are implemented in our opensource R package Seurat (www.satijalab.org/seurat, www.github.com/satijalab/seurat).
Sample a representative subset of cells (‘atoms’) from each dataset
Our first step is to selectively downsample the cells in each dataset, aiming to identify a reduced set of cells that are representative of the full dataset. In particular, we aim to ensure that rare populations continue to be represented even after downsampling. We also aim to identify cell subsets in a computationally efficient manner and to minimize any computation that must be performed on the full dataset before downsampling. We aim to select a subset of k cells from each dataset, each of which is referred to as an atom. In this manuscript, we use k = 5,000, unless otherwise noted.
We define
\(X \in {\Bbb R}^{n_{{\mathrm{scRNA  seq}}} \times d_{{\mathrm{genes}}}}\) as the count matrix for scRNAseq and
\(S \in {\Bbb R}^{k \times n_{{\mathrm{scRNA  seq}}}}\) as the sampling matrix for the dataset; each row is onehot row vector matrix indicating which cells are selected (that is, \(s_{i,j} = 1\) if cell i is the jth cell to be selected; \(i = 1,2,\ldots,n_{{\mathrm{cells}}}\) and \(j = 1,2,\ldots,k\)).
\(SX \in {\Bbb R}^{k \times d_{{\mathrm{genes}}}}\) is the scRNAseq matrix after downsampling to the k cells selected. We also call this matrix A, as it represents the ‘atoms’ selected from the original dataset.
We can use a variety of techniques to define the sketching matrix S. These include geometric sketching techniques, such as geosketch^{53} or Hopper^{54}, or fast clustering procedures, such as minibatch kmeans^{55} followed by clusterinformed downsampling.
In this work, we select cells based on their statistical leverage scores, a method for selecting influential data points in a dataset. In the context of linear regression, statistical leverage represents the influence of an individual data point in determining the best leastsquares fit. In this context, cells with high leverage scores will tend to make the largest contribution to the gene covariance matrix and, therefore, reflect the importance of the cell’s profile. The exact statistical leverage score for a cell can be computed via an eigen decomposition of the X matrix, but this is computationally inefficient. As an alternative, Clarkson and Woodruff^{56} propose a randomized algorithm that efficiently computes a fast approximation of statistical leverage^{56}. This algorithm is attractive for singlecell sequencing analysis as it is highly scalable and runs efficiently on sparse datasets. Briefly, the algorithm amounts to constructing a ‘randomized’ sketch of the input matrix based on the Johnson–Lindenstrauss lemma and computing the Euclidean norms of the rows of that sketch. The algorithm is fully described in Clarkson and Woodruff^{56}, but we note the key mathematical steps below.
For the randomized sketching matrix, we use the sparse random CountSketch matrix C, which consists of 0, 1 and –1 elements and is defined in ref. ^{80}.
\(C \in {\Bbb R}^{c \times n_{{\mathrm{scRNA  seq}}}}\) is the sparse randomized CountSketch matrix.
We then perform a QR decomposition
We then apply a fast Johnson–Lindenstrauss transformation using a very sparse random projection matrix ∏^{81}. We calculate this matrix using the RandPro package^{82} in R (‘li’ projection function),
We can now calculate the leverage score for each cell, which are the Euclidean norms of the rows of the Z matrix. We can also calculate a sampling probability for selecting each cell i as an atom based on the leverage scores.
\(l_i = Z\,[i,]_2^2\) is the leverage score for cell i, and
\(p_i = \frac{{l_i}}{{\mathop {\sum}\nolimits_{j = 1}^n {l_j} }}\) is the probability of selecting cell i as an atom.
Finally, we sample k cells as atoms based on these probabilities. As described above, this procedure results in a downsampled dataset in which only the atoms remain, which we name A.
Learn a dictionary representation to reconstruct each cell based on the atoms
We aim to learn reconstructions of X based on the atoms of A while minimizing the error between the original and reconstructed values. Specifically, we aim to identify the matrix D, where
\(D \in {\Bbb R}^{n_{{\mathrm{scRNA  seq}}} \times k}\) is the dictionary representation of the scRNAseq dataset
such that
As described previously, this optimization problem is analogous to matrix regression and has a closedform solution for calculating D
where † represents the pseudoinverse of the matrix.
Integrate the atoms from each dataset
Let \(i = 1,2,\ldots,n_{{\mathrm{dataset}}}\) represent the datasets to be integrated, and let A_{i} represent the matrix of atoms that result from downsampling dataset i. Our goal is to harmonize the set of matrices \([A_1,A_2,\ldots,A_{n_{{\mathrm{dataset}}}}]\).
This can be performed with a wide variety of existing tools for the harmonization of singlecell datasets. For example, Seurat, Harmony, LIGER, scVI, Scanorama, fastMNN, scVI and scArches all learn a shared lowdimensional space that jointly represents the datasets and aligns cells in a matched biological state together. Our goal is therefore to learn
where \(A_i^ \ast \in {\Bbb R}^{n_{{\mathrm{scRNA  seq}}} \times d_{{\mathrm{RNA}}}}\) is the harmonized space for scRNAseq dataset i.
In this manuscript, we use our previously developed anchorbased workflow to integrate the datasets using reciprocal PCA, which is optimized for integration tasks with large numbers of samples and cells (‘fast integration using reciprocal PCA’ at https://satijalab.org/seurat/articles/integration_rpca.html). The integration procedure returns a lowdimensional space that jointly represents atoms from all datasets.
Reconstruct each cell from the integrated atoms
The last step is performed individually for each dataset. Let \(i = 1,2,\ldots,n_{{\mathrm{dataset}}}\) represent the datasets to be integrated, and let X_{i} represent the full scRNAseq count matrix representing dataset i.
We reconstruct integrated values for each cell in dataset i using the previously computed dictionary representation for the dataset along with the harmonized space \(A_i^ \ast\),
The collection of matrices \(\left[ {X_1^ \ast ,X_2^ \ast ,\ldots,X_{n_{{\mathrm{dataset}}}}^ \ast } \right]\) now represents a lowdimensional space that jointly represents all cells from all datasets. Because these matrices are low dimensional, each of them can be simultaneously loaded into memory. These representations can be used as input for common downstream analytical tasks, including tSNE or UMAP visualization, graphbased clustering and the identification of developmental trajectories.
Preprocessing details for each dataset
Adult mouse frontal cortex and hippocampus PairedTag dataset
The datasets from Zhu et al.^{26} were generated with PairedTag, which performs simultaneous profiling of histone modifications and cellular transcriptomes and contains a total of 64,849 nuclei. We extracted three datasets for the histone modifications H3K27ac, H3K4me1 and H3K27me3. We used the gene expression matrices as quantified in the original experiment. For each epigenetic modification, the original manuscript quantified read densities in 5,000 bins. These were aggregated into larger peaks using the CombineTiles function in Signac, and aggregated peaks less than 1 megabase in size were retained. We retained cells with total RNA counts between 500 and 10,000. We applied SCTransform to normalize the gene expression data and TFIDF to normalize the histone modification data. We used PCA (dimensions 1:30) and TFIDF (dimensions 2:30, excluding the first dimension, as this is typically correlated with technical metrics in ATACseq or scCUT&Tag data) to reduce the dimensionality of the RNA and histone modification modalities and construct the weighted nearest neighbor (WNN) graph.
Data acquisition source: Gene Expression Omnibus (GEO), accession number GSE152020.
Human frontal cortex snmCseq data
This human frontal cortex dataset is an snmCseq dataset from Luo et al.^{51} and contains 2,784 nuclei. We used the nonCG methylation 100,000kb bin count matrices as quantified in the original experiment. We applied SCTransform^{83} to normalize the gene expression data and log normalization to normalize the methylation data. Because this dataset was used as a query dataset in this manuscript, we did not perform unsupervised dimensionality reduction on the methylation data.
Data acquisition source: GEO, accession number GSE97179 (https://brainome.ucsd.edu/annoj/brain_single_nuclei/).
Human frontal cortex snmC2Tseq data
This human frontal cortex dataset is an snmC2Tseq dataset from Luo et al.^{28} and contains 4,357 nuclei. We used the nonCG methylation 100,000kb bin count matrices as quantified in the original experiment. We applied SCTransform to normalize the gene expression data and log normalization to normalize the methylation data. We used PCA to reduce the dimensionality to 30 for both datasets and construct the WNN graph.
Data acquisition source: GEO, accession number GSE140493.
BMMC multiome
We collected a total of ten 10x multiome datasets from the NeurIPS Multimodal SingleCell Data Integration challenge website, representing 32,368 paired singlenucleus profiles of transcriptome and chromatin accessibility. We retained cells with total RNA counts between 1,000 and 10,000 and with total ATAC peak counts between 2,000 and 30,000. We applied SCTransform to normalize the gene expression data and TFIDF to normalize ATAC peak counts. We used PCA (dimensions 1:40) and TFIDF (dimensions 2:40) to reduce the dimensionality of each modality and construct the WNN graph.
Data acquisition source: https://openproblems.bio/competitions/neurips_2021/.
Human BMMC ATACseq
This human bone marrow dataset is an snATACseq dataset from Granja et al.^{43}. As the reads were originally mapped to hg19, we used cellrangeratac v2 to remap fastq files to hg38. In each cell, we quantified the same set of peaks that were detected in the BMMC multiome dataset. After removing lowquality cells, 26,159 cells were retained, with total ATAC peaks of <50,000 and >2,000. We applied TFIDF to normalize the ATACseq data. As this dataset was used as a query dataset in this manuscript, we did not perform unsupervised dimensionality reduction on the ATACseq data.
Data acquisition source: GEO, accession number GSE139369.
Human PBMC scRNA
This human PBMC scRNA dataset was obtained from the 10x Genomics website (https://www.10xgenomics.com/resources/datasets/) and consists of 33,015 cells. We retained cells with total RNA counts between 400 and 10,000. We applied log normalization for the gene expression matrix. We annotated these cells by mapping them to the Azimuth PBMC reference with the Seurat4 referencemapping framework and refined the annotations by de novo clustering. These data were used for sketching benchmark analysis (Supplementary Fig. 5).
Data acquisition source: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.1.0/pbmc33k.
Human PBMC mulitome
This human PBMC multiome (RNA + ATAC) dataset was obtained from the 10x Genomics website (https://www.10xgenomics.com/resources/datasets/) and consists of 10,970 cells. We retained cells with total RNA counts between 500 and 10,000 and total ATAC peak counts between 2,000 and 100,000. We applied SCTransform to normalize the gene expression data and TFIDF to normalize ATAC peak counts. We annotated these cells by mapping the RNA profile to the Azimuth PBMC RNA reference with the Seurat4 referencemapping framework. We then used these annotations to create ATACseq tracks, shown in Supplementary Fig. 2d.
Data acquisition source: https://www.10xgenomics.com/resources/datasets/10khumanpbmcsmultiomev10chromiumx1standard200.
Human CD34^{+} bone marrow multiome
This human CD34^{+} bone marrow multiome (RNA + ATAC) dataset was obtained from Persad et al.^{84} and consists of 13,398 cells from two replicates. We retained cells with total RNA counts between 500 and 30,000 and total ATAC peak counts between 1,000 and 100,000. We used the same normalization method used for the human PBMC multiome. We annotated these cells by mapping the RNA profile to the Azimuth RNA BMMC reference with the Seurat4 referencemapping framework. When using the human PBMC multiome dataset, we did not observe sufficient numbers of ASDCs to create a chromatin track for this dataset. However, we identified 12 cells annotated as ASDCs in these CD34^{+} bone marrow data. We used these cells to generate a chromatin track for the SIGLEC6 locus (Supplementary Fig. 2e), which validates our predicted ASDC identified via bridge integration.
Data acquisition source: https://zenodo.org/record/6383269.
Human PBMC CyTOF dataset
This human PBMC CyTOF dataset was generated by the COVID19 Multiomics Blood Atlas COMBAT consortium and consists of 7.11 million cells with a panel of 47 antibodies. We removed cells from individuals with sepsis, yielding a remainder of 5.17 million cells. We used the normalized expression matrices as quantified in the original study. As this dataset was used as a query dataset in this manuscript, we did not perform unsupervised dimensionality reduction on the protein data.
Data acquisition source: https://zenodo.org/record/5139561.
Azimuth reference
Azimuth scRNAseq references for human bone marrow (297,627 cells) and the human motor cortex (159,738 cells) were downloaded from the HuBMAP portal. The portal includes descriptions of each public data source used when compiling the reference dataset and a link to a GitHub repository and Docker Hub to reproduce the construction of the reference.
Data acquisition source: https://azimuth.hubmapconsortium.org.
Lung scRNAseq dataset atlas
Nineteen datasets profiling human lung samples using scRNAseq were downloaded from publicly available sources (links for each source dataset are provided in Supplementary Table 2). Lowquality cells were filtered using uniform quality control thresholds; cells with RNA counts between 300 and 100,000 and with mitochondrial read percentages below 20% were retained. Normalization was performed using log normalization implemented in Seurat. We used PCA (dimensions 1:40) to reduce the dimensionality of each dataset.
Data acquisition source: Supplementary Table 2 and lung scRNA datasets^{68,69,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101}.
PBMC COVID scRNAseq dataset atlas
Fourteen datasets profiling human PBMC samples using scRNAseq were downloaded from publicly available sources (links for each source dataset are provided in Supplementary Table 2). Eleven of these datasets had been previously organized in Tian et al.^{60}. Lowquality cells were filtered using uniform quality control thresholds; cells with RNA counts between 150 and 150,000 and with mitochondrial read percentages below 15% were retained. Normalization was performed using log normalization implemented in Seurat. We used PCA (dimensions 1:40) to reduce the dimensionality of each dataset.
Data acquisition source: Supplementary Table 2 and PBMC scRNA datasets^{4,62,63,102,103,104,105,106,107,108,109,110,111,112}.
Differentiation trajectory and pseudotime analysis
In Fig. 2, we identify a myeloid differentiation trajectory and pseudotime ordering of cells that describes both reference (scRNAseq) and query (scATACseq) cells. We extracted reference cells belonging to HSC, LMPP, GMP and CD14^{+} monocyte populations and query cells that mapped to any of these subsets after bridge integration. We next constructed a knearest neighbor (KNN) graph representing cells from both modalities using the latent space learned during the bridge integration procedure. This graph was used as input to the destiny package, which reduces the dimensionality of the data using diffusion maps^{113}. We note that as we manually selected cell populations that are known to encompass monocytic differentiation, we did not expect or observe branching events. We used the first two diffusion map coordinates as input to monocle3 (ref. ^{114}) to infer a pseudotemporal ordering.
We next aimed to identify cases where dynamic gene expression patterns ‘lag’ behind the accessibility dynamics of nearby regulatory regions. We can perform this analysis because our pseudotemporal ordering encompasses both scATACseq and scRNAseq cells. We first associated each scATACseq peak with a gene using the ClosestFeature function in Signac. For each gene, we next smoothed the expression profile along the learned trajectory using the ksmooth function (‘stats’ package in R^{115}) using 1,000 intervals and a bandwidth of 0.01. We repeated the same process for the accessibility of each peak linked to this gene (bandwidth of 0.05). We next calculated the crosscorrelation of the smoothed expression and accessibility values, which measures the similarity for the two time series and calculates the optimal displacement of one relative to the other. We used the ccf function (‘stats’ package in R^{115}) and identified a total of 574 gene–peak pairs with a crosscorrelation of >0.6. Of these, we identified 236 cases exhibiting an optimal displacement of >0.01 (we illustrate 6 such cases in Fig. 2l).
Bridge cell downsampling analysis
To explore how the size and composition of the multiomic dataset affected the robustness of bridge integration, we performed 25 serial downsamplings of the entire BMMC multiomic dataset (200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 11,000, 12,000, 13,000, 14,000, 15,000, 20,000 and 30,000). We used one batch of the scATAC dataset (12,256 cells) as a query, repeated bridge integration and compared the resulting predictions with our original findings. As expected, we found that the degree of agreement after downsampling was celltype dependent, as cells from abundant cell types were more robust to downsampling. We therefore expressed our results as a function of the number of cell types present in the bridge dataset for each cell type. For example, the 7,000cell downsampled dataset contained 144 CD16^{+} monocytes (prediction concordance of 1.00) and 22 proB cells (prediction concordance of 0.66). The 2,000cell downsampled dataset contained 41 CD16^{+} monocytes (prediction concordance of 0.94) and 6 proB cells (prediction concordance of 0.55). We aggregated all these results across downsamples and displayed the results in Fig. 2a. For visual clarity, we only showed an x axis range of 10 to 500 in Fig. 2a.
Bridge celltype composition resampling analysis
To assess the robustness of our bridge integration procedure to the relative proportion of cell types in the bridge dataset, we scrambled the proportions and then repeated the bridge integration procedure. To accomplish this, we sampled 10,000 cells from the original bridge dataset without replacement. We set each cell’s sampling probability inversely to the proportion of cell types in the original dataset, ensuring that we would substantially alter celltype composition. We then repeated bridge integration with the resampled dataset, mapping the same query dataset, and then compared the results in Supplementary Fig. 2.
Benchmark analysis with multiVI and Cobolt
To assess the performance of our bridge integration method alongside other recently proposed integration tools, we compared our results with multiVI^{49} from scvitools v 0.14.5 and Cobolt^{50} (v1.0.0). As both Cobolt and multiVI use variational autoencoders, both methods are run on a server with a discrete NVIDIA A100GPU with 40 gigabytes of memory and pyTorchlightning v.1.3.8 installed. Seurat analyses are run on an Intel Xeon Platinum 8280L server and use a single computational core.
For multiVI, we used the scRNAseq, scATACseq and multiomic RNA–ATAC paired counts matrices as input. We used the multiome_anndatas function to generate one anndata object for integration. We set batch information in categorical_covariate_keys, using the setup_anndata function. We then integrated the datasets by running the multiVI function, as outlined in the multiVI tutorial (https://docs.scvitools.org/en/stable/tutorials/notebooks/MultiVI_tutorial.html). We used 500 epochs for model training, as suggested in the multiVI tutorial. All other parameters were set to default settings. multiVI learns a latent space, which jointly represents cells across the scRNAseq and scATACseq datasets. We extracted this space and performed nearest neighbor calculations and UMAP visualization in Seurat.
For Cobolt, we used the scRNAseq, scATACseq and multiomic RNA–ATAC paired counts matrices as input. We used the SingleData function from cobolt_utils to generate three Cobolt objects and trained the model using 20 latent variables, a 0.001 learning rate and 100 iterations, as recommended in the Cobolt tutorial (https://github.com/epurdom/cobolt/blob/master/docs/tutorial.ipynb). All other parameters were set to default or were the recommended settings in the tutorial. Cobolt learns a latent space that jointly represents cells across the scRNAseq and scATACseq datasets. We extracted this space and performed nearest neighbor calculations and UMAP visualization in Seurat.
We performed comparative benchmarking in three contexts. First, we ran all three approaches on the datasets from Fig. 2, aiming to map an scATACseq query dataset onto an scRNAseqdefined reference. We did not have ground truth information for this dataset, so we did not calculate quantitative benchmarks, although we visualized the performance of all methods in Fig. 3b and Supplementary Fig. 2. As multiVI and Cobolt do not provide methods to explicitly label query scATACseq cells using scRNAseq references, we used a commonly used heuristic for label transfer; for each scATACseq cell, we identified the closest five neighbors in scRNAseq cells and transferred the most common cell annotation among the neighbors. In Fig. 3b, we visualized chromatin accessibility at the SIGLEC6 locus for cells predicted as ASDCs by all methods, and additional loci are shown in Supplementary Fig. 2.
Second, we performed quantitative benchmarking in a context where we had a ground truth dataset to establish the accuracy of scATACseq/scRNAseq integration. We split the BMMC multiomic dataset into two groups. The first group consists of a randomly sampled subset of 2,115 cells representing at most 100 cells per authordefined cell type. This group of cells was used as the multiomic bridge dataset for benchmarking. The remaining cells were placed in the second group and were split into separate scRNAseq and scATACseq profiles (that is, the multiomic pairing information was temporarily discarded). We then integrated the datasets using bridge integration (using both Seurat v3 and mnnCorrect for the final alignment step), multiVI or Cobolt. After integration, all methods return a latent space that jointly represents cells from both the scATACseq and scRNAseq datasets. For each scATACseq cell, we know its matched scRNAseq profile, as they were originally measured simultaneously. Successful integration techniques will place matched profiles close together in this latent space. For each scATACseq cell, we therefore calculated the Jaccard similarity metric to its matched scRNAseq profile (we note that this similarity metric is symmetric). We report these results in Fig. 3c and Supplementary Fig. 3, either averaged together across all cells or averaged within authordefined cell types.
Third, we repeated the ground truth benchmarking analysis on a second multiomic technology. PairedTag enables simultaneous CUT&Tag and transcriptomic profiling in single cells. We used data for three histone modifications: H3K27ac, H3K27me3 and H3K4me1. As each dataset consists of multiple replicates, we used replicate 1 as the multiomic dataset and split the CUT&Tag and RNA modalities from the second replicate for benchmarking. We ran multiVI, Cobolt and bridge integration (using both Seurat v3 and mnnCorrect for the final alignment step) as before, substituting the CUT&Tag counts matrix for the scATACseq matrix, as previously described.
Bridgefree benchmark analysis with SeuratCCA and Liger
To benchmark bridgefree integration methods for the crossmodality integration of scATAC and scRNA data, we initially transformed ATAC peaks into gene activity scores using the function GeneActivity in the Signac package. This generates a gene activity score matrix by summing peak counts per cell in the gene body and promoter region. scRNA gene expression and scATAC gene activity score count matrices were used as input for integration. We performed integration by following the procedure from publicly available vignettes, https://satijalab.org/seurat/articles/atacseq_integration_vignette.html and http://htmlpreview.github.io/?https://github.com/welchlab/liger/blob/master/vignettes/Integrating_scRNA_and_scATAC_data.html.
Classification metrics for benchmark analysis
In addition to Jaccard similarity, we also calculated additional quantitative benchmarking metrics (discussed below) that leverage predefined cellular annotations.
Multiclass classification area under the receiver operating characteristic (ROC) curve (AUC)
We performed a one versus rest multiclass ROC analysis because celltype annotations include multiple classes. We assessed multiclass predictions by iteratively contrasting each class with all the others. For each iteration, we designated one class as the ‘positive’ class and the remaining classes as the ‘negative’ classes. Combining the prediction score for the ‘positive’ class, we calculated the AUC for each. We report the average AUC value across all cell types.
Query annotation KNN purity
This metric quantifies the consistency between celltype labels and neighbor relationships in the latent space. For each query cell, it measures the fraction of neighbors that receive the same annotation as the query cell itself. Using the mapped query dataset, we calculated a KNN graph in the integrated space to find k = 30 nearest neighbors for each query cell and calculated the fraction of cells receiving the same annotation.
Multiclass binary cross entropy
Multiclass binary cross entropy is a commonly used metric in classification and machine learning tasks^{116} and considers both the accuracy of prediction and the associated prediction score,
where K is the number of potential classes (cell types), \(y^{(i)}\) is an indicator variable that denotes a correct (1) or incorrect (0) prediction, and \(\hat y^{(i)}\) is the prediction score (probability associated with predicting class i).
Bridge celltype remove analysis
We removed certain cell types from the bridge and reperformed bridge integration to characterize the performance of our method in situations where cell populations were missing from the bridge dataset. We separately deleted the CD8^{+} T cell, pDC, CD14^{+} monocyte and B cell subpopulations from the BMMC multiome benchmark dataset, respectively. We then repeated the bridge integration procedure using the modified bridge dataset. We then compared the predicted labels (and prediction scores) assigned to query cells based on the full and modified bridge datasets.
Bridge celltype RNA and ATAC UMI downsample analysis
To simulate reduced data quality for RNA or ATAC modalities in the multiome dataset, we downsampled RNA or ATAC UMI counts in the multiome dataset by 1, 5, 10, 20, 30, 40, 50, 60, 70, 80 and 90 using downsampleMatrix from the scuttle package^{117}. We renormalized RNA or ATAC data after downsampling and repeated the bridge integration procedure. We assessed the prediction results using a number of evaluation methods, including classification AUC, query KNN purity, multiclass binary cross entropy and Jaccard similarity.
Sketching benchmark analysis and evaluation metrics
We applied the three assessment metrics listed below to evaluate the performance of sketching algorithms. These metrics assess the ability of sketching algorithms to identify a compact subset of cells that is maximally representative of the full dataset and, in particular, to retain cells from rare subpopulations in the dataset sketch. We computed metrics on two datasets: a 33,000cell dataset of human PBMCs publicly available from 10x Genomics (https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.1.0/pbmc33k) and a 66,000cell lung scRNAseq dataset^{98}. Using two sketching algorithms (leverage scorebased sketching and geosketch^{53}), we sketched 100, 300, 500, 1,000, 1,500, 2,000, 2,500, 3,000, 3,500, 4,000, 4,500, 5,000, 6,000, 7,000, 8,000, 9,000 and 10,000 cells from the entire dataset individually. We also performed a uniform downsampling method as a negative control. To measure computing time, we altered the size of the original dataset and fixed the number of sampled cells to 5,000.
Change of gene covariance matrix
Optimal cell sketches should modify cellular density but preserve information in the dataset gene–gene covariance matrix. We therefore calculated the magnitude of the difference between the gene covariance matrix calculated on the full dataset and the gene covariance matrix calculated on the sketched dataset. Before calculating the covariance matrix, we first performed a PCA on the dataset and reconstructed a gene expression matrix using the top 50 principal components. We then calculated the Frobenius norm of changes in the gene covariance matrix using the following equation:
where \(X_r \in {\Bbb R}^{n \times d}\) is the PCA denoised full gene expression matrix, S is the sampling matrix for the dataset to sample k cells from the full matrix, and \(X_S \in {\Bbb R}^{k \times d}\) is the denoised gene expression matrix for the k sampled cells.
Celltype entropy
We assessed the evenness of cell types in the sketched data based on the original annotations from the dataset. Celltype entropy will increase when the sketched data effectively represent rare cell types. Celltype entropy will decrease when abundant cell types dominate the sketched data and rare cell types are not represented.
Hausdorff distance
We also evaluated the performance of sketching using Hausdorff distance, a metric fully described in the geosketch manuscript^{53}. The Hausdorff distance measures the largest closest distance between the full and sketched datasets. A low Hausdorff distance indicates that all cells in the full dataset are represented by the sketched cells.
Communitywide integration analyses
To facilitate the harmonization and subsequent metaanalysis of a diversity of publicly available scRNAseq datasets, we applied our atomic sketch integration approach to 1,525,710 scRNAseq profiles spanning 19 publicly available human lung scRNAseq datasets. As described above, we calculated a leverage score for each cell in each dataset and used this to sample 5,000 cells as atoms. We found that these 5,000 cells retain rare cell types, despite downsampling (Supplementary Fig. 5). We learned a dictionary representation that reconstructs cells from each dataset based on the selected atoms using the methods described above. We used our previously developed reciprocal PCAbased integration workflow (https://satijalab.org/seurat/articles/integration_rpca.html) to integrate the 95,000 atoms originating from these 19 datasets. Finally, the learned dictionary representations can be used to reconstruct harmonized profiles (in lowdimensional space) for all 1,525,710 scRNAseq profiles. This space was used as input for UMAP to generate the visualization in Fig. 4b,c.
The harmonized space for all 1,525,710 scRNAseq profiles can also be used as input to graphbased clustering approaches. However, because annotation is an iterative and manual process, we chose to first perform clustering on the harmonized dataset of 95,000 atoms. We constructed a shared nearest neighbor graph and partitioned this into clusters using the graphbased smart local moving algorithm^{118}. We initially clustered cells at a high resolution (resolution = 5) and performed differential expression analysis on all pairs of clusters for RNA markers. We merged clusters that did not exhibit clear evidence of separation. We removed clusters that showed clear evidence of expressing markers for two different cell types as likely doublets. To assign names to individual clusters, we used the recently published anatomical structures, cell types and biomarker tables^{119}, except for five clusters (adventitial fibroblast, alveolar fibroblast, myofibroblast and proliferating NK/T, squamous), where our desired annotation was not present in the most recently available version of the table (v1)^{120}. For each cell in the full dataset, we found its ten nearest neighbors among the annotated atoms and transferred the most commonly observed annotation.
Lung integration evaluation
We computed two evaluation metrics to assess the performance of the integration of 19 lung datasets. Local inverse Simpson index (LISI) is used to evaluate for batch effect correction, and KNN purity is used to evaluate preservation of the original labels in the integrated space. We merged the raw RNA expression from 19 lung datasets, normalized them and performed a PCA as a control for the integration result.
To compare the batch effects, we first computed the LISI score using the top 50 dimensions of cell embeddings in the RNA PCA and integrated latent space individually. We used the same integrated latent space to determine KNN purity.
In Fig. 5, we performed ‘communitywide’ integration on 3.46 million cells spanning 639 individuals and 14 studies. As these studies varied widely in the number of cells present in each dataset, we selected at least 5,000 and at most 10% of the cells in each dataset as atoms based on their leverage score. This enabled the larger and more comprehensive datasets to contribute additional weight to the integrated reference. We performed integration, reconstruction and annotation using the same steps as described for the lung.
Identifying DE genes across cell types and conditions
In the lung and PBMC communitywide integration, we identified DE genes on the ‘pseudobulk’ expression values calculated from each individual study. We performed a logistic regressionbased method to identify DE genes. For space considerations, we typically reported only the top 10 markers in each heat map and sorted genes first by adjusted P value and next by log (fold change) to determine the top markers. To compare the results of singlecell and bulk analyses, we used the wilcoxauc method from presto^{121} to identify DE genes using either the singlecell or pseudobulk profiles as input and sorted by the AUC statistic. In Fig. 4g, we compared the distribution of average expression values (within a cell type) for the top 100 markers identified by either singlecell or pseudobulk analysis.
To identify COVID19 response signatures that are consistent across multiple individuals, we first calculated a pseudobulk average for CD14^{+} monocytes for each of the 506 donors who were either healthy or whose metadata indicated mild, moderate or severe COVID19 (ref. ^{60}). We performed DE analysis at the pseudobulk level to identify markers of CD14^{+} monocytes expressed in severe COVID samples compared to healthy samples. In Fig. 5b, we ordered each pseudobulk profile by the expression levels of these genes, which are enriched for interferon response genes, for visualization. We repeated this process for eight additional cell states in Supplementary Fig. 7b.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
We used publicly available datasets in this work. Download locations for each dataset are listed in the Supplementary Methods and Supplementary Tables. Azimuth references are available for download at http://azimuth.hubmapconsortium.org.
Code availability
Bridge integration and atomic sketch integration are implemented as part of the Seurat R package. In this work, we also make use of the Signac and Azimuth packages. All are freely available as opensource software at the following websites: https://github.com/satijalab/seurat, https://github.com/timoast/signac and https://github.com/satijalab/azimuth.
We include two vignettes describing the ‘bridge integration’ and ‘atomic sketch integration’ procedures as Supplementary Notes with this manuscript.
References
Kent, W. J. BLAT—the BLASTlike alignment tool. Genome Res. 12, 656–664 (2002).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Hao, Y. et al. Integrated analysis of multimodal singlecell data. Cell 184, 3573–3587 (2021).
Kang, J. B. et al. Efficient and precise singlecell reference atlas mapping with Symphony. Nat. Commun. 12, 5890 (2021).
Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of singlecell RNAseq data across data sets. Nat. Methods 15, 359–362 (2018).
Domínguez Conde, C. et al. Crosstissue immune cell analysis reveals tissuespecific features in humans. Science 376, eabl5197 (2022).
Xu, C. et al. Probabilistic harmonization and annotation of singlecell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
Lotfollahi, M. et al. Mapping singlecell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
Hu, B. C. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019).
Tabula Muris Consortium et al. Singlecell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Buenrostro, J. D. et al. Singlecell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Clark, S. J. et al. Genomewide baseresolution mapping of DNA methylation in single cells using singlecell bisulfite sequencing (scBSseq). Nat. Protoc. 12, 534–547 (2017).
Wu, S. J. et al. Singlecell CUT&Tag analysis of chromatin modifications in differentiation and tumor progression. Nat. Biotechnol. 39, 819–824 (2021).
Bartosovic, M., Kabbe, M. & CasteloBranco, G. Singlecell CUT&Tag profiles histone modifications and transcription factors in complex tissues. Nat. Biotechnol. 39, 825–835 (2021).
Bendall, S. C. et al. Singlecell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011).
Stuart, T. et al. Comprehensive integration of singlecell data. Cell 177, 1888–1902 (2019).
Barkas, N. et al. Joint analysis of heterogeneous singlecell RNAseq dataset collections. Nat. Methods 16, 695–698 (2019).
Welch, J. D. et al. Singlecell multiomic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
LaraAstiaso, D. et al. Immunogenetics. Chromatin state dynamics during blood formation. Science 345, 943–949 (2014).
Chen, S., Lake, B. B. & Zhang, K. Highthroughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Ma, S. et al. Chromatin potential identified by shared singlecell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).
Zhu, C. et al. An ultra highthroughput method for singlecell joint analysis of open chromatin and transcriptome. Nat. Struct. Mol. Biol. 26, 1063–1070 (2019).
Zhu, C. et al. Joint profiling of histone modifications and transcriptome in single cells from mouse brain. Nat. Methods 18, 283–292 (2021).
Xiong, H., Luo, Y., Wang, Q., Yu, X. & He, A. Singlecell joint detection of chromatin occupancy and transcriptome enables higherdimensional epigenomic reconstructions. Nat. Methods 18, 652–660 (2021).
Luo, C. et al. Single nucleus multiomics identifies human cortical cell regulatory genome diversity. Cell Genomics 2, 100107 (2022).
Clark, S. J. et al. scNMTseq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9, 781 (2018).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Chung, H. et al. Joint singlecell measurements of nuclear proteins and RNA in vivo. Nat. Methods 18, 1204–1212 (2021).
Chen, A.F. et al. NEATseq: simultaneous profiling of intranuclear proteins, chromatin accessibility and gene expression in single cells. Nat. Meth.ods 19, 547–553 (2022).
Elad, M. & Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15, 3736–3745 (2006).
Rams, M. & Conrad, T. O. F. Dictionary learning allows modelfree pseudotime estimation of transcriptomic data. BMC Genomics 23, 56 (2022).
Ramirez, I., Sprechmann, P. & Sapiro, G. in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 3501–3508 (IEEE, 2010).
Zhang, Q. & Li, B. in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2691–2698 (IEEE, 2010).
Aharon, M., Elad, M. & Bruckstein, A. KSVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54, 4311–4322 (2006).
Korsunsky, I. et al. Fast, sensitive and accurate integration of singlecell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous singlecell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Belkin, M. & Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15, 1373–1396 (2003).
Granja, J. M. et al. Singlecell multiomic analysis identifies regulatory programs in mixedphenotype acute leukemia. Nat. Biotechnol. 37, 1458–1465 (2019).
Luecken, M. D. et al. in 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (NeurIPS, 2021).
Villani, A. C. et al. Singlecell RNAseq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).
See, P. et al. Mapping the human DC lineage through the integration of highdimensional techniques. Science 356, eaag3009 (2017).
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Zheng, S., Papalexi, E., Butler, A., Stephenson, W. & Satija, R. Molecular transitions in early progenitors during human cord blood hematopoiesis. Mol. Syst. Biol. 14, e8041 (2018).
Ashuach, T., Gabitto, M. I., Jordan, M. I. & Yosef, N. MultiVI: deep generative model for the integration of multimodal data. Preprint at bioRxiv https://doi.org/10.1101/2021.08.20.457057 (2021).
Gong, B., Zhou, Y. & Purdom, E. Cobolt: integrative analysis of multimodal singlecell sequencing data. Genome Biol. 22, 351 (2021).
Luo, C. et al. Singlecell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357, 600–604 (2017).
Bakken, T. E. et al. Comparative cellular analysis of motor cortex in human, marmoset and mouse. Nature 598, 111–119 (2021).
Hie, B., Cho, H., DeMeo, B., Bryson, B. & Berger, B. Geometric sketching compactly summarizes the singlecell transcriptomic landscape. Cell Syst. 8, 483–493 (2019).
DeMeo, B. & Berger, B. Hopper: a mathematically optimal algorithm for sketching biological data. Bioinformatics 36, i236–i241 (2020).
Hicks, S. C., Liu, R., Ni, Y., Purdom, E. & Risso, D. mbkmeans: fast clustering for single cell data using minibatch kmeans. PLoS Comput. Biol. 17, e1008625 (2021).
Clarkson, K. L. & Woodruff, D. P. Lowrank approximation and regression in input sparsity time. JACM 63, 1–45 (2017).
Schiller, H. B. et al. The Human Lung Cell Atlas: a highresolution reference map of the human lung in health and disease. Am. J. Respir. Cell Mol. Biol. 61, 31–41 (2019).
Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in singlecell transcriptomics. Database 2020, baaa073 (2020).
Plasschaert, L. W. et al. A singlecell atlas of the airway epithelium reveals the CFTRrich pulmonary ionocyte. Nature 560, 377–381 (2018).
Tian, Y. et al. Singlecell immunology of SARSCoV2 infection. Nat. Biotechnol. 40, 30–41 (2022).
Lee, J. S. & Shin, E. C. The type I interferon response in COVID19: implications for treatment. Nat. Rev. Immunol. 20, 585–586 (2020).
Wilk, A. J. et al. A singlecell atlas of the peripheral immune response in patients with severe COVID19. Nat. Med. 26, 1070–1076 (2020).
COvid19 Multiomics Blood ATlas (COMBAT) Consortium. A blood atlas of COVID19 defines hallmarks of disease severity and specificity. Cell 185, 916–938.e58 (2022).
Rudensky, A. Y. Regulatory T cells and Foxp3. Immunol. Rev. 241, 260–268 (2011).
Thimme, R. et al. Increased expression of the NK cell receptor KLRG1 by virusspecific CD8 T cells during persistent antigen stimulation. J. Virol. 79, 12112–12116 (2005).
Kurioka, A. et al. MAIT cells are licensed through granzyme exchange to kill bacterially sensitized targets. Mucosal Immunol. 8, 429–440 (2015).
Bjorklund, A. K. et al. The heterogeneity of human CD127^{+} innate lymphoid cells revealed by singlecell RNA sequencing. Nat. Immunol. 17, 451–460 (2016).
Tabula Sapiens Consortium. The Tabula Sapiens: A multipleorgan, singlecell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Han, X. et al. Construction of a human cell landscape at singlecell level. Nature 581, 303–309 (2020).
Li, H. et al. Fly Cell Atlas: A singlenucleus transcriptomic atlas of the adult fruit fly. Science 375, eabk2432 (2022).
Plant Cell Atlas Consortium et al. Vision, challenges and opportunities for a Plant Cell Atlas. eLife 10, e66877 (2021).
Lareau, C. A. et al. Dropletbased combinatorial indexing for massivescale singlecell chromatin accessibility. Nat. Biotechnol. 37, 916–924 (2019).
Datlinger, P. et al. Ultrahighthroughput singlecell RNA sequencing and perturbation screening with combinatorial fluidic indexing. Nat. Methods 18, 635–642 (2021).
Goltsev, Y. et al. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell 174, 968–981 (2018).
Li, Z. et al. Singlecell lipidomics with high structural specificity by mass spectrometry. Nat. Commun. 12, 2869 (2021).
Capolupo, L. et al. Sphingolipid control of fibroblast heterogeneity revealed by singlecell lipidomics. Preprint at bioRxiv https://doi.org/10.1101/2021.02.23.432420 (2021).
Barshan, E., Ghodsi, A., Azimifar, Z. & Jahromi, M. Z. Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recognit. 44, 1357–1371 (2011).
Woodruff, D. P. Sketching as a tool for numerical linear algebra. Preprint at https://doi.org/10.48550/arXiv.1411.4357 (2014).
Levine, J. H. et al. Datadriven phenotypic dissection of AML reveals progenitorlike cells that correlate with prognosis. Cell 162, 184–197 (2015).
Charikar, M., Chen, K. & FarachColton, M. in International Colloquium on Automata, Languages, and Programming 693–703 (Springer, 2002).
Li, P., Hastie, T. J. & Church, K. W. in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 287–296 (Association for Computing Machinery, 2006).
Siddharth, R. & Aghila, G. RandPro—a practical implementation of random projectionbased feature extraction for high dimensional multivariate data analysis in R. SoftwareX 12, 100629 (2020).
Hafemeister, C. & Satija, R. Normalization and variance stabilization of singlecell RNAseq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Persad, S. et al. SEACells infers transcriptional and epigenomic cellular states from singlecell genomics data. Nat. Biotechnol., 1–12 (2023).
Adams, T. S. et al. Singlecell RNAseq reveals ectopic and aberrant lungresident cell populations in idiopathic pulmonary fibrosis. Sci. Adv. 6, eaba1983 (2020).
Bischoff, P. et al. Singlecell RNA sequencing reveals distinct tumor microenvironmental patterns in lung adenocarcinoma. Oncogene 40, 6748–6758 (2021).
Chua, R. L. et al. COVID19 severity correlates with airway epithelium–immune cell interactions identified by singlecell analysis. Nat. Biotechnol. 38, 970–979 (2020).
Delorey, T.M. et al. COVID19 tissue atlases reveal SARSCoV2 pathology and cellular targets. Nature 595, 107113 (2021).
Deprez, M. et al. A singlecell atlas of the human healthy airways. Am. J. Respir. Crit. Care Med. 202, 1636–1645 (2020).
Eraslan, G. et al. Singlenucleus crosstissue molecular reference maps toward understanding disease gene function. Science 376, eabl4290 (2022).
Habermann, A. C. et al. Singlecell RNA sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis. Sci. Adv. 6, eaba1972 (2020).
Lukassen, S. et al. SARSCoV2 receptor ACE2 and TMPRSS2 are primarily expressed in bronchial transient secretory cells. EMBO J. 39, e105114 (2020).
Madissoon, E. et al. scRNAseq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation. Genome Biol. 21, 1 (2019).
Mayr, C.H. et al. Integrative analysis of cell state changes in lung fibrosis with peripheral protein biomarkers. EMBO Mol. Med. 13, e12871 (2021).
Melms, J. C. et al. A molecular singlecell lung atlas of lethal COVID19. Nature 595, 114–119 (2021).
Morse, C. et al. Proliferating SPP1/MERTKexpressing macrophages in idiopathic pulmonary fibrosis. Eur. Respir. J. 54, 1802441 (2019).
Reyfman, P. A. et al. Singlecell transcriptomic analysis of human lung provides insights into the pathobiology of pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 199, 1517–1536 (2019).
Travaglini, K. J. et al. A molecular cell atlas of the human lung from singlecell RNA sequencing. Nature 587, 619–625 (2020).
Wang, A. et al. Singlecell multiomic profiling of human lungs reveals celltypespecific and agedynamic control of SARSCoV2 host genes. eLife 9, e62522 (2020).
Watanabe, N. et al. Anomalous epithelial variations and ectopic inflammatory response in chronic obstructive pulmonary disease. Am. J. Respir. Cell Mol. Biol. 67, 708–719 (2022).
Wauters, E. et al. Discriminating mild from critical COVID19 by innate and adaptive immune singlecell profiling of bronchoalveolar lavages. Cell Res. 31, 272–290 (2021).
Arunachalam, P. S. et al. Systems biological assessment of immunity to mild versus severe COVID19 infection in humans. Science 369, 1210–1220 (2020).
Combes, A. J. et al. Global absence and targeting of protective immune states in severe COVID19. Nature 591, 124–130 (2021).
Lee, J. S. et al. Immunophenotyping of COVID19 and influenza highlights the role of type I interferons in development of severe COVID19. Sci. Immunol. 5, eabd1554 (2020).
Ren, X. et al. COVID19 immune features revealed by a largescale singlecell transcriptome atlas. Cell 184, 1895–1913 (2021).
SchulteSchrepping, J. et al. Severe COVID19 is marked by a dysregulated myeloid cell compartment. Cell 182, 1419–1440 (2020).
Silvin, A. et al. Elevated calprotectin and abnormal myeloid cell subsets discriminate severe from mild COVID19. Cell 182, 1401–1418 (2020).
Stephenson, E. et al. Singlecell multiomics analysis of the immune response in COVID19. Nat. Med. 27, 904–916 (2021).
Su, Y. et al. Multiomics resolves a sharp diseasestate shift between mild and moderate COVID19. Cell 183, 1479–1495 (2020).
Yao, C. et al. Celltypespecific immune dysregulation in severely ill COVID19 patients. Cell Rep. 34, 108943 (2021).
Yu, K. et al. Dysregulated adaptive immune response contributes to severe COVID19. Cell Res. 30, 814–816 (2020).
Zhu, L. et al. Singlecell sequencing of peripheral mononuclear cells reveals distinct immune response landscapes of COVID19 and influenza patients. Immunity 53, 685–696 (2020).
Haghverdi, L., Buettner, F. & Theis, F. J. Diffusion maps for highdimensional singlecell analysis of differentiation data. Bioinformatics 31, 2989–2998 (2015).
Qiu, X. et al. Reversed graph embedding resolves complex singlecell trajectories. Nat. Methods 14, 979–982 (2017).
R Core Team. R: a language and environment for statistical computing (R Foundation for Statistical Computing, 2013).
Bishop, C. M. & Nasrabadi, N. M. Pattern Recognition and Machine Learning, Vol. 4 (Springer, 2006).
McCarthy, D. J., Campbell, K. R., Lun, A. T. & Wills, Q. F. Scater: preprocessing, quality control, normalization and visualization of singlecell RNAseq data in R. Bioinformatics 33, 1179–1186 (2017).
Waltman, L. & Van Eck, N. J. A smart local moving algorithm for largescale modularitybased community detection. Eur. Phys. J. B 86, 471 (2013).
Borner, K. et al. Anatomical structures, cell types and biomarkers of the Human Reference Atlas. Nat. Cell Biol. 23, 1117–1128 (2021).
Gloria Pryhuber, X.S. HuBMAP ASCT+B Tables. Lung v1.1 https://doi.org/10.48539/HBM323.SGDF.945 (2021).
Korsunsky, I., Nathan, A., Millard, N. & Raychaudhuri, S. Presto scales Wilcoxon and auROC analyses to millions of observations. Preprint at bioRxiv https://doi.org/10.1101/653253 (2019).
Acknowledgements
We thank all members of the Satija Lab for thoughtful discussions related to this work. We thank A. Butler and H. Srivastava for assistance in identifying and locating scRNAseq datasets from human lung and PBMCs. We acknowledge the Gottardo and Newell labs for publicly releasing a standardized compendium of human PBMC scRNAseq datasets. This work was supported by the Chan Zuckerberg Initiative (EOSS0000000082 and HCAA170401895 to R.S.) and the NIH (K99HG01148901 to T.S.; K99CA267677 to A.S.; RM1HG01101402, 1OT2OD026673 01, DP2HG00962301, R01HD096770 and R35NS097404 to R.S.).
Author information
Authors and Affiliations
Contributions
T.S., Y.H. and R.S. conceived the research. Y.H., T.S., M.H.K., S.C., P.H., A.H., A.S., G.M. and S.M. performed the computational analyses, supervised by C.F.G. and R.S. Y.H., T.S. and R.S. wrote the manuscript, with input and assistance from all authors.
Corresponding author
Ethics declarations
Competing interests
In the past 3 years, R.S. has worked as a consultant for BristolMyers Squibb, Regeneron and Kallyope and served as an SAB member for ImmunAI, Resolve Biosciences, Nanostring and the NYC Pandemic Response Lab. The other authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Rhonda Bacher and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–8, Tables 1 and 2 and Notes 1 and 2.
Supplementary Tables 1 and 2
Supplementary Table 1. Summary of crossmodality integration benchmark results. Supplementary Table 2. scRNA lung and PBMC data acquisition sources.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hao, Y., Stuart, T., Kowalski, M.H. et al. Dictionary learning for integrative, multimodal and scalable singlecell analysis. Nat Biotechnol (2023). https://doi.org/10.1038/s4158702301767y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4158702301767y
This article is cited by

Integration of multimodal singlecell data
Nature Biotechnology (2023)

Bridging the multiomics gap
Nature Reviews Genetics (2023)

Infiltrating CD8+ T cells exacerbate Alzheimer’s disease pathology in a 3D human neuroimmune axis model
Nature Neuroscience (2023)

Singlecell multiomics of mitochondrial DNA disorders reveals dynamics of purifying selection across human immune cells
Nature Genetics (2023)