Abstract
Jointly profiling the transcriptome, chromatin accessibility and other molecular properties of single cells offers a powerful way to study cellular diversity. Here we present MultiVI, a probabilistic model to analyze such multiomic data and leverage it to enhance singlemodality datasets. MultiVI creates a joint representation that allows an analysis of all modalities included in the multiomic input data, even for cells for which one or more modalities are missing. It is available at scvitools.org.
Similar content being viewed by others
Main
The advent of technologies for profiling the transcriptional and chromatin accessibility landscapes at a singlecell resolution has been paramount for cataloging cellular types and states^{1,2}. However, most uses of singlecell RNAsequencing (scRNAseq)^{3,4} and singlecell assay for transposaseaccessible chromatin with sequencing (scATACseq)^{2,5} have been limited such that a given cell can only be profiled by one technology. Recently, multimodal singlecell protocols have emerged for simultaneously profiling gene expression, chromatin accessibility and, more recently, the abundance of surface protein in the same cell^{6,7}. This concomitant measurement enables a more refined categorization of cell states and, ultimately, a better understanding of the mechanisms that underlie their diversity.
The emerging area of multimodal profiling has benefited greatly from new statistical methods that jointly account for multiple data types in a range of analysis tasks^{8,9,10}. Another promising application of multimodal assays, however, is to improve the way in which the more common and less costly singlemodality datasets (for example, scRNAseq) are analyzed and interpreted. By leveraging datasets with multimodal (paired) information, one can infer properties of the missing modalities and thus gain new insight that is otherwise difficult to achieve. To provide a comprehensive solution, such an integrative analysis should be carried out at two levels. First, it should generate a lowdimensional summary of the state of each cell that reflects all the input molecular types, regardless of which type of information is available for that particular cell. As commonly done in other applications of singlecell genomics, such a representation can facilitate the identification of subpopulations or gradients and enable a more informative data visualization^{11}. A second level of analysis should generate a normalized, batchcorrected view of each highdimensional data type (for example, accessibility of each chromatin region), either observed or inferred. Such an analysis can enable broader identification of molecular features that characterize cellular subpopulations of interest.
Here, we introduce MultiVI, a deep generative model for probabilistic analysis of multimodal datasets, which also enables their integration with singlemodality datasets. Focusing on gene expression and chromatin accessibility as our main case study, we demonstrate that MultiVI provides solutions for the two levels of analysis, with a lowdimensional summary of cell state and a normalized highdimensional view of both modalities (measured or inferred) in each cell. MultiVI was designed to account for the general caveats of singlecell genomics data, namely batch effects, different technologies for the same modality, variability in sequencing depth, limited sensitivity and noise. It does so while explicitly modeling the statistical properties of each modality, treating the discreteness of the scRNAseq signal and the binary nature of the scATACseq signal. A key part in the design of MultiVI is its modularity, which allows for inclusion for additional data modalities. Here, we demonstrate it by adding surface protein expression with tagged antibodies as a third modality^{6,7}. The extended model accounts for properties of the protein data (for example, nonzero background component), and enables integration and joint analysis with singlemodality (RNA, chromatin or proteinonly) datasets.
A recent method (Cobolt^{12}) presented an approach similar to that of MultiVI, with promising results. As we will show, MultiVI provides a more comprehensive solution for integrating and interpreting information across modalities, studies and technologies (a summary of all the computational experiments performed in the paper is available in Supplementary Fig. 1). In addition to showcasing its ability to derive accurate lowdimensional representations, we demonstrate several key properties of MultiVI as a way of imputing highdimensional signals. First, we demonstrate that MultiVI provides calibrated estimates of the uncertainty in the imputed values (for example, predicted chromatin accessibility for scRNAseq only cells and predicted gene expression for scATACseq only cells), such that less accurate predictions are also less confident. Second, we demonstrate that these estimates of uncertainty give rise to accurate estimates of differential gene expression or chromatin accessibility in cells for which the respective modality was not available. Third, we show that even if a population of cells has information from only one modality, accurate imputation may still be achieved when multimodal information is available for related populations (thus effectively performing outofsample prediction). MultiVI is available in scvitools as a continuously supported, open source software package, along with detailed documentation and a usage tutorial at https://docs.scvitools.org/.
Results
The MultiVI model
MultiVI leverages our previously presented variational autoencoding (VAE^{13}) models for gene expression (scVI, ref. ^{14}), chromatin accessibility (PeakVI, ref. ^{15}) and protein abundance (totalVI, ref. ^{16}). For clarity, we focus the discussion here on jointly modeling scRNA and scATACseq data. The extension to surface protein measurements is provided in the Methods section.
Given multimodal data from a single cell (X), and a sample (or batch) S, we divide the observations into gene expression \(\left({X}_{\mathrm{R}}\right)\) and chromatin accessibility \(\left({X}_{\mathrm{A}}\right)\). Two deep neural networks, termed encoders, learn modalityspecific, batchcorrected multivariate normal distributions that represent the latent state of the cell based on the observed data, q(z_{R}∣X_{R}, S) and q(z_{A}∣X_{A}, S), from the expression and accessibility observations, respectively. To obtain a latent space that reflects both modalities, we penalize the model so that the distance between the two latent representations is minimized and then estimate the integrative cell state q(z∣X_{R}, X_{A}, S) as the average of both representations. The states of cells for which only one modality is available (that is, unpaired), are drawn directly from the representation for which data are available (that is, z_{R} or z_{A}). This encoding part of the model can be naturally extended for handling other molecular properties (such as protein abundance), by including additional encoder networks.
In the second part of the model, observations are generated from the latent representation using modalityspecific decoder neural networks. Similar to our previous models for gene expression (scVI) and accessibility (PeakVI), RNA expression data are drawn from a negative binomial distribution and the accessibility data from a Bernoulli distribution. The likelihood is computed from both modalities for paired (multimodal) cells, and only from the respective modality of unpaired cells. Finally, during training, we include an adversarial component that penalizes the model if cells from different modalities are overly separated in latent space (Methods).
This twopart architecture leverages the paired data to learn a lowdimensional representation of cell state, which reflects both data types, and it allows cells for which only one modality is available to be represented in the same (joint) latent space. Additionally, the generative part of the model provides a way to derive normalized, batchcorrected gene expression and accessibility values for both the multimodal cells (that is, normalizing the observed data) and for unpaired cells (that is, imputing unobserved data; Fig. 1 and Methods).
MultiVI integrates paired and unpaired samples
We first trained MultiVI on a fullypaired peripheral blood mononuclear cells (PBMC) dataset from 10X genomics, and observed that our predicted library size factors (Methods) are highly correlated with the observed number of unique molecular identifiers in both the expression and accessibility libraries (Pearson’s correlation 0.97 and 0.91, respectively; Supplementary Fig. 1b,c). Next, to study how well MultiVI integrates paired and singlemodality data into a common lowdimensional representation, we artificially unpaired the dataset. In this procedure, a random set of cells (between 1 and 99% of all cells) were unpaired such that each cell in the set appears twice: once with only gene expression data, and once with only chromatin accessibility data. This resulted in a heterogeneous dataset containing three sets of ‘cells’: one set has both modalities available, a second set with only RNA sequencing (RNAseq) information and the third set with only ATACseq information.
Using these data, we compared MultiVI to Cobolt^{12}, a model similar to MultiVI that uses products of experts to create a common latent space. To explore the performance of additional analysis strategies, we also added several adaptations of Seurat^{8}. Specifically, we used the Seurat V4 code base with three different approaches: (1) gene activity, we converted the ATACseq data of the accessibilityonly cells to gene activity scores (using the signac procedure), and then integrated all the cells using the genelevel data (that is, gene scores when RNAseq is not available or gene expression when RNAseq is available); (2) imputed, we followed the steps in (1) and then used Seurat to impute the RNA expression values for the accessibilityonly cells and (3) weighted nearest neighbors (WNN), using WNN graphs, which leverage information from both modalities to create a joint representational space, then project singlemodality data onto this space (Methods).
We ran all methods on the artificially unpaired datasets and compared their latent representations (with the exception of the WNNbased approach on the 99% unpaired dataset, which failed to produce results due to the low number of paired cells; Fig. 2a–c and Supplementary Fig. 2). We first quantified the mixing performance by calculating the local inverse Simpson’s index (LISI) score described by ref. ^{17} (Fig. 2d). We found that algorithms based on generative modeling (Cobolt, MultiVI) outperform the alternative approaches of gene scoring and WNN in most rates of unpaired cells. Conversely, the Seuratbased imputation approach (unlike the other two Seuratbased approaches) maintains high mixing performance across all levels of unpaired cells. This result is expected, since each accessibilityonly cell is represented by a gene expression vector that is an average over cells for which RNAseq is available and that have gene expression profiles that are similar to one another (that is, a local neighborhood in a transcriptomebased space). It does not, however, indicate whether these representations are accurate.
We next examined the accuracy of the inferred latent space, taking advantage of the groundtruth information contained in our artificially unpaired datasets. Ideally, the two modalityspecific representations of unpaired cells would be situated closely in the latent representation, as both capture the same biological state. We therefore looked at the distances between the two representations of each unpaired cell in the latent space created by each method. To account for the varying scales of different latent spaces, we used the rank distance (the minimal K for which the two representations are within each other’s Knearest neighbors (KNN), averaged across all cells; Methods and Fig. 2e). In this experiment, we found that MultiVI and Cobolt maintain the multimodal mixing accuracy substantially better than the three alternatives, and that all methods have a deteriorating performance as the level of unpaired cells increases.
One of the key modeling decisions in MultiVI, compared with previous approaches such TotalVI (ref. ^{16}), is the creation of latent representations for each data modality, followed by the calculation of an average representation, while penalizing the symmetric KL distance between representations. To test the robustness of this approach, we trained models with other approaches and compared them against default MultiVI. We tested replacing the symmetric KL divergence term with a maximum mean discrepancy term (MMD)^{18}, as well as replacing the simple average with weighted averages in two settings: a global weighting scheme (w_{m}, such that ∑_{Modalities}w_{m} = 1) and a cellspecific weight (w_{m,c}, such that ∑_{Modalities}w_{m,c} = 1). In both cases, weights are learnable parameters optimized along with the rest of the model. We used two annotated multimodal datasets in which PBMCs are profiled using either DOGMAseq^{7} or transcriptomeepitopeaccessibility sequencing (TEAseq)^{6} to evaluate latent space creation (Supplementary Fig. 3a,b). To evaluate each condition, we used the LISI metric together with several singlecell genomics batch correction and conservation of biological variation metrics, as defined in the scib package^{19} (Supplementary Information). In both datasets and under the different modeling decisions, all settings we tested yielded highly similar results (Supplementary Fig. 3c). We concluded that the model is robust with regard to these modeling decisions.
Taken together, these results show that the deep generative modeling approach, as embodied by MultiVI, effectively integrates unpaired scRNA and scATAC data while still preserving the biological state of each cell.
Integration of independent studies
Our benchmark analyses in Fig. 2 rely on artificially unpaired data, where our model benefits from all data fundamentally being generated in a single batch and by a single technology. This does not reflect realworld situations in which it is desired to integrate datasets that were generated in different batches or even different studies. We therefore sought to demonstrate MultiVI on a set of realworld data. We collected three distinct datasets of PBMCs: (1) multimodal data from the 10X dataset we used previously; (2) ATACseq from a subset of hematopoiesis data generated by Satpathy et al.^{20}, containing multiple batches of PBMCs as well as celltype specific (FACSsorted) samples and (3) PBMC data generated by several different technologies for scRNAseq, taken from a benchmarking study by Ding et al.^{21}. The datasets were processed to create a set of shared features (genes or genomic regions, when measurements are available), and annotations were collected from both the Satpathy et al. and Ding et al. datasets and combined into a shared set of celltype labels (Methods). The resulting dataset has 47,148 (53%) ATAConly cells from Satpathy et al., 30,495 (34%) RNAonly cells from Ding et al. and 12,012 (13%) jointly profiled cells from 10X.
To gauge the extent of batch effects in these data, we ran MultiVI without accounting for the study of origin of each sample or its specific technology (which varies between the RNAseq samples from Ding et al.). In this setting, cells stratified based on sample in the accessibility data, and based on technologies in the expression data, indicating that batch effects were affecting the latent representation (Supplementary Fig. 4). We then configured MultiVI to correct for batch effects and technologyspecific effects within each dataset and reran the analysis (Methods). The resulting joint latent space mixes the three datasets well (Fig. 3a), while accurately matching labeled populations from both datasets (Fig. 3b). MultiVI achieves this while also correcting batch effects within the Satpathy data and technologyspecific effects within the Ding data (Fig. 3c,d and Supplementary Fig. 5a–c). To study the correctness of the integration, we examined the set of labeled cells from the two singlemodality datasets (FACSbased labels from Satpathy and manually annotated cells from Ding). For each cell, we identified its 50 nearest neighbors that came from the other modality and summarized the distribution of labels from the neighboring cells. We find a clear agreement between the labels of each cell and the labels of its crossmodality neighbors, with some mixing among related cell types (Supplementary Fig. 5d). This therefore demonstrates that MultiVI is capable of deriving biologically meaningful lowdimensional representations that effectively integrate data not only from different modalities, but also from different laboratories and technologies.
Probabilistic data imputation with estimated uncertainty
The generative nature of MultiVI enables several functionalities for analyzing the data in the full highdimensional space, performing imputation of missing observations and modalities, estimation of uncertainty and differential analysis. To evaluate MultiVI’s data imputation capabilities, we resorted to the PBMC dataset from 10X (Methods) where 75% of the cells were artificially unpaired (as in Fig. 2). We used MultiVI to infer the values of the missing modality for the unpaired cells and found that for both modalities, the imputation had high correspondence to the observed values (Fig. 4a–c). Considering all the gene expression entries together, MultiVI achieves a Spearman correlation of 0.57 between the imputed values and the originally observed ones (scaled by library size). Since raw chromatin accessibility data are binary, we computed the area under the precisionrecall curve to assess imputation accuracy, in which MultiVI reaches 0.41. Since the raw data can be markedly affected by low sensitivity, we also calculated the correlation between the imputed values and a smoothed version of the data (obtained with a method different from MultiVI; Methods), where the signal is averaged over similar cells (separately for ATACseq and RNAseq), thus mitigating this issue. We obtained high level of correspondence between the imputed values and this corrected version of the raw data (Spearman correlations 0.8 and 0.86 for accessibility and expression, respectively; Supplementary Fig. 7a,b).
Next, we focused on uncertainty estimation for the imputed accessibility values. We measured the uncertainty of the model for each imputed accessibility value by sampling from MultiVI’s generative model (Methods) and found a strong relationship between the estimated uncertainty and the error at each data point, indicating that the model is indeed less certain of predictions that are farther from the hidden groundtruth values (Fig. 4c). Equivalent analysis for expression imputations is less informative due to the strong correlation between the average observed expression of a gene and the uncertainty of the imputed results.
We identified a small subset of chromatin accessibility values (roughly 0.5% of observations) for which we have highconfidence imputations that also have high error (Fig. 4c, green square). These highconfidence–higherror imputations correspond to cases where the model confidently predicts the opposite of the actually observed value (Fig. 4d). To investigate the source of these errors, we instead compared the imputed values to the smoothed version of the observed accessibility estimates (Methods). The smoothed data agree with the MultiVI predictions—namely, observations that were predicted as accessible tend to be open in highly similar cells, and observations that were predicted as inaccessible tend to be closed in similar cells (Fig. 4e). This indicates that these highconfidence higherror values may correspond to false negatives and false positives in the raw data.
As a specific example of imputation, we highlight the T cell marker gene CD3G. While the observed expression and the observed accessibility of the region containing the transcription start site (TSS) of the gene show high noise and sparsity, the imputed values are highly consistent and clearly mark the T cell compartment of the latent space (Fig. 4f).
To further test the usability of MultiVI’s imputation, we next explored a scenario in which the multi and singlemodal data come from different biological conditions. In this case, we resorted to the PBMCs dataset collected under the DOGMAseq protocol^{7}. In this dataset, PBMCs are profiled at a resting condition as well as after stimulation. We artificially created a dataset in which stimulated cells lack chromatin accessibility information and trained MultiVI in this dataset and the original (complete) one (Supplementary Fig. 6a,b). We first considered the resulting latent representations to evaluate the extent of batcheffect correction and preservation of biological information. To this end, we applied the scib^{19} metrics package on both the perturbed and complete dataset. We found that the scores of the perturbed dataset are similar (within 5%) to the scores of the unperturbed dataset (Supplementary Fig. 6c). Next, we assessed the accuracy of imputation of chromatin accessibility values in the stimulated cells by comparing the imputed values to the hidden accessibility values after smoothing the latter values, as above. In this test, we found accurate level of reconstruction (Spearman correlation 0.93; Supplementary Fig. 6d), highlighting MultiVI’s ability to infer missing modalities even under new perturbations.
Overall, these results show that MultiVI is capable of imputing missing observations effectively. The ability to quantify the uncertainty further allows the user to determine which imputed values are reliable for downstream analyses and which are not.
Crossmodal differential analyses
Our results thus far demonstrate that MultiVI can be used to impute missing observations in situations where the multimodal and the singlemodality data both contain similar cell types. The task becomes more challenging when considering a population of cells that is distinct from all other populations in the data and for which one of the modalities was not observed.
To explore this, we used the same 10X PBMC dataset, with 75% of cells artificially unpaired, and clustered the latent space to identify distinct cellular populations (Supplementary Fig. 8a). We chose the B cell cluster, which we annotated as such using established markers (for example, CD19, CD79A). Next, we removed all expression information (paired or unpaired) from the B cell population, thus creating a distinct population for which only accessibility data are available to the model. In a second experiment, we removed all accessibility data from the B cell population to create a dataset for which only expression was observed in those cells (Supplementary Fig. 8b,c). We trained MultiVI separately on each of the two corrupted datasets, and used the model to perform differential analyses, comparing the B cell population and the remainder of the cells. Specifically, we conducted differential expression analysis with the model trained without B cell expression and differential accessibility with the model trained without B cell accessibility. Statistical significance was estimated with Bayes factor, as in previous work^{14,15,22} (Methods). To evaluate the accuracy of this analysis, we used standard differential analyses (not using generative models) on the heldout data to create groundtruth differential results and compared them to our inferred results (Methods). Considering the first corrupted dataset, although no expression data were observed in the B cell population, we found high concordance between the observed and predicted log foldchange (log_{FC}) values (Fig. 5a, Pearson’s correlation 0.57). When examining genes that are preferentially expressed in B cells (observed log_{FC} > 1) this became more evident (Pearson’s correlation 0.74). Similarly, with the second corrupted dataset, we found high concordance between observed and predicted differences of accessibility (Fig. 5b, Pearson correlation 0.67).
Among the top most differentially expressed genes detected with the imputed expression values, we found known B cell markers, including IGLC3, IGHM, CD79A and IGHD (Supplementary Table 1 and Fig. 5c). Another example for a differentially expressed gene is CR2, a membrane protein that is normally expressed by both B and T cells, was predicted specifically in the corresponding compartments (Fig. 5d). Overall, we identified 1621 significantly differential genes (false discovery rate (FDR) of less than 0.05), of which 75% were also identified with the heldout data at a 5% FDR, a modest but significant enrichment (oddsratio 1.22, hypergeometric test P < 1.9 × 10^{−35}; Supplementary Table 1). Increasing the threshold of significance (on the FDR for the standard analysis and the Bayes factor for the MultiVI results) increased the overlap between the sets of results indicating that the results are more consistent for more highly significant genes (Fig. 5e). Similarly, we identified 922 differentially accessible regions (FDR of^{22} 0.05, Supplementary Table 2), of which 86% were also identified with the heldout data at 5% FDR (oddsratio 1.57, P < 1.7 × 10^{−95}). As in the gene expression analysis, the overlap between the inferred and observed differential accessibility analyses increased with the significance thresholds (Fig. 5f).
Taken together, these results demonstrate that MultiVI can be used to impute missing modalities even for populations that were only identified in a singlemodality dataset. This unlocks the ability to leverage multimodal data to reanalyze existing singlemodality datasets and impute the missing modality: chromatin landscape for existing scRNA experiments, and gene expression for existing scATAC experiments, as well as performing differential analyses using these imputed values.
MultiVI models three modalities and enables data imputation
To test the ability of MultiVI to integrate more than two modalities, we added measurements of protein abundance on the cell surface (with CITEseq) and accounted for these in the MultiVI model using distributional assumptions similar to TotalVI^{16} (Methods). To assess the ability of MultiVI to create meaningful trimodal latent representations, we used the two PBMC datasets profiled with DOGMAseq^{7} and TEAseq^{6} protocols. For each dataset separately we then integrated the different samples (Fig. 6a–c and Supplementary Fig. 9a–c), and evaluated the results using batch correction and biologicalconservation metrics. As benchmark, we compared MultiVI to MOFA+ (ref. ^{9}) and Seurat WNN^{8} (notably, the latter was designed for the case when all samples have all modalities). In both datasets, the latent space inferred by MultiVI performed on par with the two benchmark methods (Fig. 6d and Supplementary Fig. 9d). Next, we explored MultiVI’s ability to integrate trimodal paired and unpaired samples. To benchmark this regime, we used the DOGMAseq dataset to artificially create cells that are only profiled in two of the three modalities (RNA + chromatin, chromatin + protein or RNA + protein) (Fig. 6e). Overall we observed that MultiVI’s latent representation effectively removes batch effects and preserves celltype identity. We quantified this performance by computing integration metrics and comparing against a principal components analysis (PCA) latent space. PCA reseaches a better adjusted Rand index (ARI) score, while MultiVI outperforms in terms of batch connectivity (iLISI), graph connectivity and all silhouettebased metrics (Methods and Supplementary Fig. 9e).
Next, we examined how MultiVI handles a complex experimental design in which cells are profiled in all possible combinations of modalities. We split cells in the DOGMAseq dataset, selectively removed information from different modalities and created datasets that have only one (out of three) modality or two (out of three) modalities. We found that the resulting integrated space (Fig. 6f) ameliorates batch and technologies effects while maintaining celltype biological information, as in previous scenarios. Again, we assessed MultiVI’s performance by comparison to PCA embedding computed on the raw data and showed superior integration performance (Supplementary Fig. 9e). Overall, these results demonstrate that MultiVI is able to ameliorate batch effects while preserving biological heterogeneity, even in the complex scenario in which three data modalities are present at different combinations.
Last, we assessed the performance of MultiVI to impute missing data in this trimodal setting. We artificially unpaired 75% of the data, such that each cell is represented by three copies with only RNA, chromatin accessibility or protein expression data (resulting in a dataset in which 8% of cells have all tree modalities and the rest have only one of the three modalities). We find that the imputation of the missing modalities, generated by MultiVI (Supplementary Fig. 10) correspond to the observed values. These include Spearman correlations of 0.78 in the gene expression values, independent of the single modality used as input data (using smoothed observations, as above). We observed a similar outcome in the case of chromatin accessibility, with a Spearman correlation of 0.73 and 0.76 between the smoothed observed values and the imputed ones when only RNA or protein information is available, respectively. Our model for the protein data was designed to control for the nonnegligible background component in the signal (which may result of nonspecific binding of antibodies). We therefore first calculated the foreground (‘denoised’) component of all observed protein expression values using TotalVI (ref. ^{16}). Since the protein imputed values in MultiVI are also generated with a similar two component model, we were able to compare the imputed foreground signals to foreground signals that were inferred from the respective hidden observations. We find that the imputed values are also correlated with the observed data (with Spearman correlations of 0.53 when only chromatin accessibility data or gene expression data are available).
In summary, the inclusion of protein data into our analysis highlights the ability of MultiVI to handle additional data types and leverage them for a joint analysis with measurements of chromatin accessibility and gene expression.
Discussion
MultiVI is a deep generative model for integrated analysis of multimodal and singlemodality singlecell datasets. MultiVI uses jointly profiled data to learn a multimodal model of data sources and to relate individual modalities on the same population of cells. The model accounts for technical sources of measurment noise and can correct for additional sources of unwanted variation (for example, batch effects). MultiVI learns a rich embedding of the data coalescing information present in each individual data type, which can be used for downstream analyses.
Multiomic integration algorithms have been recently classified based on their ability to infer latent representations from shared cells across measurements (vertical integration scenario), shared features across datasets (horizontal integration scenario)^{23,24} or subsets of any of the previous (mosaic integration scenario). MultiVI could be considered a vertical integration scheme, when purely multiomic data are analyzed and different modalities are measured in the same cells. At the same time, MultiVI’s ability to integrate transcriptional and protein data, when only a subset of genes are shared across technologies, placed it within the horizontal integration scenario. Last, MultiVI’s capacity for integrating multiple modalities when only a subset of cells is shared across modalities also qualifies it as a mosaic algorithm. These characteristics highlight MultiVI’s ability to work under many scenarios, in contrast to previous algorithms that are optimized to work just in one condition^{25,26}.
Recent algorithms for the analysis of multimodal data were developed to process paired datasets, in which both modalities have been profiled at the same cell^{9,10}. Most algorithms, however, handle multimodal data, but lack the ability to integrate singlemodality datasets. While this task is possible to achieve with the Seurat code base^{8}, the respective methods we used here were not specifically designed to this end, and their performance was not tested for this task. Here, we have shown that use of deep generative modeling can effectively combine unpaired scRNA and scATAC data with multimodal singlecell data, generating a meaningful lowdimensional representation of the cells’ state that captures information about both their transcriptome and epigenome. This joint representation is achievable even when the amount of paired data is minimal, thus opening exciting opportunities for future studies in which only a small amount of paired data can be sufficient for deriving a more nuanced interpretation of singlemodality data. In contrast to Cobolt, or other twomodality algorithms^{27,28}, we demonstrate that MultiVI is able to integrate information from additional molecular properties of cells, such as surface protein expression. However, MultiVI might be limited in its ability to analyze datasets with a small number of cells, both due to the scale of data required to train neural networks, as well as the necessary amount of information needed to correctly learn multiple modelityspecific embeddings.
An additional key capability that is unique to MultiVI is the inference of the missing modality. We have demonstrated that we can identify differential gene expression in subpopulations for which only chromatin accessibility data are available and distinguishing chromatin features for subpopulations for which only gene expression data are available. These results open the way for exciting future applications: first, MultiVI and similar methods have the potential to enable a reanalysis of the large compendia of available singlemodality datasets with a relatively small amount of additional paired data, thus potentially leading to more comprehensive characterizations of cell state. Second, it can facilitate costeffective designs for future studies, in which only a subset of samples need to be profiled with the (more costly) multimodal protocol. Last, MultiVI expedites the transfer of information and analysis results across modalities, such as the case in which RNA velocity could be ported to cells in which only chromatin information has been profiled.
In summary, MultiVI is able to seamlessly integrate single and multimodal data, process information from different laboratories or technologies and create a rich joint representation (low and high dimensional) that harnesses all available information. It is implemented in the scvitools framework^{29}, making it easy to configure, train and use.
Methods
The MultiVI model
MultiVI inherits generative models describing chromatin accessibility and transcriptional observations from scVI (ref. ^{14}), peakVI (ref. ^{15}) and TotalVI (ref. ^{16}). Briefly, let \({X}_{\mathrm{R}}\in {{\mathbb{N}}}_{0}^{C\times G}\) be a scRNAseq genesbycell matrix with C cells and G genes, where \({x}_{\mathrm{R}}^{cg}\in {{\mathbb{N}}}_{0}\) is the number of reads from cell c that map to gene g. Let \({X}_{\mathrm{A}}\in {{\mathbb{N}}}_{0}^{C\times J}\) be a scATACseq regionbycell matrix with C cells and J regions, where \({x}_{A}^{cj}\in {{\mathbb{N}}}_{0}\) is the number of fragments from cell c that map to region j. Let \({X}_{P}\in {{\mathbb{N}}}_{0}^{C\times P}\) be a proteinbycell matrix with C cells and P proteins, where \({x}_{P}^{cp}\in {{\mathbb{N}}}_{0}\) is the number of fragments from cell c that map to protein g.
MultiVI models the probability of observing x_{cj} counts in a gene by using a negative binomial distribution,
where ℓ_{c} is a scaling factor that captures cellspecific biases (for example, library size), ρ_{cg} is a normalized gene frequency and θ_{g} models the per gene dispersion. The probability of observing a region as accessible is modeled with a Bernoulli distribution,
where p_{cj} captures the true biological heterogeneity and r_{j} captures regionspecific biases (for example, width, sequence). Last, MultiVI models protein expression with a mixture of negative binomial distributions that encompass background and foreground protein expression:
In this model, π_{1} accounts for the mixture proportion, β for the background expression level and α ≥ 0 is a value that corrects for foreground expression. In the observational models, the normalized gene frequency per cell, the normalized peak accessibility and the background expression level are inferred from data using deep neural networks. The scaling factor, the regionspecific and the per gene dispersion parameters are optimized directly (this is in contrast to the original implementation of scVI in which library size was modeled using a lognormal distribution).
Next, for each cell, normalized gene frequencies ρ_{cg}, accessibility biological heterogeneity p_{cj} and background and foreground protein expression \({\alpha }_{cg}^{\,f}\) and \({\beta }_{cg}^{b}\), are estimated using a latent representation as in VAE^{13}. Briefly, each modality is assign their own latent representation, a isotropic multivariate normal distribution \({Z}_{c}^{\,\mathrm{A}} \approx {{{\rm{MVN}}}}(0,1)\), \({Z}_{c}^{\,\mathrm{R}} \approx {{{\rm{MVN}}}}(0,1)\) and \({Z}_{c}^{\,\mathrm{P}} \approx {{{\rm{MVN}}}}(0,1)\). Then, with the purpose of bringing all representations together, they are combined by taking their average (for example, in the case of two modalities profiled such as ATAC and RNA, we have \({Z}_{c}=\frac{{Z}_{c}^{\,\mathrm{A}}+{Z}_{c}^{\,\mathrm{R}}}{2}\)). This merged representation is then used to decode all model parameters.
We explore alternative modality weighting schemes. Our default mode involves an average across modality latent representations, termed ‘equal’ in our software release and in the Supplementary Information (example for two modalities \({Z}_{c}=\frac{{Z}_{c}^{\,\mathrm{A}}+{Z}_{c}^{\,\mathrm{R}}}{2}\)). Alternatively, a global weighted average scheme equal across all cells, w_{m}, such that ∑_{modalities}w_{m} = 1 (example for two modalities \({Z}_{c}={w}_{\mathrm{A}}{Z}_{c}^{\,\mathrm{A}}+{w}_{\mathrm{R}}{Z}_{c}^{\,\mathrm{R}}\)) . Last, a cellspecific weight across modalities, w_{m,c}, such that ∑_{modalities}w_{m,c} = 1 (example for two modalities \({Z}_{c}={w}_{{\mathrm{A}},c}{Z}_{c}^{\,\mathrm{A}}+w+{\mathrm{R}},c{Z}_{c}^{\,\mathrm{R}}\)). In both cases, the weights are learnable parameters optimized along with the rest of the model.
MultiVI inference model
We use variational inference^{30} to compute posterior estimates of model parameters using the following variational approximation:
where the delta distribution, δ, highlights the fact that parameters ℓ, r, θ and π_{1} are inferred from the data as point estimates and optimize directly. The cellspecific factor ℓ_{c} ∈ ℓ is computed from the input data for cell c via a deep neural network \({f}_{\ell }:{{\mathbb{N}}}_{0}^{C}\to \left[0,1\right]\). The regionspecific factor r_{j} ∈ r, since it is optimized across samples, is stored as a Jdimensional tensor, used and optimized directly. In the case of each latent representation, encoders are computed as \({h}_{z}^{\mathrm{Transc}}:{{\mathbb{N}}}_{0}^{C}\to \left({{\mathbb{R}}}^{D},{{\mathbb{R}}}^{D}\right)\), \({h}_{z}^{{\mathrm{Chrom}}}:{{\mathbb{N}}}_{0}^{C}\to \left({{\mathbb{R}}}^{D},{{\mathbb{R}}}^{D}\right)\), \({h}_{z}^{\mathrm{Protein}}:{{\mathbb{N}}}_{0}^{C}\to \left({{\mathbb{R}}}^{D},{{\mathbb{R}}}^{D}\right)\) where each of them computes the distributional parameters of a Ddimensional multivariate normal random variable: \({z}_{c}^{\mathrm{Modality}} \approx {\mathrm{MVN}}\left(\mu ={[{h}_{z}^{\mathrm{Modality}}\left({x}_{c}\right)]}_{1},{\sigma }^{2}={[{h}_{z}^{\mathrm{Modality}}\left({x}_{c}\right)]}_{2}\right)\). The subscripts 1 and 2 in the last equation reflect the fact that each encoder computes both the mean and the variance of the distribution.
Using the variational approximation, the evidence lower bound is computed and optimized with respect to the variational and model parameters using stochastic gradients. To enforce the similarity between chromatin and transcription latent representations, we add to the evidence lower bound a term that penalizes the distance between representations using a symmetric Jeffrey’s divergence between distributions \(d\,({Z}_{c}^{\,\mathrm{A}},{Z}_{c}^{\,\mathrm{R}})={{{\rm{symmKL}}}}(q({z}_{c}^{\mathrm{A}}),q({z}_{c}^{\mathrm{R}}))={{{\rm{KL}}}}(q({z}_{c}^{\mathrm{A}}),q({z}_{c}^{\mathrm{R}}))+{{{\rm{KL}}}}(q({z}_{c}^{\mathrm{R}}),q({z}_{c}^{\mathrm{A}}))\). In the case of three or more distributions, we extend the penalty to match every possible set of distributions (when we include protein data, \(d({Z}_{c}^{\,\mathrm{A}},{Z}_{c}^{\,\mathrm{R}},{Z}_{c}^{\,\mathrm{P}})={{{\rm{symmKL}}}}\left(q({z}_{c}^{\mathrm{R}})\right.,q({z}_{c}^{\mathrm{A}})+{{{\rm{symmKL}}}}\left(q({z}_{c}^{\mathrm{R}})\right.,q({z}_{c}^{\mathrm{P}})+\)\({{{\rm{symmKL}}}}(q({z}_{c}^{\mathrm{A}}),q({z}_{c}^{\mathrm{P}}))\). Last, we explored an alternative penalization scheme in which we replace the symmetric KL divergence by an MMD penalty (MMD^{18}). Additional information about the model definition and inference procedure can be found at https://docs.scvitools.org/en/stable/user_guide/models/multivi.html.
Training procedure
By default, MultiVI is optimized using AdamW^{31} with a learning rate of 0.0001, weight decay of 0.001 and minibatch size of 128. As in previous models, we trained on 90% the data and used 10% as a validation set. We selected an initial training plan consisting of 500 epochs but the model is stopped early if there is no improvement in the reconstruction loss on the validation dataset for 50 epochs (early stopping). We downweighted the KL divergence between the latent representation and its prior during the first 50 epochs (for i ∈ [1, 50], \({{{\rm{KL}}}} \cdot \frac{i}{50}\)). In addition, a domain adaptation penalty is included in the training schemes to increase mixing in the latent space^{32,33}. Briefly, a classifier is created using a twolayer feed forward neural network with 32 hidden units. Its output is the probability for each cell to belong to one of the batch keys. We use the output of this classifier to create a crossentropy loss that is adversarially trained.
Modeling differences between MultiVI and Cobolt
While conceptually similar, MultiVI and Cobolt have several key differences in design and implementation choices. MultiVI offers additional functionalities due to its generative model, that is denoising, imputation, uncertainty estimation and differential analyses, which are discussed in detail in this paper. In addition to those, we detail several other differences between the methods: (1) MultiVI uses a distributional average and penalization to mix the latent representations, compared with the classical product of experts calculation used by Cobolt. (2) The distributional assumptions made by the models are different: MultiVI uses tailored noise models for each modality (negative binomial for expression, Bernoulli for accessibility) and uses a deep neural network for the generative component of the model as well as the inference component. In contrast, Cobolt uses a multinomial likelihood for both modalities and uses a linear transformation as a generative model. (3) MultiVI explicitly avoids overfitting the data, in both the architecture (for example, dropout layers) and training procedure (holding out data to use for earlystop if the model overfits), whereas Cobolt does not contain such guardrails.
Benchmarking and evaluation
Dataset preprocessing
The 10X multiomic unsorted PBMC dataset was downloaded from the company website. For artificial unpairing analyses, the processed peakbycell matrix was downloaded and filtered to remove features that are detected in fewer than 1% of the cells. For the mixedsource PBMC dataset, the fragment file was downloaded and reprocessed using CellRangerARC (v.2.0.0) with the Satpathy hg38 peaks. The Satpathy dataset was downloaded from the Gene Expression Omnibus (GEO) (accession no. GSE129785); specifically the processed peakbycell matrix and metadata files: scATACHematopoiesisAll.cellbarcodes.txt.gz, scATACHematopoiesisAll.mtx.gz and scATACHematopoiesisAll.peaks.txt.gz. We then filtered the data to only include peaks that were detected in at least 0.1% of the data, and lifted those peaks over from the hg19 to the hg38 genome reference using the UCSC liftover utility^{34}. The Ding dataset was downloaded from GEO (accession no. GSE132044); specifically the pbmc data: pbmc_hg38_count_matrix.mtx.gz, pbmc_hg38_cell.tsv.gz and pbmc_hg38_gene.tsv.gz.
Matching celltype annotation was downloaded from SCP (accession no. SCP424). After preprocessing, the reanalyzed 10X dataset was combined with both singlemodality datasets and the features were filtered to remove features (either genes or peaks) that were detected in fewer than 1% of the cells.
The DOGMAseq dataset^{7}, containing paired scRNA, scATAC and surface protein abundance observations were downloaded from GEO (accession no. GSE156478); specifically the four samples containing all three modalities: CD28_CD3_*. The ATAC observations were merged using ArchR^{35} using default arguments to produce a unified set of peaks called from all four samples. For model training, we only used features that were detected in at least 1% of cells. For the analyses included in this paper, only cells originating from the DIG_CTRL sample were used.
RNAbased Seurat integration
This integration modality disregards multiomic information and only RNA information is considered from multiome cells. Briefly, RNA information is first integrated and then chromatin accessibility is integrated using gene activity scores (RNAbased method) or RNA imputed values (RNAbased imputed method).
In more detail, cells were separated into three different datasets, multiomic cells (using only expression data), RNAonly cells and ATAConly cells. Seurat objects were created for multiome and RNAonly data, and were then normalized, scaled and the first 50 principal components are calculated. For ATAConly cells, a Seurat object was created, gene activity scores were calculated, scaled and principal components were computed. To integrate the three datasets, integration anchors (using FindIntegrationAnchors) were calculated and the data were then integrated (using IntegrateData). The RNAbased method uses gene activity scores as representative values from the ARAConly cells. The RNAbased imputed method includes an additional step in which RNA imputed values are calculated from gene activity scores by running FindTransferAnchors and TransferData. In this integration method, RNA imputed values are used as representative values from ATAConly cells. Finally, integrated data were then scaled and principal components were calculated to generate the final latent space. Across these integration methods, we followed the standard recommended procedure for analyzing data with Seurat given in their tutorials^{36}.
WNNbased Seurat integration
This approach aims to leverage information from both modalities (chromatin accessibility and expression values), using the newly described WNN approach from Seurat V4 (ref. ^{8}). We first created a WNN graph using multiomic information and then project chromatin and transcriptional information onto this.
We begin by separating cells in unpaired datasets into three different datasets, multiomic cells (with both expression and chromatin data), RNAonly and ATAC only. First, multiome latent representation is found by using the sctransform function and principal components on the expression data and latent semantic analysis (LSA) (TFIDF decomposition followed by singular value decomposition) on the chromatin data. Next, multimodal neighbors and the first 50 supervised PCA are calculated. To merge RNA and ATAConly data to multiome representation, transfer anchors (FindTransferAnchors) are computed on RNAonly data and gene activity scores on ATAC only and each dataset is integrated using IntegrateEmbeddings function. Finally, datasets and dimensionality reductions are merged and uniform manifold approximation and projection (UMAP) is visualized using the merged information.
Neighbor rank distance calculation
For artificially unpaired cells, each cell has two unpaired representations in the latent space. Given cell c with representations c_{a} and c_{b}, let \(S\left({c}_{\mathrm{a}},K\right)\) be the set of KNN to c_{a}. We then define \(\delta \left({c}_{\mathrm{a}},{c}_{\mathrm{b}}\right)\) as the minimal K for which cell c_{b} is among the KNN of cell c_{a}, \(\min \left\{k:{c}_{\mathrm{b}}\in S\left({c}_{\mathrm{a}},k\right)\right\}\).
LISI score calculation
Enrichment scores were computed as they were in our previous work^{15}, and similarly to the LISI scores described in the Harmony paper^{17}. Briefly, given a latent representation R, an integer k and the modality labels (expression, or accessibility) L, we compute G_{R,k} the KNN graph from R with k neighbors. Using G_{R,k}, we compute for each cell the proportion of neighbors that share the same modality: \({s}_{\mathrm{i}}=\frac{1}{k}{\sum }_{j\in {G}_{R,k}(i)}{\mathbb{1}}\left({L}_{i}={L}_{j}\right)\). The enrichment score is the average score across all cells, \(\bar{s}\), normalized by the expected score for a random sample from the distribution of labels: \(E\left[s\right]={\sum }_{\ell \in \{L\}}{p}_{\ell }^{2}\), with p_{ℓ} being the proportion of each modality.
Extended integration metrics
We computed extended integration metrics using the scib^{19} package. Briefly, to quantify integration throughout the paper, we computed the ARI, normalized mutual information, graph connectivity, batch LISI (iLISI), celltype LISI (cLISI), kBET, silhouette width (label silhouette) and the average silhouette width as proposed previously. To quantify metrics depending on clustering of the data, we first ran the provided functions to optimized clustering resolution. We provided celltype labels as labelkey and corresponding label under evaluation as batchkey.
Estimating imputation uncertainty
We estimated the uncertainty of the model for each imputed value by sampling from the latent space (n = 15). Next, for each imputed feature (gene or loci), we computed the mean and standard deviation. More consistent predictions corresponded to less uncertainty.
KNNbased estimate of accessibility
To estimate accessibility without using MultiVI, we computed a lower dimensional representation of the data using LSA (top 30 components), then for each cell we computed the average accessibility profile of the 50 nearest neighbors in the LSA space. This created a smooth estimate of accessibility using highly similar cells, mitigating the effect of false observations.
Expression smoothing
Expression smoothing was achieved by taking the top 30 principal components of the expression data (computed with PCA), computing the KNN graph (for K = 50) and averaging the expression values of the neighbors for each cell (scaled by library size).
Differential analyses with heldout data
To identify a distinct population of cells, we used the Leiden community detection algorithm^{37}, then examined the expression levels of known marker genes (CD79A, CD19) to identify the cluster of B cells. We then unpaired the data within the cluster, once by removing all expression data from the B cells and once by removing all accessibility data from the clusters. Since the data were already unpaired, this resulted in several cells with no observations at all and those were removed from the dataset.
Differential expression using heldout data
Differential expression was computed in two ways: (1) using the heldout data, values were normalized per cell by dividing the expression levels by the total number of reads in the cell. The log_{FC} values were then computed by dividing the mean expression values in the two groups. Statistical significance was determined using Wilcoxon ranksum test. (2) Without the heldout data, using MultiVI, in a procedure described by Lopez et al.^{14}, in which samples from the latent space and uses the generative model to estimate expression profiles. Statistical significance was then determined using Bayes factors, as well as an FDR approach described by Lopez et al.^{22}.
Differential accessibility using heldout data
Differential accessibility was computed equivalently to differential expression. (1) Using heldout data, values were normalized using the TFIDF transformation, differential accessibility was computed by subtracting the mean accessibility in the reference group from the same value in the target group. Statistical significance was determined using Wilcoxon ranksum test. (2) Without the heldout data, using MultiVI, using the procedures described in our previous work^{14,15,22}.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data used in this paper are publicly available via the original publications and releases. TEAseq datasets were download from GEO, accession number GSE158013. DOGMAseq datasets were downloaded from GEO accession no. GSE156478. The Satpathy scATACseq dataset was downloaded from GEO, accession no. GSE129785. The Ding scRNAseq dataset was downloaded from GEO accession no. GSE132044. The 10X PBMC sample dataset is available from the company website: https://www.10xgenomics.com/resources/datasets
Code availability
The code for MultiVI is publicly available via the scvitools suite at scvitools.org. Intermediate data, trained models used in this paper and the custom notebooks needed to generate the figures in this paper are all posted and available on Zenodo^{38}.
References
Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).
Buenrostro, J. D., Wu, B., Chang, H. Y. & Greenleaf, W. J. ATACseq: a method for assaying chromatin accessibility genomewide. Curr. Protoc. Mol. Biol. 109, 21.29.1–21.29.9 (2015).
Tang, F. et al. mRNASeq wholetranscriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009).
Jaitin, D. A. et al. Massively parallel singlecell RNAseq for markerfree decomposition of tissues into cell types. Science 343, 776–779 (2014).
Buenrostro, J. D. et al. Singlecell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Elliott, S. et al. Simultaneous trimodal singlecell measurement of transcripts epitopes and chromatin accessibility using TEQseq. eLife 10, e63632 (2021).
Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021).
Hao, Y. et al. Integrated analysis of multimodal singlecell data. Cell 184, 3673–3587 (2021).
Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multimodal singlecell data. Genome Biol. 21, 111 (2020).
Singh, R., Hie, B. L., Narayan, A. & Berger, B. Schema: metric learning enables interpretable synthesis of heterogeneous singlecell modalities. Genome Biol. 22, 131 (2021).
DeTomaso, D. et al. Functional interpretation of single cell similarity maps. Nat. Commun. 10, 4376 (2019).
Gong, B., Zhou, Y. & Purdom, E. Cobolt: integrative analysis of multimodal singlecell sequencing data. Genome Biol. 22, 351 (2021).
Kingma, D. P. & Welling, M. Autoencoding variational Bayes. Preprint at http://arxiv.org/abs/1312.6114v10 (2013).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Ashuach, T., Reidenbach, D. A., Gayoso, A. & Yosef, N. PeakVI: a deep generative model for single cell chromatin accessibility analysis. Cell Rep. Meth. 2 (2022).
Gayoso, A. et al. Joint probabilistic modeling of singlecell multiomic data with totalvi. Nat. Methods 18, 272–282 (2021).
Korsunsky, I. et al. Fast, sensitive and accurate integration of singlecell data with harmony. Nat. Methods 16, 1289–1296 (2019).
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. A kernel method for the twosampleproblem. Advances in neural information processing systems 19 (NIPS, 2006).
Malte, L. et al. Benchmarking atlaslevel data integration in singlecell genomics. Nat. Methods 19, 41–50 (2022).
Satpathy, A. T. et al. Massively parallel singlecell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).
Ding, J. et al. Systematic comparison of singlecell and singlenucleus RNAsequencing methods. Nat. Biotechnol. 38, 737–746 (2020).
Lopez, R. et al. Decisionmaking with autoencoding variational Bayes. Advances in Neural Information Processing Systems 33, 5081–5092 (2020).
Richardson, S., Tseng, G. C. and Sun, W. Statistical methods in integrative genomics. Annu. Rev. Stat. Appl. 3, 181–209 (2016).
Argelaguet, R., Cuomo, A. S., Stegle, O. & Marioni, J. C. Computational principles and challenges in singlecell data integration. Nat. Biotechnol. 39, 1202–1215 (2021).
Ghazanfar Shila, M. J. C. & Guibentif C. Stabmap: mosaic single cell data integration using nonoverlapping features. bioRxiv (2022).
Kriebel, A. R. & Welch, J. D. UINMF performs mosaic integration of singlecell multiomic datasets using nonnegative matrix factorization. Nat. Commun. 13, 780 (2022).
Minoura, K., Abe, K., Nam, H., Nishikawa, H. and Shimamura, T. A mixtureofexperts deep generative model for integrated analysis of singlecell multiomics data. Cell Rep. Meth. 1, 100071 (2021).
Lakkis, J., Schroeder, A., Su, K., Lee, M. Y., Bashore, A. C., Reilly, M. P. & Li, M. A multiuse deep learning method for CITEseq and singlecell RNAseq data integration with cell surface protein prediction and imputation. Nat. Mach. Intell. 4, 1–13 (2022).
Gayoso, A. et al. A Python library for probabilistic analysis of singlecell omics data. Nat. Biotechnol. 40, 163–166 (2022).
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. OpenReview.net https://openreview.net/forum?id=Bkg6RiCqY7 (2019).
Yaroslav, G. et al. DomainAdversarial Training of Neural Networks Vol. 7 (2016).
Lopez, R., Nazaret, A., Langevin, M., Samaran, J., Regier, J., Jordan, M. I. & Yosef, N. A joint model of unpaired data from scRNAseq and spatial transcriptomics for imputing missing gene expression measurements. arXiv 1905.02269 (2019).
Liftover utility. UCSC https://genome.ucsc.edu/cgibin/hgLiftOver
Granja, J. M. et al. Archr is a scalable software package for integrative singlecell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).
Hoffman, P. et al. Integrating scRNAseq and scARACseq data. Satijalab https://satijalab.org/seurat/articles/atacseq_integration_vignette.html (2021).
Traag, V. A., Waltman, L. & Van Eck, N. J. From louvain to leiden: guaranteeing wellconnected communities. Sci. Rep. 9, 5233 (2019).
Ashuach, T. & Gabitto, M. I. MultiVI  intermediate datasets, notebooks, and scripts. Zenodo https://doi.org/10.5281/zenodo.5762077 (2022).
Acknowledgements
We thank A. Gayoso for assistance on model implementation in scvitools. We thank C. Usher for assistance with visualizations. This work was funded by Chan Zuckerberg Foundation Network grant no. 201902452 and NIHNIAID grant no. U19 AI090023. This work was supported in part by the Vannevar Bush Faculty Fellowship program under grant no. N000142112941.
Author information
Authors and Affiliations
Contributions
T.A., M.I.G. and N.Y. conceived of the model and designed the analyses. T.A. and M.I.G. implemented the model. T.A., M.I.G. and R.V.K. processed the data used for benchmarking. T.A., M.I.G. and G.A.S. performed the analyses. N.Y. and M.I.J. supervised the work. T.A., M.I.G., M.I.J. and N.Y. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
N.Y. is an adviser and/or has equity in Cellarity, Celsius Therapeutics and Rheos Medicines. T.A is an employee of Vevo Therapeutics. All other authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–10.
Supplementary Table 1
Supplementary tables 1 and 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ashuach, T., Gabitto, M.I., Koodli, R.V. et al. MultiVI: deep generative model for the integration of multimodal data. Nat Methods 20, 1222–1231 (2023). https://doi.org/10.1038/s41592023019099
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592023019099
This article is cited by

Year in review 2023
Nature Methods (2024)

Mosaic integration and knowledge transfer of singlecell multimodal data with MIDAS
Nature Biotechnology (2024)

Benchmarking algorithms for joint integration of unpaired and paired singlecell RNAseq and ATACseq data
Genome Biology (2023)

scNAT: a deep learning method for integrating paired singlecell RNA and T cell receptor sequencing profiles
Genome Biology (2023)

Spiking neural networks for predictive and explainable modelling of multimodal streaming data with a case study on financial time series and online news
Scientific Reports (2023)