Abstract
The profiling of multiple molecular layers from the same set of cells has recently become possible. There is thus a growing need for multiview learning methods able to jointly analyze these data. We here present MultiOmics Wasserstein inteGrative anaLysIs (Mowgli), a novel method for the integration of paired multiomics data with any type and number of omics. Of note, Mowgli combines integrative Nonnegative Matrix Factorization and Optimal Transport, enhancing at the same time the clustering performance and interpretability of integrative Nonnegative Matrix Factorization. We apply Mowgli to multiple paired singlecell multiomics data profiled with 10X Multiome, CITEseq, and TEAseq. Our indepth benchmark demonstrates that Mowgli’s performance is competitive with the stateoftheart in cell clustering and superior to the stateoftheart once considering biological interpretability. Mowgli is implemented as a Python package seamlessly integrated within the scverse ecosystem and it is available at http://github.com/cantinilab/mowgli.
Similar content being viewed by others
Introduction
Singlecell sequencing technologies, providing a quantitative and unbiased characterization of cellular heterogeneity, are revolutionizing our understanding of the immune system, development, and complex diseases^{1,2,3}. A new frontier in singlecell sequencing technologies is represented by multiomics singlecell sequencing, allowing for the simultaneous profiling of multiple molecular readouts (e.g. transcriptome, chromatin accessibility, surface proteins) from the same cell^{4,5,6,7,8,9,10,11,12}. Examples of these cuttingedge sequencing technologies are CITEseq, simultaneously measuring RNA and surface protein abundance by leveraging oligonucleotideconjugated antibodies^{5}, and 10x Genomics Multiome platform, quantifying RNA and chromatin accessibility by microdropletbased isolation of single nuclei.
Multiomics singlecell sequencing platforms provide us with complementary molecular readouts from exactly the same set of cells, called in the following paired multiomics data. The joint analysis of such data offers the exciting opportunity to understand how different molecular facets of a cell collaboratively define the cell’s function, morphology, and state^{13}. Several multiview learning methods, jointly analyzing paired multiomics data by taking into account their shared and complementary information, have thus been recently developed^{14,15,16,17,18,19,20,21,22,23}. These methods, differently from unpaired integration ones^{24,25}, take advantage of the known correspondences between cells across modalities. Stateoftheart multiview learning methods for singlecell multiomics integration are based on integrative Matrix Factorization^{14,19,22}, knearest neighbors^{15}, or variational autoencoders^{16,17,18,26,27}. Integrative Matrix Factorization (integrative MF) and variational autoencoders perform dimensionality reduction, jointly embedding the highdimensional multiomics cellular profilings into a shared lowerdimensional latent space by leveraging common cells/observations^{13,28}. Integrative MF, due to its linear nature, defines a latent space with a natural biological interpretation, but it is too simple to catch complex biological processes^{13,28}. On the other hand, nonlinear methods, as variational autoencoders, have shown great potential in clustering cells, but despite recent works on the subject^{29,30}, they inherently lack biological interpretability. Improving integrative MF methods is thus crucial to striking a balance between interpretability and performance.
We here propose MultiOmics Wasserstein inteGrative anaLysIs (Mowgli github.com/cantinilab/mowgli), a novel integrative MF method for singlecell multiomics data combining integrative Nonnegative Matrix Factorization^{31} (integrative NMF) with Optimal Transport^{32} (OT). On one hand, Mowgli employs integrative NMF, popular in computational biology due to its intuitive representation by parts and further enhances its interpretability^{31}. On the other hand, Mowgli enhances the clustering performances of integrative MF by taking advantage of OT, which we have previously shown to better capture similarities between singlecell omics profiles^{33}. We extensively benchmarked Mowgli with respect to the stateoftheart in the integration of several paired multiomics data profiled with CITEseq^{5}, 10X Genomics Multiome, and TEAseq^{7} platforms. Of note, while we focused on the integration of the currently available omics data, Mowgli can deal with paired multiomics datasets with any type and number of omics, without any statistical assumption on the data. The performed indepth comparison showed that Mowgli’s embedding and clustering quality outperform the stateoftheart in controlled settings derived from real multiomics data and are competitive with the stateoftheart in more complex real multiomics data. Of note, the latter are affected by the lack of an absolute ground truth annotation on most real datasets. Finally, Mowgli was shown to improve the stateoftheart in terms of biological interpretability through an indepth biological analysis of TEAseq data.
Results
Mowgli: a new tool for paired singlecell multiomics data integration
We developed MultiOmics Wasserstein inteGrative anaLysIs (Mowgli), a new tool for paired singlecell multiomics data integration (github.com/cantinilab/mowgli).
Mowgli is based on integrative Matrix Factorization (integrative MF). Starting from \(d\) omics matrices \({{{{{{\bf{A}}}}}}}^{\left({{{{{\bf{p}}}}}}\right)}\in {{\mathbb{R}}}^{{{{{{{\rm{m}}}}}}}_{{{{{{\rm{p}}}}}}}\times {{{{{\rm{n}}}}}}}\) with \({{{{{\rm{p}}}}}}\in \left[1\ldots d\right]\), sharing the same columns (the cells) but having different features (e.g. genes, peaks), Mowgli jointly decomposes them into the product of omicspecific dictionaries \({{{{{{\bf{H}}}}}}}^{\left(p\right)}\in {{\mathbb{R}}}^{{m}_{p}\times k}\) and a shared embedding \({{{{{\bf{W}}}}}}\in {{\mathbb{R}}}^{k\times n}\) with \({{{{{\rm{k}}}}}} \, \ll \, {m}_{p}\) and \({{{{{\rm{k}}}}}} \, \ll \, n\) (Fig. 1A). As a standard nomenclature, in the following we will call \({{{{{\rm{k}}}}}}\) the number of latent dimensions, the columns of \({{{{{{\bf{H}}}}}}}^{\left(p\right)}\) loadings and the rows of \({{{{{\bf{W}}}}}}\) factors^{28,34}.
In line with stateoftheart MF methods for multiomics integration^{35}, the cell embedding \({{{{{\bf{W}}}}}}\) can be used to visualize and cluster the cells (Fig. 1B)^{36,37,38,39}. The dictionaries \({{{{{{\bf{H}}}}}}}^{\left(p\right)}\) instead enable biological interpretation via gene set enrichment analysis^{40}, motif enrichment analysis^{41}, or by identifying markers among the top weights (Fig. 1C).
The main innovation of Mowgli is to perform integrative MF by combining integrative NonNegative Matrix Factorization (integrative NMF) with Optimal Transport (OT). It thus solves the optimization problem:
In computational biology, integrative NMF is usually applied with an Euclidean reconstruction term between \({{{{{{\bf{H}}}}}}}^{\left(p\right)}{{{{{\bf{W}}}}}}\) and \({{{{{{\bf{A}}}}}}}^{\left(p\right)}\)^{19,22,24}. We here introduce instead a reconstruction term based on entropyregularized Optimal Transport (OT) (see Eq. 1 and Methods), which unlike Euclidean or KullbackLeibler losses leverages a notion of similarity between features. This choice is justified by the improved performance that we have previously observed once using OT to compare singlecell omics profiles^{33}. Of note, outside of biology, OT has been already used in the reconstruction loss of NMF for the factorization of single matrices^{42,43,44} and single tensors^{45}.
In addition, as in Rolet et al.^{42}, we add to the optimization problem (Eq. 1) two entropic regularization terms \({{{{{{\rm{\rho }}}}}}}_{p}{{\mbox{E}}}({{{{{{\bf{H}}}}}}}^{\left(p\right)})\) and \({{{{{\rm{\mu }}}}}}{{\mbox{E}}}\left({{{{{\bf{W}}}}}}\right)\) (see Methods). These terms ensure that the loadings and embeddings are positive distributions and they control their sparsity (see Methods), a crucial feature to further enhance the known NMF’s representation by parts and sparsity properties^{31}. \({{{{{{\rm{\rho }}}}}}}_{p}\) and \({{{{{\rm{\mu }}}}}}\) are the coldness parameters of softmax functions (see Methods) and thus offer a natural way to adjust sparsity. For instance, as \({{{{{\rm{\mu }}}}}}\) approaches 0, cells will be assigned to only one factor. As instead \({{{{{\rm{\mu }}}}}}\) increases, cells will be a combination of several factors. For all details on the mathematical formulation of Mowgli see Methods.
Of note, Mowgli is implemented as an opensource Python package seamlessly integrated into the classical Python singlecell analysis pipeline (github.com/cantinilab/mowgli). Users can thus take advantage of scverse tools like Scanpy and Muon for preprocessing and downstream analysis^{46,47}. In addition, Mowgli provides a userfriendly visualization of top genes and enriched gene sets, thus helping biological interpretability.
In the following, we extensively benchmark Mowgli against the stateoftheart: Seurat v4^{15}, Cobolt^{26}, Multigrate^{27}, and MOFA+^{14}. Although several methods exist^{14,15,16,17,18,19,20,21,22,23}, we here focused on the leading methods for paired data integration that could be applied to the multiple combinations of singlecell omics data here considered. In addition, an integrative NMF baseline is also considered (see Methods), to further compare Mowgli with the standard integrative NMF.
Mowgli’s cell embedding and clustering outperform the stateoftheart in controlled settings derived from cell lines data
We first focused on evaluating Mowgli’s embedding and clustering performance in controlled settings derived from cell lines data. To represent a panel of realistic scenarios with different distributions of cells across three groups, we applied different transformations to a simple dataset composed of three cancer cell lines profiled with scCATseq (see Fig. 2A). The scCATseq dataset provides a joint profiling of scRNA and scATAC from HCT116, HeLaS3, and K562 cell lines^{8}. Unlike simulated data, this solution allows us to avoid making assumptions about the distribution of the data. Indeed, generating simulated data following a Gaussian distribution, for instance, would favor methods that approximate singlecell data with this same distribution.
The four scenarios in our panel represent distinct realistic challenges of multiomics integration: (i) Mixed in RNA contains two cell populations that are mixed in scRNA, but well separated in scATAC; (ii) Mixed in both contains two cell populations mixed in scRNA and well separated in scATAC and two cell populations mixed in scATAC and well separated in scRNA; (iii) Rare population presents a population with much fewer cells than the others and (iv) 82% sparse, 90% sparse, and 96% sparse contain data with increasing percentages of dropouts (82–96%). Scenarios (i) and (ii) test the ability of methods to take into account the complementarity of different omic data. Scenario (iii) tests the ability of the methods to recover rare populations. Finally, scenario (iv) tests the robustness of the methods to dropout noise, while staying in a realistic range of dropouts for singlecell data. For details on the generation of these datasets see Methods.
We benchmarked Mowgli, Seurat v4, MOFA+, integrative NMF, Cobolt, and Multigrate based on natural metrics for embedding and clustering performance: silhouette score, Adjusted Rand Index (ARI), and purity score (see Methods). The silhouette score cannot be computed for Seurat because it requires an embedding of the cells not provided by this method. In addition, we computed UMAP visualizations for the different methods and datasets^{37}.
As shown in Fig. 2B, overall, Mowgli provides superior performance over the current stateoftheart according to all metrics. Indeed, in all datasets except 90% sparse, Mowgli has a performance greater or equal to that of other methods. In the 90% sparse dataset, integrative NMF has a better silhouette score than Mowgli but the same ARI and purity score. Cobolt performs poorly in Mixed in RNA and Mixed in both according to all three metrics.
These performances are confirmed by looking qualitatively at the UMAP plots in Fig. 2B. In Mixed in RNA and Mixed in both Seurat v4 and Cobolt confuse populations when individual omics are not sufficient to identify the three groups. Regarding dropouts, one of the most challenging features of singlecell data^{48}, Mowgli shows the highest resilience with respect to the stateoftheart. Indeed, while a sparsity of 96% is still coherent with realistic data^{49}, MOFA+, Seurat v4, Cobolt, and Multigrate confuse the three populations in the 96% sparse dataset. On the opposite, Mowgli correctly separates the three groups of cells in 96% sparse.
All methods except Seurat are parametrized by a number of latent dimensions. For each method, we choose the overall bestperforming number of latent dimensions (see Methods). Tuning the number of latent dimensions for each metric and dataset does not change our conclusions (see Supplementary Fig. 1).
Mowgli’s cell embeddings and clusterings are competitive with the stateoftheart in complex and heterogeneous datasets
We then benchmarked Mowgli, Seurat v4, MOFA+, integrative NMF, Cobolt, and Multigrate based on their embedding and clustering performance on five paired singlecell multiomics datasets (see Fig. 3A). Of note, these data have been already largely used to benchmark singlecell multiomics integrative methods^{15,50,51}. The chosen datasets span different sequencing technologies, modalities, tissues, and sizes: (i) Liu is a scCATseq cell lines dataset by Liu et al.^{8} (ii) PBMC 10X is a 10X Multiome human PBMC dataset from 10X Genomics (iii) OP Multiome is a 10X Multiome human bone marrow dataset from Open Problems^{52} (iv) OP CITE is a CITEseq human bone marrow dataset from Open Problems^{52} (v) BM CITE is a CITEseq human bone marrow dataset from Stuart et al.^{25}. BM CITE is the larger dataset here considered, with 29,803 cells. Supplementary Table 1 lists the modalities, numbers of cells, and numbers of cell types for each dataset. For details on data preprocessing, see Methods.
We benchmarked Mowgli, Seurat v4, MOFA+, integrative NMF, Cobolt, and Multigrate based on the same natural metrics used in the previous section. Since these metrics require a ground truth annotation, we used the celltype annotations available from the original publications of these data. In Liu, the ground truth annotations are based on the cell line of origin and are thus welldefined. On the contrary, the annotations of the other datasets were computationally derived, thus affecting this benchmark for all methods. Supplementary Note 1 illustrates this on specific subpopulations of CD8 Tcells in the BM CITE and OP CITE datasets.
As displayed in Fig. 3B, Mowgli can handle large singlecell datasets and deliver embedding and clustering performances competitive with the stateoftheart, especially considering the lack of absolute ground truth annotations on most datasets here employed.
In particular, according to the silhouette score, Mowgli outperforms other methods in Liu, PBMC 10X, and OP CITE. MOFA+ performs best in the other two datasets. In terms of ARI across resolutions, Seurat v4 performs best in PBMC 10X, OP Multiome, and OP CITE. Mowgli and Seurat v4 perform comparably in the BM CITE dataset. In the Liu dataset, only MOFA+, Multigrate, and Mowgli reach a maximum ARI of 1. Of note, in Liu, ARIs should be compared only at low resolution, as higher resolutions lead to overclustering. In terms of purity score, Mowgli outperforms other methods in the OP CITE dataset, and it is comparable to MOFA+ and Multigrate in the PBMC 10X dataset. Finally, the purity scores of all methods are comparable in the Liu dataset, except for Cobolt which is less performant.
The UMAP plots in Fig. 3B give a qualitative intuition of the described performance. In OP CITE, only integrative NMF, Cobolt, and Mowgli correctly separate subpopulations of B cells (Fig. 3B circled). In BM CITE, MAIT Tcells and subpopulations of CD8+Tcells (Fig. 3B circled) are more neatly separated in Seurat v4 than in other methods. However, as explained in Supplementary Note 1, the annotation pipeline of BM CITE might favor Seurat v4.
As in the previous section, we chose the overall bestperforming number of latent dimensions for each method (see Methods). Tuning the number of latent dimensions for each metric and dataset does not favor one method over the others, which supports Mowgli’s competitiveness with the other methods (see Supplementary Fig. 1).
Mowgli improves the biological interpretability of the stateoftheart by providing celltype specific factors in TEAseq data
We benchmarked Mowgli with respect to MOFA+ and integrative NMF based on its biological interpretability (see Fig. 4A). Indeed, MOFA+ is the leading singlecell multiomics integration tool providing userfriendly biological interpretability of its latent dimensions^{14}. At the same time, integrative NMF can be considered as a baseline with respect to Mowgli.
For this benchmark, we considered a TEAseq dataset of human PBMCs, corresponding to the paired profiling of scRNAseq, scATACseq, and surface proteins^{7}. This dataset allows us to test the methods on more than two omic datasets, thus taking into account more complementary layers of molecular regulation.
First, MOFA+, integrative NMF, and Mowgli were independently applied for the integration of the three omics constituting the TEAseq data. As the dataset was not provided with an annotation of the cells, we separately clustered the embeddings obtained from Mowgli, integrative NMF, and MOFA+ and annotated them based on gene and protein markers (see Supplementary Fig. 2, see Fig. 4B). We identified in this way coarse immune cell types: CD4 Tcells, CD8 Tcells, B cells, Natural Killer (NK) cells, MAIT Tcells, Monocytes, and Erythroid cells. Of note, the cell type annotations obtained with Mowgli, integrative NMF, and MOFA+ agree at 94% and match an independent RNAbased annotation obtained through Azimuth (see Supplementary Fig. 3). All three methods are thus able to recover the expected cell types through clustering of their embeddings.
To then test the biological interpretability of Mowgli, integrative NMF, and MOFA+, we evaluated the specificity of the associations between their factors and the identified immune cell types. The underlying assumption we are making here is that an interpretable method should provide factors that are not broadly active in all the cells, but selectively associated with a cell type. Indeed, characterizing a cell type that results from a combination of many factors is a daunting task. On the contrary, having cell typespecific factors makes the biological characterization of the associated cell type straightforward. To evaluate such specificity, for each cell type, we plotted how Mowgli, integrative NMF, and MOFA+’s factors are distributed according to their mean weight within the cell type and their mean weight outside the cell type (Fig. 4C). Factors specific to a cell type should have a high average weight within the cell type and a low average weight outside the cell type, thus falling in the upper left corner of the plots. As MOFA+’s factors are not constrained to be positive and their positive and negative parts could be associated with different biological information, we split each factor into two parts, as done in MOFA+’s interpretation tools^{14}. In addition, we quantified the performance of each factor with a specificity score, also reported in bold in Fig. 4C, and defined in the Methods section.
As shown in Fig. 4C, while MOFA+ and integrative NMF tend to associate multiple factors to the same cell type, Mowgli frequently defines clear onetoone associations between factors and cell types. In addition, the specificity score of such factors is higher in Mowgli than in MOFA+ and integrative NMF. This is particularly striking in NK cells, CD8 Tcells, and CD4 Tcells, where both MOFA+ and integrative NMF seem to aggregate information from many factors whereas Mowgli is more selective. Of note, as shown in Supplementary Fig. 4, the multiple factors associated by MOFA+ with the same cell type do not necessarily correspond to subpopulations of the same cell type.
As shown in Supplementary Fig. 5, our conclusions regarding the specificity of Mowgli’s factors are robust to the choice of hyperparameters.
Overall, the comparison with integrative NMF validates that the use of OT as a loss function and of sparsityinducing regularization introduced in Mowgli has a concrete effect on biological interpretability.
Mowgli identifies relevant subpopulations of immune cells in TEAseq data
We finally focused on the biological relevance of the factors identified by Mowgli on the human PBMC TEAseq data, described in the previous section. Indeed, while in the previous section we only considered coarse immune cell types (e.g. B cells, CD4 Tcells, CD8 Tcells), Mowgli could identify multiple factors able to subset such cell types into relevant subpopulations (see Fig. 5A, B; Supplementary Fig. 6; Supplementary Note 2). For example, Mowgli identifies factors splitting the B cell cluster into two subpopulations: memory and naive B cells. In the same way, Mowgli detects factors associated with CD8 Tcells subpopulations (naive, central memory, and effector memory), monocytes subclusters (classical and nonclassical), dendritic cells subpopulations (plasmacytoid and conventional) and Natural Killer (NK) subclusters (CD56^{dim} and CD56^{bright}). The association of the factors with specific immune subpopulations is here made based on topranked genes and proteins in effector memory CD8 T cells, naive B cells, memory B cells, and CD56^{dim} NK cells. For all other populations, the association with factors is instead based on the correlation of the factors’ weights with that of known protein markers (see Supplementary Note 2). Figure 5B displays sidebyside the UMAP plots showing the similarity between the distribution of the factors’ weights and the activity of the protein markers of their associated immune subpopulations. The UMAP visualizations of all factors and marker proteins are available in Supplementary Fig. 6 and Supplementary Fig. 7.
These same results could not be obtained with MOFA+, due to its lower biological interpretability observed in the previous section. In MOFA+, factors having similar patterns to those observed in Mowgli could be obtained for effector memory CD8 Tcells, memory B cells, nonclassical monocytes, and CD56dim NK cells (see Supplementary Fig. 4 and Supplementary Fig. 8). On the contrary, MOFA+’s factors most closely associated with the other immune subpopulations identified by Mowgli have less clear patterns (see Supplementary Note 2). As a consequence, interpreting with MOFA+ the pathways associated with CD56^{bright} NK cells, for example, would require complexly combining the pathway enrichments obtained from different factors. On the contrary, the same analysis in Mowgli can be easily realized by looking at the pathways enriched in the loadings of its 13th factor.
Finally, we looked at the biological information that Mowgli could provide regarding the identified immune subpopulations. For this part, we focused on the factors associated with four immune cell subpopulations: effector memory CD8 Tcells (factor 49), naive B cells (factor 33), memory B cells (factor 44), and CD56^{dim} NK cells (factor 2). For each of these four factors, we considered their associated loadings in \({{{{{{\bf{H}}}}}}}^{\left({rna}\right)}\), \({{{{{{\bf{H}}}}}}}^{\left({adt}\right)}\) and \({{{{{{\bf{H}}}}}}}^{\left({{{{{\rm{atac}}}}}}\right)}\) and analyzed the top genes in \({{{{{{\bf{H}}}}}}}^{\left({rna}\right)}\), top proteins in \({{{{{{\bf{H}}}}}}}^{\left({adt}\right)}\), the gene sets enriched in \({{{{{{\bf{H}}}}}}}^{\left({rna}\right)}\) and the motifs enriched in \({{{{{{\bf{H}}}}}}}^{\left({atac}\right)}\) to verify the biological information that could be extracted from the output of Mowgli (see Methods). Figure 5C displays the results obtained from this analysis.
For effector memory CD8 Tcells (CD8 TEM cells), corresponding to factor 49, Mowgli could extract two top genes (CRTAM and KLRK1), known to be essential for CD8+Tcellmediated cytotoxicity^{53,54}, two top proteins (CD45RO, TCRa/b) that are a known memory T cell marker and a T cell receptor, respectively^{55,56}. More interestingly, also several Transcription Factors (TFs) candidate regulators of this subpopulation are identified, among them EOMES and TBX21 (aka Tbet), known to be important for CD8 TEM development^{57}. In addition, five of the top candidate TF regulators (TBR1, TBX21, TBX4, TBX5 and MGA) target three of the top genes of the same factor (CCL5, CRTAM, and IL21R), thus suggesting a regulatory program possibly important for CD8 TEM cells.
In naive B cells (factor 33), Mowgli identifies as top genes FCER2 (aka CD23), a lowaffinity receptor for immunoglobulin E (IgE) with an essential role in the differentiation of Bcells^{58} and MARCH1, which downregulates the surface expression of major histocompatibility complex (MHC) class II molecules^{59}. In the top proteins, we can single out CD19, CD21, and HLADR, wellknown markers of B cells^{60}. In addition, the relative weights of IgD and IgM in factor 33 are coherent with the repartition already described for naive Bcells^{60}. Finally, among the top TF candidate regulators of factor 33, Early Bcell Factors (EBF3 and EBF1)^{61} and NFkB proteins (REL and RELA) stand out as regulators of the top genes of the same factor. Of note, these TFs play an essential role in Bcell development, maintenance, and function^{62}.
For memory B cells (factor 44), Mowgli extracts as top genes: IGHA1 and IGKC, part of immunoglobulin complexes^{63} and JAM3, belonging to the Immunoglobulin superfamily and already studied in the context of B cell homing and development^{64,65,66}. The top proteins include the wellknown B cell markers CD19, CD21, and HLADR^{60}. In addition, as observed before for naive B cells, the relative weights of IgD and IgM in factor 44 are coherent with the repartition already described for memory Bcells^{60}. In the top TFs emerging from our motif analysis and targeting the top genes we finally find RELA, TCF4, and MAX::MYC, known to be involved in the transcriptional regulation of memory B cell differentiation^{67}.
Finally, in CD56^{dim} NK cells (factor 2), Mowgli detects at top genes: NCAM1 (aka CD56), the goto marker for NK cells^{68}; KLRF1 and KLRD1, genes of the KLR family of receptors controlling NK cell activity^{69}; GZMB, involved in NKcell mediated cytotoxicity^{70}; SLAMF7, mediating NK cell activation^{71}. Top proteins include CD56, the canonical marker of NK cells^{68}, but its weight is lower than that of CD16, which is coherent with the expression profile of CD16+CD56^{dim} NK cells^{68}. Regarding TF candidate regulators, we detect EOMES and TBX21 (aka Tbet), which are critical to NKcell differentiation^{72}, MafF, having a key role in the regulation of NK cell effector functions by IL27, and JUNB::FOSB, early activator protein (AP)−1 TFs that regulate NKmeditated cytotoxicity^{73,74}. Finally, a strong regulatory program seems to emerge here with four of the top candidate TF regulators for factor 2 (MGA::EVX1, EOMES, TBX21, and JDP2) targeting four of the top genes of the same factor (C1orf21, IL18RAP, PTGDR, and SLAMF7).
Discussion
Multiple technologies allowing the multiomics profiling of the same set of cells are currently available. We thus need integration methods able to jointly learn from multiple omics data profiled on the same cells.
In this article, we introduced MultiOmics Wasserstein inteGrative anaLysIs (Mowgli), an integrative method for paired multiomics data that enables rich biological interpretation for any type and number of omics. We then indepth benchmark Mowgli’s cell embedding and clustering performance with respect to the stateoftheart in controlled settings derived from scCATseq profiling of cancer cell lines. Mowgli outperforms in this benchmark the stateoftheart showing its high potential even in challenging conditions. We then considered more complex and heterogeneous data profiled with CITEseq and 10x Genomics Multiome technologies. On these data, Mowgli performed comparably with the stateoftheart, with no method clearly outperforming others. Finally, regarding the biological interpretability, once tested on TEAseq data, corresponding to paired scRNA, scATAC, and surface protein profiling, Mowgli produces biologically meaningful representations superior to those of the stateoftheart.
A major limitation affecting this benchmark and all others focused on paired multiomics integration corresponds to the lack of a highquality biological annotation of the cells. While in some cases Fluorescenceactivated cell sorting (FACS) could represent a clear solution for an independent annotation of the cells, paired multiomics data with this type of annotation are lacking in the literature.
A limitation of Mowgli is that it does not offer a straightforward approach to defining the number of latent dimensions. The choice of number of latent dimensions (\(k\)) is inherently problemdependent and several values of \(k\) should be tested. This problem has been however extensively studied in NMF literature and classical tools like the cophenetic coefficient^{75} or the elbow method^{76} may guide the user’s choice of \(k\). At the same time, Mowgli displays relative robustness to changes in \(k\) thus suggesting that small changes in \(k\) will not affect its performance. In addition, OT is inherently expensive to compute compared to the Euclidean distance. But the entropic regularization of OT considered here is GPUfriendly, and GPU computations enable Mowgli to scale to the larger datasets presented in this article. The availability of GPUs is nowadays a standard in research centers and this will be further enhanced in the future, once larger singlecell datasets will be available. Regarding possible extensions of Mowgli, it would be interesting to deal with batch correction once integrating paired multiomics data. Indeed, most recent largescale paired multiomics data are profiled in different centers thus creating batch correction issues.
Methods
Notations
Let us consider \(n\) cells, measured across several modalities. Each modality \(p\) has \({m}_{p}\) features (e.g. genes). Let us denote \({{{{{{\bf{A}}}}}}}^{\left(p\right)}\in {{\mathbb{R}}}_{+}^{{m}_{p}\times n}\) the dataset for modality \(p\). Additionally, we impose each column of \({{{{{{\bf{A}}}}}}}^{\left(p\right)}\) to sum to 1, i.e. be a discrete probability distribution.
Optimal transport
Optimal Transport (OT), as defined by Monge^{32} and Kantorovich^{77}, aims at comparing two probability distributions \(a\) and \(b\) by computing the minimal cost of transporting one distribution to the other. In the discrete case, the classical OT distance, also known as the Wasserstein distance, between \(m\)dimensional histograms \({{{{{\bf{a}}}}}}=\left({a}_{1},\ldots,\, {a}_{m}\right)\) and \({{{{{\bf{b}}}}}}=\left({b}_{1},\ldots,\, {b}_{m}\right)\) is defined as
where \(\Pi \left({{{{{\bf{a}}}}}},\, {{{{{\bf{b}}}}}}\right)=\{{{{{{\bf{P}}}}}}\in {{\mathbb{R}}}_{+}^{m\times m}{{\mbox{such that}}}{\sum }_{{{{{{\rm{l}}}}}}}{P}_{k,{{{{l}}}}}={a}_{k} \, {{\mbox{and}}} \, {\sum }_{k}{P}_{k,{{{{l}}}}}={b}_{{{{{{\rm{l}}}}}}}\}\).
In this discrete case, the coupling P ∈ Π(a,b) is a matrix that represents how the mass in the discrete probability distribution a is moved from one bin (e.g. gene) to another one in order to transform a into b. In other words, P_{k,l} is the amount of gene expression transported between genes k and l when transporting the gene expression profile of the cell a to the gene expression profile of the cell b.
The ground cost \({{{{{\bf{C}}}}}}\in {{\mathbb{R}}}_{+}^{m\times m}\) is a pairwise distance matrix that encodes the penalty for transporting mass from one feature (e.g. gene) to another. Hence, \({{{{{\bf{C}}}}}}\) should be chosen in such a way that similar bins (e.g. genes) \(k\) and \(l\) have a low cost \({C}_{k,l}\). Indeed, this will favor transporting gene expression between similar genes. Here, for a certain omic \(p\), we define \(C\) in a datadriven way as the matrix of pairwise cosine distances between the features, i.e. the rows in our dataset \({{{{{{\bf{A}}}}}}}^{\left(p\right)}\). In other words, denoting \({{{{{{\bf{u}}}}}}}_{k}\in {{\mathbb{R}}}_{+}^{n}\) and \({{{{{{\bf{u}}}}}}}_{l}\in {{\mathbb{R}}}_{+}^{n}\) two rows in our dataset,
where \(\left\langle {{{{{\bf{x}}}}}},\, {{{{{\bf{y}}}}}}\right\rangle={\sum }_{i}{x}_{i}{y}_{i}\) is the dot product. This choice of ground cost gave the best results in our previous work^{33}.
Due to the high dimensionality of singlecell data, we use an approximation of classical OT. The entropic regularization of OT, described in (Eq. 4) below is computed using the fast and GPUenabled Sinkhorn algorithm^{78}.
where the entropy \({{\mbox{E}}}\) is defined as \({{\mbox{E}}}:{{{{{\bf{X}}}}}}\in {{\mathbb{R}}}_{+}^{m\times n}\mapsto {\sum }_{k,l}{X}_{k,l}\log {X}_{k,l}\).
If \({{{{{\rm{\varepsilon }}}}}}\) is set to zero, (Eq. 4) corresponds exactly to classical OT (Eq. 2). Increasing values of \({{{{{\rm{\varepsilon }}}}}}\) correspond to a more diffused coupling \({{{{{\bf{P}}}}}}\). In previous work, we showed the entropic regularization of OT to improve similarity inference between singlecell omics profiles compared to classical notions of distance^{33}.
As explored in that work^{33}, entropic regularization is expected to control the systematic noise due to technical dropouts and to the stochasticity of gene expression at the singlecell level. In addition, more diffused couplings increase the exchange of mass between features. This enables OT to leverage the relationships between features (e.g. genes), motivating further its application to singlecell data.
Mowgli
We aim to decompose each matrix \({{{{{{\bf{A}}}}}}}^{\left(p\right)}\) as the product of a matrix \({{{{{{\bf{H}}}}}}}^{\left(p\right)}\in {{\mathbb{R}}}_{+}^{{m}_{p}\times k}\) (the modalityspecific dictionaries) and \({{{{{\boldsymbol{W}}}}}}\in {{\mathbb{R}}}_{+}^{k\times n}\) (the embeddings, shared across modalities). The integer \(k\) is the number of dimensions of the latent space and should be small compared to the number of features. We use the entropic regularization of OT as a reconstruction loss to compare \({{{{{{\bf{H}}}}}}}^{\left(p\right)}{{{{{\bf{W}}}}}}\) to the reference data \({{{{{{\bf{A}}}}}}}^{\left(p\right)}\).
In addition, we require the columns of \({{{{{{\bf{H}}}}}}}^{\left(p\right)}{{{{{\bf{W}}}}}}\) to sum to 1, i.e. belong to the simplex. We thus impose that the columns of \({{{{{{\bf{H}}}}}}}^{\left(p\right)}\) sum to one, and that the columns of \({{{{{\bf{W}}}}}}\) sum to one. Following Rolet et al.^{42}, we use the entropy function E defined previously, with a value of −∞ when columns do not sum to one.
Combining the reconstruction and the entropy terms yields the loss
Note that for the sake of readability, we write OT_{ε} for all p, but this loss actually depends on an omicspecific ground cost C^{(p)}, which itself depends on A^{(p)} (see Eq. 3). The parameters ρ_{p}, control the sparsity of the columns of H^{(p)} and W. In order to make these parameters more comparable across omics and datasets, we define
We have set the values of \({\tilde{\rho }_{{rna}}}=0.01,\, {\tilde{\rho }_{{adt}}}=0.01,\, {\tilde{\rho }_{{atac}}}=0.1\) and \(\widetilde{\mu }=0.001\) since they performed best across datasets and metrics (silhouette score, ARI, purity score) (see Supplementary Figs 9, 10, 11). Regarding the choice of number of latent dimensions, for Liu and datasets derived from Liu, we run the method with 5 factors. For other datasets, we choose 50 factors. These parameters gave the best results overall (see Supplementary Fig. 12).
Similarly to Rolet et al.^{42}, we alternate between minimizing (Eq. 5) on \({{{{{{\bf{H}}}}}}}^{\left(p\right)}\) and \({{{{{\bf{W}}}}}}\). One can show that these smooth minimization problems on \({{{{{{\bf{H}}}}}}}^{\left(p\right)}\) and \({{{{{\bf{W}}}}}}\) are equivalent to the following smooth minimization problems on new dual variables \({{{{{{\bf{G}}}}}}}^{\left(p\right)}\). These problems can be solved using standard optimization methods, and the method of choice is LBFGS, a limitedmemory quasiNewton method.

Optimizing \({{{{{{\bf{H}}}}}}}^{\left(p\right)}\). We solve the following smooth minimization problem:
$$\mathop{\min }\limits_{{{{{{{\bf{G}}}}}}}^{\left(p\right)}} \mathop{\sum }\limits_{j}\left({{\mbox{O}}}{{{\mbox{T}}}}_{{{{{{\rm{\varepsilon }}}}}}}^{\star }\left({{{{{{\bf{g}}}}}}}_{j}^{\left(p\right)},\, {{{{{{\bf{a}}}}}}}_{j}^{\left(p\right)}\right)\right){\rho }_{p}{\left({{\mbox{E}}}\right)}^{\star }\left({{{{{{\bf{G}}}}}}}^{\left(p\right)}{{{{{{\bf{W}}}}}}}^{{{\top }}}/{\rho }_{p}\right)$$(7)Then, we update the primal variable as follows:
$${{{{{{\bf{H}}}}}}}^{\left(p\right)}={{\mbox{softmax}}}\left({{{{{{\bf{G}}}}}}}^{\left(p\right)}{{{{{{\bf{W}}}}}}}^{{{\top }}}/{\rho }_{p}\right)$$(8)The columnwise softmax is defined as:
$${{\mbox{softmax}}}:{{{{{\bf{X}}}}}}\in {{\mathbb{R}}}^{m\times n}\mapsto \frac{\exp \left({X}_{i,j}\right)}{{\sum }_{i}\exp \left({X}_{i,j}\right)}$$(9) 
Optimizing \({{{{{\bf{W}}}}}}\). We solve the following smooth minimization problem:
Then, we update the primal variable as follows:
Here, \({{\mbox{O}}}{{{\mbox{T}}}}_{{\varepsilon}}^{\star }\) and \({\left({{\mbox{E}}}\right)}^{\star }\) denote the Legendre duals of the \({{\mbox{O}}}{{{\mbox{T}}}}_{{\varepsilon}}\) and \({{\mbox{E}}}\) functions, and their smooth closed form expressions are defined in Rolet et al.^{42}.
An important property of NMF enabling representation by parts is sparsity, but it is not enforced explicitly^{79}. In classical NMF, the L1 norm can be leveraged to explicitly enforce sparsity^{79,80}. In our setting, the simplex constraint renders the L1 penalty ineffective (because it is equal to 1), but we can leverage the entropic penalty of Rolet et al.^{42} to control sparsity. Indeed, the coefficients \({{{{{{\rm{\rho }}}}}}}_{p}\) and \({{{{{\rm{\mu }}}}}}\) parametrize softmax functions, and hence control the sparsity of distributions. When setting \({{{{{{\rm{\rho }}}}}}}_{p}\) (or \({{{{{\rm{\mu }}}}}}\)) to zero, the solutions are Dirac masses (extreme sparsity) while when setting it to +∞ the solutions are uniform (no sparsity).
The code is implemented in Python and relies on PyTorch^{81} for matrix operations on the GPU and on Muon^{47} and Scanpy^{46} to handle singlecell multimodal data. The compared running times of Mowgli and other methods is reported in Supplementary Table 2.
MOFA+
We compare Mowgli to MOFA+^{14}, a variational inference method analogous to sparse PCA for multiomic data. We use the R interface MOFA2 with default training parameters. MOFA+ provides a parameter drop_factor_threshold designed to keep only informative factors, but we found that in practice it removed important information. For example, the benchmark in Zuo and Chen^{18} only kept one factor for MOFA+, which is not enough to represent cellular heterogeneity in the data. We thus choose to keep 5 factors for Liu and the datasets derived from Liu, and 15 factors for the other datasets. These parameters gave the best results overall (see Supplementary Fig. 13).
Seurat v4
We compare Mowgli to Seurat v4^{15} which uses Weighted Nearest Neighbors to integrate multiomics data. We use the R interface Seurat with default parameters.
Cobolt
We compare Mowgli to Cobolt^{26}, a deep neural network approach to integrate multiomics data. We use the Python package cobolt with a learning rate of 0.001 and 200 epochs. As suggested in the documentation, we use raw counts as input for the RNA and ATAC modalities. For Liu and the datasets derived from Liu, we use 5 latent dimensions, and for the other datasets, we use 30 latent dimensions. These parameters gave the best results overall (see Supplementary Fig. 14).
Multigrate
We compare Mowgli to Multigrate^{27}, another deep neural network approach to integrate multiomics data. We use the Python package multigrate with 200 epochs. As suggested in the documentation, we use raw counts and a negative binomial loss for RNA and processed data with a mean squared error for ATAC and ADT. For Liu and the datasets derived from Liu, we use 15 latent dimensions, and for the other datasets, we use 50 latent dimensions. These parameters gave the best results overall (see Supplementary Fig. 15).
Integrative NMF
We implemented a baseline NMFbased integration method by concatenating the features from the different omics and solving the optimization problem with positivity constraints:
We implemented this approach using the TorchNMF package. For Liu and the datasets derived from Liu, we use 5 latent dimensions, and for the other datasets, we use 30 latent dimensions. These parameters gave the best results overall (see Supplementary Fig. 16).
Note that this is almost equivalent to intNMF^{82} with \({{{{{\rm{\theta }}}}}}=1\), which minimizes instead \({\sum }_{{{{{{\rm{p}}}}}}}{{}{{{{{{\bf{A}}}}}}}^{\left(p\right)}{{{{{{\bf{H}}}}}}}^{\left(p\right)}{{{{{\bf{W}}}}}}{}}_{2}.\) However, on the considered datasets, the intNMF package was too slow to be able to include it in the benchmark.
Choice of the number of latent dimensions for all methods
For each method described above, we selected the number of latent dimensions displaying the overall best performances across evaluation metrics (Silhouette score, ARI, purity score) and datasets. This analysis has been done considering Liu and the controlled settings separately from the other datasets (PBMC 10X, OP Multiome, OP CITE, BM CITE). Indeed, Liu, composed only of three cell lines, presents much less variation than the other real datasets.
Data generation
Mixed in RNA
We simulate a dataset where one modality confuses two populations, while the other can separate them. To do so we replace the RNA profiles of all HCT cells with RNA profiles of random HeLa cells. ATAC profiles are left untouched.
Mixed in both
We simulate a dataset where the two modalities each confuse two cell populations, but separate two others. This makes the two omics complementary. To do so we replace the RNA profiles of all HCT cells with RNA profiles of random HeLa cells. Then, we replace the ATAC profiles of all K562 cells with ATAC profiles of random HeLa cells.
Sparse
We simulate high dropout noise by randomly replacing 50%, 70% or 90% of the values with zeros. Since the data is already 65% sparse, the final sparsity is 82%, 90% and 96%.
Rare population
We simulate the presence of a rare population by keeping only 10 randomly chosen HeLa cells.
Data preprocessing
All preprocessing was performed using the Scanpy^{46} and Muon^{47} Python packages.
RNA preprocessing
Quality control filtering of cells was performed on the proportion of mitochondrial gene expression, the number of expressed genes, and the total number of counts (using Muon’s filter_obs). Quality control filtering of genes was performed on the number of cells expressing the gene (using Muon’s filter_var). Cells were normalized to sum to 10000 (using Scanpy’s normalize_total), then logtransformed (using Scanpy’s log1p). The top 2500 most variable genes (1500 for the Liu dataset) were selected for downstream analysis (using Scanpy’s highly_variable_genes with flavor=‘seurat’).
ATAC preprocessing
Quality control filtering of cells was performed on the number of open peaks and the total number of counts (using Muon’s filter_obs). Quality control filtering of peaks was performed on the number of cells where the peak is open (using Muon’s filter_var). In Liu, TEA, and 10X PBMC, cells were normalized to sum to 10000 (using Scanpy’s normalize_total), then logtransformed (using Scanpy’s log1p). In OP Multiome, cells were normalized using TFIDF (using Muon’s tfidf) to follow the preprocessing chosen by its authors. The most variable peaks were selected for downstream analysis (using Scanpy’s highly_variable_genes with flavor=‘seurat’). Due to differences in the data’s distribution across datasets, we chose to keep 1500 peaks in Liu, 5,000 peaks in PBMC, and 15,000 peaks in OP Multiome and TEA.
ADT preprocessing
Since the number of proteins is small and the data is less noisy than RNA or ATAC, no quality control or feature selection was performed. The data was normalized by Center Log Ratio (using Muon’s clr).
Data analysis
Gene set enrichment analysis (GSEA)
The gProfiler API^{83} was used through Scanpy’s enrich. Custom sources GO:CC, GO: MF, GO: BP, Azimuth, and ImmuneSigDB were retrieved from the Enrichr website^{84}. Gene sets enriched with adjusted pvalues under 0.05 (with Bonferroni correction) were selected for further analysis. To make genes comparable, we normalized rows of the matrix H^{(rna)} to 1. The 150 top genes for every factor were then used as an unordered input to gProfiler.
Motif enrichment analysis
Signac^{85} was used to perform Motif Enrichment Analysis, using the JASPAR2022 motif database^{86}. To make peaks comparable, we normalized rows of the matrix \({H}^{\left({atac}\right)}\) to 1. The 100 top peaks for every factor were used as input to Signac’s FindMotifs. The union of the top peaks across factors constitutes the background.
Visualization
To visualize the latent representation of cells in MOFA+, integrative NMF, and Mowgli’s models, we computed kNN graphs (k = 20) with the euclidean distance between the cells’ lowdimensional embeddings (using Scanpy’s neighbors). We used these graphs to compute 2D UMAP^{37} projections (using Scanpy’s umap). For Seurat v4, 2D UMAP projections based on WNN graphs were performed using Seurat v4’s function RunUMAP.
Clustering
For Mowgli, integrative NMF, and MOFA+, we clustered datasets using the Leiden algorithm^{38} with varying resolutions (using Scanpy’s leiden). Similarly to UMAP visualization, the inputs of the Leiden algorithm were the previously computed kNN graphs. For Seurat v4, Leiden clustering was performed using Seurat v4’s function FindClusters.
Evaluation metrics
Silhouette score
For each sample, the silhouette width is defined as \(\frac{ba}{\max \left(a,b\right)}\) where \(a\) is the mean distance of the sample to other samples of the same cluster and \(b\) is the mean distance of the samples to samples from the nearest cluster. The silhouette score is the mean of silhouette widths across samples. The silhouette score varies between −1 and 1. We used Scikitlearn’s implementation silhouette_score^{87}.
kNN purity score
The kNN purity score measures the average proportion of a sample’s nearest neighbors that share the sample’s cluster annotation. It thus varies between 0 and 1.
Adjusted rand index
The Rand Index defines the similarity between a ground truth annotation and an experimental clustering. The ARI is then defined as
and varies between −1 and 1, with 0 representing a random clustering. We used Scikitlearn’s implementation adjusted_rand_score. Since the simulated datasets are derived from three cell lines, there is no biological heterogeneity within each of the three groups and clustering at high resolution necessarily leads to overclustering. Figure 2 thus displays maximum ARI, which is achieved for low resolutions and is more informative than the ARI across resolutions. ARI across resolutions can be found in Supplementary Figs 9 to 16.
Specificity
MOFA+ was applied with 15 factors, which are enough to represent the data (see Supplementary Fig. 13). Integrative NMF was applied with 30 factors. Mowgli was applied with 50 factors. For all three methods, a coarse Leiden clustering was applied (using Scanpy’s leiden). For all three methods, each cluster was assigned a cell type based on the expression of the canonical gene and protein markers (see Supplementary Fig. 2). To confirm this annotation, Azimuth was run on the RNA signal of the dataset (using the Azimuth web tool and the PBMC reference). The agreement of the four independent annotations is confirmed in a Sankey diagram (see Supplementary Fig. 3). Dendritic cells are absent in our manual annotations because of the coarseness of the clustering. Likewise, the ADT signal (see Supplementary Fig. 7) informs us that there is a CD4CD8 T cell population missed in all four annotations. For each factor and each cell type, we computed (i) the proportion of cells within that cell type with an absolute weight greater than \({10}^{3}\) (ii) \(a\), the mean weight for cells within that cell type (iii) \(b\), the mean weight for cells outside of that cell type. For each cell type, we then defined a specificity score for factor \(i\):
where \(\mathop{\max }\limits_{j}{a}_{j}\) is the maximum value of \(a\) computed over all factors. The specificity score is thus bounded by 1. See Fig. 4C for a visualization of this information and see Supplementary Fig. 5 for results across hyperparameters \({{{{{\rm{\mu }}}}}}\) and numbers of latent dimensions.
Biological interpretation
We added stars in front of biologically interesting elements in Fig. 5C. The first resource we used was the Human Protein Atlas, from which we programmatically retrieved information about the top proteins and genes. We starred them if they were marked as specific to NK cells, Naive B cells, Memory B cells, or Memory CD8 T cells respectively. In addition, some genes or proteins were starred manually; we discuss those in the Results and refer to the relevant literature.
We starred gene sets if they matched the considered cell types, e.g. MHC II protein complex for B cells. To reduce the noise in the Immune gene sets, we only considered gene sets opposing subtypes of the broad cell type considered, e.g. NAIVE_VS_MEMORY_BCELL_UP.
We starred the TFs if they target one of the top 20 genes. For this, we retrieved TFgene links from the Regulatory Circuits database^{88} and considered the Natural Killer cells, CD19+B cells, and CD8+T cells networks.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Regulatory Circuits. At the time of writing, the Regulatory Circuits website http://ww1.regulatorycircuits.org/ is down. We recovered the data from the mirror http://www2.unil.ch/cbg/regulatorycircuits/FANTOM5_individual_networks.tar. PBMC. We retrieve a 10X Genomics Multiome (RNA+ATAC) dataset with 9320 PBMCs. Data is available at https://www.10xgenomics.com/resources/datasets/pbmcfromahealthydonorgranulocytesremovedthroughcellsorting10k1standard200. Liu. We retrieve a scCATseq (RNA+ATAC) dataset from Liu et al.^{8} with 206 cells from three cancer cell lines (HCT116, HeLaS3, K562). Data is available in the Supplementary Materials of the original publication. Simulated data. Controlled settings derived from cell lines data were generated from the Liu dataset and can be reproduced using the provided reproducibility code (see Code availability). OPMultiome. We retrieve a Multiome (RNA+ATAC) dataset from the Open Problems challenge^{52} and select only the first batch, which contains 6137 BMMCs. The GEO accession number is GSE194122 and the data is available at. OPCITE. We retrieve a CITEseq (RNA+ADT) dataset from the Open Problems challenge^{52} and select only the first batch, which contains 4249 BMMCs. The GEO accession number is GSE194122 and the data is available at. BMCITE. We retrieve a CITEseq (RNA+ADT) dataset from Stuart et al.^{25} with 29,803 BMMCs. The GEO accession number is GSE128639 and the data is available at. PBMC TEAseq. We retrieve a recent TEAseq (RNA+ATAC+ADT) dataset from Swanson et al.^{7} with 7084 PBMCs. The GEO accession number is GSE158013 and the data is available at.
Code availability
Package. The Python package for Mowgli is hosted at https://github.com/cantinilab/mowgli/^{89}. It can be installed easily by running pip install mowgli. Reproducibility. Code to reproduce the experiments and figures is available at https://github.com/cantinilab/mowgli_reproducibility/.
References
Rajewsky, N. et al. LifeTime and improving European healthcare through cellbased interceptive medicine. Nature 587, 377–386 (2020).
Potter, S. S. Singlecell RNA sequencing for the study of development, physiology and disease. Nat. Rev. Nephrol. 14, 479–492 (2018).
Papalexi, E. & Satija, R. Singlecell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol. 18, 35–45 (2018).
Lee, J., Hyeon, D. Y. & Hwang, D. Singlecell multiomics: technologies and data analysis methods. Exp. Mol. Med. 52, 1428–1442 (2020).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Clark, S. J. et al. scNMTseq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9, 781 (2018).
Swanson, E. et al. Simultaneous trimodal singlecell measurement of transcripts, epitopes, and chromatin accessibility using TEAseq. eLife 10, e63632 (2021).
Liu, L. et al. Deconvolution of singlecell multiomics layers reveals regulatory heterogeneity. Nat. Commun. 10, 470 (2019).
Angermueller, C. et al. Parallel singlecell sequencing links transcriptional and epigenetic heterogeneity. Nat. Methods 13, 229–232 (2016).
Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).
Chen, S., Lake, B. B. & Zhang, K. Highthroughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).
Miao, Z., Humphreys, B. D., McMahon, A. P. & Kim, J. Multiomics integration in the age of million singlecell data. Nat. Rev. Nephrol. 17, 710–724 (2021).
Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multimodal singlecell data. Genome Biol. 21, 111 (2020).
Hao, Y. et al. Integrated analysis of multimodal singlecell data. Cell 184, 3573–3587.e29 (2021).
Gayoso, A. et al. Joint probabilistic modeling of singlecell multiomic data with totalVI. Nat. Methods 18, 272–282 (2021).
Ashuach, T., Gabitto, M. I., Jordan, M. I. & Yosef, N. MultiVI: deep generative model for the integration of multimodal data. https://doi.org/10.1101/2021.08.20.457057 (2021).
Zuo, C. & Chen, L. Deepjointlearning analysis model of single cell transcriptome and open chromatin accessibility data. Brief. Bioinform. 22, bbaa287 (2021).
Duren, Z. et al. Regulatory analysis of single cell multiome gene expression and chromatin accessibility data with scREG. Genome Biol. 23, 114 (2022).
Singh, R., Hie, B. L., Narayan, A. & Berger, B. Schema: metric learning enables interpretable synthesis of heterogeneous singlecell modalities. Genome Biol. 22, 131 (2021).
Wang, X. et al. BREMSC: a bayesian random effects mixture model for joint clustering single cell multiomics data. Nucl. Acids Res. 48, 5814–5824 (2020).
Jin, S., Zhang, L. & Nie, Q. scAI: an unsupervised approach for the integrative analysis of parallel singlecell transcriptomic and epigenomic profiles. Genome Biol. 21, 25 (2020).
Kim, H. J., Lin, Y., Geddes, T. A., Yang, J. Y. H. & Yang, P. CiteFuse enables multimodal analysis of CITEseq data. Bioinformatics 36, 4137–4143 (2020).
Welch, J. D. et al. SingleCell Multiomic Integration Compares and Contrasts Features of Brain Cell Identity. Cell 177, 1873–1887 (2019).
Stuart, T. et al. Comprehensive Integration of SingleCell Data. Cell 177, 1888–1902.e21 (2019).
Gong, B., Zhou, Y. & Purdom, E. Cobolt: integrative analysis of multimodal singlecell sequencing data. Genome Biol. 22, 1–21 (2021).
Lotfollahi, M., Litinetskaya, A. & Theis, F. J. Multigrate: singlecell multiomic data integration. BioRxiv (2022).
Stanojevic, S., Li, Y., Ristivojevic, A. & Garmire, L. X. Computational Methods for Singlecell Multiomics Integration and Alignment. Genomics Proteomics Bioinformatics https://doi.org/10.1016/j.gpb.2022.11.013 (2022).
Ainsworth, S., Foti, N., Lee, A. K. & Fox, E. Interpretable VAEs for nonlinear group factor analysis. at http://arxiv.org/abs/1802.06765 (2018).
Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of singlecell RNAseq via variational autoencoders. Bioinformatics 36, 3418–3421 (2020).
Lee, D. D. & Seung, H. S. Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999).
Monge, G. Memoire sur la theorie des deblais et des remblais. Mem Math Phys Acad R. Sci. 666–704 (1781).
Huizing, G.J., Peyré, G. & Cantini, L. Optimal transport improves cell–cell similarity inference in singlecell omics data. Bioinformatics 38, 2169–2177 (2022).
SteinO’Brien, G. L. et al. Enter the Matrix: Factorization Uncovers Knowledge from Omics. Trends Genet. 34, 790–805 (2018).
Cantini, L. et al. Benchmarking joint multiomics dimensionality reduction approaches for the study of cancer. Nat. Commun. 12, 1–12 (2021).
van der Maaten, L. & Hinton, G. Visualizing Data using tSNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 3, 861 (2018).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing wellconnected communities. Sci. Rep. 9, 5233 (2019).
Blondel, V. D., Guillaume, J.L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, P10008 (2008).
Mootha, V. K. et al. PGC1αresponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273 (2003).
Korhonen, J. H., Palin, K., Taipale, J. & Ukkonen, E. Fast motif matching revisited: highorder PWMs, SNPs and indels. Bioinformatics 33, 514–521 (2017).
Rolet, A., Cuturi, M. & Peyré, G. Fast dictionary learning with a smoothed Wasserstein loss. in Artificial Intelligence and Statistics. 51, 630–638 (PMLR, 2016).
Qian, W., Hong, B., Cai, D., He, X. & Li, X. NonNegative Matrix Factorization with Sinkhorn Distance. IJCAI 1960–1966 (2016).
Schmitz, M. A. et al. Wasserstein dictionary learning: Optimal transportbased unsupervised nonlinear dictionary learning. SIAM J. Imaging Sci. 11, 643–678 (2018).
Zhang, S. Y. A unified framework for nonnegative matrix and tensor factorisations with a smoothed Wasserstein loss. in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) 4178–4186 (IEEE). https://doi.org/10.1109/ICCVW54120.2021.00466 2021.
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: largescale singlecell gene expression data analysis. Genome Biol. 19, 15 (2018).
Bredikhin, D., Kats, I. & Stegle, O. MUON: multimodal omics analysis framework. Genome Biol. 23, 42 (2022).
Lähnemann, D. et al. Eleven grand challenges in singlecell data science. Genome Biol. 21, 31 (2020).
Qiu, P. Embracing the dropouts in singlecell RNAseq analysis. Nat. Commun. 11, 1169 (2020).
Lance, C. et al. Multimodal single cell data integration challenge: Results and lessons learned. in Proc. of the NeurIPS 2021 Competitions and Demonstrations Track 162–176 (PMLR, 2022).
Luecken, M. D. et al. Benchmarking atlaslevel data integration in singlecell genomics. Nat. Methods 19, 41–50 (2022).
Luecken, M. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (eds. Vanschoren, J. & Yeung, S.) vol. 1 (2021).
Lanier, L. L. NKG2D receptor and its ligands in host defense. Cancer Immunol. Res. 3, 575–582 (2015).
Boles, K. S., Barchet, W., Diacovo, T., Cella, M. & Colonna, M. The tumor suppressor TSLC1/NECL2 triggers NKcell and CD8+ Tcell responses through the cellsurface receptor CRTAM. Blood 106, 779–786 (2005).
Prince, H. E., York, J. & Jensen, E. R. Phenotypic comparison of the three populations of human lymphocytes defined by CD45RO and CD45RA expression. Cell. Immunol. 145, 254–262 (1992).
Shah, K., AlHaidari, A., Sun, J. & Kazi, J. U. T cell receptor (TCR) signaling in health and disease. Signal Transduct. Target. Ther. 6, 1–26 (2021).
Intlekofer, A. M. et al. Effector and memory CD8+ T cell fate coupled by Tbet and eomesodermin. Nat. Immunol. 6, 1236–1244 (2005).
Pirron, U., Schlunck, T., Prinz, J. C. & Rieber, E. P. IgEdependent antigen focusing by human B lymphocytes is mediated by the lowaffinity receptor for IgE. Eur. J. Immunol. 20, 1547–1551 (1990).
Bartee, E., Mansouri, M., Hovey Nerenberg, B. T., Gouveia, K. & Früh, K. Downregulation of Major Histocompatibility Complex Class I by Human Ubiquitin Ligases Related to Viral Immune Evasion Proteins. J. Virol. 78, 1109–1120 (2004).
Glass, D. R. et al. An Integrated Multiomic SingleCell Atlas of Human B Cell Identity. Immunity 53, 217–232.e5 (2020).
Lukin, K., Fields, S., Hartley, J. & Hagman, J. Early B cell factor: Regulator of B lineage specification and commitment. Semin. Immunol. 20, 221–227 (2008).
Kaileh, M. & Sen, R. NF‐κB function in B lymphocytes. Immunol. Rev. 246, 254–271 (2012).
Schroeder, H. W. & Cavacini, L. Structure and function of immunoglobulins. J. Allergy Clin. Immunol. 125, S41–S52 (2010).
Ody, C. et al. Junctional adhesion molecule C (JAMC) distinguishes CD27+ germinal center B lymphocytes from nongerminal center cells and constitutes a new diagnostic tool for Bcell malignancies. Leukemia 21, 1285–1293 (2007).
Weber, C., Fraemohs, L. & Dejana, E. The role of junctional adhesion molecules in vascular inflammation. Nat. Rev. Immunol. 7, 467–477 (2007).
Doñate, C. et al. Homing of Human B Cells to Lymphoid Organs and BCell Lymphoma Engraftment Are Controlled by Cell Adhesion Molecule JAMC. Cancer Res. 73, 640–651 (2013).
Laidlaw, B. J. & Cyster, J. G. Transcriptional regulation of memory B cell differentiation. Nat. Rev. Immunol. 21, 209–220 (2021).
Vivier, E., Tomasello, E., Baratin, M., Walzer, T. & Ugolini, S. Functions of natural killer cells. Nat. Immunol. 9, 503–510 (2008).
RodaNavarro, P. et al. Human KLRF1, a novel member of the killer cell lectinlike receptor gene family: molecular characterization, genomic structure, physical mapping to the NK gene complex and expression analysis. Eur. J. Immunol. 30, 568–576 (2000).
Su, B., Bochan, M. R., Hanna, W. L., Froelich, C. J. & Brahmi, Z. Human granzyme B is essential for DNA fragmentation of susceptible target cells. Eur. J. Immunol. 24, 2073–2080 (1994).
Guo, H., CruzMunoz, M.E., Wu, N., Robbins, M. & Veillette, A. Immune Cell Inhibition by SLAMF7 Is Mediated by a Mechanism Requiring Src Kinases, CD45, and SHIP1 That Is Defective in Multiple Myeloma Cells. Mol. Cell. Biol. 35, 41–51 (2015).
Zhang, J. et al. Sequential actions of EOMES and TBET promote stepwise maturation of natural killer cells. Nat. Commun. 12, 5446 (2021).
Ponti, C. et al. Role of CREB transcription factor in cfos activation in natural killer cells. Eur. J. Immunol. 32, 3358–3365 (2002).
Bernard, K. et al. Engagement of natural cytotoxicity programs regulates AP1 expression in the NKL human NK cell line. J. Immunol. Baltim. 162, 4062–4068 (1999).
Brunet, J.P., Tamayo, P., Golub, T. R. & Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl. Acad. Sci. 101, 4164–4169 (2004).
Kotliar, D. et al. Identifying gene expression programs of celltype identity and cellular activity with singlecell RNASeq. eLife 8, e43803 (2019).
Kantorovich, L. On the transfer of masses (in Russian). Doklady Akademii Nauk 37, 227–229 (1942).
Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. in Advances in Neural Information Processing Systems (eds. Burges, C. J., Bottou, L., Welling, M., Ghahramani, Z. & Weinberger, K. Q.) vol. 26 (Curran Associates, Inc., 2013).
Hoyer, P. O. Nonnegative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004).
Le Roux, J., Weninger, F. J. & Hershey, J. R. Sparse NMF–halfbaked or well done? Mitsubishi Electr. Res. Labs MERL Camb. MA USA Tech Rep No TR2015023 11, 13–15 (2015).
Paszke, A. et al. PyTorch: An Imperative Style, HighPerformance Deep Learning Library. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019).
Chalise, P. & Fridley, B. L. Integrative clustering of multilevel ‘omic data based on nonnegative matrix factorization algorithm. PLOS ONE 12, e0176278 (2017).
Raudvere, U. et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucl. Acids Res. 47, W191–W198 (2019).
Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Singlecell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
CastroMondragon, J. A. et al. JASPAR 2022: the 9th release of the openaccess database of transcription factor binding profiles. Nucl. Acids Res. 50, D165–D173 (2022).
Pedregosa, F. et al. Scikitlearn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Marbach, D. et al. Tissuespecific regulatory circuits reveal variable modular perturbations across complex diseases. Nat. Methods 13, 366–370 (2016).
Huizing, G.J., Deutschmann, I. M., Peyré, G. & Cantini, L. cantinilab/Mowgli: v0.3.1. (Zenodo). https://doi.org/10.5281/zenodo.8410737 2023.
Acknowledgements
We would like to thank Frank Augé and Tommaso Andreani for the insightful scientific discussions on this project in the context of the Sanofi iTech Awards. This work was supported by the Sanofi iTech Awards. The project leading to this manuscript has received funding from the Agence Nationale de la Recherche (ANR) JCJC project scMOmix and the French government under management of Agence Nationale de la Recherche as part of the ‘Investissements d’avenir’ program, reference ANR19P3IA0001 (PRAIRIE 3IA Institute). The work of G. Peyré was supported by the European Research Council (ERC project NORIA) and the French government under management of Agence Nationale de la Recherche as part of the ‘Investissements d’avenir’ program, reference ANR19P3IA0001 (PRAIRIE 3IA Institute). GPU computations were performed using HPC resources from GENCIIDRIS (Grant 2022AD011012079R2).
Author information
Authors and Affiliations
Contributions
GJ.H., G.P. and L.C. designed and planned the study. GJ.H. and L.C. wrote the paper. G.P. revised the manuscript. GJ.H. developed the tool and performed all the analyses. IM.D. participated in the data analysis.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Huizing, GJ., Deutschmann, I.M., Peyré, G. et al. Paired singlecell multiomics data integration with Mowgli. Nat Commun 14, 7711 (2023). https://doi.org/10.1038/s41467023430192
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467023430192
This article is cited by
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.