The GLUE framework

We assume that there are K different omics layers to be integrated, each with a distinct feature set \({{{\mathcal{V}}}}_k,k = 1,2, \ldots ,K\). For example, in scRNA-seq, \({\mathcal{V}}_k\) is the set of genes, while in scATAC-seq, \({{{\mathcal{V}}}}_k\) is the set of chromatin regions. The data spaces of different omics layers are denoted as \({{{\mathcal{X}}}}_k \subseteq {\Bbb R}^{\left| {{{{\mathcal{V}}}}_k} \right|}\) with varying dimensionalities. We use \({{{\mathbf{x}}}}_k^{(n)} \in {{{\mathcal{X}}}}_k,n = 1,2, \ldots ,N_K\) to denote cells from the kth omics layer and \({{{\mathbf{x}}_{k}}_{i}}^{(n)},i \in {{{\mathcal{V}}}}_k\) to denote the observed value of feature i of the kth layer in the nth cell. N K is the sample size of the kth layer. Notably, the cells from different omics layers are unpaired and can have different sample sizes. To avoid cluttering, we drop the superscript (n) when referring to an arbitrary cell.

We model the observed data from different omics layers as generated by a low-dimensional latent variable (that is, cell embedding) \({{{\mathbf{u}}}} \in {\Bbb R}^m\):

$$\begin{array}{*{20}{c}} {p\left( {{{{\mathbf{x}}}}_k;\theta _k} \right) = {\int} p \left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}};\theta _k} \right)p\left( {{{\mathbf{u}}}} \right){\mathrm{d}}{{{\mathbf{u}}}}} \end{array}$$ (1)

where p(u) is the prior distribution of the latent variable, \(p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}};\theta _k} \right)\) are learnable generative distributions (that is, data decoders) and θ k denotes learnable parameters in the decoders. The cell latent variable u is shared across different omics layers. In other words, u represents the common cell states underlying all omics observations, while the observed data from each layer are generated by a specific type of measurement of the underlying cell states.

With the introduction of variational posteriors \(q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right)\) (that is, data encoders, where ϕ k are learnable parameters in the encoders), model fitting can be efficiently performed by maximizing the following evidence lower bounds:

$$\begin{array}{rcl}{{{\mathcal{L}}}}_{{{{\mathcal{X}}}}_k}\left( {\phi _k,\theta _k} \right) & = & {\Bbb E}_{{{{\mathbf{x}}}}_k \sim p_{{{{\mathrm{data}}}}}\left( {{{{\mathbf{x}}}}_k} \right)}\left[ {{\Bbb E}_{{{{\mathbf{u}}}} \sim q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right)}\log p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}};\theta _k} \right)}\right. \\ && \left.{ - {{{\mathrm{KL}}}}\left( {q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right)\parallel p\left( {{{\mathbf{u}}}} \right)} \right)} \right]\end{array}$$ (2)

Since different autoencoders are independently parameterized and trained on separate data, the cell embeddings learned for different omics layers could have inconsistent semantic meanings unless they are linked properly.

To link the autoencoders, we propose a guidance graph \({{{\mathcal{G}}}} = \left( {{{{\mathcal{V}}}},{{{\mathcal{E}}}}} \right)\), which incorporates prior knowledge about the regulatory interactions among features at distinct omics layers, where \({{{\mathcal{V}}}} = \mathop {\bigcup}

olimits_{k = 1}^K {{{{\mathcal{V}}}}_k}\) is the universal feature set and \({{{\mathcal{E}}}} = \left\{ {\left( {i,j} \right)|i,j \in {{{\mathcal{V}}}}} \right\}\) is the set of edges. Each edge is also associated with signs and weights, which are denoted as s ij and w ij , respectively. We require that w ij ∈ (0,1], which can be interpreted as interaction credibility, and that \(s_{ij} \in \left\{ { - 1,1} \right\}\), which specifies the sign of the regulatory interaction. For example, an ATAC peak located near the promoter of a gene is usually assumed to positively regulate its expression, so they can be connected with a positive edge (s ij = 1). Meanwhile, DNA methylation in the gene promoter is usually assumed to suppress expression, so they can be connected with a negative edge (s ij = 1). In addition to the connections between features, self-loops are also added for numerical stability, with \(s_{ii} = 1,w_{ii} = 1,\forall i \in {{{\mathcal{V}}}}\). The guidance graph is allowed to be a multi-graph, where more than one edge can exist between the same pair of vertices, representing different types of prior regulatory evidence.

We treat the guidance graph as observed variable and model it as generated by low-dimensional feature latent variables (that is, feature embeddings) \({{{\mathbf{v}}}}_i \in {\Bbb R}^m,i \in {{{\mathcal{V}}}}\). Furthermore, differing from the previous model, we now model x k as generated by the combination of feature latent variables \({{{\mathbf{v}}}}_i \in {\Bbb R}^m,i \in {{{\mathcal{V}}}}_k\) and the cell latent variable \({{{\mathbf{u}}}} \in {\Bbb R}^m\). For convenience, we introduce the notation \({{{\mathbf{V}}}} \in {\Bbb R}^{m \times \left| {{{\mathcal{V}}}} \right|}\), which combines all feature embeddings into a single matrix. The model likelihood can thus be written as:

$$\begin{array}{*{20}{c}} {p\left( {{{{\mathbf{x}}}}_k,{{{\mathcal{G}}}};\theta _k,\theta _{{{\mathcal{G}}}}} \right) = {\int} p \left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}};\theta _k} \right)p\left( {{{{\mathcal{G}}}}|{{{\mathbf{V}}}};\theta _{{{\mathcal{G}}}}} \right)p\left( {{{\mathbf{u}}}} \right)p\left( {{{\mathbf{V}}}} \right){\mathrm{d}}{{{\mathbf{u}}}}{\mathrm{d}}{{{\mathbf{V}}}}} \end{array}$$ (3)

where \(p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}};\theta _k} \right)\) and \(p\left( {{{{\mathcal{G}}}}|{{{\mathbf{V}}}};\theta _{{{\mathcal{G}}}}} \right)\) are learnable generative distributions for the omics data (that is, data decoders) and knowledge graph (that is, graph decoder), respectively. θ k and \(\theta _{{{\mathcal{G}}}}\) are learnable parameters in the decoders. p(u) and p(V) are the prior distributions of the cell latent variable and feature latent variables, respectively, which are fixed as standard normal distributions for simplicity:

$$\begin{array}{*{20}{c}} {p\left( {{{\mathbf{u}}}} \right) = N\left( {{{{\mathbf{u}}}};\mathbf{0},{{{\mathbf{I}}}}_m} \right)} \end{array}$$ (4)

$$\begin{array}{*{20}{c}} {p\left( {{{{\mathbf{v}}}}_i} \right) = N\left( {{{{\mathbf{v}}}}_i;\mathbf{0},{{{\mathbf{I}}}}_m} \right),p\left( {{{\mathbf{V}}}} \right) = \mathop {\prod }\limits_{i \in {{{\mathcal{V}}}}} p\left( {{{{\mathbf{v}}}}_i} \right)} \end{array}$$ (5)

although alternatives may also be used68. For convenience, we also introduce the notation \({{{\mathbf{V}}}}_k \in {\Bbb R}^{m \times \left| {{{{\mathcal{V}}}}_k} \right|}\), which contains only feature embeddings in the kth omics layer, and u k , which emphasizes that the cell embedding is from a cell in the kth omics layer.

The graph likelihood \(p\left( {{{{\mathcal{G}}}}|{{{\mathbf{V}}}};\theta _{{{\mathcal{G}}}}} \right)\) (that is, graph decoder) is defined as:

$$\begin{array}{rcl} {\log p\left( {{{{\mathcal{G}}}}|{{{\mathbf{V}}}};\theta _{{{\mathcal{G}}}}} \right)} = {{\Bbb E}_{i,j \sim p\left( {i,j;w_{ij}} \right)}} \\ {\left[ {\log \sigma \left( {s_{ij} {{{\mathbf{v}}}}_i^ \top {{{\mathbf{v}}}}_j} \right) + {\Bbb E}_{j\prime \sim p_{{{{\mathrm{ns}}}}}\left( {j\prime |i} \right)}\log \left( {1 - \sigma \left( {s_{ij} {{{\mathbf{v}}}}_i^ \top {{{\mathbf{v}}}}_{j\prime }} \right)} \right)} \right]} \end{array}$$ (6)

where σ is the sigmoid function and p ns is a negative sampling distribution69. Here the graph likelihood has no trainable parameters, so \(\theta _{{{\mathcal{G}}}} = \emptyset\). In other words, we first sample the edges (i, j) with probabilities proportional to the edge weights and then sample vertices j′ that are not connected to i and treat them as if \(s_{ij\prime } = s_{ij}\). When maximizing the graph likelihood, the inner products between features are maximized or minimized (per edge sign) based on the Bernoulli distribution. For example, ATAC peaks located near the promoter of a gene would be encouraged to have similar embeddings to that of the gene, while DNA methylation in the gene promoter would be encouraged to have a dissimilar embedding to that of the gene.

The data likelihoods \(p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}};\theta _k} \right)\) (that is, data decoders) in equation (3) are built on the inner product between the cell embedding u and feature embeddings V k . Thus, analogous to the loading matrix in principal component analysis (PCA), the feature embeddings V k confer semantic meanings for the cell embedding space. As V k are modulated by interactions among omics features in the guidance graph, the semantic meanings become linked. While this linearity limits decoder capacity, our empirical evaluations show that it is well compensated by the nonlinear encoders, producing high-quality multi-omics alignments (Fig. 2, Extended Data Figs. 1–4 and Supplementary Figs. 1–7). The exact formulation of data likelihood depends on the omics data distribution. For example, for count-based scRNA-seq and scATAC-seq data, we used the negative binomial (NB) distribution:

$$\begin{array}{*{20}{c}} {p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}};\theta _k} \right) = \mathop {\prod }\limits_{i \in {{{\mathcal{V}}}}_k} {{{\mathrm{NB}}}}\left( {{{\mathbf{x}}_{k}}_{i};\mathbf{\mu} _i,\mathbf{\theta} _i} \right)} \end{array}$$ (7)

$$\begin{array}{*{20}{c}} {{\mathrm{NB}}\left( {{{\mathbf{x}}_{k}}_{i};{{{\mathbf{\mu }}}}_i,{{{\mathbf{\theta }}}}_i} \right) = \frac{{{{{\mathrm{{\Gamma}}}}}\left( {{{\mathbf{x}}_{k}}_{i} + {{{\mathbf{\theta }}}}_i} \right)}}{{{{{\mathrm{{\Gamma}}}}}\left( {{{{\mathbf{\theta }}}}_i} \right){{{\mathrm{{\Gamma}}}}}\left( {{{\mathbf{x}}_{k}}_{i} + 1} \right)}}\left( {\frac{{{{{\mathbf{\mu }}}}_i}}{{{{{\mathbf{\theta }}}}_i + {{{\mathbf{\mu }}}}_i}}} \right)^{{\mathbf{x}_{k}}_{i}}\left( {\frac{{{{{\mathbf{\theta }}}}_i}}{{{{{\mathbf{\theta }}}}_i + {{{\mathbf{\mu }}}}_i}}} \right)^{{{{\mathbf{\theta }}}}_i}} \end{array}$$ (8)

$$\begin{array}{*{20}{c}} {{{{\mathbf{\mu }}}}_i = {{{\mathrm{Softmax}}}}_i\left( {{{{\mathbf{\alpha }}}} \odot {{{\mathbf{V}}}}_k^ \top {{{\mathbf{u}}}} + {{{\mathbf{\beta }}}}} \right) \cdot \mathop {\sum }\limits_{j \in {{{\mathcal{V}}}}_k} {\mathbf{x}_{k}}_{j}} \end{array}$$ (9)

where \({{{\mathbf{\mu }}}},{{{\mathbf{\theta }}}} \in {\Bbb R}_ + ^{\left| {{{{\mathcal{V}}}}_k} \right|}\) are the mean and dispersion of the negative binomial distribution, respectively, \({{{\mathbf{\alpha }}}} \in {\Bbb R}_ + ^{\left| {{{{\mathcal{V}}}}_k} \right|},{{{\mathbf{\beta }}}} \in {\Bbb R}^{\left| {{{{\mathcal{V}}}}_k} \right|}\) are scaling and bias factors, ⊙ is the Hadamard product, Softmax i represents the ith dimension of the softmax output and \(\mathop {\sum}

olimits_{j \in {{{\mathcal{V}}}}_k} {{\mathbf{x}_{k}}_{j}}\) gives the total count in the cell. Taking softmax and then multiplying by total count ensures that the library size of reconstructed data matches the original30. The set of learnable parameters is \(\theta _k = \left\{ {{{{\mathbf{\theta }}}},{{{\mathbf{\alpha }}}},{{{\mathbf{\beta }}}}} \right\}\). Analogously, many other distributions can also be supported, as long as we can parameterize the means of the distributions by feature-cell inner products.

For efficient inference and optimization, we introduce the following factorized variational posterior:

$$\begin{array}{*{20}{c}} {q\left( {{{{\mathbf{u}}}},{{{\mathbf{V}}}}|{{{\mathbf{x}}}}_k,{{{\mathcal{G}}}};\phi _k,\phi _{{{\mathcal{G}}}}} \right) = q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right) \cdot q\left( {{{{\mathbf{V}}}}|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)} \end{array}$$ (10)

The graph variational posterior \(q\left( {{{{\mathbf{V}}}}|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)\) (that is, graph encoder) is modeled as diagonal-covariance normal distributions parameterized by a graph convolutional network70:

$$\begin{array}{*{20}{c}} {q\left( {{{{\mathbf{V}}}}|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right) = \mathop {\prod }\limits_{i \in {{{\mathcal{V}}}}} q\left( {{{{\mathbf{v}}}}_i|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)} \end{array}$$ (11)

$$\begin{array}{*{20}{c}} {q\left( {{{{\mathbf{v}}}}_i|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right) = N\left( {{{{\mathbf{v}}}}_i;{{{\mathrm{GCN}}}}_{{{{\mathbf{\mu }}}}_i}\left( {{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right),{{{\mathrm{GCN}}}}_{{{{\mathbf{\sigma }}}}_i^2}\left( {{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)} \right)} \end{array}$$ (12)

where \(\phi _{{{\mathcal{G}}}}\) represents the learnable parameters in the graph convolutional network (GCN) encoder.

The variational data posteriors \(q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right)\) (that is, data encoders) are modeled as diagonal-covariance normal distributions parameterized by multilayer perceptron (MLP) neural networks:

$$\begin{array}{*{20}{c}} {q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k,{{{\mathbf{V}}}}_k;\phi _k} \right) = N\left( {{{{\mathbf{u}}}};{{{\mathrm{MLP}}}}_{k,{{{\mathbf{\mu }}}}}\left( {{{{\mathbf{x}}}}_k;\phi _k} \right),{{{\mathrm{MLP}}}}_{k,{{{\mathbf{\upsigma }}}}^2}\left( {{{{\mathbf{x}}}}_k;\phi _k} \right)} \right)} \end{array}$$ (13)

where ϕ k is the set of learnable parameters in the multilayer perceptron encoder of the kth omics layer.

Model fitting can then be performed by maximizing the following evidence lower bound:

$$\begin{array}{*{20}{c}} {\mathop {\sum}\limits_{k = 1}^K {{\Bbb E}_{{{{\mathbf{x}}}}_k \sim p_{{{{\mathrm{data}}}}}\left( {{{{\mathbf{x}}}}_k} \right)}} \left[ {\begin{array}{*{20}{c}} {{\Bbb E}_{{{{\mathbf{u}}}} \sim q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right),{{{\mathbf{V}}}} \sim q\left( {{{{\mathbf{V}}}}|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)}\log p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}};\theta _k} \right)p\left( {{{{\mathcal{G}}}}|{{{\mathbf{V}}}};\theta _{{{\mathcal{G}}}}} \right)} \\ { - \mathrm{KL}\left( {q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right)q\left( {{{{\mathbf{V}}}}|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)\parallel p\left( {{{\mathbf{u}}}} \right)p\left( {{{\mathbf{V}}}} \right)} \right)} \end{array}} \right]} \end{array}$$ (14)

which can be further rearranged into the following form:

$$\begin{array}{*{20}{c}} {K \cdot {{{\mathcal{L}}}}_{{{\mathcal{G}}}}\left( {\theta _{{{\mathcal{G}}}},\phi _{{{\mathcal{G}}}}} \right) + \mathop {\sum}\limits_{k = 1}^K {{{{\mathcal{L}}}}_{{{{\mathcal{X}}}}_k}} \left( {\theta _k,\phi _k,\phi _{{{\mathcal{G}}}}} \right)} \end{array}$$ (15)

where we have

$$\begin{array}{rcl}{{{\mathcal{L}}}}_{{{{\mathcal{X}}}}_k}\left( {\theta _k,\phi _k,\phi _{{{\mathcal{G}}}}} \right) = {\Bbb E}_{{{{\mathbf{x}}}}_k \sim p_{{{{\mathrm{data}}}}}\left( {{{{\mathbf{x}}}}_k} \right)} \\ \left[ {{\Bbb E}_{{{{\mathbf{u}}}} \sim q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right),{{{\mathbf{V}}}} \sim q\left( {{{{\mathbf{V}}}}|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)}\log p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}};\theta _k} \right) - {{{\mathrm{KL}}}}\left( {q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right)\parallel p\left( {{{\mathbf{u}}}} \right)} \right)} \right]\end{array}$$ (16)

$$\begin{array}{*{20}{c}} {{{{\mathcal{L}}}}_{{{\mathcal{G}}}}\left( {\theta _{{{\mathcal{G}}}},\phi _{{{\mathcal{G}}}}} \right) = {\Bbb E}_{{{{\mathbf{V}}}} \sim q\left( {{{{\mathbf{V}}}}|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)}\log p\left( {{{{\mathcal{G}}}}|{{{\mathbf{V}}}};\theta _{{{\mathcal{G}}}}} \right) - \mathrm{KL}\left( {q\left( {{{{\mathbf{V}}}}|{{{\mathcal{G}}}};\phi _{{{\mathcal{G}}}}} \right)\parallel p\left( {{{\mathbf{V}}}} \right)} \right)} \end{array}$$ (17)

Below, for convenience, we denote the union of all encoder parameters as \(\phi = \left( {\mathop {\bigcup}

olimits_{k = 1}^K {\phi _k} } \right) \cup \phi _{{{\mathcal{G}}}}\) and the union of all decoder parameters as \(\theta = \left( {\mathop {\bigcup}

olimits_{k = 1}^K {\theta _k} } \right) \cup \theta _{{{\mathcal{G}}}}\).

To ensure the proper alignment of different omics layers, we use the adversarial alignment strategy31,71. A discriminator D with a K-dimensional softmax output is introduced, which predicts the omics layers of cells based on their embeddings u. The discriminator D is trained by minimizing the multiclass classification cross entropy:

$$\begin{array}{*{20}{c}} {{{{\mathcal{L}}}}_{{{\mathrm{D}}}}\left( {\phi ,\psi } \right) = - \frac{1}{K}\mathop {\sum }\limits_{k = 1}^K {\Bbb E}_{{{{\mathbf{x}}}}_k \sim p_{{{{\mathrm{data}}}}}\left( {{{{\mathbf{x}}}}_k} \right)}{\Bbb E}_{{{{\mathbf{u}}}} \sim q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right)}\log {{{\mathrm{D}}}}_k\left( {{{{\mathbf{u}}}};\psi } \right)} \end{array}$$ (18)

where D k represents the kth dimension of the discriminator output and ψ is the set of learnable parameters in the discriminator. The data encoders can then be trained in the opposite direction to fool the discriminator, ultimately leading to the alignment of cell embeddings from different omics layers72.

The overall training objective of GLUE thus consists of:

$$\begin{array}{*{20}{c}} {\mathop {{\min }}\limits_\psi \lambda _{{{\mathrm{D}}}} \cdot {{{\mathcal{L}}}}_{{{\mathrm{D}}}}\left( {\phi ,\psi } \right)} \end{array}$$ (19)

$$\begin{array}{*{20}{c}} {\mathop {{\max }}\limits_{\theta ,\phi } \lambda _{{{\mathrm{D}}}} \cdot {{{\mathcal{L}}}}_{{{\mathrm{D}}}}\left( {\phi ,\psi } \right) + \lambda _{{{\mathcal{G}}}}K \cdot {{{\mathcal{L}}}}_{{{\mathcal{G}}}}\left( {\theta _{{{\mathcal{G}}}},\phi _{{{\mathcal{G}}}}} \right) + \mathop {\sum }\limits_{k = 1}^K {{{\mathcal{L}}}}_{{{{\mathcal{X}}}}_k}\left( {\theta _k,\phi _k,\phi _{{{\mathcal{G}}}}} \right)} \end{array}$$ (20)

The two hyperparameters λ D and \(\lambda _{{{\mathcal{G}}}}\) control the contributions of adversarial alignment and graph-based feature embedding, respectively. We use stochastic gradient descent to train the GLUE model. Each stochastic gradient descent iteration is divided into two steps. In the first step, the discriminator is updated according to objective equation (19). In the second step, the data and graph autoencoders are updated according to equation (20). The RMSprop optimizer with no momentum term is used to ensure the stability of adversarial training.

Weighted adversarial alignment

As shown in previous work31, canonical adversarial alignment amounts to minimizing a generalized form of Jensen–Shannon divergence among the cell embedding distributions of different omics layers:

$$\frac{1}{K}\mathop {\sum}\limits_{k = 1}^K {{{{\mathrm{KL}}}}} \left( {q_k({{{\mathbf{u}}}})||\frac{1}{K}\mathop {\sum}\limits_{k = 1}^K {q_k} ({{{\mathbf{u}}}})} \right)$$ (21)

where \(q_k\left( {{{\mathbf{u}}}} \right) = {\Bbb E}_{{{{\mathbf{x}}}}_k \sim p_{{{{\mathrm{data}}}}}\left( {{{{\mathbf{x}}}}_k} \right)}q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k;\phi _k} \right)\) represents the marginal cell embedding distribution of the kth layer. Without other loss terms, equation (21) converges at perfect alignment, that is, when \(q_i\left( {{{\mathbf{u}}}} \right) = q_j\left( {{{\mathbf{u}}}} \right),\forall i

e j\). This can be problematic when cell type compositions differ dramatically across different layers, for example, in the cell atlas integration. To address this issue, we added cell-specific weights w(n) to the discriminator loss in equation (18):

$$\begin{array}{*{20}{c}} {{{{\mathcal{L}}}}_{{{\mathrm{D}}}}\left( {\phi ,\psi } \right) = - \frac{1}{K}\mathop {\sum }\limits_{k = 1}^K \frac{1}{{W_k}}\mathop {\sum }\limits_{n = 1}^{N_k} w^{\left( n \right)} \cdot {\Bbb E}_{{{{\mathbf{u}}}} \sim q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k^{\left( n \right)};\phi _k} \right)}\log {{{\mathrm{D}}}}_k\left( {{{{\mathbf{u}}}};\psi } \right)} \end{array}$$ (22)

where the normalizer \(W_k = \mathop {\sum}

olimits_{n = 1}^{N_k} {w^{\left( n \right)}}\). The adversarial alignment still amounts to minimizing equation (21) but with weighted marginal cell embedding distributions \(q_k\left( {{{\mathbf{u}}}} \right) = \frac{1}{{W_k}}\mathop {\sum}\limits_{n = 1}^{N_k} {w^{\left( n \right)}} q\left( {{{{\mathbf{u}}}}|{{{\mathbf{x}}}}_k^{\left( n \right)};\phi _k} \right)\). By assigning appropriate weights to balance the cell distributions across different layers, the optimum of \(q_i\left( {{{\mathbf{u}}}} \right) = q_j\left( {{{\mathbf{u}}}} \right),\forall i

e j\) could be much closer to the desired alignment.

To obtain the balancing weights in an unsupervised manner, we devised the following two-stage training procedure. First, we pretrain the GLUE model with constant weight \(w^{\left( n \right)} = 1\), during which noise \({\boldsymbol{\epsilon}} \sim {{{\mathcal{N}}}}\left( {{\boldsymbol{\epsilon}} ;\mathbf{0},{\mathbf{\Sigma}}} \right)\) was added to the cell embeddings before passing to the discriminator. We set ∑ to be 1.5× the empirical variance of cell embeddings in each minibatch, which helps produce a coarse alignment immune to composition imbalance. Then, we cluster the coarsely aligned cell embeddings per omics layer using Leiden clustering. The balancing weight w i for cells in cluster i is computed as:

$$\begin{array}{*{20}{c}} {w_i = \frac{{\mathop {\sum }

olimits_{k_i

e k_j} f\left( {{{{\mathbf{u}}}}_i,{{{\mathbf{u}}}}_j} \right)}}{{n_i}}} \end{array}$$ (23)

$$\begin{array}{*{20}{c}} {f\left( {{{{\mathbf{u}}}}_i,{{{\mathbf{u}}}}_j} \right) = \left\{ {\begin{array}{*{20}{l}} {\cos \left( {{{{\mathbf{u}}}}_i,{{{\mathbf{u}}}}_j} \right)^4,} \hfill & {{\mathrm{cos}}({{{\mathbf{u}}}}_i,{{{\mathbf{u}}}}_j) > 0.5} \hfill \\ {0,} \hfill & {{\mathrm{otherwise}}} \hfill \end{array}} \right.} \end{array}$$ (24)

where u i is the average cell embedding of cluster i, k i denotes the omics layer of cluster i, and n i is the number of cells in cluster i. In other words, we sum up the cosine similarities (raised to the power of 4 to increase contrast) between cluster i and all its matching clusters in other layers with cosine similarity >0.5, and then normalize by cluster size, which effectively balances the contribution of matching clusters regardless of their sizes. In the second stage, we fine-tune the GLUE model with the estimated balancing weights, during which the additive noise \({\boldsymbol{\epsilon}} \sim {{{\mathcal{N}}}}\left( {{\boldsymbol{\epsilon}} ;\mathbf{0},\tau \cdot {\mathbf{\Sigma}}} \right)\) gradually anneals to 0 (with τ starting at 1 and decreasing linearly per epoch until 0). The number of annealing epochs was set automatically based on the data size and learning rate to match a learning progress equivalent to 4,000 iterations at a learning rate of 0.002.

All benchmarks and case studies in the study were conducted with the two-stage training procedure as described above, regardless of whether the dataset being used is balanced or not.

Batch effect correction

To handle batch effect within omics layers, we incorporate batch as a covariate of the data decoders. Assuming \(b \in \left\{ {1,2, \ldots ,B} \right\}\), is the batch index, where B is the total number of batches, the decoder likelihood is extended to \(p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}},b;\theta _k} \right)\). Specifically, this is achieved by converting learnable parameters in the data decoder to be batch-dependent. For example, in the case of a negative binomial decoder, the network now uses batch-specific α, β and θ parameters:

$$\begin{array}{*{20}{c}} {p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}},b;\theta _k} \right) = \mathop {\prod }\limits_{i \in {{{\mathcal{V}}}}_k} {{{\mathrm{NB}}}}\left( {{\mathbf{x}_{k}}_{i};{{{\mathbf{\mu }}}}_i,{\mathbf{\theta }_{b}}_{i}} \right)} \end{array}$$ (25)

$$\begin{array}{*{20}{c}} {{\mathrm{NB}}\left( {{\mathbf{x}_{k}}_{i};{{{\mathbf{\mu }}}}_i,{\mathbf{\theta }_{b}}_{i}} \right) = \frac{{{\Gamma}\left( {{\mathbf{x}_{k}}_{i} + {\mathbf{\theta }_{b}}_{i}} \right)}}{{{\Gamma}\left( {{\mathbf{\theta }_{b}}_{i}} \right){\Gamma}\left( {{\mathbf{x}_{k}}_{i} + 1} \right)}}\left( {\frac{{{{{\mathbf{\mu }}}}_i}}{{{\mathbf{\theta }_{b}}_{i} + {{{\mathbf{\mu }}}}_i}}} \right)^{{\mathbf{x}_{k}}_{i}}\left( {\frac{{{\mathbf{\theta }_{b}}_{i}}}{{{\mathbf{\theta }_{b}}_{i} + {{{\mathbf{\mu }}}}_i}}} \right)^{{\mathbf{\theta }_{b}}_{i}}} \end{array}$$ (26)

$$\begin{array}{*{20}{c}} {{{{\mathbf{\mu }}}}_i = {{{\mathrm{Softmax}}}}_i\left( {{{{\mathbf{\alpha }}}}_b \odot {{{\mathbf{V}}}}_k^ \top {{{\mathbf{u}}}} + {{{\mathbf{\beta }}}}_b} \right) \cdot \mathop {\sum }\limits_{j \in {{{\mathcal{V}}}}_k} {\mathbf{x}_{k}}_{j}} \end{array}$$ (27)

where \({{{\mathbf{\alpha }}}} \in {\Bbb R}_ + ^{B \times \left| {{{{\mathcal{V}}}}_k} \right|},{{{\mathbf{\beta }}}} \in {\Bbb R}^{B \times \left| {{{{\mathcal{V}}}}_k} \right|},{{{\mathbf{\theta }}}} \in {\Bbb R}_ + ^{B \times \left| {{{{\mathcal{V}}}}_k} \right|}\), and α b , β b , θ b are the bth row of α, β, θ. Other probabilistic decoders can also be extended in similar ways.

Implementation details

We applied linear dimensionality reduction using canonical methods such as PCA (for scRNA-seq) or LSI (latent semantic indexing, for scATAC-seq) as the first transformation layers of the data encoders (note that the decoders were still fitted in the original feature spaces). This effectively reduced model size and enabled a modular input, so advanced dimensionality reduction or batch effect correction methods can also be used instead as preprocessing steps for GLUE integration.

During model training, 10% of the cells were used as the validation set. In the final stage of training, the learning rate would be reduced by factors of 10 if the validation loss did not improve for consecutive epochs. Training would be terminated if the validation loss still did not improve for consecutive epochs. The patience for learning rate reduction, training termination and the maximal number of training epochs were automatically set based on the data size and learning rate to match a learning progress equivalent to 1,000, 2,000 and 16,000 iterations at a learning rate of 0.002, respectively.

For all benchmarks and case studies with GLUE, we used the default hyperparameters unless explicitly stated. The set of default hyperparameters is presented in Extended Data Fig. 3.

Integration consistency score

The integration consistency score is a measure of consistency between the integrated multi-omics data and the guidance graph. First, we jointly cluster cells from all omics layers in the aligned cell embedding space using k-means. For each omics layer, the cells in each cluster are aggregated into a metacell. The metacells are established as paired samples, based on which feature correlation can be computed. Using the paired metacells, we then compute the Spearman’s correlation for each edge in the guidance graph. The integration consistency score is defined as the average correlation across all graph edges, negated per edge sign and weighted by edge weight.

Systematic benchmarks

UnionCom23, Pamona24 and GLUE were executed using the Python packages ‘unioncom’ (v.0.3.0), ‘Pamona’ (v.0.1.0) and ‘scglue’ (v.0.2.0), respectively. MMD-MA25 was executed using the Python script provided at https://bitbucket.org/noblelab/2020_mmdma_pytorch. Online iNMF16, LIGER17, Harmony18, bindSC33, and Seurat v3 (ref. 15) were executed using the R packages ‘rliger’ (v.1.0.0), ‘rliger’ (v.1.0.0), ‘harmony’ (v.0.1.0), ‘bindSC’ (v.1.0.0) and ‘Seurat’ (v.4.0.2), respectively. For each method, we used the default hyperparameter settings and data preprocessing steps as recommended. For the scRNA-seq data, 2,000 highly variable genes were selected using the Seurat ‘vst’ method. We used two separate schemes to construct the guidance graph. In the standard scheme, we connected ATAC peaks with RNA genes via positive edges if they overlapped in either the gene body or proximal promoter regions (defined as 2 kb upstream from the TSS). In an alternative scheme involving larger genomic windows, we connected ATAC peaks with RNA genes via positive edges if the peaks are within 150 kb of the proximal gene promoters; the edges were weighted by a power-law function \(w = \left( {d + 1} \right)^{ - 0.75}\) (d is the genomic distance in kb), which has been proposed to model the probability of chromatin contact42,43. For the methods that require feature conversion (online iNMF, LIGER, bindSC and Seurat v.3), we converted the scATAC-seq data to gene-level activity scores by summing up counts in the ATAC peaks connected to specific genes in the guidance graph. Notably, online iNMF and LIGER also recommend an alternative way of ATAC feature conversion, that is, directly counting ATAC fragments falling in gene body and promoter regions without resorting to ATAC peaks (https://htmlpreview.github.io/?https://github.com/welch-lab/liger/blob/master/vignettes/Integrating_scRNA_and_scATAC_data.html), which we abbreviate to FiG (fragments in genes). We also tested the FiG feature conversion method with online iNMF and LIGER whenever applicable.

Mean average precision (MAP) was used to evaluate the cell type resolution. Supposing that the cell type of the ith cell is y(i) and that the cell types of its K ordered nearest neighbors are \(y_1^{\left( i \right)},y_2^{\left( i \right)}, \ldots, y_K^{\left( i \right)}\), the mean average precision is then defined as follows:

$$\begin{array}{*{20}{c}} {{\mathrm{MAP}} = \frac{1}{N}\mathop {\sum}\limits_{i = 1}^N {{{{\mathrm{AP}}}}^{\left( i \right)}} } \end{array}$$ (28)

$$\begin{array}{*{20}{c}} {{{{\mathrm{AP}}}}^{\left( i \right)} = \left\{ {\begin{array}{*{20}{l}} {\frac{{\mathop {\sum }

olimits_{k = 1}^K 1_{y^{\left( i \right)} = y_k} \cdot \frac{{\mathop {\sum }

olimits_{j = 1}^k 1_{y^{\left( i \right)} = y_j^{\left( i \right)}}}}{k}}}{{\mathop {\sum }

olimits_{k = 1}^K 1_{y^{\left( i \right)} = y_k^{\left( i \right)}}}},} \hfill & {{\mathrm{if}}\,\mathop {\sum }\limits_{k = 1}^K 1_{y^{\left( i \right)} = y_k^{\left( i \right)}} > 0} \hfill \\ {0,} \hfill & {{\mathrm{otherwise}}} \hfill \end{array}} \right.} \end{array}$$ (29)

where \(1_{y^{\left( i \right)} = y_k^{\left( i \right)}}\) is an indicator function that equals 1 if \(y^{\left( i \right)} = y_k^{\left( i \right)}\) and 0 otherwise. For each cell, average precision (AP) computes the average cell type precision up to each cell type-matched neighbor, and mean average precision is the average average precision across all cells. We set K to 1% of the total number of cells in each dataset. Mean average precision has a range of 0 to 1, and higher values indicate better cell type resolution.

Cell type ASW (average silhouette width) was also used to evaluate the cell type resolution, which was defined as in a recent benchmark study73:

$$\begin{array}{*{20}{c}} {{\mathrm{cell}}\,{\mathrm{type}}\,{\mathrm{ASW}} = \frac{1}{2}\left( {\frac{{\mathop {\sum }

olimits_{i = 1}^N s_{{{{\mathrm{cell}}}}\,{{{\mathrm{type}}}}}^{\left( i \right)}}}{N} + 1} \right)} \end{array}$$ (30)

where \(s_{{{{\mathrm{cell}}}}\,{{{\mathrm{type}}}}}^{\left( i \right)}\) is the cell type silhouette width for the ith cell, and N is the total number of cells. Cell type ASW has a range of 0 to 1, and higher values indicate better cell type resolution.

Neighbor consistency (NC) was used to evaluate the preservation of single-omics data variation after multi-omics integration and was defined following a previous study74:

$$\begin{array}{*{20}{c}} {{\mathrm{NC}} = \frac{1}{N}\mathop {\sum }\limits_{i = 1}^N \frac{{\left| {{{{\mathrm{NNS}}}}^{\left( {{{\mathrm{i}}}} \right)} \cap {{{\mathrm{NNI}}}}^{\left( {{{\mathrm{i}}}} \right)}} \right|}}{{\left| {{{{\mathrm{NNS}}}}^{\left( {{{\mathrm{i}}}} \right)} \cup {{{\mathrm{NNI}}}}^{\left( {{{\mathrm{i}}}} \right)}} \right|}}} \end{array}$$ (31)

where NNS(i) is the set of k-nearest neighbors for cell i in the single-omics data, NNI(i) is the set of K-nearest neighbors for the ith cell in the integrated space, and N is the total number of cells. We set K to 1% of the total number of cells in each dataset. Neighbor consistency has a range of 0 to 1, and higher values indicate better preservation of data variation.

Biology conservation

Mean average precision, cell type ASW and neighbor consistency all measure biology conservation of the data integration. Following the procedure from the recent benchmark study73, we first conduct min-max scaling for each of the metrics and then compute the average across the three to summarize them into a single metric representing biology conservation:

$$\begin{array}{*{20}{c}} {{\mathrm{biology}}\,{\mathrm{conservation}} = \frac{{{{{\mathrm{scale}}}}\left( {{{{\mathrm{MAP}}}}} \right) + {{{\mathrm{scale}}}}\left( {{{{\mathrm{cell}}}}\,{{{\mathrm{type}}}}\,{{{\mathrm{ASW}}}}} \right) + {{{\mathrm{scale}}}}\left( {{{{\mathrm{NC}}}}} \right)}}{3}} \end{array}$$ (32)

Seurat alignment score (SAS) was used to evaluate the extent of mixing among omics layers and was computed as described in the original paper75:

$$\begin{array}{*{20}{c}} {{\mathrm{SAS}} = 1 - \frac{{\bar x - \frac{K}{N}}}{{K - \frac{K}{N}}}} \end{array}$$ (33)

where \(\bar x\) is the average number of cells from the same omics layer among the K-nearest neighbors (different layers were first subsampled to the same number of cells as the smallest layer), and N is the number of omics layers. We set K to 1% of the subsampled cell number. Seurat alignment score has a range of 0 to 1, and higher values indicate better mixing.

Omics layer ASW was also used to evaluate the extend of mixing among omics layers and was defined as in a recent benchmark study73:

$$\begin{array}{*{20}{c}} {{\mathrm{omics}}\,{\mathrm{layer}}\,{\mathrm{ASW}} = \frac{1}{M}\mathop {\sum}\limits_{j = 1}^M {{{{\mathrm{omics}}}}\,{{{\mathrm{layer}}}}\,{{{\mathrm{ASW}}}}_{{{\mathrm{j}}}}} } \end{array}$$ (34)

$$\begin{array}{*{20}{c}} {{{{\mathrm{omics}}}}\,{{{\mathrm{layer}}}}\,{{{\mathrm{ASW}}}}_{{{\mathrm{j}}}} = \frac{1}{{N_j}}\mathop {\sum }\limits_{i = 1}^{N_j} 1 - \left| {s_{{{{\mathrm{omics}}}}\,{{{\mathrm{layer}}}}}^{\left( i \right)}} \right|} \end{array}$$ (35)

where \(s_{{{{\mathrm{omics}}}}\,{{{\mathrm{layer}}}}}^{\left( i \right)}\) is the omics layer silhouette width for the ith cell, N j is the number of cells in cell type j, and M is the total number of cell types. Omics layer ASW has a range of 0 to 1, and higher values indicate better mixing.

Graph connectivity (GC) was also used to evaluate the extend of mixing among omics layers and was defined as in a recent benchmark study73:

$$\begin{array}{*{20}{c}} {{\mathrm{GC}} = \frac{1}{M}\mathop {\sum }\limits_{j = 1}^M \frac{{\left| {{{{\mathrm{LCC}}}}_j} \right|}}{{N_j}}} \end{array}$$ (36)

where LCC j is the number of cells in largest connected component of the cell k-nearest neighbors graph (K = 15) for cell type j, N j is the number of cells in cell type j and M is the total number of cell types. Graph connectivity has a range of 0 to 1, and higher values indicate better mixing.

Omics mixing

Seurat alignment score, omics layer ASW and graph connectivity all measure omics mixing of the data integration. Following the procedure from the recent benchmark study73, we first conduct min-max scaling for each of the metrics, and then compute the average across the three to summarize them into a single metric representing omics mixing:

$$\begin{array}{*{20}{c}} {{\mathrm{omics}}\,{\mathrm{mixing}} = \frac{{{{{\mathrm{scale}}}}\left( {{{{\mathrm{SAS}}}}} \right) + {{{\mathrm{scale}}}}\left( {{{{\mathrm{omics}}}}\,{{{\mathrm{layer}}}}\,{{{\mathrm{ASW}}}}} \right) + {{{\mathrm{scale}}}}\left( {{{{\mathrm{GC}}}}} \right)}}{3}} \end{array}$$ (37)

Overall integration score

To compute an overall integration score, we use a 6:4 weight between biology conservation and omics mixing, following the recent benchmark study73:

$$\begin{array}{*{20}{c}} {{\mathrm{overall}}\,{\mathrm{integration}}\,{\mathrm{score}} = 0.6 \times {\mathrm{biology}}\,{\mathrm{conservation}} + 0.4 \times {\mathrm{omics}}\,{\mathrm{mixing}}} \end{array}$$ (38)

FOSCTTM25 was used to evaluate the single-cell level alignment accuracy. It was computed on two datasets with known cell-to-cell pairings. Suppose that each dataset contains N cells, and that the cells are sorted in the same order, that is, the ith cell in the first dataset is paired with the ith cell in the second dataset. Denote x and y as the cell embeddings of the first and second dataset, respectively. The FOSCTTM is then defined as:

$$\begin{array}{*{20}{c}} {{\mathrm{FOSCTTM}} = \frac{1}{{2N}}\left( {\mathop {\sum }\limits_{i = 1}^N \frac{{n_1^{\left( i \right)}}}{N} + \mathop {\sum }\limits_{i = 1}^N \frac{{n_2^{\left( i \right)}}}{N}} \right)} \end{array}$$ (39)

$$\begin{array}{*{20}{c}} {n_1^{\left( i \right)} = \left| {\left\{ {j|d\left( {{{{\mathbf{x}}}}_j,{{{\mathbf{y}}}}_i} \right) < d\left( {{{{\mathbf{x}}}}_i,{{{\mathbf{y}}}}_i} \right)} \right\}} \right|} \end{array}$$ (40)

$$\begin{array}{*{20}{c}} {n_2^{\left( i \right)} = \left| {\left\{ {j|d\left( {{{{\mathbf{x}}}}_i,{{{\mathbf{y}}}}_j} \right) < d\left( {{{{\mathbf{x}}}}_i,{{{\mathbf{y}}}}_i} \right)} \right\}} \right|} \end{array}$$ (41)

where \(n_1^{\left( i \right)}\) and \(n_2^{\left( i \right)}\) are the number of cells in the first and second dataset, respectively, that are closer to the ith cell than their true matches in the opposite dataset. d is the Euclidean distance. FOSCTTM has a range of 0 to 1, and lower values indicate higher accuracy.

Feature consistency was used to evaluate the consistency of feature embeddings from different models. Since the raw embedding spaces are not directly comparable across models, we defined the consistency as the cross-modal conservation of cosine similarities among features in the same model. Specifically, we first randomly subsample 2,000 features and compute the pairwise cosine similarity among them using feature embeddings from the two compared models. The feature consistency score is then defined as the Pearson’s correlation between the cosine similarities of two models, averaging across four random subsamples. Feature consistency has a range of −1 to 1, and higher values indicate higher consistency.

For the baseline benchmark, each method was run eight times with different random seeds, except for Harmony and bindSC that have deterministic implementations and were run only once. For the guidance corruption benchmark, we removed the specified proportions of existing peak–gene interactions and added equal numbers of nonexistent interactions, so the total number of interactions remained unchanged. Of note, feature conversion was also repeated using the corrupted guidance graphs. The corruption procedure was repeated eight times with different random seeds. For the subsampling benchmark, the scRNA-seq and scATAC-seq cells were subsampled in pairs (so FOSCTTM could still be computed). The subsampling process was also repeated eight times with different random seeds.

For the systematic scalability test (Supplementary Fig. 17a), all methods were run on a Linux workstation with 40 CPU cores (two Intel Xeon Silver 4210 chips), 250 GB of RAM and NVIDIA GeForce RTX 2080 Ti graphical processing units. Only a single graphical processing unit card was used when training GLUE.

Triple-omics integration

The scRNA-seq and scATAC-seq data were handled as previously described (section Systematic benchmarks). Due to low coverage per single-C site, the snmC-seq data were converted to average methylation levels in gene bodies. The mCH and mCG levels were quantified separately, resulting in two features per gene. The gene methylation levels were normalized by the global methylation level per cell. An initial dimensionality reduction was performed using PCA (section Implementation details). For the triple-omics guidance graph, the mCH and mCG levels were connected to the corresponding genes with negative edges.

The normalized methylation levels were positive, with dropouts corresponding to the genes that were not covered in single cells. As such, we used the zero-inflated log-normal (ZILN) distribution for the data decoder:

$$\begin{array}{*{20}{c}} {p\left( {{{{\mathbf{x}}}}_k|{{{\mathbf{u}}}},{{{\mathbf{V}}}};\theta _k} \right) = \mathop {\prod }\limits_{i \in {{{\mathcal{V}}}}_k} {{{\mathrm{ZILN}}}}\left( {{\mathbf{x}_{k}}_{i};{{{\mathbf{\mu }}}}_i,{{{\mathbf{\sigma }}}}_i,{{{\mathbf{\delta }}}}_i} \right)} \end{array}$$ (42)

$$\begin{array}{*{20}{c}} {{\mathrm{ZILN}}\left( {{\mathbf{x}_{k}}_{i};{{{\mathbf{\mu }}}}_i,{{{\mathbf{\sigma }}}}_i,{{{\mathbf{\delta }}}}_i} \right) = \left\{ {\begin{array}{*{20}{l}} {\frac{{1 - {{{\mathbf{\delta }}}}_i}}{{{\mathbf{x}_{k}}_{i}{{{\mathbf{\sigma }}}}_i\sqrt {2\pi } }}\exp \left( { - \frac{{\left( {\log {\mathbf{x}_{k}}_{i} - {{{\mathbf{\mu }}}}_i} \right)^2}}{{2{{{\mathbf{\sigma }}}}_i^2}}} \right),} \hfill & {{\mathbf{x}_{k}}_{i} > 0} \hfill \\ {{{{\mathbf{\delta }}}}_i,} \hfill & {{\mathbf{x}_{k}}_{i} = 0} \hfill \end{array}} \right.} \end{array}$$ (43)

$$\begin{array}{*{20}{c}} {{{{\mathbf{\mu }}}}_i = \mathbf{\alpha} \odot {{{\mathbf{V}}}}_k^ \top \mathbf{u} + \mathbf{\beta} } \end{array}$$ (44)

where \({{{\mathbf{\mu }}}} \in {\Bbb R}^{\left| {{{{\mathcal{V}}}}_k} \right|},{{{\mathbf{\sigma }}}} \in {\Bbb R}_ + ^{\left| {{{{\mathcal{V}}}}_k} \right|},{{{\mathbf{\delta }}}} \in \left( {0,1} \right)^{\left| {{{{\mathcal{V}}}}_k} \right|}\) are the log-scale mean, log-scale standard deviation and zero-inflation parameters of the zero-inflated log-normal distribution, respectively, and \({{{\mathbf{\alpha }}}} \in {\Bbb R}_ + ^{\left| {{{{\mathcal{V}}}}_k} \right|},{{{\mathbf{\beta }}}} \in {\Bbb R}^{\left| {{{{\mathcal{V}}}}_k} \right|}\) are scaling and bias factors.

To unify the cell type labels, we performed a nearest neighbor-based label transfer with the snmC-seq dataset as a reference. The five nearest neighbors in snmC-seq were identified for each scRNA-seq and scATAC-seq cell in the aligned embedding space, and majority voting was used to determine the transferred label. To verify whether the alignment was correct, we tested for significant overlap in cell type marker genes. The features of all omics layers were first converted to genes. Then, for each omics layer, the cell type markers were identified using the one-versus-rest Wilcoxon rank-sum test with the following criteria: FDR < 0.05 and log fold change >0 for scRNA-seq/scATAC-seq; FDR < 0.05 and log fold change of <0 for snmC-seq. The significance of marker overlap was determined by the three-way Fisher’s exact test40.

To perform correlation and regression analysis after the integration, we clustered all cells from the three omics layers using fine-scale k-means (k = 200). Then, for each omics layer, the cells in each cluster were aggregated into a metacell by summing their expression/accessibility counts or averaging their DNA methylation levels. The metacells were established as paired samples, based on which feature correlation and regression analyses could be conducted.

To integrate the same datasets using online iNMF, we inverted the snmC-seq data via subtracting the data matrix by the largest entry, following the procedure described in the original paper16.

GLUE-based cis-regulatory inference

To ensure consistency of cell types, we first selected the overlapping cell types between the 10X Multiome and pcHi-C data. The remaining cell types included T cells, B cells and monocytes. The eQTL data were used as is, because they were not cell type-specific. For scRNA-seq, we selected 6,000 highly variable genes. To capture remote cis-regulatory interactions, the base guidance graph was constructed for peak–gene pairs within a distance of 150 kb, using the alternative scheme as described in the section Systematic benchmarks.

To incorporate the regulatory evidence of pcHi-C and eQTL, we anchored all evidence to that between the ATAC peaks and RNA genes. A peak–gene pair was considered supported by pcHi-C if (1) the gene promoter was within 1 kb of a bait fragment, (2) the peak was within 1 kb of an other-end fragment and (3) significant contact was identified between the bait and the other-end fragment in pcHi-C. The pcHi-C-supported peak–gene interactions were weighted by multiplying the promoter-to-bait and the peak-to-other-end power-law weights (above). If a peak–gene pair was supported by multiple pcHi-C contacts, the weights were summed and clipped to a maximum of 1. A peak–gene pair was considered supported by eQTL if (1) the peak overlapped an eQTL locus and (2) the locus was associated with the expression of the gene. The eQTL-supported peak–gene interactions were assigned weights of 1. The composite guidance graph was constructed by adding the pcHi-C- and eQTL-supported interactions to the previous distance-based interactions, allowing for multi-edges.

For regulatory inference, only peak–gene pairs within 150 kb in distance were considered. The GLUE training process was repeated four times with different random seeds. For each repeat, the peak–gene regulatory score was computed as the cosine similarity between the feature embeddings. The final regulatory inference was obtained by averaging the regulatory scores across the four repeats. To evaluate the significance of the regulatory scores, we compared the scores to a NULL distribution obtained via randomly shuffled feature embeddings and computed empirical P values as the probability of getting more extreme scores in the NULL distribution. Finally, we compute FDR of regulatory inference based on the P values using the Benjamini–Hochberg procedure. For cis-regulatory inference using LASSO, we used hyperparameter α = 0.01, which was optimized for area under the receiver operating characteristic curves of pcHi-C and eQTL prediction.

TF-target gene regulatory inference

We used the SCENIC workflow76 to construct a TF-gene regulatory network from the inferred peak–gene regulatory interactions. Briefly, the SCENIC workflow first constructs a gene coexpression network based on the scRNA-seq data, and then uses external cis-regulatory evidence to filter out false positives. SCENIC accepts cis-regulatory evidence in the form of gene rankings per TF, that is, genes with higher TF enrichment levels in their regulatory regions are ranked higher. To construct the rankings based on our inferred peak–gene interactions, we first overlapped the ENCODE TF chromatin immunoprecipitation (ChIP) peaks77 with the ATAC peaks and counted the number of ChIP peaks for each TF in each ATAC peak. Since different genes can have different numbers of connected ATAC peaks, and the ATAC peaks vary in length (longer peaks can contain more ChIP peaks by chance), we devised a sampling-based approach to evaluate TF enrichment. Specifically, for each gene, we randomly sampled 1,000 sets of ATAC peaks that matched the connected ATAC peaks in both number and length distribution. We counted the numbers of TF ChIP peaks in these random ATAC peaks as null distributions. For each TF in each gene, an empirical P value could then be computed by comparing the observed number of ChIP peaks to the null distribution. Finally, we ranked the genes by the empirical P values for each TF, producing the cis-regulatory rankings used by SCENIC. Since peak–gene-based inference is mainly focused on remote regulatory regions, proximal promoters could be missed. As such, we provided SCENIC with both the above peak-based and proximal promoter-based cis-regulatory rankings.

Integration for the human multi-omics atlas

The scRNA-seq and scATAC-seq atlases have highly unbalanced cell type compositions, which are primarily caused by differences in organ sampling sizes (Supplementary Fig. 17b). Although cell types are unknown during real-world analyses, organ sources are typically available and can be used to help balance the integration process. To perform organ-balanced data preprocessing, we first subsampled each omics layer to match the organ compositions. For the scRNA-seq data, 4,000 highly variable genes were selected using the organ-balanced subsample. Then, for the initial dimensionality reduction, we fitted PCA (scRNA-seq) and LSI (scATAC-seq) on the organ-balanced subsample and applied the projection to the full data. The PCA/LSI coordinates were used as the first transformation layer in the GLUE data encoders (section Implementation details), as well as for metacell aggregation (below). The guidance graph was constructed as described previously (section Systematic benchmarks).

The two atlases consist of large numbers of cells but with low coverage per cell. To alleviate dropout and increase the training speed simultaneously, we used a metacell aggregation strategy during pretraining. Specifically, in the pretraining stage, we clustered the cells in each omics layer using fine-scaled k-means (k = 100,000 for scRNA-seq and k = 40,000 for scATAC-seq). To balance the organ compositions at the same time, k-means centroids were fitted on the previous organ-balanced subsample and then applied to the full data. The cells in each k-means cluster were aggregated into a metacell by summing their expression/accessibility counts and averaging their PCA/LSI coordinates. GLUE was then pretrained on the aggregated metacells with additive noise, which roughly oriented the cell embeddings but did not actually align them (section Weighted adversarial alignment). To better use the large data size, the hidden layer dimensionality was doubled to 512 from the default 256. In the second stage, GLUE was fine-tuned on the full single-cell data with the balancing weight estimated as described in the section Weighted adversarial alignment. No metacell aggregation was used when comparing the scalability of different methods (Supplementary Fig. 17a).

For a comparison with other integration methods, we also tried online iNMF and Seurat v.3. Online iNMF was the only other method that could scale to millions of cells, so we applied it to the full dataset. On the other hand, Seurat v.3 showed the second-best accuracy in our previous benchmark. We also managed to apply it to the aggregated data used in the first stage of GLUE training, due to the fact that Seurat v.3 could not scale to the full dataset (Supplementary Fig. 17a). Label transfer was performed using the same procedure as in the triple-omics case, except that we used majority voting in 50 nearest neighbors.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.