Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data

Deconvolution of bulk gene expression profiles into the cellular components is pivotal to portraying tissue’s complex cellular make-up, such as the tumor microenvironment. However, the inherently variable nature of gene expression requires a comprehensive statistical model and reliable prior knowledge of individual cell types that can be obtained from single-cell RNA sequencing. We introduce BLADE (Bayesian Log-normAl Deconvolution), a unified Bayesian framework to estimate both cellular composition and gene expression profiles for each cell type. Unlike previous comprehensive statistical approaches, BLADE can handle > 20 types of cells due to the efficient variational inference. Throughout an intensive evaluation with > 700 simulated and real datasets, BLADE demonstrated enhanced robustness against gene expression variability and better completeness than conventional methods, in particular, to reconstruct gene expression profiles of each cell type. In summary, BLADE is a powerful tool to unravel heterogeneous cellular activity in complex biological systems from standard bulk gene expression data.


Supplementary
. Fraction of cell types in the four different levels of PBMC simulation data. The fraction (y-axis) of cell types (x-axis) in 20 simulated PBMC datasets in levels 1-4 (each panel). The standard boxplot notation was used (lower/upper hinges -first/third quartiles; whiskers extend from the top/bottom hinges to the largest/lowest values no further than 1.5 * interquartile ranges). Supplementary Figure S15. Performance of BLADE in alternative metrics compared to baseline methods in fraction estimation of simulated PBMC data. Spearman's rank correlation coefficient (y-axis; top) and RMSE (y-axis; bottom) compared between BLADE (orange) and baseline methods (CIBERSORTx (blue), MuSiC (light yellow), and NNLS (dark red)) for level 1-4 PBMC data (x-axis). Note that RMSE is not meant to be compared between data set with the different number of cell types, as it depends a lot on the abundance of cell types. (according to RMSE, the performance gets better with the higher number of cell types, which is misleading). The standard boxplot notation was used (lower/upper hinges -first/third quartiles; whiskers extend from the top/bottom hinges to the largest/lowest values no further than 1.5 * inter-quartile ranges). Supplementary Figure S19. Fraction of genes purified by CIBERSORTx in PBMC simulation data. Radar plots represent the gene fraction with estimated cell-type-specific gene expression profiles for each cell type in group mode (blue) and high-resolution mode (red) in level 1-4 PBMC simulation data.

Level1
Level2 Level3 Supplementary Figure S20. Performance of BLADE in high-resolution mode purification for simulated PBMC bulk RNA-seq data. Performance (Pearson correlation coefficient; y-axis) of BLADE (orange) and CIBERSORTx (blue) in estimating gene expression profiles per cell type (x-axis) and per sample in levels 1-3 (n=20 samples per cell type).
If we assume . Therefore, y ij is a convolution of T log-normal random variables.

Approximation of Log-Normal (LN)/Negative Binomial (NB) convolution using Probabilistic Generative Function (PGF)
Note that evaluation of g yij (y) needs to be extremely efficient, because for numerical maximum likelihood estimation g yij (y) is evaluated many times, depending on the number of parameters (I times J times T ). While numerical evaluation of (2) may still be efficient for T = 2 [1], the extension to T > 2 is not straightforward to a T − 1 dimensional integral. To this end, the log-normal density g t = gxt ij is approximated by a probability generating function (PGF). Specifically, we used an equi-spaced grid that ranges from 0 to y = y ij (i.e., observed convoluted gene expression level of sample i and gene j) in which the bins represent integer fractions of y, centered around those fractions. For instance, for y = 100, there are 101 bins representing 0, . . . , 100, with bin i : [i − 0.5, i + 0.5]. For each bin i : [a i , a i+1 ] a total density of p i = G t (a i+1 ) − G t (a i ) is assigned, where G t is the cumulative distribution ofx i j t . Then, the PGF of the associated approximation densityĝ t is: By assuming that y is an integer and the bins represent 0, . . . , y we have Hence, which is simply the coefficient of x y in the product (4). Thanks to the equi-spaced grid, Gĝ t (x) is a polynomial, for which repeated multiplication is very efficient. In fact, since there is x y term at the end (because this represents observation y), a higher order can be discarded for the multiplications in (4). Therefore, when using (4) computation grows only linearly with T , as opposed to (naive) evaluation of (2). The R package pracma was used for efficient polynomial multiplication.
For discrete distributions such as the negative binomial, (4) is exact when the number of bins equals the total count plus one. PGF requires to evaluate probabilities for bins [a i , a i+1 ]. Cumulative distribution G t is calculated as follows: where G N B denotes the cumulative distribution function of the negative binomial with parameters (µ t j , σ t j ). With given fractions f t i and convolution g yij , the log-likelihood for gene j equals: where θ j contains all 2T parameters of the convolution components and evaluated by (5). The MLE estimate of θ j is obtained by maximizing L(Y .j ; θ j ) using the numerical optimizer Rsolnp. The numerical optimizer requires several thousands of evaluations of (2) per gene, which limits the scalability of the method.

Evaluation of the PGF approximation of LN convolution model
Unlike NB distribution for which PGF approximation is exact, LN distribution is a continuous distribution and thus PGF is not exact. Therefore, we evaluated the accuracy of PGF approximation for LN, taking an alternative approximation method, Fenton-Wilkinson (FW) approximation [2], for a comparison. FW approximates g y = g yij (y) by a log-normal density, thanks to which FW is free of numerical integration and thus computationally very efficient. As a reference, we also implemented Monte-Carlo (MC) evaluation of (2), which is computationally intense (M = 10 6 sampling) but accurate. Afterwards, the likelihood of PGF and FW approximation were evaluated based on the a simulation data with three cell types (T = 3) using the following steps: • Means and standard deviations of 3 cell types are sampled from the uniform distribution: • Simulation bulk gene expression data is sampled as follows: • Evaluate the likelihood for 3 realizations of Y = y: The approximated likelihoods by PGF and FW are compared to the reference likelihood calculated by MC, in which we found a superior concordance of PGF with MC (see Supplementary Figure S3).

Supplmenetary Note 2: Detailed derivation of BLADE
In this section, we will provide a detailed derivation of Bayesian-lognormal deconvolution with a collapsed variational inference.

Notation
We assume there are expression levels of J genes obtaind from I samples. The bulk gene expression data is contributed by T cell types. We will define the following variables: • y ij : log-transformed bulk expression levels of gene j for sample i.
where y ij is observed variable, while x t ij and f t i are hidden variables. Then, we will assume the following deconvolution problem.
Note that we assume y ij and x t ij to be log-transformed data for the convenience to use Gaussian distribution instead of the log-normal distribution. In the linear scale, exp(y ij ) and exp(x t ij ), these variables follows log-normal distribution and actual deconvolution is done on linear-scale.

Probabilistic assumptions
For the three random variables, we assume the following underlining probability distributions.
Note that λ in N (x|µ, 1 λ ) is a precision, inverse of which is the variance. To incorporate the prior knowledge of gene expression profiles per cell type, we take a Bayesian framework for the x t ij , endowing conjugate prior distribution for the parameters, µ t j and λ t j , as the follows: The hyperparameters are chosen based on the observed expression levels and standard deviation of each gene and each cell type, derived from the single-cell RNA-seq data (see Online Methods).

Optimization -collapsed variational inference
We take a collapsed variational inference for the optimization of the model [3]. Denote the hidden variables in our model by Z, and the observed one as Y . The standard approach is to maximize the evidence lower-bound (ELBO): where Q(Z) is the variational distribution which approximates the intractable posterior probability of hidden variables given observed variables (i.e., P (Z|Y )). Then maximization of 13 w.r.t. the variational parameters of Q(Z) is equivalent to minimization of the Kullback-Leibler divergence between P (Z|Y ) and Q(Z) [4], which should render Q(Z) to be an accurate approximation of P (Z|Y ). Before discussion the specific variational distributions for X (all latent gene expressions) and F (all latent cell type fractions), a subset of hidden variables are integrated out in advance to reduce the complexity.

Collapsing conjugate priors
We first integrate out the variables with conjugate prior distribution (i.e., collapsing the variables). This allows us to account for the entire distribution of the variables in a fully Bayesian manner, instead of finding maximum a posteriori solution. In our model, µ t j and λ t j endowed with a conjugate prior, which can be integrated out as follows: where By replacing P (x t ij |µ t j , λ t j )P (µ t j , λ t j |µ t 0j , κ t 0j , α t 0j , β t 0j ) with the marginal distribution P (x t ij |µ t 0j , κ t 0j , α t 0j , β t 0j ), we eliminate the parameters µ t j and λ t j from the model.

Variational distribution
After collapsing µ t j and λ t j , we assume the following variational distributions for the remaining hidden variables: ν t ij , ω t j and β t i (∀t = 1, . . . , T ; ∀i = 1, . . . , I; ∀j = 1, . . . , J) are the variational parameters to be optimized in our model.

Optimization of variational parameters: Evidence lower-bound (ELBO)
Given the variational distribution Q(·), the ELBO 13 is optimized with respect to the variational parameters. To achieve this, we take a line-search approach using Limited-memory Broyden-Fletcher-Goldfarb-Shannon (L-BFGS) algorithm. For efficient optimization, both the objective function and gradients of the parameters to be optimized are analytically calculated.
The ELBO 13 can be analytically calculated as follows: See appendix for the calculation of the five components in (17) and its derivaties with respect to the variational parameters.

Empirical Bayes approach for selection of hyperparameters
BLADE has the following hyperparameters, among which a subset is determined automatically.
µ t 0j : Expected expression level of gene j in cell type t in log-scale (i.e., E[x t j ]). This is estimated using scRNA-seq data.
κ t 0j : A scaling factgor for precision of µ t j . α t 0j : A shape parameter for Gamma distribution.
is the sample variance of gene j in cell type t measured in scRNA-seq data. This is based on the fact that the precision of gene j (i.e., inverse of variance; λ t j = 1 V[x t j ] ) follows G(α t 0j , β t 0j ) and its expectation is ). This allows us to incorporate gene expression variability observed in scRNA-seq data.
, where s is the user-defined scaling facor and V[y j ] is observed variance (i.e., inverse of precision) for gene j.
Among the hyperparameters, α t i , κ t 0j , α t 0j , and s is chosen by the users. To systematically identify the optimial parameters, we employed an Empirical Bayes approach to select the best set of parameters. For each configuration of parameters, we obtained maximum likelihood estimates using only a subset of samples. We only used a subset of samples, not only to gain computational efficiency but also to avoid overfitting. Then, the hyper-parameter configuration with the highest likelihood, maximized w.r.t. the model parameters, is selected, followed by performing deconvolution using the entire data set. Throughout the manuscript, we considered a total of 90 different parameter configurations that cover all possible combinations of: α t i ∈ {1, 10}, α t 0j ∈ {0.1, 0.5, 1, 5, 10}, κ t 0j ∈ {1, 0.5, 0.1}, and s ∈ {1, 0.3, 0.5}.

A.1 Derivation of the ELBO function
Each of the five components in (17) can be analytically calculated as below: The inequality is introduced by Jensen's inequality, by making use of the convex property of the function f (x) = − log(x).
which is based on the laplace apporixmation: The expectation and variance terms can be calculated as below: The gradient of ν t ij is calculated for each of the 5 components in 17.
b : The other components in 17 does not involve ν t ij , therefore a : The other components in 17 does not involve ω t j , therefore The other components in 17 does not involve β t i , therefore