BLADE: Bayesian Log-normAl DEconvolution for enhanced in silico microdissection of bulk gene expression data

High-resolution deconvolution of bulk gene expression profiles is pivotal to characterize the complex cellular make-up of tissues, such as tumor microenvironment. Single-cell RNA-seq provides reliable prior knowledge for deconvolution, however, a comprehensive statistical model is required for efficient utilization due to the inherently variable nature of gene expression. We introduce BLADE (Bayesian Log-normAl Deconvolution), a comprehensive probabilistic framework to estimate both cellular make-up and gene expression profiles of each cell type in each sample. Unlike previous comprehensive statistical approaches, BLADE can handle >20 cell types thanks to the efficient variational inference. Throughout an intensive evaluation using >700 datasets, BLADE showed enhanced robustness against gene expression variability and better completeness than conventional methods, in particular to reconstruct gene expression profiles of each cell type. All-in-all, BLADE is a powerful tool to unravel heterogeneous cellular activity in complex biological systems based on standard bulk gene expression data.


Introduction
Over the past decade, gene expression profiling has been applied to elucidate the complexity of transcriptional regulation in diverse biological contexts, such as cancer 1,2 .
Conventional gene expression profiling, based on either RNA sequencing (RNA-seq) or microarrays, captures cumulative gene expression levels of many cells combined. It is therefore often referred to as bulk gene expression profiling, in order to distinguish it from the recent single-cell gene expression profiling technologies 3 . In oncology, single-cell RNA sequencing (scRNA-seq) is employed to study cellular heterogeneity within a tumor, composed of malignant (tumor) and non-malignant cells [4][5][6][7][8][9][10] . However, scRNA-seq has serious limitations: apart from technical challenges such as drop-out, only a limited number of samples can be profiled due to the high cost and technical difficulties 11,12 which altogether hinder its application to large series and translation to clinical applications.
Several computational deconvolution methods have been developed to predict cellular composition from bulk RNA-seq by employing a signature of pre-determined cell type-specific gene expression profiles. Initially, these signatures were constructed by sorting each cell type followed by gene expression profiling 13 , whereas recent methods such as CIBERSORTx 14 and MuSiC 15 employed scRNA-seq data for this purpose. The majority of approaches perform linear regression to reconstruct the bulk gene expression profiles using the gene expression signatures, where the regression coefficients correspond to the cellular composition. However, the standard regression approach does not account for variability in gene expression within the same cell type, and may therefore render biased results.
To the best of our knowledge, there is no deconvolution method that can adequately and also efficiently account for the variability of gene expression within the same cell type.
Modelling gene expression variability is challenging specifically for deconvolution due to the incompatibility of the log-normalization 16 which significantly stabilizes gene expression variability. Without the log-normalization (i.e., in linear-scale), gene expression data has a heavily skewed distribution, which is not properly accounted for by the standard linear regression approaches. Currently, there are few probabilistic deconvolution approaches that take skewed variability into account, but these methods handle only a restricted number of cell types due to difficulties in optimization (e.g. three cell types in DeClust 17 and Demix/DemixT 18 ). Recently, CIBERSORTx introduced a two-step approach to address variable gene expression profiles across the samples: first estimate cellular fraction (deconvolution) and then reconstruct gene expression per cell type in each sample (purification). However, for some genes, the purification step of CIBERSORTx is an underdetermined optimization problem with infinite number of solutions, which leads to incomplete purification that excludes those genes.
Here, we introduce BLADE (Bayesian Log-normAl DEconvolution), a novel Bayesian method that jointly performs deconvolution and purification in a single-step, taking into account prior knowledge of cell type specific gene expression profiles obtained from scRNAseq data. BLADE takes a Bayesian framework that integrates two signatures of mean and variability of gene expression per-cell type using a log-normal probability model. The unified probabilistic model for both deconvolution and purification of BLADE can leverage the prior knowledge for purification as well, which may remedy the underdetermination issue.
Furthermore, an efficient variational inference algorithm was developed for which we show that it can handle at least 20 cell types. Through a comprehensive evaluation based on more than 700 bulk gene expression data sets, we demonstrate a robust performance of BLADE regardless of gene expression variability. In particular, BLADE achieved high accuracy and completeness in gene expression purification, underpinning the power of the unified Bayesian framework for both tasks.

Gene expression variability within a cell type
We first assessed gene expression variability within a cell type using publicly available PBMC CITE-seq data from 10x Genomics. Based on the integration and clustering analysis followed by phenotyping of 10,403 cells, we identified fifteen immune cell types, among which nine are in common, with distinct cell-surface markers and gene expression profiles ( Fig. 1a; see Methods and Supplementary Figs. S1-2). The size of cell populations ranges from 38 regulatory T cells (0.36%) to 2,518 classical monocytes (24%). We then identified differentially expressed genes (DEGs) for each cell type. Subsequently, the standard deviation of gene expression levels per gene and per cell type was measured to assess gene expression variability among the same cell types. We identified high gene expression variability among the same cell populations, especially for DEGs without log-transformation (i.e., linear-scale; Figs. 1b-c). The variability further increased when cells from the two scRNA-seq datasets were combined, indicating the presence of more variability between individuals ( Fig. 1d; P<2.2x10e-16 from a paired t-test of within-sample and betweensample variability).

Modeling gene-expression variability by probabilistic distribution
To properly account for variation in gene expression, we examined multiple probability distributions. We evaluated normal distribution, negative binomial distribution, and lognormal distribution to fit the expression level of each gene per cell type without lognormalization. The normal distribution is the standard variability model in many deconvolution algorithms, including CIBERSORTx 14 , EPIC 19 , and ABIS 20 , while the negative binomial distribution is frequently used for handling RNA-seq data 21 . The log-normal distribution is identical to the normal distribution but includes an exponential function, assuming gene expression data is normally distributed on a log-scale but not on a linearscale. In order to evaluate the performance of these probability distributions on gene expression variability, we assessed the maximum likelihood in fitting gene expression profiles per cell type from the scRNA-seq data. The log-normal distribution, in general, shows the best performance in per-gene maximum likelihood, followed by the negative binomial and normal distributions (Figs. 2a-b). In particular, we noted a biased fit of the normal distribution towards outlier observations, in contrast to the log-normal and negative binomial distribution (examples in Fig. 2c).
We further evaluated the performance of the log-normal and negative binomial distributions in the context of deconvolution. To this end, we constructed a generic statistical deconvolution method that can model gene expression profiles with diverse probabilistic assumptions given known cellular fractions. The method approximates the convolution of random variables with an arbitrary distribution using a probabilistic generating function, for which both negative binomial and log-normal random variables can be accurately approximated (see Methods, Supplementary Note 1 and Supplementary Fig. S3). Based on this method, we evaluated the performance of negative binomial and log-normal distribution in fitting the gene expression profiles per cell type using RNA-seq data from TCGA. We obtained TCGA RNA-seq data of mesothelioma (TCGA-MESO; n=84) and sarcoma (TCGA-SARC; n=256), from which we estimated the fraction of eight cell types using EPIC 19 , a non-probabilistic deconvolution method. Then, we applied the flexible deconvolution method with two different probabilistic assumptions, log-normal and negative binomial, to estimate expression profiles per cell type of 200 random genes. In terms of loglikelihood measured per gene, log-normal and negative binomial deconvolutions performed equally well for most of the genes, except for a few genes with a more favorable performance with log-normal (Fig. 3). The lower performance of the negative binomial distribution might be due to the difficulty in finding maximum likelihood parameters.
Cumulatively, we concluded that the log-normal distribution is an attractive probabilistic distribution to model gene expression variability of each cell type. for the further details of the framework. As a result, BLADE can handle many cell types (>20 cell types) and samples (>20 samples); unlike the previous log-normal based deconvolution that can account for a maximum of three cell types 18 .

Robustness of BLADE deconvolution against gene expression variability.
We assessed the robustness of BLADE, CIBERSORTx, and non-negative least squares (NNLS) against gene expression variability by applying them to model-based simulation data.
The simulation data was created to have diverse but controlled variability levels of gene expression profiles (standard deviation of 0.1-1.5) as well as different numbers of cell types In general, all three methods could accurately estimate cellular fractions in case of a high number of genes, a low number of cell types and a low variability level. In contrast, the performance decreased when a smaller number of genes are presented, and the number of cell types is increased (Fig. 5a).
However, BLADE was more robust against gene expression variability. In particular, in the range of observed expression variability of differentially expressed genes in the PBMC scRNA-seq data (on average >0.5; Fig. 1b), BLADE significantly outperformed CIBERSORTx and NNLS.
We then compared the performance of BLADE and CIBERSORTx in estimating gene expression profiles per cell type. In this comparison, NNLS is not included because of redundancy, since the purification step of CIBERSORTx is based on NNLS. There are two modes of purification in CIBERSORTx, both of which were compared with BLADE: 1) estimating average profile per cell type across the samples (group-mode purification), and 2) estimating the profile per cell type for each sample (high-resolution-mode purification). For the data set with low variability levels, both BLADE and CIBERSORTx accurately reconstructed gene expression profiles per cell type (Fig. 5b-c). However, unlike BLADE, the performance of CIBERSORTx decreased rapidly as the RNA expression variability within a cell type increased. Furthermore, CIBERSORTx often excludes genes for purification, especially in high-resolution mode, when: 1) the number of cell types is larger than the number of samples, and 2) the variability in gene expression is high (Fig. 5d-e). BLADE could accurately estimate the gene expression profiles of each cell type in both group-mode and high-resolution mode, regardless of the number of cell types and samples without any filtering (Fig. 5b-c).

Application of BLADE to in silico mixture of PBMC scRNA-seq data
We further evaluated our method based on actual scRNA-seq data from PBMC samples that were mixed in silico in various known proportions to generate bulk gene expression data without any model assumption. We generated 20 bulk gene expression data sets by random sampling, followed by mixing 100 cells among the 10,403 cells from the two PBMC scRNAseq data sets. In order to make the simulation data as realistic as possible, a cumulative sum of raw counts of 100 cells was obtained, followed by a standard normalization of the count data The resulting simulation data recapitulate the gene expression variability of 15 cell types ( Fig. 6a; Supplementary Fig. S5). We constructed signature matrices for the BLADE significantly outperformed CIBERSORTx in the estimation of gene expression profiles per cell type in both group-mode and high-resolution mode ( Fig. 6c-e).
For group-mode purification, CIBERSORTx reconstructed expression profiles per cell type with reasonable accuracy (>0.5 Pearson correlation except for pDC). The highest performance was achieved for classical monocytes. Here, performance of BLADE was near-perfect (Fig. 6c). In high-resolution mode, CIBERSORTx did not estimate expression levels of most genes, and essentially no genes were in silico purified for 11 cell types (Fig.   6d). Furthermore, even after filtering, the estimated gene expression profiles per cell type and per sample by CIBERSORTx are less accurate than those by BLADE (Fig. 6e). The performance of BLADE in high-resolution mode purification is consistently accurate (>0.7 Pearson correlation) across all 15 cell types. Cumulatively, Bayesian simultaneous deconvolution and in silico purification by BLADE significantly outperformed CIBERSORTx in both estimating cellular fraction and especially in reconstructing gene expression profiles per cell type.

Discussion
One of the major challenges in deconvolution of bulk RNAseq data is adequate and yet efficient handling of gene expression variability without log-normalization. This difficulty causes either of the two critical shortcomings in most of the available deconvolution algorithms: 1) only a small number of cell types can be handled; and 2) the inadequate variability model (usually normal distribution) implicitly or explicitly assumed in the core algorithm, such as support vector regression and non-negative least-squares, implies inferior results. We showed that, the normal distribution often renders a biased fit (Fig. 2b-c). The inadequate noise model leads to suboptimal performance of deconvolution algorithms when there is a realistic level of gene expression variability (Fig. 5). Furthermore, purification of gene expression involves more variables to be estimated, for which an incorrect variability model can have substantial impact on the performance (Figs. 5 and 6b-d). Statistical inference of log-normal convolution models, which were shown to appropriately capture variability, however, is very challenging, as demonstrated by previous log-normal deconvolution methods that handle three cell types maximally 18 .
BLADE solves this by using a novel hierarchical Bayesian model that simultaneously performs deconvolution and estimation of gene expression profiles per cell type. The lognormal convolution model efficiently accounts for variability in gene expression (Fig. 3) and also for prior knowledge of gene expression profiles per cell type derived from scRNA-seq data (Fig. 4). Notably, thanks to the unified probabilistic model used in BLADE, the prior knowledge contributes to both deconvolution and gene expression purification. This prior knowledge significantly reduces the search space of solutions for both tasks, which leads to enhanced accuracy and coverage, especially for gene expression purification. The efficient variational inference of BLADE allowed to handle a large number of cell types (> 20 cell types) which was not possible by previous statistical approaches 17,18 . Furthermore, BLADE may be beneficial in handling cell types without a precise prior knowledge, for instance, cancer cells with highly variable gene expression profiles across the subject, unlike the nonmalignant cells 24 . By integrating both signatures of mean and standard deviation, BLADE balances the contribution of genes with varying precision for deconvolution by prioritizing the genes with low variability (i.e., high precision) when estimating cell type fractions. Furthermore, BLADE can aid previously-established gene expression subtypes (e.g., PDAC 25,26 ) by characterizing the subtypes with distinct TME profiles. The detailed profiling of the TME, in particular, immune TME profiles may lead to a clinically applicable biomarker strategy for immunotherapy based on the standard bulk gene expression profiling. In conclusion, BLADE is a novel tool that can significantly contribute to unravel cellular heterogeneity in complex biological systems. (1) which implies, with = ,
The interest lies in estimating parameters = ( , ) by maximum likelihood.
While numerical evaluation of (2) may still be efficient for = 2 28 , however, the extension to > 2 is not straightforward to a − 1 dimensional integral. To this end, the log-normal density = is approximated by a probability generating function (PGF).
See Supplementary Note 1 for the details of PGF approximation. The PGF-based approximation of showed higher accuracy than an alternative approximation method, Fenton-Wilkinson (FW) approximation 29 , which was also included as a benchmark (See Supplementary Fig. S3).

Comparison of MLE for LN and NB based on the generic deconvolution technique
The aforementioned generic deconvolution was used to evaluate LN and NB for deconvolution. For this, two RNA-seq data sets are retrieved from The Cancer Genome Atlas (https://tcga-data.nci.nih.gov/tcga/) using TCGAbiolinks 30 . We considered all complete samples from the following tumor types: Mesothelioma (MESO 31 , n = 84; and Sarcoma (SARC 32 , n = 256. Data was preprocessed as described previously in Rauschenberger et al. 33 . The comparison procedure for LN and NB distributions is: 1. Apply a non-statistical method, EPIC 19 , to estimate cell type fractions for bulk RNA-seq data using cell type specific reference signatures. It has shown that EPIC provides an reliable estimate of cellular fractions of = 8 cell types 34 , and it provides absolute fractions that add up to 1.
2. Fix the cellular fractions and fit generic deconvolution models with = 8 LN or NB components using maximum likelihood.
3. Compare the maximum likelihood values of the LN and NB models for genes.
The above procedure was done for 200 randomly selected genes with mean count per million larger or equal to 5 to exclude lowly expressed genes. Note that the comparison of the maximum likelihood values is fair, because the number of parameters used in the LN and NB components is the same, 2 = 16 per gene.

Hierarchical Bayesian model for convolution of log-normal variables (BLADE)
A novel Bayesian Log-normal Deconvolution model, BLADE, is introduced to efficiently For the inference, a collapsed variational inference was employed to handle analytically intractable posterior distribution of hidden variables given observed variables 35 .
In the framework, the random variables with conjugate prior distribution, which are and , were integrated out, which allows us to find a fully Bayesian estimation of instead of estimation of the single most probable and 35 . By defining the variational distribution for the hidden variables, and , the objective function is to minimize the dissimilarity between the variational distribution and probability distribution, measured by Numba-compiled objective function and gradients were used for the acceleration.

Selection of hyperparameters based on the empirical-Bayes framework
BLADE has multiple hyperparameters for the hidden variables and , and also for is also user-defined, which serve as a scale factor for variance of (see also

Supplementary Note 2).
An empirical Bayes approach was employed to select the best set of user-defined parameters 36 . For each configuration of parameters, a maximum likelihood estimate of variational parameters is obtained using a subset of samples. Then, the hyperparameter configuration with the highest likelihood is selected, followed by performing deconvolution using the entire data set. Only a subset of samples is used in the empirical Bayes step, not only to gain computational efficiency but also to avoid overfitting. Throughout the manuscript, we considered a total of 90 different parameter configurations that cover all possible combinations of: ∈ 1,10 , 0 ∈ 0.1, 0.5, 1, 5, 10 , 0 ∈ 1, 0.5, 0.1 , and ∈ 1, 0.3, 0.5 .

Construction of the simulation data with a controlled noise level
We

Construction of PBMC stimulation data
To construct realistic simulation data, 20 bulk gene expression data sets were generated by randomly sampling and merging a subset of 10,403 cells from the two PBMC scRNA-seq datasets. For each sample, the cellular fraction was first sampled from a Dirichlet distribution.
The actual fractions of the 15 cell types were used as the parameter of the Dirichlet distribution so that the sampled fraction is similar to the total fraction. The fraction was then converted into the count of each cell type, with the following constraints: 1) the total number of cells is 100, and 2) the minimum number of cells per type is one. Then, the given number of cells were sampled with replacement, followed by obtaining the raw counts per cell type as the cumulative sum of raw counts of the sampled cells. Up to three distinct cells per type were allowed to be sampled since otherwise, gene expression variability was over-stabilized due to the averaging. Finally, the simulated bulk raw counts were obtained by taking the cumulative sum of the raw counts per cell type among 15 cell types. The bulk gene expression data was log-normalized using the Seurat package 27 .  then also in merged PBMC data set (y-axis). Only the 9 cell types commonly detected in two data sets were used in the analysis.