New generative methods for single-cell transcriptome data in bulk RNA sequence deconvolution

Numerous methods for bulk RNA sequence deconvolution have been developed to identify cellular targets of diseases by understanding the composition of cell types in disease-related tissues. However, issues of heterogeneity in gene expression between subjects and the shortage of reference single-cell RNA sequence data remain to achieve accurate bulk deconvolution. In our study, we investigated whether a new data generative method named sc-CMGAN and benchmarking generative methods (Copula, CTGAN and TVAE) could solve these issues and improve the bulk deconvolutions. We also evaluated the robustness of sc-CMGAN using three deconvolution methods and four public datasets. In almost all conditions, the generative methods contributed to improved deconvolution. Notably, sc-CMGAN outperformed the benchmarking methods and demonstrated higher robustness. This study is the first to examine the impact of data augmentation on bulk deconvolution. The new generative method, sc-CMGAN, is expected to become one of the powerful tools for the preprocessing of bulk deconvolution.


Impact of data augmentation on deconvolution results
The results of our study demonstrate the positive impact of sc-CMGAN data augmentation on bulk deconvolution performance (see Table 1 for information about the each scRNA-seq dataset).We focused on Baron's dataset and observed consistent improvements in performance, as measured by the Pearson coefficient and RMSE, across all three deconvolution methods (SCDC, MuSiC, and BisqueRNA) (Fig. 2A).Specifically, significant improvements in deconvolution were observed for SCDC (p = 0.043) and BisqueRNA (p = 0.005) when using sc-CMGAN data augmentation.We further compared the performance of sc-CMGAN with other benchmarking methods in terms of their impact on the three deconvolution methods (Fig. 2B,C and Table 2).For SCDC, significant improvements were observed only when using sc-CMGAN.For MuSiC, the Copula and CTGAN generative methods led to a decrease of performance, while TVAE and sc-CMGAN showed improvements.In the case of (A) Workflow of bulk deconvolution using data augmentation.Matrix C with augmentation is single cell RNA sequence data for reference data in deconvolution.Matrix T is the pseudo-bulk RNA sequence data, which is the input of the deconvolution methods.Finally, predicted cell type propotions and computed cell-type propotions are compared.RNA seq: RNA sequence; RMSE: root-mean-square error; Pearson: Pearson correlation value.(B) Summary of sc-CMGAN (Details are described in Method).In the feature selection step, top (t 0 − t n ) % of genes was selected as cell marker by ridge regression.sc-CMGAN: Generative Adversarial Network based on cell markers for single-cell genomics data, t 0 : initial value of percentage of cell markers in all genes; n: number of steps of gene selection.

Relationship between hyperparameter and performance
Figure 2D presents the relationship between the combination of epoch and the number of generated data and their impact on deconvolution, specifically for the SCDC method using the Baron dataset.It was observed that the performance of deconvolution was most improved when the epoch was set to 100 and the number of generated data was 100.In terms of the effect of epoch on performance, it was found to have a smaller impact on performance in sc-CMGAN compared to CTGAN.When the number of generated data was 300 cells/cell type and the deconvolution method used was sc-CMGAN, the Pearson correlation values were as follows: 0.851 in 50 epochs, 0.854 in 100 epochs, 0.841 in 150 epochs, 0.845 in 200 epochs, 0.838 in 250 epochs, and 0.845 in 300 epochs.The best value and worst value had a difference of 0.016.On the other hand, when the number of generated data was 300 cells/cell type and the deconvolution method was CTGAN, the Pearson correlation values were as follows: 0.809 in 50 epochs, 0.822 in 100 epochs, 0.830 in 150 epochs, 0.840 in 200 epochs, 0. 829 in 250 epochs and 0.834 in 300 epochs.The difference between the best value and worst value was 0.031.

Evaluation at the cell type level
The impact of data generation on bulk deconvolution was further analyzed at the cell type level, specifically for the SCDC method using the Baron dataset.The results showed that performance was improved for almost all cell types, with the exception of quiescent stellate cells (Fig. 3A, Table 3).The ratio of RMSE improvement to the RMSE of the control (without data generation) was calculated for each cell type.Among all cell types, beta cells exhibited the highest improvement ratio (+ 0.50), indicating a substantial enhancement of performance.On the other hand, quiescent stellate cells showed a negative improvement ratio (− 0.08), suggesting a decrease of performance with data generation.To provide a visual representation of the training, testing, and generated data in beta cells and quiescent stellate cells, UMAP plots were created (Fig. 3B).

Deconvolution in other datasets
The impact of sc-CMGAN on bulk deconvolution was also examined in other datasets, namely GSE81547, Kidney HCL, and PBMCs datasets (Fig. 4, Table 4).The results showed improvements in Pearson correlation values for most conditions, except for the analysis of PBMCs data using the MuSiC method.In the analysis of Kidney HCL and PBMCs datasets using the BisqueRNA method, sc-CMGAN significantly improved the RMSE (Kidney HCL: p = 0.039, PBMCs: p = 0.005).Furthermore, MuSiC analysis of the Kidney HCL dataset initially encountered the "Not enough valid cell type" error, however, data augmentation using sc-CMGAN allowed for error-free deconvolution in this dataset.All the detailed results, including the best epoch and cell numbers from this experiment, can be found in the supplementary data.

Deconvolution in real bulk RNA sequence dataset
Using Baron's dataset as a reference, we performed bulk deconvolution in real bulk RNA sequence data (Fig. 5).As also shown in the study by Wang et al., deconvolution methods overestimated the proportion of α cells 9 .However, all deconvolution methods with augmentation underestimated the proportion of α cells, compared to the deconvolution without augmentation.

Discussion
Bulk deconvolution is a valuable approach for estimating cell type proportions from bulk RNA-seq data, providing a cost-effective alternative to scRNA-seq.In the study, we attempted to improve the performance of bulk deconvolution using data generative methods.While efforts have been made to generate scRNAseq.In silico, the impact of data augmentation on deconvolution remains uncertain 14 .Additionally, we developed a new stepwise generative method (sc-CMGAN), and its performance was compared the performance of sc-CMGAN with that of benchmarking generative methods.
The results demonstrated that data augmentation using sc-CMGAN consistently improved the performance of all tested bulk deconvolution methods in the Baron's dataset.While other benchmarking generative methods also led to improvements, sc-CMGAN exhibited two key advantages.Firstly, sc-CMGAN displayed the highest robustness across different deconvolution methods.Notably, significant improvements were observed with SCDC only when using sc-CMGAN.In MuSiC, the performance with TVAE was slightly better than sc-CMGAN, but in BisqueRNA, TVAE showed less improvement compared to the other three methods.Secondly, sc-CMGAN exhibited high stability and improvement regardless of the training epoch, unlike CTGAN, which showed significant performance variations with different epochs.This stability can be attributed to the stepwise strategy employed by sc-CMGAN.These deconvolution improvements were found to be logically appropriate by two analyses at the cell type level (Fig. 3, Table 3).In the analysis of performance of each cell type, most cell types showed improved performance without bias, which is consistent with comprehensive data augmentation for cell types.In the analysis of the relationship improvement and visualization using UMAP, the highest improvement was seen in beta cells, of which distribution of the generated data was similar to the distribution of test data.Only quiescent stellate cells didn't showed improvement, but this could be improved by adjusting the epochs for each cell.The two analyses showed that the augmentation strategy had the potential to partially mitigate the heterogeneity in gene expression between subjects and improve the bulk deconvolution.
Furthermore, the study extended its evaluation to other datasets (GSE81547, Kidney HCL, and PBMCs) (Fig. 4, Table 4).Pearson correlation values improved in almost all conditions, with the exception of the analysis of PBMCs data using MuSiC.This improvement was observed in the Baron dataset and GSE81547 with the inter-case variation, as well as in Kidney HCL and PBMCs datasets with the intra-case variation.These findings indicate that sc-CMGAN is effective in addressing both types of variation encountered in bulk deconvolution.
We have two limitations of the study.First, the study is lacking real RNA sequence data analysis.The main purpose of the study is to investigate the influence of augmentation strategy on bulk deconvolution, so we designed the study using only pseudo-RNA sequence data.Our method has a potential to investigate the interesting biology if we deconvolute a real bulk RNA-seq data.Second, it is better to include many results that quantify this improvement under different tissues and conditions.Then, we selected tissues to avoid overlap (kidney, pancreas, peripheral blood cells) and the results that cannot be included in the main text is included in the supplementary data.Further extension of the conditions will make the results of improvements more reliable.
In conclusion, our study demonstrated that both the benchmarking and new generative methods improved the performance of bulk deconvolution.Specifically, our newly developed sc-CMGAN method outperformed the benchmarking methods in enhancing the performance of bulk deconvolution.The sc-CMGAN method, accompanied by its dedicated library and software, shows promising potential to become one of the powerful tools for the preprocessing in bulk deconvolution.

Datasets
The primary dataset used in this study was the pancreatic single-cell transcriptome data from Baron et al., which is widely used in bulk deconvolution 15 .Additionally, we examined other datasets of pancreatic, renal, and peripheral blood single-cell transcriptome data to assess the robustness of our approach (refer to Table 1 for details on the scRNA-seq datasets) [16][17][18][19] , and Fadista's dataset to perform bulk deconvolution in real bulk RNA sequence data 20 .Selection of the primary dataset was based on the criteria of having the highest number of cell types and two cases at least available.

Pre-processing
In the pre-processing step, we followed the approach described by Cobos et al. 13 Initially, we removed rows corresponding to genes with zero expression or zero variability.Next, the cells with library size, mitochondrial www.nature.com/scientificreports/content or ribosomal content further than three median absolute deviations away were discarded.Subsequently, we retained only those genes that were present in at least 5% of all cells, regardless of cell type, and had a UMI or read count greater than one.TMM normalization was applied to the final scRNA-seq expression dataset 21 .

Generation of artificial pseudo-bulk mixtures
The deconvolution pipeline, as depicted in Fig. 1A, was implemented in this study.Following the pre-processing step, the dataset was divided into equal proportions of 50% for training data and 50% for testing data.Subsequently, using the testing data, a matrix (referred to as matrix T) comprising 1000 pseudo-bulk mixtures was generated.This involved summing the count values from randomly selected individual cells.For each dataset, the minimum number of cells utilized to construct the pseudo-bulk mixture was set at 100.

Data augmentation
The training data were augmented by benchmarking or new generative methods.In the study, we employed Gaussian Copula (Copula), Conditional Tabular GAN (CTAGAN) and Triplet-based Variational Autoencoder (TVAE) as benchmarking generative methods 22,23 .Additionally, we developed and tested a new generative method specifically designed for scRNA-seq data.A grid search was performed to investigate optimal data generative conditions (number of images generated and training epoch).Specifically, we varied the number of generated data by increments of 100 cells, ranging from 100 cells per cell type to 1000 cells per cell type.Furthermore, we explored different training epochs, testing values from 50 to 300 epochs with increments of 50 epochs.The augmented data was used as a new independent reference case and added to the matrix C for the bulk deconvolution.

New generative method
A new data generative method called sc-CMGAN (stepwise Generative adversarial network based on cell markers for single-cell genomics data) was developed (Fig. 1B).The sc-CMGAN approach consists of three main steps: feature selection, training, and generating.In the feature selection step, a set of cell marker genes was identified using ridge regression.These genes serve as key indicators of cell types in the sc-RNA seq.The absolute value of the coefficient in ridge regression was taken as the importance of the genes.Next, in the training step, the sc-RNA seq.data corresponding to the selected cell marker genes were used to train the GAN model (CTGAN).This step involved the learning and capturing of the underlying data distribution of the cell marker genes.In the generating step, the trained models were used to generate sc-RNA seq.data for the cell marker genes.Simultaneously, the non-cell marker genes were assigned the median value of the expression data for the respective cell type.This process was repeated for a specific number of cycles (= n), and the generated data sets from each cycle were combined.To control the selection of cell markers, the top (t 0 − t n ) percentage of genes were chosen, where t 0 represents the initial value of the percentage of cell markers in all genes.In this study, the hyperparameters were set as (n, t 0 , t) = (2, 40, 20).The sc-CMGAN code, library and software can be obtained from the following GitHub link: https:// github.com/ TouiN ishik awa/ scCMG AN.

Bulk deconvolution method
The bulk deconvolution method was employed to estimate the cell type proportions from the artificial pseudobulk mixtures (matrix T) and the augmented reference scRNA-seq data (matrix C).Three benchmarking deconvolution methods, namely SCDC, MuSiC, and BisqueRNA, were utilized for the bulk deconvolution analysis [9][10][11] .The bulk deconvolution process was implemented in the R environment (version 3.6.1.).

Evaluation and visualization of results
Both the Pearson correlation coefficient and root mean square error (RMSE) were calculated to evaluate the performance of different deconvolution methods.The Pearson correlation coefficient measures the linear relationship between the estimated cell type proportions and the true proportions from the pseudo-bulk mixtures.A higher Pearson correlation values indicates better agreement between the estimated and true proportions.The RMSE quantifies the difference between the estimated and true proportions, with lower values indicating better performance.Furthermore, the generated data were visualized using the Uniform Manifold Approximation and

Statistical analysis
To assess the significance of the improvement achieved through data augmentation, a paired t-test was conducted.This statistical test compared the Pearson correlation values and RMSE between the results obtained with and without data augmentation.The paired t-test was performed in a Python environment, using appropriate statistical packages.

Figure 1 .
Figure 1.(A) Workflow of bulk deconvolution using data augmentation.Matrix C with augmentation is single cell RNA sequence data for reference data in deconvolution.Matrix T is the pseudo-bulk RNA sequence data, which is the input of the deconvolution methods.Finally, predicted cell type propotions and computed cell-type propotions are compared.RNA seq: RNA sequence; RMSE: root-mean-square error; Pearson: Pearson correlation value.(B) Summary of sc-CMGAN (Details are described in Method).In the feature selection step, top (t 0 − t n ) % of genes was selected as cell marker by ridge regression.sc-CMGAN: Generative Adversarial Network based on cell markers for single-cell genomics data, t 0 : initial value of percentage of cell markers in all genes; n: number of steps of gene selection.

Figure 3 .
Figure 3. (A) Correlation between the computed (known) and predicted cell type proportions in 1000 pseudobulk RNA-seq.data, organized by cell types.Pearson correlation values (r) are improved in almost all cell types.(B) Scatter plot visualization of the training, testing, and generated data using UMAP.This compares the generated data for cells that showed much improvement and cells that showed a decrease in performance.

Figure 4 .
Figure 4. Comparison between bulk deconvolution without sc-CMGAN (Control) and with sc-CMGAN in the datasets of Kidney HCL., PBMCs, and GSE81547.The results showed improvements in Pearson correlation values for most conditions, except for the analysis of PBMCs data using the MuSiC method.

Table 1 .
Summary of datasets.

Table 2 .
RMSE and Pearson correlation values between the computed (known) proportions in 1000 pseudobulk RNA-seq.data and the predicted proportions from the different bulk deconvolution methods using different generative methods.*p < 0.05, **p < 0.01.

Table 3 .
24SE and Pearson correlation values by cell types using different generative methods (bulk deconvolution method was SCDC).The best values of each cell type are in bold.Projection (UMAP) technique.UMAP is a dimensionality reduction algorithm that can provide a low-dimensional representation of high-dimensional data, allowing for visualization and clustering of the generated data24.