New generative methods for single-cell transcriptome data in bulk RNA sequence deconvolution

Nishikawa, Toui; Lee, Masatoshi; Amau, Masataka

doi:10.1038/s41598-024-54798-z

Download PDF

Article
Open access
Published: 20 February 2024

New generative methods for single-cell transcriptome data in bulk RNA sequence deconvolution

Toui Nishikawa¹,
Masatoshi Lee¹ &
Masataka Amau²

Scientific Reports volume 14, Article number: 4156 (2024) Cite this article

706 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Numerous methods for bulk RNA sequence deconvolution have been developed to identify cellular targets of diseases by understanding the composition of cell types in disease-related tissues. However, issues of heterogeneity in gene expression between subjects and the shortage of reference single-cell RNA sequence data remain to achieve accurate bulk deconvolution. In our study, we investigated whether a new data generative method named sc-CMGAN and benchmarking generative methods (Copula, CTGAN and TVAE) could solve these issues and improve the bulk deconvolutions. We also evaluated the robustness of sc-CMGAN using three deconvolution methods and four public datasets. In almost all conditions, the generative methods contributed to improved deconvolution. Notably, sc-CMGAN outperformed the benchmarking methods and demonstrated higher robustness. This study is the first to examine the impact of data augmentation on bulk deconvolution. The new generative method, sc-CMGAN, is expected to become one of the powerful tools for the preprocessing of bulk deconvolution.

scDREAMER for atlas-level integration of single-cell datasets using deep generative model paired with adversarial classifier

Article Open access 27 November 2023

Comprehensive evaluation of deconvolution methods for human brain gene expression

Article Open access 15 March 2022

Challenges and perspectives in computational deconvolution of genomics data

Article 19 February 2024

Introduction

Recent advancements in single-cell RNA sequencing (scRNA-seq) have enabled the analysis of transcriptome profiles at the individual cell level. ScRNA-seq allows for the determination of cell type composition and ratios, which can facilitate the study of changes in tissue composition associated with diseases and the identification of disease-related cellular targets. For example, different types of infiltrating immune cells have different effects on tumor progression and the mass of A cells was increased in Type 2 diabetes^1,2. However, the high cost and technical complexity of getting scRNA-seq data pose challenges when dealing with large sample populations^3,4.

To overcome these challenges, several methods that estimate the proportion of cells from bulk RNA expression data without relying on single-cell sequencing has been attracted attention. This process, known as bulk RNA sequence deconvolution (bulk deconvolution), has seen the early development of statistical and computational methods^5,6,7,8. More recently, methods utilizing scRNA-seq as a reference have achieved higher performance in deconvolution^9,10,11. For instance, MuSiC (2019) demonstrated high performance, particularly in tissues with closely related cell types, and Bisque (2020) implemented a regression-based approach to learn gene-specific bulk expression transformations^9,10. SCDC (2021) proved to be an effective method by leveraging cell type-specific gene expression profiles from multiple scRNA-seq reference datasets¹¹.

Despite these advancements, there are several challenges associated with bulk deconvolution using scRNA-seq reference datasets. Firstly, there is heterogeneity in gene expression between subjects, which has been reported to reduce the performance of bulk deconvolution^4,9,12,13. Secondly, achieving higher performance in bulk deconvolution requires more and higher-quality scRNA-seq data. However, as mentioned, the cost and availability of public scRNA-seq data make it difficult to secure sufficient data for analysis.

In the study, we aimed to investigate whether the performance of bulk deconvolution could be improved by augmenting scRNA-seq data using benchmarking generative methods (Fig. 1A). Additionally, we developed a new generative method based on a stepwise selection of cell markers called sc-CMGAN (stepwise Generative Adversarial Network based on cell markers for single-cell genomics data) (Fig. 1B).

Results

Impact of data augmentation on deconvolution results

The results of our study demonstrate the positive impact of sc-CMGAN data augmentation on bulk deconvolution performance (see Table 1 for information about the each scRNA-seq dataset). We focused on Baron's dataset and observed consistent improvements in performance, as measured by the Pearson coefficient and RMSE, across all three deconvolution methods (SCDC, MuSiC, and BisqueRNA) (Fig. 2A). Specifically, significant improvements in deconvolution were observed for SCDC (p = 0.043) and BisqueRNA (p = 0.005) when using sc-CMGAN data augmentation. We further compared the performance of sc-CMGAN with other benchmarking methods in terms of their impact on the three deconvolution methods (Fig. 2B,C and Table 2). For SCDC, significant improvements were observed only when using sc-CMGAN. For MuSiC, the Copula and CTGAN generative methods led to a decrease of performance, while TVAE and sc-CMGAN showed improvements. In the case of BisqueRNA, all generative methods significantly improved the performance compared to the control (Copula: p = 0.003, CTGAN: p = 0.003, TVAE: p = 0.025, sc-CMGAN: p = 0.005).

Table 1 Summary of datasets.

Full size table

Table 2 RMSE and Pearson correlation values between the computed (known) proportions in 1000 pseudo-bulk RNA-seq. data and the predicted proportions from the different bulk deconvolution methods using different generative methods.

Full size table

Relationship between hyperparameter and performance

Figure 2D presents the relationship between the combination of epoch and the number of generated data and their impact on deconvolution, specifically for the SCDC method using the Baron dataset. It was observed that the performance of deconvolution was most improved when the epoch was set to 100 and the number of generated data was 100. In terms of the effect of epoch on performance, it was found to have a smaller impact on performance in sc-CMGAN compared to CTGAN. When the number of generated data was 300 cells/cell type and the deconvolution method used was sc-CMGAN, the Pearson correlation values were as follows: 0.851 in 50 epochs, 0.854 in 100 epochs, 0.841 in 150 epochs, 0.845 in 200 epochs, 0.838 in 250 epochs, and 0.845 in 300 epochs. The best value and worst value had a difference of 0.016. On the other hand, when the number of generated data was 300 cells/cell type and the deconvolution method was CTGAN, the Pearson correlation values were as follows: 0.809 in 50 epochs, 0.822 in 100 epochs, 0.830 in 150 epochs, 0.840 in 200 epochs, 0. 829 in 250 epochs and 0.834 in 300 epochs. The difference between the best value and worst value was 0.031.

Evaluation at the cell type level

The impact of data generation on bulk deconvolution was further analyzed at the cell type level, specifically for the SCDC method using the Baron dataset. The results showed that performance was improved for almost all cell types, with the exception of quiescent stellate cells (Fig. 3A, Table 3). The ratio of RMSE improvement to the RMSE of the control (without data generation) was calculated for each cell type. Among all cell types, beta cells exhibited the highest improvement ratio (+ 0.50), indicating a substantial enhancement of performance. On the other hand, quiescent stellate cells showed a negative improvement ratio (− 0.08), suggesting a decrease of performance with data generation. To provide a visual representation of the training, testing, and generated data in beta cells and quiescent stellate cells, UMAP plots were created (Fig. 3B).

Table 3 RMSE and Pearson correlation values by cell types using different generative methods (bulk deconvolution method was SCDC).

Full size table

Deconvolution in other datasets

The impact of sc-CMGAN on bulk deconvolution was also examined in other datasets, namely GSE81547, Kidney HCL, and PBMCs datasets (Fig. 4, Table 4). The results showed improvements in Pearson correlation values for most conditions, except for the analysis of PBMCs data using the MuSiC method. In the analysis of Kidney HCL and PBMCs datasets using the BisqueRNA method, sc-CMGAN significantly improved the RMSE (Kidney HCL: p = 0.039, PBMCs: p = 0.005). Furthermore, MuSiC analysis of the Kidney HCL dataset initially encountered the “Not enough valid cell type” error, however, data augmentation using sc-CMGAN allowed for error-free deconvolution in this dataset. All the detailed results, including the best epoch and cell numbers from this experiment, can be found in the supplementary data.

Table 4 RMSE and Pearson correlation values from the datasets of GSE81547, PBMCs and Kidney HCL.

Full size table

Deconvolution in real bulk RNA sequence dataset

Using Baron's dataset as a reference, we performed bulk deconvolution in real bulk RNA sequence data (Fig. 5). As also shown in the study by Wang et al., deconvolution methods overestimated the proportion of α cells⁹. However, all deconvolution methods with augmentation underestimated the proportion of α cells, compared to the deconvolution without augmentation.

Discussion

Bulk deconvolution is a valuable approach for estimating cell type proportions from bulk RNA-seq data, providing a cost-effective alternative to scRNA-seq. In the study, we attempted to improve the performance of bulk deconvolution using data generative methods. While efforts have been made to generate scRNAseq. In silico, the impact of data augmentation on deconvolution remains uncertain¹⁴. Additionally, we developed a new stepwise generative method (sc-CMGAN), and its performance was compared the performance of sc-CMGAN with that of benchmarking generative methods.

The results demonstrated that data augmentation using sc-CMGAN consistently improved the performance of all tested bulk deconvolution methods in the Baron’s dataset. While other benchmarking generative methods also led to improvements, sc-CMGAN exhibited two key advantages. Firstly, sc-CMGAN displayed the highest robustness across different deconvolution methods. Notably, significant improvements were observed with SCDC only when using sc-CMGAN. In MuSiC, the performance with TVAE was slightly better than sc-CMGAN, but in BisqueRNA, TVAE showed less improvement compared to the other three methods. Secondly, sc-CMGAN exhibited high stability and improvement regardless of the training epoch, unlike CTGAN, which showed significant performance variations with different epochs. This stability can be attributed to the stepwise strategy employed by sc-CMGAN.

These deconvolution improvements were found to be logically appropriate by two analyses at the cell type level (Fig. 3, Table 3). In the analysis of performance of each cell type, most cell types showed improved performance without bias, which is consistent with comprehensive data augmentation for cell types. In the analysis of the relationship improvement and visualization using UMAP, the highest improvement was seen in beta cells, of which distribution of the generated data was similar to the distribution of test data. Only quiescent stellate cells didn’t showed improvement, but this could be improved by adjusting the epochs for each cell. The two analyses showed that the augmentation strategy had the potential to partially mitigate the heterogeneity in gene expression between subjects and improve the bulk deconvolution.

Furthermore, the study extended its evaluation to other datasets (GSE81547, Kidney HCL, and PBMCs) (Fig. 4, Table 4). Pearson correlation values improved in almost all conditions, with the exception of the analysis of PBMCs data using MuSiC. This improvement was observed in the Baron dataset and GSE81547 with the inter-case variation, as well as in Kidney HCL and PBMCs datasets with the intra-case variation. These findings indicate that sc-CMGAN is effective in addressing both types of variation encountered in bulk deconvolution.

We have two limitations of the study. First, the study is lacking real RNA sequence data analysis. The main purpose of the study is to investigate the influence of augmentation strategy on bulk deconvolution, so we designed the study using only pseudo-RNA sequence data. Our method has a potential to investigate the interesting biology if we deconvolute a real bulk RNA-seq data. Second, it is better to include many results that quantify this improvement under different tissues and conditions. Then, we selected tissues to avoid overlap (kidney, pancreas, peripheral blood cells) and the results that cannot be included in the main text is included in the supplementary data. Further extension of the conditions will make the results of improvements more reliable.

In conclusion, our study demonstrated that both the benchmarking and new generative methods improved the performance of bulk deconvolution. Specifically, our newly developed sc-CMGAN method outperformed the benchmarking methods in enhancing the performance of bulk deconvolution. The sc-CMGAN method, accompanied by its dedicated library and software, shows promising potential to become one of the powerful tools for the preprocessing in bulk deconvolution.

Materials and methods

Datasets

The primary dataset used in this study was the pancreatic single-cell transcriptome data from Baron et al., which is widely used in bulk deconvolution¹⁵. Additionally, we examined other datasets of pancreatic, renal, and peripheral blood single-cell transcriptome data to assess the robustness of our approach (refer to Table 1 for details on the scRNA-seq datasets)^16,17,18,19, and Fadista’s dataset to perform bulk deconvolution in real bulk RNA sequence data²⁰. Selection of the primary dataset was based on the criteria of having the highest number of cell types and two cases at least available.

Pre-processing

In the pre-processing step, we followed the approach described by Cobos et al.¹³ Initially, we removed rows corresponding to genes with zero expression or zero variability. Next, the cells with library size, mitochondrial content or ribosomal content further than three median absolute deviations away were discarded. Subsequently, we retained only those genes that were present in at least 5% of all cells, regardless of cell type, and had a UMI or read count greater than one. TMM normalization was applied to the final scRNA-seq expression dataset²¹.

Generation of artificial pseudo-bulk mixtures

The deconvolution pipeline, as depicted in Fig. 1A, was implemented in this study. Following the pre-processing step, the dataset was divided into equal proportions of 50% for training data and 50% for testing data. Subsequently, using the testing data, a matrix (referred to as matrix T) comprising 1000 pseudo-bulk mixtures was generated. This involved summing the count values from randomly selected individual cells. For each dataset, the minimum number of cells utilized to construct the pseudo-bulk mixture was set at 100.

Data augmentation

The training data were augmented by benchmarking or new generative methods. In the study, we employed Gaussian Copula (Copula), Conditional Tabular GAN (CTAGAN) and Triplet-based Variational Autoencoder (TVAE) as benchmarking generative methods^22,23. Additionally, we developed and tested a new generative method specifically designed for scRNA-seq data. A grid search was performed to investigate optimal data generative conditions (number of images generated and training epoch). Specifically, we varied the number of generated data by increments of 100 cells, ranging from 100 cells per cell type to 1000 cells per cell type. Furthermore, we explored different training epochs, testing values from 50 to 300 epochs with increments of 50 epochs. The augmented data was used as a new independent reference case and added to the matrix C for the bulk deconvolution.

New generative method

A new data generative method called sc-CMGAN (stepwise Generative adversarial network based on cell markers for single-cell genomics data) was developed (Fig. 1B). The sc-CMGAN approach consists of three main steps: feature selection, training, and generating. In the feature selection step, a set of cell marker genes was identified using ridge regression. These genes serve as key indicators of cell types in the sc-RNA seq. The absolute value of the coefficient in ridge regression was taken as the importance of the genes. Next, in the training step, the sc-RNA seq. data corresponding to the selected cell marker genes were used to train the GAN model (CTGAN). This step involved the learning and capturing of the underlying data distribution of the cell marker genes. In the generating step, the trained models were used to generate sc-RNA seq. data for the cell marker genes. Simultaneously, the non-cell marker genes were assigned the median value of the expression data for the respective cell type. This process was repeated for a specific number of cycles (= n), and the generated data sets from each cycle were combined. To control the selection of cell markers, the top (t₀ − t_n) percentage of genes were chosen, where t₀ represents the initial value of the percentage of cell markers in all genes. In this study, the hyperparameters were set as (n, t₀, t) = (2, 40, 20). The sc-CMGAN code, library and software can be obtained from the following GitHub link: https://github.com/TouiNishikawa/scCMGAN.

Bulk deconvolution method

The bulk deconvolution method was employed to estimate the cell type proportions from the artificial pseudo-bulk mixtures (matrix T) and the augmented reference scRNA-seq data (matrix C). Three benchmarking deconvolution methods, namely SCDC, MuSiC, and BisqueRNA, were utilized for the bulk deconvolution analysis^9,10,11. The bulk deconvolution process was implemented in the R environment (version 3.6.1.).

Deconvolution in real bulk RNA sequence dataset

Evaluation and visualization of results

Both the Pearson correlation coefficient and root mean square error (RMSE) were calculated to evaluate the performance of different deconvolution methods. The Pearson correlation coefficient measures the linear relationship between the estimated cell type proportions and the true proportions from the pseudo-bulk mixtures. A higher Pearson correlation values indicates better agreement between the estimated and true proportions. The RMSE quantifies the difference between the estimated and true proportions, with lower values indicating better performance. Furthermore, the generated data were visualized using the Uniform Manifold Approximation and Projection (UMAP) technique. UMAP is a dimensionality reduction algorithm that can provide a low-dimensional representation of high-dimensional data, allowing for visualization and clustering of the generated data²⁴.

Statistical analysis

To assess the significance of the improvement achieved through data augmentation, a paired t-test was conducted. This statistical test compared the Pearson correlation values and RMSE between the results obtained with and without data augmentation. The paired t-test was performed in a Python environment, using appropriate statistical packages.

Data availability

The four publicly available datasets used in the study can be found at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133 (baron), https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81547 (GSE81547), https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/fresh_68k_pbmc_donor_a (PBMCs), https://figshare.com/articles/HCL_DGE_Data/7235471 (kidney.HCL).

Code availability

Source code can be found at https://github.com/TouiNishikawa/scCMGAN.

References

Fridman, W. H., Pagès, F., Sautès-Fridman, C. & Galon, J. The immune contexture in human tumours: Impact on clinical outcome. Nat. Rev. Cancer 12, 298–306 (2012).
Article CAS PubMed Google Scholar
Rahier, J., Goebbels, R. M. & Henquin, J. C. Cellular composition of the human diabetic pancreas. Diabetologia 24, 366–371. https://doi.org/10.1007/bf00251826 (1983).
Article CAS PubMed Google Scholar
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145. https://doi.org/10.1038/nrg3833 (2015).
Article CAS PubMed Google Scholar
Ziegenhain, C. et al. Comparative analysis of single-cell RNA sequencing methods. Mol. Cell 65, 631-643.e634. https://doi.org/10.1016/j.molcel.2017.01.023 (2017).
Article CAS PubMed Google Scholar
Venet, D., Pecasse, F., Maenhaut, C. & Bersini, H. Separation of samples into their constituents using gene expression data. Bioinformatics 17(Suppl 1), S279-287. https://doi.org/10.1093/bioinformatics/17.suppl_1.s279 (2001).
Article PubMed Google Scholar
Shen-Orr, S. S. et al. Cell type-specific gene expression differences in complex tissues. Nat. Methods 7, 287–289. https://doi.org/10.1038/nmeth.1439 (2010).
Article CAS PubMed PubMed Central Google Scholar
Gong, T. & Szustakowski, J. D. DeconRNASeq: A statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data. Bioinformatics 29, 1083–1085. https://doi.org/10.1093/bioinformatics/btt090 (2013).
Article CAS PubMed Google Scholar
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457. https://doi.org/10.1038/nmeth.3337 (2015).
Article CAS PubMed PubMed Central Google Scholar
Wang, X., Park, J., Susztak, K., Zhang, N. R. & Li, M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat. Commun. 10, 380. https://doi.org/10.1038/s41467-018-08023-x (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Jew, B. et al. Accurate estimation of cell composition in bulk expression through robust integration of single-cell information. Nat. Commun. 11, 1971. https://doi.org/10.1038/s41467-020-15816-6 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Dong, M. et al. SCDC: Bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Brief Bioinform. 22, 416–427. https://doi.org/10.1093/bib/bbz166 (2021).
Article CAS PubMed Google Scholar
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498. https://doi.org/10.1038/s41586-018-0414-6 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Avila Cobos, F., Alquicira-Hernandez, J., Powell, J. E., Mestdagh, P. & De Preter, K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat. Commun. 11, 5650. https://doi.org/10.1038/s41467-020-19015-1 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Marouf, M. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11, 166. https://doi.org/10.1038/s41467-019-14018-z (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346-360.e344. https://doi.org/10.1016/j.cels.2016.08.011 (2016).
Article CAS PubMed PubMed Central Google Scholar
Enge, M. et al. Single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns. Cell 171, 321-330.e314. https://doi.org/10.1016/j.cell.2017.09.004 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049. https://doi.org/10.1038/ncomms14049 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Han, X. et al. Construction of a human cell landscape at single-cell level. Nature 581, 303–309. https://doi.org/10.1038/s41586-020-2157-4 (2020).
Article ADS CAS PubMed Google Scholar
Guo, G. HCL DGE Data https://doi.org/10.6084/m9.figshare.7235471.v2 (2020).
Fadista, J. et al. Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism. Proc. Natl. Acad. Sci. U. S. A. 111, 13924–13929. https://doi.org/10.1073/pnas.1402665111 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140. https://doi.org/10.1093/bioinformatics/btp616 (2010).
Article CAS PubMed Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A. & Veeramachaneni, K. Modeling tabular data using conditional GAN. arXiv:1907.00503 (2019). https://ui.adsabs.harvard.edu/abs/2019arXiv190700503X
Ishfaq, H., Hoogi, A. & Rubin, D. TVAE: Triplet-based variational autoencoder using metric learning. arXiv:1802.04403 (2018). https://ui.adsabs.harvard.edu/abs/2018arXiv180204403I
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (2018). https://ui.adsabs.harvard.edu/abs/2018arXiv180203426M.

Download references

Acknowledgements

The authors are grateful to Professor S. Murata (Department of Human pathology, Wakayama Medical University) and Professor S. Hashimoto (Department of Molecular Pathophysiology, Wakayama Medical University) for useful discussions. We also thank Professor R. Watanabe (Department of Life Science Frontiers, CiRA, Kyoto University) and Rhelixa Inc. for organizing Single Cell Genomics Hackathon 2022.

Author information

Authors and Affiliations

Faculty of Medicine, Wakayama Medical University, 811-1 Kimiidera, Wakayama, 641-8509, Japan
Toui Nishikawa & Masatoshi Lee
Faculty of Medicine, Kyoto University, Kyoto, Japan
Masataka Amau

Authors

Toui Nishikawa
View author publications
You can also search for this author in PubMed Google Scholar
Masatoshi Lee
View author publications
You can also search for this author in PubMed Google Scholar
Masataka Amau
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.N. contributed to the design, analysis and interpretation of the report. M.L. significantly contributed to collection of datasets and making a Figure. M.A. contributed to the software engineering.

Corresponding author

Correspondence to Toui Nishikawa.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nishikawa, T., Lee, M. & Amau, M. New generative methods for single-cell transcriptome data in bulk RNA sequence deconvolution. Sci Rep 14, 4156 (2024). https://doi.org/10.1038/s41598-024-54798-z

Download citation

Received: 08 September 2023
Accepted: 16 February 2024
Published: 20 February 2024
DOI: https://doi.org/10.1038/s41598-024-54798-z

Keywords

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

scDREAMER for atlas-level integration of single-cell datasets using deep generative model paired with adversarial classifier

Comprehensive evaluation of deconvolution methods for human brain gene expression

Challenges and perspectives in computational deconvolution of genomics data

Introduction

Results

Impact of data augmentation on deconvolution results

Relationship between hyperparameter and performance

Evaluation at the cell type level

Deconvolution in other datasets

Deconvolution in real bulk RNA sequence dataset

Discussion

Materials and methods

Datasets

Pre-processing

Generation of artificial pseudo-bulk mixtures

Data augmentation

New generative method

Bulk deconvolution method

Deconvolution in real bulk RNA sequence dataset

Evaluation and visualization of results

Statistical analysis

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Comments

Search

Quick links