Bulk tissue cell type deconvolution with multi-subject single-cell expression reference

Knowledge of cell type composition in disease relevant tissues is an important step towards the identification of cellular targets of disease. We present MuSiC, a method that utilizes cell-type specific gene expression from single-cell RNA sequencing (RNA-seq) data to characterize cell type compositions from bulk RNA-seq data in complex tissues. By appropriate weighting of genes showing cross-subject and cross-cell consistency, MuSiC enables the transfer of cell type-specific gene expression information from one dataset to another. When applied to pancreatic islet and whole kidney expression data in human, mouse, and rats, MuSiC outperformed existing methods, especially for tissues with closely related cell types. MuSiC enables the characterization of cellular heterogeneity of complex tissues for understanding of disease mechanisms. As bulk tissue data are more easily accessible than single-cell RNA-seq, MuSiC allows the utilization of the vast amounts of disease relevant bulk tissue RNA-seq data for elucidating cell type contributions in disease.

the cell type proportions are estimated accurately by MuSiC when the missing cell type is not the dominant cell type in the bulk tissue.

Supplementary Note 4: Tolerance of bias in single-cell relative abundance on deconvolution
The protocol discrepancies between bulk and single-cell datasets may lead to estimation bias. To evaluate the degree of bias tolerance relative to the biological signal, we manually introduce noise to cross-subject average of the single-cell obtained relative abundance . Because of the constraint that ∑ =1 = 1, we generate biased relative abundance by Dirichlet distribution, denoted by ′ . Consider one cell type only. For simplicity, we drop the superscript for cell type. We assume the relative abundances of genes follow a Dirichlet distribution, where is a scaling factor. The mean and variance of ′ are and �1− � +1 , respectively. By setting = 999, 1332, 1999 and 3999, corresponding to ≥ 2, 1.5, 1 and 0.5, we simulated 100 times the cross-subject average of relative abundance of 6 major cell types from Segerstolpe et al. 3 We deconvolved the artificial bulk data constructed by Xin et al. 1 (Supplementary Figure 8c) and MuSiC provides accurate cell type proportions even with biased relative abundance as input.

Supplementary Note 5: Robustness to single-cell dropout noise on deconvolution
Single-cell RNA-seq data are prone to gene dropout and the dropout rates can differ across datasets. To evaluate the robustness of MuSiC, NNLS, BSEQ-sc and CIBERSORT to dropout in single-cell data, we constructed artificial bulk data from the original scRNA-seq data and deconvolve it by single-cell data with additional dropout noise. Following Jia et al. 4 , the dropout rate is generated by where is the observed read counts, is the dropout rate parameters. The simulated read count ′ follows distribution such that We evaluated four different dropout rates with = 1, 0.5, 0.2 and 0.1 (Supplementary Figure  8a-b). In general, adding more dropout noise leads to lower MuSiC estimation accuracy. Compared with other methods, MuSiC consistently performs better in the presence of dropout noise.

Supplementary Note 6: Convergence of MuSiC with different starting points
MuSiC estimates cell type proportions by weighted non-negative least square (W-NNLS), which might be sensitive to the choice of starting values. To examine the convergence property of MuSiC, we re-analyzed the data in Figure 2b to show convergence with different starting points. The artificial bulk data is constructed by Xin et al. 1 while the single-cell reference consists of 6 healthy subjects from Segerstolpe et al. 3 The cell type proportions of four cell types: alpha, beta, delta and gamma are estimated using MuSiC with different starting points are shown in Supplementary Table 8. W-NNLS converges to the same value regardless of the starting points (Supplementary Figure 9).

Supplementary Note 7: Complex models
More complex error models, such as the gamma may give better fit to data, but could be computationally more challenging to fit. Here our empirical results show that the Gaussian model already gives accurate estimates.

Supplementary Tables
Supplementary  TSPYL1  RPS4X  72  ATP6V0B  TMED4  SPINT2  23  ACTB  C6orf62  HSPA5  73  PSAP  CST3  REG1B  24  PRSS1  RPL3  ITGB1  74  LRRC75A-AS1  CD63  HNRNPC  25  RBP4  DPYSL2  IAPP  75  S100A11  TOB1  RPL15  26  GDF15  UBC  TPT1  76  MUC13  HLA-A  ENO1  27  COX8A  SCG2  RPL5  77  MAP1B  CLU  RPS11  28  ALDOA  ALDH1A1  SLC7A2  78  CD59  TTC3  GANAB  29  PDK4  PFKFB2  HNRNPA1  79  SLC30A8  RPS11  CDH1  30  RPL8  CPE  ANXA2  80  CPE  G6PC2  PEG10  31  H3F3B  C10orf10  RPL7  81  CLPS  GRN  CLDN4  32  IGFBP7  TMBIM6  RPS18  82  CTSD  SERPINA1  GSTP1  33  S100A6  CRYBA2  PCSK1  83  ATP1B1  SSR4  TUBA1A  34  EEF2  FTX  ATP1A1  84  OLFM4  RPS6  RPS27A  35  TIMP1  HSPA8  IDS  85  TAGLN2  OAZ1  PRPF8  36  CFL1  HSP90AA1  GDF15  86  SCGN  MARCKS  HSPB1  37  GRN  H3F3B  RPS3  87  SERPING1  RPL15  RPS8  38  SPINT2  SLC30A8  RPSA  88  WFS1  SQSTM1  RPS12  39  SQSTM1  TLK1  CSDE1  89  LAPTM4A  RASD1  ACLY  40  KRT19  ETNK1  CLTC  90  TAAR5  DSP  MSN  41  CD63  B2M  RPL10  91  SLC22A17  COX8A  HNRNPA2B1  42  SLC40A1  DDX5  YWHAZ  92  RPL3  TIMP1  CTNNB1  43  G6PC2  FOS  RPL3  93  HERPUD1  ATP1B1  MORF4L1  44  REG1A  MAFB  SLC30A8  94  CD24  WFS1  SERINC1  45  DDX5  CD59  RPL6  95  CALR  PRDX3  KRT19  46  PCBP1  TM4SF4  TMSB10  96  CLDN7  CHP1  NCL  47  C6orf62  TMEM33  CD44  97  LAMP2  YWHAE  GPX4  48  CRYBA2  CAPZA1  NPM1  98  CST3  FAM46A  GNB1  49  CD74  CALM2  B2M  99  TMBIM6  RUFY3  RPS7  50 HLA    The x-axis is the log transformed average relative abundance across cells from the same cell type, and the y-axis is the subject label. The relative abundance of gene GC is widely spread across the x-axis while the relative abundance of gene TTR is more concentrated across subjects. We consider gene GC as non-informative and TTR as informative. b. Comparison of log transformed relative abundance levels between real bulk tissue RNA-seq data and artificially constructed bulk RNA-seq data for the same subject. Single-cell and bulk tissue RNA-seq data are both from Segerstolpe et al. Each dot represents a gene and the gray line is x=y. c. Heatmap of true and estimated cell type proportions. In addition to the four methods described in the main text, we also evaluated the estimates given by MuSiC and NNLS when using only the marker genes used in BSEQ-sc. Source data are provided as a Source Data file.  Deconvolution results when the single-cell reference is from the 12 healthy subjects of Xin et al. with leave-one-out, i.e., for each subject under deconvolution, only single-cell data from the remaining 11 subjects were used as single-cell reference. c.The cell type proportions for the artificial bulk data are manually adjusted so that beta cells are the dominant cell type, as expected in real bulk tissue. Alpha cells dominate in the scRNA-seq data due to dissociation and capture bias. Thus, this analysis mirrors the real data analysis scenario where cell type proportions differ substantially between scRNA-seq reference and bulk tissue. In more detail, we combined cells from two subjects as one artificial bulk tissue RNA-seq dataset, for example, H1.2 combined cells from subject H1 and H2. Then we dropped 75% of the alpha cells at random. The single-cell reference is from the 6 healthy subjects of Segerstolpe et al. Here, all methods that rely on pre-selected marker genes from CIBERSORT are heavily biased by the cell type proportions in the single cell reference, and miss the true cell type proportions in the bulk tissue data. In comparison, MuSiC is able to adjust to the difference between scRNA-seq reference and bulk data. Source data are provided as a Source Data file.  The artificial bulk data and the single-cell reference are both from Segerstolpe et al. We constrained our analysis to the 6 major cell types: alpha, beta, delta, gamma, acinar and ductal cells. The artificial bulk data is constructed by summing read counts from the 6 major cell types while the single-cell reference contains only 5 cell types (the column header shows the cell type that is missing in the single-cell reference). The x-axis labels cell types used in the single-cell reference and the y-axis shows the subject label. The top panel shows the true composition, while panels below it show the results from each method. See Supplementary  Table 3 for detailed evaluation results. Source data are provided as a Source Data file.    SO  D2  D8  SO  D2  D8  SO  D2  D8  SO  D2  D8  SO  D2  D8  SO  D2  D8   SO  D2  D8  SO  D2  D8  SO  D2  D8  SO  D2  D8  SO  D2  D8  SO  D2  D8  SO  D2