Enhancing prognostic power in multiple myeloma using a plasma cell signature derived from single-cell RNA sequencing

Multiple myeloma (MM) is a heterogenous plasma cell malignancy, for which the established prognostic models exhibit limitations in capturing the full spectrum of outcome variability. Leveraging single-cell RNA-sequencing data, we developed a novel plasma cell gene signature. We evaluated and validated the associations of the resulting plasma cell malignancy (PBM) score with disease state, progression and clinical outcomes using data from five independent myeloma studies consisting of 2115 samples (1978 MM, 65 monoclonal gammopathy of undetermined significance, 35 smoldering MM, and 37 healthy controls). Overall, a higher PBM score was significantly associated with a more advanced stage within the spectrum of plasma cell dyscrasias (all p < 0.05) and a shorter overall survival in MM (hazard ratio, HR = 1.72; p < 0.001). Notably, the prognostic effect of the PBM score was independent of the International Staging System (ISS) and Revised ISS (R-ISS). The downstream analysis further linked higher PBM scores with the presence of cytogenetic abnormalities, TP53 mutations, and compositional changes in the myeloma tumor immune microenvironment. Our integrated analyses suggest the PBM score may provide an opportunity for refining risk stratification and guide decisions on therapeutic approaches to MM.

The UAMS Study: Two gene expression datasets of the UAMS study were used in our analysis.UAMS-I (GEO accession number: GSE136400) included 426 patients with newly diagnosed MM (NDMM) enrolled between 2004 and 2019. 2 Gene expression profiles were generated using Affymetrix Human Genome U133 Plus 2.0 Array in 867 whole bone marrow (WBM) samples, for which 401 had matched purified tumor CD138+ cells. 2 UAMS-I study was used as a validation set for the survival analysis.UAMS-II (GEO accession number: GSE5900) profiled gene expression of CD138-selected cells from the BM of 22 healthy controls, 44 MGUS, and 12 SMM patients using Affymetrix Human Genome U133 Plus 2.0 Array. 8UAMS-II data was used for the analysis of MM development.

The MAQC-II Study:
The MAQC-II study consists of 36 independent teams analyzing six microarray datasets to generate predictive models for lung or liver toxicity in rodents, breast cancer, MM, or neuroblastoma in humans. 3The MM dataset (GEO accession number: GSE24080) including 559 NDMM was contributed by the Myeloma Institute for Research and Therapy at the UAMS.Gene expression profiling of highly purified bone marrow plasma cells was performed with Affymetrix Human Genome U133 Plus 2.0 Array and used as a validation set in the survival analysis.We noticed that there were 136 samples overlapped between the UAMS-I and MAQC-II datasets and have removed the expression data of the 136 samples from the analysis of MAQC-II study.
The APEX Study: This dataset included patients with relapsed myeloma enrolled in phase 2 and phase 3 clinical trials of bortezomib. 4 Pretreatment tumor CD138+ cells from the BM were profiled with Affymetrix Human Genome U133A/B Array.0][11] For simplicity, we used "APEX" to represent this study in our analysis.The APEX dataset was used as a validation set in the survival analysis.
The Mayo Clinic Study: The Mayo Clinic study included 15 healthy controls, 21 MGUS, 23 SMM, 75 NDMM, and 28 relapsed/refractory MM (RRMM) patients. 12Gene expression (GEO accession number: GSE6477) was performed on CD138-selected BM cells using the Affymetrix Human Genome U133A chip.The Mayo Clinic dataset was used in the analysis of MM development.

Development of the Plasma B-cell Signature
To define the plasma cell signature, we referred to the immune cell specific marker genes curated by the PanglaoDB database. 13The database integrated a large number of single-cell RNA-seq datasets from previous publications and defined high-resolution human immune cell clusters. 13ese clusters were corresponding to 16 different types of the human immune cell types, including naive B-cell, memory B-cell, and plasma B-cell.The data do not include malignant B cells, e.g., CD138+ cells from MM samples.The database manually curated a list of candidate marker genes for each immune cell type based on 4,598 previous publications.Following that, a ubiquitousness index (UI) and a specificity index (SI) were calculated to quantify the specificity of these candidate markers.Based on the provided information, we selected 84 genes specifically expressed in normal plasma B-cells to form the plasma B-cell (PB) signature (Supplementary Table 2).As controls, we also defined signatures for naive B-cell, memory B-cell, and 13 other immune cell types (macrophage, dendritic cell, plasmacytoid dendritic cell, myeloid-derived suppressor cell, mast cell, natural killer cell, neutrophil, eosinophil, basophil, CD4+ T-cell, CD8+ T-cell, regulatory T-cell, memory T-cell) in a similar way.

Calculation of the Plasma B-Cell Malignancy Score
Given a MM gene expression dataset, we used the plasma B-cell signature to calculate a patientspecific score named plasma B-cell malignancy score (PBM score).PBM score quantifies the level of malignancy of CD138+ cells in MM tumors, with a greater score indicating a higher level of malignancy.To calculate the PBM score, a rank-based statical framework called BASE (Binding Associated with Sorted Expression) 14 was applied to gauge the perturbed expression of PB signature genes.Briefly, the analysis includes the following steps.First, in a given MM gene expression dataset, the expression of each gene was log-transformed and normalized by subtracting its median expression across all samples.Second, the expression of all genes in each sample was sorted in the decreasing order.Third, the expression values of all signature genes were summarized considering their positions in the ranked gene list, resulting in a raw enrichment score for each sample.Finally, the raw score was normalized against the null distribution obtained from permutated data to obtain the PBM score.Of note, the BASE algorithm was originally designed for weighted gene signatures. 15,16Here we applied it to the unweighted gene sets.In this way, the BASE algorithm give rise to similar results with those from the single-sample GSEA analysis. 17

Gene Ontology Biological Process Analysis
The Gene ontology (GO) biological process (BP) enrichment analysis was performed to explore the biological functions of genes correlated with PBM scores.Using the CoMMpass data, we identified genes that were correlated with the PBM scores in their expression values across all MM samples.In total, 234 positively correlated genes (Spearman correlation coefficient r > 0.25, P < 0.001) and 1,075 negatively correlated genes (r < -0.25 and P < 0.001) were selected (Supplementary Table 3).GO enrichment analysis was implemented using the R package "clusterProfiler".We focused on biological process GO terms in our analysis. 18

Tumor Microenvironment Immune Cell Inference and Clustering Analysis
The UAMS-I dataset has gene expression profiles for paired WBM and tumor CD138+ cell samples of 401 patients.By referring to the method described by Danziger et al 2 , we inferred the infiltration levels of major immune cell types based on deconvolution analysis.Specifically, the digital cell quantifier (DCQ) algorithm, implemented by an R package "ADAPTS", was used to deconvolute WBM expression profiles. 19,20A signature matrix named MGSM27 was used as the reference for deconvolution, which included the 22 leukocyte signature matrix (LM22), 4 bone marrow cell types (adipocyte, osteoclast, plasma memory cell, and osteoblast), and the malignant plasma B cell. 20,21ter deconvolution analysis, we obtained the proportion of the above-list 27 cell types in each WBM sample.Then we combined these results with CD138+ cell expression data to calculate TIME-specific (CD138-cells) gene expression levels of all genes.Specifically, the proportion of the CD138+ cells (Prop CD138+ ) in each WBM sample was estimated as the sum of the plasma.cells,B.cells.memory,MM.plasma.cells, and PlasmaMemory. 2The expression profiles of the CD138-cells (e CD138-) in the WBM of each patient were then estimated as the following: Based on the estimated CD138-expression profiles, we performed unsupervised clustering analysis by using the top 5% most expressed genes (n=805 genes).Specifically, the R package "ConsensusClusterPlus" was applied to implement this analysis, leading to 5 MM clusters as reported in the original study. 2,22

The Selection of Myeloma Driver Mutations
To investigate the relationship between PBM score and myeloma driver mutations, 33 genes with nonsynonymous somatic mutations in at least 30 samples in the CoMMpass dataset were listed as candidates (Supplementary Table 6).Ten out of the 33 genes that were also included in the COSMIC Cancer Gene Census (CGC) 23 were selected for the subsequent analysis, including KRAS, NRAS, MUC16, BRAF, FAT4, LRP1B, FAT3, CSMD3, FAT1, and TP53.

Integrative Models for Predicting Prognosis
To examine the additional prognostic value contributed by the PBM score, we constructed prognostic models that include clinical variables with and without PBM scores.The performance of these models was evaluated and compared based on a 5-fold cross-validation.Specifically, the samples in a given dataset were randomly divided into 5 subgroups.Each time, 4 subgroups were selected and merged into the training data, and the other subgroup was used as the test data.The training data were used to fit a multivariate Cox regression model, which was subsequently used to predict the prognostic risk of samples in the test data.This procedure was rotated 5 times until the prognostic risk of all samples was predicted exactly once.The predicted prognostic scores were compared with known survival information to evaluate the performance of the models.The performance of all models was evaluated 100 times, each from different subgroup randomization.The R package "rms" was used for model construction and evaluation.Specifically, the "cph" function was used to construct the univariate and multivariate Cox regression models; the "validate" function was used to perform 5-fold cross-validation for each model; and the "survest" function was used to calculate the prognostic risk (i.e., survival probabilities) of samples in the testing dataset.The "rcorr.cens"function of the R package "Hmisc" was used to calculate the Concordance index (C-index) of prognostic models.

Statistical Analysis
The R package "survival" was implemented to perform the survival analysis.Univariate and multivariate Cox proportional hazard models were constructed to estimate the association between PBM score and patient survival by calculating hazard ratios (HRs) and 95% confidence intervals (CIs).By using the median of PBM scores as the threshold, MM patients were stratified into two groups with "high" and "low" PBM scores, respectively.Kaplan-Meier method was used to plot their survival curves and log-rank test was used to compare their survival difference.
The difference between two samples groups were conducted using the Wilcoxon rank-sum test, and further confirmed by the Student's t-test for values with a normal distribution.The Benjamini-Hochberg method was used to correct multiple testing, e.g., in differential gene expression analysis.Receiver operating characteristic (ROC) analysis using PBM to predict the progression of MGUS was performed using the "roc" function of the R package "pROC".The cutoff value of the optimal PBM and the calculation of the area under the ROC curve (AUC) are conducted using the same function.All statistical analyses were conducted in the R environment (v4.0.2).