Introduction

Skeletal muscle diseases result in decreased muscle mass and muscle dysfunction thereby inducing physical disability and increased mortality1. Skeletal muscle dysfunction has been shown to contribute to decreased quality of life, increased disease morbidity and mortality in respiratory illness including for example, chronic obstructive pulmonary disease (COPD), pulmonary arterial hypertension and acute respiratory distress syndrome (ARDS)2,3,4. Despite the profound clinical implications, knowledge of the molecular mechanisms of muscle dysfunction, as well as objective, non-volitional methods to quantify the degree of muscle dysfunction are insufficient5,6. Understanding the pathomolecular mechanisms conserved across muscle diseases may provide vital insight to help develop therapies to ameliorate them.

A growing number of studies of human muscle disease have identified dysregulated gene expression that is associated with disease severity1,7,8,9. These studies however, are usually limited by relatively small sample sizes without external validation from independent cohorts10. Moreover, individually these studies are not representative of biological and clinical heterogeneity observed in the real-world patient population, which substantially limits their generalizability. The vast quantity of expression profiling data in the public repositories Gene Expression Omnibus (GEO) and ArrayExpress represents novel opportunities to address these challenges by facilitating comprehensive integration of human muscle disease cohorts for meta-analysis.

We applied a multi-cohort analysis framework11,12 that leverages the biological, clinical, and technical heterogeneity across independent data sets to identify a reproducible disease gene signature13,14. This approach has discovered robust signatures in organ rejection13, neurodegenerative diseases14, sepsis15, tuberculosis16, viral infections17, vaccination18, and systemic sclerosis19, many of which have been successfully validated in prospective independent cohorts20,21,22. We hypothesized that convergent transcriptional abnormalities occur across muscle diseases regardless of the specific muscle pathophysiology and that the relative expression of these genes is associated with the degree of muscle dysfunction. To the best of our knowledge, this is the largest systematic multi-cohort analysis investigating transcriptional changes across multiple human muscle diseases.

We identified a conserved gene signature across five muscle disease categories including muscular dystrophies, inflammatory myopathies, critical illness myopathy, and chronic systemic diseases associated with muscle dysfunction such as chronic obstructive pulmonary disorder (COPD). Importantly, we validated the discriminatory power of this signature in other diseases with muscle phenotypes that were not part of the discovery meta-analysis, cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS). We found that the common muscle disease gene signature is significantly associated with clinical and histological disease severity in independent validation cohorts. Finally, we identified patterns of gene dysregulation unique to each muscle disease category relative to the others.

Portions of this manuscript, including the methods section, have been presented previously reported in a PhD thesis by the first author1. The present study has an expanded number of patient cohorts that were not included in the PhD thesis.

Results

Meta-analysis identifies a common gene signature of muscle diseases

A total of 45 independent patient cohorts that profiled human muscle diseases and normal muscle controls (862 cases, 512 controls), comprising 1374 samples met criteria for inclusion (Supplementary Fig. 1, Table 1). Collectively, the cohorts represent a broad range of patient ages and peripheral muscles from both upper and lower extremities. Available phenotypic data for patient samples included in public repositories is shown in Supplementary Table 1 and summary descriptions of each study are found in Supplementary Document 1.

Table 1 Summary of public gene expression-based discovery and validation data sets used in the meta-analysis.

For the discovery cohort we ensured that there were at least three cohorts for each disease category that met our inclusion criteria. As there were only two cohorts for the MND category, this was not included in the discovery cohort as a disease category; instead, the two MND cohorts were included in the secondary validation cohort.

We chose smaller patient cohorts (< 30 samples) for the discovery meta-analysis and reserved larger patient cohorts and/or cohorts with clinical measures of muscle mass or strength or histologic assessments for validation analysis. For the discovery meta-analysis, 30 patient cohorts (348 cases, 288 controls) containing at least three cohorts from each of the five muscle disease categories were analyzed.

To identify the most robust differentially expressed (DE) genes across muscle diseases measured on multiple different microarray platforms we performed gene expression meta-analysis10,11 using a “leave-one-disease-out” strategy to correct for heterogeneity in genes DE between muscle disease categories and to avoid one muscle disease influencing the overall analysis, as described before13,14. We identified 209 genes that remained significantly DE in all 5 iterations of the “leave-one-disease-out” analysis (Supplementary Table 2).

We then applied an iterative greedy forward search15 to the 209 genes and identified a set of 52 genes (36 up-regulated, 16 down-regulated) that was optimized for discriminatory power termed the Common Muscle disease Module (CMDM). As expected, the 52-gene CMDM score distinguished muscle disease from healthy controls with summary area under the curve (AUC) = 0.91 (95% confidence interval [CI] 0.83- 0.96) in the discovery cohorts (Fig. 1A,B).

Figure 1
figure 1

Discovery of the 52-gene signature expressed across human muscle diseases. (A) Meta-analysis and leave-one-disease-out analysis reveal common differentially expressed genes across muscle diseases. (B) Representative forest plots of most up-regulated (CHRNA1, left) and most-downregulated (CAMK2II, right) across muscle disease. The x axis represents standardized mean difference (Hedges’ g in z-scaled log 2 values) between muscle disease and controls. Summary effect sizes for each gene across all cohorts are represented as a yellow diamond.

Next, we tested the CMDM 52- gene signature in the validation dataset of 12 cohorts (N = 657 total samples; Table 1). The validation set included at least one cohort belonging to each of the five muscle disease categories. (Table 2; Supplementary Table 3). The CMDM accurately identified muscle disease samples in most cohorts (summary AUC = 0.91 [95% CI 0.77–0.97]) (Fig. 2A).

Table 2 Common muscle disease module (CMDM) genes.
Figure 2
figure 2

Validation of the 52-gene set of genes expressed across human muscle diseases. (A) ROC curves comparing 657 patients in the validation cohort. (B) ROC curves comparing 81 patients in the secondary validation cohort. (C) Heat map shows consistent differential expression in the majority of discovery, validation, and secondary validation cohort data sets. Columns represent CMDM genes ranked from the highest to the lowest standardized mean difference (Hedges’ g in z-scaled log 2 values) from left to right. Rows denote data sets used in each stage of meta-analysis, arranged by unsupervised hierarchical clustering using Ward’s minimum variance method. (D) Violin plots of CMDM muscle disease severity scores for a cohort of congenital muscle diseases, classified by degree of fibrosis (none, mild, moderate, severe), GSE109178. Error bars show middle quartiles. P values calculated with Wilcoxon rank-sum test. Jonckheere's trend test shows significant association (two-tailed p ≤ 0.05) p = 4.45 × 10–9. Refer to Table 1 for data set information. ICUAW intensive care unit acquired weakness, IM inflammatory myopathies, DI disuse and immobility, CMD congenital muscle disorders, CSM chronic systemic diseases affecting muscle. Genes unavailable for a dataset are shown in grey.

An additional three cohorts (two cohorts from amyotrophic lateral sclerosis [ALS] and one cohort from cerebral palsy [CP]) could not be classified into any of the five muscle disease categories present in the discovery and validation sets. Therefore we tested these three cohorts (N = 81 total samples; Table 1) as a secondary validation set to assess the generalizability of the CMDM. The CMDM accurately identified muscle disease samples in two of the three cohorts (ALS-GSE3307 and GSE31243); summary AUC of all three cohorts = 0.92 [95% CI 0.54–0.99]) (Fig. 2B).

Visual inspection of the heatmap of the 52 CMDM genes in Fig. 2C shows the pattern of expression is generally highly consistent between the discovery and validation set, as well as the secondary validation set, further supporting the generalizability of the CMDM to muscle diseases. One notable exception was the cohort GSE13608 in the validation analysis with AUC 0.23 (95% CI 0.14–0.32), which was reflected in the heatmap showing gene expression opposite to the majority of genes across the meta-analysis.

CMDM score significantly associates with clinical and histological measures of disease severity

When selecting differentially expressed genes using the multi-cohort analysis, we did not consider disease severity. Every sample was classified as either “control” or “case.” As muscle disease severity exists along a continuum, we hypothesized that the summary expression of the CMDM would correlate with the severity of muscle disease and clinical measures of muscle function. We calculated and then correlated CMDM scores for each cohort to measures of disease severity and extent, muscle mass, strength, and function, when this detail was provided.

Five of the cohorts reported disease severity scores and/or clinical measures of muscle mass, strength and function. The cohort GSE109178 assessed the degree of muscle fibrosis histologically in patients with dystrophic subtypes of CMD, and categorized cases into subgroups of normal, mild, moderate, or severe fibrosis (Fig. 2D, Supplementary Table 1). Cohorts GSE78929 (ICUAW), GSE34111 (CSM) and EMEXP3260 (ALS), reported strength, muscle mass and/or physical functional capacity and the ALS cohort was additionally classified as “early” vs “late” disease severity. We found CMDM summary expression scores were significantly correlated to the histologic measures of disease severity and clinical measures of muscle mass, strength and function (Table 3 and Fig. 3A–F).

Table 3 Associations of common muscle disease module (CMDM) scores with clinical and histological measures of disease severity.
Figure 3
figure 3

CMDM score significantly associates with clinical and histological severity in ALS and ICUAW. Plots of (A) CMDM scores in violin plots for controls, early ALS, and late ALS (EMEXP3260) Error bars show middle quartiles. P-values calculated with Wilcoxon rank-sum test. (B) Grading score of muscle atrophy in ALS (based on histology) versus CMDM score. (C) Shoulder abduction (muscle strength) versus CMDM score. (D) CMDM scores in violin plots for controls, early ICUAW (Day 7 post-ICU), and sustained ICUAW (month 6 post-ICU; GSE78929). (E) CMDM scores versus functional independence measure [FIM] motor subscore. (F) CMDM score versus quadriceps muscle mass. Each dot corresponds to individual samples. ICUAW ICU acquired weakness, ALS acute amyotrophic lateral sclerosis.

Meta-analysis highlights common mechanisms of muscle diseases

We next sought to identify conserved pathways dysregulated across muscle diseases using meta-analysis. Gene Set Enrichment Analysis (GSEA) evaluated the enrichment of Gene Ontology (GO) terms in the complete ranked list of genes based on expression relative to controls from both discovery and validation cohorts combined. A total of 74 GO Biological Process (BP) terms were significantly enriched (FDR q-value < 0.05) after removing redundant GO terms (Supplementary Table 4). Twelve GO terms were down-regulated and 62 up-regulated. Networks of overlapping significantly enriched up- and down-regulated GO terms were visualized to aid in the interpretation of the GO enrichment results (Supplementary Fig. 3).

The most down-regulated and up-regulated gene sets based on enrichment score (ES) were regulation of skeletal muscle adaptation (ES = -0.80, q-value = 7.9 × 10–4) and macrophage migration (ES = 0.73, q-value = 7.3 × 10–4), respectively. Nine of the 12 down-regulated gene sets were related to mitochondrial metabolism, including 2-oxoglutarate metabolism and synthesis of (ubi)quinone, a redox-active lipid that participates in several processes including mitochondrial electron transport. Four upregulated gene sets were related to collagen metabolism and extracellular structure organization. Forty-four upregulated gene sets were related to immune system processes including neutrophil activation, innate immune response and antigen processing.

Disease-specific patterns of gene expression changes

We hypothesized that functional analysis of a disease category-specific gene signature, after removing genes shared with other disease categories, would provide insights into the unique pathomolecular mechanisms underlying each individual muscle disease. Thus, for each disease category we used the meta-analysis approach to generate a rank ordered list of genes based on expression relative to controls for the combined discovery and validation cohorts. We next visually examined the location of each of the CMDM genes, within the ordered list of genes, for each disease category (Fig. 4A). The CMDM genes were more densely distributed amongst the most up- and down-regulated genes for each muscle-specific category gene list, validating that the CMDM genes are similarly dysregulated in each muscle disease.

Figure 4
figure 4

Disease-specific meta-analysis. (A) Distribution of the 52 CMD genes among individual disease meta-analysis gene lists. Each line presents the presence of a CMDM gene among the 24,572 gene probes generated from disease-specific meta-analysis ranked from the most positive standardized mean difference (left) to the most negative standardized mean difference (right). (B) Disease-specific meta-analysis after removing genes differentially expressed across the other four disease categories, identifies genes more strongly expressed in a single disease. Top 10 up-regulated and top 10 down-regulated genes shown (if 20 or more genes present). ICUAW intensive care unit acquired weakness, IM inflammatory myopathies, DI disuse and immobility, CMD congenital muscle disorders, CSM chronic systemic diseases affecting muscle.

We then utilized the “leave-one-disease-out” meta-analysis approach to iteratively generate ranked lists containing four of the five disease categories. Significantly DE genes identified in the four-disease meta-analysis gene list were then removed from the single disease category meta-analysis gene list. We removed 359, 99, 200, 484, and 365 genes from the gene lists for CMD, IM, ICUAW, DI, and CSM, respectively.

The disease-specific gene lists represent genes that are expressed more strongly in a specific muscle disease category (Fig. 4B). The gene lists were then assessed for GO term enrichment to identify disease-specific pathways (Supplementary Fig. 4AE. Supplementary Table 5A–E). Despite removing genes significant to other disease categories, persistent down-regulation of (i) gene sets related to mitochondrial electron transport were found in IM, DI, ICUAW, and CSM cohorts and (ii) gene sets related to mitochondrial translation were found in CMD, DI, and CSM cohorts. Significant down-regulation of genes sets related to muscle contraction were identified in CMD, IM, and ICUAW. Upregulation of mRNA splicing via spliceosome was found in ICUAW and DI. Upregulation of extracellular matrix organization genes were observed in CMD and IM. Significant up regulation of NF-kB signaling genes was found in IM and ICUAW categories. Only ICUAW had down-regulation of genes related to cell fate specification, including SOX17, a transcription factor induced during satellite cell specification23.

Subcellular localization analysis of CMDM genes

We explored whether the CMDM gene signature is overrepresented in certain subcellular compartments. The majority of CMDM genes (98.1%) mapped to at least one subcellular localization. The vesicular exosome contained the greatest proportion of CMDM genes (25%) and was significantly overrepresented in CMDM signature (q-value = 0.039) (Supplementary Table 6A and 6B, Supplementary Fig. 5). One gene (FST) was extracellular.

Discussion

We analyzed the reported transcriptomes of 1374 individual muscle samples collected from 45 independent patient/control cohorts classified into five categories of skeletal muscle disease and derived using multiple microarray platforms, to identify and validate a robust and reproducible gene signature of muscle disease. This analysis leveraged both biological and technical heterogeneity across multiple independent cohorts in the discovery cohort to avoid overfitting and validated the CMDM signature using cohorts containing larger sample sizes to reduce technical heterogeneity11. To assess the generalizability of the signature we examined three more cohorts that could not be classified within the five specified muscle disease categories and found the signature to be reproducible in these cohorts as well.

Although there are heterogeneous muscle types and diverse genetic and acquired causes of different muscle diseases, the 52-gene CMDM reflected convergent transcriptional pathways across peripheral skeletal muscles affected by disease. This implies that prior characterization of any of the genes in the CMDM may be relevant across muscle diseases. Several of the genes in the CMDM have been associated with muscle disease previously, whereas many remain unknown or poorly characterized in skeletal muscle.

Increased expression of cholinergic receptor nicotinic, alpha 1 (CHRNA1), the most robustly DE CMDM gene, has been recognized as a marker of severity of muscle denervation24,25. Up-regulation of CHRNA1 has been reported to be associated with dynamic epigenetic modifications of the gene in a rat model of disuse-induced atrophy26. The most down-regulated CMDM gene CAMK2G, calcium/calmodulin-dependent protein kinase type II (CaMKII) subunit gamma, is involved in sarcoplasmic reticulum Ca2+ transport in skeletal muscle and has been shown to remain active after exercise27. While agonists of CaMKII have been proposed as potential pharmacologic therapies of in various muscle disease28 , it has remained unclear which of the CaMKII subunits is most important in the regulation of skeletal muscle adaptation, response to injury and activity, and oxidative capacity as these subunits are currently not well characterized. Given that CAMK2G is down-regulated across most muscle diseases in this study we propose that it may be a suitable target for future studies of potential therapeutics.

Although able to robustly identify a broad range of muscle diseases, the CMDM signature more importantly strongly correlates with clinical and histological measures of disease severity, providing persuasive evidence that the signature could have future applications as a biomarker for phenotyping muscle disease. The CMDM signature could specifically provide diagnostic information and quantify the molecular response to therapy for muscle disease. Measuring changes in CMDM scores after treatment may improve the identification of therapy responders, and using it at enrollment in therapeutic trials may aid the stratification of patients within trial arms. Furthermore this signature could also serve to phenotype patients with COPD, ICUAW and other chronic respiratory diseases based on the extent of muscle dysfunction.

We applied gene set enrichment analysis to identify functional pathways that are similarly altered across muscle diseases. As expected, genes involved in skeletal muscle skeletal adaptation and mitochondrial function were down-regulated. Coordinate down-regulation of mitochondrial genes has been described in a number of muscle diseases29,30,31,32. The predominant up-regulated functional terms were related to immune activation. Muscle damage secondary to disease induces immune activation culminating in inflammation and deposition of extracellular matrix (ECM)33,34. Skeletal muscle diseases are characterized by up-regulation of ECM genes including collagen, with progressive development of fibrosis leading to dysfunctional muscle tissue35,36. Collectively, these findings are consistent with literature in chronic skeletal muscle diseases proposing the convergence of final common pathways including chronic inflammation, fibrosis, oxidative stress, and mitochondrial dysfunction36.

We next removed genes similarly dysregulated across muscle disease to identify pathways altered within specific muscle diseases. This strategy identified pathways unique to ICUAW as well as those shared with other muscle diseases. Significant up-regulation of NF-kB signaling genes was found in IM and ICUAW. NF-kB has been previously shown to play a role in IM37,38, has been studied in animal models of cancer cachexia and ICUAW39,40,41 and has been shown to be an inhibitor of skeletal myogenesis and muscle regeneration42. Remarkably, only ICUAW had down-regulation of genes related to cell fate specification. Decreased numbers of satellite cells (precursors to skeletal muscle cells) in ICUAW sustained long term, compared to healthy controls have been detected histologically43, supporting the finding of a down-regulated stem-cell gene set.

An unexpected finding of our meta-analysis was that the CMDM signature is enriched for genes targeted to the exosomal vesicle. Vesicular exosomes, cell derived vesicles containing signaling factors (including genes and microRNAs) for intercellular communication, have been found to have roles in muscle regeneration and congenital muscle diseases44. Monitoring exosomal miRNAs has been proposed as a non-invasive method for tracking muscle disease progression44,45. Future studies will assess whether plasma protein concentrations of the exosomal CMDM genes correlate with muscle severity to the same extent as their transcripts.

Our meta-analysis has limitations despite its comprehensiveness. Although most included studies attempted to select patients without co-morbidities that span more than one muscle disease category, there are potentially multiple pathologies in some of the muscle samples. Given the number of cohorts and size of the overall study, such confounding is likely to be minimal. The broad inclusion criteria applied in this study has identified a robust disease signature that reflects the heterogeneity observed in the real-world patient population. The considerable variance in gene expression profiles between the different muscle tissue sites46 included in this analysis is expected to have reduced the number of significant genes, while increasing the generalizability of the significant genes detected. We primarily focused on identifying a gene signature that is conserved between several muscle disease categories and across samples. Although this is beneficial for capturing features that are consistent across multiple diseases, it is ill-suited for identifying subgroups of disease.

Based on the use of microarray data from multiple platforms, we cannot test for alterations in splicing regulation, which has been associated with several congenital muscle diseases including the most common adult onset muscular dystrophies47. Analysis of RNA-seq transcriptome data will be necessary to determine whether altered splice variants lead to muscle pathology in other disease categories. Identification of conserved epigenetic signatures of muscle disease will provide important insights into the underlying mechanisms resulting in gene transcriptomic dysregulation identified here, once future epigenome-wide association studies of various muscle diseases are available.

The cohort GSE34111 had a global expression pattern that differed markedly from the other muscle diseases and disease categories. As this cohort was the only one in the analysis that included cancer cachexia, it remains unclear whether the difference in global expression pattern reflects significant differences in the pathomechanism of cancer cachexia or technical or experimental differences in the study. Future analysis comparing peripheral muscle from patients with cancer cachexia and controls are required. Within the validation set, the chronic systemic disease and ICUAW categories each consisted of one cohort, reducing the power to detect significant effect size differences from controls within these disease categories. For this reason, disease specific pathway analysis was performed by combing both discovery and validation cohorts.

CMDM genes may be conceptually divided into those having direct etiological contribution to muscle disease and those that represent a secondary phenomenon in the development of muscle disease, include stress-related changes or cell survival mechanisms48. Further experimentation will be required to identify the CMDM genes directly contributing to disease as these genes are expected to be good candidates for novel disease modifying therapies14. CMDM genes without functional annotation can be prioritized for future experimental evaluation based on the strength of the molecular data (e.g. effect size or correlation with clinical phenotype). Direct experimentation will be necessary to determine the role of the dysregulated genes and pathways in muscle disease as either causal drivers or responses to muscle disease.

Our results identify a conserved muscle disease transcriptional signature associated with clinical and histologic disease severity, and identify numerous novel genes associated with muscle disease severity. Muscle disease specific analysis identifies pathways uniquely altered in ICUAW. Thus our findings serve as a valuable resource for interpreting disease mechanisms, connecting findings across muscle diseases, and driving novel hypotheses.

Methods

The analysis workflow is shown in Supplementary Fig. 1.

Data collection and pre-processing

Two public gene expression microarray repositories (ArrayExpress, NIH GEO) were searched for human muscle disease datasets (search date: Aug 29, 2019). We found 45 independent datasets with 1374 muscle biopsies that met our inclusion criteria (Supplementary Methods).

We divided the sample cohorts into 6 disease categories for analysis: (1) inflammatory myopathies (IM), (2) ICU acquired weakness (ICUAW), (3) congenital muscle diseases (CMD), (4) chronic systemic disease affecting muscle (CSM), (5) disuse and immobility (DI), (6) motor neuron disease (MND). Next, we divided the patient cohorts into a discovery cohort for the initial meta-analysis and a validation cohort for the independent validation analysis. For the discovery cohort we ensured that there were at least three cohorts for each disease category that met our inclusion criteria. As there were only two cohorts for the MND category, this was not included in the discovery cohort as a disease category; instead, the two MND cohorts were included in the secondary validation cohort.

Normalization and probe expression summarization are described in Supplementary Methods. The number of studies measured for each gene are listed in Supplementary Tables 2 and 3.

Meta-analysis

Multicohort meta-analysis of gene expression was performed (using R package MetaIntegrator)12 as described in the Supplementary Methods. The utility of the leave-one-disease-out approach in identifying a robust gene expression signature during acute rejection across different transplanted solid organs13 and across neurodegenerative diseases14 has been shown before.

Derivation of common muscle disease module (CMDM) score

We applied a greedy forward search as described in the Supplementary Methods section to identify a gene signature maximized for diagnostic power, termed the Common Muscle Disease Module (CMDM).

Validation of CMDM score and correlation of the CMDM genes with clinical and histological severity

Tukey’s Biweight correlation was used to assess the association of the CMDM score with the histologic and clinical measures. Between- and within-group CMDM score comparisons were done with the Wilcoxon rank sum test.

Muscle disease category specific meta-analysis

To identify patterns on gene expression changes that are unique to each muscle disease category we performed meta-analysis using the combined data of the discovery and validation cohorts. We first analyzed each disease category separately, as well as the other four diseases together.. Genes that were significantly differentially expressed in the four-disease category meta-analysis were then removed from the individual disease category meta-analysis, thereby removing the DE genes common to all muscle disorders, from the disease-specific gene list.

Gene ontology functional analysis identified functional themes within differentially expressed genes across muscle disease categories as described in the Supplementary Methods.

Subcellular localization analysis was performed for each gene within the CMDM as described in the Supplementary Methods.

All analyses were completed in R language for statistical computing (version 3.4.1). Significance levels were set at two-tailed p < 0.05, unless specified otherwise.