Comprehensive multi-cohort transcriptional meta-analysis of muscle diseases identifies a signature of disease severity

Muscle diseases share common pathological features suggesting common underlying mechanisms. We hypothesized there is a common set of genes dysregulated across muscle diseases compared to healthy muscle and that these genes correlate with severity of muscle disease. We performed meta-analysis of transcriptional profiles of muscle biopsies from human muscle diseases and healthy controls. Studies obtained from public microarray repositories fulfilling quality criteria were divided into six categories: (i) immobility, (ii) inflammatory myopathies, (iii) intensive care unit (ICU) acquired weakness (ICUAW), (iv) congenital muscle diseases, (v) chronic systemic diseases, (vi) motor neuron disease. Patient cohorts were separated in discovery and validation cohorts retaining roughly equal proportions of samples for the disease categories. To remove bias towards a specific muscle disease category we repeated the meta-analysis five times by removing data sets corresponding to one muscle disease class at a time in a “leave-one-disease-out” analysis. We used 636 muscle tissue samples from 30 independent cohorts to identify a 52 gene signature (36 up-regulated and 16 down-regulated genes). We validated the discriminatory power of this signature in 657 muscle biopsies from 12 additional patient cohorts encompassing five categories of muscle diseases with an area under the receiver operating characteristic curve of 0.91, 83% sensitivity, and 85.3% specificity. The expression score of the gene signature inversely correlated with quadriceps muscle mass (r = −0.50, p-value = 0.011) in ICUAW and shoulder abduction strength (r = −0.77, p-value = 0.014) in amyotrophic lateral sclerosis (ALS). The signature also positively correlated with histologic assessment of muscle atrophy in ALS (r = 0.88, p-value = 1.62 × 10–3) and fibrosis in muscular dystrophy (Jonckheere trend test p-value = 4.45 × 10–9). Our results identify a conserved transcriptional signature associated with clinical and histologic muscle disease severity. Several genes in this conserved signature have not been previously associated with muscle disease severity.

www.nature.com/scientificreports/ A growing number of studies of human muscle disease have identified dysregulated gene expression that is associated with disease severity 1,[7][8][9] . These studies however, are usually limited by relatively small sample sizes without external validation from independent cohorts 10 . Moreover, individually these studies are not representative of biological and clinical heterogeneity observed in the real-world patient population, which substantially limits their generalizability. The vast quantity of expression profiling data in the public repositories Gene Expression Omnibus (GEO) and ArrayExpress represents novel opportunities to address these challenges by facilitating comprehensive integration of human muscle disease cohorts for meta-analysis.
We applied a multi-cohort analysis framework 11,12 that leverages the biological, clinical, and technical heterogeneity across independent data sets to identify a reproducible disease gene signature 13,14 . This approach has discovered robust signatures in organ rejection 13 , neurodegenerative diseases 14 , sepsis 15 , tuberculosis 16 , viral infections 17 , vaccination 18 , and systemic sclerosis 19 , many of which have been successfully validated in prospective independent cohorts [20][21][22] . We hypothesized that convergent transcriptional abnormalities occur across muscle diseases regardless of the specific muscle pathophysiology and that the relative expression of these genes is associated with the degree of muscle dysfunction. To the best of our knowledge, this is the largest systematic multi-cohort analysis investigating transcriptional changes across multiple human muscle diseases.
We identified a conserved gene signature across five muscle disease categories including muscular dystrophies, inflammatory myopathies, critical illness myopathy, and chronic systemic diseases associated with muscle dysfunction such as chronic obstructive pulmonary disorder (COPD). Importantly, we validated the discriminatory power of this signature in other diseases with muscle phenotypes that were not part of the discovery meta-analysis, cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS). We found that the common muscle disease gene signature is significantly associated with clinical and histological disease severity in independent validation cohorts. Finally, we identified patterns of gene dysregulation unique to each muscle disease category relative to the others.
Portions of this manuscript, including the methods section, have been presented previously reported in a PhD thesis by the first author 1 . The present study has an expanded number of patient cohorts that were not included in the PhD thesis.

Results
Meta-analysis identifies a common gene signature of muscle diseases. A total of 45 independent patient cohorts that profiled human muscle diseases and normal muscle controls (862 cases, 512 controls), comprising 1374 samples met criteria for inclusion ( Supplementary Fig. 1, Table 1). Collectively, the cohorts represent a broad range of patient ages and peripheral muscles from both upper and lower extremities. Available phenotypic data for patient samples included in public repositories is shown in Supplementary Table 1 and summary descriptions of each study are found in Supplementary Document 1.
For the discovery cohort we ensured that there were at least three cohorts for each disease category that met our inclusion criteria. As there were only two cohorts for the MND category, this was not included in the discovery cohort as a disease category; instead, the two MND cohorts were included in the secondary validation cohort.
We chose smaller patient cohorts (< 30 samples) for the discovery meta-analysis and reserved larger patient cohorts and/or cohorts with clinical measures of muscle mass or strength or histologic assessments for validation analysis. For the discovery meta-analysis, 30 patient cohorts (348 cases, 288 controls) containing at least three cohorts from each of the five muscle disease categories were analyzed.
To identify the most robust differentially expressed (DE) genes across muscle diseases measured on multiple different microarray platforms we performed gene expression meta-analysis 10,11 using a "leave-one-disease-out" strategy to correct for heterogeneity in genes DE between muscle disease categories and to avoid one muscle disease influencing the overall analysis, as described before 13,14 . We identified 209 genes that remained significantly DE in all 5 iterations of the "leave-one-disease-out" analysis (Supplementary Table 2).
We then applied an iterative greedy forward search 15 to the 209 genes and identified a set of 52 genes (36 up-regulated, 16 down-regulated) that was optimized for discriminatory power termed the Common Muscle disease Module (CMDM). As expected, the 52-gene CMDM score distinguished muscle disease from healthy controls with summary area under the curve (AUC) = 0.91 (95% confidence interval [CI] 0.83-0.96) in the discovery cohorts (Fig. 1A,B).
Next, we tested the CMDM 52-gene signature in the validation dataset of 12 cohorts (N = 657 total samples; Table 1). The validation set included at least one cohort belonging to each of the five muscle disease categories. (Table 2; Supplementary Table 3). The CMDM accurately identified muscle disease samples in most cohorts (summary AUC = 0.91 [95% CI 0.77-0.97]) ( Fig. 2A).
An additional three cohorts (two cohorts from amyotrophic lateral sclerosis [ALS] and one cohort from cerebral palsy [CP]) could not be classified into any of the five muscle disease categories present in the discovery and validation sets. Therefore we tested these three cohorts (N = 81 total samples; Table 1) as a secondary validation set to assess the generalizability of the CMDM. The CMDM accurately identified muscle disease samples in two of the three cohorts (ALS-GSE3307 and GSE31243); summary AUC of all three cohorts = 0.92 [95% CI 0.54-0.99]) (Fig. 2B).
Visual inspection of the heatmap of the 52 CMDM genes in Fig. 2C shows the pattern of expression is generally highly consistent between the discovery and validation set, as well as the secondary validation set, further supporting the generalizability of the CMDM to muscle diseases. One notable exception was the cohort GSE13608 in the validation analysis with AUC 0.23 (95% CI 0.14-0.32), which was reflected in the heatmap showing gene expression opposite to the majority of genes across the meta-analysis. www.nature.com/scientificreports/ CMDM score significantly associates with clinical and histological measures of disease severity. When selecting differentially expressed genes using the multi-cohort analysis, we did not consider disease severity. Every sample was classified as either "control" or "case. " As muscle disease severity exists along a continuum, we hypothesized that the summary expression of the CMDM would correlate with the severity of muscle disease and clinical measures of muscle function. We calculated and then correlated CMDM scores for each cohort to measures of disease severity and extent, muscle mass, strength, and function, when this detail was provided. Five of the cohorts reported disease severity scores and/or clinical measures of muscle mass, strength and function. The cohort GSE109178 assessed the degree of muscle fibrosis histologically in patients with dystrophic subtypes of CMD, and categorized cases into subgroups of normal, mild, moderate, or severe fibrosis (Fig. 2D, Supplementary Table 1). Cohorts GSE78929 (ICUAW), GSE34111 (CSM) and EMEXP3260 (ALS), reported strength, muscle mass and/or physical functional capacity and the ALS cohort was additionally classified as "early" vs "late" disease severity. We found CMDM summary expression scores were significantly correlated to the histologic measures of disease severity and clinical measures of muscle mass, strength and function (Table 3 and Fig. 3A-F).

Meta-analysis highlights common mechanisms of muscle diseases.
We next sought to identify conserved pathways dysregulated across muscle diseases using meta-analysis. Gene Set Enrichment Analysis (GSEA) evaluated the enrichment of Gene Ontology (GO) terms in the complete ranked list of genes based on Table 1. Summary of public gene expression-based discovery and validation data sets used in the metaanalysis. IM inflammatory myositides, MMD myotonic muscular dystrophy, MD muscular dystrophies, DMD Duchene's muscular dystrophy, FSHD fascioscapulohumoral muscular dystrophy, LGMD2A limbgirdle muscular dystrophy type 2A, HIBM heritable IBM, POLG1, MD, TMD tibial muscular dystrophy, IBM inclusion body myositis, DM diabetes mellitus, GSD II glycogen storage disease type II, also called Pompe disease, POLG1 mitochondrial DNA polymerase γ, ICUAW intensive care unit acquired weakness, MODS multi-organ dysfunction syndrome. a Not published yet. b Deltoid muscle samples removed as FSHD typically affects biceps. c Same healthy controls used in subcohorts of GSE3307.  www.nature.com/scientificreports/  Table 4). Twelve GO terms were down-regulated and 62 up-regulated. Networks of overlapping significantly enriched up-and down-regulated GO terms were visualized to aid in the interpretation of the GO enrichment results (Supplementary Fig. 3).
The most down-regulated and up-regulated gene sets based on enrichment score (ES) were regulation of skeletal muscle adaptation (ES = -0.80, q-value = 7.9 × 10 -4 ) and macrophage migration (ES = 0.73, q-value = 7.3 × 10 -4 ), respectively. Nine of the 12 down-regulated gene sets were related to mitochondrial metabolism, including 2-oxoglutarate metabolism and synthesis of (ubi)quinone, a redox-active lipid that participates in several processes including mitochondrial electron transport. Four upregulated gene sets were related to collagen metabolism and (C) Heat map shows consistent differential expression in the majority of discovery, validation, and secondary validation cohort data sets. Columns represent CMDM genes ranked from the highest to the lowest standardized mean difference (Hedges' g in z-scaled log 2 values) from left to right. Rows denote data sets used in each stage of meta-analysis, arranged by unsupervised hierarchical clustering using Ward's minimum variance method. (D) Violin plots of CMDM muscle disease severity scores for a cohort of congenital muscle diseases, classified by degree of fibrosis (none, mild, moderate, severe), GSE109178. Error bars show middle quartiles. P values calculated with Wilcoxon rank-sum test. Jonckheere's trend test shows significant association (two-tailed p ≤ 0.05) p = 4.45 × 10 -9 . Refer to Table 1  www.nature.com/scientificreports/ extracellular structure organization. Forty-four upregulated gene sets were related to immune system processes including neutrophil activation, innate immune response and antigen processing.
Disease-specific patterns of gene expression changes. We hypothesized that functional analysis of a disease category-specific gene signature, after removing genes shared with other disease categories, would  www.nature.com/scientificreports/ provide insights into the unique pathomolecular mechanisms underlying each individual muscle disease. Thus, for each disease category we used the meta-analysis approach to generate a rank ordered list of genes based on expression relative to controls for the combined discovery and validation cohorts. We next visually examined the location of each of the CMDM genes, within the ordered list of genes, for each disease category (Fig. 4A). The CMDM genes were more densely distributed amongst the most up-and down-regulated genes for each musclespecific category gene list, validating that the CMDM genes are similarly dysregulated in each muscle disease. We then utilized the "leave-one-disease-out" meta-analysis approach to iteratively generate ranked lists containing four of the five disease categories. Significantly DE genes identified in the four-disease meta-analysis gene list were then removed from the single disease category meta-analysis gene list. We removed 359, 99, 200, 484, and 365 genes from the gene lists for CMD, IM, ICUAW, DI, and CSM, respectively.
The disease-specific gene lists represent genes that are expressed more strongly in a specific muscle disease category (Fig. 4B). The gene lists were then assessed for GO term enrichment to identify disease-specific pathways ( Supplementary Fig. 4A-E. Supplementary Table 5A-E). Despite removing genes significant to other disease categories, persistent down-regulation of (i) gene sets related to mitochondrial electron transport were found in IM, DI, ICUAW, and CSM cohorts and (ii) gene sets related to mitochondrial translation were found in CMD, DI, and CSM cohorts. Significant down-regulation of genes sets related to muscle contraction were identified in CMD, IM, and ICUAW. Upregulation of mRNA splicing via spliceosome was found in ICUAW and DI. Upregulation of extracellular matrix organization genes were observed in CMD and IM. Significant up regulation of NF-kB signaling genes was found in IM and ICUAW categories. Only ICUAW had down-regulation of genes related to cell fate specification, including SOX17, a transcription factor induced during satellite cell specification 23 .

Subcellular localization analysis of CMDM genes.
We explored whether the CMDM gene signature is overrepresented in certain subcellular compartments. The majority of CMDM genes (98.1%) mapped to at least one subcellular localization. The vesicular exosome contained the greatest proportion of CMDM genes (25%) and was significantly overrepresented in CMDM signature (q-value = 0.039) (Supplementary Table 6A and 6B, Supplementary Fig. 5). One gene (FST) was extracellular.

Discussion
We analyzed the reported transcriptomes of 1374 individual muscle samples collected from 45 independent patient/control cohorts classified into five categories of skeletal muscle disease and derived using multiple microarray platforms, to identify and validate a robust and reproducible gene signature of muscle disease. This analysis leveraged both biological and technical heterogeneity across multiple independent cohorts in the discovery cohort to avoid overfitting and validated the CMDM signature using cohorts containing larger sample sizes to reduce technical heterogeneity 11 . To assess the generalizability of the signature we examined three more cohorts that could not be classified within the five specified muscle disease categories and found the signature to be reproducible in these cohorts as well.
Although there are heterogeneous muscle types and diverse genetic and acquired causes of different muscle diseases, the 52-gene CMDM reflected convergent transcriptional pathways across peripheral skeletal muscles affected by disease. This implies that prior characterization of any of the genes in the CMDM may be relevant across muscle diseases. Several of the genes in the CMDM have been associated with muscle disease previously, whereas many remain unknown or poorly characterized in skeletal muscle.
Increased expression of cholinergic receptor nicotinic, alpha 1 (CHRNA1), the most robustly DE CMDM gene, has been recognized as a marker of severity of muscle denervation 24,25 . Up-regulation of CHRNA1 has been reported to be associated with dynamic epigenetic modifications of the gene in a rat model of disuse-induced atrophy 26 . The most down-regulated CMDM gene CAMK2G, calcium/calmodulin-dependent protein kinase type II (CaMKII) subunit gamma, is involved in sarcoplasmic reticulum Ca2+ transport in skeletal muscle and has been shown to remain active after exercise 27 . While agonists of CaMKII have been proposed as potential pharmacologic therapies of in various muscle disease 28 , it has remained unclear which of the CaMKII subunits is most important in the regulation of skeletal muscle adaptation, response to injury and activity, and oxidative capacity as these subunits are currently not well characterized. Given that CAMK2G is down-regulated across most muscle diseases in this study we propose that it may be a suitable target for future studies of potential therapeutics.
Although able to robustly identify a broad range of muscle diseases, the CMDM signature more importantly strongly correlates with clinical and histological measures of disease severity, providing persuasive evidence that the signature could have future applications as a biomarker for phenotyping muscle disease. The CMDM signature could specifically provide diagnostic information and quantify the molecular response to therapy for muscle disease. Measuring changes in CMDM scores after treatment may improve the identification of therapy responders, and using it at enrollment in therapeutic trials may aid the stratification of patients within trial arms. Furthermore this signature could also serve to phenotype patients with COPD, ICUAW and other chronic respiratory diseases based on the extent of muscle dysfunction.
We applied gene set enrichment analysis to identify functional pathways that are similarly altered across muscle diseases. As expected, genes involved in skeletal muscle skeletal adaptation and mitochondrial function were down-regulated. Coordinate down-regulation of mitochondrial genes has been described in a number of muscle diseases [29][30][31][32] . The predominant up-regulated functional terms were related to immune activation. Muscle damage secondary to disease induces immune activation culminating in inflammation and deposition of extracellular matrix (ECM) 33,34 . Skeletal muscle diseases are characterized by up-regulation of ECM genes including collagen, with progressive development of fibrosis leading to dysfunctional muscle tissue 35,36  www.nature.com/scientificreports/ www.nature.com/scientificreports/ findings are consistent with literature in chronic skeletal muscle diseases proposing the convergence of final common pathways including chronic inflammation, fibrosis, oxidative stress, and mitochondrial dysfunction 36 . We next removed genes similarly dysregulated across muscle disease to identify pathways altered within specific muscle diseases. This strategy identified pathways unique to ICUAW as well as those shared with other muscle diseases. Significant up-regulation of NF-kB signaling genes was found in IM and ICUAW. NF-kB has been previously shown to play a role in IM 37,38 , has been studied in animal models of cancer cachexia and ICUAW [39][40][41] and has been shown to be an inhibitor of skeletal myogenesis and muscle regeneration 42 . Remarkably, only ICUAW had down-regulation of genes related to cell fate specification. Decreased numbers of satellite cells (precursors to skeletal muscle cells) in ICUAW sustained long term, compared to healthy controls have been detected histologically 43 , supporting the finding of a down-regulated stem-cell gene set.
An unexpected finding of our meta-analysis was that the CMDM signature is enriched for genes targeted to the exosomal vesicle. Vesicular exosomes, cell derived vesicles containing signaling factors (including genes and microRNAs) for intercellular communication, have been found to have roles in muscle regeneration and congenital muscle diseases 44 . Monitoring exosomal miRNAs has been proposed as a non-invasive method for tracking muscle disease progression 44,45 . Future studies will assess whether plasma protein concentrations of the exosomal CMDM genes correlate with muscle severity to the same extent as their transcripts.
Our meta-analysis has limitations despite its comprehensiveness. Although most included studies attempted to select patients without co-morbidities that span more than one muscle disease category, there are potentially multiple pathologies in some of the muscle samples. Given the number of cohorts and size of the overall study, such confounding is likely to be minimal. The broad inclusion criteria applied in this study has identified a robust disease signature that reflects the heterogeneity observed in the real-world patient population. The considerable variance in gene expression profiles between the different muscle tissue sites 46 included in this analysis is expected to have reduced the number of significant genes, while increasing the generalizability of the significant genes detected. We primarily focused on identifying a gene signature that is conserved between several muscle disease categories and across samples. Although this is beneficial for capturing features that are consistent across multiple diseases, it is ill-suited for identifying subgroups of disease.
Based on the use of microarray data from multiple platforms, we cannot test for alterations in splicing regulation, which has been associated with several congenital muscle diseases including the most common adult onset muscular dystrophies 47 . Analysis of RNA-seq transcriptome data will be necessary to determine whether altered splice variants lead to muscle pathology in other disease categories. Identification of conserved epigenetic signatures of muscle disease will provide important insights into the underlying mechanisms resulting in gene transcriptomic dysregulation identified here, once future epigenome-wide association studies of various muscle diseases are available.
The cohort GSE34111 had a global expression pattern that differed markedly from the other muscle diseases and disease categories. As this cohort was the only one in the analysis that included cancer cachexia, it remains unclear whether the difference in global expression pattern reflects significant differences in the pathomechanism of cancer cachexia or technical or experimental differences in the study. Future analysis comparing peripheral muscle from patients with cancer cachexia and controls are required. Within the validation set, the chronic systemic disease and ICUAW categories each consisted of one cohort, reducing the power to detect significant effect size differences from controls within these disease categories. For this reason, disease specific pathway analysis was performed by combing both discovery and validation cohorts.
CMDM genes may be conceptually divided into those having direct etiological contribution to muscle disease and those that represent a secondary phenomenon in the development of muscle disease, include stress-related changes or cell survival mechanisms 48 . Further experimentation will be required to identify the CMDM genes directly contributing to disease as these genes are expected to be good candidates for novel disease modifying therapies 14 . CMDM genes without functional annotation can be prioritized for future experimental evaluation based on the strength of the molecular data (e.g. effect size or correlation with clinical phenotype). Direct experimentation will be necessary to determine the role of the dysregulated genes and pathways in muscle disease as either causal drivers or responses to muscle disease.
Our results identify a conserved muscle disease transcriptional signature associated with clinical and histologic disease severity, and identify numerous novel genes associated with muscle disease severity. Muscle disease specific analysis identifies pathways uniquely altered in ICUAW. Thus our findings serve as a valuable resource for interpreting disease mechanisms, connecting findings across muscle diseases, and driving novel hypotheses.

Methods
The analysis workflow is shown in Supplementary Fig. 1.
Data collection and pre-processing. Two public gene expression microarray repositories (ArrayExpress, NIH GEO) were searched for human muscle disease datasets (search date: Aug 29, 2019). We found 45 independent datasets with 1374 muscle biopsies that met our inclusion criteria (Supplementary Methods).
We divided the sample cohorts into 6 disease categories for analysis: (1) inflammatory myopathies (IM), (2) ICU acquired weakness (ICUAW), (3) congenital muscle diseases (CMD), (4) chronic systemic disease affecting muscle (CSM), (5) disuse and immobility (DI), (6) motor neuron disease (MND). Next, we divided the patient cohorts into a discovery cohort for the initial meta-analysis and a validation cohort for the independent validation analysis. For the discovery cohort we ensured that there were at least three cohorts for each disease category that met our inclusion criteria. As there were only two cohorts for the MND category, this was not included in the discovery cohort as a disease category; instead, the two MND cohorts were included in the secondary validation cohort. www.nature.com/scientificreports/ Normalization and probe expression summarization are described in Supplementary Methods. The number of studies measured for each gene are listed in Supplementary Tables 2 and 3.

Meta-analysis.
Multicohort meta-analysis of gene expression was performed (using R package MetaIntegrator) 12 as described in the Supplementary Methods. The utility of the leave-one-disease-out approach in identifying a robust gene expression signature during acute rejection across different transplanted solid organs 13 and across neurodegenerative diseases 14 has been shown before.
Derivation of common muscle disease module (CMDM) score. We applied a greedy forward search as described in the Supplementary Methods section to identify a gene signature maximized for diagnostic power, termed the Common Muscle Disease Module (CMDM).
Validation of CMDM score and correlation of the CMDM genes with clinical and histological severity. Tukey's Biweight correlation was used to assess the association of the CMDM score with the histologic and clinical measures. Between-and within-group CMDM score comparisons were done with the Wilcoxon rank sum test.
Muscle disease category specific meta-analysis. To identify patterns on gene expression changes that are unique to each muscle disease category we performed meta-analysis using the combined data of the discovery and validation cohorts. We first analyzed each disease category separately, as well as the other four diseases together.. Genes that were significantly differentially expressed in the four-disease category meta-analysis were then removed from the individual disease category meta-analysis, thereby removing the DE genes common to all muscle disorders, from the disease-specific gene list.
Gene ontology functional analysis identified functional themes within differentially expressed genes across muscle disease categories as described in the Supplementary Methods. Subcellular localization analysis was performed for each gene within the CMDM as described in the Supplementary Methods.
All analyses were completed in R language for statistical computing (version 3.4.1). Significance levels were set at two-tailed p < 0.05, unless specified otherwise.

Data availability
The datasets supporting the results of this article are available in GEO and ArrayExpress online repositories at http:// www. ebi. ac. uk/ arrayexpress/ and http:// www. ncbi. nlm. nih. gov/ geo/. Data set accession numbers can be found in Table 1. www.nature.com/scientificreports/