Texture analysis of muscle MRI: machine learning-based classifications in idiopathic inflammatory myopathies

To develop a machine learning (ML) model that predicts disease groups or autoantibodies in patients with idiopathic inflammatory myopathies (IIMs) using muscle MRI radiomics features. Twenty-two patients with dermatomyositis (DM), 14 with amyopathic dermatomyositis (ADM), 19 with polymyositis (PM) and 19 with non-IIM were enrolled. Using 2D manual segmentation, 93 original features as well as 93 local binary pattern (LBP) features were extracted from MRI (short-tau inversion recovery [STIR] imaging) of proximal limb muscles. To construct and compare ML models that predict disease groups using each set of features, dimensional reductions were performed using a reproducibility analysis by inter-reader and intra-reader correlation coefficients, collinearity analysis, and the sequential feature selection (SFS) algorithm. Models were created using the linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), support vector machine (SVM), k-nearest neighbors (k-NN), random forest (RF) and multi-layer perceptron (MLP) classifiers, and validated using tenfold cross-validation repeated 100 times. We also investigated whether it was possible to construct models predicting autoantibody status. Our ML-based MRI radiomics models showed the potential to distinguish between PM, DM, and ADM. Models using LBP features provided better results, with macro-average AUC values of 0.767 and 0.714, accuracy of 61.2 and 61.4%, and macro-average recall of 61.9 and 59.8%, in the LDA and k-NN classifiers, respectively. In contrast, the accuracies of radiomics models distinguishing between non-IIM and IIM disease groups were low. A subgroup analysis showed that classification models for anti-Jo-1 and anti-ARS antibodies provided AUC values of 0.646–0.853 and 0.692–0.792, with accuracy of 71.5–81.0 and 65.8–78.3%, respectively. ML-based TA of muscle MRI may be used to predict disease groups or the autoantibody status in patients with IIM and is useful in non-invasive assessments of disease mechanisms.

Idiopathic inflammatory myopathies (IIMs) are a heterogeneous family of systemic disorders characterized by muscle weakness, muscle enzyme elevations, inflammatory changes on muscle biopsy, and extra-muscular manifestations 1,2 . The common disease groups of IIMs in adults are polymyositis (PM), dermatomyositis (DM), amyopathic dermatomyositis (ADM), and inclusion body myositis (IBM). These inflammatory myopathies show different clinical presentation patterns and responses to treatment [3][4][5][6] . Patients with PM and DM have similar therapeutic strategies involving the empirical use of corticosteroids and immunosuppressive agents 5 , whereas patients with ADM require earlier and more intensive therapy because of its poor prognosis with severe pulmonary involvement and early death 6 . Therefore, the early identification of IIM disease groups is essential for predicting clinical courses and selecting treatment plans. With the new discoveries of myositis-specific autoantibodies (MSAs) and myositis-associated autoantibodies (MAAs) [7][8][9] , more clinical characteristics have been obtained for IIMs. These autoantibodies are associated with distinct clinical phenotypes and may define a prognosis for a subset of patients.
In IIMs, MRI of skeletal muscles is a feasible method for assessing disease activity and identifying useful biopsy sites. Due to uniform fat suppression and no administration of contrast media, STIR MR sequences are preferred 10,11 . The proximal legs are preferentially examined because thigh muscles are mostly affected in IIM patients 12 . Although previous studies reported characteristic muscle MRI findings in IIM patients [11][12][13][14][15][16] , quantitative or semi-quantitative assessments with MRI have been limited 12 .
A texture analysis (TA) is an image analysis technique that allows for the quantification of image characteristics based on the distribution of pixels and their surface intensity or patterns 17,18 . These image characteristics are based on the microstructures of a background tissue and are sometimes imperceptible to the human visual system 17 . TA has been applied to a number of medical image assessments, including oncologic imaging 19,20 , neuroimaging 21,22 , and musculoskeletal imaging 23,24 . Recent US-based radiomics studies reported differentiation between neurogenic and myogenic diseases using musculature imaging 25 . To the best of our knowledge, an analysis of IIMs with texture features derived from muscle MRI has not yet been conducted.
The present study was performed to evaluate the diagnostic performance of ML-based MRI radiomics models for predicting disease groups in patients with IIMs. We also investigated the feasibility of classifications based on autoantibodies (e.g., anti-Jo1 and anti-ARS antibodies).

Methods
The present study was approved by the Research Ethics Committee of Saitama Medical University Hospital as a retrospective medical imaging data analysis using TA and a deep-learning technique. The requirement for informed consent was waived by the Committee (approval number 20041.01). All experiments were performed in accordance with the relevant guidelines and regulations.
Patients. Figure 1 shows inclusion and exclusion criteria. In total, 243 patients who underwent muscle MRI of the thighs with suspicion of myositis between January 2012 and December 2019 were identified and reviewed. Exclusion criteria were as follows: 134 patients diagnosed with diseases other than myositis or an unknown cause; 4 who were not followed up nor treated after MRI in our hospital; 11 with high-grade muscle atrophy (difficulty in segmentation); 8 with severe artifacts on MRI; 7 with insufficient clinical data, and 3 who underwent MRI at other institutions. Using the 2017 European League Against Rheumatism/American College of Rheumatology (EULAR/ACR) classification criteria, the latest and most widely used criteria because of their high sensitivity and specificity 26  In a subgroup study, we investigated whether it was possible to predict the status of some representative MSAs. The data analysis workflow is shown in Fig. 2. After segmentation, image processing, texture feature extraction, a reproducibility analysis, and collinearity analysis of all datasets together were conducted, followed by texture feature selection and ML-based model construction in separate classification attempts.

Segmentation.
Muscle segmentation was performed using open-source software (ITK-SNAP version 3.8.0). A two-dimensional region of interest (ROI) that covered the whole area of one slice of a muscle MR image of the proximal thighs and excluded the epimysium was selected for each subject (see Fig. 3). Two radiologists with 20 and 4 years of experience performed the ROI delineation in an independent manner. A senior radiologist performed tumor segmentation again with a minimum interval of 2 months. Segmentation was performed on the same image slice assessed by another radiologist with 5 years of experience. All three radiologists were blinded to clinical information. www.nature.com/scientificreports/ Texture feature extraction. To avoid data heterogeneity bias, all MRI data were subjected to imaging normalization (the intensity of the image was scaled to 0-100) and resampled to the same resolution (3 × 3 × 3 mm) before feature extraction. The calculation of texture features was performed using an open-source software package capable of extracting a large panel of engineered features from medical images (PyRadiomics version 2.1.0). Texture features were calculated based on six feature classes (first-order statistics, the gray-level co-occurrence matrix (GLCM), graylevel dependence matrix (GLDM), gray-level run-length matrix (GLRLM), gray-level size zone matrix (GLSZM), and neighboring gray-tone difference matrix (NGTDM)). Other than the 93 original features (18 first-order, 24 GLCM, 14 GLDM, 16 GLRLM, 16 GLSZM, and 5 NGTDM features), 93 filtered images using local binary pattern (LBP) were obtained and the results were compared with each other. www.nature.com/scientificreports/ Dimensional reduction of texture features. After numeric values had been normalized as z-scores, the dimensional reduction was performed in two consecutive steps: a reproducibility analysis and collinearity analysis.
To evaluate intra-observer and inter-observer reproducibilities, intraclass correlation coefficient (ICC) values were calculated for each texture feature. Features with excellent reproducibility (ICC ≥ 0.8) in intra-observer and inter-observer analyses were included in further analyses.
A collinearity analysis was conducted using Pearson's correlation coefficient (r). The threshold for collinearity was r = 0.7. Features with high collinearity were excluded from the analysis. In the case of a feature pair having high collinearity, the one with the lowest collinearity with the other features remained in the analysis.
Feature selection and ML-based classification. The sequential feature selection (SFS) algorithm, a wrapper-based greedy search algorithm, was used for feature selection 29 . A radiomics model was created based on a limited number of selected features (3-4 features according to the number of patients) with the lowest collinearity socores 30 . A linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), support vector machine (SVM), k-nearest neighbors (k-NN), random forest (RF), and multi-layer perceptron (MLP) classifiers were used for model development, with all using default parameter settings. Ten-fold cross-validation repeated 100 times was performed for classification models. The performance of classifiers was evaluated by the area under the curve (AUC). Accuracy, sensitivity, specificity, precision, and the F-measure were calculated based on the confusion matrix of classification results. Statistical analysis. Statistical analyses were performed using an opensource software package (Python scikit-learn 0.22.1). Differences in patient characteristics were assessed using 2-sample t-tests and chi-squared tests. Values of p < 0.05 were considered to be significant.

Results
Clinical characteristics. The study included 74 patients (19 PM, 22 DM, 14 ADM, and 19 non-IIM). Mean age was lower in the ADM group than in the other groups. Muscle weakness was rarer and CK levels were lower in the ADM group than in the other groups. Patient characteristics are shown in Supplementary Table S1   Dimensional reduction with the reproducibility test and collinearity analysis. Among the 93 original features, the mean ICC value was 0.926 (SD = 0.101) in the inter-observer reproducibility test and 0.975 (SD = 0.029) in the intra-observer reproducibility test. Eighty and 93 features had excellent inter-reader and intra-reader reproducibilities (ICC ≥ 0.8), respectively. The number of features with excellent reproducibility in both analyses was 80. By excluding features with high collinearity (r ≥ 0.7), the number of features was further reduced to 11. Eleven representative features and their respective ICCs are shown in Table 1, and their distribution and collinearity status are shown in Supplementary Figs. S1 and S2 online, respectively. On the other hand, among 93 LBP features, the mean ICC value was 0.776 (SD = 0.207) in the inter-observer reproducibility test, and 0.859 (SD = 0.100) in the intra-observer reproducibility test. Fifty-four and 60 features had excellent inter-reader and intra-reader reproducibilities (ICC ≥ 0.8), respectively. The number of features with excellent reproducibility in both analyses was 54. By excluding features with high collinearity (r ≥ 0.7), the number of features was further reduced to 9. Nine representative features and their respective ICCs are shown in Table 2, with their distribution and collinearity status in Supplementary Figs. S3 and S4 online, respectively.
Feature selection and ML-based multi-class classification of IIM disease groups. The SFS algorithm associated with a univariate analysis (p < 0.05) provided 3 to 4 features for each classifier (i.e. LDA, QDA, SVM, k-NN, RF, and MLP classifiers). We constructed multi-class classification models based on the selected features, and evaluated their performance via the tenfold cross-validation repeated 100 times. www.nature.com/scientificreports/ The top classification score for original features was obtained in the LDA classifier: the macro-average AUC was 0.683 (SD = 0.012), with an accuracy of 58.6% (SD = 2.0%), macro-average precision of 59.1% (SD = 2.9%), and macro-average recall of 56.7% (SD = 2.1%). Equivalent classification scores were also observed in the QDA and SVM classifiers.
All classification attempts for original and LBP features are summarized in Tables 3 and 4, with confusion matrices in Supplementary Figs. S5 and S6 and roc curves in Figs. S7 and S8 online, respectively.

ML-based multi-class classification attempts for non-IIM and IIM disease groups.
In the multiclass classification analysis of non-IIM vs PM vs DM vs ADM, we selected representative features using the SFS algorithm associated with a univariate analysis, and evaluated their performance via the tenfold cross-validation repeated 100 times.

Subgroup analysis of ML-based classification of representative autoantibodies. We searched
our patients for the anti-Jo-1 and anti-ARS antibodies, and selected 45 patients with sufficient data for these two autoantibodies. Although we also investigated other MSA/MAAs, few patients had sufficient data (the information on representative MSA/MAAs in all patients for this analysis is shown in Supplementary Table S4 online). Therefore, in this subgroup analysis, we focused on the anti-Jo-1 and anti-ARS antibodies, and attempted to construct binary classification models for each antibody. We only applied original features because of the limited number of subjects. We selected 3 representative features using the SFS algorithm associated with a univariate analysis, and evaluated their performance by tenfold cross-validation repeated 100 times.
The  Table 4. Performance of multi-class classifications of IIM groups (LBP features). Note Data are means ± standard deviations. Feature name codes are as follows: TexFL1 = total energy, TexFL2 = variance, TexFL3 = cluster shade, TexFL4 = contrast, TexFL5 = difference entropy, TexFL6 = long run emphasis, TexFL7 = long run low gray-level emphasis, TexFL8 = gray-level non-uniformity, TexFL9 = busyness. LDA linear discriminant analysis, QDA quadratic discriminant analysis, SVM support vector machine, k-NN k-nearest neighbors classifier, RF random forest classifier, MLP multi-layer perceptron. www.nature.com/scientificreports/ attempts for these two antibodies are summarized in Tables 5 and 6, with confusion matrices in Supplementary Figs. S13 and S14 and roc curves in Figs. S15 and S16 online, respectively.

Discussion
In the present study, we found that ML-based TA of muscle MRI has the potential to distinguish between PM, DM, and ADM. In contrast, ML models distinguishing between non-IIM and IIM disease groups had low classification accuracy. We also showed that our ML models have the potential to predict the status of anti-Jo-1 and anti-ARS antibodies. Since IIMs are rare disorders, we were unable to collect a large number of IIM patients. Therefore, this analysis is a small-scale proof-of-concept study that demonstrates the potential of MR-based TA to predict disease groups or the autoantibody status.
To the best of our knowledge, the potential value of MR-based TA for discriminating IIM disease groups has not yet been assessed. Apart from TA, attempts have been made to differentiate between IIM disease subtypes using conventional MRI findings. Previous studies demonstrated that a subcutaneous high signal intensity (HSI), fascial HSI, and the patchy or diffuse distribution of HSI in muscle are useful MRI findings for differentiating between PM and DM [11][12][13][14][15][16] . Ukichi et al. assessed the likelihood of DM using a scoring system with several characteristic MRI findings 16 . Although classification performance in the present study was lower, several points need to be considered that emphasize the advantages of our models. We built a multi-class classification model for PM, DM and ADM, which is more practical for clinical applications. Furthermore, instead of using conventional morphological parameters that are subject to individual interpretation and inter-observer variability, we introduced radiomics imaging analyses that extract various quantitative features from medical images and overcome these issues. In addition, we further developed classification models for autoantibodies, which are useful for clinical practice in recent antibody-oriented medicine.
IIMs are now diagnosed based on the findings of clinical and histopathological examinations. Although muscle and skin biopsies are widely accepted methods for defining the diagnosis of IIMs, they are invasive and susceptible to significant sampling bias. In previous studies, false-negative results were reported in 10-20% of all IIM muscle biopsies due to sampling errors caused by the scattered distribution of focal disease activity 31-34 . Table 5. Performance of machine learning-based classifications of anti-Jo-1 antibodies. Note Data are means ± standard deviations. Feature name codes are as follows: TexF1 = kurtosis, TexF2 = interquartile range, TexF3 = total energy, TexF4 = cluster prominence, TexF5 = correlation, TexF6 = difference average, TexF7 = imc2, TexF8 = maximum probability, TexF9 = large dependence high gray-level emphasis, TexF10 = dependence nonuniformity, TexF11 = coarseness. LDA linear discriminant analysis, QDA quadratic discriminant analysis, SVM support vector machine, k-NN k-nearest neighbors classifier, RF random forest classifier, MLP multi-layer perceptron.  Table 6. Performance of machine learning-based classifications of anti-ARS-antibodies. Note Data are means ± standard deviations. Feature name codes are as follows: TexF1 = kurtosis, TexF2 = interquartile range, TexF3 = total energy, TexF4 = cluster prominence, TexF5 = correlation, TexF6 = difference average, TexF7 = imc2, TexF8 = maximum probability, TexF9 = large dependence high gray-level emphasis, TexF10 = dependence nonuniformity, TexF11 = coarseness. LDA linear discriminant analysis, QDA quadratic discriminant analysis, SVM support vector machine, k-NN k-nearest neighbors classifier, RF random forest classifier, MLP multi-layer perceptron. www.nature.com/scientificreports/ The 2017 EULAR/ACR criteria were recently introduced, which permit diagnoses using a two-version scoring system with and without muscle biopsy 26 . Although the new criteria have the advantages of high diagnostic performance and flexibility, disagreements in the diagnosis of IIM disease groups have been reported in several cohort studies 27,28 . MRI, which is not incorporated into the new criteria, is not invasive and has the potential to characterize IIM disease subtypes. A quantitative radiomics assessment of muscle MRI, as shown in our approach, may be a more objective and feasible method for IIM diagnoses and disease subtype classifications.
In the present study, several complex TA features were valuable for differentiating between IIM disease groups; GLCM features describes the second-order statistical information of gray levels between neighboring pixels in an image 35 ; the LBP-2D filtered-feature represents a comparison of center pixels and their surrounding pixels. Since these complex TA features have been suggested to reflect underlying pathomorphological texture patterns in various fields of medical imaging [36][37][38] , the explanation of TA features provided in the present study needs to be complemented by further evidence, including histopathology.
Conventional ML classifiers, such as LDA, SVM, k-NN, and RF, were mainly examined in the present study instead of using a deep-learning or convolutional neural network (CNN) approach. Since deep-learning is now widely used for image classification to facilitate the diagnosis of various diseases, it would add values and expect improvement in classification rates to introduce the deep-learning or CNN method. Multi-task deep CNN models were recently applied to the diagnosis of neurodiseases and achieved high classification performance 39 . These models are suited to our theme because complex multi-omics data are particularly important in IIM or other collagen diseases, and a deep CNN approach will assist in the construction of favorable classification models for these diseases. As a preliminary study on CNN-based classification models, we implemented the MLP classifier, which is the simplest form of an artificial neural network. In the present study, the MLP classifier provided similar or slightly lower results than the other conventional classifiers. Since MLP is considered be a favorable estimator in non-linear models, our models may be approximated to linear models rather than complex non-linear models. However, since this is a small-scale study, different results may have been obtained if we employed large samples as well as independent training and test cohorts.
Overall, our radiomics models distinguished between IIM disease groups with moderate diagnostic accuracy, but with poor accuracy between non-IIM and IIM disease groups. It is important to note that even patients with non-IIM decided by the 2017 ACR/EULAR criteria may have IIM to some degree because the criteria are in themselves a prediction model using an aggregate scoring system derived from several variables. According to its definitions, "possible IIM" and "non-IIM", which were combined as non-IIM in the present study, correspond to a possibility of ≥ 50% and < 55%, and < 50%, respectively 26 .
In a recent study, among 111 patients who were diagnosed with IIM clinically, 89 (80.2%) were classified as having probable/definite IIM using the 2017 ACR/EULAR criteria, while the other 22 (19.8%) were in the false-negative possible IIM/non-IIM group 28 . In the present study, all 19 patients with non-IIM were clinically diagnosed with IIM. Except for two patients with ADM (confirmed by skin biopsy), the other 17 were treated as PM; however, it was not clear whether at least 4 patients had PM or DM. Moreover, 12 out of the 19 non-IIM patients showed HSI on muscle STIR MRI.
Sampling errors may occur with biopsies, which is consistent with previous findings, and this reduced the likelihood of a diagnosis of IIM within the classification criteria because of increases in aggregate score cut points with the addition of biopsy information. Other reported diagnostic factors that may lead to false classification results in the IIM/non-IIM group included the autoantibody status and skin manifestations 27,28 . Due to the uncertainty and heterogeneity of the non-IIM group decided by the 2017 EULAR/ACR criteria, it appears to be more important to include appropriate control groups, such as normal or other disease groups, rather than a non-IIM group or to construct autoantibody-oriented classification models if the goal is to construct useful classification models for clinical practice. Since it is not currently possible to include new samples, this is a subject for future analyses.
The present results also provide a promising perspective on the classification of autoantibodies. We achieved good diagnostic performance using radiomics models for the anti-Jo1 and anti-ARS antibodies. Anti-Jo1 antibodies are the most common autoantibodies among IIM (up to 20% of IIM) 40 . They are included as anti-ARS antibodies, which define the clinical phenotype called anti-synthetase syndrome (ASS), including myositis, interstitial lung disease, arthritis, Raynaud's phenomenon, and mechanics hands 41 . Previous studies reported that characteristic histopathological features and muscle MRI patterns in active ASS 42,43 . In the present study, we speculate that a high magnitude of voxel values and inhomogeneity in an image, which corresponded to the features such as total energy, cluster prominence, dependence non-uniformity and coarseness, may be characteristics of ASS; however, this needs to be evaluated in future studies that include histopathology. Regarding other autoantibodies, Pinal-Fernandez et al. showed that anti-SRP-positive immune-mediated necrotizing myopathy (IMNM) had more severe atrophy and fatty replacement than anti-HMGCR-positive IMNM 44 . Due to our small sample size, we were unable to construct classification models for several other autoantibodies. However, based on the importance of autoantibodies in IIMs, we need to evaluate multi-class classification models of several important autoantibodies in future studies.
The present study had limitations. The number of patients examined was not sufficient to construct a MLbased classification model. The rarity of these disorders prevented the collection of a large sample size. To avoid overfitting, we performed several feature reduction steps, including a collinearity analysis and SFS algorithms, and constructed a model with a limited number of features and multiple classifiers. We also performed a crossvalidation of the calculated models to avoid overestimation. Nevertheless, future studies with a large sample size and independent training and test cohorts will provide supportive evidence for the diagnostic value of our radiomics models. Another limitation is that the present study lacked appropriate control groups, such as normal or other disease groups, as described above. Moreover, it is of greater clinical importance to construct a comprehensive classification model in consideration of clinical, serological, and pathological data. We had to www.nature.com/scientificreports/ correlate between histopathology and MRI TA values, and also we had to compare our radiomics models with human readers. The further development of our models will be achieved by addressing these issues in the future. In addition, our analysis only considered the intra-muscular area. Based on previous findings, the inclusion of the extra-muscular area, particularly the subcutaneous area, may provide more accurate results. Similarly, we only used STIR images in our radiomics analysis. We did not include additional TA on contrast-enhanced images because contrast-enhanced sequences were not available in all patients. A recent study on characteristic MRI findings of IIM stated that contrast-enhanced sequences were useful for the differentiation of disease groups, whereas STIR images provided similar results and were beneficial considering the risk and cost of contrast media.
In conclusion, ML-based TA of muscle MRI has potential as a method for predicting disease groups or autoantibody status in patients with IIM. With further studies to verify its reproducibility and viability, TA may become a clinically feasible technique that will be of assistance in non-invasive assessments of underlying disease mechanisms and help guide therapeutic decisions.

Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.