Comparative validation of AI and non-AI methods in MRI volumetry to diagnose Parkinsonian syndromes

Automated segmentation and volumetry of brain magnetic resonance imaging (MRI) scans are essential for the diagnosis of Parkinson’s disease (PD) and Parkinson’s plus syndromes (P-plus). To enhance the diagnostic performance, we adopt deep learning (DL) models in brain MRI segmentation and compared their performance with the gold-standard non-DL method. We collected brain MRI scans of healthy controls (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=105$$\end{document}n=105) and patients with PD (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=105$$\end{document}n=105), multiple systemic atrophy (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=132$$\end{document}n=132), and progressive supranuclear palsy (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=69$$\end{document}n=69) at Samsung Medical Center from January 2017 to December 2020. Using the gold-standard non-DL model, FreeSurfer (FS), we segmented six brain structures: midbrain, pons, caudate, putamen, pallidum, and third ventricle, and considered them as annotated data for DL models, the representative convolutional neural network (CNN) and vision transformer (ViT)-based models. Dice scores and the area under the curve (AUC) for differentiating normal, PD, and P-plus cases were calculated to determine the measure to which FS performance can be reproduced as-is while increasing speed by the DL approaches. The segmentation times of CNN and ViT for the six brain structures per patient were 51.26 ± 2.50 and 1101.82 ± 22.31 s, respectively, being 14 to 300 times faster than FS (15,735 ± 1.07 s). Dice scores of both DL models were sufficiently high (> 0.85) so their AUCs for disease classification were not inferior to that of FS. For classification of normal vs. P-plus and PD vs. P-plus (except multiple systemic atrophy - Parkinsonian type) based on all brain parts, the DL models and FS showed AUCs above 0.8, demonstrating the clinical value of DL models in addition to FS. DL significantly reduces the analysis time without compromising the performance of brain segmentation and differential diagnosis. Our findings may contribute to the adoption of DL brain MRI segmentation in clinical settings and advance brain research.


Introduction
Parkinson's disease (PD) diagnosis is primarily based on clinical presentation.However, for atypical symptoms called red flags [1], brain magnetic resonance imaging (MRI) is essential for diagnosing Parkinsonplus syndromes (P-plus), such as multiple system atrophy (MSA) and progressive supranuclear palsy (PSP).MRI improves the diagnostic accuracy and can be used for monitoring disease progression [2].Brain MRI can reveal various features that appear in P-plus but not in PD [2,3,4].For instance, patients with PSP show marked midbrain atrophy [5], known as the hummingbird sign.In MSA-Parkinsonian type (MSA-P), the putamen is atrophic, with a flattened lateral border, and shows a hypointense signal on T1-weighted gradient-echo images.Patients with MSA-cerebellar type (MSA-C) show predominant atrophy in the pons and middle cerebellar peduncles, resulting in an increased midbrain-to-pons ratio [6] and a decrease in the magnetic resonance Parkinsonism index [7].Accordingly, quantitative measures of the volume of these brain structures have also been assessed, showing high sensitivity and specificity in differentiating PD from P-plus [8].
Although the diagnostic sensitivity and specificity obtained by evaluating the midbrain area are generally high for differentiating between PSP, MSA, and PD [9], the visual assessment of this area is not quantitative, lacks objectivity, and highly dependent on the physician's skills or image acquisition.Consequently, diagnoses based on visual assessments have shown a broad spectrum of accuracy, even falling below 80% [10,11,12].To develop a consistent and quantitative analysis of brain MRI, volumetry of the midbrain area has been used as an optimal predictor for accurate diagnosis [8,13,6,14].Thus, brain image segmentation has become an important stage in most downstream analyses based on prediction models or automated machine-learning (ML) methods for volumetry and diagnosis.
A trained physician's manual segmentation of brain MRI scans is strenuous and time-consuming, and it requires a highly skilled specialist to correctly identify the brain structures.Various automated techniques using atlas-based or deep-learning (DL) techniques have been developed to overcome these problems.Although automated image segmentation models for the brain show limitations [15,16], FreeSurfer (FS) [17] can extract brain structures with relatively high accuracy.Therefore, FS has been widely adopted as a non-DL automated segmentation method [17,18,19,20,21].
Various automated segmentation methods for brain structures have been developed, but their use in clinical practice is limited, being typically used in one-time studies.This is attributable to the time-consuming and complex process of automated segmentation models compared with physicians' simple visual assessments of brain MRI scans.For instance, the automated FS for segmentation takes more than 4.5 h per patient to segment the brain captured in an MRI scan.This complexity problem occurs because existing automated segmentation methods use atlas-based registration [22,23,24,25].In fact, expressing segmentation as an atlas-based registration problem requires considerable time, and FS must be optimized to obtain a coordinate transformation function suitable for the internal atlas model of each test sample.
An automated model for fast segmentation and diagnosis without involving intricate methods should be developed for clinical use.Although DL segmentation has been used in various fields, including medicine [16], the segmentation of brain structures in MRI for the diagnosis of neurodegenerative diseases has made little progress.In addition, no study has introduced artificial-intelligence-based analysis or demonstrated the usefulness of DL (i.e., complexity or disease discrimination performance) compared with existing non-DL automated segmentation of brain structures (e.g., FS).Unlike existing non-DL methods, DL may increase the analysis speed by completing segmentation using only forward computations based on learned parameters without requiring optimization processes such as registration.However, it is difficult to predict whether DL shows performance degradation compared with non-DL methods, especially in diagnosing neurodegenerative diseases.Our study is significant because it is the first experimental study that demonstrates, with extensive clinical data, the competitive performance of DL and non-DL methods.A DL method can achieve high performance in terms of the analysis complexity and diagnostic performance for differentiating major neurodegenerative diseases (e.g, differential diagnosis between PD, P-plus, and normal cases).
Recent DL segmentation models are classified into convolutional neural network (CNN) and vision transformer (ViT) architectures.Accordingly, a representative model of each framework, V-Net [26] and UNet transformer (UNETR) [27], respectively, were adopted to perform volumetric 3D image segmentation in this study.The DL models were trained to segment brain structures on MRI scans for the diagnosis of neurodegenerative diseases, and their performances were analyzed and compared with an existing non-DL model, FS.Six brain structures that are important in classifying normal, PD, and P-plus cases were segmented: putamen, pallidum, midbrain, pons, caudate, and third ventricle.The volumes of the segmented areas were subsequently used to differentiate between normal, PD, and P-plus cases.We compared the disease differentiation accuracy and segmentation time of the DL models with those of FS, which were regarded as the reference (i.e., ground truth) for training the DL segmentation models.

Methods
In this section, we describe the brain MRI data (Section 2.1), FS implementation (Section 2.2), and DL method implementation (Section 2.3) for the volumetric analysis of key brain structures to diagnose neurodegenerative diseases.Figure 1 shows an overview of the study process considering the evaluation and comparisons between FS and DL models (i.e., modified V-Net and UNETR representing CNN and ViT DL architectures, respectively).Figure B.1 shows a diagram of the overall performance comparison.We developed DL models with faster processing but similar segmentation performance to FS.The DL models were trained to reproduce and segment the results of FS for each brain structure 256×256×128 by taking skull-stripped brain image I ∈ R 256×256×128 as input (i ∈ {pallidum, putamen, caudate, third ventricle, midbrain, pons}), with resolution (h, w, d) (height h = 256, width w = 256, depth d = 128).The DL segmentation results for the six brain structures were stored as 3D binary masks (F i and V i indicate the FS and DL-model masks for brain structure i, respectively), where each mask output contained intensities between 0 and 1 (area outside and inside the target brain structure, respectively).By calculating the absolute volume of each or all the brain structures predicted by FS or DL models, we performed binary classification of PD, MSA-C, MSA-P, PSP, and normal cases, and calculated the area under the curve (AUC) of segmentation.This study was approved by the Institutional Review Board of Samsung Medical Center, and the requirement for written informed consent was waived (approval number: 2021-07-026).We retrospectively screened patients from the Neurology Department of Samsung Medical Center between January 2017 and December 2020.Patients diagnosed with PD, probable MSA, or probable PSP were included in this study.The diagnosis for each patient was determined by movement disorder specialists based on the following criteria: PD was determined according to the United Kingdom PD Society Brain Bank criteria [28] using [18F] N-(3-fluoropropyl)-2βcarbon ethoxy-3β-(4-iodophenyl) nortropane positron emission tomography, while probable MSA and PSP were diagnosed according to the second consensus diagnosis of MSA [29] and movement disorder society clinical diagnostic criteria for PSP [30], respectively.MSA cases were further classified as either MSA-P or MSA-C after reaching consensus [29].Patients with concomitant or structural brain lesions, including stroke and tumors, which may affect brain MRI scans, were excluded from the study.An age-matched healthy elderly population was included as the control group.Demographic information on age, sex, and disease duration until the brain MRI examination was collected, as listed in Table 1.We analyzed the data from 411 individuals and performed threefold cross-validation to train and evaluate the DL models.Each group consisted of 105 healthy controls and 105 PD, 69 PSP, 69 MSA-C, and 63 MSA-P cases.

Study population and clinical assessments
We applied cross-validation with three outer folds for evaluation to mitigate bias in the validation and test sets and analyze the effect of set composition (combinations of cases in groups).The data were randomly divided into three sections, one for testing and two for training.Each group comprised 35 normal, 35 PD, 23 PSP, 23 MSA-C, and 21 MSA-P cases.

Data acquisition and standardization
Axial brain MRI scans were acquired using a standard protocol for T1-magnetization-prepared rapid acquisition of gradient echo, with repetition/echo time of 11,000/125 ms, inversion time of 2,800 ms, field of view of 240 mm, acquisition matrix size of 320 × 249, echo train length of 27, 1 signal average, slice thickness of 5 mm, interslice gap of 1.5 mm, and scanning time of 198 s.
We included six brain structures that are involved in Parkinsonian syndromes in the gray matter, namely, the midbrain, pons, putamen, pallidum, caudate, and third ventricle.These areas are reported to have the highest sensitivity and specificity for differentiating Parkinsonian syndromes [13,16].The MRI scans were resized to 256 × 256 × 128 (i.e., number of slices in the coronal/sagittal/axial planes) to segment each structure.
The FS accepts Digital Imaging and Communications in Medicine (DICOM) or Neuroimaging Informatics Technology Initiative (NIfTI) files as inputs.DICOM is a compelling and flexible but complex format that provides interoperability between several hardware and software tools.Given its complexity, DICOM may be inefficient in image processing and analysis [31].In addition, DICOM stores a single volume as a series of 2D slices, which is cumbersome for 3D imaging.NIfTI is a more straightforward format than DICOM and preserves the essential metadata.In addition, it maintains the volume as a single file and uses raw data after a simple header, and NIfTI files can be loaded and processed faster than DICOM files.Therefore, we converted files in the brain MRI DICOM format into files in the NIfTI format using MRIcroGL1 .

Brain structure segmentation: Baseline with FS
The extraction of brain structures obtained using atlas-based automated segmentation are necessary for training and validation before establishing an automated DL segmentation model.In this study, we used these results as DL ground-truth labels and evaluated the validity of DL model for generating the same label.As a representative technology for atlas-based automated segmentation (see details in Supplementary Section A), we selected FS (version 7.2), which is publicly available for neuroscience research and provides high segmentation performance [18,19,20,21,32,33].
To segment and extract the six brain structures using FS, it sequentially executes the recon-all pipeline2 and brainstem substructure pipeline 3 .We used both pipelines because the recon-all pipeline does not support segmentation of brainstem structures (e.g., pons and midbrain).However, because the brainstem substructure pipeline receives preprocessed inputs from the recon-all pipeline, both pipelines should be executed.Therefore, the extraction of the six brain structures through FS can be divided into MRI scan preprocessing in the recon-all pipeline and the remaining segmentation of the recon-all pipeline along with segmentation in the brainstem substructure pipeline.These processes are explained in Sections 2.2.1 and 2.2.2.

MRI scan preprocessing for FS: Motion correction and skull removal
The MRI scan preprocessing in the recon-all pipeline of FS mainly consists of 1) motion correction, 2) normalization, and 3) skull stripping.Motion correction is conducted before averaging when various source volumes are used, compensating for small motion variations between volumes.FS constructs cortical surface models and the boundary between white matter and cortical gray matter to automatically match the brain images of patients, using software [17].In addition, intensity normalization is applied to the original volume.However, adjusting for intensity fluctuations may hinder intensity-based segmentation.Instead, we scale the intensities of all voxels to the mean value (110) of white matter.
After correcting for motions and normalizing the data, FS removes the skull and provides the skull-stripped brain MRI scan.Removing intracranial brain cavities (e.g., skin, fat, muscle, neck, and eyeballs) may reduce human rater variability [34] and promote automated brain image segmentation and improve analysis quality.Therefore, brain MRI scans should be preprocessed to isolate the brain from extracranial or nonbrain tissues in a process known as skull stripping [35].FS developers devised and applied in-house automated skull-stripping algorithms to isolate intracranial cavities by default.
In this study, the steps of brain MRI scan preprocessing (i.e., skull stripping with motion correction and normalization of a brain MRI scan) took approximately 20 min.We converted the final skull-stripped images to NIfTI files with size of 256 × 256 × 128, while the original brain MRI scan had a size of 256 × 256 × 256, which was adjusted for efficient comparison with the DL models.

FS for brain structure segmentation
After preprocessing (Section 2.2.1),FS segments the six brain structures by applying the remaining processes of the recon-all pipeline and the complete brainstem substructure pipeline.After skull stripping, registrationbased segmentation proceeds as follows.FS determines and refines the white and gray matter interfaces for both hemispheres.Then, FS searches for the edge of the gray matter, which represents the pial surface.With pial surfaces, FS expands and inflates sulci banks and gyri ridges.Subsequently, it extends again into a sphere and parcellates the cortex.After applying these processes, FS segments the brain.The recon-all pipeline encompasses some brain structures (i.e., putamen, caudate, pallidum, and third ventricle), while the brainstem substructure pipeline segments the midbrain and pons.
In this study, the final segmentation result was assessed with the same input size of 256 × 256 × 128 4 .In addition, we replaced FS with a DL model applied to the skull-stripped MRI scan (i.e., preprocessing result of the recon-all pipeline) to perform segmentation.For the replacement, we evaluated whether the DL analysis is faster than FS analysis and whether the segmentation result of DL is sufficiently reproducible compared with that of FS.The difference between FS and DL segmentation is illustrated in Figures 4 and 5.

DL models for brain structure segmentation
In this study, we used DL models and FS to segment the same skull-stripped images (i.e., images preprocessed by the FS recon-all pipeline, as described in Section 2.2.1).The original size of the skull-stripped image generated by FS was 256 × 256 × 256, which was adjusted to 256 × 256 × 128 for DL segmentation owing to the limited graphics processing unit (GPU) memory.Specifically, similar to the segmentation using FS described in Section 2.2.2, the DL models received skull-stripped images as inputs and were trained to individually segment each structure as a binary mask, in which pixels inside and outside the structure were valued 1 and 0, respectively.We evaluated and compared the performance and analysis time of the DL models by replacing the segmentation process of FS after skull stripping with DL.FS may be inefficient because it segments the entire brain image, requiring many hours of processing.In fact, FS takes at least 4.5 h to segment the six brain structures considered in this study because it requires atlas-based registration to transform the coordinates of the entire MRI scan to segment specific brain structures.Consequently, FS cannot notably reduce the processing time even if only six brain structures were to be segmented.On the other hand, we verified that DL segmentation (e.g., using V-Net or UNETR) takes less than 1 min per case to segment the six target brain structures.As DL models do not require complex registration, unlike non-artificial-intelligence methods (e.g., FS), they can substantially increase the processing efficiency.

DL models
The implementation details of the DL models are described herein.As DL models, we adopted the CNNbased V-Net [26] and ViT-based UNETR [27] using the segmentation results provided by FS as labels (Section 2.2.2).The two models were trained to reproduce FS segmentation.

CNN-based V-Net
V-Net has been used to segment an entire volume after training an end-to-end CNN on MRI volumes for revealing the prostate [26].The architecture of V-Net is V-shaped, where the left part of the network is a compression path, whereas the right part decompresses the features until the original input size is recovered.The left part of the network is separated into stages that operate at varying resolutions.
In this study, one to three convolutional layers were used in each step.A residual function was learned at each level.The input of the residual part was used in the convolutional layers and nonlinear operations.This output was added to the last convolutional layer of the stage.The rectified linear unit (ReLU) was used as the nonlinear activation function.Convolutions were applied throughout the compression path.The right part of the network learned a residual function similar to that of the left part.V-Net has shown promising segmentation results, and using this model in our application improved performance.The model was adjusted according to the available memory.The proposed architecture is illustrated in Figure 2. The left part used a residual block (ResBlock) and maximum pooling (MaxPooling).ResBlock was applied to all the blocks with an input size of 256 × 256 × 128.On the other hand, 3D MaxPooling reduced the depth, height, and width of the feature maps to reduce their resolution.The right part also used ResBlock but replaced MaxPooling with UpConvolution, which consisted of 3D upsampling, batch normalization, ReLU activation, and convolutional layers (5 × 5 × 5 filter, same padding, and stride of 1).Upsampling increased the resolution of the feature maps, and batch normalization improved convergence throughout the network [36].UNETR [27] is a transformer architecture for 3D medical-image segmentation.It uses a transformer as the encoder to learn the sequence representations of the input volume and capture global multiscale information while adopting U-shaped architectures for the encoder and decoder.The proposed architecture is illustrated in Figure 3. UNETR followed a contracting-expanding path with an encoder comprising a stack of transformers connected to a decoder through skip connections.The encoder directly used 3D patches and was connected to a CNN-based decoder via a skip connection.A 3D input volume was split into homogeneous nonoverlapping patches and projected onto a subspace using a linear layer.Position embedding was applied to the sequence and then used as input to the transformer.The encoded representations at different levels in the transformer were retrieved and sent to a decoder via skip connections to obtain the segmentation results.

Implementation details of DL models: Training and inference
For the DL models, the input comprised a brain mask and the corresponding patient's segmented brain structures in the MRI scans, which were merged into an array of dimension 256 × 256 × 128.The ground truth of each brain structure was segmented using FS.For evaluation, threefold cross-validation of the test data was applied to calculate the Dice score and Dice loss.We implemented V-Net in TensorFlow 5 and Keras 6 and trained it for 100 epochs.For UNETR, PyTorch7 and MONAI 8 were applied, and the model was trained for 20,000 iterations.Both models were trained using an NVIDIA Tesla V100 DGXS GPU with a batch size of 1 and an initial learning rate of 0.0001.
We evaluated the accuracy of the evaluated models using the Dice score by comparing the expected segmentation with V-Net (or UNETR) and FS outputs.The Dice score measures the overlap between the reference and predicted segmentation masks.A Dice score of 1 indicates perfect spatial correspondence between the two binary pictures, whereas a score of 0 indicates no correlation.We used the Dice loss to determine the performance of the three outer cross-validations on their test sets for the corresponding structures.If F i and V i are the ground-truth mask and its prediction for each brain structure, respectively (i.e., FS segmentation mask F i and its DL prediction mask V i , respectively, as shown in Figure 1), the Dice score [37] for each brain structure i ∈ {pallidum, putamen, caudate, third ventricle, midbrain, pons} is derived as where • denotes the Hadamard product (i.e., component-wise multiplication) and || • || 1 is the L1-norm (i.e., sum of absolute values of all components).Moreover, we measured the segmentation time for evaluation.

Statistical analysis for binary classification of cases
We obtained the absolute volumes from the six segmented brain structures (i.e., pons, putamen, pallidum, midbrain, caudate, and third ventricle) predicted by the DL models (i.e., CNN-based V-Net or ViT-based UNETR) or FS.Based on the absolute volume of the individual brain structures, we calculated the AUC of the binary classification of diseases, normal vs. P-plus, normal vs. PD, and PD vs. P-plus cases.The AUC was computed based on the receiver operating characteristic curve produced by the correlation between the predicted absolute volume of each brain structure and each case.
Disease binary classification was conducted using the six segmented brain structures individually or collectively.For individual analysis, the AUC was derived through thresholding-based binary classification by obtaining the absolute volume of the individual structures.For a comprehensive analysis of all structures, we additionally considered an ML classification algorithm to perform disease binary classification with the six volumes as inputs.For the classification algorithm, binomial logistic regression (LR) and extreme gradient boosting (XGBoost) were used.LR is a statistical model widely used in ML classification [38,39,40].XGBoost is a well-established method that produces advanced results among gradient-boosting-based techniques [41] (e.g., XGBoost successfully won 17 out of the 29 ML tasks posted on Kaggle by 2015 [42]).In both methods, we evaluated the AUC obtained by the DL model and FS through threefold cross-validation.As mentioned in Section 2.2.2, the brain structure segmentation using FS sequentially processes the remaining of the recon-all pipeline and the complete brainstem substructure pipeline.In FS segmentation, we removed the analysis time of preprocessing (i.e., time to extract the skull-stripped image from the original MRI) described in Section 2.2.2.The resulting time provides a fair comparison of the total times, as FS and DL models use the skull-striped MRI scan as input to derive the final segmentation results, indicated by bold values in Table 2.

Segmentation time of brain structures
The CNN-based V-Net and ViT-based UNETR are considerably faster than FS.On average, V-Net took 3.48 s to segment the six brain structures, and UNETR took 48.14 s, whereas FS took approximately 15,735 s, being approximately 4521 and 326 times slower than V-Net and UNETR, respectively.

Dice score of brain structure segmentation using DL models
Segmentation and prediction results of V-Net and FS are illustrated in Figure 4.The corresponding results of UNETR are illustrated in Figure 5.The Dice score was obtained (Table 3) to evaluate the performance of 3D image segmentation.The CNN-and ViT-based models showed high Dice scores above 0.85 for all the brain structures.The Dice scores were higher for the midbrain and pons than for the basal ganglia (i.e., caudate, putamen, pallidum), possibly because the brainstems are surrounded by cerebrospinal fluid and provide a stronger contrast for accurate segmentation.The ViT-based model showed a higher Dice score than the CNN-based model, which in turn showed a much shorter segmentation time than the ViT-based model (e.g., 3.48 s for V-Net and 48.14 s for UNETR, as shown in Table 2) 9 In addition, the CNN-based V-Net had a similar performance to the ViT-based UNETR in actual disease classification, as listed in Table 4.

Binary classification based on individual brain structures
Using the estimated volumes, we performed binary classification for cases normal vs. P-plus, normal vs. PD, and PD vs. P-plus, where P-plus comprised PSP, MSA-P, and MSA-C cases.The AUCs of the brain structures for each model were compared, as listed in Table 4, which also presents the AUC of the midbrain-to-pons ratio [43].
Among the 98 cases (7 cases of binary classification × 2 DL models × 7 cases of brain structures), there was no significant difference in AUC between the DL models and FS, except for 11 cases.Of these 11 cases, 7 AUCs of the DL models (i.e., CNN-based V-Net and ViT-based UNETR) were higher than those of FS.Furthermore, most of the cases for the CNN-based V-Net showed no lower AUC for disease classification than the cases for the ViT-based UNETR.
The highest AUCs in the comparison between the methods were higher in normal or PD vs. MSA-C (0.91--0.94) than in normal or PD vs. PSP (0.75--0.89).Among the brain structures, the midbrain-to-pons ratio showed the best performance in normal vs. MSA-C and PD vs. MSA-C, while the third ventricle and pallidum showed the best performance in normal vs. PSP and PD vs. PSP.The highest AUCs were not significantly different in the classification of normal or PD vs. MSA-P (0.69-0.73) or PD (0.63).

Binary classification based on complete brain structures
Most AUCs of the DL models were not significantly different from those of FS, as listed in Table 5, although a considerable difference existed in the segmentation speed between the models and FS, as listed in Table 2.In Table 5, the highest AUC of FS and DL models for each binary classification are indicated in bold.The highest AUCs of classification between PD vs. P-plus and normal vs. P-plus were higher than 0.8 in both DL models, except for PD vs. MSA-P (AUC > 0.76).There was no significant difference between FS and the DL models (p-value of 0.05 or higher) in all highest AUCs.
Table 5 shows that of the 28 cases (2 ML models × 2 DL models × 7 binary classifications), 24 cases had no significant differences with FS, obtaining p-values above 0.05.Like listed in Table 3, the CNN-based V-Net achieved a better AUC than the ViT-based UNETR.In 9 of the 14 pairs of cases, the CNN-based V-Net outperformed the ViT-based UNETR.In both LR and XGBoost, collectively considering the six brain structures (Table 5) resulted in a significantly higher AUC than when considering the individual structures (Table 4).The best performance was higher in normal or PD vs. MSA-C (0.93--0.95) than in normal or PD vs. PSP (0.80--0.89).Unlike the AUC in individual brain structures, the highest AUC became significant in normal or PD vs. MSA-P (0.79--0.82).The highest AUC was not significantly different for normal vs. PD (0.70).We interpret these results in the Discussion section.

Discussion
We developed two DL models, V-Net and UNETR, which showed significantly faster brain segmentation than FS and a comparable accuracy.Our DL models shortened the segmentation time by at least 300 times compared with FS.Moreover, they showed robust high performance in differential diagnosis between PD and P-plus cases using the volume of segmented brain structures.The DL models were efficient (i.e., analysis speed at least 300 times faster than FS) and effective (i.e., comparable to FS in Dice score and AUC) in automated brain segmentation and disease diagnosis, even for simultaneous analysis of all brain structures and their individual analyses.Thus, the proposed DL models may promote the application of automated brain segmentation in clinical practice and facilitate efficient and accurate brain research in medicine.
Automated tools have scarcely been adopted for brain segmentation in clinical practice despite their high accuracy in the differential diagnosis of patients with Parkinsonism [13,16].This is mainly attributable to the complicated and time-consuming process of automated brain segmentation compared with physicians' qualitative visual assessment of brain MRI scans.Consequently, automated segmentation models have mainly been used in research settings that require quantitative brain measurements.Nevertheless, their application in clinical settings may increase with our DL models, which have shown much faster segmentation than FS with a similar accuracy.The DL models may contribute to improve the accuracy of clinical diagnosis of PD or P-plus cases by providing precise brain image analysis.In addition, clinical trials that require quantitative brain measurement from a large population may be conveniently conducted using our fast and accurate DL models.In the past, methods for brain image analysis were time-and resource-consuming, even with an automated segmentation tool such as FS.While V-net and UNETR showed significantly faster segmentation with satisfactory accuracy, the CNNbased V-Net may be more suitable in clinical settings for diagnosis based on volumetry of brain MRI scans.Although the ViT-based UNETR is the most recent DL model and shows a high Dice score, the number of training parameters is approximately 46 times larger than that of V-Net.As the number of calculations increases with the number of trainable parameters, the hardware requirements increase in terms of GPU memory and processing power.Consequently, the ViT-based UNETR may be considerably demanding for training and evaluation, requiring high specification GPU.The CNN-based V-Net showed an AUC generally higher than that of UNETR and lower Dice scores.Until the ViT performance is further improved, the CNN-based V-Net, which uses fewer GPU resources, seems to be the best option for clinical practice.
Regarding the AUC of differential diagnosis for PD and P-plus cases, the CNN-and ViT-based models (V-Net and UNETR, respectively) showed comparable performance to FS.Since our DL models are at least 300 times faster than FS without sacrificing diagnostic performance, they are superior to FS in terms of clinical efficacy.In binary classification using individual brain structures, the relative order of the AUC of each brain structure was consistent with previously reported results [10,44].For instance, the pons and midbrain-to-pons ratio showed the highest AUC in classification of normal vs. MSA-C and PD vs. MSA-C cases.The third ventricle and pallidum showed the highest AUC in classification of normal vs. PSP and PD vs. PSP cases.The putamen showed the highest AUC in classification of PD and MSA-P cases.In the classification of PD vs. PSP cases, the third ventricle showed a higher AUC, whereas the midbrain showed a relatively lower AUC.Single measurements of the midbrain have failed to differentiate PSP from PD or MSA [45,46,47], despite classic MRI studies showing atrophic midbrain in PSP [7,11].On the other hand, the third ventricle has been shown to be a reliable marker for diagnosing early stage PSP from PD and late-stage PSP [48], and it has been added to a new version of the magnetic resonance Parkinsonism index [49].
For binary classification based on the six brain structures, significant improvements in the AUC were achieved in all models.In both DL models, the highest AUC of classification of PD vs. P-plus and normal vs. P-plus cases was above 0.8, except for PD vs. MSA-P cases.The relatively low AUC of classification between PD and MSA-P cases based on brain MRI cases has also been reported in previous studies.[10,44].The limitation of clinical diagnosis may have contributed to the relatively low AUCs in these studies owing to the overlapping manifestations between PD and MSA-P cases.Clinical diagnosis of PSP and MSA-P has been reported to have the most frequent discrepancy from autopsy-proven diagnosis, even when considering diagnostic criteria [50].No significant difference in brain MRI scans has been found between normal and PD cases, resulting in no significant AUC differences for classification between these cases.
Our study has some limitations.First, the diagnoses of PD, PSP, and MSA-C were not pathologically verified.Instead, movement specialists provided clinical diagnoses based on validated clinical consensus, providing only probable diagnosis.Second, we segmented six brain structures, namely, midbrain, pons, medulla, putamen, pallidum, and third ventricle, but disregarded other brain structures that may reflect different pathologic characteristics between PD and P-plus (e.g., cerebellum, middle cerebellar peduncle).We excluded those structures owing to the low segmentation accuracy achieved by FS.Nevertheless, the differential diagnosis of P-plus using only the brain structures included in this study has been reported as reliable [43].Third, given memory limitations, we downscaled the output shape from 256 × 256 × 256 to 256 × 256 × 128, which may have caused an information loss.Nevertheless, the Dice scores suggest a negligible impact of information loss, whereas using a downscaled input accelerates training and inference in DL models.
Automated segmentation of brain MRI scans has become an influential method for diagnosing neurodegenerative diseases, including movement disorders.The proposed DL models showed remarkable results for both brain segmentation and the differential diagnosis of PD and P-plus.Using the high-performance CNN-and ViT-based models, we significantly shortened the segmentation time of deep brain structures while obtaining comparable accuracy to the conventional FS segmentation.Despite the superior DL performance, no quantitative results of the comparative analysis and evaluation of the performance of DL have been reported to date for the differential diagnosis of neurodegenerative diseases, including PD and P-plus.To the best of our knowledge, this is the first study to quantitatively establish the significance of DL segmentation and disease classification.We found that the cost-effective CNN-based model achieves satisfactory performance in both segmentation and differential diagnosis compared with the most recent ViT-based model.Our DL models may contribute to the development of patient-and clinician-friendly segmentation methods that enable fast and accurate diagnosis and may provide a meaningful reference for hospitals planning to introduce DL brain segmentation and diagnosis for neurodegenerative diseases.

Data Availability
The authors declare that the main data supporting the results of this study are available within the paper.The raw datasets from Samsung Medical Center are protected to preserve patient privacy but can be made available upon reasonable request provided that approval is obtained from the corresponding Institutional Review Board.

A Related work A.1 Manual segmentation
In manual MRI scan segmentation, human raters (e.g., expert physicians) manually delineate and label regions of interest in the scans [51].Although this method is considered as the gold standard, it is cumbersome and has low reproducibility.Manual segmentation of 3D volume scans is generally performed slice-by-slice and typically requires segmentation of 80 slices, being tedious and time-consuming.While various brain structures have been used to diagnose central nervous system diseases (e.g., stroke, Alzheimer's disease), certain brain structures are used for diagnosing atypical Parkinsonism.The putamen, globus pallidus, midbrain, and pons are the main brain structures that show changes in atypical Parkinsonism and are often segmented for diagnosis and differentiation.However, manual segmentation of these structures in a brain MRI scan is time-consuming and strenuous, even for an experienced radiologist or neurologist who can accurately recognize these structures.In addition, manual segmentation is prone to inter-and intra-rater variability [10,11,12].Moreover, the segmentation quality depends on rater proficiency, and even experienced specialists may show variability from their previous annotations.Hence, validation by at least two raters is required for the analysis.Given the challenges and problems of manual segmentation, automated methods are preferred for large-scale datasets in clinical trials or when accurate and quantitative analyses of brain MRI scans are required, such as when measuring the volume or intensity of signals in a brain structure.

A.2 Automated segmentation: Atlas-based method
Automated image segmentation has been dominated by atlas-based methods that formulate segmentation as an image-registration problem [22].A labeled image (i.e., an atlas) is transformed (i.e., registered) using a deformation model for mapping onto an unlabeled image (i.e., test scan).The established spatial correspondence is then used to transfer labels from the atlas to the target MRI scan [23,24,25].Initially, a single atlas delineated by medical experts was used, but segmentation could be highly biased depending on the quality of registration (i.e., similarity between the atlas and scan) [52].Subsequently, multiple labeled atlases have been used to mitigate bias and capture wide anatomical variations [52].Accordingly, two strategies have been proposed: 1) multi-atlas and 2) Bayesian segmentation.Multi-atlas segmentation registers atlases individually onto the test scan and applies label fusion (majority voting) to propagate the most frequently selected labels [53,54].Bayesian segmentation uses a single probabilistic atlas that summarizes all atlases [55,56].This entails propagating label probabilities (prior) and image voxel intensities (likelihood) to deduce a generative model (posterior probability) using Bayes' rule.This strategy can be adapted to MRI scans [55,57,58,59] and is faster than multi-atlas segmentation because it requires only one computationally intensive registration step per scan.Bayesian segmentation is implemented in various tools such as FS [17], statistical parametric mapping [55], and the FMRIB software library (FSL) FMRIB integrated registration and segmentation tool (FIRST) [60].
Velasco-Annis et al. [33] analyzed various automated segmentation algorithms.For our six target brain structures, Velasco-Annis et al. [33] reported the average specificity, positive predictive value, and Dice score of FS as higher than those of FSL-FIRST.Additionally, compared with other automated approaches (i.e., statistical parametric mapping and FSL), FS had the highest sensitivity and specificity for brain volume changes in ROC analysis, achieving more consistency, less susceptibility to noise, and better image quality [18,19,20,21].Furthermore, with several segmentation tools introduced for general brain segmentation, FS is frequently used in PD diagnosis [61,62,63,64].Therefore, with the extensive and automated analysis of key features in the human brain, FS has been widely recognized as the most representative atlas-based automated segmentation method for brain structure analysis.Thus, it served as reference in our study.

A.3 Automated segmentation: DL model
Modern automated image segmentation relies on DL techniques, with the two most generalized DL models being CNNs and ViTs.As for other computer vision tasks, CNNs are predominant in image segmentation owing to the effectiveness of the convolution operation.Convolution deals with sparse interactions (local connections), weight (parameter) sharing, and translation equivariance, giving CNNs a strong and useful inductive bias (prior knowledge) and allowing them to quickly converge with reduced computational complexity.Owing to the effectiveness of the convolution operation, the UNet architecture [65] has achieved outstanding results in the medical field [66,67,68,69,70,71], being widely used for segmentation.UNet has a U-shaped symmetric encoder-decoder architecture, typically including 1) a convolutional encoder (or downsampling network) to extract relevant features from the inputs at different resolutions, followed by 2) a convolutional decoder (or upsampling network) to synthesize the extracted features as a high-resolution image to obtain pixel-or voxel-wise precision, and 3) a skip connection between layers to recover spatial information lost during downsampling.V-Net [26] is a representative variant of UNet for 3D medical image segmentation.
Despite their efficiency, CNNs have a limited ability to learn long-distance dependencies owing to the locality of receptive fields in the convolutional layers [72,73].Thus, transformer-based models, which use self-attention mechanisms as core operators, have recently enabled attractive solutions for computer vision tasks.The key idea of the self-attention mechanism, which has shown great success in natural language processing, is to learn the relative importance (self-alignment) of a single token relative to all other tokens in a sequence [74].In other words, calculating the pairwise interactions between all input units has essentially the same effect as having a global receptive field of long-range dependencies [75].Inspired by this mechanism, ViT [76] was introduced to interpret an image as a sequence of patches, adapting self-attention for computer vision applications.ViT and its variants have demonstrated excellent performance in many computer vision tasks [77,78,79,80,81].UNETR [27] is a representative ViT-based 3D image segmentation model that improves the segmentation performance by reducing the loss of encoding information by converting the encoder of an existing CNN-based segmentation model into a ViT.However, transformer-based approaches have limitations, such as the need for large amounts of training data owing to the lack of inductive bias and the quadratic computational complexity of self-attention according to the input image size [76].
Although the medical community has a great interest in DL models for image segmentation, few studies have been conducted on segmenting the intricate brain structures to diagnose diseases.To the best of our knowledge, no existing study has investigated DL methods for segmenting the biomarkers of Parkinsonian syndromes.Bocchetta et al. [82], Manjón et al. [83] used FS as an automated segmentation tool toward diagnosing Parkinsonian syndromes but neglecting DL methods.Similarly, Ali R. Khan [84] proposed a method for fully automated segmentation of the brain without relying on DL.They introduced a pipeline that uses FS labeling to provide information in a highly nonlinear transformation method (i.e., large deformation diffeomorphic metric mapping).In this study, we used high-performance DL models based on CNN and ViT to segment brain structures of patients with Parkinsonian syndromes.We established that DL models can yield equal or more effective results than FS.These models can substantially shorten the segmentation time while retaining the accuracy of non-DL FS segmentation.

Figure 1 :
Figure 1: Overview of the study.The diagnostic performance of Parkinsonian syndrome regarding analysis time and accuracy for extracting and segmenting brain structures were compared between DL models and FS.Disease diagnosis was performed using the extracted structures individually or comprehensively.

Figure 2 :
Figure 2: Architecture of CNN-based 3D segmentation using V-Net.ResBlock, MaxPooling, and UpConvolution were used to reduce the depth, height, and width.The output shown in the figure is the segmentation of pallidum.(Conv, convolution layer; BN, batch normalization)

Figure 4 :
Figure 4: Segmentation results of CNN-based V-Net (left 3D images in first column and red-highlighted areas in second column) and FS (right 3D images in first column and blue-highlighted areas in second column) for each brain structure.

Figure 5 :
Figure 5: Segmentation results of ViT-based UNETR (left 3D images in first column and red-highlighted areas in second column) and FS (right 3D images in first column and blue-highlighted areas in second column) for each brain structure.

Figure B. 1 :
Figure B.1: Study overview and performance comparisons.Using FS and DL V-Net and UNETR for segmentation, we analyzed the segmentation time, Dice score, and AUC of disease diagnosis considering each brain structure.In addition, a comparison of AUC was conducted using ML methods for disease diagnosis considering all brain structures.

Table 2 :
Measured segmentation time per patient obtained by using CNN-based V-Net, ViT-based UNETR, and FS.The time was calculated after the skull-stripped image was obtained.Data are shown as mean ± standard deviation.(V3, third ventricle)

Table 2
lists the time required to segment the six brain structures per patient.

Table 4 :
Disease binary classification based on individual brain structures.Segmentation AUC of CNN-based V-Net, ViTbased UNETR, and FS.Mean ± standard deviation for threefold cross-validation and midbrain-to-pons ratio segmentation are listed.
* p < 0.05 indicates a significant difference in AUC between the DL models and FS.The best result for each volume segmentation method based on FS and DL in binary classification is shown in bold.

Table 5 :
Binary classification of diseases based on all the brain structures.AUC in LR and XGBoost of CNN-based V-Net, ViT-based UNETR, and FS.The AUC is expressed as the mean from threefold cross-validation.LR; logistic regression, XGBoost; eXtreme Gradient Boosting Normal vs. MSA-P 0.78 ± 0.04 0.73 ± 0.003 * 0.81 ± 0.03 0.77 ± 0.04 0.79 ± 0.001 0.82 ± 0.01The best result for each volume segmentation method based on FS and DL in each binary classification is shown in bold.
* p < 0.05 indicates a significant difference in AUC between the DL models and FS.