A deep learning approach for fully automated measurements of lower extremity alignment in radiographic images

During clinical evaluation of patients and planning orthopedic treatments, the periodic assessment of lower limb alignment is critical. Currently, physicians use physical tools and radiographs to directly observe limb alignment. However, this process is manual, time consuming, and prone to human error. To this end, a deep-learning (DL)-based system was developed to automatically, rapidly, and accurately detect lower limb alignment by using anteroposterior standing X-ray medical imaging data of lower limbs. For this study, leg radiographs of non-overlapping 770 patients were collected from January 2016 to August 2020. To precisely detect necessary landmarks, a DL model was implemented stepwise. A radiologist compared the final calculated measurements with the observations in terms of the concordance correlation coefficient (CCC), Pearson correlation coefficient (PCC), and intraclass correlation coefficient (ICC). Based on the results and 250 frontal lower limb radiographs obtained from 250 patients, the system measurements for 16 indicators revealed superior reliability (CCC, PCC, and ICC ≤ 0.9; mean absolute error, mean square error, and root mean square error ≥ 0.9) for clinical observations. Furthermore, the average measurement speed was approximately 12 s. In conclusion, the analysis of anteroposterior standing X-ray medical imaging data by the DL-based lower limb alignment diagnostic support system produces measurement results similar to those obtained by radiologists.


Related research
Methods using DL techniques on radiographs to automatically assess lower limb alignment have been proposed to improve and supplement the manual interpretations of radiologists.To assess leg-length discrepancies in pediatric patients using radiographic images, Zheng et al. 11 proposed a system using DL techniques.A U-Net with mixed residual blocks was used to segment femurs and tibias in radiographs, followed by leg-length calculation.Furthermore, the measurement of pediatric leg-length on radiographs was automated and performed rapidly using a DL algorithm.
Schock et al. 12 proposed a DL method for automatically analyzing lower limb alignment by calculating the anatomic-mechanical angle (AMA) and hip-knee-ankle angle (HKAA) using weight-bearing bilateral lower limb radiographs.U-Net 15 , which is generally employed in semantic segmentation techniques, was used to generate the binary mask images required for quantitatively measuring AMAs and HKAAs.Furthermore, various data augmentation techniques were adopted to prevent the overfitting of model.
Tack et al. 13 proposed a multi-stage approach to localize relevant landmarks for assessing lower limb alignment.First, YOLOv4 16 was used to detect regions of interest (ROIs) in the entire lower limb radiographs, wherein landmarks within individual ROIs were located using ResNet 17 .Second, the mean radial error was used as a loss function to minimize the regression errors.However, this study is limited in that the performance of ROI extraction by YOLOv4 is inconsistent for radiographs with low contrast, thereby introducing significant errors in the measurement of HKAA.
Finally, Lee et al. 14 proposed a DL-based system to automatically measure the leg-length using the entire leg radiographs of diverse patients, including those with orthopedic hardware implanted for surgical treatment.The system comprised a four-stage cascade architecture-ROI detection, bone segmentation, landmark detection, and leg-length calculation.For ROI detection and bone segmentation, a customized single-shot multi-box detector 18 and XY-attention network 19 were used, respectively.Independent of the orthopedic hardware implanted in the lower extremity limbs of patients, the performance of this system was similar to that of radiologists in terms of accuracy and reliability.

Material and methods
This retrospective study was approved by the institutional review board of Keimyung University Dongsan Medical Center, where it was conducted (IRB No. DSM-2021-04-063).All methods were performed in accordance with the ethical standards of Helsinki Declaration.Because the data used in this retrospective study were fully deidentified to protect patient confidentiality, the requirement for informed consent was waived by the institutional review board of Keimyung University Dongsan Medical Center.
Study participants and datasets.The leg radiographs of non-overlapping 770 patients were collected from January 2016 to August 2020 (Innovision; DK Medical Systems Co. Ltd., Seoul, South Korea).Among 770 images, we excluded 320 images featuring at least one artificial joint and images belonging to patients that underwent hip arthroplasty or exhibited skeletal or fibrous dysplasia.The remaining 450 images were used for system development, training, and performance validation.The images had an average resolution of approximately 3000 × 7000 pixels and were composed of 24-bit grayscale JPEGs.Before usage, all images were anonymized for privacy protection.
The dataset was divided into training, validation, and test sets.For object detection and image segmentation training, data were allocated at the ratio training: validation: test = 200:50:200, with random selection for strict separation.The training set was used for object detection and semantic segmentation training to extract the required ROIs and mask images.The validation set was used to verify the performance of each model.The test set was used for the performance evaluation of the completed DL models and comparing the measurements of lower limb parameters obtained using the system with the clinical observations.In the ROIs within the individual radiographs, the segmentation masks were manually annotated by a board-certified radiologist (M.L., with 18 years of experience).These annotations were used as the ground truth for ROI detection and segmentation.Figure 1 displays the flowchart for the dataset composition process, and Table 1 summarizes the basic information of the patients from whom the dataset was acquired.

Reference standard for lower limb alignment
A board-certified radiologist (M.L., with 18 years of experience) measured parameters for assessing lower limb alignment in all radiographs within the test set.The measured parameters are listed in Table 2 20,21 .
Model architecture.The proposed measurement system consists of three steps, as summarized in Fig. 2. In step 1, a DL algorithm was used to detect the classes of ten ROIs corresponding to the left and right femurs and tibiae (four in total), femoral heads, knees, and ankles (six in total).In step 2, the detected ROIs were used to extract mask images for the left and right femurs and tibiae long axes, femoral heads, knees, and ankles.Therefore, each ROI was cropped from the radiograph, and a semantic segmentation model was used to extract mask images.
In step 3, the generated mask images were used to measure the parameters for determining the lower limb alignment status, as shown in the radiographs.Therefore, an algorithm to detect the necessary landmarks was applied, wherein image processing techniques were used to determine landmarks based on medical definitions (see Supplementary Methods for details).The detected landmarks were used to calculate angles and lengths, and the results are shown in radiographs.
Training strategy.First, the YOLOv5 22 model was used to identify the ROIs for each part.To ensure its adequate training, the number of epochs and learning rate were set to 300 and 0.001, respectively.The Adam optimizer function, which is a common gradient-based optimization method, was used to perform weight updates using gradient descent.For training and inference, owing to a limited amount of graphics processing unit (GPU) memory, the input lower limb radiographs were resized to a fixed resolution of 640 × 480 pixels.Particularly, to eliminate image distortion from resizing, the input radiographs were padded to a square shape before being fed into the DL system.Finally, pixel values were normalized between 0 and 1.For the training dataset, the batch size was set to 8 and model validation was performed using the validation dataset at the end of each epoch to prevent overfitting.
Thereafter, the HarDNet-MSEG 23 image segmentation model was used to create mask images for each ROI.Two-hundred epochs were used for training, with a learning rate of 0.005 and batch size of 1. Furthermore, the Adam optimizer function was used to perform weight updates using gradient descent.To compensate for the varying contrast levels in radiographs, contrast limited adaptive histogram equalization (CLAHE) 24 was applied to each ROI to preserve local features while enhancing the low contrast of images.This method effectively distinguished important features and noise during the learning process.The CLAHE performance was influenced by two parameters, i.e., the block size for block-wise processing and clipping to prevent extreme pixel intensity variations within blocks 25 .Based on the best-performing experimental value, the block size and clipping value were set to (8,8) and 2.0, respectively.Because the performance improvement in ROI detection was negligible, CLAHE was not applied to full-leg radiographic images.The right long axis and ROIs for the femoral head, knee, and ankle were horizontally flipped to reflect the left direction data.Similar to dataset preparation for ROI detection, the extracted ROI images were padded to a square shape and resized to 512 × 512 pixels for training the image segmentation model.The model validation was performed using the validation dataset at the end of each epoch.An Intel(R) Xeon(R) Silver 4216 CPU @ 2.10 GHz and NVIDIA RTX 2080Ti were used for detection and image segmentation training, model performance evaluation, and inference time measurement.

Statistical analysis.
To evaluate the performance of DL model for ROI detection, the mean average precision (AP), which is a widely adopted evaluation metric for object detection, was used with intersection over union thresholds spanning from 0.5 to 0.95 26 .For bone segmentation, the Dice similarity coefficient (DSC) 27 and www.nature.com/scientificreports/Hausdorff distance (HD) 28 were used as evaluation metrics.DSC measures the pixel-wise agreement between a predicted segmentation and its ground truth, and HD quantifies the largest discrepancy between two segmentation masks.
The reliability and accuracy of the proposed system were evaluated by calculating the concordance correlation coefficient (CCC), Pearson correlation coefficient (PCC), and intraclass correlation coefficient (ICC).Bland-Altman plots were used to investigate the similarity between the clinical and system measurements and the presence of biases.
Thereafter, the mean absolute deviation (MAD) was calculated to determine the variability and extent of differences between measurements performed by the radiologist and system.Finally, the mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE) were sequentially calculated to validate the measurement performance of the system.

Results
Study participants.The lower limb frontal radiographs obtained from 450 patients (mean age ± standard deviation, 55 ± 16; age range, 11-89 years; 168 men, 282 women) were used.The proportion of female patients was approximately twice of that of male patients, and the collected radiographs included those of the entire area from the hip to lower left and right ankles.However, images containing medical implants, skeletal anomalies, or bone dysplasia were excluded.

Performance of DL models for ROI detection and segmentation.
After training, the performances of the object detection and image segmentation models were tested using a predefined test dataset.According to the experimental results, detection failures were not observed for any test data, and the mean AP across ROIs was approximately 0.99, indicating a considerably high detection performance.Thereafter, DSC and HD were measured to evaluate the performance of the image segmentation model (HarDNet-MSEG) for femoral heads, knees, ankles, and shafts for femur and tibia.The average DSC for each class was observed to be 0.97.Table 3 summarizes the performance results of the object detection and image segmentation models.
Accuracy and reliability of parameter measurements.The 13 quantitative parameters for diagnosing lower limb malalignments, including the lengths of the two legs, femurs, and tibiae, revealed high concordance and close correlations with the measurements performed by the radiologist.For lower limb length measurements, the ICC, PCC, and CCC indicated high values for the entire lower limb length (0.979, 0.996, and 0.979), tibial length (0.905, 0.975, and 0.906), and femoral length (0.979, 0.986, and 0.940).Similarly, for the lower limb alignment assessment parameters, the experimental results indicated that the proposed system produced significantly reliable results with correlation coefficients greater than or equal to 0.9.As summarized in Table 4, the high measurement accuracy of the proposed system was observed using MAE, MSE, and RMSE, which validated its effectiveness.
The Bland-Altman plots demonstrated that the measurements obtained using the system excellently agreed with the reference standard obtained by radiologist (Fig. 3).Particularly, marginal variability and mean difference were observed between the two measurements.Furthermore, a few observations deviated from the central line in certain parameters; however, these deviations were negligible.
Consequently, MAD was analyzed by calculating the average of deviations obtained after subtracting the overall mean of the measurements from individual measurements to determine the variability using reference values.Table 5 summarizes these results, and Fig. 4 shows the cases of lower extremity alignment calculated using the DL-based system.Overall, these experimental results indicate the absence of systematic bias between the reference standard and parameters measured using the proposed system.
Finally, the duration at each step of the system and that required to compute the final outputs were measured (Table 6).Measurements were repeatedly performed for a single lower limb radiograph, and separate measurements were performed for central processing unit (CPU)-only and combined GPU-CPU computations.For www.nature.com/scientificreports/CPU-only computations, the average durations for the first, second, and third steps were approximately 1.08, 8.59 and 2.53 s, respectively.For combined CPU-GPU computations, the average durations for the first, second, and third steps were 3.43, 7.95, and 2.7 s, respectively.For CPU-only and combined CPU-GPU computations, the total execution times for the system were approximately 12.2 and 14.08 s, respectively, and the variation in these two values can be attributed to the time required to load DL models into the GPU memory.Particularly, while measuring the execution time of the system after loading the DL models into GPU memory for ROI detection and segmentation, it was observed that steps 1 and 2 preserved time.Consequently, compared to CPU-only computations, no significant difference was observed in the overall execution time.

Discussion
This study proposed a method for automated measurements of bilateral leg lengths, femoral and tibial lengths, and parameters used to determine the presence of lower limb malalignments by applying DL technology to anteroposterior lower limb radiographs.First, the ability of the DL-based model to detect and segment the femur, tibia, femoral head, knee, and ankle was validated for measuring each parameter using a predefined test dataset of 200 images (average of class-wise AP = 0.99; average DSC = 0.97), indicating excellent performance.
Comparing the results of each system-calculated indicator with the values observed by a radiologist revealed that the measurements of the total lower limb length, femoral and tibial lengths, and the 13 parameters for determining lower limb alignment (mLPFA, mLDFA, mMPTA, mLDTA, MAD, mJLCA, mTFA, aMPFA, aLDFA, NSA, aMPTA, aLDTA, and aTFA) exhibited a significantly high correlation.(CCC, ICC, and PCC ≥ 0.91) To our knowledge, this study is the first to evaluate such a large number of alignment indicators.Furthermore, the obtained MAE, MSE, and RMSE results revealed the absence of significant bias between the observed values and observations of an actual evaluator.
Schock et al. 12 proposed an automated evaluation method for the lower limb alignment status by applying DL technology to anteroposterior lower limb radiographs, thereby measuring the alignment indicators rapidly and accurately (HKAA: PCC = 0.99 [P-value < 0.001], ICC = 0.99, and mean ± deviation = 0.10 ± 4.42; AMA: PCC = 0.99 [P-value < 0.001], ICC = 0.89, and mean ± deviation = 5.13 ± 1.36).Similar to the proposed method, an image segmentation model was used to generate mask images for the entire femur and tibia, thereby directly identifying points required to calculate the indicators using the contours of these mask images.However, this method is considerably different from the proposed method wherein the precise landmark locations are determined using the entire mask images of the femur and tibia extracted using the image segmentation model.To achieve this, the ROIs in the radiographs are identified, the corresponding mask images are extracted, and the landmarks within those regions are determined.This approach considers local features that can be easily overlooked in the entire image.Therefore, these factors contribute to the performance variations between systems.
Comparing the results obtained using the proposed method with those observed by the radiologist revealed that the proposed method has high concordance and reliability.Furthermore, the time required to automatically generate results using a single radiograph input was approximately 12 s, which was faster than that required for radiologists to perform direct observations (approximately 130 s per image).Therefore, a large number of images can be measured accurately at a substantially faster pace, which enables the swift determination of the lower limb alignment status and renders the model effective for repetitive measurement tasks required for prognosis observation.www.nature.com/scientificreports/However, this methodology has several limitations.First, the evaluation process after obtaining the alignment indicators was performed using a limited internal dataset.Second, the data did not include radiographs with abnormal skeletal structures or bone dysplasia.To handle diverse patient groups, the development of sophisticated DL models and establishment of large datasets including these patients, are necessary.Third, a single evaluator obtained the validation data; however, methods such as comparing the system to the results of the same evaluator measuring the same radiographs with a time interval or introducing one or more additional evaluators for cross-validation can be considered.
In summary, a DL-based system was developed and evaluated for measuring various parameters to assess lower limb malalignments, including angle and length.Because the steps involved in the measurement process are executed automatically without human intervention, the process can be performed rapidly.This is beneficial for reducing the workload of clinicians and establishing appropriate treatment plans for patients through fast and accurate diagnoses, thereby enhancing the effectiveness of treatments.Further prospective studies should be performed to extend this system by increasing the diversity of measured parameters and patient groups, including those with artificial joints and skeletal abnormalities.

Figure 1 .
Figure 1.Flowchart detailing the number of patients included and excluded for this study based on given criteria.

Figure 2 .
Figure 2. Progression of the proposed system.

Figure 3 .
Figure 3. Bland-Altman plot for the reference standards and measurements using the DL-based system for each parameter.The x-axis represents the mean of the reference and corresponding parameter measured by the system, whereas the y-axis represents the difference between the two measurements.(a) Lower limb length, (b) parameters based on mechanical axes, and (c) parameters based on anatomical axes.

Figure 4 .
Figure 4. Example outputs of the system.The lines were generated using the DL-based automatic measurement system.Images on the right side indicate that the system accurately and reliably localized relevant landmarks required to assess lower extremity alignment.(a) Lower limb length, (b) parameters based on mechanical axes, and (c) parameters based on anatomical axes.

Table 2 .
Parameters required to determine lower limb alignment status and normal range for each parameter.

Table 3 .
Performance of DL models for ROI detection and segmentation.

Table 4 .
Correlation coefficients and error indices between the actual values observed by the radiologist and those measured by the system.

Table 5 .
Mean, mean difference, and mean absolute deviation for the actual values observed by the radiologist and the values measured by the system.

Table 6 .
Comparison of average and total measurement times for each step of the proposed system obtained with and without using GPU.*Indicates the case where deep-learning models have been already loaded into the GPU memory.