Deep learning approach for automatic landmark detection and alignment analysis in whole-spine lateral radiographs

Human spinal balance assessment relies considerably on sagittal radiographic parameter measurement. Deep learning could be applied for automatic landmark detection and alignment analysis, with mild to moderate standard errors and favourable correlations with manual measurement. In this study, based on 2210 annotated images of various spinal disease aetiologies, we developed deep learning models capable of automatically locating 45 anatomic landmarks and subsequently generating 18 radiographic parameters on a whole-spine lateral radiograph. In the assessment of model performance, the localisation accuracy and learning speed were the highest for landmarks in the cervical area, followed by those in the lumbosacral, thoracic, and femoral areas. All the predicted radiographic parameters were significantly correlated with ground truth values (all p < 0.001). The human and artificial intelligence comparison revealed that the deep learning model was capable of matching the reliability of doctors for 15/18 of the parameters. The proposed automatic alignment analysis system was able to localise spinal anatomic landmarks with high accuracy and to generate various radiographic parameters with favourable correlations with manual measurements.

Spinal curvature was modified in the human species, as one of the few bipedal animals, to enable horizontal gaze while both hands are free for performing complex tasks. Lordotic curvature was thus developed in the cervical and lumbar vertebrae to maintain the centre of mass within the area of both feet (stance width). This concept was further elaborated by 'conus of economy' theory proposed by Dubousset 1 : shifting the centre of mass away from the standing area would result in additional energy expenditure. Various radiographic parameters have been developed and validated on standard uniplanar two-dimensional (2D) whole-spine radiographs to evaluate spinopelvic harmony and the sagittal balance of the spine 2 .
In the clinical setting, manual measurement and calculation of numerous spinopelvic parameters on wholespine radiographs require substantial time and effort. Thus, semi-automated or automated spine radiographic anatomic landmark localisation and vertebral segmentation on plain radiographs have been studied for over a decade 3,4 . Recently, deep learning methods have been applied to automatic sagittal radiographic parameter measurement, with mild to moderate standard errors and favourable correlations with manual measurement [5][6][7][8][9] . It is worth mentioning that some approaches can better characterise spinal alignment, as they can estimate several spinopelvic parameters at once. For example, Galbusera et al. 8 trained 78 distinct deep learning models to derive 78 landmark coordinates and six spinopelvic parameters. Korez et al. 5 were able to estimate five spinopelvic parameters following a detection-based approach, which detects four anatomic structures first and then regresses five anatomic landmarks within the detected structures later. However, these approaches still have room for improvement: (1) The detection-based model may fail to distinguish between similar adjacent anatomic structures (e.g. the third and fourth thoracic vertebra). (2) The predicted radiographic parameters are not comprehensive enough to cover the whole spinal column and pelvic structures. (3) The coordinate regressors in some studies 5,10 may lose the ability to utilise all relevant anatomic structures of the entire image, as their approach involves the split of images into small patches. (4) The test datasets are often insufficient in quantity and diversity for spinal pathologies, and they might not represent real clinical situations.
Our present study addresses these problems by using an ensemble of two end-to-end trainable models to localise 45 anatomic landmarks on whole-spine lateral radiographs. We created a dataset consisting of 2210 radiographs, the largest, annotated dataset with various spinal pathologies to date. Using our dataset, we trained deep learning models which can predict landmark coordinates using anatomic structures of the entire radiographs. Our models were able to find 45 anatomic coordinates of the whole spinal column and pelvic structures with low median error and generate various radiographic parameters with favourable correlations with manual measurements.

Results
Study design. Whole-spine plain radiographs are widely used as the first-line examination for standing patients with scoliosis, kyphosis, spinal imbalance, or patients who have received long spinal instrumentation. This study aims to automatically annotate 45 landmark coordinates (Fig. 1) on whole-spine lateral images. With the derived landmark coordinates, 18 spinopelvic parameters ( Supplementary Fig. S1) can be used to (1) evaluate whole spinopelvic alignment and balance in standing positions; (2) perform postoperative follow-ups for implants across multi-level and wide regions.
Dataset demographics. From January 2018 to April 2020, a total of 2900 consecutive whole-spine lateral plain radiographs were reviewed and annotated under the approval of the institutional review board of our hospital (IRB No. 202000623B0). After excluding (1) 174 images with inadequate length, meaning they did not include either C2 dens or both femoral heads, (2) 294 images with anatomic variance, in which the vertebral column numbered more or fewer than 25 vertebrae, and (3) 222 images with poor contrast preventing identification of pelvic anatomic structures, a total of 2210 images were included in our study. Learning speed for landmarks in different spinal areas. We observed that the width of coarse heatmaps produced at the 1 st model stage gradually became narrower during training, indicating the neural network became more and more confident about the estimated landmark coordinates. To visualise this narrowing behaviour, we categorised the landmarks of vertebral centres and femoral heads into four areas (cervical, thoracic, lumbar, and femoral heads) and estimated the standard deviations (SDs) of per-landmark first-stage heatmaps. As indicated by Fig. 5, we calculated the per-area averaged SDs and observed that the decay profile of SDs followed the tendency of t −γ , where t is the training time (epoch) and γ is the decay rate. The decay rate γ can be interpreted as the learning speed for landmarks. We observed that the learning speed for landmarks in the  Performance of the deep learning model for automatic landmark localisation. We used localisation error as the metric for performance evaluation on 400 test images. The localisation error was defined as the Euclidean distance between the landmark coordinates of the ground truth and the deep learning model. Due to the non-normal distribution characteristics of the localisation errors, we illustrated the localisation errors of 45 landmarks using boxenplots (also known as letter-value plots 28 ), as in Fig. 6. A boxenplot describes error distribution using quantiles. For example, the widest box covers the range of 0.25 to 0.75 quantile (also known as the interquartile range [IQR]); the second-widest box covers the range of 0.125 to 0.875 quantile; and the thirdwidest box covers the range of 0.0625 to 0.9375 quantile. The performance of the deep learning model was the highest in the cervical area, where the median localisation errors ranged from 1.75 to 2.64 mm, followed by model performance in the lumbosacral area, with median localisation errors ranging from 1.76 to 2.63 mm. The thoracic area had greater localisation errors than did the cervical and thoracic areas, with median errors ranging from 2.21 to 3.07 mm. Localisation errors were the greatest at the centres of both femoral heads, with median errors of 2.75 mm and 3.39 mm.
We further examined the error distributions of anatomic landmarks using the calculated boxenplots. Error distributions for landmarks in the cervical area were generally narrower (all heights of the third-widest boxes were < 8mm ) with shorter tails (most heights of the fourth-to seventh-widest boxes were < 20mm ), which indicated accurate predictions of anatomic landmarks. Error distributions for landmarks in the thoracic area were wider (all heights of the third-widest boxes were < 20mm ) and longer-tailed (most heights of the fourth-to seventh-widest boxes lay between 10mm and 30mm ), which indicated larger numbers of localisation errors and higher possibility of incorrect level recognition. Model performance was higher in the lumbosacral area; error distributions of landmarks become narrow again below L3 (the heights of the third-widest boxes were < 8mm ) but with long tails (heights of the fourth-to seventh-widest boxes ranged from 7mm to 40mm ). For the centres of both femoral heads, the localisation errors were the greatest among all anatomic landmarks; the error distributions of the landmarks were the widest (the heights of the third-widest boxes were < 28mm ) and had the longest tails (the heights of the fourth-to seventh-widest boxes were < 50mm).  Table 1. Median parameter errors with IQR are presented in addition to mean parameter errors with SD due to the non-normal distribution characteristics of parameter errors. All the predicted radiographic parameters were significantly correlated with ground truth values (all p < 0.001).
For fundamental spinopelvic parameters, the mean errors ranged from 1.1 • (PT) to 5.1 • (LL), and the median errors ranged from 0.6 • (PT) to 3.0 • (LL). No significant differences were observed between the predictions and the ground truth values, with all p > 0.05 in Wilcoxon signed-rank tests. The predicted PT and SVA were highly correlated with ground truth values, with Pearson correlation coefficient (R) > 0.9.
Model performance varied for regional spinal parameters in different anatomic areas. For cervical parameters, the mean errors ranged from 1.1mm (cSVA) to 6.6 • (CL), and the median errors ranged from 0.7mm (cSVA) to 5.3 • (CL). For thoracic parameters, the errors were generally larger, with mean errors ranging from 5.4 • (MTK) to 6.7 • (PTK) and median errors ranging from 4.2 • (MTK) to 4.9 • (GTK). For lumbosacral parameters, the mean errors were 4.3 • (L4SL) and 1.0 • (LPA), and the median errors were 2.3 • (L4SL) and 0.5 • (LPA). Among the regional spinal parameters, significant differences were observed between predictions and the ground truth values for T1S, GTK, and PTK, with p < 0.05 in Wilcoxon signed-rank tests. The predicted cSVA and LPA had the closest correlations with ground truth values, with R > 0.9.
The proposed deep learning model performed well in predicting global spinopelvic parameters, with the mean errors ranging from 0.3 (Barrey index) to 3.4 • (SSA), and the median errors ranged from 0.03 (Barrey index) to 1.9 • (SSA). No statistical differences were present between the predictions and ground truth values, with all p > 0.05 in Wilcoxon signed-rank tests. All the global spinopelvic parameters were strongly correlated with ground truth values, with all R > 0.9 except for the Barrey index ( R = 0.893).

Level of agreement between doctors and the deep learning model. Interobserver reliability com-
parisons between three human observers (a junior resident, spine fellow, and senior surgeon), the deep learning model, and the ground truth values were conducted using the intraclass correlation coefficient (ICC) for 90 images within the dataset for interobserver reliability analysis, as described in "Datasets" of the Methods section. The ICC heatmap (Fig. 7) presented a data matrix, where colouring provided an overview of the numeric differences of the ICC for each radiographic parameter. Colour intensity represented the magnitude of ICC values, with a deeper red colour indicating a higher ICC and a deeper blue colour indicating a lower ICC. Reliability was categorised into four grades according to the ICC magnitude: excellent ( 0.9 − 1.0 ), high ( 0.7 − 0.9 ), moderate ( 0.5 − 0.7 ), and low ( 0.25 − 0.5 ). Hierarchical cluster analysis was used for comparing the overall performance of the deep learning model and the human observers. The sequences of the horizontal (interobserver compari- www.nature.com/scientificreports/ sons) and vertical (radiographic parameters) axes were redistributed according to the trend of ICC magnitudes, such that higher interobserver reliability scores clustered towards the upper right corner and lower interobserver reliability scores clustered towards the lower left corner. As can be observed in Fig. 7, the deep learning model achieved excellent reliability for the parameters of PT, TPA, GT, cSVA, SVA, and OD-HA compared with the reliability of the three human observers. For the parameters of LPA, MTK, LL, SSA, T1S, GTK, SS, and CL, the deep learning model and human observers achieved high reliability. Nonetheless, interobserver reliability was more divergent for the Barrey index, PTK, PI, and L4SL parameters; the colours of heatmap indicated variance. Compared with ground truth values, the deep learning model outperformed the human observers in the Barrey index but underperformed them in PTK, PI, and L4SL.
In brief, the deep learning model was capable of matching the reliability of human observers for 15 out of the 18 parameters.

Discussion
Adult spinal deformity (ASD) is a debilitating condition present in 32%-68% of people older than 65 years 29,30 . The aetiologies involve a spectrum of diseases including de novo scoliosis, progressive adolescent idiopathic scoliosis, degenerative hyperkyphosis, and iatrogenic flat back deformity 31 . Radiographic assessment of the whole spine, including both hip joints, is recommended for the evaluation of sagittal balance in adult spinal deformity. Numerous studies have reported correlations of fundamental spinopelvic parameters with health-related quality of life metrics and the prognosis of ASD corrective surgeries 2,31-35 . Regional spinal parameters also play a prominent role in disease classification and preoperative planning 18,22,36,37 . Global spinopelvic parameters [23][24][25][26][27] , which facilitate the evaluation of the spinal curvature across more than two regions, enable overall assessment of sagittal balance without the influence of postural changes, body size differences, or regional compensating mechanisms for ASDs such as cervical hyperlordosis, thoracic hypokyphosis, and pelvic retroversion. In the clinical setting, manual measurement of all these parameters is time consuming and is sometimes influenced by interobserver variability [38][39][40][41][42] . In this study, we proposed a deep learning model demonstrating performance  43,44 . Deep learning has enabled further improvement of mean localisation errors for intervertebral discs to below 2 mm 45,46 . However, for pathologic spines or in the presence of metallic implants, mean localisation errors range between 6 and 8.5 mm when machine learning is applied [47][48][49] . Even with the assistance of deep learning, progress is limited, and mean localisation errors have marginally improved to between 6.9 and 9 mm in CT datasets of various spinal pathologies 50 . Our two-stage deep learning model was able to automatically localise 45 anatomic landmarks (24 vertebral centres and 21 specific landmarks) in a complex test dataset containing 400 randomly selected images of various spinal pathologies and different types of metallic implants. Model performance was the highest in the cervical area, with all median localisation errors lower than 3 mm. The occlusion of shoulder girdle structures may be the explanation for the wider distribution of localisation errors (Fig. 6) near C-T junctions. Landmarks in the thoracic area are difficult to predict, especially for patients with scoliosis in the midthoracic region. Overlapping of the vertebrae in severe scoliotic curves hinders clear recognition of each vertebral centre, and prediction under this condition typically requires repeat evaluations in both directions (cranial-caudal and caudal-cranial), even for experienced surgeons and radiologists. Unlike the thoracic region, the landmarks in lumbosacral area are not occluded by adjacent anatomic structures and can typically be clearly identified. In this study, the median localisation errors of the lumbosacral vertebral landmarks were all less than 3 mm. Although the median localisation errors were 3.39 mm and 2.75 mm, the recognition of bilateral femoral heads was worse than the recognition of the whole vertebral column. Partial occlusion by the contralateral femoral heads, poor contrast visualisation in the pelvic region, and the presence of metallic implants in the hip region may all affect the localisation of femoral head centres. Overall, our model performed better in the cervical and lumbosacral areas, but improvement is required for landmark identification in the thoracic and pelvic areas.
The two-stage deep learning model locates the 45 anatomic landmarks in a coarse-to-fine manner. Taking Fig. 3 as an example, the predicted heatmaps of landmarks in thoracic and thoracolumbar areas were lesslocalised and scattered along the spinal column between the anterior and posterior borders of the vertebrae in the first stage. These illegible landmarks then became more localised in the second stage of the model. On the other hand, the predicted heatmaps of landmarks in cervical and lumbosacral areas were extremely localised in the first stage already. Correct recognition of illegible landmarks is a challenge for the deep learning model under the following situations. First, landmarks with a higher degree of occlusion, e.g. thoracolumbar landmarks which were partially occluded by the rib cage, implants, or bone cement; Second, vertebral landmarks with high indistinguishability, e.g. T7 could not be determined easily because its structure and background were extremely Table 1. Performance evaluation of the spinal radiographic parameters of the ensemble model. R Pearson correlation coefficient; SD standard deviation; IQR interquartile range, PI pelvic incidence; SS sacral slope; PT pelvic tilt; LL lumbar lordosis; SVA sagittal vertical axis; CL cervical lordosis; T1S T1 slope; cSVA cervical sagittal vertical axis; GTK global thoracic kyphosis; PTK proximal thoracic kyphosis; MTK main thoracic kyphosis; L4SL L4-S1 lordosis; LPA lumbar pelvic angle; SSA spino-sacral angle; GT global tilt; TPA T1 pelvic angle; OD-HA odontoid hip axis. *p value < 0.05. † Absolute difference between the prediction and the ground truth.

Radiographic parameters
Ground truth Parameter error †

Correlation analysis
Wilcoxon signed-rank test  Fig. 6 was mainly caused by incorrect recognition of vertebral levels in these situations. Several studies have also applied deep learning methods for automatic radiographic parameter prediction in individuals with spinal disorders. For Cobb angle measurement of adolescent idiopathic scoliosis, Wang et al. 7 achieved circular mean absolute errors (CMAEs) of 7.81° and 6.26° in anteroposterior (AP) and lateral views, respectively, by using multi-view extrapolation net (MVE-Net). Wu et al. 9 further improved CMAEs to 4.04°and 4.07° in AP and lateral views by using multi-view correlation network (MVC-Net), respectively. In addition, Galbusera et al. 8 collected 493 biplanar EOS plain radiographs for model development and applied 78 convolutional neural networks to automatically recognise 78 landmarks. Radiographic parameters could be generated accordingly, including T4-T12 kyphosis, L1-L5 lordosis, Cobb angle of scoliosis, PI, PT, and SS, with the standard errors of the estimated parameters ranging from 2.7° (for PT) to 11.5° (for L1-L5 lordosis). Recently, Korez et al. 5 developed a two-stage model for fully automatic measurement of sagittal spinopelvic parameters. In their model, RetinaNet was used for recognising specific areas such as C7, S1, and both femoral heads in the first stage. In the second stage, U-net was used for anatomic landmark detection, such as those in the centres and anterior and posterior corners within previously identified areas. With a small training dataset of 145 sagittal radiographs, they were able to achieve mean absolute errors ranging from 1.2 • ± 1.2 • (spinal tilt) to 5.5 • ± 4.2 • (PI) for spinopelvic parameters, with significant correlations between manual measurements and their deep learning model. Nonetheless, the study included only images with degenerative pathologies with or without short segment fusions. The researchers also excluded images not automatically recognised in the first stage from the statistical analysis. Furthermore, the test datasets of the previous studies were relatively small, with fewer than 100 X-ray images. In our study, after generating the locations of 45 anatomic landmarks, 18 www.nature.com/scientificreports/ radiographic parameters were automatically calculated, most of which (15/18) had model performance comparable to those of human observers; the exceptions were PTK, PI, and L4SL (Fig. 7). The mean absolute errors ranged from 0.1 • ± 0.2 • (OD-HA) to 6.7 • ± 6.2 • (PTK), and the median absolute errors ranged from 0.03 ± 0.1 (Barrey index) to 5.3 • ± 6.4 • (CL; Table 1). Our deep learning approach achieved good performances in PT, cSVA, SVA, T1S, and SS, which can be determined using two anatomic landmarks and one reference line (either horizontal or vertical), as shown in Supplementary Fig. S1. The deep learning model also achieved good performances in radiographic parameters across a broad region (e.g. TPA, GT, OD-HA, SSA, and Barrey index) and across the mid-range region (≥ 5 levels; e.g. LPA, MTK, LL, GTK, CL). These parameters were determined by an angle between two crossed lines made of long-distant landmarks, and a small perturbation on these landmark coordinates would not increase the parameter error much. However, relatively low performances were observed in three parameters: PTK, PI, and L4SL. The possible explanations are as follows. The anatomic landmarks used for the measurement of PTK are across five levels in the upper thoracic area but usually blocked by the shoulder girdle (clavicle, scapula, humerus) and ribs. PI is a pelvic morphology parameter that requires four anatomic landmarks in the pelvis, including femoral heads, the landmarks with the most significant localization errors. Lastly, L4SL is a parameter whose measurement requires four anatomic landmarks across a relatively short distance, only two levels in the lumbosacral area. The prevalence of adult spinal deformity has increased as the world gradually stepped into an ageing society 31 . Since whole-spine plain radiographs are the standard first-line examination, fast and accurate interpretation of the radiographic parameters through the deep learning approach can readily be applied in clinical practice. It is an arduous task to derive those spinopelvic parameters for a doctor in busy clinical settings. The deep learning model can reduce repetitive work and potentially improve clinical efficiency by reducing the time required to generate 18 radiographic parameters to nearly 1 second while maintaining a generally acceptable 5° error in most parameters. More importantly, this method can be used retrospectively on multi-institutional dataset to achieve an understanding of the distribution of spinal parameters at a populational level.

Mean (SD) Mean (SD) Median (IQR) R
The main limitations of this study were the lack of external validation dataset. The current dataset contained images from only one medical centre. In addition, we excluded images with anatomic variance with anomalous vertebral numbering (more or fewer than 25) from the dataset. Therefore, our current deep learning model may not produce appropriate predictions for individuals with conditions that lead to anomalous vertebral numbering, such as lumbosacral transitional vertebrae or other congenital vertebral anomalies. To address this issue, it may be necessary to design an algorithm that determines the number and identities of vertebrae before localising landmarks. It is noteworthy that even for experienced surgeons, significant variability exists when measuring radiographic parameters in patients with anomalous vertebral numbering 51 . Two other aspects of our proposed model which still have room for improvement include the occurrence of incorrect level recognition and its relatively poor hip centre recognition. Improvement in these aspects shall be possible with a larger training dataset.
We demonstrated that with the aid of the proposed deep learning model, the accuracy of the automatic landmark localiser was within acceptable ranges for whole-spine lateral plain radiographs in a large test dataset consisting of images of various spinal pathologies. The model was capable of matching the reliability of human observers for 15 of the 18 parameters, and could potentially be applied in institutional practice to aid in clinical workflow and reduce research workload.

Methods
This study was approved by the Institutional Review Board of Chang Gung Memorial Hospital, Taiwan (IRB No. 202000623B0) and carried out in accordance with the pertinent guidelines and regulations. The informed consent was obtained from all participants or their legal guardians prior to each clinical visit in the study.

Datasets.
A total of 2900 consecutive whole-spine lateral images collected in our hospital from January 2018 to April 2020 were reviewed and deidentified before data analyses. A senior radiologist screened the entire image dataset and excluded (1) 174 images with inadequate length that did not include either C2 dens or both femoral heads; (2) 294 images with anatomic variance in which the vertebral column contained fewer than or more than 25 vertebrae; and (3) 222 images with poor contrast, preventing identification of pelvic anatomic structures. After exclusion, a total of 2210 images were included and annotated in this study.
We further split 2210 images into three categories, namely scoliosis (1041 images), kyphosis (466 images), and implant (703 images), according to the disease aetiologies or the presence of one or more implants. The annotated 2210 images were then used to form the following datasets: (1) Dataset of children (120 images): we did not evaluate the model performance using images of children because the sacral ossification centre fused gradually from teenage to young adulthood. In this study, a senior spinal surgeon screened the annotated 2210 images and identified 120 images of them as totally unfused sacrum (aged less than 12 years old). This dataset of children was not used during inference but was included during training to enrich the diversity of training samples. www.nature.com/scientificreports/ Labelling and classification of the datasets. We used a custom-written MATLAB GUI program for annotation of the whole-spine lateral images in Digital Imaging and Communications in Medicine (DICOM) format. The dataset of 2210 images underwent a three-stage peer-reviewed annotation process. The first-stage annotation was conducted by an annotation team comprising three junior orthopaedic residents and one senior orthopaedic resident. The annotated coordinates of all 45 anatomic landmarks were recorded accordingly. In the second stage, a spine fellow reviewed the first-stage annotations and classified the images into the following categories according to the disease aetiologies or the presence of metallic implants: scoliosis, kyphosis, and implant. Modifications were made by the spine fellow if a first-stage annotation was deemed incorrect. In the third stage, the data were reviewed and amended again by a senior spine surgeon. The finalised version of annotations was set as ground truth values and exported after integration by an engineer.
Annotated landmarks and derived radiographic parameters. The  where w confines the range of non-linearity into (−w, w) and ǫ limits the curvature of the non-linear region. The Wing loss function was initially adopted for facial landmark detection and was demonstrated to perform better than the commonly used Smooth L1 or L2 loss.
In addition, the two-stage network predicts the same regression target twice. We select the last-stage output as the final prediction because results of the last stage can generally be expected to be improved results of the former stage 53 .
Gaussian or exponential heatmaps for heatmap regularisation. For each landmark x k , y k , we predefine a distribution P x k , y k ∈ R H×W , which contains a small splotch centring at x k , y k . Additional loss for heatmap regularisation (Jenson-Shannon entropy) is then considered to encourage that P (k,s=2) ∼ P (k) . In our paper, we experiment with the following forms of P x k , y k 6 : where σ is a positive real number, G ij x k , y k ; σ is a Gaussian, E ij x k , y k ; σ is an exponential function, and C G and C E are the normalisation constants ensuring that ij G ij x k , y k = 1 and ij E ij x k , y k = 1 . These two functions are designed such that they reach the same half maxima at (x k ± σ 2log2, y k ) and (x k , y k ± σ 2log2).
In the above setting, σ is a hyperparameter of the model which controls the width of both Gaussian and exponential functions. In general, σ has to be small enough so that the model can be encouraged to produce localised heatmaps in the 2 nd stage. However, it is inappropriate to have a too small value of σ . For example, in an extreme case (e.g. σ < 1) , P x k , y k is extremely localised (almost only one pixel is non-zero), which means we unnecessarily encourage the network to be extremely certain about its estimation and the learning could become a challenge for the network. , is predicted using the differentiable spatial to numerical transform (DSNT) layer 12 as follows: H are the predefined 2D matrices representing x and y coordinates; P (k,s) ∈ R H×W is the predicted 2D heatmap for landmark x k , y k in the s th model stage; and Softmax(.) denotes the operation of 2D Softmax , which was adopted to ensure the normalisation condition, ij P (p,s) ij = 1. DSNT can be interpreted as a layer which calculates the 'expected values' of landmark coordinates. To predict the landmark coordinates well, we expect each last-stage heatmap P (k,s=2) to contain only one small splotch centring at x k , y k .

Weighted localisation error for model selection.
To determine localisation quality, we can measure the Euclidean distance between the predicted and annotated landmark coordinates. The localisation error for landmark i on image j is defined as is the predicted (annotated) landmark coordinates for landmark i on image j. We then consider the weight-averaged version of the localisation error for model selection. It is crucial to know that 24 of the 45 landmarks are annotated on vertebral centres, which have different extents of tolerable variability. For example, the size of C1 is much smaller than that of L5; hence, a 1-mm error for the L5 centre is less significant than a 1-mm error for the C1 centre. Considering the per-landmark error significance, we define the weighted localisation error for each landmark i on image j as follows: where M is the number of landmarks to be localised. Based on the above explanation, an error of landmark i on image j is measured in the unit of the distance between the annotated landmark i and its nearest annotated landmark neighbour. If ∼ ξ (j) i = 1 , we can conclude that the predicted location of landmark i has the potential to reach to its nearest-neighbour landmark on image j . If ∼ ξ (j) i ≪ 1 , this indicates that landmark i on image j is well predicted and less likely to reach to the annotated locations of other landmarks. ∼ ξ (j) i is a dimensionless quantity. Due to the existence of highly occluded landmarks or 'hard cases' , the error distributions are skewed in the experiments. Thus, instead of error mean, we compute the error median for each j th landmark as follows: where N is the total number of images.
In this study, all landmark errors are expected to be maintained within a certain threshold, because many of them are directly related to one or more spinopelvic parameters. Thus, we use the maximum median error of landmarks for model selection, which is defined as: Model selection. Two hyperparameters are fine-tuned to obtain the optimal model, namely the heatmap type for regularisation (Gaussian, exponential, or none) and the width of heatmap for regularisation ( σ ∈ [d σ , d σ /2, d σ /4] , if applicable). d σ is obtained using the following heuristics. We estimate the mean distance between consecutive vertebral centres, which is 41.2 (pixels). We then assume that a suitable Gaussian splotch should be confined approximately within this distance. Thus, we require 6d σ = 41.2 (pixels) and obtain d σ = 6.9.
After fivefold cross-validation of the dataset (1690 radiographs), we observed that the model trained with heatmap regularisation (exponential function, σ = d σ /4 ) and the model trained without heatmap regularisation have the highest performance. These two models achieved Model ensemble. An ensemble of the two optimal models was created to boost predictive performance.
The ensemble was obtained by averaging (unweighted) their predictions.
(4) www.nature.com/scientificreports/ Training details. All radiographs were downsized to 864 × 382 px and then padded to 864 × 480 px in the pre-processing pipeline. During training, several operations of data augmentation were applied to prevent the deep learning model from adapting to images of certain scales, orientations, and types of noise. The operations used were random scaling (scale ranging from 0.8 to 1.0 ), random rotation (angle ranging from −30 • to 30 • ), and random Gaussian blur (strength ranging from 0 to 0.1 ). We used the Adam optimiser for loss minimisation during training. For all training experiments, we trained the model for 120 epochs. However, if the model's performance did not improve within 20 epochs, we ended the training early. The learning rate of the optimiser was set to 0.005 at the beginning and was reduced to 0.0005 at epoch 100. The model was implemented and trained using TensorFlow 2.1.0. and Horovod 0.18.2. We used six NVIDIA Tesla V100 GPUs for training. The batch size was set to 18 (3 images per GPU). We used cross-GPU batch normalisation 54 so that both the mean and variance were estimated using tensors scattered to all GPUs. Using TensorFlow, we turned on automatic mixed-precision training 55 for faster training and reduced GPU memory use.
Statistical analysis. Our deep learning approach was evaluated using the test dataset (400 images) for landmark localisation and parameter estimation. Due to the non-normal distribution of localisation errors, we used boxenplots to visualise error distributions. For parameter estimation, median parameter errors with IQR are presented in addition to mean parameter errors with SD. Pearson correlation coefficients were used to evaluate the correlations of all predicted radiographic parameters and the ground truth values for the corresponding parameters. Wilcoxon signed-rank tests were used to evaluate the numerical differences between the deep learning model and the ground truth values. A p value < 0.05 was considered significant.
The ICC was used to evaluate interobserver reliability between three human observers (a junior resident, spine fellow, and senior surgeon), the deep learning model, and the ground truth values using the dataset for interobserver reliability analysis (90 images). Reliability was classified into four grades according to the magnitude of the ICC: excellent ( 0.9 − 1.0 ), high ( 0.7 − 0.9 ), moderate ( 0.5 − 0.7 ), and low ( 0.25 − 0.5 ). The ICC data matrix was illustrated in the heatmap, where colouring provided an overview of the numeric ICC differences for each radiographic parameter. Hierarchical cluster analysis was used to build a hierarchy of the ICC heatmap clusters. As a result, the sequences of the horizontal (interobserver reliability) and vertical (radiographic parameters) axes were redistributed according to the trend of ICC magnitudes, such that higher ICC values cluster towards the upper-right corner and lower ICC values cluster towards the lower-left corner.

Data availability
The training and test datasets generated for this study contain protected patient information. Some data may be available from the corresponding author for research purposes upon reasonable request.