Determining the anatomical site in knee radiographs using deep learning

An important quality criterion for radiographs is the correct anatomical side marking. A deep neural network is evaluated to predict the correct anatomical side in radiographs of the knee acquired in anterior–posterior direction. In this retrospective study, a ResNet-34 network was trained on 2892 radiographs from 2540 patients to predict the anatomical side of knees in radiographs. The network was evaluated in an internal validation cohort of 932 radiographs of 816 patients and in an external validation cohort of 490 radiographs from 462 patients. The network showed an accuracy of 99.8% and 99.9% on the internal and external validation cohort, respectively, which is comparable to the accuracy of radiographers. Anatomical side in radiographs of the knee in anterior–posterior direction can be deduced from radiographs with high accuracy using deep learning.


Patient cohorts.
Training and internal validation cohorts were gathered by querying the in-house radiological information system (University Hospital Essen). The training cohort was randomly selected from all patients who had a radiograph of the knee in anterior-posterior direction acquired between 1.1.2009 and 31. 12.2018. From each year, 300 radiographs were included into the training cohort. Radiographs were excluded for the following reasons: (a) not AP view (b) both knees were acquired (c) knee not fully visible (knee was cut or another body part was acquired) (d) low image quality (e) fibula missing.
The internal validation cohort was gathered in the same manner by including 1000 randomly selected radiographs of the knee acquired between 1.1.2019 and 31.12.2020. The same procedure, inclusion criteria as well as exclusion criteria were applied as for the training cohort. In addition, to avoid positive bias, radiographs of patients that were already included in the training cohort were removed from the validation cohort.
A second, external validation cohort was collected from a neighboring hospital (Elisabeth Krankenhaus Essen). Using their radiological information system, all patients with a knee radiograph between 1.1.2021 and 31.5.2021 were gathered.
Altogether, the training set consisted of 2892 radiographs from 2540 patients, the internal validation cohort of 932 radiographs from 816 patients and the external validation cohort of 490 radiographs from 462 patients (Fig. 1).
Scanners and acquisition parameters. Radiographs in the training and validation cohort were acquired mainly on scanners from Siemens (Siemens Healthineers, Erlangen, Germany) and AGFA (AGFA Healthcare, Mortsel, Belgium). The external validation cohort was acquired using Canon (Canon Europe, London, UK) scanners (Table 1). On average, the radiographs were acquired with 64.8 kVp (range 54.9-75.0), 64.5 kVp (range 54.9-76.8), and 7.7 mAs (range 1-62) and 7.5 mAs (range 1-35) in the training and internal validation cohort respectively. For the external validation cohort these parameters were not available in the DICOM tags.
Removing anatomical side markers. As placing an anatomical side marker is a major quality criterion, it can be expected that nearly all the data will contain an ASM. Without any processing, a network could "cheat" during training and use the ASM to determine the anatomical side instead of using the x-ray of the knee. The network would then fail completely on data without ASMs. Therefore, all ASMs need to be removed for training. Care needs to be taken, as the removal could introduce an unexpected bias into the model: In clinical routine,  www.nature.com/scientificreports/ right ASMs are prominently placed on the right side (corresponding to the left side of the image), while left ASMs are placed on the left side. Therefore, if the ASMs are simply removed, e.g., by black rectangles, this would give away strong evidence to the anatomical side, and the network could determine it simply by observing on which side the visible modification has taken place. Unfortunately, removing the ASMs without any visible evidence is not easy, as these are sometimes placed very close to or even inside the knee tissue. Therefore, the ASMs were first marked in all radiographs in the training cohort and subsequently removed by cropping the radiographs to an area where the ASMs are not visible (Fig. 2). For the validation cohorts this procedure is not necessary, as the network never saw these markers and did not learn their meaning so cannot use them for inference.
Annotations. Anatomical labeling was determined by reading the corresponding DICOM tags. Nonetheless, as the tags might contain errors, all images were checked by the radiographer (20 years of experience) and corrected where necessary.
Preprocessing. After cropping, the cropped radiographs were checked a second time to make sure no kind of side markers were still present in the data. All radiographs were then reduced from 12-bit to 8-bit by simple linear rescaling and then rescaled to a size of 256 × 256.
Neural network. For modeling, a standard network, the ResNet-34, pretrained on the ImageNet dataset, was chosen 16 . This architecture has shown excellent performance in imaging tasks, is readily available, and is medium-sized, which should fit the amount of data available well. Two different loss functions were tested for optimization: cross-entropy loss and focal loss (with parameters α = 1 and γ = 2). The latter was used because it should give more importance to examples that are harder to classify than cross-entropy loss 17 . The loss function was optimized using the Adam optimizer (with parameters β 1 = 0.9, β 2 = 0.999). Multiple augmentations like brightness changes, sharpening etc. were applied during training to synthetically increase the sample size, which helps the network to generalize better (A full list can be found in Supplemental 1). The batch size was set to 32. Training was stopped after 30 epochs. For development, PyTorch v1.4 18 and PyTorch-lightning v1.2.6 19 were used. Details on the network can be found in Supplemental 1. www.nature.com/scientificreports/ Cross-validation. One of the most important parameters when learning a neural network is the learning rate. Therefore, a fivefold cross-validation was employed to choose an optimal learning rate and the loss function. The accuracy of the test fold was used to measure the performance of the model. The learning rate that showed the highest accuracy was selected for training the final model.
Validation. The final model was then retrained on the whole training data set with the corresponding learning rate and loss. The same training parameters were used, i.e., the training was performed with the Adam optimizer for 30 epochs with a batch size of 32. The performance on the final model was then measured on the internal and also on the external validation cohort.
To understand how the network derived its decisions from the radiographs, occlusion sensitivity maps 20 with a stride of 8 and a patch size of 48 were employed to produce heat maps of the regions that contributed to a given decision.

Statistics.
All descriptive statistics were reported as mean +/− standard deviation or standard error where appropriate. Statistical significance was chosen to be below a p-value of 0.05. All analyses were conducted with Python 3.7 and the SciPy 1.5.4 package 21 .

Results
Patient collective. The mean age of all patients was 44.5 ± 24.4 years (range 0-101 years), with 1908 females and 1910 males ( Table 2). There was not much difference in gender between patients of the training and validation cohorts, but there was a significant difference in age, especially for the external validation cohort (Fig. 3).
Annotations. All knees in the radiographs were labeled to be either left and right. In the training cohort, 1403 left (48.5%) and 1489 right knees (51.5%) could be found, while in the internal validation there were 457 right (49%) and 475 left knees (51%). Similarly, in the external validation cohort, there were 260 left (57%) and 229 (43%) right knees. No significant difference between the training set and each validation cohort could be seen using chi-square tests (both p > 0.05).

Cross-validation.
The best learning rate was 3 × 10 -4 using cross-entropy as a loss, with a mean accuracy of 99.7% (SE: 0.001%) ( Table 3). On the 2892 images in the training set, the models trained with this learning rate made 8 errors altogether, 6 times a right knee was predicted to be a left knee, and 2 times vice versa (Fig. 4). The Table 2. Demographics of the patient collective. The P-value denotes the significance of a chi-square and a Wilcoxon rank-sum test for sex and age between the training and the internal and external validation cohorts, respectively.

Discussion
Side marking in X-ray images is of great clinical importance to avoid side confusion with potentially fatal consequences. Whereas in the past a radiopaque side marker was placed in the examination area and also x-rayed, nowadays the side marking is increasingly done digitally. Until now, this is done manually by radiographers and is tedious work due to the sheer mass of radiographs. Due to the fact that the digital detectors used today cannot be exposed from the wrong side, unlike X-ray films or storage film systems in the past, it is possible to automate this task at least for skeletal images. In our study we demonstrated that this manual work can equally well be performed by a neural network. We trained a standard network architecture, the ResNet-34, to determine the anatomical side in radiographs of the knee.
The network demonstrated excellent performance on an internal as well as external validation cohort. The accuracy on both cohort was slightly higher than during cross-validation, which might be due to chance. Nonetheless, the network seemed to be able to generalize to unseen data without compromising the overall accuracy.
When comparing the accuracy to those of the radiographers, the network performed similarly during the internal cross-validation. The radiographers had a mislabeling in 0.38% of all cases, which is nearly the same as the error of the network, which was 0.28% (p = 0.65, using a chi-square test). On the internal and external validation sets, the radiographers mislabeled in 0.2% of all cases. However, although the network did not show any error here, there is no statistically significant difference between them (p = 0.48 and p = 1.0). It, therefore cannot be claimed that the network performs better just because it showed perfect accuracy. In contrast to the radiographers, who missed placing the anatomical side markers in around 1.3%, 3.1%, and 2.5% of all cases, the network will always predict by design.
We applied the occlusion sensitivity maps for a rough estimation of which parts of the knee radiograph are important for the network to draw its estimation from. As expected, the fibula seems to be the most important part to determine the anatomical side of the radiograph. Nonetheless, unexpectedly in many images a second hot spot is visible, on the opposite side of the fibula and above the intercondylar area. The network seems to take into Table 3. Mean accuracy and standard error of the models trained during cross-validation. The highest accuracy is marked in bold. www.nature.com/scientificreports/ consideration the slant of the lower end of the femur, which seems also to be a rather good indication whether the knee is a left or right one. Despite this, we stress that any network is in general a black box and interpretation has to be taken with some care. Although anatomical side and placing ASMs is a routine task, there are only very few studies that automatically try to predict anatomical side in radiographs: For chest radiographers, Xue et al. present a system based on classical image processing to predict if a radiograph shows the chest in frontal or lateral view 22 , while Reza et al. predict the projection (PA vs AP), also using more classical image processing and machine learning techniques 23 . Fang et al. present a more general network that classifies the view in general radiographs 24 . Unfortunately, as the problem is much wider, naturally their accuracy is lower. They also removed radiographs of children, which often have much higher variability. Other studies concentrate on the radiological findings, for example, several studies present a deep network for generating labels that can be directly used in radiological reports 25,26 . These models assume the correct anatomical side, which could be checked using a network like the one we have presented.
Determining the anatomical side of the knee in AP views seems a rather easy task as long as the fibula is visible. Despite this, biological variability and the acquisition conditions are very different, as scans may contain screws, implants, cast and other temporary stabilization artifacts. This variation makes an automation using classical image processing and machine learning techniques rather complex and error prone. Neural networks on the other hand are much better suited for the task and our results show that also an off-the-shelf network can solve such problems easily.
We restricted ourselves to knee radiographs acquired in AP direction, since this is the most common acquisition in clinical routine, and our network shows perfect accuracy in two validation cohorts, equaling those of radiographers and making application in clinical routine possible. However, because a wrong ASM can have severe consequences and deep learning networks are in general black boxes, the prediction of the network could be presented to the radiographer for a second check, making mistakes even less probable.
We have little doubt that similar networks to determine the anatomical side for other body parts such as chest, abdomen, spine, upper and lower extremities etc. as well as in lateral plan or AP view and lateral view combined can very easily be developed with similarly high accuracies. Such studies should be performed in future. Since the internal and external validation cohorts were acquired with different scanners but represent a similar population, it should be also verified that the network performs equally well in other populations.
Our cohorts were acquired from the clinical routine and were only randomly subsampled to reduce the sample size. Despite this, a statistically significant age difference was observed. While we have no clear explanation for Figure 5. Occlusion sensitivity maps for randomly selected images from the internal validation cohort (left column) and external validation cohort (right column). In each row first the cropped, original image is shown, then the occlusion sensitivity map and finally an overlay of both. In general, two hot spots are visible, which correspond to the region that is most important for the network for its decision: the fibula as well as the lower end of the femur opposite to the fibula. www.nature.com/scientificreports/ this difference, since the population of the hospitals should be relatively similar, this is not a disadvantage to our study. Indeed, our network still performed excellent, showing that it can work with images of patients of any age.
In our experiments, we used the ResNet-34 network because it is a medium-sized network and has performed excellently in many applications. However, we believe that other networks, e.g., the Inception V3 or the VGG-16 network, would work equally well. Similarly, contrary to our expectation, the cross-entropy loss performed better than the focal loss in our experiment. It might be because the focal loss has two parameters, which we could have selected suboptimally and which we did not tune during cross-validation. Nonetheless, although the cross-entropy loss was better during cross-validation, showing only 8 instead of 14 errors, this difference is not statistically significant (p = 0.29, using a chi-square test).
Some limitations apply to our study: Even though we used radiographs from two validation cohorts obtained from scanners from different vendors, it is necessary to check that the network will perform equally well on different populations. In addition, while the restriction to AP views, while by far the most often used view direction in clinical routine, a network that can deal with PA views as well would be desirable. In addition, we assumed that the radiographs were acquired with a computed radiography system and therefore no mirroring occurred when the images were transferred to the PACS, which is more common when using film-screen radiography.
In conclusion, a neural network to determine the anatomical side in radiographs of the knee was trained and was shown to perform excellently in two validation cohorts.

Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.