Automatic classification of canine thoracic radiographs using deep learning

The interpretation of thoracic radiographs is a challenging and error-prone task for veterinarians. Despite recent advancements in machine learning and computer vision, the development of computer-aided diagnostic systems for radiographs remains a challenging and unsolved problem, particularly in the context of veterinary medicine. In this study, a novel method, based on multi-label deep convolutional neural network (CNN), for the classification of thoracic radiographs in dogs was developed. All the thoracic radiographs of dogs performed between 2010 and 2020 in the institution were retrospectively collected. Radiographs were taken with two different radiograph acquisition systems and were divided into two data sets accordingly. One data set (Data Set 1) was used for training and testing and another data set (Data Set 2) was used to test the generalization ability of the CNNs. Radiographic findings used as non mutually exclusive labels to train the CNNs were: unremarkable, cardiomegaly, alveolar pattern, bronchial pattern, interstitial pattern, mass, pleural effusion, pneumothorax, and megaesophagus. Two different CNNs, based on ResNet-50 and DenseNet-121 architectures respectively, were developed and tested. The CNN based on ResNet-50 had an Area Under the Receive-Operator Curve (AUC) above 0.8 for all the included radiographic findings except for bronchial and interstitial patterns both on Data Set 1 and Data Set 2. The CNN based on DenseNet-121 had a lower overall performance. Statistically significant differences in the generalization ability between the two CNNs were evident, with the CNN based on ResNet-50 showing better performance for alveolar pattern, interstitial pattern, megaesophagus, and pneumothorax.

Thoracic radiographs are part of routine clinical evaluation of patients with confirmed or suspected thoracic pathology both in human and veterinary medicine. Nevertheless, interpreting thoracic radiographs is a challenging and error-prone task for the medical doctor 1,2 , and for the veterinary practitioner alike 3 . In human medicine, despite the efforts to improve radiology residents' training programmes, the prevalence of interpretation errors has not significantly improved in recent decades 1,2 . The prevalence and the impact of interpretation errors on thoracic radiographs have only seldom been investigated in veterinary medicine 4 . Conversely, this topic has been widely studied in human medicine and the most common causes of interpretation errors have been identified [5][6][7] . Different strategies to reduce interpretation errors have been proposed both in human 1,8 and veterinary medicine 3 ; among these is the use of computer-aided detection (CAD) tools to support the practitioner in everyday practice 6,9 . The high performances shown by deep-learning algorithms in several radiology-related tasks have driven very active research in this field, with an increasing number of publications 10 . In particular, deep learning algorithms for the detection of specific pathologies or conditions such as pneumothorax 11 , pneumonia 12 , malignant nodules 13 and COVID-19 14 have been proposed. In addition, broader applications of these algorithms, such as automatic triaging 15 and automatic labeling of chest radiographs 16 , have been investigated. Furthermore, several artificial intelligence-based products for the automatic detection of specific conditions, both on plain radiographs and www.nature.com/scientificreports/ computed tomographic images, have been approved by the Food and Drug Administration in the last few years, thereafter becoming commercially available.
To date, the possibilities offered by deep learning in veterinary medicine have been investigated for the classification of magnetic resonance images 17,18 for the detection of liver degeneration from ultrasound images 19 and for the automatic classification of corneal lesions from photographs 20 . Multi-label algorithms allow for the detection of different objects (in our case lesions) on the same image. In multi-label training each image is annotated with multiple labels according to the lesions evident on the radiograph 21 . To the best of the authors' knowledge, both in human 11,12,22 and in veterinary medicine 22,23 , most of the studies on applying CNNs to thoracic radiographs are focused on detecting individual pathologies or conditions, whereas studies using a multi-label approach are relatively scarce in the human medical literature 16,21,24,25 and the scope to use multi-label algorithms on canine thoracic radiographs has not been explored yet. Therefore, the aims of this study are: (1) to develop a multi-label deep learning-based network capable of detecting some of the most common lesions found on plain radiographs of the canine thorax; (2) to test the generalization ability of the developed algorithm on an external Data Set of radiographs.

Results
Database. The complete database was composed of 3839 latero-lateral (LL) radiographs. Data Set 1 comprised 3063 LL images, 632 LL images were discarded due to incorrect positioning or poor image quality. Data Set 2 comprised 776 LL, 77 LL radiographs were excluded because of positioning error or poor image quality. In both data sets, "unremarkable" and "cardiomegaly" were the two most represented lesions. There was an uneven distribution of the different radiographic findings between the two data sets, with some over-represented and some under-represented in Data Set 2 when compared to Data Set 1.

Selection of the radiographic findings.
Only a limited number of radiographs showing tracheal collapse, hernia, fracture and pneumomediastinum were available in Data Set 1 (Table 1) , and, therefore, these radiographic findings were excluded from training. Thus the radiographic findings used to train the network were: unremarkable, cardiomegaly, alveolar pattern, bronchial pattern, interstitial pattern, mass, pleural effusion, pneumothorax, megaoesophagus.
Classification results. ResNet-50 had a higher classification accuracy than DenseNet-121, both on Data Set 1 and on Data Set 2, for all the considered radiographic findings except pleural effusion. Classification accuracy of the two architectures on Data Set 1 and Data Set 2 is reported in Tables 2 and 3. For some radiographic findings the classification accuracy of both ResNet-50 and DenseNet-121 was higher on Data Set 2 than on Data Set 1. In particular, both architectures showed a higher accuracy on Data Set 2 than on Data Set 1 for alveolar pattern. Furthermore, DenseNet-121 showed higher accuracy on Data Set 2 than on Data Set 1 also for bronchial pattern, cardiomegaly, megaoesophagus, unremarkable and pneumothorax. For the remaining radiographic findings, accuracy on Data Set 2 was lower than on Data Set 1. Statistically significant differences in accuracy on Data Set 2 (generalization accuracy) between ResNet-50 and DenseNet-121 were evident for: (1) alveolar pattern (Z = 3.813, P = 0.0001); (2) interstitial pattern (Z = 3.283, P = 0.0010); (3) megaeosophagus (Z = 2.257, P = 0.0240); (4) pneumothorax (Z = 3.314, P = 0.0009). No differences were evident for: cardiomegaly (Z = 0.800, P = 0.427); mass (Z = 1.580, P = 0.1142); unremarkable (Z = 0.817, P = 0.4137); pleural effusion (Z = 0.347, P = 0.7286). A graphical representation of the classification results of the model is reported in Fig. 1.

Discussion
A new, deep learning-based, multi-label classification method for the automatic detection of several radiographic findings in canine thoracic radiographs is proposed. The high classification accuracy shown by both tested architectures on Data Set 2, for almost all the radiographic findings, suggests that multi-label CNNs can be successfully trained also in the case of relatively small-sized and highly unbalanced databases. On the other hand, the classification differences in several radiographic findings between the veterinary and the human medical literature make comparison with similar studies 21,25 not entirely straightforward. Moreover, some of the radiographic findings that are common in humans (e.g. emphysema, fibrosis) are rarely found in dogs. Nonetheless,  www.nature.com/scientificreports/ it is feasible to make this direct comparison between human and veterinary examples for some radiological findings, such as cardiomegaly, pleural effusion, pneumothorax, consolidation (labelled " alveolar pattern" in this study) and unremarkable 21,25 . Interestingly, for all the above-mentioned radiographic findings, the AUC of the developed CNN was similar to or higher than that reported in similar studies on humans 21,25 both for Data Set 1 and for Data Set 2.
Another interesting aspect of this research is related to the large variability in body size and body shape typical of the dog, which directly translates into a wide range of normality in the radiographic appearance of the canine thorax. Indeed, the dog is the only known species that has a 50-fold variability in dimensions among individuals. Therefore, it is easily understood that the radiographic appearance of the thorax of, for example, a bulldog, a dachshund, or a German shepherd, is very different in radiological terms. Despite such variability, the developed CNN was able to detect most of the radiographic findings included in the CNN with an accuracy ranging from moderate to very good. In particular, ResNet-50 displayed an AUC above 0.8 in the detection of alveolar pattern, www.nature.com/scientificreports/ cardiomegaly, megaoesophagus, pleural effusion, and pneumothorax. In addition it showed high accuracy in identifying normal radiographs (labelled "unremarkable"). Interestingly, in similar experiments in humans the accuracy in identifying radiologically normal images was lower 25 . Conversely, accuracy was lower than 0.8 for bronchial pattern, interstitial pattern and mass. It is the authors' opinion that the limited generalization ability shown by ResNet-50 in the detection of bronchial and interstitial patterns might be related to the difference in image quality of the original DICOM images between Data Set 1 and Data Set 2. In fact, the radiographs acquired using the CR system had a lower image quality than those acquired through the DR system. Another possible explanation is that bronchial and interstitial patterns were not assessed on VD images. On the other hand, the low accuracy in the detection of masses could be related to the inability of the network to consider orthogonal views simultaneously. The low accuracy in detecting masses shown by ResNet-50 and DenseNet-121, both on Data Set 1 and Data Set 2, is probably related to the fact that several mass-like structures (for example nipples, degeneration of the costochondral joints in older animals, pleural mineralizations) are often present in normal radiographs. Interestingly, also in the experiments by Wang et al. 2017 24 and Yao et al. 2018 26 accuracy in detecting masses and nodules in humans was low (AUC below 0.8). The developed CNN had variable performances for the detection of the different lesions and, therefore, results obtained with the current version of the CNN should be confirmed with other methods (e.g.: interpretation by radiologist, computed tomography, magnetic resonance imaging) before taking clinical decisions based on those results. ResNet-50 and DenseNet-121 are the two most commonly used pre-trained CNNs for multi-label chest X-ray image classification 21,24,26 . In this study, ResNet-50 showed a significantly higher generalization ability than DenseNet-121 in the detection of alveolar pattern, interstitial pattern, megaoesophagus, and pneumothorax, whereas no differences were evident for cardiomegaly, mass, unremarkable and pleural effusion. In previous human studies, these two network architectures demonstrated a variable accuracy in the detection of radiographic lesions ,with ResNet-50 performing better than DenseNet-121 for some lesions and vice versa 21 . Furthermore, in some studies, both ResNet-50 and DenseNet-121 were used as backbones for category-wise, residual operations, and attention-based mechanisms 21 . Incorporating the above modules within the network is reported to increase the average AUC 21 . The above modules were not included in the present study, mainly due to the limited data set size and because of the high imbalance lesion distribution.
Models trained on a specific data set do not always obtain comparable performance when tested on data sets from a different institution. Accuracy increases if the data sets acquired from multiple institutions are used for the training 27 . A limitation of this study is that both data sets were acquired at the same institution and a data set from an external veterinary clinic was not available. However, in order to keep center generalization into account, Data Set 1 and Data Set 2 (used respectively for training and testing) were acquired using two different radiograph acquisition systems. Further studies, possibly including radiographs acquired at multiple veterinary clinics, could help clarify the current generalization performances of the developed CNN. Furthermore, it is also possible that the exclusion of incorrectly positioned and exposed radiographs from both the training and the test set might have influenced the classification accuracy towards more favorable results. The possibility to automatically detect positioning or exposure abnormalities has not been explored yet.
Another limitation of the present study is that the radiographic findings included in the training set do not, of course, fully represent all the lesions types that might occur in thoracic radiographs in dogs. Furthermore, due to the limited number of available cases, radiographs showing the least represented radiographic findings (tracheal collapse, hernia, fracture, and pneumomediastinum) were not included in the training. For the above reasons, the real "in-field" generalization ability of the developed CNN has yet to be fully tested.
The developed CNN is prospectively aimed to assist veterinary clinicians, both general practitioners and radiology specialists, in their daily work. It is the authors' opinion that the scope to use deep learning-based tools during routine clinical activity will increase productivity while decreasing the error rate. Generally speaking, veterinary facilities are smaller than human hospitals and the global number of veterinary specialists in all the disciplines is significantly lower the global number of specialist doctors. Therefore, veterinary general practitioners are required to develop expertise in several different fields of medicine, such as radiology, surgery, internal medicine, pathology, and so on. It is the authors' opinion that, in such a scenario, veterinarians could greatly benefit from the use of deep learning-based tools to assist them in their clinical routine. Indeed, several application cases for these algorithms have been proposed and analysed in the human medical literature. For instance, the use of deep learning-based algorithms is reported to increase accuracy in the detection of pulmonary nodules by skilled radiologists 9 , or to decrease the average reporting delay in a clinical setting 15 . The possible impact CNN use in the veterinary medical field has not been evaluated yet.

Methods
Database creation. Radiographic findings. All the images were reviewed by three experienced veterinary radiologists (AZ, TB and SB, with more than 20, 10 and 3, years' experience respectively). Before interpretation, image quality was assessed and, in particular, radiograph exposure and patient positioning were evaluated. Only properly exposed images with the animal positioned correctly were included in both data sets. Radiographs of immature dogs and images with evident artefacts (double exposure, dirt on the cassette, etc.) were also excluded. When available, both LL and VD radiographs of the same patient were reviewed simultaneously. The radiographs were classified strictly based on the presence or absence of individual radiographic findings and not on the presence or absence of pathologies (e.g.: pneumonia) or conditions (e.g.: oedema) that might be characterized by the simultaneous presence of several radiographic findings. All the radiographs were labelled according to the following radiographic findings: alveolar pattern, interstitial pattern, bronchial pattern, mass, cardiomegaly, pleural effusion, pneumothorax, hernia, megaoesophagus, fracture, pneumomediastinum, tracheal collapse. If no radiographic findings were evident, the image was classified as unremarkable. The distribu- www.nature.com/scientificreports/ tion (focal vs. diffused) of both alveolar and interstitial patterns was not considered. Interstitial and bronchial patterns were graded as mild, moderate, or severe. Mild bronchial and interstitial patterns were considered as normal variations in the radiographic appearance of the canine thorax and, therefore, not included in the training. If only mild bronchial and interstitial patterns were evident, the radiographs were classified as unremarkable. Cases showing both segmental and diffused megaoesophagus were classified as megaoesophagus. The presence of cardiomegaly was assessed based on the authors' experience. In unclear cases, the vertebral heart score 28 was calculated and then compared with the breed-specific reference intervals reported in the literature. Mediastinal and thoracic wall masses were included in the mass tag. Both diaphragmatic and abdominal wall hernias were classified as hernia. Likewise, both fractures to the ribs and to the vertebral column were classified as fracture. Fractures of the long bones were not considered. No grading score was assigned to tracheal collapse. All the images were reviewed simultaneously by the three authors and all the labels were assigned following a consensus discussion.
Image processing and deep learning. The deep-learning analysis was performed on a dedicated workstation (Linux operating system, Ubuntu 18.04, Canonical) equipped with four graphic processing units (Tesla V100; NVIDIA), a 2.2 GHz processor (Intel Xeon E5-2698 v4; Intel) and 256 GB random-access memory. Before feeding to the CNN the images were downsampled to 224x224 pixels. The images were not cropped during the test phase, neither lossy compressed or converted to JPEG. Instead, the lossless MHA format was used. Radiograph classification was performed using convolutional neural networks (CNN), a special class of deep-learning algorithms specifically designed to work with images, and this classification was performed using two different CNN architectures: (1) DenseNet-121 29 , (2) ResNet-50 30 . The tested CNN architectures were pre-trained on a large-scale data set of everyday images called ImageNet and then fine-tuned. Different radiographic findings are usually evident on the same radiograph, often as a result of a single condition or pathology, and, therefore, a multi-label approach was used. Binary cross-entropy was used as the objective function. The same training parameters were used for all the networks. Training was performed until convergence using the Adam optimizer and a learning-rate scheduler with exponential decay. The weights from the epoch with the lowest loss on the validation set were chosen and further used for testing. The training set was augmented by random horizontal/ vertical flips, cropping, affine warping, and linear contrast changes. All the images were normalized to the 0-1 range, where 0 denotes the background. The split ratio for training, validation, and test set (for Data Set 1) was 8:1:1 respectively.The training scheme was not directly optimizing any of the evaluation metric, e.g. AUC, sensitivity, or specificity. No information from Data Set 2 was used during the training.

Conclusions
A multi-label CNN-based network for the automatic classification of canine LL radiographs was developed and tested. The developed network had a variable accuracy in the detection of radiographic findings in an external test set. Further studies, hopefully including a larger number of radiographs acquired in several different veterinary institutions, could allow the development of a network with a broader generalization ability. Furthermore, a larger database could allow testing the network also on VD images. CNN-based tools could, prospectively, assist the veterinarian in his everyday work allowing for a higher quality veterinary care. Nonetheless, for a successful application of these tools in the clinical workflow, the advantages and the pitfalls of such tool must be clearly known by the operator.

Data availibility
The data sets generated during and/or analysed during the current study are not publicly available because they are property of the Veterinary Teaching Hospital of the University of Padua but are available from the corresponding author on reasonable request. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.