Thoracic radiographs are part of routine clinical evaluation of patients with confirmed or suspected thoracic pathology both in human and veterinary medicine. Nevertheless, interpreting thoracic radiographs is a challenging and error-prone task for the medical doctor1,2, and for the veterinary practitioner alike3. In human medicine, despite the efforts to improve radiology residents’ training programmes, the prevalence of interpretation errors has not significantly improved in recent decades1,2. The prevalence and the impact of interpretation errors on thoracic radiographs have only seldom been investigated in veterinary medicine4. Conversely, this topic has been widely studied in human medicine and the most common causes of interpretation errors have been identified5,6,7. Different strategies to reduce interpretation errors have been proposed both in human1,8 and veterinary medicine3; among these is the use of computer-aided detection (CAD) tools to support the practitioner in everyday practice6,9.

The high performances shown by deep-learning algorithms in several radiology-related tasks have driven very active research in this field, with an increasing number of publications10. In particular, deep learning algorithms for the detection of specific pathologies or conditions such as pneumothorax11, pneumonia12, malignant nodules13 and COVID-1914 have been proposed. In addition, broader applications of these algorithms, such as automatic triaging15 and automatic labeling of chest radiographs16, have been investigated. Furthermore, several artificial intelligence-based products for the automatic detection of specific conditions, both on plain radiographs and computed tomographic images, have been approved by the Food and Drug Administration in the last few years, thereafter becoming commercially available.

To date, the possibilities offered by deep learning in veterinary medicine have been investigated for the classification of magnetic resonance images17,18 for the detection of liver degeneration from ultrasound images19 and for the automatic classification of corneal lesions from photographs20. Multi-label algorithms allow for the detection of different objects (in our case lesions) on the same image. In multi-label training each image is annotated with multiple labels according to the lesions evident on the radiograph21. To the best of the authors’ knowledge, both in human11,12,22 and in veterinary medicine22,23, most of the studies on applying CNNs to thoracic radiographs are focused on detecting individual pathologies or conditions, whereas studies using a multi-label approach are relatively scarce in the human medical literature16,21,24,25 and the scope to use multi-label algorithms on canine thoracic radiographs has not been explored yet. Therefore, the aims of this study are: (1) to develop a multi-label deep learning-based network capable of detecting some of the most common lesions found on plain radiographs of the canine thorax; (2) to test the generalization ability of the developed algorithm on an external Data Set of radiographs.



The complete database was composed of 3839 latero-lateral (LL) radiographs. Data Set 1 comprised 3063 LL images, 632 LL images were discarded due to incorrect positioning or poor image quality. Data Set 2 comprised 776 LL, 77 LL radiographs were excluded because of positioning error or poor image quality. In both data sets, “unremarkable” and “cardiomegaly” were the two most represented lesions. There was an uneven distribution of the different radiographic findings between the two data sets, with some over-represented and some under-represented in Data Set 2 when compared to Data Set 1.

Table 1 Number of LL radiographs showing the following included radiographic findings.

Selection of the radiographic findings

Only a limited number of radiographs showing tracheal collapse, hernia, fracture and pneumomediastinum were available in Data Set 1 (Table 1) , and, therefore, these radiographic findings were excluded from training. Thus the radiographic findings used to train the network were: unremarkable, cardiomegaly, alveolar pattern, bronchial pattern, interstitial pattern, mass, pleural effusion, pneumothorax, megaoesophagus.

Classification results

ResNet-50 had a higher classification accuracy than DenseNet-121, both on Data Set 1 and on Data Set 2, for all the considered radiographic findings except pleural effusion. Classification accuracy of the two architectures on Data Set 1 and Data Set 2 is reported in Tables 2 and 3. For some radiographic findings the classification accuracy of both ResNet-50 and DenseNet-121 was higher on Data Set 2 than on Data Set 1. In particular, both architectures showed a higher accuracy on Data Set 2 than on Data Set 1 for alveolar pattern. Furthermore, DenseNet-121 showed higher accuracy on Data Set 2 than on Data Set 1 also for bronchial pattern, cardiomegaly, megaoesophagus, unremarkable and pneumothorax. For the remaining radiographic findings, accuracy on Data Set 2 was lower than on Data Set 1. Statistically significant differences in accuracy on Data Set 2 (generalization accuracy) between ResNet-50 and DenseNet-121 were evident for: (1) alveolar pattern (Z = 3.813, P = 0.0001); (2) interstitial pattern (Z = 3.283, P = 0.0010); (3) megaeosophagus (Z = 2.257, P = 0.0240); (4) pneumothorax (Z = 3.314, P = 0.0009). No differences were evident for: cardiomegaly (Z = 0.800, P = 0.427); mass (Z = 1.580, P = 0.1142); unremarkable (Z = 0.817, P = 0.4137); pleural effusion (Z = 0.347, P = 0.7286). A graphical representation of the classification results of the model is reported in Fig. 1.

Table 2 Performances of ResNet-50 in Data Set 1 and Data Set 2. Parentheses show 95% CIs.
Table 3 Performances of DenseNet-121 in Data Set 1 and Data Set 2. Parentheses show 95% CIs.
Figure 1
figure 1

Visual assessment of the ResNet-50 classification results of a radiograph of a dog showing an alveolar pattern in the cranial lung lobe. The activations of the last layer are visualized superimposed on the radiographs. Each image corresponds to the activations for a specific radiographic finding. The alveolar pattern was correctly identified by the model (B) however the model also falsely identified the presence of a mass (E). (A) Original image, (B) alveolar pattern, C bronchial pattern, (D) cardiomegaly, (E) mass, (F) interstitial pattern, (G) pleural effusion, (H) pneumothorax, (I) unremarkable.


A new, deep learning-based, multi-label classification method for the automatic detection of several radiographic findings in canine thoracic radiographs is proposed. The high classification accuracy shown by both tested architectures on Data Set 2, for almost all the radiographic findings, suggests that multi-label CNNs can be successfully trained also in the case of relatively small-sized and highly unbalanced databases. On the other hand, the classification differences in several radiographic findings between the veterinary and the human medical literature make comparison with similar studies21,25 not entirely straightforward. Moreover, some of the radiographic findings that are common in humans (e.g. emphysema, fibrosis) are rarely found in dogs. Nonetheless, it is feasible to make this direct comparison between human and veterinary examples for some radiological findings, such as cardiomegaly, pleural effusion, pneumothorax, consolidation (labelled “ alveolar pattern” in this study) and unremarkable21,25. Interestingly, for all the above-mentioned radiographic findings, the AUC of the developed CNN was similar to or higher than that reported in similar studies on humans21,25 both for Data Set 1 and for Data Set 2.

Another interesting aspect of this research is related to the large variability in body size and body shape typical of the dog, which directly translates into a wide range of normality in the radiographic appearance of the canine thorax. Indeed, the dog is the only known species that has a 50-fold variability in dimensions among individuals. Therefore, it is easily understood that the radiographic appearance of the thorax of, for example, a bulldog, a dachshund, or a German shepherd, is very different in radiological terms. Despite such variability, the developed CNN was able to detect most of the radiographic findings included in the CNN with an accuracy ranging from moderate to very good. In particular, ResNet-50 displayed an AUC above 0.8 in the detection of alveolar pattern, cardiomegaly, megaoesophagus, pleural effusion, and pneumothorax. In addition it showed high accuracy in identifying normal radiographs (labelled “unremarkable”). Interestingly, in similar experiments in humans the accuracy in identifying radiologically normal images was lower25. Conversely, accuracy was lower than 0.8 for bronchial pattern, interstitial pattern and mass. It is the authors’ opinion that the limited generalization ability shown by ResNet-50 in the detection of bronchial and interstitial patterns might be related to the difference in image quality of the original DICOM images between Data Set 1 and Data Set 2. In fact, the radiographs acquired using the CR system had a lower image quality than those acquired through the DR system. Another possible explanation is that bronchial and interstitial patterns were not assessed on VD images. On the other hand, the low accuracy in the detection of masses could be related to the inability of the network to consider orthogonal views simultaneously. The low accuracy in detecting masses shown by ResNet-50 and DenseNet-121, both on Data Set 1 and Data Set 2, is probably related to the fact that several mass-like structures (for example nipples, degeneration of the costochondral joints in older animals, pleural mineralizations) are often present in normal radiographs. Interestingly, also in the experiments by Wang et al. 201724 and Yao et al. 201826 accuracy in detecting masses and nodules in humans was low (AUC below 0.8). The developed CNN had variable performances for the detection of the different lesions and, therefore, results obtained with the current version of the CNN should be confirmed with other methods (e.g.: interpretation by radiologist, computed tomography, magnetic resonance imaging) before taking clinical decisions based on those results.

ResNet-50 and DenseNet-121 are the two most commonly used pre-trained CNNs for multi-label chest X-ray image classification21,24,26. In this study, ResNet-50 showed a significantly higher generalization ability than DenseNet-121 in the detection of alveolar pattern, interstitial pattern, megaoesophagus, and pneumothorax, whereas no differences were evident for cardiomegaly, mass, unremarkable and pleural effusion. In previous human studies, these two network architectures demonstrated a variable accuracy in the detection of radiographic lesions ,with ResNet-50 performing better than DenseNet-121 for some lesions and vice versa21. Furthermore, in some studies, both ResNet-50 and DenseNet-121 were used as backbones for category-wise, residual operations, and attention-based mechanisms21. Incorporating the above modules within the network is reported to increase the average AUC21. The above modules were not included in the present study, mainly due to the limited data set size and because of the high imbalance lesion distribution.

Models trained on a specific data set do not always obtain comparable performance when tested on data sets from a different institution. Accuracy increases if the data sets acquired from multiple institutions are used for the training27. A limitation of this study is that both data sets were acquired at the same institution and a data set from an external veterinary clinic was not available. However, in order to keep center generalization into account, Data Set 1 and Data Set 2 (used respectively for training and testing) were acquired using two different radiograph acquisition systems. Further studies, possibly including radiographs acquired at multiple veterinary clinics, could help clarify the current generalization performances of the developed CNN. Furthermore, it is also possible that the exclusion of incorrectly positioned and exposed radiographs from both the training and the test set might have influenced the classification accuracy towards more favorable results. The possibility to automatically detect positioning or exposure abnormalities has not been explored yet.

Another limitation of the present study is that the radiographic findings included in the training set do not, of course, fully represent all the lesions types that might occur in thoracic radiographs in dogs. Furthermore, due to the limited number of available cases, radiographs showing the least represented radiographic findings (tracheal collapse, hernia, fracture, and pneumomediastinum) were not included in the training. For the above reasons, the real “in-field” generalization ability of the developed CNN has yet to be fully tested.

The developed CNN is prospectively aimed to assist veterinary clinicians, both general practitioners and radiology specialists, in their daily work. It is the authors’ opinion that the scope to use deep learning-based tools during routine clinical activity will increase productivity while decreasing the error rate. Generally speaking, veterinary facilities are smaller than human hospitals and the global number of veterinary specialists in all the disciplines is significantly lower the global number of specialist doctors. Therefore, veterinary general practitioners are required to develop expertise in several different fields of medicine, such as radiology, surgery, internal medicine, pathology, and so on. It is the authors’ opinion that, in such a scenario, veterinarians could greatly benefit from the use of deep learning-based tools to assist them in their clinical routine. Indeed, several application cases for these algorithms have been proposed and analysed in the human medical literature. For instance, the use of deep learning-based algorithms is reported to increase accuracy in the detection of pulmonary nodules by skilled radiologists9, or to decrease the average reporting delay in a clinical setting15. The possible impact CNN use in the veterinary medical field has not been evaluated yet.


Database creation

Radiographic findings

All the images were reviewed by three experienced veterinary radiologists (AZ, TB and SB, with more than 20, 10 and 3, years’ experience respectively). Before interpretation, image quality was assessed and, in particular, radiograph exposure and patient positioning were evaluated. Only properly exposed images with the animal positioned correctly were included in both data sets. Radiographs of immature dogs and images with evident artefacts (double exposure, dirt on the cassette, etc.) were also excluded. When available, both LL and VD radiographs of the same patient were reviewed simultaneously. The radiographs were classified strictly based on the presence or absence of individual radiographic findings and not on the presence or absence of pathologies (e.g.: pneumonia) or conditions (e.g.: oedema) that might be characterized by the simultaneous presence of several radiographic findings. All the radiographs were labelled according to the following radiographic findings: alveolar pattern, interstitial pattern, bronchial pattern, mass, cardiomegaly, pleural effusion, pneumothorax, hernia, megaoesophagus, fracture, pneumomediastinum, tracheal collapse. If no radiographic findings were evident, the image was classified as unremarkable. The distribution (focal vs. diffused) of both alveolar and interstitial patterns was not considered. Interstitial and bronchial patterns were graded as mild, moderate, or severe. Mild bronchial and interstitial patterns were considered as normal variations in the radiographic appearance of the canine thorax and, therefore, not included in the training. If only mild bronchial and interstitial patterns were evident, the radiographs were classified as unremarkable. Cases showing both segmental and diffused megaoesophagus were classified as megaoesophagus. The presence of cardiomegaly was assessed based on the authors’ experience. In unclear cases, the vertebral heart score28 was calculated and then compared with the breed-specific reference intervals reported in the literature. Mediastinal and thoracic wall masses were included in the mass tag. Both diaphragmatic and abdominal wall hernias were classified as hernia. Likewise, both fractures to the ribs and to the vertebral column were classified as fracture. Fractures of the long bones were not considered. No grading score was assigned to tracheal collapse. All the images were reviewed simultaneously by the three authors and all the labels were assigned following a consensus discussion.

Image processing and deep learning

The deep-learning analysis was performed on a dedicated workstation (Linux operating system, Ubuntu 18.04, Canonical) equipped with four graphic processing units (Tesla V100; NVIDIA), a 2.2 GHz processor (Intel Xeon E5-2698 v4; Intel) and 256 GB random-access memory. Before feeding to the CNN the images were downsampled to 224x224 pixels. The images were not cropped during the test phase, neither lossy compressed or converted to JPEG. Instead, the lossless MHA format was used. Radiograph classification was performed using convolutional neural networks (CNN), a special class of deep-learning algorithms specifically designed to work with images, and this classification was performed using two different CNN architectures: (1) DenseNet-12129, (2) ResNet-5030. The tested CNN architectures were pre-trained on a large-scale data set of everyday images called ImageNet and then fine-tuned. Different radiographic findings are usually evident on the same radiograph, often as a result of a single condition or pathology, and, therefore, a multi-label approach was used. Binary cross-entropy was used as the objective function. The same training parameters were used for all the networks. Training was performed until convergence using the Adam optimizer and a learning-rate scheduler with exponential decay. The weights from the epoch with the lowest loss on the validation set were chosen and further used for testing. The training set was augmented by random horizontal/vertical flips, cropping, affine warping, and linear contrast changes. All the images were normalized to the 0-1 range, where 0 denotes the background. The split ratio for training, validation, and test set (for Data Set 1) was 8:1:1 respectively.The training scheme was not directly optimizing any of the evaluation metric, e.g. AUC, sensitivity, or specificity. No information from Data Set 2 was used during the training.

Statistical analysis

We assessed individual architectures, both on Data Set 1 and Data Set 2, with the area under the receiver operating characteristic curve (AUC) using a commercially available statistical software (MedCalc). Sensitivity was calculated as: true positive /(true positive \(+\) false negative), specificity as: true negative/ (false positive \(+\) true negative), positive likelihood ratio (PLR) as: sensitivity / (1 − specificity) and negative likelihood ratio (NLR) as: (1 − sensitivity)/specificity. The performances of the two architectures were compared, on the Data Set 2 only, with the DeLong test. The differences in the AUCs of the considered tests, as a result of the DeLong test, are expressed as Z score. All p-values were assessed at an alpha of 0.05.


A multi-label CNN-based network for the automatic classification of canine LL radiographs was developed and tested. The developed network had a variable accuracy in the detection of radiographic findings in an external test set. Further studies, hopefully including a larger number of radiographs acquired in several different veterinary institutions, could allow the development of a network with a broader generalization ability. Furthermore, a larger database could allow testing the network also on VD images. CNN-based tools could, prospectively, assist the veterinarian in his everyday work allowing for a higher quality veterinary care. Nonetheless, for a successful application of these tools in the clinical workflow, the advantages and the pitfalls of such tool must be clearly known by the operator.