## Introduction

Thoracic radiographs are part of routine clinical evaluation of patients with confirmed or suspected thoracic pathology both in human and veterinary medicine. Nevertheless, interpreting thoracic radiographs is a challenging and error-prone task for the medical doctor1,2, and for the veterinary practitioner alike3. In human medicine, despite the efforts to improve radiology residents’ training programmes, the prevalence of interpretation errors has not significantly improved in recent decades1,2. The prevalence and the impact of interpretation errors on thoracic radiographs have only seldom been investigated in veterinary medicine4. Conversely, this topic has been widely studied in human medicine and the most common causes of interpretation errors have been identified5,6,7. Different strategies to reduce interpretation errors have been proposed both in human1,8 and veterinary medicine3; among these is the use of computer-aided detection (CAD) tools to support the practitioner in everyday practice6,9.

The high performances shown by deep-learning algorithms in several radiology-related tasks have driven very active research in this field, with an increasing number of publications10. In particular, deep learning algorithms for the detection of specific pathologies or conditions such as pneumothorax11, pneumonia12, malignant nodules13 and COVID-1914 have been proposed. In addition, broader applications of these algorithms, such as automatic triaging15 and automatic labeling of chest radiographs16, have been investigated. Furthermore, several artificial intelligence-based products for the automatic detection of specific conditions, both on plain radiographs and computed tomographic images, have been approved by the Food and Drug Administration in the last few years, thereafter becoming commercially available.

To date, the possibilities offered by deep learning in veterinary medicine have been investigated for the classification of magnetic resonance images17,18 for the detection of liver degeneration from ultrasound images19 and for the automatic classification of corneal lesions from photographs20. Multi-label algorithms allow for the detection of different objects (in our case lesions) on the same image. In multi-label training each image is annotated with multiple labels according to the lesions evident on the radiograph21. To the best of the authors’ knowledge, both in human11,12,22 and in veterinary medicine22,23, most of the studies on applying CNNs to thoracic radiographs are focused on detecting individual pathologies or conditions, whereas studies using a multi-label approach are relatively scarce in the human medical literature16,21,24,25 and the scope to use multi-label algorithms on canine thoracic radiographs has not been explored yet. Therefore, the aims of this study are: (1) to develop a multi-label deep learning-based network capable of detecting some of the most common lesions found on plain radiographs of the canine thorax; (2) to test the generalization ability of the developed algorithm on an external Data Set of radiographs.

## Results

### Database

The complete database was composed of 3839 latero-lateral (LL) radiographs. Data Set 1 comprised 3063 LL images, 632 LL images were discarded due to incorrect positioning or poor image quality. Data Set 2 comprised 776 LL, 77 LL radiographs were excluded because of positioning error or poor image quality. In both data sets, “unremarkable” and “cardiomegaly” were the two most represented lesions. There was an uneven distribution of the different radiographic findings between the two data sets, with some over-represented and some under-represented in Data Set 2 when compared to Data Set 1.

### Selection of the radiographic findings

Only a limited number of radiographs showing tracheal collapse, hernia, fracture and pneumomediastinum were available in Data Set 1 (Table 1) , and, therefore, these radiographic findings were excluded from training. Thus the radiographic findings used to train the network were: unremarkable, cardiomegaly, alveolar pattern, bronchial pattern, interstitial pattern, mass, pleural effusion, pneumothorax, megaoesophagus.

### Classification results

ResNet-50 had a higher classification accuracy than DenseNet-121, both on Data Set 1 and on Data Set 2, for all the considered radiographic findings except pleural effusion. Classification accuracy of the two architectures on Data Set 1 and Data Set 2 is reported in Tables 2 and 3. For some radiographic findings the classification accuracy of both ResNet-50 and DenseNet-121 was higher on Data Set 2 than on Data Set 1. In particular, both architectures showed a higher accuracy on Data Set 2 than on Data Set 1 for alveolar pattern. Furthermore, DenseNet-121 showed higher accuracy on Data Set 2 than on Data Set 1 also for bronchial pattern, cardiomegaly, megaoesophagus, unremarkable and pneumothorax. For the remaining radiographic findings, accuracy on Data Set 2 was lower than on Data Set 1. Statistically significant differences in accuracy on Data Set 2 (generalization accuracy) between ResNet-50 and DenseNet-121 were evident for: (1) alveolar pattern (Z = 3.813, P = 0.0001); (2) interstitial pattern (Z = 3.283, P = 0.0010); (3) megaeosophagus (Z = 2.257, P = 0.0240); (4) pneumothorax (Z = 3.314, P = 0.0009). No differences were evident for: cardiomegaly (Z = 0.800, P = 0.427); mass (Z = 1.580, P = 0.1142); unremarkable (Z = 0.817, P = 0.4137); pleural effusion (Z = 0.347, P = 0.7286). A graphical representation of the classification results of the model is reported in Fig. 1.

## Discussion

A new, deep learning-based, multi-label classification method for the automatic detection of several radiographic findings in canine thoracic radiographs is proposed. The high classification accuracy shown by both tested architectures on Data Set 2, for almost all the radiographic findings, suggests that multi-label CNNs can be successfully trained also in the case of relatively small-sized and highly unbalanced databases. On the other hand, the classification differences in several radiographic findings between the veterinary and the human medical literature make comparison with similar studies21,25 not entirely straightforward. Moreover, some of the radiographic findings that are common in humans (e.g. emphysema, fibrosis) are rarely found in dogs. Nonetheless, it is feasible to make this direct comparison between human and veterinary examples for some radiological findings, such as cardiomegaly, pleural effusion, pneumothorax, consolidation (labelled “ alveolar pattern” in this study) and unremarkable21,25. Interestingly, for all the above-mentioned radiographic findings, the AUC of the developed CNN was similar to or higher than that reported in similar studies on humans21,25 both for Data Set 1 and for Data Set 2.

ResNet-50 and DenseNet-121 are the two most commonly used pre-trained CNNs for multi-label chest X-ray image classification21,24,26. In this study, ResNet-50 showed a significantly higher generalization ability than DenseNet-121 in the detection of alveolar pattern, interstitial pattern, megaoesophagus, and pneumothorax, whereas no differences were evident for cardiomegaly, mass, unremarkable and pleural effusion. In previous human studies, these two network architectures demonstrated a variable accuracy in the detection of radiographic lesions ,with ResNet-50 performing better than DenseNet-121 for some lesions and vice versa21. Furthermore, in some studies, both ResNet-50 and DenseNet-121 were used as backbones for category-wise, residual operations, and attention-based mechanisms21. Incorporating the above modules within the network is reported to increase the average AUC21. The above modules were not included in the present study, mainly due to the limited data set size and because of the high imbalance lesion distribution.

Models trained on a specific data set do not always obtain comparable performance when tested on data sets from a different institution. Accuracy increases if the data sets acquired from multiple institutions are used for the training27. A limitation of this study is that both data sets were acquired at the same institution and a data set from an external veterinary clinic was not available. However, in order to keep center generalization into account, Data Set 1 and Data Set 2 (used respectively for training and testing) were acquired using two different radiograph acquisition systems. Further studies, possibly including radiographs acquired at multiple veterinary clinics, could help clarify the current generalization performances of the developed CNN. Furthermore, it is also possible that the exclusion of incorrectly positioned and exposed radiographs from both the training and the test set might have influenced the classification accuracy towards more favorable results. The possibility to automatically detect positioning or exposure abnormalities has not been explored yet.

Another limitation of the present study is that the radiographic findings included in the training set do not, of course, fully represent all the lesions types that might occur in thoracic radiographs in dogs. Furthermore, due to the limited number of available cases, radiographs showing the least represented radiographic findings (tracheal collapse, hernia, fracture, and pneumomediastinum) were not included in the training. For the above reasons, the real “in-field” generalization ability of the developed CNN has yet to be fully tested.

The developed CNN is prospectively aimed to assist veterinary clinicians, both general practitioners and radiology specialists, in their daily work. It is the authors’ opinion that the scope to use deep learning-based tools during routine clinical activity will increase productivity while decreasing the error rate. Generally speaking, veterinary facilities are smaller than human hospitals and the global number of veterinary specialists in all the disciplines is significantly lower the global number of specialist doctors. Therefore, veterinary general practitioners are required to develop expertise in several different fields of medicine, such as radiology, surgery, internal medicine, pathology, and so on. It is the authors’ opinion that, in such a scenario, veterinarians could greatly benefit from the use of deep learning-based tools to assist them in their clinical routine. Indeed, several application cases for these algorithms have been proposed and analysed in the human medical literature. For instance, the use of deep learning-based algorithms is reported to increase accuracy in the detection of pulmonary nodules by skilled radiologists9, or to decrease the average reporting delay in a clinical setting15. The possible impact CNN use in the veterinary medical field has not been evaluated yet.

## Methods

### Database creation

We assessed individual architectures, both on Data Set 1 and Data Set 2, with the area under the receiver operating characteristic curve (AUC) using a commercially available statistical software (MedCalc). Sensitivity was calculated as: true positive /(true positive $$+$$ false negative), specificity as: true negative/ (false positive $$+$$ true negative), positive likelihood ratio (PLR) as: sensitivity / (1 − specificity) and negative likelihood ratio (NLR) as: (1 − sensitivity)/specificity. The performances of the two architectures were compared, on the Data Set 2 only, with the DeLong test. The differences in the AUCs of the considered tests, as a result of the DeLong test, are expressed as Z score. All p-values were assessed at an alpha of 0.05.