Introduction

Gout is a common metabolic disorder characterized by chronic hyperuricemia, i.e., the deposition of monosodium urate monohydrate crystals (tophi) in soft tissue, triggering intense and acute arthritis with intolerable pain as well as articular and periarticular inflammation1. Note that the prevalence of gout (< 1–6.8%) and the incidence of gout (0.58–2.89 per 1000 person-years) tend to vary widely due to the population studied and methods employed2. Tophi deposits are generally found at the 1st metatarsophalangeal (MTP) joints; however, they are occasionally found in the ankles and knees. Tophi has also been shown to promote chronic inflammation and erosive arthritis3. The gold standard for the diagnosis of gout is arthrocentesis4; however, imaging is increasingly being used to confirm the diagnosis due to its noninvasiveness5. The American College of Rheumatology (ACR) and the European League Against Rheumatism (EULAR) have collaboratively approved classification criteria for gout. Gout classification criteria in ACR/EULAR 2015 include clinical, laboratory, and imaging findings (Table 1), where cases of gout are indicated by a threshold score of ≥ 86. Note however that some imaging-related findings, such as a double contour (DC) sign in ultrasound (Fig. 1A), urate in dual-energy computed tomography (DECT) (Fig. 1B), or radiographic gout-related erosion (Fig. 1C), generate a score of up to 4. Clearly, the diagnosis of gout depends heavily on imaging findings. Imaging has also been recognized as a useful tool to assess the outcomes of urate-lowering therapy7. Thiele et al. reported a strong correlation between DC characteristics and serum uric acid levels8. The Outcome Measures in Rheumatology (OMERACT) Gout Working Group has provided guidance for researchers on the means by which to assess the effects of treatment via imaging9. Imaging modalities have demonstrated good efficacy in detecting changes in urate deposition, joint inflammation, and bone erosions10. A substantial reduction in the size of tophi in ultrasound images is associated with a lower probability of relapse after terminating gout prophylaxis. Ultrasound follow-up can also be useful for managing urate-lowering therapy and gout flare prophylaxis11,12,13,14,15. DECT is the most accurate diagnostic imaging test; however, the procedure is expensive and exposes the patient to high levels of radiation16. Although musculoskeletal ultrasound is relatively non-invasive and inexpensive17, the reliability depends on expert experience. Studies have demonstrated that ultrasound offers good diagnostic accuracy with high specificity and is likely beneficial in the diagnosis of gout18. Researchers are increasingly looking to artificial intelligence (AI) to facilitate the interpretation of ultrasound images for diagnostic applications19; however, there has been relatively little research on the application of AI to the identification of musculoskeletal phenomena in ultrasound images. Our review of the literature pertaining to this issue revealed studies on myositis20, dysplasia of the neonatal hip21, synovitis22, lumbar ultrasound image feature extraction23, and nerve identification24,25. In the current study, we sought to use AI to identify the indicators of gout with a focus on detecting tophi at the 1st MTP joints.

Table 1 Definition and special considerations for each domain included in 2015 Gout Classification criteria.
Figure 1
figure 1

Imaging features included in the classification criteria: (A) Transverse ultrasonographic image showing double-contour in femoral articular cartilage. The image features hyperechoic enhancement over the surface of the hyaline cartilage; (B) Dual-energy computed tomography scan showing urate deposition at the right first metatarsophalangeal joint; (C) Conventional radiographic image showing signs of erosion in the first metatarsophalangeal joint, as indicated by a cortical break with sclerotic margin and overhanging edge.

Materials and methods

Subjects

This study enrolled patients who were diagnosed with gout in accordance with 2015 ACR/EULAR gout classification criteria during the period from December 5th, 2019, to May 28th, 2020. The study was approved by Taipei Medical University Hospital Institute Ethics Committee (N202204073). All methods were performed in accordance with the relevant guidelines and regulations.

Ultrasound imaging protocol

All images were obtained using a LOGIQ E9 ultrasound machine and ML6-15 ultrasound scanner by a single rheumatologist with 7 years of experience. The scanner was placed on the dorsal and lateral sides of the bilateral 1st MTP joints. A total of four images from each patient were used for training and evaluation.

Region of interest (ROI) cropping and definition

The ROI in the images was identified using Microsoft Paint. All ROIs were labeled by the same rheumatologist who performed the scanning. Note that the size of the ROI was not stipulated. Rather, we focused on the bilateral 1st MTP joints and tophi. The blue square and the red label in Fig. 2 respectively indicate the ROI and tophi. The hyperechoic nature of tophi (and occasional calcification) generally results in a heterogenous appearance in scans26. Figure 3 presents a schematic diagram showing the workflow employed for training and testing. After ROI cropping, the images were split into a training and test set. Data augmentation based on histogram equalization and horizontal flipping was used to increase the number of images in the training set.

Figure 2
figure 2

Ultrasound images of (A) left 1st metatarsophalangeal joint lateral side with tophi and (B) right 1st metatarsophalangeal joint lateral side without tophi in single gout patient. The red curve indicates the boundary of the tophi region. The blue rectangle frame indicates the region of interest drawn by a rheumatologist for subsequent transfer learning.

Figure 3
figure 3

Schematic diagram showing the workflow of training and testing. After cropping the region of interest in 375 ultrasound images, the images were assigned to training or test datasets. Data augmentation based on histogram equalization and horizontal flipping was used to increase the number of images in the training dataset. Transfer learning was performed using three pre-trained models with fivefold cross-validation prior to the analysis of diagnostic performance using the test set.

Study design

The diagnoses of all tophi in this study were subject to confirmation by a rheumatologist, such that training was implemented as supervised learning. Transfer learning and the fine-tuning of hyperparameters were implemented on three pre-trained models: InceptionV3, ResNet101, and VGG19. Note that the classification accuracy of these models has been demonstrated in the ImageNet Large-Scale Visual Recognition Challenge. The MATLAB 2021a platform was used for the re-training of the three models specifically for the classification of tophi versus non-tophi in ultrasound images. The size of the input images was adjusted in accordance with model settings. Stochastic gradient descent with momentum was applied as the solver. The maximum number of epochs was as follows: InceptionV3 (20), ResNet101 (20), and VGG19 (30). The learning rate was as follows: InceptionV3 (0.0001), ResNet101 (0.0001), and VGG19 (0.0001). Five-fold cross-validation was used to ensure the stability of the results.

Evaluation metrics

In estimating the diagnostic performance of physicians, the images were classified as tophi and non-tophi according to sonographic patterns. Model evaluation was based on several evaluation metrics, including accuracy, recall (sensitivity), precision (positive predictive value, PPV), F1-Score, receiver operating characteristic (ROC) curve, and area under the curve (AUC). The evaluation metrics were calculated based on the calculation of true negative (TN), true positive (TP), false negative (FN), and false positive (FP). TP and TN were respectively defined as the number of positive and negative cases that were successfully classified. FP and FN were respectively defined as the number of misclassified negative and positive cases. Accuracy was calculated as the ratio of correct predicted classes to the total number of samples evaluated. Recall was used to compute the fraction of positive patterns that were correctly classified. Precision was used to compute the positive patterns that were correctly predicted based on all predicted patterns in a positive class. F1-score indicated the harmonic average between recall and precision rates.

Results

Study population

Our initial recruitment included 111 patients. From this group, 15 patients were excluded due to unclear imaging data. This left 96 patients who met the enrollment criteria for this study, including those with tophi (n = 47) and those without tophi (n = 49). The initial plan was to use 4 images from each patient, for a total of 384 images. However, the exclusion of 9 images due to a lack of clarity resulted in 375 images, including 73 images with tophi and 302 images without tophi.

Demographics

The patients were divided into a tophi group and a non-tophi group. Patient ages ranged from 22 to 87 years old. The mean age of patients was 52 years old (male: 52 years old, female: 64.5 years old). The ratio of male: female was 92:4. The mean uric acid level was 7.04 mg/dl (male: 7.08 mg/dl, female: 6.1 mg/dl).

Performance assessment of AI models

After data augmentation, the training set included 174 images in the tophi group and 241 in the non-tophi group. Following the completion of transfer learning, a test set was used to assess the performance of the models and obtain a confusion matrix (Fig. 4). Table 2 presents the accuracy of the models in terms of diagnostic performance, as follows: InceptionV3 (0.871 ± 0.020), ResNet101 (0.913 ± 0.015), and VGG19 (0.918 ± 0.020). Recall performance was as follows: InceptionV3 (0.507 ± 0.060), ResNet101 (0.680 ± 0.056), and VGG19 (0.747 ± 0.056). Precision was as follows: InceptionV3 (0.767 ± 0.091), ResNet101 (0.863 ± 0.098), and VGG19 (0.825 ± 0.062). AUC was as follows: InceptionV3 (0.925 ± 0.011), ResNet101 (0.966 ± 0.007), and VGG19 (0.967 ± 0.008).

Figure 4
figure 4

Confusion matrix of models when applied to the test set: (a) InceptionV3, (b) ResNet101, and (c) VGG19.

Table 2 Performance evaluation of models in test dataset.

Discussion

Neural networks comprise an input layer, a hidden layer for calculations, and an output layer to update the parameters of the hidden layer and train weights. The neurons in each layer are laid out within a fully connected structure. In 1988, Lecun et al. proposed the LeNet network architecture27, which laid the foundation for convolutional neural networks (CNNs). Their architecture included a convolution layer, activation function layer, pooling layer, flatten layer, and fully connected layer. A convolution kernel or filter slides across the image to extract features with the length and width of the image controlled by stride and padding. An activation function after the convolutional layer is used to derive a feature map via nonlinear transformation. A pooling layer is used to reduce the volume of image-related information by retaining only the information deemed important. The original information is then subjected to max pooling or mean pooling via a pooling filter, after which a feature map is created in the form of a 2D image. Finally, the feature map is expanded into a 1D array prior to insertion into the fully connected layer.

Currently dominating various computer vision tasks, CNNs are a class of artificial neural network that has been attracting immense interest across a variety of domains, including radiology. A CNN is designed to automatically learn and adapt spatial hierarchies of features through backpropagation using multiple building blocks, such as convolution layers, pooling layers, and fully connected layers28. CNNs are now widely utilized for the identification of anatomical structures in medical ultrasound image analysis29. In the rheumatology field, several studies have used CNNs for the measurement and estimation of metacarpal-head cartilage30,31. Visual observation and semi-quantitative analysis are commonly used in the interpretation of medical ultrasound images; however, visual analysis is subjective and reviewer-dependent while semi-quantitative methods are operator-dependent. Visual evaluation involves the careful examination of ultrasound images covering the ROIs to compare lesions with surrounding normal tissue. However, the results are easily affected by ambient lighting, screen brightness and contrast, eyestrain, and the clinical experience of the observer. Semi-quantitative methods involve converting the intensity of ROIs bound by radiotracers into a numeric value representing the characteristics of that specific lesion. Ideally, the results should be reproducible and strongly correlated with the results of visual analysis. Researchers have recommended the establishment of databases by which to derive observer-independent imaging protocols.

In patients with chronic gout, serum urate levels and the rate of urate deposition are primary outcome measures32. The gout working group at the OMERACT in 2014 reported that measuring whole-body urate deposition levels is not a feasible option. By contrast, it is possible to quantify tophi within a representative area (e.g., bilateral 1st MTP joints) or a predetermined set of joints33. Ultrasound and DECT are the methods best suited to measuring urate deposition9; however, DECT exposes patients to unacceptable levels of radiation and is available in few research centers. Ultrasound can also be used for the quantification of tophi deposition in superficial areas, such as the 1st MTP joints. It is inexpensive, non-intrusive, and widely available; however, the results are highly dependent on the specifics of the scanner and operator experience. Our objective in this study was to apply an AI learning system to the identification of tophi in ultrasound images. Our review of the literature revealed that this was the first-ever study to use transfer learning in the development of an AI system for the interpretation of ultrasound images aimed at identifying tophi in musculoskeletal tissue.

It is important to note that the properties of ultrasound images vary with the operator, scanner, and patient34. Our study also has some limitations. All of the images used in the current study were captured by the same experienced rheumatologist using the same ultrasound machine and scanner. Although this helped to ensure consistent image quality, it limited the diversity of images in the database.

Conclusions

In our study, we applied an AI learning system to the identification of tophi in ultrasound images. Our results demonstrate that it is possible to re-train deep convolutional neural networks to identify the patterns of tophi in ultrasound images with a high degree of accuracy.