Introduction

Acute and chronic nonhealing wounds represent a heavy burden to healthcare systems, affecting millions of patients around the world1. In the United States, medicare cost projections for all wounds are estimated to be between $28.1B and $96.8B2. Unlike acute wounds, chronic wounds fail to predictably progress through the phases of healing in an orderly and timely fashion, thus require hospitalization and additional treatment adding billions in cost for health services annually3. The shortage of well-trained wound care clinicians in primary and rural healthcare settings decreases the access and quality of care to millions of Americans. Accurate measurement of the wound area is critical to the evaluation and management of chronic wounds to monitor the wound healing trajectory and to determine future interventions. However, manual measurement is time-consuming and often inaccurate which can cause a negative impact on patients. Wound segmentation from images is a popular solution to these problems that not only automates the measurement of the wound area but also allows efficient data entry into the electronic medical record to enhance patient care.

Related studies on wound segmentation can be roughly categorized into two groups: traditional computer vision methods and deep learning methods. Studies in the first group focus on combining computer vision techniques and traditional machine learning approaches. These studies apply manually-designed feature extraction to build a dataset that is later used to support machine learning algorithms. Song et al. described 49 features that are extracted from a wound image using K-means clustering, edge detection, thresholding, and region growing in both grayscale and RGB4. These features are filtered and prepared into a feature vector that is used to train a Multi-Layer Perceptron (MLP) and a Radial Basis Function (RBF) neural network to identify the region of a chronic wound. Ahmad et al. proposed generating a Red-Yellow-Black-White (RYKW) probability map of an input image with a modified hue-saturation-value (HSV) model5. This map then guides the segmentation process using either optimal thresholding or region growing. Hettiarachchi et al. demonstrated an energy minimizing discrete dynamic contour algorithm applied on the saturation plane of the image in its HSV color model6. The wound area is then calculated from a flood fill inside the enclosed contour. Hani et al. proposed applying an Independent Component Analysis (ICA) algorithm to the pre-processed RGB images to generate hemoglobin-based images, which are used as input of K-means clustering to segment the granulation tissue from the wound images7. These segmented areas are utilized as an assessment of the early stages of ulcer healing by detecting the growth of granulation tissue on ulcer bed. Wantanajittikul et al. proposed a similar system to segment the burn wound from images8. Cr-Transformation and Luv-Transformation are applied to the input images to remove the background and highlight the wound region. The transformed images are segmented with a pixel-wise Fuzzy C-mean Clustering (FCM) algorithm. These methods suffer from at least one of the following limitations: (1) as in many computer vision systems, the hand-crafted features are affected by skin pigmentation, illumination, and image resolution, (2) they depend on manually tuned parameters and empirically handcrafted features which does not guarantee an optimal result. Additionally, they are not immune to severe pathologies and rare cases, which are very impractical from a clinical perspective, and (3) the performance is evaluated on a small biased dataset.

Since the successes AlexNet9 achieved in the 2012 Imagenet large scale visual recognition challenge10, the application of deep learning11 in the domain of computer vision sparked interests in semantic segmentation12 using deep convolutional neural networks (CNN)13. Typically, traditional machine learning and computer vision methods make decisions based on feature extraction. To segment the region of interest, one must guess a set of important features and then handcraft sophisticated algorithms that capture these features14. However, a CNN integrates feature extraction and decision making. The convolutional kernels of CNN extract the features and their importance is determined during the training of the network. In a typical CNN architecture, the input are processed by a sequence of convolutional layers and the output is gernerated by a fully connected layer that requires fixed-sized input. One successful variant of CNN is fully convolutional neural networks (FCN)15. A FCN is composed of convolutional layers without a fully connected layer as the output layer. This allows arbitrary input sizes and prevents the loss of spatial information caused by the fully connected layers in CNNs. Several FCN-based methods have been proposed to solve the wound segmentation problem. For example, Wang et al. estimated the wound area by segmenting wounds16 with the vanilla FCN architecture15. With time-series data consisting of the estimated wound areas and corresponding images, wound healing progress is predicted using a Gaussian process regression function model. However, the mean Dice accuracy of the segmentation is only evaluated to be 64.2%. Goyal et al. proposed to employ the FCN-16 architecture on the wound images in a pixel-wise manner that each pixel of an image is predicted to which class it belongs17. The segmentation result is simply derived from the pixels classified as a wound. By testing different FCN architectures they are able to achieve a Dice coefficient of 79.4% on their dataset. However, the network’s segmentation accuracy is limited in distinguishing small wounds and wounds with irregular borders as the tendency is to draw smooth contours. Liu et al. proposed a new FCN architecture that replaces the decoder of the vanilla FCN with a skip-layer concatenation upsampled with bilinear interpolation18. A pixel-wise softmax layer is appended to the end of the network to produce a probability map, which is post-processed to be the final segmentation. A dice accuracy of 91.6% is achieved on their dataset with 950 images taken under an uncontrolled lighting environment with a complex background. However, images in their dataset are semi-automatically annotated using a watershed algorithm. This means that the deep learning model is learning how the watershed algorithm labels wounds as opposed to human specialists.

To better explore the capacity of deep learning on the wound segmentation problem, we propose an efficient and accurate framework to automatically segment wound regions. The segmentation network of this framework is built above MobileNetsV219. This network is light-weight and computationally efficient since significantly fewer parameters are used during the training process.

Our contributions can be summarized as follows:

  1. 1.

    We build a large dataset of wound images with segmentation annotations done by wound specialists. This is by far the largest dataset focused on wound segmentation (to the best of our knowledge).

  2. 2.

    We propose a fully automatic wound segmentation framework based on MobileNetsV2 that balances computational efficiency and accuracy.

  3. 3.

    Our proposed framework shows high efficiency and accuracy in wound image segmentation.

Dataset

Dataset construction

There is currently no public dataset large enough for training deep-learning-based models for wound segmentation. To explore the effectiveness of wound segmentation using deep learning models, we collaborated with the Advancing the Zenith of Healthcare (AZH) Wound and Vascular Center, Milwaukee, WI. Our chronic wound dataset was collected over 2 years at the center and includes 1109 foot ulcer images taken from 889 patients during multiple clinical visits. The raw images were taken by Canon SX 620 HS digital camera and iPad Pro under uncontrolled illumination conditions, with various backgrounds. Figure 1 shows some sample images in our dataset.

Figure 1
figure 1

An illustration of images in our dataset. The first row contains the raw images collected. The second row consists of segmentation mask annotations we create with the AZH wound and vascular center.

The raw images collected are of various sizes and cannot be fed into our deep learning model directly since our model requires fixed-size input images. To unify the size of images in our dataset, we first localize the wound by placing bounding boxes around the wound using an object localization model we trained de novo, YOLOv320. Our localization dataset contains 1010 images, which are also collected from the AZH Wound and Vascular Center. We augmented the images and built a training set containing 3645 images and a testing set containing 405 images. For training our model we have used LabelImg21 to manually label all the data (both for training and testing). The YOLO format has been used for image labelling. The model has been trained with a batch size of 8 for 273 epochs. With an intersection over union (IoU) rate of 0.5 and non-maximum suppression of 1.00, we get the mean Average Precision (mAP) value of 0.939. In the next step, image patches are cropped based on the bounding boxes result from the localization model. We unify the image size (224 pixels by 224 pixels) by applying zero-padding to these images, which are regarded in our dataset data points. We confirm that the data collected was de-identified and in accordance to relevant guidelines and regulations and the patient’s informed consent is waived by the institutional review board of University of Wisconsin-Milwaukee.

Data annotation

During training, a deep learning model is learning the annotations of the training dataset. Thus, the quality of annotations is essential. Automatic annotation generated with computer vision algorithms is not ideal when deep learning models are trained to learn how human experts recognize the wound region. In our dataset, the images were manually annotated with segmentation masks that were further reviewed and verified by wound care specialists from the collaborating wound clinic. Initially only foot ulcer images were annotated and included in the dataset as these wounds tend to be smaller than other types of chronic wounds, which makes it easier and less time-consuming to manually annotate the pixel-wise segmentation masks. In the future we plan to create larger image libraries to include all types of chronic wounds, such as venous leg ulcers, pressure ulcers, and surgery wounds as well as non-wound reference images. The AZH Wound and Vascular Center, Milwaukee, WI, had consented to make our dataset publicly available.

Methods

In this section we describe our method with the architecture of the deep learning model for wound segmentation. The transfer learning used during the training of our model and the post-processing methods including hole filling and removal of small noises are also described. We confirm that the research is approved by the institutional review board of University of Wisconsin-Milwaukee.

Pre-processing

Besides cropping and zero-padding discussed in the dataset construction section, standard data augmentation techniques are applied to our dataset before being fed into the deep learning model. These image transformations include arbitrary rotations in the range of + 25 to − 25 degrees, random left–right and top-down flippings with a probability of 0.5, and random zooming within 80% of the original image area. Random zooming is performed as the only non-rigid transformation because we suspect that other non-rigid transformations like shearings do not represent common wound shape variations. Eventually, the training dataset is augmented to around 5000 images. We keep the validation dataset unaugmented to generate convincing evaluation outcomes.

Model architecture overview

A convolutional neural network (CNN), MobileNetV219, is adopted to segment the wound from the images. Compared with conventional CNNs, this network substitutes the fundamental convolutional layers with depth-wise separable convolutional layers22 where each layer can be separated into a depth-wise convolution layer and a point-wise convolution layer. A depth-wise convolution performs lightweight filtering by applying a convolutional filter per input channel. A point-wise convolution is a 1 × 1 convolution responsible for building new features through linear combinations of the input channels. This substitution reduces the computational cost compared to traditional convolution layers by almost a factor of k2 where k is the convolutional kernel size. Thus, depth-wise separable convolutions are much more computationally efficient than conventional convolutions suitable for mobile or embedded applications where computing resource is limited. For example, the mobility of MobileNetV2 could benefit medical professionals and patients by allowing instant wound segmentation and wound area measurement immediately after the photo is taken using mobile devices like smartphones and tablets. An example of a depth-wise separable convolution layer is shown in Fig. 3c, compared to a traditional convolutional layer shown in Fig. 3b.

The model has an encoder-decoder architecture as shown in Fig. 2. The encoder is built by repeatedly applying the depth-separable convolution block (marked with diagonal lines in Fig. 2). Each block, illustrated in Fig. 3a, consists of six layers: a 3 × 3 depth-wise convolutional layer followed by batch normalization and Relu activation23, and a 1 × 1 point-wise convolution layer followed again by batch normalization and Relu. To be more specific, Relu624 was used as the activation function. In the decoder, shown in Fig. 2, the encoded features are captured in multiscale with a spatial pyramid pooling block, and then concatenated with higher-level features generated from a pooling layer and a bilinear up-sampling layer. After the concatenation, we apply a few 3 × 3 convolutions to refine the features followed by another simple bilinear up-sampling by a factor of 4 to generate the final output. A batch normalization layer is inserted into each bottleneck block and a dropout layer is inserted right before the output layer. In MobileNetV2, a width multiplier α is introduced to deal with various dimensions of input images. we let α = 1 thus the input image size is set to 224 pixels × 224 pixels in our model.

Figure 2
figure 2

The encoder–decoder architecture of MobilenetV219.

Figure 3
figure 3

(a) A depth-separable convolution block. The block contains a 3 × 3 depth-wise convolutional layer and a 1 × 1 point-wise convolution layer. Each convolutional layer is followed by batch normalization and Relu6 activation. (b) An example of a convolution layer with a 3 × 3 × 3 kernel. (c) An example of a depth-wise separable convolution layer equivalent to (b).

Transfer learning

To make the training more efficient, we used transfer learning for our deep learning model. Instead of randomly initializing the weights in our model, the MobileNetV2 model, pre-trained on the Pascal VOC segmentation dataset25 was loaded before training. Transfer learning with the pre-trained model is beneficial to the training process in the sense that the weights converge faster and better.

Post-processing

The raw segmentation masks predicted by our model are grayscale images with pixel intensities that range from 0 to 255. In the post processing step, binary segmentation masks are first generated from thresholding with a fixed threshold of 127, which is half the max intensity. The binary masks are further processed by hole filling and removal of small regions to generate the final segmentation masks as shown in Fig. 4. We notice that abnormal tissue like fibrinous tissue within chronic wounds could be identified as non-wound and cause holes in the segmented wound regions. Such holes are detected by finding small connected components in the segmentation results and filled to improve the true positive rate using connected component labelling (CCL)26. The small false-positive noises are removed in the same way. The images in the dataset are cropped from the raw image for each wound. So, we simply remove noises in the segmentation results by removing the connected component small enough based on adaptive thresholds. To be more specific, a connected region is removed when the number of pixels within the region is less than a threshold, which is adaptively calculated based on the total number of pixels segmented as wound pixels in the image.

Figure 4
figure 4

An illustration of the segmentation result and the post processing method. The first row illustrates images in the testing dataset. The second row shows the segmentation results predicted by our model without any post processing. The holes are marked with red boxes and the noises are marked with yellow boxes. The third row shows the final segmentation masks generated by the post processing method.

Results

We describe the evaluation metrics and compare the segmentation performance of our method with several popular and state-of-the-art methods. Our deep learning model is trained with data augmentation and preprocessing. Extensive experiments were conducted to investigate the effectiveness of our network. FCN-VGG-16 is a popular network architecture for wound image segmentation17,27. Thus, we trained this network on our dataset as the baseline model. For fairness of comparison, we used the same training strategies and data augmentation strategies throughout the experiments.

Evaluation metrics

To evaluate the segmentation performance, Precision, Recall, and the Dice coefficient are adopted as the evaluation metrics28.

Precision

Precision shows the accuracy of segmentation. More specifically, Precision measures the percentage of correctly segmented pixels in the segmentation and is computed by:

$$\text{Precision}=\frac{True \,positives}{True\, positives+False\, positives}$$

Recall

Recall also shows the accuracy of segmentation. More specifically, it measures the percentage of correctly segmented pixels in the ground truth and is computed by:

$$\text{Recall}=\frac{True\, positives}{True\,positives+False \,negtives}$$

Dice coefficient (Dice)

Dice shows the similarity between the segmentation and the ground truth. Dice is also called F1 score as a measurement balancing Precision and Recall. More specifically, Dice is computed by the harmonic mean of Precision and Recall:

$$\text{Dice}=\frac{2 \times True\, positives}{2 \times True \,positives+False \,negtives+False\, positives}$$

Experiment setup

The deep learning model in the presented work was implemented in python with Keras29 and Tensorflow30 backend. To speed up the training, the models were trained on a 64-bit Ubuntu PC with an 8-core 3.4 GHz CPU and a single NVIDIA RTX 2080Ti GPU. For updating the parameters in the network, we employed the Adam optimization algorithm31, which has been popularized in the field of stochastic optimization due to its fast convergence compared to other optimization functions. Binary cross entropy was used as the loss function and we also monitored Precision, Recall, and the Dice score as the evaluation matrices. The initial learning rate was set to 0.0001 and each minibatch contained only 2 images for balancing the training accuracy and efficiency. The convolutional kernels of our network were initialized with HE initialization32 to speed up the training process and the training time of a single epoch took about 77 s. We used early stopping to terminate the training so that the best result was saved when there was no improvement for more than 100 epochs in terms of Dice score. Eventually, our deep learning model was trained for around 1000 epochs before overfitting.

To evaluate the performance of the proposed method, we compared the segmentation results achieved by our methods with those by FCN-VGG-1617,27, SegNet16, and Mask-RCNN33,34. We also added 2D U-Net35 to the comparison due to its outstanding segmentation performance on biomedical images with a relatively small training dataset. The segmentation results predicted by our model are demonstrated in Fig. 4 along with the illustration of our post processing method. Quantitative results evaluated with the different networks are presented in Table 1 where bold numbers indicate the best results among all the models.

Table 1 The precision, recall, and dice score evaluated using various models on our dataset.

Comparing our method to the others

In the performance measures, the Recall of our model was evaluated to be the second highest among all models, at 89.97%. This was 1.32% behind the highest Recall, 91.29%, which was achieved by U-Net. Our model also achieved the second highest Precision of 91.01%. Overall, the results show that our model achieves the highest accuracy with a mean Dice score of 90.47%. the VGG16 was shown to have the worst performance among all the other CNN architectures. Mask-RCNN achieved the highest Precision of 94.30%, which indicates that the segmentation predicted by Mask-RCNN contains the highest percentage of true positive pixels. However, the Recall is only evaluated to 86.40%, meaning that more false negative pixels are undetected compared to U-Net and MobileNetV2. Our accuracy was slightly higher than U-Net and Mask-RCNN, and significantly higher than SegNet and VGG16.

Comparison within the Medetec Dataset

Apart from our dataset, we also conducted experiments on the Medetec Wound Dataset36 and compared the segmentation performance of these methods. The results are shown in Table 2. We annotated the dataset in the same way that our dataset was annotated and trained the networks with the same experimental setup. The highest Dice score is evaluated to 94.05% using MobileNetV2 + CCL. The performance evaluation agrees with the conclusion drawn from our dataset where our method outperforms the others regardless of which chronic wound segmentation dataset is used, thereby demonstrating that our model is robust and unbiased.

Table 2 The precision, recall, and dice score evaluated using various models on the Medetec dataset.

Discussion

Comparing our method to VGG16, the Dice score is boosted from 81.03 to 90.47% tested on our dataset. Based on the appearance of chronic wounds, we know that wound segmentation is complicated by various shapes, colors, and the presence of different types of tissue. The patient images captured in clinic settings also suffer from various lighting conditions and perspectives. In MobileNetV2, the deeper architecture has more convolutional layers than VGG16, which makes MobileNetV2 more capable to understand and solve these variables. MobileNetV2 utilizes residual blocks with skip connections instead of the sequential convolution layers in VGG16 to build a deeper network. These skip connections bridging the beginning and the end of a convolutional block allows the network to access earlier activations that weren’t modified in the convolutional block and enhance the capacity of the network.

Another comparison between U-Net and SegNet indicates that the former model is significantly better in terms of mean Dice score. Similar to the previous comparison, U-Net also introduces skip connections between convolutional layers to replace the pooling indices operation in the architecture of SegNet. These skip connections concatenate the output of the transposed convolution layers with the feature maps from the encoder at the same level. Thus, the expansion section which consists of a large number of feature channels allows the network to propagate localization combined with contextual information from the contraction section to higher resolution layers. Intuitively, in the expansion section or “decoder” of the U-Net architecture, the segmentation results are reconstructed with the structural features that are learned in the contraction section or the “decoder”. This allows the U-Net to make predictions at more precise locations. These comparisons have illustrated the effectiveness of skip connections for improving the accuracy of wound segmentation.

Besides the performance, our method is also efficient and lightweight. As shown in Table 3, the total number of trainable parameters in the adopted MobileNetV2 was only a fraction of the numbers in U-Net, VGG16, and Mask-RCNN. Thus, the network took less time during training and could be applied to mobile devices with less memory and limited computational power. Alternatively, higher-resolution input images could be fed into MobileNetV2 with less memory size and computational power comparing to the other models.

Table 3 Comparison of total numbers of trainable parameters.

Conclusions

We attempted to solve the automated segmentation problem of chronic foot ulcers in a dataset we built on our own using deep learning. We conducted comprehensive experiments and analyses on SegNet, VGG16, U-Net, Mask-RCNN, and our model based on MobileNetV2 and CCL to evaluate the performance of chronic wound segmentation. In the comparison of various neural networks, our method has demonstrated its effectiveness and mobility in the field of image segmentation due to its fully convolutional architecture consisting of depth-wise separable convolutional layers. We demonstrated the robustness of our model by testing it on the foot ulcer images in the publicly available Medetec Wound Dataset where our model still achieves the highest Dice score. In the future, we plan to improve our work by a novel multi-stream neural network architecture that extracts the shape features separately from the pixel-wise convolution in our deep learning model. A sketch of this idea is demonstrated in Fig. 5. With the advance of hardware and mobile computing, larger deep learning models will be runnable on mobile devices. Another future reseach is testing deeper neural networks on our dataset. Also, we will include more data in the dataset to improve the robustness and prediction accuracy of our method.

Figure 5
figure 5

An illustration of the idea of multi-stream model that recognize pixels intensities and shape information separately. To predict segmentation masks, the output tensors from both streams can be fused using carefully designed concatenation layers or multi-scale pooling layers.