Generative adversarial networks based skin lesion segmentation

Skin cancer is a serious condition that requires accurate diagnosis and treatment. One way to assist clinicians in this task is using computer-aided diagnosis tools that automatically segment skin lesions from dermoscopic images. We propose a novel adversarial learning-based framework called Efficient-GAN (EGAN) that uses an unsupervised generative network to generate accurate lesion masks. It consists of a generator module with a top-down squeeze excitation-based compound scaled path, an asymmetric lateral connection-based bottom-up path, and a discriminator module that distinguishes between original and synthetic masks. A morphology-based smoothing loss is also implemented to encourage the network to create smooth semantic boundaries of lesions. The framework is evaluated on the International Skin Imaging Collaboration Lesion Dataset. It outperforms the current state-of-the-art skin lesion segmentation approaches with a Dice coefficient, Jaccard similarity, and accuracy of 90.1%, 83.6%, and 94.5%, respectively. We also design a lightweight segmentation framework called Mobile-GAN (MGAN) that achieves comparable performance as EGAN but with an order of magnitude lower number of training parameters, thus resulting in faster inference times for low compute resource settings.


Introduction
Skin cancer results in approximately 91,000 deaths annually 1 .Early detection and regular monitoring are crucial in improving the quality of diagnosis, ensuring accurate treatment planning, and reducing skin cancer mortality rates 2 .A common detection method involves a dermatologist examining skin images to identify ambiguous clinical patterns of lesions that are often not visible to the naked eye.Dermoscopy, a widely used technique, helps dermatologists differentiate between malignant and benign lesions by eliminating surface reflections on the skin, thereby improving the accuracy of skin cancer diagnosis 3 .Skin lesion segmentation, a method to differentiate foreground lesions from the background, has received a lot of attention for over a decade due to its high clinical applicability and demanding nature.Computer-aided diagnostic algorithms for automated skin lesion segmentation could aid clinicians in precise treatment and diagnosis, strategic planning, and cost reduction.However, automated skin lesion segmentation is challenging due to several factors 7 such as (i) large variance in shape, texture, color,   17 Conditional Generative Adversarial Network GAN Generative Network 18 Decisive Generator for skin lesion segmentation GAN DCGAN 19 Generating Synthetic Skin Images GAN FCA-Net 20 Factorised channel attention and multi-scale features GAN DAGAN 21 Deep Neural Network with generative Networks GAN UNet-SCDC GAN 21 Leveraging power of discriminators GAN SLS-Net 22 Lightweight device model with GAN GAN geographical conditions, and fuzzy boundaries, (ii) the presence of artifacts such as hair and blood vessels, and (iii) poor contrast between background skin and cancer lesions in addition to artifacts from image acquisition, as shown in Figure 1.

Prior Work
Pixel-level skin lesion segmentation algorithms can be divided into approaches built upon a) classical image processing and b) deep learning-based architectures.Deep learning-based methods can be further classified into Convolutional Neural Networks (CNN) and Adversarial Learning-based Generative Networks (GAN) based on the network topology.A brief review of a few prior works in these categories is presented in Table 1.The performance of classical image processing approaches heavily depends on post-processing, such as thresholding, clustering, and hole filling, tuning hyperparameters, and manual feature selection.Manually tuning these parameters can be expensive and could result in poor generalizability.Lately, deep learningbased approaches have surpassed several classical image processing-based approaches, mainly due to the wide availability of large labeled datasets and compute resources.Deep convolutional neural networks (DCNN) based methods gained a lot of popularity for skin lesion segmentation prior to the introduction of Transformer and GAN-based approaches in the field of medical imaging, [23][24][25][26][27] .
The success of prior DCNN-based approaches in skin lesion segmentation is primarily based on supervised methods that rely on large labeled datasets to extract features related to the image's spatial characteristics and deep semantic maps.However, gathering a large dataset with finely annotated images is time-consuming and expensive.To address this challenge, Goodfellow et al. 28 introduced Generative Adversarial Networks (GANs), which have gained popularity in various applications, including medical image synthesis, due to the lack of widely available finely annotated data.Several recent and relevant GAN-based approaches in skin lesion analysis from the literature are listed in Table 1.Unsupervised learning-based algorithms that can handle large datasets with precision and high performance without requiring ground truth labels carry significant promise in addressing real-world problems such as computer-aided medical image analysis.In our work, we address the challenges of lesion segmentation by utilizing generative adversarial networks (GANs) 28 , which can generate accurate segmentation masks with minimal or no supervision.GANs work by training a generator and discriminator to compete against each other, where the generator tries to create realistic images, and the discriminator tries to differentiate between real and generated images (Figure 2).However, designing an effective GAN for segmentation takes considerable time, as the performance is highly dependent on the architecture and choice of the loss function.Our study aims to optimize all three components (generator, discriminator, and loss function) for better segmentation results.The choice of the loss function is critical for the success of any deep learning architecture, and our approach takes this into account 29 .

Proposed Work
We propose two GAN frameworks for skin lesion segmentation.The first is Efficient-GAN (EGAN), which focuses on precision and learns in an unsupervised manner, making it data-efficient.It uses an encoder-decoder-based generator, patchGAN 30 based discriminator and smoothing-based loss function.The generator architecture uses a squeeze and excitation-based compound scaled encoder and a lateral connection-based asymmetric decoder.This architecture captures dense features to generate fine-grained segmentation maps, and the discriminator distinguishes between synthetic and original labels.We also implement a morphological-based smoothing loss function to capture fuzzy boundaries more effectively.
Although deep learning methods provide high precision for lesion segmentation, they are computationally expensive, making them impractical for real-world applications with limited resources like dermatoscopy machines.This presents a challenge in contexts where high-resource devices are unavailable to dermatologists.To address this issue, various devices like MoleScope II, DermLite, and HandyScope have been developed for lesion analysis and support low computational resources.These devices use a special lens with a smartphone.To create a more practical model for such real-time applications, we propose Mobile-GAN (MGAN), which is a lightweight unsupervised model consisting of an Inverted Residual block 31 with Atrous Spatial Pyramid Pooling 32 .This model aims to achieve good segmentation performance in terms of the Jaccard score with lower resource strain.With only 2.2M parameters (as opposed to 27M parameters in EGAN), the model can run at 13 frames per second, increasing the potential impact of computer vision-based approaches in day-to-day clinical practice.

Performance of CNN-based models
We implemented and analyzed the results of several CNN and GAN-based approaches for this task.Table 2 summarizes the evaluation of CNN and GAN-based approaches on the unseen test dataset.We started with one of the most popular architectures in medical imaging segmentation -UNet 33 .Since this architecture is a simple stack of convolutional layers, the original UNet provided a baseline performance on ISIC 2018 dataset.We strategically conducted several experiments using deeper encoders like ResNet, MobileNet, EfficientNet, and asymmetric decoders (described in the Methods section)The concatenation of low-level features is skewed rather than linking each block from the encoder, like in traditional UNet.Adding a batch normalization layer after each convolutional layer also helped achieve better performance.For detailed evaluation with CNN-based methods, we also experiment with DeepLabV3+ 32 and Feature Pyramid Network (FPN) 34 decoders in combination with various encoders as described above, and the modification led to improved performance.These results on the ISIC 2018 test set from our experimentation, i.e., us running the authors' code to train the proposed models, are listed with * in Table 2.

Performance of GAN-based models
Table 2 also lists several results from recent literature on this dataset for comparison completeness.Models trained by us are submitted to the evaluation server for a fair evaluation.We then compare the results of various GAN-based approaches, as shown in Table 2.We observe that a well-designed generative adversarial network (GAN) improves performance compared to techniques based on CNNs for medical image segmentation.This is because of GANs ability to overcome the main challenge in this domain of not having large labeled training data.Our proposed EGAN approach outperforms all other approaches.A few works 8,9,35,36 report better performance compared to our results.But these works created and used an independent test split from ISIC training data and did not use the actual ISIC test data.

Performance of lightweight models
We designed a lightweight generator model called MGAN, based on DeepLabV3+ and MobileNetV2, which achieves results comparable to our EGAN model in terms of Dice Coefficient with significantly fewer parameters and faster inference times.

Visualization of the learned representations
One of the criticisms of deep neural networks, which can make valuable and skillful predictions, is that they are generally opaque, i.e., it is unclear how or why a particular prediction or decision is made.To address concerns about the opacity of deep neural networks, we utilized the internal structures of convolutional neural networks that work on 2D image data to investigate the representations learned by our unsupervised model.Figure 4 displays the segmentation results for visual interpretation.The proposed GAN framework also demonstrates better segmentation performance regardless of non-skin objects or artifacts in the image.We assessed and visualized the 2D filter weights of the model to explore the features learned by the model.Additionally, we investigated the activation layers of the model to understand precisely which features the model recognized for a given input image, and we displayed the results in Figure 3 for visual understanding.We selected the output of seven blocks of the encoder (Block1-Block7) and four output feature maps from the decoder (D1-D4) for visualization, as the model has numerous convolutional layers in each architecture block.

Discussion
This paper has three main findings.First, we proposed a novel unsupervised adversarial learning-based framework (EGAN) based on Generative Adversarial Networks(GANs) to segment skin lesions in a fine-grained manner accurately.In data-scarce applications such as skin lesion segmentation, the success of GANs relies on the quality of the generator, discriminator, and loss function used.One of the main challenges in the field of medical imaging is the availability of large annotated data, collecting which is a tedious, consuming, and costly task.To address the data-efficiency challenge, we trained our model unsupervised, allowing the generator module to capture features effectively and segment the lesion without supervision.Our patchGAN-based discriminator penalized the adversarial network by differentiating between labels and predictions.As we do not backpropagate the error during training in the discriminator, no such advancement is needed as PatchGAN-based architecture is powerful enough to classify between real and fake.In skin lesion segmentation, capturing contextual information around the segmentation boundary is crucial for improving performance 8 .To address this, we also implemented the morphological-based smoothing loss to capture fuzzy lesion boundaries, resulting in a highly discriminative GAN that considers contextual information and segmented boundaries.The performance-exclusive EGAN approach outperforms prior works achieving improved performance with a dice coefficient of 90.1% on the ISIC 2018 test dataset when trained with adversarial learning and morphology-based smoothing loss function compared to using the dice loss alone, which achieved a dice coefficient of 88.4% revealing the potential of our methodology.Our evaluation of the ISIC 2018 dataset demonstrates significantly improved performance compared to existing models in the literature.Furthermore, the proposed framework's potential can be extended to other medical imaging applications.Second, we proposed a lightweight segmentation framework (MGAN) that achieves comparable results while being much less computationally expensive -with an order of magnitude lower number of training parameters and significantly faster inference time.The MGAN approach is suitable for real-time applications, making it a viable solution for cutting-edge deployment, for instance, in low compute resource contexts.Our proposed framework includes two generative models: EGAN and MGAN, which are designed to balance performance and efficiency.Integrating models like MGAN with dermoscopy devices can revolutionize the future of dermatology, enabling more efficient, accurate, real-time segmentation and accessible care for patients with skin lesions.Third, our approach enables visualizing the learned representations of the model to interpret the predictions.This is especially crucial for clinical algorithms-in-the-loop applications such as skin lesion segmentation, where the decisions of automated segmentation methods could be considered by clinicians in the context of the features learned by the model.
Limitations: Although our model has achieved promising performance on ISIC 2018 dataset, the performance could not be evaluated on other datasets.We explored different datasets such as Derm7pt 43 , Diverse Dermatology Images 44 , and Fitzpatrick 17k 45 , among others, to assess the generalizability of the proposed approach.However, we noticed that segmentation masks were not available.While segmentation masks were available for the PH2 dataset 46 , we could not access the dataset.Deep Learning models are computationally intensive and require significant resources.EGAN model is computationally heavy for deployment in real-time clinical applications.This can limit the use in resource constraint environments or devices with limited processing capabilities.In such scenarios, models such as MGAN could be utilized.

Methods
The skin lesion GAN-based segmentation framework we propose in this work is shown in Figure 2. The framework contains three main components: i) the generator, which consists of an encoder to extract feature maps and a decoder to generate segmentation maps without supervision and adapt to variations in contrast and artifacts; ii) the discriminator, which distinguishes between the reference label and the segmentation output; and iii) appropriate loss functions to prevent overfitting, achieve excellent convergence, and accurately capture fuzzy lesion boundaries.

Dataset
The proposed segmentation approach was evaluated using the ISIC 2018 dataset, a standard skin lesion analysis dataset.This dataset contains 2594 images with corresponding ground truth, of which 20% (approximately 514 images) were used for validation.The images in the dataset vary in size and aspect ratio and contain lesions with different appearances in various skin areas.Some sample images from the dataset are shown in Figure 1.To ensure a fair evaluation, the results of the test set were uploaded to the online server of the ISIC 2018 4 portal.

Generative Adversarial Network
Goodfellow et al. 28 first introduced Generative adversarial networks (GAN) to generate synthetic data.Labeling clinical information is a tricky and time-consuming task requiring a specialist.Several medical imaging applications lack adequately annotated data.Inspired by this, the proposed work leverages unsupervised GAN for skin lesion segmentation.To begin with the methodology, we first briefly discuss generator and discriminator concepts.An adversarial network comprises a generator (G) and a discriminator (D).The generator maps a random vector γ from source domain space α to generate the desired output where V is function of Discriminator (D), Generator (G),γ is from a input noise distribution P γ (γ), true samples are from P data (α) and θ G are generator paramaters and θ D are discriminator paramaters.

Segmentation Framework
Generally, segmentation frameworks consist of encoder-decoder-based architecture.The encoder module is the block for feature extraction to capture spatial information within the image.It reduces the spatial size, i.e. the dimension of the input image, and decreases feature map resolution to catch significant level features.The decoder recuperates the spatial data by upsampling the feature map extracted by layers of the encoder and providing the output segmentation map.We propose to modify the architecture design of the encoder-decoder to capture the dense feature map rather than the traditional encoder and change the decoder appropriately, as shown in Figure 5. Including squeeze and excitation-based compound scaled encoders significantly improves efficiency in terms of results.

Design of Encoder
Advancement of CNN designs is dependent on the accessibility of infrastructure and, afterward, the scaling of the model in terms of width (w), depth (d), or resolution (r) of the network to accomplish further significant improvement in performance when there is an expansion in the availability of resources.Instead of doing this scaling manually and arbitrarily, Tan et al.  proposed a novel systematic and automatic scaling approach by introducing a compound coefficient.The novel technique of compound coefficient φ to efficiently scale the network's depth, width, and resolution with a proper arrangement of scaling factors is per the following equation: The encoder is built using the above equation proposed by Baheti et al. 40 , consisting of seven building blocks.Each basic building block for this encoder model is squeezing, and excitation functions 48 with mobile inverted bottleneck convolution (MBConv), as shown in Figure 5(b).Also, swish activation is used in each encoder block, enhancing performance.

Design of Decoder
The encoder downsamples the input image to a smaller resolution and captures contextual information.A decoder block likewise called an upsampling path, comprises many convolutional layers that progressively upsample the feature map obtained from the encoder.The conventional segmentation framework like UNet 33 has symmetric encoder and decoder architectures.The proposed architecture builds upon a compound scaled squeeze & excitation-based encoder and decoder as an asymmetric network.The output features from the encoder are expanded in the decoder blocks consisting of bilinear upsampling.The low-level features from the encoder are combined with the higher-level feature maps from the decoder of respective sizes to generate a more precise segmentation output.

Design of lightweight segmentation framework
To develop a lightweight segmentation architecture for the generator, we leverage the power of MobileNetV2 31 and DeepLabV3+ 32 consisting of atrous spatial pyramid pooling module (ASPP) as shown in Figure 6.MobileNetV2 uses depthwise separable convolution and inverted residual blocks as the basic building module, as shown in Figure 6 above the encoder.MobileNetV2 is modified such that the output stride, i.e., the ratio of the input image to the output image, is 8.It has fewer computations and parameters and is thus suitable for real-time applications.The ASPP block has a variety of dilation rates, i.e., 1, 6, 12, and 18, to generate multi-scale feature maps and further integrate by concatenation.This feature map is upsampled and integrated with a low-level intermediate feature map from the contracting path, i.e., encoder, to generate fine-grained segmentation output.The feature extraction consisted of blocks of inverted residual blocks, as shown in Figure 6.The stride of the latter blocks is set as one.Images of size 512 × 512 × 3 are fed as input to MGAN architecture.

Discriminator
In our architecture, we have a generator and a discriminator.The discriminator supervises the generator to produce precise masks that match the original ground truth.We have implemented a patchGAN-based approach to achieve this, classifying each m × n mask as equivalent to the ground truth.The discriminator consists of five Conv2D layers with a kernel size of 4 × 4 and a stride of 2 × 2, with 64, 128, 256, 512, and 1 feature maps in each layer.LeakyReLU activation with an alpha value of 0.2 is used in each Conv2D layer, with the last layer using sigmoid activation.The patch-based discriminator has an output size (m × n) of 16 × 16, where one pixel is linked to a patch of input probability maps with a size of 94 × 94.The discriminator classifies each patch as either fake or real.This learning strategy enforces the predicted label to be similar to the ground truth.The number of parameters is the same as proposed in patchGAN 30 .
We practice the following adversarial technique for each generated label to align with the ground truth labels.A min-max two-step game alternatively renews the generator and discriminator network with adversarial learning.The discriminator function is given by: where x, y are the pixel locations of the input, D(I S ) is the Discriminator function of Source Domain Images(I S ), i.e., Label Image, D(I T ) is Discriminator function of Target Domain Images (I T ), i.e., Predicted Image and γ is the probability of the predicted pixel, γ =1 when prediction is from ground truth, i.e., source domain, and γ= 0 when prediction is from generator segmented mask, i.e., target domain.

Loss Function
We implement smoothing loss based on morphology to improve skin lesions segmentation and supervise the network that captures the lesion's smoothness and fuzzy boundaries.The network's loss function includes dice coefficient loss (L DL ) as well as the morphology-based smoothing loss (L SL ).The dice coefficient loss assesses the cross-over between the ground truth and prediction and is given by the condition:

Figure 1 .
Figure 1.Challenges in skin lesion segmentation using dermoscopic images.First row: a) minor variation in the lesion and skin color, b) low contrast between wound and skin, c) occlusion in lesions due to hair, and d) artifacts from image acquisition.Second row: a few examples from the ISIC Lesion dataset 4 used in this paper.

Figure 2 .
Figure 2. Flowchart of the proposed framework.The generator module is an encoder-decoder network.The discriminator classifies the segmentation result as real or fake.

Figure 3 .
Figure 3. Visualization of the feature maps of proposed EGAN architecture.

Figure 4 .
Figure 4. Comparison of the segmentation by various CNN and GAN-based approaches.Each column serially depicts the input image, label, output of various CNN-based approaches, and output of proposed MGAN and EGAN.Ground truth and segmented lesions are marked with green and red curves respectively.
Encoder has seven blocks and asymmetric concatenation to the decoder with lateral connections.Decoders made of four blocks D1-D4 of Conv2D.(b) MBConv Block of the Encoder

Figure 5 .
Figure 5.The architecture of the proposed generator in the EGAN architecture.

Figure 6 .
Figure 6.The architecture of the lightweight and efficient segmentation network MGAN.This architecture is based on an inverted residual network and atrous spatial pyramid pooling.The inverted residual block is shown above Encoder.

Table 1 .
Related work on skin lesion segmentation with CNN and GAN-based approaches Model name / Citation One phrase description Architecture Saliency Maps 5 Segmentation based on Supervised Saliency Maps Classical UNet Segmentation 6 Stochastic weight averaging using UNet CNN Deep CNN 7 Full Resolution Networks CNN BLA-Net 8 deformable convolution ResNet34 with auxiliary boundary learning network CNN ERU 9 EfficientNetB4 with UNet based encoder-decoder CNN AS-Net 10 Combines spatial and channel attention for learning CNN Attention Network 11 Attention mechanism with high resolution features CNN SEACU-Net 12 Squeeze and Excitation based Attentive ConvLSTM CNN Conditional Random Fields 13 Deep Learning Approach with Pre and Post Processing CNN FAT-Net 14 Feature Adaptive Transformers Transformer DFE-Net 15 CNN and Transformer based Feature extraction Transformer SLT-Net 16 CSwin Transformer replaced Conv module in UNet Transformer cGAN

Table 3
comprehensively compares various mobile architectures based on the Jaccard Index, the number of parameters in a million, and inference speed on the test dataset for a patch size of 512 × 512.As we can see from the table 3 MGAN has 2.2M parameters providing the Inference Speed of 13FPS.Even though SLSNet reports a higher performance in terms of the Jaccard Index, this result is evaluated on the independent validation test set.3/11

Table 2 .
Results of CNN and GAN-based approaches including our proposed algorithms (MGAN and EGAN) on the ISIC 2018 test dataset.* indicates the model was re-trained using the authors' source code.-indicates metrics not being reported

Table 3 .
Comparison of various Mobile networks available in the literature.The inference column indicates the Frames per Second (FPS) on original Images with a patch size of 512 × 512