Assessing generalisability of deep learning-based polyp detection and segmentation methods through a computer vision challenge

Polyps are well-known cancer precursors identified by colonoscopy. However, variability in their size, appearance, and location makes the detection of polyps challenging. Moreover, colonoscopy surveillance and removal of polyps are highly operator-dependent procedures and occur in a highly complex organ topology. There exists a high missed detection rate and incomplete removal of colonic polyps. To assist in clinical procedures and reduce missed rates, automated methods for detecting and segmenting polyps using machine learning have been achieved in past years. However, the major drawback in most of these methods is their ability to generalise to out-of-sample unseen datasets from different centres, populations, modalities, and acquisition systems. To test this hypothesis rigorously, we, together with expert gastroenterologists, curated a multi-centre and multi-population dataset acquired from six different colonoscopy systems and challenged the computational expert teams to develop robust automated detection and segmentation methods in a crowd-sourcing Endoscopic computer vision challenge. This work put forward rigorous generalisability tests and assesses the usability of devised deep learning methods in dynamic and actual clinical colonoscopy procedures. We analyse the results of four top performing teams for the detection task and five top performing teams for the segmentation task. Our analyses demonstrate that the top-ranking teams concentrated mainly on accuracy over the real-time performance required for clinical applicability. We further dissect the devised methods and provide an experiment-based hypothesis that reveals the need for improved generalisability to tackle diversity present in multi-centre datasets and routine clinical procedures.


Introduction
Colorectal cancer (CRC) is the third leading cause of cancer deaths, with reported mortality rate of nearly 51% 1 .CRC can be characterised by early cancer precursors such as adenomas or serrated polyps that may over time lead to cancer.While polypectomy is a standard technique to remove polyps 2 by placing a snare (thin wire loop) around the polyp and closing it to cut though the polyp tissue either with diathermy (heat to seal vessels) or without (cold snare polypectomy), identifying small or flat polyps (e.g.lesion less than 10 mm) can be extremely challenging.This is due to complex organ topology of the colon and rectum that make the navigation and treatment procedures difficult and require expert-level skills.Similarly, the removal of polyps can be very challenging due to constant organ deformations which make it sometimes impossible to keep track of the lesion boundary making the complete resection difficult and subjective to experience of endoscopists.Computer-assisted systems can help to reduce operator subjectivity and enables improved adenoma detection rates (ADR).Thus, computer-aided detection and segmentation methods can assist to localise polyps and guide surgical procedures (e.g.polypectomy) by showing the polyp locations and margins.Some of the major requirements of such system to be utilised in clinic are the real-time performance and algorithmic robustness.
Machine learning, in particular deep learning, together with tremendous improvements in hardware have enabled the possibility to design networks that can provide real-time performance despite their computational complexity.However, one major challenge in developing these methods is the lack of comprehensive public datasets that include diverse patient population, imaging modalities and scope manufacturers.Incorporating real-world challenges in the dataset can only be the way forward in building guaranteed robust systems.In the past, there has been several attempts to collect and curate gastrointestinal (GI) datasets that include other GI lesions and polyps.A summary of existing related datasets with polyps are provided in Supplementary Table 1.A major limitation of these publicly available datasets is that they consists of either single center or data cohort representing a single population.Additionally, most widely used public datasets have only single frame and single modality images.Moreover, even though conventional white light endoscopy (WLE) is used in standard colonoscopic procedures, narrow-band imaging (NBI), a type of virtual chromo-endoscopy, is widely used by experts for polyp identification and charaterisation.Most deep learning-based detection [3][4][5] and segmentation [6][7][8][9] methods are trained and tested on the same center dataset and WLE modality only.These supervised deep learning techniques has a major issue in not being able to generalise to an unseen data from a different center population 10 or even different modality from the same center 11 .Also, the type of endoscope used also adds to the compromise in robustness.Due to selective image samples provided by most of the available datasets for method development, the test dataset also comprise of similarly collected set data samples 9,[12][13][14] .Similar to the most endoscopic procedures, colonoscopy is a continuous visualisation of mucosa with a camera and a light source.During this process live videos are acquired which are often corrupted with specularity, floating objects, stool, bubbles and pixel saturation 15 .The mucosal scene dynamics such as severe deformations, view-point changes and occlusion can be some major limiting factors for algorithm performance as well.It is thus important to cross examine the generalisability of developed algorithms more comprehensively and on variable data settings including modality changes and continuous frame sequences.
With the presented Endoscopic Computer Vision (EndoCV) challenge in 2021 we collected and curated a multicenter dataset 16 that is aimed at generalisability assessment.For this we took an strategic approach of providing single modality (white light endoscopy modality, WLE) data from five hospitals (both single frame and sequence) for training and validation while the test data consisted of four different configurations -a) mixed center unseen data with WLE modality with samples from centers in training data, b) a different modality data (narrow-band imaging modality, NBI) from all centers, c) a hidden sixth center single frame data and d) a hidden sixth center continuous frame sequence data.While, unseen data with centers included in training assesses the traditional way of testing the supervised machine learning methods on held-out data, unseen modality and hidden center data testing gauge the algorithm's generalisability.Similarly, sequence test data mimics the occurrence of polyps in data as observed in routine clinical colonoscopy procedure.

Detection and localisation
While classification methods are frame-based classifiers for polyps [17][18][19] , detection methods provide both classification and localisation of polyps 3,4 which can direct clinicians to the site of interest, and can be additionally used for counting polyps to assess disease burden in patients.With the advancements of object detection architectures, recent methods are end-to-end networks providing better detection performance and improved speed.The state-of-the-art methods are broadly divided into two categories: multi-stage detectors and single-stage detectors.The multi-stage detector methods include Region proposals-Based Convolutional Neural Network (R-CNN) 20 , Fast R-CNN 21 , Faster R-CNN 22 , Region-based fully convolutional networks (R-FCN) 23 , Feature Pyramid Network (FPN) 24 and Cascade R-CNN 25 .On the other hand, the One-stage detectors directly provide the predicted output (bounding boxes and object classification) from input images without the region of interest (ROI) proposal stage.The One-stage detector methods include Single-Shot Multibox Detector (SSD) 26 , Yolo 27 , RetinaNet 28 and Efficientdet 29 .
Different studies have been conducted in the literature that focused on polyp detection by employing both multi-stage detectors and single-stage detectors.Multi-stage Detectors: Shin et al. 30 used a transfer learning strategy based on Faster R-CNN architecture with the Inception ResNet backbone to detect polyps.Qadir et al. 4 adapted Mask R-CNN 31 to detect colorectal polyps and evaluate its performance with different CNN including ResNet50 32 , ResNet101 32 and Inception ResNetV2 33 as its feature extractor.Despite the speed limitation, multi-stage detectors are widely used in the detection task of endoscopy data challenges due to their competitive performance on evaluation metrics.Single-stage Detectors: Urban et al. 3 used YOLO to detect polyps in real-time, which also resulted in high detection performance.Lee et al. 34 employed YOLOv2 35 and validated the proposed approach on four independent dataset.They reported a real-time performance and high sensitivity and specificity on all datasets.Zhang et al. 36 proposed the ResYOLO network that adds residual learning modules into the YOLO architecture to train deeper networks.They reported a near-real-time performance for the ResYOLO network depending on the hardware used.Zhang et al. 5 proposed an enhanced SSD named SSD for Gastric Polyps (SSD-GPNet) for real-time gastric polyp detection.SSD-GPNet concatenates feature maps from lower layers and deconvolves higher layers using different pooling techniques.YOLOv3 37 with darknet53 backbone and YOLOv4 showed IOU and average precision (AP) over 0.80% and real-time FPS over 45.Moreover, there exist methods that relied on anchor-free detectors to locate the polyps where they claim to detect polyps without the definition of anchors such as CornerNet 38 and ExtremeNet 39 .Zhou et al. 40 proposed the CenterNet, which treats each object as a point and increases the speed significantly while ensuring the accuracy is acceptable.While Wang et al. 41 achieved state-of-the-art results on automatic polyp detection in real-time situations using anchor-free object detection methods.In addition to these works, Multi-stage, Single-stage and other types of detectors have been widely used by participants teams in different polyp detection datasets and challenges such as MICCAI'15 42 , ROBUST-MIS 43 , EAD2019 15 and EndoCV2020 9 .

Segmentation
Semantic segmentation is the process of grouping related pixels in an image to an object of the same category.Deep learning has been very successful in the field of the medical domain, convolutional neural networks (CNN) based techniques were suggested to generate complete and precise segmentation outputs without requiring any post-processing.In deep learning, medical segmentation methods can be categorized into four categories: Models based on fully convolutional networks, Models based on Encoder-Decoder architecture, Models based on Pyramid-based architecture and Models based on Dilated Convolution Architecture.
Models based on fully convolutional networks: Brandao et al. 44 proposed three different FCN-based architectures for detection and segmentation of polyps from colonoscopy images.Zhang et al. 45 proposed multi-step practice for the polyp segmentation.The former step includes region proposal generation using FCN, and the latter step uses spatial features and a random forest classifier for the refinement process.A similar method was introduced by Akbari et al. 46 which uses patch selection while training FCN and Otsu thresholding to find the accurate location of polyp.Guo et al. 7 describe two methods based on FCN for Gastrointestinal ImageANALysis (GIANA) polyp segmentation sub-challenge.
Models based on encoder-decoder architecture: Nguyen and Slee 6 proposed multiple deep encoder-decoder networks to capture multi-level contextual information and learn rich features during training.Zhou et al. 47 proposed UNet++, a deeply supervised encoder-decoder network that showed improved performance on polyp segmentation task.Similarly, PraNet 48 aggregated deep features in their parallel partial decoder to form initial guidance area maps.Mahmud et al. 49 integrated dilated inception blocks into each unit layer and aggregate the features of the different receptive fields to capture better-generalized feature representations.Huang et al. 50proposed a low memory traffic, fast and accurate method for the polyp segmentation achieving 86 frames per second (FPS).Later, Zhang et al. 8 proposed a hybrid method combining both transformer-based network and CNN to capture global dependencies and the low-level spatial features for the segmentation task.Inspired by high-resolution network 51 , Srivastava et al. 10 proposed multi-scale residual fusion network (MSRF-Net) that allows information exchange across multiple scales and showed improved generalisability on unseen datasets.All of these encoder-decoder architectures were evaluated only on still images.Ji et al. 52 proposed a progressively normalised self-attention network (PNS-Net) for video polyp segmentation.
Models based on pyramid-based architecture: Jia et al. 53 proposed a pyramid-based model named PLPNet for automated pixel-level polyp classification in colonoscopy images.Also, Guo et al. 54 employed the Pyramid Scene Parsing Network (PSPNet) 55 with SegNet 56 and U-Net 57 as an ensemble deep learning model.The proposed model achieved a improvement upto 6.38% compared with a single basic trainer.
Models based on dilated convolution architecture: Sun et al. 58 used dilated convolution in the last block of the encoder while Safarov et al. 59 used in all encoder blocks.Though 59 used a mesh of attention blocks and residual block as a decoder, both methods tested there model on CVC-ClinicDB achieving F1-score of 96.106 and 96.043, respectively.Furthermore, nested dilation network (NDN) 60 was designed to segment lesions and tested on the GIANA2018 dataset achieving improvements on Dice upto 3% compared to other methods.

The EndoCV challenge: dataset, evaluation and submission
In this section, we present the dataset compiled for the polyp generalisation challenge, the protocol used to obtain the ground truth, evaluation metrics that were defined to assess the participants methods and a brief summary on the challenge setup and ranking procedure.

Dataset and challenge tasks
Dataset: The EndoCV2021 challenge addresses the generalisability of polyp detection and segmentation tasks in endoscopy frames.The colonoscopy video frames utilised in the challenge are collected from six different centres including two modalities (i.e.WL and NBI) with both sequence and non-sequence frames (see Figure 1 a).The challenge includes five different types of data and participants were allowed to combine accordingly for their train-validation splits: i) multi-centre video frames from five centres for training and validation, ii) polyp size-based, iii) single frame and sequence data split, iv) modality split (i.e.only for testing phase) and v) one hidden centre test (test phase only).The training dataset consisted of a total of 3242 WL frames from five centers (i.e.C1-C5) with both single and sequence frames.The test dataset, consists of: a) dataset with unseen modality, NBI (data 1), b) dataset with single frames from unknown center (data 2), c) frame sequences from the mixed centers (C1-C5, data 3), and iv) the unseen center sequence frames (C6, data 4).A total of 777 frames were used and data 3 was picked as base dataset against which generalisability of methods were assessed.Polyp size distribution (see Figure 1, left (b)) and its size in log-scale on resized images of the same resolution (540 × 720 pixels) (see Figure 1 (b), right) in both training and test sets are presented.These sizes were divided into null (for no polyp in frames), small (< 100 × 100 pixels bounding box), medium (≤ 200 × 200 pixels polyp bounding box) and large (> 200 × 200 pixels polyp bounding box).These numbers were 534, 1129, 1224 and 705, respectively, for null, small, medium and large size polyps (accounting for 3058 polyp instances) in the training set.Similarly, for the test set the numbers were 134, 144, 296 and 261, respectively, for null, small, medium and large size polyps (in total 701 polyp instances).The size distribution in both of these datasets are nearly identical (see Figure 1 (b), right) which is due to the defined range for categorically representing their occurance.
Challenge tasks: EndoCV2021 included two tasks for which the generalisability assessment was also conducted (see Figure 2): 1) detection and localisation task and 2) pixel-level segmentation task.For the detection task, participants were provided both single and sequence frames with manually annotated ground truth polyp labels and their corresponding bounding boxes.Participants were required to train their model for predicting class labels, bounding box co-ordinates and confidence scores for localisation.For the segmentation task, we provided the pixel level ground truth segmentation from experts that included the same data as provided for the detection task.Here, the participants were challenged to obtain close to ground truth segmentation binary map prediction.Both of these challenge tasks were assessed rigorously to understand the generalisability of the developed methods.In this regard, the test data consisted of four different categories, here we call it as data 1, data 2, data 3 and data 4. Data 1 consisted of unseen modality with NBI data widely used in colonoscopy, data 2 comprises of single frames of unseen center C6, data 3 consisted of mixed seen center (C1-C5) sequence data whereas data 4 included sequence data from unseen center C6.For generalisabilty, we compared the scores between data 3 (seen center data) with the other unseen data categories.All test results were evaluated on a common NVIDIA Tesla V100 GPU.Further details on metric computation is provided in Section Evaluation metrics.

Ethical and privacy aspects of the data
The Data for EndoCV2021 was gathered from 6 different centers located in five different countries (i.e.UK, Italy, France, Norway and Egypt).The ethical, legal and privacy of the relevant data was handled by each responsible center.The data collected from each center included a minimum of two essential steps as described below: • Patient consenting procedure at each institution (required).
• Review of the data collection plan by a local medical ethics committee or an institutional review board.
• Anonymization of the video or image frames (including demographic information) before sending to the organizers (required).
Table 1 illustrates the ethical and legal processes fulfilled by each center along with the details of the endoscopic equipment and recorders used for the collected data.

Annotation protocol
The annotation process was conducted by a team of three experienced researchers using an online annotation tool called Labelbox 1 .Each annotation was cross-validated by the team and by the center expert for accurate boundaries segmentation.At least one senior gastroenterologist was assigned for an independent binary review process.A set of protocols for manual annotation of polyp has been designed as follows:  1.Data collection information for each center: Data acquisition system and patient consenting information.
• Clear raised polyps: Boundary pixels should include only protruded regions.Precaution has to be taken when delineating along the normal colon folds • Inked polyp regions: Only part of the non-inked appearing object delineation • Polyps with instrument parts: Annotation should not include instrument and is required to be carefully delineated and may form more than one object • Pedunculated polyps: Annotation should include all raised regions unless appearing on the fold • Flat polyps: Zooming the regions identified with flat polyps before manual delineation.Also, consulting center expert if needed.
The annotated masks where examined by experienced gastroenterologists who gave a binary score indicating whether a current annotation can be considered clinically acceptable or not.Additionally, some of the experts provided feedback on the annotation where these images were placed into an ambiguous category for further refinement based on the experts feedback.A detailed process along with the number of annotations conducted and reviewed is outline in Supplementary Figure 1.

Polyp detection
For the polyp detection task, we have computed standard computer vision metrics such as average precision (AP) and intersection-of-union (IoU) 61 .
• IoU: The IoU metrics measures the overlap between two bounding boxes A and B as the ratio between the target mask and predicted output.
Here, ∩ represents intersection and ∪ represents the union.
• AP: AP is computed as the Area Under Curve (AUC) of the precision-recall curve of detection sampled at all unique recall values (r1, r2, ...) whenever the maximum precision value drops: with p interp (r n+1 ) = max r≥r n+1 p(r).Here, p(r n ) denotes the precision value at a given recall value.This definition ensures monotonically decreasing precision.AP was computed as an average APs at 0.50 and 0.95 with the increment of 0.05.Additionally, we have calculated AP small , AP medium , AP large .More description about the detection evaluation metrics and their formulas can be found here 2 .

Polyp segmentation
For polyp segmentation task, we have used widely accepted computer vision metrics that include Sørensen-Dice Coefficient (DSC = t p+ f p+ f n ), precision (p = t p t p+ f p ), and recall (r = t p t p+ f n ), overall accuracy (Acc = t p+tn t p+tn+ f p+ f n ), and F2 (= 5p×r 4p+r ).In addition to the performance metrics, we have also computed frame per second (FPS= # f rames sec ).Here, tp, fp, tn, and fn represent true positives, false positives, true negatives, and false negatives, respectively.Another commonly used segmentation metric that is based on the distance between two point sets, here ground truth (G) and estimated or predicted (E) pixels, to estimate ranking errors is the average Hausdorff distance (H d ) and defined as: The mean H d is normalised between 0 and 1 by dividing it by the maximum value H d max for a given test set.

Polyp generalisation metrics
We define the generalisability score based on the stability of the algorithm performance on seen white light modality and center dataset (data 3) with unseen center split (data 2 and data 4) and unseen modality (data 1) in the test dataset.We conducted the generalisability assessment for both detection and segmentation approaches separately.
For detection, the deviation in score between seen and unseen data types are computed over different AP categories, k ∈ {mean, small, medium, large} with tolerance, (tl = 0.1): Similarly, for segmentation, the deviation in score between seen and unseen data types are computed over different segmentation metric categories, k ∈ {DSC, F2, p, r, H d } with tolerance, (tl = 0.05):

Challenge setup, and ranking procedure
We set-up challenge website 3 with an automated docker system for metric-based ranking procedure.Challenge participants were required to perform inference on our cloud-based system that incorporated NVIDIA Tesla V100 GPU and provided test dataset with instructions of using directly on GPU without downloading the data for round 1 and round 2. However, we added an additional round 3 where the challenge participant's trained model were used for inference by the organisers on an additional unseen sequence dataset.Thus, the challenge consisted of three rounds.All provided test frames were from unseen patient data to avoid any data leakage.Further details on data samples in each round are summarised below: • Round 1: Test subset-I consisted of: a) 50 samples of each data 1 (unseen modality), data 2 (unseen single sample, C6) and data 3 (mixed center C1-C5 sequence data) • Round 2: Test subset-II comprised of: a) all 88 samples of each data 1 (unseen modality), 86 samples of data 2 (unseen single sample, C6) and 124 samples of data 3 (mixed C1-C5) • Round 3: Inference on round 3 data was performed by the organisers on the same GPU.This round comprised of: a) 135 samples of each data 1 (unseen modality), 86 samples of data 2 (unseen single sample, C6), 124 samples of data 3 (mixed center C1-C5 sequence data) and an additional set of 432 sequence samples from unseen center C6.
We conducted elimination for both round 1 and round 2 which was based on the metric scores on the leaderboard and timely submission.In round 2, we eliminated those with very high computational time for inference and metric score consistency.The chosen participants were requested for the method description paper at the EndoCV proceeding 62 to allow transparent reporting of their methods.All accepted methods were eligible for round 3 evaluation and have been reported in this paper.
Detection ranking was performed as an aggregated score between the average precision and deviation scores between data 3 (seen C1-C5) w.r.t.other unseen data in the test set.Similarly, for team ranking on segmentation, we used segmentation metrics and deviation scores for segmentation (between seen and unseen data).Please refer to Section Evaluation metrics for details.An aggregated rank was used to announce winner.In our final ranking reported in this paper, we have additionally used inference time as well.

Method summary of the participants
Below, we summarise the EndoCV2021 generalisability assessment challenge for polyp detection and segmentation methods using deep learning.Tabulated summaries are also provided highlighting the nature of the devised methods and basis of choice in-terms of speed and accuracy for detection (see Table 2) and segmentation (see Table 3).Methods are detailed in the compiled EndoCV2021 challenge proceeding 62 .

Detection Task
• AIM_CityU: 63 The team used one-stage anchor-free FCOS 67 as the baseline detection algorithm and adopted ResNeXt-101-DCN with FPN as their final feature extractor.For the model optimization, both online (random flipping and multi-scale training) and offline (random rotation, gamma contrast, and brightness transformation, etc.) data augmentation strategies are performed to improve the model generalization.
• HoLLYS ETRI: 64 The team used Mask R-CNN 31 for detection and segmentation task.All the weights were initialized with pre-trained weights.An ensemble learning method based on 5-fold cross-validation was used to improve the generalization performance.While training a single Mask R-CNN, the data acquired from all data centers were not used.Instead, only the data acquired from four centers were used for training and, the data from remained center was used for validation.Ensemble inference was performed by combining the inference results of 5 models.For the detection task, weighted box fusion technique 68 was used to combining results of detection.For segmentation task, segmentation masks from 5 models were averaged with IoU threshold of 0.6.
• JIN_ZJU: 65 The team used the YoloV5 69 as the baseline detection algorithm.To improve the generalisation ability of the standard Yolov5, different data augmentation methods were applied that included hue adjustment, saturation adjustment, value adjustment, rotating, translation, scaling, up-down flipping, left-right flipping, mosaic and mixup.
• GECE_VISION: 66 The team proposed an ensemble-based polyp detection architecture using the EfficientDet 29 as the base model family with EfficientNet as backbone network.The bootstrap aggregating (bagging) was utilized to aggregate different versions of the predictors (EfficientDet D0, D1, D2, D3) which are trained on bootstrap replicates of the training set.In order to increase the variance and improve generalization capability of the model, data augmentation have been used (i.e., scale jittering with 0.2-2.0,horizontal flipping, and rotating between 0 • -360 • ).Adam optimiser and the scheduling learning rate were used with decreasing factor of 0.2 whenever validation loss did not change in the last 10 epochs.

Segmentation Task
• aggcmab: The team 70 improved their previously developed framework cascaded double encoder-decoder convolutional neural network 76 by increasing the encoder representation capability and adapting to a multi-site sampling technique.The first encoder-decoder generates an initial attempt to segment the polyp by extracting features and downsampling spatial resolutions while increasing the number of channels by learning convolutional filters.The output from the first network acts as an input for the second encoder-decoder along with the original image.Cross-entropy loss was minimized using the stochastic gradient descent with a batch-size of 4 and a learning rate of lr = 0.01 with rate decay of 1e-8 every 25 epochs.The training images were resized to 640×512, and data augmentation (e.g.random rotations, vertical/horizontal flipping, contrast, saturation and brightness changes) was applied.Four versions were generated from the image (i.e.horizontal and vertical flipping), and the average result was calculated on the test set.
• AIM_CityU: The team 63 adopted HRNet 51 as the backbone to maintain the high-resolution representations in multiscale feature fusion mechanism.To further eliminate noisy information in segmentation predictions and enhance model generalization, the team proposed a low-rank module to distribute feature maps in the high dimensional space to a low dimensional manifold.For the model optimization, various data augmentation strategies, including random flipping, rotation, color shift (brightness, color, sharpness, and contrast) and Gaussian noise, were performed to improve the model generalization further.Cross entropy and dice loss are utilized to optimize the whole model.
• HoLLYS_ETRI: The team 64 proposed an ensemble inference model based on 5-fold cross-validation to improve the performance of polyp detection and segmentation.The Mask R-CNN was used to generate the output segmentation mask.Ensemble inference was used to generate the final segmentation mask by averaging the results from the 5 models.After averaging the masks, if the inference results were greater than the threshold (0.6) then the output mask is considered as a polyp otherwise was counted as a background.Data augmentation was performed based on the techniques provided in Detectron2.The model was trained for 50,000 steps and checkpoints were save for every 1,000 steps with learning rate lr=0.001 that changes with a warm-up scheduler • MLC_SimulaMet: The team 71 developed two ensemble models using well-known segmentation models; namely UNet++ 47 , FPN 24 , DeepLabv3 77 , DeepLabv3+ 78 and novel TriUNet for their DivergentNet ensemble model, and three UNet 57 architectures in their TriUNet ensemble model.The novel TriUNet model takes a single image as input, which is passed through two separate UNet models with different randomized weights.The output of both models was then concatenated before being passed through a third UNet model to predict the final segmentation mask.The whole TriUNet network was trained as a single unit.Thus, the proposed DivergentNet included five segmentation models.
• sruniga: The team 72 suggested a lightweight deep learning-based algorithm to meet the real-time clinical need.The proposed network applied the HarDNet-MSEG 50 as the backbone network as it has a low inference speed due to reduced shortcuts.Moreover, they proposed an augmentation strategy for realising improved generalizable model.The data augmentation was applied according to a certain probability.For training the model, the dataset was split into 80% training and 20% validation using adam optimizer and setting the learning rate lr of 0.00001 for all the experiments.

10/26
Images were resized to 352×352, and data augmentation has been applied according to the proposed algorithm by the team.
• Mah_UNM: The team 74 proposed modifying the SegNet 56 by embedding Gated recurrent units (GRU) units 79 within the convolution layers to improve its performance in segmenting polyps.The hyperparameters were set as the original SegNet with learning rate lr of 0.005 and batch size of 4. The multiplicative factor of gamma of 0.8 was used for the learning rate decay with adam optimizer and weighted cross-entropy loss.The provided dataset was split into 80% training and 20% validation.
• NDS_MultiUni: The team 75 suggested building a cascaded ensemble model made of MultiResUNet 80 architectures.
The input image was fed to four different MultiResUNet models in the proposed model, and each model generated an output mask.Afterwards, the four predicted outputs were averaged together to produce the final segmentation mask.Each model was trained for 100 epochs with the same setting of hyperparameters.The input images were resized to 256×256 with a batch size of 8, binary cross-entropy as loss function and using Adam optimizer.The learning rate is set to lr of 1e − 3 and using the Reduce LROnPlateau callback.
• YCH_THU: The team 73 used existing parallel reverse attention network (PraNet) 48 .They extracted multi-level features from colonoscopy images utilizing a parallel res2Net-based network.Moreover, the segmentation results are postprocessed to remove uncertain pixels and enhance the boundary.The images were resized to 512×512 and the dataset was split into 80% training and 20% validation.The model was trained for 300 epochs with batch size 18 and learning rate lr of 1e-4 which was reduced every 50 epochs.

Results
The EndoCV2021 challenge focus on detection and segmentation of polyps with different sizes from endoscopic frames.The endoscopy video frames are gathered from six worldwide centers including two different modalities (i.e.White Light and Narrow Band Imaging).The frames were annotated by clinical experts in the challenge team for the purpose of detection and localization.The training dataset consisted of total 3242 frames from five centers only with the release of binary masks for the segmentation task and bounding box coordinates for detection task.For the test dataset, frames from center six was include to provide an overall of 777 frames from the six centers with a variation between single and sequence frames.There was a variation in the polyp size in both the training and testing set as shown in Fig. 1b.

Aggregated performance and ranking on detection task
Table 4 represents the average precision (AP) computed at three different IoU thresholds and AP at different scales for the participant teams on the four datasets.Moreover, results from baseline methods YOLOv4 81 , RetinaNet (ResNet50) 28 and EfficientNet-D2 are provided.Methods presented by teams HoLLYS_ETRI and JIN_ZJU outperform against the other teams in terms of AP values for the single frame datasets (i.e. both data 1 (NBI) and data 2 (WLE).The results by both teams on data 1 had an increased difference for AP mean (>20%), AP 50 (>15%) and AP 75 (>18%) when compared to the other teams.However, for data 2, team AIM_CityU produced comparable results leading them to third place with a small difference of 0.88% for AP mean score when compared to team HoLLYS_ETRI .For the seen sequence dataset (Data 3), team JIN_ZJU maintained the top performance for AP mean (i.e. higher than secondbest team AIM_CityU by 4.19%) and AP 75 (i.e. higher than second-best team HoLLYS_ETRI by 3.29%).Team HoLLYS_ETRI maintained their top performance with highest result for AP 50 with a greater difference of 2.10% when compared to AIM_CityU that comes in second place.Furthermore, the method by HoLLYS_ETRI surpassed the results of other teams and baseline methods on the unseen sequence (Data 4) where the second teams take place with a difference of (>0.037)AP mean , (>0.04) AP 50 and (>0.055)AP 75 .In general, as concluded from table 4, results by teams HoLLYS_ETRI, JIN_ZJU and AIM_CityU had the best performance while results of the baselines method did not show better performance compared to any proposed method.
Table 5 shows the ranking of detection task of polyp generalisation challenge after calculating the average detection, average deviation scores and time.Team AIM_CityU ranks the first place with inference time of 0.10 second per frame and lowest deviation scores of dev_g 2−3 (0.1339), dev_g 4−3 (0.0562) and dev_g (0.0932).Followed by team HoLLYS_ETRI in second place with an increased inference time of 0.69 s per frame and a top score of 0.4914 for average detection.In third place, team JIN_ZJU takes place with 1.9 s per frame for the inference time and the second-best average detection result of 0.4783.

Aggregated performance and ranking on segmentation task
Figure 4 (a) demonstrate the boxplots for each teams and baseline methods.It can be observed that the median values for all area-based metrics (dice, precision, recall and F2) are above 0.8 for most teams when compared on all 777 test samples.However, a greater variability can be observed for all teams and baselines that is represented by large number of outlier samples.
11/26 For the mean distance-based normalised metric (1 − H d ), only marginal change can be seen for which top teams have higher values as expected.On observing closely only the dice similarity metric in Figure 4 (a) where dot and box plots are provided, teams MLC_SimulaMet and aggcmab obtained the best scores demonstrating least deviation and with most samples concentrated in the interquartile range (IQR).It can be observed that paired aggcmab and MLC_SimulaMet; DeepLabV3+(ResNet50) and ResNetUNet(ResNet34); and HoLLYS_ETRI and PSPNet have similar performances since their quartiles Q1, Q2, and Q3 scores are very close to each other.Although the mean DSC score of team aggcmab is slightly higher than the MLC_SimulaMet, there was no observed statistically significant difference between these two teams.However, both of these teams reported significant difference with p < 0.05 when compared to the best performing baseline DeepLabV3+(ResNet50).
Tables 6 present the JC, DSC, F2, PPV, Recall, Accuracy and HDF acquired by top five participanting teams and baseline methods (i.e.FCN8, PSPNet, DeepLabV3+ and ResNetUNet) using data 1 to data 4 respectively.As shown in the table for data 1 (NBI still images), the method suggested by teams sruniga and AIM_CityU outperformed against the other teams an baseline methods in terms of JC (>65%), DSC (> 74%) and F2 (> 73%).The team sruniga had an outstanding performance in segmenting fewer false-positive regions achieving a PPV result of 81.52 % which is higher than other methods by atleast 5%.Nevertheless, the top recall value for team MLC_SimulaMet and HoLLYS_ETRI (> 86%) proving their ability in detecting more true positive regions.The accuracy results on this data were comparable between all teams and baseline methods ranging from 95.78% to 97.11% with the best performance by team AIM_CityU.The results on Data 2 (WLE still images) are also presented in Table 6.For this data, the methods developed by teams MLC_SimulaMet and aggcmab produced the top values for JC (>0.77),DSC (>0.82) and F2 (>0.81) with comparable results between two teams.The PPV value was maintained with the method proposed by team sruniga(i.e. as discussed for data 1 ) with value of 0.8698 ± 0.21 followed by team MLC_SimulaMet in second place with a value of 0.8635 ± 0.26.Additionally, the method by team MLC_SimulaMet surpassed the results for all evaluation measures when compared to the other teams and baseline methods on data 3 as shown in the table.Moreover, the method proposed by team aggcmab comes in second place with more the 5% reduction of results for the JC, DSC and HDF.For this dataset, the baseline method DeepLabV3+ (ResNet50) showed improved performance compared to results on previously discussed data (i.e.data 1 and data 2) where it acquires second place for the F2 and accuracy with a result of 82.66% and 95.99% respectively.Similarly to the performance of the teams on data 3, as shown in Table 6 (i.e. on Data 4 (unseen sequence)) methods by teams MLC_SimulaMet and aggcmab produce the best results for most of the evaluation measures JC (>0.68),DSC (>0.73),F2(>0.71),ACC (>0.97) and HDF (<0.34).Generally, throughout the evaluation process for all tables on the different datasets, team sruniga provided a high PPV value on data 1, data 2 and data 4. Furthermore, the baseline methods showed low performance in terms of final score compared to the methods proposed by the participants especially with data 1, data 2 and data 4.
To understand the behaviour of each method for provided test data splits we plotted DSC values each separately and compared the ability of methods to generalise on these.From Figure 4 (c-d) it can be observed that difference in data setting affect almost all methods.It can be observed that there is nearly upto 20% gap in performance of the same methods when tested on WLE and NBI.Similarly, for single and sequence frame case and unseen center data.However, it could be observed that those methods that had very close values (e.g., HoLLYS_ETRI) suffered in performance compared to other methods.
To assess generalisability of each method, we also computed deviation scores for semantic segmentation referred to as dev_g (see Table 7 and Figure 4 (f)).For this assessment, team aggcmab ranked the first on both average segmentation scores R seg and deviation score R dev .Even though team sruniga was only third on R seg , they were second on R dev and ranked at the 1st position for their computation time with average inference time of only 17 ms per second.Team MLC_SimulaMet only was ranked third due to their large computational time of 120 ms per frame and larger deviations (lower generalisation ability).We provide the results of teams with performance below baseline and poor ranking compared to top five teams analysed in the paper in the Supplementary Table 2 for completeness.It is to be noted that these teams were selected in the round 3 of the challenge as well but have not been analysed in this paper due to their below baseline scores.

Discussion
While polyp detection and segmentation using computer vision methods, in particular deep learning, have been widely studied in the past, rigorous assessment and benchmarking on centerwise split, modality split and sequence data have not been comprehensively studied.In our EndoCV2021 edition, we challenged participants to address generalisability issues in polyp detection and segmetation methods by providing multicenter and diverse data.For polyp detection and localisation, 3/4 teams chose feature pyramid-based networks while one team used YOLOV5 ensemble paradigm.Unlike most of these methods that require anchors to detect multiple objects of different scales, and overlap, team AIM_CityU 63 used an anchor free fully convolution one-stage object detection (FCOS) method.HoLLYS_ETRI 64 focused mostly on accuracy and used an ensemble to train five different models, i.e., one model per center, and an aggregated model output was devised for the test inference.Even though, the HoLLYS_ETRI team led the leaderboard ranking on the average detection score on almost all data type, the observed detection speed (0.69 sec.) and the deviation in generalisation score only put them on the second rank (see Table 5).On contrary, AIM_CityU team with their anchor free single stage network performed consistently well in almost all data with the fastest inference (0.1 sec.) and the least deviation score (see Figure 3 and Table 5) between teams, and hence leading the leaderboard.
Hypothesis I: It can be hypothesised that anchor free detection methods can better generalise compared to methods that require anchors in heterogeneous multi-center dataset.This is strictly true as the polyp sizes in the dataset is varied (see Figure 1 b) and also the image sizes ranged from 388 × 288 pixels to 1920 × 1080.
Since, all methods trained their algorithm on single frame images provided in this challenge, it can be observed in Table 4 that the detection scores for all methods are relatively higher for the data 2 (WLE-single) , compared to other data categories, despite they came from unseen data center 6.However, performance drop can be observed for both seen (centers, C1-C5) and unseen (center, C6) sequence data that consisted of WLE images only.In addition, change in modality has detrimental effect on the performance for all methods even on single frames (see for data 1, NBI-single, Table 4).
Hypothesis II: It can be hypothesised that methods trained on single frame produce sub-optimal and inconsistent detection in videos as image-based object detection cannot leverage the rich temporal information inherent in video data.The scenario gets worsen when applied on different center to that on which it was trained.To overcome this, Long Short-Term Memory (LSTM) based methods can be used to keep the temporal information for pruning predictions 87 .
For segmentation task, while most teams used ensemble technique targeting to win on the leaderboard (MLC_SimulaMet 71 , HoLLYS_ETRI 64 , aggcmab 70 ), there were some teams who worked towards model efficiency network (e.g., team sruniga 72 ) or modifications for faster inference and improved accuracy (e.g., team AIM_CityU 63 ).Light weight model using HarDNet68 backbone with aggregated maps across scales (team sruniga) and use of multi-scale feature fusion network (HRNet) with low-rank disentanglement by team AIM_CityU outperformed all other methods on narrow band imaging modality (data 1) including the baseline segmentation methods (see Table 6).These methods showed acceptable performance for single frames on unseen data (data 2, WLE-single) as well.However, on sequence data (both for seen sequence data 3 and unseen sequence data 4), both of these methods performed poorly compared to ensemble-based techniques (see Figure 4  Proportion (%) Here, for ranking we have only considered dice similarity coefficient values.
(nearly 6 times higher than the fastest method).
Hypothesis III: It can be hypothesised that on single frame data multi-scale feature fusion networks perform better irrespective of their modality changes.This is without requiring to ensemble same or multiple models for inference which ideally increases both model complexity and inference time.However, on sequence data we advice to incorporate temporal information propagation in the designed networks.Furthermore, to improve model generalisation on unseen modality, domain adaptation techniques can be applied 11 .
HoLLYS_ETRI 64 used instance segmentation approach with five separate models trained on C1 to C5 training data separately.It can be observed that this scheme provided better generalisation ability in most cases leading to the least deviation on average dice score (see Figure 4 (f)).However, reported dice metric values were lower than most methods especially ensemble techniques that are targeted towards higher accuracy but are less generalisable (in terms of consistency in test inference across multiple data categories).This is also evident in Figure 5 where proportion of samples from data 1 for top performing teams aggcmab and MLC_SimulaMet are mostly ranked on the third and fourth ranks.
Hypothesis IV: It can be hypothesised that pretext tasks can lead to improved generalisability.However, to boost model accuracy, modifications are desired that could include feature fusion blocks and other aggregation techniques.

Conclusion
We provided a comprehensive dissection of widely used deep learning baseline methods and methods devised by top participants in crowd-sourcing initiative of EndoCV2021 challenge.Through our experimental design, provided multi-center dataset and holistic comparisons, we demonstrate the need of generalisable methods to tackle real-world clinical challenges required for robust polyp detection and segmentation tasks.While most methods provided improvement over several widely used baseline methods, their design adversely impacted algorithmic robustness and/or real-time capability when provided unseen sequence data and different modality.A better trade-off in both inference time and generalisability is the key take away of this work.We provide experimental-based hypothesis to encourage future research towards innovating more applicable methods that can work effectively in multi-center data and diverse modalities that are widely used in colonoscopic procedures.2. Semantic segmentation results for teams ranking below 5th place on out-of-sample data 1, data 2, data 3 and data 4.

Data type
Teams

Figure 1 .Figure 2 .
Figure 1.Multi-center training and test samples: a. Colonoscopy video frames for which the annotation samples were reviewed and released as training (left) and test (right) is provided.Training samples included nearly proportional frames from five centers (C1-C5) while test samples consist of majority of single and sequence frames from unseen center (C6); b.Number of polyp or non-polyp samples-based on polyp sizes on resized image frames of 540 × 720 pixels (left) and their intra-size variability (right) for training (top) and testing data (bottom).

Figure 3 .
Figure 3. Generalisation assessment on detection task: (left) mean average precision (mAP) on all data versus deviation computed between seen center with unseen modality and unseen center.Least deviation with larger mAP is desired.(right) Comparison of teams and baseline methods on seen center sequence data versus unseen center sequence data, C6.Higher values along both axes is desired.

Figure 4 .
Figure 4. Generalisation assessment on segmentation task: a) Box plots for all segmentation metrics (dice coefficient, DSC; precision, PPV; recall, Rec; F2, type-II error; and Hausdorff distance, Hd) used in the challenge for all test data samples.b) Boxplots representing descriptive statistics over all cases (median, quartiles and outliers) are combined with horizontally jittered dots representing individual data points in all test data.A red-line represent the best median line.It can be observed that teams aggcmab and MLC_SimulaMet have similar results and with Friedman-Nemenyi post-hoc p value < 0.05 denoting significant difference with the best performing baseline DeepLabv3+ method.c) White light endoscopy, WLE versus narrow-band imaging, NBI d) single versus sequence data , e) seen centers, C1-C5 versus unseen center, C6 and f) deviation scores.
(c-e)).Several network conjoint by MLC_SimulaMet and dual UNet network used by the team aggcmab have disadvantage of large inference time

Figure 5 .
Figure 5. Algorithmic rank-based on across bootstrap samples 86 are displayed with colours according to the data categories: Histogram bars show how much proportion (in %) of each data contribute to the ranking of each team and baseline method.Here, for ranking we have only considered dice similarity coefficient values.

Polyp sizes (in log. scale)
Approved by the data inspectorate.No further ethical approval was required as it did not interfere with patient treatment

Table 2 .
Summary of the participating teams detection task for EndoCV2021 Challenge.All test was done on NVIDIA V100 GPU provided by the organisers.

Table 3 .
Summary of the participating teams algorithm for segmentation task EndoCV2021 Challenge.Top five teams are shown above horizontal line and worse performing team in round 3 are provided below this line.

Table 5 .
Ranking of detection task of polyp generalisation challenge

Table 6 .
Team results for the polyp segmentation methods proposed by the participating teams as well as for the baseline methods.All results are given for data 1, data 2, data 3 and data 4. Top evaluation criteria are highlighted in bold.

Table 7 .
Ranking of segmentation task of polyp generalisation challenge: Ranks are provided based on a) semantic score aggregation, R seg ; b) average deviation score, R dev ; and c) overall ranking (R all ) that takes into account R seg , R dev and time.For ties in the final ranking (R all ), segmentation score is taken into account.For time, ranks are provided into three categories: teams with < 50 ms, between 50 − 100 ms and > 100 ms.Top two values are in bold.Time ↓ R seg ↓ R dev ↓ R all ↓ Data 1 Data 2 Data 4 dev_g 1−3 dev_g 2−3 dev_g 4−3 (ms)