A multi-centre polyp detection and segmentation dataset for generalisability assessment

Polyps in the colon are widely known cancer precursors identified by colonoscopy. Whilst most polyps are benign, the polyp’s number, size and surface structure are linked to the risk of colon cancer. Several methods have been developed to automate polyp detection and segmentation. However, the main issue is that they are not tested rigorously on a large multicentre purpose-built dataset, one reason being the lack of a comprehensive public dataset. As a result, the developed methods may not generalise to different population datasets. To this extent, we have curated a dataset from six unique centres incorporating more than 300 patients. The dataset includes both single frame and sequence data with 3762 annotated polyp labels with precise delineation of polyp boundaries verified by six senior gastroenterologists. To our knowledge, this is the most comprehensive detection and pixel-level segmentation dataset (referred to as PolypGen) curated by a team of computational scientists and expert gastroenterologists. The paper provides insight into data construction and annotation strategies, quality assurance, and technical validation.

particularly challenging to assess the malignant potential for lesions smaller than 10 mm.As a consequence most detected lesions are removed with subsequent CRC mortality reduction 4 .The removal of the lesions depends also of an exact delineation of the boundaries to assure complete resection.If the lesions are detected and completely removed at a precancerous stage, the mortality is nearly null 5 .Unfortunately, there is a considerable limitation related to various human skills 6,7 confirmed in a recent systematic review and meta-analysis demonstrating miss rates of 26% for adenomas, 9% for advanced adenomas and 27% for serrated polyps 8 .A thorough and detailed assessment of the neoplasia is essential to assess the malignant potential and the appropriate treatment.This assessment is based on size, morphology and surface structure.Currently, the Paris classification, prone to substantial inter-observer variation even among experts, is used to assess the morphology 9 .The surface structure classified by the Kudo pit pattern classification system or the Narrow-Band Imaging International Colorectal Endoscopic (NICE) classification system also help to predict the risk and degree of malignant transformation ? .This classification system may to some extent also predict the histopathological classification into either adenomas, sessile serrated lesions (SSLs), hyperplastic polyps or traditional serrated adenoma (TSA) ? .Unfortunately, these macroscopic classification systems are prone to substantial inter-observer variations, thus a high performing automatic computer-assisted system would be of great important both to increase detection rates and also reduce inter-observer variability.To develop such a system large segmented image databases are required.While current deep learning approaches has been instrumental in the development of computer-aided diagnosis (CAD) systems for polyp identification and segmentation, most of these trained networks suffer from huge performance gap when out-of-sample data have large domain shifts.On one hand, training models on large multi-centre datasets all together can lead to improved generalisation, but at an increased risk of false detection alarms 10 .On the other hand, training and validation on centre-based splits can improve model generalisation.Most reported works are not focused on multi-centre data at all.This is mostly because of the lack of comprehensive multi-centre and multi-population datasets.In this paper, we present the PolypGen dataset that incorporates colonoscopy data from 6 different centres for multiple patient and varied populations.Attentive splits are provided to test the generalisation capability of methods for improved clinical applicability.The dataset also is suitable for exploring federated learning and training of other time-series models.PolypGen can be pivotal in algorithm development and in providing more clinically applicable CAD detection and segmentation systems.
Although there are some publicly available datasets for colonoscopic single frames and videos (Table 1), lack of pixel-level annotations and preconditions applied for access of them pose challenges in its wide usability for method development.Many of these datasets are by request which requires approval from the data provider that usually takes prolonged time for approval and the approval is not guaranteed.Similarly, some datasets do not include pixel-level ground truth for the abnormality location which will cause difficulty in development or validation of CAD systems (e.g., El salvador atlas 11 and Atlas of GI Endoscope 12 ).Moreover, most of the publicly available datasets include limited number of images frames from one or a few centres only (e.g., datasets provided in 13,14,14,15 ).To this end, the presented PolypGen dataset is composed of a total of 8037 frames including both single and sequence frames.The provided comprehensive dataset consists of 3762 positive sample frames collected from six centres and 4275 negative sample frames collected from four different hospitals.The PolypGen dataset comprises of varied population data, endoscopic system and surveillance expert, and treatment procedures for polyp resections.A t-SNE plot for positive samples provided in the Fig. 1 demonstrates the diversity of the compiled dataset.

Ethical and privacy aspects of the data
Our multi-centre polyp detection and segmentation dataset consists of colonoscopy video frames that represent varied patient population imaged at six different centres including Egypt, France, Italy, Norway and the United Kingdom (UK).Each centre was responsible for handling the ethical, legal and privacy of the relevant data.The data collection from each centre included either two or all essential steps described below: 1. Patient consenting procedure at each individual institution (required) 2. Review of the data collection plan by a local medical ethics committee or an institutional review board 3. Anonymization of the video or image frames (including demographic information) prior to sending to the organizers (required) Table 2 illustrates the ethical and legal processes fulfilled by each centre along with the endoscopy equipment and recorders used for the data collected.

Study design
PolypGen data was collected from 6 different centres.More than 300 unique patient videos/frames were used for this study.
The general purpose of this diverse dataset is to allow robust design of deep learning models and their validation to assess their Here, the 1st boxed region represents mostly the sequence data.Interestingly, the 3rd, the 4th, and the 6th boxed regions mostly represent both polyp and non-polyp data and heterogeneously distributed.Samples from 2nd and the 5th boxed regions shows mostly protruded polyps but with differently positioned endoscopy locations.Some samples in these also include the colonoscopy frames with dyes.
generalizability capability.In this context, we have proposed different dataset configurations for training and out-of-sample validation and proposed unique generalization assessment metrics to reveal the strength of deep learning methods.Below we provide a comprehensive description of dataset collection, annotation strategies and its quality, ethical guidelines and metric evaluation strategies.

Video acquisition, collection and dataset construction
A consortium of six different medical data centres (hospitals) were built where each data centre provided videos and image frames from at least 50 unique patients.The videos and image samples were collected and sent by the senior gastroenterologists involved in this project.The collected dataset consisted of both polyp and normal mucosa colonoscopy acquisitions.To incorporate the nature of polyp occurrences and maintain heterogeneity in the data distribution, the following protocol was adhered for establishing the dataset: 1. Single frame sampling from each patient video incorporated different view points 2. Sequence frame sampling consisted of both visible and invisible polyp frames (at most cases) with a minimal gap 3.While single frame data consisted of all polyp instances in that patient, sequence frame data consisted of only a localised targeted polyp 4. Positive sequence included both positive and negative polyp instances but from video with confirmed polyp location while for negative sequence only patient videos with normal mucosa were used An overview of the number of samples comprising positive samples and negative samples is presented in Fig. 2 a.The total positive samples of 3762 frames are released that comprises of 484, 1166, 457, 677, 458 and 520 frames from centres C1, C2, C3, C4, C5 and C6, respectively.These frames consist of 1537 single frames (1449 frames from C1-C5 also provided in EndoCV2021 challenge and 88 frames from C6), and 2225 sequence frames with majority of sequence data sampled from centres C2 (865), C4 (450), and C6 (432).The number of polyp counts for pixel-level annotation of small (≤ 100 × 100), medium (between including no polyp frames but frames in close proximity of polyp are represented as histogram plot (Fig. 2 b).The total annotations for polyp that we release is 3447.All these polyp samples are verified by expert gastroenterologists.
We have provided both still image frames and continuous short video sequence data with their corresponding annotations.The positive and negative samples in the dataset of the polyp generalisation (PolypGen) are further detailed below.

Positive samples
Positive samples consist of video frames from the patient with a diagnosed polyp case.The selected frames may or may not have the polyp in it but may be located near to the chosen frame.Nevertheless, a majority of these frames consists of at least one polyp in the frame.For the sequence positive samples, the continuity of the appearance and disappearance of the polyp similar to real scenario has been taken into account and thus these frames can have a mixture of polyp instances and frames with normal mucosa.Table 3 is provided to detail the characteristics of 23 sequence data included in our dataset.It can be observed from Figure 4 that varied sized polyps are included in the dataset with variable view points, occlusions and 6/20 instruments.Exemplary pixel-level annotations of positive polyp samples for each centre and their corresponding bounding boxes are presented in Fig. 3.

Negative samples
Negative samples mostly refer to the negative sequences released in this dataset, i.e. no polyp frames.These sequences are taken from patient videos which consisted of confirmed absence of polyps (i.e., normal mucosa) in the acquired videos or at areas away from the polyp occurrences.It includes cases with anatomies such as colon linings, light reflections and mucosa covered with stool that may be confused with polyps (see Figure 5 and corresponding negative sequence attributes in Table 4).

Annotation strategies and quality assurance
A team of 6 senior gastroenterologists (all over 20 years of experience in endoscopy), two experienced post-doctoral researchers, and one PhD student were involved in the data collection, data sorting, annotation and the review process of the quality of annotations.For details on data collection and data sorting please refer to Section Video acquisition, collection and dataset construction.All annotations were performed by a team of three experienced researchers using an online annotation tool called Labelbox1 .The dataset was divided equally between the three reviewers for the annotation process where each research annotated a specific group of frames.However, all the annotated frames were revised by all the senior gastroenterologists team.Each annotation was later cross validated for accurate segmentation margins by the team and by the centre expert.Further, an independent binary review process was then assigned to a senior gastroenterologists, in most cases experts from different centres were assigned.A protocol for manual annotation of polyp was designed to minimise the heterogeneity in the manual delineation process.The protocol was in detail discussed together with the clinical experts and the annotators during several weekly meeting.Here, we only present a brief on the important aspects of the annotation that should be taken care during annotations.Example samples were provided by expert endoscopists to the annotators especially this was the case in the video annotations.The set protocols are listed below (refer Fig. 3 for final ground truth annotations):  • Video sequence annotation: One sample from expert gastroenterologist were provided for sequences that showed difficulty in distinguishing between mucosa and polyp.Polyps that are distant and not clearly visible were also not annotated as polyps.
• Tackling with occlusion: Polyps that were occluded with stool or instrument were required to exclude the parts of mucosa that were obstructed.
• Cancerous mucosa: Mucosa that were already cancerous but not appear as polyps were excluded from the annotation.However, raised mucosal surface that charactised adenomatous polyps were included.
Each of these annotated masks were reviewed by expert gastroenterologists.During this review process, a binary score was provided by the experts depending on whether the annotations were clinically acceptable or not.Some of the experts also provided feedback on the annotation and these images were placed into ambiguous category for further rectification based on expert feedback.These ambiguous category was then jointly annotated by two researchers and further sent for review to Some liquid and stool residues, reflections seq23_neg Perfect clean bowel, normal vascular pattern, well distended Some light reflections one expert.The outcome of these quality checks are provided in Figure 6.It can be observed that large fraction (30.5%) of annotations were rejected (excluding ambiguous batch, total annotations were 2213, among which only 1537 were accepted and 676 frames were rejected).Similarly, the ambiguous batch that included correction of annotations after the first review also recorded 34.17% rejected frames on the second review.folder "PolypGen".The size of images provided in the dataset can range from 384 × 288 pixels to 1920 × 1080 pixels.The size of masks correspond to the size of the original images, however, the polyp sizes in the provided dataset is variable (as indicated in Figure 2).Since we followed a full anonymisation protocol, no gender or age information is provided.

Technical Validation
For the technical validation, we have included single frames data (1449 frames) from five centres (C1 to C5) in our training set and tested on out-of-sample C6 data on both single (88 frames) and sequence frames (432 frames).Such out-of-sample testing on a completely different population and endoscopy device data allows to comprehensively provide evidence of generalisability of current deep learning methods.The training set was split into 80% training only and 20% validation data.Here, we take most commonly used methods for segmentation in the biomedical imaging community [30][31][32][33][34] , including that for polyps.For reproducibility, we have included the train-validation split as .txtfiles as well in the PolypGen dataset folder.However, users can choose any set of different combined training or split training schemes for generalisation tests as they prefer, e.g., training on random three centres and testing on remaining three centres.Also, the dataset is suitable for federated learning (FL) approaches 35 that uses decentralised training and allow to aggregate the weights from the central server for improved and 12/20 generalisable model without compromising data privacy.

Benchmarking of state-of-the-art methods
To provide generalisation capability of some state-of-the-art (SOTA) methods, we have used a set of popular and well-established semantic segmentation CNN models on our PolypGen dataset.Each model was run for nearly 500 epochs with batch size of 16 for image size of 512 × 512.All models were optimised using Adam with weight decay of 0.00001 learning rate of 0.01 and allowing the best model to be saved after 100 epochs.Classical augmentation strategies were used that included scaling (0.5, 2.0), random cropping, random horizontal flip and image normalisation.All models were run on Quadro RTX6000.
Evaluation metrics for segmentation.
We compute standard metrics used for assessing segmentation performances that includes Jaccard Index (JI = T P T P+FP+FN ), F1-score (aka Dice similarity coefficient, DSC), F2-score, precision (aka positive predictive value, PPV or p = T P T P+FP ), (r = T P T P+FN ), and overall accuracy (Acc.= T P+T N T P+T N+FP+FN ) that are based on true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) pixel counts.Precision-recall tradeoff is also given by the Dice similarity coefficient (DSC) or F1-score and F2-scores: where F β -score is computed as weighted harmonic means of precision and recall.Another commonly used segmentation metric that is based on the distance between two point sets, here ground truth (G) and estimated or predicted (E) pixels, to estimate ranking errors is the average Hausdorff distance (d AHD ) and defined as: Since boundary-distance-based metrics are insensitive to the object size and sensitive to the object shape, we include two additional metrics which are average surface distance (ASD) and Normalised surface dice (NSD).ASD is the average of all distances (Euclidean) between pixels on the predicted object segmentation border and its nearest neighbour on the reference segmentation border.All obtained distances are averaged, yielding an average distance value ASD for symmetric case is: The normalized surface dice (NSD) 36 computes the fractional correctly predicted segmentation boundary using an additional threshold accounting for amount of class-specific distance deviation.In our experiments we have set it to 10.If d be the distance and τ be the acceptable deviation (threshold) with d(B E , B G ) be the computed distance for predicted mask boundary w.r.t the nearest-neighbour distances to the reference segmentation boundary then for symmetric case NSD is given by: NSD is bounded between 0 and 1 where 1 refers to the segmentation boundary below deviation threshold τ.

Polyp segmentation benchmarking
Long et al. 30 presented a Fully Convolutional Network (FCN) that uses downsampling and upsampling for image segmentation.The model is dived into two sectors, the first sector is responsible for extracting detailed feature maps through downsampling the spatial resolution of the image.The second sector is responsible for retrieving the location information through upsampling.The U-Net architecture developed by Ronnerberger et al. 31 has shown tremendous success in medical image segmentation 37 including endoscopy 15,38 .U-Net is generally an encoder network followed by a decoder network, where, convolution blocks followed by max-pooling downsampling are applied to the image to encode feature presentations at different multiple levels.Afterwards, the decoder projects semantically the distinguishing characteristics learnt by the encoder.The decoder is composed of upsampling and concatenation followed by a standard convolution function.The skip connections between downsampling and upsampling paths in the U-Net (i.e. which makes it symmetric) is the main difference between the U-Net and the FCN 39 .Pyramid Scene Parsing Network (PSPNet) 32  convolutions are both present in the PSPNet encoder.. Similarly, dilated convolutions enabled the construction of semantic segmentation networks to effectively control the size of the receptive field that was incorporated in an a family of very effective semantic segmentation architectures, collectively named DeepLab 33 .The DeepLabV3 capture multi-scale information by employing the atrous convolution at multiple rates in a cascade or parallel multi-scale context through spatial pyramid pooling.Moreover, ResUNet 40 incorporates the benefits of both the ResNet and U-Net which allowed the design of a network with fewer parameters and improved segmentation performance.Fig. 8 provides an illustrative figure for the architecture of SOTA methods as explained.
All of these networks has been explored for polyp segmentation in literature [41][42][43] .Here, we benchmark our dataset on these popular deep learning model architectures.Out-of-sample generalisation results for both single frame (Table 5) and sequence data (Table 6) has been included in our technical validation of the presented data.

Validation Summary
Our technical validation suggest that DeepLabV3+ with ResNet101 has the best performance on most metrics except for FPS suggesting larger latency in inference (Table 5 and Table 6).The highest score was 0.82 for DSC and the least 9.67 for d AHD with DeepLabV3+ with ResNet101 on single frame data.However, the second best inference speed (FPS of 47) and score (DSC = 0.81, d AHD = 9.95) was obtained again using DeepLabV3+ but with ResNet50 backbone.Similarly, for sequence C6 out-of-sample test generalisation the highest score of 0.65 for DSC with highest recall of 0.73 and the least 9.08 for d AHD was obtained with DeepLabV3+ with ResNet101 backbone.With the same ResNet101 backbone ResUNet resulted in very close performance of 0.65 DSC but with highest precision on 0.92 and d AHD of 9.20.However, even with the same ResNet101 backbone ResUNet (40 FPS) has better speed compared to DeepLabV3+ (33 FPS).In addition, we also evaluated using normalised surface dice (NSD) and mean average surface distance (MASD) both of which demonstrated similar performance trend for most methods.NSD was the lowest for the DeepLabV3+ and ResNetUNet with ResNet101 backbone, 0.64 and  We also ran size-stratified DSC estimates for each algorithms.For medium and large polyps, the DSC score for majority of methods were not affected with DSC of 0.87 for large polyps and 0.84 for medium in the case of DeepLabV3+ with ResNet101 backbone.However, a steep decrease in DSC was observed for the small polyps with DSC value of only 0.46.Also for classical PSPNet and FCN8 networks, DSC difference was estimated to be over 0.10 between large and medium polyp sizes while for both ResNet-UNet and DeepLabV3+ had much smaller difference.

Discussion
While for the single frames data DSC is above 0.80 for ResNetUNet and DeepLabV3+ using ResNet101 backbone, however, the same architectures only provided DSC around 0.70 on the sequence dataset.This can be primarily because of the larger number of frames in the sequence dataset (nearly 5 times) which if 432 versus only 88 single frames, and the heterogeneous image

Figure 1 .
Figure 1.t-SNE plot for positive samples: 2D t-SNE embedding of the "PolypGen" dataset based on deep autoencoder extracted features.Each point is an image in the positive samples of the dataset.For each of the six boxed regions (dashed black lines) 25 images were randomly sampled for display in a 5 × 5 image grid.Here, the 1st boxed region represents mostly the sequence data.Interestingly, the 3rd, the 4th, and the 6th boxed regions mostly represent both polyp and non-polyp data and heterogeneously distributed.Samples from 2nd and the 5th boxed regions shows mostly protruded polyps but with differently positioned endoscopy locations.Some samples in these also include the colonoscopy frames with dyes.

Figure 2 .
Figure 2. PolypGen dataset: (a) Positive (both single and sequence frames) and negative samples (sequence only) from each centre, and (b) polyp size-based histogram plot for positive samples showing variable sized annotated polyps in the dataset (small is ≤ 100 × 100 pixels; medium is > 100 × 100 ≤ 200 × 200, and large is > 200 × 200 pixels).Null represents no polyp present in the sample.

Figure 3 .
Figure 3. Sample polyp annotations from each centre: Segmentation area with boundaries and corresponding bounding box/boxes overlaid images from all six centres.Samples include both small sized polyp (< 10000 pixels) including some flat polyp samples to large sized (≥ 40000 pixels) polyps and polyps during resection procedure such as polyps with blue dyes.

Figure 4 .Figure 5 .
Figure 4. Positive sequence data: Representative samples chosen from 23 sequences of the provided positive samples data.A summary description is provided in Table 3. Parts of images have been cropped for visualization.

Figure 6 .
Figure 6.Annotation quality review: Total curated frames along with accepted and rejected frame numbers during annotation quality review by experts for single frame data.Annotated frames with % of flat and protruded polyps categorised during annotation are also provided.

Figure 7 .
Figure 7. Folder structure: PolypGen dataset is divided into two folders -positive frames and negative frames.Later, it is divided into three different levels for positive frames and only one level for negative samples.(On left) Sub-folder structure with different folder names and the format of data present in each sub-folder 2 is provided.Sample images are also shown.(On right) Sub-folders present in negative folder is shown with two sample sequences (from centre 4 and centre 1).Each data source centre is shown as color legends.

Table 1 .
An overview of existing gastrointestinal (GI) lesion datasets including polyps: number of images or videos along with the availability type is provided.Including ground truth segmentation masks ‡ Contour Video capsule endoscopy • Not available anymore ♣ Medical atlas for education with several low-quality samples of various GI findings. †

Table 2 .
Data collection information for each centre: Data acquisition system and patient consenting information.

Table 3 .
Positive sample sequence summarised attribute: Total of 23 sequences are provided as positive sample sequence for patients with polyp instances during colonoscopy examination.Here JNET refers to Japan NBI Expert Team classification score.These sequences depict different sized polyps and location with different artifacts and varying visibility.Sequences referring to one selected image is shown in Fig.4.

Table 4 .
Negative sample sequence summarised attribute: Total of 23 sequences are provided as negative sample sequence for patients with no polyp during colonoscopy examination.These sequences depict different artifacts and varying visibility of vascular pattern and occlusion of mucosa.

Table 5 .
Performance evaluation of SOTA segmentation methods on 88 single frames from centre 6 in an out-of-sample generalisation task.Top two values are presented in bold.Recall ↑ Acc.↑ d AHD ↓ NSD↑ MASD ↓ FPS↑ JI: Jaccard index DSC: Dice coefficient F2: Fbeta-measure, with β = 2 PPV: positive predictive value Acc.: overall accuracy d AHD : avg.Hausdorff distance FPS: frames per second ↑: best increasing ↓: best decreasing NSD: Normalised Surface Dice MASD: Mean Average Surface Distance

Table 6 .
Performance evaluation of SOTA segmentation methods on 432 sequence frames from centre 6 in an out-of-sample generalisation task.Top two methods are presented in bold.AHD : Average Hausdorff distance ↑: best increasing ↓: best decreasing NSD: Normalised Surface Dice MASD: Mean Average Surface Distance 0.63, respectively for single frames, and 0.57 each for sequence frames.The lowest MASD was reported for DeepLabV3+ with ResNet101 backbone with 23.29 and 18.59, respectively, for single and sequence frames.