An objective comparison of detection and segmentation algorithms for artefacts in clinical endoscopy

We present a comprehensive analysis of the submissions to the first edition of the Endoscopy Artefact Detection challenge (EAD). Using crowd-sourcing, this initiative is a step towards understanding the limitations of existing state-of-the-art computer vision methods applied to endoscopy and promoting the development of new approaches suitable for clinical translation. Endoscopy is a routine imaging technique for the detection, diagnosis and treatment of diseases in hollow-organs; the esophagus, stomach, colon, uterus and the bladder. However the nature of these organs prevent imaged tissues to be free of imaging artefacts such as bubbles, pixel saturation, organ specularity and debris, all of which pose substantial challenges for any quantitative analysis. Consequently, the potential for improved clinical outcomes through quantitative assessment of abnormal mucosal surface observed in endoscopy videos is presently not realized accurately. The EAD challenge promotes awareness of and addresses this key bottleneck problem by investigating methods that can accurately classify, localize and segment artefacts in endoscopy frames as critical prerequisite tasks. Using a diverse curated multi-institutional, multi-modality, multi-organ dataset of video frames, the accuracy and performance of 23 algorithms were objectively ranked for artefact detection and segmentation. The ability of methods to generalize to unseen datasets was also evaluated. The best performing methods (top 15%) propose deep learning strategies to reconcile variabilities in artefact appearance with respect to size, modality, occurrence and organ type. However, no single method outperformed across all tasks. Detailed analyses reveal the shortcomings of current training strategies and highlight the need for developing new optimal metrics to accurately quantify the clinical applicability of methods.


Supplementary Figures
Composition of the endoscopy artefact detection dataset. a, Percentage of image frames from each of the 6 data institutions in train, test and generalization datasets by institute (outer ring), imaging modality middle ring and imaged organ (inner ring). 'Other' organ indicates frames that cannot be classified into the five main organs including laproscopy and non-organ objects. b, Proportion of the total number of artefact bounding boxes (number in pie chart center) for each of the 7 artefact classes in the detection train, test and generalization datasets respectively. c, Normalised width vs height relative to source image width and height of annotated ground-truth bounding boxes (per small dot) coloured by class. Larger points plot the mean (width, height) pair of each class. Boxes are primarily square shaped (mean point width:height ≈ 1:1). d, Plot of mean box width and height of individual classes of (c) and their characteristic standard deviation. . Spatial overlap of ground-truth reference bounding box annotations. a, Percentage (%) of boxes in each class (row, y-axis) that spatially overlap (IoU>0) with boxes from each of the 7 classes (column, x-axis). b, The distribution of boxes that overlap in each class (row, y-axis) and the proportion of times they overlap with boxes in each of the 7 classes (column, x-axis). It is computed by normalising each row of each matrix in A with their row sum. c, The mean IoU of overlapped boxes in each class (row, y-axis) and boxes of each of the 7 classes (column, x-axis). Comparison of (b) and (c) shows the mean IoU does not reflect the higher frequency of overlap of small artefact classes with other small and large artefacts. Figure 5. Illustration of the intersection-over-union (IoU) and average precision metrics for evaluating detection performance. a, Schematic of IoU. Shaded blue areas illustrate the relevant areas referred to in the equation. b, Steps involved in computing the precision and recall for constructing the precision-recall curve and computation of average precision (AP) as the area underneath this curve for assessing ranked detection performance. c, Example precision-recall curve for the contrast class for the Faster R-CNN detection baseline. An area element for computing AP is also illustrated. Individual points correspond to each predicted 'contrast' bounding box in the detection test dataset plotted left to right by descending predicted objectness score. d, Positively matched (green boxes) and unmatched (grey boxes) predicted boxes by yangsuhui according to different IoU cutoffs for an example test image.   . Relative IoU and mAP detection performance of detection methods. Critical difference diagram of IoU, a and mAP, b detection performance for all individual methods (black font), merged (blue font) prediction baseline and state-of-the-art Faster R-CNN (green font) and Retinanet (green font) baselines. Scale reports the average rank of methods across all artifact classes with respect to each metric. The lower the score the better the performance. Thick black horizontal lines join methods that are not statistically different in rank (p≤0.05) according to post-hoc Friedman-Nemenyi statistical testing 1 .    . Globally easy and hard images for artefact detection. Ease of artefact detection for a given image in the detection test dataset was defined as the mean F1 score across all teams (Supplementary Note II). a, F1 score for each test image sorted by ascending mean F1 score. The higher the F1 score, the easier it was for all teams to locate all artefacts present in the image. b, Breakdown of F1 score per team (y-axis) for each test image in ascending mean F1 score order as in a). Teams were ordered top to bottom by descending detection score. c, Montage of the 24 hardest and easiest images for artefact detection. Images in the montages have been rescaled to 256 x 256 pixels for display.  In all panels error bars plot ± 1 standard deviation of class scores relative to the global mean score per team and the red dashed line plots the score for a method which predicts negative for all classes for all images. '*' mark teams whose rank performance significantly improve on the U-Net baseline with Friedman Bonferroni-Dunn post-hoc testing and p<0.05. Solid black lines join methods with no significant difference in rank with Friedman-Nemenyi post-hoc testing and p<0.05. , recall, f) and segmentation or s-score, g) across all 10 teams. In panels b)-g) error bars plot ± 1 standard deviation of individual team scores relative to the mean team score with respect to the artefact class. In all panels teams are plotted with circle markers with size in decreasing size by decreasing mean s-scores. Figure 17. Globally easy and hard images for artefact segmentation. Ease of artefact detection for a given image in the segmentation test dataset was defined by the mean DSC score across all teams (Supplementary Note II). a, DSC score for each test image sorted by ascending mean DSC score. The higher the DSC score, the easier it was for all teams to segment all artefacts present in the image. b, Breakdown of DSC score per team (y-axis) for each test image in ascending mean F1 score order (x-axis) as in a). Teams were ordered top to bottom with descending segmentation score. c, Montage of the 24 hardest and easiest images to segment. Images in the montages have been rescaled to 256 x 256 pixels for display.

Supplementary Note I: EAD dataset
With the EAD Challenge we aimed to establish a first large and comprehensive dataset for "Endoscopy artefact detection" (see Suppl. Fig.1-3). The provided data was assembled from 6 different centers worldwide: John Radcliffe Hospital, Oxford, UK; ICL Cancer Institute, Nancy, France; Ambroise Parè Hospital of Boulogne-Billancourt, Paris, France; Instituto Oncologico Veneto, Padova, Italy; University Hospital Vaudois, Lausanne, Switzerland and the Botkin Clinical City Hospital, Moscow. This unique endoscopic video frame dataset is multi-organ (gastroscopy (stomach), cystoscopy bladder), gastrooesophageal (oesophagus), colonoscopy (colon), capsule endoscopy (small intestine)), multi-modal (white light, fluorescence, capsule and narrow band imaging), is inter patient and encompasses multiple populations (UK, France, Russia, and Switzerland). Videos were collected from patients on a first-come-first-served basis at Oxford, with randomized sampling at French centres and only cancer patients were selected at the Moscow centre. Videos at these centres were acquired with standard imaging protocols using endoscopes built by different companies; Olympus (Oxford, Paris), Biospec (Moscow), Medtronic (Oxford) and Karl Storz (EPFL, ICL, Paris). The dataset was built randomly mixing the collected data proportionately. Suppl. Fig. 3 gives a comprehensive visual breakdown of the dataset. All images have been carefully anonymised before release. No patient information should be visible in this data. We have also developed a comprehensive open-source software 1 to assist all participants.

Fabrication of dataset Gold standard
Clinical relevance to the challenge problem was first identified. During this step, 7 different common imaging artefact types (see Suppl. Fig. 3b) were suggested by 2 expert clinicians who performed bounding box labelling of these artefacts on a small dataset (∼100 frames). These frames were then taken as reference to produce bounding box annotations for the remaining train-test dataset by 2 experienced postdoctoral fellows. A final further validation by 2 experts (clinical endoscopists) was carried out to ensure the reference standard. The ground-truth labels were randomly sampled (1 per 20 frames) during this process. To maximize consistency in annotation between two independent annotators, a few rules were determined as described below:

Annotation software
We used the open-source VIA annotation tool 2 for semantic segmentation. For bounding box annotation we used a python, Qt and OpenCV based in-house tool.

Annotation Protocols
For the same region, multiple boxes were annotated if the region belonged to more than 1 class The minimal box sizes were used to describe the artefact region, e.g. if multiple specular reflections are present in an image then instead of one large box we use multiple small boxes to capture the natural size of the artefact Each artefact type was determined to be distinctive and general across endoscopy datasets

Annotator Variation
Variation in bounding box annotations were accounted for by computing a weighted final score 0.6*mAP + 0.4*IoU in the multi-class artefact detection challenge. Here, IoU (intersection over union) was downweighted as it is likely to vary more compared to mAP (mean average precision) across individual annotators.
Variation in the semantic class labels of masks for semantic segmentation was found not significant. Further we do not consider contrast and blur classes which are inherently poorly defined spatially..

Composition for different sub-challenges
Below we describe the composition of the training-test dataset for each sub-tasks of the EAD challenge. The information is also visually represented in Suppl. Fig.3: Detection The training dataset for detection consists in total 2192 annotated frames over all 7 artefact classes. All algorithms were evaluated online 2 using the evaluation metrics discussed in Supplementary Note II on a test set of 195 frames (∼10% of training data). During the annotation we found that most of frames were much more affected by specularity, imaging artefact and bubbles compared to other artefact classes. We tried to keep the ratio of class types similar between the training and test datasets as best we could. Suppl. Fig.3 shows the artefact class distribution for detection and generalization datasets.

Semantic Segmentation
The training dataset for semantic segmentation consists 475 annotated frames for 5 of the 7 classes: specularity, saturation, artefact, bubbles and instrument (i.e., no contrast and blur). The test data contains 122 annotated frames.
Generalization The training dataset for generalization is the same as that for detection however the test data for generalization uses a previously withheld dataset provided by a sixth institution (Padova) not present in any other training or test data released for the detection and segmentation tasks (Suppl. Fig.3b). The generalization test data consists 52 images and the task was to detect all 7 artefact classes as with the detection task.

Image variation within dataset
We highlight in this section the diversity of the assembled dataset in terms of visual appearance and variability in artefacts.
Image modality Imaging modality plays an important role in visual diversity of clinical endoscopy data. Different imaging modalities are commonly used during diagnosis to better visualize the underlying disease which inevitably changes the appearance of the imaged tissue of an organ.
White light (WL) WL is considered as an standard imaging modality. Broad spectrum light is shone and all reflected wavelengths are collected to form a true to life representation of the tissue surface (mucosa). The captured visual appearance is what the human eye normally sees; pink tissue, low contrast with surface vasculature. Unfortunately, due to multi-focal nature of the tumors, the specificity of such imaging modality is very low.
Fluorescence light (FL) FL is used as a complimentary modality in addition to WL due to its improved specificity for detecting cancer tumors in hollow organs. During FL, a colored dye is introduced which selectively visualizes targeted tissue regions at different color and wavelength. Common application include the identification of multi-focal bladder cancer, squamous cell carcinomas and dysplasia in oesophagus and screening for dysplasia in individuals with ulcerative colitis 3 .
Narrow band imaging (NBI) is an imaging technique that uses specific blue and green wavelengths to enhance the detail of the tissue surface. This is to target the peak light absorption of haemoglobin in the blood. Consequently, blood vessels appear darker and improves visual contrast for the identification of other surface structures 4 . It is frequently used to aid identification of Barrett's oesopagus 5 and pit patterns for colorectal polyps and tumour classification 6 . The research have shown its improved accuracy and specificity compared to WL modality.
Capsule imaging is a procedure that uses a much smaller wireless camera inside a vitamin-size capsule which the patient swallows to capture images compared to the standard flexible tube. It is commonly used to image the small intestine, an area which otherwise would be difficult to access. Strictly not an imaging modality but a type of instrumentation we chose to include it as a modality as the captured images of the small intestine using the smaller camera form factor are distinctively different from the above three modalities. Further the imaging in water of the PillCam in the EAD dataset results in bubbles of much larger areas.

Imaged organs
Oesophagus. The oesophagus is a hollow tube that connects the mouth with the stomach. Two major cancers occur here; squamous cell carinoma in the upper and middle oesophagus near the mouth and adenocarcioma in the lower oesophagus at the junction with the stomach. A premalignant lesion that can precede adenocarcinoma is call Barrett's Oesophagus. The surface appearance of the oesophagus is smooth ending in a sphincter leading towards the stomach. This transition is characterised visually by smooth tissue giving way to the rougher, glandular looking stomach tissue. Common conditions include cancer, oesophagitis (inflammation) and Barrett's oesophagus.
Stomach. The stomach is a hollow organ that receives food from the oesophagus. Visually the stomach is most distinguished by its glandular, rough tissue and under white light, presence of the two sphincters, one to the oesophagus, one to the lower intestine. Conditions include gastritis and cancer.
Small Intestine. The small intestine absorbs nutrient from food. Of the organs listed, it is the most difficult to reach, endoscopy of oesophagus and stomach through the mouth, colon endoscopy through anus and bladder endoscopy stomach through the urethra. Under capsule endoscopy (PillCam) the surface is tentacle-like with many villi. Conditions include Crohn's disease, irritable bowel syndrome and cancer.
Colon. The colon, also called the large bowel or intestine removes water and salt to form stool. Endoscopy of the colon is called colonoscopy. Common conditions include colitis (inflammation of the colon) and colon cancer. The appearance of the colon is tube-like with regular surface undulations.

20/31
Bladder. The bladder is a muscular sac in the pelvis, above and behind the pubic bone. The bladder stores urine. Endoscopy of the bladder is called cystoscopy. It is used to assess conditions such as cystitis (inflammation of the bladder) and bladder cancer. The surface appearance of the bladder looks typically smooth with lots of microvasculature.

Imaging instrument
The imaging instrument used depends on what is being imaged which often determines the specialist manufacturer. In our dataset endoscopy images of the oesophagus (Oxford), stomach (Oxford, Paris, Padova) and colon (Oxford, Padova) uses Olympus, bladder (EPFL, ICL, Moscow) uses Karl Storz and the small intestine (Oxford) uses Medtronic's PillCAM SB3 system. Visualisation of image diversity using t-SNE To capture the image diversity based on texture and colour we trained a ResNet 'encoder-decoder' autoencoder to extract deep image features for t-SNE (Table 1,2). All images were rescaled by bilinear interpolation to be 256 x 256. ResNet blocks use the full pre-activation variant 7 . The autoencoder was trained on the detection training set with mean absolute error loss and Adam optimizer (λ =1x10 4 , β 1 =0.5, β 2 =0.99) with early stopping and no data augmentation. The output of the dense layer in the encoder (Table 1) was used as input to t-SNE (perplexity = 15, learning rate = 100) to reduce the 256 dimensions to 2 dimensions for plotting, (Suppl. Fig. 2). By visual inspection, individual images are well-spread in all directions with some of the generalization dataset images occupying new areas not occupied by train and test datasets in the t-SNE embedding. This indicates diversity. Further we see the detection test dataset is representative of the training dataset with images embedded in the areas covered by both 'Train-I' and 'Train-II'.  Table 3. Computed image quality metrics for the EAD dataset. LAPV was computed with a 5x5 Laplacian kernel. IoUmeasure of spatial overlap using intersection-over-union, see Supplementary Note II.
Image quality measures To provide more objective quantification of the variability in endoscopic image a number of image quality measures were computed in Table 3. Without corresponding 'clean' reference images, standard full-reference image quality measures such as PSNR and SSIM could not be used. Instead we used no-reference image quality measures from the literature which do not require matched 'clean' versions of the image; the Variance of Laplacian (LAPV) 8 to quantify the amount of edges in an image (primarily useful for detecting image blur) and BRISQUE 9 which quantifies the extent of visual distortion (thus theoretically independent of artefact type). These metrics have have primarily been developed and tuned on natural images which have very different image statistics compared to biomedical endoscopy images; more diverse colours, distinguished edges and distinctive image texture. As such we computed the following additional estimates of quality based on the trained autoencoder featueres and manually labelled endoscopy artefact bounding box annotations.
Visual diversity index We compute the ratio of the explained variance of the second and first principal component, σ 2 /σ 2 of applying PCA analysis using the extracted features of the trained deep autoencoder. The larger the diversity the smaller the variation that can be captured by the first component relative to the second component. Shannon diversity index (H). More commonly known as information entropy in information sciences. It quantifies the uncertainty in predicting the artefact class of an individual bounding box taken at random from the dataset.
where p i is the proportion of bounding boxes belonging to the ith of N artefact class in the dataset of interest.
Simpson diversity index (D). Measures 1 -the probability that two bounding boxes randomly selected from the same dataset belong to the same artefact class without replacement.
where n i is the number of bounding boxes in the ith of N artefact classes in the dataset of interest. Simpson's diversity index gives a value 0-1 for increasing diversity (more class types).

22/31
Frequency of images with overlap of bounding boxes between different artefact classes. A major problem of endoscopy imaging artefacts compared to the detection of objects in natural images is the potential for numerous spatial overlapping between artefact bounding boxes. Intuitively this will complicate accurate artefact detection. To better understand the frequency of spatial overlap generally in endoscopy image datasets, we further dissected the % overlapped boxes between artefact classes (Table 3) with respect to individual artefact classes and frequency across an image dataset. For a given dataset,D of N images three measures was used.
The frequency (%) of images with box overlap between artefact class i and j, (Suppl. Fig.4a). The fraction of the total number of images containing a box of class i that overlap with a box of class j. Boxes were deemed to overlap if IoU>0.
where I denotes the indicator function for counting.
The proportion (%) of overlapped boxes in artefact class i that overlapped with a overlapped box in class j, (Suppl. Fig.4b). This is essentially the row normalised matrix of the frequency overlap measure above.
Mean IoU of overlapped boxes in class i with overlapped boxes in class j, (Suppl. Fig.4c). {Mean IoU i j

Performance Criteria (technical measures used in the challenge) Intersection over union (IoU) and Jaccard index (J).
We quantify the amount of overlap between reference and predicted bounding boxes and segmentations using the intersection-over-union, (Suppl. Fig.5a). defined as where R is the reference bounding box/segmentation of an artefact and S its corresponding predicted bounding box/segmentation and | · | denote the set cardinality. The IoU is more commonly known as the Jaccard index in image segmentation. For multiple artefact classes the mean IoU/Jaccard index calculated over all reference bounding boxes in the dataset images is used for evaluation. The IoU ranges between 0 for no overlap to 1 for perfect overlap.
Mean average precision (mAP). For any given image the number of predicted bounding boxes typically will not equal the number of reference bounding boxes. Intuitively, the best algorithm is the one that minimises the total number (#) of predictions or precision p = T P T P+FP and maximises the total number of detected reference boxes or recall, r = T P T P+FN . The average precision (AP) is the standard measure used to evaluate this tradeoff for object detectors taking into account the detector's confidence of a box containing an object (objectness score). Average precision (AP) is computed as the Area Under Curve (AUC) of the precision-recall curve of detection sampled at all unique recall values (r 1 , r 2 , ...) whenever the maximum precision value drops: with p interp (r n+1 ) = max r≥r n+1 p(r). Here, p(r n ) denotes the precision value at a given recall value. This definition ensures a monotonically decreasing precision curve. The mAP is the mean of AP over all artefact classes i for N = 7 classes given as This definition of AP was popularised in the PASCAL VOC challenge 10 . The calculation is illustrated in Suppl. Fig.5B,C. Considering variation in annotation, we used an IoU>= 0.25 to designate a positive "match" between a reference and predicted box. The mAP ranges between 0 for no detection and 1 for full detection. The higher the mAP the better the performance.
Artefact detection accuracy (score d ). The detection performance of participants were finally ranked using a weighted score of 0.6 mAP + 0.4 IoU.

Dice coefficient (DSC).
A spatial overlap measure for segmentation similar to IoU defined as DSC(R, S) = 2|R∩S| |R|+|S| where | · | denotes the set cardinality and R and S is the reference and predicted masks respectively. DSC is 0 for no overlap and 1 for perfect overlap. It can be calculated from the IoU, DSC = 2IoU 1+IoU . Precision (p), recall (r) and F β score. These measures are used to evaluate the fraction of correctly predicted instances. Given a number (#GT) of true instances (ground-truth bounding boxes or pixels in image segmentation) and a method which predicts #Pred instances, precision is the fraction of predicted instances that were correctly found, p = #TP #Pred. where TP denotes true positive and recall is the fraction of ground-truth instances that were correctly predicted, r = #TP #GT . Ideally, the best methods should have jointly high precision and recall. F β -scores gives a single score to capture this desirability through a weighted (β ) harmonic means of precision and recall, F β = (1 + β 2 ) · p·r (β 2 ·p)+r . Segmentation accuracy (score s or s-score). Semantic segmentation accuracy was measured with similar consideration to that of detection accuracy using a combined weighted score, score s or s-score = 0.75 · [0.5 · (F 1 + J)] + 0.25 · F 2 taking into account the overlap between predicted and reference segmentation as given by the Jaccard Index and the precisionrecall tradeoff as given by the DSC similarity coefficient (DSC) also called F 1 -score and the F 2 -score.

24/31
Generalization score. We define generalization of artefact detection as the stability of an algorithm to achieve similar performance when applied to a different imaging dataset that may differ in imaging modality and acquisition protocol but contain the same imaging artefact classes. To assess this, participants applied their trained methods to data collected from a sixth institution whose images were not included in neither the training nor test data of the detection and segmentation tasks. Without access to the training code, we estimated the generalization ability as the mean deviation between the mAP of the detection and generalization test datasets of each class i for deviation greater than a tolerance of 0.1 mAP i d .
The best algorithm should have high mAP g and low dev g (→ 0). In practice, participants were finally ranked using a weighted ranking score, score g = 1/3 · Rank(dev g ) + 2/3 · Rank(mAP g ) where Rank(mAP g ) is the rank of a participant when sorted by mAP g in ascending order.

Performance Criteria (technical measures for targeted analysis) Building super detection by merging participant bounding box predictions.
For each image, all bounding box predictions from all teams were concatenated together to form a super-set of box predictions, B. The super-set B was then filtered to produce the final "super detector" predictions:

Contribution of individual teams to performance of super detector
We measure the contribution of individual teams to the super detector by the mean percentage of merged boxes per image in the given dataset taken from each team. Given two detectors that produce similar bounding boxes the detector that scores the most stable predicted bounding boxes with higher confidence is superior and will contribute most to the super detector. Suppl. Fig. 7c illustrates how not all the bounding boxes in the merged detector come from top-ranking methods.
Building super segmentation by merging participant segmentation masks. All predicted binary segmentation masks were summed together for each artifact class independently and divided by the number of teams to generate a consensus score, 0-1 between teams. Thus, a pixel in the merged mask prediction has value 1 only if all teams predicted the artifact class was present. The final merged mask is the result retains all predicted pixels if at least 4 teams predicted the class i.e. consensus score >= 0.4. This is a threshold essentially on the level of consensus to retain only "stable" predictions.

Contribution of individual teams to performance of super segmentation
We measure the contribution of individual team predictons to the super segmentation as the mean fraction of pixels shared between the individual team predictions and the super segmentation binary masks across artifact classes.
Confusion matrix of class detection. Measures the proportion (as a fraction 0-1) of the number of predicted boxes that were correctly classified. Predicted boxes for each image were ranked in descending order by predicted objectness score independent of class and assigned to the ground truth bounding box with highest IoU. For each class, the number of times its bounding boxes were assigned to each ground truth artefact class was then tabulated and normalized by the total number of ground truth bounding boxes in the class. Predicted boxes with no ground truth box were ignored and do not contribute to the computation. For an image dataset, the average confusion matrix over all images was reported. The ideal matrix has all leading diagonal elements 1 and all off-diagonal elements 0.
Confusion matrix score of class detection. To summarize the confusion matrix performance with a single matrix we report the mean of the trace of the confusion matrix (C i j ) of N artefact classes.
Confusion matrix score = Tr(C i j ) N The higher the score, the better the ability of a detection method to distinguish between artefact classes.

Supplementary Note III: Deep Neural Network Detection and Segmentation
We briefly review deep neural network approaches to detection and segmentation and propose a taxonomy based on algorithm design. The reader is referred to specialised reviews for technical details of detection 13 and segmentation 14 . Recently deep neural networks have emerged as the automatic method of choice for state-of-the-art object detection and segmentation objects both in everyday image scenes 10,15 and in biomedical images [16][17][18] . Compared to traditional feature handcrafting approaches deep neural networks construct features automatically by learning to associate between two datasets given sufficient training pairs. Specifically, given a set of input images x and the corresponding desired output, y true (bounding boxes for detection or masks for segmentation), deep neural networks specify a parametrised function, f (x; θ ) given by weights, θ that minimise the dissimilarity or loss between the function output and input x, y = f (x; θ ) and the desired output y true (Suppl. Fig. 18a). The performance of a deep learning system thus depends on the architecture design of f to capture the task complexity, how well the loss function captures the desired objective to be learnt and the ability of the chosen optimizer to find the best global minima. To this end, modern state-of-the-art architectures can be considered in terms of a 'backbone' network that extracts generically informative image features that feeds into a 'head' network to produce task-specific predictions (Suppl. Fig. 18b-d).
The 'backbone' network is usually a pre-trained network with weights trained previously on a very large dataset and a related task such as image classification in PASCAL VOC 10 or MS COCO Challenges 15 to take advantage of the hierarchical feature learning of neural networks. Early layers learn generic "primitive" image concepts such as colour and edges that are generically useful across networks 19,20 . This practice is commonly called transfer learning and is used to train on smaller datasets. The "head" network is typically a very small network specifically designed for a given problem (Suppl. Fig.18c,d) and is trained for each dataset starting from randomly initialized or pretrained weights. To date, numerous deep learning object detection and segmentation systems have been proposed, however, most of these design share three common underlying considerations: scale (exploiting the image context), accuracy and speed. Scale is the ability to capture local and global level spatial relationships between image pixels to capture the specific characteristics of objects of any shape and size. Accuracy is the ability to maximize true positive and minimize false positive detection and speed refers to the computational time required by a trained network to process a single image. Often there is a speed-accuracy compromise. Suppl. Fig. 18e,f places EAD participants' base solutions within a taxonomy constructed according to design principle and chronological order of development. The backbone network is constructed to extract image features through successive downsampling of the input image. Due to this property to 'encode' features, the backbone is also often termed an 'encoder'. Using the backbone 'encoded' features, the head network produces the task-specific output predictions. c, Two most common detector heads are: i) Region proposal classification -the backbone network predicts the coordinates of positive object containing candidate bounding box regions, the corresponding image patch is cropped out and classified into different object classes and background. ii) Regression-based classification -the joint backbone and head network acts as one large region proposal network that simultaneously predicts candidate bounding boxes and their class without intermediate proposal of candidate regions. d, Most common detector head found in medical imaging. The downsampled features from the backbone network is 'decoded' via parameterized upsampling to give a classification for each pixel at the original input image size. Combined with a backbone 'encoder', the full architecture is known as an 'encoder-decoder' network. e, f, Taxonomy of the submitted EAD2019 participant detection and segmentation algorithms, respectively. It is organized according to the architectural design principle (bold text between arrows, specific realization in brackets and italicized font) and in relation to historical development.