Endomapper dataset of complete calibrated endoscopy procedures

Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introduces the Endomapper dataset, the first collection of complete endoscopy sequences acquired during regular medical practice, making secondary use of medical data. Its main purpose is to facilitate the development and evaluation of Visual Simultaneous Localization and Mapping (VSLAM) methods in real endoscopy data. The dataset contains more than 24 hours of video. It is the first endoscopic dataset that includes endoscope calibration as well as the original calibration videos. Meta-data and annotations associated with the dataset vary from the anatomical landmarks, procedure labeling, segmentations, reconstructions, simulated sequences with ground truth and same patient procedures. The software used in this paper is publicly available.


Background & summary
Endoscopes traversing body cavities are routine.However, their potential for navigation assistance or device autonomy remains mostly locked.In order to unlock it, computer-assisted endoscopes would require spatial AI (Artificial Intelligence) capabilities, i.e., being able to estimate a map of the regions that are traversed, along with the endoscope localization within such map.This capability is known in the robotics literature with the acronym VSLAM (Simultaneous Localization and Mapping from Visual sensors).Spatial AI and VSLAM will augment endoscopies with novel features, including augmented reality insertions, detection of blind zones, polyp measurements or guidance to polyps found in previous explorations.In the long term, VSLAM will also support utterly new robotized autonomous procedures.For our purposes in this paper, the goal of VSLAM is to build a per-patient map, in real-time, during endoscope insertion in a first procedure.This map will be exploited and perfected during the withdrawal of such first procedure, and in any other future one.
There are mature methods for out of the body VSLAM 1,2 .However, bringing them to endoscopy implies overcoming new barriers.The light source is co-located with the endoscopes, and hence is moving and is close to the body surfaces.The body surfaces have poor texture and abundant reflections due to fluids.The scene geometry includes a prevalent deformation.The video combines slow observation of areas of interest, with fast motions and long occlusions of the endoscope lenses.
Our contribution in this paper is the Endomapper dataset 3 , which makes available, for the first time, 96 high quality calibrated recordings of complete routine endoscopies (Figure 1), making secondary use of medical data, i. e., just recording standard procedures that were going to be performed in any case, without any modification.Compared to ad-hoc recordings, secondary-use ones show realistic features and hence contain the actual challenges VSLAM will face in routine practice.
No other public dataset offers a comparable volume of fully calibrated endoscopies in HD (see Table 1).Heilderberg 4 is a very interesting dataset that contains images of the colon from laparoscopic procedures.However, this view of the colon is not compatible with our goal of building 3D reconstructions of the interior of the gastrointestinal tract.CVCClinicDB, GIANA and Kvasir focus on polyps detection since they are often used to benchmark CAD (Computer Aided Diagnosis) systems.Other datasets focus in segmentation of elements of interest, such as tools in instrument-kvasir or polyps in Kvasir-seg.However, they only provide sparse image sets or short videos (less than 30 seconds).More similar to ours, Colon10k provides images from short sequences for place recognition and reconstructions.In contrast, we offer hours of real calibrated video, corresponding to the complete procedures.Apart from these real imaging datasets, it is worth mentioning works that create simulated data of the colon, either using 3D models, like Rau et al. 5 , Bobrow et al. 6 and Incetan et al. 7 , or a realistic phantom, like Ozyoruk et al. 8 .Similar to these works, we also include a few simulated sequences to harness the extra information of these scenes as a means to evaluate the methods developed.In particular, due to the monocular nature of the dataset, no ground truth geometry is available for quantitative evaluation.To address this issue, we include photorealistic sequences from a simulated colon, with ground truth geometry for the deforming scene and endoscopy trajectory.
The Endomapper dataset includes colonoscopies, gastroscopies, and calibration videos along with geometric and photometric calibration parameters.More than half of the sequences are screening colonoscopies, for which the standard procedure implies a thorough and slow exploration, being close to typical operation modes for VSLAM and that can serve as a bridge to more challenging sequences.To research map reuse and recognition in a second exploration, colonoscopies corresponding to the same patient but separated in time by several weeks are also included in the data.
Regarding metadata, some endoscopies include a description of the procedure made by the endoscopist, in the form of text footage.The text describes the anatomical regions traversed, re-explorations of the same region, the performed interventions or the tools used.This footage indexes the videos to identify interesting sections for VSLAM.Building on our dataset, the community can provide derived or metadata results to support subsequent research.Some examples of these derived data are included in the dataset: 1) anatomical regions segmentation, at frame level, performed by a doctor after visualizing the video; 2) tools segmentation in selected video sections, which can boost the tool segmentation performance in the specific endoscopy domain; 3) Structure from Motion (SfM) reconstructions using COLMAP 18 , which provides up-to-scale 6 DoF endoscope trajectory and 3D models for the video segments corresponding to smooth explorations of non-deforming scenes.The SfM output has proven valid to supervise learning tasks such as image matching 19 or image retrieval 20 .
Endomapper offers a sweet point of challenge, including easy video segments where state-of-the-art algorithms perform reasonably.However, all these methods also fail at some point, signaling what are the research challenges to face.We believe that the dataset will spur research that identifies challenges and foster progress of VSLAM in gastrointestinal environments.
Finally, we have made publicly available 7 software repositories corresponding to photometric and geometric calibration from calibration videos, simulated colon sequences generation and technical validation.Section Code Availability contains the details of these repositories containing the software, including installation and usage instructions.

Methods
The methodology used to create the dataset is explained in this section.First, we present a description of the recording procedure for the sequences in the dataset, including the description of the capture system and the type of recordings.Then, we describe the calibration procedure and the methodology used in both geometric and photometric calibration.Finally, we also briefly summarize the methods used to create each type of meta-data.

Recording endoscopies procedure
The acquisition of the sequences in the dataset was performed in the Hospital Clinico Universitario Lozano Blesa, in Zaragoza (Spain), using an Olympus EVIS EXERA III CV-190 video processor, EVIS EXERA III CLV-190 light source, and EVIS EXERA III CF-H190 colonoscope or EVIS EXERA III GIF-H190 gastroscope.The acquisition system is composed of a computer and a data acquisition card connected to the endoscopy tower via a Digital Visual Interface (DVI).Two different acquisition cards have been used: Epiphan Video DVI2USB 3.0 and Magewell Pro Capture DVI.The videos were recorded at 1440×1080 at 40fps and 24RGBbits (Epiphan) or 1440×1080 at 50fps and 24RGBbits (Magewell).The output image given by the endoscopy tower is cropped to remove personal information.The videos were manually edited to remove any frame recorded when the camera was out of the body of the patient.During the span of the project, the recordings were done one day of the week and synchronized with the medical staff involved.The patients were not selected based on their symptoms or pathology, we followed without interference the hospital's schedule which was mostly focused on the colorectal cancer screening program.The Endomapper technical staff was present in all the recording sessions to secure the quality of the acquisitions, but without interfering with the medical procedure.

Calibration
The dataset uses 10 different colonoscopes and 8 different gastroscopes.The calibration sequences for all the colonoscopes and gastroscopes where acquired in a single session using a Lambertian pattern (obtained from calib.io).Figure 2 shows two frames of the calibration videos imaging the calibration pattern.The Lambertian pattern corresponds to an array of circles from the Vicalib 21 library.The physical size of the pattern used is 5, 61 × 9, 82 cm.

Geometric calibration
The calibration videos are processed by Vicalib 21 to obtain the endoscope intrinsic parameters according to the Kannala & Brandt model 22,23 .The calibration defines eight intrinsic parameters, four projective parameters (in pixels) f x , f y ,C x ,C y and four distortion coefficients k 1 , k 2 , k 3 , k 4 .We process 1 out of 20 frames and outlier matches are removed.Next, the projection model yielding the projection in pixels u = (u, v), for a 3D point with coordinates X = (x, y, z) with respect to the camera frame is described as: where is the distorted radius, r = x 2 + y 2 is the undistorted radius and θ = arctan 2 (r, z) is the angle between the incoming ray and the optical axis.

Photometric calibration
The light source and camera of the endoscope are calibrated to obtain a model able to reproduce the photometry of the recordings.In the endoscope, the distances between the light sources and the camera are small and mostly symmetrical.Following Modrzejewski et al. 24 , we assume that these sources can be modelled as a single virtual light and adopt the Spot Light Source model (SLS), which was shown to offer a good compromise between complexity and accuracy.In addition, the light spread function and the camera vignetting are jointly estimated assuming radial symmetry.With this model, the light radiance going from the endoscope to a 3D surface point X is where P is the light center, σ 0 , is the light's intensity value and D is the principal direction in which light propagates with a spreading factor µ, that modules the radial attenuation R. As the light traverses the scene, its radiance decreases as a function of the distance travelled d = ||X − P||, following an inverse-square law S(X, P) = 1/d 2 .Finally, L(X, P) is the unit vector of the light's outgoing direction.The corresponding intensity value I (X) on the image is:

5/19
Original image where a bidirectional reflectance distribution function (BRDF) f r (ω i , ω r ) defines how light is reflected from the surface to the camera.The projection of the light beam on the geometry introduces a cosine term of the angle θ between the incoming light ray ω i and the surface normal.Finally, the endoscope applies an automatic gain g t , that can vary at every t-th time instant, and a gamma curve (γ = 2.2) to improve the perceived dynamic range of the image.

Radial attenuation Compensated image
The parameters of this model are estimated by optimising a photometric loss on the white areas of the Vicalib pattern (Figure 2).The results of the calibration provide a 2D weighting of the photometric effects caused by the vignetting and the light spread function, that can be used to compensate them (Figure 3).

Simulated colon
The VR-Caps 7 simulator is used to generate photorealistic synthetic image sequences of a 3D colon model obtained from a Computed Tomography.Since this is a simulation, we have full access to scene configuration: camera calibration, deformations, trajectory and illumination, hence to the ground truth geometry, camera pose and 3D deforming scene.For the same endoscope trajectory, we generated different sequences with more aggressive deformations to allow ablative studies with respect to the deformation magnitude.Deformations applied are described by the next equation: where V 0 x , V 0 y and V 0 z are the coordinates of the surface point at rest.We can control the magnitude and velocity of the deformations according to the parameters A and ω respectively, which corresponds to the maximum excursion and velocity of the deformations respectively.We also modified the colon texture to increase its contrast.The code to create these simulated sequences is available in the repository EM_Dataset-Simulations (see Section Code availability, https://github.com/endomapper/EM_Dataset-Simulations ).

Meta-data
For a set of selected recordings, we provide several types of meta-data useful for plenty of potential research lines, but in particular for VSLAM.This subsection presents a description of the meta-data and the annotation methodologies.

Text footage
The endoscopist performing each procedure provided a description of it, that was registered during the exploration.It includes the anatomical regions traversed, the interventions, the medical findings such as polyp approximated size, the tools used, or the sections with NBI (Narrow-Band Imaging) illumination.This description is made available as text footage synchronized with the corresponding videos.This metadata can be useful, for example, to identify the sections of the video that are more promising for VSLAM, such as the re-observations of the same region or interactions with tools of known size.

Anatomical regions
Anatomical section recognition is useful to create topological maps of the colon.These maps can be used to create smaller reconstructions with less probability of errors.Some colonoscopy procedures were annotated by the medical staff of the project after the recording.Multiple careful visualizations were necessary to delimit the ten anatomical regions, that are shown in Figure 4.

Tools segmentation
Tool segmentation is one of the challenges for spatial AI in colonoscopies.Since they occlude the view and cause failures in other algorithms, many works in the literature mask them out.Tools were manually segmented using Odin CAT tool 25 , which allows to maintain a mask between frames, giving a more robust annotation.

COLMAP 3D reconstruction
Traditional SIFT-based rigid SfM algorithms are able to produce partial reconstructions from colonoscopy videos.We include some examples of the output of COLMAP 18,26 processing in our sequences, which provides a first approximation for the up-to-scale camera trajectory and the scene's sparse structure.This information can be organized to produce weak supervision in the form of sparse depth maps, local correspondences between frames, image-to-image labels (frames depicting the same place) or relative camera pose transformation between frames.Several computer vision tasks like depth prediction, image matching, image retrieval and visual localization can greatly benefit from this kind of supervision.Megadepth 27 is a well-known dataset that uses this SfM procedure to obtain 3D point clouds, similar to ours.It is being extensively used for deep learning supervision 19,28,29 .Other works employed SfM to identify co-visible frames in the recordings, which has proven to be useful to train CNNs for place recognition in landmark images 20 and in colonoscopy sequences 16,30 .
For our recordings, we apply exhaustive guided matching between all the images in the sequence to associate frames that are temporally distant.We use our camera calibration and we do not optimize it during the COLMAP bundle adjustment.The minimum triangulation angle is relaxed to 8 degrees during the initialization of the models.The rest of the parameters are left as default.

Recordings from the same patient
One of the main obstacles in colon reconstruction is the consistency between colonoscopies in longitudinal studies.Thanks to the colorectal cancer screening program, colonoscopy pairs from the same patient were registered.This would help to evaluate the lifelong capabilities of the developed VSLAM algorithms.

Data Records
This section describes the dataset structure and details of the meta-data available.The dataset is available on the Synapse platform 3 and is subject to access controls (see Section Usage Notes).A summary of the dataset structure can be seen in Figure 5.At publication time, there is a total of 96 real sequences and their duration goes from less than ten minutes to more than half an hour.The file DatasetSummary.xls in the dataset main folder includes a summary of the acquisition details of each sequence in the dataset.

Video recordings
Data is stored in the directory Sequences.Each procedure has a corresponding directory Seq_XXX (XXX is the sequence number) that contains: 1.The directory meta-data, that contains all the meta-data files associated to the sequence.These files are described in the next section.
2. The video Seq_XXX.mov, in which the actual recording is.The video codec is H264 31 , a lossy compression using the profile High 4:4:4 with 4.2 level and a bit rate of 7Mbps.It offers an optimal size vs. quality trade-off for lossy compression.
3. The thumbnail version, Seq_XXX_thumbnail.webm, that contains a compressed version of the recording for easy and quick visualization.This version uses the free codec libvpx 32 , at 320 × 240 resolution.

Camera calibration
All the calibration information is included in the directory Calibrations.There is a directory Endoscope_XX (XX is the endoscope number) for each endoscope that contains: 1.The calibration video Endoscope_XX.mov.This version is the lossy H264 version.The lossless version can be found in the lossless folder mentioned before.

Geometric calibration
The file Endoscope_XX_geometrical.xml is the output calibration from Vicalib 21 .This XML file contains the intrinsic parameters of the camera ( f x , f y ,C x ,C y ,k 1 , k 2 , k 3 , k 4 ) following the Vicalib output format.

Photometric calibration
The photometric calibration file, Endoscope_XX_photometrical.xml, contains the calibrated parameters of the light source and the camera of the endoscope.An endoscope's <rig> may have one or more <camera> tags, associated with one or more <light> sources.Currently, only a single camera and a single virtual light are supported.
Each camera tag has a particular <camera_model>.This model has a single parameter, the value of the gamma γ response function in Eq. ( 4).Regarding the light source, the <light_model> has four parameters: the intensity value σ 0 , the light spread factor µ and two vectors for the light centre P and the principal direction D.

Simulated colon
All the data related to the simulated colon is included in the directory Simulated Sequences.There is a directory seq_X (X is the sequence number) for each sequence obtained from the simulation.The directory contains: 1.The directory rgb with the RGB images of the sequence in png format.
2. The directory depth with the depth images for each RGB image of the sequence stored in exr format.
3. A file rgb.txt with a list of file names of all RGB images of the sequence.4. A file depth.txtwith a list of the file names of all depth images of the sequence.5.A file trajectory.csvcontaining the ground truth camera trajectory.6.A file calibration.txtcontaining the simulated camera calibration.7. A file info.txtcontaining the deformations applied, its parameters and units.

Meta-data
This section contains the details and formats for each type of meta-data.The file DatasetSummary.xlsdetails the availability of the metadata in each sequence of the dataset.

Text footage
Two files: Seq_XXX.json and Seq_XXX.srt are included inside the root and meta-data folder.The .json file contains a structure with the timestamp and the associated text.The text footage is also included in .srtformat to ease the visualization synchronized with the video.The references to identify the tools used during the procedure are stored in the meta-data directory.

Anatomical regions
Table 2 shows the detailed number of frames labelled for each region in each video.The dataset contains this information in a file named Anatomical_Regions_XXX.txt with the format Frame###;region label; in each line.

Tool Segmentation
There are 4086 frames with tools segmented across four different colonoscopies as detailed in Table .3. The segmentations for each video can be found in file tool_segmentation_XXX.xml.This file contains, for each segmented frame, the id of the frame and a list of 2D point coordinates that define the tool segmentation as a binary polygon.The segmentation was done using a proprietary Odin CAT tool 25 .Some examples can be seen in Figure 6.

COLMAP 3D reconstruction
Table 4 summarizes the reconstruction results for the Endomapper sequences.The reconstructions are stored following the text format of COLMAP (https://colmap.github.io/format.html).We provide text files showing the images contained in each cluster reconstructed by COLMAP, as well as the images that COLMAP considered covisible, i.e. images that have at least one 3D point in common.Figure 7 shows two examples of these reconstructed clusters.

Same patient recordings
A file SamePatient.json is stored in the root folder containing which sequences are from the same patient and the time that separates both sequences.

Technical validation
Firstly, we detail an error analysis of the calibration and a comparison of the calibration parameters among endoscopes.Secondly, we test state-of-the-art SfM and VSLAM algorithms on typical colonoscopy sequences.Finally, the anatomical region and tool segmentation labels are validated on state-of-the-art recognition algorithms.All the code used in this section is publicly available in the Endomapper repositories https://github.com/Endomapper.

Calibration validation Geometric calibration
The software used to compute the geometric calibration and to obtain the validation and comparisons shown in this section is available in the repository EM_Dataset-GeometricCalibration (see Section Code availability, https: //github.com/endomapper/EM_Dataset-GeometricCalibration).The geometric calibration was computed from the calibration videos using the Vicalib 21 tool, tuning the parameters for each endoscope calibration separately.The parameters are detailed in the repository.
To compare the different calibrations visually, we have undistorted a grid using each calibration.Figure 8 shows the differences between each endoscope, the 10 colonoscopes and 8 gastroscopes correspondingly.The results show that the calibrations are equivalent around the center of the images and differences between them are significant only in the image borders.
To further analyze the calibration results, we analyzed the reprojection error.For all the calibrations, the RMSE is between 0.3 and 0.4 pixels.We have selected two endoscopes, one colonoscope (Endoscope_06) and one gastroscope (Endoscope_18), as prototypes.Figure 9 displays the inliers reprojection error distribution for the selected endoscopes.Here we can see that the error of the inliers is uniformly distributed around the image, and that only at the image boundaries there are fewer measurements.The calibrations are then expected to be very accurate in general, being the most inaccurate areas the ones closer to the borders.
The projection function that relates the incoming ray angle θ with the distorted radius r d is plotted in Figure 10.Here we can see that both types of endoscopes are almost equal and very close to an orthogonal projection fisheye lens 22 .Finally, in Figure 11 we show the view angle of both prototype endoscopes.There we can see that the gastroscope has a lower view angle than the colonoscope on the edges, which is why each type of endoscope needs to be calibrated separately.
With this analysis we conclude that the individual calibrations obtained from the videos are accurate (RMSE is low, covers most image and is consistent with all endoscopes).We believe that accurate calibration information boosts the performance of the geometric methods.com/endomapper/EM_Dataset-PhotometricCalibration).The photometric calibration was computed from the calibration videos for each endoscope separately.We selected 38 frames per sequence, looking for a variety of distances from the camera to the calibration pattern.On each frame, we consider a 120 • field of view.The centre of the virtual light converges about 4 mm behind the tip of the endoscope, thus being able to model all real lights with a single beam.The gamma value is experimentally fixed to γ = 2.2, which is also a broadly used value.The endoscope applies a continuous gain control, progressively increasing or decreasing the gain.Relative auto-gain is estimated with respect to the first image of the sequence.Consequently, the σ 0 value is unobservable and it is fixed to one.
The resulting models are validated in a different set of images of the Vicalib pattern.The photometric errors in Table 5 show the validation results of eight colonoscopes and eight gastroscopes.In the colonoscopes, the calibration is able to estimate the pixel intensities of the images with an RMSE of 2.9 grey levels.In gastroscopes, lights are not symmetrically placed on the tip of the endoscope.Consequently, RMSE increases slightly, up to 3.3 grey levels.

SfM/SLAM validation COLMAP validation
COLMAP is able to estimate sparse reconstructions for different sections along a sequence, see some examples in Figure 7.As it can be seen, the 3D point cloud and the camera trajectory look reasonable, showing a tubular shape with cameras traversing it.The covisibility information is always accurate, as the geometrical checks in COLMAP avoid frames that do not observe the same place to be incorrectly reconstructed in the same model.
Covisibility information was exploited in 30 , where a CNN was trained with COLMAP reconstructions from our sequences for the task of image retrieval.The system 30 is able to recognize frames observing the same place in the colon, even when the frames come from different colonoscopies of the same patient.In Figure 12, we can see some examples of successfully retrieved between two colonoscopies of the same patient performed within two weeks of each other.The 3D reconstructions look reasonably accurate and can be of great help as a weak supervision for training depth, camera pose or image retrieval networks.The software used to compute the reconstruction shown in this section is available in the repository EM_Dataset-ColmapValidation (see Section Code availability, https://github.com/endomapper/EM_Dataset-ColmapValidation).
ORB-SLAM validation ORB-SLAM3 1 is the reference system for sparse real-time visual SLAM.We have used it to process the whole Seq_015 video, to build the map and estimate the endoscope pose.To achieve a real-time performance the image size is reduced from 1440 × 1080 to 720 × 540 and one out of every two frames are skipped.To address non-rigidity, the reprojection error acceptance threshold has been increased by a ×2 factor with respect to its default value for rigid scenes, which helps in preventing tracking losses.The Kannala-Brandt camera model has proven to be essential to extract and triangulate features close to the borders of the image, where distortion is significant.As argued, an accurate calibration enables the use of the whole image for geometric computation, boosting accuracy and robustness.
ORB-SLAM3 has been able to estimate 133 sub-maps of small size (see Table 6 for a summary of the map specifications).The camera has been localized successfully with respect to a map in 25% of the frames.The time between video frames is 40 ms, and ORB-SLAM3 is able to run in real time, with an average tracking time of 23 ms and maximum of 37 ms.shows a sub-map where the camera undergoes a forward-backward motion The map contains 54 keyframes, 3682 points and 349 frames.
From this analysis we conclude that our EndoMapper dataset offers the challenges of real endoscopy exploration such as scene deformation, multiple occlusion, changes in lighting, and clutter due to cleaning water or tools that eventually result in frequent tracking losses.Classical discrete feature VSLAM methods like ORB-SLAM3 can run on these videos in real-time, localizing the camera in 25% of the frames.However, the scene model is fragmented in a myriad of very small rigid sub-maps.The clear challenge is multiple mapping techniques operating in Endoscopy able to merge sub-maps with common areas.
The software and the detailed tuning used to compute the ORB-SLAM3 reconstructions shown in this section is available in the repository EM_Dataset-ORBSLAM3Validation (see Section Code availability, https://github.com/endomapper/EM_Dataset-ORBSLAM3Validation).

Anatomical region validation
The anatomical region labels have been validated by fine-tuning different models for anatomical region recognition.The software used to evaluate the anatomical regions recognition is available in the repository EM_Dataset-AnatomicalRegions (see Section Code availability, https://github.com/endomapper/EM_Dataset-AnatomicalRegions).Following   For the fine-tuning, we train the models during 100 epochs following the learning rate decay in Thanh et al. 34 and use 4 sequences (Seq_003, Seq_011, Seq_013, Seq_093) for training and Seq_094 for testing.Seq_094 was chosen as the test sequence because it has the best balance between classes.The metrics used to evaluate this experiment were Top-1 and Top-3 accuracy, defined as the accuracy for which the true class matches the most probable prediction and any of the 3 most probable predictions, respectively.Table 7 presents the results for the anatomical region recognition.
ResNet is able to perform better in Top-1 accuracy and similarly in Top-3.This shows that this model is the best overall.It is also interesting to note that MobileNet is able to obtain a close performance with a smaller computational footprint, being interesting for real-time systems.It is also worth remarking that Top-1 accuracy is low in comparison to other computer vision tasks, suggesting that anatomical region classification is a challenging research problem.Indeed, the differences between sections are very subtle, even for the trained eye.In any case, the results on Top-3 accuracy are promising and show that the EndoMapper data is a relevant tool to advance the performance in this problem.

Tools segmentation validation
The tool segmentation labels have been validated by training and testing several models for binary tool segmentation as proposed in Tomasini et al. 38 .This work compares the performance of various models on three different datasets, including the EndoMapper dataset labels.All the models were trained from scratch on EndoVis17 dataset and fine-tuned on Kvasir-Inst and

14/19
Query from Seq_035 Images retrieved from Seq_027 EndoMapper.The performance results obtained can be seen in Table 8.Examples of binary segmentation of images from the Endomapper dataset obtained using the different models can be seen in Figure 14.
The lower mIoU of all models on our EndoMapper dataset compared to that of other datasets highlights the challenge of the EndoMapper tool segmentation labels.It is interesting to note that the efficient MiniNetV2 reaches similar performance to state-of-the-art models on all datasets while requiring less memory and inference time.The software used to evaluate the tool segmentation models is available in the repository EM_Dataset-ToolSegmentation (see Section Code availability, https: //github.com/endomapper/EM_Dataset-ToolSegmentation).

Usage notes
The dataset is available on the Synapse platform 3 .The dataset can be publicly accessed under the following conditions: 1) Limited to research on how to obtain relevant medical information from images or video.2) Redistribution of the data is not allowed.3) Requires a Statement of Intended Use, which includes a description of how you intend to use this data.4) You further agree to cite the DOI of the collection and the publication in any publication resulting from this content as follows: a) Azagra, P.   Table 8.Binary segmentation results (mIoU) for models pre-trained on EndoVis17 and fine-tuned for each of the target datasets (Kvasir and Endomapper).N.A.:Not available due to computational resource limitations.

Figure 4 .
Figure 4. Illustration of the anatomical regions labeled.

4 .
A subtitle file, Seq_XXX.srt, if the video has text footage in the form of text subtitles.5.The metadata file, Seq_XXX_info.json,where sequence number, endoscope number and the type of metadata of the procedure is stored.Additionally, the folder Lossless_sequences contains the lossless versions of the videos.This format uses codec ffv1 version 3 with a bitrate of 310 Mbps.

Figure 5 .
Figure 5. Directory structure of the dataset.

Figure 6 .
Figure 6.Examples for the tool segmentation mask in Seq_009

Figure 7 .Figure 8 .
Figure 7. Two clusters from the COLMAP reconstruction after processing Seq_001.For each cluster, it is shown a 3D view of the frames' poses and colon map points and five RGB images as summary of the cluster frames.

Figure 9 . 6 Figure 10 .
Figure 9. Distributions of error in the images in prototype calibrations.The line representing the error is not magnified, observe that most of them appear as points as errors are mostly smaller than one pixel.

13 / 19 Figure 11 .
Figure 11.View angles plotted on top of calibration images from each prototype endoscope.The iso-lines are plotted in 20 • intervals.

Figure 12 .
Figure 12.Examples of successful retrieval from different sequences of the same patient.The left column contains the queries from the current sequence (Seq_035) while the rest of the columns are the first three retrieved images from the previous sequence (Seq_027).

19 Figure 13 .
Figure 13.ORB-SLAM3 sub-map in Seq_015 between frames 54420 and 55170.The camera undergoes a forward-backwards motion.Right, 3D map in top view, keyframes in blue, map points in red.Left images corresponding to 4 keyframes spread over the trajectory.

+
: Params: Memory required by the model (M = millions of parameters).++ : Time: Average inference time for 1 image on GPU RTX2080.

Table 1 .
Overview of existing datasets of endoscopies in the gastrointestinal tract.

Table 2 .
Summary of the anatomical sections per video and label.

Table 3 .
Summary of the frames with tool segmentation.

Table 5 .
Summary of photometric calibration errors.

Table 6 .
37mmary of the size of the 133 sub-maps obtained after processing Seq_015 works in medical image33and datasets15, we fine-tuned four different CNNs that are known to perform well on medical image classification: EfficientNet V2 34 , MobileNetv235, DenseNet 36 and ResnetV237.With MobileNet and EfficientNet, we look for a model requiring low computational resources.DenseNet and ResNet were chosen for its performance in image classification.

Table 7 .
Top-1 and Top-3 accuracy of the anatomical region recognition models.All models were fine-tuned in 4 sequences(Seq_003,Seq_011,Seq_013,Seq_093) on top of ImageNet pre-trained weights, and tested on Seq_094.