Background & Summary

Endoscopes traversing body cavities are routine. However, their potential for navigation assistance or device autonomy remains mostly locked. In order to unlock it, computer-assisted endoscopes would require spatial AI (Artificial Intelligence) capabilities, i.e., being able to estimate a map of the regions that are traversed, along with the endoscope localization within such map. This capability is known in the robotics literature with the acronym VSLAM (Simultaneous Localization and Mapping from Visual sensors). Spatial AI and VSLAM will augment endoscopies with novel features, including augmented reality insertions, detection of blind zones, polyp measurements or guidance to polyps found in previous explorations. In the long term, VSLAM will also support utterly new robotized autonomous procedures. For our purposes in this paper, the goal of VSLAM is to build a per-patient map, in real-time, during endoscope insertion in a first procedure. This map will be exploited and perfected during the withdrawal of such first procedure, and in any other future one.

There are mature methods for out of the body VSLAM1,2. However, bringing them to endoscopy implies overcoming new barriers. The light source is co-located with the endoscopes, and hence is moving and is close to the body surfaces. The body surfaces have poor texture and abundant reflections due to fluids. The scene geometry includes a prevalent deformation. The video combines slow observation of areas of interest, with fast motions and long occlusions of the endoscope lenses.

Our contribution in this paper is the Endomapper dataset3, which makes available, for the first time, 96 high quality calibrated recordings of complete routine endoscopies (Fig. 1), making secondary use of medical data, i. e., just recording standard procedures that were going to be performed in any case, without any modification. Compared to ad-hoc recordings, secondary-use ones show realistic features and hence contain the actual challenges VSLAM will face in routine practice.

Fig. 1
figure 1

Overview of the Endomapper Dataset.

No other public dataset offers a comparable volume of fully calibrated endoscopies in HD (see Table 1). Heilderberg4 is a very interesting dataset that contains images of the colon from laparoscopic procedures. However, this view of the colon is not compatible with our goal of building 3D reconstructions of the interior of the gastrointestinal tract. CVCClinicDB, GIANA and Kvasir focus on polyps detection since they are often used to benchmark CAD (Computer Aided Diagnosis) systems. Other datasets focus in segmentation of elements of interest, such as tools in instrument-kvasir or polyps in Kvasir-seg. However, they only provide sparse image sets or short videos (less than 30 seconds). More similar to ours, Colon10k provides images from short sequences for place recognition and reconstructions. In contrast, we offer hours of real calibrated video, corresponding to the complete procedures. Apart from these real imaging datasets, it is worth mentioning works that create simulated data of the colon, either using 3D models, like Rau et al.5, Bobrow et al.6 and Incetan et al.7, or a realistic phantom, like Ozyoruk et al.8. Similar to these works, we also include a few simulated sequences to harness the extra information of these scenes as a means to evaluate the methods developed. In particular, due to the monocular nature of the dataset, no ground truth geometry is available for quantitative evaluation. To address this issue, we include photorealistic sequences from a simulated colon, with ground truth geometry for the deforming scene and endoscopy trajectory.

Table 1 Overview of existing datasets of endoscopies in the gastrointestinal tract.

The Endomapper dataset includes colonoscopies, gastroscopies, and calibration videos along with geometric and photometric calibration parameters. More than half of the sequences are screening colonoscopies, for which the standard procedure implies a thorough and slow exploration, being close to typical operation modes for VSLAM and that can serve as a bridge to more challenging sequences. To research map reuse and recognition in a second exploration, colonoscopies corresponding to the same patient but separated in time by several weeks are also included in the data.

Regarding metadata, some endoscopies include a description of the procedure made by the endoscopist, in the form of text footage. The text describes the anatomical regions traversed, re-explorations of the same region, the performed interventions or the tools used. This footage indexes the videos to identify interesting sections for VSLAM.

Building on our dataset, the community can provide derived or metadata results to support subsequent research. Some examples of these derived data are included in the dataset: 1) anatomical regions segmentation, at frame level, performed by a doctor after visualizing the video; 2) tools segmentation in selected video sections, which can boost the tool segmentation performance in the specific endoscopy domain; 3) Structure from Motion (SfM) reconstructions using COLMAP9, which provides up-to-scale 6 DoF endoscope trajectory and 3D models for the video segments corresponding to smooth explorations of non-deforming scenes. The SfM output has proven valid to supervise learning tasks such as image matching10 or image retrieval11.

Endomapper offers a sweet point of challenge, including easy video segments where state-of-the-art algorithms perform reasonably. However, all these methods also fail at some point, signaling what are the research challenges to face. We believe that the dataset will spur research that identifies challenges and foster progress of VSLAM in gastrointestinal environments.

Finally, we have made publicly available 7 software repositories corresponding to photometric and geometric calibration from calibration videos, simulated colon sequences generation and technical validation. Section Code Availability contains the details of these repositories containing the software, including installation and usage instructions.

Methods

The methodology used to create the dataset is explained in this section. First, we present a description of the recording procedure for the sequences in the dataset, including the description of the capture system and the type of recordings. Then, we describe the calibration procedure and the methodology used in both geometric and photometric calibration. Finally, we also briefly summarize the methods used to create each type of meta-data.

Recording endoscopies procedure

The acquisition of the sequences in the dataset was performed in the Hospital Clinico Universitario Lozano Blesa, in Zaragoza (Spain), using an Olympus EVIS EXERA III CV-190 video processor, EVIS EXERA III CLV-190 light source, and EVIS EXERA III CF-H190 colonoscope or EVIS EXERA III GIF-H190 gastroscope. The acquisition system is composed of a computer and a data acquisition card connected to the endoscopy tower via a Digital Visual Interface (DVI). Two different acquisition cards have been used: Epiphan Video DVI2USB 3.0 and Magewell Pro Capture DVI. The videos were recorded at 1440 × 1080 at 40fps and 24RGBbits (Epiphan) or 1440 × 1080 at 50fps and 24RGBbits (Magewell). The output image given by the endoscopy tower is cropped to remove personal information. The videos were manually edited to remove any frame recorded when the camera was out of the body of the patient. During the span of the project, the recordings were done one day of the week and synchronized with the medical staff involved. The patients were not selected based on their symptoms or pathology, we followed without interference the hospital’s schedule which was mostly focused on the colorectal cancer screening program. The Endomapper technical staff was present in all the recording sessions to secure the quality of the acquisitions, but without interfering with the medical procedure.

Use of human participants

The recordings were made under the ethical approval of the CEICA Ethics Committee (Comité de Ética de la Investigación de la Comunidad Autónoma de Aragón (CEICA), meetings 04/03/2020 acta 05/2020, 23/09/2020 acta 18/2020, 20/04/2022 acta 08/2022 and 16/11/2022 acta 20/2022). Informed consent was obtained from all subjects. According to this approval, the collection can be publicly accessed under certain conditions (see Section Usage Notes).

Calibration

The dataset uses 10 different colonoscopes and 8 different gastroscopes. The calibration sequences for all the colonoscopes and gastroscopes where acquired in a single session using a Lambertian pattern (obtained from calib.io). Figure 2 shows two frames of the calibration videos imaging the calibration pattern. The Lambertian pattern corresponds to an array of circles from the Vicalib12 library. The physical size of the pattern used is 5,61 × 9,82 cm.

Fig. 2
figure 2

Two examples of calibration images (left, middle). The calibration pattern (right).

Geometric calibration

The calibration videos are processed by Vicalib12 to obtain the endoscope intrinsic parameters according to the Kannala & Brandt model13,14. The calibration defines eight intrinsic parameters, four projective parameters (in pixels) fx, fy, Cx,Cy and four distortion coefficients k1, k2, k3, k4. We process 1 out of 20 frames and outlier matches are removed. Next, the projection model yielding the projection in pixels u = (u, v), for a 3D point with coordinates X = (x, y, z) with respect to the camera frame is described as:

$$u={f}_{x}{x}_{d}+{C}_{x},\quad {x}_{d}={r}_{d}\frac{x}{r}$$
(1)
$$v={f}_{y}{y}_{d}+{C}_{y},\quad {y}_{d}={r}_{d}\frac{y}{r}$$
(2)

where \({r}_{d}=\theta \left(1+{k}_{1}{\theta }^{2}+{k}_{2}{\theta }^{4}+{k}_{3}{\theta }^{6}+{k}_{4}{\theta }^{8}\right)\) is the distorted radius, \(r=\sqrt{{x}^{2}+{y}^{2}}\) is the undistorted radius and θ = arctan2(r, z) is the angle between the incoming ray and the optical axis.

Photometric calibration

The light source and camera of the endoscope are calibrated to obtain a model able to reproduce the photometry of the recordings. In the endoscope, the distances between the light sources and the camera are small and mostly symmetrical. Following Modrzejewski et al.15, we assume that these sources can be modelled as a single virtual light and adopt the Spot Light Source model (SLS), which was shown to offer a good compromise between complexity and accuracy. In addition, the light spread function and the camera vignetting are jointly estimated assuming radial symmetry. With this model, the light radiance going from the endoscope to a 3D surface point X is

$${\sigma }_{{\rm{SLS}}}\left({\bf{X}},{\bf{P}}\right)={\sigma }_{0}R\left(\mu ,{\bf{D}},{\bf{X}},{\bf{P}}\right)S\left({\bf{X}},{\bf{P}}\right)L\left({\bf{X}},{\bf{P}}\right),$$
(3)

where P is the light center, σ0, is the light’s intensity value and D is the principal direction in which light propagates with a spreading factor μ, that modules the radial attenuation R. As the light traverses the scene, its radiance decreases as a function of the distance travelled \(d=\left\Vert {\bf{X}}-{\bf{P}}\right\Vert \), following an inverse-square law \(S\left({\bf{X}},{\bf{P}}\right)=1/{d}^{2}\). Finally, L(X, P) is the unit vector of the light’s outgoing direction. The corresponding intensity value \({\mathscr{ {\mathcal I} }}({\bf{X}})\) on the image is:

$${\mathscr{ {\mathcal I} }}({\bf{X}})={(| {\sigma }_{{\rm{SLS}}}({\bf{X}},{\bf{P}})| {f}_{r}({{\boldsymbol{\omega }}}_{i},{{\boldsymbol{\omega }}}_{r})\cos \theta {g}_{t})}^{1/\gamma },$$
(4)

where a bidirectional reflectance distribution function (BRDF) \({f}_{r}\left({{\boldsymbol{\omega }}}_{i},{{\boldsymbol{\omega }}}_{r}\right)\) defines how light is reflected from the surface to the camera. The projection of the light beam on the geometry introduces a cosine term of the angle θ between the incoming light ray ωi and the surface normal. Finally, the endoscope applies an automatic gain gt, that can vary at every t-th time instant, and a gamma curve (γ = 2.2) to improve the perceived dynamic range of the image.

The parameters of this model are estimated by optimising a photometric loss on the white areas of the Vicalib pattern (Fig. 2). The results of the calibration provide a 2D weighting of the photometric effects caused by the vignetting and the light spread function, that can be used to compensate them (Fig. 3).

Fig. 3
figure 3

Example of photometric calibration results.

Simulated colon

The VR-Caps7 simulator is used to generate photorealistic synthetic image sequences of a 3D colon model obtained from a Computed Tomography. Since this is a simulation, we have full access to scene configuration: camera calibration, deformations, trajectory and illumination, hence to the ground truth geometry, camera pose and 3D deforming scene. For the same endoscope trajectory, we generated different sequences with more aggressive deformations to allow ablative studies with respect to the deformation magnitude. Deformations applied are described by the next equation:

$${V}_{y}^{t}={V}_{y}^{0}+A\,\sin (\omega t+{V}_{x}^{0}+{V}_{y}^{0}+{V}_{z}^{0}),$$
(5)

where \({V}_{x}^{0}\), \({V}_{y}^{0}\) and \({V}_{z}^{0}\) are the coordinates of the surface point at rest. We can control the magnitude and velocity of the deformations according to the parameters A and ω respectively, which corresponds to the maximum excursion and velocity of the deformations respectively. We also modified the colon texture to increase its contrast. The code to create these simulated sequences is available in the repository EM_Dataset-Simulations (see Section Code availability, https://github.com/endomapper/EM_Dataset-Simulations).

Meta-data

For a set of selected recordings, we provide several types of meta-data useful for plenty of potential research lines, but in particular for VSLAM. This subsection presents a description of the meta-data and the annotation methodologies.

Text footage

The endoscopist performing each procedure provided a description of it, that was registered during the exploration. It includes the anatomical regions traversed, the interventions, the medical findings such as polyp approximated size, the tools used, or the sections with NBI (Narrow-Band Imaging) illumination. This description is made available as text footage synchronized with the corresponding videos. This metadata can be useful, for example, to identify the sections of the video that are more promising for VSLAM, such as the re-observations of the same region or interactions with tools of known size.

Anatomical regions

Anatomical section recognition is useful to create topological maps of the colon. These maps can be used to create smaller reconstructions with less probability of errors. Some colonoscopy procedures were annotated by the medical staff of the project after the recording. Multiple careful visualizations were necessary to delimit the ten anatomical regions, that are shown in Fig. 4.

Fig. 4
figure 4

Illustration of the anatomical regions labeled.

Tools segmentation

Tool segmentation is one of the challenges for spatial AI in colonoscopies. Since they occlude the view and cause failures in other algorithms, many works in the literature mask them out. Tools were manually segmented using Odin CAT tool16, which allows to maintain a mask between frames, giving a more robust annotation.

COLMAP 3D reconstruction

Traditional SIFT-based rigid SfM algorithms are able to produce partial reconstructions from colonoscopy videos. We include some examples of the output of COLMAP9,17 processing in our sequences, which provides a first approximation for the up-to-scale camera trajectory and the scene’s sparse structure. This information can be organized to produce weak supervision in the form of sparse depth maps, local correspondences between frames, image-to-image labels (frames depicting the same place) or relative camera pose transformation between frames. Several computer vision tasks like depth prediction, image matching, image retrieval and visual localization can greatly benefit from this kind of supervision. Megadepth18 is a well-known dataset that uses this SfM procedure to obtain 3D point clouds, similar to ours. It is being extensively used for deep learning supervision10,19,20. Other works employed SfM to identify co-visible frames in the recordings, which has proven to be useful to train CNNs for place recognition in landmark images11 and in colonoscopy sequences21,22.

For our recordings, we apply exhaustive guided matching between all the images in the sequence to associate frames that are temporally distant. We use our camera calibration and we do not optimize it during the COLMAP bundle adjustment. The minimum triangulation angle is relaxed to 8 degrees during the initialization of the models. The rest of the parameters are left as default.

Recordings from the same patient

One of the main obstacles in colon reconstruction is the consistency between colonoscopies in longitudinal studies. Thanks to the colorectal cancer screening program, colonoscopy pairs from the same patient were registered. This would help to evaluate the lifelong capabilities of the developed VSLAM algorithms.

Data Records

This section describes the dataset structure and details of the meta-data available. The dataset is available on the Synapse platform3 and is subject to access controls (see Section Usage Notes). A summary of the dataset structure can be seen in Fig. 5. At publication time, there is a total of 96 real sequences and their duration goes from less than ten minutes to more than half an hour. The file DatasetSummary.xls in the dataset main folder includes a summary of the acquisition details of each sequence in the dataset.

Fig. 5
figure 5

Directory structure of the dataset.

Video recordings

Data is stored in the directory Sequences. Each procedure has a corresponding directory Seq_XXX (XXX is the sequence number) that contains:

  1. 1.

    The directory meta-data, that contains all the meta-data files associated to the sequence. These files are described in the next section.

  2. 2.

    The video Seq_XXX.mov, in which the actual recording is. The video codec is H26423, a lossy compression using the profile High 4:4:4 with 4.2 level and a bit rate of 7Mbps. It offers an optimal size vs. quality trade-off for lossy compression.

  3. 3.

    The thumbnail version, Seq_XXX_thumbnail.webm, that contains a compressed version of the recording for easy and quick visualization. This version uses the free codec libvpx24, at 320 × 240 resolution.

  4. 4.

    A subtitle file, Seq_XXX.srt, if the video has text footage in the form of text subtitles.

  5. 5.

    The metadata file, Seq_XXX_info.json, where sequence number, endoscope number and the type of metadata of the procedure is stored.

Additionally, the folder Lossless_sequences contains the lossless versions of the videos. This format uses codec ffv1 version 3 with a bitrate of 310 Mbps.

Camera calibration

All the calibration information is included in the directory Calibrations. There is a directory Endoscope_XX (XX is the endoscope number) for each endoscope that contains:

  1. 1.

    The calibration video Endoscope_XX.mov. This version is the lossy H264 version. The lossless version can be found in the lossless folder mentioned before.

  2. 2.

    The Geometric calibration parameters Endoscope_XX_geometrical.xml.

  3. 3.

    The Photometric Calibration parameters Endoscope_XX_photometrical.xml.

Geometric calibration

The file Endoscope_XX_geometrical.xml is the output calibration from Vicalib12. This XML file contains the intrinsic parameters of the camera (fx, fy, Cx, Cy, k1, k2, k3, k4) following the Vicalib output format.

Photometric calibration

The photometric calibration file, Endoscope_XX_photometrical.xml, contains the calibrated parameters of the light source and the camera of the endoscope. An endoscope’s <rig> may have one or more <camera> tags, associated with one or more <light> sources. Currently, only a single camera and a single virtual light are supported.

Each camera tag has a particular <camera_model>. This model has a single parameter, the value of the gamma γ response function in Eq. (4). Regarding the light source, the <light_model> has four parameters: the intensity value σ0, the light spread factor μ and two vectors for the light centre P and the principal direction D.

Simulated colon

All the data related to the simulated colon is included in the directory Simulated Sequences. There is a directory seq_X (X is the sequence number) for each sequence obtained from the simulation. The directory contains:

  1. 1.

    The directory rgb with the RGB images of the sequence in png format.

  2. 2.

    The directory depth with the depth images for each RGB image of the sequence stored in exr format.

  3. 3.

    A file rgb.txt with a list of file names of all RGB images of the sequence.

  4. 4.

    A file depth.txt with a list of the file names of all depth images of the sequence.

  5. 5.

    A file trajectory.csv containing the ground truth camera trajectory.

  6. 6.

    A file calibration.txt containing the simulated camera calibration.

  7. 7.

    A file info.txt containing the deformations applied, its parameters and units.

Meta-data

This section contains the details and formats for each type of meta-data. The file DatasetSummary.xls details the availability of the metadata in each sequence of the dataset.

Text footage

Two files: Seq_XXX.json and Seq_XXX.srt are included inside the root and meta-data folder. The .json file contains a structure with the timestamp and the associated text. The text footage is also included in .srt format to ease the visualization synchronized with the video. The references to identify the tools used during the procedure are stored in the meta-data directory.

Anatomical regions

Table 2 shows the detailed number of frames labelled for each region in each video. The dataset contains this information in a file named Anatomical_Regions_XXX.txt with the format Frame###;region label; in each line.

Table 2 Summary of the anatomical sections per video and label.

Tool segmentation

There are 4086 frames with tools segmented across four different colonoscopies as detailed in Table 3. The segmentations for each video can be found in file tool_segmentation_XXX.xml. This file contains, for each segmented frame, the id of the frame and a list of 2D point coordinates that define the tool segmentation as a binary polygon. The segmentation was done using a proprietary Odin CAT tool16. Some examples can be seen in Fig. 6.

Table 3 Summary of the frames with tool segmentation.
Fig. 6
figure 6

Examples for the tool segmentation mask in Seq_009.

COLMAP 3D reconstruction

Table 4 summarizes the reconstruction results for the Endomapper sequences. The reconstructions are stored following the text format of COLMAP (https://colmap.github.io/format.html). We provide text files showing the images contained in each cluster reconstructed by COLMAP, as well as the images that COLMAP considered covisible, i.e. images that have at least one 3D point in common. Figure 7 shows two examples of these reconstructed clusters.

Table 4 Summary of COLMAP 3D reconstruction.
Fig. 7
figure 7

Two clusters from the COLMAP reconstruction after processing Seq_001. For each cluster, it is shown a 3D view of the frames’ poses and colon map points and five RGB images as summary of the cluster frames.

Same patient recordings

A file SamePatient.json is stored in the root folder containing which sequences are from the same patient and the time that separates both sequences.

Technical Validation

Firstly, we detail an error analysis of the calibration and a comparison of the calibration parameters among endoscopes. Secondly, we test state-of-the-art SfM and VSLAM algorithms on typical colonoscopy sequences. Finally, the anatomical region and tool segmentation labels are validated on state-of-the-art recognition algorithms. All the code used in this section is publicly available in the Endomapper repositories https://github.com/Endomapper.

Calibration validation

Geometric calibration

The software used to compute the geometric calibration and to obtain the validation and comparisons shown in this section is available in the repository EM_Dataset-GeometricCalibration (see Section Code availability, https://github.com/endomapper/EM_Dataset-GeometricCalibration). The geometric calibration was computed from the calibration videos using the Vicalib12 tool, tuning the parameters for each endoscope calibration separately. The parameters are detailed in the repository.

To compare the different calibrations visually, we have undistorted a grid using each calibration. Figure 8 shows the differences between each endoscope, the 10 colonoscopes and 8 gastroscopes correspondingly. The results show that the calibrations are equivalent around the center of the images and differences between them are significant only in the image borders.

Fig. 8
figure 8

Comparison of how a regular pixel grid is undistorted by the calibration of each endoscope. Colonoscopes and gastroscopes are separated for easier visualization.

To further analyze the calibration results, we analyzed the reprojection error. For all the calibrations, the RMSE is between 0.3 and 0.4 pixels. We have selected two endoscopes, one colonoscope (Endoscope_06) and one gastroscope (Endoscope_18), as prototypes. Figure 9 displays the inliers reprojection error distribution for the selected endoscopes. Here we can see that the error of the inliers is uniformly distributed around the image, and that only at the image boundaries there are fewer measurements. The calibrations are then expected to be very accurate in general, being the most inaccurate areas the ones closer to the borders.

Fig. 9
figure 9

Distributions of error in the images in prototype calibrations. The line representing the error is not magnified, observe that most of them appear as points as errors are mostly smaller than one pixel.

The projection function that relates the incoming ray angle θ with the distorted radius rd is plotted in Fig. 10. Here we can see that both types of endoscopes are almost equal and very close to an orthogonal projection fisheye lens13. Finally, in Fig. 11 we show the view angle of both prototype endoscopes. There we can see that the gastroscope has a lower view angle than the colonoscope on the edges, which is why each type of endoscope needs to be calibrated separately.

Fig. 10
figure 10

Relation between the incoming ray angle θ with the distorted radius rd. The dotted curves represent the ideal orthogonal and equisolid projection models. The right image is a zoom of the curves to show the small differences between the colonoscope and the gastroscope.

Fig. 11
figure 11

View angles plotted on top of calibration images from each prototype endoscope. The iso-lines are plotted in 20° intervals.

With this analysis we conclude that the individual calibrations obtained from the videos are accurate (RMSE is low, covers most image and is consistent with all endoscopes). We believe that accurate calibration information boosts the performance of the geometric methods.

Photometric calibration

The software used to compute the photometric calibration and to obtain the validation shown in this section is available in the repository EM_Dataset-PhotometricCalibration (see Section Code availability, https://github.com/endomapper/EM_Dataset-PhotometricCalibration). The photometric calibration was computed from the calibration videos for each endoscope separately.

We selected 38 frames per sequence, looking for a variety of distances from the camera to the calibration pattern. On each frame, we consider a 120° field of view. The centre of the virtual light converges about 4 mm behind the tip of the endoscope, thus being able to model all real lights with a single beam. The gamma value is experimentally fixed to γ = 2.2, which is also a broadly used value. The endoscope applies a continuous gain control, progressively increasing or decreasing the gain. Relative auto-gain is estimated with respect to the first image of the sequence. Consequently, the σ0 value is unobservable and it is fixed to one.

The resulting models are validated in a different set of images of the Vicalib pattern. The photometric errors in Table 5 show the validation results of eight colonoscopes and eight gastroscopes. In the colonoscopes, the calibration is able to estimate the pixel intensities of the images with an RMSE of 2.9 grey levels. In gastroscopes, lights are not symmetrically placed on the tip of the endoscope. Consequently, RMSE increases slightly, up to 3.3 grey levels.

Table 5 Summary of photometric calibration errors.

SfM/SLAM validation

COLMAP validation

COLMAP is able to estimate sparse reconstructions for different sections along a sequence, see some examples in Fig. 7. As it can be seen, the 3D point cloud and the camera trajectory look reasonable, showing a tubular shape with cameras traversing it. The covisibility information is always accurate, as the geometrical checks in COLMAP avoid frames that do not observe the same place to be incorrectly reconstructed in the same model.

Covisibility information was exploited in22, where a CNN was trained with COLMAP reconstructions from our sequences for the task of image retrieval. The system22 is able to recognize frames observing the same place in the colon, even when the frames come from different colonoscopies of the same patient. In Fig. 12, we can see some examples of successfully retrieved between two colonoscopies of the same patient performed within two weeks of each other.

Fig. 12
figure 12

Examples of successful retrieval from different sequences of the same patient. The left column contains the queries from the current sequence (Seq_035) while the rest of the columns are the first three retrieved images from the previous sequence (Seq_027).

The 3D reconstructions look reasonably accurate and can be of great help as a weak supervision for training depth, camera pose or image retrieval networks. The software used to compute the reconstruction shown in this section is available in the repository EM_Dataset-ColmapValidation (see Section Code availability, https://github.com/endomapper/EM_Dataset-ColmapValidation).

ORB-SLAM validation

ORB-SLAM31 is the reference system for sparse real-time visual SLAM. We have used it to process the whole Seq_015 video, to build the map and estimate the endoscope pose. To achieve a real-time performance the image size is reduced from 1440 × 1080 to 720 × 540 and one out of every two frames are skipped. To address non-rigidity, the reprojection error acceptance threshold has been increased by a ×2 factor with respect to its default value for rigid scenes, which helps in preventing tracking losses. The Kannala-Brandt camera model has proven to be essential to extract and triangulate features close to the borders of the image, where distortion is significant. As argued, an accurate calibration enables the use of the whole image for geometric computation, boosting accuracy and robustness.

ORB-SLAM3 has been able to estimate 133 sub-maps of small size (see Table 6 for a summary of the map specifications). The camera has been localized successfully with respect to a map in 25% of the frames. The time between video frames is 40 ms, and ORB-SLAM3 is able to run in real time, with an average tracking time of 23 ms and maximum of 37 ms. Figure 13 shows a sub-map where the camera undergoes a forward-backward motion The map contains 54 keyframes, 3682 points and 349 frames.

Table 6 Summary of the size of the 133 sub-maps obtained after processing Seq_015.
Fig. 13
figure 13

ORB-SLAM3 sub-map in Seq_015 between frames 54420 and 55170. The camera undergoes a forward-backwards motion. Right, 3D map in top view, keyframes in blue, map points in red. Left images corresponding to 4 keyframes spread over the trajectory.

From this analysis we conclude that our EndoMapper dataset offers the challenges of real endoscopy exploration such as scene deformation, multiple occlusion, changes in lighting, and clutter due to cleaning water or tools that eventually result in frequent tracking losses. Classical discrete feature VSLAM methods like ORB-SLAM3 can run on these videos in real-time, localizing the camera in 25% of the frames. However, the scene model is fragmented in a myriad of very small rigid sub-maps. The clear challenge is multiple mapping techniques operating in Endoscopy able to merge sub-maps with common areas.

The software and the detailed tuning used to compute the ORB-SLAM3 reconstructions shown in this section is available in the repository EM_Dataset-ORBSLAM3Validation (see Section Code availability, https://github.com/endomapper/EM_Dataset-ORBSLAM3Validation).

Anatomical region validation

The anatomical region labels have been validated by fine-tuning different models for anatomical region recognition. The software used to evaluate the anatomical regions recognition is available in the repository EM_Dataset-AnatomicalRegions (see Section Code availability, https://github.com/endomapper/EM_Dataset-AnatomicalRegions). Following works in medical image25 and datasets26, we fine-tuned four different CNNs that are known to perform well on medical image classification: EfficientNet V227, MobileNetv228, DenseNet29 and ResnetV230. With MobileNet and EfficientNet, we look for a model requiring low computational resources. DenseNet and ResNet were chosen for its performance in image classification. For the fine-tuning, we train the models during 100 epochs following the learning rate decay in Thanh et al.27 and use 4 sequences (Seq_003, Seq_011, Seq_013, Seq_093) for training and Seq_094 for testing. Seq_094 was chosen as the test sequence because it has the best balance between classes. The metrics used to evaluate this experiment were Top-1 and Top-3 accuracy, defined as the accuracy for which the true class matches the most probable prediction and any of the 3 most probable predictions, respectively. Table 7 presents the results for the anatomical region recognition.

Table 7 Top-1 and Top-3 accuracy of the anatomical region recognition models.

ResNet is able to perform better in Top-1 accuracy and similarly in Top-3. This shows that this model is the best overall. It is also interesting to note that MobileNet is able to obtain a close performance with a smaller computational footprint, being interesting for real-time systems. It is also worth remarking that Top-1 accuracy is low in comparison to other computer vision tasks, suggesting that anatomical region classification is a challenging research problem. Indeed, the differences between sections are very subtle, even for the trained eye. In any case, the results on Top-3 accuracy are promising and show that the EndoMapper data is a relevant tool to advance the performance in this problem.

Tools segmentation validation

The tool segmentation labels have been validated by training and testing several models for binary tool segmentation as proposed in Tomasini et al.31. This work compares the performance of various models on three different datasets, including the EndoMapper dataset labels. All the models were trained from scratch on EndoVis17 dataset and fine-tuned on Kvasir-Inst and EndoMapper. The performance results obtained can be seen in Table 8. Examples of binary segmentation of images from the Endomapper dataset obtained using the different models can be seen in Fig. 14.

Table 8 Binary segmentation results (mIoU) for models pre-trained on EndoVis17 and fine-tuned for each of the target datasets (Kvasir and Endomapper).
Fig. 14
figure 14

Binary Segmentation examples from Endomapper dataset using different approaches fine-tuned on Endomapper dataset: (a) Original Image (b) Ground-truth manual segmentation (c) MiniNet (d) UNet (e) LinkNet (f) MF-TAPNet.

The lower mIoU of all models on our EndoMapper dataset compared to that of other datasets highlights the challenge of the EndoMapper tool segmentation labels. It is interesting to note that the efficient MiniNetV2 reaches similar performance to state-of-the-art models on all datasets while requiring less memory and inference time. The software used to evaluate the tool segmentation models is available in the repository EM_Dataset-ToolSegmentation (see Section Code availability, https://github.com/endomapper/EM_Dataset-ToolSegmentation).

Usage Notes

The dataset is available on the Synapse platform3. The dataset can be publicly accessed under the following conditions: 1) Limited to research on how to obtain relevant medical information from images or video. 2) Redistribution of the data is not allowed. 3) Requires a Statement of Intended Use, which includes a description of how you intend to use this data. 4) You further agree to cite the DOI of the collection and the publication in any publication resulting from this content as follows: a) Azagra, P. et al. Endomapper dataset of complete calibrated endoscopy procedures. https://doi.org/10.7303/syn26707219 (2022). Synapse. b) Azagra, P. et al. EndoMapper dataset of complete calibrated endoscopy procedures. Scientific Data. 5) Images of the collection can be included in the scientific citing publications. 6) Video segments can be used to produce multimedia material in the citing scientific publications.