Abstract
Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introduces the Endomapper dataset, the first collection of complete endoscopy sequences acquired during regular medical practice, making secondary use of medical data. Its main purpose is to facilitate the development and evaluation of Visual Simultaneous Localization and Mapping (VSLAM) methods in real endoscopy data. The dataset contains more than 24 hours of video. It is the first endoscopic dataset that includes endoscope calibration as well as the original calibration videos. Meta-data and annotations associated with the dataset vary from the anatomical landmarks, procedure labeling, segmentations, reconstructions, simulated sequences with ground truth and same patient procedures. The software used in this paper is publicly available.
Similar content being viewed by others
Background & Summary
Endoscopes traversing body cavities are routine. However, their potential for navigation assistance or device autonomy remains mostly locked. In order to unlock it, computer-assisted endoscopes would require spatial AI (Artificial Intelligence) capabilities, i.e., being able to estimate a map of the regions that are traversed, along with the endoscope localization within such map. This capability is known in the robotics literature with the acronym VSLAM (Simultaneous Localization and Mapping from Visual sensors). Spatial AI and VSLAM will augment endoscopies with novel features, including augmented reality insertions, detection of blind zones, polyp measurements or guidance to polyps found in previous explorations. In the long term, VSLAM will also support utterly new robotized autonomous procedures. For our purposes in this paper, the goal of VSLAM is to build a per-patient map, in real-time, during endoscope insertion in a first procedure. This map will be exploited and perfected during the withdrawal of such first procedure, and in any other future one.
There are mature methods for out of the body VSLAM1,2. However, bringing them to endoscopy implies overcoming new barriers. The light source is co-located with the endoscopes, and hence is moving and is close to the body surfaces. The body surfaces have poor texture and abundant reflections due to fluids. The scene geometry includes a prevalent deformation. The video combines slow observation of areas of interest, with fast motions and long occlusions of the endoscope lenses.
Our contribution in this paper is the Endomapper dataset3, which makes available, for the first time, 96 high quality calibrated recordings of complete routine endoscopies (Fig. 1), making secondary use of medical data, i. e., just recording standard procedures that were going to be performed in any case, without any modification. Compared to ad-hoc recordings, secondary-use ones show realistic features and hence contain the actual challenges VSLAM will face in routine practice.
No other public dataset offers a comparable volume of fully calibrated endoscopies in HD (see Table 1). Heilderberg4 is a very interesting dataset that contains images of the colon from laparoscopic procedures. However, this view of the colon is not compatible with our goal of building 3D reconstructions of the interior of the gastrointestinal tract. CVCClinicDB, GIANA and Kvasir focus on polyps detection since they are often used to benchmark CAD (Computer Aided Diagnosis) systems. Other datasets focus in segmentation of elements of interest, such as tools in instrument-kvasir or polyps in Kvasir-seg. However, they only provide sparse image sets or short videos (less than 30 seconds). More similar to ours, Colon10k provides images from short sequences for place recognition and reconstructions. In contrast, we offer hours of real calibrated video, corresponding to the complete procedures. Apart from these real imaging datasets, it is worth mentioning works that create simulated data of the colon, either using 3D models, like Rau et al.5, Bobrow et al.6 and Incetan et al.7, or a realistic phantom, like Ozyoruk et al.8. Similar to these works, we also include a few simulated sequences to harness the extra information of these scenes as a means to evaluate the methods developed. In particular, due to the monocular nature of the dataset, no ground truth geometry is available for quantitative evaluation. To address this issue, we include photorealistic sequences from a simulated colon, with ground truth geometry for the deforming scene and endoscopy trajectory.
The Endomapper dataset includes colonoscopies, gastroscopies, and calibration videos along with geometric and photometric calibration parameters. More than half of the sequences are screening colonoscopies, for which the standard procedure implies a thorough and slow exploration, being close to typical operation modes for VSLAM and that can serve as a bridge to more challenging sequences. To research map reuse and recognition in a second exploration, colonoscopies corresponding to the same patient but separated in time by several weeks are also included in the data.
Regarding metadata, some endoscopies include a description of the procedure made by the endoscopist, in the form of text footage. The text describes the anatomical regions traversed, re-explorations of the same region, the performed interventions or the tools used. This footage indexes the videos to identify interesting sections for VSLAM.
Building on our dataset, the community can provide derived or metadata results to support subsequent research. Some examples of these derived data are included in the dataset: 1) anatomical regions segmentation, at frame level, performed by a doctor after visualizing the video; 2) tools segmentation in selected video sections, which can boost the tool segmentation performance in the specific endoscopy domain; 3) Structure from Motion (SfM) reconstructions using COLMAP9, which provides up-to-scale 6 DoF endoscope trajectory and 3D models for the video segments corresponding to smooth explorations of non-deforming scenes. The SfM output has proven valid to supervise learning tasks such as image matching10 or image retrieval11.
Endomapper offers a sweet point of challenge, including easy video segments where state-of-the-art algorithms perform reasonably. However, all these methods also fail at some point, signaling what are the research challenges to face. We believe that the dataset will spur research that identifies challenges and foster progress of VSLAM in gastrointestinal environments.
Finally, we have made publicly available 7 software repositories corresponding to photometric and geometric calibration from calibration videos, simulated colon sequences generation and technical validation. Section Code Availability contains the details of these repositories containing the software, including installation and usage instructions.
Methods
The methodology used to create the dataset is explained in this section. First, we present a description of the recording procedure for the sequences in the dataset, including the description of the capture system and the type of recordings. Then, we describe the calibration procedure and the methodology used in both geometric and photometric calibration. Finally, we also briefly summarize the methods used to create each type of meta-data.
Recording endoscopies procedure
The acquisition of the sequences in the dataset was performed in the Hospital Clinico Universitario Lozano Blesa, in Zaragoza (Spain), using an Olympus EVIS EXERA III CV-190 video processor, EVIS EXERA III CLV-190 light source, and EVIS EXERA III CF-H190 colonoscope or EVIS EXERA III GIF-H190 gastroscope. The acquisition system is composed of a computer and a data acquisition card connected to the endoscopy tower via a Digital Visual Interface (DVI). Two different acquisition cards have been used: Epiphan Video DVI2USB 3.0 and Magewell Pro Capture DVI. The videos were recorded at 1440 × 1080 at 40fps and 24RGBbits (Epiphan) or 1440 × 1080 at 50fps and 24RGBbits (Magewell). The output image given by the endoscopy tower is cropped to remove personal information. The videos were manually edited to remove any frame recorded when the camera was out of the body of the patient. During the span of the project, the recordings were done one day of the week and synchronized with the medical staff involved. The patients were not selected based on their symptoms or pathology, we followed without interference the hospital’s schedule which was mostly focused on the colorectal cancer screening program. The Endomapper technical staff was present in all the recording sessions to secure the quality of the acquisitions, but without interfering with the medical procedure.
Use of human participants
The recordings were made under the ethical approval of the CEICA Ethics Committee (Comité de Ética de la Investigación de la Comunidad Autónoma de Aragón (CEICA), meetings 04/03/2020 acta 05/2020, 23/09/2020 acta 18/2020, 20/04/2022 acta 08/2022 and 16/11/2022 acta 20/2022). Informed consent was obtained from all subjects. According to this approval, the collection can be publicly accessed under certain conditions (see Section Usage Notes).
Calibration
The dataset uses 10 different colonoscopes and 8 different gastroscopes. The calibration sequences for all the colonoscopes and gastroscopes where acquired in a single session using a Lambertian pattern (obtained from calib.io). Figure 2 shows two frames of the calibration videos imaging the calibration pattern. The Lambertian pattern corresponds to an array of circles from the Vicalib12 library. The physical size of the pattern used is 5,61 × 9,82 cm.
Geometric calibration
The calibration videos are processed by Vicalib12 to obtain the endoscope intrinsic parameters according to the Kannala & Brandt model13,14. The calibration defines eight intrinsic parameters, four projective parameters (in pixels) fx, fy, Cx,Cy and four distortion coefficients k1, k2, k3, k4. We process 1 out of 20 frames and outlier matches are removed. Next, the projection model yielding the projection in pixels u = (u, v), for a 3D point with coordinates X = (x, y, z) with respect to the camera frame is described as:
where \({r}_{d}=\theta \left(1+{k}_{1}{\theta }^{2}+{k}_{2}{\theta }^{4}+{k}_{3}{\theta }^{6}+{k}_{4}{\theta }^{8}\right)\) is the distorted radius, \(r=\sqrt{{x}^{2}+{y}^{2}}\) is the undistorted radius and θ = arctan2(r, z) is the angle between the incoming ray and the optical axis.
Photometric calibration
The light source and camera of the endoscope are calibrated to obtain a model able to reproduce the photometry of the recordings. In the endoscope, the distances between the light sources and the camera are small and mostly symmetrical. Following Modrzejewski et al.15, we assume that these sources can be modelled as a single virtual light and adopt the Spot Light Source model (SLS), which was shown to offer a good compromise between complexity and accuracy. In addition, the light spread function and the camera vignetting are jointly estimated assuming radial symmetry. With this model, the light radiance going from the endoscope to a 3D surface point X is
where P is the light center, σ0, is the light’s intensity value and D is the principal direction in which light propagates with a spreading factor μ, that modules the radial attenuation R. As the light traverses the scene, its radiance decreases as a function of the distance travelled \(d=\left\Vert {\bf{X}}-{\bf{P}}\right\Vert \), following an inverse-square law \(S\left({\bf{X}},{\bf{P}}\right)=1/{d}^{2}\). Finally, L(X, P) is the unit vector of the light’s outgoing direction. The corresponding intensity value \({\mathscr{ {\mathcal I} }}({\bf{X}})\) on the image is:
where a bidirectional reflectance distribution function (BRDF) \({f}_{r}\left({{\boldsymbol{\omega }}}_{i},{{\boldsymbol{\omega }}}_{r}\right)\) defines how light is reflected from the surface to the camera. The projection of the light beam on the geometry introduces a cosine term of the angle θ between the incoming light ray ωi and the surface normal. Finally, the endoscope applies an automatic gain gt, that can vary at every t-th time instant, and a gamma curve (γ = 2.2) to improve the perceived dynamic range of the image.
The parameters of this model are estimated by optimising a photometric loss on the white areas of the Vicalib pattern (Fig. 2). The results of the calibration provide a 2D weighting of the photometric effects caused by the vignetting and the light spread function, that can be used to compensate them (Fig. 3).
Simulated colon
The VR-Caps7 simulator is used to generate photorealistic synthetic image sequences of a 3D colon model obtained from a Computed Tomography. Since this is a simulation, we have full access to scene configuration: camera calibration, deformations, trajectory and illumination, hence to the ground truth geometry, camera pose and 3D deforming scene. For the same endoscope trajectory, we generated different sequences with more aggressive deformations to allow ablative studies with respect to the deformation magnitude. Deformations applied are described by the next equation:
where \({V}_{x}^{0}\), \({V}_{y}^{0}\) and \({V}_{z}^{0}\) are the coordinates of the surface point at rest. We can control the magnitude and velocity of the deformations according to the parameters A and ω respectively, which corresponds to the maximum excursion and velocity of the deformations respectively. We also modified the colon texture to increase its contrast. The code to create these simulated sequences is available in the repository EM_Dataset-Simulations (see Section Code availability, https://github.com/endomapper/EM_Dataset-Simulations).
Meta-data
For a set of selected recordings, we provide several types of meta-data useful for plenty of potential research lines, but in particular for VSLAM. This subsection presents a description of the meta-data and the annotation methodologies.
Text footage
The endoscopist performing each procedure provided a description of it, that was registered during the exploration. It includes the anatomical regions traversed, the interventions, the medical findings such as polyp approximated size, the tools used, or the sections with NBI (Narrow-Band Imaging) illumination. This description is made available as text footage synchronized with the corresponding videos. This metadata can be useful, for example, to identify the sections of the video that are more promising for VSLAM, such as the re-observations of the same region or interactions with tools of known size.
Anatomical regions
Anatomical section recognition is useful to create topological maps of the colon. These maps can be used to create smaller reconstructions with less probability of errors. Some colonoscopy procedures were annotated by the medical staff of the project after the recording. Multiple careful visualizations were necessary to delimit the ten anatomical regions, that are shown in Fig. 4.
Tools segmentation
Tool segmentation is one of the challenges for spatial AI in colonoscopies. Since they occlude the view and cause failures in other algorithms, many works in the literature mask them out. Tools were manually segmented using Odin CAT tool16, which allows to maintain a mask between frames, giving a more robust annotation.
COLMAP 3D reconstruction
Traditional SIFT-based rigid SfM algorithms are able to produce partial reconstructions from colonoscopy videos. We include some examples of the output of COLMAP9,17 processing in our sequences, which provides a first approximation for the up-to-scale camera trajectory and the scene’s sparse structure. This information can be organized to produce weak supervision in the form of sparse depth maps, local correspondences between frames, image-to-image labels (frames depicting the same place) or relative camera pose transformation between frames. Several computer vision tasks like depth prediction, image matching, image retrieval and visual localization can greatly benefit from this kind of supervision. Megadepth18 is a well-known dataset that uses this SfM procedure to obtain 3D point clouds, similar to ours. It is being extensively used for deep learning supervision10,19,20. Other works employed SfM to identify co-visible frames in the recordings, which has proven to be useful to train CNNs for place recognition in landmark images11 and in colonoscopy sequences21,22.
For our recordings, we apply exhaustive guided matching between all the images in the sequence to associate frames that are temporally distant. We use our camera calibration and we do not optimize it during the COLMAP bundle adjustment. The minimum triangulation angle is relaxed to 8 degrees during the initialization of the models. The rest of the parameters are left as default.
Recordings from the same patient
One of the main obstacles in colon reconstruction is the consistency between colonoscopies in longitudinal studies. Thanks to the colorectal cancer screening program, colonoscopy pairs from the same patient were registered. This would help to evaluate the lifelong capabilities of the developed VSLAM algorithms.
Data Records
This section describes the dataset structure and details of the meta-data available. The dataset is available on the Synapse platform3 and is subject to access controls (see Section Usage Notes). A summary of the dataset structure can be seen in Fig. 5. At publication time, there is a total of 96 real sequences and their duration goes from less than ten minutes to more than half an hour. The file DatasetSummary.xls in the dataset main folder includes a summary of the acquisition details of each sequence in the dataset.
Video recordings
Data is stored in the directory Sequences. Each procedure has a corresponding directory Seq_XXX (XXX is the sequence number) that contains:
-
1.
The directory meta-data, that contains all the meta-data files associated to the sequence. These files are described in the next section.
-
2.
The video Seq_XXX.mov, in which the actual recording is. The video codec is H26423, a lossy compression using the profile High 4:4:4 with 4.2 level and a bit rate of 7Mbps. It offers an optimal size vs. quality trade-off for lossy compression.
-
3.
The thumbnail version, Seq_XXX_thumbnail.webm, that contains a compressed version of the recording for easy and quick visualization. This version uses the free codec libvpx24, at 320 × 240 resolution.
-
4.
A subtitle file, Seq_XXX.srt, if the video has text footage in the form of text subtitles.
-
5.
The metadata file, Seq_XXX_info.json, where sequence number, endoscope number and the type of metadata of the procedure is stored.
Additionally, the folder Lossless_sequences contains the lossless versions of the videos. This format uses codec ffv1 version 3 with a bitrate of 310 Mbps.
Camera calibration
All the calibration information is included in the directory Calibrations. There is a directory Endoscope_XX (XX is the endoscope number) for each endoscope that contains:
-
1.
The calibration video Endoscope_XX.mov. This version is the lossy H264 version. The lossless version can be found in the lossless folder mentioned before.
-
2.
The Geometric calibration parameters Endoscope_XX_geometrical.xml.
-
3.
The Photometric Calibration parameters Endoscope_XX_photometrical.xml.
Geometric calibration
The file Endoscope_XX_geometrical.xml is the output calibration from Vicalib12. This XML file contains the intrinsic parameters of the camera (fx, fy, Cx, Cy, k1, k2, k3, k4) following the Vicalib output format.
Photometric calibration
The photometric calibration file, Endoscope_XX_photometrical.xml, contains the calibrated parameters of the light source and the camera of the endoscope. An endoscope’s <rig> may have one or more <camera> tags, associated with one or more <light> sources. Currently, only a single camera and a single virtual light are supported.
Each camera tag has a particular <camera_model>. This model has a single parameter, the value of the gamma γ response function in Eq. (4). Regarding the light source, the <light_model> has four parameters: the intensity value σ0, the light spread factor μ and two vectors for the light centre P and the principal direction D.
Simulated colon
All the data related to the simulated colon is included in the directory Simulated Sequences. There is a directory seq_X (X is the sequence number) for each sequence obtained from the simulation. The directory contains:
-
1.
The directory rgb with the RGB images of the sequence in png format.
-
2.
The directory depth with the depth images for each RGB image of the sequence stored in exr format.
-
3.
A file rgb.txt with a list of file names of all RGB images of the sequence.
-
4.
A file depth.txt with a list of the file names of all depth images of the sequence.
-
5.
A file trajectory.csv containing the ground truth camera trajectory.
-
6.
A file calibration.txt containing the simulated camera calibration.
-
7.
A file info.txt containing the deformations applied, its parameters and units.
Meta-data
This section contains the details and formats for each type of meta-data. The file DatasetSummary.xls details the availability of the metadata in each sequence of the dataset.
Text footage
Two files: Seq_XXX.json and Seq_XXX.srt are included inside the root and meta-data folder. The .json file contains a structure with the timestamp and the associated text. The text footage is also included in .srt format to ease the visualization synchronized with the video. The references to identify the tools used during the procedure are stored in the meta-data directory.
Anatomical regions
Table 2 shows the detailed number of frames labelled for each region in each video. The dataset contains this information in a file named Anatomical_Regions_XXX.txt with the format Frame###;region label; in each line.
Tool segmentation
There are 4086 frames with tools segmented across four different colonoscopies as detailed in Table 3. The segmentations for each video can be found in file tool_segmentation_XXX.xml. This file contains, for each segmented frame, the id of the frame and a list of 2D point coordinates that define the tool segmentation as a binary polygon. The segmentation was done using a proprietary Odin CAT tool16. Some examples can be seen in Fig. 6.
COLMAP 3D reconstruction
Table 4 summarizes the reconstruction results for the Endomapper sequences. The reconstructions are stored following the text format of COLMAP (https://colmap.github.io/format.html). We provide text files showing the images contained in each cluster reconstructed by COLMAP, as well as the images that COLMAP considered covisible, i.e. images that have at least one 3D point in common. Figure 7 shows two examples of these reconstructed clusters.
Same patient recordings
A file SamePatient.json is stored in the root folder containing which sequences are from the same patient and the time that separates both sequences.
Technical Validation
Firstly, we detail an error analysis of the calibration and a comparison of the calibration parameters among endoscopes. Secondly, we test state-of-the-art SfM and VSLAM algorithms on typical colonoscopy sequences. Finally, the anatomical region and tool segmentation labels are validated on state-of-the-art recognition algorithms. All the code used in this section is publicly available in the Endomapper repositories https://github.com/Endomapper.
Calibration validation
Geometric calibration
The software used to compute the geometric calibration and to obtain the validation and comparisons shown in this section is available in the repository EM_Dataset-GeometricCalibration (see Section Code availability, https://github.com/endomapper/EM_Dataset-GeometricCalibration). The geometric calibration was computed from the calibration videos using the Vicalib12 tool, tuning the parameters for each endoscope calibration separately. The parameters are detailed in the repository.
To compare the different calibrations visually, we have undistorted a grid using each calibration. Figure 8 shows the differences between each endoscope, the 10 colonoscopes and 8 gastroscopes correspondingly. The results show that the calibrations are equivalent around the center of the images and differences between them are significant only in the image borders.
To further analyze the calibration results, we analyzed the reprojection error. For all the calibrations, the RMSE is between 0.3 and 0.4 pixels. We have selected two endoscopes, one colonoscope (Endoscope_06) and one gastroscope (Endoscope_18), as prototypes. Figure 9 displays the inliers reprojection error distribution for the selected endoscopes. Here we can see that the error of the inliers is uniformly distributed around the image, and that only at the image boundaries there are fewer measurements. The calibrations are then expected to be very accurate in general, being the most inaccurate areas the ones closer to the borders.
The projection function that relates the incoming ray angle θ with the distorted radius rd is plotted in Fig. 10. Here we can see that both types of endoscopes are almost equal and very close to an orthogonal projection fisheye lens13. Finally, in Fig. 11 we show the view angle of both prototype endoscopes. There we can see that the gastroscope has a lower view angle than the colonoscope on the edges, which is why each type of endoscope needs to be calibrated separately.
With this analysis we conclude that the individual calibrations obtained from the videos are accurate (RMSE is low, covers most image and is consistent with all endoscopes). We believe that accurate calibration information boosts the performance of the geometric methods.
Photometric calibration
The software used to compute the photometric calibration and to obtain the validation shown in this section is available in the repository EM_Dataset-PhotometricCalibration (see Section Code availability, https://github.com/endomapper/EM_Dataset-PhotometricCalibration). The photometric calibration was computed from the calibration videos for each endoscope separately.
We selected 38 frames per sequence, looking for a variety of distances from the camera to the calibration pattern. On each frame, we consider a 120° field of view. The centre of the virtual light converges about 4 mm behind the tip of the endoscope, thus being able to model all real lights with a single beam. The gamma value is experimentally fixed to γ = 2.2, which is also a broadly used value. The endoscope applies a continuous gain control, progressively increasing or decreasing the gain. Relative auto-gain is estimated with respect to the first image of the sequence. Consequently, the σ0 value is unobservable and it is fixed to one.
The resulting models are validated in a different set of images of the Vicalib pattern. The photometric errors in Table 5 show the validation results of eight colonoscopes and eight gastroscopes. In the colonoscopes, the calibration is able to estimate the pixel intensities of the images with an RMSE of 2.9 grey levels. In gastroscopes, lights are not symmetrically placed on the tip of the endoscope. Consequently, RMSE increases slightly, up to 3.3 grey levels.
SfM/SLAM validation
COLMAP validation
COLMAP is able to estimate sparse reconstructions for different sections along a sequence, see some examples in Fig. 7. As it can be seen, the 3D point cloud and the camera trajectory look reasonable, showing a tubular shape with cameras traversing it. The covisibility information is always accurate, as the geometrical checks in COLMAP avoid frames that do not observe the same place to be incorrectly reconstructed in the same model.
Covisibility information was exploited in22, where a CNN was trained with COLMAP reconstructions from our sequences for the task of image retrieval. The system22 is able to recognize frames observing the same place in the colon, even when the frames come from different colonoscopies of the same patient. In Fig. 12, we can see some examples of successfully retrieved between two colonoscopies of the same patient performed within two weeks of each other.
The 3D reconstructions look reasonably accurate and can be of great help as a weak supervision for training depth, camera pose or image retrieval networks. The software used to compute the reconstruction shown in this section is available in the repository EM_Dataset-ColmapValidation (see Section Code availability, https://github.com/endomapper/EM_Dataset-ColmapValidation).
ORB-SLAM validation
ORB-SLAM31 is the reference system for sparse real-time visual SLAM. We have used it to process the whole Seq_015 video, to build the map and estimate the endoscope pose. To achieve a real-time performance the image size is reduced from 1440 × 1080 to 720 × 540 and one out of every two frames are skipped. To address non-rigidity, the reprojection error acceptance threshold has been increased by a ×2 factor with respect to its default value for rigid scenes, which helps in preventing tracking losses. The Kannala-Brandt camera model has proven to be essential to extract and triangulate features close to the borders of the image, where distortion is significant. As argued, an accurate calibration enables the use of the whole image for geometric computation, boosting accuracy and robustness.
ORB-SLAM3 has been able to estimate 133 sub-maps of small size (see Table 6 for a summary of the map specifications). The camera has been localized successfully with respect to a map in 25% of the frames. The time between video frames is 40 ms, and ORB-SLAM3 is able to run in real time, with an average tracking time of 23 ms and maximum of 37 ms. Figure 13 shows a sub-map where the camera undergoes a forward-backward motion The map contains 54 keyframes, 3682 points and 349 frames.
From this analysis we conclude that our EndoMapper dataset offers the challenges of real endoscopy exploration such as scene deformation, multiple occlusion, changes in lighting, and clutter due to cleaning water or tools that eventually result in frequent tracking losses. Classical discrete feature VSLAM methods like ORB-SLAM3 can run on these videos in real-time, localizing the camera in 25% of the frames. However, the scene model is fragmented in a myriad of very small rigid sub-maps. The clear challenge is multiple mapping techniques operating in Endoscopy able to merge sub-maps with common areas.
The software and the detailed tuning used to compute the ORB-SLAM3 reconstructions shown in this section is available in the repository EM_Dataset-ORBSLAM3Validation (see Section Code availability, https://github.com/endomapper/EM_Dataset-ORBSLAM3Validation).
Anatomical region validation
The anatomical region labels have been validated by fine-tuning different models for anatomical region recognition. The software used to evaluate the anatomical regions recognition is available in the repository EM_Dataset-AnatomicalRegions (see Section Code availability, https://github.com/endomapper/EM_Dataset-AnatomicalRegions). Following works in medical image25 and datasets26, we fine-tuned four different CNNs that are known to perform well on medical image classification: EfficientNet V227, MobileNetv228, DenseNet29 and ResnetV230. With MobileNet and EfficientNet, we look for a model requiring low computational resources. DenseNet and ResNet were chosen for its performance in image classification. For the fine-tuning, we train the models during 100 epochs following the learning rate decay in Thanh et al.27 and use 4 sequences (Seq_003, Seq_011, Seq_013, Seq_093) for training and Seq_094 for testing. Seq_094 was chosen as the test sequence because it has the best balance between classes. The metrics used to evaluate this experiment were Top-1 and Top-3 accuracy, defined as the accuracy for which the true class matches the most probable prediction and any of the 3 most probable predictions, respectively. Table 7 presents the results for the anatomical region recognition.
ResNet is able to perform better in Top-1 accuracy and similarly in Top-3. This shows that this model is the best overall. It is also interesting to note that MobileNet is able to obtain a close performance with a smaller computational footprint, being interesting for real-time systems. It is also worth remarking that Top-1 accuracy is low in comparison to other computer vision tasks, suggesting that anatomical region classification is a challenging research problem. Indeed, the differences between sections are very subtle, even for the trained eye. In any case, the results on Top-3 accuracy are promising and show that the EndoMapper data is a relevant tool to advance the performance in this problem.
Tools segmentation validation
The tool segmentation labels have been validated by training and testing several models for binary tool segmentation as proposed in Tomasini et al.31. This work compares the performance of various models on three different datasets, including the EndoMapper dataset labels. All the models were trained from scratch on EndoVis17 dataset and fine-tuned on Kvasir-Inst and EndoMapper. The performance results obtained can be seen in Table 8. Examples of binary segmentation of images from the Endomapper dataset obtained using the different models can be seen in Fig. 14.
The lower mIoU of all models on our EndoMapper dataset compared to that of other datasets highlights the challenge of the EndoMapper tool segmentation labels. It is interesting to note that the efficient MiniNetV2 reaches similar performance to state-of-the-art models on all datasets while requiring less memory and inference time. The software used to evaluate the tool segmentation models is available in the repository EM_Dataset-ToolSegmentation (see Section Code availability, https://github.com/endomapper/EM_Dataset-ToolSegmentation).
Usage Notes
The dataset is available on the Synapse platform3. The dataset can be publicly accessed under the following conditions: 1) Limited to research on how to obtain relevant medical information from images or video. 2) Redistribution of the data is not allowed. 3) Requires a Statement of Intended Use, which includes a description of how you intend to use this data. 4) You further agree to cite the DOI of the collection and the publication in any publication resulting from this content as follows: a) Azagra, P. et al. Endomapper dataset of complete calibrated endoscopy procedures. https://doi.org/10.7303/syn26707219 (2022). Synapse. b) Azagra, P. et al. EndoMapper dataset of complete calibrated endoscopy procedures. Scientific Data. 5) Images of the collection can be included in the scientific citing publications. 6) Video segments can be used to produce multimedia material in the citing scientific publications.
Code availability
The dataset can be used without any further code. All the code used for the calibration, simulated sequences generation and technical validation are publicly available as repositories at https://github.com/Endomapper. The instructions for installing and using them are available in Synapse https://www.synapse.org/#!Synapse:syn52137895.
References
Campos, C., Elvira, R., Rodrguez, J. J. G., Montiel, J. M. M. & Tardós, J. D. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Transactions on Robotics 37, 1874–1890 (2021).
Engel, J., Koltun, V. & Cremers, D. Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 611–625 (2017).
Azagra, P. et al. Endomapper dataset of complete calibrated endoscopy procedures. Synapse https://doi.org/10.7303/syn26707219 (2022).
Maier-Hein, L. et al. Heidelberg colorectal data set for surgical data science in the sensor operating room. Scientific Data 8, 1–11 (2021).
Rau, A., Bhattarai, B., Agapito, L. & Stoyanov, D. Bimodal camera pose prediction for endoscopy. Preprint at https://arxiv.org/abs/2204.04968 (2022).
Bobrow, T. L. et al. Colonoscopy 3D video dataset with paired depth from 2D-3D registration. Medical Image Analysis, 102956 (2023).
Incetan, K. et al. VR-Caps: a virtual environment for capsule endoscopy. Medical Image Analysis 70, 101990 (2021).
Ozyoruk, K. B. et al. EndoSLAM dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos. Medical Image Analysis 71, 102058 (2021).
Schönberger, J. L. & Frahm, J.-M. Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4104–4113 (2016).
Dusmanu, M. et al. D2-net: A trainable CNN for joint description and detection of local features. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8092–8101 (2019).
Radenović, F., Tolias, G. & Chum, O. Fine-tuning CNN image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 1655–1668 (2018).
Heckman, C. et al. Vicalib. https://github.com/arpg/vicalib. University of Colorado, Boulder. (accessed: 14.10.2020).
Kannala, J. & Brandt, S. S. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1335–1340 (2006).
Usenko, V., Demmel, N. & Cremers, D. The double sphere camera model. In 2018 International Conference on 3D Vision (3DV), 552–560 (IEEE, 2018).
Modrzejewski, R., Collins, T., Hostettler, A., Marescaux, J. & Bartoli, A. Light modelling and calibration in laparoscopy. International Journal of Computer Assisted Radiology and Surgery 15, 859–866 (2020).
Odin. Odin cat tool. https://cat-aws.odin-vision.com.
Schönberger, J. L., Zheng, E., Pollefeys, M. & Frahm, J.-M. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 501–518 (2016).
Li, Z. & Snavely, N. Megadepth: Learning single-view depth prediction from internet photos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2041–2050 (2018).
Sarlin, P.-E. et al. Back to the feature: Learning robust camera localization from pixels to pose. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3247–3257 (2021).
Yang, T.-Y., Nguyen, D.-K., Heijnen, H. & Balntas, V. UR2KiD: Unifying retrieval, keypoint detection, and keypoint description without local correspondence supervision. Preprint at https://arxiv.org/abs/2001.07252 (2020).
Ma, R. et al. Colon10k: A benchmark for place recognition in colonoscopy. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 1279–1283 (2021).
Morlana, J., Azagra, P., Civera, J. & Montiel, J. M. M. Self-supervised visual place recognition for colonoscopy sequences. In Medical Imaging with Deep Learning (MIDL) (2021).
Richardson, I. E. H. 264 and MPEG-4 video compression: video coding for next-generation multimedia (John Wiley & Sons, 2004).
Grange, A. & De Rivaz, P. VP9 bitstream & decoding process specification. Version 0.6. https://storage.googleapis.com/downloads.webmproject.org/docs/vp9/vp9-bitstream-specification-v0.6-20160331-draft.pdf (2016).
Thanh, N. C., Long, T. Q. et al. Polyp segmentation in colonoscopy images using ensembles of U-Nets with efficientnet and asymmetric similarity loss function. In 2020 RIVF International Conference on Computing and Communication Technologies (RIVF), 1–6 (IEEE, 2020).
Borgli, H. et al. Hyperkvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific Data 7, 1–14 (2020).
Tan, M. & Le, Q. EfficientNetV2: Smaller models and faster training. In International Conference on Machine Learning, 10096–10106 (PMLR, 2021).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4510–4520 (2018).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4700–4708 (2017).
He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), 630–645 (Springer, 2016).
Tomasini, C., Riazuelo, L., Murillo, A. & Alonso, I. Efficient tool segmentation for endoscopic videos in the wild. In Medical Imaging with Deep Learning (MIDL) (2022).
Fernández-Esparrach, G. et al. Exploring the clinical potential of an automatic colonic polyp detection method based on the creation of energy maps. Endoscopy 48, 837–842 (2016).
Ali, S. et al. An objective comparison of detection and segmentation algorithms for artefacts in clinical endoscopy. Scientific Reports 10, 2748 (2020).
Bernal, J., Tudela, Y., Riera, M. & Sánchez, F. J. Polyp detection in colonoscopy videos. In Computer-Aided Analysis of Gastrointestinal Videos, 163–169 (Springer International Publishing, Cham, 2021).
Pogorelov, K. et al. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In 8th ACM Multimedia Systems Conference (MMSys'17), 164–169 (2017).
Jha, D. et al. Kvasir-seg: A segmented polyp dataset. In International Conference on Multimedia Modeling, 451–462 (Springer, 2020).
Pogorelov, K. et al. Nerthus: A bowel preparation quality video dataset. In 8th ACM Multimedia Systems Conference, MMSys'17, 170–174 (2017).
Jha, D. et al. Kvasir-instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopy. In MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II 27, 218–229 (Springer, 2021).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (MICCAI), 234–241 (Springer, 2015).
Iglovikov, V. & Shvets, A. Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. Preprint at https://arxiv.org/abs/1801.05746 (2018).
Chaurasia, A. & Culurciello, E. Linknet: Exploiting encoder representations for efficient semantic segmentation. In 2017 IEEE Visual Communications and Image Processing (VCIP), 1–4 (IEEE, 2017).
Jin, Y., Cheng, K., Dou, Q. & Heng, P.-A. Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 440–448 (Springer, 2019).
Alonso, I., Riazuelo, L. & Murillo, A. C. Mininet: An efficient semantic segmentation convnet for real-time robotic applications. IEEE Transactions on Robotics 36, 1340–1347 (2020).
Acknowledgements
This work was supported by EU-H2020 grant 863146 ENDOMAPPER, Spanish Government grants PI20/01514 and FPU20/06782, and Aragón Government grant T45 23 R.
Author information
Authors and Affiliations
Contributions
P.A. coordinated the dataset acquisition and lead the organization. J.M.M.M. originated the concept of dataset. P.A., L.R., J.C., J.D.T., A.C.M. and J.M.M.M. designed the dataset details. C.S., A.F. and A.L. performed the endoscopies, provided medical explanations and anatomical labels. P.A., L.R., C.O. and J.M.M.M. designed and operated the data acquisition system and created the database. P.A., V.M.B., J.D.T. and J.M.M.M. performed endoscope’s calibration. J.J.G.R. and J.D.T. provided colon simulations. O.L.B., J.M. and J.M.M.M. provided COLMAP reconstructions. C.T., L.R. and A.C.M. provided tool segmentation. P.A. and J.L. provided anatomical landmark annotations. P.A., L.R., O.L.B., C.T., J.M., D.R., V.M.B., J.J.G.R., R.E., J.C., J.D.T., A.C.M. and J.M.M.M performed the analysis and technical validation. P.A., L.R., J.M., V.M.B., J.J.G.R., J.D.T. and J.M.M.M. created and edited the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Azagra, P., Sostres, C., Ferrández, Á. et al. Endomapper dataset of complete calibrated endoscopy procedures. Sci Data 10, 671 (2023). https://doi.org/10.1038/s41597-023-02564-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-023-02564-7
This article is cited by
-
ToTem NRSfM: Object-Wise Non-rigid Structure-from-Motion with a Topological Template
International Journal of Computer Vision (2024)