Endomapper dataset of complete calibrated endoscopy procedures

Azagra, Pablo; Sostres, Carlos; Ferrández, Ángel; Riazuelo, Luis; Tomasini, Clara; Barbed, O. León; Morlana, Javier; Recasens, David; Batlle, Víctor M.; Gómez-Rodríguez, Juan J.; Elvira, Richard; López, Julia; Oriol, Cristina; Civera, Javier; Tardós, Juan D.; Murillo, Ana C.; Lanas, Angel; Montiel, José M. M.

doi:10.1038/s41597-023-02564-7

Download PDF

Data Descriptor
Open access
Published: 03 October 2023

Endomapper dataset of complete calibrated endoscopy procedures

Pablo Azagra ORCID: orcid.org/0000-0002-3567-3294¹,
Carlos Sostres^2,3,4,5,
Ángel Ferrández^2,3,4,5,
Luis Riazuelo¹,
Clara Tomasini¹,
O. León Barbed¹,
Javier Morlana¹,
David Recasens ORCID: orcid.org/0000-0002-5637-2845¹,
Víctor M. Batlle ORCID: orcid.org/0000-0002-6837-934X¹,
Juan J. Gómez-Rodríguez¹,
Richard Elvira¹,
Julia López²,
Cristina Oriol¹,
Javier Civera¹,
Juan D. Tardós¹,
Ana C. Murillo ORCID: orcid.org/0000-0002-7580-9037¹,
Angel Lanas^2,3,4,5 &
…
José M. M. Montiel¹

Scientific Data volume 10, Article number: 671 (2023) Cite this article

2514 Accesses
11 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introduces the Endomapper dataset, the first collection of complete endoscopy sequences acquired during regular medical practice, making secondary use of medical data. Its main purpose is to facilitate the development and evaluation of Visual Simultaneous Localization and Mapping (VSLAM) methods in real endoscopy data. The dataset contains more than 24 hours of video. It is the first endoscopic dataset that includes endoscope calibration as well as the original calibration videos. Meta-data and annotations associated with the dataset vary from the anatomical landmarks, procedure labeling, segmentations, reconstructions, simulated sequences with ground truth and same patient procedures. The software used in this paper is publicly available.

Segment anything in medical images

Article Open access 22 January 2024

Transparent medical image AI via an image–text foundation model grounded in medical literature

Article 16 April 2024

Towards a general-purpose foundation model for computational pathology

Article 19 March 2024

Background & Summary

Endoscopes traversing body cavities are routine. However, their potential for navigation assistance or device autonomy remains mostly locked. In order to unlock it, computer-assisted endoscopes would require spatial AI (Artificial Intelligence) capabilities, i.e., being able to estimate a map of the regions that are traversed, along with the endoscope localization within such map. This capability is known in the robotics literature with the acronym VSLAM (Simultaneous Localization and Mapping from Visual sensors). Spatial AI and VSLAM will augment endoscopies with novel features, including augmented reality insertions, detection of blind zones, polyp measurements or guidance to polyps found in previous explorations. In the long term, VSLAM will also support utterly new robotized autonomous procedures. For our purposes in this paper, the goal of VSLAM is to build a per-patient map, in real-time, during endoscope insertion in a first procedure. This map will be exploited and perfected during the withdrawal of such first procedure, and in any other future one.

There are mature methods for out of the body VSLAM^1,2. However, bringing them to endoscopy implies overcoming new barriers. The light source is co-located with the endoscopes, and hence is moving and is close to the body surfaces. The body surfaces have poor texture and abundant reflections due to fluids. The scene geometry includes a prevalent deformation. The video combines slow observation of areas of interest, with fast motions and long occlusions of the endoscope lenses.

Our contribution in this paper is the Endomapper dataset³, which makes available, for the first time, 96 high quality calibrated recordings of complete routine endoscopies (Fig. 1), making secondary use of medical data, i. e., just recording standard procedures that were going to be performed in any case, without any modification. Compared to ad-hoc recordings, secondary-use ones show realistic features and hence contain the actual challenges VSLAM will face in routine practice.

No other public dataset offers a comparable volume of fully calibrated endoscopies in HD (see Table 1). Heilderberg⁴ is a very interesting dataset that contains images of the colon from laparoscopic procedures. However, this view of the colon is not compatible with our goal of building 3D reconstructions of the interior of the gastrointestinal tract. CVCClinicDB, GIANA and Kvasir focus on polyps detection since they are often used to benchmark CAD (Computer Aided Diagnosis) systems. Other datasets focus in segmentation of elements of interest, such as tools in instrument-kvasir or polyps in Kvasir-seg. However, they only provide sparse image sets or short videos (less than 30 seconds). More similar to ours, Colon10k provides images from short sequences for place recognition and reconstructions. In contrast, we offer hours of real calibrated video, corresponding to the complete procedures. Apart from these real imaging datasets, it is worth mentioning works that create simulated data of the colon, either using 3D models, like Rau et al.⁵, Bobrow et al.⁶ and Incetan et al.⁷, or a realistic phantom, like Ozyoruk et al.⁸. Similar to these works, we also include a few simulated sequences to harness the extra information of these scenes as a means to evaluate the methods developed. In particular, due to the monocular nature of the dataset, no ground truth geometry is available for quantitative evaluation. To address this issue, we include photorealistic sequences from a simulated colon, with ground truth geometry for the deforming scene and endoscopy trajectory.

Table 1 Overview of existing datasets of endoscopies in the gastrointestinal tract.

Full size table

The Endomapper dataset includes colonoscopies, gastroscopies, and calibration videos along with geometric and photometric calibration parameters. More than half of the sequences are screening colonoscopies, for which the standard procedure implies a thorough and slow exploration, being close to typical operation modes for VSLAM and that can serve as a bridge to more challenging sequences. To research map reuse and recognition in a second exploration, colonoscopies corresponding to the same patient but separated in time by several weeks are also included in the data.

Regarding metadata, some endoscopies include a description of the procedure made by the endoscopist, in the form of text footage. The text describes the anatomical regions traversed, re-explorations of the same region, the performed interventions or the tools used. This footage indexes the videos to identify interesting sections for VSLAM.

Building on our dataset, the community can provide derived or metadata results to support subsequent research. Some examples of these derived data are included in the dataset: 1) anatomical regions segmentation, at frame level, performed by a doctor after visualizing the video; 2) tools segmentation in selected video sections, which can boost the tool segmentation performance in the specific endoscopy domain; 3) Structure from Motion (SfM) reconstructions using COLMAP⁹, which provides up-to-scale 6 DoF endoscope trajectory and 3D models for the video segments corresponding to smooth explorations of non-deforming scenes. The SfM output has proven valid to supervise learning tasks such as image matching¹⁰ or image retrieval¹¹.

Endomapper offers a sweet point of challenge, including easy video segments where state-of-the-art algorithms perform reasonably. However, all these methods also fail at some point, signaling what are the research challenges to face. We believe that the dataset will spur research that identifies challenges and foster progress of VSLAM in gastrointestinal environments.

Finally, we have made publicly available 7 software repositories corresponding to photometric and geometric calibration from calibration videos, simulated colon sequences generation and technical validation. Section Code Availability contains the details of these repositories containing the software, including installation and usage instructions.

Methods

The methodology used to create the dataset is explained in this section. First, we present a description of the recording procedure for the sequences in the dataset, including the description of the capture system and the type of recordings. Then, we describe the calibration procedure and the methodology used in both geometric and photometric calibration. Finally, we also briefly summarize the methods used to create each type of meta-data.

Recording endoscopies procedure

The acquisition of the sequences in the dataset was performed in the Hospital Clinico Universitario Lozano Blesa, in Zaragoza (Spain), using an Olympus EVIS EXERA III CV-190 video processor, EVIS EXERA III CLV-190 light source, and EVIS EXERA III CF-H190 colonoscope or EVIS EXERA III GIF-H190 gastroscope. The acquisition system is composed of a computer and a data acquisition card connected to the endoscopy tower via a Digital Visual Interface (DVI). Two different acquisition cards have been used: Epiphan Video DVI2USB 3.0 and Magewell Pro Capture DVI. The videos were recorded at 1440 × 1080 at 40fps and 24RGBbits (Epiphan) or 1440 × 1080 at 50fps and 24RGBbits (Magewell). The output image given by the endoscopy tower is cropped to remove personal information. The videos were manually edited to remove any frame recorded when the camera was out of the body of the patient. During the span of the project, the recordings were done one day of the week and synchronized with the medical staff involved. The patients were not selected based on their symptoms or pathology, we followed without interference the hospital’s schedule which was mostly focused on the colorectal cancer screening program. The Endomapper technical staff was present in all the recording sessions to secure the quality of the acquisitions, but without interfering with the medical procedure.

Use of human participants

The recordings were made under the ethical approval of the CEICA Ethics Committee (Comité de Ética de la Investigación de la Comunidad Autónoma de Aragón (CEICA), meetings 04/03/2020 acta 05/2020, 23/09/2020 acta 18/2020, 20/04/2022 acta 08/2022 and 16/11/2022 acta 20/2022). Informed consent was obtained from all subjects. According to this approval, the collection can be publicly accessed under certain conditions (see Section Usage Notes).

Calibration

The dataset uses 10 different colonoscopes and 8 different gastroscopes. The calibration sequences for all the colonoscopes and gastroscopes where acquired in a single session using a Lambertian pattern (obtained from calib.io). Figure 2 shows two frames of the calibration videos imaging the calibration pattern. The Lambertian pattern corresponds to an array of circles from the Vicalib¹² library. The physical size of the pattern used is 5,61 × 9,82 cm.

Geometric calibration

The calibration videos are processed by Vicalib¹² to obtain the endoscope intrinsic parameters according to the Kannala & Brandt model^13,14. The calibration defines eight intrinsic parameters, four projective parameters (in pixels) f_x, f_y, C_x,C_y and four distortion coefficients k₁, k₂, k₃, k₄. We process 1 out of 20 frames and outlier matches are removed. Next, the projection model yielding the projection in pixels u = (u, v), for a 3D point with coordinates X = (x, y, z) with respect to the camera frame is described as:

$$u={f}_{x}{x}_{d}+{C}_{x},\quad {x}_{d}={r}_{d}\frac{x}{r}$$

(1)

$$v={f}_{y}{y}_{d}+{C}_{y},\quad {y}_{d}={r}_{d}\frac{y}{r}$$

(2)

where ${r}_{d}=\theta \left(1+{k}_{1}{\theta }^{2}+{k}_{2}{\theta }^{4}+{k}_{3}{\theta }^{6}+{k}_{4}{\theta }^{8}\right)$ is the distorted radius, $r=\sqrt{{x}^{2}+{y}^{2}}$ is the undistorted radius and θ = arctan2(r, z) is the angle between the incoming ray and the optical axis.

Photometric calibration

The light source and camera of the endoscope are calibrated to obtain a model able to reproduce the photometry of the recordings. In the endoscope, the distances between the light sources and the camera are small and mostly symmetrical. Following Modrzejewski et al.¹⁵, we assume that these sources can be modelled as a single virtual light and adopt the Spot Light Source model (SLS), which was shown to offer a good compromise between complexity and accuracy. In addition, the light spread function and the camera vignetting are jointly estimated assuming radial symmetry. With this model, the light radiance going from the endoscope to a 3D surface point X is

$${\sigma }_{{\rm{SLS}}}\left({\bf{X}},{\bf{P}}\right)={\sigma }_{0}R\left(\mu ,{\bf{D}},{\bf{X}},{\bf{P}}\right)S\left({\bf{X}},{\bf{P}}\right)L\left({\bf{X}},{\bf{P}}\right),$$

(3)

where P is the light center, σ₀, is the light’s intensity value and D is the principal direction in which light propagates with a spreading factor μ, that modules the radial attenuation R. As the light traverses the scene, its radiance decreases as a function of the distance travelled $d=\left\Vert {\bf{X}}-{\bf{P}}\right\Vert $, following an inverse-square law $S\left({\bf{X}},{\bf{P}}\right)=1/{d}^{2}$. Finally, L(X, P) is the unit vector of the light’s outgoing direction. The corresponding intensity value ${\mathscr{ {\mathcal I} }}({\bf{X}})$ on the image is:

$${\mathscr{ {\mathcal I} }}({\bf{X}})={(| {\sigma }_{{\rm{SLS}}}({\bf{X}},{\bf{P}})| {f}_{r}({{\boldsymbol{\omega }}}_{i},{{\boldsymbol{\omega }}}_{r})\cos \theta {g}_{t})}^{1/\gamma },$$

(4)

where a bidirectional reflectance distribution function (BRDF) ${f}_{r}\left({{\boldsymbol{\omega }}}_{i},{{\boldsymbol{\omega }}}_{r}\right)$ defines how light is reflected from the surface to the camera. The projection of the light beam on the geometry introduces a cosine term of the angle θ between the incoming light ray ω_i and the surface normal. Finally, the endoscope applies an automatic gain g_t, that can vary at every t-th time instant, and a gamma curve (γ = 2.2) to improve the perceived dynamic range of the image.

The parameters of this model are estimated by optimising a photometric loss on the white areas of the Vicalib pattern (Fig. 2). The results of the calibration provide a 2D weighting of the photometric effects caused by the vignetting and the light spread function, that can be used to compensate them (Fig. 3).

Simulated colon

The VR-Caps⁷ simulator is used to generate photorealistic synthetic image sequences of a 3D colon model obtained from a Computed Tomography. Since this is a simulation, we have full access to scene configuration: camera calibration, deformations, trajectory and illumination, hence to the ground truth geometry, camera pose and 3D deforming scene. For the same endoscope trajectory, we generated different sequences with more aggressive deformations to allow ablative studies with respect to the deformation magnitude. Deformations applied are described by the next equation:

$${V}_{y}^{t}={V}_{y}^{0}+A\,\sin (\omega t+{V}_{x}^{0}+{V}_{y}^{0}+{V}_{z}^{0}),$$

(5)

where ${V}_{x}^{0}$, ${V}_{y}^{0}$ and ${V}_{z}^{0}$ are the coordinates of the surface point at rest. We can control the magnitude and velocity of the deformations according to the parameters A and ω respectively, which corresponds to the maximum excursion and velocity of the deformations respectively. We also modified the colon texture to increase its contrast. The code to create these simulated sequences is available in the repository EM_Dataset-Simulations (see Section Code availability, https://github.com/endomapper/EM_Dataset-Simulations).

Meta-data

For a set of selected recordings, we provide several types of meta-data useful for plenty of potential research lines, but in particular for VSLAM. This subsection presents a description of the meta-data and the annotation methodologies.

Text footage

The endoscopist performing each procedure provided a description of it, that was registered during the exploration. It includes the anatomical regions traversed, the interventions, the medical findings such as polyp approximated size, the tools used, or the sections with NBI (Narrow-Band Imaging) illumination. This description is made available as text footage synchronized with the corresponding videos. This metadata can be useful, for example, to identify the sections of the video that are more promising for VSLAM, such as the re-observations of the same region or interactions with tools of known size.

Anatomical regions

Anatomical section recognition is useful to create topological maps of the colon. These maps can be used to create smaller reconstructions with less probability of errors. Some colonoscopy procedures were annotated by the medical staff of the project after the recording. Multiple careful visualizations were necessary to delimit the ten anatomical regions, that are shown in Fig. 4.

Tools segmentation

Tool segmentation is one of the challenges for spatial AI in colonoscopies. Since they occlude the view and cause failures in other algorithms, many works in the literature mask them out. Tools were manually segmented using Odin CAT tool¹⁶, which allows to maintain a mask between frames, giving a more robust annotation.

COLMAP 3D reconstruction

Traditional SIFT-based rigid SfM algorithms are able to produce partial reconstructions from colonoscopy videos. We include some examples of the output of COLMAP^9,17 processing in our sequences, which provides a first approximation for the up-to-scale camera trajectory and the scene’s sparse structure. This information can be organized to produce weak supervision in the form of sparse depth maps, local correspondences between frames, image-to-image labels (frames depicting the same place) or relative camera pose transformation between frames. Several computer vision tasks like depth prediction, image matching, image retrieval and visual localization can greatly benefit from this kind of supervision. Megadepth¹⁸ is a well-known dataset that uses this SfM procedure to obtain 3D point clouds, similar to ours. It is being extensively used for deep learning supervision^10,19,20. Other works employed SfM to identify co-visible frames in the recordings, which has proven to be useful to train CNNs for place recognition in landmark images¹¹ and in colonoscopy sequences^21,22.

For our recordings, we apply exhaustive guided matching between all the images in the sequence to associate frames that are temporally distant. We use our camera calibration and we do not optimize it during the COLMAP bundle adjustment. The minimum triangulation angle is relaxed to 8 degrees during the initialization of the models. The rest of the parameters are left as default.

Recordings from the same patient

One of the main obstacles in colon reconstruction is the consistency between colonoscopies in longitudinal studies. Thanks to the colorectal cancer screening program, colonoscopy pairs from the same patient were registered. This would help to evaluate the lifelong capabilities of the developed VSLAM algorithms.

Data Records

This section describes the dataset structure and details of the meta-data available. The dataset is available on the Synapse platform³ and is subject to access controls (see Section Usage Notes). A summary of the dataset structure can be seen in Fig. 5. At publication time, there is a total of 96 real sequences and their duration goes from less than ten minutes to more than half an hour. The file DatasetSummary.xls in the dataset main folder includes a summary of the acquisition details of each sequence in the dataset.

Video recordings

Data is stored in the directory Sequences. Each procedure has a corresponding directory Seq_XXX (XXX is the sequence number) that contains:

1.
The directory meta-data, that contains all the meta-data files associated to the sequence. These files are described in the next section.
2.
The video Seq_XXX.mov, in which the actual recording is. The video codec is H264²³, a lossy compression using the profile High 4:4:4 with 4.2 level and a bit rate of 7Mbps. It offers an optimal size vs. quality trade-off for lossy compression.
3.
The thumbnail version, Seq_XXX_thumbnail.webm, that contains a compressed version of the recording for easy and quick visualization. This version uses the free codec libvpx²⁴, at 320 × 240 resolution.
4.
A subtitle file, Seq_XXX.srt, if the video has text footage in the form of text subtitles.
5.
The metadata file, Seq_XXX_info.json, where sequence number, endoscope number and the type of metadata of the procedure is stored.

Additionally, the folder Lossless_sequences contains the lossless versions of the videos. This format uses codec ffv1 version 3 with a bitrate of 310 Mbps.

Camera calibration

All the calibration information is included in the directory Calibrations. There is a directory Endoscope_XX (XX is the endoscope number) for each endoscope that contains:

1.
The calibration video Endoscope_XX.mov. This version is the lossy H264 version. The lossless version can be found in the lossless folder mentioned before.
2.
The Geometric calibration parameters Endoscope_XX_geometrical.xml.
3.
The Photometric Calibration parameters Endoscope_XX_photometrical.xml.