Background & Summary

Following the technological advancements in surgery, operation rooms are evolving into intelligent environments. Context-aware systems (CAS) are emerging as pivotal components of this evolution, empowered to advance pre-operative surgical planning1,2,3, automate skill assessment4,5,6,7,8, support operation room planning9,10,11, and interpret the surgical context comprehensively12. By enabling real-time alerts and offering decision-making support, these systems prove invaluable, especially but not only for less-experienced surgeons. Their capabilities extend to the automatic analysis of surgical videos, encompassing functions like indexing, documentation, and generating post-operative reports13. The ever-increasing demand for such automatic systems has sparked machine-learning-based approaches to surgical video analysis.

Cataract Surgery, renowned as the most commonly conducted ophthalmic surgical procedure and one of the most demanding surgeries worldwide, is a major operation where deep learning can be of great benefit. Cataract, characterized by the opacification of the eye’s natural lens, is often attributed to aging and leads to impaired visual acuity, reduced brightness, visual distortion, double vision, and color perception degradation. Globally, cataracts stand as the primary cause of blindness14. Owing to the aging demographic and increased lifespans, the World Health Organization forecasts a surge in cataract-related blindness cases, estimating the number to reach 40 million by the year 202514. This prevalent disease can be remedied through cataract surgery involving the substitution of the eye’s natural lens with a synthetic counterpart known as an intraocular lens (IOL). Advancements in technology have driven the evolution of cataract surgery techniques. This evolution spans from intracapsular cataract extraction (ICCE) in the 1960s and 1970s to extracapsular cataract extraction (ECCE) in the 1980s and 1990s. Today, the primary method involves sutureless small-incision phacoemulsification surgery with an injectable intraocular lens (IOL) implantation. Throughout this paper, the term “Cataract Surgery” is synonymous with “Phacoemulsification Cataract Surgery”. Due to the widespread occurrence of cataract surgery and its substantial influence on patients’ quality of life, a significant focus has been directed towards the analysis of cataract surgery content using deep learning methodologies over the past decade. In particular, Surgical phase recognition and scene segmentation are joint building blocks in various applications related to cataract surgery video analysis13. These applications include but are not limited to relevance detection15, relevance-based compression16, irregularity detection17,18, and surgical outcome prediction. The current public datasets for cataract surgery either provide annotations for a particular sub-task such as instrument recognition19, scene and relevant anatomical structure segmentation15,20,21,22,23, or offer small multi-task datasets targeting specific problems such as intraocular lens (IOL) irregularity detection17. As a result of the lack of a comprehensive dataset, there exists a considerable gap in exploring deep-learning-based approaches and frameworks to enhance cataract surgery outcomes. To facilitate the development of such systems and models, there is a compelling need for large-scale datasets that encompass multi-task annotations.

This paper introduces the largest cataract surgery video dataset, including 1000 videos of cataract surgery recorded in Klinikum Klagenfurt, Austria, between 2021 and 2023. We provide large-scale ground-truth annotations for the semantic segmentation of different instruments and relevant anatomical structures, as well as surgical phases. Besides, the dataset features two subsets for major irregularities in cataract surgery, which affect surgical workflow, including intraocular lens (IOL) rotation, and pupil contraction in cataract surgery. Together, these 1000 videos, annotated datasets, and irregularity subsets form a complete dataset to empower computer-assisted interventions (CAI) in cataract surgery.

Methods

This work is performed under ethics committee approval (EK 28/17) from Ethikkommission Kärnten24. All patients have given written consent to the video recording and open publication.

Dataset acquisition

Cataract surgery is performed utilizing a binocular microscope, which offers a three-dimensional magnified and illuminated view of the eye, ensuring precise observation of the patient’s eye. The surgeon manually adjusts the microscope’s focus to optimize visual clarity during the procedure. Additionally, a mounted camera within the microscope captures and archives the entire surgical process, facilitating subsequent analysis for various post-operative purposes.

Cataract-1K dataset description

The Cataract-1K dataset consists of 1000 videos of cataract surgeries conducted in the eye clinic of Klinikum Klagenfurt from 2021 to 2023. The videos are recorded using a MediLive Trio Eye device mounted on a ZEISS OPMI Vario microscope. The Cataract-1K dataset comprises videos conducted by surgeons with a cumulative count of completed surgeries ranging from 1,000 to over 40,000 procedures. On average, the videos have a duration of 7.12 minutes, with a standard duration of 200 seconds. In addition to this large-scale dataset, we provide surgical phase annotations for 56 regular videos and relevant anatomical plus instrument pixel-level annotations for 2256 frames out of 30 cataract surgery videos. Furthermore, we provide a small subset of surgeries with two major irregularities, including “pupil reaction” and “IOL rotation,” to support further research on irregularity detection in cataract surgery. Except for the annotated videos and images, the remaining videos in the Cataract-1K dataset are encoded with a temporal resolution of 25 fps and a spatial resolution of 512 × 324. Table 1 provides a comparison between the annotated subsets in the Cataract-1K dataset and currently existing datasets for semantic segmentation and phase recognition in cataract surgery. We delineate the challenges and annotation procedures for each subset in the following paragraphs.

Table 1 Comparison of annotated subsets in the Cataract-1K dataset with existing datasets for semantic segmentation and phase recognition in cataract surgery.

Phase recognition dataset

Crafting an approach to detect and classify significant phases within these videos, considering frame-by-frame temporal details, presents considerable challenges due to several factors:

  • As shown in Fig. 1, phase recognition datasets for cataract surgery are extremely imbalanced, as the longest phase (phacoemulsification) and the shortest phase (incision) cover 28.72% and 2.1% of the annotations, respectively.

    Fig. 1
    figure 1

    Total duration of the annotated phases in the 56 annotated cataract surgery videos (in seconds).

  • Videos may exhibit defocus blur stemming from manual camera focus adjustments25.

  • Unintentional eye movements and rapid instrument motions close to the camera result in motion blur, impairing distinctive spatial details.

  • As illustrated in Fig. 2, instruments, which play a fundamental role in distinguishing between relevant phases, share a substantial resemblance in certain phases, leading to a narrow variation between different classes in a trained classification model.

    Fig. 2
    figure 2

    Sample frames from different phases in a regular cataract surgery.

  • Lack of metadata in stored videos precludes additional contextual information.

  • Variances in patients’ eye visuals generate substantial inter-video distribution disparities, demanding ample training data to build networks with generalizable performance.

As shown in Fig. 2, regular cataract surgery can include twelve action phases, including incision, viscoelastic, capsulorhexis, hydrodissection, phacoemulsification, irrigation-aspiration, capsule polishing, lens implantation, lens positioning, viscoelastic-suction, anterior-chamber flushing, and tonifying/antibiotics. Besides, the idle phases refer to the time spans in the middle of a phase or between two phases when the surgeons mainly change the instruments and no instrument is visible inside the frames. We provide a large annotated dataset to enable comprehensive studies on deep-learning-based phase recognition in cataract surgery videos. Table 2 visualizes the phase annotations corresponding to 56 regular cataract surgery videos, with a spatial resolution of 1024 × 768, a temporal resolution of 30 fps, and an average duration of 6.45 minutes with a standard deviation of 2.04 minutes. This dataset comprises patients with an average age of 75 years, ranging from 51 to 93 years, and a standard deviation of 8.69 years. The videos present in the phase recognition dataset correspond to surgeries executed by surgeons with an average experience of 8929 surgeries and a standard deviation of 6350 surgeries.

Table 2 Visualizations of phase annotations for 56 normal cataract surgeries. The durations of the videos are different and normalized for better visualization.

Semantic segmentation dataset

Figure 3 visualizes pixel-level annotations for relevant anatomical objects and instruments. As illustrated in Fig. 3, semantic segmentation in cataract surgery videos poses the following challenges:

  • Variations in color, shape, size, and texture in pupil.

  • Transparency and deformations in the artificial lens,

  • Smooth edges and color variations in iris,

  • Occlusion, motion blur, reflection, and partly visibility in instruments,

  • Visual similarities between different instruments in case of multi-class instrument segmentation,

Fig. 3
figure 3

Visualization of pixel-based annotations corresponding to relevant anatomical structures and instruments in cataract surgery and the challenges associated with different objects.

The semantic segmentation dataset includes frames from 30 regular cataract surgery videos with a spatial resolution of 1024 × 768 and an average duration of 6.52 minutes with a standard deviation of two minutes. Frame extraction is performed at the rate of one frame per five seconds. Subsequently, the frames featuring very harsh motion blur or out-of-scene iris are excluded from the dataset. We provide pixel-level annotations for three relevant anatomical structures, including the iris, pupil, and intraocular lens, as well as nine instruments used in regular cataract surgeries, including slit/incision knife, gauge, spatula, capsulorhexis cystome, phacoemulsifier tip, irrigation-aspiration, lens injector, capsulorhexis forceps, and katana forceps. All annotations are performed using polygons in the Supervisely platform, and exported as JSON files. Within this dataset, the included individuals possess an average age of 74.5 years, spanning from 51 to 90 years, with a standard deviation of 8.43 years. Additionally, the videos contained in the semantic segmentation dataset depict surgeries conducted by surgeons whose collective experience averages 8033 surgeries, with a standard deviation of 3894 surgeries. The provided dataset enables a reliable study of segmentation performance for relevant anatomical structures, binary instruments, and multi-class instruments.

Irregularity detection dataset

This dataset contains two small subsets of major intra-operative irregularities in cataract surgery, including pupil reaction18 and lens rotation17.

  • Pupil Contraction: During the phacoemulsification phase, where the occluded natural lens is fragmented and suctioned, there exists a heightened risk of causing damage to the delicate iris. Even very subtle trauma to the tissue can lead to undesirable pupil constriction26. These sudden reactions in pupil size can lead to serious intra-operative implications. Especially during the phacoemulsification phase, where the instrument is deeply inserted inside the eye, sudden changes in pupil size may lead to injuries to the eye’s tender tissues. Besides, achieving precise IOL alignment or centration becomes challenging in cases where intraoperative pupil contraction (miosis) occurs. Particularly in multifocal IOLs, minor displacements or tilts, which might be negligible for conventional mono-focal IOLs, can significantly compromise visual performance. In the case of toric IOLs, precise alignment of the torus is crucial, as any deviation diminishes the IOL’s effectiveness. Detection of unusual pupil reactions and severe pupil contractions during the surgery can highly contribute to the overall outcomes of cataract surgery and provide important insight for further post-operative investigations. Figure 4-top demonstrates an example of severe pupil contraction during cataract surgery. Pupil contraction can be automatically detected via accurate segmentation of the pupil and cornea, and tracking the relative area of the pupil over time18.

    Fig. 4
    figure 4

    Intra-operative irregularities in cataract surgery.

  • IOL Rotation: Although aligned and centered upon surgery’s conclusion, the IOL may rotate or dislocate following the surgery. Even slight deviations, such as minor misalignments of the torus in toric IOLs or the slight displacement and tilting of multifocal IOLs, can result in significant distortions in vision and leave patients dissatisfied. The sole way to address this postoperative complication is follow-up surgery, which entails added costs, heightened surgical risks, and patient discomfort. Identification of intra-operative indicators for predicting and preventing post-surgical IOL dislocation is an unmet clinical need. It is argued that intra-operative rotation of IOLs during cataract surgery is the leading cause of post-operative misalignments27. Hence, automatic detection and measurement of intra-operative lens rotations can effectively contribute to preventing post-operative IOL dislocation. Figure 4-bottom represents fast clockwise rotations of IOL during unfolding, which occur in less than seven seconds. While intra-operative IOL rotation is a serious irregularity, its occurrence within cataract surgery videos is relatively infrequent. Consequently, conventional classification techniques designed to discriminate videos exhibiting IOL rotation struggle due to the considerable class imbalance present in the training data. Indeed, lens rotation computation entails more complicated frameworks and accurate computation of lens rotation necessitates more intricate methodologies. In our extensive investigation into predicting post-operative IOL dislocation, we have introduced, implemented, and assessed a robust framework for precisely calculating IOL rotation. This framework incorporates advanced techniques such as phase recognition, semantic segmentation, and object localization networks to precisely measure the sum of absolute rotations of the IOL after lens unfolding28.

Experimental settings for phase recognition

Network architectures

We adopt a combined CNN-RNN framework for phase recognition. The CNN component, serving as the backbone model, is responsible for the extraction of distinctive features from individual frames within the video sequence. To achieve this, two different pre-trained CNN architectures, VGG16 and ResNet50, are employed. The output feature map of the CNN is fed into a recurrent neural network (RNN). The RNN component focuses on capturing temporal features from the input video clip. We compare the performance of four different RNN architectures, including LSTM, GRU, BiLSTM, and BiGRU.

Training settings

We adopt a one-versus-rest strategy to evaluate phase recognition performance15,29. Accordingly, we segment all videos corresponding to each phase into three-second clips with an overlap of one second. Afterward, the entire dataset is split into two categories: the designated target phase and the remaining phases (the “rest” class). We apply offline augmentations to the videos across all categories. Typically, the number of clips in the target category is significantly lower than in the rest category. To rectify this imbalance problem, we employ a random selection process from the “rest” category, aligning it with the clip count in the target category. This strategy ensures an equivalent number of clips in both classes. The employed augmentations include gamma and contrast adjustments with a factor of 0.5, Gaussian blur with a sigma of 10, random rotation up to 20 degrees, brightness within a range of [−0.3, 0.3], and saturation within a range of [0.5, 1.5]. To maximize diversity within our training set, we employ a random sampling strategy during training. Specifically, we configure the network’s input sequence to comprise 10 frames randomly selected from 90 frames within each three-second clip. In all settings, the backbone network employed for feature extraction is pre-trained on the ImageNet dataset. The RNN component is constructed with a single recurrent layer comprising 64 units. This is followed by a dense layer with 64 units, and finally, a two-unit layer with a Softmax activation function. To mitigate the risk of overfitting, the last four layers of the CNN component are kept frozen during training, and dropout regularization with a rate of 0.5 is applied to the output feature map of the recurrent layer. All models are trained on 32 videos and tested on non-overlapping clips from the remaining videos. We use a binary cross-entropy loss function and Adam optimizer, a learning rate equal to 0.001, and a batch size of 16. The network’s input dimensions are set to 224 × 224. We compare the performance of the trained models using accuracy and F1 score.

Experimental settings for semantic segmentation

Network architectures

We perform experiments to validate the robustness of our pixel-level annotations using several state-of-the-art baselines targetting general images, medical images, and surgical videos. The specifications of the baselines are listed in Table 3.

Table 3 Specifications of the proposed and alternative approaches.

Training settings

For all neural networks, the backbones are initialized with ImageNet’s pre-trained parameters30. We train all networks with a batch size of eight and set the initial learning rate to 0.001, which decreases during training using polynomial decay \(lr=l{r}_{init}\times {\left(1-\frac{iter}{total-iter}\right)}^{0.9}\). The input size of the networks is set to 512 × 512. We apply cropping and random rotation (up to 30 degrees), color jittering (brightness = 0.7, contrast = 0.7, saturation = 0.7), Gaussian blurring, and random sharpening as augmentations during training, and use the cross entropy log dice loss during training as in Eq. (1),

$${\mathscr{L}}=(\lambda )\times BCE({{\mathscr{X}}}_{true}(i,j),{{\mathscr{X}}}_{pred}(i,j))-(1-\lambda )\times \left(\mathrm{log}\frac{2\sum {{\mathscr{X}}}_{true}\odot {{\mathscr{X}}}_{pred}+\sigma }{\sum {{\mathscr{X}}}_{true}+\sum {{\mathscr{X}}}_{pred}+\sigma }\right),$$
(1)

where \({{\mathscr{X}}}_{true}\) denotes the ground truth binary mask, and \({{\mathscr{X}}}_{pred}\) denotes the predicted mask \(\left(0\le {{\mathscr{X}}}_{pred}(i,j)\le 1\right)\). The parameter λ [0, 1] is set to 0.8 in our experiments, and \(\odot \) refers to the Hadamard product (element-wise multiplication). Besides, the parameter σ is the Laplacian smoothing factor, which is added to (i) prevent division by zero and (ii) avoid overfitting (in experiments, σ = 1). We compare the performance of baselines using average dice and average intersection over union (IoU).

Data Records

All datasets and annotations including the 1000 raw videos, phase recognition set, semantic segmentation set, and irregularity detection set are publicly released in Synapse31.

Frame-level annotations for phase recognition are provided in CSV files, determining the first and the last frames for all action phases per video. The preprocessing codes to extract all action and idle phases from a video using the CSV files are provided in the GitHub repository of the paper. Table 2 visualizes our phase annotations for 56 cataract surgery videos. Furthermore, Fig. 1 demonstrates the total duration of the annotations corresponding to each phase from 56 videos.

Pixel-level annotations are provided in two formats: (1) Supervisely format, for which we provide Python codes for mask creation from JSON files, and (2) COCO format, which also provides bounding box annotations for all pixel-level annotated objects. The latter annotations can be used for object localization problems. The preprocessing codes to create training masks for “anatomy plus instrument segmentation”, “binary instrument segmentation”, and “multi-class instrument segmentation” are provided in the GitHub repository of the paper. We have formed five folds with patient-wise separation, meaning every fold consists of the frames corresponding to six distinct videos. Table 4 compares the number of instances and their appearance percentage in the frames. Besides, Table 5 lists the average number of pixels per frame corresponding to each label.

Table 4 Number of instances and presence in the frames (% of total number of frames in each fold).
Table 5 Average pixels corresponding to different labels per frame.

Technical Validation

In this section, we bolster the quality control of our multi-task annotations by rigorously training several state-of-the-art neural network architectures for each task. We meticulously evaluate the performance of the trained models using relevant metrics to ensure the accuracy and reliability of our annotations.

Table 6 showcases the phase recognition performance of several CNN-RNN architectures. In our evaluations, we have combined the phases of viscoelastic and anterior-chamber flushing due to their shared visual features. The collective findings reveal commendable and satisfactory phase recognition performance across diverse backbones and recurrent network setups. Notably, the incorporation of bidirectional recurrent layers has consistently amplified detection accuracy and F1-Score across all configurations. Furthermore, networks leveraging the ResNet50 backbone display marginally superior performance compared to those utilizing VGG16. This outcome can be attributed to the deeper architecture of ResNet50, facilitating the extraction of intricate features essential for accurate recognition. The results also reveal the distinguishability of different phases in cataract surgery. Precisely, the phacoemulsification phase consistently attains the highest accuracy and F1 score, attributed to the distinctive phacoemulsification instrument and the unique texture of the pupil during this phase. Conversely, the least robust detection performance aligns with the viscoelastic/AC flushing phases, accentuating the visual resemblances shared between these phases and other phases within cataract surgery videos.

Table 6 Phase recognition performance of several CNN-RNN architectures.

Table 7 provides a quantitative analysis of “anatomy plus instrument” segmentation performance for various neural network architectures. The results notably highlight that segmenting the relevant anatomical structures emerges as a comparatively less challenging task than instrument segmentation for all networks. Specifically, the best performance corresponds to pupil segmentation, attributable to its distinct features and sharp boundaries. In contrast, lens segmentation demonstrates relatively lower performance due to its transparent nature and an inherent imbalance issue (outlined in Table 4). The segments involving instruments, however, confront significant challenges. This class is marked by major distortions, encompassing motion blur, reflections, and occlusions, collectively contributing to the relatively low performance of the networks. The best performance corresponds to the DeepPyramid network with a VGG16 backbone, consistently yielding optimal results across all classes.

Table 7 Quantitative evaluations of “anatomy plus instrument” segmentation performance for neural network architectures listed in Table 3.

Figure 5 visually compares the Dice and IoU metrics’ averages and standard deviations across five folds for the evaluated neural networks. According to the results, DeepPyramid, AdaptNet, and ReCal-Net are the three best-performing networks for anatomy and instrument segmentation in cataract surgery videos.

Fig. 5
figure 5

Average and standard deviation of “anatomy plus instrument” segmentation results for neural network architectures listed in Table 3.

Within Table 8, a thorough comparison is made between the performance of various neural network architectures concerning intra-domain and cross-domain scenarios. These architectures are trained using our binary instrument annotations. The results clearly indicate statistical differences between Cataract-1k and CaDIS datasets. Concretely, the average dice coefficient for binary instrument segmentation equals 77% within the Cataract-1k dataset. However, this performance metric markedly contracts, remaining confined to around 67% (with AdaptNet illustrating 66.23%) when extended to the CaDIS dataset. This considerable variance starkly underscores the substantial domain shift inherently present between these two datasets. These results demonstrate the necessity of strategically exploring semi-supervised and domain adaptation techniques to elevate instrument segmentation performance in cataract surgery videos with cross-dataset domain shifts32.

Table 8 Single domain and cross-domain binary instrument segmentation performance for neural network architectures listed in Table 3.

Usage Notes

The datasets are licensed under CC BY. For further legal details, we kindly request the readers to refer to the complete license terms. Besides, anyone can view the sample videos and images from the dataset and access the GitHub repository for dataset preparation codes.