Introduction

Major vascular injury during minimal access, endoscopic or robotic-assisted surgery can impair visualization and requires immediate action1,2. Despite maximal efforts, including the conversion from minimally invasive to ‘open’ surgery, 13–60% of major vascular injuries result in patient death2,3,4,5,6. Surgeons immediately assess the likelihood of achieving hemostasis and the need for blood transfusion, however; inexperience, inability7,8,9,10,11 and stress1,3,12,13 impair decision-making. Accordingly, surgeon self-assessments of the likelihood of controlling an unexpected vascular complication are uncorrelated with their actual performance14. Inaccurate predictions of blood loss and task outcome risk patient harm by delaying changes in technique, aid from surgical colleagues, or transfusion of blood products. Rather than waiting for a patient’s clinical deterioration, early prediction of difficulty at achieving hemostasis and high-volume blood loss using computer vision (CV) techniques could optimize patient outcomes.

We created SOCAL (Simulated Outcomes Following Carotid Artery Laceration), a video dataset of attending and resident surgeons (otorhinolaryngologists and neurosurgeons) controlling life-threatening internal carotid artery injury (ICAI) in a validated, high-fidelity bleeding cadaveric simulator14,15,16,17,18. Carotid injury is a catastrophic complication of endonasal surgery and results in up to 30% mortality, similar to vascular injuries during minimally-invasive abdominal and thoracic surgery5,19,20. In prior work, we applied artificial intelligence (AI) methods to SOCAL video and developed tools that quantify blood loss and measure surgeon performance metrics from video21,22. Using these tools, we showed that video contains signals of surgical task outcome, but we do not know whether the model can detect predictive signals early in a bleeding episode, nor its performance compared to gold-standard human experts.

We provided human experts (fellowship trained skull-base neurosurgeons) with the first minute of 20 videos from SOCAL (‘Test Set’) and collected predictions of blood loss and task success over the entire unseen task. Experts’ predictions of outcome and blood loss established a benchmark of human performance. We then built a deep learning neural network (DNN) trained on the SOCAL video dataset (excluding the Test Set), called SOCALNet, and compared model performance on the Test Set to expert benchmarks. We validated SOCALNet predictions in subsequent experiments. To the authors knowledge this is the first comparison of DNN-derived surgical video outcome prediction to human experts viewing the same video.

Methods

Experimental design

Experimental setup, data collection, consent and implementation parameters for the dataset are found in Appendix 1. Seventy-five surgeons ranging from junior trainees to world experts on endoscopic endonasal approaches (EEA) were recorded in a nationwide, validated, high-fidelity training exercise. Surgeons attempted to control an ICAI in a cadaveric head perfused with blood substitute. In short, task success was defined as the ability for the operating surgeon to achieve hemostasis using a crushed muscle patch within 5 min, upon which simulated patient mortality occurred. Blood loss was additionally measured and recorded for each trial. Performance data and intraoperative video was used to develop the SOCAL database14,15,16,17,18,23. The SOCAL database was developed in concordance with previously published methods, and is publicly available23,24,25,26. The SQUIRE reporting guidelines were followed27. The study was approved by the IRB of the University of Southern California. All research was performed in accordance with relevant regulations/guidelines. No patient data was utilized therefore patient-level informed consent was waived. Participating surgeons’ consent was obtained for intraoperative video recording. Surgeon-expert consent was obtained.

Datasets

The 147 videos in SOCAL were divided into a training set of 127 videos and a separate test set of 20 videos. Ten videos depicting successes and 10 of failure were initially chosen at random for the test set; ultimately, 11 success videos (and 9 failures) were used due to ease of video formatting. Videos were truncated after 60 s. Only videos in the test set were shown to experts for grading.

SOCALNet model architecture

SOCALNet utilized two distinct neural network architectures and a transfer learning approach to generate predictions using video. The first layer, a ResNet, is used to analyze each individual frame to generate a vector representation of features which correspond with success/failure of a trial, or an amount of blood lost. However, given the necessity to analyze video (versus individual frames), a temporal layer was added following the ResNet. This temporal layer utilizes an LSTM architecture, a type of recurrent neural network which contains an input, output and forget gate. These gates can modify information from the current frame as well as the frames prior, before passing these modified weights to the subsequent cell, effectively regulating the flow of information across a temporal sequence. This enables SOCALNet to take individual frame-predictions generating by a ResNet and link them together in a temporal sequence using an LSTM. A schematic of SOCALNet is shown in Fig. 1.

Figure 1
figure 1

SOCALNet architecture. Deep learning model used to predict blood loss and task success in critical hemorrhage control task. (A) Video is snapshotted into individual frames. (B) A pretrained ResNet convolutional neural network (CNN) is fine-tuned on SOCAL images from (A), to predict of blood loss and task success in each individual frame. The penultimate layer of the network was removed and a 1 × 4 matrix of values predictive of success/failure or BL was obtained. This is repeated for all frames, generating a new matrix with N (number of frames) rows and 4 columns. Output matrix from (B) and Tool Presence Information (C) [e.g. ‘Is suction present? Yes (check); is Muscle present? No (X), etc.; encoded as 8 binary values per frame (Nx8)] is input into a temporal layer. (D) Temporal layer: Long-short-term memory (LSTM) modified recurrent neural network allowing for temporal analysis across all frames. The 2D matrix of: features from the ResNet and Tool Presence Information (‘check mark’, ‘X’) from each frame are fed into the Temporal Layer. All LSTM predictions are consolidated in one dense layer and (E) a final prediction of success/failure, and blood loss (in mL) is output.

SOCALNet model implementation

See eSupp1 for model code. Video was sampled at 1 frame-per-second (fps) and input into two layers, a feature generating layer and a temporal analysis algorithm (Fig. 1). The output of the model was a binary prediction of surgical ability (trial success or failure) and estimated blood loss over the entire trial (in milliliters).

For the feature generator, we utilized a transfer-learning approach, where a Residual Learning Neural Network (ResNet) model was pretrained on the ImageNet 2012 classification dataset28,29. ResNet is a single-stage convolutional neural network (CNN) which uses skip connections to allow for large networks with many layers to skip layers that hurt overall performance. ResNet has become ubiquitous for object detection and classification in computer vision (CV)29. All weights from pretraining on ImageNet were used in our model, however the final three layers of the ResNet were retrained on SOCAL images to predict blood loss or task success. The values of the four output nodes from the penultimate layer of the ResNet were extracted, representing a 4 × 1 matrix of values predictive of task success/failure or blood loss within that individual frame. This matrix is combined with with tool presence information encoded as an array of eight binary values (1 × 8 matrix per frame, representing whether specific surgical instruments were present within the frame). This process is repeated for all frames, and the resulting 2D matrix is passed into a bi-layer Long Short-Term Memory (LSTM) recurrent neural network30. Instrument annotations alone are inadequate for outcome prediction; successful detectors incorporate instrument data and image features21.

Expert assessment

Experts were four skull base fellowship-trained neurosurgeon instructors in ICAI management. Experts watched the 20, 1-min test videos and provided: blood loss estimates (in mL), outcome predictions (success/failure), and surgeon grades (1–5 Likert scale, 1 represents novice and 5 represents master). Experts also reported self-confidence in their outcome prediction (1–5 Likert scale; 5 represents most confident). Each expert was surveyed for this data in a standardized fashion via the following questions: Based on the 1 min of video viewed, (1) do you feel the operating surgeon will succeed or fail in controlling bleeding within 5 min? (2) how much blood (in mL) will be lost by the end of the trial. (3) On a Likert scale of 1–5, how skilled is the operator? (4) How confident are you in this prediction? To provide baselines prior to grading, experts were shown 3 anchoring videos demonstrating predetermined novice, average, and master performances with respective outcomes data. Anchoring videos were not contained in the Test-Set and were chosen as representative videos of each skill level by adjudication by the study team. Experts were not given additional data (e.g. years of practice, attending/resident status) on participating surgeons and relied solely on intraoperative video. Grading sessions were conducted in double-blinded fashion by the lead author (DJP) and individual experts (BS, MR, GZ, DAD, referred to as S1–S4). Given high concordance, mean and mode are reported for experts (‘S’).

Validation analysis

We conducted two experiments to evaluate model and expert concordance. In experiment one, two videos were identified in the Test-Set where a critical error occurred shortly after the 1-min video sample concluded (i.e., not shown to the model or surgeons). The model and all surgeons predicted, incorrectly, that both videos were successes. A new, 1 min clip was generated showing the critical error and its aftermath. These new clips were evaluated by one of the human experts and SOCALNet.

In a second experiment, the three best (least blood loss, successes) and worst (most blood loss, failures) videos were identified from within the Test-Set. Composite ‘best’ and ‘worst’ videos were constructed by combining the first 20 s of each of the three best and worst trials in each possible order permutation (6 ‘best’, 6 ‘worst’ videos). The twelve composite videos were then presented to SOCALNet.

Statistical analysis

Blood loss prediction was reported using mean error, root mean square error (RMSE), and Pearson’s correlation coefficients. Categorical inter-rater reliability was calculated using Cohen’s Kappa and Krippendorff’s alpha for more than two raters. Continuous inter-rater reliability was calculated using Pearson’s correlation coefficient and an inter-rater correlation coefficient (ICC) (> 2 groups; using a two-way random effects ICC model)31. We used Fisher’s exact test for categorical comparisons. We performed analysis in Python with SciPy32.

Results

Table 1 lists predictions and ground truth data. There were 11 successful trials and 9 failed trials in the Test Set, with mean blood loss of 568 mL (range 20–1640 mL, mean success = 323 mL, mean failure = 868 mL). Experts correctly predicted outcome in 55/80 predictions (69%, Sensitivity: 79%, Specificity: 56%). Expert predictions were concordant, with one dissent in 80 ratings (Fleiss’ kappa = 0.95). The average root mean square error (RMSE) for blood loss prediction of surgeons was 351 mL (mean error = − 131 mL, average R2 = 0.70). Expert ICC was high at 0.72.

Table 1 Results comparing deep learning model with expert Surgeons.

Figure 2, and Supplemental Table 1 demonstrates the relationship between prediction confidence, surgeon skill and prediction accuracy. Experts were most accurate when maximally confident (5/5 confidence, accuracy 88%) or viewing a surgeon they rated as having minimal (Likert scale 1, accuracy 92%) or maximal skill (Likert scale 5, accuracy 79%). Predictions with non-maximal confidence (levels 2–4,) were only marginally better than chance (53%, p = 0.02 compared to maximal confidence). Predictions of intermediate skill surgeons were also less accurate (levels 2–4, 63%, p = 0.04 compared to composite 1/5 and 5/5 skill).

Figure 2
figure 2

Association between expert confidence, surgeon skill level and accuracy of prediction. Experts are most accurate when viewing trials of surgeons with low or high skill, or where they (experts) are maximally confident. For those with moderate skill or when experts have moderate confidence, prediction accuracy is lower. Size of circle denotes number of trials. Color denotes accuracy.

SOCALNet correctly predicted outcome in 17/20 trials (85%, Sensitivity: 100%, Specificity: 66%), noninferior to surgeons (p = 0.12). The model predicted blood loss with a RMSE of 295 mL (mean error = − 57 mL, R2 = 0.74) (Fig. 3). The model and experts all predicted outcome correctly in 13/20 trials. In four trials, the model was correct and all experts incorrect, in one trial the model was incorrect, and all experts correct, and two trials all were incorrect (Fig. 4). Correlation (R2) between blood loss estimates for the model, experts and ground truth are shown in Supplemental Fig. 1, and range from 0.53 to 0.93. Correlation between the model and the average surgeon blood loss estimate was 0.73, ranging from 0.53 to 0.74 for individual surgeons (Table 1).

Figure 3
figure 3

Expert and SOCALNet blood loss quantification. Predicted versus observed blood loss estimations by individual surgeons (grey), surgeon mean (blue), and model (green). Red points represent measured blood loss (ground truth).

Figure 4
figure 4

Outcome predictions of experts and SOCALNet. Outcomes of experts (Blue) and model (Red) in predicting task success using 1 min of video. Circle size denotes number of trials (N). Success (S) and failure (F) denoted underneath each N. When the union of successful predictions is taken, the model + expert grouping would successfully predict outcome in 18/20 cases. In the 2 remaining cases (bottom left quadrant), a critical error took place following the cessation of the video and was evaluated in subsequent counterfactual experiments.

We then evaluated trials above the 50th percentile for blood loss, where blood loss exceeded 500 mL and transfusion might be needed. The model predicted a blood loss estimate above 500 mL in 80% (8/10) compared to experts 47.5% (19/40); this difference was not statistically significant (p = 0.09).

Exploratory model-validation

Supplemental Table 2 reports model-validation experiments. In two trials, experts and SOCALNet predicted success, but the surgeon failed due to a critical error shortly after the end of the 1-min clip (therefore unseen by experts and SOCALNet). When we included the critical error, the model accurately predicted ‘failure’, as did an expert. In a second experiment, SOCALNet viewed six composite ‘Best’ trials and uniformly predicted success with low blood loss (328–473 mL); conversely, in six composite ‘Worst’ videos the model uniformly predicted failure with high blood loss (792–794 mL).

Discussion

To address the need for datasets depicting surgical adverse events we created SOCAL, a public video dataset of 147 attempts to control carotid injury in high-fidelity perfused cadavers. In this work we compared human expert predictions of outcome using 1 min of video from 20 trials in the dataset to those of a DNN (SOCALNet). Compared to expert benchmarks, SOCALNet met or surpassed expert prediction performance, despite its relatively primitive architecture and small training data size relative to CV tasks. We synthesized counterfactual videos of excellent and poor surgeon performance to challenge SOCALNet, and it correctly predicted the outcomes in these challenges. SOCALNet and other CV methods can aid surgeons by quantifying and predicting outcome during surgical events, and in automatic video review. The absence of video datasets containing adverse events is a critical unmet need preventing the development of predictive models to improve surgical care.

Benchmark performance of human experts

Expert predictions were highly concordant, indicating that experts detected similar signals of blood loss and outcome (cross-correlation: R2 = 0.74–0.93, Kappa for success prediction = 0.95). Experts had uniform definitions of success (hemostasis) and were familiar with the stepwise progression of a well-described technique18,33. Thus, it is reasonable to conclude that using the first minute of video of a bleeding event, human experts detect signals predictive of blood loss and task outcome.

Although experts had reasonably accurate outcome and blood loss predictions (69% accuracy, R2 = 0.7), experts systematically overestimate surgeon success and underestimate bleeding: 4/6 of expert errors were false ‘success’ predictions, experts systematically underestimated blood loss by 131 mL and experts failed to identify 52% of high blood loss (above 500 mL) events. This post-hoc cutoff of 500 mL represents a potential clinical marker of need for transfusion. The tendency for human experts to underestimate blood loss is well documented34,35,36,37, corroborated by our findings, and may result in delayed recognition of life-threatening hemorrhage.

To validate individual ratings, we asked experts to provide their confidence in each prediction, and perceived skill rating of the participating surgeon. Maximally confident predictions were more likely to be correct, as expected from prior work34,35,38. Similarly, predictions were most accurate when evaluating highest and lowest-skilled surgeons (skill rating 1 or 5), but scarcely better than chance when evaluating intermediate surgeons. Intermediate skill surgeons comprised half of all surgeons and may benefit greatly from performance assessments.

During a real vascular injury, estimation ability of the average surgeon is likely to be inferior to our experts calmly rating a single stereotyped task after training with videos of known blood loss. Experts’ systematic underestimation of blood loss and struggle to assess performance of intermediate surgeons represents a chasm in surgeon-assessment proficiency. Surgical patients may benefit from novel methods that improve on these benchmarks.

SOCALNet performance compared to experts

We designed a primitive deep-learning architecture containing a standard CNN and a recurrent neural network, which we call SOCALNet. We provided SOCALNet with short videos from a much smaller training dataset than is customary in CV. Despite these disadvantages, SOCALNet made statistically non-inferior (and numerically superior) outcome predictions and superior blood loss predictions compared to human experts. SOCALNet’s predictions of blood loss had a smaller mean underestimation and standard error. Unlike experts, SOCALNet predictions were accurate for intermediate-skill surgeons.

The advantages of SOCALNet support the development of computer vision tools for surgical video review and as potential teammates for surgeons39. SOCALNet demonstrates that CV models can provide accurate, clinically meaningful analyses of surgical outcome from video. Future models could leverage the vast but largely untapped collections of surgical videos. Workflows developed in building SOCALNet can guide model deployment for other surgical adverse events. Human-AI teaming is a validated concept in other domains40,41,42. A SOCALNet-and-expert combined team (with model as a tiebreaker, particularly when expert confidence was low) would have generated 18/20 correct predictions. Furthermore, the only two inaccurate predictions from this teaming occurred when a critical error was made after the video ceased, and these errors were detected by the model and experts. If utilized at scale, AI-driven video analysis may quantify comparisons of surgical technique, provide real-time feedback for trainees, or provide guidance during rare scenarios a surgeon may not have encountered (e.g. vascular injury) but the model has been trained on39.

SOCALNet has room for improvement. For adverse events, the (1) accurate estimation of high-volume blood-loss and (2) detection of task failures may be prioritized as exsanguination is life-threatening. SOCALNet blood loss predictions exhibited more robust central tendency than experts, resulting in better predictions for typical performances. However, when grading edge cases of the two worst surgeons in the Test Set, SOCALNet underestimated blood loss (absolute error of 790–800 mL on videos exceeding 1.5L of blood loss). In predicting failure (specificity), both experts and SOCALNet showed limitations (Specificity = 0.56, 0.66 respectively); however, improving expert predictions are challenging, and most surgeons are non-experts. Accordingly, applying CV optimization techniques to AI models (e.g. cost-sensitive classification, oversampling) may be preferred43,44.

Surgical adverse event video datasets: an unmet need in surgical safety

A growing body of evidence supports the quantitative analysis of surgical video22,45,46,47,48. One fundamental discovery has been the detection of signals in surgical video that predict patient outcome: surgeons have heterogeneous skill resulting in heterogeneous outcomes14,45,46,49. Although low-skill surgeons are more likely to have adverse intraoperative events, video of these events has not been systematically studied. Instead of studying surgical video, studies describe adverse events using textual medical records, radiography, and laboratory results. Analysis of these extra-operative records and correlations with pre-operative risk factors and post-operative management can be useful50,51,52,53,54. However, this research omits a crucial determinant of the outcome of the surgical patient: the surgical event itself. This omission limits root-cause analysis to only the extra-operative universe and prevents evaluation of the technical maneuvers and patient anatomic conditions that make adverse events more likely. Unlike textual records, surgical video depicts all visualized surgeon movements and patient anatomy, making video uniquely suited for the study of operative events. The results of the present study begin to demonstrate the value of studying video of surgical adverse events.

We propose the creation of large, multi-center datasets of surgical videos that includes adverse events55,56. Video datasets of surgical adverse events can be leveraged using predictive models (e.g., SOCALNet) which can detect intraoperative events, evaluate performance and quantify technique. This study was supported the North American Skull Base Society, whose mission is to promote scientific advancement, share outcomes data for education and to advance outcomes research. Groups such as the Michigan Bariatric Surgery Collaborative and the Michigan Urologic Surgery Improvement Consortium have conducted similar work and we hope to call their attention to adverse events in addition to routine procedures57,58. National organizations capable of soliciting large bodies of data should prioritize collecting adverse event videos and apply technical innovations adopted by other medical fields to ensure privacy and confidentiality59,60,61. National organizations can also facilitate the scaling of expert labeling. Small groups face long delays in accruing sufficient cases and labeling video. In this study, despite a long term track record of collaboration amongst our team, it required 2 months for our experts to review 20 min of aggregated video62. Collaborative efforts may be able to require video review as a condition of membership. This work is of importance given the potential strength of AI models to augment human performance. In the context of ICAI, an AI model may be useful in predicting high volumes of blood loss, or where outcomes are more uncertain. However, the volume of video required for appropriate statistical power to demonstrate clinical utility would require significant collaboration between institutions and expert surgeon reviewers. We are in the process of establishing a data sharing collective, aimed at providing a secure mechanism for surgeons to share anonymized video and corresponding outcomes. This effort mirrors other quality improvement efforts already underway in surgical fields, with the added modality of surgical video and computer vision analysis. It is our hope that these efforts can accelerate the collection of surgical video and analysis using DNN methods such as described in this manuscript.

Finally, high-fidelity simulation enables analysis of rare surgical events. Curating 150 videos of real carotid injuries would require tens of thousands of cases, an impossible task without streamlined data-sharing mechanisms; using perfused cadavers and real instruments we collected hundreds of observations of this otherwise rare event. Videos in the simulated environment can complement surgical video datasets that otherwise depict thousands of uncomplicated cases and only a few rare events14,15,17,18,63,64,65,66. As more surgical video datasets are developed, we can follow the ‘sim-to-real’ process where models are trained on virtual data and then fine-tuned and validated in the real environment67,68,69.

Limitations

Our study has several limitations. First, validation on clinical video is a clear next step, although accruing a corpus of carotid injury video would likely require substantial national efforts. Second, individualized models are required which incorporate surgeon experience, response to hemorrhage, and patient specific factors into a predictive model. This is a necessary step in the development of deep learning models and for human-AI teaming. Concepts such as the ‘OR Black Box’ may be able to incorporate factors which may not be captured in purely intraoperative video (e.g. a surgeon’s appropriate response to an injury)70. Additionally, results from carotid injuries may not transfer to other vascular injuries, and vascular injuries differ from other adverse events. Finally, this task was performed in a constrained, simulated environment, with clear endpoints; this is of course far removed from realities of clinical practice. Rather than diminishing our results, these complementary challenges showcase the depth of unmet need within surgical-video data science. Separately from these study design limitations, SOCALNet ingests ground truth tool annotations as input, which requires pre-processing of data and is thus not fully automated71,72,73. The lack of curated surgical video datasets remain a major limitation for future work.

Conclusion

Experts and a neural network can predict the outcome of surgical hemorrhage from the first minute of video of the adverse event. Neural network-based architectures can already achieve human or supra-human performance at predicting clinically relevant outcomes from video. To improve outcomes of surgical patients, advances in quantitative and predictive methods should be applied to newly collected video datasets containing adverse events.