Main

Surgical robots may be on the brink of achieving their fundamentally disruptive potential1. Since the first surgical robot was introduced in 1985 (the PUMA560, tasked with performing a computed tomography-guided brain biopsy2), the field of robotic surgery has expanded in size and scope, offering the potential for enhanced surgical precision, telesurgery and increasingly complex autonomous function. Technological advances in robotic control systems and artificial intelligence (AI) make it likely that the next generation of surgical robots will transform the surgical technology landscape, previously monopolized by a limited number of approved devices such as Intuitive’s da Vinci1,3,4.

This proliferation of robotic platforms poses important challenges for their safe and ethical clinical translation1,3,5—challenges that extend beyond the operating room and encompass wider considerations within healthcare and society4,6. The scope of the evaluation challenge is too broad for existing methodological templates5,7, but current circumstances create a brief window of opportunity to develop a structured framework capable of guiding the evaluation of surgical robots across their development and translation8.

Conducting high-quality surgical research is difficult owing to the nature of surgical innovation9,10 leading historically to a methodologically weak approach. Specific problems have included a lack of robust early-stage studies providing transparent and timely reporting of iterative development, and subsequent comparative studies failing to address variations in surgical technique and indications, operator learning curves and lack of equipoise1,5. Evaluating surgical robots is subject to all of these challenges, but adds the need to consider unique ethical considerations, profound questions about economic value and sustainability, major impacts on the host healthcare system, and the increasing integration of AI into robotic systems11.

Robotic surgery, like most innovative surgical technology, is often introduced without the stepwise testing process routinely used in medical therapeutics12. Evaluation of surgical innovation is traditionally through initial small case series documenting feasibility, followed by adoption (which may be fast or slow) based largely on non-comparative retrospective evidence of potential benefits to the patient. Robotics manufacturers engage in active campaigns to promote their products with physicians and directly or indirectly with patients. Uncertainty, desire to improve and personal biases can lead to innovation without rigorous evaluation, with consequent risks to patient safety. Therefore, frameworks to ensure proper evaluation of patient safety are essential13.

The IDEAL framework provides a structured evaluation pathway for surgical innovation and devices, from needs analysis and preclinical testing, to long-term studies of widespread use9,14,15 (Fig. 1). However, the breadth of the evaluation problem of surgical robotics goes beyond both IDEAL and the boundaries of classical evidence-based medicine, with solutions requiring a diverse array of stakeholders to tackle all the aspects that need consideration. The IDEAL Robotics Colloquium was established to make proposals for a comprehensive practical guide for evaluation of surgical robots, using the existing IDEAL study stages as a template (Fig. 2).

Fig. 1: Current IDEAL framework with example study types.
figure 1

The IDEAL framework provides an evaluative pathway for complex innovations and devices, spanning the entire life cycle, from early adoption to widespread use. IDEAL stages 0, 1 and 2a explore the safety and feasibility of an intervention; IDEAL stages 2b and 3 compare the intervention against the current standard to determine effectiveness; and IDEAL stage 4 involves the long-term monitoring of interventions following widespread uptake and adoption. Adapted from ref. 119.

Fig. 2: Examples of current robotic systems across IDEAL stages of evaluation.
figure 2

Examples chosen are purely illustrative, representing an array of systems across a variety of specialities. Chosen examples were assigned to IDEAL stages based on relevant publications, including (but not limited to) a proof-of-concept study for the Maestro System120; a single-center cohort study for the VELYS120; a multicenter prospective cohort study for the Versius System121; and randomized studies for the Mako Robotic Arm and the da Vinci system59,122. Stages 3 and 4 were combined to reflect how robots may have comparative evidence (via a randomized control trial) with long-term monitoring data for particular indications, but still require further comparative evidence for use in other indications.

In this paper, we present a systematic analysis of the evaluation life cycle of surgical robots, in three parts. First, we dissect the preclinical and early clinical study of the safety and feasibility of new robotic concepts (IDEAL stages 0, 1 and 2a). Next, we review the pivotal phase when the effectiveness of robotic interventions is studied on a larger scale, and compared against current best practice (IDEAL stages 2b and 3). Finally, we consider IDEAL stage 4, when the robot has been widely adopted, shifting focus to long-term monitoring of performance in real-world settings. This analysis results in a list of stage-specific recommendations for systematic evaluation of robots in surgery.

Methodology

An international interdisciplinary consensus process was completed in several stages. First, seven distinct virtual panels with expertise relevant to important aspects of the challenges to robotic surgery evaluation were devised by the three lead authors (H.J.M., P.T.R. and P.M.). These panels considered AI, technical evaluation, clinical evaluation, human factors, health economics, ethics and surgical training. Patient representatives were included in each panel.

Panel leaders with relevant expertise were selected from the IDEAL council, and were asked to invite 8–12 experts from multiple disciplines (including surgeons, engineers, economists, statisticians, device regulators, patient representatives, ethicists, digital health experts, patient safety experts, system engineers, social scientists, philosophers and education experts) to join their respective panels. Experts from diverse professional and geographical backgrounds were invited, and were chosen based on leadership roles in relevant organizations (university, hospital, societal and industrial) and/or accomplishments relevant to robotic surgery development and evaluation. The recruitment and facilitation of these panels and the general strategy for their function were developed in partnership with the Royal College of Surgeons of England and the National Institute for Health and Care Research, and considered the views of industry. To avoid bias by association, one co-author conducted MEDLINE searches for publications relevant to each panel, and identified additional potential members, who were invited to join the panel, ensuring that each panel had at least one such member.

Each panel participated in a series of semistructured virtual meetings (chaired by respective panel leaders) at which the key challenges for each panel domain were discussed. The degree to which the current IDEAL framework addressed these challenges was also discussed, and further recommendations to address these challenges were proposed. Each panel therefore produced a report across each stage of the IDEAL framework to summarize the outputs of these meetings. Panel reports were then synthesized by an internationally diverse core writing group that included experts in sustainability (A.V.), global health (R.B.), device regulation (T.M.) and medical statistics (D.S.), who were independent from the colloquium panels. A full list of authors, panel members and industry collaborators is found at the end of the paper. To improve usability, the final recommendations were considered from the perspective of key surgical robotics stakeholders: the device developer, clinician, patient and wider healthcare ecosystems16 (Fig. 3).

Fig. 3: Key stakeholders in the development and evaluation of surgical robots.
figure 3

Consideration of the key stakeholders is essential to the successful introduction of innovative devices.

Recommendations according to each of these perspectives were grouped together for IDEAL stages 0, 1 and 2a, which cover preclinical development and early clinical evaluation; for IDEAL stages 2b and 3, which cover comparative assessment; and for IDEAL stage 4, which covers long-term monitoring and technological evolution.

The IDEAL recommendations are based on three principles: (1) the use of the most rigorous and appropriate methodology to address the key questions at each stage in the intervention’s life cycle; (2) adherence to the fundamental principles of medical ethics (beneficence, non-maleficence, autonomy and justice); and (3) maximum feasible transparency in reporting evaluation outcomes. These principles have allowed the development of coherent proposals for evaluation across a very broad range of complex therapeutic interventions, but they inevitably lead to some recommendations that may not be feasible in many current contexts. In reporting our recommendations, we have indicated our recognition of this by prefacing certain recommendations with ‘in principle’, or by qualifying them by explicitly mentioning their conditional feasibility.

Preclinical development and early clinical evaluation (IDEAL stages 0, 1 and 2a)

An innovative device must first be deemed safe, feasible and acceptable for its successful translation. This is achieved through preclinical evaluation (IDEAL stage 0) to assess safety and feasibility, first-in-human study (IDEAL stage 1) and prospective development (IDEAL stage 2a) ahead of further collaborative evaluation and comparative assessment. Studies in this phase currently suffer from design flaws, severe reporting bias and methodological heterogeneity, which IDEAL aims to reduce16. This stage also commonly encompasses critical progression points such as regulatory approval and financing. The key challenges and recommendations of this early developmental stage are considered below, and summarized in Box 1.

Device perspective in IDEAL stages 0–2a

Key challenges

The complex and rapidly evolving nature of surgical robots poses unique challenges to assessing their safety and effectiveness3. Current assessment domains are usually driven by regulatory requirements. In the United Kingdom and European Union, this requires a demonstration of overall safety and performance; in the United States, a reasonable assurance of safety and effectiveness is required17,18. However, the implementation of these requirements varies among national regulators, and is subject to complex procedural rules and variable decision-making both within and between bodies. Requirements are also influenced by wider geopolitical, economic and legal factors19,20,21,22,23. Although international harmonized standards exist, they focus on technical aspects of device assessment, such as software or electrical safety assessments, rather than clinical metrics19,20,21,22. The nature and quality of scientific evidence developed for device safety, performance and effectiveness may therefore be vastly different for similar systems, being largely defined, verified, and validated internally by each company. Without recording of iterative systematic modification and assessment, key domains may be overlooked, particularly during prototyping and when changes are made during early clinical studies.

As complementary technologies develop, they will increasingly be integrated into surgical robotic systems13. The most impactful of these technologies will likely be AI—boosting function and increasing the autonomy of robotic systems through integration of sensory inputs, learned computational reasoning and adaptive behavior8,24,25,26,27. However, autonomous systems currently have no ‘common sense’ and so would not necessarily stop an obviously unsafe action if a specific scenario had not been ‘learned’ by the algorithm. The integration of AI also adds a further layer of complexity to device development, calibration and evaluation8,24,25,26,27. AI-integrated functions have the potential for rapid self-updating, requiring monitoring and understanding of risk and failure modes, including data drift. Isolating the dynamic AI components of the robotic system for assessment may be difficult, and assessment frameworks need to address this problem. A recent review on intraoperative AI applications for robotic surgery found that all identified publications reported on preclinical development only, and were heterogeneous in their evaluation approach, highlighting the need for a robust evaluation framework for early integration of AI into clinical practice8.

Recommendations

With these challenges in mind, this Colloquium proposes the following recommendations for early-stage evaluation of surgical robotics from a device perspective. When assessing the performance of a robotic system, technical metrics alone are acceptable in earlier studies (stage 0); however, a clinical outcomes-based approach should be used as the primary focus of assessment as early as is feasible28.

For early technical assessment of robotic systems, a standardized checklist should be used to summarize performance, safety and usability for each released version. Assessment should be systematic and transparent, including details of system latency, motion accuracy, instrument safety, operation under load, reliability, internal fault recognition and online security. Metrics and measurement instruments for each of these domains require further definition. For each domain, performance benchmarks and areas of concern should be clearly stated and shared.

Building on the IDEAL-D preclinical device assessment approach of relative risk assessment, the proportionate evaluation of autonomous surgical robots should be guided by its classification along two main axes—autonomy level and risk—before proceeding to clinical studies16,24. Autonomy levels are as described by Yang et al. into six categories: no autonomy, robot assistance, task autonomy, conditional autonomy, high autonomy and full autonomy. The preclinical evidence requirements should be guided by a failure modes and effects analysis approach to risk stratification, based on the likelihood and severity of device failure in each cell of the risk/autonomy matrix29. Therefore, before clinical study, a high-risk, full-autonomy device would require more extensive preclinical evidence than a low-risk device with no or low (that is, task-only) autonomy16,24.

For the evaluation of AI-integrated robots, the preclinical (stage 0) testing should begin with stand-alone evaluation of the autonomous component and hardware separately, followed by in silico and simulator-based assessment of the two integrated into a functional unit in realistic tasks. Later stages (stage 1 and beyond) should study the performance of the AI algorithms within systems (with the hardware components of that system version) in a clinical context—using clinical outcomes where feasible. Reporting guidelines, such as DECIDE-AI, should be used to guide early clinical evaluation30.

The maturation of the system from in silico to in vitro and in vivo versions should revolve around addressing identified clinical unmet needs and should be described with clear identification of the prototype version. This should include documentation of iterative changes to the procedure, device and patient selection, and describe simulation studies in detail. This information should be recorded prospectively, and a log should be accessible to regulators. In systems with AI integration, the AI component is particularly susceptible to rapid iterations, and therefore changes to input data, algorithm code and model testing should be reported.

Clinician perspective in IDEAL stages 0–2a

Key challenges

From the perspective of clinicians, the introduction of a robotic device within a clinical team is a multifaceted challenge. Investigation of robot interaction with humans (that is, the surgical team) is crucial, particularly in the domains of usability, trust and failure analysis13,31,32,33. This is particularly pertinent with respect to the integration of AI, which could alter responsibility and liability paradigms34. Understanding systems modeling is important, as a device is never integrated into a ‘static’ system—the act of introducing it will change both the behaviors of the surgical team and the way they think about their work within the operating room. Surgical team trust must also be considered in the evaluation of these systems—especially in systems with an autonomous component, which current assessment frameworks do not recognize or evaluate24,35. Human factors and ergonomics approaches will be important in developing solutions to these largely unexplored problems. Recent projects such as the Trustworthy Autonomous Systems Hub and Responsible AI UK will aim to bring standardization and regulation to this rapidly evolving field.

Recommendations

The human–device interface and team–device integration in the operating room should be included in the intervention development and description. In principle, this process starts with robotic development, which should utilize user-centered design and involve input from surgical team members.

Robot assessment should include a human factors-based evaluation of team communication (including communication with the robot), intuitiveness of visual displays, control interface usability, feedback mechanisms (for example, haptic and auditory) and ease of integration with existing workflows. Human factors assessment of system integration should ideally include directly observed user situational awareness, user workload (mental and physical), task analysis in device use, operational challenges and potential safety-related issues36. Formal qualitative research to study robot user opinion and perceptions may be helpful; the ongoing REINFORCE initiative may provide a framework for this37,38,39,40. Human reliability assessments should be used to stratify potential risks and hazards across a wide variety of surgical expertise (that is, consultants and trainees, those with previous robotic or minimally invasive surgery expertise)41.

Surgeons’ trust in any AI autonomous function and its evolution should be evaluated initially in simulated situations. This should include monitoring for frequency of, and reasons for, surgical team members taking over control of the robotic system, alongside independent observation and qualitative assessment. Surgical robotic incorporation of AI poses ethical challenges, including fair distribution of risk and benefits for patients and clinicians. Integration of ethical considerations should occur across key domains for every study stage by addressing the key issues of minimizing harm, ensuring autonomy and consent and optimizing justice (for example, in terms of differential access to treatment). Conflicts of interest should be openly addressed35. In principle, a standard process for determining responsibility for errors when AI is integrated should be adopted with suitable expert advice and should be publicly accessible.

Patient perspective in IDEAL stages 0–2a

Key challenges

From a patient perspective, as robotic systems grow in complexity they become increasingly difficult to understand and trust42,43. Patients invited to participate in early clinical studies will rarely be able to understand the risks and limitations of the technology, compare these against other treatment options (including other robots) or be aware of potential vested interests of the investigators and healthcare system. The nature of the early IDEAL stages means patient numbers and operating team experience are limited, resulting in the evaluation of interventions at early stages of the learning curve, with resultant implications for both clinicians and patients44. Surgical teams may not know all the risks, or how learning curves may increase the overall risk to patients involved at this early stage (relative to that which later patients experience)44. Provisions to minimize harm and ensure truly informed consent are ethical requirements in this phase of surgical robot evaluation35,43,45,46.

Recommendations

Active patient and public involvement is desirable to ensure a patient-centered research design from the outset, and formal qualitative research assessing patient perceptions, understanding of the robotic system and trust in the intervention may be very informative. Patient information sheets for both research and surgical consent purposes should be developed with input from patient groups. Crucially, informed consent in early clinical studies (that is, stages 1 and 2a) should acknowledge a potentially increased uncertainty of benefit and risk of harm in early cases, as with all new device introductions. Information should include details of previous studies; known risks and the possibility of unknown risks; dependence level on the surgical robot and mitigation plans for system failure; level of AI-system autonomy and protocols for the takeover of control; transparency regarding surgical team experience with the system; and any potential conflicts of interest.

System perspective in IDEAL stages 0–2a

Key challenges

When considering the impact of surgical robots in health systems, societal cost must be considered. Currently, health economic assessments are not standard components of early evaluation frameworks for devices, as illustrated by the lack of guidelines from The Professional Society for Health Economics and Outcomes Research (ISPOR) for this stage. Early health economic evaluations are heterogeneous and often unsatisfactory. Economic evaluations at this stage act as exploratory tools to assist decision-making about pursuing further development, and to provide insights into future cost-effectiveness, particularly for complex interventions47. The current deficit in early economic evaluation extends to encompass related gaps in the evaluation of the environmental sustainability and global applicability of surgical robots48. Early and systematic use of unmet needs analyses, health economic analyses and sustainability analyses can and should serve a vital role in guiding the efficient onward development of devices and avoiding waste49.

Recommendations

Unmet needs analyses and early economic models should routinely be considered before moving into definitive studies42, such as headroom analyses to provide early estimates of cost-effectiveness or economic burden studies to advise on high-priority disease targets. These could provide pilot metrics for expenditure (including time, money, human resource and technical resource) and costs of altered downstream care. Iterative exploratory decision-analytic modeling could inform robotic development as part of the early health technology assessment process47,50.

Value of research studies should identify surgical robots that are unlikely to be successfully implemented into the health system, permitting decision-making on halting research and investment into technologies unlikely to be adopted. Reverse engineering and frugal or alternative surgical robot design (such as handheld platforms) could be explored to reduce cost, improve eventual accessibility across healthcare systems and boost the potential for global health impact51,52.

Sustainability metrics should be recorded during preclinical (device-only) and clinical (device within system) evaluations48. Assessment should integrate a complete life cycle assessment model53. This includes recording resources required to build, run and maintain each device version, along with device design (for example, careful material selection, modular system design, reusable parts) from preclinical stages onwards. Interoperability, parts replacement and maintenance by local teams, especially in low-resource settings, should be considered at the earliest design stages.

Comparative evaluation (IDEAL stages 2b and 3)

Once a stable version of an effective and safe robot has been developed, a comparative evaluation with the current surgical standard should follow. Expert consensus is needed on the nature of the patients and procedures to be studied in trials, and on markers of adequate procedure quality, to avoid bias due to learning curves or wide variations in performance. Evidence from collaborative prospective cohort studies (IDEAL stage 2b) in a range of potentially appropriate settings and indications can provide this, and thereby facilitate definitive randomized comparative studies against an appropriate control group (IDEAL stage 3).

The importance of adequate comparative evaluation before adoption was recently illustrated by the US Food and Drug Administration warning against the use of robotic surgery for the treatment of breast and cervical cancers54. The recommendation on cervical cancer was based on the results of a prospective randomized trial, and a population-based study comparing open versus minimally invasive surgery (including robotic surgery) showing worse disease-free survival and overall survival in patients who underwent minimally invasive surgery55,56. A breakdown of adverse events across all robotic surgeries recorded by the US Food and Drug Administration includes 2,000 events that involved injury to the patient, 17,000 events due to malfunction of specific robots and 294 fatalities57. It is not clear how many of these events could have been avoided by more rigorous evaluation at an earlier stage, but it is undeniable that omitting such evaluation reduces our capacity to limit harm. The key challenges and recommendations of this comparative evaluation stage are considered below, and summarized in Box 2.

Device perspective in IDEAL stages 2b and 3

Key challenges

Surgical robots offer great potential technical advantages including improved precision, dexterity, improved ergonomics, and teleoperation, but they demand new or different resources from healthcare systems (for example, surgical team training, audit and maintenance)1,58. Few definitive high-quality comparative trials have been published, and from these the evidence of benefits of robot-assisted surgery over comparable minimally invasive surgical approaches has been inconclusive55,56,59,60,61. The literature reveals methodological limitations, such as poor reporting of outcome measures, a lack of agreed core outcomes sets, incomplete efficacy or effectiveness assessment, and variable reporting of safety58,62. The rapid evolution of robots poses major problems for evaluation, with newer AI-enabled systems threatening to render current studies outdated before their completion—demanding innovative, iterative evaluation strategies, such as implementation trials63. This uncertainty complicates decision-making about when, how and if a definitive randomized clinical trial should be performed within the evaluation cycle of the surgical robot. Some of the technology incorporated in newer surgical robots could itself provide next-generation evaluation measures, such as computer vision, a domain of AI applied to operative videos with procedural analytics64,65,66,67. However, such outcome measures must themselves be robustly validated before clinical implementation, and their relation to clinical outcomes fully understood.

Recommendations

The comparative stage poses numerous device-related challenges. The benefits and risks of a surgical robot should be documented through well-designed prospective evaluations, capturing clearly defined safety and effectiveness outcomes (including patient-reported outcomes) relevant to a given procedure, surgical speciality or patient population. These studies should proceed in a stepwise fashion according to the IDEAL recommendations, considering and adopting seamless designs for efficiency, where plausible.

Measured outcomes must include well-defined clinical outcomes (ideally from existing consensus core outcome sets), technical outcomes (including those derived from robotic kinematic and haptic sensors), patient-reported outcomes (such as quality-of-life indicators) and wider outcomes that reflect potential robotic disruption (ergonomic benefits, impacts on accessibility to surgery) where relevant. Next-generation outcomes and measures, such as those derived from robotic kinematic, haptic sensors and video data, should be reported where relevant, but should be robustly validated and their associations with clinical outcomes determined.

Randomized controlled trials will serve as the default choice for thorough comparative studies of robotic surgery where preliminary studies suggest a potentially important clinical or economic benefit. Planned prospective implementation trials should be considered only where randomized trials are considered impossible. However, for procedures where a robot system has previously established its superiority over non-robotic surgery in technically similar contexts, and no substantial change in the level of risk is expected, further randomized trials for every new procedure may be unnecessary. In this situation, a collaborative prospective cohort study (IDEAL stage 2b), a prospective implementation trial or a prospective registry is ethically necessary to ensure that a meaningful evaluation of effectiveness and safety is performed as indicated by existing decision-support algorithms68.

In principle, public preregistration of protocols and analytic intent is recommended for all studies, with any post hoc changes recorded. Protocols should specify defined data dictionaries, data recording by independent observers, with independent validation, and calculations of interobserver reliability. Data collection and analysis should be sensitive to, and protected against, conflicts of interest and related biases. The privacy and security implications of capturing, storing and using data from robotic devices should also be considered.

During evaluation, changes to the technology or procedure may result in unexpected outcomes, which could warrant reevaluation of the robotic surgery at the current or an earlier IDEAL stage. Thresholds for this kind of action should be established in advance, considering trends in outcome data suggesting changes in risk levels, indications for use or device performance. As a guiding principle, major changes in risk should warrant a return to earlier IDEAL stages. An independent expert panel should be involved and work with regulators in making these decisions, including which IDEAL stage study is required.

In cases where a robotic system can perform a procedure that achieves a physiological, clinical or functional effect that was not previously possible, there may be no reasonable comparator. Independent ethical advice should be sought to determine whether control groups for a randomized trial are acceptable, depending on the nature of the presumed benefits and anticipated risks of the procedure and the outcome data available. Where the clinical outcomes of the novel robotic approach are clearly unachievable by other means, randomization of participants may be unethical. Alternate designs to study effectiveness and safety, where possible, should be sought.

Clinician perspective in IDEAL stages 2b and 3

Key challenges

Human factors and ergonomics analysis is crucial during the clinical translation of surgical robots to ensure they are usable, and can efficiently integrate into complex teams and workflows33,69,70. Concerns about the occupational consequences of surgery has led to an interest in ergonomic innovation in surgery and is a purported benefit of surgical robotics, but the evidence base is conflicted and of limited quality70. As surgeons gain experience with the robot, their operative skills are expected to improve, described by a learning curve71. Surgeon experience and learning curves are an important source of potential variation and bias in comparative surgical robot trials, with high-quality trials incorporating their effects into analysis72,73,74,75,76. A reliable measure of the learning curve can only be achieved by analysis of meaningful measures of operation quality and patient outcomes77. Learning curve evaluation is important for fair comparative analysis, and for planning and implementing training programs for the surgical team32,78,79. Effective, standardized team training is essential for comparative evaluation and clinical translation, but there is no consensus on developing mandatory training program requirements71,78,80,81.

Recommendations

Human factors should be considered, and behavior change scientists should be consulted during the evaluation of surgical robots to examine hypotheses generated in earlier IDEAL stages—evaluating features such as workflow, variations in system use, ergonomic risk assessment, data collection capabilities of the device, teamwork, nontechnical skills and workspace analysis.

Analyzing learning curves is essential in evaluating new technologies, including surgical robots. Large prospective cohorts (IDEAL 2b studies) offer the first opportunity to capture real-world learning curves for surgical robots, and should be used to study their complexity and improve our tools for evaluating them. Metrics gathered from direct supervision, objectively defined criteria, cadaver laboratories and simulator training should be standardized and used for assessment of real-world learning curves. The performance plateau should be continuously monitored to detect changes over time, studying the effects of factors that may influence surgical performance such as casemix, team changes, and changes in the surgical environment. Criteria for the minimal acceptable level of plateau performance should be agreed for surgeons to practice independently or take part in definitive pivotal comparative studies with the robot, using objective measures of procedural quality. Statistical exploration of learning effects (such as sensitivity analysis or extensions of the primary analysis82) should be included in trial protocols to identify and adjust for learning curve bias. Training mechanisms should be audited for impact and iteratively improved to meet user needs37. Programs should directly attempt to track the learning curves seen in surgical robotics training and investigate techniques such as mentoring approaches to shorten them or minimize any effect on patients. Training courses should be validated by evidence of correlation between course evaluations and clinical performance. Institutional clinical governance policies should require the development and use of consistent criteria pertaining to surgeon training and outcomes to monitor continued learning.

In the case of autonomous systems, learning curves will likely be linked to the evolution of trust in the AI application. Proxies for clinicians’ trust in the autonomous components (such as instances of use or if manual override is required) should be studied and presented with learning curve analysis.

Patient perspective in IDEAL stages 2b and 3

Key challenges

Patient acceptability is increasingly important when implementing healthcare interventions, but this is difficult to define or assess in relation to surgical robots16,83. Acceptability is important for IDEAL stage 2b/3 studies as patients must provide fully informed consent to studies before enrollment. Very few patients have a comprehensive understanding of what a surgical robot is or does, of the current evidence about the potential and proven risks and benefits of a surgical robot or of the degree of autonomy of robots during surgery84. Patient perceptions of likely benefit or harm may be affected by media ‘spin’ or by industry and marketing psychology44. This may affect patient preference for one treatment over another42, and this could contribute to the challenges of randomization or trial recruitment. Therefore, it is important that patients are provided with a clear, accurate nontechnical explanation of the evidence on the established benefits, known risks and gaps in knowledge about robotic surgery in their specific context, and protected from potential bias from developers and robotic enthusiasts.

Recommendations

Although no universal definition of patient acceptability exists for surgical robots, it should be considered as including (1) patient perception (personal and societal views, the degree of trust within a patient–doctor relationship and wider system), (2) patient understanding (procedure, risk, equipoise and device) and (3) patient consent (informed consent, full disclosure of conflicts of interest).

The consent process should not be contaminated by surgeon bias or patient misinformation. Potential alternatives to the traditional consent process include using research nurses or computer decision-support programs. Surgeons involved in the process of consent for robotic surgery trials should undergo training to minimize unconscious bias85.

Patients involved in IDEAL stage 2b or 3 studies should be informed about their surgeon’s current level of experience with the proposed robotic platform and procedure, encompassing information on both local outcomes and complications for robotic and alternative (that is, standard-of-care) procedures. If an accurate assessment of the learning curve is available, this should be disclosed.

System perspective in IDEAL stages 2b and 3

Key challenges

A broad systems perspective is needed during comparative surgical robotic evaluation38. Surgical robots must be economically viable, and the cost of purchase, maintenance and repairs fully evaluated11,13. Increasing attention is being paid to the environmental impact of surgery; thus, the impact of robotics should be measured and justified in terms of global Net Zero initiatives53,86,87. Adoption of single-use robotic tools presents a concern in this regard.

Existing efforts reporting on the resource use, greenhouse gas emissions and material footprints associated with robots require extension to provide impact comparisons with existing technologies53. The Lancet Commission on Global Surgery highlighted the huge global unmet need for timely and effective surgical services, particularly in low-income settings88. An in-depth understanding of each surgical ecosystem will be needed before a decision on integration of robotics49,88,89. This includes understanding challenges such as inconsistent access to electricity, clean water, operating rooms and certified surgeons, equipment sterilization procedures, maintenance of equipment and inconsistent funding, which may render robotic surgery infeasible. In resource-poor settings, there is a clear opportunity cost of introducing surgical robots, which may squander scarce resources, and be impossible to maintain, resulting in net harm and perpetuating healthcare inequality81. From an ethical viewpoint, it is important to consider the impact of robotics on access to care for relatively disadvantaged populations in all healthcare systems90.

Recommendations

Analysis of healthcare costs associated with robotic intervention and control treatments should be routinely included in comparative surgical robotic studies. Economic studies should include clinically and system-relevant outcomes over a sufficient length of follow-up to compare a surgical robot to current surgical practice. Established international frameworks such as those published by ISPOR should be used to evaluate health economics and outcomes research91,92. Decision-analytic modeling should be used in IDEAL stage 2b studies. IDEAL stage 3 studies should incorporate formal economic evaluations providing trial-based cost-effectiveness analyses that follow established reporting guidelines, such as the Consolidated Health Economic Evaluation Reporting Standards93.

Although the Colloquium acknowledges the substantial barriers to implementing robotic surgery programs in low-resource settings, it is possible that future advances may reduce these. Therefore, stakeholders from low-income countries with an interest in robotic surgery should be encouraged to join discussions and provide insights into how robotic surgery might become more feasible and beneficial in such settings once its value in higher-income settings is established.

To delineate whether a surgical robot would result in net health benefits while remaining cost-effective in low-income settings, a rigorous modeling approach can be applied. This should include metrics on robot effectiveness and safety; health economic and sustainability analysis; and specific capacity metrics for the target healthcare environment—such as basic infrastructure (including energy and information technology services), healthcare infrastructure, necessary human resources (entire surgical team), medical supplies, critical care capacity and healthcare funding. The goal of this process is to estimate the robot’s impact within lower-resourced ecosystems, determining an environment’s readiness for downstream robot integration.

A modeling approach can also be applied to identify major risks to fair distribution of benefits within higher-income contexts. If modeling reveals concerns regarding equity of access, safety, cost-effectiveness or readiness, then a plan for local capacity building should be developed and its implementation monitored before robots are introduced. Efforts to uphold fairness by increasing access to successful innovation internationally should be supported by nongovernmental organizations, governments and the robotic industry, and by existing infrastructure, such as the SAFROS project to address current inequities in access to safe surgery.

The sustainability and economic evaluation of a surgical robot should include a complete life cycle assessment considering how the surgical robot changes practice in relation to the surgical procedure, manufacture and maintenance, type and amount of waste generated, and reusable and single-use items. Any projected increase in carbon footprint compared with continuing with non-robotic surgery should be assessed, minimized and offset where possible (for example, switching from consumable to reusable components) and should be justified in terms of other quantifiable benefits (for example, improved patient outcomes, and downstream economic and environmental benefits).

Long-term monitoring and technological evolution (IDEAL stage 4)

Following comparative evaluation and widespread adoption, the focus shifts to long-term monitoring of performance in real-world settings. Registries are the predominant methodology in this stage of evaluation11, but ownership and curation of robotic registries by commercial groups can introduce risks of bias and lack of transparency. Other prospective methods of long-term study, such as observational cohort studies, have limitations including fragmentation, maintenance costs and lack of comparability. In an increasingly digitalized healthcare landscape, real-world datasets (RWDs) leveraging data collected for clinical care or administrative purposes have become important potential data sources for the evaluation of health interventions94. However, valid studies based on RWD need standards to guide their design and reporting, and safeguards for privacy and data security. Expanding on the IDEAL framework, targeted recommendations specific to IDEAL stage 4 study designs are needed to inform their methodologies and analytics. The key challenges and recommendations of this long term monitoring stage are considered below, and summarized in Box 3.

Device perspective in IDEAL stage 4

Key challenges

Long-term monitoring of a surgical robot’s real-world performance is critical for the safety, evolution and longevity of a device. This could best be achieved by device developers working with regulators, providers, insurers and other stakeholders to create international surveillance systems7. The developers of surgical robots have a duty to ensure that patients and scientific evaluators have the best possible evidence to fulfill the ethical requirements for autonomy and non-maleficence, respectively, and this needs comprehensive, unbiased outcome data from real-world settings. Many existing device monitoring systems are criticized as passive and inconsistent, underreporting incidents and therefore lagging behind analogous systems for drug monitoring95,96,97. Given the current lack of incentives to evaluate, it is unsurprising that existing evidence on devices is weak, and efforts to curate data are fragmented, reducing comparability and scope for analysis98,99.

Manufacturers, hospitals and insurers curate and maintain datasets, but have few incentives to make them widely accessible, while commercial, and sometimes regulatory, issues also inhibit full disclosure of clinical and technical data95. Registries are currently the predominant methodology for long-term monitoring of robotic surgical interventions, but currently these are generally in-house datasets focused on a single robotic system, and usually lack independent validation and/or have limited access to real-world data15. Efforts to link datasets to facilitate better analysis of larger groups are currently limited in their impact and capacity, partly by regulatory issues around data sharing. Stakeholder collaboration at all levels (individual, organizational, system, international) is required to generate high-quality data, as seen with the US national device and evaluation system MDEPinet, which acts as a registry network for specific surgical devices100. To give a full picture, evaluation systems for surgical robots need to go further, supplementing standard outcome measures (that is, effectiveness, safety and economical) with complementary datasets, including machine-generated activity data, data from human factors analyses and data to monitor the dynamic nature of AI incorporated into surgical robotics101.

Recommendations

In principle, best practice should be followed using established design and reporting guidelines, and prospectively collected high-quality data102. Integration of RWDs should be encouraged if quality can be assured. Data should be collected and analyzed by groups independent from those producing it. The roles and conflicts of interest of those producing and curating data should be transparent and available.

Datasets should include, but not be limited to, patient population demographics, disease characteristics, device characteristics, device indications, type of setting, clinical outcomes, economic outcomes, low-level technical outcomes, technical failures, adverse events, changes in device capabilities and dedicated metrics monitoring AI-system evolution. Reporting of technical failures (including software failures) and patient safety incidents should be mandatory, supported by national regulators and independent of device manufacturers. Rapidly generated, scalable datasets should be developed for widely adopted innovations. Collection and analysis should be fully automated, with harmonized coding language and core reporting and outcome measures.

Regulatory, political and commercial barriers may limit the feasibility of optimal sharing of real-world data. In principle, international collaborative approaches are recommended to produce homogeneous and comparable datasets, with data-sharing agreements giving data access to all stakeholders. Governance of linked datasets should ensure open access to facilitate observational research. Governments, insurers, hospitals and professional associations all have potential roles in this.

Statistical analyses of real-world data should be transparent in their methods, and show how they account for confounding factors, sources of bias and missing data. Analyses should be made accessible according to the FAIR (findability, accessibility, interoperability and reuse) principles103.

AI-enabled and autonomous systems require particular attention. The initial use and indication of use should be clearly stated, and metrics for long-term monitoring of performance and safety established from the outset of clinical use. Performance should be evaluated at regular intervals, with more frequent evaluations of rapidly changing systems. Changes in indication of use, the level of autonomy of the system or performance drift, which might increase the level of risk, will require detailed evaluation. Changes in machine behavior during the period should be described, with analysis of how the algorithm has changed where this is possible.

Clinician perspective IDEAL stage 4

Key challenges

The long-term integration of surgical robots into health systems relies on their adoption by clinicians. The principal challenges from this perspective arise from training, credentialing and determining accountability for adverse outcomes (particularly in the context of robot autonomy and AI). Even technologies that demonstrate safety and efficacy experimentally pose risk to patients in untrained hands, and inadequate training prolongs learning curves, particularly during the long-term study stage, as devices are adopted by new surgical teams101,104,105,106,107. Research attempting to elucidate learning curves associated with surgical robots remains sparse but appears to be developing. While standardized robotic training programs exist for well-established surgical robots, such as the da Vinci, most robotic surgery training remains inconsistent and non-standardized, particularly for novel robots108,109.

There are efforts to address these challenges, such as the multi-institutional validation and assessment of training modalities in robotic surgery (the MARS project), but the optimal strategies for training robotic surgeons are unclear109. Ongoing certification and credentialing based on a regular reexamination of skills is not currently required for robotic surgery, which contrasts with practice in comparable high-risk industries involving complex technologies (for example, aviation)71. Determining accountability for, and analyzing the causation of, adverse events during surgery will be more complicated in a robotic future34,110,111. Communication difficulties due to altered spatial relationships in the operating room, telesurgery, input from company technical experts and, in future, increasing machine AI autonomy all have the potential to diffuse responsibility for decisions33,112,113. Effective monitoring will require routine recording, storing and analysis of granular data including technical, video, audio and IT data streams, which may be needed in the analysis of adverse events, aligning surgery with other high-risk, high-technology processes114.

Recommendations

Novel training methods should undergo evaluation using appropriate frameworks for determining validity (for example, Messick’s framework115). They should specify the aims of the training and use an appropriate educational paradigm. These studies should inform standardized training programs, which receive oversight from recognized accrediting bodies and are independent from industry partners. Where validated methods exist, surgeons using a robotic system should undergo regular revalidation with holistic assessments of performance through assessment of technical and nontechnical skills. Novel methodologies including automated performance metrics, AI-driven credentialing and operative video assessment should be adopted if validated. Ongoing credentialing and revalidation should include assessments of skills necessary to operate the device, but also the availability of skills in techniques needed to safely manage emergencies using alternative approaches, whether by the same surgeon or another.

A human factors expert should be included in the analysis of all serious adverse events involving a surgical robot. Adverse events/errors should be analyzed using data including technical, usability, interface and system integration failures. Governance for robotic surgery, particularly where AI systems with autonomy are involved, needs to evolve so that it can determine appropriate responsibility for monitoring, accountability for adverse events and responsibility for implementing improvements. This will require collaboration between legislators, healthcare organizations, professional bodies and industry.

Processes for monitoring the unplanned evolution of aspects of machine-learning-enabled AI should be iteratively reviewed, as human experience of this activity is in its infancy. Reevaluation of processes and algorithms should take place at regular intervals, and whenever evolving aspects (for example, level of autonomy or drift in target population) cause substantial changes in performance.

Patient perspective IDEAL stage 4

Key challenges

As with all IDEAL stages, patients are the most important stakeholder when evaluating surgical robotics, as the recipients of both benefits and harms. Patient perceptions are influenced by exposure to the views and agendas of other stakeholders, for example, manufacturer marketing and clinician enthusiasm. However, patients have limited access to scientific evidence, which may be further restricted due to regulatory/approval processes. They may be falsely reassured that a robotic system is well established and safe, without specific evidence for the indication it is being offered for (procedure creep). They are unlikely to be cognizant of iterative changes to a surgical robot, rendering it different to the device upon which initial evidence was generated (device creep), making it important that this type of information is explicitly mentioned during the consent process.

Recommendations

Comprehensive robotic surgery registries and/or systems for extracting reliable information from existing real-world data sources should be made accessible and understandable to patients by providing lay language explanations of their outputs.

Current data should inform the consenting process; evidence referred to when seeking informed consent must relate to the indication and not simply to the device, since robotic systems may be used for many different procedures (procedure creep). Informed consent by patients should routinely seek general consent for future use of anonymized data for research and safety surveillance to maximize the value of health data.

Finally, where mechanisms to facilitate this exist, public and patient involvement should inform the design of IDEAL stage 4 studies and outcome measures to ensure they remain patient centered.

System perspective IDEAL stage 4

Key challenges

The evaluation of the wider systems impact of robotic systems needs to continue in the long term, to track the cost-effectiveness and sustainability of their integration into healthcare systems with varying resources and capacity. Health economic analyses need to be iteratively updated with real-world data, and should remain free from restriction or private interests to maintain transparency116. Costs will be impacted by learning curves, technical errors, system failures, dynamic pricing and other factors. This means that real-world data, including health data, administrative claims data and prospective observational studies are essential in modeling the true value of robotic systems in IDEAL stage 4. Potential access and equity issues accompanying these high-cost investments must be considered117, meaning resource allocation requires justification in terms of their place among competing choices. Ethically, providers must consider the benefits of robots against wider health system needs, and rationally allocate limited resources to high-priority issues.

Similarly, strong arguments are needed in favor of robotic surgery to counterbalance environmental impacts seen through life cycle assessments. This issue makes an argument for innovators to adopt sustainable practices in the development, implementation and maintenance of robots. Innovators should measure and minimize environmental harms of robots, ideally through open, transparent datasets, such as the HealthcareLCA repository118, fostering collaborative investigation of their impact in real-world settings. Outside experimental evaluation settings, complex interventions enter complex adaptive systems, with potentially unforeseen ‘emergent’ consequences. True performance will only be revealed in real-world settings and must be monitored to avoid unrecognized gradual decline in safety or effectiveness. This demands the development of monitoring infrastructure, processes and governance.

Recommendations

Cost-effectiveness analyses using decision-analytic modeling of real-world data should evaluate robotic systems by indication, and provide comparable analyses openly available to all stakeholders. These should use validated outcome metrics and comply with ISPOR guidance. In principle, regular reviews of robotic surgery cost-effectiveness should include an assessment of changes in organizational configurations and their influence on processes/outcomes, where the necessary resources are available.

National and international discussion forums involving clinicians, patient advocacy groups, industry, policymakers, ethicists and economists are needed to consider the potential effects of robotics on equity of healthcare access, and to explore models that might justify use in low-income settings. Advice from public health experts, policymakers, ethicists and climate scientists should be considered in discussions of how robotic surgery platform design, development and use could be made more sustainable.

In principle, complete life cycle assessments of surgical robotics should incorporate a broad range of parameters, be guided by environmental experts, produce data without restriction and contribute toward living open-access data repositories. Moreover, complete life cycle assessments of surgical robotics should be iteratively updated against existing care standards in real-world settings to monitor and minimize their environmental impacts through quality improvement. These recommendations will require further development of collaborations, datasets and resources.

Conclusion

The next generation of surgical robotics is poised to transform healthcare systems around the world. Whether this will result in substantial patient and societal benefit depends critically on whether innovation is guided by appropriate evaluation. This Colloquium has provided key recommendations for the evaluation of surgical robots across their developmental life cycle, mapped to the IDEAL evaluation framework.

Our analysis presents practical recommendations to guide robotics developers, clinicians, patients and wider systems as we enter the next era of surgical robotics. For all stages of evaluation, all stakeholders should be considered at the outset, including the surgical team (human factors analysis and training), patients (acceptability and rigorous ethical assessment) and the wider system (economic and sustainability evaluation). Further work is needed to establish standardized metrics for technical and clinical outcomes, refine health economic assessment models and assess the global representativeness of these recommendations.

No framework that deals with such a broad range of evaluation challenges can hope to avoid conflicts between recommendations, or situations where recommendations may appear disproportionate to the problem addressed. Such dilemmas with IDEAL recommendations are usually easily resolved by referral to the underlying principles mentioned in Methodology. The breadth of the subject also raises the question of which evaluation recommendations are relevant in the context of which particular studies. Clearly, incorporating all possible aspects in any single study would be infeasible and unnecessary, but sensible judgment, involving discussion where necessary with relevant subject experts, should allow this guidance to be of practical use to clinicians, robotic engineers, patients and other stakeholders in the development of robotic surgery.