A large proportion of the world’s population undergo surgical interventions during their lifetime, sometimes repeatedly. Surgical interventions (also known as ‘medical interventions’) are defined as any procedure on the human body with a therapeutic purpose, which includes invasive (open) or minimally invasive (laparoscopic, robotic, endoscopic or percutaneous) procedures. The World Health Organization (WHO) recognized in 2008 that complications of surgical interventions are a major burden and a global public health issue1. Ten years later, postoperative complications were described as a hidden pandemic with largely under-recognized causes, which are often avoidable2. A major barrier to reducing the burden of surgical interventions is the scarcity of data to act upon, and even if available, data on outcomes after surgical interventions are often of poor quality. The lack of consistent reporting is well highlighted in the medical literature3, and even top surgical journals often fail to provide proper information on postoperative events, for example, in defining the severity of complications or providing sufficient follow-up for assessment. Postoperative complications not only cause suffering and dissatisfaction with a reduction in quality of life (QoL) for patients, but also have serious implications on many levels of society and are associated with tremendous financial cost4,5.

The first step to preventing harmful events after an intervention and allowing for credible comparisons of competing therapies or care providers is to develop standardized tools assessing both the positive and negative outcomes of a procedure. Such tools must be relevant for patients and health care providers, as well as all other stakeholders within society, and must be widely accepted among various health care systems and cultures. Considering that the subject area remains complex and tools for outcome measurements covering perspectives of a broad range of stakeholders are not available, we opted for the format of a consensus approach to develop guidelines on how to assess the outcomes of a surgical intervention. The assumption was that the best available evidence, together with a consensus developed among diverse representatives of society, would yield the most convincing approach for broad adoption.

For this purpose, we relied on the Zurich–Danish model, where an independent Jury frames recommendations based on evidence reports prepared by a multidisciplinary panel of experts and on its own deliberations6,7,8,9. To our knowledge, our consensus conference is the first attempt to include a broad range of perspectives from various stakeholders affected by the quality of surgical outcomes. The resulting recommendations are intended to provide a general framework for surgical outcome assessment that can be adapted by researchers and health care providers for specific patient populations and medical interventions.


Zurich–Danish model for consensus building

The Zurich–Danish model6 aims at producing evidence-based, internationally valid and unbiased recommendations that consider the perspectives of many stakeholders, including patients and health care providers, as well as payers or governments. We have previously used this approach to develop a consensus in the area of liver transplant for hepatocellular carcinoma7 and treatment options for neuro-endocrine liver metastases8, as well as for the selection of an academic chair in Medicine9.

The principle relies on a clear distinction between those who provide the evidence (the experts) and those who draw the final recommendations (the Jury). The Jury consists of individuals with sufficient background knowledge to cover the perspectives of a wide and important range of stakeholders, without being directly involved within their professional spheres in the topic under evaluation. The organizing committee, the experts and the Jury interact in three phases (Fig. 1) — that is, the preparation phase, the in-person consensus conference and the Jury deliberations. Each panel of experts addresses their specific question in the year-long preparation phase and then proposes evidence-based recommendations at the conference meeting. The answers to the questions are communicated to the Jury in writing at least three weeks before the meeting, with possible interaction in the interim between Jury members and panel chairs. At the conference, the Jury and the audience challenge these recommendations by asking questions and offering comments. Based on all the presented information, the Jury finalizes the consensus recommendations, which are then made available to the public.

Fig. 1: The Zurich–Danish consensus conference.
figure 1

The three phases of the Zurich–Danish consensus conference, an independent jury-based consensus conference model for the development of recommendations in medico-surgical practice. Adapted from Lesurtel et al.6.

Expert and Jury recruitment

Experts were recruited through four channels: (1) invitation to senior authors of relevant publications identified by the Local Organizing Committee (LOC), (2) invitation to experts recommended by panel chairs or members (snowballing technique), (3) call to patient- and scientific organizations to participate in one of the panels and (4) consideration of any expert contacting the LOC directly, after validation of their expertise.

The constitution of the Jury by the LOC started with a list of perspectives and priorities to be covered, with consideration of geographic and gender balance. We also favored individuals previously involved in consensus conferences with a similar format; for example, we recruited Carmen Walbert as President of the Jury owing to her active participation in a previous consensus conference on how to select an academic chair9. To secure proper patient representation, we contacted the organization EUPATI (European Patients’ Academy on Therapeutic Innovation), which issued a call to their members to send us resumes and letters of intent to participate. To prevent any conflict of interest, Jury members were not directly involved in surgical or medical outcomes research.

Topics and panels

An initial list of topics was prepared by the LOC and experts with relevant publications or opinion leaders in their respective fields were invited to participate. Topics were subsequently discussed and modified by invited faculties. Nine panels composed of four to five recognized international experts with different perspectives and geographic origin were proposed by the LOC, selected to provide a wide breadth of expertise. To maximize the relevance of the topics, the respective panel chairs and members could adjust their specific questions as needed to better cover their respective topics. Panels one through five focused on the various stakeholders’ perspectives, while panels six through nine concentrated on specific aspects of outcome measurement, analysis and interpretation. Each panel had the task of answering three to five questions on a specific topic (Box 1). The full list of the panel chairs, panel members and jury members can be found at the end of the text.


Standardized time points for outcome assessments

First, the Jury recognized that outcome assessment is a dynamic process that requires standardized time points of observations. There is currently no agreement as to when outcomes should be captured and an urgent need to move away from historically collected discharge or 30-day data only10,11. Through the panel presentations and discussions at the consensus conference, the Jury saw a need for standardized time points of outcome assessment to ensure comparability of outcomes. They proposed five fixed time points (Box 2). The first assessment should capture the pre-disease state — meaning a time before the patient had their condition — and should include some information on their quality of life (T0), followed by a recording of disease state and related symptoms before the intervention (T1), outcomes during the early postoperative phase (T2), mid-term (T3) and long-term (T4) (providing data five years after intervention).

Pre-disease conditions or QoL are difficult to assess retrospectively, but information on employment status, exposure to risk factors or other health behaviors are relatively reliable. T1 should include information collected a few days before the intervention. T2 and particularly T3, referring to the length of long-term follow-up, should be disease-, procedure- and context dependent. Ideally, the optimal length of mid-term follow-up should be defined by research, as it can vary greatly among individual procedures; for example, three months for liver resection12, six months for pancreatic resections13 and over one year for liver transplantation14. The T4 assessment should be carried out five years after the intervention (open ended from then on) and very long-term follow-ups should also be considered when appropriate.

Outcome assessment goes beyond mortality

The health care providers’, mostly physicians’, perspective has been the only (or predominant) view on outcomes for a long time, typically by just reporting on short-term, for example, 30-day, mortality rates. With the dramatic decrease in perioperative mortality rates following most procedures, the focus has turned toward postoperative morbidity. However, reporting on complications has been inconsistent and notoriously lacking information on the severity of the respective events and their time of occurrence, making the evaluation of surgical procedures a ‘comic opera’15. Reliable comparisons of postinterventional outcome are only possible if results are uniformly and comprehensively reported. The ideal outcome measures should be relevant to most procedures, collected in ways that minimize bias, and interpretation must eventually be widely accepted to generate a universal language.

Postoperative negative events can be divided into three categories16. First, failure to cure — indicating that the objective of an intervention was not achieved (for example, no curative resection of a malignant tumor); second, sequelae when the negative event is inherent to the procedure (for example, amputation of a leg inevitably leads to invalidity); and third, complications covering all other events. Terms like major, severe, minor, serious, mild and intermediate must be avoided unless clearly defined.

A few systems to classify complications have been proposed, including the Clavien–Dindo17,18, Accordion19 and Memorial Sloan Kettering cancer center classifications — all of which are based on an inaugural proposal made in Toronto, Canada, in 1992 to critically assess the introduction of laparoscopic cholecystectomy16. To secure a universal language, there is a need to select the best system, which should be precise, reproducible, intuitive and quantitative, and should minimize biases in data collection. Irrespective of the system, data collection is best done independently by dedicated staff, instead of surgical interns or residents in training20. The Clavien–Dindo classification (Table 1), which has been applied to most fields of surgery, fulfills these criteria best and has been widely utilized.

Table 1 The Clavien–Dindo classification17,18

A limitation of the Clavien–Dindo and other classifications is that the full description of complications must be tabulated; therefore, they are difficult to use for outcome comparisons. Additionally, most studies only capture the most severe complications while omitting complications of lesser degree21. To address this limitation, the Comprehensive Complication Index (CCI), based on the Clavien–Dindo system, was developed to assess overall morbidity by capturing all complications in a single patient22,23. The patients’ perspective was explicitly considered in the development of the CCI by allotting weights from the patient view to the respective complications. The CCI expresses the cumulative burden with a single normalized metric ranging from 0 (no complication) to 100 (death) and accounts for both the number and severity of the complications. The CCI has been validated in several independent patient cohorts23,24,25, correlates highly with cost26,27 and has proven to be a highly sensitive endpoint for randomized trials28,29. A web application ( is available for the calculation of the CCI.

There are other metrics such as the ‘textbook outcome’ approach, which refers to the proportion of patients without any negative events or just minimal deviation from the optimal clinical course30. For example, in pancreatic surgery, a textbook outcome is defined as a patient without any pancreatic fistula, bile leak, severe complications or readmission after being discharged31. Readmission rate is another frequently used parameter, as is length of hospital stay, days alive out of hospital (DAOH) and treatment costs. In this context, DAOH represents a more global outcome that includes all reasons for hospitalization (medical issues, adjuvant therapy, and so on) and is therefore more patient-centered32. An extension of this concept, known as ‘failure to rescue’, has been developed as a new indicator of quality to highlight the ability of superior centers to recognize complications at an early stage, and therefore to properly treat those complications, thus minimizing the risk of death. Based on the extensive literature, expert panels’ assessments, and thorough discussions at the consensus conference, the Jury proposed that, as a minimum, the Clavien–Dindo classification, the CCI and failure to rescue should be used to assess outcomes when it comes to postoperative complications (Box 2).

The Jury agreed that proper assessment from the surgical perspective must include regular, interdisciplinary morbidity and mortality conferences with the intent to reflect and learn from adverse patient outcomes in real-world practice, and to find solutions to reduce the risk for adverse outcomes in the future. Good outcomes in challenging cases should also be discussed at morbidity and mortality conferences, to better understand the ‘favorable’ factors affecting outcome33.

Patients at the center of their outcome assessment

While most metrics, except the CCI, were developed exclusively from the health care providers’ perspective, modern medicine has begun to reset its focus on the most central stakeholder, the patient, to deliver more patient-centered and holistic care.

For patients, many of the data recorded by their physicians may seem abstract. They also may give more value to their functional status after an intervention than to the quality of the non-medical services provided, such as quality of food or comfort of the hospital room. Patient-reported outcome measures (PROMs) allow for quantitative measurement and continuous improvement of these outcomes. PROMs should be used to ensure that the patients’ voice is heard and incorporated into clinical decisions, such as in shared decision-making, which is a cornerstone of patient-focused medical practice. The incorporation of PROMs into the clinical care pathway not only highlights patients’ perception of their treatment but can also change how patients think about their condition and can even improve survival rates, as shown, for example, in lung cancer studies34. PROMs can also improve the quality of interventions by considering outcomes that are inadequately represented by metrics relating only to a short-term interventional perspective; rather PROMs should (and often do) include questions relating to the entire care pathway and its integration, including the transitions of care. The Jury decided to recommend the use of PROMs in routine clinical care and research but refrained from making recommendations on the use of specific instruments. The choice of psychometrically validated PROM instruments depends on the patient population, intended use in clinical practice, and on the time frame of outcome measurement and comparability or quality improvement efforts by others.

While it is standard in some health care systems to inform and engage patients in decisions about their treatment options, patient passivity remains an issue. The challenge lies in how to effectively engage patients and to accompany them in understanding the process and benefits of shared decision-making with their physicians. Playing an active role in decision-making can be challenging for some patients, owing to the high cognitive and emotional burden requested35. To truly empower patients in the process of shared decision-making, coaching patients and supporting them in self-management is of utmost importance. Offering access to adequate information on the disease, treatments and outcomes allows the patient to understand what to expect in the future and adapt to living with their disease, which is highly relevant in the case of chronic disorders. For the health care provider, this means tailoring the information presented to the individual patient, for example, using well-developed decision aids36,37. While the patient is at the center of the conversation, it is important to also include loved ones such as family members or caretakers, as surgical interventions and their outcomes will also influence the people close to the patient and the relationships between them. Mutual trust between patients and their health care providers is fundamental to optimize care, and can be achieved based on empathy, kindness and a positive patient-centered environment.

Creating a trustworthy and empathic environment and listening to the patient must be part of any inclusive outcome assessment, making patient-reported experience measures (PREMs) a relevant metric for optimal patient care. Unlike PROMs, which assess patients’ health status, PREMs evaluate patients’ personal experience of receiving care; they should be monitored by means of questionnaires (such as the EQ-5D quality of life instrument) and interpreted by independent staff members (for example, study nurses).

The Jury found that the need for more patient-centered assessment and treatment is immense. They recommend internationally standardized outcome measures like PROMs and PREMs to facilitate holistic patient management and emphasize the importance of communication between health care providers and patients (Box 2). Health care providers must communicate clearly with patients to ensure that they fully understand their condition and the potential consequences of an intervention. To achieve this goal, providers may benefit from formal training. However, through the process of engaging in shared decision-making and being truly empowered, patients themselves also take on a share of the responsibility for their outcome.

Comparisons of outcomes

Credible and relevant comparisons of specific procedures across hospitals, competing therapies and over time are requested by most stakeholders within health care systems, foremost by patients and their families. The goal is not limited to a ranking in the quality of care, but rather continuous improvement at each level of care delivery, including physicians and other health care personnel, hospitals and even health care systems.


Benchmarking is a quality improvement and monitoring approach originally used in business to compare the performance of an organization to the ‘best in class’. It differs from conventional quality improvement efforts where the aim is to reach the ‘average’ result across a range of institutions. To assess the value of benchmarking in surgery, a Delphi consensus study suggested specific steps to inform the benchmarking process38. Benchmarking targets are usually validated outcomes, rather than process measures for a specific operation. These outcomes are measured among ‘best case patients’ who have minimal risk factors and undergo the operation at designated ‘best centers’. These centers should have (1) a high caseload, (2) a specialized multidisciplinary team including non-surgical disciplines and (3) be part of or responsible for a national and/or international registry. This approach avoids debates about ambiguous risk adjustment and sets a target that is the best achievable result, to inspire and motivate physicians and the whole health care team. The targets require complete and accurate granular clinical data that meet source data verification standards.

By referring to a point of reference (the benchmark value), health care teams can better assess their strengths and weaknesses and strive for the best possible results. The actions taken to reduce or close the gap between an institution’s performance and the benchmark have great potential to improve outcomes. The CCI — which quantifies overall morbidity — has often been used as the main benchmarking outcome12,13,39,40. Additional markers should include adverse outcomes that are relevant to specific surgical interventions (for example, anastomotic leak in colorectal surgery or graft failure after transplantation)38 or textbook outcomes (as described above)30,41. Despite being an important pillar of patient-centered care, PROMs have rarely been addressed in surgical benchmarking initiatives, possibly because these measures often lack context-specific validity42 and are not universally agreed upon or collected in surgical databases.

The Jury (Box 2) recommended comparing standardized and reproducible outcomes through benchmarking, regardless of the size of hospital or standing of the individual department. Everyone should start with benchmarking because everyone should strive to improve. The Jury calls on editors of medical journals to ensure that authors referring to benchmarking relate it to the best possible result and not just average outcome.

Risk assessment

To properly compare surgical outcomes across patient groups and institutions, it is crucial to include risk profiles of all patients in the analysis and reporting. Failure to account for risk profiles leads to behaviors of avoiding interventions on higher-risk patients, potentially decreasing these patients’ access to care. This is also counterproductive for expert health care institutions because centers involved in the management of the most complex or high-risk patients, which consequently have a lower proportion of ‘benchmark’ (that is, low-risk, straightforward) cases, hence disclose better outcomes when risk status is accounted for12,13,14,43,44,45,46,47,48,49,50. A high ratio of complex cases can positively impact the outcomes of all patients, as they logically enhance the capability of the surgeons and the center. Additionally, patients’ expectations can be adjusted and better understood when their potential individual risk is incorporated into discussions, thereby also improving outcomes, particularly in terms of QoL51. To ensure that fair and accurate comparison of outcomes between institutions is possible, the Jury recommended mandatory reporting of standardized risk profiles of patients, taking into account not just patient factors but also physician- and procedure-related factors, for example, surgical volume or high-risk procedures like pancreatic resection (Box 2).

Data management

Another recommendation of the Jury relates to the collection, verification and management of health care data. The need for reliable data, collected through secure channels and available for research projects and quality control, was widely acknowledged during the conference. The Jury concluded that there must be a position within every institution for a ‘data quality guarantor’, who would be responsible for data collection, management and storage. The role of this person would be to not only oversee and validate data collection, but also to train personnel and be the contact person for any official or governmental site overseeing quality and data. Beyond individual health care facilities, the role of governments and regulatory bodies was also seen as crucial by the Jury and experts.

Other perspectives relevant to society

The consensus conference included panels and discussions on the perspectives of payers, governments and society at large. This perspective does not only include outcomes per se, but also the resources invested to achieve certain outcomes. Wise spending of resources requires carefully developed, evidence-based guidelines with definitions of indications for surgical interventions. For example, the ‘Choosing Wisely’ campaign (an initiative of the American Board of Internal Medicine) seeks to address these challenges for specific indications and treatments by advancing a national dialogue with all stakeholders, focusing on shared decision-making with patients as partners to define ‘wise choices’. Its goal is to avoid unnecessary medical tests, treatments and procedures, especially in areas with limited resources52. High-quality outcome data are central to defining indications for surgical interventions and achieving the best possible outcomes for the resources invested. Therefore, governments have a vested interest in fostering the standardization of robust outcome data.

To this end, government regulatory agencies should follow a legislative mandate to promote and protect public and individual health, assuring fidelity to that mission by carrying out monitoring and measuring outcomes. They should clarify which metrics are most appropriate for addressing the required quality priorities, particularly those that can feasibly be collected using agreed-upon definitions (for example, long-term quality in care, PROMs, PREMs and standardized assessment of postoperative complications). This should involve qualified personnel to oversee, collect or validate the data, therefore requiring appropriately relevant and accurate data sources. The adage of ‘garbage in, garbage out’ is one that all efforts should prevent, otherwise wasted efforts and resources will ensue. Rather than basing measurements of care quality on a minimum number of procedures (for example, highly specialized procedures in general surgery), regulatory bodies should instead look at the quality and accuracy of recorded data and ways to improve centralized clinical expertise, such as multidisciplinary treatment of complex diseases (for example, by the intensive care unit and surgical department). Although high hospital volume (that is, the number of times a specific procedure is done at a facility per year) was shown to be an excellent tool to improve care quality in many domains, overconcentration can also have potentially harmful effects by creating a monopolistic market and less willingness to invest in novel procedures and adequate education and training53,54,55,56,57. Furthermore, global equity with fairness in financial contribution should be addressed, including basic coverage for everybody.

The Jury concluded that governments should be responsible for overseeing data collection, storage, management and access among researchers. Nationwide data collection enhances trust among patients, health care providers and the public. To secure the most appropriate interventions and treatments, the Jury acknowledges the importance of second opinions and removal of financial incentives — that is, monetary motivation to conduct procedures without evidence of benefit, which may lead to harm and exaggerated costs — as well as the importance of implementing initiatives like Choosing Wisely58,59 (Box 2).

Cultural and demographic differences in outcome interpretation

Differences in outcomes after medical interventions may occur when unjust and avoidable systemic differences exist in health care delivery that cannot be attributed to the disease, clinical indication for surgical procedures or type of procedure performed. These differences can arise from structural health systems or societal barriers to care60,61. Cultural factors also impact the way patients participate in their own care after a medical intervention. How we perceive, experience and cope with disease is based on representations regarding causes and consequences of sickness, which are shaped by cultural factors, our social positions and systems of meaning62. Cultural issues also play a major role in patient adherence and partnership with the health care team63.

While there are standards for the collection and evaluation of some specific social and demographic factors, such as employment and insurance status, there are no standards for the collection of information regarding cultural attitudes and social norms that can have an impact on the health of the individual or performance of a health system64,65,66. Additional information on social determinants (for example, poverty, food insecurity, discrimination and unsafe housing) and cultural and demographic factors (including gender identification, religion and others) would facilitate interpretation of outcomes after medical interventions.

The Jury concluded that cultural and demographic factors might have an extensive, although so far poorly assessed, effect on outcomes and outcome assessment. The Jury suggested incorporating cultural and demographic factors into the evaluation of outcomes, through cultural adaptation of outcome measures themselves and/or consideration of socio-cultural determinants of health when interpreting outcomes. Socio-demographic data should be collected in a consistent way — for example, by defining a minimal dataset in large national databases — and should be interpreted in the context of specific cultural and demographic backgrounds (Box 2).

A new culture in dealing with unwarranted outcomes

When something goes wrong during or following a surgical intervention, it is usually the result of multiple systemic factors, rather than a single cause67. While cases of gross negligence, recklessness or intentional harm call for an assignment of individual culpability and disciplinary action, it is critical to avoid treating the care provider as solely liable in cases of unintentional errors. Attitudes toward medical errors must focus on improving the overall process of care delivery — providing professional safety tools, training and support to clinicians so that they can express empathy, and where appropriate, apologize68,69. The best lesson is to offer transparent and honest disclosure to patients and families70, which appears to be the best modality to prevent more suffering. In line with this, the Jury recommended that health care facilities foster a shift from a culture of blame to one of collaboration and collective learning (Box 2).

The Jury also addressed the need for clearly defined systems and procedures to mitigate the consequences of unwarranted outcomes. From an ethical and legal standpoint, outcomes should be evaluated according to the consequences of the intervention (clinical outcome) and whether all required conditions were met (procedural outcome, such as compliance with the law, or informing patients about the risks). They should include benefits and harms jointly identified by care practitioners, individual patients and experts, and assessed against a standard of a decent or flourishing life, beyond just biological and psychological functioning71, as well as the process of health care journey. Developing such a standard requires further research. Discussion of clinical outcomes should be supported by evidence-based decision aids as part of shared decision-making and advance care planning.


This consensus conference delivered Jury-based recommendations on how to assess outcomes of surgical interventions using a rigorous format designed to minimize biases and conflicts of interest. The Jury was composed of independent members including key stakeholders of the society, from economy, industry, psychology/psychiatry, science and patient advocates. The Jury’s recommendations were mostly based on the work of nine panels of international and multidisciplinary experts, who succeeded in delivering their responses to specific questions to the Jury well in advance of the consensus meeting.

The statements of the Jury offer a better understanding of the various stakeholders, with a particular emphasis on patients, which are too often forgotten in the delivery of health care owing to overwhelming political and financial pressures. The Jury also highlighted the responsibilities in properly assessing the results of surgical interventions, which include not only the health care providers but also governments and, most importantly, patients themselves. Unfortunately, there is no single metric available covering all aspects, and likely there will never be. With these Jury-based recommendations, however, we provide a framework for outcome assessment that may be further developed by researchers and health care providers targeting specific patient populations and interventions. The most frequently recurring questions of the Jury to the panel chairs were ‘So what can be done better? What are the precise steps and actions you suggest for assessing outcomes more accurately after surgical interventions and thereby improving the quality of patient care worldwide?’ From the answers to these questions, we summarize seven priorities that emanate from the Jury’s recommendations to credibly report on surgical interventions (Box 3).

There are some limitations to the recommendations. While all attempts were made to minimize the risk of bias, each Jury member brought their own background and opinions. However, as previous consensus conferences, the Jury does not just accept expert statements but also rejects recommendations or strongly modifies them after a productive deliberation. Next, we faced the challenge of balancing specificity with broad applicability of the recommendations. We prioritized recommendations pertinent to a broad range of surgical interventions, with the need for further adjustments based on the specifics of interventions and underlying diseases.

A final aim of the consensus exercise was to highlight areas needing more research (Box 4). For example, while the use and implementation of artificial intelligence is currently widely discussed in the assessment of surgical interventions, studies measuring its precise benefit in clinical practice and research are yet to be conducted. Also, the influence of socio-demographic and cultural factors on outcomes after surgical interventions are only now being recognized. Measuring, recording and comparing such complex determinants of health will require specialized tools, which are still lacking today.

The Jury underlined this challenge relevant to all culture and societies and made a call to the WHO and G20 to specifically address these issues with the aim of achieving a level of standardization that will enable credible comparisons and improvements in the delivery of health care worldwide. This will go a long way in facilitating accurate outcome expectations for patients and thereby achieving better results.