Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI

Vasey, Baptiste; Nagendran, Myura; Campbell, Bruce; Clifton, David A.; Collins, Gary S.; Denaxas, Spiros; Denniston, Alastair K.; Faes, Livia; Geerts, Bart; Ibrahim, Mudathir; Liu, Xiaoxuan; Mateen, Bilal A.; Mathur, Piyush; McCradden, Melissa D.; Morgan, Lauren; Ordish, Johan; Rogers, Campbell; Saria, Suchi; Ting, Daniel S. W.; Watkinson, Peter; Weber, Wim; Wheatstone, Peter; McCulloch, Peter

doi:10.1038/s41591-022-01772-9

Download PDF

Consensus Statement
Published: 18 May 2022

Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI

Nature Medicine volume 28, pages 924–933 (2022)Cite this article

30k Accesses
123 Citations
136 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 12 August 2022

This article has been updated

Abstract

A growing number of artificial intelligence (AI)-based clinical decision support systems are showing promising performance in preclinical, in silico evaluation, but few have yet demonstrated real benefit to patient care. Early-stage clinical evaluation is important to assess an AI system’s actual clinical performance at small scale, ensure its safety, evaluate the human factors surrounding its use and pave the way to further large-scale trials. However, the reporting of these early studies remains inadequate. The present statement provides a multi-stakeholder, consensus-based reporting guideline for the Developmental and Exploratory Clinical Investigations of DEcision support systems driven by Artificial Intelligence (DECIDE-AI). We conducted a two-round, modified Delphi process to collect and analyze expert opinion on the reporting of early clinical evaluation of AI systems. Experts were recruited from 20 pre-defined stakeholder categories. The final composition and wording of the guideline was determined at a virtual consensus meeting. The checklist and the Explanation & Elaboration (E&E) sections were refined based on feedback from a qualitative evaluation process. In total, 123 experts participated in the first round of Delphi, 138 in the second round, 16 in the consensus meeting and 16 in the qualitative evaluation. The DECIDE-AI reporting guideline comprises 17 AI-specific reporting items (made of 28 subitems) and ten generic reporting items, with an E&E paragraph provided for each. Through consultation and consensus with a range of stakeholders, we developed a guideline comprising key items that should be reported in early-stage clinical studies of AI-based decision support systems in healthcare. By providing an actionable checklist of minimal reporting items, the DECIDE-AI guideline will facilitate the appraisal of these studies and replicability of their findings.

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Article Open access 09 September 2020

Reporting guidelines in medical artificial intelligence: a systematic review and meta-analysis

Article Open access 11 April 2024

Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension

Article Open access 09 September 2020

Main

The prospect of improved clinical outcomes and more efficient health systems has fueled a rapid rise in the development and evaluation of AI systems over the last decade. Because most AI systems within healthcare are complex interventions designed as clinical decision support systems, rather than autonomous agents, the interactions among the AI systems, their users and the implementation environments are defining components of the AI interventions’ overall potential effectiveness. Therefore, bringing AI systems from mathematical performance to clinical utility needs an adapted, stepwise implementation and evaluation pathway, addressing the complexity of this collaboration between two independent forms of intelligence, beyond measures of effectiveness alone¹. Despite indications that some AI-based algorithms now match the accuracy of human experts within preclinical in silico studies², there is little high-quality evidence for improved clinician performance or patient outcomes in clinical studies^3,4. Reasons proposed for this so-called AI chasm⁵ are lack of necessary expertise needed for translating a tool into practice, lack of funding available for translation, a general underappreciation of clinical research as a translation mechanism⁶ and, more specifically, a disregard for the potential value of the early stages of clinical evaluation and the analysis of human factors⁷.

The challenges of early-stage clinical AI evaluation (Box 1) are similar to those of complex interventions, as reported by the Medical Research Council dedicated guidance¹, and surgical innovation, as described by the IDEAL Framework^8,9. For example, in all three cases, the evaluation needs to consider the potential for iterative modification of the interventions and the characteristics of the operators (or users) performing them. In this regard, the IDEAL framework offers readily implementable and stage-specific recommendations for the evaluation of surgical innovations under development. IDEAL stages 2a and 2b, for example, are described as development and exploratory stages, during which the intervention is refined, operators’ learning curves are analyzed and the influence of patient and operator variability on effectiveness are explored prospectively, before large-scale efficacy testing.

Early-stage clinical evaluation of AI systems should also place a strong emphasis on validation of performance and safety, in a similar manner to phase 1 and phase 2 pharmaceutical trials, before efficacy evaluation at scale in phase 3. For example, small changes in the distribution of the underlying data between the algorithm training and clinical evaluation populations (so-called dataset shift) can lead to substantial variation in clinical performance and expose patients to potential unexpected harm^10,11.

Human factors (or ergonomics) evaluations are commonly conducted in safety-critical fields such as aviation, military and energy sectors^12,13,14. Their assessments evaluate the effect of a device or procedure on their users’ physical and cognitive performance and vice-versa. Human factors, such as usability evaluation, are an integral part of the regulatory process for new medical devices^15,16, and their application to AI-specific challenges is attracting growing attention in the medical literature^17,18,19,20. However, few clinical AI studies have reported on the evaluation of human factors³, and usability evaluation of related digital health technology is often performed with inconstant methodology and reporting²¹.

Other areas of suboptimal reporting of clinical AI studies have also recently been highlighted^3,22, such as implementation environment, user characteristics and selection process, training provided, underlying algorithm identification and disclosure of funding sources. Transparent reporting is necessary for informed study appraisal and to facilitate reproducibility of study results. In a relatively new and dynamic field such as clinical AI, comprehensive reporting is also key to construct a common and comparable knowledge base to build upon.

Guidelines already exist, or are under development, for the reporting of preclinical, in silico studies of AI systems, their offline validation and their evaluation in large comparative studies^23,24,25,26; but there is an important stage of research between these, namely studies focusing on the initial clinical use of AI systems, for which no such guidance currently exists (Fig. 1 and Table 1). This early clinical evaluation provides a crucial scoping evaluation of clinical utility, safety and human factors challenges in live clinical settings. By investigating the potential obstacles to clinical evaluation at scale and informing protocol design, these studies are also important stepping stones toward definitive comparative trials.

**Fig. 1: Comparison of development pathways for drug therapies, AI in healthcare and surgical innovation.**

Table 1 Overview of existing and upcoming AI reporting guidelines

Full size table

To address this gap, we convened an international, multi-stakeholder group of experts in a Delphi exercise to produce the DECIDE-AI reporting guideline. Focusing on AI systems supporting, rather than replacing, human intelligence, DECIDE-AI aims to improve the reporting of studies describing the evaluation of AI-based decision support systems during their early, small-scale implementation in live clinical settings (that is, the supported decisions have an actual effect on patient care). Whereas TRIPOD-AI, STARD-AI, SPIRIT-AI and CONSORT-AI are specific to particular study designs, DECIDE-AI is focused on the evaluation stage and does not prescribe a fixed study design.

Box 1 Methodological challenges of the AI-based decision support system evaluation

The clinical evaluation of AI-based decision support systems presents several methodological challenges, all of which will likely be encountered at early stage. These are the needs to:

account for the complex intervention nature of these systems and evaluate their integration within existing ecosystems
account for user variability and the added biases occurring as a result
consider two collaborating forms of intelligence (human and AI system) and, therefore, integrate human factors considerations as a core component
consider both physical patients and their data representations
account for the changing nature of the intervention (due to early prototyping, version updates or continuous learning design) and analyze related performance changes
minimize the potential of this technology to embed and reproduce existing health inequality and systemic biases
estimate the generalizability of findings across sites and populations
enable reproducibility of the findings in the context of a dynamic innovation field and intellectual property protection

Recommendations

Reporting item checklist

The DECIDE-AI guideline should be used for the reporting of studies describing the early-stage live clinical evaluation of AI-based decision support systems, independently of the study design chosen (Fig. 1 and Table 1). Depending on the chosen study design, and if available, authors may also want to complete the reporting according to study-type-specific guidelines (for example, STROBE for cohort studies)²⁷. Table 2 presents the DECIDE-AI checklist, comprising the 17 AI-specific reporting items and ten generic reporting items selected by the Consensus Group. Each item comes with an E&E to explain why and how reporting is recommended (Supplementary Appendix 1). A downloadable version of the checklist, designed to help researchers and reviewers check compliance when preparing or reviewing a manuscript, is available as Supplementary Appendix 2. Reporting guidelines are a set of minimum reporting recommendations and not intended to guide research conduct. Although familiarity with DECIDE-AI might be useful to inform some aspects of the design and conduct of studies within the guideline’s scope²⁸, adherence to the guideline alone should not be interpreted as an indication of methodological quality (which is the realm of methodological guidelines and risk of bias assessment tools). With increasingly complex AI interventions and evaluations, it might become challenging to report all the required information within a single primary manuscript, in which case references to the study protocol, open science repositories, related publications and supplementary materials are encouraged.

Table 2 DECIDE-AI checklist

Full size table

Discussion

The DECIDE-AI guideline is the result of an international consensus process involving a diverse group of experts spanning a wide range of professional backgrounds and experience. The level of interest across stakeholder groups and the high response rate among the invited experts speaks to the perceived need for more guidance in the reporting of studies presenting the development and evaluation of clinical AI systems and to the growing value placed on comprehensive clinical evaluation to guide implementation. The emphasis placed on the role of human-in-the-loop decision-making was guided by the Steering Group’s belief that AI will, at least in the foreseeable future, augment, rather than replace, human intelligence in clinical settings. In this context, thorough evaluation of the human–computer interaction and the roles played by the human users will be key to realizing the full potential of AI.

The DECIDE-AI guideline is the first stage-specific AI reporting guideline to be developed. This stage-specific approach echoes recognized development pathways for complex interventions^1,8,9,29 and aligns conceptually with proposed frameworks for clinical AI^6,30,31,32, although no commonly agreed nomenclature or definition has so far been published for the stages of evaluation in this field. Given the current state of clinical AI evaluation, and the apparent deficit in reporting guidance for the early clinical stage, the DECIDE-AI Steering Group considered it important to crystallize current expert opinion into a consensus, to help improve reporting of these studies. Beside this primary objective, the DECIDE-AI guideline will hopefully also support authors during study design, protocol drafting and study registration, by providing them with clear criteria around which to plan their work. As with other reporting guidelines, it is important to note that the overall effect on the standard of reporting will need to be assessed in due course, once the wider community has had a chance to use the checklist and explanatory documents, which is likely to prompt modification and fine-tuning of the DECIDE-AI guideline, based on its real-world use. Although the outcome of this process cannot be pre-judged, there is evidence that the adoption of consensus-based reporting guidelines (such as CONSORT) does, indeed, improve the standard of reporting³³.

The Steering Group paid special attention to the integration of DECIDE-AI within the broader scheme of AI guidelines (for example, TRIPOD-AI, STARD-AI, SPIRIT-AI and CONSORT-AI). It also focused on DECIDE-AI being applicable to all types of decision support modalities (that is, detection, diagnostic, prognostic and therapeutic). The final checklist should be considered as minimum scientific reporting standards and does not preclude reporting additional information, nor are the standards a substitute for other regulatory reporting or approval requirements. The overlap between scientific evaluation and regulatory processes was a core consideration during the development of the DECIDE-AI guideline. Early-stage scientific studies can be used to inform regulatory decisions (for example, based on the stated intended use within the study) and are part of the clinical evidence generation process (for example, clinical investigations). The initial item list was aligned with information commonly required by regulatory agencies, and regulatory considerations are introduced in the E&E paragraphs. However, given the somewhat different focuses of scientific evaluation and regulatory assessment³⁴, as well as differences between regulatory jurisdictions, it was decided to make no reference to specific regulatory processes in the guideline, nor to define the scope of DECIDE-AI within any particular regulatory framework. The primary focus of DECIDE-AI is scientific evaluation and reporting, for which regulatory documents often provide little guidance.

Several topics led to more intense discussion than others, both during the Delphi process and the Consensus Group discussion. Regardless of whether the corresponding items were included, these represent important issues that the AI and healthcare communities should consider and continue to debate. First, we discussed at length whether users (see glossary of terms) should be considered as study participants. The consensus reached was that users are a key study population, about whom data will be collected (for example, reasons for variation from the AI system recommendation and user satisfaction), and who might logically be consented as study participants and, therefore, should be considered as such. Because user characteristics (for example, experience) can affect intervention efficacy, both patient and user variability should be considered when evaluating AI systems and reported adequately.

Second, the relevance of comparator groups in early-stage clinical evaluation was considered. Most studies retrieved in the literature search described a comparator group (commonly the same group of clinicians without AI support). Such comparators can provide useful information for the design of future large-scale trials (for example, information on the potential effect size). However, comparator groups are often unnecessary at this early stage of clinical evaluation, when the focus is on issues other than comparative efficacy. Small-scale clinical investigations are also usually underpowered to make statistically significant conclusions about efficacy, accounting for both patient and user variability. Moreover, the additional information gained from comparator groups in this context can often be inferred from other sources, such as previous data on unassisted standard of care in the case of the expected effect size. Comparison groups are, therefore, mentioned in item VII but considered optional.

Third, output interpretability is often described as important to increase user and patient trust in the AI system, to contextualize the system’s outputs within the broader clinical information environment¹⁹ and potentially for regulatory purposes³⁵. However, some experts argued that an output’s clinical value may be independent of its interpretability and that the practical relevance of evaluating interpretability is still debatable^36,37. Furthermore, there is currently no generally accepted way of quantifying or evaluating interpretability. For this reason, the Consensus Group decided not to include an item on interpretability at the current time.

Fourth, the notion of users’ trust in the AI system and its evolution with time were discussed. As users accumulate experience with, and receive feedback from, the real-world use of AI systems, they will adapt their level of trust in its recommendations. Whether appropriate or not, this level of trust will influence, as recently demonstrated by McIntosh et al.³⁸, how much effect the systems have on the final decision-making and, therefore, influence the overall clinical performance of the AI system. Understanding how trust evolves is essential for planning user training and determining the optimal timepoints at which to start data collection in comparative trials. However, as for interpretability, there is currently no commonly accepted way to measure trust in the context of clinical AI. For this reason, the item about user trust in the AI system was not included in the final guideline. The fact that interpretability and trust were not included highlights the tendency of consensus-based guidelines development toward conservatism, because only widely agreed-upon concepts reach the level of consensus needed for inclusion. However, changes of focus in the field, as well as new methodological development, can be integrated into subsequent guideline iterations. From this perspective, the issues of interpretability and trust are far from irrelevant to future AI evaluations, and their exclusion from the current guideline reflects less a lack of interest than a need for further research into how we can best operationalize these metrics for the purposes of evaluation in AI systems.

Fifth, the notion of modifying the AI system (the intervention) during the evaluation received mixed opinions. During comparative trials, changes made to the intervention during data collection are questionable unless the changes are part of the study protocol; some authors even consider them as impermissible, on the basis that they would make valid interpretation of study results difficult or impossible. However, the objectives of early clinical evaluation are often not to make definitive conclusions on effectiveness. Iterative design–evaluation cycles, if performed safely and reported transparently, offer opportunities to tailor an intervention to its users and beneficiaries and augment chances of adoption of an optimized, fixed version during later summative evaluation^8,9,39,40.

Sixth, several experts noted the benefit of conducting human factors evaluation before clinical implementation and considered that, therefore, human factors should be reported separately. However, even robust preclinical human factors evaluation will not reliably characterize all the potential human factors issues that might arise during the use of an AI system in a live clinical environment, warranting a continued human factors evaluation at the early stage of clinical implementation. The Consensus Group agreed that human factors play a fundamental role in AI system adoption in clinical settings at scale and that the full appraisal of an AI system’s clinical utility can happen only in the context of its clinical human factors evaluation.

Finally, several experts raised concerns that the DECIDE-AI guideline prescribes an evaluation that is too exhaustive to be reported within a single manuscript. The Consensus Group acknowledged the breadth of topics covered and the practical implications. However, reporting guidelines aim to promote transparent reporting of studies rather than mandating that every aspect covered by an item must have been evaluated within the studies. For example, if a learning curves evaluation has not been performed, then fulfilment of item 14b would be to simply state that this was not done, with an accompanying rationale. The Consensus Group agreed that appropriate AI evaluation is a complex endeavour necessitating the interpretation of a wide range of data, which should be presented together as far as possible. It was also felt that thorough evaluation of AI systems should not be limited by a word count and that publications reporting on such systems might benefit from special formatting requirements in the future. The information required by several items might already be reported in previous studies or in the study protocol, which could be cited rather than described in full again. The use of references, online supplementary materials and open-access repositories (for example, Open Science Framework (OSF)) is recommended to allow the sharing and connecting of all required information within one main published evaluation report.

Our work has several limitations that should be considered. First, the issue of potential biases, which apply to any consensus process, must be considered. These include anchoring or participant selection biases⁴¹. The research team tried to mitigate bias through the survey design, using open-ended questions analyzed through a thematic analysis, and by adapting the expert recruitment process, but it is unlikely that it was eliminated entirely. Despite an aim for geographical diversity and several actions taken to foster it, representation was skewed toward Europe and, more specifically, the United Kingdom. This could be explained, in part, by the following factors: a likely selection bias in the Steering Group’s expert recommendations; a higher interest in our open invitation to contribute coming from European/United Kingdom scientists (25 of 30 experts approaching us, 83%); and a lack of control over the response rate and self-reported geographical location of participating experts. Considerable attention was also paid to diversity and balance among stakeholder groups, even though clinicians and engineers were the most represented, partly due to the profile of researchers who contacted us spontaneously after the public announcement of the project. Stakeholder group analyses were performed to identify any marked disagreements from underrepresented groups. Finally, as also noted by the authors of the SPIRIT-AI and CONSORT-AI guidelines^25,26, few examples of studies reporting on the early-stage clinical evaluation of AI tools were available at the time that we started developing the DECIDE-AI guideline. This might have affeced the exhaustiveness of the initial item list created from literature review. However, the wide range of stakeholders involved and the design of the first round of Delphi allowed identification of several additional candidate items, which were added in the second iteration of the item list.

The introduction of AI into healthcare needs to be supported by sound, robust and comprehensive evidence generation and reporting. This is essential both to ensure the safety and efficacy of AI systems and to gain the trust of patients, practitioners and purchasers, so that this technology can realize its full potential to improve patient care. The DECIDE-AI guideline aims to improve the reporting of early-stage live clinical evaluation of AI systems, which lays the foundations for both larger clinical studies and later widespread adoption.

Methods

The DECIDE-AI guideline was developed through an international expert consensus process and in accordance with the EQUATOR Network’s recommendations for guideline development⁴². A Steering Group was convened to oversee the guideline development process. Its members were selected to cover a broad range of expertise and ensure a seamless integration with other existing guidelines. We conducted a modified Delphi process⁴³, with two rounds of feedback from participating experts and one virtual consensus meeting. The project was reviewed by the University of Oxford Central University Research Ethics Committee (approval R73712/RE003) and registered with the EQUATOR Network. Informed consent was obtained from all participants in the Delphi process and consensus meeting.

Initial item list generation

An initial list of candidate items was developed based on expert opinion informed by (1) a systematic literature review focusing on the evaluation of AI-based diagnostic decision support systems³; (2) an additional literature search about existing guidance for AI evaluation in clinical settings (search strategy available on the OSF⁴⁴); (3) literature recommended by Steering Group members^{19,22,45,46,47,48,49}; and (4) institutional documents^50,51,52,53.

Expert recruitment

Experts were recruited through five different channels: (1) invitation to experts recommended by the Steering Group; (2) invitation to authors of the publications identified through the initial literature searches; (3) call to contribute published in a commentary article in a medical journal⁷; (4) consideration of any expert contacting the Steering Group on their own initiative; and (5) invitation to experts recommended by the Delphi participants (snowballing). Before starting the recruitment process, 20 target stakeholder groups were defined, namely: administrators/hospital management, allied health professionals, clinicians, engineers/computer scientists, entrepreneurs, epidemiologists, ethicists, funders, human factors specialists, implementation scientists, journal editors, methodologists, patient representatives, payers/commissioners, policymakers/official institution representatives, private sector representatives, psychologists, regulators, statisticians and trialists.

One hundred thirty-eight experts agreed to participate in the first round of Delphi, of whom 123 (89%) completed the questionnaire (83 identified from Steering Group recommendations, 12 from their publications, 21 from contacting the Steering Group on own initiative and seven through snowballing). One hundred sixty-two experts were invited to take part in the second round of Delphi, of whom 138 completed the questionnaire (85%). One hundred ten had also completed the first round (continuity rate of 89%)⁵⁴, and 28 were new participants. The participating experts represented 18 countries and spanned all 20 of the defined stakeholder groups (Supplementary Note 1 and Supplementary Tables 1 and 2).

Delphi process

The Delphi surveys were designed and distributed via the REDCap web application^55,56. The first round consisted of four open-ended questions on aspects viewed by the Delphi participants as necessary to be reported during early-stage clinical evaluation. The participating experts were then asked to rate, on a 1–9 scale, the importance of items in the initial list proposed by the research team. Ratings of 1–3 on the scale were defined as ‘not important’, 4–6 as ‘important but not critical’ and 7–9 as ‘important and critical’. Participants were also invited to comment on existing items and to suggest new items. An inductive thematic analysis of the narrative answers was performed independently by two reviewers (B.V. and M.N.), and conflict was resolved by consensus⁵⁷. The themes identified were used to correct any omissions in the initial list and to complement the background information about proposed items. Summary statistics of the item scores were produced for each stakeholder group by calculating the median score, the interquartile range (IQR) and the percentage of participants scoring an item 7 or higher, as well as 3 or lower, which were the pre-specified inclusion and exclusion cutoffs, respectively. A revised item list was developed based on the results of the first round.

In the second round, the participants were shown the results of the first round and invited to rate and comment on the items in the revised list. The detailed survey questions of the two rounds of Delphi can be found on the OSF⁴⁴. All analyses of item scores and comments were performed independently by two members of the research team (B.V. and M.N.) using NVivo (QSR International Pty Ltd., version 1.0) and Python (Python Software Foundation, version 3.8.5). Conflicts were resolved by consensus.

The initial item list contained 54 items. One hundred twenty sets of responses were included in the analysis of the first round of Delphi (one set of responses was excluded due to a reasonable suspicion of scale inversion, two due to completion after the deadline). The first round yielded 43,986 words of free text answers to the four initial open-ended questions, 6,419 item scores, 228 comments and 64 proposals for new items. The thematic analysis identified 109 themes. In the revised list, nine items remained unchanged, 22 were reworded/completed, 21 were reorganized (merged/split, becoming 13 items), two items were dropped and nine new items were added, for a total of 53 items. The two items dropped were related to health economic assessment. They were the only two items with a median score below 7 (median: 6, IQR: 2–9 for both) and received many comments describing them as an entirely separate aspect of evaluation. The revised list was reorganized into items and subitems. One hundred thirty-six sets of answers were included in the analysis of the second round of Delphi (one set of answers was excluded due to lack of consideration for the questions, one due to completion after the deadline). The second round yielded 7,101 item scores and 923 comments. The results of the thematic analysis and the initial and revised item lists, as well as per-item narrative and graphical summaries of the feedback received in both rounds, can be found on the OSF⁴⁴.

Consensus meeting

A virtual consensus meeting was held on three occasions between 14 and 16 June 2021 to debate and agree to the content and wording of the DECIDE-AI reporting guideline. The 16 members of the Consensus Group (Supplementary Note 1 and Supplementary Table 2a,b) were selected to ensure a balanced representation of the key stakeholder groups as well as geographic diversity. All items from the second round of Delphi were discussed and voted on during the consensus meeting. For each item, the results of the Delphi process were presented to the Consensus Group members, and a vote was carried out anonymously using the Vevox online application (https://www.vevox.com). A pre-specified cutoff of 80% of the Consensus Group members (excluding blank votes and abstentions) was necessary for an item to be included. To highlight the new, AI-specific reporting items, the Consensus Group divided the guidelines into two item lists: an AI-specific items list, which represents the main novelty of the DECIDE-AI guideline, and a second list of generic reporting items, which achieved high consensus but are not AI specific and could apply to most types of studies. The Consensus Group selected 17 items (made of 28 subitems in total) for inclusion in the AI-specific list and ten items for inclusion in the generic reporting item list. A summary of the Consensus Group votes can be found in Supplementary Table 3.

Qualitative evaluation

The drafts of the guideline and of the E&E sections were sent for qualitative evaluation to a group of 16 selected experts with experience in AI system implementation or in the peer-reviewing of literature related to AI system evaluation (Supplementary Note 1), all of whom were independent of the Consensus Group. These 16 experts were asked to comment on the clarity and applicability of each AI-specific item, using a custom form (available on the OSF⁴⁴). Item wording amendments and modifications to the E&E sections were conducted based on the feedback from the qualitative evaluation, which was independently analyzed by two reviewers (B.V. and M.N.) and with conflicts resolved by consensus. A glossary of terms (Box 2) was produced to clarify key concepts used in the guideline. The Consensus Group approved the final item lists, including any changes made during the qualitative evaluation. Supplementary Figs. 1 and 2 provide graphical representations of the two item lists’ (AI-specific and generic) evolution.

Box 2 Glossary of terms

AI system	Decision support system incorporating AI and consisting of (1) the AI or machine learning algorithm; (2) the supporting software platform; and (3) the supporting hardware platform
AI system version	Unique reference for the form of the AI system and the state of its components at a single point in time. Allows for tracking changes to the AI system over time and comparing between different versions.
Algorithm	Mathematical model responsible for learning from data and producing an output.
AI	Science of developing computer systems which can perform tasks normally requiring human intelligence’²⁶
Bias	Systematic difference in treatment of certain objects, people, or groups in comparison to others’⁵⁸
Care pathway	Series of interactions, investigations, decision-making and treatments experienced by patients in the course of their contact with a healthcare system for a defined reason
Clinical	Relating to the observation and treatment of actual patients rather than in silico or scenario-based simulations
Clinical evaluation	Set of ongoing activities, analyzing clinical data and using scientific methods, to evaluate the clinical performance, effectiveness and/or safety of an AI system, when used as intended⁵⁰
Clinical investigation	Study performed on one or more human subjects to evaluate the clinical performance, effectiveness and/or safety of an AI system⁵⁹. This can be performed in any setting (for example, community, primary care and hospital).
Clinical workflow	Series of tasks performed by healthcare professionals in the exercise of their clinical duties
Decision support system	System designed to support human decision-making by providing person-specific and situation-specific information or recommendations to improve care or enhance health
Exposure	State of being in contact with, and having used, an AI system or similar digital technology
Human–computer interaction	Bi-directional influence between human users and digital systems through a physical and conceptual interface.
Human factors	Also called ergonomics. ‘The scientific discipline concerned with the understanding of interactions among humans and other elements of a system, and the profession that applies theory, principles, data and methods to design in order to optimise human well-being and overall system performance’ (International Ergonomics Association).
Indication for use	Situation and reason (medical condition, problem and patient group) where the AI system should be used
In silico evaluation	Evaluation performed via computer simulation outside the clinical settings
Intended use	Use for which an AI system is intended, as stated by its developers, and which serves as the basis for its regulatory classification. The intended use includes aspects of the targeted medical condition, patient population, user population, use environment and mode of action.
Learning curves	Graphical plotting of user performance against experience⁶⁰. By extension, analysis of the evolution of user performance with a task as exposure to the task increases. The measure of performance often uses other context-specific metrics as a proxy.
Live evaluation	Evaluation under actual clinical conditions, in which the decisions made have a direct effect on patient care. As opposed to ‘offline’ or ‘shadow mode’ evaluation where the decisions do not have a direct effect on patient care.
Machine learning	‘Field of computer science concerned with the development of models/algorithms that can solve specific tasks by learning patterns from data, rather than by following explicit rules. It is seen as an approach within the field of AI’²⁶.
Participant	Subject of a research study on whom data will be collected and from whom consent is obtained (or waived). The DECIDE-AI guideline considers that both patients and users can be participants.
Patient	Person (or the digital representation of this person) receiving healthcare attention or using health services and who is the subject of the decision made with the support of the AI system. Note: DECIDE-AI uses the term ‘patient’ pragmatically to simplify the reading of the guideline. Strictly speaking, a person with no health conditions who is the subject of a decision made about them by an AI-based decision support tool to improve their health and well-being or for a preventative purpose is not necessarily a ‘patient’ per se.
Patient involvement in research	Research carried out ‘with’ or ‘by’ patients or members of the public rather than ‘to’, ‘about’ or ‘for’ them (adapted from the INVOLVE definition of ‘Public Involvement’).
Standard practice	Usual care currently received by the intended patient population for the targeted medical condition and problem. This may not necessarily be synonymous with the state-of-the-art practice.
Usability	‘Extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use’⁶¹.
User	Person interacting with the AI system to inform their decision-making. This person could be a healthcare professional or a patient.

The definitions provided pertain to the specific context of DECIDE-AI and the use of the terms in the guideline. They are not necessarily generally accepted definitions and might not always be fully applicable to other areas of research.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

All data generated during this study (pseudonymized where necessary) are available upon justified request to the research team and for a duration of 3 years after publication of this manuscript. Translation of these guidelines into different languages is welcomed and encouraged, as long as the authors of the original publication are included in the process and resulting publication.

Code availability

All codes produced for data analysis during this study are available upon justified request to the research team and for a duration of 3 years after publication of this manuscript.

Change history

12 August 2022
A Correction to this paper has been published: https://doi.org/10.1038/s41591-022-01951-8

References

Skivington, K. et al. A new framework for developing and evaluating complex interventions: update of Medical Research Council guidance. Br. Med. J. 374, n2061 (2021).
Article Google Scholar
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).
Article PubMed Google Scholar
Vasey, B. et al. Association of clinician diagnostic performance with machine learning-based decision support systems: a systematic review. JAMA Netw. Open 4, e211276 (2021).
Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. Br. Med. J. 374, n1872 (2021).
Article Google Scholar
Keane, P. A. & Topol, E. J. With an eye to AI and autonomous diagnosis. NPJ Digital Med. 1, 40 (2018).
Article Google Scholar
McCradden, M. D., Stephenson, E. A. & Anderson, J. A. Clinical research underlies ethical integration of healthcare artificial intelligence. Nat. Med. 26, 1325–1326 (2020).
Article CAS PubMed Google Scholar
Vasey, B. et al. DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence. Nat. Med. 27, 186–187 (2021).
Article Google Scholar
McCulloch, P. et al. No surgical innovation without evaluation: the IDEAL recommendations. Lancet 374, 1105–1112 (2009).
Article PubMed Google Scholar
Hirst, A. et al. No surgical innovation without evaluation: evolution and further development of the ideal framework and recommendations. Ann. Surg. 269, 211–220 (2019).
Article PubMed Google Scholar
Finlayson, S. G. et al. The clinician and dataset shift in artificial intelligence. N. Engl. J. Med. 385, 283–286 (2021).
Article PubMed PubMed Central Google Scholar
Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).
PubMed Google Scholar
Kapur, N., Parand, A., Soukup, T., Reader, T. & Sevdalis, N. Aviation and healthcare: a comparative review with implications for patient safety. JRSM Open 7, 2054270415616548 (2015).
PubMed PubMed Central Google Scholar
Corbridge, C., Anthony, M., McNeish, D. & Shaw, G. A new UK defence standard for human factors integration (HFI). Proc. Hum. Factors Ergon. Soc. Annu. Meet. 60, 1736–1740 (2016).
Article Google Scholar
Stanton, N. A., Salmon, P., Jenkins, D. & Walker, G. Human Factors in the Design and Evaluation of Central Control Room Operations (CRC Press, 2009).
US Food and Drug Administration (FDA). Applying human factors and usability engineering to medical device: guidance for industry and Food and Drug Administration staff. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/applying-human-factors-and-usability-engineering-medical-devices (2016).
Medicines & Healthcare products Regulatory Agency (MHRA). Guidance on applying human factors and usability engineering to medical devices including drug-device combination products in Great Britain. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/970563/Human-Factors_Medical-Devices_v2.0.pdf (2021).
Asan, O. & Choudhury, A. Research trends in artificial intelligence applications in human factors health care: mapping review. JMIR Hum. Factors 8, e28236 (2021).
Article PubMed PubMed Central Google Scholar
Felmingham, C. M. et al. The importance of incorporating human factors in the design and implementation of artificial intelligence for skin cancer diagnosis in the real world. Am. J. Clin. Dermatol. 22, 233–242 (2021).
Article PubMed Google Scholar
Sujan, M. et al. Human factors challenges for the safe use of artificial intelligence in patient care. BMJ Health Care Inform. 26, e100081 (2019).
Article PubMed PubMed Central Google Scholar
Sujan, M., Baber, C., Salmon, P., Pool, R. & Chozos, N. Human factors and ergonomics in healthcare AI. https://www.researchgate.net/publication/354728442_Human_Factors_and_Ergonomics_in_Healthcare_AI (2021).
Wronikowska, M. W. et al. Systematic review of applied usability metrics within usability evaluation methods for hospital electronic healthcare record systems. J. Eval. Clin. Pract. 27, 1403–1416 (2021).
Article PubMed Google Scholar
Nagendran, M. et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. Br. Med. J. 368, m689 (2020).
Article Google Scholar
Collins, G. S. & Moons, K. G. M. Reporting of artificial intelligence prediction models. Lancet 393, 1577–1579 (2019).
Article PubMed Google Scholar
Sounderajah, V. et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI Steering Group. Nat. Med. 26, 807–808 (2020).
Article CAS PubMed Google Scholar
Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat. Med. 26, 1351–1363 (2020).
Article CAS PubMed PubMed Central Google Scholar
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26, 1364–1374 (2020).
Article CAS PubMed PubMed Central Google Scholar
von Elm, E. et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Br. Med. J. 335, 806–808 (2007).
Article Google Scholar
Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372, n71 (2021).
Article PubMed PubMed Central Google Scholar
Sedrakyan, A. et al. IDEAL-D: a rational framework for evaluating and regulating the use of medical devices. Br. Med. J. 353, i2372 (2016).
Article Google Scholar
Park, Y. et al. Evaluating artificial intelligence in medicine: phases of clinical research. JAMIA Open 3, 326–331 (2020).
Article PubMed PubMed Central Google Scholar
Higgins, D. & Madai, V. I. From bit to bedside: a practical framework for artificial intelligence product development in healthcare. Adv. Intell. Syst. 2, 2000052 (2020).
Article Google Scholar
Sendak, M. P. et al. A path for translation of machine learning products into healthcare delivery. Eur. Med. J. https://www.emjreviews.com/innovations/article/a-path-for-translation-of-machine-learning-products-into-healthcare-delivery/ (2020).
Moher, D., Jones, A., Lepage, L. & CONSORT Group. Use of the CONSORT statement and quality of reports of randomized trials: a comparative before-and-after evaluation. J. Am. Med. Assoc. 285, 1992–1995 (2001).
Article CAS Google Scholar
Park, S. H. Regulatory approval versus clinical validation of artificial intelligence diagnostic tools. Radiology 288, 910–911 (2018).
Article PubMed Google Scholar
US Food and Drug Administration (FDA). Clinical decision support software: draft guidance for industry and Food and Drug Administration staff. https://www.fda.gov/media/109618/download (2019).
Lipton, Z. C. The mythos of model interpretability. Commun. ACM 61, 36–43 (2018).
Article Google Scholar
Ghassemi, M., Oakden-Rayner, L. & Beam, A. L. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health 3, e745–e750 (2021).
Article PubMed Google Scholar
McIntosh, C. et al. Clinical integration of machine learning for curative-intent radiation treatment of patients with prostate cancer. Nat. Med. 27, 999–1005 (2021).
Article CAS PubMed Google Scholar
International Organization for Standardization. Ergonomics of human–system interaction—part 210: human-centred design for interactive systems. https://www.iso.org/standard/77520.html (2019).
Norman, D. A. User Centered System Design (CRC Press, 1986).
Winkler, J. & Moser, R. Biases in future-oriented Delphi studies: a cognitive perspective. Technol. Forecast. Soc. Change 105, 63–76 (2016).
Article Google Scholar
Moher, D., Schulz, K. F., Simera, I. & Altman, D. G. Guidance for developers of health research reporting guidelines. PLoS Med. 7, e1000217 (2010).
Article PubMed PubMed Central Google Scholar
Dalkey, N. & Helmer, O. An experimental application of the DELPHI method to the use of experts. Manage. Sci. 9, 458–467 (1963).
Article Google Scholar
Vasey, B., Nagendran, M. & McCulloch, P. DECIDE-AI 2022. https://doi.org/10.17605/OSF.IO/TP9QV (2022).
Vollmer, S. et al. Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness. Br. Med. J. 368, l6927 (2020).
Article Google Scholar
Bilbro, N. A. et al. The IDEAL reporting guidelines: a Delphi consensus statement stage specific recommendations for reporting the evaluation of surgical innovation. Ann. Surg. 273, 82–85 (2021).
Article PubMed Google Scholar
Morley, J., Floridi, L., Kinsey, L. & Elhalal, A. From what to how: an initial review of publicly available ai ethics tools, methods and research to translate principles into practices. Sci. Eng. Ethics 26, 2141–2168 (2019).
Article PubMed PubMed Central Google Scholar
Xie, Y. et al. Health economic and safety considerations for artificial intelligence applications in diabetic retinopathy screening. Transl. Vis. Sci. Technol. 9, 22 (2020).
Article PubMed PubMed Central Google Scholar
Norgeot, B. et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat. Med. 26, 1320–1324 (2020).
Article CAS PubMed PubMed Central Google Scholar
IMDRF Medical Device Clinical Evaluation Working Group. Clinical Evaluation. https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrf-tech-191010-mdce-n56.pdf (2019).
IMDRF Software as Medical Device (SaMD) Working Group. ‘Software as a medical device’: possible framework for risk categorization and corresponding considerations. https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrf-tech-140918-samd-framework-risk-categorization-141013.pdf (2014).
National Institute for Health and Care Excellence (NICE). Evidence standards framework for digital health technologies. https://www.nice.org.uk/about/what-we-do/our-programmes/evidence-standards-framework-for-digital-health-technologies (2019).
High-Level Independent Group on Artificial Intelligence (AI HLEG). Ethics guidelines for trustworthy AI. European Commission. Vol. 32. https://ec.europa.eu/digital (2019).
Boel, A., Navarro-Compán, V., Landewé, R. & van der Heijde, D. Two different invitation approaches for consecutive rounds of a Delphi survey led to comparable final outcome. J. Clin. Epidemiol. 129, 31–39 (2021).
Article PubMed Google Scholar
Harris, P. A. et al. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inform. 42, 377–381 (2009).
Article PubMed Google Scholar
Harris, P. A. et al. The REDCap consortium: building an international community of software platform partners. J. Biomed. Inform. 95, 103208 (2019).
Article PubMed PubMed Central Google Scholar
Nowell, L. S., Norris, J. M., White, D. E. & Moules, N. J. Thematic analysis: striving to meet the trustworthiness criteria. Int. J. Qual. Methods 16, 1609406917733847 (2017).
Article Google Scholar
International Organization for Standardization. Information technology—artificial intelligence (AI)—bias in AI systems and AI aided decision making. https://www.iso.org/standard/77607.html (2021).
IMDRF Medical Device Clinical Evaluation Working Group. Clinical Investigation. https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrf-tech-191010-mdce-n57.pdf (2019).
Hopper, A. N., Jamison, M. H. & Lewis, W. G. Learning curves in surgical practice. Postgrad. Med. J. 83, 777–779 (2007).
Article CAS PubMed PubMed Central Google Scholar
International Organization for Standardization. Ergonomics of human–system interaction—part 11: usability: definitions and concepts. https://www.iso.org/standard/63500.html (2018).

Download references

Acknowledgements

The authors would like to thank all Delphi participants and experts who participated in the guideline qualitative evaluation. B.V. would also like to thank B. Beddoe (Sheffield Teaching Hospital), N. Bilbro (Maimonides Medical Center), N. Marlow (Oxford University Hospitals), E. Taylor (Nuffield Department of Surgical Sciences, University of Oxford) and S. Ursprung (Department for Radiology, Tübingen University Hospital) for their support in the initial stage of the project. This work was supported by the IDEAL Collaboration. B.V. is funded by a Berrow Foundation Lord Florey scholarship. M.N. is supported by the UKRI CDT in AI for Healthcare (http://ai4health.io; grant P/S023283/1). D.C. receives funding from the Wellcome Trust, AstraZeneca, RCUK and GlaxoSmithKline. G.S.C. is supported by the NIHR Biomedical Research Centre, Oxford, and Cancer Research UK (program grant C49297/A27294). M.I. is supported by a Maimonides Medical Center Research fellowship. X.L. receives funding from the Wellcome Trust, the National Institute of Health Research/NHSX/Health Foundation, the Alan Turing Institute, the MHRA and NICE. B.A.M. is a fellow of the Alan Turing Institute, supported by EPSRC grant EP/N510129/, and holds a Wellcome Trust-funded honorary post at University College London for the purposes of carrying out independent research. M.M. receives funding from the Dalla Lana School of Public Health and the Leong Centre for Healthy Children. J.O. is employed by the Medicines and Healthcare products Regulatory Agency, which is the competent authority responsible for regulating medical devices and medicines in the United Kingdom. Elements of the work relating to the regulation of AI as a medical device are funded by grants from NHSX and the Regulators’ Pioneer Fund (Department for Business, Energy and Industrial Strategy). S.S. receives grants from the National Science Foundation, the American Heart Association, the National Institutes of Health and the Sloan Foundation. D.S.W.T. is supported by the National Medical Research Council, Singapore (NMRC/HSRG/0087/2018;MOH-000655-00), the National Health Innovation Centre, Singapore (NHIC-COV19-2005017), the SingHealth Fund Limited Foundation (SHF/HSR113/2017), the Duke-NUS Medical School (Duke-NUS/RSF/2021/0018;05/FY2020/EX/15-A58) and the Agency for Science, Technology and Research (A20H4g2141; H20C6a0032). P. Watkinson is supported by the NIHR Biomedical Research Centre, Oxford, and holds grants from the NIHR and Wellcome. P. McCulloch receives grants from Medtronic (unrestricted educational grant to Oxford University for the IDEAL Collaboration) and the Oxford Biomedical Research Centre. The views expressed in this guideline are those of the authors, Delphi participants and experts who participated in the qualitative evaluation of the guideline. These views do not necessarily reflect those of their institutions or funders.

Author information

Authors and Affiliations

Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK
Baptiste Vasey, Mudathir Ibrahim & Peter McCulloch
Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, UK
Baptiste Vasey, David A. Clifton & Carmelo Velardo
Critical Care Research Group, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, UK
Baptiste Vasey, Peter Watkinson & Sarah Vollam
UKRI Centre for Doctoral Training in AI for Healthcare, Imperial College London, London, UK
Myura Nagendran
University of Exeter Medical School, Exeter, UK
Bruce Campbell
Royal Devon and Exeter Hospital, Exeter, UK
Bruce Campbell
Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology & Musculoskeletal Sciences, University of Oxford, Oxford, UK
Gary S. Collins
Institute of Health Informatics, University College London, London, UK
Spiros Denaxas & Bilal A. Mateen
British Heart Foundation Data Science Centre, London, UK
Spiros Denaxas
Health Data Research UK, London, UK
Spiros Denaxas
UCL Hospitals Biomedical Research Centre, London, UK
Spiros Denaxas
University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
Alastair K. Denniston & Xiaoxuan Liu
Academic Unit of Ophthalmology, Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
Alastair K. Denniston & Xiaoxuan Liu
Moorfields Eye Hospital NHS Foundation Trust, London, UK
Alastair K. Denniston & Livia Faes
Healthplus.ai-R&D BV, Amsterdam, The Netherlands
Bart Geerts, Rachel Barnett & Siri L. van der Meijden
Department of Surgery, Maimonides Medical Center, Brooklyn, NY, USA
Mudathir Ibrahim & Joel Horovitz
The Wellcome Trust, London, UK
Bilal A. Mateen
The Alan Turing Institute, London, UK
Bilal A. Mateen
Department of General Anesthesiology, Anesthesiology Institute, Cleveland Clinic, Cleveland, OH, USA
Piyush Mathur
The Hospital for Sick Children, Toronto ON, Canada
Melissa D. McCradden
Dalla Lana School of Public Health, University of Toronto, Toronto ON, Canada
Melissa D. McCradden
Morgan Human Systems Ltd, Shrewsbury, UK
Lauren Morgan
Medicines and Healthcare products Regulatory Agency, London, UK
Johan Ordish
HeartFlow Inc., Redwood City, CA, USA
Campbell Rogers
Departments of Computer Science, Statistics, and Health Policy, and Division of Informatics, Johns Hopkins University, Baltimore, MD, USA
Suchi Saria
Bayesian Health, New York, NY, USA
Suchi Saria
Singapore National Eye Center, Singapore Eye Research Institute, Singapore, Singapore
Daniel S. W. Ting, Dinesh V. Gunasekaran, Tien-En Tan & Wei Yan Ng
Duke-NUS Medical School, National University of Singapore, Singapore, Singapore
Daniel S. W. Ting
NIHR Biomedical Research Centre Oxford, Oxford University Hospitals NHS Trust, Oxford, UK
Peter Watkinson
The BMJ, London, UK
Wim Weber & John Fletcher
School of Medicine, University of Leeds, Leeds, UK
Peter Wheatstone
Department of Ophthalmology, School of Medicine, University of Washington, Seattle, WA, USA
Aaron Y. Lee
School of Medicine, Cardiff University, Cardiff, UK
Alan G. Fraser
Google Health, London, UK
Ali Connell, Christopher J. Kelly & Reena Chopra
Quantium Health, Johannesburg, South Africa
Alykhan Vira
Artera Research, Artera, Mountain View, CA, USA
Andre Esteva
University of Pittsburgh, Pittsburgh, PA, USA
Andrew D. Althouse
Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Andrew L. Beam
CAIRElab, Leiden University Medical Centre, Leiden, the Netherlands
Anne de Hond
Institute for Biomedical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-University Munich, Munich, Germany
Anne-Laure Boulesteix & Ludwig C. Hinske
Rheumatology Department, Royal Berkshire Hospital, Reading, UK
Anthony Bradlow
Cambridge Centre for AI in Medicine, University of Cambridge, Cambridge, UK
Ari Ercole
Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
Arsenio Paez
Usher Institute, Edinburgh Medical School, University of Edinburgh, Edinburgh, UK
Athanasios Tsanas
K Sharp, Llanelli, UK
Barry Kirby
Department of Computing, Imperial College London, London, UK
Ben Glocker
Sensyne Health, Oxford, UK
Carmelo Velardo
Seoul National University College of Medicine, Seoul, South Korea
Chang Min Park
Division of Imaging & Oncology, University Medical Center Utrecht, Utrecht, the Netherlands
Charisma Hehakaya
School of Computer Science, University of Birmingham, Birmingham, UK
Chris Baber & Konstantinos Kamnitsas
Nuffield Department of Medicine, University of Oxford, Oxford, UK
Chris Paton
Johner Institute, Konstanz, Germany
Christian Johner
PDD Group, London, UK
Christopher J. Vincent
University of Manchester, Manchester, UK
Christopher Yau
Pathology and Data Analytics, University of Leeds, Leeds, UK
Clare McGenity
Department of Biostatistics, Brown University School of Public Health, Providence, RI, USA
Constantine Gatsonis
The Christie NHS Foundation Trust, Manchester, UK
Corinne Faivre-Finn
London School of Economics, London, UK
Crispin Simon
Department of Medical Informatics, Amsterdam UMC, University of Amsterdam, Amsterdam, the Netherlands
Danielle Sent
Mila, Quebec AI Institute, Montreal, Quebec, Canada
Danilo Bzdok
Leeds Teaching Hospitals NHS Trust, Leeds, UK
Darren Treanor
Department of Computer Science and Centre for Health Informatics, University of Manchester, Manchester, UK
David C. Wong
Google Health, Palo Alto, CA, USA
David F. Steiner
Berlin Institute of Health, Berlin, Germany
David Higgins
Healthcare Safety Investigation Branch, Farnborough, UK
Dawn Benson
MRC London Institute of Medical Sciences, Imperial College London, London, UK
Declan P. O’Regan
Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, UK
Dominic Danks
University of Pisa, Pisa, Italy
Emanuele Neri
School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK
Evangelia Kyrimi
Charité Universitätsmedizin Berlin, Berlin, Germany
Falk Schwendicke
Australian Institute of Health Innovation, Macquarie University, Sydney, New South Wales, Australia
Farah Magrabi
West Midlands Academic Health Science Network, Birmingham, UK
Frances Ives
Department of Cardiovascular Sciences, KU Leuven, Leuven, Belgium
Frank E. Rademakers
Bristol Centre for Surgical Research, Department of Population Health Sciences, Bristol Medical School, Bristol, UK
George E. Fowler
Deep Blue, Rome, Italy
Giuseppe Frau
Population Health Science Institute, Newcastle University, Newcastle upon Tyne, UK
H. D. Jeffry Hogg
Department of Neurosurgery, National Hospital for Neurology and Neurosurgery, Queen Square, London, UK
Hani J. Marcus
Department of Radiology, University of Michigan, Ann Arbor, MI, USA
Heang-Ping Chan
The Abigail Wexner Research Institute, Nationwide Children’s Hospital, The Ohio State University, Columbus, OH, USA
Henry Xiang
Department of Medicine, East Sussex Healthcare Trust, Hastings, UK
Hugh F. McIntyre
Hardian Health, Haywards Heath, UK
Hugh Harvey
Department of Radiology, Seoul National University Hospital, Seoul, South Korea
Hyungjin Kim
Department of Computer Science, University of York, York, UK
Ibrahim Habli
Department of Anesthesiology and Critical Care Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD, USA
James C. Fackler
Joint Centre for Bioethics, University of Toronto, Toronto, Ontario, Canada
James Shaw
University of Oxford, Oxford, UK
Janet Higham
Centre for Trauma Sciences, Blizard Institute, Queen Mary University of London, London, UK
Jared M. Wohlgemut, Max Marsden & Zane B. Perkins
Department of Medical Imaging, Western University, London, Ontario, Canada
Jaron Chong
Radiation Oncology Department, Hôpital Européen Georges Pompidou, AP-HP, Paris, France
Jean-Emmanuel Bibault
Center of Research in Epidemiology and Statistics (Inserm 1153), Université de Paris, Paris, France
Jérémie F. Cohen
Department of Pathology, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, the Netherlands
Jesper Kers
Oxford Internet Institute, University of Oxford, Oxford, UK
Jessica Morley
Oral Diagnostics & Digital Health & Health Services Research, Charité Universitätsmedizin Berlin, Berlin, Germany
Joachim Krois
Nature Medicine, New York, NY, USA
Joao Monteiro
Nuclear Medicine / 3DLab, Sheffield Teaching Hospitals, Sheffield, UK
Jonathan Taylor
Department of Radiology, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
Jung Hyun Yoon
Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA
Karandeep Singh
Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Karel G. M. Moons & Maarten van Smeden
Harvard T.H. Chan School of Public Health, Boston, MA, USA
Kassandra Karpathakis
Medical University of South Carolina, Charleston, SC, USA
Ken Catchpole
Centre for Trials Research, Cardiff University, Cardiff, UK
Kerenza Hood
Moorfields Ophthalmic Reading Centre and Clinical AI Hub, Moorfields Eye Hospital, London, UK
Konstantinos Balaskas
Applied Decision Science, Cincinnati, OH, USA
Laura Militello
Department of Epidemiology, Care and Public Health Research Institute, Maastricht University, Maastricht, the Netherlands
Laure Wynants
Australian Institute of Machine Learning, University of Adelaide, Adelaide, South Australia, Australia
Lauren Oakden-Rayner
University College London, London, UK
Laurence B. Lovat
Department of Epidemiology, Maastricht University, Maastricht, the Netherlands
Luc J. M. Smits
US Food and Drug Administration, Silver Spring, MD, USA
M. Khair ElZarrad
Hospital Israelita Albert Einstein, São Paulo, Brazil
Mara Giavina-Bianchi
The University of Western Ontario, London, Ontario, Canada
Mark Daley
Duke Institute for Health Innovation, Durham, NC, USA
Mark P. Sendak
Human Factors Everywhere, Woking, UK
Mark Sujan
Department of Operating Rooms, Radboudumc, Nijmegen, the Netherlands
Maroeska Rovers
University of Colorado, Boulder, CO, USA
Matthew DeCamp
The Healthcare Improvement Studies Institute, School of Clinical Medicine, University of Cambridge, Cambridge, UK
Matthew Woodward
Department of Surgery and Cancer, Imperial College London, London, UK
Matthieu Komorowski
Genomics England, Queen Mary University of London, London, UK
Maxine Mackintosh
University of Iowa, Iowa City, IA, USA
Michael D. Abramoff
Big Data Department, Fundación Pública Andaluza Progreso y Salud, Regional Ministry of Health of Southern Spain, Sevilla, Spain
Miguel Ángel Armengol de la Hoz
National Hospital for Neurology and Neurosurgery, Queen Square, London, UK
Neale Hambidge
Skin Analytics, London, UK
Neil Daly
Division of Informatics, Imaging and Data Science, The University of Manchester, Manchester, UK
Niels Peek
Kadoorie Centre for Critical Care Research and Education, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, UK
Oliver Redfern
Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
Omer F. Ahmad
Amsterdam University Medical Centers, University of Amsterdam, Amsterdam, the Netherlands
Patrick M. Bossuyt
Institute of Ophthalmology, University College London, London, UK
Pearse A. Keane
Centro de Engenharia e Tecnologia Naval e Oceânica–Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal
Pedro N. P. Ferreira
Institute of Public Health, Medical Decision Making and Health Technology Assessment, University for Health Sciences, Medical Informatics and Technology, Tirol, Austria
Petra Schnell-Inderst
Gastrointestinal Endoscopic Surgery, , Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
Pietro Mascagni
King’s Health Partners Academic Surgery, King’s College London, London, UK
Prokar Dasgupta
Graduate School of Biomedical Sciences, University of Texas MD Anderson Cancer Center and The University of Texas Health Science Center at Houston, Houston, TX, USA
Pujun Guan
Division of Surgery and Interventional Sciences, University College London, London, UK
Rawen Kader
Department of medical imaging, Radboud University Medical Center, Nijmegen, the Netherlands
Ritse M. Mann
The Lancet Digital Health, The Lancet Group, London, UK
Rupa Sarkar
Department of Neurosurgery, Helsinki University Hospital, Helsinki, Finland
Saana M. Mäenpää
Harvard Medical School, Boston, MA, USA
Samuel G. Finlayson
Data Science and its Application, German Research Center for Artificial Intelligence, Kaiserslautern, Germany
Sebastian J. Vollmer
Department of Radiology, Asan Medical Center, Seoul, South Korea
Seong Ho Park
University of York, York, UK
Shakir Laher
Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
Shalmali Joshi
Leiden University Medical Center, Leiden, the Netherlands
Siri L. van der Meijden
Department of Clinical Radiology, Great Ormond Street Hospital for Children NHS Foundation Trust, London, UK
Susan C. Shelmerdine
PUBLIC, Oxford, UK
Tom J. W. Stocker
University of Turin, Turin, Italy
Valentina Giannini
QUEST Centre for Responsible Research, Berlin Institute of Health, Charité Universitätsmedizin Berlin, Berlin, Germany
Vince I. Madai
University Division of Anaesthesia, Department of Medicine, University of Cambridge, Cambridge, UK
Virginia Newcombe
Philosophy Department and School of Medicine, Macquarie University, Sydney, New South Wales, Australia
Wendy A. Rogers
IBM Research Africa, Nairobi, Kenya
William Ogallo
Center for Computational Health, IBM Research, Cambridge, MA, USA
Yoonyoung Park

Authors

Baptiste Vasey
View author publications
You can also search for this author in PubMed Google Scholar
Myura Nagendran
View author publications
You can also search for this author in PubMed Google Scholar
Bruce Campbell
View author publications
You can also search for this author in PubMed Google Scholar
David A. Clifton
View author publications
You can also search for this author in PubMed Google Scholar
Gary S. Collins
View author publications
You can also search for this author in PubMed Google Scholar
Spiros Denaxas
View author publications
You can also search for this author in PubMed Google Scholar
Alastair K. Denniston
View author publications
You can also search for this author in PubMed Google Scholar
Livia Faes
View author publications
You can also search for this author in PubMed Google Scholar
Bart Geerts
View author publications
You can also search for this author in PubMed Google Scholar
Mudathir Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bilal A. Mateen
View author publications
You can also search for this author in PubMed Google Scholar
Piyush Mathur
View author publications
You can also search for this author in PubMed Google Scholar
Melissa D. McCradden
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Morgan
View author publications
You can also search for this author in PubMed Google Scholar
Johan Ordish
View author publications
You can also search for this author in PubMed Google Scholar
Campbell Rogers
View author publications
You can also search for this author in PubMed Google Scholar
Suchi Saria
View author publications
You can also search for this author in PubMed Google Scholar
Daniel S. W. Ting
View author publications
You can also search for this author in PubMed Google Scholar
Peter Watkinson
View author publications
You can also search for this author in PubMed Google Scholar
Wim Weber
View author publications
You can also search for this author in PubMed Google Scholar
Peter Wheatstone
View author publications
You can also search for this author in PubMed Google Scholar
Peter McCulloch
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

the DECIDE-AI expert group

Aaron Y. Lee
, Alan G. Fraser
, Alastair K. Denniston
, Ali Connell
, Alykhan Vira
, Andre Esteva
, Andrew D. Althouse
, Andrew L. Beam
, Anne de Hond
, Anne-Laure Boulesteix
, Anthony Bradlow
, Ari Ercole
, Arsenio Paez
, Athanasios Tsanas
, Baptiste Vasey
, Barry Kirby
, Bart Geerts
, Ben Glocker
, Bilal A. Mateen
, Bruce Campbell
, Campbell Rogers
, Carmelo Velardo
, Chang Min Park
, Charisma Hehakaya
, Chris Baber
, Chris Paton
, Christian Johner
, Christopher J. Kelly
, Christopher J. Vincent
, Christopher Yau
, Clare McGenity
, Constantine Gatsonis
, Corinne Faivre-Finn
, Crispin Simon
, Daniel S. W. Ting
, Danielle Sent
, Danilo Bzdok
, Darren Treanor
, David A. Clifton
, David C. Wong
, David F. Steiner
, David Higgins
, Dawn Benson
, Declan P. O’Regan
, Dinesh V. Gunasekaran
, Dominic Danks
, Emanuele Neri
, Evangelia Kyrimi
, Falk Schwendicke
, Farah Magrabi
, Frances Ives
, Frank E. Rademakers
, Gary S. Collins
, George E. Fowler
, Giuseppe Frau
, H. D. Jeffry Hogg
, Hani J. Marcus
, Heang-Ping Chan
, Henry Xiang
, Hugh F. McIntyre
, Hugh Harvey
, Hyungjin Kim
, Ibrahim Habli
, James C. Fackler
, James Shaw
, Janet Higham
, Jared M. Wohlgemut
, Jaron Chong
, Jean-Emmanuel Bibault
, Jérémie F. Cohen
, Jesper Kers
, Jessica Morley
, Joachim Krois
, Joao Monteiro
, Joel Horovitz
, Johan Ordish
, John Fletcher
, Jonathan Taylor
, Jung Hyun Yoon
, Karandeep Singh
, Karel G. M. Moons
, Kassandra Karpathakis
, Ken Catchpole
, Kerenza Hood
, Konstantinos Balaskas
, Konstantinos Kamnitsas
, Laura Militello
, Laure Wynants
, Lauren Morgan
, Lauren Oakden-Rayner
, Laurence B. Lovat
, Livia Faes
, Luc J. M. Smits
, Ludwig C. Hinske
, M. Khair ElZarrad
, Maarten van Smeden
, Mara Giavina-Bianchi
, Mark Daley
, Mark P. Sendak
, Mark Sujan
, Maroeska Rovers
, Matthew DeCamp
, Matthew Woodward
, Matthieu Komorowski
, Max Marsden
, Maxine Mackintosh
, Melissa D. McCradden
, Michael D. Abramoff
, Miguel Ángel Armengol de la Hoz
, Myura Nagendran
, Neale Hambidge
, Neil Daly
, Niels Peek
, Oliver Redfern
, Omer F. Ahmad
, Patrick M. Bossuyt
, Pearse A. Keane
, Pedro N. P. Ferreira
, Peter McCulloch
, Peter Watkinson
, Peter Wheatstone
, Petra Schnell-Inderst
, Pietro Mascagni
, Piyush Mathur
, Prokar Dasgupta
, Pujun Guan
, Rachel Barnett
, Rawen Kader
, Reena Chopra
, Ritse M. Mann
, Rupa Sarkar
, Saana M. Mäenpää
, Samuel G. Finlayson
, Sarah Vollam
, Sebastian J. Vollmer
, Seong Ho Park
, Shakir Laher
, Shalmali Joshi
, Siri L. van der Meijden
, Spiros Denaxas
, Suchi Saria
, Susan C. Shelmerdine
, Tien-En Tan
, Tom J. W. Stocker
, Valentina Giannini
, Vince I. Madai
, Virginia Newcombe
, Wei Yan Ng
, Wendy A. Rogers
, William Ogallo
, Wim Weber
, Xiaoxuan Liu
, Yoonyoung Park
& Zane B. Perkins

Contributions

B.V., M.N. and P. McCulloch designed the study. B.V. and M.I. conducted the literature searches. Members of the DECIDE-AI Steering Group (B.V., D.C., G.S.C., A.K.D., L.F., B.G., X.L., P. Mathur, L.M., S.S., P. Watkinson and P. McCulloch) provided methodological input and oversaw the conduct of the study. B.V. and M.N. conducted the thematic analysis and Delphi rounds analysis and produced the Delphi round summaries. Members of the DECIDE-AI Consensus Group (B.V., G.S.C., S.P., B.G., X.L., B.A.M., P. Mathur., M.M., L.M., J.O., C.R., S.S., D.S.W.T., W.W., P. Wheatstone and P. McCulloch) selected the final content and wording of the guidelines. B.C. chaired the consensus meeting. B.V., M.N. and B.C. drafted the final manuscript and E&E sections. All authors reviewed and commented on the final manuscript and E&E sections. All members of the DECIDE-AI expert group collaborated in the development of the DECIDE-AI guidelines by participating in the Delphi process, the qualitative evaluation of the guidelines or both.

Corresponding author

Correspondence to Baptiste Vasey.

Ethics declarations

Competing interests

M.N. consults for Cera Care, a technology-enabled homecare provider. B.C. was a Non-Executive Director of the UK Medicines and Healthcare products Regulatory Agency (MHRA) from September 2015 until 31 August 2021. D.C. receives consulting fees from Oxford University Innovation, Biobeats and Sensyne Health and has an advisory role with Bristol Myers Squibb. B.G. has received consultancy and research grants from Philips NV and Edwards Lifesciences LLC and is owner and board member of Healthplus.ai BV and its subsidiaries. X.L. has advisory roles with the National Screening Committee UK, the WHO/ITU focus group for AI in health and the AI in Health and Care Award Evaluation Advisory Group (NHSX, AAC). P. Mathur is the co-founder of BrainX LLC and BrainX Community LLC. M.M. reports consulting fees from AMS Healthcare and honoraria from the Osgoode Law School and the Toronto Pain Institute. L.M. is director and owner of Morgan Human Systems. J.O. holds an honorary post as an Associate of Hughes Hall, University of Cambridge. C.R. is an employee of HeartFlow Inc., including salary and equity. S.S. has received honoraria from several universities and pharmaceutical companies for talks on digital health and AI. S.S. has advisory roles in Child Health Imprints, Duality Tech, Halcyon Health and Bayesian Health. S.S. is on the board of Bayesian Health. This arrangement has been reviewed and approved by Johns Hopkins in accordance with its conflict of interest policies. D.S.W.T. holds patents linked to AI-driven technologies and is a co-founder and equity holder of EyRIS Pte Ltd. P. Watkinson declares grants, consulting fees and stocks from Sensyne Health and holds patents linked to AI-driven technologies. P. McCulloch has an advisory role for WEISS International and the technology incubator PhD program at University College London. B.V., G.S.C., A.K.D., L.F., M.I., B.A.M., S.D., P. Wheatstone and W.W. have no further conflicts of interest to declare.

Peer review

Peer review information

Nature Medicine thanks Alejandro Berlin, Rahul Deo, Isabelle Boutron and Leo Anthony Celi for their contribution to the peer review of this work. Javier Carmona was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2, Supplementary Tables 1–3 and Supplementary Note 1

Reporting Summary

Supplementary Appendix 1

E&E document

Supplementary Appendix 2

DECIDE-AI checklist

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vasey, B., Nagendran, M., Campbell, B. et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med 28, 924–933 (2022). https://doi.org/10.1038/s41591-022-01772-9

Download citation

Received: 20 November 2021
Accepted: 03 March 2022
Published: 18 May 2022
Issue Date: May 2022
DOI: https://doi.org/10.1038/s41591-022-01772-9

This article is cited by

Use of artificial intelligence in critical care: opportunities and obstacles
- Michael R. Pinsky
- Armando Bedoya
- Gilles Clermont
Critical Care (2024)
Artificial intelligence and urology: ethical considerations for urologists and patients
- Giovanni E. Cacciamani
- Andrew Chen
- Andrew J. Hung
Nature Reviews Urology (2024)
To warrant clinical adoption AI models require a multi-faceted implementation evaluation
- Davy van de Sande
- Eline Fung Fen Chung
- Michel E. van Genderen
npj Digital Medicine (2024)
Artificial intelligence in liver cancer — new tools for research and patient management
- Julien Calderaro
- Laura Žigutytė
- Jakob Nikolas Kather
Nature Reviews Gastroenterology & Hepatology (2024)
Enhancing the fairness of AI prediction models by Quasi-Pareto improvement among heterogeneous thyroid nodule population
- Siqiong Yao
- Fang Dai
- Hui Lu
Nature Communications (2024)