Introduction

Artificial intelligence (AI) has been introduced to healthcare with the promise of assisting or automating tasks to reduce human workload. In publications, medical AI models have been reported to produce promising results in a variety of data-driven scenarios, including clinical decision support, medical image interpretation and risk prediction1,2,3. However, real-world implementation of medical AI interventions has so far been limited and the potential benefits not yet realised. One significant barrier to adoption is the lack of high-quality evidence supporting their effectiveness, such as from randomised controlled trials (RCTs) performed in relevant clinical settings4,5.

RCTs provide the highest quality evidence for evaluating the impact of medical interventions. Importantly, they provide evidence on the effect of interventions on outcomes grounded in benefit to patients and the health system and often generate sufficient evidence to justify widespread adoption. Therefore, it is imperative that RCTs are well-designed, properly conducted and transparently reported. Incomplete or unclear reporting results in poor transparency of bias and research waste, leading to poor decision-making and non-reproducibility of findings6.

Reporting guidelines such as the CONSORT 2010 statement set out consensus-driven minimum reporting standards for the reporting of RCTs7. To provide additional and specific guidance for RCTs involving AI interventions, the CONSORT-AI extension was developed and published in September 20208. CONSORT-AI includes 14 additional checklist items to be reported alongside the 37 CONSORT 2010 items. These items provide elaboration and additional criteria specific to AI, such as reporting algorithm version and input data selection, aiming to improve the completeness and relevance of the original CONSORT statement to AI interventions8.

Many RCTs of AI interventions have been published since CONSORT-AI, but the completeness of reporting is currently unclear. This systematic review aims to assess the completeness of reporting in recent RCTs for AI interventions using CONSORT-AI and to summarise study characteristics to provide insight into this area of research.

Results

In total, 5111 articles were retrieved following deduplication. 332 articles were selected for full-text review following title and abstract screening. 267 articles that did not meet the inclusion criteria were excluded, including 104 ongoing or unpublished trial registry entries. 65 RCTs met the inclusion criteria and were included in the final analysis9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73. Amongst these, four were RCTs of diagnostic test evaluation, where the primary outcome was diagnostic yield (for example, the effect of an assistive AI intervention on a clinician’s ability to detect disease)13,17,24,36. Whilst these interventional studies did not measure patient outcomes, they were included in this review as the concordance with CONSORT-AI guidelines remains relevant. Details of excluded articles are shown in the PRISMA flow diagram74, see Fig. 1. The full list of included RCTs is available in Supplementary Data 1.

Fig. 1
figure 1

PRISMA flow diagram74.

Study characteristics

The majority of studies were conducted in China (n = 24, 37%)12,15,27,29,30,37,38,39,40,41,43,59,60,61,62,63,64,65,66,68,69,71,72,73, USA (n = 12, 18%)16,19,22,25,28,32,42,51,55,56,67,70 and Japan (n = 5, 8%)9,24,31,33,50. There were 4 international multicentre studies conducted across European sites11,13,34,44 and 10 studies performed within individual European countries: France (n = 3)18,36,58, Italy (n = 2)53,54, Spain (n = 2)21,47, England (n = 1)26, Germany (n = 1)48 and Denmark (n = 1)14. The remainder (n = 10, 15%) took place across South Korea, Taiwan, India, Thailand, Israel, Mexico, Argentina, Rwanda and Malawi, as shown in Fig. 210,17,20,23,35,45,46,49,52,57.

Fig. 2: Location heatmap of included studies by country, showing high distribution within China and USA.
figure 2

Generated using R Statistical Software (v4.1.1, R Core Team 2021).

Median sample size across all included RCTs was 186 (IQR 56-654). Most RCTs were single centre (n = 39, 60%) versus multicentre (n = 26, 40%). Studies were commonly unblinded (n = 24, 37%) or single-blinded (n = 21, 32%), with few double-blinded RCTs (n = 2, 3%). 18 (28%) did not report details of any blinding.

Types of AI intervention

The most common types of AI intervention were endoscopy assistance (n = 13, 20%), image enhancement (n = 11, 17%), image classification (n = 9, 14%), and chatbots (n = 7, 11%). Endoscopy assistance was defined as computer-aided detection of suspicious lesions during colonoscopy or upper endoscopy, which highlight regions on the endoscopist’s display in real-time. Image enhancement encompasses AI interventions that modify medical images, such as ultrasound or radiography, to improve clarity or highlight areas of interest. In contrast, image classification involves automated diagnosis or interpretation of medical images using an AI, with the results informing clinician decision-making. Chatbots use language models to process human speech or text prompts and generate responses; specific uses within the included studies were digital mental health assistance and exercise coaching, used as a supplement to healthcare professional-guided therapy. All RCTs had two arms, with the exception of one study that investigated two different chatbot interventions against a control intervention simultaneously (delivering personalised exercise coaching by smart speaker or by text messaging)25. Full classification and description of interventions is shown in Table 1.

Table 1 Type, frequency and description of AI interventions across included studies

AI interventions were placed into categories according to level of human oversight: ‘data presentation’ (n = 27, 43%), ‘clinical decision support’ (n = 14, 22%), ‘conditional automation’ (n = 6, 10%) and ‘high automation’ (n = 16, 25%). No AI interventions were determined to have ‘full automation’. More broadly, interventions were classified as assistive (non-autonomous) (n = 41, 63%) or autonomous (n = 22, 34%). Two studies (3%) did not report sufficient detail to determine level of human oversight.

Clinical specialty of interventions

When grouped by clinical specialty, most RCTs were in gastroenterology (n = 15, 23%), followed by radiology (n = 5, 8%), primary care (n = 4, 6%), emergency medicine (n = 4, 6%), diabetology (n = 4, 6%) and cardiology (n = 4, 6%). The full distribution of clinical specialties is shown in Fig. 3.

Fig. 3
figure 3

Distribution of clinical specialties amongst included RCTs, showing a high prevalence of interventions within gastroenterology.

Journal of publication

The 65 RCTs were published across 52 unique medical journals. As of May 2023, only two of the included journals (4%) explicitly mandated CONSORT-AI in their online submission guidelines (The Lancet Digital Health, The Lancet Gastroenterology) and one (2%) recommended CONSORT-AI without an explicit mandate (Ophthalmology Science). However, CONSORT 2010 was explicitly mandated by 28 journals (54%) and recommended without mandate in a further seven journals (13%). The EQUATOR Network (www.equator-network.org) is a comprehensive catalogue of reporting guidelines (including CONSORT-AI) and was recommended by 23 journals (44%) in total, of which eight (15%) specifically mandated its use to locate relevant reporting guidelines. Most journals that recommended use of the EQUATOR Network also explicitly recommended CONSORT 2010 (n = 21, 91%).

Overall CONSORT-AI concordance

Overall median concordance to all CONSORT-AI items (comprising 14 AI-specific items and 37 non-AI-specific items) across all 65 included RCTs was 90% (IQR 77–94%). Two studies (3%) demonstrated 100% concordance34,63. Median overall CONSORT-AI concordance varied between geographical regions: China (86%, IQR 59–92%), USA (92%, IQR 90–94%), Japan (92%, IQR 86–96%) and Europe (93%, IQR 87–96%).

Ten RCTs (15%) explicitly reported use of CONSORT-AI, nine (14%) reported use of CONSORT 2010 only, five (8%) reported use of CONSORT-EHEALTH and 41 RCTs (63%) did not explicitly report use of any reporting guidelines. One study discussed CONSORT-AI in the limitations but did not make use of them, and instead reported according to CONSORT 201053. Median overall CONSORT-AI concordance amongst studies that reported use of CONSORT-AI was 96% (IQR 94–99%), compared to 92% (IQR 92–94%) for those that used CONSORT 2010 only, 90% (IQR 81–94%) for those that used CONSORT-EHEALTH, and 84% (IQR 62–91%) for those that did not mention use of any reporting guidelines (see Table 2).

Table 2 Overall CONSORT-AI concordance according to self-reported use of guidelines

Given CONSORT 2010 has been widely adopted for many years, and the additional AI-specific items are relatively new recommendations, the next section will discuss reporting of AI-specific items and non-AI specific items separately.

AI-specific CONSORT-AI items

When considering the 14 AI-specific CONSORT-AI items only, median concordance across all studies was 86% (IQR 71–93%). Just over half of studies (n = 36, 55%) reported 12 or more of the 14 checklist items, with four studies (6%) reporting 9 or fewer items. Of the six studies (9%) that achieved 100% concordance, five had reported use of the CONSORT-AI checklist and one had not. Median concordance varied between geographical regions: China (79%, IQR 71–85%), USA (86%, IQR 73–91%), Japan (86%, IQR 82–96%) and Europe (93%, IQR 86–93%).

Concordance also varied between AI-specific items (Table 3). Concordance was especially low for items 5 (i) (stating algorithm version) and 25 (whether the AI intervention / code can be accessed): 20% and 42%, respectively. Items 4a (ii) (inclusion criteria for input data), 5 (iii) (handling of poor-quality input data) and 19 (analysis of performance errors) were also relatively poorly reported. 100% concordance was observed for items 1a,b (ii) (stating intended use of intervention), 5 (v) (stating output of intervention) and 5 (vi) (explaining how the output contributed to decision-making).

Table 3 Percentage concordance with AI-specific CONSORT-AI items8

There was no significant correlation between date of publication and CONSORT-AI concordance (Spearman’s r = −0.21, p = 0.091). However, this exploratory analysis was limited by the small number of studies and narrow date range.

Non-AI-specific CONSORT-AI items

For the 37 non-AI-specific CONSORT-AI items (i.e., those contained within CONSORT 2010), median concordance across all RCTs was 92% (IQR 76–97%). Eight studies (12%) demonstrated 100% concordance with the non-AI-specific items, of which seven had explicitly reported use of CONSORT 2010 or CONSORT-AI. Median non-AI-specific CONSORT-AI concordance varied between geographical regions: China (88%, IQR 54–95%), USA (97%, IQR 93–97%), Japan (95%, IQR 85–99%) and Europe (95%, IQR 87–98%). Mean concordance for non-AI-specific items can be found in Supplementary Data 2.

There were several non-AI-specific CONSORT-AI items that were relatively poorly reported, including item 10 (who generated allocation sequence / enrolled participants / assigned participants to interventions) at 51%, and item 24 (access to full trial protocol) at 31%. Reporting was also suboptimal around sample size calculation, randomisation methods, reporting harms / unintended effects and trial registration details (Supplementary Data 2).

Journal reporting guideline mandates

Overall, reporting concordance with CONSORT-AI was good regardless of whether journals mandated its use (Table 4). Median CONSORT-AI concordance was higher for RCTs published in journals where CONSORT-AI was mandated (n = 2, 3%), at 100%, versus 90% (IQR 76–94%) for RCTs published in journals that did not mandate CONSORT-AI (n = 63, 97%).

Table 4 RCT CONSORT-AI concordance according to reporting guideline mandates from their journals of publication

RCTs published in journals where CONSORT 2010 was mandated (n = 30, 46%) had a higher overall median CONSORT-AI concordance of 92% (IQR 90–95%), versus 82% (IQR 61–90%) where CONSORT 2010 was not mandated (n = 35, 54%). This is primarily attributable to non-AI-specific item concordance, which had a median of 92% (IQR 90–95%) versus 81% (IQR 57–92%) in CONSORT 2010 mandated versus non-mandated journals, respectively. AI-specific items also showed higher concordance when CONSORT 2010 was mandated, with median 86% (IQR 79–93%) versus 79% (IQR 71–93%).

Discussion

The primary aim of this review was to determine the extent to which published RCTs report according to the CONSORT-AI extension since its publication in September 2020. We found 65 RCTs evaluating AI interventions in a variety of clinical settings and countries. Only 10 RCTs mentioned use of CONSORT-AI and 9 mentioned use of CONSORT 2010. Despite this, concordance with CONSORT-AI was generally high. There remains notable areas of poor reporting, such as stating the AI algorithm’s version, explaining whether or how the AI algorithm could be accessed, and most studies did not report details and availability of the full study protocol. From a journal mandate point of view, only 3 out of 52 journals instructed or recommended use of the CONSORT-AI checklist. It was unsurprising that journal mandates for use of CONSORT-AI were associated with greater concordance with CONSORT-AI reporting items (100% concordance versus 90%). However, we also found that AI RCTs published in journals endorsing CONSORT 2010 were more transparently reported compared to journals endorsing no reporting guidelines – according to CONSORT-AI specific considerations (92% concordance versus 82%). This may point towards a higher level of editorial scrutiny in journals which promote better reporting practices.

We found poor reporting for item 5 (i), regarding the statement of algorithm version used, at a median of only 20%. Lack of reporting on algorithm versioning (or other type of traceable identifier) raises significant concerns when appraising evidence of past and future studies of the same AI intervention. Without a traceable identifier, significant adjustments and updates (if any) that have been made over the lifetime of the AI intervention cannot be tracked and compared, so comparison between studies becomes difficult. This is becoming more relevant as AI medical devices are coming to market with referenced evaluation evidence published years ago. Stating whether the AI intervention or its code could be accessed (item 25) was also poorly reported, with median concordance of 40%. This may impede the ability of other researchers to achieve independent evaluation and potentially replication of findings, especially when the AI device is not a commercially available product and there is no named manufacturer. The remaining AI-specific CONSORT-AI items with lower concordance were item 4a (ii), regarding inclusion criteria at the level of the input data and item 5 (iii), regarding how poor-quality input data was handled – both important for reproducibility of the intervention in future trials and real-world use. Additionally, relatively few RCTs reported item 19, regarding results of performance error analysis, indicating the exploration of AI errors in an attempt to gain further insight into the nature and cause of AI failures, as well as their consequences, remains non-standard practice.

Overall, concordance with non-AI-specific CONSORT-AI items was higher than for AI-specific items, at 86% (IQR 71–93%) versus 92% (IQR 76–97%), likely due to its longstanding ubiquity amongst the medical scientific community and widespread acceptance as the standard of reporting. Despite this, low concordance was observed for several items, most notably providing access to the full trial protocol (item 24) with a concordance of only 31%. This has implications for reporting transparency as unreported protocol deviations may obscure bias in the methodology and presentation of findings.

Most RCTs did not mention using specific reporting guidelines and only 10 out of the 65 included studies explicitly reported use of CONSORT-AI. This low uptake may be explained by lack of journal mandates in instructions to authors. The CONSORT-AI extension was mandated by only two of the 52 journals in which the included studies were published, with one additional journal recommending its use without mandate. Other journals either recommended CONSORT 2010 or signposted to generic resources like the EQUATOR Network, where finding CONSORT-AI would be up to the individual authors’ initiative.

Previous research on instructions for authors in high impact factor journals, in the context of CONSORT 2010, has shown that journal endorsement is sometimes lacking – especially in the endorsement of specific extensions75. Following the publication of CONSORT-AI in late 2020, the working group has reached out to editors of over 110 medical journals, raising awareness of the availability of these new standards. CONSORT-AI has been referenced by policy and regulatory bodies including the WHO76, FDA77 and MHRA78, and has received over 400 citations to date. Despite this, we found that there remains low journal uptake, so mechanisms to lower the bar for adoption may require further consideration. One method to address this could be through editorial systems with tick boxes for authors to indicate the type of work being submitted, where the appropriate reporting checklist could be automatically delivered to be submitted with the paper. Such mechanisms will help ensure transparent reporting whilst reducing the burden on journal editors.

This systematic review provided an opportunity to assess the applicability and interpretation of CONSORT-AI recommendations across a diverse range of RCTs published since September 2020. Given the fast-moving nature of the field, this review also served as a mechanism for reflecting on clarity and applicability of the CONSORT-AI extension and to consider whether the items remain applicable to new and emerging types of AI interventions.

For item 1a,b (i) – “indicate that the intervention involves artificial intelligence/machine learning in the title and/or abstract and specify the type of model” – the type of AI model was frequently not specified within the abstract. A decision was made in this review to not impose stringent requirements for the “type of model” component. It is debatable how meaningful a short description of model type in the title and abstract can be and perhaps a full description of the AI model is more relevant for diagnostic test accuracy studies model development and validation studies (where STARD-AI and TRIPOD + AI are more relevant reporting guidance, respectively)79,80.

We also want to reflect on difficulties experienced by our reviewers when assessing certain items, which may be due to poor reporting by authors of the RCTs, but could also indicate a lack of clarity in the item itself. For example, for item 5 (iii) – “describe how poor-quality or unavailable input data were assessed and handled”, it was difficult to interpret the information provided as a separate consideration from item 4a (ii) – “state the inclusion and exclusion criteria at the level of the input data”. There were several disagreements during data extraction which required discussion, as it was unclear whether some RCTs were discussing input data inclusion and exclusion criteria (item 4a (ii)) or the quality of the actual input data post-inclusion (item 5 (iii)). Further elaboration may be needed to differentiate these two criteria in the CONSORT-AI documentation and/or provide authors with more specific reporting instructions. This item was also difficult to apply to certain AI interventions, especially AI-assisted endoscopy. Some AI interventions will, by design, automatically exclude data that cannot be processed, which is desirable from a safety perspective. This means that item 5 (iii) may be inapplicable and be less likely to be reported as a result.

Similarly, for item 19 – “describe results of any analysis of performance errors and how errors were identified” – assessment of concordance was challenging for certain AI interventions, especially those involving digital therapeutics (for example, AI-delivered cognitive behavioural therapy, counselling or rehabilitation). Analysis of performance errors was rarely performed in these studies, but we also found it difficult to define what performance errors could look like and how they could be measured within a trial setting for such interventions. Errors could be subtle and difficult to verify beyond obviously nonsensical responses. It may be appropriate to report an evaluation of harmful effects caused by the AI intervention, including disparate harms across subpopulations, however these effects may be difficult to detect. As applications of AI technologies evolve, it is important that guidelines maintain relevance. Given the rapid growth of digital therapeutics and medical large language models, this could be an area of focus for subsequent CONSORT-AI iterations81,82.

An additional reflection is that this review identified a high proportion of trials evaluating AI-assisted endoscopy interventions for gastroenterology. This is in keeping with findings from a recent review by ref. 83, and may be explained by the challenge of assessing the performance of these devices in non-interventional trials (given AI-assisted endoscopy is implemented in real-time). For other AI interventions such as image classification systems, observational retrospective or prospective studies can provide indications of diagnostic accuracy, with evaluation of the downstream impact to health and resource outcomes less commonly evaluated. Furthermore, the performance of AI-assisted endoscopy is typically evaluated by measuring adenoma detection rate as an outcome. This can only be determined by performing polyp removal and confirmation using histopathology; therefore, interventional trials are necessary.

Previous systematic reviews have used the CONSORT-AI checklist to evaluate reporting completeness of RCTs involving AI interventions in healthcare84,85,86. However, these differ from the current systematic review in terms of methodology (for example, using a less sensitive search strategy consisting of three search terms85) and incomplete application of CONSORT-AI (specifically, excluding three of the 14 AI-specific CONSORT-AI items84). Additionally, these reviews executed their literature searches in 2021 or earlier, less than a year after CONSORT-AI was published. Our systematic review used a robust search strategy, including clinical trials registries, and was carried out in conjunction with CONSORT-AI authors to ensure that each item was interpreted correctly. Furthermore, this review covers a two-year article submission period following publication of CONSORT-AI to provide a fairer assessment of initial uptake.

One limitation of this systematic review was the potential for incomplete study retrieval despite best efforts to maximise sensitivity. For example, some RCTs published in computer science journals did not explicitly identify as RCTs in the title, abstract or keywords, which could mean other similar trials were not retrieved by the literature search. Furthermore, indexing errors for study status in trial registry entries may have led to incorrect exclusion of published studies that had not been updated in the clinical trial registry. However, an attempt was made to mitigate this by searching relevant trial registration numbers through Google Search if no linked publication was included on the trial registry page. It should be acknowledged that publications included in our review may have been submitted soon after publication of the CONSORT-AI guidelines (September 2020) and may not have had sufficient time to be drafted in accordance due to the length of editorial processes. Our search strategy includes terms describing AI and ML, which inevitably confounds concordance with CONSORT-AI item 1a,b (i): “Indicate that the intervention involves artificial intelligence / machine learning in the title and/or abstract and specify the type of model.” However, this is a necessary keyword for literature searching and therefore an unavoidable confounder. Finally, non-English language RCTs were excluded, which has the potential to introduce bias, particularly when considering the diverse geographical spread of RCTs.

In conclusion, the results of this systematic review have shown that in the 2-year period since publication of CONSORT-AI in September 2020, most AI-specific CONSORT-AI items were well-reported across relevant studies. However, a small number of specific items remain poorly reported. As with other reporting guidelines, the potential value of CONSORT-AI in improving reporting would be further enhanced by encouraging adoption, for example, through recommendations (or even mandates) from journals or funders. This systematic review has indirectly served as a test of the feasibility and usability of CONSORT-AI, indicating that some minor modifications in future updates to the checklist may help improve accessibility to authors and maintain relevance to the latest AI technologies. Arguably it is still early days to evaluate the impact of CONSORT-AI, given that many RCTs take years to complete and become published. Future reviews of AI RCTs could also compare these findings to new and ongoing RCTs that will be published in the coming years.

Methods

This systematic review is reported according to the PRISMA 2020 statement74. The protocol was prospectively registered on the Centre for Open Science’s Open Science Framework (OSF) Registry (doi.org/10.17605/OSF.IO/CRF3Q).

Search strategy

A combination of keywords and MeSH terms was used to identify RCTs on interventions involving AI, for example: “artificial intelligence”, “decision support system”, “deep learning” and “neural network”, in addition to specific terms such as “naïve bayes”, “random forest” and “multilayer perceptron”. A modified version of the Cochrane RCT sensitivity and precision maximising filter was used to improve relevant article retrieval87. The search strategy was developed in conjunction with an information specialist and was not adapted from any previous reviews. Keywords and subject headings were adjusted for each database as required. Database search strategies and PRISMA-S checklist are included in Supplementary Information.

MEDLINE, Embase and Cochrane Central databases were searched on 19th September 2022. Clinical trial registries, including the International Clinical Trials Registry Platform and ClinicalTrials.gov, were searched for completed studies on the same date. Articles published from 9th September 2020 onwards were retrieved for screening, following the date of CONSORT-AI publication. Articles were restricted to English language. Reference lists of included articles and identified secondary research sources were screened for relevant articles before exclusion. The database searches were not repeated.

Study selection

Eligibility criteria were primary reports of RCTs involving AI interventions within any healthcare setting, available in the English language. AI interventions were defined as any devices or tools with an AI or machine learning component, determined by reviewers during screening. Conference abstracts, protocols and studies primarily evaluating robotics were excluded. Articles submitted to the journal of publication prior to the release of CONSORT-AI guidelines (September 2020), determined by online publication history, were excluded.

Covidence systematic review software (2022) was used to collate references, deduplicate and screen for inclusion at both title / abstract and full-text stages88. Title and abstracts were independently screened by two authors (AM and VN). Full-text articles of eligible studies were retrieved and independently assessed in detail by two authors (AM and VN) before inclusion or exclusion, with reasons given for the final decision. Disagreements were resolved by discussion or by a senior author (XL).

Data extraction

Two authors (AM and XL) independently extracted data from the final selection of RCTs, including study characteristics (first author, date of publication, country of study, medical specialty, publishing journal, number of study sites, blinding, study duration, sample size, randomisation technique, experimental and control interventions, AI characteristics, level of human oversight, use of CONSORT-AI) and concordance with the 14 AI-specific items of the CONSORT-AI checklist. Level of human oversight was classified according to a graded autonomy model described by Bitterman et al., which included the categories: ‘data presentation’ (AI highlights areas for review by the clinician), ‘clinical decision support’ (AI calculates a risk score that is interpreted by the clinician), ‘conditional automation’ (AI acts autonomously with clinician as backup), ‘high automation’ (AI acts autonomously with no clinician backup or validation) and ‘full automation’ (as for ‘high automation’ but can be used across all populations or systems)89. Any conflicts were resolved by discussion. For each journal of publication of the included RCTs, online submission guidelines were accessed to determine the recommended RCT reporting guidelines, including whether CONSORT-AI was recommended or mandated. Journal submission guidelines and concordance with the 37 non-AI-specific items of CONSORT-AI were assessed by two authors (AM and BN), with any conflicts resolved by a senior author (XL). Risk of bias assessment was not conducted as this review was primarily concerned with completeness of reporting for AI-specific considerations, rather than RCT outcomes and intervention effectiveness.

Data synthesis

Primary analysis of CONSORT-AI concordance was assessed through percentage of RCTs reporting each item. Results relating to concordance are reported for all CONSORT-AI items, as well as AI-specific and non-AI-specific items separately. Concordance is then reported according to the country of RCT conduct to examine variations in reporting practice across geographies. Lastly, concordance is reported according to whether the journal of publication mandated or endorsed the use of CONSORT-AI, CONSORT 2010 and/or any other reporting guidance. Concordance was defined as fulfilment of all components of each CONSORT-AI item, or the item being non-applicable. This rule was applied to all items with the exception of item 1a,b (i) – “indicate that the intervention involves artificial intelligence/machine learning in the title and/or abstract and specify the type of model”. After reviewing a sample of studies, we found the type of AI model was frequently not specified within the abstract. For this review, RCTs were considered to achieve this criterion as long as AI or ML were described, however stringent requirements for the “type of model” component were not applied. Analysis of study characteristics was performed using descriptive statistics and figures. An exploratory analysis that was not part of the original protocol was carried out using Spearman’s Rank-Order Correlation to determine whether CONSORT-AI concordance had changed with later dates of publication. P-values under 0.05 were considered significant. Statistical analysis was performed using Statistical Package for Social Sciences (SPSS) for Windows, Version 25.0.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.