Overcoming the reporting deficit in biomedical AI

The past two decades have seen massive advances and rapidly declining costs in high-throughput technologies that produce enormous amounts of biomedical data. This development has been accompanied by breakthroughs in the field of artificial intelligence (AI). With the help of AI, high-dimensional data can now be modeled in a mathematically robust and accurate way, which has led to numerous applications in biomedical research. For example, AI has been successfully used to determine particles in cryogenic electron microscopy projection images1, to infer proteins from mass spectrometry data2, to conduct exploratory analysis of single-cell data3 and to predict incipient circulatory failure in the intensive care unit4.

In spite of the obvious potential of AI in biomedical research, we observe trends that are detrimental to the development of new, improved AI methods and also constitute major hurdles in applying biomedical AIs in basic or translational biomedical research. Best practices of machine learning are not always adhered to, and often only selected aspects of the AI models and their evaluation are reported5. Because of this, the decisions of biomedical AIs are often opaque, difficult to explain and not fully reproducible6,7,8,9,10,11,12. In clinical research in particular, it is crucial to instill trust in AI models and to report on them in an explicit and transparent fashion that adheres to commonly used standards5,12,13. Or, as put by Davenport et al.10: “For widespread adoption to take place, AI systems must be approved by regulators [and] standardised to a sufficient degree [...].”

To address this problem, several checklists and guidelines for reporting AI methodology and results in biomedical and clinical research have been proposed recently14,15,16,17,18,19,20,21. This, however, is only a first step toward resolving the reporting deficit because mere guidelines and checklists do not make biomedical AI reports accessible to the scientific community. Moreover, guidelines and checklists provide no practical means to identify biomedical AIs that do not adhere to the recommended best practices. We believe that what is needed is a community-driven registry that allows authors of new biomedical AIs to easily generate accessible, browsable and citable reports that can be scrutinized and reviewed by the scientific community.

In view of this, we present the AIMe registry for artificial intelligence in biomedical research: https://aime-registry.org. It consists of a user-friendly web service that guides authors of new AIs through the AIMe standard, a generic minimal information standard that allows reporting of any biomedical AI system. Once the AIMe standard has been reported, a database entry and an HTML report along with a unique AIMe identifier are created. The latter serves to keep the entry openly accessible and can be disseminated by the authors, for example by inclusion in a manuscript.

We have designed the AIMe registry as a community-driven platform for AI in biomedicine. It allows users to raise issues related to existing entries if they have doubts concerning their adequacy or informativeness. Moreover, we will update the reported AIMe standard each year based on feedback from the scientific community. Interested researchers are invited to join the AIMe steering committee, which consolidates the feedback into an updated version of the AIMe standard.

The remainder of this paper is organized as follows: first, we present the first version of the AIMe standard. We then present the AIMe registry and detail how it incorporates feedback from the scientific community. In the section on governance, we formulate the mission of the AIMe initiative and provide details on the structure of the organization as well as the yearly revision process. Finally, we present conclusions in the last section of the paper.

The AIMe2021 standard

Here, we present the first version of the AIMe standard, the AIMe2021 standard. To design the AIMe2021 standard, we proceeded as follows: as a first step, the initial AIMe steering committee composed of the co-authors affiliated with the Chair of Experimental Bioinformatics of the Technical University of Munich, with the University of Hamburg and with the Department of Mathematics and Computer Science of the University of Southern Denmark compiled a draft of the AIMe2021 standard. We then shared a call for contributions via social media and mailing lists, in which we asked interested researchers to provide feedback and to join the AIMe steering committee. All other co-authors of this paper responded to this call. Finally, we consolidated the feedback into the AIMe2021 standard via a collaborative document editing effort coordinated by the first and last authors of this paper.

The AIMe2021 standard is divided into five sections: Metadata, Purpose, Data, Method and Reproducibility. The formal YAML specification of the AIMe2021 standard is available at https://aime-registry.org/specification/. Examples of AIMe reports are available at https://aime-registry.org/database/.

Metadata

The AIMe standard asks authors of biomedical AIs to report basic metadata for their methods (Supplementary Fig. 1). In a first series of questions, the authors are asked to provide metadata about the paper and the corresponding author(s) (MD.1–MD.6). They should also disclose funding sources (MD.7) and specify whether the entry should appear among the results when searching the AIMe database (MD.8). Temporarily excluding a report from the search might be useful if the reported AI has not been published yet. However, all created reports are always publicly accessible via their unique AIMe identifiers and automatically become searchable once a paper ID or URL is added in (MD.4). Moreover, authors can upload other checklists or reports they might have filled in (MD.9) (e.g., the MI-CLAIM checklist18).

Purpose

In this section, authors are requested to elaborated on the purpose of their biomedical AI (Supplementary Fig. 2). They should state what their AI is designed to learn or predict (P.1) and whether it predicts a surrogate marker rather than a directly measurable response variable (P.2). Furthermore, AIMe requests that the authors specify a category to which their AI problem belongs (P.3). Typical categories are classification (assign discrete labels to all items), regression (predict a real-valued number for all items), clustering (partition a set of items into subsets of homogeneous groups), ranking (learn an ordering for a set of items), dimensionality reduction (compress all items’ initial high-dimensional representations) and data generation.

Data

In biomedical research, it is common practice to include multiple datasets in the same pipeline to gain insights into complex biological processes. The AIMe standard therefore ask authors of new AIs to add separately each dataset employed and then characterize it in terms of data availability, possible biases and applied transformations (Supplementary Fig. 3).

For each dataset x, the authors should report the type of data (D.x.1)—e.g., expression, methylation or phenotype data. For instance, if an AI uses gene expression data to predict the body mass index (BMI), then the authors should add one dataset for the BMI data and a separate dataset for the expression data. Because there are often no gold-standard data for biomedical AI problems, new AIs are often evaluated on simulated data. In view of this, AIMe asks the authors to specify whether their data is real or simulated (D.x.2). Moreover, the authors should report whether the dataset is publicly available (D.x.3) and specify whether it was used for training the AI method (D.x.4).

Biomedical data are often subject to various biases22,23,24. Even if these biases can be addressed appropriately, readers should be aware of them to avoid possible misinterpretations. Therefore, AIMe asks the authors if, and if so how, they have checked whether their data is subject to biases (D.x.5). AIMe also requests that authors report the dimensionality of their data, i.e., specify the number of samples and features (D.x.6). This is especially important because high-dimensional data often exhibits multicollinearity and sparsity25, which in turn tends to negatively affect the efficiency of AI systems26 and often leads to overfitting. As most AI methods are not scale invariant, the data usually need to be normalized during pre-processing. Consequently, AIMe asks the authors if, and if so how, they have pre-processed their data (D.x.7).

Method

The next series of questions addresses the specific AI methods (Supplementary Fig. 4). The first question AIMe asks in this regard is which AI or mathematical methods (e.g., logistic regression, random-forest classification, deep neural networks, ant colony optimization, genetic programming) were used (M.1). Next, the authors must specify how they selected the method’s hyper-parameters (e.g., number of trees in random-forest models) (M.2). This is important because hyper-parameters typically have an enormous impact on method performance but are often not reported in the publications27,28.

The AIMe standard also contains questions related to the validation and verification of the AI method used. The initial questions ask which test metrics (e.g., Gini coefficient, running time, mean squared error) were used to evaluate the method (M.3). Later, the authors are asked to report how they prevented overfitting—i.e., how they ensured that their AI model does not merely memorize the training data but can generalize to unseen, independent data (M.4). Overfitting can be prevented by using various techniques such as ensemble learning, cross-validation and regularization.

Moreover, AIMe asks the authors to clarify whether they have checked if there are trigger situations that induce their method to fail in its task (M.5). A possible trigger situation is the presence of confounding factors: i.e., variables that influence both the model input and output variables and, as a result, potentially distort the results29. The authors are also required to report whether they have checked if randomized steps in their AI affect the stability of the results (M.6). Moreover, they should specify whether they have compared their AI method to simple baseline models (M.7), as well as to state-of-the-art competitors (M.8).

Reproducibility

The last four questions help increase the reproducibility of the experiments that validate the proposed AI (Supplementary Fig. 5). First, the authors are asked whether they provide all means to easily re-run their AI, e.g., by providing conda or pip packages, Dockerfiles, language-specific build system files or detailed README files (R.1). They are also required to provide information about the source code availability of the main AI method, the data simulator (if applicable) and the pre-processing pipeline (R.2). Next, AIMe asks the authors whether they provide a pre-trained model, e.g., by uploading it to repositories such as Kipoi30 (R.3). Finally, the authors should elaborate on the software and hardware environments required to run their AI method (R.4).

The AIMe registry

The AIMe registry provides three main services: add a new report, query the database and contribute to AIMe (Fig. 1).

Fig. 1: Overview of the AIMe registry.
figure 1

Users can create a new report, query the database to find existing entries and raise issues, and contribute to AIMe by joining the AIMe steering committee or providing feedback that will be incorporated into the next version of the standard.

Creating a new report

During the creation of a new report, AIMe guides authors of new AIs through the current version of the AIMe standard (as discussed earlier in the description of the standard). To ensure that the standard is generically applicable, the system allows authors to skip some of the questions if the information required to answer them is not available. To encourage authors to skip as few questions as possible, a validation and a reproducibility score are computed for each report. The scores range from 0 to 10: the higher the scores, the fewer questions concerning validation and reproducibility of the reported AI have been skipped. Authors of AIMe reports can edit previously created reports at any time, but all previous versions will remain visible in the HTML report.

Querying the AIMe database

Users can find existing reports in the AIMe database via their unique AIMe IDs, or search the database for reports of interest via full-text or keyword search. If users identify answers in the reports they deem inappropriate, uninformative or misleading, they can raise issues after providing their personal information (name and email address). The reports’ corresponding authors can reply to the issues, and they are allowed two weeks to notify AIMe’s executive board about offensive or otherwise inappropriate issues. If the authors raise no complaints or the executive board classifies the complaints as unwarranted, the issues and the personal information of the users who raised them, as well as the authors’ replies, are appended to the reports. Note that, because AIMe is committed to open peer review, issues that are due to misunderstandings but do not contain any insulting or off-topic elements will not be classified as inappropriate. Hence, by raising issues, members of the scientific community can review existing AIMe reports. This is important because it helps reveal reports in which questions are answered inadequately.

Contributing to AIMe

The Contribute functionality of the AIMe registry allows interested members of the scientific community to actively shape future versions of the AIMe standard by providing suggestions for improvement and requesting membership in the steering committee (as discussed below in the section on governance). All versions of the AIMe standard are formally specified in a YAML-based language. This ensures that the structure of old reports will remain well defined even after the current standard is updated at the beginning of each year. The YAML specifications are available at https://aime-registry.org/specification/.

AIMe governance

Mission

The mission of the AIMe initiative is to promote open, transparent and reproducible biomedical AI research. For this, we provide a community-driven registry, where biomedical AI researchers can report their AI models in a standardized fashion, search the AIMe database for AI systems related to their work and comment on existing reports as well as the AIMe standard itself (see “The AIMe Registry” above). The AIMe initiative is committed to the following principles of open science31,32.

  • Open peer review: Registry users who raise an issue on an existing entry are required to provide personal information, and all issues are appended to the reports and hence visible in the database (unless they are deemed by the AIMe executive board to be offensive or off-topic).

  • Open methodology: The openly accessible YAML specification of the AIMe standard clearly states how the reproducibility and validation scores are computed based on the answers provided in the reports.

  • Openness to diversity of knowledge: Biomedical AI researchers with diverse professional and cultural backgrounds are invited to join the steering committee and help shaping future versions of the AIMe standard.

  • Open source code: The source code of the AIMe registry is freely available under the terms of a widely used open source license (see “Code availability” below).

Organization structure

There are three different roles in which scientists from the field of biomedical AI can participate in and contribute to the AIMe initiative: as a registry user, as a steering committee member and as an executive board member. These roles can be described as follows.

Registry user. Registry users can contribute to the AIMe initiative as described in the registry section above: i.e., by providing new entries, raising issues related to existing entries and commenting on the AIMe standard. Moreover, if they wish to play a more active role in the AIMe community, they can request membership in the steering committee.

Steering committee. The steering committee is responsible for maintaining and updating the specification of the AIMe standard. Its members are professional researchers working at the interface of AI, biomedicine, bioinformatics, computational biology and digital health. The founding steering committee consists of all co-authors of this paper. Supplementary Fig. 6 provides an overview of its members’ professional backgrounds and expertises in biomedical AI. The founding steering committee covers all academic career levels from PhD student to full professor and reflects the internationality of the biomedical AI community in that its members work at research institutions in eight different countries in Europe, Asia, and the Americas.

Executive board. The executive board is responsible for coordinating the yearly reviews of the AIMe standard, for hosting and technical maintenance of the AIMe platform, for reviewing complaints on raised issues (i.e., deciding if issues qualify as offensive or off-topic) and for managing requests for membership in the steering committee. Such requests will be answered positively if the requester (a) provides plausible indication that they are a professional researcher with expertise in biomedical AI and (b) commits to actively participating in the yearly revision process. The founding executive board consists of the first and the senior authors of this paper.

Yearly revision process

Because biomedical AI is a rapidly evolving field, it is crucial that the AIMe standard continuously adapt to new developments in order to ensure that it will continue to reflect the needs of the research community. Therefore, AIMe foresees a yearly revision process, which is divided into two phases: a feedback phase from January 1 to September 30 of each year and a consolidation phase from October 1 to December 31.

During the feedback phase, users of the AIMe registry can provide feedback on the current version of the AIMe standard. Moreover, the steering committee members will actively reach out to influential representatives of the biomedical AI community and also submit their own proposals for improvements based on novel trends and developments in biomedical AI. During the consolidation phase, the steering committee will consolidate the collected feedback into a new version of the AIMe standard, coordinated by the executive board. On January 1, the new version of the AIMe standard will replace the old one.

Conclusions

AI is on the rise in biology and medicine and demonstrates utility in numerous application scenarios. However, basic information about data, methods and implementation of AI is often incomplete in the respective publications. This makes it difficult to judge, comprehensively compare and reproduce the results of biomedical AIs, a situation that, in turn, constitutes a major hurdle for developing new AI methods and for applying AI in research and practice. To address this problem and thereby improve the quality, reliability and reproducibility of biomedical AIs, we have developed the community-driven AIMe registry presented in this paper. This allows authors to easily register their AIs and assists researchers and practitioners in finding existing AI systems that are relevant for their application scenarios.