Federated benchmarking of medical artificial intelligence with MedPerf

Medical artificial intelligence (AI) has tremendous potential to advance healthcare by supporting and contributing to the evidence-based practice of medicine, personalizing patient treatment, reducing costs, and improving both healthcare provider and patient experience. Unlocking this potential requires systematic, quantitative evaluation of the performance of medical AI models on large-scale, heterogeneous data capturing diverse patient populations. Here, to meet this need, we introduce MedPerf, an open platform for benchmarking AI models in the medical domain. MedPerf focuses on enabling federated evaluation of AI models, by securely distributing them to different facilities, such as healthcare organizations. This process of bringing the model to the data empowers each facility to assess and verify the performance of AI models in an efficient and human-supervised process, while prioritizing privacy. We describe the current challenges healthcare and AI communities face, the need for an open platform, the design philosophy of MedPerf, its current implementation status and real-world deployment, our roadmap and, importantly, the use of MedPerf with multiple international institutions within cloud-based technology and on-premises scenarios. Finally, we welcome new contributions by researchers and organizations to further strengthen MedPerf as an open benchmarking platform.

1 Introduction -Need for Wide Data Access and

Model Generalization
As medical AI has begun to transition from research to clinical care [1][2][3][4] , national agencies around the world have started drafting regulatory frameworks to support this new class of interventions.
Examples include the US Food and Drug Administration (https://www.fda.gov/medical-devices/digital-health-center-excellence), the European Medicines Agency (https://www.ema.europa.eu/en/about-us/how-we-work/regulatory-science-strategy),and the Central Drugs Standard Control Organisation in India 5 .A key point of agreement for all regulatory agency efforts is a requirement for formal, large-scale validation of medical AI models [6][7][8] .Widespread approval and adoption of medical AI models will thus require expansion and diversification of clinical data sourced from multiple organizations.Furthermore, there are emerging parallels between stages for approval for medical AI interventions and the regulatory approval of small molecules or medical devices through clinical trials [9][10][11] .
Pioneering research in the medical field and elsewhere 12,13 has demonstrated that using large and diverse datasets during model training results in more accurate models.Such models are also expected to be more generalizable to other clinical settings.Other studies have shown that models trained with data from limited and specific clinical settings demonstrate bias toward specific patient populations [14][15][16] , and such data biases can lead to models that appear promising during development but have lower performance in wider deployment 17,18 .A given static model may be susceptible to distribution shifts for the model's input or the model's target, or both 19 .For example, input distributional shifts may occur when an algorithm is evaluated on a population different than the one upon which it was trained on, when there are changes to local demographics or disease prevalence, or as a result of software or hardware upgrades of medical imaging equipment used for data acquisition.Similarly, distributional shifts may also arise from variations in geographic insurance reimbursement and medical procedure trends, or from new annotation or labeling guidelines.These issues, which are often intertwined with one another and frequently result in performance degradation, can also hinder trust and acceptance of AI among healthcare stakeholders, including clinicians, patients, insurers, and regulators.
We believe a new approach to leveraging diverse data can deliver consistent clinical and business value to healthcare data owners, while creating adoption incentives through lower implementation cost and lower deployment risk 6 .Such an approach should allow collaborative model training and evaluation on large, multi-institutional and representative datasets while complying with privacy and regulatory requirements.However, the degree to which these requirements can be met during collaborative training is still an open research question 43 .
Here we present MedPerf, an approach focused on broader data access during model evaluation, which we believe will best support model generalization as well as improve clinician and patient confidence.MedPerf was built upon the group's experience leading and disseminating efforts such as (i) the development of standardized benchmarking platforms (e.g., MLPerf for benchmarking machine learning training 20 and inference 21 across industries in a pre-competitive space -https://mlcommons.org/#MLPerf);(ii) the implementation of federated learning software frameworks (e.g., NVIDIA CLARA, Intel OpenFL 22 , and Flower by Adap/University of Cambridge); (iii) the ideation and coordination of federated medical challenges across dozens of clinical sites and research institutes (e.g., BraTS 23 and FeTS 24 ); as well as (iv) other prominent medical AI and machine learning efforts spanning multiple countries and healthcare specialties (e.g., oncology 25 and COVID 26 ).MedPerf should also illuminate cases where better models are needed, increase adoption of existing generalizable models, and incentivise further model development, data annotation, curation, and data access, while preserving patient privacy.The development of this approach requires (a) consistent and rigorous methodologies to evaluate performance of AI models for real-world use in a standardized manner, (b) a technical approach that enables measuring model generalizability across institutions, while maintaining data privacy and respecting model intellectual property, and (c) a community of expert groups to employ the evaluation methodology and the technical approach to define and operate mature performance benchmarks.
MedPerf aims to address these goals.MedPerf is an open-source framework designed to develop and support benchmark reference implementations, respect data privacy, and support model evaluation through formal generation of benchmarking working groups.MedPerf provides the opportunity to set standards, best practices, and benchmarking for medical AI in a pre-competitive space.The current list of contributors includes representatives of 18 companies, 13 universities, 6 hospitals, and 10 countries.
In this section, we discuss challenges to wider data access for AI training and evaluation in healthcare.Convincing data owners to broaden data access is hindered by substantial regulatory, legal, and public perception risks, high up-front costs, and uncertain financial return on investment.

Risk
Sharing patient data presents three major classes of risk: liability, regulatory, and public perception.Sharing patient data can expose providers to liability risk in multiple ways.Shared data could be stolen or misused in a manner damaging to patients (e.g., to discriminate against patients with certain conditions).Patient data are protected by complex regulations such as HIPAA in the United States and GDPR in Europe that carry substantial penalties for violators.
The perception of risk is also heightened because AI is a relatively new paradigm where application of existing regulations can be unclear.Lastly, even if data are shared legally and used beneficially, people naturally value privacy, and sharing data without explicit consent could lead to negative public perception 27 .

Cost
Sharing data requires up-front investment to turn raw data into a useful resource for AI.This transformation involves multiple steps: 1. Data collection: Cohorts need to be identified and the corresponding data need to be made accessible.
2. Transformation: Once accessible, data must be reformatted to a standardized representation for each data type (e.g., DICOM 28 for medical images) suitable for subsequent steps.
3. Anonymization: Data are anonymized by removing identifying information and/or filtering to comply with statistical and regulatory requirements (e.g., K anonymity 29 ).
4. Labeling: For many AI tasks, data must be labeled (i.e., annotated) according to the task (e.g., brain tumor segmentation).To ensure quality and performance, labeling should be consistent across institutions.This step is expensive, highly human-dependent, and error-prone, while carrying additional costs related to annotation correction, versioning, and dataset maintenance 30 .
5. Review: Data need to be reviewed for regulatory, legal, and policy compliance, and patients or patient groups need to be consulted for the design and perception of the use case.
6. Licensing: Data must be licensed in a manner that fulfills business and/or scientific interests while complying with existing regulations.
7. Sharing: Data must be physically shared with licensees, through complex legal agreements, which may require secure transmission of large data volumes or the creation of specially designed data enclaves.
Navigating these steps can be costly.The technical part of the process is also complex, requiring a mix of medical, artificial intelligence, and software engineering skills.There are multiple opportunities for error that may not be revealed until downstream consequences emerge, necessitating careful validation at each step, sometimes with multiple iterations 31 .

Uncertain Return
Even if a data owner (e.g., a hospital) is willing to pay for these costs and mitigate these risks, benefits can be unclear for financial or technical reasons.For example, if the development of an AI-based solution is driven by the AI model builder instead of the data provider, the AI provider may see a greater share of the eventual benefits than the data owner, even though the data owner may incur a greater share of the risk.
From a technical perspective, it can be difficult to prove a model's performance prior to deployment.Current medical AI community challenge efforts (e.g., FeTS 24 , CheXpert 32 , BraTS 33 , NLST 34 , CHAOS 35 , fastMRI 36 ) have been invaluable for advancing research but lack the scope to serve as real-world evaluation mechanisms in clinical settings.These challenges typically focus on a single dataset and task and thus do not reflect the diversity (e.g., multi-modal and multi-institutional) and complexity (e.g., different clinical and technical workflows) of real-world use cases.Model training and evaluation on non-diverse datasets carries increased risk of overfitting and the chance that even top-performing models will not generalize in real-world use cases, where clinical data reside in multi-institutional, geographically distributed organizations with significant differences across domains (i.e., domain shifts) 14 .
3 Proposed Solution: An Open Benchmarking

Platform using Federated Evaluation
Our goal is to increase the clinical impact of AI by leveraging more data across multiple facilities to address the challenges described above.We are developing an open benchmarking platform that combines a lower-risk, evaluation-focused approach without data sharing along with appropriate infrastructure, technical support and organizational coordination.This approach can reduce the risk and cost associated with data sharing while increasing the likelihood of business and medical benefits provided by AI solutions.MedPerf should lead to wider adoption, more efficacious and cost-effective clinical practice, and improve patient outcomes.
Our technical approach uses federated evaluation, a reduced-risk form of federated learning.At its core, the aims of federated evaluation are to make sharing models with multiple data owners easy and reliable, to evaluate those models against data owners' data in controlled settings, and to aggregate and analyze evaluation metrics.Importantly, by limiting the goal to model evaluation, and by aggregating only evaluation metrics, federated evaluation poses significantly lower risk to patient privacy than collaborative model training, while also minimizing the risk 37,38 of intellectual property theft and data misuse.
More specifically, our open platform for federated evaluation will provide a common, open-source infrastructure for defining medical AI benchmarks and coordinating federated evaluation of models against such benchmarks.We are building the infrastructure with best practices to help align AI model owner/developers with data owners, through an active community with a neutral organization at its core.We intend for our approach to be compatible with, and to build upon, existing federated learning frameworks, rather than to compete with them.Furthermore, as detailed below, we introduce steps that give data owners control over what algorithms run on their data and allow them to confirm benchmarking results.

Risks are Mitigated by Focusing on Model Evaluation and Trusted Groups
MedPerf addresses regulatory, liability, and public perception risks using a three-pronged approach.
First, because the initial focus is on model evaluation instead of training, our federated evaluation approach maximizes value without data leaving the possession of data owners, either directly or accidentally leaked through results.We only need data owners to share agreed-upon evaluation metrics (e.g., specificity), which are aggregated across participating institutions before disseminating.This mitigates most regulatory, public perception, and legal risk.
Second, Medperf retains human evaluators 39 as a critical part of the proces: the MedPerf client software requires a data-owner's system administrator to approve all model evaluations and result uploads, and automatically records transactions to support auditing.Moreover, to protect against malicious or erroneous implementations, MedPerf requires that (a) all novel code has no network access and restricted local file-system access, (b) evaluation algorithm implementations are well-vetteed and common among benchmarks, and (c) all output (i.e., statistics) must be explicitly approved by data owners before it is uploaded to the MedPerf platform.
Third, we leverage social trust: we enable benchmarks to be specified, developed, and deployed publicly or within closed groups, such as provider networks with pre-existing trusted relationships and business and legal contracts, and these closed-group benchmarks will be prioritized during the pilot phase of deployment.We are developing the MedPerf infrastructure through MLCommons, a non-profit with diverse membership and open-source practices, backed by dozens of high-profile companies and institutions (https://mlcommons.org/en/#founders).

Practices
We aim to reduce the costs of data sharing by developing open-source infrastructure and best practices that enable infrastructure vendors, model owners, and data owners to collaboratively build within a fast-growing ecosystem.
First, we provide community best practices for sharing models and data.For instance, we are using the MLCube container for model sharing (see (https://mlcommons.org/en/mlcube/) for concept introduction and practical examples, and (https://github.com/mlcommons/mlcube)for the code repository).MLCube extends common container standards, such as Docker and Singularity, to offer a simple and consistent file system-based interface for other infrastructure to train or make inferences using AI models (e.g., for testing harnesses or federated learning).
Additionally, deployment tools like Docker and Singularity enable hospital information technology groups to evaluate the AI model code for security concerns using common methods and tools.
Second, we are developing an open-source hub for medical AI benchmarks and a consistent methodology for benchmarking.The hub will offer coordination among benchmark groups, model developers, and data owners by providing a central model and data registry and by storing results, but will not directly handle proprietary models or data, ensuring that these assets remain in the hands of their owners.Instead, model and data owners will register hashes to enable checking the integrity of their assets without exposing them to the platform.This method will ensure that benchmark results can be compared to better establish promising technical approaches.

Return on Investment: Increasing Certainty through Better Model Evaluation
Our approach decreases the uncertainty of deploying AI models by enabling easy evaluation against data held by multiple data owners.We enable model developers to indirectly interact with data owners' datasets and thus tap into a large, virtual test set.In doing so, we increase the size of the test set and thereby reduce uncertainty of the evaluation -even if all the data are from a single provider.More importantly, by enabling evaluation against data from multiple providers, we can more effectively evaluate how the model will perform when deployed at different facilities with diverse patient populations.And by providing multi-site performance feedback to model developers, we increase the odds of successful model deployment.
Ultimately, demonstration that broad evaluation via federated evaluation is correlated with clinical efficacy will further improve clinician and patient confidence and motivate additional data owners to participate.

Building an On-Ramp to Federated Learning
We believe widespread adoption of federated evaluation will also spur wider adoption of federated learning.Federated learning (FL) is a promising technology to enable development of AI models by leveraging data from multiple institutions without directly sharing data [40][41][42] .While FL enables model training without data sharing, data may leak through the model parameters themselves, requiring additional mitigations [43][44][45] .Research and development of these mitigations is ongoing, slowing the adoption of the technology.We believe that federated evaluation provides concrete benefits while building industry familiarity with the technology needed for full FL.

MedPerf Technical Approach
In this section, we describe the structure and functionality of an open benchmarking platform for medical AI.We define a MedPerf benchmark in this context, discuss user roles required to successfully operate a benchmark, and provide an overview of the operating workflow.

What is a Benchmark
For the purposes of our platform, a benchmark is a bundle of assets that enables quantitative measurement of the performance of AI models for a specific clinical problem.A benchmark consists of the following major components: 1. Specifications: precise definition of the clinical setting (e.g., problem or task and specific patient population) on which trained AI models are to be evaluated, the labelling methodology, and specific evaluation metrics.

Dataset Preparation:
a process that prepares datasets for use in evaluation, and can also test the prepared datasets for quality and compatibility.

3.
Registered Datasets: a list of registered datasets prepared according to the benchmark criteria and approved for evaluation use by their owners, e.g.patient data from multiple facilities representing (as a whole) a diverse patient population.
4. Evaluation: a consistent implementation of the testing pipelines and evaluation metrics.

Reference Implementation:
an example of a benchmark submission consisting of example model code, the evaluation component above, and publicly available de-identified or synthetic sample data.

6.
Registered Models: a list of registered models to run in this benchmark.
7. Documentation: documents for understanding and using the benchmark.
Our platform uses the MLCube container for components such as Dataset Preparation, Evaluation, and the Registered Models.MLCube containers are software containers (e.g., Docker and Singularity) with standard metadata and a consistent file-system level interface.By using MLCube, the infrastructure software can easily interact with models implemented using different approaches and/or frameworks, running on different hardware platforms, as well as leverage common software tools for validating proper secure implementation practices (e.g., CIS Docker Benchmarks).

Benchmarking User Roles
We have identified four primary roles in operating an open benchmark platform, outlined in Table 1.In many cases, a single organization may participate in multiple roles, and multiple organizations may share any given role.Beyond these roles, the long term success of medical AI benchmarking requires organizations that create and adopt appropriate community standards for interoperability such as Vendor Neutral Archives (VNA) 46,47 , DICOM 28 , OMOP 48 (https://www.ohdsi.org/data-standardization/the-common-data-model/),PRISSMM 49 , and HL7/FHIR 50 .

Benchmarking Workflow
Our open benchmarking platform uses the workflow depicted in Figure 1.To start, a benchmark group registers the benchmark with the benchmarking platform (1) and then recruits data (2)   and model owners (3).The benchmarking platform sends model evaluation requests to the data owners who approve and execute the evaluations, successively vetting and then pushing results to the benchmarking platform (4).The benchmarking platform shares the results with participants based on a policy specified by the benchmark group (5).Table 2 provides further details about each workflow step.

MedPerf Roadmap
Ultimately, we aim to deliver an open platform that enables groups of researchers and developers to use federated evaluation to provide high-confidence evidence of generalized model performance to regulators, health care providers, and patients.In Table 3, we review necessary steps, scope of each step, and current progress towards developing this open benchmarking platform.

Related Work
Our effort is inspired by several classes of related work.First, we adopt a federated approach to data, focusing first on evaluation to lower the barriers to adoption.Second, we adopt the standardized measurement approach to medical AI from organizations such as RSNA (https://www.rsna.org),SIIM (https://siim.org), and Kaggle (https://www.kaggle.com),and we generalize these efforts to a standard platform that can be applied to many problems rather than focus on a specific one.Third, we leverage the open, community-driven approach to benchmark development successfully employed to accelerate hardware development, through efforts such as MLPerf (https://mlcommons.org)and SPEC (https://www.spec.org/benchmarks.html),and apply it to the medical domain.Lastly, we push towards creating shared best practices for AI as inspired by efforts like MLflow (https://mlflow.org)and Kubeflow (https://www.kubeflow.org)for AI operations, as well as MONAI (https://monai.io)and GaNDLF (https://cbica.github.io/GaNDLF/)for medical models.

Discussion and Conclusion
Our initial goal is to provide medical AI researchers with reproducible benchmarks based on diverse patient populations to assist healthcare algorithm development.We believe such benchmarks will increase development interest and solution quality, leading to patient benefit and growing adoption.Furthermore, our platform will help advance research related to, but not limited to, data utility, robustness to noisy annotations, and understanding of model failures.If a critical mass of AI researchers adopts these benchmarks, healthcare decision makers will see substantial benefits from aligning with this effort to increase benefit for their patient populations.
Ultimately, standardizing best practices and evaluation methods will lead to highly accurate models that are acceptable to regulatory agencies and clinical experts, and create momentum within patient advocacy groups.By bringing together these diverse groups, starting with AI researchers and healthcare organizations, as well as building trust with clinicians, regulatory authorities, and patient advocacy groups, we envision accelerating the adoption of AI in healthcare and increased clinical benefits to patients and providers worldwide.However, we cannot achieve these benefits without the help of the technical and medical community.We call for: • Healthcare stakeholders to form the benchmark groups that define benchmark specifications and oversee the analyses of their results.
• AI researchers to test this end-to-end platform and use it to create and validate their own models across multiple institutions.
• Data owners (e.g., healthcare organizations) to register their data in the platform, again while never sharing the data itself.
• Data model standardization efforts to enable collaboration between institutions, such as the OMOP Common Data Model and VNA.
• Regulatory bodies to develop medical AI solution approval requirements that include technically robust and standardized benchmarking.
We believe open efforts like MedPerf can drive innovation and bridge the gap between AI research and real-world clinical impact.To achieve these benefits, there is a critical need for broad collaboration, reproducible, standardized and open computation, and a passionate community that spans academia, industry, and the clinical world.With MedPerf, we aspire to bring such a community of stakeholders together as a critical step toward realizing the grand potential of medical AI.We invite participation at https://mlcommons.org/medperf.2 for details of all workflow steps, 1 through 5. • Once the benchmark, dataset, and models are registered to the benchmarking platform, the platform notifies the data owners that models are available for benchmarking • The data owner runs a benchmarking client that downloads available models, reviews and approves models for safety, then approves execution • Once execution completes, the data owner reviews and approves upload of the results to the benchmark platform 5 Release results • Benchmark results are aggregated by the benchmarking platform and shared per the policy specified by the benchmark group

FiguresFigure 1 .
Figures Figure 1.Benchmarking workflow, from benchmark registration to results.See Table2 for

Table 1 .
Benchmarking user roles and responsibilities.

Table 2 .
Benchmarking workflow, steps, and interconnections with roles.Once implemented by the model owner and approved by the benchmark group, the model can be registered on the platform

Table 3 .
MedPerf roadmap stages, scopes, and corresponding details for each stage.