The proliferation of generative AI models has garnered public interest and creative healthcare and clinical applications. While discourse about stewarding AI towards responsible use has been ongoing for many years, concurrent with public demand for accountability1, there is increased interest by AI researchers to evaluate the impact of AI systems and mitigate their potential to cause harm, particularly in its application to healthcare.

In another comment article, we discussed the pervasiveness of bias in clinical interactions, interventions and devices2. For example, racial bias has been reported in clinical notes and some clinical devices have been designed without considerations of potential differences in its effectiveness across gender and skin color (for example, pulse oximeter). The use of data from healthcare systems that embed gender, socioeconomic and racial discrimination to train AI algorithms can exacerbate health inequities. There have also been noted examples of hallucinations, bias, and failure of generative AI systems3. Specific to health are concerns about large language models (LLMs) propagating medical racism, bias, myths, stigmas and misinformation. For example, research by Omiye et al.4 used nine questions to assess four popular large language models (that is, OpenAI’s ChatGPT, OpenAI’s GPT-4, Google’s Gemini (formerly Bard), and Anthropic’s Claude) and found that most of the models reproduced widespread misconceptions about race in medicine4. Another study by Ayoub and colleagues5 showed that generative AI-based simulated physicians showed gender, racial, age, sexual orientation and political affiliation biases when making life and death decisions in a resource-limited setting. These concerns also extend to the spread of stigma and misinformation about sensitive health topics that could lead to death, including mental and behavioral health challenges such as opioid use disorder, addictions, and suicidal ideation, among others. The full scale of the potential impact of these ethical issues on the health outcomes of populations when LLMs are used in healthcare settings is still unknown. Yet, the state of current LLM deployments and applications in healthcare settings remains fast. For example, in the past year, there have been over 150,000 clinical notes drafted, and millions of draft response messages created in Epic's EHR software using ambient technology.

In response to these challenges, responsible AI tools have been introduced to monitor bias and truthfulness in generative AI models with broad and specific applications in health, politics and other fields6,7. This adds to existing fairness approaches, which Chohlas-Wood and colleagues grouped into three conceptual classes with varying challenges8; these include principles that limit the impact of demographic attributes, those that seek to make equal error rates across groups, and those that make uniform decision rates throughout groups. While there is some usefulness to these fairness principles, in some situations they could exacerbate poor health outcomes for marginalized populations8, and there are no systematic guidelines on how these fairness principles should be applied in different contexts. Further, generative AI systems require the incorporation of fairness principles throughout their life cycle.

Here, we argue that there is a need to ensure that public-use generative AI models are used appropriately in healthcare settings9. This implies applications should not be driven by hype and excitement, but should be carefully designed and adopted by considering their potential impact on the entire healthcare system, and known systemic biases and inequities. Suitable applications and limitations of AI models should be clear to clinicians, patients and the health systems that buy these AI systems.

Building on checklists and standards, such as Datasheets for Datasets10, that have been adopted across diverse fields, labels have been proposed as one way of enabling AI trustworthiness. These labels broadly focus on ethical, secure, and responsible AI and data. Stuurman and Lachaud11 described different approaches for creating such labels inspired by existing labeling standards in different industries including information labels, nutrition-like labels, quality or conformity labels, education labels and professional labels. Scharowski and colleagues12 stated that certification labels attest to the fact that an item or service satisfies one or more requirements, making them appropriate for use in audited AI cases. They also reported that surveyed subjects — end-users who may be impacted by AI systems — recorded higher willingness to use AI for both low-stake (that is, price comparison, route planning) and high-stake (for instance, medical diagnosis, loan approval) scenarios after being told the AI model had a certification label. We further extend these recommendations by proposing that there is need for a responsible use label that corresponds to prescription labels and follows standards as strict as those set by the Food and Drug Administration (FDA). A usage label should inform healthcare adopters of the purpose of the algorithm, what it shouldn’t be used for and how to appropriately use it. If done well, labels as a form of disclosure could limit misuse of AI models in healthcare settings, thereby reducing their potential to cause harm or worsen health inequities.

Furthermore, creating such a label would force AI developers to be more critical in assessing the ethical implications of the algorithms they develop and release to the general public. While there is more public demand for accountability, the teaching of ethical AI, specifically, approaches to redress the impacts of data and algorithm bias is not always a priority in computer science education or conferences. Instead, the focus weighs heavily towards advances in algorithms. In the last several years, major AI companies in the US have been in the news for disbanding their ethical units. The development of AI labels would not be successful without diverse developer teams that include social scientists, ethicists and in the case of healthcare AI, clinicians. Developers should clearly communicate the approved uses and potential side effects of an algorithm to challenge those adopting these systems for healthcare applications to conduct a ‘thought experiment’ to determine potential impacts prior to use.

Components of a responsible use label

To adopt the FDA approach to creating prescribing labels, the usage label should include the following information: approved usage, potential side effects, warnings and precautions, use in specific populations, adverse reactions, unapproved usage, completed studies, and ingredients of the algorithm. The responsible use label does not preclude the adherence to ethical principles such as, the GREAT PLEA — Governability, Reliability, Equity, Accountability, Traceability, Privacy, Lawfulness, Empathy, and Autonomy13 — which are necessary for the development, implementation and use of generative AI algorithms in healthcare settings. Rather, it adds to existing ethical AI principles by addressing the need for prescription-like information to support the effective and equitable application of general-use generative AI algorithms in healthcare settings, especially when the algorithms were not created for healthcare use.

Approved usage

A succinct description of what the AI developers created the algorithm to do that includes descriptions of known use cases, and how adopters can use the algorithm to accomplish the specified use cases. For example, if the AI model was created to summarize information on the Internet then it should say so.

Potential side effects

Clearly communicates potential issues that might be encountered in the usage of the AI algorithm (for instance hallucinations or misrepresentation of historical data).

Warnings and precautions

Supplies information on the most serious ethical and equity issues that might arise from the use of the AI system. It also provides recommendations on how to identify and prevent such adverse reactions. Conveys information on questions that adopters should ask or consider before the algorithm is applied to solve a problem in a clinical setting. For example, what are the implications of adopting the algorithm for a specific clinical use case? How does adoption impact healthcare delivery for different racial, ethnic, gender, or other marginalized groups?

Use in specific populations

Includes information on the intended application of the algorithm in specific populations. In a clinical setting, this could mean limiting applications to specific diseases, or conditions. Given known biases in the healthcare system2, this should also describe how the algorithm addresses issues of representation and contexts for diverse populations (such as use in populations with language differences).

Adverse reactions

Identifies undesirable or unintended effects associated with the adoption of the AI in a clinical setting, specifically as it relates to healthcare workers and patients. This section should describe the most common and most frequent reported adverse reactions that require interruption or discontinued use of the algorithm. For example, an unintended effect could involve an AI algorithm that increases clinicians’ workload when it was intended to improve efficiency.

Unapproved usage

Conveys information about specific cases where the algorithm should not be used and the potential impacts if used. For example, an AI algorithm developed to summarize texts could be used to create a hospital discharge summary, which implies it could capture clinician bias present in clinical notes. Developers probably can’t think of all unapproved uses, so this section should be updated as more research is conducted.

Completed studies

References to any scientific studies and findings that support the recommended use cases, adverse reactions, potential side effects and unapproved usage. An example is peer reviewed research demonstrating the applications of the AI system.

Ingredients

Describes datasets used in training the algorithms, including known ethical issues associated with the data. The dataset description should include the elements mentioned in the “Data Set Nutrition Label” proposed by Holland and colleagues14, which includes information on the data source, metadata and variables. Also, known ethical issues could include underrepresentation or complete absence of specific populations in the data. For example, a model trained to predict acute kidney injury in a population of 703,782 US veterans that was 94% male performed worse in predicting acute kidney injury in females in both the Veterans Affairs data and a sex-balanced academic hospital cohort15.

Conditions of labeling

For an AI algorithm to meet the usage label criteria outlined above, it must meet the following conditions. First, the developer has conducted a careful assessment of the benefits and potential risks for the use case. Second, the algorithm does not pose substantial risk to exacerbating health inequities for the current use case. Third, rigorous research and validation has been conducted by the developer to ensure it will be useful for the particular application. Fourth, there are clear guidelines on how the public or experts can ‘safely and effectively’ use the algorithm. Developing these guidelines will require input from social scientists, ethicists and clinicians.

Conclusion and outlook

Similar to medication use, having these guidelines will not stop misuse. For example, the FDA has use cases for a medication but people can choose to use it for other conditions. However, it challenges adopters to consider how their use deviates from the prescribed uses and forces them to take responsibility. Furthermore, medication labels can be long and difficult to understand. We therefore recommend that attempts to replicate the FDA prescribing labeling process should also aim to create labels that are succinct and easy for the general public to understand.