To the Editor—The promise of machine learning (ML) to augment medical decision-making in dynamic care environments has yet to be fully realized because of a gap in how algorithms are translated to the bedside, sometimes known as the ‘AI chasm’1. The drivers of this gap are numerous and complex, but a central challenge relates to the integration of ML into complex decision-making processes and clinical workflows. The ML field has developed a more nuanced appreciation of the importance of having the “human in the loop”2, but has yet to identify precisely how to optimize the human–ML interface to achieve maximal impact on key outcomes3.

Cognitive load is a term used to reflect the mental effort required to perform a task, which can be immense in care environments where clinicians collate, integrate, filter, weigh and reason about patient data in real time. Cognitive overload is associated with medical errors and burnout and contributes to suboptimal care outcomes4,5. ML is capable of decreasing the mental effort required to process immense amounts of biomedical data, yet its ability to do so is rarely, if ever, evaluated. We argue that cognitive load can and should be measured throughout the ML development cycle to maximize the potential for integration of ML into medicine and to improve patient and provider outcomes.

The main memory system that supports information processing is working memory. Cognitive load represents the amount of working memory resources consumed by a task. Intrinsic cognitive load refers to working memory resources consumed by the inherent difficulty of a task. Extraneous cognitive load describes working memory resources consumed by unnecessary or distracting details from the environment within which the task is performed. Cognitive load theory is based on the understanding that working memory resources are finite, and when the sum of intrinsic (non-modifiable) and extraneous (modifiable) cognitive load overwhelms working memory, task performance suffers as outlined above. ML that decreases cognitive load should therefore be valued, and ML that increases cognitive load, for example through excessive false alarms6 or complicated explainability visualizations7, may actually present unintended additional risk to clinicians, patients and the success of the ML project overall.

Cognitive load should be measured before and after implementation of ML in medicine. Checklists for ‘ideal’ ML in healthcare have been developed8, and we argue that an additional requirement should be that the cognitive load of the targeted clinical task(s) decreases with the addition of an ML system. There are qualitative and quantitative techniques that can be employed to make this determination. Psychometric rating scales (such as the 9-point Paas Scale and NASA Task Load Index), dual-task procedures (measurement of performance on a primary task while a distracting secondary task must be performed) and/or physiological measures (such as pupillary dilation, heart rate, galvanic skin response or electroencephalography data) can be used alone or in combination to increase the validity of cognitive load estimation9.

Such cognitive load tools should be employed throughout the ML development cycle (Table 1). Stakeholders should add assessment of cognitive load to the list of considerations when they decide which medical decision-making processes to target with ML. As a supplement to discussions with clinicians, candidate tasks can be broken down into subtasks and the cognitive load associated with each end user subtask can be measured using the above techniques at the bedside, in the simulation lab and/or with end-user recall surveys. Cognitive load analysis can guide precisely where ML might be maximally impactful.

Table 1 Rationale and roadmap for measuring cognitive load in machine learning projects in healthcare

After the ML model has been trained, validated and tested, various user interfaces and mechanisms for conveying predictions, explainability and uncertainty can be trialed using techniques such as A versus B testing in end users10. Visualization techniques that maximally decrease the cognitive load associated with the previously identified subtasks should be prioritized for future study and potential roll-out to the point of care.

Simulation might then be used to recreate the task and its associated clinical ecology, ideally using an exact replica of the care environment in which the study team can trial the intended deployment strategy in a more life-like clinical context. Realistic sources of extraneous cognitive load, such as multi-source data streams and alarming monitors, can be introduced. Paired clinical scenarios might require the clinician to perform the relevant clinical tasks without and with the help of the ML model. The variables associated with ML inference presentation can be validated and/or re-tested at this phase. The psychometric and physiologic metrics of cognitive load described above can be measured and compared between simulations, in addition to traditional metrics of ML and task performance. Pre and post evaluations can be repeated for a predefined cohort of clinical users with varying roles, experience and task expertise to gain a holistic understanding of the potential impacts of the ML model on cognitive load for the entire care team.

We recommend that an ML model proceed to a trial of bedside implementation only if it performs well in traditional metrics of model evaluation, improves task performance and decreases cognitive load during realistic simulations. This will optimize prospects for achieving return on investment for stakeholders and patients, given the substantial investment required to develop and deploy ML in clinical environments. Once the ML model is introduced to the bedside, qualitative and quantitative cognitive load assessments should continue, as part of small-scale feasibility studies to garner valuable feedback about how the model is integrating into clinical workflows3. Ethnographic evaluations of the impact of ML on overall unit workload dynamics will determine the intended and unintended consequences of ML deployment. Any initial increases in cognitive load due to unanticipated changes in workflows should ideally be followed by decreases in cognitive load as end users integrate the ML model into their clinical practice. Sustained suboptimal performance on cognitive load metrics during the feasibility study should prompt study teams to hold discussions with clinician end users to inform possible improvements, especially if desired clinical outcomes are unrealized. This is especially important in the context of the COVID-19 pandemic, during which cognitive overload has reached crisis levels in overburdened care environments.