Decades of work in population and public health has informed research and practice that aim to understand what makes and keeps people and populations healthy1. The major underpinning principle is that of health equity, defined as “minimizing avoidable disparities in health and its determinants—including but not limited to health care—between groups of people who have different levels of underlying social advantage or privilege, that is, different levels of power, wealth, or prestige due to their positions in society relative to other groups”2. Inequality at a societal level is itself harmful across the population as a whole3. To capture the complex interplay between individual, community and other structural factors such as racism, which affect and are leverage points for health, the social ecological model was developed4. This framework outlines how the health of an individual is affected by multiple factors operating at different levels in a hierarchy (Fig. 1). Indeed, an understanding of and focus on the macro-level properties (for example, levels beyond the individual in Fig. 1) is critical to put individuals in the best context to leverage interventions, without increasing disparities5. A focus on determinants, antecedents and other factors related to health outside the hospital are imperative to not only address specific challenges for high-risk individuals, but also determine what policies would benefit communities as a whole.

Fig. 1: The socio-ecological model of health.
figure 1

This model (adapted from Bronfenbrenner et al.4) can be used to understand and identify leverage points for the multifaceted and interactive effects of the multi-level individual and environmental factors that determine health. Macro-level properties (those above the individual level) are also key to understanding inequities and interventions that can reduce inequity.

An understanding of the importance of macro-level properties can be useful, for example, in examining the impact of introducing healthier food options in a neighbourhood to help people in a community improve their diet6 or how maternal health policies impact child mortality7. These are pressing challenges; in the United States social determinants (broadly defined as the conditions in which people are born, grow, live, work and age)8 account for 25–60% of deaths in any given year according to results from meta-analyses9. Moreover, 80% of the growing burden of non-communicable diseases worldwide could be prevented through modifying behaviours such as reducing tobacco, alcohol, fat and/or salt consumption, promoting physical activity and improving environmental conditions such as air quality and urban planning10.

In the past decade, the development of statistical and machine learning approaches with a focus on clinical tasks, such as predicting disease prognosis and identifying phenotypes, has greatly matured with some demonstrations of benefit to patients11,12. Echoing research in population and public health, the recent COVID-19 pandemic has highlighted how multi-sectoral factors outside of the clinic such as community, social networks and environment are also critical with respect to health13,14,15,16,17,18,19. Accordingly, this Perspective illustrates where and how machine learning has been shown to be useful in a holistic set of tasks related to health. We summarize existing data and methods used in public and population health, and use this synthesis to present directions for future work to leverage synergies in machine learning and population and public health.

Data in public health

Before elaborating on machine learning efforts in public and population health and gaps, an outline of the types of data that are commonly used is pertinent. Data commonly used in public health can be broadly categorized into (1) surveys conducted by public health and governmental organizations that aggregate individual-level information and (2) person-generated data that provide information at a finer resolution20. These types of data each provide complementary types of information relevant to public health; each also associated with their own challenges. In the following subsections we outline each of these data and associated methodological challenges with respect to their use in models of health.

Data from surveys and government reports

Traditional approaches to data collection in public health involve the aggregation of data via local officials and channels. When health providers report notifiable diseases on a case-by-case basis, it is known as passive surveillance; often useful during disease outbreaks and to gain a baseline view of disease burden in a specific location. Conversely, active surveillance is when a health department proactively contacts health care providers to request information about diseases, which can be catered to specific needs and easier to validate, but more labour intensive. This surveillance information is then forwarded to national health ministries and international organizations like the World Health Organization. Such collection procedures result in robust, denominator-based public data that are often made publicly available by public health and governmental organizations. Examples include the National Health and Nutrition Examination Survey (NHANES) or the Demographic and Health Surveys programme (DHS). These types of data system also aim to capture different indicators of health, designed via specific constructs. For example, housing quality can be measured via rental status, sanitation status, crowding, indoor air quality and so on21. Despite the robust nature of the constructs and denominator-based data collection processes22, loss of information at an individual level due to aggregation, privacy concerns and temporal delays in the process are challenges in the use of data for all situations—such as the need for rapid policy-making, which was exemplified during the COVID-19 pandemic.

Person-generated data

The rising ubiquity of technologies such as smartphones and physical activity trackers, as well as data from social media sites such as Twitter and Instagram and Internet surveys, have made it possible to access high-resolution and geo-linked data in near-real time. The nature of this data can help evade recall or information biases associated with traditional surveys. The often-linked geographic information and time stamps further enable the capture of hyper-local, daily and sub-daily health-related information from behaviours to exposures and other macro-level properties, as well as health outcomes23,24. Accordingly, such ubiquitous technologies can provide opportunities to better measure the social determinants of health in a targeted way, by person, location and/or time21,22,23,24,25. These attributes of person-generated data can complement denominator-based survey and report data that are not available at such high granularity. However, the opt-in nature of and resources required for the tools used to produce person-generated data (that is, an individual chooses to use a certain app, and there can be cost associated with access) alongside the unstructured nature of the data (not in the form of specific measures or constructs from the outset) bring new computational challenges to the fore if we intend to use the data to make inferences within and across populations. These challenges have been outlined in detail previously22 and are used to motivate discussion in subsequent sections.

Machine learning in population and public health

Building on the focus of public and population health, here we outline a taxonomy to organize and illustrate machine learning efforts linked to priorities of these fields. We use this summary and juxtaposition with current research to identify gaps in the application of machine learning in public and population health.

  • Identification of factors and their relation to health outcomes. Learning what contributes, at all levels of the socio-ecological framework and their interactions, to health outcomes is a significant part of public and population health research. Machine learning has so far played a role in identification in a broad range of studies from learning biological mechanisms26 to establishing the multivariate empirical relationship between the probability of disease outbreak and environmental conditions27. Given the complex relationships and possible mediations between multi-level factors28, by leveraging new data sources there is the opportunity to use and develop machine learning methods for the interpretable identification and assessment of the source and magnitude of a wide variety of multi-level factors in health outcomes.

  • Design of interventions. Besides multi-level factors related to disease, a socio-ecological framework also indicates the utility of interventions at multiple levels. Besides exacerbating inequities, targeting individuals directly can be highly stigmatizing, aggravating health-related behaviours that may be the target of intervention29. Thus, while a large body of work in machine learning has focused on targeting the individual, for example towards depression management30, self-efficacy for weight loss31, smoking cessation32 and personalized nutrition based on glycaemic response33, the possibility of leveraging data and machine learning in efforts that consider the multi-level nature of influence around an individual is an open investigation area. For example, group-based intervention programmes are one of the means to reduce substance abuse by reinforcing positive behaviour. System dynamics modelling approaches or agent-based models (which involve simulation of the history, location in space and time, and interaction between individual agents) can be useful to design and evaluate the effectiveness of interventions at multiple levels34,35.

  • Prediction of outcomes. Predicting mortality risk 36, hospital readmission37 and disease prognoses from pathology, imaging or other clinical data38 are well-studied tasks using probabilistic machine learning and deep learning methods. Prediction has also been leveraged for population-level questions, largely in geographic disease risk mapping39. Conversely, mitigating health disparities and the prediction of outside-hospital events are crucial challenges that have received less attention from the machine learning community. Although new data provide the potential to incorporate social, environmental and other multi-level determinants in (and thus improve) prediction models, there is a need to expand on research, which may use similar methods to the vast preponderance of research on clinical prediction models to incorporate these new data sources. Such data can also better capture factors that are represented by proxies such as race at present40,41. Besides the addition of data to prediction models, a critical focus on the types of prediction task and how they are used in practice is also essential. For example, although the incorporation of patient socio-economics can improve risk assessment, and epidemiological evidence shows the relation of these to several important outcomes42, concerns regarding their use being utilized to justify lower standards of care for poor patients43 have been voiced. Such important risks illustrate the importance of the development and use of prediction models that are closely linked to clinical and public health practices and priorities. In the United States, for example, accountable health initiatives and tax codes have encouraged the activation of risk mitigation actions beyond primary care that could leverage risk information from social factors, such as through improving health care teams’ ability to understand the ‘upstream’ factors impacting their patients’ health and the ability to act on care recommendations, informing clinical care decisions44.

  • Allocation of resources. The use of machine learning and artificial intelligence for resource allocation has been promoted in several types of health care and public health tasks, typically with a focus on allowing for estimation under uncertainty, computing treatment effects or alternate scenarios to aid in decision-making, augmenting decision rule approaches by incorporating more information and handling missing data. In health care, this has been applied at the person level (for example, for learning personalized management and treatment plans by modelling the temporal evolution of patient data45). Machine learning approaches have also been incorporated in resource allocation for measurement; for example deciding what laboratory measurements or psychosocial measures from mobile phones should be measured, when and on whom, trading off the value of information against the cost of acquisition46,47. With a public and population health lens, the propagation and enhancement of disparities in resource allocation should also be considered via a multi-level perspective. Examples of such efforts to account for and address inequity could be the development of methods that include relevant information beyond the individual level, such as the patient’s geographic location, in the optimization procedure48.

Current challenges

From the elemental look at the fields of population and public health and current data and machine learning efforts, we now shift our attention to future enterprise. What are the significant challenges that must be surmounted and for which there is room for machine learning innovation? We consider challenges across data, problem selection and formulation.

Privacy and health data

In light of a multi-level perspective, it should first be clarified that privacy can be viewed at collective and individual levels. This is important because collective rights are not necessarily a large-scale representation of individual rights and related issues49. At a collective level, for example, people may wish to avoid being stigmatized via certain assessments. At an individual level, people can also be concerned about their own privacy. The COVID-19 pandemic brightly illuminated and surfaced these issues and several trade-offs specific to person-generated data and public health based on the rapid use of data to inform policy efforts and models during the crisis. While focused literature has comprehensively summarized challenges, proposed recommendations and discussed these in detail50,51, we summarize main concepts here.

To make the use of person-generated data feasible, and more broadly any digital data in public health research and practice efforts at scale, regulatory frameworks for appropriate guidance for the health sector and other end-users have been emphasized. Regulation has an important role; indeed, current procedural mechanisms are lacking in their uniformity, which is needed to promote predictability and trust in the public at individual and collective levels. Regulatory guidance, drawn from human rights legislation, must be supported by broad audit and enforcement powers52. Beyond this, institutions that use data should foster robust data stewardship and standardize their practices to international best practices (such as ensuring the team developing the framework is appropriately diverse, in technical specialization as well as in terms of demographics and connections with appropriate communities). The latest recommendations on data privacy foster a dynamic approach, open to iteration. Approaches must also go beyond anonymization, which in itself may decrease utility and remains vulnerable to de-anonymization via aggregation of disparate pieces of information53,54.

A comprehensive approach to privacy is also imperative in lieu of relying only on regulation, which can be slow to change, or be a source of structural discrimination55. Research has also shown that within current regulatory frameworks it is common that individuals who share data publicly may not be cognizant that their data may be used by external parties for public health monitoring or research purposes56. Thus a discussion on data privacy is inextricable from efforts to empower the public to make their decisions and weigh in on what risks are appropriate for them. Best practices in health communication can be leveraged for this purpose. Data sharing can be a form of public health intervention and can increase the likelihood of using the data further and translating results into action. Reporting exposure data back to study participants is increasingly critical and can increase self-efficacy, particularly when working with underserved communities57. Personal actions as well as attitudes and trust of the use of data are essential in response to public health crises58. Accordingly, there is room for the development of such approaches in effective communication of data, which could entail summarizing and communicating one’s own data in a manner that is interpretable, and also aggregating or contrasting it with data from others in a privacy-preserving manner. Machine learning is starting to be used to automate patient education and information communication efforts59. Such approaches can be informed by the health communication literature, which shows that instead of treating health literacy as a patient problem that needs to be fixed or circumvented, health literacy interventions can integrate the principles of socio-ecology to develop interventions that build capacity and empower individuals and communities via factors significant to them60,61.

Assessing external validity

External validity, the validity of applying the conclusions of a scientific study outside the context of that study is an important consideration in all statistical and machine learning in health endeavours. Given the comprehensive consideration of multi-level attributes in population and public health, spanning populations and their context, external validity brings new considerations for data and algorithms. Whereas large denominator-based survey and government reports typically used in public health efforts aim to provide the information to mitigate these challenges (that is, by including a representative population, or at least information about which group is represented), the localized information offered through person-generated data comes with several external validity challenges. First, the data often do not come with linked information regarding attributes of the persons sharing the data (for example gender, age and so on), making it difficult to understand who has contributed the data22. Second, as the data from such sources are not organized into specific constructs (that is, the free-form text or images have to be processed to form features), variables from one dataset or environment may not be comparable to another. Finally, given the opt-in nature of (and resources required for) the tools used to produce person-generated data, measurement of an outcome may be affected by selection bias. Accordingly, it is important to understand the factors that lead to the data being contributed; developing algorithms on data in a new context can result in biased results if the mechanisms by which the algorithm worked in the original population are not well understood. For example, in one location people may tend to share more information due to different social norms23.

There are several machine learning avenues for increasing standardization across environments (where an environment is a data source, hospital, location and so on) or for analysing data from multiple environments. Data augmentation is an approach to fill gaps in non-representative samples62. Domain adaptation methods can be developed to address distribution shifts that may occur across different environment, both based on different populations and data generation mechanisms.25,63,64 In particular, thinking about data distribution shifts and differences from a causal perspective has been utilized to inform the empirical learning processes 23,65.

Measuring and integrating social determinants of health

New data provide a needed way to capture data from daily life outside of the hospital, yet still critical for a comprehensive understanding of health. However, social determinants encompass a broader set of information than just that at the individual level. There is a need for better measures of upstream factors that are important social determinants. This spans environmental, policy and other social factors such as racism55. Some work leveraging new data and machine learning methods has focused on using natural language processing, computer vision or other approaches to identify and generate relevant features for a specific task, which can be leveraged to generate features of the built and social environments from unstructured data in scalable ways (that is, for more environments and communities)66,67,68,69. Second, although social determinants have been studied extensively in the epidemiology literature, the findings from these studies have underscored the need for methods that can better capture flexible and complex relationships between social determinants and health outcomes70,71. This need also indicates an opportunity for machine learning; the flexibility of modern machine learning methods may help us model these relationships72,73. Third, the use of social variables in causal models is often restricted, under the premise that they are non-manipulable or not intervenable74. A causal perspective (for example, by the use of directed acyclic graphs) is important to systematically assess the role of different variables to model them in a manner that will also enable identification of where interventions can and should occur. At the same time, causal methods often assume stable unit treatment value, which implies that there is no interference and only one version of treatment; this is often non-tractable with the complex nature of social determinants and other methods must be investigated. In general, a full specification of social determinants and the pathways by which they operate is important. This requires identification of the variables and mediator or moderator effects75. An understanding of distal and proximal determinants and their relations is also relevant to be able to examine and consider the effect of different forms of interventions and to identify and work towards structural changes at the root cause of disparities55.

Health disparities

The description and explanation of racial and ethnic health disparities, which are differences in health status or in the distribution of health resources between different population groups, arising from the social conditions in which people are born, grow, live, work and age are major focuses of public and population health research. These disparities often manifest in specific gender, income level and race/ethnicity groups experiencing greater health risks2. Algorithmic fairness has recently emerged as a field of machine learning, with the goal of mitigating differences in machine learning outcomes across social groups. Broadly, algorithmic fairness approaches have consisted of statistical notions that ensure some form of parity for members of different protected groups (for example by race, gender and so on) and individual notions that aim to ensure that people who are ‘similar’ with respect to the classification task receive similar outcomes. We refer the reader to detailed summaries of algorithmic fairness elsewhere76.

Although it is becoming clear that algorithms are an important place to search for bias, as algorithms become embedded in many societal efforts, algorithms can incorporate and augment existing biases through many means, not only the outcomes they optimize for, which are a main focus of algorithmic fairness efforts (Fig. 2). Indeed, one risk is that clinicians and others may trust that issues of bias are sufficiently managed via algorithmic fairness efforts77. Accordingly, here we place algorithmic fairness in relation to the broader literature in health disparities within public and population health to identify challenges and opportunities in algorithmic fairness work with respect to advancing health.

Fig. 2: Illustration of sources of bias at different stages of data and algorithm use.
figure 2

The figure, adapted from Mitchell et al.100, shows where disparities can manifest in the process of generating and using data. Where machine learning is applied (what questions are asked) can impact disparities and equity, and is added to the pipeline. Algorithmic fairness comes into play at the end of this pipeline (via questions asked/outcomes optimized).

Health disparities are reflective of social oppression and its influence on the health of individuals that identify with such marginalized communities. It is essential to identify what leads to disparate health outcomes to design interventions to mitigate disparities and improve the health of high-risk populations. This involves multiple types of task, such as measuring health outcomes78,79 and disparities across social groups80,81,82, as well as designing policies to mitigate the disparities82. Figure 2 provides a framework for analysing how societal bias can result in biased predictions and where algorithmic fairness contributes (bottom two boxes in Fig. 2). Essentially, data are always sourced via some perspective83; an important consideration in any use of data to understand disparities. Issues of data representation in training data have also been well documented in terms of their importance in biased algorithm outcomes84. Linked to this, how algorithms make inference from underrepresented features85,86 can also contribute to biased outcomes and disparate performance. In the following subsections, current gaps in machine learning and algorithmic fairness work with respect to health disparities are made explicit, and recommendations are provided for ensuring that health equity remains an inherent goal in the design of machine learning algorithms in health settings.

Algorithmic design

Obermeyer and Mullainathan79 recently presented a commonly used clinical risk score that considered financial cost expenditure as a proxy for health care needs. Owing to unequal access to care, less money is spent on care for Black patients compared with white patients, and thus although health care cost appeared to be an effective proxy for health by some measures of predictive accuracy, large racial biases resulted. As such cost-based proxy objectives are not uncommon, an essential step is to be aware of task goals and outcomes and interrogate them with respect to health equity. It should be noted that several of the constructs considered in today’s algorithmic fairness measures are socially determined (for example race, gender) and thus consideration of them without broader attention to the systemic processes involved in their determination shifts focus away from the root causes of inequity55. Historically, algorithmic fairness has not accounted for the complex causal relationships between the biological, environmental and social factors that give rise to differences in medical conditions across protected identities. These missing factors can result in misalignment in equity and algorithmic fairness notions as described above77. Moreover, the deployment of algorithms could also perpetuate or augment disparities even with ‘algorithmically fair’ efforts87. In the following sections, challenges related to the use of such variables (as well as those related to algorithmic fairness that are exterior to the algorithm) are elaborated.

Pitfalls with ‘proxies’ in modelling social variables

Although accounting for factors such as ‘race’ may be important in specific analyses, it is often unknown what the comprising factors of such social constructs are, how they interact and how to model them88. Indeed, poor representation of variations within and between groups, along with the difficulty in attaining the appropriate factors, have led to the use of social constructs such as race as proxies for unknown and/or unmeasured biological and social factors (including racism). A recent study highlighted several clinical risk estimation tools across cardiology, nephrology, obstetrics and many other specialties that all use race, and how this use of a simple race variable severely compromises the health of marginalized individuals41. Accordingly, how relevant biologic variation is to be assessed and reported without stratifying populations based on factors such as race and ethnicity is still a challenge to be addressed89. Overall, the use of variables such as race, even when considering them as markers to be fair with respect to, such as in algorithmic fairness efforts, obscures variation within and between individuals, impeding equity goals.

Multiple axes of disparities and intersectionality

Health disparities are often measured by considering averages across individuals who identify with a certain attribute, such as a race category. However, measuring health disparities as averages by category does not fully represent or describe the multifaceted and interwoven effects of the forces shaping disparities. Moreover, gradients within specific categories are also neglected. For example, disparities have continued across income groups of childbearing women even after the introduction of policies to improve the health of pregnant women in California2. Indeed, lived experiences are frequently the product of intersecting patterns of social forces such as racism and sexism90, and modelling them as simple additive or multiplicative effects will not capture the full complexity. Research in public health has aimed to address this challenge of modelling the non-additive and dynamic nature of multiple social factors beyond simple interaction terms91. Recently, a statistical multi-level method for capturing social factors and their intersections, known as multi-level analysis of individual heterogeneity and discrimination accuracy has been described92. This method involves decomposing total variance into (1) between-strata variance, which allows the identification and assessment of disadvantaged groups, and (2) within-strata variance, which allows the identification of individuals within a social group that are at added disadvantage compared with other members of the group. The approach presents several advantages over fixed-effect models that include interaction terms for multiple sensitive attributes, including restricting parameter growth to linear (as opposed to geometric) forms and adjusting for the sample sizes of the intersectionalities. Overall, the dynamic nature of intersectionality highlights the need for the development of new machine learning approaches that can handle multiple protected attributes and their intersections, including at higher dimensions.

Algorithmic fairness in deployment and health disparity dissonances

Going beyond the data used and methods employed, there can be further equity impediments based on the types of questions that are the focus of machine learning efforts. Indeed, as work interrogating the limits of algorithmic fairness has discussed, the possible veneer of technical neutrality may break down once the full systems the technology is embedded in are considered93. In particular, although algorithms themselves may be deemed ‘fair’, they can result in unfairness when considered in the context of the systems they are deployed in. We illustrate such dissonances via two scenarios.

First, we highlight unfairness that can result from the use of protected attributes that are proxies. Consider the scenario represented in Fig. 3; predicting a health outcome H (for example, risk of cardiovascular disease) based on individual-level information including the perceived protected attribute, P, and clinical conditions, C. Even if successful in ensuring that the risk is fair using a group metric such as demographic parity (equal decision rates across groups regardless of outcome)94 with respect to racial identity as the protected group attribute, by only considering individual-level attributes, we are still left with unexplained variance for aspects that race can be acting as a proxy for, such as education, E, income levels, I, and neighbourhood characteristics N95. Indeed, equal risk scores for Black and non-Black patients would not eliminate disparities across lower-income Black patients versus higher-income Black patients. In health efforts, considering sensitive variables at macro96 and individual levels simultaneously is one route to a more holistic consideration of fairness97.

Fig. 3: Abstract illustration of challenge in algorithmic fairness due to unexplained variance or proxy variables.
figure 3

Consider a given P that is composed of several factors E, I and N. H is to be predicted using C and P while ensuring that the model prediction is fair with respect to P. However, E, I and N (shaded blue) are typically not accounted for, though doing so would better represent variance in P.

Another challenge in applying algorithmic fairness efforts that can exacerbate disparities is posed by the population to which the focus of algorithmic development and fairness efforts are devoted. Figure 4 represents a health care example wherein individual-level factors are included in an algorithm used to make a decision on a treatment, T, for patients based on C. An algorithm ensuring that model outcomes are fair with respect to P could still perpetuate health inequities in the population if insurance status, represented by I, and necessarily populations for which this varies are not accounted for (because insurance status can be related to health outcomes98). A continued focus on efforts that improve care for only the top tier of patients will advance disparities and can be detrimental to all. Machine learning approaches such as transportation of causal effects and domain adaptation may be used to focus questions and studies on improving efforts for neglected populations99. At minimum, population representation (who is being included in fairness efforts) should be identified and tallied to draw focus to the development of innovations for underrepresented groups.

Fig. 4: Abstract illustration of challenge in algorithmic fairness owing to who the data represents.
figure 4

A given P and C, which is used to determine the treatment T, are ultimately used to assess H. If high-risk populations such as the uninsured are not included (shaded blue), because I affects T, inferred relationships and treatment effects would not be relevant to those most vulnerable. Furthermore, if such populations continue to be excluded from algorithmic efforts, overall disparities can increase.

Conclusions

Through a discussion of the principles of population and public health, as well as current machine learning efforts, we synthesize and summarize areas where machine learning innovation may synergize with, advance and build on research and practice in these fields. We distil major areas of challenge spanning the data used, methods developed and questions asked, which are all important in equity considerations. These challenges include: the relevance of multi-level factors in health, privacy considerations with relation to health data, external validity concerns specific to public health data and questions, as well as the measurement of and inclusion of social determinants in causal models. We identify how algorithmic fairness efforts must be considered in the context of the data and systems in which they are applied, showing how they could otherwise perpetuate or advance health disparities. Speaking of health equity when only referring to clinical decision-making and fair AI in health care limits equity considerations with respect to health, ignoring the multi-level influences on (and possible interventions in) our health. These principles to shape data, measures and questions of machine learning efforts in health should be leveraged from domains such as public and population health, in which the study of inequity and health is rooted. This grounding is desirable as we work towards improving health for all populations amidst shifting climates, priorities and data. In sum, this Perspective aims to open the imaginary and activate the machine learning community regarding the types of data and questions we consider when thinking about machine learning and AI in health.