Main

As decision-making algorithms continue to be adopted for a variety of high-impact applications, many of them have been found to disproportionately harm marginalized populations, as evidenced by algorithmic audits1,2,3,4,5,6. In particular, area-based measures to identify disadvantaged neighbourhoods have recently become widespread for tasks such as allocating vaccines, assessing social vulnerability, and healthcare cost adjustment, intended to optimize an equitable distribution of resources7,8,9,10. However, their potential for allocative harm—the withholding of resources from specific subpopulations11—is not well understood, and it remains unclear how different subpopulations may be disproportionately impacted by the design of such area-based models.

The California Community Environmental Health Screening Tool (CalEnviroScreen) is a data tool that designates neighbourhoods as eligible for capital projects and social services funding, and is intended to promote environmental justice. CalEnviroScreen’s model output is used to designate ‘disadvantaged communities’, for which 25% of proceeds from California’s cap-and-trade programme are earmarked. CalEnviroScreen also directly influences funding from a variety of public and private sources, and is reported to have directed an estimated US$12.7 billion in funding12. The funding targets of the tool are varied, including programmes for affordable housing, land-use strategies, agricultural subsidies, wildfire risk reduction, public transit and renewable energy. Similar data tools are in use or development at the federal and state levels across the United States13.

CalEnviroScreen ranks each census tract in the state according to its level of marginalization in terms of environmental conditions and population characteristics. The algorithm does so by aggregating publicly available tract-level data into a single score, based on variables from four categories: environmental exposures, environmental effects, sensitive populations and socioeconomic factors. Tracts in the top 25% of scores are designated as disadvantaged communities, representing ~10 million residents for whom earmarked funding is made available.

Screening tools like CalEnviroScreen have been criticized for being in tension with the principles of environmental justice, defined broadly as a movement to address systemic environmental harms faced by marginalized populations through mechanisms such as equitable resource allocation and inclusive decision-making14. On the one hand, such tools may distributively advance environmental justice by allocating resources to marginalized communities, but on the other, critics contend that subjective, state-run decision-making lacks accountability for affected communities, particularly as the state itself bears responsibility for perpetuating environmental injustices14,15,16. Consequently, such screening tools may present their algorithmic output as objective truth and place marginalized communities in competition for limited funds16,17. Algorithmic audits are therefore necessary to identify the extent to which the design of tools like CalEnviroScreen can impact different communities.

Audits of large-scale algorithms are often limited to observing ‘black box’ outputs, as individual-level data privacy requirements and proprietary pipelines prevent comprehensive audits of algorithmic systems1,2,18. By contrast, we are able to fully reproduce and test changes to CalEnviroScreen’s model due to its population-level usage of publicly available data, potentially enabling generalization of our findings to similar large-scale algorithms. In this Article we investigate the inner workings of CalEnviroScreen, characterizing model sensitivity, funding impact, ethical concerns and avenues for harm-reduction (Extended Data Fig. 1).

Model sensitivity and funding impact

The CalEnviroScreen model is highly sensitive to change. We found that 16.1% of all tracts could change designation based on small alterations to the model. This represents high designation variation given that only 25% of all tracts receive designation (Fig. 1 and Extended Data Table 1). These large fluctuations in designation are solely due to varying subjective model specifications such as health metrics, pre-processing methods and aggregation methods19. For example, changing pre-processing methods—switching from a percentile ranking to a more commonly used method like z-score standardization—led to a 5.3% change in designated tracts.

Fig. 1: CalEnviroScreen’s sensitivity to input parameters.
figure 1

The axes denote model scores in terms of percentiles. Grey bars indicate maximum and minimum values from alternative plausible model specifications with varying health metrics, pre-processing methods and aggregation methods. The dashed red line indicates the 75th percentile cutoff score for funding designation. Dots indicate the median predicted amount of model sensitivity at a given percentile, in terms of by how many percentile-ranks a tract can vary. Light shaded portions and error bars indicate 95% prediction intervals. Dark shaded portions indicate 75% prediction intervals (for example, in 95% of predictions, tracts at the 75th percentile can vary their score by 44 percentile-ranks).

In the absence of a ground-truth variable or validation metric (that is, a concrete ability to quantify the true value of environmental harm in California), model sensitivity represents the ambiguity across alternative specifications, enabling an uncertainty assessment of the model outputs19,20,21,22. For example, we observe high levels of model sensitivity at the designation threshold (75th percentile), where the predicted tract ranking could vary across models by 44 percentile ranks (Fig. 1). Even tracts ranked as low as the bottom 5th percentile could be eligible under slightly different models. We observe lower, yet still substantial, model sensitivity at the 95th percentile, where the predicted range is 18 percentile ranks. Given this variability in ranking certainty, dichotomizing designation may present a false sense of precision, leading to funding decisions based on unstable information.

Receiving algorithmic designation is financially consequential. We estimated through a causal analysis that the effect of receiving designation from the algorithm is a 104% (95% confidence interval, 62–145%) increase in funding, equivalent to US$2.08 billion (US$1.56–2.41 billion) in additional funding over a four-year period for 2,007 tracts (Fig. 2 and Extended Data Table 2). Similarly, among the 400 tracts that would be eligible for designation under an alternative model (described below), we estimated they would have received equivalent to US$632 million ($377–881 million) in additional funding over the same time period.

Fig. 2: Total cumulative funding by California Climate Investments implemented in census tracts by CalEnviroScreen percentile from 2017 to 2021.
figure 2

Dark blue bars indicate funding earmarked specifically for disadvantaged communities (DAC), as determined by the CalEnviroScreen model. Lighter blue bars indicate other earmarked funding (buffer and low income). Grey bars indicate all other funding. Buffer funding is earmarked for low-income communities and households that are not designated as disadvantaged communities, but are within half a mile of a disadvantaged communities census tract. Low-income funding is earmarked for low-income communities and households statewide.

Allocative tradeoffs and harms

Under such a model with high uncertainty, every subjective model decision is implicitly a value judgement: any variation of a model could favour one subpopulation or disfavour another. Both the model in its current form and plausible alternative forms can exhibit bias among different subpopulations, illustrating the zero-sum nature of delegating funding allocation to a single model.

To exemplify these challenging tradeoffs, we constructed an alternative model for designation assignment. CalEnviroScreen does not include race in its algorithm, but we are able to assess its impact on race by examining the racial composition of designated tracts. We first changed the pre-processing and aggregation methods to avoid penalizing tracts with extreme levels in variables such as air pollution indicators, and then incorporated a number of additional population health metrics for a more broad definition of vulnerability to environmental exposures. On average, incorporating these changes led to increased designation to tracts with higher levels of racially minoritized people in poverty, but decreased designation among racially minoritized populations overall (Fig. 3 and Extended Data Fig. 2).

Fig. 3: Allocative tradeoffs between racially minoritized populations in poverty and racially minoritized populations overall.
figure 3

Comparison of how algorithmically designated tracts are distributed by race and poverty across the current CalEnviroScreen model and an alternative model, among tracts that would change designation status under the alternative model. The alternative model uses a different pre-processing technique, a different aggregation technique, and it incorporates additional population health variables. Red densities indicate tracts that receive designation under the current model but are not designated under the alternative model. Blue densities indicate tracts gaining designation under the alternative model. Contours are calculated as the smallest regions that bound a given proportion of the data.

In particular, expanding the ‘sensitive populations’ category of the algorithm presents ethical concerns. The category is represented by three variables: respiratory health, cardiovascular health and low birthweight. It would be sensible to include additional health indicators relevant to environmental exposures, such as chronic kidney disease or cancer23,24. The inclusion of such variables, however, would result in the loss of designation for tracts with high Black populations. Because low birthweight disproportionately affects Black infants, the introduction of other variables such as cancer—which also disproportionately affects Black populations albeit to a lesser extent—would reduce the impact of low birthweight on the algorithm’s output.

Moreover, we found the existing model to potentially underrepresent foreign-born populations. The model measures respiratory health in terms of emergency-room visits for asthma attacks, which underrepresents groups who use the emergency room less or come from countries where asthma is less prevalent, yet still have other respiratory issues25,26. Consequently, using survey data of chronic obstructive pulmonary disease (COPD) to represent respiratory health increases the designation of tracts with foreign-born populations of 30% or higher (Extended Data Figs. 3 and 4).

Critically, the zero-sum nature and high sensitivity of the model are conducive to model manipulability. It is feasible for a politically motivated internal actor, whether subconsciously or intentionally, to prefer model specifications that designate tracts according to a specific demographic, such as political affiliation or race. Through adversarial optimization—optimizing over pre-processing and aggregation methods, health metrics and variable weights—we find that the maximum increase or decrease a model can be manipulated to favour a specific US political party is 39% and 34%, respectively (Fig. 4 and Extended Data Fig. 5). Any efforts to mitigate the harms of allocative algorithms such as CalEnviroScreen thus need to consider both model sensitivity and manipulability.

Fig. 4: Adversarially optimized distribution of algorithmically designated tracts by political party.
figure 4

Columns and flow denote the distribution of tracts designated as disadvantaged by political affiliation, determined by affiliations of district assembly members. The leftmost and rightmost columns are adversarially optimized to increase Democratic and Republican tracts, respectively. The centre column represents the original model. Flows between columns represent changes in binned percentile ranking of census tracts between the original and adversarial models.

Mitigation strategies

Because there is no singular ‘best’ model, we propose assessing robustness via sensitivity analysis and incorporating additional models accordingly. For example, the California Environmental Protection Agency recently decided to honour designations from both the current and previous versions of CalEnviroScreen, effectively taking the union of two different models. This approach reduces model sensitivity by 40.7%, and a three-model approach additionally incorporating designations from our alternative model reduces the model sensitivity by 71.0%. Using multiple models also mitigates allocative harm—by broadening the category of who is considered disadvantaged, different populations are less likely to be in competition with each other for designation.

A potential concern is that increasing the number of designated tracts may dilute earmarked funds for disadvantaged groups. However, incorporating an additional model per our example would only increase the number of tracts by 10%, yet reduce model sensitivity by 51.1%. Doing so would also reduce equity concerns and more accurately represent the uncertainty inherent to designating tracts (consideration should be given as to whether these benefits outweigh the downsides). Furthermore, adding models is only one possible solution. There are many other ways to equitably address decision-making under uncertainty, such as randomizing assignment for tracts near the decision threshold (similar to lottery admissions for educational institutes), aggregating outputs from multiple models into a singular ensemble model, scoring tracts based on both the model output and its respective uncertainty measurement, or funding tracts on a tiered or sliding-scale system weighted by uncertainty measurements instead of using a single hard threshold27,28,29,30.

However, reducing model sensitivity is not a complete solution—transparency and accountability are necessary to reduce harm. The agency developing CalEnviroScreen is active in offering methodological transparency and soliciting feedback, which enables critiques such as ours and promotes public discourse. Agencies developing similar tools to identify disadvantaged neighbourhoods should follow suit. A safeguard like an external advisory committee comprising domain experts and leaders of local community groups could also help reduce harm by identifying ethical concerns that may have been missed internally. It would also promote equitable representation and involvement from the public, aligning with the tool’s goal of advancing environmental justice.

Discussion

Our findings are threefold: (1) CalEnviroScreen’s model is both sensitive to change and financially consequential; (2) subjective model decisions lead to allocative tradeoffs, and models can be manipulated accordingly; and (3) model sensitivity can be mitigated by accounting for uncertainty in designations, thereby reducing the need for tradeoffs. Concretely, we recommend accounting for uncertainty by incorporating sensitivity analyses and potentially including additional models to increase robustness, and urge for community-based independent oversight.

Our analysis is not a comprehensive audit of CalEnviroScreen. We do not identify every potential flaw or ethical concern of the model, but instead highlight illustrative examples of how model choices can facilitate allocative harm. Only members of a given community can fully know how their respective tracts are represented and affected by the algorithm. We do not advocate for any particular model over another. Such decisions are inherently subjective and should be made in consultation with affected communities and relevant experts. Our estimates of model sensitivity are probably underestimates, as we do not exhaustively specify alternative models. Furthermore, our estimate of the funding impact of algorithmic designation is probably an underestimate, as detailed data on relevant private funding sources are not publicly available. Our estimate of model bias for foreign-born populations may be inaccurate because undocumented immigrants are often underrepresented in census data31. We are unable to fully assess model sensitivity to the modifiable areal unit problem32—a phenomenon where varying the geographical unit of observation can alter model output—because we only had data to convert a fraction of the variables to a smaller geographical scale (Supplementary Information).

Other limitations of our analysis include the limitations besetting the data tool itself: missingness in data and a lack of random measurement error metrics. For example, indoor air quality or regulatory compliance may be important determinants of environmental risk, but are not included in CalEnviroScreen33,34. Similarly, CalEnviroScreen measures outdoor air quality using sensors and interpolation algorithms used to infer air quality for areas between sensors, which may result in noisier estimates for marginalized communities farther from sensors. The degree to which such algorithms can cause allocative harm should be examined35. Overall, missingness in environmental data is more pronounced in marginalized communities36. As CalEnviroScreen’s data sources lack random measurement error metrics, our uncertainty estimates only reflect a specific type of uncertainty: model sensitivity, or ambiguity between models21,22. Incorporating random measurement error would increase the uncertainty of CalEnviroScreen scores.

Our work draws upon previous literature from a variety of disciplines. Previous studies related to decision science and composite indicator construction recognize model sensitivity, or ambiguity, as a measure of uncertainty, and demonstrate how diversity in modelling assumptions, or worldviews, can improve robustness19,21,22,37,38. In the algorithmic fairness literature, uncertainty has been formulated as a driver of algorithmic bias for ranking algorithms, and race has been found to be inferred from models that do not include it as a variable5,20,39. Environmental justice theorists have critiqued environmental screening algorithms for extractive data practices, the exclusion of affected communities from decision-making, the subjectivity of outside expertise in allocating community resources, and not including race as a variable7,16,17,36,40. Our analysis also draws from frameworks of ‘distributive justice’, examining how to fairly allocate resources within a society41,42,43,44. A recent environmental health study has examined how environmental screening tools can be improved to mitigate disparities in air quality45. Our analysis contributes to the literature by identifying technical mechanisms by which subjectivity in the model design of environmental screening algorithms contributes to uncertainty in the model output and the potential for allocative harm.

More broadly, our findings illustrate how allocative algorithms can encode unintentional bias into their outputs. Questions of how to allocate scarce resources have always been challenging and subjective, yet delegating allocation to algorithms may erroneously give the appearance of objectivity by obscuring the design choices behind the algorithms15,17,46. Any such notion that algorithms are intrinsically objective should be rejected. With increasingly high-dimensional and high-resolution data, unintentional bias will become both more common and more difficult to detect. Both algorithm developers and policymakers should acknowledge the subjective process of algorithm development and work to minimize harm accordingly.

Technical and regulatory solutions will be necessary to address the concerns of allocative harm as algorithms continue to be adopted for policy use. Although the misuse of such tools could exacerbate existing inequities, a careful and community-minded approach can lead to the broad realization of CalEnviroScreen’s intended goal—furthering environmental justice and mitigating the harms done to structurally marginalized populations.

Methods

Data

For our sensitivity analyses, we used census tract-level data obtained from the current version of CalEnviroScreen (version 4.0, implemented in late 2021), consisting of 8,035 observations with 21 variables measuring different aspects of environmental exposures and population characteristics. The variables measure ozone levels, fine particulate matter, diesel particulate matter, drinking water contaminants, lead exposure, pesticide use, toxic release from facilities, traffic impacts, cleanup sites, groundwater threats, hazardous waste, impaired waters, solid waste sites, asthma, cardiovascular disease, low birthweight, education, housing burden, linguistic isolation, poverty and unemployment.

For our additional variable analyses, we used the PLACES dataset from the Centers for Disease Control and Prevention to include tract-level variables on estimated prevalence for asthma, cancer, chronic kidney disease and coronary heart disease. We obtained demographic information on tract-level race/ethnicity from the American Community Survey.

For the causal analysis specifically, we examined years 2017–2021 of the California Climate Investments funding dataset, and used CalEnviroScreen 3.0 scores (implemented in 2017), as there are not yet sufficient data for funded projects guided by CalEnviroScreen 4.0. We calculated the amount of funding allocated to each census tract by summing the amount of funding received from different programmes for each tract. For funding projects that were attributed to assembly districts but not specific census tracts, we made conservative assumptions that prioritized non-earmarked funding towards non-priority population tracts. We attributed assembly district-level funding to tracts using the following steps: (1) funds earmarked for ‘priority populations’ such as disadvantaged tracts were exclusively attributed to their respective tracts within that district; (2) the remaining funds were attributed to non-priority population tracts within the district up to the amount attributed to priority population tracts; (3) any remaining funds after that were distributed equally (more details are provided in the Supplementary Information). To attribute districts to tracts spanning multiple districts, we followed the methodology listed in the California Climate Investments funding dataset—we solely considered them to belong to whichever district contained the largest population. For tracts that were missing relevant block-level population metrics, we assigned them to districts based on whichever district contained more blocks from the given tract. For a single tract that had missing population metrics and the same number of blocks for two districts, we assigned its district based on geographical area.

Algorithmic audit

We first reproduced the original CalEnviroScreen model based on its documentation47, then validated our reproduction on existing data. We next identified potential issues in the data tool and conceived plausible alternative models. As a general approach, we built alternative models (implementing various small changes to the current CalEnviroScreen model) and evaluated how they differed from the original model to assess the sensitivity of the CalEnviroScreen algorithm to model decisions. Variation was measured in terms of percent change in tracts changing designation, that is, the number of tracts changing designation divided by the total number of tracts multiplied by 100. Details of each step of this approach are given in the following.

We assessed changes to (1) the pre-processing method, (2) the aggregation method and (3) health metrics, all subjective areas for constructing composite indicators. We assessed pre-processing methods by changing the existing pre-processing method—percentile-ranking—to z-score standardization. We assessed the aggregation methods by changing the existing aggregation method—multiplication—to arithmetic mean.

We assessed health metrics based on our concerns of public health biases perpetuated by the algorithm. First, we noted that the existing method of measuring health indicators strictly by emergency-room visits may be skewed towards populations who use the ER disproportionately often, and so we tested including tract-level survey indicators of health in the model, namely asthma and cardiovascular health26. Second, we noted that only using asthma as a measure of environmental vulnerability with respect to respiratory health may not be fully reflective of those with respiratory health issues, so we tested including survey indicators for COPD. The inclusion of survey indicators of health were weighted such that categories of respiratory health, cardiovascular health and low birthweight were equally weighted. Finally, we noted that low birthweight, cardiovascular and respiratory issues are not the only health-related ways in which populations may be vulnerable to environmental exposures, and so we tested including other indicators, such as chronic kidney disease and cancer.

Our alternative models were pre-specified and designed based on the changes listed above: changing pre-processing to standardization, changing aggregation to averaging, and including survey indicators of health for cardiovascular health, asthma, COPD, chronic kidney disease and cancer. We evaluated the overall model sensitivity by assessing the different combinations of model specifications by varying pre-processing (z-score standardization versus percentile ranking), aggregation (multiplication versus averaging) and health variables (including versus excluding the additional health variables we specified)—calculating the number of distinct tracts that change designation across all models. We trained a smooth nonparametric additive quantile regression model on the range (that is, minimum and maximum values across models) for each tract to obtain prediction intervals48.

Empirical strategy

We used a sharp regression discontinuity design with local linear regression as the functional form to estimate the effect of algorithm designation on total funding received49. We selected the bandwidth using the Imbens–Kalyanaraman algorithm50. The treatment variable was a binary indicator for each tract denoting whether it was designated as disadvantaged by the algorithm. The outcome variable was the log of total funding received per tract. The forcing variable was the CalEnviroScreen percentile rank for each tract. Covariates included the aggregate pollution burden and population characteristics indicators from CalEnviroScreen, and tract-level race and poverty estimates from the American Community Survey. As robustness checks, we estimated the treatment effect with varying bandwidths, functional forms, covariate adjustments and dataset configurations. We also estimated the treatment effect with a propensity score matching approach51, and a linear model causal forest (Supplementary Information)52. All parenthetical values reported in the main text are 95% confidence intervals, and were calculated by multiplying standard errors by the 97.5th percentile point of the standard normal distribution.

Adversarial optimization

We formulated our optimization strategy as follows:

$${\mathop{\max }\limits_{W,\,p,\,a}{\phi }_{d}(\,f({\bf{W}},\,p,\,a))}$$
$${\rm{s.t.}}\,0.1\le {w}_{i}\le {0.9}$$
$${p\,\in \{0,\,1\}}$$
$${a\,\in \{0,\,1\}}$$

where f is the CalEnviroScreen algorithm designating tracts as disadvantaged, ϕd is a function totalling the number of tracts belonging to a chosen demographic d (for example, political affiliation, race), W = {w1, …, wn} is a vector of weights for each variable in the CalEnviroScreen algorithm, p is an indicator variable denoting pre-processing options (percentile-ranking versus z-score standardization), and a is an indicator variable denoting the aggregation methods (multiplication versus averaging). Weight variables were restricted to be between 0.1 and 0.9 to prevent extreme individual weight values. We used the Hooke–Jeeves method to solve the optimization problem53.

Political affiliation at the tract level was determined by party affiliation in terms of assembly district. For tracts that spanned multiple assembly districts, we attributed those tracts to the districts in which most of their population belonged, in line with how the Climate Change Investments fund attributes tract-level funding to tracts spanning multiple districts. Race was determined by percentage of the population for each tract being of a given race. We calculated the percent change in designated tracts for the party with fewer tracts.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this Article.