## Introduction

Urbanization forecasts estimate that six of ten people will live in cities by 20301, increasing impacts on natural resources and demand for ecosystem services, both within and beyond city borders. Recent studies predict an increase in the number of large cities vulnerable to water stress from 35 to 45% in the next 25 years2. Watershed degradation and water treatment costs have increased throughout the 21st century2, and existing water governance systems and engineering responses are proving inadequate and unfeasible both environmentally and economically. This is particularly acute in large cities, where large populations and population densities concentrate large numbers of people dependent on the same water supplies. The increasing demand on water resources, including exogenous factors such as those related to accelerated climate change, are driving new management strategies to sustain the provision of watershed services3,4,5,6. As water supply concerns grow, policies and tools must adapt to these new contexts7,8,9.

### Enabling conditions (variable importance)

We used random forest analyses to rank 17 predictor variables, each representing one or more key enabling conditions described in Fig. 1 (see Supplementary Data 1 for relationships between enabling conditions and representative data sets). The representative data were used to form two statistical models using different groups of cities: 1) The Global Cities Model containing all 416 large cities from the CWM; 2) The Non-USA Cities Model containing only the 299 large cities in the CWM that lie outside of the USA. We developed the second model due to the over representation of cities in the USA in CWM. These cities all receive the same value for country-level data, which would decrease the ability of these variables to explain variation in the data and result in artificially lower importance rankings. There were not enough large U.S. cities with IWS programs in our data set to justify a random forest model of only cities in the United States.

The two most important enabling conditions in both models were percent of watershed with agricultural land cover and percent of watershed area designated as protected (Fig. 3). Both models also indicated that Average Annual Growth (the average annual growth rate of national GDP for 1994–2014) is important, possibly because economic growth may increase the resources available for payment for ecosystem services programs and rapid economic growth can increase impacts to water supplies from increased development without infrastructure and institutions in place to address these impacts11. Other enabling conditions, such as population (an indicator for number of potential stakeholders) and enforceability of contracts (a World Bank Indicator14 used as proxy for both ability to enforce IWS agreements and secure land tenure) were relatively important in both models. Both models ranked water diversion volume and watershed population density as not important in predicting the presence of an IWS program relative to other variables.

### Differences in enabling conditions between groups of global cities

Less variation in a variable would result in a lower importance ranking because it would be harder for the algorithm to differentiate outcomes. Anticipation of this relationship partially motivated our choice to build the Non-USA Cities model. For example, when applied to the Global Cities Model, the random forest algorithm identified other variables not based on national scale data as having greater predictive power (Fig. 3). Variable importance rankings differed in our models with the exception of the top two most important enabling conditions. We expected differences between the Global Cities and Non-USA Cities models because of the resolution of our data and the high number of USA cities in the data set. Several of the variables we included in our analyses were based on national scale data and were associated with the presence of IWS only in the Non-USA Cities model. Furthermore, governance and economic variables such as presence of IUCN organizations, World Bank aggregate governance indicators, and Gross Domestic Product (GDP), ranked relatively more important for predicting the presence of IWS in the Non-USA Cities model. In addition, the Global Cities model included a disproportionately high number of USA cities (117 of the 416 cities in the Global Cities Model were in the USA), all with the same values for socioeconomic and governance variables that were based on national-scale data.

Weighted Drought Vulnerability is ranked important for Global Cities and not important for Non-USA Cities, indicating that drought vulnerability is more associated with IWS programs in the USA than elsewhere. However, drought could be a driving factor to search for policy and program innovations only when other enabling conditions are already in place. Such interactions among variables potentially explain some differences between the important conditions found in our models. Enabling conditions may interact within each model, such that different values of one condition may impact the importance of other conditions. The following section describes additional analysis of the behavior of individual conditions within each model.

### Enabling condition directionality and behavior

Enabling conditions are sorted into four general categories: biophysical, economic, governance, and sociocultural conditions (Fig. 1). While some enabling conditions were more important than others, we found that in both models, important enabling conditions came from all four categories, with at least one important variable in each of the four categories (Table 1).

We used partial dependence plots for the random forest models to explore how individual enabling conditions could predict IWS programs across each range of values in the representative data (Supplementary Fig. 1). Partial dependence plots depict the relationship between an outcome and different values of predictor variables within a model, with all other predictor variables held constant. For many of the variables, the marginal effect on the outcome (presence or absence of an IWS program) was more pronounced at changes occurring between lower values (i.e. changes in watershed area at the lower range of area size). The partial dependence plots indicate that at low values, marginal changes in an enabling condition could have a large impact on predicting IWS, whereas at higher values, marginal changes did not increase an enabling condition’s importance in predicting IWS. This may indicate possible thresholds of some predictor variables, above which increasing the variable does not influence the likelihood of IWS presence. Table 1 summarizes the direction of the relationship between each enabling condition and the outcome. For example, watersheds with higher percentage of area with agricultural land cover were more likely than otherwise comparable watersheds to contain a city with an IWS program. Watersheds with lower percentage of area protected were also more likely to contain a city with an IWS program.

### Expanding the scope of IWS

Our results could be used in combination with local, context-specific data to guide decisions about sites for future IWS programs. We selected the top 5 enabling conditions from the Non-USA Cities model and divided cities outside of the US into top or bottom half of values for each enabling condition depending on the relationships described by partial dependence plots for each condition (Supplementary Fig. 1). Using this ranking system, the following four cities most closely matched the top 5 enabling conditions associated with IWS programs, but do not currently have a program: (1) Dhaka, Bangladesh; (2) Guayaquil, Ecuador; (3) Dubai, United Arab Emirates; and (4) Leon, Mexico. An additional 37 cities met the characteristics for 4 of the top 5 enabling conditions (Supplementary Data 3). However, it would be critical to supplement an analysis of candidate cities with additional information about the places and people, as our results are not comprehensive of all required factors that enable IWS programs. For example, alternative approaches to managing urban water supplies such as desalination in Dubai could eliminate a need for IWS.

### Outlier cities

Not all conditions must necessarily be in place for an IWS program to develop. Within our analysis, there are examples of cities with IWS that do not have all the identified enabling conditions in place. For example, Seattle in the US has an IWS program with 0% agriculture cover in its source watershed (97.59% forest cover) but a high percentage of protected area (82.1%). This situation reflects a history of land acquisitions by the Seattle Public Utilities, which now owns and protects a large portion of the watershed. In other countries such as Mexico, China, South Africa, and Colombia, we found additional cities with IWS programs even though important enabling conditions were not present. In some cases, as with Colombia, where four of their seven large cities employ IWS, there may be national level programs, legal instruments, or concerted NGO efforts to initiate and support IWS9,11,26,27. As hypothesized by previous research22, while not all variables are needed for IWS to exist in a city, a combination of sufficient enabling conditions such as political support28, strong conservation need29, or outside conservation funding30 could provide sufficient conditions for an IWS program to emerge. We emphasize that knowledge of which conditions are critical in specific contexts would be important for IWS program design, program success, and long term IWS sustainability.

Conversely, having important enabling conditions in place is not sufficient to ensure presence of an IWS program. Our database also contains examples of cities that do not have IWS programs even though they have high levels of enabling conditions, such as the candidate cities we identified. Having enabling conditions in places with no IWS program could indicate presence of a different management strategy that successfully achieves the same outcomes of protecting urban water supplies. For example, cities and countries could have alternative policies or management practices in place, from strong regulatory frameworks or more voluntary measures such as source water protection plans.

## Discussion

In our assessment of 416 cities with over 1.15 billion drinking water consumers, conditions representing a range of sociocultural, governance, biophysical, and economic factors were important for IWS presence. In comparing all major cities to only those outside the United States, two suites of important enabling conditions emerged in particular. We found that key enabling conditions for IWS programs in major global cities include the amount of watershed area in (1) agricultural land use and (2) protected designations. In general, threats or risks to ecosystem services can facilitate the development of IWS by increasing awareness of ecosystem service benefits and their need for conservation22,25,29,31,32. Places where ecosystem services have clear benefits to human communities are more likely to protect ecosystem services, since beneficiaries have incentive to compensate service providers25,28,30,33,34,35.

Greater percentage of agricultural land in the watersheds serving a city was an essential enabling condition for all of the cities in our sample and for the cities outside the USA (Fig. 3). Previous research suggests several mechanisms by which agricultural land can be an important factor associated with the development of programs such as IWS. Agricultural lands have long been a key area for the implementation of payments for ecosystem services type approaches, often due to large numbers of private landowners as ecosystem services suppliers, and the lack of specific regulation for concerns such as nonpoint source pollution18,34,36. Upstream agricultural land could further be associated with IWS programs and other environmental policy interventions because it can impact urban drinking water supply quality. Thus these land uses also present a ready opportunity for organized management actions7,9,16. Our results support the proposed linkages between agricultural lands, impacts on downstream water supplies, and existence of payment programs such as IWS, which have been cited as key drivers of IWS programs in cities such as New York in the US and Quito in Ecuador15.

The percentage of protected area in source watersheds was the second highest ranking enabling condition in both models—although the relationship was negative. As the percentage of protected area increased, the probability a city had an IWS program decreased. IWS programs are designed to provide land owners with incentives to protect or enhance the watershed for the provision of water services14. Watersheds with large percentages of protected areas may not need further protection or incentives provided by IWS programs, so there is less motivation to develop programs in these locations. Additionally, source watersheds with a lot of land in the public domain may be easier to convert to protected status while watersheds with more private landowners or community-based tenure arrangements are a better target for IWS. In watersheds that have a low percentage of protected area, there may be increased opportunities for an IWS program as a way to influence management in the watershed and enhance water provision services via interventions on privately-managed land. Establishment of protected areas may also face additional hurdles in watersheds with large amounts of agricultural land37, leading water managers to seek out market-based approaches such as IWS. Finally, a watershed with high percentage agricultural land and low percentage protected area could indicate increased risks to provisioning water services that have potential impacts to downstream water users.

Our research is one of the first attempts to quantitatively evaluate enabling conditions for IWS programs in cities across the world. Previous research on IWS has focused on individual or limited numbers of cases rather than global patterns. As a global-level analysis, our research begins to fill this gap by broadly testing factors associated with the existence of IWS programs. Our results should not, however, be interpreted as a mandatory or static checklist of all necessary factors to implement IWS or a similar policy. Though various conditions predict IWS presence, it is possible for IWS programs to emerge in a variety of situations. We identified the contextual conditions present in areas where IWS interventions already exist, and the conditions that were less relevant to the existence of IWS. Contextual details about the mechanisms underlying the emergence of IWS are important for understanding enabling conditions in specific places, even if the finer scale conditions vary. Conservation practitioners, in particular, could add from their experiences in the field to improve our understanding of what local conditions facilitate both the emergence and sustainability of IWS programs. For example, the lower cost of implementing IWS schemes as compared to other policy tools is a known factor in their emergence34,38. Our analysis outside the US identified per capita GDP as positively associated with IWS presence, while conservation spending was negatively associated. The positive association with GDP could potentially suggest that a certain level of affluence is needed for IWS and that water providers are not able to spread the cost to users when users are predominately poor. For conservation spending, it may mean that when spending is high, there are co-benefits to water quality coming from other investment actions that make IWS less necessary, similar to our interpretation for the negative relationship with the percentage of protected areas in the watershed. Where information on the cost effectiveness of various alternative strategies (including IWS) is available, it would enhance understanding of program emergence and sustainability.

Future research can further evaluate enabling conditions across a variety of contexts and scales to help establish clear relationships. Documenting and monitoring IWS performance are important for providing data to test the mechanisms through which enabling conditions are associated with the existence and sustainability of programs. Sub-national data and analyses would improve our ability to test enabling conditions and validate theory. Our results provide insights about general patterns and broad trends for large cities, but the nature of global synthesis can mask relationships between conditions that could explain variation within a region. Regional or country level analysis could provide more details about the mechanisms underlying how a program was operating (e.g. successfully or not) and the factors that are most important in initiating IWS programs, but our global-scale analysis is not designed or poised to take advantage of this higher resolution sub-national scale data. For example, funding sources, investors, and supporting organizations differ among IWS programs in Latin America27, and within the US some major water utilities pay upstream landowners to change management practices (e.g. Denver Water, Colorado) or purchase land in the watershed (Seattle Cedar River Watershed).

Local conditions and context indeed matter for more nuanced analysis and local application of findings; however, understanding the general conditions that can make it more likely that a program will emerge is an important step in understanding where and how to dig deeper into finer grain analysis. Understanding the general conditions can provide partial explanation of program presence and evaluate the potential scope for expanding programs. Numerous researchers in this field have come to similar conclusions, in particular that the time is ripe to collect previously disparate lessons learned from case studies of ecosystem services and synthesize them for broader general conditions impacting presence of IWS (see synthesis provided in Huber-Stearns et al.). For example, Naeem et al.39 call for the need to document initial baseline conditions, including the initial state of threats to services and important factors that will forecast service trends in the beginning of a program, and Ingram et al.40 distill lessons learned about the use of ecosystem services, especially around understanding necessary institutional factors needed where governance may be weak. Recent research on PES programs more broadly identify key characteristics of buyers, sellers, and program specific metrics as key determinants of the spread and uptake of PES13.

Implications and implementation of research on natural resource management is critical for practitioners. We have been working with collaborators at The Nature Conservancy (a non-governmental organization) on how to use the findings from this research to improve their IWS development program. When comparing potential locations for program investment, the most important conditions can be used to evaluate where IWS programs are likely present in comparable locations. In evaluating cities for program development, those that have similar characteristics to cities that do have a program may be good candidates. We provide an example process in Supplemental Note 1 comparing Recife and Salvador, which are both coastal cities in Brazil using a ranking of cities based on the top 5 important conditions from our Non-USA cities model (Supplementary Data 2). Neither currently have an IWS program according to our research, although there are other cities in Brazil that do have a program. By comparing values for important variables delineated by this research, an IWS program is more likely present in Recife. This information is valuable when combined with local context and investment criteria to evaluate scope and expansion of IWS programs into new locations.

IWS programs emerge out of the interplay among numerous factors in complex social-ecological systems. What works in one place may not work in another because of the unique social and ecological contexts in each place. Our study takes an empirical approach in examining broad and globally available evidence on IWS programs and their enabling conditions. To elucidate particular conditions that enable innovative solutions in natural resource management, we emphasize that further cross-disciplinary and sub-regional investigations are needed.

## Methods

### Identifying cities with IWS programs

The city water map: Our list of 534 global cities comes from the City Water Map, version 2.325 (CWM), a database by The Nature Conservancy containing information on large cities and their source watersheds. The original city list for CWM started with the World Urbanization Prospects (WUP) report conducted by the United Nations Population Division1 that lists all current and previous world cities with a population > 300,000. Cities below this population threshold were added to the CWM from research on 225 cities with populations over 100,000 in the United States2. Data on the source watershed and specific withdrawal information was collected by searching water utilities directly, though in some cases no information was found. The final City Water Map list of cities contains 534 cities, including the world’s 50 largest urban areas, the largest urban area in each country with > 750,000 people, and a representative sample of cities stratified by both geographic region and population range1.

The CWM database contains a known bias resulting from data accessibility and availability that oversampled USA cities and undersampled Indian and Chinese cities1. The data were subset by removing all USA cities that met either of the following two criteria: (1) a population < 300,000, OR (2) no population data was available. This is based on the city population limit of 300,000 from the World Urbanization Projects report by UNPD2 that the CWM database used to develop their database. Most of the cities in CWM under the 300,000 threshold were additions to the WUP report and creating this cutoff reduced the data set by almost 100 USA cities.

### Identifying cities with IWS programs

Data on existing IWS programs were gathered from several sources. We analyzed the 416 cities that met the UNPD criteria for large cities (population > 300,000). We used Forest Trends’ State of Watershed Investments bi-annual report11 (29 cities identified) and a literature review of IWS programs to identify 59 cities in the CWM that have an IWS program using search engine Web of Science and publishing service ScienceDirect. A search was conducted for title, abstract, and keywords only using the search terms “payment* for ecosystem services” OR “payment* for environmental services” OR “payment* for water* services” AND “water*“. Web of Science results listed 136 articles and ScienceDirect returned 91, which, excluding duplicated articles, produced a library of 171 articles.

Much of the program information was collected from the State of Watershed Payments annual report, produced by Forest Trends11.The Forest Trends report and literature search were reviewed for IWS programs that met two criteria; (1) they provide water for a city in the CWM database, and (2) drinking water protection is specified as a program goal. The list of cities that have met the IWS criteria include those with Demonstration Projects that are focused on drinking water because they are actively managing drinking water using a IWS program. For this research, cities with IWS programs (CityIWS) are those cities within the CWM database with a IWS program identified by either the Forest Trends report and/or the literature review (Supplementary Fig. 2). Cities with no known IWS program are denoted by “Cityno IWS.” Of these 59 cities with IWS, 53 met the UNPD criteria for large cities. We defined IWS as transactional arrangements (in cash or in-kind) between two or more parties that compensate a land manager for protecting drinking water supplies for urban beneficiaries11,22. Our list of enabling conditions built on a synthesis of theory and case studies on payments for ecosystem services conducted by coauthors on this paper22. We identified global data sets for the variable (e.g. city population, watershed area) or, when necessary, for a proxy indicator that represented the variable (e.g. Property Rights Index represents land ownership and access). We intentionally targeted data for all four condition categories (biophysical, economic, governance, and sociocultural data) identified by Huber-Stearns et al. in an attempt to represent as many different types of potentially important characteristics as possible. All city data is available in Supplementary Data 3.

### Enabling conditions concept and data

#### Enabling conditions concept

The original concept and list of enabling conditions is derived from previously published work22. Enabling conditions are defined as factors that increase the likelihood of an intended change in the governance approach, strategy, or management regime. Enabling conditions, by definition, facilitate the emergence or sustainability of a particular environmental policy, while the absence of key enabling conditions can present a barrier to management or sustained policy action. In this initial publication we summarized existing literature on the concept of enabling conditions and synthesized the information into a list of potential conditions, grouped by category (Fig. 1). Although these categories provided more structure for the presentation of conditions, it is important to note that the conditions in each theme were identified from a variety of disciplinary perspectives and fields, journal types, and author considerations, so no one theme was solely represented by one discipline.

### Enabling conditions data

Here we distinguish between EC variables, those broad conditions identified by Huber-Stearns et al.22 and Representative Data, the actual data used in the analysis. Information from 14 data sets were collected, processed, and integrated into a relational database (See Supplemental Data 1 for relationship between EC variables and representative data sets). For this study we targeted global data sets to emphasize standard measurements for each indicator. For some EC variables no representative data was available with global coverage. In some cases representative data could potentially represent multiple EC variables (Supplementary Data 1). For example, the number of IUCN organizations per million people could represent the presence of an influential supporter of PES such as politician or prominent NGO, the presence of strong intermediaries, and strong capacity among actors. It these cases it is also possible the representative data reflects a combination or interaction of EC variables.

### Statistical analysis

Statistical analyses were performed in R version 3.2.341 with some pre-processing of geospatial data in ArcGIS42 within an equal-area Mollweide projection43.

### Water supply origin and water source characteristics

The origin of the water supply for each city and characteristics of the watersheds were described using CWM diversion type categories and volumes, combined with delineations of the surface and groundwater basins that serve each city25,44,45. Percent ground or surface water was categorized in one of six types: primarily surface water ( > 75% of diversion volume from surface sources); mixed sources (50–75% surface volume, 25–50% surface volume, or 1–25% surface volume); groundwater sources only, or no available data. Surface Water includes all diversion types except groundwater and alluvial aquifers. Watershed area was calculated as the combined area (km2) of all watersheds and groundwater basins being used for drinking water for each city. Percentage of protected area is from IUCN-designated protected lands within this total area46. Land cover types (percentage forest and percentage agricultural and/or pastoral) were calculated for the source watersheds and basins44 of each city and grouped based on classification per Supplementary Table 2. For cities with mixed above and below-ground water sources with diversion volumes available for each, land cover was weighted by diversion volume. If diversion volumes were not available for all sources, land cover was represented by the sources with available data.

### Calculating post-stratification weights

Post-stratification weights were calculated for each city in the CWM to further address sampling bias and adjust the distribution of cities to reflect real city distributions45. Using the World Urbanization Projects Report (WUP)1 the proportion of cities within each geographical region was calculated for each of 5 city population classes1 (Supplementary Table 3). Region was used as opposed to country because some countries have few or no cities in the CWM data set. The WUP report originally supplied the base data for the CWM and the geographical regions and population classes are described in the report as well. Proportions of each city class were calculated and used to determine a weight field (# database cities in region class/sum of UNPD cities per region) that adjusts city data proportions to the WUP report proportions.

### Variable selection

Thirty candidate variables from existing data sets were identified to represent potential enabling conditions as identified in Huber-Stearns et al.5 Variables either directly quantify conditions, as in the case of biophysical and economic characteristics, or serve as recognized proxies of city characteristics. Predictably, many of these variables are correlated, as they are based on shared information (i.e., several of the country level economic indices are calculated using GDP). Collinear and replicated variables were excluded. Selection was based on analysis of spearman pairwise correlations and variance inflation factors47. The R package Corrgram v1.1048 was used to calculate correlation coefficients. Of the 30 variables tested, 18 were found to be correlated with at least 1 other variable at corr > 0.7, indicating high collinearity17. Supplementary Table 4 provides the correlation coefficients between highly correlated variables (corr > 0.7) and justification for which of the correlated variables were selected for inclusion in the final models.

In addition to the spearman correlation coefficient, variance inflation factor (VIF) was calculated using R package car48 using the full database as well as a subset of the cities contained only non-USA cities, though not all variables could be included because of missing values. VIF is calculated as 1/(1-R2) from a linear model and estimates how much the variance of a coefficient is inflated from linear dependence with other predictors. A higher VIF value indicates that the variance (the square of the standard error) is larger than if the predictor were not correlated with other predictors. VIF were calculated iteratively by sequentially dropping the predictor with the largest VIF, recalculating with the remaining variables, and repeating until returned values were under the preselected VIF threshold of 349. Supplementary Table 5 provides the VIF values for our final list of 17 variables, with any values exceeding our threshold of 3 in bold.

Three representative data sets (Conservation Spending, Average World Bank Governance Indicators, and National GDP per capita) did not meet the VIF criteria, but were included in the model analysis because there were no other proxy variables for the EC variable they represented. After reducing both the number of cities and the representative data, the final database used for analysis contained 416 cities and 17 variables, representing 14 of the EC variables described by Huber-Stearns et al. (Correlation coefficients provided in Supplementary Table 6). A final data table with all representative data is provided as Supplementary Data 3.

### Random forest model

We determined the predictive importance value of our representative data using a random forest model of classification trees23. This model was selected because inference trees are robust when regressing data with high dimensionality, which is a situation with many predictor variables compared with the number of data points50, often referred to as a large p, small n problem. Previously published research on enabling conditions for IWS programs often discuss only one or few enabling conditions, but our analysis allowed us to build a model using interactions between variables as opposed to evaluating fit to an existing model or assumptions. Using machine learning to consider many variables at once allows us to rank those variables in terms of importance for predicting the presence or absence of IWS programs. Logistic regression was considered as a potential model, but initially resulted in perfect separation, likely due to the small minority class and high dimensionality characteristics of the data. The random forest approach has been widely used in the medical field for situations with highly unbalanced data with varied and potentially interacting predictor variables50,65,66, and is becoming more prevalent in the conservation and natural resource management literature, especially when attempting to evaluate global patterns67,68,69,70. Random forest methods also reduce issues of bias toward the majority class that can occur with unbalanced data sets in logistic regression71,72, important because in this data set cities with IWS programs represent the minority class.

Random forest models are a type of machine learning algorithm that consist of many individual decision trees constructed with random subsets of predictor and dependent variables. Each tree in the random forest model predicts the presence or absence of a IWS program for a CWM city using a random subset of data and predictor variables. The model ranks all variables according to aggregate prediction performance in the forest of individual trees23. The model constructed by the random forest classification technique allows us to rank variables in terms of importance in predicting the presence or absence of IWS in a given city. We selected this classification system specifically for high classification accuracy and the ability to model complex interactions between predictor variables24. We used the R package Party51 because its functionality is particularly well suited for unbalanced data sets with high dimensionality50, can address missing data52, and has the capacity to reduce bias from predictor variable type and correlated predictors25,44,45.

The data were split 80/20 (pareto principle) for training and test sets and not transformed. We weighted enabling conditions data to represent city distribution regionally and globally using UN statistical region boundaries (described above in the section titled Post stratification weights). Given the unbalanced nature of the dependent variable (IWS presence or absence), several strategies were attempted to address potential bias in the model due to the small size of the minority class (only 11.5% of modeled cities contain an IWS program because some cities were not included in the models). To address this class imbalance, the data were adjusted four different ways before modeling: (1) the larger class (cities with no IWS programs) was undersampled53, (2) the smaller class (cities with a IWS program) was upsampled, and (3) new minority class were created using the Synthetic Minority Over-sampling Technique (SMOTE) function54 in R package DMwR55, and (4) weights were incorporated in the random forest classifier which made the classifier cost sensitive and penalized the model fit for misclassifying the minority class. The final models reported here addressed class imbalance by incorporating a weight class in the random forest model as this method produced models with higher predictive ability (predictive performance described below) better than other options.

Each model contained 8,000 trees56, with the number of preselected variables (mtry) set to 4 (calculated as the square root of the number of predictors57) and all other parameters set to default. We used the unbiased permutational variable importance measure (function varimpAUC), because it is particularly suited for unbalanced response classes58. The varimpAUC output is also a non-conditional variable importance measure that can be computed with missing data52. With this approach correlations between predictor variables need to be addressed separately, which was done using a spearman correlation test as described in the Variable selection section above. Predictor ranking was evaluated using mean decrease accuracy because of varying scales of measurement, and correlation among predictors59. Predictive power of each model was evaluated using a standard metric, area under the Receiver Operating Characteristic curve (AUC), which assesses classification accuracy56. Values for AUC range from 0.5 to 1 and the closer to unity, the more accurate a model, where models with a value of 0.7 are considered reasonable and those with values > 0.8 considered strong60. Both models presented here performed at AUC values of > 0.7. The relationships between individual predictors and outcome (presence of IWS) was evaluated using partial dependence plots via R package mlr61. Partial dependence plots reveal the relationship of individual conditions within each random forest model by integrating out (and thus controlling for) other factors. Greater y values indicate that an observation for a specific variable is associated with higher probability for classifying a city as having an IWS program. These plots depict the marginal effect of the variable to provide an average trend of individual variables within a model by integrating out all other variables62,63.

### Code availability

All data and code are available on GitHub (https://doi.org/10.5281/zenodo.1403842) including the code used for the shiny app reference in the data availability section. Link: https://github.com/cromulo/IWS.