ClimActor, harmonized transnational data on climate network participation by city and regional governments

Cities and regions have become increasingly engaged in global climate change governance. They are pledging their own climate mitigation targets and participating in membership networks that typically are transnational in nature and engage thousands of subnational governments. Researching these growing trends in participation has been difficult due to the disparate and inconsistent nature of this self-reported data. To facilitate future analyses of these actors, we introduce ClimActor, the largest harmonized global dataset of more than 10,000 city and regional governments participating in networks like the Global Covenant of Mayors for Climate and Energy, C40 Cities for Climate Leadership, ICLEI Local Leaders for Sustainability, among others. We include key contextual information on each actor’s population, geographic location, and administrative jurisdiction to facilitate disambiguation of potential overlaps in actions or emissions. We also provide a series of cleaning functions based on phonetic and fuzzy string matching algorithms within an open-source R package to make it easy for anyone to immediately use the ClimActor dataset with other relevant data. Measurement(s) climate action network participation Technology Type(s) digital curation • computational modeling technique Sample Characteristic - Environment city Sample Characteristic - Location Earth (planet) Measurement(s) climate action network participation Technology Type(s) digital curation • computational modeling technique Sample Characteristic - Environment city Sample Characteristic - Location Earth (planet) Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12994214

determining overlaps between actors: city governments embedded within states often commit their own climate actions without coordinating with higher levels of jurisdictions, which could result in double counting of emissions reductions or other activities if not appropriately accounted for 14,16,17 . Due to these challenges, most studies evaluating subnational climate actions are focused on a small number of cities with available data 18,19 or localized to a specific region 20 .
To obtain a full global picture of which subnational governments are taking climate actions as reported through primarily transnational climate initiatives, such as the Global Covenant of Mayors for Climate and Energy, C40 Cities for Climate Leadership, and ICLEI Local Leaders for Sustainability, this paper shares what we believe to be the most comprehensive and harmonized database of subnational climate action network participation and contextual information to date. It is intended to underpin research efforts to build on and extend analyses on the scope and impacts of subnational climate action. To create it, we collated data from multiple sources (see Online-only Table 1) and developed an R package ("ClimActor") that provides a series of functions to aid in cleaning, matching and mapping these disparate datasets and enables users to match this information with other datasets (i.e., on emissions, commitments, etc.) in a standardized way. The ClimActor subnational climate action dataset includes more than 10,000 subnational actors' location, population, area and participation in climate action networks (Figs. [1][2][3]. The ClimActor dataset helps navigate several key challenges and considerations around the analysis of subnational climate action. Data platforms often represent subnational actors in disparate ways: New York City, for example, can appear as New York, NY, New York, or the city of New York on multiple platforms but refer to the same city of more than 8 million people. Collecting information on which actors have signed up to various pledging platforms and then merging these data with critical information on population or emissions is then challenging because actor names must be cleaned and standardized to ensure actors' data are correctly matched. Actor names can be misspelled or subnational governments' could have similar names (i.e., London could refer to London, United Kingdom or London, Ontario or even London, Kentucky). The process of disambiguating these names and matching datasets can be incredibly laborious and time-consuming 21 . To address these obstacles, the ClimActor dataset compiles data on subnational climate network participation and relevant contextual data, enabling researchers to explore how climate action participation and trends vary within and across different geographic and economic contexts (Fig. 3) 15,22 .
The dataset also addresses the challenges of disentangling overlaps and hierarchies in subnational climate action data. Identifying overlaps between different levels of jurisdiction and avoiding instances of "double-counting" is a vital part of many climate action analyses 14 . Identifying the links between local, regional, and national governments also aids efforts to explore the nuances of subnational climate action within specific national governance contexts 6 . The ClimActor database lists the networks an actor participates in as well as its higher-levels of political administration, making it possible to understand overlaps in climate commitments. This information can be used to avoid double-counting commitments reported across several different platforms and to facilitate explorations of subnational engagement with transnational climate initiatives. It can also help determine what commitments are not subsumed within a nested jurisdiction but instead in addition to (i.e., "additional") to what a higher level of government has pledged, if any (i.e., if New York City and New York State both make climate pledges, understanding their nested relationship would allow for an analyst to make an evaluation of whether actions at the lower level are truly in addition to what a higher level of administration has pledged) 1 . www.nature.com/scientificdata www.nature.com/scientificdata/ Methods participation data. Membership data for subnational actors participating in climate action networks (Online-only Table 1) such as the carbonn Climate Center and EU Covenant of Mayors for Climate and Energy initiatives and UNFCCC's Global Climate Action Portal were collected from the respective websites using Python v3.6.5 23 and the BeautifulSoup v4.6 24 and json 25 packages. Generally, the workflow to collect actor data from these websites is as follows: a complete list of actors within the database is first extracted from the site and if other relevant information is available for an actor (e.g., location, population, etc.), that information is collected and matched to an actor's name. Data are then cleaned using the pandas package in python 26 before being stored as csv files. Online-only Table 1 lists the complete climate action networks that were collated for the ClimActor dataset.
Contextual data. If a climate action platform or database does not include contextual information on a particular actor, such as population, location, area, etc., this data is collected from Wikipedia's English language pages. Although a crowdsourced website and information repository, Wikipedia has been shown to be an accurate source of political data and information 27 . Wikipedia is accessed through the MediaWiki API using Python, and the BeautifulSoup, json, and requests 28 packages. Each actor in the database is minimally associated with a name, entity type (either City, Region, Company, or Country), and ISO country code, each of which is used to resolve which Wikipedia page corresponds to an actor in question. This procedure begins by passing an actor's name to the MediaWiki API's search function, which returns a list of potential matches ordered by relevance. The top 20 pages are each downloaded and prepared for further filtering by mapping each page to a likely entity type and country.
To ensure the correct Wikipedia page is selected for a subnational actor, the Wikipedia categories to which a page belongs are fed to a function that counts the occurrences of certain words that likely correspond to a specific entity type (e.g., words like municipality, town, capital, ward, and village are more commonly found in pages that should be in the City category). If the word "disambiguation" occurs in a page's categories, the page is taken to be a Disambiguation page 29 . For instance, an actor in question that could have a name that refers to something else (e.g., Golf, Illinois could be mistaken for the sport of golf if no other contextual information is included). Otherwise, the entity type with the highest count of matching words is then taken to be the page's entity type. If no words match, the page is assigned no entity type. www.nature.com/scientificdata www.nature.com/scientificdata/ To infer a page's country, the page's "info-box"-the light grey table on the right of many Wikipedia pages-is searched for links to Wikipedia pages corresponding to the countries on the site's list of countries assigned an ISO code (ISO, 2020). The first of these links found in the infobox is taken to be the one corresponding to the country the page should be assigned to. If no links are found within the infobox, the page's body text is then searched according to the same procedure.
The list of 20 search results is then filtered such that only pages matching the actor in entity type and ISO code remain. The first of these pages-deemed by the Wikipedia search function to be most relevant-is then selected as the best candidate and is mined for contextual information according to the procedure described in the following paragraph. If no pages remain after filtering the top 20 for entity type and ISO code, the results are checked for any disambiguation pages. If one is found, then each of the links in the disambiguation page are processed and filtered as if they were the results of the search above, the first matching page being selected if multiple matches are found. If still no matches are found, then the process is repeated starting from the call of the search function including an actor's ISO code in the search query. If still no matches are found, the actor is not resolved to a page.
If a page is found during this process, contextual information is extracted from the page's info-box, a table containing rows usually several sections indicated by a single column spanning the entire table, each of which contains several related rows with two columns, the left one describing the data found in the right column. For example, under the section "population, " a row may be found containing two columns with values "Metro" and some number respectively. This information is used to find relevant population and figures, matched with corresponding years where possible. In addition to this data, latitude and longitude are collected in decimal degrees where possible. To account for potential overlaps between city and region actors, the latter which often encompass the former, we also collect information on city actors' higher jurisdictions. For example, Manhattan (borough) is located in New York City (city) within New York (state). This information can be used to avoid double counting of climate actions, although there are some manual disambiguations that are still needed. Wallonia Region in Belgium, for instance, is a higher administrative unit than other regional actors, such as Antwerp Province and Brussels Capital Region, and encompasses five other provinces within Belgium (Liège, Hainaut, Luxembourg, Walloon Brabant, and Namur). Avoiding overlaps between Wallonia Region and other city or municipality actors within Belgium would need to account for overlaps within these provinces located in Wallonia as well. In some cases, regional councils commit to climate action and encompass several regions and cities, such as the Assembly of Regions of Cote d'Ivoire (ARDCI).
Because we only scraped English-language Wikipedia pages, we acknowledge that some contextual information is missed through Wikipedia pages in other languages (see Technical Validation:Data Limitations). Other data sources for country-level contextual data (i.e., population, land area) were obtained from the sources listed in Online-only Table 2.

Data Records
The ClimActor package includes three main datasets: subnational climate action participation and accompanying contextual information for each actor (stored as "contextuals"); a key dictionary ("key_dict") of subnational actor standardized and unstandardized names for purposes of string matching (see 'Usage Notes' below); and a country dictionary ("country_dict") for cleaning country names associated with each actor that is required for the cleaning functions in the ClimActor package. The datasets are available on figshare 30 , and are also stored as Rdata objects and automatically loaded when the ClimActor package is loaded within a user's R environment. They are also available as CSV files and can be downloaded from the ClimActor package's GitHub page (www.github.com/ datadrivenenvirolab/ClimActor). An overview of the included information within the datasets is provided in Online-only Table 2. www.nature.com/scientificdata www.nature.com/scientificdata/

technical Validation
We performed validation checks primarily on the self-reported population and geographic location data from subnational actors via climate action reporting platforms. Platforms such as CDP and the Global Covenant of Mayors for Climate and Energy include population data, although sometimes the year associated with the data are not included and require validation. Because actors may simultaneously report to multiple climate action platforms at once and report conflicting data, we compared and validated the data to ensure the database reflects the most recent and accurate data available. Below we describe the validation checks applied to self-reported population and geographic location data. population data. In instances where subnational actors report to multiple climate action platforms and different population data are collected from these various sources, we compared the self-reported population numbers with each other. If the values varied within five percent of each other, we selected the first entry of that actor within the database. If the values varied more than five percent of each other, we prioritized the following data sources' population data: 1) CDP (formerly known as the Carbon Disclosure Project), as they require cities and regions to report on an annual basis; 2) Global Covenant of Mayors for Climate and Energy or EU Covenant of Mayors for Climate and Energy; 3) Carbonn Climate Registry. For actors in which no population data was reported to these three sources, population data was obtained from Wikipedia. For actors where population data was available but where the year corresponding to the population data was missing, the year corresponding to the population data was estimated by comparing the population data with the actor's Wikipedia page. For example, if an actor's population data matched the number on its Wikipedia page, we imputed the corresponding population year with the year listed on an actor's Wikipedia page.
Geographic location data. In instances where a subnational actors' geographic location data (longitude, latitude) were provided, we validated by plotting all actors' coordinates on a map (Fig. 1). A global polygon of countries' boundaries were used to identify potentially erroneous coordinate information. Outliers were then visually inspected and the geographic coordinate information was removed and replaced with Wikipedia's geographic coordinates for that actor if they fell outside the bounding box.

Data limitations.
Our dataset is limited by several constraints. First, data are self-reported by subnational actors that participate in transnational climate initiatives. While there are most certainly subnational governments outside of these initiatives that are taking climate actions, we are unable to systematically capture and assess www.nature.com/scientificdata www.nature.com/scientificdata/ them due to the unwieldy nature and infeasibility of collecting every instance outside of common networks and reporting platforms we used in Online-only Table 1 14 . Second, our data collection efforts were primarily limited to English-language sources, although there are certainly non-English language websites and data sources we can research for future iterations of the database. The use of translation websites and application programmer interfaces (APIs) and social media text mining, combined with natural language processing (NLP) techniques, could also aid in further expanding the ClimActor database to include more non-English actors and those not currently reporting to the largest and most common networks available.

Usage Notes
We developed the ClimActor R package to assist researchers in utilizing our subnational climate action database alongside other datasets. For example, researchers may have data on a subset of subnational actors' carbon emissions or GDP data that they wish to analyze alongside our dataset. Ensuring actors' names are consistent (i.e., New York could refer to either New York City or New York state), "clean" (i.e., one dataset could refer to Dallas, TX as simply "Dallas" or "City of Dallas" while referring to the same actor), and disambiguated (i.e., the city of London could refer to London, Ontario in Canada, London, Kentucky in the United States or London, United Kingdom) can often present challenges in analysis. Data from different languages also may include character encodings, accent marks, or various spellings (e.g., Venice and Venezia), which the ClimActor package assists in disambiguating and standardizing. The ClimActor package streamlines the data cleaning process by: • providing a "key dictionary" of consistent subnational actor names and contextual information (e.g., population, geographical coordinates, etc. whenever these data were available; see above Data Records); • providing a "country dictionary" of consistent country names (i.e South Korea could also be referred to as "Korea" or "Republic of Korea"; see above Data Records); • a set of basic data preparation functions to ensure that the dataset in question has the columns and entries needed to work with the subsequent standardization functions; • a set of data standardization functions to ensure standardized versions of accessory information (entity type, country, ISO code) that allows further functions to standardize actor names; • a set of string matching functions (using a mixture of exact, phonetic, and fuzzy string matching) to standardize actors' names based on our collected database of subnational actors; • a set of post-cleaning functions to update and expand the key dictionary (with novel actors in the user's database) and to include contextual information (i.e., data records listed in Online-only Table 2) from our database to the user's dataset.
The string matching functions are based on phonetic matching algorithms and string distance measures compiled in the phonics R package 31 and stringdist R package 32 respectively. The process reduces a character string (i.e., a subnational government actor's name) to a symbolic representation that approximates its English pronunciation before comparing string distance between these measures. This reduced form is then used for matching and avoids complications when there may be common words that increase the length of an actor's name (i.e Prefeitura de Pedreira and Prefeitura de Pedra Bela) or special characters such as accent marks that may complicate fuzzy string matching 33  www.nature.com/scientificdata www.nature.com/scientificdata/ (Metaphone, Nysiis modified, Onca modified refined, Phonex, and Roger Root). These algorithms were selected after evaluating all 16 (nine core and seven modified) that were included in the phonics R package and selecting the five that consistently yielded the most accurate match results in our validation. To validate the phonetic algorithms, the 16 different phonetic representations were first created for all entries in the ClimActor key dictionary. Representations were then visually inspected for duplication, before string similarity measures (based on the average of the Full Damerau-Levenshtein distance, q-gram distance, cosine distance q-gram profiles, Jaccard distance between q-gram profiles, and the Jaro-Winker distance) were used to calculate string similarities between alternative names for entries in the key dictionary (See Table 1 for an example). For detailed descriptions of these algorithms and their respective trade-offs, see Howard (2019) 31 .
A more detailed user guide outlining the workflow of using the ClimActor package is visible in Fig. 4 and also available at: https://github.com/datadrivenenvirolab/ClimActor. Because the landscape of subnational climate action changes almost daily, the online repository will be updated periodically.

Code availability
Data and code for the ClimActor R package functions is available on GitHub: https://github.com/ datadrivenenvirolab/ClimActor.