All-hazards dataset mined from the US National Incident Management System 1999–2014

This paper describes a new dataset mined from the public archive (1999–2014) of the U.S. National Incident Management System/Incident Command System Incident Status Summary Form (a total of 124,411 reports for 25,083 incidents, including 24,608 wildfires). This system captures detailed information on incident management costs, personnel, hazard characteristics, values at risk, fatalities, and structural damage. Most (98.5%) of the reports are fire-related, followed in decreasing order by other, hurricane, hazardous materials, flood, tornado, search and rescue, civil unrest, and winter storms. The archive, although publicly available, has been difficult to use due to multiple record formats, inconsistent free-form fields, and no bridge between individual reports and high-level incident analysis. Here, we describe this improved dataset and the open, reproducible methods used, including merging records across three versions of the system, cleaning and aligning with the current system, smoothing values across reports, and supporting incident-level analysis. This integrated record offers the opportunity to explore the daily progression of the most costly, damaging, and deadly events in the U.S., particularly for wildfires.


Background & Summary
There has been a steady rise in the occurrence of billion-dollar disasters in the United States (U.S.) since the 1980s, with the past three years (2016-2018) each setting historic highs. The average number of billion-dollar disasters across this period has more than doubled (from 6 to 15 events per year 1 ). Further, cumulative costs in 2017 set a new annual record of $306.2 billion 2 . There is evidence that the frequency and magnitude of natural disasters is changing, with some extreme weather events linked to anthropogenic climate change 3 , such as the intensity of tropical storms 4 . Further, within just the past few decades, the area burned by wildfires in the western U.S. has increased at least threefold [5][6][7][8] , with a strong climate change influence for forest systems 9 . This is a critical moment to develop new methods and data sources to help understand the interrelationships between the physical and environmental characteristics of a hazard, the effectiveness of incident management strategies, and the societal impacts of these large-scale events.
Wildfires and hurricanes are two natural hazards that cause significant societal impacts, and require costly and complex incident response. In the last two years, damages from wildfires alone have exceeded $40 billion 1 and current suppression costs average $2-3 billion each year 10 . For example, in the last two years, California has seen the largest, costliest, and deadliest fires in state history. More than 33,000 homes burned and 130 people died in wildfires in the past two years (www.iii.org/fact-statistic/facts-statistics-wildfires#top). Another striking impact is that of hurricanes between 2016 and 2018, when six separate billion-dollar hurricanes made landfall in the U.S. with an inflation-adjusted total loss of $329.9 billion and 3,318 fatalities 1 . When these events turn into disasters with large societal impacts, they necessitate a government-coordinated response with critical documentation about how the event is unfolding and threats to life and property.
The Incident Command System Incident Status Summary Form 209 (ICS-209) captures a unique perspective across an important population of hazard events. It is intended specifically for significant incidents that operate www.nature.com/scientificdata www.nature.com/scientificdata/ for an extended duration, compete with other incidents for scarce resources, or require significant mutual aid/ additional support and attention 11 . Further, the ICS-209 is intended for use when an incident becomes significant enough to merit special attention, such as those that attract media attention, or when there is increased threat to public safety. Over 98% of the incidents in the dataset are wildfires. Although only 1-2% of wildfires become large incidents, they account for approximately 85% of total suppression costs and upwards of 95% of total acres burned each year 12 . The ICS-209 captures the best in-the-moment observations about the current and forecasted status of the hazard, current resources assigned, estimated costs, current and forecasted critical needs, and the societal and natural values currently at threat. These status summaries are required for each operational period of an incident response or when significant events warrant a status update. As a result, these reports offer a unique opportunity to study the relationship between hazard characteristics, incident response, and the societal impacts/ threats incrementally across all phases of active response. This paper describes the methods used to create and refine this dataset into a science-grade database. We provide a brief overview of the all-hazards dataset and then examine high-level spatial and temporal trends for wildfires across several key variables captured by these reports. We compare these values with a larger population of wildfires in the U.S. We then look in detail at relationships between these variables for the 2013 Rim Fire, a large catastrophic event, for insight into how these values might be combined with other sources of data such as satellite-derived datasets. We conclude by identifying the key opportunities for use of this dataset and how this open-source solution could be extended in the future.

Methods
Data source. Historical data from the ICS-209s are archived by the National Wildfire Coordinating Group (NWCG) on the Fire and Aviation Management Website (FAMWEB) in two formats: an HTML version of the forms (2001 to 2013), published at (https://fam.nwcg.gov/fam-web/hist_209/report_list_209), and the raw data, published annually on the FAMWEB homepage (https://fam.nwcg.gov/fam-web). We downloaded and compared the data from both sources and determined that the overlapping records were identical, yet the raw data covered a wider timespan, was more complete and included all records up to the most recent completed year. All data in the ICS-209-PLUS dataset are sourced from the raw data files, and the dataset is designed so that it can be extended as new data is released. We manually downloaded the raw data from FAMWEB, saving each table in Excel format. The data span three separate versions that we will refer to as Historical System 1 (HIST1) from 1999 to 2002, Historical System 2 (HIST2) from 2001 to 2013, and the current version (CURRENT) from 2014 to the present. The input tables for each of these versions are summarized in Table 1 below. The ICS-209-PLUS dataset is currently truncated in 2014 due to missing tables and data duplication issues in 2015 through 2017. If these issues can be resolved, we will extend the dataset to include records from the most recent years. We describe the handling of data from overlapping years in the Purging Duplicate and Erroneous Records section below.
History of ICS. The ICS-209 form, commonly referred to as a sitrep, is part of the U.S. National Incident Management System/Incident Command System (NIMS/ICS). The earliest implementation of ICS was developed by the U.S. Forest Service (USFS) following a devastating fire in California in 1970 that claimed 16 lives, destroyed over 700 structures and burned over 500k acres. Numerous communication and coordination issues hampered the effectiveness of the agencies involved, resulting in a congressional mandate requiring the USFS to design a new system to facilitate interagency coordination and to support the allocation of suppression resources in dynamic, multi-fire situations [13][14][15] . The USFS worked in collaboration with California state agencies to produce  11 . The ICS-209 is described as providing a "snapshot in time", capturing the most accurate and up-to-date information available at the time of preparation. The form is typically completed by the Situation Unit Leader or Section Planning Chief within the Incident Management Team, but may also be completed by a local dispatcher or another staff member when necessary. Reports are logged for each operational period or when information becomes outdated in a quickly evolving incident. Each report describes current characteristics of the hazard, current environmental conditions, current and projected incident management costs, details about specific resources assigned to the incident, critical resource needs, a description of structural and life safety threats, an ongoing accounting of injuries, fatalities, damages, and the projected incident management outlook. The format and content of the ICS-209 has evolved over time in parallel with efforts to adapt the form for all-hazards use. Our inspection of records identified new fields added in 2004, when the system was incorporated into NIMS, and again in 2007 to support all-hazards reporting. It is important to note that the use of NIMS/ICS was not mandatory on large incidents until fiscal year 2006 15 , and so there may be significant gaps in reporting prior to this date. Any time-series analysis exploring trends in the data must acknowledge this limitation. Additionally, the ICS-209-PLUS dataset is based on the published data and may exclude records containing sensitive information.
The raw data are published in three separate formats. The original format, Historical System 1 (HIST1), spans 1999 to 2002 and includes basic incident information, start location, personnel usage, and total structures damaged/destroyed. The second format, Historical System 2 (HIST 2), was introduced in 2001 and captures a broader set of information related to the hazard including incident complexity, fire behavior, fuels, and local weather. It contains freeform narrative text fields to capture projected risk to communities, resources at risk, critical resource needs, planned actions, and projected hazard movement/spread. Additional societal impact values include injuries, fatalities, evacuations in progress, and estimates of structures threatened. The current system format was released in 2014. Tighter standardization of values on the form resulted in cleaner categorical data. Major changes include an expansion of formats for capturing point-of-origin data, expanded functionality for tracking casualties and illnesses, and expanded functionality for tracking life safety management.  15 . The complex record clusters all fires and sitreps associated with a fire complex under the same incident number, capturing individual fire names, suppression strategy, current containment percentage and estimated costs to date. The current version also includes the current area for individual fires within the complex. This information is missing in both historical versions of the system. We used data that was manually compiled and verified to derive a complex associations table for wildfires between 1999 and 2013 and we produce a similar table for 2014 to describe the relationship between fire complexes and individual fires. These supplementary tables are described in section 7 below. Finally, the current version has an incidents table that contains basic incident level information including discovery date, cause, area, location, and estimated cost to date. Concatenated versions of the original complex and incident tables are included in the dataset (Table 1), but we did not clean or modify these tables. We created a new Wildfire Incident Summary open/reproducible framework. We produced the ICS-209-PLUS dataset using principles of open and reproducible science [18][19][20] . All data source files and the final ICS-209-PLUS dataset are archived online 17 . The python source code for the ICS-209-PLUS creation 21 and R code used for spatial database creation and the creation of all figures and tables 22 are publicly available. Our aims are twofold: to provide transparency to the methods and assumptions used to produce the final dataset and to provide a framework for others to adapt or expand upon the dataset. The code is written in Python using the Numpy and Pandas data science libraries. We were unable to automate the downloading of the raw data from FAMWEB and so our code assumes all relevant tables (Table 1) are downloaded to the corresponding annual directories beforehand.
We accomplished several key objectives in this first release. First, we aligned data elements and standardized values across both historical versions with the current data model. This allows for seamless comparison of records across the entire time-period. Secondly, due to the free-form nature of the fields and limited mechanisms enforcing data entry standards, the original data is notoriously messy and difficult to use. The scripts are designed to automate as much of the cleaning and formatting as possible, improving the overall consistency of the dataset. It is also designed to support the manual updates identified in the process of producing the dataset. This is important for several reasons. It allowed us to easily incorporate updates for fields such as Latitude and Longitude that were deemed critical for dataset use. It also provided a framework for incorporating the cleaning efforts and refinements from the 209 dataset curated by Karen Short. Finally, we connect the ICS-209-PLUS dataset with the Fire Program Analysis fire-occurrence database (FPA FOD 23 ) enabling linkage with fire perimeter data and final fire statistics. The following sections detail how the dataset is produced from the merging of the original source files to the creation of the Wildfire Incident Summary table and external linkages.
Producing the ICS-209-PLUS dataset. The ICS-209-PLUS dataset is produced by a series of Python scripts that first consolidates the annual files for each of the tables across the three versions of the system (Table 1). Each version of the sitrep table is then cleaned and prepared for the merge. This includes general cleaning and formatting for each field, field-level updates to correct known errors, and deletion of duplicates/erroneous records. Each of the related tables are pivoted and totals are calculated for personnel, aerial equipment, structures threatened, structures damaged, structures destroyed, injuries, fatalities, and evacuation status (2014+) for each situation report. These totals are joined into the situation report and then columns across the three versions are aligned and concatenated. Once the data has been consolidated into a single dataset, individual fields are cleaned and smoothed, filling missing values and adjusting values where appropriate. This finalized version is then used www.nature.com/scientificdata www.nature.com/scientificdata/ to produce an all-hazards dataset (ICS-209-PLUS All-Hazards), and a wildfire dataset (ICS-209-PLUS WF). The wildfire dataset is composed of two tables: all the wildfire daily status summaries and an incident level summary record. The Wildfire Incident Summary contains high-level statistics that are useful from a research standpoint.
Cleaning and formatting the individual datasets. The script cleans each version of the sitrep table prior to the merge. This is necessary to deal with subtle differences between each version. Unique identifiers are constructed within the historical datasets to separate out individual fire events and to group related incidents together. We clean and standardize values for each historical version so that they merge smoothly into the final dataset. Once this preliminary cleaning is complete, members of the historical dataset are compared with a refined version of the record and sitreps that are not members of this refined set are archived to a deleted sitreps table (described later).
Creating unique incident and fire identifiers. The Incident Number field is meant to uniquely identify an incident, but there are multiple issues with this field, particularly in the historical datasets. In some instances, incident numbers are incomplete or they are re-used from year to year, resulting in sitreps for multiple incidents being grouped together as a single incident. Splitting them based on year is problematic because some fires, particularly in the southeastern United States span the annual boundary or have a final report filed in the next year. There are also instances where Incident Name and point of origin are distinctly different but share the same incident number. Conversely, there are incidents in the current version that have the same incident number, but are split across multiple unique system identifiers. Finally, there are instances where fires are incorporated into a fire complex and the Incident Number changes to that of the fire complex. We solved these issues by creating two concatenated ID fields: the Fire Event ID and the Incident ID. The Fire Event ID is used to identify individual wildfires regardless of whether they are managed as part of a larger fire complex. The Incident Id is used to group all sitreps related to an incident response, clustering related situation reports that are related but may differ in terms of the Incident Number and or the Incident Name.
The fire event ID. The Fire Event ID is a concatenation of the Start Year and the Incident Number fields followed by a sequence number (default = 1). The Start Year separates instances where the Incident Number is re-used from year-to-year. We manually scanned sitreps in both historical versions sorted by Incident Number, Incident Name, Discovery Date, and the report date to identify records that needed to be split. For example, Incident Number "AR-ARS-D2" was assigned to three separate incidents starting in different locations at different times (Table 3). We split them by adjusting the sequential variable for Dierks to 2 and Red Barn to 3. the incident ID. The Incident ID is a concatenation of the Start Year, the final Incident Number, and the final Incident Name, such that multiple fires can be grouped together if they are later incorporated into a larger response. This information is missing in the historical datasets, but was manually compiled and verified by co-author Karen Short over time during the compilation of the Fire Program Analysis fire-occurrence database (FPA-FOD 23 ). As a United States Forest Service employee, she was able to compare incident information with other internal sources to piece together relationships between individual fires and fire complexes and to purge duplicate and erroneous situation reports across the two historical records. We use this cleaned version of the sit-209 records, referred to as the Short master list 17 , as a definitive reference such that this table is used to create the Incident ID for all historical sitreps and determines which records should be deleted.
The example below is taken from the 2006 Boundary Complex Wildfire. It illustrates how Incident ID is used to group related fires together (Table 4) while preserving the original values. The complex includes the following individual fires: Elkhorn 2, Lost Lake, Deer, Thicket, Chuck, East Elk, North Elk, and Knapp 2, all under Incident ID 2006_ID-SCF-006336_BOUNDARY COMPLEX. The Fire Event IDs are included at the fire right to illustrate how the Incident ID allows for multiple physical fires to be grouped together as a single response, whereas the Fire Event ID provides a unique identifier for the physical fire event.
General field level cleaning. We used the Python data science tools to inspect values contained in each column across the three versions to determine what actions were needed to clean and prepare for the merge. Many columns had standardized values, but contained extraneous characters or inconsistencies. The script uses regular expressions to standardize values for fields like GACC Priority, Dispatch Priority, Percent Containment, Containment Date, and Incident Management Team Type fields. Once these values are standardized, they are linked to corresponding values in the lookup code tables. The script also removes all linefeeds and hidden characters from text fields to make viewing and processing the fields easier. Values such as "N/A", "same", or "none" and redundant values are deleted from the consolidated text fields. The script fixes any obvious date errors (e.g., year values of 1901 instead of 2001) and applies consistent formatting across all date fields. All Latitude and Longitude values have been converted to decimal degrees. We cleaned and formatted most of the fields with the exception of  www.nature.com/scientificdata www.nature.com/scientificdata/ weather variables and fuels. We determined that both of these fields would require extensive effort and fell outside the scope of this initial release.
Throughout the process, we identified individual values that were clearly an error and made some individual field level updates. These updates are limited and are incorporated into the general field cleaning function for each script. In the future, there is potential to maintain these updates as part of field level update table that could be loaded at runtime to automate individual field-level modifications. This would be an ideal solution to support ongoing update and maintenance of the dataset in the future but is beyond the scope of this initial release.
Finally, we use static information from the Incidents Table ( . Few values were changed in the 2014 dataset, particularly for columns with already high fill rates (~0-8 rows updated). There were modest improvements for Point of Origin City (152 rows filled) and Legal Description for point of origin (154 rows filled), but the overall fill rates remain the same. We applied the patch to the 2014 dataset so that when we publish later years, the data in the 2014 records will remain consistent over time.  (Table 5). A handful of Incident Types were eliminated in the current system. After careful consideration, we decided to keep the historical values for consistency and to prevent information loss. Prescribed burns (RX) and Wildfire for Resource Benefit have been included in the Wildfire datasets. We reclassified all values that were binary (yes/no) to boolean values (true/false) to make them consistent and to put them in a more standard database format.
Cleaning and consolidating narrative text. Each version of the Incident Status Summary provides space for recording important observations from incident command. The earliest version of the report (Historical System 1) has only one Narrative field whereas later versions have multiple narrative text fields organized around the following topics: critical resource needs, current threats, projected incident movement and spread, weather, fuels, relevant conditions, and general remarks. Critical resource needs, current threats, and projected fire activity capture projected values at 12, 24, 48, 72, and greater than 72 hours from the current report. We consolidated these observations into one narrative field for each topic to manage the complexity of the dataset, eliminate redundancy, and to organize the observations for potential text mining and topic modeling efforts. Before consolidating, we clean each individual field to strip hidden characters, eliminate placeholder values (e.g. "n/a", "same", "none") and eliminate duplicate values. A pipe '|' character is used to separate observations. For example, the following values for the projected activity fields:  www.nature.com/scientificdata www.nature.com/scientificdata/  The Incident ID is used to join Wildfire Incident Summary records with records in the FPA FOD Extract file. The extract is an excel spreadsheet published as part of the dataset using the naming convention FOD_JOIN_ {mmddyyyy}.xlsx. Matching records between the two datasets was an iterative process. At time of publication, all wildfire incidents occurring in United States and US Territories with a clearly defined point of origin larger than 1000 acres with corresponding record(s) in the FPA FOD database has been linked with 86% of incidents linking to at least one FPA FOD record. As we continue to clean and refine the dataset we will publish incremental updates to this file 17 .
As matches were identified between the two datasets, the Short Master List 17 , needed to be updated to maintain the relationships between fires and complexes in the FPA FOD dataset. This process for managing relationships between fires and complexes at the sitrep level for the historic dataset was unwieldy but difficult to modify given the significance of the Short master list in the merge and cleaning process for the historical data. We replaced this with a merge configuration file FOD_CPLX_REF_2014.xlsx for the current dataset (Table 7). This file is derived from the Complex Associations Table with additional rows added for complexes that are in the FPA FOD but not in the ICS-209-PLUS table. The table maintains relationship between fires and complexes and is used to map individual sitreps to the incident summary record for the fire complex and to the fire complex in the FPA FOD dataset.
FPa FoD and MTBS fields. The majority of incidents in the ICS-209-PLUS dataset matched with a single record in the FPA FOD database (83%) but a small percentage (3%) were fire complexes associated with multiple records in the FPA FOD database. To balance the potential for multiple FPA FOD identifiers per incident with the more general case, we developed the following solution. We added the following fields to the Wildfire Incident Summary ( Table 8). The values for the largest fire are used if there are multiple fires. The Cause field takes all unique values for cause. It is typically one, but will contain a list of values if there are multiple causes across a multi-fire incident. The discovery date is the earliest value in related FPA FOD records and the containment day is the latest. The FOD_FINAL_ACRES is the sum of all reported acres. The FOD_NUM_FIRES field is the number of fires linked to this incident and FOD_LIST stores information about all fires related to the incident.
All FPA FOD records associated with an incident are stored in the FOD Fire List as a JSON object. This provides both a human-readable and machine parsable summary at the incident level. For example, the 1999 Arizona Jump Complex has three records in the fire-occurrence database and so the FOD Fire List contains three entries:

SR/R (for Search & Rescue types)
All other values preserved to prevent data loss.  www.nature.com/scientificdata www.nature.com/scientificdata/ degrees and minutes to decimal format. We then mapped all the points, identifying those that fell outside of the United States and its territories. The most common issue was an incorrect numeric sign for latitude or longitude. The longitude was incorrect for 98.5% of the longitudes in the second historical system (Historical System 2) and 4% of the coordinates for 2014. Wherever possible, we used latitude/longitude values from the FPA-FOD Fire-occurrence database for missing and erroneous values. We then manually examined the remaining values that fell outside of the clipped boundaries individually. We used the information contained in other point of origin fields (e.g., the location description) to estimate latitude and longitude. For each of these estimated values, we set the Lat/Long Update flag to true and set the Lat/Long Confidence field to capture our level of confidence in this estimate (low, medium, high). We rated our estimate as high to medium if we were able to get close to the actual point of origin (ex: intersection of roads) and low if the location description was vague (ex: 6 miles southwest of Sisters Oregon). Our goal was to maximize available geospatial information while allowing users of the data to filter out low-confidence or updated values when a high level of accuracy is needed. The accuracy and completeness of the data improves over time across the three versions, as well as the location description fields available for estimation. In the earliest version (Historical System 1), 29% of coordinates were missing or erroneous but we were able to populate nearly half (49%) of the missing values with estimates taken from the corresponding record in the FPA-FOD database. With limited information, we were only able to manually estimate point of origin for 103 additional values (12%) with the majority of those (89 of the incidents) estimated as low confidence due to limited location information.
In contrast, only 2% of the coordinates were missing or erroneous in the second historical version (Historical System 2) and we were able to populate 45% of the missing values with estimates taken from the corresponding record in the FPA-FOD database. We were able to estimate an additional 26% of missing values with a mix of confidence levels (42 incidents with high confidence, 26 with medium confidence, and 50 with low confidence levels). Finally, only 1.6% of the coordinates were missing or erroneous in the 2014 data with over half of the missing values populated with values taken from the corresponding record in the FPA-FOD database and we were able to estimate all but 2 of the remaining values with a high level of confidence with roughly half requiring a simple swap of latitude and longitude values to correct. Table 9 below summarizes latitude and longitude updates by system and corresponding levels of confidence.
Some of these records are deleted as part of the merging and cleaning process described in section 5 resulting in 98% of incidents with a valid latitude longitude in the final dataset.
Preparing to merge. The individual fields and values in the incident status reports remained relatively consistent across the three versions, but the underlying data model continued to evolve to adapt to all-hazards management and to capture more detailed information about resources, life safety threat, and management. Our    www.nature.com/scientificdata www.nature.com/scientificdata/ goal when mapping values across the three versions was to maximize continuity while making the historical data forward compatible with the current system. Most of the columns aligned with minimal or no modification. There were several columns that had no equivalent column in the current system. We preserved the ones that had a high fill rate: Major Problems, Observed Fire Behavior, and Terrain. Refer to the column definitions in the ICS-209-PLUS sitrep table 17 .
In addition to the consolidated text fields described in the section above, we added fields to the situation report to align the historical data with the current version or where they added value. For example, the Acres field provides a convenient way to compare incident area without having to convert units of measurement. The day of year (DOY) fields (Discovery DOY and Report DOY), Start Year, and Current Year (CY) support simple querying and analysis without having to manipulate the related timestamps. The script also integrates totals calculated from the related tables into the incident status record. Purging duplicate and erroneous records. As mentioned above, the historical datasets overlap between 2001 and 2003, and sometimes incident status reports were logged in both systems resulting in duplicate records across the two systems, along with other erroneous records. Many of these records were deleted from the dataset maintained by Karen Short. Rather than deleting each of these records explicitly, we use the records in the Short dataset as a master list (Short1999to2013v2.xlsx).
Any wildfire that does not exist in the master list is removed from the production dataset. Once the cleaning and formatting of the sitrep table is complete, the wildfires in the master list are moved to production dataset and the deleted records are archived to a separate deletions file for reference ( Table 9). The comparison resulted in the deletion of 527 sitreps from the first historical dataset (3.4%, 57% of these overlapping with Historical System 2) and 3,597 sitreps from the second historical dataset (3.4%).
Merging and final cleaning. Once each of the individual datasets are refined, historical columns are renamed to align with the corresponding columns in the current version (Table 11) and the individual Incident Status Summary tables are concatenated into a single dataset. Unused columns are dropped.
The script then makes a final cleaning and smoothing pass across the records, filling missing values where appropriate and smoothing columns to make them more consistent. The specifics are described below and the columns in the final dataset are described in 17 . Filling missing values. Several fields in the dataset are either cumulative or the value, once known, is unlikely to change. We forward filled these fields with the previous known value to minimize gaps and to make sure that these values were propagated to the final report. This was important not just for consistency, but also because these records are used to produce the Wildfire Incident Summary

Smoothing acres and calculating new acres.
Once the forward-filling of acres is complete, we perform a backwards smoothing pass. If the number of Acres is downgraded on a subsequent report, we reduce the number of acres on previous reports given that a fire never truly gets smaller, this is likely over-estimation at the time the report was filed. Reducing this was important because we use this value to calculate the New Acres field, which is then used to calculate the daily fire spread rate (Wildfire FSR) (see Wildfire Incident Summary section below).
Smoothing cost estimates. The consistency of the cost fields is critical for analysis, even if these fields are subject to bias and real-time information is limited, all records are subject to this bias and it provides a metric for comparing cost across incidents in the research dataset. Both the Estimated Incident Management Costs to Date and the Projected Final Incident Management Cost fields were sparsely populated with the Estimated Incident Management Costs to Date populated only 35% of the time and the Projected Final Incident Management Cost only 2% of the time. This field was also particularly prone to data entry error and variations in notation, particularly for estimates in the millions or billions of dollars. When the records were sorted by incident and report date, it was easy to identify instances where someone either left off a digit or added too many for that particular day. Also, as cost increased, sometimes notation changed to simplify data entry (e.g., 1,200,000 becomes 1.2 for $1.2 million dollars). After forward filling the values, we started the cleaning process by manually inspecting all instances where the final reported values were an order of magnitude smaller than the maximum value entered across the reports. We designated the final value based on comparison of trends across the existing reports. We were conservative, only correcting obvious errors. These updates are individually updated in the cost adjustments function of the merge script. Once corrections were made to the final reported values, we performed two smoothing passes. We first worked backward from the final cost, adjusting any estimates that were more than 10x larger than the current value by reducing it until it was within the 10x limit. We then worked forward, adjusting any values that were at least 9x smaller the previous estimate until they fell within the 9x limit. When both these passes were complete, if there was no value for the Projected Final Incident Management Cost, we defaulted it to the Estimated Incident Management Costs to Date on the final report.

Creating the wildfire incident summary record. The cleaned and merged Incident Status Summary
table is used to create the Wildfire Incident Summary table. This table extracts key elements from the individual sitreps to describe the fire and support high-level analysis across wildfire events. This information includes the cause, discovery date information, final acres, final estimated costs, injuries, fatalities, if evacuations recorded at any point during the fire, and the point of origin (latitude/longitude) for the fire. The summary also includes relevant statistics for structural threat, structures damaged/destroyed and estimates of total personnel and aerial support summed across the fire. We identify peak volumes and corresponding days across the fire including peak personnel, peak aerial, and peak fire spread. Finally, we calculate what we call the Cessation Date when the fire grew to within 95% of its final size. This metric is valuable, because the containment date may actually be quite conservative, with incident management teams hesitant to declare a fire contained until there is very limited risk of growth. the historical complex associations tables. We use the Short reference data described above to construct a Complex Associations Table for the historical datasets. This information is limited. It clusters incident names and numbers under associated fire complex and tallies the number of sitreps related to each. This table is described in Table 12 below.

Data records
The initial release of the ICS-209-PLUS dataset spans sixteen years from 1999 to 2014 and contains 124,411 Incident Status Summary reports for 25,083 thousand all-hazard incidents. The dominant hazard in the dataset is wildland fire (98.3%) with the remaining 1.7% spread across other hazards. The number of incidents is lower overall prior to 2005 mandate, but contains roughly the same distribution of fire and non-fire incidents as in subsequent years. Given the dominance of wildfire in the dataset, we created three tables specifically for wildland fire analysis: a Wildfire-specific Incident Status Summary table with just the wildfire sitreps, a Wildfire Incident Summary table with key values related for each fire, and a Complex Associations table (Historical only 1999 to 2013). Table 13 below summarizes each of the tables in the ICS-209-PLUS dataset and the number of records contained in each.
All files used to produce the dataset are packaged with yearly database table files as part of the ics209-plus-source.zip file. These input files and directory locations are summarized in Table 14 below. Table  definition and field fill rates are also available online in the ics209-pluse-reference.zip file (Table 15 below). The dataset, input, source, and reference files can be found online at figshare 17 .

technical Validation
It is important to note that the ICS-209-PLUS represents a small but important subset of wildfires (1-2%). The trends presented in this technical validation are for large incidents only and not meant to be interpreted as holding for wildfires in general.
Wildfire distribution. Wildfires requiring the use of ICS are spread across the interior of Alaska and continental U.S. with higher concentrations in parts of California, the Northern Rockies, Northern Forests, and parts of the Southeastern U.S. There were a small number of fires in Hawaii and Puerto Rico (not shown). Fire is www.nature.com/scientificdata www.nature.com/scientificdata/ notably limited in areas of the Central Midwest and Northeastern US. The spatial distribution of incidents for the continental U.S. and Alaska with the total number of incidents by year is shown in Fig. 1 below. trends across key variables. We examined the national patterns in six variables within the dataset: maximum fire spread rate, burned area, projected final costs, total personnel (total personnel summed across sitreps), maximum number of homes threatened, and the number of homes destroyed (Fig. 2). Not surprisingly, the fastest fires were located in the northern Great Basin area, along the border of Nevada and Idaho, landscapes that are dominated by cheatgrass fuels. Additionally, the west experienced larger, faster fires requiring more resources and resulting in higher suppression costs. Suppression costs are an order of magnitude lower on average in the east versus the west, but that societally impactful fires are not limited to the West (Fig. 2ef). Smaller fires in the east threaten and destroy large numbers of homes. The allocation of firefighting personnel is heaviest in California and the Pacific Northwest, with California committing the heaviest resources regionally, which may help mitigate the number of homes damaged or destroyed given the population density and large number of homes threatened across these fire-prone landscapes, except in the hottest and driest areas portion of Southern California. Large wildfires have a presence in all but key areas of the midwest and northeast. In the Appalachian mountain range, the influence of human ignitions has led, in recent years, to large of areas burned, requiring management of an incident command team that is apparent in this dataset. More research is needed to understand the factors at play in these fires outside of the scope of this work.
Summary statistics were generated across the national Geographic Area Coordination Centers (GACCs) for the six key metrics (Table 16). Most notably in the conterminous U.S., the Great Basin experienced the fastest fires, with an average maximum fire spread rate of 3,147 acres/day, yet, Alaska reported the fastest average maximum fire spread rate across all GACCs (5,620 acres/day) and largest average wildfire (22,738 acres). The average fire size in the southern GACC (southeastern U.S.) was 850 acres, or roughly 85% smaller than average fire size compared to all other GACCs. Though wildfires were substantially smaller in the southern GACC, it experienced the second highest structures threatened (136,592) and destroyed (9,555   www.nature.com/scientificdata www.nature.com/scientificdata/ There are 100 situation reports for the Rim Fire spanning fire discovery on August 17th at 3:25 pm through the declared containment on October 24th. Looking across key metrics on the 209 reports, one can see the fire narrative play out across the situation reports (Fig. 3). There is a close relationship across the peaks of these variables. There is a clear lag in the reporting of burned area, which may reflect delays in finalizing estimates, particularly in the most volatile phase of large and fast fires. The fire is not estimated at 10,000 acres until the third day and does not reach 100,000 acres until the morning of the seventh day. The acreage stabilizes as the growth slows on September 6th. The estimated final incident management costs rise steeply during the active growth phase of the fire and level out at $46 million dollars, less than half the final estimated incident management cost. This example illustrates that estimates are likely to be conservative, particularly on large, fast-moving events with estimates based on what is known in the moment rather than a clear accounting of final cost. It may take months for the final numbers to be tallied on large-scale responses. The allocation of firefighting personnel ramps quickly to just over 5,000 on September 1st, fourteen days into the fire and then begins to taper down equally quickly. The number of homes threatened rises quickly, with the first threat reported as 25 structures at 6 pm on August 19th. This value jumps to 2,505 at 7am the following morning with a total of seven structures destroyed including two residences, and again on August 23rd to 4,503 structures at 6am and 5,503 structures at 7:30 pm, coinciding with issuing of new evacuation advisories.
The values logged for both total personnel and structures threatened may represent the most knowable information in-the-moment. The tracking of personnel happens in real-time during a shift and concrete estimates of homes and areas threatened by a wildfire are integral to incident management decisions. It is also likely that there is a lag between resource needs and the deployment of new firefighting personnel. We hypothesize that the first report of structures threatened may serve as an accurate indicator of the onset of social disruption and that the growth and levelling out of personnel indicates that the fire is in its most acute and socially disruptive phase with the onset of demobilization of resources indicating the easing of the threat posed by a fire. In this case, the first structural threat is recorded at 6 pm on August 19th and jumps to over 2,500 the following evening. This figure reaches it peak at over 5,000 structures approximately 48 hours later on August 22nd, where it remains for thirteen days. The number of personnel steadily increases until September 1st when resources begin to drop.
The perimeter map (Fig. 3a) visually depicts the rapid growth of the fire. The black dot at the bottom left is the Point of Origin Latitude/Longitude from the ICS 209 report. The perimeter outline is taken from the MTBS dataset 24 and the fire progression is constructed using the MODIS burned area data 26 . Because the MODIS data records the last burn detection for each pixel, the maximum growth is recorded on day five of the fire, which is earlier than the formal reporting of this same acreage on the sitrep. This highlights two important points. Knowing that there is a likely delay in the reporting of acres burned on the sitreps, we may be able to make Dataset   www.nature.com/scientificdata www.nature.com/scientificdata/ adjustments to these values as part of any analysis. Additionally, knowing the strengths and weaknesses of the values on the 209 and the inherent biases, it could potentially be combined with other data sources such as remote sensing to correct these biases that connects the physical measures of a wildfire with important measures of social disruption and firefighting resources.

Usage Notes
As wildfire activity has increased in the U.S. over the past several decades 5-8 , the ICS-209-PLUS dataset offers a unique opportunity to explore the costs and consequences of the nation's major wildfire events. This dataset captures a unique perspective that complements other important sources of information on wildfires, from government compiled databases (FPA-FOD 23 ) to satellite-derived detections of active fire or burned area (i.e., Landsat, MODIS, VIIRS 24,26,27 ).
This dataset captures critical details about an important subset of wildfires in the United States, those that require the establishment of an incident command team. Although these large, serious fires account of only 2% of wildfires, they account for approximately 80% of suppression costs. The daily situation reports capture the best in-the-moment information across the changing characteristics of the fire, environmental conditions such as weather and terrain, incident response, and the built and natural values at risk. As the magnitude of wildfires grow and we continue to expand our reach into the wildland urban interface (WUI), the values captured across this important population of fires provides a unique opportunity to understand the relationship between changing fire regimes, incident response, and the social decisions we are making in relation to these risks.
The first revision of this dataset accomplishes several key objectives. It aligns the underlying data model across the three versions of the system from 1999 to 2014, making it possible to effectively compare values across these three versions of the ICS-209 reporting system for longer timeframe than previously possible with minimal effort.   www.nature.com/scientificdata www.nature.com/scientificdata/ Substantial cleaning efforts and the filling of missing values allow for more accurate assessment across the reports and to analyze trends more effectively. Additionally, the code used to create this dataset is open source and can easily be extended to process and add subsequent years to the existing dataset. The effort here represents something that could be taken on at a large scale, to make the data more publicly available on a natural hazard that increasingly threatens lives and infrastructure. More work is needed to streamline the process so that new data can be processed and downloaded without manual effort. Additionally, more work is needed to streamline the publication process. The data from 2015 forward needs to be updated so that it is a complete and accurate record of daily situation records with associated record for personnel resources, structural threat, and life safety. Finally, the data is published after it is finalized at the end of the year. There is potential to build tools that capture these www.nature.com/scientificdata www.nature.com/scientificdata/ 'in-the-moment' reports in real-time, allowing new questions to be asked in real-time and for this data to be integrated with other sources such as satellite-derived data.
One of the longstanding barriers to using the ICS-209 has been the data quality and lack of alignment across the three reporting systems. Like most human-generated, observational datasets, the fields are free-form and difficult to wrangle. Our goal was to strike a balance between manual inspection/cleaning where necessary and programmatic data cleaning. As highlighted above, we focused manual efforts on high-value fields like latitude/ longitude and cost where we felt cleaning would provide the highest initial return. We also automated the standardization of values across the reporting systems and auto-filled empty values wherever it made sense. Moving forward, the data will continue to be messy and data duplication and field-level errors are impossible to plan for and eradicate. As highlighted above, better mechanisms for crowd-sourcing the reporting of errors and automating updates could greatly improve the integrity of the data, particularly as this dataset is adopted for widespread use.
Our hope is that making this data available will lead to cross-sector and cross-discipline work that leads to greater understanding of our nation's wildfire trends and the consequences of those changing trends. Understanding the causes and consequences of wildfire is a complex task requiring expertise across disciplines and potentially benefiting those in the forefront of fire management, climate science, natural hazards research,   www.nature.com/scientificdata www.nature.com/scientificdata/ policy making, planning and development. The ICS-209-PLUS data provides an important level of detail that can be used in parallel with other sources of information, filling in gaps and providing a more complete and nuanced picture of the relationship between characteristics of wildfire, incident response, and the causes and consequences of threatening wildfires in the national landscape. There is potential to address a critical informational need as we work to understand trends and address the impacts and consequences of fire in an evolving physical and social landscape. This dataset has a great future research benefit, particularly if current limitations are addressed effectively through community-wide efforts to keep improving on this rich dataset in an open science framework.