Developing a large-scale dataset of flood fatalities for territories in the Euro-Mediterranean region, FFEM-DB

This data paper describes the multinational Database of Flood Fatalities from the Euro-Mediterranean region FFEM-DB that hosts data of 2,875 flood fatalities from 12 territories (nine of which represent entire countries) in Europe and the broader Mediterranean region from 1980 to 2020. The FFEM-DB database provides data on fatalities’ profiles, location, and contributing circumstances, allowing researchers and flood risk managers to explore demographic, behavioral, and situational factors, as well as environmental features of flood-related mortality. The standardized data collection and classification methodology enable comparison between regions beyond administrative boundaries. The FFEM-DB is expandable, regularly updated, publicly available, and with anonymized data. The key advantages of the FFEM-DB compared to existing datasets containing flood fatalities are its high level of detail, data accuracy, record completeness, and the large sample size from an extended area. Measurement(s) flood mortality characteristics • condition (means of transport) • activity • accident place • accident dynamic • death cause • protective behavior • hazardous behavior Technology Type(s) data collection Factor Type(s) demographics • country • administrative units • NUTS 1–3 Sample Characteristic - Location territories in the Euro-Mediterranean region Measurement(s) flood mortality characteristics • condition (means of transport) • activity • accident place • accident dynamic • death cause • protective behavior • hazardous behavior Technology Type(s) data collection Factor Type(s) demographics • country • administrative units • NUTS 1–3 Sample Characteristic - Location territories in the Euro-Mediterranean region

of human life 5 , with the river floods of July 2021 resulting in more than 200 fatalities 6 , which demonstrates this problem remains unsolved.
Insights on how people die from floods usually derive from the study of flood fatality accounts. However, existing studies and databases on flood fatalities (FFs) face important limitations, such as (1) small sample size; (2) narrow geographic extent; (3) low level of detail on FFs; and (4) lack of information concerning the circumstances surrounding fatal incidents (Fig. 1). Regarding the first two limitations, studies tend to focus predominantly on national datasets 7 or event-focused datasets [8][9][10] , usually containing a relatively small number of FFs in a specific area. However, examining small samples within narrow geographic boundaries produces results that are hardly transferable to other regions. Such results may be influenced by traditions and cultural factors [11][12][13][14] , infrastructure typology 15 , types of environments or settings 16 , housing types 17 , and the population's quality of training or education. Methodological differences, such as using different systems to classify flood death conditions, are also a major problem for cross-study comparisons (see, for example, Ashley and Ashley 18 and Fitzgerald et al. 19 ). A significant challenge for researchers is comparing information for different regions or countries based on common criteria and standards to gain a general, transferable understanding of the drivers of flood mortality.
In addition to these limitations, currently available international databases, such as the Emergency Events Database EM-DAT 20 , provide a useful accounting of fatality numbers but lack details on the circumstances of actual incidents. At the same time, many of them are multihazard-oriented, making the attribution of fatalities to specific hazards, such as flooding, complex, as they often occur in conjunction with other hazards, e.g., wind or landslides. In addition, some international databases include events only if they exceed a minimum threshold of fatalities. Such thresholds lead to a potential miss of fatalities, which, especially within Europe, can be an important portion of the total number of fatalities as a result of a large number of low-mortality events 21 .
To address these gaps, we propose the Database of Flood Fatalities from the Euro-Mediterranean region, FFEM-DB, a multinational database comprising 2,875 FFs from territories in Europe and the Mediterranean region, from 1980 to 2020. FFEM-DB presents an extensive geographical area (covering 12 study areas, nine of which are entire countries) addressing the sample size issue repeatedly acknowledged in the literature 9,22,23 . It provides a high level of detail for each fatality, precise demographic and geographic location data, and details of the circumstances leading to a fatality. Therefore, it enables the comprehensive cross-regional study of FF circumstances, identifying commonalities and differences between particularly vulnerable groups and hazardous situations that lead to fatal accidents, and considering regional socio-economic characteristics. Furthermore, FFEM-DB creates the foundation for studying the association of FF mortality with cross-border variables, such as geomorphological and hydrometeorological features and risk mitigation initiatives and policies. Such cross-regional and cross-border learning can support improved risk communication and better preparedness to help avoid accidents. In this light, FFEM-DB is valuable in evaluating the impact of the EU Flood Directive (2007/60/EC) on flood risk management in Europe and relevant to the Sendai Framework for Disaster Risk Reduction target of reducing disaster mortality between 2020-2030.
FFEM-DB brings together the best information available at the regional/national level, ensuring that the data are standardized, verified, and quality controlled. Moreover, FFEM-DB is publicly accessible, and it is scalable as it has been developed with a clearly defined methodology that permits the addition of new regional/national datasets. With these characteristics, the FFEM-DB database is globally unique.

Methods
Data sources. The FFEM-DB database draws data from local, high-resolution databases (or datasets) to ensure high accuracy, data quality, and completeness. These databases have been developed and published individually (online-only Table 1) by local research teams or are included here for the first time (e.g., UK) to support mortality studies at national or regional levels. A common denominator of these local databases is the detailed recording of FFs profiles and circumstances of death through multiple sources, namely (1) national authorities, (2) reports from bodies implicated in risk management such as the police and fire department, and (3) local or national media from which detailed information is drawn. Secondary control sources include historical catalogs of damaging flood events/fatalities and scientific publications.
Of the aforementioned data sources, news media were particularly relevant as they are deemed as a reliable source of societal information. They have been previously used to analyze FFs [24][25][26] , indicate the impact of damaging weather events on a local scale [27][28][29][30][31] , explore the evolution of perception of natural hazards 24,32 , and validate hazard maps [33][34][35] . The period covered by the FFEM-DB, 1980-2020, can be divided into two time periods based on the availability and ease of access to press information. Roughly, the 1980-2000 period is based on printed archived newspaper material. After 2000, digital newspapers and archives became more abundant, and, most importantly, there is greater access to local newspapers that often provide detailed accounts of particular events. Online-only Table 1 presents the main sources of primary data on fatal flood events and the associated FFs for each study area.  www.nature.com/scientificdata www.nature.com/scientificdata/ Data collection and reporting standards. Consistency and accuracy were ensured throughout the data collection process. Securing these criteria concerns two main steps: (1) collecting data from the individual research teams and (2) merging the derived data into FFEM-DB.
Regarding data collection in each study area, various and multiple sources have been used by each independent research team, as shown in the online-only Table 1. All the sources used were specified to ensure high transparency and confidence in the derived information. Despite the variety of sources that may have been used depending on availability, a prerequisite was set that data should be verified by at least two independent sources. We can distinguish different combinations of sources for verifying the data, with the following being the most prevalent among the involved research teams: • Press and media combined with field research.
• Documentary records, including the press, combined with official authorities' reports.
• Various media sources using text-mining tools.
To ensure reporting standards were the same for all 12 regions, each research group used a standardized form that enables homogeneity of information imported into the database. This standardized form was developed through a trial period of use by five research groups involved in capturing, designing, and initiating FF data collection 36 to ensure that it could accommodate their data. The basic form adjustments made during the trial period were as follows: • Adjust field categorization to cover all distinct sub-cases.
• Introduce new fields and their categorization.
• Revise fields considering the availability and accessibility of the requested information.
The derived data were consistently checked before entering the standard fields of the database. The following were primarily checked concerning the suitability of the reported FFs: • Each reported FF corresponds to the type of floods the database deals with (see section Data Records).
• Each reported FF is directly associated with rainfall-induced flooding (see section Data Records).
The data collected in the standardized form also included fields for assessing the accuracy of specific information difficult to collect or confirm, namely the approximate hour when the fatality took place and the geographical coordinates. Data was also quality controlled for errors, duplicate entries, and missing data. The identification of duplicate entries was undertaken by computing the Jaccard similarity coefficient 37 . Geographical coordinates were checked in the GIS environment (QGIS 3.10). When needed, coordinates were adjusted based on auxiliary spatial information provided, with either a reverse geocoding process or manual geolocation through Google Maps or OpenStreetMap services. Fatality data were anonymized.
History and updates of the database. The FFEM-DB is an expandable database that is periodically updated 36,38 . The update occurs on average every two years. It was developed in its original composition in 2017 (MEFF version 36 with data exclusively from Mediterranean countries or regions for the 1980-2015 period. It included five territories from four countries (France, Greece, Italy, and Spain). In 2019 it evolved to include FFs from new territories in Europe and neighboring non-European countries of the Mediterranean region, covering 1980-2018 (EUFF version 38 ). Specifically, the EUFF version also included FFs from the Czech Republic, Israel, Portugal, and Turkey. The FFEM-DB is the latest version, covering a larger area of the Euro-Mediterranean region and a more extended period . Compared to the EUFF version, it also includes FFs from Cyprus, Germany, and the UK. Most importantly, FFEM-DB has been developed into a structured and publicly accessible database, available in 4TU Centre for Research Data 39 . We should note that the standards for collecting, reporting, and controlling data were the same in all database versions.

Data Records
Definitions and key concepts. The basic concepts of the FFEM-DB database are defined in the following: Study period: Currently, FFEM-DB covers 41 years, from 1980 to 2020. Flood fatality (FF): A person killed by the direct impact of a flood. It encompasses people killed from short-term clinical causes, such as drowning, collapse/heart attack, poly-trauma, poly-trauma & suffocation, hypothermia, suffocation, and electrocution. People missing and presumed dead are included only if more than one source refers to eyewitness testimonies that, for example, the missing person was swept away by a torrent. FFs resulting from storm surges, dam breaks, and accompanying landslides are not included in the FFEM-DB. Additionally, indirect losses associated with long-term health effects are not included.
Fatal flood event (FE): A flash flood or river flood that has caused one or more deaths in a specified region. Flash floods are caused by sudden, short-lived, and usually heavy rainfall over relatively small basin/watershed, while overflowing rivers and streams cause river floods, usually resulting from long-lasting rainfall/snowmelt. Floods caused by the accumulation of rainwater due to lack of drainage, such as urban floods, are also included. The FEs are aggregated at the NUTS 3 spatial level 40 Table 2. The fields of the table FATALITIES are filled in by selecting from a predefined menu of options, shown in Table 3.
A. FATALITIES table: It contains the date when the fatality occurred, the fatality profile (gender, age, and residency), and circumstances (victim's condition and activity, the place and dynamics of the accident in terms of the particular circumstances that led to death, clinical cause of death, protective or hazardous behaviors). The ID-Fatality is the primary key connecting this table to the LOCATION table, while NUTS_3_ID works as a  foreign key connecting to the table NUTS 3. B. LOCATION table: It contains administrative and geographical information on where the fatality occurred (country name and acronym, territorial levels from 1 to 3 according to the country administrative subdivisions, latitude and longitude, the accuracy of location). The accuracy of geographical coordinates is considered high if the place of death is precise. Otherwise, it is considered low, and the coordinates correspond to the center of the relevant smaller known administrative unit, e.g., at the territorial_LV3 level.
C. NUTS 3 table: This table allows the downscaling of the location of death from the NUTS 0 (country level) to NUTS 3 level. Geographical and demographic information (area, population, population density, population by gender, and age category) on the NUTS 3 level is also provided 40,41 .
The predefined categories in the fields of the FATALITIES table resulted from extensive research in the existing literature. Previous works have highlighted among FFs the role of demographics 12,42 and victim activity 22 , infrastructure, the use of vehicles 23,43,44 and vehicle types 15 , the victim's residence 42 , and the cause of death 22 . In addition, previous studies have shown the influence of environmental factors 16,45,46 and the victim's hazardous behavior 47,48 . Figure 3 shows the number of FFs at the NUTS 3 level for the examined period.

Spatial data visualization.
The geographical distribution can indicate various environmental, climatic, and societal factors of vulnerability to flooding. Indeed, analyses published on a previous version of the database 36 present handy conclusions on the role of the geographical location and demographic features on flood mortality across the studied areas.

technical Validation
Evaluation of completeness and coverage indicators. Based on several indicators, the database is evaluated regarding data completeness and coverage of FFs. The evaluation is undertaken internally, i.e., by evaluating the completeness of the data of each field and study area, and the evolution of completeness within the examined period, as well as externally, by comparing cumulative data against external sources. www.nature.com/scientificdata www.nature.com/scientificdata/ Internal evaluation is intended to measure data completeness at various levels and dimensions, to indicate interannual changes, differences between the study areas, and parameters associated with low data availability.
Field completeness. Table 4 shows the percentages of missing data for each field of the FATALITIES table per study area. The following fields were excluded from the evaluation: (1) the date, which is 100% complete as it is a mandatory field, and (2) the fields of protective and hazardous behaviors, as this information is only available in cases where someone witnessed the accident. Therefore, it is unknown whether the absence of data in these fields is related to the lack of information or the non-manifested behavior.
Among the evaluated fields, the lowest percentages of missing data correspond to gender (20%), accident dynamic (20%), and cause of death (20%), while the highest is associated with victim's activity (61%). The study area of Turkey exhibits the highest proportion of missing data for the selected fields (52%), while those of Catalonia and Greece have the lowest (10%). Temporal change of field completeness Table 5 shows the annual evolution of missing data (%) in the fields of the FATALITIES table for the total FFEM-DB area. Overall, there was a decreasing trend in missing data, which reduced from 47% in 1980 to 23% in 2020. The improvement of completeness over time reflects increased access to information, suggesting even better recording of such data in the future.
As the level of description of the FFs within FFEM-DB is very high, with 11 parameters required for the overall description of death conditions (FATALITIES table), missing values are expected. The percentage of missing values is quite large for some fields, especially at the beginning of the study period. Nevertheless, we consider all fields essential for analyzing the FF circumstances. Moreover, given the large sample, we do not consider that studies addressing the vulnerability of citizens to floods would be undermined. Regardless the acknowledged completeness trends, FFEM-DB creates an extensive dataset that allows study of flood mortality from multiple aspects (related to the different variables included), addressing the limitations mentioned in the introductory section of this study. Finally, we also expect more opportunities to fill these fields in the future through focused research and better information means.
External evaluation is based on the comparison with international databases and literature regarding the FFs coverage achieved. Finally, an evaluation through specific events that have been well-documented is performed.  www.nature.com/scientificdata www.nature.com/scientificdata/ Overall coverage. Comparative analysis and evaluation of FFEM-DB in relation to other disaster databases require the high spatial density of FFs data to be adapted to the spatial levels used within other study areas to ensure comparability between reported FEs among databases.

FATALITIES
Four independent publicly accessible disaster impact databases were considered for the evaluation of FFEM-DB completeness as to the total number of FFs: the Emergency Events Database (EM-DAT) 20 , the Dartmouth Flood Archive (DFA) 49 , the European Past Floods (EPF) 50 , and the Historical Analysis of Natural Hazards in Europe-HANZE-Events database (HANZE-E) 51 . The respective specifications are presented in online-only Table 2.
When comparing FFEM-DB with these databases, the following issues should, however, be considered:    Age  0  13  29  36  42  15  52  2  38  9  65  12  39   Gender  26  7  12  7  32  2  26  2  6  3  35  3  20   Residency  26  17  100  46  58  12  42  40  29  18  31  20  33   Victim's condition  26  11  24  80  50  20  38  21  26  18  90  23  57   Victim's activity  35  23  24  70  73  25  44  33  55  44  82  38  61   Accident place  26  6  24  32  36  6  30  4  12  7  64  12  36   Accident dynamic  30  3  35  49  45  3  28  4  16  24  20 13 20  www.nature.com/scientificdata www.nature.com/scientificdata/ • The external databases report on different geographical coverage and administrative level resolutions. This characteristic affected the comparison for the FEs in BAL, CAT, and SFR as data are not available at this administrative level in the other four impact databases, although basic information on the regions that the event affected within a country is always reported. An analysis of the FEs and associated FFs for these study areas is possible with EM-DAT and HANZE-E, after a thorough study of the reported events in Spain and France. Figure 4 shows the number of FFs in FFEM-DB and the respective estimates of the four disaster databases. Only FFs corresponding to common periods and territories are considered in each case.
High-impact events coverage. Table 6 presents the results for the number of high-impact FEs and associated FFs per study area, derived by the two databases, FFEM-DB and EM-DAT. The EM-DAT was selected for   www.nature.com/scientificdata www.nature.com/scientificdata/ this comparative analysis, as it covers the whole study period and geographical area of the FFEM-DB. Choosing events with 10 or more FFs eliminates the threshold bias associated with EM-DAT's mandatory event entry criteria. The events that were specified in EM-DAT as landslides, dam breaks, or storm surge events were excluded from the comparison. Out of the 48 flood events recorded within EM-DAT (with 10 or more FFs), which concern the examined areas, 11 (23%) were excluded for the reasons mentioned above.
Overall, FFEM-DB contains 18% more FFs than EM-DAT for the FEs with 10 or more FFs. In particular, at the study area level, the comparison reveals different FF numbers for nine out of the 12 study areas. FFEM-DB includes more FFs for CAT, CZE, GRE, POR, and TUR, and less for GER, ISR, ITA, and SFR than EM-DAT. For CYP and UK, there were no FEs with more than 10 FFs; thus, they are not included in the comparison.
In the EM-DAT database, the affected Spanish regions are mentioned descriptively in each event, so it is possible to export events by region. Therefore, CAT was found to be included among other regions for two events with more than 10 FFs in June 2000 and October 2018. However, according to the literature, the June 2000 event caused five deaths in CAT 55 , thus it was excluded from comparison for CAT. In the October 2018 event, all the FFs took place in Mallorca (BAL) 56 . In addition, CAT was not included in the affected areas of EM-DAT in the November 1982 event, in which 14 FFs were recorded in CAT as reported by FFEM-DB and documented in relevant scientific articles 57,58 .
For CZE, both databases include three FEs that took place during the period under review, with FFEM-DB having 5% more FFs 59,60 .
For GER, only one FE is reported by both databases; however, the number of FF differs, 27 in EM-DAT compared to 22 in FFEM-DB (22). Careful analysis of the literature 61,62 indicates that 20 FFs occurred, so the additional FFs in the EM-DAT are likely to reflect fatalities arising from other hazards.
For GRE, FFEM-DB includes more high-impact events, resulting in a higher number of FFs by a factor of two. The Greek FEs and FFs included in FFEM-DB have been validated through scientific publications focusing on the analysis of FFs in the country 16,63 .
For ISR, EM-DAT includes one more high-impact FE in October 1997, when most of the 13 fatalities occurred in car accidents caused by hazardous driving conditions resulting from heavy rainfall, as reported by local media 64 . According to the sources of FFEM-DB, only four FFs resulted from flooding, and therefore this FE is considered to have less than 10 FFs and is excluded from the comparison.
For ITA, FFEM-DB includes more high-impact FEs (seven) than EM-DAT (five), but a lower by 10% number of FFs. Inconsistencies were, however, found regarding the number of FFs provided by EM-DAT for some FEs. In the September 2000 flood event, EM-DAT reported 16 FFs in Soverato, Calabria, while the exact number of FFs was 13 65 . In addition, the November 1994 event is classified as a river flood, when in fact, landslides occurred and were responsible for several deaths 66 . Finally, the Versilia event in 1996 caused many fatalities, and while almost all of which resulted from debris flows 67 , in EM-DAT they were attributed to floods.
For POR, it should be noted that the FFEM-DB only contains FEs and FFs for Portugal mainland, excluding Madeira and the Azores archipelagos. This is why the February 2010 landslide/flash flood event in Madeira 68 was excluded from this comparison. Another two events in December 1981 and January 1996 were excluded from the comparison, as both caused deaths attributed to landslides 69,70 . Beyond that, FFEM-DB includes two high-impact FEs, with only one of them listed in EM-DAT. The number of FFs for this event is comparable among the two databases.
For TUR, FFEM-DB includes 29% more FFs than EM-DAT. Out of the 28 Turkish high-impact FEs included in FFEM-DB, only 13 (46%) are also reported in EM-DAT, which, however, includes four events of which the number of FFs in FFEM-DB is marginally less than 10. All TUR high-impact FEs have been cross-referenced with local databases 71,72 . Accuracy evaluation through specific events. To evaluate further the accuracy of FFEM-DB, the number of FFs for specific well-documented flood events was compared against those reported in international databases. For the validation, the actual number of fatalities was derived from scientific publications and/or governmental reports describing these well-known events. In online-only Table 3, we present the comparison and relevant documentation of all the notable flood events that occurred in the FFEM-DB area and study period causing more than 15 FFs, showing that FFEM-DB has more accurate values than other existing databases with regard to the number of FFs.
Comparisons with international databases showed that the FFEM-DB achieves high coverage of FFs caused by flash floods, urban floods, and river floods while giving special attention to the quality, immediacy, and reliability of the sources it draws the information from. This is supported by the variety of sources used in each country/region included in FFEM-DB and the local character of information. The closeness of the information source to the actual flood events and the cross-checking between different sources enhance the completeness, accuracy, and reliability of the overall dataset.

Usage Notes
The FFEM-DB database can be easily accessed and downloaded from the 4TU Centre for Research Data 39 , and data can be directly used for analysis (https://doi.org/10.4121/14754999.v2). This database is intended to act as a large pool of publically accessible data for analysis of death circumstances over territories within the Euro-Mediterranean region. The relational structure of the database allows for analyses at the FF, study area, country, territorial, and NUTS levels. The FATALITIES table provides granular data on the FF profiles and circumstances. Each FF is further linked to geographical data (LOCATION table) and demographics at the NUTS levels (NUTS 3 table).
www.nature.com/scientificdata www.nature.com/scientificdata/ The data and their structure allow easy integration into databases intended to assess and analyze the societal impacts of disasters related to weather and climate. To this end, we provide, in the following, specific examples of analyses that can be applied. The extensive geographical area of the FFEM-DB dataset offers the opportunity to: • Compare flood mortality in different geomorphological settings (e.g., flat areas of Germany or the Czech Republic against the high-inclination areas of Greece or Italy), as well as in different landcover and urbanization settings. • Compare flood mortality and death circumstances among areas with different policies and measures aiming to address flood risk mitigation, such as through driving education, road network management, risk signage, and the adaptation of impact-based warnings. • Examine the impact of risk mitigation policies and initiatives on flood mortality. For example, our dataset can be used to compare an area/country that uses a "turn around don't drown" -type of awareness campaign 48 with an area/country that does not, in terms of vehicle-related flood fatalities, as a complementary criterion on the efficiency of such campaigns.

Data availability
FFEM-DB is available in the 4TU Centre for Research Data 39 , https://doi.org/10.4121/14754999.v2). It includes the following files: a) a comma-separated values (csv) file, named "Fatalities.csv" that contains the structure and data of the FATALITIES Table (Table 3); b) a comma-separated values (csv) file, named "Location.csv" that contains the structure and data of the LOCATION Table (Table 2); c) a comma-separated values (csv) file, named "NUTS 3.csv" that contains the structure and data of the NUTS 3 Table (Table 2); d) the readme (txt) file, containing the description of the database structure.