## Introduction

There is clear theoretical, model-based, and empirical evidence that global precipitation extremes, i.e. precipitation exceeding a high threshold, will increase in a warming climate1,2,3,4. However, there is greatly more uncertainty regarding the hydrologic response regarding flooding and there is not yet clear evidence for widespread increases in flood occurrence either in observations5,6,7,8,9,10 or in model simulations11,12,13. While there is still a theoretical expectation that flood events will increase in a warming climate14,15,16,17, and while such flood increases have been documented regionally18,19, the absence of broader observational trends supporting this hypothesis is conspicuous.

In the literature on hydrological processes, the lack of such trends is often attributed to changes in non-precipitation-flood drivers, such as temperature-driven decreases in snow accumulation and increases in evaporation that yield decreases in soil moisture9,20,21,22,23. Because of the compounding nature of different flood drivers, establishing a direct link between increases in extreme precipitation and increases in flooding is challenging24,25,26. Indeed, previous studies suggest that the strength of the relationship between precipitation and discharge may depend on a range of factors including catchment size, event magnitude25,27, and season28 though the details of these complex relationships remain largely unknown and are hard to generalize.

Further complicating such investigations is the rarity of extreme events with long return intervals and their sparseness in observed precipitation and streamflow records. Several approaches have been proposed to address this data scarcity problem, including: pooling observations across different catchments29 or seasonal predictive ensemble members30,31; tree-ring and historic reconstructions32,33; stochastic streamflow generation34,35; and ensemble modeling using Single Model Initial-condition Large Ensembles SMILEs36. To date, however, few studies have combined atmospheric SMILEs with hydrological models to obtain a SMILE of streamflow time series, i.e. a ‘hydro-SMILE’37,38,39. The availability of such a hydro-SMILE is crucial in assessing the relationship between future changes in extreme precipitation and flooding – particularly high-end extreme events (i.e., those occurring twice or fewer times per century), which are rare to nonexistent in observed time series.

Here, we seek to reconcile the extreme precipitation-flood paradox in a warming climate: is there a precipitation threshold beyond which increasing precipitation extremes directly translate into increasing flood risk? We hypothesize that such a threshold should exist because moderately extreme events may be buffered by decreased soil moisture (due to warming) while very extreme events may quickly lead to soil saturation and subsequently to direct translation of precipitation to runoff. Using a hydro-SMILE approach, we consider precipitation and flood characteristics from historical (1961–2000) and warmer future (2060–2099) climates for 78 catchments in major Bavarian river basins (Main, Danube, and the Inn river with their major tributaries; henceforth Hydrological Bavaria) characterized by a wide variety of hydroclimates, soil types, land uses, and streamflow regimes39,40. We find that there does indeed exist a catchment-specific extremeness threshold (i.e. return interval threshold) above which precipitation increases clearly yield increased flood magnitudes, and below which flood magnitude is strongly modulated by land surface processes such as soil moisture availability. Ultimately, this finding may help reconcile seemingly conflicting climatological and hydrological perspectives on changing flood risk in a warming climate.

Addressing the precipitation-flood paradox is simply not possible using observations alone, as the high-end extreme events of interest are rare to nonexistent in temporally limited observational records. This real-world data limitation effectively precludes statistical analyses of extreme events with return periods exceeding ~50 years. To overcome this problem, we use a hydro-SMILE to obtain a large number of extreme precipitation–streamflow pairs. The hydro-SMILE consists of hydrological simulations obtained by driving a hydrological model with climate simulations from a single model initial-condition large ensemble (SMILE) climate model. The underlying model simulations were originally generated by Willkofer et al.40 as part of the ClimEx project41. The hydro-SMILE simulations consist of daily streamflow (mm d−1), snow-water-equivalents (SWE, mm), and soil moisture (%) – all of which were obtained by driving the hydrological model WaSiM-ETH42 with a 50-member ensemble of high-resolution climate input (spatial: 500 × 500 m2, temporal: 3 h) (for further information on the hydro-SMILE see Section “Hydro-SMILE”).

While such a large ensemble approach resolves the small or zero size problem for very extreme events, new sources of uncertainty do also arise. We acknowledge that the hydro-SMILE modeling chain is affected by uncertainties introduced through both the underlying climate and hydrological models. Climate model uncertainties include those relating to precipitation process-representation, downscaling, and bias-correction procedures, hydrological model uncertainties comprise model and parameter uncertainties. These latter uncertainties may be particularly relevant for the very extreme events under consideration in the present study because model calibration and evaluation rely upon observed events – and (as previously noted) modern observational records simply don’t exist for events of the extreme magnitudes considered here. However, we point out that this particular element of the overall uncertainty is essentially irreducible, and will likely remain so until the length of the observed record increases substantially some decades in the future. As such, the use of a hydro-SMILE is an appropriate method – and arguably the singular method available, at present - to comprehensively and quantitatively address the extreme precipitation-flood paradox.

## Results

### Threshold behavior in flow response to extreme precipitation

We first seek to assess whether there exists a return interval threshold beyond which precipitation (P) increases consistently translate into streamflow (Q) increases, and thereby to increases in flood magnitude. To do so, we use a hydro-SMILE consisting of a 50-member ensemble of 3-hourly precipitation and streamflow time series for Hydrological Bavaria (see Methods section “Study region” and Supplementary Figure 1), which we aggregated to daily resolution. The hydro-SMILE was derived for the period 1961–2099 by combining the Canadian Regional Climate Model large ensemble CRCM5-LE41 with the hydrological model WaSiM-ETH40,42 (see Methods sections “Hydro-SMILE” and “Hydrological model evaluation”). From this ensemble, we extract precipitation–discharge (P − Q) pairs for a historical (1961–2000) and future time period (2060–2099) by first applying a peak-over-threshold approach on precipitation and then identifying corresponding peak discharges (see Methods section “Event identification”). We then empirically compute P and Q magnitudes for different levels of extremeness, i.e. mean events and progressively more extreme events with 10, 20, 50, 100, and 200 year return intervals, by pooling events extracted from the 50 ensemble members. Finally, we derive future relative changes in extreme event magnitudes by comparing magnitudes for a future period (2060–2099) with magnitudes of a historic period (1961–2000) (see Methods section “Changes in event magnitudes and P − Q relationship”).

We find that median future changes in daily precipitation and corresponding discharge extremes overall catchments depend on their respective level of extremeness (here defined as their return interval, RI; Fig. 1). Precipitation frequency and magnitude are found to increase for all levels of extremeness, with the largest median increases corresponding to the most extreme events which is consistent with prior findings43,44,45. 50-year RI precipitation events (i.e. events of a magnitude occurring approximately twice per century), occur twice as often (a 100% increase) in the future period vs. the historical period, while the frequency of 200 years RI events increases by up to 200%. Median increases in precipitation magnitudes corresponding to these frequency increases range from an increase <10% for 50 year RI events to an up to 15% increase for 200 year events.

In notable contrast to precipitation changes, changes in flood frequency and magnitude exhibit a more complex response as a function of flood event extremeness. We find that there exists a return interval threshold below which flood frequency and magnitude decrease, and above which they increase. The mean location of this threshold across all catchments lies between event RIs of 20–50 years for both frequency and magnitude (Fig. 1). However, the exact location of this threshold is catchment-dependent (Fig. 2). Some catchments already show increases in magnitude/frequency at very low thresholds (<10 years, lightly colored catchments), while in other catchments a threshold only emerges at very long return intervals (100 or 200 years, darkly colored catchments). A few catchments (20%) don’t show any threshold behavior at all as they either exhibit uniformly increasing or decreasing discharges independent of the return interval. However, even in catchments without a distinct threshold, the discharge response becomes increasingly positive for increasing event magnitudes.

This finding of a catchment-specific return interval threshold in a great majority of instances suggests that the extreme streamflow response in a warming climate changes sign, from negative to positive, when comparing more ‘common’ flood events (i.e. those occurring 5 or more times per century) to more ‘rare’ flood events (i.e. those occurring two or fewer times per century). This finding has major implications for the interpretation of time series of observed streamflow, as the historical record is often too short to robustly characterize changes in high-magnitude events occurring only several times per century, and any such threshold behavior might go undetected as a result. Still, the results corroborate findings by earlier studies suggesting that historical changes in flooding do, to some degree, depend on event extremeness25,27.

Next, we assess which meteorological factors and catchment characteristics influence the location of the overall flood response threshold along the extremeness spectrum when considering median changes in extremes overall catchments. For this assessment, we compare historical and future precipitation and discharge extremes for (a) small (<1000 km2) and large (>1000 km2) catchments, (b) low- (<1000 m.a.s.l.) and high-elevation catchments (>1000 m.a.s.l.), (c) winter (Oct–Mar) and summer (April–Sept) events, (d) snow-influenced (>10 mm stored SWE) and rainfall-driven events (<10 mm stored SWE), and (e) events extracted using different precipitation temporal aggregation levels (1-day, 3-day, and 5-day accumulated precipitation) (see Methods section “Changes in event magnitudes and P − Q relationship”).

Our results show that the threshold above which precipitation increases translate into increases in flood frequency and magnitude is strongly modulated by elevation, season, and event type (Figs. 3, 4), but does not meaningfully depend upon the precipitation temporal aggregation level (Supplementary Figure 2) or upon catchment size (Supplementary Figure 3). This result may change if studying a dataset with a wider range of catchment sizes. However, when studying larger catchments, interactions of flood waves from different tributaries will have to be considered. The return interval threshold does not exist at all or occurs at a much lower extremeness level in high-elevation catchments (<10 years RI) versus low-elevation catchments (~50 years RI). In other words, precipitation frequency and magnitude increases in high-elevation catchments are more directly translated into flood frequency and magnitude increases than in low-elevation catchments for any given event extremeness level (Figs. 3c, 4c). In addition to elevation, this threshold also depends on the season. In high-elevation catchments, discharge frequency and magnitude increases are stronger in winter than in summer. In contrast, flood frequency and magnitude mostly decrease in low-elevation catchments in winter while they increase in summer for high-magnitude events (Figs. 3b, 4b). A substantial portion of this elevational separation in flood response may be explained by differences in extreme precipitation event type, i.e. whether an event is snow-influenced or rainfall-driven (Figs. 3d–f, 4d–f). In low-elevation catchments, flood frequency and magnitude decrease for snow-influenced events caused by a decrease in extreme precipitation during such events while they increase for very extreme rainfall-driven events (return intervals >50 years) (Figs. 3e, 4e). In contrast, high-elevation catchments show flood frequency and magnitude increases for both snow-influenced and moderately extreme rainfall-driven events (Figs. 3f, 4f). This behavior would be consistent with a simultaneous decrease in mean snowpack accumulation and the number of rain-on-snow events39,46,47,48,49,50, which in some cases have lower peaks than solely rainfall-driven events23.

### Flood-precipitation dependence strengthens

In addition to assessing changes in precipitation and flood magnitude, we consider the (non-)stationarity of the relationship between the two variables over time in a warming climate. We compare different measures of dependence including correlation and extremal (i.e. tail) dependence51 for progressively more extreme events for the historical and future period (see Methods section “Changes in event magnitudes and P − Q relationship”). Similar to changes in flood frequency and magnitude, we find that changes in the strength of the P − Q relationship overall catchments are generally positive above a certain return interval threshold and depend on event magnitude, season, and in particular elevation (Fig. 5). The median P − Q relationship changes overall 78 catchments are generally stronger in high- versus low-elevation catchments, and are also stronger in winter than in summer. In low-elevation catchments, the relationship weakens for moderate extreme events and intensifies only for very extreme events, particularly in summer. In high-elevation catchments, the relationship intensifies for both moderate and severe extremes. In these catchments, however, the strengthening of the relationship in winter decreases as events become more extreme, while it intensifies more strongly for the more extreme events in summer. These findings suggest that influences on the threshold above which the P − Q relationship strengthens are complex, and likely vary widely across hydroclimates as suggested by variations by season and event type. They are also suggestive of a potentially important role for antecedent land surface conditions in modulating the underlying relationship – a topic we explore further in the next section.

### Role of antecedent conditions in flood response

We also assess the extent to which land surface and hydro-meteorological drivers beyond precipitation govern flood magnitudes at different levels of extremeness. For this assessment, we construct a multiple linear regression model that predicts flood magnitude (mean and 100 year RI) using a set of predictors: mean event precipitation, mean event temperature, mean event SWE, and mean event soil moisture anomalies, which are only weakly collinear according to the variable inflation factor (VIF does not exceed 10 for any pair and only exceeds 4 for very few pairs; see Methods section “Importance of hydro-meteorological drivers”). We consider the sign and magnitude of the associated regression coefficients, and their change between the two time periods of interest (historical: 1961–2000, future: 2060–2099).

The regression analysis shows that flood magnitude is driven by different meteorological conditions and land surface processes whose importance varies widely by the level of extremeness, elevation, and season (Fig. 6 upper panel). For moderate and severe extremes at both low and high elevations, precipitation is positively related to discharge magnitude (i.e. for sufficiently extreme events, precipitation increases almost always lead to discharge increases). In contrast, the role of all the other drivers particularly that of temperature strongly depends on the level of extremeness, elevation, and season and is not statistically significant in all cases.

In low-elevation catchments, temperature increases are associated with discharge decreases, particularly for moderate extremes (negative regression coefficients) (Fig. 6a). In summer, higher temperatures mean higher evapotranspiration and therefore lower soil moisture, which means higher soil water storage capacity and therefore less direct runoff resulting from a given amount of precipitation. In winter, higher temperatures are associated with less snow accumulation and therefore less rain-on-snow events46,47,49,50, which can lead to smaller flood peaks because solely rainfall-driven events may not be as severe as rain-on-snow events23. While these temperature effects are strong for moderate floods, temperature loses importance moving toward more extreme events. This effect is particularly pronounced in summer, where the negative effect of temperature weakens while the positive relation between event magnitude and precipitation intensifies. In winter, temperature effects are still important, however, also to a smaller degree (Fig. 6b).

In low-elevation catchments during winter, soil moisture and snow accumulation are indeed important drivers of flood magnitude. Increases in soil moisture lead to increases in flood magnitudes, as precipitation can more directly be converted into a runoff. In contrast, more snow accumulation is related to smaller floods because water is temporarily stored in the snowpack, and does not form runoff until melting at some later point. While the soil-drying effects of increasing temperatures may lead to flood decreases in low-elevation catchments, they can also lead to flood increases in high-elevation catchments (particularly in winter). This effect arises largely from the phase change of precipitation, which falls increasingly as rain rather than snow in a warming climate47, and which has been directly linked with an increase in flood magnitude in such regions23. Interestingly, the positive association between temperature and flood magnitude at high elevations exists not only for moderate events, but also for very extreme events.

Our analysis of future changes in flood driver importance further shows that the future relevance of precipitation as a flood driver increases for severe events while the importance of temperature increases for moderate but decreases for severe extremes (Fig. 6c–d). This may potentially be understood in the context of soil saturation as a modulating factor: for typical and even moderate events, antecedent soil-drying and snowpack losses resulting from warming temperatures oppose the effect of increasingly extreme precipitation volume; but for sufficiently severe precipitation events, the extremely large volume of water entering the system may be able to quickly saturate the soil column and overcome even a substantial degree of antecedent soil-drying. In addition, increasingly extreme precipitation may lead to infiltration excess even in the case when soils are not yet saturated. Collectively, these findings support the following generalization: the more extreme a flood event, the more important precipitation becomes as a singular driver – particularly in a warmer future climate.

## Discussion

In this work, we demonstrate for hydrological Bavaria that there is an extremeness or return interval threshold, which varies by catchment, season, and event type, above which extreme precipitation increases outweigh the soil-drying effects of warming temperatures. This result suggests that in other regions around the globe with similar hydro-climates, i.e. temperate climates with pluvial or nival flow regimes, flood risk in a warming climate may also exhibit divergent changes above and below some locally-defined extremeness or return interval threshold. We further find that the hydrologic response to extreme precipitation varies predictably as a function of event magnitude in a warming climate, with streamflow responses becoming increasingly positive even in the few study catchments which do not exhibit distinct threshold behavior. This, when viewed in the context of prior research, may offer evidence for the broader geographic generalizability of our findings. We find that increases in precipitation yield larger and more consistent increases in flood magnitude for more extreme versus more moderate events which is supported by previous observational studies showing only weak dependence between extreme precipitation and moderate flood occurrence in the United States10, stronger increasing flood trends for extreme than moderate floods in Central Europe27, and trends in extreme discharge that only align with trends in floods for the rarest events in Australian catchments25. Thus, there does appear to be a growing body of real-world evidence suggestive of the existence of a precipitation-flood response threshold across a wider range of hydroclimatic and hydrologic regimes than explicitly considered in the present study.

The complex influences of elevation, season, and event type upon the return interval threshold suggest that the location of this critical cross-over point may vary somewhat widely across regions of the world with varying topography and background climate. Substantial modulation of this threshold would likely occur depending on climatic factors such as aridity and the local relevance of snowmelt, catchment size, and land use and management. Consider, for example, a semi-arid or subtropical regime (as opposed to the moist mid-latitude regime that characterizes the catchments in the present study). In such a location, the return interval threshold might be higher due to drier antecedent soil conditions a temperature-related phenomenon we also see when comparing seasonally-varying summer with winter thresholds (Figs. 3, 4). The existence of a high return interval threshold in drier Mediterranean regions is supported by observation-based studies that have demonstrated a stronger relationship between precipitation and discharge for larger versus smaller flood events in Spain52, and have shown decreases in the occurrence of moderate floods in southern Europe9,27. In contrast, if we consider cold high-latitude regions and/or high altitude regions with a snow-dominant precipitation regime, the return interval threshold might be expected to be much lower. Indeed, this relationship is apparent from our threshold analysis for snow-influenced events in high-elevation regions (Figs. 3, 4). Additionally, and as suggested by our results (Supplementary Figure 3) the return interval threshold may also be modulated by catchment area (generally increasing with catchment size). For larger river basins than the ones included in our Bavarian selection, this finding would imply higher return level thresholds than 20–50 years. Furthermore, direct human influence on streamflow such as dynamic reservoir operations and/or flood management interventions might lead to higher return interval thresholds because smaller floods can be buffered by temporary water storage53. In contrast, urbanized catchments (characterized by a high fraction of water-impervious surfaces) might have lower return interval thresholds than catchments with unsealed surfaces because of a more direct relationship between extreme precipitation and flood response54,55. Such a return interval threshold might even vary from year to year in a single location-occurring at a higher level of event extremeness during drought versus pluvial periods.

How exactly such a return interval threshold varies for different hydro-climates remains to be investigated using a global hydro-SMILE. Creating such a global hydro-SMILE for flood analyses requires the combination of a globally downscaled and bias-corrected atmospheric SMILE with a global hydrological model specifically calibrated for flood peaks. Satisfactory calibration for far-from-mean state conditions is challenging using calibration metrics commonly used for large-scale model calibration56 and data storage and computational costs are high at a global scale when a large spatial domain is combined with a large ensemble size. In addition, global-scale models may not as accurately represent complex land surface processes as smaller-scale models and appropriate reference datasets for meteorology, soils, and hydrogeology are harder to obtain. Creating such a global hydro-SMILE therefore remains a considerable research effort, but one of substantial importance in a warming climate.

There are two important implications arising from the existence of a return interval threshold above which increases in precipitation directly translate to increases in flood occurrence. First, this threshold existence suggests that previous studies that focused on less extreme floods, which have shown little change or even decreases in annual streamflow maxima or events with return intervals of less than ~20 years57,58, will likely be unrepresentative of changes in higher-magnitude events. A robust statistical signal is unlikely to arise in most historical datasets shorter than 100 years because the strongest link between increasing extreme precipitation and flood magnitude occurs for rare, high-magnitude events with return intervals exceeding 20–50 years. This result points to an important limitation of observation-only studies, as well as to the critical importance of large modeling ensembles that can yield larger sample sizes for rare, high-magnitude events. Second, our analysis suggests that despite historical uncertainties, large increases in flood magnitude are likely in a warming climate for the very largest events–potentially including those unprecedented in the modern historical record (i.e., events with 200-year RI, Fig. 1). The fact that climate warming may act to decrease the magnitude of more moderate flood events while simultaneously increasing the magnitude of the most extreme events, however, highlights the considerable risk of developing a “false sense of security” based on recent historical experience. These findings therefore have major implications for climate adaptation and flood risk mitigation activities, as well as infrastructure design, in a warming climate.

Ultimately, we suggest that this analysis may help reconcile seemingly conflicting perspectives in the climatological and hydrological literature on flood risk in a warming climate. The apparent “precipitation-flood” paradox – whereby precipitation extremes have increased, but floods have not5,24 – may in fact be fully resolved by separating flood events by their extremeness. In this sense, both perspectives may ultimately be correct: hydrologic evidence suggesting no consistent increase in recent flood magnitude because of land surface drying and the changing role of snow using observational records of limited length9,59,60 is physically consistent with climatological arguments pointing to a large increase in the magnitude and frequency of historically rare or unprecedented precipitation events and subsequent flood risk61,62.

Future research aimed at expanding the coverage of the regional hydro-SMILE approach to a wider range of hydrologic and climatological regimes will be critical in confirming the broader generalizability of our findings in the present study, but emerging observational evidence does suggest that threshold behavior in precipitation-flood response is plausible across a wide range of regimes in a warming climate9,10,25,52. In this work, we confirm that antecedent land surface conditions are indeed critical in modulating more common or moderate flood events, but that precipitation becomes the dominant driver for very extreme events and ultimately overwhelms the effects of soil moisture or snowpack. Finally, we emphasize that the inherent limitations of the historical observational record can be obviated through the use of a climate model large ensemble approach in combination with an advanced hydrological model–a framework that might be useful for more broadly assessing complex and possibly non-linear changes in extreme events in the warming earth system.

## Methods

### Study region

We study the relationship between extreme precipitation and flood events and its influencing factors in a warming climate for a set of 78 catchments with nearly natural flow conditions in Hydrological Bavaria (Supplementary Figure 1). This region comprises the Main, Danube, and Inn rivers with their major tributaries. This study region is particularly well suited to analyze variations in the precipitation–discharge (P − Q) relationship because the constituent catchments are characterized by diverse topographic and climatic conditions, ranging from a wet alpine region in the south (1700 mm y−1) to a relatively flat and dry foreland to the north (700 mm y−1), and diverse soil types and land uses. The variations in these conditions lead to a wide range of hydrologic regimes, including snow-influenced regimes with flood peaks in spring and summer to primarily rainfall-influenced regimes with the main flood season in winter. While these regime types can be considered representative of the temperate climate zone with similar runoff regimes (pluvial to nival), our catchment selection does not cover other climate zones such as cold-climates, semi-arid to arid regions, and the tropics.

### Hydro-SMILE

For this analysis, we use a hydro-SMILE, i.e. hydrological simulations obtained by driving a hydrological model with a Single Model Initial-Condition Large Ensemble (SMILE) climate model. The underlying simulations were originally generated by Willkofer et al.40 as part of the ClimEx project41. The simulations consist of daily streamflow (mm d−1), snow-water-equivalents (SWE, mm), and soil moisture (%) – all of which were obtained by driving the hydrological model WaSiM-ETH42 with a 50-member ensemble of high-resolution climate input (spatial: 500 x 500 m2, temporal: 3 h). The climate input consists of an ensemble provided through the Canadian Regional Climate Model version 5 nested with the Canadian Earth System Model63 under RCP 8.564 – a ’high-warming’ climate scenario. WaSiM-ETH is a distributed, mainly physically-based hydrological model comprising modules for evapotranspiration, interception, snow accumulation, and melt, glaciers, runoff generation, soil water storage, and discharge routing42. The model was set up for 98 catchments in Hydrological Bavaria by Willkofer et al.40 using spatial information on elevation, slope, and exposition derived from a digital elevation model for Europe (EU-DEM65), land-use derived from the CORINE land cover dataset66, soil characteristics derived from the European soil database (ESDB v2.067), and hydro-geology (hydraulic conductivity) derived from the Bavarian hydrogeology map68 and the international hydrogeological map of Europe (IHME1500 v1.1.69) to define global model parameters (i.e. parameters applied to the 98 catchments) describing evapotranspiration rates, infiltration rates, groundwater fluxes, snowmelt, and glacier dynamics and by calibrating four parameters, i.e. those related to recession and direct flow. These local parameters were calibrated for the period 2004–2010 using the dynamically dimensioned search algorithm70 on the observed 3 h discharge of the 98 catchments provided by the Bavarian Environment Agency (Bayerisches Landesamt für Umwelt - LfU71) and sub-daily observed interpolated meteorological input (i.e. precipitation, temperature, relative humidity, incoming shortwave radiation, and wind speed).

The meteorological Sub-Daily Climatological REFerence dataset (SDCLIREF) created in the ClimEx-project is based on a combination of hourly and disaggregated daily station data. To obtain the disaggregated daily station data, the method of fragments72 was used to extend the sub-daily record to 1981–2010 and to densify the station network. The station data were then interpolated to a 500 × 500 m2 grid using a combination of multiple linear regression, considering elevation, exposition, latitude, and longitude, and inverse distance weighting similar to Rauthe et al.73. The dynamically dimensioned search algorithm used a multi-objective function targeted at optimizing flood characteristics composed of the Nash–Sutcliffe efficiency (ENS74) and the Kling–Gupta efficiency (EKG75), which both focus on high flows76, the log(ENS), which emphasizes low flows, and the root-mean-squared error to standard deviation ratio (RSR), which quantifies volume errors. The overall objective function assigns a lot of weight to the metrics ENS and EKG because our study focuses on flood events:

$$M=0.5\times (1-{E}_{{{{{\rm{NS}}}}}})+0.25\times (1-{E}_{{{{{\rm{KG}}}}}})+0.15\times (1-{{\mbox{log}}}({E}_{{{{{\rm{NS}}}}}}))+0.1\times {R}_{{{{{\rm{SR}}}}}}.$$
(1)

The calibrated model was first run for a reference period 1981–2010 with the sub-daily (3 h) observed interpolated meteorological input also used for model calibration. After running the model for the reference period, it was run for a simulation period 1961–2099 with meteorological data derived from the fifth-generation Canadian Regional Climate Model large ensemble (CRCM5-LE) 50 members41 consisting of a dynamically downscaled version (0.11; 12 km) of the second generation Canadian Earth System Model large ensemble (CanESM2-LE)77. The CRCM5-LE data were further bias-corrected using a quantile mapping approach78,79 adjusted to sub-daily time steps and the SDCLIREF as the reference climatology (1981–2010). Correction factors were determined for each quantile bin for each month and sub-daily time step. To preserve the ensemble spread, all members were pooled to obtain the correction factors and these factors were subsequently applied to each ensemble member separately. The bias-corrected data were then further downscaled to 500 × 500 m2 spatial resolution. The center point of each 0.11 CRCM5-LE grid cell was treated as a virtual meteorological station and for each time step the anomaly from the mean state was interpolated to the 500 × 500 m2 grid using inverse distance weighting. The interpolated anomalies were then multiplied/added to the climatological reference fields from the SDCLIREF. Afterwards, the downscaled data were corrected in order to ensure the conservation of mass for each downscaled 0.11 grid cell.

Previous studies have demonstrated that CRCM5-LE (1) shows realistic patterns of daily and sub-daily extreme precipitation80 and of the timing of annual maximum precipitation over Central Europe81; (2) that its high-resolution allows for a realistic representation of local precipitation extremes, especially over coastal and mountainous regions41; (3) that it is consistent with the EURO-Cordex ensemble82, and (4) that it compares well to other large ensembles with respect to regional precipitation pattern changes81.

For the subsequent analyses of extreme precipitation and flood events, the 3 h meteorological and streamflow time series were aggregated to a daily scale and averaged over each catchment.

### Hydrological model evaluation

We here evaluate the hydrological model for the 78 catchments used in this study for the reference period 1981–2010 using observed daily streamflow from the hydrological services of Bavaria and Baden-Württemberg (both in Germany), Austria, and Switzerland with respect to a set of measures including visual inspection, general efficiency metrics, and flood characteristics of events determined using a peak-over-threshold approach with the 98th flow percentile as a threshold and a minimum time lag of 10 days between successive events to ensure independence. The general efficiency metrics considered are the Kling–Gupta efficiency75, Nash–Sutcliffe efficiency74, volumetric efficiency, and mean absolute error, four metrics often used in flood simulation studies. The flood characteristics considered are the number of events, mean timing (day of the year), mean peak magnitude (mm d−1), mean volume (mm event−1), mean duration (days), and P − Q dependence. The start and end of events are determined as the time when discharge rises and falls below the threshold, respectively, event duration is defined as the time elapsing between the start and end of an event, and the volume as the cumulative flow exceeding the threshold over the whole event duration.

The model shows a satisfactory performance qualitatively and quantitatively using general and flood-specific evaluation metrics (Supplementary Figure 4). Kling–Gupta efficiencies ranged from the first quartile of 0.67 to the third quartile of 0.85, Nash–Sutcliffe efficiencies from the first quartile of 0.56 to the third quartile of 0.8, and volumetric efficiencies from the first quartile of 0.68 to the third quartile of 0.8. The mean absolute error lay at 0.35 mm d−1 (Supplementary Figure 4a). The flood-specific performance evaluation showed a slight underestimation of the number of events (relative error: 1st quartile: −0.14, median: −0.06, 3rd quartile: 0.07), a slight delay of the timing of flood occurrence (relative error: 1st quartile: −0.01, median: 0.05, 3rd quartile: 0.11), a slight overestimation of flood peaks (relative error: 1st quartile: −0.02, median: 0.07, 3rd quartile: 0.22), an overestimation of both flood volume (relative error: 1st quartile: 0.08, median: 0.32, 3rd quartile: 0.54) and duration (relative error: 1st quartile: −0.02, median: −0.14, 3rd quartile: 0.39), and an underestimation of P − Q dependence (relative error: 1st quartile: −0.35, median: −0.24, 3rd quartile: −0.04) (Supplementary Figure 4b). Overall, the model performance with respect to high flows and flooding is satisfactory. In addition, the results of our change impact assessment are less affected by inconsistencies between observed and simulated flow because we assess relative rather than absolute changes in precipitation and flood magnitudes.

### Event identification

Using the daily streamflow simulations from the 50 members of the hydro-SMILE, we identify pairs of extreme precipitation (i.e. areal sum over catchment) and corresponding streamflow for two non-overlapping periods of 40 years, a historical (1961–2000) and a warmer future period (2060–2099). Periods of 40 years were chosen to maximize the sample size while ensuring that the two periods are as distinct as possible. To identify these P − Q pairs, we first define daily extreme precipitation events (mm d−1) using the 99th percentile (determined on all days (including 0 precipitation days) using the full-time series 1961–2099) as a threshold and by prescribing a minimum time lag of 10 days between events in order to ensure independence (i.e. to enable declustering). This event extraction procedure results in roughly 2–2.5 events chosen per year on average depending on the catchment. Over the 2000 model years of data per time period (40 years across 50 ensemble members), we select approximately 5000 extreme events per catchment. The start of each precipitation event is defined as the day when precipitation exceeds 1 mm prior to the first threshold exceedance and the end of each precipitation event is defined as the time when precipitation falls below 1 mm after the final threshold exceedance (for an illustration of the event identification procedure see Supplementary Figure 5). Next, for each precipitation event, we identify the corresponding streamflow peak (mm d−1) within a time window from the start of the precipitation event to 5 days after the end of the precipitation event. Finally, for each event, we determine temperature (C) on the day of peak precipitation and snow-water-equivalent (mm) and soil moisture anomalies (deviation from the mean, percentage) on the day prior to the occurrence of the precipitation extreme. We repeat this event extraction procedure for two additional temporal aggregation levels (3-day and 5-day mean precipitation accumulations) in order to assess the effect of precipitation aggregation on future precipitation and discharge changes because event identification using different aggregation levels results in the extraction of different event sets.

### Changes in event magnitudes and P − Q relationship

In the first part of our analysis, we use the P − Q event pairs identified to analyze how precipitation and corresponding flood magnitudes as well as the relationship between the two variables may change in the future. To do so, we compare the statistical characteristics of these variables for the future period (2060–2099) to the characteristics of the historical period (1961–2000). P and Q magnitudes are determined empirically by pooling events extracted from the 50 ensemble members for different levels of extremeness, i.e. ’mean’ events (those which occur, on average, once or twice per year) and progressively more extreme events with 10, 20, 50, 100, and 200 year return intervals, respectively. Sample quantiles are computed for probabilities corresponding to different return periods T using:

$$p=1-(\mu /T),$$
(2)

where μ is the mean inter-arrival time between events. The P − Q relationship is characterized for different dependence measures including Pearson’s correlation coefficient and the tail dependence coefficient $$\overline{\chi }$$51, which provides a simple measure of extremal dependence, at different levels of extremeness (i.e. probabilities corresponding to return intervals of 10, 20, 50, 100, and 200 years). Future changes are expressed as relative changes with respect to the characteristics of the historical period.

We identify factors potentially influencing the nature of change in P − Q magnitudes and relationship by looking at different levels of extremeness, i.e. return intervals, small and large catchments, high-elevation, and low-elevation catchments, winter and summer events, and snow-influenced and rainfall-driven events. The levels of extremeness considered for both P and Q are the mean and quantiles corresponding to return intervals of 10, 20, 50, 100, and 200 years. Within the 2000 model years available for analysis, roughly 10 events have a return interval of 200 years while roughly 200 events have a return interval of 10 years in each catchment. Small to medium-size catchments are distinguished from large catchments by setting an area threshold of 1000 km2 83, which results in 21 small and 57 large catchments. Similarly, low-elevation catchments are separated from high-elevation catchments using an elevation threshold of 1000 m above sea level84, which results in 55 low-elevation catchments and 23 high-elevation catchments. Winter events are defined as those events happening between October and March and summer events as those events occurring between April and September. Our results are not sensitive to the use of an alternative seasonal definition aligning with the start of the hydrological year (Nov–April, May–Oct). Throughout the analysis, snow-influenced events are defined as those events during which there was at least 10 mm of SWE while rainfall-driven events are those with less than 10 mm of SWE47.

### Importance of hydro-meteorological drivers

In the second part of the analysis, we identify potential hydro-meteorological drivers influencing extreme precipitation and flood magnitudes and their statistical relationships. A comparison of driver importance for the two periods (historical and future) allows us to identify drivers losing or gaining importance in the future. For both periods, we fit multiple linear models to flood magnitudes (mean or quantiles for the 78 catchments) using four explanatory variables, all of which exhibit only weak collinearity according to the variable inflation factor, which lies around 1–2 for most variables and does not exceed 4 in most cases. The explanatory variables include mean event precipitation for each catchment (i.e. mean precipitation for the extreme events identified), mean event temperature, mean event SWE, and mean event soil moisture anomaly. Both flood magnitudes and the explanatory variables are standardized prior to model fitting by subtracting the mean and dividing by the standard deviation (z-scores) in order to make the resulting regression coefficients inter-comparable and easily interpretable. Comparing regression coefficients of the future model to the coefficients of the historical model (absolute changes) enables quantification of changes in future driver importance. Similar to the change analysis, we also distinguish between different levels of extremeness to determine how driver importance varies for events with different return intervals (mean and 100 year event), between low- and high-elevation catchments to define to which degree driver importance depends on catchment elevation, and between winter and summer events to shed light on how driver dependence varies by season.