Physics-guided probabilistic modeling of extreme precipitation under climate change

Earth System Models (ESMs) are the state of the art for projecting the effects of climate change. However, longstanding uncertainties in their ability to simulate regional and local precipitation extremes and related processes inhibit decision making. Existing state-of-the art approaches for uncertainty quantification use Bayesian methods to weight ESMs based on a balance of historical skills and future consensus. Here we propose an empirical Bayesian model that extends an existing skill and consensus based weighting framework and examine the hypothesis that nontrivial, physics-guided measures of ESM skill can help produce reliable probabilistic characterization of climate extremes. Specifically, the model leverages knowledge of physical relationships between temperature, atmospheric moisture capacity, and extreme precipitation intensity to iteratively weight and combine ESMs and estimate probability distributions of return levels. Out-of-sample validation suggests that the proposed Bayesian method, which incorporates physics-guidance, has the potential to derive reliable precipitation projections, although caveats remain and the gain is not uniform across all cases.

www.nature.com/scientificreports www.nature.com/scientificreports/ state-of-the-art and best practice, the focus of this manuscript is on characterizing gaps in the physics through model-based simulations.
Based on an implicit assumption that the range of parametric and structural variations embedded in climate models may capture the range of physically plausible behavior 12,13 , multiple model ensembles are used to characterize the variability among models and for associated uncertainty assessment. In a desire to balance two competing views, one that argues that historical skills of model simulations based on comparisons with observations is a valid indicator of future behavior, and another that suggests that given expected changes in the earth's radiative balance model consensus is a better indicator 14 , recent research has examined the principled ways to consider both skills and consensus in a Bayesian framework [15][16][17][18][19] . The hypothesis that known physics-based relations that may or may not be well captured in earth system models, as well as observed and simulated data-driven multivariate dependence structures may improve this aspect of model assessment and probabilistic modeling has been suggested 20 . Here we examine this hypothesis in the context of climate model-driven probabilistic modeling of future precipitation extremes, based on temperature dependence. While we do not consider the full range of temperature scaling based on Clausius-Clapeyron and convective processes [21][22][23] , we do consider a specific case study. Our findings may lead to more effective uncertainty characterization, as well as better understanding of model strengths and the applicability of scientific understanding beyond what may be captured in models to improve uncertainties.

Skill, consensus, and physics-guided climate model weighting
The most common and practical approach for probabilistic climate modeling involves exploiting archived ensembles of ESM runs to estimate probability distributions of climate change. Several methodologies for creating such ensembles have been proposed that essentially focus on skill-and consensus-based weighting of the ensemble members. Skill refers to the ability of an ESM to replicate historical climate observations, while consensus relates on their agreement with their peers about the future 14 . This approach was formalized for regional average temperature and precipitation in a Bayesian framework 15,16 . It was then extended in several studies to accommodate bivariate relationships between averages of climate variables 17 and to support efficient probabilistic modeling across multiple geographic regions simultaneously 18 . To date, most of these studies have only supported averages of climate variables. An exception is a recent study that applies this framework to high quantiles of precipitation 19 . Specifically, it applies a modified version of the framework to the 95 th percentile of precipitation depth on wet days.
Literature has pointed out the difficulty of measuring the "skill" of an ESM 12,13,24,25 , despite a multitude of attempts to do so [26][27][28] . Furthermore, in many cases common skill metrics such as root mean squared error 26 tend to not lead to systematic differences in terms of model projections 27,28 ; that is, a "better model" often does not say anything different about the future than a "bad" model. Several notable studies, however, suggest that skill metrics designed to capture whether an ESM is simulating a non-trivial physical process can lead to clearer insights about anthropogenic attribution 29 or reduced future uncertainty 23,30,31 . From this, we can synthesize a hypothesis that non-trivial, physics-guided measures of skill may be more useful indicators of ESM reliability. This hypothesis is tested formally via the Bayesian model proposed in the current study, using precipitation extremes as a case.

physics of precipitation extremes
Precipitation extremes are in many cases expected to increase in intensity, duration, and/or frequency as a function of climate change given theory 21,22,32 , evidence from observations 33 , and ESM projections 34,35 . At a global scale this can be explained by the Clausius Clapeyron (CC) equation 34,36 , which shows that under ideal conditions, atmospheric moisture capacity increases in a warming climate.
The August-Roche-Magnus formula 37 provides an empirically derived approximation in ideal conditions (between −40 and 50 degrees Celsius and over a plane surface of water): where e s is saturation vapor pressure (i.e., atmospheric moisture holding capacity) in hPa and T is temperature in Celsius. Moisture condenses to precipitable water when atmospheric moisture holding capacity is reached. On average, this implies a shift in the distribution of the intensity of precipitation events: larger e s (T) values imply longer duration between condensation and thus precipitation events. When heavy precipitation events do occur, they are expected to increase in intensity owing to increased atmospheric moisture content. Moreover 38,39 , revealed that super CC scaling (typically of order of 1.5-2 times than CC scaling) is primarily due to response of convection to increase in near-surface humidity, while other atmospheric conditions remain constant. Ultimately, in aggregate, increasing temperatures under climate change translates to increased capacity for drought risk with simultaneous increased potential for extreme precipitation and flood risk 40 . At a global average scale, it has been estimated that atmospheric moisture capacity increases by 7% per degree Celsius 36 ; this is often referred to as CC scaling.
Generally it would be difficult to assess an ESM's ability to simulate the dynamical processes (upward vertical wind velocities) that partially drive extreme precipitation since observational data for those processes are usually not even available. In contrast, in many regions of the world, high quality observations for both temperature and precipitation do exist.
Hence, in this study, we leverage this knowledge with the following hypothesis: a skillful ESM should be able to successfully replicate not only the observed marginal distribution of extreme precipitation but also its observed dependence (whether the relationship is positive, inverse, or lack thereof) on contemporaneous air temperature at a regional scale. The complexity of the relationship between air temperature and extreme precipitation 41 as well as the relative regional dominance of dynamical processes 21 www.nature.com/scientificreports www.nature.com/scientificreports/ This further supports the potential utility in modeling the relationship between temperature and extreme precipitation at a regional scale. In other words, rather than imposing CC scaling or any specific magnitude of scaling, we allow the Bayesian model to learn the specific regional dependence between daily extreme precipitation and same-day temperature. For this study, we restrict the type of temperature-precipitation dependence to a linear type with unknown direction and magnitude. We do so while acknowledging that in many cases that dependence structure could be of other forms (monotonic but nonlinear, non-monotonic, etc.) based on the above physics discussion. Future studies could seek to extend the way the following proposed model measures that dependence.

Methods
The data and preprocessing steps. An ensemble of 15 ESMs from the Coupled Model Intercomparison Project Phase 5 (CMIP5) archive is used in this study. For the years 1950-1999, historical ESM runs are used (SI Table 1 and SI Table 2). For the years 2065-2089, runs from the greenhouse gas scenario RCP8.5 are used. The model presented shortly is run for all 18 continental U.S. Hydrologic Unit 2 (HU2) watersheds provided by the United States Geological Survey's Watershed Boundary Dataset (USGS WBD) 45 . We include the metadata on the ESMs and watersheds used in the supplementary material. USGS WBD HU2 shape files are used to identify grid cells that belong to each watershed (SI Table 3). In each watershed and for each ESM, preprocessing is conducted as follows. For each month and year, we find the day with the maximum total precipitation depth over the entire watershed. Then, for that same day, precipitation and temperature are extracted at each grid cell for that day and then averaged over grid cells in the watershed. We refer to those watershed-averaged precipitation values as block maxima, i.e., maximum values of precipitation over discrete temporal blocks, in accordance with extreme value analysis literature 46 . For each month separately, those block maxima are then sorted in ascending order and treated as return levels. Sorting block maxima in ascending order helps alleviate the fact that ESMs are likely to be out of phase with each other and observations. This idea has been utilized in statistical downscaling; with asynchronous regression approaches, the order statistics of observations are regressed on the order statistics of an ESM to create transfer functions that can be carried forward to future ESM simulations 47 . Surface (2-meter) air temperature averaged over the same days as the block maxima are extracted and re-sorted according to precipitation ordering, as well. Observational precipitation maxima and surface air temperature are extracted from a higher resolution ( ) 1 16 degree gridded observational data product 48 for the years 1950-1999 and are preprocessed in the same manner as the ESMs.
We denote P as return levels/depths of precipitation and T as temperature averaged over the same day in the same location. The subscript k indexes observational datasets (there is only one observational dataset used in this study, but the Bayesian model allows for more than one); m ∈ [1, …, M = 12] indexes season (calendar month in this study), q ∈ [1, …, Q = 25] indexes the ranks of the return levels (i.e., indexes return periods) from smallest to largest from a historical climatology; q′ ∈ [1, …, Q′ = 25] the same but for the future climatology; j indexes ESM datasets.
Let Z j,m = log(P j,m,q=1 ), i.e., for any ESM dataset j (or k for observations), the smallest value of the precipitation is transformed with a natural log. Then, for larger return levels q ∈ [2, …, Q], we let U j,m,q = log(P j,m,q − P j,m,q−1 ), i.e., the logged difference between adjacent return levels. We note that while using logged difference, which essentially models a quantile process as a random walk with log-normal increments, is slightly unusual; we decided on using this framework primarily for computational convenience. This preprocessing is done for three separate climatologies: 1950-1974, 1975-1999, and 2065-2089. For the historical period we use the data from 1950-1974 for model estimation and the data from 1975-1999 to validate our prediction. Then using the information based on such model fitting over a historical data period and an optimized model fitted to the data from 1975-1999, we project probabilistic precipitation extremes scenarios in 2065-2089. The bayesian ensemble model. We leverage the Bayesian skill and consensus-based framework discussed earlier [15][16][17][18][19] as the mechanical foundation for our model. Through a Markov Chain Monte Carlo (MCMC) process, ESM projections of return levels are iteratively weighted and averaged according to (1) their skill as measured by their similarity to observational return levels and (2) their consensus with projections. Skill is formulated to explicitly evaluate whether the return levels from ESMs depend on temperature in the same way that they do in observations. SI Tables 4 and 5 summarizes all the notations used in this research.
First, the smallest of the return levels are assumed to follow Gaussian data models: Here, the unknowns C m and ′ C m are seasonal parameters that can be estimated given that there are multiple models and observational datasets. CBIAS j is a bias term for ESM j that is assumed to be constant over time regimes. The parameter σ j is a scalar weight for each ESM. Finally, θ is a future variance scaling parameter that modulates the importance of consensus in the determination of weights and also allows for a different magnitude of uncertainty in the future climatological regime 20 . Similar to models from past studies 15,16,18 , the weight parameter σ k is estimated from observational data as the inverse of the sample variance of the smallest block maxima (q = 1) over all M seasons. www.nature.com/scientificreports www.nature.com/scientificreports/ Define the variable δ j,m,q = T j,m,q − T j,m,q−1 for any ESM j (or observational dataset k). Similar to past related studies 15,16,18 , we fix the parameter ε − k q , 1 as follows: first, separately for each season m, we fit a simple linear regression of [2, , ] . We save the residuals from each of these regressions. Then, for each order statistic q, we calculate the sample variance of the residuals for using q all seasons m ∈ [1, …, M] Using these, we define the data model for values of U w,m,q and ′ ′ U w m q , , , for q ∈ [2, …, Q] and q′ ∈ [2, …, Q′].
The treatment of σ k and ε k,q as fixed and estimated from data is the empirical aspect of the Bayesian model proposed here. Empirical Bayesian methods leverage the Bayesian statistical paradigm to obtain posterior distributions of unknown parameters conditional on data, but with considerably less intricate computations and generally near identical theoretical properties.
With Eqs. 5-7, we are essentially assuming that the logarithm of the differential between any pair of subsequent order statistics, which is a sample quantity, is Gaussian with the mean being the population equivalent. As such, the model does not suggest that extremes themselves are Gaussian, it merely says sample versions are normally distributed around true population quantities. This Gaussian assumption of the U k,m,q statistics is examined in greater detail in the supplementary materials. It is important to keep in mind that observations themselves are potentially noisy realizations of the truth. Our Bayesian model does not necessarily treat observations as ground truth in the way a supervised learning approach would. The parameter τ k lets us scale the weight of observational climate data to behave more like ground truth, which in turn influences values of unknown parameters in a manner similar in spirit to a supervised learning problem. We explore the sensitivity of model results to choice of τ k in the supplementary material, based on which we settle on τ k = 100 for our data analysis in this paper.
The full joint distribution of all the unknown parameters: Validation of the Bayesian model is a crucial component of assessing its utility. Of course, unlike weather forecasting, true validation over future climatologies is impossible in the immediate term given the lead times of interest. We validate the model using a training-holdout scheme similar to conventional predictive modeling. We do this in each region using 1950-1974 as the "training" and 1975-1999 as the "validation" climatologies, respectively. We examine the accuracy of our Bayesian model, posterior coverage, posterior upper coverage, and posterior width, all as compared to the original ensemble of ESMs. In addition, we also compare posterior projected changes in return levels as compared to those projected changes obtained directly from the original ensemble of ESMs. For this measure, where the original ensemble performs well with reference to held out observations, the ideal Bayesian model should exhibit similar projected changes. In cases where the original ensemble performs poorly against held out observations, the ideal Bayesian model might deviate in terms of projected changes.
Finally, similar in theme to related work 18 , we use the ESMs themselves to validate the model. For a given watershed, each ESM is iteratively treated as true climate and the Bayesian model is run using 1975-1999 as the training climatology and 2065-2089 as the validation climatology. This is motivated by the fact that the difference between 2065-2089 and 1975-1999 should show a more prominent signal and thus might be a more fair way to assess the ability of the Bayesian model to handle longer-term changes (than those between 1950-1974 and 1975-1999) and potential nonstationarity. Details of these validation steps are reported in the supplementary materials.

Results
Validation results. Overall, the Bayesian model tends to be more accurate than the ensemble in non-summer months. Note that ESMs poorly simulate the pronounced diurnal cycle in precipitation over the United States in the summer 49 , and there is a significant correlation between tropical and North Pacific sea surface temperatures and summer precipitation variability 41 . Using these factors in the Bayesian model can potentially yield even better comparative www.nature.com/scientificreports www.nature.com/scientificreports/  www.nature.com/scientificreports www.nature.com/scientificreports/ performance over the unweighted ensemble. However, we did not pursue such more mathematically complex models to keep the analysis tractable and interpretable. Figure 2 examines the ability of the Bayesian model to simulate historical changes in return levels compared to the original ensemble projections. Historical changes (1950-1974 to 1975-1999) are shown for the Bayesian model, the original ensemble, and the observations. All changes are in mm day units and are separately examined marginally over watersheds, return periods, and seasons. Median projections from the Bayesian model better reflect observed magnitudes of change, whereas the median of the original ensemble consistently underestimates observed change. The relative consistency of the Bayesian projections, versus those obtained from the raw ensemble, may reflect a combination of (1) the Bayesian model's ability to weight ESM projections more realistically using observations and/or (2) "shrinkage", or the process of the model parameters tending toward average behavior over seasons and order statistics.  We now focus on one watershed, Ohio, as a complete case study. Figure 1 shows that the Bayesian model substantially improves on its accuracy in the 1975-1999 time period, cutting RMSE almost in half. Figure 2 shows that the Bayesian model also more realistically portrays the change in average Ohio extremes. Figure 3 shows that, qualitatively, the Bayesian model yields similar projected changes as the median projected changes from the original ensemble. Table 1 shows that, while the Bayesian model cuts RMSE almost in half, it also effectively on average reduces width as measured by the credible interval. Finally, we explore detailed projected changes for the uppermost extremes: Fig. 4 shows the detailed end of century percent change projections (1975-1999 median to 2065-2089) for q′ = 25 year return levels Ohio. Violin plots show a full probability distribution of change relative to the 1975-1999 from the Bayesian model for each calendar month. Median, lower bound, and upper bound change projections from the original ensemble are overlaid for comparison. One notable feature in the Bayesian projections that is absent in the original ensemble is a long upper tail. Another is the difference between median projections of the Bayesian model versus the original ensemble, which can provide insight into biases being estimated. December projections show this difference, where the full spread of ESM projections shows increased precipitation, but the Bayesian model shows a median of projected decrease but a heavy upper tail. This lends an explicit likelihood to potential changes that are larger than the original ensemble projections. Though generally similar on average, the Bayesian change projections have the advantage over the original ensemble in that they provide stakeholders with information on probabilities versus discrete, unweighted projections. , effectively removing the notion of temperature dependence. We then perform a meta-analysis of the model's performance against the validation regime with versus without temperature dependence. Table 2 synthesizes the relative difference in model performance when including temperature dependence versus not including it. Including temperature dependence improves overall RMSE in 7 of 18 watersheds, increases or maintains coverage in 11 of 18, increases or maintains upper coverage in 14 of 18, and increases average posterior width in 11 of 18. In the Lower Mississippi, South Atlantic-Gulf, and Ohio watersheds, accuracy improves notably with temperature dependence included. There appears to be a mild bias-variance tradeoff, where posterior intervals improve (and often widen) and accuracy decreases slightly in general. The ability of the Bayesian model to capture bias of ESMs likely contributes to its performance more so than temperature dependence. Despite these mild tradeoffs when including temperature, the Bayesian model still generally outperforms the ensemble in terms of accuracy (see Fig. 1 and Table 1). www.nature.com/scientificreports www.nature.com/scientificreports/

Discussion
In this study, we present a physics-guided Bayesian model that utilizes ensembles of ESMs to estimate probabilistic projections of precipitation extremes under climate change. We exploit the knowledge that in many regions there is a relationship between temperature and extreme precipitation (e.g. 34 ), but that the dependence structure between the two variables might often be more complex than idealistic Clausius Clapeyron scaling 50 . The Bayesian model weights ESMs according to their ability to capture not only historically observed marginal, univariate statistics of daily total precipitation return levels but also their covariance with historically observed same-day average surface temperature. This study has a similar goal to a Bayesian model for precipitation extremes developed recently 19 but is more generalized in the sense that it simultaneously models a complete cumulative distribution function of extreme events rather than one specific statistic of extreme precipitation.
For the model specific to precipitation extremes developed here, there are several caveats worth highlighting. ESMs would not explicitly capture (extra)tropical cyclones or other major storms (e.g., Nor' easters) from the observational record, and thus extreme precipitation as a result of those types of events might not be reflected directly in raw ensemble data. This may impact the Bayesian model's ability to accurately capture some of the most extreme observed events, especially in the southeastern United States (e.g., see Fig. 1). In the validation analysis, while the Bayesian model's 99% credible intervals performed well overall, they did not tend to outperform the original ensemble in terms of upper coverage, i.e., the ability to bound the most extreme observations. This could be explained partially be a form of bias-variance trade-off, given that the Bayesian model usually outperforms the ensemble in terms of accuracy in the 1975-1999 period. It is also reasonable to hypothesize that a Generalized Extreme Value (GEV) data model, for example, might better handle the uppermost extremes. Future research could expand the development of tractable, intepretable Bayesian frameworks for ensemble weighting based on a GEV data model 19 . For this particular work, results in Fig. 4 do show that the upper tail of projected changes can be quite heavy; this is not reflected in the 99% credible intervals.
Caution must also be exercised in interpreting Bayesian projections, especially given mixed results from the ESM cross validation experiment (see supplementary materials). Stationarity and reliability of a relatively complex multivariate distribution of Bayesian parameters, while tested to the extent possible via cross validation with ESMs and with an explicit training-holdout split in the historical time window, is not guaranteed in the future 25 . However, the Bayesian model does produce projections that are qualitatively comparable to those of the original ensemble but with probabilistic information, implying that the risk of leaning on the Bayesian model versus the raw ESMs is minimal in a relative sense.
From stakeholder perspective, near term and long term risk assessment needs to account for uncertainty from disparate sources and climate uncertainties may be of second order. For example, in case of hydrological and flood risk assessment, parametric and modelling uncertainties owing to limited understanding of underlying processes can dominate or in some cases, comparable to climate uncertainties 51 . Similarly, in the context of risk assessment on critical infrastructures including transportation, energy, and water and wastewater networks, uncertainties associated with associated non-linear dynamics and cascading failures 52 can dominate the stressor related uncertainties. Hence, future research and methodologies in this direction need to characterize, both qualitatively or quantitatively, the relative importance and magnitude of various sources of uncertainties within mathematical frameworks to aid policymakers in decision making.
As discussed more in the supplementary materials, certain combinations of prior parameters may work particularly well in certain watersheds, but in this study we opted to find one set of parameters that worked well, generally (See SI Figs. 7-8). The Bayesian model generally performs best when skill is favored over consensus ((See SI Figs. 9-17 for watershed scale results). This may suggest that in general, weighting ESMs based on nontrivial and physics-guided measures of historical skill (in this case, how well ESMs portray precipitation-temperature dependence from observed data) can lead to improvements in the statistical attributes of probabilistic projections, e.g., accuracy and coverage.
www.nature.com/scientificreports www.nature.com/scientificreports/ www.nature.com/scientificreports www.nature.com/scientificreports/ data used are detailed in references and tables in the supplementary information. Significant portion of the work was performed when Kodra and Bhatia were graduate students at Sustainability and Data Science Laboratory at Northeastern University.