Background & Summary

Among the most severe impacts of climate change is the intensification of the hydrologic cycle1,2. The Clausius-Clapyeron relation3, which defines specific humidity of the atmosphere as a function of temperature, suggests that the rising trend of global mean surface air temperature will lead to an increase in evaporation and precipitation4, and potentially exacerbate observed changes in river flows5, hydrological extremes6,7 and water availability8,9. These changes are particularly pronounced over Earth’s large lakes10 (which hold more than 90 percent of all global surface fresh water), where rapid increases in lake temperature11 have led to unprecedented water level dynamics on many of those lakes12,13. The intensified hydrologic cycle, coupled with the ever-increasing water demands of a rapidly growing population14, have strained global water resources, indicating a need for improved understanding of how the different components of the Earth’s system (e.g., climate, land surface, and human) have influenced the hydrologic cycle15. To meet this demand, hydrological models are often used16, in part because of their capacity to represent hydrologic variables across the global landmass. Model simulations have corroborated observed changes in components of the water cycle17,18,19, and related these changes to natural and anthropogenic factors20,21.

As hydrological models have become more advanced, simulations of water balance components (e.g. runoff, evaporation) have also been made available in the public domain22,23,24, providing opportunities to advance understanding of the hydrologic cycle at multiple spatiotemporal scales. However, uncertainties in global data products are often high, especially in regions with very large lakes25, as lake-atmospheric feedbacks can be challenging to simulate accurately26,27. To offset limitations of hydrologic model simulations, remote sensing data products are among the potential alternatives for large lakes research. Recent advances in remote sensing techniques28 have improved the accuracy of data products representing important variables of large lakes hydrology such as water levels29 and evaporation30. However, the development of remote sensing data sets usually does not take into account mass flux balance in the context of the overall hydrologic cycle. This limitation hinders the applicability for large lakes of remote sensing data sets, as they often cannot be used together with other independent data sources to explain the mechanisms driving changes in water storage31.

The Laurentian Great Lakes (hereafter referred to as the Great Lakes; Fig. 1), the largest system of freshwater lakes on Earth, represent many of the challenges facing global large lakes. Water levels of the Great Lakes have fluctuated in response to natural climate variability (e.g., variations in precipitation and evaporation) as well as direct anthropogenic factors such as regulation of outflows and inter-basin diversions32,33,34. The intensified dynamic of water levels in the last two decades35 has elevated societal concern of a potential new norm for the Great Lakes hydrologic cycle12,36 in the future as global temperatures continue to rise37, posing new challenges for regional water management. Although there are multiple data sources available to study the Great Lakes water balance22,38,39,40,41,42,43,44,45, none of them adequately quantify uncertainty46,47 or reconcile the water balance because they were developed independently.

Fig. 1
figure 1

The main features of the Laurentian Great Lakes basin (shaded region) including lake surfaces (light blue), location of major cities, main inter-basin diversions, and connecting channels (Source: NOAA Great Lakes Environmental Research Laboratory and U.S. Army Corps of Engineers, Detroit District).

To provide a framework for incorporating independent data sets and informing water management decisions for large lakes, a statistical framework (the Large Lakes Statistical Water Balance Model, hereafter referred to as the L2SWBM) has been recently developed48,49. This new model can assimilate independent data products to infer the value of water balance components through a Bayesian framework. A conventional water balance equation is used within the L2SWBM to constrain the estimates, ensuring that outputs can close the water balance over multiple time periods. The L2SWBM has been used to support Great Lakes hydrological research, particularly by attributing water level changes to climatic conditions50, assessing bias of different data products representing a common water balance component47,51, and benchmarking the performance of operational forecasts31.

This article presents a seventy-year record of Great Lakes water balance estimates using the L2SWBM. This dataset can be used to explore the mechanisms underlying long-term changes as well as the most recent surge of Great Lakes water levels, and provide new insight into how climate change has influenced, and might continue to influence large lakes. The inputs and source code of the L2SWBM are also made available, and can be customized to incorporate new measurements, estimates or simulations when they become available in the future.


Figure 2 shows a schematic of the Bayesian inference approach encoded in the L2SWBM. The following sections elaborate on the independent data sources used as inputs, and describe the components of the L2SWBM in greater detail.

Fig. 2
figure 2

Schematic figure of the approach to generating new monthly estimate for the Great Lakes water balance.

A compilation of multiple data sources for the Great Lakes water balance

Multiple hydro-climate datasets are available to represent the water balance of the Great Lakes31 ranging from gauge-based aggregated data43,52 to model simulations47 and remote sensing products40. However, they were mostly developed independently with limited consideration of fidelity to the water balance31. This inconsistency among available data sets is a long-standing challenge facing Great Lakes hydrologic research41, and has motivated the development of the L2SWBM50. To inform the Bayesian framework encoded within the L2SWBM, we selected eight independent data sources, including:

  • Beginning of month (BOM) water levels (H) for each of the Great Lakes, provided by the binational Coordinating Committee on Great Lakes Basic Hydraulic and Hydrologic Data (referred to as “Coordinating Committee”, or CCGLBHHD, hereafter; for more information about this ad-hoc group, please see Gronewold, et al.31). Lake wide-average water levels were calculated as the arithmetic mean of daily water level measurements over a subset of in situ gauges located across the coastline of each lake. A greater detail of the underlying data sets is discussed in Gronewold, et al.31.

  • Diversions (D) into, or out of, each lake were provided by the Coordinating Committee31. Diversions include the Long Lac and Ogoki Diversion into Lake Superior, the Chicago Diversion from Lake Michigan-Huron, and the Welland Canal, which diverts water from Lake Erie to Lake Ontario.

  • Connecting channel flows (I or Q) are obtained from two independent data sources. The first dataset was estimated by the Coordinating Committee using a variety of methods such as stage-fall discharge equations or aggregation of discrete flow measurements31. The second dataset was measured using Acoustic Doppler Velocity Meters, which were installed across International Gauging Stations (IGS) maintained by the United States Geological Survey and Water Survey Canada31.

  • Over-lake precipitation (P) is obtained from four data sources: (i) the NOAA-GLERL Great Lakes Monthly Hydrometeorological Database (GLM-HMD)43; (ii) output of the Great Lakes Advanced Hydrologic Prediction System (AHPS)53, which is operated by the United States Army Corps of Engineers (USACE); (iii) National Weather Service Multisensor Precipitation Estimates (NWS MPE)54; and (iv) Meteorological Service of Canada’s Canadian Precipitation Analysis (CaPA)40,55.

  • Over-lake evaporation (E) is obtained from three data sources: (i) the NOAA-GLERL GLM-HMD43; (ii) output of the USACE AHPS41; (iii) the Environment and Climate Change Canada’s Water Cycle Prediction System (ECCC WCPS)39; and output of the NOAA-GLERL Finite-Volume Community Ocean Model (FVCOM)56.

  • Tributary lateral runoff (R) is obtained from three data sources: (i) the NOAA-GLERL GLM-HMD43; (ii) output of the USACE AHPS41; and (iii) the ECCC WCPS39.

Table 1 provides a summary of these data sets and indicates which data set was used to estimate the prior distributions and likelihood functions. Besides the data sets included in Table 1, there are other regional31 and global57,58 data products that have been identified for potential applications of the L2SWBM on the water balance of the Great Lakes (and other large lakes) in the future.

Table 1 Summary of data sets, and an indication of which were used to calculate the prior probability distribution and likelihood functions, for each of the water balance components including over-lake precipitation (denoted as P), over-lake evaporation (denoted as E), lateral runoff (denoted as R), inflow through main channels from upstream lake (denoted as I), outflow through main channels (denoted as Q), diversion (denoted as D) and lake storage (denoted as H). Note that only data from 1950 to 2019 was used in this study.

The Large Lakes Statistical Water Balance Model (L2SWBM)

The L2SWBM uses a conventional water balance model to constrain component estimates, ensuring that the water balance can be closed over multiple timespans for the Great Lakes system. For Lake Superior, Lake Michigan-Huron, Lake Erie, and Lake Ontario, changes in storage over a specific time window were defined using Eq. (1).

$$\Delta {H}_{j,w}={H}_{j+w}-{H}_{j}=\mathop{\sum }\limits_{i=j}^{j+w-1}\left({P}_{i}-{E}_{i}+{R}_{i}+{I}_{i}-{Q}_{i}+{D}_{i}+{\varepsilon }_{i}\right)$$

where: ΔH: change in lake storage over w months, i.e. from month j to month j + w (mm);

P: over-lake precipitation (mm);

E: over-lake evaporation (mm);

R: lateral tributary lake inflow (mm);

I: inflow from upstream lake (m3/s);

Q: outflow to downstream lake (m3/s);

D: inter-basin diversions (to or from a specific lake) and consumptive uses (m3/s);

ε: process error term representing water level changes not explained by the other components (mm) such as ground-water fluxes or glacial isostatic rebound59.

We note that the L2SWBM code converts I, Q, and D from flow rate (m3/s) to lake-depth (mm) using lake surface area whenever required (e.g., the unit of millimetre is required to calculate water balance closure). The sign of D depends on whether water is diverted to (positive values) or from (negative values) a specific lake. In addition, this study used a rolling window of w = 12, which generally leads to better results regarding water balance closure48,49.

Over Lake St. Clair, which has a substantially smaller surface area relative to the other four lakes, the combined effect of inflow (from Lake Michigan-Huron via the St. Clair River) and outflow (to Lake Erie via the Detroit River) generally dominates the hydrologic cycle. Therefore, only net basin supply (NBS = PE + R) was modelled, and the water balance equation for Lake St. Clair was modified as below.

$$\Delta {H}_{j,w}={H}_{j+w}-{H}_{j}=\mathop{\sum }\limits_{i=ji=j}^{j+w-1}\left(NB{S}_{i}+{Q}_{MH{U}_{i}}-{Q}_{i}+{\varepsilon }_{i}\right)$$

where QMHU is the outflow from Lake Michigan-Huron while the other variables are defined following those of Eq. 1.

Each water balance component was then inferred through a Bayesian approach, in which the “true” value of a variable (e.g., over-lake precipitation for Lake Superior) at a specific time-step (e.g., Jan 2019) was probabilistically estimated using a prior probability distribution and likelihood functions parameterized from multiple independent data sources. The following section will describe our approach to parameterizing the Great Lakes water balance using the L2SWBM. It is informative to note that the following sections share some similarity to the recent publication on the L2SWBM48. However, we also included more details on specific modifications (e.g., data used to derive the L2SWBM parameters) in our application to derive a seventy-year long record for the Great Lakes water balance.

Prior distributions of water balance components

We first modelled each water balance component with a probability distribution family, representing a “prior belief” of the possible range of values. The parameters of these distributions were empirically estimated from historical data spanning from 1950 to 2019 (presented in Table 1). Specifically, over-lake evaporation (E), connecting-channel inflow (I), connecting-channel outflow (Q), diversions (D) as well as net basin supply (NBS; for Lake St. Clair) corresponding to each calendar month m (\(m\in [1,12]\)) were modelled with a normal distribution:

$$\pi \left({E}_{m}\right)={\rm{N}}\left({\mu }_{E,m},{\tau }_{E,m}/2\right)$$
$$\pi \left({I}_{m}\right)={\rm{N}}\left({\mu }_{I,m},{\tau }_{I,m}\right)$$
$$\pi \left({Q}_{m}\right)={\rm{N}}\left({\mu }_{Q,{m}_{t}},{\tau }_{Q,m}\right)$$
$$\pi \left({D}_{m}\right)={\rm{N}}\left({\mu }_{D,{m}_{t}},{\tau }_{D,m}\right)$$
$$\pi \left(NB{S}_{m}\right)={\rm{N}}\left({\mu }_{NBS,m},{\tau }_{NBS,m}\right)$$

where the mean (μ) and precision (τ) parameters were calculated empirically from historical data. The use of the precision (\(\tau =1/{\sigma }^{2}\)) rather than the variance (\({\sigma }^{2}\)) in this study is the conventional practice for Bayesian inference60. We note that the precision of the prior probability distribution for E was divided by two (i.e., the variance was doubled) as showed in Eq. 3. This modification allowed a broader range of feasible values to account for a potential shift of evaporation in a warming climate48.

Lateral runoff (R) drained to each lake from the corresponding basin for each calendar month m (\(m\in [1,12]\)), which is always positive, was then modelled with a lognormal prior probability distribution:

$$\pi \left({R}_{t}\right)={\rm{LN}}\left({\mu }_{{\rm{ln}}(R),m},{\tau }_{{\rm{ln}}(R),m}\right)$$

where t is a specific time step, and prior mean (\({\mu }_{{\rm{ln}}(R),m}\)) and precision (\({\tau }_{{\rm{ln}}(R),m}\)) were calculated for each calendar month m using historical data records for that month. For example, at time step (t) January 2019, we have m equals 1 and the lateral runoff is modelled using mean and precision calculated from all observed January runoff values.

We modelled over-lake precipitation (P) using a gamma probability distribution, where the distribution parameters for each calendar month m were also calculated empirically from historical data.

$$\pi \left({P}_{m}\right)={\rm{Ga}}\left({\psi }_{m}^{1},{\psi }_{m}^{2}\right)$$

The shape (\({\psi }^{1}\)) and rate (\({\psi }^{2}\)) parameters of the gamma distribution were defined as below (following Thom61).

$${\psi }_{m}^{1}=\frac{1}{4{\phi }_{m}}\left(1+\sqrt{1+\frac{4{\phi }_{m}}{3}}\right)$$
$${\phi }_{m}=ln\left({\mu }_{P,m}\right)-{\mu }_{{\rm{ln}}(P),m}$$
$${\psi }_{m}^{2}={\psi }_{m}^{1}/{\mu }_{P,m}$$

where \({\mu }_{P,m}\) (Eq. 11), and \({\mu }_{{\rm{ln}}(P),m}\) (Eq. 12) are respectively the mean of historical precipitation, and the mean of the logarithm of precipitation for calendar month m.

The error term ε in Eq. 1 and Eq. 2 was also modelled using a vague normal prior probability distribution following Gelman62 across all calendar months:

$$\pi \left({\varepsilon }_{m}\right)={\rm{N}}\left(0,0.01\right)$$

Likelihood functions for analysis period

To derive the likelihood functions for the analysis period, data from multiple data sources spanning over the 1950–2019 period was used (note that the temporal coverage varies substantially across the data sets, as presented in Table 1).

For changes in lake storage over a period of w months, the likelihood function was defined as:

$${y}_{\Delta {H}_{j,w}}={y}_{{H}_{j+w}}-{y}_{{H}_{j}} \sim N\left(\Delta {H}_{j,w},{\tau }_{\Delta {H}_{j,w}}\right)$$

in which the observed change in storage over a rolling window of length w months (\({y}_{\Delta {H}_{j,w}}\)) is the difference between water level measurements (yH) at the beginning of month j + w and month j. We modelled this value with a normal distribution with mean \(\Delta {H}_{j,w}\) and precision \({\tau }_{\Delta {H}_{j,w}}\).

The likelihood functions for water balance components on the right hand side of Eq. 1 (Eq. 2 for Lake St. Clair) follow a normal distribution:

$${y}_{t,\theta }^{n} \sim {\rm{N}}\left({\theta }_{t}^{n}+{\eta }_{\theta ,{m}_{t}}^{n},{\tau }_{t,\theta }^{n}\right)$$

where \(\theta \in (P,E,R,I,Q,D,NBS)\), \({y}_{t,\theta }^{n}\) is data source \(n\in [1,N]\) for component θ (which has N independent data sources) at time step t (e.g., t = Jan 2019; note that mt = 1 in this case); \({\eta }_{\theta ,{m}_{t}}^{n}\) is the bias of data source number nth in calendar month m (\(m\in [1,12]\)) and \({\tau }_{t,\theta }^{n}\) is the precision of data source number nth at time step t.

Similar to other applications of the L2SWBM50,51, the precision of changes in lake storage (\({\tau }_{\Delta {H}_{j,w}}\)) and the precision of data sources of each water balance component over each time step (\({\tau }_{t,\theta }^{n}\); \(\theta \in P,E,R,I,Q,D,NBS\)) were modelled with a gamma prior probability distribution with both shape and scale parameters equal 0.1:

$${\tau }_{\Delta {H}_{j,w}}={\rm{Ga}}\left(0.1,0.1\right)$$
$${\tau }_{t,\theta }^{n}={\rm{Ga}}\left(0.1,0.1\right)$$

Except for channel flows (of which the bias is relatively low for most lakes46), the bias of each contributing data set was modelled using a normal distribution with mean 0 and precision 0.01 (i.e., a standard deviation of 10):

$$\pi \left({\eta }_{\theta ,{m}_{t}}^{n}\right)={\rm{N}}\left(0,0.01\right)$$

Statistical inference of water balance components

To infer new estimates for the Great Lakes water balance over the 1950–2019 period, the L2SWBM was used to encode the prior distributions and likelihood functions estimated from available independent datasets into a JAGS (Just Another Gibbs Sampler) model inference routine63, which is an open-source successor to BUGS (Bayesian inference Using Gibbs Sampling)64. The ‘rjags’ package within the R statistical software environment65 was then used to simulate the JAGS model over 1,000,000 Markov Chain Monte Carlo (MCMC) iterations using two parallel MCMC chains. We omitted the first 500,000 iterations as a “burn-in” period. The remaining 500,000 iterations were then thinned at 250-iteration intervals to retain the final subset of 2,000 iterations. The 95% credible interval of the final subset was used to infer a feasible range for each water balance component.

It is informative to note that historical estimates of over-lake evaporation are not readily available from 1900 to 1949. As a result, this study only used data spanning from 1950 to 2019 to inform the statistical inference. To ensure that no observation was used to estimate both the prior distribution and the likelihood function (and thus would be favoured by the L2SWBM) at any specific time step, we used the following approach to derive the prior distributions:

  • For the analysis period from 1950 to 1984: the prior distributions are generated from historical data covering the 1985–2019 period.

  • For the analysis period from 1985 to 2019: the prior distributions are generated from historical data covering the 1950–1984 period.

Data Records

The new estimates of the Great Lakes water balance, together with the L2SWBM source code and inputs synthesized for this project (monthly data available up to December 2019 depending on variables), are compressed as multiple zip-archives that are available for download66. The total file size of the dataset is approximately 4 MB, and contains:

  1. (i)

    The L2SWBM source codes, stored in multiple R-script files, together with the R-script of the BUGS model and the model configuration file (accompanied with a text file explaining the variables of this configuration file). The configuration file can be adjusted to include more data sources or focus on a different analysis period. The abovementioned files were compressed into a zip archive named as “”.

  2. (ii)

    Inputs of the L2SWBM. These independent data records were used to derive the prior distribution and likelihood functions for each of the variable. Data for each variable of a specific lake is stored in a separate csv file. All inputs files were compressed into a single zip archive named as “”.

  3. (iii)

    Outputs of the L2SWBM. The L2SWBM generated multiple outputs that are organized as four separate folders (each folder was compressed into one single zip archive). Table 2 provides a description of the data available as well as naming convention of these outputs.

    Table 2 Description and naming convention of outputs generated by the L2SWBM.

Technical Validation

Figure 3 provides a visual assessment of a representative time series of inferred values (95% credible interval of L2SWBM simulations) of storage changes and water balance components for Lake Superior from 2015 to 2019. We note that the published dataset66 also contains the graphs for each of the decadal periods (e.g., 1950–1959) across all lakes. The results in Fig. 3 (and other figures in Do, et al.66) indicate the presence of important differences among the historical data sets. For instance, there are substantial differences between over-lake precipitation (the top panel) aggregated from ECCC CaPA gridded product and that available in a legacy dataset (USACE AHPS).

Fig. 3
figure 3

Comparison between the newly-derived water balance components generated by the L2SWBM (vertical grey bars) and corresponding observations from independent data sets (horizontal dashes) for Lake Superior from 2015 to 2019. From top to bottom: over-lake precipitation (denoted as P), over-lake evaporation (denoted as E), lateral runoff (denoted as R), outflow (denoted as Q), diversions (denoted as D) and changes in lake storage (denoted as ΔH). All of the included data sets are made available in Do, et al.66. Figures for each of the decadal periods (e.g., 1950–1959 or 1960–1969) across all lakes are also available in Do, et al.66.

The Bayesian inferred values (the vertical grey bars in Fig. 3) show generally consistent seasonal and inter-annual patterns, but also contain important differences relative to the other datasets. We also note that estimates over downstream lakes (i.e., Lake Erie and Lake Ontario) show generally higher uncertainty relative to the upstream lakes (i.e., Lake Superior and Lake Michigan-Huron), potentially due to the accumulation of uncertainty of the model simulations.

It is informative to note that the bias in channel flows (i.e., inflow, outflow and diversion) was not modelled by the L2SWBM, owing to relatively reliable records of river discharge comparing to estimates of other variables such as over-lake precipitation46. In addition, only one data source is available to represent changes in lake storage. As a result, estimates of channel flows and lake storage changes appear to be well constrained across all time steps relative to the other variables. Moving forward, we plan to assimilate new global data products of water level (e.g. water level derived from Gravity Recovery and Climate Experiment data57), and river flow simulated by hydrological models (e.g. WRF-Hydro model67) to provide a more holistic view of the uncertainty associated with L2SWBM estimates of these variables.

Tables 36 provide an overview of the new estimates of over-lake precipitation, over-lake evaporation, and lateral runoff across Lake Superior (Table 3), Lake Michigan-Huron (Table 4), Lake Erie (Table 5), and Lake Ontario (Table 6). Here we calculated the mean and the standard deviation of the median (denoted as MED) and the 95% credible interval (denoted as CI) estimated by the L2SWBM for each calendar month. Over-lake precipitation tends to have the highest inter-annual variations (indicated by a high standard deviation of MED) in Lake Superior and Lake Michigan-Huron, while lateral runoff has the highest inter-annual variation in Lake Erie and Lake Ontario, of which the ratio of land area to lake area is relatively high32.

Table 3 The mean and standard deviation (values inside the brackets) of the median (denoted as MED) and the 95% credible interval (denoted as CI) of the L2SWBM inference for over-lake precipitation (denoted as P), over-lake evaporation (denoted as E), and lateral runoff (denoted as R) over Lake Superior. The mean and standard deviation were calculated for each calendar month.
Table 4 The mean and standard deviation (values inside the brackets) of the median (denoted as MED) and the 95% credible interval (denoted as CI) of the L2SWBM inference for over-lake precipitation (denoted as P), over-lake evaporation (denoted as E), and lateral runoff (denoted as R) over Lake Michigan-Huron.
Table 5 The mean and standard deviation (values inside the brackets) of the median (denoted as MED) and the 95% credible interval (denoted as CI) of the L2SWBM inference for over-lake precipitation (denoted as P), over-lake evaporation (denoted as E), and lateral runoff (denoted as R) over Lake Erie.
Table 6 The mean and standard deviation (values inside the brackets) of the median (denoted as MED) and the 95% credible interval (denoted as CI) of the L2SWBM inference for over-lake precipitation (denoted as P), over-lake evaporation (denoted as E), and lateral runoff (denoted as R) over Lake Ontario.

The CI of the new estimates is generally consistent across all time steps, indicated by a relatively small value of both the mean and the standard deviation. Our calculated uncertainties in each water balance component are susceptible to both the a priori range of values for that component, and to the range of variability in assimilated estimates. Consequently, some estimates, such as over-lake evaporation, can have low uncertainty values because evaporation has a very strong seasonal cycle, with very low values in the summer months. In future research, we intend to experiment with different expressions of the a priori water balance uncertainty to determine whether they impact the uncertainties of the L2SWBM estimates.

To assess long-term water balance closure, we also compared the cumulative changes in lake storage simulated by the L2SWBM with those obtained from observed data. Figure 4 shows the results of this comparison over the 2015–2019 period, indicating the capacity of L2SWBM estimates to close the water balance over consecutive periods of 1-, 12-, or 60-months.

Fig. 4
figure 4

Water balance closure assessment using our new L2SWBM water balance estimates across the Great Lakes from 2015 to 2019. Vertical grey bars represent simulated cumulative changes (95% posterior predictive intervals) while black points represent observed cumulative changes in storage over one month (top panels), 12 month (middle panels), and 60 month periods (lower panels). Note that the range of the y axis varies across different rolling windows.

The ability of the new estimates to reconcile the water balance provides a potential pathway towards improved understanding of hydrologic response to long-term climate variability. In addition, the uncertainties of water balance components inferred through the new estimate could be used to identify the time windows that need additional information such as new simulations using state-of-the-art hydrological models.