Introduction

Phytoplankton blooms with the potential to harm aquatic ecosystems and human health are referred to as Harmful Algal Blooms (HABs). Most HAB phenomena are caused by blooms of microscopic algae or phytoplankton (including certain cyanobacteria), although the term also applies to blooms of macroalgae (seaweeds)1. HABs are of worldwide concern due to their potential to harm the environment and human health2. It also has tremendous economic impacts with over $4.6B per annum in estimated damages within the United States alone3.

The Laurentian Great Lakes of North America contain approximately 95% of the surface freshwater supply in the United States4. The Great Lakes are also the world’s largest and most biologically diverse freshwater resource5. Lake Erie is the fourth-largest by area (surface area = 25874 km2), the shallowest (mean depth = 19 m), and the smallest by water volume (484 km3). It is also the most southern among the Great Lakes with its watershed containing both urbanized environments that include large manufacturing facilities, and rural environments that include extensive areas of agriculture. Hence, Lake Erie (Fig. 1a) is severely impacted by excessive nutrient loading, especially from its western basins6. Over the past decades, western Lake Erie has witnessed increasingly intense algal blooms including cyanobacterial blooms7,8,9. From analyses of long-term trends in nutrient loading and meteorological conditions (including severe springtime precipitation events coupled with long-term trends in agricultural land use and practices), it was determined that the 2011 extreme HAB (Fig. 1a) in western Lake Erie was largely driven by trends in these factors10. The bloom extended over 16 km from the shores, and in the central basin, it was observed at a depth of 60 m. The bloom density also caused damage to the navigation because it disrupted boat motors.

Fig. 1: Satellite image of algal bloom over Lake Erie and seasonal prediction of HABs severity index.
figure 1

a Optical three-band visible image from the Envisat satellite (MERIS_FP) at 300 m resolution showing one of the largest algae blooms from October 2011 extended from Toledo to beyond Cleveland and along the Ontario shore (courtesy of European Space Agency). Data from three visible bands were used each with a bandwidth of 10 nm (442.5, 460, and 665 nm band centers). b HABs severity index (SI) from observation (black bar), prediction using GA-chem model that is based on nutrient loading only (blue bar), prediction using GA-clim model that is based on climate indices only (green bar) and GA-chem-clim that is based on nutrient loading and climate indices (red bar).

A seasonal severity index (SI) for HABs in Lake Erie has been made available by the National Oceanic and Atmospheric Administration (NOAA) since the 2002 season9, and contains estimates for the maximum bloom biomass seen for the peak 30 days of a bloom. Stumpf et al.9 found that a blended model incorporating springtime discharge and measured total phosphorus (TP) appeared sufficient to predict the bloom magnitude in western Lake Erie. However, the linear and non-linear regression models based on TP and water discharge alone could not successfully explain some extreme HAB events (e.g., the extreme HAB event in 2011). It is therefore hypothesized that some other unknown biological, chemical, hydrodynamical, and meteorological factors may be responsible for this unexplained variance.

Previous studies have analyzed the relationships between algal growth rates and toxin production in response to temperature and inorganic nutrients11,12,13,14. However, the relationship between HABs and large-scale atmospheric conditions has rarely been addressed. Nonetheless, the identification and understanding of atmospheric conditions favoring HABs could be essential to predict these events. Baez et al.15 showed that HAB events in southwest Europe are linked to atmospheric teleconnections, which in turn are correlated with hydrographical conditions favoring HABs. Phlips et al.16 examined the roles that hurricanes and El Niño play in contributing to HAB events in two sub-tropical Florida estuaries. They found that an increase in rainfall associated with hurricanes and El Niño resulted in increased nutrient loads, ultimately driving HABs within the Indian River Lagoon and St. Lucie Estuary. Previous studies have also explored the relationship between interannual and multi-decadal variability of atmospheric circulation and the occurrences of HABs (e.g., refs. 17,18,19,20,21). Zhou et al.22 analyzed bloom extent in Lake Erie and developed a model for explaining interannual variability of Lake Erie hypoxia. Their model, based on four variables (April to June tributary discharge, May to July soluble reactive phosphorus (SRP) loading, July wind stress, and June northwesterly wind duration) can explain 82% of the interannual variability of hypoxia. They discussed the need for meteorological factors in the development of nutrient management strategies by showing that a discharge-only model explained only 39% of the variability (i.e., if the meteorological factors were not included in the model).

It is widely accepted that the occurrence of HABs in freshwater bodies results from complex non-linear interactions between chemical, biological, hydrological, and meteorological processes that take place at different spatial and temporal scales. Data-driven machine learning techniques may help understand these complex non-linear interactions and develop predictive modeling that may capture these non-linear interactions in a way that would not be possible using statistical or dynamical models. A recent review paper23 discussed the advances made in the machine learning methods for predicting HABs and shellfish biotoxin contamination, with a particular focus on autoregressive models, support vector machines, random forest, probabilistic graphical models, and artificial neural networks (ANNs). Likewise, most measures of local conditions used by ecologists fail to capture complex associations between weather and ecological processes24. As a result, large-scale, seasonal indices of climate spanning several months can outperform local climatic factors indicating the importance of climatic influences on population ecology25.

In this study, we develop a machine learning-based predictive model of the Lake Erie SI. Building on previous approaches (e.g., ref. 26) that ingest observed nutrient information only, we modify this approach by including a set of physically-based, easily-available, large-scale climate indices as potential predictors for the seasonal prediction of HABs. We believe that the present study is the first to use a machine learning-based approach that uses climate indices in combination with nutrient loading to predict the seasonal SI of HABs.

Results

Our predictive model makes use of the peak three 10-day periods of bloom biomass in Lake Erie to estimate the SI9 and is available from https://www.weather.gov/cle/LakeErieHAB. The SI is a non-dimensional positive number for each season (Supplementary Fig. S1, blue line). During the observed period of 2002–2020, the highest SI (10.5) was observed in 2015, and the lowest SI (0.3) was in 2002 and 2005.

For the seasonal prediction of HABs in Lake Erie, we use Genetic Algorithm (GA, see details in “Materials and methods”) models. We used a jackknife procedure to assess the uncertainty of the predictors, as well as the predictions. For this, we used 18 years of data for training the model and one leftout year for cross-validation. This process was done repeatedly. Thus, at the end of the jackknife method, we have 19 models and 19 cross-validations. In the GA-chem model that is based on nutrient loading only (Supplementary Table S1), a total of eight parameters (total phosphorus (TP), SRP, total Kjeldahl nitrogen (TKN), total nitrogen ((nitrite + nitrate), TN), chlorides (CL), total suspended solids (TSS), silica dioxide (SiO2, referred here as SD), and sulfate (SO4, referred here as SL)) from March to May were used as the model input. Out of these eight parameters, accumulated TP [March–May] was selected by 12 models followed by accumulated SRP [March–May] (selected by 8 models), accumulated TSS [March–May] (selected by 6 models), accumulated SD [March–May] (selected by 3 models), accumulated SL [March–May] (selected by 2 models), and accumulated CL [March–May] (selected by 1 model). There is little to no data available before March because of low discharge due to freezing conditions.

The GA-chem-clim model that is based on chemical loading and climate indices (Table 1), selected TP in combination with three monthly climate indices, the Pacific-North American (PNA, Wallace and Gutzler27) pattern, El Niño-Southern Oscillation (ENSO) (Southern Oscillation index or SOI, Ropelewski and Jones28), and Pacific Decadal Oscillation (PDO, Mantua et al.29) from the input set of climate indices consisted of Arctic Oscillation Index (AO), PNA, Quasi-biennial Oscillation (QBO), Atlantic Multidecadal Oscillation (AMO), ENSO SOI Index, Niño index (Niño3.4), North Atlantic Oscillation (NAO), and PDO. Out of these potential predictors, ENSO (SOI) index from April, PNA from December, PDO from November, and chemical loadings of TP from March–May turned out to be the most useful predictors.

Table 1 Performance (root mean square error (RMSE) and coefficient of determination (R2)) of GA-chem-clim (using nutrient loading and climate indices) developed using Genetic Algorithm under the jackknife procedure.

For the GA-clim model (Supplementary Table S2), which is based on the climate indices only, the prediction is made using the three climate indices ENSO (SOI), PNA, and PDO. Qin et al.30 in their study of extreme climate anomalies enhancing the cyanobacterial blooms in Lake Taihu, China found that a recorded algal bloom in 2017 was connected with the 2016–2017 El Niño, AMO and PDO. They found the persistent warmth during 2016–2017 was related to the warm phases of AMO and PDO. McKibben et al.31 showed evidence of climatic regulation of domoic acid in shellfish over the past 20 years in the Northern California Current regime. They found that the timing of elevated domoic acid is strongly related to warm phases of the PDO and the Oceanic Niño Index, an indicator of El Niño events.

Seasonal prediction of HABs severity index using nutrient loading

Predictive modeling for seasonal severity of HABs (HABs SI) using the GA-chem model (Supplementary Table S1) was developed using the observations of integrated chemical loading (March to May for 2002–2020) for Lake Erie with the aim that the seasonal forecast would be made available in early June.

Using the jackknife method, TP was found to be the dominant predictor by the GA-chem models. The interannual variability of TP selected by the maximum number of GA-chem models as a potential predictor is shown in Supplementary Fig. S1. A systematic relationship between the HABs SI and TP is evidenced by a high correlation between them (0.59), which is consistent with other studies9. Based on the predictors selected by the GA-chem model, the monthly variation in the most dominant predictors TP and SRP was examined (Supplementary Fig. S2). We found that the correlation between TP loading and HABs SI was 0.31 in March, 0.43 in April, and 0.37 in May. The correlation between HABs SI and SRP was 0.41 in March, 0.35 in April, and 0.34 in May (Supplementary Table S3). Apart from these two dominant predictors, we found SD and TSS as the two other potential predictors. In our GA-chem experiments, finding SD as a potential predictor was a novel finding (to the best of our knowledge) given that while past studies have found a connection between SD and algal blooms32,33, these were primarily diatom blooms. Previous work on diatom blooms in Lake Erie (e.g., ref. 34) found that rapid sedimentation of diatoms may contribute to hypoxic zones near the lake bottom. Similar hypoxic zones have been associated with increased internal loading of phosphorus in Lake Erie35,36,37. Given the importance of SD as a predictor by the GA-chem model, it is possible that increase in SD may lead to an indirect increase in cyanobacteria biomass via internal loading of phosphorus resulting from increased diatom sedimentation rates. TSS is the most visible indicator of water quality and can come from soil erosion, runoff, discharges, disturbed bottom sediments and algae. As such, finding TSS in some of the predictive models is not a surprise. There are two other predictors, SL and CL, which are not discussed here since they are selected by only two models (SL) and one model (CL) in the jackknife method.

Based on the two potential predictors (TP and SRP) and four other predictors (TSS, SL, CL, and SD) selected by some GA-chem predictive models, HABs SI simulation (Fig. 1b) was successful for the extreme year of 2011 (with an overprediction in HABs SI by 1.12). For another extreme year 2015, the GA-chem model missed the prediction of HABs SI by 7.8 units. It was found (https://www.washingtonpost.com/news/capital-weather-gang/wp/2015/11/12/this-years-disgusting-green-algal-bloom-in-lake-erie-was-the-most-severe-on-record/) that Maumee River watershed received about eight inches more rain than normal in June 2015. It was also the fourth wettest June in Toledo, Ohio, and one of the top 20 wettest months since records began in 1880. As a result, our GA-chem model that lacked nutrient loading data from June, may have performed poorly. Therefore, to test the impact of nutrient loading from June, a GA-chemplus model was developed, which is similar to GA-chem but makes use of chemical loading data from March to June (instead of March to May as in GA-chem). Using the GA-chemplus, the prediction of HABs SI was improved considerably in 2015 compared to GA-chem (Observed value: 10.5, Predicted value using GA-chemplus: 10.28, predicted values using GA-chem: 2.70). For the entire time period (2002–2020), the root mean squared error (RMSE) of predicted HABs SI using GA-chem is 2.67 and the correlation with the HABs SI is 0.53 (Table 2). However, for years with HABs SI > 7, the performance of GA-chem deteriorates as evidenced by higher RMSE (4.02). The standard deviation of the observed HABs SI is 3.17, compared to the standard deviation for the GA-chem model (2.74). To further improve the prediction of SI, we look at predictors other than nutrients and discharge data.

Table 2 Performance of GA-chem (based on chemical loading data for March–May), GA-clim (based on climate indices data for November (previous year)–May (current year)), and GA-chem-clim (based on chemical loading data for March–May, and climate indices data for November–May) against observed HABs severity index during 2002–2020.

Seasonal prediction using nutrient plus climate indices

Large-scale circulation patterns are known to have an important association with the hydrometeorological conditions for the Great Lakes, and hence, may be considered as potential predictors for HABs. ENSO is the dominant climate mode on the planet38, and its associated teleconnection pattern, i.e., the PNA teleconnection27,39, provides the greatest potential for seasonal prediction of hydrometeorology over the Great Lakes40,41,42. Similar to GA-chem, a set of predictive models, using the jackknife method, was developed using climate indices only (GA-clim, Supplementary Table S2). We found that PDO, PNA and ENSO (SOI) were the most important predictors of all the climate indices included in this analysis. The performance of GA-clim for all events is comparable to GA-chem (RMSE of GA-chem: 2.67, GA-clim: 2.52) and R2 for each predictive model is in the range 0.29–0.69 for GA-chem and 0.49–0.63 for GA-clim (Supplementary Tables S1, S2).

Notaro et al.41 examined the relationship between large-scale circulation and the regional climate of Northeast United States for early winter using regional climate simulations and observations. They found that during positive PNA, there is greater subsidence over land, lower atmospheric moisture content and higher stability, resulting in reduced total simulated cloud cover, precipitation and snowfall. Also, during positive PNA, lower surface air temperature over land resulted in higher simulated cloud cover over the Great Lakes and the Atlantic Ocean due to increased thermal contrast. The time-series of ENSO (SOI) and PNA (Supplementary Fig. S3) indicate that they were mostly in-phase in some years, but were out-of-phase in other years. Thus, no linear relationship of these indices with the severity of HABs over Lake Erie could be established. This finding emphasizes the importance of including climate indices in machine learning models to capture such non-linearity.

The improvement in prediction in GA-chem-clim compared to GA-chem is evident in Fig. 1b. The RMSE in GA-chem-clim is reduced to 2.26 from 2.67 as found in GA-chem (Table 2). At the same time, the correlation between observed and simulated SI increased from 0.53 in GA-chem to 0.67 in GA-chem-clim. Moreover, the observed standard deviation of SI was captured well by the GA-chem-clim compared to GA-chem (Table 2) indicating that GA-chem-clim can represent the observed variation in HABs better than GA-chem. Considering years with HABs SI > 7, GA-chem-clim also performs better than GA-chem (Table 2), and is particularly evident in 2017 (Fig. 1b; Obs:8, GA-chem:5.65, GA-chem-clim:8.29). This improvement is notable because one of the major problems mentioned in Stumpf et al.9 model was the prediction in years with high HABs SI.

We further investigated large-scale atmospheric conditions to understand the role of climate variability (PNA and ENSO) and their connection to HABs. For this purpose, we analyzed the large-scale circulations based on data obtained from the European Center for Medium-Range Weather Forecasts (ECMWF) Reanalysis 5 (ERA5, Hersbach et al.43) for years with low SI (<2; 2002, 2005, 2006, and 2007) and high SI (>7; 2011, 2013, 2015, 2017, and 2019). The reasoning behind this categorization is to focus on the role of nutrient loading and atmospheric circulations for these extreme low and high HABs SI years.

The composites of large-scale circulation patterns from the winter months show distinct differences that lead to years with low (SI < 2) and high (SI > 7) severity of blooms. Here we analyze the geopotential height—the height of a pressure surface (e.g., 500 hPa) above the mean sea level—which is closely related to wind anomalies due to the geostrophic balance and is often used to identify atmospheric teleconnection patterns. We find higher geopotential heights at 500 hPa over the northern U.S. (Fig. 2) during low severity years, which result in less cold air outbreaks in the winter months and, thus, higher temperature. Meanwhile, the high-pressure anomaly center also tends to shift the jet stream southward, which leads to less snowfall over the Great Lake areas. The higher geopotential height anomalies over the U.S. can be traced back to the tropical Pacific as the positive PNA wave train, with low-pressure anomalies found over the Aleutian Islands region and high-pressure anomalies over the subtropical North Pacific. A positive PNA is typically observed during El Niño, agreeing with our findings of a prominent positive sea surface temperature (SST) anomaly in the central tropical Pacific (Fig. 3). In addition, the positive phase of PDO can also induce positive PNA teleconnection pattern44,45. Conversely, lower geopotential heights are found during winter months prior to high severity blooms, which allowed greater intrusions of Arctic air mass during winters of preceding years with a high HAB severity index. Over the Pacific Ocean, high pressure anomalies weaken the Aleutian Low while low pressure anomalies are found over and to the west of Hawaiian Islands, accompanied by prominent negative SST anomalies in the central equatorial Pacific. These large-scale climate conditions are associated with the negative PNA and La Niña, which lead to cooler temperature and greater snowfall over the U.S. (Fig. 4) in the winter resulting in more runoff in the summer, and potentially stronger bloom seasons.

Fig. 2: Geopotential height and winds during years with mild and severe bloom activity.
figure 2

a Anomalies of geopotential height (shaded, m) and winds (vector, m s −1) for December, January and February for years with HABs SI < 2 at 200 hPa. b Same as (a) but for HABs SI > 7. c Same as (a) but for 500 hPa. d Same as (c) but for HABs SI > 7. e Same as (a) but for 850 hPa. f Same as (e) but for HABs SI > 7. Anomalies were calculated after subtracting the climatology based on 2002–2020.

Fig. 3: Oceanic and atmospheric conditions during years with mild and severe bloom activity.
figure 3

a Anomalies of SST (˚C) for December, January, and February for years with HABs SI < 2. b Anomalies of 500-hPa geopotential height (shaded, m) and 200-hPa wind (vector, m s−1) for December, January, and February for years with HABs SI < 2. c Same as (a) but for HABs SI > 7. d Same as (b) but for HABs SI > 7. SSTs were obtained from HadISST (Rayner et al.54), and geopotential height and winds were obtained from the ERA5 reanalysis. Anomalies were calculated after subtracting the climatology based on 2002–2020.

Fig. 4: Accumulated snow depth and near-surface temperature during years with mild and severe bloom activity.
figure 4

a Anomalies of accumulated snow depth (m) for years with HABs SI < 2. b Same as (a) but HABs SI > 7. c Anomalies of accumulated snow depth (m) for HABs SI > 7 minus HABs SI < 2. d Same as (a) but for 2 m temperature (K). e Same as (b) but for 2 m temperature (K). f Same as (c) but for 2 m temperature (K). These anomalies for December, January, and February were calculated after subtracting the climatology based on 2002–2020.

Prediction of multiple years using nutrient loading and climate indices

In the previous sections, we used maximum available data length for training and a left out year for prediction. To further test the capability of the GA models (GA-chem, GA-clim and GA-chem-clim) and the current approach, we reduced the data length for training i.e., 79%, (15 years) of the data used for training and 21% (4 years) of data for prediction and performed 4 set of experiments. In designing these experiments, at least one extrema is part of the prediction phase (see Table 3). Monthly accumulations of chemical loading data and monthly climate indices are used for GA-chem and GA-chem-clim models. In all four experiments (Fig. 5), GA-chem overpredicts the low extrema years and underpredicts the high extrema years. GA-chem-clim model improves the prediction compared to GA-chem model as shown in Table 3 by reduction of RMSE for GA-chem-clim models. In Table 4, we show the HABs SI functions for Experiments 1–4. Apart from PNA for December month, we found TP for April and May, TSS for March, and CL for the months of March and April as predictors for HABs SI. The priority of monthly chloride loading given by the GA-chem and GA-chem-clim models in experiments 1–4 has important implications for phytoplankton dynamics in Lake Erie. There is strong evidence that large inputs of road de-icing salt (via spring runoff) can disrupt lake phytoplankton dynamics46. Some species of freshwater cyanobacteria, including Microcystis aeruginosa, have demonstrated high tolerances to elevated salinity, and increased chloride concentrations have been associated with cyanobacterial blooms47. The importance of chloride loading in March–May as a predictor may also serve as an indirect factor, as road salt application rates may function as a proxy measure for the severity of a previous season’s winter (as shown by the GA models in selecting PNA for the month of December).

Table 3 Root mean squared error (RMSE) and mean bias between observations and HABs severity index for GA-chem and GA-chem-clim models for experiments 1 to 4.
Fig. 5: Prediction of HABs severity index using different input parameters.
figure 5

a HABs severity index (SI) for Experiment-1 from observation (black bar), prediction using GA-chem model that is based on nutrient loading only (blue bar), prediction using GA-clim model that is based on climate indices only (green bar) and GA-chem-clim that is based on nutrient loading and climate indices (red bar). b Same as (a) but for Experiment-2. c Same as (a) but for Experiment-3. d Same as (a) but for Experiment-4. The years on x-axis show the predicted years for each of these four experiments.

Table 4 HABs SI functions for GA-chem, GA-clim and GA-chem-clim for the 4 experiments.

Although there is an overall improvement in GA-chem-clim model compared to GA-chem (Figs. 1b,  5 and Tables 2, 3), there may be other environmental predictors which can add value to both these models. To test this hypothesis, we incorporated local data (2 m-temperature and 10 m wind speed) at five locations surrounding the Lake Erie (Toledo (83.4764°W, 41.5631°N), Cleveland (81.8528°W, 41.4050°N), Buffalo (78.7358°W, 42.9408°N), Delhi (80.5453°W, 42.8635°N), and Niagara District (79.1717°W, 43.1917°N)) in the GA models. The inclusion of wind speed (early April to mid-May) and temperature (early March to late April) in the GA models showed that they can play important roles in the prediction of HABs SI. Since the results are still preliminary and only used local data over 5 stations and two variables (temperature and wind speed), and need more experiments and further investigation, we do not report results from these experiments here. Also, in this study, due to the small size of the dataset (2002–2020), the number of predictors is limited to avoid overfitting48. We hope to design further experiments in future to select the predictors from chemical, climate and local parameters in order to improve the prediction of HABs SI.

Concluding remarks

The Lake Erie seasonal severity index was successfully predicted using a GA model with predictors selected from two types of data, chemical data and climate indices. Despite limited availability (19 years) of the bloom severity data, our GA-chem-clim model provides a more accurate prediction than GA-chem as seen by lower RMSE and higher correlation in GA-chem-clim than GA-chem model (Tables 2, 3 and Figs. 1b, 5). Out of the five years with HABs SI > 7, the prediction for 2013 was the worst with a negative bias of ~3.5 units by both GA-chem and GA-chem-clim. The comparison of performance by GA-chem, GA-clim, and GA-chem-clim (Supplementary Table S4) show residual standard error for GA-chem-clim is lower than GA-chem and GA-clim. The Pearson product-moment correlation and p-value further confirms the superior performance of GA-chem-clim model. We further demonstrated that the GA-chem-clim model improves (as shown in Table 3) the performance of GA-chem model by predicting a set of four different years comprising of low and high HABs SI.

To understand the potential role of climate indices for improved HABs prediction, we further analyzed the large-scale atmospheric circulations using reanalysis data. It was found that for the mild HABs years (SI < 2), the large-scale meteorological features are distinctly different from those for the severe HABs years (SI > 7). This suggests that large-scale circulation may affect the seasonal evolution of HABs in Lake Erie. Moreover, the seasonal prediction of HABs can be completed by early June (using GA-chem, GA-clim, and GA-chem-clim), before the expected peak in HAB activity during July to October. This improved early seasonal prediction can provide timely information to policymakers for adopting proper planning and mitigation strategies such as restrictions in harvesting and help in monitoring toxins in shellfish to keep contaminated products off the market. The incorporation of climate indices in the GA-models provides the flexibility of early prediction with a greater lead time and also shows improvement over GA-model which only uses chemical loading.

Overall, the present approach shows promise for early-season prediction of HABs SI in Lake Erie, and it would be interesting to apply the same approach to other water bodies. At present, this GA-model is deterministic. It would be worthwhile to design a probability-based system that can provide the likely range of HABs severity and their probability of occurrence. Work will be undertaken in the future to include the stochastic framework within our methodology to include these aspects.

Materials and methods

Machine learning techniques like the Genetic Algorithm (GA; refs. 49,50,51,52) are search heuristics based on the laws of natural selection, where a large number of randomly-generated mathematical functions are initially assumed to explain the relationship between predictor and predictand variables. New forms of functional relations evolve during an iterative process that involves reproduction and mutation among the most fitting individual functions. After a large number of iterations (typically in thousands), the best individual function emerges.

Predictive modeling for seasonal HABs using the GA was developed using observations of integrated chemical loading (March to May) into Lake Erie as well as the monthly large-scale climate indices seven months prior to June (i.e., from November of the previous year to May of the current year). The small number (19, one event per year for the period 2002–2020) of observations, and the corresponding predictor observations compelled us to constrain the GA process to select only a few (4 to 5) predictors to avoid the so called “curse of dimensionality” and the risk of overfitting48. GA procedures were initiated with 2000 size-constrained random functions, which were allowed to reproduce among one another, and mutate at certain random intervals, to produce more and more fit individual functions in an iterative process. The measure of the fitness of a model is defined as how best its predictions match the observations of HABs SI. With each iteration, the GA procedure evolves and refines its functions. The mean of seasonal HABs SI in our limited data set was 4.72 with a standard deviation of 3.17 units. In our model development procedure, we considered only those models, which achieved an accuracy of 2.0 units or less in the GA procedure. The value represents the maximum RMSE from Table 1, Supplementary Tables S1, S2. The GA iterative process continued until the fitness of the best-evolved function reached saturation below the predefined threshold, which took about 4000–6000 iterations.

The paucity of observations of HAB events (only 19 seasons) does not permit the probability distribution of HAB to be represented. Further, a single predictive model may not be sufficient to explain the relationship between predictors and predictands. In order to analyze the bias and the variance of the GA-based prediction, we employed the jackknife method53. Using this method, we developed 19 predictive models, using 18 data points for training, leaving one data point (one season) for prediction. Hence, the mean of the jackknife variance for the 19 models was considered the variance of the GA model. With the availability of more observations in the future, this procedure may be repeated to obtain improved predictive models.

Supplementary Tables S1, S2 show the performance, and the predictors selected by each of the 19 models developed by GA-chem and GA-clim, respectively. The mean jackknife standard error of the models was found to be 1.76, which is much smaller compared to the natural variability of the seasonal index record (3.17). The mean strength (equivalent to r2 in regression modeling) of the models was found to be 0.37. Within the 19 different data sets, the GA procedure picked different predictors. Integrated March-to-May TP, SRP, TSS, CL and SL loading were the most common predictors selected by GA-chem.

In the GA-clim models that used only climate indices data, the major predictors selected were April ENSO (SOI) index, December PNA, and November PDO. The mean strength of the models was found to be 0.57, and the standard error of prediction was 1.61 units.

As mentioned earlier, the combination of climatic indices with chemical loading led to enhancement of model strength (0.63), and reduction in standard error (1.43). This analysis indicates that a suitable combination of ENSO, PNA and other climate indices, and bio-chemical loading may be responsible for the observed variability of HABs over Lake Erie.