Background and Summary

Calcium concentration and pH are key determinants of many environmental and biological processes in freshwater ecosystems. Both variables regulate metabolic physiology in aquatic organisms, influencing reproduction, growth, and predator-prey interactions across a wide range of taxa including bacteria1,2, aquatic algae and diatoms3,4, molluscs5, crustacea6, and fish4,7,8. Since differences in these parameters can lead to detectable biological effects on individuals, populations, and communities9,10, pH and calcium concentration can both be important predictors of species distributions11,12 and are often used to evaluate the risk of establishment for invasive species, such as dreissenid mussels13,14,15. pH and dissolved calcium content of lakes influence their susceptibility to acidification16,17. They affect nutrient availability18,19, and play an important role in determining the environmental risks posed by metals and other contaminants by influencing their dissolution, mobilization, bioavailability, and toxicity20,21,22, as well as mediating their adsorption and desorption by microplastics23,24.

For large-scale studies at regional, national and continental levels, a common challenge facing freshwater researchers and resource managers is the availability of water quality data17, including calcium and pH. Such data are not readily available for all areas of North America, and given the large number of lakes, rivers, and other water bodies in Canada and the USA, measuring these variables at all sites would be prohibitively expensive and impractical. One way of improving water quality data coverage is to use existing measurements to predict values for unsampled locations via spatial interpolation14,25,26. This approach has several advantages: large amounts of data from multiple sources can be combined, no complex mechanistic modelling is required, and a range of established interpolation methods are available.

The goal of this work was therefore to use spatial interpolation to generate calcium and pH raster layers for the entirety of Canada and the continental USA at higher spatial resolution and coverage than previously available13,15. An expansive dataset covering Canada and the USA (1,347,887 calcium measurements from 97,648 locations, and 8,789,005 pH measurements from 208,784 locations) was compiled from multiple governmental, non-governmental, and academic sources, and used to generate spatially interpolated maps of these variables at a 10 × 10 km resolution. These layers will be of value for projects requiring calcium and pH data at regional to continental scales, including understanding past and present sensitivity of lakes and rivers to acidification16,17, assessment of regional variation in the risks posed by contaminants20, ecological niche modelling27, and invasive species risk assessment13,15.

Methods

Data sources

Since Canada lacks a centralised repository for water quality data, georeferenced Canadian water quality records were obtained from multiple sources: publicly-accessible federal28,29,30,31, provincial and territorial agency databases32,33,34,35,36,37,38,39,40; non-governmental open access data repositories - the Atlantic Datastream (https://atlanticdatastream.ca/)41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93 and the Mackenzie Datastream (https://mackenziedatastream.ca/)52,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115; published reports and primary literature116,117,118,119,120,121,122,123,124; a previous invasive species risk assessment15; and directly from contacts in relevant agencies in each of the provinces and territories (Table 1). Records for the United States (including Alaska, but excluding Hawaii) were obtained from the Water Quality Portal125, which combines data from federal, state, tribal and local agencies; the dataRetrieval package126 was used to directly download data for sites with calcium and pH data collected between 2000 and 2021 (Water Quality Portal accessed 15th February 2021). To ensure that records were as contemporary as possible while retaining high spatial coverage, records from before 2000 were excluded for most sources. However, older records were retained for some areas of Canada (particularly the Territories) where fewer data were generally available. All data handling, processing and interpolation was conducted in R v4.1.0127.

Table 1 Summary of data sources used.

Data processing and preparation for interpolation

Records from appropriate site types (lakes, rivers, ponds, and streams) were selected where possible, although most data sources did not provide this information. Records from marine waters or in proximity to mines, industrial facilities, wastewater treatment infrastructure or other potentially-contaminated sites were excluded if this information was provided. For USA Water Quality Portal data, for example, this was done by excluding records with certain keywords (e.g. “WASTEWATER”) in the site name or site description fields. Records with various map datums (NAD27, NAD83, WGS84) were included without correction; differences among these three major datums are generally less than a few hundred meters, which is an acceptable degree of positional error given the intended final resolution of the interpolated data layers. In any case, most records did not include map datum information, although records which specified unusual or unrecognised map datums were excluded. Data were inspected for clearly incorrect positions (e.g., points plotting outside of the relevant state, province, territory, or points plotting in the ocean); these were corrected where possible. Records that lacked critical metadata (i.e., coordinates, date, etc.), had obvious position or date errors that could not be easily rectified, were flagged at the source with quality control concerns, or had impossible (e.g., negative) measurements, were excluded.

‘Total’ and ‘Dissolved’ calcium were the most commonly recorded fractions, but data for other fractions were sometimes provided. Analysis of data from samples where more than one fraction was measured demonstrated strong positive correlations with slope close to 1 among the most commonly measured fractions (Table 2). Consequently, where data for multiple fractions were provided, measurements of ‘Dissolved’ calcium were preferred, but most other fractions were treated as equivalent and used where ‘Dissolved’ data were not provided. Other fractions were occasionally provided, including ‘Filterable’ and ‘Fixed’ calcium; insufficient data were available to compare these with ‘Dissolved’ calcium, and since they were extremely rarely encountered, they were excluded. In any case, large numbers of records did not provide information on the fraction analysed; their removal would have had a highly detrimental impact on the extent of the available data, so they were retained and assumed to be equivalent to ‘Dissolved’. Calcium concentrations were converted to consistent units (mg L−1) and records without units (<0.01% of records) were excluded.

Table 2 Comparison of “Dissolved” calcium versus different calcium fractions measured for samples where measurements for more than one fraction were supplied in the source data.

Some records had extremely high calcium concentrations, including some well over 1000 mg L−1; these values were generally considered unfeasible, as freshwater calcium concentrations rarely exceed 450 mg L−1 and are typically much lower128. Anomalously high calcium concentrations may result from inclusion of inappropriate sample types (e.g., contaminated water, industrial effluents, marine samples), equipment malfunction, and data entry errors. A cut-off of 500 mg L−1 was therefore set and all records with higher calcium concentrations were excluded; this represented <0.2% of all records. The only exceptions to this rule were samples from the Pecos and Wichita River systems in Texas; calcium concentrations above 500 mg L−1 are not unusual in this area129,130, and removing all such records left a notable gap in spatial coverage in an area with already sparse data coverage. Instead, all records above 500 mg L−1 for this area were set to 500 mg L−1 to maintain consistency with the rest of the data, while avoiding the loss of spatial coverage. For records with calcium concentrations below 0.05 mg L−1 (a common detection limit), one of two approaches was taken. Where records were flagged as being ‘below detection limit’, or where an explicit detection limit was given for values below 0.05 mg L−1, records were set to 0.05 mg L−1 for consistency across the dataset (<0.005% of all records). Other records with calcium concentrations less than 0.05 mg L−1 were excluded (<0.05% of all records). For pH data, records with values lower than 2.5 or above 12.5 were excluded, although for most sources all records fell within this range.

Duplicate data (duplicate records present in individual data sources, presence of the same data in multiple sources) and pseudo-duplicate data (lab and field replicates, samples collected simultaneously from different depths at a location) were handled by calculating an average (median) using all records for each site on each date. For each variable, these site-date medians were then used to calculate the following summary statistics for each site across all dates: mean, standard deviation, 25th percentile, 50th percentile (median), 75th percentile, minimum, maximum (all of these summary statistics are included in the shared databases, see Data Records section). For spatial interpolation, the median value for each site was used, since this measure is comparatively robust to outliers. Medians for each site were converted to spatial data and reprojected into the North America Albers Equal Area Conic projection, using the sf package131.

Spatial interpolation methods

To select the approach used to generate the interpolated data layers, three interpolation methods were compared (Table 3): nearest neighbour (NN), inverse distance weighting (IDW), and ordinary kriging (OK). NN is the simplest method, providing a baseline against which the more advanced methods can be compared; each point for which an interpolated value was required was assigned the value from the closest available data point. IDW uses a combination of values from multiple data points, weighted by distance. For IDW, arbitrary or ‘standard’ values for nmax (the maximum number of points to be considered when predicting a value for a specific grid cell) and idp (the inverse distance power parameter, which controls how the weighting of data points varies with distance) are often used132. In this case, however, the optim function was used to find values of idp and nmax for the calcium and pH data which minimised two different error metrics, root-mean-square error (RMSE) and mean absolute error (MAE), during preliminary 5-fold cross-validation (Table 4). OK is a geostatistical technique, which uses a fitted model of the spatial autocorrelation among data points (a ‘semi-variogram’ or ‘variogram’) to derive the weights used for the interpolation of values to each grid cell. OK often generates superior results to IDW133, but this is not always the case132,134. An additional advantage of OK is that it generates a measure of statistical uncertainty (Kriging variance) for each interpolated value; this is not typically provided by other methods. Kriging variance is influenced by the distances from the interpolated points to locations with data, and by the spatial covariance relationship determined by the fitted variogram; greater variance indicates greater distance from measured values and thus greater uncertainty in the interpolated values. Variograms for each variable were fitted using the automap package135, which automatically selects relevant models and parameter values that best fit the empirical variogram (Table 3), although constraints can be applied to the process. In this case, variograms were fitted with and without a fixed ‘nugget’ of zero, since manual setting of this parameter can sometimes be advantageous136. In all cases, OK was restricted to a nmax of 100 and nmin (the minimum number of data points to consider) of 15; changes to these numbers had little to no impact on the error metrics obtained during preliminary 5-fold cross-validation. Spatial models for all interpolation methods were fit using the gstat package137.

Table 3 List of Interpolation methods used, including parameter values (where applicable).
Table 4 Error metrics considered for comparison of interpolation methods.

Leave-one-out cross-validation (LOOCV) was performed for each method to compare their predictive accuracies. This technique drops an individual point from the dataset and then uses the remaining data to interpolate a value for the location of the dropped data; this is repeated for all available data points. The interpolated values for each point were compared to the real measured values and used to calculate multiple performance metrics (Table 4): the correlation coefficient r; three absolute error measures, RMSE, MAE, and the mean bias error (MBE); and a measure of relative error, the median symmetric accuracy138 (MSA). The interpolation methods were compared by considering their scores in these metrics. Initially, the intention was to use an ‘objective’ function136 to integrate these metrics into a single performance score. However, this was not necessary, since for both calcium concentration and pH one method had the best scores in all key metrics (see Technical Validation). Since predictive accuracy of any interpolation method can vary spatially132,136, error metrics were also calculated using data from each individual province, territory and state.

On the basis of these comparisons a final interpolation method was selected for each variable. Calcium and pH values were then interpolated onto a grid with a cell size of 10 × 10 km2 using the gstat::predict function and the resulting grids were converted to raster format139. Interpolated rasters were masked using outlines of Canada and the USA from the rnaturalearth package140. Rasters of Kriging variance for each variable were also generated at the same resolution.

Data Records

Project data are available at Data Dryad141. The data provided include the final interpolated rasters (and kriging variance rasters) for calcium and pH, the point data used for the interpolations (summary statistics for each site) and the underlying data for each site on each date (Table 5).

Table 5 List of data provided in the associated data repository141.

Rasters were generated by Ordinary Kriging with a pre-defined zero nugget: for both variables this was the best-performing method (see Technical Validation). All rasters use the North America Albers Equal Area projection (ESRI:102008), have a resolution of 10 × 10 km, and have been provided in geotiff format. For each interpolation, the associated kriging variance rasters have been provided; these can be used to identify areas of higher uncertainty resulting from low availability of water quality data. Rasters are provided both ‘masked’ (using country outlines for the USA and Canada from the rnaturalearth package140 such that values are only provided for land area) and ‘unmasked’. The latter allow users to use their own territorial outlines for masking, to resample the rasters at different resolutions, or to reproject the rasters (for example into a latitude-longitude projection). It is advisable to perform these latter operations prior to masking the rasters with territory outlines. Some example R scripts to facilitate masking and reprojection can be found in the associated GitHub repository142 (see Code Availability, below).

The ‘sites’ files contain the site data used for the interpolations (Table 5). This includes the following summary statistics for each site: median (used for the interpolations), number of dates with data, total number of records included, mean, standard deviation, minimum, maximum, 25th percentile, 75th percentile. Information on data sources and years with data is also included. The ‘site-date’ files contain summary data for each site on each date with available data: the median, mean, standard deviation, minimum, maximum, 25th percentile, 75th percentile, and number of records. Data sharing agreements with some organisations do not permit open sharing of their data and thus these records are not included in the databases (11,901 sites for calcium, 7601 sites for pH). For a small number of sites for which both public and proprietary data were available (calcium: 383 sites, pH: 61 sites), summary statistics have been recalculated using only public data; consequently, the values provided may not exactly match those used for the interpolations. The associated metadata files include full information on the contents of each of the data files. Finally, the source_ids.csv file includes identifying information for the data sources included in the databases.

There are obvious spatial patterns in freshwater calcium concentrations across the continent (Fig. 1), reflecting the relationship with the chemical composition of the underlying bedrock. These include large areas of comparatively low calcium on the east and west coasts of Canada and the USA, and a large area corresponding to the Canadian Shield geological region. Calcium concentrations are comparatively high (30 mg L−1 or greater) across a continuous broad area running from the southern United States up to Yukon and Alaska. Areas of high and low calcium tend to correspond with areas of high and low pH (Fig. 2).

Fig. 1
figure 1

Interpolated freshwater calcium concentration map for Canada and the USA, generated using zero-nugget Kriging interpolation.

Fig. 2
figure 2

Interpolated freshwater pH map for Canada and the USA, generated using zero-nugget Kriging interpolation.

Technical Validation

Calcium

The final calcium database used for the interpolations included records for 97,648 sites; the publicly shareable dataset includes 85,747 sites. Median calcium concentrations for individual sites ranged from 0.06 to 500 mg L−1, but 95% of sites had median calcium concentrations of 115 mg L−1 or lower (Fig. 3). The highest concentrations of sites were mostly in the eastern United States and parts of southern Ontario, Quebec, and New Brunswick, while coverage was lowest in Alaska, the Canadian territories (Yukon, Northwest Territories, and Nunavut), and parts of northern Quebec (Fig. 4a). The majority of sites (56%) were sampled multiple times; individual sites were sampled from 1 to over 1000 times (median dates sampled = 2, mean dates sampled = 10.6). In most areas sites were, on average, sampled at least twice; however, there were areas of northern Ontario and Quebec where only a single data point was provided for most sites (Fig. 4b). Some of the provided data for this area were already temporally-averaged values, so this does not mean that all data points for these areas were based on single measurements. Temporal variation in measurements from individual sites is to be expected as a result of measurement error and temporal change, such as seasonal fluctuations in calcium concentrations (Fig. 5). However, the scale of temporal variation at individual sites was generally smaller than the spatial variation among sites. The interquartile range for temporally-averaged calcium concentrations across all sites was 48.5 mg L−1, while the median interquartile range for calcium measurements at individual sites was 5 mg L−1, and 75% of sites had an interquartile range of 12.5 or less.

Fig. 3
figure 3

Frequency distribution of site median calcium concentrations. Dashed line marks the 95th percentile. X-axis is truncated at 150 mg L−1; a small proportion of sites (~3%) had higher values.

Fig. 4
figure 4

Calcium concentration data coverage. (a) Sites per 10,000 km2 (b) Sampling intensity (mean dates sampled per site per 10,000 km2).

Fig. 5
figure 5

Temporal variation in dissolved calcium concentration for ten sites, selected (from among the 100 sites with data for the most dates) to have the longest temporal coverage and to come from 10 different administrative regions; source region given for each plot, along with number of dates with data. Seasonal fluctations are evident for most of the sites, but are small compared to spatial variation across Canada and the USA. Dotted lines mark the interquartile range for each site. Outliers have been removed for presentation, but were not excluded from calculation of site statistics. Note the differences in vertical scales for individual plots.

For the interpolation of calcium concentrations, the zero-nugget Kriging method (OK-ZN) had the highest r value and the lowest error metrics (excluding the proportional error, MSA, which was very similar to the lowest value); in particular, the bias error (MBE) was lower than the other methods (Table 6). The IDW interpolations, however, were not substantially worse. At the province, territory, and state level, the outcome was mostly similar: OK-ZN was the best or joint-best method in 53 out of 62 cases (Tables 7, 8); OK was slightly superior for four areas, the IDW methods (IDW-OR and IDW-OM) were superior for 4 areas, and in one US state NN was the best method. There appeared to be no tendency for the best approach to vary with number of data points in each area; OK-ZN was generally superior for states, provinces and territories with low (e.g., Mississippi, n = 75) and high (e.g., Florida, n = 13,591) numbers of data points. Consequently, the zero-nugget kriging interpolation (OK-ZN) was selected as the preferred interpolation method.

Table 6 Error Metrics for calcium interpolation methods, generated via leave-one-out cross-validation.
Table 7 Best-performing calcium and pH interpolation methods for each Province / Territory (Canada), according to LOOCV error score and correlation between observed and predicted values.
Table 8 Best-performing calcium and pH interpolations for each State (USA).

pH

The final pH database used for the interpolations included records for 208,784 sites; the publicly shareable dataset includes 201,183 sites. The median pH across all sampled sites was 7.9, and 95% of sites had a median pH between 5.4 and 8.74 (Fig. 6). Density of sites was high across much of the USA, with a considerably higher number of sites than for calcium (Fig. 7a). Coverage tended to be sparser for Canada, with some areas, such as northern Saskatchewan and northern Manitoba, having fewer sites with available data compared to calcium. Compared to the calcium data, a greater proportion (67%) of sites had data from more than one date, and sites tended to have data from a greater number of sampling dates (median dates sampled = 4, mean dates sampled = 17.2). However, there were again areas of Quebec and Ontario where the data tended to be based on single values for each site (Fig. 7b). Temporal fluctuation at individual sites was also evident for pH (Fig. 8). However, the scale of temporal variation for individual sites was again smaller than the spatial variation among sites. The median interquartile range for pH measurements at individual sites was 0.3, with 75% of sites having an interquartile range of less than 0.46; the interquartile range across all sites (spatial variability) was 0.97.

Fig. 6
figure 6

Frequency distribution of site median pH values.

Fig. 7
figure 7

pH data coverage. (a) Sites per 10,000 km2 (b) Sampling intensity (mean dates sampled per site per 10,000 km2).

Fig. 8
figure 8

Temporal trends in pH for ten sites, selected (from among the 100 sites with data for the most dates) to have the longest temporal coverage and to come from 10 different administrative regions; source region given for each plot, along with number of dates with data. Dotted lines mark the interquartile range for each site. Outliers have been removed for presentation, but were not excluded from calculation of site statistics.

Error metrics for the pH interpolations were generally very low, including RMSE and MAE; this is to be expected, since the restricted range of feasible pH values makes extremely large errors impossible. While it is not valid to directly compare most metrics between interpolations with different scales and based on different data, it is worth noting that the proportional errors (MSA) were considerably lower for the pH interpolations compared to the calcium interpolations. It is important to be aware, however, that since pH is measured on a logarithmic scale, apparently small differences may have comparatively large physical and chemical implications. There was little variation in the accuracy of the different interpolation methods, with most error metrics being similar for most of the methods (Table 9). However, OK-ZN had the best (or equal-best) scores in every metric excluding the proportional error, which was very close to the lowest value. For individual provinces, territories and states, the situation was similar (Tables 7, 8); OK-ZN was the best or equal-best method in 46 cases, with IDW-OM / IDW-OR being slightly better for the others. Consequently, the zero-nugget kriging interpolation (OK-ZN) was selected as the preferred interpolation method.

Table 9 Error Metrics for pH interpolation methods, generated via leave-one-out cross-validation.

Kriging variance maps generated by the selected interpolation methods can be used to identify areas of higher uncertainty in the interpolated values, and maps for the two variables show broadly similar patterns (Fig. 9). Across much of the USA and some of the Canadian provinces, there were high densities of sites (Figs. 4a, 7a); kriging variance was lower in these areas, indicating comparatively lower uncertainty in the interpolated values. Variance, and therefore uncertainty, was highest in the northern areas of the continent, where there were fewer sites with data. Interpolated values in such areas should be treated with some caution, since they are more distant from locations with measured values. These areas could be prioritised for future sampling if more certainty is required for estimates of pH and calcium concentrations.

Fig. 9
figure 9

Kriging variance maps for (a) calcium and (b) pH interpolations. Q25, Q75, and Q95 are 25th, 75th, and 95th percentile values.

Usage Notes

Example application – invasive species risk assessment for dreissenid mussels

To illustrate the advantages of high-resolution calcium and pH data, Wilcox et al.143 performed a continental-scale risk assessment for two species of invasive, freshwater dreissenid mussels using the new data layers. The two species, the zebra mussel Dreissena polymorpha and the quagga mussel D. rostriformis bugensis, have significant ecological and economic impacts on freshwater ecosystems in North America144, but can only survive, grow, and reproduce in waters with sufficiently high concentrations of dissolved calcium and within a particular pH range5.

Due to the lack of high-resolution calcium and pH data, previous risk assessments13,15 for these species have been limited to ecoregion- or sub-drainage-level resolution, and have not included Alaska and large areas of Canada (the Maritime provinces, Newfoundland and Labrador, and the Arctic). By combining the new calcium and pH data layers with additional high-resolution bioclimatic variables (e.g. temperature) from WorldClim145, Wilcox et al.143 were able to model habitat suitability for both dreissenid species for the entire extent of Canada and the continental USA, assess the importance of calcium and pH relative to additional bioclimatic drivers of mussel distributions, and calculate the relative risk of invasion for every Canadian province and territory at a 10 km2 resolution.

Limitations

‘Big data’ approaches can be a useful tool for water quality projects, but are not without limitations146, and the aggregation of many different data sources, with highly variable quality control standards, necessitates some care in their use. Given the extremely large number of data points involved, inspection of individual data records was not possible. Therefore, some of the data filtering and cleaning approaches may have resulted in the exclusion of some valid records. On the other hand, it is likely that some low-quality data remain in the final database used for the interpolations. For example, while certain sources provided enough information to be able to quickly screen out inappropriate sample types, most sources did not. Points with large or obvious errors in their values or positions were easy to identify, and therefore to remove or correct. Records with incorrect or inaccurate – but plausible – values or positions were effectively impossible to identify and remove. The large number of records used for the final calcium and pH databases should, however, minimise the impact of these types of error on the final interpolations.

There are also a few limitations to using spatial interpolation to create such large-scale maps of water quality variables. Calcium concentrations and pH are primarily driven by the underlying geology147; transitions between underlying rock types can be relatively well delineated, and interpolation across such boundaries may produce results that do not reflect reality. This problem is likely to be minimised in areas with a high density of data points but may be important in data-poor regions. For example, there are relatively large swathes of northern Quebec and Arctic Canada for which no calcium or pH data were available. This may not be problematic in some contexts; for example, in the case of invasive species risk assessment, there are other factors (low temperature, remote location) that may make these areas low risk for many non-native organisms. Geological proxies could be used to predict calcium concentration in locations with no water quality data148, but this requires detailed geological information and validated mechanistic models and does not account for effects of plant cover and land use, which influence water chemistry4,149. Despite these limitations, geological data could be used to help improve predictions in areas with lower data coverage, or higher uncertainty, via co-kriging, which allows relationships with additional variables to be used during the interpolation process150,151. Alternatively, a range of machine learning approaches are available which are also able to use additional information, such as geological data and other environmental covariates; these methods can perform better than traditional geostatistical methods for generating spatial interpolations152,153, particularly when the density of data points for the primary variable of interest is low154. However, they do not always generate more accurate interpolations; a combined approach, which uses an ensemble of outputs from different interpolation methods with spatially-varying weightings dependant on density of available data, may result in better overall accuracy136. Such exercises are good candidates for future improvement to these data layers.