A palaeoclimate proxy database for water security planning in Queensland Australia

Palaeoclimate data relating to hydroclimate variability over the past millennia have a vital contribution to make to the water sector globally. The water industry faces considerable challenges accessing climate data sets that extend beyond that of historical gauging stations. Without this, variability around the extremes of floods and droughts is unknown and stress-testing infrastructure design and water demands is challenging. User-friendly access to relevant palaeoclimate data is now essential, and importantly, an efficient process to determine which proxies are most relevant to a planning scenario, and geographic area of interest. This paper presents PalaeoWISE (Palaeoclimate Data for Water Industry and Security Planning) a fully integrated, and quality-assured database of proxy data extracted from data repositories and publications collated in Linked Paleo Data (LiPD) format. We demonstrate the application of the database in Queensland, one of Australia’s most hydrologically extreme states. The database and resultant hydroclimate correlations provides both the scientific community, and water resource managers, with a valuable resource to better manage for future climate changes.


Background & Summary
The essential value of high-resolution accessible global palaeoclimate datasets to climate change predictions is well recognised [1][2][3] . The rise in popularity of data repositories together with advances in computing mean that large-scale data compilation and analyses are now more accessible 1,2,[4][5][6][7] . Despite such advances, a disconnect remains between the availability of palaeoclimate databases and uptake by key industry sectors. One such sector is the water industry, which faces significant challenges with respect to climate variability and change and its impact on future water supply 8 .
Improvements to industry decision-making can only be facilitated by establishing the 'plausible ranges of climate change' 8 and the reduction in the uncertainty afforded by millennial-scale records 9 . The relatively short observational record-length (<100 years) available for hydrological modelling and water planning, is insufficient to capture variability around the extremes of floods and droughts [9][10][11][12][13][14] . Climate information also plays a key role in enabling the sort of 'smarter solutions' required of the industry, with several applications demonstrating the tangible benefits of incorporating palaeoclimate data into water management 13,[15][16][17] . Palaeoflood data, for example, is now routinely used to improve flood frequency analysis in several countries 9,18,19 and is especially valuable to 'stress test' infrastructure design to safeguard against dam overspill.
Using palaeoclimate data from the Australasian region, we present an efficient and integrated tool that allows access to a standardised database to rapidly assess the proxy records most relevant to a hydroclimate scenario, and geographic area of interest. The database represents an expansion on previous compilations and includes records reported in Freund et al. (2017), , and Comas-Bru et al., (2020) with additional records sourced directly from publications or authors. The database comprises 396 records derived from 11 different archive types (e.g., corals, tree rings, sediments, speleothems) with an emphasis on the Common Era (i.e., the last 2000 years). We demonstrate the application of this palaeoclimate information to both the scientific community and the water industry by testing the temporal correlation between sample proxy records and a full suite of hydroclimate indices relevant to water planning in Queensland, one of Australia's largest and climatically variable Palaeoclimate data compilation. Data Sources. The majority of proxy records were sourced from online data repositories (e.g. NOAA World Data Service for Paleoclimatology, PANGAEA) and extracted using record details contained within the published reviews of Freund et al. (2017) and , which focus on proxies relevant to Australian climate. Freund et al. (2017) report details of a high-resolution (annual or higher) proxy network from the southern hemisphere which were used to reconstruct rainfall for Australia's eight natural resource management regions. Low-resolution proxies (>annual) were largely sourced from , who identified a total of 132 high quality palaeoclimate datasets and also provided alternative chronologies based on revised age modelling. Relevant records from the Speleothem Isotopes Synthesis and AnaLysis (SISAL) database 21 were filtered using the geographic extent for the region influential to Australasian climate (cf. . Where data were not in an online repository, they were sourced from the supplementary materials or directly from the authors.
Selection Criteria. Extracted records were screened against several broad criteria to capture the maximum number of both high and low-resolution records before being collated in the database. To enhance usage by water resource managers, the Common Era was prioritised where resolution is generally high, with >50% of datasets having a temporal resolution of annual or greater.
The following final criteria were used: 1. The proxy record must be detailed in a peer-reviewed publication.
2. The proxy record must contain at least two samples dated to within the last 2000 years. 3. The proxy record must span at least 20 years. 4. The proxy record must not require further processing to yield a chronological time series. This relates particularly to the exclusion of tree-ring datasets comprised of raw tree-ring width values, which would require further processing. 5. The proxy must be related directly, or teleconnected to, Australian climate, as stated in the original publication or a more recent published synthesis.
Database collation of proxy records. Proxy records including all associated metadata were compiled and reformatted in the Linked Paleo Data (LiPD) format 7 using the lipdR and dplyr packages in the statistical language R [22][23][24] . The LiPD format is based on linked JavaScript Object Notation (JSON-ld), and has the benefits of being highly flexible, self-contained (data and metadata are always stored together), and permits integration and comparison with previously published syntheses 1,2,4,25 . Table 1 outlines a subset of metadata fields for proxy records stored in the database, which is provided as both LiPD and R data files 26 . PalaeoWISE database users are directed to McKay and Emile-Geay (2016) and the Linked Earth Ontology 27 for full details of database structure and standard definitions and terminology of field names. All included fields are fully described in the PalaeoWISE files 26 . PalaeoWISE 26 also includes an overview of the completeness of the database fields in the supplementary material (Section 1). Meta-analysis and visualisation of the database were undertaken in R using the packages dplyr, ggplot2, sf, and rnaturalearth 23,24,[28][29][30][31] .
Following collation and standardisation of proxy records, summary dashboards were produced for each record to facilitate the quality control of database contents similar to those outlined by PAGES2k Consortium

Data Records
The PalaeoWISE (Palaeoclimate Data for Water Industry and Security Planning) database contains 396 palaeoclimate proxy records 26, , each of which documents an archive's response to past changes in climate. The majority of proxies come from sites located in the Australasian region, with some records in the Indian and central Pacific Oceans, as well as Antarctica (Fig. 1). The geographic distribution of proxies is predominantly from tropical latitudes (Fig. 1). This reflects both the dominance of tropical coral as a palaeoclimate archive for the Australasian region and the influence of dedicated ocean/atmospheric climate research programs that have produced multiple proxy records from a single site (e.g. Global Tropical Moored Buoy Array Program) ( Table 2). A single marine sediment core extracted from the Makassar Strait, Indonesia, for example, has yielded four proxy datasets 94   sediment, speleothems, and tree rings) and the temporal resolutions range from monthly/seasonal (e.g. corals) to decadal/centennial (e.g. foraminifera) (Fig. 1). Records in the database have timespans ranging from 21 to 40,000 years, although the majority of records do not extend beyond the beginning of the Common Era ( Fig. 1, Table 2). PalaeoWISE 26 is hosted on figshare (https://doi.org/10.6084/m9.figshare.14593863.v3), which is also accessible via the project website (www.palaeoclimate.com.au/project-outputs/proxy-map/ access-the-palaeowise-database/). PalaeoWISE 26 includes 15 items as detailed in Table 3, together with the code to produce the figures presented in this manuscript. The proxy data are presented as a zipped folder of LiPD and Rdata files and includes a brief introduction on how to interact with LiPD files in R and a README.txt file. PalaeoWISE 26 also includes all proxy dashboard figures (Fig. 2), and correlation maps and coefficients for each of the 396 proxy records, 73 Queensland catchments, and 75 climate variables. An analysis of correlation coefficient lags (in years) for the seven example proxy datasets is also included in PalaeoWISE 26 . More information for each item can be found in Table 3 and in the PalaeoWISE readme file 26 . The proxy data contained in PalaeoWISE 26 is also hosted by NOAA World Data Service (WDS) for Paleoclimatology (https://www.ncdc. noaa.gov/paleo/study/34073) 32 . This community-specific, open access repository archives the PalaeoWISE proxy data in LiPD format, and also in the WDS template text format for records not previously archived in the WDS Paleoclimatology 32 .

technical Validation
Database quality control. Essential quality assurance was completed on the individual proxy records using summary dashboards following the example of PAGES2k Consortium (2017). Proxy records, which comprise a single timeseries and multiple metadata fields, were verified by comparison with the original source data where available. The full collection of summary dashboard plots is available in PalaeoWISE 26 . The overall completeness and accuracy of individual datasets was also verified during the creation of the LiPD files for each dataset.

Relationship between proxies and hydroclimate.
A key goal was to examine the extent to which the database captures the variability in hydroclimate using the state of Queensland as an example. However, a common challenge is that of stationarity, which assumes that the relationship between the proxy and climate variable over the shared period is representative of the entire time span of the proxy record. While methods exist to model unstable/nonlinear or multivariate relationships between proxies and climate variables, the approach adopted here is simple in the hope that it can be employed by a greater range of potential users, including the water industry, to efficiently screen the database for proxy data of relevance to catchment-scale hydroclimatic variability.  Table 2. Summary of all proxy records in the database by archive type. Note: a single reference may be associated with multiple datasets. *bold text denotes references for the example datasets discussed in this paper. Italicised text denotes references for which data were sourced from supplementary materials or directly from authors.
www.nature.com/scientificdata www.nature.com/scientificdata/ Selection of example proxy and hydroclimate variables. From the complete database, an example proxy set was selected for each of the eight archive types (sediment, foraminifera, ice core, leaf material, tree ring, ostracod, speleothem and coral) based on the highest correlation coefficient between the proxy, the 75 climate variables and 73 Queensland catchments. None of the ostracod-derived proxies reported a significant correlation coefficient with any of the selected climate variables and catchment, so no example is provided here. The data sets for the example proxy records are either continuous or have gaps/irregular time steps to allow us to test for changes in correlation coefficients based on record continuity, but all have an average temporal resolution of less than ten years.
A comprehensive set of hydroclimate variables relevant to catchment-scale hydroclimate modelling and future climate change projections (https://www.longpaddock.qld.gov.au/qld-future-climate/dashboard/) were selected: annual rainfall, evapotranspiration, temperature, Standardised Precipitation Index (SPI) 129,130 , Standardised Precipitation Evaporation Index (SPEI) 129 , and indices for severe and extreme wetness and dryness (Table 4). Gridded datasets (cell size = 0.05 degrees, approximately 10 km) of annual rainfall, evapotranspiration, and temperature were extracted from the Scientific Information for Landowners (SILO) database (https:// www.longpaddock.qld.gov.au/silo) for the period 1889 to 2019 using the July to June water year. SPI and SPEI grids (cell size = 0.05 degrees) were then calculated from instrumental data at timescales of 12, 24, 36, and 48 months (   www.nature.com/scientificdata www.nature.com/scientificdata/ of hydrological applications annual and multi-annual time scales are important for water storages (and thus water supply security) because storages aggregate water over time and have variable 'stress' periods ranging from single to multiple years. These stress periods relate primarily to droughts, which in Australia are typically multi-year events. Periods of severe and extreme wetness and dryness were derived from all SPI and SPEI series using criteria outlined in Table 4 and are assessed over the same ~120-year period of recorded climate data. Catchment-averaged annual time-series for the 73 Queensland catchments were then derived from all climate grids for the July to June water year for the period 1/1/1889 to 31/12/2019.
Outlier analysis of proxy data. As correlation calculations are not resistant to outliers in the proxy data, technical validation also tested for outliers using Rosner's test 131 in the R package EnvStats 132 . This procedure allows the user to test for multiple outliers in a dataset, as opposed to more static approaches using only a single outlier at a time. We note that the Rosner's test does not take into account the temporal structure of the data, though there are other methods for finding outliers in such series (e.g. Chen and Liu (1993)). However, these are considerably more complex to implement in irregularly sampled series [133][134][135][136] .
A maximum of three outliers were tested on each of the example seven proxy datasets (Fig. 3) and two climate time series (annual rainfall and temperature; Fig. 4). Of the 2,156 proxy observations considered, the procedure found only three potential outliers, shown as vertical lines in Fig. 3. The identification of these outliers does not mean that they are incorrect, and remain included, but they might require some further investigation in any subsequent analysis. None of the data points extracted for the climatic observations were considered outliers. Beyond the seven records presented here as examples, the entire proxy database was quality controlled, with outliers identified using the method described above. The quality codes for outliers, suspected outliers, and missing values are detailed in PalaeoWISE (in both the LiPD metadata files and the fieldnames spreadsheet) 26 .
Temporal correlations. The relationship between the proxy records and catchment-averaged hydroclimate time series was tested using correlation analysis across the whole database. Correlation coefficients were determined using a kernel-based approach which is similar to Pearson's correlation coefficient but has the advantage of applying to irregularly spaced data. The approach was used previously in Roberts et al. (2017;. For unevenly spaced series, Pearson's correlation is not appropriate and the correlation method (and Python/Fortran code) from Rehfeld and Kurths (2014) was used. Conservative correlation lags of −5 to +5 years are included to acknowledge the potential for some dating uncertainty in high resolution proxies.
An approximate test for significant correlation is given as > α z N * /2 , where z is the inverse Gaussian distribution, α is the significance level and N* is the minimum number of data points for either time series within the overlapping period. Exact significance tests are not known for the Gaussian kernel method and the number of overlapping points changes depending on the lag and irregularity of the spacing of the two datasets being correlated 137 . Additionally, the significance tests also depend on the characteristics of the data series, for example those that are nonlinear, heteroskedastic or have a hidden dependence structure. This approximate significance test was applied to all correlation results presented here, and non-significant correlations are not presented.
To test the robustness of the Roberts et al. (2017) kernelised approach, we re-calculated the correlation coefficients based on the ranks for the data values. This in effect allows for a comparison of Pearson vs Spearman-type correlation where highly non-linear relationships would appear as a large difference between them. The differences between the Spearman and Pearson-type correlations when run on the same data sets showed very few

Climatic Index
Description and use Method Reference

Derivation period
Average precipitation Catchment-averaged precipitation (mm) Annual precipitation averaged over each catchment. 208

months
Morton's potential evapotranspiration

months
Temperature Catchment-averaged temperature (°C) Annual temperature averaged over each catchment 208

months
Standardised Precipitation Index (SPI)

Identification of wetter and drier periods
Gamma distribution using a 1900-1999 reference period 130  www.nature.com/scientificdata www.nature.com/scientificdata/  Visualising temporal correlations. Heat maps were constructed from the resultant correlation data to provide a condensed, visual tool that highlights the potential of individual proxies to reflect catchment-scale hydroclimate shown are the maximum absolute ccf between catchment-averaged rainfall and the example proxies for all Queensland catchments from lags +5 to −5 years. White = non-statistically significant. Histogram shows the distribution of maximum absolute ccf by lag. The Burdekin and the Balonne-Condamine catchments referred to in the text are illustrated. Vector map data sourced from www.qldspatial. information.qld.gov.au. and the associated time lag (Figs. 5, 6). The heat maps display the maximum absolute correlation coefficients by climate index and catchment, with examples for catchment-averaged rainfall (Fig. 5) and temperature (Fig. 6) provided. Maps for each of the 75 hydroclimatic variables are available in a single page format, as are the correlation results for each catchment, dataset, and climate variable 26 . An interactive summary of the correlation results is also presented on the project website at www.palaeoclimate.com.au.
The heat maps deliver meaningful information on the selection of proxy records and their associated skill with selected hydroclimate variables. This is especially valuable to appreciate the extent to which a given proxy correlates at the catchment (e.g., dataset 274), region (e.g., dataset 170; coastal eastern Queensland) or broader www.nature.com/scientificdata www.nature.com/scientificdata/ state-level (dataset 269) (Fig. 5). However, as heat maps are designed to show the 'best case' correlation coefficient, the lag is not constant across catchments. For example, a high correlation between catchment-averaged rainfall and proxy dataset 269 occurs at a lag of −1 in the Burdekin catchment ( Fig. 5) but at a lag of +1 year in the Balonne-Condamine catchment ( Fig. 5; PalaeoWISE correlations 26 ). Despite the variability in associated lag, the majority of maximum absolute correlation coefficient values occur at lag −1 (Figs. 5, 6). To supplement the maps, and as an additional tool to aid the selection of relevant records, Fig. 7 shows the most 'successful' datasets for catchment-averaged rainfall and temperature records. Here, success was defined as the datasets with the highest significant absolute correlation coefficient for each of the 73 Queensland catchments for the climate variable of interest. Figure 7 shows dataset 269 has the largest number of highest correlations for rainfall, but that dataset 470 has the highest correlation coefficient for temperature within the Queensland catchments. Similar plots for each climate variable are presented in PalaeoWISE (success histograms) 26 . Table 3 details the individual files contained within PalaeoWISE 26 . The current and all future versions of PalaeoWISE 26 can be accessed at https://doi.org/10.6084/m9.figshare.14593863.v3, and the project website (www. palaeoclimate.com.au/project-outputs/proxy-map/access-the-palaeowise-database/). The proxy data contained in PalaeoWISE 26 can also be accessed on NOAA WDS Paleoclimatology (https://www.ncdc.noaa.gov/paleo/ study/34073) 32 in both the LiPD format and also in WDS template text format for records not previously archived in this repository.

Usage Notes
The approach and outputs are likely to be primarily used by the scientific community in the first instance to access both high-and low-resolution palaeoclimate proxy data in a single digital database. The inclusion of lowand high-resolution proxies facilitates use for hydrological modelling scenarios that may vary in timescales from annual or centennial.
PalaeoWISE 26 also provides an essential resource for scientists and water managers to screen proxies correlated to hydroclimatic indices of their interest. The correlation approach is intended as an efficient, visual tool to identify relevant proxies and catchments for further investigation. The code accompanying this work allows for straightforward extrapolation of the approach to areas outside of Queensland where accompanying hydroclimate variables exist.
We welcome any additional or clarifying information to be incorporated into future versions. When using this database or any correlations presented within, please cite both the original data author(s)/collector(s) as well as this publication.

Code availability
Code to reformat the relational database to the LiPD and Rdata formats was adapted from this example (https:// github.com/nickmckay/sisal2lipd) and is available in PalaeoWISE 26 . Code to produce the figures are available in PalaeoWISE 26 . Correlations were all produced using code published within the original publications cited within.   Fig. 7 Identification of the most successful datasets for (a) catchment-averaged rainfall and (b) temperature. Success here is the proportion of the 73 Queensland catchments for which each proxy in the seven example datasets recorded the highest correlation coefficient at the 0.05% significance level. Similar plots for each climate variable are available in PalaeoWISE 26 .