A database of water chemistry in eastern Siberian rivers

Permafrost degradation leads to considerable changes in river ecosystems. The Eastern Siberian River Chemistry (ESRC) database was constructed to create a spatially extensive river chemistry database to assess climate warming-induced changes in freshwater systems in permafrost-dominated eastern Siberia. The database includes 9487 major ion (Na+, K+, Ca2+, Mg2+, Cl−, SO42− and HCO3−) data of chemical results from 1434 water samples collected mainly in six large river basins in eastern Siberia spanning 1940–2019. Data were obtained from public databases, scientific literature in English and Russian, and researchers and were formatted with a consistent table structure. The database is transparent and reproducible. Climate variable (air temperature and precipitation) data, discharge data, trace element concentration data, and isotope data at the basin and subbasin scales are also provided. This database enhances knowledge about the water chemistry of the permafrost region, especially in eastern Siberia, where data are scarce. The database will be useful to those assessing spatiotemporal changes in river water chemistry associated with permafrost degradation or other environmental stressors in a warmer climate.


Background & Summary
The Arctic Ocean accounts for only 1% of the global ocean volume, while it receives more than 10% of global river discharge (~ 4300 km 3 per year) 1,2 from ~ 15% of the global land surface 3 . Surface water from Arctic and sub-Arctic river basins is generally fresh 4 with low concentrations of dissolved ions. Over the past several decades, the Arctic freshwater system has experienced significant changes 5 due to accelerated climate warming and an intensified hydrological cycle as well as human activities across the terrestrial pan-Arctic [6][7][8] .
The chemical compositions of river water are the result of natural processes and anthropogenic influences 9 . Progressive increases in major ion delivery to the Arctic and sub-Arctic freshwater systems are highly associated with permafrost degradation in a warmer climate 10 . Permafrost degradation enhances infiltration, increases groundwater storage, and drives deeper flow paths 11 , leading to increasing contributions of highly mineralized groundwater to streamflow. As a result, Arctic freshwater is shifting from a mineral-poor surface water-dominated river system to a mineral-rich groundwater system 12 . Our understanding of the response of the Arctic freshwater system to permafrost degradation is mainly based on river water chemistry observations in western Siberia 13 .
The water chemistry database in western Siberia is relatively rich, especially for the Ob River, with sampling dating back to the 1930s 13 , and is constantly replenished [14][15][16] . In contrast, water chemistry data in eastern Siberia are relatively sparse. Early data on water chemistry in eastern Siberia were published mainly in the www.nature.com/scientificdata www.nature.com/scientificdata/ Russian literature and were difficult to access. In fact, the water chemistry of eastern Siberia was continuously observed and studied by scholars in the former Soviet Union during the 1940s and 1950s e.g., Bochkarev 17 . In the 1990s, research for two PhD theses was conducted to systematically study the water chemistry of the Lena River 18 and the other rives of eastern Siberia 19 . After 2000, the Arctic Great Rivers Observatory (ArcticGRO), which originated from the Pan-Arctic River Transport of Nutrients, Organic Matter, and Suspended Sediments (PARTNERS) project, provides open-access water chemistry data of the Lena and Kolyma Rivers from 2003. However, water chemistry data for other rivers (e.g., Angara, Selenga, Yana and Indigirka) are still limited.
The objective of this study was to combine existing eastern Siberian river chemistry datasets into a single database that can help assess climate effects on freshwater chemistry in permafrost-dominated regions. Data obtained from public databases, researchers, and the literature, including English and Russian articles and dissertations, were combined to create a georeferenced database with 9487 water chemistry results for 1434 samples collected from rivers across eastern Siberia (Fig. 1). A shapefile that delineated polygons for river basins was constructed to accompany the chemistry database. This database also included climate variables such as air temperature and precipitation at the basin scale. The database is transparent and reproducible and can be useful to assess the responses of freshwater systems to climate change in permafrost-dominated regions.

Methods
Data acquisition. Google Scholar, Scopus, and eLIBRARY.RU, as well as public data sources, were searched using the term "water chemistry" in Eastern Siberia. In total, 1434 multisource data, including major ions, were obtained from both published datasets and unpublished field studies (Table 1). Among these data, (1) 159 datasets were from the ArcticGRO water quality data 20 and the GLObal RIver CHemistry (GLORICH) databases 21 ; (2) 928 water chemistry data were sourced from 10 published studies in both English [22][23][24][25][26] and Russian 17,18,[27][28][29]; and (3) 347 unpublished datasets were provided by Gabysheva O.I. and Wang P. Chemical analyses of the waters sampled by research groups led by Gabysheva O.I. and Wang P. were performed at the laboratory of the Institute for Biological Problems of Cryolithozone and the Baikal Institute of Nature Management (Siberian Branch, Russian Academy of Sciences), respectively, following the methodology described by Semenov 30 .
For the 347 unpublished datasets, water samples were collected in pre-cleaned polypropylene bottles and immediately filtered through disposable sterile Sartorius filter elements (pore size 0.45 μm). The first 50 mL of the filtrate was discarded. The filtered solutions for cation and trace element analysis were acidified (pH = 2) with ultrapure double-distilled HNO 3 , stored in HDPE bottles prewashed with 1 M HCl and rinsed with Milli-Q deionized water. Filtered water samples for anions were not acidified and stored in High Density Polyethylene (HDPE) bottles prewashed according to the procedure described above for cations. Some components were analysed directly at the sampling sites; the remaining samples were fixed according to the analysis procedure and transported in a refrigerated box at 1-3 °C. Anions (Cl − , SO 4 2− , HCO 3 − ) were determined by high-performance liquid chromatography (HPLC), and cations (Ca 2+ , Mg 2+ , K + and Na + ) were analysed by flame atomic-absorption spectrometry. www.nature.com/scientificdata www.nature.com/scientificdata/ We consolidated all collected data for major dissolved ions (Na + , K + , Ca 2+ , Mg 2+ , Cl − , SO 4 2− and HCO 3 − ) in eastern Siberian rivers and divided them into 7 categories according to the spatial distribution in six major river basins and out-of-basin areas (named Angara, Selenga-Baikal, Lena, Yana, Indigirka, Kolyma and Eastern Siberia in the "Basin" attribute) and eliminated duplicate data.

Concentration in mEq L AW V Concentration in mg L
The inorganic total dissolved solids (TDS) were determined by the sum of seven major ions (Na + , K + , Ca 2+ , Mg 2+ , Cl − , SO 4 2− and HCO 3 − ) expressed in mg/L. Among the ArcticGRO datasets 20  www.nature.com/scientificdata www.nature.com/scientificdata/ (117 datasets) were obtained by multiplying the concentration of sulfur (mg S/L) by three, and the HCO 3 − concentrations (147 datasets) were calculated from the alkalinity based on the ratio of equivalent weights 32 and marked as "cal_alk" in the attribute "Note": Anion HCO 3 − in 151 groups of data from Huh, et al. 24 , Huh, et al. 26 , and GLORICH 21 were determined by the charge balance method from the other ions, which was marked as "cal_ib" in the "Note" attribute.
Ionic charge balance controls. To control the data quality of water samples, the ionic charge balance technique was used in this study since the concentrations of all negatively charged ions should be equal to the sum of the positively charged ions in each sample. The ion balance (IB) was determined as follows WMO 33 : where C i is the concentration of ion type i in a specific sample (mEq/L); IS is the sum of all ion concentrations (mEq/L); ID is the difference between the sum of the cation concentrations and the sum of the anion concentrations (mEq/L); and IB is the ratio of ID to IS, representing both systematic and random errors during the measurements. As a result, 122 samples (8.5% of the total samples) with absolute values of IB greater than 10 were excluded from this study, and in 48 samples, some ions were absent (marked as "imbalance" and "absent" in the "IB" attribute, respectively). As a result, 1264 samples were considered reasonable for further analysis.
Normal distribution assessment. The normality assumption is assessed using skewness and kurtosis and applies to both small and large samples 34 for the 1264 sets of TDS data. The skewness (γ1) and kurtosis (γ2) describe the degree of asymmetry in a distribution and the extent to which the density of observations differs from the probability density of the normal curve 35  where n represents the sample size with a value of x i , x is the mean value and SD is the standard deviation. A z-test is applied, and z scores can be obtained by dividing the skew values or excess kurtosis by their standard errors 34 : where SE γ1 and SE γ2 are the standard errors of skewness and kurtosis, respectively. The normality test results with positive skew values and positive excess kurtosis from IBM SPSS software (https://www.ibm.com/analytics/spss-statistics-software) show that the dataset of TDS values does not follow the normal distribution (Table 3), as the z score is larger than ± 1.96 (α = 0.05).
Outlier detection. The 1264 sets of TDS data varied widely (12-2586 mg/L). Tukey's method 36 applies to both symmetric and skewed data and detects more outliers for data that do not follow a normal distribution, unlike the standard deviation (SD) method (Mean ± 2 SD, Mean ± 3 SD) 37 . Since Tukey's method makes no distributional assumptions about the data 37 , outliers in this study were detected by Tukey's 3 IQR (interquartile range) method. The IQR is known as the difference between the first quartile (Q1) and the third quartile (Q3) 38 : Result 1264 5.34 0.07 77. 33 38.98 0.14 282.43 Table 3. Normality tests of the TDS dataset using skewness and kurtosis.
www.nature.com/scientificdata www.nature.com/scientificdata/ The samples were detected as potential outliers and possible outliers by inner fences with a 1.5 IQR interval and outer fences with a 3 IQR 37,39 , respectively.
Inner fences are situated at a distance of 1.5 IQR below Q1 and above Q3: I QR Low potential outliers 1 1 5 Q I QR High potential outliers 3 1 5 = + .
The intervals with 3 IQR are called outer fences and are located below Q1 and above Q3 at 3 IQR distances: The outlier detection results (Table 4) show that 4.4% and 8.2% of the 1264 TDS data account for the possible outliers and potential outliers, respectively. Subbasin selection. The subbasin boundaries used in this study were extracted from the HydroBASINS shapefile 40 , which follows the rule that at every location where two river branches meet, each has an individual upstream area that exceeds a certain size threshold (i.e., 100 km 2 ). The rule still allows smaller subbasins to occur, and we selected the 6 th -level basin for this database according to the data volume and sampling density.
In total, 218 subbasins were selected from a total of 776 subbasins in the eastern Siberia region (including its six major basins) where the sampling sites were located (Fig. 2). Each subbasin was named with a unique code in ObjectID together with average river water chemistry and the climatic factors (temperature (T), precipitation (P) and potential evaporation (PET)) at subbasin scales. T and PET were derived from the Climate Research Unit (CRU) 4.04 dataset 41 , and P was obtained from the Global Precipitation Climatology Centre (GPCC) dataset 42 at a resolution of 0.5° from 1901 to 2019.
Meteorological data processing. We clipped the meteorological data (.nc file) using subbasin boundaries and then pre-processed the data to filter out the missing data. The monthly precipitation data (mm/month) of the year are summed to obtain the annual precipitation (mm/year). The same was true for the daily potential evaporation data, which should be multiplied by the number of days of each year. After that, we averaged the meteorological data within each subbasin.

Data Records
The dataset is publicly available at figshare 43 . The water chemistry database consists of the following 3 categories and associated listed files: Category 1: Boundary data. This folder contains the boundaries of eastern Siberia and its six major river basins (Angara, Selenga-Baikal, Lena, Yana, Indigirka and Kolyma) with the river system, which consist of four shp files.
Eastern_Siberia_boundary.shp Basin_boundary.shp Subbasin_boundary.shp River_system.shp Category 2: Water chemistry data. This folder contains the full river water chemistry database, which consists of a csv file with all total dissolved solids (TDS) and major ions (Na + , K + , Ca 2+ , Mg 2+ , Cl − , SO 4 2− and HCO 3 − ), as well as related information (basins, coordinates, sample period, data source, permafrost type, and lithology), basic climatic (temperature and precipitation) and discharge data for each sample (Sample_ID). This folder also contains a sample summary csv file, which provides the maximum, minimum, mean, standard deviation and number information for the ion concentrations and TDS in each river basin.

Technical Validation
Quality assurance for the 1434 unique datasets from each independent source was separated into two stages (Fig. 3): (1) Import and standardization and (2) Screening by chemical and statistical methods. Import and standardization. Data extracted from different sources (manuscripts, online databases and field work reports) were input into an initial data file according to corresponding attributes without alteration. After the multisource data were assembled, an initial check of transcription errors and the modification of input errors (e.g., decimal point mislocation, incorrect placement of variables, and character error) were conducted. Then, standardization and unit conversion were carried out for original water chemical data by parametric conversion (i.e., determining the concentration of hydrogen carbonate by alkalinity and determining the concentration of sulfate by sulfur concentration) and conversion of units into mg/L. Duplicate data were then screened by comparing the coordinates and times of the datasets. Ten percent of random data were selected from our database for validation to eliminate errors during the whole import and standardization process.
Screening by chemical and statistical methods. We compared the original TDS values from data sources (i.e., literature and database) with the calculated TDS by the sum of major ions to ensure the rationality of the original ion concentration data. Forty-eight of all datasets were missing ions and marked as "absent" in the "IB" column of the "Samples_database.csv" file. Then, we performed charge balance across all datasets and identified 122 total samples that did not meet the ion balance (marked as "imbalance" in the "IB" column of the "Samples_database. csv" file). The remaining 1264 sets of data were explored by normal distribution assessment and outlier detection methods, and the input and processing of outliers were then verified. Both the inner and outer fences of outliers were determined for the 1264 TDS datasets, and outliers with high mineralization of river water appear in only the Angara, Lena and Selenga-Baikal River basins due to different karst processes. Finally, 150 datasets were selected randomly from the final database twice, and the final validation was conducted by people not involved in the data collection process.

Meteorological data validation.
To ensure the reliability of the meteorological data, the gridded data were compared with the observation data from the meteorological stations. The validation of monthly gridded data against the observed data (Fig. 4)

Usage Notes
The Eastern Siberian River Chemistry (ESRC) database includes the boundaries of eastern Siberia, its six river basins and the 218 subbasins in which water samples were taken. In addition to the sampling information, this www.nature.com/scientificdata www.nature.com/scientificdata/ database also includes 1434 samples of 7 major ion concentrations, total dissolved solids (TDS), climatic factors (temperature and precipitation), lithology, permafrost, sampling information, and annual air temperature, precipitation, potential evaporation, and discharge data for each subbasin during the period from 1901-2019.   Meteorology datasets. The datasets contain 3 files, tmp.csv, pre.csv and pet.csv, which are air temperature (T, °C), precipitation (P, mm/yr), and potential evaporation (PET, mm/yr) data, respectively. Each file has similar data with two main attributes: Subbasin ID (Subbasin) and yearly data average (1901 to 2019).
Subbasin ID -named by the unique code of the subbasin according ObjectID, including a total number of 218 subbasins. Yearly data average -Named by the corresponding year of average annual temperature (precipitation or potential evaporation) within each subbasin from 1901 to 2019. We denote a missing value as −9999.

Code availability
Within the repository, we also provide code for extracting climate data of each subbasin from the Climatic Research Unit at the University of East Anglia (http://www.cru.uea.ac.uk/) and the Global Precipitation Climatology Centre (https://climatedataguide.ucar.edu/climate-data/gpcc-global-precipitation-climatologycentre) in the code folder. ♦ The shp folder contains 218 subbasin boundary shp files. ♦ The downloaded input data are stored in 3 nc files with annual average temperature, annual precipitation, and annual potential evaporation data   www.nature.com/scientificdata www.nature.com/scientificdata/