Spatiotemporal estimation of nutrient data from the northwest pacific and east asian seas

Nutrient data obtained from field observations have the potential to enhance our understanding of oceanic biogeochemical cycling and productivity changes. In particular, long-term nutrient data can provide valuable information on the links between climate change and biogeochemical changes. However, unlike other observational variables such as sea surface temperature, nutrient data are limited in terms of their broad-scale observations and automated sensor-based measurements. In this study, we analyzed nitrate and phosphate data obtained from coastal regions in Northeast Asia and the northwest Pacific from 1980 to 2019 using the spatiotemporal kriging technique and provide results in a spatiotemporal grid format. The data are available at monthly intervals and may be attractive to researchers in the fields of oceanography, marine ecology, and marine biogeochemistry at the climate change scale. Furthermore, sharing the source code of the data production process can contribute to better long-term data reproduction in the future.


Background & Summary
The northwest Pacific and adjacent eastern Asian waters are known for their high primary productivity [1][2][3] , which has been examined in various marine chemical and biological studies, including those on water quality changes and material cycling.Nutrient data play a crucial role in these studies, with nitrogen and phosphorus being especially significant, as they influence the growth and reproduction of phytoplankton and shape the area's phytoplankton species composition [4][5][6][7][8][9] .Human activities, such as artificial nitrogen input in densely populated regions such as eastern China and western Europe, affect the ocean's biogeochemical structure [10][11][12][13] .
Coastal waters near East Asia, including the heavily impacted Yellow Sea and East China Sea, the unique East Sea with relatively lower human impact, and the northwest Pacific influenced by the Kuroshio current, are influenced by both natural and human factors, leading to complex changes in nutrient levels 14 .Nutrient supply from the deep sea has been reported to have decreased worldwide due to the strengthening of the stratification [1][2][3]15 , while artificial supply through the atmosphere and rivers has increased in the East Asian waters 11,12,[16][17][18] . Ths, understanding the long-term changes in nutrient levels in this region is crucial, and various perspectives are being studied to comprehend the phenomenon and predict future changes 6,[19][20][21] .
From the perspective of the spatial estimation of ocean information, studies providing gridded data such as OISST (Optimum Interpolation SST [Sea Surface Temperature]) are ongoing [22][23][24] .In addition, research is being conducted to remove the bias of numerical models using spatial estimation techniques such as Kriging 25 .However, these techniques mainly focus on temperature and salinity.Nutrient data, however, primarily rely on in situ observations, as satellite remote sensing and unmanned equipment cannot cover them.Efforts have been made to provide monthly gridded nutrient data for the North Pacific region [26][27][28] , but 4D gridded nutrient data (x, y, z, t) that consider both spatial and temporal variations are limited to some reanalysis data using numerical models 29 .
This study uses 40 years of nutrient concentration observations from 1980 to 2019 to spatially and temporally optimize the data for the northwest Pacific Ocean (N25-45°, E121-145°) into a gridded format.The estimated nutrient grid data were validated through a verification process and are presented (with validation errors shown) in the modeled results.

Methods
The procedure used to create gridded data is summarized in Fig. 1.This section outlines the steps involved in transforming raw observational data into nutrient grid data and validating the results.All procedures were executed using the R programming language (R core team, 2023) 30 .The data production process consisted mainly of three steps: data collection and preprocessing, spatiotemporal estimation, and postprocessing and validation.These steps will be discussed in further detail below.
The OMMORV data are currently accessible for download starting from 1997, with earlier data available in the JMA Data Report of Oceanographic Observations Special Issue 35 .To avoid duplications, any potential overlap in the data were excluded from the spatiotemporal estimation process.

Nutrient data.
The SOO data can be obtained by specifying the region, line, station, observation date, and depth.These data comprise ten water quality parameters, including temperature, salinity, dissolved oxygen concentration, nitrate concentration, and phosphate concentration.The data have been collected since 1961 with a bimonthly (February, April, June, August, October, and December) measurement frequency.Although the station and line locations may have undergone slight changes, as of 2020, data from 207 stations along 25 lines have been accumulated.
Fig. 1 Workflow of the study.Nutrient and depth data were downloaded from various sources.The nutrient data obtained from different sources were compiled into a common data format, and data within the valid range of time, space, and nutrient variables were selected.Then, the refined data were used for variogram modeling and spatiotemporal kriging.10-fold cross-validation was performed to indicate estimation errors, and the data were saved.
The OMMORV data can be obtained through the provided link associated with each research vessel.The hydrographic data, saved with an '.E' extension, were used in this study.The format of the data underwent a change in 2010 and is now classified into two versions, 'E2.x' and 'E3.x' .The data are a 126-byte ASCII record in a fixed width format, with observation information and items arranged at regular intervals.Detailed information on the format of the data is available separately 36 .
To ensure compatibility, the nutrient data from both SOO and OMMORV underwent unit conversion.The original units of μmol/kg were changed to μmol/L, and the density was calculated based on the recorded water temperature and salinity data.This calculation was carried out assuming standard atmospheric pressure (10.1325 dbar) with the gsw library in the R programming language 37,38 .The SOO and OMMORV data were merged into a table of 2,023,251 observations and 16 variables, and the missing information for each variable is shown in Table 2.
The bathymetry used in this study utilized NOAA's ETOPO 2022 data, which is the second release of these data following ETOPO1 33,39 .The water depth data can be obtained either by downloading the desired area data from the following URL or by using the R Package marmap 40 .In this study, the water depth data within the range of N25-45° and E121-145° were processed at 10-minute intervals using the getNOAA.bathyfunction from the marmap library.The computation was performed on grid points that were densely set from the surface to the 9,000 m depth range using the standard depth from WOD18 41 .However, to reduce the computational load, the number of depth grids was reduced from 137 to 43.Depth intervals were created at 20 m intervals for the 0-200 m range, 100 m intervals for the 200-2000 m range, and 500 m intervals for the 2000-9000 m range to produce the grids.

Spatiotemporal estimation.
The spatiotemporal kriging (STK) approach was used to transform the irregular nutrient data collected from the SOO and OMMORV sources into spatiotemporal grid data.This method  has been widely used in various fields [42][43][44][45] and distinguishes itself from those in previous studies by considering the vertical dimension in the 4-dimensional spatiotemporal estimation.The R libraries gstat and spacetime were utilized for maximum 3-dimensional spatiotemporal estimation [46][47][48] ; however, a custom function had to be developed, as 4-dimensional coordinates are not supported by these libraries.Kriging with External Drift (KED) was applied for nutrient estimation in unmeasured spatiotemporal areas using spatiotemporal coordinates as the auxiliary variables.KED is also referred to as universal kriging when the drift is limited to spatial coordinates 49,50 .The unmeasured point's estimation at a specific time point is represented as a weighted combination of the spatial trend and the residual from the regression model at the measured point as in Eq. (1).
where x i and x 0 represent the coordinates of the observation and the target location, respectively, and the 4-dimensional coordinate structure including horizontal, vertical, and temporal dimensions is represented as . The subscript i denotes the observed location, and 0 denotes the location of interest for prediction.f k (x 0 ) is a function representing the average spatiotemporal variation, and a linear function is used.ω k is the coefficient of the regression function f k (x 0 ), and ω 0 is the Lagrange parameter to remove bias.e is the residual of f k (x 0 ), and λ i represents the weight coefficient of e.
The optimal coefficients (ω k , ω 0 , λ i ) that satisfy the condition of minimizing the error variance in Eq. ( 2) are derived in the form of Eq. ( 3), and it is solved as shown in Eq. (4).
The process of finding the solution in the form of a matrix equation is shown in Eqs.(3, 4), and the block matrices that make up the overall matrix equation are constituted as in Eq. (5-11).The bold symbols indicate the block matrix.
where X is the observed coordinates, X 0 is the target coordinates, and ~ represents the min-max scaled coordinates.I n is a unit vector of n × 1, 0 m is a zero vector of 1 × m, and 0 is a zero matrix of m × m.The superscript T on a matrix represents the transpose.
) is expressed as a function of the spatial distance h and the temporal distance u (12, 13). ) ) ) )

The h u ( , )
st SM γ was fitted using the Sum-Metric (14) model 51,52 .The Spherical model (15) was applied equally to the γ s , γ t , and γ joint models.
In the variogram model, h is the separation distance (generally, spatial distance), a is the range, and b is the nugget.The minimum of the observational data is C 0 + b, and κ is an anisotropic parameter for time.Each parameter was optimally estimated using the L-BFGS-B algorithm 46,53 .An example of spatiotemporal variogram modeling using observed nitrate data from 2013 is provided in Fig. 2.
When computing the actual z x ( ) 0 , only the estimated λ i is used, and the regression coefficient ω k , which determines the spatial average variation, is calculated using Eq. ( 16) as a constraint.
The prediction results of Eq. ( 16) are accompanied by the indicator of estimation uncertainty, the error variance, as provided in Eq. (17).

Data Records
The reproduced data are provided in comma-separated values (.csv) and R Data (.RData) file formats, with processing codes written in the R language 54 .The code for reading, analyzing, and visualizing the data can also be used to update the data.The data provided in CSV format consist of spatiotemporal coordinates (x, y, z, t), estimated values of nitrate or phosphate, and error variances of kriging.The error variances provide quantitative information on the magnitude of estimation errors and can be utilized in future conditional simulations.R Data (.RData) is a binary data format that can be directly loaded into memory using the load function in the R programming language for immediate use.

technical Validation
The performance of the estimation model was evaluated using 10-fold cross-validation for spatial estimation results obtained through STK (Fig. 3, Table 3).Note that Simple and Ordinary Kriging always predict values that are less than or equal to the maximum observed value, while KED can predict values that are greater or smaller than the neighboring observed values.Therefore, if an estimated value falls outside the range of the WOD18 standard, it may need to be adjusted to a value within the range before interpretation.In this case, negative concentration values were replaced with 0. The root mean square error (RMSE), mean absolute error (MAE), and adjusted coefficient of determination R ( ) were used as performance evaluation metrics (18

= − − − −
Where n is the number of data points and m(=4) is the number of predictor variables used in the estimation.
The performance of the model in predicting water temperature and nitrate and phosphate concentrations was evaluated using error metrics (RMSE, MAE, adjusted R-squared).Note that since various reanalysis datasets are available, sea water temperature estimates are not provided here, and only the performance evaluation results   ).The spatial distribution of the errors showed that the area near Hokkaido in Japan had higher nutrient concentrations than other areas, with approximately 7 μmol/L for nitrate and approximately 0.8 μmol/L for phosphate(Fig.4a,b).
The RMSE variation was analyzed according to depth(Fig.4c,d).The RMSE was observed to be relatively higher within the 0-1000 m depth range, where the thermocline is located, but remained stable below a depth of 1000 m.However, an increase in RMSE was observed in the deep sea below 5000 m.The increased error in the thermocline and deep sea was attributed to the abrupt changes in nutrient concentration and lack of data, respectively.
Additionally, Compatibility with global-scale projects was assessed.The raw data utilized in this study was contrasted with the biogeochemical data product of GLODAPv2.2022(https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0257247/) 59 .Since the data prior to 2010 was verified in previous research using CLIVAR and SIO datasets 11 , the focus was on data from 2010 onwards.A total of 671 data points with precisely matching longitude, latitude, depth and time were compared(Fig.5).
Subsequently, the data estimated by STK was also compared with the GLODAPv2.2022data(Fig.6).Among the gridded data from the period 2010-2019, grids that were spatiotemporally closest to certain GLODAPv2.2022data were compared.The grid data closest to the GLODAPv2.2022data were identified, and those within the 5% quantile distance were selected.The criteria for selection were a horizontal distance of approximately 15 km, a vertical distance of about 16 m, and a time difference within roughly 9 days.For NO3, 2,652 data points were contrasted, yielding an R adj 2 of 0.984 and a Residual Standard Error of around 2.03.For PO4, 2,676 data points were examined, with an R adj 2 of 0.981 and a Residual Standard Error of approximately 0.158.

Usage Notes
This dataset was used to assess the nutrient dynamics in select areas of the northwest Pacific, both locally and regionally (Fig. 7).Gridded data can be examined through basic statistical analysis and spatial statistical methods such as EOF.These data can also be utilized for comparison with biogeochemical modeling outcomes.The  surface (0-50 m averaged) nitrate concentration trend estimated in this study corroborates the decreasing nitrate concentration trend observed in previous studies 20,21 since approximately 2010 in the Yellow Sea (Fig. 8) .Table 4.The system environment used for development and testing is presented.The main development system for the STK library utilized a Windows operating system and was also tested on a Linux (Ubuntu) system.

2 where the matrices ij 2 σ and σ j 0 2
are calculated based on the spatiotemporal variogram h u ( , ) st SM γ estimated from the observational data.The variogram represents the change in covariance with respect to distance and time, reflecting the strength of the correlation between data points as a function of spatial and temporal distance.γ h u ( , ) st SM

Fig. 2
Fig. 2 Example of empirical (a) and fitted (b) variograms.The nitrate data from 2013 were used and fitted using the Sum-Metric model.The spatial and temporal distances were min-max scaled.The semivariance of the z-axis represents the inverse correlation with respect to spatiotemporal distance.

Fig. 3
Fig. 3 10-fold cross-validation results using observational data from 1980 to 2019, spanning a total of 40 years(a: nitrate, b: phosphate, c: water temperature).Some values fall outside of the valid range, but many data points are densely distributed around the 1:1 line (black solid line).Due to the nature of KED, concentration values calculated as negative values can be converted to 0 or detection limit values.

Fig. 4
Fig. 4 Error distribution for horizontal (a,b) and vertical (c,d) directions.a and c represent the error distribution of nitrate concentration, while b and d represent the error distribution of phosphate concentration.In areas with high data density, errors are relatively low.The vertical distribution of errors shows an increasing trend in the thermocline (approximately 300 m).

Fig. 5 A
Fig. 5 A comparison of nutrient data used for STK estimation (from NIFS and JMA) with spatiotemporally matched cruise data from GLODAPv2.2022.The results of the comparison for nitrate (a,c) and phosphate (b,d) are presented, with the distribution of differences converging towards a central value of 0.

Fig. 6
Fig. 6 Results from a comparison of nitrate (a) and phosphate (b) gridded data estimated by STK from 2010 to 2019 with GLODAPv2.2022data.Information from the linear model is represented by the linear regression line (black solid line), 95% confidence interval (red dotted line), and 95% prediction interval (blue dotted line).

Fig. 7
Fig. 7 Spatial distribution of the mean and standard deviation of nitrate (a,b) and phosphate (c,d) concentrations during the target period of estimation (1980-2019).The mean concentration and variability are highest in the Yellow Sea and near Hokkaido, Japan.The high standard deviation observed in the northwest Pacific off the southeast coast of Japan is attributed to the lack of data.

Fig. 8
Fig.8The time series decomposition of the estimated nitrate (a) and phosphate (b) concentrations in the Yellow Sea region is displayed, where they were decomposed into trend, seasonal, and residual components.These outcomes can be utilized for research on the sources of variability and trend analysis.

Table 1 .
Data specifications for SOO, OMMORV, and ETOPO 2022.The table includes information on data providers, file formats, observational periods, spatiotemporal resolutions, and accessible URLs.

Table 2 .
The basic description of the dataset merged from SOO and OMMORV.The preprocessed R object name and its description, unit, data type, and number of missing data are specified.