A database of global coastal conditions

Remote sensing satellite imagery has the potential to monitor and understand dynamic environmental phenomena by retrieving information about Earth’s surface. Marine ecosystems, however, have been studied with less intensity than terrestrial ecosystems due, in part, to data limitations. Data on sea surface temperature (SST) and Chlorophyll-a (Chlo-a) can provide quantitative information of environmental conditions in coastal regions at a high spatial and temporal resolutions. Using the exclusive economic zone of coastal regions as the study area, we compiled monthly and annual statistics of SST and Chlo-a globally for 2003 to 2020. This ready-to-use dataset aims to reduce the computational time and costs for local-, regional-, continental-, and global-level studies of coastal areas. Data may be of interest to researchers in the areas of ecology, oceanography, biogeography, fisheries, and global change. Target applications of the database include environmental monitoring of biodiversity and marine microorganisms, and environmental anomalies.

originates in the surface thermal skin layer of the ocean and not the water below as measured by in situ thermometers 27 . SST provides fundamental information on the global climate systems, and it is an essential parameter in weather prediction 28 . Chlo-a is a proxy for understanding fluctuations in algae and pigmented bacteria as it can elucidate photosynthetic activity in coastal systems 4,20,29 . The near-surface concentration of Chlo-a is calculated using an empirical relationship derived from in situ measurements, and the implementation of the standard O'Reilly band ratio OCx (e.g., OC3M, for the MODIS sensor) algorithm merged with the color index algorithm of Hu et al. 30,31 . SST and Chlo-a have been crucial in studies to reconstruct environmental phenomena, such as Vibrio cholerae emergence 13,32,33 , algae blooms 29,34,35 , El Niño and La Niña dynamics 36 , and coral bleaching 37 .
Satellite-derive data have many limitations given their sensitivity to absorption of solar isolation, heat exchange with the atmosphere, and sub-surface turbulence. Nevertheless, since these conditions are known and common, validation and uncertainty are estimated relative to in situ buoys to correct final datasets [38][39][40] . Satellite-derived data provide an opportunity to analyze large study areas during extended periods, at the cost of limiting the information to surface level. Complementary approaches may include the addition of more oceanic and atmospheric observations like bathymetry, wind direction, and wind speed 1 . We compiled remotely sensed data of monthly SST and Chlo-a from the exclusive economic zone (EEZ) of coastal areas globally for a 18-year period (2003-2020). Data were used to generate summary statistics at yearly and monthly composites. Code is included to update the database as data are released. This database can be downloaded freely through Figshare 41 .

Methods
This section describes the procedures used to generate the individual data records that comprise the SST and Chlo-a databases. Data retrieval and analysis performed during the development of the database were executed using the statistical software R 42 . The SST and Chlo-a databases were developed in four stages: (a) data procurement, (b) preparation, (c) processing, and (d) analysis. The first two stages were associated with input data, while the third stage was applied specific methods to construct the core of each database. The fourth stage included the statistical analyses of the data. The methodological stages are summarized in Fig. 1 and described in detail below. Data procurement. The database is based on satellite observations derived from the MODIS satellite. The Terra and Aqua satellites have been orbiting around the Earth since their launch in 1999 and 2002, respectively, obtaining data of Earth's surface every one to two days at three spatial resolutions (250, 500, 1000 m) and 36 spectral bands (from 0.405 to 14.385 µm). From the available atmospheric and oceanic observations made available from NASA's Aqua Spacecraft, Sea Surface Temperature (SST) in °C and Chlorophyll-a (Chlo-a) in mg*m −3 were selected since they summarize major physical and biological phenomena. SST and Chlo-a are available at a temporal resolution of 1-day, 8-day, and monthly composites and a spatial resolution of ~4 km (Table 1).
SST and Chlo-a, among other environmental variables, can be accessed through National Oceanic and Atmospheric Administration's (NOAA) Coastal Watch Environmental Research Division (ERD) Environmental Research Division Data Access Protocol (ERDDAP) data server, also known as the NOAA's Coastal Watch. www.nature.com/scientificdata www.nature.com/scientificdata/ NOAA's Coastal Watch is a program that provides timely access to near-real-time satellite data to monitor, restore, and manage coastal ocean resources, and the ERDDAP Data Server supports manual downloads through a web application and remote downloads from any computer program (e.g., MATLAB, R, JSONP, Python) of both gridded and tabular data 43 .
Data downloading. The remote request to the ERDDAP Data Server relies on the creation of specially formed URLs to query the server for a specific database. A URL consists of a root, a target, and a constraint expression 43 . To procure the inputs needed to assemble this database especially formed URLs were created through a programming algorithm in R (Auxiliary Materials 44 ).
The root or base URLs that provided the location of the gridded database were obtained from the ERDDAP griddap documentation webpage (https://coastwatch.pfeg.noaa.gov/erddap/griddap/documentation.html) and remained constant in all requests for a specific database.
The target is the equivalent to the unique identifier or data set ID previously assigned by the ERDDAP (https://coastwatch.pfeg.noaa.gov/erddap/griddap), in conjunction with a specific data file type extension, for this study .nc was selected producing NetCDF-3 binary files with COARDS/CF/ACDD metadata. NetCDF, Network Common Data Form, files are recommended when using software tools to analyze geospatial data as they provide multidimensional scientific data in a standardized manner (https://coastwatch.pfeg.noaa.gov/ erddap/griddap/documentation.html) 45,46 .
The constraint expression (or query) helped define the parameters, which correspond to the study period and spatial coverage. Regarding the first parameter, the study period comprised all available observations from the MODIS instrument aboard the Aqua satellite (i.e., monthly composites from 2003 to 2020). The spatial coverage was defined by the minimum and maximum latitude (i.e., 89.98°S to 89.98°N) and longitude (i.e., 179.98°W to 179.98°E) from the original satellite image for global coverage.
Data preparation. Data within the NetCDF files were imported into R using the RNetCDF package 47 . A NetCDF object contains a list of at least four attributes: time, longitude, latitude, and the values of the variable being measured (i.e., SST and Chlo-a). The attribute corresponding to the specific variable being measured was extracted from the NetCDF object and transformed into a raster object using the RNetCDF and raster packages in R 48 . A raster object consists of a matrix of cells (i.e., pixels) organized into rows and columns where each cell contains a value representing information (i.e., temperature and pigmentation) and the metadata corresponding to spatial information of object 49 .
As the last piece of the data preparation process, the extent of the raster was verified to match that of the original satellite data. Extent was set to latitude and longitude of 89.98°S to 89.98°N and 179.98°W to 179.98°E, respectively. The coordinate reference system (CRS) was defined to be relative to the WGS84 datum for easy manipulation by the end user.
Data processing. A significant feature of the SST and Chlo-a databases is the addition of the segmentation by the world's exclusive economic zone (EEZ). EEZ is a marine zone within 200 nautical miles from a country's coastline where each country claims jurisdiction for economic activities 50 . Given the oceanographic nature of the data, focusing on the 200-mile buffer of EEZ provides a more comprehensive explanation of oceanic changes, with the potential to promote the development of ocean planning initiatives directly influencing human settlements on the coasts. To represent the EEZ, a geospatial vector file in shapefile format was constructed by delimiting a buffer of ~200 miles off coastlines globally.
The EEZ regions were defined using the functions crop and mask from the raster 48 package. The function mask allowed to place the area of interest (i.e., the EEZ) on top of each monthly raster, assigning no value to cells outside of the area of interest, while the function crop ensured that each raster matched the extent of that of the area of interest (Fig. 2). The core database included 408 individual rasters cropped and masked to the EEZ of each country. www.nature.com/scientificdata www.nature.com/scientificdata/ Statistical analysis. Complementary to the core database, data were treated as an m by n matrix, where m represents the years and n represents the months and stacked in two distinct ways (1) in yearly composites and (2) monthly composites.

Database Title
We created the annual and monthly stacks by using stack function in the raster package 48 . The mean, range, maximum, minimum, and standard deviation values were estimated for annual and monthly SST and Chlo-a. We obtained a total of 90 rasters for the yearly composites (18 years, five different statistics) and 60 rasters for the monthly summaries (12 months, five different statistics).

Data Records
Final data are provided in the form of GeoTIFFs for the EEZs boundaries and statistical analysis results 41 . Data can be downloaded based on annually, monthly, or as summary composites of the 18-years period. Data can also be updated using the code included in the Auxiliary Material in Figshare 44 .

technical Validation
Remotely sensed environmental observations from the MODIS instrument, including SST and Chlo-a, have been validated profusely by the scientific community against a number of models and in situ measurements [51][52][53][54][55][56][57][58] and used in a diverse set of studies 13,14,19,[59][60][61][62][63][64][65][66][67] . For instance, validation of the SST observation uses accurate ship-based infrared radiometers and differing and moored buoys with thermometers a meter of depth 38,56,57 . NASA's standard processing and distribution of the SST products are performed using software developed by the Ocean Biology Processing Group 18 . SST products are validated internally by NASA using a collocated matchup database of in situ observations that are collected within 30 minutes of an overpass and 10 km of a pixel. MODIS SST observations represent the thermal skin layer of the ocean, which is <1 mm thick and is cooler than the underlying water due to vertical heat flux 68,69 . At night or when wind speeds are greater than ~6 m/s, the relationship between the skin temperature and the subsurface are nearly equal. It is under these conditions that validation and uncertainty estimates relative to sub-surface in situ buoys are typically reported 20,38 . The estimation vs. observation relationship, however, can be very variable under conditions of low wind speeds and reduced sub-surface turbulence 21,70 . Furthermore, NASA MODIS uses a collection of cloud classification algorithms to indicate when a pixel corresponds to clear sky conditions (i.e., no cloud coverage). The most recent cloud-classification method is the Alternating Decision Tress 71 . Other SST observations validations tests include a regional ice test, where reflectance thresholds are determined using the Sentinel-2 MSI calibrated reflectance 72 and correction of dust contamination 73 .
MODIS Chlo-a observations are derived from the O'Reilly OC3M algorithm and the Hu color index 30,31 . The algorithm is calculated using an empirical relationship from in situ measurements and remote sensing reflectance in the blue-to-green region of the visible spectrum. Level 3 MODIS data may provide biased minima and maxima values during errors in the observation that, for example, has some cloud contamination or sunlight affecting the value captured by the sensor. Due to potential atmospheric contamination some regions could have a limited number of observations from which to estimate the monthly values, which increases uncertainty. There is an estimated ± 35% nominal uncertainty related to the OC3M algorithm used to derive the global Chlo-a product. Nevertheless, error could increase in optically complex waters like those present in coastal areas 74,75 .
We performed a data validation procedure comparing MODIS observation of SST and Chlo-a against gold-standard sensors. More specifically, we compared MODIS data against SST data from Sentinel-3 76 during the year 2020. We found that data from MODIS and Sentinel-3 were statistically indistinguishable with a Pearson correlation coefficient of r = 0.99 for the annual mean, minimum, and maximum composites (R 2 = 0.99, p < 0.05; Supplementary Fig. S1). Additionally, Chlo-a data were evaluated by comparing MODIS data against SeaWiFS 30 observations for the year 2010, when the SeaWiFS satellite ended operations. We found that MODIS Chlo-a data were significantly correlated with SeaWiFS Chlo-a data but with less strength than for SST evaluations. More specifically, correlation was r = 0.83 (R 2 = 0.67, p < 0.05) for the mean, r = 0.71 (R 2 = 0.53, p < 0.05) www.nature.com/scientificdata www.nature.com/scientificdata/ for the maximum, and r = 0.76 (R 2 = 0.52, p < 0.05) for the minimum Chlo-a composites (Fig. S2). Together, these results suggest that MODIS data have a robust representation of environmental conditions in global coastal waters, at least when compared against gold-standard datasets of SST and Chlo-a.

Usage Notes
The proposed use of this dataset is for coarse-scale, regional or global-level studies of coastal environmental conditions. Fine-scale assessments of SST and Chlo-a are warranted to improve accuracy and detail of these variables for local-level applications. The data can be used to identify anomalies for SST and Chlo-a at local, regional, and global levels. The example demonstrates SST and Chlo-a data explorations in tropical and temperate localities, identifying patterns along time (Fig. 3). Areas in the mid-Atlantic region of the United States show an increase in mean SST during the month of June to October (Fig. 3a), while areas in the subtropics of the Americas (i.e., Ecuador and Colombia) reveal cooler temperatures during the same period (Fig. 3b). Additional exploration of the data in tropical and subtropical zones of different latitude reveal that Chlo-a increases from September to December (Fig. 4b). Contrarily, in the tropics, Chlo-a concentration increases between March and May (Fig. 4a).

Code availability
Code in R language to recreate the database and the figures in the Usage Notes is available on Figshare 41 .