A salmon diet database for the North Pacific Ocean

The North Pacific Marine Salmon Diet Database is an open-access relational database built to centralize and make accessible salmon diet data through a standardized database structure. The initial data contribution contains 21,862 observations of salmon diet, and associated salmon biological parameters, prey biological parameters, and environmental data from the North Pacific Ocean. The data come from 907 unique spatial areas and mostly fall within two time periods, 1959–1969 and 1987–1997, during which there are more data available compared to other time periods. Data were extracted from 62 sources identified through a systematic literature review, targeting peer-reviewed and gray literature. The purpose of this database is to consolidate data into a common format to address gaps in our ecological understanding of the North Pacific Ocean, particularly with respect to salmon. This database can be used to address a variety of questions regarding salmon foraging, productivity, and marine survival. The North Pacific Marine Salmon Diet Database will continue to grow in the future as more data are digitized and become available.


Background & Summary
Even though salmon spend 1-6 years of their life in the marine environment, this phase of their life cycle is poorly understood compared to their freshwater phase 1 . There are limited data on how salmon are distributed, what they feed on, and what threats to survival they may face during this phase, which includes nearshore and offshore components. The marine phase is hypothesized to contain salmon population bottlenecks 2 , and research has shown that salmon smolt to adult survival rates can be less than 1% for some stocks in the North Pacific 3 . There is also evidence that Pacific salmon marine survival has been declining over the past several decades in certain areas, especially more southern stocks 4,5 . Therefore, it is becoming urgent that researchers understand what is happening to salmon during the marine phase of their life cycle, especially after they have moved offshore, because this phase tends to be data-poor compared to the early marine coastal phase.
One of the most important factors affecting salmon survival is the presence and abundance of suitable prey. Although it is difficult to assess prey distribution across ocean basins, information on prey presence and abundance can be derived from salmon diets. Marine diet data has the potential to give insight into food webs, niche overlap among species/stocks, competition, health, and changing ocean conditions [6][7][8] . Since the early 1900s, researchers have been examining the diets of Pacific salmon to provide information on species biology and the conditions that salmon face in the ocean 9,10 . Although there have been some reviews of salmon diet data in the North Pacific, these have been limited in time and space and the data are not normally made publicly available 7,8,11-15 . Salmon diet data have been collected using a variety of methodologies employed by researchers from countries across the North Pacific. These data are scattered across the peer-reviewed and gray literature, making collation challenging. However, there is high value in collating these data, considering the costliness and difficulty of conducting fieldwork in the open ocean. Synthesizing these data can reveal important information about salmon open ocean life history experience and further understanding of the potential impacts of changing oceans on salmon productivity. Additionally, more comprehensive, accurate and robust diet data will be an asset to ecosystem models, which are increasingly being applied in ecosystem-based management 16 . As salmon face an uncertain future with climate change 17,18 , this is a critical time to consolidate available knowledge in order to advance research on salmon marine ecology.
The goal of this project was to develop an open-access database framework for collating marine salmon diet data, alongside available salmon biological data, prey biological data, and environmental data. We compiled an initial contribution of salmon stomach content diet data from offshore areas using a systematic literature review, followed by quality control and standardization procedures for two time periods : 1959-1969 and 1987-1997. These decades were selected partially because they are time periods in which there are a larger quantity of data available on salmon diets. This database will continue to grow as more sources are identified and added and can be used as a tool by researchers to study salmon marine survival and North Pacific ecosystem dynamics.

Methods
Systematic literature review. In order to identify sources that contained salmon diet data, in the form of stomach contents, a systematic literature review was performed using database keyword searches of ProQuest: Aquatic Sciences and Fisheries Abstracts (https://search.proquest.com/asfa), Web of Science: Core Collection and Web of Science: Zoological Record (https://webofknowledge.com). Not all salmon diet studies are part of the peer-reviewed literature and many North Pacific researchers have published data through the North Pacific Anadromous Fish Commission (NPAFC) and the defunct International North Pacific Fisheries Commission (INPFC). Therefore, database search results were supplemented with relevant INPFC documents and bulletins (https://npafc.org/inpfc/), NPAFC documents (https://npafc.org/npafc-documents/) and bulletins (https://npafc. org/bulletin/), and relevant bibliographic references found within these documents and bulletins ( Table 1).
The database keyword searches identified 591 unique sources. Sources were filtered for relevance and excluded based on the following criteria: (i) the source did not have salmon stomach content diet data from between 1959-1969 or 1987-1997 for the marine environment, as defined by the area beyond the Riverine Coastal Domain (~15 km) 19 (556 sources); (ii) the source did not have extractable diet data and authors did not respond to inquiries (1 source); (iii) the source was a review, in which case the original sources were used, if possible and relevant, to extract data (2 sources); (iv) the source overlapped completely with another source, i.e., the data were reported using the exact same metrics for the same samples as another source (2 sources).
The database search was supplemented with sources that met the same criteria from the INPFC documents and bulletins and the NPAFC documents and bulletins, bringing the total number of unique sources to 62 (Online-only Table 1).

Data extraction.
For each salmon diet sample, we extracted the following data (if available): source metadata (e.g., publication year, title, authors) (Online-only Table 2), salmon capture method (Online-only Table 3), site information (time, location), salmon information (e.g., taxonomy, life stage, sex), salmon replicates, type of diet data (e.g., percent weight of prey, total number of prey), and prey information (e.g., taxonomy, life stage, quantity) (Online-only Table 4). A diet sample is defined as a distinct sample in time and space from a specific source and is entered into the database exactly as it is reported in the source. A sample can contain diet data from one salmon or more than one salmon when individuals were grouped together for diet analysis (up to 2,215 in this data compilation). If different diet metrics were reported for the same diet sample (e.g., number of prey and volume of prey), then all metrics were entered into the database. If the data were not available in table format, but figure format only, then the data were extracted using WebPlotDigitizer (https://apps.automeris.io/wpd/).
For each source, we extracted data as it was presented in the sources in almost all cases, and therefore it was extracted according to the data resolution of the source. For example, if the sample location was presented as a station, it was extracted as a station with specific geographical coordinates, but if it was presented as a transect or area, then it was extracted as a transect or area with latitude and/or longitude minimums and maximums. If geographical coordinates were not specified, then they were estimated based on survey maps or descriptions present in the source. Prey taxonomy was reported to different resolutions across sources (e.g., Copepoda versus Neocalanus cristatus). In order to keep the lowest data resolution reported in the source while also being able to compare across sources, each prey item was reported at all possible taxonomic levels (kingdom, phylum, class, order, family, genus, species). In addition to the salmon diet data, if the source presented additional related data for salmon biological parameters (i.e., variables) (Online-only Table 5), prey biological parameters (Online-only Tables 6 and 7), or environmental parameters (Online-only Table 8), these data were extracted as well. For detailed information about the different types of data extracted and the extraction methodology see Online-only Tables 2-8.

Database framework.
We built an open-access relational database in MySQL v8.0.18 called the "North Pacific Marine Salmon Diet Database" 20,21 . The North Pacific Marine Salmon Diet Database contains all of the extracted data noted above: diet data, salmon biological data, prey biological data, and environmental data. This database also allows for inclusion of prey biological data that are not associated with a salmon sample. For example, if a researcher conducted a zooplankton tow for potential prey and they have biological data for these potential prey (e.g., length, weight), these data can be added to the database. In this database, all data are linked by site, which has both a temporal and spatial component. All data are also related to a source in order to distinguish related data and trace its origins. While the database was built specifically to house North Pacific salmon diet data from the marine environment, the database structure was designed to easily be applied to other predator and prey interactions with only slight modifications.

Data records
The North Pacific Marine Salmon Diet Database currently contains 6,869 diet observations from 6,305 unique diet samples of over 69,942 salmon. Types of diet data included percent weight of prey, absolute weight of prey, average weight of prey, percent volume of prey, percent number of prey, absolute number of prey, average number of prey, frequency of occurrence (numerical and percent), stomach content index, index of fullness, and index of relative importance. The database also houses 11,965 observations of salmon biological parameters for 6,172 unique salmon samples. One observation means one biological parameter measured for one sample of salmon, which can contain one or many fish. Salmon biological parameters include length, weight, daily ration, empty stomachs, male/female ratio and many others. Additionally, the database includes 238 observations of prey biological parameters from 112 unique prey taxonomic categories. Prey biological parameters include body length, body weight, body width and size index. Finally, the database contains 2,790 observations of environmental parameters. Environmental parameters include temperature, salinity and others. These data are available in a static figshare repository 22 . The most current version of the relational database, associated documentation, and data are available in a dynamic GitHub repository, which will be updated as more sources are identified and added to the database 21 . Data were visualized using R statistical software v3.6.1 23 .
Spatial and temporal coverage. Diet samples were collected at 751 unique spatial locations, which included areas (polygons), transects, and point locations across the North Pacific from the California Current to the Sea of Japan (Fig. 1). Salmon biological data were collected from 709 locations, prey biological data from 4 locations, and environmental data from 446 locations. Salmon biological data were reported across the entire North Pacific, while prey biological data were sparsely reported from a few locations in the Gulf of Alaska and the Sea of Okhotsk/Kuril Islands. Environmental data were mainly available from the eastern and central North Pacific Ocean. Since our search was focused on specific time periods, most of the data we collected fell within our specified decadal periods: 1959-1969 and 1987-1997 (Fig. 2). However, some sources reported data from other time periods and these data were also included in the database. While diet data and salmon biological data were consistently reported across the temporal range, environmental data and prey biological data were inconsistently reported.
Salmon and prey species coverage. Sockeye (Oncorhynchus nerka), pink (Oncorhynchus gorbuscha) and chum (Oncorhynchus keta) were reported most frequently in our database, while coho (Oncorhynchus kisutch), Chinook (Oncorhynchus tshawytscha) and steelhead (Oncorhynchus mykiss) were reported less frequently ( Table 2). The most commonly reported prey groups were amphipods, fish (Class Actinopterygii), euphausiids, cephalopods (Subclass Coleoidea), and copepods ( Table 3). The category 'miscellaneous' was also commonly reported, although it usually made up just a small percentage of the diets. Within the diet data reported in the database, there are 186 unique prey taxa, meaning the lowest taxonomic classifications of prey items. Only 18.5% of salmon diet data were reported to the species level, while the majority were reported to higher taxonomic levels (e.g., Amphipoda, Decapoda, Euphausiacea).

Search terms Source
Results before filtering

Results after filtering
(Chinook OR "Oncorhynchus tshawytscha" OR coho OR "Oncorhynchus kisutch" OR sockeye OR "Oncorhynchus nerka" OR pink OR "Oncorhynchus gorbuscha" OR chum OR "Oncorhynchus keta" OR steelhead OR "Oncorhynchus mykiss") AND (marine OR ocean* OR coast* OR "Gulf of Alaska" OR "Bering Sea") AND (stomach OR gut* OR "prey composition" OR "diet* composition" OR "composition of diet" OR "composition of prey") AND (diet* OR prey OR food) www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
Standardization procedures were used to verify and collate the data. Since some of the taxonomic records were outdated, the taxonomies were verified and updated using the World Register of Marine Species (http://www. marinespecies.org/). For types of diet data that should add to a cumulative percentage of 100 (e.g., percent weight, percent volume), diet samples were excluded if the cumulative percent of prey was above 105% or below 95%. If the cumulative percentage did not add to 100 but still fell within this range, diet data values for that sample were rescaled to add to 100. For other metrics, including absolute and average weight and number of prey, as well as numerical frequency of occurrence, we consulted a salmon diet expert to determine if our highest values were reasonable to find in adult salmon stomachs (V. Zahner, pers. comm.).

Usage Notes
The North Pacific Marine Salmon Diet Database contains observations from many different sources, which report data with varying levels of detail. The method for data digitalization was to extract data as it was presented in the sources, or as close to it as possible. In this way, we have provided a more complete set of data that can be used for a variety of studies on salmon ecology and North Pacific ecosystems and left the manipulation and interpretation of data up to the discretion of the user.
The North Pacific Marine Salmon Diet Database is publicly available and can be used under the license of CC BY, meaning that the work can be distributed, remixed, adapted and built upon with acknowledgement of authors. There are a number of ways to access the information contained in the database. A static copy of the initial data contribution to the North Pacific Marine Salmon Diet Database described in this paper is available in a figshare repository in the form of seven csv files ('Diet_data' , 'Prey_biological_data' , 'Sources' , 'Predator_biological_data' , 'Environmental_data' , 'Gear_type_prey' , 'Gear_type_predator') 22 . In this case, 'predator' refers to salmon, but the database was designed so that it could easily be applied to other predator and prey interactions and thus we used the predator-prey terminology. Detailed information about the variables in the database can be found in Online-only Tables 2-8, which corresponds to the csv files in the figshare repository. To avoid making assumptions about how the data will be used, we have included all available data from the sources. This means that in some cases the same diet samples are reported multiple times using different diet metrics. In the figshare repository, there is R v3.6.1 code ('Preprocessing_data') to exclude overlapping data based on diet metrics of interest: presence/absence, weight/volume, frequency of occurrence, and numerical.
A dynamic version of the North Pacific Marine Salmon Diet Database can be accessed through a GitHub repository and will be updated as more historic data are digitized and made available by the Pelagic Ecosystems Laboratory at the University of British Columbia 21 . The GitHub repository contains updated versions of the files found in the figshare repository, in addition to documentation for the North Pacific Marine Salmon Diet Database ('npmsdd_documentation'), a data entry template ('npmsdd_data_entry_template'), and a dumped version of the MySQL relational database ('npmsdd_MySQL_database_dump') for users that are interested in working with the database in the relational database format.

code availability
The figshare repository contains R v3.6.1 code for downstream data processing ('Preprocessing_data'), and code to create the figures that appear in this publication ('Publication_figures') 22 . The GitHub repository contains the most up-to-date code for downstream data processing 21 .  Table 3. The number of diet samples containing each prey taxonomy. Prey taxonomy refers to the lowest taxonomic level identified by the source. Prey taxonomies that were reported in less than 20 samples were excluded.