Myctobase, a circumpolar database of mesopelagic fishes for new insights into deep pelagic prey fields

The global importance of mesopelagic fish is increasingly recognised, but they remain poorly studied. This is particularly true in the Southern Ocean, where mesopelagic fishes are both key predators and prey, but where the remote environment makes sampling challenging. Despite this, multiple national Antarctic research programs have undertaken regional sampling of mesopelagic fish over several decades. However, data are dispersed, and sampling methodologies often differ precluding comparisons and limiting synthetic analyses. We identified potential data holders by compiling a metadata catalogue of existing survey data for Southern Ocean mesopelagic fishes. Data holders contributed 17,491 occurrence and 11,190 abundance records from 4780 net hauls from 72 different research cruises. Data span across 37 years from 1991 to 2019 and include trait-based information (length, weight, maturity). The final dataset underwent quality control processes and detailed metadata was provided for each sampling event. This dataset can be accessed through Zenodo. Myctobase will enhance research capacity by providing the broadscale baseline data necessary for observing and modelling mesopelagic fishes.


Background & Summary
Open-ocean pelagic ecosystems are under-represented in databases of marine biodiversity, despite holding the largest biomass of organisms on Earth 1 . The open-ocean pelagic community predominantly consists of fish, crustaceans and cephalopods that inhabit mesopelagic (200-1000 m) and bathypelagic (1000 m to >4000 m) depths 2 . Many pelagic species undertake diel vertical migration (DVM), moving from the depths to shallower waters at dusk and in the reverse direction at dawn, possibly constituting the largest animal migration on the planet 3 . This has implications for carbon sequestration and climate regulation, as organisms actively transport organic carbon from the surface to the deep ocean [3][4][5][6] .
Mesopelagic fishes are a central component of open-ocean pelagic communities dominating global vertebrate biomass with estimates of up to 10 billion tons 4,7,8 . They are thought to represent a key link to coupling physical-biogeochemical models to the population dynamics of top-predators 6,9 . Thus, the open-ocean pelagic environment and its inhabitants are critical components for the provision of globally important ecosystem services 1,2 .
In the Southern Ocean, mesopelagic fishes are key prey for sentinel species such as seals and seabirds [10][11][12] . As major consumers of secondary productivity (zooplankton) they also exert control on lower trophic levels of oceanic food webs 13,14 . Despite their ecologically important role, there are gaps in our knowledge of their biodiversity, abundance, biomass, and the processes that shape their distribution, life cycles and behaviour 4,8,[15][16][17][18][19] . Mesopelagic fishes are difficult to sample due to their patchy distribution and ability to avoid and escape pelagic nets 8,20,21 . This is confounded by the differences in catch efficiency between gear types, leading to biased estimates of abundance and biomass 4,8 .
Mesopelagic fishes are likely to be impacted by ocean-warming with evidence indicating future range shifts of temperate species, range reductions of Antarctic species 16,22 and possible biogeographic shifts in body size patterns 23 . This will have implications for the predators that feed upon them 14 and potentially more broadly for global biogeochemical cycles 24 . Tracking the future distribution and abundance of species requires the development of reliable population baseline estimates and an understanding of the biases associated with different sampling strategies [25][26][27][28] .
The advent of biodiversity informatics has seen the development of open-access biodiversity data repositories, such as the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS), to enhance research output in the context of a global biological monitoring system 29 . However, much of the data for mesopelagic fishes archived in these repositories are for species occurrence only, without information on abundance or biomass. Often, the methodological metadata (such as net-type or depth of trawl) are also lacking, further limiting the scope of the analyses 1,30 .
Decades of data from net sampling of the pelagic environment exist for localised regions of the Southern Ocean as part of multiple national Antarctic research programs 31 . Here, we take the important step of integrating and standardising the format of published and unpublished survey data on the abundance, biomass, biodiversity, and methodological metadata for mesopelagic fishes of the Southern Ocean. We collected and synthesised data from the Scotia Sea 14,32-34 , the Antarctic Peninsula 35,36 , the Lazarev Sea 37,38 , the Kerguelen Plateau 39-43 , East Antarctica 44 , Macquarie Island 45,46 and unpublished data from the Indian Ocean from Japan's national Antarctic program into a single dataset called Myctobase (Fig. 1).
The long-term aims of Myctobase are (1) to provide a single findable, standardised, and citable resource for multiple combined datasets; (2) to provide the baseline data upon which to measure the status and future trends of the mesopelagic fish assemblage; (3) to support further research such as investigating the patterns and biophysical drivers of diversity; and (4) to provide the broadscale spatiotemporal perspective necessary for the holistic management and conservation of the open-ocean pelagic ecosystem in the Southern Ocean. The dataset is available on GBIF, OBIS and Zenodo to support the management of the open-ocean pelagic ecosystem. Researchers wishing to contribute to this project should contact the corresponding authors. We hope that this database will continue to grow, providing a resource for ongoing monitoring of the Southern Ocean pelagic ecosystem.

Methods
Sampling design. Samples were collected onboard multiple national Antarctic research cruises between the years 1991-2016 and for the year 2019 and jointly covered all calendar months (Online-only Table 1). Net sampling occurred predominantly in the Indian and Atlantic Sectors of the Southern Ocean (Fig. 1).
Mesopelagic fishes were sampled using a range of gear types as a "standard" sampling gear does not currently exist 21 . Samples were collected from pelagic trawls using opening-closing net systems with a range of Rectangular Midwater Trawl (RMT) and International Young Gadoid Pelagic Trawl (IYGPT) nets, an Isaacs-Kidd Midwater Trawl (IKMT) net and a Matsuda-Oozeki-Hu Trawl (MOHT) net (Table 1, Online-only  Table 2). Opening-closing net systems allowed for the sampling of discrete depth intervals and included stratified oblique and horizontal trawls. Survey designs (for trawl locations) included randomised design, designated stations, and target trawls on acoustically detected aggregations of fish (see Online-only Table 1 for more information on sampling methodology).
Once onboard, samples were sorted to the lowest taxonomic level possible using published guides 41,47 and personal/institutional reference collections. Taxonomic identities were verified in home laboratories. Standard length (mm) and wet weight (g) measurements were taken onboard with a motion compensated balance or in home laboratories. Samples were preserved in either ethanol, formalin or frozen for further analyses.
Detailed information on methodologies utilised for each research cruise can be found in the relevant citations listed in Online-only Table 1.
Data collection and processing. Potential data holders were identified by compiling a metadata catalogue of scientific publications and existing survey data for Southern Ocean mesopelagic fishes. Data holders were invited to contribute published and unpublished data to the Myctobase project. A standardised template for the collation of multiple datasets was created using Darwin Core terms 48 where possible. The terms used in the template alongside definitions for each term can be found in Table 1. Data were collated with R statistical software, version 4. 0. 4 49 .
Units of measurement were standardised across datasets. Length and weight measurements were converted into millimetres and grams, respectively. The taxonomy of each observation was verified and associated aphia IDs (globally unique and stable identifiers for each taxonomic name) were retrieved from the World Register of Marine Species using the R package, WoRMS 50 (Fig. 2).
Abundance was calculated for each net haul by species per filtered volume of water. This was performed by dividing the species count (n) by the volume filtered (m 3 ). This was not possible for all net hauls and species as count and volume filtered information were not always recorded or technical difficulties were encountered during the research cruise. Methodology for calculating volume filtered varied between datasets. Several cruises www.nature.com/scientificdata www.nature.com/scientificdata/ used mechanical flow meters while others used a calculation where the nets filtering area (m 2 ) was multiplied by the distance it was towed (m). The filtering area of the MOHT net was calculated by multiplying the net mouth area by a coefficient estimated from calibration tows in calm seas. The coefficient was estimated under a calm sea condition using a net frame with a flowmeter which was vertically deployed up to 100 m wire out. The net frame was retrieved slowly to the sea surface and a count from the flowmeter was recorded. This procedure was repeated five times and an average value was calculated to obtain the coefficient. For some datasets, volume filtered was not previously calculated, but the information needed to calculate the volume filtered had been recorded. In these instances, we retrospectively calculated the volume filtered using this information. This was shown in a separate column to the values that were obtained at the time of a cruise, this is labelled www.nature.com/scientificdata www.nature.com/scientificdata/ volumeFiltered2 in the database. Detailed information on the methodologies utilised to obtain volume filtered, including our retrospective calculations, for each cruise can be found in Online-only Table 2.
The solar position and day/night information was added for each observation based on the date, time and the start latitude and longitude of each net haul using the R package, maptools 51 (Fig. 2). Solar position is the angle of the sun in relation to the horizon (0°), thus dawn was defined as a solar position of −12° to 12° before midday, and dusk was defined as a solar position of 12° to −12° after midday. Day was defined as the period after dawn and before dusk and night was defined as the period after dusk and before dawn. We added the zone (eg. Northern, Subantarctic and Antarctic) in which net hauls were undertaken using the definitions in 52 . The sector in which the net haul was undertaken was additionally added following definitions in 53 , which define four major sectors: Atlantic, Indian, West Pacific and East Pacific (Fig. 2).

Data Records
The dataset is comprised of three comma-separated files which are freely available at Zenodo 54 . Filenames adhere as closely as possible to the naming convention set out by the Darwin Core Standard 48 . The first file (event.csv) describes the survey methodology. Each row has its own unique event ID, which consists of the institute, cruise, event number (as recorded in the voyage logbook) and the net number (institute_cruise_event_net). An event ID represents the sampling event or net haul and contains the details of the event including date, time, position (latitude, longitude, and depth), sampling protocol, net type, net mesh size, tow speed, volume filtered and haul type (station, routine, target, test, surface or failed). The second file (groupOccurrence.csv) contains the catch data linked to the survey methodology by an event ID. Each row has its own unique occurrence ID, which is the event ID and aphia ID (retrieved from WoRMS) combined (eventID_aphiaID). An occurrence ID contains taxonomic information (eg. phylum, class, order, and family), the number of individuals (n) and estimated abundance (n_m 3 ) for the associated sampling event. The final file (individualOccurrence.csv) contains measurements of individuals. Each row contains the event and occurrence ID, which links each measurement to the first and second file, and a 'catalogNumber' , linking the data to the original dataset for traceability. Rows also contain taxonomic information, standard length (mm), weight (g), life stage and reproductive maturity of the individual (following definitions in 55 ) and sex, where available. Additional notes on the preservation method and whether the specimen was measured before or after preservation are included. The highest concentration of data points is within the Atlantic Sector (45% of net hauls) spanning from the Antarctic continent to the Polar Front at 50° S (Fig. 1). The East and West Pacific Sector stands as a major gap in the spatial coverage of Myctobase (21% and 1% of net hauls in the East and West, respectively).
Vertically, data extend from the surface down to a maximum depth of 2000 m. More than half of the trawls took place in the epipelagic layer at 0-200 m (n = 3713). Data from the mesopelagic (200-1000 m) and bathypelagic (>1000 m) layers were also recorded from stratified oblique trawls.
Data span from 1991-2016 and 2019, with no data for 2017 and 2018 (Online-only Table 1). The year 2006 contains the highest number of net hauls (11%). Cumulatively, the datasets within Myctobase cover data for every month (NB. not every month of every year), with 76% of net hauls occurring in the summer months (November to March).

Records of occurrence.
The four most abundant fish families in Myctobase are Paralepididae, Nototheniidae, Myctophidae and Bathylagidae. Myctophidae make up the highest number of records in Myctobase for both the Atlantic and Indian sectors (Fig. 3). In the Indian Ocean sector, the highest number of records for Myctophidae are in the Subantarctic zone. Conversely, in the Atlantic Ocean sector the highest number of records for Myctophidae are in the Antarctic zone. There are a similar number of records for Bathylagidae, Nototheniidae and Paralepididae in the Indian sector for both the Antarctic and Subantarctic zones. In the Atlantic Ocean sector, the highest number of records for Bathylagidae, Nototheniidae and Paralepididae are in the Antarctic zone. www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
The final dataset was subject to quality control and validation processes. Rather than removing ambiguous or incomplete records altogether, two extra columns were added to the event information (event.csv). The first column labelled validation is an index used to indicate whether the record passed quality control measures (0 = fail, 1 = pass). The second column labelled validationDescription details the reasons for a failure. Indexing the data in this way prevents the loss of potentially valuable data. For example, some events are missing latitude and longitude information, however a more general sampling location is often available and may be enough for regional analyses (Fig. 2).
The R package obistools 56 was used to validate sampling locations and dates of events. Sampling locations on land or at depth values higher than the bathymetry raster used in obistools did not pass the quality control step and thus were given a '0' in the validation column. Further, records missing latitude, longitude and date/time data were similarly identified. Records where technical difficulties were recorded, such as net failing to open or close were also allocated a '0' in the validation column (Fig. 2).
Abundance was standardised to number of individuals per m 3 (n_m 3 ) for all datasets. In some instances, abundance could not be calculated due to missing count or volume filtered data. Additionally, abundances were not calculated for US AMLR Program as this data were considered bycatch and not the sole focus of the sampling program. Subsequently, this data is important for documenting the species observed and overall distribution in the region of the Antarctic Peninsula. Important information regarding the use of abundance values is available in 'Usage Notes' .
Although the purpose of the dataset is to document the occurrences and relevant metadata of mesopelagic fish species, we also retained records of cephalopods. Occurrence records of cephalopods alongside those of fish were included in some of the datasets contributed to Myctobase. Cephalopods are similarly important species of the open-ocean pelagic community and are a key group supporting the flow of energy from primary producers to higher order predators in Southern Ocean food webs [57][58][59] . Data on squid are limited due to their low catchability with scientific nets 59 . As such, we retained records to maximise availability of valuable data. The data are structured to enable users to easily filter out cephalopod occurrences if they are not of interest.
Taxonomic fields within the final dataset were checked for spelling errors and to verify the usage of valid/ accepted names according to WoRMS. Where appropriate, records were corrected. All original data records were archived for future data checking and validation. Users can provide feedback for any data record to the corresponding authors.

Usage Notes
Myctobase will enhance research capacity by facilitating international effort toward observing and modelling mesopelagic fish taxa. The data held in Myctobase are suitable for a number of applications, for example, investigating the biophysical determinants shaping patterns of occurrence and biodiversity. Further examples of data use can be found in the citations listed in Online-only Table 1. Data can be analysed using a variety of statistical software such as R 49 or Matlab 60 .
Myctobase contains data from 72 different research cruises across the Southern Ocean. Each cruise employed a specific sampling strategy and provided varying levels of detail on methodology and samples collected. The different sampling methodologies used between datasets necessitates caution when comparing abundance estimates, as well as length and weight measurements across research cruises. The biases associated with net type and mesh size of the net have previously been described 8,21 , and are likely to influence the final output of analyses that use data from different projects. For example, the IYGPT has a larger mesh size at the front of the net limiting the size of fish that may be caught due to smaller fish escaping through the mesh. The minimum size limit has been suggested to be approximately 25-35 mm 61 . This suggests that nets with larger mesh sizes may not be appropriate for calculating densities of species that can escape the net. However this is a complex issue which requires further study as the net will also exert a herding effect on fish 32 . Biases should be considered and acknowledged. Myctobase provides detailed information for each sampling event to ensure that researchers can account for these differences in their analyses or to enable comparison of data using similar methods.
Further, abundance values should be treated as relative rather than absolute values due to the patchy distribution of species and the limited spatial and temporal coverage of sampling creating uncertainty around estimates 4,8 . Data are best used to demonstrate the community composition, distribution, and occurrence of fish species, life history and relative abundance within the Southern Ocean. For example, there are a multitude of available modelling techniques which may be used to predict a species geographic distribution in relation to environmental variables 62,63 . Furthermore, this information provides the data necessary for ground-truthing acoustic data 64 .
We anticipate that Myctobase will continue to grow into a fully circumpolar database with continued collaboration from across the scientific marine community. We invite researchers to contact the corresponding author(s) to contribute data (particularly from under-represented regions, taxa, and data types such as abundance data), including data from unpublished research which can be embargoed until publication.
Terms of use. The database is released under a CC-BY license. Users are encouraged to formally cite the data record used according to the standards and format of the journal in which they are published.

Code availability
We used freely available code from the following packages: maptools 51 , obistools 56