Image dataset for benchmarking automated fish detection and classification algorithms

Multiparametric video-cabled marine observatories are becoming strategic to monitor remotely and in real-time the marine ecosystem. Those platforms can achieve continuous, high-frequency and long-lasting image data sets that require automation in order to extract biological time series. The OBSEA, located at 4 km from Vilanova i la Geltrú at 20 m depth, was used to produce coastal fish time series continuously over the 24-h during 2013–2014. The image content of the photos was extracted via tagging, resulting in 69917 fish tags of 30 taxa identified. We also provided a meteorological and oceanographic dataset filtered by a quality control procedure to define real-world conditions affecting image quality. The tagged fish dataset can be of great importance to develop Artificial Intelligence routines for the automated identification and classification of fishes in extensive time-lapse image sets.


Background & Summary
In a context of global climate change and increasing human impact in coastal marine areas, the monitoring of changes in fish behaviour and population abundances is becoming strategic to provide data on ecosystem productivity, functioning and derived services (e.g., the status of already overexploited stocks) [1][2][3] . For this reason, monitoring the temporal dynamics of fish communities is of pivotal importance to distinguish the variability in species composition, due to diel and seasonal activity rhythms, from more long-lasting trends of change 4,5 . The temporal trend of fish presence and abundance, obtained from the analysis of imagery data, is produced by the rhythmic migration of populations into the marine 3D space seabed and water column scenario [6][7][8] . The information derived from such dynamics coupled with environmental (oceanographic and meteorological) data provide useful information regarding species ecological niche [9][10][11] , and allow understanding and forecasting the impact of anthropic activities (e.g., commercial fishing, urban and port expansion) and the consequent mitigation actions (e.g., establishment of marine protected areas) 7,12,13 .
Cabled video-observatory monitoring technology is considered as the core of growing in situ and robotized marine ecological laboratories in coastal and deep-sea areas 14,15 . International initiatives about marine observatories infrastructures, like for example the European Multidisciplinary Seafloor and water column Observatory (EMSO-ERIC), the Joint European Research Infrastructure of Coastal Observatories (JERICO-RI), or the Ocean Network Canada (ONC) are becoming widespread all over the world 16 , and increasingly install multiparametric sensors that, beside the imaging depicting biological information, also acquire oceanographic and geo-chemical data 13,17 .
Unlike other types of data, the scientific content of videos and images is not immediately usable. To overcome this problem, the image content is often inspected by trained operators in order to manually extract relevant biological information, such as the number of individuals and the corresponding classification into species [18][19][20] . This manual process requires a considerable human effort, and it is really time demanding. For this reason, automated image analysis methodologies for the extraction and coding of the image content need to be urgently defined and developed in order to transform imaging devices into actual biological tools for the underwater observing systems 21,22 . these artificial reefs, with a Field of View (FOV) area of about 3 × 3 m, resulting in a 10.5 m 3 of imaged volume (Fig. 2).
The image monitoring was performed in a 30 min time-lapse mode, by synchronising illumination at nighttime at the moment of shooting. To shoot photos at night, the camera was associated with two illuminators located beside the camera at 1 m distance from each other, each one consisting of 13 high-luminosity white LEDs. The lights were emitting 2900 lumens, with a colour temperature of 2700 kelvin and an illumination angle of 120°. An automated protocol, controlled by a LabView application, switched on-and-off the lights before and after the camera shooting, resulting in a 30 s light-on period, to allow the lights to warm up and attain the maximum amount of homogeneous illumination. www.nature.com/scientificdata www.nature.com/scientificdata/ whereas the second camera image resolution was 2048 × 1536 pixels (Fig. 2). The acquired images have a JPEG format for both cameras.
Fish tags and annotation procedure. In order to tag the relevant biological content of the images (i.e., fish individuals), a Python code was developed based on the OpenCV framework for Python (https://opencv. org/) 48 (Fig. 3).
The script allowed tracing a line around the biological subjects, calculating afterwards a bounding box (bbox). The script and all the instructions of the tagging procedure are available through the Zenodo repository 49 .
The species classification was performed according to FISHBase 50 . In those cases where the fish was not fully classifiable because too distant or badly positioned within the FOV we classified them as "Unknown fish". This is because these unclassified fishes are important for the estimate of fish biomass (Fig. 3). Some examples deal with individuals appearing in the photo like dots. Other examples deal with overlapping fishes, such as when they form schools. oceanographic and meteorological data acquisition and processing. The OBSEA was equipped with a CTD probe to measure the water temperature, salinity, and the changes of depth, calculated from shifts in water pressure (as proxy for tides). During the period between 2013-2014, two CTD probes were sequentially deployed to avoid data gaps during sensor maintenance operations ( Table 2). In Table 3 the deployment periods of both CTD probes are depicted. Flowchart for the tagging procedure. The tagging procedure of the photos were carried out with a Python code, at the end of which it releases as output a list of tags in text format and save the images with their bounding boxes (rectangles of different colours). Here, we report an example of a processed photo with tagged specimens and untagged fishes (green circle).  Table 2. Technical characteristics of the two CTD probes, and of the two meteorological stations. Technical characteristic of the two CTD sensors (i.e., SBE16 and SBE37) installed at the OBSEA, the meteorological station of the Polytechnic University of Catalonia (UPC) in Vilanova i la Geltrú (i.e., Station 1), and the meteorological station of Sant Pere de Ribes (i.e., Station 2) present during the period between 2013-2014.
www.nature.com/scientificdata www.nature.com/scientificdata/   www.nature.com/scientificdata www.nature.com/scientificdata/ Moreover, meteorological variables were measured from the meteorological station on the roof of the Polytechnic University of Catalonia (UPC) building in Vilanova i la Geltrú, and from the meteorological station of Sant Pere de Ribes, Spain (www.meteo.cat) ( Table 2). The first one was a Vantage Pro2 meteorological station. This station was installed to collect data on the air temperature, wind speed and direction. Furthermore, we compiled data for solar irradiance and rain from the meteorological station in Sant Pere de Ribes. This station was equipped with a Pyranometer SKS 1110 to measure solar irradiance, and a Rain[e] sensor for the rain.
All the oceanographic and meteorological data were averaged every 30 min, in order to have mean and standard deviation measurements contemporary to the timing of all acquired images (see above), except for the irradiance and rain, that were compiled selecting and extracting only readings correspondent to the acquired image timings (see above).   www.nature.com/scientificdata www.nature.com/scientificdata/ In order to filter these data, we applied a Quality Control (QC) procedure for all the environmental variables except for the solar irradiance and rain, considered prefiltered and institutional data. This procedure is based on the guidelines from the Quality Assurance of Real-Time Oceanographic Data (QARTOD), issued by the United States Integrated Ocean Observing System (US-IOOS) Program Office, as part of its Data MAnagement and Cyberinfrastructure (DMAC) (https://ioos.noaa.gov/project/qartod/). This QC procedure was based on the IOOS QC python tools (https://github.com/ioos/ioos_qc). Following the QARTOD guidelines, the following tests were applied: • Gross Range test. Highlight data points that exceeded sensors or operator selected minimum and maximum levels. • Climatology test. Data points that fall outside the seasonal ranges introduced by the operator.
• Spike test. Data points n-1 that exceeded a selected threshold relative to adjacent points.
• Rate of change test. Examination of excessive rises or falls in the data.
• Flat line test. Examination of invariant values in the data.

Column Labels Description
Date/Time The time stamp information in UTC with "yyyy-mm-ddThh:mm:ss", as format      www.nature.com/scientificdata www.nature.com/scientificdata/ Each time that the quality test was run, each value of the dataset was flagged with a quality control code. The QC flags and meanings are shown in Table 4.
The oceanographic and meteorological data were annotated into comma delimited files (CSV) with additional information on QC flags, time stamps, and measurement devices used for their acquisition 51-53 . Data Records tagging outputs. All time-lapse images were saved with the filename indicating the date (i.e., the year, the month, and the day), the timestamp in Universal Time Coordinates (UTC) (i.e., hour, minutes and seconds), the name of the platform, and finally the camera used for the acquired image 48 . As a result, we had an inspected dataset of 33805 images, depicting a total of 69917 manually tagged fish specimens, 36777 of which pertaining to 29 different taxa (Fig. 4) ( Table 5). The remaining specimens (i.e., 33140) were attributed to the unclassified category (see previous section).
In the dataset file for manual tagging 48 , we reported the timestamp in UTC (yyyy-mm-ddThh:mm:ss) and the filename (e.g., timestamp associated) of the tagged image, plus the fish taxa name and the image vertices' coordinates of the bounding box (bbox) containing the identified specimens in the OBSEA photo (Fig. 4).

Fig. 6
Time series plots of the environmental variables. Here we report the time series for the three oceanographic variables (i.e., water temperature, salinity and depth), and the five meteorological variables (i.e., air temperature, wind speed and direction, solar irradiance and rain) at the OBSEA platform, and meteorological stations on the "Development Centre of Remote Acquisition and Information Processing" (SARTI) rooftop and in Sant Pere de Ribes between 2013 and 2014. In the seawater temperature, pressure and salinity graphs we highlighted the use of SBE37 CTD probe with grey bands, and the SBE16 CTD probe with light yellow bands. The green points in the time series are the good quality data, the yellow ones the suspicious and the red ones the bad. Relative percentage of each QC Indexes was reported in the time series, except for rain and solar irradiance data, considered a prefiltered and institutional source (see previous section). (2023) 10:5 | https://doi.org/10.1038/s41597-022-01906-1 www.nature.com/scientificdata www.nature.com/scientificdata/ In order to improve the reuse of this dataset, we report here its details, described also in the PANGEA repository 48 , in Table 6.
The proposed dataset can be used with any image analysis methodology, including the popular Deep Learning (DL) approaches, thanks to the annotated bboxs and related species labels for each fish individual. The bboxs proposed in this work are rotated rectangles that tightly fit each tagged fish individual. Image analysis approaches based on convolutional operators need the bboxs to be rectangles with the edges parallel to the image borders and, depending on the specific implementation, the bboxs could have different encoding. An example is the rectangle encoding for the "You Only Look Once" (YOLO) approach 54 , for which it is very easy to transform the general-purpose rectangle encoding suggested in our work into the YOLO encoding and vice-versa.
A recent work on Deep Learning (DL) methods for automatic recognition and classification of fish specimens 55 identified the paucity of multiple species labelled datasets created by specialists, and with a community-oriented approach as major constraint for this methodology. In our dataset, ground-truthed by specialists, we labelled multiple species of fishes with a great number of tags, and with images taken from a camera focussing the same artificial reef during the whole monitoring period. For this reason, this dataset can be a good material for DL procedures and Artificial Intelligence based approaches in general.
oceanographic and meteorological datasets. The measurements from the CTD device of the OBSEA, the meteorological stations of "Development Centre of Remote Acquisition and Information Processing" (SARTI, https://www.sarti.webs.upc.edu/web_v2/) rooftop and the Sant Pere de Ribes station were stored in a PANGEA repository [51][52][53] . In order to better use this dataset we report the details of these datasets in Tables 7, 8 and 9, respectively.
Environmental data had temporal gaps in their time series due to sensor malfunction or power/communications loss. The temporal coverage for each variable is detailed in Table 10.

Technical Validation
The manual tagging fish classification was performed following the FishBase website 48 , consulting local fish faunal guides [56][57][58] . The operator that carried out the tagging trained in the fish classification using the Citizen Science tool of the OBSEA website (https://www.obsea.es/citizenScience/). Furthermore, to better classify the recognizable fish specimens we cross-checked our fish identification with specialists in fish classification from the Institut de Ciències del Mar of Barcelona (ICM-CSIC, www.icm.csic.es).
Here, we report the time series for the three most abundant fish taxa (i.e., Diplodus vulgaris, Oblada melanura and Chromis chromis) and total fish counts detected during the tagging procedure in order to ensure that there are not large gaps in the image acquisition at the OBSEA during 2013 and 2014, and that the data encompass all the seasons to detect and classify the highest number of species of the local changing fish community (Fig. 5).
We also reported the time series of the environmental variables measured at the OBSEA platform, and at the two different meteorological stations on the "Development Centre of Remote Acquisition and Information Processing" (SARTI) rooftop and in Sant Pere de Ribes between 2013 and 2014. These time series are displayed www.nature.com/scientificdata www.nature.com/scientificdata/ with their respective Quality Control (QC) Indexes highlighted by different colours, in order to ensure the good quality of these data and show the low occurrence of gaps in the time series (see previous section) (Fig. 6).
As a result, we also show here the resulting graphs from the diel waveform analysis of the tagging data for the three most abundant species and the total number of individuals of fishes related to the solar irradiance respective values to identify the phase of rhythms (i.e., the peak averaged timing as a significant increase in fish counts) in relation to the photoperiod (solving via data averaging the problems of gaps in data acquisition) (Fig. 7).
It can be observed that in general the species are diurnal as reported in literature 59 . The only exception is O. melanura that was observed more active during crepuscular hours 59 , but in our case was tagged more during nighttime. This could be explained by the better visualisation of this species with illumination, lacking of well recognizable marks for its classification. Therefore, it could be inferred that, in general, the tags for the different species are proportional to the local abundances, except for the certain species, such as O. melanura. This last statement is based on a recent article 60 describing a method for the estimation of organisms' abundance from visual counts with cameras. The article proposes a Bayesian framework that, under appropriate assumptions, allows to estimate the animals' density in a single survey without the need to track the movement of the single specimens.

Usage Notes
As can be observed in Table 5 the classes of the inspected dataset are imbalanced (e.g., there are 14328 Diplodus vulgaris tags and only 1 Trachurus sp. tag). This characteristic has to be managed by applications dealing with Artificial Intelligence for the automated interpretation of the image content. In case the image analysis method could not manage unbalanced datasets 61,62 , data augmentation approaches could be used for generating new reliable individuals starting from the classes tagged in the dataset [63][64][65] .

code availability
The developed Python code for tagging and labelling the images is available through the Zenodo repository 49 . Another device that can be used for tagging fishes is the public Label Image tool (https://github.com/tzutalin/ labelImg).