A macroscopic and stereological imaging dataset of Pleuronectes platessa ovaries

The North Sea plaice, Pleuronectes platessa (Linnaeus, 1758), is a commonly studied commercial flatfish with poorly known ovarian histology. The following dataset is a collection of female plaice gonad images and their corresponding histological slides, collected during a complete season of the plaice’s reproduction cycle. Stereology was used to determine the percentage of different structures found throughout the ovaries. Inter-agent calibrations were accomplished in order to harmonize the stereological readings, and were based on a comprehensive reading protocol and histological lexicon that were specifically written for the plaice’s ovaries. The distribution and homogeneity of the different cell types found throughout the ovaries were also evaluated. This dataset can be used to automate the stereological reading process (through statistical learning methods for example) or to objectively determine the plaice’s maturity phase, and link that information to either macroscopic measurements or through image analysis of the full ovaries.


Background & Summary
In stock assessments, the reproductive capacity of a commercial fish species is a key parameter for fisheries management plan. This reproductive capacity, or the capacity of a fish population to produce viable eggs and larvaes 1 , is usually estimated through the computation of the fish length at which 50% of the fish population has reached sexual maturity (L50) 2 . Being able to accurately determine the maturity phase of a fish is thus of paramount importance 1,3 . Unfortunately, these methods are very subjective, with the use of criteria such as the size, color or texture of the gonads, to estimate sexual maturity of commercial species 4,5 . Moreover, the maturity cycle of certain fish species is poorly known 6 , the determination of maturity phases can show great variability between assessing operators 7,8 , there are numerous terminologies to describe the ichthyological reproductive system 4,5,9 , and maturity scales are in constant evolution and differ from one institution to another 5 . This led the International Council for the Exploration of the Sea (ICES) to work on harmonizing the definitions, terminologies and practices used to determine these different maturity phases 9 .
This study has been set under project MATO ("MATurité Objective des poissons par l'histologie quantitative" -Objective fish maturity using quantitative histology), carried out by the Institut Français de Recherche pour l'Exploitation de la Mer (IFREMER). This project aimed to bring knowledge on the ovarian histology of the North Sea plaice (Pleuronectes platessa) in order to correlate histological maturity with macroscopic parameters like the size, weight and age of the fish, as well as the size, color and texture of the ovaries. Moreover, in order to harmonize data collection and terminologies, the terminology used by Brown-Peterson and al 4 , and the maturity scale of the ICES 5,9 , were used. Pleuronectes platessa was chosen as an easy to access and important commercial fish species, with the aim to update the outdated knowledge on the histological structures of this species' ovaries [10][11][12][13] .
During this study (Fig. 1), each plaice was measured (weight and size), their otoliths were collected for age estimations and the ovaries were extracted to be photographed, weighed, put into a Davidson solution before 2 Scientific Data | (2020) 7:165 | https://doi.org/10.1038/s41597-020-0505-8 www.nature.com/scientificdata www.nature.com/scientificdata/ being cast into paraffin, trichrome stained (Prenant- Gabe 14 ), and mounted between slide and slip. Each slides were then scanned and stereology was used to quantify each structure found throughout the ovary.
The dataset presented here is composed of: • 151 pictures of both ovaries for the 151 fish sampled during this study;   1 Schematic overview of the data acquisition process for Pleuronectes platessa macroscopic parameters, ovaries and ovarian histological slides. In blue are the materials and methods used for data collection, in green is the raw data obtained throughout the study, in red are the documents that have been published as a result of the data collection process, as well as to allow reproducibility of the slide readings, in brown are the potential use and reuse of the database.
The fish analyzed were all females, with a size range going from 15 cm to 36 cm, weights ranging from 34 grams to 523 grams, and ages spanning from 1 year old up to 5 years old.
This dataset completes, updates and enriches the existing knowledge on plaice oogenesis, and allows a comparison between macroscopic and microscopic parameters for each fish. Using an objective histological method to determine the sexual maturity phase in Pleuronectes platessa is time consuming but yields better results compared to the visual method. Moreover, with this database, less time consuming methods, such as image analysis and statistical learning for the recognition of cellular structures, can be put in place. This dataset can also be used as a means to calibration between stereology readings, as well as defining objectively the sexual maturity of each individual using histology.

Methods
Sampling. All 151 individuals were fished by bottom trawling in the English Channel (ICES division VIId), during 10 different sampling events from January 2017 to August 2019 (Table 1) so as to gather data from fish at different phases of the reproductive cycle 15,16 . The sampling method took into consideration all captured female plaice, regardless of the size. During the data collection in August, only female plaice of 20 cm and under were dissected in order to complete previous missing data for the sexually immature (SI) state.
Each fish was measured (total length with an accuracy of less than 1 cm), weighted (ungutted weight with an accuracy of less than 1 g) and aged through otolithometry. An experienced operator estimated each female's sexual maturity through the observation of the ovaries, following the maturity staging grids of the ICES 9 . Both ovaries were then extracted and photographed.
These pictures (Fig. 2) were standardized by being taken by the same operator, with the same camera (Nixon D3200), and in the same workroom so as to minimize the variations from the shot angle and lighting. The ovaries were positioned onto a green background, next to a 0.50€ coin on a blue background that served as a fixed size marker. The identification tag of the sampled fish appeared under the ovaries. The picture was named after the fish's identification tag. Each fish's identification tag was composed of the following data: specie's code, date of sampling, sampling zone, total length of fish, ungutted weight of fish, sex, visually estimated maturity phase.

Month
Year Number of fish sampled   www.nature.com/scientificdata www.nature.com/scientificdata/ The Image J software (v. 1.50J) was used to analyze the pictures and calculate the macroscopic parameters of each ovary: surface (mm²), length (mm), width (mm), width at mid-length (mm), mean color value of the different hues found on the ovary, the standard deviation of the mean color value, and the modal value (the most frequently occurring color value within the selected ovary).
Mounting between slide and slip. Both ovaries were placed into separate tissue processing embedding cassettes with their respective nametag. For the individuals with ovaries of 3 centimeters and over, the dorsal ovary (coded D) and ventral ovary (coded V) were cut into 3 sections of 1 cm. These sections were located in the anterior (coded 1), median (coded 2), and posterior (coded 3) area of each ovary. The 6 samples were then placed into 6 separate cassettes (Fig. 3) with a unique nametag that included the fish's identification tag followed by the section position (V1, V2, V3, D1, D2 or D3).
The tissue processing embedding cassettes were placed into a Davidson solution, for tissue fixation, for a period of 12 to 24 hours at a temperature of 4 °C. For the Davidson solution, 400 ml of glycerol, 800 ml of formol (37%), 1200 ml of ethanol (95%) and 1200 ml of filtered marine water were added in that order. This solution was kept at 4 °C, and 360 ml of concentrated acetic acid (10%) was added before use. The samples were then trimmed at the edges so as to get a straight rim, before being placed into an automate (Leica TP1020) for dehydration. The dehydration process lasted 48 hours with a succession of different baths: 1 hour in ethanol 70%, 3 hours in ethanol 70%, 3 hours in ethanol 95%, 4 hours in ethanol 95%, 3 hours in ethanol 100%, 4 hours in ethanol 100%, 6 hours in ethanol 100%, 2 hours in butanol 100%, 4 hours in butanol 100%, 6 hours in butanol 100%. Still in the automate, the samples were immersed into a first liquid paraffin bath (60 °C) for 4 hours before being immersed into a second paraffin bath for 8 hours.
The samples embedded in paraffin were cut into sections of 5 microns thick, using a microtome (HM330). Three consecutive sections were placed onto a single slide. Each slide was then deparaffinized, rehydrated and stained in Prenant-Gabe's Trichrome 14 . Finally, the sections were mounted with Roti-Histokitt.
Quantitative histology. The slides were digitized using a histology slide scanner Aperio CS, running under the Scan Scope Console software (v.10.2.0.2352, Leica Biosystems), with a magnification of 20x (numerical aperture 0.75). The scans were then analyzed using the Aperio software (v12.1.0.5029) 17 . The counting of cellular structure was done with the use of a stereological analysis based on Glagolev's method 18 , an assumption-based stereological method that uses a grid of points to estimate the different structures' areas on the total amount of points sampled.
Through the Aperio interface (Fig. 4), a sampling grid overlaid the scanned ovary slide. The generated sampling grid outlines the ovary, sketching a line along the outer ovarian wall, so as to reduce the sampling area to just the ovarian section. This sampling grid was composed of 500 to 600 sampling points equidistant from one another, and covering the entire sampling area. The fact that the 500 to 600 points are always equidistant from one another, and scales with the sampling area, assures the same sampling effort for every slide, no matter the surface or the shape of the outlined ovary, while covering a maximum of the sampling area. The position of the first cross is randomly generated, making the grid of sampling points unique each time a new one is created. This is important since it means that the sampling is random, but also that if multiple grids are generated for the same slide, it is important to use the same grid if we wish to compare the results of multiple readers. Finally, a 500 to 600 sampling point grid was used based on Gundersen's rule 19 that states that a biological compartment must have at www.nature.com/scientificdata www.nature.com/scientificdata/ least 150 points so as to obtain a relative uncertainty, between 3% and 10%, on the estimation of its volume's fraction. Having more than 5 structures at all times for each slides, the 500 to 600 points are not enough to abide by that rule, but that number of sampling points still allows for an accurate count of the fraction that each structure occupies within the histological slide, without having so many points that it would make the reading excruciatingly long, or make the sampling effort amount to manually counting each cell one by one.
For each sampling point, a single cellular structure was assigned from the 20 different cellular structures that could be found throughout the ovary. These 20 structures were identified, discussed and then clearly described by experienced oogenesis specialists (KK and CHB) in a lexicon 20 . Once each sampling point was assigned to a structure, the percentage of times that a structure was counted on a single slide was calculated:

Data Records
The images and datasets generated during the current study are available in the Zenodo repository 21 , [https://doi. org/10.5281/zenodo.3745640]. This dataset contains macroscopic (Full_Ovaries_Data.zip) and histological images (Stereology_Readings_ Data.zip) of the ovaries of 151 European plaice (female, Pleuronectes platessa), as well as three data frames that are: the macroscopic parameters collected for each fish (Macros.csv), the stereological reading results of the calibration exercise from the 3 operators (Interagent.csv), and the stereological reading results for all 226 histological slides read throughout this study (Stereology.csv). Images: • Full_Ovaries_Data.zip: archive in zip format of 151 pictures (.JPG; 8Mo-9Mo; sRGB; 6016 × 4000 pixels) of both ovaries from 151 female plaice dissected during this study. • Stereology_Readings_Data.zip: archive in zip format of two directories containing the following images: www.nature.com/scientificdata www.nature.com/scientificdata/ • Interagent_Calibration: The pictures (.svs: Aperio single-file pyramidal tiled TIFF, with non-standard metadata and compression) are of the 20 histological slides used for the stereological count during the agent calibration exercise • Ovary_Slides: The pictures (Aperio single-file pyramidal tiled TIFF, with non-standard metadata and compression) in this dataset are of the 226 histological slides read during this study Data frames: • Intergaent_read_me.txt: a text file (.txt) listing the acronyms used in the Interagent.csv file, as well as their meaning. • Interagent.csv: a text data file (.csv) with the output of two stereological readings, done by three agents for 15 slides, and by two agents for 20 slides. The information contained in this table is as follows: • agent: code id for the three agents that did the calibration exercise (A, B and C) • num_fish: fish number for this study. Here we have 20 different fish • fish_id: identification number of the fish. This id number is identical to the name given to the pictures of the full ovaries (Full_Ovaries_Data) • scan_id: identification number of the digitized histological slide that was used for the stereological count (Stereology_Readings_Data/Interagent_Calibration) • total_points: total number of identified structures for the stereological sampling grid of a slide • cell-type: abbreviation of the structure identified (lexicon available here: https://archimer.ifremer.fr/ doc/00501/61235/). In this study, we have 20 different structures • hit_points: number of time a structure has been counted on a single slide • Fract_estim: percentage (%) of times a structure was counted on a single slide = (100/total_point) * hit_points • reading: reading number. In this study, we have two readings, the first (1) and the second (2) • Macros_read_me.txt: a text file (.txt) listing the acronyms used in the Macros.csv file, as well as their meaning. • Macros.csv: a text data file (.csv) containing macroscopic parameters measurements for all 151 fish that have been used during this study. The information contained in this table is as follows: • num_fish: fish number for this study. Here we have 151 different female fish • fish_id: identification number of the fish. This id number is identical to the name given to the pictures of the full ovaries (Full_Ovaries_Data) • gon_pos: gonad position, with D being the dorsal gonad of the individual, and V being the ventral gonad • date: the date the fish was caught (dd/mm/yyy) • L_fish: total length of the fish (cm) • W_fish: total weight of the fish (g) • mat_estim: visually estimated maturity, after observation of the fish's gonad with the naked eye, following the WKMATCH 9 scale • age: estimated age (in years) of the fish, after analysis of the fish's otolith. The IFREMER laboratory executed this analysis in Boulogne-sur-Mer (FRANCE) • W_gon: gonad weight (g) • Kurtosis*: kurtosis parameter • Skewness*: skewness coefficient • gon_area*: gonad area (mm²) • L_gon*: gonad length (mm) • width_gon*: maximum gonad width (mm) • width_mid_L_gon*: width at mid-length of the gonad (mm) • mean_col_index*: the mean color value of the different hues found on the ovary • std_dev*: standard deviation of the mean_col_index • modal*: modal value or the most frequently occurring color value within the selected ovary *: values determined after image analysis of the Full_Ovaries_Data images with the ImageJ software (v. 1.50 J) • Stereology_read_me.txt: a text file (.txt) listing the acronyms used in the Stereology.csv file, as well as their meaning. • Stereology.csv: a text data file (.csv) of the stereology count results of 226 slides read during this study. Among these slides, 90 were read to test the homogeneity distribution of different cell types found throughout each ovary, and 151 median histological slides of the ventral ovary were also read. The information contained in this table is as follows: • agent: code id for the 3 agents that did the calibration exercise (A, B and C) • num_fish: fish number for this study. Here we have a total of 151 fish • fish_id: identification number of the fish. This id number is identical to the name given to the pictures of the full ovaries (Full_Ovaries_Data) www.nature.com/scientificdata www.nature.com/scientificdata/ Inter-agent calibration and reading protocol set up. To quantify the reading disagreements between operators, the differences in cellular structure identification between several agents was assessed. A total of 15 slides were read by 3 different agents. The 15 slides are median sections of the ventral ovary (V2) of 15 different fish, and were randomly picked out from the slides at our disposal at the time of this study. For every slide, a reading error index was established for each cellular structure by calculating the difference in percentage between the maximum and the minimum counting value of each structure identified on the slide. To identify cell structures that present problems for reader identification, a threshold of 3% was set by taking the quantile at 90% of the distribution of the percentage reading index. This choice reflects a compromise between the quality of the readings (the low reading error percentage of 3%) and the recognition of significant identification problems (the selection of the 10% most error-prone structures with a reading error index higher than 3%). Consequently, for all of the structures that showed a reading error index of more than 3%, the slides were reviewed and each reader explained why they chose their respective structures for each sampling point. These results from the first reading exercise allowed the adjustment and improvement of the reading protocol 22 , as well as setting an error limit of 3% for the reading error index of each structure.
During the second reading exercise, the same 15 slides and sampling grids were read again by the same 3 operators. The results were then analyzed through the reading error index of each structure, as well as the estimation of the percentage agreement between readers 23 , and Fleiss's kappa 24,25 . These last two inter-rater reliability indexes are statistical indexes based on the degree of agreement between readers for the classification of objects or individuals 23-25 . cellular homogeneity inter and intra-gonad. To assess the cellular homogeneity inside the gonad and between the ventral and dorsal gonads, 6 slides matching the anterior, median and posterior sections of the dorsal and ventral gonads (respectively coded D1, D2, D3 and V1, V2, V3) for 15 individuals were read. The 15 ovaries chosen for this part of the study did not show oocytes with advanced vitellogenesis (vit4), hydrating oocytes (och), or hydrated oocytes (oh). The reading of all 90 slides was done by the 3 operators that had previously validated the inter-calibration exercise. For each section of a single ovary, one of each slide was randomly assigned (cast of dice) to one of the three operators. Each slide was assigned only once, and each operator had a collection of 30 slides to read, composed of one slide from each ovary.
Afterward, the differences (in %) between the minimum and maximum count of each type of cellular structure found throughout all 6 sections, as well as histograms figuring the number of times each cellular structure was counted within the ovaries of a single individual, were established to better visualize the results found for all 15 fish. With the aim of objectively stating the effects of the section position within the gonad, as well as cellular structure occurrences within these sections, general linear models (GLM) were performed.
These models were used to highlight the differences between the 6 slides, for each cellular structure, and for all 15 individuals. The response variable used was the number of times a structure was counted on a single slide divided by the total number of sampled points on that same slide. For the GLM, the error term followed a binomial distribution, and a logit regression model 26,27 was used. The model results were then analyzed by using the deviances of each variable (the 20 cellular structures). The function drop1 27 was used to quantify the deviances of each variable by removing them from the whole model alternatively. A principal component analysis (PCA) on the histological structures was established to summarize and plot the reading data.

code availability
For the inter-agent calibration results, as well as the cellular homogeneity verification, a code was set up using RStudio (version 1.2.5001). This code can be accessed by contacting either Carine Sauger (carine.sauger@gmail. com) or Laurent Dubroca (laurent.dubroca@gmail.com).