Tracking Fish Abundance by Underwater Image Recognition

Marine cabled video-observatories allow the non-destructive sampling of species at frequencies and durations that have never been attained before. Nevertheless, the lack of appropriate methods to automatically process video imagery limits this technology for the purposes of ecosystem monitoring. Automation is a prerequisite to deal with the huge quantities of video footage captured by cameras, which can then transform these devices into true autonomous sensors. In this study, we have developed a novel methodology that is based on genetic programming for content-based image analysis. Our aim was to capture the temporal dynamics of fish abundance. We processed more than 20,000 images that were acquired in a challenging real-world coastal scenario at the OBSEA-EMSO testing-site. The images were collected at 30-min. frequency, continuously for two years, over day and night. The highly variable environmental conditions allowed us to test the effectiveness of our approach under changing light radiation, water turbidity, background confusion, and bio-fouling growth on the camera housing. The automated recognition results were highly correlated with the manual counts and they were highly reliable when used to track fish variations at different hourly, daily, and monthly time scales. In addition, our methodology could be easily transferred to other cabled video-observatories.

1 Image segmentation and feature extraction Figure S1 summarizes the image elaboration tasks for learning and for executing the automated image recognition algorithm. All the image elaboration tasks shown in the two pipelines were implemented in Python, using the OpenCV library [1]. S 1: Schematic representation of the pipelines used for learning the image binary classifier a), and for executing the automated image recognition b).

Training and validation pipeline
The training and validation phase is aimed at learning the binary classifier capable to automatically recognise the content of an image region following the methodology discussed in [4].
The proposed approach learns the binary classifier from a set of positive and negative examples, whithin a supervised machine learning approach [3,5]. The example set is obtained through the sequence of tasks shown in the Figure S1a) and described below.
Image Differencing: the image dataset used for the training and validation is organised as a time-series where each image is characterised by a time stamp. This time organisation allows the use of the image differencing approach discussed in [8]. The image differencing removes the image regions that do not changes between consecutive images, as for example the background and the patch of bio-fouling on the camera port-hole. At the same time the image differencing highlights the image regions that changes along consecutive images (e.g. fish samples). An example of image differencing between two consecutive images acquired at the time t and the time t − 1 is shown in Figure S2. Other techniques for image differencing involving images acquired at the time t, time t − 1 and time t + 1 can also be implemented with the same objective and are also discussed in [8].
S 2: An exmaple of image differencing [8] between two consecutive images acquired at time t and at time t-1. Letters A, B, C, D and E in the image |I t −I t−1 | represent the computed differences between the images I t and I t−1 . In particular, the regions A, B, C, E and F in figure I t contain fishes, while the region D contains a patch of bio-fouling. Due to the very low contrast in the lower right corner of the images I t and I t−1 , the fish specimen F do not emerge in the image difference.
Image Segmentation: the obtained image difference is then segmented in order to obtain Regions of Interest (RoI) potentially containing fish specimens. Firstly the image difference is blurred with a bi-lateral filter with the aim of removing possible noise generated by the differencing task, while keeping sharp the edges of foreground subjects. Then a Gaussian adaptive thresholding and a morphological opening operator are applied in order to binarise the difference image and remove small not relevant image blobs [8,1]. For each binary blob identified on the image difference, the algorithm defined in [2, 1] was used for extracting the region contour. The convex hull of each blob is then computed and mapped on the input image acquired at time t.
Region Labelling: the RoI identified on the input image is then labelled in order to define the set of positive and negative examples used for learning the RoI binary classifier. In particular each RoI identified by the segmentation task is visually inspected and manually labelled with 1 if it contains at least a fish specimen and labelled with 0 otherwise. A software component with a simple user interface was defined for performing the manual RoI labelling.
Feature Extraction: the bounding box of each labelled RoI was computed and the image features representing the bounding box interior was extracted according to the tables S1 and S2. Within the geometric image features shown in Table S1, the length of the minor and major axis (axm, axM ), the convex hull perimeter (perimeter ), the convex hull and the bounding box areas (cntArea, bbArea) are all expressed as number of pixels and describe the size of the relevant subject. The eccentricity (ecc), the equivalent diameter (equiDiameter ) and the aspect ratio (aspectRatio) of the convex hull, together with the extent (extent) and the solidity (solidity) of the RoI content describe the shape of the relevant subject.
Within the texture image features shown in Table S2, the histogram shape index (histIndex ) captures the overall pixel intensity variance inside the analysed region. It is obtained by transforming the region into a grey level image and extracting the histogram h of the pixel intensities. Similarly the standard deviation of the mean grey level (std ) captures the variation of the pixel intensity with respect to the region mean grey intensity µ. The entropy (ent) of h captures the information stored in the region, and finally, the normalized contrast index (contrast) is defined as the ratio between the difference in the mean grey level inside the region (mean(gInt)) and outside the region, but within the oriented bounding box (mean(gExt)), and the mean grey level inside the whole bounding box.
All the previously discussed image features were chosen such that they have a linear computational cost with respect to the number of pixels. Moreover, though a single image feature can appear not relevant with respect to the recognition of fish specimens, it can become relevant if combined with other image features.

Automated image recognition pipeline
The output of the training and validation process is a binary classifier ready to be used for the fish recognition of unknown images. The automated image recognition pipeline shown in Figure S1b) is similar to the training and validation pipeline. The image content of the image acquired at time t is obtained by computing the difference with the image acquired at time t − 1. During the automated recognition phase, no user interaction is needed and thus the relevant image features selected during the learning process are extracted from every identified RoI. Such image features are then applied to the binary classifier that returns 1 if the RoI contains at least a fish specimens, 0 otherwise.

Image recognition and feature selection
The Supervised Machine Learning approach used in this work is based on a Genetic Programming (GP) procedure [6,11,7] GP is an evolutionary computation methodology capable of learning how to accomplish a given task. GP generates the solutions of the given task starting from an initial population of randomly generated mathematical expressions, based on a set of mathematical primitives, constants and variables. The initial solutions are improved by mimicking the selection processes that occur naturally in biological systems through the Selection, Crossover and Mutation genetic operators [6].
In the used work, the binary classifiers evolved by the GP based approach are expressed as mathematical functions, whose variables correspond to the image features discussed in Section 1.
To evolve the GP-based classifiers the following parameters have to be chosen: the set of mathematical primitives, the number of individuals of the initial population, the number of generations the individuals evolve through, the specific parameters driving the crossover and the mutation among individuals, as shown in Table S3. According to the feature selection method proposed in [9,4], the relevant image features are identified by analysing the number of their occurrences among the classifiers of the population pool obtained by nesting the GP procedure within a K-fold Cross Validation framework (K = 10). Figure S3 shows the probability distribution of the image feature occurrences (green dotted line) according to the Bernoulli trial.
The red filled circles represent the number of occurrences of the image features in the population pool, while the vertical red lines represent the two-tails p-value equal to 0.001 used to select the relevant image features. Actually the image features on the right of the right vertical line are deemed as relevant.
From the relevance analysis, eight out of fourteen image features resulted relevant as shown in Table S4. Nevertheless, the automated image recognition was defined as an ensemble of all the individuals of the Population Pool containing the three most occurring image features (i.e. contrast, equiDiameter, and ent), as discussed in [4]. These individuals are listed in Table S5 and the S 3: Relevance of the image features, according to the test statistics discussed in section [4,9]. The abscissa represents the occurrences of the image features within the population pool. The ordinate represents the probability an image feature occurred in the population pool. The two red vertical lines represent the two-tails p-values with p equals to 0.001. ensemble of the selected individuals is defined by the equation (Eq. S1): where r is the unknown RoI to be classified, C ens is the set of individuals shown in Table S5 and eval(c(r)) is the real number obtained by evaluating the classifier c instantiating each variable with the corresponding image feature value.

Efficacy of the automated recognition for ecological analyses
This section contains the part of statistical analysis aimed at comparing the observed and recognised time-series that are not described in the main paper.
In particular, the PERMutaitonal Analisys Of VAriance (PERMANOVA) and the Generalised Linear Model (GLM) was performed and the corresponding results for values of bio-fouling scores grater than 0 are shown in the Tables S6 and S7, respectively.