Annotation-free learning of plankton for classification and anomaly detection

The acquisition of increasingly large plankton digital image datasets requires automatic methods of recognition and classification. As data size and collection speed increases, manual annotation and database representation are often bottlenecks for utilization of machine learning algorithms for taxonomic classification of plankton species in field studies. In this paper we present a novel set of algorithms to perform accurate detection and classification of plankton species with minimal supervision. Our algorithms approach the performance of existing supervised machine learning algorithms when tested on a plankton dataset generated from a custom-built lensless digital device. Similar results are obtained on a larger image dataset obtained from the Woods Hole Oceanographic Institution. Additionally, we introduce a new algorithm to perform anomaly detection on unclassified samples. Here an anomaly is defined as a significant deviation from the established classification. Our algorithms are designed to provide a new way to monitor the environment with a class of rapid online intelligent detectors.

(2) The PC is evaluated for number of clusters greater or equal to 1, and the maximum value is adopted as an estimation of the total number of classes. The PC takes values in range [1/K,1].
The closer it is to 1/K, the smaller the separation between the analyzed data.

Xie-Beni coefficient
The Xie-Beni coefficient has been introduced in 2 and is defined as: Where: xj represents a N-dimensional feature vector that is used as input to the clustering algorithm. The Xie-Beni coefficient is also defined as compactness and separation coefficient. The logic of determination for the number of clusters relies in maximizing separation between clusters and compactness within the same cluster.
Both PC and Xie-Beni have been applied to the lensless microscope dataset, and have yielded incorrect results with respect to the PE (see fig S5). axis is the distribution of ten runs on random subsets of all species. For example, for the leftmost box, 3 species have been randomly chosen from the lensless microscope database. This procedure is repeated ten times and the mode is then used as the estimated number of clusters.

K-MEANS
K-mean is a is a clustering algorithm widely used in data mining, for clustering large sets of data.
The term k-means was introduced for the first time by James MacQueen in 3 . The standard version of the algorithm, we are going to briefly describe in this section, is also referred as Lloyd's algorithm and was introduced by Lloyd in 4 . The algorithm consists of two different steps. The first one consists in choosing the centroids ci for each of the k clusters. The second one consists in assigning each point to a cluster, minimizing an objective function. Let us consider a set of observations , with . The k-mean algorithm will separate the observations into k partitions , as to minimize the intra-partition variance: After the first k centroids are choose, the algorithm assigns each observation x to the cluster correspondent to the nearest centroid (according to Euclidian distance). Then, the centroids are updated using the mean intra-cluster value, iterating the procedure until convergence (defined as no update in the centroids). The k-mean algorithm is a crispy clustering algorithm, i.e. there is a hard membership forcing a given point to belong exclusively to one cluster.

FUZZY K-MEANS
The fuzzy clustering is a set of algorithms that relaxes the crispy constraint of hard memberships, allowing each point to belong to two or more clusters. The fuzzy k-means was introduced in 5 and 6 . Given a set of observations , with , the algorithm aims to minimize the following objective function: The parameter m is defined fuzziness, is the degree of membership for observation in cluster j, is the centroids for cluster j, ||*|| can be any norm to express similarity between observations and centroids. The centroids and the degree of memberships are updated according to Eq. (6) and (7): The algorithm converges to a saddle point for the objective function in Eq. (2), when there is a maximum change in the degree of membership lower than a given tolerance .

MIXTURES OF GAUSSIAN MODELS
Gaussian Mixture Model (GMM) is a parametric statistical model assuming data are generated from a linear combination of normal gaussian sources 7 . Given a set of observations , with , the GMM model associates each observation xi with a probability p(xi) according to equation (8): Where (*) is a normal multivariate distribution with mean and covariance , K is the number of gaussians distribution to consider, which corresponds to the number of clusters to partition the observations in. are the coefficient of the gaussians linear combination, with the constraint and . The complete set of parameters is equal to the vector ( To estimate , an Expectation-Maximization (EM) approach is adopted. Starting from an initial value for (e.g., random), the E-step basically consists in computing the N x K matrix of the degree of membership for each observation in cluster j as: The degree of memberships satisfies the condition: . The M-step consists in update the parameter set according to the estimated degree of membership matrix, according to Eq.
10, 11 and 12: the two-step process is iterated until convergence, defined as a change in the log-likelihood for the space of parameter lower than a determined tolerance .

Shape descriptors
Hu moments The Hu moments descriptors are introduced in 8 Hu moments are invariant with respect to rotation, scaling and translation.

Zernike moments
The Zernike moments 9 are a set of complexes, orthogonal polynomials defined over the interior of the unit circle . The two-dimensional Zernike moment for order (p, q) is defined as: With m = 0,1, 2, …, ∞, * denotes the complex conjugate, f(x, y) is the function to describe and n is an integer where m-is even and is lower or equal to m and .
orthogonal polynomial can be expressed in polar coordinate as: Zernike moments show robustness to noise, expression efficiency and fast computation. We implemented a customized version of Zernike moments starting from 10 .

Gray Scale Co-occurrence Matrix (GSCM) and Haralick features
The GSCM is one of the earliest methods for texture feature extraction proposed by Haralick image comparing each pixel to its neighborhood 13 . Given a pixel I(i, j) in an image I, we consider a window of size k x k centered in (i, j). The intensity value of the pixel (i, j) is subtracted from all the pixels inside window, binarizing the resulting differences: each positive value is set to one, each negative (or null) value is set to zero. The resulting matrix is used to build a binary code concatenating each digit starting from the one on the up left, in clockwise sense. The built code is referred as local binary pattern (see fig S6). It is possible to use a subset of 2 p LBP to describe the texture of an image.

Shape signatures and Fourier Descriptors (FD)
A shape signature 14 is a contour-based local representation of shape features, characterized by sensitivity to noise and not robustness. It has been showed that using shape signatures for computing FD improve the accuracy of the representation. A shape signature is a 1-D function representing 2D contour of an image. In our work, we used the centroid distance: Where ( is the image's centroid. The centroid distance is invariant with respect to translation.
Fourier transformation of the shape signature has been widely used for shape analysis and retrieval 15 .
We used the set of points defining the contour of the analyzed shape provided by OpenCV to compute the FD. In detail, after computing the centroid distance signature, we evaluate the Fast Fourier Transform (FFT). Given our shape signature z(t), the FFT of z(t) is: To get the rotation invariance, only the real part of the FFT[z(t)] is considered. A normalization with respect to the first Fourier coefficient is used to get scaling invariance: It has been proved that the first ten FD can be enough to accurately describe image's shapes for retrieval and classification 14 .

Multidimensional plot
A major problem for graphical representation of multivariate data is the dimensionality. In this manuscript, we exploited two different methods for visualizing our multidimensional set of features: parallel coordinates and Andrew's curve.

PARALLEL COORDINATES
Parallel coordinates are a method of graphical representation of multidimensional data. It was introduced in 16 , and consists in associating a N-dimensional data with N vertical parallel axis.
Each observation will be represented by a multiline whose vertex on the j-th vertical axes represents the j-th coordinate for the observation, that is the specific value of the observation, among dimension j.

ANDREW'S CURVE
Andrew's curve 17 provides visualization of multidimensional data by mapping them into a function, defined as: It has been proved that Andrew's curves preserve mean, distance and variance of the multidimensional data, meaning that close Andrew's curves correspond to close data observations. This set of features spans a 131-dimensional space, whose shape represents the projection of the biological diversity of the collection of organisms. In this section we report a graphical representation for each subset of features, previously described. For every subset of features, it is Table S1. Computational time on raspberry pi for the analysis of one sample. The standard deviation is computed among the objects contained into the 60 frames of the analyzed video.

COMPUTATIONAL TIME ON PC
We used a PC with CPU quad core i7 2.9 GhZ, 32 GB of RAM and GPU NVIDIA GeForce 940 MX with 2 GB of RAM. The computational time requested for the image processor module applied to the 40 species included into the WHOI dataset is around 49 seconds. The total computational time requested for the image processor module applied to the 100 images for each of the 40 species included into the WHOI dataset is around 49 seconds. The features extraction module for the same dataset requires 720 seconds (around 0.18 seconds per sample). The partitioning algorithm requires 32 seconds for the complete dataset, while the classification module based on random forest required 31 seconds (0.007 seconds per sample).

TIME REQUIRED FOR DEC-DETECTORS TRAINING ON PC
We trained the DEC detectors using the 500 training images for each of the ten species included into the lensless microscope dataset. The average time for training (see main text for training process details) is t = 39.2 ± 3.4 seconds.