Active particle feedback control with a single-shot detection convolutional neural network

The real-time detection of objects in optical microscopy allows their direct manipulation, which has recently become a new tool for the control, e.g., of active particles. For larger heterogeneous ensembles of particles, detection techniques are required that can localize and classify different objects with strongly inhomogeneous optical contrast at video rate, which is often difficult to achieve with conventional algorithmic approaches. We present a convolutional neural network single-shot detector which is suitable for real-time applications in optical microscopy. The network is capable of localizing and classifying multiple microscopic objects at up to 100 frames per second in images as large as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$416 \times 416$$\end{document}416×416 pixels, even at very low signal-to-noise ratios. The detection scheme can be easily adapted and extended, e.g., to new particle classes and additional parameters as demonstrated for particle orientation. The developed framework is shown to control self-thermophoretic active particles in a heterogeneous ensemble selectively. Our approach will pave the way for new studies of collective behavior in active matter based on artificial interaction rules.


Architecture
We adapted the "Tiny" version of the YOLOv2 architecture: TinyYOLOv2 [1]. It uses 9 convolutional layers with a 3 × 3 kernel and 6 max-pooling layers with a 2 × 2 kernel (Fig. S1, Tab. S1) to simultaneously predict multiple bounding boxes and class probabilities for those boxes. The last convolutional layer has a 1 × 1 kernel and reduces the data to the output shape 13 × 13 × B · (4 + 1 + C).

Output Decoding
The network takes an input RGB image of the size 416 × 416 pixels and divides that input image into 13 × 13 grid cells (Fig. 1a). For each grid cell it predicts B bounding boxes and for each bounding box an object confidence as well as probabilities for each of the C classes. The output is encoded in a 13 × 13 × B · (4 + 1 + C) tensor. Since there are 13 × 13 = 169 grid cells, for B = 5, the network predicts 169 · 5 = 845 bounding boxes at once (Fig. 1b). For C = 2 classes each bounding box is then described by 4 + 1 + C = 7 data elements ( Fig.  S2): • t x , t y , t w , t h of the bounding box • object confidence p 0 • probabilities for all classes (p 1 . . . p C ) Bounding Box Class Prob. Obj. Conf. YOLOv2 predicts the coordinates of the bounding boxes relative to the location of the grid cell using anchor boxes. This bounds the ground truth to fall between 0 and 1. If the location of the grid cell is c x , c y and the relative width and height of the anchor box are b w , b h , the predictions t x , t y , t w , t h correspond to: where σ is the logistic activation function, x, y the center location and w, h the width and height of the bounding box relative to the input image size. The anchor boxes are a set of B = 5 predefined bounding boxes of a certain height and width. These boxes are defined to capture the aspect ratio of specific object classes. The network does not directly predict bounding boxes, but rather predicts the probabilities and refinements that correspond to a set of anchor boxes. The object confidence p 0 for the bounding box is the probability that the bounding box contains an object whereas p 1 , p 2 , . . . , p C are the probabilities that the object belongs to a certain class. From the 845 predicted bounding boxes most will have a very low object confidence, so we only keep the boxes whose object confidence is larger than a certain object threshold (Fig. 1c). There still may be boxes remaining that largely overlap with each other but belong to the same object (Fig. S3a). To remove these duplicate bounding boxes a non-maximum suppression (NMS) algorithm is applied. The algorithm selects the bounding box with the highest confidence and removes any bounding boxes with an intersection over union (Fig. S3b) larger than a given NMS threshold. Finally, the class of the bounding box is assigned according to the highest class probability.

Neural Network Training
The network is trained using a sum-squared error between the predictions and the ground truth to calculate the loss (Section 2.1). We trained the network with synthetic images (Section 2.2) with a size of 416 × 416 pixels. Whereas we trained the network with grayscale images it also possible to train the network with colored images. For each image the ground truth for the bounding boxes and the classes are stored in an associated XML file in the Pascal VOC format. A dataset comprises a set of training images, a set of validation images and a set of test images. Typically, the size of the validation set is 20 % of the size of the training set. We trained the network without pre-trained weights for 10 epochs and batch sizes of 8. For details on the implementation in Python/Keras see Code 1 and Section 4.

Loss Function
YOLOv2 uses the following sum-squared error for the loss function [2]: The first two terms represent the localization loss, terms 3 and 4 the confidence loss and the last term the classification loss.

General Remarks
(a) The loss function penalizes classification errors only if an object is present in that grid cell.
(b) Since there are B = 5 bounding boxes for each cell we need to choose one of them for the loss. This will be the box with the highest IOU with the ground truth box so the loss will penalize the localization loss if that box is responsible for the ground truth box.
(c) The sum-squared error weights localization errors equally with classification errors.
1st Term SSE between the predicted box location (x, y) and the ground truth location (x,ŷ). We sum over all 13 × 13 grid cells (G = 13) and for each grid cell we sum over all 5 boxes (B = 5).
To comply with (a) and (b) a binary variable 1 obj ij is used so that 1 obj ij = 1 if box j in grid i contains an object and box j is responsible of detecting that object, otherwise 0 (The box is responsible for detecting an object if it has the highest IOU with the ground truth box between the B boxes).
As mentioned in (c) the SSE weights localization errors equally with classification errors. To give the localization error a higher weight in the loss function the parameter λ loc = 5 is used.
The 2nd term is similar to the 1st term, but calculates the SSE in the box dimensions. To reflect that small deviations in large boxes matter less that in small boxes YOLO uses the square root of w and h. Otherwise the SSE would weight errors in large boxes and small boxes equally.

4th Term
If there is no object in the grid we only need to take care about the confidence C (The confidence needs to be zero when there is no object). The variable 1 no obj ij = 1 if box j in grid i contains no object or box j is not responsible of detecting that object, otherwise 0.
Since many grid cells do not contain any object, this pushes the confidence scores of those cells quickly towards zero This can lead the training to diverge early. To attenuate the decrease of the loss from confidence predictions of boxes that do not contain objects the parameter λ no obj = 0.5 is used.

5th Term
Here, the errors for all class probabilities in the 13 × 13 grid cells are summed.

Synthetic Image Generation
Here we list the generation functions used to simulate the darkfield images of nano-and microparticles, Janustype as well as rod-like and elliptical microparticles. Note that all images are assumed to be in focus.

Spots
Function used to generate a Gaussian intensity distribution: Figure S4: Example of an intensity distribution generated with Eqn. (1).

Ring-Shaped
Function used to generate a ring-shaped intensity distribution: Figure S5: Example of a ring-shaped intensity distribution generated with Eqn. (2).

Janus-Type
Function used to generate a Janus-type intensity distribution: Here, α is given as: where x and y are coordinates in the reference frame of the particle center: Figure S6: Example of a Janus-type intensity distribution generated with Eqn. (3).

Ellipse
Function used to generate an elliptical, two-dimensional Gaussian intensity distribution: where a, b and c are given in terms of the orientation angle θ: Figure S7: Example of an elliptical intensity distribution generated with Eqn. (4).

Rod-Like
Function used to generate a rod-like intensity distribution: Here, * denotes the convolution of a two-dimensional, rectangular function Π with a Gaussian function g, where , w are the length and width of the rectangle and σ 2 the variance of the Gaussian function.

Datasets
This section gives an overview of the investigated synthetic datasets and the parameters used for their generation.
Here, [a . . . b] denotes a uniform distribution of random samples between a and b. The SNR of an image is defined as the ratio of the average signal value to the standard deviation of the signal.

Dataset 1
Dataset 1 comprises 25000 images for training and 5000 images for validation. The SNR of the images is randomly distributed with SNR = [1 . . . 30].

Dataset 2
Dataset 2 comprises 25000 images for training and 5000 images for validation. The SNR of the images is randomly distributed with SNR = [1 . . . 30].

Codes
The network is trained and evaluated in Python/Keras using the TensorFlow backend [3][4][5]. For real-time inference the model graph is exported as protocol buffer file (*.pb) and parameters required to decode the YOLOTrack 1.0 output are exported to an INI file (*.ini). These files are imported by the C based dynamic link libraries TF.dll, YOLOTrack10.dll that are easily integrable in other programming languages such as LabVIEW and C++. Fig. S14 shows the structure of the software framework:  The TF.dll is a general TensorFlow library for model inference build on top of the TensorFlow C API (tensorflow.dll). It can be used with any TensorFlow model and is not specific to YOLOTrack. The YOLOTrack10.dll adds specific functions required to decode the YOLOTrack 1.0 output and does not depend on the TensorFlow C API.

Code 1
Code 1 contains the YOLOTrack 1.0 implementation. The code repository is structured in the following directories:

Code 2
Code 2 contains an extended version of YOLOTrack, called YOLOTrack 1.1, to allow for the detection of oriented bounding boxes (see Section 9 for details). The repository structure is largely similar to Code 1 and structured as follows: •

Offset Correction
The trained network can have a sub-pixel localization offset. However, once trained the offset vector is constant and can be corrected by subtraction (Fig. S15). The offset vector is independent on the SNR but depends on the object class. In general, a correction is only required when seeking for sub-pixel resolution. To correct the offset for a specific class the error/offset from the ground truth is sampled with 1000 synthetic images containing only one object of that class at randomized positions. The offset vector is then calculated from the mean of the offset distribution. The offset correction has to be done for every model and individually for all object classes.  The sample is illuminated with an oil-immersion darkfield condenser (Olympus, U-DCW, NA 1.2 -1.4) and a white-light LED (Thorlabs, SOLIS-3C). The scattered light is imaged by the objective and a tube lens (250 mm) to an EMCCD (electron-multiplying charge-coupled device) camera (Andor, iXon DV885LC). The variable numerical aperture of the objective was set to a value below the minimum aperture of the darkfield condenser. The dichroic beam splitter (D) was selected to reflect the laser wavelength (Omega Optical, 560DRLP) and a notch filter (F) is used to block any remaining back reflections from the laser (Thorlabs, NF533-17). The acousto-optic deflector (AOD), as well as the piezo stage, are driven by an AD/DA (analog-digital/digitalanalog) converter (Jäger Messtechnik, ADwin-Gold II). A LabVIEW program running on a desktop PC (Intel Core i7 2600 4 × 3.40 GHz CPU) is used to record and process the images as well as to control the AOD feedback via the AD/DA converter. To get the fastest possible image processing the LabVIEW program is interfacing a GeForce GTX 1660 Ti GPU to run the model inference at a frame rate of up to 100 fps.

Sample Preparation
The sample consists of two glass cover slips (22 mm × 22 mm) confining a thin liquid film. First, the cover slips were thoroughly cleaned by rinsing successively with acetone, isopropyl and Milli-Q water and dried with a nitrogen gun. To prevent sticking of the microparticles, the glass surfaces were passivated with Pluronic F-127 (Sigma-Aldrich). To attain the adsorption of Pluronic F-127 in a brush-like configuration the cleaned, hydrophilic cover slips were rendered hydrophobic with a thin layer of polystyrene (PSS-Polymer, M w = 88 kDa, PDI = 1.66). Therefore, 30 µl of 2 % polystyrene in toluene was spin-coated at 8,000 rpm onto the cover slips, resulting in a polystyrene layer thickness of about 100 nm. Subsequently, the cover slips are immersed in 1 % Pluronic F-127 solution for 10 min. Thereafter, the cover slips were briefly dipped in Milli-Q water and dried with a nitrogen gun. Subsequently, the edges of one cover slip was covered with a thin layer of PDMS (polydimethylsiloxane) for sealing. The particle solution used for the experiments was prepared by mixing 2.2 µm diameter gold-coated melamine formaldehyde (MF) particles (MicroParticles) and 0.5 µm diameter polystyrene particles (Polyscience) in 0.1 % Pluronic F-127 solution. The surface of the MF particles is uniformly coated with gold nanoparticles of about 10 nm diameter with a surface coverage of about 10 %. To passivate the surface of the particles they have been washed in a 1 % Pluronic F-127 solution before mixing. Finally, 0.5 µl of the mixed particle suspension is pipetted in the middle of one of the cover slips and the other is placed on top. Depending on the area covered by the liquid, typically about 18 mm × 18 mm, the resulting liquid film height is about 3 µm.

Orientation Detection (YOLOTrack 1.1)
To allow for an orientation detection the network needs to be modified. We normalized the 360 • range of possible orientations to a range from 0 to 1 and adapted YOLOTrack 1.0 to directly predict the orientation of the bounding box via a single regression number ϕ (Fig. S20). In the loss function the mean squared error between the ground truthφ i and the predicted angle ϕ i was added: To bound the output between 0 and 1 the logistic activation function σ was used. The annotation XML file format was adapted by adding a <orientation></orientation> tag, requiring also to adapt the annotation parsing and the batch generation. The implementations details can be found in Code 2.

Video Files
Video 1: Single particle driven between two target positions. Video 2: Single particle confined at a target position.
Video 3: Control of 6 active particles in a hexagon with a background of passive particles.
Video 4: Control of 6 active particles at low SNR with in a grid pattern.
Video 5: Control of 9 active particles at low SNR in a grid pattern that is transformed in its size.
Video 6: Control of 9 active particles at low SNR that are transformed from a grid to circular pattern.