Bellybutton: accessible and customizable deep-learning image segmentation

The conversion of raw images into quantifiable data can be a major hurdle and time-sink in experimental research, and typically involves identifying region(s) of interest, a process known as segmentation. Machine learning tools for image segmentation are often specific to a set of tasks, such as tracking cells, or require substantial compute or coding knowledge to train and use. Here we introduce an easy-to-use (no coding required), image segmentation method, using a 15-layer convolutional neural network that can be trained on a laptop: Bellybutton. The algorithm trains on user-provided segmentation of example images, but, as we show, just one or even a sub-selection of one training image can be sufficient in some cases. We detail the machine learning method and give three use cases where Bellybutton correctly segments images despite substantial lighting, shape, size, focus, and/or structure variation across the regions(s) of interest. Instructions for easy download and use, with further details and the datasets used in this paper are available at pypi.org/project/Bellybuttonseg.


I. INTRODUCTION
Extracting quantitative information from image data is a major step in many fields of research.Prior to the last decade, state of the art algorithms typically focused on highly specific use cases, such as tracking spherical particles [1] or identifying astronomical light sources [2].These algorithms were typically task specific -aiming to identify predefined features -as opposed to machine learning algorithms that are more adaptive.In fact, reviews as late as 2015 did not even mention machine learning (ML) [3].Progress is still being made in this domain today [4].Since the introduction of AlexNet [5] in 2012, the capacity of ML methods in this arena has moved at a breathtaking pace, fueled largely by the success of convolutional neural networks (CNNs) [6].This class of techniques allows a more general approach to quantification of image data, including addressing more nuanced and harder-to-formulate questions by requiring only correct examples as training data.More specifically, the task of segmenting an image -identifying the pixels that comprise one or more objects or regions of interest -has become a large focus [7], as it allows researchers to rapidly and deeply analyze complex data.While state-of-the-art benchmarks in this domain [8] require enormous computation and are thus out of even a skilled single user's reach, software tools like Keras [9], an Application Program Interface (API) for Python, greatly simplify the process of creating smaller, custom neural network solutions, in principle in just a few lines of code.However, in practice the process is rarely that simple, and for those unfamiliar with deep neural networks, many pieces of the process become daunting; optimizing the many user-defined "hyper-parameters" of the algorithm, picking the right network, cleaning the data, and possibly learning a new programming language can each require a lot of additional effort.
As a result, a large and recent body of work has been focused on methods and software packages for simplifying this process.The majority focused on biological research, specifically the tracking of cells from microscopy data [10][11][12][13][14][15][16][17], but similar works tackle goals ranging from identifying and tracking 2D materials like graphene [18] to segmenting other medical or biological imaging data [19][20][21][22], images of flora and fauna [23], scanning electron microscopy images for material science [24,25], astronomical data [26,27], particle physics [28], and more.Typically these works compete for highest accuracy on benchmark data sets [12], or ease of use for pre-specified domains (very often biological data) [10,11].While many of these methods are likely applicable for tasks outside of their intended application, e.g.[16], few are explicitly designed for general use.
Here we introduce an easy-to-use segmentation solution aimed at a broad array of research applications, named "Bellybutton."Bellybutton uses a 15-layer convolutional neural network that can be trained on as little as one (or a portion of one) image with user-defined segmentation, and can account for variations in size, lighting, rotation, focus, or shape of desired segmentation regions, as is common in research applications.The algorithm operates on a pixel-by-pixel basis, determining if each is inside or outside of a segmentation ('innies' or 'outies,' hence the name Bellybutton).The algorithm can analyze input images of varying shape and size, and automatically performs a variety of data augmentation, including flipping and rotating images, normalizing brightness across images, and evenly sampling innies and out- ies.Bellybutton requires no coding knowledge, and can be trained and run on a laptop.We detail its performance and flexibility through several use cases including segmenting bubbles with poor lighting and focus, semitransparent, tightly packed particles that have intricate birefringence patterns, and tracking a thin clear lattice of material that fractures over time.Each of these data sets is available online, along with a guide for Bellybutton's use on new data sets.

II. METHOD
Bellybutton operates on a pixel-by-pixel basis, scanning images and using the neighborhood around a given point in an image to determine if a pixel is inside or outside of a segment, as well as how far from that segment's edge.It uses a deep convolutional neural network (CNN), whose structure is shown schematically in Fig. 1A.The CNN consists of 3x3 convolutional layers, 2x2 max pooling layers, skip connections inspired by ResNet [29], and ends with four dense layers feeding into two outputs -a classification of pixel type (inside or outside a region), and a distance-from-region-edge scalar value, which is used to separate distinct regions in contact.The scalar value is trained to vary between 0 (for all outside pixels) to a maximum value set by the user (typically 10), allowing the system to localize region edges while easily satisfying this output when it is unimportant, for example in the center of a 100 pixel-wide region.The chosen network architecture strikes a balance between being small enough to train rapidly from scratch on a laptop, while being large enough to generate valid segmentation on nontrivial problems.The choice of a CNN has been the standard for segmentation problems [6,13,15,19,20,[22][23][24][25][26][27], as it allows the network natural access to spatial information.The decreasing layer size is also standard, and gives the network sufficient flexibility to hierarchically analyze spatial patterns without superfluous parameters.The network itself takes multiple size subsets of an image as input, centered around the pixel in question, each down-sampled to 25x25 pixels.This sampling process is performed automatically during training and prediction, and gives the network the ability to analyze multiple length scales while keeping input size minimal.A typical example is shown in Fig. 1A and B using 1, 3, 9, and 27x scales.
For training, a user may provide individually-labeled segmentation maps, that is, every pixel in a particular segment must contain the same number, unique to that segment.Alternatively, if no segments are in contact, a user-provided binary mask is sufficient.Pixels are each then given a classification label that corresponds to 'innie' (inside a segmented region' or 'outie.'Optionally the user may exclude regions of an image using a binary Area of Interest (AOI) mask, as indicated by the excluded gray area in Fig. 1C.The distance to segment edges is also calculated from this mask, and used to train the scalar output.
To avoid prolonged training, the user may select to train using a fraction of available training data.We find that near optimal results are often reached without using all available pixels (see Fig. 2E.)Furthermore, rotated and flipped images are (optionally) used in training to prevent overfitting.Once trained, Bellybutton produces a score of 0 (outside) to 1 (inside a region) for each pixel, shown in Fig. 1D, which is binarized to produce an innievs-outie map.Finally, the output of the scalar distanceto-region-edge, shown in Fig. 1E, is used to watershed the 'innie' pixels into distinct regions to produce a segmented map, as in Fig. 1E.Data used in this figure, aqueous foams in microgravity, comes from Ref. [30], which was the first work to utilize Bellybutton.

III. EXAMPLE USES
Bellybutton is effective for a variety of purposes.Here we use the example of segementing a 3D printed photoelastic material in the shape of a granular packing.This material is illuminated between cross-polarizers such that it develops a birefringence pattern when under mechanical stress.This lighting is useful experimentally, but complicates the tracking process; previous experiments using photoelastic granular disks have required two sets of images, one with regular lighting to track particles, and second one with the birefringence pattern to analyze force [31].Bellybutton was trained on two fourths of three images of this system, under low, medium, and high stress, and tested on the remaining two fourths of each image, shaded purple in Fig. 2A.While remaining roughly the same shape, the particles present a wide variety of patterns as the stress changes.Furthermore, a variety of confounding factors make this segmentation more difficult: A substantial portion of the image (the left and right edges) is out of focus.The camera is close enough to the sample that only particles in the center are imaged head-on, leading to different viewing angles for particles near the edges of the system.Finally, particles near the left and right edge are tilted sufficiently such that their edges are exposed to the camera.
The input scales used are shown in Fig. 2B, overlaid on zoomed-in data.Segmentation is successful, with the majority of errors concentrated at the bottom of the left- For quantitative analysis of these results, we utilize the SEG score from Ref. [12], which compares each true region with the identified region of highest overlap.We find this metric to be the most indicative of performance by eye, although many others are commonly used [7,12].For each true region R i , a 'Jaccard index' is calculated with the Bellybutton-generated region B i of highest overlap, by dividing the area of their intersection by the area of their union.True regions that do not have an intersection of at least one half of their area are given a score of 0. The SEG reported is the average of all such scores for a given dataset, with a perfect score being 1.A detailed explanation of the calculation is given in [32].Bellybutton was reliably able to beat a 0.9 SEG score on the test set for this data.
In the highlighted example the entire training set was used, and the network was trained for E = 2 epochs (each training data point was shown to the network twice).For practical use however, it may not be necessary to use even this much data (half of three images), as shown in Fig. 2E.A sub-sampling option is given as a parameter in the Bellybutton package, named 'fraction.'This value indicates the fraction (0-1] of available training pixels that the algorithm will use to train the neural network.For values below 1, individual pixels are randomly chosen, but at a rate that ensures that innies and outies are equally represented [33].We find that accuracy for a variety of problems is dependent on the quantity being sufficiently high, where E is the number of epochs in training, M is the size of the total training set, F is the fraction of the training set that is used, and T = EF M is the total number of training steps.This dependency is shown by the data collapse in Fig. 2E.As a result, smaller data fractions F can be used to suss out the tractability of a problem.In this example, even tiny fractions of the training data can still yield passable results, as seen by the modest dependence of SEG on data fraction in Fig. 2F, however for optimal results, a larger fraction of the data must be used, to give the network access to a wider variety of examples.Overall, more data is typically better, but we often find that F ≥ 0.1 gives reasonable results for systems with many repeated particles, like the one shown in Fig. 2.An important caveat is that these training data should be taken from a sufficiently varied set of images and locations within those images to encompass the range of the desired data set.Bellybutton is also useful for structure-finding.In the following example a lattice of laser-cut acrylic (Polymethyl methacrylate or PMMA) is slowly fractured while lit between cross-polarizers to reveal changes in internal stress.These changes to the material's structure as well as its brightness, shown in Fig. 3A and E, make it very difficult to track algorithmically.Using just three training images with human-generated masks, Bellybutton is capable of tracking the fracturing structure through time, as shown in Fig. 3E and F, despite lighting and focus changes.The package includes options for a binarized output, or a distance-to-edge output, which is shown here.The latter can be helpful for skeletonizing a structure, and to suppress noise and error.

IV. HOW AND WHEN TO USE BELLYBUTTON
We have tried to make Bellybutton as accessible as possible.It is downloadable as a python package, which can be easily installed with one command, and utilizing Bellybutton requires no coding.Instructions for use, details for how to customize training and hyper-parameters, and much more can be found at pypi.org/project/Bellybuttonseg.Starting a project is as simple as running a single command, and Bellybutton creates a folder structure to add images, masks, and areas of interest.Adjusting the parameters of training and testing are done through editing an automaticallygenerated text file.Furthermore, we have provided the data sets used in each figure as example projects that can be downloaded in one command, set up, and run on a lap-top.Deploying one of these example projects takes under a minute, plus training time (computer dependent).
While only three examples of Bellybutton's potential uses are shown, its flexibility should make it useful in a wide variety of situations.Regions are not limited to single particles; masks might specify the two connected regions of a dimer, or a disk and a mark on its surface indicating its rotational position as separate regions, allowing them both to be segmented simultaneously.The same approach could be applied to a cell and its nucleus, an insect and its head or feet, a particle and its previous position, allowing velocity to be approximated from single images.Regions can be used to identify particle classes as well; segmenting only particles of a given shape, size, or orientation will prompt Bellybutton to do the same.A broad rule of thumb is if a region is easily identifiable by eye, it is a good candidate for Bellybutton.This class of image segmentation problems is both frustrating and common in research, and we believe giving users an easy-to-use but flexible method like Bellybutton will save countless hours in the lab.

FFIG. 1 .
FIG. 1.The Bellybutton Method (A) Architecture of the 15-layer convolutional neural network.Multiple scales of an experimental image, each reduced to 25x25 pixels, are simultaneously taken as a single input.The network consists of two 3x3 convolutional layers followed by a 2x2 max pooling layer.This pattern is repeated twice more, each with skip connections as shown.The final 2x2x96 layer is flattened, fed through four dense layers and produces a two output scalars, one signifying the class of the pixel (inside or outside of a region), the other the distance to the nearest region edge.(B) An example experimental image, overlaid with the chosen input scales 1, 3, 9, and 27x.(C) User-defined mask, in this case binary as no segments are in contact.User may also define an area of Interest (AOI), which in this example removes the edges of the image (gray) from training.(D) Class probability output after training.The network generates a prediction score on a pixel-by-pixel basis.(E) Distance map to outside of a particle.Values are capped at a user-specified value, in this case 10 pixels, so much of the image appears binary.The zoomed-in region highlights the gray-scale output near the edges of the bubbles.(F) Final segementation is produced by watershedding the binarized classification probability (D) using the distance map (E).(D) and (E) are also saved if desired.

FIG. 2 .
FIG. 2. 3D Printed Photoelastic Disks (A) Images of a 3D printed photoelastic material in the shape of a granular packing under three stress states (high, medium, low).Each was divided into four sections, two of which (gray) were used for training, and two (purple) were used for evaluation.A single network was trained using all six training regions, and tested on all six test regions.(B) Zoom in on orange-framed region in (A).Note the variety of lighting patterns on each disk.Teal and blue superimposed squares are the image scales fed into the network for this task.(C) User-generated masks for these zoomed in regions (which are part of the test set).(D) Final segmentation output for the zoomed in region.Note that the colors serve to differentiate regions; there is no attempt to match the colors between (C) and (D).(E) SEG score for the test set as a function of Epochs times Training Fraction EF .Training fraction F is denoted by color, and is the portion of the training data used in training the network, with each data point shown to the network once per epoch.SEG score is an indicator of segmentation quality, and is calculated by dividing the intersection of generated regions and their corresponding true regions with their union, and averaging for all true regions (see text for further explanation).(F) SEG score for all runs with EF ≥ 3 as a function of data fraction F .Note the diminishing returns on this task for high F .

FIG. 3 .
FIG. 3. Tracking a Changing Structure with Bellybutton (A) Training images of a fracturing lattice.Image contrast and brightness have been enhanced, and the top 2/3 of each image is shown.Note that these are the only training images, but that we have spread them out in time to encompass a wide range of situations.(B) Binary mask for the third training image with superimposed area of interest (gray).(C) Example test image and (D) accompanying Bellybutton-generated distance map output.Orange square denotes location of zoomed regions in (E) and (F).(E) Zoomed in (enhanced) images with (F) corresponding Bellybutton-generated distance map for many time steps.