Quantifying yeast colony morphologies with feature engineering from time-lapse photography

Baker’s yeast (Saccharomyces cerevisiae) is a model organism for studying the morphology that emerges at the scale of multi-cell colonies. To look at how morphology develops, we collect a dataset of time-lapse photographs of the growth of different strains of S. cerevisiae. We discuss the general statistical challenges that arise when using time-lapse photographs to extract time-dependent features. In particular, we show how texture-based feature engineering and representative clustering can be successfully applied to categorize the development of yeast colony morphology using our dataset. The Local binary pattern (LBP) from image processing is used to score the surface texture of colonies. This texture score develops along a smooth trajectory during growth. The path taken depends on how the morphology emerges. A hierarchical clustering of the colonies is performed according to their texture development trajectories. The clustering method is designed for practical interpretability; it obtains the best representative colony image for any hierarchical cluster.


Background & Summary
Saccharomyces cerevisiae, or Baker's yeast, is a model organism that can grow a diverse range of strain-dependent morphologies at the scale of cell colonies [1][2][3] .Manifesting as macroscopic tubes, cracks, and ridges that mark the colony surface, these morphologies influence how the organism interacts within a biological environment 4,5 .Beer brewers and bakers have historically relied on visual identification of morphology to make practical distinctions among encountered strains of S. cerevisiae 6,7 .
Easy access to digital cameras and video recording allows for the eyes of a well-trained expert to be replaced with computer algorithms well-trained on large collections of image data.The data can be brought to bear on scientific questions of which yeast morphology identification is only one emblematic example.The challenge of this broad paradigm shift toward data-driven science is how to leverage statistical methods and algorithms to translate the abundant big data into useful information 8 .This is especially pertinent when navigating the high-dimensional pixel spaces associated with digital photographs and video.
The dataset introduced by this article involves large ensembles of S. cerevisiae growth experiments photographed hourly over days.We provide a suite of statistical methods to translate these time-lapse images into actionable information for understanding colony morphology.Although our analysis is illustratively applied to the specific question of yeast colony morphology, our approach can be understood more broadly as a strategy for quantitatively extracting desired features from time series of photographs.Toward this end, we emphasize in our Methods section the general challenges faced in the analysis of time-lapse image data.We review alternative approaches using the literature on colony morphology as a guide, and explain the general benefits of the perspective introduced by this paper.All of our statistical methods are implemented with generic Python packages prioritizing easy adaptation to any time-lapse image datasets.
Even in scientific settings, past categorization methods of yeast colony morphology have relied on qualitative scoring in which a single investigator identified colony categories by eye [9][10][11] .Additional tools such as ImageJ 12 and CellProfiler 13 have been used to identify colonies based on shape, size, and color.Another approach was categorization based on principal component analysis at fixed time points 14 .For time-series of images, size trajectories have been studied for characterizing morphology 15,16 .In this article, we demonstrate the unsupervised categorization of morphology based on complete texture trajectories as shown in Figure 1.Our contribution can be understood in two ways.First, by introducing the texture analysis tools from image processing we demonstrate feature engineering appropriate for the practical quantification of morphology.Second, our approach enables the study of time series of images because of the smooth trajectories of the images once projected into our engineered feature space.From this perspective, this article is a tutorial on a general framework enabling the principled categorization of datasets involving time-lapse image data.
In particular, we introduce a novel statistical framework combining texture analysis tools from image processing with clustering algorithms from machine learning.All of the necessary tools for the analysis pipeline of the framework are readily available in Python (refer to Code Availability).In addition, a supplementary open-source Python package has been introduced by the authors for the purpose of improving upon the available clustering algorithms 17 .The supplemental provides a Python implementation of a hierarchical clustering algorithm that finds a prototypical representative for each data cluster it obtains 18 .Prototypical representatives of clusters are practical aides especially relevant for large image-based datasets.This article will use its photographic dataset of S. cerevisiae colonies to demonstrate that this novel framework combining texture analysis and machine learning provides a simple, interpretable, and effective approach for the categorization of colony morphologies.

Methods
This section describes (i) the experimental design for image data acquisition, (ii) image pre-processing, (iii) texture-based feature engineering, and (iv) representative clustering.These steps are part of the characterization pipeline as shown in Figure 1.All code used in these methods are available in standard open-source Python libraries or are provided by the authors as fully-documented Python packages downloadable from the Python Package Index (see Code Availability).

Yeast strains and genetic manipulation
Unless noted, standard media and methods were used for the growth and manipulation of yeast 19 The S. cerevisiae strains used in this study are listed in Table 1.Strains with a YPG prefix, which are included in the image dataset, are the haploid progeny of two yeast strains.The first strain, YO502, was derived from an Ethiopia white tecc brewing strain 20 .The second strain, YO1817, is isogenic to the YO486 Japanese sake brewing strain 20 , except that it is monosomic for chromosome I and disomic for chromosome XII.YO502 and YO1817 were mated, the heterozygous diploid was sporulated, and haploid recombinant progeny were isolated by manual tetrad dissection.

Image data acquisition
Colonies of S. cerevisiae were generated by arraying approximately 48 individual cells of the same strain onto an agar plate in a checkerboard pattern using a Sony SH800 fluorescent cell sorter with 488 nm and 531 nm excitation lasers.Photographs were acquired while these colonies grew at 30 • C in rich (YPD) medium (1% yeast extract, 2% peptone, 1.5% glucose).Colony images were acquired at median rate of one image every 23 minutes over the course of 3 days using a Canon 5d Mark II camera outfitted with a MP-E 65 mm 1-5x macro lens.This camera was attached to a custom built motorized 2-axis gantry that moved the camera over a set of up to 13 plates.For this particular study, 5500 colonies were selected as a representative ensemble of experimental data.These colonies represent 196 unique strains of the species.More information about this dataset is discussed in the Data Records section.Access to a larger library of data from which this dataset is sampled can also be made available upon request.Contact the corresponding author(s) for more information.

Image pre-processing
To be studied further, colony images must be extracted from the image backgrounds.Backgrounds included agar plate edges or segments of other colonies in the grid.Image selection was accomplished with an image processing pipeline using algorithms in the scikit-image library for Python 21 .
The pre-processing pipeline involved three main steps.(i) Canny edge detection (skimage.feature.canny)determined the edges and boundaries of any objects in the image.This included the colony boundary.(ii) The circular Hough transform (skimage.transform.hough_circle_peaks)found the maximal radius of the largest (approximately) circular object in the Canny edge detection image.This was assumed to encompass the colony boundary.Anything outside this circle was masked.(iii.)A convex hull (skimage.morphology.convex_hull_image)was drawn around the unmasked region of the original image to capture the true (non-circular) colony boundary in detail.Parameters in these functions were tuned by hand to the data.Finally, the high-resolution images were re-sized to one-half of the approximate average pixel dimensions, 100 × 100.Images were padded before re-sizing to preserve colony shapes.

Feature engineering using local binary patterns
The dominant visible morphology of a growing yeast colony is the unique and complicated pattern of folds and lumps that evolve primarily on the colony surface.This texture is an ensemble of 2-dimensional coherent spatiotemporal structures.Dimensional reduction of this kind of complex texture space is important for discovering the appropriate descriptors to distinguish the available morphologies.
A de facto standard for dimensionality reduction is the Principal Component Analysis (PCA) 22 .Given an image dataset, PCA finds the minimal set of basis images that inform the majority of the variability in the dataset.If a PCA algorithm was applied to the yeast image dataset, one might expect a set of 2-dimensional static images describing the full variety of colony image textures.Unfortunately, a well-known issue with PCA is that even the simplest dynamic structure requires a large set of static basis images to accurately describe its motion.For example, a movie of a bouncing ball travelling across a fixed field of view would need an entire flip-book of static images to show the traversal.When applied to growing yeast colonies, PCA is being asked to use static basis images to simultaneously describe the patterns of folds and lumps on a colony surface as well as the outward-bound radial traversal of those patterns.The generic growth information is dominant and will win out at the expense of the more subtle distinctions between texture patterns.This discussion is intended to communicate why the naive application of standard dimensional reduction via the PCA in the image space is not robust to growth; distinctions among morphologies are lost.
Two straightforward options remain.Either (1) remove time altogether, or (2) engineer better features for texture that are robust to growth.As to the former option, Reference 14 showed how yeast colonies could be clustered according to their two-dimensional image at the final recorded time.A disadvantage of using a fixed time is this ignores a lot of the available information in our dataset.Hence, the analysis in this article follows the latter option.We engineer accessible and appropriate features for yeast texture via a standard image processing tool called the local binary pattern (LBP) 23 .We will show that the LBP is robust to growth.A Python implementation of the LBP algorithm is available in scikit-image as skimage.feature.local_binary_patternallowing for ready accommodation to our data processing pipeline.
The LBP was designed to capture the essential notions of texture in 2-dimensional images.For example, texture should not depend on the lighting or orientation at which the original image was viewed-therefore, the LBP is scale and rotationally invariant by its design.The LBP algorithm uses ten pattern categories to label each pixel in an image.The pattern category for a pixel is chosen by comparing that pixel to eight of its surrounding neighbors (Figure 2).We can apply the LBP to every pixel in an image from our colony image dataset.Next, we can record the relative frequency of each LBP-pattern category amongst all of the image's pixels.This produces a vector of 10 numbers which we can associate to that image.In this article, we refer to this engineered feature space of vectors describing our colony images as the LBP space.Importantly, the LBP space has no spatial information.The spatial information was integrated out by computing the relative frequencies of the LBP-pattern categories for each image.This is the critical step for making our LBP feature robust to growth.We have replaced the complicated image of texture-many folds and lump evolving together on a 2-dimensional surface-with a simple trajectory that changes smoothly in the 10 coordinates that define the LBP-space.
Because LBP-space trajectories are robust to growth, dimensionality reduction can be applied in this feature space.A PCA algorithm was used to determine the best set of static basis modes for describing the full variety of our dataset in this 10-dimensional LBP-space.Each PCA mode returned by the algorithm was a vector of 10 numbers.Taken together, these LBP PCA modes provided an optimal set of coordinates to describe the LBP-space trajectories.We found that 3 LBP PCA modes were sufficient to characterize a majority of the variation of the LBP-space trajectories.Hence, we ultimately described our photographed morphology growth using this reduced 3-dimensional feature space.
We now make a few practical comments regarding the robustness advantages of this feature space.First, in the provided dataset our photographs of yeast colonies were taken at inconsistent times.The lack of uniform measurement time is due to the sporadic traversal of the camera during the experiment.Favorably, the smooth trajectories in the LBP space allowed for us to use interpolation so every colony shared the same measurement times.This interpolation could not have been done in the full image space.In our study, we also simplify by setting a universal final time which was fixed to be the median ending time of our dataset.The median is robust to short-time outliers involving failed-growth experiments.If an experiment ended earlier than our chosen final time, growth was assumed to have saturated and the final data point was extrapolated.

Representative clustering
We categorized colony texture within the 3-dimensional feature space defined by the LBP PCA modes.To do so, model-free clustering was applied to the feature-space trajectories of the colonies.Model-free emphasizes that distances were defined directly between trajectories without fits or other modelling assumptions.To define colony distances, the usual Euclidean distance was used to compute a distance between points in the 3-dimensional feature space at each time.Summing this distance over all times in the trajectory gave us a valid distance metric to use for the yeast colonies.
There are a wide variety of machine learning methods for finding clusters once pairwise distances have been obtained; in this paper, hierarchical clustering 24 was used because of its natural emphasis on differences at various scales.Distinct clusters at the top of the hierarchy can be expected to have obvious morphological differences.Meanwhile, differences amongst clusters making up the lower rungs of the hierarchy are expected to have more subtle contrasts.The standard implementation of agglomerative hierarchical clustering assembles clusters from the bottom up according to some linkage rule for joining pairs of clusters.A variety of Python algorithms for hierarchical clustering are found in e.g.SciPy 25 .
The output of a hierarchical clustering algorithm is a binary tree called a dendrogram.In a dendrogram, the leaves (made up of the initial data points) are nodes placed at height zero.Linked clusters are drawn as new nodes above their constituent pair and are placed at a height proportional to the linkage distance.The linkage distance is the quantity the hierarchical clustering algorithm uses to decide which cluster to join, and the definition can vary between implementations 24 .The assembly proceeds until the final linkage when the binary tree is crowned by one last node.
The dendrogram can be cut at different heights.A cut is a choice of partition of the initial data points according to the sub-trees appearing directly below the cut.For our dataset, we would like for the dendrogram cut height to say something interpretable about the differences between morphology among the partitions.That is, higher up in the tree we expect the differences to be more dramatic.Lower in the tree we expect the differences to be more subtle.In an image-based dataset, we have the opportunity to introduce additional interpretability.For each cluster, we can select an optimal representative image.The representative image should offer a best approximation to the shared yeast morphology approximated by the cluster.Of the many linkage rules in standard practice 24,25 (summarized in Table 2), minimax linkage 18 was the only choice that met the desired criteria of interpretable cuts and natural cluster representatives important for our image-based dataset.
The result of hierarchical clustering using minimax linkage is shown in Figure 3.The upper section, Figure 3(a), shows a dendrogram with a view that has been truncated to show only the last seven linkages computed by the algorithm.We do this to avoid viewing the full dendrogram which must eventually display a node for each point in the dataset.In Figures 3(b)-(e), four prototype images are shown and the growth time is displayed below the image.The four prototypes which display obvious differences in texture are the optimal representatives of the four top-level clusters produced by the cut dendrogram.The cut occurs at a height that has been chosen according to the CH index.The CH index 26 is a standard statistical measure balancing scores for variations within and between clusters.The index is a useful metric for selecting appropriate cut heights.
The advantage of hierarchical clustering is demonstrated by Figure 4.In Figure 4(a), the sub-dendrogram extending below the branch represented by Figure 3(c) is shown with a new (lower) cut height.As such, Figures 4(b)-(e) show prototypes with more subtle contrasts than those evident among the previous Figures 3(b)-(e).Exploring the dendrogram tree obtained from minimax linkage allows for visual insights into the structure of texture-based clusters.Agglomerative hierarchical clustering is able to provide texture information at multiple scales.Additionally, Figures 4(b),(c) show how subtle distinctions can manifest in texture growth at early times-this additional contrast cannot be obtained by comparing only the terminal colony textures.

Technical Validation
In this section, we emphasize two points about the technical quality of the dataset of yeast colony images.First, we explain the reliability of the colony metadata.The metadata refer to the folder and file names in the data record called ImageData.We also make some comments on outlier detection.Second, we discuss why it is reasonable to conclude that growth conditions are approximately the same for all the colonies.Equal growth conditions are an important assumption when we create our basis of feature-space growth trajectories (see the Methods section).
The essential metadata for each image file in the ImageData data record are the strain number, the strain experiment iteration, and the photograph timestamp.For additional information on the manifestation of this information within the file, consult the Data Records section.The most straightforward metadata entry was the timestamp which we extracted directly from the metadata of the digital photograph itself.Next, consider the strain number.All of the single colonies used in this study are replicates.Each agar plate was entirely filled by colonies of the same strain.With this redundancy, we can conclude that the strain number was reliable even if a single strain experiment iteration exhibited an uncharacteristic morphology.To address the reliability of the strain experiment iterations, we must recall some facts of the experiment design.Strain experiment iterations were arranged in a checkerboard pattern on a plate.A digital camera was attached to a custom built motorized 2-axis gantry that allowed the camera to move over the iterations.A computer program managed and tracked the motorized control of the camera.The tracking location and time could be compared with the image timestamp to correctly associate an image with its strain experiment iteration.
Even with these reliable designs in place, let us suppose a colony were mislabelled at some point during its growth.We can expect that the new image with the incorrect label will have poor correlation with the previously photographed image which had correctly been assigned this same label.Indeed, time correlations measurements like this are are often used, e.g. to identify noise in gene expression profiles 32 .In our Methods section, we discuss clustering the photograph data using a texture score that develops along a smooth trajectory during growth.An important benefit of the proposed approach is that the smoothness of the trajectory emerges from the continuity of the colony growth.Mislabelling breaks the continuity.Hence, we naturally detect mislabellings as outliers in the hierarchical clustering results.As a final comment, we note that if the label is incorrectly assigned early in the growth like at the first timestep, there is not much that can be done to detect abnormalities.Care must be taken that the labels have been assigned correctly for the initial pass of the camera.
Next, we address the reliability of assuming that colonies develop under equal growth conditions.Again, recall that the colonies are arranged in a checkerboard pattern.This pattern was designed with appropriate spacings to grant equal nutrient access regardless of the colony position.All colonies were grown under the same conditions: at 30 • C in rich (YPD) medium (1% yeast extract, 2% peptone, 1.5% glucose).The special-purpose camera rig was built within the growth environment.The camera rig was located at a distance from the colonies and run at a sufficiently slow speed to mitigate any undesired heating.

Usage Notes
A tutorial has been written for reproducing the pre-and post-processing results discussed in this report.This tutorial is implemented as an interactive Jupyter notebook 33 and can be obtained on GitHub at https://github.com/pyYeast/

Figure
Figure 1.Figures1(a)-(d)show the result of feature engineering for some colonies in the dataset.Later in Figure3these colonies are shown to be the top level prototypes returned by the hierarchical clustering algorithm.We will illustrate how to read this figure by taking Figure1(a) as an example.In the left block, the colony image and the local binary pattern (LBP) categorization of the colony image are compared at beginning, middle, and end times measured in hours.In the right block, the images have been projected onto the three Principal Component Analysis (PCA) modes used for dimensionality reduction of the images at all measurement times.The corresponding feature space trajectories are plotted.Bold lines correspond to existing data that has been binned.Thin lines are missing values filled in using interpolation of the existing data.Distances between the trajectories shown here are used to define the pairwise distance matrix used for clustering.

1 .
Figure 1.Figures1(a)-(d)show the result of feature engineering for some colonies in the dataset.Later in Figure3these colonies are shown to be the top level prototypes returned by the hierarchical clustering algorithm.We will illustrate how to read this figure by taking Figure1(a) as an example.In the left block, the colony image and the local binary pattern (LBP) categorization of the colony image are compared at beginning, middle, and end times measured in hours.In the right block, the images have been projected onto the three Principal Component Analysis (PCA) modes used for dimensionality reduction of the images at all measurement times.The corresponding feature space trajectories are plotted.Bold lines correspond to existing data that has been binned.Thin lines are missing values filled in using interpolation of the existing data.Distances between the trajectories shown here are used to define the pairwise distance matrix used for clustering.

Figure 2 .
Figure 2. Figure 2(a) shows the radius parameter of the local binary pattern (LBP) is adjusted to select the neighboring pixels for the algorithm.In Figure 2(b), the LBP of one pixel is calculated.Neighboring pixels are judged according to the center pixel and the result is categorized.Figure 2(c) shows the nine LBP categories.In Category 0, all eight neighbors are bright.In Category 1, all neighbors are bright except for a single pixel that is dark, and so on.Category 8 is a dark spot.Any discontiguous blend of light and dark pixels falls within the catch-all Feature 9.

Figure 3 .
Figure 3. Figure 3(a) shows the dendrogram constructed by applying agglomerative hierarchical clustering using minimax linkage from the pairwise distance matrix defined by the feature engineering depicted in Figure 1.Four dendrogram branches have been distinguished by a fixed-height cut of the dendrogram (dotted line).From left to right, these four branches correspond to the cluster prototypes shown in Figures 3(b)-(e).Clustering is based upon the entire time-lapse trajectory.The displayed images are of each colony at the latest available time in its time-lapse: 58 hr., 60 hr., 24 hr., and 58 hr. for Figures 3(b), (c), (d), and (e).

Figure 4 .
Figure 4. Figure 4(a) shows the sub-branch represented by the prototype Figure 3(c) of the dendrogram plotted in Figure 3(a).A new cut height is chosen (represented by a dotted line in Figure 4(a)) and four branches are created.The four prototypes in Figures 4(b)-(e) ordered alphabetically correspond to the four branches read left to right.Numbers next to the colony images report the elapsed time of the experiment (in hours) when that image was taken.Figure 4(d) shows an outlier colony for which no growth data was recorded during the experiment (less than 1 hour had elapsed).Figures 4(b),(c) show colony growth at three times along the time-lapse to demonstrate how the complete texture trajectories are used to define the colony distances.

Table 1 .
S. cerevisiae strains used in this study.

Table 2 .
Comparison of common linkage functions

Table 3 .
An example line of the data record CoordinatesTable.

Table 4 .
An example line of the data record DistancesTable.