Label-free cell cycle analysis for high-throughput imaging flow cytometry

Imaging flow cytometry combines the high-throughput capabilities of conventional flow cytometry with single-cell imaging. Here we demonstrate label-free prediction of DNA content and quantification of the mitotic cell cycle phases by applying supervised machine learning to morphological features extracted from brightfield and the typically ignored darkfield images of cells from an imaging flow cytometer. This method facilitates non-destructive monitoring of cells avoiding potentially confounding effects of fluorescent stains while maximizing available fluorescence channels. The method is effective in cell cycle analysis for mammalian cells, both fixed and live, and accurately assesses the impact of a cell cycle mitotic phase blocking agent. As the same method is effective in predicting the DNA content of fission yeast, it is likely to have a broad application to other cell types.


Supplementary Tables
classifying DNA content and mitotic phases, we systematically excluded each of the feature classes from our analysis (the baseline analysis is shown as "none") and examined the effect of doing so on the results. Note that omitting any particular class of morphological features has very little impact on the overall accuracy of the approach, likely because many features are correlated and these phenotypes can be detected using many different features. Interestingly, performing the analysis using only the darkfield features (that is, leaving out all brightfield features, second row from the bottom) does substantially reduce the accuracy whereas using only brightfield features (that is, leaving out all darkfield features, last row) only slightly reduces accuracy. Thus, the brightfield features are much more informative then the darkfield features for these phenotypes. Reshaping the image size does not influence the results of the machine learning algorithm.

Supplementary
We repeated our analysis where we reshaped the images to different sizes. The method is robust to the image size, as long as no parts of the cells are cropped (typical cell size: ~30-35x30-35 pixels).

SUPPLEMENTARY NOTE 1 | Protocol for the analysis pipeline
Please note that an updated version of this tutorial is available at: www.cellprofiler.org/imagingflowcytometry

STEP 1: EXTRACT SINGLE CELL IMAGES AND IDENTIFY CELL POPULATIONS OF INTEREST WITH IDEAS SOFTWARE
a. Open the IDEAS analysis tool (we used version 6.0.129), which is provided with the ImageStreamX instrument.
b. Load the .rif file that contains the data from the imaging flow cytometer experiment into IDEAS using File > Open. Note that any compensation between the fluorescence channels can be carried out at this point. The IDEAS analysis tool will generate a .cif data file and a .daf data analysis file (we provide a data file on www.cellprofiler.org/imagingflowcytometry).
c. Perform your analysis within the IDEAS analysis tool following the instructions of the software and identify cells that have each phenotype of interest, using a stain that marks each population. This is known as preparing the "ground truth" (expected result) annotations for the phenotype(s) of interest. In cases when a stain has been used to mark the phenotype(s) of interest in one of the samples, any parameters measured by IDEAS can be used to assign cells to particular classes. In the example data set, the PI (Ch4) images of pH3 (Ch5)  Anaphase, G1, G2, Metaphase, Prophase, S and Telophase). You can find a ziparchive of the .tif images of the populations exported from the provided example data set on www.cellprofiler.org/imagingflowcytometry.

MONTAGES OF IMAGES USING MATLAB
To allow visual inspection and to reduce the number of .tif files, we tiled the images for the f. Adjust the name of the image channels as they were exported from IDEAS in step 1 (in the example we used 'Ch3' (brightfield), 'Ch6' (darkfield) and 'Ch4', PI stain).
g. Adjust the size of images (we have used 55X55 pixels for each image -this will depend on the size of the cells imaged and also the magnification).
h. Save the Matlab script by clicking 'Save' in the toolstrip.
i. Run the Matlab script by clicking 'Run' in the toolstrip and check that the montages appear in your designated output folder. The montages of 15x15 images that we created from the example data set are provided on www.cellprofiler.org/imagingflowcytometry.

STEP 3: SEGMENT IMAGES AND EXTRACT FEATURES USING CELLPROFILER
To extract morphological features from the brightfield and darkfield images and to determine the ground truth DNA content we used the imaging software CellProfiler.
b. Load the provided CellProfiler project (Supplementary Code 2), including the pipeline within it, using File > Open Project.
c. Specify the images to be analyzed by dragging and dropping the folder where the image montages that were created in step 2 are located into the white area inside the CellProfiler window that is specified by 'File list'.
d. Click on 'NamesAndTypes' under the 'Input modules' and adjust the names of the image channels as they were exported from IDEAS and specified in step 2 f. Then click on Update.
e. Adjust the pipeline (which was loaded as part of Step b) if needed by adding or adjusting analysis modules (visit www.cellprofiler.org for tutorials on how to use CellProfiler). In the provided CellProfiler pipeline, we defined a grid that is centered at each of the 15x15 single cell images. We extracted features for the darkfield images (granularity, radial distribution, texture, intensity) based on the entire square image containing each cell. In other words, we did not attempt to measure darkfield properties within each individual cell by segmenting each cell, because the darkfield image is recorded at a 90° angle to the brightfield image and thus does not align with it. Further, darkfield does not necessarily depict the physical shape of the cell as is the case for brightfield. Next, we segmented the brightfield images (that is, identified individual cell borders) without using any stains, but by smoothing the images (CellProfiler module 'Smooth' with a Gaussian Filter) followed by edge detection (CellProfiler module 'EnhanceEdges' with Sobel edge-finding) and by applying a threshold (CellProfiler module 'ApplyThreshold' with the MCT thresholding method and binary output). We close the obtained objects (CellProfiler module 'Morph' with the 'close' operation) and use them to identify the cells on the grid sites (CellProfiler module 'IdentifyPrimaryObjects'). To filter out secondary objects (such as debris), which are typically smaller than the cells, on the single cell images we measure the sizes of secondary objects (if there are any) and neglect the smaller objects. Then we extract features for the segmented brightfield images (granularity, radial distribution, texture, intensity, area and shape and Zernike polynomials). In a last step, we extract the intensity of the PI images that we use as ground truth for the DNA content of the cells. The complete CellProfiler pipeline with the parameters used in our analysis can be found in Supplementary Code 2.
f. Specify the output folder by clicking on 'View output settings' and selecting an appropriate 'Default Output Folder'.
g. Extract the features of the images by clicking on 'Analyze Images'. The extracted features from the brightfield and darkfield images as well as the intensity of the PI images in .txt-format are provided on www.cellprofiler.org/imagingflowcytometry.   c) Adjust the name of the input data containing the features that was created in step 4 I. to be used for regression. d) Adjust the name of the ground truth data for the phases that was created in step 4.I.

I. Data preparation
to be used to train the regression. e) Save the Matlab function by clicking 'Save' in the toolstrip.
f) Run the Matlab function. In our example we used the settings 'LearnRate' equal to 0.1 and specified the decision tree structure that we used as the weak learning structure by setting the number of leafs 'minleaf' to 5. To fix the stopping criterion (corresponding to the amount of weak learners that is used to fit the data) we performed internal cross-validation (see below). Again, the data is split into a training set (90% of the cells) and a testing set (10% of the cells). Then the algorithm is trained on the training set for which the ground truth cell cycle phases of the cells is provided, before it is used to predict the cell cycle phase of the cells in the test set without providing their ground truth cell cycle phases. To show that the label-free prediction of cell cycle phases is robust we performed a ten-fold crossvalidation. The predicted cell cycle phases are provided on www.cellprofiler.org/imagingflowcytometry.

Internal cross validation to determine the stopping criterion
To prevent overfitting the data and to fix the stopping criterion for the applied boosting algorithms, we performed a five-fold internal cross-validation. To this end, we split up the training set into an internal-training (consisting of 80% of the cells in the training set) and an internal-validation (20% of the cells in the training set) set. We trained the algorithm on the internal-training set with up to 6,000 decision trees. We then predicted the DNA content/cell cycle phase of the inner-validation set and evaluated the quality of the prediction as a function of the used amount of decision trees. The optimal amount of decision trees is chosen as the one for which the quality of the prediction is best. We repeat this procedure five times and determine the stopping criterion for the whole training set as the average of the five values for the stopping criterion obtained in the internal crossvalidation.