The CellPhe toolkit for cell phenotyping using time-lapse imaging and pattern recognition

With phenotypic heterogeneity in whole cell populations widely recognised, the demand for quantitative and temporal analysis approaches to characterise single cell morphology and dynamics has increased. We present CellPhe, a pattern recognition toolkit for the unbiased characterisation of cellular phenotypes within time-lapse videos. CellPhe imports tracking information from multiple segmentation and tracking algorithms to provide automated cell phenotyping from different imaging modalities, including fluorescence. To maximise data quality for downstream analysis, our toolkit includes automated recognition and removal of erroneous cell boundaries induced by inaccurate tracking and segmentation. We provide an extensive list of features extracted from individual cell time series, with custom feature selection to identify variables that provide greatest discrimination for the analysis in question. Using ensemble classification for accurate prediction of cellular phenotype and clustering algorithms for the characterisation of heterogeneous subsets, we validate and prove adaptability using different cell types and experimental conditions.

-Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy

Human research participants
Policy information about studies involving human research participants and Sex and Gender in Research.
Reporting on sex and gender Population characteristics

Recruitment
Ethics oversight Note that full information on the approval of the study protocol must also be provided in the manuscript.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences
Behavioural & social sciences Ecological, evolutionary & environmental sciences For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative. Reporting for specific materials, systems and methods All data used to produce the results in the manuscript, including separate data that will allow the user to follow the worked example in the CellPhe user guide, are available from the Dryad database: https://doi.org/10.5061/dryad.4xgxd25f0. Here, the file example_data.zip contains all the data required to follow the worked example and a video that explains how to use the GUI is available on Zenodo: https://zenodo.org/record/7674584\#.ZAJYBOzP0o8. Source data are provided with this paper.
Reporting on sex and gender is not applicable.
No human research participants were involved.
No participants were recruited.
Not applicable.
Statistical methods were not used to predetermine sample size as there are no effect sizes or predetermined proportions involved in our study. However, cell seeding densities (i.e. the number of cells per well) were kept consistent throughout all experiments to ensure reproducibility and were guided by the user manual provided by PhaseFocus. Furthermore, balanced training sets were used to ensure no bias towards training of one class over the other.
To ensure reliable characterisation of cellular phenotype, only cells that were tracked for 50 frames or more were included in analyses. Furthermore, our toolkit includes automated identification and exclusion of segmentation errors prior to downstream analyses.
Training sets were a compilation of experiments performed in different months to increase sample size for training and to ensure that identified discriminatory variables were not a result of experimental variability. Independent test sets were used for model testing where data included in test sets were from experiments carried out by a different individual, and test set data was not used during model training. Findings were consistent across multiple experiments and models achieved high classification accuracy for independent test sets. The CellPhe method was validated using further experiments with different cell lines and different drugs.
Data sets for entire experiments were randomly assigned as either training or test data so that cells from the same experiment were never assigned to both training and test sets.
Blinding was not possible as cell cultures were either treated or not treated and therefore cells could not be allocated to groups randomly. During training, supervised machine learning algorithms require class labels to be associated with training data, but independent test sets, never used in model training, were used to assess model performance and therefore avoid bias. Test set labels were only used to calculate the accuracy of classification.