Learning representations for image-based profiling of perturbations

Measuring the phenotypic effect of treatments on cells through imaging assays is an efficient and powerful way of studying cell biology, and requires computational methods for transforming images into quantitative data. Here, we present an improved strategy for learning representations of treatment effects from high-throughput imaging, following a causal interpretation. We use weakly supervised learning for modeling associations between images and treatments, and show that it encodes both confounding factors and phenotypic features in the learned representation. To facilitate their separation, we constructed a large training dataset with images from five different studies to maximize experimental diversity, following insights from our causal analysis. Training a model with this dataset successfully improves downstream performance, and produces a reusable convolutional network for image-based profiling, which we call Cell Painting CNN. We evaluated our strategy on three publicly available Cell Painting datasets, and observed that the Cell Painting CNN improves performance in downstream analysis up to 30% with respect to classical features, while also being more computationally efficient.


March 2021
For a reference copy of of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on on these points even when the disclosure is is negative.The results are reported in in three benchmark datasets, each corresponds to to one high-content screening experiment, after data filtering the sample sizes (used for training and evaluation of of the models) are the following: BBBC037: 205 treatments placed in in 1029 wells.Additionally, there are 175 negative control wells.In In total: 1024 wells.BBBC022: 995 treatments placed in in 3971 wells.Additionally, there are 1280 negative control wells.In In total: 5251 wells.BBBC036: 1550 treatments placed in in 12180 wells.Additionally, there are 3528 negative control wells.In In total: 15 15 708 wells.
We We selected these three datasets for the study because each of of them is is a large collection of of perturbations of of two types: chemical and genetic perturbations.This covers a sufficient space of of phenotypic responses for studying representation learning for cell morphology.
We We conducted quality control of of images in in all the three datasets by by analyzing image-based features with principal component analysis.The outliers observed in in the first two principal components were flagged as as candidates for exclusion, and were visually inspected to to confirm rejection.We We found most of of these images to to be be noisy or or empty and not suitable for training and evaluation.With this quality control, two wells were removed from BBBC037, 43 43 wells from BBBC022, and no no wells were removed from BBBC036.If If treatments had multiple concentrations in in BBBC022 and BBBC036, we we kept only the maximum concentration for further analysis and evaluation.The metadata used for the extraction of of features will be be made publicly available.
Treatments (also refereed to to as as perturbations) initially had five replicates (treatment in in a single well in in five different plates) in in BBBC037 dataset, four replicates in in BBBC022 dataset and up up to to eight replicates in in BBBC036 dataset.
Multiple replicates are standard in in high-throughput imaging studies to to measure signal strength and replicability.In In average, the three datasets exhibited more than 80% consistency in in replicability with approximately 50% of of the treatments having a phenotype significantly different from control samples.This level of of consistency and replicability is is sufficient for investigating the effects of of treatments and for investigating methods that can amplify that signal.
Treatments were not randomly allocated in in experimental plates, but followed the specific predefined plate layout.
Blinding is is not applicable for this research, mainly because it it is is a retrospective study.The compound and gene overexpression screens were designed in in previous studies, and our goal was to to investigate if if it it was possible to to extract more signal from these datasets given what it is is currently known about these treatments (mechanism of of action annotations).U2OS -female, A549 -male.
None of of the cell lines used were authenticated.
Cell were not tested for mycoplasma contamination. None.
materials, systems and methodsWe We require information from authors about some types of of materials, experimental systems and methods used in in many studies.Here, indicate whether each material, system or or method listed is is relevant to to your study.If If you are not sure if if a list item applies to to your research, read the appropriate section before selecting a response.