Label-free prediction of cell painting from brightfield images

Cell Painting is a high-content image-based assay applied in drug discovery to predict bioactivity, assess toxicity and understand mechanisms of action of chemical and genetic perturbations. We investigate label-free Cell Painting by predicting the five fluorescent Cell Painting channels from brightfield input. We train and validate two deep learning models with a dataset representing 17 batches, and we evaluate on batches treated with compounds from a phenotypic set. The mean Pearson correlation coefficient of the predicted images across all channels is 0.84. Without incorporating features into the model training, we achieved a mean correlation of 0.45 with ground truth features extracted using a segmentation-based feature extraction pipeline. Additionally, we identified 30 features which correlated greater than 0.8 to the ground truth. Toxicity analysis on the label-free Cell Painting resulted a sensitivity of 62.5% and specificity of 99.3% on images from unseen batches. We provide a breakdown of the feature profiles by channel and feature type to understand the potential and limitations of label-free morphological profiling. We demonstrate that label-free Cell Painting has the potential to be used for downstream analyses and could allow for repurposing imaging channels for other non-generic fluorescent stains of more targeted biological interest.

Morphological profile generation. Features were aggregated using the median value per image. For feature selection, we adopted the following approach: • This process reduced the number of useable features to 611, comparable to other studies. We used the ground truth data only in the feature reduction pipeline to avoid introducing a model bias to the selected features.
Training and test set generation. We sampled from 17 of the 19 batches to select wells for training ( Table 1). The remaining two batches were used to select the test set and were excluded in the training process. Compounds from the test set were comprised from a set of known pharmacologically active molecules, with a known observable phenotypic activity. We randomly sampled 3000 wells for model training and hyperparameter tuning, with the constraint to force an overall equal number of wells per batch. We randomly selected one field of view from each of the four fields in the well, which was the image used in the training set. For model tuning we used 90/10 splits sampled randomly from the training set, before using the full training set to train the final models. The test set contained 273 images and was chosen by sampling randomly within each treatment group across the two remaining batches (treatment breakdown in Table 1). U-net with L1 loss network architecture. Our first model is based on the original U-Net 21 , a convolutional neural network which has proven very effective in many imaging tasks. U-Net architectures have been used to solve inverse reconstruction problems in cellular 22 , medical 23 and general imaging problems 24 . For segmentation tasks, out of the box U-Net based architectures such as nnU-Net 25 have been proven to perform very well even compared to state-of-the-art models.
U-Nets involve a number of convolutions in a contracting path with down sampling or pooling layers, and an expansive path with up sampling and concatenations, allowing for retention of spatial information while learning detailed feature relationships. The network captures multi-scale features of images through different resolutions by going up and down the branches of the network.
We adapted the typical grayscale or RGB channel U-Net model to have 3 input channels and 5 output channels to accommodate our data. An overview of the model network and training is presented in Fig. 1. There were 6 convolutional blocks in the downsampling path, the first with 32 filters and the final with 1024 filters. Each block performed a 2d convolution with a kernel size of 3 and a stride of 1, followed by a ReLU then batch normalization operation. Between blocks a max pooling operation with a kernel size of 2 and a stride of 2 was applied for downsampling. The upsampling path was symmetric to the downsampling, with convolutions with a kernel of 2 and a stride of 2 applied for upsampling. The corresponding blocks in the contracting and expansive paths were concatenated as in the typical U-Net model. The final layer was a convolution with no activation or batch normalization. In total our network had 31 × 10 6 trainable parameters.
For pairs of corresponding images {(x i , y i )}, where x i is a 3-channel image from the brightfield input space and y i is a 5-channel image from the real fluorescent output space, the loss function L L1 for the U-Net model is: where ŷ i is the predicted output image from the network.
(1) Table 1. The composition of the Training, Validation and Test sets. Choosing data from many batches and experiments should make our models learn more robust, common features, which gives us the best chance of successful prediction on unseen data. where G is trying to minimise this objective, and D is trying to maximize it: Many GAN image reconstruction models follow the framework from Isola 12 where the model objective function G * is, for example, a mix of both objective functions: for some weighting parameter .
These constructions can be unstable and difficult to train, and this was the case for our dataset. To overcome difficulties with training we opted for a conditional Wasserstein GAN with gradient penalty 26 (cWGAN-GP) approach. This improved Wasserstein GAN was designed to stabilize training, useful for more complex architectures with many layers in the generator.
The Discriminator network D-alternatively the critic in the WGAN formulation-is a patch discriminator 12 with the concatenated brightfield and predicted Cell Painting channels as the eight-channel input. There were 64 filters in the final layer and there were three layers. For the convolutional operations the kernel size is 4 and the stride is 2. The output is the sum of the cross-entropy losses from each of the localized patches.
In Eq. (4), the L1 loss term enforces low-frequency structure, so using a discriminator which classifies smaller patches (and averages the responses) is helpful for capturing high frequency structures in the adversarial loss. For Wasserstein GANs the discriminator is called the critic as it is not trained to classify between real and fake, instead to learn a K-Lipschtiz function to minimize the Wasserstein loss between the real data and the generator output.
Hence, for our conditional WGAN-GP-based construction we trained the generator to minimize the following objective: where 1 is a weighting parameter for the L1 objective. We introduce e as an adaptive weighting parameter to prevent the unbounded adversarial loss overwhelming L1: The critic objective, which the network is trained to maximize, is: where 2 is a weighting parameter for the gradient penalty term: which is used to enforce the K-Lipschitz constraint. x is from uniformly sampling along straight lines between pairs of points in the real data and generated data distributions 27 . cropped to serve as inputs to the network. No data augmentation was used. The models were trained on the AstraZeneca Scientific Computing Platform with a maximum allocation of 500G memory and four CPU cores. The PyTorch implementation of the networks and training is given in our GitHub repository. The U-Net model was trained with a batch size of 10 for 15,000 iterations (50 epochs). The optimizer was Adam, with a learning rate of 2 × 10 -4 and weight decay of 2 × 10 -4 . The total training time for the U-Net model was around 30 h. The cWGAN-GP model was trained with a batch size of 4 for an additional 21,000 iterations (28 epochs). The generator optimizer was Adam, with a learning rate of 2 × 10 -4 and β 1 of 0 and β 2 of 0.9. The critic optimizer was Adam with a learning rate of 2 × 10 -4 and β 1 of 0 and β 2 of 0.9 and a weight decay of 1 × 10 -3 . The generator was updated once for every 5 critic updates. The L1 weight is 1 = 100 and the gradient penalty weighting parameter is 2 = 10. e = 1/epoch. The total training time for the cWGAN-GP model was an additional 35 h. For each model, the best performing epoch was selected by plotting the image evaluation metrics of the validation split for each epoch during training. Once the metrics stopped improving (or got worse), training was stopped as the model was considered to be overfit. Image-level evaluation metrics. The predicted and target images were evaluated with five metrics: mean absolute error (MAE), mean squared error (MSE), structural similarity index measure (SSIM) 28,29 peak signalto-noise ratio (PSNR) 29 and the Pearson correlation coefficient (PCC) 8,30 . MAE and MSE capture pixel-wise differences between the images, and low values are favorable for image similarity. SSIM is a similarity measure between two images where for corresponding sub-windows of pixels, luminance, contrast, means, variances and covariances are evaluated. The mean of these calculations is taken to give the SSIM for the whole image. PSNR is a way of contextualizing and standardizing the MSE in terms of the pixel values of the image, with a higher PSNR corresponding to more similar images. PCC is used to measure the linear correlation of pixels in the images.

Model inference.
PSNR is normalized to the maximum potential pixel value, taken as 255 when the images are converted to 8-bit. For this dataset the PSNR appeared as high as the maximum 8-bit pixel values for each image was generally lower than the theoretical maximum value. Only SSIM can be interpreted as a fully-normalized metric, with values between 0 and 1 (1 being a perfect match).
Morphological feature-level evaluation. Pairwise Spearman correlations 31 between the features in test set data were calculated for each model, with the mean values for each feature group grouped into correlation matrices and visualized as heatmaps 19 . These features were split into several categories -area/shape, colocalization, granularity, intensity, neighbors, radial distribution and texture. We also visualized feature clustering using uniform manifold approximations (UMAPs) 32 , implemented in python using the UMAP package 33 with the parameters: n_neighbors = 15, min_dist = 0.8, n_components = 2. Further details are in the GitHub repository.
Toxicity prediction. We normalized the morphological profiles to zero mean and standard deviation of one and classified the compounds with K-NN-classifier (k = 5, Euclidian distance) into two groups using the controls as training label (positive controls were used as an example of toxic phenotype). To account for the unbalance between the number of positive and negative controls, we sampled with a replacement equal number of profiles from both categories for training and ran the classifier 100 times and used majority voting for the final classification.
Ethics approval/Consent to participate. The authors declare this study does not require ethical approval.

Results
To systematically investigate the utility of the label-free prediction of Cell Painting from brightfield images, we conducted the evaluation on three separate levels: image-level, morphological feature-level, as well as a downstream analysis on the profile-level to identify toxic compounds. We evaluated the two models on the test set of 148 unique compounds, 26 positive control wells (mitoxantrone), and 77 negative control wells (DMSO) that represented two experimental batches and 12 plates (Table 1) www.nature.com/scientificreports/ ingly, only the plasma membrane of AGP channel was successfully predicted, whereas the models were not able to reproduce the actin filaments and the Golgi-apparatus structures. Similarly with the Mito channel, the models performed well overall in predicting the mitochondrial structures, but we observed lack in small granularity and detail that was present in the ground truth. In addition, we observed that the blocking effect from inference time sliding window was visible in the generated images. Nevertheless, DNA, ER, and RNA channels were better predicted, of which RNA achieved the best image-level performance.  Using correlation analysis, we deconvoluted the profile similarities according to feature type, cell compartment, and imaging channels for both models (Fig. 5, Supplementary Figure D). Overall, many morphological features extracted from the generated images showed substantial correlation with those extracted from ground truth images. Examples of accurately reproduced (> 0.6 correlation) feature groups across both models were texture measurements of the AGP channel in both cells and cytoplasm, intensity measurements of the Mito channel in the cytoplasm, and granularity and texture measurements of the DNA channel within the nuclei. The highest performing feature group were granularity measurements of the RNA channel within the cytoplasm (0.86 correlation in both models). Almost all features correlated positively with the ground truth, and only a small number of features showed close to zero or negative correlation, such as the radial distribution of RNA in cytoplasm and the intensity of the AGP channel in nuclei. The cell colocalization features were not calculated by CellProfiler however the cytoplasm and nuclei colocalization features represent the cell as a whole. A full  The mean correlations for the top 10% of the selected features were 0.80/0.81. The mean correlations for the top 50% of the selected features were 0.64/0.65. The number of features with a correlation greater than 0.8 were 26/30 (U-Net/cWGAN-GP), and for both models all feature groups and cell compartments had at least one feature with such a strong correlation, except for the colocalization feature group. For the cWGAN-CP model the breakdown of the 30 features with greater than 0.8 correlation was as follows: 12 cell, 11 nuclei, 7 cytoplasm, and this included the feature types: 11 texture, 7 radial distribution, 5 area/shape, 4 granularity, 2 texture, 1 neighbors, 0 colocalization.
Using uniform manifold approximation (UMAP), we visualized the high-dimensional morphological profiles for identification of underlying data structures (Fig. 6).UMAP was chosen as it is typically used for visualizing multidimensional morphological data and is regularly applied in the analysis of Cell Painting profiles 2,19 . We observed that all three image sources (ground truth, U-Net, and cWGAN-GP) were separated in the common feature space due to the difference in the ground truth and predicted images. Despite this separation, we also observed that all the image sources maintained the overall data structure. The clear batch effect visible in the ground truth is also evident in the predicted images, and similarly the clustering of positive and negative control wells respectively is retained, indicating successful model performance. Features extracted from the cWGAN-GP model lie closer to the ground truth features than the U-Net extracted features however features from both models are closer to each other than either are to the ground truth. We repeated the analysis using Principal Component Analysis (PCA) 34 and observed the same behavior (Supplementary Figure C).

Profile-level evaluation. Conclusions made from the label-free prediction of Cell Painting images should
show agreement with ground truth datasets to be of experimental value. We therefore performed a series of analyses to identify compounds that elicit a comparable toxic phenotype to an established positive control compound, mitoxantrone. Our UMAP (Fig. 6) showed that, as would be expected, most of the compounds showed greater resemblance to negative (DMSO) controls. Promisingly, some compounds clustered with our positive control compound. To identify these compounds, we trained a K-NN classifier using the control profiles as our training set. The classification resulted in identification of eight compounds in the ground truth, five in the cWGAN-GP model, and eight in the U-Net model profiles (Supplementary Figure E). The U-Net model achieved a sensitivity of 62.5% and specificity of 98.0% in toxicity classification whilst the respective values for cWGAN-GP were 50.0% and 99.3%.

Discussion
As the first full prediction of five-channel Cell Painting from brightfield input, we have presented evidence that label-free Cell Painting is a promising and practicable approach. We show our model can perform well in a typical downstream application, and that many label-free features from the predicted images correlate well with features from the stained images. In addition, we see success in clustering by treatment type from the features. We present indications of which channels and biomarkers can be satisfactorily by testing our model predictions with a comprehensive segmentation-based feature extraction profiling methodology. Figure 6. The UMAPs demonstrate that both models could reproduce the separation between treatments and batches seen in the ground truth features and that the clustering patterns are very similar. (a) shows each test image labelled by treatment type, (b) shows ground truth vs the two models and (c) highlights the difference between the two batches. cWGAN-GP was closer to the ground truth than U-Net, although the two models sit much closer to each other in feature space than either to the ground truth. www.nature.com/scientificreports/ Training with an adversarial scheme using a conditional GAN approach has been shown to enhance performance in virtual staining tasks 35 . In preliminary experiments, we tested different GAN models and adversarial weightings, but we chose cWGAN-GP due to its stability of training on our dataset. Increasing the adversarial component of the training objective function resulting in undesirable artifacts in the images which were not true representations of the cells. Our results demonstrate that incorporating adversarial loss results in a small increase in performance over L1 loss based on the pixel-wise evaluation metrics in all channels except ER. Even though the difference in metric values are mostly within one standard deviation, the values are calculated for images generated from the same set of brightfield images (for each model), so it is meaningful that cWGAN-GP achieves superior performance for 18 out of 25 metrics across the channels ( Table 2). In domain transfer problems such as ours, the finer details in some of the predicted channels can be obstructed and blurred with a pixel-wise loss function such as L1 loss 36 . Just a small performance difference is expected as the network and loss function for training both models were very similar, and in the cWGAN-GP model the adversarial component of loss is weighted relatively low compared to L1.
In addition to the metric evaluation, features from images from the cWGAN-GP model show an increase in performance over the U-Net model, with slightly higher mean correlations to the ground truth (Fig. 5). The strong correlation of specific feature groups increases confidence that extracted morphological feature data from predicted images can be used to contribute overall morphological profiles of perturbations.
Biologically, it is expected that correlations of feature groups within the DNA channel are higher within the nuclei compartment than either cytoplasm or cells since the nuclei compartment is morphologically very distinct from the cytoplasmic region. The high correlation of radial distribution features in the RNA channel suggests that successful visualization of the nucleoli within the nuclear compartment has a large effect on this particular feature group. AGP and Mito channels both contain small and subtle cellular substructures which are typically less than two pixels wide. We postulate that this fine scale information is not present in the brightfield image in our data, making accurate image reproduction of the AGP and Mito channels very challenging regardless of model choice. There are known limitations of brightfield imaging which will always restrict a domain transfer model with brightfield input. For example, brightfield images can display heterogeneous intensity levels and poor contrast. Segmentation algorithms have been proven to perform poorly when compared with fluorescence channels, even after illumination correction methods have been applied 27 .
In UMAP feature space (Fig. 6), as well as in the PCA analysis (Supp Figure C), the features extracted from model-predicted images do not overlap with the ground truth features. This highlights limitations of the model but also the challenge of batch effect-it is not expected for the model to exactly predict an unseen batch. The relevant structure of the data is maintained although not absolute values. This is also notable at the image level, for example the high MSE is likely due to the batch effect causing a systematic difference in pixel values between the training and test batches. Similarity metrics such as SSIM may be more informative in this instance. It is notable that ground truth features from different batches also sit in non-overlapping feature space in our UMAPs. Within batches, the negative controls and treatments would form sub-clusters depending on the treatments in a larger dataset, however we acknowledge our test set was relatively small, resulting in minimal sub-clustering.
Other studies which have used U-Net based models to predict fluorescence from transmitted light or brightfield have evaluated their performance on pixel-level metrics such as PCC 15 , SSIM 28,29 and PSNR 37 . The mean PCC of all channels in our test set is 0.85 (using the best model for each channel), a value which compares favorably with prevailing work in fluorescent staining prediction 8 . In our data two channels (DNA, RNA) exceed a PCC of 0.90 for both models. However, absolute values of image metrics are heavily data dependent, so we present these metrics primarily for model comparison.
Such metrics are standard in image analysis but have some limitations for cellular data. Treating each pixel in the image equally is a significant limitation as pixels representing the cellular structures are clearly more important than background (void) pixels 37 . Some channels such as the DNA channel are more sparse than other channels such as AGP, and as such the number of pixels of interest vs background pixels is different. Extracting features from scientific pipelines provides a more biologically relevant and objective evaluation to give deep learning methods more credibility and a greater chance of being practically employed in this field 38 .
Despite the small sample size, the models performed well at predicting compounds with phenotypes similar to that of the positive control compound mitoxantrone, with specificities of 98.0% (U-Net) and 99.3% Table 2. Image metrics for each channel for the two models. The best performing model for each channel metric is highlighted in bold. www.nature.com/scientificreports/ (cWGAN-GP). The sensitivities were 62.5% and 50% respectively. Lower sensitivity values suggest the models are not identifying some potentially toxic or active profiles. There are multiple morphological features present in ground truth images which are linked to toxicity and it is likely that the models cannot capture all of these accurately. The manifestation of cytotoxicity can be most prevalent in the Mito/AGP channels due to mitochondrial dysfunction and gross changes to cytoskeletal processes and structure. We report that the Mito and AGP channels were the least well predicted channels on similarity and correlation metrics (Table 2), thus could serve as a reason for loss of sensitivity. Future studies could focus on predicting these two channels in particular. It is also relevant to consider that the concentration of mitoxantrone used was deliberately chosen to elicit a milder cytotoxic phenotype with an enlarged cellular area (suggesting a cytostatic effect is occurring) rather than complete cell death and an absence of cells entirely. Very high specificities indicate that label-free Cell Painting is not introducing prediction errors which would lead to false positive identification of a mitoxantrone-like phenotype. This is expected for models with significantly more weight on L1 loss, rather than with a high adversarial weighting which could introduce artifacts or phantom structures into the predicted images. Despite a lack of sensitivity, we have demonstrated that morphological features extracted from both models are capable of recapitulating a significant portion of the morphological feature space to result in a positive clustering to the chosen control compound, mitoxantrone. We disclose two compounds which are already in the public domain; glipizide, a clinically-used sulfonylurea compound for the treatment of type 2 diabetes, and GW-842470, a phosphodiesterase 4 inhibitor previously evaluated in clinical studies for the treatment of atopic dermatitis (discontinued).
Although comparative or superior results for phenotype classification could likely be achieved with an image classifier trained on the brightfield, as a label-free and image-to-image approach there is a lot of promise as a generalist model to perform multiple tasks. The simple visualization of Cell Painting channels is one approach to improve the interpretation of brightfield images. Our results highlight the rich information captured in the brightfield modality, which may currently be under-utilized in morphological profiling.
Our approach by itself cannot replace in full the fluorescent staining process and the information it contains, but we have provided evidence that it may be possible to replicate the information of some Cell Painting channels and feature groups, and that the brightfield modality by itself may be sufficient for certain experimental applications. Importantly, employing such methods may reduce time, experimental cost and enable the utilization of specific imaging channels for experiment-specific, non-generic fluorescent stains. We acknowledge that particular feature groups which predict poorly in our models (such as colocalization features) may result in an inability (or reduced sensitivity) to identify cellular phenotypes which are characterized solely by these features. In such experimental applications, the replacement of generic stains for phenotype-or target-relevant biomarkers may offer an effective solution to the standard Cell Painting protocol.
One limitation of this study is the dataset used. Future studies will require larger datasets with greater diversity in terms of compound treatments and collection sources. Matching the numbers of fields in our training and test sets to the size of a typical dataset used in drug discovery would allow for greater insight into the capabilities of label-free Cell Painting. International collaborations such as the Joint Undertaking of Morphological Profiling using Cell Painting (JUMP-CP) aim to further develop Cell Painting and provide a highly valuable public dataset for this use 39 . It is notable that we have evaluated predictions with downstream features the networks have not seen-simply extracted classically from the images. In future studies, incorporating feature information in the training of the network itself may drastically improve performance, and this task may be appropriate for a transfer learning approach 40 .
We also acknowledge that the brightfield imaging mode itself may restrict the quantity of information being input into our models. The rationale behind imaging at multiple focal planes in brightfield configuration was to visualize as much cellular substructure as possible. We frequently observed that not all cellular features are visible in a single focal plane, therefore taking information from a z-stack image set will increase the input content to the models. Alternative brightfield imaging modalities such as phase contrast microscopy or differential interference contrast (DIC) microscopy have previously been used for fluorescence channel prediction tasks 22,41 and can capture a wealth of cellular morphology. It was beyond the scope of this study to investigate the impact of using different brightfield imaging modalities on predictive model performance, yet it should be recognized that all brightfield approaches lack the ability to fully visualize small-scale cellular substructures to an extent 14 . We propose that brightfield imaging in a z-stack configuration serves as a typical process that is widely adopted across both academic and industry laboratories, therefore the methods presented herein are applicable to a range of instruments with varied imaging capabilities.
In summary, we propose a deep learning approach which can digitally predict Cell Painting channels without the application of fluorescent stains, using only the brightfield imaging modality as an input. Building upon previous work 8, 22 we have predicted the five fluorescent channels and used these images to extract the associated groups of morphological features from a standard image analysis pipeline. We then used the features from the predicted images to assess how information-rich such images are. Finally, we have provided a critical evaluation of the predictions using morphological features extracted from images using CellProfiler analysis and the resulting compound profiles.

Data availability
The datasets generated and analyzed in this study are not publicly available due to AstraZeneca licenses but are available from the corresponding author on reasonable request. Source code from this project is available on GitHub: https:// github. com/ cross zamir ski/ Label-free-predi ction-of-Cell-Paint ing-from-brigh tfield-images.