Advanced microscopes extract rich visual information from biological samples at scales from individual atoms to cells and tissues. Among the different imaging modalities, brightfield illumination with transmitted light is the simplest to acquire while avoiding damaging the sample1. The usefulness of this technology has led to its widespread adoption2,3,4, and thereby to a dramatic increase in the volumes of microscopy data. However, the automated analysis techniques required to extract information at scale are often hindered by the artifacts present in the images5,6. Detecting and neutralizing the impact of such problematic image areas would provide more accurate results from experiments3, making artifact segmentation an important, albeit overlooked, research area in cell biology and beyond7,8.

While any signal that deviates from the reflection of expectation can be considered artifactual9, the common source of artifacts in cell microscopy is the introduction of foreign objects during sample preparation. These include dust, fragments of dead cells, bacterial contamination, reagent impurities, defects on the light path, etc. We focus on detecting these low-level anomalies8,10 in brightfield microscopy and use the term artifact with this meaning. Manually identifying all the affected images or image regions is a time-consuming solution to this problem11,12. A common alternative approach for large datasets is computer-aided delineation and removal of the artifacts, but two complexities make this task challenging. First, artifacts appear stochastically in microscopy images leading to sparse data. Second, artifact characteristics, such as morphology and texture, are often very heterogeneous and hence are challenging to define.This means, it is unfeasible to comprehensively collect representative examples of all possible artifact types, which renders computational modeling difficult.

Deep learning has emerged as the favored solution to artifact detection7,8. While strongly supervised convolutional neural networks (CNN) such as U-Net13,14,15,16,17 are state-of-the-art for most computer vision tasks, they cannot overcome some challenges that artifact detection brings7. A major bottleneck for the strongly supervised deep learning methods is their requirement of pixel-level annotation, which is time-consuming, and requires substantial expertise. As an alternative, weakly supervised techniques such as ScoreCAM18, which involve only image-level labeling, greatly reduce the time needed to prepare the dataset. In particular, generative autoencoder-based models19,20,21,22,23,24 are trained to reconstruct artifact-free images and report artifacts on test images as areas with large reconstruction error. Alternatively, one-class classification approaches25,26,27 train a classifier on artifact-free images and report artifacts as images with a low probability of belonging to this clean class. Combining the performance advantages of the strongly supervised methods and the convenience of image-level annotations would therefore be of great practical interest and impact.

In this work, we make the following key contributions: (1) we empirically confirm the prevalence of artifacts in brightfield microscopy images; (2) test a range of existing approaches in domains other than microscopy for the artifact segmentation task, and find none of them is accurate for practical use; (3) combine the merits of weakly and strongly supervised methods for artifact segmentation from brightfield cell microscopy images using only image-level annotations. To our knowledge, this is the first attempt to segment artifacts in microscopy images in a weakly supervised way. We introduce ScoreCAM-U-Net, a model that combines the informative pixel-level28 and cheap-to-generate image-level18 annotation schemes, and accurately detects artifacts in held-out samples. As training is performed using only image-level labels, generating training data is orders of magnitude cheaper, but without substantially sacrificing performance compared to pixel-level data. (4) We study the impact of removing artifacts on different downstream applications. We demonstrate that artifacts in microscopy images confound downstream analyses such as nuclei segmentation or quantification of ligand binding, and that ScoreCAM-U-Net successfully overcomes these problems.


To delineate artifacts from brightfield microscopy images, we introduce ScoreCAM-U-Net, a method that uses image-level annotations as input for training, and produces artifact segmentations as an output. We compare the performance of our pipeline with a strongly supervised counterpart trained on pixel-level annotations as well as with state-of-art models that are trained using image-level labeling on three different datasets.


We chose three datasets for this study to cover multiple common variables in experimental design to better assess the generalizability of the results. Overall, the datasets cover nine different cell lines, fixed and live cells, two different plate formats and two microscopes. The datasets provenances have been described previously3,4,29,30 and we briefly describe their most important properties here.

Seven cell lines dataset

Seven types of cells including human cells from breast cancer (MCF7), fibrosarcoma (HT1080), cervical cancer (HeLa), hepatocellular carcinoma (HepG2), alveolar basal epithelial (A549), dog cells from kidney tissue (MDCK), and mouse embryonic fibroblast cells (NIH3T3) were seeded in Collagen type 1-coated CellCarrier-384 Ultra Microplates (PerkinElmer, Waltham, MA; cat. 6057700). The cells were stained with 10 µg/ml Hoechst 33342 (Thermo Fisher, Waltham, MA; cat. H3570) and fixed in formaldehyde (Sigma, St. Louis, MO; cat. 252549). A 20 × water immersion objective was used to acquire images on an Opera Phenix high-content screening system (PerkinElmer) in confocal mode. Nine fields of view were acquired from each well with a total of 3024 images of size 1080 × 1080 px (1 px = 0.59 µm) with 350 cells in each field of view on average. All fields of view were imaged in fluorescent and brightfield modalities, with one modality acquired first on all wells and then the second. This dataset is referred to as “seven cell lines” in the further text.

LNCaP dataset

The cells of human prostate adenocarcinoma (LNCaP, from ATCC) were seeded in a CellCarrier-384 Ultra Microplate (PerkinElmer), fixed in formaldehyde, and stained using DRAQ5 fluor (Abcam, Cambridge, United Kingdom) to tag nuclear DNA. A 20 × objective was used to acquire images on a CellVoyager 7000 (Yokogawa, Tokyo, Japan) instrument in confocal mode to acquire fluorescence and brightfield images of size 2556 × 2156 pixels (1 pixel = 0.325 µm) with 681 cell in each field of view on average. Similar to the seven cell lines dataset, one modality was acquired on all wells before moving on to the second modality.

ArtSeg-CHO-M4R dataset

The imaging was performed as described previously29. Briefly, live CHO-K1-hM4R cells were seeded with a density of 25 000 cells per well into µ-Plate 96 Well Black plate (Ibidi) 5–7 h before the imaging to allow attachment. All the experiments were performed in the cell culture medium DMEM/F-12 with 9% FBS (Sigma), antibiotic antimycotic solution (100 U/ml penicillin, 0.1 mg/ml streptomycin, 0.25 μg/ml amphotericin B, Sigma) and 750 μg/ml of selection antibiotic geneticin (G418, Capricorn Scientific). The final volume in the well was 200 μl. All imaging experiments were carried out at 37 °C in the 5% CO2 atmosphere. The images were captured with Cytation 5 Imaging Multi-Mode Reader (BioTek, Bad Friedrichshall, Germany). Images were obtained using a LUCPLFLN 20 × objective lens with working-distance of 6.6 mm, and numerical aperture of 0.45 (Olympus), using LED excitation source with 531(40) nm filter and captured with 593(40) nm emission filter. The field of view size was 1224 × 904 pixels (1 pixel = 0.323 µm). For a single field of view, a brightfield image was obtained first, which was immediately followed by fluorescence image acquisition. These steps were repeated for four fields of view in each well. In all experiments, a constant concentration of 2 nM UR-CG07231, a TAMRA labeled fluorescence ligand was used to visualize cells expressing muscarinic M4 receptors in the fluorescence channel. In concentration–response experiments atropine, arecholine (Sigma), UNSW-MK25932 and UR-SK7533 were used. UNSW-MK259, UR-SK75 and UR-CG072 were kindly provided by Dr. Max Keller from the University of Regensburg. The ArtSeg-CHO-M4R dataset is made freely available for public use.

Artifact annotation

The seven cell lines and LNCaP data were inspected and 11.4% and 6.5% of the samples were found to have artifacts, 344/3024 and 51/784 fields-of-view respectively. The same number of fields-of-view from each dataset were randomly sampled to be used as training images without artifacts. At the same time, 99.2% of samples in the ArtSeg-CHO-M4R dataset (1171/1181) were found to have artifacts. The clean images for this dataset were generated as described below.

For all three datasets, pixel-level ground truth masks of artifacts were generated by manual annotation. All annotators had prior training in bioimage analysis, microscopy and cell biology. For seven cell lines and LNCaP datasets, the artifacts were annotated as polygons using VGG image annotator34 and for ArtSeg-CHO-M4R dataset, as freehand annotations with the MembraneTools module of Aparecium software35. For all datasets, the artifact pixels were annotated while keeping the number of background pixels annotated as artifacts as low as possible. Due to the fuzzy nature of the artifacts borders, including some background pixels during annotation was inevitable. We nevertheless tried to emulate the typical human annotator and keep this number as low as possible, while maintaining a reasonable speed of the annotation process.

For the ArtSeg-CHO-M4R dataset, the artifact annotations contain a considerable number of background pixels in some images as it speeds up annotation and better reflects the annotation process in real-world conditions.

To obtain the weak labels for the seven cell lines and LNCaP datasets, the images were classified to be either clean or artifact-containing by manual inspection by the annotator. An image was considered clean if no artifacts were observed. For the ArtSeg-CHO-M4R dataset, as the vast majority of images contain at least one artifact, the clean images were generated by replacing the pixel values of manually annotated artifacts with the values of the corresponding pixels in the estimated background image. The background is estimated by fitting the original image with a two-dimensional second order polynomial function36. To simulate imaging noise, a zero-centered noise profile of the background pixels is added to the estimated background. Testing the clean images using trained model shows that no artifacts could be detected from the resulting images. Moreover, the modified areas were also not visually detectable by human experts (Supplementary Fig. S1).

ScoreCAM-U-Net for artifact segmentation

Our weakly supervised artifact segmentation pipeline combines the ScoreCAM model18 that highlights areas of the image most useful for differentiating between clean and artifact-containing images with U-Net model4 that directly classifies pixels into categories. We call this pipeline “ScoreCAM-U-Net '' (Fig. 1, Appendix A, and supplementary Table S1).

Figure 1
figure 1

Artifact segmentation pipeline—ScoreCAM-U-Net. During training (top), ScoreCAM18 (purple) is used to generate pixel-level probability maps of artifacts and the corresponding binary masks that are used to train the U-Net4 segmentation model (blue). During the inference (bottom), the trained U-Net (blue) is used to segment artifacts from the images that were deemed to contain artifacts (image with red borders) by the ScoreCAM (purple). Vertical dashed lines: binarization of pixel probability maps values.

ScoreCAM18 is a technique used to explain predictions made by deep learning methods, mostly applied to models that perform image classification. ScoreCAM analyzes both the model output and the corresponding image, and highlights parts of the image that had a large impact for the particular prediction. It proceeds in four steps. First, visual representations (activation maps) of the last convolutional layer are extracted from an image classification model (ResNet37 in our implementation). Next, each activation map is upscaled to match the size of the input image, normalized to a range between 0 and 1, and projected onto a copy of the input image via multiplication, producing a projected input image. Then, the classification model (ResNet in our implementation) uses projected inputs to calculate the probability of the input image belonging to each class. Finally, all activation maps are summed, each multiplied by the corresponding class-largest probability and passed through the ReLU38 activation function to generate the final output (Supplementary Fig. S2). Unlike other competitors that rely on gradients, ScoreCAM uses the largest class probability to obtain the resulting map. It has been empirically shown that this feature makes ScoreCAM less noisy and therefore more useful in practice18.

The strongly supervised U-Net13 model has already been successfully adapted for brightfield nuclei segmentation and its architecture is described in detail in the corresponding paper4 (Supplementary Fig. S2). The architecture consists of an encoder and a decoder connected by a bottleneck, and skip links which pass the signal from the encoder to the decoder. We used an encoder consisting of 15 convolutional layers that use convolutional filters of size 3 × 3 and a rectified linear unit (ReLU)4,38 activation function. After every third layer, there is a 2 × 2 max-pooling layer and a skip connection to the decoder. Symmetrically, the decoder has 15 convolutional layers with ReLU activation functions. After every third convolutional layer, there is an upsampling layer that upscales its input height and width by a factor of 2. Finally, the bottleneck after the encoder has three convolutional layers. There are 64 filters in each convolutional layer in the encoder, decoder, and bottleneck.

Model training and evaluation


To train the ScoreCAM-U-Net model, the ResNet5037 classification model in the ScoreCAM18 framework was first trained to classify clean and artifact-containing images. The model is trained on the seven cell lines dataset using 482 images for training, 101 for validation and 104 for testing; on ArtSeg-CHO-M4R, using 1386 images for training, 404 for validation, and 572 for testing; and on LNCaP, using 70 images for training, 16 images for validation and 16 images for testing. The test set in ArtSeg-CHO-M4R dataset was chosen such that ten concentration–response curves with multiple competitive ligands could be obtained. The Adam optimizer39 was used to optimize binary cross-entropy loss for 150 epochs. The initial learning rate (0.002) was reduced by a factor of 10 when the validation loss did not improve for 10 consecutive epochs.

The output of ScoreCAM was binarized with the threshold of 0.05 and used as pseudo-labels for the U-Net model, which was subsequently trained to segment the artifacts using the same datasets’ splits and training procedure. The ScoreCAM binarization threshold was selected to maximize the pixel-wise IoU of the validation set. precisely, ScoreCAM-U-Net was trained using different thresholds for binarizing the results of ScoreCAM before using them as pseudo-labels for the U-Net model. The results then were evaluated using the validation set and the threshold with the best IoU score was selected (Supplementary Fig. S3). All the experiments were conducted using a Tesla V100- PCIE-32 GB Graphics Processing Unit.

Comparison with other methods

We compared the segmentation results obtained from the ScoreCAM-U-Net to a number of alternative solutions. As ScoreCAM-U-Net is a combination of ScoreCAM and U-Net models, we first compared our performance to each of these models separately. We expected a strongly supervised U-Net model trained on pixel-level annotations to show better performance than its weakly supervised counterparts. We also compared the proposed approach to the current state-of-the-art algorithms used to detect anomalies using image labels in domains other than microscopy: Patch Support Vector Data Description (PatchSVDD)26,27, Patch Distribution Modeling (PaDiM)27, and an autoencoder-based method (AE)24. All model architectures, training parameters and training processes are adopted here as defined in the original papers24,26,27.

PaDiM and PatchSVDD are both embedding similarity-based methods that use convolutional neural network-based approaches (encoders) that learn robust and short representations (embeddings) from patches of clean images. During the inference, the encoders are used to extract embeddings from test image patches and compare them to the embeddings extracted from the clean images based on a similarity metric. The main difference between these two methods is in the similarity metric employed to compare the embeddings as well as in the way the embeddings are constructed. PaDiM applies the Mahalanobis distance metric40 and constructs an embedding by combining the features of multiple encoder layers, whereas PatchSVDD uses the Euclidean distance metric and constructs the representation from the features of a single encoder layer. Based on the embedding comparisons, each test image patch is assigned a similarity score in which a low similarity score indicates the presence of artifacts. The final segmentation of each test image is constructed after the similarity scores of these patches are distributed to their pixels and the corresponding patch segmentations are merged together.

The AE method also utilizes a convolutional neural network based approach (an encoder-decoder network architecture) that first learns representations of the clean input images (using the encoder) and then to reconstruct the original clean input images from the learned representations (via the decoder). During inference, the trained model is expected to fail to reconstruct the artifactual areas of the test images as the network has only acquired rich representations of clean images. Therefore, artifacts manifest themselves in areas with a high pixel-wise difference between the input image and its reconstructed counterpart.

We measure the ability of the models to correctly identify the presence of an artifact in the image using the F1 score which is the harmonic mean of precision and recall. We also assess the segmentation performance via calculating pixel-wise precision, recall, F1, and the intersection over union (Box 1).


We first binarized the probability maps produced by the models at cutoffs of 0.75 for AE, 0.3 for PaDIM, 0.0005 for PatchSVDD, 0.001 for ScoreCAM, 0.001 for ScoreCAM-U-Net and 0.45 for U-Net. These cutoffs were selected to maximize pixel-wise IoU (Box 1) performance on validation data. We then filtered out objects smaller than 1000, 500, and 500 pixels in the seven cell lines, the ArtSeg-CHO-M4R, and the LNCaP datasets respectively using remove_small_objects function from the skimage package41. The sizes of the filtered-out objects were selected to maximize the pixel-wise IoU of the majority number of models, and different sizes do not drastically change the performance of the models (Supplementary Tables S2, S3, and S4). We recommend using expert knowledge to select the size of objects to filter out.

Measuring impact of artifacts and artifact removal on the downstream analyses

To evaluate the utility of removing artifacts in microscopy experiments, we focused on two common types of downstream analyses: nuclei segmentation and effective concentration estimation from concentration–response assays. The former is a standard step in the majority of cell microscopy workflows while the latter is an example of a commonly used pipeline where cell segmentation is used for image intensity quantification which is followed up by regression analysis.

Nuclei segmentation

In order to assess how nuclei segmentation accuracy inside the artifactual regions compares to artifact-free areas, we evaluated the performance of nuclei segmentation in the seven cell lines dataset inside and outside the artifactual areas. To detect and segment the nuclei from the brightfield images we used an existing PPU-Net3 model. The training, ground truth preparation, and post-processing steps for this model are described in the original publication3. We calculated segmentation pixel-wise F1 and object-wise F1 scores (Box 1), following previously described approaches3,42, and morphological properties (size and solidity) of the resulting nuclei.

Ligand affinity estimation

In downstream analysis of pharmacological experiments, the cell bodies are segmented from brightfield images using a U-Net-based deep learning model29, and the cell fluorescence intensities are quantified from a parallel fluorescence channel based on the segmentation. The fluorescence intensities of cells depend on the strength of interaction (affinity) between the protein and the interaction partner (ligand) as well as the ligand concentration. The strength of protein–ligand interaction is determined using regression analysis of competitive ligand concentrations and the well average fluorescence intensity information from up to 64 individual images.

We studied the impact of artifacts and artifact removal on the determination of receptor-ligand interaction affinity. For that, in each of the ten individual concentration–response experiments, the cells were detected from brightfield images using a previously developed U-Net based segmentation model with an F1-score of 0.8929. The artifactual areas determined manually or with ScoreCAM-U-Net were removed from the analysis. For experimental control, the analysis was also carried out without any artifact removal. The average intensity of the detected cell pixels as well as the average intensity of the background were determined from the aligned red fluorescence protein filter (excitation: 531(40) nm, emission: 593(40) nm) fluorescence images made in parallel with the brightfield images. The values were averaged for all images from the same well. For each well, to find the specific fluorescence intensity of bound fluorescence ligand the difference between cellular and background fluorescence intensities was calculated. LogIC50 values corresponding to half maximal displacement of the fluorescence ligand were obtained via nonlinear regression analysis. For that, the fluorescence intensity dependence on the competitive ligand concentration was fitted with the Hill equation using GraphPad Prism 5.0 and "log(inhibitor) vs. response" nonlinear regression model which is equivalent to the logistic regression.

Concentration–response experiments serve as a good example for image analysis pipelines that rely on image intensity calculation and regression in the downstream analysis. For quantifying the quality of the full pipeline, we chose the absolute difference between the LogIC50 values calculated from manual artifact removal and the alternative option. The difference of LogIC50 values describes how accurate pharmacological parameters can be obtained with and without anomaly removal. We also used the R2 value of the Hill equation fit as a metric, which reflects the overall agreement between the experiment and the model. Finally, we chose the Pearson’s correlation coefficient r between predicted fluorescence intensity values using manual artifact removal and the alternative method, which allows isolating the effect of artifacts on the signal directly without the influence of other sources of uncertainty.


To develop and test a weakly supervised method for artifact segmentation: we confirmed that artifacts exist and are prevalent in brightfield microscopy images; annotated artifacts in three datasets; tested models for finding them automatically; and evaluated the impact of removal on downstream analysis results.

Artifacts in brightfield images are prevalent and diverse

The artifacts in the seven cell lines dataset range from very big (e.g. a clump of detached cells covering 49% of the image pixels) to tiny ones only a few pixels in size. The average annotated artifact size in this dataset is 4,417 pixels, which is larger than a typical nucleus in this dataset, and 16% of images had at least 10% of their area covered by artifacts. The artifacts in the seven cell lines dataset were heterogeneous in their size and morphological properties (Fig. 2, Supplementary Fig. S4).

Figure 2
figure 2

Artifacts are heterogeneous, and range in shapes and sizes. A UMAP projection of all artifacts from the seven cell lines dataset. The inputs to the UMAP are the pixels of each patch that contains an artifact and the outputs are the first two features in the UMAP embeddings of each patch. We then used these two features respectively as ‘x’ and ‘y’ values to plot the corresponding input patch in 2D space.

In the LNCaP dataset, we annotated 60 objects that affected 6.5% of the images. The sizes of artifacts range from big (e.g. a hair covering 10% of the image pixels) to small, which covers only 0.07% of the pixels, with the average artifact being 75,933 pixels (Supplementary Fig. S4).

In the ArtSeg-CHO-M4R dataset, almost all images had artifacts, with a total of 13,713 artifact objects in 1,171 affected images. Again, the largest object covered a large part of the image (e.g. 63% as a clump of detached cells), while the smallest one was a few pixels in size (Supplementary Fig. S4). An average artifact in this dataset had an area of 3,450 pixels, or 0.31% of image size.

Artifacts can be accurately detected with weak supervision

Next, we compared different approaches for artifact detection and segmentation qualitatively and quantitatively (Fig. 3A,B; Supplementary Fig. S5). We first evaluated the ability of the models to detect artifacts in the images. As ScoreCAM-U-Net and ScoreCAM both use the same ResNet classification backend, their detection performance is the same, with both models achieving image classification F1 scores of 93.2%, 93.7% and 90% in seven cell lines, LNCaP and ArtSeg-CHO-M4R datasets respectively (Fig. 3B, Supplementary Table S5). Other methods were less accurate, with the only exception of U-Net outperforming ScoreCAM-based models in the LNCaP dataset (99.4% F1 score for U-Net over 93.7% for ScoreCAM-U-Net; Fig. 3B).

Figure 3
figure 3

Artifact segmentation and image-level classification results for all models (colors) in seven cell lines, LNCaP, and ArtSeg-CHO-M4R datasets. (A) Examples of brightfield images and the corresponding artifact segmentation of all models (columns, colors) and datasets (rows; separated by lines and dataset names). White contour: expert annotated artifact boundaries; colored contours: artifact segmentation boundaries of the corresponding model. (B) Different performance metrics for all models (colors) and datasets (rows). Left column: artifact segmentation precision (x-axis) and recall (y-axis) of artifact detection at different thresholds (points along the curve) for all models and datasets. Middle column: artifact segmentation pixel-wise IoU (y-axis) for all models and all datasets. Right column: image-level classification F1 score (y-axis) for all models (x-axis) and datasets (rows).

We then assessed the models’ performance in segmenting artifacts. ScoreCAM-U-Net outperforms the other non-strongly supervised models by achieving the highest area under the precision-recall curve, as well as the largest average object intersection over union on seven cell lines and LNCAP datasets (Fig. 3B). There was no dominant weakly supervised model in the ArtSeg-CHO-M4R dataset. Compared to the strongly supervised U-Net model, ScoreCAM-U-Net got the second-highest IoU performance in the seven cell lines (49.5 ScoreCAM-U-Net vs 72.9 U-Net) and the LNCaP (39.9 ScoreCAM-U-Net vs 65.74 U-Net) datasets (Supplementary Table S5).

Although the strongly supervised approach outperformed weakly supervised methods, it took substantial time to prepare the pixel-level annotations required for the U-Net model compared to weak labeling. On average, an expert spent 279 s to produce pixel-level annotation for a single microscopy image, while it took them only 2 s to point out if a given image contained an artifact. Hence, weakly supervised methods consume about two orders of magnitude less of expert time for the given case. Therefore, when making a choice of method for dealing with artifacts, it is reasonable to take into account the dataset size and the amount of time needed to produce relevant annotations. For complex projects which require large training datasets for model development, generating precise pixel-level labels is prohibitively time-consuming, and hence weakly supervised approaches like ScoreCAM-U-Net are the best automated option available.

Weakly supervised artifact removal improves downstream analysis

After establishing the quality of the proposed ScoreCAM-U-Net method for artifact detection and segmentation, we evaluated the impact of using it for cleaning images on two downstream applications.

Removing artifacts improves quality of nuclei segmentation

As artifacts distort pixels that otherwise represent nuclei (Fig. 4), we observed substantial degradation in nuclei segmentation performance due to artifacts. The pixel-wise F1 score decreased from 0.89 in artifact-free to 0.60 in artifactual regions; and the object-wise F1 score decreased from 0.65 in artifact-free to 0.28 in artifactual regions (Fig. 5). This had a direct impact on naive analyses that do not differentiate between artifactual and clean regions, reducing segmentation accuracy (0.87 pixel-wise F1, 0.61 object-wise F1; Fig. 5). Importantly, automatically removing artifacts using ScoreCAM-U-Net has the same impact as manual removal, improving the segmentation performance to near-optimal 0.89 and 0.64 pixel-wise and object-wise F1 scores (Fig. 5).

Figure 4
figure 4

Visual impact of artifacts on nuclei segmentation. Two pairs of brightfield images with corresponding nuclei segmentation (in dark red) overlaid. Zoomed-in purple circles represent examples of artifact-free areas and artifactual areas (light red). White contours: artifact borders; yellow-ish white contours: nucleus ground truth borders; arrows and text: guides to corresponding regions and elements.

Figure 5
figure 5

Impact of artifacts and artifact removal on downstream analyses. Density (y-axis) of image-average nucleus segmentation pixel-wise F1 (top, x-axis) and object-wise F1 (bottom, x-axis) in the seven cell lines dataset for different areas of the image (colors). Pink: area in the images manually annotated as not artifacts; blue: area in the images manually annotated as artifacts; green: area in the images automatically annotated as not artifacts by ScoreCAM-U-Net; yellow: all image area. Dashed lines: mean pixel-wise F1 and object-wise F1 of segmented nuclei in the artifactual and artifact-free regions (different colors).

We next considered nuclear size and morphology metrics with and without artifact correction. Nuclei in areas containing artifacts show different morphological properties with nuclei solidity of 0.92 and size of 213 pixels while the same properties are 0.95 and 400 pixels respectively in the artifact-free regions (Supplementary Fig. S6). In concordance with the segmentation results, automatically removing artifacts using ScoreCAM-U-Net recovers the expected nuclei size and solidity of 397 pixels and 0.95 respectively for artifact-free areas, again performing close to the gold standard of manual removal (Supplementary Fig. S6). These results demonstrate that automatic removal can overcome the detrimental effect of artifacts with quality close to manual filtering.

Removing artifacts improves pharmacological parameter estimates

Cell segmentation in images is a commonly used process to determine biochemical or pharmacological parameters from microscopy experiments. Common examples of such experiments include quantifying image intensity of the segmented areas. This can be followed up by a test of significance or regression analysis to determine biochemical parameters like half-life of a reaction or the half maximal effective concentration of a substance. We analyzed how presence of artifacts affects the quality of a microscopy image-based analysis used to determine ligand affinity to M4 muscarinic receptors using the ArtSeg-CHO-M4R dataset. Manual anomaly removal has a clear effect on both the plateau locations and the estimated Log(IC50) values (Fig. 6, Supplementary Fig. S7). The mean absolute difference between Log(IC50) calculated with manual artifact removal and no artifact removal is 0.29 units, equivalent to a two-fold error in dose. In contrast, after automatic artifact removal with ScoreCAM-U-Net, the Log(IC50) difference from manual anomaly removal was reduced to just 0.16 units, which is similar to the standard deviation of 0.11 observed between biological replicate experiments. The model fit explained 0.89, 0.86, and 0.74 of the data variation for manual anomaly removal, ScoreCAM-U-Net anomaly removal, and no anomaly removal respectively. Finally, the Pearson’s correlation coefficient of well-average fluorescence intensities between manual anomaly removal and ScoreCAM-U-Net anomaly removal is 0.98 while the correlation between manual anomaly removal and no anomaly removal is only 0.93. Overall, removing artifacts leads to an increase in replicate correlation, which itself results in reduction in estimate uncertainty. The estimated ligand affinity better reflects the values established from manually cleaned images. This confirms that artifact removal leads to considerable improvement of downstream regression or statistical analysis which relies on image intensity quantification.

Figure 6
figure 6

Cell fluorescence intensity dependence on M4 receptor ligand concentration determined with live-cell fluorescence microscopy at the presence of 2 nM UR-CG072. Displacement curves of three different ligands are shown: pirenzepine (A,C), atropine (B) and arecholine (D). Three different artifact removal methods at the image analysis stage are compared (colors): manual artifact segmentation, ScoreCAM-U-Net segmentation and no artifact removal. For each combination of ligand and artifact removal method a regression analysis is performed with Hill equation (Hill coefficient fixed at − 1) with the best fits shown as continuous lines. For each displacement curve, the Log(IC50) ± SD is presented, where SD represents the standard deviation estimation of Log(IC50). Each displacement curve was measured in duplicates with each data point representing the average fluorescence intensity of cells in each well.


We proposed ScoreCAM-U-Net, a deep learning model for identifying artifacts in brightfield microscopy images that combines the benefits of weakly supervised learning which does not require delineating objects, and strongly supervised learning that provides pixel-level resolution. ScoreCAM-U-Net outperforms the other non fully supervised models in segmenting the artifacts from the images. Moreover, ScoreCAM-based methods outperform the others in detecting artifactual images with the only exception of U-Net in the LNCaP dataset. This is due to the relatively small size of the test set in the LNCaP dataset (16 images), so missing only one image could cause this difference in classification accuracy. Inspecting the results, we found that the scoreCAM-U-Net constantly classified one of the clean images as artifactual while U-Net missed it only once. As the training of the ScoreCAM-U-Net is performed using only image-level labels, generating training data is orders of magnitude faster, but without substantially sacrificing performance compared to pixel-level annotation. To our knowledge, this is the first attempt to automatically detect artifacts in large sets of brightfield microscopy images.

Several factors can contribute to dissimilar model performance in different datasets, including but not limited to dataset size, nature of artifacts, average artifacts size, etc. We speculate that the relatively inferior performance of the models on the ArtSeg-CHO-M4R dataset compared to the other two datasets could be attributed to the more blurry nature of the artifact borders and the relatively small average artifact size (“Artifacts in brightfield images are prevalent and diverse”). On the other hand, the moderate fuzziness of the borders in both seven cell lines and the LNCaP datasets may contribute to the better performance than the ArtSeg-CHO-M4R; having seven cell lines superior as it has more representative samples in the training set.

Due to the blurry nature of the borders of the artifact objects, a mismatch between predicted and annotated pixels has appeared which led to an accuracy gap. However, the impact of artifact removal on downstream tasks, rather than classification and segmentation performance, is arguably the most relevant metric for practical application (“Weakly supervised artifact removal improves downstream analysis”). Our results demonstrate that artifacts have an adverse impact on nuclei segmentation and that detection and measurement of nuclei are improved when removing such artifacts. We showed that this impact manifests in both quantitative segmentation metrics such as pixel-wise and object-wise F1 score, as well as morphological properties of the nuclei like solidity and size, which are central for cytometry applications. Almost all study designs that use large-scale cell microscopy and image quantification-based readout would benefit from our model.

One important application of cell microscopy is intensity quantification for studying the localization and co-localization of fluorescently labeled molecules. To exemplify this type of analysis, we studied how artifact removal affects the calculation of drug-receptor binding affinities based on live-cell fluorescence and brightfield microscopy. After artifact removal with ScoreCAM-U-Net, the estimated ligand affinities are in better agreement with the values established from manually cleaned images. The model-based estimates also reduce linear regression uncertainty and result variability of independent experiments, indicating a combination of better fit of the theoretical model and improved reproducibility of the measurements. Thus, artifact removal improves image intensity quantification independent of the nature of statistical analysis applied downstream.

Our ScoreCAM-U-Net method establishes the utility of automatically segmenting artifacts from brightfield microscopy images. The key advantage of our approach is its scalability, such that clean images can be obtained for screening campaigns that would be prohibitively expensive to process manually. For example, a screening experiment generating 100,000 images could take around 388 h of continuous work by an expert to delineate artifacts from only 5% of them.

The limitation of our approach is an inability to differentiate different types of artifacts. For example, the current model would not tell if an image contains an artifact of cell debris and the other contains bacterial contamination. A natural extension can build on our approach to train a model that can differentiate between different types of artifacts. Other extensions can use the power of deep learning for other imaging modalities, such as histopathology, as well as to further reduce annotation time. We envision that ultimately, all common artifacts will be automatically segmented and optionally removed at the time of acquisition with no input needed by the operator. Moreover, we believe that the encouraging results presented in this work will motivate the use of weakly supervised segmentation methods such ScoreCAM-U-Net in other areas where pixel-level annotations are prohibitively expensive or time-consuming to acquire, i.e. medicine.