CLOOME: contrastive learning unlocks bioimaging databases for queries with chemical structures

The field of bioimage analysis is currently impacted by a profound transformation, driven by the advancements in imaging technologies and artificial intelligence. The emergence of multi-modal AI systems could allow extracting and utilizing knowledge from bioimaging databases based on information from other data modalities. We leverage the multi-modal contrastive learning paradigm, which enables the embedding of both bioimages and chemical structures into a unified space by means of bioimage and molecular structure encoders. This common embedding space unlocks the possibility of querying bioimaging databases with chemical structures that induce different phenotypic effects. Concretely, in this work we show that a retrieval system based on multi-modal contrastive learning is capable of identifying the correct bioimage corresponding to a given chemical structure from a database of ~2000 candidate images with a top-1 accuracy >70 times higher than a random baseline. Additionally, the bioimage encoder demonstrates remarkable transferability to various further prediction tasks within the domain of drug discovery, such as activity prediction, molecule classification, and mechanism of action identification. Thus, our approach not only addresses the current limitations of bioimaging databases but also paves the way towards foundation models for microscopy images.

1 Supplementary Notes

CLOOME hyperparameter search space
Below, we state the hyperparameter selection for the results reported in this study, based on performance in a validation set for each one of the downstream tasks.For the retrieval task, the model was trained for 70 epochs for the random split and for 60 epochs for the scaffold split, based on the top-1 accuracy in validation.For the zero-shot molecule classification, the selected models were trained for 63 and 57 epochs for the random and scaffold split, respectively.For the zero-shot mechanism of action classification, the selected models were trained for 70 and 69 epochs for the random and scaffold split, respectively.

Bilinear model hyperparameter search space
The hyperparameters of the bilinear models that yielded the results shown in this paper were also selected based on performance in a validation set, and are shown in

Retrieval task results for sampled images or molecules
In Table 7, we report the Top-1, Top-5 and Top-10 accuracies in the structure and image retrieval, respectively, for a sampling rate of 1%, or equivalently, 1 matched example along with 99 un-matched ones -a setting often used to evaluate retrieval systems.[8.49, 11.7] Supplementary Table 7: Results for the retrieval task among 100 candidates.Given a molecule-perturbed microscopy image, the matched molecule must be selected from a set of candidates, and vice versa.Top-1, top-5 and top-10 accuracy in percentage are shown for a hold-out test set, along with the upper and lower limits for a 95% confidence interval (CI) (n = 2, 115 for the random split and n = 1, 398 for the scaffold split) on the resulting proportion.The best method in each category is marked in bold.

Downstream tasks evaluation with corrupted images
In this section, we evaluated the performance of CLOOME when carrying out different corruptions (see Figure 1), which were not considered during pre-training, to the test images.The goal of this evaluation is to assess its robustness to changes in the data distribution and simulate a scenario where the images used for inference might exhibit a domain shift to the images used during pre-training.
Different corruptions and their effects on performance metrics.In this experiment, we investigated the impact of the following transformations on performance metrics (see Table 8): random horizontal and vertical flipping, small rotation (from -10 to 10 degrees with respect to the center of the image), random horizontal and vertical flipping with small rotation, and large rotation (from -180 to 180 degrees).The performance metrics drop only slightly for most tasks.In fact, for the cross-modal retrieval tasks (Tables 8 and 9), the performance on corrupted image remains mostly within the confidence intervals of the previous evaluation with the original images.As shown in Figure 1, rotations introduce wedge-like structures that are added to fill the rectangular shape, which could explain the lower accuracy.As expected, introducing different image transformations not considered during training slightly affects the performance metrics.
Further experiments with corrupted images.We selected one of these transformations (specifically the random horizontal and vertical flipping together with small rotation), to show the effect of using corrupted images in the rest of the downstream tasks.Regarding bioactivity prediction performance, shown in Table 10, the mean AUC changes only from 0.714 to 0.713.As shown in Tables 11 and 12, or the zero-shot tasks, the considered distortions affect the performance more than in the retrieval and linear probing tasks.A possible explanation is that image embeddings corresponding to cells treated with different molecules are closer to each other than their corresponding structure embeddings.If this is the case, corrupting the images will have a larger impact in image-to-image (i.e.zero-shot) tasks than in cross-modal tasks (i.e.retrieval).

Table 2 :
Considered hyperparameter space of CLOOME models.The selected configurations on manual search on validation set shown in bold for the random split and in italics for the scaffold split.

Table 3 :
Considered hyperparameter space of CLOOME models.The selected configurations based on manual search on validation set shown in bold.

Table 4 :
Considered hyperparameter space of CLOOME models.The selected configurations based on manual search on validation set shown in bold for the random split and in italics for the scaffold split.

Table 5 :
Considered hyperparameter space of CLOOME models.The selected configurations based on manual search on validation set shown in bold for the random split and in italics for the scaffold split.

Table 6 :
Table1.3.The selected models were trained for 52 and 54 epochs for the random and scaffold split, respectively.Considered hyperparameter space for the bilinear model.The selected configurations based on manual search on validation set shown in bold for the random split and in italics for the scaffold split.