Finding the semantic similarity in single-particle diffraction images using self-supervised contrastive projection learning

Single-shot coherent diffraction imaging of isolated nanosized particles has seen remarkable success in recent years, yielding in-situ measurements with ultra-high spatial and temporal resolution. The progress of high-repetition-rate sources for intense X-ray pulses has further enabled recording datasets containing millions of diffraction images, which are needed for the structure determination of specimens with greater structural variety and dynamic experiments. The size of the datasets, however, represents a monumental problem for their analysis. Here, we present an automatized approach for finding semantic similarities in coherent diffraction images without relying on human expert labeling. By introducing the concept of projection learning, we extend self-supervised contrastive learning to the context of coherent diffraction imaging and achieve a dimensionality reduction producing semantically meaningful embeddings that align with physical intuition. The method yields substantial improvements compared to previous approaches, paving the way toward real-time and large-scale analysis of coherent diffraction experiments at X-ray free-electron lasers.


Introduction
A guiding principle in fundamental condensed matter research is that for understanding function, we have to study structure [1].Techniques based on lensless diffractive imaging, like X-ray crystallography and coherent diffraction imaging (CDI), are powerful and widely used tools to discover structures up to atomic resolutions [2].In the recent past, single-particle coherent diffraction imaging using intense coherent X-ray pulses from free-electron lasers (SP-CDI) [2,3] has revolutionized the field of structural characterization [4][5][6][7][8][9][10][11][12].SP-CDI is a technique with which in-situ measurements of isolated and non-fixated nano-scaled targets can be acquired.Depending on the experimental scheme, each recorded diffraction image is a complete and self-contained experiment that needs individual analysis [3].However, due to the advent of high repetition-rate sources like the European XFEL [13] and LCLS-II [14], millions of images are typically recorded during one experimental campaign [15].Manual analysis of such amounts of data represents an enormous problem.It may leave researchers unable to analyze significant amounts of their data comparatively as they have to resort to large-scale averaging, which might wash out or conceal important information, or manually select subsets of the dataset.In this work, we present a novel embedding technique for diffraction images called contrastive projection learning (CPLR) based on contrastive learning (CLR) [16,17].CPLR produces a dimensionality-reduced embedding space with which semantic comparisons between diffraction images become possible and, thus, enables human-level comparative analysis on big-data scale datasets.
At the very core of every comparative analysis is an assumption establishing a similarity measure between samples.However, current approaches for establishing such a measure for diffraction images cannot compete with the perception of a trained researcher.This perceived similarity, or semantic similarity [18], is contextually aware [19], whereas computational methods for diffraction image data currently lack such awareness.
Available strategies for comparative analysis are based on either supervised classifier schemes [20,21] or unsupervised sorting methods [22][23][24][25][26].However, all these methods come with significant trade-offs: Supervised algorithms align with human perception and produce high-accuracy results [21,27] but are in real-world scenarios unavailable as they require time-and labor-expensive manually fabricated expert labels.Unsupervised routines work in such cases but do not reach comparable accuracy levels and introduce additional restrictions or requirements.For example, traditional cluster techniques produce a lot of unwanted predictions [22], threshold-based approaches act primarily as hit-finder [23,28], autocorrelation-based methods only extract particle size information and are computationally costly [24], auxiliary approaches rely on rarely available additional experimental data [24], and Fourier-inversion based techniques [15,24,26] are only applicable to SP-CDI data from the small-angle-scattering regime, where reconstruction by Fourier inversion is possible [10,29].
Our CPLR method can potentially improve all strategies mentioned above; It can serve as an improved similarity measure for unsupervised methods, as in the context of regular self-supervised learning [30][31][32], and can act as a powerful pretraining for subsequent supervised or distillation-based training [17].Furthermore, CPLR directly establishes a way to find semantically similar diffraction images in a fully self-supervised fashion.In self-supervised contrastive learning, a supervised task is constructed by artificially creating label information via domain-specific augmentation strategies [16,33,34].In this work, we design a novel augmentation approach for diffraction image data where a deep neural network contrasts images from different coordinate projections.
The quality of the CPLR embedding space is evaluated using a publicly available diffraction image dataset [35] from an SP-CDI experiment on superfluid helium nanodroplets, for which semantically sensitive expert labels are available [12,21].images taken from a publicly available dataset [35] from an SP-CDI experiment on superfluid helium nanodroplets [12,21].a and b are semantically similar while a and c are not.d shows four widely used similarity measures (see [37]), that disagree with the human perception, while only our method (labelled CPLR) agrees with it.
Color-coding is that the pair of images that is more-similar is green, while the one that is less-similar is orange.
Using the broadly established linear evaluation protocol [16,17,36], we show that our method outperforms non-contrastive methods by a large margin while improving the contrastivelearning baseline by 6 to 10 %.
Figure 1 provides a concrete example of three diffraction images from [12], a, b, and c.Two images (b and c) are to be compared to a.A human immediately identifies the elongated streak-like feature in a as the dominant characteristic and can identify b as being more similar to a than c is to a. And, indeed, this is correct from a physical perspective.The nanoparticles' structures that produce a and b are more similar to each other than those that produce a and c [12].However, from an algorithmic perspective, this is not a trivial problem. Figure 1 d shows for b and c five similarity measures: Four widely used measures for image similarity (see [37]) and our method (labeled CPLR).Color-coding is that the image that is more similar to a is green, while the less similar image is orange.The only measure that agrees with the human perception, and the physics of the problem, is our CPLR method.
Our approach is not limited to data from SP-CDI experiments.Theoretically, CPLR can be applied to diffraction data from all experimental techniques operating in Polar coordinates, including X-ray crystallography and traditional CDI approaches.Ultimately, CPLR provides a path for analyzing the impending amounts of diffraction data where the human perceived similarity is maintained even among millions of diffraction images.

Results and discussion
Contrastive learning is about augmentation, not architecture.In contrastive learning (CLR), we artificially create label information for supervised learning by designing augmentation pipelines that consider domain [38] and task [39] knowledge [16].Therefore, CLR is an instance of self-supervised learning [40].The fundamental assumption in self-supervised learning is that the input data contain more task-specific information than sparse categorical ground truth data in supervised learning [41].Consequently, a careful augmentation design should provide better results on downstream tasks than a supervised learning scenario [41,42].While improvements in Accuracy in supervised learning are usually related to architecture modifications, regularization, or loss function, CLR is about domain-specific augmentation strategies above anything else [16,41].Formally, CLR is a technique to create an embedding space from arbitrary input modalities, enabling comparative analysis.CLR dates back to work done in the nineties [42] but only recently has seen a renaissance, yielding State-of-theart results in visual- [16,17,36,43], audio- [44][45][46], video- [47][48][49], and text-representation [50,51] learning.

Contrastive baseline and contrastive projection learning.
In this study, we use the experimental design presented in [17], called SimCLRv2, as a baseline to compare our results.A large encoder and a smaller transformation neural network produce the representations in two stages.First, the encoder acts as a feature extractor; then, the transformation network learns an optimized representation that minimizes the CLR loss, termed normalized temperature-scaled cross-entropy loss (NT-Xent) [52].Conceptually, a duplicate is produced for each input image where both images are heavily augmented.All duplicates form so-called positive pairs with their originals, while all images with all other images but their duplicates form negative pairs.Then, the network learns to discriminate between positive and negative pairs during training.
In contrastive projection learning (CPLR), we produce the positive pairs not from the same image, as in [16,17], but project the diffraction images, which are naturally recorded in Polar coordinates, to Cartesian coordinates.Figure 2 i) and ii) provide a schematic overview of the network design and the conceptual idea.Using coordinate projections as an augmentation strategy implicitly penalizes that trivial Polar symmetries are learned and explicitly enforces that learned representations are invariant under rotational and translational changes.A simple example can be constructed with the help of figure 2 iii).There, example images for every class in the dataset are shown.While figure 2 iii) is fully explained in the following subsection, here, we concentrate on the Polar and Cartesian projection of the Elliptical class -the first column.The Polar form shows the characteristic Airy rings typical for single-laser-shot and single-particle imaging data [12].Usually, learning rotational invariance in arbitrary image data is achieved via random rotational transformations during the augmentation stage of training a network.However, such a transformation would yield no, or very little, change with such diffraction images due to the high degree of rotational symmetry.Therefore, we encourage the network to decouple the learned representation from rotational symmetries by correlating the Polar form with the Cartesian form.In addition, after the coordinate transformation, we leverage a stochastic augmentation pipeline [16,17].More details can be found in the Methods section Augmentation strategy.The dataset.Helium nanodroplets were imaged at XUV photon energies in a single-shot single-particle experiment [12] at the FERMI free-electron laser [54].The scattering images were recorded using a non-linear MCP-type detector [55] in a so-called wide-angle setting where each diffraction image contains 3D-structural information and cannot be reconstructed via Fourier inversion [10,12,56].The publicly available and hand-curated dataset [35] contains 7264 diffraction images with semantically sensitive labels and was previously used in a supervised classification task [21].We discarded 6000 diffraction images as they either exhibited strictly round or no Airy patterns at all.The round Airy patterns are by far the most common class, and we removed them to create a more balanced dataset since they can be reliably sorted using radial slices and a classical peak-finder [57].
The provided expert labels can be used for multiclass and multilabel analysis, meaning every diffraction image has binary label information for multiple classes that are often mutually non-exclusive.Figure 2 iii) shows this dataset's seven possible classes and their absolute and relative occurrence.For every class, one example is given in Polar and Cartesian form.To illustrate the multilabel property: The characteristic streak-like feature that defines the Streak class can also be found, for example, in the image for the Layered class.For this reason, the given percentages do not add up to 100 %, as multiple images belong to multiple classes and most classes are heavily under-represented.This pronounced multilabel imbalance is typical for diffraction image datasets [9,12].Training and evaluation.As in [17], we use a 2-wide [58], selective-kernel [59] ResNet50-D [53] network with squeezeexcitation blocks [60] as feature extractor, and a three-layer multi-layer-perceptron (MLP) as transformation network.We train for 1000 epochs, with a batch size of 628, using the LARS optimizer [61] with a cosine schedule [62], ten warmup epochs [63], and optimizing the NT-Xent loss [52].Training takes half an hour on four NVIDIA 3090 GPUs.The code, pretrained models, and training results are openly available at [64].
The quality of the learned embedding space is evaluated using the so-called linear evaluation protocol [16,17,36], which is carried out as follows: After self-supervised training, we freeze the feature extractor network and use the first layer of the transformation network (this is indicated by the green layer in 2 i)) as embedding space.Then, for every ground truth class in the dataset, we train a linear classifier on top of the learned representations and calculate the Precision and Recall score.Both metrics are obtained for every class via 5-fold stratified cross-validation to account for statistical fluctuations from sampling and the dataset's class imbalance.It has turned out to be important to use Precision and Recall as most classes are very rare, with three out of seven classes appearing in under 7 % of all images.In these cases, Accuracy would produce very high scores for classifiers predicting every image as not being part of any class.Moreover, we calculate an additional metric called Overlap in order to compare our method with metrics operating on raw images, such as the structural similarity (SSIM) index [65], the complex wavelet SSIM (CWSSIM) index [66], and the keypoint-based scale-invariant feature transform (SIFT) distance [67].Overlap is the global average of the normalized dot-product between the ground truth of every image and its 13 closest images according to the pairwisecalculated distances.13 was chosen as it corresponds to 1 % of all images in the dataset, which sets the Overlap score to be a local-neighborhood evaluation.Consequently, an Overlap score of 0.5 corresponds to: On average, the 13 most similar images shared 50 % of the original image's labels.
We compare our method to the SimCLRv2 baseline and other approaches that have been used in the past with diffraction images in the following sections.More details on training and evaluation can be found in the Methods section Training and linear evaluation strategy.The embedding space is linearly separable into semantic features.The evaluation scores after training are provided in table I.The second and third columns show the Precision and Recall score of the linear evaluation protocol, and the last two columns show the used metric for calculating the pairwise distances to calculate the Overlap score, which is given in the last column.Contrastive-based shows the results for the CLR baseline and our CPLR method.The arrows indicate the coordinate projections used, where the first term is used for inference and the second is used for constructing the contrastive task.Consequently, Polar ⇔ Polar and Cartesian ⇔ Cartesian are cases of the standard CLR framework with either purely unmodified (Polar) or purely projected (Cartesian) diffraction images.Cartesian ⇔ Polar and Polar ⇔ Cartesian are cases of our CPLR method, where the difference between the two is that we changed the input for inference at evaluation time to either the Cartesian or the Polar form.Continuous latent variables and variational Bayesian methods are techniques used previously in the context of diffraction images.More details are given in the Methods section Non-contrastive-based methods.Random baseline gives the result for an artificial embedding space built from uniform noise; this is the lowest possible score.This baseline is equivalent to a Random guesser with no learned information about the dataset.Direct measures are methods applied directly to the images, which cannot be evaluated using the linear evaluation protocol.
All methods except for the Direct measures are trained / run five times using five different integer random_state keys where the evaluation scores were each time obtained via 5-fold stratified cross-validation.The macro average [68] over all classes for all train and cross-validation runs is given in the table.The standard deviation of all methods is equal to or below 0.01.
Relative to the CLR baseline, the CPLR method yields significant improvements on all metrics improving Precision by 6 % and Recall by 10 %.Relative to the best non-CLR methods (VAE for Precision and all but UMAP for Recall), the CPLR method improves Precision by 35 % and Recall by 36 %.In addition, the Overlap score is relatively improved by about 6 % compared to the CLR baseline and 17 % compared to the TABLE I. Macroaverage [68] results on the helium nanodroplets dataset.Contrastive-based are the results for the CLR baseline and the CPLR method, where the arrows indicate with which projections the contrastive task was constructed.Continuous latent variables or variational Bayesian methods list techniques that have been used with diffraction data in the past.Random baseline gives the result for an artificial embedding space built from uniform noise, this is the lowest possible score.Direct measures are applied directly on the images, and cannot be evaluated using the linear evaluation protocol.All methods except for the Direct measures are trained / ran for five times where each time the evaluation scores where obtained via 5-fold stratified cross validation.The standard deviation of all methods is equal or below 0.01.The best result for each score is given in bold letters.
Linear CPLR is more robust with fewer samples.The general idea of using the linear evaluation protocol for evaluation is to look for linearly separable regions in the embedding space.Therefore, this method only applies to one-hot [72] ground truth data, meaning multiclass but single label.However, the helium nanodroplets dataset has multiclass and multilabel [73] ground truth data, where each image has multiple associated labels.Moreover, as typical in datasets on helium nanodroplets, the dataset is heavily unbalanced, where simpler shapes, like Elliptical, dominate other classes [9].It is, therefore, instructive to look at the individual averages for every class, which are given in table II.The CPLR method performs significantly better than the CLR baseline and non-contrastive methods in linear evaluation.The most significant improvement is with rarely occurring classes that appear only in ≤ 7 % of all images.VAE and PCA-based techniques fail entirely to place these diffraction images in a linearly separable region of the embedding space, resulting in poor Precision and Recall scores.However, the CLR baseline also yields limited success in the case of radial symmetry-breaking features like the Newton Rings and Layered class.There, the diffraction images contain features that either break radial symmetry (Layered) or introduce a second radially symmetric feature (Newton Rings), cf.figure 2 iii), which in combination with a low class-count brings the CLR method to its limits.The symmetry-breaking projection of the CPLR method helps in those cases and yields better results when fewer images are available.
A qualitative comparison of the CPLR, CLR, and VAE results is given in figure 3 i) to iii).In each plot, the column Input image shows the same three diffraction images, randomly chosen out of the three classes Streak, Layered, and Double Rings.Next to each input image are the four diffraction images belonging to the four closest embeddings in the embedding space.Additionally, every image is augmented in the top left corner by the class names given in the ground truth data and, for all images but the input image, by the Overlap score with its corresponding input image.
The images for the Streak / Bent class combination in the first row show strong Overlap scores for all three frameworks (1.00 for CPLR, 0.92 for CLR, and 0.79 for VAE).However, only the contrastive-based methods placed those embeddings of images next to each other where the characteristic streak feature is orientated and elongated differently than in the input image.We consider this a strength highlighting that both contrastive methods focus more on the semantics within a diffraction image than the pixel-wise similarity.This can also be seen in the Streak / Bent / Layered class combination in the second row of figure 3, where the direction of the characteristic Streak and Bent features vary substantially in size and orientation for both CLR-based methods but are identically aligned within the nearest neighbors of the VAE framework.
As already discussed above, our CPLR method outperforms the baseline CLR methods, especially in scenarios with low sample counts and symmetry-breaking features.This behavior is best seen in the second row for the Streak / Bent / Layered class combination.Only 27 images in the dataset have a Layered label assigned to them, and the CLR method fails to learn this characteristic having an average Overlap score of 0.25 for this example and only placing one additional image with a Layered label near the input image in the embedding space.
On the other hand, the CPLR method performs significantly better, with an average Overlap score of 0.67 and placing two images with a Layered label, two with a Streak, and one with a Bent label next to the input image.This observation also holds for the third example with the Elliptical / Double Rings example, where the CPLR method reaches an average Overlap score of 0.58, compared to 0.54 and 0.38 for the CLR and the VAE method, respectively.These qualitative observations, along with the quantitative results presented above, show conclusively that the CPLR method is introducing significant improvements compared to previous methods for finding the semantic similarity in diffraction images and the baseline CLR method.

Summary and outlook
We have introduced a novel method for finding the semantic similarities in diffraction images without relying on expert labeling.Based on contrastive learning (CLR), we introduced contrastive projection learning (CPLR), where the contrastive learning task is constructed from coordinate-projections of an input diffraction image and not from the same image as in CLR.This relatively easy alternation of the learning scenario substantially improves the quality of the learned embedding space on all metrics and scores.CPLR, therefore, provides a much-needed pathway for the upcoming big-data challenges within the coherent diffraction imaging (CDI) community since similarity calculations are at the core of almost every segmentation, classification, and clustering algorithm.Consequently, CPLR can be implemented as a stand-in-replacement for other similarity metrics in all so-far published classification and clustering approaches for diffraction images, potentially improving iii) The most similar images using the VAE framework In addition, our method can, theoretically, also be applied to all data that inherit Polar symmetry, such as in X-ray crystallography.
Our results have the potential to enable multiple future possibilities.For example, currently, 3D reconstruction via CDI methods can either be done in the small-angle regime, where reconstruction by Fourier inversion is possible [10,29], using the Expand-Maximize-Compress (EMC) algorithm [74], or in the wide-angle regime, via a recursive forward-fitting Multi-Slice-Fourier-Transform (MSFT) approach [10,29,56].In both cases, the similarity between diffraction images needs to be calculated.As of today, the EMC method can be applied to datasets on the order of millions of images [15,26].However, the similarity calculation is currently done using the cross-correlation between radial intensity profile lines of the diffraction images at different angles [15,26], which is computationally costly, and, as the authors in [26] pointed out, may not be sufficient for more complex patterns.As with the MSFT method, similarity calculations are currently done using either the MSE or even manual estimation by researchers [10,12,29,75].
Ultimately, CPLR can provide a path to apply the EMC algorithm on more complex datasets, get better results on simpler datasets, and replace the MSE metric in MSFT-based approaches.
Furthermore, it enables quick and reliable statistical reasoning on the variability and occurrence of features within diffraction image datasets, as was done in [12], for example.
Finally, recent research on contrastive methods in computer vision [43,76] promises accuracies comparable to supervised methods or surpassing them and can be easily implemented into our framework.
Therefore, this manuscript stands as a stepping stone for adapting self-supervised learning to the domain of diffraction imaging.
All code for the discussed experiments, pretrained models, and extracted embedding spaces are available at our ETH Gitlab repository [64].

Methods
Augmentation strategy.A well-defined augmentation strategy is critical in contrastive learning [16].As pointed out by [16,17], the essential parts of constructing this strategy are random cropping and random color distortion transformations.The latter is targeted towards histogram and colorchannel correlation-based overfitting of the network.Since diffraction data is monochrome, we replace the channel-independent RGB distortion with a single-channel jitter distortion.Furthermore, as in [16,17], we use a probabilistic augmentation strategy that includes Flip, Rotation, Crop & Resize, Jitter, Fill, and Translation transformations on all input patches.However, our Crop & Resize routine is not changing the aspect ratio, as is usually done in other contrastive learning augmentation pipelines.Changing the aspect ratio would break the correlation between the Polar and Cartesian projections.Every transformation has a fixed probability of 50 % for being applied at every invocation.We implemented the entire pipeline using TensorFlow augmentation layers placed on the GPU itself.Code is available in the official repository The linear classifier we used for linear evaluation was a single-layer perceptron with an inverse-scaling learning rate schedule and a l2 penalty of 0.0001.We used the implementation provided by the sklearn Python package.Code is available in the official repository [64].Non-contrastive-based methods.Listed in table I are the Factor-Analysis (FA), the Principal-Component-Analysis (PCA), the Kernel-PCA, the Uniform Manifold Approximation & Projection (UMAP) [69], and the Variational Autoencoder (VAE) [70,71] methods.All of these have been used with various forms of spectrographic image data.FA-and PCA-based methods are parameter-free dimensionality reduction techniques that are regularly used within all scientific disciplines; while FA considers the dataset's variance, PCA considers the covariance of the data.FA and PCA-based methods have been used with powder diffraction data [77,78] and X-ray diffraction phase analysis [79] and as a dimensionality reduction for subsequent classification [80] and clustering [22].
A VAE is a generative variational Bayesian model where the input information is encoded to a low dimensional representation via an encoder function and then recreated by a decoder function.The loss function, called the Evidence lower bound, is a lower bound on the marginal likelihood [70].VAEs have been used with diffraction images in various tasks, such as anomaly-detection [81], dimensionality reduction [82], phase reconstruction [83,84], and modeling the continuous 3D shape transition in heterogeneous samples [26].We train the VAE as described in [71], using the code from [85].UMAP is a dimensionality reduction technique based on manifold learning and topological data analysis and has been used with other spectrographic image data, such as Ronchigrams [86] and Audio spectrograms [87,88].We use UMAP with the default parameters and a fixed integer random_state for reproducibility.
The size of the low-dimensional representation for all mentioned methods was set to 1024, identical to the dimensionality of the CLR-based representation space.

43 FIG. 1 .
FIG.1.Illustrating semantic similarity.a, b and c are diffraction images taken from a publicly available dataset[35] from an SP-CDI experiment on superfluid helium nanodroplets[12,21]. a and b are semantically similar while a and c are not.d shows four widely used similarity measures (see[37]), that disagree with the human perception, while only our method (labelled CPLR) agrees with it.Color-coding is that the pair of images that is more-similar is green, while the one that is less-similar is orange.
FIG.2.Showing details on the CPLR architecture, the concept, and the classes within the dataset.i) Schematic of the used architecture.In this example, a positive pair, consisting of one diffraction image in its natural Polar and its Cartesian form, are passed on to the feature-extractor network, a modified ResNet50-D[53].The adjacent grey box then shows the transformation network, a three layer MLP, where the first layerin green -forms the embedding space within the here used SimCLRv2[17] framework.ii) The conceptual idea in more detail.The goal during training is to maximize similarity (minimize distance in embedding space) for positive pairs and minimize similarity (maximize distance in embedding space) for negative pairs.iii) The seven possible, non-exclusive classes of the here used multiclass and multilabel dataset.The top row shows the diffraction image in Polar coordinates and the bottom row the Cartesian projection.In addition, the absolute and relative class counts are given alongside the class names.Several different features can appear in a single image which then belongs to several classes simultaneously.For example, a Streak feature is also present in the example for the Layered class.

FIG. 3 .
FIG.3.Qualitative results on the helium nanodroplets dataset.In i) to iii) the column Input image shows three randomly chosen imagesfrom a pre-defined class combination.Next to the input images are the four closest diffraction images according to calculated pairwise distances for the CPLR, CLR, and VAE method using the metric given in table I. Every image is augmented in the top left corner by the class names given in the ground truth data and -for all images but the input image -the Overlap score with its corresponding input image.
[64].Training and linear evaluation strategy.The NT-Xent loss that is minimized during training is given by: l ,  = − log exp sim z  , z  / 2 =1 1 [≠ ] exp (sim (z  , z  ) /) , where sim (u, v) = u  v/( | |u| | | |v| |) denotes the cosine similarity between two vectors u and v, 1 [≠ ] ∈ {0, 1} is an indicator function evaluating to 1 if, and only if,  ≠ , and  denotes a temperature parameter.We performed extensive hyper-parameter optimization to obtain the best possible values for the temperature parameter , which are 0.200 for Polar ⇔ Polar, 0.200 for Cartesian ⇔ Cartesian, 0.075 for Cartesian ⇔ Polar, 0.100 for Polar ⇔ Cartesian.The results of this hyper-search, as well as scripts to re-run it, can be found in the official repository [64].

TABLE II .
Results for every class in the helium nanodroplets dataset.Compared to table I, we only show the four best-performing methods, namely our CPLR (Polar ⇔ Cartesian) method along with the CLR (Polar ⇔ Polar) baseline, and the VAE, and Kernel PCA (Using the euclidean metric) approaches.All methods have been trained / run for five times where each time the evaluation scores where obtained via 5-fold stratified cross validation.The standard deviation of all methods and for all classes is equal or below 0.01.The best result for each score is highlighted in bold letters.

Similarity Linear eval. Similarity Linear eval. Similarity Linear eval. Similarity Class n abs n rel Precision
Input imageClosest images in the CPLR embedding spacei) The most similar images using the CPLR framework The most similar images using the CLR framework