Introduction

Astronomy has become an immensely data-rich field. For example, the Sloan Digital Sky Survey (SDSS) will produce more than 50,000,000 images of galaxies in the near future1. In turn, galaxy morphology can be used to provide an independent test of the two proposed scenarios for galaxy formation. Elliptical galaxies, for example, are believed to be formed through major mergers2, whereas disk-dominated galaxies cannot have undergone recent major mergers, as such mergers would have severely disrupted their shape3. Thus, the class of quenching models is sufficient to explain the full range of morphological types observed for quenched galaxies. For example, bars can be found in all types of disk galaxies, from the earliest to the latest stages of the Hubble sequence. Barred galaxies constitute a major fraction of all disk galaxies. A small number of galaxies that appear unbarred at visual wavelengths have actually been found to be barred when observed in the near infra-red. The three clearest cases are NGC 15664, NGC 10685, 6 and NGC 3097. De Zeeuw and Franx8 surveyed the literature for the dynamics of these objects. We are still far from a complete understanding of the dynamical structure of galaxies. Here, we will be able to do no more than scratch the surface of the majority of these problems.

The development of galaxy morphological schemes can be used to successfully determine galaxy morphology via classification or image-retrieval methods. For example, the Deep Neural Network (DNN) algorithm has been used to classify the Galaxy Zoo (e.g. ref. 9). This method minimizes the sensitivity to changes in the scaling, rotation, translation and sampling of an image by using a rotation-invariant convolution. The results of this method is better than 99% with respect to human classification; however, as human classification has several associated errors, in turn the DNN approach also suffers from the same errors10. In ref. 11, the random forest method was used to classify an HST/WFC3 image containing 1639 galaxies, which identified disturbed morphologies using multimode, intensity and deviation statistics. Additionally12, proposed a method that consists of two stages: first, feature extraction (shape, color and concentration) of galaxy images from the SDSS DR7 spectroscopic sample, followed by the classification of these features using a support vector machine.

The authors in ref. 10, proposed a different approach called MORFOMETRYKA, which used the Linear Discriminant Analysis (LDA) algorithm to classify various features (concentration, asymmetry, smoothness, entropy and spirality) extracted from the galaxy images. The results of their approach were better than 90% based on 10-fold cross validation to classify a galaxy as either an elliptical or a spiral.

These galaxy classification methods have provided powerful results. However, there is another trend to deal with galaxy images, i.e. to determine the most similar images to query image, not classify them into groups only, therefore, the image retrieval techniques are needed13.

The image-retrieval method is a computer system for browsing, searching, detecting and retrieving images from a large database of digital images14. The content-based image retrieval (CBIR) approach is one of the most commonly used image retrieval methods15, which aims to avoid the use of textual descriptions and instead retrieves images based on similarities in their content. Relevant content can be information related to image patterns, colors, textures, shape and location16.

Such image content is obtained by using feature-extraction methods, which is then saved in a database. To answer a queried image, the similarity between stored features and the features of a queried image (extracted using the same method) is computed and used to determine the closest between the images. However, the CBIR approach is a challenging problem for galaxy images, because there is a large number of galaxy images and determining the most relevant images from a large database becomes a non-trivial task.

Several methods have been applied to improve the quality of CBIR for galaxy images. Ref. 17 introduced a CBIR method for astronomical images which used a multi-resolution approach to compress the original images in sketches. These sketches (features) were compared with the features of the queried image through the use of correlation and symmetry functions18. Next, ref. 19 proposed a CBIR method which summarized and indexed the Zurich archive of solar radio spectrograms. The summarized step was performed by clustering the content of an image into groups (regions) by using the same texture feature, which were represented by a set of parameters (location, a texture roughness and region extensions). The indexing step was then performed by quantizing these regions.

In general, the previous methods consider either the shape, the texture features or the color, or both of them (color/texture, color/shape and shape/texture), but not all of them. Moreover, not all of the extracted features are important: some may be redundant/irrelevant, which in turn reduce the quality of the classification or image-retrieval results. To address this, the aim of this paper is to introduce a new machine-learning approach for the retrieval of galaxy images. Our approach avoids the limitations of previous methods by extracting the shape, color and texture features from galaxy images, and then determining the most relevant features and ignoring other features by using the K-NN classifier as measure of the quality of the features which selected by Sine Cosine algorithm (SCA).

The proposed approach consists of two stages: training and image retrieval. In the training stage there are two steps: the first is feature extraction, where the color, shape and texture features are extracted from a dataset of galaxy images. The second step is feature selection, which is performed based on the modified sine cosine algorithm20 that selects the most relevant features using the classification accuracy as a fitness function. In the second stage, similar images to the queried image are returned by using the Euclidean distance as a measure.

Feature extraction

In this section, visual features such as color, texture and shape are introduced15.

Color Feature Extraction

The color of an image is one of the most widely used features in image retrieval and several other image-processing applications. It is a very important feature since it is invariant with respect to scaling, translation and rotation21. Therefore, the aim of any color feature extraction method is to represent the main colors of the image content (red, green, and blue, i.e. RGB) and then use these color features to describe the image and distinguish it from other images. RGB colors used in this study were obtained by converting from the SDSS color system using the Maxim DL astronomical software22.

The color histogram is one of the most well-known color features used for image feature extraction23, 34, which denotes the joint probability of the intensity of an image. From probability theory, a probability distribution can be uniquely characterized by its moments. Thus, if we interpret the color distribution of an image as a probability distribution, moments can be used to characterize the color distribution. The moments of the color distribution are the features extracted from the images; if we denote the value of the ith color channel at the jth image pixel as P ij , then the color moments can be defined as refs 23 and 24:

  • The first-order moment (the mean):

    $${E}_{i}=\frac{1}{N}\sum _{j=1}^{N}\,{P}_{ij}$$
    (1)
  • The second-order moment (the standard deviation):

    $${\sigma }_{i}=\sqrt{\frac{1}{(N-\mathrm{1)}}\sum _{j=1}^{N}\,{({P}_{ij}-{E}_{i})}^{2}}$$
    (2)
  • The third-order moment (skewedness):

$${s}_{i}=\sqrt[3]{\frac{1}{N}\sum _{j=1}^{N}\,{({P}_{ij}-{E}_{i})}^{3}}$$
(3)

Texture Feature Extraction

The texture descriptor is an important feature that provides properties such as smoothness, coarseness and regularity25. Textures can be rough or smooth, vertical or horizontal. Generally, they capture patterns in the image data, such as repetitiveness and granularity.

There are several texture extraction methods, such as the discrete cosine transform (DCT), the discrete Fourier transform (DFT), discrete wavelet transform (DWT) and the Gabor filter26, 27. The Gray Level Co-Occurrence Matrix (GLCM) and Color Co-Occurrence Matrix (CCM) are the most commonly used statistical approaches used to extract the texture of an image28. These features include the contrast, correlation, entropy, energy and homogeneity, which are defined as:

  • The contrast represents the amount of local variation in an image. This concept refers to pixel variance, and it is defined as:

    $$CN=\frac{1}{{(G-\mathrm{1)}}^{2}}\sum _{u=0}^{G-1}\,\sum _{v=0}^{G-1}\,{|u-v|}^{2}p(u,v)$$
    (4)
  • The correlation represents the relation between pixels in an image, which determines the linear dependency between two pixels and is defined as:

    $$CR=\frac{1}{2}\sum _{u=0}^{G-1}\,\sum _{v=0}^{G-1}\,\frac{(u-{\mu }_{u})(v-{\mu }_{v})}{{\sigma }_{u}^{2}{\sigma }_{v}^{2}}p(u,v)+1$$
    (5)
  • The energy (En) represents the textural uniformity, where large values of En indicate a completely homogeneous image.

    $$En=\sum _{u=0}^{G-1}\,\sum _{v=0}^{G-1}\,p{(u,v)}^{2}$$
    (6)
  • The entropy (ET) measures the randomness of the intensity distribution. It is inversely correlated to En, and is defined as:

    $$ET=\frac{1}{2log(G)}\sum _{u=0}^{G-1}\,\sum _{v=0}^{G-1}\,p(u,v)\,lo{g}_{2}\,p(u,v)$$
    (7)
  • The homogeneity (H) is used to measure the closeness of the distribution, where large values H indicate that the image contrast is low. The definition of H is given in the following equation:

$$H=\sum _{u=0}^{G-1}\,\sum _{v=0}^{G-1}\,\frac{p(u,v)}{1+|u-v|}$$
(8)

where u, v are the coordinates of the co-occurrence matrix, G is the number of grey levels, and μ u , μ v , σ u , and σ v are the mean values and the standard deviations of the uth row of the vth column of the co-occurrence matrix, respectively.

Shape Feature Extraction

Shape features were extracted by using the contour moments defined mathematically as follows. Let z(i) be an ordered sequence that represents the Euclidean distance between the centroid and all N boundary pixels of the object. The rth contour sequence moment m r 14 is defined as:

$${m}_{r}=\frac{1}{N}\times \sum _{i=1}^{N}\,{[z(i)]}^{r}$$
(9)

Sine Cosine Algorithm

In this section, the sine cosine algorithm (SCA) is illustrated20, this algorithms is a new meta-heuristic algorithm which used either the sine or cosine function to search about the best solution. Consider the current solution X i , \((i=1,2,\ldots ,po{p}_{size})\) from the population of solutions is updated as in the following equation20:

$${X}_{i}={X}_{i}+{r}_{1}\times \,\sin ({r}_{2})\times |{r}_{3}P-{X}_{i}|$$
(10)
$${X}_{i}={X}_{i}+{r}_{1}\times \,\cos ({r}_{2})\times |{r}_{3}P-{X}_{i}|$$
(11)

The previous two equations were combined to update the solution that can be simultaneously by switching between the sine or cosine function20:

$${X}_{i}=\{\begin{array}{ll}{X}_{i}+{r}_{1}\times \,\sin ({r}_{2})\times |{r}_{3}P-{X}_{i}| & if\,{r}_{4} < 0.5\\ {X}_{i}+{r}_{1}\times \,\cos ({r}_{2})\times |{r}_{3}P-{X}_{i}| & if\,{r}_{4}\ge 0.5\end{array}$$
(12)

where r 1, r 2, r 3 and r 4 are random variables, P is the best solution, and |·| represents the absolute value20.

Following ref. 20, each parameter was used to perform a specific task. For example, the r 2 parameter defines the direction of X i (i.e., towards or away from P), while r 3 gives random weights to P in order to stochastically emphasize (r 3 > 1) or deemphasize (r 3 < 1) its influence when defining the distance. Next, r 4 is responsible for switching between the sine and cosine functions in equation (12) 20. Finally, r 1 was used to determine the next position regions (or movement direction), which could be either in the space between X i and P or outside of this space, and it is also responsible for balancing between the exploration and exploitation to improve the convergence performance by updating its value as ref. 20:

$${r}_{1}=a-t\frac{a}{{t}_{max}}$$
(13)

where t is the current iteration, t max is the maximum number of iterations, and a is a constant. Figure 1 shows how equation (12) defines a region between two solutions in the searched space.

Figure 1
figure 1

The Sine and Cosine functions effects on the next solution20.

The Proposed Image Retrieval Approach

In this section, we investigate a new approach to galaxy image retrieval as illustrated in Algorithm 1. Our proposed approach consists of two stages: a training stage and the galaxy image retrieval stage.

In the first stage, the input is the dataset of galaxy images. Then the shape, texture and color features are extracted for each galaxy image I, which are combined into a feature vector FV I , where I is the current image. The next step in the training stage is to reduce the size of FV through using the Binary SCA (BSCA) algorithm (see Algorithm 2) to select the most relevant features. This process is performed by maximizing the accuracy of the K-NN classifier, which is used as a fitness function.

The BSCA starts by generating a random population of size pop size , and the output is the best solution P that points to the selected features (Sel Feat ). The solution in the population of the BSCA algorithm is represented as a binary vector by using the sigmoid function which transforms a real number into a binary number as:

$${X}_{i}=\{\begin{array}{ll}1 & if\,S({X}_{i}) > \sigma \\ 0 & otherwise\end{array},\,S({x}_{i})=\frac{1}{1+{e}^{-{X}_{i}}}$$
(14)

where \(\sigma \in [0,1]\) and X i is the current solution (for example, the solution X i  = 001100 with six features means that the third and fourth features are selected).

After the solutions are converted to binary vectors, the fitness function is computed for each solution. The fitness function is defined according to the classification accuracy rate as:

$${F}_{i}=\frac{{N}_{C}}{{N}_{I}}\times 100$$
(15)

where N C is the number of correctly predicted samples, and N I represents the total number of images. The dataset is divided by using a 10-fold cross validation (CV), and then the K-NN algorithm predicts, using the label of the testing set, where the output from 10-fold CV is the average of accuracy through 10 runs.

The solution X i is updated using equations (10) or (11) based on the value of r 4. This process is repeated until the maximum number of iterations is reached, or there is only a small difference between \({F}_{i}^{old}\) and F i . The output of this stage is the global best solution P, which represents the optimally selected features Sel Feat .

The second stage starts by extracting the features of a queried image FQ, and then the same features corresponding to Sel Feat are selected. Then the Euclidian distance is used to compute the similarity between FQ and FV

Algorithm 1 The Proposed approach For Galaxy Image Retrieval

1:

Input: database of images, queried image.

2:

Output: precision and recall.

3:

Training stage:

 

Compute the feature vectors FV I for all images in the database.

 

Select features Sel Feat  = BSCA(FV).

 

Update the set of features FV = FV(Sel Feat ).

4:

Image retrieval stage:

 

Compute the feature vector FQ of the queried image I Q .

 

Update FQ = FQ(Sel Feat ).

 

For {all I i % in parallel techniques}

 

Compute the distance between FQ and FV i using the Euclidean distance EDist i .

 

end for

5:

Select the smallest distance from EDist and determine the index S index that satisfies EDist < \(\epsilon \).

6:

Select from the database any images with index S index .

7:

Compute the precision and recall.

Algorithm 2 Binary Sine Cosine Algorithm (BSCA)

1:

Input: features of each image (FV).

2:

Initialize a set of solutions (X) with size pop size , and set the maximum number of iterations t max .

3:

for i = 1: pop size do

4:

     Convert X i to a binary vector using equation (14).

5:

     Compute the fitness function F i based on the selected features from FV and using 10-fold cross-validation.

6:

     if F i  < F P then

7:

        F P  = F i .

8:

        P = X i .

9:

     end if

10:

end for

11:

repeat

12:

     Update r 1, r 2, r 3, and r 4.

13:

     Update the position using equation (12).

14:

until (t < t max )

15:

Return the best solution P obtained so far as the global optimum F P .

, and the closest images to the query image are returned (based on the small difference or the required number of images).

Experimental Results

We tested our proposed approach using the EFIGI catalogue, which consists of 4458 galaxy images29. We also compared the performance of our method with the particle swarm optimization (PSO)30 and genetic algorithm (GA)31 methods. The parameters used in each algorithm is given in Table 1. The common parameters between the three algorithms are the population size, the maximum number of iterations which was set to 20 and 100, respectively, and the maximum number of iterations used as the stopping criteria. The experiments were implemented in Matlab and run in the Windows environment with 64-bit support.

Table 1 The parameter settings of each algorithm.

Images Database

The EFIGI catalogue29 contains 16 morphological attributes that were measured by visual examination of the composite g, u, r color image of each galaxy, derived from the SDSS FITS images using29. The EFIGI catalogue merges data from standard surveys and catalogues (the Principal Galaxy Catalogue, SDSS, the Value-Added Galaxy Catalogue, HyperLeda, and the NASA Extragalactic Database). The bulge-to-disk ratio32 and the degree of azimuthal variation of the surface brightness were often used as discriminant parameters along the Hubble sequence. This is not surprising since the EFIGI classification scheme is very close to the RC3 system. The final EFIGI database is a large sub-sample of the local universe which densely samples. The EFIGI morphological sequence is based on the RC3 revised Hubble sequence (RHS), which we call the EFIGI morphological sequence (EMS).

Finally, all colors of the original data were used to create composite, “true color”, RGB images in PNG format with the Maxim DL astronomical software22, using the same intensity mapping for all RGB images.

Performance measures

Two measurements were used to evaluate the performance of the proposed algorithm: the precision rate and the recall rate.

  • The precision rate is defined as the ratio of the number of retrieved images similar to the queried image relative to the total number of retrieved images28.

    $$precision=\frac{p}{p+r}\times 100$$
    (16)
  • The recall rate is defined as the percentage of retrieved images similar to the query image among the total number of images similar to the queried image in the database28.

$$recall=\frac{p}{p+q}\times 100$$
(17)

where p, q and r are the number of relevant images retrieved, relevant images in the dataset which are not retrieved, and non-relevant images in the dataset which are retrieved, respectively.

Results and Discussion

In order to assess the effectiveness of our approach, we used the leave-one-out cross-validation method, where each image in the dataset was considered as the queried image, and the process was repeated 4458 times. Also, we used the 1-NN method based on 10-fold cross-validation (CV), which was used to evaluate the subset of selected features. This classifier is a parameter-free feature and is easy to implement33. As discussed previously, the 10-fold CV works by dividing the dataset into ten groups, and the experiment was performed ten times by selecting one group as the test set and the remaining groups were used as a training set during each run. The output is the average of accuracy of the ten runs.

In general, we used color, texture and shape feature vectors for galaxy image retrieval. The total number of extracted features was 30, where nine features were extracted from the three colors RGB (three moments for each color), 20 texture features (four rotations for each measure) and one shape feature. The extracted feature vectors were applied to the feature selection method (in this study, we compared the BSCA, PSO and GA methods) to determine the relevant features.

The best selected features with their accuracy (the value of fitness function) are given in Table 2. From this table it can be seen that, the BSCA algorithm selects a small number of features with high accuracy followed by the PSO, however, the GA selects a large number of features with low accuracy. In addition, we observed that the more relevant features thatcontain more information and are used to distinguish between the classes are the third color moment, energy, homogeneity, entropy and contour. These features are common between the three algorithms, and all of them are selected by the proposed method.

Table 2 The selected features and their accuracy.

The comparison results of our proposed method with other methods are illustrated in Figs 2, 3, 4 and 5 and Table 3. From Table 3, we can conclude that the proposed approach is better than PSO and GA in terms of precision and recall measures. The best results were obtained when the spiral-edge type was used as the queried image because they present the most regular structure, while the less accuracy occurs when the spiral type galaxy was tested.

Figure 2
figure 2

Galaxy image retrieval for a spiral galaxy29.

Figure 3
figure 3

Galaxy image retrieval for an edge-on spiral galaxy29.

Figure 4
figure 4

Galaxy image retrieval for an irregular galaxy29.

Figure 5
figure 5

Galaxy image retrieval for an elliptical galaxy29.

Table 3 A comparison between the proposed approach and the PSO and GA methods for galaxy image retrieval.

Moreover, from Table 3, it can be seen that the proposed method is faster than the other two algorithms, which takes ~292.0 s (nearly half the time of the other algorithms) to select the best features. We note that the GA method takes less time to complete than the PSO algorithm. In general, the computing time is divided into three parts: the first is the time needed to extract features from the images (~375 s, where each image takes ~0.084). The second part is the time needed to select the most relevant features as in Table 3. The last part is the time needed to compute the matching, which requires ~0.0157 s in addition to the time need to extract the features of the queried image (~0.084).

Figures 2, 3, 4 and 5, show an example of the retrieval images for four galaxy types. In these figures, the five database images that are the closest to the queried image are given as the retrieval results.

In order to investigate the influence of the size of the training set when selecting the best features, the dataset was randomly divided into training and testing sets. The proposed method was then evaluated at three different sizes, i.e. 50%, 70% and 85% of dataset (the remaining is the test set). Our results are shown in Table 4, where it can be seen that the worst accuracy was obtained when the sizes of the training and test sets were equal. The best accuracy was achieved when the training set was 85% of the entire database (as expected: by increasing the size of training set, the accuracy also increases).

Table 4 The effect of the size of training set on the performance of the proposed approach for galaxy image retrieval.

Finally, from the previous results, we can conclude on two things: first is that the proposed approach for galaxy image retrieval is better than the PSO and GA algorithms in terms of recall, precision, accuracy and the time complexity. The second is that the most suitable method used to split the dataset (when selecting the best-fitting features) is the 10-fold CV, however, if the dataset is divided randomly then the most suitable size for the training set is in the range 85% to 90%.

Conclusions

In this study, we proposed a machine learning approach for galaxy image retrieval used for the automatic detection of galaxy morphological types from datasets of galaxies images. The automated detection of galaxies types is very important to understand the physical properties of the past, present, and future of the universe, while also offering a means for identifying and analyzing peculiar galaxies that cannot be associated with a defined morphological stage on the Hubble sequence.

Our analysis was performed such that our approach automatically detected specific morphology types from different morphological classes without human guidance. The proposed algorithm was compared with the PSO and GA algorithms, and its performance was evaluated based on recall and precision. The results indicate the superior performance of our proposed approach.

Based on the promising results of the algorithm, our future work will attempt to further investigate its application to other complex problems in astronomy by modifying the proposed method.