A Real-World Benchmark for Sentinel-2 Multi-Image Super-Resolution

Insufficient image spatial resolution is a serious limitation in many practical scenarios, especially when acquiring images at a finer scale is infeasible or brings higher costs. This is inherent to remote sensing, including Sentinel-2 satellite images that are available free of charge at a high revisit frequency, but whose spatial resolution is limited to 10m ground sampling distance. The resolution can be increased with super-resolution algorithms, in particular when performed from multiple images captured at subsequent revisits of a satellite, taking advantage of information fusion that leads to enhanced reconstruction accuracy. One of the obstacles in multi-image super-resolution consists in the scarcity of real-world benchmarks—commonly, simulated data are exploited which do not fully reflect the operating conditions. In this paper, we introduce a new benchmark (named MuS2) for super-resolving multiple Sentinel-2 images, with WorldView-2 imagery used as the high-resolution reference. Within MuS2, we publish the first end-to-end evaluation procedure for this problem which we expect to help the researchers in advancing the state of the art in multi-image super-resolution.


Background & Summary
Super-resolution (SR) is aimed at reconstructing a high-resolution (HR) image from a single image or multiple low-resolution (LR) observations presenting the same scene.Multi-image SR (MISR) fuses multiple LR images, each of which contains a different portion of HR information.This allows for achieving higher reconstruction accuracy than relying on single-image SR (SISR) 1 , but MISR is highly sensitive to the variability of the input images and their proper co-registration 2 .This poses a challenge when preparing the data for training and validation.Recent advances in satellite image SR include SISR 3 and MISR 4 techniques for enhancing Sentinel-2 (S-2) multispectral images (MSIs), composed of 13 bands, whose resolution ranges from 60 m ground sampling distance (GSD) to 10 m GSD 5 .
Commonly, SR techniques are evaluated relying on an artificial scenario-a certain image is treated as an HR reference which is subsequently degraded to obtain the simulated LR images.The similarity of the super-resolved outcome to the reference is then used to evaluate the SR performance.Unfortunately, such procedure does not reflect the real-world operating conditions 6 , and methods that perform well for the simulated data are not necessarily effective for original (i.e., not downsampled) images.It is therefore crucial to properly validate the emerging techniques using real LR images coupled with a real HR reference-in an excellent survey on real-world SISR 6 , Chen et al. identified the deficiency of realistic datasets as one of the most important challenges in this field.Recently, several real-world SISR datasets have been elaborated [7][8][9] , however preparing such datasets for MISR is much more costly and troublesome.
In 2019, European Space Agency organized an SR challenge 10 based on real-world scenes acquired by the Proba-V satellite, each of which contains an HR image (100 m GSD) coupled with at least nine LR images (300 m GSD).The dataset allowed for developing first MISR techniques underpinned with convolutional neural networks (CNNs), applied either to enhance the LR images before their multi-temporal fusion 11 or employed to learn the reconstruction process in an end-to-end manner 2,12,13 .Very recently, the WorldStrat dataset was published which matches multiple S-2 images with SPOT spectral bands of 6 m GSD 14 (the magnification factor equals 1.67×).However, the problem of comparing the reconstruction outcome with the reference was not discussed there and it is not clear whether and how WorldStrat can be used for benchmarking SR algorithms.We demonstrate that this is not a straightforward task, especially when LR and HR images are captured by different satellites.In 15 , S-2 SISR was assessed by comparing the outcome against WorldView-3 images.The authors observed that the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) do not correlate well with the quality of the reconstructed images-the highest scores were obtained for blurred images in which the details were not reconstructed accurately.Contrary to that, the learned perceptual image patch similarity (LPIPS) 16 was reported to be suitable for evaluating SISR for remote sensing 17,18 , but it was not applied to MISR.
In this paper, we address the problem of evaluating MISR for S-2 images by introducing a new Multi-image Sentinel-2 SR (MuS2) benchmark.It is composed of a new dataset with WorldView-2 (WV-2) images used as an HR reference and the end-to-end validation procedure.Our contribution can be summarized in the following points.
• We publish a new MuS2 benchmark dataset with 91 diverse scenes covering around 2500 km 2 (Fig. 1), each composed of at least 14 S-2 MSIs coupled with a single WV-2 MSI.The benchmark is intended to serve as a test set for evaluating future advancements in MISR.
• We elaborated a protocol (included into the source code we publish) for assessing the SR outcome for all of S-2 10 m bands based on the corresponding WV-2 bands, and we report the results of 3× magnification obtained using the well-established MISR techniques.
• We demonstrate that the LPIPS metric 16 is suitable for evaluating the MISR outcome, even if the reference HR images are acquired using a different satellite.To verify that, we have performed a mean opinion score (MOS) survey, whose results are thoroughly discussed.
• We introduce relevance masks which indicate the image regions that can be considered for evaluation relying on pixel-wise image similarity metrics.

Data Source
Our benchmark dataset is composed of images acquired by S-2 and WV-2 satellites.The S-2 images were downloaded from the Copernicus Open Access Hub, and we accessed the WV-2 images published within the European Cities dataset 1 .The S-2 data are organized into tiles, each of which contains a 100 × 100 km MSI, and aggregated into products that are distributed at different processing levels.Here, we exploit Level-2A which includes the bottom of atmosphere reflectance correction.Each tile is composed of 13 spectral bands, however the B10 (cirrus) band is excluded from Level-2A, because it does not represent any surface information.The blue, green, red, and near infrared (NIR) bands (B02, B03, B04, and B08, respectively) are of 10 m GSD, the vegetation red edge bands (B05, B06, and B07), narrow NIR (B08a), and the short-wave infrared (SWIR) bands (B11 and B12) are of 20 m GSD, while the remaining coastal aerosol (B01), water vapour (B09), and cirrus clouds estimation (B10) bands are of 60 m GSD.Each WV-2 tile contains a panchromatic image of 0.4 m GSD and 8 spectral bands at 1.6 m GSD.These are C-coastal (400-450 nm), B-blue (450-510 nm), G-green (510-580 nm), Y-yellow (585-625 nm), R-red (630-690 nm), RE-red 1 The European Cities dataset is freely available at https://earth.esa.int/web/guest/-/worldview-2-european-cities-dataset.For the remaining band pairs, the spectral coverage did not exceed D C = 0.5, so we do not consider them in our study.

Scene Selection and Data Alignment
For each WV-2 tile, we collected 14 or 15 S-2 images that entirely contain that tile or overlap with it.The images, both within the S-2 stacks, as well as with the WV-2 tiles, were co-registered at whole-pixel precision.We assemble the dataset using areas defined by seven different military grid reference system (MGRS) tiles.These areas represent diverse environments throughout Europe, such as mountains in Norway, plains in Germany and Belgium and a coastal region in Spain.The diversity of the chosen locations is illustrated in Fig. 1.Our data preparation procedure (see Fig. 2) operates on original S-2 and WV-2 images.The common area is determined based on the geographic coordinates retrieved from the metadata, and for each band, N LR images (I in ) are cropped and coupled with the cropped WV-2 image.The latter is subsequently downsampled to create the HR reference (I HR ) whose each dimension is α× larger than those of I in .In the published dataset, we use α = 3, as enlarging the images 3× remains sufficiently challenging for the state-of-the-art MISR techniques 12,13,19,20 .However, the elaborated software tools that we publish along with the benchmark allow for generating the reference HR data up to α = 6.25 for WV-2 images of original size.Overall, we acquired 91 pairs of HR images, each of which is coupled with multiple LR images (every HR and LR image is composed of four bands).To verify whether the bands are correctly co-registered after cropping, we composed color images and inspected them against color artefacts (Fig. 3a, b).In order to check the co-registration correctness between LR images and the HR image, we have assembled checkerboard mosaicked images and inspected them visually for all the scenes (Fig. 3c).Finally, we combined multiple S-2 images for each scene to inspect whether the obtained images are free of color and halo artefacts that could result from poor co-registration across multiple S-2 images and/or the spectral bands (Fig. 3d).

3/14
The Evaluation Procedure The evaluated SR process is fed with N input images for reconstructing a specific S-2 band to produce a single super-resolved image (I SR ) which is subsequently compared with the corresponding I HR .By an input image we understand a single S-2 band, but it is also possible that multiple bands cropped to show the same area are processed to reconstruct a single S-2 band or multiple bands, and each of the super-resolved 10 m bands can be compared with a corresponding I HR derived from the WV-2 data.We measure the similarity between I SR and I HR with the PSNR, SSIM, and LPIPS metrics that are commonly employed for assessing the SR quality 17 .I in and I HR are acquired by different sensors, hence the differences between I SR and I HR result not only from the lack of reconstruction accuracy, but also from the characteristics of the imaging sensor and the temporal changes of the scene (due to different acquisition time) 21 .Also, as the N input images are co-registered in the LR space at whole-pixel precision, the super-resolved I SR may be displaced up to α pixels in each dimension compared with I HR .For Proba-V images (acquired with the same satellite), it was sufficient to co-register I SR to I HR and to compensate the brightness bias 10 .The co-registration was performed by shifting I SR by [−α; α] pixels in vertical and horizontal dimensions to maximize the PSNR score computed for the whole I HR image without an α-wide boundary (so that even after shifting the same part of I HR is considered for computing the score).For MuS2, we have fully adopted that approach.In 10 , the brightness was compensated by subtracting the mean brightness from both I SR and I HR .In the case considered here, the differences between S-2 and WV-2 images are more substantial than among Proba-V images, therefore we match the histogram of I SR to I HR before the evaluation.Importantly, this does not convey any HR information from I HR to I SR , but only compensates the low-frequency differences between them.Also, we do not modify I HR , so that every I SR obtained using different techniques is always compared to the same reference image.
As proposed in 10 , in addition to reporting the direct values of PSNR, we also compute their relation to the scores retrieved for images obtained by averaging LR images enlarged with bicubic interpolation, treated as a baseline which does not offer any information gain.We adopt the same approach for SSIM and LPIPS metrics, and we also compute the balanced metric as: where I bic is obtained by averaging all bicubically-upsampled I in 's in the scene (they are all co-registered as justified earlier).Hence, B < 1 means better performance compared with the bicubic interpolation, and B > 1 indicates the opposite case (for PSNR and SSIM the higher score indicates higher similarity, and for LPIPS the lower the score, the higher the similarity).
During our experiments (reported later in this paper), we observed that the PSNR and SSIM metrics are not particularly effective in assessing the reconstruction accuracy when I in and I HR originate from different satellites-the scores differ little across I SR 's obtained with the employed SR or interpolation procedures.Based on experimental validation and MOS survey, whose results are discussed later in this paper, it is clear that the outcomes obtained using MISR techniques better reflect the image details than relying on bicubic interpolation, and this is correctly captured relying on the LPIPS metric.It can be seen from Figure 4 that there are regions, marked as red in Figure 4e, in which bicubic interpolation consistently prevails over all of the four considered SR techniques trained from real-life data in terms of the mean squared error (MSE), which the PSNR metric is based on (these regions were extracted after applying shift and brightness compensation).This may result from the temporal differences between I in 's and I HR -in the regions where bicubic interpolation renders lower MSE than all the SR techniques, the HR reference can be regarded as inappropriate to assess the reconstruction quality.It can be noticed that such regions are mainly located in the cultivated fields, while in the highly-detailed urban areas they are isolated (this may result from shadows and differences in lighting conditions).Overall, for every scene, we generate the relevance masks which indicate the regions in I HR where at least one MISR model (out of four considered in our study) offers better performance than the bicubic interpolation in terms of MSE, and if the relevance masks are applied, then the metrics are computed only inside such regions (hence, in Figure 4e, the pixels marked as red would be ignored when computing the metrics with the relevance masks applied).

Data Records
The MuS2 benchmark dataset is available at https://doi.org/10.7910/DVN/1JMRATand its folder structure is depicted in Figure 5.The images are grouped into 91 scenes, each of which is placed in a separate scene-level folder, whose names follow the pattern <ord num> <mgrs> <date>-2AS <tile>, where <ord num> is the ordinal number of Sentinel-2 product, <mgrs> is the MGRS tile representing the captured area, <date> is the acquisition date of the WV-2 image, and <tile> is the name of the WV-2 tile.Each scene-level folder contains 12 band-level folders (b1, b2, b3, etc.) with the corresponding S-2 bands.Every band-level folder contains 14 or 15 Level-2A images captured at different revisits.The WV-2 bands (numbered from 0 to 7) along with a panchromatic image are stored in the hr resized folder for each scene, resized to the dimensions 3× larger than the S-2 input images.Also, the scene classification maps (scl), cloud masks  (cld), aerosol optical thickness (aot), and water vapour maps (wvp) are included as well.The resolution of HR images is 3× larger than the resolution of S-2 10 m GSD images (i.e., bands B02, B03, B04, and B08).S-2 and WV-2 band images are intended to be coupled for evaluation as follows: • S-2 B02 (b2) with WV-2 B band (mul band 1), • S-2 B03 (b3) with WV-2 G band (mul band 2), • S-2 B04 (b4) with WV-2 R band (mul band 4), • S-2 B08 (b8) with WV-2 NIR1 band (mul band 6).
Additionally, the relevance masks for the 10 m S-2 bands are also included in the masks folder.

Technical Validation
The goal of our experiments was twofold: (i) to confirm that MuS2, including the evaluation procedure, is suitable for assessing the reconstruction accuracy, and (ii) to report the scores of the well-established SR techniques, which can be used as a baseline for future research.For this sake, we included HighRes-net that recursively combines latent LR representations to obtain the super-resolved image 19 , as well as the residual attention MISR (RAMS) network equipped with the attention mechanism 20 .
In this study, we focus on evaluating MISR performed for four 10 m S-2 bands matched with the HR reference obtained from WV-2.In order to verify that based on MuS2, we can assess how much the reconstructed details resemble those observed in I HR , we consider three groups of methods: (i) MISR networks trained with real-world LR and HR Proba-V images 10 , (ii) MISR networks trained using S-2 simulated LR data 22 (obtained by degrading an original S-2 image, later treated as I HR ), and (iii) image interpolation techniques treated as a baseline.In 23 , we demonstrated that RAMS and HighRes-net trained from Proba-V images are successful in super-resolving S-2 data, and they do not generate artefacts observed when these networks are trained using simulated LR images.To verify how such artefacts influence the scores for MuS2, we report the performance for the networks trained with Proba-V NIR, Proba-V Red, and S-2 simulated images.
Quantitative results obtained for four S-2 bands without and with the relevance masks applied are reported in Table 1.It can be seen that without the relevance mask, the PSNR and SSIM scores do not differ much between each other within particular bands, and they do not indicate the SR techniques to be better than interpolation.This is against what can be judged from visual inspection of the reconstruction outcome, an example of which is presented in Figure 6.The images rendered by HighRes-net and RAMS trained over Proba-V images present more details than the interpolated images (which are quite blurry), and they are free from the artefacts present for the models trained from the S-2 simulated data.It is worth noting that based on visual assessment, the details in I SR are quite close to those observed in the WV-2 image, which suggests that they were reconstructed indeed rather than hallucinated (such hallucination artefacts are quite common for SISR techniques, especially for larger magnification factors 24 ).Unfortunately, these qualitative observations are quantitatively reflected only in the LPIPS values, while PSNR and SSIM are slightly worse for HighRes-net and RAMS-overall, both LPIPS and B indicate that SR networks perform better than interpolation and they penalize for the artefacts.Apparently, the differences between I in and I HR images resulting from different image acquisition conditions prevail over the accuracy of reconstructing the details when local pixel-wise metrics like PSNR or SSIM are used, but they have smaller impact on the feature-based LPIPS metric.When the relevance masks are applied, all the metrics indicate the superiority of SR networks and their B scores are aligned in a similar order as without the mask.
As shown in Table 1, the reported metrics (without the relevance masks) are not consistent in indicating the SR performance.In order to verify which of them is most reliable for assessing the reconstruction accuracy, we have prepared a MOS survey composed of 15 queries2 .We considered the cases in which PSNR, SSIM, and LPIPS metrics pointed to a different SR or interpolation outcome as the most (eight cases) or the least (seven cases) similar to I HR .In each query, the participants of diverse background (including Earth observation professionals and individuals without any remote sensing experience) were presented such three images alongside I HR , and asked to select the best or the worst image for retrieving some specific details (e.g., delineating the roads).In this way, we wanted to prevent the participants from being biased toward picking an image of the best perceptual quality, instead of the one whose details are most similar to those in I HR .The results of the survey (averaged from over 160 responses) are reported in Table 2-clearly, LPIPS correlates most with the human judgement both in picking the most or least accurate reconstruction outcome (retrieved with an SR network trained from Proba-V data and an interpolation technique, respectively).Overall, even though LPIPS is a perceptual metric, the reported study shows that it can be exploited as a reliable indicator of reconstruction accuracy in this case.We expect that the balanced metric B introduces an additional stability, as the PSNR and SSIM scores are sensitive to the hallucination effects 24 which potentially may not affect the LPIPS metric.1. Questionnaire sent to the respondents to determine the mean opinion score (MOS).The questions and images are in the same order as in the survey.There are 15 questions concerned with the reconstruction accuracy.For each image, we provide the metric used to determine the best (or worst) match alongside the method used to obtain that result (obviously, such information was not revealed to the respondents).

Figure 2 .
Figure 2. The data preparation pipeline: the yellow boxes indicate input data, the blue ones-temporary data, and the red ones-the output.For each WV-2 image, N S-2 images with overlapping area are processed to identify the coordinates of the largest common region.It is used to crop the four 10 m bands from S-2 images and corresponding WV-2 bands.The latter are downsampled, so that they are α× larger than the cropped S-2 images.

Figure 3 .
Figure 3. Color images composed of S-2 and WV-2 bands to inspect the co-registration correctness of the prepared data.The example shows correct co-registration among (a) S-2 and (b) WV-2 bands, (c) between S-2 and WV-2 images (in a form of a checkerboard image), and (d) among multiple S-2 images (incorrect co-registration would result in color artefacts).

Figure 4 .
Figure 4. Areas (red) in which bicubically interpolated image is closer to the HR reference than the super-resolved one (after applying appropriate shift and brightness compensation).Outcomes of (a, b) HighRes-net 19 and (c, d) RAMS 20 models trained over Proba-V NIR and Red images, respectively, are shown along with (e) the final relevance mask (red areas are masked out).

1 . 2 .Table 1 - 4 . 5 .
Which of the following images presents the urban area in the MOST DETAILED WAY? Which of the following images is the MOST DIFFERENT from the reference image?Reference image LPIPS (Bicubic), 119 votes SSIM (HRN SIM), 29 votes PSNR (HRN NIR), 18 votes Continued from the previous page 3. Which of the following images presents the annotated woodless area in the MOST ACCURATE way?This area was annotated in red in the reference image (right).Which of the following images presents the SHAPE of annotated buildings in the WORST way?The buildings of interest are annotated in red in the reference image.In which image the rural roads are presented in the CLEAREST (MOST DETAILED) way?The roads of interest are annotated in red in the reference image.

6 . 7 . 8 . 9 .
In which image the buildings manifest THE LEAST LEVEL OF DETAIL?In which of the following images the junction is presented in the MOST ACCURATE (DETAILED) way?In which of the following images it is the MOST CHALLENGING to distinguish separate trees?In which of the following images the area of interest (annotated in red in the Reference image) is presented in the MOST DETAILED way?Reference image PSNR (Linear), 3 votes LPIPS (RAMS NIR), 150 votes SSIM (Bicubic), 13 votes

10 . 11 . 12 .
Which of the following images presents the coastal area in the LEAST ACCURATE (LEAST DETAILED) way?The area of interest is rendered in red in the reference image.Which of the following images presents the roads in the MOST DETAILED way?The roads of interest are annotated in red in the reference image.In which of the following images the urban area is presented in the LEAST DETAILED way?

13 . 14 . 15 .
Which of the following images allows one to count the buildings in THE MOST CONVENIENT way?The buildings of interest are annotated in red in the reference image.In which of the following images the parking is presented in the LEAST DETAILED way?The parking of interest is annotated in red in the reference image.Which of the following images presents the roads of interest MOST FAITHFULLY?The roads of interest are annotated in red in the reference image.

Table 2 .
The MOS survey outcome showing how often each metric was consistent with the answers (in %), stratified into SR nets trained with Proba-V and simulated images, and interpolation.

Table 1 -
Continued from the previous page

Table 1 -
Continued from the previous page

Table 1 -
Continued from the previous page