Background & Summary

Object detection is becoming faster and more accurate with the development of AI technology. This helps underwater robots archive better performance in accident rescue, facilities maintenance, biological investigation and other underwater applications. Onshore AI algorithms are developing rapidly based on rich and high-quality datasets. In order to transfer AI achievements from land to underwater, appropriate underwater datasets are required. There have been several underwater optical datasets, such as Brackish1 dataset, Segmentation of Underwater IMagery2 (SUIM) dataset, Detecting Underwater Objects3 (DUO) dataset, for the research on object detection, semantic segmentation and other AI applications. Due to the scattering and attenuation of the light in water, underwater optical imaging is a difficult task and often gets low quality images. So acoustic sensors are widely used for perceiving the underwater environment. MFLS is portable for underwater robots while providing dynamic real-time image data in high resolution. It is very applicable for scenarios requiring close and detailed inspection.

There have been previous works on underwater object detection and related applications with MFLS in AI areas. Haoting Zhang et al. proposed MFLS image target detection models based on You Only Look Once (YOLO) v5 network4. Zhimiao Fan et al. proposed a modified Mask Region Convolutional Neural Network (Mask RCNN) for MFLS image object detection and segmentation5. Longyu Jiang et al. proposed three simple but effective active-learning-based algorithms for MFLS image object detection6. Alan Preciado-Grijalva et al. investigated the potential of three self-supervised learning methods (RotNet, Denoising Autoencoders, and Jigsaw) to learn sonar image representation7. Gustavo Divas Karimanzira et al. proposed an underwater object detection solution with MFLS based on RCNN and deployed the solution on an NVIDIA Jetson TX28. Neves et al. proposed a novel multi-object detection system using two novel convolutional neural network-based architectures that output object position and rotation from sonar images to support autonomous underwater vehicle (AUV) navigation9. Xiang Cao et al. proposed an obstacle detection and avoidance algorithm for an AUV with MFLS, using the YOLOv3 network for obstacle detection10. There are different defects among the datasets used in these researches, such as small sample size, few categories of objects, and virtual image data based on style transfer technology. The most important point is that most of the datasets in the related research are not public.

Underwater data collection often comes with high costs of economy, labor and time. Professionals are highly required in operating the MFLS devices and annotating the sonar images while most of the researchers are inexperienced in the MFLS, which results in few public related datasets11,12,13,14,15. Developing MFLS-based algorithms requires a large number of sonar images to verify and improve their approaches. For example, object detection methods in deep learning fields require data to train neural networks. Some researchers have applied sonar simulation11,12 and image translation technology13,14,15 as a solution of lacking data. Since the complexity of acoustic propagation property and the instability of underwater environment, there have always been differences between the generated images and real images. Contrasting to the optical images, the lack of dataset obstructs the development of research on object detection with MFLS images in AI areas. Our proposed dataset aims to improve the above situations.

Considering the recognition habits of human vision, MFLS generally provides images processed with filters and pseudo-coloring which may cause the loss of effective data. Based on acoustic propagation characteristics, the MFLS provides the range and azimuth angle information. So the image in sector representation achieves better visual perception. But invalid information is imported to areas beyond the sector in the images. Figure 1 shows the MFLS raw image and the processed images. The images contain the same three target objects: ball, square cage and human body model.

Fig. 1
figure 1

MFLS raw and processed images.

There have been several MFLS datasets providing processed image data. Erin McCann et al. provided a sonar dataset containing 8 fish species for fish classification and fishery assessment16. Deepak Singh et al. provided a sonar dataset containing typical household marine debris and distractor marine objects in 11 classes for semantic segmentation17. Matheus M. Dos Santos et al. published dataset ARACATI 2017, which provides optical aerial and acoustic underwater images for cross-view and cross-domain underwater localization18. Pontoon objects and moving boats were present in the MFLS data of the dataset. Figure 2 shows an example of comparison of the three datasets and our UATD dataset.

Fig. 2
figure 2

Comparison of MFLS datasets.

Our UATD dataset directly addresses the above two issues. Starting from 2020, we collected the MFLS data in Maoming and Dalian, China. The environment included lake and shallow water. The dataset provides raw data of MFLS images in high resolution with annotation of 10 categories of target objects. A corresponding benchmark of SOTA detectors performed on UATD including efficiency and accuracy indicators was provided. This dataset could promote the research on underwater object detection based on MFLS. Our work supports three consecutive China Underwater Robot Professional Contest (URPC), providing the dataset for the underwater target object detection algorithm competition. UPRC2022 refers to


Collecting MFLS data

Tritech Gemini 1200ik (website: multibeam forward-looking sonar was used for data collection. The sonar operates at two acoustic frequencies, 720 kHz for long-range target detection, and 1200 kHz for enhanced high-resolution imaging at shorter ranges. Table 1 shows the acoustic specifications of the sonar. The Gemini software development kit providing the raw data of sonar images is available for Windows and Linux operating systems.

Table 1 Acoustic specifications of Tritech Gemini 1200ik.

We have designed a mechanical structure for the sonar to collect data, as shown in Fig. 3a. The sonar is fixed to a box structure. The box structure is mounted to the end of a metal rod. A connecting piece is installed in the middle of the metal rod to fix the collection equipment to the hull. The connecting piece allows us to adjust the rod to control the depth of sonar to the surface and the tilt angle to the water bottom during the collection. The sonar data collection structure equipped on the boat is shown in Fig. 3b.

Fig. 3
figure 3

Data collection equipment.

We performed the experiments in two places: Golden Pebble Beach at Dalian(39.0904292°N, 122.0071952°E) and Haoxin Lake at Maoming(21.7011602°N, 110.8641811°E). The environments of experiments performed and satellite maps with the experimental areas marked are shown in Fig. 4. The experimental waters have a minimum depth of 4 meters and a maximum depth of 10 meters at Dalian, and about a depth of 4 meters at Maoming.

Fig. 4
figure 4

Environment and satellite map of experiments performed.

Ten categories of target objects were selected: cube, ball, cylinder, human body model, tyre, circle cage, square cage, metal bucket, plane model and ROV, as shown in Fig. 5 with their scales. The objects were tied to a floating ball with a long rope individually so that the rough location of the objects could be distinguished according to the ball floating on the water. As a result, the objects might suspend in water or lay on the bottom.

Fig. 5
figure 5

Objects of the dataset. The scale of the objects is shown below the name of the objects in the figure. The scale is in meter and the representation: L(Length), W(Width), H(Height), R(Radius).

After deploying the objects, we drove the boat mounted with sonar and cruised around the selected sites, searching the target objects and recording data by adjusting the sonar direction.

Object annotations

The shape of the same target may change when the sonar is imaging at different positions and angles, which makes it difficult for the annotator to judge the target category only by experience and intuition when annotating. Therefore, we developed an annotation software for sonar images named forward-looking sonar label tool (OpenSLT). Compared with other annotation tools, OpenSLT has the following two new features: 1) input as image stream. 2) real-time annotation. These features allow us to overcome the above problem as mentioned. OpenSLT can be divided into three modules: toolbar, image display area and annotation display area, as shown in Fig. 6a. The tool first receives the raw data of sonar as input and plays in the form of a video stream. The playback speed can be accelerated or slowed down until the target is found. Then annotator can press the pause button and annotate in image display area using mouse. The annotation will be automatically generated and saved locally. This annotation method enables the annotator to continuously track the target object when annotating as shown in Fig. 6, avoiding the situations where the target object position cannot be confirmed and the target object type cannot be judged when it reappears. With the data protocol provided by Tritech, we extract the sonar working information and the sonar image information from every frame of sonar original data, including working range, frequency, azimuth, elevation, sound speed and image resolution. Then we store these information in a CSV file. OpenSLT loads the CSV file and retains these information during the annotation. In addition, OpenSLT generates file path information in the annotation for batch processing.

Fig. 6
figure 6

Sequential frame annotation. This example shows the moving ROV during the data collection.

Data Records

UATD dataset is openly available to the public in a figshare repository19. The dataset contains 9200 image files in BMP format corresponding with the same number of annotation files in XML format, and is divided into three ZIP archives, namely “”, “”, “”. “UATD-Training” contains 7600 pairs of images and annotations. The remaining two parts contain 800 pairs of images and annotations respectively. Each part consists of two folders, storing the image files and annotation files respectively. The image files are in the folder named “image”, and the annotation files are in the folder named “annotation”.

The class distribution of the objects is shown in Fig. 7a. The statistic of collecting ranges and frequencies is shown in Fig. 7b. A total of 2900 images have been collected in 720k Hz and 6300 have been in 1200k Hz. The sonar working range distributes from 5 meters to 25 meters during the data collection. Therefore, we count the number of images with the scope of every 1 meter by range. Besides, the images are also counted by working frequency of 720k and 1200k respectively as shown in Fig. 7b.

Fig. 7
figure 7

An overview of the distribution statistics of UATD dataset.

A pair of UATD data is shown in Fig. 8 as an example. The echo intensities data is stored in the first channel of the BMP image file. The data in the rest of the two channels of the image is the same as the first channel. The annotation file can be divided into four sections. The “sonar” section provides some basic sonar working information at the moment the corresponding image is collected. As shown in the example: the range is 14.9941 m, the azimuth is 120°, the elevation is 12°, the sound speed is 1582.4 m/s and the frequency is 1200k Hz. All of these parameters are parsed from the sonar output data stream directly. The “file” section provides some information about the relative paths of the image file and annotation file. The “filename” parameter provides the common filename prefix of a pair of image files and annotation files. The “size” section provides the image information. In this example, the image resolution is 1024 × 1428 and owns 3 channels. The “object” section provides the category name under “name” tab and bounding box in pixels of the object under the “bndbox” tab.

Fig. 8
figure 8

An example of UATD dataset.

Technical Validation

The appearances of the same underwater target object at different imaging angles of the MFLS are generally different, leading to a great challenge to the subsequent labeling work. To address the challenge, we designed three effective methods to ensure the accuracy of sonar image annotations. Firstly, three members of our team were responsible for the annotation and completed the labeling work individually after randomly assigning the collected data. Then, with the data playback function in OpenSLT, cross-checking was performed to reduce manual annotation errors. Secondly, we recorded the video stream displaying the processed images for vision habits from Gemini 1200ik synchronously with raw data during the data collection. At the same time, OpenSLT played back the data in the way of the stream to corresponding with the recorded video stream. So it was convenient to detect and track the object by comparing it with the video during annotating the raw data in OpenSLT, avoiding labeling errors caused by losing the target. Finally, we handed over the annotated data to a professional data management company cooperating with us. The professional staff members of the company have checked the data again to ensure correctness.

Object detection benchmarks

A benchmark based on our dataset is given in Table 2. Currently, MMdetection20 is one of the best open-source object detection toolboxes based on PyTorch, which provides a variety of SOTA detectors and is simple to employ. Therefore, the benchmarks are generated by MMdetection(V2.25.0). We choose Faster-RCNN21 and YOLOv322 with various backbones as our object detectors which are the most popular two-stage and one-stage SOTA detectors respectively.

Table 2 Benchmark of Faster-RCNN and YOLOv3 with various backbones performed on UATD.

The evaluation has considered both accuracy and efficiency. We first adopted the evaluation metric mean average precision (mAP) and mean average recall rate23 (mAR) to measure the accuracy of detectors on UATD. Then, the efficiency was tested on the local computer and the indicators of FPS, Params and FLOPs were given. More details as below:

Accuracy metrics:

mAP - Corresponds to the mean AP for intersect of union (IoU) equals to 0.5 on total categories (10 in UATD).

mAR - Corresponds to the mean recall rate on total categories (10 in UATD).

APname - AP of class (name belongs to the classes in UATD).

Efficiency metrics:

Params - The parameter size of models.

FLOPs - Floating-point operations per second with input image size of 512 × 512.

FPS - Frames per seconds.

The UATD was trained on the local machine with an NVIDIA GeForce GTX 1080 GPU. The image height of the input sonar image is resized to 512 while the image width is scaled by the original ratio in both training and inference. The pretrained parameters on ImageNet24 were used to initialize the backbone. During the training period, the initial learning rate was set to 0.0005 and decreased by 0.1 at the 8th and 11th epoch (12 epochs in total) respectively. The warm-up strategy was adopted with a 0.0001 warm-up ratio and increased by linear step in the first 500 iterations. Otherwise, Adam method was employed to optimize the models.

According to the benchmark, Faster-RCNN with Resnet-18 backbone achieves the best mAP of 83.9% and the best mAR of 89.7%. On the other hand, YOLOv3 with MobilenetV2 backbone has a good performance in efficiency with only 3.68 M Params and 4.22 G FLOPs, as well as the fastest inference speed of 93.4 FPS tested on the local machine.