An efficient coral survey method based on a large-scale 3-D structure model obtained by Speedy Sea Scanner and U-Net segmentation

Over the last 3 decades, a large portion of coral cover has been lost around the globe. This significant decline necessitates a rapid assessment of coral reef health to enable more effective management. In this paper, we propose an efficient method for coral cover estimation and demonstrate its viability. A large-scale 3-D structure model, with resolutions in the x, y and z planes of 0.01 m, was successfully generated by means of a towed optical camera array system (Speedy Sea Scanner). The survey efficiency attained was 12,146 m2/h. In addition, we propose a segmentation method utilizing U-Net architecture and estimate coral coverage using a large-scale 2-D image. The U-Net-based segmentation method has shown higher accuracy than pixelwise CNN modeling. Moreover, the computational cost of a U-Net-based method is much lower than that of a pixelwise CNN-based one. We believe that an array of these survey tools can contribute to the rapid assessment of coral reefs.

Scientific RepoRtS | (2020) 10:12416 | https://doi.org/10.1038/s41598-020-69400-5 www.nature.com/scientificreports/ years, convolutional neural networks (CNNs) have shown outstanding accuracy in automatic image classification and segmentation 21,22 , especially in the field of computer vision. Several studies in the literature have applied variants of the CNN method to coral classification or segmentation using various types of dataset, e.g., those obtained by laboratory experiments or by divers and underwater vehicles [23][24][25] . However, research using largescale images obtained from the sea remains limited 25 and continuous research effort to remedy this is required. Through recent technology innovation, a more efficient image collection system, namely the "Speedy Sea Scanner (SSS)" (Fig. 1), which is a towed optical camera array that has succeeded in making a large-scale and high-resolution 2-D image (orthophoto) of the seafloor around the Kujuku islands in 2017 26 . When that imaging was collected, the surveying efficiency of the SSS was approximately 7,000 m 2 /h. According to previous studies, the surveying efficiency of divers or swimmers is approximately 150 m 2 /h 12,13 while that by AUVs is some 2,470 m 2 /h at 2 m above the seafloor 27 . Thus, the surveying efficiency of the SSS is a dramatic improvement over these earlier methods and we can now obtain a large number of images with greater ease than before. In addition, precise depth information on the seafloor can be obtained from a 3-D structure model derived from part of the survey area 26 . However, a large-scale 3-D structure model of an entire survey area has not been generated and the accuracy of the seafloor's depth distribution has yet to be evaluated. In conjunction with the SSS technology's development, to reduce the time required for the analysis of huge quantities of data, an automatic coral coverage estimation method that makes use of conventional image segmentation approaches based on pixelwise CNN and bag-of-visual-words (BoVW) was proposed and the performances were compared 26 . In the comparison, the accuracy of pixelwise CNN was found to be better than that of BoVW. However, field sampling data is still lacking and a problem in the form of the substantial computation cost of large-scale coral cover estimations was encountered, undermining their practical application.
In this study, we demonstrate the effectiveness of the coral cover estimation method we propose herein. We collected seafloor images using the SSS off the coast of Kumejima in Okinawa, Japan, and used them to construct a large-scale 3-D model (See the Methods section for the methodological details of SSS). In addition, we obtained multibeam echosounder (MBES) data to use as reference data for the seabed topography. In general, the MBES data, where collected and available, feeds into the General Bathymetric Chart of the Oceans (GEBCO) to generate wider-scale bathymetric data sets for the entire ocean (https ://www.gebco .net/). Therefore, we prepared the two digital elevation models (DEMs) attained from the SSS and MBES data. Herein, we refer to them as DEM SSS and DEM MBES , respectively. The resolution on the horizontal plane and the accuracy of depth information in the vertical plane within the DEMs was then compared. We show that the resolution in DEM SSS is much higher than that in DEM MBES and quantify the difference between the two.
In addition, we propose another segmentation method based on U-Net 28 , (which is often used in medical applications) 29 , and perform the coral cover estimation using the large-scale 2-D image (orthophoto) converted from the 3-D structure model. The computational cost of the U-Net-based segmentation method is much smaller than that of the pixelwise CNN-based one 26 . The prediction time of U-Net is about 1/1,000 for pixelwise CNN (See the Results section for the details). We believe that an array of these survey tools can contribute to enabling the rapid assessment of coral reefs.

Methods
Data collection. The SSS towed optical camera array system was used for collecting the images. The following is a brief list of the general advantages of the SSS: • Lower cost of development and maintenance than that for underwater vehicles. www.nature.com/scientificreports/ • Higher surveying efficiency than that which can be achieved by methods relying on divers and underwater vehicles. • Simple operation without additional electrical equipment.
• Robust pair-matching between adjacent images for 3-D structure model generation.
• High portability-it can be carried by a small boat and easily deployed at the survey site, including small islands.
The system's depth rating is 50 m. The length of the array's baseline is 4.4 m, with six equally spaced cameras (Panasonic DC-GH5 with custom-made waterproof housing and batteries) installed on the platform. Each optical camera can record up to 6 h of high-definition video at a recording rate of 23.98 frames per second. We determined the length such that two adults could handle the system and carry it to the survey area by small boat. The attitude during towing is held stable by the tailplane and the tilt angle can be tuned through the attachment position of the towing rope. The system was towed by the survey boat, which was equipped with a navigation system (POS MV, Applanix). The positioning error of the navigation system was approximately ± 1 m. The distance of the SSS from the seafloor was set to around 2-5 m, while the boat maintained a speed of 2-3 knots during the survey. To keep the safety of survey, and monitor the vertical position of SSS, a fish echosounder (HDS Gen2, LOWRANCE) was equipped on the ship. In addition to the SSS survey, precise seabed topography was measured using multi-beam sonar (Sonic 2022, R2Sonic LLC) with an operating frequency of 400 kHz. We also used bathymetric data to validate the accuracy of the depth distribution in the 3-D structure model (DEM SSS ) generated from the collected images. The DEM MBES was generated from the sounding data using the commercial software (HYPACK, Xylem Inc.). The tidal and sound refraction corrections were conducted following the general processing flow in the software. The sound profile for the sound refraction correction was measured by the Conductivity Temperature Depth profiler (CTD; Minos.X, AML Oceanographic Ltd.) before the survey. The vertical resolution of the multi-beam sonar was 1.25 cm with 0.9° × 0.9° directivity. The mean density of the sound data in a grid was 7.77 and we adopted the central value to the DEM MBES grid data.
The images were collected offshore at Kumejima, Okinawa prefecture, Japan on July 6, 2018. Kumejima is surrounded by a wide variety of different marine habitats, e.g., intertidal mudflats and rocky shores, vibrant coral reefs, muddy/sandy substrates and submarine limestone caves. The SSS survey was conducted in an area with water depths spanning 5-45 m. The offshore survey time taken at Kumejima was about 56 min for the seven survey lines.

Large-scale 3-D structure model generation. Details of the data processing methods employed
were outlined in our previous study 26 . Here, we recall in brief the image processing flow. First, the GPS device and cameras were time-synchronized with GPS time. Next, continuous still images were obtained from the video data. In this study, we extracted two still images per second. Color corrections were then performed on the images. The camera locations were estimated on the basis of GPS data and added to the corresponding still images. The GPS data was then up-sampled using the cubic spline interpolation method. In this case, the up-sampling rate was 10 times that of the original data points. Here, the vertical distance between the fish echosounder and the SSS was recorded using a fish echosounder with 0.1 m vertical resolution; then, the tidal correction was conducted to the vertical distance. Also, the vertical offset between the water surface and the fish echosounder was directly measured by measure. In addition, we directly measured the horizontal distance between the GPS and the SSS, during the survey. With using the measured distances, the position offset of the SSS was corrected. A 3-D point cloud was reconstructed from the continuous images using a low-cost commercial software (Metashape, Agisoft) employing Structure from Motion (SfM) techniques. SfM is a technique that utilizes 2-D image series to construct a 3-D structure model 30,31 . From the 3-D structure model, the DEM SSS and 2-D image (orthophoto) can be produced.
Network architecture. We built a U-Net-based 28 deep neural network that takes an image of 512 × 512 pixels as input and produces a predicted label image of the same size (see the supplementary Fig. S1). This network, like the U-Net, consists of an encoder part in the first half and a decoder part in the second. The encoder network extracts a small feature map from the input image using the convolution (Conv) and pooling (Pool) layers, while the decoder expands to the original image size using the convolution and up-sampling (Upsamp) layers. The encoder block consists of two repeating layers consisting of 3 × 3 convolutions and a 2 × 2 maximum pooling with two strides for the rectified linear unit's (ReLU) activation. The decoder block comprises 2 × 2 upsampling and two 3 × 3 convolution layers. After each of the first three decoder blocks, a 50% dropout layer was added. In the final layer of the decoder, the feature map was converted into the two classes (coral or non-coral) by a 1 × 1 convolution and then a softmax activation function was applied. The skip connection bridges the gap between each convolution layer of the encoder and a corresponding up-sampling layer of the decoder in order to preserve high-resolution information from the input image. The skip connection simply concatenates the channels in each layer of the encoder with one from the decoder. We implemented the above network using the Keras 32 library with the Tensorflow 33 backend.
Network training and evaluation. For the training of the network, a data-augmentation technique based on rotation 21,34 was employed to improve prediction performance and, in particular, to prevent overfitting. In the images used in this study, there is no specific orientation and the coral remains even when rotated. Thus, the rotated images at 90, 180 and 270 degrees and the corresponding labeled coral images were used in the training.
When training the U-Net and pixelwise CNN models, we used the F-measure score as a loss function and maximized the loss in order to train the networks. We employed Adam 35  www.nature.com/scientificreports/ Gradient Descent (SGD) solver 36 for training the network and explored the optimal hyperparameters within the following ranges: learning rates of SGD of 10 −4 , 10 −3 and 10 −2 and epoch numbers of 100, 200 and 300, respectively. We fixed the batch size to 4. For the evaluation of the prediction performance, we performed a five-fold cross-validation 37 . That is, the 200 images of the dataset were randomly divided into five sub-datasets and then four of these were used to train the U-Net. The sub-datasets that were not used for training were evaluated by accuracy, precision, recall and the F-measure. The five cross-validation scores were calculated by averaging the five training and evaluation sessions with different training sub-datasets.
Evaluation metrics. We employed four evaluation metrics, namely accuracy, precision, recall and the F-measure, to evaluate the prediction performances of the U-Net and pixelwise CNN models. The accuracy was defined as the ratio of successfully predicted pixels to all of the predicted pixels. Although this metric indicates overall performance, it is not a suitable measure when the percentage of coral is very low or high. For example, when the percentage is very low, the model that predicts all pixels as non-coral showed high accuracy. Therefore, we also calculated the F-measure for evaluation using precision and recall. Precision is the fraction of manuallylabeled pixels such as coral amongst the pixels predicted to be coral. Recall is the fraction of relevant pixels that were successfully predicted to be coral. Finally, the F-measure is defined as the harmonic mean of precision and recall as follows: when the values of precision and recall are high on balance, the F-measure also reaches a high value. The ranges of the four metrics are 0 to 1.

Results and discussion
Reconstructed optical map of the seafloor. The 3-D structure model was generated from 30,957 images obtained across seven survey lines (Fig. 2). The total length of each survey line was around 1,838 m. The resolutions of the x, y and z axes were 0.01 m and the corals can be identified from the constructed model. The survey site is a well-known diving spot and we can identify some drop-offs with depth differences of around 5-7 m.
The large-scale 2-D image was produced from the 3-D structure model and is illustrated in Fig. 3. A survey area of 11,434 m 2 was covered, yielding a calculated survey efficiency of 12,146 m 2 /h. The pixel resolution in the horizontal plane (x-y plane) is about 3.5 mm/pixel (± 0.4%); the viewing scale can be adjusted on any type of commercial or free geographical information system (GIS) software. As shown in the Fig. 3, the resolution was enough to identify the coral. We can identify a large quantity of coral from the high-resolution image in Fig. 3 and the presence of at least 10 individual species of corals, such as Pocillopora eydouxi and P. verrucosa, are confirmed in this data by the expert.
In addition, the DEM SSS inside the black border line was produced from the 3-D structure model and overlapped onto the DEM MBES (background), as is shown in Fig. 4. It seems that the connection between the DEM SSS and DEM MBES is seamless. To compare the DEM resolutions, enlarged figures are illustrated in Fig. 4a,b. The www.nature.com/scientificreports/ resolution of the image (horizontal plane) in Fig. 4a is 0.5 m/pixel and in Fig. 4b is 0.01 m/pixel; thus, we can extrapolate the seafloor structure with precision using the DEM SSS . The accuracy by the photogrammetry method was well discussed in the literature (approximately 1-2 mm at 3 m distance) 38 . The distribution of differences of depths in the vertical plane (elevation) was calculated and is illustrated as the color gradation in Fig. 5a. From this figure, it can be seen that the difference around the slope area is large. In addition, Fig. 5b    www.nature.com/scientificreports/ picked up from the point data at the characteristic land features. From these results, the error was larger in the horizontal plane than in the vertical direction. We assume the main difference of depths was caused by the gap in the horizontal plane due to the GPS positioning error (± 1 m). Although a slight difference in the vertical plane is observed, this high-resolution DEM SSS will offer useful information for the advanced surveying of seabed topography, especially in shallow coastal areas. This precise seabed topography will contribute not only to coral surveys but also to other ecological, engineering and geographical studies, e.g., high-resolution advection modeling and structural calculations of natural reefs [39][40][41] . The survey efficiency of 12,146 m 2 /h achieved in this study is higher than the 7,000 m 2 /h of the previous study 26 , because six cameras were used in this case compared to five in the previous one due to battery problems. In addition, the water transparency was better than before (see the supplementary Fig. S3); therefore, we could maintain the SSS at a high altitude of around 3-5 m. Thus, the efficiency of the SSS is at least five times greater than that of an AUV and some 80 times higher than that of diving, making it suitable for the rapid assessment of coral reefs.
Of course, the condition is different in each survey site; therefore, we should search the optimal survey strategy to fit them. The use of the acoustic positioning system or the already-known benchmark position on the sea floor will become one of the solutions to keep the accuracy of the DEM SSS . Also, in case of the deeper sea survey or more turbid condition, we should use the LED lights and care the safety of the operation of the towed camera array system with long towing rope to avoid hitting the corals.

Evaluation of U-Net-based segmentation.
In this study, we propose and evaluate a U-Net-based coral segmentation approach for the efficient surveying of large areas, such as depicted in Fig. 3. (See the Methods sections for details of the U-Net model and data processing). For training and evaluation, we divide the entire dataset (Fig. 3a) into 14,016 images of 512 × 512 pixels. Each divided image measures about 3.2 m 2 . We randomly selected 200 images from those divided and manually labeled images of coral under the supervision of coral experts. The images in the leftmost and rightmost columns in Fig. 6 are examples of the divided images and labeled coral images, respectively.
We then performed training and performance evaluations of the dataset of the 200 image pairs above. The processing of the color correction (CC) 26 and data-augmentation (DA) for the obtained images, which was based on rotations 21,34 , may affect prediction performance. Therefore, we trained and evaluated four types of U-Net models with and without CC and DA, respectively. Furthermore, to compare prediction performance with the U-Net model, we employed the pixelwise CNN model, which had exhibited good performance in our previous work 26 . We evaluated the performances of the pixelwise CNN models with different input window sizes of 32 × 32, 48 × 48, 64 × 64, 96 × 96, 128 × 128 and 160 × 160, because the size of the local images used for the input window of the pixelwise CNN model greatly influences the prediction performance. (See the Methods section for details of the training procedure and evaluation metrics). Figure 6 shows prediction examples for two test images, A and B. The images in the leftmost column are the original ones, while the images in the second column were processed by color correcting the originals. The images in the third and fourth columns are the predicted results using U-Net with CC and DA and the pixelwise CNN (window: 64 × 64 pixels) models, respectively. The results for the different processing conditions (CC and  Table 1 and Fig. 7a show the evaluated performances of U-Net with and without CC and DA, as well as pixelwise CNN using the images with CC and DA with different window sizes. The predictions by all variants of U-Net achieved high levels of accuracy (> 0.9). From the results listed in Table 1, it can be confirmed that performance tends to increase with the application of CC and DA. The U-Net model with both CC and DA showed the highest accuracy (0.910) and F-measure (0.772). The pixelwise CNN result shows that the performance tends to increase with increasing window size. However, it is clearly shown in Fig. 7a that the accuracy (blue-dashed line) and F-measure (orange-dashed line) of the U-Net exhibit better performances compared to that of the pixelwise CNN. These results indicate that the U-Net has high predictive performance, and both CC and DA are effective for improving this. While pixelwise CNN uses the local information of window   Fig. S1). Therefore, U-Net is considered to have achieved higher performance than pixelwise CNN.
We assessed the details of the relationship between prediction performance and prediction time. Figure 7b displays prediction times per image using U-Net and pixelwise CNN with different window sizes. We used an Nvidia GeForce GTX 1,080 Ti GPU with an Intel Xeon CPU E5-2,630 v4 computing core. These results indicate that the prediction time rapidly increases as the input size expands, while the prediction time of U-Net is very short (0.057 s). Note that the prediction time of U-Net does not change because the input size is constant (512 × 512 pixels). The prediction time of U-Net is about 1/1,000 for pixelwise CNN with a window size of 64 × 64. The results shown Fig. 7a,b indicate that U-Net-based prediction is more accurate and substantially faster than pixelwise CNN.
Estimation of coral cover in the surveyed area. We built a prediction model for the entire surveyed area using all 200 images and the U-net with CC and DA, which had exhibited the best performance in the above evaluations. The 2-D image (orthophoto) of the entire surveyed area was divided into 14,016 local images (512 × 512 pixels). We estimated the quantity of coral in the surveyed area (11,434 m 2 ) using the built model and divided the images. The calculation time for this estimation was 1,120 s (18.7 min) using the same GPU and CPU as that outlined above. Figure 8 shows the overall coral coverage prediction by the model. The predicted percent coral cover was distributed from 0 to 35%. According to the previous survey, conducted in 2011 by scuba divers using the manta-method, the coral cover in the area was estimated to be around 25 to 50% 42 . The results this time around were about half of what they were last time, so our results indicate a decline in coral cover, which may be due to the 2016 bleaching event 43 . As previously described, the changes to coral reefs have been dramatic and determining the mechanisms underlying these requires the capacity to rapidly assess reefs. In addition, the U-net based segmentation method has the possibility to be applied for the species cover, or disease prevalence studies. Although the fields are different, Saito et al. have classified the layers of two-dimensional materials into three classes 44 . Also, Kohl et al. have classified images of street scenes taken from a camera into 19 classes, including person, car, and road 45 . As remarked above, the efficient survey method presently under discussion has the potential to become a useful tool for quantitatively investigating biological systems such as coral.

Conclusions
In this paper, we proposed an efficient method for coral cover estimation and demonstrated its viability. A largescale 3-D structure model, with resolutions in the x, y and z planes of 0.01 m, was successfully generated by means of a towed optical camera array system (Speedy Sea Scanner). The survey efficiency attained was 12,146 m 2 /h. In addition, we propose a segmentation method utilizing U-Net architecture and estimate coral coverage using a large-scale 2-D image. The U-Net-based segmentation method has shown higher accuracy than pixelwise CNN modeling. Moreover, the computational cost of a U-Net-based method is much lower than that of a pixelwise CNN-based one. We believe that an array of these survey tools can contribute to the rapid assessment of coral reefs. www.nature.com/scientificreports/