Automating the assessment of biofouling in images using expert agreement as a gold standard

Biofouling is the accumulation of organisms on surfaces immersed in water. It is of particular concern to the international shipping industry because it increases fuel costs and presents a biosecurity risk by providing a pathway for non-indigenous marine species to establish in new areas. There is growing interest within jurisdictions to strengthen biofouling risk-management regulations, but it is expensive to conduct in-water inspections and assess the collected data to determine the biofouling state of vessel hulls. Machine learning is well suited to tackle the latter challenge, and here we apply deep learning to automate the classification of images from in-water inspections to identify the presence and severity of fouling. We combined several datasets to obtain over 10,000 images collected from in-water surveys which were annotated by a group biofouling experts. We compared the annotations from three experts on a 120-sample subset of these images, and found that they showed 89% agreement (95% CI: 87–92%). Subsequent labelling of the whole dataset by one of these experts achieved similar levels of agreement with this group of experts, which we defined as performing at most 5% worse (p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= 0.009–0.054). Using these expert labels, we were able to train a deep learning model that also agreed similarly with the group of experts (p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= 0.001–0.014), demonstrating that automated analysis of biofouling in images is feasible and effective using this method.


Introduction
Global trade relies on the international shipping industry, which has been implicated in the spread of many marine nonindigenous species (NIS) around the world 1,2 .Modern vessels have two primary pathways for translocating NIS, namely (i) as stowaways in ballast water, or (ii) attached to the vessel surface as biofouling 3 ; examples of each follow.Ballast water was the likely vector for zebra mussels to spread from Europe to the great lakes in North America 4 , where they have led to increases in toxic blue-green algae 5 and cost industry more than $200 million per year in maintaining water intake structures 6 .Biofouling is one of the most significant pathways for the spread of non-indigenous seaweeds [7][8][9] , which can outcompete native species 10 , make native kelp forests less resilient 11 and adversely impact fishing and tourism operations 12,13 .
Although vessels are incentivised to manage their biofouling to reduce hydrodynamic drag and fuel costs 3,14 , it is a challenging undertaking and can occur even on hulls that employ current best practice 15,16 .The primary method of biofouling management is the regular application of anti-fouling coatings.These contain biocides, such as copper, or create a surface that releases organisms or dissuades attachment to slow down the process of biofouling accumulation 17 .A vessel's operating profile contributes to fouling risk, with extended periods of inactivity being associated with higher biofouling pressure 18 .Niche areas, such as sea chests, propellers, and other complex surface structures are at high risk of becoming fouled as they can offer a sheltered environment for fouling organisms to establish.They are also a lower priority for management as they contribute less to hydrodynamic drag compared to the flat surfaces of the hull 19 .
There is growing interest in closer management of the biofouling pathway by biosecurity regulators 20,21 .New Zealand has implemented a clean hull standard that sets requirements for vessels to manage biofouling and proposed a clean hull threshold to determine the potential biosecurity risk of a vessel 22 .For vessels staying longer than three weeks or visiting areas other than those designated as places of first arrival, any macrofouling other than goose barnacles is considered to be a biosecurity risk, while for short stay vessels there are macrofouling coverage thresholds 16,23 .In implementing this policy, they have stressed the vessel management requirements rather than the thresholds, as even with best management practices ships can become fouled 16 .
Australia has also proposed requirements for vessels to implement biofouling management practices or provide evidence that their fouling is appropriately controlled 21 .
In-water inspections are the best way to verify biofouling standards are being met and to collect the necessary data to measure the effectiveness of biofouling management practices.However, in-water inspections are expensive, require specialist dive teams to operate in an environment with a number of health and safety risks, and while inspections are being conducted vessels are restricted in the activities they can undertake 24 .A biofouling expert also either needs to be present during the inspection, or review the images and footage gathered afterwards, which can be a costly and time-consuming process.An alternative is to employ an underwater drone or remotely operated vehicle (ROV), which would enhance data collection opportunities but also potentially increase the burden of the expert interpreting the data.
In this paper we explore the potential for deep learning, a type of machine learning which models phenomena using deep neural networks, to automate or assist the analysis of biofouling inspection data.In the last decade deep learning has revolutionised computer vision; in fact, many regard AlexNet, the 2012 winner of the ImageNet visual recognition challenge 25 , as the watershed moment for deep learning 26 .AlexNet was among the first deep convolutional neural networks (CNN) 27 , an architecture that is particularly suited to computer vision tasks.Our present approach is motivated by the plethora of successful applications of deep CNNs to complex image recognition tasks, from identification of wild animals in camera trap images 28,29 to identification of coral species 30 .
A prominent example of automating biofouling image analysis is CoralNet, a machine learning method initially designed for annotating benthic surveys of coral reefs using a random annotation point approach 31 .CoralNet has been applied to assess the level of cover of different species and higher level taxonomic groups present in fouling communities on oil platforms in the UK continental shelf, using images taken by ROVs 32 .Our aim in this current study was to develop a method that could be used to assess biosecurity risk, and in this context CoralNet was less suitable.Most of the images of vessel hulls that were available for developing our method had limited biofouling coverage.Unlike coral reefs and oil platforms, vessels are not stationary and actively manage their biofouling.This makes sampling error an important consideration for annotation point approaches, like CoralNet, and as CNNs consider the whole image they do not have this weakness.
Determining the potential biosecurity risk of a vessel also does not require the identification of particular species.It has been found that there is a positive relationship between the degree of biofouling present on a vessel and the number of NIS present 23 .This has led many jurisdictions, such as New Zealand, to require biofouling to be managed holistically rather than targeting specific species 16 .Species-based approaches also scale poorly in the marine context, as there is a large number of species that can be observed in these communities, the taxonomy is highly complex, and previously unobserved species are common 20,21 .Instead, we aimed to identify the presence and severity of biofouling which is a much simpler problem.This also makes our model more transferable to locations outside of the support of the current data and it may be of use to other jurisdictions, although performance will likely be improved by the introduction of local examples to the training data.

Dataset
We assembled a dataset of 10,263 images collected from in-water surveys of around 300 commercial and recreational vessels.This dataset comprised images provided by three jurisdictions, namely: the Australian Department of Agriculture, Water and the Environment (DAWE), the New Zealand Ministry for Primary Industries (MPI), and the California State Lands Commission (CSLC).Examples from the CSLC dataset are available in the literature 33 , and the MPI data set has previously been used to inform vessel biofouling management in New Zealand 16,23,34 .
Each image was accompanied by a label according to the six-class Level of Fouling (LoF) scheme 35 .However, due to inconsistencies in the LoF labelling across the three jurisdictions, it was necessary to relabel the dataset in a systematic way.We first devised a Simplified Level of Fouling (SLoF) scale (Table 1).This scale was based on the LoF scheme, but collapsed the six levels into pairs to create a three-class scale.This was the simplest possible set of annotations that supported our goal of identifying images with fouling present and highlighting images with severe fouling.We then separately asked experts and workers from the Amazon Mechanical Turk platform to annotate images according to our SLoF scheme.For the former, we engaged three biofouling experts from Ramboll New Zealand who hold qualifications in marine biology.Due to time and budget constraints, we only asked the experts to grade a set of 120 images from the DAWE dataset, constructed by stratified random sampling to ensure balance across LoF.We will call this set of 120 images the expert-labelled dataset.
Next, the convenience of the Amazon Mechanical Turk platform enabled us to annotate each image in our 10,263 dataset according to the new SLoF scheme.The examples and user interface supplied to workers are given in the supporting information.Nine workers graded each image, and the results were aggregated by taking the median value.This labelling scheme produced an imbalanced dataset, with around 70% of the images being labelled SLoF 0 compared to 20% SLoF 1 and 10% SLoF 2 (Table 2).Example images and their SLoF labels are provided in Figure 1.Finally, we divided the overall dataset of 10,263 images into a training set and a test set, as is commonly done in machine learning to enable proper evaluation.The test set consists of the 120 expert-labelled images plus 721 other images from 14 vessels selected with varying degrees of fouling as determined by SLoF.The test set was constructed to challenge the machine learning model with different styles of vessel niches and fouling communities.Hence, we had a total of 841 images in the test set; the remaining data were used to both train the deep learning model and perform cross-validation (5-fold) for hyperparameter tuning.

Machine Learning
A machine learning algorithm typically learns by training on a set of examples.We present to the machine learning algorithm a set of images with the accompanying SLoF labels (i.e.0, 1, 2).We wish the algorithm to accurately label images outside of this training set, i.e., to generalize to never-before-seen images.
The setup so far makes the problem a classic supervised learning task.However unlike most image classification problems, our classes are ordinal.For example, mistaking an image of SLoF 2 as 0 is a larger error compared to mistaking an image of SLoF 1 as 0. This is an analogous challenge to the recent APTOS 2019 Blindness Detection Kaggle competition 36 , which asked participants to label the severity of a disease in images on an integer scale.Many of the best performing Kaggle entries used regression losses rather than classification losses, and we follow the same approach here as this allows the relative magnitude of errors to be captured.
To measure the model performance, we consider our three-class problem as two separate binary classification tasks: 1) identify fouled images (SLoF = 0 versus SLoF > 0) and 2) identify heavily fouled images (SLoF = 2 versus SLoF < 2).This allows us to measure the effectiveness of our model as a classifier without choosing arbitrary class thresholds.Instead of the more commonly used receiver operating characteristic (ROC) curve, we use the average precision metric because it provides a better indication of classifier performance in the case that classes are imbalanced 37,38 .We apply the average precision metric to each of the two binary classification tasks, and report finally their average as an overall indicator of performance.
Given that we are working with image data, the natural deep learning architecture to use is the convolutional neural network (CNN).A CNN comprises an input layer, which in our case is an RGB image, and an output layer, which is a raw number that relates to the SLoF class of the image.Between these are multiple hidden layers, which are connected in a sequence and make up the architecture of network.Each layer performs an operation on the previous layer, such as convolutions, pooling operations, or matrix-matrix multiplications, and the nature of these operations are determined by trainable weights 26 .The creators of AlexNet were the first to discover that stacking a large number of these layers greatly improved performance the performance of CNNs on image-recognition tasks 27 .
Training a CNN consists of many components including the selection of a network architecture, a method of optimising the weights of the network (optimiser), a differentiable function that describes network performance with different configurations of weights (loss function), optimiser parameters, an image augmentation pipeline and a learning rate schedule that modifies the size of each weight update over each epoch (i.e., iteration through the training data).Together these components affect the quality of the trained neural network.Often the term hyperparameter is used to refer to parameters of the optimiser, the learning scheduler, etc.The number of possible combination of these design components is incredibly large, and the available search space for determining the best combination is limited by the amount of computing power available.
We trained and tested our deep learning models with pytorch 39 , an open-source deep-learning library developed by Facebook.We begin the model building process by conducting a learning rate test 40 , using stochastic gradient descent (SGD) as the optimiser and a default set of optimiser parameters picked from the APTOS challenge.The result of this test was used to inform a quasi-random search for the best optimiser parameters 41 , drawing parameters from a Sobol sequence 42 to provide more even coverage of the search space compared to random sampling.This was done by training the model for a small number of epochs, and the best sets of optimiser parameters were chosen for further exploration in addition to the default set.
We then tested performance for different combinations of the training components.We considered mean squared error and smooth-L1 loss, which we weighted by class frequency to remove the bias introduced by the imbalance of the dataset 43 .In addition to the standard optimisation algorithm SGD, other optimisers such as Adaptive Moment Estimation (Adam) 44 , Rectified Adam (RAdam) 45 and Adam with a corrected weight decay algorithm (AdamW) 46 were tested.Several learning rate schedules were examined including a multi-step learning rate decay schedule, one-cycle 47 and cosine annealing 48 .In CNNs, image augmentation pipelines are important for preventing overfitting to the training data, and two different approaches with varying complexity were tried from the APTOS competition.These applied operations to our training data images that did not change their class such as rotations, random cropping, and adjusting the colour and contrast.
We considered off-the-shelf network architectures, starting with the small resnet18 residual network which was used to test every possible combination of the training components above.The residual network was introduced to address the vanishing gradient problem in networks with large numbers of layers by allowing inputs to skip layers, and obtained first place in the 2015 ImageNet classification challenge 49 .Once the best training components were identified we trained larger and more modern network architectures on larger images, allowing us to determine if increasing image size from 224×224 to 448×448 pixels improved performance.These architectures included the "ResNeXT" squeeze and excitation networks which built upon the residual learning idea and introduced a squeeze and excitation block that incorporates relationships between image colour channels 50,51 .We also tested the inception architecture, which attempts to identify features at different scales in the image by applying convolution layers with several different sized kernels simultaneously 52 .We also considered efficient nets, which incorporate some of these previous ideas into an architecture that is designed to scale optimally and efficiently when the number of layers is increased 53 .A summary of the network architectures used in this paper and the Python packages used to implement them are provided in Table 3.We used the pretrained ImageNet weights to initialise all our networks.These weights are obtained by pre-training the network on the ImageNet database, which contains millions of images with a thousand different categories 54 , and were available for download through the architecture packages.This is a common practice known as transfer learning which reduces the number of epochs required to reach a performance plateau and improves results on small datasets 28 .All network weights were trained, except for the batch-normalisation layers, as these are best trained on large datasets like the ImageNet database.
The final step was creating a network ensemble.This is a technique where the class of an image is predicted by multiple networks, and their outputs are combined to obtain better performance 28 .We took the simplest approach, which is to average the raw network output.We identified the best performing ensemble by testing the performance of every combination of network trained on a particular image size.This gave us 510 possible ensembles to test for each image resolution.The full details of the model fitting process are provided in the supporting documentation.

Thresholding to create a classifier
The raw output of our model is a single number which needs to be thresholded to map back to the SLoF classes.The precision-recall curve created by combining validation crossfolds is used to guide this mapping process (Figure 2).In particular, the curve highlights the trade-off between precision and recall when choosing a threshold.A high precision classifier will only capture some of the positive results, while a high recall classifier will capture most of the positive results along with many false positives.For illustrative purposes we have selected three classifiers to explore, namely a high-precision classifier chosen with a 25% recall threshold, a high-recall classifier chosen with a 90% recall threshold and a balanced classifier with a 70% recall threshold.

Comparison to experts
Perfect agreement within our labelling scheme is unlikely among biofouling experts due to its subjectivity, and the frequency at which experts agree with each other is a useful benchmark to evaluate the performance of our alternative labelling methods.Our 120-image expert-labelled dataset was graded by three experts, yielding a total of 720 expert-expert label pairs.These were obtained by pairing the labels of one expert to annotations provided by the other two, and repeating the process for each expert.We also paired the MTurk and model labels with each expert, providing 360 label pairs to compare to the expert-expert label pair performance.
We assessed the significance of differences in precision and recall with a two-sided Fisher's exact test 55 with the fisher.testfunction in R 56 , using the null hypothesis that the precision or recall between experts is no different to the precision or recall with our alternate labels and using experts as the ground truth.We also used the two-one-sided t-tests (TOST) approach to test for non-inferiority 57 using the TOSTER R package 58 .The null hypothesis in this method was that the agreement observed between experts would be at least 5% better compared to the agreement observed between our alternate labels and experts.A separate non-inferiority test was necessary as the lack of significant differences does not mean we can conclude that two distributions are similar 59 .We chose a p-value of 0.05 to signify statistical significance.

Model thresholding and performance
Our best performing model based on five fold cross validation was an ensemble consisting of the resnet18, se_resnext50_32x4d, inceptionv4, inceptionresnetv2, efficientnet-b3 and efficientnet-b5 CNN architectures (see Table 5/ 3) with an input image size of 448×448 pixels, trained using a SGD optimiser, the default set of optimiser hyperparameters, smooth-L1 loss, a cosine annealing learning rate schedule and the more complex set of image augmentations.This gave a mean average precision of 0.849 (standard deviation of 0.018), which significantly improved upon the results from our initial random search of 0.799 (standard deviation of 0.028).The full results of the model fitting process is provided in the supporting documentation.The results for each binary classification problem with this model on our validation data and test dataset are shown in Table 4.The classifiers show better results on the testing dataset, which is promising for the generalisability of our model.

Inter-rater reliability
We found that experts agree most often on images showing clean or heavily fouled hulls, while images that only contained some fouling was were more likely to obtain inconsistent grades (Figure 3a).Overall, experts showed 89% agreement for both tasks (95% CI: 87-92%).As we have considered every combination of experts, the recall and precision calculated for each task was the same.Experts were found to achieve 91% precision and recall for identifying images containing fouling (95% CI: 88-94%) and 87% for images containing heavy fouling (95% CI: 82-90%) (Table 5).When the rate of agreement between the MTurk labels and experts was compared to agreement between experts we found that the difference was not statistically significant when comparing recall (p = 0.28-0.44)or precision (p = 0.32-0.48)for both tasks.However, the non-inferiority test showed that the agreement of the MTurk labels to experts was similar to within a margin of at most 5% worse (p=0.004-0.02)This similarity could also be observed from the confusion matrix between expert and MTurk labelling (Figure 3b).
Expert agreement is a useful benchmark for our computer vision model, and depending on the thresholds chosen to create a classifier, different outcomes were found (Table 5).Choosing a 70% recall threshold for both tasks resulted in a classifier that was close to having similar agreement with experts to within a margin of at most 5% worse (p = 0.071-0.093).While these p-values were slightly above our threshold for statistical significance, it was also noted that the difference in precision (p = 0.33-0.65)or recall (p = 0.25-0.88)was also not statistically significant.The results for this classifier are shown in Figure 3c as a confusion matrix.Using a 90% recall threshold instead could produce significantly higher recall or close to with respect to experts (p = 0.019-0.058)at the cost of significantly lower precision (p=0.002-0.004).Conversely, using a much lower recall threshold of 25% results in significantly higher precision (p=0.004-0.018),with a corresponding decrease in recall (p< 0.001).

Discussion
In this study we applied deep learning methods to identify the presence and severity of biofouling on ship hulls using images annotated on the Amazon Mechanical Turk crowdsourcing platform, and compared our performance to experts.Our MTurk 7/12 Table 5. Precision and recall for expert-expert, MTurk-expert and classifier-expert label pairs.Numbers in brackets are the 95% confidence intervals.The TOST column are non-inferiority testing p-values using the two-one-sided t-tests approach, with the null hypothesis being that the agreement observed between experts would be at least 5% better compared to the agreement observed for the method-expert label pairs.The p-value columns are given by a two-sided exact Fisher test, with the null hypothesis being that the method-expert label pairs do not differ in their precision or recall compared to the expert-expert label pairs.labels showed similar agreement to experts, and this result is highly promising as it suggests images can be effectively graded for the presence and severity of biofouling by non-experts by aggregating their annotations, offering the potential for substantial time and cost-savings.This labelling was also sufficient for training CNNs that were found to have close to expert agreement, although fine-tuning our model on expert-annotated data may offer further improvements.We have also demonstrated that if high precision or recall is desired for the application of the model, then classifiers can be created that offer better performance than experts with regard to this property.This allows the behaviour of the classifiers to be tuned for a particular application.For example, when screening vessels for biosecurity risk it may be desirable to have a classifier with higher recall so few images with severe fouling are missed.Conversely, if an activity were being undertaken where intervention capacity was limited then a classifier with higher precision would be more appropriate.The effectiveness of management activities for vessel biofouling in reducing biosecurity risk is currently a key knowledge gap for regulators, which makes it difficult to determine which combination of activities will provide confidence that a vessel is low risk.This model could be applied to provide a cheaper and more reliable way to identify the most effective management strategies, if combined with standardised vessel sampling protocols 24,60 , clear definitions of vessel biosecurity risk, such as the clean hull standard for New Zealand 16 , collection of management data, and ongoing in-water vessel inspections.This will also support more consistent assessment of effective management strategies between different organisations, which is a limitation of expert assessments.Building this evidence base would also provide benefits to industry, as it would be a basis from which to work towards regulatory alignment between different jurisdictions.

Labels
In-water cleaning and hull grooming are increasingly important biofouling management activities, as regular cleaning can limit biofouling accumulation and provide options where the anti-fouling coatings of vessels are no longer effective or have failed 61,62 .However, it also presents a biosecurity risk because cleaning can lead to the release of viable propagules and organisms can detach and still be viable [63][64][65] .One way this risk can be managed is by considering the biofouling state of the vessel before setting conditions on in-water cleaning or grooming activities.For example, New Zealand recommends that in-water cleaning of macrofouling with an international origin must capture biological waste and dispose of it on land or be rendered non-viable, but this would not be necessary if only a slime layer were present 66 .Automatic detection of biofouling using the state-of-the-art deep learning tools developed in this paper could be a cost-effective and reliable way for regulators and industry to process the outcomes of biofouling inspections for this purpose.
So far we have only tested our model on static images.Since videos are constructed using a stream of images, our model should be readily adaptable to videos as well.However, further work is needed to address issues such as identifying the frames in which the camera is directed towards a vessel hull as opposed to open water or where image quality is poor, which is a common issue when analysing stills obtained from ROV footage 32 .The video format would also offer the opportunity to incorporate information from future and previous frames to improve and smooth fouling estimates, and ideas from current action recognition methods could potentially be applied 67 .
Our SLoF labelling scheme also only relates to the percentage cover of macrofouling present within an image, which could be more rigorously used to determine the absolute biosecurity risk of a vessel if the area of hull captured within the image could be estimated.Given that in-water inspection methods are expected to vary greatly between jurisdictions, being able to do this without the presence of scale bars would be a major advantage.One possibility would be training a deep neural network on images of vessel hulls taken using multiple cameras, and building a model that will estimate depth given a single image 68 .

Figure 2 .
Figure 2. Precision-recall curve for model using validation data from each crossfold.

Figure 3 .
Figure 3. Confusion matrices for expert versus MTurk and model labels using the SLoF score on the 120 image test set.

Table 2 .
Breakdown of the number of images by SLoF in the five-fold cross validation and test datasets.

Table 3 .
Summary of neural network architectures used in model building.

Table 4 .
Precision and recall of classifier using model with chosen recall thresholds on the validation and testing dataset.