Benthic and coral reef community field data for Heron Reef, Southern Great Barrier Reef, Australia, 2002–2018

This paper describes benthic coral reef community composition point-based field data sets derived from georeferenced photoquadrats using machine learning. Annually over a 17 year period (2002–2018), data were collected using downward-looking photoquadrats that capture an approximately 1 m2 footprint along 100 m–1500 m transect surveys distributed along the reef slope and across the reef flat of Heron Reef (28 km2), Southern Great Barrier Reef, Australia. Benthic community composition for the photoquadrats was automatically interpreted through deep learning, following initial manual calibration of the algorithm. The resulting data sets support understanding of coral reef biology, ecology, mapping and dynamics. Similar methods to derive the benthic data have been published for seagrass habitats, however here we have adapted the methods for application to coral reef habitats, with the integration of automatic photoquadrat analysis. The approach presented is globally applicable for various submerged and benthic community ecological applications, and provides the basis for further studies at this site, regional to global comparative studies, and for the design of similar monitoring programs elsewhere.


Background & Summary
This study describes a unique point-based data set for coral reef environments, collected using a photoquadrat survey method published for seagrass environments 1 . The data set describes the spatial and temporal distribution of benthic community abundance and composition for Heron Reef, a 28 km 2 shallow platform reef located in the Capricorn Bunker Group, Southern Great Barrier Reef (GBR), Australia. On average, 3,600 coral reef data points were collected annually over the period 2002 to 2018. Annual data sets were acquired for independent research projects, but the collection methods were consistent. The initial field data collection design was planned to acquire detailed field data to describe the spatial distribution and variability of benthic composition across the study site to assist with calibration and validation of earth observation-based mapping products.
To create a map based on earth observation imagery, it is common to use training or calibration data to transform the imagery into a map of surface properties using a supervised algorithm (e.g. multivariate statistical clustering, random forest) 2 . To report on the accuracy measures of the maps, reference or validation data are contrasted with the output maps 3 . Hence for calibration and validation purposes, georeferenced field data must be representative of all the features to be mapped and collection should ideally coincide with satellite image acquisition. Many earth observation approaches have been implemented for mapping the benthic communities of Heron Reef 4-12 and several of these maps are now accessible online 6,13,14 .
Several studies have utilised time series benthic data to analyse changes in benthic community and coral type trends, supporting broad ecological knowledge of coral reef ecosystems such as the Caribbean reef degradation 15 and coral cover decline on the GBR 16 . Similarly, benthic community and coral cover data sets have been identified as important indicators of coral reef health providing the backbone for monitoring and management initiatives around the world 17,18 .
Articles and data sets have been published that describe the benthic community properties of Heron Reef, however, their spatial coverage, number of georeferenced data points, and revisit times are limited 19 . The time series photoquadrat data sets presented in this paper could be used for further understanding of benthic community distribution, including statistical analysis of trends in coral cover, analysis of changes in benthic community and coral type, or used for testing of other earth observation-based mapping and modelling approaches. Additionally, as our methodology describes machine annotation of the field photoquadrats, it would be possible to reanalyse the photoquadrats with new categories not previously considered important from a biological perspective (e.g. unknown disease or impact, or a specific benthic community type), or for other features (e.g. the counting of sea cucumbers (Holothuroidea sp.)).
Detailed analyses of our complete data set may permit a greater understanding of the persistence and/or dynamics of the benthic community at Heron Reef. As such, our ongoing analyses include evaluation of changes in community composition following major impacts such as cyclones, coral bleaching, crown of thorns predation, etc., and additionally, statistical analyses of coral recovery after such impacts. To this degree, these benthic community data sets are invaluable.

Methods
The photoquadrat-based data in this study was collected for Heron Reef, Southern Great Barrier Reef, Australia ( Fig. 1). Here we provide a short overview of the collection methods, however a detailed description can be found in 11 . These methods are applicable to any habitat. Photoquadrats were analysed for substrate and/or benthic community types known to be present on the reef (Fig. 1). The benthic community classes included in the analysis are shown in Table 1.
Georeferenced photoquadrat data collection. Detailed information on benthic community composition was gathered at Heron Reef on the reef flat (0-2 m depth) and at the 5 m contour on the reef slope using a repeatable and fine spatial scale (sampling every 2-4 m) technique for surveying benthic cover 11 . The technique required a snorkeler or diver manually capture georeferenced photoquadrats along defined transect surveys using a standard digital camera in a waterproof housing (e.g. Sony Cyber shot, Canon AA540, Lumix, or Olympus T4). A plumb-line attached to the camera, ensured that the footprint of each photoquadrat approximated 1 m 2 of the benthos.
From 2002-2004, a 100 m transect tape was deployed at each defined survey start site at a maximum depth of 3 m, or on scuba at 5 m depth. From 2005 onwards, instead of deploying a tape, the surveyor towed a standard handheld GPS (e.g. Garmin eTrex, Garmin 72) at the surface in a waterproof bag for all surveys. This enabled accurate registration of the location of the acquisition of each photoquadrat, which was subsequently assigned via time synchronization, with the track log from the towed GPS. Once this method was established transect survey lengths were extended to distances of 500 m-1500 m. The start and end point of each transect was defined by GPS waypoints, permitting accurate revisits in subsequent years. The distance between successive photoquadrats was estimated by the surveyor's kick cycle. However this was not considered a problem as the exact location of each photograph was known through the GPS synchronisation.
All surveys were performed during the day, and derivation of sunlight and sun angle can be ascertained through the timestamp of each photoquadrat and its corresponding GPS location. Reef Flat surveys were collected at high tide to provide sufficient water depth for the snorkeler to safely traverse the reef. Reef Slope surveys were collected at low tide. No water quality information was recorded.
The locations of the transect surveys were chosen to ensure they traversed gradients or edge features to detect any change in benthic cover over these features. This was done initially through visual assessment of existing satellite imagery in combination with expert knowledge of the study area. The aim was to produce data that provided an adequate representation of the variation in benthic community cover across Heron Reef. Limited transect surveys were located within the deep lagoonal area of the reef, as this area is hard to access by boat due to tidal range restrictions permitting short working times in the lagoon. Transect surveys were revisited in subsequent years, and additional transect surveys were included on subsequent trips based on increased knowledge of the environment. The benthic data sets and photoquadrat images are available at 20 . Automated photoquadrat analysis for benthic community composition. Percentage cover of the benthic communities for each photoquadrat was determined through a machine-learning (ML) approach which assessed benthic community composition. A previously devised category scheme consisting of 63 class codes that differentiated all major GBR-specific coral morphologies and other bottom types was used 21 which, following machine annotation, were collapsed first into broad groups and subsequently into six simplified groups for validation purposes (Table 1).
Initial training of the ML platform was achieved via manual annotation of approximately 5% of the total number of photoquadrats (equivalent to 108,700 annotated points; based on 21 ), to achieve a machine annotation accuracy of >70% as determined by the classifier 21 . A unique source was created for each camera used. To give a default and uniform image annotation area, boundaries of 5% were used for the top and left sides of the photoquadrat, whilst a boundary of 95% was used for the right and bottom sides of the photoquadrat. Annotation points (50) were generated randomly over the entire annotation area per photoquadrat. For manual annotation of photoquadrat sets, the level of confidence was set to 100%. A further approximately 2.5% of photoquadrats

Data Records
Detailed information regarding the output benthic cover percentages and the number of benthic photoquadrats acquired for each field campaign are documented in Table 2. The benthic data sets and photoquadrat images are available at 20 , with the photoquadrats and benthic cover analysis for individual survey years accessible online through the campaign specific DOIs listed in the table, from where the data can be downloaded directly.

technical Validation
To understand the validation technique applied to these data sets, it is important to reiterate the purpose of collecting the data set itself, which was a fast field method to gather benthic community information over a large spatial extent, whilst accurately representing variability. Validation of the data set was conducted on various levels, and included: standardisation of photoquadrat capture method and conditions, and a quantitative accuracy assessment.
Standardisation of photoquadrat image capture. To standardise photoquadrat image capture, the camera and lens setup used was calibrated prior to annual survey, so as to capture a footprint that covered the same extent of the benthos. This was accomplished by attaching a plumb-line to the camera system such that when it touched the bottom, the captured photoquadrats represented ~1 m 2 of the benthos. To do this standardisation,   Table 1. Benthic community and coral type descriptions and their class codes used for photoquadrat annotation. Manual and automated (machine) annotation utilized the full labelset (63 class codes). Following machine annotation, these 63 class codes were aggregated via broad groups into six simplified groups for validation of the machine learning.  Table 3. Quantitative assessment of the machine annotation stevia construction of a confusion matrix. For each camera used, machine annotation (modelled data) of 2.5% of all the photoquadrats captured was compared with manual annotation (reference data) of the same validation data set in a using standard confusion matrix 3 . From this, the overall accuracy and individual class accuracies were calculated following a well-documented approach 3 . www.nature.com/scientificdata www.nature.com/scientificdata/ the camera was moved vertically over a marked 1 m 2 until the field of view enveloped the area, and the plumb-line was fixed. During the survey the operator used the plumb-line to determine the camera height above the ground. When held vertically with the weight touching the substrate this permitted reproducible capture of photoquadrats that covered the same area for all surveys. Light conditions were generally the same for each expedition, the data collected over a consecutive 4-5 day period, with stable weather, water clarity conditions and tidal range. Ideally light conditions would have been standardised using a strobe, however this would slow down the speed of the transect surveys.
Quantitative accuracy assessment. To determine the accuracy of the machine annotation we constructed a confusion matrix that compared, for a select set of validation photoquadrats, the benthic composition output from the machine learning annotation (modelled data), with the equivalent manual annotations (reference data). Using the confusion matrix we calculated the overall accuracy and the individual benthic label user and producer accuracy following a well-documented method 3 . All cameras demonstrated an overall accuracy of between 74% and 82% (Table 3; 3 ). To provide a validation data set, ~2.5% of photoquadrats were manually annotated in an identical manner to the training data (36,950 annotated points; see Methods Section).