A contemporary baseline record of the world’s coral reefs

Addressing the global decline of coral reefs requires effective actions from managers, policymakers and society as a whole. Coral reef scientists are therefore challenged with the task of providing prompt and relevant inputs for science-based decision-making. Here, we provide a baseline dataset, covering 1300 km of tropical coral reef habitats globally, and comprised of over one million geo-referenced, high-resolution photo-quadrats analysed using artificial intelligence to automatically estimate the proportional cover of benthic components. The dataset contains information on five major reef regions, and spans 2012–2018, including surveys before and after the 2016 global bleaching event. The taxonomic resolution attained by image analysis, as well as the spatially explicit nature of the images, allow for multi-scale spatial analyses, temporal assessments (decline and recovery), and serve for supporting image recognition developments. This standardised dataset across broad geographies offers a significant contribution towards a sound baseline for advancing our understanding of coral reef ecology and thereby taking collective and informed actions to mitigate catastrophic losses in coral reefs worldwide. Measurement(s) ecosystem • coral reef • composition Technology Type(s) automated image annotation • machine learning Factor Type(s) year of data collection • geographic location Sample Characteristic - Organism Anthozoa • Algae • Porifera Sample Characteristic - Environment marine coral reef biome • marine coral reef fore reef Sample Characteristic - Location Atlantic Ocean • Eastern Australia • Indian Ocean • Southeast Asia • Pacific Ocean • Great Barrier Reef Measurement(s) ecosystem • coral reef • composition Technology Type(s) automated image annotation • machine learning Factor Type(s) year of data collection • geographic location Sample Characteristic - Organism Anthozoa • Algae • Porifera Sample Characteristic - Environment marine coral reef biome • marine coral reef fore reef Sample Characteristic - Location Atlantic Ocean • Eastern Australia • Indian Ocean • Southeast Asia • Pacific Ocean • Great Barrier Reef Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13007516


Background & Summary
The escalating deterioration of coral reefs over the past half-century 1,2 has imposed urgent challenges for coral reef science 3,4 , especially at the spatial and temporal scales 5 required to support science-based management and conservation at global scales 6,7 . Novel research and management approaches that consider current ecological and socio-cultural paradigms of coral reefs are therefore pivotal for addressing such challenges 4,[7][8][9][10][11][12][13] . Many of these strategies, however, face major challenges in terms of data acquisition, analysis and implementation 14 . In addition, the reluctance to share data among scientists 15 may often frustrate progress in reef science, management and conservation 16 . Clearly this increasing demand of scientific outputs needs to be pursued collectively as a top priority 16 in order to take up the challenges of managing Anthropocene reefs effectively in a rapidly changing world [4][5][6]17 .
There have been a large number of technological innovations that are revolutionising the collection and analysis of critical data [18][19][20][21][22][23] . Yet, the lack of adequate metrics for monitoring 3 , the paucity of large-scale assemblage datasets 7 and, limited availability of key long-term ecological data 24 have generated a lag in the task of incorporating scientific findings into management and policy actions. Large-scale and standardised field data Methods A comprehensive description of the methodological aspects used during the field surveys and image analysis have been published in González-Rivero et al. 23,25,35 . Therefore, here we include a synopsis of how this dataset was generated and made available to the wider community.
Our approach involved the rapid acquisition of high-resolution imagery over large extent of reefs and efficient image analysis to provide key information about the state of coral reef benthic habitat across multiple spatial scales 23 . The data generation and processing involved three main components: (1) photographic surveys, (2) post-processing of images and (3) image analysis, which are described and summarised below in Fig. 1.
Photographic surveys. An underwater propulsion vehicle customised with a camera system ("SVII", Supplementary Fig. 1), consisting of three synchronised DSLR (Digital Single-Lens Reflex) cameras (Cannon 5D-MkII cameras and Nikon Fisheye Nikkor lens with 10.5 mm focal length), was used to survey the fore-reef (reef slope) habitats from five major coral reef regions: Central Pacific Ocean, Western Atlantic Ocean, Central Indian Ocean, Southeast Asia and Eastern Australia in 23 countries or territories (Table 1, Supplementary Fig. 2). Within each region, multiple reef locations were surveyed aiming to capture the variability and status of fore-reefs environments across regions and within each region. Sampling design varied according to particular environmental and socioeconomic factors potentially influencing the distribution and structure of coral reef assemblages at each region and/or country. Overall, prior to field expeditions, reef localities were selected considering factors such as wave exposure, reef zones (i.e. fore-reefs), local anthropogenic stressors (e.g. coastal development), fishing pressures, levels of management (e.g. marine park, protected areas), and presence of monitoring sites.
Underwater images were collected in each reef location once every three seconds, approximately every 2 m apart, following a transect along the seascape at a standard depth of 10 m (±2 m). Although overlap between consecutive images is possible, the process for extracting standardised photo-quadrats from an image ensures that the photo-quadrats are non-overlapping between and within images (see further details next section). Each transect averaged 1.8 km in length, hereafter referred to as a "survey". See Supplementary Fig. 3 for an explanation of the hierarchical structure of the photographic surveys. No artificial illumination was used during image capture, but light exposure was manually adjusted by modifying the ISO during the dive, using an on-board tablet computer encased in an underwater housing ( Supplementary Fig. 1). This computer enabled the diver to control camera settings (exposure and shutter speed) according to light conditions. Images were geo-referenced using a surface GPS unit tethered to the diver ( Supplementary Fig. 1). Altitude and depth of the camera relative to the reef substrate and surface were logged at half-second intervals using a Micron Tritech transponder (altitude, Supplementary Fig. 1) and pressure sensor (depth) in order to select the imagery within a particular depth and www.nature.com/scientificdata www.nature.com/scientificdata/ to scale and crop the images during the post-processing stage. Further details about the photographic surveys are provided in González-Rivero et al. 25,35 . Post-processing of images for manual and automated annotation. The post-processing pipeline produced images with features required for manual and automated annotation in terms of size and appearance. The process involved several steps that transformed the raw images from the downward facing camera into photo-quadrats of 1 m 2 , hereafter referred to as a "quadrat" (Fig. 1). As imagery was collected without artificial light using a fisheye lens, each image was processed prior to annotation in order to balance colour and to correct the non-linear distortion introduced by the fisheye lens 23 (Fig. 1). Initially, colour balance and lens distortion correction were manually applied on the raw images using Photoshop (Adobe Systems, California, USA). Later, in order to optimise the manual post-processing time of thousands of images, an automatic batch processing was conducted on compressed images 23 (jpeg format) using Photoshop and ImageMagick, the latter an open-source software for image processing (https://imagemagick.org/index.php). In addition, using the geometry of the lens and altitude values, images were cropped to a standardised area of approximately 1 m 2 of substrate 23,35 (Fig. 1). Thus, the number of nonoverlapping quadrats extracted from one single raw image varied depending on the distance between the camera and the reef surface. Figure 1 illustrates a situation where the altitude of the camera allowed for the extraction of two quadrats from one raw image. Further details about colour balance and lens distortion correction and cropping are provided in González-Rivero et al. 23,35 . image analysis: manual and automated annotation for estimating covers of benthic categories. Manual annotation of the benthic components by a human expert took at least 10 minutes per quadrat, creating a bottleneck between image post-processing and the required data-product. To address this issue, we developed an automated image analysis to identify and estimate the relative abundance of benthic components such as particular types of corals, algae, and other organisms as well as non-living components. To do this, automated image annotation based on deep learning methods (Deep Learning Convolutional Neural Networks) 23 were applied to automatically identify benthic categories from images based on training using human annotators (manual annotation). The process for implementing a Convolutional Neural Network (hereafter "network") and classify coral reef images implied three main stages: (i) label-set (benthic categories) definition, (ii) training and fine-tuning of the network, and (iii) automated image annotation and data processing.
Label-set definition. As a part of the manual and automated annotation processes to extract benthic cover estimates, label-sets of benthic categories were established based on their functional relevance to coral reef ecosystems and their features to be reliably identified from images by human annotators 25 . The labels were derived, modified and/or simplified from existing classification schemes 40,41 , and were grouped according to the main benthic groups of coral reefs including hard coral, soft coral, other invertebrates, algae, and other. Since coral reef assemblages vary in species composition at global and regional scales, and surveys were conducted at different www.nature.com/scientificdata www.nature.com/scientificdata/ times between 2012 and 2018 across the regions, nine label-sets accounted for such biogeographical and temporal disparity. In general, a label-set was developed after each main survey expedition to a specific region. The label-sets varied in complexity (from 23 to 61 labels), considering the differential capacity to visually recognise (in photographs) corals to the lowest possible taxon between the regions. While label-sets for the Atlantic and Central Pacific (Hawaii) included categories with coral genus and species, for the Indian Ocean (Maldives, Chagos Archipelago), Southeast Asia (Indonesia, Philippines, Timor-Leste, Solomon Islands, and Taiwan), and Eastern Australia, corals comprised labels based on a combination of taxonomy (e.g., family and genus) and colony morphology (e.g., branching, massive, encrusting, foliose, tabular).
The other main benthic groups were generally characterised by labels reflecting morphology and/or functional groups across the regions. "Soft Corals" were classified into three groups: 1) Alcyoniidae (soft corals), the dominant genera; 2) Sea fans and plumes from the family Gorgoniidae; and 3) Other soft corals. "Algae" groups were categorised according to their functional relevance: 1) Crustose coralline algae; 2) Macroalgae; and 3) Epilithic Algal Matrix. The latter is a multi-specific algal assemblage smothering the reef surface of up to 1 cm in height (dominated by algal turfs). "Other Invertebrates" consisted of labels to classify sessile invertebrates different to soft corals (e.g., Millepora, bryozoans, clams, tunicates, soft hexacorrallia, hydroids) and some mobile invertebrates observed in the images (mostly echinoderms). The remaining group, "Other", consisted of sand, sediments, and occasional organisms or objects detected in the images such as fish, human debris (e.g., plastic, rope, etc.), and transect hardware. The exception within these main groups were the "Sponges", which were classified and represented by multiple labels only in the Atlantic (given their abundance and diversity in the Caribbean), including categories with sponge genus and species, and major growth forms (rope, tube, encrusting, massive).
Training and fine-tuning of the network. The deep learning approach used relies on a convolutional neural network architecture named VGG-D 16 42 . Details on the initialisation and utilisation of this network are provided in González-Rivero et al. 23 . A total of nine networks were used, one for each country within the regions, except for the Western Atlantic Ocean, where the network was trained using data from several countries, and the Philippines and Indonesia, where the network was trained using data from those two countries. ( Table 2). The first step in implementing a network was to randomly select a subset of images from the whole regional set to be classified, which were then divided into training and testing sets (Fig. 1). Human experts manually annotated both sets using the corresponding label-set under CoralNet 43 , an online platform designed for image analysis of coral reef related materials (https://coralnet.ucsd.edu/). The number of images and points manually annotated per network is presented in Table 2 (generally 100 points per image for training sets and 40 or 50 points per image for testing sets). www.nature.com/scientificdata www.nature.com/scientificdata/ Each training and testing data set were exported from CoralNet 43 and used along with the associated quadrats to support an independent training and fine-tuning process aimed to find the network configuration that produced the best outcomes. Initially, each quadrat used from the training and testing sets was converted to a set of patches cropped out around each annotation point location. The patch area to crop around each annotation point was set to 224 × 224 pixels to align with the pre-defined image input size of the VGG-D architecture. The fine-tuning exercise ran in general for 40 K iterations to establish the best combination of model parameters or weights that minimised the cross-entropy loss while the overall accuracy increased. An independent 20% subset from the original set of quadrats was used to assess the performance of the final classification (% of accuracy). In addition, parameters of learning rate and image scale were independently optimised for each network by running an experiment using different values for such parameters in order to select the values that derived the smallest errors per label. Further details of the model parametrisation for each network are provided in González-Rivero et al. 23

(see Supplementary Material).
Automated image annotation and data processing. Once optimised, a network was used to automatically annotate the corresponding set of non-annotated quadrats. The quadrats were processed through the network, where for each quadrat, 50 points (input patches) were classified using the associated labels. Upon completion of automated image annotation for a specific region/country, the annotation outputs containing locations of 50 pixels (i.e., their x and y coordinates) with their associated labels per quadrat (a csv file per quadrat) were incorporated and collated into a MySQL database along with information about the field surveys. In addition to the manual and automated annotations tables (raw data), we provide two levels of aggregation for the benthic data. First, the relative abundance (cover) for each of the benthic labels per quadrat, which was calculated as the ratio between the numbers of points classified for a given label by the total number of points evaluated in a quadrat. Second, the relative abundance for each of the main benthic groups (hard coral, soft coral, other invertebrates, algae, and other) per survey, which involved three calculations: 1) summarise the quadrat covers by image averaging all the quadrats from one single image per label, 2) summarise image covers by survey averaging all the images across one survey per label, and 3) merge survey data by main benthic groups summing the covers of all labels belonging to the same group across one survey.

Data Records
The dataset presented here has been made freely available through The University of Queensland "eSpace" repository 34 , and released under a Creative Commons Attribution license (CC BY 3.0; https://creativecommons.org/ licenses/by/3.0/deed.en_US). For attribution, we expect users to cite this paper when using the content. It includes three core components: 1) a series of relational tables (csv format) with the manual and automated annotations (raw data), 2) a series of relational tables (csv format) with benthic covers per survey and quadrat (processed data), dataset IDs, and label-set descriptions, and 3) the imagery associated with the dataset (jpeg format). We are providing cover data from more than one million of quadrats (over 55 million of associated data records) based on automated annotation, and over 859 K points (image features) classified by humans within selected quadrats through the CoralNet 43 web interface that can be used to implement approaches of automated image analyses.
A unique identification number (ID) was assigned to every survey (5 digits), image (9 digits), and quadrat (11 digits) which are the basis for the database relational links among tables and files. For example, in quadrat ID 17001644602, the first five digits correspond to the survey ID (17001), the survey ID and the next four digits (6446, a number automatically assigned by the camera between 0001 and 9999) to the image ID (170016446), and the image ID along with the last two digits (02, quadrat number within an image automatically assigned during the cropping process) to the quadrat ID. The core table "seasurvey_quadrat.csv" contains all the survey, image and quadrat IDs (over 1.1 million records) and is zipped within the "tabular-data.zip" file. See Supplementary Fig. 3 for an explanation of the hierarchical structure of the dataset.
Manual and automated annotations. The structure of the tables containing the data records from manual (testing and training datasets) and automated annotations is presented in Table 3. Each annotation record is identified by the quadrat ID and includes the coordinates of the pixel that has been annotated (x and y), and its corresponding label. Both types of annotation records are zipped within the "tabular-data.zip" file. While there is a unique file ("seaviewsurvey_annotations.csv") for the automated annotations with more than 55.2 million data records, there are nine files with the manual annotations, which correspond to the regions/countries conforming the training sets (Table 2). They were named "annotations_" with the region/country code (Table 3) appended (e.g., annotations_PAC_AUS.csv).
The table "seaviewsurvey_labelsets.csv" provides the description of the label-sets used for each region. The structure of this table is presented in Table 3, where each record corresponds to a label within a region. Refer to the Usage Notes section for instructions on accessing visual examples of the labels.
Benthic cover. The structure of the tables containing the data records from benthic covers per survey and quadrat is presented in Table 3. Both types of cover records are zipped within the "tabular-data.zip" file. Each data record in the surveys file ("seaviewsurvey_surveys.csv") corresponds to a unique survey identified by the survey ID and contains the cover of the five main benthic groups (hard coral, soft coral, other invertebrates, algae, and other) along with metadata of the survey (e.g., ocean, country, latitude, longitude, etc.). Cover records per quadrat were grouped in five files and named "seaviewsurvey_reefcover_" with the ocean and/or region appended (e.g., seaviewsurvey_reefcover_pacificaustralia.csv) to facilitate the data retrieval by the main regions surveyed. (2020) 7:355 | https://doi.org/10.1038/s41597-020-00698-6 www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
The methodological approach of automated image annotation used to generate this dataset is described and validated in Gonzalez-Rivero et al. 23 . In order to evaluate the reliability of the estimations of benthic covers per each network we used the set of images and manual annotations defined above as testing datasets. Each testing dataset was constructed from contiguous images within standard spatial units (hereafter called "test transects") with an extent of 30 m in length, concomitant with most coral reef monitoring programs 31,44-46 and best represents the spatial heterogeneity within a site 25 . Thus, the aggregation of images within 30 m test transects allowed the evaluation of the performance of automated estimations within a scale that is consistent with monitoring sampling units, accounting for the variability in benthic abundance estimation among images. Test transects were selected at random while ensuring that images used for training the networks were not included. A total of 5,747 images, within 517 test transects (Table 2), were annotated (identical annotations points) by trained human observers, hereafter called "observer", and the networks, hereafter called "machine". The benthic composition within these test transects was averaged across images and contrasted between the two groups annotated: observer vs. machine. Specifically, we compared the error between machine and observer at the level of benthic categories and the consistency between machine and observer estimations from a community perspective. We used the Absolute Error (E) to estimate the variability in the machine estimates when compared against observer estimations of abundance of a given benthic category. The absolute error (hereafter called "error") for each category (i) was calculated as the absolute difference between the abundance estimated by the machine (m) and the observer (o; Eq. (1)). The error was calculated and compared at two aggregation levels: a) main benthic groups and b) label-set, sensu González-Rivero et al. 25 : Such errors indicated that the trained networks (machine) were able to produce cover estimates comparable to those generated by observers, and therefore suitable for spatial and temporal analysis and monitoring 23 . While the errors are variable among regions and benthic categories (Fig. 2), they are within the range of previously reported inter-and intra-observer variability for established monitoring programs 47 (e.g., 2-5%, Long Term Monitoring Program, AIMS, Australia). Among main benthic groups (e.g., hard coral, algae), "Algae" showed the highest  www.nature.com/scientificdata www.nature.com/scientificdata/ errors among regions, ranging between 2% (Pacific, Hawaii) and 5% (Southeast Asia, Fig. 2). The abundance of "Hard Coral" and "Soft Coral" estimated by the machine showed errors between 1% (Atlantic) and 4% (Australia), and 0.5% (Indian Ocean) and 3% (Australia), respectively. The remaining groups (Other and Other Invertebrates) showed a consistent error below 2% (Fig. 2).
The error of machine estimates was more consistent within main groups, with the only exception of the "Algae" group (Fig. 3). "Epilithic Algal Matrix", as a functional group, comprised of a diverse number of algae groups (e.g., turf, cyanobacteria) was the most variable label (5%-7% error). The error of estimations, within the "Hard Coral" group, remained below 2% among regions, while "Soft Coral", in particular "Other soft corals", showed an error of up to 3%. This label is comprised of a large diversity of genera and growth forms, while more taxonomically defined labels showed an error below 2%. The remaining labels within the "Other" group, comprised mainly by substrate categories (e.g., sand, terrigenous sediment), and "Other Invertebrates", comprise of benthic invertebrates other than hard and soft corals, showed a consistently low error (below 1%-2%; Fig. 3).
In addition, the overall performance of automated image annotation was evaluated by 1) correlating the estimates of abundance (cover) produced by the machine against those produced by the observer, and 2) evaluating the overall agreement between machine and observer estimations using the Bland-Altman plot 48 , also called difference plots. Correlation was evaluated using the coefficient of determination (R 2 ) from a linear regression model, which also evaluated the significance of this correlation. The R 2 provides an indication of the intensity of the correlation by evaluating the co-variance between the observer and machine estimations. The Bland-Altman plot determines differences between the two estimations against the observer estimations, or reference sensu Krouwer 49 . We used the Bland-Altman plot analysis 50 to evaluate: (1) the mean of the difference or bias of machine estimations, (2) the homogeneity of the difference between machine and observer across the mean (i.e., over-dispersion) and, (3) the critical difference or agreement limits. The latter refers to the range, within the 95% Confidence Interval, of the difference between the two methods, and can be used as a reference to define where the measurements fall out of the range of the agreement (i.e., precision of the agreement). Bias refers to the difference between the two methods and the Bland-Altman plot can help visualise whether this bias changes across the mean of values evaluated, and therefore a measurement of the consistency of the bias 50 .
Expectedly, network estimations (machine) of benthic cover were highly correlated with observer estimations for all five global regions (R 2 = 0.97, P < 0.001, Fig. 4). Most importantly, the differences between the machine and observer were unbiased across the spectrum of benthic cover (mean ~ 0), and the variability around the mean difference was estimated at 4% (Critical Difference or 95% Confidence Interval of the difference) for all labels across the surveyed regions (Fig. 4).
Lastly, to evaluate the machine performance for estimating community structure, pair-wise comparisons of manual and automated estimations of benthic composition within each test transect were executed using the Bray-Curtis similarity index. This index is sensitive to misrepresentation in the automated estimation of abundance for specific labels or benthic groups when compared against manual observations. Therefore, index values of 100% will represent a complete resemblance between machine and observer estimations for community composition. While the Absolute Error already provides a metric for label-specific performance of automated annotations, the community-wide analysis lays out a synthesis analysis to understand how closely represented is the automated estimation of benthic composition against manual observations across the range of communities within a region.
The comparison of communities within regions using Bray-Curtis similarity index showed the estimations of benthic composition were consistent between observer and machine (i.e., between 84% and 94% of similarity) among and within regions (Fig. 5). Across regions, Australia and South East Asia exhibited the lowest values of similarities, 84% and 88% respectively. Indian Ocean and Atlantic showed similarities around 90%, while Central Pacific Ocean presented up to 94% of similarity between automated estimations and manual observations. Irrespective of the differences in community structure among and within regions, this comparison corroborates the reliability of the machine estimations at both label and community level.
Machine estimations were generally better for well-defined and large organisms in quadrats, such as corals and soft corals, showing the lowest error (<1%-2%). The window size (224 × 224 pixels), however, penalises the estimation of smaller, patchy or less defined organisms, such as algal categories (error 3%-6%). Additional caveats and discussions in terms of the errors and accuracy of data derived from automated classifiers can be found in Beijbom et al. 18,19 , González-Rivero et al. 23,25 and Williams et al. 22 .
In summary, the validation exercise demonstrates that the trained networks generated cover estimates unbiased compared to those derived from observers, with an overall error of 4%. Thus, our data set represents a useful source of information to support spatio-temporal analysis of coral reefs ranging from small-(presence/absence within quadrats 1 m 2 ) to broad-scales (regional/global analysis).

Usage Notes
Users can identify transects of relevance using a map or file naming convention present on the data repository 34 . The survey ID ("surveyid" in tables) is the unique identifier of each reef location in space and time that users will require as a key cross-reference to navigate within the repository and link the different cover tables, as well as cover records with the imagery folders.
Our relational dataset contains some tables with thousands to millions of rows (e.g., cover, quadrat and annotation tables), hence it is advised to use a software that can handle this structure, such as R, PostGRES, Microsoft Access, etc. Manipulating these data with Microsoft Excel is not advised. For users interested in working with the whole dataset, the best approach will be to create a database, for instance in the MySQL system, and import all the csv files provided within the "tabular-data.zip". Then the user can relate the tables using the "surveyid", "imageid" and "quadratid" fields, which are cross-referenced between tables (see Data Records section and Table 3). (2020) 7:355 | https://doi.org/10.1038/s41597-020-00698-6 www.nature.com/scientificdata www.nature.com/scientificdata/ Accessing benthic covers. We provide guidelines to access cover records from the Indian Ocean Region (hereafter called "IOR") per survey and quadrat as a transferable example for retrieving any cover record. The first step is to download the "tabular-data.zip" file from the repository 34 . As soon as the file is unzipped, 18 csv files (i.e., tables) are extracted including six with cover records, one file for each of the five regions and one summary file. Within the latter, the "seaviewsurvey_surveys.csv" file, the user can directly access the percent cover of main benthic groups (hard coral, algae, soft coral, other invertebrate, and other) summarised per survey ("surveyid" column). Filtering this table by the "ocean" column, users can select the surveys associated only with IOR, in this case 92 surveys. In addition, users can filter the surveys per country ("country" column), link each survey to the zip folder with the images ("folder_name" column) and the geographic location (columns with latitude and longitude at the start and end of the survey).
If users require percentage covers of detailed benthic categories per quadrat they must work with the file "seav-iewsurvey_reefcover_indianocean.csv" (if the user is interested in other region/country, please refer to Table 1 to find the appropriate file). Users can relate cover records at the level of surveys (seaviewsurvey_surveys.csv file) with records at the level of quadrats (seaviewsurvey_reefcover_indianocean.csv) using the "surveyid" field.
To summarise percentage covers at the level of quadrats, users must sum the values of the benthic categories of interest. In our example with the IOR, by selecting and merging from the 45 field names of the columns in the "seaviewsurvey_reefcover_indianocean.csv" file. For this purpose, we recommend using the label-set table "seaviewsurvey_labelsets.csv" to understand the labels of different benthic categories, review the descriptions and examples, and guide your merging scheme. Filtering the label-set table by the "region" column, in this case selecting Indian Ocean, users will obtain the 45 categories (under the "label" column) associated with this region. Please note we also provide a reduced set of categories (under "merged_label" and "merged_name" columns) used in the technical validation, however users can merge the categories based on their criteria. A good consistency check to use after merging categories is that the cover fields in each row should always sum to 1.0 (these values are proportions, and 1.0 represents 100%).
To create summaries of the cover data for spatial analyses (at different scales) users will need to consider the cases where from a single georeferenced image there is more than one 1 m 2 photo-quadrat extracted (see Post-processing of images section). This implies all the quadrats originated from the same image have the same georeference. For this situation, we recommend to aggregate the cover table (in this case seaviewsurvey_reef-cover_indianocean.csv) by the "imageid" field, taking the mean cover value among all records (quadrats) for an image. From this point, users can spatially aggregate the images according with their own criteria and/or purposes. For instance, we applied hierarchical clustering to obtain the 30 m test transects for the technical validation. Further details about this method are provided in González-Rivero et al. 23 (see SM4 in Supplementary Material) and Vercelloni et al. 37 . When aggregating, users can relate cover records using the "surveyid", "imageid" and "quadratid" fields. It is important to remind here that the dataset has a unique ID for each survey (five digits), image (nine digits), and quadrat (eleven digits) that link the different tables and files. Thus, in the ID 36001004202 from the IOR cover dataset, the first five digits correspond to the survey ID (36001), this survey ID and the next four digits to the image ID (360010042), and the image ID along with the last two digits to the quadrat ID (36001004202, 02 implies there are at least two quadrats available for the 360010042 image).
Accessing temporal data. Table 1 identifies which regions/countries have temporal data. Using the "seav-iewsurvey_surveys.csv" table users can identify which reef locations have been surveyed multiple times by filtering by the "transectid" field. A transect ID will appear more than once in this field if the transect has been surveyed temporally in more than one year. For instance, filtering the table by Indian Ocean/Maldives, and then www.nature.com/scientificdata www.nature.com/scientificdata/ filtering the first transect ID, 37001, you will find there are two surveys ID associated with this transect, 37001 and 46001, which were done in 2015 and 2017 respectively as recorded in the "surveydate" field. This means this reef location has available images and cover data for these years. Once the user has identified and selected reef locations with temporal data, we recommend plotting the georeferenced data of the images of the surveys of interest in order to visually explore the spatial overlap between them over the different years and design appropriate temporal comparisons. Further details about temporal assessments are provided in González-Rivero et al. 23 (see SM4 in Supplementary Material), Kennedy et al. 39 and Vercelloni et al. 37 . www.nature.com/scientificdata www.nature.com/scientificdata/ Accessing the imagery associated with benthic covers. To inspect the image quadrats associated with the data in detail, users only need the survey ID(s) and to download the set of quadrats required from the "photo-quadrats" directory at the repository 34 . Within this directory, the different zip folders of each survey can be identified with a code (e.g., IND_MDV_37001_201503) that includes the five digits of the survey ID in the Fig. 4 Overall agreement between observer (manual) and machine (networks) estimations of abundance (cover). Agreement is here discretised in two metrics: (a) Correlation between machine and observer annotations, and (b) bias (Bland-Altman plot). Each filled circle in these panels represents the estimated cover for each label classified by both, the machine and the observer in a given transect. The correlation shows that estimations of benthic abundance by expert observations are significantly represented by the automated estimations (R 2 = 0.97). The Bland-Altman plot shows that overall the differences (Bias) between machine and observer tend to mean of zero (grey continuous line), and a homogenous error around the mean, defined by Critical Difference (Critical diff.) or the 95% Confidence Interval of the difference between observers and machines (dashed grey lines). Figure originally published in González-Rivero et al. 23 (as Figure A1).