Background & Summary

Understanding how life is distributed on earth and how these patterns are shaped by the environment, including interactions between species, have been key goals in biology for centuries13, while biodiversity conservation has provided an additional motivation for understanding spatial patterns in nature and linking these to human impacts. The recent increase in the availability of broad-scale biological data, particularly in the form of extraordinarily large online databases such as the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS), has re-invigorated investigation into the distribution of life on earth in relation to environmental drivers4,5, and fuelled the growing study of macroecology6,7. However, these virtual collections of records lack the quantitative abundance information that is particularly important for biodiversity conservation applications. For example, opportunities for reliable assessment of species for the IUCN Red List are greatly reduced without abundance information, and using range maps based on presence-only records to prioritize conservation planning can lead to protection of marginal habitat8.

A shortage of resources and coordination across political boundaries has limited collection of broad-scale quantitative biodiversity datasets. However, even the largest, most expensive, multi-national scientific projects have not provided consistently collected abundance data at a global scale for any community type. For example, the US$650 million 10-year Census of Marine Life, while achieving enormous gains in understanding the variety of life in the sea, fell short in progress towards one of its goals—to estimate the abundance of marine life9.

Established in 2007 as a means to overcome the shortage of resources and capacity to provide quantitative data on marine species over large temporal and spatial scales, the Reef Life Survey (RLS) program has involved data collection by an international network of trained volunteer (or ‘citizen’) scientists and professional biologists largely acting in a voluntary capacity. Focussing on quality of outputs and consistency of data through selective inclusion and training of volunteer participants, rather than broader engagement of all interested, RLS fills a niche between other citizen science programs and large-scale professional initiatives such as the Census of Marine Life. The RLS program represents a marine analogue to well-organized and large-scale amateur bird watching programs (e.g., eBird and the Christmas Bird Count), but with a more structured quantitative sampling methodology than most. Through the long term, it aims to provide a biological equivalent to the synoptic picture of the physical parameters generated for the world’s oceans through sensor networks such as the ARGO float array10 and the Australian Integrated Marine Observing System (IMOS).

Here we describe the global reef fish dataset collected by the Reef Life Survey program, which can be used to assess large-scale spatial patterns in diversity and community structure, and as a baseline for comparison with future surveys, to address long-standing ecological questions or conservation goals. These data have already been used to describe global patterns in reef fish functional diversity11 and have provided the most comprehensive empirical assessment of key features for successful marine protected area (MPA) design and management12. The dataset described here includes all survey sites analysed for the latter study, with the exception of data collected using the same methodology from 107 sites that were provided to us for analysis but belong to other organisations or are otherwise confidential. Some transects surveyed at different depths at the same site, but on different days, have also been excluded from this dataset, which only includes surveys from the latest date (at the time of writing) at any given site. A summary of survey effort and key diversity values by ecoregion13 is provided in Table 1.

Table 1 Summary of taxonomic richness and abundance information by ecoregion


Survey methods

The RLS global reef fish dataset includes data from 1,879 sites, collected using standard RLS survey methods, described in detail in an online methods manual (Reef Life Survey methods manual. Surveys involve underwater visual census (UVC) by SCUBA divers along a 50 m transect line, laid along a depth contour on hard substrate (coral or rocky reef). All fish species observed within 5 m of the transect line were recorded on a waterproof datasheet as the diver swam slowly along the line (at approximately 2 m/min).

Abundance estimates were made by keeping a tally of individuals of less abundant species and, in locations with high fish densities, estimating the number of more abundant species. Abundances of schooling fishes were recorded by counting a subset within the school which was combined with an estimate of the proportion of the total school. In coral reefs with high fish species richness and densities, the order of priority for recording accurately was to first ensure all species observed along transects were included, then tallies of individuals of larger or rare species, then finally estimates of abundance for more common species. Only divers with the most extensive and appropriate experience undertook surveys in diverse coral reefs.

Nearly all fishes observed were identified to species level, with photographs of unknown species taken with an underwater digital camera for later identification using appropriate field guides and consultation with taxonomic experts for the particular group, as necessary. When species level identification was not possible, records were classified at the highest taxonomic resolution possible given the information available and experience of the observer. In total, 1.8% of records in the RLS global reef fish dataset were not at the species level (i.e., are at genus level or higher).

Fishes within 5 m of the line were recorded separately for each side of the transect line, with each side referred to as a ‘block’. Thus, two blocks form a complete transect (also referred to as an individual survey). Multiple transects were usually surveyed at each site (global mean 1.98±0.03 SE transects per site, min=1, max=9), usually along different depth contours (mean depth 7.28 m±0.07 SE, min=0.1, max=42). Sites are distinguished by unique site codes with latitude and longitude recorded in decimal degrees (WGS84) using a handheld GPS unit, or occasionally taken from Google Earth.

Quality control

Data in the RLS global reef fish dataset were collected by a combination of experienced scientists and skilled recreational divers, with all divers having either substantial prior experience in reef fish surveys or extensive training in the RLS methods. Screening of interested divers was undertaken before training so that only the most committed and capable divers with appropriate SCUBA experience were invited to participate. Although a minimum of 50 dives’ experience was used as a standard in diver selection, a survey of RLS divers in 2010 indicated that most RLS divers had completed over 300 dives. For divers without prior formal scientific training, one-on-one instruction in survey methods and assistance with species identification was provided during a training course typically lasting four to five days, but up to two weeks (depending on local marine life and skills of the diver). During these courses, trainees undertook practice surveys with an experienced scientist, who carefully compared their data following each dive, with a final approval given after data were considered to be of high consistency with the trainer. A formal comparison of data collected by divers without tertiary scientific training with data collected by experienced scientists showed that the variation between recreational and scientific divers was non-significant and negligible in comparison to other sources of variation within and between sites14. The vast majority of divers who contributed data to this dataset were trained by the authors. Data collected during training were not added to the database.

Following each survey, each RLS diver transcribed their data from the underwater datasheets onto custom data entry forms in Microsoft Excel. This was usually done the same day as survey dives were undertaken. Excel data entry templates contained lookups from region-specific species lists and were in a consistent format for uploading to the RLS database. Data checks were made upon upload to the database, including for data structure (and completeness) and consistency in metadata among divers, as well as checks designed to detect species not previously recorded by RLS divers in that particular region. Any species added that was previously not in the RLS database for that region prompted querying of those particular data points, and taxonomic and distributional data were also checked before addition of new species.

Consistency in data collection was continually emphasized, and was assisted by continued participation of the same divers over time. Further to this, the authors participated in surveys over many ecoregions (59 of 72 collectively), collecting 35.1% of all data in the dataset, and thus providing a substantial element of consistency in diver participation at the global scale.

Data collection mechanisms

This dataset was compiled from data collected in a combination of collaborative surveys with scientific colleagues worldwide, targeted RLS field campaigns and ad-hoc local surveys by trained RLS divers at their regular dive sites or when on holidays. Field campaigns involved small groups of divers (usually 4 to 8) undertaking survey dives over a period of four days to two weeks (or occasionally longer) under the direction and supervision of a scientist or experienced survey diver (mostly one of the authors). At the conclusion of each field campaign, one of the RLS organizers or scientists leading the trip collated data from participants and undertook manual checks of the data. These checks included close scrutiny of species lists, abundances and site details. Evidence in the form of images was typically requested for records of species not seen by the experienced surveyor on the trip, with such evidence essential for divers with less experience in that particular region. Uncertain records or records of new species for regions for which definitive evidence was not available were reduced to the highest taxonomic resolution for which there was confidence (usually genus). For ad-hoc surveys by trained divers outside of group field campaigns, species identification assistance and data transfer occurred via email, and all the data checks were made by a scientist in the office before uploading data to the database.

Data Records

Data record 1

The RLS reef fish dataset is managed in a live database, and thus any errors are corrected as identified and taxonomic details updated as appropriate. It is accessible in comma-separated format on the Reef Life Survey website: (Data Citation 1), containing the data fields outlined in Table 2. We strongly recommend the use of this ‘live’ dataset over the archived version described below as Data Record 2, and would appreciate that any errors identified by users of the dataset be reported to the corresponding author to enable correction where necessary, allowing improvement for subsequent users.

Table 2 Key to the data fields in the Reef Life Survey reef fish dataset.

Data record 2

An archival version of the RLS reef fish dataset, in the same format (comma-separated) and the same fields as described above, has been deposited in figshare (Data Citation 2). Details of the data fields provided in Table 2 are also provided in csv format associated with this data record.

Technical Validation

All methods to estimate fish densities involve biases. For UVC methods, as used to collect RLS data, biases have been explored and include species-specific avoidance or attraction to divers and influences of habitat, visibility or species’ physical and behavioural characteristics on detectability1517. Differences between divers have also been noted to contribute to variation in UVC data18,19, although this variation is generally small compared to that associated with the spatial and temporal factors of interest such as site, region or month14,15.

We consider that, among the available methods for non-destructive sampling of marine fish communities, the UVC method applied here provides the most efficient means to cover the diverse range of assemblages, micro-habitats and conditions associated with the world’s coral and rocky reefs. It must be noted, however, that the densities of diver-shy species will consistently be under-estimated, while densities of species attracted to divers will be over-estimated, and the net effect of these on total fish density at any given site will depend on the local species composition. Habitat characteristics can potentially affect species detectability; however, in an experiment in which a macroalgal forest was cleared, the difference in detectability between vegetated and open reef was found to be negligible for five of six fish species15. In summary, given a range of biases, data should not be regarded as providing accurate estimates of fish density within transect blocks; however, biases are largely systematic at the species level, and data can be used for relative comparisons, such that if counts of a species double between sites, then underlying densities can also be regarded as doubling.

Usage Notes

There are a few important considerations for usage of the RLS global reef fish dataset (field names are described in Table 2):

  1. 1

    The level of spatial aggregation of data. The base unit in the data is a transect block (250 m2). There are always two adjacent blocks per complete transect (surveyID), which can be summed to form a standard 500 m2 survey area. However, multiple transects are usually surveyed within each site (SiteCode), and the number of transects differs between sites. Given blocks, and even transects, within a site are not independent of one another, analysis of patterns among sites should consider spatial autocorrelation at the site scale. Options include aggregating data across blocks and transects within each site (e.g., calculating mean densities per 250 m2 or 500 m2), or adding SiteCode as a nested spatial factor in statistical models and using summed values within each 500 m2 transect. If aggregating data across multiple transects per site, consideration should be given to the accumulation of species in the sample with each additional transect (see consideration 2 below).

  2. 2

    Species richness or diversity estimates will be biased by the varying number of transects surveyed at each site if not calculated consistently at the block or transect (surveyID) level first (or if fixed coverage sampling methods are not used).

  3. 3

    Spatial bias. The Australian continent is much better sampled than other parts of the world, while high-latitude areas are poorly covered, with no survey data from Antarctica. Table 1 lists the number of surveys and sites within each ecoregion and provides a quick guide to coverage within any given ecoregion. Likewise, Figure 1 and the site map on the RLS website ( can be used for a quick visual overview. Site latitude (SiteLat) and longitude (SiteLong) data are readily extracted from the dataset for conversion to kml files and display in Google Earth.There are also many other considerations which are common to any dataset of this nature. These are not described in detail here, but a few examples include:

    Figure 1: The distribution of sites included in the Reef Life Survey reef fish dataset (black circles).
    figure 1

    Many sites are overlapping. Marine Ecoregions of the World13 have been colour-coded to reflect the density of sites surveyed within.

  4. 4

    Spatial autocorrelation (i.e., sites in close proximity exhibit closer relationships than sites at distance) should be considered for many analyses as data are highly clumped.

  5. 5

    Consideration should be given to assumptions relating to species detectability for some studies, including whether differences in detectability due to physical appearance, behaviour or abundance may affect conclusions.

  6. 6

    Abundance data have a long distribution tail, and factors typically act in a multiplicative rather than additive way (e.g., a change from 5–10 is more ecologically comparable to a change from 50–100 than from 50–55), so log transformation of abundance data is appropriate in most cases.

The analyses of this dataset described in associated papers11,12 used community level metrics (e.g., diversity values) calculated per transect or block, and averaged across transects within each site. We also found the random forest methods used to model community metrics in these papers to be robust to spatial autocorrelation.

Additional information

How to cite this article: Edgar, G. J. & Stuart-Smith, R. D. Systematic global assessment of reef fish communities by the Reef Life Survey program. Sci. Data 1:140007 doi: 10.1038/sdata.2014.7 (2014).