The retrospective analysis of Antarctic tracking data project

The Retrospective Analysis of Antarctic Tracking Data (RAATD) is a Scientific Committee for Antarctic Research project led jointly by the Expert Groups on Birds and Marine Mammals and Antarctic Biodiversity Informatics, and endorsed by the Commission for the Conservation of Antarctic Marine Living Resources. RAATD consolidated tracking data for multiple species of Antarctic meso- and top-predators to identify Areas of Ecological Significance. These datasets and accompanying syntheses provide a greater understanding of fundamental ecosystem processes in the Southern Ocean, support modelling of predator distributions under future climate scenarios and create inputs that can be incorporated into decision making processes by management authorities. In this data paper, we present the compiled tracking data from research groups that have worked in the Antarctic since the 1990s. The data are publicly available through biodiversity.aq and the Ocean Biogeographic Information System. The archive includes tracking data from over 70 contributors across 12 national Antarctic programs, and includes data from 17 predator species, 4060 individual animals, and over 2.9 million observed locations.


Background & Summary
There is increasing evidence and concern that Southern Ocean ecosystems are facing globally significant challenges, especially in regions undergoing some of the fastest rates of warming on Earth, or where commercial fishing may be impacting ecosystem processes. At lower latitude locations, in the west, such as the Antarctic Peninsula, winter air temperatures have warmed by 4.8 times the global average, and ocean surface temperatures have risen by 1 °C 1,2 . At the same time, concerns about commercial catches of Antarctic krill Euphausia superba and toothfish Dissostichus spp. continue, e.g. 3 . Ecological effects arising at multiple scales from the physical changes in the environment require further investigations [4][5][6] to allow a realistic assessment of the effects of regional and global warming and ocean acidification vs. top predator recoveries and/or fishing 7 . The paucity of data on spatial and temporal ecosystem dynamics, and heterogeneity of change even at relatively small spatial scales, e.g. 8 , adds considerable uncertainty around projections for biological systems. Local mitigation or management measures require a solid knowledge foundation to encapsulate critical ecosystem processes or vulnerable ecosystem components 9 .
The distributions and abundances of marine endotherms in the Southern Ocean are linked to both habitat and prey availability 10 . Areas with high concentrations of predators often signal higher diversity or abundance of lower trophic organisms, and are therefore regions that may need special management consideration. In addition to a long history of at-sea surveys, e.g. 11 , recent advances in electronic tagging techniques provide the capacity to record the movement and behaviour of a range of animals in relation to environmental parameters 12 . Bio-loggers and transmitters now allow collection of different types of data at the individual level, including geographic location and environmental data 13,14 . The use of these devices is now commonplace, leading to an explosion in the quantity and quality of data, creating new challenges for data management, integration, and analysis, and requiring the development of new tools and approaches 15 . Scientists have thus taken advantage of the miniaturisation of electronic tags to remotely follow penguins, petrels, albatross, seals and whales at sea for more than two decades in the Southern Ocean to learn how they spend their time at sea and understand the role they play in different food webs. While lacking a species-interaction context, such data can help to identify regions utilized by multiple species of predators, which are indicative of Areas of Ecological Significance 16 , or biological hotspots, e.g. 17,18 .
Despite the considerable number of tracking studies on the distribution and habitat use patterns of upper trophic level, air breathing vertebrates in parts of the Southern Ocean based on tracking data, e.g. 19,20 , no # A full list of authors and their affiliations appears at the end of the paper.

Methods
Original deployment of tracking devices. RAATD aggregated data from three types of tracking devices ( Fig. 1). In increasing order of precision these are light-level recording Global Location Sensors (GLS loggers or geolocators), satellite-relayed Platform Terminal Transmitters (PTTs), and Global Positioning System devices (GPS). Typically, GLS and GPS devices record data in internal memory, and must be physically recovered in order to download the data. PTTs transmit a carrier signal to satellites, and can deliver data remotely and in near-real time. Some modern devices now combine the capabilities of PTT and GPS (or other) devices, relaying high-quality GPS data to the end user via satellites. GLS devices, which are among the smallest, allowing for deployments on smaller predators, typically record ambient light levels throughout the day, from which coarse estimates of latitude and longitude can be calculated (to within 100-200 km) using day length and timing of local noon. Some GLS units can also record sea surface temperature, which can help refine position estimates 25  Step 2 Step 3 Step

Filtered Tracks
Step 5 Step 5 Fig. 1 Data workflow from tracking-device deployment on animals to state-space model-filtered tracks (and associated data). Arrows and boxes correspond to the specific sections in the text. The blue box indicates the filtering and validation workflow for which R scripts are provided; purple boxes indicate publiclyavailable data files through the AADC and Darwin Core packages available through the Global Biodiversity Information Facility (GBIF) and Ocean Biogeographic Information System (OBIS). estimated by RAATD data contributors using five methods [26][27][28][29][30] (GLS Methods 23 ) and generally corresponded to individual distribution during the non-breeding season. GPS tags make use of global navigation satellite systems and provide very high resolution (about ten meters) location fixes and time information. Some are satellite-linked, while others have smaller batteries and must be recovered (i.e. the animal carrying the tag must be recaptured) to download the archived data. PTT tags transmit signals to ARGOS satellites which transfer the received signals to a receiving station at the Collecte de Localisation Satellites (CLS) in Toulouse, France, to estimate locations based on Doppler shifts in the received signals to an accuracy of approximately 1,000 m. Processing by CLS involved a least-squares filtering method up to 2008, thereafter Kalman filters have been used 31 . Different models of GLS, PTT, and GPS devices from different manufactures have been used throughout the years, each having specific characteristics (size, operating modes, etc.) that may influence accuracy of the locations, but because device type was not always provided by the data providers, a standard correction has been applied in RAATD (see below). In summary, the "RAATD core group" (i.e. the analysing team) worked on location data converted from light-level data by the data contributors, on CLS-processed PTT location data, and on raw data directly delivered by GPS devices.
Device attachment to animals was also species-specific. When loggers are small enough, like GLS, they are mounted on leg or flipper bands/tags, while larger data-loggers and transmitters are often attached to the plumage or pelage on the back or head of the animal, a position that optimizes data communication with satellites. Modes of attachment on the back varied from using harnesses, glue or marine tape. For whales, transmitters with subcutaneous anchors were attached to the back, using poles, cross bows or air guns. Scientists limited handling time and stress as much as possible during attachment and retrieval of devices, e.g. [32][33][34][35][36][37] , following established animal handling guidelines that meet ethical reviews. However, it should be noted that the RAATD dataset contains tracking data that span almost three decades, during which time substantial progress has been made in terms of miniaturization and advances in electronic components. Any adverse effects of devices on animals are therefore likely to be less acute in recent years compared to the earlier years of tracking.
Step 2. associated metadata. Where available, information on the deployment site and relevant characteristics of the animal at the time of deployment was standardized. Where age class and sex were known, this information was included in the metadata.
Step 3. data standardization. Location dates and times were converted to UTC (Coordinated Universal Time). Records with missing latitude or longitude values were removed, and all longitudes were transformed to lie between 180 °W and 180 °E. Data files were row-ordered by individual, with rows within an individual in their correct temporal sequence. Near-duplicate positions, defined as animal positions that occurred three seconds or less after an existing position fix from the same animal, and which had identical longitude and latitude values (for GPS devices) or longitude and latitude values that differed by less than 1 −05 and which had the same location quality value (for PTT devices), were removed.
Entries in the age class, breeding stage, device type, location quality, scientific, common, and abbreviated name, sex, and deployment site columns were validated against controlled vocabularies. Mandatory entries (e.g., deployment date, device type, individual animal identifier) were checked for missing values. When the data contributors could not provide missing deployment dates, the first data point of the track was used as a reference point for deployment. Where animal identifiers were missing, they were created from the the tag identifier or file name.
Deployment locations were recorded by the original field team either at the individual-animal level (using e.g., a hand-held GPS device) or at the deployment-site level (i.e., one deployment location per group of animals). The latter was common for deployments at colonies, whereas the former was most common for non-colony deployments (e.g., seals and whales). Where deployment locations were not recorded by the field team, the first location estimate(s) in the tracking data were used. Deployment site names were standardized to colony names wherever possible (e.g., to the beach-on-island level).
Periods at the start or end of deployments were identified and discarded if there was evidence that location data during these periods did not represent the animals' at-sea movement. For example, tags may have been turned on early (thereby recording locations prior to their deployment on animals) or animals may have remained at the deployment site, e.g. the breeding colony, for an extended period at the start or end of the tag deployment. Some tracks also showed a marked deterioration in the frequency and quality (for PTTs) of location estimates near the end of a track. Such locations were visually identified based on maps of each track in conjunction with plots of location distance from deployment site against time. This information is captured in the location_to_keep column appended to each species' raw data file (1 = keep, 0 = discard).
Step 4. data filtering. Each track in the standardized dataset was visually inspected by the Data Editorial Group, and flagged for removal (using the keepornot column in the metadata file) if location estimates appeared unreasonably noisy relative to the length and extent of the track, and/or the location estimates were very irregular in time.
www.nature.com/scientificdata www.nature.com/scientificdata/ Next, automated quality-control checks were used to remove individual deployments that: (1) were flagged for removal (keepornot column in the metadata file); (2) had fewer than twenty location records; or (3) had deployments lasting less than 1 day. Additionally, individual deployments were checked to ensure that: (1) near-duplicate records in PTTs (locations occurring within 2 min of each other) were removed; (2) PTT Argos Z-class locations were reclassified as B-class locations (the least precise Argos location quality class that has an associated error variance 38 ); and (3) locations implying unrealistic travel rates during the preceding time step (over 10 m s −1 for penguins and marine mammals and over 30 m s −1 for flying seabirds) were removed. Note that the definition of "duplicate locations" in the filtering context is more aggressive (less than two minutes vs less than three seconds) than that used during data standardization: for standardization, the intention was to keep the data as close to original as possible, whereas for filtering the presence of multiple positions in a short period of time (less than two minutes) has a negative effect on the filter performance.
A state-space model (SSM) was used to estimate locations at regular time intervals (one hour for GPS data; two hours for Argos data; twelve hours for GLS data) and account for measurement error in the original observations 12,38 . The data were SSM-filtered and subjected to a final quality control where tracks that failed to converge, as judged by nlminb convergence criteria 39 , were re-fitted using different initial values. If re-fitted tracks continued to fail to converge they were removed from the final filtered dataset.
For converged tracks, longitude and latitude residuals were examined for systematic trends indicative of lack of fit. Tracks that failed this inspection were removed from the final filtered dataset.
Step 5. data publication. RAATD established a data-sharing and publication agreement with all data providers in 2017. The standardized (trimmed) and filtered data are held in a data repository hosted at the Australian Antarctic Division (AADC) (see details below, in the 'Standardized Data' section). The filtered data are also according to the OBIS-ENV guidelines 40 published in international repositories through the SCAR Antarctic Biodiversity Portal (see details below, in the 'OBIS-ENV compliant data' section). For this purpose, and to ensure standardized file structure, secure (meta)data storage and the facilitation of community access to the data (where appropriate), the resulting datasets have been uploaded to the biodiversity.aq IPT instance (Integrated Publishing Toolkit; www. ipt.biodiversity.aq), the accepted route for publishing data to the SCAR Antarctic Biodiversity Portal (www.biodiversity.aq). This should ensure a seamless flow to the Ocean Biogeographic Information System (OBIS) and the Global Biodiversity Information Facility (GBIF).

Data Records
Original data provided by contributors.  They are made available as (i) a single metadata file containing a description for each individual in the dataset and (ii) a set of seventeen CSV files, one for each species, which aggregate all of the respective individual location data (Online-only Table 1). Records in the two files can be linked by the common 'individual_id' field, as each animal in the study has a unique identifier. The data and metadata are available to the public through the Australian Antarctic Data Centre: standardized data 41 ; state-space model-processed (filtered) data 42 .

OBIS-ENV compliant data.
The standardised data will also be provided as a set of Darwin Core Archives using the Darwincore Event core (Fig. 1) in compliance with the OBIS-ENV-DATA format 40 . All field definitions (Darwin Core Terms) are available on the Darwin Core website (at: http://rs.tdwg.org/dwc/terms/index.htm#occurrenceindex). The Darwin Core aims to share data about taxa in a simple structured way. It includes a glossary of terms and is primarily based on their occurrence in nature as documented by observations, specimens, samples, and related information. Documents describing how these terms are managed, how the set of terms can be extended for new purposes, and how the terms can be used can be found on the website.
The OBIS-ENV compliant data data are made publicly available through the Antarctic Biodiversity Portal Integrated Publishing toolkit (http://ipt.biodiversity.aq/resource?r=raatd_scar_trackingdata). The Antarctic Biodiversity Portal acts as the Antarctic thematic node for the Ocean Biogeographic Information System (OBIS, Ant-OBIS) and the Global Biodiversity Information Facility (GBIF, AntaBIF). Geographic coverage. All species considered in this dataset have circumpolar Antarctic distributions ( Fig. 2; species-specific distributions are given in Supplementary Fig. S1) with a longitudinal range spanning 180 °W to 180 °E. The species breed either on the coast of the Antarctic continent or on the sub-Antarctic islands to the north (see Supplementary Table S1 for a list of the main study sites). Species with geographically limited distributions (such as chinstrap penguins Pygoscelis antarcticus) were not included; instead we concentrated on species whose distribution covers a large portion of the Southern Ocean. In addition, a number of deployments in the Antarctic (crabeater seals Lobodon carcinophaga and Weddell seals Leptonychotes weddellii) were conducted in the pack ice at un-named locations. Similarly, humpback whales Megaptera novaeangliae were instrumented at sea either off the coast of the Antarctic Peninsula, off Australia or off New Zealand.
Taxonomic coverage. Seventeen species of meso-and top predators were selected for analyses (Table 1 and Fig. 3), five marine mammals (one baleen whale, one otariid and three phocid seals) and twelve seabirds (five penguins, five albatrosses, and two petrels). These species cover a diverse range of ecological niches and life-history traits and include dietary specialists (e.g., crabeater seals), deep divers (e.g., elephant seal Mirounga leonina and emperor penguin Aptenodytes forsteri), wide ranging, highly migratory species (e.g., wandering albatross Diomedea exulans), nearshore foragers (e.g., Adélie penguin Pygoscelis adeliae) and capital (e.g., Weddell seal) www.nature.com/scientificdata www.nature.com/scientificdata/ versus income (e.g., Antarctic fur seal Arctocephalus gazella) breeders. In total, 4,060 individuals were included in the standardized dataset before quality control (some individuals may have been counted more than once in this total, as repeat deployments on the same individuals are not taken into account in this summation), with 1,482 marine mammals and 2,578 seabirds, providing 2,964,245 location fixes before filtering (Table 1). After filtering and quality control processes, the total number of individuals used was 2,823, providing 2,328,772 location fixes (a 21% decrease in the number of location fixes) ( Table 1).
Temporal coverage. The data are not distributed evenly in time (Fig. 3). While the time frame ranges from 1991 to 2016, most data were collected during the period 2007 to 2014. This is a reflection of increased research effort in the Southern Ocean and advances in technology since the early 1990s, and also the timeline of the RAATD project which stopped actively seeking new data inputs in 2016. Further, some data providers primarily contributed older datasets that were already published or soon to be published, rather than unpublished data.
The lack of even coverage in terms of taxa, space and time is a function of several factors. First, some deployments were mostly conducted during the breeding season when species like Adélie penguins make relatively short duration (2-14 days) and local (10-200 km) foraging trips, compared with post-moulting southern elephant seals that make distant (several thousands of km) and longer duration foraging trips (many weeks). Second, the coverage reflects the research effort related to funding and logistics and, third, the availability of species that lend themselves to instrumentation (e.g., central place foragers). For instance, crabeater seals are very abundant but because they inhabit pack ice they are logistically very difficult to capture for tracking studies. In the case of humpback whales, long-term attachments of tracking equipment are relatively difficult to attain so there are less data available. The technology to track the smaller flying bird species is also comparatively new, relying until recently upon small archival light loggers (see above), so there have been relatively few studies of these species to date.

Technical Validation
The standardized data were subjected to a range of quality checks before undertaking further processing (Fig. 1). These included: • Counts of unique deployment positions and longitude/latitude variability were calculated for each dataset, and used as a check for errors in deployment position. • The distance from the recorded deployment position to the first few track points was calculated, and any distances greater than 10 km were flagged for manual inspection and verification. Similarly, the deployment date was compared to the date of the first point of the track, and differences were flagged for manual verification. • Various cross-checks were conducted to identify other data errors or discrepancies, including checking for multiple device identifiers associated with a single individual animal identifier, checking for identical individual identifiers on different species or in different datasets, checking that redeployed devices (i.e. the same www.nature.com/scientificdata www.nature.com/scientificdata/ device deployed on multiple individuals) did not have temporal overlap, and checking for data missing from the 29 th of February of leap years (perhaps indicating data that had been discarded by accident).
A number of additional quality-control checks were implemented prior to the data filtering; these are described in the Methods (Steps 3-4). State-space models (SSM) are now the standard approach for dealing with observation errors in electronic tagging location data 38,43 . The SSM filtering protocol that was applied to all the data provided essential quality control and validation. It was a variation of what was used by Jonsen et al. 38 , which was implemented in the statistical computing language R 44 via the Template Model Builder package (TMB package) 45 . The TMB package provides extremely fast and stable maximum likelihood estimation, via autodifferentiation and the Laplace approximation, for non-Gaussian and nonlinear SSM's 46 . This was essential for filtering the large amount of tracking data compiled herein.
The SSM filtering accounted for observation errors in the tracking data and, unlike the raw track data, provided location estimates and standard errors at regular time intervals along estimated tracks 38,47 . These location (error-filtered and time-regularised) outputs are essential for determining species' habitat preferences from tracking data (see Usage Notes) and other types of ecological inferences. We encourage users of these filtered outputs to evaluate the level of uncertainty in the estimated locations for their ecological inferences, as our methods for filtering are specific for our purposes. Example filtering code is provided so that users can reproduce our filtered data from the raw data or produce a new set of filtered data using, for example, different time steps (https://github. com/SCAR/RAATD).
Following SSM filtering, estimated tracks were evaluated for goodness of fit by examination of (1) maps of estimated and observed locations and (2) residual plots of latitude and longitude. Tracks associated with obviously poor fits to the data, unrealistic estimated movements and frequent extended periods without observations (relative to the step length duration) were discarded from the final output dataset. This examination was conducted independently by three people. Estimated tracks were discarded when at least two examiners were in agreement to discard. In cases where the optimisation algorithm (nlminb in R) failed to converge to a global minimum, up to ten attempts with different initial values were made in an effort to obtain convergence. Tracks for which convergence could not be obtained were discarded from the final output dataset. Combined, these quality control and validation procedures accounted for a 30% reduction in the number of individual tracks retained in the filtered data compared with the standardized data ( Table 1).
The SSM-filtered data are affected by several caveats. First, the standardised GPS, PTT and GLS data were filtered using time steps of 1-, 2-and 12 h, respectively. These time steps were chosen as they are generally appropriate relative to the typical sampling frequencies of the three tag types. In some cases, these time steps did not  Table 1. Count of the number of individuals and location fixes by species included in the RAATD project initially (standardized data) and following the cleaning and filtering processes (filtered data). Note, in some cases (e.g., emperor penguin) the number of fixes of the filtered data is greater than the original number of fixes in the raw data due to a high prevalence of tag duty-cycling (tags working non-continuously, e.g., recording 12 h during the day and being turned off for 12 h at night) or due to periods when no location fixes were recorded but which were interpolated by the state-space model. *Royal penguins have a limited geographic distribution, but they can be considered ecologically equivalent to Macaroni penguins where they occur, and these two species will be considered together in RAATD's further analyses.
www.nature.com/scientificdata www.nature.com/scientificdata/ match well with the sampling frequency of particular tags. For example, GPS tags deployed on some birds had far higher sampling frequencies and a 1-h time step may be too coarse in these cases. Second, for GLS data the time period around the equinoxes (approximately four weeks, each) yields suspect latitude estimates. The SSM filter does not fully account for this uncertainty. Third, for animals carrying tags programmed to turn off when hauled out on land or ice the SSM-estimated locations imply movement looping beyond and back to these haul-out sites when the tags are off. These estimates are clearly spurious. Fourth, tag-sampling frequency often declines toward the end of long deployments. Despite some influence on SSM-estimated locations near the end of these deployments, these data were retained in RAATD.

Usage Notes
Thanks to an unprecedented sharing effort from the SCAR EG-BAMM community, a benchmark dataset has been assembled that fills important gaps in spatial occurrence of various species for areas of the world that are traditionally data-poor. The dataset compiled for RAATD is used in the analytical project described in the Background section to determine Areas of Ecological Significance for the 17 species of predators considered in the dataset. To this end, a habitat selectivity procedure is one possible modelling method, aiming to identify the particular environmental conditions that are favoured by the animals, relative to the range of conditions that are available. This first requires estimation of the geographic space available to a given animal, which can be assessed using various methods, e.g. 48 . This region of geographic space has an associated range of environmental conditions over the period in which animals were tracked. Regression modelling can then be used to identify the environmental covariates that discriminate areas that are preferentially utilized. For particular analyses, tracking data may need to be subdivided, for example by breeding stage, depending on whether or not the animals' interactions with the environment differ by breeding stage. The individual habitat preference models can then be combined to provide a multi-species view of important regions of habitat including their underlying environmental processes, e.g. 17,18 . Following this analysis and production of a scientific article, the dataset will be available for re-use to help address emerging research questions or pressing conservation issues.
The SCAR EG-BAMM is pleased to make this dataset openly available for the Antarctic and broader scientific communities. It is organized and curated using the best principles and practices of recent biodiversity informatics practices 49 . In this framework, the final version of the dataset is fully compliant with Darwin Core body of standards and can be downloaded through the Global Biodiversity Information Facility (GBIF) and Ocean Biogeographic Information System (OBIS) data portals.
In regard to the use of the dataset, the RAATD consortium promotes the CC-BY (Creative Commons Attribution License), this being the standard practice for citing GBIF-mediated data, believing that it reflects an established norm that the communities we serve use to cite original work. Users are expected to comply with the guidelines of the SCAR/SCADM Data Policy: https://www.scar.org/scar-library/reports-and-bulletins/ scar-reports/2717-scar-report-39/file/ and to recognize the valuable contributions of data providers (generally www.nature.com/scientificdata www.nature.com/scientificdata/ scientists who collect, synthesise, model, or prepare analysed data) and to facilitate repeatability of research results. Users of SCAR data should communicate with and formally acknowledge data authors (contributors) and sources, refer to Data Contacts and Citations 23 , for specific citations. Where possible, this acknowledgment should take the form of a citation, such as when citing a book or journal article.

Code availability
The code for (i) trimming the raw tracks and (ii) the state space filtering have been made available on the SCAR github page (https://github.com/SCAR/RAATD). Additional information is provided in the Technical Validation section below.