Whales from space dataset, an annotated satellite image dataset of whales for training machine learning models

Monitoring whales in remote areas is important for their conservation; however, using traditional survey platforms (boat and plane) in such regions is logistically difficult. The use of very high-resolution satellite imagery to survey whales, particularly in remote locations, is gaining interest and momentum. However, the development of this emerging technology relies on accurate automated systems to detect whales, which are currently lacking. Such detection systems require access to an open source library containing examples of whales annotated in satellite images to train and test automatic detection systems. Here we present a dataset of 633 annotated whale objects, created by surveying 6,300 km2 of satellite imagery captured by various very high-resolution satellites (i.e. WorldView-3, WorldView-2, GeoEye-1 and Quickbird-2) in various regions across the globe (e.g. Argentina, New Zealand, South Africa, United States, Mexico). The dataset covers four different species: southern right whale (Eubalaena australis), humpback whale (Megaptera novaeangliae), fin whale (Balaenoptera physalus), and grey whale (Eschrichtius robustus).


Background & Summary
Very high-resolution (VHR) satellite imagery allows us to survey regularly remote and large areas of the ocean, difficult to access by boats or planes. The interest in using VHR satellite imagery for the study of great whales (including sperm whales and baleen whales) has grown in the past years 1-5 since Abileah 6 and Fretwell et al. 7 showed its potential. This growing interest may be linked to the improvement in the spatial resolution of satellite imagery, which increased in 2014 from 46 cm to 31 cm. This upgrade enhanced the confidence in the detection of whales in satellite imagery, as more details could be seen, such as whale-defining features (e.g. flukes).
Detecting whales in the imagery is either conducted manually 1,4,5,7 , or automatically 2,3 . A downside of the manual approach is that it is time-demanding, with manual counter often having to view hundred and sometimes thousands of square kilometres of open ocean. The development of automated approaches to detect whales by satellite would not only speed up this application, but also reduce the possibility of missing whales due to observer fatigue and standardize the procedure. Various automated approaches exist from pixel-based to artificial intelligence. Machine learning, an application of artificial intelligence, seems to be the most appropriate automated method to detect whales efficiently in satellite imagery 2,3,8,9 . In machine learning an algorithm learns how to identify features by repeatedly testing different search parameters against a training dataset 10,11 . Concerning whales, the algorithm needs to be trained to detect the wide variety of shapes and colour characterising whales. Shapes and colour will be influenced by the type of species, the environment (e.g. various degree of turbidity), the light conditions, and the behaviours (e.g. foraging, travelling, breaching), as different behaviours will result in different postures. The larger a training dataset is, the more accurate and transferable to other satellite images the algorithm will be. At the time of writing, such a dataset does not exist or is not publicly available.
Creating a large enough dataset necessary to train algorithms to detect whales in VHR satellite imagery will require the various research groups analysing VHR satellite imagery to openly share examples of whales and non-whale objects in VHR satellite imagery, which could be facilitated by uploading such data on a central open www.nature.com/scientificdata www.nature.com/scientificdata/ source repository, similar to the GenBank 12 for DNA code or OBIS-Seamap 13 for marine wildlife observations. Ideally clipped out image chips of the whale objects would be shared as tiff files, which retains most of the characteristics of the original image. However, all VHR satellites are commercially owned, except for the Cartosat-3 owned by the government of India 14 , which means it is not possible to publicly share image chips as tiff file. Instead, image chips could be shared in a png or jepg format, which involve loosing some spectral information. If tiff files are required, georeferenced and labelled boxes encompassing the whale objects could also be shared, including information on the satellite imagery to allow anyone to ask the commercial providers for the exact imagery.
Here we present a database of whale objects found in VHR satellite imagery. It represents four different species of whales (i.e. southern right whale, Eubalaena australis; grey whale, Eschrichtius robustus; humpback whale, Megaptera novaeangliae; fin whale, Balaenoptera physalus; Fig. 1), which were manually detected in images captured by different satellites (i.e., GeoEye-1, Quickbird-2, WorldView-2, WorldView-3). We created the database by (i) first detecting whale objects manually in satellite imagery, (ii) then we classified whale objects as either "definite", "probable" or "possible" as in Cubaynes et al. 1 ; and (iii) finally we created georeferenced and labelled points and boxes centered around each whale object, as well as providing image chips in a png format. With this database made publicly available, we aim to initiate the creation of a central database that can be built upon.

Methods
image acquisition. Twelve satellite images were used to build the database. They were acquired by different very high-resolution satellites owned by Maxar Technologies, formerly known as DigitalGlobe ( Table 1). The choice of imagery was linked to other projects 1,3,7,8 or specifically acquired to enlarge the database. Some images were selected from Maxar Technologies' archives 15 and others were requested to be captured during a specific time window (see "Usage note" section for advice about access to satellite images).
Criteria to select the imagery were: 1) less 20% cloud cover, 2) calm sea state (i.e. no white caps and low swell), and 3) where it was known that only one species would be present at the time of image acquisition. The percentage of cloud coverage was assessed by the satellite imagery provider. We visually assessed the sea state for the presence of white caps and the level of swell. As it is currently unknown whether species could be differentiated in VHR satellite images, we selected well studied locations to ensure the presence in the imagery of only one great whale species.
Detecting whales. The satellite images were manually scanned for the presence of whales using ArcGIS 10.4 ESRI 2017, following Cubaynes et al. 1 systematic method, which involved overlaying a grid on top of the imagery and scanning one cell after the other at a scale of 1:1,500 m. Prior to scanning, the imagery was pansharpened, a process of joining the high spatial resolution of the panchromatic image (grey scale image) to the high spectral resolution of the multispectral image (colour image) to get one image of high spatial and spectral resolutions. We used the ESRI pansharpening algorithm.
Whale objects were marked with a point and were subsequently assigned a level of confidence as explained below in the "Technical Validation" section.

Creating labelled and georeferenced points and boxes.
For each detected whale, a point was placed on it with associated metadata (see Data description). Boxes were created around each point indicating a whale object using ArcGIS 10.4 ESRI 2017, and following the workflow illustrated in Fig. 2. We created square boxes www.nature.com/scientificdata www.nature.com/scientificdata/ www.nature.com/scientificdata www.nature.com/scientificdata/ with a power of 2 (i.e. 128 × 128 pixels) to facilitate its use for machine learning approaches, particularly deep learning algorithms. Each whale object was represented by a point and a box (delimiting the pixels in the pansharpened image). The boxes created around the whale object were saved as one shapefile (a georeferenced file) per satellite image, as the coordinate system varied from one image to the next (Table 2), similarly for points. With the exception of the four satellite images of the Pelagos Sanctuary, for which one box and one point shapefile were created for the four images.
Creating image chips. Image chips were created using the box created in the above section, following the workflow presented in Fig. 3. Prior to creating the image chips for Valdes 2012 and 2016, the corresponding box shapefiles and satellite images had to be re-projected to WGS 1984 UTM Zone 20 S. We did the same for Auckland 2006 and 2011, using WGS 1894 UTM Zone 58 S. The file name of the image chips corresponds to the box ID of the respective boxes, allowing to find the associated data within the corresponding box shapefiles that defines the specific raw satellite image that the image chip came from.
Future updates of the datasets. As we acquire and analyse more satellite imagery, we aim to annually update the Whales from Space dataset. The updates will be available under the Whales from Space dataset deposited on the NERC Polar Data Centre repository 16,17 to ensure consistency and long-term public availability of the data.

Data records
The "Whales from space dataset" is available on the NERC UK Polar Data Centre repository and separated in two sub-datasets: a dataset that contains the whale annotations (box and point shapefiles with associated csv files) named "Whales from space dataset: Box and point shapefiles" 16 ; and a dataset with the image chips named "Whales from space dataset: Image chips" 17 . The "Whales from space dataset: Box and point shapefiles" dataset can be accessed on the NERC UK Polar Data Centre directly using the DOI link (https://doi.org/10.5285/ C1AFE32C-493C-4DC7-AF9F-649593B97B2C). This dataset contains nine shapefiles with boxes centered on each whale and nine point shapefiles marking each individual detected whales ( Table 2) totalling 633 annotated whale objects (Table 3 and Fig. 4). This dataset also includes four csv files: 1) a csv file joining all the attribute tables linked to every box and point shapefiles for whale objects (WhaleFromSpaceDB_Whales.csv); 2) a second csv file explaining each column of the first csv file (WhaleFromSpace_Guidance.csv); and 3) two other csv files describe the naming of each box (WhaleFromSpaceDB_BoxNaming.csv) and each point (WhaleFromSpaceDB_ PointNaming.csv).
The "Whales from space dataset: Image chip" comprises of the 633 annotated whale objects as image chips. To fulfil the End User Licence Agreement with Maxar Technologies 18 , these image chips are shared in a png format, and access to the dataset is available upon request from the NERC UK Polar Data Centre that can be contacted at PDCServiceDesk@bas.ac.uk. Data access requires user name and email address, which will be shared with Maxar Technologies. Anyone using any of the image chips is also required to attribute the images properly (See Usage Notes).
Each box and point has metadata associated to it, which is included in the attribute table associated to the specific shapefile. It contains information about the detected whale: certainty level (i.e. "definite", "probable", "possible") derived from the classification score assessed based on various criteria (i.e. body length, body width, body shape, body colour, flukeprint, blow, contour, wake, after-breach, defecation, other disturbance, fluke,  www.nature.com/scientificdata www.nature.com/scientificdata/ flipper, head callosities and mudtrail) following Cubaynes et al. 1 method, most likely species, and potential other species. For each annotated whale, we also provide information about the imagery analysed: the location, latitude and longitude (in decimal degrees and recorded using the same geographic coordinate system and projection as the satellite imagery), imagery ID, imagery date, type of satellite, spatial resolution, number of multispectral bands, product level and type (e.g. Standard2A). The size of each boxes was also specified in terms of pixels.

Technical Validation
Certainty of whale identification. Ground truthing, the process of verifying on the ground what is observed in a satellite image 19 , is not possible when attempting to detect a mobile object, such as whales, because whales visible in the imagery will have moved by the time the imagery is received by the customer for analysis, which can take between six hours up to a couple days. Alternatives have been tried, such as timing the collection of satellite image with a boat or aerial survey 20,21 . However, it is difficult to synchronise the acquisition of a satellite image with such surveys, due to several factors; including competing tasking where satellite image orders for defence and disaster relief take priority over other orders. This is currently relevant as only one very www.nature.com/scientificdata www.nature.com/scientificdata/ high-resolution satellite can acquire 30 cm resolution imagery. The presence of clouds is also a limiting factor, as it will prevent the detection of whales in satellite images but not impact the detection capabilities from a boat survey 20,21 . There has also been an attempt to match whales tagged with tracking devices to those observed in an imagery, but the low accuracy of the coordinates provided by the tracking devices fixed on whales did not permit this matching 8 . With this dataset, to assess our confidence whether the object observed was a whale, 1) we analysed images of well surveyed areas, where only one species was recorded at a specific time 1 ; and 2) we have established a certainty level reflecting our confidence in the detection. As whales will not always be at the sea surface and as light gets attenuated with increasing depth, whales below the surface will not be as visible as those near the surface, for which characteristic-whale features can be observed (e.g. fluke, flipper). Recognising that some whale objects will be easier to detect than others, we created three levels of confidence (i.e. definite, probable, and possible). The certainty level was assigned based on a combination of criteria 1 . We recommend that only the whales with a "definite" certainty level be used to train automated detection systems.

Species differentiation.
As species differentiation has not been tested when analysing satellite images, we reference the most likely species in this database. The most likely species was assigned based on the scientific literature, hence our decision to acquire images of specific areas when only one large whale species was expected to be present 1 .

Usage Notes
Correct attribution for satellite images. Anyone using any of the image chips is required to attribute the image chips as follow: "Satellite image © 2022 Maxar Technologies".  Table 3. Summary of the number of whale objects counted in the imagery. See Table 1 for more details about the satellite imagery.

Fig. 4
Proportion of whale objects included in the database per species (top to bottom: southern right whale, humpback whale, fin whale and grey whale) and per certainty categories ("definite", "probable", and "possible"). The proportion is given separately for each satellite image analysed in this study (Table 1).
www.nature.com/scientificdata www.nature.com/scientificdata/ advice on getting access to satellite images. All the satellite images that we have used to build the dataset were provided by Maxar Technologies (formerly DigitalGlobe). We recommend contacting Maxar Technologies national office to enquire about acquisition and cost, as pricing is conducted on a user case scenario. To ensure you acquire the same satellite images we have created the boxes for, we have provided the Catalogue ID in Table 1. All the images we have used are now considered archival and accessible at a lower cost. There are different types and levels for a same satellite image and we recommend acquiring the satellite images with the same product level and type, as specified in Table 1. Acquiring a different product level or type may shift the image, meaning the whale-object boxes will not be centred on the whales they were created for.

Code availability
We used ArcGIS 10.4 ESRI 2017 to analyse the satellite images and create the boxes. ArcGIS 10.6 ESRI 2017 can also be used. Various pansharpening algorithm exists 22 . As we have used the ESRI pansharpening algorithm, we recommend using this one. The Gram-Schmidt is often preferred when monitoring wildlife from space 23 ; however, we have found that sometimes it may shift the pansharpened image compared to the panchromatic and multispectral images. Therefore, if a pansharpening algorithm other than ESRI is used, we recommend testing that it does not shift the image or to be aware of by how many pixels it has shifted the image. acknowledgements This work was supported by an Innovation Voucher from the British Antarctic Survey and a grant from NC-International NERC (NE/T012439/1). We are thankful to Ellen Bowler for her advice on the best format of the boxes, for this database to be useful for machine learning. We are also grateful to the insightful knowledge from the teams of machine learning experts from the GAIA (Geospatial Artificial Intelligence for Animals) and the GSTS smartWhales projects, and the Cambridge Image Analysis and the AI for the study of Environmental Risk research groups from the Department of Applied Mathematics and Theoretical Physics at the University of Cambridge, which used and confirmed the application of these datasets to machine learning.