An open access dataset for developing automated detectors of Antarctic baleen whale sounds and performance evaluation of two commonly used detectors

Since 2001, hundreds of thousands of hours of underwater acoustic recordings have been made throughout the Southern Ocean south of 60° S. Detailed analysis of the occurrence of marine mammal sounds in these circumpolar recordings could provide novel insights into their ecology, but manual inspection of the entirety of all recordings would be prohibitively time consuming and expensive. Automated signal processing methods have now developed to the point that they can be applied to these data in a cost-effective manner. However training and evaluating the efficacy of these automated signal processing methods still requires a representative annotated library of sounds to identify the true presence and absence of different sound types. This work presents such a library of annotated recordings for the purpose of training and evaluating automated detectors of Antarctic blue and fin whale calls. Creation of the library has focused on the annotation of a representative sample of recordings to ensure that automated algorithms can be developed and tested across a broad range of instruments, locations, environmental conditions, and years. To demonstrate the utility of the library, we characterise the performance of two automated detection algorithms that have been commonly used to detect stereotyped calls of blue and fin whales. The availability of this library will facilitate development of improved detectors for the acoustic presence of Southern Ocean blue and fin whales. It can also be expanded upon to facilitate standardization of subsequent analysis of spatiotemporal trends in call-density of these circumpolar species.

Automated detection of whale sounds. The volume of existing and incoming acoustic data far exceeds the capacity of human expert analysts to manually inspect it, and as a result automated algorithms have been relied upon to determine the presence of sounds from marine mammals in the recordings. Ecological results from long-term analyses have been reported in the form of presence (e.g. months, days, or hours of recordings with call presence), or as estimates of call numbers per time-period (e.g. see studies listed in Table 1). However, the results are not easily comparable because different studies had different data collection protocols and employed different analytical techniques, neither of which have been standardised. Furthermore, robust measures of bias and variability, which can be dataset-specific, are not always reported alongside results (Table 1).
A variety of automatic detection algorithms have been used to detect the calls of blue whales and fin whales. Algorithms to detect stereotyped calls of these species include matched filters [32][33][34] , energy detectors 35 , subspace projection detectors (blue whales only 26,36 ). However, the most widely used algorithm has been spectrogram correlation 37 , and this has been implemented in a variety of software packages [38][39][40] and has been used widely on a variety of different datasets [16][17][18][19][20]30,[41][42][43][44] . Spectrogram correlation is similar to matched filtering except that it acts on the spectrogram, rather than purely in the time or frequency domains; instead of cross-correlating a time series or spectrum, it correlates an image template or kernel pixel-by-pixel with the spectrographic data of interest.
Factors that affect the detector performance. Three main factors can impact the performance of an automated detection algorithm: acoustic properties of the recording site, variability in signals that are being detected, and variability in the characteristics of the recording system. The acoustic recordings from the Southern Ocean span a wide geographic and temporal range and encompass a variety of environments, thus characteristics of the recording site (e.g. propagation loss and noise levels) are expected to be both site-and timespecific 43,46 .
In addition to site-specific features, the properties of blue and fin whale sounds can change over time and space. Sounds from most blue whale populations have changed slowly and in a predictable manner since they were first described in the 1970s 14,[47][48][49] . On top of the well documented decreases in tonal frequency of sounds from year-to-year, there are predictable intra-annual changes that have also been observed 14,48,50 . There is some evidence that the properties of fin whale sounds vary geographically in the Antarctic 15,18 , and have been found to vary temporally in other oceans 33,51,52 . While these changes may seem small and/or occur over long time periods, they must nevertheless be accounted for when using automated detection algorithms to detect trends in long-term and widely dispersed datasets 53 .
Lastly, the acoustic recordings around the Antarctic have been made with a variety of instruments. These include: Scripps Acoustic Recording Packages (ARP); Multi-Electronique Autonomous Underwater Recorder for Acoustic Listening (AURAL); Australian Antarctic Division Moored Acoustic Recorders (AAD-MAR), Develogic Sono.Vaults; and Pacific Marine Environmental Laboratory-Autonomous Underwater Hydrophones (PMEL-AUH). Different instruments may have different capabilities, including depth rating, system frequency response, and duty cycle requirements, and these further affect the performance of an automated detector 54  www.nature.com/scientificreports/ the duty cycle of an instrument, for example, is known to affect the accuracy of predicting the presence of Antarctic blue whales in addition to the call rate 55 . Additionally, the depth of the recorder is expected to change the detection range and noise levels observed at a recorder 4 .
Here we create and document an open access set of recordings collected around the Antarctic and manual annotations of blue and fin whale call occurrences in a subset of those recordings. This dataset takes the form of an "annotated library" of Antarctic underwater sound recordings. We demonstrate how the library can be used to evaluate the performance of automated detectors over the variety of recording scenarios contained within the library. We also suggest methods to help standardise the reporting of results with a view towards facilitating long-term comparisons of PAM studies of baleen whales around Antarctica.

Methods
Towards a representative circumpolar dataset. Our annotated library contains data from four geographic regions: the Atlantic, Pacific, and Indian sectors of the Southern Ocean and the Western Antarctic Peninsula (WAP; Fig. 1). In each region, we identified sites that had at least a full year of data from 2014 or 2015, and ideally had two consecutive years. When two consecutive years were not available, another year from the same site was included or two different sites were selected. The Indian sector site also included data from 2005 to increase the temporal span of the library, as well as a second location with data from 2014 and 2017. The data Table 1. Previous analyses of long-term datasets that have used automated algorithms to detect calls and report spatial distribution and/or temporal occupancy of Antarctic blue whales. a False positive rate reported only for months when there were more than 500 calls detected. b False positive detections were from a different detector operating in an adjacent frequency band with a similar, frequency-adjusted, spectrogram correlation kernel. c True positive rates for this detector for high, medium, and low signal to noise ratio (SNR) calls and a variety interfering noises reported by Socheleau et al. 2015, but the prevalence of these conditions within the full dataset is not indicated. d False negative rate reported as a percentage of total uncorrected detections for a 20% subset of days without automated detections. Subsampling from each dataset. Moorings in the Antarctic are typically recovered and serviced at the most once a year due to their remote locations, potentially long periods of ice cover, and reduced/negligible access during Antarctic winter. Thus we define a site-year as a recording from a single instrument and site that is approximately a year in duration. A subset of approximately 200 h of data was selected from each site-year for annotation. This number of hours was chosen a priori and was constrained by budgetary limits, but it was believed to be a reasonable trade-off among analyst time, maintaining adequate sample sizes within each siteyear, and annotating a sufficient number of different site-years. For each site-year a systematic random subsampling scheme was used to generate a representative set of acoustic recordings from the larger dataset. The systematic random subsampling scheme consisted of:  www.nature.com/scientificreports/ 1. splitting the dataset into "chunks" of time. The optimal length of a time chunk will be species and study specific. For the annotated library time was split into mostly hour-long chunks with some exceptions. 2. calculating the spacing between chunks, t s to ensure that the desired sample size of time chunks is created and that there was broad representation of hours in the day across all chunks. For the annotated library, spacing was calculated such that there were at least 150, and usually nearer to 200, annotated periods between the first and last available chunks. 3. Picking a random number between 1 and t s , the spacing, to determine the starting chunk (this was the random element of the subsampling scheme).
The aim of the subsampling scheme was to capture a representative sample of the signals recorded for a given site-year, i.e., to select periods of time with calls that spanned a range of signal-to-noise ratios (SNR) and periods of time without calls, as well as other sounds that might contribute to false positives (though rare events may have been missed). Having a temporally representative subsample of sounds was deemed necessary to understand how a detector would perform when used across the entire dataset.
For each site, 10-18 h of data were annotated per month with the exceptions of the Ross Sea in 2014 which had no data for January, and Maud Rise 2014 which recorded only from Jan-Sep (Fig. 2). Over the whole year the subsampling scheme ensured a relatively even distribution of hours across a 24 h cycle, with each site containing between 5 and 10 h inspected for any given hour in the cycle.
However, four sites had been annotated previously, and thus used slightly different subsampling schemes. Additionally, three of these sites had recording duty cycles shorter than an hour.

Manual annotations.
For manual detection and annotation of calls, recordings were visualised in Raven Pro 1.5 57 . Spectrogram details included a 120 s timespan, frequency limits between 0 and 125 Hz, Fast Fourier Transform (FFT) of approximately 1 s in duration; frequency resolution of approximately 1.4 Hz, and 85% time overlap between successive FFTs. Lower and upper limits of the spectrogram power (spectrogram floor and ceiling) were adjusted for each 1-h segment. The lower limit of spectrogram power was adjusted by the analyst until approximately 25% of the spectrogram was at or below the floor value (i.e. a visual estimate of 25th percentile spectral noise level). The ceiling of the spectrogram was then adjusted so that the difference between ceiling and floor was between 30 and 50 dB relative to full-scale. The ceiling of the spectrogram could then be adjusted further to provide additional contrast in the event of long loud broadband sounds such as ice or prolonged occurrence of baleen whale choruses.
Within each subsample the analyst marked the time-frequency bounds of all occurrences of blue and fin whale sounds. Each analyst had extensive expertise in the identification of blue and fin whale sounds, particularly those from the Southern Hemisphere including the Antarctic. The analyst assigned one of eight different classifications to annotations: Bm-Ant-A, Bm-Ant-B, Bm-Ant-Z, Bm-D, Bp-20, Bp-20Plus, Bp-Downsweep, and  Table 3, Figs. 3, and 4. The first two letters of the classification correspond to genus and species, so sounds starting with Bm were produced by blue whales and Bp by fin whales. The remainder of the classification corresponds to particular call types for that species (or sub-species in the case of Antarctic blue whales). In addition to marking the time-frequency boundaries of all potential detections, the analyst also noted qualitative information about background noise and other sources of sound that were present in each chunk that was inspected, including the presence and intensity of a "chorus" of elevated background noise in the 20-30 Hz band over which Antarctic blue whale z calls and fin whale 20 Hz pulses contain most of their energy 18,19 .
For each site and classification, the 5th and 95th percentile frequency limits and durations of annotations were measured and plotted to visually identify gross differences among sites as a rough form of "quality control" across sites and analysts. These percentiles also directly informed respective parameters for automated detectors (Fig. 5).
Signal-to-noise ratio, as described by Lurton (2010) 60 , was then measured for each manual annotation. In brief, the root mean square (RMS) signal and noise power, Z s+n , was measured for the full duration of each detection over the frequency band of interest: 17-29 Hz for Bm-Ant-A, Bm-Ant-B, Bm-Ant-Z; 20-30 Hz for Bp-20 Hz, Bp-20Plus. Since some analysts marked time-frequency boundaries more tightly than others, a buffer of 1 s before and after the observation was then created to ensure that no residual signal was included in the measurement of noise. The noise measurement period was the same duration as the annotation, but split evenly before and after the buffer (i.e. the noise period was d/2 s before and d/2 s after the buffer, where d is the duration of the manual annotation). RMS noise power, Z n and variance of noise power, 2 n was measured for t noise over the same band of interest. Finally, the SNR in dB was calculated as: Automated detectors. In order to demonstrate the utility of the annotated library and compare the sitespecific performance of automated detection algorithms, we characterised the performance of two automated detectors commonly used for detecting sounds of Antarctic blue and fin whales for each of the sites in the annotated library: an energy sum detector, and a spectrogram correlation detector. Energy sum detectors rely only on knowledge of the duration and frequency band of the call, so in general can be more flexible if calls are variable within the band of detection. The spectrogram correlation detectors relies on a priori knowledge of the shape of the call in the time-frequency domain, and thus perform better when calls are highly stereotyped with relatively little variation in shape from one call to the next. These two types of detectors were chosen because to demonstrate that the library was suitable for different types of detectors, and not because we believed they were optimal for their respective tasks.
For fin whale 20 Hz pulses (both Bp-20 Hz and Bp-20Plus classifications) we applied an energy sum detector 38 which targeted the 20 Hz pulse of fin whales by summing the energy for each spectrogram slice in the band from 15 to 30 Hz. Thus, the detection score was the sum of the squared value of all spectrogram frequency bins (after noise normalisation) at that time step. In addition to the threshold for summed energy, a minimum and maximum time over threshold of 0.5 and 2.5 s as well as a minimum time between detections of 0.5 s were used as criteria for detection of individual fin whale 20 Hz pulses. Table 3. Classification and labelling system for blue and fin whale sounds in the SORP library of annotated recordings.

Label Call Type References Description
Bm-Ant-A Antarctic blue whale unit A 11,27 A constant frequency tone between 28 and 25 Hz (depending on the year) without other units Bm-Ant-B Antarctic blue whale unit AB 11,27 Antarctic blue whale unit A tone followed by partial or full inter-tone downsweep (unit B) Bm-Ant-Z Antarctic blue whale z-call; (AKA 3 unit vocalisation) 11,17 Antarctic blue whale 'z-call' with upper tonal unit A and lower tonal unit C present (and downswept unit B either present or absent) Bm-D Blue whale FM (AKA D-calls) 11 Any downswept frequency modulated calls from blue whales. Typically, but not always, longer in duration and lower in frequency than FM calls from fin and minke whales   www.nature.com/scientificreports/ For Antarctic blue whale song (i.e. classifications of Bm-Ant-A, Bm-Ant-B, and Bm-Ant-Z), we applied a spectrogram correlation detector 37 which targeted Bm-Ant-Z calls, but was also effective at detecting Bm-Ant-A and Bm-Ant-B since these call types are essentially each a subset of the full Z-call. The detection score for the spectrogram cross-correlation detector was the magnitude of the 2D cross-correlation between the correlation kernel and the spectrogram at each time step. Thus the values for threshold are in a somewhat arbitrary units of 'recognition score' which is the result of cross-correlation between normalised spectrogram and correlation kernel. In addition to a detection score threshold, a minimum time over threshold and minimum time between calls were also used 37 .
Detectors were run on the annotated library's subsets of recordings for each site using Pamguard Version 2.01.03 39 . Each detector was applied to the subsample of data for each site using a range of thresholds (determined empirically) in order to create a receiver-operator characteristic (ROC) and precision-recall (PR) curve for each site 61,62 .
Feature extraction and detector design. The spectrogram correlation and energy sum detectors were parameterised by the time and frequency properties of calls, namely the duration and frequency of each unit of each call. www.nature.com/scientificreports/ The specific time-frequency properties that we used for each detector were chosen based on published descriptions of calls. The detector parameters were validated by simple comparison with measurements from manual annotations, specifically the 5th and 95th percentiles of the energy distribution for each annotation (Fig. 5).
The mean duration of all manual annotations for that classification, d , was used to determine the time boundaries for each automated detection. A "refractory period" of length d was applied after each detection to prevent new detections from overlapping existing detections. The refractory period prevented multipath arrivals (e.g. reverberation from the seabed and surface that can arrive before or after the detection) from being detected by the automated detector. However, the refractory period had the downside of preventing legitimate detection of calls from different animals that arrived within d seconds of each-other. This was believed to be a prudent trade-off because multipath arrivals appeared to be far more common than overlapping calls from two different animals. Furthermore, by preventing automated detections from overlapping, the total number of possible automated detections (and true negative/false positive rates) could be calculated from the total duration of the recording,d , refractory period, and total duration of all the manual annotations.
Noise normalisation was applied to the spectrogram prior to automated detection. The noise normalisation algorithm was Pamguard's ' Average Subtraction' algorithm, and this involved subtracting a decaying average for each spectrogram frequency bin at each time step. Specific parameters for the fin whale 20 Hz pulse detector are described in Table 4 and Antarctic blue whale song in Table 5.
To parameterise the blue whale detector the equation was used to determine the frequency (in Hz) of unit A of Antarctic blue whale calls. In this equation, derived from 14 , f a is the frequency of unit A, and t is the number of days since 12 March 2002. For each site-year t was set to be the 1 st of June for detector parameters that required estimation of f a .
(2) f a = 27.6659 − 0.135 365 t www.nature.com/scientificreports/ Evaluation of detector performance. Detections from the automated detectors were matched to the human analyst by comparing the start and end times of all pairs of manual and automated detections. Detections were considered a match if there was any time overlap between manual and automated observations. This criterion created the potential for duplicate matches between multiple automated and manual annotations. Duplicates were identified and labelled, but were neither counted as true positives nor false positives when calculating ROC and precision-recall curves. For each threshold automated detections were tabulated to create a confusion matrix of true positives, false positives, their respective rates, precision, and recall. ROC curves and precision recall curves for each site and detector were then created from each set of true and false positives (Fig. 7).
To investigate the relationship between the number of automated detections and SNR, a generalised additive model (GAM) 63 was fitted using results from the automated detection process. For each manually detected call, SNR and whether or not the call was automatically detected was recorded. Specifically, each manual annotation was assigned a value of 1 when any automated detections matched, and a value of 0 when no automated detections matched. The matches were modelled as the response of logistic regression with SNR as a predictor using a GAM with a binomial family error distribution, a logit link function. The GAM was fitted separately for each site using the default number of knots within the package 'mgcv' 63 in R version 3.6.1 64 .

Results
Distribution of annotations throughout the library. The annotated library consisted of 1880.25 h (audio duration) of annotated data across 11 site-years and 7 sites. In total, there were 105,161 annotations across all sites, though the numbers of annotations were neither evenly distributed by site nor classification (Table 6). Bm-Ant-A was the most numerous annotation with 24,363 manual detections in total, while Bm-Ant-Z was the least numerous annotation with 2,515 manual detections in total. Ross Sea 2014 had the fewest annotations over all site-years with only 359 annotations (104 of Bm-Ant-A, and the remainder unidentified). Elephant Island 2014 had the most annotations of all site-years with 21,438 in total including unidentified sounds.
The percentage of hours with each type of annotation was also variable across sites (Table 6). Bm-Ant-A had the highest percentage across all sites ranging from 0.6 to 91.2% of hours, while Bp-20Plus had the lowest proportions across all sites with no Bp20Plus detections at Casey 2014 or Ross Sea 2014. Antarctic blue whale classifications were generally present in higher percentage of hours than fin whale sounds across most site-years ( Table 6).

Description of classification features.
Within each classification the 5th and 95th percentiles of the frequency bounds and durations were similar across sites, but with a few notable exceptions. Annotations of Bp-20 Hz, Bp-20Plus, and Bp-Downsweep from Elephant Island 2014, appeared to have longer durations than these classifications from other sites. However, visual comparison of these annotations suggest that this difference appeared to arise from the way the analyst marked annotations (i.e. more generous time-boundaries than other analysts) rather than true difference in the duration of the sound. This suggests that our use of the 90th percentile energy duration did not provide a measure of duration that was fully robust against analyst variability. Thus, different features or measures of duration may be more robust or appropriate for developing automated detectors and/or classifiers.
The stereotyped calls of Antarctic blue whales (Bm-Ant-A, Bm-Ant-B, and Bm-Ant-Z) and those of fin whales (Bp20Hz, Bp20Plus) are well described in the scientific literature, and are very distinctive from one another, and this was reflected in the plots of their 90% duration and 5th-95th percentile frequency bounds. In contrast, the properties of Bm-D and Bp-Downsweep, have not been as well defined in the literature and  www.nature.com/scientificreports/ have forms that appear very similar to each other. Thus these classes have higher potential for confusion and a higher likelihood of being marked as unidentified. As a result, the time-frequency bounds of unidentified calls combined two categories: (1) calls that clearly did not fit into any of the defined classifications, and (2) calls that were intermediate between Bm-D and Bp-Downsweep. However, by restricting annotations to only signals that can be definitively attributed to one species or the other, they do appear to be distinguishable using duration and frequency (Fig. 5). This is an instance where having multiple experienced analysts annotate the same data set might converge on clear guidelines for distinguishing between the two call types. While decisions to only annotate or detect signals that are clearly attributable to a known species are necessary and justifiable, further research on acoustic behaviour would be required to determine whether this has downstream implications for making accurate population abundance estimates. In contrast to the duration measurements, the upper frequency limit of the Bp-20Plus call type did show true differences across sites revealing geographic separation similar to that which has been described in previous studies 15,18 . Gedamke (2009) 15 found that fin whales detected on recorders in the Indian Ocean (including sites south of 60°S had higher-frequency components near 100 Hz, while fin whales detected in the Tasman Sea (Pacific Ocean including sites south of 60° S) had higher frequency components at 82 and 94 Hz. Širović et al. (2009) 18 found that fin whale sounds recorded off the WAP and Scotia Sea had higher frequency components around 90 Hz, while recordings off East Antarctica had higher frequency components near 100 Hz. In our study, the Indian and Atlantic sectors had higher frequency components around 100 Hz, while the WAP and Pacific sectors were around 90 Hz. Recordings investigated by Gedamke (2009) 15  While there is a temptation to speculate on the drivers of these temporal trends, such analyses are beyond the scope of this work, which was the creation of a dataset suitable for characterising automated detectors. Rather, the purpose of plotting monthly number of annotations by site is simply to describe the contents of the Annotated Library and to identify months or seasons that do and do not have sufficient number of detections to allow characterisation of a detector. In that regard, there is a notable lack of fin whale annotations (Bp-20 Hz, Bp-20Plus, and Bp-Downsweep) from July-December.
An example of using the annotated data to examine the performance of automated detectors. ROC, precision-recall, and SNR. ROC and PR curves indicated that detector performance was fair-topoor for these datasets. ROC and PR curves varied by site for both blue and fin whale detectors with some sites much worse than others (Fig. 7). For example, the true positive rate for the blue whale detector ranged from 8 to 55% at a false alarm rate of 1% (~ 2.8 false positives per hour). The true positive rate for the fin whale detector ranged from 1 to 76% at a false alarm rate of 1% (~ 14.4 false positives per hour).
In addition to variability in detector performance, the distribution of SNR also varied across sites with the combined Bp20 and Bp20 plus distributions showing more variability than the combined Bm-Ant-A, Bm-Ant-B, and Bm-Ant-Z distributions (Fig. 8). The modelled probability of detection at 1% false positive rate was similar across sites at high-SNR, but was more variable across sites at low SNR (e.g. < 0 dB) (Fig. 9).

Discussion
We created an annotated library of blue and fin whale sounds that spans four circumpolar Antarctic recording regions, five different years (2005,2013,2014,2015,2017), and five different types of instrument. The acoustic data in our library come from a variety of different data collection campaigns conducted by laboratories from five nations.
The distribution of calls in our library varied considerably across sites, years, and species. Antarctic blue whale sounds, particularly Bm-Ant-A, were the most numerous, and are well represented at all sites, and over most times throughout the year. Fin whale sounds had a much more seasonal representation in the annotated library with annotations in late summer and throughout autumn months, and few throughout the rest of the year. Fin whale Bp-20Plus sounds also revealed some degree of biogeographic separation with calls in the Atlantic and Indian sectors having higher upper-frequency components than those in the Pacific and WAP sectors. The annotations in the library form a representative ground-truth dataset that can be used to extract the features of each call type, and also to train and characterise the performance of automated detectors.
Detector performance. To test the utility of the library, we characterised the performance of a spectrogram correlation detector for blue whale calls and an energy sum detector for fin whale calls. The performance of the automated detectors varied by site-year. Neither detector performed particularly well, and some sites and years showed much worse performance than others (Fig. 7). Differences in detector performance broadly followed differences in SNR across sites such that sites with lower SNR had worse performance than those with higher SNR (Fig. 8). Across sites, the automated detectors showed greater variability at low SNR than at high SNR (Fig. 9). www.nature.com/scientificreports/ Characterising the performance of an automated detector and estimating the probability of automatic detection as a function of SNR using a representative subset of data, as we have done here, can be important steps towards meaningful comparisons of animal sounds across sites and over time 43,46 . In addition to performance of the detector, differences in call density (a useful metric for such comparisons) can arise from site-specific factors such as differences in instrumentation (including depth) 54 , analyst variability 53,65 , ambient and local noise sources 36,66 , propagation 46,67 , and animal behaviour 43 . These factors are not mutually exclusive, and can interact in a complex manner. Addressing and accounting for how each of these factors affects the call density is beyond the scope of this manuscript, but is a requirement if one wants to make comparisons of acoustic detections that meaningfully address biological questions of distribution and temporal trends. The library and methods we present here for assessing the performance of the detectors are a step away from estimating call-density, which in turn is a step away from estimating animal density 68 .
None of the passive acoustic studies of Antarctic blue or fin whales to date (listed in Table 1) have completely reported on the performance of their detector over a representative subsample of their data. The methods we have presented here for characterising the performance of a detector on a representative subsample of data constitute a bare minimum of reporting for future studies that utilise automated detectors to study Antarctic blue and fin whale calls. Specifically, reporting should include all parameters for the automated detector including any noise www.nature.com/scientificreports/ pre-processing steps; distribution and SNR of ground-truth detections throughout the dataset; and true and false positive rates and/or precision and recall of the detector for a representative sample of the data.
We hope the open-access annotated library we have presented here can provide a base dataset upon which to develop improved detectors i.e. with higher true positive rates and lower false positive rates. Here we have extracted duration and frequency measurements from annotations, but the library can readily be used to extract more complex features such as pitch-tracks 69 or other time-frequency features 70 to train machine learning algorithms 37,71,72 , deep neural networks 73 , or other any other advanced detectors that may provide better performance than the spectrogram correlation detector. Better detectors would not only reduce a source of uncertainty in estimating call-density, but would also reduce the amount of analyst effort required to verify true positives and account for false positives.
Future development of this dataset will aim to expand the annotated library to serve as a test-bed for subsequent analyses that address the issues of noise, detection range, and analyst variability to produce standardised outputs that are appropriate for circumpolar comparisons of call-density. This additional development would entail (1) collating pressure calibration details for noise analysis at each site-year, (2) estimating detection range throughout each site-year and (3) having multiple analysts annotate the same subsets of data for the purposes of quantifying analyst bias and variability.

Conclusions
We created an annotated library of blue and fin whale sounds that spans four circumpolar Antarctic recording regions, five different years (2005,2013,2014,2015,2017), and five different types of instrument. The annotations in the library form a representative ground-truth dataset and we demonstrate how to train, test, and characterise the performance of two common automated detectors using the library. The annotated library we present here can serve as a benchmark upon which detectors can be developed, compared, and improved upon. It may also serve as a base dataset to develop additional analytical techniques to enable robust comparisons of acoustic detections of blue and fin whale across diverse circumpolar sites and over long spans of time.
We encourage further contributions of data and annotations to help expand the library, and in the future hope to include annotations of sounds from additional Antarctic species, as well as data from other recording locations