33 years of globally calibrated wave height and wind speed data based on altimeter observations

This dataset consists of 33 years (1985 to 2018), of global significant wave height and wind speed obtained from 13 altimeters, namely: GEOSAT, ERS-1, TOPEX, ERS-2, GFO, JASON-1, ENVISAT, JASON-2, CRYOSAT-2, HY-2A, SARAL, JASON-3 and SENTINEL-3A. The altimeter data have been calibrated and validated against National Oceanographic Data Center (NODC) buoy data. Differences between altimeter and buoy data as a function of time are investigated for long-term stability. A cross validation between altimeters is also carried out in order to check the stability and consistency of the calibrations developed. Quantile-quantile comparisons between altimeter and buoy data as well as between altimeters are undertaken to test consistency of probability distributions and extreme value performance. The data were binned into 1° by 1° bins globally, to provide convenient access for users to download only the regions of interest. All data are quality controlled. This globally calibrated and cross-validated dataset provides a single point of storage for all altimeter missions in a consistent format.


Background & Summary
Satellite Radar altimeters have provided global coverage of wind speed and significant wave height (wave height) for more than three decades. Such data have been used for many applications, including: offshore engineering design, validation of numerical models, wind and wave climatology and investigation of long-term trends in oceanographic wind speed and wave height [1][2][3][4][5] . Since the launch of GEOSAT in 1985, there has been an almost continuous coverage of global observations from satellite altimeters. Following the conclusion of the GEOSAT mission in late-1989, there was a short gap until the launch of the European Remote-Sensing Satellite (ERS-1) in mid-1991. Since then, a total of 13 satellite altimeter missions have been operated, with the two latest launches, namely JASON-3 and SENTINEL-3A in 2016. The vast majority of these satellites have been placed in near-polar, sun-synchronous orbits. Altimeters are nadir-looking instruments, meaning they measure along a narrow, pencil-beam directly below the satellite (foot print approx. 10 km wide). This orbit geometry means that the satellites trace-out "herring bone" ground track patterns. Along track resolution of altimeter data is high, with data available at approximately 1 Hz along track (7 km). The ground track separation depends on orbit geometry but can be up to 400 km at the equator, with satellites repeating the same ground tracks on a 3 to 10 day repeat cycle (note that CRYOSAT-2 has a 369 day repeat cycle with a semi-repeat cycle of 30 days). In recent years, as the number of altimeters in operation has increased, the density of observations has greatly improved.
As a number of agencies have been responsible for the launch and operation of these satellites, data tends to be available from a relatively large number of sites, has been calibrated in a variety of different manners and exists in a variety of data formats. As this obviously complicates usage of the data, a number of attempts have been made to both consistently calibrate altimeter missions but also to provide data repositories for multiple missions. These include: Globwave (http://globwave.ifremer.fr/), Radar Altimeter Data System (RADS, http://rads. tudelft.nl, AVISO (https://www.aviso.altimetry.fr), National Satellite Ocean Application Service (NSOAS, http:// www.nsoas.org.cn/) and National Oceanic and Atmospheric Administration (NOAA, https://www.noaa.gov/). However, none of these repositories archives all the missions over the period since 1985 in a consistent manner.
This paper outlines an archive containing wind speed and wave height data, together with related quantities for all 13 altimeter missions. The data are consistently calibrated against buoys, cross-validated between satellites and quality controlled. The satellite calibrations are checked for long term stability, discontinuities and drift.
In situ measurements. In order to calibrate the altimeters in a consistent fashion, a long-term high-quality database of buoy in situ measurements of wind speed and wave height is required. These data should span a range of different meteorological environments and geographic regions. In addition, such data should be relatively far from land, so as to avoid contamination of altimeter measurements due to land/islands within the altimeter footprint. The only in situ dataset which meets these requirements is the National Data Buoy Center (NDBC) buoy archive. Once  www.nature.com/scientificdata www.nature.com/scientificdata/ NDBC data have been quality controlled, the data are archived by the National Oceanographic Data Center (NODC) and are available in the public domain. As with any long term in situ data archive of this type there have been changes in buoy hulls, instrumentation packages and analysis methods over the duration of the measurements 8 . The impact of such changes has previously been investigated in the context of trend estimates 9 . In the present application, data from a large number of buoys are pooled and a mean calibration obtained across all buoys. This process ensures that impacts resulting from the changes in hull type at specific locations have a negligible impact on the overall calibration result. The desire to have a long duration dataset relatively far offshore means that the data will be almost exclusively northern hemisphere. Although this will bias mean climatic conditions to some extent, it is unlikely to have a significant impact on the mean calibration 10 . Also, some doubts have been raised about the validity of such data at high wind speeds and wave heights 3,4,8,[11][12][13][14] . Despite these concerns, the NDBC in situ dataset has been extensively used to validate model results and calibrate satellite systems and been found to be of high quality 15,16 .
In this work, wind speed and significant wave height have been obtained from NODC moored buoy data more than 50 km offshore to avoid land contamination 17 . From 2011, NODC data contain a series of quality flags for wind speed and significant wave height (0, 1, 2, and 3 which represent quality_good, out_of_range, sensor_nonfunctional, and questionable, respectively). Only wind speed and significant wave height which are flagged "0" have been used for calibration of the altimeter data. The locations of NODC buoys used in the calibration are shown in Fig. 1.
The buoys measure either significant wave height (H s ), wind speed at the height of the anemometer z (U z ) or both. For wind speed, a consistent reference height of 10 m is required (U 10 ). This was obtained by assuming a neutral stability logarithmic boundary layer as given by 1 : where κ is the von Kármán constant which is approximately 0.4, C d is the drag coefficient and z o is the roughness length. Measurements of C d over the ocean yield results with scatter over an order of magnitude, and much research has focused on the wind speed and sea state dependence of C d 1,18,19 . In this work, C d = 1.2 × 10 −3 and z o = 9.7 × 10 −5 m are used. As mentioned in previous studies 7 , a different assumption of C d does not have a significant influence on the final satellite wind speed 17 . For a more detailed description of NOAA buoy data, one can refer to Zieger 20 . This choice of boundary layer correction is consistent with previous altimeter calibrations 17 .
Altimeter data. The altimeter data used in this database were sourced from three different archived locations, namely Globwave 21 , Radar Altimeter Data System (RADS) 22 , and National Satellite Ocean Application Service (NSOAS) 23  The data from six altimeters which were mostly retired before 2013, namely GEOSAT, ERS-1, TOPEX, ERS-2, GFO, and ENVISAT were obtained from Globwave. The calibration and validation of these altimeters has www.nature.com/scientificdata/ previously been described, using essentially the same process adopted here 7 . These altimeters were re-calibrated for the present database, resulting in very minor changes to the calibration relationships. Data for a further six altimeters: JASON-1, JASON-2, CRYOSAT-2, SARAL, JASON-3 and SENTINEL-3A were obtained from RADS. The final satellite is Hai Yang-2A (HaiYang means ocean in Chinese). This satellite is China's first dynamic environmental satellite and the data from this altimeter is not available in the public domain. However, following personal communication, the data were provided from NSOAS. Summary information for each of the satellites/altimeters is included in Table 1.
Values of significant wave height and wind speed are determined from the high frequency altimeter data by fitting a functional form to the radar return from the ocean surface in a process called waveform retracking. As noted above, the original data for the present database were sourced from Globwave, RADS and NSOAS. In the case of both Globwave 24 and RADS 25 , these data were originally sourced from the various satellite agencies in the form of 1 Hz Geophysical Data Records (GDRs). The processing used to form the GDRs uses a range of different retracking approaches and no attempt has been made to harmonize the retracking. Rather, we use the 1 Hz data from the GDRs and calibrate at this level. This calibration clearly removes some differences between various datasets. A harmonized retracking of all data would presumably further increase the quality of data but is beyond the scope of this database.
Altimeter quality controls. Altimeter, Geophysical Data Records are not free from data errors and such data contain numerous data "spikes" due to land and ice contamination and issues associated with variable quality of the altimeter waveform received by the satellite 17 . As indicated previously 7 , the Globwave data contains a series of quality flags (0, 1, and 2, representing good_measurement, acceptable_for_some_applications, and bad_measurement, respectively) 24 . These flags proved very reliable in excluding poor quality data.  www.nature.com/scientificdata www.nature.com/scientificdata/ In the present database, a series of data flags defined as 1, 2, 3, 4, and 9 represent Good_data, Probably_ good_data, SAR-mode data or possible hardware error (only used for CRYOSAT-2), Bad_data and Missing_data, respectively, have been used. Hence, the quality flags from Globwave have been transformed from flags 0, 1 and 2 to the present flags 1, 2 and 4. Moreover, all NaN values in Globwave have been defined as missing data which are flagged 9.
The RADS and NSOAS data do not have a similar system of quality flags. Hence, the following criteria have been used to assess the quality of the data. This approach is similar to that adopted previously 17,26 . Initially the H s data are considered:  www.nature.com/scientificdata/ (1) If H s > 30 m, then the data point was flagged "4" which means "bad" data.
(2) All points identified as over land and ice using land/ice masks (as defined in GDRs distributed with the original data by the various satellite agencies) were discarded. (3) In the present dataset, 1 Hz values are used. However, the satellite agencies do distribute data related to variability of 20 Hz products in the GDRs. All data in which the standard deviation of these 20 Hz altimeter data values is greater than 2.5 m signifies data where there is significant variability within the footprint and were flagged "4". (4) After applying these quality controls, the data were divided into blocks of 25 points, which represents approximately 180 km along the ground track. As argued in Zieger et al. 17 , this represents segments long enough to form reliable statistics but not so long that data will not be coherent. Individual values in the block were identified as outliers, and flagged "4", based on the median absolute deviation (MAD) 27 . The MAD is defined as 28 : x median{ } n i and x i is the original observation in which i = 1, 2, 3, … n. In this case n = 25. The value of b is given by 1.4826 which is the scaling factor of Gaussian distributions 29 . Furthermore, following Miller 30 , a threshold value of 3 has been chosen and hence all values which are outside the following criterion were categorized as outliers 27 . The criterion is given by This equation can be rewritten as: In a final check, blocks identified in test 4 above, were further considered. These blocks were re-divided into sub-blocks, either side of flagged points. These sub-blocks were considered in the following manner:  is the mean of the block, then it indicates that it is possible that there are multiple spikes in the block. If R > 0.5, then the entire sub-block was flagged as "4".
The steps undertaken at points 4 and 5 above are intended to flag erroneous "spikes" in the data. Visual examination of many cases indicated it was remarkably successful at this, whilst not removing strong along-track gradients which may be caused by strong currents 31,32 . Nevertheless, the data are retained and flagged as "4". Users interested in along-track variability can process such data if desired.
Similar criteria have been applied for wind speed (U 10 ). In this case, for test 1 the wind speed limit was set at 60 ms −1 for all altimeters except SARAL. In the case of SARAL this limit was set at 24 ms −1 . (See discussion of SARAL calibration below.) As for significant wave height, wind speed values above these limits were classified as "bad" data which are flagged "4".
Calibration against in situ measurements. The quality controlled significant wave height and wind speed data were calibrated by comparing the buoy measurements with altimeter passes. Buoy observations and altimeter passes were considered a "matchup" if they satisfied the following criteria: a. Altimeter track was within 50 km of the buoy and the overpass occurred within 30 min. of the buoy recording data. These matchup criteria have been widely used in previous studies 17, [33][34][35][36][37][38] . b. Only buoys which are more than 50 km offshore have been considered in order to avoid the impact of the proximity of land on both buoy and satellite observations. c. A minimum of five points were required in the altimeter pass within the 50 km radius region around the buoy. d. Any large variability in the along-track altimeter data was excluded. Specifically, passes in which σ > . Again, the same criteria were also applied for wind speed. As not all buoys measure both wind speed and wave height, there is not a one to one overlap between buoys used to calibrate wind speed and wave height.
The values of significant wave height (H s ) for calibration can be extracted directly from the various data archives. However, altimeter U 10 is calculated from the radar cross-section, σ 0 (ratio of the returned to transmitted energy of the altimeter pulse) and a variety of different relationships have historically been used for different altimeters. In order to have consistent calibrations across the various altimeters, it is desirable to use a consistent U 10 − σ 0 relationship 7,17 . Hence, following the method used in Zieger et al. 17 and Young et al. 7 , uncalibrated wind speed was calculated based on the backscatter coefficient σ 0 using the algorithm 39    Note that the units of wind speed in the above relationships are meters per second and the radar backscatter decibels. www.nature.com/scientificdata www.nature.com/scientificdata/ Equations (2) and (3) with the value of the parameters given in (4) have been developed for Ku-band radar altimeters (13.5-13.8 GHz). The SARAL altimeter, however, is a Ka-band radar altimeter (35.75 GHz). Following Lillibridge et al. 40 for calibrated Ka altimeters, (2) and (3) still hold but with the coefficients in (4) given by:  www.nature.com/scientificdata www.nature.com/scientificdata/ This high wind speed relationship has recently been validated for use in extreme value analyses 42 . Quilfen et al. 43 also obtained a very similar result to Eq. (6). There is some evidence that altimeter wind speed may be a function of sea state in addition to radar cross-section 44 . The analysis here does not consider any sea state dependence, which appears small. Values of radar cross-section (σ 0 ) provided for each of the altimeter systems have a variety of different datum offsets. Therefore, following Young et al. 7 , it is necessary to remove this offset before applying (3)-(6). This is achieved by comparing buoy measurements of U 10 with altimeter σ 0 and determining the offset σ offset which gives the best fit (in a least-squares sense) between the data and (3)- (6). Figure 3 shows this process for JASON-3 and HY-2A (the altimeter with the largest σ 0 offset, see Online-only Table 1).
A linear regression analysis is then performed between the buoy and altimeter match-up data (U 10 values). Although the buoy data are considered "ground truth" for the purposes of the calibration, such data does contain both sampling and calibration errors 7 . As a consequence, a conventional regression analysis is not appropriate. However, in such cases, a reduced major axis (RMA) regression can be used 45 . This regression minimizes the triangular area bounded by the vertical and horizontal offsets between the data point and the regression line and the cord of the regression line. This is in contrast to a conventional regression which minimizes the vertical axis offset from the regression line. In addition, standard least squares regression analysis is highly sensitive to outliers. Such outliers can be removed by the use of robust regression 46 . Robust regression assigns a weight to each point, with values between 0 and 1. Points with a value less than 0.1 were designated as outliers and removed from the analysis before applying the RMA regression analysis.
In the type of regression analysis described above, it is desirable to have as many matchups as possible, as this reduces the confidence limits on the calibration (regression) result. Hence, it is usual to pool data from all buoys over the full duration of the altimeter mission 7,17 . However, such an "average" calibration will mask any changes in the calibration over time (drift or discontinuities in calibration). Such issues can be addressed by firstly calibrating   www.nature.com/scientificdata/ the individual altimeters against all buoy data (average calibration) and then examining the differences between buoy and altimeter (with average calibration) as a function of time 7 . Young et al. 7 , identified a number of such changes to calibration in their analysis of GEOSAT, ERS-1, TOPEX, ERS-2, GFO and ENVISAT. For the seven additional altimeters included here, discontinuities in the significant wave height of HY-2A and the wind speed of CRYOSAT-2 (see Fig. 4) were identified. Other altimeters do not change their calibration significantly during their respective missions. Note that the present results for HY-2A are consistent with the results of Liu et al. 47 . Figure 4 shows the difference between buoy and altimeter values of wind speed, ΔU 10 as a function of time for CRYOSAT-2. As can be seen in this figure, a clear change in calibration occurs in mid-2014. When such discontinuities were identified in the data, a piecewise calibration was performed. That is, the altimeter was calibrated separately either side of the discontinuity. Figure 5 shows the result, once the data were calibrated in this fashion, clearly removing the discontinuity.
In Figs 4 and 5 there is a clear periodicity in the data with an annual signal in ΔU 10 . As demonstrated in Young and Donelan 10 this is a result of changes in the structure of the atmospheric boundary layer as a result of changes in the air-water temperature difference (atmospheric stability). These stability effects do not impact H s and no attempt has been made to correct U 10 for this effect.
As noted earlier, SARAL operates in the Ka frequency band, whereas all other altimeters operate in the Ku-band. As a consequence, the parameters in the U 10 − σ 0 relations (2) and (3) were defined by (5). Examination of scatter and Q − Q plots between buoy and SARAL wind speeds showed good agreement using this approach. However, when cross-validation was carried out with other altimeters (see Technical validation below) and the wind speed range was extended to higher values, it was clear that the SARAL calibration under-estimated higher wind speeds. This behavior is shown in the altimeter-altimeter validation Q − Q plots in Fig. 6.
In order to address this issue, the other calibrated altimeters in orbit at the same time as SARAL (JASON-2, JASON-3, CRYOSAT-2, and SENTINEL-3A) were used to determine a wind speed correction for SARAL wind speed U 10 > 10 ms −1 . The wind speed correction developed is quadratic (see Online-only Table 1). The available data for the correction was limited to U 10 < 24 ms −1 . As caution should be exercised in a quadratic extrapolation, wind speeds which are greater than 24 ms −1 have been flagged as "bad". As a result, the calibration relation for SARAL wind speed has been separated into two regions -U 10 < 10 ms −1 and 10 ms −1 ≤ U 10 ≤ 24 ms −1 . The final calibration relationships for significant wave height are shown in Online-only Table 2 and for wind speed in Online-only Table 1.

Data Records
A total of 26 parameters, as outlined in Table 2 are archived in the repository 6 . Again, since SARAL is a Ka-band altimeter, rather than Ku-band, the variable names have been changed accordingly. All data are stored in NetCDF files with the records binned into 1° by 1° bins. Within each bin, full data resolution is provided with the recorded latitude, longitude of every 1 Hz measurement recorded. The binned storage provides a convenient mechanism for most users to access data. Data files are not included for areas where there is no data -land or ice areas or bins where the satellite orbit meant there were no overpasses. Individual files are provided for each of the 13 altimeters as summarized in Table 3. Data commences in 1985 and continues to 2018 except for the short break in 1990 and 1991. It should be noted that although data in the nearshore region is provided in the database, it is recommended for applications more than 50 km offshore. Data less than 50 km offshore which passed all other quality control processes are flagged "2" -"probably good data".
All data files are provided in NetCDF format following the IMOS (Integrated Marine Observing System) data protocols upon which the project is based 48,49 . The IMOS standard flag system is used for all data flags -where 1, 2, 3, 4, and 9 represent Good_data, Probably_good_data, SAR-mode altimeter data or hardware error (CRYOSAT-2 only), Bad_data and Missing_data, respectively. Note that CRYOSAT-2 operated for some geographic regions in SAR-mode. This data has been flagged as "3" in the database. The calibrations developed for CRYOSAT-2 were not developed for SAR-mode data and hence, this data should be used with caution. The filenames follow the format: As the full database consists of approximately 500,000 files (Table 3), it has been stored using the following hierarchy for folders: \Satellite_Name\20degree_by_20degree_subregion\NetCDF_files (as above) e.g. \JASON1\020S_280E\IMOS_SRS-Surface-Waves_MW_JASON-1_FV02_016S-282E-DM00.nc. For both the 20-degree subregion and the 1 deg NetCDF file the latitude, longitude signifies the west most and south most locations for the data region.
The data can be accessed in the number of ways: www.nature.com/scientificdata www.nature.com/scientificdata/ (a) Static archive A static "snapshot" of the data as described in this paper has been archived and allocated the identifier https://doi.org/10.26198/5c77588b32cc1. This is a full copy of all data at the date of submission of this publication. (b) Dynamic archive -AODN graphical portal As the intention is to update the data at approximately 6-month intervals, a dynamic archive is maintained at the Australian Ocean Data Network (AODN). The AODN portal can be accessed at: https://portal.aodn.org.au/. The user can access the data graphically from the portal. To find the data, the following navigation is recommended.
(i) Click the "Get Ocean Data Now" button (ii) Scroll to the keyword search box at the bottom left of the screen. Enter the keywords "altimeter waves". (iii) Click on the thumbnail map of the world to the right. The graphical interface which opens allows the user to scroll to any area of the world and define a region to download with the mouse. The specific satellites to download can be specified in the menu to the left.
(c) Dynamic archive -Direct interface The dynamic archive can also be accessed directly as an Amazon S3 archive. It is recommended that this is done using software such as Cyberduck. Instructions to set up such a server can be found at: https://help.aodn.org.au/downloading-data-from-servers/amazon-s3-servers/ Once access to the S3 server is gained, the user should navigate to:

IMOS/SRS/Surface-Waves/
The dynamic archive is in the folder: Wave-Wind-Altimetry-DM00 The static archive mentioned above is in the folder: Wave-Wind-Altimetry-DM00_C-20190228T030000Z

Technical Validation
As noted above, a further set of checks to verify the consistency and stability of various altimeters was conducted in the form of cross validations between altimeter missions. The same criteria as for the buoy matchups have been applied for the cross validations (observations within 50 km and 30 min). Again, RMA regression has been performed for each cross validation. Matchup scatter plots, probability density functions as well as Q -Q plots were analyzed for each combination. This follows the same approach used by Young et al. 7 Figure 7 shows RMA cross-validation, Fig. 8 presents Q − Q plots and Fig. 9 shows altimeter -altimeter differences as a function of time. Note that the gaps in the time series in Fig. 9 occur due to changes in orbit of satellites over time. Such changes mean that for a period of time there will be no cross-over points which meet the match-up criteria required. Data are shown for both H s and U 10 . As calibrated data are used, the scatter plots (Fig. 7) and Q − Q plots (Fig. 8) should lie along the diagonal and the altimeteraltimeter differences should be zero.
In order to analyse the performance of the cross validations, four different statistical parameters, namely bias (B), root-mean square error (RMSE), Pierson's correlation coefficient (ρ) and scatter index (SI) were used. These parameters were calculated based on the following relations 16 in which M and O stand for model and observation, respectively.
A summary of the statistical parameters is provided in Tables 4 to 7. It should be noted that C-2, HY2, J-1, J-2, J-3, SA, and S-3 are abbreviated forms of CRYOSAT-2, Hai Yang-2A, JASON-1, JASON-2, JASON-3, SARAL and SENTINEL-3A, respectively. Moreover, whilst other statistical parameters are commutative, bias is not. Hence, in order to read the bias in Tables 6 and 7, one has to read it from row to column.
All the statistical parameters shown in Tables 4 to 7 indicate consistent performance between the altimeters. All values of wave height correlation coefficient (Table 4) are close to one. Similarly, RMSE values (Table 4) are small, with all values less than 20 cm. Bias and scatter index ( Table 6) are also very small, being less than 12 cm Scientific Data | (2019) 6:77 | https://doi.org/10.1038/s41597-019-0083-9 www.nature.com/scientificdata/ and 0.1, respectively. Similar to the cross-validation for the significant wave height, the cross-validation for wind speed indicates excellent agreement, with the exception of HY-2A (Tables 5 and 7). Although the HY-2A wind speed data are clearly of lower quality than the other altimeters, it is still likely that it will be acceptable for most applications. SENTINEL-3A is a new-generation SAR mode altimeter and operates in this mode at all times. As a result, the radar return is no longer Gaussian, which can introduce biases due to swell and relative track angle to swell direction 50 . The results in Tables 4 to 7 indicate that this platform produces error statistics comparable to the other platforms averaged over all geographic regions. It is possible, however, that its performance in swell-dominated regions may differ from the other platforms. These potential regional differences have not been explored in the present work. It should be noted that NaN values in the tables indicate that the corresponding cross-validations do not satisfied our matchups criteria. It does not mean, they do not have any matchups.
A number of other studies have used this dataset for global studies. Young and Donelan 10 have examined global climatology of U 10 and H s and found the data to be consistent with buoy and model reanalysis results. Takbash et al. 42 have used the data to examine global extreme value wind speed and wave height (i.e. 1 in 100 year values). They show that the data produces extreme value estimates consistent with buoy data. That study indicates that although the present calibrations are limited to values of U 10 < 24 ms −1 and H s < 9 m the tails of the respective probability distribution functions remain valid above these limits. Young and Ribal 51 have used the data to investigate trends in wind speed and wave height. This is a particularly demanding analysis, as it requires long term stability of the data. The results show that wind speed and wave height trends are consistent in both magnitude and spatial distribution and that the wind speed trends are consistent with radiometer and scatterometer data.

Code Availability
All data are available as NetCDF files. The calibration process described in this paper and the production of the NetCDF files were undertaken using Matlab scripts written for this purpose. This code is available from the corresponding author upon request.