Trends and variability of the atmosphere–ocean turbulent heat flux in the extratropical Southern Hemisphere

Ocean–atmosphere interactions are complex and extend over a wide range of temporal and spatial scales. Among the key components of these interactions is the ocean–atmosphere (latent and sensible) turbulent heat flux (THF). Here, based on daily optimally-interpolated data from the extratropical Southern Hemisphere (south of 30°S) from a period 1985–2013, we analyze short-term variability and trends in THF and variables influencing it. It is shown that, in spite of climate-change-related positive trends in surface wind speeds over large parts of the Southern Ocean, the range of the THF variability has been decreasing due to decreasing air–water temperature and humidity differences. Occurrence frequency of very large heat flux events decreased accordingly. Remarkably, spectral analysis of the THF data reveals, in certain regions, robust periodicity at frequencies 0.03–0.04 day−1, corresponding exactly to frequencies of the baroclinic annular mode (BAM). Finally, it is shown that the THF is correlated with the position of the major fronts in sections of the Antarctic Circumpolar Current where the fronts are not constrained by the bottom topography and can adjust their position to the atmospheric and oceanic forcing, suggesting differential response of various sections of the Southern Ocean to the changing atmospheric forcing.

summarizes the results of the basic comparative analysis between the OAFlux and the SOFS meteorological/hydrological data and the turbulent flux components (unfortunately, the SOFS measurements lie outside of the data coverage of the ISCCP data set -which ends in 2009 -and hence no analogous comparison for the radiative flux data can be made). Fragments of the analyzed time series in the period 17 Mar 2010-13 Mar 2011 are shown in Supplementary Fig. 1. In the table, r 2 denotes the Pearson correlation coefficient and σ -standard deviation of differences. Supplementary Fig. 2 compares the cumulative distribution functions from the SOFS and OAFlux data set.
As can be seen, the OAFlux air-temperature values are slightly higher than the measured ones almost throughout the whole analysis period, with a positive bias of 0.42 • C. This manifests itself in a negative bias in F sh , as F sh is strongly negatively correlated with T a (see further Supplementary Fig. 20).
The range of variability of q a in the OAFlux data is underestimated, similarly as the range of variability of F lh , which is a function of the air humidity. No significant biases or deviations are present in the wind speed datathe pdfs of U 10 are almost identical (very high p-values of the K-S test).
Overall, at the SOFS location the OAFlux latent heat flux tends to be underestimated within the range of low values, and overestimated within a range of high values ( Supplementary Fig. 2e).
Noticeably, both datasets show a seasonal cycle in both air and water temperature, as well as in the air humidity, but no seasonal cycle is present in the turbulent heat flux time series, dependent on the difference T s − T a and q s −q a (Supplementary Fig. 1). In the case of the THF, short-term, synoptic variability dominates over longer-term, seasonal signal.   Figure 2: Cumulative distribution functions of the meteorological variables (T a , T s , q a , U 10 ) and the turbulent heat flux (F lh , F sh ) at the SOFS station (magenta point on the map in Fig. 1 in the main text): measured (blue), from the OAFLux data set (red), and from the NCEP-DOE data set (green).

S1.2. Comparison between different data sets
Unfortunately, the satellite-data-based GSSTF.3 and HOAPS products are available only until the end of 2008 (see the list in the introduction to this note) and therefore cannot be validated against the SOFS measurements . The results of  validation of the NCEP-DOE data are shown in Supplementary Figs. 1, 2 and Supplementary Table 2.
The time series in Supplementary Fig. 1, and the cdfs in Supplementary Fig. 2e,f show that the most evident feature of the NCEP-DOE data is overestimation of the extremes, especially positive extremes of F lh and negative extremes of F sh . Both effects are strongly related to overestimated variability of wind speed (the standard deviation of differences σ between the modeled and measured wind speed reaches almost 2 m/s), as well as to bias in air temperature and humidity (0.78 • C and 0.31 g/kg, respectively). As a result, almost all statistics describing the accuracy of F lh and F sh are considerably worse for NCEP-DOE than for OAFlux. In particular, whereas the bias for both OAFlux THF components is negligible, it exceeds 21 W/m 2 for the NCEP-DOE latent heat flux. The standard deviation of differences of both NCEP-DOE THF products is more than 60% higher than for analogous OAFlux products.
In an attempt to assess the quality of the other two data sources, GSSTF.3 and HOAPS, we first determine the cdfs of F lh and F sh from the OAFlux data for three different time periods: [2004][2005][2006][2007][2008][2009][2010][2011][2012][2013], and the third one corresponding to the period of availability of the SOFS observational data (exactly as analyzed before). Supplementary Figs 3d and 4d show that the respective cdfs hardly change, suggesting that, first, the data from the "SOFS-period" are representative for a longer time period spanning 5 years, and second, there is no significant trend in the OAFlux fluxes between 2004-2008 and 2009-2013. Combined with the above-demonstrated high quality of the OAFlux product at the SOFS station, we may formulate a hypothesis that the F lh and F sh from OAFlux in the years 2004-2008 reflect the real-world variability of these variables at the analyzed location. We may then compare the other products to OAFlux in period 2004-2008 (in which all four data sets are available), as shown in Supplementary Figs 3

and 4.
Among the four data sets, the slope of the cdf of F lh is steepest in the case of OAFlux, i.e., the range of variability of these data is the lowest. The scatterplots in Supplementary Fig. 3b,c show also that the F lh values of the two satellite-derived products are limited from below (by zero in HOAPS and −18 W/m 2 in GSSTF.3). The shapes of the clouds of dots suggest that the algorithms produce values lower than these limits which are then adjusted to the allowed range. In the case of F sh , the HOAPS and, especially, GSSTF.3 cdfs agree well with that of OAFlux within the range of negative values, whereas for positive F sh only the NCEP-DOE cdf is close to that of OAFlux ( Supplementary Fig. 4d). Here, again, the range of variability of the NCEP-DOE data is the largest among the four data sets; it is smallest in the case of GSSTF.3, where values larger than ±50 W/m 2 occur only sporadically.
The lack of observational data from other locations makes it impossible to assess the quality of the four datasets in other parts of the domain of study. However, it is worth noting that the relationships between different THF products are similar everywhere -as the histograms of F lh and F sh in Supplementary Fig. 5 clearly show. They were calculated for pairs of values from the whole data sets, i.e., each histogram represents over 65 millions of points from the whole domain and from the period 1987-2008, common to all data sets. They have analogous features to those discussed above (note that the color scale is logarithmic, so that the scatter of points seems larger than it really is). In summary: (i) the NCEP-DOE F lh product has larger variance than that of the other products, due to higher extremes; (ii) the deviations between the OAFlux and NCEP-DOE F lh are smaller than those between OAFlux and the remaining two products; (iii) the values of F lh in HOAPS and GSSTF.3 data are cut at the prescribed lower limit; (iv) the values of the NCEP-DOE F sh are similar to those from OAFlux for F sh > 0, but considerably larger (in terms of amplitude) for F sh < 0; (v) the differences between the pairs of F sh products are largest for OAFlux-HOAPS data; (vi) contrary to the remaining products, the values of the HOAPS F sh are never smaller than −50 W/m 2 , presumably due to some "limiter"; (vii) the range of variability of the GSSTF.3 F sh is Supplementary  considerably lower than that of the OAFlux data. Additional analysis (not shown) confirmed that these observations have location-independent character. For example, the ratio of the variance of F lh between the OAFlux and NCEP-DOE products is lower than 1 everywhere in the domain of study; similarly, the ratio of the variance of F sh between the OAFlux and GSSTF.3 is everywhere higher than 1.
Obviously, the above-described observations cannot be used to justify statements concerning the quality of the analyzed data sets in different parts of the area of study. The existence of certain general relationships between the particular data sets presumably results from certain assumptions and parameterizations underlying the algorithms, different types/sources of data used as input, etc. However, the performance of those algorithms, and thus the quality of the final products, may be spatially variable, and validation at one location -the SOFS station -remains insufficient to estimate that quality elsewhere.      Supplementary Fig. 7, but for a profile at 60 • W (line E on the map in Fig. 1 in the main text).  Fig. 13, but for the profile at 315 • E (line F on the map in Fig. 1).

Detection of periodicity in the analyzed spectrograms
In terms of the dependence of the power spectral density P on frequency f , the spectrograms analyzed in this work can be divided into two groups: power law, with P (f ) ∼ f −α , α > 1 (for α = 2 we speak of a red-noise spectrum); and exponential, with P (f ) ∼ exp(−αf ). Spectra of Ts, Fsw and F lw belong to the first group, spectra of F lh , F sh , Ta, U10 and qa -to the second group (Fig. 7 in the main paper and Supplementary Figs. 16, 18).
For white noise spectra, i.e., those with frequency-independent power density, an established, exact method exists for testing the significance of periodic variations against the background noise -the Fisher's g-statistic (see, e.g., [14]). However, no analogous exact method exists for non-white noise spectra. Established tests, based, e.g., on smoothing of the periodogram, are biased, because they have a strongly non-Gaussian distribution which formally makes them inappropriate for standard least-squares goodness-of-fit test.
In this work, two "measures of importance" of the peaks associated with the BAM signal are used, as detailed below. When interpreting the results, it should be kept in mind that the final spectrograms are averages of 513 spectrograms estimated from the subsets of data; additionally, they have been smoothed after averaging. This procedure, possible thanks to the sufficient length of the analyzed time series, tends to smooth out spurious peaks and leave only the robust ones, repeatedly occurring in the majority of the individual spectrograms. The above-mentioned tools used in the analysis are: 1. The peak-enhancement factor, or the peak prominence λp, a characteristics similar to the prominence ratio used in acoustics (again, suitable only for approximately-white noise spectra), here defined as the ratio between the maximum power density within frequency range 0.0313-0.0391 day −1 (BAM) to the minimum power density within frequency range 0.0166-0.0293 day −1 . For spectra with power density decreasing with f -including power-law and exponential spectra analyzed here -it is expected that λp < 1. The larger λp, the higher the peak (or, more precisely, its "left" slope in the spectral plot).
2. The height of the BAM-peak above the upper 95% prediction bounds associated with an exponential/power law fit to the spectrogram. (It is important not to confuse the prediction bounds for new observations with the confidence bounds for the coefficients of the fitted function. The prediction bounds measure the confidence that the new observation lies within the interval given a single predictor value -in this case, the BAM frequency.) The procedure for each analyzed spectrogram was as follows: • Remove from the spectrogram data points corresponding to the BAM frequency range in order to obtain the fit independent of these data points.
• Fit an a exp(−αf ) or af α function to the spectrogram (depending on its shape), find the 95% confidence bounds on the fit coefficients and analyze the histogram of residuals to estimate the accuracy of the fit.
• Find the 95% prediction bounds for new observations for the whole frequency range.
• Check whether the spectrogram data within the BAM frequency range lies above or below the upper prediction bound.
The example results for the zonally-averaged spectrograms of F lh (those shown in Fig. 7a in the main text) are shown in Supplementary Fig. 16.
Once more, neither of the two methods provides a final, conclusive answer regarding the significance of the BAM-peak. Moreover, the second methods seems too strict in situations when the BAM-peak is the global maximum of the spectrum, but does not exceed the prediction bounds -like in the example shown in Supplementary Fig. 16c. Nevertheless, our attention should be directed to regions where both methods suggest the existence of the BAM signal. An additional indication of the robustness of the results provide the NCEP-DOE data, from which a very similar spatial pattern of λp is obtained, as shown in Supplementary Fig. 17. The spectrograms for zonally-averaged NCEP-DOE data are also very similar to those obtained from the OAFlux data (not shown). Supplementary Figure 17: Maps of the peak-enhancement factor λ p for the spectra of F lh (a) and F sh (b) obtained from the NCEP-DOE data, analogous to those shown in Fig. 8 in the main text.