Regression-based gap-filling methods show air temperature reductions and wind pattern changes during the 2019 total eclipse in Chile

Singular disruptive events like solar eclipses affect the measured values of meteorological variables at the earth’s surface. To quantify such an impact, it is necessary to estimate what value the parameter would have taken had the event not occurred. We design and compare several methods to perform such an estimate based on longer observational timeseries from individual meteorological surface stations. Our methods are based on regularised regressions (including a Bayesian variant) and provide both a point an associated error estimate of the disruptive event’s impact. With their help, we study the effect of the total solar eclipse of July 2nd, 2019, in the Coquimbo Region of Chile, on near-surface air temperatures and winds. The observational data used have been collected by the meteorological surface station network of the Centro de Estudios Avanzados en Zonas Áridas (CEAZA). Most stations inside the eclipse’s umbra registered a temperature drop of 1–2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\circ }$$\end{document}∘C, while the most extreme estimated temperature drop surpassed 6 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\circ }$$\end{document}∘C. The presence of an ‘eclipse cyclone’ can neither be proven nor refuted. Application of the regression methods to other comparable problems like volcanic eruptions, forest fires, or simply gap filling of observational data, are conceivable.

. 'Contact' times of the eclipse of July 2nd, 2019, calculated for La Serena (29°54 ′ 16.3 ′′ S 71°14 ′ 56.2 ′′ W) on http:// xjubi er. free. fr/ en/ and rounded to the nearest minute, in local time . Also given are the time of sunset (calculated on https:// plane tcalc. com/) and time ranges referred to in the text: ' t • ' refers to the times removed from statistical calculations to avoid training data containing effects of the eclipse, and ' u , v ' refers to the time span over which wind data is averaged.

Results
Terminology. All discussions below apply to timeseries of data from a single sensor, in particular screen level air temperature. We think of such a timeseries as a data matrix X dt indexed by a calendar date d (rows; 'date axis') and a time t (columns; 'time axis'). The time period affected by the eclipse will be given the symbol t • ; this includes the full duration of the eclipse (C1-C4), plus an additional time span beyond C4 to account for some dynamical lag in the system (see Table 1). Conversely, all times not affected by the eclipse are symbolised by t • . Analogously, the day of the eclipse will be denoted d • and all other days by d • . When we talk about an average over a number of days, we mean that the averaging is performed over the date dimension and the result is a onedimensional array indexed by times (a daily cycle). All data used in this study comes from the meteorological surface station network operated by the Centro de Estudios Avanzados en Zonas Áridas (CEAZA) in La Serena, Chile, from here on called the CEAZAMet station network.
Evaluation of the regression methods with respect to 2m air temperatures. Our methods are generalisations of two types of estimates found in the literature: (1) averages over a number of 'similar' days, and (2) smoothed observations from the day of the eclipse with interpolated values for t • 1,2 . Approach (1) can formally be seen as a regression with selected days d ∈ d • serving as predictors and equal regression weights which sum to 1. Dropping the restrictions on the weights and allowing for an intercept generalises the approach, and any regression method with subset selection replaces the task of choosing 'similar' days with an objective procedure. Including an intercept can be interpreted as allowing for slowly-varying (synoptic) background conditions. Since we need to judge the similarity of days without considering t • , this regression is trained on t • even though we are interested in t • , for which the model is simply evaluated. Approach (2) can equally be cast as a regression, with t ∈ t • as targets and t ∈ t • as predictors, trained on examples d ∈ d • . This procedure is equivalent to a smoothing along the time axis with the kernel's weights fit by regression instead of having been determined a priori. We will call the two approaches 'average' and 'smoothing' type regressions, respectively. In both cases we can make use of the additional information we have from years of monitoring data at the stations at which we observe the eclipse; we use approximately 4 years of data preceding the eclipse, since for this time period, most CEAZAMet stations in the eclipse's umbra have been collecting data at a 5 min sampling rate.
We use the lasso as a regression method with subset selection (for both the 'average' and the 'smoothing' type); the predictors are selected by including an ℓ 1 penalty on the vector of regression weights during the loss minimization, which effectively sets many of the weights to zero 54 . For comparison, we evaluate unweighted averages over N days similar to the eclipse day ('N-averages' from now on), as well as standard smoothing procedures (local regression, in both linear 5 and quadratic variants). Similarity between predictor and target days for the N-average is assessed as the pairwise Euclidean distance in the space spanned by times t • . Uncertainties for all methods can be quantified empirically by estimating the values for all t ∈ t • on all days d ∈ d • and subtracting the corresponding actual observations. This gives an error distribution for each time t ∈ t • . For the smoothing regression-the only method where the quantity of interest to us is actually the target-care needs to be taken to separate training and test data. To this end, we use four-fold cross-validation, so that approximately 1 year of data is in the test set at any time, and we calculate the error for a day only if it is in a test set. Figures 1 and 2 show the regression-based estimates and those obtained from comparison methods, respectively, for air temperature on the day of the eclipse at a few selected stations in and around La Serena. The box plots show the temperature depression T of the observation relative to the reference estimate. The plot elements representing uncertainty (shaded bands, boxes, whiskers and fliers) are constructed by adding the error distributions computed as described above to the point estimate. The regression methods' uncertainty is much reduced relative to that of the comparison methods; some but not all estimates of T fall outside of 95% of the calculated error distributions.
Besides the empirical estimates for uncertainties, we also employ a Bayesian regression technique which estimates a full posterior distribution. Certain specific regularisation penalties in classical regression correspond to particular distributions of priors over the parameters in a Bayesian setting 54 , and we chose automatic relevance determination (ARD) 55 as a method that selects predictors based on the posterior distributions of their associated coefficients. The agreement between the estimates and their uncertainties obtained from the three methods (Figs. 3, 4, 5) is encouraging. Note that a Bayesian regression would be meaningless in the 'average' direction, since the time range of interest t • is not actually the target in that case.
Stronger temperature depressions due to the eclipse are generally found further inland (Figs. 3, 4), which is expected due to the thermal inertia of the sea. Only one station close to the coast shows a larger T estimate (Fray Jorge Quebrada)-but it should be noted that the timing of the largest temperature drop is late there compared to most stations (Fig. 5), suggesting it may have been related to the sunset rather than the eclipse (the station is separated from the sea and setting sun by a mountain range). There is a hint of a suggestion that moving away from the center line of the eclipse reduces the experienced temperature drop, especially outside of the region of totality, but the proximity to the coast appears dominant. The largest estimated temperature drop of more than 6 • C (Cachiyuyo) might raise suspicions, but it turns out that its magnitude is largely a result of its occurrence at the end of the day, when temperatures are dropping very fast even on regular days and the eclipse's effect is more akin to an earlier onset of nightfall (cf. Fig. 7).
The eclipse wind. Encouraged by the results for the surface temperatures, we apply the smoothing type lasso also to the wind field. The average type regression gives quantitatively very similar results but is both much more costly computationally (at least if uncertainties are to be estimated) and harder to analyse since its setup differs from regular regression frameworks in that the information of interest (times t • ) are not actually the regression's target. The results (not shown) with respect to uncertainties are very similar to those for the tem- www.nature.com/scientificreports/ perature case, with some but not all stations achieving a statistical significance of the estimated eclipse effect at 95%; we performed the ARD variant of the regression as well with comparable results. Instead of focusing on uncertainties as in the temperature case, we simply consider the point estimates for the change in the wind field induced by the eclipse. We applied the regression to the u, v wind vectors separately and calculate an 'eclipse wind' following Clayton 3 by subtracting the reference estimate from the observations on eclipse day. A two-dimensional spatial picture can be constructed, again following Clayton, by plotting a station's relative latitudinal distance to the center path of the eclipse in y, and replacing the time of an observation by a virtual distance along the eclipse's path calculated with the assumption of a propagation speed of the eclipse of approximately 2000 km h −1 (Fig. 6).
Clayton hypothesised that an eclipse provided an optimal occasion for the assessment of earlier theories by Ferrel regarding the structure of cyclones with a cold core 56 . The essence of the argument is that a cool surface temperature anomaly will lead to surface outflow of air and replacement by subsiding motion, with ensuing inflow aloft and possibly a closing of the vertical cell at some distance from the center. Outflow is associated with anticyclonic and inflow with cyclonic rotation, but air which has acquired cyclonic angular momentum aloft will conserve it when subsiding. The outflow therefore has to first overcome the existing cyclonic momentum before crossing a circle of zero rotation and turning anticyclonic at some distance from the center of subsidence. Clayton surmised, however, that because of friction, the cyclonic rotation of the innermost ring would be too weak to be observable, and focused on demonstrating an anticyclonic sense of rotation further away.
The wind field is very variable and a result of many interacting forces, including local to continental-scale contrasts in surface heat fluxes and the effects of synoptic weather systems. The signal-to-noise ratio of any eclipse-induced effects may be rather low, and Clayton's results have not been reproduced convincingly 2,5 . Instead, where changes have been demonstrated, explanations other than his proposed 'cyclone' have been favoured 4 . It should also be noted that Ferrel's and Clayton's ideas about cyclones predate the modern understanding of midlatitude and tropical cyclones as primarily the result of dynamical instabilities by half a century.
A noteworthy feature of the eclipse wind in Fig. 6 is the 'bow shock' on the eastern (C1) front, where air seems to move out of the way of the arriving eclipse. Since the eclipse travels with supersonic speed, it is impossible for any eclipse-caused outflow to result in air parcels moving ahead (eastward) of the eclipse. Remember, however, that while the frame of reference for the wind vector locations in Fig. 6 can be considered to be moving with the eclipse, the same is not true for the coordinate system of the quivers, which is both isotropic and static. The www.nature.com/scientificreports/ fact that eastward flow occurs at some stage during the eclipse is therefore not a contradiction. (As an aside, the strong anisotropy of the virtual along-path direction related to the eclipse's supersonic movement casts doubt on Clayton's original argument and his Fig. 4, which represents the eclipse cyclone as a fairly symmetric structure.) With much imagination, one might discern an anticyclonic movement of the air masses following C1, which later turns into a cyclonic movement around the time of totality. This would imply an initial anticyclonic outflow, followed by subsidence which carries with it a cyclonic vorticity stemming from the required inflow aloft. It could further be argued that the flow field appears more turbulent after the passing of totality than before, analogously to that in the wake of a fast-moving solid object. Without more detailed analysis, which is beyond the scope of this study, we cannot confirm or refute any of the preceeding interpretations; and others are equally possible: for example, it appears that zones of converging and diverging flow alternate in the penumbral zone, with convergence on the northern edge coexisting with divergence to the south and vice versa. Similarly strong gradients in surface heating as across the edge of the umbra may be found over ocean eddies 57 and fronts 58,59 , resulting in wind field adjustments which could be compared to the eclipse case. However, the supersonic movement of the umbral shadow is likely to result in a very different dynamical adjustment of the atmosphere.
In the Coquimbo Region, the wind field is influenced by the Southeast Pacific Subtropical Anticyclone, passing frontal systems, a strong land-sea breeze and equally strong orographically driven wind systems. Around La Serena, both the land-sea and the mountain-valley systems tend to produce westerly winds during the day and easterlies at night, but differences in reversal timing may lead to complex interactions. The eclipse occurred just before sunset, and an earlier-than-usual reversal of the daily wind systems due to the associated decreased surface heating is consistent with the mostly easterly direction of the eclipse wind in Fig. 6.

Discussion
After witnessing the eclipse in La Serena on July 2nd, 2019, we were interested in finding out how much the atmosphere's surface layer had cooled during the event. Observers reported that they felt colder during the eclipse than before, but that does not necessarily imply that the air temperature dropped considerably, since much of the sensation may be due to an experienced reduction in direct radiative heating of the body. CEAZAMet operates a monitoring network with many stations inside the belt of totality of this eclipse which provided us with near-surface air temperature data. However, no 'obvious' pattern of change, such as a pronounced dip in the temperature curve, is consistently discernible at all or a majority of the stations. The eclipse occurred so close to sunset that temperatures had started dropping before C1 and little radiative energy was available after C4 to raise temperatures again. Furthermore, any features of the data from the day of the eclipse which could potentially be www.nature.com/scientificreports/ interpreted as dips in temperature do not stand out against the general background of intra-diurnal variability in the Coquimbo Region. This is because the dominantly southerly wind alternates frequently between a slight on-and offshore flow, which brings with it strong advective temperature changes since a cold ocean borders a strongly heated land surface.
We therefore chose to focus on the problem of how to separate the effect of the eclipse from other circumstantial influences on the observed meteorological variables. The reference estimate of how a meteorological variable would have behaved on the same day but in absence of the eclipse is speculative, and we cannot confidently use its point value without some idea of how uncertain it is. While some previous studies give confidence intervals on certain specific estimates 4,5 , these are not related to the uncertainties surrounding the actual effect sizes at specific times and locations. Here, we calculate uncertainties and their bias (for the lasso estimates) empirically on the basis of longer-term records from each of the stations at which we observed the eclipse, by performing our estimation procedure for any non-eclipse day and calculating the errors with respect to the true evolution of that day. The error distribution thus pertains to any arbitrary day of the year and includes uncertainties arising from annual-scale variability; in areas with a more pronounced annual cycle than the Coquimbo region, we would expect the estimates to exhibit larger uncertainties, which in turn could be counteracted by restricting the training data set to days closer to the day to be predicted (in terms of the day of year, DOY) if longer-term records are available.
Calculating the errors associated with particular estimates allows for selecting one with a low uncertainty. The lowest test errors are achieved by the smoothing lasso and the ARD regression (Table 2), but the error for ARD is taken to be simply the posterior standard deviation and may not be entirely accurate since ARD assumes the data to be Gaussian-distributed. It is convenient that the smoothing-type estimates perform the best, since they are both more time-efficient than the average-lasso and take the form of a regular regression. Therefore different regression methods can be used interchangeably and analysed with well-established tools such as crossvalidation. All the methods we have described possess an adjustable parameter which we select to approximately minimise a test error; for the lasso regressions it is the regularisation parameter α , for the unweighted average it is the number of days N , and for the local regressions it is the kernel width. The estimates are not overly sensitive to the parameter values, and a complete numerical optimization over all stations is computationally expensive for some methods. Additionally, different stations' estimates have rather different error levels and a general loss www.nature.com/scientificreports/ function would need to be able to weight those appropriately. Therefore, we simply carry out a grid search for a selection of stations and choose roughly appropriate parameter values. It is conceivable that methods with better performance can be found. We limit ourselves to predicting diurnal temperature cycles by using other diurnal temperature cycles as training data. A regression framework is more typically used to relate one variable to another, and this could certainly be done here; however, contemporaneous multi-variable regressions cannot be used since additional independent variables may also be affected by the eclipse. To the extent that the values of other variables may be able to differentiate locations in state space characterised by very similar temperatures, it could be useful to include them. Similarly, it might be possible to exploit autocorrelations on an inter-diurnal scale. We experimented with non-linear regression methods (Gaussian Processes, Neural Networks) for the more appropriate smoothing setup but found no improvements, which is explained by the dominant influence of only the two data points immediately before and after the prediction period t • . We also experimented with using the solar elevation angle β as a coordinate instead of time, since it may be expected that it normalise annual variability and guard against the impact of varying day lengths to some extent, but similarly found no improvement to the predictive accuracy of our models while significantly complicating the analysis.
A physics-based model could in principle be initialized to the state of the atmosphere at the start of the eclipse and run predictively with eclipse-unaware radiative forcing in order to provide a reference estimate. If all necessary physical parameters are measured for a specific location, this could be a relatively confined boundary layer energy balance model. However, in our case we have little knowledge of important parameters such as surface albedo, boundary layer thickness and longwave radiation. As we have previously stated, advection plays an important role in the surface energy balance, and either a crude observational estimate based on widely spaced station measurements or a full numerical weather prediction (NWP) model is needed in order to account for it. However, such a physics-based modeling exercise represents a major effort, and we devised our statistical methods primarily in order to provide a simpler alternative. It is furthermore not clear whether the results from a physical model would be less uncertain than those of a statistical one, in particular under the circumstances described for the Coquimbo Region: relatively small swings between on-and offshore winds can have major effects on local temperatures, and the horizontal resolution of NWP models is coarse compared to the complexities of the terrain. Even more importantly, adequate data for model initialization is not available.

Methods
In terms of the data matrix X with rows corresponding to days and columns to times, the 'average' type regression estimate for d • is given by where β is an estimate for the (column) vector of regression coefficients β and Ŷ is the regression estimate. This expression corresponds to an unweighted average over N (selected) days if β contains N elements with value 1/N and otherwise zeros. Viewed as a regression, an intercept is included by adding a column of ones to X T , and the estimate β d • t • for β is obtained by minimising a loss function. We denote with the subscripts the fact that β is estimated by using the observation at d • but only the times t • (excluding the eclipse) as a target. The 'smoothing' approach consists in using the columns of X corresponding to t • as the regression's targets and all remaining columns X dt • as predictors: In this case, all non-eclipse days d • on record serve as training samples, and estimates are obtained for any day but only for t • . The regression coefficients β in this case correspond to weights of a smoothing kernel applied to the day of the eclipse, where the values at t • have been removed from and filled in by the smoothing procedure.
We calculate a distribution over prediction errors by subtracting the observed value at t • for all non-eclipse days from the corresponding estimates:Ŷ   The lasso, which we apply to both regression problems, is a regularised regression whose loss function contains a ℓ 1 penalty on the vector of regression coefficients β . It is this form of the penalty that leads to the coefficients for less influential predictors to be set to zero, and thereby to subset selection 54 . The penalty is scaled by a hyperparameter, denoted α here, which controls how many predictors are culled: the larger the value of α , the more coefficients become zero, and the lower the resulting model complexity. We choose its value by approximately minimising a root mean square test error (RMSE) computed by averaging PE 2 over t • and d • . The regression estimate itself is not overly sensitive to the precise value of α , and so we simply choose a single value that is approximately optimal for a number of meteorological stations (see Table 3).
We compare the lasso estimates with estimates based on simple averaging and smoothing. For the averaging, the data matrix's date dimension is first ordered according to the mean square difference over times t • between each day and eclipse day. Then, the N closest days, in this Euclidean distance sense, are averaged. A lower value of N corresponds to a larger value of α (stronger regularisation and fewer non-zero coefficients). As with α , we select a value for N which approximately minimizes the average test error over all stations.
For smoothing, we use local linear and quadratic regression with a Gaussian kernel 54 . The period t • is removed before the smoothing estimate is computed and the resulting gap filled in by the estimate. The governing parameter for local regressions is a length scale, which is again fit by approximately minimizing the average test error over all stations.
An additional regression type, linear Bayesian regression with automatic relevance determination (ARD), is applied to the smoothing problem. The prediction error is here estimated in the form of a full predictive distribution rather than empirically for a point estimate. Since the eclipse period t • is what we want to compute any errors for and the average type regression does not use it as a target, the Baysian regression does not really provide any useful information in this case. ARD regularises complexity by pruning predictors whose distributions' precision (the inverse of the standard deviation) exceeds a threshold value 55 . In contrast to the lasso, however, the numerical value of this threshold has almost no effect on the results and can be left at a default setting (see Table 3). Note, however, that the implementation used 60 supposes Gaussian distributions for both the data and the regression coefficients (and therefore also the predictive error distribution), an assumption that is only approximately valid in our case. Table 2. Test errors over all stations and all non-eclipse days for the methods discussed in this paper. Since we perform ARD only for a subset of times out of t • , we restrict all errors to this subset. The ARD error is the posterior standard deviation averaged over times and stations. The other errors are computed to be comparable with this, as the RMSE over days in d • , averaged over times and stations.  www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.