Gap-free 16-year (2005–2020) sub-diurnal surface meteorological observations across Florida

The sub-tropical, flat, peninsular region of Florida is subject to a unique climate with extreme weather events that impact agriculture, public health, and management of natural resources. Meteorological data at high temporal resolutions especially in tropical latitudes are essential to understand diurnal and semi-diurnal variations of climate, which are considered as the fundamental modes of climate variations of our Earth system. However, many meteorological datasets contain gaps that limit their use for validation of models and further detailed observational analysis. The objective of this paper is to apply a set of data gap filling strategies to develop a gap-free dataset with 15-minute observations for the sub-tropical region of Florida. Using data from the Florida Automated Weather Network (FAWN), methods of linear interpolation, trend continuation, reference to external sources, and nearest station substitution were applied to fill the data gaps depending on the extent of the gap. The outcome of this study provides continuous, publicly accessible surface meteorological observations for 30 FAWN stations at 15-minute intervals for years 2005–2020.


Background & Summary
Data such as rainfall, temperature, wind patterns, and solar radiation are significant meteorological variables in determining climate variations and change.For example, high spatial and temporal resolution rainfall data is necessary for the development of hydrological models, flood risk assessment, land management, and climate model validation [1][2][3][4] .In sub-tropical, flat regions such as Florida, slight seasonal climate shifts can have drastic impacts on flooding, agricultural production, and public health [5][6][7][8] .Florida is a sub-tropical region with average air temperatures fairly stable in the summer across the state and varying from North to South (increasingly warm) in the winter 8 .During summer months, average temperatures are typically between 24 and 28 • C ( 297-301 K) across the state, and in winter, Northern Florida averages around 7-13 • C (280-286 K) while the southern part of the state tends to average around 15-19 • C (288-292 K) 9 .The elevation levels in Florida ranges from sea level to about 105 m above sea level 10 .Due to its peninsular geography in subtropical latitudes and interactions with relatively warm oceans, Florida has a unique climate to the rest of the United States 11,12 .Its wet season is heavily interconnected with fresh water availability and ecosystem functionality, and as population growth continues throughout the state, there is the further strain placed on its natural resources 11 .Within the climate system, the diurnal and semi-diurnal scale variations represent a fundamental mode of variability 13 .Diurnal variations are generated from diurnally varying solar heating that affects near the surface, through the depths of the troposphere, and in the stratosphere that manifests as pronounced oscillations with periods of approximately 24 h (diurnal) and 12 h (semi-diurnal).These periodic oscillations that appear in the upper atmosphere are also called atmospheric tides, which significantly impact the diurnal and semi-diurnal variations of many climatic variables 14 .Often, the fidelity of numerical climate and weather models and reanalyses is assessed in the ability to represent the diurnal scales owing to its feature of being a fundamental mode of variation of the climate system (e.g., [15][16][17][18] ).However, a huge limitation of verifying these models to simulate the diurnal cycles is the lack of data that robustly resolves the diurnal variations 19 .To examine these variations in sub-tropical areas such as the state of Florida, continuous data is needed at sub-diurnal (hourly or finer) temporal and spatial resolution.
The Florida climate is representative of a trade wind regime for latitudes between about 25 degrees North and South of the equator, including monsoon regions such as India and Vietnam 11,20,21 .Areas such as these often lack a high-density observational network, but for the Florida region, many climate and weather datasets exist to provide information at varying spatio-temporal resolutions.For example, the Florida Climate Center I provides daily precipitation and temperatures for approximately 100 stations across Florida, and hourly local climatological data (LCD) is available through the National Oceanic and Atmospheric Administration (NOAA) 22,23 .These datasets lack a sub-hourly temporal resolution, limiting their applicability.In addition, there are two precipitation-only data sources.Integrated Multi-Satellite Retrievals for Global Precipitation Mission (IMERG) data from the National Aeronautics and Space Administration (NASA) provides 30-minute precipitation data at a 10 km resolution over the period June 2000-present 12 .The second source, from the National Climatic Data Center (NCDC), offers 15-minute precipitation observations from stations that are sparsely located (31 stations throughout Florida). 24.
The Florida Automated Weather Network (FAWN) is the only network providing sub-hourly data for 10 meteorological variables 25 .Initiated in 1997 to provide climatic data to rural areas in Florida to inform growers, it is currently comprised of 42 stations 26 .The goal of FAWN is to provide accurate, reliable, and real-time weather data to users across Florida for applications such as cold weather protection strategies, irrigation scheduling, and extreme precipitation analysis [26][27][28][29] .Each FAWN tower, as shown in Figure 1, is equipped with sensors that measure air temperature at 60 cm (T 60cm), air temperature at 2 m (T 2m), air temperature at 10 m (T 10m), soil temperature (T Soil), relative humidity (RH), precipitation (PPT), wind speed (WS), wind direction (WD), solar radiation (Sol Rad), and barometric pressure 26 .The FAWN dataset also includes derived parameters such as dew point temperature (T Dew), wet bulb temperature, and potential evapotranspiration 26 .However, barometric pressure, wet bulb temperature, and potential evapotranspiration are not included in annual datasets.FAWN data is offered at a fine temporal resolution of 15 minutes, but many of these datasets contain gaps of various sizes, from 15 minutes to one year, due to operational issues that limit their use for applications that require continuous datasets 26 .To address this challenge, this paper details the methods utilized to fill gaps in FAWN meteorological observations to generate a continuous dataset over 2005-2020.Previous studies have employed various data homogenization and gap-filling methods for meteorological variables, such as wind speed and precipitation, across Florida to improve prediction methods and better understand trends in extreme weather [29][30][31] .The value of the the dataset generated in this study lies in its fine temporal resolution and diverse set of meteorological variables.This study leverages the FAWN infrastructure in order to create a gap-free dataset for wider scientific applications in regions of similar characteristics.The publicly available dataset will provide a unique resource within a complex sub-tropical region for climate analysis and modeling.I The Florida Climate Center is located at Florida State University's Center for Ocean-Atmospheric Prediction Studies and is part of a three-tiered system providing climate services at the national, regional and state levels.

Data Acquisition and Preprocessing
Yearly observations at 15-minute intervals were obtained from FAWN for all active stations 25 .In this study, we examined the FAWN data available between 1997 and 2020 and selected the stations with data present across the longest period of time during which the most stations were available, resulting in the chosen 30 stations over 2005-2020, as shown in Figure 2. In the northern part of the State, 16 stations were located in forested and woody environments, and in the South, nine stations were in areas classified as savanna.Four of the stations were positioned in urban areas, and one station was located in cropland.
FAWN implements initial quality control measures and filtering before publishing the raw data, details for which are given in Table 1.Annual tests are conducted to determine if repair or replacement of sensors is needed based upon EPA guidelines, and filtering of these potential operational incidents as well as power failures result in data gaps 32 .In this study, supplemental quality control mechanisms were implemented to enhance the data reliability.For all temperature measurements, if there was a difference > 5 • C within one time step of 15 minutes, the data point was marked as a data gap that was filled as described below.For WS higher than 30 mph, the event was manually checked against nearby FAWN stations and LCD reports to confirm high wind speeds.If the high wind speeds were confirmed, then they remained in the data set, and if they could not be verified, the value was marked as a data gap.Additionally, RH values of 0% were marked as data gaps.

Gap Filling
Data gaps occurred if the difference between two consecutive data points was greater than 15 minutes.The number of 15-minute observations in the raw data for each station at each year is given in Figure 3 to provide insight into the amount of data points present and the extent of missing data.Figures 4a and  4b provide the minimum and maximum number of consecutive 15-minute data gaps missing for each station in each year, respectively, to demonstrate the distribution of data gap extents across space and time.Station #s 7 and 8 had a larger number of 15-minute gaps than other stations, with the most gaps occurring in 2007 and 2008, respectively.The years 2007-2009 had the most gaps for all stations, with over 1000 gaps for most stations during that period.Large data gaps such as these are primarily due to operational issues, generally from power failures.Gap filling of meteorological variables is inherently uncertain and challenging, with differing methodological approaches for different variables [33][34][35][36][37][38] .We applied several methods of data gap filling based upon gap size and the nature of the meteorological variables.
Datasets with Diurnal Cycles.Gap filling for datasets with diurnal cycles such as temperatures, RH, and Sol Rad followed the same methodology.Figure 5a depicts an example of gap filling for T 10m over the study period for station #28.The year 2007 had the most gaps for this station (see Figure 5b), and various gap filling techniques were applied based on the gap size.For data gaps < 6 hours (about 82% of gaps), linear interpolation was implemented using the slope between the two data points at either end of the gap to estimate the missing data points (see Figure 5c) [35][36][37] .Such temporal interpolation is a reliable data gap filling method in continuous climate variables such as near-surface air temperature and solar radiation data [35][36][37] .For gaps between 6 and 12 hours (about 1% of gaps), trend continuation for the meteorological variables was implemented, similar to Tardivo 39 and Kemp 40 , by extracting the data values and trends from two days prior to and two days after gaps (see Figure 5d).At each missing time step, the measurement was filled with the average values at that particular time from the surrounding days.When the data gaps were greater than 12 hours (about 17 % of gaps), an outside data source was referenced.These large gaps mainly occurred in the years 2005-2009 for most stations (see Figures 3 and  4).In this study, LCD from NOAA 22 was used as an external data source to fill these large data gaps with 3/16 data from weather stations within the same city, or if not available then the same county (see Figure 5e).Hourly LCD values were linearly interpolated to 15-minute intervals.Since LCD is available only for T 10m, T Dew, and RH, so larger data gaps in T 60cm, T 2m, T Soil and Sol Rad were filled using data from the nearest FAWN station (see Figure 5f) 34,35,38 .The nearest station was determined through the smallest euclidean distance to a station with available data.The temporal correlations were high between the monthly means of these nearest stations.The nearest station method was also applied for any periods when there were gaps in the LCD, following Luedeling 35 and Graf 38 .In this study, the distance to the nearest station was typically around 32 km.
Discrete Datasets.To fill gaps in the discrete datasets such as PPT, WS, and WD, LCD and nearest FAWN stations were used, similar to the larger data gaps mentioned above 34,35,38 .The NCDC PPT data could not be used due to large distances from FAWN stations and lack of observations consistent with the FAWN and LCD PPT observations.Given the distribution of available FAWN stations, it was reasonable to assume that gradients of observed data for these meteorological variables were captured by filling the gaps using the nearest station.

Data Records
Gap-free data for 30 FAWN stations over the period 2005-2020 are available through Figshare, an open access repository, in CSV file format titled "Gap-Free Sub-Diurnal Meteorological Data from Florida Automated Weather Network (FAWN)".The data is continuous over 16 years for each station listed in Figure 2, and annual data within the given time period can be downloaded.There are 10 gap-filled meteorological variables provided in the datasets, the units and labels of which are given in Table 2.

Technical Validation
In addition to visual inspection of filled data such as comparing diurnal patterns with surrounding days, the validity of the data was assessed to ensure consistency between the filled data and the raw data for each station and meteorological variable.This was assessed by conducting differential statistics between the raw data and the filled data.A two-tailed T-test on the means of each meteorological variable at each station was conducted to determine whether the mean of the filled data differed significantly from the mean of the raw data 37 .This test was chosen as one source of validation in order to ensure that the gap filling process did not significantly alter the mean of the filled data as compared to the raw data.All p-values resulting from the T-test were > 0.1, so there was no significant difference found between the filled data means and raw data means (see Table 3 for minimum p-values).
Figures 6a and 6b provide the mean, along with the standard deviation, minimum, and maximum values, for the 10 meteorological variables at each station.As expected, the mean air temperature values increase from station #1, at around 292 K, to station #30, at around 297 K (North to South).The maximum PPT was highest, between 52.1 mm and 68.3 mm, at station #s 3, 8, 21, and 28, providing information on the areas which received the highest intensity rainfall within a 15-minute period over the study period.The standard deviation of the temperature values tended to decrease from station 1 (around 8 K) to 30 (around 5 K), supporting higher temperature variability in the more northern stations.
To test the difference in standard deviations between the meteorological variables at each station in the filled dataset and raw dataset, an F-test was implemented.As we determined that the means of the filled and raw data were not significantly different, this test was conducted to reveal whether the dispersal of value around the averages of each dataset significantly varied.These p-values resulting from the F-test were also all > 0.1, indicating no notable difference in standard deviations.
In order to test the statistical difference in distribution of the raw and filled datasets, the Kolmogorov-Smirnov (KS) test was implemented.This test essentially checks whether two datasets come from the same distribution, and the test statistic can be interpreted to represent the greatest distance between the cumulative distribution function of each dataset 41 .Thus, the KS test was chosen as a third validation metric to determine if there existed significant difference between the shape and spread of the filled and raw datasets.The resulting p-values from the KS test showed no such difference, as they were all > 0.1.

Usage Notes
The gap-filled dataset generated through this work is unmatched in temporal resolution and spatial extent across the state of Florida.It provides information on 10 meteorological variables at 15-minute intervals, spanning 30 stations from as far north as Jay (latitude 31 • N) to Homestead in the south (latitude 25.5 • N).It also has potential applications in climate monitoring, agriculture, and hydrology.The gap free data can be applied to understand climate variability and verify numerical climate and weather models, which can be used to predict future weather conditions from current observation 42 .The continuous 16-year data product developed through the methods outlined above can serve as an important resource for climate research and forecasting in sub-tropical regions such as Florida 11,19 .

Figure 2 .
Figure 2. Location of 30 FAWN stations selected for this study across Florida, along with their names and numbers.The base map is a 2019 Land Cover map from Moderate Resolution Imaging Spectroradiometer (MODIS).

Figure 3 .Figure 4 . 12 / 16 Figure 5 .
Figure 3. Heat map representing the number of observations (in thousands) present in the raw data for each station in each year.

Figure 6 .
Figure 6.Statistical description for the 10 meteorological variables at (a) station #s 1-15 and (b) station #s 16-30, provided in four triangles.In the clockwise direction from the top, each triangle provides the maximum, mean, minimum, and standard deviation of the gap-filled dataset over the 16-year period.

Table 2 .
Micrometeorological variables and their descriptions as available from FAWN on an annual basis for download.

Table 3 .
Minimum p-values for each meteorological variable from the statistical significance tests, including the T-test, F-test, and Kolmogorov-Smirnov (KS) test, on each station.