Background & Summary

Data such as rainfall, temperature, wind patterns, and solar radiation are significant meteorological variables in determining climate variations and change. For example, high spatial and temporal resolution rainfall data is necessary for the development of hydrological models, flood risk assessment, land management, and climate model validation1,2,3,4. In sub-tropical, flat regions such as Florida, slight seasonal climate shifts can have drastic impacts on flooding, agricultural production, and public health5,6,7,8. Florida is a sub-tropical region with average air temperatures fairly stable in the summer across the state and varying from North to South (increasingly warm) in the winter8. During summer months, average temperatures are typically between 24 and 28 °C (297–301 K) across the state, and in winter, Northern Florida averages around 7–13 °C (280–286 K) while the southern part of the state tends to average around 15–19 °C (288–292 K)9. The elevation levels in Florida range from sea level to about 105 m above sea level10. Due to its peninsular geography in subtropical latitudes and interactions with relatively warm oceans, Florida has a unique climate to the rest of the United States11,12. Its wet season is heavily interconnected with fresh water availability and ecosystem functionality, and as population growth continues throughout the state, there is the further strain placed on its natural resources11.

Within the climate system, the diurnal and semi-diurnal scale variations represent a fundamental mode of variability13,14,15,16. Diurnal variations are generated from diurnally varying solar heating that affects near the surface, through the depths of the troposphere, and in the stratosphere that manifests as pronounced oscillations with periods of approximately 24 h (diurnal) and 12 h (semi-diurnal). These periodic oscillations that appear in the upper atmosphere are also called atmospheric tides, which significantly impact the diurnal and semi-diurnal variations of many climatic variables16. Often, the fidelity of numerical climate and weather models and reanalyses is assessed in the ability to represent the diurnal scales owing to its feature of being a fundamental mode of variation of the climate system (e.g.17,18,19,20). However, a huge limitation of verifying these models to simulate the diurnal cycles is the lack of data that robustly resolves the diurnal variations21. To examine these variations in sub-tropical areas such as the state of Florida, continuous data is needed at sub-diurnal (hourly or finer) temporal and spatial resolution.

The Florida climate is representative of a trade wind regime for latitudes between about 25 degrees North and South of the equator, including monsoon regions such as India and Vietnam11,22,23. Areas such as these often lack a high-density observational network, but for the Florida region, many climate and weather datasets exist to provide information at varying spatio-temporal resolutions. For example, the Florida Climate Center, affiliated with the National Climatic Data Center (NCDC), provides daily precipitation and temperatures for approximately 100 stations across Florida24, and hourly local climatological data (LCD) is available through the National Oceanic and Atmospheric Administration (NOAA)25,26. These datasets lack a sub-hourly temporal resolution, limiting their applicability. In addition, there are two precipitation-only data sources. Integrated Multi-Satellite Retrievals for Global Precipitation Mission (IMERG) data from the National Aeronautics and Space Administration (NASA) provides 30-minute precipitation data at a 10 km resolution over the period June 2000-present12. The second source, from the NCDC, offers 15-minute precipitation observations from stations that are sparsely located (31 stations throughout Florida)27.

The Florida Automated Weather Network (FAWN) is the only network providing sub-hourly data for 10 meteorological variables28. Initiated in 1997 to provide climatic data to rural areas in Florida to inform growers, it is currently comprised of 42 stations29. The goal of FAWN is to provide accurate, reliable, and real-time weather data to users across Florida for applications such as cold weather protection strategies, irrigation scheduling, and extreme precipitation analysis29,30,31,32. Each FAWN tower, as shown in Fig. 1, is equipped with sensors that measure air temperature at 60 cm (T 60 cm), air temperature at 2 m (T 2 m), air temperature at 10 m (T 10 m), soil temperature (T Soil), relative humidity (RH), precipitation (PPT), wind speed (WS), wind direction (WD), solar radiation (Sol Rad), and barometric pressure29. The FAWN dataset also includes derived parameters such as dew point temperature (T Dew), wet bulb temperature, and potential evapotranspiration29. However, barometric pressure, wet bulb temperature, and potential evapotranspiration are not included in annual datasets. FAWN data is offered at a fine temporal resolution of 15 minutes, but many of these datasets contain gaps of various sizes, from 15 minutes to one year, due to operational issues that limit their use for applications that require continuous datasets29. To address this challenge, this paper details the methods utilized to fill gaps in FAWN meteorological observations to generate a continuous dataset over 2005–2020. Previous studies have employed various data homogenization and gap-filling methods for meteorological variables, such as WS and PPT, across Florida to improve prediction methods and better understand trends in extreme weather32,33,34. The value of the the dataset generated in this study lies in its fine temporal resolution and diverse set of meteorological variables. This study leverages the FAWN infrastructure in order to create a gap-free dataset for wider scientific applications in regions of similar characteristics. The publicly available dataset will provide a unique resource within a complex sub-tropical region for climate analysis and modeling.

Fig. 1
figure 1

Typical FAWN tower configuration (left) and photo of a FAWN tower (right).

Methods

Data acquisition and preprocessing

Yearly observations at 15-minute intervals were obtained from FAWN for all active stations28. In this study, we examined the FAWN data available between 1997 and 2020 and selected the stations with data present across the longest period of time during which the most stations were available, resulting in the chosen 30 stations over 2005–2020, as shown in Fig. 2. In the northern part of the State, 16 stations were located in forested and woody environments, and in the South, nine stations were in areas classified as savanna. Four of the stations were positioned in urban areas, and one station was located in cropland.

Fig. 2
figure 2

Location of 30 FAWN stations selected for this study across Florida, along with their names and numbers. The base map is a 2019 Land Cover map from Moderate Resolution Imaging Spectroradiometer (MODIS).

FAWN implements initial quality control measures and filtering before publishing the raw data, details for which are given in Table 1. Annual tests are conducted to determine if repair or replacement of sensors is needed based upon EPA guidelines, and filtering of these potential operational incidents as well as power failures result in data gaps35. In this study, supplemental quality control mechanisms were implemented to enhance the data reliability. For all temperature measurements, if there was a difference >5 °C within one time step of 15 minutes, the data point was marked as a data gap that was filled as described below. For WS higher than 30 mph, the event was manually checked against nearby FAWN stations and LCD reports to confirm high WS. If the high WS value was confirmed, then it remained in the data set, and if it could not be verified, the value was marked as a data gap. Additionally, RH values of 0% were marked as data gaps.

Table 1 Quality control measures implemented by FAWN, adapted from the FAWN measurement system specifications47.

Gap filling

Data gaps occurred if the difference between two consecutive data points was greater than 15 minutes. The number of 15-minute observations in the raw data for each station at each year is given in Fig. 3 to provide insight into the amount of data points present and the extent of missing data. Figure 4a,b provide the minimum and maximum number of consecutive 15-minute data gaps missing for each station in each year, respectively, to demonstrate the distribution of data gap extents across space and time. Station #s 7 and 8 had a larger number of 15-minute gaps than other stations, with the most gaps occurring in 2007 and 2008, respectively. The years 2007–2009 had the most gaps for all stations, with over 1000 gaps for most stations during that period. Large data gaps such as these are primarily due to operational issues, generally from power failures. Gap filling of meteorological variables is inherently uncertain and challenging, with differing methodological approaches for different variables36,37,38,39,40,41. We applied several methods of data gap filling based upon gap size and the nature of the meteorological variables.

Fig. 3
figure 3

Heat map representing the number of observations (in thousands) present in the raw data for each station in each year.

Fig. 4
figure 4

The minimum (left triangle) and maximum (right triangle) number of 15-minute data gaps present at each station in each year for (a) station #s 1–15 and (b) station #s 16–30.

Datasets with diurnal cycles

Gap filling for datasets with diurnal cycles such as temperatures, RH, and Sol Rad followed the same methodology. Figure 5a depicts an example of gap filling for T 10 m over the study period for station #28. The year 2007 had the most gaps for this station (see Fig. 5b), and various gap filling techniques were applied based on the gap size. For data gaps <6 hours (about 82% of gaps), linear interpolation was implemented using the slope between the two data points at either end of the gap to estimate the missing data points (see Fig. 5c)38,39,40. Such temporal interpolation is a reliable data gap filling method in continuous climate variables such as near-surface air temperature and solar radiation data38,39,40. For gaps between 6 and 12 hours (about 1% of gaps), trend continuation for the meteorological variables was implemented, similar to Tardivo and Berti42 and Kemp et al.43, by extracting the data values and trends from two days prior to and two days after gaps (see Fig. 5d). At each missing time step, the measurement was filled with the average values at that particular time from the surrounding days. When the data gaps were greater than 12 hours (about 17% of gaps), an outside data source was referenced. These large gaps mainly occurred in the years 2005–2009 for most stations (see Figs. 3, 4). In this study, LCD from NOAA25 was used as an external data source to fill these large data gaps with data from weather stations within the same city, or if not available then the same county (see Fig. 5e). Hourly LCD values were linearly interpolated to 15-minute intervals. Since LCD is available only for T 10 m, T Dew, and RH, larger data gaps in T 60 cm, T 2 m, T Soil and Sol Rad were filled using data from the nearest FAWN station (see Fig. 5f)37,38,41. The nearest station was determined through the smallest euclidean distance to a station with available data. The temporal correlations were high between the monthly means of these nearest stations. The nearest station method was also applied for any periods when there were gaps in the LCD, following Luedeling38 and Graf41. In this study, the distance to the nearest station was typically around 32 km.

Fig. 5
figure 5

Examples for the methods used for filling diurnal meteorological variable such as 10 m air temperature at the Immokalee station (#28): (a) 16-year time-series data; (b) Zoomed in time-series for 2007; (c) linear interpolation for gaps less than 6 hours; (d) trend continuation method for gaps between 6–12 hours; and (e) external data source and (f) nearest station for gaps larger than 12 hours.

Discrete datasets

To fill gaps in the discrete datasets such as PPT, WS, and WD, LCD and nearest FAWN stations were used, similar to the larger data gaps mentioned above37,38,41. The NCDC PPT data could not be used due to large distances from FAWN stations and lack of observations consistent with the FAWN and LCD PPT observations. Given the distribution of available FAWN stations, it was reasonable to assume that gradients of observed data for these meteorological variables were captured by filling the gaps using the nearest station.

Data Records

Gap-free data for 30 FAWN stations over the period 2005–2020 are available through Figshare, an open access repository, in CSV file format titled “Florida Automated Weather Network Yearly CSV Data (Gap Free)44. The data is continuous over 16 years for each station listed in Fig. 2, and annual data within the given time period can be downloaded. There are 10 gap-filled meteorological variables provided in the datasets, the units and labels of which are given in Table 2.

Table 2 Micrometeorological variables and their descriptions as available from FAWN on an annual basis for download.

Technical Validation

In addition to visual inspection of filled data such as comparing diurnal patterns with surrounding days, the validity of the data was assessed to ensure consistency between the filled data and the raw data for each station and meteorological variable. This was assessed by conducting differential statistics between the raw data and the filled data. A two-tailed T-test on the means of each meteorological variable at each station was conducted to determine whether the mean of the filled data differed significantly from the mean of the raw data40. This test was chosen as one source of validation in order to ensure that the gap filling process did not significantly alter the mean of the filled data as compared to the raw data. All p-values resulting from the T-test were >0.1, so there was no significant difference found between the filled data means and raw data means (see Table 3 for minimum p-values).

Table 3 Minimum p-values for each meteorological variable from the statistical significance tests, including the T-test, F-test, and Kolmogorov-Smirnov (K-S) test, on each station.

Figure 6a,b provide the mean, along with the standard deviation, minimum, and maximum values, for the 10 meteorological variables at each station. As expected, the mean air temperature values increase from station #1, at around 292 K, to station #30, at around 297 K (North to South). The maximum PPT was highest, between 52.1 mm and 68.3 mm, at station #s 3, 8, 21, and 28, providing information on the areas which received the highest intensity rainfall within a 15-minute period over the study period. The standard deviation of the temperature values tended to decrease from station 1 (around 8 K) to 30 (around 5 K), supporting higher temperature variability in the more northern stations.

Fig. 6
figure 6

Statistical description for the 10 meteorological variables at (a) station #s 1–15 and (b) station #s 16–30, provided in four triangles. In the clockwise direction from the top, each triangle provides the maximum, mean, minimum, and standard deviation of the gap-filled dataset over the 16-year period.

To test the difference in standard deviations between the meteorological variables at each station in the filled dataset and raw dataset, an F-test was implemented. As we determined that the means of the filled and raw data were not significantly different, this test was conducted to reveal whether the dispersal of values around the averages of each dataset significantly varied. These p-values resulting from the F-test were also all >0.1, indicating no notable difference in standard deviations.

In order to test the statistical difference in distribution of the raw and filled datasets, the Kolmogorov-Smirnov (K-S) test was implemented. This test essentially checks whether two datasets come from the same distribution, and the test statistic can be interpreted to represent the greatest distance between the cumulative distribution function of each dataset45. Thus, the K-S test was chosen as a third validation metric to determine if there existed significant difference between the shape and spread of the filled and raw datasets. The resulting p-values from the K-S test showed no such difference, as they were all > 0.1.

Usage Notes

The gap-filled dataset generated through this work is unmatched in temporal resolution and spatial extent across the state of Florida. It provides information on 10 meteorological variables at 15-minute intervals, spanning 30 stations from as far north as Jay (latitude 31°N) to Homestead in the south (latitude 25.5°N). It also has potential applications in climate monitoring, agriculture, and hydrology. The gap free data can be applied to understand climate variability and verify numerical climate and weather models, which can be used to predict future weather conditions from current observation46. The continuous 16-year data product developed through the methods outlined above can serve as an important resource for climate research and forecasting in sub-tropical regions such as Florida11,21.