Analysis of atmospheric temperature data by 4D spatial–temporal statistical model

The meteorological data such as temperature of the upper atmosphere is ssential for accurate weather forecasting. The Universal Rawinsonde Observation Program (RAOB) establishes an extensive radiosonde network worldwide to observe atmospheric meteorological data from the surface to the low stratosphere. The RAOB data data has very high accuracy but can offer a very limited spatial coverage. Meanwhile, ERA-Interim reanalysis data is widely available but with low-quality. We propose a 4D spatiotemporal statistical model which can make effective inferences from ERA-Interim reanalysis data to RAOB data. Finally, we can obtain a huge amount of RAOB data with high-quality and can offer a very wide spatial coverage. In empirical research, we collected data from 200 launch sites around the world in January 2015. The 4D spatiotemporal statistical model successfully analyzed the observation gaps at different pressure levels.

www.nature.com/scientificreports/ given by B-spline basis functions. The spline coefficients, indexed in space and time, are modelled as spatiotemporal models on the sphere. In fact, we use an extension of the observation-minus-background approach, which is common in atmospheric sciences 17,18 and filter out large scale components. The rest of the paper is organized as follows. "Datasets" section describes the RAOB dataset, and the ERAinterim reanalysis data, which is used as the background. In "4D spatiotemporal statistical model" section, we present the 4D spatio-temporal statistical model. "Model results" section reports the model results, and "Conclusion" section concludes.

Datasets
In this section, we introduce the data used for the estimation and validation of the model: the RAOB radiosonde temperature vertical profiles which are used as the model response, and the European Centre for Medium-range Weather Forecasts (ECMWF) ERA-interim reanalysis data, which are used as the background.
The universal rawinsonde observation program. In this paper, we use the RAOB dataset for January, 2015. This dataset is obtained from the database implemented by the Earth System Research Laboratory of the National Oceanic and Atmospheric Administration (NOAA/ESRL, https:// ruc. noaa. gov/ raobs/).
We consider twice a day observations at 00:00 UTC and 12:00 UTC from the n = 200 worldwide launch sites. Temperature data collected at the latitude 42.27 and longitude 118.97 in January 2015 is depicted in Fig. 1, where the y-axis presents the barometric pressure levels.
A radiosonde is made by a set of instruments carried into the atmosphere by a weather balloon, measuring various atmospheric parameters and transmitting them by radio to a ground receiver up to altitudes of approximately 30-35 km, depending on the balloon burst. The observations are processed and encoded for transmission and efficient storage. While the radiosonde transmits an essentially continuous stream of temperature readings back to the station (each 5-10 m of altitude, measured each 1-2 s), only a subset of this information is encoded and transmitted.
ERA-interim description. ERA-Interim is a global atmospheric reanalysis data provided by ECMWF. It is based on assimilating global observations into a computational model which gives the best state estimate 19 . We focus on 00:00 UTC and 12:00 UTC reanalysis which are described as instantaneous, though they represent 30 min averages and have a horizontal resolution of 0.75 o , which is about 80 km depending on latitude. The product comprises 60 vertical levels from the surface up to 0.1 hPa, and each level is populated by 241 × 480 pixels.

4D spatiotemporal statistical model
In this section, we introduce a 4D spatiotemporal statistical model, which fits the functional data on the sphere. Let y TEMP (s, h, t) be the atmospheric profile at site with coordinates s = (lat, lon) ∈ S 2 and time t , observed at altitude or barometric pressure h , where S 2 is the sphere representing the Earth. The model is defined by the following measurement and dynamics equations: In this model, y TEMP (s, h, t) represents RAOB data and x ERA (s, h, t) represents ERA-Interim reanalysis data. Among (1), s represents different position of launching site on the two-dimensional plane of longitude and latitude, h represents the third dimension of altitude or pressure, and t represents the fourth dimension of time. In addition, φ(h) ′ z(s, t) means the difference between RAOB data and ERA-Interim reanalysis data with altitude h in position s at time t . To this end, this is a 4D model, which is relatively novel. This model is very different from conventional 3D models in traditional literature. In the case study, we define h ∈ H = [50, 925] hPa , t = 1, . . . , 62 . In the measurement equation, ǫ(s, h, t) is a Gaussian measurement error. The variance of ǫ(s, h, t) is represented as σ 2 ǫ (h). To this end, σ 2 ǫ (h) needs to satisfy the following condition: where φ(h) is a set of p × 1 basis functions, and β ERA (h) is modelled as: where c β is a p × 1 vector needed to be estimated. In Eq. (2), z(s, t) is a p-variate latent random variable with Markovian dynamics ruled by the persistency matrix G . Therefore, G matrix is the correlation coefficient matrix of latent random variable z(s,t) in time. In the dynamics equation, η(s; t) is a p dimensional Gaussian innovation random field which is independent in time, with spatial covariance function: In fuction (5), v = (v 1 , . . . , v p ) ′ is the variance vector, and ρ s, s ′ ; θ j is a valid spatial correlation function 20 on the sphere with range parameter θ j from θ = (θ 1 , . . . , θ p ) ′ . In addition, "diag()" means the elements on the diagonal of the matrix.
The O − B approach is generalized here by using the ERA-interim background as the unique covariate of model, where O refers to observations and B refers to background 21 . In fact, the O − B approach is generalized to O = βB + e , where e = ′ z + ǫ is capable to give a more detailed description of the bias structure. To model the effect of the East-West atmospheric dynamics on the O − B error, the following simple anisotropic correlation structure is used: In this paper, we use the B-spline system (B-spline) as φ(h) in the above model equation. Knots are the values of where the pieces of polynomial meet, which are denoted and sorted into nondecreasing order. Figure 2 shows .

Model results
We first estimated the model by Expectation-Maximization algorithm. Second, we analyzed the model estimation results and verified the fitting goodness of our model based on cross-validation R 2 . Finally, based on the results of spatial prediction, we analyzed the observation gaps by O − B approach.

Model estimation results.
We choose Bspline to model Eqs. (1)(2). As the accuracy of the data at high altitude decreases, we set more internal breakpoints at high altitudes where the barometric pressure is below 300 hPa . Based on the empirical analysis, we set the parameter knots for estimating β ERA (h) and σ 2 ǫ (h) as The estimated coefficient z(s, t) of the latent term φ(h) ′ z(s, t) in Eq. (1) is p-variate latent Gaussian random variable. Since the computation burden would be pretty high if too many knots are set, here the knots is as We implement a cubic spline here. Therefore, the number of basis functions for β ERA (h) , σ 2 ǫ (h) and φ(h) ′ z(s, t) is 9, 9, 4 respectively. Tables 1 and 2 show the estimate results for c β and c ǫ .
We plot the β ERA (h) and σ 2 ǫ (h) estimates in Fig. 3. We can find that in the low-altitude layer with barometric pressure higher than 400 hPa , the estimated value of β ERA (h) is close to 1, and the variance σ 2 ǫ (h) of the residuals is pretty low. In the atmosphere where the barometric pressure is lower than 400 hPa , the variance σ 2 ǫ (h) of the residuals increases, and β ERA (h) deviates from the value 1. Thus, our model captures the bias from high-level RAOB temperature data, benefiting the accuracy of estimation of parameters.
Model fitting goodness. In this section, 30% sites are randomly selected as test set, and the remaining 70% sites are used for model estimation as training set. The real value and the predicted value from test set are defined as y = (y 1 , y 2 , . . . , y n 1 ) and y * = (y * 1 , y * 2 , . . . , y * n 1 ) respectively, where n1 represents the number of sites in the test set. By comparing the predicted value with the real value for the selected 30% sites, we obtain the cross-validation R 2 of the RAOB data model. Specifically,R 2 = 1 − n1 k=1 (y * −y) 2 / n 1 k=1 (y − n 1 k=1 y k /n 1 ) 2 .
We show the fitting criteria R 2 with respect to barometric pressure h , time t , and launching site s , respectively, in Figs Fig. 4, represents the number of data in each interval. The left part of Fig. 4 shows that as the altitude increases, the cross-validation R 2 becomes smaller. In general, excluding the four upper-altitude intervals, the criteria R 2 higher than 0.96 is achieved, indicating the high accuracy of model prediction. The right part of Fig. 4 shows that R 2 fluctuates randomly over time. According . β ERA (h) estimate and its 90%, 95%, 99% confidence bands (left), σ 2 ǫ (h) estimate and its 90%, 95%, and 99% confidence bands (right). Among them, the outermost light red border on both sides represent 90% confidence bands, red border in the middle of both sides represent 95% confidence bands, and the innermost crimson border on both sides represent 99% confidence bands. where z is the average of latent space-time variable z(s, t).
As mentioned in the introduction, the GAIA-CLIM project suggests to consider as observational gaps the areas of the atmosphere where the uncertainty of spatial prediction is higher. Figure 6 shows the average estimates of the difference φ(h) ′ z(s, t) between the RAOB and ERA-Interim at pressure level 450 hPa . As it is expected, the differences in areas where there are sampling sites can be estimated well with color marked on the map. To visualize the changes of the average uncertainty of spatial prediction along different altitudes, we drew Figs   www.nature.com/scientificreports/ the standard deviations of the spatial predictions at barometric pressures 450 hPa and 100 hPa , respectively. We found that, no matter what the altitude is, the variance of differences in areas where there are sampling sites are very small, while the variance of differences in areas far away from sampling sites are very large. In this way, we explore the underlying mechanism of how the uncertainty changing at various barometric pressure levels.

Conclusion
Meteorological data such as temperature data of the upper atmosphere is essential for accurate weather forecasting. In this paper, we propose a 4D spatiotemporal model that objectively measures the observation gap of two types of temperature data (RAOB data and ERA-Interim reanalysis data). Based on the O − B method, the observation gap is analyzed through the spatial prediction results. The estimated results of the model are of significantly practical interpretation, and the model is meaningful in theory and practice. From the theoretical point of view, we propose a 4D statistical model on the basis of traditional 3D model. This model not only considers the traditional third dimension of time, longitude and latitude, but also the dimension of altitude. Our model effectively characterizes the vertical profile of the radiosonde data through a linear combination of random components and B-spline curves. We establish the model with these two kind of temperature data from 200 launching sites worldwide in January 2015. The model shows a good model fittingness by the criteria the cross-validation R 2 .
From an application point of view, it can help climate scientists make effective inferences from ERA-Interim reanalysis data to RAOB data. As mentioned before, ERA-Interim reanalysis data is widely available with lowquality; RAOB data with high-quality but can offer a very limited spatial coverage. Through the method we