Abstract
Spatiotemporal datasets, which consist of spatiallyreferenced time series, are ubiquitous in diverse applications, such as air pollution monitoring, disease tracking, and clouddemand forecasting. As the scale of modern datasets increases, there is a growing need for statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle many observations. This article introduces the Bayesian Neural Field (BayesNF), a domaingeneral statistical model that infers rich spatiotemporal probability distributions for dataanalysis tasks including forecasting, interpolation, and variography. BayesNF integrates a deep neural network architecture for highcapacity function estimation with hierarchical Bayesian inference for robust predictive uncertainty quantification. Evaluations against prominent baselines show that BayesNF delivers improvements on prediction problems from climate and public health data containing tens to hundreds of thousands of measurements. Accompanying the paper is an opensource software package (https://github.com/google/bayesnf) that runs on GPU and TPU accelerators through the Jax machine learning platform.
Similar content being viewed by others
Introduction
Spatiotemporal data, which consists of measurements gathered at different times and locations, is ubiquitous across diverse disciplines. Government bodies such as the European Environment Agency^{1} and United States Environmental Protection Agency^{2}, for example, routinely monitor a variety of air quality indicators (PM_{10}, NO_{2}, O_{3}, etc.) in order to understand their ecological and public health impacts^{3,4}. As it is physically impossible to place sensors at all locations in a large geographic area, environmental data scientists routinely develop statistical models to predict these indicators at new locations or times where no data is available^{5,6}. Spatiotemporal data analysis also plays an important role in cloud computing, where consumer demand for resources such as CPU, RAM, and storage is driven by timeevolving macroeconomic factors and varies across data center location. Cloud service providers build sophisticated demandforecasting models to determine prices^{7}, perform load balancing^{8}, save energy^{9}, and achieve service level agreements^{10}. Additional applications of spatiotemporal data analysis include meteorology (forecasting rain volume^{11} or wind speeds^{12}), epidemiology (“nowcasting” active flu cases^{13}), and urban planning (predicting rider congestion patterns at metro stations^{14}).
Unlike traditional regression or classification methods in machine learning that operate on independent and identically distributed (i.i.d.) data, accurate models of spatiotemporal data must capture complex and highly nonstationary dynamics in both the time and space domains. For example, two locations twenty miles apart in California’s central valley may exhibit nearly identical temperature patterns, whereas two locations only one mile apart in nearby San Francisco might have very different microclimates; and these effects may differ depending on the time of year. Handling such variability across different scales is a key challenge in designing accurate statistical models. Another challenge is that spatiotemporal observations are typically driven by unknown and noisily observed datagenerating processes, which require models that report probabilistic predictions to account for the aleatoric and epistemic uncertainty in the data.
The dominant approach to spatiotemporal data modeling in statistics rests on Gaussian processes, a rich class of Bayesian nonparametric priors on random functions^{15,16,17}. Consider a spatiotemporal field Y(s, t) indexed by spatial locations \({{\bf{s}}}\in {{\mathbb{R}}}^{d}\) and time points \(t\in {\mathbb{R}}\). A typical Gaussianprocess based “prior probability” distribution (used in popular geostatistical software packages such as RINLA^{18} and sdmTMB^{19}) over the random field Y is given by:
In Eq. (1), η is a random function whose covariance over space and time is determined by a kernel function \({k}_{\theta }(({{\bf{s}}},t),({{{\bf{s}}}}^{{\prime} },{t}^{{\prime} }))\) parameterized by θ; x(s, t) is a covariate vector associated with index (s, t); h is a mean function with parameters β (e.g., for a linear function, \(h(x;\beta ):={\beta }^{{\prime} }x\)) of the latent field F; and Dist is a noise model (e.g., Normal, Poisson) for the observations Y(s, t), with indexspecific parameter g(F(s, t)) (where g is a link function, e.g., \(\exp\)) and global parameters γ.
Given an observed dataset \({{\mathcal{D}}}:=\{Y({{{\bf{s}}}}_{1},{t}_{1})={y}_{1},\ldots,Y({{{\bf{s}}}}_{N},{t}_{N})={y}_{N}\}\), the inference problem is to determine the unknown parameters (θ, β, γ), which in turn define a posterior distribution over the processes (η, F, Y) given \({{\mathcal{D}}}\). Advantages of the model (1) are (i) its flexibility, as η is capable of representing highly complex covariance structure; and (ii) its ability to quantify uncertainty, as the posterior spreads its probability mass over a range of functions and model parameters that are consistent with the data. Moreover, the model easily handles arbitrary patterns of missing data by treating them as latent variables. A number of recent articles have developed specialized Gaussian process techniques for modeling rich spatiotemporal fields e.g., refs. ^{19,20,21,22,23}.
Despite their flexibility, spatiotemporal models based on Gaussian processes (such as Eq. (1)) come with significant challenges. The first is computational. The simplest and most accurate posterior inference algorithms for these models have a computational cost of O(N^{3}), where N is the number of observations, which is unacceptably high in datasets with tens or hundreds of thousands of observations. Reducing this cost requires compromises, either on the modeling side (e.g., imposing a discrete Markovian structure on the model^{18,19}) or on the posteriorinference side (e.g., approximating the true posterior with a simpler Gaussian process^{20,21,23}). Either way, the resulting models have less expressive power and cannot explain the data as accurately. These approximations also involve delicate linearalgebraic derivations or stochastic differential equations, which are challenging to implement and apply to new settings.
The second challenge is expertise, where the accuracy of model (1) on a given dataset is dictated by key choices such as the covariance kernel k_{θ} and mean function h. Even for seasoned data scientists, designing these quantities is difficult because it requires detailed knowledge about the application domain. Further, even small modifications to the model can impose large changes to the learning algorithm, and so most software packages only support a small set of predetermined covariance structures k_{θ} (e.g., separable Matérn kernels, radial basis kernel, polynomial kernel) that are optimized enough to work effectively on large datasets.
To alleviate these fundamental tensions, this article introduces the Bayesian Neural Field (BayesNF)—a method that combines the scalability of deep neural networks with many of the attractive properties of Gaussian processes. BayesNF is built on a Bayesian neural network model^{24} that maps from multivariate spacetime coordinates to a realvalued field. The parameters of the network are assigned a prior distribution, and as in Gaussian processes, conditioning on observed data induces a posterior over those parameters (and in turn over the entire field). Because inference is performed in “weight space” rather than “function space”, the cost of analyzing a dataset grows linearly with the number of observations, as opposed to cubically for a Gaussian process. Because BayesNF is a hierarchical model (Fig. 1), it naturally handles missing data as latent variables and quantifies uncertainty over parameters and predictions. And because BayesNF defines a field over continuous space–time, it can model nonuniformly sampled data, interpolate in space, and extrapolate in time to make predictions at novel coordinates.
Our description of BayesNF as a neural “field” is inspired by the recent literature on neural radiance fields (NeRFs^{25,26}) in computer vision. A key discovery that enabled the success of NeRFs is that neural networks are biased towards learning functions whose Fourier spectra are dominated by low frequencies, and that this bias can be corrected by concatenating sinusoidal positional encodings to the raw spatial inputs^{27}. To ensure that our BayesNF model assigns high prior probability to data that includes both low and highfrequency variation, we append Fourier features to the raw time and position data that are fed to the network. In Methods, we show that these Fourier features, coupled with learned scale factors and convex combinations of activation functions, improve BayesNF models’ ability to learn flexible and wellcalibrated distributions of spatiotemporal data. Incorporating sinusoidal seasonality features lets BayesNF models make predictions based on (multiple) seasonal effects as well. Taken together, these characteristics enable stateoftheart performance in terms of point predictions and 95% prediction intervals on diverse largescale spatiotemporal datasets, without the need to heavily customize the BayesNF model structures on a perdataset basis.
BayesNF belongs to a family of emerging techniques that leverage deep neural networks with hierarchical Bayesian models for spatiotemporal data analysis—a thorough survey of these advances is given in Wikle and ZammitMangion^{28}. Our method is inspired by limitations of existing deep neural network approaches for probabilistic prediction in spatiotemporal data. For example, the Bayesian spatiotemporal recurrent neural networks introduced in McDermott and Wikle^{29} require the data to be observed at a fixed spatial grid and regular discretetime intervals. In contrast, BayesNF is defined over continuous spacetime coordinates, enabling prediction at novel locations and in datasets with irregularly sampled time points. The deep “Empirical Orthogonal Function” model^{30} is a powerful exploratory analysis tool but is less useful for prediction: it cannot handle missing data, make predictions at new time points, or deliver uncertainty estimates. Additional methods in this category include Bayesian neural networks that are highly task oriented—e.g., for analyzing power flow^{31}, wind speed^{32}, or floater intrusion risk^{33}. These methods leverage domainspecific architectures designed specifically for the analysis problem at hand, and do not aim to provide software libraries that are easy for practitioners to apply in new spatiotemporal datasets beyond the application domain. In contrast, a central goal of BayesNF is to provide a domaingeneral modeling tool that is easily applicable to the same type of datasets as the Gaussian process model (1), without the need to redesign substantial parts of the probabilistic model or network architecture for each new task.
Neural processes^{34} also integrate deep neural networks with probabilistic modeling, but are based on a graphical model structure that is fundamentally difficult to apply to spatiotemporal datasets. In particular, because neural processes aim to “metalearn” a prior distribution over random functions, the authors note it is essential to have access to a large number of independent and identically distributed (i.i.d.) datasets during training. However, most spatiotemporal data analyses are based on only a single realworld dataset (e.g., those in Table 1) where there is no notion of sharing statistical strength across multiple i.i.d. observations of the entire field.
Graph neural networks (GNNs), surveyed in Jin et al.^{35}, are another popular deeplearning approach for spatiotemporal prediction which have been particularly useful in settings such as analyzing traffic or populationmigration patterns. These models require as input a graph describing the connectivity structure of the spatial locations, which makes them less appropriate for spatial data that lack such discrete connectivity structure. Moreover, the requirement that the graph be fixed makes it harder for GNNs to interpolate or extrapolate to locations that are not included in the graph at training time. The BayesNF model, on the other hand, operates over continuous space, and is therefore more appropriate for spatial data without known discrete connectivity structure. In addition, as noted in Jin et al.^{35}, GNNs have not yet been demonstrated on probabilistic prediction tasks, and we are unaware of the existence of opensource software libraries based on GNNs that can easily handle the sparse datasets in Table 1.
Results
Model description
Consider a dataset \({{\mathcal{D}}}=\{y({{{\bf{s}}}}_{i},{t}_{i}) i=1,\ldots,N\}\) of N spatiotemporal observations, where \({{{\bf{s}}}}_{i}\in {{\mathcal{S}}}\subset {{\mathbb{R}}}^{d}\) denotes a ddimensional spatial coordinate and \({t}_{i}\in {{\mathcal{T}}}\subset {\mathbb{R}}\) denotes a time index. For example, if the field is observed at longitudelatitude coordinates in discrete time, then \({{\mathcal{S}}}=(180,\, 180]\times [90,\, 90]\subset {{\mathbb{R}}}^{2}\) and \({{\mathcal{T}}}=\{1,2,\ldots,\}\). If the field also incorporates an altitude dimension, then \({{\mathcal{S}}}\subset {{\mathbb{R}}}^{3}\). We model this dataset as a realization {Y(s_{i}, t_{i}) = y(s_{i}, t_{i}), 1 ≤ i ≤ n} of a random field \(Y:{{\mathcal{S}}}\times {{\mathcal{T}}}\to {\mathbb{R}}\) over the entire spatiotemporal domain. Following the notation in Wikle and ZammitMangion^{28}, we describe the field using a hierarchical Bayesian model:
In this notation, upper case letters denote random quantities, Greek letters denote model parameters, lower case letters denoted nonrandom (fixed) quantities, and square brackets [ ⋅ ] denote (yettospecified) probability distributions. The distribution of the observable random variables Y(s, t) is parameterized by global parameters Θ_{y} and an unobservable (latent) spatiotemporal field F(s, t). In turn, F(s, t) is parameterized by a set of random global parameters Θ_{f} and a collection x(s, t) = [x_{1}(s, t), …, x_{m}(s, t)] of m fixed covariates associated with index (s, t).
Box 1 completes the definition of BayesNF by showing specific probability distributions for the model (2)–(4). Figure 1 shows a probabilisticgraphicalmodel representation of a BayesNF model with H = 3 layers, which takes a spatiotemporal index (s, t) at the input layer and generates a realization Y(s, t) of the observable field at the output layer. At a high level, the input layer transforms the spatiotemporal coordinates (s, t) into a fixed set of spatiotemporal covariates, which include linear terms, interaction terms, and Fourier features in time and space. The second layer performs a linear scaling of these covariates using a learnable scale factor—this layer aims to avoid the need for the practitioner to manually specify how to appropriately scale the data, which is known to heavily influence the learning dynamics^{36}. Next, the hidden layers of the network contain the usual dense connections, except that the activations are specified as a learnable convex combination of “primitive” activations, such as rectified linear units (relu), exponential linear unit (elu), or hyperbolic tangent (tanh). The goal of these convex combinations is to automate the discovery of the covariance structure in the field, given that activation functions correspond directly to covariance of random functions defined by Bayesian neural networks^{37}. At the final layer, the output of the feedforward network is used to parameterize a probability distribution over the observable field values, which serves to capture the fundamental aleortic uncertainty in the noisy data. Epistemic uncertainty in BayesNF is expressed by assigning prior probability distributions to all learnable parameters, such as covariate scale factors; connection weights, biases, and their variances; and additional parameters of the observation distribution.
We next describe the components of this process in sequence from inputs to outputs in more detail. This description defines a prior distribution over Bayesian Neural Fields—in Methods we discuss ways of inferring the posterior over the random variables defined in Box 1.
Spatiotemporal covariates
Letting (s, t) = ((s_{1}, …, s_{d}), t) denote a generic index in the field, the covariates [x_{1}(s, t), …, x_{m}(s, t)] may include the following functions:
The linear and interaction covariates (5)–(7) are the usual first and secondorder effects used in spatiotemporal trendsurface analysis models (Section 3.2 of ref. ^{17}). In Eq. (8), the temporal seasonal features are defined by a set \({{\mathcal{P}}}=\{{p}_{1},\ldots,{p}_{\ell }\}\) of ℓ seasonal periods, where each p_{i} has harmonics \({{{\mathcal{H}}}}_{{p}_{i}}^{{{\rm{t}}}}\subset \{1,2,\ldots,\lfloor {p}_{i}/2\rfloor \}\) for i = 1, …, ℓ. For example, if the time unit is hourly data and there are m = 2 seasonal effects (daily and monthly), the corresponding periods are p_{1} = 24 and p_{2} = 730.5, respectively. Noninteger periodicities handle seasonal effects that have varying duration in the time measurement unit (e.g., days per month or weeks per year). The Methods section discusses how to construct appropriate seasonal features for a variety of time units and seasonal effect combinations. In Eq. (9), the spatial Fourier features for coordinate s_{i} are determined by a set \({{{\mathcal{H}}}}_{i}^{{{\rm{s}}}}\subset {\mathbb{N}}\) of additional frequencies that capture periodic structure in the ith dimension (i = 1, …, d). These covariates correct for the tendency of neural networks to learn lowfrequency signals^{27}: the empirical evaluation in the next section confirms that their presence greatly improves the quality of learned models. Covariates may also include static (e.g., “continent”) or dynamic (e.g., “temperature”) exogenous features, provided they are known at all locations and time points in the training and testing datasets.
Covariate scaling layer
Scaling inputs improves neural network learning e.g., ref. ^{36}, but determining the appropriate strategy (e.g., zscore, min/max, tanh, batchnorm, layernorm, etc.) is challenging. BayesNF uses a prior distribution over scale factors to learn these quantities as part of Bayesian inference within the overall probabilistic model. In particular, the next stage in the network is a widthm hidden layer \({h}_{i}^{0}({{\bf{s}}},t)={e}^{{\xi }_{i}^{0}}{x}_{i}({{\bf{s}}},t)\) obtained by randomly scaling each of the m covariates x(s, t), where \({e}^{{\xi }_{i}^{0}}\) is a lognormally distributed scale factor (for i = 1, …, m).
Hidden layers
The model contains L + 1 ≥ 1 hidden layers, where layer l has N ^{ℓ} units \({h}^{\ell }={({h}_{1}^{\ell },\ldots,{h}_{{N}^{\ell }}^{\ell })}^{{\prime} }\) (for l = 1, …, L). These hidden units are derived from N ^{ℓ }preactivation units \({z}^{\ell }={N}_{\ell 1}^{1/2}{\Omega }^{\ell }{h}^{\ell 1}+{\beta }^{\ell }\) where \({\Omega }^{\ell }=[{\omega }_{ij}^{\ell };1\le i\le {N}^{\ell },1\le j\le {N}_{\ell 1}]\) is a random N ^{ℓ} × N_{ℓ−1} weight matrix and \({\beta }^{\ell }={({\beta }_{1}^{\ell },\ldots,{\beta }_{{N}^{\ell }}^{\ell })}^{{\prime} }\) a random bias term. The network parameters \({\omega }_{ij}^{\ell }\) and \({\beta }_{i}^{\ell }\) are drawn i.i.d. N(0, σ ^{ℓ}), where the variance \({\sigma }^{\ell }=\ln (1+{e}^{{\xi }^{\ell }})\) a learnable parameter whose prior is obtained by applying a softplus transformation to ξ ^{ℓ} ~ N(0, 1). The \({N}_{\ell 1}^{1/2}\) prefactor ensures the network has a welldefined Gaussian process limit as the number of hidden units N ^{ℓ} → ∞^{24}.
In addition to the covariate scaling layer, BayesNF departs from a traditional Bayesian neural network by using A^{ℓ}≥1 activation functions \(({u}_{1}^{\ell },\ldots,{u}_{{A}^{\ell }}^{\ell })\) at hidden layer l, instead of the usual A^{ℓ} = 1. For example, the architecture shown in Fig. 1 uses A^{ℓ} = 2 where \({u}_{1}^{\ell }\) is the hyperbolic tangent (tanh) and \({u}_{2}^{\ell }\) is the exponential linear unit (elu) activation (where l = 1, 2). Each postactivation unit \({h}_{i}^{\ell }\) (for i = 1, …, N ^{ℓ}) is then a random convex combination of the activations \({u}_{1}^{\ell }({z}_{i}^{\ell }),\ldots,{u}_{{A}^{\ell }}^{\ell }({z}_{i}^{\ell })\), where the coefficient of \({u}_{j}^{\ell }\) is the output of a softmax function \({e}^{{\gamma }_{j}^{\ell }}/{\sum }_{k=1}^{{N}_{d}}{e}^{{\gamma }_{k}^{\ell }}\) whose jth input is \({\gamma }_{j}^{\ell } \sim N(0,1)\) (for j = 1, …, A^{ℓ}). The activation function governs the overall covariance properties of the random function defined by a Bayesian neural network^{24,37}. By specifying the overall activation at each layer as a learnable convex combination of A^{ℓ} “basic” activation functions (e.g., tanh, relu, elu), BayesNF aims to automate the process of selecting an appropriate activation and in turn the covariance structure within the random field.
Finally, the latent stochastic process F(s, t) is defined as the preactivation unit \({z}_{1}^{L+1}\) of layer L + 1, which has exactly N^{L+1} = 1 unit. We let Θ_{f} denote all n_{f} random network parameters in Box 1 and denote the prior as π_{f}. Further, the notation \({F}_{{\theta }_{f}}({{\bf{s}}},t)\) denotes the (deterministic) value of the process F at index (s, t) when Θ_{f} = θ_{f}.
Observation layer
The final layer connects the stochastic process F(s, t) with the observable spatiotemporal field Y(s, t) ~ Dist(F(s, t); Θ_{y}) through a noise model that captures aleatoric uncertainty in the data. The parameter vector \({\Theta }_{y}=({\Theta }_{y,1},\ldots,{\Theta }_{y,{n}_{y}})\) is n_{y}dimensional and has a prior π_{y}. There are many choices for this distribution, depending on the field Y(s, t); for example,
which correspond to a Gaussian noise model with mean F(s, t) and variance Θ_{y,1} (n_{y} = 1), a StudentT model with location F(s, t), scale Θ_{y,1} and Θ_{y,2} degrees of freedom (n_{y} = 2); and a Poisson counts model with rate \(\exp F({{\bf{s}}},t)\) (n_{y} = 0), respectively. A key design choice in these observation distributions is that certain parameters such as Θ_{y,1} in Eq. (10) or Θ_{y,1}, Θ_{y,2} in Eq. (11) are not indexspecific but rather shared across all inputs, which serves to mitigate the model’s sensitivity to overfitting noise fluctuations from highfrequency Fourier features.
Posterior inference and querying. Let P(Θ_{f}, Θ_{y}, Y) be the joint probability distribution over the parameters and observable field in Box 1. The posterior distribution given \({{\mathcal{D}}}\) is
While the righthand side of Eq. (13) is tractable to compute, the lefthand side cannot be normalized or sampled from exactly. In the Posterior Inference section of Methods, we discuss two approximate posterior inference algorithms for BayesNF: maximum aposteriori ensembles and variational inference ensembles. They each produce a collection of parameters \({\{({\theta }_{f}^{i},{\theta }_{y}^{i})\}}_{i=1}^{M}\approx P({\Theta }_{f},{\Theta }_{y} {{\mathcal{D}}})\) drawn from an approximation to the posterior (13). The Prediction Queries subsection of Methods discusses how these posterior samples be used to compute point predictions \(\hat{y}({{{\bf{s}}}}_{*},{t}_{*})\) of the spatiotemporal field at a novel index (s_{*}, t_{*}) and the associated prediction intervals \([{\hat{y}}_{{{\rm{low}}}}({{{\bf{s}}}}_{*},{t}_{*}),{\hat{y}}_{{{\rm{hi}}}}({{{\bf{s}}}}_{*},{t}_{*})]\) for a given level α ∈ (0, 1) (e.g., α = 95%).
Prediction accuracy on scientific datasets
Datasets
To quantitatively assess the effectiveness of BayesNF on challenging prediction problems, we curated a benchmark set comprised of six publicly available, largescale spatiotemporal datasets that together cover a range of complex empirical processes:

1.
Daily wind speed (km/h) from the Irish Meteorological Service^{38}. 19610101 to 19781231; 12 locations; 78,888 observations, 0% missing.

2.
Daily particulate matter 10 (PM10, μg/m^{3}) air quality in Germany from the European Environment Information and Observation Network^{39}. 19980101 to 20091231; 70 locations; 149,151 observations, 52% missing.

3.
Hourly particulate matter 10 (PM10, μg/m^{3}) from the London Air Quality Network^{20}. 20181231 to 20190331; 72 locations; 144,570 observations, 7% missing.

4.
Weekly chickenpox counts (thousands) from the Hungarian National Epidemiology Center^{40} 20050103 to 20141229; 20 locations; 10,440 observation, 0% missing.

5.
Monthly accumulated precipitation (mm) in Colorado and surrounding areas from the University Corporation for Atmospheric Research^{41}. 19500101 to 19971201; 358 locations; 134,800 observations, 35% missing.

6.
Monthly sea surface temperature (°C) anomalies in the Pacific Ocean from the National Oceanic and Atmospheric Administration Climate Prediction Center^{17} 19700101 to 20030301; 2261 locations; 902,139 observations, 0% missing.
Table 1 summarizes key statistics of these datasets. Figure 2 shows snapshots of the observed data at a fixed point in time (Fig. 2a) and in space (Fig. 2b), highlighting the complex statistical patterns (e.g., nonstationarity and periodicity) in the underlying fields along these two dimensions. Five train/test splits were created for each benchmark. Each test set contains (#locations)/(#splits) locations, holding out the 10% most recent observations.
Baselines
The prediction accuracy on the benchmark datasets in Table 1 using BayesNF is compared to several stateoftheart baselines. This evaluation focuses specifically on baseline methods that (i) have highquality and widely used opensource implementations; (ii) can generate both point and interval predictions; and (iii) are directly applicable to new spatiotemporal datasets (e.g., those in Table 1) without the need to redevelop substantial parts of the model. The methods are:

1.
StSVGP: Spatiotemporal Sparse Variational Gaussian Process^{20}. This method handles large datasets (i.e., linear time scaling in the number of time points) by leveraging a statespace representation based on stochastic partial differential equations and Bayesian parallel filtering and smoothing on GPUs. Parameter estimation is performed using natural gradient variational inference.

2.
StGBoost: Spatiotemporal Gradient Boosting Trees^{42}. Prediction intervals are estimated by minimizing the quantile loss using an ensemble of 1000 tree estimators. As this baseline is not a typical time series model, the same covariates [x_{1}(s, t), …, x_{m}(s, t)] (5)–(9) provided to BayesNF are also provides as regression inputs.

3.
StGLMM: Spatiotemporal Generalized Linear Mixed Effects Models^{19}. These methods handle large datasets by integrating latent GaussianMarkov random fields with stochastic partial differential equations. Parameter estimation is performed using maximum marginal likelihood inference. Three observation noise processes are considered:

IID: Independent and identically distributed Gaussian errors.

AR1: Order 1 autoregressive Gaussian errors.

RW: Gaussian random walk errors.


4.
NBEATS: Neural Basis Expansion Analysis^{43}. This baseline employs a “windowbased” deep learning autoregressive model where future data is predicted over a fixedsize horizon conditioned on a window of previous observations and exogenous features. The model is configured with indicators for all applicable seasonal components—e.g., hour of day, day of week, day of month, week of year, month—as well as trend and seasonal Fourier features. The method contains a large number of numeric hyperparameters which are automatically tuned using the NeuralForecast^{44} package. Prediction intervals are estimated by minimizing quantile loss.

5.
TSReg: Trend Surface Regression with Ordinary Least Squares (OLS) (Section 3.2 of ref. ^{17}). The observation noise model is Gaussian with maximum likelihood estimation of the variance. As with StGBoost, the regression covariates are identical to those provided to BayesNF.

6.
BayesNF: Bayesian Neural Field; using variational and maximum aposteriori inference.
We also attempted to use the fixedrank kriging (Frk) method^{22}, but were unable to perform inference over noise parameters for spatiotemporal data. Taken together, the baselines provide broad coverage over recent statistical, machine learning, and deep learning methods for largescale prediction. All methods were run on a TPU v38 accelerator, which consists of 8 cores each with 16 GiB of memory. Additional evaluation details are described in Methods.
Quantitative results
Table 2 shows accuracy and runtime results for all baselines and benchmarks. Point predictions are evaluated using rootmean square error (RMSE (25)) and mean absolute error (MAE (26)) and 95% prediction intervals are evaluated using the mean interval score (MIS (27)), averaged over all train/test splits. The final column shows the wallclock runtime in seconds that each method was run. While runtime cannot be perfectly aligned due to variety of learning algorithms used and their iterative nature, the wallclock numbers show that all baselines were run for sufficiently long to ensure a fair comparison. Figure 3 compares predictions on heldout data at one representative spatial location in each of the six benchmarks. We discuss several takeaways from these results.
BayesNF using VI is the strongest baseline in 12/18 cases followed by BayesNF using MAP: it is tied with VI in 3/18 cases (Precipitation) and superior in 3/18 cases (Sea Surface Temperature). In 2/18 cases (Chickenpox; MAE and RMSE) errors from the BayesNF methods are slightly higher than the StGLMM (AR1) baseline, although the running time of the latter is ~ 4x higher. The most apparent improvements of BayesNF occur in the Wind Speed, Precipitation, and Sea Surface Temperature datasets, shown qualitatively in rows 1, 5, 6 of Fig. 3. Results using additional ablations are discussed in the Ablations subsection of Methods. Combined with Table 2, these results highlight the expressive modeling capacity of BayesNF models, their ability to accurately quantify predictive uncertainty, and the benefit of using spatial embeddings to capture highfrequency signals in the data.
While predictions from StSVGP generally follow the overall “shape” of the heldout data, the mean and interval predictions are not well calibrated (Fig. 3, second column). StSVGP requires several modeling tradeoffs to ensure lineartime scaling in the number of time points, including the use of Matérn kernels (which cannot express effects such as seasonality) and kernels that are separable in time and space. Additional difficulties include manually selecting the number of spatial inducing points and complex algorithms needed to optimize their locations. StSVGP runs out of memory on the Sea Surface Temperature benchmark (1 million observations).
The StGLMM methods (AR1, IID, RW) fail to complete on 4/6 benchmarks. The scaling characteristics are also unpredictable: for example, StGLMM runs on Air Quality 2 (144,570 observations) but fails on Wind Speed (78,888 observations). On the two datasets they can handle (rows 3 and 4 of Fig. 3), the StGLMM methods are highly competitive on Chickenpox and not competitive on Air Quality 2, with the AR1 error model delivering the lowest errors.
StGBoost delivers reasonable prediction intervals but its point predictions underfit (Fig. 3, third column). It has a high computational cost because (i) a large number of estimators is needed to obtain accurate predictions (using 1000 estimators provided statistically significant improvements over 500 estimators in 17/18 benchmarks); (ii) three models must be separately trained from scratch: one model to predict the mean and two models to predict upper and lower quantiles. Whereas BayesNF uses a single learned distribution for all queries, StGBoost trains different models for different queries, which does not guarantee probabilistically coherent answers.
NBEATS is only competitive on the Sea Surface Temperature benchmark, where it is the nextbest baseline after BayesNF. Its runtime on this benchmark is 3x–4x faster than BayesNF due to automatic early stopping. The method fails to deliver predictions on the Precipitation benchmark because the training and test datasets contain time series that are too sparse to handle; e.g., the number of observed timepoints is smaller than the autoregressive window size or prediction horizon. The prediction errors on the remaining three benchmarks are high even though all the seasonal effects were added to the model, suggesting that either (i) the model is not able to effectively leverage spatial correlations for cross timeseries learning; or (ii) the hyperparameter tuning algorithm does not converge to sensible values within the allotted time.
TSReg requires less than 1 second to train, but does not capture any meaningful structure and produces poor predictions. Using LASSO or ridge regression instead of OLS did not improve the results. TSReg uses identical covariates to BayesNF but performs much worse, highlighting the need to capture nonlinear dependencies in the data for generating accurate forecasts.
Analyzing German air quality data
Atmospheric particulate matter (PM10) is a key indicator of air quality used by governments worldwide, as these particles can induce adverse health effects when inhaled into the lungs. Accurate predictions of PM10 values at novel points in space and time within a geographic region can help decision makers characterize pollution patterns and inform public health decisions.
We explore predictions from BayesNF on the German Air Quality dataset^{39}, which contains daily PM10 measurements from 70 stations between 19980101 and 20091231. We infer a BayesNF model for this dataset with depth H = 2; weekly, monthly, and yearly seasonal effects (8); and harmonics \({{{\mathcal{H}}}}_{1}^{{{\rm{s}}}}={{{\mathcal{H}}}}_{2}^{{{\rm{s}}}}=\{1,\ldots,4\}\) for the spatial Fourier features (9). The distribution of Y given the stochastic process F is a StudentT (11) truncated to \({{\mathbb{R}}}_{\ge 0}\)
Spatial and temporal interpolation
Figure 4a shows the PM10 observations at 20030201, 20050101, 20050401, and 20070101, where roughly 50% of the stations do not have an observed measurement at a given point in time. Figure 4b shows the median PM10 predictions y_{0.5}(s_{*}, t_{*}) (24) interpolated at a grid of 10,000 novel spatial indexes (s_{*}, t_{*}) within Germany. Figure 4c shows the width \({\hat{y}}_{{{\rm{hi}}}}({{{\bf{s}}}}_{*},{t}_{*}){\hat{y}}_{{{\rm{low}}}}({{{\bf{s}}}}_{*},{t}_{*})\) of the inferred 95% prediction interval. These plots reflect the spatiotemporal structure captured by BayesNF and identify coordinates within the field with low and high predictive uncertainty about air pollution. The axisaligned artifacts in Fig. 4b, where predictions are consistent along certain thin regions, are a result of the spatial Fourier features (9). How well these artifacts reflect the true behavior can be empirically investigated by obtaining PM10 measurements at the novel locations along these regions. Figure 4d shows the observed and median predicted PM10 values across all time points at four stations with the highest missing data rates: DEBWO31, southwest Germany, 51% missing; DEBB056, northeast Germany, 84% missing; DEBU034, northwest Germany, 99% missing; DESL008, west Germany, 89% missing. PM10 trajectories predicted by BayesNF at time points where data is missing reproduce the temporal patterns at time points with observed data, which include high frequency periodic variation and irregular, spatially correlated jumps.
Variography
The accuracy of PM10 predictions in Fig. 4d cannot be quantitatively assessed because the groundtruth values are not known at the predicted time points. However, we can gain more insight into how well the learned spatiotemporal field matches the observed field by comparing the empirical and inferred semivariograms. The semivariogram γ of a process Y characterizes the joint spatiotemporal dependence structure; it is defined as
where the choice of \({{\bf{s}}}\in {{\mathcal{S}}},t\in {{\mathcal{T}}}\) is arbitrary (e.g., (s, t) = (0, 0), under the assumption that only the displacements in time and space affect the dependence (Section 2.4.2 of ref. ^{17}).
The surface plots in Fig. 5 compare the empirical semivariogram (left) computed at the 70 observed stations with the inferred semivariogram (right) computed at 70 uniformly chosen random locations within Germany, for distances h ∈ [0, 1000] kilometers and time lags τ ∈ {0, …, 10} days. The agreement between these two plots suggests that BayesNF accurately generalizes the spatiotemporal dependence structure from the observed locations to novel locations in the field. The lower two panels in Fig. 5 show the empirical (solid line) and inferred (dashed line) semivariograms, separately for each of the 10 time lags τ. The difference between the semivariograms is highest for τ ∈ {0, 1, 2} days, suggesting that the learned model is expressing relatively smooth phenomena and assuming that the highfrequency daytoday variance is due to unpredictable independent noise. The differences between the semivariograms become small for τ > 2 days, which suggests that BayesNF effectively captures these longerterm temporal dependencies.
Discussion
This article proposes a probabilistic approach to scalable spatiotemporal prediction called the Bayesian Neural Field. The model combines a deep neural network architecture for highcapacity function approximation with hierarchical Bayesian modeling for accurate uncertainty estimation over complex spatiotemporal fields. Posterior inference is conducted using stochastic ensembles of maximum aposteriori estimation or variationally trained surrogates, which are easy to apply and deliver wellcalibrated 95% prediction intervals over test data. The results in Fig. 6 confirm that quantifying uncertainty using MAP or VI ensembles is superior to performing maximumlikelihood estimation (MLE), which ignores the parameter priors. While these inference methods are approximate in nature and are not guaranteed to match the true posterior, the BayesNF model is a deep neural network where interpreting parameters such as weights and biases is not of inherent interest to a practitioner in a given data analysis task. Rather, we expect BayesNF to be most useful in cases where the predictive calibration is more relevant. Additional advantages of BayesNF are its relative simplicity, ability to handle missing data, and ability to learn a full probability distribution over arbitrary spacetime indexes within the spatiotemporal field.
Evaluations against prominent statistical and machine learning baselines on largescale datasets show that BayesNF delivers significant improvements in both point and interval forecasts. The results also show that combining periodic effects in the temporal domain with Fourier features in the spatial domain enables BayesNF to capture spatiotemporal patterns with multiple (noninteger) periodicity and highfrequency components. As a domaingeneral method, BayesNF can produce strong results on multiple datasets without the need to handdesign the model from scratch each time or apply datasetspecific inference approximations. For a representative air quality dataset, the semivariograms inferred by BayesNF evaluated at novel spatial locations agree with the empirical semivariogram computed at observed locations, which highlights the model’s ability to generalize well in space and time.
Practitioners across a spectrum of disciplines—from meteorology to urban studies and environmental informatics—are in need of more scalable and easytouse statistical methods for spatiotemporal prediction. A freely available implementation of BayesNF built on the Jax machine learning platform, along with user documentation and tutorials, is available at https://github.com/google/bayesnf. We hope these materials help practitioners obtain strong BayesNF models for many spatiotemporal problems that existing software cannot easily handle.
The approach discussed in this paper opens several avenues to future work. While Bayesian Neural Fields are designed to minimize the user’s involvement in constructing a predictive model, further improvements can be achieved by enabling domain experts to incorporate specific statistical covariance structure that they know to be present. It is also worthwhile to explore applications of BayesNF for modeling the residuals of causal or mechanistic laws in physical systems where there exist strong domain theories of the average datagenerating process, but poor models of the empirical noise process. Another promising extension is using BayesNF models to handle not only “geostatistical” datasets, in which the measurements are pointreferenced in space, but also “areal” or “lattice” datasets, where the measurements represent aggregated quantities over a geographical region. While areal datasets are often converted to geostatistical datasets by using the centroid of the region as the representative point, a more principled approach would be to compute the integral of a Bayesian Neural Field over the region. Finally, BayesNF can be generalized to handle multivariate spatiotemporal data, where each spatial location is associated with multiple time series that contain withinlocation and acrosslocation covariance structure. Effectively handling such datasets will even further broaden the scope of problems that BayesNF can solve.
Methods
Posterior inference
Let P(Θ_{f}, Θ_{y}, Y) denote the joint probability distribution over the parameters and observable field in Box 1. The posterior distribution is given by Eq. (13) in the main text. We describe two approximate posterior inference algorithms for BayesNF. In these sections, we define Θ = (Θ_{f}, Θ_{y}), θ = (θ_{f}, θ_{y}) and r = (s, t).
Stochastic MAP ensembles
A simple approach to uncertainty quantification is based on the “maximum aposteriori” estimate:
We find an approximate solution to the optimization problem (15) using stochastic gradient ascent on the joint log probability, according to the following procedure, where B ≤ N is a minibatch size and (ϵ_{1}, ϵ_{2}, … ) is a sequence of learning rates:
Repeat until convergence
We construct an overall “deep ensemble” \({\{({\theta }_{f}^{i},{\theta }_{y}^{i})\}}_{i=1}^{M}\) containing M ≥ 1 MAP estimates by repeating the above procedure M times, each with a different initialization of θ_{0} and random seed.
Stochastic variational inference
A more uncertaintyaware alternative to MAP ensembles is meanfield variational inference, which uses a surrogate posterior \({q}_{\phi }(\theta )={\prod }_{i=1}^{{n}_{f}}\nu ({\theta }_{f,i};{\phi }_{f,i})\mathop{\prod }_{i=1}^{{n}_{y}}\nu ({\theta }_{y,i};{\phi }_{y,i})\) over Θ to approximate the true posterior \(P({\theta }_{f},{\theta }_{y} {{\mathcal{D}}})\) (13) given the data \({{\mathcal{D}}}\). Optimal values for the variational parameters \(\phi=({\phi }_{f,1},\ldots,{\phi }_{f,{n}_{f}},{\phi }_{y,1},\ldots,{\phi }_{y,{n}_{y}})\) are obtained by maximizing the “evidence lower bound”:
where Eq. (22) follows from the independence of the priors. Finding the maximum of Eq. (22) is a challenging optimization problem. Our implementation leverages a Gaussian variational posterior q_{ϕ} with KL reweighting, as described in Blundell et al. (Sections 3.2 and 3.4 of ref. ^{45}).
Meanfield variational inference is known to underestimate posterior variance and can also get stuck in local optima of Eq. (21). To alleviate these problems, we use a variational ensemble that is analogous to the MAP ensemble described above. More specifically, we first perform M ≥ 1 runs of stochastic variational inference with different initializations and random seeds, which gives us an ensemble {ϕ^{i}, i = 1, …, M} of variational parameters. We then approximate the posterior \(P(\theta  {{\mathcal{D}}})\) with an equalweighted mixture of the resulting variational distributions \({\{{q}_{{\phi }^{i}}\}}_{i=1}^{M}\).
Prediction queries
We can approximate the posterior (13) using a set of samples \({\{({\theta }_{f}^{i},{\theta }_{y}^{i})\}}_{i=1}^{M}\), which may be obtained from either MAP ensemble estimation or stochastic variational inference (by sampling from the ensemble of M variational distributions). We can then approximate the posteriorpredictive distribution \(P(Y({{{\bf{r}}}}_{*}) {{\mathcal{D}}})\) (which marginalizes out the parameters Θ) of Y(r_{*}) at a novel field index r_{*} = (s_{*}, t_{*}) by a mixture model with M equally weighted components:
Equipped with Eq. (23), we can directly compute predictive probabilities of events {Y(r_{*}) ≤ y}, predictive probability densities {Y(r_{*}) = y}, or conditional expectations \({\mathbb{E}}\left[\varphi (Y({{{\bf{r}}}}_{*})) {{\mathcal{D}}}\right]\) for a probe function \(\varphi :{\mathbb{R}}\to {\mathbb{R}}\). Prediction intervals around Y(r_{*}) are estimated by computing the αquantile y_{α}(r_{*}), which satisfies
For example, the median estimate is y_{0.50}(s_{*}, t_{*}) and 95% prediction interval is [y_{0.025}(s_{*}, t_{*}), y_{0.975}(s_{*}, t_{*})]. The quantiles (24) are estimated numerically using Chandrupatla’s root finding algorithm^{46} on the cumulative distribution function of the mixture (23).
Temporal seasonal features
Including seasonal features (c.f. Eq. (8)), where possible, is often essential for accurate prediction. Example periodic multiples p for datasets with a variety of time units and seasonal components are listed below (Y=Yearly; Q=Quarterly; Mo=Monthly; W=Weekly; D=Daily; H;Hourly; Mi=Minutely; S=Secondly):

Q: Y=4

Mo: Q=3, Y=12

W: Mo=4.35, Q=13.045, Y=52.18

D: W=7, Mo=30.44, Q=91.32, Y=365.25

H: D=24, W=168, Mo=730.5, Q=2191.5, Y=8766

Mi: H=60, D=1440, W=10080, Mo=43830, Q=131490, Y=525960

S: Mi=60, H=3600, D=86400, W=604800, Mo=2629800, Q=7889400, Y=31557600
Ablations
To better understand how the prediction accuracy of BayesNF varies with the choices of inference algorithm and network architecture, results from two classes of ablation studies for the benchmarks in Table 2 are reported.
Inference methods: comparison of VI, MAP, and MLE
Figure 6 shows a comparison of runtime vs. accuracy profiles on the six benchmarks from Table 1 using three parameter inference methods for BayesNF—VI, MAP, and MLE. MLE is the maximum likelihood estimation baseline described in Lakshminarayanan et al.^{47}, which is identical to Box 1 expect that the terms π_{f} and π_{y} in Eq. (18) are ignored. MLE performs no better than MAP or VI in all 18/18 profiles (and is typically worse), illustrating the benefits of parameter priors and posterior uncertainty which do not impose runtime overhead. Between MAP and VI, the latter performs better in 13/18 profiles: that is, on all metrics for Wind, Air Quality 1, and Air Quality 2; on RMSE and MAE for Chickenpox; and on RMSE and MIS for Precipitation.
Model architectures
Figure 7 shows the percentage change in RMSE, MAE, MIS, and runtime using BayesNF (MAP inference; 64 particles; fixed number of training epochs) while applying a single change to the reference model for each benchmark. The goal of these ablations is to study how changes to the network structure affect the predictive performance.
Figure 7a, b shows results for decreasing or increasing the network depth by one layer. The Sea Surface Temperature benchmark is the most sensitive to the network depth, where decreasing the depth causes the forecast errors to increase by around 50%, whereas increasing the depth delivers 5–10% decreases. The MIS error is particularly sensitive to reducing the network depth where the results become significantly worse in 5/6 benchmarks, although the runtime also decreases by up to 50%.
Figure 7c, d shows results for halving or doubling the width of the hidden layers. The Sea Surface Temperature benchmark is highly sensitive to halving the network width, with errors increasing above 25%. The remaining benchmarks demonstrate slight improvements in the errors which are not statistically significant, suggesting that the runtime gains could justify halving the width in these benchmarks. Doubling the with causes substantial increases in the runtime with no systematic pattern in the RMSE, MAE, or MIS values across the benchmarks.
Figure 7e, f shows results using only tanh or elu activations instead of the convex combination layer. Discarding the convex combination layer delivers runtime speedups, which are larger using tanh as compared to elu. However, there is no clear winner in terms of error when using only tanh or only elu; and no error metric is consistently negative by selecting one of the two activations. The changes in error which are consistently positive (as compared to the convex combination layer) are (i) tanh only: Air Quality 2 (MIS 16%); (ii) elu only: Sea Surface Temperature (RMSE 59%, MAE 76%, MIS 49%) and Precipitation (MAE 7.8%, MIS 16%).
Figure 7g shows results for disabling the covariate scaling layer. The runtime is only slightly changed in all benchmarks. However, several errors increase consistently on average, namely in the Precipitation (RMSE 24%, MAE 27%, MIS 33%), Chickenpox (MIS 32%), and Air Quality 1 (MAE 13%) benchmarks. The remaining changes are neither consistently above nor below zero.
Figure 7h shows results for omitting the spatial Fourier features (Eq. (9)). While omitting these features delivers small runtime improvements, it also causes substantial increases in RMSE, MAE, and MIS values across all benchmarks except for Wind. These results support the hypothesis that spatial Fourier features are essential for accurate generalization across space and time.
In summary, the results (specifically Fig. 7e–h), demonstrate that architectural choices in BayesNF such as the spatial Fourier features, convex combination layer, and covariate scaling are effective in reducing the prediction error across several benchmarks and metrics at the cost of a manageable runtime overhead.
Evaluation metrics
The quality of point forecasts are evaluated using RMSE and MAE scores. Interval forecasts are evaluated using the MIS score at level α = 0.05. The definitions are as follows:
where y_{i} is the true value, \({\hat{y}}_{i}\) is the point forecast, and (ℓ_{i}, u_{i}) are endpoints of the interval forecast.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All datasets from Table 1 are publicly available under opensource licenses. • Wind Speed. GNU GPL v2. https://rspatial.github.io/gstat/reference/wind.html. • Air Quality 1. GNU GPL v3. https://rdrr.io/cran/spacetime/man/air.html. • Air Quality 2. CC Attribution 1.0 Generic. https://doi.org/10.5281/zenodo.4531304. • Chickenpox Cases. CC Attribution 4.0 International. https://doi.org/10.24432/C5103B. • Precipitation. Public Domain. https://www.image.ucar.edu/Data/US.monthly.met/. • Sea Surface Temperature. GNU GPL v2. https://github.com/andrewzm/STRbook/. The full datasets, test/train splits, model predictions, and ablation results are available at https://doi.org/10.5281/zenodo.12735404. Refer to the README in these files for additional information.
Code availability
An opensource Python implementation of BayesNF is available at https://github.com/google/bayesnf under an Apache2.0 License. The full evaluation pipeline containing all model configurations for the baselines is also provided in the repository. The source code of bayesnf v0.1.3 is uploaded in the Supplementary Code.
References
European Environment Agency. European air quality index. https://airindex.eea.europa.eu/AQI/index.html (2024).
U.S. Environmental Protection Agency. U.S. air quality index. https://www.airnow.gov/ (2024).
Wang, S., Yuan, W. & Shang, K. The impacts of different kinds of dust events on PM10 pollution in northern China. Atmos. Environ. 40, 7975–7982 (2006).
Medina, S., Plasencia, A., Ballester, F., Mücke, H. G. & Schwartz, J. Apheis: public health impact of PM10 in 19 European cities. J. Epidemiol. Community Health 58, 831–836 (2004).
Huang, W. et al. An overview of air quality analysis by big data techniques: Monitoring, forecasting, and traceability. Inf. Fusion 75, 28–40 (2021).
Karagulian, F. et al. Review of the performance of lowcost sensors for air quality monitoring. Atmosphere 10, 506 (2019).
Niu, D., Feng, C. & Li, B. Pricing cloud bandwidth reservations under demand uncertainty. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, 151–162 (Association for Computing Machinery, 2012).
Mishra, S. K., Sahoo, B. & Paramita Parida, P. Load balancing in cloud computing: a big picture. J. King Saud. Univ. Comput. Inf. Sci. 32, 149–158 (2020).
Cao, J., Wu, Y. & Li, M. Energy efficient allocation of virtual machines in cloud computing environments based on demand forecast. In Advances in Grid and Pervasive Computing, of Lecture Notes in Computer Science 7296, 137–151 (Springer, 2012).
Faniyi, F. & Bahsoon, R. A systematic review of service level management in the cloud. ACM Comput. Surveys 48, 1–27 (2015).
Sigrist, F., Künsch, H. R. & Stahel, W. A. A dynamic nonstationary spatiotemporal model for short term prediction of precipitation. Ann. Appl. Stat. 6, 1452–1477 (2012).
Jung, J. & Broadwater, R. P. Current status and future advances for wind speed and power forecasting. Renew. Sustain. Energy Rev. 31, 762–777 (2014).
Lu, F. S., Hattab, M. W., Clemente, C. L., Biggerstaff, M. & Santillana, M. Improved statelevel influenza nowcasting in the United States leveraging internetbased data and network approaches. Nat. Commun. 10, 147 (2019).
Gan, Z., Yang, M., Feng, T. & Timmermans, H. Understanding urban mobility patterns from a spatiotemporal perspective: Daily ridership profiles of metro stations. Transportation 47, 315–336 (2020).
Rasmussen, C. E. & Williams, C. K. I.Gaussian Processes for Machine Learning (The MIT Press, 2006).
Cressie, N. & Wikle, C. K. Statistics for SpatioTemporal Data. Wiley Series in Probability and Statistics (John Wiley & Sons, 2011).
Wikle, C. K., ZammitMangion, A. & Cressie, N. SpatioTemporal Statistics with R (Chapman and Hall/CRC, 2019).
Rue, H. et al. Bayesian computing with INLA: A review. Annu. Rev. Stat. Appl. 4, 395–421 (2017).
Anderson, S. C., Ward, E. J., English, P. A. & K., B. L. A. sdmTMB: An R package for fast, flexible, and userfriendly generalized linear mixed effects models with spatial and spatiotemporal random fields. bioRxiv (2022).
Hamelijnck, O., Wilkinson, W., Loppi, N., Solin, A. & Damoulas, T. Spatiotemporal variational Gaussian processes. In Proc. 35th International Conference on Neural Information Processing Systems, vol 34 of Advances in Neural Information Processing Systems, 23621–23633 (Curran Associates, Inc., 2021).
Zhang, J., Ju, Y., Mu, B., Zhong, R. & Chen, T. An efficient implementation for spatialtemporal Gaussian process regression and its applications. Automatica 147, 110679 (2023).
ZammitMangion, A. & Cressie, N. FRK: an R package for spatial and spatiotemporal prediction with large datasets. J. Stat. Softw. 98, 1–48 (2021).
Banerjee, S. Modeling massive spatial datasets using a conjugate Bayesian linear modeling framework. Spat. Stat. 37, 100417 (2020).
Neal, R. M. Bayesian Learning for Neural Networks. Ph.D. thesis (University of Toronto, 1996).
Mildenhall, B. et al. NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 99–106 (2021).
Hoffman, M. D. et al. ProbNeRF: Uncertaintyaware inference of 3D shapes from 2D images. In International Conference on Artificial Intelligence and Statistics, 10425–10444 (PMLR, Norfolk, 2023).
Tancik, M. et al. Fourier features let networks learn high frequency functions in low dimensional domains. In Proceedings of the 34th International Conference on Neural Information Processing Systems, vol. 33 of Advances in Neural Information Processing Systems, 7537–7547 (Curran Associates, Inc., 2020).
Wikle, C. K. & ZammitMangion, A. Statistical deep learning for spatial and spatiotemporal data. Annu. Rev. Stat. Appl. 10, 247–270 (2023).
McDermott, P. L. & Wikle, C. K. Bayesian recurrent neural network models for forecasting and quantifying uncertainty in spatialtemporal data. Entropy 21, 184 (2019).
Amato, F., Guignard, F., Sylvain, R. & Kanevski, M. A novel framework for spatiotemporal prediction of environmental data using deep learning. Nat. Sci. Rep. 10, 22243 (2020).
Gao, F., Xu, Z. & Yin, L. Bayesian deep neural networks for spatiotemporal probabilistic optimal power flow with multisource renewable energy. Appl. Energy 353, 122106 (2024).
Liu, Y. et al. Probabilistic spatiotemporal wind speed forecasting based on a variational Bayesian deep learning model. Appl. Energy 260, 114259 (2020).
Wang, J. et al. Predicting windcaused floater intrusion risk for overhead contact lines based on Bayesian neural network with spatiotemporal correlation analysis. Reliab. Eng. Syst. Saf. 225, 108603 (2022).
Garnelo, M. et al. Neural processes (2018). arXiv.1807.01622.
Jin, M. et al. A survey on graph neural networks for time series: forecasting, classification, imputation, and anomaly detection (2023). arXiv.2307.03759.
LeCun, Y. A., Bottou, L., Orr, G. B. & Müller, K.R. Efficient backprop. In Montavon, G., Orr, G. B. & Müller, K.R. (eds.) Neural Networks: Tricks of the Trade, 9–48, 2nd edn (Springer, 2012).
Pearce, T., Tsuchida, R., Zaki, M., Brintrup, A. & Neely, A. Expressive priors in Bayesian neural networks: Kernel combinations and periodic functions. In Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, vol. 115 of Proceedings of Machine Learning Research, 134–144 (PMLR, Norfolk, 2020).
Haslett, J. & Raftery, A. E. Spacetime modelling with longmemory dependence: assessing Ireland’s wind power resource. J. R. Stat. Soc. C (Appl. Stat.) 38, 1–50 (1989).
Pebesma, E. spacetime: spatiotemporal data in R. J. Stat. Softw. 51, 1–30 (2012).
UCI Machine Learning Repository. Hungarian chickenpox cases. https://doi.org/10.24432/C5103B (2021).
University Corporation for Atmospheric Research. US precipitation and temperature data 1895–1997. https://www.image.ucar.edu/Data/US.monthly.met/ (2010).
Pedregosa, F. et al. Scikitlearn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Oreshkin, B. N., Carpov, D., Chapados, N. & Yoshua, B. NBEATS: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations (2020).
Nixtla Labs. NeuralForecat: Scalable and user friendly neural forecasting algorithms. https://github.com/Nixtla/neuralforecast (2024).
Blundell, C., Cornebise, J., Kavukcuoglu, K. & Wierstra, D. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on Machine Learning, 37 of Proceedings of Machine Learning Research, 1613–1622 (PMLR, Norfolk, 2015).
Chandrupatla, T. R. A new hybrid quadratic/bisection algorithm for finding the zero of a nonlinear function without using derivatives. Adv. Eng. Softw. 28, 145–149 (1997).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the 31st International Conference on Neural Information Processing Systems, of Advances in Neural Information Processing Systems 30, 6405–6416 (Curran Associates, Inc., 2017).
Esri, DigitalGlobe, GeoEye, icubed, USDA FSA, USGS, AEX, Getmapping, Aerogrid, IGN, IGP, swisstopo, and the GIS User Community. World imagery [basemap]  captured Mar 15, 2022. https://www.arcgis.com/home/item.html?id=10df2279f9684e4a9f6a7f08febac2a9.
CIRCLEDC Stadia Maps, CIRCLEDC OpenMapTiles, CIRCLEDC OpenStreetMap, CIRCLEDC Stamen Design, CIRCLEDC CNES, Distribution Airbus DS, CIRCLEDC Airbus DS, CIRCLEDC PlanetObserver (Contains Copernicus Data). Stadia maps. https://stadiamaps.com, https://stamen.com, https://openstreetmap.org/copyright.
Author information
Authors and Affiliations
Contributions
The statistical model was designed and implemented by F.S., M.H., J.B., and U.K. Evaluations were designed by F.S. and implemented by F.S., J.B., C.C., and B.P.; R.S. and B.P. provided guidance and oversight. F.S. drafted the manuscript, all authors contributed to its revision and completion.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Tom Rainforth and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Saad, F., Burnim, J., Carroll, C. et al. Scalable spatiotemporal prediction with Bayesian neural fields. Nat Commun 15, 7942 (2024). https://doi.org/10.1038/s41467024514775
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467024514775
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.