Introduction

Spatiotemporal data, which consists of measurements gathered at different times and locations, is ubiquitous across diverse disciplines. Government bodies such as the European Environment Agency1 and United States Environmental Protection Agency2, for example, routinely monitor a variety of air quality indicators (PM10, NO2, O3, etc.) in order to understand their ecological and public health impacts3,4. As it is physically impossible to place sensors at all locations in a large geographic area, environmental data scientists routinely develop statistical models to predict these indicators at new locations or times where no data is available5,6. Spatiotemporal data analysis also plays an important role in cloud computing, where consumer demand for resources such as CPU, RAM, and storage is driven by time-evolving macroeconomic factors and varies across data center location. Cloud service providers build sophisticated demand-forecasting models to determine prices7, perform load balancing8, save energy9, and achieve service level agreements10. Additional applications of spatiotemporal data analysis include meteorology (forecasting rain volume11 or wind speeds12), epidemiology (“nowcasting” active flu cases13), and urban planning (predicting rider congestion patterns at metro stations14).

Unlike traditional regression or classification methods in machine learning that operate on independent and identically distributed (i.i.d.) data, accurate models of spatiotemporal data must capture complex and highly nonstationary dynamics in both the time and space domains. For example, two locations twenty miles apart in California’s central valley may exhibit nearly identical temperature patterns, whereas two locations only one mile apart in nearby San Francisco might have very different microclimates; and these effects may differ depending on the time of year. Handling such variability across different scales is a key challenge in designing accurate statistical models. Another challenge is that spatiotemporal observations are typically driven by unknown and noisily observed data-generating processes, which require models that report probabilistic predictions to account for the aleatoric and epistemic uncertainty in the data.

The dominant approach to spatiotemporal data modeling in statistics rests on Gaussian processes, a rich class of Bayesian nonparametric priors on random functions15,16,17. Consider a spatiotemporal field Y(st) indexed by spatial locations \({{\bf{s}}}\in {{\mathbb{R}}}^{d}\) and time points \(t\in {\mathbb{R}}\). A typical Gaussian-process based “prior probability” distribution (used in popular geostatistical software packages such as R-INLA18 and sdm-TMB19) over the random field Y is given by:

$$\eta \sim {{\rm{GP}}}(0,{k}_{\theta });\quad F({{\bf{s}}},t)=h(x({{\bf{s}}},t);\beta )+\eta ({{\bf{s}}},t);\quad Y({{\bf{s}}},t) \sim {{\rm{Dist}}} \, (g(F({{\bf{s}}},t)),\gamma ).$$
(1)

In Eq. (1), η is a random function whose covariance over space and time is determined by a kernel function \({k}_{\theta }(({{\bf{s}}},t),({{{\bf{s}}}}^{{\prime} },{t}^{{\prime} }))\) parameterized by θ; x(st) is a covariate vector associated with index (st); h is a mean function with parameters β (e.g., for a linear function, \(h(x;\beta ):={\beta }^{{\prime} }x\)) of the latent field F; and Dist is a noise model (e.g., Normal, Poisson) for the observations Y(st), with index-specific parameter g(F(st)) (where g is a link function, e.g., \(\exp\)) and global parameters γ.

Given an observed dataset \({{\mathcal{D}}}:=\{Y({{{\bf{s}}}}_{1},{t}_{1})={y}_{1},\ldots,Y({{{\bf{s}}}}_{N},{t}_{N})={y}_{N}\}\), the inference problem is to determine the unknown parameters (θ, β, γ), which in turn define a posterior distribution over the processes (ηFY) given \({{\mathcal{D}}}\). Advantages of the model (1) are (i) its flexibility, as η is capable of representing highly complex covariance structure; and (ii) its ability to quantify uncertainty, as the posterior spreads its probability mass over a range of functions and model parameters that are consistent with the data. Moreover, the model easily handles arbitrary patterns of missing data by treating them as latent variables. A number of recent articles have developed specialized Gaussian process techniques for modeling rich spatiotemporal fields e.g., refs. 19,20,21,22,23.

Despite their flexibility, spatiotemporal models based on Gaussian processes (such as Eq. (1)) come with significant challenges. The first is computational. The simplest and most accurate posterior inference algorithms for these models have a computational cost of O(N3), where N is the number of observations, which is unacceptably high in datasets with tens or hundreds of thousands of observations. Reducing this cost requires compromises, either on the modeling side (e.g., imposing a discrete Markovian structure on the model18,19) or on the posterior-inference side (e.g., approximating the true posterior with a simpler Gaussian process20,21,23). Either way, the resulting models have less expressive power and cannot explain the data as accurately. These approximations also involve delicate linear-algebraic derivations or stochastic differential equations, which are challenging to implement and apply to new settings.

The second challenge is expertise, where the accuracy of model (1) on a given dataset is dictated by key choices such as the covariance kernel kθ and mean function h. Even for seasoned data scientists, designing these quantities is difficult because it requires detailed knowledge about the application domain. Further, even small modifications to the model can impose large changes to the learning algorithm, and so most software packages only support a small set of predetermined covariance structures kθ (e.g., separable Matérn kernels, radial basis kernel, polynomial kernel) that are optimized enough to work effectively on large datasets.

To alleviate these fundamental tensions, this article introduces the Bayesian Neural Field (BayesNF)—a method that combines the scalability of deep neural networks with many of the attractive properties of Gaussian processes. BayesNF is built on a Bayesian neural network model24 that maps from multivariate space-time coordinates to a real-valued field. The parameters of the network are assigned a prior distribution, and as in Gaussian processes, conditioning on observed data induces a posterior over those parameters (and in turn over the entire field). Because inference is performed in “weight space” rather than “function space”, the cost of analyzing a dataset grows linearly with the number of observations, as opposed to cubically for a Gaussian process. Because BayesNF is a hierarchical model (Fig. 1), it naturally handles missing data as latent variables and quantifies uncertainty over parameters and predictions. And because BayesNF defines a field over continuous space–time, it can model non-uniformly sampled data, interpolate in space, and extrapolate in time to make predictions at novel coordinates.

Fig. 1: Probabilistic graphical model representation of the Bayesian Neural Field.
figure 1

a An example spatiotemporal domain comprised of two spatial coordinates (latitude, longitude) and a daily time coordinate. b In the probabilistic graphical model, each node denotes a model variable and each edge denotes a direct relationship between a pair of variables. Gray nodes are observed variables and white notes are local latent variables, which are both associated with an observation Y(st) at a spatiotemporal coordinate (st). Pink nodes are global latent variables (parameters), which are shared across all spatiotemporal coordinates. c Realizations of the spatiotemporal field generated from the BayesNF at four example time points. Satellite basemap source: Esri, DigitalGlobe, GeoEye, i-cubed, USDA FSA, USGS, AEX, Getmapping, Aerogrid, IGN, IGP, swisstopo, and the GIS User Community48.

Our description of BayesNF as a neural “field” is inspired by the recent literature on neural radiance fields (NeRFs25,26) in computer vision. A key discovery that enabled the success of NeRFs is that neural networks are biased towards learning functions whose Fourier spectra are dominated by low frequencies, and that this bias can be corrected by concatenating sinusoidal positional encodings to the raw spatial inputs27. To ensure that our BayesNF model assigns high prior probability to data that includes both low- and high-frequency variation, we append Fourier features to the raw time and position data that are fed to the network. In Methods, we show that these Fourier features, coupled with learned scale factors and convex combinations of activation functions, improve BayesNF models’ ability to learn flexible and well-calibrated distributions of spatiotemporal data. Incorporating sinusoidal seasonality features lets BayesNF models make predictions based on (multiple) seasonal effects as well. Taken together, these characteristics enable state-of-the-art performance in terms of point predictions and 95% prediction intervals on diverse large-scale spatiotemporal datasets, without the need to heavily customize the BayesNF model structures on a per-dataset basis.

BayesNF belongs to a family of emerging techniques that leverage deep neural networks with hierarchical Bayesian models for spatiotemporal data analysis—a thorough survey of these advances is given in Wikle and Zammit-Mangion28. Our method is inspired by limitations of existing deep neural network approaches for probabilistic prediction in spatiotemporal data. For example, the Bayesian spatiotemporal recurrent neural networks introduced in McDermott and Wikle29 require the data to be observed at a fixed spatial grid and regular discrete-time intervals. In contrast, BayesNF is defined over continuous space-time coordinates, enabling prediction at novel locations and in datasets with irregularly sampled time points. The deep “Empirical Orthogonal Function” model30 is a powerful exploratory analysis tool but is less useful for prediction: it cannot handle missing data, make predictions at new time points, or deliver uncertainty estimates. Additional methods in this category include Bayesian neural networks that are highly task oriented—e.g., for analyzing power flow31, wind speed32, or floater intrusion risk33. These methods leverage domain-specific architectures designed specifically for the analysis problem at hand, and do not aim to provide software libraries that are easy for practitioners to apply in new spatiotemporal datasets beyond the application domain. In contrast, a central goal of BayesNF is to provide a domain-general modeling tool that is easily applicable to the same type of datasets as the Gaussian process model (1), without the need to redesign substantial parts of the probabilistic model or network architecture for each new task.

Neural processes34 also integrate deep neural networks with probabilistic modeling, but are based on a graphical model structure that is fundamentally difficult to apply to spatiotemporal datasets. In particular, because neural processes aim to “meta-learn” a prior distribution over random functions, the authors note it is essential to have access to a large number of independent and identically distributed (i.i.d.) datasets during training. However, most spatiotemporal data analyses are based on only a single real-world dataset (e.g., those in Table 1) where there is no notion of sharing statistical strength across multiple i.i.d. observations of the entire field.

Table 1 Spatiotemporal datasets analyzed in the empirical evaluation

Graph neural networks (GNNs), surveyed in Jin et al.35, are another popular deep-learning approach for spatiotemporal prediction which have been particularly useful in settings such as analyzing traffic or population-migration patterns. These models require as input a graph describing the connectivity structure of the spatial locations, which makes them less appropriate for spatial data that lack such discrete connectivity structure. Moreover, the requirement that the graph be fixed makes it harder for GNNs to interpolate or extrapolate to locations that are not included in the graph at training time. The BayesNF model, on the other hand, operates over continuous space, and is therefore more appropriate for spatial data without known discrete connectivity structure. In addition, as noted in Jin et al.35, GNNs have not yet been demonstrated on probabilistic prediction tasks, and we are unaware of the existence of open-source software libraries based on GNNs that can easily handle the sparse datasets in Table 1.

Results

Model description

Consider a dataset \({{\mathcal{D}}}=\{y({{{\bf{s}}}}_{i},{t}_{i})| i=1,\ldots,N\}\) of N spatiotemporal observations, where \({{{\bf{s}}}}_{i}\in {{\mathcal{S}}}\subset {{\mathbb{R}}}^{d}\) denotes a d-dimensional spatial coordinate and \({t}_{i}\in {{\mathcal{T}}}\subset {\mathbb{R}}\) denotes a time index. For example, if the field is observed at longitude-latitude coordinates in discrete time, then \({{\mathcal{S}}}=(-180,\, 180]\times [-90,\, 90]\subset {{\mathbb{R}}}^{2}\) and \({{\mathcal{T}}}=\{1,2,\ldots,\}\). If the field also incorporates an altitude dimension, then \({{\mathcal{S}}}\subset {{\mathbb{R}}}^{3}\). We model this dataset as a realization {Y(siti) = y(siti), 1 ≤ i ≤ n} of a random field \(Y:{{\mathcal{S}}}\times {{\mathcal{T}}}\to {\mathbb{R}}\) over the entire spatiotemporal domain. Following the notation in Wikle and Zammit-Mangion28, we describe the field using a hierarchical Bayesian model:

$$\,{{\mbox{Observation}}} \, {{\mbox{Model:}}}\,\,[Y(\cdot )| F(\cdot ),{\Theta }_{y}],$$
(2)
$$\,{{\mbox{Process}}} \, {{\mbox{Model:}}}\,\,[F(\cdot )| x(\cdot ),{\Theta }_{f}],$$
(3)
$$\,{{\mbox{Parameter \, Models:}}}\,\,[{\Theta }_{y},{\Theta }_{f}].$$
(4)

In this notation, upper case letters denote random quantities, Greek letters denote model parameters, lower case letters denoted non-random (fixed) quantities, and square brackets [ ] denote (yet-to-specified) probability distributions. The distribution of the observable random variables Y(st) is parameterized by global parameters Θy and an unobservable (latent) spatiotemporal field F(st). In turn, F(st) is parameterized by a set of random global parameters Θf and a collection x(st) = [x1(st), …, xm(st)] of m fixed covariates associated with index (st).

Box 1 completes the definition of BayesNF by showing specific probability distributions for the model (2)–(4). Figure 1 shows a probabilistic-graphical-model representation of a BayesNF model with H = 3 layers, which takes a spatiotemporal index (st) at the input layer and generates a realization Y(st) of the observable field at the output layer. At a high level, the input layer transforms the spatiotemporal coordinates (st) into a fixed set of spatiotemporal covariates, which include linear terms, interaction terms, and Fourier features in time and space. The second layer performs a linear scaling of these covariates using a learnable scale factor—this layer aims to avoid the need for the practitioner to manually specify how to appropriately scale the data, which is known to heavily influence the learning dynamics36. Next, the hidden layers of the network contain the usual dense connections, except that the activations are specified as a learnable convex combination of “primitive” activations, such as rectified linear units (relu), exponential linear unit (elu), or hyperbolic tangent (tanh). The goal of these convex combinations is to automate the discovery of the covariance structure in the field, given that activation functions correspond directly to covariance of random functions defined by Bayesian neural networks37. At the final layer, the output of the feedforward network is used to parameterize a probability distribution over the observable field values, which serves to capture the fundamental aleortic uncertainty in the noisy data. Epistemic uncertainty in BayesNF is expressed by assigning prior probability distributions to all learnable parameters, such as covariate scale factors; connection weights, biases, and their variances; and additional parameters of the observation distribution.

We next describe the components of this process in sequence from inputs to outputs in more detail. This description defines a prior distribution over Bayesian Neural Fields—in Methods we discuss ways of inferring the posterior over the random variables defined in Box 1.

Spatiotemporal covariates

Letting (st) = ((s1, …, sd), t) denote a generic index in the field, the covariates [x1(st), …, xm(st)] may include the following functions:

$$\{t,{s}_{1},\ldots,{s}_{d}\}\quad Linear\,Terms$$
(5)
$$\{t{s}_{1},\ldots,t{s}_{d}\}\quad Temporal-Spatial\,Interactions$$
(6)
$$\{{s}_{i}{s}_{j};1\le i < j\le d\}\quad Spatial-Spatial\,Interactions$$
(7)
$$\{(\cos (2\pi h/pt),\sin (2\pi h/pt));p\in {{\mathcal{P}}},h\in {{{\mathcal{H}}}}_{p}^{{{\rm{t}}}}\}\quad Temporal\,Seasonal\,Features$$
(8)
$$\{(\cos (2\pi {2}^{h}{s}_{i}),\sin (2\pi {2}^{h}{s}_{i}));1\le i\le d,h\in {{{\mathcal{H}}}}_{i}^{{{\rm{s}}}}\}\quad Spatial\,Fourier\,Features$$
(9)

The linear and interaction covariates (5)–(7) are the usual first and second-order effects used in spatiotemporal trend-surface analysis models (Section 3.2 of ref. 17). In Eq. (8), the temporal seasonal features are defined by a set \({{\mathcal{P}}}=\{{p}_{1},\ldots,{p}_{\ell }\}\) of seasonal periods, where each pi has harmonics \({{{\mathcal{H}}}}_{{p}_{i}}^{{{\rm{t}}}}\subset \{1,2,\ldots,\lfloor {p}_{i}/2\rfloor \}\) for i = 1, …, . For example, if the time unit is hourly data and there are m = 2 seasonal effects (daily and monthly), the corresponding periods are p1 = 24 and p2 = 730.5, respectively. Non-integer periodicities handle seasonal effects that have varying duration in the time measurement unit (e.g., days per month or weeks per year). The Methods section discusses how to construct appropriate seasonal features for a variety of time units and seasonal effect combinations. In Eq. (9), the spatial Fourier features for coordinate si are determined by a set \({{{\mathcal{H}}}}_{i}^{{{\rm{s}}}}\subset {\mathbb{N}}\) of additional frequencies that capture periodic structure in the ith dimension (i = 1, …, d). These covariates correct for the tendency of neural networks to learn low-frequency signals27: the empirical evaluation in the next section confirms that their presence greatly improves the quality of learned models. Covariates may also include static (e.g., “continent”) or dynamic (e.g., “temperature”) exogenous features, provided they are known at all locations and time points in the training and testing datasets.

Covariate scaling layer

Scaling inputs improves neural network learning e.g., ref. 36, but determining the appropriate strategy (e.g., z-score, min/max, tanh, batch-norm, layer-norm, etc.) is challenging. BayesNF uses a prior distribution over scale factors to learn these quantities as part of Bayesian inference within the overall probabilistic model. In particular, the next stage in the network is a width-m hidden layer \({h}_{i}^{0}({{\bf{s}}},t)={e}^{{\xi }_{i}^{0}}{x}_{i}({{\bf{s}}},t)\) obtained by randomly scaling each of the m covariates x(st), where \({e}^{{\xi }_{i}^{0}}\) is a log-normally distributed scale factor (for i = 1, …, m).

Hidden layers

The model contains L + 1 ≥ 1 hidden layers, where layer l has N units \({h}^{\ell }={({h}_{1}^{\ell },\ldots,{h}_{{N}^{\ell }}^{\ell })}^{{\prime} }\) (for l = 1, …, L). These hidden units are derived from Nℓ pre-activation units \({z}^{\ell }={N}_{\ell -1}^{-1/2}{\Omega }^{\ell }{h}^{\ell -1}+{\beta }^{\ell }\) where \({\Omega }^{\ell }=[{\omega }_{ij}^{\ell };1\le i\le {N}^{\ell },1\le j\le {N}_{\ell -1}]\) is a random N × N−1 weight matrix and \({\beta }^{\ell }={({\beta }_{1}^{\ell },\ldots,{\beta }_{{N}^{\ell }}^{\ell })}^{{\prime} }\) a random bias term. The network parameters \({\omega }_{ij}^{\ell }\) and \({\beta }_{i}^{\ell }\) are drawn i.i.d. N(0, σ), where the variance \({\sigma }^{\ell }=\ln (1+{e}^{{\xi }^{\ell }})\) a learnable parameter whose prior is obtained by applying a softplus transformation to ξ ~ N(0, 1). The \({N}_{\ell -1}^{-1/2}\) prefactor ensures the network has a well-defined Gaussian process limit as the number of hidden units N → 24.

In addition to the covariate scaling layer, BayesNF departs from a traditional Bayesian neural network by using A≥1 activation functions \(({u}_{1}^{\ell },\ldots,{u}_{{A}^{\ell }}^{\ell })\) at hidden layer l, instead of the usual A = 1. For example, the architecture shown in Fig. 1 uses A = 2 where \({u}_{1}^{\ell }\) is the hyperbolic tangent (tanh) and \({u}_{2}^{\ell }\) is the exponential linear unit (elu) activation (where l = 1, 2). Each post-activation unit \({h}_{i}^{\ell }\) (for i = 1, …, N) is then a random convex combination of the activations \({u}_{1}^{\ell }({z}_{i}^{\ell }),\ldots,{u}_{{A}^{\ell }}^{\ell }({z}_{i}^{\ell })\), where the coefficient of \({u}_{j}^{\ell }\) is the output of a softmax function \({e}^{{\gamma }_{j}^{\ell }}/{\sum }_{k=1}^{{N}_{d}}{e}^{{\gamma }_{k}^{\ell }}\) whose j-th input is \({\gamma }_{j}^{\ell } \sim N(0,1)\) (for j = 1, …, A). The activation function governs the overall covariance properties of the random function defined by a Bayesian neural network24,37. By specifying the overall activation at each layer as a learnable convex combination of A “basic” activation functions (e.g., tanh, relu, elu), BayesNF aims to automate the process of selecting an appropriate activation and in turn the covariance structure within the random field.

Finally, the latent stochastic process F(st) is defined as the pre-activation unit \({z}_{1}^{L+1}\) of layer L + 1, which has exactly NL+1 = 1 unit. We let Θf denote all nf random network parameters in Box 1 and denote the prior as πf. Further, the notation \({F}_{{\theta }_{f}}({{\bf{s}}},t)\) denotes the (deterministic) value of the process F at index (st) when Θf = θf.

Observation layer

The final layer connects the stochastic process F(st) with the observable spatiotemporal field Y(st) ~ Dist(F(st); Θy) through a noise model that captures aleatoric uncertainty in the data. The parameter vector \({\Theta }_{y}=({\Theta }_{y,1},\ldots,{\Theta }_{y,{n}_{y}})\) is ny-dimensional and has a prior πy. There are many choices for this distribution, depending on the field Y(st); for example,

$$Y({{\bf{s}}},t) \sim {{\rm{Normal}}}(F({{\bf{s}}},t),{\Theta }_{y,1}),$$
(10)
$$Y({{\bf{s}}},t) \sim {{{\rm{StudentT}}}}_{{\Theta }_{y,2}}(F({{\bf{s}}},t),{\Theta }_{y,1}),$$
(11)
$$Y({{\bf{s}}},t) \sim {{\rm{Poisson}}}({e}^{F({{\bf{s}}},t)}),$$
(12)

which correspond to a Gaussian noise model with mean F(st) and variance Θy,1 (ny = 1), a StudentT model with location F(st), scale Θy,1 and Θy,2 degrees of freedom (ny = 2); and a Poisson counts model with rate \(\exp F({{\bf{s}}},t)\) (ny = 0), respectively. A key design choice in these observation distributions is that certain parameters such as Θy,1 in Eq. (10) or Θy,1Θy,2 in Eq. (11) are not index-specific but rather shared across all inputs, which serves to mitigate the model’s sensitivity to over-fitting noise fluctuations from high-frequency Fourier features.

Posterior inference and querying. Let P(ΘfΘyY) be the joint probability distribution over the parameters and observable field in Box 1. The posterior distribution given \({{\mathcal{D}}}\) is

$$ P({\theta }_{f},{\theta }_{y}| {\left\{Y({{{\bf{s}}}}_{i},{t}_{i})=y({{{\bf{s}}}}_{i},{t}_{i})\right\}}_{i=1}^{N})\\ \propto \left(\prod _{i=1}^{{n}_{f}}{\pi }_{f}({\theta }_{f,i})\right)\left(\prod _{i=1}^{{n}_{y}}{\pi }_{y}({\theta }_{y,i})\right) \prod _{i=1}^{n}{{\rm{Dist}}}(y({{{\bf{s}}}}_{i},{t}_{i});{F}_{{\theta }_{f}}({{{\bf{s}}}}_{i},{t}_{i}),{\theta }_{y})$$
(13)

While the right-hand side of Eq. (13) is tractable to compute, the left-hand side cannot be normalized or sampled from exactly. In the Posterior Inference section of Methods, we discuss two approximate posterior inference algorithms for BayesNF: maximum a-posteriori ensembles and variational inference ensembles. They each produce a collection of parameters \({\{({\theta }_{f}^{i},{\theta }_{y}^{i})\}}_{i=1}^{M}\approx P({\Theta }_{f},{\Theta }_{y}| {{\mathcal{D}}})\) drawn from an approximation to the posterior (13). The Prediction Queries subsection of Methods discusses how these posterior samples be used to compute point predictions \(\hat{y}({{{\bf{s}}}}_{*},{t}_{*})\) of the spatiotemporal field at a novel index (s*t*) and the associated prediction intervals \([{\hat{y}}_{{{\rm{low}}}}({{{\bf{s}}}}_{*},{t}_{*}),{\hat{y}}_{{{\rm{hi}}}}({{{\bf{s}}}}_{*},{t}_{*})]\) for a given level α (0, 1) (e.g., α = 95%).

Prediction accuracy on scientific datasets

Datasets

To quantitatively assess the effectiveness of BayesNF on challenging prediction problems, we curated a benchmark set comprised of six publicly available, large-scale spatiotemporal datasets that together cover a range of complex empirical processes:

  1. 1.

    Daily wind speed (km/h) from the Irish Meteorological Service38. 1961-01-01 to 1978-12-31; 12 locations; 78,888 observations, 0% missing.

  2. 2.

    Daily particulate matter 10 (PM10, μg/m3) air quality in Germany from the European Environment Information and Observation Network39. 1998-01-01 to 2009-12-31; 70 locations; 149,151 observations, 52% missing.

  3. 3.

    Hourly particulate matter 10 (PM10, μg/m3) from the London Air Quality Network20. 2018-12-31 to 2019-03-31; 72 locations; 144,570 observations, 7% missing.

  4. 4.

    Weekly chickenpox counts (thousands) from the Hungarian National Epidemiology Center40 2005-01-03 to 2014-12-29; 20 locations; 10,440 observation, 0% missing.

  5. 5.

    Monthly accumulated precipitation (mm) in Colorado and surrounding areas from the University Corporation for Atmospheric Research41. 1950-01-01 to 1997-12-01; 358 locations; 134,800 observations, 35% missing.

  6. 6.

    Monthly sea surface temperature (°C) anomalies in the Pacific Ocean from the National Oceanic and Atmospheric Administration Climate Prediction Center17 1970-01-01 to 2003-03-01; 2261 locations; 902,139 observations, 0% missing.

Table 1 summarizes key statistics of these datasets. Figure 2 shows snapshots of the observed data at a fixed point in time (Fig. 2a) and in space (Fig. 2b), highlighting the complex statistical patterns (e.g., nonstationarity and periodicity) in the underlying fields along these two dimensions. Five train/test splits were created for each benchmark. Each test set contains (#locations)/(#splits) locations, holding out the 10% most recent observations.

Fig. 2: Spatial and temporal observations for evaluation datasets from Table 1.
figure 2

a Snapshots of spatial observations at fixed points in time. b Snapshots of temporal observations at fixed locations in space. Satellite basemap source: Stadia Maps, OpenMapTiles, OpenStreetMap, Stamen Design, CNES, Distribution Airbus DS, Airbus DS, PlanetObserver (Contains Copernicus Data)49.

Baselines

The prediction accuracy on the benchmark datasets in Table 1 using BayesNF is compared to several state-of-the-art baselines. This evaluation focuses specifically on baseline methods that (i) have high-quality and widely used open-source implementations; (ii) can generate both point and interval predictions; and (iii) are directly applicable to new spatiotemporal datasets (e.g., those in Table 1) without the need to redevelop substantial parts of the model. The methods are:

  1. 1.

    StSVGP: Spatiotemporal Sparse Variational Gaussian Process20. This method handles large datasets (i.e., linear time scaling in the number of time points) by leveraging a state-space representation based on stochastic partial differential equations and Bayesian parallel filtering and smoothing on GPUs. Parameter estimation is performed using natural gradient variational inference.

  2. 2.

    StGBoost: Spatiotemporal Gradient Boosting Trees42. Prediction intervals are estimated by minimizing the quantile loss using an ensemble of 1000 tree estimators. As this baseline is not a typical time series model, the same covariates [x1(st), …, xm(st)] (5)–(9) provided to BayesNF are also provides as regression inputs.

  3. 3.

    StGLMM: Spatiotemporal Generalized Linear Mixed Effects Models19. These methods handle large datasets by integrating latent Gaussian-Markov random fields with stochastic partial differential equations. Parameter estimation is performed using maximum marginal likelihood inference. Three observation noise processes are considered:

    • IID: Independent and identically distributed Gaussian errors.

    • AR1: Order 1 auto-regressive Gaussian errors.

    • RW: Gaussian random walk errors.

  4. 4.

    NBEATS: Neural Basis Expansion Analysis43. This baseline employs a “window-based” deep learning auto-regressive model where future data is predicted over a fixed-size horizon conditioned on a window of previous observations and exogenous features. The model is configured with indicators for all applicable seasonal components—e.g., hour of day, day of week, day of month, week of year, month—as well as trend and seasonal Fourier features. The method contains a large number of numeric hyperparameters which are automatically tuned using the NeuralForecast44 package. Prediction intervals are estimated by minimizing quantile loss.

  5. 5.

    TSReg: Trend Surface Regression with Ordinary Least Squares (OLS) (Section 3.2 of ref. 17). The observation noise model is Gaussian with maximum likelihood estimation of the variance. As with StGBoost, the regression covariates are identical to those provided to BayesNF.

  6. 6.

    BayesNF: Bayesian Neural Field; using variational and maximum a-posteriori inference.

We also attempted to use the fixed-rank kriging (Frk) method22, but were unable to perform inference over noise parameters for spatiotemporal data. Taken together, the baselines provide broad coverage over recent statistical, machine learning, and deep learning methods for large-scale prediction. All methods were run on a TPU v3-8 accelerator, which consists of 8 cores each with 16 GiB of memory. Additional evaluation details are described in Methods.

Quantitative results

Table 2 shows accuracy and runtime results for all baselines and benchmarks. Point predictions are evaluated using root-mean square error (RMSE (25)) and mean absolute error (MAE (26)) and 95% prediction intervals are evaluated using the mean interval score (MIS (27)), averaged over all train/test splits. The final column shows the wall-clock runtime in seconds that each method was run. While runtime cannot be perfectly aligned due to variety of learning algorithms used and their iterative nature, the wall-clock numbers show that all baselines were run for sufficiently long to ensure a fair comparison. Figure 3 compares predictions on held-out data at one representative spatial location in each of the six benchmarks. We discuss several takeaways from these results.

Table 2 Point prediction errors in terms of root-mean square error (RMSE) and mean average error (MAE); interval prediction error in terms of mean interval score (MIS); and wall-clock runtime in seconds on spatiotemporal benchmark datasets using multiple baselines methods
Fig. 3: Comparison of predictions using BayesNF and various baselines.
figure 3

Each row shows results for a given spatiotemporal benchmark dataset at one spatial location. Black dots are observed data, blue dots are test data, red lines are median forecasts, and gray regions are 95% prediction intervals. (BayesNF: Bayesian Neural Field. Svgp: Spatiotemporal Sparse Variational Gaussian Process. Gboost: Spatiotemporal Gradient Boosting Trees. StGLMM: Spatiotemporal Generalized Linear Mixed Effect Models. NBEATS: Neural Basis Expansion Analysis. TSReg: Trend-Surface Regression).

BayesNF using VI is the strongest baseline in 12/18 cases followed by BayesNF using MAP: it is tied with VI in 3/18 cases (Precipitation) and superior in 3/18 cases (Sea Surface Temperature). In 2/18 cases (Chickenpox; MAE and RMSE) errors from the BayesNF methods are slightly higher than the StGLMM (AR1) baseline, although the running time of the latter is  ~ 4x higher. The most apparent improvements of BayesNF occur in the Wind Speed, Precipitation, and Sea Surface Temperature datasets, shown qualitatively in rows 1, 5, 6 of Fig. 3. Results using additional ablations are discussed in the Ablations subsection of Methods. Combined with Table 2, these results highlight the expressive modeling capacity of BayesNF models, their ability to accurately quantify predictive uncertainty, and the benefit of using spatial embeddings to capture high-frequency signals in the data.

While predictions from StSVGP generally follow the overall “shape” of the held-out data, the mean and interval predictions are not well calibrated (Fig. 3, second column). StSVGP requires several modeling trade-offs to ensure linear-time scaling in the number of time points, including the use of Matérn kernels (which cannot express effects such as seasonality) and kernels that are separable in time and space. Additional difficulties include manually selecting the number of spatial inducing points and complex algorithms needed to optimize their locations. StSVGP runs out of memory on the Sea Surface Temperature benchmark (1 million observations).

The StGLMM methods (AR1, IID, RW) fail to complete on 4/6 benchmarks. The scaling characteristics are also unpredictable: for example, StGLMM runs on Air Quality 2 (144,570 observations) but fails on Wind Speed (78,888 observations). On the two datasets they can handle (rows 3 and 4 of Fig. 3), the StGLMM methods are highly competitive on Chickenpox and not competitive on Air Quality 2, with the AR1 error model delivering the lowest errors.

StGBoost delivers reasonable prediction intervals but its point predictions underfit (Fig. 3, third column). It has a high computational cost because (i) a large number of estimators is needed to obtain accurate predictions (using 1000 estimators provided statistically significant improvements over 500 estimators in 17/18 benchmarks); (ii) three models must be separately trained from scratch: one model to predict the mean and two models to predict upper and lower quantiles. Whereas BayesNF uses a single learned distribution for all queries, StGBoost trains different models for different queries, which does not guarantee probabilistically coherent answers.

NBEATS is only competitive on the Sea Surface Temperature benchmark, where it is the next-best baseline after BayesNF. Its runtime on this benchmark is 3x–4x faster than BayesNF due to automatic early stopping. The method fails to deliver predictions on the Precipitation benchmark because the training and test datasets contain time series that are too sparse to handle; e.g., the number of observed timepoints is smaller than the auto-regressive window size or prediction horizon. The prediction errors on the remaining three benchmarks are high even though all the seasonal effects were added to the model, suggesting that either (i) the model is not able to effectively leverage spatial correlations for cross time-series learning; or (ii) the hyperparameter tuning algorithm does not converge to sensible values within the allotted time.

TSReg requires less than 1 second to train, but does not capture any meaningful structure and produces poor predictions. Using LASSO or ridge regression instead of OLS did not improve the results. TSReg uses identical covariates to BayesNF but performs much worse, highlighting the need to capture nonlinear dependencies in the data for generating accurate forecasts.

Analyzing German air quality data

Atmospheric particulate matter (PM10) is a key indicator of air quality used by governments worldwide, as these particles can induce adverse health effects when inhaled into the lungs. Accurate predictions of PM10 values at novel points in space and time within a geographic region can help decision makers characterize pollution patterns and inform public health decisions.

We explore predictions from BayesNF on the German Air Quality dataset39, which contains daily PM10 measurements from 70 stations between 1998-01-01 and 2009-12-31. We infer a BayesNF model for this dataset with depth H = 2; weekly, monthly, and yearly seasonal effects (8); and harmonics \({{{\mathcal{H}}}}_{1}^{{{\rm{s}}}}={{{\mathcal{H}}}}_{2}^{{{\rm{s}}}}=\{1,\ldots,4\}\) for the spatial Fourier features (9). The distribution of Y given the stochastic process F is a StudentT (11) truncated to \({{\mathbb{R}}}_{\ge 0}\)

Spatial and temporal interpolation

Figure 4a shows the PM10 observations at 2003-02-01, 2005-01-01, 2005-04-01, and 2007-01-01, where roughly 50% of the stations do not have an observed measurement at a given point in time. Figure 4b shows the median PM10 predictions y0.5(s*t*) (24) interpolated at a grid of 10,000 novel spatial indexes (s*t*) within Germany. Figure 4c shows the width \({\hat{y}}_{{{\rm{hi}}}}({{{\bf{s}}}}_{*},{t}_{*})-{\hat{y}}_{{{\rm{low}}}}({{{\bf{s}}}}_{*},{t}_{*})\) of the inferred 95% prediction interval. These plots reflect the spatiotemporal structure captured by BayesNF and identify coordinates within the field with low and high predictive uncertainty about air pollution. The axis-aligned artifacts in Fig. 4b, where predictions are consistent along certain thin regions, are a result of the spatial Fourier features (9). How well these artifacts reflect the true behavior can be empirically investigated by obtaining PM10 measurements at the novel locations along these regions. Figure 4d shows the observed and median predicted PM10 values across all time points at four stations with the highest missing data rates: DEBWO31, southwest Germany, 51% missing; DEBB056, northeast Germany, 84% missing; DEBU034, northwest Germany, 99% missing; DESL008, west Germany, 89% missing. PM10 trajectories predicted by BayesNF at time points where data is missing reproduce the temporal patterns at time points with observed data, which include high frequency periodic variation and irregular, spatially correlated jumps.

Fig. 4: Spatiotemporal prediction of atmospheric particulate matter (PM10) in German air dataset.
figure 4

a shows the observed data at four time points: each shaded circle represents a measurements of PM10 at a given station. Higher values of PM10 correspond to lower air quality. The data is sparse: at any given time point, only 47% of stations (on average) are associated with a PM10 observation. b Median predictions of PM10 air quality at four time points across the whole spatial field. c Width of 95% predictions of PM10 air quality at four time points across the whole spatial field. d Observed PM10 data (black) and median prediction (red) at four sparsely observed locations across time. Satellite basemap source: Stadia Maps, OpenMapTiles, OpenStreetMap, Stamen Design, CNES, Distribution Airbus DS, Airbus DS, PlanetObserver (Contains Copernicus Data)49.

Variography

The accuracy of PM10 predictions in Fig. 4d cannot be quantitatively assessed because the ground-truth values are not known at the predicted time points. However, we can gain more insight into how well the learned spatiotemporal field matches the observed field by comparing the empirical and inferred semi-variograms. The semi-variogram γ of a process Y characterizes the joint spatiotemporal dependence structure; it is defined as

$$2\gamma ({{\bf{h}}},\tau )={{\rm{Var}}}\left[Y({{\bf{s}}}+{{\bf{h}}},t+\tau )-Y({{\bf{s}}},t)\right]\quad ({{\bf{h}}}\in {{\mathcal{S}}},\tau \in {{\mathcal{T}}}),$$
(14)

where the choice of \({{\bf{s}}}\in {{\mathcal{S}}},t\in {{\mathcal{T}}}\) is arbitrary (e.g., (st) = (0, 0), under the assumption that only the displacements in time and space affect the dependence (Section 2.4.2 of ref. 17).

The surface plots in Fig. 5 compare the empirical semi-variogram (left) computed at the 70 observed stations with the inferred semi-variogram (right) computed at 70 uniformly chosen random locations within Germany, for distances h [0, 1000] kilometers and time lags τ {0, …, 10} days. The agreement between these two plots suggests that BayesNF accurately generalizes the spatiotemporal dependence structure from the observed locations to novel locations in the field. The lower two panels in Fig. 5 show the empirical (solid line) and inferred (dashed line) semi-variograms, separately for each of the 10 time lags τ. The difference between the semi-variograms is highest for τ {0, 1, 2} days, suggesting that the learned model is expressing relatively smooth phenomena and assuming that the high-frequency day-to-day variance is due to unpredictable independent noise. The differences between the semi-variograms become small for τ > 2 days, which suggests that BayesNF effectively captures these longer-term temporal dependencies.

Fig. 5: Comparison of the empirical and inferred spatiotemporal semivariograms, which measure the variance of the difference between field values at a pair of locations, for German PM10 air quality dataset.
figure 5

The empirical semivariogram is computed using the locations of the 70 stations in the observed dataset. The inferred semivariogram is computed on 70 novel spatial locations, sampled uniformly at random within the boundary of the field. a The agreement between the semivariogram surfaces indicates that BayesNF extrapolates the joint spatiotemporal dependence structure between locations in the observed data to novel locations. b For short time lags less than three days, the empirical variogram is higher than the inferred variogram at all distances, showing that BayesNF models high-frequency day-to-day variance as unpredictable observation noise.

Discussion

This article proposes a probabilistic approach to scalable spatiotemporal prediction called the Bayesian Neural Field. The model combines a deep neural network architecture for high-capacity function approximation with hierarchical Bayesian modeling for accurate uncertainty estimation over complex spatiotemporal fields. Posterior inference is conducted using stochastic ensembles of maximum a-posteriori estimation or variationally trained surrogates, which are easy to apply and deliver well-calibrated 95% prediction intervals over test data. The results in Fig. 6 confirm that quantifying uncertainty using MAP or VI ensembles is superior to performing maximum-likelihood estimation (MLE), which ignores the parameter priors. While these inference methods are approximate in nature and are not guaranteed to match the true posterior, the BayesNF model is a deep neural network where interpreting parameters such as weights and biases is not of inherent interest to a practitioner in a given data analysis task. Rather, we expect BayesNF to be most useful in cases where the predictive calibration is more relevant. Additional advantages of BayesNF are its relative simplicity, ability to handle missing data, and ability to learn a full probability distribution over arbitrary space-time indexes within the spatiotemporal field.

Fig. 6: Runtime versus prediction error profiles using variational inference (VI; orange), maximum a-posteriori (MAP; blue), and maximum likelihood estimation (MLE; green) for BayesNF on the spatiotemporal benchmarks from Table 1.
figure 6

Markers indicate ensemble size (8, 16, 32, 64, 96). The predictions errors are given in terms of root-mean square error (RMSE) and mean average error (MAE) for point forecasts and in terms of mean interval score (MIS) for 95% interval forecasts.

Evaluations against prominent statistical and machine learning baselines on large-scale datasets show that BayesNF delivers significant improvements in both point and interval forecasts. The results also show that combining periodic effects in the temporal domain with Fourier features in the spatial domain enables BayesNF to capture spatiotemporal patterns with multiple (non-integer) periodicity and high-frequency components. As a domain-general method, BayesNF can produce strong results on multiple datasets without the need to hand-design the model from scratch each time or apply dataset-specific inference approximations. For a representative air quality dataset, the semi-variograms inferred by BayesNF evaluated at novel spatial locations agree with the empirical semi-variogram computed at observed locations, which highlights the model’s ability to generalize well in space and time.

Practitioners across a spectrum of disciplines—from meteorology to urban studies and environmental informatics—are in need of more scalable and easy-to-use statistical methods for spatiotemporal prediction. A freely available implementation of BayesNF built on the Jax machine learning platform, along with user documentation and tutorials, is available at https://github.com/google/bayesnf. We hope these materials help practitioners obtain strong BayesNF models for many spatiotemporal problems that existing software cannot easily handle.

The approach discussed in this paper opens several avenues to future work. While Bayesian Neural Fields are designed to minimize the user’s involvement in constructing a predictive model, further improvements can be achieved by enabling domain experts to incorporate specific statistical covariance structure that they know to be present. It is also worthwhile to explore applications of BayesNF for modeling the residuals of causal or mechanistic laws in physical systems where there exist strong domain theories of the average data-generating process, but poor models of the empirical noise process. Another promising extension is using BayesNF models to handle not only “geostatistical” datasets, in which the measurements are point-referenced in space, but also “areal” or “lattice” datasets, where the measurements represent aggregated quantities over a geographical region. While areal datasets are often converted to geostatistical datasets by using the centroid of the region as the representative point, a more principled approach would be to compute the integral of a Bayesian Neural Field over the region. Finally, BayesNF can be generalized to handle multivariate spatiotemporal data, where each spatial location is associated with multiple time series that contain within-location and across-location covariance structure. Effectively handling such datasets will even further broaden the scope of problems that BayesNF can solve.

Methods

Posterior inference

Let P(ΘfΘyY) denote the joint probability distribution over the parameters and observable field in Box 1. The posterior distribution is given by Eq. (13) in the main text. We describe two approximate posterior inference algorithms for BayesNF. In these sections, we define Θ = (ΘfΘy), θ = (θfθy) and r = (st).

Stochastic MAP ensembles

A simple approach to uncertainty quantification is based on the “maximum a-posteriori” estimate:

$${\theta }^{*}=\arg {\max }_{\theta }\left\{\log P({\theta }_{f},{\theta }_{y},{\left\{Y({{{\bf{r}}}}_{i})=y({{{\bf{r}}}}_{i})\right\}}_{i=1}^{N})\right\}.$$
(15)

We find an approximate solution to the optimization problem (15) using stochastic gradient ascent on the joint log probability, according to the following procedure, where BN is a mini-batch size and (ϵ1ϵ2, … ) is a sequence of learning rates:

$${{\rm{Initialize}}}\,{\theta }_{0} \sim {\pi }_{f}{\pi }_{y};\,t\leftarrow 0$$
(16)

Repeat until convergence

$$ \{{I}_{1},\ldots,{I}_{B}\} \sim {{\rm{Uniform}}}(\{K\subset [N] \, | \, {{\rm{card}}}(K)=B\})$$
(17)
$${\hat{g}}_{t}={\nabla }_{\theta }{\left[\log {\pi }_{f}({\theta }_{f})+\log {\pi }_{y}({\theta }_{y})+\frac{N}{B}\mathop{\sum }_{j=1}^{B}\log \left({{\rm{Dist}}} ( \, y({{{\bf{r}}}}_{{I}_{j}});{F}_{{\theta }_{f}}({{{\bf{r}}}}_{{I}_{j}}),{\theta }_{y})\right)\right]}_{{\theta }_{t-1}}$$
(18)
$${\theta }_{t}={\theta }_{t-1}+{\epsilon }_{t}{\hat{g}}_{t};t\leftarrow t+1.$$
(19)

We construct an overall “deep ensemble” \({\{({\theta }_{f}^{i},{\theta }_{y}^{i})\}}_{i=1}^{M}\) containing M ≥ 1 MAP estimates by repeating the above procedure M times, each with a different initialization of θ0 and random seed.

Stochastic variational inference

A more uncertainty-aware alternative to MAP ensembles is mean-field variational inference, which uses a surrogate posterior \({q}_{\phi }(\theta )={\prod }_{i=1}^{{n}_{f}}\nu ({\theta }_{f,i};{\phi }_{f,i})\mathop{\prod }_{i=1}^{{n}_{y}}\nu ({\theta }_{y,i};{\phi }_{y,i})\) over Θ to approximate the true posterior \(P({\theta }_{f},{\theta }_{y}| {{\mathcal{D}}})\) (13) given the data \({{\mathcal{D}}}\). Optimal values for the variational parameters \(\phi=({\phi }_{f,1},\ldots,{\phi }_{f,{n}_{f}},{\phi }_{y,1},\ldots,{\phi }_{y,{n}_{y}})\) are obtained by maximizing the “evidence lower bound”:

$${{\rm{ELBO}}}(\phi )=\log P({{\mathcal{D}}})-{{\rm{KL}}}({q}_{\phi }(\theta )| | P(\theta | {{\mathcal{D}}}))={{\mathbb{E}}}_{\phi }\left[\log \frac{P({{\mathcal{D}}},\theta )}{{q}_{\phi }(\theta )}\right]$$
(20)
$$={{\mathbb{E}}}_{\phi }[\log P({{\mathcal{D}}}| \theta )]-{{\rm{KL}}}({q}_{\phi }(\theta )| | \pi (\theta )).$$
(21)
$$ =\mathop{\sum }_{i=1}^{N}{{\mathbb{E}}}_{\phi }\left[\log \left({{\rm{Dist}}}(y({{{\bf{r}}}}_{i});{F}_{{\theta }_{f}}({{{\bf{r}}}}_{i}),{\theta }_{y})\right)\right]\\ - \sum _{i=1}^{{n}_{f}}{{\mathbb{E}}}_{{\phi }_{f,i}}\left[\log \left(\frac{\nu ({\theta }_{f,i};{\phi }_{f,i})}{{\pi }_{f}({\theta }_{f,i})}\right)\right]- \sum _{i=1}^{{n}_{y}}{{\mathbb{E}}}_{{\phi }_{y,i}}\left[\log \left(\frac{\nu ({\theta }_{y,i};{\phi }_{y,i})}{{\pi }_{y}({\theta }_{y,i})}\right)\right].$$
(22)

where Eq. (22) follows from the independence of the priors. Finding the maximum of Eq. (22) is a challenging optimization problem. Our implementation leverages a Gaussian variational posterior qϕ with KL reweighting, as described in Blundell et al. (Sections 3.2 and 3.4 of ref. 45).

Mean-field variational inference is known to underestimate posterior variance and can also get stuck in local optima of Eq. (21). To alleviate these problems, we use a variational ensemble that is analogous to the MAP ensemble described above. More specifically, we first perform M ≥ 1 runs of stochastic variational inference with different initializations and random seeds, which gives us an ensemble {ϕii = 1, …, M} of variational parameters. We then approximate the posterior \(P(\theta | {{\mathcal{D}}})\) with an equal-weighted mixture of the resulting variational distributions \({\{{q}_{{\phi }^{i}}\}}_{i=1}^{M}\).

Prediction queries

We can approximate the posterior (13) using a set of samples \({\{({\theta }_{f}^{i},{\theta }_{y}^{i})\}}_{i=1}^{M}\), which may be obtained from either MAP ensemble estimation or stochastic variational inference (by sampling from the ensemble of M variational distributions). We can then approximate the posterior-predictive distribution \(P(Y({{{\bf{r}}}}_{*})| {{\mathcal{D}}})\) (which marginalizes out the parameters Θ) of Y(r*) at a novel field index r* = (s*t*) by a mixture model with M equally weighted components:

$$\hat{P}(Y({{{\bf{r}}}}_{*})| {{\mathcal{D}}})=\frac{1}{M}\mathop{\sum }_{i=1}^{M}{{\rm{Dist}}}(Y({{{\bf{r}}}}_{*});{F}_{{\theta }_{f}^{i}}({{{\bf{r}}}}_{*}),{\theta }_{y}^{i}).$$
(23)

Equipped with Eq. (23), we can directly compute predictive probabilities of events {Y(r*) ≤ y}, predictive probability densities {Y(r*) = y}, or conditional expectations \({\mathbb{E}}\left[\varphi (Y({{{\bf{r}}}}_{*}))| {{\mathcal{D}}}\right]\) for a probe function \(\varphi :{\mathbb{R}}\to {\mathbb{R}}\). Prediction intervals around Y(r*) are estimated by computing the α-quantile yα(r*), which satisfies

$$\hat{P}(Y({{{\bf{r}}}}_{*})\le {y}_{\alpha }({{{\bf{r}}}}_{*})| {{\mathcal{D}}})=\alpha \quad \alpha \in [0,1].$$
(24)

For example, the median estimate is y0.50(s*t*) and 95% prediction interval is [y0.025(s*t*), y0.975(s*t*)]. The quantiles (24) are estimated numerically using Chandrupatla’s root finding algorithm46 on the cumulative distribution function of the mixture (23).

Temporal seasonal features

Including seasonal features (c.f. Eq. (8)), where possible, is often essential for accurate prediction. Example periodic multiples p for datasets with a variety of time units and seasonal components are listed below (Y=Yearly; Q=Quarterly; Mo=Monthly; W=Weekly; D=Daily; H;Hourly; Mi=Minutely; S=Secondly):

  • Q: Y=4

  • Mo: Q=3, Y=12

  • W: Mo=4.35, Q=13.045, Y=52.18

  • D: W=7, Mo=30.44, Q=91.32, Y=365.25

  • H: D=24, W=168, Mo=730.5, Q=2191.5, Y=8766

  • Mi: H=60, D=1440, W=10080, Mo=43830, Q=131490, Y=525960

  • S: Mi=60, H=3600, D=86400, W=604800, Mo=2629800, Q=7889400, Y=31557600

Ablations

To better understand how the prediction accuracy of BayesNF varies with the choices of inference algorithm and network architecture, results from two classes of ablation studies for the benchmarks in Table 2 are reported.

Inference methods: comparison of VI, MAP, and MLE

Figure 6 shows a comparison of runtime vs. accuracy profiles on the six benchmarks from Table 1 using three parameter inference methods for BayesNF—VI, MAP, and MLE. MLE is the maximum likelihood estimation baseline described in Lakshminarayanan et al.47, which is identical to Box 1 expect that the terms πf and πy in Eq. (18) are ignored. MLE performs no better than MAP or VI in all 18/18 profiles (and is typically worse), illustrating the benefits of parameter priors and posterior uncertainty which do not impose runtime overhead. Between MAP and VI, the latter performs better in 13/18 profiles: that is, on all metrics for Wind, Air Quality 1, and Air Quality 2; on RMSE and MAE for Chickenpox; and on RMSE and MIS for Precipitation.

Model architectures

Figure 7 shows the percentage change in RMSE, MAE, MIS, and runtime using BayesNF (MAP inference; 64 particles; fixed number of training epochs) while applying a single change to the reference model for each benchmark. The goal of these ablations is to study how changes to the network structure affect the predictive performance.

Fig. 7: Percentage change in error (RMSE, MAE, MIS) and runtime on benchmark datasets using BayesNF with various modeling ablations.
figure 7

Horizontal bars in red (resp. blue) show an increase (resp. decrease) in the error and runtime measurements. Errors bars show the minimum and maximum across all train/test splits. a Decrease network depth by 1. b Increase network depth by 1. c Half network width. d Double network width. e No convex combination (tanh activation). f No convex combination (elu activation). g No covariate scaling layer. h No spatial Fourier features.

Figure 7a, b shows results for decreasing or increasing the network depth by one layer. The Sea Surface Temperature benchmark is the most sensitive to the network depth, where decreasing the depth causes the forecast errors to increase by around 50%, whereas increasing the depth delivers 5–10% decreases. The MIS error is particularly sensitive to reducing the network depth where the results become significantly worse in 5/6 benchmarks, although the runtime also decreases by up to 50%.

Figure 7c, d shows results for halving or doubling the width of the hidden layers. The Sea Surface Temperature benchmark is highly sensitive to halving the network width, with errors increasing above 25%. The remaining benchmarks demonstrate slight improvements in the errors which are not statistically significant, suggesting that the runtime gains could justify halving the width in these benchmarks. Doubling the with causes substantial increases in the runtime with no systematic pattern in the RMSE, MAE, or MIS values across the benchmarks.

Figure 7e, f shows results using only tanh or elu activations instead of the convex combination layer. Discarding the convex combination layer delivers runtime speedups, which are larger using tanh as compared to elu. However, there is no clear winner in terms of error when using only tanh or only elu; and no error metric is consistently negative by selecting one of the two activations. The changes in error which are consistently positive (as compared to the convex combination layer) are (i) tanh only: Air Quality 2 (MIS 16%); (ii) elu only: Sea Surface Temperature (RMSE 59%, MAE 76%, MIS 49%) and Precipitation (MAE 7.8%, MIS 16%).

Figure 7g shows results for disabling the covariate scaling layer. The runtime is only slightly changed in all benchmarks. However, several errors increase consistently on average, namely in the Precipitation (RMSE 24%, MAE 27%, MIS 33%), Chickenpox (MIS 32%), and Air Quality 1 (MAE 13%) benchmarks. The remaining changes are neither consistently above nor below zero.

Figure 7h shows results for omitting the spatial Fourier features (Eq. (9)). While omitting these features delivers small runtime improvements, it also causes substantial increases in RMSE, MAE, and MIS values across all benchmarks except for Wind. These results support the hypothesis that spatial Fourier features are essential for accurate generalization across space and time.

In summary, the results (specifically Fig. 7e–h), demonstrate that architectural choices in BayesNF such as the spatial Fourier features, convex combination layer, and covariate scaling are effective in reducing the prediction error across several benchmarks and metrics at the cost of a manageable runtime overhead.

Evaluation metrics

The quality of point forecasts are evaluated using RMSE and MAE scores. Interval forecasts are evaluated using the MIS score at level α = 0.05. The definitions are as follows:

$$\, {{\mbox{Root}}} \, {{\mbox{Mean}}} \, {{\mbox{Squared}}} \, {{\mbox{Error}}} \, {{\mbox{(RMSE)}}}\,\quad \sqrt{\sum _{i=1}^{n}{( \, {y}_{i}-{\hat{y}}_{i})}^{2}/n}$$
(25)
$$\,{{\mbox{Mean}}} \, {{\mbox{Absolute}}} \, {{\mbox{Error}}} \, {{\mbox{(MAE)}}}\,\quad \sum _{i=1}^{n}\left\vert \, {y}_{i}-{\hat{y}}_{i}\right\vert /n$$
(26)
$$\,{{\mbox{Mean}}} \, {{\mbox{Interval}}} \, {{\mbox{Score}}} \, {{\mbox{(MIS)}}}\,\quad \sum _{i=1}^{n}\left[({u}_{i}-{\ell }_{i})+\frac{2}{\alpha }({\ell }_{i}-{y}_{i}){{\bf{1}}}[{y}_{i} < {\ell }_{i}]+\left.\frac{2}{\alpha }({y}_{i} < {u}_{i}){{\bf{1}}}[{u}_{i} < {y}_{i}]\right]\right./n,$$
(27)

where yi is the true value, \({\hat{y}}_{i}\) is the point forecast, and (iui) are endpoints of the interval forecast.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.