## Introduction

Probabilistic forecasts predict the likelihood of weather conditions at a given time and location. Weather conditions of interest can range from core atmospheric variables such as rate of rain and snow, wind velocity and direction, temperature, pressure levels, and solar coverage to weather patterns such as hurricanes, wildfires, and floods1,2. For the case of precipitation, a probabilistic forecast answers the question, “What is the current probability of a given amount of precipitation occurring at a location and time in the future?”

Short-term forecasting up to twelve hours in advance allows for predicting weather conditions with higher spatial and temporal precision than longer time ranges. This makes it possible for these forecasts to have substantial impact on society by helping with daily planning, energy management, transportation, and the mitigation of extreme weather events, among others3. Short-term forecasting is also a longstanding scientific challenge that combines our best understanding of the physics of the atmosphere with our most advanced computational capabilities. Current operational models for short-term forecasting are Numerical Weather Prediction (NWP) models that rely on physics-based simulations. The atmospheric simulations make use of supercomputers with heterogeneous hardware that run virtually continuously in data centers around the globe and update the forecasts based on the latest observations. The weather conditions that the models predict include hundreds of atmospheric and oceanic features. The forecasts usually have a frequency of one or more hours and a grid resolution of 3–12 km. NWP methods obtain a probabilistic forecast by ensembling or post-processing the output of multiple individual physics-based models, each in turn requiring atmospheric simulation at the supercomputer scale. The accuracy of a physics-based forecast is tied to the grid resolution as more precise physics simulations require a finer representation of the state of the atmosphere. This relationship creates a computational bottleneck inherent to physics-based models that has proven challenging to overcome4. Besides resolution, the accuracy of the forecasts also depends on how well the physical models used in NWP describe the atmosphere at the various relevant scales; improving these models is a substantial scientific challenge by itself5.

Due to the computational bottleneck, the large computational resources, and the time lag that physics-based models incur when making a forecast, efficient models based on deep neural networks represent a promising alternative framework for weather modeling6,7 (Fig. 1). Instead of explicitly simulating the physics of the atmosphere, neural models learn the relationships between input observations and output variables directly from data. Neural networks can run in a matter of seconds on parallel hardware and can thus generate forecasts more frequently and with higher spatial resolution. The networks are also notably simple and can be specified with generic modules in a few dozens of lines of code without hand-tuned routines for a specific task. The prediction of a neural network can also naturally be made probabilistic, learning to capture all the possible variability of the forecast from the data itself. These properties can not only offer improved forecasts, but also frequent and personalized forecasts3 and open avenues of new applications that rely on the models’ efficiency and flexibility. However, showing that the neural networks are able to learn to emulate the physics of the atmosphere sufficiently well to make skillful high-resolution forecasts for up to twelve hours ahead—a period that requires an advanced understanding of atmospheric physics and is well beyond the skill of extrapolation and short-term nowcasting methods6,8,9,10—is a substantial open challenge at the core of the neural modeling approach.

In this work, we present MetNet-2, a probabilistic weather model based on deep neural networks that is a successor to MetNet7. MetNet-2 features a forecasting range of up to 12 h of lead time at a frequency of 2 min and a spatial resolution of 1 km. In order to capture sufficient input context, MetNet-2 uses input observations from a 2048 km × 2048 km region and adopts novel neural network architectural elements in order to effectively process the large context. Such elements are (a) a context-aggregating module that enables the receptive field of the network to double after every layer, (b) a strong lead time conditioning scheme and (c) a model parallel training setup utilizing multiple chips for increased memory and parallel computation.

## Results

We train MetNet-2 to forecast precipitation, a fast-changing weather variable, over a 7000 km × 2500 km region of the Continental United States (CONUS). We find that MetNet-2 outperforms the probabilistic ensemble High-Resolution Ensemble Forecast (HREF) for the entire lead time range of 12 h according to the probabilistic metric Cumulative Ranked Probability Score (CRPS). When both MetNet-2 and HREF are thresholded to produce a categorical forecast, MetNet-2 outperforms HREF up to at least 9 h of lead time for both low and high rates of precipitation, according to the categorical Critical Success Index (CSI) metric. These results hold despite the key difference that MetNet-2 has an output resolution of 1 km and does not rely on forward atmospheric simulation, whereas HREF has a resolution of 3 km and relies on the results of five different forward atmospheric simulations from respective physics-based models, including those from the High-Resolution Rapid Refresh (HRRR) model11.

We also study the performance of MetNet-2 in a hybrid mode where the physics-based forecast is used as an additional input to MetNet-2 itself. We find that Hybrid MetNet-2 is able to outperform a MetNet2-postprocessed HRRR forecast up to the entire range of 12 h, according to both CRPS and CSI. This shows the ability of MetNet-2 to extract and relay unique information that is not available in the atmospheric simulation even for longer lead times. MetNet-2’s performance represents a step forward towards skillful forecasts with neural networks and suggests that MetNet-2 may be learning to emulate aspects of atmospheric physics. We perform an analysis into what MetNet-2 has learnt using state-of-the-art interpretation methods. The analysis reveals that MetNet-2 appears to make use of advanced physics principles when making its forecasts, which the model learns directly from the data.

### Comparison with HREF

Although MetNet-2 uses the assimilation features that come from HRRR in order to gain a more complete picture of the initial state of the atmosphere, MetNet-2 does not rely on the atmospheric simulation itself that is the most computationally intensive part of an NWP model. The ensemble HREF model relies on 10 such simulations coming from five different NWP models each running on a supercomputer. The first result on dataset A is that MetNet-2 obtains a better CRPS than HREF over the entire lead time range of 12 h (Fig. 2). This metric is particularly appropriate in this evaluation as both CRPS and HREF are probabilistic. CRPS also takes into account the entire distribution across all precipitation rates; in other words, the result incorporates the performance from low to high rates of precipitation. When thresholding both MetNet-2 and HREF, optimized based on the categorical metric CSI, MetNet-2 outperforms HREF for at least the first 9 h of lead time for low (0.2 mm/h) and high rates of precipitation up to 20 mm/h (Fig. 2 and Supplementary Fig. 3). MetNet-2 and HREF both outperform HRRR on these metrics across the whole 12 h range (Figs. 2 and 3). The skill gap between MetNet-2 and HREF is greatest in relative terms at the earliest hours and decreases gradually over time. Figure 4 represents two case studies of MetNet-2 and HREF forecasts. Both the uncertainty of the prediction and the variability grow over time and these aspects are evident in MetNet2’s and HREF’s forecasts. The probability of precipitation that MetNet-2 assigns to a given location tends to decrease on average over time as the probability mass is spread over a growing region of likely precipitation. The expected amount of precipitation that MetNet-2 forecasts in a patch closely matches the ground truth amount of precipitation.

### Hybrid results

A second core result is the performance comparison of MetNet-2 in a hybrid setting that uses the prediction of an NWP model, in this case HRRR. We compare MetNet-2 Postprocess with MetNet-2 Hybrid on test dataset B for both instantaneous and cumulative precipitation targets. MetNet-2 Postprocess maps HRRR’s forecast to a probabilistic one. MetNet-2 Hybrid maps both MetNet-2’s default inputs as well as HRRR’s forecasts to a probabilistic forecast. Remarkably, the MetNet-2 architecture is able to add value to HRRR’s postprocessed forecast all the way up to 12 h of lead time. That is, the performance of MetNet-2 Hybrid is higher than that of MetNet-2 Postprocess across the whole range based on both CRPS and CSI scores and for both low and high rates of precipitation (Figs. 3 and 5). Figure 6 shows a case study with these models and Fig. 7a, b visualize, respectively, the probabilistic error based on the Brier score achieved by the models, and the prediction regions for various rates of precipitation based on the CSI thresholds.

### Ablations

We perform various ablations on the default MetNet-2 in order to shed further light on the model’s performance. A regional evaluation of MetNet-2 shows that the model performs well across diverse regions that see varying levels of annual precipitation (Supplement H). On the architectural front, we find that the size of the input context of 2048 km × 2048 km improves performance over context sizes of 1536 km × 1536 km, 1024 km × 1024 km and 512 km × 512 km (see “Methods” and Supplement F.1). The additional observations of the atmosphere that the assimilation process incorporates also have an impact on MetNet-2’s performance, especially at later hours (see “Methods” and Supplement F.2). MetNet-2 is able to extract information from a broad range of observations and any additional observations are likely to improve MetNet-2’s performance. Furthermore, both removing the special conditioning scheme for the lead time index (see “Methods” and Supplement F.4) and limiting the maximum dilation factor to 16 or below also impact MetNet-2’s performance negatively (see “Methods” and Supplement F.5).

### Interpretation

MetNet-2’s remarkable performance makes it important to understand what the physics-free neural network is learning. This can help researchers gain new insight about interactions between different meteorological variables and ensure that the model conforms to our prior knowledge about weather physics. We adopt a state-of-the-art neural interpretation technique called Integrated Gradients to attribute predictions to the input variables12. Among the notable findings, Fig. 8b shows that the relative importance of absolute vorticity is small for near-term forecasts, but grows in importance as lead time increases all the way up to 12 h. In Fig. 8a, the importance of upper-level vorticity for a twelve hour forecast is consistent with what is known as quasi-geostrophic theory, a non-trivial set of simplifications and filtering of the equations of motion. A key result in the theory is that positive vorticity in the upper-troposphere is consistent with upward motion in the lower-troposphere13. This upward motion does not directly trigger precipitation, but prepares the atmosphere for convection. See Supplement G for other key findings.

## Discussion

MetNet-2’s strong performance for both low and high levels of precipitation and for both instantaneous and cumulative measures, its ability to estimate uncertainty and capture variation, its independence from atmospheric simulation, its design simplicity and the rapid and different nature of MetNet-2’s computation represent a step towards a fundamental shift in forecasting from physics-based models to learning-based ones. The results also show how neural networks can learn to emulate complex and large-scale physics paving the way for ever more ambitious applications of neural nets in the physical sciences. Direct sensor data, although not readily available, can likely be used in place of the assimilation state to further reduce MetNet-2’s total latency to essentially just the time required for observing the atmosphere while removing any remaining reliance on NWP’s initial state. Though designed for geo-spatial prediction, little in MetNet-2’s architecture is specific to precipitation. This raises hopes that MetNet-2 could work well for many other weather variables possibly at once and even learn to transfer from one variable to the next and improve overall performance.

## Methods

### Ground truth

The radar precipitation measures are especially important for our task as they also serve as the ground truth training targets for MetNet-2. The instantaneous measures and the hourly cumulated measures are produced at 2 and 60 min intervals respectively. The measures range from a rate of 0–102.4 mm/h, with the higher and more extreme rates of precipitation becoming increasingly rare in the data; see Supplement B.2 summarizing how often various rates occur in the data.

### Dataset creation and splits

The data for MetNet-2 comes in input-output pairs where the inputs include radar, satellite, and weather state and outputs, the ground truth, correspond to the radar precipitation estimates. The available data spans a period from July 2017 to August 2020. The training, validation and test data sets are generated without overlap from periods in sequence. Successive periods of 400, 12, 40, 40 and 12 h are used to sample, respectively, training, validation, and test data, with the two 12 h periods inserted as hiatus. Spatially, the target patches are sampled randomly from intersections on a grid over the CONUS region spaced at .5 degrees in longitude and latitude. We sample two different test datasets, A and B, the former for our main comparison with HREF and the latter to compare the various MetNet-2 variants. Test dataset A covers only cumulative precipitation as HREF doesn’t forecast instantaneous precipitation and the available HREF data over CONUS overlaps at 953 timestamps with the rest of the data, from which the test dataset A is sampled. Test dataset B covers both cumulative and instantaneous precipitation and overlaps with the rest of the data at all timestamps. Both datasets contain 39,841 patches each.

### MetNet-2 postprocess and hybrid

To study MetNet-2’s performance in hybrid settings, we consider other training modes for MetNet-2 that, contrary to the default MetNet-2, make use of the outcome of NWP’s atmospheric simulation. MetNet-2 Postprocess takes as an input HRRR’s forecast for a given lead time along with static location, altitude, and time features and learns to map HRRR’s forecast as closely as possible to the ground truth. It also learns to correct for any systematic biases in HRRR’s forecast and makes the forecast probabilistic. MetNet-2.

Hybrid learns to extract information from all the available inputs, including the twelve hourly forecasts from HRRR, as well as the radar and assimilation inputs used in the default MetNet-2. Whereas HRRR produces individual non-probabilistic rollouts, MetNet-2 and the variants are probabilistic at their core. Figure 1 summarizes the types of models and the respective steps.

### Model and architecture

A probabilistic forecast captures the combined uncertainty of both the measurements and the model:

$$P({{{{\bf{r}}}}}_{x,y,t}{{{{{\rm{|}}}}}}t_0)=f({{{{{\bf{c}}}}}}_{x,y,t_0},L)$$
(1)

where r are rates of precipitation, x,y,t are the location and target time of the forecast, t0 is the time at which the forecast is made, cx,y,t0 is the atmospheric context at time t0 relevant for location x,y and L = t − t0 is the lead time of the forecast. MetNet-2 bins the precipitation rates into 512 categories that allow the model to forecast arbitrary discrete probability distributions over the categories7.

The size of the input context plays a key role in the design of MetNet-2’s architecture. Due to fast-changing nature of the atmosphere, the longer the lead time of the forecast for a location x,y the more context the model needs around x,y in order to have sufficient information for a skillful forecast. The context grows spatially in both dimensions and hence the total number of locations to attend grows quadratically in the length of the lead time. For a target patch of size 512 km × 512 km and forecast lead times of up to 12 h, MetNet-2 uses an input context size of 2048 km × 2048 km. This amounts to between 64 and 85 km of context per hour of lead time in each spatial dimension.

Besides making a large context available to the network, the network must be able to process and attend to the key parts of the context with its architecture. It is a special feature of the weather forecasting task that these key parts vary as a function of lead time: for the same input patch of data as lead time increases, the network must attend to key parts of an ever-growing potential region. These variable range dependencies present a challenge for the design of the neural architecture.

### Input encoder

The input to MetNet-2 captures 2048 km × 2048 km of weather context for each input feature, but it is downsampled via averaging by a factor of 4 in each spatial dimension, resulting in an input patch of 512 × 512 positions (Fig. 9a). The downsampling provides a trade-off between maintaining a sufficient amount of information in the context while substantially reducing the amount of computation required to encode this information.

In addition to the input patches having spatial dimensions, they also have a time dimension in the form of multiple time slices (see Supplement B.1 for details). This is to ensure that the network has access to the temporal dynamics in the input features. After padding and concatenation together along the depth axis, the input sets are embedded using a convolutional recurrent network10 in the time dimension7.

### Exponentially dilated convolutions

The next part of MetNet-2’s architecture aims at connecting each position in the layer representing the encoded context with every other position in order to capture the full context. MetNet-2 uses two-dimensional convolutional residual blocks with a sequence of exponentially increasing dilation factors of size 1, 2, 4,…, 12816,17. Dilation factors increase the receptive field of the convolution by skipping positions without increasing the number of parameters (Fig. 9c). Each position connects in this manner to all of the other 512 × 512 positions of the encoded tensor. Supplementary Fig. 2 illustrates the exact residual block with the dilated convolutions. Three stacks of 8 residual blocks form this context aggregating part of MetNet-2’s architecture. The target patch of precipitation that MetNet-2 predicts corresponds to 512 km × 512 km and is centered in the middle of the 2048 km × 2048 km of the input patch. Because of that, the 512 × 512 positions from the context aggregation in the input encoder, are cropped to 128 × 128 positions. To obtain a prediction for the full size target patch, we upsample four times in each dimension, effectively creating another layer of 512 × 512 positions. This is processed with another shallow network and ends with a categorical prediction over 512 precipitation levels for each target position. See Fig. 9d for a full depiction of the architecture and Supplement D for additional architectural details.