Extended-range statistical ENSO prediction through operator-theoretic techniques for nonlinear dynamics

Forecasting the El Nino-Southern Oscillation (ENSO) has been a subject of vigorous research due to the important role of the phenomenon in climate dynamics and its worldwide socioeconomic impacts. Over the past decades, numerous models for ENSO prediction have been developed, among which statistical models approximating ENSO evolution by linear dynamics have received significant attention owing to their simplicity and comparable forecast skill to first-principles models at short lead times. Yet, due to highly nonlinear and chaotic dynamics (particularly during ENSO initiation), such models have limited skill for longer-term forecasts beyond half a year. To resolve this limitation, here we employ a new nonparametric statistical approach based on analog forecasting, called kernel analog forecasting (KAF), which avoids assumptions on the underlying dynamics through the use of for nonlinear kernel methods for machine learning and dimension reduction of high-dimensional datasets. Through a rigorous connection with Koopman operator theory for dynamical systems, KAF yields statistically optimal predictions of future ENSO states as conditional expectations, given noisy and potentially incomplete data at forecast initialization. Here, using industrial-era Indo-Pacific sea surface temperature (SST) as training data, the method is shown to successfully predict the Nino 3.4 index in a 1998-2017 verification period out to a 10-month lead, which corresponds to an increase of 3-7 months over a benchmark linear inverse model (LIM), while significantly improving upon the ENSO predictability"spring barrier". An analysis of a 1300-yr control integration of a comprehensive climate model (CCSM4) further demonstrates that the enhanced predictability afforded by KAF holds over potentially much longer leads, extending to 24 months versus 18 months in the benchmark LIM.

Pacific sea surface temperature (SST) as training data, the method is shown to successfully predict the Niño 3.4 index in a 1998-2017 verification period out to a 10-month lead, which corresponds to an increase of 3-8 months (depending on the decade) over a benchmark linear inverse model (LIM), while significantly improving upon the ENSO predictability "spring barrier". In particular, KAF successfully predicts the historic 2015/16 El Niño at initialization times as early as June 2015, which is comparable to the skill of current dynamical models. An analysis of a 1300-yr control integration of a comprehensive climate model (CCSM4) further demonstrates that the enhanced predictability afforded by KAF holds over potentially much longer leads, extending to 24 months versus 18 months in the benchmark LIM.
Probabilistic forecasts for the occurrence of El Niño/La Niña events are also performed, and assessed via information-theoretic metrics, showing an improvement of skill over LIM approaches, thus opening an avenue for environmental risk assessment relevant in a variety of contexts.
Previous studies on improving the skill of conventional LIMs 1, 2 have highlighted the importance of nonlinearity in ENSO dynamics, such as surface-subsurface interactions and surface winds 3 , which are usually underestimated in linear dynamics approximations. Adequate representation of such processes within statistical ENSO models is important to attain optimal long-term forecast skill, influencing two major components of model design, namely (i) the construction of predictor vectors to extract pertinent information about the state of the climate system; and (ii) the assumed evolution dynamics employed to make predictions beyond the training period. For instance, LIMs use the leading principal components from empirical orthogonal function (EOF) analysis as predictors, and approximate their tendencies by a Markov prediction model similar to a linear regression. A linear structure is imposed in two aspects here; that is, the linear predictors obtained by EOF analysis and the linear dynamical model treating these predictors as state vectors.
This suggests that two corresponding types of improvement of conventional LIMs can be sought.
Indeed, a number of studies 3,4 have shown that inclusion of LIM residuals by the present and recent past states can implicitly capture subsurface forcing and its nonlinear interactions with SST, consequently contributing to more skillful ENSO predictions. On the other hand, replacing EOF analysis by nonlinear dimension reduction techniques 5 has also led to improved performance by LIMs under certain circumstances, especially for long-term ENSO forecasts.
Other statistical approaches for ENSO prediction [6][7][8][9] have included methods based on Lorenz's analog forecasting technique 10 or neural networks (NNs) 11 . The core idea of analog methods 6-8 is to employ observational or free-running (non-initialized) model data as libraries of states, through which forecasts can be made by identifying states in the library that closely resemble the observed data at forecast initialization, and following the evolution of the quantity of interest on these socalled analogs to yield a prediction. Advantages of analog forecasting include avoiding the need for an expensive, and potentially unstable, initialization system, as well as reducing structural model error if natural analogs are employed. For phenomena such as ENSO exhibiting a reasonably small number of effective degrees of freedom, the performance of analog techniques has been shown to be comparable to, and sometimes exceed, the skill of initialized forecasts by coupled general circulation models (CGCMs) 7,8 . Meanwhile, NN models build a representation of the predictand as a nonlinear function of the predictor variables by optimization within a high-dimensional parametric class of functions. A recent study 9 utilizing convolutional NNs trained on GCM output and reanalysis data has demonstrated forecast skill for the 3-month averaged Niño 3.4 index extending out to 17 months.
The KAF approach 12,13 employed in this work can be viewed as a generalization of conventional analog forecasting, utilizing nonlinear-kernel algorithms for statistical learning 14 and operator-theoretic ergodic theory 15,16 to optimally capture the evolution of a response variable (here, an ENSO index) under partially observed nonlinear dynamics. In particular, unlike conventional approaches, the output of KAF is not given by a local average over analog states (or linear combinations of states in the case of constructed analogs 6 ), but is rather determined by a projection operation onto a function space of observables of high regularity, learned from high-dimensional training data. This hypothesis space has as a basis a set of eigenvectors of a judiciously constructed kernel matrix, which depends nonlinearly on the input data, thus potentially capturing richer patterns of variability than the linear principal components from EOF analysis. It should be noted, in particular, that the eigenvectors employed in KAF are not expressible as linear projections of the data onto corresponding EOFs. By virtue of this structure, the forecast function from KAF can be shown to converge in an asymptotic limit of large data to the conditional expectation of the response function, acted upon by the Koopman evolution operator of the dynamical system over the desired lead time, conditioned on the input (predictor) data at forecast initialization 13 . As is well known from statistics, the conditional expectation is optimal in the sense of minimal expected square error (L 2 error) among all forecasts utilizing the same predictors.
Here, we build KAF models using lagged histories of Indo-Pacific SST fields as input data.
Specifically, we employ the class of kernels introduced in the context of nonlinear Laplacian spectral analysis (NLSA) 17,18 algorithms, whose eigenfunctions provably capture modes of high dynamical coherence 19,20 , including representations of ENSO, ENSO combination modes, and Pacific decal variability [21][22][23] . This capability is realized without ad hoc prefiltering of the input data through the use of delay-coordinate maps 24 and a nonlinear (normalized Gaussian) kernel function 25 measuring similarity between delay-embedded sequences of SST fields. A key advantage of the methodology employed here over LIMs or other parametric models is its nonparametric nature, in the sense that KAF does not impose explicit assumptions on the dynamical model structure (e.g., linearity), thus avoiding systematic model errors that oftentimes preclude parametric models from attaining useful skill at long lead times. Moreover, by employing delays, KAF enhances the information content of the predictor data beyond the information present in individual SST snapshots, allowing it to capture important contributors to ENSO predictability, such as upperocean heat content and surface winds 3,26 .
A distinguishing aspect of the approach presented here over recent analog 7,8 and NN 9 methods utilizing large, GCM-derived, multivariate datasets for training is that our observational forecasts are trained only on industrial-era observational/reanalysis SST fields. Indeed, due to the large number of parameters involved, NN methods cannot be adequately trained from industrial-era data alone 9 , and are trained instead using thousands of years of model output. In contrast, KAF is an intrinsically nonparametric method, involving only a handful of meta-parameters (such as embedding window and hypothesis space dimension), and thus able to yield robust ENSO forecasts from a comparatively modest training dataset spanning ∼ 120 years. It should also be noted that KAF exhibits theoretical convergence and optimality guarantees 13 which, to our knowledge, are not available for other comparable statistical ENSO forecasting methodologies.

Results
We present prediction results obtained via KAF and LIMs in a suite of prediction experiments for (i) the Niño 3.4 index in industrial-era HadISST data 27 ; (ii) the Niño 3.4 index in a 1300-yr control integration of CCSM4 28 , a CGCM known to exhibit fairly realistic ENSO dynamics and Pacific decadal variability 29 ; and (iii) the probability of occurrence of El Niño/La Niña events in both HadISST and CCSM4. All experiments use SST fields from the respective datasets in a fixed Indo-Pacific domain as training data. The Niño 3.4 prediction skill is assessed using root-mean-square error (RMSE) and pattern correlation (PC) scores. We employ PC = 0.6 and 0.5 as representative thresholds separating useful from non-useful predictions. The probabilistic El Niño/La Niña experiments are assessed using the Brier skill score (BSS) 30,31 , as well as information-theoretic relative entropy metrics [32][33][34] . For consistency with an operational forecasting environment, all experiments employ disjoint, temporally consecutive portions of the available data for training and verification, respectively. In the CCSM4 experiments, Niño 3.4 indices are computed relative to monthly climatology of the training data; in HadISST we use anomalies relative to the 1981-2010 monthly climatology as per current convention in many modeling centers 35 . In 2008-2017, the PC skill of KAF exhibits a more uniform rate of decay, crossing the PC = 0.5 threshold at lead time of 9 months, compared to 6-7 months for LIM. When measured over the full, 20-year verification period, KAF provides 3 to 4 months of additional useful skill, as well as a markedly slower rate of skill reduction beyond 4 months. As a comparison with stateof-art dynamical forecasts, the skill scores in Figure 1 are generally comparable, and in some cases exceed, recently reported scores from the North American Multimodel Ensemble 36 (though it should be kept in mind that the verification periods of that study and ours are different).
In separate calculations, we have verified that incorporating delays in the kernel is a significant contributor to the skill of KAF, particularly at lead times greater than 2 months. This is consistent with previous work, which has shown that delay embedding increases the capability of kernel algorithms to extract dynamically intrinsic coherent patterns in the spectrum of the Koopman operator 19,20 , including ENSO and its linkages with seasonal and decadal variability of the climate system [21][22][23] . Given that our main focus here is on the extended-range regime, Figure 1 depicts KAF results for a single embedding window of 12 months. Operationally, however, one  (c-f) Running forecasts for representative lead times in the range 0-9 months. Black lines show the true signal at the corresponding verification times.
would employ different embedding windows at different leads to optimize performance.
Next, in order to assess the potential skill of KAF and LIM approaches in an environment with more data available for training and verification (contributing to more robust model construction and skill assessment, respectively) than HadISST, we examine Niño 3.4 prediction results for the CCSM4 dataset, using the first 1100 years for training and the last 200 years for verification. In general, KAF appears to benefit from the larger CCSM4 dataset more than LIM, which is consistent with the former method better leveraging the information content of large data sets due to its nonparametric nature and theoretical ability to consistently approximate an optimal prediction function (the conditional expectation). In contrast, LIM's performance is limited to some extent by structural model error due to linear dynamics, and this type of error may not be possible to eliminate even with arbitrarily large training datasets. Another important consideration in ENSO prediction is the seasonal dependence of skill, exhibiting the so-called "spring barrier" 37 . Figure 3 shows the month-to-month distribution of the KAF and LIM Niño 3.4 PC scores, computed for 6-and 9-month leads in HadISST and 0-24-month leads in CCSM4 using the same training and verification datasets as in Figures 1 and 2, respectively. These plots feature characteristic spring barrier curves, with the highest predictability occurring in late winter to early spring and a clear drop of skill in summer 3,4 . The diminished summer predictability is thought to be caused by the low amplitude of SST anomalies developing then, making ENSO dynamics more sensitive to high-frequency atmospheric noise 3 . As we have shown in previous work 22 , the class of kernels employed here for forecasting is adept at capturing the effects of atmospheric noise and its nonlinear impact on ENSO statistics, including positive SST anomaly skewness underlying El Niño/La Niña asymmetry. The method has also been found highly effective in capturing the phase locking of ENSO to the seasonal cycle through a hierarchy of combination modes 21,23,38 . Correspondingly, while both methods exhibit a reduction of skill during summer, KAF fares significantly better than the benchmark LIMs in both of the HadISST and CCSM4 datasets. More specifically, in the HadISST results in Figure 3 35 ). In fact, the forecast trajectories in Figure 4 have comparable, and sometimes higher, skill than forecasts from many operational weather prediction systems, despite KAF utilizing far-lower numerical/observational resources and temporal resolution.
We now turn to the skill of probabilistic El Niño/La Niña forecasts. A distinguishing aspect of probabilistic prediction over point forecasts (e.g., the results in Figures 1-3) is that it provides more direct information about uncertainty, and as a result more actionable information to decision makers. Probabilistic prediction with either first-principles or statistical models, or ensembles of models, is typically conducted by binning collections of point forecasts, e.g., realized by random draws from distributions of initial conditions and/or model parameters. Here, we employ an alternative approach, which to our knowledge has not been previously pursued, whereby KAF and LIM approaches are used to predict the probability of occurrence of El Niño or La Niña events directly, without generating ensembles of trajectories. Our approach is based on the fact that predicting conditional probability is equivalent to predicting the conditional expectation of a characteristic function indexing the event of interest 13 . As a result, this task can be carried out using the same KAF and LIM approaches described above, replacing the Niño 3.4 index by the appropriate characteristic function as the response variable.
Here, we construct characteristic functions for El Niño and La Niña events in the HadISST and CCSM4 data using the standard criterion requiring that at least five overlapping seasonal (3month running averaged) Niño 3.4 indices to be greater (smaller) than 0.5 (−0.5) • C, respectively 35 .
In this context, natural skill metrics are provided by the BSS 30, 31 , as well as relative entropy from information theory 32, 33 (see Methods). For binary outcomes, such as occurrence vs. nonoccurrence of El Niño or La Niña events, the BSS is equivalent to a climatology-normalized RMSE score for the corresponding characteristic function. As information-theoretic metrics, we employ two relative entropy functionals 34 , denoted D and E, which measure respectively the precision relative to climatology and ignorance (lack of accuracy) relative to the truth in a probabilistic forecast.
For a binary outcome, D attains a maximal value D * , corresponding to a maximally precise binary forecast relative to climatology. On the other hand, E is unbounded, but has a natural scale E * corresponding to the ignorance of a probabilistic forecast based on random draws from the climatology; this makes E = E * a natural threshold indicating loss of skill. Overall, a skillful binary probabilistic forecast should simultaneously have BSS ≪ 1, D ≃ D * , and E ≪ E * . Note that we only report values of these scores in the CCSM4 experiments, as we found that the HadISST verification is not sufficiently long for statistically robust computation of relative-entropy scores.
The results of probabilistic El Niño/La Niña prediction for CCSM4 and HadISST are shown in Figures 5 and Figure 6, respectively. As is evident in Figure 5 In conclusion, this work has demonstrated the efficacy of KAF in statistical ENSO prediction, with a robust improvement in useful prediction horizon in observational data by 3-7 months over LIM approaches, and three appealing characteristics-namely, more skillful forecasts at long lead times with slower reduction of skill, reduced spring predictability barrier, and improved prediction of event onset. Moreover, in the setting of model data with larger sample sizes, the enhanced performance of KAF becomes more pronounced, with skill extending out to 24 month leads. Aside from the higher skill in predicting ENSO indices, the method is also considerably more skillful in probabilistic El Niño/La Niña prediction. We attribute these improvements to the  Kernel analog forecasting Kernel analog forecasting (KAF) 12, 13 is a kernel-based nonparametric forecasting technique for partially observed, nonlinear dynamical systems. Specifically, it addresses the problem of predicting the value of a time series y(t + τ ) ∈ R, where t and τ are the forecast initialization and lead times, respectively, given a vector x(t) ∈ R m of predictor variables observed at forecast initialization, under the assumption that y(t + τ ) and x(t) are generated by an underlying dynamical system. To make a prediction of y(t + τ ) at lead time τ = q ∆t, where q is a non-negative integer and ∆t > 0 a fixed interval, it is assumed that a time-ordered dataset consisting of pairs (x n , y n+q ), with x n = x(t n ), y n = y(t n+q ), t n = n ∆t, and n ∈ {0, . . . , N − 1}, is available for training. In the examples studied here, y(t) is the Niño 3.4 index, x(t) is a lagged sequence of Indo-Pacific SST snapshots (to be defined in Eq. (5) below), ∆t is equal to 1 month, and N is the number of samples in the training data.
For every such lead time τ , KAF constructs a forecast function f τ : R m → R, such that f τ (x(t)) approximates y(t + τ ). To that end, it treats x(t) and y(t) as the values of observables (functions of the state) along a trajectory of an abstract dynamical system (here, the Earth's climate system), operating on a hidden state space Ω.
On Ω, the dynamics is characterized by a flow map Φ t : Ω → Ω, such that Φ t (ω) corresponds to the state reached after dynamical evolution by time t, starting from a state ω ∈ Ω. Moreover, there exist functions X : Ω → R m and Y : Ω → R such that, along a dynamical trajectory initialized at an arbitrary state ω ∈ Ω, we have x(t) = X(Φ t (ω)) and y(t) = Y (Φ t (ω)). That is, in this picture, x(t) and y(t) are realizations of random variables, X and Y , referred to as predictor and response variables, respectively, defined on the state space Ω.
It is a remarkable fact, first realized in the seminal work of Koopman 41 and von Neumann 42 in the 1930s, that the action of a general nonlinear dynamical system on observables such as X and Y can be described in terms of intrinsically linear operators, called Koopman operators. In particular, taking F = {f : Ω → R} to be the vector space of all real-valued observables on Ω (note that Y lies in F ), the Koopman operator U τ at time τ is defined as the linear operator mapping f ∈ F to g = U τ f ∈ F , such that g(ω) = f (Φ τ (ω)). From this perspective, the problem of forecasting where ω ∈ Ω is the state initializing the observed dynamical trajectory. It should be kept in mind that while F is by construction a linear space, on which U τ acts linearly without approximation, the elements of F may not (and in general, will not) be linear functions satisfying a relationship such as for ω, ω ′ ∈ Ω and c ∈ R. In fact, in many applications Ω is not a linear set (e.g., it could be a nonlinear manifold), in which case no elements of F are linear functions.
In the setting of forecasting with initial data determined through X, a practically useful approximation of U τ Y must invariably be through a function F τ ∈ F that can be evaluated given values of X alone. That is, we must have F τ (Φ t (ω)) = f τ (x(t)), where f τ : R m → R is a realvalued function on data space, referred to above as the forecast function, and x(t) = X(Φ t (ω)).
In real-world applications, including the ENSO forecasting problem studied here, the observed data x(t) are generally not sufficiently rich to uniquely infer the underlying dynamical state on Ω (i.e., X is a non-invertible map). In that case, any forecast function F τ will generally exhibit a form of irreducible error. The goal then becomes to construct F τ optimally given the training data (x n , y n+q ) so as to incur minimal error with respect to a suitable metric. Effectively, this is a learning problem for a function in an infinite-dimensional function space, which KAF addresses using kernel methods from statistical learning theory 14 .
Following the basic tenets of statistical learning theory, and in particular kernel principal component regression, KAF searches for an optimal f τ in a finite-dimensional hypothesis space H L , of dimension L, which is a subspace of a reproducing kernel Hilbert space (RKHS), H, of realvalued functions on data space R m . For our purposes, the key property that the RKHS structure of H allows is that a set of orthonormal basis functions ψ 0 , . . . , ψ L−1 for H L can be constructed from the observed data x 0 , . . . , x N −1 , such that ψ l (x) can be evaluated at any point x ∈ X, not necessarily lying in the set of training points x n . In particular, ψ l (x(t)) can be evaluated for the data x(t) observed at forecast initialization. As with every RKHS, H is uniquely characterized by a symmetric, positive-definite kernel function k : R m × R m → R, which can be intuitively thought of as a measure of similarity, or correlation, between data points. As simple example, EOF analysis is based on a covariance kernel, k(x, x ′ ) = x ⊤ x ′ , also known as a linear kernel, but note that the kernel k used in our experiments is a Markov-normalized, nonlinear Gaussian kernel (whose construction will be described below).
Associated with any kernel function k and the training data x 0 , . . . , x N −1 is an N × N symmetric, positive-semidefinite kernel matrix K with elements K ij = k(x i−1 , x j−1 ), and a corresponding orthonormal basis of R N consisting of eigenvectors φ 1 , . . . , φ N , satisfying Note that in the case of the covariance kernel, the φ j corresponding to nonzero eigenvalues become principal components given by linear projections of the data onto the corresponding EOF; that is, where X is the m×N data matrix whose n-th column is equal to x n−1 , and e l ∈ R m is a unit-norm eigenvector (EOF) of the m × m spatial covariance matrix C = XX ⊤ , corresponding to the same eigenvalue λ l , It should also be noted that Eq. (2) is special to eigenvectors associated with covariance kernels, and in particular does not hold for the nonlinear Gaussian kernels employed in this work.
By convention, we order the eigenvalues λ l in decreasing order. Assuming then that λ L is strictly positive, the basis functions ψ l : R m → R of the hypothesis space H L are given by Given these basis functions, the KAF forecast function f τ at lead time τ = q ∆t is expressed as the linear combination where the expansion coefficients c l (τ ) are determined by regression of the time-shifted response values y(t n + τ ) = y n+q against the eigenvectors φ l , viz.
One can verify that with this choice of expansion coefficients c l (τ ), f τ minimizes an empirical risk A key aspect of the KAF forecast function is its asymptotic behavior as the number of training data N and the dimension of the hypothesis space L grow. In particular, it can be shown 13 that if the dynamics on Ω has an ergodic invariant probability measure (i.e., there is a well-defined notion of climatology, which can be sampled from long dynamical trajectories), and the eigenvalues of the kernel matrix K are all strictly positive for every N, then under mild regularity assumptions on X, Y , and the kernel k (related to continuity and finite variance), F τ converges in a joint limit of N → ∞ followed by L → ∞ to the conditional expectation E(U τ Y | X) of the response observable U τ Y evolved under the Koopman operator for the desired lead time, conditioned on the data available at forecast initialization through X. In particular, it is a consequence of standard results from probability theory and statistics that E(U τ Y | X) is optimal in the sense of minimal RMSE among all finite-variance approximations of U τ Y that depend on the values of X alone. Note that no linearity assumption on the dynamics was made in order to obtain this result.
It should also be noted that while, as with many statistical forecasting techniques, stating analogous asymptotic convergence or optimality results in the absence of measure-preserving, ergodic dynamics is a challenging problem, the KAF formulation described above remains well-posed in quite general settings, including the long-term climate change trends present in the observational SST data studied in this work. . This means that one cannot increase the hypothesis space dimension L to beyond m, even if plentiful data is available, i.e., N ≫ m. In response, following earlier work 17,18,43 , to construct k we start from a kernel of the formk(x,

Choice of predictors and kernel
R → R is a nonlinear shape function, set here to a Gaussian, h(u) = e −u 2 /ǫ with ǫ > 0, and d : R m ×R m → R a distance-like function on data space. We set d to an anisotropic modification of the Euclidean distance, x − x ′ , shown to improve skill in capturing slow dynamical timescales 43 .
The kernelk is then Markov-normalized to obtain k using the normalization procedure introduced in the diffusion maps algorithm 25 . With this construction, all eigenvalues of K are positive if the bandwidth parameter ǫ is small-enough.
What remains is to specify the predictor function X. For that, we follow the popular approach employed, among many techniques, in singular spectrum analysis (SSA) 44 , extended EOF analysis, and NLSA 17,18 , which involves augmenting the dimension of data space using the method of delaycoordinate maps 24 . Specifically, using S : Ω → R p to denote the function on state space such that is equal to the values of the SST field sampled at the p gridpoints on the Indo-Pacific domain corresponding to climate state ω, we set where Q is the number of delays, and m = pQ. It is well known from dynamical systems theory 24 that for sufficiently large Q, X(ω) generically becomes in one-to-one correspondence with ω, meaning that as Q grows, X(ω) becomes a more informative predictor than S(ω). Elsewhere 19,20 , it was shown that as Q increases, the eigenvectors of the corresponding kernel matrix K, and thus the hypothesis spaces H L , become increasingly adept at capturing intrinsic dynamical timescales associated with the spectrum of the Koopman operator of the dynamical system. In particular, it has been shown 21-23 that for interannual embedding windows, Q 24, the leading eigenvectors of K become highly efficient at capturing coherent modes of variability associated with ENSO evolution, as well as its linkages with the seasonal cycle and decadal variability. Due to this property, the kernels employed in this work are expected to be useful for ENSO prediction since the associated eigenspaces can capture nonlinear functions of the input data, and within those eigenspaces, meaningful representations of ENSO dynamics are possible using a modest number of eigenfunctions.
Linear inverse models Under the classical LIM ansatz 1 , the dynamics of ENSO can be well modeled as linear system driven by stochastic external forcing represented as temporally Gaussian white noise. The method used here to conduct LIM experiments follows closely the procedure described by Penland and Sardeshmukh 2 . Specifically, the dynamics is governed by a stochastic differential equationψ where ψ(t) ∈ R L is the LIM state vector at time t, B is an L × L matrix representing a stable dynamical operator, and W (t) is a Gaussian white noise process.
Let s(t) be an SST vector from Eq. (4), observed at forecast initialization time t, and s ′ (t) = s(t) −s(t) be the corresponding anomaly vector relative to the monthly climatologys(t) at time t (computed as described above for the HadISST and CCSM4 datasets). Then, the corresponding LIM state vector ψ(t) is given by projection onto the leading L EOFs e l from Eq. (3), computed for the covariance matrix C based on the anomalies s ′ (t n ) in the training data, i.e., The RMSE-optimal estimate for ψ(t + τ ) at an arbitrary lead time τ under the assumed dynamics in Eq. (6) is then given by the solution of the deterministic part with initial data ψ(t), viz.
where G(τ ) = exp(Bτ ). Given this estimate, the predicted Niño 3.4 index at time t + τ is given whereψ l (t+τ ) is the l-th component ofψ(t+τ ), and z l is the regression coefficient of the Niño 3.4 index y(t n ) against the l-th SST principal component time series ψ l (t n ) in the training phase, Note that because the Niño 3.4 index y(t) is a linear function of the Indo-Pacific SST anomalies s ′ (t), forecasts via Eq. (7) are equivalent to (but numerically cheaper than) first computing LIM forecastsŝ ′ (t + τ ) of the full Indo-Pacific SST anomalies s ′ (t + τ ), and then deriving from these forecasts a predicted Niño 3.4 index. Specifically, let J denote the set of indices j of the components s ′ j (t) of s ′ (t) lying within the Niño 3.4 region, and |J| the number of elements of J. Then, we have so that where e jl is the j-th component of EOF e l . Noticing that the j-th component of the LIM forecast s(t + τ ) is given byŝ it then follows thatŷ form which we deduce that predicting using Eq. (7) is equivalent to Eq. (9).
In practice, G(τ 0 ) is first estimated at some time τ 0 = q 0 ∆t, q 0 ∈ N, from the training SST data s(t n ) sampled at the times t n from (1), and then G(τ ) is computed at the desired lead time τ by All LIM-based forecasts reported in this paper were obtained via Eq. (7) with G(τ ) from Eq. (11).
Note that under analogous ergodicity assumptions to those employed in KAF, the empirical time averages in Eq. (10) converge to climatological ensemble averages. It should also be noted that, among other approximations, the model structure in Eq. (6) assumes that the dynamics is seasonally independent.
Validation The main tunable parameters of the KAF method employed here are the number of delays Q, the Gaussian kernel bandwidth ǫ, and the hypothesis space dimension L. Here, we use throughout the values Q = 12, ǫ = 1, and L = 400. As stated above, these values were determined by hold-out validation, i.e., by varying the parameters seeking optimal prediction skill in validation datasets. Note that this search was not particularly exhaustive, as we found fairly mild dependence of forecast skill under modest parameter changes around our nominal values.
The LIMs in this work have two tunable parameters, namely τ 0 in Eq. (11) and the number L of principal components employed. We set τ 0 = 2 months and L = 20, using the same hold-out validation procedure as in KAF.
Probabilistic forecasting Our approach for probabilistic El Niño/La Niña forecasting is based on the standard result from probability theory that the conditional probability of a certain event to occur is equal to the conditional expectation of its associated characteristic function. Specifically, let, as above, Y : Ω → R be the real-valued function on state space Ω such that Y (ω) is equal to the Niño 3.4 index corresponding to climate state ω ∈ Ω. Let alsoȲ : Ω → R be the 3-month running-averaged Niño 3.4 index, i.e., Here, we follow a standard definition for El Niño events 35 , which declares ω to be an El Niño state ifȲ (ω) > 0.5 • C for a period of five consecutive months about ω. This leads to the characteristic function χ + : Ω → R, such that χ + (ω) = 1 if there exists a set J of five consecutive integers in the range [−4, 4], such thatȲ (Φ j ∆t (ω)) > 0.5 for all j ∈ J, and χ + (ω) = 0 otherwise. Similarly, we define a characteristic function χ − for La Niña events, requiring thatȲ (ω) < −0.5 • C for a five-month period about ω. With these definitions, the conditional probabilities P ±,τ (x(t)) for El Ninõ/La Niña, respectively, to occur at lead time τ , given the predictor vector x(t) at forecast initialization time t, is equal to the conditional expectation E(U τ χ ± | X = x(t)) of χ ± acted upon by the time-τ Koopman operator U τ . In particular, because the values of χ ± are available to us over the training period, we can estimate P ±,τ (x(t)) using the KAF and LIM methodologies analogously to the Niño 3.4 predictions described above. All probabilistic forecast results reported in this paper were obtained in this manner.
Forecast skill quantification Letω n = Φ n ∆t (ω 0 ), withω 0 ∈ Ω and n ∈ {0, 1, . . . ,Ñ − 1}, be the climate states in Ω over a verification period consisting ofÑ samples, andx n = X(ω n ) and be the corresponding values of the predictor and response (Niño 3.4) functions at lead time τ = q ∆t. We assess the forecast skill of the prediction function f τ for U τ Y using the root-mean-square error (RMSE) and Pearson correlation (PC; also known as pattern correlation) scores, defined as respectively. Here, µ τ (μ τ ) and σ τ (σ τ ) are the empirical means and standard deviations of y n+q (f τ (x n )), respectively.
In the case of the probabilistic El Niño/La Niña forecasts, we additionally employ the BSS and relative-entropy scores, which are known to provide natural metrics for assessing the skill of statistical forecasts [31][32][33][34] . In what follows, we outline the construction of these scores in the special case of binary response functions, such as the characteristic functions χ ± for El Niño/La Niña events, taking values in the set {0, 1}. First, note that every probability measure on {0, 1} can be characterized by a single real number π ∈ [0, 1] such that the probability to obtain 0 and 1 is given by π and 1 − π, respectively. Given two such measures characterized by π, ρ ∈ [0, 1], we define the quadratic distance function Moreover, under the condition that ρ = 0 only if π = 0, we define the relative entropy (Kullback- where, by convention, we set π log 2 (π/ρ) or (1 − π) log 2 [(1 − π)/(1 − ρ)] equal to zero whenever ρ or 1 − ρ is equal to zero, respectively. This quantity has the properties of being non-negative, and vanishing if and only if π = ρ. Thus, D KL can be thought of as a distance-like function on probability measures, though note that it is non-symmetric (i.e., in general, D KL (π || ρ) = D KL (ρ || π)), and does not obey the triangle inequality. Intuitively, D KL (π || ρ) can be interpreted as measuring the precision (additional information content) of the probability measure characterized by π relative to that characterized by ρ, or, equivalently, the ignorance (lack of information) of ρ relative to π.
Among these, B(τ ;ω n ) andB(τ ;ω n ) measure the quadratic error of the probabilistic binary forecasts by P +,τ (x n ) and the climatological distributionP + , respectively, relative to the true probability χ + (ω n+q ). In particular, B(τ ;ω n ) vanishes if and only if P +,τ (x n ) is equal to 1 whenever χ + (ω n+q ) is equal to 1 (i.e., if whenever the forecast model predicts an El Niño event, the event actually occurs). Moreover, based on the interpretation of relative entropy stated above, D(τ ;x n ) measures the additional precision in the forecast distribution relative to climatology, and E(τ ;ω n ) the ignorance of the forecast distribution relative to the truth. In our assessment of a forecasting framework such as KAF and LIM we consider time-averaged, aggregate scores over the verification dataset, viz. Note that B(τ ) is equal to the mean square difference between the characteristic function χ + and the forecast function P + evaluated on the verification dataset. Similarly,B(τ ) is equal to the mean square difference between χ + and the constant function equal to the climatological probabilityP + .
Note that in the expression above we have tacitly assumed that the climatological forecast has nonzero error, i.e.,B(τ ) = 0, which holds true apart from trivial cases. With these definitions, Let now P + * be defined such that it is equal to 1 ifP + ≤ 0.5, and 0 ifP + > 0.5. Note that the binary probability distribution represented by P * places all probability mass to the outcome in {0, 1} that is least probable with respect to the climatological distribution, which in the present context corresponds to an El Niño event. One can verify that the precision score D(τ ) is bounded above by the relative entropy D * = D KL (P + * ||P + ). The latter quantity provides a natural scale for D(τ ) corresponding to the precision score of a probabilistic forecast that is maximally precise relative to climatology, in the sense of predicting with probability 1 the climatologically least likely outcome. To define a natural scale for E(τ ), we consider the time-averaged relative entropy E * = Ñ −1 n=0 D KL (χ + (ω n ) ||P + )/Ñ, i.e., the average ignorance of the climatological distribution relative to the truth. This quantity sets a natural threshold for useful probabilistic El Niño forecasts, in the sense that such forecasts should have E(τ ) < E * . The BSS and relative entropy scores and thresholds for probabilistic La Niña prediction are derived analogously to their El Niño counterparts using the characteristic function χ − .