Principal nonlinear dynamical modes of climate variability

We suggest a new nonlinear expansion of space-distributed observational time series. The expansion allows constructing principal nonlinear manifolds holding essential part of observed variability. It yields low-dimensional hidden time series interpreted as internal modes driving observed multivariate dynamics as well as their mapping to a geographic grid. Bayesian optimality is used for selecting relevant structure of nonlinear transformation, including both the number of principal modes and degree of nonlinearity. Furthermore, the optimal characteristic time scale of the reconstructed modes is also found. The technique is applied to monthly sea surface temperature (SST) time series having a duration of 33 years and covering the globe. Three dominant nonlinear modes were extracted from the time series: the first efficiently separates the annual cycle, the second is responsible for ENSO variability, and combinations of the second and the third modes explain substantial parts of Pacific and Atlantic dynamics. A relation of the obtained modes to decadal natural climate variability including current hiatus in global warming is exhibited and discussed.

frequency, year -1 NDM 3 Figure S2. PCs (left panels) and NDM hidden time series (right panels) Fourier spectra amplitudes (from global SST data). Annual cycle is entirely captured by the NDM 1, while spread among many EOFs. Gnuplot (Copyright 1986(Copyright -1993(Copyright , 1998(Copyright , 2004 Thomas Williams, Colin Kelley) was used to create the graphs in this figure. (with subtracted mean annual signal) with its 95% Bayesian confidence interval; blue -Nino 3.4 index derived from the SST field reconstructed from NDM 2 with its 95% Bayesian confidence interval; black -the original SSTA-based Nino3.4 data. The two strongest El-Nino events are partially captured by the NDM 1, the others are not. Gnuplot (Copyright 1986(Copyright -1993(Copyright , 1998(Copyright , 2004 Thomas Williams, Colin Kelley) was used to create the graphs in this figure.  Figure S4. Comparing the PCs produced by EOF decomposition (normalized to equal standard deviation) and hidden time series of NDMs (line thickness reflects the 95% Bayesian confidence interval), both for global SST data. The time scales are much better separated by the NDMs. Gnuplot (Copyright 1986(Copyright -1993(Copyright , 1998(Copyright , 2004 Thomas Williams, Colin Kelley) was used to create the graphs in this figure.

Caption of Supplementary Animation
Representation of NDMs in data space (see text of the paper). Captured fractions of SST are marked by color at each grid point. Modes 1,2 and 3 are displayed on the top and left bottom panels. The right bottom panel shows the rest of the data that is not involved in the nonlinear expansion. Gnuplot (Copyright 1986(Copyright -1993(Copyright , 1998(Copyright , 2004 Thomas Williams, Colin Kelley) was used to create this animation.

Nonlinear dynamical modes decomposition
Let the D-dimensional time series X n be the result of an observation: with equal distance between the time moments t n enumerated by the index n and data length N. Note that hereinafter spatially distributed time series are understood as a particular case of multidimensional time series, when the coordinates of the vector X n are given by enumerating spatial nodes. In order to catch nonlinear interactions described in the main text we suggest an iterative procedure of data decomposition via the sum: Here F i (a i , ·) : R → R D is a time-independent nonlinear mapping parameterized by a i ∈ R M , and p n i is its argument which depends on the discrete time n. Each term in the form of F(a, p n ) can be physically interpreted as a nonlinear multidimensional (or, particularly, spatially distributed) response of the system to the hidden scalar signal p n . The terms F i (a i , p n i ) governed by the scalar signals p i will be called nonlinear dynamical modes (NDMs), which are found with specific prior assumptions by the method described below.
A geometrical interpretation of a NDM is also useful. Generally, the parameter a of the function F defines a nonlinear curve in R D . Then the time series p 1 , ..., p N are just coordinates of N points along such a curve. These points are specific projections of the observed data (see Fig.S5 for illustration).
The iterative scheme of the decomposition is constructed as follows. On the k-th iteration (k = 1, 2, . . .) we fit the term F(a, p n ) optimally into the residual Here a and p 1 , . . . , p N are the fitted parameters. The particular parameterization of the model to fit as well as a cost function and a criterion of optimality of the fitting are described in detail below. As a result of the k-th iteration, we obtain the k-th NDM.
It is interesting that the classic empirical orthogonal functions (EOF) decomposition 1 can be represented via the sum (S2) with F i (a i , p n i ) = a i · p n i and fitting restricted by the condition of a fixed length a = 1. Thus, the EOF decomposition can be regarded as a linear case of the decomposition (S2).  Figure S5. Schematic example of the idea of nonlinear mode (nonlinear projection of the data on the curve) compared to EOF (orthogonal projection of the data on the line).

6/13
In a general case the key problem is finding an optimal (probably nonlinear) structure of parameterization for each mode F(a, ·); it depends on the dataset structure and the duration of the observed time series. This problem can be solved within a Bayesian framework.

NDM parameterization
A Bayesian formulation of the optimal fitting problem assumes that we have a set of models which can be fit into the given data (this procedure will be called "model learning"), and we should choose the most useful (optimal) model from this set. In this section we are going to write explicitly the functional form which we use to define the set of models on each iteration of the above-described scheme.
Let us denote the residuals given on a particular iteration, again, as the time series of D-dimensional column-vectors X 1 , . . . , X N (following the notation at step 3 of the algorithm given in the main text).
First of all, we define the rotation of the dataset X 1 , . . . , X N to its principal components (PCs), i.e. the preliminary EOF decomposition: Here V is an orthonormal matrix whose columns are eigenvectors of the given data covariance matrix: The components Y n i of the column-vector Y n are ordered such that their variances, i. e. eigenvalues of C X , decrease with i = 1, . . . , D. The rotation (S3) is especially appropriate for the case when the initial data is spatially distributed and has therefore many nodes with similar dynamics. It filters the directions in the D-dimensional space by their significance and allows us to include the EOF-based truncation into the model in a natural way. Now let us define the truncated column-vector Y n d : and introduce the following form of the model: Here we make the assumption that the d leading EOFs are nonlinearly coupled by the function Φ Φ Φ(a, p n ) governed by the scalar time series p = (p 1 , . . . , p N ), while the residuals and the other EOFs are modelled by Gaussian delta-correlated processes ξ ξ ξ In order to define a nonlinear function of p n in (S6), we use a polynomial parameterization as it is one of the universal approximators for nonlinear dependencies: Here the vector a is formed by the components of all α α α i ∈ R d : An infinite basis of polynomials {Π 0 , Π 1 , ..., Π m , ...} is fixed in (S8), with Π i (p) being a Hermitian polynomial of degree i normalized to unity (convenience of such choice of basis will be seen in further sections). A deterministic part of such model at particular values of parameters will be considered as NDM (as it follows from (S3) and (S6)): Here V d is the matrix with d columns which are d leading EOFs corresponding to the truncated PCs (S5).

Bayesian NDM learning
The Bayesian approach to model learning is based on the analysis of the posterior parameter probability distribution P(a, p, σ σ σ |X 1 , . . . , X N ) of the model under the condition that the dataset X 1 , . . . , X N was observed. The estimation of this distribution is given by the Bayes theorem: 2 Here the normalization term A(X 1 , . . . , X N ) is independent of the model parameters. The second term is the likelihood fuction which can be written for any parameterization of f directly from (S3), (S6): Here the operation · denotes the euclidean norm in R d . The last term in (S11) is the prior probability distribution P pr (a, p, σ σ σ ) which does not depend on the observed data, since it is a property of the model.
The parameters (ā,p,σ σ σ ) which maximize (S11) are considered here as the result of model learning. Technically, maximization of (S11) is equivalent to minimizing the following cost function ϕ by (a, p, σ σ σ ): ϕ(a, p, σ σ σ ) = − log P(X 1 , . . . , X N |a, p, σ σ σ )P pr · (a, p, σ σ σ ) → min (S14) To solve the problem (S14),we set the prior distribution as factorized by a, p and σ σ σ : P pr (a, p, σ σ σ ) = P a pr (a) · P p pr (p) · P σ pr (σ σ σ ) (S15) In Bayesian modeling the choice of the relevant prior distributions is very important: it should be adequate to the problem from both mathematical and physical points of view, as well as the choice of parameterization.
Prior distribution of σ σ σ The problem (S14) without any prior requirements is extremely well-posed by σ as a one-dimensional minimization problem. Physically it means that for any form of the mode one can easily calculate the deviation of the data from the mode.
That is why it is sufficient to let P σ pr be a non-zero constant on a finite area covering all expected values of σ σ σ : The function (S14) under the condition (S16) has a unique minimum by σ σ σ which can be explicitly found at any (a, p): Since the definition (S10) of a NDM does not depend on σ σ σ , it is convenient to exclude this parameter from the model (S6) defining σ σ σ as (S17). Such a model depends only on the parameters (a, p) with the same prior distributions P a pr (a) and P p pr (p), its likelihood is derived from (S12)-(S13) and denoted in the main text as L(a, p 1 , . . . , p N ): L(a, p) = P X 1 , . . . , X N |a, p, σ σ σ min (a, p) (S18) This model has the following cost function to be minimized over (a, p): ψ(a, p) = − log L(a, p) · P a pr (a) · P p pr (p) → min (S19) Obviously, the problems (S14) and (S19) have the same solution (ā,p), i. e. the results of learning are equal for both models (with and without σ σ σ ). It can be shown also, that the results of Bayesian optimization described in sec. 1.3 are not affected by σ σ σ exclusion in the case N >> 1.
Prior distribution of the hidden time series p We expect that the hidden signal p is associated with the hidden dynamical system, i. e. the time sequence and the evolution law in p are important. At the same time we do not know the evolution law of the hidden system before the decomposition. In order to include our preferences into the model, we use the simplest evolution law differing from white noise as a prior information: Here η n ∈ R is a Gaussian delta-correlated random process. Equation (S20) is a linear stochastic operator (autoregressive process of first order) with two parameters b and σ p . In general, it can be thought of making smooth signals more preferable. Indeed, the time scale of the autocorrelation function for the process (S20) is determined by b: Also, to remove an obvious degeneracy connected with linear stretching of the coordinate p along the curve and corresponding changes in the parameters a of the curve, we fix the amplitude of the process (S20) to be unity: To specify P p pr (p) we need to specify the distribution of p 1 initializing the process (S20). Naturally, it is supposed to be equal to the invariant measure P inv of the process (S20) under the condition (S22): Finally, from (S20)-(S24) we get the following prior distribution of p: Thus, the sequence p corresponds to the process (S20) in the steady state. Actually, the distribution (S25) is normal with the inverse covariance matrix B: Here p is assumed as a column-vector, B is symmetric and tridiagonal and depends only on the free parameter τ. Mathematically (S26) tries to solve the general problem of degeneracy among an infinite number of possible coordinates introduced on the same curve. It is especially important when the curve has a high curvature or self-intersections. Different coordinates can provide different, more or less "good", hidden time series p which we evaluate using (S26).

9/13
Prior distribution of the parameters a Despite the condition (S22), the distribution (S26) does not exclude infinitely small variances of the time series p and corresponding degeneracy. Therefore the aim of the prior distribution of a is to do this job, as well as restrict the set of curves described by f.
We take the prior distribution of a in a Gaussian form with the covariance matrix K: Here a is assumed as a column-vector. The Gaussianity of (S27), together with the linearity of (S8) by a, provides an easy numerical implementation and calculation time advantages. At the same time, Gaussian distribution is a very general and popular way to specify a preferred area in the parameter space.
Note that the basis of normalized Hermitian polynomials in (S8) has the orthonormality property in terms of scalar product weighted with the invariant measure (S24) (by definition of Hermitian polynomials): This property is the reason of the Hermitian polynomials basis choice: different terms in (S8) are based on orthogonal polynomials. In order to provide the regularizing properties of (S27) described above and taking into account the property (S28), we use the prior matrix K which: (i) makes the covariance matrix of (S8) a priory equal to the covariance matrix of the input data Λ Λ Λ d = Y n d Y n d T n ; (ii) holds (S27) invariant under rotations of the orthonormal basis Π 0 , ..., Π m . These conditions uniquely define the matrix K as the following matrix with the size d · (m + 1): Note that Λ Λ Λ d is a diagonal matrix with variances λ k = 1 N ∑ N n=1 Y n k 2 of the first d PCs on the diagonal (it follows from our definition of Y n d ). Thus, taking into account (S30) we can write the prior distribution (S27) in the following form:

Bayesian optimal NDM selection
In the scope of sec. 1.1 and sec.1.2 we have a set of models defined by triplets of free control parameters, or hyperparameters: the number d of PCs , nonlinearly coupled by the model in (S6); the degree of the nonlinearity m of the polynomial in (S8); and the prior time scale τ of the hidden time series p in (S25). Obviously, every particular triplet (d, m, τ) provides a set of various data generators (S6) with a prior measure on this set, and there is a problem of choosing the "best" triplet.
Within the Bayesian framework we evaluate the usefulness of the triplet (d, m, τ) for describing a given dataset as the probability of the corresponding model to generate this dataset: Here the prior distributions P a pr , P p pr and the likelihood L under the integral are assumed for the hyperparameters (d, m, τ). The quantity (S32) is called Bayesian evidence, or marginal likelihood of the model. 3 The optimal NDM is defined as (S10) with the parameters (ā,p) solving the problem (S19) for the triplet of hyperparameters (d,m,τ) maximizing the evidence (S32). This optimality criterion is completely equivalent to the minimum description length 10/13 principle (MDL) 4, 5 for model selection in information theory, which defines an optimal model as one coding the observed data with the minimal amount of information E: In general, the estimation of (S33) is a very hard problem. There are several methods of numerical approximation of an integral like (S32): the Laplace approximation, 6 the Gibbs sampling, 7 the EM algorithm. 8 We use the Laplace approximation (in the same way as in 9 for latent variables). It is especially approved here because of the Gaussianity of the likelihood (S12) by a in case of the polynomial parameterization (S8) and Gaussianity of the prior distribution by p (S26) and by a (S27). The expression for the cost function (S33) obtained by this method is: Here the function ψ is taken for the hyperparameters triplet (d, m, τ), and ∇ ∇ ∇∇ ∇ ∇ T ψ is the matrix of the second derivatives of ψ by (a, p) at the minimum (ā,p). Actually, the first term in (S34) is the minimum of the cost function (S19). The second term, generally, is a penalty for the complexity of the model. The computation of det 1 2π ∇∇ T ψ in our case can be additionally simplified at N >> 1 due to the facts below.
First, the elements of the matrix ∇ a ∇ p T ψ are N times smaller than the elements of ∇ a ∇ a T ψ and d times smaller than the elements of ∇ p ∇ p T ψ. It follows that we can neglect the contribution of the mixed derivatives ∇ a ∇ p T ψ into det 1 2π ∇∇ T ψ at the largest order by 1 N . Hence (S34) takes the form: Second, due to the properties of the parameterizing function (S8) and the prior distribution of (S31) , det 1 2π ∇ ∇ ∇ a ∇ ∇ ∇ a T ψ at the minimum (ā,p) can be separated into d determinants corresponding to d PCs (again, at the largest order by 1 N ). Then from (S35) we come to the final expression: Here α α α k k k = α 1 k , ..., α m k is a vector with elements from (S9). The second term in (S36) is a determinant of tridiagonal matrix (because of tridiagonality of B in (S26)). Thus, all components of (S36) can be computed quite fast, and the estimation time is limited mainly by the time of minimization (S19) for each triplet.

Simple example of NDM extraction
Data description Let us demonstrate a single NDM extraction step on a simple two-dimensional model dataset (see Figs. S6a, S6c) produced by a generator like (S6): Here ζ n ∈ R 2 is a Gaussian delta-correlated random process with noise intensity σ φ φ φ = 0.4, and φ φ φ is the following vector function: which describes a nonlinear curve in R 2 . This curve is shown in Fig. S7a. To produce scalar time series q 1 , . . . , q N we use time series of the variable y of the Lorenz system 10 in the chaotic regime corresponding to the classic Lorenz attractor: In this regime the system's trajectory jumps between two symmetric wings of the attractor, while on each wing the trajectory rotates about the center of the wing. The average jumping period is 2.26 times more than the average rotation period.
We create the time series y n by taking N = 400 values of the variable y(t) with the time step ∆t = 0.14 units and after the system has come to the attractor. This corresponds in average to 5.3 observations per one rotation and 33.2 jumps during the observation time interval, i. e. the time series y n are smooth and cover the attractor. Finally, the time series q n are produced by normalizing y n to the standard deviation equal to 1. The time series q n are shown in Fig. S8a. Also a time delayed representation of these time series is drawn in Fig. S8b to show the Lorenz attractor in a 2-D projection of the delayed coordinates phase space.
The observed noisy dataset X n is shown on a 2-D plane of the variables (X 1 , X 2 ) in Fig. S7b. The standard deviation of both φ 1 (q n ) and φ 2 (q n ) is approximately 1, so the noise level in the observed variables can be evaluated as σ φ φ φ and equals to 40%. Results It is evident from Figs. S6b, S6d that the structure of the Lorenz attractor in the observed variables (X 1 , X 2 ) is completely destroyed by the nonlinearity and noise. In this situation the NDM provides an answer to the important question: what is the optimal one-dimensional phase variable capturing the dynamical properties of the system? In our model example we know that this variable is given by projection of the data on the true curve in Fig. S7a.
The result of NDM extraction using our method can be seen in Figs. S7, S8. The nonlinear curve given by the parameters a of the NDM at optimal hyperparameters minimizing (S36) is shown in Fig. S7d. Obviously, this curve is very close to the true one, and the time series p of the hidden variable (Fig. S8e) demonstrate the structure of the Lorenz attractor in a delayed representation (Fig. S8f). In accordance with the optimality criterion (S33) the curves obtained at non-optimal hyperparameters are not similar to the true one: the too simple linear model of a mode (Fig. S7c) is not flexible enough, while the too complex model (Fig. S7e) gives complex overfitted curve capturing the particular noise realization in (S37). Neither linear (Fig. S8c,d), nor overfitted (Fig. S8g,h) modes can provide a good structure of the attractor in the hidden variables as the optimal one does.

Bayesian confidence interval estimation
In order to calculate Bayesian confidence intervals presented in the main text for NDMs of SST data, we first sampled the parameters (a, p) distributed by the approximate posterior distribution proportional to the integrand in (S32). The sampling was performed under the same approximation of the integrand in the vicinity of its maximum which holds in obtaining (S36) from (S32), i. e. we created Gaussian samples with mean (ā,p) and the covariance matrix divided into blocks ∇ ∇ ∇ p ∇ ∇ ∇ p T ψ and ∇ ∇ ∇ α α α 1 ∇ ∇ ∇ α α α 1 T ψ, . . . , ∇ ∇ ∇ α α α d ∇ ∇ ∇ α α α d T ψ making p and α α α 1 , . . . , α α α d approximately independent Gaussian variables.
Then for each time moment n we cut 2.5% of the samples with the largest values of p n and 2.5% of the samples with smallest values of p n to obtain the 95% confidence interval for p n . (e, f) nonlinear mode with optimal hyperparameters d = 2, m = 3, τ = 7.5; (g, h) overfitted nonlinear mode with hyperparameters d = 2, m = 7, τ = 2.6.

12/13
To calculate confidence intervals for NDM-based indices, we produced time series of these indices from each (a, p), i. e. created samples of the indices, and then estimated the 95% confidence interval for each time moment for each index in the same way as for p time series.
The Bayesian confidence intervals obtained by the way described above show the uncertainty in determining NDMs induced by the uncertainty of learning a and p for each NDM.
Generally, the confidence zone is more shallow for the hidden time series p than for the indices, because the uncertainties of the latter are influenced additionally by the uncertainty of a.