A Koopman operator-based prediction algorithm and its application to COVID-19 pandemic and influenza cases

Future state prediction for nonlinear dynamical systems is a challenging task. Classical prediction theory is based on a, typically long, sequence of prior observations and is rooted in assumptions on statistical stationarity of the underlying stochastic process. These algorithms have trouble predicting chaotic dynamics, “Black Swans” (events which have never previously been seen in the observed data), or systems where the underlying driving process fundamentally changes. In this paper we develop (1) a global and local prediction algorithm that can handle these types of systems, (2) a method of switching between local and global prediction, and (3) a retouching method that tracks what predictions would have been if the underlying dynamics had not changed and uses these predictions when the underlying process reverts back to the original dynamics. The methodology is rooted in Koopman operator theory from dynamical systems. An advantage is that it is model-free, purely data-driven and adapts organically to changes in the system. While we showcase the algorithms on predicting the number of infected cases for COVID-19 and influenza cases, we emphasize that this is a general prediction methodology that has applications far outside of epidemiology.


Introduction
Ability for prediction of events is one of the key differentiators of homo sapiens.The key element of prediction is reliance on collected data over some time interval for estimation of evolution over the next time period.Mathematicians have long worked on formal aspects of prediction theory, and separate streaks such as the Wiener-Kolmogorov, 1 , Furstenberg 2 and Bayesian prediction 3 have emerged.However, all of these are concerned with prediction of future events based on a, typically long, sequence of prior observations.This is rooted in assumptions on statistical stationarity of the underlying stochastic process.
The point of view on prediction in this paper is quite different: we view the process over a short (local) time scale and extract its coarse-grained ingredients.We proceed with prediction of the evolution based on these, learning the process and building a global time-scale on which such prediction is valid.Then, we monitor for the change in such coarse-grained ingredients, detect if a substantial change (a "Black Swan" event 15 , see Supplementary Information section S1 for the mathematical definition that we use.) is happening, and switch back to local learning and prediction.In this way, we accept the limitations on predictability due to, possibly finite time, nonstationarity, and incorporate them into the prediction strategy.In principle, such strategy is valuable even in the case where over a long time interval the system is indeed a stationary stochastic process with respect to some invariant measure.An example is the Lorenz system, the prototypical system exhibiting chaotic dynamics and the Butterfly Effect 51,56 , studied in Supplementary Information section S2.1.For typical learning algorithms Black Swan events are devastating: the learning algorithm has to be restarted as otherwise it would learn the deviation as normal.Our method includes a technique of retouching the Black Swan event data, wherein its replacement with a realization of the normal process evolution obviates the need for a restart.
However, the case of Black Swan unpredictability and Butterfly Effect unpredictability are ontologically different.Namely, the Butterfly Effect is the consequence of dynamics inherent to the system, while the Black Swan arises from an action external to system dynamics.
We approach the prediction problem from the perspective of Koopman operator theory 7,[9][10][11][12]20 in its recently developed form that is applicable to nonstationary stochastic processes 13,21 . Th Koopman operator theory is predicated on existence of the composition operator that dynamically evolves all the possible observables on the data, enabling the study of nonlinear dynamics by examining its action on a linear space of observables.The key ingredients of this approach become eigenvalues and eigenfunctions of the Koopman operator and the associated Koopman Mode Decomposition (KMD) of the observable functions, which is then approximated numerically using Dynamic Mode Decomposition (DMD).The numerical approach used in this work relies on lifting the available data to higher dimensional space using Hankel-Takens matrix and on the improved implementation of DMD algorithm for discovering the approximations of the Koopman modes with small residuals.The obtained Koopman mode approximations and the related eigenvalues, called Ritz pairs, are crucial for obtaining satisfactory predictions using KMD. One ofthe main advantages of the method is that it completely data-driven, i.e., model-free.
The retouching method is presented in §2.2.In §2. 3 we present the application of the prediction algorithm to infection cases data of the current COVID-19 pandemic.
In Supplementary Information, we provide mathematical background of all algorithms, with worked examples that clarify all technical details and offer more in depth discussions.Further, we test the predictive potential of the proposed method using diverse case studies with the goal to stress the data driven (model free) nature of the method.For instance, an application of the method to prediction of physiological process data, that additionally validates our methodology, could be of interest in personalized medical treatment context.As further case studies, we use the Lorenz system and the AL index of geomagnetic substorms.

Prediction with Koopman Mode Decomposition
Our approach starts with the Koopman operator family U t , which acts on observables f by composition U t f (x) = f (x(t)).U t is a linear operator that allows studying the nonlinear dynamics by examining its action on a linear space F of observables.In a data driven setting, usually the states z i ≈ x(t i ) of the dynamical system at discrete time moments t i are known.They are governed by the discrete dynamical system z i+1 = T(z i ), for which the Koopman operator reads U f = f • T.
The key of the spectral analysis of the dynamical system is a representation of a vector valued observable f = ( f 1 , . . ., f d ) T as a linear combination of the eigenfunctions ψ j of U .Under certain assumptions, each observable f i can be approximated as f i (z) ≈ ∑ m j=1 ψ j (z)(v j ) i , where ψ 1 , . . ., ψ m are selected eigenfunctions and v j = ((v j ) 1 . . .(v j ) d ) T are the vectors of the coefficients.Then, we can predict the values of the observable f at the future states T(z), T 2 (z), . . .by numerically evaluating The decomposition of the observables is called the Koopman Mode Decomposition (KMD); the scalars λ j are the Koopman eigenvalues, and the v j 's are the Koopman modes.Their numerical approximations can be computed based on the supplied data pairs (f(z i ), f(T(z i ))), i = 0, . . ., M, using e.g. the Dynamic Mode Decomposition (DMD) 28 , 12 , 13 .In this work, the numerical algorithm is fed with the sequence of successive data snapshots, called active window, to compute the approximation of KMD, and then (1) is used for prediction.If we use wider windows and longer forecasting lead time, we speak of global prediction; for narrower windows with locally adapted widths and shorter forecasting lead time we have local prediction (nowcasting).

Case study: Influenza epidemics.
As first example for showing our prediction methodology, we use the set of data associated with influenza epidemics.Clearly, not driven by an underlying deterministic dynamical system, the influenza time series exhibits substantial regularity in that it occurs typically during the winter months, thus enabling coarse-grained prediction of the type "we will see a very small number of cases of influenza occurring in summer months".However, predicting the number of influenza cases accurately is a notoriously hard problem 15 , exacerbated by the possibility that a vaccine designed in a particular year does not effectively protect against infection.Moreover, the H1N1 pandemic that occurred in 2009 is an example of a Black Swan event.
The World Health Organization's FluNet is a global web-based tool for influenza virological surveillance.FluNet makes publicly available data on the number of specimens with the detected influenza viruses of type A and type B. The data have been collected from different countries, starting with the year 1997, and are updated weekly by the National Influenza Centers (NICs) of the Global Influenza Surveillance and Response System (GISRS) and other national influenza reference laboratories, collaborating actively with GISRS.We use the weekly reported data for different countries, which consist of the number of received specimens in the laboratories, the distribution of the number of specimens with confirmed viruses of type A.
The Koopman Mode Decomposition was used in the context of analyzing the dynamics of the flu epidemic from different -Google Flu -data in 24 .We remark that the authors of that paper have not attempted prediction, and have analyzed only "stationary" modes -e.g. the yearly cycles, thus making the paper's goals quite different from the nonstationary prediction pursued here.
We first compare the global and the local prediction algorithms.The KMD is computed using active windows of size w = 312, and the 208 × 104 Hankel-Takens matrices.In Fig. 1a, we show the performances of both algorithms, using the learning data from the window April 2003 -April 2009 (shadowed rectangle).In the global prediction algorithm the dynamics is predicted for 104 weeks ahead.The first type of failure in the global prediction algorithm and forecasting appears after the Black Swan event occurred in the years 2009 and 2010.This is recognized by the algorithm, so that it adapts by using the smallest learning span and, with this strategy, it allows for reasonably accurate forecasting, at least for shorter lead times.This data, in addition to those from Supplementary Information section S2. 4 show the benefits of monitoring the prediction error and switching to local prediction.The initial Hankel-Takens matrix is 3 × 2, and the threshold for the local prediction relative error in Supplementary Information Algorithm S4 is 0.005.(Some predicted values were even negative; those were replaced with zeros.)The global prediction algorithm recovers after the retouching the Black Swan event data, which allows for using big learning window.Compare with positions of the corresponding colored rectangles in Figure 2.

Retouching the Black Swan event data
Next, we introduce an approach that robustifies the global algorithm in the presence of disturbances in the data, including the missing data scenario.We use the data window July 2004-July 2010, which contains a Black Swan event in the period 2009-2010.As shown in Figure 1b, the learned KMD failed to predict the future following the active training window.This is expected because the perturbation caused by the Black Swan event resulted in the computed Ritz pairs that deviated from the precedent ones (from a learning window before disturbance), and, moreover, with most of them having large residuals.This can be seen as a second type of failure in the global prediction.
The proposed Black Swan event detecting device, built in the prediction algorithm (see Supplementary Information Algorithm S3), checks for this anomalous behaviour of the Ritz values and pinpoints the problematic subinterval.Then, the algorithm replaces the corresponding supplied data with the values obtained as predictions based on the time interval preceding the Black Swan event.Figure 1b shows that such a retouching of the disturbance allows for a reasonable global prediction.
Note that in a realistic situation, global predictions of this kind will trigger response from authorities and therefore prevent its own accuracy and induce loss of confidence, whereas local prediction mechanisms need to be deployed again.

Monitoring and restoring the Ritz values
We now discuss the effect of the Black Swan event and its retouching to the computed eigenvalues and eigenvectors.We have observed that, as soon as a disturbance starts entering the training windows, the Ritz values start exhibiting atypical behavior, e.g.moving deeper into the right half plane (i.e.becoming more unstable), and having larger residuals because the training data no longer represent the Krylov sequence of the underlying Koopman operator.This is illustrated in the panels (a) and (b) in Figure 2, which show, for the sliding training windows, the real and the imaginary parts of those eigenvalues for which the residuals of the associated eigenvectors are smaller than η r = 0.075.Note the absence of such eigenvalues in time intervals that contain the disturbance caused by the Black Swan event.
On the other hand, the retouching technique that repairs the distorted training data restores the intrinsic dynamics over the entire training window.The distribution of the relevant eigenvalues becomes more consistent, and the prediction error decreases, see panels (c) and (d) in Figure 2, and in Supplementary Information Figure S16.

Discussion
Our proposed retouching procedure relies on detecting anomalous behavior of the Ritz values; a simple strategy of monitoring the spectral radius of active windows (absolutely largest Ritz value extracted from the data in that window) is outlined in Supplementary Information.Note that this can also be used as a litmus test for switching to the local prediction algorithm.
In Supplementary Information, we provide further examples, with the influenza data, that confirm the usefulness of the retouching procedure.In general, this procedure can also be adapted to the situation when the algorithm receives a signal that the incoming data is missing or corrupted.

COVID-19 prediction.
The second set of data we consider is that associated with the ongoing COVID-19 pandemic.Because the virus is new, the whole event is, in a sense, a "Black Swan".However, as we show below, the prediction approach advanced here is capable of adjusting quickly to the new incoming, potentially sparse data and is robust to inaccurate reporting of cases.
At the beginning of the spread of COVID-19, we have witnessed at moments rather chaotic situation in gaining the knowledge on the new virus and the disease.The development of COVID-19 diagnostic tests made tracking and modelling feasible, but with many caveats: the data itself is clearly not ideal, as it depends on the reliability of the tests, testing policies in different countries (triage, number of tests, reporting intervals, reduced testing during the weekends), contact tracing strategies, using surveillance technology, credit card usage and phone contacts tracking, the number of asymptomatic transmissions etc.Many different and unpredictable exogenous factors can distort it.So, for instance the authors of 26 comment at https://ourworldindata.org/coronavirus-testing that e.g."The Netherlands, for instance, makes it clear that not all labs were included in national estimates from the start.As new labs get included, their past cumulative total gets added to the day they begin reporting, creating spikes in the time series."For a prediction algorithm, this creates a Black Swan event that may severely impair prediction skills, see §2.2.
This poses challenging problems to the compartmental type models of (SIR, SEIR) which in order to be useful in practice have to be coupled with data assimilation to keep adjusting the key parameters, see e.g. 27.Our technique of retouching ( §2.2) can in fact be used to assist data assimilation by detecting Black Swan disturbance and thus to avoid assimilating disturbance as normal.
In the KMD based framework, the changes in the dynamics are automatically assimilated on-the-fly by recomputing the KMD using new (larger or shifted) data snapshot windows.This is different from the compartmental type models of infectious diseases, most notably in the fact that the procedure presented here does not assume any model and, moreover, that it is entirely oblivious to the nature of the underlying process.

An example: European countries
As a first numerical example, we use the reported cumulative daily cases in European countries.In Supplementary Information section S1.5, we use this data for a detailed worked example that shows all technical details of the method.This is a good test case for the method -using the data from different countries in the same vector observable poses an additional difficulty for a data driven revealing of the dynamics, because the countries independently and in an uncoordinated manner impose different restrictions, thus changing the dynamics on local levels.For instance, at the time of writing these lines, a new and seemingly more infectious strain of the virus circulating in some parts of London and in south of England prompted the UK government to impose full lockdown measures in some parts of the United Kingdom.Many European countries reacted sharply and immediately suspended the air traffic with the UK.
In the first numerical experiment, we use two datasets from the time period February 29 to November 19. and consider separately two sets of countries: Germany, France and the UK in the first, and Germany, France, UK, Denmark, Slovenia, Czechia, Slovakia and Austria in the second.The results for a particular prediction interval are given in Figure 3 and Figure 4.For more examples and discussion how the prediction accuracy depends on the Government Response Stringency Index (GRSI 47,48 ) see Supplementary Information section S1.5.In the above examples, the number of the computed modes was equal to the dimension of the subspace of spanned by the training snapshots, so that the KMD of the snapshots themselves was accurate up to the errors of the finite precision arithmetic.In general, that will not be the case, and the computed modes will span only a portion the training subspace, meaning that the KMD of the snapshots might have larger representation error.(Here we refer the reader to Supplementary Information section S1. 3, where all technical details are given.)This fact has a negative impact to the extrapolation forward in time and the problem can be mitigated by giving more importance to reconstruction of more recent weights.This is illustrated in Figures 5 and 6, where the observables are the raw data (reported cases) for Germany, extended by a two additional sequence of filtered (smoothened) values.The figures illustrate an important point in prediction methodology, that we emphasized in the introduction: a longer dataset and a better data reconstruction ability (i.e.interpolation) does not necessarily lead to better prediction.Namely, weighting more recent data more heavily produces better prediction results.This was already observed in 30 for the case of traffic dynamics, and the method we present here can be used to optimize the prediction ability.

An example: USA and worldwide data
We have deployed the algorithm to assess the global and United States evolution of the COVID-19 pandemic.The evolution of the virus is rapid, and "Black Swans" in the sense of new cases in regions not previously affected appear with high frequency.Despite that, the Koopman Mode Decomposition based algorithm performed well.
In Figure 7a we show the worldwide forecast number of confirmed cases produced by the algorithm for November 13th, 2020.The forecasts were generated by utilizing the previous three days of data to forecast the next three days of data for regions with higher than 100 cases reported.The bubbles in figure 7a are color coded according to their relative percent error.As can be observed, a majority of the forecasts fell below below 15% error.The highest relative error for November 13th, 2020 was 19.8% which resulted from an absolute error of 196 cases.The mean relative percent error, produced by averaging across all locations, is 1.8% with a standard deviation of 3.36% for November 13th, 2020.Overall, the number of confirmed cases are predicted accurately and since the forecasts were available between one to three days ahead of time, local authorities could very well utilize our forecasts to focus testing and prevention measures in hot-spot areas that will experience the highest growth.
A video demonstrating the worldwide forecasts for March 25, 2020 -November 29, 2020 is provided in the Supplementary Information online (Figure 7a is a snapshot from that video).Lastly, it is well known that the ability to test people for the virus increased throughout the development of the pandemic and thus resulted in changes in the dynamics of reported cases.Although it is impossible for a data-driven algorithm to account for changes due to external factors, such as increased testing capabilities, it is important that the algorithm be able to adjust and relearn the new dynamics.For this reason, we encourage the reader to reference the video and note that although periods of inaccuracy due to black swan events occur, the algorithm is always able to stabilize and recover.In contrast, since this is at times a rapidly (exponentially) growing set of data, methods like naive persistence forecast do poorly.
In Figures 7b, 7c we show the performance of the prediction for the cumulative data for the US in March-April 2020.It is of interest to note that the global curve is obtained as a sum of local predictions shown in figure 7a, rather than as a separate algorithm on the global data.Again, the performance of the algorithm on this nonstationary data is good.

Discussion
In this work, we have presented a new paradigm for prediction in which the central tenet is understanding of the confidence with which the algorithm is capable of predicting the future realizations of a non-stationary stochastic process.Our methodology is based on Koopman operator theory 20 .Operator-theoretic methods have been used for detection of change in complex dynamics in the past, based on both Koopman 1,32 and Perron-Frobenius operators 33 .Other methods include variational finite element techniques combined with information theoretic measure (Akaike's information criterion) and maximum entropy principle 50 .Our approach to the problem of prediction of nonstationary processes has several key ingredients.First, the Koopman operator on the space of the observables is used as a global linearization tool, whose eigenfunctions provide a coordinate system suitable for representation of the observables.Second, in a numerical computation, we lift the available snapshots to a higher dimensional Hankel-Takens structure, which in particular in the case of abundance of data, allows for better numerical (finite dimensional) Rayleigh-Ritz approximation of eigenvalues and eigenvectors of the associated Koopman operator, as well as the KMD.Third, using our recent implementation of the DMD, we select the Koopman modes that have smallest residuals, and thus highest confidence, which is the key for the prediction capabilities of the KMD.In the absence of enough modes with reasonably small residuals, i.e. low confidence, we switch to local prediction, with narrower learning windows and shorter lead time.By monitoring the prediction error, the algorithm may return back to global prediction.
Our methodology is entirely consistent with the typical training/test dataset validation techniques in machine learning.Namely, the globally learned model on the training data is applied to test data for the next time interval.The novelty in our approach is that we constantly check for how well the learned model generalizes, and if it does not generalize well, we restart the learning.One can say that we implemented a feedback loop, within which the machine learning algorithm's generalizability from training to test dataset is constantly checked, and the system adapts to new conditions.Evidence for effectiveness of this procedure is presented for the COVID-19 prediction example, where we show how the generalization error diminishes over time.

Methods
Our starting assumption is that observed data is generated by a dynamical process realized on some underlying state space.This is a broad enough assumption to cover data generated by both deterministic and stochastic dynamical systems 11 .The (internal) state is often inaccessible; instead, an observable (output) is given as a function f (x(t)) of the state vector x(t).

The Koopman operator and the KMD
The Koopman operator family U t , acts on observables f by composition U t f (x) = f (x(t)).It is a global linearization tool: U t is a linear operator that allows studying the nonlinear dynamics by examining its action on a linear space F of observables.In data analysis, for the discrete time steps t i , the discrete sequence z i ≈ x(t i ), generated as numerical software output, is then a discrete dynamical system z i+1 = T(z i ), for which the Koopman operator reads U f = f • T.
The key of the spectral analysis of the dynamical system is a representation of a vector valued observable f = ( f 1 , . . ., f d ) T as a linear combination of the eigenfunctions ψ j of U .In a subspace spanned by eigenfunctions each observable f i can be written as f i (z) ≈ ∑ ∞ j=1 ψ j (z)(v j ) i and thus (see e.g. 16,20 . . .then, since U ψ j = λ j ψ j , we can envisage the values of the observable f at the future states T(z), T 2 (z), . . .by The numerical approximation of KMD can be computed using for example DMD algorithms.Different versions of the algorithm used in this work are described in details in Supporting Information-Methods.

Finite dimensional compression and Rayleigh-Ritz extraction
For practical computation, U is restricted to a finite dimensional space F D spanned by the dictionary of suitably chosen functions D = { f 1 , . . ., f d }, and we use a matrix representation U of the compression Ψ F D U |F D : F D → F D , where Ψ F D is a L 2 projection e.g. with respect to the empirical measure defined as the sum of the Dirac measures concentrated at the z i 's.Since U is the adjoint of the DMD matrix A associated with the snapshots z i , the approximate (numerical) Koopman modes and the eigenvalues are the Ritz pairs (Ritz eigenvalues and eigenvectors) of A, computed using the Rayleigh-Ritz method.The residuals of the Ritz pairs can be computed and used to check the accuracy 12 .See Supporting Information-Methods.

The Hankel-DMD (H-DMD)
The data snapshots (numerical values of the observables) can be rearranged in a Hankel-Takens matrix structure: for a subsequence (window) of w successive snapshots f b , f b+1 , . . ., f w−1 , split w = m H + n H and then define new snapshots as the columns h i of the n H × m H Hankel-Takens matrix (see 7,22,45 , and Supporting Information) Then, for this data we compute the KMD and use (1) for prediction.Predictions of the observables f i are then extracted from the predicted values of the observables h i .The introduction of Hankel-Takens matrix alleviates issues that arise from using a basis on a potentially high dimensional space: namely, taking products of basis elements on 1-dimensional subspaces -for example Fourier basis on an interval in R. Such constructions lead to an exponential growth in the number of basis elements, and the so-called curse of dimensionality.The Hankel-Takens matrix is based on the dynamical evolution of a one or more observables -functions on state space -that span a Krylov subspace.The idea is that one might start even with a single observable, and due to its evolution span an invariant subspace of the Koopman operator (note the connection of such methods with the Takens embedding theorem ideas 1,7,22 ).Since the number of basis elements is in this case equal to the number of dynamical evolution steps, in any dimension, Krylov subspace-based methods do not suffer from the curse of dimensionality.The widths of the bubbles represent the number of cases in a region; only regions with more that 100 cases are used and the bubbles are colored according to their relative percent error.(7b) Comparison of true and forecast data for cumulative confirmed cases in the US for April to December 2020.The cumulative forecasts shown here were obtained by summing the forecasts of the individual locations, indicating that the region specific forecasts were sufficiently accurate for tracking the cumulative dynamics of the virus in the US.(7c) Percent error for the forecasts of the cumulative confirmed cases in the US.On average the percent error is less than 5 percent and although spikes occur, which could be due to changes in testing availability, the algorithm adjusts and the error stabilizes within a short amount of time.Furthermore, Johns Hopkins provided data for around 1787 locations around the United States and we produced forecasts for each of those locations.

S1 Methods
In this section we provide technical details of the numerical spectral analysis of dynamical systems, which is at the core of the prediction algorithms presented in this work.The tools of trade are the Koopman operator, the KMD and the DMD decompositions.We first present a compact tutorial on the numerical aspects of the Koopman modal analysis of nonlinear dynamical systems, and, then, we provide theoretical underpinnings for the methods presented in the paper.
We do not go into the details of convergence of the Koopman operator approximations utilized in the paper, but we mention the associated work.For example, for on-attractor evolution, the properties of the Generalized Laplace Analysis (GLA) method acting on L 2 functions were studied in 1,2 .The off-attractor case was pursued in 3 in Hardy-type spaces.This study was continued in 4 to construct dynamics-adapted Hilbert spaces.A study of convergence of DMD-type approximations utilized in this paper is provided in 5 .There are two types of results presented in these papers 1) Convergence of the spectral objects over an infinite time interval and 2) rate of convergence to spectral objects.Since we are pursuing a finite-time analysis, the results of type 2) are more relevant.The gist of these results is that the convergence rate is 1/n for regular dynamics (limit cycles, limit tori), where n is the number of snapshots, and 1/ (n) for irregular (chaotic) dynamics.This can be improved to even exponential convergence under some conditions 6 .While the proofs indicated here are for GLA, and depend on convergence time averages over trajectories, they can be extended for DMD methods since in general they can be related to time averages over trajectories 7 .
Since the prediction methods in the paper are tightly connected to detection of "Black Swan" events, and these are often defined in imprecise terms, we provide a mathematical definition that we utilize in this work for the orientation of the reader: Definition S1.1 Let f : M → C, f ∈ H and K : H → H an operator from the Hilbert space H to itself.Consider the dynamics given by f = K f .Let µ be an ergodic invariant measure for K, i.e.
for almost all x ∈ M. Assume the support S of µ is such that S = M.A Black Swan event for an observable f is g / ∈ f (S).The magnitude of the Black Swan event for observation g is d(g, f (S)) where d is a metric, and f (S) is the range of f on S.
The set M does not necessarily need to be the state space.In the examples shown in the paper, the observables are spectral (e.g. the spectral radius), M = C and the operator K is the operator acting on the spectral objects induced by the Koopman evolution.
For a non-degenerate stochastic process with an ergodic measure µ it can be unlikely that the support of µ is different from M. In that case, the "BlackSwannes" of a value can be defined as d(g, ν), where d is e.g. the Wasserstein distance of a delta distribution at f and ν(E) = µ( f −1 (E)) is the pushforward measure under f 8 .Note that this particular d can be used in the deterministic case described previously.

S1.1 Organization of the section
This section is organized as follows.First, in §S1.2 we review the continuous and the discrete autonomous dynamical systems, and the definition of the Koopman operator U on the space of observables.The review material is based on the papers on the theory and applications of the Koopman operator 9 , 10 , 11 , and our recent work 12 , 13 , 14 .The spatio-temporal representation of the evolution of the observables using the eigenvalues and eigenvectors of the Koopman operator (Koopman mode decomposition, KMD) is discussed in §S1.2.1.The next key ingredient, numerical computation of approximate eigenvalues and eigenvectors, is reviewed in §S1.3.In particular, the details of the matrix representation of a compression of U to the subspace of observables are worked out in §S1.3.2;numerical realization of the Koopman mode decomposition is presented in detail in §S1.3.3 and §S1.3.4.In §S1.3.5 we provide the key elements of the Schmid's DMD algorithm, and in §S1.3.6 its recent enhancement that allows selection of Ritz pairs that can be used for a KMD suitable for prediction.In §S1.3.7 we review the least squares methods for spatio-temporal representation of the snapshots using the selected Koopman modes.And finally, after having prepared all necessary ingredients, in §S1.4 we present the global prediction algorithm.The setup of the prediction framework is given in §S1.4.1 which introduces basic notation and §S1.4.2,where we lift the data snapshots in a Hankel structure that will be used in the algorithms.In §S1.4.3, we discuss the limitations of the numerical realization of the KMD based prediction scheme introduced in §S1.2.1.A worked example that illustrates technical details is provided in §S1.5, where we apply the prediction scheme to the spread of the coronavirus disease in European countries.In §S1.7, we discuss the problem of Black Swan events 15 in the training data, and we propose a novel technique, based on the method reviewed in §S1.3.6, for detecting and retouching Black Swan type disturbances of the data.This makes the global prediction scheme resilient to sudden and unpredictable disturbances that have become part of the learning data window.In §S1.7.2, we present a more flexible local prediction scheme that dynamically resizes the learning data windows and the forecasting lead time in an event of sudden changes and large prediction errors.

S1.2 Setting the scene: Koopman operator
Consider an autonomous system of differential equations . . .
with state space X and vector-valued nonlinear function F.Here X is a compact smooth N-dimensional manifold, endowed with a Borel sigma algebra B, and for simplicity identified with a subset of R N , with F : X −→ R N .The associated flow map ϕ t : X −→ X advances an initial state x(t 0 ) forward in time by a time unit t, Note that ϕ t+s = ϕ t • ϕ s , where • denotes the composition of mappings.The (internal) state is often inaccessible; instead an observable (output) is given as a function f : X −→ C of the state, where the class (function space) F f of observables is appropriately chosen and endowed with a Banach or Hilbert space structure.For more detailed introduction we refer to 16 , in particular Chapters II and V, and 17 .For instance, we can take F = L p (X , µ), 1 ≤ p ≤ ∞, with an appropriate measure µ and e.g. for p = 2 with the corresponding Hilbert space structure.The Koopman operator semigroup (U ϕ t ) t≥0 is defined by Here we assume that ϕ t preserves sets of measure zero (if µ(A) = 0, then µ((ϕ t ) −1 )(A) = 0) and that U ϕ t is defined on the equivalency classes (modulo µ).It can be considered as a linearization tool for (S5): U ϕ t is a linear operator that allows studying (S5) by examining its action on the infinitely dimensional space F of observables.If ϕ t is measure-preserving ( then U ϕ t is an isometry.For an introduction to the theory of the Koopman operator on the Banach lattice L p see [17, Chapter 7].An analogous approach is applicable to a discrete dynamical system where T : X −→ X is a measurable nonlinear map on a state space X and i ∈ Z.The Koopman operator U ≡ U T for the discrete system is defined analogously by Discrete dynamical systems naturally describe evolution of discrete events, e.g. stock market data, reported cases of influenza illnesses, or lynx population in Europe, but they are also at the core of numerical analysis of the continuous systems.More precisely, if we run a numerical simulation of (S5) in a time interval [t 0 ,t * ], the numerical solution is obtained on a discrete equidistant grid with fixed time lag ∆t: In this case, a black-box software toolbox acts as a discrete dynamical system z i = T(z i−1 ) that produces the discrete sequence of z i ≈ x(t i ); this is sampling with noise.For t i = t 0 + i∆t we have (using (S6), (S7) and the group property) On the other hand, using (S9), where Hence, in a software simulation of (S5) with the initial condition z 0 = x(t 0 ), we have an approximation with the fidelity that depends on the numerical scheme deployed in the software, and which is studied in the shadowing theory, see e.g. 18, 19 .This can be obviously extended to vector valued observables: for g = (g 1 , . . ., g d ) : X −→ C d we define The observables can be both purely physical quantities (e.g.temperature, pressure, energy) and mathematical constructs using suitable classes of functions (e.g.(multivariate) Hermite polynomials, radial basis functions).In particular, if we set

S1.2.1 Spectral decomposition and representation of observables
Spectral decomposition of U is the pillar of both theoretical and practical analysis of dynamical systems in the framework of the Koopman linearizations (S7), (S9).We consider (S9) with F = L 2 (X , µ), where X is compact and big enough to contain the states.An eigenpair (λ j , ψ j ) of the eigenvalue λ j ∈ C and nonzero function ψ j ∈ F (eigenvector, eigenfunction) satisfies The key of the spectral analysis of the dynamical system is a representation of an observable as a linear combination of the eigenfunctions of U .Since the Koopman operator can have continuous spectrum 20,21 , this might not be always possible.However, an observable that belongs to a subspace spanned by eigenfunctions can be written as Then we can envisage the values of the observable h at the future states T(z), T 2 (z), . . .by For more details see e.g. 22, 23 , 24 .If the dynamics is evolving on an attractor, then all eigenvalues are on the unit circle; however we are also interested in an off-attractor analysis.The mapping ϕ t is thus not assumed measure preserving, and thus U is not necessarily unitary.Detailed analysis of the spectrum in this general case and the function spaces associated with it can be found in 22 .

S1.3 Numerical computation
For a practical application of the Koopman operator, we need numerical methods to compute its approximate eigenvalues and eigenvectors, and the modes v j in (S15).We consider the discrete case (S8), (S9), (S14); for numerical computations with a continuous system we invoke the discretization (S11), (S12), (S13).For more details on the Extended DMD and implementation using the kernel trick see 11 . 14/41

S1.3.1 The data
In a typical data driven setting, we will have a sequence of snapshots, where we use the notion of snapshot as a numerical value of a scalar or vector valued observable at a specific instance in time.We do not assume explicit knowledge of the mappings F (S5) or T (S8).For example, the snapshots may be obtained from high speed camera recording of a combustion process in a turbine 25 , or. e.g. as the wind tunnel measurements.A less expensive and less restrictive is a numerical simulation of (S5) represented by (S11), (S12), (S13), where we can feed an initial z 0 to a software tool (representing T, or its linearization through a numerical scheme encoded in the software toolbox) to obtain the sequence where f = ( f 1 , . . ., f d ) T is a vector valued (d > 1) observable with the action of U d defined by (S14).The time resolution ∆t can be set to obtain the desirable numerical accuracy.This can then be repeated for many initial values z 0 ; if the new initial value is defined as z 0 = p(z 0 ), then the new simulation data can be incorporated by adding the new observables f i • p as new components of f.In a CFD application, f may be the full state observable, and the entries in the state vectors z i are then e.g. the values of the pressure and of the components of the velocity at a discrete spatial grid in the physical domain.
Hence, independent of the underlying process, the numerical data values are of the form of a matrix S with columns, respectively, Although the snapshots are generated by the nonlinear system (S8), (S12), the recursion (Krylov sequence) (S17), driven by the linear operator U d and numerically evaluated along a trajectory initialized at z 0 , motivates to seek out a linear operator (matrix) A ∈ C d×d whose action on the available snapshots is given by . . .
Thus, if we set X = S(1 : d, 1 : M + 1), Y = S(1 : d, 2 : M + 2), then such an A would satisfy Y = AX, and this could be extended linearly to the span of the columns of X by A(Xv) = Yv, v ∈ C M+1 .The action of A outside the range of X is not specified by the available data.
In general, X and Y are not necessarily extracted from a single trajectory (S17, S18).The data may consist of several short bursts with different initial conditions, arranged as a sequence of column vector pairs of snapshots (x k , y k ), where x k = f(z k ), y k = f(T(z k )) column-wise so that a kth column in Y corresponds to the value of the observable in the kth column of X through the action of U d , as in (S19); see 26 .Depending on the parameters d and M, the matrices X, Y can be square, tall (more rows than columns), or wide (more columns than rows).Then, analogously to (S19), we can search for a linear transformation A such that Y = AX.Such an A may not exist.
However, we can always define a particular matrix A which minimizes Y − AX F .Clearly, if X T has a nontrivial null-space, A is not unique; we can choose B so that BX = 0 and thus (A + B)X = AX.One can impose an additional condition of minimality of A F , which yields the well known solution A = YX † , expressed using the Moore-Penrose pseudoinverse X † of X.This additional constraint, although useful to enforce uniqueness and boundness, has (to the best of our knowledge) no other useful interpretation in this framework.If we are interested in approximating some eigenvalues and eigenvectors, then with the restricted information on A (only its action on the range of X is meaningfully defined), we will use the Rayleigh quotient X † AX = X † (A + B)X = X † Y, so this non-uniqueness of A is immaterial.If X is of full row rank, then the optimal A is unique.If X is of full column rank, then A = YX † satisfies Y = AX exactly.Throughout this paper we assume that X is of full (row or column) rank.However, even when full rank, X may be severely ill-conditioned so that its numerical rank is lower, which poses notrivial numerical challenges; these issues are addressed in our recent work 27 .

S1.3.2 A dual representation and the compression of U
The sequence of vector valued observables (S18) naturally describes the (discrete) time dynamics, with the column index representing the timestamp, and row indices representing the scalar observables (e.g. the pressure and the components of the velocity at particular spacial coordinates) that build the vector observable.For instance, in a CFD application, the multi-indexed (2D, 3D) spatial positions are mapped (vectorized) into a column vector in the usual way, and S is generated by the software, column by column.

15/41
In a dual interpretation (reading) of the data, we can think of each row in (S18) as a set of values of an observable, sampled over the spatial domain.In other words, we transpose the matrix S in (S18), partition S = S T as and consider the action of U on the space F D spanned by the dictionary of scalar functions where Ψ F D is a suitable projection with the range F D .This is the standard construction: we need a representation of U f i of the form With the data at hand, the projection is feasible only in the discrete (algebraic) sense: we can define the matrix U = (u ji ) ∈ C d×d column-wise by minimizing the residual ρ i (z) in (S22) over the states z = z k , using the values To that end, write the least squares residual which is the L 2 residual with respect to the empirical measure defined as the sum of the Dirac measures concentrated at the z k 's, Hence, the columns of the matrix representation are defined as the solutions of the least squares problems . . .
for i = 1, . . ., d.The solutions of the above algebraic least squares problems for all i = 1, . . ., d are compactly written as the matrix U ∈ C d×d that minimizes and the action of U can be represented, using (S22), as Similarly as with the computation of A in §S1.3.1,U is uniquely determined only if X T is of full column rank.Otherwise, we must proceed carefully when using the spectral data of U to infer approximate eigenvalues of U .In particular, if d > M + 1, X T has a nontrivial null-space, and if U is another least squares solution, then X T ( U − U) = 0. On the other hand, along the linear manifold U + Ker(X T ) = {U + B : X T B = 0}, the Rayleigh quotient (matrix representation of the compression of U onto the range of X) X † (U + B)X = X † UX ∈ C (M+1)×(M+1) remains uniquely determined.(Note that in the case of complex data we work with the adjoints X * and Y * , instead of X T (S20) and Y T (S21), to obtain U = A * = A T , which is then the matrix representation in the basis of complex conjugate functions f i .) The quality of this finite dimensional approximation of U depends on the selected dictionary of the observables (capturing a nearly invariant subspace that corresponds to the most relevant eigenvalues), as well as on the approximation level of the underlying measure by the empirical one, in particular on the distribution of the z k 's.For related numerical issues see 27 and for a theoretical study of convergence, see 5 .

S1.3.3 Computation of eigenfunctions and the Koopman mode decomposition (KMD)
Next, we describe the framework for practical computation of the modal decomposition from §S1.2.1.It is the classical Rayleigh-Ritz extraction, based on (S27) and the spectral decomposition of U.
Consider first the case rank(X) = d; then d ≤ M + 1, and U is uniquely defined, column by column, from the solutions of the least squares problems (S25), for i = 1, . . ., d.In this case all d eigenvalues (with the corresponding eigenvectors) are well determined by the data.For technical simplicity we assume that U is diagonalizable, with the spectral decomposition U = QΛQ −1 , with Λ = diag(λ i ) d i=1 , Q = (q 1 , . . ., q d ), Uq i = λ i q i .We do not assume that the eigenvalues are simple, and in the case of multiple eigenvalues we list them as successive diagonal entries of Λ.Then, for z ∈ X , and the approximate eigenfunctions of U , extracted from the span of f 1 , . . ., f d , are Following §S1.2.1, we seek a decomposition of observables in terms of the φ i 's, similar to (S15).In a numerical simulation, these eigenfunctions are accessible, as well as the observables, only as the tabulated values for z ∈ {z 0 , . . ., z M }: Let now g(z) T = (g 1 (z), . . ., g d (z)) be a vector valued observable and let Since (U φ i )(z) ≈ λ i φ i (z), we have In the sequel, we use , where v i 2 φ i is again an eigenfunction.To evaluate (S31) numerically at z 0 , use (S29).
If some eigenvalues are multiple, then with any block diagonal nonsingular matrix D = k D k , that commutes with Λ, we have U = (QD)Λ(QD) −1 .(The number of the blocks D k equals the number of different eigenvalues, and the dimensions corresponds to their multiplicities.)If we repeat the same construction with Q = QD, then the new approximate eigenfunction of U are ( φ 1 , . . ., φ d ) = (φ 1 , . . ., φ d )D, and the matrix of the modes is V = VD −1 .At the end, we obtain another representation of the sum in (S31).
Using (S26), we conclude that AQ −T = Q −T Λ, i.e. the columns of Q −T are the (right) eigenvectors of A. Hence, for computing the Koopman modes, we can proceed with computing the eigenvectors of A. The eigenvector matrix is necessarily of the form Q −T D −1 with some D = k D k , as above.
Consider now the case d > M + 1 = rank(X).We have M + 1 < d Ritz pairs of U, and in the decomposition (S28) the matrix Q is tall rectangular, d × (M + 1), so we cannot immediately insert QQ −1 as in (S30).To replace the spanning set f 1 , . . ., f d with (φ 1 , . . ., φ M+1 ) = ( f 1 , . . ., f d )Q, we must use QQ † = I d .If the full column rank Q is extracted from the range of X, then QQ † X = X.We proceed with the assumption that the data snapshots are real -the additional goal is to point out that in that case all computation can be done (and in a software implementation it should) in real arithmetic, even if the eigenvalues and eigenvectors (the columns of Q) are complex.Since the matrix U is then real as well, the pair Λ, Q computed by the Rayleigh Ritz method will be closed under conjugation and can be indexed as follows: if λ i ∈ R, then q i ∈ R d , and if ℑ(λ i ) > 0, then λ i+1 = λ i , q i+1 = q i .Using the identity we immediately conclude that Q = QJ, where Q is real and J nonsingular.(Here ℜ(•) and ℑ(•) denote the real and the imaginary parts of complex scalars or vectors.)Hence, QQ † = Q Q † is real symmetric and X T QQ † = X T .On the other hand, in a practical computation, we see the function values only at z ∈ {z 0 , . . ., z M+1 }, and for those values we can use QQ † instead of QQ −1 in relation (S30).The rest is straightforward, yielding the modal matrix V = Γ T Q †T , and QQ † AQ †T = Q †T Λ.The latter reveals that Q † , Λ correspond to Ritz pairs of A, extracted from the range of X.
In the next section, we derive the KMD directly from an application of the Rayleigh Ritz procedure to the matrix A.
where we have written and E M+1 = 0 r M+1 .In (S34), r M+1 is the residual obtained after projecting U M+1 d f onto the subspace spanned by Here we assume that M + 1 < d (possibly even M + 1 d), so that we expect nonzero residual r M+1 .Our earlier full rank assumption on X implies that its rank is M + 1.
If P M+1 is the orthogonal projector onto [K M+1 ], then the compression This means that λ and the function h , h) is an approximate eigenpair with the residual measured in the norm of the function space F .Given the data snapshots (S18) as the only available numerical information, the coefficients α = (α 1 , . . ., α M ) in (S34) can be determined using the discretized (algebraic) least squares projection and the notation from §S1.3.1 as follows: The least squares error to be minimized is If X is of full column rank, then α = X † y M+1 is the unique solution expressed using the Moore-Penrose pseudoinverse.Hence, for a particular initial z 0 , the relation (S33) reads On the other hand, by (S19), (U d K M+1 )(z 0 ) = AK M+1 (z 0 ), and, as a concrete numerical realization of (S33) on the trajectory starting at z 0 , we obtain the Krylov decomposition 18/41 where C M+1 = X † AX = X † Y is the Rayleigh quotient.Note here that the full column rank assumption on X implies X † X = I.Also note that here we do not have A explicitly formed, nor we think of it as A = YX † .Hence, since the residual r M+1 = y M+1 − XX † y M+1 is unlikely to be zero, we can extract from X only approximate (Ritz) eigenpairs of A. To that end, we first compute the eigenvalues and eigenvectors of C M+1 .Under the generic assumption that all eigenvalues of C M+1 are algebraically simple, * its spectral decomposition is , where In other words, the eigenvectors of C M+1 are the columns of the inverse of the Vandermonde matrix ) are approximate eigenvectors of A. With an eye towards (S31), we write X = VV M+1 , i.e., for k = 0, . . ., M, It is precisely this structure that yields the spatio-temporal representation in §S1.2.1.Indeed, if we set then λ j and the jth column Φ : j of Φ are a Ritz pair of U d , U d Φ : j ≈ λ j Φ : j , see (S35), (S36).If Φ : j = (ϕ 1 j , . . ., ϕ d j ) T , then U ϕ i j ≈ λ j ϕ i j , i = 1, . . ., d.We have for k = 0, . . ., M and we can extrapolate this to the future steps by increasing k which amounts to rising the powers of λ i .Also note that Φ evaluated at z 0 equals precisely V, so that (S41) is a concrete numerical realization of (S42).In an ideal situation, λ j is geometrically simple eigenvalue and ϕ i j are nearly collinear for i = 1, . . ., d.However, this is not essential for the purposes of snapshots representation and prediction because the action of U d is component-wise, and each ϕ i j is an approximate eigenfunction of U .This algebraically elegant process has a drawback that becomes apparent when we consider its numerical software implementation.Vandermonde matrices are notoriously ill-conditioned.Moreover, in case of an off-attractor analysis the values |λ i | j may vary in size over several orders of magnitude, which poses challenging problems for the finite precision computation.For that reason, an SVD based method of Schmid 28 , designated as DMD, has become the main computational device for the KMD.However, we have recently shown in 13 that this companion matrix based approach can be implemented more accurately using the DFT and specially tailored algorithms for the Vandermonde and the related Cauchy matrices.

S1.3.5 Schmid's dynamic mode decomposition (DMD)
The Rayleigh-Ritz procedure outlined in §S1.3.4 is based on a Krylov sequence, which naturally fits the dynamics of a discrete dynamical system driven by U (in the space of observables).However, it yields a numerically ill-conditioned problem, as a consequence of that very representation.From a numerical point of view, the Rayleigh-Ritz procedure is best executed in unitary/orthonormal bases, so that X should be replaced with an orthonormal matrix spanning the same subspace.Since X can be nearly numerically rank deficient (its columns are actually generated by the power method), Schmid 28 used the PCA 29 with prescribed cutoff threshold to construct the best lower dimensional subspace (i.e. a POD basis) that captures the data, and then used the Rayleigh-Ritz extraction from that subspace.For the readers' convenience, we briefly review the DMD algorithm; we assume the more general setting where the snapshots are generated with several initial conditions, so that the input data are not necessarily of the form (S17).That is, the matrices X and Y are such that, column-wise, 26 .The total number of snapshots (column dimension) is in this general case denoted by m; in the case of a single trajectory (S17), m = M + 1.
The theoretical underpinning is the classical matrix theorem on best low rank approximations.
* Since C M+1 is an unreduced Hessenberg matrix, its eigenvalues must be of geometric multiplicity one.If C M+1 has multiple eigevaules, then its generalized eigenvector matrix is the inverse of the confluent Vandermonde matrix generated by the distinct eigenvalues.The Jordan structure of each multiple eigenvalue consists of a single Jordan block.
Hence, we can replace X with its best low rank approximation by truncating its SVD X = UΣV * ≈ U r Σ r V * r , where U r = U(: , and Σ r = diag(σ i ) r i=1 contains the largest r singular values of X.In brief, U r is the POD basis for the snapshots x 1 , . . ., The index r is selected so that the approximation error (S43) is below a user prescribed threshold value, and it is a numerical rank 32 of X.Now, DMD uses the range of U r for the Rayleigh-Ritz extraction.The Rayleigh quotient A r = U * r AU r is computed, using as which is suitable for data driven setting because it does not use A explicitly.Clearly, (S44, S45) only require that Y = AX; it is not necessary that Y is shifted X as in §S1.3.4.Each eigenpair (λ , w) of A r generates the corresponding Ritz pair (λ ,U r w) for A. This is the essence of the Schmid's method 28 , summarized in Algorithm S1 below.

S1.3.6 Refined Rayleigh-Ritz Data Driven Modal Decomposition (RRRDDMD)
Recently, in 12 , we revisited DMD and introduced several modifications.First, we show that the residuals Av i − λ i v i 2 can be computed in a data driven scenario as well.This allows for selecting good Ritz pairs, with small residuals, which proved to be the key for selecting good modes for the prediction algorithm; see §S1.4.3.Further, we show that the Ritz vectors can be improved by using the well known refinement technique, which we have adapted to the data driven setting of the DMD.

S1.3.7 Spatio-temporal representation of the snapshots
In general, a DMD algorithm will compute r ≤ m Ritz vectors (modes) with the corresponding eigenvalues.In particular, in the Schmid's DMD, r may be considerably smaller than m, as e.g. in the case of an off-attractor analysis of a dynamical systems, after removing peripheral eigenvalues, see [12,  §4.1].In any case, the most important coherent structures of the process are determined by a subset of the modes; so we may want to express the available data snapshots by r < m modes.It is desirable
Ritz values, i.e. eigenvalues of A r .9: for i = 1, . . ., r do 10: that such modes can represent the snapshots reasonably well, and that they have small residuals which, as we shall see below, is essential for the prediction of the evolution of the sequence (S17), with m = M + 1. Assume that we have such a selection of r numerically linearly independent modes and, to ease the notation, assume that we have indexed the Ritz pairs so that the selected ones are indexed with j = 1, . . ., r.With this setup, a modal decomposition of f k = f(z k ) can be written as If r = M + 1 = m, then the coefficients α 1 , . . ., α m can be computed as so that this reconstruction is exact for k = 0, . . ., M. In matrix notation, if we define V r = v 1 . . .v r then we have To compensate for the truncation error, the coefficients α j can be recomputed by solving the weighted least squares problem where w k ≥ 0 are the weights that can be used to emphasize importance of some time indicies or to introduce forgetting factors, Ω is positive definite matrix, † and √ Ω stands for the positive definite square root or the Cholesky factor of Ω.For numerical methods for this optimization problem we refer to 44 , 14 .Here, for the reader's convenience, we provide an explicit formula for where

S1.4 Global Koopman prediction algorithm
Here we give the details of the new proposed algorithm, designated as Global Koopman Prediction (GKP) algorithm.The basic idea is to extract the intrinsic eigenvalues and modes of the dynamical system under consideration, and then to predict the evolution of the system by using the principles outlined in §S1.2.1, §S1.3.3.In order to reveal the relevant eigenvalues and the corresponding modes that capture the dynamics of the system on a larger time interval and not only locally, one has to use large training sets.This strategy is at risk if the algorithm is oblivious to unusual and unexpected changes (perturbations) that can be classified as Black Swan events.If such data are used in a learning window, the long term prediction is doomed to fail.We use the numerically computed spectral information on U to develop an additional device to equip the algorithm with a litmus test for detecting Black Swan events (a posteriori, of course), and, moreover, with a retouching scheme to restore the global prediction capability (see §2.2 in the main text).Furthermore, an important feature is that the data snapshots are lifted in a Hankel matrix structure, as described in §S1.4.2.

S1.4.1 Setting the scene -the prediction task
Consider a discrete dynamical system z k+1 = T(z k ) that is accessible through a sequence of snapshots where d ≥ 1 is the dimension of the scalar or vector-valued system observable f : X → R d , and f k = f(z k ) is its value for the (possibly unknown) state z k , with the time stamp t k , k = 0, 1, 2, . ... The goal is to learn the dynamics from the available data and then to predict the future values.More precisely, suppose that the present time moment is t p−1 , and that, up to that moment, the data is readily available; we seek a prediction of the data value at the next time moments t p ,t p+1 , . . .,t p+τ(p) .We call that future time moments t p the prediction moments.The prediction will be based on a sliding window of size w in the sequence (S51), i.e. we will use the f k 's starting from the index b = p − w that defines the active window In terms of the system mapping T, these values can be represented as

S1.4.2 Lifting the data into a Hankel matrix structure and the H-DMD
The key for a successful application of the prediction framework from §S1.2.1 is that the finite dimensional numerical approximation from §S1.3.3 captures the spectral information accurately enough.To that end, we adopt the Hankel-DMD approach of 45 , 46 .For an active window W (p, w), first conveniently split w = m H + n H , and then lift the observables into the higher dimensional space R , = d • n H , and arrange them as columns of a × (m H + 1) (block) Hankel matrix as follows: We can think of the h i 's as the values of the vector-valued observable h : X −→ R composed with the powers of T analogously to (S52), i.e.
which can be interpreted as a Krylov sequence for the Koopman operator The techniques from §S1.3 now apply in this new setting simply by setting h instead of f, and instead of d.The matrix H plays the role of the snapshot matrix ‡ S, and we have X = H(:, 1 : m H ), Y = H(:, 2 : m H + 1).We will attempt predicting the h k 's, and from the obtained results extract the predictions of the original observable f.The starting points are the DMD of H (H-DMD), and the corresponding KMD.Since the KMD changes with the sliding active data window W (p, w), we use the term active KMD (AKMD) when we refer to the computation used for prediction.

S1.4.3 Prediction -the basic idea and its limitations
Suppose that the DMD algorithm, applied to (S53), has extracted r = m H Ritz pairs, and that the Ritz vector span the range of H. Then we can determine the coefficients α j such that where δ k,m H +1 is the Kronecker delta symbol, and r m H +1 is the residual of the orthogonal projection of h m H +1 onto the range of X.This means that the decomposition of the snapshots in terms of the modes is exact, except for the last one, which may not belong to the range of X, and the residual r m H +1 represents its decomposition error.For details we refer to [13,  §2.4,§3.2].
If we want to extend the above relation beyond the index k = m H + 1 (i.e. to extrapolate into the future the evolution of the sequence h k ), we can apply the appropriate power of A and use the approximation Av j = λ j v j + r j ≈ λ j v j .This is a fairly simple operation -it amounts to increasing the power of the λ j 's.Of course, the residuals will accumulate with each such iteration, e.g.
So, using the first sums above (and ignoring the residual terms) to predict future of the h k 's has limited range, except in the case of small r m H +1 and small residuals r j , which are not too much amplified under the action of the powers of A. Hence, it is desirable to have a KMD that uses only the selected modes corresponding to the Ritz pairs with small residuals, and that we can have an accurate decomposition of the type (S48), using the selected modes.This selection is possible in data driven scenarios using the methods from 44 and 12 , outlined in §S1.3.6.The desire to have a high fidelity representation of the data snapshots with as few as possible modes v j is motivated by revealing latent coherent structures of the e.g.flow field; small residuals allow for extrapolation of the dynamics forward in time.
If our goal is solely the prediction, the weight factors w i in (S49) can be tuned to favor most recent snapshots, and the weighting matrix Ω can emphasize particular block rows in the h k 's; see §S1.3.7 and [14,  §3].In particular, with a suitable choice of W and Ω, we can focus the reconstruction of the h k 's to the present snapshot f n+n H +m H −1 = f t p−1 .
Since the f k 's, starting from the past time stamp index b and ending at the present index p − 1 = b + n H + m H − 1, are in the last block row of H (see (S53)), the corresponding formulas are obtained by taking the last d components of the Ritz vectors v j .To that end, define v j = v j ((n H − 1)d + 1 : n H d) as the trailing d components of v j .Hence, from the AKMD of the lifted observables, we read off approximate decomposition of the snapshots f k as ) is a reconstruction of the acquired data, while for k = p, p + 1, . . ., p + τ(p), (S59) is an extrapolation of the AKMD, and it gives us predictions for future data snapshots.We say that f p+τ , τ > 0 is the prediction of the observable at the lead time τ.
If the number of rows n H of Hankel matrix is smaller than number of columns m H (n H < m H ) then the KMD gives some sort of regression function for the data in the reconstruction window b + n H , . . ., b + n H + m H − 1.The reason why the first part of active window is not declared as reconstruction window is that when the KMD of the form (S59) is used, the Koopman eigenfunction values are determined such that for t = 0 the sum of the right hand side is equal to the first snapshot.When applied to Hankel matrix this means that the data in the beginning of the active window b, . . ., b + n H − 1, which form the first column of Hankel matrix are reconstructed with high accuracy.On the other hand, if n H ≥ m H and if the Hankel matrix has full column rank, the data in the whole active window are reconstructed with high accuracy.

S1.5 A worked example
We now illustrate the key elements of the procedure outlined in §S1.4.2, §S1.4.3 using a worked example.The problem under study is the spread of the coronavirus disease (COVID-19).The data consists of reported cummulative daily cases in some European countries.It should be stressed that the algorithm uses only the raw data -no other information on the nature of the data or on modelling parameters is assumed.Further, the data itself is clearly not ideal, as it depends on the reliability of the tests, testing policies in different countries (triage, number of tests, reporting intervals, reduced testing during the weekends), contact tracing strategies, the number of asymptomatic transmissions etc.Moreover, using the data from different countries in 23/41 the same vector observable poses an additional difficulty for a data driven revealing of the dynamics, because the countries independently and in an uncoordinated manner impose different restrictions, thus changing the dynamics.
For an analysis of a particular country, it is better to define the observables as the reported cases on local level, e.g.provinces, counties, cities with similar conditions.Clearly, the dynamics of the spread of the infection depends on the population density as well.This is best seen e.g. by comparing the heat map of the reported cases in the USA with the image of the USA from space at night.In the numerical examples in this section, we purposely use data from different countries to make the prediction task more challenging, which makes it an excellent stress test example.Our goal with this example is twofold.First, we show the potentials and the limits of the proposed prediction algorithm.Secondly, we discuss technical details of the computational scheme.
We use the following datasets: DS1 The numbers of reported COVID-19 cases in Germany, France and the United Kingdom in the period February 29 to November 18.The ordered triple of reported cases is an observable.

DS2
The dataset DS1 augmented by the numbers of reported cases in Denmark, Czechia, Slovenia, Austria and Slovakia.

DS3
The numbers of reported COVID-19 cases in a selected European country in the period February 29 to November 18, augmented with two sequences of filtered data.
The test of the prediction algorithm runs on the lifted data (265 observables from R d : d = 3 for DS1 and DS3; d = 8 for DS2) i.e. on the columns of the 94d × 172 Hankel matrix H = h 1 h 2 . . .h 172 (see (S53)) with the block partition 94 × 172, each block being d × 1.The matrix H is used as a historical record, encoding the period February 29 -November 18, and we run the prediction algorithm starting from some past index and test its accuracy by comparison with the historical data.We use simple increasing window starting at the index b = 1 and ending at p − 1, where we choose different values of p. Then we predict the next τ + 1 values from the moment p on.
In the first experiment, we use DS1 and attempt prediction for 35 days ahead.We take the first 40 columns of H as available data and set X = H(1 : 282, 1 : 39), Y = H(1 : 282, 2 : 40).(This corresponds to the period February 29 -July 10, and the prediction for 35 days ahead starts July 11.)The prediction relative error is shown on the left panel in Figure S8.The right panel shows the Koopman Ritz values computed in the algorithm; note that the algorithm has revealed the eigenvalue 1, and that all other Ritz values are inside the unit circle.The quite satisfactory prediction skill (recall, no information whatsoever on the nature of the data is used) and well behaved Ritz values are related to the nature of the dynamics of the infection during the summer.
In this example, it is instructive to check the Government Response Stringency Index (GRSI) 47 § for the entire time interval involved in the computation.The three indexes behave differently: it can be noticed that France had sharper changes than Germany and the United Kingdom (e.g.around June 20), and sometimes similar increase of stringency but a week earlier than the other two countries (e.g.beginning to mid March).On the other hand, the GRSI for Germany and the United Kingdom were not that much different throughout the observed period; see the left panel in Figure S9.It should be noted, however, that the GRSI does not measure the quality of the implementation of the imposed restriction and that for a particular country it does not necessarily indicate the trends in the dynamics of the disease spreading.Now, in the same interval, we add new observables by including the data from five more countries: Danemark, Czechia, Slovenia, Austria and Slovakia.Hence, the matrix H is 752 × 172.The prediction errors for a 28 days prediction are given in the left panel in Figure S10.Remarkably, the computed Koopman Ritz values nearly match the one computed in the first test with only three countries, see the right panel in Figure S10.Note that even with the differences shown in Figure S9, the main trend of the implementation of the measures is similar.This might help explain the robustness of the spectrum indicated in Figure S10, where such differences do not seem to lead to drastic change in the spectral behavior.We believe this indicates the robustness of our methodology.We proceed with the numerical experiment using the dataset DS1.We further expand the learning window and then consider three consecutive steps with h 1:105 , h 1:106 , h 1:107 .The relative errors for 35 days prediction are shown in the first row of Figure S11.In the context of policy changes that affected the dynamics of the infection spreading, and the fact that the algorithm is purely data driven, the results can be considered satisfactory: in the first graph, the error is below five percent for 16 days and below ten percent for three weeks for all three countries (first graph), below six percent for almost entire 35 days period (second graph), below five percent for more than three weeks and below ten percent for 30 days (third graph).
In the next test, we use the data windows h 1:132 , . . ., h 1:140 for 28 days predictions for the time intervals October 11 -November 7, October 12 -November, ..., October 19 -November 15.The results are shown in Figure S12.Now, we go to the datased DS3.The focus is on some computational details related to the two main ingredients -the DMD and the KMD.The datased DS3 is constructed by a single and a double application of the Savitzky-Golay filter to the  The GRSI data for Germany, France, United Kingdom, Denmark, Czechia, Slovenia, Austria and Slovakia are taken from 47 .For more details see 48 .
Germany data, so that d = 3. (The filter uses cubic polynomial and data window of width 5. On the left boundary, we add zero values, and on the right boundary we leave the original data.The filtered data differ from the original at most five to ten percent relative error in the first 30 days and at most O(10 −3 ) afterwards.)The purpose of the test is to create a situation that one could encounter when deploying the Koopman/DMD framework for data driven prediction or for a discovery and analysis of latent coherent structures.
Right panel: Prediction errors for the period October 11 -November 7. Compare with the first graph in Figure S12.
We recall that a DMD algorithm uses a rank revealing decomposition with some threshold value and that the number of the computed Ritz pairs may be smaller than the column dimension of the matrix X; see §S1.3.5.Then the reconstruction formula (S47) for the coefficients is not valid, and one has to satisfy (S46) by solving the least squares problem where the reconstruction error is not necessarily small, and it introduces noise into the extrapolation process outlined in §S1.4.3.(Recall, if we have full set of modes, then the reconstruction is perfect and the only error is from the finite precision arithmetic.)The prediction skill based on this KMD is shown in Figure S13.Now we change the reconstruction strategy and state the problem as the weighted least squares problem (S49) with the weights that favour the four most recent snapshots with weights set to one, and with the weights of all other snapshots set to the machine round-off unit ε ≈ 2.2 • 10 −16 .The effect of weighting is best seen by comparing the middle graphs in Figures S13 and S14.In the case of weighting, the reconstruction of almost all leading snapshots is bad, but the ones more important for the prediction task have much smaller error.

S1.6 Comments on SIR type models
The key coefficient in SIR-type models, the so-called reproduction number R 0 can be estimated using Koopman operator techniques.Namely, the classic SIR model reads Under condition s = 1 (infinite reservoar of susceptibles), the exponential growth happens when β − ν > 0, i.e. β /ν > 1.The reproduction number is defined by R 0 = β /ν.Thus, R 0 is related to the coefficient of exponential growth.Since we know that eigenvalues of the linearized system are eigenvalues of the Koopman operator, the largest real Koopman eigenvalue is related to R 0 .Another number commonly estimated for use in tracking of epidemics is the instantaneous reproduction number defined by 49 where D is the duration of the infectiousness.For small ν, and constant s = 1, we have providing another connection between the SIR models, Koopman operator spectrum and Reproduction numbers.

S1.7 The framework for prediction and Black Swan event detection
In GKP algorithm, the spectral information is extracted from a sequence of active windows -for each window, the snapshots are arranged in a Hankel matrix whose columns define a new set of snapshots and approximate eigenvalues and eigenvectors are computed using Algorithm S2.In the case when the dynamics of the system is not coupled with some other dynamical system, we expect that, in the absence of unexpected disturbances, the AKMD will capture at least the basic trends of the dynamics.In particular, the spectral radius of the active window (the maximal absolute value of the selected Ritz values) should not change too much.Further, the DMD algorithm should compute Ritz pairs with reasonably small residuals.This is plausible, because the sequence (S17) can be, at any moment, interpreted as an excerpt from a power method generated sequence, and the power method in the limit reveals the absolutely dominant eigenvalues.However, if the dynamical system data are hit by disturbance, this could be recognized e.g. by detecting the active windows whose spectral radii change significantly, or by the absence of Ritz pairs with small residuals (see Figures S15b, S15c, S15d ).This enables us to pinpoint the discrete time moments/subintervals at which disturbances interfere with the original dynamics.For the chosen reference interval I , if max j |λ j | ∈ I or if there are no Ritz pairs with reasonably small residuals, we flag the observed active window as the window which possibly contains a Black Swan event.By sliding the active windows along the computational domain, using the flagged windows, we determine the time sub-intervals containing disturbances whose dynamics is not well captured by the corresponding AKMD models.
The reference interval I can be determined (and dynamically adjusted) e.g. by first computing the Ritz values for many active windows, and then by trial and error, including a statistical reasoning and information theoretic techniques (see e.g. 50) learn to differentiate between the acceptable interval I for spectral radii and the values that are considered outliers.This is best done on a case-by-case basis.
This scheme can be implemented with different sizes of the Hankel matrices (see §S1.4.2) and with different sizes of active windows and then the Black Swan event intervals can be determined by taking into account all determined intervals.

S1.7.1 The retouching trick to process Black Swan events
If the Black Swan event data are included in the training set, the dynamics of the original system (decoupled from this disturbance) cannot be revealed, which means that the prediction of the dynamics after the Black Swan event will be damaged, if not impossible.However, instead of using the original data we can replace them with the data obtained by the prediction based on the information from the previous active windows, preceding flagged intervals.This is illustrated in (S62): the perturbed value (U k f)(z 0 ) + ε k is replaced with (U k f)(z 0 ), which is a predicted value based on the previous undisturbed data.The same can be done for the remaining data in the flagged window.
The prediction after the Black Swan event then becomes more stable and in most cases quite successfully predicts data after the Black Swan event; see §2.2 and Fig. 2 in the main text.
There are many variations of this scheme.For instance, it could happen that the Black Swan interval detected in the algorithm is too long and possibly unrealistic.Therefore we limit the length of the interval on which the data are replaced in order to prevent the algorithm from changing the dynamics too much.Then we apply the algorithm again and detect if the replacements result with decreasing of maximum of the absolute value of the eigenvalues over the active windows.The whole process can be repeated more times to remove eventual Black Swan events that are not taken into account in the previous steps.Finally, the retouched data, cleaned from the Black Swan event disturbances, are used for the prediction.

S1.7.2 Local Koopman prediction
In some cases, the global prediction algorithm is not feasible.For instance, when we just start collecting the data, we have not enough information for a GKP analysis.Or, in the situation when GKP recognizes the beginning of a Black Swan event, as discussed at the beginning of §S1.7.Then, the available data cannot be used for prediction, because the dynamical system has changed.The new model must be built from scratch, as if we just started getting new data.The best we can do is to create a new local algorithm that needs less data, but also with a much shorter reach into the future.
In the Local Koopman Prediction (LKP) algorithm we change the size of the active window depending on the success of the previous prediction.The idea is to assimilate as much acquired data as possible, so we set Hankel matrix dimension variable with prediction moment, i.e. n H = n H (p) and m H = m H (p). We also choose the minimal Hankel matrix dimension and we start predictions with such minimal Hankel matrix i.e. for first prediction p = p 0 we set When data f p at prediction time t p becomes available, we can compute the error of the prediction f p , using suitable norm, as k 0 = n H,min + m H,min , k f = end 3: end if 4: for p = k 0 , k 0 + 1, ..., k f let t p−1 be the time of the last known data do 5: if the error ε p (S65) is larger than referent error ε re f then 6: Resize the Hankel matrix to the minimal size (S64).Increase the size of the Hankel matrix using (S66).Form the Hankel matrix (S53) and compute the AKMD.

11:
Using the AKMD, extrapolate to obtain the prediction (S59).12: end for therefore ultimate exponential divergence of prediction from true trajectory.However, this neglects the finer aspects of chaotic dynamics that are exhibited in the most paradigmatic of chaotic systems -the Lorenz dynamical systems, modeled by Lorenz equations with σ = 10, β = 8/3, and ρ = 28 for which the system exhibits chaotic behavior.For understanding of the prediction capability for the Lorenz system, more important than the long term exponential divergence of trajectories is the short term divergence typically induced by switching between the two wings of the butterfly (see Figure S15a).At the core of our approach is the observation that, while inside one of the butterfly wings, the system behaves in a predictable manner.The exponential divergence is ultimately due to switching between the two butterfly wings.The first time such a switch happens, the situation resembles a Black Swan event 15 (although there is an ontological difference highlighted in the main text): the trajectory suddenly wonders off to a different part of the state space and starts exploring there, until another switch happens taking it back to the known part of the state space.This fits our paradigm of splitting the state space into domains over which prediction is possible and monitoring for the switch between such domains.
The current theory is thus an extension of the ideas in 20 , where deterministic components of stochastic dynamical systems were extracted using Koopman operator methods, and 52 where it was shown that Lorenz system can be described well by a set of linear evolution equations driven by stochastic term that induces switching.In both of these, the detection of the switching moment and the precise interaction of local and global behavior on subdomains of state space was not accounted for; we address it here.
We use (S67) to test the prediction potential of the KMD.We have generated data using numerical simulation (ODE solver) of (S67) with the time resolution δt = 0.01s, thus obtaining a discrete dynamical system.For a present moment (index) p, an active window of length w is selected as in §S1.4.2, with b = p − w, and the selected data are lifted in the Hankel structure.The KMD is computed for the corresponding vector valued observables h i , and used for their prediction as explained in §1 of the paper.For computing the KMD for the global prediction algorithm, the active windows of size w = 400 and 300 × 100 Hankel matrices are used.By sliding the active windows along the computational domain we get prediction at different times.When the actual data and the prediction errors become available, we either continue forecasting with the same KMD, or a switching device invokes the local prediction scheme with 21 × 11 Hankel matrices if the error is above a preset threshold.The prediction is then with shorter forecast lead time, and predicted data are based on a sequence of local KMD's.The local algorithm keeps increasing the active windows and the lead time, whilst monitoring the error; see the Methods section.
In Figure S15, we show the KMD reconstruction and prediction results of the observable x 1 = x of the system (S67) for a selection of five active windows.While the reconstruction -indicated by green traces -works well (as expected, see e.g. 44, 13 , 14 ), the prediction capability -indicated by magenta traces -is lost for the third and the fourth active windows.An inspection of the quality of the approximate Ritz pairs computed by the DMD 12 shows that for the time interval containing those two windows none of the computed pairs has the residual below 0.01, i.e. no useful spectral information, which is essential for the KMD, could be extracted from the available data snapshots, see Figures S15b, S15c.As a consequence, the prediction using numerical   realization of KMD cannot give satisfactory results.On the other hand, in that part of the domain where the trajectory behaves chaotically, and the intensive change of the nature of the eigenpairs precludes accurate numerical approximations, local prediction scheme quickly adapts to the new data, forgets the previously acquired information, and delivers better results.See Figure S15d.
The reconstruction with a reduced number of modes (see Figure S15d) uses only the Ritz pairs with small residuals (see [12,  §3.2]).The number of modes used for prediction after the first, the second and the fifth active window (gray rectangles) were 10, 18 and 10, respectively.One can observe that the reconstruction and prediction capabilities -shown in dashed greenare comparable with using the full KMD.

S2.1.1 Remark
Regarding the question of detecting the switching moments, Figure S15d provides an insight.If we look at the first pink zone with no "good" eigenvalues (Figures S15b, S15c), we see that it starts close to the switching moment.Also, this zone is quite long because the switching moments in that zone are too close to each other and no learning data window can fit in-between.Only when two switching moments are distant enough, we can find "good" eigenvalues, the learning data window exits the pink zone, and the global prediction recovers.It is remarkable that recent works 53,54 have found spectral objectspseudoeigenfunctions -that govern quite regular short term dynamics inside the wing of the Lorenz attractor.This dynamical feature -discovered by careful analyses of the continuous Koopman operator spectrum for the Lorenz system -seems to enable the prediction algorithm performance.

S2.1.2 Discussion
Historically, the most discussed way in which a substantial change in dynamics can occur in dynamical systems is due to a change of a value of a bifurcation parameter 55 .The prediction method that we propose is not necessarily related to a change of parameter in the system.Namely, the original description of the black swan event does not relate to a change in parameter, just to travel to another part of the space (here considering the dynamical system to be the ecological system).White swans were known to exist in Europe, but explorers found black swans in Australia.The prediction that an explorer would make when traveling to Australia might have been existence of a white swan.Upon observation, they concluded that black swans exist.The bird had all other properties of the white swan, except for the color.There were no parameter changes, no bifurcation that occurred.Similarly, the prediction of the dynamics while on one wing of the Lorenz butterfly attractor is based on the eigenvalues of the Koopman operator detected while sampling that wing.Once the dynamics "travels" to the other wing, the change in dynamics is recognized (although there are no parameter changes), but as the dynamics continues on the other wing, the same eigenvalues are obtained.The difference is in the resulting local 4 eigenfunctions, (or pseudoeigenfunctions, as in 53,54 ), that are related by the symmetry (x, y) → (−x, −y).
As is well known, chaotic dynamics is an asymptotic property of a dynamical system, and the associated unpredictability is not due to local passage near saddles, but to long term repeat of such events, that ultimately leads to mixing dynamics 56 .Switch in dynamics is here due to internal effects, and thus ontologically different from the Black Swan situation.The switch is due -in the Lorenz case -precisely to the local saddle event, that transitions the dynamics from one wing of the butterfly to the other.We presented a method by which such passage can actually be detected, and accounted for, inside a prediction algorithm.
We note there are methods of prediction of chaotic dynamical systems that can predict the evolution over several Lyapunov times of chaotic systems 57 .
Note that our purpose is somewhat different than in 57 , We are more interested in detection of failure to predict accurately, then establishing a method for long-term (climate) prediction.In separate work 27 we pursue the question of long-term prediction of the Lorenz model and provide evidence of ability of Koopman based methods to predict over many Lyapunov time-scales.

S2.2 Case study: Resonant breathing
The mathematical model of the human cardiovascular system was developed by Ursino and Magosso in ( 58 , 59 , 60 , 61 ).This model includes mathematical descriptions of a pulsating heart, as well as the mechanics of blood flow ( 62 ) and baroreflex activation ( 58 , 59 , 60 , 61 ).It includes more than 90 parameters and 21 states (pressures, flows, volumes, resistances, and elastances).Twenty-one delay differential equations reflect conservation of mass and balance of forces at arteries and veins, as well as delayed physiological responses to vagal and sympathetic neural activity.This allows for simulation of high-resolution blood pressure and heart period as a function of time.In 63 we modified the Ursino and Magosso model to use experimentally derived respiration period as a model input.In addition, we set external noise from the Ursino and Magosso model to zero, because of the noise in the respiratory input used in our model.
Data for model validation were provided by 12 men and 12 women who were healthy college students between 21 and 24 year of age.They were participants in an experiment one of the aims of which was to develop a computational physiology approach to model how cardiovascular processes change when the baroreflex mechanism is challenged.This study was approved by the Rutgers University Institutional Review Board for the protection of human subjects involved in research.One of the tasks that the participants completed, was a 5-minutes resonance breathing task (6P) ( 64 , 65 , 66 , 67 ), during which they breathed at a rate of approximately 6 breaths/min following a visual pacer (Easy Air, Biofeedback Foundation of Europe, Montreal, Canada).The specific details on the participants' selection/exclusion process and experimental procedure can be found in 63 .
In 63 , to find the optimal set of parameters for each subject, we selected as model output the cost function that takes into account power spectral densities and time averages of several observables, such as heart period.Instead of doing brute-force optimization on the cost function with over 90 parameters in the model, we used the following procedure: 1) an initial sensitivity analysis was performed to select the most important parameters to tune, and 2) optimization of these most important parameters was performed to minimize the cost function.The details of the calibration procedure can be found in 63 .
We use the results of the chosen simulation to analyze the global prediction algorithm on it.The numerical solutions were obtained using AIMdyn's GOSUMD software.The used time step for numerical simulations was ∆t 0 = 0.003.The parameters in the simulations, with the exception of the function modeling breathing, were chosen as obtained in 63 .The input breathing function in this simulation was chosen so that in the first part of simulation, the period of input breathing function was constant and equal to 10 seconds.The period of 10 second simulates the rhythm of resonant breathing.In the second half of the simulation, experimentally determined normal breathing function was used as an input.
As already mentioned, in the global prediction algorithm one should provide long enough set of data in order to extract the dynamical system parameters related to the phenomena we want to capture with the algorithm.On the other hand, the time step between the neighboring snapshots should be chosen so that the balance between the numerical complexity and of the length of dynamical phenomena one want to reveal with used KMD algorithm is achieved.Since in the system there are no frequencies larger than 10, it is enough to take ∆t = 0.03 between neighboring snapshots.By using time-lagged embedding for each variable separately, we form the Hankel matrices and apply the GKP for the reconstruction and prediction.In the computations we present here, we use the training sets that consists of 900 snapshots and Hankel matrices of dimension 600 × 300 sliding along the computational domain with the chosen step.The length of the training sets was chosen so that at least two time periods of the global disturbance that we want to reveal are included in them.
The switching moment from resonant to normal breathing is nicely detected by eigenvalues provided by DMD algorithm in Figure S16.The nature of eigenvalues changes significantly in the active window beginning at t = 150 when switching of the dynamics occurs.It is nicely visible from Figures S16 -S17 that when the training set is in the zone of the resonant breathing, the GKP results are perfect in that same zone, and then deteriorate as we move into the normal breathing zone.This is as expected since after the moment of transition from the resonant to normal breathing the dynamical system is not governed by the same set of parameters.When the training set is in the zone of the normal breathing, the GKP in that zone is much less accurate then in the resonant breathing zone.It catches well the global behavior but it is poor in the details.What we can conclude from the presented results is the following.When the training set is in the zone of the resonant breathing, the GKP algorithm results are perfect in that same zone, and then deteriorate as we move into the normal breathing zone.This 34/41 is as expected since after the moment of transition form the resonant to normal breathing the dynamical system is not governed by the same set of parameters.When the training set is in the zone of the normal breathing, the GKP in that same normal breathing zone is much less accurate then in the resonant breathing zone.It catches well the global behavior but it is poor in the detail, most of the error value comes from the difference in the phase.This is also logical since normal breathing is much more irregular than the resonant breathing and it turns out that it can not be learned with high accuracy.

S2.3 Case study: Geomagnetic substorms
Geomagnetic storms and substorms are violent disturbances of the Earth's magnetosphere, caused by energy transfer of the solar wind into the planets magnetosphere, with potentially severe impact on the human civilization 6869 .
Physics-based modeling (see e.g. 70, 71 , 72 ) of geomagnetic substorms and storm/substorm interaction is a challenging task and the subject of intensive study.It must cover multiscale, nonlinear interactions of plasmas that are not in equilibrium, or are in an unstable equilibrium, which makes such modeling difficult to apply when prediction is needed, see e.g. 73.
On the other hand, given an abundance of observation data, a data-driven approach is an attractive alternative; see e.g. 74, 75 .The intensity of a substorm is quantified by the Auroral Electrojet (AE) index, the AL, which is a measure of the magnitude of the geomagnetic field disturbances on the ground induced by ionospheric currents developed during substorm.Other information such as e.g.solar wind data 76 , the Dst index, and other substorm signature indices may be available 77 and used as observables.
For the purpose of this case study of the proposed approach as a purely data driven black-box methodology, we choose to use the AL index as the only observable; the data are downloaded from the Kyoto Geomagnetism Data Service (http://wdc.kugi.kyoto-u.ac.jp/).
The presented results are obtained by using global prediction algorithm with the active windows of size 30 and the Hankel matrices of dimension 20 × 10.By sliding the active windows along the computational domain we get prediction at different time moments.In Figure S18 we present the obtained reconstruction and prediction results for four active windows.Note that the modal representation of the signal is good, and it could provide a valuable insights to the experts in magnetic storm physics.
In the framework of our theory, the poor global prediction results are to be expected -almost all eigenvectors used in the KMD have large residuals.This once more justifies our approach, based on using the residuals of the Ritz pairs, computable even in the data driven setting, using the method from 12 .However, large prediction errors trigger the switch to the local prediction algorithm, which delivers more accurate predictions, at least for shorter lead time, as shown in Figure S18.

S2.4 Additional numerical results for prediction of influenza cases
Here we provide some additional results for the material in §2 of the paper.In Figures S19-S22 we show prediction of the dynamics of influenza for USA and UK for 2 and 52 weeks ahead, obtained with the KMD decompositions in the global prediction algorithm, using sliding active windows of size 312.
In Figure S23, we provide additional numerical illustration (most relevant eigenvalues before and after retouching, with the corresponding prediction errors, and relation with the dominant frequencies from the DFT analysis) related to Figure 1b Global prediction with retouching the Black Swan event data

Figure 1 .
Figure 1.Influenza data (USA).(1a): The data are collected in the window April 2003 -April 2009 (shadowed rectangle) and then the dynamics is predicted for 104 weeks ahead.The local prediction algorithm recovers the prediction capability by forgetting the old data and using narrower learning windows.The local prediction algorithm delivers prediction for one week ahead.(1b): The active window (shadowed rectangle) is July 2004 -July 2010, and the dynamics is predicted for 104 weeks ahead.The global prediction fails due to the Black Swan data in the learning window.(Somepredicted values were even negative; those were replaced with zeros.)The global prediction algorithm recovers after the retouching the Black Swan event data, which allows for using big learning window.Compare with positions of the corresponding colored rectangles in Figure2.

Figure 2 .
Figure 2. The real and imaginary parts of Ritz values with residuals bellow η r = 0.075 for sliding active windows.The color intensity of eigenvalues indicates the amplitudes of the corresponding modes.Pink rectangles mark ends of training windows with no acceptable Ritz values.Note how the unstable eigenvalues (ℜ(λ ) > 0) impact the prediction performance, and how the retouching moves them towards neutral/stable -this is shown in the yellow rectangle in panels (a) and (c).Also influenced by the disturbance are the eigenvalues in the light blue rectangles in panels (a), (b); retouching moves the real parts of eigenvalues towards neutral/stable and rearranges them in a lattice-like structure 22 , as shown in panels (c), (d).Compare with Figure 1b.

Figure 3 .
Figure 3. Prediction of COVID-19 cases (35 days ahead, starting July 11) for Germany, France and United Kingdom.Left panel: The Hankel-Takens matrix H is 282 × 172, the learning data consists of h 1:40 .The KMD uses 39 modes.Middle panel: The matrix H is 363 × 145, the learning data is h 1:13 .The KMD uses 12 modes.Right panel: The Koopman-Ritz values corresponding to the first (magenta circles) and the middle (blue plusses) panel.Note how the three rightmost values nearly match.

Figure 4 .
Figure 4. Prediction errors and KMD spectrum of COVID-19 cases (28 days ahead, starting July 11) for Germany, France, United Kingdom, Denmark, Slovenia, Czechia, Slovakia and Austria.Left panel: The Hankel-Takens matrix H is 752 × 172, the learning data consists of h 1:40 .The KMD uses 39 modes.Middle panel: The matrix H is 968 × 145, the learning data is h 1:13 .The KMD uses 12 modes.Right panel: The Koopman-Ritz values corresponding to the first two computations in Figure 3 (magenta circles and blue pluses, respectively) and the the first two panels in this Figure (orange x-es and cyan squares, respectively).Note how the corresponding Koopman-Ritz values nearly match for all cases considered.

Figure 5 . 2 .
Figure 5. Prediction experiment with data from Germany.Left panel: the computed residuals for the computed 102 Koopman Ritz pairs (extracted from a subspace spanned by 132 snapshots h 1:132 ).Note that all residuals are small.The corresponding Ritz values are shown in the first panel in Figure 6.Middle panel: KMD reconstruction error for h 1:132 and the error in the predicted values h 133:160 (encircled with •).The reconstruction is based on the coefficients (α j ) r j=1 = arg min α j ∑ k h k − ∑ r j=1 λ k j α j v j 2 2 .Right panel: Prediction errors for the period October 11 -November 7.

Figure 6 . 5 .
Figure 6.Prediction experiment with DS3 with data from Germany.Left panel: the computed 102 Koopman Ritz values (extracted from a subspace spanned by 132 snapshots h 1:132 ).The corresponding residuals are shown in the first panel in Figure 5. Middle panel: KMD reconstruction error for h 1:132 and the error in the predicted values h 133:160 (encircled with •).The reconstruction is based on the coefficients (α j ) r j=1 = arg min α j ∑ k w 2 k h k − ∑ r j=1 λ k j α j v j

2 2 .
Right panel: Prediction errors for the period October 11 -November 7. Compare with the third graph in Figure5.

Figure 7 .
Figure 7. Prediction of confirmed COVID-19 cases utilizing the publicly available COVID-19 data repository provided by Johns Hopkins.The true data ranges between March 22nd, 2020 and November 29th, 2020.We utilize the last three days of data to forecast the following three days of data.(7a) Predicted conditions and prediction error worldwide on November 13.The widths of the bubbles represent the number of cases in a region; only regions with more that 100 cases are used and the bubbles are colored according to their relative percent error.(7b) Comparison of true and forecast data for cumulative confirmed cases in the US for April to December 2020.The cumulative forecasts shown here were obtained by summing the forecasts of the individual locations, indicating that the region specific forecasts were sufficiently accurate for tracking the cumulative dynamics of the virus in the US.(7c) Percent error for the forecasts of the cumulative confirmed cases in the US.On average the percent error is less than 5 percent and although spikes occur, which could be due to changes in testing availability, the algorithm adjusts and the error stabilizes within a short amount of time.Furthermore, Johns Hopkins provided data for around 1787 locations around the United States and we produced forecasts for each of those locations.

S1. 3 . 4
Krylov compression of U d and the KMD Note that, for an f ∈ F , (S17) naturally generates a Krylov sequence of functions f, U d f, . . ., U M d f, U M+1 d f, and that

Figure S8 .
Figure S8.Left panel: Relative prediction error for Germany, France and the United Kingdom for a 35 days prediction starting after the data window h 1:40 .(In terms of the original data, prediction starts on July 11.)Right panel: The Koopman Ritz values used in the KMD.

Figure S9 .
Figure S9.Government Response Stringency Index measures response indicators (OxCGRT indicators) such as school closing, workplace closings, cancelling public events, restrictions on gathering size, closing public transport, stay at home requirement, restriction on internal movement and international travel.The GRSI data for Germany, France, United Kingdom, Denmark, Czechia, Slovenia, Austria and Slovakia are taken from47 .For more details see48 .

Figure S10 .
Figure S10.Left panel: Relative prediction error for eight European countries for a 28 days prediction starting after the data window h 1:40 .(In terms of the original data, prediction starts on July 11.)Right panel: The Koopman Ritz values used in the KMD, denoted as blue pluses (+).The red circles (•) denote the Ritz values computed using only three countries as shown in Figure S8.

Figure S13 .
Figure S13.Prediction experiment with DS3 with data from Germany.Left panel: the computed residuals for the computed 102 Koopman Ritz pairs (extracted from a subspace spanned by 132 snapshots h 1:132 ).Note that all residuals are small.The corresponding Ritz values are shown in the first panel in Figure S14.Middle panel: KMD reconstruction error for h 1:132 and the error in the predicted values h 133:160 (encircled with •).The reconstruction is based on the coefficients (α j ) r j=1 = arg min α j ∑ k h k − ∑ r j=1 λ k j α j v j

Figure S11 .
Figure S11.First row: Prediction error for Germany, France and the United Kingdom for a 35 days prediction starting from the data windows h 1:105 (prediction for September 14 -October 18), h 1:106 (prediction for September 15 -October 19), h 1:107 (prediction for September 16 -October 20), respectively.Second row: The corresponding Koopman Ritz values used in the KMD.

Figure S14 .λ k j α j v j 2 2.
Figure S14.Prediction experiment with DS3 with data from Germany.Left panel: the computed 102 Koopman Ritz values (extracted from a subspace spanned by 132 snapshots h 1:132 ).The corresponding residuals are shown in the first panel in Figure S13.Middle panel: KMD reconstruction error for h 1:132 and the error in the predicted values h 133:160 (encircled with •).The reconstruction is based on the coefficients(α j ) r j=1 = arg min α j ∑ k w 2 k h k − ∑ r j=1 λ k j α j v j 2 2 .Right panel: Prediction errors for the period October 11 -November 7. Compare with the first graph in FigureS12, and with the third graph in FigureS13.

Figure S15 .
Figure S15.(S15a): the Lorenz system (S67).For each active window, the Ritz pairs with the residual below η r = 0.01 are selected (see [12, §3.2]); the real and the imaginary parts of the corresponding Ritz values are shown in (S15b, S15c).The color intensity of the eigenvalues indicates the amplitudes of the corresponding modes.(S15d): KMD reconstruction and prediction of the observable x 1 ≡ x for the Lorenz system (S67).The data are collected in five active windows (time intervals [1, 5], [6.5, 10.5], [14.5, 18.5], [21.5, 25.5], [28.5, 32.5], marked by shadowed rectangles) and then the dynamics is predicted for the time moments ahead of the active window.Note -by comparing the positions of the intervals with poor prediction with the eigenvalue-free pink rectangles in FiguresS15b, S15c-that the failure of the global prediction occurs after the active windows which do not contain Ritz pairs with sufficiently small residuals, as indicated by magenta curves.The local algorithm recovers the prediction capability, using a sequence of shorter moving KMD's and prediction for 10 time steps ahead, as indicated by orange curves.

Figure S16 .
Figure S16.Physiology model.The real and the imaginary parts of eigenvalues for sliding active windows for which the residuals are smaller than the threshold η r = 0.025.The intensity of color of eigenvalues is associated with the amplitude of modes.

Figure S17 .
Figure S17.Physiology model.Extrasplanchnic peripheral resistance and splanchnic venous unstressed volume.First column: Reconstruction and prediction obtained using GKP for the chosen active windows.The full (r = 300) and reduced (r < 300) prediction obtained with GKP on active windows in the first part of simulation in the breathing zones captures accurately the dynamics, while even the full prediction obtained with the training set in the resonant breathing zone does not capture well the dynamics.Second column: The prediction errors for the full and reduced prediction (with r modes) for the chosen active window in the first part of simulation.

Figure S18 .
Figure S18.Geomagnetic substorms data: KMD reconstruction and prediction of the AL index.For reconstruction and global prediction, the data are collected in four active windows (time intervals [900, 1050], [1200, 1350], [1600, 1750], [2000, 2150], indicated by shadowed rectangles), the algorithm uses 20 × 10 Hankel matrices and then the dynamics is predicted for 20 time steps ahead.The time resolution of the collected data is δt = 5 min.The local prediction algorithm uses 3 × 2 matrices and the error threshold for resizing the Hankel matrix to minimal size (switching to the local algorithm) is set to 10.The dynamics with the local prediction algorithm is predicted two time steps ahead.

Figure S19 .
Figure S19.Influenza data (USA).Global Koopman prediction on influenza with the size of active windows 312 and different sizes of Hankel matrices.The prediction 2 weeks ahead by using KMD's from the active windows sliding along computational domain with sliding step ∆p = 1.

Figure S22 .
Figure S22.Influenza data (UK).Global Koopman prediction on influenza with the size of active windows 312 and different sizes of Hankel matrices.The prediction 52 weeks ahead by using KMD's from the active windows sliding along computational domain with sliding step ∆p = 1.

Figure S23 .
Figure S23.Influenza data (USA).The most relevant eigenvalues and the prediction errors in the global algorithm (using 208 × 104 Hankel matrices) for the active window as in Figure1bin the main paper.Note how the unstable eigenvalues (ℜ(λ ) > 0) impact the prediction performance, and how the retouching moves them to the left.Compare with Figures1b and 2in the main paper.Note how the dominant frequencies from the DFT analysis correspond to the imaginary parts of the eigenvalues computed after the retouching and selected by the residual criterion.
S65)At other prediction moments, if the prediction error (S65) is smaller than the referent error ε re f we assimilate the newly acquired data into the active window by increasing the Hankel matrix sizen H (p) = n H (p − 1) + 1, or m H (p) = m H (p − 1) + 1.Data snapshots f 0 , f 1 , . ..,f end ; • indices of time moments for the begin and the end of the local prediction k 0 , k f (optionally) • minimal Hankel matrix dimension n H,min , m H,min ; • error threshold ε re f ; • lead time τ l .Ensure: Predicted system observables f n H,min +m H,min , f n H,min +m H,min +1 , . . ., f end+τ l (or f k 0 , . . ., f k f +τ l ) 1: if k 0 and k f not defined then