Discovering spatiotemporal patterns of COVID-19 pandemic in South Korea

A novel severe acute respiratory syndrome coronavirus 2 emerged in December 2019, and it took only a few months for WHO to declare COVID-19 as a pandemic in March 2020. It is very challenging to discover complex spatial–temporal transmission mechanisms. However, it is crucial to capture essential features of regional-temporal patterns of COVID-19 to implement prompt and effective prevention or mitigation interventions. In this work, we develop a novel framework of compatible window-wise dynamic mode decomposition (CwDMD) for nonlinear infectious disease dynamics. The compatible window is a selected representative subdomain of time series data, in which compatibility between spatial and temporal resolutions is established so that DMD can provide meaningful data analysis. A total of four compatible windows have been selected from COVID-19 time-series data from January 20, 2020, to May 10, 2021, in South Korea. The spatiotemporal patterns of these four windows are then analyzed. Several hot and cold spots were identified, their spatial–temporal relationships, and some hidden regional patterns were discovered. Our analysis reveals that the first wave was contained in the Daegu and Gyeongbuk areas, but it spread rapidly to the whole of South Korea after the second wave. Later on, the spatial distribution is seen to become more homogeneous after the third wave. Our analysis also identifies that some patterns are not related to regional relevance. These findings have then been analyzed and associated with the inter-regional and local characteristics of South Korea. Thus, the present study is expected to provide public health officials helpful insights for future regional-temporal specific mitigation plans.


Results
Spatial-temporal characteristics of COVID-19 in South Korea. In this section, we present an overview of COVID-19 data collected in South Korea (see Fig. 1 for more description). Daily confirmed cases and deaths of COVID-19 from January 20, 2020 to May 10, 2021, were obtained from the Korea Centers for Disease Control and Prevention (KCDC) and each provincial website 12 . As of May 10, 2021, there were a total of 127,772 COVID-19 confirmed cases and 1875 deaths in South Korea. To analyze the spatiotemporal patterns of COVID-19, the spatial distribution of COVID-19 confirmed cases is refined in 17 first-tier administrative divisions of South Korea. Figure 1 shows a South Korea map (a) with spatial distributions of the cumulative number of COVID-19 confirmed cases (b) and the cumulative number of COVID-19 deaths (c). As displayed in b, c, d of Fig. 1, South Korea shows a high level of spatial and temporal heterogeneity in 17 regions. We can observe that the main characteristics of the temporal patterns of South Korea can be placed into the particular four stages, i.e., three big waves and the last stage. More precisely, the first window is from January 20, 2020 to April 26, 2020, the second window is from July 28, 2020 to October 12, 2020, the third window is from November 3, 2020 to February 1, 2021, and the period after the third wave is February 2, 2021, to May 10, 2021. These are chosen as four windows and represented by different colors in Fig. 2a.
The first case of COVID-19 in South Korea was a 35-year-old Chinese woman who traveled from Wuhan, China, and was confirmed on January 20, 2020. She entered the Incheon international airport and she was isolated at a hospital upon entry. After the index case, only 30 confirmed cases have occurred until February 17, 2020. However, there was an explosive outbreak in Daegu due to the superspreading events from the Shincheonji Church-related clusters from February 18 to March 23, 2020 12 . As a result, the first wave (January 20, 2020-April 26, 2020, see   www.nature.com/scientificreports/ Seoul including the Guro call center and the Itaewon club cluster in May 2020. From July 28 to October 12, 2020, the second wave started in Seoul and Gyeonggi Province (see Fig. 1d). The main cause of the second wave was the rally held at Gwanghwamun Square in Seoul. Seoul city has the highest in the confirmed cases and Gyeonggi Province has the second-highest in the confirmed cases and the highest in deaths. The largest wave was the third wave from November 3, 2020 to February 1, 2021. This was partly due to the winter seasons, which results in a favorable condition for close contact between people staying indoors. After the third wave, the constant level of COVID-19 cases has been maintained nationwide from February 2, 2021, to May 10, 2021.
Analysis of spatial-temporal COVID-19 in South Korea. In this section, we shall present the analysis of spatial-temporal COVID-19 in South Korea. 1st wave  2nd wave  3rd wave  after the 3rd wave Korea. South Korea went through three big waves, after the third wave, the incidence has been maintained with no significant increase or decrease. The four windows of main interest were colored and given as; (1) the first wave (January 20, 2020-April 26, 2020); (2) the second wave (July 28, 2020-October 12, 2020); (3) the third wave (November 3, 2020-February 1, 2021); and (4) after the third wave (February 2, 2021-May 10, 2021). (b) Weekly incidence and cumulative cases in 17 regions, plotted as the bars and as a curve, respectively. The three highest cumulative cases, the next five highest cases, and the rest cases are marked with red, green, and blue, respectively. www.nature.com/scientificreports/ that the hot spots were moving from Daegu to Seoul and Gyeonggi while Jeonnam remained the cold spot in the first, third, and last periods. Interestingly, Daegu and Gyeongbuk were the cold spots during the second wave after the severe first outbreak. The chosen four windows are then used to apply CwDMD, which results in the discrete DMD modes and eigenvalues in each window. Supplementary Figs. S2-S5 compare the results of the DMD data fitting with the region-specific COVID-19 data for each window. There is a perfect agreement between the COVID-19 data (red dot) and the DMD output (black solid) in all 17 regions.
CwDMD has been used to investigate the spatiotemporal pattern of COVID-19 in 17 regions, whose discussions are presented in the following four subsections. Note that a few important DMD modes selected in each window are categorized into three regimes, oscillatory, growing, and decaying. These are then used for the phase and magnitude analysis of each window.
The first wave. The first wave is chosen as the total of 14 weeks and so, the spatial vs temporal resolution is 17 to 14. This is compatible as discussed in the section for "Methods".
In Fig. 3, we show the power of DMD modes in a, i.e., the measure of the scaled size of the DMD modes (see the section of "Methods" in details). The power is used for the selection of dominant DMD mode and the selected DMD mode is then utilized for both magnitude and phase analysis.
We note that the first three DMD modes of the highest power were chosen and they are denoted by #1, #2, and #3 as shown in Fig. 3a. In fact, the selected three DMD modes correspond to the growing, the oscillatory, and the decaying modes, respectively in the discrete dynamical system for the first window. The magnitude analysis has been performed using these dominant DMD modes. We observe that all three DMD modes show that Daegu and Gyeongbuk have the largest magnitude and Seoul and Gyeonggi are next. These are indicated by gray bars in b, c, d, respectively, and consistent with the cumulative confirmed cases of the first wave given in e and f of Fig. 3.
Next, we explored the phase analysis from the three selected DMD modes. We note that phase or phase difference can be interpreted as the time (in week) between peak to peak of the region-specific COVID-19 outbreak. Namely, the smaller the phase difference of two different regions is, the closer the peaks of these regions will be. We find that in all three DMD modes, the phases of Busan, Gyeongnam, and Chungnam are similar. Note that these three regions are close to the epicenter. Consequently, we find a strong correlation between the phase of the southern part of South Korea and the distances from the epicenter, i.e., Daegu and Gyeongbuk. This is consistent with the data presented in f of Fig. 3.
On the other hand, the phase of DMD mode #1, shows that there is a time lag of 2-3 weeks between the peaks of Seoul and Gyeonggi and those from Busan, Gyeongnam, Gyeongbuk, and Daegu. In particular, from the fact that the DMD mode #1 is a growing mode, the above conclusion indicates that there was definitely a different cause for the COVID-19 outbreak of Daegu and Gyeongbuk from that of Seoul and Gyeonggi. Note that it can be clearly identified in the graph of e and f in Fig. 3. More precisely, the weekly confirmed cases of Seoul and Gyeonggi are similar to those of other regions from weeks 5 to 7. However, the confirmed cases began to increase from week 8 to 13, while those of other regions decreased. We later found that this peculiar behavior could be associated with a few large workplace-related clusters such as the Guro-Gu call center in Seoul and Gyeonggi from March 2020 to April 2020 36,37 .
The second wave. The second wave is chosen as the total of 11 weeks and so, the spatial vs temporal resolution is 17 to 11. This is compatible as discussed in the section for "Methods".
In Fig. 4, we show the power of DMD modes in a. The power is used for the selection of dominant DMD mode and the selected DMD mode is then utilized for both magnitude and phase analysis. We note that the first two DMD modes of the highest power were chosen and they are denoted by #1 and #2 as shown in Fig. 4a. The selected two DMD modes correspond to the oscillatory and the growing modes, respectively. The magnitude analysis has been performed using these dominant DMD modes. The weekly confirmed cases of the total of six selected regions are then shown in d of Fig. 4.
The magnitude analysis using both of these selected DMD modes shows that Seoul and Gyeonggi have a significantly large magnitude of confirmed cases. This is in fact, consistent with the data shown as Fig. 4d. The main drive behind this large magnitude can be correlated with the outbreak from the rally held at Gwanghwamun Square in Seoul on August 15, 2020. Note that this rally was organized by SarangGeil Church in Seoul and people from all regions of South Korea participated. We observe that, unlike the first wave, the magnitude of Daegu and Gyeongbuk are relatively small. This can be attributed to the continued strict mitigation interventions in the Daegu and Gyeongbuk area since the first wave. See also the COVID-19 cases shown in Fig. 1 as well as in Supplementary Fig. S1, which are consistent with our magnitude analysis for Daegu and Gyeongbuk.
We now explore the phase analysis from the two selected DMD modes. First, we begin with the following facts; (1) the maximum phase difference in the DMD mode #1 is between Sejong and Ulsan and its value is 1.54 weeks; (2) the maximum phase difference is 1.04 weeks and it is between Busan and Jeju in the DMD mode #2. The relatively short phase difference indicates that the second wave can be characterized as an almost simultaneous nationwide spread. This can be attributed to the fact that all participants from all regions who attended the rally in Seoul returned to their home region within a few days, i.e., less than a week 38 .
The third wave. The third wave is chosen as the total of 13 weeks and so, the spatial vs temporal resolution is 17 to 13. This is compatible as discussed in the section for "Methods".
In Fig. 5, we show the power of DMD modes in a. The power is used for the selection of dominant DMD mode and the selected DMD mode is then utilized for both magnitude and phase analysis. We note that the first two DMD modes of the highest power were chosen and they are denoted by #1 and #2 as shown in Fig. 5a www.nature.com/scientificreports/ selected two DMD modes correspond to the growing and the oscillatory modes, respectively. The magnitude analysis has been performed using these dominant DMD modes. The weekly confirmed cases of the total of eight selected regions are then shown in d of Fig. 5.
The magnitude analysis using both of these selected DMD modes shows that Seoul and Gyeonggi have a significantly large magnitude of confirmed cases, similar to the second wave. This is in fact, consistent with the data shown as Fig. 5d. This is due to the cold winter seasons, as people favorably stayed indoors for close contacts, which is enhanced by the higher population density in Seoul and Gyeonggi; the South Korean population is highly disproportionate and the metropolitan area has more than 50% of the total South Korean population.
The phase analysis in this wave shows that the maximum phase difference is larger than that of the second wave for both modes. Namely, the maximum phase difference in the DMD mode #1 is 4.02 weeks, which is between Busan and Jeonnam, while the maximum phase difference in the DMD mode # 2 is 4.13 weeks, which  www.nature.com/scientificreports/ is between Gyeonggi and Jeonnam. In particular, regions grouped according to the higher phase similarity are (1) Busan, Gyeongnam, and Ulsan, which are all located in the southeast area, (2) Seoul, Gyeonggi, and Incheon, which are all located in the northwest area, and (3) Daegu and Gyeongbuk, which are at the central area. This analysis identifies that there are strong spatial correlations in the third wave. This seems to be natural. But, to our surprise, we observe that there is more or less independent phase behavior between Gwangju and Jeonnam in DMD mode # 2. This means that Jeonnam is not much affected by the outbreak of COVID-19 in Gwangju, even if Jeonnam surrounds Gwangju. In fact, it is in this way throughout the whole time when COVID-19 data is collected. This indicates that the expected spatial correlation is sometimes misleading. Additionally, the similar phenomenon is also observed between in Daejeon and Chungnam.
The period after the third wave. The period after the third wave is chosen as the total of 13 weeks and so, the spatial vs temporal resolution is 17 to 13 again like the third wave. The main feature of this period is that the weekly incidence is relatively large all over South Korea. In Fig. 6, we show the power of DMD modes in A. The power is used for the selection of dominant DMD mode and the selected DMD mode is then utilized for both magnitude and phase analysis. We note that a single DMD mode shows the dominant power and so, only this DMD mode is chosen and denoted by #1 as shown in Fig. 6a. The selected DMD mode corresponds to the oscillatory mode.
The magnitude analysis using this selected DMD mode shows that Seoul and Gyeonggi have the largest magnitude, which is consistent with the highest cumulative COVID-19 cases during the period after the third wave in these regions as shown in Fig. 2b. This consistency also holds for the next largest magnitudes or cumulative cases occurring in the southeast areas, which include Busan, Ulsan, and Gyeongnam.
The phase analysis in this period shows that the maximum phase difference is 11.8 weeks, which is from Incheon and Gwangju. Furthermore, the phase difference between neighboring regions such as Seoul and Gyeonggi, Daejeon and Chungnam, and Daegu and Gyeongbuk is also more than three weeks, which is relatively large. This indicates that overall large weakly incidence in each region is local in nature. Namely, the outbreaks in each region are mainly due to local outbreaks within the region and the inter-regional correlation of outbreaks seems to be irrelevant in this period. This has been further justified by investigating the spatial variations using the estimation of so-called the coefficient of variation below.  We investigate the rate of incidence per 100,000 people in each region for the first wave, the second wave, the third wave, and the period after the third wave and plot this in Fig. 7a-d, respectively. This shows that the regional variation in weekly incidence is gradually decreasing over time. We observe that in the first wave (see www.nature.com/scientificreports/ Fig. 7a), only the rate of incidence for Daegu and Gyeongbuk is shown to be higher than average. After the first wave, the rate of incidence for Daegu and Gyeongbuk becomes below the average, whereas that of Seoul and Gyeonggi stays higher than the average. Even if it is not definitely clear, as time proceeds, the regional differences seem to get smaller. To quantify this observation on the time-dependent regional difference in the incidence rate, we compute so-called the coefficient of variation (CV) for the rate of incidence per 100,000 people. The CV is defined by the ratio of the standard deviation to the mean 39 . This is a dimensionless number that can be used to compare the dispersion of groups with different means or different units. Similar to the standard deviation, the larger the CV is, the more over-dispersed the data will be. The computed CV is presented in Fig. 7e for the third wave, and 0.51 CV (95% CrI: 0.38-0.77 CV) for the period after the third wave. This result clearly demonstrates that the first drastic reduction in CV occurred during the second wave, and the regional variation of weekly incidence tends to decrease over time. Namely, the spatiotemporal incidence pattern tends to be homogeneous, thereby indicating that the local outbreaks are dominant in most of the regions for the period after the third wave.
Novel compatible window-wise dynamic mode decomposition. Our data analysis using CwDMD has clearly shown the usefulness of the method to identify patterns of the spatially and temporally correlated nonlinear data. It is shown as well that some hidden patterns could be identified. The standard DMD, however, has a limitation in that it may provide misleading analysis generally for the inconsistent data 32 . The inconsistent data, equivalent to the nonlinear data can be interpreted as the data in which spatial resolution, the amount of spatial detail is given incompatible with the temporal resolution, the amount of temporal detail. Precise condition for the compatibility is obtained in section for "Methods". In Fig. 8 we have considered the COVID-19 time series data collected in a total of 17 regions. The standard DMD operator is shown to be able to fit the data perfectly in case a total of 18 or smaller temporal data is selected. The number 18 is the maximal time resolution for which the compatibility between spatial and temporal resolutions is valid. As the temporal resolution increases, the data fitting quality by DMD deteriorates significantly. This is unequivocally interpreted that DMD is inadequate to provide meaningful data analysis for these cases. To quantify the inadequacy, we investigate the phase and magnitude analysis from the selected DMD mode. For 19-week data from December 27, 2020-May 8, 2021, there is an evident disagreement between the COVID-19 data (black solid) and the DMD output (orange bar). The actual data indicates that the number of confirmed cases is higher in Gyeonggi and Seoul and it is relatively lower in Ulsan. However, DMD data analysis indicates otherwise that the number of confirmed cases in Ulsan is higher than in Seoul. This implies that the selected DMD mode does not represent the data pattern adequately. Thus, the direct and reliable DMD analysis of large time-series data is concluded not to be feasible unless it is linear. We, therefore, arrive at the need of introducing a novel compatible window-wise dynamic mode decomposition. The main issue in DMD for large time-series lies in the nonlinearity of the data. The point of CwDMD is that for any given nonlinear data, it is proven to be possible to select an adequate set of representative subdomains called windows, each containing moderate-sized linear data. For example, Fig. 2a, shows specially chosen windows for COVID-19 data in South Korea we analyze. The total size-times duration of all the windows serving a given system depends only on local situations that can arise in the full-time series data. We then apply the standard DMD for each window. This strategy is called the compatible window-wise dynamic mode decomposition respectively. This result shows that regional variation in the rate of incidence per 100,000 population becomes gradually uniform over time. www.nature.com/scientificreports/ (CwDMD). Basically, CwDMD is a collection of DMD for a specially selected set of consistent windows. In each window, we choose the most significant DMD modes, and the reconstructed data in its dimension, from the selected DMD modes are constructed and investigated to understand the actual data.

Discussion
In this study, we have developed a novel data-driven framework: compatible window-wise dynamic mode decomposition (CwDMD). Using the CwDMD, we have identified the spatiotemporal transmission patterns of COVID-19 in South Korea from January 20, 2020 to May 10, 2021. It is generally very challenging to uncover COVID-19 transmission dynamics since there exists a complex interplay among various time-varying factors such as virus, human, mobility, socio-economic infrastructures, and public health policies. However, our CwDMD analysis successfully elucidates how spatial correlations among 17 regions evolve in the presence of such complex features. The first wave was focused on the Daegu and Gyeongbuk area, which was mainly caused by the superspreading events from the Shincheonji Church-related clusters 12,17 . It spread to several regions nearby, but this was quickly contained. This was due to aggressive interventions such as drive-through or walk-through rapid PCR testing, contact tracing, isolation, mask distribution, and social distancing 40 . Most of all, the behavior and awareness of South Koreans were the most crucial reasons for great success. As a result, the substantially largest outbreaks from the Shincheonji Church-related clusters did not last for more than a month. Our analysis also confirmed that the local outbreaks were kept in the Daegu and Gyeongbuk areas. Towards the end and after the first wave, a few major large-scale outbreaks occurred in the metropolitan regions including Seoul, Gyeonggi, and Incheon. For example, in Seoul, there were a few sporadic large outbreaks, which include the Guro-call center, Itaewon Club, and Richway (Seoul-based health product retailer) between March 2020 and June 2020.
Execution of online school, which was initiated in the middle of the first wave, and ongoing intensive interventions contributed to maintaining a low level of COVID-19 outbreaks nationwide until the rally held in Seoul. On August 15, 2020, the rally led by SarangGeil Church caused 641 cases in Seoul and thus initiated the second wave. Our phase analysis for the second wave captured that COVID-19 spread rapidly throughout the nation. This is linked to the fact that people participated in the rally and returned to their home regions in a few days 38 , which arose local outbreaks in every region as well.
The local outbreaks became dominant compared to inter-regional outbreaks during the winter season from November 2020 to February 2021. The large-scale local spread of COVID-19 led to the third wave with the largest cumulative cases nationwide. Since the outbreaks were significantly severe in the third wave and the majority of the cases were focused on the metropolitan area, region-specific public health policies were first implemented and risk assessment level for social distancing interventions was refined from Level 3 to Level 5, as of November 7, 2020 41 . Moreover, region-specific restrictions of large gatherings, such as prohibiting gatherings of more than four people and closing shops after 10 pm, have been imposed during the third wave 42  Cumulative cases 10 4 www.nature.com/scientificreports/ dose, as of May 10, 2021 2 . COVID-19 has then been maintained without major outbreaks for more than four months after the third wave, between February 2, 2021 to May 10, 2021. Overall, cumulative cases and deaths of COVID-19 in South Korea seem not that large compared to those of other countries with similar population densities, and the duration of each wave seems not too long either. For example, as of July 9, 2021, a total of 814,533 cumulative cases and 14,933 deaths of COVID-19 in Japan were reported while a total of 165,344 cumulative cases and 2036 deaths in South Korea 4 were reported. Japan's vaccination rate (2.1 % of the primary dose and 1.0% of the second dose as of May 10, 2021) and population density (337/km 2 ) are similar to those of South Korea. However, the fourth big wave occurred in Japan, from March 2021 to May 2021 with a maximum daily number of confirmed cases of more than 6000. This can be associated with the fact that Japan imposes voluntary social distancing policy, while South Korea continues to enforce compulsory social distancing policies even after the third wave. Japan has invoked a number of COVID-19 State of Emergencies, but compulsory policies such as forced suspension or lockdown was not imposed 43 . On the other hand, policies in South Korea such as prohibiting gatherings of more than four people and closing shops after 10 p.m. forcibly prevent further infections from occurring. It is worth mentioning that there are data-related issues in this study. First, the official (reported) data could be different from the real ones due to the selective biases of various kinds 5,6 . Next, other factors such as temperatures, seasonality, UV radiation, pollution, etc. [44][45][46] are not included in the analysis.
South Korea is one of the most successful countries for mitigating and preventing the COVID-19 pandemic. Since South Korea has learned a valuable lesson from the MERS-CoV outbreak, which was the largest outbreak originated from the Middle Eastern countries in 2015, various preparedness plans have been initiated for emerging infectious diseases including medical infrastructure and transparent data disclosure through daily briefings 47 . Real-time infection transmission notification through mobile phone applications or websites, and a real-time alarm system through mobile phone (including location-specific risk notification through GPS) have been newly developed during the COVID-19 pandemic. In addition, South Koreans were quickly alert and carried out voluntary preventing activities such as wearing a mask and prohibiting gatherings. With such an ensemble of national infrastructure and citizens' voluntary participation in quarantine, South Korea demonstrates its superiority in handling COVID-19 outbreaks through successful mitigation strategies.
DMD has been successful to extract spatial-temporal coherent patterns in a specific form of periodic, growing, and decaying dynamical spectrum decomposition 34 . On the other hand, it is shown that balance between spatial and temporal resolutions has to be taken into account since otherwise, DMD mode analysis can result in erroneous data interpretation for highly nonlinear time series data. This balance is mathematically identified as the linearity of data in this paper, which means that DMD can in general make sense only for the appropriate selection of windows from the full temporal data sets so that spatial resolution is larger than the temporal resolution. This clearly generates the limitation of the use of classical DMD and/or its variants 22,25 since oftentimes it is useful to extract spatiotemporal patterns for rather long data sets. To overcome this issue, one can select a special set of the time series data with certain labels as discussed in 48 or more generally, one can use a certain multiscale temporal representation of the data. Namely, one can decompose the temporal steps, from fine to coarse so that in coarse level, the global data makes the linearity, while the fine-scale is handled only in several local windows. Somewhat similar but different idea, named as multiresolution DMD can be found at 49 . Overall, a systematic method or mathematical modeling for forecasting COVID-19 data is an open and challenging issue. The multiscale approach briefly described above is potentially useful to generate the prediction operator. Lastly, if we can identify the data related to external controls and interventions to stop spreading COVID-19, then we may be able to apply DMD with control, presented in 50 for analysis, which is yet to be investigated.

Methods
Compatible window-wise dynamic mode decomposition (CwDMD). In this section, we shall describe the compatible window-wise Dynamic Mode Decomposition (CwDMD), a novel dynamic mode decomposition method that respects the compatibility of the data set. A detailed statement of compatibility will be presented as well. Basically, we present a new observation that the consistent data is a linear data and suggest that DMD has to be applied for the consistent or linear data. A compatibility condition is a way of achieving this consistency or linearity of the data set. We shall show that certain windows of the given time series data has to be selected so that a balance between the spatial and temporal resolution of the data set is made. This balance will then lead to the linearity of the selected windows. The application of DMD for each window is shown to result in accurate data analysis.
Throughout this section, for the sake of convenience, we denote C n×ℓ by the space of complex matrices of size n × ℓ . For n = 1 or ℓ = 1 , we shall omit writing it. Namely, for ℓ = 1 , we set C n := C n×1 , that of which is sets of complex vectors of size n. For any element c ∈ C , we shall denote c by its complex conjugate. We shall denote · ∼ by the vector and · ≈ by the tensor. For M ≈ ∈ C n×ℓ , its null and range will be denoted by N ( M ≈ ) and stands for the k th snapshot of the data set for k ≥ 0 with m + 1 being the last entry of the data set, we let X ≈ and Y ≈ denote the followings: We shall briefly review the general description of the dynamic mode decomposition (DMD) applied for T ≈ . For clarity, we assume an ordered sequence of data separated by a constant sampling time t . The idea of DMD lies at the assumption that there exists a linear operator A ≈ that connects at least, approximately two data u ∼ k and its for all k ≥ 0 , that is The ambiguity in the approximation ≈ will be clarified by defining A ≈ = Y ≈ X ≈ † or as the solution to the following optimization problem: where � · � F is the Frobenius norm. We note that the operator A ≈ is a type of dynamic operator that relates two consecutive data set. The goal of the dynamic mode decomposition is to extract the dynamic characteristic of A ≈ , not directly to construct the mapping A ≈ . More precisely, DMD obtains spectrums or spatial-temporal characteristics of the dynamical process described by A ≈ . We note that the spectrums can be used to completely construct the action of the operator A ≈ if needs arise.
The essential algorithmic background lies in singular value decomposition of data, X ≈ and the relationship between eigen-pairs of A ≈ and its representation in principal component modes (see Lemma 1 and Lemma 2,in Supplementary note for Method). These are used to obtain the standard dynamic mode decomposition algorithm, as provided in Algorithm 1 51 .
Generally, the data analysis can be accomplished through the dynamic modes and eigenvalues, as given as ,··· ,n 's are called the DMD modes or mode vectors and they provide a rich set of information, especially spatial information about the data set 25 . For example, the modulus of the element of the mode vector provides measure of the spatial region's participation for that mode. On the other hand, the eigenvalues { i } i=1,··· ,n are relevant to the time evolution of the data sets and thus, they contain temporal information.
Linearity, consistency, and CwDMD. A loophole in DMD lies in that DMD spectrums are found for an approximate dynamic operator A ≈ for the data set T ≈ . It is very much ambiguous and completely unknown theoretically how much the error observed in Eq. (1) results in misleading data interpretation from DMD spectrums. This has been elaborated in Fig. 8 for further clarity. The desired DMD is then not to start with constructing DMDspectrums for A ≈ that satisfies (1), but, to build DMD spectrums based on A ≈ that satisfies the following relationship: ) (see the notion of linear consistency defined for T ≈ in Definition 2 and also Theorem 1 of Supplementary note). We remark that the linearity is much more intuitive and general than the linear consistency. The notion of the linearity is a certain extension of the existence of line connecting two points in two dimensional Euclidean space consisting of one spatial dimension and one temporal dimension.
On the other hand, we observe that these two concepts; linearity and linear consistency are in fact equivalent. Namely, the linear consistency of T ≈ holds if and only if the linearity of T ≈ holds (see Theorem 2 in Supplementary note for detailed proof). In another words, nonlinear data is inconsistent and inconsistent data is nonlinear. This equivalency is remarkable since these two concepts can be used to derive so-called the compatibility condition, which can be used to easily verify the linearity of T ≈ . Note that the linear consistency condition provides an important algebraic condition for the data being linear. However, authors find it difficult to verify that condition in general. The concept of compatibility is based on the observation that the data T ≈ being linear is relevant to the balance between spatial and temporal resolutions. As mentioned, for example, in one spatial dimension, only two points (two temporal data) can be connected in general by a line, unless data consisting of more than two points are collinear. Its extension for higher dimensional case can be understood as a simple inequality: m ≤ n . More precisely, the compatibility condition can be stated as follows: Definition (Compatibility Condition) Compatibility condition is the balance between to the balance between temporal and spatial resolutions, i.e., a data set T ≈ with the temporal resolution m + 1 and spatial resolution n have the relationship that m ≤ n.
Note that for m > n , T ≈ will be in general inconsistent unless it is linear. The compatibility condition is stated to cover very general situations for which DMD can have a meaningful usage. We can show that under the compatibility condition, DMD will provide meaningful results with probability one. To be more precise, we note that the consistency can be easily understood in terms of the linear independency of the data X ≈ , i.e., the linear independency of X ≈ implies the consistency of T ≈ and this can in particular, remove the trivial case that any column of X ≈ is the zero vector. Theoretically, it is established that if T ≈ satisfies the compatibility condition, then almost all X ≈ ∈ C n×m with m ≤ n will consist of columns which are linearly independent 52,53 . This means that Therefore, the data set T ≈ is linear. The compatibility condition thus implies the consistency with probability one. Thus, the compatibility condition implies that the linearity of the data T ≈ is almost always guaranteed in case m ≤ n , which then leads to the meaningful DMD results. In a very much rare case, when the consistency breaks under the compatibility condition, one can provide a small (arbitrarily small) perturbation to obtain T ε ≈ ∈ C n×(m+1) , which is proven to result in a linear data 54 .
Namely, for m ≤ n , let X ε ≈ ∈ C n×m consist of first m columns of T ε ≈ . Then we consider X ε ≈ ∈ C m×m obtained from X ε ≈ by chopping off all rows underneath m th row of X ε ≈ . This square matrix can be proven to be diagonalizable 52,54 , i.e., it consists of linear independent columns and thus the columns of X ε ≈ is linearly independent. In view of the spatio-temporal analysis of the data, arbitrarily small perturbation will not change the result significantly. Furthermore, theoretically, such arbitrarily small perturbation will not affect the computation of the DMD-spectrums if they are in particular, Gaussian 55,56 . We remark that our data is generally very nice, i.e., whenever we choose m ≤ n , the data set T ≈ is always linear consistent and so, no perturbation was needed.
We are in a position to introduce our new algorithm, so-called a compatible window-wise dynamic mode decomposition (CwDMD). Our observation is that for m > n , T ≈ will be in general inconsistent unless it is linear.
As such, the direct and reliable DMD analysis of large time series data is not feasible in general. The strategy is to choose an adequate set of representative subdomains called windows, each containing a moderate size of time-series data that satisfies the compatibility. The total size-times duration of all the windows serving a given ) . Note that these windows can be constructed so that they may overlap or non-overlap depending on the situations. Therefore, choices of window can be made without too much restriction other than the condition of compatibility. This can be summarized as in the Algorithm 2.
Data fitting, dimensional reduction, frequency and phase analysis. In this section, we discuss the data fitting using the DMD operator and choice of modes for the dimensional reduction and their uses for the phase analysis of each window. Throughout this section, we assume that T ≈ ∈ C n×(m+1) is consistent and the DMD operator We would also like to mention that the precise action of the operator A ≈ may not be found solely from these eigenspectrums. Namely, the data X ≈ has to be represented in terms of DMD modes, which requires to solve certain optimization problem. In a prior work, this has been accomplished by taking into account the whole data X ≈ . We shall show that this can be done taking into account any single snapshot data in X ≈ under the consistency condition, thereby achieving a significant computational reduction. We begin our discussion with the fact that almost all complex matrices over complex fields are diagonalizable 52,54 . Namely, geometric and algebraic multiplicities of almost all complex matrices over complex fields are identical. This means that the DMD modes make a full set of eigenvectors for almost all data set satisfying the compatibility. Some list of a couple of equivalent conditions to the fact that algebraic and geometric multiplicities agree for a matrix A ≈ ∈ C n×n can be found at 57  ] . With α ∼ given above, we can obtain the action of the DMD operator A ≈ as follows: for where î is the pure imaginary number such that î 2 = −1 . We remark that it is standard to choose η = 0 , which is also our choice. Oftentimes DMD is argued to be biased to the initial data 24 , our observation is that it is not really the case, for the consistent data. We recall that the framework of the optimized DMD 22 is also designed to obtain the same α ∼ for fitting, X ≈ , by solving the following optimization problem: where α i e k R(log( i )) eˆi kI(log( i )) φ www.nature.com/scientificreports/ It is clear that the consistency of data leads to a significant reduction of the computational effort. We now can consider a discrete to continuous extension of the action of DMD operator. We remark that from the discrete represent of u ∼ k in (4), a continuous extension can be achieved as follows: for all t ≥ t 0 = 0, We now discuss the mode choice for the phase analysis, which will be used to obtain the dimensional reduction of the data. The most natural guide to choose the important DMD mode is to find the DMD mode which contributes most significantly to the data both temporally and spatially. This leads us to choose the index of DMD mode for which the following quantity, product of the temporal and spatial contribution in each window is maximized: where p is the temporal resolutions for the window. We call the quantity | k | p �α k φ ∼ k � F the power of the k th DMD mode and observe that in general one or two dominant powers exist. These are then chosen to form a dimensionally reduced data. For example, φ ∼ k is the DMD mode whose power is the largest. Then it is used to form a dimensionally reduced data: for all t ≥ t 0 = 0, which is used for the data interpretation such as phases and magnitudes. In literature, DMD modes are chosen based on their norms or weighted norm by the corresponding DMD eigenvalues 32 . For example, the use of weighted norm by DMD eigenvalues, can be interpreted as to penalize spurious modes with large norms but quickly decaying contributions to the dynamics 29 . In our choice, we incorporate α ∼ , the coordinate of data in the frame of DMD modes as a special scale for DMD modes. These measurements are meaningful especially for highly nonlinear data, since coordinates given in terms of DMD modes can much affect the dynamics of data. We remark that the frequency of the solution for the mode k, can be defined through I(log( k ))/2π and thus the period is given by the reciprocal of the frequency. The identified DMD mode can be categorized as periodic, growing or decaying modes depending on the magnitude of k . Namely, for eigenvalues on (or close), outside or inside the unit circle, the corresponding modes are considered as oscillatory, growing, and decaying modes, respectively. In the present work, we give a tolerance ǫ = 5.E−2 and denote N o = {i : || i | − 1| ≤ ǫ} , N g = {i : | i | > 1 + ǫ} , N d = {i : | i | < 1 − ǫ} by the set of oscillatory modes, the set of growing modes, and the set of decaying modes, respectively. We first select the DMD modes of large powers, and then measure the magnitude of its eigenvalues and determine whether they are oscillatory, growing or decaying mode.