Algorithmic discovery of dynamic models from infectious disease data

Horrocks, Jonathan; Bauch, Chris T.

doi:10.1038/s41598-020-63877-w

Download PDF

Article
Open access
Published: 27 April 2020

Algorithmic discovery of dynamic models from infectious disease data

Jonathan Horrocks¹ &
Chris T. Bauch¹

Scientific Reports volume 10, Article number: 7061 (2020) Cite this article

5783 Accesses
26 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Theoretical models are typically developed through a deductive process where a researcher formulates a system of dynamic equations from hypothesized mechanisms. Recent advances in algorithmic methods can discover dynamic models inductively–directly from data. Most previous research has tested these methods by rediscovering models from synthetic data generated by the already known model. Here we apply Sparse Identification of Nonlinear Dynamics (SINDy) to discover mechanistic equations for disease dynamics from case notification data for measles, chickenpox, and rubella. The discovered models provide a good qualitative fit to the observed dynamics for all three diseases, However, the SINDy chickenpox model appears to overfit the empirical data, and recovering qualitatively correct rubella dynamics requires using power spectral density in the goodness-of-fit criterion. When SINDy uses a library of second-order functions, the discovered models tend to include mass action incidence and a seasonally varying transmission rate–a common feature of existing epidemiological models for childhood infectious diseases. We also find that the SINDy measles model is capable of out-of-sample prediction of a dynamical regime shift in measles case notification data. These results demonstrate the potential for algorithmic model discovery to enrich scientific understanding by providing a complementary approach to developing theoretical models.

Dynamic causal modelling of COVID-19 and its mitigations

Article Open access 20 July 2022

A deconvolution approach to modelling surges in COVID-19 cases and deaths

Article Open access 09 February 2023

GrowthPredict: A toolbox and tutorial-based primer for fitting and forecasting growth trajectories using phenomenological growth models

Article Open access 18 January 2024

Introduction

Dynamic models that capture the governing mechanisms of systems lie at the heart of many scientific theories of natural systems, ranging from planetary motion to epidemiology¹. Since Isaac Newton’s Principia mathematica, research have been devoted to creating models that accurately describe and predict the behaviour of these systems². These models are typically arrived at through a deductive process by hypothesizing mechanisms, formulating dynamic mathematical models that represent those mechanisms, and testing them against data. This tried-and-true approach remains essentially unchanged today.

In the late twentieth century, methods of reconstructing phase spaces or differential equation models from time series data were proposed^3,4. More recently, advances in algorithmic sophistication, computational power, and increased data availability have renewed the development of ‘model discovery’ methods for dynamical system that determine a system of governing dynamical equations from a given dataset⁵. A seminal paper in model discovery uses symbolic regression to recover nonlinear differential equations⁶. This approach automated the process of finding the symbolic structure of the dynamical system governing a natural process. Being able to model a system symbolically rather than numerically is crucial due to the explanatory value of a model built with elementary functions, since in principle it allows prediction under a broad range of possible conditions and not just a replication of the given dataset. In other words, the technique is intended to automatically uncover the nonlinearities that govern system dynamics.

However, early attempts at dynamical systems model discovery were subject to overfitting, as well as being computationally expensive and lacking the ability to scale well to systems with higher dimensionality. Deriving dynamic models from data faces several challenges stemming from large dimensionality. The simplest method to obtain a model that explains the data well minimizes the residual squared error between the predicted response and the data (OLS). This tends to create very complicated models with high descriptive value. However, the models tend to be overfitted to any noise present in the data, compromising their predictive ability⁷. Highly complicated models also exceed human analytic ability, thus detracting from the interpretability of the model.

Most alternative methods to OLS either use subset selection, which attempts to identify some subset of the predictors that adequately describes the system while disregarding the rest^8,9, or shrinkage (regularization), which fits the model using all of the available predictors but forces the coefficients of select predictors towards zero, thereby performing a kind of variable selection. For instance, SINDy (Sparse Identification of Nonlinear Dynamics) is a recent breakthrough that automates model discovery through sparsity-promoting regression techniques^{10,11,12,13,14,15,16,17,18,19}. SINDy begins with a large library of nonlinear terms. The algorithm fits the model to the data with the current library, removes any nonlinear terms from the library that have small fitted coefficients, and repeats the process. This progressively shrinks the size of the library until a relatively small system of differential or difference equations with good explanatory power for the given dataset is obtained. This inductive approach contrasts with the deductive approach of first formulating a model, inferring its parameters from data, and then testing its predictive power²⁰.

Epidemiological systems have long been studied with dynamic models, on account both of their public health relevance and the complex patterns exhibited by epidemics (Fig. 1)^{21,22,23,24,25,26,27,28}. Contemporary epidemic modelling approaches can be traced to the work of Kermack and McKendrick in 1927²⁹. Their compartmental model partitions a population into a series of mutually exclusive compartments–such as susceptible, infected or recovered–and models how individuals move between the compartments^23,30,31. Nonlinearity typically arises from infection transmission. For instance, the commonly used mass-action mixing principle posits that the number of new infections is the product of the number of susceptible and infectious individuals^23,30,31. This principle was originally used to describe the rate of a well-mixed chemical reaction by relating it to the concentration of reactants³², hence compartmental models conceptualize new infections as resulting from random mixing between susceptible and infectious individuals. These models have been expanded in many ways to account for observed epidemic patterns. For instance, seasonal variation in the transmission rate can give rise to complex dynamics such as bifurcations, oscillations and deterministic chaos^{21,22,23,24,25,26,27,28}. The spatial and temporal dynamics of many childhood infectious disease such as measles, pertussis, rubella and chickenpox are well-described by these dynamic models.

Epidemic models are one of several model systems that have been used to validate SINDy through an approach known as model rediscovery^11,19. In this approach, synthetic data are generated by adding noise to the output from a pre-specified model, and the algorithm is applied to the synthetic data to study the conditions under which it can discover the original model¹⁰. Through this process, SINDy has shown it can not only successfully rediscover the original epidemic models, but it many cases it can also generate hierarchies of models of varying complexity, including new models that can fit the synthetic data but have different model equations from the original model^11,19. SINDy has also been used for discovery of new models from empirical data in experimental mechanics and optics^33,34 as well as to discovery of new models and model reductions from synthetic fluid mechanics data^35,36. Thus far, however, SINDy has not yet been tested in model discovery from empirical data on infectious disease dynamics. This represents a very different challenge for model discovery algorithms: the complicated and noisy nature of real-world epidemiological data represents a challenge for generating sparse, low-dimensional models that can provide mechanistic insights or predict real-world dynamics.

Here, using minimal prior epidemiological knowledge, we apply SINDy to time series data from measles, rubella and chickenpox to discover dynamic models that govern their epidemic patterns, and we compare the resulting models to a standard compartmental model for these infections. We choose epidemiological systems because complex biological systems present a nontrivial challenge for SINDy, and these three infectious diseases display a diverse range of nontrivial dynamical behaviour. At the same time, a significant amount of research in compartmental epidemic models suggests that we have a good idea of what kind of discovered models would satisfy the criteria of interpretability and predictive power we expect from a dynamic model. This exercise allows us to determine whether inductive model discovery techniques have potential to identify new dynamic epidemiological models. Perhaps more importantly, this exercise could also tell us whether model discovery can enrich and nuance our understanding of existing study systems by uncovering a role for nonlinear mechanisms that were previously not considered. In the next section, we first apply SINDy to model rediscovery of a compartmental model with a seasonally varying transmission rate. Then we apply SINDy to model discovery from infectious disease case notification data for measles, rubella and chickenpox, and compare it to a standard compartmental epidemic model.

Results

Model rediscovery from simulated data

For model rediscovery we use a discrete-time Susceptible-Infectious-Recovered (SIR) model accounting for demographic processes (birth and death) and seasonal variation in the transmission rate:

$${S}_{t+1}={S}_{t}+\nu -\beta (t){S}_{t}{I}_{t}-\mu {S}_{t},$$

(1)

$${I}_{t+1}={I}_{t}+\beta (t){S}_{t}{I}_{t}-\gamma {I}_{t}-\mu {I}_{t},$$

(2)

$${R}_{t+1}={R}_{t}+\gamma {I}_{t}-\mu {R}_{t},$$

(3)

where ${S}_{t}$ (${I}_{t}$, ${R}_{t}$) is the number of susceptible (infectious, recovered) persons, $t$ is the timestep, $\nu $ ($\mu $) is the per capita birth (death) rate per timestep, $\gamma $ is the per capita recovery rate per timestep, and $\beta (t)$ is the seasonally-varying transmission rate with form given by

$$\beta (t)={\beta }_{0}(1+{\beta }_{1}\,\cos \,(2\pi t/T-\phi )),$$

(4)

where $T=1$ year is the period of the oscillation and $\phi $ is the phase shift corresponding with the seasonal behaviour of the transmission rate. We interpreted one timestep to correspond to one week. The mass action incidence term (${\beta }_{t}{S}_{t}{I}_{t}$) appears in both the ${S}_{t+1}$ and ${I}_{t+1}$ equations. Note that we may treat the state variable $R$ as redundant since it does not appear in the other equations, and thus we may exclude it from simulations.

We used a discrete-time instead of a continuous-time (differential equation) model because SINDy necessitates evaluating the derivative of the input data, which for a continuous time model applied to empirical data can yield noisy and unpredictable values, making it difficult for the sparse regression algorithm to obtain a global minimum. In contrast, applying SINDy to a discrete-time modelling framework lends itself naturally to the weekly temporal resolution of the empirical data we used. Hence we use a discrete-time compartmental model for both model rediscovery as well as for model discovery from empirical data. However, we note that model rediscovery with SINDy also works with a continuous-time compartmental model.

We solved the discrete-time SIR model numerically and added Gaussian noise to the output state variables to generate a simulated dataset. SINDy was then applied to the simulated dataset using a function library consisting of all polynomials involving ${S}_{t}$ and ${I}_{t}$ up to and including second order, for both constant and seasonally-varying coefficients (see Methods). This process was repeated for a range of noise magnitudes. ‘Coefficient’ refers to the weighting assigned by SINDy to a given term, but we will use the terminology interchangeably.

For sufficiently small noise, SINDy was able to recover the original model with a very high accuracy by correctly recovering the coefficients of the SIR model (Fig. 2). However, at higher (but still relatively small) noise magnitudes, SINDy begins to overfit the noise in the simulated data and, while it still correctly identifies the seasonally-varying mass action mixing term as having the largest coefficient, it also incorrectly identifies other nonlinear terms as having a role in the model, such as ${I}_{t}^{2}$ (Fig. 2). The relatively small values for noise at which these spurious terms are introduced illustrates the challenges that noisy data present for SINDy. (However, it is also possible that other methods for regularization not explored here might enable SINDy to fit the data well even for the higher noise case.) Results for other noise levels appear in Supplementary Figs. 1–4.

Model discovery from empirical data

We studied three datasets consisting of case notifications for measles in England and Wales (1948–1967); chickenpox in Ontario, Canada (1946–1967); and rubella in Ontario, Canada (1946–60) (Fig. 1). Each provides a different example of an attractor class: measles is biennial, chickenpox is predominantly annual, and rubella is multiennial with a weaker annual signatures as well^25,27. Hence, these data represent an interesting test of whether a SINDy model discovered from a library of annually-forced transmission rates can generate a model that can predict not only annual but also biennial or multiennial dynamics. Our baseline analysis used a second-order polynomial library. However, we also tried a third-order polynomial library, which in principle allows capturing more features of the data but also risks more overfitting of noise. The data in Fig. 1 was first smoothed to reduce the risk of overfitting (see Methods).

The datasets describe the case incidence (number of new cases per week), but our SINDy state variables concern the prevalence of epidemiological states: the number of susceptible individuals (${S}_{t}$) and infectious individuals (${I}_{t}$, the infection prevalence) at any given time $t$. Hence we converted case incidence to case prevalence to obtain the required input time series of the number of infectious individuals (see Methods).

We do not have data on the temporal dynamics of susceptible individuals, so we reconstructed a time series of the number of susceptible individuals using a standard method (see Methods)³⁷. The method requires the initial number of susceptible persons, ${S}_{0}$, and this initial condition can strongly influence disease dynamics. The number of susceptible individuals in a given year is not known for any of our historical datasets. Similarly, as mentioned in the Introduction, coefficients below a certain sparsity threshold value ($\lambda $, the “sparsity knob”) are removed from the library at each iteration of SINDy. But, there is no a priori knowledge to guide the selection of the value of $\lambda $. Hence, we applied SINDy to each point of the ${S}_{0}-\lambda $ parameter grid. The Akaike Information Criterion (AIC) score of the model identified at each point on the parameter grid was computed to give us a way of measuring model parsimony across the grid³⁸. To quantify sparsity of the coefficient matrix we introduced a sparsity index which is the ratio of the number of library functions with zero coefficients to the total number of functions in the library:

$$r=1-\frac{\parallel \Xi {\parallel }_{0}}{\parallel \Theta {\parallel }_{0}},$$

where $\Xi $ is the set of coefficient vectors and $\Theta $ is the collection of library functions.

We found that SINDy discovered models that reflect the observed dynamics of infectious individuals for both measles and chickenpox (Figs. 3 and 4). This includes both the biennial attractor of measles and the annual attractor of chickenpox. The models are not very sparse ($r=0.25$) although the remaining coefficients differ greatly in the magnitudes that SINDy assigns to them. For measles, SINDy assigned the largest coefficients to the bilinear incidence terms $SI$ and $\beta SI$, corresponding to constant and seasonally varying mass-action incidence terms respectively, as in the discrete-time SIR model (Fig. 3). SINDy also assigned large values for the $SI$ and $\beta SI$ coefficients for chickenpox, although the $\beta II$ term was assigned the largest value overall, suggesting overfitting (Fig. 4). The model that SINDy discovered for rubella was more sparse ($r=0.7$), and SINDy also assigned the largest coefficients to the mass-action mixing terms $SI$ and $\beta SI$, although the model predicts an annual attractor instead of the observed multiennial attractor (Fig. 5). This occurs despite the fact that models with bilinear incidence are capable of generating multiennial attractors a rubella^25,27. We reserve a more complete discussion of overfitting and the interpretation of SINDy model coefficients for a following subsection.

The fit between model and data for measles and chickenpox is low compared to many well-controlled examples from physics¹⁰, but it is comparable to the agreement achieved in many studies in epidemiological modelling³⁹. This is because biological and social systems are complex and have multiple nonlinear feedbacks, in addition to numerous environmental heterogeneities. We discuss the case of rubella further in a following subsection on using power spectral densities.

The SINDy-discovered models confirm that seasonal forcing plays an important role in epidemics of childhood infectious diseases. The seasonally forced transmission rate (BSI) is always the first or second largest term in the discovered models. This provides a separate line of evidence in support of deductively derived models^25,27 that support the role of seasonal variation in the transmission rate. Also, despite the fact that seasonal forcing occurs at a period of one year, SINDy can recover a seasonally forced model that exhibits a biennial attractor, hence these results provide another line of evidence that seasonal forcing can generate attractors of multiple different periods²⁵, some of which are epidemiological relevant. The emergence of a biennial attractor from seasonal (annual) forcing happens due to the interplay between forcing dynamics, the gradual build-up of new susceptible individuals through births, and its rapid depletion during epidemic periods.

Power spectral density

The results for rubella (Fig. 5) highlight a limitation of using goodness-of-fit to time series as the criterion. As noted, a bilinear incidence term can generate multiennial attractors under seasonal forcing, but SINDy selected a model that generates an annual attractor under a second-order library. In terms of qualitative descriptions of the data, a model that generates multiennial attractors of the right frequency but with epidemic peaks at different years than what is observed in the dataset is perhaps more desirable, and says more about underlying epidemiological mechanisms, than a model that simply yields an annual attractor.

Hence, we developed a modification of our approach such that the goodness-of-fit between the power spectral density (PSD) of the model and the data is used as the criterion for assessing models, instead of fit between the prevalence time series (see Methods). Using this approach for rubella with a second-order polynomial library yields a model that reproduces the large multiennial outbreaks observe in the empirical dataset, as well as the notable tendency for seasonal rubella incidence to ramp up slowly in the first part of the season, but decline very quickly in the second part (Fig. 6). The discovered model also includes a strong contribution from the mass-action incidence term $\beta (t)SI$.

The estimated power spectral densities for the empirical data and the best SINDy models appears in Supplementary Figs. 5–7 for all three diseases. These figures show that dominant peak of the SINDy model PSD matches that of the empirical data, for all three diseases: 1 year for chickenpox, 2 years for measles, and 5 years for rubella. However, the SINDy model for chickenpox fails to capture some of the lower frequencies present in the empirical data.

Interpretation of SINDy model coefficients

Inspection of the SINDy coefficients for the $I$ variable in the baseline measles and chickenpox analyses (Figs. 3 and 4), as well as the PSD analysis for rubella (Fig. 6) suggests that SINDy is producing meaningful models for the measles baseline and rubella PSD analyses, but could be overfitting the data in the chickenpox baseline analysis. (As noted above, SINDy exhibits a poor fit in the rubella baseline analysis, Fig. 5). We focus our discussion on the $I$ equation for which observational data are available to SINDy, unlike the case for the $S$ equation where SINDy must fit reconstructed data.

The SINDy models for the measles baseline and rubella PSD analyses exhibit biologically meaningful coefficients. We observe order 10¹ two-variable terms pertaining to transmission (such as $SI$); order 10⁰ single-variable terms pertaining to demographic processes (such as $S$, $I$); and order ${10}^{-1}$ or 10⁰ ${S}^{2}$ terms. All other terms are of order ${10}^{-2}$ or smaller. The relative orders of magnitude of the transmission and demographic-related terms are expected given that epidemic processes occur an order of magnitude faster than demographic processes. We also note that the SINDy models include both SI (constant) and BSI (seasonally varying) terms. This is also expected since the transmission rate in the SIR compartmental model $dI/dt=({b}_{0}+{b}_{1}\,\sin (t))SI+\cdots $ includes both seasonally varying and constant contributions to the total transmission rate^25,27. Their relative magnitude varies depending on how important seasonal forcing is for dynamics. The sign of the BSI term may be positive or negative, depending on the phase shift.

The presence of the positive ${S}^{2}$ term in the $I$ equation both the baseline measles and rubella PSD models is notable. Adding a term of the form ${I}_{t+1}=+\,\alpha {S}_{t}^{2}$ ($\alpha > 0$) to the $I$ equation has the effect of slowing down and linearizing the epidemic curve in the early stages of the epidemic. This is because the transmission portion of the equation changes from ${I}_{t+1}=\beta {S}_{t}{I}_{t}+\cdots $ (as in the SIR model) to ${I}_{t+1}=\beta {S}_{t}{I}_{t}+\alpha {S}_{t}^{2}\cdots =\beta {S}_{t}({I}_{t}+\frac{\alpha }{\beta }{S}_{t})+\cdots $. As a result, the force of infection changes from $\beta {I}_{t}$ to $\beta ({I}_{t}+\frac{\alpha }{\beta }{S}_{t})$. In the early stages of the outbreak ${S}_{t}\gg {I}_{t}$ and ${S}_{t}\approx {S}_{cst}$ is approximately constant, hence, the force of infection is approximated by $\beta ({I}_{t}+\frac{\alpha }{\beta }{S}_{cst})$; this functional form has a smaller slope with respect to changes in ${I}_{t}$, meaning that the epidemic will growth subexponentially. We also note that $\alpha < 0$ has the opposite effect and could even make incidence of new infections decline, if $\alpha $ is large enough.

The ${S}^{2}$ coefficient is positive in the rubella PSD and measles baseline SINDy models, but absent for the rubella baseline model and has the wrong sign for the chickenpox baseline model. The empirical case notification data for measles and rubella show a corresponding pattern of a slower initial growth (in the case of measles, driven partially by small outbreaks in even years), followed by a more rapid decline after the peak of the outbreak. The SINDy model time series for the baseline measles and rubella PSD analyses share this feature as well. However, this pattern is not apparent in the chickenpox data or in the baseline chickenpox SINDy dynamics. Moreover, in contrast to the measles baseline and rubella PSD SINDy models, conventional compartmental models always show a symmetric epidemic curve, or faster growth followed by a slower decay. This demonstrates an advantage of the SINDy models over classic compartmental models for capturing subexponential epidemic growth in the early stages of an epidemic–a time when accurate projection of new cases is most important. It also suggests a simple way that early subexponential growth could be captured phenomenologically in simple compartmental models without the need to resort to more complicated models such as spatial models (which can also capture subexponential growth^40,41).

The $I$ equation for the baseline chickenpox SINDy model is strongly dominated by the seasonally varying $BII$ coefficient. This suggests that SINDy is overfitting the chickenpox data. This outcome is made more likely by the fact that the chickenpox data exhibit a simple annual cycle that can be better explained by a sinusoidal oscillation than a nonlinear transmission model. In comparison, measles dynamics show a two-year periodicity, and rubella dynamics show a multi-year periodicity. In both latter cases, interaction between nonlinear mass-action infection dynamics, vital dynamics, and seasonal forcing appears necessary to explain their rich dynamics^25,27,28.

Effect of changing sparsity knob λ

Our approach applies SINDy to a grid of possible values for the initial number of susceptible individuals ${S}_{0}$ and the sparsity knob $\lambda $ which determines the threshold for removing terms from the library in each iteration of SINDy. The sparsity knob is particularly important because if it value is set too low, the discovered model will include most (or all) of the functions in the library, resulting in overfitting. Conversely, if the threshold is set too high then features required to emulate the dynamics of the system may be removed, resulting in a model that does not resemble the data in a meaningful way.

Hence, in order to balance sparsity with goodness-of-fit, we focussed on the models that yielded the lowest AIC score across the grid for the foregoing results (Figs. 3–5). This approach shows there is an optimal region in the ${S}_{0}-\lambda $ plane that ensures the SINDy algorithm can generate regularized, accurate models from empirical disease data (Supplementary Figs. 8–15). For instance, in the case of measles, the best AIC corresponded to a model using an initial susceptible value of ${S}_{0}=0.11286$ and a sparsity knob of $\lambda =0.00517$ (Fig. 3). In contrast, a much lower sparsity setting ($\lambda =0.0001$) discovers a model that is overfitted to apparently random features of the data and that includes all the terms of the library, whereas a much higher sparsity setting ($\lambda =0.1$) discovers a model that exhibits annual attractor instead of the characteristics biennial attractor of measles (Supplementary Fig. 9).

Third-order polynomial library

When a third-order library is used instead, the discovered models for measles and chickenpox capture the observed epidemiological dynamics as well as the second-order library does, and the models have similar sparsity indices (Supplementary Figs. 16–18). SINDy continues to assign significant weight to the bilinear incidence terms $SI$ and $\beta SI$, but SINDy assigned even stronger weight to the ${S}^{2}I$ and $S{I}^{2}$ terms (and their corresponding seasonal terms). This could indicate overfitting, or it may indicate a more general nonlinear incidence mechanism in the underlying system. It has been shown that an incidence function of the form ${S}^{p}{I}^{q}$ (where $p,q > 0$) may more adequately represent some endemic cycles, a form that is present when a 3rd order polynomial library is used^42,43. In the case of rubella, SINDy generates a model that captures the multiennial attractor observed in rubella dynamics (Supplementary Fig. 15) although the model is not very sparse ($r=0.15$) and is strongly dependent on trilinear incidence terms ${S}^{2}I$ and ${I}^{3}$. Parameter planes for ${S}_{0}$ and $\lambda $ and examples of predictions for very low and high sparsity thresholds appear in Supplementary Figs. 19–27.

Comparison with compartmental epidemic models

We compared the models discovered by SINDy to the classical discrete-time SIR compartmental model with seasonal forcing and mass-action mixing. We fitted Eqs. (1)–(4) to the case prevalence and reconstructed susceptible time series for all three infections by sweeping across parameter grids and minimizing the least-squared error between model and data time series (see Methods). We thereby inferred the parameter values $\nu $, $\mu $, ${\beta }_{0}$, ${\beta }_{1}$ and $\gamma $ and compared these parameter values to the coefficients of functions determined from SINDy for the models in Figs. 3–5.

These comparisons show that SINDy tends to select the coefficients corresponding to mass-action mixing, often with seasonal forcing as well, and that these coefficients have a similar magnitude and sign as the inferred parameters of the compartmental SIR model (Fig. 7). The similarities are strongest for measles and chickenpox. This demonstrates that SINDy is effective in capturing the theoretical principles of mass action incidence–a core mechanism of most epidemiological models–as well as seasonal forcing, which is required to explain endemic patterns of many childhood infectious diseases in the pre-vaccine era. The SINDy models also depend on the linear terms for each of the $S$ and $I$ equations. Usually these have a similar magnitude and sign as in the SIR model. These terms correspond to vital dynamics (births and deaths). However, several other features are noticeably different from the inferred SIR model. For instance, SINDy often infers a different magnitude and/or sign than the SIR model, particularly for the baseline rubella analysis that provides a poor fit to the empirical case notification data (Fig. 5).

SINDy model can predict qualitative shifts in empirical measles dynamics out-of-sample

To test whether models discovered by SINDy can predict real-world dynamics, we studied the ability of the second-order SINDy baseline measles model (Fig. 3) to make out-of-sample prediction of regime shifts in measles dynamics. We used a classic example of nonlinear measles dynamics in the United Kingdom^25,27. During 1948–1967, the recruitment rate of new susceptible individuals in England & Wales was roughly constant and measles incidence exhibited a clear biennial pattern (Fig. 1). However, the recruitment rate of new susceptible individuals dropped significantly in 1967, due to the introduction of mass vaccination and a decline in birth rates, causing dynamics to shift suddenly to a more irregular pattern dominating the time period 1968 to 1988^25,27. Previous research has shown that simple compartmental epidemic models can predict this shift, as well as the patterns observed before and after the shift^25,27. These models characterize the dynamics from 1948–1967 as a biennial attractor²⁵ that is relatively stable to perturbations from noise (on account of the fact that the period of the biennium’s Floquet multiplier–which predicts response to noise–is two years and thus exactly matches the two-year period of the biennial attractor²⁷. The dynamics from 1968–1988 are characterized as an annual attractor that is more easily perturbed by noise (on account of the Floquet multiplier of the annual attractor having a non-annual period greater than one year²⁷). These differences have implications for the power spectra of the time series of infection incidence. In the power spectra of both the classical models and the empirical data, we observe a shift from a dominant peak at two years and very little power elsewhere in the spectrum (except for a supporting annual peak) during the biennial era, to a dominant peak at one year and a second peak at a period of approximately 2.5 years (corresponding to the period of the attractor’s Floquet multiplier), during the era of irregular dynamics²⁷.

The second-order SINDy model for measles was discovered from the incidence patterns observed during the biennial era (Fig. 3). We hypothesized that if the SINDy model is discovering real-world mechanisms and not over-fitting the data, it should predict the same transition observed in the empirical data and predicted by previous models^25,27. To test this hypothesis, we reduced the susceptible recruitment rate in the SINDy model (the coefficient of the $S$ term) to match the drop in the susceptible recruitment rate observed in the United Kingdom from 1948–1967 to 1968–1988 and we compared the time series and power spectra of infection incidence and susceptible individuals before and after the drop. We also added white noise to the simulations to test the response of the attractors to noise. We found that the SINDy model predicts the same transition. In the biennial era, the SINDy model predicts a stable biennial cycle that is relatively robust to noise: the time series of infection incidence shows a clear biennial pattern and the power spectrum shows a strong peak at a period of two years, a supporting peak at one year, and little else (Figs. 3 and 8). In simulations where the coefficient of the $S$ term is reduced to capture a drop in the recruitment rate, the SINDy model predicts a noisy annual cycle: the time series of infection incidence shows a dominant annual signature in the presence of significant irregularities and perhaps an envelope of a non-seasonal periodicity, and the corresponding power spectrum shows a strong annual peak with a secondary peak at a higher period, caused by the response of the annual attractor to noise. The time series also resemble the empirical post-vaccine dynamics^25,27. Hence, the SINDy model is predicting the empirically observed regime shift in measles dynamics out-of-sample. (Predicted susceptible dynamics are also shown in Figs. 8 and 9 but were not analyzed in refs. ^25,27. Ten additional stochastic realizations with different starting random number seeds and their power spectra appear in the SI Appendix: Figs. 28–37. These show similar patterns to Figs. 8 and 9).

Discussion

Model discovery generates models inductively from data, using minimal prior knowledge about the system. This differs from the deductive approach that currently dominates model development in most fields, including theoretical epidemiology. Here we demonstrated that Sparse Identification of Nonlinear Dynamics (SINDy) can discover dynamical system models from empirical data on childhood infectious diseases from the pre-vaccine era. These inductively derived models (1) reproduce the observed dynamics of measles, chickenpox and rubella, (2) recover prominent features of deductively derived models, such as the fundamental mechanisms of mass-action mixing and seasonal forcing that underpin decades of research in theoretical epidemiology, (3) confirm the important role of seasonal variation in the transmission rate for dynamics of childhood infectious diseases, and (4) can predict regime shifts in dynamical patterns for measles, out-of-sample. Hence, our results show that model discovery methods could lend insight to both model creation as well as understanding of epidemiological mechanisms. Because this approach to developing the dynamic models is fundamentally different from the conventional approach, it raises the possibility that real-world mechanisms could be discovered that would otherwise be difficult or impossible to find through conventional ‘deductive’ modelling.

However, we also note that the model discovered by SINDy appeared to be overfitted in the case of chickenpox. Also, we were not able to obtain a good fit to the empirical rubella time series data; this prompted us to fit the power spectral density instead, which yielded a reasonable qualitative agreement between model and data. The approach of using power spectral density as a fitting criterion might be desirable for describing the qualitative outbreak patterns of infectious diseases like rubella, which in our dataset exhibits large epidemics every 6–7 years interspersed with small annual epidemics. These aspects of our findings emphasize the challenges in applying SINDy to complex, noisy biological systems.

Limitations to the approach stem from data quality and availability, regression algorithms and criterion for selecting parameters like the sparsity threshold. Case notifications are typically under-reported even when they are available, and our approach requires reconstruction of the susceptible time series, which necessitates making assumptions^37,44. SINDy may be sensitive to the method used to reconstruct the susceptible time series, and these methods might also bias SINDy toward selecting certain terms. Unfortunately, susceptible reconstruction is necessary due to the lack of high-quality longitudinal serological data, but a careful sensitivity analysis in future work could help us better understand the impact of differing methods of susceptible reconstruction.

Applying SINDy to vaccine era data will require further thought because the vaccination in the 20th century changed the dynamics of childhood infectious disease dramatically^25,27,44,45 and introduced the additional dimension of human behaviour through vaccine decision-making^46,47,48. Fortunately, data are increasingly abundant in an era of digital data, open sharing, and online social media^49,50, although this does not necessarily translate into higher quality data. And, the risk of overfitting to noise is always present and will only increase as more data becomes available⁵¹. In this case, we tested whether the model was over-fitted by analysing its out-of-sample prediction of empirical measles dynamics. However, over-fitting could also be tested varying the window of the moving average filter or subsampling the data before filtering.

The sensitivity of SINDy to noise (despite improvements over previous algorithms in this respect) is another limitation, and was illustrated in our model rediscovery subsection. Re-discovered models may describe the datasets very well in the presence of noise, but the spurious inclusion of terms that do not appear in the original model suggests caution when interpreting models discovered from noisy empirical data. We hypothesize that noise is also the cause of unexpected terms in our models discovered from empirical data, especially for the results using a third-order library. Part of the solution to this problem will be improved regression approaches that are less likely to overfit noise. However, the other part of the solution is inevitably the thoughtful application of subject expertise by humans when interpreting discovered models.

Another limitation is the choice of functional basis. This might be the largest limitation, since human bias is introduced when formulating the function library that SINDy starts with¹⁰. Our choice of seasonally varying transmission rates and polynomial functions may be limiting the discovery of a more parsimonious and interpretable model based on other terms, at least in principle. The state variables themselves are also pre-defined and may not be the best choice to recover the dynamics of the data.

There remains much opportunity to develop further techniques that assist in applying sparse identification methods to epidemiological data. A more exhaustive literature review of current transmission modelling practices would aid in determining a functional basis that could successfully capture a sparse model using subset selection methods. There exist many modern adaptations on compartmental modelling of infectious diseases^52,53 which incorporate functions that extend beyond a simple polynomial basis constructed from state variables. Specifically, a general nonlinear incidence (a transmission function of the form ${S}^{p}{I}^{q}$, where $p,q > 0$) could be explored. Extending the function library to include non-integer values for $p$ and $q$ may allow a more accurate representation of the transmission mechanism and result in a more parsimonious discovered model. In addition, the choice of a sinusoidal function for the transmission rate may not be optimal²⁵. While an alternative based on forcing from the school year calendar was tested in our research, further analysis in this area is necessary.

In conclusion, we have shown that SINDy can be applied to epidemiological data to yield models that describe observed epidemic patterns and correspond well with canonical models like compartmental epidemic models. In the case of measles, it can also predict dynamical shifts out-of-sample. Although inductive model discovery methods come with their own set of challenges and limitations, their radically different approach to model generation suggests they can form a powerful complement to traditional modelling approaches, thereby improving our scientific understanding for many natural systems.

Material and Methods

Sparse Identification of Nonlinear Dynamics (SINDy)

This work builds on the sparse regression methods outlined in ref. ¹⁰. Given the recent advances in both compressed sensing^54,55,56 and sparse regression^7,57 it has become computationally feasible to extract system dynamics from large, multimodal datasets. These techniques rely heavily on the fact that many dynamical systems can be represented by governing equations that are sparse in the space of all possible functions. In this work we focus on dynamical systems that are given by a system of ordinary differential equations of the form

$$\dot{x}=f(x(t),t),$$

(5)

where $x(t)=({x}_{1}(t),{x}_{2}(t),\ldots ,{x}_{n}(t))$ represents the state of the n-dimensional system at time $t$, and $f=({f}_{1},{f}_{2},\ldots ,{f}_{n})$ is the sparse set of functions that dictate the dynamics of the system.

It is assumed that the time series data is sampled at points ${t}_{1},{t}_{2},\ldots ,{t}_{m}$ for both $x$ and $\dot{X}$, usually given as either data from simulations or empirical data from measurements. Depending on the system in question, numerical differentiation methods to approximate $\dot{x}$ that are well-suited for the level of noise must be used. The method used in ref. ¹⁰ is total variation regularization^58,59 that works well on a noisy system when only the state variables are available. Alternatively, a discrete adaptation of SINDy may be used, where the response of the system $f({x}_{t},t)$ is ${x}_{t+1}$. Regardless, the time series data of the state variables and the response are represented by the matrices

$$\begin{array}{rcl}X & = & [\begin{array}{cccc}{x}_{1}({t}_{1}) & {x}_{2}({t}_{1}) & \ldots & {x}_{n}({t}_{1})\\ {x}_{1}({t}_{2}) & {x}_{2}({t}_{2}) & \ldots & {x}_{n}({t}_{2})\\ \vdots & \vdots & \ddots & \vdots \\ {x}_{1}({t}_{m}) & {x}_{2}({t}_{m}) & \ldots & {x}_{n}({t}_{m})\end{array}]\\ \dot{X} & = & [\begin{array}{cccc}{\dot{x}}_{1}({t}_{1}) & {\dot{x}}_{2}({t}_{1}) & \ldots & {\dot{x}}_{n}({t}_{1})\\ {\dot{x}}_{1}({t}_{2}) & {\dot{x}}_{2}({t}_{2}) & \ldots & {\dot{x}}_{n}({t}_{2})\\ \vdots & \vdots & \ddots & \vdots \\ {\dot{x}}_{1}({t}_{m}) & {\dot{x}}_{2}({t}_{m}) & \ldots & {\dot{x}}_{n}({t}_{m})\end{array}].\end{array}$$

We then construct a library of linear and nonlinear candidate functions for the model, given prior knowledge of the system we wish to describe. Common choices for these functions are polynomial and trigonometric functions of the state variables, though other functions (e.g. exponential, rational) functions may be included as well. This function library is then evaluated at each time-step, generating the $m\times p$ matrix

$$\Theta (X)=[1\,X\,{X}^{{P}_{2}}\,{X}^{{P}_{3}}\,\ldots \,\sin (X)\,\cos (X)\,\sin (2X)\,\cos (2X)\,\ldots ],$$

(6)

where ${X}^{{P}_{n}}$ represents all possible polynomials of degree $n$ that can be constructed by the state variables. Now, relying on the assumption that the derivative $\dot{X}$ can be described by relatively few of the nonlinearities active in $\Theta (X)$, we may set up the sparse regression problem

$$\dot{X}=\Theta (X)\Xi ,$$

(7)

where $\Xi =({\xi }_{1},{\xi }_{2},\ldots ,{\xi }_{p})$ is a set of sparse coefficient vectors.

There are several current methods that have been developed to perform sparse regression. A common choice is the LASSO (least absolute and shrinkage operator)^7,57, a regression method that promotes sparsity by applying an ${l}_{1}$ penalty on the norm of the coefficient vector. However, this method does not scale well to large datasets. This paper utilizes an iterative method developed by Brunton et al., as described below:

1.
Perform a least-squares regression on the relation in Eq. [7].
2.
Set all terms in Ξ that are less (in absolute value) than some threshold $\lambda $ to zero.
3.
Create new library Θ′, dropping functions that correspond to zero entries in Ξ.
4.
Repeat steps 1–3 until equilibrium (i.e. no terms in Ξ are smaller in magnitude than λ), or some other stopping criteria is reached.

This yields the set of sparse vectors that provides an approximate solution to Eq. [7]. We can then reconstruct the $k$th row of the dynamical system by taking

$${\dot{x}}_{k}=\Theta ({x}_{k}^{T}){\xi }_{k},$$

(8)

where Θ$({x}_{k}^{T})$ is the symbolic representations of the elements of x.

Finally, combining all of the rows of the discovered dynamical system results in the system of equations

$$\dot{x}={\Xi }^{T}\Theta {({x}^{T})}^{T}.$$

(9)

The code for this algorithm, along with several examples that demonstrate its application, can be found at ref. ⁶⁰. The modified repository used for all computation done can be found at ref. ⁶¹.

Applying SINDy to epidemiological systems

The application of data-driven model discovery methods to epidemiological systems presents a unique set of challenges. Firstly, incidence data is often subjected to noise at several levels, notably inconsistent reporting of infection cases^44,62,63. In addition, the derivative data must be approximated using numerical methods, leading to another source of inaccuracy. Secondly, most compartmental epidemic models depend on the densities of both infected and susceptible individuals. However, temporal data of the seropositive individuals in a population would require extensive surveying and is frequency not available. Instead, several methods for reconstructing the susceptible class from the given incidence data are used (see subsection on Susceptible Reconstruction. When using up to second order polynomials, the function library we used was

$$\Theta (X)=[\begin{array}{cccccccccc}1 & S & I & {S}^{2} & {I}^{2} & SI & \beta & \beta S & \beta I & \beta {S}^{2}\\ \beta {I}^{2} & \beta SI & & & & & & & & \end{array}],$$

and when using up to third order polynomials, the function library used was

$$\Theta (X)=\left[\begin{array}{rccccccccccc}1 & S & I & {S}^{2} & {I}^{2} & SI & {S}^{3} & {S}^{2}I & S{I}^{2} & {I}^{3} && \\ & & \beta & \beta S & \beta I & \beta {S}^{2} & \beta {I}^{2} & \beta SI & \beta {S}^{3} & \beta {S}^{2}I & \beta S{I}^{2} & \beta {I}^{3}\end{array}\right],$$

where $\beta $ is the seasonally-varying transmission rate given in Eq. 22. When the model coefficients are given in figures describing SINDy-discovered models, this parameter is represented by $B$.

Model selection

There exists a group of statistical metrics that are used to balance goodness-of-fit with model complexity, called information criteria. These metrics are useful in the comparison and selection of models when first given a space of candidate models from which to choose. In the context of symbolic modelling this space is usually constructed from a functional basis, often heuristically defined given contextual theory^64,65,66. Given a computationally tractable basis, each possible model would be fitted and the information criterion would be computed and used to select the model that best balances parsimony and predictive power. The information criterion used in this paper is called the Akaike information criterion (AIC)³⁸ and is derived from use of maximum likelihood. The AIC value for a given candidate model $i$ is defined by

$$AI{C}_{i}=2k-2\,\mathrm{ln}(L({\bf{x}},\hat{\mu }),$$

(10)

where $L$ is the conditional probability of the observations ${\bf{x}}$ given the set of best-fit model parameters $\hat{\mu }$, and $k$ is the number of free parameters in the model.

Data sources and preprocessing

Temporal data of infection incidence of various infections and time periods has been made available by numerous sources, often from governmental reporting programs. The three infectious diseases and the corresponding locations and time periods used for this study are measles in England and Wales from 1952–1967 (from ref. ⁶⁷), chickenpox in Ontario (Canada) from 1946–1967, and rubella in Ontario from 1946–1960 (both from²⁷). These diseases and time periods were chosen as they exhibit contrasting dynamic behaviour, most notably in the period of the epidemic cycle. For each of these diseases, the time frame chosen is before the vaccines for the respective diseases became commonly available. The raw data were also smoothed using a Savitzky-Golay filter⁶⁸ of order 3 with a window length of 19 in order to reduce the risk of SINDy overfitting the data, resulting in the smoother time series shown in Figs. 2–6. Once this data was imported and both the time and case vectors were labelled, both the birth and population data (taken from^{67,69,70,71,72}) were imported and linearly interpolated to be given per week, the same scale as the case notification data. Time series of birth rates for Ontario and the United Kingdom appear in Supplementary Fig. 38.

Discrete time model

In the limit as Δ$t\to 0$ the discrete model (Eqs. 1–3) converges to the continuous-time compartmental SIR model^27,31. Applying SINDy to discover a continuous-time model involves determining the derivative vector $\dot{x}(t)=\langle \dot{S}(t),\dot{I}(t),\dot{R}(t)\rangle $. As this requires numerical differentiation of a potentially noisy system, valuable information can be lost. However, when using the discrete system, the response vector is ${x}_{t+1}=\langle {S}_{t+1},{I}_{t+1},{R}_{t+1}\rangle $, which is simply the next data point and thus is implicitly available without numerical approximations. Hence, we used the discrete-time SIR model for our analysis. Solving Eq. 4 for $\beta $ and iterating over empirical data for measles, chickenpox and rubella give the time series found in Supplementary Fig. 25. Given that each of these diseases is most common among school-aged children^44,45,73 it is unsurprising that the period corresponding with the lowest transmission is in the summer, followed by a peak in September correlating with a return to school. These findings are further discussed and confirmed by refs. ^73,74, though more analysis by ref. ⁴⁴ indicate the peak in transmission rate occurs several weeks earlier, alleging this effect may also be attributed to weather fluctuations.

Susceptible reconstruction

The simplest method for the reconstruction of the susceptible class is to iterate the equation

$${S}_{t+1}={S}_{t}-\alpha {C}_{t,t+1}+{B}_{t,t+1},$$

(11)

where ${S}_{t}$ represents the number of susceptibles at the start of week $t$, ${C}_{t,t+1}$ and ${B}_{t,t+1}$ are the number of new cases and births respectively in week $t$, and $\alpha $ is the rate at which cases are reported (i.e. ${\alpha }^{-1}$ is the average proportion of all cases that are reported to the data collection agency)⁴⁴. The idea behind this method is simple: each week the suceptible class grows by the number of new births into the population (in the absence of vaccination), and shrinks by the number of new infections. If the reporting rate $\alpha $ was well known, this relation would provide a good approximation. However, reporting varies significantly for different diseases and locations⁶³ as well as changing temporally³⁷. It is also difficult to estimate explicitly, due to the lack of serological data available.

An extension of this method is derived in ref. ³⁷. They assume the discrete relation

$${S}_{t+1}={S}_{t}-{\alpha }_{t}{C}_{t,t+1}+{B}_{t-d,t-d+1}+{u}_{t},$$

(12)

where $u$ describes the additive noise ($E(u)=0,V(u)={\sigma }_{u}^{2}$), and $d$ represents a short delay to allow for the period of time between birth and susceptibility to the disease. Now let ${Z}_{t}$ describe the deviation from the mean $E(S)=\bar{S}$ at week $t$, i.e.

$${S}_{t}=\bar{S}+{Z}_{t}\mathrm{}.$$

(13)

By substituting Eq. [13] into Eq. [12] we see that ${Z}_{t}$ also satisfies the relation

$${Z}_{t+1}={Z}_{t}-{\alpha }_{t}{C}_{t,t+1}+{B}_{t-d,t-d+1}+u.$$

(14)

Iterating this expression results in the relation

$${Z}_{t}={Z}_{0}-\mathop{\sum }\limits_{i=1}^{t}\,{\alpha }_{i}{C}_{i,i+1}+\mathop{\sum }\limits_{i=1}^{t}\,{B}_{i-d,i-d+1}+\mathop{\sum }\limits_{i=1}^{t}\,{u}_{i}$$

(15)

Reference³⁷ uses the simplifying notation

$${X}_{t}=\mathop{\sum }\limits_{i=1}^{t}\,{C}_{i,i+1},\,{Y}_{t}=\mathop{\sum }\limits_{i=1}^{t}\,{B}_{i-d,i-d+1},\,{U}_{t}=\mathop{\sum }\limits_{i=1}^{t}\,{u}_{i},\,{R}_{t}=\mathop{\sum }\limits_{i=1}^{t}\,({\alpha }_{i}-\bar{\alpha }){C}_{i}.$$

This simplifies Eq. [15] to

$${Z}_{t}={Z}_{0}-\bar{\alpha }{X}_{t}+{Y}_{t}-{R}_{t}+{U}_{t}.$$

(16)

If it is assumed that the reporting rate is constant (${R}_{t}\approx 0$) and noise is negligible (${U}_{t}\approx 0$), this reduces to the linear relationship

$${Y}_{t}=\bar{\alpha }{X}_{t}+({Z}_{t}-{Z}_{0}).$$

(17)

Hence, applying a linear regression to the cumulative births (${Y}_{t}$) against the cumulative cases (${X}_{t}$) provides an estimate for the residuals ${Z}_{t}-{Z}_{0}$ and the average reporting rate $\bar{\alpha }$.

We call this the ‘global regression method’. Applying this reconstruction method yields time series of of the proportion of susceptible individuals for the data in Fig. 1 (Supplementary Fig. 39). From these figures it can be seen that each reconstruction (especially for the chickenpox and rubella case notification data) suffers from local shifts in the mean, caused by the assumption that the reporting rate is temporally invariant. Reference³⁷ correct for this by assuming that the dominant fluctuations in Eq. 16 are caused by variation in the reporting rate ${\alpha }_{t}$ rather than in external noise (${u}_{t}$). Equation 16 can then be expressed as

$${Y}_{t+1}={R}_{t}-{U}_{t}{Z}_{0}-({\alpha }_{t+1}-\bar{\alpha }){X}_{t}+{\alpha }_{t+1}{X}_{t+1}+{Z}_{t+1}-{u}_{t+1}.$$

(18)

Local linear regression techniques can then be applied to estimate both the reporting rate and the susceptible class. This method is sensitive to the bandwidth parameter, and must be tuned beforehand to minimize large-scale fluctuations from the global mean. We use this ‘locally linear regression method’³⁷ for all of the results reported in our paper. The susceptible reconstruction time series from the incidence data in Fig. 1 appears in Supplementary Fig. 40, and the resulting transmission rate reconstruction time series also appears in Supplementary Fig. 41.

Incidence to prevalence conversion

The prevalence of the infection is defined by the number (or proportion) of infectious individuals at any given time. Compartmental epidemic models usually predict infection prevalence. However, data are usually in the form of newly occurring cases, referred to as incidence. Hence both the proportion of susceptible individuals and the prevalence of infection must be recovered from the infection incidence data before the SINDy algorithm can be applied. Given temporal case incidence data ${C}_{t}$, suppose that the duration of infection (${D}_{i}$), the mean individual lifespan ($L$), and the proportion of people that will contract the infection in their lifetime ($p$) are known and constant. The average proportion of the population that is infected at any given time is then given by

$$\langle {P}_{t}\rangle =\frac{p{D}_{i}}{L}$$

(19)

From the relation

$$\frac{{P}_{t}}{\langle {P}_{t}\rangle }=\frac{{C}_{t}}{\langle {C}_{t}\rangle }$$

we then obtain

$${P}_{t}=\frac{{C}_{t}p{D}_{i}}{\langle {C}_{t}\rangle L}$$

(20)

which is used to construct the prevalence (infectious) class given incidence data. We assumed ${D}_{i}=2$ weeks for all $i$^25,27,75, $L=65$ years⁷⁶ and $p=0.95$⁷⁵.

Weighted thresholding

SINDy is based on sparse regression, a statistical learning technique that performs feature selection while fitting the active terms to the data. The realization of this technique used in ref. ¹⁰ is the iterated thresholding method. The key parameter in this algorithm is $\lambda $, a chosen threshold below which coefficients (and their corresponding functions) are eliminated on any given iteration. In ref. ¹⁰ and subsequent papers this parameter is taken as constant, though ref. ¹¹ analyses the effects of fitting $\lambda $ using cross-validation. However, epidemiological data present an additional challenge, as the state variables are often orders of magnitude apart (the proportion susceptible, for instance, is $O\mathrm{(0.1)}$ while the proportion infected is $O\mathrm{(0.0001)}$. When evaluating a higher order function library using data on contrasting scales, high order functions of small state variables (such as ${I}^{3}$) have a much smaller column norm than larger state variables or functions with a smaller polynomial order. As a result, the iterated sparse regression algorithm can assign them large coefficients to account for this, which are much less likely to be eliminated by a fixed thresholding value.

To account for this, we introduce a threshold for each function in the library that is scaled according to the norm of the corresponding column (see also ref. ¹⁰). For each column $k$ in the function library Θ(X), we construct the threshold

$${\lambda }_{w}^{(k)}=\frac{{\lambda }_{c}}{|{\Theta }^{(k)}(X)|},$$

(21)

where ${\Theta }^{(k)}(X)$ is the $k$th column in the function library, |·| is the l₂– norm, and ${\lambda }_{c}$ is a constant threshold value. The algorithm in Section is then performed in the same way, using this function-dependent sparsity knob instead. This is the technique utilized throughout this paper, and any reference to a constant $\lambda $ value is the ${\lambda }_{c}$ parameter in Eq. 21.

Choice of functional basis

Determining the correct basis of elementary functions is a key step when generating a model using SINDy, and the lack of a rigorous method to identify such a basis is one of its notable downfalls¹⁰. Nevertheless, most compartmental models in epidemiology have been constructed using a simple basis of polynomial and trigonometric functions, which is what we use in this analysis. Many compartmental epidemic models only use polynomial functions on the second degree or lower, so we commonly limit our function library to second or third order polynomials.

Depending on the nature of the system and the assumptions made, it becomes necessary to add several features to the function library. The dynamics of the prevalence of both measles and chickenpox are strongly dictated the seasonal forcing function^25,27. Hence, a new parameter $\beta $ is constructed such that

$$\beta ={\beta }_{0}(1+{\beta }_{1}\,\cos (2\pi t/T-\phi )),$$

(22)

where $T$ is the period of the seasonal oscillations (usually $1\,y{r}^{-1}$) and $\phi $ is the phase shift. This parameter is then multiplied by each of the $p$ columns in Θ to create $p$ new features in the function library.

Model dynamics are sensitive to $\phi $. Hence, we included $\phi $ in a three-dimensional parameter sweep involving $\phi $, ${S}_{0}$, and $\lambda $ as follows. For each point of the ${S}_{0}-\lambda $ parameter grid ($\lambda \in [0.0001,0.1]$, ${S}_{0}\in [0.05,0.13]$, see Supplementary Figs. 5–12), we also conducted a parameter sweep for $\phi $ ranging from $0$ to $52$ weeks in increments of Δ$\phi =0.5\,{\rm{wk}}$. The SINDy model was generated for each value of $\phi $ in this parameter range, and the SINDy model with lowest AIC values was selected to represent that point in the ${S}_{0}-\lambda $ parameter grid. (We opted for this approach to facilitate visualization of the three-parameter sweep in the two-dimensional ${S}_{0}-\lambda $ plane and to focus on the ${S}_{0}-\lambda $ relationship. Altering the values of $\phi $ away from the optimal values usually worsens the model fit in predictable ways. For instance, a value of $\phi $ that causes transmission to peak in the summer months also causes an epidemic peak in the summer, which is rarely observed in the empirical datasets).

Given that the susceptible population is influenced heavily by the birth rate, the addition of a birth parameter is also beneficial. A functional form of the birth rate can be assumed and added to the library, but given that in the place and time period of this study the birth rate does not behave in a way that can be described by a linear or exponential function we choose to represent the birth rate in the function library by simply including a column of the empirical data $B(t)$ that gives the total number of births in week $t$. This data is already required to scale the state variables and the source for each location used in this paper is given in Section.

Power spectral density

Estimates of the power spectral density of the prevalence time series can be useful for model selection when qualitative features such as attractor class are seen as more important than goodness-of-fit. However, a model selection method that values sparsity is still desirable. This leads to a model selection process of computing the AIC score of the spectral densities of the model and the data, which will promote a parsimonious model that attempts to match the qualitative features present in the data. We followed standard procedures for computing power spectra of time series^27,77,78. First, the infection time series data from both the discovered SINDy model and the relevant data were smoothed using a moving average window with a span of 23 timesteps. Second, the smoothed data were linearly trend-corrected. Third, 20% of the time series was tapered using a split cosine bell. The periodogram of the resulting time series was then computed. We used the Matlab functions smooth, detrend, and periodogram⁷⁹. The AIC score was then computed using the residual sum of squares between the empirical and SINDy model power spectral density estimates, taking the number of free parameters from the discovered model. The methodology for computed the power spectral density described in the out-of-sample prediction subsection was the same.

SIR model fitting

This can be done by simulating the model across a wide range of linearly-spaced parameter values and selecting the model which minimizes the sum of squares error between the simulated model and the observed data. Baseline values for the parameters were taken from refs. ^25,75. For simplicity a closed system was assumed, implying that the birth and death rates were equal ($\nu =\mu $). Also, recall that given basic reproductive ratio ${{\mathscr{R}}}_{0}$ and recovery rate $\gamma $, the mean transmission rate is completely determined by ${\beta }_{0}={{\mathscr{R}}}_{0}\cdot \gamma $. The parameters that were varied, with corresponding ranges and step sizes, were ${R}_{0}\in (6,16)$, $\gamma \in (0.55,1.25)$, ${\beta }_{1}\in (0.05,0.35)$, $\mu \in (3\times {10}^{-4},6\times {10}^{-4})$ and $\in (0,51.5)$. Simulations of the discrete SIR model were run at each point in the parameter plane, beginning 150 years prior to the temporal range of the data to eliminate the effects of transients and the impact of the initial conditions, which were fixed at $({S}_{0},{I}_{0})=(0.1,5\times {10}^{-5})$. The resulting models with the minimal sum of squares error when compared with the data were selected and plotted in Supplementary Figs. 42–44.

Data availability

The code used to generate the results is publicly available at ref. ⁶¹. The infectious disease data for measles, rubella and chickenpox can be obtained from the International Infectious Disease Data Archive (http://iidda.mcmaster.ca/). Demographic data are available from published sources^{67,69,70,71,72}.

References

Strogatz, S. H. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering (CRC Press, 2018).
Schichl, H. Models and History of Modeling, In: Modeling languages in mathematical optimization. Springer, Boston chap. 2, 25–39 (2004).
Packard, N. H., Crutchfield, J. P., Farmer, J. D. & Shaw, R. S. Geometry from a time series. Phys. Rev. Lett. 45, 712–716 (1980).
Article ADS Google Scholar
Crutchfield, J. P. & McNamara, B. S. Equation of motion from a data series. Complex systems 1, 121 (1987).
MathSciNet MATH Google Scholar
Daniels, B. C. & Nemenman, I. Automated adaptive inference of phenomenological dynamical models. Nature communications 6, 8133 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Bongard, J. & Lipson, H. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 104, 9943–9948 (2007).
Article ADS CAS MATH Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 267–288 (1996).
Article MathSciNet MATH Google Scholar
John, G. H., Kohavi, R. & Pfleger, K. Irrelevant features and the subset selection problem. In Machine Learning Proceedings 1994, 121–129 (Elsevier, 1994).
Clyde, M., Parmigiani, G. & Vidakovic, B. Multiple shrinkage and subset selection in wavelets. Biometrika 85, 391–401 (1998).
Article MathSciNet MATH Google Scholar
Brunton, S. L., Proctor, J. L. & Kutz, J. N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113, 3932–3937 (2016).
Article ADS MathSciNet CAS MATH Google Scholar
Mangan, N. M., Kutz, J. N., Brunton, S. L. & Proctor, J. L. Model selection for dynamical systems via sparse regression and information criteria. Proc. R. Soc. A 473, 20170009 (2017).
Article ADS MathSciNet CAS PubMed MATH Google Scholar
Rudy, S. H., Brunton, S. L., Proctor, J. L. & Kutz, J. N. Data-driven discovery of partial differential equations. Science Advances 3, e1602614 (2017).
Article ADS PubMed PubMed Central Google Scholar
Tran, G. & Ward, R. Exact recovery of chaotic systems from highly corrupted data. Multiscale Modeling & Simulation 15, 1108–1129 (2017).
Article MathSciNet Google Scholar
Kaiser, E., Kutz, J. N. & Brunton, S. L. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proceedings of the Royal Society A 474.2219, 20180335 (2018).
Article MathSciNet MATH Google Scholar
El Sayed, M. Y., Semaan, R. & Radespiel, R. Sparse modeling of the lift gains of a high-lift configuration with periodic coanda blowing. In 2018 AIAA Aerospace Sciences Meeting, 1054 (2018).
Dam, M. Topological bifurcations of coherent structures and dimension reduction of plasma convection models. Ph.D. thesis, DTU Compute (2018).
Mangan, N. M., Brunton, S. L., Proctor, J. L. & Kutz, J. N. Inferring biological networks by sparse identification of nonlinear dynamics. IEEE Transactions on Molecular, Biological and Multi-Scale Communications 2, 52–63 (2016).
Article Google Scholar
Quade, M., Abel, M., Nathan Kutz, J. & Brunton, S. L. Sparse identification of nonlinear dynamics for rapid model recovery. Chaos: An Interdisciplinary Journal of Nonlinear Science 28, 063116 (2018).
Article MathSciNet Google Scholar
Mangan, N. M., Askham, T., Brunton, S. L., Kutz, J. N. & Proctor, J. L. Model selection for hybrid dynamical systems via sparse regression. Proceedings of the Royal Society A 475, 20180534 (2019).
Article ADS MathSciNet CAS MATH Google Scholar
He, D., Ionides, E. L. & King, A. A. Plug-and-play inference for disease dynamics: measles in large and small populations as a case study. Journal of the Royal Society Interface (2009).
Olsen, L. F. & Schaffer, W. M. Chaos versus noisy periodicity: alternative hypotheses for childhood epidemics. Science 249, 499–504 (1990).
Article ADS CAS PubMed Google Scholar
Ellner, S. P., Bailey, B. A. & Bobashev, G. V. Noise and nonlinearity in measles epidemics: Combining mechanistic and statistical approaches to population modeling. The American Naturalist 151, 425–440 (1998).
Article CAS PubMed Google Scholar
Allen, L. J. An Introduction to Stochastic Epidemic Models, in: Mathematical Epidemiology, eds Allen, L. J. et al., vol. 1945 (Springer, Berlin, 2008).
Bolker, B. M. & Grenfell, B. T. Chaos and biological complexity in measles dynamics. Proceedings of the Royal Society of London B: Biological Sciences 251, 75–81 (1993).
Article ADS CAS Google Scholar
Earn, D. J. D., et al. A simple model for complex dynamical transitions in epidemics. Science 287 (2000).
Rohani, P., Earn, D. J. & Grenfell, B. T. Opposite patterns of synchrony in sympatric disease metapopulations. Science 286, 968–971 (1999).
Article CAS PubMed Google Scholar
Bauch, C. T. & Earn, D. J. D. Transients and attractors in epidemics. Proc. R. Soc. Lond. B 270, 1573–1578 (2003).
Article Google Scholar
Ferrari, M. J. et al. The dynamics of measles in sub-saharan africa. Nature 451, 679 (2008).
Article ADS CAS PubMed Google Scholar
Kermack, W. O. & McKendrick, A. G. A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. A 115 (1927).
Anderson, R. M. & May, R. M. Infectious diseases of humans: dynamics and control (Oxford university press, 1992).
Hethcote, H. W. The mathematics of infectious diseases. SIAM review 42, 599–653 (2000).
Article ADS MathSciNet MATH Google Scholar
Érdi, P. & Tóth, J. Mathematical models of chemical reactions: theory and applications of deterministic and stochastic models (Manchester University Press, 1989).
Lai, Z. & Nagarajaiah, S. Sparse structural system identification method for nonlinear dynamic systems with hysteresis/inelastic behavior. Mech. Sys. & Sig. Proc 117, 813–842 (2019).
Article Google Scholar
Sorokina, M., Sygletos, S. & Turitsyn, S. Sparse identification for nonlinear optical communication systems: SINO method. Optics express 24, 30433–30443 (2016).
Article ADS PubMed Google Scholar
Dam, M., Brøns, M., Juul Rasmussen, J., Naulin, V. & Hesthaven, J. S. Sparse identification of a predator-prey system from simulation data of a convection model. Physics of Plasmas 24, 022310 (2017).
Article ADS CAS Google Scholar
Loiseau, J.-C. & Brunton, S. L. Constrained sparse Galerkin regression. Journal of Fluid Mechanics 838, 42–67 (2018).
Article ADS MathSciNet MATH Google Scholar
Finkenstädt, B. F. & Grenfell, B. T. Time series modelling of childhood diseases: a dynamical systems approach. Appl. Statist 49, 187–205 (2000).
MathSciNet MATH Google Scholar
Akaike, H. Information theory and an extension of the maximum likelihood principle. In Breakthroughs in statistics, 610–624 (Springer, 1992).
Ferguson, N. M., Donnelly, C. A. & Anderson, R. M. Transmission intensity and impact of control policies on the foot and mouth epidemic in great britain. Nature 413, 542 (2001).
Article ADS CAS PubMed Google Scholar
Bauch, C. T. The spread of infectious diseases in spatially structured populations: an invasory pair approximation. Math. Biosci. 198, 217–237 (2005).
Article MathSciNet PubMed MATH Google Scholar
Grenfell, B. T., Bjornstad, O. N. & Kappey, J. Travelling waves and spatial hierarchies in measles epidemics. Nature 414, 716–723 (2001).
Article ADS CAS PubMed Google Scholar
Liu, W.-m., Hethcote, H. W. & Levin, S. A. Dynamical behavior of epidemiological models with nonlinear incidence rates. Journal of mathematical biology 25, 359–380 (1987).
Article MathSciNet CAS PubMed MATH Google Scholar
Korobeinikov, A. & Maini, P. K. A lyapunov function and global properties for sir and seir epidemiological models with nonlinear incidence. Mathematical Biosciences and Engineering 1, 57–60 (2004).
Article MathSciNet PubMed MATH Google Scholar
Fine, P. E. M. & Clarkson, J. A. Measles in england and wales - i: An analysis of factors underlying seasonal patterns. International Journal of Epidemiology 11 (1982).
Schenzle, D. An age-structured model of pre- and post-vaccination measles transmission. Mathematical Medicine and Biology: A Journal of the IMA 1, 169–191 (1984).
Article MathSciNet CAS MATH Google Scholar
Bauch, C. T. Imitation dynamics predict vaccinating behaviour. Proceedings of the Royal Society of London B: Biological Sciences 272, 1669–1675 (2005).
Article Google Scholar
Oraby, T., Thampi, V. & Bauch, C. T. The influence of social norms on the dynamics of vaccinating behaviour for paediatric infectious diseases. Proc. R. Soc. B 281, 20133172 (2014).
Article PubMed Google Scholar
Wang, Z., Andrews, M. A., Wu, Z.-X., Wang, L. & Bauch, C. T. Coupled disease–behavior dynamics on complex networks: A review. Physics of life reviews 15, 1–29 (2015).
Article ADS PubMed PubMed Central Google Scholar
Salathe, M. et al. Digital epidemiology. PLoS computational biology 8, e1002616 (2012).
Article CAS PubMed PubMed Central Google Scholar
Pananos, A. D. et al. Critical dynamics in population vaccinating behavior. Proceedings of the National Academy of Sciences 201704093 (2017).
Lazer, D., Kennedy, R., King, G. & Vespignani, A. The parable of google flu: traps in big data analysis. Science 343, 1203–1205 (2014).
Article ADS CAS PubMed Google Scholar
Satsuma, J. et al. Extending the SIR epidemic model. Physica A: Statistical Mechanics and its Applications 336, 369–375 (2004).
Article ADS Google Scholar
McCluskey, C. C. Complete global stability for an sir epidemic model with delay—distributed or discrete. Nonlinear Analysis: Real World Applications 11, 55–59 (2010).
Article MathSciNet MATH Google Scholar
Donoho, D. L. Compressed sensing. IEEE Transactions on information theory 52, 1289–1306 (2006).
Article MathSciNet MATH Google Scholar
Candès, E. J. & Wakin, M. B. An introduction to compressive sampling. IEEE signal processing magazine 25, 21–30 (2008).
Article ADS Google Scholar
Brunton, S. L., Proctor, J. L. & Kutz, J. N. Compressive sampling and dynamic mode decomposition. Journal of computational dynamics 2(2), 165–191 (2013).
Article MATH Google Scholar
James, G., Witten, D., Hastie, T. & Tibshirani, R. An introduction to statistical learning (Springer, 2013).
Rudin, L. I., Osher, S. & Fatemi, E. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60, 259–268 (1992).
Article ADS MathSciNet MATH Google Scholar
Chartrand, R. Numerical differentiation of noisy, nonsmooth data. ISRN Applied Mathematics 2011 (2011).
Brunton, S. Computer Code: Sparse Identification of Nonlinear Dynamics (SINDy), www.faculty.washington.edu/sbrunton/sparsedynamics.zip (Accessed 9 February 2020).
Horrocks, J. H. Computer Code: Application of SINDy to epidemic data, https://github.com/jonathanhorrocks/SINDy-data (Accessed 9 February 2020).
Davis, S. F. et al. Reporting efficiency during a measles outbreak in new york city, 1991. American journal of public health 83, 1011–1015 (1993).
Article CAS PubMed PubMed Central Google Scholar
Doyle, T. J., Glynn, M. K. & Groseclose, S. L. Completeness of notifiable infectious disease reporting in the united states: an analytical literature review. American journal of epidemiology 155, 866–874 (2002).
Article PubMed Google Scholar
Burnham, K. P. & Anderson, D. R. Model selection and multimodel inference: a practical information-theoretic approach (Springer Science & Business Media, 2003).
Claeskens, G., Hjort, N. L. et al. Model selection and model averaging. Cambridge Books (2008).
Woodward, M. Epidemiology: study design and data analysis (CRC press, 2013).
Bolker, B. Infectious disease data, https://ms.mcmaster.ca/bolker/measdata.html (Accessed 9 February 2020).
Schafer, R. W. et al. What is a Savitzky-Golay filter? IEEE Signal processing magazine 28, 111–117 (2011).
Article ADS Google Scholar
GB Historical GIS/University of Portsmouth. Total population of the United Kingdom, http://www.visionofbritain.org.uk/unit/10061325/cube/TOT_POP (Accessed 9 February 2020).
Gov.UK. 200 years of the census in Wales, https://web.archive.org/web/20090319202324/http://www.statistics.gov.uk/census2001/bicentenary/pdfs/wales.pdf (Accessed 9 February 2020).
Statistics Canada. Ontario birth rate, https://www150.statcan.gc.ca/cansim/results/cansim-0530001-eng-2134590597138961162.csv (Accessed 9 February 2020).
Statistics Canada. Ontario total population, https://www150.statcan.gc.ca/n1/pub/11-516-x/sectiona/4147436-eng.htm#1 (Accessed 9 February 2020).
London, W. P. & Yorke, J. A. Recurrent outbreaks of measles, chickenpox, and mumps. American Journal of Epidemiology 98 (1978).
Soper, H. E. The interpretation of periodicity in disease prevalence. Journal of the Royal Statistical Society 92, 34–73 (1929).
Article MATH Google Scholar
Anderson, R. M. & May, R. M. Infectious diseases of humans: dynamics and control (Oxford university press, 1992).
Leeson, G. W. Increasing longevity and the new demography of death. International Journal of Population Research 2014 (2014).
Brockwell, P. J., Davis, R. A. & Fienberg, S. E. Time Series: Theory and Methods: Theory and Methods (Springer Science & Business Media, 1991).
Priestley, M. B. Spectral analysis and time series, vol. 1 (Academic press London, 1981).
Matlab v 9.4. Periodogram power spectral density estimate, https://www.mathworks.com/help/signal/ref/periodogram.html (Accessed 9 February 2020).
World Health Organization. Immunization Coverage Data, http://apps.who.int/immunization_monitoring/globalsummary/timeseries/tswucoveragemcv1.html.

Download references

Acknowledgements

The authors are grateful to David J.D. Earn for providing data for this project from the International Infectious Disease Data Archive (IIDDA), and to Sri Namachivaya, Giang Tran, and two anonymous reviewers for helpful comments.

Author information

Authors and Affiliations

Department of Applied Mathematics, University of Waterloo, Waterloo, N2L 3G1, Canada
Jonathan Horrocks & Chris T. Bauch

Authors

Jonathan Horrocks
View author publications
You can also search for this author in PubMed Google Scholar
Chris T. Bauch
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.T.B. conceived the study. C.T.B. and J.H. developed the methodology. J.H. conducted the analysis and drafted the manuscript. C.T.B. wrote the final version of the manuscript.

Corresponding author

Correspondence to Chris T. Bauch.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Appendix.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Horrocks, J., Bauch, C.T. Algorithmic discovery of dynamic models from infectious disease data. Sci Rep 10, 7061 (2020). https://doi.org/10.1038/s41598-020-63877-w

Download citation

Received: 08 April 2019
Accepted: 07 April 2020
Published: 27 April 2020
DOI: https://doi.org/10.1038/s41598-020-63877-w

This article is cited by

Revisiting the standard for modeling the spread of infectious diseases
- Michael Nikolaou
Scientific Reports (2022)
SINDy-SA framework: enhancing nonlinear system identification with sensitivity analysis
- Gustavo T. Naozuka
- Heber L. Rocha
- Regina C. Almeida
Nonlinear Dynamics (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Dynamic causal modelling of COVID-19 and its mitigations

A deconvolution approach to modelling surges in COVID-19 cases and deaths

GrowthPredict: A toolbox and tutorial-based primer for fitting and forecasting growth trajectories using phenomenological growth models

Introduction

Results

Model rediscovery from simulated data

Model discovery from empirical data

Power spectral density

Interpretation of SINDy model coefficients

Effect of changing sparsity knob λ

Third-order polynomial library

Comparison with compartmental epidemic models

SINDy model can predict qualitative shifts in empirical measles dynamics out-of-sample

Discussion

Material and Methods

Sparse Identification of Nonlinear Dynamics (SINDy)

Applying SINDy to epidemiological systems

Model selection

Data sources and preprocessing

Discrete time model

Susceptible reconstruction

Incidence to prevalence conversion

Weighted thresholding

Choice of functional basis

Power spectral density

SIR model fitting

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Appendix.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Revisiting the standard for modeling the spread of infectious diseases

SINDy-SA framework: enhancing nonlinear system identification with sensitivity analysis

Comments

Search

Quick links