A novel geo-hierarchical population mobility model for spatial spreading of resurgent epidemics

Computational models for large, resurgent epidemics are recognized as a crucial tool for predicting the spread of infectious diseases. It is widely agreed, that such models can be augmented with realistic multiscale population models and by incorporating human mobility patterns. Nevertheless, a large proportion of recent studies, aimed at better understanding global epidemics, like influenza, measles, H1N1, SARS, and COVID-19, underestimate the role of heterogeneous mixing in populations, characterized by strong social structures and geography. Motivated by the reduced tractability of studies employing homogeneous mixing, which make conclusions hard to deduce, we propose a new, very fine-grained model incorporating the spatial distribution of population into geographical settlements, with a hierarchical organization down to the level of households (inside which we assume homogeneous mixing). In addition, population is organized heterogeneously outside households, and we model the movement of individuals using travel distance and frequency parameters for inter- and intra-settlement movement. Discrete event simulation, employing an adapted SIR model with relapse, reproduces important qualitative characteristics of real epidemics, like high variation in size and temporal heterogeneity (e.g., waves), that are challenging to reproduce and to quantify with existing measures. Our results pinpoint an important aspect, that epidemic size is more sensitive to the increase in distance of travel, rather that the frequency of travel. Finally, we discuss implications for the control of epidemics by integrating human mobility restrictions, as well as progressive vaccination of individuals.

www.nature.com/scientificreports/ known key factor influencing the speed of epidemics. Chen et al. 26 show that overlapping in communities leads to increased infection prevalence and a higher spread velocity in the early stages of emerging infections; Salathé et al. 4 show that the dynamics of epidemics is influenced by the structure of communities, which, in turn, has implications on immunization strategies for large epidemics; Shang et al. 27 show that overlapping communities and a higher network average degree accelerate spreading; Stegehuis et al. 28 show that the structure of communities has a significant influence on the behavior of percolation on networks, as community structure can stimulate or suppress spreading, based on the mesoscopic set of communities. Further incorporating human mobility and contact patterns increases the realism of an epidemic model. We note the work of Liu et al. 21 which shows that the reproduction number R 0 has a much higher variability than expected, due to the heterogeneity of contact networks. Fueled by a data driven approach, the authors propose a multiplex representation of the population. Mistry et al. 22 provide accurate age-stratified contact matrices for a large number of countries further motivating the need for a heterogeneous approach in disease modeling. Sattenspiel et al. 29 extend a SIR model with five fixed patterns of mobility, but use otherwise large compartments for modeling the population structure. Watts et al. 20 propose a fully synthetic hierarchical block model, aimed at reproducing multiple epidemic waves, but without integrating realistic distances between communities, or any correlation to real-world human settlement organization. Additionally, Salathé et al. 30 analyze contact networks in the USA and confirm that heterogeneity is essential on the larger scale, but it is reliable enough to assume homogeneous contact inside small communities (like high schools). This conclusion reinforces our choice for partitioning the population into very fine-grained communities (to the level of households), and further simplification of the small communities' topology, from a social network, to a stochastic block model 31 .
We also note the recent work of Calvetti et al. 32 which adapt a network SEIR model using a single scale lattice of geographical blocks for modeling spatial mobility of the population. By contrast, our approach models a multiscale population with much higher granularity. Finally we note our previous work 33 , which is a first attempt at exploring geo-spatially organized populations. However 33 , is limited to studying the impact of country density on epidemic spreading on a single scale population model.
The motivation of this study is to describe a robust epidemic modeling framework which simultaneously incorporates accurate population modeling and human mobility, both of which represent ongoing challenges due to their theoretical complexity as well as limitation in available data for validation. Thus, we propose the novel geo-hierarchical population mobility model (GHPM) which lies at the crossroads of population organization and mobility, both of which are key aspects to consider when targeting realistic large-scale resurgent epidemic outbreaks. We propose the novel idea of distributing a population into spatially organized communities (i.e., human settlements), which are then organized into a hierarchy of administrative divisions (i.e., district, neighborhood, street, block, household). Thus, the population is partitioned with very high granularity all the way down to household-/family-level, containing just a few individuals, but where the transmission risks are highest 34 . Embedded into our population model, we further propose a novel mobility algorithm based on the geographical distance between settlements and their size, which determines the complexity of the underlying hierarchical structure. Finally, targeting the reliable reproduction of resurgent epidemics, we analyze the complex interplay between the population mobility model of GHPM and outbreak dynamics by adopting a modified SIR model with patient relapse. Altogether, the resulting framework is analyzed using detailed computer simulations.
In contrast to other computational models like GLEaMviz 35 , RAPIDD Ebola forecasting 36 , or 37 , our GHPM model is, to the best of our knowledge, the first framework to combine a geo-spatial and a hierarchical model to structure population, alongside an epidemic model with relapse. Using available empirical data on influenza and COVID-19, we show how GHPM reproduces similar epidemic dynamics (e.g., size, waves). The main focus of this paper is to determine how the population organization, travel distance and travel frequency affect the spread of diseases on large scales (country-level), and how restriction and immunization strategies can be applied efficiently to control epidemics.

Results
Characteristics of large epidemics. Real-world large, resurgent epidemics are known to be shaped by repeated waves of non-deterministic amplitude, and an overall limited epidemic size φ (it is safe to assume φ < 0.1 considering epidemics over the last decades 38,39 ). To this end, we investigate the epidemic sizes of various outbreaks over heterogeneous geographical areas. Since many infectious diseases are repetitive, seasonal (e.g., influenza, pertussis), and others appear and are then permanently eradicated (e.g., H1N1, smallpox), our approach, throughout the rest of the study, is to use the yearly epidemic size φ y as a measure of the repeating outbreak waves. Therefore, we further provide a statistical overview of seasonal, eradicated and ongoing epidemics worldwide, as such: seasonal influenza in Germany (2014), influenza in California (winter season 2016/2017), global H1N1 cases (2009), measles in Indonesia (2016), Pertussis in California (2010), and data on COVID-19, with estimates on global epidemic sizes for 2020.
The distributions of the yearly epidemic sizes φ y are given in Fig. 1 [COVID-19 in panel (f)], showing predominantly small-sized epidemics [ N(φ y ) is highest for smaller outbreak sizes, like φ y < 0.02 in panel (f), for COVID -19]. The power-law fits for the flatter area in each distribution [e.g., 0.02 < φ < 0.06 in panel (f)] range within γ = 0.55 − 1.05 , well outside the representative power-law exponent 2 < γ < 3 ; this means that larger sized epidemics are not exceptional situations. Nevertheless, the upper yearly real-world threshold for epidemic sizes seems to be small 39 , e.g., φ < 4 × 10 −4 for influenza in California, or φ < 0.1 for COVID-19. Even though COVID-19 is an ongoing pandemic, with final size and dynamics not yet known, the statistical analysis shows that the regional epidemic size distribution is consistent with historical data on other outbreaks.
We observe that, in case of large and/or long-lasting epidemics, we find heterogeneity and multi-modality in outbreak size over a heterogeneous geographical area (i.e, world's countries, regions in Indonesia, counties in www.nature.com/scientificreports/ California, Germany). Since regional diversity (e.g., demographics, climate, mobility, NPIs) is highly relevant in determining seasonality and resurgence of epidemics 40 we are encouraged to further develop a multiscale population structure, based on real-world geographic data.
The geo-hierarchical population mobility model (GHPM). Population structuring. Despite of the political, historical and geographical factors influencing the boundaries of countries and states, human populations have always flocked together into settlements 41 . As such, our proposed GHPM model starts from a targeted, real-world geographical area A (e.g., a continent, a country, a state), quantified by a set S of (real) settlements, which represent the basis of human organization and cooperation. In contrast to the abstract representation of network communities, each settlement s i ∈ S is modeled according to real-world data, using its geographical position s i (x i , y i ) (i.e., longitude, latitude) and estimated size � * (s i ) (i.e., number of individuals). Next, we create a multi-level hierarchical structure under each settlement based on its estimated population � * (s i ) . This hierarchy of administrative divisions is summarized in Table 1, with the lowest level corresponding to households, each containing a family of average size h size 42 (more details in "Methods"). The number of households n h for a settlement is estimated as n h = � * (s i )/h size , where h size is given by a distribution (see SI.1). From households upward, in the administrative hierarchy of each settlement, we define blocks, whose number n b is given by the number of households as n b = (n h ) β , followed by streets ( n s = (n b ) β ), neighborhoods ( n n = (n s ) β ), districts ( n d = (n n ) β ), and finally the root of the hierarchy represented by the settlement. The impact of choosing different branching factors β is detailed in SI.2. Also, note that the number of hierarchical levels added to each settlement depends on its estimated size � * (s i ) as defined in Table 1. Thus, we differentiate between the complexity of population organization in villages, towns, cities and metropolises (with 2-5 levels accordingly).
Once the hierarchy of administrative divisions is created, each household h j in s i is randomly populated with individuals based on h size . The final size of a household is represented as �(h j ) and the final population of each in all examples, for small outbreaks (e.g., φ y < 0.02 in f) followed by a relatively uniform distribution of larger epidemic sizes (e.g., 0.02 < φ y < 0.06 in f). The insets summarize the cumulative frequency distributions CFD, quantifying the amount of epidemics sized ≤ φ y , on log-log axes. Fitting a power-law over the flatter regions (orange line) results in exponents γ = 0.55 − 1.05. Table 1. Summary of the possible hierarchical administrative divisions characterizing a settlement s i , given its estimated real-world population � * (s i ) . Each division models a distinct hierarchical level in the settlement; based on its size, a settlement is modeled with 2-5 hierarchical levels (village-metropolis). www.nature.com/scientificreports/ parent division is calculated as the sum of the population of the households in the parent's subtree. As such, the final population �(s i ) (without the * symbol) of the settlement becomes �(s i ) = �(h j ) ; likewise, the total population of A becomes � = �(s i ) . We also introduce the notations D i , for the set of all divisions inside any settlement s i ∈ S (note that |D i | = n h + n b + n s + n n + n d ), and H i , for the set of all households in any s i ∈ S (note that |H i | = n h , H i ⊂ D i ). This concludes the definition of the geo-hierarchical population structure, which is depicted in Fig. 2, using three settlements from the United Kingdom as an example. For more details on the experimental setup of GHPM refer to the "Methods" section.

Administrative division Applies to Population Inclusion criteria Amount/s i Level
Individual mobility. The mobility algorithm is described by a stochastic mobility function, based on the previously obtained geo-hierarchical population structure. The algorithm is applied to every individual n k , from all settlements s i ∈ S , for every iteration of the simulation time t. An individual can adopt one of three different movement scenarios: S1 Travel to another settlement s j ∈ S with probability P 1 , in a randomly chosen household h j of that settlement. Furthermore, for a random timeout of 1-10 iterations (e.g., days), the individual is part of h j , after which, the individual is returned to his original household. We use this simple timeout mechanism to implement the idea of business/leisure travel for a random, limited duration. S2 Otherwise, remain in the same settlement s i with probability P 2 , and move to another division div j ∈ D i (a household, or any upper level division inside the settlement). Furthermore, the individual is part of div j for 1 iteration (e.g., day), after which, the individual is returned to his original household. We use this simple timeout mechanism to implement the idea of short local trips to work, shopping etc. S3 Otherwise, remain home with probability P 3 . The individual does not move out of his household.
The mechanisms of inter-settlement travel (scenario 1) and intra-settlement travel (scenario 2) are based on the same principle-a probability proportional to the population of the target settlement/division, and indirect proportional to the distance traveled. As such, the absolute inter-settlement travel probability p * inter (s i , s j ) is: www.nature.com/scientificreports/ where is the geographical Euclidean distance between the two settlements (in km, based on latitude and longitude), lg is the log base 10 of the population of the settlement to travel to, 0 < � ≤ 1 is a tunable travel distance parameter, and 0 < � ≤ 1 is a tunable travel frequency parameter. The effects of these two parameters are detailed in the forthcoming experimental results sections and the Discussion. The actual probability p inter (s i , s j ) of travel between two settlements is obtained through normalization of p * inter (s i , s j ) as: Based on Eq. (2), if an individual leaves its home settlement s i , he will be associated to a randomly chosen household h j in the target settlement s j . If the individual does not leave the settlement (based on Eq. 2), we compute a similar intra-settlement probability p * intra (h i , div j ) of mobility between a household h i ∈ H i and any other administrative division div j ∈ D i (e.g., household, block, street etc.) from within s i as: where, similar to Eq. (1), we use the distance , population , the travel frequency parameter , and the travel distance parameter . Inside a settlement, is the maximum distance to the lowest common ancestor of the two divisions. See Fig. 2 with examples = 1 (London) and = 3 (Edinburgh). When div j is another household, then �(div j ) is equivalent with the size of the household �(h j ) ; for div j being an upper level division, �(div j ) is the sum of sizes of all households under that division. The actual intra-settlement probability p intra (h i , div j ) of travel between a household and another division is obtained through normalization of p * intra (h i , div j ) as: Based on Eqs. (1)-(4), any individual will choose one of the three mobility scenarios with the following probabilities: Adding up the three probabilities P 1 + P 2 + P 3 = 1.
As a final observation, we consider that while an individual is "away" from home (scenarios 1, 2), he is exempt from any further travel until he returns to his original household. Also, the starting point of travel for any individual is its original household.
Epidemic transmission. We intend to use the GHPM model to replicate resurgent epidemics, like influenza or COVID-19 (see Fig. 1), so that we adopt a SIR epidemic model with relapse 43 , also know as a SIRI model 44 . Hence, we make the following assumptions: • Each individual can be in one of three mutually exclusive states (susceptible S, infected I, or recovered R), where the fraction of the population in each state, at any discrete moment in time t, is denoted as S(t), I(t), and R(t), respectively. At every time step, an infected coming in contact with a susceptible individual, can transfer the disease with a probability . Subsequently, an infected individual can recover with a probability µ , after which he remains recovered, but only for a specific period (see "GHPM experimental setup" for details on the reproduction number estimation). After this timeout period, the individual becomes susceptible again. • Homogeneous population mixing is sufficiently accurate at the small granularity of households 30,34 , as well as for short duration (1 day) in higher administrative divisions (e.g., block, neighborhood). Moreover, we consider that epidemic contagion only occurs between infected and susceptible individuals found in the same administrative division at the same time t. Given R 0 > 1 as a necessary condition to trigger an epidemic, we choose R 0 = 3 ; nevertheless, this does not guarantee that an outbreak will occur every single simulation. • Only one random individual (seed) is infected at t = 0 . Once triggered, an epidemic can develop inside the household of the seed using homogeneous mixing with the described SIR model. Further spatial spreading depends entirely on the described mobility algorithm.

Estimating epidemic dynamics.
We run the GHPM model in various scenarios, defined by different parameter settings, to better understand the potential of recent epidemic outbreaks to exhibit heterogeneous and resurgent behavior. In Fig. 3 we investigate the potential of our GHPM model's heterogeneity in estimating epidemic dynamics. As such, we vary the travel distance parameter (while keeping all other parameters fixed). Figure 3d depicts a bimodal distribution of the yearly epidemic size φ y , given simple homogeneous population mixing, modeled through one single-scale settlement ( = 533, 160 individuals). Alternatively, Fig. 3e,f depict two epidemic size distributions achieved by GHPM with its specific hierarchical structure and mobility (same population , R 0 = 3 , simulation time t = 2000 iterations (days), = 0.1 , respectively = 0.35 ). Regardless of the epidemic sizes, a visible mode is detected near φ y = 0 , followed by a relatively flat distribution. Figure 3b,c shows two www.nature.com/scientificreports/ representative simulated epidemics generated with GHPM for the corresponding travel distance parameters = 0.1 , and = 0.35 . Figure 3a depicts the same time series information corresponding to the homogeneously mixing model in Fig. 3d.
Two main observations can be summarized from Fig. 3. First, the homogeneous model in panel (a) reproduces an over-simplified spike in the number of daily cases without any resurgence, while panels (b-c) display different shapes and sizes φ y , as well as resurgence. Even if the epidemic looks like it repeatedly dies out in panel (b), it manages to flare up again in new waves. Second, the homogeneous model in panel (a) is only capable of reproducing a bimodal epidemic size distribution. Nevertheless, real data (see Fig. 1) confirms the variability and resurgence reproduced by our GHPM heterogeneous model.
These results also have implications in the context of epidemic control. Even though the two scenarios depicted in Fig. 3b,c (with corresponding distributions in Fig. 3e,f) have a similar parameter setting, the difference in travel distance results in a large variation of epidemic size. As increases, the expected epidemic size φ increases and the distribution tends to become more bimodal. Conversely, for very small , the epidemic size distribution tends to have one single mode near 0. Thus, extremes of the travel distance parameter lead to entirely local or entirely global outbreaks, and GHPM can be compared with a homogeneous one. However, when is neither too small or too large (like in Fig. 3b,c), we obtain the most faithful reproduction of real epidemic dynamics (Fig. 1c,d), and the impact of a fixed R 0 is negligible.  Figure 4a summarizes the impact of increasing the travel distance parameter on the total epidemic size φ . In accordance to Eqs. (1) and (3), influences the range over which individuals are likely to travel, both outside and inside settlements. Namely, a higher increases the probability of any individual to travel to a more distant settlement (or division) from its original household, because it directly reduces the weight of in Eqs. (1) and (3). Figure 4b depicts φ as a function of the travel frequency parameter . In addition, we choose to introduce h as the average number of individuals transiting (leaving) a household, which is a more intuitive measure than alone. Unlike , which is an input parameter in the model, the value of h is measured during simulation, and normalized by the number of households. The inset of Fig. 4b shows the linear relationship ( , n ). In accordance to Eqs. (1) and (3), affects the weight of each probability of any individual to travel outside its home settlement (or household).
In case of Figure 4a, we observe a clear phase transition from local to global epidemic as increases. Intuitively, when the two travel parameters are close to 0, infected individuals are more likely to remain in their original household (or settlement); as such, the resulting epidemic size φ is delimited to the local scale of h size . As long as the size of a modeled local scale is much smaller than the size of the entire population, truly large epidemics cannot occur with , , h → 0 . On the other hand, When � > 0.01 and � h > 0.01 , large scale outbreaks can occur, regardless of the modeled population size.
When either or h converge towards extremes (0 and 1) GHPM can be assimilated to a homogeneously population model, where the set of susceptible individuals coincides with the size of a household h size and , correspondingly (stochastic dynamics appear at the household level, or at the global scale). Nevertheless, neither of these limits (0 and 1) are realistic representations of real-world mobility, which is a mixture of local and non-local patterns. As such, we further focus on the intermediate values of travel parameters where shifts in φ are visible. The phase transition in Fig. 4a is best fitted by a 3rd degree polynomial, whereas the less abrupt increase in φ triggered by h is best modeled by a linear fit. This suggests that small changes in have a much greater impact on the size of the epidemic φ compared to similar changes in h . Similarly, by considering the linear relationship between ( , h ), depicted in the inset of Fig. 4b, we conclude that the epidemic size φ is more sensitive to increases in than increases in .
The analysis of travel parameters in the GHPM model leads to an insightful conclusion with possible effective policy implications on the ongoing COVID-19 pandemic and future outbreaks. We found that restricting the distance, rather than the frequency of travel -during an outbreak-is the more adequate approach to minimize the eventual impact of an epidemic.
Embedding mobility restrictions and immunization. Finally, we compare the effectiveness of mobility restrictions versus mass immunization. The first, is implemented by long distance travel restrictions, a measure already adopted world-wide, throughout 2020, during the COVID-19 pandemic, and with notable results 45,46 . In GHPM, this restriction is obtained by reducing the travel distance parameter , as discussed in the previous section. The second, is implemented in GHPM by adding a progressive linear vaccination policy. More precisely, we pick random, not currently infected individuals, and transfer them to a permanent recovered state R * . The vaccination policy is not started at t = 0 , but only after t = 365 days in our 5-year long simulation frame- www.nature.com/scientificreports/ work (this approach mirrors the current COVID-19 situation better). Afterwards, a random set of individuals are selected every single iteration as the number of immune individuals grows progressively. Figure 5a depicts the impact of two mobility restriction policies of reducing travel to 30% (orange line), respectively to just 10% (red line) of the original, unrestricted travel. We note that, for increasing � > 0.2 , even very strict measures do not guarantee a proportional reduction in epidemic size φ. Figure 5b summarizes the impact of two immunization policies of vaccinating individuals at a rate of 60% per year (5%/month; green line), respectively 90% per year (7.5%/month; red line), compared to no vaccination at all (gray line). Here we observe that, as the travel distance parameter increases, even an aggressive vaccination policy is unable to guarantee a proportional reduction in epidemic size φ . Of course, this is, in part, a consequence of our choice to implement the vaccination policy with a 1 year delay in our simulations. Further experimenting with different vaccination delays, or vaccination rates is outside the scope of this paper but can make an interesting follow-up research topic.

Discussion
The introduction of our multiscale GHPM model enables us to reveal several meaningful characteristics of realworld epidemic outbreaks that are, otherwise, challenging to describe with homogeneous mixing models which adopt a single scale [15][16][17][18][19] . For instance, most epidemic spreading models describe any outbreak though only two outcomes: (1) an epidemic trigger condition is not fulfilled and the disease subsides in a local sub-population, or (2) the condition is fulfilled, and the disease manages to spread globally to a large scale comparable to the entire population 47 . The epidemic sizes distribution always becomes bimodal, as the first mode correlates to unsuccessful, local outbreaks, and the second mode correlates to successful, global scale outbreaks (see Fig. 3a,d). Different, network science approaches have also been used in computational epidemics, like a forest-fire topology 48 , and social networks with two 49 and multiple dimensions 50 . However, even relatively complicated network models lead to the same bimodal distributions.
The motivation of this study is to bridge realistic, hierarchical population structuring, with individual mobility patterns and infectious dynamics with patient relapse into a reliable simulation framework, targeting the better prediction and control of epidemic dynamics. As such, we propose a very fine-grained population structuring and mobility influenced by spatial and hierarchical constraints. To the best of our knowledge, our approach is novel, and we provide qualitative comparisons to homogeneous population mixing through means of epidemic size distributions. Furthermore, our empirical data restates the important temporal heterogeneity of many large epidemics that has yet to receive full attention in the modeling state of the art. This heterogeneity is exemplified in our real data on influenza and COVID-19 evolution (Fig. 1), and reproduced by our GHPM model (Fig. 3).
A predominant body of the disease epidemiology state of the art focuses on R 0 as the central topic of research 8,12,23,51 . While this number has an inherent value for compartmental models with homogeneous mixing, R 0 could also be estimated for a more complex deterministic multiscale model. Nevertheless, estimating or redefining R 0 , would not improve any relationship to the final epidemic size in stochastic multiscale models 20 , like our GHPM. The large, non-deterministic variations in epidemic size (Fig. 3e,f), and the resurgent characteristic (Fig. 3b,c) do not result from initial conditions like R 0 = /µ , but rather from "black swan" 52 events, during which an infected individual travels from a compromised household (or settlement) to new uninfected settlements. In this way, the fluctuating yearly epidemic size φ y is mostly decided by the increasing travel distance and travel frequency to susceptible populations. www.nature.com/scientificreports/ In addition, we show that, apart from rare, long distance travel events, the population structure is a decisive factor in influencing the speed and impact of epidemic outbreaks. For example, the recent COVID-19 epidemic has a very varied evolution across countries (suggested by the distribution in Fig. 1b), ranging from less than 1-10% prevalence in 2020. These variations can also be attributed to the intensity, delay, and strength of implementing restrictions 23 . Even though non-pharmaceutical interventions (NPIs) are a relevant form of active control on epidemic duration and size 9 , our observations pinpoint that the GHPM population organization can achieve stochastic dynamics in epidemic size as large and heterogeneous as empirically observed fluctuations.
The focus of our simulation experiments was to determine how population organization, travel distance and travel frequency affect the spread of disease. In this sense, we show that travel restrictions, like reducing , and (or equivalently h ) in our model (distance and frequency), can determine a significant change in the resulting epidemic size φ y , comparable to stronger social intervention strategies like vaccination or total quarantine. Specifically, we find a phase transition from local to global epidemic around � > 0.1 and � h > 1 . When the two parameters are close to 0, infected individuals are more likely to remain in their original household, and the resulting epidemic is delimited to the local scale ( h size ≪ ). Overall, we conclude that minor modifications in have a higher impact on the size of the epidemic φ y compared to similar modifications in h . One of the main take-away messages of this study is that we found travel distance to be more significant in increasing epidemic size than travel frequency. Additionally, our results pinpoint that, even an aggressive immunization policy (e.g., vaccination of 60-90% non-infected individuals per year), is unable to guarantee an immediate proportional reduction in epidemic size, given a delay of one year to start the vaccination campaign. To this end, vaccination (at a moderately realistic pace) does not yield lower epidemic sizes than reduced travel distances.
To conclude, major recent outbreaks like Ebola, SARS or COVID-19 repeatedly confront public health authorities with the uncertainty of-how big will it be this time? Unfortunately, current state of the art computational epidemiology can hardly offer accurate answers. Even the most complicated models of infectious spreading require an estimation of the relevant susceptible population. It is much easier to do retrospective studies in which, after an epidemic is observed, the specific parameters are approximated. However, such an approach is often limited because it has reduced relevance for future outbreaks. For instance, the global 1918 Spanish flu pandemic did not stop the 2003 SARS or 2020 COVID-19 pandemics. Even if the planet was less populated and less connected a century ago, the 1918 flu made considerably more victims compared to the potentially more infectious diseases of the XXI century. With the statistics of SARS pointing towards a quick mitigation with relatively low global impact, COVID-19 has had a very different evolution 53 . In the current pandemic context, the population susceptible to the SARS-CoV-2 virus is roughly the entire population of the planet. Should we then estimate scenarios with outbreaks concentrated around large populated hubs, or include billions of susceptible individuals spread across the entire planet? Perhaps the ultimate question is simply formulated as-what is the epidemic size distribution for a given infectious pathogen?
Our GHPM model tries to address these questions, to the best possible extent, by structuring population as a geographically spaced hierarchical set of sub-populations, modeled down to the level of households. The major advantage over other single scale, multiscale, or network models is that it can be extended to very large scales (e.g., continental or global) without implying homogeneous populations. This way, a modeled epidemic can be seen as multiple smaller epidemics occurring at different times, in different sub-populations. In this context, most of the infectious spreading happens locally, and global spreading is determined by rare distant travel. Because the epidemic size distribution is remarkably susceptible to the population structure, we suggest that epidemic control can be improved through adequate strategies applied to the boundaries inherently delimiting large, multiscale populations. We believe that future studies can adopt and extend our concept of geo-hierarchical population mobility to study progressively more realistic epidemic models of infectious spreading.

Methods
Geo-spatial population data. GHPM supports a pseudo-realistic organization of the population on which to run an epidemic outbreak simulation. In this sense, the settlements number |S|-and implicitly the total population of the experiment -are defined by a chosen geographical area A, most commonly limited to a country. In this sense, the number, position and size of settlements are defined according to data extracted from the Global Rural-Urban Mapping Project (GRUMP v1), revision 01 (March 2017) curated by the Center for International Earth Science Information Network (CIESIN), Columbia University 54 . GRUMP is an undergoing large-scale project, and is still missing some data, to a variable extend, for some countries. For the purpose of this paper we run all experiments on a model of Germany, with with 53.31M inhabitants (as defined by GRUMP) spread over 1132 settlements, which we consider a mid-sized representative example.
Inside each settlement, GHPM uses a simplified synthetic, but intuitive hierarchical organization of inhabitants using 2-5 levels of administrative units. We chose the synthetic alternative (inside settlements) because of the limitation in available data defining the organization of each settlement. Nevertheless, with more available data in the future, GHPM can be modified to offer a more precise mapping of each settlement's population. Conversely, GHPM can be used without any real data, if one so chooses, by creating a synthetic set of settlements, defined by positions and populations according to any distribution of choice.
GHPM experimental setup. In order to significantly accelerate the large number of GHPM experiments, we chose to scale down the modeled population by 100 fold. As such, the final simulation population used throughout the experiments is 533,160 individuals (i.e., 1% of uniformly scaled down population of Germany from the GRUMP dataset). www.nature.com/scientificreports/ In general, the size of each unique household h i may be chosen as a uniformly distributed integer number between 1-4 individuals (i.e., average h size = 2.5 ) based on UN data for developed countries 42 . In particular, we use a custom distribution of household sizes according to data available in 2019 for Germany (Federal Statistical Office-Statistisches Bundesamt), as detailed in SI.1.
A branching factor β can be used to determine the number of divisions in a settlement, based on the number of households. In this sense, a static parameter of β = 0.5 − 0.7 makes a good approximation of the hierarchical density of administrative units. In this paper we use β = 0.6 . Overall, it makes practical sense to use 1 > β > 0.5 (i.e., larger than square root) in order to obtain a more dense hierarchy of upper level divisions. A sensitivity analysis for β is provided in SI.2.
All simulations run for t = 2000 iterations (days). If we consider the correspondence 1 iteration = 1 day, then the simulation duration translates to 2000/365 ≈ 5.48 years. Since the target of our study are resurgent epidemics (with relapse), the final epidemic duration and final size cannot be expressed in absolute values (they run indefinitely in most cases). Our approach, throughout the paper, is to use the yearly epidemic size φ y as a measure of the repeating outbreak waves. Intuitively, we average the total epidemic size φ over the simulation period, i.e., φ y = φ/5.48 . Furthermore, the impact of each parameter in the GHPM model is discussed in SI.3.
The timeout period for any recovered individual is uniformly chosen between of 3-6 months (current estimates of COVID-19 immunity of 3 months 55 , 4-5 months 56 , 6 months 57 ). The adopted infection rate is = 0.06 , and the recovery rate is µ = 0.02 . While a theoretical reproduction number R 0 is based on the µ and , in practice 22 , we estimate R 0 from the household size distribution, and obtain R 0 = 0.983 . Recent studies suggest an R 0 = 3.30 ± 1.4 for COVID-19 51 .