Activity-based epidemic propagation and contact network scaling in auto-dependent metropolitan areas

We build on recent work to develop a fully mechanistic, activity-based and highly spatio-temporally resolved epidemiological model which leverages person-trajectories obtained from an activity-based model calibrated for two full-scale prototype cities, consisting of representative synthetic populations and mobility networks for two contrasting auto-dependent city typologies. We simulate the propagation of the COVID-19 epidemic in both cities to analyze spreading patterns in urban networks across various activity types. Investigating the impact of the transit network, we find that its removal dampens disease propagation significantly, suggesting that transit restriction is more critical for mitigating post-peak disease spreading in transit dense cities. In the latter stages of disease spread, we find that the greatest share of infections occur at work locations. A statistical analysis of the resulting activity-based contact networks indicates that transit contacts are scale-free, work contacts are Weibull distributed, and shopping or leisure contacts are exponentially distributed. We validate our simulation results against existing case and mortality data across multiple cities in their respective typologies. Our framework demonstrates the potential for tracking epidemic propagation in urban networks, analyzing socio-demographic impacts and assessing activity- and mobility-specific implications of both non-pharmaceutical and pharmaceutical intervention strategies.


Methods
PanCitySim provides spatio-temporal COVID-19 propagation outcomes based on a high-resolution mobility simulation of person and vehicle movements in an average day for a full-scale city. We implemented PanCitySim for two prototype cities, which are simulated on their respective networks to provide multi-modal trajectories for each person at a five-minute resolution. From these activity-specific person-trajectories, we generated activitybased contact networks for the entire population and analyzed their scaling properties. We then simulated epidemic propagation via a Susceptible-Exposed-Infectious-Recovered-Deceased (SEIRD) epidemiological model on the contact network. The epidemiological model takes into account the latest measurements of relevant parameters, such as the incubation period, duration of infectiousness, and age-dependent likelihood of symptomaticity and mortality 4,45,49,50 . We validated our epidemic propagation predictions using actual infections and mortality observations from the relevant cities. Finally, we simulated a scenario where the transit network (including its corresponding contacts) is removed, and analyzed the outcomes on disease propagation. The authors confirm that all methods were carried out in accordance with relevant guidelines and regulations.
Activity-based simulation model and city data. A recent study identified 12 urban typologies based on 69 mobility, socio-demographic, environmental and network structure obtained from over 300 cities 51 . The two auto-dependent typologies consisting of cities largely in the US/Canada were used as the test beds for this study: Auto Sprawl (86% car mode share; e.g. Baltimore, Tampa, Raleigh) and Auto Innovative (78% car mode share; e.g. Washington D.C., Atlanta, Boston). Auto Sprawl typifies the lower-density US/Canada cities with low transit usage ( ∼4%), while Auto Innovative consists of denser cities with an average of 11% transit mode share. Auto Innovative is also three times as populated on average, but only slightly denser, which indicates that Auto Sprawl cities have larger areas than those in Auto Innovative. The GDP per capita of Auto Innovative cities is 1.2 higher on average compared to that of Auto Sprawl cities. Given these key differences between the two typolo-  www.nature.com/scientificreports/ gies, prototype cities representing the population, land-use and mobility demand and supply outcomes in both typologies were synthesized 47 . Both prototype cities were built on actual road and transit networks, population microdata and land use categories from representative (or archetype) cities close to the centroid of their respective typologies. For Auto Sprawl, the archetype chosen was the Baltimore Metropolitan Area (population 2.77 × 10 6 , density 4.11 × 10 2 km −2 ), while for Auto Innovative, it was Greater Boston (population 4.6 × 10 6 , density 5.09 × 10 2 km −2 ). However, the demand and supply models for both prototype cities were calibrated to fit average typology values, in order to ensure representativeness of overall mobility outcomes. The spatio-temporal activity and mobility patterns of each city are shown in Fig. 2.
The simulation of the prototype cities was performed in SimMobility, an open-source platform for microscopic demand and mesoscopic supply dynamic traffic assignment modeling 44 . The cities were calibrated for modeshares, activity patterns and network speeds 47 . The inputs to the simulator are: land use, demographic and economic factors, as well as road and transit networks. A discrete choice modeling framework then simulates daily activity schedules (DAS) for each individual in a given synthetic population. Parameters for population (including age, gender, employment, vehicle ownership and household characteristics), land use and road/ transit networks are obtained from a real-world archetype city in the typology. The DAS is a high level plan, including only important choices, which are translated into trip chains. Lower level choices are made during the day when those plans are executed. During the day, agents are either performing an activity (Home, Work, Education, Shopping or Other) or executing a trip. During an activity, agents are stationary at one location and then, at the beginning of each trip, agents further detail their plan. Once the detailed plan is made and the start time is reached, the supply simulator moves the agents accordingly. Given the path of each traveler, the supply simulator produces the actual movement trajectory of each heterogeneous vehicle type and pedestrian (passenger) movements, which are performed on the network to provide event-driven trajectories for each person. These trajectories are the final outputs of the simulator (SimMobility). The outputs of the traffic simulator serve as the inputs to the SEIRD model. As long as the spatio-temporal location of the individuals are produced as output, any traffic simulator can be used to produce the inputs for the SEIRD model. From the 5-minute person-trajectories, activity-specific contact graphs are constructed and transmission events are simulated. SEIRD model. We define the following states in our susceptible-exposed-infectious-recovered-deceased (SEIRD) model: susceptible ( S ), exposed, i.e. infected but not contagious ( E ), infectious and symptomatic ( I S ), infectious but asymptomatic (I A ), recovered ( R ) and deceased ( D ). We assume that symptomatic cases are automatically quarantined by the end of the day. The transitions to each of these states are governed by the following probabilities: φ is the probability of infection, where is a parameter to be calibrated. Using the mechanistic framework 15,45 , q is a measure of viral shedding rate [m 3 ] and i the contact intensity [1/m 3 ]. τ is the duration of contact. The indices are m (susceptible population), n (infectious) and t (time step). κ is the daily probability of transitioning from exposed E to infectious I {A,S} and d L is the incubation period 4 . Furthermore, d L ∼ Lognormal( = 1.62, 2 = 0.42) 50 . The median incubation period is taken as 5 days 50 . ρ is the probability an infectious person is symptomatic. Thus, the transition to state I S is governed by the probability ρκ . Further, α is the proportion of infectious people who are symptomatic, γ is the daily probability that a person recovers ( d I is the duration of infectiousness, also lognormally distributed) and µ is the mortality rate. Evidence suggests that ρ and µ are highly age-dependent 4 . Given that case fatality rates (CFR) have been shown to differ significantly by age group, we use existing adjusted posterior mode estimates of the CFR, based on measurements over 41 days 49 . We assume the CFR is obtained from the CDF of an underlying exponential distribution governing the duration from onset of COVID-19 to death. Thus: The value of d D computed from the above equation is taken as the median of the lognormal random variate. Thus, the mean of the associated normal distribution is obtained by ln d D = ln(d D ) . We also compute the variance of d D based on credible interval estimates 49 d D . Finally, we estimate the variance of the associated normal distribution, that is 2 ln d D by solving the transcendental equations relating the parameters of the lognormal distribution to those of the associated normal distribution. These parameters are summarized in Table 1.
Contact intensity. The transmission probability is explicitly dependent on the separation distance and shedding rate 15 . With regard to distance, the contact intensity i has an inverse cube relationship. For simplicity, we assume the shedding rate ( q 0 ) remains constant, but estimate the separating distance d n,m at time step t between two persons n and m at a given physical node V based on locations which are assumed to be normally distributed  per node and time step. For agents in transit, we partition the vehicles into squares, randomly assign passengers to each square partition, and estimate the expected distance between randomly selected passenger pairs. A detailed description of the distance computation between individuals who share space in transit and those who share space while performing an activity are presented in Supplementary Sections 2.1 and 2.2, respectively. When agents are at home, we use the expected distance between inhabitants based on the average square footage of homes in the city. Thus, the probability of transmission is given by:

Contact network generation.
A contact network is a collection of several contact graphs at different times of day. In this framework, we generate contact graphs at 5-minute time steps for the entire population using activity-based simulated trajectories from calibrated models of the two cities. The contact graph, at any 5-minute time window, is created by a union of three subgraphs-Home, Activity or Transit. The Home graph consists of agents who share a home during the time window. The Activity graph comprises individuals who are performing an activity (Work, Education, Shopping or Other). The Transit graph consists of individuals who are traveling in a bus or train or waiting for the same. We assume that motorists make no contacts while en route to their destinations, hence they are not modeled in our contact graph. To facilitate an efficient representation of the contact graphs, we employ a hub-and-spoke transformation, which leverages on the sparseness of the graphs (Supplementary Figure 7).
Calibration. To find , we calibrate the SEIRD model to achieve a basic reproductive number (average number of secondary cases caused by an infected person in the early stage of the epidemic) of R 0 = 2.5 using: where Θ � = Θq 0 i 0 , and X is the set of index cases while S the set of secondary cases 15 . We define R 0 as the average number of new infections per initially infected person. We calibrate R 0 over a period of 5 days on the full population, as this is the median duration of incubation 50 . Post calibration, R 0 was equal to 2.43 ± 0.14 in case of Auto Sprawl and 3.19 ± 0.16 in case of Auto Innovative using 30 samples for both cases (errors are 95% confidence intervals). The five-day average of R t (daily reproductive rate), however was 0.77 in both cities. While calibrating, we found that a universal value of Θ � is not possible. For a given value of Θ � , the variance in the value of R 0 decreases with an increase in I 0 . This behavior shown in Fig. 7c. In order to arrive at a stable value calibrated Θ � , the I 0 was set to 1000 for calibration.

Results
Analysis of contact network structure and scaling properties. We generated an activity-based contact network for the population in each city every 5 minutes (see Methods). The Union contact network comprises all activities for each individual: Work, Education, Shopping, Other and Transit. The Transit activity comprises waiting and the duration of time spent on a bus or train. The spatial resolution of each contact is at the node level (activity location) or at transit vehicle (bus/train) partitions. Figure 3a shows the degree distribution for the Union contact network at selected times in both cities. Further, we plot the average degree k for each activity contact network, including the Union (Fig. 3b). We observe that Work is responsible for the maximum  In Auto Innovative, however, the activity responsible for second greatest peak contact average was Shopping with 110 contacts per person between 4 and 5 PM. The maximum number of people who are in contact at any given time of the day is represented by the maximum clique size of the contact graph, as shown in Fig. 3c. On average, the maximum number of contacts created in Auto Sprawl is more than 1600 between 10 and 11 AM, while for Auto Innovative the maximum number of contacts is twice as high during the same time.
In order to gain further insights into the contact network structure, we analyzed the complementary cumulative distribution functions (CCDF) for each 5-minute time step by activity (Supplementary Figure 8). There appear to be no identifiable patterns in the Union contact graphs, but we hypothesized that the Work contact network follows a Weibull distribution p k ∝ e (− k) , while the Transit contact network obeys the power law, p k ∝ k − . Contact networks for Shopping and Other activities appear to follow an exponential distribution p k ∝ e − k . In order to test these null hypotheses, we conducted Kolmogorov-Smirnov (KS) tests to determine if alternative distributions provide better fits 52 . The KS tests produce p-values based on loglikehood ratios between   Supplementary Figs. 9a and 9b). The fitted Weibull parameters for Work diverge between both cities. The scale-free Transit contact graphs were fitted to p k ∝ k −̂ , with a cut-off at k = 150 . Between 8 AM and 8 PM, ̂≈ 1.4 in both cities. For Shopping, which follows an exponential distribution, ̂ is similar for both cities at the same time periods. In the case of Other, however, while the trends in Auto Sprawl and Auto Innovative are similar, ̂ tends to be higher in Auto Sprawl. A summary of all the fitted parameters for the distributions governing Work, Shopping, Other and Transit is given in Table 2 (corresponding plots are shown in Supplementary Figure 9c).
The fitted distributions are assumed to be valid from a minimum degree k min , which is determined by minimizing the KS-distance between observed and predicted values based on the power law fit. While this k min is biased, it facilitates a consistent comparison between candidate distributions 53 . Other alternative distributions considered were the lognormal and truncated power law. Notably, for both Transit contact networks, the parameter k min is largely time-invariant, averaging 2.4 in Auto Sprawl and 2.2 in Auto Innovative (Supplementary Figure 9d). This indicates that the power law takes effect with a minimum of two or three contacts in each city's transit network.
Spatio-temporal evolution and activity-based impacts. We calibrated Auto Sprawl and Auto Innovative to fit the expected early dynamics of COVID-19 4,54-56 (see Methods). We considered the basic reproduc- Given the uncertainty about R 0 , we also considered the timedependent reproduction number R t (average number of new daily infections per infectious individuals). We computed the five-day average of R t as a measure of the early propagation of the epidemic, which was 0.77 in both cities. Given these starting assumptions, we simulated the epidemic for 270 days in both cities (Fig. 4a). The peak number of infections occurred at day 27 in Auto Sprawl and at day 34 in Auto Innovative, dissipating after 150 days and 250 days, respectively. At the peak, there were 9.21 × 10 4 infections (both exposed and infectious individuals) in Auto Sprawl and 1.38 times more infections in Auto Innovative (Fig. 4b).
We plot the daily infection and mortality rates, along with R t in Fig. 4c. The infection rate is given by the number of daily new infections as a proportion of the entire population. The mortality rate is defined as the number of daily deaths also with respect to the entire population. While both Auto Sprawl and Auto Innovative  www.nature.com/scientificreports/ had similar early onset dynamics, we observe that disease propagation diverged rapidly between both cities. In Auto Sprawl, R t peaked at day 9, with an average rate of change of 1.78 × 10 −1 day −2 from onset to peak. Post-peak, however, the rate of change is −9.61 × 10 −3 day −2 . Furthermore, the slope of the infection rate from onset to peak (day 18) in Auto Sprawl is 5.15 × 10 2 day −2 . Post-peak, the rate is 4.25 × 10 1 day −2 . In Auto Innovative, however, R t peaked at day 13, with a slope steeper than that of Auto Sprawl by a factor of 1.1. R t then dissipated 2.4 times slower compared to Auto Sprawl. The infection rate peaked on day 26 in Auto Innovative (in twice the amount of time as in Auto Sprawl), and dissipated at a rate of −5.44 × 10 1 day −2 . From onset to peak, the mortality rate is 1.34 day −2 in both cities, peaking on day 28 Auto Sprawl and on day 34 in Auto Innovative. Observing epidemic propagation by activity (Fig. 4c), we find that at the onset, Transit activity was responsible for most of the transmissions, moreso trains than buses. At the peak of infection, Transit remained responsible for the greatest share of transmissions in Auto Sprawl, while Home and Work had the greatest share in Auto Innovative. In the latter stages, we find that the greatest share of infections occurred during Work. Education was also significant in Auto Innovative, while Shopping was responsible for a sizable number of infections in Auto Sprawl. The lowest impact activity was Bus transit. We note that the trajectory of the epidemic can also be analyzed across socio-demographic dimensions (such as age (Supplementary Figure 3)), which are available for the synthetic populations. Furthermore, as discussed in the section on Methods, the epidemiological model in PanCitySim has age-specific parameters. With this level of granularity, vulnerable demographics can be identified and relief operations and interventions targeted accordingly.
We observe the spatial evolution of the epidemic in both cities (Fig. 5). In Auto Sprawl, at the onset of the epidemic, we identify several hubs, mostly located outside of the city center. These quickly migrate to the city center by the second week, becoming stronger there, with the peak of the disease being observed in the sixth week. In Auto Innovative, a denser city with greater transit usage, the onset of the epidemic begins in city center, grows rapidly and spreads out to the suburbs. The peak of epidemic is observed between weeks 5 and 7, after which there is a gradual decline of in the numbers throughout the city. We observe that towards the end of the simulation, the rate of decline in the number of cases is much slower for Auto Innovative compared to Auto Sprawl. At the end of our simulation period, Auto Innovative had around 10 times more active cases compared to Auto Sprawl.

Impact of transit network on disease propagation.
To understand the extent of the impact of the physical transit network on the propagation of the epidemic, we removed the Transit activity from the contact network construction in both Auto Sprawl and Auto Innovative. Under this scenario (No Transit), individuals could no longer travel via transit and thus had to choose an alternative mode or change their activity. In realworld circumstances, the majority of travelers in these types of cities would elect to travel by private car as single www.nature.com/scientificreports/ passengers, and thus not contribute to disease propagation while traveling. We then compared the SEIRD model outputs and time-dependent reproductive rates in the full-network and no-transit cases (Fig. 4b).
The results indicate that R t peaked on the first day in Auto Sprawl (the average value in the first five days was 1.40), dissipating at the rate of −7.65 × 10 −3 day −2 . The removal of transit thus effectively dampened the reproduction of the epidemic from the onset in Auto Sprawl. In Auto Innovative, R t peaked at day 25 (compared to day 13 in the full case), but its slope was five times smaller than in the full network. Thus, the force of the epidemic was also lessened in Auto Innovative due to transit removal.
This effect is even more apparent when we consider the infection rates. In Auto Sprawl, the infection rate peaked at day 86 with 1.49 ×10 3 infections/day-a 98% decrease from the peak rate (day 19) in the full-network case. The slope of the infection rate (onset to peak) similarly decreased by 98% when transit was removed. Meanwhile, in Auto Innovative, the infection rate peaked at day 61 with 4.54 ×10 3 infections/day as a result of transit removal-a decrease of 64% from the full-network peak (day 18). In a similar trend, the rate of change decreased by 84% in the no-transit case, when compared to the full network.
The removal of transit also dampened the force of mortality, as daily deaths peaked on day 62 in Auto Sprawl (compared to day 28 in the full-network network) and on day 78 in Auto Innovative (compared to day 34 in the full-network case). These results highlight the importance of modeling the Transit contact network in detail, and the central role that public transportation played in spreading the virus, especially in the early stages of the disease outbreak.

Sensitivity analyses and validation.
As the dynamics of the COVID-19 are yet to be fully understood, with several possible assumptions including, but not limited to, re-infections 57 , multiple strains of the virus 58 and multiple-outbreaks 59 , the simulation results presented here are just one realization out of numerous possible outcomes of the initial parameters. However, in generating our results, we ran the simulator ten times on the full population in each city. Figure 6 shows the mean and 95% confidence intervals for various outputs of the model. Thus, we analyzed the sensitivity of three key variables (number of days to reach infection peak; peak number of infections and the basic reproductive rate) to changes in the initial number of infections ( I 0 ) in both prototype cities. The results (Fig. 7) are based on a population sample comprising 100,000 households randomly selected from each simulated prototype city (Auto Sprawl and Auto Innovative). We varied the number of initial infections ( I 0 ) and ran the simulation 50 times in each case. The sensitivity patterns of I 0 were similar in both prototype cities. We observe in Fig. 7a,b that a higher number of initial infections results in a larger and more rapid evolution of the epidemic.  www.nature.com/scientificreports/ We compare the simulation outputs (predicted infections and deaths) of PanCitySim to those of example cities from the respective typologies (Fig. 8). Given that our simulations are based on a no-intervention scenario, we also assess the dates of policy interventions put in place by respective governments in each city. First, we note that either a smaller or a delayed peak of the epidemic is seen in both cities compared to the respective simulated output. This can be attributed to the restrictions put in place at the start of the epidemic (represented by the red dots). Second, we observe an increase in the number of infections once those restrictions are relaxed or lifted (represented by green triangles). These two trends are more obvious in case of Auto Sprawl compared to Auto Innovative, possibly owing to the densely populated areas in and a wider prevalence of transit in Auto Innovative, both of which increase the likelihood of mixing (as seen in the contact network structure). Notwithstanding these considerations, we find that the mean absolute percentage errors for infections and deaths across all days are largely centered around 1% (Fig. 9). This indicates that the model framework performs reasonably well and is thus potentially viable as a predictive and decision-making tool for cities in Auto Sprawl and Auto Innovative.

Discussion
We developed a flexible and dynamic framework, PanCitySim, which is built on a modified SEIRD model combined with an activity-based mechanistic model which simulates mobility and human interactions in an urban environment. The epidemiological model enables a computation of infection dynamics of the entire population in the city during their activities (including transit) at a 5-minute resolution. We demonstrated the use of PanCitySim and the rich output it provides in two prototype cities representative of most urban areas in the US and Canada (both auto-dependent, but one with higher population density and greater share of mass transit).
We generated 5-minute activity-based contact networks for entire synthetic populations (2.5 and 4.5 million individuals in Auto Sprawl and Auto Innovative, respectively). We observed that the largest average contacts  Boxplots showing the mean absolute percentage errors (MAPE) for the simulation output with respect to the observations from the cities in each typology. The MAPE is computed for exposed and infectious cases ( E + I ) and recorded deaths (D). The baseline is the day since the first death reported in each city. www.nature.com/scientificreports/ per individual is 90 in Auto Sprawl and 120 in Auto Innovative. We analyzed the degree distributions of these networks, gaining important insights into the mixing of populations across two prototype cities representative of the US and Canada. Furthermore, contact networks for four activity types were fitted. We found that these contact networks follow well-known distributions describing complex systems. Significantly, Transit contacts obey the power law ( ̂∼ 1.4 up to a maximum of 150 contacts), which is time-invariant and constant in two distinct city types. One implication of this is that there is no epidemic threshold for the Transit contact, and any number of initial infections could quickly reach epidemic proportions on this network, thus rendering it a candidate activity for early containment and intervention 62,63 . Shopping and Other follow exponential distributions that are also largely time-invariant. Work, on the other hand, follows a Weibull (stretched exponential) distribution with time-dependent parameters differing by city. While Work accounts for the greatest number of contacts per person, particularly during the middle of the day, Transit, more than any other activity, accounts for the closest of contacts.
Observing the dynamics of COVID-19 for 270 days in both cities, we found that even if the index cases begin on the outskirts of the city, the epidemic rapidly spreads to the city center. In both cities the epidemic peaked between days 27 and 34 with more than 9 × 10 4 infections, dissipating slowly after 150 days if the city is sparse or after 250 days for a denser typology with more than 1.3 × 10 5 exposed or infected individuals. Further, we found that at the onset of the epidemic it is crucial to restrict mass transit services or focus interventions (such as enforcing mask-wearing by passengers) on this activity. Post-peak, however, restrictions should be targeted towards work areas, as well as shopping centers or schools, in less dense car-oriented cities.
Our approach, which is fully mechanistic and highly spatio-temporally resolved, offers insights into the contact network structure and the importance of having a detailed representation of population mobility. With PanCitySim, scenarios can be realistically modeled and targeted to specific activities, ages, and employment types. We can detect the emergence of super-spreading events and show how urban activity patterns affects the spreading of such events. It can also be used to test distinct vaccination strategies, such as prioritizing population groups with higher exposure to the virus or higher risk of severe disease or death. The framework is responsive to changes in demand and supply availabilities and is thus useful for decision makers in understanding and mitigating epidemics in metropolitan areas.

Data availability
The full data set used for this study is publicly available. Given that the sizes of the data files used in our experiments are large, we have created a demonstration data set with a random sample of 30,000 individuals, which can be downloaded from our Github repository for PanCitySim: https:// github. com/ panci tysim/ PanCi tySim. The repository also contains an end-to-end working Python notebook using the demonstration dataset. A supplementary Information document is also available in this repository. It comprises model implementation and calibration details, pseudocode summaries and additional implementation notes on the traffic simulator (Sim-Mobility). An algorithmic description of the various steps and data structures involved in the implementation of PanCitySim is presented in Supplementary Section 6 (Pseudocode).