Superspreading of airborne pathogens in a heterogeneous world

Epidemics are regularly associated with reports of superspreading: single individuals infecting many others. How do we determine if such events are due to people inherently being biological superspreaders or simply due to random chance? We present an analytically solvable model for airborne diseases which reveal the spreading statistics of epidemics in socio-spatial heterogeneous spaces and provide a baseline to which data may be compared. In contrast to classical SIR models, we explicitly model social events where airborne pathogen transmission allows a single individual to infect many simultaneously, a key feature that generates distinctive output statistics. We find that diseases that have a short duration of high infectiousness can give extreme statistics such as 20% infecting more than 80%, depending on the socio-spatial heterogeneity. Quantifying this by a distribution over sizes of social gatherings, tracking data of social proximity for university students suggest that this can be a approximated by a power law. Finally, we study mitigation efforts applied to our model. We find that the effect of banning large gatherings works equally well for diseases with any duration of infectiousness, but depends strongly on socio-spatial heterogeneity.

www.nature.com/scientificreports/ visits while being infectious. Thus a very short period with high infectivity is modeled by M = 1 , while large M corresponds to a disease with a prolonged infectious state. Our social world model consists of visits to locations, and is quantified by the size distribution over these social gatherings. Consider first the case in which an infected individual at each time step visits a location where the number of people is exponentially distributed. For fixed R 0 , the total the number of secondary infections will then follow a negative binomial distribution with parameter (1 + R 0 /M) −1 , independent of the shape of the exponential distribution (see SI). Thus the statistics will at most be marginally overdispersed, and in particular we find that the 20% most infectious individuals will at most infect 65% for R 0 ≥ 1.0.
However, many social events are much larger than the family size events that dominate our daily life. Thus we assume a fat-tailed probability density function for the chance of being at a location in which one on average could infect up to n other people, Here the exponent ν > 2 determines the broadness of the distribution: For small ν people will often visit locations where they interact with large crowds. The parameter a regularises our model to make it valid for small locations, and sets the scale of the typical location size. The distribution corresponds to picking a random person and asking how many other people the airborne particles from this person will reach on average. This will typically be lower than the total number of people at the location. Further, this distribution is marginalized over time in the sense that it should be thought to include rare events that people go to perhaps just once a month, etc.
The number of people an individual infects at a location Z i we take to follow Poisson statistics, Z i |N i ∼ Pois(αN i ) , where α is the overall infectiousness of the disease and N i is a random number drawn from P N . This effectively means that a single person can infect many in a single timestep of the model, which is in line with how airborne pathogens are believed to spread 17 .
We derive the solution to our model in the "Methods" section. We note that while we discretize time, a qualitatively similar model can be developed with time continuous and for which the duration of events vary.
Superspreading. We consider epidemics of known basic reproduction number R 0 . In our model this average infection number reads To compare varying values of ν and M, we choose α such that R 0 is fixed. Naturally, increasing the number of events (M), i.e. the duration of the infectious stage, the infectivity per event has to decrease in order to keep R 0 fixed. Note how, as ν → 2 , the infectiousness must be scaled to zero, α → 0 reflecting the fact that infections become dominated by extremely large events. Thus for ν → 2 the epidemic is driven solely by extreme superspreading events. We further note that for small R 0 , confusingly the term superspreading actually refers to (1)   The value of M, which is defined by the disease, and may differ widely between diseases with equal R 0 , has a crucial effect on the offspring distribution, as shown in Fig. 2. As M becomes small, to keep R 0 fixed, α must be increased, and thus the chance of infecting a lot in a single location goes up. So a disease with a short time of infectiousness will be more prone to yield superspreading events, while long infectiousness gives statistics similar to a homogeneously spreading disease. Further, in our model, since we fix R 0 , the average size of the events will be independent of the time of infectiousness.
As a concrete example, we find from Fig. 2 that the probability of a person to infect 100 others is a full order of magnitude smaller with M = 25 compared to M = 1 . The inset of Fig. 2 shows that as M is decreased, the chance of infecting exactly zero increases. The simple logic is that when a few is infecting many, there must be many who infect none. We note that in practice the effective M will typically be smaller than the duration of infectiousness if people are aware of the disease and its symptoms and decide to self-isolate. In this case, M should reflect the amount of time spent being infectious while on a normal daily routine and not the inherent duration of infectiousness of the disease. Thus an airborne diseases with short pre-symptomatic infectivity may likely be driven by superspreading events.
We present in Fig. 3 mobility tracking data of ∼650 students at a Danish university (details in "Methods" section). This data allow us to determine the average time that people spend at a location. We estimate this by calculating the distribution of time τ spent consecutively together for all pairs of students. The inset of Fig. 3a shows that this approximately follows an exponential distribution with characteristic time τ 0 ∼ 1.6 h. The fit does not match the data for small τ , where students simply passing one another, and not actually staying nearby one another, bias the data. If we denote the duration of infectiousness of the disease by T, and assume that the disease only spreads for meetings longer than a certain duration, one can approximate M ∼ T/τ 0 . Taking into account that for the student data presented here, time spent alone (such as sleeping) is not captured, we get an estimate of M in the range of 5-10 for T ∼ 24 h.
The other crucial parameter in our model is the exponent ν . It is well-established that the sizes of, for example, cities or companies approximately follow a Zipf distribution ∼x −2 asymptotically 18,19 . Although firm sizes are indicative of places people visit, we expect such a distribution to be steeper in general. Further, the distribution will vary from country to country, city to city, and neighbourhood to neighbourhood. As a concrete example we employ the mobility tracking data, for which we find that the sizes of groups (Fig. 3a) that students gather in follow a power law distribution with exponent in the range of ∼2.5-3.0.
The number of potential infections at a location depends on the range and suspension time of the airborne pathogens. Although the tracking data is limited in size, and thus exponential cutoffs are quite small, we can nonetheless calculate the average number of people that are within short range of one another within each cluster. Figure 3b shows that in a cluster of x people, each person in on average close to ∼ √ x other people. In the general case of a location size distribution, which asymptotically goes like ∼x −γ , the chance of picking a person at a location of size n must then scale as ∼x −(γ −1) . P N describes the average number of people one interacts with at a location. If we assume that these x people are distributed in a two-dimensional space, and that during the visit a person walks in a few straight lines through this space, there will of the order of √ x interactions, precisely as observed in the tracking data. Transforming we find P n ∼ n −2(γ −1)+1 . Thus γ = 2.5-3.0 of Fig. 3 leads to ν = 2.0-3.0, which can be compared to social contact studies for individuals 20 . We stress that these parameters are estimated for a small cohort of university students, and different parameters will most likely be obtained on data sets concerning people with different demographics. Inset shows the probability that an individual will infect precisely zero. In all plots we take a = 3.

Scientific Reports
| (2021) 11:11191 | https://doi.org/10.1038/s41598-021-90666-w www.nature.com/scientificreports/ Our model becomes invalid as ν → 2 , where an exponential cutoff must be included. All figures are reproduced in the SI including such a cutoff. For ν = 2.5 the numerical deviations of superspreading statistics are small when comparing the model with and without an exponential cutoff.
In Fig. 4a we show the classical superspreading plot: the fraction f x of infected people infected by the x most infectious people. The concavity of these curves signify superspreading. The inset, adapted from Ref. 2 , shows, for instance, that for the SARS epidemic, the 20% most infectious infected more than 80% of the total.
In reality, the 'ideal' homogeneous distribution shown in the inset, in which the top x infectors infect exactly x, is unattainable simply due to the randomness of Poisson statistics. For finite R 0 , the grey ( R 0 = 1.0) and darkgrey ( R 0 = 2.0 ) background indicate the minimal statistics demanded by Poisson randomness alone. As we include spatial heterogeneity, the statistics become more extreme, especially for small M (blue curves, Fig. 4). Statistics of social interaction networks generated from proximity data. Proximity is determined using Bluetooth by an app installed on smartphones distributed to a group of students at a Danish university ( N ∼ 650 ). (a) The proximity data are split in one hour windows and in each window we identify the connected components of the social interaction network and plot the distribution of cluster sizes. Power law fits over the entire range of clusters gives ∼x −2.5 , whereas restricting the range to smaller cluster gives ∼x −3 . Parameter errors from fits are smaller than the error due to range selection. Inset shows the distribution of the duration of time that people spent together. Blue curve allows gaps in the data of 5 min and pink allows 30 min gaps (see "Methods" section). Fitting these curves with ∼ exp(−τ/τ 0 ) for times > 30 min yields characteristic times τ 0 ≈ 1.5 h for the blue curve and τ 0 ≈ 1.7 for the pink curve. (b) For each cluster, the plot shows the average number of students that had a strong Bluetooth signal with one another indicating proximity 2 m. A power law fit of the data yields ∼x 0.5 . As ν is lowered, the statistics become more extreme, but so does the effect of an exponential cutoff. We show in the SI, that with a cutoff of n = 1000 , f 20 can never be larger than 90% (for M = 1).

Mitigation.
As an epidemic unfolds, mitigation measures can be employed to halt the spread of disease. In SIR models, effects of mitigation are captured by lowering the infectiousness. [ α in our model]. In contrast, in network models 21 the number of connections between individuals can be lowered for a similar effect.
We capture the effects of mitigation by calculating how a measure affects the instantaneous value of R 0 . Mitigation measures such as enforcing the use of masks and good hygiene will directly affect α . As R 0 is directly proportional to α , the effect of this measure is mathematically trivial in our model. As we directly model people visiting different locations, we focus instead on the effect of closing locations. We emphasize that the effect on R 0 of closing places is in fact independent of the duration of infectiousness (M) in our model, and is thus effective even for a disease that does not show spreading heterogeneity. This is a direct consequence of allowing infectious individuals to infect many in each time step.
Instructionally, we begin by calculating the effect of closing locations of a given size. Figure 5 shows that if small locations are closed, R is actually increased as some of the people that would visit a small location instead go to larger locations. However when locations of a reasonable size are closed, we see that R decreases. This effect subsequently diminishes for very large n, as there are only few large gatherings of a particular size.
A more natural approach is to close all places larger than a given cutoff size, i.e. a ban on large gatherings. We derive in the SI, the distribution resulting from replacing P N (n) with P N (n, where x is the location cutoff size. Figure 5 shows that a much stricter mitigation effort is required to bring down R for large ν than for small. For small ν , a large fraction of all infectious take place at large locations and thus one can get away with a smaller mitigation effort. It is thus crucial to gauge the value of ν for a community when choosing the size of mitigation needed to bring R 0 below one. A separate mitigation effort comes through contact tracing [22][23][24] : finding and isolating people who have been in contact with an infectious individual. Automated contact tracing depends strongly on the participation of a large fraction of the population 25 , whereas manual contact tracing comes with a large administrative burden. In the present context, we find that the duration of infectiousness M and the power law exponent ν likewise have big consequences for the effectiveness of this approach.
We show in Fig. 6a the probability, given a person was infected, that he was infected at an event of size z ≥ 50 . If this probability is high, the effect of contact tracing is eased, as the epidemic will be driven by large spreading events. In this way, our model demonstrates that social parameters should be taken into to account when evaluating whether contact tracing is a viable strategy.
We can further gauge this behaviour by calculating the likelihood of where events took place. For instance, consider the case where, say, c = 3 members of a family all got infected and we can assume that they got infected at the same event. Where is this most likely to have happened? This will depend on the socio-spatial parameters of the community they live in and on the duration of infectiousness of the disease. By Bayes' theorem, we have  Figure 6b shows this for c = 3 and ν = 2.5 . We see that, for fixed R 0 , the larger the time of infectiousness, M, the more likely it is that this happened at a large location. In the same manner, the value of R 0 affects the distribution as well as shown in the figure.

Discussion
We have presented a model that is simple enough to be studied analytically, yet reveals key insights into the statistics of epidemics spreading in heterogeneous spaces. The model is equally applicable to populations ranging from a full country to a small isolated neighbourhood, the difference being modelled by the parameter ν and the exponential cutoff. In contrast, the infectiousness period M is set by the disease, and thus does not change by location. Given knowledge of a disease and the space in which it is spreading, our model gives baseline statistics which can give quite extreme measures (e.g. 20% infecting 80%) even in the absence of biological superspreaders. We thus conclude that to judge directly from data if superspreaders are important drivers in an epidemic, it should also be contrasted to a baseline model that include spatial heterogeneity and duration of infectiousness instead of a simple homogeneous population.
Our model does not directly take into account effects such as a variance in infectiousness attributed to factors such as ventilation or the type of event taking place at a location (concert or dinner?). However, this simply amounts to a re-interpretation of P N , as variations in the type of event would either increase or decrease the number of effective interactions.
In the SI we develop and solve analytically a version of the model that includes an exponential cutoff. Such a cutoff is necessary to get accurate statistics for ν 2.5 (dependent on the exact cutoff). In particular, with an exponential cutoff we can further allow ν < 2 , which perhaps could be representative for very social demographics such as young people. The value of the exponential cutoff could be estimated based on spreading events: If one observes that it is rare to find a person who infects more than k other people, then one can employ an exponential cutoff in the order of k/α.
With an exponential cutoff, a bound can be put on statistics such as f 20 even without knowing the value of ν . Thus, by simply knowing the approximate value of the exponential cutoff as well as the duration of infectiousness, our model predicts extremeness statistics such as f 20 . If these are found to be more extreme in an ongoing epidemic it would be strong evidence for the existence of biological superspreaders no matter the value of ν . For example, as detailed in the SI, if the exponential cutoff is s ∼ 1000 , we find for a disease with R 0 = 2.0 that f 20 0.9 implies the existence of superspreaders for M = 1 and any value of ν . For M = 5 this drops to f 20 0.7 , while for M = 25 we have f 20 0.55 . We note further that an alternative regularization to an exponential cutoff is a direct cap on the number of interactions. This is equivalent to a ban of gatherings, the details of which can be found in the SI.
While the above arguments show that knowledge of ν is not always required, our model naturally gives more accurate results when ν is known. We have presented example data for which ν can be approximately extracted. The social mobility data is generated from Bluetooth signals between smartphones of ∼650 students. Naturally, these people will in general be in rooms with many other people, for whom we have no data. Thus the data www.nature.com/scientificreports/ will in general underestimate the number of people at locations. In particular, after study/work hours will be highly underestimated as people will go home or away from university campus, where there is little chance to encounter other students that participate in the study. Conversely, student life tend to quite regularly gather people in large crowds for instance at lectures. Thus depending on the range over which we extract the power law, different exponents are found. Further, since our data is limited to students at a university, the results should be considered in this context. Data for other demographics will lead to different parameters. The data, in any case, demonstrates that power laws can approximate the size distribution of locations that people visit also on the scale of 10 2 -10 3 people.
To pinpoint social heterogeneity our study calls for active sociological studies that go beyond the contact time measurements of Ref. 26 . Most effectively this could be done by mobile tracking, in analogy to the limited study among student used in our work. With social structure input and estimates of M and R 0 for a given ongoing disease we may fix α . The assumption of all people being equally infectious can then be tested from observed rate of household transmissions 27 , taking into account the fraction of time spent by individuals in the household. Likewise, one of the parameters of the model could be fixed by comparing to e.g. the fraction of people that infect no one (Fig. 2).
In the context of the 2019 SARS-CoV-2 epidemic, there is strong evidence that the spread is by aerosols 28 . Further, viral load during the infection is quite peaked 29 , which could indicate a short period of high infectiousness on the order of 1-2 days. Superspreaders have been suggested to play a role in driving the SARS-CoV-2 epidemic 7 , which, in turn, has an effect on the choice of mitigation efforts. It is thus of paramount importance to fully understand the true distribution of superspreading, taking into account that superspreading statistics can result from many sources, one being socio-spatial heterogeneity as discussed here.
Our model could further be expanded to explicitly include biological superspreaders. For instance, if one expects varying viral loads to be driving the variation in spreading, the constant infectivity α can be replaced with a dispersed probability distribution with mean value α . In this case, special care should be taken when α for some highly infectious individuals can exceed unity. Below this limit, however, our conclusions on mitigation strategies by banning large gatherings remain unaffected. If, on the other hand, variations in spreading are due to e.g. varying particle sizes of the airborne pathogens, the variation should instead be applied to P N directly, as some individuals will be able to infect a higher fraction of people at a given location. For example, if only some people produce pathogenic aerosol these would have a much higher range than others 17 . In a location of x people, such a superspreader might reach the entirety of the x people, instead of just ∼ √ x . This would take the exponent ν from, say, 2.5-1.75.
Our model further gives insights into applying mitigations. In the case of SARS-CoV-2, many countries chose to implement bans on large gatherings. These range from banning gatherings larger than 1000 people e.g. in France, to banning social gatherings of more than 6 people in the UK. Figure 5 shows how the difference in effectiveness of such bans depend strongly on the value of ν . In particular, the shape of the power law in Eq. (1) sets the range over which changing the size of the ban has the highest effect. Naturally, the lower the ν the better the effect of a ban of a certain size. However, one may also consider the effect of changing a ban from one size to another. For instance, for ν = 2.5 changing from a ban of gathering larger than 100-10 reduces R by ∼50% , whereas for ν = 3.0 the reduction is more than ∼68% . Further, as long as the ban size is significantly smaller than the exponential cutoff, the effect of such a cutoff is negligible.
The traditional approach to modelling a heterogeneous population is to consider network models. In our model, time is discretized by location visits, during which an infectious individual can, in principle, infect the entirety of an event. In contrast, in agent-based SIR and network models infectious individual typically interact with one person per time step. Our approach thus focuses on airborne pathogens, where transmission does not require one-to-one interactions such as touch, and has the effect that the number people that are together becomes central. This in turn leads to the prediction that mitigation strategies aimed at reducing R 0 depends strongly on the socio-spatial distribution of gatherings, but is in fact independent of the duration of infectivity M. Thus we find that for airborne diseases, mitigation by reducing large gatherings will also work well for diseases that do not exhibit heterogeneous spreading ( M ≫ 1).
The distance of infection of such airborne transmissions can vary a lot between diseases and this should be reflected in the final choice of distribution exemplified by Eq. (1). In particular, for diseases that spread as aerosol, the distance of influence can become very big whereas a disease that is transmitted by larger droplets only allow limited secondary infections even at very large events. Likewise, in calculating the statistics of our model we have assumed a fully susceptible population. For diseases such as influenza, where cross-immunity plays a major role [30][31][32] , the number of susceptible people at the social events should be re-scaled accordingly.

Methods
To derive the statistics of our model, we consider the early stages of an epidemic where only a fraction of the entire population has been infected. Here, we can neglect saturation and derive analytically the offspring distribution in terms of α and the social parameters a and ν from Eq. (1). We find where A is a matrix with entries and b is a vector with entries www.nature.com/scientificreports/ and E n (·) is the exponential integral function. We relegate the derivation to the SI, but note that Eq. (4) is exact even when truncating A and b to be finite. We furthermore use E ν+j (aα) = 1 aα e −aα − (ν + j)E ν+j+1 (aα) , permitting evaluation by recursion as long as E ν (aα) = ∞ 1 e −aα t /t ν dt is known, which can be evaluated using Gaussian quadrature or any other standard method.
Finally, we can use a (zero-padded) discrete Fourier transform (DFT) to obtain the probability for any M: An important special case of our general formula is the probability that a person infects exactly zero after having visited M places. These people represent an important dual aspect of superspreading: When few are infecting the majority, only few infection events are left to the majority of people. We note that a and α always appear in the combination a · α , thus one of them can be chosen freely if the other is adjusted to fix R 0 . In this way, our calculations are valid for any value of a as long as α is adjusted accordingly, and thus the exact behaviour of the power law at small values is not important for the statistics when considering fixed R 0 .
For the purpose of estimating the social parameter ν , we utilized data collected from smartphones distributed to around 1000 students at a Danish university 33 . Of these students we only used the subset of 642 students who had daily records of proximity to others. The cell phone tracking data consist of proximity measurements achieved from repeated Bluetooth scans by the smartphones every fifth minute. The study lasted two years and the data set contains more than 30 million data points. We divide the data into into blocks of 1 h, and subsequently discretize the data in a consistent manner by capping the received signal strength indicator at a value that corresponds to ∼2 m. This results in hourly networks of students that are within the vicinity of one another. We use the connected components of these graphs as indicators for the sizes of locations to generate Fig. 3a. Parameter errors are estimated from the sample variance of the data.
The relatively low number of total students who participate in the study combined with infrequent Bluetooth scans result in low exponential cutoffs for the number of direct connections within 2 m. Nonetheless, as an approximation for the range of an airborne pathogens, we consider the average number of students that are directly connected within each cluster. Figure 3b shows that this average approximately grows like the square root of the cluster size. The true number of potential infections in a location of a given size depends on the precise characteristics of the disease, in particular if it spreads as aerosol or not 17 .
For estimating the amount of time people spent together at locations, we calculate the distribution of the duration of meetings between all pairs of students. Two students are considered to be in the same room, if they are in the same cluster. We only have Bluetooth signals every fifth minute, and a person can be beyond the reach of this signal for a brief time without having actually left a certain location. Thus we permit gaps in the tracks. Figure 3 shows the result of allowing gaps of 5 min and 30 min. Naturally, the latter yields larger times spent together, but the difference is not massive. We further note that the data show peaks at around 2 h and 4 h, which we suspect to be the result of students going to a single lecture or two lectures in a row, respectively.