Big-data-driven modeling unveils country-wide drivers of endemic schistosomiasis

Schistosomiasis is a parasitic infection that is widespread in sub-Saharan Africa, where it represents a major health problem. We study the drivers of its geographical distribution in Senegal via a spatially explicit network model accounting for epidemiological dynamics driven by local socioeconomic and environmental conditions, and human mobility. The model is parameterized by tapping several available geodatabases and a large dataset of mobile phone traces. It reliably reproduces the observed spatial patterns of regional schistosomiasis prevalence throughout the country, provided that spatial heterogeneity and human mobility are suitably accounted for. Specifically, a fine-grained description of the socioeconomic and environmental heterogeneities involved in local disease transmission is crucial to capturing the spatial variability of disease prevalence, while the inclusion of human mobility significantly improves the explanatory power of the model. Concerning human movement, we find that moderate mobility may reduce disease prevalence, whereas either high or low mobility may result in increased prevalence of infection. The effects of control strategies based on exposure and contamination reduction via improved access to safe water or educational campaigns are also analyzed. To our knowledge, this represents the first application of an integrative schistosomiasis transmission model at a whole-country scale.

in the transmission cycle of the disease (up to a few days vs. months/years 1 ), the concentrations of cercariae and miracidia can be considered at their equilibrium values (as obtained by settingĊ i = 0 andṀ i = 0). By also rescaling the state variables as the model described in the Methods section of the main text can be written as The parameters β i and χ i represent, respectively, aggregated exposure and contamination rates -the former of which also accounts for the local abundance of the intermediate snail hosts. Although simplified, this formulation can indeed ease the 2/16 application of the model to real case studies for which detailed information on the heterogeneity of transmission risk might not be readily available.

Spatial heterogenity of human exposure and contamination
The use of the model outlined above requires the specification of the parameters β i and χ i , i.e. of the (possibly) site-specific exposure and contamination rates. Both parameters are clearly related to rurality and availability of environmental freshwater.
Rural communities that lack access to piped drinking water and improved sanitation, and that have to resort to unsafe water sources for their primary needs, are in fact both more prone and more conducive to schistosomiasis transmission 2, 3 . Conversely, 30 the availability of adequate water provisioning and sanitation infrastructures may represent an effective protection against schistosomiasis, as shown in a recent systematic review of available field data 4 . However, neither can safe water supplies completely avert human contact with environmental freshwater, especially in water-rich regions, nor can the presence of adequate sanitation guarantee per se its use 5 . Therefore, the causal pathway through which water and sanitation affect disease transmission still remains elusive -even more so in developing countries where understanding exposure and mechanisms 35 of spread would be most important 6 . Moreover, in the simplified formulation of the epidemiological model described above, exposure risk also depends on local snail population abundances, which are also certainly influenced by rurality and environmental freshwater availability. Malacological surveys suggest that occurrence of the snail species involved in schistosomiasis transmission is widespread in Senegal 7 . However, the lack of quantitative data on an appropriate spatial scale precludes the use of these observations to inform the model about the spatial distribution of snail populations. Habitat suitability 40 for freshwater snails can be mapped via geostatistical methods 8,9 , which however require a considerable amount of field observations, currently not available for Senegal.
To parameterize the model, we assume that the synthetic exposure and contamination rates β i and χ i increase with the fraction of people living in rural areas, ρ i , and the availability of environmental freshwater, ω i ; more specifically, β i and χ i are assumed to increase with the product ρ i ω i , which may represent a simple (yet comprehensive) proxy for transmission 45 risk heterogeneity linked to socioeconomic and environmental conditions. The rurality index ρ i is available at the department level through the Global Atlas of Helminth Infections (Fig.1c . The spatial distribution of the product ρ i ω i that is actually assumed to drive the spatial distribution of the aggregated exposure and contamination rates is shown in Fig. S2b.
are the baseline values of the synthetic exposure and contamination rates, respectively, while φ and ξ are two coefficient accounting for the combined effects of rurality and freshwater abundance on exposure and contamination 11 .

Model simulation and evaluation of epidemiological indicators
As schistosomiasis is endemic in Senegal, model outputs are evaluated by running the epidemiological model up to convergence to steady state starting from an initial condition in which human communities are set to be completely uninfected (h 0 i = 1 and h p i = 0 for 0 < p ≤ P in all arrondissements), while the prevalence of infected snails is initially set to 5% (s i = 0.95 and y i = 0.05 in all arrondissements). Note that the model produces an estimate of the distribution of human hosts among infection classes (and of the prevalence of susceptible/infected snails in each arrondissement). Comparing this output with prevalence Infection prevalence can be upscaled to departmental/regional scales via weighted averaging (i.e. using arrondissement population sizes as weights).
The model also allows to easily estimate the Average Parasite Burden (APB), a standard measure that is routinely used in epidemiology to quantify the community-level intensity of infection, in addition to disease prevalence, and that closely relates 80 to morbidity. APB can be defined as the mean number of parasites hosted in each resident of a human community (say i), i.e.
Therefore, in our framework, evaluation of APB is done ex-post -that is, APB can be seen as an output (rather than as a state variable) of the model. Although APB represents the basis for the traditional approach to describe schistosomiasis transmission dynamics 18 , it forcedly neglects the heterogeneity that typically characterizes the distribution of schistosomes among human hosts within a community. Models based on a Stratified Worm Burden (SWB) approach 17 , like the one used in this work, can 85 4/16 be used to effectively overcome this limitation of purely APB-based models. SWB models have also been shown to better reproduce the epidemiological dynamics actually observed in the field 19,20 .
In the SWB approach, the analysis of APB can be usefully complemented with the evaluation of some indicators of the heterogeneity of parasite distribution, such as the aggregation parameter k i , obtained by fitting a negative binomial distribution to the simulated parasite loads in each community i, or the dispersion index D i , defined as the ratio between the sample 90 variance of parasite distribution (σ 2 i ) and the sample APB i within each community. A typical feature of SWB-based models parameterized with realistic values of the mortality rates of human hosts and adult schistosomes is that they tend to produce parasite distributions with relatively low aggregation. In fact, by using a simple SWB model applied to an isolated and stationary host population subject to a constant force of infection, it has been shown 19 that the equilibrium distribution of parasite burden within human hosts is controlled by the ratio between the human and the schistosome mortality rates. Specifically, in 95 absence of human demographic turnover (mu H = 0 and α p H = 0) the distribution is strictly Poisson, well approximated by a negative binomial otherwise. The highly aggregated distributions typically observed in egg-count data can be produced by more complex SWB models accounting for a detailed description of in-host biology (including e.g. parasite mating, aggregation, density-dependent fecundity and random egg-release) 20 .

Model calibration
100 Some of the model parameters can be reliably estimated from the literature or from epidemiological/demographic records.
Specifically, the baseline mortality rates of human hosts, snails and parasites can be evaluated as the inverse of the average lifetimes of people in Senegal (61 years 21 , hence µ H = 4.5 · 10 −5 days −1 ), snail intermediate hosts (about 1 year 22 , hence µ S = 2.7 · 10 −3 days −1 ) and schistosomes (about 5 years 22 , hence µ P = 5.5 · 10 −4 days −1 ), respectively. From these figures we have µ H /µ P = 0.082. Following a field study conducted in an endemic area of Sudan 23 , parasite-induced mortality in 105 human hosts is set to α H = 1.1 · 10 −7 days −1 parasite −1 . The extra-mortality suffered by infected snails is set to α S = 1.4 · 10 −2 days −1 , according to the observation that the lifespan of infected snails is about two months 22 . As for parasite load in human hosts, we consider a maximum burden of P = 150 parasites and a threshold for infection T = 10 parasites 17 . The human population of each community is thus divided into P + 1 classes, with classes T < p ≤ P being considered as infected.
Conversely, numerical fitting is necessary to calibrate the parameters describing human exposure and contamination risk. . Despite relative small sample sizes (as an example, 5,000 children from 100 schools in 20 districts were tested in the most recent campaign), which may lead to some uncertainty at fine spatial scales, this dataset is currently in use at the Senegalese Ministry of Health as the most reliable and up-to-date picture of the spatial distribution of the disease in the country.

5/16
and M4) environmental heterogeneity. Note that no parameters involving human mobility are calibrated in any of the model set-ups (see below). Calibration is performed independently for each model set-up by minimizing the residual sum of squares of reported (u D r ) vs. modeled (u M r ) values of schistosomiasis prevalence in each region r (Fig. S2d), i.e.
Regional infection prevalences can be readily upscaled from the epidemiological dataset at the health-district level via weighted 125 averaging (i.e. using health-district population sizes as weights; each health-district value is assigned to the region where the health-district lies). Numerical fitting is performed with the Nelder-Mead simplex method 24 .
Because all the model set-ups have the same number of calibration parameters, performance comparison can simply be performed by evaluating the coefficient of determination is the total sum of squares and u D r is the mean of the regional prevalence values. Note that R 2 can also be negative, namely if RSS > T SS.

Inference of human mobility from mobile phone traces
Mobility is estimated from the anonymized movement routes of Sonatel mobile phone users collected for one year, from 135 January 1 to December 31, 2013. Sonatel is the main telecommunications provider of Senegal, with more than 9 million subscribers in the country. The 2013 mobile phone dataset contains more than 15 billion CDRs. Each record includes an anomynous identifier for the user making the call, as well as information about when (time stamp) and where (antenna tower) the call was initiated. Matrix Q = [Q i j ] is defined in this work as the probability that people usually living in community (arrondissement) i come in contact with freshwater in community (arrondissement) j ( j = 1 . . . 123, including i). We assume 140 that this probability is proportional to the time spent in arrondissement j, and that the number of phone calls made by a user while being in arrondissement j is also proportional to the time spent in that arrondissement. Therefore, the entries Q i j of matrix Q are assumed to be proportional to the number of phone calls made by users usually living in arrondissement i while being in arrondissement j.
To characterize human mobility patterns, first we upscale antenna-level data to the spatial scale of arrondissements, namely 145 by assigning the traces originated at each antenna to the relevant arrondissement based on the geographical position of the antenna. Then, we use CDRs to identify the 'home' (residential) arrondissement for each anonymous user. Following a definition often used in the context of CDR analysis and related epidemiological applications 12 , we define home as the site (arrondissement) where most calls are made by a user during night hours (from 7pm to 7am) over the whole dataset (i.e. over a 6/16 timespan of one year). If several arrondissements match this criterion, home is randomly selected among the arrondissements that host most of the night calls made by the user. Afterwards, for each arrondissement i, the number of calls made in arrondissement j by users whose home site has been identified with i is extracted from the dataset. This number, properly divided by the total number of calls made by users usually living in arrondissement i (independently of the location where the call originates from), represents an estimate of the entries of the mobility matrix (graphically shown in Fig. 1d in the main text).
The mobility matrix Q so derived is used in the model set-ups where human movement is accounted for (M2 and M4), and can 155 be made available upon request for replication purposes.
A useful indicator of mobility is the community-level fraction of the population of each arrondissement that leaves its home site at least once during the time window of interest (e.g. one year in Fig. 1d in the main text)

160
To elucidate the role of human movement on schistosomiasis transmission, it may be useful to analyze some scenarios in which mobility is different from CDR-based estimates. To that end, the mobility matrix estimated from CDRs has to be artificially manipulated. Given an average country-wide mobility m * , the local values of m * i can be obtained by redistributing the total number m * ∑ i K i of mobile people proportionally to the contribution m i K i of each arrondissement i to the total number m ∑ i K i of mobile people in the reference case, so that The diagonal elements Q * ii of the modified mobility matrix Q * can thus be obtained as Q * ii = 1 − m * i , while the off-diagonal entries Q * i j are assumed to be proportional to the contribution of each destination site j to overall mobility from site i as estimated from CDRs, i.e.
Matrix Q * so obtained can then be fed to the model set-ups originally calibrated with Q estimated from CDRs (set-ups M2 and M4). Clearly, m * = 0 (or, equivalently, Q * = I, with I being the identity matrix) corresponds to a no-mobility scenario 170 (set-ups M1 and M3).
As a word of caution, we acknowledge that changes in human mobility could actually produce effects on the resulting mobility matrix that are far more complex than the simple manipulations described above. In that case, human mobility models -such as the so-called gravity 13 or radiation 14 models, which can be calibrated against movement patterns extracted from CDRs 15 -could be used to better analyze different mobility scenarios. Note, however, that recent studies have shown that standard mobility models may perform poorly in the sub-Saharan context, thus warranting extra-care in their use 16 .

7/16 on exposure/contamination reduction WAter, Sanitation and Hygiene (WASH) interventions
The effect of WASH actions aimed at increasing access to safe drinking water and improved sanitation 3 can be seen as equivalent, 180 in our simplified formulation of the spatially heterogeneous exposure and contamination rates, to decreasing the fraction ρ i of residents of community i living in rural conditions, at least as far as water and sanitation are concerned. These actions can be either targeted (e.g. implemented only in selected rural communities) or untargeted (implemented in all communities), and are obviously constrained by the available budget.
Let τ and η be the planned extent of the interventions (evaluated as the number of potentially benefited people, τ ≤ ∑ i ρ i K i ) 185 and their potential efficiency (i.e. the probability of success per effort unit). Untargeted actions can be described in the model as where ρ i represents the fraction of people in community i with no access to safe water supplies and improved sanitation after action implementation.
Targeted interventions can be formulated in the model by sorting communities (according to some suitable criterion), selecting the first U ones so that ∑ U i=1 ρ i K i ≤ τ and setting 190 ρ i = 1 − η therein. Natural ranking criteria in our framework include the quantity ρ i ω i , which quantifies transmission risk based on the rurality of living conditions and the abundance of environmental freshwater where snail populations can thrive, or the regional values of schistosomiasis prevalence. Other options (like e.g. prioritizing rural or water-rich regions, or communities with large inbound/outbound mobility fluxes) are obviously possible.

195
The effect of IEC campaigns aimed at promoting hygiene and increasing awareness about disease transmission pathways 2 can be modeled as a decrease of the baseline exposure/contamination rates. Like in the case of WASH interventions, IEC campaigns can be either targeted or untargeted, and are subject to budget constraints.
Let τ and η be again the planned extent of the interventions (τ ≤ ∑ i K i ) and their supposed efficiency, respectively.
Untargeted interventions can be modeled as in the case of the fine-grained model set-ups M3 and M4 (similar relationships can be worked out for the coarse-grained set-ups M1 and M2).
The implementation of targeted interventions requires sorting the communities (again, according to some suitable criterion),

8/16
selecting the first U ones so that ∑ U i=1 K i ≤ τ, and setting The previous equations refer again to model set-ups M3 and M4, but similar relationships can be obtained for 205 set-ups M1 and M2 as well. Like in the case of WASH interventions, ranking criteria can prioritize high-risk or high-prevalence communities, but other indicators describing access to safe water sources or sanitation facilities (available online through GAHI, Maps and data, http://www.thiswormyworld.org; last last date of access: 03/02/2017) could be used as well. Figure S1. Schematic representation of the schistosomiasis transmission model. Rectangles represent the human components of three sample communities (say i, j and k, identified by different colors), stratified by infection class (H p i are hosts burdened with 0 ≤ p ≤ P parasites in community i). Rounded rectangles represent the freshwater components of the three communities, including (say, for community i) susceptible (S i ) and infected (I i ) snails, and the larval stages of the parasite (miracidia, M i , and cercariae, C i ). Circles, squares and hexagons indicate the force of infection for the different human communities, the human contribution to freshwater infestation and mobility-related processes, respectively. See Fig. 1a in the main text for a graphical depiction of the transmission cycle of the disease, the Methods section in the main text for a complete description of the model and Table 1 in the main text for a description of the variables and parameters used in the model.  , χ 0 = 2.7 · 10 −3 [days −1 parasites −1 ], φ = 5.2 · 10 −1 , ξ = 1.3 · 10 2 ; for M4, β 0 = 5.5 · 10 −3 [days −1 ], χ 0 = 2.2 · 10 −3 [days −1 parasites −1 ], φ = 4.3 · 10 −1 , ξ = 1.1 · 10 2 . The model set-up accounting for fine-grained spatial heterogeneity and human mobility (M4) has better overall explanatory power than the others (R 2 = 0.76) and is thus retained as reference model. Figure S4. Sensitivity analysis of model projections with respect to parameter variations. In each panel, results are shown for simulations performed with the reference model and the best-fit parameter set (see Fig. 2 in the main text) except for the value of one parameter (label) that has instead been allowed to vary in a ±100% range with respect to its reference value. Shown are the effects of parameter variations on the country-averaged (blue) and the maximum regional (green) values of schistosomiasis prevalence. Blue/green numbers within each panel indicate deviations from the reference simulation for ±50% parameter variations.   (8), evaluated as ∑ i K i Q i, j with i ∈ Saint-Louis, j = 1...123. Highlighted are religious gatherings that generate peaks of outgoing mobility fluxes (Grand Magal de Touba, GMdT; Gamou de Tivaouane; Kazu Rajab) and within-region mobility. Overall, remarkable mobility fluxes are also directed towards the most 'attractive' and well connected region of Dakar, or to those closest in distance to Saint-Loius (Louga, 5, and Matam, 14), which show the 'gravitational' nature of human mobility. Daily mobility patterns are defined by looking at the spatial patterns of the calls made by each user over one day (instead of one year like in Fig. 1d).