Mapping road network communities for guiding disease surveillance and control strategies

Human mobility is increasing in its volume, speed and reach, leading to the movement and introduction of pathogens through infected travelers. An understanding of how areas are connected, the strength of these connections and how this translates into disease spread is valuable for planning surveillance and designing control and elimination strategies. While analyses have been undertaken to identify and map connectivity in global air, shipping and migration networks, such analyses have yet to be undertaken on the road networks that carry the vast majority of travellers in low and middle income settings. Here we present methods for identifying road connectivity communities, as well as mapping bridge areas between communities and key linkage routes. We apply these to Africa, and show how many highly-connected communities straddle national borders and when integrating malaria prevalence and population data as an example, the communities change, highlighting regions most strongly connected to areas of high burden. The approaches and results presented provide a flexible tool for supporting the design of disease surveillance and control strategies through mapping areas of high connectivity that form coherent units of intervention and key link routes between communities for targeting surveillance.

networks, the regular and planar nature of road networks precludes the formation of clear communities, i.e. roads that cluster together shaping areas that are more connected within their boundaries than with external roads. Highly connected regional communities can promote rapid disease spread within them, but can be afforded protection from recolonization by surrounding regions of reduced connectivity, making them potentially useful intervention or surveillance units 6,26,27 . For isolated areas, a focused control or elimination program is likely to stand a better chance of success than those highly connected to high-transmission or outbreak regions. For example, reaching a required childhood vaccination coverage target in one district is substantially more likely to result in disease control and elimination success if that district is not strongly connected to neighbouring districts where the target has not been met. The identification of 'bridge' routes between highly connected regions could also be of value in targeting limited resources for surveillance 28 . Moreover, progressive elimination of malaria from a region needs to ensure that parasites are not reintroduced into areas that have been successfully cleared, necessitating a planned strategy for phasing that should be informed by connectivity and mobility patterns 26 . Here we develop methods for identifying and mapping road connectivity communities in a flexible, hierarchical way. Moreover, we map 'bridge' areas of low connectivity between communities and apply these new methods to the African continent. Finally, we show how these can be weighted by data on disease prevalence to better understand pathogen connectivity, using P. falciparum malaria as an example.

Data
African road network data. Data on the African road network (ARN) were obtained from GPS navigation and cartography as described in a previous study 24 . The dataset maps primary and secondary roads across the continent, and while it does have commercial restrictions, it is a more complete and consistent dataset than alternative open road datasets (e.g. OpenStreetMap 29 , gRoads 30 ). Visual inspection and comparison between the ARN and other spatial road inventories validated the improved accuracy and consistency of ARN, however a quantitative validation analysis was not possible due to the lack of consistent ground-truth data at continental scales. Figure 1a shows the African road network data used in this analysis. The road network dataset is a commercial restricted product and requests for it can be directly addressed to GARMIN 31 . Plasmodium falciparum malaria prevalence and population maps. To demonstrate how geographically referenced data on disease occurrence or prevalence can be integrated into the approaches outlined, gridded data on Plasmodium falciparum malaria prevalence were obtained from the Malaria Atlas Project (http:// www.map.ox.ac.uk/). These represent modelled estimates of the prevalence of P. falciparum parasites in 2015 per 5 × 5 km grid square across Africa 32 . Additionally, gridded data on estimated population totals per 1 × 1 km grid square across Africa in 2015 were obtained from the WorldPop program (http://www.worldpop.org/). The population data were aggregated to the same 5 × 5 km gridding as the malaria data, and then multiplied together to obtain estimates of total numbers of P. falciparum infections per 5 × 5 km grid square.

Results
Detecting communities in the African road network. We modeled the ARN as a'primal' road network, where roads are links and road junctions are nodes 33 . Spatial road networks have, as any network embedded in two dimensions, physical spatial constraints that impose on them a grid-like structure. In fact, the ARN primal network is composed of 300, 306 road segments that account for a total length of 2, 304, 700 km, with an average road length of 7.6 km ± 13.2 km. Such large standard deviations, as already observed elsewhere 23,24,34 , are due to the long tailed distribution of road lengths, as illustrated in Fig. 1c. Another property of road network structure is the frequency distribution of the degree of nodes, defined as the number of links connected to each node. Most networks in nature and society have a long tail distribution of node degree, implying the existence of hubs (nodes that connect to a large amount of other nodes) 21 , with the majority of nodes connecting to very few others. For road networks, however, the degree distribution strongly peaks around 3, indicating that most of the roads are connected with two other roads. The long tail distribution of the length of road segments, coupled with the peaked degree distribution, indicates the presence of translational invariant grid-like structure, in which road density smoothly varies among regions while their connectivity and structure does not. Within such gridlike structures it is very difficult to identify clustered communities, i.e. groups of roads that are more connected within themselves than to other groups. This observation is confirmed by the spatial distribution of betweenness centrality (Bc), which measures the amount of time the shortest paths between each couple of nodes pass through a road. The probability distribution of Bc is long tailed (Fig. 1d), while its spatial distribution spreads across the entire network, with a structural backbone form, as shown in Fig. 1b. Again, under such conditions and because of the absence of bottlenecks, any strategy to detect communities that employs pruning on Bc values 35 , will be minimally effective.
To detect communities in road networks we follow the observation that human displacement in urban networks is guided by straight lines 36 . Therefore, geometry can be used to detect communities of roads by assuming that people tend to move more along streets than between between streets. We developed a community detection pipeline that converts a primal road network, where roads are links and roads junction are nodes 33 , to a dual network representation, where link are nodes and street junction link between nodes 37 , by mean of straightness and contiguity of roads. It is important to note here that the units of analysis are road segments, which here are typically short and straight between intersections, making the straightness assumption valid. Community detection in the dual network is then performed using a modularity optimization algorithm 38 . The communities found in the dual network are then mapped back to the original primal road network. These communities encode information about the geometry of road pattern but can also incorporate weights associated with a particular disease to guide the process of community detection.
Nodes in the dual network represent lines in the primal network. The conversion from primal to dual is done by using a modified version of the algorithm known as continuity negotiation 37 . In brief, we assume that a pair of adjacent edges belongs to the same street if the angle θ between these edges is smaller than θ c = 30°. We also assume that the angle between two adjacent edges (i, j) and (j, p) is given by the dot product cos (θ) = r i, j r j,p /r i, j r j,p , where r i, j = r j r i . Under these assumptions, the angle between two edges belonging to a perfect straight line is zero, while it assumes a value of 90° for perpendicular edges.
Our algorithm starts searching for the edge that generates the longest road in the primal space, as can be seen in Fig. 2a. Then, a node is created in the dual space and assigned to this road. Next, we search for the edge that generates the second longest road, and a new node is created in the dual space and assigned to this road. If there is at least one interception between the new road and the previous one, we connect the respective nodes in the dual space. The algorithm continues until all the edges in the primal space are assigned to a node in the dual space, as shown in Fig. 2b. Note that the conversion from primal to the dual road network has been used extensively to estimate human perception and movement along road networks (Space syntax, see 36 ), which also supports our use of road geometry to detect communities.
Despite the regular structure of the network in the primal space, the topology of these networks in the dual space is very rich. For instance the degree distribution in dual space follows the power-law P(k) k −γ . This property has been previously identified in urban networks 33 and it is strongly related to the long tailed distribution of road lengths in these networks (see Fig. 1c). Since most of the roads are short, most of the nodes in dual space will have a small number of connections. On the other hand, there are a few long roads (Fig. 2a) that originate at hubs in the dual space (Fig. 2b). Our approach for detecting communities in road networks consists then in performing classical community detection in the dual representation ( Fig. 2c) and then bringing the result back to the primal representation, as shown in Fig. 2d. The algorithm used to detect the communities is the modularity-based algorithm by Clauset and Newman 35 . The hierarchical mapping of communities on the African road network, with outputs for 10, 20, 30 and 40 sets of communities, is shown in Fig. 3. The maps highlight how connectivity rarely aligns with national borders, with the areas most strongly connected through dense road networks typically straddling two or more countries. The hierarchical nature of the approach is illustrated through the breakdown of the 10 large regions in Fig. 3a into further sub-regions in b, c and d, emphasizing the main structural divides within each region in mapped in 3a. Some large regions appear consistently in each map, for example, a single community spans the entire north African coast, extending south into the Sahara. South Africa appears as wholly contained within a single community, while the horn of Africa containing Somalia and much of Ethiopia and Kenya in consistently mapped as one community. The four maps shown are example outputs, but any number of communities can be identified. The clustering that maximises modularity produces 104 communities, and these are mapped in Fig. 4.
Even with division into 104 communities, the north Africa region remains as a single community, strongly separated from sub-Saharan Africa by large bridge regions. South Africa also remains as almost wholly within its own community, with Somalia and Namibia showing similar patterns. The countries with the largest numbers of communities tend to be those with the least dense infrastructure equating to poor connectivity, such as DRC and Angola, though West Africa also shows many distinct clusters, especially within Nigeria. Apart from the Sahara, the largest bridge regions of poor connectivity are located across the central belt of sub-Saharan Africa, where population densities are low and transport infrastructure is both sparse and often poor. The communities mapped in Figs 3 and 4 align in many cases with recorded population and pathogen movements. For example, the broad southern and eastern community divides match well those seen in HIV-1 subtype analyses 12 and community detection analyses based on migration data 27 . At more regional scales, there also exist similarities with prior analyses based on human and pathogen movement patterns. For example, the western, coastal and northern communities within Kenya in Fig. 4b, identified previously through mobile phone and census derived movement data 39,40 . Further, Guinea, Liberia and Sierra Leone typically remain mostly within a single community in Fig. 3, with some divides evident in Fig. 4c. This shows some strong similarities with the spread of Ebola virus through genome analysis 15 , particularly the multiple links between rural Guinea and Sierra Leone, though Fig. 4c highlights a divide between the regions containing Conakry and Freetown when Africa is broken into the 104 communities. Figure 3 highlights the connections between Kinshasa in western DRC and Angola, with the recent yellow fever outbreak spreading within the communities mapped. Figure 4d shows the'best' communities map for an area of southern Africa, and the strong cross-border links between Swaziland, southern Mozambique and western South Africa are mapped within a single community, as well as wider links highlighted in Fig. 3, matching the travel patterns found from Swaziland malaria surveillance data 41 .
Integrating P. falciparum malaria prevalence and population data with road networks for weighted community detection. The previous section outlined methods for community detection on unweighted road networks. To integrate disease occurrence, prevalence or incidence data for the identification of areas of likely elevated movement of infections or for guiding the identification of operational control units, an adaptation to weighted networks is required. We demonstrate this through the integration of the data on estimated numbers of P. falciparum infections per 5 × 5 km grid square into the community detection pipeline. The final pipeline for community detection calculated a trade-off between form and function of roads in order to obtain a network partition.
The form is related to the topology of the road network and is taken into account during the primal-dual conversion. The topological component guarantees that only neighbor and well connected locations could belong to the same community. The functional part, on the other hand, is calculated by the combination of estimated P. falciparum malaria prevalence multiplied by population to obtain estimated numbers of infections, as outlined above.
The two factors were combined to form a weight to each edge of our primal network. The weight w i, j of edge (i, j) is defined as where m(r) is the P. falciparum malaria prevalence and p(r) is the population count, both at coordinate r. These values are obtained directly from the data. When the primal representation is converted into its dual version, the weights of primal edges, given by Eq. 1, are converted into weights of dual nodes, which are defined as where i represents the i th dual node and Ω i represents the set of all the primal edges that were combined together to form the dual node i (see Fig. 2a,b). Finally, weights for the dual edges are created from the weights of dual nodes, by simply assuming The dual network weighted by values of λ i,¯j was used as input for a weighted community detection algorithm. Ultimately, when the communities detected in the dual space are translated back to primal space, we have that neighbor locations with similar values of estimated P. falciparum infections belong to the same communities. For the example of P. falciparum malaria used here, the max function was used, representing maximum numbers of infections on each road segment in 2015. This was chosen to identify connectivity to the highest burden areas. Areas with large numbers of infections are often 'sources' , with infected populations moving back and forward from them spreading parasites elsewhere 6,42 . Therefore, mapping which regions are most strongly connected to them is of value. Alternative metrics can be used however, depending on the aims of the analyses.
The integration of P. falciparum malaria prevalence and population (Fig. 5a) through weighting road links by the maximum values across them produces a different pattern of communities (Fig. 5b) to those based solely on network structure (Fig. 3). The mapping of 20 communities is shown here, as it identifies key regions of known malaria connectivity, as outlined below. The mapping shows areas of key interest in malaria elimination efforts connected across national borders, such as much of Namibia linked to southern Angola 43 , but the Zambezi region of Namibia more strongly linked to the community encompassing neighbouring Zambia, Zimbabwe and Botswana 44 . In Namibia, malaria movement communities identified through the integration of mobile phone-based movement data and case-based risk mapping 26 show correspondence in mapping a northeast community. Moreover, Swaziland is shown as being central to a community covering, southern Mozambique and the malaria endemic regions of South Africa, matching closely the origin locations of the majority of internationally imported cases to Swaziland and South Africa 41,45,46 . The movements of people and malaria between the highlands and southern and western regions of Uganda, and into Rwanda 47 , also aligns with the community patterns shown in Fig. 5b. Finally, though quantifying different factors, the analyses show a similar east-west split to that found in analyses of malaria drug resistance mutations 6,48 and malaria movement community mapping 27 .

Discussion
The emergence of new disease epidemics is becoming a regular occurrence, and drug and insecticide resistance are continuing to spread around the world. As global, regional and local efforts to eliminate a range of infectious diseases continue and are initiated, an improved understanding of how regions are connected through human transport can therefore be valuable. Previous studies have shown how clusters of connectivity exist within the global air transport network 49,50 and shipping traffic network 50 , but these represent primarily the sources of occasional long-distance disease or vector introductions 1,8 , rather than the mode of transport that the majority of the population uses regularly. The approaches presented here focused on road networks provide a tool for supporting the design of disease and resistance surveillance and control strategies through mapping (i) areas of high connectivity where pathogen circulation is likely to be high, forming coherent units of intervention; (ii) areas of low connectivity between communities that form likely natural borders of lower pathogen exchange; (iii) key link routes between communities for targetting surveillance efforts. The outputs of the analyses presented here highlight how highly connected areas consistently span national borders. With infectious disease control, surveillance, funding and strategies principally implemented country by country, this emphasises a mismatch in scales and the need for cross-border collaboration. Such collaborations are being increasingly seen, for example with countries focused on malaria elimination (e.g. 51,52 ), but the outputs here show that the most efficient disease elimination strategies may need to reconsider units of intervention, moving beyond being constrained by national borders. Results from the analysis of pathogen movements elsewhere confirm these international connections (e.g. 6,12,41,48 , building up additional evidence on how pathogen circulation can be substantially more prevalent in some regions than others. The approaches developed here provide a complement to other approaches for defining and mapping regional disease connectivity and mobility 9 . Previously, census-based migration data has been used to map blocks of countries of high and low connectivity 27 , but these analyses are restricted to national-scales and cover only longer-term human mobility. Efforts are being made to extend these to subnational scales 53,54 , but they remain limited to large administrative unit scales and the same long timescales. Mobile phone call detail records (CDRs) have also been used to estimate and map pathogen connectivity 26,40 , but the nature of the data mean that they do not include cross-border movements, so remain limited to national-level studies. An increasing number of studies are uncovering patterns in human and pathogen movements and connectivity through travel history questionnaires (e.g. 41,47,55,56 ), resulting in valuable information, but typically limited to small areas and short time periods.
There exist a number of limitations to the methods and outputs presented here that future work will aim to address. Firstly, the hierarchies of road types are not currently taken into account in the network analyses, meaning that a major highway and small local roads contribute equally to community detection and epidemic spreading. The lack of reliable data on road typologies, and inconsistencies in classifications between countries, makes this challenging to incorporate however. Moreover, the relative importance of a major road versus secondary, tertiary and tracks is exceptionally difficult to quantify within a country, let alone between countries and across Africa. Finally, data on seasonal variations in road access does not exist consistently across the continent. Our focus has therefore been on connectivity, in terms of how well regions are connected based on existing road networks, irrespective of the ease of travel. A broader point that deserves future research is that while intuition suggests a correspondence in most places, connectivity may not always translate into human or pathogen movement.
Future directions for the work presented here include quantitative comparison and integration with other connectivity data, the integration of different pathogen weightings, and the extension to other regions of the World. Qualitative comparisons outlined above show some good correspondence with analyses of alternative sources of connectivity and disease data. A future step will be to compare these different connections and communities quantitatively to examine the weight of evidence for delineating areas of strong and weak connectivity. This could potentially follow similar studies looking at community structure on weighted networks, such as in the US based on commuting data 57 , or UK and Belgium from mobile network data 58,59 . Here, P. falciparum malaria was used to provide an example of the potential for weighting analyses by pathogen occurrence, prevalence, incidence or transmission suitability. Moreover, future work will examine the integration of alternative pathogen weightings. The maximum difference method was used here to pick out regions well connected to areas high P. falciparum burden, but the potential exists to use different weighting methods depending on requirements, strategic needs, and the nature of the pathogen being studied.
Despite the rapid growth of air travel, shipping and rail in many parts of the world, roads continue to be the dominant route on which humans move on sub-national, national and regional scales. They form a powerful force in shaping the development of areas, facilitating trade and economic growth, but also bringing with them the exchange of pathogens. Results here show that their connectivity is not equal however, with strong clusters of high connectivity separated by bridge regions of low network density. These structures can have a significant impact on how pathogens spread, and by mapping them, a valuable evidence base to guide disease surveillance as well as control and elimination planning can be built.

Methods
Results were produced through four main phases. Phase 1: Road network cleaning and weighted adjacency list production: the road cleaning operation aimed to produce a road network from the georeferenced vectorial network of roads infrastructure. This phase was conducted using ESRI ArcMap 10.4 (http://desktop.arcgis.com/en/ arcmap/) through the use of the topological cleaning tool. The tool integrates contiguous roads, removes very short links and removes overlapping road segments. Road junctions were created using the polyline to node conversion tool, while road-link association was computed using the spatial join tool. Malaria prevalence values were assigned to each road using the spatial join tool. The adjacency matrix output, containing also the coordinates for each road junctions, was extracted in form of text file. Phase 2: Conversion from the primal to the dual network: the primal network created in phase 1 was then used as input for a continuity negotiation-like algorithm. The goal of this algorithm was to translate the primal network into its dual representation (see Fig. 2a,b). The implementation of the negotiation-like algorithm used the iGraph library in C++ (http://igraph.org/c/) on an octa-core iMac. The conversion took around 20 hours for a primal network with ~200 k nodes running. The algorithm works by first identifying roads composed of many contiguous edges in the primal space. Two primal-edges are assumed to be contiguous if the angle between them is not greater than 30° degrees. Because the dual representation generated by the algorithm strongly depends on the starting edge, we started by looking for the edge that produces the longest road. As soon as this edge was found, a dual-node was created to represent that road. Next we proceeded to look for the edge that produced the second longest road and create a dual-node for that road. We continued this process until every primal-edge had been assigned to a road. Finally, dual-nodes were connected to each other if their primal counterparts (roads) crossed each other in the primal space. Phase 3: Community detection: we used a traditional modularity optimization-based algorithm to identify communities in the dual representation of the road network. The modularity metrics were computed in R using the iGraph library (http://igraph.org/r/). To incorporate the prevalence of malaria, we used the malaria prevalence values as edge weights for community detection. Phase 4: Mapping communities. Detected communities were mapped back to the primal road network with the use of the spatial join tool in ArcMap. All maps were produced in ArcMap.