Network properties of salmonella epidemics

We examine non-typhoidal Salmonella (S. Typhimurium or STM) epidemics as complex systems, driven by evolution and interactions of diverse microbial strains, and focus on emergence of successful strains. Our findings challenge the established view that seasonal epidemics are associated with random sets of co-circulating STM genotypes. We use high-resolution molecular genotyping data comprising 17,107 STM isolates representing nine consecutive seasonal epidemics in Australia, genotyped by multiple-locus variable-number tandem-repeats analysis (MLVA). From these data, we infer weighted undirected networks based on distances between the MLVA profiles, depicting epidemics as networks of individual bacterial strains. The network analysis demonstrated dichotomy in STM populations which split into two distinct genetic branches, with markedly different prevalences. This distinction revealed the emergence of dominant STM strains defined by their local network topological properties, such as centrality, while correlating the development of new epidemics with global network features, such as small-world propensity.

The multiplex Polymerase Chain Reaction (PCR) was performed to amplify the variable number tandem repeats (STTR9, STTR5, STTR6, STTR10pl and STTR3) and subsequent analyses were conducted as described previously 1 . MLVA results were reported as a string of five numbers representing relevant repeats 2 .
The S. Typhimurium reference strain LT2 (GenBank NC_003197) was used as a control throughout and consistently generated the expected MLVA type 4-13-13-10-0211. A total of 17,107 human STM isolates were examined following removal of duplicate isolates from the same episode of the disease (99.3% of all STM cases recorded in NSW during the study period).
Video S1 illustrates the dataset, by highlighting the STM cases recorded in 2008 in NSW, including Sydney, Australia.
The MLVA profile is given by the vector where the subscript denotes the loci, e.g., p 1 is the number of repeats in locus 1.

Network and Clustering
The network topology is a complete graph where we represent each MLVA profile as a node, and the edge weight between nodes is given by the Manhattan distance between profiles.
Clustering methods. The two clustering methods used are: overlapping and partitioning, as shown in Figure S1.
Edge Weights. For this work, we require edge weights (or, equivalently, a dissimilarity measure) between all pairs of MLVA profiles. The proper choice of an edge weight w ij between a pair of nodes i and j depends on the properties of the system.
Let p i and p j be the the MLVA profiles for nodes i and j. Then, the L 1 -norm distance d ij , or the Manhattan distance, between these nodes is computed as .
An alternative similarity measure is given by the L 2 -norm distance: The L 1 -norm is selected as the similarity measure between two MLVA p i and p j in order to represent the expectation that a single large difference in one individual locus of p i and p j represents a smaller overall distance than a combination of several small differences between individual loci of p i and p j . For example, MLVA pattern {3-12-12-9-523} is expected to be more similar to {3-12-12-11-523} than to {3-13-13-10-523}. In this example, the L 1 -norm for the first pair results is 2, i.e., the sum of differences in the 4 th locus, being smaller than the L 1 -norm for the second pair which yields 3 across three different inner loci. On the other hand, the L 2 -norm would also produce distance 2 for the first pair (i.e., a square root of the squared difference in the 4 th locus), but would yield distance √3 for the second pair, i.e. a square root of the sum of three individual minimal distances across three inner loci, which is greater than 2, contradicting our expectation.
In our analysis, we paid a special attention to locus 5. Rather than treating it as any other vector element of p i , we follow a normalization procedure which combines the approach of Larsson et al. 3 (that is, explicitly identifying and summing the corresponding numbers of tandem repeats in 27 and 33 base-pairs, when these numbers are available) with a division of p i5 by 37 and rounding the result, when these numbers are not available. The edge weights are given by the reciprocal of these distances, i.e., .
Computing the weights in this way results in them being normalised. The edge weights are used to compute a number of network measures; moreover, they are employed by Cytoscape to lay out the network via a force-directed spring algorithm.
Overlapping and Partitioning Clustering. Using the edge weights, we obtained the overlapping clustering algorithm simply by including all nodes within a certain distance. For instance, let C i be the cluster associated with node i, then for some threshold D max . In this work we set D max = 5 such that we account for a detectable mutation in all loci.
We used hierarchical agglomerative clustering, which begins with all nodes in separate clusters and greedily combines them according to a dissimilarity criterion. The dissimilarity requires a "metric" and a "linkage criteria". The metric gives the distance between pairwise observations (i.e., the L 1 -norm), and the linkage criterion d IJ determines the distance between two sets of observations I and J. We employed the Ward's minimum variance method in this study 4 .
An example. It is instructive to consider MLVA profile 3-9-7-12-622 which appears as the most frequent profile in terms of the average cluster prevalence (overlapping clusters), with the average cluster prevalence of 124.7714286 (see Table S2). The individual prevalence of this profile is only 1 (that is, it has been identified only once in 8 years), and so it is important to trace the reasons behind its high average cluster prevalence. The overlapping cluster, within the threshold distance D max = 5, to which the profile 3-9-7-12-622 belongs includes 35 MLVA profiles overall. In order to determine its cluster neighbours, we firstly consider the normalisation of the profile 3-9-7-12-622since it does not appear in the study of Larsson et al. 3 , the corresponding numbers of tandem repeats in 27 and 33 base pairs are not available, and hence, the value p i5 is obtained by division of 622 by 37 and rounding the result, yielding 3-9-7-12-17. This normalised profile is, therefore, within D max = 5 to several profiles with very high individual frequency: specifically, the top 4 most frequent profiles are in the same overlapping clusters (see Table S1). In these profiles the value p i5 = 523 is replaced by 14 = 2 + 12, as there are 2 base-pairs 27 and 12 base-pairs 33, according to Larsson et al. 3 Thus, the average cluster prevalence of the profile 3-9-7-12-622 (after normalisation: 3-9-7-12-17) is high because it belongs to an overlapping cluster which also includes profiles with high individual prevalence, while these profiles themselves belong to overlapping clusters with smaller average cluster prevalence.

Network Measures
Clustering and Closeness. In capturing characteristics of the weighted networks inferred in this study, we utilize several measures: the clustering coefficient and closeness centrality (both local and global), as well as the characteristic path length (a global measure). We employ the weighted versions of these measures, following recent studies of brain networks 5 . We chose closeness centrality, rather than betweenness centrality 6 , percolation centrality 7 , or PageRank centrality 8 , due to a natural connection between the normalized closeness centrality and the characteristic path length, with the latter being the average of the former, as shown below. Since the networks are fully connected, the unweighted degree distribution of each node is equivalent, reducing the immediate applicability of other measures, such as the network assortativity 9,10 and the rich-club coefficient 11 , which need to be carefully adapted for weighted networks 12 .
The clustering coefficient is typically defined for unweighted networks to measure the degree to which a nodes' neighbours are connected. There have been numerous proposals for applying this measure to weighted networks, all designed to capture different network characteristics. We employ the approach of Onnela et al. 13 : where , and k i is the number of edges connected to n i . A common global measure of clustering in a network is given by averaging the local coefficients, where N is the number of nodes in the network. This is known as the average clustering coefficient.
The closeness centrality of a node is computed as the sum of the length of the shortest paths between nodes. In this work we use the normalized form of this measure: The characteristic path length is a robust measure for quantifying the average distance between nodes. It is computed as follows: which is equivalent to taking the average closeness centrality.
We can combine the above measures by taking the ratio of the clustering coefficient to the closeness centrality c i / l i -similarly, the global form is given as C/L. This ratio relates to another common measure in network theory, known as the small world coefficient 14 . The small world coefficient is often used to capture the degree to which a network can be characterised as having the "small-world" phenomena 15 . Typically for the small world coefficient, however, C and L are taken relative to the random network clustering coefficient C rand and path length L rand . That is, Given the MLVA network is complete, there is no random structure; however, random edge weights for a complete network would result in C rand ≈ 1 and L rand ≈ 1. Thus, the ratio C/L could be considered a proxy to . Various other centrality measures, including PageRank and betweenness centrality, yielded similar results for tracing the evolution of strains, as well as predicting the severity of a given MLVA profile. However, we only presented results for the clustering and closeness of a node due to their intuitive description and our focus in small-world properties.

Entropy
Given the frequency of each MLVA profile i, observed during a specific time interval T, we obtain an MLVA frequency distribution. Then the entropy H T of this MLVA frequency distribution is measured as Figure S1. Clustering the MLVA profiles. The two clustering methods used are: overlapping (left) and partitioning (right). The size of each node is set in proportion to the prevalence of the corresponding MLVA profile. The overlapping approach (left) is to cluster all nodes within a certain distance to a focus node, within the same cluster. We show two clusters (a blue and a yellow cluster), where the green nodes correspond to both. The partitioning approach (right) uses agglomerative clustering to yield non-overlapping clusters. We use a distance threshold, and if the nodes are further away than this threshold on the dendrogram, they become separate clusters. The figure shows the polar dendrogram of the entire MLVA dataset, with insets showing two separate clusters, and grey shading denoting the distance threshold.      Movie S1. STM cases recorded in 2008 in NSW, including Sydney, Australia.