Individual Movement Strategies Revealed through Novel Clustering of Emergent Movement Patterns

Understanding movement is critical in several disciplines but analysis methods often neglect key information by adopting each location as sampling unit, rather than each individual. We introduce a novel statistical method that, by focusing on individuals, enables better identification of temporal dynamics of connectivity, traits of individuals that explain emergent movement patterns, and sites that play a critical role in connecting subpopulations. We apply this method to two examples that span movement networks that vary considerably in size and questions: movements of an endangered raptor, the snail kite (Rostrhamus sociabilis plumbeus), and human movement in Florida inferred from Twitter. For snail kites, our method reveals substantial differences in movement strategies for different bird cohorts and temporal changes in connectivity driven by the invasion of an exotic food resource, illustrating the challenge of identifying critical connectivity sites for conservation in the presence of global change. For human movement, our method is able to reliably determine the origin of Florida visitors and identify distinct movement patterns within Florida for visitors from different places, providing near real-time information on the spatial and temporal patterns of tourists. These results emphasize the need to integrate individual variation to generate new insights when modeling movement data.

. Throughout our different analyses, we set the hyper-17 parameter to 0.1 to make it uninformative while the hyper-parameter was also set to 0.1 to 18 promote a more parsimonious model with fewer groups. 19 We fit this model within a Bayesian framework, using a Blocked Gibbs sampler similar to the one 20 described in 1 . The full conditional distributions (FCDs) are given below: 21 -FCD for the probability of individuals in group k being seen at each location : 22 where is the number of times individuals from group k were seen in each location l (l=1,…,L). 25 -FCD for the latent group membership variable for individual j : 26 We drew from a multinomial distribution with size 1 and probabilities given by the expression above. 28 -FCD for the stick-breaking parameters : 29 where is the number of individuals assigned to group k and > is the sum of the number of 32 individuals in groups k+1,…,K ( > = +1 + ⋯ + ). 33

Simulated data
The data required by our method consist of a matrix with the number of times each location was visited by each individual. In this matrix, rows are individuals and columns are locations. For instance, if we had four locations and locations 1-4 were visited 2, 3, 1, and 1 times, respectively, our observation for individual j would consist of the following vector w j = [2, 3, 1, 1].
The data required for the network analysis methods consist of a matrix with the amount of movement between all pairwise combinations of locations. To generate this input, we randomly sample without replacement the locations visited by each individual to then determine the amount of movement between the different locations. For instance, using the example above, we start by creating a derived vector with the explicit identifiers of each location r j = [1,1,2,2,2,3,4]. We then randomly sample these location identifiers without replacement. Say that this process generates the vector r * j = [2, 2, 1, 1, 3, 2, 4], where now the elements in this vector represent the location visited at each time point. This vector can then be represented by the following matrix where numbers refer to the amount of movement departing from location i (row i) and arriving in location s (column s). The input matrix required for the network analysis methods is obtained by summing these matrices over all individuals j. To illustrate this process, we provide below the code we used to create the simulated data shown in panel D of Fig. 1  The outputs from this code are: 1. the individual level data "sim1 indiv data.csv" used by our method IBC; 2. the spatially aggregated data "sim1 locloc data.csv" used by common network analysis methods; and 3. the underlying visitation rate parameters for each group "sim1 parameter.csv".
The figure below shows the assumed visitation rate for each group, where dotted red lines indicate locations that are frequently visited by individuals from various groups. Locations Visitation rate

Blocked Gibbs sampler
Our algorithm is primarily implemented in R, with some auxiliary functions implemented in C++ through the use of the R package "Rcpp". To fit this model, we used the following code: set.seed(10) library(gtools) library(Rcpp) setwd( U:/independent studies/movement LDA/appendix code ) source( gibbs functions.R ) sourceCpp("gibbs_functions_cpp.cpp") #import data (rows are individuals, columns are locations, cells contain the number #of times each individual was seen at each location) dat=data.matrix(read.csv( sim1 indiv data.csv ,as.is=T)) ngroups=25 #maximum number of groups alpha=0. This code imports the individual level data "sim1 indiv data.csv" and relies on the wrapper function "gibbs.sampler" (contained in the file "gibbs functions.R") to fit the model. The arguments of this function are: 1. ngroups: the maximum number of groups; 2. ngibbs: number of iterations for the Gibbs sampler; 3. dat: the data; and 4. alpha and epsilon: values for the hyper-prior parameters α and .
The output of this function consists in a list containing the following parameters and results for each Gibbs iteration: z,V,β,ψ and the log-likelihood (a useful metric to assess algorithm convergence).
To assess if the algorithm has converged, we examine the trace-plot of the log-likelihood. The code below suggests that the algorithm has converged in 100 iterations. It is important to note that our algorithm does not necessarily result in a model with the highest likelihood because our truncated stick-breaking prior will attempt to enforce a more parsimonius representation (i.e., a smaller number of groups).

Simulated data
We created multiple sets of simulated data to compare the performance of our method to that of alternative algorithms. In our first set of simulations, we varied the proportion of mixed membership sites (from 0% to 80%) while assuming the existence of 5 underlying communities. In our second set of simulations, we varied the underlying number of communities from 3 to 10 while setting the proportion of mixed membership sites to 40%. We performed these sets of simulations assuming 40 (small network) and 400 (large network) locations to explore how network size influenced our results.

Alternative algorithms
We compared our IBC method to three existing network analysis methods. The first two methods were hard-clustering methods that have been widely adopted in the literature: the fast modularity optimization algorithm (FMM) proposed by Blondel et al. 1 and the flow-based "Map equation" method (ME) 2 . Both of these methods were implemented using the "igraph" package in R. The third algorithm is a newer method called linked community algorithm (Link) which has generated considerable interest in the scientific community because, similar to our method, it can handle overlapping communities 3 . This method is implemented by the R package "linkcomm".
Differently from the main text, we did not include the modularity optimization approach using the simulated annealing algorithm because it was too slow for the large network. We also tried to use the model-based clustering model of Handcock et al 4 but the algorithm was not converging well despite running very long chains.

Comparison of models
There are multiple ways in which the different clustering algorithms can be compared. Here we compare these algorithm using three criteria. The first one consisted in assessing if these algorithms correctly detected the true number of communities. The second criterion consisted of comparing the link community algorithm and the IBC method (the only two algorithms that allow for mixed membership sites) in relation to their ability to correctly discriminate between mixed membership sites and non-mixed membership sites. This is an important task given the role of these critical connectivity sites in multiple disciplines. For this criterion, we calculate the proportion of sites correctly classified. This is given by + where n is the total number of sites, and CM and CNM are the number of sites correctly classified as mixed and non-mixed membership sites, respectively.
Our third criterion was based on model fit (as measured by the likelihood). In this approach, we assume that each observation consists on the number of movements originating from location i (i.e., the i-th row of the location-by-location X matrix). The total number of movements from this location is given by = ∑ =1 . Based on these definitions, we assume that the likelihood is given by: The hard clustering network analysis approaches we adopted in the manuscript do not explicitly define . Nevertheless, we can estimate as follows. Define the probability of movement within if location i belongs to group k but location j does not, where . = ∑ =1 is the total number of locations.
For the IBC method, we assume that is given by: where is a variable that denotes the membership status of a randomly chosen individual. Unfortunately, we did not find a simple way to define for the linked community algorithm and therefore we do not use the likelihood criterion to compare this method to the other algorithms.

Results
Our IBC method generally tended to correctly estimate the number of communities (Fig. A1). However, there were some discrepancies, particularly when the number of communities was large (≥ 7) and when a very high proportion of sites (≥ 60%) were mixed for the large network. The hard-clustering methods tended to almost always underestimate the true number of communities. In particular, the map equation algorithm surprisingly estimated that just a single community was present in most of our simulations. This result illustrates how having even a modest amount of mixed membership sites can substantially hinder the ability of some of these network analysis algorithms to identify the underlying patterns in these data. Finally, despite the ability to represent mixed sites, the link community algorithm failed to correctly identify the true number of communities. Fig. A1. Comparison between the estimated (y-axis) and the true number of communities. Top panels depict the results for our first set of simulations, in which the true number of communities was set to 5 (grey horizontal line) and the proportion of mixed sites varied from 0 to 80% (x-axis). Bottom panels depict the results for our second set of simulations, in which the true number of communities varied from 3 to 10 (x-axis) and the proportion of mixed sites was set to 40%. Diagonal grey line is a 1:1 line in the bottom panels.
We also assessed how well the link community algorithm and the IBC method were able to detect mixed membership sites. While the IBC method almost always correctly identified mixed membership and non-mixed membership sites, the link community algorithm had a substantially worse performance (Fig. A2). Finally, in relation to model fit, our results reveal that IBC had a substantially better fit than the other two hard-clustering methods (Fig. A3). Fig. A2. Comparison of IBC and the Link algorithm in predicting mixed and non-mixed membership sites. Values closer to 1 indicated better fit to the data. Top panels depict the results for our first set of simulations, in which the true number of communities was set to 5 and the proportion of mixed sites varied from 0 to 80% (x-axis). Bottom panels depict the results for our second set of simulations, in which the true number of communities varied from 3 to 10 (x-axis) and the proportion of mixed sites was set to 40%. Higher values indicated better fit to the data. Top panels depict the results for our first set of simulations, in which the true number of communities was set to 5 and the proportion of mixed sites varied from 0 to 80% (x-axis). Bottom panels depict the results for our second set of simulations, in which the true number of communities varied from 3 to 10 (x-axis) and the proportion of mixed sites was set to 40%.